Traps and Pitfalls of Using SQL with Jinja

In recent years, a hybrid language has emerged for building data transformation pipelines. This language has experienced significant growth, partly owing to the popularity of data transformation tools like dbt.

This hybrid language is based on SQL and mostly resembles it. It’s also somewhat based on Python and resembles it a little bit as well. Though in reality it’s neither of the two. I’m going to refer to this language as Jinja SQL.

In a nutshell, developers use Python-like syntax to dynamically stitch together pieces of plain text to produce (hopefully correct) SQL code, so that the latter can be executed against a target database or a data warehouse. Somewhat similar to a preprocessor in a language like C, with the exception of lacking any awareness about the grammar of the language it's stitching.

The popularity of this approach stems from its ease of implementation and its perceived user-friendliness for non-programmers, especially analytics engineers who are skilled in SQL but not necessarily in Python.

However, I argue that its apparent simplicity is deceptive, as underneath it lie subpar development practices and mounting technical debt. Once you move past simple variable substitution, the complexity escalates drastically.

In this post, I explore some of the most prominent pitfalls I encountered while working with Jinja SQL. I also kick off a conversation around potential alternatives.

Development environment

Let’s start with the most fundamental shortcoming.

Since Jinja is a templating engine designed for arbitrary text and doesn’t come with native SQL support, Jinja SQL lacks the development environment one would typically expect from a full-fledged programming language. As a consequence, essential features such as syntax checking, highlighting, debugging and auto-complete are non-existent.

This fact negatively impacts the development experience and leads to slower iterations due to lack of a proper debugging process and frequent round trips to the database to test things out.

The community has attempted to address these shortcomings; for instance, dbt introduced the debug() macro. This macro allows users to set pdb breakpoints in their Jinja code, directly exposing them to the underlying Python backbone and thereby undermining the very purpose of having a "simple" language.

Unfortunately, the story doesn’t end here. Let's delve deeper into how Jinja SQL leads to an ever-growing accumulation of technical debt.

[Meta]programming

Jinja SQL is not simply a regular programming language; it is, in fact, a metalanguage.

Jinja SQL developers are not writing the final executable code. They write a mix of Jinja code and pieces of SQL text, which, upon evaluation, generates the final SQL that is then executed.

This concept is known as metaprogramming and is notoriously challenging to understand and debug. You can get a taste of how complex it is by googling pitfalls when using a preprocessor in C / C++ and other languages that support it. The challenges are even greater when the code is generated at runtime, rather than compile time, with no checks in place to ensure the correctness of the dynamically generated code.

Take the following simple example:

1
2
SELECT *
FROM some_table {{ some_macro() }}

Can you discern all possible variations of what the final query could look like just by examining the snippet above? Here are a few examples of the possible output of the some_macro() invocation:

  • A table alias
  • A UNION statement
  • A JOIN statement
  • A WHERE clause
  • An ORDER BY clause

The list goes on and on.

Due to being a plain text preprocessor lacking awareness of the semantic meaning of SQL, Jinja SQL deprives us of the ability to reason about the code as we read it.

As no constraints are imposed on the possible outputs of the macro call, we are compelled to dig into the implementation details of this macro to get even a remote idea of the resulting query we will end up with.

Even when examining the final SQL, it can be quite tedious to pinpoint the impact and boundaries of the macro output, especially for larger queries.

Modularization

This pitfall mostly concerns macros with complex implementation. At some point, a macro implementation gets too big, prompting a developer to break it down into multiple smaller macros.

The refactoring process might produce new macros that resemble something like the following:

1
2
3
4
5
6
7
{% macro aliased_subquery(subquery) %}
    {% if  from  in subquery %}
        , ({{ subquery }}) AS subquery
    {% elif  join  in subquery %}
        , ({{ subquery }}) AS joined_subquery
    {% endif %}
{% endmacro %}

The macro above returns a partial SQL code snippet, which, on its own, is syntactically incorrect and cannot be used outside the context of a calling macro.

Returning partial code snippets is a questionable practice on its own, made worse by the fact that there is absolutely no way to express that this macro should only be used within the scope of some other macro or a module. Concepts like namespaces or modules are not available in Jinja.

Whitespace control

Another byproduct of Jinja being a plain text preprocessor is the availability of whitespace control. This mechanism was originally designed to eliminate redundant whitespaces when using Jinja with languages sensitive to formatting, such as YAML, HTML, and other markup languages.

SQL doesn't fall into this category, and applying this feature there leads to surprising issues because the generation process is completely unaware of the syntactic rules of the code being generated.

For example, the code below:

1
2
SELECT
    1 AS {%- if 1 = 1 %}one{% else %}not_one{% endif -%}

Results in the following generated SQL:

1
2
SELECT
    1 ASone

Due to my use of the {%- block instead of {%, the one identifier got concatenated to the AS keyword, resulting in unexpected output.

Now, imagine onboarding a new teammate and guiding them through the intricacies of various flavors of Jinja blocks.

Nondeterminism

Admittedly, this anti-pattern is a general consequence of allowing the preprocessor to generate code at runtime and not unique to Jinja. Nonetheless, I found this practice to be so prevalent that I couldn't overlook it in this post.

Let's make a slight change to the example from the beginning of this post. Now, the macro’s output is determined by the results of a database query:

1
2
SELECT *
FROM some_table {{ run_query(SELECT expression FROM some_external_source) }}

In addition to being unable to determine the possible outputs, we can no longer even get an idea just by looking at the code. That's because the result of the run_query() call is nondeterministic and may generate different output depending on when the code is evaluated.

The nondeterminism arises from the fact that the generated SQL ultimately depends on an external state (a database in this case). This essentially means that the query can change at any time without the user making any modifications to the query code itself or even being aware of it.

Furthermore, the query cannot be generated without accessing a database, which presents a challenge for tools like dbt. These tools typically compile model queries before executing anything, and they may not have a valid database connection available at compilation time.

I can certainly envision rare cases in which such behavior might be desirable or even necessary. Even then, I would prefer to have the evaluator enforce some constraints on what can and cannot be returned by the macro, instead of trying to determine the boundaries of its nondeterministic output by eyeballing a massive blob of SQL code after compilation.

Can we do better?

The issues described above are challenges that Jinja SQL developers must grapple with in addition to the typical hurdles encountered when using any other programming language.

“But what's the alternative?” one might ask. Can we do any better than merely stitching pieces of text together?

The alternative, I believe, is to extend the SQL language itself with native support for metaprogramming constructs that enable direct invocation of functions implemented in Python (or other programming language).

For example, the example from the beginning of the post can look like this:

1
2
3
SELECT *
FROM some_table
@WHERE(<condition>) col = value

Now, in this example, the WHERE clause will only be added to the query if the <condition> evaluates to true. What's noteworthy here is that we now have a clear expectation of the output: it will either include the WHERE clause or not, depending on the result of the <condition>.

The same example could also look like this:

1
2
3
4
SELECT *
FROM some_table
@JOIN(<condition>) some_other_table
ON some_table.key = some_other_table.key

We achieve the same conditional semantics, but for the JOIN clause - the join will be added to the query only when the <condition> expression evaluates to true. Once more, we can easily reason about the code just by reading it.

What if instead of joining directly to the some_other_table table, we need to construct a subquery dynamically as part of a macro? This is where the Python function invocation comes into play:

1
2
3
4
5
6
7
from sqlglot import exp


@macro()
def get_subqery():
    return exp.select(*).from_(some_other_table).subquery()
    # Or just return a string: return “(SELECT * FROM some_other_table)”

The macro above can be invoked using the same @ syntax:

1
2
3
4
SELECT *
FROM some_table AS a
@JOIN(<condition>) @get_subquery() AS b
ON a.key = b.key

Now if the <condition> evaluates to true the final SQL statement will look as follows:

1
2
3
4
SELECT *
FROM some_table AS a
JOIN (SELECT * FROM some_other_table) AS b
ON a.key = b.key

See how we can immediately tell what can be returned by the get_subquery() macro without reading its contents, while the macro evaluator ensures that its return value is compatible with the @JOIN construct (eg. a table identifier, a subquery, etc).

The macro evaluator guarantees that each macro returns a syntactically correct SQL expression. Simultaneously, the parser handles things like trailing commas and other formatting issues.

This approach enables us to reason about the code without diving into the implementation details of each macro used. It also makes the code base look clean since Python and SQL source files are clearly separated, and Python offers ways to modularize your implementation.

Additionally, one can leverage the vast Python ecosystem, including developer tools and best practices. For instance, developers can create unit tests for macros using libraries like unittest or pytest. This also includes the standard Python package manager, which can be used to distribute libraries of macros, thereby eliminating the need for a custom package management solution.

You can find more details about this approach here.

Conclusions

The challenges discussed in this post are mere highlights within the broader array of issues associated with using Jinja SQL. The majority of sections in the Using Jinja guide outline potential problems that would not have arisen had Python and SQL been used separately.

What's worse is that Jinja SQL fails to deliver on its promise of simplicity. The language ended up being just as complex as Python, plus it introduces its own set of gotchas and know-how that are difficult to debug and maintain.

An alternative solution to metaprogramming in SQL is to keep Python and SQL separate while creating a bridge between the two through a small extension of SQL itself. This approach would help maintain a clean codebase while also leveraging the vast ecosystem of tools and best practices that have been developed for Python over decades.

I’d like to learn more from the community about other common issues encountered when using SQL with Jinja. Additionally, I would greatly appreciate hearing people's thoughts on the proposed solution and any ideas for potential alternatives. Feel free to start a conversation in Slack or contact me directly via email.