Tobiko

Project Variables in SQLMesh: Non-obvious Challenges

Starting from version 0.91.0 of SQLMesh, you can define project-wide variables and use them in model definitions.

The project-wide variables are defined in configuration:

config.yaml

And then used in a model like this:

my_model.sql

So what, you might think - dbt has had this feature forever.

And you’re right!

What makes this feature challenging to implement (and therefore interesting) is SQLMesh’s unique approach to model versioning.

Model Versioning: Recap

Each time a user modifies a model and executes the sqlmesh plan command, SQLMesh generates a self-contained snapshot of that model and all its dependencies.

The snapshot represents a version of the model at a point in time and possesses properties that let SQLMesh replicate the exact behavior of this model version later.

Implementing this is straightforward when model definitions are self-contained, confined to one file, and do not have any shared configurations. But the situation changes drastically when project-wide configuration is shared by some models but not by others, which is precisely what project variables facilitate.

For example, if model_a and model_b reference the variable my_var, while model_c doesn’t, we’d only want to create new versions for the two models that reference the my_var when its value changes. But how can we tell whether a variable is used by any given model?

Static Code Analysis is Here for You

I bet you already know that SQLMesh's foundation is rooted in its ability to semantically understand and analyze SQL with the help of SQLGlot. I’d be surprised if you weren’t aware, since our team just can’t stop talking about it (here, and here, and also here)!

SQLMesh not only parses SQL but also extends each supported dialect with additional mechanics that allow syntax like @my_var in the first place. As a result, the model query above ends up with the following tree representation:

my_var_ast

By traversing the tree, SQLMesh can easily determine whether any given model references a variable with the given name and only creates new versions for affected models when the variable’s value changes.

What About Python Code?

SQL code is not the only place where a user can reference a variable. Variables can also be referenced in Python models and macros.

For example, referenced variables can be listed as arguments to the Python model’s entry point function:

my_model.py

Or you can access variables directly by name via context - for example, context.var(“my_var”).

In either case, SQLMesh detects all referenced variables in order to correctly version the model.

Fortunately, what worked for SQL also works for Python, even though Python’s syntax tree is a lot more complex.

SQLMesh parses the code of Python models and macros, analyzes function signatures, and searches for context.var invocations.

SQLMesh can determine which variables are referenced in the function signature just by looking at the argument names.

The context.var case, though, is troublesome since the variable name must be determined at parsing time. And a user could pass any Python expression that returns a variable name to context.var.

SQLMesh handles this by evaluating the expression found in the first argument to the context.var method. This approach imposes certain limitations on the types of expressions that can be passed. Variables and functions from the global scope can be used, but any expression involving a local variable or function will fail.

For example, this works:

var_name_global_scope.py

But not this:

var_name_global_local.py

Yes, and Jinja as Well!

Finally, the last place where variables can be referenced is Jinja macros. SQLMesh traverses the Jinja syntax tree in a similar fashion to how it does for SQL and Python.

It has the same limitations as well, namely that variable names must be derivable statically at parsing time.

Conclusion

When versioning each individual model separately, something as simple as project-wide variables can become quite complex.

To version models correctly and provide a satisfying user experience, SQLMesh performs static analysis across three different syntax trees: SQL, Python, and Jinja. This is necessary to identify which variables are referenced by any given model.

Variable names must be known at parsing time, which imposes certain constraints on the level of dynamism that can be employed when constructing a variable name in Python/Jinja.

Try this out and tell us if this solves problems you care about in our Slack community! We want to build with people who share our passion for data, SQL, and compilers.