Are you using dbt and have heard of SQLMesh, but aren't quite sure what it is?
You are in luck, because this article is for you! We will teach you the basics of SQLMesh from a dbt user's perspective to help get you up to speed.
We talk about the entire data transformation workflow. In each section, we discuss equivalencies between a dbt concept/tool and its SQLMesh counterpart, note any caveats or differences between the two, and list related SQLMesh features and how they can be beneficial.
This is the second post in the series. The first post covered the components used to create a SQLMesh project - this post is on SQLMesh project development and execution. We recommend reading the first post...first.
Model development and execution
The data transformation rubber meets the road when we actually load and run our models. One of the primary differences between dbt and SQLMesh is the development and execution flow.
Environments are isolated namespaces that allow you to make and test changes to your models before deploying them to your real business operations (into “production”).
dbt environments are separate targets within a connection profile. Each environment is isolated, so your changes in one environment have no relationship with the others. That separation means the entire project must run for each environment; computations done in one environment must be repeated when applied to a new environment or deployed to production.
Environments serve the same purpose in SQLMesh - a way to make and test changes to a project’s code. However, there are two important differences from dbt:
SQLMesh analyzes the specific changes made to the code and only reruns models whose data becomes invalid due to the change, not the entire project.
SQLMesh uses virtual environments when deploying from one environment to another. Instead of re-running the entire project in the new environment, it safely shares already existing tables from other environments into the new one (where possible). Again, only models affected by changes made will require re-execution in the new environment. Check out the above link if you'd like to know the details.
Environments are not pre-defined in a profile. They are dynamically created using
For these reasons, SQLMesh environments can be created quickly and cheaply. While they can still be used like dbt environments, where each developer has their own static environment they use for all their changes, it is now possible to create new environments for each feature or issue.
plan command is the primary tool for developing in SQLMesh, which allows you to understand the impact of your changes and choose how to apply them. It is a new concept not present in dbt. This section talks about what it is and how you use it.
When changing a model in dbt, you make the change and then rerun the project. You, the developer, are responsible for understanding how your change affected the model and all downstream models. As more and more models are added to a project, this becomes a harder and harder task; cross your fingers that you don't receive that 2am call that an important BI dashboard is down.
SQLMesh adds the
plan step, which reviews the affects of the change and prompts for how to apply it to the environment.
When you issue the
sqlmesh plan [environment name] command, SQLMesh compares your local project files to the state of the environment. For each changed model, a list of downstream models that directly or indirectly
SELECT from it, a code diff of the change, and the impact of the change. The impact is classified as either
breaking (downstream models are impacted) or
non-breaking (downstream models are not impacted).
After you examine the implications, and are ready to continue, you can
apply the changes to the specified environment. Optionally, you can choose what timeframe to backfill incremental models, to limit the amount of computation to only what is needed to validate the changes.
For example, adding a column is
non-breaking because no existing columns or data are affected.
WHERE clause is
breaking because the existing table contains rows that the change now filters out. SQLMesh can automatically determine whether changes are breaking on a best effort basis, or you can do it manually for fine-grained control.
This example from the SQLMesh quickstart guide shows a
plan where the user directly modified the
sqlmesh_example.incremental_model (lines 5-9) by adding a column
new_column (line 19).
non-breaking change indirectly modified the downstream
sqlmesh_example.full_model (lines 23-25).
Example 1: SQLMesh plan from the quickstart guide
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
From the dbt node selection docs:
dbt runexecutes all of the models in the dependency graph;
dbt seedcreates all seeds,
dbt snapshotperforms every snapshot. The
--selectflag is used to specify a subset of nodes to execute.”
dbt run and its analogues run everything unless you specify a subset of models/seeds/snapshots with the
--select flag. Depending on the specific project, this could mean that most of the executed computations aren’t needed at all.
SQLMesh makes a distinction between applying new changes (
sqlmesh plan) and computing the latest data for an environment (
sqlmesh run). This section will focus on the latter.
sqlmesh run [environment] combines
dbt run with the concept of cron to automatically determine which models need to run and only runs those models.
Unlike dbt, which runs the local file models, SQLMesh runs the models in the specified environment. The local files do not correlate to what gets run. For example, with SQLMesh you could do
sqlmesh run staging followed by
sqlmesh run production to compute the models in the staging environment and then the models in the production environment.
If you want to propegate model changes to an environment, use
sqlmesh plan. If you want to compute the latest data for an environment, use
SQLMesh does not have separate
snapshot commands. Every model, regardless of type, scheduled to run, will run and nothing else. Of course
seed models on need to be computed when changed and thus execute via
Cron and model scheduling
Coordinating when to run a specific model or set of models requires external tooling in
dbt. SQLMesh has
cron built in to solve this for you via the
cron field available in model configuration.
Consider a model whose
daily. The first time you issue
sqlmesh run today, SQLMesh will detect that the model has not run since yesterday and execute the model. If you issue
sqlmesh run again today, SQLMesh knows that the model has already run and will not execute it again.
Different models can use different
cron parameters, depending on how frequently they need to execute. Every time you issue
sqlmesh run, SQLMesh will determine which models are due and run only those models.
You can execute
sqlmesh run yourself or with the native Airflow integration. If running it yourself, a sensible approach is to use Linux’s
cron tool to execute
sqlmesh run on a cadence at least as frequent as your briefest SQLMesh model
cron parameter. For example, if your briefest model’s
cron is hour, your
cron tool should issue
sqlmesh run at least every hour.
Nightmare scenario: model runs fail, data doesn’t load, and your CEO comes calling about their stale dashboard.
SQLMesh provides notifications via email or Slack when runs start, end, or fail. That way you know about and fix issues before the angry messages start flying.
YOLO is great for parties and terrible for data engineering. Testing your data and code helps you avoid the dreaded YOFO (you’re only fired once)!
dbt tests are an excellent practice! But the name is misleading in the context of software engineering, where “tests” evaluate something about code (not data). Therefore, in SQLMesh these evaluations are called audits - they audit whether existing data meet the specified criteria.
SQLMesh also has tests, which follow the software engineering concept of testing the code. These take the form of “unit tests” for your models.
You specify the data input values and the expected output values, and SQLMesh runs the tests. The tests run locally with DuckDB by default, or you can optionally run them in your data warehouse. SQLMesh tests run as part of
sqlmesh plan, informing you if you have a problem before any transformations take place. They are great at checking expected results and edge cases.
Another way to make sure your code is behaving as expected is SQLMesh's table diff tool. Taking inspiration from the open-source data-diff Python library,
table_diff compares the data in two tables or models and summarizes the difference between them.
With data diff, you can quickly verify your changes had the intended effect to the model's columns, rows, and row data.
Summary of SQLMesh workflow
Taking everything we've learned in this post, this is what the high-level SQLMesh workflow looks like:
Make a change to a model.
sqlmesh plan devto:
Automatically verify SQL syntax and run tests to ensure your SQL logic.
Review a summary of the data contract differences between your local project files and the environment and the downstream impact.
Apply the changes to the environment.
Review audit results and table diff output.
Merge changes into production, and execute
sqlmesh planto apply the changes to the production environment. This will be a virtual update and production will immediately have the latest changes.
sqlmesh run on a regular cadence to compute new data in DWH or SQLMesh's airflow integration to automatically do this for you.
SQLMesh dbt adapter
Good news if you already have a dbt project: SQLMesh can run dbt projects with its dbt adapter!
A tool for converting dbt project files into SQLMesh format is also planned.
This post described the basics of developing and executing SQLMesh projects from a dbt user's perspective - we hope it provided an on-ramp to help you understand how SQLMesh works and can be beneficial for your business.
If you find this series helpful, we'd be happy to continue it. Please join our Slack community and let us know what other topics you'd like to see us cover.
If you want to get going with SQLMesh, check out the quickstart guide for a brief end-to-end example in each of the three SQLMesh interfaces.