Are you using dbt and have heard of SQLMesh, but aren't quite sure what it is?
You are in luck, because this article is for you! We will teach you the basics of SQLMesh from a dbt user's perspective to help get you up to speed.
We talk about the entire data transformation workflow. In each section, we discuss equivalencies between a dbt concept/tool and its SQLMesh counterpart, note any caveats or differences between the two, and list related SQLMesh features and how they can be beneficial.
This is the second post in the series. The first post covered the components used to create a SQLMesh project - this post is on SQLMesh project development and execution. We recommend reading the first post...first.
Model development and execution
The data transformation rubber meets the road when we actually load and run our models. One of the primary differences between dbt and SQLMesh is the development and execution flow.
Environments
Environments are isolated namespaces that allow you to make and test changes to your models before deploying them to your real business operations (into “production”).
dbt environments are separate targets within a connection profile. Each environment is isolated, so your changes in one environment have no relationship with the others. That separation means the entire project must run for each environment; computations done in one environment must be repeated when applied to a new environment or deployed to production.
Environments serve the same purpose in SQLMesh - a way to make and test changes to a project’s code. However, there are two important differences from dbt:
-
SQLMesh analyzes the specific changes made to the code and only reruns models whose data becomes invalid due to the change, not the entire project.
-
SQLMesh uses virtual environments when deploying from one environment to another. Instead of re-running the entire project in the new environment, it safely shares already existing tables from other environments into the new one (where possible). Again, only models affected by changes made will require re-execution in the new environment. Check out the above link if you'd like to know the details.
-
Environments are not pre-defined in a profile. They are dynamically created using
sqlmesh plan
.
For these reasons, SQLMesh environments can be created quickly and cheaply. While they can still be used like dbt environments, where each developer has their own static environment they use for all their changes, it is now possible to create new environments for each feature or issue.
Plan
SQLMesh’s plan
command is the primary tool for developing in SQLMesh, which allows you to understand the impact of your changes and choose how to apply them. It is a new concept not present in dbt. This section talks about what it is and how you use it.
Model changes
When changing a model in dbt, you make the change and then rerun the project. You, the developer, are responsible for understanding how your change affected the model and all downstream models. As more and more models are added to a project, this becomes a harder and harder task; cross your fingers that you don't receive that 2am call that an important BI dashboard is down.
SQLMesh adds the plan
step, which reviews the affects of the change and prompts for how to apply it to the environment.
When you issue the sqlmesh plan [environment name]
command, SQLMesh compares your local project files to the state of the environment. For each changed model, a list of downstream models that directly or indirectly SELECT
from it, a code diff of the change, and the impact of the change. The impact is classified as either breaking
(downstream models are impacted) or non-breaking
(downstream models are not impacted).
After you examine the implications, and are ready to continue, you can apply
the changes to the specified environment. Optionally, you can choose what timeframe to backfill incremental models, to limit the amount of computation to only what is needed to validate the changes.
For example, adding a column is non-breaking
because no existing columns or data are affected.
Adding a WHERE
clause is breaking
because the existing table contains rows that the change now filters out. SQLMesh can automatically determine whether changes are breaking on a best effort basis, or you can do it manually for fine-grained control.
This example from the SQLMesh quickstart guide shows a plan
where the user directly modified the sqlmesh_example.incremental_model
(lines 5-9) by adding a column new_column
(line 19).
That non-breaking
change indirectly modified the downstream sqlmesh_example.full_model
(lines 23-25).
Example 1: SQLMesh plan from the quickstart guide
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
Running models
From the dbt node selection docs:
"By default,
dbt run
executes all of the models in the dependency graph;dbt seed
creates all seeds,dbt snapshot
performs every snapshot. The--select
flag is used to specify a subset of nodes to execute.”
dbt run
and its analogues run everything unless you specify a subset of models/seeds/snapshots with the --select
flag. Depending on the specific project, this could mean that most of the executed computations aren’t needed at all.
SQLMesh makes a distinction between applying new changes (sqlmesh plan
) and computing the latest data for an environment (sqlmesh run
). This section will focus on the latter.
sqlmesh run [environment]
combines dbt run
with the concept of cron to automatically determine which models need to run and only runs those models.
Unlike dbt, which runs the local file models, SQLMesh runs the models in the specified environment. The local files do not correlate to what gets run. For example, with SQLMesh you could do sqlmesh run staging
followed by sqlmesh run production
to compute the models in the staging environment and then the models in the production environment.
If you want to propegate model changes to an environment, use sqlmesh plan
. If you want to compute the latest data for an environment, use sqlmesh run
.
SQLMesh does not have separate seed
and snapshot
commands. Every model, regardless of type, scheduled to run, will run and nothing else. Of course seed
models on need to be computed when changed and thus execute via sqlmesh plan
.
Cron and model scheduling
Coordinating when to run a specific model or set of models requires external tooling in dbt
. SQLMesh has cron
built in to solve this for you via the cron
field available in model configuration.
Consider a model whose cron
is daily
. The first time you issue sqlmesh run
today, SQLMesh will detect that the model has not run since yesterday and execute the model. If you issue sqlmesh run
again today, SQLMesh knows that the model has already run and will not execute it again.
Different models can use different cron
parameters, depending on how frequently they need to execute. Every time you issue sqlmesh run
, SQLMesh will determine which models are due and run only those models.
You can execute sqlmesh run
yourself or with the native Airflow integration. If running it yourself, a sensible approach is to use Linux’s cron
tool to execute sqlmesh run
on a cadence at least as frequent as your briefest SQLMesh model cron
parameter. For example, if your briefest model’s cron
is hour, your cron
tool should issue sqlmesh run
at least every hour.
Notifications
Nightmare scenario: model runs fail, data doesn’t load, and your CEO comes calling about their stale dashboard.
SQLMesh provides notifications via email or Slack when runs start, end, or fail. That way you know about and fix issues before the angry messages start flying.
Tests
YOLO is great for parties and terrible for data engineering. Testing your data and code helps you avoid the dreaded YOFO (you’re only fired once)!
dbt tests are an excellent practice! But the name is misleading in the context of software engineering, where “tests” evaluate something about code (not data). Therefore, in SQLMesh these evaluations are called audits - they audit whether existing data meet the specified criteria.
SQLMesh also has tests, which follow the software engineering concept of testing the code. These take the form of “unit tests” for your models.
You specify the data input values and the expected output values, and SQLMesh runs the tests. The tests run locally with DuckDB by default, or you can optionally run them in your data warehouse. SQLMesh tests run as part of sqlmesh plan
, informing you if you have a problem before any transformations take place. They are great at checking expected results and edge cases.
Another way to make sure your code is behaving as expected is SQLMesh's table diff tool. Taking inspiration from the open-source data-diff Python library, table_diff
compares the data in two tables or models and summarizes the difference between them.
With data diff, you can quickly verify your changes had the intended effect to the model's columns, rows, and row data.
Summary of SQLMesh workflow
Taking everything we've learned in this post, this is what the high-level SQLMesh workflow looks like:
-
Make a change to a model.
-
Execute
sqlmesh plan dev
to: -
Automatically verify SQL syntax and run tests to ensure your SQL logic.
-
Review a summary of the data contract differences between your local project files and the environment and the downstream impact.
-
Apply the changes to the environment.
-
Review audit results and table diff output.
-
Merge changes into production, and execute
sqlmesh plan
to apply the changes to the production environment. This will be a virtual update and production will immediately have the latest changes.
Then use sqlmesh run
on a regular cadence to compute new data in DWH or SQLMesh's airflow integration to automatically do this for you.
SQLMesh dbt adapter
Good news if you already have a dbt project: SQLMesh can run dbt projects with its dbt adapter!
A tool for converting dbt project files into SQLMesh format is also planned.
Summary
This post described the basics of developing and executing SQLMesh projects from a dbt user's perspective - we hope it provided an on-ramp to help you understand how SQLMesh works and can be beneficial for your business.
If you're looking for more information about SQLMesh and dbt, check out our comparison page and dbt-related FAQ questions.
If you find this series helpful, we'd be happy to continue it. Please join our Slack community and let us know what other topics you'd like to see us cover.
If you want to get going with SQLMesh, check out the quickstart guide for a brief end-to-end example in each of the three SQLMesh interfaces.