Tobiko

SQLMesh for dbt Users - Part 2

Are you using dbt and have heard of SQLMesh, but aren't quite sure what it is?

You are in luck, because this article is for you! We will teach you the basics of SQLMesh from a dbt user's perspective to help get you up to speed.

We talk about the entire data transformation workflow. In each section, we discuss equivalencies between a dbt concept/tool and its SQLMesh counterpart, note any caveats or differences between the two, and list related SQLMesh features and how they can be beneficial.

This is the second post in the series. The first post covered the components used to create a SQLMesh project - this post is on SQLMesh project development and execution. We recommend reading the first post...first.

Model development and execution

The data transformation rubber meets the road when we actually load and run our models. One of the primary differences between dbt and SQLMesh is the development and execution flow.

Environments

Environments are isolated namespaces that allow you to make and test changes to your models before deploying them to your real business operations (into “production”).

dbt environments are separate targets within a connection profile. Each environment is isolated, so your changes in one environment have no relationship with the others. That separation means the entire project must run for each environment; computations done in one environment must be repeated when applied to a new environment or deployed to production.

Environments serve the same purpose in SQLMesh - a way to make and test changes to a project’s code. However, there are two important differences from dbt:

  • SQLMesh analyzes the specific changes made to the code and only reruns models whose data becomes invalid due to the change, not the entire project.

  • SQLMesh uses virtual environments when deploying from one environment to another. Instead of re-running the entire project in the new environment, it safely shares already existing tables from other environments into the new one (where possible). Again, only models affected by changes made will require re-execution in the new environment. Check out the above link if you'd like to know the details.

  • Environments are not pre-defined in a profile. They are dynamically created using sqlmesh plan.

For these reasons, SQLMesh environments can be created quickly and cheaply. While they can still be used like dbt environments, where each developer has their own static environment they use for all their changes, it is now possible to create new environments for each feature or issue.

Plan

SQLMesh’s plan command is the primary tool for developing in SQLMesh, which allows you to understand the impact of your changes and choose how to apply them. It is a new concept not present in dbt. This section talks about what it is and how you use it.

Model changes

When changing a model in dbt, you make the change and then rerun the project. You, the developer, are responsible for understanding how your change affected the model and all downstream models. As more and more models are added to a project, this becomes a harder and harder task; cross your fingers that you don't receive that 2am call that an important BI dashboard is down.

SQLMesh adds the plan step, which reviews the affects of the change and prompts for how to apply it to the environment.

When you issue the sqlmesh plan [environment name] command, SQLMesh compares your local project files to the state of the environment. For each changed model, a list of downstream models that directly or indirectly SELECT from it, a code diff of the change, and the impact of the change. The impact is classified as either breaking (downstream models are impacted) or non-breaking (downstream models are not impacted).

After you examine the implications, and are ready to continue, you can apply the changes to the specified environment. Optionally, you can choose what timeframe to backfill incremental models, to limit the amount of computation to only what is needed to validate the changes.


Figure 1: Data Bowie makes many changes to SQLMesh models Data Bowie makes many changes to SQLMesh models

For example, adding a column is non-breaking because no existing columns or data are affected.

Adding a WHERE clause is breaking because the existing table contains rows that the change now filters out. SQLMesh can automatically determine whether changes are breaking on a best effort basis, or you can do it manually for fine-grained control.

This example from the SQLMesh quickstart guide shows a plan where the user directly modified the sqlmesh_example.incremental_model (lines 5-9) by adding a column new_column (line 19).

That non-breaking change indirectly modified the downstream sqlmesh_example.full_model (lines 23-25).

Example 1: SQLMesh plan from the quickstart guide

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
$ sqlmesh plan dev
======================================================================
Successfully Ran 1 tests against duckdb
----------------------------------------------------------------------
Summary of differences against `dev`:
├── Directly Modified:
│   └── sqlmesh_example.incremental_model
└── Indirectly Modified:
    └── sqlmesh_example.full_model
---

+++

@@ -1,6 +1,7 @@

SELECT
id,
item_id,
+  'z' AS new_column,
ds
FROM sqlmesh_example.seed_model
WHERE
Directly Modified: sqlmesh_example.incremental_model (Non-breaking)
└── Indirectly Modified Children:
    └── sqlmesh_example.full_model
Apply - Backfill Tables [y/n]: y
    sqlmesh_example.incremental_model ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0%  1/1  0:00:00

All model batches have been executed successfully

Virtually Updating 'dev' ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100.0%  0:00:00

The target environment has been updated successfully

Running models

From the dbt node selection docs:

"By default, dbt run executes all of the models in the dependency graph; dbt seed creates all seeds, dbt snapshot performs every snapshot. The --select flag is used to specify a subset of nodes to execute.”

dbt run and its analogues run everything unless you specify a subset of models/seeds/snapshots with the --select flag. Depending on the specific project, this could mean that most of the executed computations aren’t needed at all.

SQLMesh makes a distinction between applying new changes (sqlmesh plan) and computing the latest data for an environment (sqlmesh run). This section will focus on the latter.

sqlmesh run [environment] combines dbt run with the concept of cron to automatically determine which models need to run and only runs those models.

Unlike dbt, which runs the local file models, SQLMesh runs the models in the specified environment. The local files do not correlate to what gets run. For example, with SQLMesh you could do sqlmesh run staging followed by sqlmesh run production to compute the models in the staging environment and then the models in the production environment.

If you want to propegate model changes to an environment, use sqlmesh plan. If you want to compute the latest data for an environment, use sqlmesh run.

SQLMesh does not have separate seed and snapshot commands. Every model, regardless of type, scheduled to run, will run and nothing else. Of course seed models on need to be computed when changed and thus execute via sqlmesh plan.

Cron and model scheduling

Coordinating when to run a specific model or set of models requires external tooling in dbt. SQLMesh has cron built in to solve this for you via the cron field available in model configuration.

Consider a model whose cron is daily. The first time you issue sqlmesh run today, SQLMesh will detect that the model has not run since yesterday and execute the model. If you issue sqlmesh run again today, SQLMesh knows that the model has already run and will not execute it again.

Different models can use different cron parameters, depending on how frequently they need to execute. Every time you issue sqlmesh run, SQLMesh will determine which models are due and run only those models.

You can execute sqlmesh run yourself or with the native Airflow integration. If running it yourself, a sensible approach is to use Linux’s cron tool to execute sqlmesh run on a cadence at least as frequent as your briefest SQLMesh model cron parameter. For example, if your briefest model’s cron is hour, your cron tool should issue sqlmesh run at least every hour.

Notifications

Nightmare scenario: model runs fail, data doesn’t load, and your CEO comes calling about their stale dashboard.

SQLMesh provides notifications via email or Slack when runs start, end, or fail. That way you know about and fix issues before the angry messages start flying.

Tests

YOLO is great for parties and terrible for data engineering. Testing your data and code helps you avoid the dreaded YOFO (you’re only fired once)!

dbt tests are an excellent practice! But the name is misleading in the context of software engineering, where “tests” evaluate something about code (not data). Therefore, in SQLMesh these evaluations are called audits - they audit whether existing data meet the specified criteria.

SQLMesh also has tests, which follow the software engineering concept of testing the code. These take the form of “unit tests” for your models.

You specify the data input values and the expected output values, and SQLMesh runs the tests. The tests run locally with DuckDB by default, or you can optionally run them in your data warehouse. SQLMesh tests run as part of sqlmesh plan, informing you if you have a problem before any transformations take place. They are great at checking expected results and edge cases.

Another way to make sure your code is behaving as expected is SQLMesh's table diff tool. Taking inspiration from the open-source data-diff Python library, table_diff compares the data in two tables or models and summarizes the difference between them.


Figure 2: SQLMesh table_diff output for a comparison of 'incremental_model' across prod and dev environments SQLMesh table_diff output for a comparison of 'incremental_model' across prod and dev environments

With data diff, you can quickly verify your changes had the intended effect to the model's columns, rows, and row data.

Summary of SQLMesh workflow

Taking everything we've learned in this post, this is what the high-level SQLMesh workflow looks like:

  1. Make a change to a model.

  2. Execute sqlmesh plan dev to:

  3. Automatically verify SQL syntax and run tests to ensure your SQL logic.

  4. Review a summary of the data contract differences between your local project files and the environment and the downstream impact.

  5. Apply the changes to the environment.

  6. Review audit results and table diff output.

  7. Merge changes into production, and execute sqlmesh plan to apply the changes to the production environment. This will be a virtual update and production will immediately have the latest changes.

Then use sqlmesh run on a regular cadence to compute new data in DWH or SQLMesh's airflow integration to automatically do this for you.

SQLMesh dbt adapter

Good news if you already have a dbt project: SQLMesh can run dbt projects with its dbt adapter!

A tool for converting dbt project files into SQLMesh format is also planned.

Summary

This post described the basics of developing and executing SQLMesh projects from a dbt user's perspective - we hope it provided an on-ramp to help you understand how SQLMesh works and can be beneficial for your business.

If you're looking for more information about SQLMesh and dbt, check out our comparison page and dbt-related FAQ questions.

If you find this series helpful, we'd be happy to continue it. Please join our Slack community and let us know what other topics you'd like to see us cover.

If you want to get going with SQLMesh, check out the quickstart guide for a brief end-to-end example in each of the three SQLMesh interfaces.