Tobiko

The False Promise of dbt Contracts

Over the past few years, dbt has led the innovation of analytics workflows. It enabled analysts to meaningfully contribute and deliver business value in a field that was previously owned by data engineers.

The limitations of dbt's architecture have become more apparent as its usage has grown. dbt was initially designed to be a simple build tool, but its simple foundation has kept it from meeting the modern demands of data teams and it has had to push more complexity onto its users to compensate.

In dbt's latest post, they announced that the number of large projects (> 1000 models) is increasing and proposed two solutions: data contracts and versions. On the surface, these seem like beneficial changes but they don't address dbt's core architectural issues and instead push the burden of complexity onto its users.

Model contracts in dbt

Model contracts in dbt are manually defined schemas in YAML. They contain column names and types that are used for validation. If a model's SQL doesn't match the schema, the build will fail. This adds overhead to the user without addressing the fundamental problem.

As a model owner, if I change the columns or types in the SQL, it's usually intentional. Having the build fail because I forgot to update the contract isn't really useful, just annoying. What I really want is to understand all the downstream consumers of my model, whether or not what I did was breaking or non-breaking for each of them, and then to easily migrate them. This is a fundamental feature in SQLMesh.

Data contracts in SQLMesh

SQLMesh doesn't make you duplicate information to get the benefits of contracts. Instead, through its native understanding of SQL, your code/sql/model becomes the source of truth; there's no reason to do it twice.

Simply write SQL and SQLMesh does the rest. It parses the model definition and understands that you have col_a of type int and col_b of type text.

Simply write sql SQLMesh understands that model 'bronze.a' has 'col_a' of type int and 'col_b' of type text.

When you make a change to model ‘bronze.a', SQLMesh's semantic understanding of your entire project allows it to inform you whether or not you made breaking changes to downstream models.

SQLMesh understands breaking changes SQLMesh understands that changing '1 + 1' to '1 + 2' is a breaking change and prompts you to backfill 'b', 'c', and 'd'.

In dbt, a change like this wouldn't be detected because the schema hasn't changed (the types and columns are the same). However, SQLMesh correctly identifies this as a breaking change. dbt's model contracts don't take into account logic, only structure. Data consistency is only achievable with data contracts which requires taking into consideration both structure and semantics.

If only a column was added, SQLMesh detects that the change is non-breaking change and does not need any backfilling (because no one is using it).

SQLMesh understands non-breaking changes SQLMesh understands that adding columns have no impact downstream and doesn't require backfills

Additionally, the examples shown above contain a multi-project setup. Even though the 'bronze' and 'silver' models are in separate repositories, SQLMesh is able to keep them in sync. dbt is preparing for multi-project deployments by adding contracts, but you'll only be able to operate on the scope of a single project and have to manage state yourself.

Versions aren't the solution

Because dbt's model contracts don't actually help with migrating downstream dependencies and achieving data consistency, versions were introduced to allow models to migrate at their own pace.

When versioning a model in dbt, you manually make a copy of an existing model and then assign it a version in a YAML file. Instead of directly addressing the problem of migrating downstream models, they're providing a configuration workaround that adds tech debt and results in spaghetti code.

If a company decides to use this feature, inevitably, teams won't have the bandwidth to migrate versioned tables and you'll be left with a web of models depending on different versions of parents. Additionally, you'll have to incur extra cost for each legacy version you're maintaining due to deferring migration work.

Correctness without complexity

SQLMesh was built with the principle that correctness is non-negotiable. A transformation platform must scale as the number of models or contributors grow, but it cannot do that by pushing complexity onto its users.

Painlessly handling complexity should be a first class feature in the tools businesses rely on and is a core focus for SQLMesh.

We'd love to hear your thoughts on data contracts! Join our Slack channel and say hi!