Tobiko

dbt + SDF: What Changes and What Doesn't

Last Tuesday, dbt Labs acquired SDF Labs, creators of a rust-based SQL parser and transformation framework.

First and foremost, I want to congratulate the SDF team on a successful exit. We've collaborated on several issues together and shared ideas on how to best do type inference and column-level lineage. They're a very bright group, and I wish them all the best.

Acquiring SDF - A surface-level fix for a deep-rooted problem

dbt™ rose to prominence at a time when its approach to data transformation—leveraging Jinjafied SQL—stood out as the best solution, largely because it was the only solution on the market. Over the past several years, the emergence of new, SQL-aware transformation tools, such as SDF and SQLMesh, have surpassed dbt — reducing painfully slow compilation times and adding compile-time syntax checks to dramatically improve developer experience.

dbt is fundamentally limited by its lack of state management — a constraint rooted in design decisions made and upheld since its founding. This stateless architecture makes it inherently difficult to build efficient and correct data transformation pipelines.

It came as little surprise to us that dbt would acquire the ability to parse SQL. After all, it has already made an earlier foray to patch this weakness when they began using our parser, SQLGlot, for column-level lineage last year, taking capabilities from our open-source library to bolster their paid enterprise product.

Acquiring SDF is another step in dbt's efforts to achieve parity with capabilities we already provide in SQLGlot and SQLMesh today, such as syntax checks and "ref-less" dependency detection. In some respects, SDF will help — it was created by an impressive team and excels at what it originally set out to be: serving as typescript for SQL. But we believe it won't fill dbt's foundational architectural gaps. These are the very gaps we built SQLMesh to address.

Easy to start, hard (very hard) to scale

We've talked to many data engineers who loved dbt when things were small and simple.

dbt is beautiful when you use it exactly as it was originally envisioned, which is to fully refresh every model every time you run. However, as your data grows and your models multiply, this approach becomes unsustainably time consuming and costly — not unlike razing your house to the ground and rebuilding from scratch each time you need to fix a single light.

Figure 1: dbt downstream impact

This is where incremental models come into play, offering a targeted and far more efficient means to update only what's necessary. But running and maintaining incremental models in dbt can quickly become messy and error-prone, a far cry from the simplicity that dbt initially promises. Incremental models are stateful by design because data is appended to them at a regular cadence. However, because dbt has no state, the burden falls to you to manage this yourself.

You have to manually handle:

  • Finding which interval to run
  • Juggling between initial and incremental runs
  • Scalably backfilling data
  • Efficiently creating dev environments with representative data
  • Restating stale or late data
  • Manually selecting impacted downstream jobs to run
  • Dropping or altering tables by hand to reconcile changes

What was once a minimalistic workflow consisting of just dbt run becomes a maze of convoluted commands that require extensive tribal knowledge to execute correctly.

Adding a GPS won't make a car drive faster

Acquiring SDF gives dbt the potential to improve certain aspects of the developer experience like syntax checking and compile times. But it's similar to adding a navigation system to a car. It's nice to have and will help you not get lost, but it doesn't make the car more powerful.

Figure 2: misleading sdf benchmark

Figure 3: admission

dbt marketing claims that the acquisition of SDF makes dbt 100x faster. This claim is misleading (by their own admission):

  • The benchmarks shown in the graph compare SDF parsing raw SQL to dbt compile. Given the existing integration, a fairer comparison would be to combine dbt compile's time with SDF's parse time. And this means that integrating SDF as it is will in fact slow down dbt's overall compilation.
  • Compilation only makes up a small fraction of the time spent in a data pipeline relative to the core data processing that takes place in the warehouse. And because SDF replicates dbt's execution model, it can do very little to improve the overall workflow.

SQLMesh - raising the bar for efficient, scalable, and modern data transformation

On the other hand, SQLMesh was purpose-built as a stateful data transformation platform. Inspired by Terraform, we designed SQLMesh to be a declarative data pipeline framework that could scale with any company.

Understanding SQL and working with incremental models were not features we bolted on - they were built into SQLMesh architecture from the ground up.

SQLMesh keeps track of which model version is in which environment, which date intervals they contain, what has changed, and what needs to run as a result of said changes. The vehicle for these capabilities is our Virtual Data Environments—isolated development spaces that allow production datasets to be reused for previewing changes before they're promoted to production. Virtual Data Environments make true green/blue deployments possible, and ensure the same computation is never repeated.

Read more about Virtual Data Environments.

The impact of our features is clear in practice. When Harness migrated from dbt to SQLMesh, they saw a 30-40% reduction in BigQuery spend. SQLMesh's hyper-efficient evaluation model streamlined incremental loads and eliminated redundant recomputation.

Syntax errors and auto-complete are valuable, but modern data transformation must solve more strategic challenges, like managing state, automating incremental models, and scaling effortlessly with growing data - capabilities that go beyond convenience to drive steep cost reductions and deliver step-change workflow improvements.

Tipping the scale between innovation and exclusivity

dbt owes much of its success to its open-source roots and a vibrant community that supported its growth. But their integration with SDF will not be open source, a decision that reinforces a trend of locking new functionalities behind paywalls, including:

  • Deprecating dbt docs in favor of dbt Explorer
  • Gating column-level lineage, despite many open source tools offering it for free
  • Acquiring Transform to create a Cloud-only semantic layer

In our opinion, open core products thrive when the core experience remains robust and accessible to all, complemented by paid tiers for hosting, infrastructure, and advanced features tailored to enterprise needs such as security and observability. Our hope is that our features in Tobiko Cloud are a worthwhile investment for data orgs and enterprises. Over-prioritizing paid vs OSS risks alienating the users that made it successful in the first place.

Keeping open source at the core of Tobiko

At the core of SQLMesh lies SQLGlot, our multi-dialect SQL parser. It powers the semantic understanding that makes many of our features possible. Writing SQLGlot was a herculean task, but we've chosen to give every developer the ability to test, modify, and use it — no strings attached. We believe its open-source nature has been instrumental in its rise to become one of the most popular tools of its kind.

Figure 4: SQLGlot PyPi ranking SQLGlot is in the top 1000 most downloaded Python packages on PyPi and is used by many data products (Superset, Dagster, dbt Cloud, ibis...)

A SQL transformation framework should understand SQL, and keeping such a foundational feature closed source would stifle many use cases and ultimately, its adoption and potential. Many of the most successful compilers — Python, Rust, GCC (C), and V8 (Javascript) — achieved widespread use and impact because they embraced open-source, and we share the same commitment.

As a venture-backed company, we get asked whether Tobiko will follow a similar path to dbt and prioritize profitability at the expense of OSS like SQLGlot. The answer is no. SQGlot’s reach and applications now far transcend the data transformation space. Removing it from SQLMesh is not only impractical, it also goes against our principle to keep products core to the developer experience available to all. We view open-source as non-negotiable for the core of our technology, and essential to maintaining the balance between commercial success and the growth of our community.

Building a better open source data framework together

Look, I get it... building an open core business model is a tough game, and I don’t envy dbt’s current position. But I believe we can find a healthy balance between open source and commercial features.

SQLMesh already provides an incredibly powerful batteries-included experience. State, static analysis, lineage, efficiency, deployments. Tobiko Cloud makes it even better with managed infra, scheduling, observability, debugging, alerting, and security.

What I can promise you is that we love open source and appreciate every question, piece of feedback, and issue. Check our Slack and Github - you will be hard pressed to find any other project with our level of responsiveness. We cherish community PRs and quickly work with you to get it merged. We understand how much work you put in, and we reciprocate with the same energy. That's what makes an open source ecosystem great.

If you want to experience this first hand, we're excited to jump into the trenches with you. We know data engineering can’t evolve from our efforts alone. We’ll see you there in our Slack community.