Fivetran Guest Blog: SQLMesh’s table_diff— Comparing Tables with Confidence

Fivetran shares how their analytics team uses SQLMesh's table_diff feature to validate transformations, streamline QA, and scale model development.

This post was contributed by Ezra Lee, a Data Architect at Fivetran, an industry-leading, cloud-based data movement platform, designed to offer organizations the ability to effortlessly extra, load, and transform data between a wide range of sources and destinations. Ezra has worked in data analytics since 2015. He is part of Fivetran’s Analytics Architecture team, where he builds analytics infrastructure and workflows. His team has deeply integrated SQLMesh’s table_diff for better reliability and scale.

‍

Ensuring that your transformations and data remain consistent is critical when working with SQL models. Whether you're refactoring pipelines, migrating tables, adding/removing columns from existing tables, fixing a table bug (e.g. your table has duplicate records per grain), or verifying data between development and production environments, SQLMesh's table_diff makes it easy to compare and diagnose discrepancies. Instead of checking row-by-row or conjuring ad-hoc SQL queries, table_diff automates the comparison process, surfacing mismatches and ways in which your data has changed.

This tool is especially useful for testing incremental updates, validating schema changes, and spot-checking potential data drift. By identifying key differences, table_diff has helped my team at Fivetran quickly pinpoint inconsistencies, resolve potential bugs, and make frictionless transitions between data models.

Why Table Comparison Matters

For stakeholders, data integrity directly impacts decision-making quality. When reports show unexpected numbers and dashboards display conflicting metrics, business trust in data erodes quickly. table_diff is a crucial quality assurance tool that allows us to uphold said integrity at the core of our data-driven decision-making processes.

It also brings immediate, day-to-day operational and productivity gains. Implementing table_diff in our workflows has streamlined testing and validation across environments. We spend less time conducting manual QA, and our model iterations have become substantially faster.

In this post, I’ll dive into a variety of use cases where our team relies on table_diff—and the benefits we have seen as a result.

Using SQLMesh `table_diff`

SQLMesh’s table_diff is run directly from the CLI, allowing users to compare tables efficiently. Its syntax follows a straightforward structure:

sqlmesh --ignore-warnings table_diff <table1>:<table2> -o <primary_key> 
--show-sample

Where:

<table1>:<table2> → The tables you want to compare (must include schema names).
-o <primary_key> → Specifies the primary key(s) that define unique records.
--show-sample → Displays sample mismatched records if differences are found.

For example, if you’re comparing a single primary key:

sqlmesh --ignore-warnings table_diff my_schema.table1:prod_schema.table1 -o id 
--show-sample

For multiple primary keys:

sqlmesh --ignore-warnings table_diff my_schema.table1:prod_schema.table1 -o id1 -o 
id2 -o id3 --show-sample

When working with models that already define grains (primary keys) in their configurations, the -o flag may not be necessary (see more info here). Before running in your Terminal, first ensure that you “cd” (change directory) into the correct project directory to avoid referencing the wrong tables. More on “cd” and Linux/zsh/bash commands here if you need a primer.

Interpreting Results (and Troubleshooting Caveats/Gotchas!)

Sample screenshot of table_diff output in my Terminal

After running table_diff, SQLMesh provides a structured breakdown of how two given tables (that you specify) compare. Here are some key terms:

FULL MATCH → All values match for records that successfully joined on the primary key(s).
PARTIAL MATCH → Some field values differ between the tables.
<schema.table_name> ONLY → Rows exist in one table but not the other.
pct_match → Percentage of matching field values for records that joined successfully.

These outputs help our team quickly uncover data discrepancies. That said, for the benefit of others adopting table_diff, it’s worth highlighting a few edge cases that may trip up even experienced users. Here are some of the usual suspects:

Negative PARTIAL MATCH values → This usually indicates NULL values in primary keys, which table_diff cannot handle. Solution: Use COALESCE() to replace NULLs before running the comparison.
Frequent PARTIAL MATCH results → Expected for tables with slowly changing dimensions (SCDs), auto-generated fields, or frequently updated timestamps. Check if these discrepancies are meaningful.
Tables without defined grains (primary keys) → table_diff won’t work unless unique keys exist for proper comparisons.

For instances where tables lack primary keys, a workaround (albeit an incomplete one) is to use set-based SQL operations to compare records manually. While not as robust as table_diff, this method can still detect general differences when primary keys are absent:

-- This returns the records in model1 that are NOT in model2. (aka set subtraction 
operation).
SELECT * FROM model1
EXCEPT DISTINCT
SELECT * FROM model2;

-- This returns the records in model2 that are NOT in model1. (aka set subtraction 
operation).
SELECT * FROM model2
EXCEPT DISTINCT
SELECT * FROM model1;

-- If model1 and model2 are identical, both the above queries will return no records.

Custom Workflows & Automation with SQLMesh `table_diff`

table_diff can be run on one table at a time—but doing so would fall far short of realizing its full value, especially for projects spanning multiple models. At Fivetran, we leverage Python automation to programmatically validate multiple tables within our SQLMesh projects.

Below is the approach I recently took. I wrote a Python script to programmatically run table_diff on a large number of tables and summarize the output results for me in a way that is compatible with our data warehouse and BI tools:

Identify tables to compare, either manually or through a generated list.
Validate primary keys, ensuring they exist and are correctly set.
Remove unsupported field types (e.g., STRUCT data types, which table_diff cannot handle.
Run table_diff using SQLMesh’s API for each table.
Store results in a Google Sheet or another tracking system for easy review.

Here’s a useful snippet of the Python script to programmatically iterate execution of table_diff on many tables listed in a Google Sheet. You’ll notice it logs schema differences (diff.schema_diff()) and records mismatches (diff.record_diff()), among other things.

Automating this process ensures consistent data validation across environments, reducing human error and improving model stability. Using table_diff via the Python API also enabled us to scale the power of table_diff by automating what would otherwise be a time-consuming manual process. (More time for fun stuff, woohoo!)

Beyond Refactoring: Everyday Use Cases

While table_diff is now essential for our large-scale refactors or table migrations, I find it can be equally useful for incremental updates to existing models. When adding/removing columns, adjusting business logic, or testing schema changes, table_diff provides immediate feedback on how modifications impact our data.

Common day-to-day use cases include:

Verifying changes after modifying a column’s logic → Ensure expected transformations are applied correctly.
Checking for unintended data loss after schema updates → Detect missing or misaligned records.
Comparing development and production models → Validate that staging changes won’t introduce unexpected issues in production.
Ensuring consistency in incremental models → Confirm that partitioning, clustering, and uniqueness constraints work as expected.
Debugging an existing table (data discrepancy/validation) → To illustrate with an example: if you have duplicate records per grain, you’ll want to verify after your code changes that your model outputs are now indeed unique at the specified grain. table_diff can indicate what specifically is different about the two tables–and whether you fixed the bug correctly. It would output or flag the duplicate records in the original table, but not in your fixed and improved table.

`table_diff`: combining technical and business wins

The impact of table_diff extends well beyond its technical capabilities. It has helped my team operate more efficiently and build greater trust in our data at every level.

Integrating table_diff into our data engineering workflows has shortened debugging time and given us the ability to catch issues early, quickly validate transformations, and iterate on models with substantially less friction. It has played a key role during migrations and platform upgrades—we have higher confidence that inconsistencies and breakage will not catch us off guard.

Running table_diff has become a routine part of the model development workflow within Fivetran’s analytics organization—so much so, we have built in-house IDE extensions to make table_diff even easier and faster to access and execute.

Whether you're tackling a full-scale refactor or making a small but impactful model update, table_diff is an invaluable tool to validate your work.