Tobiko - Are these SQL queries the same?

How can you tell if two SQL queries are the same, producing the same results given the same input?

Checking that two statements are textually equivalent doesn't work well because it would mistakenly include unimportant differences such as white space, casing, and comments.

These four queries return identical results, but they are not textually equivalent:

SELECT price + 2 FROM table WHERE 1 == 1;

-- White space
SELECT price+2
FROM table
WHERE 1 == 1   ;

-- Casing
select PRICE + 2 from table WHERE 1 == 1;

-- Comments
SELECT price + 2 /* comment */ FROM table WHERE 1 == 1;

Unimportant differences can be more nuanced than just formatting. The query can be syntactically different yet still be semantically identical - different ways of writing the same underlying query:

-- This is equivalent to the original query!
SELECT price + (1 + 1) FROM table;

Why does it matter?

If you run SQL queries in data pipelines, it's useful to know when there are meaningful changes to queries so that you can appropriately rerun them. If a query hasn't changed, you may not need to rerun anything.

Most transformation tools do naive text comparisons (if anything at all) when checking if two queries are the same. In this post, I'll explain how you can use SQLGlot to accurately determine whether or not queries are equivalent. This approach is used by SQLMesh, an efficient data transformation framework, to automatically minimize the number of SQL queries run whenever you deploy changes.

At its core, accurately comparing queries requires eliminating all the ways the same underlying semantics (what the query "does") could be written differently. This elimination consists of multiple steps.

Parsing SQL

The first step towards intelligently comparing SQL queries is understanding what each one means. We can achieve that by converting the text into an abstract syntax tree (AST) with SQLGlot:

>>> from sqlglot import parse_one
>>>
>>> query = parse_one("""
... SELECT
...   a, -- comment
...   b,
... FROM table
... """)
>>>
>>> query
Select(
  expressions=[
    Column(
    this=Identifier(this=a, quoted=False),
    _comments=[
        comment]),
    Column(
    this=Identifier(this=b, quoted=False))],
  from=From(
    this=Table(
    this=Identifier(this=table, quoted=False))))

Parsing SQL into an AST eliminates many types of meaningless textual differences. With the AST in hand, we can use SQLGlot to generate a canonical SQL statement with all of the whitespace stripped and consistently formatted:

>>> query.sql()
'SELECT a /* comment */, b FROM table'

Canonicalization

Although parsing removes many of the textual differences, it's possible to do better!

SQLGlot has a library full of "optimizations" which aim to further canonicalize SQL. These optimizations aren't for improving query execution speed; they are for streamlining a query into a more condensed form:

>>> from sqlglot.optimizer import optimize
>>> query = optimize(
...     """
...     SELECT * EXCEPT ("gender"),
...     PRICE * (1 - 0.1) AS discounted_price
...     FROM table
...     WHERE
...         1 = 1 AND (country = 'CAD' AND age > 10) OR country = 'USA'
...     """,
...     schema={
...         "table": {
...             "price": "int",
...             "country": "string",
...             "age": "int",
...             "gender": "string",
...         },
...     },
... )
>>> print(query.sql(pretty=True))
SELECT
  "table"."price" AS "price",
  "table"."country" AS "country",
  "table"."age" AS "age",
  "table"."price" * 0.9 AS "discounted_price"
FROM "table" AS "table"
WHERE
  (
    "table"."age" > 10 OR "table"."country" = 'USA'
  )
  AND (
    "table"."country" = 'CAD' OR "table"."country" = 'USA'
  )

This example showcases many transformations that convert the original query into a standard, condensed form that enables accurate query comparisons.

Some of the optimizations shown include

Lower casing unquoted identifiers (PRICE -> price)
Adding aliases to tables (FROM table -> FROM table AS table)
Expanding stars (SELECT * EXCEPT -> SELECT price, …)
Adding tables and alias to columns (price -> table.price as price)
Mathematical simplification (price * (1 - 0.1) -> price * 0.9)
Boolean simplification (removed WHERE 1=1)
Conjunctive normal form conversion ('CAD' AND age > 10 OR country = 'USA' -> (age > 10 OR country = 'USA' ) AND (country = 'CAD' OR country = 'USA'))

Once the optimizations have been applied, meaningless differences have been removed and the queries can be accurately compared.

Non-Breaking Changes

By accurately comparing queries, you can avoid expensive data rebuilds whenever the text of a SQL query has changed but the queries remain semantically identical and will return the exact same results.

Using this framework, you can go even further by detecting if a set of changes are breaking or non-breaking.

For example, if the changes to a project just add a column to one SQL query, you don't need to rebuild the query's downstream dependencies because none of them actually use the new column. The data that already exists would be unchanged by a rebuild, making the rebuilds an unnecessary waste.

Conclusion

Understanding whether two SQL queries are the same can save you a lot of time and money.

Doing so accurately without truly understanding SQL is not really feasible. You can leverage SQLGLot as a framework to convert SQL into a standardized form or use SQLMesh (which has this functionality built in).

If you're interested in data, SQL, or just wanna chat we'd love to meet you. Please join us in our growing Slack community to say hi!