In my previous post about Semantic Diff for SQL, I mentioned that one of the use cases for understanding the difference between SQL queries is to automatically detect functional changes (as opposed to cosmetic or structural ones) and assess their downstream impact.
In this post, I explain why I built this and share details on how it works.
Why categorize changes?
When a user changes the behavior of a model, the change may or may not have impact on models that are downstream from it. In the former case, downstream models need to be re-computed alongside the changed model to propagate changes correctly, while in the latter case only the directly modified model needs to be re-computed.
Without distinguishing between these two scenarios, users are faced with a dilemma of either constantly re-computing downstream models or never updating them at all. Hence the trade-off between efficiency and correctness. Anything in between requires careful manual intervention on the user’s part, which is time-consuming and error-prone.
Categorizing changes into breaking and non-breaking allows SQLMesh to reduce the amount of necessary recomputation to a minimum while guaranteeing correctness. For example, adding a new column is not considered to be a breaking change, but removing or modifying the existing one is.
Figure 1: Breaking vs. non-breaking changes
Presently, the only kind of change that is being categorized as non-breaking is adding new projections to the query’s
SELECT statement. The implementation is very conservative about all other kinds of changes and categorizes them as breaking by default. Over time, support for additional types of non-breaking changes will be added to the platform.
It can be argued that extending the projection list is not always a non-breaking change. For example, if a downstream query uses
SELECT *, it may very well be impacted in a breaking way.
Fortunately, this is not an issue for SQLMesh, which always expands
* projections and replaces them with actual column references with the help of the SQLGlot optimizer.
Additionally, if users don’t like a category assigned to their changes by SQLMesh, they can always override and set it manually.
The output of the SQLGlot diff function is a so-called “edit script” - a list of operations on individual AST nodes. Applying these operations in order to the source SQL expression transforms it into the target SQL expression. The implementation can return five types of edit operations:
Insert - a new node should be inserted
Remove - an existing node should be removed
Move - an existing node should be moved to a different place in the tree
Update - the value of an existing node needs to be updated
Keep - an existing node is unchanged
Since the only type of change that will be categorized as non-breaking is addition of a new projection, we can safely assume that the change is breaking if the output edit script contains any operation other than Insert or Keep.
New items in the SELECT’s projections list are not always non-breaking. For example, some SQL functions (eg. explode in Spark SQL) may impact the cardinality of rows returned by a query, in which case the change should be considered breaking. This is why we need a special check for whether the inserted node references a User Defined Table Function (UDTF) which may return 0, 1, or multiple rows.
The rest of the algorithm is trivial: for each inserted node, the parent of the node should either be another newly inserted node (to support insertion of complex expressions) or a
The whole algorithm can be expressed in a few lines of Python code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
In the future, additional heuristics will be added to the algorithm to automatically detect other kinds of non-breaking changes. For example, changing the order of boolean expressions chained by the
AND operator can also be treated as non-breaking.
Additionally, we’re looking into using the column-level lineage in SQLMesh to categorize changes per each impacted downstream model individually.
This will allow an even finer balance between correctness and efficiency, since changes like removing a column that is not referenced downstream will no longer be categorized as breaking.
Figure 2: Column-level change categorization
Join our Slack channel to follow this work and other exciting developments that are happening in the world of data transformation.