Until recently, AST was just another acronym I hadn’t memorized. It was never relevant to me in any of my prior roles. But after digging deeper, it made sense why I had never heard of it before and why I wish I had known it sooner. Before we jump in, let’s define what AST means in the context of software engineering.
AST stands for abstract syntax tree. The team here at Tobiko Data has written a couple different blogs regarding the topic but I want to convey my journey in understanding ASTs and their purpose. It’s fair to say that my past experiences never warranted me to think this deeply about data structures.
So what exactly are ASTs and why are they important? At a high level, databases and query engines utilize AST representations of SQL queries to insert, update, delete, and fetch data records. These representations are produced in a tree-like data structure, which SQLMesh uses to extend additional quality-of-life features that improve the user experience and boost productivity.
While knowing how these things work on an atomic level can be great, I always felt like these were unnecessary details taking up brain space for me. It’s like knowing what B-Tree indexes in postgresql are doing under the hood when really I just want to know when and how to leverage it. So when I first came across ASTs, I wrote them off as another esoteric concept someone was shilling.
I later realized it wasn’t just some new fad. Let’s dive into an example and break it down in the context of a SQL model.
SQLMesh parses the SQL statement and creates an AST. The AST is then converted into a canonical SQL statement. SQLMesh uses the canonical query to determine if a set of changes are breaking or non-breaking. This makes sense but why do I really care about a canonical query when I thought my original query was sufficient? We’ll come back to this.
Let’s take a real-life example this time. You’re an analyst at a B2B software company. Your organization uses Salesforce to track opportunities in your pipeline and you just had a killer quarter! How exciting?!
However, leadership is dumbfounded because they weren’t even forecasting close to what ended up happening. You’re tasked with figuring out what really happened.
A couple of things you might do:
- Ask the leadership to give you new probabilities for stages in the sales cycle and backtest those against the last 8-12 quarters.
- Find the opportunities that were anomalies and just update those in your dataset.
- Create a new probability column in your dataset and start tracking that moving forward for all opportunities.
- Do nothing and see if it happens again. Sales teams have been known to be lucky, right?
For the sake of exploration and to keep this blog short, let’s just pick one from above. We will create a new probability column based on what leadership has suggested and start tracking that moving forward for all opportunities.
You open up SQLMesh and update your model to the following.
When we add a new column, SQLMesh takes the SQL and creates an AST, which can be represented as the following:
Figure 3. Initial AST.
Figure 4. New AST after the change.
SQLMesh parses this query and recognizes we have added a column. In this case our AST for the query has been updated and no longer matches the previous version. SQLMesh will automatically detect the new column you’ve added and will classify this as a non-breaking change. But how does it know it is a non-breaking change? Well, that’s where SQLMesh does the hard work for you! It will look at downstream models and understand if models use a specific column and if something needs to be recomputed to reflect the change properly.
This is super important
! Other tools forced me to choose between efficiency and correctness. In this case, SQLMesh won’t require me to rebuild or refresh this whole data pipeline. It recognizes that none of my downstream models are using this new column and therefore don’t need to be rebuilt.
Without AST, you just have a blob of text. Other tools just deal with blobs of SQL, and so if white space changes, the whole world has changed according to them. There's no real understanding of what that blob of text is
.
Instead of treating things as a black hole of SQL, SQLMesh understands that this is a query, that queries x tables, selects y columns, etc.
My old mental model of how to ship changes had become irrelevant. I used to worry about having to wait hours for my models to refresh and update downstream. But after learning about SQLMesh mechanics, I realized I didn’t have to be so scared of making changes, especially in crunch times.
If I have any non-breaking changes, then we only need to re-compute the directly modified models. Again, SQLMesh handles this logic for you and executes only what is necessary. So I can apply these changes, and I have the choice to backfill or not. In this case, I don’t want to backfill values and instead only apply changes going forward. The reason I don’t want to backfill is because the new logic doesn’t apply to my old records and backfilling it would only cause more confusion. This is where forward only can come in handy.
Figure 5. Breaking Change in SQLMesh Plan.
Another scenario could be that we want to update the current definition and not add a new column. SQLMesh again will recognize this has a data impact. We could recompute all previous values or just choose to have new values adopt the updated definition. This is probably not an option I would choose due to the complexity of tracking when I made this change.
In the past, I would lazily rebuild a whole data pipeline without thinking about it. I was comfortable with this because most of the datasets were very small (<10 GBs) and the costs associated with them were minimal. Still, I wasn’t thinking about the other implications that could happen as we scaled our project.
When I first read about ASTs and how they worked, I was a little confused and thought it was a little woo woo. But after working in SQLMesh and actually getting my hands dirty, it makes so much sense. There was definitely a mindset change to embrace this workflow but now I feel confident about making big changes to models. With SQLMesh I still get to focus on making the changes required for the business but I let SQLMesh take care of ensuring it’s done efficiently and intelligently. I get to be lazy and efficient which is a win-win for me.
This semantic understanding of SQL by SQLMesh starts to compound with its virtual data environment feature. Next time, I’ll write about how easy it is to promote these changes between environments without incurring additional warehouse costs!
If you're interested in data, SQL, or just wanna chat we'd love to meet you. Please join us in our growing Slack community to say hi!