Tobiko

Evolving Large Datasets with SQLMesh

Every time a user makes a change to an existing model, SQLMesh automatically versions the change and assigns it one of the following categories:

Breaking - if the change may impact data for downstream dependencies of this model. In this case, a new physical table will be created and fully refreshed for the modified model and all impacted models downstream from it.

Non-breaking - if the change doesn’t impact downstream dependencies. In this case, the physical table will be created and refreshed only for the directly modified model.

Figure 1: Breaking vs. non-breaking changes Figure 1: Breaking vs. non-breaking changes

You may notice that in both cases the physical table for the changed model is rebuilt from scratch. If a change is initially applied to a development environment, the physical table is subsequently reused when the change is promoted to production, ensuring that no computation is wasted.

But what if the dataset represented by the modified model is large enough that rebuilding it with each change is not only expensive but also time-consuming, even if this process only occurs once per change? For example, during my time at Netflix, I worked with core datasets that captured information about playback sessions. These datasets were massive, reaching the scale of petabytes. Querying these datasets for analysis was challenging, as it was only feasible to retrieve a few days worth of data at a time. Naturally, rebuilding a dataset like this from scratch was never on the table.

Welcome Forward

To manage the evolution of such large datasets, SQLMesh supports a third category of changes known as forward-only. The name is derived from the fact that changes of this category will be applied “going forward.”

Individual models can be explicitly configured as “forward-only” within their definitions, ensuring that all subsequent changes made to them are automatically categorized as such. Alternatively, a user can make a one-time forward-only change by adding the --forward-only flag to the sqlmesh plan command.

Regardless of the selected method, SQLMesh will not rebuild the physical table from scratch. Instead, the behavior will vary depending on whether changes are applied to a development environment or the production one.

Development Environment

In a development environment, SQLMesh creates a shallow (a.k.a. “zero-copy”) clone of the changed model’s existing physical table that is currently used in production. Currently, shallow cloning is supported by engines like BigQuery, Databricks and Snowflake. If the target engine doesn’t support this feature, SQLMesh creates an empty temporary table instead.

In both scenarios, evaluating a modified model in a development environment has no effect on the production data. However, it's important to note that users may not obtain a complete data preview of their changes if their target engine does not support shallow cloning.

Because users have the option to choose a custom time window within which they want to preview their changes, they can strike a balance between ensuring their modifications behave as intended and managing the associated computational overhead.

Production Environment

When forward-only changes are promoted to the production environment, they are applied to existing physical tables going forward. Consequently, the existing data within an affected table remains unchanged, while its schema is automatically adjusted to align with the schema of the updated model's query.

Data computed in a development environment will not be reused when the change is promoted to production. In some cases, it may be necessary to apply the changed model retroactively to a portion of the data once it has been promoted to the production environment (which can be achieved with restatement plans).

Unlike with breaking or non-breaking changes, reverting forward-only changes is not as straightforward, as there are no dedicated physical tables associated with each change that can be easily swapped out. Instead, reverting changes involves applying additional forward-only changes, possibly coupled with recomputing portions of data as needed.

Conclusion

Constructing a dedicated physical table for each applied model change is an effective and secure approach to ensuring isolation between environments, providing a complete data preview, handling concurrent changes, and enabling a swift and cost-efficient means of reverting model changes to any previous state.

Nevertheless, in certain cases, datasets are simply too large to rationalize the benefits of reconstructing the table with every change. To support such scenarios, SQLMesh offers the forward-only mode, enabling the evolution of existing tables while maintaining the same isolation guarantees between environments. This comes at the cost of limited data preview capabilities during development and the inability to reuse data computed within a development environment when promoting changes to production.

Join the growing community of data enthusiasts to learn more about this and other exciting developments in SQLMesh.