Tobiko

Why Data Teams Are Adopting Declarative Pipelines

To understand the trend in declarative data transformation / DataOps product offerings, one simply needs to examine the evolution of the DevOps discipline.

Consider the Chef / Puppet / Ansible family of tools. These tools provide configuration management for cloud infrastructure. They predominantly follow an imperative approach, focusing on running the user-provided scripts as-is on multiple nodes, rather than guaranteeing the correctness of the end state. Users are largely responsible for failure recovery, health checking, resource provisioning, and deployment automation.

Terraform came along after those, introducing a declarative approach that emphasizes where the system should end up as opposed to how it should get there. Later, Kubernetes continued this evolution by introducing additional high-level declarative constructs, integrating features like failure recovery, blue-green deployments, and more. Both Terraform and Kubernetes must maintain internal state to facilitate these capabilities.

In this post, I draw a parallel between the Data Transformation and the DevOps worlds, exploring the evolution from imperative stateless solutions to declarative stateful ones and why it matters.

Imperative and Stateless Data Transformation

Stateless data transformation refers to a transformation process that doesn't retain any information about its invocations.

This implies that each execution of such a transformation is “forgetful,” behaving as if no prior invocations have occurred. One well-known example of a tool that falls into this category is dbt.

The appeal of tools using this approach is understandable. The implementation is straightforward, making it simple to explain the tool to users as it doesn't do much beyond executing the user-provided SQL code in the correct order.

Therefore, stateless data transformation is also inherently imperative, meaning that users need to specify how the desired state should be achieved rather than focusing on what this state should be.

Maintaining state is not an easy task. Developers have to worry about the consistency of the persisted state, failure recovery, and account for migrations as the product evolves. Users must consider where the state will be stored and manage additional configuration overhead.

However, real-world use cases often demand capabilities that go beyond simply running things in the correct order. Features like data completeness, incremental processing, versioning, and deployments are all stateful by nature. When using stateless tools, users are compelled to build some of these capabilities themselves or accept a substantial bill from their cloud data provider as well as testing code changes directly in production.

Furthermore, the stateless process must operate under the assumption that the user-defined transformations are always idempotent, meaning that running them repeatedly yields the same outcome. As it happens, more advanced data loading patterns, such as incremental loads, are only sometimes idempotent.

In summary, any data transformation product aiming to be a “Terraform for data” must handle state in some way or another.

Declarative and Stateful Data Transformation

The alternative to the imperative stateless data transformation process is the declarative stateful one.

It focuses on achieving the desired state of data in the most efficient and secure manner, rather than relying on a user to figure it out on their own.

For example, SQLMesh maintains internal state to keep track of the following:

In the following sections, I'll explore these in more detail and elaborate on the importance and impact of each.

Data Intervals

Each model (transformation) in SQLMesh is associated with a specific cadence on which it should be evaluated. On each run, SQLMesh records a time interval for which the execution was performed.

SQLMesh uses these intervals to determine whether it’s time to execute a specific transformation. They also enable SQLMesh to execute transformations incrementally when applicable, substantially decreasing the computational overhead.

In addition to reducing redundant computation, this mechanism also helps ensure data completeness. In other words, downstream pipelines can automatically determine if the upstream is ready to be consumed by examining the intervals that have been processed.

Data Versioning

Whenever a user modifies a model, the change is fingerprinted, and a new record is created to represent this version in the SQLMesh state.

Combined with the fact that SQLMesh maintains a separate physical table for each unique model fingerprint, the availability of state makes the following features possible:

  • The change history for each individual model is easily accessible and used to enrich the data lineage.
  • If one user runs a specific model version, other users can just reuse the produced output as part of their runs, further reducing the computational overhead.
  • By comparing a new model version to the previous one, SQLMesh can determine whether the change is breaking or non-breaking in relation to the model's downstream dependencies.
  • Decoupling from the current state of the code repository in which transformations are defined makes data deployments and rollbacks possible.

Data Deployments

SQLMesh keeps track of data environments along with the set of model versions associated with each environment. Production is considered one such environment within SQLMesh.

When evaluating models within a specific environment, SQLMesh relies solely on its internal state rather than the current state of the code / GitHub repository.

Consequently, deploying changes to production or any other environment comes down to updating the relevant record in the state and swapping views in the virtual layer.

With SQLMesh, code changes are applied to a target environment using the plan command, which automatically determines the most efficient way to reflect these changes in that environment.

Hence, the difference between the plan and run commands. The former reflects changes in the source code in a target environment by backfilling missing data and updating the environment’s record in the SQLMesh state. The latter reads from the state to execute transformations for the most recent data based on their current definitions in that environment.

Figure 1: The difference between planning and running in SQLMesh Figure 1: The difference between planning and running in SQLMesh

Closing Thoughts

Maintaining state introduces extra complexity for product developers but offers significant benefits to its users.

When using stateful solutions, users can define their data transformations declaratively and reason about them using high-level constructs such as environments and deployments, rather than orchestrating these complex flows themselves.

SQLMesh leverages its internal state to reduce computational overhead, version user changes, manage data environments, and enable an effective data deployment process.

More than 1000 data professionals in our community actively discuss these and other relevant topics. Join us on Slack and become part of the conversation!