Data pipelines consist of three general tasks: extracting, loading, and transforming data.
These tasks can be arranged in multiple ways (ETL, ELT, ETLT, etc.), but the transformation approach is directly tied to the loading approach. Therefore, understanding load patterns is critical to effectively use data transformation tools like SQLMesh.
There are multiple approaches to loading data. The most common approaches can be roughly classified into one of two patterns: full refresh and incremental. This post describes those two patterns and discusses considerations for choosing between them.
Data loading is a hard problem, and the idiosyncrasies of data systems mean that no approach is valid for everyone. We describe the simplest versions of these load patterns here and will introduce additional wrinkles and complexity in future posts.
Load Pattern Considerations
Multiple factors drive the optimal choice of load pattern. Loading patterns differ widely in complexity, and we should prefer simpler patterns unless there are good justifications for additional complexity.
The core considerations for choosing a load pattern are data size, computational intensity of data transformations, and frequency of loading.
The frequency of loading, or “load cadence,” is determined by how quickly the business needs new source data to be available. If data must be available in near real time, the cadence must be rapid - potentially every minute or multiple times an hour. If data is only needed weekly, the cadence can be weekly. Below, we refer to the time between loads as the “cadence interval.”
Data size and the computational intensity of transformations determine the cost of a single load. That cost and the load cadence combine to determine the total cost of a load pattern - a critical concern when designing data pipelines.
Full Refresh Pattern
The full refresh pattern ingests all source data every time a load occurs. This means that both all the source data must be pulled and the transformations must run against all the data every time. The full refresh pattern results in the longest and most costly load possible.
Sometimes this pattern cannot be used at all. If a load takes longer than the load cadence interval, the data would never fully load before a new load began.
The full refresh pattern is only reasonable when the load length is short and the cadence interval is long. In those cases, the following considerations would be met:
- The data is small
- The data is not being pulled over a slow network connection
- The data transformations are not long-running or computationally intensive
The simplicity of the full refresh pattern makes it the easiest to implement and maintain, but it is costly because it executes full runs every single load.
In cloud data warehouses these costs can accumulate rapidly - every unnecessary refresh is an expenditure that could be avoided with a different load pattern.
The incremental load pattern ingests only the source data that became available after the previous load. Therefore, the amount of data to load is smaller than the entire source data, resulting in a lower cost per load.
The tradeoff for decreased cost is increased complexity. The primary source of that complexity is the need for information about what has happened in the past.
The full refresh pattern does not require that knowledge because the same operations happen every load regardless of what happened before. (In some scenarios this may not be true - stay tuned for a future blog post!)
The first required piece of information for incremental loading is what source rows should be processed in the current load. There are a few potential sources for this information - we focus on time as the source in this post.
We can determine what rows to process based on each source row’s creation timestamp. The load run specifies a date range or ranges, and only rows with stamps in that range are processed.
There are two general approaches to determining the range: using the most recent record in the existing data or using intervals based on the load cadence.
Below, we walk through the general process of standing up a new data pipeline with each approach.
Most Recent Record
The simplest approach to creating the range is to base it on the date/time of the most recent load.
Before we start incremental loads, we must perform an initial data load. We visually depict that load in Figure 1.
The data we have ingested at a point in time is represented by the empty left rectangle, where the absence of color indicates absence of data.
After the initial load, the right rectangle represents the data ingested so far. We have all the data from the beginning of time to the present, so the rectangle is completely filled. In subsequent graphics, we represent the absence of data with empty rectangles appended or interleaved with the data we have already ingested.
For explanatory purposes, we depict the ingested data as ordered by record timestamp, with the beginning of time at the bottom and the present time at the top. In practice, it is not necessary to order records in the database.
After the initial load, we begin performing incremental loads. To do that, we first determine the date range containing the records to process.
With the most recent record approach, the start of the range is determined by querying the most recent timestamp in the data you already ingested. The end of the range is when the current load runs.
We visually depict that second load in Figure 2. Time has passed, so we have one cadence interval’s worth of unloaded data - it is depicted as the empty rectangle on the top left.
After the incremental load, we have ingested that interval’s data, depicted by filling the empty rectangle with gray on the top right.
The basic version of the most recent record approach makes at least three assumptions.
Assumption 1: Table already exists
The table you query for the latest date will exist the first time you run an incremental load because you are actively building the pipeline.
However, business processes change, databases are reorganized, and new staff come onboard.
If the table is moved or deleted the load will fail, requiring manual intervention by staff. However, it is possible to guard against this possibility with Jinja templating.
Assumption 2: No gaps in the data
Gaps in the data are problematic because the most recent record method can only fill the interval(s) that have occurred since the previous load.
Figure 3 depicts data containing gaps, with multiple empty rectangles within the data on the left.
After the load, only the most recent record at the top has been filled in pink on the top right. Data for the older gaps was not loaded.
The only way to fill the older gaps is by executing an ad hoc query outside the automated loading cadence. This requires time from technical staff and establishing business processes to remember when and how to execute the queries.
Assumption 3: Backfill the table in one pass
Large source data systems may contain more data than the loading tool can computationally process in one pass. This problem can be solved by loading data in multiple batches, each of which is small enough to process.
However, the most recent record approach is only able to divide the data into two segments: before and after the most recent record timestamp. Therefore, it does not have the necessary information to further divide the data into batches for loading.
As with data gaps, the only feasible solution is executing ad hoc queries that manually implement batching, which brings the same costs of staff time and new business processes.
The simplicity of the most recent record approach greatly limits its flexibility and places responsibility for managing load history on the user.
With this approach, the tool triggering the load doesn’t know what to load - it can’t obtain the necessary information. Instead, it blindly executes the query that determines what to load based on the existing data.
A more powerful approach is for the tool triggering the load to tell the load what to process instead of passively determining it during execution. The tool can communicate this based on time intervals.
The first step to using intervals is to determine the set of all possible time intervals based on the load cadence. For example, if you have a daily load cadence the set of all intervals would be each day since the loads began.
The tool records which intervals have ever been processed; that way it knows what has been loaded and can ensure there are no gaps in the desired cadence.
The interval approach is much more powerful because it allows your tools to take responsibility for ensuring all intervals are loaded for a given table.
This enables the tool to do things like processing intervals in batches, which allows parallel/simultaneous processing of intervals and not needing specialized clusters to perform large backfills.
Figure 4 depicts an initial load that contains three intervals. Because the tool knows how to segregate the data into batches, it can process the three intervals in parallel.
The interval approach also allows the tool to find and repair any data gaps. Figure 5 depicts an incremental load when the existing data contains gaps. The tool can recognize the empty intervals and automatically include them in the incremental load without manual intervention or ad hoc queries.
The interval approach reduces the burden on developers and decreases the need for manual intervention to resolve data problems like gaps. It is conceptually more complex than the most recent record approach, but how much of that complexity falls onto technical staff to manage is strongly affected by choice of tools.
In addition to which records to load, the incremental approach requires knowing whether our data transformation logic has changed.
This information is required because, ideally, all data in our system is internally consistent. If some rows were loaded before a logic change and other rows after they cannot be safely combined when analyzed.
If the transformation logic has changed, you have three ways to proceed. A full refresh ensures all the data are transformed using the new logic but may be infeasible due to time or cost. Alternatively, you can reprocess data ingested after a specific date or only apply the new logic in future loads.
The latter two require communication with downstream data users to prevent errors from accidentally combining inconsistent data.
Choosing a load pattern
How should you choose a load pattern?
The decision is a tradeoff: simpler systems and higher load costs vs. more complex systems and lower load costs. Fortunately, the complexity of the incremental pattern is not inherent - it is determined, in part, by choice of tools.
If your data loading is implemented with custom code, the system complexity and maintenance burden will be high.
If you use a data transformation tool like dbt, the complexity and maintenance burden will be lower. However, dbt was not originally designed with incremental loading in mind, so it requires additional packages, code customization, and monitoring.
Incremental loading is a core feature of SQLMesh. Our engineers built custom incremental loading pipelines in companies like Netflix and Apple, and the lessons they learned are built directly into SQLMesh so you won’t have to learn them too.
With SQLMesh, choosing a load pattern is simpler because incremental loading is so much easier - give it a try!