Data Completeness: To Stream or Not to Stream

The tremendous push towards streaming pipelines lately has led some to believe that the days of batch processing are either over or will soon be coming to an end.

We hear things like “batch processing is just a subset of streaming” and “organizations must transition to real-time data to stay competitive through faster decision-making”.

I believe streaming has an important place in the modern data infrastructure, but it’s nowhere close to replacing batch processing any time soon. At least not as long as companies prioritize the quality of their decisions over the speed at which these decisions are made.

I’m not going to talk about the complexity associated with implementing and reasoning about streaming pipelines. Nor will I dive into idempotency, data restatement, and failure recovery.

Instead, I'd like to address something more fundamental to the business itself: data completeness.

What is Data Completeness?

Data completeness is one of the data quality dimensions and refers to individual records not having any missing information.

For the purposes of this post, when I refer to data completeness, I mean the availability of all relevant data points (records) at a point in time. To align it more closely with the original definition, we can consider missing records as those in which all individual values are missing (e.g. nulls).

In my extended definition, completeness is temporal. Values that are initially missing at one point in time can become available at a later point.

Now, completeness is not an “all or nothing” concept. Generally, data is rarely completely complete (pun intended). Some data points will always be missing for one reason or another (e.g., late arriving data, lost records, etc.).

Completeness is more of a spectrum, and only a domain expert can determine what constitutes a 'sufficient' level of it. At some point, you must draw a line in the sand, declare that you’ve collected all data points worth collecting, and move on.

Impact on Reporting

Data (in)completeness has a direct impact on business metrics and reporting.

Consider the following diagram depicting the number of sales made for each day:

Figure 1: Number of sales

What happened on 09/06? Let's consider a few questions we may ask ourselves here:

  • Did the number of sales suddenly drop, indicating potential trouble for the company?
  • Could it simply be an anomaly related to seasonality?
  • Was there a data outage, causing a gap in the records?
  • Or perhaps today is 09/06, and more sales records are expected to arrive?

The last two questions are directly linked to the data completeness. Without addressing these questions first, it’s impossible to answer the others with any degree of certainty. It should actually be a single question: “Do we have a sufficient number of data points for all days up to and including 09/06, to draw meaningful conclusions from the calculated numbers?”

Some may argue that by visually inspecting the diagram above, one can easily identify a data completeness issue based on whether today is 09/06 or not. However, what if the gap wasn’t as conspicuous? Or the timezone in the data didn’t match your local timezone?

Alternatively, what if we were evaluating the average price of items sold on that day instead of counting transactions? Without the noticeable jump between days we might not even notice discrepancies in the reporting, missing potential issues in the underlying data.

Completeness, therefore, entails not exposing downstream consumers to incomplete or partial data. In many cases, a portion of data may only become visible after it’s been determined that enough data points have been received to proceed with further analysis.

Ensuring Completeness

To efficiently ensure completeness and prevent diagrams resembling the one in the previous section, a pipeline developer needs to validate the data as it lands in a table. This is accomplished with the help of automated data auditing.

For example, the pipeline developer might have some idea about the minimum number of records that are expected to land every hour. They can then check every hour’s worth of data and make sure that the number of records surpasses the threshold. If the batch of records passes the audit, a signal is released to inform downstream consumers about the availability of new data.

Figure 2: Auditing the data before releasing the signal to downstream consumers

More advanced auditing techniques involve comparing each individual batch against the overall distribution of batches by calculating the average number of records in each batch and its standard deviation. Such audits can subsequently verify that the number of records in a given batch does not deviate beyond X standard deviations from the average.

Note how in both examples we rely on discrete batches of records to validate our assumptions about the data. Without organizing data into comparable chunks, it's impossible to reason about completeness.

Write-Audit-Publish

Write-Audit-Publish, or simply WAP, is the ultimate methodology for ensuring data completeness.

It's very similar to the approach outlined earlier, but instead of immediately writing the newly arrived data into the target table, it's first stored in a staging location inaccessible to downstream consumers.

Staged records are subsequently audited, and upon passing all checks, they are moved to the target table and made available to downstream consumers. Therefore, downstream consumers are never exposed to incomplete or unvalidated data, as it is physically inaccessible.

Figure 3: Write-Audit-Publish

Support in tooling

SQLMesh supports auditing and signaling out of the box. Supporting WAP is on the product’s roadmap.

dbt doesn’t support either approach out of the box. dbt tests, which are analogous to audits, run separately from the actual transformation. The purpose of this feature is somewhat unclear to me because it doesn't prevent the exposure of downstream consumers to bad or incomplete data. The community is currently making ongoing efforts to add support for WAP, but none of these efforts appear to be ready for practical use yet.

Apache Iceberg has first-class support for WAP. There are also ways to emulate WAP using shallow (a.k.a. zero-copy) clones in Delta Lake, BigQuery and Snowflake. Though the latter approach doesn’t generalize well to tables populated incrementally.

Final Thoughts

Streaming data pipelines are here to stay, and so are the batch ones.

When it comes to obtaining accurate numbers rather than quick ones, the advantages offered by batch processing are pretty much unbeatable.

Data audits are essential to prevent incomplete data from being used downstream. Signaling helps inform downstream consumers when new data chunks are ready for use, whereas WAP makes incomplete data physically inaccessible.

Join the Slack community with over 1000 data enthusiasts to discuss completeness and other common challenges that practitioners face on a daily basis.