If you’re using SQLMesh alongside Apache Spark and Apache Iceberg, I have some exciting news for you!
Starting from version 0.57.0
, SQLMesh applies the Write-Audit-Publish (WAP) pattern when executing models using Apache Spark and the Apache Iceberg data format. The best part? No user action is required to enable this behavior - it's enabled by default.
What is Write-Audit-Publish Pattern
Write-Audit-Publish, or simply WAP, is one of the key design patterns in data engineering that ensures data quality and integrity without ever exposing consumers to bad or incomplete data.
Instead of immediately writing newly arrived data into the target table, it's first stored in a staging location inaccessible to downstream consumers.
Staged records are subsequently audited and, upon passing all checks, are moved to the target table and made available to downstream consumers.
Therefore, downstream consumers are never exposed to incomplete or unvalidated data, as it is physically inaccessible.
To learn more about WAP check out the blog post we wrote about it.
Why Apache Iceberg
Apache Iceberg is a high-performance data format for large analytics tables. Its design was heavily inspired by Git, from which it inherits many concepts and features.
For example, every write into an Iceberg table creates a new snapshot, akin to a Git commit. Furthermore, Iceberg enables the creation of branches within the table and supports cherry-picking of snapshots from one branch to another as a simple metadata operation. Changes made to one branch are not visible in other branches.
The unique characteristics of the Apache Iceberg architecture enable SQLMesh to deliver a seamless WAP experience to its users.
When SQLMesh executes a model it first writes the model’s output into a dedicated table branch created specifically for this execution. Audits are subsequently executed against the data stored in the branch. If the audits succeed, the changes are cherry-picked into the table’s main branch, making them visible to the downstream consumers. The latter step is a metadata update that doesn’t involve moving any data.
Where to Start
To get started with SQLMesh and WAP, checkout the Quickstart guide.
If the model’s storage format is set to iceberg
and the target engine is configured to be Apache Spark, SQLMesh will automatically start applying the WAP pattern during execution.