Tobiko

We Need Even Greater Expectations When Testing Data

Although companies heavily rely on SQL / data pipelines to power applications and make decisions, testing those pipelines is usually an afterthought. The practice of testing software is well established and considered a critical component of application development. Currently, testing data models is still in its infancy and not as common nor considered as important as testing software.

However, systematic testing of data is critical to creating trustworthy and reliable pipelines. This post highlights the three categories of tests for data: unit tests, table diffs, and audits.

Unit tests

Unit tests are the first line of defense for all software. They ensure that changes to code don't unexpectedly change business logic. They are also great at validating logic and edge cases. As code bases become more complex, it becomes increasingly difficult to test and validate changes manually. Unit tests give teams the confidence to collaborate on changes while knowing that even minor unintended changes will be caught.

In my opinion, unit tests are the most important yet ignored technique for creating robust data pipelines. Unit testing pipelines can be as simple as a script that tests whether or not the output of a SQL query matches a fixed input.

Example unit test

Let's say we want to unit test this SQL query which computes total revenue for users from two upstream tables.

WITH orders AS (
  SELECT *
  FROM raw.orders
  WHERE transaction_type NOT IN ('refund', 'void')
), discounts AS (
  SELECT
    order_id,
    SUM(discount) AS discount
  FROM raw.discounts
  WHERE status = 'used'
  GROUP BY order_id
)
SELECT
  o.user_id::INT AS user_id,
  SUM(o.total)::INT AS gross_total,
  SUM(o.total - COALESCE(d.discount, 0))::INT AS net_total
FROM orders o
LEFT JOIN discounts d
  ON o.order_id = d.order_id
GROUP BY o.user_id;

You can create fake data for each raw table, execute it, and validate that the results are as expected.

test_revenue:
  model: revenue
  inputs:
    raw.orders:
      - order_id: 1
        user_id: 1
        transaction_type: complete
        total: 4
      - order_id: 2
        user_id: 1
        transaction_type: complete
        total: 3
      - order_id: 3
        user_id: 1
        transaction_type: VOID
        total: 3
      - order_id: 4
        user_id: 2
        transaction_type: complete
        total: 2
    raw.discounts:
      - order_id: 1
        status: used
        discount: 1
      - order_id: 1
        status: void
        discount: 1
  outputs:
    # Test the main query
    # SELECT
    #   o.user_id::INT AS user_id,
    #   SUM(o.total)::INT AS gross_total,
    #   SUM(o.total - COALESCE(d.discount, 0))::INT AS net_total
    # FROM orders o
    # LEFT JOIN discounts d
    #   ON o.order_id = d.order_id
    # GROUP BY o.user_id;
    query:
      - user_id: 1
        gross_total: 7
        net_total: 6
      - user_id: 2
        gross_total: 2
        net_total: 2
    ctes:
      # Test the first CTE
      # WITH orders AS (
      #   SELECT *
      #   FROM raw.orders
      #   WHERE transaction_type NOT IN ('refund', 'void')
      # )
      orders:
        - order_id: 1
          user_id: 1
          transaction_type: complete
          total: 4
        - order_id: 2
          user_id: 1
          transaction_type: complete
          total: 3
        - order_id: 4
          user_id: 2
          transaction_type: complete
          total: 2

This is an example YAML file in SQLMesh's format that contains a declarative unit test. It specifies fixed input values for raw.orders and raw.discounts and specifies expected outputs for the main query and the CTE orders.

Queries containing CTEs like the one above (orders and discounts) can be challenging to test and debug because the CTEs contain intermediate calculations. Unit tests at the CTE level reveal issues in those calculations.

As an example of an issue this may catch, the query's orders CTE contains the clause

WHERE transaction_type NOT IN ('refund', 'void')

where refund and void are lowercase. The unit test data contains a row in raw.orders with a transaction_type of VOID (uppercase), which will not be correctly filtered out by the orders CTE WHERE clause.

Because that row is not filtered out, a test of the overall query will fail due to incorrect sums in the outer query. Fortunately, we also specified a unit test at the CTE level that reveals exactly where the error occurred:

======================================================================
FAIL: test_revenue (tests/test_revenue.yaml) (cte='orders')
----------------------------------------------------------------------
AssertionError: Data differs
  {'order_id': 1, 'user_id': 1, 'transaction_type': 'complete', 'total': 4}
  {'order_id': 2, 'user_id': 1, 'transaction_type': 'complete', 'total': 3}
+ {'order_id': 3, 'user_id': 1, 'transaction_type': 'VOID', 'total': 3}
  {'order_id': 4, 'user_id': 2, 'transaction_type': 'complete', 'total': 2}

======================================================================
FAIL: test_revenue (tests/test_revenue.yaml)
----------------------------------------------------------------------
AssertionError: Data differs
- {'user_id': 1, 'gross_total': 7, 'net_total': 6}
?                               ^               ^

+ {'user_id': 1, 'gross_total': 10, 'net_total': 9}
?                               ^^              ^

  {'user_id': 2, 'gross_total': 2, 'net_total': 2}

Being able to break down complex queries with many CTEs helps isolate issues more easily.

Running tests in CI

Unit tests should be quick to run, easy to set up, and enjoyable to write. If you use Spark or DuckDB, setting up unit tests is straightforward because they are easily installed with pip and can run in standalone mode. However, if you're using a SQL engine like BigQuery, Redshift, or Snowflake, it's much more difficult because they are not easily accessible in CI (Continuous Integration) environments.

If you're unable to access your data warehouse from CI, you can leverage a tool like SQLGlot to transpile your existing queries into DuckDB or Spark, so that they can run in an isolated container. If you're using SQLMesh, it has a built-in unit testing framework which transpiles queries automatically and can run unit tests with the YAML format shown above.

Table Diff

As you iterate on new versions of data pipelines, inevitably you'll want to compare your new version's output with the existing one's. Unit tests will catch specific logical changes, but you may want to understand how the data itself is different. Did total revenue change? Does the new version have more rows? Are users still being assigned the correct segmentation categories? Additionally, there may be cases in the actual data that are not represented in the unit tests.

Table diffing, which is implemented by SQLMesh and Data Fold's data diff, are next in line when it comes to validating changes to data.

Table diffing consists of three parts: schema diffs, statistical summaries, and row level comparisons.

  1. Schema diffs simply compare the columns and types of two tables. It can quickly tell you if any columns were removed, added, or modified.
  2. Statistical summaries compare things like counts, number of nulls, and average values between two tables.
  3. Row level comparisons show differences between rows if a matching key is defined.

Here is an example of SQLMesh's table diffing output.

Schema Diff Between 'revenue__prod' and 'revenue__dev':
└── Modified Columns:
    ├── gross_total (INT -> DOUBLE)
    └── net_total (INT -> DOUBLE)
Row Count: revenue__dev: 3, revenue__prod: 2 -- -33.33%
 s__user_id  s__gross_total  s__net_total  t__user_id  t__gross_total  t__net_total
         -1             4.0           3.0         NaN             NaN           NaN
          1             2.0           2.0         1               2             3

In this example, 'revenue__dev' has changed 'gross_total' and 'net_total' int types to double types. Additionally a strange user '-1' has been added and user '1' has different values.

Audits

The final type of test is the audit or data quality check. Audits are the most commonly implemented kinds of data tests with tools like great expectations, dbt, and SQLMesh.

Audits run after a model has run and validate certain constraints like nullability or uniqueness. Audits can be either blocking or non-blocking, which means that failures can block further execution or just emit warnings.

One of the main benefits of audits is that they can check for upstream data quality issues. For example, if an upstream table has an unidentified issue, a data quality check could catch it because it checks for constraints against actual data.

However, just having audits is not enough to ensure robust pipelines. Unlike unit tests, they are not designed to catch logic differences or edge cases and are more expensive to run. They also only trigger after a model has been materialized which is later than it needs to be.

Here is an example of a SQLMesh audit that has failed.

sqlmesh.utils.errors.AuditError: Audit 'assert_positive_user_ids' for model 'test.revenue' failed.
Got 1 results, expected 0.
SELECT * FROM "revenue__prod"
WHERE "user_id" < 0

The audit is a simple SQL query that check if there are any negative user ids. If there are any, it triggers a failure.

Summary

Unit Tests Table Diff Audits
Purpose Validates SQL logic: does it output the expected values given the input? Validates the change you're making: does it produce the expected differences? Validates the data: does it conform to expected patterns?
Lifecycle Build / Compile time Iteration / Development Development / Production run time
Ease of Use Takes time and thought to create comprehensive unit tests Very easy to use given the existence of two tables. Easy to use predefined audits but custom audits may be more difficult to set up
Maintainability Logic or schema changes will require updating unit tests Nothing to maintain Changes to upstream data can trigger false positives and require audits to be updated
Cost to Run Extremely cheap and fast, instantaneous validation Requires a full join and other aggregations over the data which can take time and be expensive Can be expensive and occurs after model is run
Caveats Does not use real data Does not measure the quality of data; only if it is different Can assess data quality but only captures what is represented in the data

Testing is an essential part of creating maintainable and trustworthy pipelines. By leveraging these three techniques, companies can mitigate risks when making changes to data pipelines. As confidence in these tests increases, the speed and frequency of iteration will increase, allowing your company to be more agile in making trusted data driven decisions.

Please join our Slack community if you would like to chat or learn more about testing data!