ClickHouse rapidly processes vast amounts of data. With this speed, your data team can rapidly detect and respond to changes impacting the business. But data systems are shared, so changing them requires coordination. Coordination is hard, and time pressure makes it even harder.

SQLMesh lets you manage changes to your data transformation pipelines without fear. It ensures that people understand the full impact of code changes they make, that they can safely develop code without impacting others, and that the right version of the data is always deployed in production. In essence, data error prevention is a first-class citizen.

This post adapts an existing ClickHouse example analysis to demonstrate how SQLMesh lets you maintain fast, efficient data pipelines in dynamic organizations.

SQLMesh: A Transformative Data Management Tool

SQLMesh simplifies building and maintaining reliable data transformation pipelines in dynamic organizations.

Designed with the challenges of today’s data teams in mind, SQLMesh enables users to define data transformations in a SQL-first way, manage dependencies, and automate the maintenance of data models.

SQLMesh’s approach to versioned data/code and dependency management makes it an appealing choice for data engineers and analysts.

ClickHouse: The Real-Time Analytical Database

ClickHouse is an open-source, high-performance columnar database management system designed for real-time analytical queries.

ClickHouse has grown into a widely adopted solution for organizations that need to process vast amounts of data quickly.

It’s particularly popular for applications like analytics platforms, log management, and real-time reporting, where speed and performance are paramount.

Why Use SQLMesh for Data Transformation with ClickHouse?

Integrating with ClickHouse and SQLMesh was a natural step. SQLMesh is designed for data transformations and modeling, and when paired with ClickHouse, it can offer performance levels that other database engines struggle to match. The integration allows data engineers to build high-speed data models using SQLMesh’s framework, all while taking advantage of ClickHouse’s unparalleled query performance.

Benefits of Using ClickHouse

ClickHouse offers several key benefits for data-driven applications:

  • Speed: ClickHouse’s columnar storage and efficient indexing enable it to quickly execute complex queries on large datasets.
  • Scalability: Optimized for high throughput, ClickHouse can scale horizontally and vertically to accommodate massive data volumes.
  • Cost-Effective: ClickHouse Cloud’s architecture ensures effective data tiering and scaling, resulting in leading price/performance ratio on the market. Additionally, as an open-source solution, ClickHouse allows organizations to leverage top-tier performance regardless of their deployment model of choice.

How ClickHouse Differs from Other Engines

Unlike traditional row-oriented databases, ClickHouse’s columnar storage format allows it to read only the necessary columns in a query, making it exceptionally fast for analytical workloads. Traditional databases like MySQL and PostgreSQL are optimized for transactional processing, while ClickHouse is specifically engineered for high-performance analytics. This makes it an ideal pairing for SQLMesh, which focuses on transforming and preparing data for analytical use.

Benefits of Using SQLMesh

SQLMesh provides several value-added benefits to ClickHouse users:

  • SQL-First Data Modeling: SQLMesh lets users define data transformations in SQL, making it accessible to analysts and engineers.
  • Dependency Management: SQLMesh tracks dependencies between models, ensuring that data transformations are up-to-date and accurate.
  • Virtual Data Environments: provide low-cost, efficient, scalable, and safe data environments that allow for isolated development, easy previewing of changes, and seamless promotion to production without data duplication or movement, reducing the risk of accidental errors or outdated results in ClickHouse.
  • Automatic Detection of Breaking Changes: minimizes unnecessary recomputation while ensuring correctness, categorizing changes as breaking or non-breaking based on their impact on downstream models.

By combining SQLMesh’s transformation framework with ClickHouse’s high-speed query capabilities, organizations can build robust data pipelines that are easy to maintain and scale.

Example: Using SQLMesh and ClickHouse Together

This section demonstrates running the SQLMesh quickstart project on the ClickHouse engine.

We begin by installing both ClickHouse and SQLMesh.

Installation

To get started, we need a ClickHouse server to execute the analysis. Fortunately, it is easy to download and run a standalone server.

First, we open a CLI terminal and create a new directory `sqlmesh-example` where both ClickHouse and our SQLMesh project files will live.

Next, we create a `clickhouse` sub-directory then download and run the ClickHouse binary:

❯ mkdir sqlmesh-example && cd sqlmesh-example
❯ mkdir clickhouse && cd clickhouse
❯ curl https://clickhouse.com/ | sh

Successfully downloaded the ClickHouse binary, you can run it as:
    ./clickhouse

❯ ./clickhouse server

Application: Starting ClickHouse 24.11.1.679 (revision: 54492)
[...]

We’re now ready to create a new Python virtual environment and install SQLMesh:

❯ cd ..
❯ python3 -m venv .env
❯ source .env/bin/activate
❯ pip install "sqlmesh[clickhouse,web]"

Collecting sqlmesh[clickhouse,web]
  Downloading sqlmesh-0.130.1-py3-none-any.whl.metadata (10 kB)
[...]

Setup

Now that ClickHouse and SQLMesh are installed, we’re ready to set up the example project and configure SQLMesh’s connection to ClickHouse.

First, we initialize the SQLMesh example project, specifying ClickHouse as our default SQL code dialect.

❯ sqlmesh init clickhouse
❯ ls
audits               config.yaml          macros               seeds
clickhouse           models               tests

Our directory now contains sub-directories to organize project files and a pre-populated `config.yaml` file for configuring our project.

Let’s update the `config.yaml` file to specify a gateway connecting to our ClickHouse server:

[...]
# Clickhouse connection (default credentials)
clickhouse:
  connection:
    type: clickhouse
    host: 127.0.0.1
    port: 8123
    username: default
    password: ''
[...]

Let’s make sure the connection is working properly with the `sqlmesh info` command:

❯ sqlmesh info

Models: 3
Macros: 0
Data warehouse connection succeeded
State backend connection succeeded

Success!

Making a plan

SQLMesh projects are built on models. Roughly, a model corresponds to a single data transformation step whose outputs are materialized in the database.

Models are typically written in SQL, but SQLMesh also supports Python models to implement functionality not provided by the data warehouse. The SQLMesh example project contains three SQL models and no Python models.

SQLMesh’s core workflow involves applying plans to environments. An environment is an isolated namespace containing specific versions of project files and the corresponding database objects storing their outputs.

Project files are modified as business needs change. A plan identifies the sequence of actions needed to bring an environment into alignment with the modified project files. Plans summarize those actions so you understand the implications of the changes before they are implemented.

A new SQLMesh project contains one empty environment `prod` (or “production”) that will contain the model versions deployed in your business’s production environment. Other environments can be cheaply created and destroyed on demand.

Let’s populate our project’s `prod` environment with the `sqlmesh plan` command:

❯ sqlmesh plan       
======================================================================
Successfully Ran 1 tests against duckdb
----------------------------------------------------------------------
New environment `prod` will be created from `prod`

Summary of differences against `prod`:
Models:
└── Added:
    ├── sqlmesh_example.full_model
    ├── sqlmesh_example.incremental_model
    └── sqlmesh_example.seed_model

Models needing backfill (missing dates):
├── sqlmesh_example.full_model: 2024-11-04 - 2024-11-04
├── sqlmesh_example.incremental_model: 2020-01-01 - 2024-11-04
└── sqlmesh_example.seed_model: 2024-11-04 - 2024-11-04

Apply - Backfill Tables [y/n]:

SQLMesh recognized that the project files contain three models that are being added to the empty `prod` environment. It also recognizes that all three models need to backfill (a generic term for updating or processing data) because they have not yet been executed.

We enter `y` at the backfill prompt and hit Enter.

The output reports SQLMesh’s creation of physical tables to store the model data, its execution of the model transformation code, and its updates to the environment:

Creating physical tables ━━━━━━━━━━━━━━━━━━ 100.0% • 3/3 • 0:00:00

All model versions have been created successfully

[1/1] sqlmesh_example.seed_model evaluated in 0.00s
[1/1] sqlmesh_example.incremental_model evaluated in 0.01s
[1/1] sqlmesh_example.full_model evaluated in 0.01s
Evaluating models ━━━━━━━━━━━━━━━━━ 100.0% • 3/3 • 0:00:00 

All model batches have been executed successfully

Virtually Updating 'prod' ━━━━━━━━━━━━━━━━━ 100.0% • 0:00:00

The target environment has been updated successfully

We have now executed the SQLMesh example project on ClickHouse!

Learn more about how SQLMesh enables safe, isolated code development in the SQLMesh Quickstart guide.

Conclusion

SQLMesh and ClickHouse offer an efficient, powerful data transformation and analysis solution. SQLMesh’s SQL-first, model-driven approach complements ClickHouse’s high-speed query capabilities, allowing organizations to harness the best of both tools. Whether you're handling real-time analytics or large-scale data transformations, this duo provides a streamlined way to achieve accurate, fast insights.

If you’re interested in learning more or connecting with the communities behind SQLMesh and ClickHouse, feel free to join the respective Slack channels: