Tobiko

Unlocking Data Insights with Ibis and SQLMesh

Ibis is the portable Python DataFrame library: a unified DataFrame API for any execution engine. Currently, Ibis supports DuckDB, Polars, DataFusion, Snowflake, BigQuery, PySpark, and over a dozen more backends -- most powered by SQL engines, offering the flexibility of Python with the scale of modern SQL. Ibis enables data practitioners to write their transformation code once and reuse it with little to no change across 20+ backends. Ibis is a self-governing open-source project primarily maintained by Voltron Data's team of open-source and DataFrame experts.

Have you ever needed to learn new dialects of database languages as a data scientist or struggled with the differences between database languages? Does your company manage different production pipelines with multiple databases or engines? Have you needed to rewrite data pipelines from experimentation to deployment? These are challenges that SQLMesh and Ibis together can solve.

In today's data-driven world, the ability to efficiently analyze and derive insights from vast amounts of data is paramount. Leveraging powerful open-source tools like SQLMesh and Ibis can streamline this process, enabling you to easily manipulate and query data. In this guide, we'll walk you through the steps of using SQLMesh and Ibis together, empowering you to harness the full potential of your data analytics workflow.

Getting Started

To begin, let's ensure we have the necessary tools installed and set up properly. Follow these steps:

1. Clone or Copy Repository:

  • Start by cloning the repository from  GitHub.

This repository contains the files we'll be working with throughout this guide.

2. Navigate to Repository:

Once the repository is cloned or copied, navigate to its directory (sqlmesh-examples/002_ibis) in your terminal or command prompt.

3. Set Up Virtual Environment:

Create a virtual environment using the following command:

python -m venv .env

4. Activate Environment:

Activate the virtual environment with the command:

. .env/bin/activate

5. Install Packages and Dependencies:

Install Ibis and SQLMesh along with all their dependencies from the requirements.txt using the following command:

pip install -r requirements.txt

Note: SQLMesh is installed with the UI (the [web] part in the requirements file) so you can easily visualize your changes and lineage.

Setting Up Environments and Models

With the prerequisites in place, let's set up environments and models for our data analysis.

Imagine you’re a data scientist developing ETL pipelines and machine learning features. You’re working on a new feature that you think would improve model performance. You have experimented locally and want to deploy the transformation into production.

In the walkthrough below, we’ll recreate this hypothetical scenario, which you’re likely already familiar with. We will first create a production environment that is the status quo and then change and deploy a new transformation.

1. Create Production Environment:

Run the following command to create the production environment and populate it with your tables:

sqlmesh plan

A development environment (dev) will be created later when we have made some changes to our models.

2. Explore Model Files:

Within the /models folder, you'll find five files:

  • seed_model.sql : is the first model that reads the data from ../seeds/seed_data.csv that has 3 columns; id, item_id and ds.

  • incremental_model.sql: a downstream model from seed, it brings in all 3 of the columns from the seed data.

  • full_model.sql: downstream model from incremental, having 2 columns item_id and num_orders (from counting unique id’s).

  • ibis_full_model_sql.py: a python file using ibis to generate and return a sql query. It is functionally identical to the full_model.sql.

  • ibis_full_model_python.py: a SQLMesh python model using Ibis to generate and return a Pandas DataFrame. It is functionally identical to the full_model.sql.

These files represent different stages of data processing and analysis. SQLMesh lets you write queries using either SQL or Python (provided your Python code returns a DataFrame or SQL), which means that you can write your data transformations in Ibis code and use built-in methods to convert it to either SQL or DataFrames in a single line of code. This way, you can reap the benefits of Ibis’s unified and user-friendly interface while also leveraging the full scale of SQLMesh’s many innovations.

You can use either of these methods, but there is a caveat. SQLMesh parses and understands SQL queries from its use of SQLGlot. This means that you can get table and column-level lineage from any model in your project that is defined by SQL code. If you instead want to convert your Ibis query to a DataFrame, you will only get table-level lineage for those, since no SQL is created. (Note in the SQLMesh UI screenshot that the ibis.ibis_full_model_python model’s column names are greyed out.)

Lineage of Ibis Models Figure 1. Lineage of Ibis Models.

Querying Data

Now that our environments and models are set up, let's query some data and make modifications.

1. Fetch Data:

To retrieve data from a specific model, use the following command:

sqlmesh fetchdf "select * from ibis.ibis_full_model_sql;"

This command fetches data from the specified model from the terminal and displays it for you inline:

Model Before Changes Figure 2. Model Before Changes.

2. Making Changes:

Open the ibis_full_model_sql.py file. We will use ibis’s mutate method to add a new column to one of our models:

change = query.mutate(new_col=query.item_id * 2)

Add this code to line 27 just before the python return statement.

Adjust the return statement to return your change.

return ibis.to_sql(change)

Save your file and run sqlmesh plan dev to apply the changes to the development environment.

Note: This same change can be made to the ibis_full_model_python.py file as well. Add the same change column code before the return statement and adjust the return to be return change.to_pandas() instead of return query.to_pandas().

3. Verify Changes:

Fetch data again using sqlmesh fetchdf to verify that the changes have been applied correctly. To fetch data from the dev environment add a __dev to the name of the model as shown below:

sqlmesh fetchdf "select * from ibis__dev.ibis_full_model_sql;"

Model After Changes Figure 2. Model After Changes.

See how this model now has a new column, reflecting our change! Once you’re happy with all your changes in dev, you can run sqlmesh plan to apply this and any other changes you made to the production environment.

Conclusion

By combining the power of SQLMesh and Ibis, you can streamline your data analysis workflow and derive valuable insights from your datasets. Experiment with different models and analyses to uncover hidden patterns and trends within your data!

Armed with these techniques, you can confidently explore your data and unlock its full potential. Happy analyzing!