Tobiko

Automated Data Warehouse Migrations

Migrating from one data warehouse to another is difficult. A million different things can go wrong when converting SQL code, securely moving data, and validating in the new warehouse. According to McKinsey, 75% of migration projects run over budget and 38% finish behind schedule; engineers aside, your company's executive team certainly finds that painful!

With the advent of open table formats like Delta, Hudi, and Iceberg, many organizations are migrating to data lakes to save on costs and to avoid vendor lock-in.

However, getting data into the data lake is only half the battle when moving to a vendor-agnostic platform. The SQL queries that transform data into insights and applications are written in vendor-specific dialects, and your team is on the hook for translating the codebase.

SQL or Gobbledygook

SQL is the lingua franca of data. It was created in 1974, and the first ANSI SQL standard was created in 1986.

Since its creation, there have been dozens of flavors of SQL with drastically different syntax and semantics. There are variations between data types, functions, case sensitivity, date time syntax, etc.

standards.png Every SQL engine / warehouse has their own flavor of our favorite data language

One cause of this fragmentation is that there's little incentive for data warehouse implementers to conform to a standard. Existing customers have large code bases, and non-portable SQL makes it more difficult (and less likely) for companies to switch warehouses.

infinite_doors.webp Millions of lines of SQL can take years to migrate

Motivation

We at Tobiko have been thinking about bridging the gap between SQL engines for quite some time.

At companies like Netflix and Airbnb, Spark and Trino are used to query petabytes of data stored on Iceberg. Trino is used for quick ad hoc analysis and dashboards, while Spark is used for reliable batch computation.

Unfortunately, Spark and Trino use very different SQL dialects. At Netflix, I started working on SQLGlot to give scientists and analysts the ability to reuse queries between Trino and Spark rather than manually rewriting them.

That initial motivation led to Tobiko creating SQLMesh, which is built on top of SQLGlot. It was designed from the ground up to enable projects to work seamlessly across multiple SQL engines.

Translating SQL

There are three major steps a translation tool must accomplish to run SQL across different platforms: parsing, normalization, and generation.

transpilation_flow.png Transpiling is hard but can be made easier with a canonical representation

Parsing

Understanding SQL is the first step towards portable SQL. Without parsing, SQL queries are just raw text with no meaning. But once you parse SQL queries into an Abstract Syntax Tree (AST), it becomes possible to systematically understand and manipulate queries.

SQLMesh uses SQLGlot to convert SQL queries into the AST format that enables converting SQL queries from one dialect to another. SQLGlot handles mappings of data types, functions, date formats, JSON, and even advanced constructs such as unnest / explode.

Supporting these mappings across 20+ dialects is an arduous and never ending task, but the architecture of SQLGlot makes it easy to fill missing gaps. SQLGlot isn't perfect. There are many cases where it doesn't correctly transpile things. But the beauty of open source is that everyone can contribute issues or pull requests so that everyone can benefit.

We pride ourselves in quickly resolving any issues, so if you ever run into one, we welcome PRs or issues.

oss_insights.png We generally respond to issues within an hour

Normalization

After parsing SQL code into an AST... we're still not yet ready to translate.

Just like you care about an email's meaning when it's written in ALL CAPITAL LETTERS, some SQL engines care a lot about capitalization. SQLMesh handles these differences by normalizing the names of objects in your SQL code.

Each engine / warehouse has different rules regarding case sensitivity of column and table names. There are four categories of case sensitivity: case insensitive, lower case, upper case, and mixed.

Case insensitive

Case insensitive engines (Redshift, DuckDB, Trino, SQLite, Spark, and Microsoft SQL Server.) treat all identifiers as the same, regardless of quotes.

"silver"."COUNTRIES" = SILVER.COUNTRIES

Lower case

Lower case engines (Clickhouse and Postgres) treat unquoted identifiers as lower case.

"silver"."countries" = silver.countries
"SILVER"."COUNTRIES" <> silver.countries

Upper case

Upper case engines (Snowflake and Oracle, which conforms to the SQL standard) treat unquoted identifiers as upper case.

"MY"."TABLE" = my.table
"my"."table" <> my.table

Mixed

Finally, in Bigquery, CTE names are case insensitive but UDFs (user defined functions) and table names are case sensitive. MySQL's case sensitivity behavior depends on your file system.

SQLMesh understands all of these nuances and captures the behavior of your project's quoting at parse time before normalizing identifiers for use in other dialects.

snowflake.png A simple model definition written for Snowflake can be challenging to translate to other engines

In this particular example because all identifiers are unquoted, Snowflake treats all identifiers as equivalent to upper case.

bigquery.png This would be the equivalent definition written in Bigquery, taking into consideration all the assumptions of Snowflake.

Generation

Now that the entire project code is understood and normalized, SQLMesh is ready to generate the appropriate SQL for a specific engine.

Although SQLGlot handles most of the nuances of translating functions and data types, each engine has its own feature set for things like transactions, insert / merge capabilities, and other DDL / DML differences.

SQLMesh is able to capture those difference with individual engine adapters, which implement high level methods like listing the data objects in a schema.

The adapters allow projects with code written in one dialect to seamlessly execute on an engine that runs a different dialect. You can even use multiple dialects across a project's models (although that might get confusing).

Migrating warehouses with SQLMesh

SQLMesh can be used to make migrating warehouses easy. It can even be used when you want to maintain support for multiple warehouses at the same time.

It allows you to write SQL queries in any dialect and run it in another. All you need to do is setup a gateway. It will parse and normalize all of your SQL queries and then generate the corresponding code to run in the warehouse of your choice.

config.png An example SQLMesh config file that can run models written in DuckDB in both DuckDB and Postgres.

SQLMesh is also compatible with dbt. So if you have a dbt project running on Snowflake and want to migrate to Databricks, you can use SQLMesh to do that for you. SQLMesh provides many additional features like column level lineage in addition to transpilation.

To sum it all up, open formats are only the first step towards avoiding vendor lock-in. Warehouses trap you with custom SQL dialects. SQLGlot understands many dialects and facilitates transpilation. SQLMesh leverages SQLGlot to normalize and orchestrate your queries taking into account differing engine capabilities.

Although it may not be 100% perfect, Tobiko is committed towards working with the open-source community to make your SQL codebase runnable anywhere.

We'd love for you to use SQLMesh to help you migrate data warehouses or move toward being vendor agnostic. Join our Slack community and let us know if you run into any issues or have any questions!