Transpiling SQL #1: JSON Paths

While there is a SQL standard, most database engines choose to support a variation of it. This makes it difficult to write portable SQL code, resulting in applications that are inflexible due to database vendor lock-in.

One of SQLGlot's goals is to bridge the gap between all these different variations, or "dialects", through an easily extensible SQL transpilation framework. This bridge can greatly simplify database migrations and enable SQL code to seamlessly run in different engines, automatically handling compatibility issues.

This post is the beginning of a series that explains how SQLGlot works by analyzing solutions to real-world transpilation issues.

It is assumed that the reader is familiar with the concepts of parsing and Abstract Syntax Trees (ASTs). For those new to SQLGlot, we suggest taking a quick look at its AST primer.

Introduction

Generally speaking, in order to transpile a query from dialect A to dialect B, one has to start by parsing that query according to the syntactical rules of A. This process usually produces an AST that represents the query, which then needs to be converted into a semantically equivalent query in B, or something as close as possible to it if there is no exact equivalent.

In our case there are many SQL dialects, each with its own unique traits that we'll potentially need to transpile. In order to avoid creating countless AST nodes to represent these different traits, SQLGlot chooses to define a standardized AST which unifies similar concepts across dialects.

Figure 1: The transpilation flow in a nutshell

For example, let's take the conversion of strings to timestamps. Most dialects provide a function to do this, a sample of which is shown below:

Dialect Function
BigQuery parse_timestamp(format, string[, time_zone)
DuckDB strptime(string, format)
Postgres to_timestamp(string, format)
Presto date_parse(string, format)

Even though these functions have different names and number of arguments, they express the same concept, so SQLGlot represents all of them using a single AST node:

class StrToTime(Func):
    arg_types = {"this": True, "format": True, "zone": False}

The node ("string to time") contains three attributes: this (the string to be converted), format, and zone. It must have three attributes to represent the BigQuery dialect's third time_zone argument, even though the other three dialects never use it.

After parsing the functions into this AST node, we can generate SQL in a new dialect from it. The generation logic can be easily overridden, allowing us to produce the correct function name and arguments in each case.

This example uses the SQLGlot function parse_one to parse the BigQuery dialect's parse_timestamp() function into the ast object. It then uses ast's sql() method to generate the function in Bigquery, DuckDB, PostgreSQL, and Presto dialects:

from sqlglot import parse_one

ast = parse_one(
    "parse_timestamp('%Y-%m-%d', '2020-12-25')",
    dialect="bigquery",
)

ast.sql("bigquery")
# "PARSE_TIMESTAMP('%Y-%m-%d', '2020-12-25')"

ast.sql("duckdb")
# "STRPTIME('2020-12-25', '%Y-%m-%d')"

ast.sql("postgres")
# "TO_TIMESTAMP('2020-12-25', 'YYYY-MM-DD')"

ast.sql("presto")
# "DATE_PARSE('2020-12-25', '%Y-%m-%d')"

It's not just SQL

One of the hurdles that comes with transpiling SQL is that you often have to transpile things that go beyond plain SQL syntax, such as the actual... data themselves. 🤯

In the above example, the timestamp format used in BigQuery ('%Y-%m-%d') was the same in DuckDB and Presto, but different from the one used in Postgres! This means that transpiling this function requires also being able to convert between different date/time formats correctly.

This is another case where having a standardized representation is really useful. Instead of creating its own representation of date/time formats, SQLGlot parses them into Python's existing strftime components and then converts those to the target dialect at generation time. However, we won't go into details here - transpiling date/time formats deserves a post on its own.

A task that has proven to be quite difficult to transpile is extracting values from JSON, mainly because there are so many different ways to express the information to extract.

To demonstrate, consider the JSON document

'{"class": {"students": [{"name": "Jane"}, {"name": "John"}]}}'

that contains the names of two students in a class. This is how we could extract the name of the first student in the class in the following dialects:

Dialect Syntax
BigQuery json_extract(doc, '$.class.students[0].name')
ClickHouse JSONExtractString(doc, 'class', 'students', 1, 'name')
DuckDB doc -> '$.class.students[0].name'
Postgres doc::json -> 'class' -> 'students' -> 0 -> 'name'
Presto json_extract(doc, '$.class.students[0].name')
Redshift json_extract_path_text(doc, 'class', 'students', '0', 'name')
Snowflake get_path(parse_json(doc), 'class.students[0].name')

A few interesting things to note here:

  1. Some dialects choose to only provide functions like json_extract(), while others also provide binary operators like ->
  2. The function names and their number of arguments can differ
  3. The expected type of JSON the functions take can differ (e.g. having to parse_json(doc) in Snowflake)
  4. The representation of the extraction path can differ (e.g., '$.class.students[0].name' in BigQuery vs. 'class.students[0].name' in Snowflake)

Respecting (1) and (2) when transpiling is generally simple. However, the same doesn't hold for (3) and (4) because the information we can extract from the query may be insufficient.

For instance, if the JSON document is stored in a column instead of being hard-coded as a string, we need to know the column's data type to decide if it's necessary to call parse_json() on it when generating Snowflake syntax.

Similarly, if the extraction path (e.g. doc -> x -> y -> z) is stored in one or more columns, then we don't know what value it leads to based solely on the query syntax so we may not be able to transpile it. For the rest of this post, we'll only consider paths that are expressed as strings, such as '$.foo'.

Transpiling the extraction paths

It turns out that dialects generally fall into two groups: those that support a subset or variant of the JSONPath syntax (e.g., Presto) and those where the extraction paths are expressed as a sequence of keys and array indexes (e.g., Postgres).

Continuing the pattern of the previous examples, we want to parse and convert these paths into a standardized representation so their transpilation becomes simpler.

For the former group, a hand-written JSONPath parser takes care of converting input extraction paths into a JSONPath node. For the latter group, SQLGlot first parses the corresponding path then losslessly transforms the AST into a JSONPath node, so that both categories share the same representation.

Finally, we can again override the generation logic of each constituent part per dialect to produce the correct syntax, or even warn if some part in the path can't be transpiled. Some conversions that are supported today can be found in SQLGlot's test suite.

Closing remarks

SQLGlot's support for JSON value extraction has been significantly improved starting with version v21.0.0, using the ideas described in this post. There is still room for improvement, as there may be some rough edges related to type incompatibilities, so if you come across any bugs or have a feature request let us know.

Transpiling SQL is hard, but one can do a pretty good job if they bundle together similar concepts into one representation - that way it's easier to control how to map that back to the right SQL syntax. This strategy works especially well for SQLGlot, which currently supports 20 SQL dialects, making it important to minimize the number of changes needed to fix transpilation issues.

If you'd like to suggest topics for future posts or just chat about SQLGlot in general, we would be glad to welcome you in our growing Slack community!