Starting today, Tobiko is releasing a first-of-its kind feature that empowers data engineers to mix-and-match the best query engine for each individual task — all within a single project.
The data landscape continues to evolve, and one of the most significant paradigm shifts now taking place is the emergence of data lakes as the new architecture of choice for modern organizations. Recent developments, most notably AWS’s integration with Apache Iceberg, underscore the growing momentum toward vendor neutrality and adoption of open standards.
Across the industry, companies are decoupling storage from computation, demanding interoperability across platforms and tools, sharpening their focus on cost efficiency, and looking for ways to accommodate increasingly diverse data formats and use cases within the same catalog.
At Tobiko, we’re hearing from customers about their challenges with handling datasets and workflows that span a proliferation of sources, types, and applications. That’s why we’re launching support for multi-engine projects — to minimize context- and task-switching as they work, and to provide a framework in which they can get much more flexible and targeted with how they tackle different tasks.
Tobiko’s Multi-Engine Support: One project, multiple engines
Support for multi-engine projects - which we’re rolling out for SQLMesh and our hosted enterprise offering, Tobiko Cloud - allows users to leverage multiple engines within a single project and assign the most suitable engine to each job.
This means you can run individual models on specific engines, depending on your use case and the engine’s strengths. For example - for a workload that calls for large-scale processing, you might use SPARK. For the piece that involves medium-sized loads and unifying data from mixed sources, use Trino. And when you want to quickly process data locally at minimal cost, turn to DuckDB, and so forth. With our multi-engine support functionality, all of this can now be done in one centralized project.
What sets Tobiko’s Multi-Engine Support apart?
Pairing each task with the engine that’s most optimized for it comes with efficiency gains that require no elaboration. Support for multi-engine projects frees you from relying on specific query providers or ecosystems, giving you the flexibility to switch up your tools and adapt as your project needs evolve.
Unlike other platforms, such as dbtTM, where multiple engines can be used but users must spin up a new project to operate with each different engine, Tobiko’s feature can keep all engine use cases neatly consolidated within one project. What’s more, dbt restricts their cross-platform mesh to their paid cloud offering, while Tobiko’s feature is free and available to both our paying customers and our open-source community.
We are firm believers that setting up new environments and toggling among them takes engineers out of their flow state (read our blog on how we designed Tobiko’s products to optimize for the developer experience). We built our multi-engine support to do away entirely with these distractions, and we’re committed to making this seamless developer experience accessible to all without a paywall.
To put the proverbial cherry on top, our multi-engine support is enhanced by being part of the robust ecosystem built on top of our open-source parser and transpiler, SQLGlot. Should your code be written in a dialect unique to a specific engine or warehouse, SQLGlot gives you the ability to query otherwise incompatible datasets without skipping a bit, delivering truly effortless cross-engine interoperability.
How to enable multi-engine support?
To get you started, we’ll use an example to illustrate how to configure a SQLMesh project with multiple engines. Note that the models you’ll be running should reside in a shared data catalog, and for each engine you’ll use, you must secure admin permissions to perform read/write operations on said shared catalog.
- Begin by adding all required gateway connections in your configuration.
- Next, specify the gateway to be used for execution in the MODEL DDL for each model. If no gateway is explicitly defined, the default gateway will be used for this particular model.
In SQLMesh, the virtual layer consists of views, one for each model, each pointing to a snapshot table in the physical layer. In a multiple engine project the virtual layer is managed by the engine corresponding to the default gateway.
You can learn more about our virtual data environments in this blog, which explains how they allow you to preview changes with minimal cost before promoting them to production.
Example DuckDB + PostgreSQL setup
Consider a scenario where you utilize PostgreSQL for the majority of the models of your project, but would like to employ DuckDB for the remaining models to capitalize on its vectorized query execution for enhanced query performance and efficiency. Below is a simple example of setting up a project with connections to both DuckDB and PostgreSQL.
In this setup, the PostgreSQL engine is set as the default gateway, so it will be used to manage views in your virtual layer. Meanwhile, DuckDB’s attach feature enables read-write access to the PostgreSQL catalog’s physical tables.
Given this configuration, when a model’s gateway is explicitly set to DuckDb, it will be materialized within the PostgreSQL main_db catalog, but it will be evaluated using DuckDB’s engine.
In the order_ship_date model, the DuckDB engine is set, which will be used to create the physical table in the PostgreSQL database. This allows you to efficiently scan data from an Iceberg table, or even query tables directly from S3 when used with the HTTPFS extension.
In models where no gateway is specified, the default PostgreSQL engine will be used to create physical tables, as well as to create and manage views of the virtual layer.
In this example, using both PostgreSQL and DuckDB in a single project offers the best of both worlds: PostgreSQL handles transactional workloads with its robust ACID compliance and relational capabilities, while DuckDB excels at fast, in-memory analytics on large datasets like CSV or Parquet files. Unlike PostgreSQL, which requires indexing to achieve optimal query performance, DuckDB will run complex analytical queries much faster without the need for such setup. This approach lets you leverage PostgreSQL’s versatility for transactional operations and DuckDB’s speed for analytics, applying the optimal engine to each model to streamline your workflow and maximize overall performance.
Advancing the MDS: the right engines for the right jobs
Tobiko’s support for multi-engine projects is a substantial leap forward in data engineering flexibility and efficiency. In a single project, users can harness the strengths of multiple engines by selecting the ones tailored to each workstream. Whether it’s fine-tuning for analytics speed, workload size, or cost-effectiveness, our users can frictionlessly adapt to project needs without losing focus or momentum.
We’re proud to deliver a feature that lets data engineers work smarter and easier as the industry embraces data lakes, open table formats, and fluid operability. Keep an eye out as we continue to expand and refine our suite of capabilities for the modern data stack.