The Future of Tobiko - From Sinks to SQL

We've had an incredibly fun journey so far, and we're excited to announce the availability of Tobiko Cloud, our managed cloud and enterprise product, along with a $17.3 million Series A investment round.

The investment is led by Theory Ventures, with participation from Unusual Ventures, 20Sales, Fivetran CEO George Fraser, Census CEO Boris Jabes, and MotherDuck CEO Jordan Tigani. Tomasz Tunguz, Founder of Theory Ventures, will join our founders and Wei Lien Dang of Unusual Ventures on the Board as part of the investment. This brings Tobiko's total funding raised to $21.8 million, including a previously unannounced $4.5M seed round led by Unusual Ventures.

When Toby, Iaroslav, and I began talking about starting a company, we originally looked downstream in the analytics stack but concluded that the source of truth really lies at the transformation layer. As we started exploring the space, we discovered that the approaches used in existing data engineering solutions were rather naive.

The Kitchen Sink

There has been plenty of innovation in the dev tools space for software engineers. Today, software engineers deploy code at companies of all stages and sizes (even the smallest of startups) with sophistication, the benefit of a robust ecosystem of tools, and a strong and developed set of best practices. While some of these ideas translate over to data, data is fundamentally different.

How many lines of code can a team write? Thousands? Maybe millions if it's a larger team, but certainly not billions. If you make a change to your software, it's not the end of the world to recompile your project. It's different with data because your deliverable isn't the actual code that's being written — it's the underlying tables. One small change to a SQL query can easily affect billions of rows of data.

As an analogy, let's say you want to replace the kitchen sink in your home. Since we know that the sink is a unit, and we know what it does and how it works (it's connected to pipes), we can cut out the sink, replace it, and reconnect the pipes. The understanding of “what is a kitchen sink” and “sinks need to be connected to pipes” is intuitive to us.

Computers (at least not today) don't have the luxury of human intuition. If your transformation tool doesn't understand the nature of a change to your SQL, it doesn't have much of a choice except to rebuild your entire data warehouse. Rebuilding might work in software development or even small data teams, but the moment you have meaningful data and workloads it becomes expensive in both the cost of your warehouse compute and the hit to your analysts' productivity as they wait for jobs to finish. Essentially, if you don't understand the nature of what you're doing, you end up tearing down and rebuilding your entire house to replace the kitchen sink.

Our Approach to Data

We created SQLMesh, our open source data transformation framework, because we knew there was a better way to ship data.

Much of our team previously worked at the largest companies in tech: Google, Apple, Airbnb, and Netflix. At this scale, it is impractical to rebuild the warehouse on every minor change. As a result, we took an approach to managing data pipelines that works for small teams and is compatible with organizations as they grow with both the amount of data and the size of the team.

While there are many philosophies within our company, there are three main ideas that come to mind:

  • Data is different from software; understanding your construction plans is critical to efficiently building a house (and your data). What started out as an open-source project has been a foundational tool for our journey in data transformation: SQLGlot. SQLGlot allows SQLMesh to parse SQL and semantically understand changes to a query. Did you add a column? If yes, we don't need to rebuild downstream tables. Did you modify a column? Then we need to rebuild downstream tables if those tables depended on the modified column. Understanding the change allows SQLMesh to do only what's necessary, saving money and time. Furthermore, because we understand the nature of the change, SQLMesh can tell you what's going to happen before you do it. I used to be an administrator in the world of competitive Rubik's Cube solving and I accidentally replaced the WHERE clause in a DELETE SQL statement. Bye bye database. I still get flack for that 15 years later…which would have been completely avoided with SQLMesh.

  • Avoid redoing work that has already been done. Computation costs money (and it gets worse the more data you have), so if you've already built a table, why do it again? We introduced SQLMesh's Virtual Data Environments to do exactly this. It's common for analysts to answer similar questions and to iterate on their analysis. They may make a small change to one part, but current tools will recompute the entire analysis. If two analysts run the same query, the computation gets done each time even though nothing has changed in the result. SQLMesh's Virtual Data Environments avoid this inefficiency (which can exist even if you are simply setting up your project).

  • A stateful experience. Successfully doing the above requires you to understand the state of your data and what's happened to it. It's hard to be efficient, even with a robust understanding of your construction plans, if you don't know how much work has been done the day before. SQLMesh is built as a stateful experience, meaning we have first-class support for incremental models. It's wasteful to rebuild your entire table just to add the latest day's worth of data.

We want to bring software engineering best practices to the world of data, but recognize that data simply isn't the same as software. We hope these ideas will equip data teams to build and maintain their pipelines in a better way, and with a delightful experience.

Open Source, Cloud, and Observer

As firm believers in open source, we are grateful to the community for its support and enthusiasm. Close partnerships with our community members and organizations such as Fivetran, Harness, Dreamhaven, and Pipe have been extremely helpful in ensuring that SQLMesh meets the needs of teams in production environments, and we have seen tangible results: Harness's use of SQLMesh reduced their BigQuery spending by 30-40%.

It has been inspiring to see SQLMesh bring material value to organizations, and today we are thrilled to announce the availability of our managed cloud offering. Tobiko Cloud brings an enterprise-grade hosted version of SQLMesh to organizations. We are still committed to a free and powerful experience with open-source SQLMesh, and we believe that Tobiko Cloud will bring tremendous value to organizations looking to simplify their data transformation workflows and seamlessly scale. Tobiko Cloud includes Observer, our observability product, that lets you rapidly understand and evaluate what's happening in your pipelines.

SQLMesh understands SQL; Observer understands SQLMesh. With Observer, data teams can understand every version of every pipeline that's been run. Because Observer is integrated with SQLMesh, it semantically understands the change to your SQL. When something breaks, we can not only tell you that something is broken, but we can show you what contributed to the break (bad code vs. bad data). We believe that Observer will significantly improve the productivity of data teams and allow them to debug their projects with more speed and precision.

What's Next?

The modern paradigm for data transformation is only a few years old, but it's easy to get complacent with inconveniences. We accept these inconveniences because life was better after the tool than before it. It's easy to think that because one can keep creating workarounds for the headaches of a tool that it's okay.

Our hope is that Tobiko can redefine the baseline expectations for data transformation — data is different from software, and a tailored solution for data transformation built on foundational principles of how data teams work is necessary. Data transformation should be intuitive for data analysts and powerful for data engineers. Small teams should write SQL (and avoid debugging messy Jinja), and larger teams shouldn't have to worry about their warehouse costs exploding. We're excited that Tobiko Cloud gives us the opportunity to make this a reality. If you'd like to learn more, please reach out!

P.S.: You can reach me or anyone else on our team through Slack. We're always eager to talk data!