Tobiko

The Data Dev Environment Graveyard

More Overhead, Less Work: What's Happening Now?

"When was the last time you created a development environment for testing data and felt good about it?"

Perhaps it was last week when you created a new feature branch to get some data prepped for the board meeting? Or maybe it was earlier today when you needed to make an update to an existing pipeline calculating ad spend ROI for the marketing team.

Before you got started, you might have had some idea of what you needed to do, which tables you would be pulling from, and maybe even some of the business logic you would write. But how were you planning on testing this? Maybe the change you needed to make affected a very complex data pipeline, maybe it would impact millions of rows of data, maybe you would need to backfill several tables as well.

This is a ton of friction before I even get started! Most of these questions happen to break your flow state. This is a huge hurdle for me, and I bet for you too. Frankly, many tools haven't been built to properly address it, including dbt.

There was a customer I had worked with using dbt that enabled their developers to have separate schemas for development. In the beginning, this was great because it allowed isolated environments and flexibility for different teams to work on things as needed. However, over time this caused a ton of issues. Many of these dev environments became stale quickly. As production updated and new sources and new rows were created (read: data drift), it became harder to keep up with the additional nuances or even an accurate representation of what was in production. The teams were spending more time and money waiting to update their dev environments than actually building out new data products. Doing more meta work to do the actual work is something we can all agree on: it shouldn't be normal.

data_chaosData Chaos

I remember another customer awhile back who had some serious pain with too much friction in how they work. Their analyst team used to write and execute queries in the snowflake console and then copy those queries over to dbt adding in their ref and source functions manually. As you might imagine, unless proper guardrails are in place, you would be spending more time ensuring the query works. This team would then run several additional queries just to test their data without knowing or understanding the downstream impacts. Can you imagine how much their snowflake bill was due to this workflow? Hopefully, not $500,000

I want you to feel how I feel

So it's easy to criticize workflows, frameworks, and tools as lacking. And I don't intend to write 1000 words just for that. Instead I want you to live in what I'm living in now: a solution that intimately knows the above and is informed by these nasty battle scars.

Throughout my exploration of SQLMesh, I noticed myself comparing it to a workflow I was familiar with rather than testing it out properly. This inherently wouldn't lead to a fair assessment so I regrouped and tried again. The first major concept I needed to be able to digest was Virtual Data Environments. The team has written extensively on them (here and here) but to truly grasp how they work you have to get your hands dirty.

To me, getting my hands dirty isn't just pip installing some software and running a bunch of commands from the documentation. I believe there are a few specific steps to test any technology properly. The first should be defining the actual problems I want to solve. After that, I need to develop a proof of concept. From the proof of concept, I need to test and evaluate it against my defined requirements. Lastly, I need to summarize the results and determine if it provides the value I need.

Back to what matters now. As I loaded the browser UI (one of three ways to access) for SQLMesh, I noticed there was a drop down to select or create new environments.

data_envFigure 1. Virtual Environments

When I created a new environment, I expected to have to wait some time for it to build before it was ready to use. In the past, this was enough time to make myself a fresh americano using my Breville Barista machine. However, there was no wait time. Once I clicked add and clicked plan, my environment was instantly ready for me to start development.

"I must have messed something up. I need to try again."

At first I was pretty skeptical so I created another and added and ran a sqlmesh plan to ensure this wasn't some bug. Again, the environment was ready to use!

Still feeling a bit skeptical about how easy it was to set up a completely new and isolated environment, I decided to edit some models to see what would happen. As I edited models, executed queries, and even ran some data diffs between the environments I realized how much I didn't have to switch to different tabs and exercise my copy and paste muscles across tabs.

data_diff_1 data_diff_2Figure 2. Data diff and SQLMesh Plan

More output + faster iterations + less context switching = better flow state. But the benefits extend further beyond just that equation. I no longer needed to think about all those times I hadn't rebuilt my dev schema in a week or even a day. The annoyance of waiting for every table to rebuild just so I can make even the smallest of updates is gone. It just doesn't matter anymore. The extra warehouse costs associated with that are also gone. No longer do I need a macro that automatically drops stale datasets based on time. But most importantly, my environment is a true reflection of production, so I will always be correct.

"I didn't know how many guessing games I was playing until I didn't have to play them anymore."

This completely changed my mental model for data engineering powered by SQLMesh. I still had an itch to understand how this really works in a production state with several collaborators. In theory, like many things, it sounded great and I could easily list out the benefits from my point of view but what were others thinking?

I decided to chat with some teams who were already using SQLMesh in production. One of the VPs said while query cost at first was important, he quickly realized he cared more about his team's output. Meaning, even if they were spending the same amount on warehouse costs, he was fine with that because his team was able to execute more queries at a faster rate while being correct. Efficiency of his team drastically increased while those costs stayed the same. The additional insights his team could produce was invaluable.

efficiency_graphFigure 3. Data Efficiency

What's Next?

Virtual data environments have been a pleasure to work with and I'm really enjoying this new workflow. While it was unfamiliar at first, it now feels like the necessary ergonomics for me for any data engineering work. Next time, I'll write about how you can extend this functionality using our github CI/CD bot!

If you're interested in data, SQL, or just wanna chat we'd love to meet you. Please join us in our growing Slack community to say hi!

Join us on Slack