Docs + Testing: Are We Optimizing for the Wrong Thing?

How much context and assurance do I need to know I’m developing in the right direction?

Working in Public is so Wonderful

After she read my personal blog post about joining Tobiko, I’ve had a string of wonderful conversations with Jenna Jordan!

I was noodling on an initial backlog of content that I was slow cooking until I spoke with her about documentation and developer experience (read: ergonomics). In conclusion, I have never been so glad to put content back in the oven and build something from scratch because of candid conversations with someone who surprised me. These conversations crystallized a question that’s been humming in my head since.

How much context and assurance do I need to know I’m developing in the right direction?

This question is timeless and something we mutter to ourselves whenever we work. If you don’t ask this of yourself when you work, I’ll just assume you always know what to do at any given time ;)

What table do I start with?
Who owns this table?
Who do I go to ask about table XYZ?

What Questions Matter to Us?

Right now we live in a reality where column level lineage is a table stakes commodity (here’s looking at you SQLGlot ;)), major data transformation tools like SQLMesh and dbt treat testing (both audits and unit testing) as first class citizens, and documentation is automated to some degree. Now that the baseline is starting so high even compared to 2 years ago, our collective standards are higher. We are asking harder and necessary questions on how we work.

Is it a fair ask for people to write so much yaml by hand?
Do we really want a better data catalog or better search?
Are we all optimizing for the wrong thing?
Is self service context…possible?
Wave a magic hand, what does the UX look and feel like to answer the above questions?

I asked Jenna what she thought of the above, and she helped articulate the above for me in much greater detail.

Is it a fair ask to expect people to write so much yaml by hand? -> What are we trying to accomplish with the information that goes into the YAML files, and in which of those cases is writing yaml by hand appropriate for accomplishing those goals (vs when is it boilerplate that should be automated/abstracted away)? What would a developer flow look like if that YAML is automated/abstracted away?
Do we really want a data catalog or do we want better search? -> When people go to a data catalog, what are their primary use cases? In what situations is dbt docs/Explorer sufficient, vs when is a dedicated data catalog tool needed? For transformation use cases that don't require a dedicated data catalog tool, what does an ideal UI look like (is it just search?)?
Are we optimizing for the wrong thing? -> What is dbt optimizing for (esp. after their recent showcase)? What is SQLMesh optimizing for? (maybe what are other transformation tools optimizing for?)... What should we be optimizing for to meet a variety of business needs in order to actually add value (and reduce costs.... improve ROI of the data team)?

Alright, let’s answer these questions.

What are we trying to accomplish with the information that we manually write by hand?

dbt Reflection

Let’s take a look at what’s included in docs experience in dbt and how it answers the tiny questions we ask within ourselves while tasked with updating this model.

--stg_payments.sql
with source as (

    {#-
    Normally we would select from the table here, but we are using seeds to load
    our data in this project
    #}
    select * from {{ ref('raw_payments') }}

),

renamed as (

    select
        id as payment_id,
        order_id,
        payment_method,

        -- `amount` is currently stored in cents, so we convert it to dollars
        amount / 100 as amount

    from source

)

select * from renamed

# schema.yml
version: 2

models:
  - name: stg_payments
    columns:
      - name: payment_id
          description: primary key
        tests:
          - unique
          - not_null
      - name: payment_method
          description: credit_card, bank_transfer, etc.
        tests:
          - accepted_values:
              values: ['credit_card', 'coupon', 'bank_transfer', 'gift_card']

Is this tested?
- Yes, I need to open another tab and look at the yaml
How is this tested?
- Yes, unique and not nulls in the yaml along with an accepted values test
How often does this rerun?
- I’ll have to check my cron scheduler in a github action or in airflow or dbt Cloud if I paid for it
When was the previous run and when’s the next run?
- See above and do some mental math
What does this table do?
- No one filled out the description but from the stg_ naming convention, looks like it’s cleaning up a raw source table
What does each column mean?
- Descriptions in the yaml and an inline comment in the SQL file itself
How does this relate to other tables?
- Run dbt docs generate or use the dbt power user VS Code extension and get node-level lineage
What’s the blast radius of change when I update this table?
- I understand a bird’s eye view that it will effect things at a node-level, but it’s up to me to eyeball how column level impact will cascade or not
How do I create a unit test for this?
- I have to write it all by hand. Maybe someone created a code generation dbt package, but I’ll probably google for 5 minutes before my laziness makes me give up.
- See docs
  - ‍Example

How has this table been used so far?
- Tough luck kid. You gotta DM someone.

This is what I’m used to. This is what a lot of us are used to.

And that familiarity lends itself to productivity. But you and I both know:

You’re lucky if your dbt project is documented consistently in yaml
You’re lucky if all the columns are listed out
You’re lucky if your team is willing to write all those unit tests by hand

I know these things in the context of working at dbt Labs AND selling complementary software at Datafold to dbt. People don’t upkeep their yaml documentation. They just don’t. I know this very deeply because of the friction to get Datafold customers to add this config to their yaml so that data diffing would work properly. This isn’t a thought experiment. This is real life.

# schema.yml
models:
  - name: dim_orgs
    columns:
      - name: org_id
        tests:
          - not_null
          - unique
        tags: ['primary-key'] # literally just this, oh how many customer support tickets I wished to have avoided

What’s made progress to motivate docs maintenance is a tool built by a company outside of dbt Labs.

dbt Power User baby

This is a great tool. I shower it with flowers well deserved. I loved using it. Heck, I poke my head in there every once in awhile even while working at Tobiko just to see what they’re cooking up. Because it solves a problem that resonates so deeply with me.

Automate the boring work for me please and give me one less tab to babysit.

And it looks like dbt Labs thinks so too

Or maybe Matt Palmer can help us all out. Here’s a post he made.

However, there’s something that these docs don’t do well. It doesn’t tell me how this interfaces with the people I’m serving. It doesn’t tell me the most important thing that matters. Who is this really for? Am I solving for a problem that matters? Is it worth the effort to update this vs. add a brand new SQL model?

I still have to DM someone.

SQLMesh Reflection

Let’s take a look at what’s included in docs experience in SQLMesh and how it answers the tiny questions we ask within ourselves while tasked with updating this model.

--stg_payments.sql

MODEL (
  name demo.stg_payments,
  cron '@daily',
  grain payment_id,
  audits (UNIQUE_VALUES(columns = (
    payment_id
  )), NOT_NULL(columns = (
    payment_id
  )))
);

SELECT
  id AS payment_id, --primary key
  order_id, --foreign key
  payment_method, -- ex: credit_card, bank_transfer
  amount / 100 AS amount, /* `amount` is currently stored in cents, so we convert it to dollars */
FROM demo.seed_raw_payments

Is this tested?
- Yes, looks like via audits inline with top-level config
How is this tested?
- Looks like unique and not nulls are tested on a single column
How often does this rerun?
- Daily, hmm, is that just 12am UTC?
When was the previous run and when’s the next run?
- See the cron periods in the docs metadata
What does this table do?
- No one filled out the description but form the stg_ naming convention, looks like it’s cleaning up a raw source table
What does each column mean?
- Inline SQL comments that I can view in the UI to understand
How does this relate to other tables?
- Ahh I can see column level lineage in the editor UI along with more friendly UI for the inline SQL comments
What’s the blast radius of change when I update this table?
- Column level lineage helps me drag around the dependencies and glance at what matters
How do I create a unit test for this?
- Run this:
  - How has this table been used so far?
    - Tough luck kid. You gotta DM someone.
    - Probably look at query history if I’m extra motivated, but this is an intermediate staging table so it won’t give much signal unless I look at the query history of the downstream table that most end users are actually querying ad hoc or from a BI report
  Automatically generated unit test based on data queried live

How has this table been used so far?
- Tough luck kid. You gotta DM someone.
- Probably look at query history if I’m extra motivated, but this is an intermediate staging table so it won’t give much signal unless I look at the query history of the downstream table that most end users are actually querying ad hoc or from a BI report

Keep in mind, I’ve been entrenched in dbt’s UX/UI for years of my data engineering career, so my mind is ferociously thinking to itself how much of my productivity is rooted in familiarity vs. good UI/UX. I don’t know if I can every fully remove that bias, but heck, it’s worth trying!

I hate context switching.

I say this because I never knew how much friction I had in building docs and configs in a separate file outside of my SQL and just how much unnecessary friction was added to how I simply understand what the heck this SQL file is really doing. You notice all these little questions I ask? I answer most of them just looking at the raw SQL file. If I want a higher fidelity experience, I can simply open the SQLMesh UI and instantly validate in my mental model that the configs are doing what I expect them to do. Along with inline SQL comments (how I naturally do this in real life), they translate to canonical docs in the right side metadata docs. I even know the data types without having to hard code them or check them against the query console. And this is coming from the guy who built v1 of dbt model contracts. I even know the next run scheduled! But honestly, the best part of this is that this allows me to be lazy in the ways it makes sense to be lazy: building unit tests. create_test is the best command I never knew I wanted until now (see video above to see it in action).

However, there’s something that these docs don’t do well. They don’t tell me how this interfaces with the people I’m serving. It doesn’t tell me the most important thing that matters.

Who is this really for?
Am I solving for a problem that matters?
Is it worth the effort to update this vs. add a brand new SQL model?

I still have to DM someone.

When people go to a data catalog, what are their primary use cases?

The obvious answer: to understand the upstream and downstream dependencies in the transformation chain.

What Looker, Tableau, and Power BI reports does this touch?
Who’s building ad hoc tables/views on top of this?
Are there random duplicates floating in the ether?
Are we using Fivetran or a custom loader for this?
Who owns all the above components?

SQLMesh and dbt can’t answer these and that’s okay. These tools aren’t data catalogs. They are transformation tools that provide useful docs that feel like catalogs. I’m tempted to write more, but this is not the main character in today’s story.

The answer to that is going to vary heavily depending on who those people are. An analyst trying to build a report is not going to give a fig about the lineage, they just want to know what table to use for their report.

—Jenna

What’s Worth Optimizing?

dbt sets a precedent that verbosity == productivity and that automating and formatting that verbosity is the goal.

SQLMesh sets a precedent that inline business logic + metadata == productivity and that automating UI/UX inline is the goal.

Both have their merits. Both have their charm.

However, it feels like answering the above questions have culminated into the fact that I never wanted to swim in oceans of documentation and yaml . Increasing my typing speed feels like succeeding at the wrong thing. I want surgical answers. I want how search makes me feel. Like how perplexity opened the internet’s collective consciousness that search was really about specific answers with cited sources rather than blue links with friction. Like how people don’t want to swim through a data catalog every time to understand where to start, and where my changes will end. I think LLMs can help, but building data pipelines is such a human, opinionated experience (think: do we all have the same definition of daily active user?). What we really want is someone we trust. A better DM starting point, a forum, something where a person we respect puts their name on the line to share the answers I need. That’s what DMs are all about. That’s what reddit karma is all about.

We all do this, don’t lie.

I think a realistic starting point for us all is reframing the original question.

How much context and assurance do I need to know I’m developing in the right direction? → What’s the best DM I can send my teammate to move on with both our days and not waste their time?

That starts with automating and enhancing ergonomics around that so splendidly, that 90% of the DM is not even needed anymore(think: look at all the questions we answered above!). Then we just ask, “So you want daily active user to be defined as…?” and well, the rest is still unwritten.

What Now?

You can start by DMing me ;) Ahhhh I’m just playing. But really, it’s talking within yourself, your team, and me about what SQLMesh, dbt, heck the whole data industry should be optimizing for to elevate how you work. And maybe, we’ll help you send one less, poorly constructed, DM along the way 🙂

Ciao y’all.

Note: If you DM me, I may just write a blog about it and give you your well deserved flowers in public. Ball’s in your court.