dbt docs vs automatic lineage

dbt's built-in documentation and lineage graph is one of the best features of the modern data stack. If your entire transformation layer lives in dbt models, you can generate a complete lineage graph with a single command, publish it to dbt Cloud, and give your whole team visibility into how your models connect. For many teams, this is good enough — until it is not.

The problem is not with dbt docs. The problem is the assumption that your entire data stack is visible to dbt. In practice, almost no data team's transformation logic lives exclusively in dbt models. There are Snowpark procedures from three years ago that no one wants to migrate. There are Airflow tasks that run custom SQL against raw tables. There are ad hoc queries that analysts run in a Jupyter notebook that write results back to a production schema. There are Python scripts in Lambda functions that call the warehouse API directly. None of this appears in dbt docs. All of it is part of your real lineage graph.

What dbt Docs Gives You

dbt generates lineage from the Jinja SQL in your model files. It parses the SQL, identifies ref() and source() calls, and builds a directed acyclic graph (DAG) of model dependencies. The output is accurate, comprehensive, and automatically updated every time you run dbt compile. For the portion of your transformation layer that lives in dbt models, this is excellent lineage coverage.

dbt docs also captures schema information (column definitions), tests (data quality assertions), and descriptions you document in YAML files. When maintained diligently, this gives you a searchable documentation layer that is genuinely useful for onboarding new engineers and for understanding the intent behind complex models.

The limitation is scope. dbt docs knows about dbt. It does not know about Fivetran connectors, Snowpark procedures, Airflow SQL operators, direct API calls, or anything else in your pipeline that is not expressed as a dbt model.

What Automatic Lineage Extraction Adds

Automatic lineage extraction works from a different starting point: the query history in your warehouse. Every SQL statement that runs against Snowflake, BigQuery, or Redshift — whether it originated from dbt, Fivetran, a Python script, an Airflow task, or an analyst's SQL editor — leaves a record in the query log. By parsing that query log and inferring column-level read and write patterns, it is possible to construct a lineage graph that covers all transformations that actually ran against your warehouse, regardless of what generated them.

The result is a more complete picture. The dbt DAG shows you the lineage for your modeled transformation layer. The automatic lineage graph shows you the full picture including the edges that dbt cannot see: the Airflow task that writes directly to a schema that a dbt model later reads, the Snowflake Streamlit app that queries a production table and writes aggregates to a new schema, the Fivetran connector whose raw tables feed your staging models.

These additional edges matter enormously for incident response. When a pipeline breaks, the blast radius question — what downstream assets are affected? — requires knowing the complete lineage graph, not just the dbt portion of it. A data engineer who sees from dbt docs that a broken staging model feeds five downstream models may miss the three Tableau extracts and the Airflow task that also read from the same staging model through a route that is invisible to dbt.

The Freshness Problem with Static Documentation

dbt docs are only as current as the last time someone ran dbt compile and republished the docs site. In practice, many teams publish docs weekly or even less frequently. In a fast-moving environment where pipelines change daily, this means the docs are always somewhat stale. A model that was deprecated last Tuesday still appears in a docs site last published two weeks ago.

Automatic lineage extraction from query history is continuously updated. Because it is derived from actual queries that ran, the lineage graph reflects the current state of your pipelines rather than the last documented state. If an analyst added a new dependency by running a query yesterday, that dependency is visible in the lineage graph today without anyone having to update documentation.

Column-Level vs. Model-Level Granularity

dbt lineage is model-level by default. The graph shows you that model A depends on model B. It does not tell you which specific columns from model B are used in model A's logic unless you have manually configured column-level lineage documentation in dbt YAML files — a maintenance burden that most teams do not sustain.

Automatic lineage extraction from query logs produces column-level lineage by default because SQL queries operate at the column level. A SELECT statement specifies which columns are read; an INSERT or MERGE specifies which columns are written. Parsing these statements gives you column-level dependency information without any manual annotation.

The difference matters for impact analysis. A column rename in a source table may break 3 of 20 downstream models — not all 20. Model-level lineage shows you that all 20 models depend on the source table. Column-level lineage shows you which 3 are actually affected. That distinction saves hours of unnecessary investigation.

How Decube Integrates dbt and Automatic Lineage

Decube integrates with both dbt Cloud and the warehouse query log to produce a unified lineage graph that includes both sources. dbt model metadata enriches the graph with the documentation, descriptions, and test results that dbt captures. The warehouse query log fills in the gaps — the non-dbt transformation edges that dbt docs cannot see. The result is a single lineage interface that covers the entire transformation stack.

This integration also enables Decube to surface dbt test failures in the context of lineage. When a dbt test fails on a staging model, the platform shows which downstream dbt models depend on that staging model and, where query log lineage is available, which non-dbt assets also depend on it. The combination of static dbt lineage and dynamic query log lineage gives a more complete blast radius than either source alone.

See Your Full Lineage Graph

Decube connects dbt Cloud and warehouse query logs to show you the complete picture.

Book a Demo