Column-Level vs Table-Level Lineage

Table-level lineage is easy to build. Parse the SQL, find the FROM and JOIN clauses, draw an edge between source and destination. Every major data warehouse can give you this for free. The problem is that table-level lineage answers a question nobody is actually asking during an incident. "Which tables are upstream of this one?" is a fine graph theory question. It is useless when an executive is asking why their Q3 revenue number changed at 8am on a Wednesday.

Column-level lineage is harder. It requires understanding what happens inside the SQL - which columns from which source tables flow into which columns in the destination table, through what transformation logic. The difference between these two levels of detail is the difference between a 3-hour debugging session and a 3-minute one. This article breaks down exactly what you lose when you only have table-level lineage, and what you gain when you get to columns.

What Table-Level Lineage Actually Tells You

Table-level lineage gives you a directed acyclic graph where nodes are tables and edges are data flow relationships. If table B depends on table A, there is an edge from A to B. This is useful for two things: understanding the rough shape of your pipeline and identifying which tables might be affected when an upstream source goes down.

Practically, it means you can answer: "If I drop table A, which tables break?" It does not mean you can answer: "The discount_pct column in our revenue table shows a null rate of 23% this morning. Which transformation introduced that?" For that question, you need to know which dbt model computes discount_pct, which source column feeds it, and what filter or CASE statement might produce nulls.

Most open-source lineage tools built on top of SQL parsing - OpenLineage, Marquez, dbt's built-in lineage graph - operate primarily at the table level. They produce beautiful graphs that look comprehensive and are genuinely useful for documentation. They are not sufficient for incident triage at column granularity.

What Column-Level Lineage Tells You That Tables Cannot

Column-level lineage traces each column in a destination table back through every transformation step to its origin in a source system. For a column like net_revenue_usd in a Snowflake reporting table, column-level lineage might reveal the following chain:

When the value in that column drops by 18% overnight, column-level lineage tells you immediately that ref_fx_rates did not update yesterday (the Fivetran connector for the FX rate provider had a 401 auth error), which means all USD conversions used stale rates from the day before. Table-level lineage tells you that fct_revenue_daily depends on ref_fx_rates. That is a substantially different amount of actionable information.

The Incident Triage Math

Consider a common scenario: a data quality alert fires at 6:47am saying the row count in orders_daily is 31% below the 30-day average. Your on-call data engineer gets paged. With table-level lineage, the debugging flow looks like this:

Step 1 (8 minutes): Open the lineage graph. Identify 14 upstream tables. Step 2 (15 minutes): Check each upstream table's freshness. Three look delayed. Step 3 (20 minutes): Check which columns in orders_daily are populated vs null to narrow down which upstream. Step 4 (25 minutes): Find the specific dbt model that joins the problematic upstream. Step 5 (10 minutes): Identify the engineer who owns that model and page them. Total: roughly 80 minutes to identify root cause and owner.

With column-level lineage: Step 1 (45 seconds): Open the alert. The monitoring platform shows that the freshness violation originates at column payment_method_id - which traces directly to a Stripe API connector that last synced at 11pm. Step 2 (90 seconds): The platform shows the owning engineer for the Stripe connector and pages them. Total: under 3 minutes.

Multiply this across a data team handling 15-20 incidents per month, and the arithmetic becomes compelling quickly. As discussed in our Seed Round announcement, this time savings was one of the core design goals for the Decube platform.

Why Column-Level Lineage Is Hard to Build

Building accurate column-level lineage requires more than SQL parsing. There are four specific technical challenges that make this significantly harder than table-level lineage.

Aliasing and CTEs: SQL commonly uses column aliases and common table expressions that rename columns at intermediate steps. A naive column tracer that looks only at FROM clauses will lose track of a column that gets renamed four times across three CTEs. Accurate column lineage requires full semantic analysis of the SQL AST, not just structural parsing.

Dynamic SQL: Templated SQL in dbt uses Jinja macros that generate column lists at compile time. The actual columns selected are not visible in the raw SQL file - you need to resolve the compiled SQL output that dbt generates. This means integrating with the dbt Cloud API or the local dbt manifest.json artifact, not just reading raw model files.

Warehouse query logs: For warehouses like Snowflake and BigQuery, the most accurate source of column-level lineage is not model definitions - it is the actual query logs that show exactly which SQL ran at what time. Query log-based lineage is more accurate than definition-based lineage because it captures what actually executed, including ad-hoc queries and BI tool-generated SQL that is never captured in dbt models.

Cross-platform stitching: A column might originate in a Kafka topic, be landed by a Fivetran connector into Snowflake, transformed by a dbt model, and then read by a Looker dashboard. Stitching the lineage graph across four different systems - each with a different metadata format - requires integration with all four APIs, not just one.

What to Look for in a Column-Level Lineage Tool

Not all column-level lineage tools are equal. Several claim column-level support but implement it through a combination of manual tagging and limited SQL parsing that only works for simple SELECT statements. When evaluating a lineage platform, ask for a live demonstration of the following:

Tools that require manual YAML annotation to capture lineage will always be incomplete - engineers do not update documentation when they change transformation logic under deadline pressure. Lineage has to be automatic and derived from what actually ran, not what was documented.

Column-Level Lineage and Data Governance

The governance use case for column-level lineage is equally strong. Regulations like GDPR and CCPA require organizations to know exactly where personal data flows - not just which tables contain PII, but which columns, and which downstream tables read from those columns. A table-level lineage map that shows users is upstream of events does not tell you whether the email column from users actually propagates into events. Column-level lineage makes that determination precise and automatic.

Impact analysis follows the same pattern. When a database administrator needs to rename a column in a production source table, column-level lineage shows every downstream model, mart, and dashboard that reads that specific column - not just every table that reads from that source. The difference between "this change affects 47 downstream tables" and "this change affects the revenue column read by 3 dashboards" is the difference between a week of testing and a 30-minute targeted review.

Building Toward Full Context

Column-level lineage is necessary but not sufficient for full data context. Knowing that net_revenue_usd comes from Stripe via five transformation steps tells you the technical provenance. It does not tell you what business definition was certified for that column, which team certified it, or whether the definition matches how the finance team uses it in their quarterly reports.

That is why Decube combines column-level lineage with a business glossary layer that links technical columns to business definitions, and a quality monitoring layer that detects when the values in those columns deviate from expected patterns. Lineage tells you where data came from. Quality monitoring tells you when it is wrong. Business glossary tells you what it means. Together, those three layers are what we call data context.

See Column-Level Lineage in Action

Connect Decube to your Snowflake or BigQuery warehouse and get a full column-level lineage graph in under 10 minutes. No agents, no YAML, no professional services.

Book a Demo