Data observability program

Most data teams do not lack monitoring tools. They lack a coherent observability program — a set of interlocking capabilities that together answer the question "what is the current state of my data, and how do I know?" This post describes what a mature data observability program looks like, why it matters, and the sequence in which most teams successfully build it.

The distinction between monitoring and observability matters here. Monitoring tells you whether a threshold was breached. Observability tells you why. A monitoring alert saying "row count in orders_daily dropped 40%" is the beginning of an incident, not the end. Observability means you can answer within minutes: which upstream source changed, what the downstream impact is, and who is responsible for the fix.

The Four Pillars of Data Observability

A complete data observability program rests on four interconnected capabilities. They can be built independently, but the value of each multiplies when they are connected.

1. Freshness monitoring: Is data arriving when it should? Freshness checks verify that tables and datasets are updated within expected time windows. A dashboard that runs on data that should be refreshed at 6 AM but has not updated since 11 PM the previous night is already broken before any user has opened it. Freshness monitoring catches this before the first stakeholder complaint.

2. Volume and schema monitoring: Is the data shaped as expected? Volume monitors track row counts, null rates, and cardinality of key columns. Schema monitors detect structural changes — new columns, removed columns, renamed columns, changed data types. Schema changes are one of the most common root causes of downstream pipeline failures, yet they are almost never detected automatically in teams without a formal observability program.

3. Column-level data lineage: Where did the data come from, and where does it go? Lineage answers the blast radius question for any incident: when something breaks, which downstream assets are affected? Without column-level lineage, answering this question requires manual investigation that takes hours. With column-level lineage, the answer is available in seconds. Lineage also enables root cause analysis in reverse: when something looks wrong in a report, you can trace backwards through the transformation graph to find the source of the issue.

4. Ownership and routing: Who is responsible for each piece of data? Monitoring without routing is broadcast alerting — Slack channels full of alerts that no one acts on. The final pillar of a mature observability program is connecting every monitored asset to a named owner who is paged when that asset fails. This requires integrating ownership data from your LDAP, Okta, or HR system and mapping it to data assets in your warehouse and transformation layer.

Why Most Teams Get Stuck on Pillar 1

Freshness monitoring is the most approachable entry point — it requires minimal instrumentation, produces immediate value, and has clear pass/fail thresholds. The problem is that teams build freshness monitors and stop there, declaring their observability problem solved. A few months later they experience a major incident involving a schema change or a lineage-invisible transformation error, and realize that freshness alone only catches a fraction of real-world data quality failures.

The reason teams stall at freshness is that the next pillar — lineage — feels daunting. Traditional approaches to lineage required extensive manual annotation of dbt YAML files, instrumentation of every data source, or expensive professional services engagements to map transformation graphs. These approaches are too costly for most teams to justify until after they have experienced a catastrophic incident.

The modern alternative is automatic lineage extraction from query logs. By analyzing the SQL query history in Snowflake, BigQuery, or Redshift, it is possible to infer the full lineage graph — which columns in which tables were read by which queries that produced which output columns — without any manual annotation. This approach has limitations (it cannot see transformations that happen outside the warehouse) but it covers the vast majority of transformation logic for most teams and requires no ongoing maintenance effort.

The Implementation Sequence That Works

Based on what we have seen across dozens of data teams implementing observability programs, this is the sequence that produces results fastest with the least disruption:

Week 1-2: Connect your warehouse and establish a freshness baseline. Before you add monitoring rules, spend a week simply observing your data. Which tables are updated how often? What are the normal row count ranges for your most important tables? What columns have high null rates that are expected versus unexpected? This baseline is your monitoring threshold foundation. Monitoring rules without a baseline produce too many false positives to be useful.

Week 3-4: Build lineage from query history. Extract column-level lineage from the last 90 days of query history in your warehouse. This produces a graph that shows which columns feed which transformations feed which output tables. The graph will have gaps — not all transformations are visible from warehouse query history alone — but it will cover your core pipeline and give you the blast radius information you need for incident response.

Week 5-6: Add volume and schema monitoring to your top 20 tables. Do not try to monitor everything at once. Start with the 20 tables that downstream dashboards and applications depend on most directly. Set volume bounds based on your Week 1-2 baseline. Add schema change detection. These 20 tables probably account for 80% of your data incident surface area.

Week 7-8: Connect ownership and routing. Map each of the 20 monitored tables to a named owner. Configure routing so that alerts about each table go directly to the responsible engineer's on-call channel, not to a generic Slack alert channel. This single change typically reduces mean time to resolution by 60-70% because it eliminates the triage step of figuring out who to contact.

Ongoing: Expand coverage and reduce noise. After the initial 8-week sprint, iterate: expand monitoring coverage to more tables, tune threshold rules to eliminate false positives, and add monitoring for new pipelines as they are built. The goal is to make observability a default part of how your team builds, not a bolt-on after the fact.

What Decube Does Differently

Decube was designed to support this implementation sequence without requiring weeks of professional services or manual configuration. Connection to Snowflake, BigQuery, or Redshift via read-only OAuth produces an initial lineage graph and freshness baseline in under 10 minutes. Monitoring rules can be set from that baseline automatically, or manually tuned by the team. Ownership data syncs from LDAP or Okta. Alert routing is configurable at the asset level.

The goal was to make the 8-week implementation plan an 8-day one for teams that are ready to move. Not every team is — some need to resolve data architecture questions first, or build consensus across the engineering and analytics organization. But for teams that are ready, the technical barrier to a functioning observability program should be hours, not weeks.

Start Your Observability Program

Connect your warehouse and get a full lineage graph and freshness baseline in under 10 minutes.

Book a Demo