Why Data Quality Monitoring Still Fails at 2am

Data quality monitoring has become standard practice. Most data teams running Snowflake or BigQuery have some form of monitoring in place - whether through dbt tests, a dedicated observability platform, or custom SQL checks running in Airflow. The alerts fire. The on-call engineer gets paged. And then, consistently, the same problem: the engineer stares at a Slack notification that says "row count 34% below baseline on orders_daily" and has no idea where to start.

The monitoring did its job. The alert fired within 60 seconds of the anomaly. But the engineer still spent 90 minutes triaging a problem that turned out to be a single Fivetran connector that had a rate-limit error at 1:17am. The alert told them something was wrong. It did not tell them what caused it, what it affects, or who owns the fix.

The Alert-to-Context Gap

The failure mode in modern data quality monitoring is not sensitivity. Statistical anomaly detection has gotten very good at catching when a metric deviates from its historical pattern. The failure is what happens after detection. An alert message that contains a table name, a metric, and a deviation percentage contains exactly enough information to confirm that something is wrong and exactly zero information about how to fix it.

Consider what an on-call data engineer actually needs to resolve an incident at 2am:

Which upstream source broke? (root cause, not just symptom)
When did the break happen? (to understand the blast radius)
Which downstream tables, models, and dashboards are affected?
Who owns the upstream source that broke?
Has this happened before? (pattern or one-off?)

A threshold alert on orders_daily answers none of these. It confirms the symptom. The engineer still has to open the lineage graph in a separate tool, check the freshness of upstream tables in another tool, find the owner of the relevant connector in a spreadsheet or Confluence page, and then piece together a coherent picture of what happened.

Why Broadcast Alerts Make This Worse

The default behavior of most monitoring platforms is to broadcast alerts to a shared Slack channel. A message appears in #data-alerts, the channel with 47 members, and it reads something like: "ALERT: table analytics.orders_daily failed row count check (expected 12,400-14,200, got 8,903)".

Three people see it. Two of them do not know that table and assume someone else will handle it. One of them is technically on-call but is unsure whether this is their pipeline or another team's. The ticket sits unacknowledged for 22 minutes before the on-call engineer finally opens it.

Broadcast alerts create diffusion of responsibility. When an alert is sent to 47 people, the probability that any single person feels urgently responsible approaches zero. Targeted alerts to the specific table owner - routed through PagerDuty or direct Slack DM - resolve significantly faster. But targeted routing requires knowing who owns the specific table that broke, not just which alert channel covers data problems generally.

The Ownership Problem

Data ownership is almost universally managed in Confluence, Notion, or a spreadsheet that was accurate when it was written six months ago. Engineers leave teams. Tables get migrated to new owners. New pipelines get built without anyone updating the ownership registry. When an incident fires, the on-call engineer looks up the owner, finds a name, pings that person, and gets an out-of-office reply.

This is not a process failure - it is a tooling architecture failure. Ownership data lives in a separate system from monitoring data, which lives in a separate system from lineage data. None of them are connected. When an incident fires, the engineer has to manually traverse three separate systems to build a complete picture of the problem.

Decube solves this by making ownership a first-class attribute of the lineage graph. Every table and column has an owner, stored directly in the platform alongside the lineage relationships and quality checks. When an alert fires, the platform knows immediately which engineer owns the upstream source that broke, and routes the page directly - without requiring the on-call engineer to consult a separate ownership registry.

Statistical Anomaly Detection Adds Noise Without Context

ML-based anomaly detection has become a selling point for several data observability platforms. Train a model on historical metric distributions, and flag when current values fall outside the predicted range. The promise is fewer false positives than threshold-based rules. In practice, the reduction in false positives is modest, and the cost is significant opacity.

When a threshold-based rule fires, the engineer knows exactly why: "row count dropped below 10,000." When a statistical model fires, the alert says "anomaly detected." The engineer does not know whether this is a legitimate problem or a statistical artifact of a holiday weekend changing the baseline. Without context about why the anomaly detection model flagged this metric, the engineer has to manually verify the alert before acting on it.

For data quality monitoring, threshold rules with business context are more actionable than statistical models with no context. A rule that says "row count on orders_daily must be between 10,000 and 16,000 on weekdays, 2,000 and 6,000 on weekends" is transparent, debuggable, and understood by non-engineers. A statistical model that says "this is 2.3 standard deviations from the 30-day mean" is opaque and requires a data scientist to interpret.

What a Good Alert Actually Looks Like

An alert that enables fast resolution contains the following elements. First, the specific symptom: which metric, which table, what value, what was expected. Second, the probable root cause: which upstream source is the first anomalous node in the lineage graph. Third, the downstream impact: which tables, models, and dashboards are affected by this failure. Fourth, the owner: who is responsible for the broken upstream source. Fifth, escalation path: what to do if the primary owner is unresponsive.

That is a fundamentally different artifact than a row count alert with a table name. Building it requires the monitoring system to have real-time access to the lineage graph, the ownership registry, and the dependency map between tables and dashboards. Tools that operate in isolation cannot produce it.

Freshness vs Quality: Two Different Problems

Data monitoring tools often conflate freshness monitoring and quality monitoring under the "data observability" umbrella. They are different problems with different causes and different fixes.

Freshness failure means a table did not update when expected. The cause is almost always upstream: a connector failed, a job timed out, a dependency was not met. Resolution is operational: restart the connector, rerun the job, page the infrastructure owner.

Quality failure means the table updated but the values are wrong. The cause is typically a transformation logic bug, a schema change in the source, or a business rule that was not accounted for. Resolution is analytical: understand what changed in the data or the transformation, determine if the output values are acceptable or need a rerun.

These two failure modes require different investigation paths and different owners. Treating them identically in a single monitoring dashboard creates confusion during triage. A well-designed monitoring platform separates freshness checks from quality checks and routes incidents accordingly. Freshness incidents go to infrastructure owners. Quality incidents go to data model owners. As we covered in our article on column-level lineage, knowing which specific column is affected by a quality failure dramatically narrows the investigation scope.

Making the 2am Alert Actionable

Four concrete changes make data quality alerts dramatically more actionable without requiring a complete platform replacement:

Add lineage context to every alert. For every table that fires a quality check, automatically include the immediately upstream tables in the alert message. A single additional line saying "Upstream tables: stripe_charges (last synced 8h ago), fx_rates (last synced 1h ago)" narrows the investigation scope immediately.

Route to table owners, not channels. Replace channel broadcasts with direct pages to the engineer who owns the affected table, with secondary escalation to team leads after 15 minutes. Ownership lookup has to be automated - not a manual spreadsheet step.

Separate freshness from quality alerts. Run freshness checks on all tables and pages orchestration owners when they fail. Run quality checks on tables with defined rules and page data model owners when they fail. Routing based on failure type cuts mean-time-to-resolve significantly.

Include downstream impact in every alert. An engineer who knows their pipeline break affects three dashboards that four executives read is motivated differently than an engineer who knows only that a row count check failed. Impact visibility creates urgency proportional to actual business risk.

Build Alerts That Actually Help

Decube combines lineage, ownership, and quality monitoring so every alert contains the context needed to resolve it - not just the signal that something broke.

Book a Demo