What is Data Observability and How Does It Support Data Quality
Understanding how observability tools help monitor data health, prevent issues, and maintain trust in our systems
In software development, observability means adding enough “sensors” (logs, metrics, traces) so you can tell what your app is doing and fix problems fast. Tools like Datadog collect those signals and show you if something breaks.
Data observability does the same thing, but for data.
Picture a fitness band for your data pipelines: instead of tracking heart rate, it tracks whether data is fresh, complete, in the right shape, and moving through the steps it should. If something looks off, you get an alert before it hurts reports or machine‑learning models.
As Andy Petrella defines in the Fundamental Data Observability book1:
“Data observability is the capability of a system that generates information on how the data influences its behavior and, conversely, how the system affects the data.”
In short, data observability lets teams trust their data the same way software observability lets them trust their apps.
However, you might ask: What is the difference between data quality and data observability?
Data quality is the goal: delivering data that’s clean, complete, fresh, and trustworthy.
Data observability provides the sensors, dashboards, and alerts that show when data quality slips and pinpoint where to fix it.
We recently launched a series on data quality. If you're interested, you can check it here2 and subscribe so you never miss an update! 😊
In this post, we will discuss:
Where Data Observability Is Applied
Types of Data Observability
Implementing Observability in Data Projects
Selecting the Right Observability Solution.
Establishing Effective Ownership.
Implementation Best Practices.
Note: In this post, we used some insights from the Master AI-Ready Data Infrastructure3 course by Chad Sanderson, a pioneer in data quality/data contracts and co-author of the Data Contracts book.
Where Data Observability Is Applied
let’s see where you can actually put observability to work.
1. Records and Files
Observability can be applied to individual files and records. These are materialised data assets that land in some storage system.
We’re either:
Validating on a row-by-row basis in near real-time, or
Performing large sets of aggregation after the fact in a more reactive way to check for anomalies.
2. Tables and Views
The second place where we can implement observability is on tables and views within our analytical database. We might set up an observability tool within Snowflake, BigQuery, or Redshift to monitor how data flows through the system.
3. Pipelines
We can also apply observability to your pipelines themselves. This helps answer questions like:
Are jobs being performed on schedule?
Have they been delayed? For how long?
We you get a comprehensive view across your entire data ecosystem?
Types of Data Observability
The following four types of data observability represent different approaches to maintaining data quality, each with distinct advantages and limitations. Understanding these methods will help us implement a comprehensive strategy tailored to our organisation's specific needs.
1. Schema Change Detection
Schema change detection happens close to real-time when a change occurs from a data source or when new files are loaded into a system. Ideally, this happens before the pipeline executes, allowing us to verify if the schema has been updated or is backward-compatible with the previous version.
Pros:
Excellent for root cause analysis, as we can pinpoint the exact moment a schema changed.
When implemented across all phases of the data engineering life cycle, it helps identify where issues first materialise.
Cons:
It can get very noisy, leading to alert fatigue.
Too many alerts can cause teams to tune them out.
Note: It's often better to have highly contextual, sensitive alerts than too many alerts. In some cases, it's even preferable to miss something occasionally than to have excessive alerts that get ignored.
2. Data Change Detection
This focuses on change detection at the record level rather than the schema level. For example, when files are loaded into a storage, we may check each record against predefined data quality rules.
An example would be validating that a field containing a 7-character string doesn't suddenly contain a 6-character string instead.
Pros:
Excellent for root cause analysis at the record level.
More comprehensive than schema checking.
Flexible testing options (Regex, tools like Great Expectations).
Cons:
It can be very expensive when checking every record continuously.
Most teams opt for sampling approaches (checking a subset of records periodically).
3. Aggregation/Comparative Checks
These checks are ideal for SLA (Service Level Agreement) monitoring. They involve looking at data over a period of time, aggregating it, and comparing it against established thresholds.
We can also use these checks to validate relationships between data entities. For example, ensuring there's only one truck driver per shipment, while allowing multiple shipments per shipper.
Pros:
Essential for SLA management and data contracts.
Effective for detecting relationship violations between entities.
Cons:
Reactive rather than proactive.
It can be costly and time-consuming to set up.
4. Anomaly Detection
This is the AI/ML-driven approach to data observability. Vendors offer machine learning-based anomaly detection, where we provide them with time series data. Their models analyse patterns and common fluctuations, alerting us when they detect anything outside normal parameters.
Pros:
Can detect unexpected patterns that rule-based systems might miss.
Works autonomously once properly trained.
Cons:
Often creates additional work investigating false positives.
Difficult to distinguish between data anomalies and legitimate business changes.
Requires treating all alerts with the same level of rigour.
Implementing Observability in Data Projects
Data observability has become essential for maintaining data quality and reliability across organisations. Here's an enhanced approach to implementing observability in your data management practices:
Selecting the Right Observability Solution
When choosing a data observability solution, it’s important to consider our team's needs, budget, and technical expertise. Options range from enterprise-grade platforms to open source tools and AI-focused solutions, each with unique strengths.
Enterprise Data Observability Platforms
Comprehensive monitoring: Provides pre-built monitors covering data quality, freshness, volume anomalies, and schema changes
Simple deployment: Typically cloud-hosted with minimal setup requirements
Integration capabilities: Connects with most modern data stack components (warehouses, pipelines, BI tools)
Example: Monte Carlo
Open Source Solutions
Cost-effective: Minimal licensing costs, ideal for budget-conscious organisations
Customisation: Highly adaptable to specific requirements and tech stacks
Implementation complexity: Requires more engineering resources and expertise
Examples: Great Expectations, dbt tests, Dagster
AI-Centric Observability Tools
Specialised monitoring: Focuses on ML model drift detection and performance
Automated insights: Uses AI to identify anomalies and potential root causes
Complementary implementation: Works alongside traditional observability tools
Cost considerations: Often, pricing scales with data volume and model complexity
Examples: WhyLabs, Arize AI, Fiddler AI
Establishing Effective Ownership
Clear ownership ensures observability tools are used effectively. Different teams bring unique strengths, so aligning responsibilities helps maintain reliable, high-quality data systems.
Data Platform Teams
Primary responsibility: Initial implementation and platform maintenance.
Configuration management: Establishing monitoring templates and baseline thresholds.
Limited scope: Typically not responsible for business-specific monitoring logic.
Cross-functional coordination: Acts as a bridge between technical and business teams.
Data Engineering Teams
Operational ownership: Monitor performance and incident response.
Root cause analysis: Diagnosing and resolving data quality issues.
Implementation variations:
Centralised teams: Standardised monitoring approaches across datasets.
Decentralised teams: Domain-specific monitoring tailored to business units.
Data Science Teams
Business logic expertise: Deep understanding of data context and acceptable bounds.
Alert definition: Setting appropriate thresholds based on domain knowledge.
Limited operational focus: Less involved in infrastructure troubleshooting.
Collaboration opportunity: Partner with engineering for a comprehensive monitoring strategy.
Implementation Best Practices
Effective observability isn't just about tools, it's about how we roll them out, manage alerts, and keep improving. A thoughtful, phased approach helps teams avoid alert fatigue and stay focused on what matters.
Phased Rollout
Begin with critical datasets before expanding coverage.
Start with basic metrics (freshness, volume) before advanced ones (anomaly detection)
Establish baseline performance before setting alert thresholds.
Alert Management
Implement severity levels to prioritise response efforts.
Create clear escalation paths for different issue types.
Develop runbooks for common failure scenarios.
Continuous Improvement
Regularly review alert effectiveness and adjust thresholds.
Document lessons learned from major incidents.
Schedule periodic reviews of monitoring coverage gaps.
Finding the Right Balance
The challenge with data observability is that there's no one-size-fits-all approach. What often works better than complex anomaly detection is when:
Something clearly breaks for data producers, making the issue obvious.
Data consumers (like AI/ML teams) notice and report that "something doesn't look right"
Even without a specific test or monitor, these signals can help identify issues that need attention in the future.
Conclusion
Data observability is essential for maintaining trust in our data systems. By implementing appropriate levels of monitoring across files, tables, and pipelines, teams can catch issues early and ensure data quality.
The key is finding the right balance, implementing enough observability to catch critical issues while avoiding alert fatigue. The best approach combines automated checks with human intuition and clear communication between data producers and consumers.
By thoughtfully addressing all aspects of data observability implementation, from selecting the right tools to establishing clear ownership and following best practices, organisations can significantly improve data reliability and reduce operational disruptions.
If you found this post helpful, check out the posts below where we discuss Data Engineering concepts.
We Value Your Feedback
If you have any feedback, suggestions, or additional topics you’d like us to cover, please share them with us. We’d love to hear from you!
What are your experiences with data quality? How do you manage data quality in your role? Feel free to share with the community so we can all learn from each other.
https://www.oreilly.com/library/view/fundamentals-of-data/9781098133283/
https://pipeline2insights.substack.com/t/data-quality
https://maven.com/chad-sanderson/datainfra