In the last two posts of our Data Quality series, we explored the fundamentals of data quality, including its definition, key dimensions, and real-world examples. We also shared insights from our careers and provided a roadmap for bridging theory with practice to implement effective data quality checks.
If you haven’t caught up yet, you can read the previous posts here: [Data Quality Series]1
In this post, we’ll dive deeper into the impact of data quality on AI. We’ll explore why data quality is crucial for AI and discuss key questions you can ask to assess the data quality within your organisation.
Note: These posts are based on our experiences and insights from the Master AI-Ready Data Infrastructure2 by Chad Sanderson, a pioneer in data quality/data contracts and co-author of the Data Contracts book.
Before we dive into the details, let's check out Chad's definition of data quality.
Data quality refers to the measure of data's condition, suitability, and effectiveness for its intended use in operations, decision-making, and planning. Data quality issues occur when the expectations of data producers, do not meet the expectations of data consumers.
Let’s start by discussing who are the data producers and consumers.
Consumers and Producers
Data Producers
Data producers are the engineers, developers, and third-party platforms responsible for generating raw data. They serve as the foundation for all AI training, analytics, and decision-making processes by providing the initial data inputs.
Data Consumers
Data consumers are individuals, systems, or applications that use processed data to extract insights, make informed decisions, and drive business outcomes. While they have limited visibility into upstream systems, they rely on data for various functions. Key consumers include AI/ML engineers and data scientists who develop models, analysts who generate insights, business teams (Sales, Marketing, Executives) who drive strategy, and customer service and product teams who enhance user experience.
Both data producers and consumers play a critical role in the AI ecosystem, ensuring that high-quality data drives innovation and decision-making.
Why is Data Quality Essential to AI?
The Foundation of Learning
AI systems learn from data, without it, they can’t function. The quality, accuracy, and relevance of data determine how effectively an AI model learns and adapts over time.
Garbage In, Garbage Out
An AI model is only as good as the data it's trained on. If the data is inaccurate, biased, or incomplete, the model's outputs will be equally unreliable, reducing the system’s effectiveness.
Impact on Decision-Making
AI plays a critical role in automating decisions. High-quality data ensures that these decisions are accurate, reliable, and trustworthy, increasing confidence in AI-driven processes.
Poor Data Can Break AI Models
Inconsistent or low-quality data can cause AI models to degrade over time or even fail entirely, leading to costly inefficiencies and requiring constant retraining.
Why Data Quality Issues Occur?
Lack of Ownership
Data is often collected for operational purposes rather than being explicitly designed for analytics, machine learning, or AI. Without clear ownership, no one is accountable for maintaining its accuracy and reliability.
Limited Awareness
Many organisations struggle with data visibility, teams don’t always know what data exists, where it’s stored, or how it’s being used. This lack of awareness leads to inconsistent or redundant data usage.
Poor Change Management
When data structures, sources, or processes change, key stakeholders are often left uninformed. Without proper communication, outdated or incorrect data continues to be used, leading to errors and inconsistencies.
Lack of a Single Source of Truth
Conflicting data definitions and multiple versions of the same data create confusion. Without a well-defined semantic framework, organisations struggle to establish a consistent and trustworthy view of their data.
Common Data Quality Problems and Their Impact on AI
1. Breaking Schema Changes
These occur when unexpected modifications in the data schema disrupt AI models. Changes like altering data types, modifying formats, or adding/removing fields can create incompatibilities that break data pipelines.
How Do They Impact AI?
Misinterpretation of Data: AI models may process data incorrectly if field meanings or formats change unexpectedly.
Processing Failures: Schema mismatches can cause errors, preventing AI systems from functioning properly.
Inaccurate Outcomes: When data structures shift without proper handling, AI outputs become unreliable, leading to flawed predictions and decisions.
Example: Column name [“User”] changed to column name [“Customer”].
2. Missed Data SLAs
Service Level Agreements (SLAs) define expectations for data quality, including timeliness, completeness, and accuracy. When these standards aren’t met, data becomes outdated, incomplete, or arrives too late to be useful.
How Do Missed Data SLAs Impact AI?
Delayed Decision-Making: Late or missing data prevents AI systems from making timely, data-driven decisions.
Reduced Model Performance: Incomplete or outdated data weakens AI models, leading to less accurate predictions.
Operational Inefficiencies: Business processes relying on AI slow down, causing disruptions and increased costs.
Example: Expected # of events per hour decreased from 1000 to 0.
3. Data Duplication
Data duplication occurs when identical or nearly identical information appears multiple times in a dataset. This redundancy often results from data entry errors, merging multiple sources, or poor data management practices.
How Does Data Duplication Impact AI?
Skewed Analysis: Duplicate data inflates counts, distorts trends, and misleads insights.
Increased Processing Costs: Larger datasets require more storage and computing power, slowing down AI operations.
Compromised Model Training: Training on redundant data biases AI models, reducing their accuracy and effectiveness.
Example: A transaction event is loaded via batch into a Bronze table and the same event is streamed via CDC into a different from Bronze table.
4. Semantic Violations
The gradual or sudden changes in the underlying business processes, rules, or objectives that data models are supposed to represent. This drift can occur due to changes in market conditions, company policies, or customer behaviour, and it is often not immediately reflected in the data-driven models.
How Does Semantic Violations Impact AI??
Misinterpretation of Data: AI models continue using outdated assumptions, leading to incorrect insights.
Reduced Accuracy and Reliability: Models trained on old patterns struggle to align with current realities, lowering their effectiveness.
Increased Error Rates: As the gap between business changes and model logic widens, AI-driven decisions become more error-prone.
Example: An operational database table changes the business logic of a distance_traveled column from kilometres to miles
Key Questions to Address Data Quality Issues
If you haven’t yet addressed data quality issues, It’s important to start with the right questions. These key questions will help you identify and understand the problems so you can take action:
What are the most common data quality issues we face?
What is causing these issues?
What impact are these issues having on our operations and decisions?
How can we effectively detect these issues early?
What steps can we take to prevent them from happening in the future?
By asking these right questions, you can begin to form a strategy for improving your data quality and ensuring more reliable outcomes.
Conclusion
Good data quality is crucial for AI to function properly. In this post, we discussed what data quality means, why it’s important for AI and the common issues that can arise. Poor data leads to poor decisions. AI models depend on clean, accurate data to make reliable insights. If the data is messy or inconsistent, AI can fail, causing mistakes and unnecessary costs.
To improve data quality, organisations must find the problems, understand their causes, and take action to fix them. There are several solutions to help improve data quality, such as Data Observability, Data Catalogs, Data Lineage, Testing and Validation, and Data Contracts. These tools help ensure that data is accurate and usable for AI. However, all of these solutions start with one key factor: communication. Open communication between teams, whether data producers, consumers, or other stakeholders is key to identifying issues, setting standards, and working towards common goals. By focusing on communication, organisations can use these tools to improve data quality and make sure their AI systems perform well.
If you enjoyed this post, you may also like the posts below that we discuss other fundamental concepts in Data Engineering.
We Value Your Feedback
If you have any feedback, suggestions, or additional topics you’d like us to cover, please share them with us. We’d love to hear from you!
https://pipeline2insights.substack.com/t/data-quality
https://maven.com/chad-sanderson/datainfra
Agreed.
https://open.substack.com/pub/procurefyi/p/the-end-of-big-dumb-ai-data?r=223ajc&utm_medium=ios