Building Trust in Data: The Fundamentals of Data Quality
Understanding Data Quality Dimensions and How They Help Data Solutions
Data quality is a major challenge for companies, especially with the rise of AI and other data-driven products.
Monte Carlo’s recent annual survey of data professionals revealed that while nearly all data teams are pursuing AI, 68% aren’t fully confident in the quality of their data1. One reason could be that most data wasn’t created with AI in mind. It was designed for specific operational or transactional purposes. As a result, repurposing this data for AI without proper preparation can lead to poor outcomes.
As Andrew Ng, a globally recognised leader in AI emphasises, AI systems rely on both the model and the data. Preparing and adapting data thoughtfully is key to building effective AI solutions. Consequently, implementing robust data quality solutions and infrastructure has become increasingly essential.
As a data professional, you might wonder how to tackle data quality issues in your career and take action instead of just discussing them. While this is important, it’s equally essential to understand the fundamentals first. A solid grasp of these basics helps us know where and when to apply quality checks throughout the data engineering lifecycle.
When I joined an accounting software company, I faced challenges with delays in delivering data to our upstream stakeholders due to missing data quality checks. Our team sometimes received incomplete or late data, and we’d only find issues after spending 7 to 8 hours processing pipelines.
To address this, I was tasked with automating data quality checks to save time and reduce manual effort. With no prior knowledge of data quality or its dimensions, I began by studying data quality principles, dimensions, and frameworks. I then identified all the issues we were facing and created checks based on the concepts I learned. This approach led to significant improvements in both data accuracy and efficiency.
In this post, we’ll cover the essentials, starting with:
What is Data Quality?
An introduction to data quality dimensions as defined by The Data Management Body of Knowledge book (DAMA-DMBOK). These dimensions provide a framework for evaluating and improving data quality.
Examples for each dimension, along with insights from our careers.
This foundational knowledge will help you take your first steps toward solving data quality challenges effectively.
What is Data Quality
To better understand what data quality is, here are three definitions from key sources:
Data quality is the health of data at any stage in its life cycle. Data quality can be impacted at any stage of the data pipeline, before ingestion, in production, or even during analysis.(Data Quality Fundamentals by Barr Moses, Lior Gavish, and Molly Vorwerck)
Data Quality can be defined as the degree to which dimensions of Data Quality meet the requirements. This implies that requirements should be formulated for each (relevant) dimension. A much shorter definition for quality of data is 'fit for purpose.(DAMA-DMBOK)
Data quality as an organisation's ability to understand the degree of correctness of its data assets, and the tradeoffs of operationalising such data at various degrees of correctness throughout the data lifecycle, as it pertains to being fit for use by the data consumer.(Data Contracts by Chad Sanderson and Mark Freeman)
Simply Data quality is about how well data meets specific needs. It's about being fit for purpose or fit for use.
If poor data quality exists:
Best Case: A few wrong numbers lead to minor misunderstandings about the business.
Worst case: Incorrect data causes harm to people or regulatory violations.
Key Factors to consider when dealing with Data Quality
Define Data Quality for Your Business
Each organisation must define data quality based on its unique goals, needs, and use cases. Aligning this definition with business objectives ensures relevance and practicality.Data Quality Doesn't Mean Perfection
Data quality isn’t about achieving perfection—it’s about ensuring data is "good enough" to meet business needs while balancing effort and accuracy.
Data Quality Dimensions with Examples
A data quality dimension is a specific characteristic or attribute used to assess the quality of data. It represents a distinct aspect or feature that helps determine whether the data is accurate, complete, and reliable.
The sources we mentioned earlier highlight various data quality dimensions. In this discussion, we will focus on the dimensions defined in the DAMA-DMBOK framework, a foundational reference for systematically assessing and improving data quality. That said, it’s always valuable to explore other dimensions tailored to your business needs, as every organisation may require unique criteria to ensure data quality.
Note: Most of the examples provided below are based on a company specialising in air conditioning system services.
Validity
Definition: Validity measures how well data aligns with the expected business logic.
Example: a valid phone number should include the country code and a nine-digit number, depending on the business requirements. If the business operates locally, including a country code might not be necessary. However, for international operations, such as in Australia and New Zealand, country codes become essential.
Completeness
Definition: Completeness ensures that all necessary data is present.
Example: a complete address should include the apartment or building number and street name to ensure accurate delivery and proper service to the customer.
Integrity
Definition: Integrity ensures that data is plausible and matches reality.
Example: if a company's address lists a location in New South Wales (NSW) but the company is based in Victoria, this creates a data integrity issue, as the information doesn’t match the real-world context.
Timeliness
Definition: Timeliness refers to how quickly data is refreshed according to business expectations.
Example: If data is expected to be refreshed within two weeks, any delay beyond that would be considered untimely and may impact decision-making.
Currency
Definition: Currency focuses on how current the data is.
Example: When the status changes, the approval date should reflect the most recent update. However, if the approval date is listed before the "won" date, it indicates a currency issue, as the data is not aligned with the correct timeline.
Reasonableness
Definition: Reasonableness ensures that data values are logical and meet expected business logic.
Example: In an HVAC materials purchase order, we should only see materials. An entry like "AC hire" doesn’t make sense here, as it should be captured under a different concept, such as a service or rental, rather than a purchase. This is tied to the business logic and expected data values.
Uniqueness
Definition: Uniqueness ensures that each data value is unique, which is crucial for merging data from different systems.
Example: unique IDs are essential to avoid duplicates.
Accuracy
Definition: Accuracy measures the correctness of data.
Example: In an HVAC-related job, a technician should spend a reasonable amount of time preparing, fixing, or replacing equipment. If we see a belt replacement job that took 20 hours, this would indicate an accuracy issue, as the time spent is unusually high and doesn't reflect the actual time required for such a task.
Conclusion
Data quality is key to the success of AI, data-driven solutions, and business operations. By applying key data quality dimensions, companies can ensure their data meets their needs. Building strong data quality systems helps improve efficiency, decision-making, and reduce risks. In the end, focusing on data quality leads to better business results and more reliable AI solutions.
In upcoming posts, we will explore:
The impact of data quality.
Why data quality issues occur.
Common data quality problems and their impact on AI.
Solutions for improving data quality.
Tools and resources to start learning more about data quality.
If you're preparing for a Data Engineering interview, don't forget to check out our series here: [Pipeline To Insights Data Engineering Interview Preparation Guide]2
If you're looking to improve your data modelling skills, check out:
If you’re looking to enhance your data pipeline designs, check out:
We Value Your Feedback
If you have any feedback, suggestions, or additional topics you’d like us to cover, please share them with us. We’d love to hear from you!
What are your experiences with data quality? How do you manage data quality in your role? Feel free to share with the community so we can all learn from each other.
Resources
🚀 3-step framework to scaling data quality in the age of generative AI3 by
Creating A Basic Data Quality Framework4 by Prithu Barnwal
Data Quality: Core Concepts Course5 by
https://www.montecarlodata.com/blog-2024-state-of-reliable-ai-survey/
https://pipeline2insights.substack.com/t/interview-preperation
https://substack.com/home/post/p-146677839
https://medium.com/@prithubarnwal007/creating-a-basic-data-quality-framework-f6168970cffd
https://www.linkedin.com/learning/data-quality-core-concepts/the-importance-of-data-quality