Implementing Data Quality Framework with dbt
How to Identify, Address, and Implement Data Quality Solution using dbt
Imagine you’ve just joined a company and encountered data quality challenges. Processes need to be reworked, insights shared with stakeholders fall short of expectations, data delays slow down pipelines, and even seemingly valuable source data is too flawed to generate real impact.
What would you do? Where would you start?🤔💭
You’re not alone! Many of us have faced similar challenges throughout our careers. In this post, we’ll share a step-by-step guide to help you take the initiative and address these issues effectively.
Before diving into this post, we highly recommend checking out our previous post to familiarise yourself with the fundamentals. We covered:
What is Data Quality?
An introduction to data quality dimensions
Examples for each dimension, along with insights from our careers
Understanding the definition of data quality, its dimensions, and metrics is one aspect of the journey. However, the real challenge lies in connecting these concepts and applying them to extract meaningful insights. In this post, we aim to bridge that gap by presenting a practical scenario that a data engineer, analytics engineer, or a similar professional might encounter in a company. We will then outline a structured roadmap to effectively address the issue, demonstrating how data quality principles translate into real-world impact.
Scenario
You’ve just joined a company as a Data Engineer or a similar role, and your first task is to create a data ingestion pipeline for your AI Engineer.
After gathering requirements and exploring the company's SQL Server data, you identify 8 tables to prototype the pipeline. The goal is to enable the AI Engineer to perform analytics and build models. However, after further analysis and discussions with stakeholders, you discover a major roadblock, the source tables lack the necessary data quality to meet business requirements and support the AI Engineer’s needs.
Now, it’s time to take action!
In this post, we’ll walk through a clear, step-by-step approach to building a data quality framework that addresses these challenges and ensures your data is reliable, accurate, and ready for AI-driven insights.
Step 1
The first step is to understand the fundamentals of data quality, including its definition, key dimensions, and common measurement techniques.
You can find this in our previous post here.
Step 2
The second step is to list the data quality issues you encountered while working with the tables intended for the pipeline.
Step 3
The third step is to discuss with downstream stakeholders (in this case AI engineers) to understand what better data quality means to them. It is also good to talk with upstream stakeholders (software engineers, third parties, and so on…) to understand the system that these data generated from. Remember as we mentioned in the previous post, data quality issues might arise since the system was designed for specific operational or transactional purposes.
Step 4
Based on the information and requirements gathered in the previous three steps, document all findings and create a comprehensive list of tables along with their associated data quality issues.
Next, use your understanding of data quality dimensions to categorise each issue according to its corresponding dimension. Aligning issues with their definitions will help quantify data quality and provide insights into the necessary quality checks to implement.
Step 5
Choose a tool to implement your quality checks based on the info you prepared in step 4.