The Crucial Midpoint of Data Engineering: Data Transformation
From Raw Data to Gourmet Insights: The Art and Science of Data Transformation
Data transformation is like preparing ingredients for a gourmet dish: You start with raw produce, full of potential but not yet ready to eat. The transformation stage is where Data Engineers become chefs, cleaning, chopping, and blending until everything is just right. The raw data turns into insights-ready information, prepped and plated for analytical consumption.
Transforming data allows for the unification and integration of data. Once transformed, data can be viewed as a cohesive entity. Without effective transformation, achieving a unified view across an organisation is impossible. As Bill Inmon, the father of the data warehouse, emphasised, data transformation is crucial for integration.
In the Data Pipelines Pocket Reference book, James Densmore highlights that data transformation is the essential "T" in both ETL and ELT processes.
Stage 1: Querying Data
Definition: Requesting specific records from a database or storage system.
Data Types: You might query various types of data, such as tabular or semi-structured data, often stored in a cloud data warehouse.
Common Languages:
SQL (Structured Query Language): Widely used for relational databases.
NoSQL Query Languages: Such as MongoDB Query Language for document-oriented databases.
Graph Query Languages: Like Cypher for querying graph databases such as Neo4j.
Uses:
Cleaning Data.
Joining and aggregating datasets.
Filtering records to retrieve specific information.
Potential Issues with Poor Queries:
Performance problems in the source database
"Row explosion" from unintended joins, increasing record counts, and straining storage
Slow or overly broad queries, causing delays in reporting and analytics
Stage 2: Data Modelling
Definition: Represents how data aligns with real-world concepts.
Purpose: Creates a structured format to make data valuable for the business.
Benefits:
Enables analysts to efficiently query and access data for reporting
Reflects the organisation's processes, definitions, workflows, and logic
Succeeding in Data Modelling requires:
Collaboration: Work with stakeholders to understand key terms (e.g., "customer").
Alignment: Ensure the data model aligns with business objectives.
Stage 3: Transformation
Stages of Transformation:
Before Ingestion: Apply initial transformations, such as adding timestamps in the source system and enriching records with additional fields or calculations before they reach the data warehouse.
Purpose: Manipulate and enhance data to ensure it's ready for downstream use.
During Ingestion: Perform in-flight transformations to adjust data types and standardise formats.
Further Downstream: Apply schema transformations, denormalise data, perform aggregations for reporting, or prepare features for machine learning models.
Note: Many of these practices are also applicable to unstructured data, including images, videos, and raw text.
Collaboration in Data Transformation
Collaboration between upstream and downstream stakeholders is vital to ensure data quality, performance, and business relevance.
Upstream Stakeholders:
Business Definition Controllers: Define the meaning and usage of data (e.g., business stakeholders). They ensure data aligns with business rules.
System Controllers: Engineers who manage data-generating systems. They focus on system performance and aim for transformations that avoid heavy system loads.
Downstream Stakeholders:
Data Users: Include data analysts, scientists, ML engineers, and business users who depend on transformed data. They require data to be fast, reliable, and accurate for insights, models, and decision-making.
Conclusion
Data transformation is essential for making raw data useful and actionable. By collaborating effectively and following best practices, data engineers can deliver high-quality, business-aligned data that supports analysis and informed decision-making. This stage lays the groundwork for subsequent steps in the data lifecycle.
Stay tuned by subscribing for our upcoming post on the next stage: Serving.
If you're interested in learning about the ingestion stage of the data engineering lifecycle, check out our post on ingestion.
If you'd like to learn more about data Engineering please check:
[Data Engineering Professional course] on Coursera/Deeplearning.ai.
Join The Conversation!
We hope you found this post helpful. If you have any questions, comments, or experiences you'd like to share, we'd love to hear from you!