Week 6/33: Data Modelling for Data Engineering Interviews (Part #3)
What is Data Vault 2.0 and its role in Data Engineering Interviews
In this post, we explore the Data Vault modelling technique, with a focus on Data Vault 2.0. This approach has gained significant popularity recently, especially among enterprise-level organisations. As a result, questions about Data Vault are becoming increasingly common. In this post, we will cover:
What is Data Vault 2.0
Common Data Vault interview questions
Bridging the fundamental models to Data Vault
A Case Study: Converting a Dimensional Model to a Data Vault
This is the third and final post in which we will cover data modelling for data engineering interviews. If you have missed the previous posts, you can check them here: [Pipeline To Insights Interview Preparation Series]1
Note: Data Vault 2.0 is an advanced data modelling method that builds upon foundational data modelling principles. To fully grasp the concepts discussed in this post, we highly recommend ensuring a strong understanding of the following fundamentals:
ER Diagrams: Understanding entity relationships and how they map to database structures.
3rd Normal Form (3NF): The principles of normalisation and how to design databases for minimal redundancy.
Dimensional Modelling: The design of star and snowflake schemas for analytical purposes.
If you are not confident with these topics yet, we suggest starting with the post linked below to build a solid foundation:
What is Data Vault 2.0
Data Vault 2.0 is a methodology for designing scalable, flexible, and agile data warehouses. It builds upon the original Data Vault concepts, adding enhancements to support modern data environments.
At a high level, the Data Vault structure is built around three core components: Hubs, Links, and Satellites. Hubs store unique business keys, acting as the foundation for capturing core entities. Links establish and maintain relationships between these entities, while Satellites enrich the model by storing descriptive attributes and tracking historical changes.
Common Attributes Across Components
All Data Vault components share some common attributes to ensure consistency, auditability, and traceability across the system:
Hash Keys: Ensure uniqueness and optimise joins. Hash keys are generated using a hashing algorithm applied to business keys or concatenated business keys.
Load Date (LOAD_DTS): Captures the exact timestamp when the data was loaded into the Data Vault, enabling accurate historical tracking.
Record Source (REC_SRC): Tracks the origin of the data, providing transparency and supporting data lineage for audit and compliance purposes.
Key structural components and features
1. Hubs
Definition: Hubs represent unique business keys. These are the identifiers for core business entities, such as Customer IDs, Product IDs, or Order IDs.
Purpose: They provide a single, centralised point to manage and access unique business keys.
Additional Key:
HashKey: A hash value generated from the business key (e.g., Customer_ID) to ensure uniqueness across the system. This serves as the primary key for the hub and optimises joins.
Example:
A
Hub_Customer
table might include:Hub_HashKey(HK_Customer_ID)
: Primary key generated by hashing theCustomer_ID
.Customer_ID
: Unique identifier for a customer.LOAD_DTS
: Timestamp for when the record was first loaded.REC_SRC
: Source of the data (e.g., CRM, ERP).
2. Links
Definition: Links capture the relationships between business keys. They connect hubs to represent how entities are related.
Purpose: Links ensure that relationships can be managed independently from the data attributes.
Additional Key:
Link_HashKey: A hash value is generated by combining all connected hubs' hash keys. This ensures that each relationship is unique.
Example:
A
Link_Sales
table might connect Customer, Product, and Store:Link_HashKey(HK_L_Cust_Order)
: Primary key generated by hashing the concatenated hash keys ofHub_Customer
, andHub_Order
.Customer_HashKey(HK_Customer_ID)
: Foreign key referencing theHub_Customer
table.Order_HashKey(HK_Order_ID)
: Foreign key referencing theHub_Product
table.LOAD_DTS
: Timestamp for when the record was first loaded.REC_SRC
: Source of the data.
3. Satellites
Definition: Satellites store descriptive attributes and historical changes for hubs and links.
Purpose: Satellites allow tracking of changes over time without impacting the core hubs or links.
Additional Key:
HashKey: Foreign key linking back to the hub or link it describes.
Example:
A
Sat_Customer
table might include:Hub_HashKey(HK_Customer_ID)
: Foreign key referencing theHub_Customer
table.LOAD_DTS
: Timestamp for when the record was first loaded.CustomerName
: Customer name.Address
: Customer address.REC_SRC
: Source of the data.
4. Key Features
Modularity: Each component (hub, link, satellite) is independent, ensuring changes in one do not affect others.
Scalability: Parallel loading of components supports large-scale data environments.
Historical Tracking: Satellites provide full traceability of changes over time.
Automation: Metadata-driven approaches enable automation across:
ETL Processes: Automated creation and loading of hubs, links, and satellites.
Schema Management: Automated generation of table schemas with standardised structures.
Auditing and Governance: Automated tracking of data lineage,
LoadDate
andRecordSource
for compliance.
5. Simple Data Vault example with all components
Bridging The Fundamental Models to Data Vault
Data Vault can also be defined as a hybrid approach that combines the fundamental models of the 3rd Normal Form (3NF) and Star Schema2. This hybrid nature allows Data Vault to inherit the strengths of both approaches while addressing their limitations, making it a robust solution for modern data environments.
Focus on Normalisation and Core Entities
In 3NF, data is organised into highly normalised tables to reduce redundancy and ensure data integrity. Each table represents a single subject or entity (e.g., customers, products).
In Data Vault: The concept of Hubs mirrors this focus by representing unique business keys (e.g.,
Customer_ID
,Product_ID
). Hubs act as the centralised storage for core entities, ensuring uniqueness and reducing redundancy while supporting scalability.
Simplified Relationships for Querying
In Star Schema, relationships are denormalised and simplified for analytical purposes. Dimensions connect directly to a central fact table, making it easy to query aggregated data.
In Data Vault: The concept of Links resembles the fact table in a Star Schema but offers more flexibility. Links capture relationships between entities (e.g., customer-product-store relationships) while maintaining a normalised design, ensuring adaptability and ease of change.
Historical Tracking Without Overwriting
In 3NF, historical tracking often requires additional design effort (e.g., audit tables or temporal databases). Changes to attributes can overwrite existing data.
In Data Vault: Satellites solve this problem by storing descriptive attributes and tracking historical changes. Each satellite is designed to store history independently, preserving the integrity of the hubs and links.
Query-Optimised Structures
Star Schema structures are designed for performance in analytical queries, with pre-aggregated data and simplified joins.
In Data Vault: While maintaining flexibility, Data Vault uses satellites to provide descriptive attributes that can be selectively queried. This allows for a balance between query performance and scalability.
Scalability and Flexibility
Both 3NF and Star Schema face challenges in scalability when dealing with large, complex datasets or evolving requirements.
In Data Vault: The modular design of hubs, links, and satellites supports parallel loading and incremental updates, making it ideal for enterprise-level and cloud-native environments.
Tip: We suggest having the below visual in your pocket as a Data Warehouse Modelling reference.
Interview Questions
Question 1
What is Data Vault 2.0? Can you explain its components and how they work together?
Answer: Data Vault 2.0 is a scalable and flexible data modelling methodology designed to handle large, dynamic datasets. It breaks down data into three key components: Hubs, Links, and Satellites, which work together to create a modular and historical data warehouse.
Hubs act as the foundation of the data model, ensuring a single source of truth for core entities. Each hub contains:
A hashed value for uniqueness and efficient joins.
The actual unique identifier.
When the data was loaded.
Source of the data.
Links capture relationships between business keys in hubs. They define connections independently of attributes, supporting flexibility as relationships evolve. Each link contains:
A unique hash combining the hash keys of related hubs.
Referencing hash keys from hubs.
When the relationship was recorded.
Source of the data.
Satellites provide historical tracking and enrich hubs and links with details like customer names or transaction amounts. Satellites include:
Links back to the relevant hub or link.
Descriptive data, such as
Name
orAmount
.When the data was loaded.
Source of the data.
Therefore, Hubs store unique entities, Links define relationships between those entities, and Satellites provide descriptive and historical context for both hubs and links in a Data Vault.
Question 2
How does Data Vault 2.0 handle historical data tracking?
Answer: Data Vault 2.0 tracks historical data using satellites, which are designed to store descriptive attributes and record changes over time. Each satellite is associated with a hub or a link and includes specific columns to ensure accurate historical tracking:
Load Date: Captures when the data was first loaded into the system. This ensures an immutable record of the data’s entry into the warehouse.
Effective Date and End Date: Tracks the validity period of the data. The Effective Date marks when the attribute value became valid, while the End Date indicates when it was replaced or updated. If the record is still current, the End Date is typically left as
NULL
.
This approach ensures that historical information is preserved without modifying the hubs or links.
Example: If a customer moves to a new address, the satellite will store the old address with an End Date, and a new record will be added for the new address with a fresh Effective Date. The Hub_Customer remains unchanged, maintaining its role as the unique identifier for the customer.
Question 3
What advantages does Data Vault 2.0 offer over traditional Dimensional Modelling?
Answer: Data Vault 2.0 provides significant benefits in scalability, flexibility, adaptability, and governance making it ideal for modern data environments:
Scalability: Hubs, links, and satellites can be loaded independently and in parallel, accelerating data pipelines and handling large datasets efficiently.
For example, In systems with millions of daily transactions, customer, product, and transaction data can be processed simultaneously without bottlenecks.
Flexibility: Adding new attributes or entities is non-disruptive due to the modular structure, enabling easy schema evolution. Seamlessly integrates structured, semi-structured, and unstructured data.
For instance, Adding a new customer attribute involves creating a satellite, leaving the core hubs and links untouched.
Adaptability: Satellites preserve all historical changes, enabling robust analysis of past data states. Incremental updates accommodate evolving business needs with minimal impact.
To exemplify, A new pricing model can be added as a satellite, preserving historical prices alongside new ones.
Governance: Metadata fields like
LoadDate
andRecordSource
ensure full auditability.Simply,
REC_SRC
provides a clear data lineage for regulatory audits.
These features highlight its ability to handle large volumes, schema changes, and historical tracking showcasing its value for modern data challenges.
Case Study: Converting Dimensional Model to Data Vault 2.0
A company produces products such as electronics (phones, tablets) and beauty items (perfumes, lipsticks). Customers shop at nearby stores, often purchasing a mix of products in a single visit, recorded on one invoice.
Suppose we are given the below ER Diagram which is the Star Schema:
The provided star schema includes a fact table (FACT_Sales
) and dimension tables (DIM_Customer
, DIM_Product
, DIM_Store
, DIM_Date
). Generally, converting Star Schema into Data Vault is pretty straightforward since we will basically convert the dimension tables to hubs and fact tables to link, and then we will define satellites for descriptive attributes.
Let's break down how this schema can be transformed into a Data Vault model.
Step 1: Identify Business Keys for Hubs
In Data Vault, business keys are stored in hubs. These represent unique identifiers for core business entities:
Hub_Customer
Business Key:
Customer_ID
Attributes:
Customer_ID
,LOAD_DTS
,REC_SRC
Hub_Product
Business Key:
Product_ID
Attributes:
Product_ID
,LOAD_DTS
,REC_SRC
Hub_Store
Business Key:
Store_ID
Attributes:
Store_ID
,LOAD_DTS
,REC_SRC
Hub_Date
Business Key:
Date_ID
Attributes:
Date_ID
,LOAD_DTS
,REC_SRC
Step 2: Define Relationships for Links
Relationships between business keys are stored in links, which connect hubs.
Link_Sales
Business Keys:
Customer_ID
,Product_ID
,Store_ID
,Date_ID
Attributes:
Link_HashKey
(hashed from the concatenation of business keys),Customer_HashKey
,Product_HashKey
,Store_HashKey
,Date_HashKey
, Sales_ID,LOAD_DTS,REC_SRC
Note: Sales_ID here is known as a degenerate key or dependent child key and ensures that each link is unique even if multiple occurrences of the same combination of business keys exist.
Step 3: Create Satellites for Attributes
Satellites store descriptive attributes and track changes for hubs and links.
Satellites for Customer Attributes:
Table:
Sat_Customer
Attributes:
Hub_Customer_HashKey
,First_Name
,Last_Name
,LOAD_DTS,REC_SRC
Table:
Sat_Contact
Attributes:
Hub_Customer_HashKey,Email
,Postal_Code
,City
,LOAD_DTS,REC_SRC
Note: The reason behind dividing Customer Attributes is that attributes in Sat_Contact are more likely to change than those in Sat_Customer. Therefore,
Updates to frequently changing attributes don’t affect the structure or historical tracking of other attributes.
This reduces the risk of unnecessary data duplication in satellites and optimises storage usage.
Satellite for Product Attributes:
Table:
Sat_Product
Attributes:
Hub_Product_HashKey
,Product_Name
,Category_Name
,LOAD_DTS,REC_SRC
Satellite for Store Attributes:
Table:
Sat_Store
Attributes:
Hub_Store_HashKey
,Store_Name
,LOAD_DTS,REC_SRC
Satellite for Date Attributes:
Table:
Sat_Date
Attributes:
Hub_Date_HashKey
,Date
,LOAD_DTS,REC_SRC
Satellite for Sales Attributes:
Table:
Sat_Sales
Attributes:
Link_Sales_HashKey
,Unit_Price
,Quantity
,LOAD_DTS,REC_SRC
Benefits of This Conversion
Scalability: Hubs, links, and satellites can be loaded in parallel, making the model suitable for high-volume data ingestion.
Flexibility: Adding new attributes (e.g., a new product feature) involves creating additional satellites without altering existing structures.
Historical Tracking: Satellites track attribute changes over time, ensuring full traceability and robust analytical capabilities.
Adaptability: The modular structure allows seamless integration of new data sources and relationships.
Note: You may already notice that in the provided dimensional model, there are no dimension-to-dimension links. This is typically the case in a star schema where dimensions are directly related only to the fact table. However, in a snowflake schema, dimension-to-dimension links are more common. In such cases, we could represent dimension-to-dimension relationships in the Data Vault as link tables between their respective hubs so that we could have multiple link tables to define such relationships.
Conclusion
Data Vault 2.0 is an essential methodology for data engineering interviews, particularly due to its popularity in enterprise-level organisations. In this post, we covered;
The foundational concepts of Data Vault 2.0,
Common interview questions,
A detailed case study that demonstrates how to convert a dimensional model into a Data Vault structure.
For those seeking more advanced knowledge about Data Vault 2.0 or aiming to get certified, we suggest exploring this Udemy Course3. However, we believe the concepts and case studies covered in this post are more than sufficient to tackle Data Vault-related questions in data engineering interviews confidently.
If you are interested in refreshing or deepening your knowledge of data modelling, we highly recommend exploring our previous post below. In it, we dive deep into topics such as Normalised Forms, Slowly Changing Dimensions (SCDs), the Kimball and Inmon approaches, and One Big Table by solving interview questions:
There is also a free data vault training by Datavault for interested candidates, you can check the link for dates and times: [link]4
We Value Your Feedback
If you have any feedback, suggestions, or additional topics you’d like us to cover, please share them with us. We’d love to hear from you!
Have you used Data Vault 2.0 before? We’re curious, how did you use it, and what was your experience like? Share your story and tips!
https://pipeline2insights.substack.com/t/interview-preperation
https://www.udemy.com/course/modeling-data-warehouse-with-data-vault-for-beginners/
https://data-vault.com/free-data-vault-training/#applytoday