Week 10/33: ETL and ELT Processes for Data Engineering Interviews #1
Understanding ETL, ELT, their use cases and differences
Data Engineering interviews often test your understanding of how data moves through modern systems. Two key processes that come up in these interviews are ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform). These processes form the core of today's data systems and are common topics in technical tests and case studies.
In recent years, many companies have moved to cloud platforms like AWS, Azure, and Google Cloud. This shift has changed how data pipelines are built and managed. As a result, companies expect you to know both the technical details of ETL and ELT and how they work in real cloud-based environments.
In this post, we'll cover:
ETL vs. ELT: Understanding the Definition and Key Characteristics.
Comparing ETL and ELT, considering their advantages and disadvantages.
ETL and ELT scenario-based interview questions.
A real case study with step-by-step implementation.
In the next post, we will dive deeper into ETL and ELT best practices, explore tool choices, and walk through building an end-to-end ELT pipeline using dlt and dbt.
For the previous posts of this series, check here: [Data Engineering Interview Preparation Series]1
What is ETL?
ETL stands for Extract, Transform, Load. It is a process used to move data from one or more sources into a centralised system, such as a data warehouse. This process makes it easier for organisations to analyse and use their data.
The usual way ETL pipelines work is as follows.
Extract: Data is gathered from various sources like databases, flat files, APIs, etc. This step pulls the raw data needed for further processing.
Transform: Once extracted, the data is cleaned and converted into a useful format. This might include removing errors, filtering out unnecessary information, or converting data types to ensure consistency.
Load: The transformed data is then loaded into a target system, such as a data warehouse or a data lake, where it can be used for analysis and reporting.
Key Characteristics of ETL
Batch Processing: ETL often works in batches, processing large volumes of data at scheduled intervals.
Data Quality: The transformation step allows for cleaning and validating data, ensuring it meets business standards.
Integration: ETL brings together data from multiple sources into one unified format, making it easier to analyse.
Note: While understanding ETL fundamentals is important, many interviewers prefer to see how you apply these concepts in real-world scenarios. ETL-related interview questions are often scenario-based or involve hands-on assessments that test your practical skills.
Scenario-based ETL Questions
Question 1: Retail Company Sales Data Integration
Imagine you work for a retail company where each store collects sales data using different systems and formats. What would be the most suitable approach to consolidate this data into a central data warehouse, and why?
Answer: Choosing the right data integration method depends on stakeholder requirements and access patterns. However, assuming that the primary need is to ensure data quality and comprehensive reporting, an ETL approach is highly suitable. The following steps can address this scenario;
Extraction: Data is collected from various store systems regardless of their differing formats.
Transformation: This step cleans and standardises the data, converting different date formats, currencies, and units into a unified structure.
Loading: The transformed data is then loaded into a central data warehouse where stakeholders can easily access it for analytics and decision-making.
This method ensures that the final dataset is reliable and ready for use, making it a good fit for a retail environment focused on periodic analysis.
Question 2: Healthcare Provider Data Consolidation
Consider a healthcare provider that collects patient data from several clinics. What is the most suitable approach for integrating this data into a unified system, and what are the reasons behind your choice?
Answer: Assuming that the key priority is to ensure high data quality and consistency for regulatory compliance and clinical decision-making, an ETL approach is highly suitable.
Extraction: Patient data is gathered from each clinic's system.
Transformation: This stage is essential for cleaning and standardising the data.
Loading: After transformation, the clean data is loaded into a centralised system or data warehouse.
In this scenario, ETL ensures that data from diverse sources is transformed into a format that meets the strict quality and consistency standards required in healthcare.
What is ELT?
ELT stands for Extract, Load, Transform. It is a process used to move data from one or more sources into a centralised system, typically a cloud data warehouse or a data lake, where the transformation happens. This approach leverages the power of modern data platforms to process data after it has been loaded.
The usual way ELT pipelines work is as follows:
Extract: Data is gathered from various sources like databases, flat files, APIs, etc. This step pulls the raw data needed for further processing.
Load: The raw data is immediately loaded into the target system without any prior transformation. This ensures that the data is stored in its complete form, preserving every detail.
Transform: Once the data is in the target system, it is transformed as needed. This can include cleaning the data, converting data types, and applying business logic, all performed using the processing power of the cloud or the data warehouse.
Key Characteristics of ELT
Real-Time or Near Real-Time Processing: ELT is well-suited for scenarios where data is needed quickly. By loading raw data immediately, organisations can transform it on-demand to suit various analytical needs.
Scalability: Modern cloud-based systems provide the flexibility and power to handle vast amounts of data efficiently, making ELT a robust choice for large-scale data operations.
Flexibility: Since raw data is stored in its entirety, you have the option to re-transform or analyse it differently as business requirements evolve.
Integration: ELT centralises data first, which makes it easier to explore and apply multiple transformation models later, depending on the analytical or reporting needs.
Note: Similar to ETL, many interviewer prefer to see how you apply ELT concepts in real-world scenarios.
Scenario-based ELT Questions
Question 1: Social Media Data Analysis
You work for a company that collects large volumes of raw social media data from various sources. What would be the most suitable approach to integrate this data into a cloud-based platform for real-time analytics, and why?
Answer: Since the primary usage of such data is real-time analytics, an ELT approach is suitable. Following the below steps, near real-time analytics can be achieved;
Extraction: Data is collected from different social media APIs and sources.
Loading: The raw data is immediately loaded into a cloud-based data warehouse, ensuring that no detail is lost.
Transformation: Transformations are applied later based on specific analytical requirements, allowing for on-demand processing using the platform’s resources.
This approach offers flexibility and scalability, enabling real-time insights while preserving the raw data for future analysis.
Question 2: IoT Data Integration
Consider a manufacturing company that continuously collects sensor data from its machinery. What is the most suitable data integration approach to manage this high-volume, high-velocity data stream, and why?
Answer: Such high-volume data may introduce challenges when attempted to be processed on-the-fly. Therefore, I’d prefer ELT-like approaches to address this data integration while ensuring the raw data is loaded immediately with possible future processing capabilities.
This method ensures that the sensor data is available promptly for both immediate monitoring and in-depth historical analysis, taking full advantage of the scalable nature of cloud platforms.
ETL vs. ELT
When deciding between ETL and ELT, it's important to understand their key differences, advantages, disadvantages, and the use cases that make each approach suitable. The choice often depends on use cases, stakeholder requirements, and the technical environment.
Therefore, each of the approaches has distinct use cases due to their properties.
ETL is ideal when your business requirements demand that data is cleansed and standardised before it enters your analytical systems. This is especially important when:
Data Quality and Consistency Are Paramount: Industries like finance, healthcare, and retail often have strict regulatory and reporting requirements. In these environments, stakeholders expect high-quality, uniform data that can be trusted for decision-making.
Batch Processing: ETL is a great fit for batch processing scenarios where data is updated at regular intervals (e.g., nightly or weekly).
ELT is ideal for environments that require rapid data ingestion, flexible processing, and the ability to handle massive volumes of raw data. This approach is particularly advantageous when:
Immediate Data Availability Is Critical: In scenarios where data must be ingested quickly, such as real-time analytics or monitoring, ELT allows you to load data in its raw form first.
Dynamic and Evolving Transformation Requirements: When business needs are likely to change or when you need to support ad-hoc queries, having raw data stored allows you to experiment with different transformation logic without re-extracting data.
Handling Unstructured or Semi-Structured Data: Modern data lakes and warehouses can efficiently store and process unstructured data. ELT takes advantage of these capabilities, enabling organisations to ingest large volumes of diverse data (e.g., JSON logs, clickstreams, IoT sensor readings) and later transform it as required for different analytical purposes.
For more details about pipeline design patterns, check out the below post where we explore a variety of patterns in detail.
ETL Case Study
The below case study may seem very simple but this is inspired by an ETL part of a case study given in the technical assessment step of a Mid-Level Data Engineering position in Australia.
Objective
You are a Data Engineer at P2I Analytics
that tracks customer purchases. The company wants to build an ETL pipeline that consolidates data from multiple sources to generate a daily product sale report and daily customer purchase report.
Your task is to extract, transform, and load (ETL) data from different sources, perform necessary cleaning and aggregations, and load the final dataset into a database for reporting.
Scenario
The company receives data from multiple sources:
Orders Data (API) - JSON file simulating an API response.
Customer Data (CSV) - A file extracted from the CRM system.
Product Data (PostgreSQL) - Stored in a relational database.
You need to integrate this data and generate a daily revenue summary.
Tasks
Explain your plan.
Implement the ETL pipeline.
Consider Edge Cases and Potential Optimisations (Do not implement).
Output Format
README.md file explaining the plan, instructions to run the ETL scripts, and edge cases
Python Scripts or SQL Queries implementing the ETL pipeline
Note: They generally specify their output formats and tools, the above is given to exemplify that. We will share our output in the post.
Solution
Step 1: Plan
Before diving into implementation, it's essential to understand the data and its relationships. We have three data sources:
Orders Data: Contains fields such as
order_id
,customer_id
,product_id
,amount
, andorder_date
.Customer Data: Includes
customer_id
,name
,email
, andsignup_date
.Product Data: Comprises
product_id
,name
,price
, andcategory
.
The primary keys are order_id
for orders, customer_id
for customers, and product_id
for products. The relationships are established through foreign keys: customer_id
in Orders links to customer_id
in Customers, and product_id
in Orders links to product_id
in Products.
Understanding these connections is crucial for determining the necessary joins and transformations to consolidate the data effectively.
Then, to approach this ETL task, we will first focus on extracting data from multiple sources. This includes;
reading orders from a JSON file,
customer data from a CSV file,
and product details from a PostgreSQL database.
Once the data is extracted, I will move on to the transformation phase.
First, we will merge orders with customer data using
customer_id
.Then merge the resulting dataset with product data using
product_id
.
Since both customer and product tables contain a name
column, we will rename them accordingly, customer_name
for customer data and product_name
for product data to prevent conflicts. Additionally, we will handle missing values, particularly customer emails, by assigning a default value. We will also ensure there are no duplicate orders by applying deduplication before aggregation.
Following this, we will compute two separate aggregated summaries:
Product-Level Summary: This will include total orders and total revenue per
order_date
andproduct_name
.Customer-Level Summary: This will include total orders and total spending per
order_date
andcustomer_name
.
Finally, we will proceed with the loading phase by storing both daily_product_sale_summary
and daily_customer_purchase_summary
into a PostgreSQL database. This will allow efficient retrieval and reporting of customer purchases and product sales.
Step 2: Implementation
Import dependencies:
Extract orders data (JSON data used to simulate API data):
Extract customer data (CSV)
Extract product data (PostgreSQL)
Now, we join the datasets and perform data cleaning.
Merge orders with customers
Merge orders_customers with products
Since both customer and product have name fields and by default, Pandas name them name_x and name_y, it is better to rename them to keeping the data clean.
Rename name fields
Alright, now our dataframe looks like this:
There seems to be a ‘NaN’ value in the email column that we may address. Since it is not specified in the task itself about how to address nulls we can assume a value to replace them.
Note: Since the email field is not a required field in the given task, we can totally leave it as it is. However, it may be advantageous in real interviews to show that we are familiar with null handling and checking features of data in detail.
We can approach this based on specific assumptions: For example, in this scenario, we assume that emails should always be present and should never be missing under any circumstances. Based on this assumption, we fill in null values with a default email.
While this might not always be the best practice, for the sake of this case study, we are making this assumption to proceed with the transformation.
Note: Any assumptions you make while solving this case study should be clearly documented in your solution. Be prepared to justify them if asked during follow-up questions in the next round.
Our data doesn’t have any duplicates but we can still include a deduplication step assuming in case of duplicate order_id’s, we will keep only the first occurrence.
Now that we merged and cleaned our data, let’s perform the last step of transformation which is aggregating for analysis purposes.
Daily product sale summary
Daily customer purchase summary
Lastly, we will load our data into PostgreSQL for our stakeholders to use.
Now we have our summary tables available in the PostgreSQL database.
select * from daily_product_sale_summary;
Expected Output
order_date product_name total_orders total_revenue
2024-02-01 Keyboard 1 49.99
2024-02-01 Laptop 1 99.99
select * from daily_customer_purchase_summary;
Expected Output
order_date customer_name total_orders total_spent
2024-02-01 Alice 1 49.99
2024-02-01 Bob 1 99.99
Step 3: Edge Cases and Potential Optimisations
While the ETL pipeline is designed for smooth operations, it is essential to anticipate potential issues and future optimisations:
API Downtime: If the orders API fails, we need a retry mechanism or fallback logic with a logging implementation to keep track of API calls.
Schema Evolution: If columns are added or removed in future versions of customer or product datasets, the ETL may be improved to handle schema changes.
Incremental Loading: Instead of reloading all records daily, implement logic to only process new or updated records if the data volume is relatively high.
Data Validation Framework: Implementing a tool like Great Expectations to enforce data quality checks before loading into the database.
Automation with Airflow: Scheduling the ETL pipeline using Apache Airflow to handle dependencies and retries automatically.
Conclusion
Understanding ETL and ELT is essential for modern data engineers, as these processes form the backbone of data workflows. In this post, we explored the fundamentals of ETL and ELT, their distinct advantages, and when to use each approach. Through scenario-based questions, we demonstrated how these methodologies apply to real-world challenges across industries like retail, healthcare, social media, and IoT.
We also implemented a hands-on ETL case study, guiding you through the extraction, transformation, and loading of data from multiple sources.
In the next post, we will be diving into tools and more advanced practices for ETL and ELT processes and we will also have another case study where we implement an ELT pipeline using dlt and dbt.
If you want to explore more about dlt and dbt before the next post, you can check out our previous posts about them.
We Value Your Feedback
If you have any feedback, suggestions, or additional topics you’d like us to cover, please share them with us. We’d love to hear from you!
https://pipeline2insights.substack.com/t/interview-preperation