Week 17/31: Data Pipelines and Workflow Orchestration for Data Engineering Interviews (Part #2)
Understanding Apache Airflow and Dagster and their roles in Data Engineering with real interview questions
In the first post, we discussed the importance and concepts of data pipeline orchestration. Specifically, we covered:
Fundamentals of data pipeline orchestration.
Key terms and concepts.
Roles involved in orchestration.
Core problems are solved by orchestration.
A real-world example.
If you missed the first post, you can check it out here:
Data orchestration tools help us automate and manage our data pipelines with scheduling, dependency management, and monitoring. We can run tasks at fixed times or can trigger them based on events. Tools like Airflow and Dagster offer different ways to design these workflows and if you know Python, you can easily learn them.
In this post, we will explore:
What is Apache Airflow and how does it work
What Dagster is and how it works
The differences and trade-offs between tools
Orchestration tools-related interview questions
For the previous posts of this interview series, check here: [Data Engineering Interview Preparation Series]1
Apache Airflow
Apache Airflow, developed originally at Airbnb and now an Apache Software Foundation project, is built on a task-centric paradigm. It models workflows as Directed Acyclic Graphs (DAGs) where each node represents a task, and edges represent dependencies between tasks.
Apache Airflow is available in two main forms: an open-source version and a managed service version. The open source version is the original, community-driven platform that users install and maintain on their own infrastructure. This requires handling installation, configuration, scaling, and ongoing maintenance. In contrast, managed versions are provided by cloud providers (like Google Cloud Composer, Amazon MWAA, or Astronomer) that handle the infrastructure, updates, and maintenance aspects for you, allowing teams to focus more on workflow development rather than operational concerns.
Companies of all sizes use Apache Airflow to manage their data workflows (e.g., Airbnb, Netflix, Spotify, and Walmart).
Key Concepts in Airflow
DAGs (Directed Acyclic Graphs): A DAG is like a to-do list with a specific order. It shows which tasks need to run and in what order.
Example: First, load data, then transform it, then save it.Tasks: A task is an actual job created using an operator.
Example: APythonOperator
that runs a script to clean your data.Operators: These are task templates that tell Airflow what to do.
Example:PythonOperator
runs Python code,BashOperator
runs a command in the terminal.XComs (Cross-Communication): A way for tasks to share small bits of data.
Example: One task finishes and sends the output to the next task using XCom.Sensors: These are special tasks that wait for something to happen before moving on.
Example: A sensor waits until a file appears in a folder before starting the next step.
Airflow's philosophy centers on control flow rather than data flow. The DAG defines how tasks relate to each other in terms of execution order, but the actual data passing between tasks is secondary. Airflow was designed with the assumption that each task might interact with different external systems, and the pipeline's main purpose is to coordinate these interactions.
Example:
This Airflow DAG downloads sales data, processes it to calculate product totals, and sends a notification when complete, demonstrating a basic ETL pipeline with task dependencies.
If you want to learn more about Airflow, check out the official tutorial here:
Dagster
Dagster takes an asset-centric approach to data orchestration. Instead of focusing primarily on tasks and their execution order, Dagster centers on data assets, the inputs and outputs of the computation, and their relationships.
Dagster is a modern data orchestration platform available in both open source and managed versions (Dagster+). The open-source Dagster is the core platform that users can deploy and manage themselves, handling installation, configuration, maintenance, and scaling on their own infrastructure.
Dagster+, the managed version, is provided by Elementl (the company behind Dagster) and handles the infrastructure, upgrades operational aspects, and allows teams to concentrate on building data pipelines rather than managing the underlying orchestration system.
Key Concepts
Assets: These are data objects that our system uses.
Example: A customer database, a trained machine learning model, or a visualisation dashboard.
Asset Definitions: These are recipes that tell Dagster how to create or update assets.
Example: A Python function that transforms raw customer data into clean, analysed data.
Asset Graph: This shows how assets depend on each other, creating an automatic flow.
Example: Raw data → Cleaned data → Analytics → Dashboard, where each step requires the previous.
Software-Defined Assets (SDAs): The code-based way to describe your assets and their relationships.
Example: Python decorators like
@asset
that turn ordinary functions into asset-creating recipes.Materialisations: The actual process of creating or updating an asset. Example: Running the code that transforms your raw data into the cleaned dataset and storing the results.
Partitions: Ways to divide assets into manageable chunks.
Example: Splitting customer data by month, so you can process January, February, etc., separately.
Schedules: Instructions for when to update assets automatically.
Example: "Refresh the sales dashboard every morning at 6 AM."
Resources: External systems or tools your assets need.
Example: Database connections, API clients, or cloud storage access.
I/O Managers: Components that handle how assets are stored and loaded. Example: A manager that saves processed data to S3 and knows how to load it back.
Ops: Reusable building blocks for asset definitions.
Example: A data validation function that can be used in multiple asset creation processes.
Dagster emphasises data flow over control flow. Its design philosophy centers on these principles:
Data-Aware Orchestration: Workflows are designed around the data assets they produce, not just the tasks they execute.
Data Quality: Built-in mechanisms for testing and monitoring data quality.
Observability: Extensive logging and a rich UI for tracking lineage and status of data assets.
Testability: First-class support for unit testing pipeline components.
Hands-on Example: Creating Dagster Assets
This Dagster pipeline generates sample sales data and transforms it into both product-based and date-based summaries, showcasing how data assets flow through a data pipeline with automatic dependency tracking.
If you want to learn more about Dagster, check out the official tutorial here.
Conceptual Differences Between Airflow and Dagster
Now that we've seen both tools in action, let's explore their key philosophical and conceptual differences:
1. Core Abstraction
Airflow: Task-centric with DAGs representing execution flow between tasks.
Dagster: Asset-centric with graphs representing data dependencies between assets.
2. Data Flow vs Control Flow
Airflow: Emphasises control flow (when and in what order tasks are executed).
Dagster: Emphasises data flow (how data moves through the system and transforms).
3. Data Passing
Airflow: Uses XComs for passing small pieces of data between tasks, but this is not the primary focus
Dagster: First-class support for passing data between operations with explicit typing.
4. Testing Approach
Airflow: Testing typically involves running the full DAG or mocking components
Dagster: Built-in unit testing capabilities for individual asset definitions
5. Monitoring and Observability
Airflow: Focused on task execution status and logs
Dagster: Provides asset-level observability, including data lineage and materialisation history
6. Configuration Management
Airflow: Configuration often spread across DAG definition and separate config files
Dagster: Structured configuration with built-in validation
7. Errors and Recovery
Airflow: Retry logic at the task level
Dagster: Structured error handling with typed exceptions and retries
8. UI Focus
Airflow: UI centered around DAGs, task instances, and execution history
Dagster: UI displays asset graph, lineage, and materialisation history
When to Choose Each Tool
Choose Airflow When:
You need a battle-tested orchestration tool with a large community
Your workflows primarily coordinate actions across various systems
You have complex scheduling requirements
You need a wide range of pre-built integrations
Choose Dagster When:
Your workflows are primarily data pipelines with clear data dependencies
Data quality and testing are high priorities
You want greater visibility into data lineage and asset history
You prefer a more modern Python development experience with type hints
Interview Questions
When it comes to data orchestration questions, you're often asked about your overall experience with orchestration tools. Based on your response and the company's interview style, they may dive deeper to assess whether you have hands-on experience with these tools or just have surface-level exposure.
Therefore, the initial questions we’d expect from the interviewer in this topic is usually:
Have you used an orchestration tool before in your previous work or projects?
How have you used it? (If the previous answer is yes!)
Then, the below follow-up questions are relatively common.
Q1: How do you typically structure your workflows or DAGs (Directed Acyclic Graphs) within an orchestration tool?
Answer:
I usually start by defining the logical sequence of tasks that need to happen from the moment data is ingested until it’s loaded or processed. I break down large processes into modular tasks that each perform a single, well-defined function.
In Airflow, for instance, I’d define each task as an operator and link them in a DAG to reflect dependencies. This modularity makes it easier to maintain, troubleshoot, and scale individual parts of the pipeline without affecting the whole.
Q2: How have you approached error handling, retries, and alerting within your orchestrated pipelines?
Answer:
I typically define a clear error-handling strategy for each task. For instance, if a task fails due to a temporary network outage, I set a limited number of retries with an exponential backoff period to give the system time to recover. If the error persists after the retries, the system marks the job as failed.
Also, I prefer to configure a chat tool notification (such as Slack) that triggers on failures for alerting so that the team can explore the issue.
Q3: Can you share how you monitor the performance and health of your orchestrated pipelines?
Answer:
I monitored the orchestrated pipelines by simply looking at task execution times, resource usage, and overall success/failure rates. For example, in Airflow or Dagster, I’d like to use the tools’ web UI to view DAG runs, track each task’s runtime, and spot bottlenecks.
Preferably, I’d integrate logs and metrics into a monitoring platform (such as Grafana) so I can set up custom dashboards and alerts. However, I did not have a chance to do such integration in my previous work (or projects).
Note: Do not forget, it's always better to be honest in interviews than to exaggerate.
Q4: Have you had to handle backfills or reprocessing of historical data? If so, how did you manage that in your orchestration tool?
Answer:
Yes, I had to handle backfills occasionally when there’s a change in the transformation logic or when historical data needed to be updated.
In Airflow, I used the DAG’s catch-up feature, setting a start_date
that goes back to the point in time where data needs reprocessing. I ensure that tasks can handle re-running safely by building idempotent tasks or by writing data in a way that won’t create duplicates.
Additionally, I prefer to schedule backfills during off-peak hours to avoid overwhelming production resources and do all necessary tests before deploying.
Q5: How do you ensure that an Airflow workflow is idempotent?
Answer:
Idempotency in Airflow workflows is achieved by ensuring that tasks can be run multiple times without causing unintended side effects. This can be done by:
Using unique identifiers for task outputs.
Implementing checks to skip tasks if they have already been executed.
Storing intermediate results in a manner that can be checked and reused if needed.
Q6: Imagine you have a data pipeline that processes customer transaction data hourly. Sometimes, upstream systems delay data delivery by up to 30 minutes. How would you design this pipeline in Airflow/Dagster to handle these delays without creating duplicate processing?
Answer:
I would implement a sensor operator that checks for the availability of data before proceeding:
A sensor is a special type of operator in Airflow/Dagster that continuously checks for a specific condition to be met before allowing workflow execution to proceed.
For handling delayed data, a sensor would:
Poll at regular intervals to check if transaction data has arrived.
Wait patiently during the 30-minute potential delay window.
Trigger downstream processing tasks only when data is detected.
Timeout after a configured maximum wait period.
This approach prevents duplicate processing because the pipeline only executes once per data batch, regardless of when it arrives within the acceptable window.
Q7: You've noticed that a particular task in your pipeline occasionally fails due to memory issues during peak hours. How would you modify your orchestration to handle this scenario?
Answer:
Assuming we are using Airflow, I’d follow the below steps;
Resource Allocation:
Configure the task with a higher memory executor (if using Kubernetes executor or Celery)
Set
execution_timeout
to prevent runaway processes
Schedule Adjustment:
Shift the schedule of this DAG to run during off-peak hours
Implement a queuing system to limit concurrent executions:
3. Task Optimisation:
Modify the task to process data in chunks rather than all at once:
Therefore, I’d ensure the task runs during off-peak hours, uses sufficient memory resources, and is optimised to process data in smaller chunks to avoid memory-related failures.
Conclusion
In this post, we explored two leading data orchestration tools—Apache Airflow and Dagster—and common interview questions about them. We covered the fundamentals of Apache Airflow and its operational mechanics, Dagster's architecture and workflow approach, key differences and trade-offs between these orchestration platforms, and frequently asked interview questions with detailed answers.
Coming up next in our series, we'll dive into hands-on practice. We'll share a complete technical assessment with working orchestration examples in a GitHub repository, plus clear explanations of the code and design choices.
If you are interested in orchestration in practice, you can check out the below post where we use Dagster to schedule a data ingestion task.
We Value Your Feedback
If you have any feedback, suggestions, or additional topics you’d like us to cover, please share them with us. We’d love to hear from you!
https://pipeline2insights.substack.com/t/interview-preperation