Week 21/34: Open Table Formats for Data Engineering Interviews
What are Open Table Formats and their importance for data engineers
In the fast-paced world of data engineering, managing large datasets efficiently is essential. For years, companies have turned to Hadoop, data warehouses, and data lakes to process and store their data. But each of these technologies had its own set of problems. Open Table Formats have emerged as a modern solution, bridging the gap by adding database-like features to data lakes. They enhance data reliability, performance, and governance, offering a unified and scalable approach to managing big data.
In this post, we’ll cover:
What are Open Table Formats, and why did they emerge?
What problems does the Open Table Format solve?
Three Popular Open Table Formats
What is the Unified table format?
Key interview questions
Before diving into this post, we recommend reading "Data Serialisation: Choosing the Best Format for Performance and Efficiency" to explore various open file formats and their impact on performance and efficiency.
If you want to see what's covered in our interview series, please take a look at the Data Engineering Preparation Guide series1 for all the posts.
What are Open Table Formats, and why did they emerge?
The journey to Open Table Formats started with tools like Hadoop for processing large datasets. While revolutionary at the time, Hadoop was complex to manage, expensive to maintain, and slow to evolve with changing business needs.
Next came data warehouses, designed specifically for structured data analytics. They worked well for well-defined data schemas but had significant limitations:
High costs at scale.
Difficulty in changing column names or data types.
Challenges in handling unstructured data.
Vendor lock-in concerns.
The next evolution was the Data Lake. A data lake allows organisations to store raw data in cost-effective storage solutions like Amazon S3, Azure Blob Storage, or Google Cloud Storage. This approach was flexible, scalable, and economical, but it introduced new challenges:
No ACID transactions: Changes weren't guaranteed to be reliable or consistent.
Poor schema evolution: Changing the data structure over time was difficult.
Slow performance: Querying raw files (like Parquet or ORC) could be inefficient.
No time travel/version control: Rolling back changes or accessing previous versions wasn't possible.
Imagine this scenario: Our team updates customer records in our data lake. Halfway through the process, the job fails. Now, some records are updated while others remain in their previous state. Our dashboards show inconsistent data, and analysts can't trust the results. This is the reality that many data teams faced with traditional data lakes.
While data lakes offered flexibility and cost advantages, they lacked the reliability and performance of traditional databases. This gap created the perfect opportunity for Open Table Formats to emerge.
An open table format provides a layer of abstraction on top of a data lake. This allows data to be managed and optimised more efficiently. At the same time, the increased structure allows for additional features. Apache Iceberg2, Delta Lake3, and Apache Hudi4 are some examples, which we will look at in more detail later.