Pipeline To Insights

Pipeline To Insights

Share this post

Pipeline To Insights
Pipeline To Insights
11 Storage Formats for Data Engineers

11 Storage Formats for Data Engineers

How to leverage storage formats for efficient and scalable data systems

Erfan Hesami's avatar
Erfan Hesami
Jan 02, 2025
∙ Paid
54

Share this post

Pipeline To Insights
Pipeline To Insights
11 Storage Formats for Data Engineers
6
17
Share

In Data Engineering, the choice of storage format is a foundational decision that influences the efficiency, scalability, and overall cost-effectiveness of a system. Storage formats are the backbone of how data is ingested, processed, stored, and queried. By understanding the strengths and limitations of each format, Data Engineers can design systems that meet current demands and scale seamlessly as data grows in complexity and volume.

In this post, we explore 11 key storage formats, their principles, and practical applications, simplifying to choose the right format.

Pipeline To Insights is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber😊🙏🏻

1. Row-Based Storage

  • Purpose: Optimised for transactional systems where entire rows are frequently accessed.

  • Method: Stores data row by row in sequential order.

  • Benefits:

    • Ideal for OLTP workloads.

    • Simple and intuitive.

  • Limitations:

    • Less efficient for analytical queries.

    • Increased storage size for repeated values.

  • Examples: MySQL, PostgreSQL.

The image is an infographic comparing row-based and column-based databases. On the left, an "Orders Table" is shown with a row-based storage representation below it, where each row contains all data for a single order. On the right, a similar "Orders Table" shows a column-based storage representation, where each column's data is stored together. The table headings include Order ID, User ID, Product ID, Price, and Quantity, with corresponding values underneath. The tables are accompanied by arrows pointing downwards to their respective storage types. The graphic includes branding for "blog.bytebytego.com".
source

2. Columnar Storage

  • Purpose: Designed for analytical workloads involving aggregations and filtering.

  • Method: Stores data column by column, enabling faster reads for specific columns.

  • Benefits:

    • Excellent for OLAP systems.

    • Highly compressible and query-efficient.

  • Limitations:

    • Slower for insert/update operations.

  • Examples: Apache Parquet, ORC, Amazon Redshift.

The image is an infographic comparing row-based and column-based databases. On the left, an "Orders Table" is shown with a row-based storage representation below it, where each row contains all data for a single order. On the right, a similar "Orders Table" shows a column-based storage representation, where each column's data is stored together. The table headings include Order ID, User ID, Product ID, Price, and Quantity, with corresponding values underneath. The tables are accompanied by arrows pointing downwards to their respective storage types. The graphic includes branding for "blog.bytebytego.com".
source

3. Key-Value Storage

  • Purpose: Stores data as key-value pairs, optimised for quick lookups and scalable architectures.

  • Method: Associates unique keys with corresponding values.

  • Benefits:

    • Simple and highly scalable.

    • Efficient for caching and session management.

  • Limitations:

    • Not suited for complex queries.

  • Examples: Redis, DynamoDB.

NoSQL Database Types: Understanding the Differences
source

4. Document-Oriented Storage

  • Purpose: Stores semi-structured data in a flexible, schema-less manner.

  • Method: Data is saved as documents, often in JSON or BSON formats.

  • Benefits:

    • Supports nested data and dynamic schemas.

    • Easy to scale horizontally.

  • Limitations:

    • Limited support for complex relationships.

  • Examples: MongoDB, CouchDB.

Why Relational Databases are not the Cure-All. Strength and Weaknesses.

5. Graph Storage

  • Purpose: Designed for relationship-driven data and graph-based queries.

  • Method: Stores data as nodes, edges, and properties to represent relationships.

  • Benefits:

    • Optimised for connected data (e.g., social networks).

    • Intuitive querying of relationships.

  • Limitations:

    • Can be complex to model and scale.

  • Examples: Neo4j, Amazon Neptune.

Source: neo4js
source

Pipeline To Insights is a reader-supported publication. To receive new posts and support our work, consider becoming a free or paid subscriber😊🙏🏻


6. Time-Series Storage

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Erfan Hesami
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share