Week #6: 100 Days of SQL Optimisation

Exploring DuckDB and Its Capabilities

Dec 17, 2024

Think of SQL optimisation as riding a bike on a long journey. If you start with the wrong gear or an unbalanced load, you’ll waste energy pedaling harder than necessary and slow yourself down. Choosing the right file formats is like selecting a smooth route while understanding your data types ensures you’re in the correct gear to maintain momentum. A well-optimised query, much like a perfectly tuned bike ride, lets you glide efficiently, conserve energy, and reach your destination faster and with less effort. 🚴‍♂️🛤️

This week, we began exploring DuckDB (for more details about DuckDB1) and its performance under different conditions. Using a mix of real-world datasets and practical examples, we highlighted how columnar storage, pre-sorting data, and optimal data formats like Parquet can significantly enhance query performance. Each example focused on showcasing DuckDB's capabilities in modern analytical workloads.

Day 36: PostgreSQL vs. DuckDB

Goal: Compare PostgreSQL and DuckDB’s performances on a basic aggregation query.

Details:

Day 37: Sorted Data vs. Hash Joins

Goal: Retrieve matching vehicles based on their state and model year.

Details:

Day 38: Optimising Range Queries

Goal: Count the total number of vehicles within a specific electric_range.

Details:

Day 39: Temporary Tables in DuckDB

Goal: Calculate the average session time for each user from web log data.

Details:

Day 40: Querying CSV vs. Parquet

Goal: Calculate the total retail sales grouped by item type for the year 2020.

Details:

Day 41: Querying JSON vs. Parquet

Goal: Calculate the total retail sales grouped by item type for the year 2020.

Details:

Day 42: Varchar vs. Timestamp

Goal: Calculate the average day value from a date column in a crime dataset.

Details:

Conclusion

In this sixth week of our "100 Days of SQL Optimisation" journey, we focused on optimising performance with DuckDB while leveraging efficient data structures and query techniques:

Columnar Storage for Speed: Compared PostgreSQL with DuckDB and demonstrated how columnar storage accelerates aggregations by keeping relevant data in memory.
Sorted Data for Joins and Range Queries: Pre-sorting data before joins and queries reduced execution time and memory usage significantly.
Temporary Tables: Used temporary tables to cache intermediate results, improving performance for repetitive calculations.
Data Format Impact: Highlighted the efficiency of Parquet over CSV and JSON formats due to its smaller size and faster querying capabilities.
Optimal Data Types: Showed the importance of using TIMESTAMP instead of VARCHAR for date operations to reduce memory usage and processing overhead.

These optimisations showcased how leveraging in-memory execution, efficient storage formats, and optimal data structures can deliver significant performance improvements in modern analytical databases like DuckDB.

If you're interested in learning more practical SQL optimisations, follow along with our 100-day journey. Subscribe to stay updated, each day, you'll gain new insights into making your SQL queries more powerful and efficient!

If you missed the Week 5 recap, you can catch up by checking it out here:

Week #5: 100 Days of SQL Optimisation

Pipeline to Insights

December 5, 2024

Think of SQL optimisation as planning a holiday. A last-minute booking might secure a flight, but it could leave you with inconvenient layovers and missed experiences. By carefully organising your trip, prioritising key destinations, and booking accommodations wisely, you simplify your journey, predicting potential delays and maximising enjoyment. A wel…

Read full story

If you’re interested in using DuckDB, the dlt library is a fantastic tool for effortlessly ingesting datasets into DuckDB. Check out our post to learn more about how dlt simplifies data importing from various sources.

Introduction to data load tool (dlt): A Python Library for Simple Data Ingestion

Pipeline to Insights

December 1, 2024

Introduction to data load tool (dlt): A Python Library for Simple Data Ingestion

Are you struggling to handle the complexity of ingesting data from multiple sources into various destinations? Or are you tired of the non-user-friendly import data features of different systems?

Read full story

https://duckdb.org/why_duckdb.html

Pipeline To Insights

Week #5: 100 Days of SQL Optimisation

Introduction to data load tool (dlt): A Python Library for Simple Data Ingestion

Discussion about this post