Think of SQL optimisation as riding a bike on a long journey. If you start with the wrong gear or an unbalanced load, you’ll waste energy pedaling harder than necessary and slow yourself down. Choosing the right file formats is like selecting a smooth route while understanding your data types ensures you’re in the correct gear to maintain momentum. A well-optimised query, much like a perfectly tuned bike ride, lets you glide efficiently, conserve energy, and reach your destination faster and with less effort. 🚴♂️🛤️
This week, we began exploring DuckDB (for more details about DuckDB1) and its performance under different conditions. Using a mix of real-world datasets and practical examples, we highlighted how columnar storage, pre-sorting data, and optimal data formats like Parquet can significantly enhance query performance. Each example focused on showcasing DuckDB's capabilities in modern analytical workloads.
Day 36: PostgreSQL vs. DuckDB
Goal: Compare PostgreSQL and DuckDB’s performances on a basic aggregation query.
Details:
Day 37: Sorted Data vs. Hash Joins
Goal: Retrieve matching vehicles based on their state and model year.
Details:
Day 38: Optimising Range Queries
Goal: Count the total number of vehicles within a specific electric_range
.
Details:
Day 39: Temporary Tables in DuckDB
Goal: Calculate the average session time for each user from web log data.
Details:
Day 40: Querying CSV vs. Parquet
Goal: Calculate the total retail sales grouped by item type for the year 2020.
Details:
Day 41: Querying JSON vs. Parquet
Goal: Calculate the total retail sales grouped by item type for the year 2020.
Details:
Day 42: Varchar vs. Timestamp
Goal: Calculate the average day value from a date column in a crime dataset.
Details:
Conclusion
In this sixth week of our "100 Days of SQL Optimisation" journey, we focused on optimising performance with DuckDB while leveraging efficient data structures and query techniques:
Columnar Storage for Speed: Compared PostgreSQL with DuckDB and demonstrated how columnar storage accelerates aggregations by keeping relevant data in memory.
Sorted Data for Joins and Range Queries: Pre-sorting data before joins and queries reduced execution time and memory usage significantly.
Temporary Tables: Used temporary tables to cache intermediate results, improving performance for repetitive calculations.
Data Format Impact: Highlighted the efficiency of Parquet over CSV and JSON formats due to its smaller size and faster querying capabilities.
Optimal Data Types: Showed the importance of using TIMESTAMP instead of VARCHAR for date operations to reduce memory usage and processing overhead.
These optimisations showcased how leveraging in-memory execution, efficient storage formats, and optimal data structures can deliver significant performance improvements in modern analytical databases like DuckDB.
If you're interested in learning more practical SQL optimisations, follow along with our 100-day journey. Subscribe to stay updated, each day, you'll gain new insights into making your SQL queries more powerful and efficient!
If you missed the Week 5 recap, you can catch up by checking it out here:
If you’re interested in using DuckDB, the dlt library is a fantastic tool for effortlessly ingesting datasets into DuckDB. Check out our post to learn more about how dlt simplifies data importing from various sources.
https://duckdb.org/why_duckdb.html