Certainly telling us Polars believers what we already know, but this was a great breakdown. Pandas has certain functions that sometimes beats it out (date times come to mind), but I honestly can’t imagine using Pandas ever again.
Yeah for sure I’m expanding it to duckdb and Pyspark and share the outcome soon. Would you like to see the benchmarking using same size of data (around 1GB).?
Yeah, 1GB sounds good as a starting point and will definitely give interesting results. Just a note though: Spark is really optimized for much larger scales (100s of GB to TB). At 1GB, everything still fits comfortably in memory, so frameworks like DuckDB and Polars will likely shine since Spark’s overhead (scheduler, shuffle, serialization) will dominate.
For context, Spark usually targets partition sizes around 128MB, so 1GB ends up being only ~8 partitions – not really stressing the engine. It could be great to also include a larger dataset later (say 50–100GB) to highlight the scaling differences and where Spark starts to show its strengths.
Great post, Erfan! Just a quick tip, instead of `time.time()` use `time.perf_counter()`. The `time.time()` method tells the current time of your device whereas `time.perf_counter` uses relative time [Ref1]. From [Ref2]:
"It's measured (perf_counter) using a CPU counter and, as specified in the docs, should only be used to measure time intervals"
Certainly telling us Polars believers what we already know, but this was a great breakdown. Pandas has certain functions that sometimes beats it out (date times come to mind), but I honestly can’t imagine using Pandas ever again.
Hi, thanks for the great work. Is there a plan to compare Polars to PySpark?
Yeah for sure I’m expanding it to duckdb and Pyspark and share the outcome soon. Would you like to see the benchmarking using same size of data (around 1GB).?
Yeah, 1GB sounds good as a starting point and will definitely give interesting results. Just a note though: Spark is really optimized for much larger scales (100s of GB to TB). At 1GB, everything still fits comfortably in memory, so frameworks like DuckDB and Polars will likely shine since Spark’s overhead (scheduler, shuffle, serialization) will dominate.
For context, Spark usually targets partition sizes around 128MB, so 1GB ends up being only ~8 partitions – not really stressing the engine. It could be great to also include a larger dataset later (say 50–100GB) to highlight the scaling differences and where Spark starts to show its strengths.
Thank you so much for your feedback💐😊
Great post, Erfan! Just a quick tip, instead of `time.time()` use `time.perf_counter()`. The `time.time()` method tells the current time of your device whereas `time.perf_counter` uses relative time [Ref1]. From [Ref2]:
"It's measured (perf_counter) using a CPU counter and, as specified in the docs, should only be used to measure time intervals"
Ref1: https://blog.dailydoseofds.com/p/dont-use-timetime-to-measure-execution
Ref2: https://stackoverflow.com/questions/66036844/time-time-or-time-perf-counter-which-is-faster
Thank you so much Ro for this great tip