Pandas vs. Polars: Benchmarking Dataframe…

Mar 21

Comparing functionalities, performances and usabilities of Pandas and Polars

7 Comments

Certainly telling us Polars believers what we already know, but this was a great breakdown. Pandas has certain functions that sometimes beats it out (date times come to mind), but I honestly can’t imagine using Pandas ever again.

Expand full comment

Emmanuel GUENOU

Sep 12

Hi, thanks for the great work. Is there a plan to compare Polars to PySpark?

Expand full comment

Reply (2)

Pipeline to Insights

Sep 12

Yeah for sure I’m expanding it to duckdb and Pyspark and share the outcome soon. Would you like to see the benchmarking using same size of data (around 1GB).?

Expand full comment

Reply (1)

Emmanuel GUENOU

Sep 16

Yeah, 1GB sounds good as a starting point and will definitely give interesting results. Just a note though: Spark is really optimized for much larger scales (100s of GB to TB). At 1GB, everything still fits comfortably in memory, so frameworks like DuckDB and Polars will likely shine since Spark’s overhead (scheduler, shuffle, serialization) will dominate.

For context, Spark usually targets partition sizes around 128MB, so 1GB ends up being only ~8 partitions – not really stressing the engine. It could be great to also include a larger dataset later (say 50–100GB) to highlight the scaling differences and where Spark starts to show its strengths.

Expand full comment

Pipeline to Insights

Sep 12

Thank you so much for your feedback💐😊

Expand full comment

Jul 22Edited

Great post, Erfan! Just a quick tip, instead of `time.time()` use `time.perf_counter()`. The `time.time()` method tells the current time of your device whereas `time.perf_counter` uses relative time [Ref1]. From [Ref2]:

"It's measured (perf_counter) using a CPU counter and, as specified in the docs, should only be used to measure time intervals"

Ref1: https://blog.dailydoseofds.com/p/dont-use-timetime-to-measure-execution

Ref2: https://stackoverflow.com/questions/66036844/time-time-or-time-perf-counter-which-is-faster

Expand full comment

Reply (1)

Pipeline to Insights

Sep 13

Thank you so much Ro for this great tip

Expand full comment

Pipeline To Insights

Pandas vs. Polars: Benchmarking Dataframe…