Created
November 1, 2021 20:54
-
-
Save koaning/5a0f3f27164859c42da5f20148ef3856 to your computer and use it in GitHub Desktop.
A benchmark with Polars.
polars (0.16.16) 4.9s
pandas (1.5.3) 28.2s
pandas (2.0.0rc1) throws an error: <class 'numpy.intc'>
my code for comparison is at https://github.com/wgong/py4kids/blob/master/lesson-14.6-polars/polars-cookbook/cookbook/pandas_vs_polars.py#L532
I would not have expected .transform()
to be slower than .join()
!
But yeah, I'll also poke around some more here, but thanks for the reply!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Ah you're right, thanks! This is a bit more tricky than I thought.
I overlooked that the
pandas.DataFrame.transform
returns a DataFrame with the same dimension as the input DataFrame, so your original code avoids having to do a join, while my first revision above misses that.Here's another revision that uses
join
instead ofassign
. This fixes these issues and is about an order magnitude faster than the original! I added a compare method to make sure the new code and the original code have identical outputs!I'm getting these benchmarks for the full pipeline (set types, sessionize, add features, and remove bots) in Polars and Pandas (after loading both into a df):
So I agree with you that Polars is almost an order of magnitude faster here, but at least it's not two orders! 😄
(Also minor note, not important for the benchmark: the sessionize default thresholds are set differently for Polars vs Pandas.)