Skip to content

Instantly share code, notes, and snippets.

@hannes
Created March 4, 2022 10:30
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hannes/e2599ae338d275c241c567934a13d422 to your computer and use it in GitHub Desktop.
Save hannes/e2599ae338d275c241c567934a13d422 to your computer and use it in GitHub Desktop.
import time
import statistics
import tempfile
import os
import pandas as pd
import numpy as np
test_polars = True
test_duckdb = True
test_arrow = True
test_pandas = True
def timeit(f):
times = []
for n in range(1,3):
start = time.time()
f()
times.append(time.time() - start)
return(statistics.median(times))
def experiment(n_rows, n_groups):
g1 = np.repeat(np.arange(0, int(n_groups/2)), int(n_rows/(n_groups/2)))
g2 = np.tile(np.arange(0, 2), int(n_rows/2))
d = np.random.randint(0, n_rows, n_rows)
df = pd.DataFrame.from_dict({'g1' : g1, 'g2' : g2, 'd' : d})
df = df.reindex(np.random.permutation(df.index))
assert(len(df[['g1', 'g2']].drop_duplicates()) == n_groups)
if test_pandas:
if test_pandas:
time_pandas = timeit(lambda:df.groupby(['g1', g2]).agg(
s=('d', 'sum'), ct=('d', 'count')).head(1))
print("%d\t%d\tpandas\t%0.3f" % (n_rows, n_groups, time_pandas))
if test_polars:
import polars as pl
df_polars = pl.DataFrame(df)
time_polars = timeit(lambda:df_polars.groupby(['g1', 'g2']).agg(
[
pl.sum('d'),
pl.count(),
]
).head(1))
print("%d\t%d\tpolars\t%0.3f" % (n_rows, n_groups, time_polars))
if test_arrow:
import pyarrow as pa
df_arrow = pa.Table.from_pandas(df)
time_arrow = timeit(lambda:df_arrow.group_by(['g1', 'g2']).aggregate([("d", "sum"), ("d", "count")]).take([1]))
print("%d\t%d\tarrow\t%0.3f" % (n_rows, n_groups, time_arrow))
if test_duckdb:
import duckdb as dd
con = dd.connect()
time_duckdb = timeit(lambda:con.from_df(df
).aggregate('sum(d), count(*)', 'g1, g2').limit(1).execute())
print("%d\t%d\tduckdb\t%0.3f" % (n_rows, n_groups, time_duckdb))
for n_rows in 1000000, 10000000, 100000000:
for n_groups in 1000, n_rows/1000, n_rows/100, n_rows/10 ,n_rows:
experiment(n_rows, n_groups)
@jorgecarleitao
Copy link

fyi, in line 58, shouldn't con.from_df(df) be outside the timeit, so that pandas -> dd is done outside of the timeit? It seems more consistent with the other engines (not sure what is the cost pandas -> dd, but it could be O(N)).

@hannes
Copy link
Author

hannes commented Mar 8, 2022

Good point, but the pandas registration in DuckDB is O(1) so it should matter little

@bwlewis
Copy link

bwlewis commented Mar 8, 2022

Hi Hannes, impressive! I can close to exactly reproduce the Python results you show but I am struggling to repro the results in R. What am I doing wrong? See quick notes here:

https://github.com/bwlewis/duckdb_and_r/tree/main/group_by

best regards,

bryan

@hannes
Copy link
Author

hannes commented Mar 9, 2022

@bwlewis the reason this is not as fast in R yet is that we don't parallelise R data frame scans yet. I have a branch in the works that does this though.

@alexander-beedie
Copy link

alexander-beedie commented Jul 26, 2022

Just started evaluating duckdb and vaex for out-of-core usage, and am having a good time - very nice work! :)

However, after reading the aggregate hashtable article I was curious about this benchmarking script as it doesn't show the same performance characteristics that we're observing (specifically vs polars, which I've made a few code contributions to recently).

There are actually two significant problems with the benchmarking script that mean it isn't really comparing like-for-like, and I think I may have found a performance-related bug in duckdb with 64bit values (update - potential bug logged as duckdb/duckdb#4208).

Benchmark issues

  • Unnecessary intermediate result materialization: in the script above duckdb doesn't materialize any intermediate results, and there's actually no reason for polars to either - just make the frame lazy (the preferred mode of execution) and collect the result:

     # initialization - just add 'lazy()'
     df_polars = pl.DataFrame(df).lazy()
    
     # everything else is the same, you just collect
     # the result at the end (like duckdb 'execute()')
     .head(1).collect() 
  • Column dtypes aren't declared: it seems that duckdb is able to use the smallest necessary integer type for the given data, irrespective of the pandas column dtype. The test values are all <= 32bit, but polars inits 64bit int columns, as per the implicit pandas dtype, which causes it to use more memory and run slower than necessary. Setting a more accurate dtype causes polars to run at essentially the same speed as duckdb (a bit faster in some regimes, a bit slower in others, but overall very similar).

    One-line fix:

    df = pd.DataFrame.from_dict({'g1':g1, 'g2':g2, 'd':d}, dtype=np.int32)  # << explicitly set dtype(s)

    (Note: the g2 col is only ever 1 or 0, but for simplicity's sake am just setting all to 32bit ;)

Updated benchmark results

As per the article, the original results came from a machine with an M1 Max, which -lucky me- I am also running, so the following results should be very comparable to the original (I've omitted pandas, because it remains every bit as awful as you'd expect ;)

image

(Note: used latest package versions - duckdb 0.4.0, polars 0.13.57)

rows groups pyarrow duckdb polars
10,000,000 1,000 0.0590 0.0140 0.0310
10,000,000 10,000 0.0660 0.0460 0.0340
10,000,000 100,000 0.0910 0.0860 0.0400
10,000,000 1,000,000 0.2130 0.1510 0.1210
10,000,000 10,000,000 0.4840 0.2080 0.2220
100,000,000 1,000 0.6160 0.1360 0.3540
100,000,000 100,000 0.9110 0.5820 0.4250
100,000,000 1,000,000 1.6040 0.9080 0.9620
100,000,000 10,000,000 3.6330 1.6710 1.7290
100,000,000 100,000,000 6.0120 2.8590 3.7870

Summary

Given the impact of the two minor fixes above, I'm not really sure that the article's conclusion...

we can see how DuckDB consistently outperforms the other systems, with the single-threaded Pandas being slowest, Polars and Arrow being generally similar

...entirely holds when the benchmark is updated to better compare like-for-like (aside from the final 100,000,000 / 100,000,000 case, where duckdb does indeed definitively lead 🚀)

Still, it's great stuff, and I'm looking forward to using it more :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment