-
-
Save hannes/e2599ae338d275c241c567934a13d422 to your computer and use it in GitHub Desktop.
import time | |
import statistics | |
import tempfile | |
import os | |
import pandas as pd | |
import numpy as np | |
test_polars = True | |
test_duckdb = True | |
test_arrow = True | |
test_pandas = True | |
def timeit(f): | |
times = [] | |
for n in range(1,3): | |
start = time.time() | |
f() | |
times.append(time.time() - start) | |
return(statistics.median(times)) | |
def experiment(n_rows, n_groups): | |
g1 = np.repeat(np.arange(0, int(n_groups/2)), int(n_rows/(n_groups/2))) | |
g2 = np.tile(np.arange(0, 2), int(n_rows/2)) | |
d = np.random.randint(0, n_rows, n_rows) | |
df = pd.DataFrame.from_dict({'g1' : g1, 'g2' : g2, 'd' : d}) | |
df = df.reindex(np.random.permutation(df.index)) | |
assert(len(df[['g1', 'g2']].drop_duplicates()) == n_groups) | |
if test_pandas: | |
if test_pandas: | |
time_pandas = timeit(lambda:df.groupby(['g1', g2]).agg( | |
s=('d', 'sum'), ct=('d', 'count')).head(1)) | |
print("%d\t%d\tpandas\t%0.3f" % (n_rows, n_groups, time_pandas)) | |
if test_polars: | |
import polars as pl | |
df_polars = pl.DataFrame(df) | |
time_polars = timeit(lambda:df_polars.groupby(['g1', 'g2']).agg( | |
[ | |
pl.sum('d'), | |
pl.count(), | |
] | |
).head(1)) | |
print("%d\t%d\tpolars\t%0.3f" % (n_rows, n_groups, time_polars)) | |
if test_arrow: | |
import pyarrow as pa | |
df_arrow = pa.Table.from_pandas(df) | |
time_arrow = timeit(lambda:df_arrow.group_by(['g1', 'g2']).aggregate([("d", "sum"), ("d", "count")]).take([1])) | |
print("%d\t%d\tarrow\t%0.3f" % (n_rows, n_groups, time_arrow)) | |
if test_duckdb: | |
import duckdb as dd | |
con = dd.connect() | |
time_duckdb = timeit(lambda:con.from_df(df | |
).aggregate('sum(d), count(*)', 'g1, g2').limit(1).execute()) | |
print("%d\t%d\tduckdb\t%0.3f" % (n_rows, n_groups, time_duckdb)) | |
for n_rows in 1000000, 10000000, 100000000: | |
for n_groups in 1000, n_rows/1000, n_rows/100, n_rows/10 ,n_rows: | |
experiment(n_rows, n_groups) | |
Good point, but the pandas registration in DuckDB is O(1) so it should matter little
Hi Hannes, impressive! I can close to exactly reproduce the Python results you show but I am struggling to repro the results in R. What am I doing wrong? See quick notes here:
https://github.com/bwlewis/duckdb_and_r/tree/main/group_by
best regards,
bryan
@bwlewis the reason this is not as fast in R yet is that we don't parallelise R data frame scans yet. I have a branch in the works that does this though.
Just started evaluating duckdb
and vaex
for out-of-core usage, and am having a good time - very nice work! :)
However, after reading the aggregate hashtable article I was curious about this benchmarking script as it doesn't show the same performance characteristics that we're observing (specifically vs polars
, which I've made a few code contributions to recently).
There are actually two significant problems with the benchmarking script that mean it isn't really comparing like-for-like, and I think I may have found a performance-related bug in duckdb
with 64bit values (update - potential bug logged as duckdb/duckdb#4208).
Benchmark issues
-
Unnecessary intermediate result materialization: in the script above
duckdb
doesn't materialize any intermediate results, and there's actually no reason forpolars
to either - just make the frame lazy (the preferred mode of execution) and collect the result:# initialization - just add 'lazy()' df_polars = pl.DataFrame(df).lazy() # everything else is the same, you just collect # the result at the end (like duckdb 'execute()') .head(1).collect()
-
Column dtypes aren't declared: it seems that
duckdb
is able to use the smallest necessary integer type for the given data, irrespective of the pandas column dtype. The test values are all <= 32bit, butpolars
inits 64bit int columns, as per the implicit pandas dtype, which causes it to use more memory and run slower than necessary. Setting a more accurate dtype causes polars to run at essentially the same speed asduckdb
(a bit faster in some regimes, a bit slower in others, but overall very similar).One-line fix:
df = pd.DataFrame.from_dict({'g1':g1, 'g2':g2, 'd':d}, dtype=np.int32) # << explicitly set dtype(s)
(Note: the
g2
col is only ever 1 or 0, but for simplicity's sake am just setting all to 32bit ;)
Updated benchmark results
As per the article, the original results came from a machine with an M1 Max, which -lucky me- I am also running, so the following results should be very comparable to the original (I've omitted pandas
, because it remains every bit as awful as you'd expect ;)
(Note: used latest package versions - duckdb 0.4.0, polars 0.13.57)
rows | groups | pyarrow | duckdb | polars |
---|---|---|---|---|
10,000,000 | 1,000 | 0.0590 | 0.0140 | 0.0310 |
10,000,000 | 10,000 | 0.0660 | 0.0460 | 0.0340 |
10,000,000 | 100,000 | 0.0910 | 0.0860 | 0.0400 |
10,000,000 | 1,000,000 | 0.2130 | 0.1510 | 0.1210 |
10,000,000 | 10,000,000 | 0.4840 | 0.2080 | 0.2220 |
100,000,000 | 1,000 | 0.6160 | 0.1360 | 0.3540 |
100,000,000 | 100,000 | 0.9110 | 0.5820 | 0.4250 |
100,000,000 | 1,000,000 | 1.6040 | 0.9080 | 0.9620 |
100,000,000 | 10,000,000 | 3.6330 | 1.6710 | 1.7290 |
100,000,000 | 100,000,000 | 6.0120 | 2.8590 | 3.7870 |
Summary
Given the impact of the two minor fixes above, I'm not really sure that the article's conclusion...
we can see how DuckDB consistently outperforms the other systems, with the single-threaded Pandas being slowest, Polars and Arrow being generally similar
...entirely holds when the benchmark is updated to better compare like-for-like (aside from the final 100,000,000 / 100,000,000 case, where duckdb
does indeed definitively lead 🚀)
Still, it's great stuff, and I'm looking forward to using it more :)
fyi, in line 58, shouldn't
con.from_df(df)
be outside thetimeit
, so thatpandas -> dd
is done outside of the timeit? It seems more consistent with the other engines (not sure what is the costpandas -> dd
, but it could beO(N)
).