Created
July 27, 2015 17:04
-
-
Save darribas/41940dfe7bf4f987eeaa to your computer and use it in GitHub Desktop.
Quick comparison between `pandas` and `dask` groupby functionality.
If I understand this correctly, you are comparing a pandas groupby to dask converting from pandas then doing a groupby.
Is this really a fair test?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Nice comparison.
If your data fits in memory then you should almost always just use Pandas. Full groupby-applies like
df.groupby(...).apply(func)
are hard to do in parallel and require a full dataset shuffle. Dask (or any parallel library) should perform about as well under groupby-reductions for standard reductions likedf.groupby(...).col.mean()
.