Skip to content

Instantly share code, notes, and snippets.

@darribas
Created July 27, 2015 17:04
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save darribas/41940dfe7bf4f987eeaa to your computer and use it in GitHub Desktop.
Save darribas/41940dfe7bf4f987eeaa to your computer and use it in GitHub Desktop.
Quick comparison between `pandas` and `dask` groupby functionality.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@mrocklin
Copy link

mrocklin commented Aug 3, 2015

Nice comparison.

If your data fits in memory then you should almost always just use Pandas. Full groupby-applies like df.groupby(...).apply(func) are hard to do in parallel and require a full dataset shuffle. Dask (or any parallel library) should perform about as well under groupby-reductions for standard reductions like df.groupby(...).col.mean().

@spott
Copy link

spott commented Feb 8, 2018

If I understand this correctly, you are comparing a pandas groupby to dask converting from pandas then doing a groupby.

Is this really a fair test?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment