Skip to content

Instantly share code, notes, and snippets.

@mrocklin
Last active June 16, 2016 20:09
Show Gist options
  • Save mrocklin/2c5f08bcf040d63799ca to your computer and use it in GitHub Desktop.
Save mrocklin/2c5f08bcf040d63799ca to your computer and use it in GitHub Desktop.

This is in response to https://peadarcoyle.wordpress.com/2016/03/02/a-map-of-the-pydata-stack/ . It started off as an e-mail but I decided to keep things public.

First, let me just say that I think that reproducing the ML decision graph is a really cool idea. I suspect that it'll get hairy for a while as people speak up. I'll speak up below, but I'm obviously invested in some of these projects, so you should probably take everything I say with a grain of salt. OK, here we go:

To me scientific data overlaps with tabular and array. It's not clear that the choice between {array, dataframe, scientific} is easy to make "Well, I have scientific array data, what do I choose now?"

The same issue exists somewhat for time-series "Well, I have tabular time series, which branch do I take?" I recommend removing Castra from the map, it's not a very serious project.

I can envision a separate map for storage technologies (hdf5, netcdf, bcolz, castra, csv, parquet, ...) It's odd to have both computational systems like xarray side-by-side storage systems like bcolz.

For xarray I think the main distinction is with labeled axes "Do my axes have labels? (e.g. time, latitude, longitude)" is a good question that branches between numpy->pandas in 1d and numpy->xarray in nd.

<particularly biased> In the "My data is distributed" section I claim that various dask projects could fit in each one of those boxes. Dask.array handles distributed arrays just fine (see recent blogpost), and the serialization behind SFrame is not particularly more or less fancy than nice formats that dump to numpy or pandas accessible by Bolt or dask.{array,dataframe}. Generally SFrame excels on machine learning algorithms that were specifically co-designed along with the data structure. Personally I avoid using the term "dask" to refer to any of the dask collections, as I think is done here for dask.dataframe. Different users use the term "a dask" to refer separately to a graph, an array, a bag, and a dataframe. This gets confusing.. </particularly biased>

One way to structure this around computational systems would be to ask the following three questions:

  1. How is your data laid out {array, tabular, text, nested}?
  2. What is the scale of your data?
  3. How do you label/index entries in your data?

I suspect that answers to these three questions would be enough to isolate down to a single choice or at worst case a very small number of choices.

I think that there is a separate set of questions for storage systems, though that's probably a separate conversation

SQL could easily be used in the fits-in-ram tabular case as well. Databases are good things.

@mrocklin
Copy link
Author

mrocklin commented Mar 3, 2016

Comments welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment