mrocklin/pydata-map-discussion.md

## pydata-map-discussion.md

      
    Raw
  

              pydata-map-discussion.md
            
          
    This is in response to https://peadarcoyle.wordpress.com/2016/03/02/a-map-of-the-pydata-stack/ .  It started off as an e-mail but I decided to keep things public.
First, let me just say that I think that reproducing the ML decision graph is a really cool idea.  I suspect that it'll get hairy for a while as people speak up.  I'll speak up below, but I'm obviously invested in some of these projects, so you should probably take everything I say with a grain of salt.  OK, here we go:
To me scientific data overlaps with tabular and array.  It's not clear that the choice between {array, dataframe, scientific} is easy to make "Well, I have scientific array data, what do I choose now?"
The same issue exists somewhat for time-series "Well, I have tabular time series, which branch do I take?"  I recommend removing Castra from the map, it's not a very serious project.
I can envision a separate map for storage technologies (hdf5, netcdf, bcolz, castra, csv, parquet, ...)  It's odd to have both computational systems like xarray side-by-side storage systems like bcolz.
For xarray I think the main distinction is with labeled axes "Do my axes have labels?  (e.g. time, latitude, longitude)" is a good question that branches between numpy->pandas in 1d and numpy->xarray in nd.
<particularly biased> In the "My data is distributed" section I claim that various dask projects could fit in each one of those boxes.  Dask.array handles distributed arrays just fine (see recent blogpost), and the serialization behind SFrame is not particularly more or less fancy than nice formats that dump to numpy or pandas accessible by Bolt or dask.{array,dataframe}.  Generally SFrame excels on machine learning algorithms that were specifically co-designed along with the data structure.   Personally I avoid using the term "dask" to refer to any of the dask collections, as I think is done here for dask.dataframe.  Different users use the term "a dask" to refer separately to a graph, an array, a bag, and a dataframe.  This gets confusing.. </particularly biased>
One way to structure this around computational systems would be to ask the following three questions:

How is your data laid out {array, tabular, text, nested}?
What is the scale of your data?
How do you label/index entries in your data?

I suspect that answers to these three questions would be enough to isolate down to a single choice or at worst case a very small number of choices.
I think that there is a separate set of questions for storage systems, though that's probably a separate conversation
SQL could easily be used in the fits-in-ram tabular case as well.  Databases are good things.