ExpandingMan/ThePlan.md

## ThePlan.md

      
    Raw
  

              ThePlan.md
            
          
    The Plan

The ultimate goal is to have a package or set of packages for large-scale distributed
computing on tabular data in Julia that works "out of the box".  We would like to be able
to replace processes that run in e.g. Apache Spark or Python's dask.
Examples of things we'd like to be able to do:

Get table metadata for a table that is spread across a hundred CSV files in HDFS.
Join a 10^10 row table which is stored as parquet files on S3 buckets with a 100 row
table that we create locally in memory, perform a groupby operation and save to a new
table as parquet files on S3.
Perform a trivially parallelizable row-wise opration on a 10^10 row table stored as
parquet files on S3 buckets efficiently.
Perform simple machine learning at large scale, for example performing a linear
regression on rows of a 10^10 row table stored on S3 buckets using e.g. online
stats.

Loading Tables

A good first step might be to have easy and flexible ways of loading views of extremely
large tables.  Local workers should have complete metadata of the table in memory and
easily be able to fetch specified subsets of the table on demand.
Two things that would be good to have for this:

Some sort of CacheArrays.jl which would provide an interface and some basic
implementations of AbstractArray objects that can load or delete specified ranges
into memory from any source.  For example, we should be able to have an array which
represents a file on S3 which loads a range of bytes automatically when indexed.
Similarly, we should have a type which when used in a table loads only the Parquet
chunks needed (automatically, based on indexing) when the array is used in a Tables.jl
compatible table.
Packages for loading tables, such as CSV.jl and Parquet.jl need improved functionality
for loading only metadata where possible and having lazy objects which represent
pieces of tables that can be retrieved on demand.  Parquet.jl for example, already
loads in chunks, which is what we want, but it's currently very difficult to have it do
that from anything except files on the file system.

TODO:


Look at JuliaDB.jl implementation much more carefully.