Skip to content

Instantly share code, notes, and snippets.

@ExpandingMan
Created July 14, 2021 00:12
Show Gist options
  • Save ExpandingMan/2bbc92039fa18fb4dac479398c7d53da to your computer and use it in GitHub Desktop.
Save ExpandingMan/2bbc92039fa18fb4dac479398c7d53da to your computer and use it in GitHub Desktop.
plan for distributed tables in Julia

The Plan

The ultimate goal is to have a package or set of packages for large-scale distributed computing on tabular data in Julia that works "out of the box". We would like to be able to replace processes that run in e.g. Apache Spark or Python's dask.

Examples of things we'd like to be able to do:

  • Get table metadata for a table that is spread across a hundred CSV files in HDFS.
  • Join a 10^10 row table which is stored as parquet files on S3 buckets with a 100 row table that we create locally in memory, perform a groupby operation and save to a new table as parquet files on S3.
  • Perform a trivially parallelizable row-wise opration on a 10^10 row table stored as parquet files on S3 buckets efficiently.
  • Perform simple machine learning at large scale, for example performing a linear regression on rows of a 10^10 row table stored on S3 buckets using e.g. online stats.

Loading Tables

A good first step might be to have easy and flexible ways of loading views of extremely large tables. Local workers should have complete metadata of the table in memory and easily be able to fetch specified subsets of the table on demand.

Two things that would be good to have for this:

  1. Some sort of CacheArrays.jl which would provide an interface and some basic implementations of AbstractArray objects that can load or delete specified ranges into memory from any source. For example, we should be able to have an array which represents a file on S3 which loads a range of bytes automatically when indexed. Similarly, we should have a type which when used in a table loads only the Parquet chunks needed (automatically, based on indexing) when the array is used in a Tables.jl compatible table.
  2. Packages for loading tables, such as CSV.jl and Parquet.jl need improved functionality for loading only metadata where possible and having lazy objects which represent pieces of tables that can be retrieved on demand. Parquet.jl for example, already loads in chunks, which is what we want, but it's currently very difficult to have it do that from anything except files on the file system.

TODO:

  • Look at JuliaDB.jl implementation much more carefully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment