The Plan
The ultimate goal is to have a package or set of packages for large-scale distributed computing on tabular data in Julia that works "out of the box". We would like to be able to replace processes that run in e.g. Apache Spark or Python's dask.
Examples of things we'd like to be able to do:
- Get table metadata for a table that is spread across a hundred CSV files in HDFS.
- Join a
10^10
row table which is stored as parquet files on S3 buckets with a 100 row table that we create locally in memory, perform a groupby operation and save to a new table as parquet files on S3.