This benchmark considers the case of reading a few thousand csv files (all using a common schema) into a single table. The data is about 2.4 million rows, and about 100MB as parquet or 400 MB as single uncompressed csv.
We compare access directly over the POSIX filesystem to S3-based access over a local network (i.e. where the same server is hosting a MINIO bucket and the RStudio instance on which we run the tests).
We also compare these times against serializing the collected csv files into a single file first, either as parquet, csv, or compressed .csv.gz first, and reading that.
Note that the goal here is not a comparison of arrow
vs readr
, but really a comparison of the costs of local network-based access