Skip to content

Instantly share code, notes, and snippets.

@jeroen
Last active February 12, 2021 13:27
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jeroen/723da97422b06d88ce2178ccd947f70b to your computer and use it in GitHub Desktop.
Save jeroen/723da97422b06d88ce2178ccd947f70b to your computer and use it in GitHub Desktop.
Quick 2021 benchmark of CSV readers in R
# Note: CSV parses are NOT DIRECTLY COMPARABLE.
# - data.table does not parse dates, it just gives strings.
# - data.table is only fast when OpenMP is supported, i.e. not on MacOS.
# - vroom takes advantage of altrep, which defers some parsing.
# - arrow takes advantage of hardware extensions if available.
# - results will be different if you specify the types of the columns.
library(vroom)
library(arrow)
library(data.table)
library(readr)
library(nycflights13)
# Max multi-threading in datatable (not supported on MacOS)
data.table::setDTthreads(0)
# On your marks...
write.csv(flights, 'flights.csv', row.names = FALSE)
system.time({readr <- readr::read_csv('flights.csv')})
system.time({vroom <- vroom::vroom('flights.csv', delim = ',')})
system.time({arrow <- arrow::read_csv_arrow('flights.csv')})
system.time({datatable <- data.table::fread(file = 'flights.csv', sep = ',')})
# Note datatable does not understand dates
class(flights$time_hour) # INPUT: DATE
class(readr$time_hour) # DATE
class(vroom$time_hour) # DATE
class(arrow$time_hour) # DATE
class(datatable$time_hour) # STRING :(
# Go
bench::mark(
readr::read_csv('flights.csv'),
vroom::vroom('flights.csv', delim = ','),
vroom::vroom('flights.csv', delim = ',', altrep = FALSE),
arrow::read_csv_arrow('flights.csv'),
data.table::fread(file = 'flights.csv', sep = ','),
check = FALSE, min_iterations = 10
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment