Skip to content

Instantly share code, notes, and snippets.

@jmcastagnetto
Created July 1, 2021 15:39
Show Gist options
  • Save jmcastagnetto/fef3f3a2778028e7efb6836d6d8e3f8e to your computer and use it in GitHub Desktop.
Save jmcastagnetto/fef3f3a2778028e7efb6836d6d8e3f8e to your computer and use it in GitHub Desktop.
Testing readr::read_csv(), data.table::fread() and vroom::vroom()
# Test done to check/answer the question at https://stackoverflow.com/questions/68211842/why-is-vroom-so-slow
# Downloaded CSV file on 2021-07-01 from:
# https://www.datosabiertos.gob.pe/dataset/vacunaci%C3%B3n-contra-covid-19-ministerio-de-salud-minsa
# and then compressed it with gzip
# $ zcat vacunas_covid.csv.gz | wc -l
# 7311644
library(readr)
library(vroom)
library(data.table)
library(microbenchmark)
csv_file <- "vacunas_covid.csv.gz"
microbenchmark(
readr={
t <- read_csv(csv_file, col_types=cols())
write_csv(t, csv_file)
},data.table={
t <- fread(csv_file)
fwrite(t, csv_file, sep=",")
},vroom={
t <- vroom(csv_file, delim=",", show_col_types = F)
vroom_write(t, csv_file, delim=",")
},
times=5
)
R version 4.1.0 (2021-05-18) -- "Camp Pontanezen"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> # Test done to check/answer the question at https://stackoverflow.com/questions/68211842/why-is-vroom-so-slow
> # Downloaded CSV file on 2021-07-01 from:
> # https://www.datosabiertos.gob.pe/dataset/vacunaci%C3%B3n-contra-covid-19-ministerio-de-salud-minsa
> # and then compressed it with gzip
>
> library(readr)
> library(vroom)
> library(data.table)
> library(microbenchmark)
> csv_file <- "vacunas_covid.csv.gz"
> microbenchmark(
+ readr={
+ t <- read_csv(csv_file, col_types=cols())
+ write_csv(t, csv_file)
+ },data.table={
+ t <- fread(csv_file)
+ fwrite(t, csv_file, sep=",")
+ },vroom={
+ t <- vroom(csv_file, delim=",", show_col_types = F)
+ vroom_write(t, csv_file, delim=",")
+ },
+ times=5
+ )
Unit: seconds
expr min lq mean median uq max neval cld
readr 101.72094 105.75384 109.16869 106.08111 108.06967 124.21788 5 c
data.table 28.18751 30.32570 31.06592 30.44838 33.12746 33.24055 5 a
vroom 48.65399 51.52445 55.78264 52.89823 53.83582 72.00071 5 b
>
>
>
> proc.time()
user system elapsed
1065.499 39.475 990.722
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment