Skip to content

Instantly share code, notes, and snippets.

@markdanese
Last active April 22, 2016 11:35
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save markdanese/28b9f5412df55efceba754fee2363444 to your computer and use it in GitHub Desktop.
Save markdanese/28b9f5412df55efceba754fee2363444 to your computer and use it in GitHub Desktop.
A test of the new feather package in R using Medicare Part D drug reimbursement data
# load libraries --------------------------------------------------------------------
library(data.table)
library(feather)
# US Part D Drug prices 2013: 500 MB zip, 2.9 GB uncompressed -----------------------
pde_link <- "http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_DRUG_13.zip"
tf <- tempfile()
download.file(pde_link, tf)
x <- unzip(tf, exdir = tempdir())
df <- fread(x[2], verbose = TRUE)
unlink(x)
rm(x, tf)
# various write/save options --------------------------------------------------------------
write_feather_time <-
system.time(
write_feather(df, "./data/analysis/pde2013.fthr")
)
write_rds_T_time <-
system.time(
saveRDS(df, "./data/analysis/pde2013T.rds", compress = TRUE)
)
write_rds_F_time <-
system.time(
saveRDS(df, "./data/analysis/pde2013F.rds", compress = FALSE)
)
write_csv_time <-
system.time(
fwrite(df, "./data/analysis/pde2013.csv")
) # requires data.table 1.9.7 + with fwrite added
# various write options -------------------------------------------------------------
read_feather_time <-
system.time(
df1 <- read_feather("./data/analysis/pde2013.fthr")
)
rm(df1)
gc()
read_rds_T_time <-
system.time(
df2 <- readRDS("./data/analysis/pde2013T.rds")
)
rm(df2)
gc()
read_rds_F_time <-
system.time(
df3 <- readRDS("./data/analysis/pde2013F.rds")
)
rm(df3)
gc()
# summarize results -----------------------------------------------------------------
output <- ls(pattern = "_time")
times <- lapply(output, function(x) get(x))
names(times) <- output
print(times)
@markdanese
Copy link
Author

My times on a 2014 Macbook Pro with SSD:

$read_feather_time
   user  system elapsed 
 15.315   4.497  34.324 

$read_rds_F_time
   user  system elapsed 
 82.351   3.617  83.322 

$read_rds_T_time
   user  system elapsed 
 74.282   1.136  75.604 

$write_feather_time
   user  system elapsed 
  7.125   6.361  13.891 

$write_rds_F_time
   user  system elapsed 
 65.706   5.236  71.482 

$write_rds_T_time
   user  system elapsed 
180.748   0.687 181.557 

And fread reported that it took 28 seconds to read in the CSV. fwrite took about 200 seconds to write the CSV with the defaults.

@markdanese
Copy link
Author

I updated to the new, parallel version of fwrite() in data.table 1.9.7 and the write speed was 13 seconds (that is not a typo). For comparison, readr::read_csv() was 228 seconds. I don't know whether this is because of the solid state drive, but it is pretty amazing.

@jangorecki
Copy link

Nice, the speed up is amazing. You may update the R script and uncomment fwrite lones so it can be reproduced for copy-paste.

@markdanese
Copy link
Author

Thanks @jangorecki -- fixed.
Today's update (20 April 2016) now has fwrite() writing the same file in 3.8-4.0 seconds. (By the way, the previous version was 11.2 sec and not 13. Not that it matters any more!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment