Last active
April 22, 2016 11:35
-
-
Save markdanese/28b9f5412df55efceba754fee2363444 to your computer and use it in GitHub Desktop.
A test of the new feather package in R using Medicare Part D drug reimbursement data
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# load libraries -------------------------------------------------------------------- | |
library(data.table) | |
library(feather) | |
# US Part D Drug prices 2013: 500 MB zip, 2.9 GB uncompressed ----------------------- | |
pde_link <- "http://download.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Downloads/PartD_Prescriber_PUF_NPI_DRUG_13.zip" | |
tf <- tempfile() | |
download.file(pde_link, tf) | |
x <- unzip(tf, exdir = tempdir()) | |
df <- fread(x[2], verbose = TRUE) | |
unlink(x) | |
rm(x, tf) | |
# various write/save options -------------------------------------------------------------- | |
write_feather_time <- | |
system.time( | |
write_feather(df, "./data/analysis/pde2013.fthr") | |
) | |
write_rds_T_time <- | |
system.time( | |
saveRDS(df, "./data/analysis/pde2013T.rds", compress = TRUE) | |
) | |
write_rds_F_time <- | |
system.time( | |
saveRDS(df, "./data/analysis/pde2013F.rds", compress = FALSE) | |
) | |
write_csv_time <- | |
system.time( | |
fwrite(df, "./data/analysis/pde2013.csv") | |
) # requires data.table 1.9.7 + with fwrite added | |
# various write options ------------------------------------------------------------- | |
read_feather_time <- | |
system.time( | |
df1 <- read_feather("./data/analysis/pde2013.fthr") | |
) | |
rm(df1) | |
gc() | |
read_rds_T_time <- | |
system.time( | |
df2 <- readRDS("./data/analysis/pde2013T.rds") | |
) | |
rm(df2) | |
gc() | |
read_rds_F_time <- | |
system.time( | |
df3 <- readRDS("./data/analysis/pde2013F.rds") | |
) | |
rm(df3) | |
gc() | |
# summarize results ----------------------------------------------------------------- | |
output <- ls(pattern = "_time") | |
times <- lapply(output, function(x) get(x)) | |
names(times) <- output | |
print(times) |
I updated to the new, parallel version of fwrite()
in data.table 1.9.7 and the write speed was 13 seconds (that is not a typo). For comparison, readr::read_csv()
was 228 seconds. I don't know whether this is because of the solid state drive, but it is pretty amazing.
Nice, the speed up is amazing. You may update the R script and uncomment fwrite lones so it can be reproduced for copy-paste.
Thanks @jangorecki -- fixed.
Today's update (20 April 2016) now has fwrite()
writing the same file in 3.8-4.0 seconds. (By the way, the previous version was 11.2 sec and not 13. Not that it matters any more!)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
My times on a 2014 Macbook Pro with SSD:
And fread reported that it took 28 seconds to read in the CSV. fwrite took about 200 seconds to write the CSV with the defaults.