Skip to content

Instantly share code, notes, and snippets.

@xhdong-umd
Last active May 26, 2017 15:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save xhdong-umd/6429e7f96735142fa467f3b1daa91a2c to your computer and use it in GitHub Desktop.
Save xhdong-umd/6429e7f96735142fa467f3b1daa91a2c to your computer and use it in GitHub Desktop.
R function that decompress zip, gz, bzip2, xz into temp file, run function then remove temp file

The need of this came from the fact that read.csv can read zip files directly but data.table::fread cannot take connections as input since it requires random file seek. There is simple usage of data.table::fread(paste0("zcat < ", PATH_TO_FILE)) but that depend on command line tool gzip, which is not always available in windows. See here for more details.

The code is based on R.utils::decompressFile with lots of modifications:

  1. no more removing input file. I lost several data files and puzzled too much time because of this even if I read the document and knew this behavior in the beginning.
  2. According to ?connections, `gzfile`` can handle gz, bzip2, xz. No need to specify and use different functions for uncompress.
  3. The only exception is gzfile cannot handle zip. We can use unzip to decompress file directly without need of connections. unzip does not support Unicode filenames as introduced in zip 3.0. See more in ?unzip for its limitations. If you really need Unicode filename, it might be easier to just install the command line tool gzip (if it is not available already, like windows) and use format like data.table::fread(paste0("zcat < ", PATH_TO_FILE)) directly.

read.csv can read regular file and zip file in same syntax. temp_unzip actually can take regular file as input which is just write to temp directory again. Obviously this is not optimal, we will want to test if file is compressed. To use fread in same syntax for regular file or zip file, we can have something like this:

fread_all <- function(object, ...) {
  # just read directly to test if it is regular file
  data <- try(fread(object, nrows = 5),silent = TRUE)
  if (class(data) == "data.frame") { 
    return(fread(object, ...))
  } else {
    return(temp_unzip(object, fread, ...))
  }
}

Updates

2017-03-06 Added warning on multiple files in zip. Mac OS will add hidden folder even for single file zip. Our function still support this case but also gave warning and information about the file will be extracted.

2017-03-07 Now scan zip file contents. After filtering out hidden files (lead by . in unix, end by $ in windows, __MACOSX folder in mac), only proceed if there is only one visible file.

The input file used here is a 160M csv, compressed to bzip2, gz, zip.

We compared

  • reading original csv directly
  • zcat method (note we need to quote the file name because there is & in it)
  • temp_unzip with bzip2, zip, gz

The bzip2 need signficantly longer time because the uncompress of bzip2 is slow. zcat actually is slightly faster than reading original csv. gz and zip with temp_unzip have very similar performance with reading original file without compression.

library(microbenchmark)
microbenchmark(
  fread(eg_csv),
  fread(input = paste0("zcat < '", eg_gz, "'")), 
  temp_unzip(eg_bz, fread),
  temp_unzip(eg_zip, fread),
  temp_unzip(eg_gz, fread),
  times = 1)
Unit: seconds
                                          expr      min       lq     mean   median       uq
                                 fread(eg_csv) 2.117812 2.117812 2.117812 2.117812 2.117812
 fread(input = paste0("zcat < '", eg_gz, "'")) 1.984009 1.984009 1.984009 1.984009 1.984009
                      temp_unzip(eg_bz, fread) 6.304849 6.304849 6.304849 6.304849 6.304849
                     temp_unzip(eg_zip, fread) 2.481650 2.481650 2.481650 2.481650 2.481650
                      temp_unzip(eg_gz, fread) 2.487811 2.487811 2.487811 2.487811 2.487811
      max neval
 2.117812     1
 1.984009     1
 6.304849     1
 2.481650     1
 2.487811     1
# To decompress zip, gz, bzip2, xz into temp file, run function then remove temp file.
temp_unzip <- function(filename, fun, ...){
BFR.SIZE <- 1e7
if (!file.exists(filename)) {
stop("No such file: ", filename);
}
if (!is.function(fun)) {
stop(sprintf("Argument 'fun' is not a function: %s", mode(fun)));
}
temp_dir <- tempdir()
# test if it's zip
files_in_zip <- try(utils::unzip(filename, list = TRUE)$Name, silent = TRUE)
if (class(files_in_zip) == "character") {
# hidden files can be ignored: starting with ., ending with $, __MACOSX folder
visible_files <- files_in_zip[!grepl("((^__MACOSX\\/.*)|(^\\..*)|(^.*\\$$))",
files_in_zip)]
# will not continue for multiple non-hidden files since behavior is not well defined.
if(length(visible_files)>1) {
stop(paste0("Zip file contains multiple visible files:\n",
paste0(" ", visible_files, collapse = "\n")))
}
if(length(visible_files) == 0) { stop("\n No visible file found in Zip file")}
# proceed with single non-hidden file
utils::unzip(filename, files = visible_files[1], exdir = temp_dir, overwrite = TRUE)
dest_file <- file.path(temp_dir, visible_files[1])
} else {
dest_file <- tempfile()
# Setup input and output connections
inn <- gzfile(filename, open = "rb")
out <- file(description = dest_file, open = "wb")
# Process
nbytes <- 0
repeat {
bfr <- readBin(inn, what=raw(0L), size=1L, n=BFR.SIZE)
n <- length(bfr)
if (n == 0L) break;
nbytes <- nbytes + n
writeBin(bfr, con=out, size=1L)
bfr <- NULL # Not needed anymore
}
close(inn)
close(out)
}
# call fun with temp file
res <- fun(dest_file, ...)
file.remove(dest_file)
return(res)
}
# need `R.utils` to compress files
temp_test <- "temp.csv"
temp_content <- "a,b,c
1,2,3
4,5,6
"
cat(file = temp_test, temp_content)
library(R.utils)
# always remember the `remove` parameter when using `R.utils`!!!
gzip(temp_test, remove = FALSE)
bzip2(temp_test, remove = FALSE)
temp_unzip(paste0(temp_test, ".bz2"), fread)
temp_unzip(paste0(temp_test, ".gz"), fread)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment