Skip to content

Instantly share code, notes, and snippets.

@omarbenites
Forked from xhdong-umd/1. intro.md
Created March 6, 2017 16:47
Show Gist options
  • Save omarbenites/412d502ce983b5b0b9ebfbc9c908436c to your computer and use it in GitHub Desktop.
Save omarbenites/412d502ce983b5b0b9ebfbc9c908436c to your computer and use it in GitHub Desktop.
R function that decompress zip, gz, bzip2, xz into temp file, run function then remove temp file

The need of this came from the fact that read.csv can read zip files directly but data.table::fread cannot take connections as input since it requires random file seek. There is simple usage of data.table::fread(paste0("zcat < ", PATH_TO_FILE)) but that depend on command line tool gzip, which is not always available in windows. See here for more details.

The code is based on R.utils::decompressFile with lots of modifications:

  1. no more removing input file. I lost several data files and puzzled too much time because of this even if I read the document and knew this behavior in the beginning.
  2. According to ?connections, `gzfile`` can handle gz, bzip2, xz. No need to specify and use different functions for uncompress.
  3. The only exception is gzfile cannot handle zip. We can use unzip to decompress file directly without need of connections. unzip does not support Unicode filenames as introduced in zip 3.0. See more in ?unzip for its limitations. If you really need Unicode filename, it might be easier to just install the command line tool gzip (if it is not available already, like windows) and use format like data.table::fread(paste0("zcat < ", PATH_TO_FILE)) directly.

read.csv can read regular file and zip file in same syntax. temp_unzip actually can take regular file as input which is just write to temp directory again. Obviously this is not optimal, we will want to test if file is compressed. To use fread in same syntax for regular file or zip file, we can have something like this:

fread_all <- function(object, ...) {
  # just read directly to test if it is regular file
  data <- try(fread(object, nrows = 5),silent = TRUE)
  if (class(data) == "data.frame") { 
    return(fread(object, ...))
  } else {
    return(temp_unzip(object, fread, ...))
  }
}

Updates

2017-03-06 Added warning on multiple files in zip. Mac OS will add hidden folder even for single file zip. Our function still support this case but also gave warning and information about the file will be extracted.

The input file used here is a 160M csv, compressed to bzip2, gz, zip.

We compared

  • reading original csv directly
  • zcat method (note we need to quote the file name because there is & in it)
  • temp_unzip with bzip2, zip, gz

The bzip2 need signficantly longer time because the uncompress of bzip2 is slow. zcat actually is slightly faster than reading original csv. gz and zip with temp_unzip have very similar performance with reading original file without compression.

library(microbenchmark)
microbenchmark(
  fread(eg_csv),
  fread(input = paste0("zcat < '", eg_gz, "'")), 
  temp_unzip(eg_bz, fread),
  temp_unzip(eg_zip, fread),
  temp_unzip(eg_gz, fread),
  times = 1)
Unit: seconds
                                          expr      min       lq     mean   median       uq
                                 fread(eg_csv) 2.117812 2.117812 2.117812 2.117812 2.117812
 fread(input = paste0("zcat < '", eg_gz, "'")) 1.984009 1.984009 1.984009 1.984009 1.984009
                      temp_unzip(eg_bz, fread) 6.304849 6.304849 6.304849 6.304849 6.304849
                     temp_unzip(eg_zip, fread) 2.481650 2.481650 2.481650 2.481650 2.481650
                      temp_unzip(eg_gz, fread) 2.487811 2.487811 2.487811 2.487811 2.487811
      max neval
 2.117812     1
 1.984009     1
 6.304849     1
 2.481650     1
 2.487811     1
# To decompress zip, gz, bzip2, xz into temp file, run function then remove temp file.
temp_unzip <- function(filename, fun, ...){
BFR.SIZE <- 1e7
if (!file.exists(filename)) {
stop("No such file: ", filename);
}
if (!is.function(fun)) {
stop(sprintf("Argument 'fun' is not a function: %s", mode(fun)));
}
temp_dir <- tempdir()
# test if it's zip
files_in_zip <- try(unzip(filename, list = TRUE)$Name, silent = TRUE)
if (class(files_in_zip) == "character") {
if(length(files_in_zip)>1) { warning(paste0(
" Zip file contains multiple files.\n Mac OS built in zip compressor will add hidden folder for even single file zip.\nUsing the first file: ",
files_in_zip[1])) }
unzip(filename, exdir = temp_dir, overwrite = TRUE)
dest_file <- file.path(temp_dir, files_in_zip[1])
} else {
dest_file <- tempfile()
# Setup input and output connections
inn <- gzfile(filename, open = "rb")
out <- file(description = dest_file, open = "wb")
# Process
nbytes <- 0
repeat {
bfr <- readBin(inn, what=raw(0L), size=1L, n=BFR.SIZE)
n <- length(bfr)
if (n == 0L) break;
nbytes <- nbytes + n
writeBin(bfr, con=out, size=1L)
bfr <- NULL # Not needed anymore
}
close(inn)
close(out)
}
# call fun with temp file
res <- fun(dest_file, ...)
file.remove(dest_file)
return(res)
}
# need `R.utils` to compress files
temp_test <- "temp.csv"
temp_content <- "a,b,c
1,2,3
4,5,6
"
cat(file = temp_test, temp_content)
library(R.utils)
# always remember the `remove` parameter when using `R.utils`!!!
gzip(temp_test, remove = FALSE)
bzip2(temp_test, remove = FALSE)
temp_unzip(paste0(temp_test, ".bz2"), fread)
temp_unzip(paste0(temp_test, ".gz"), fread)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment