xhdong-umd/1. intro.md

## 1. intro.md

      
    Raw
  

              1. intro.md
            
          
    The need of this came from the fact that read.csv can read zip files directly but data.table::fread cannot take connections as input since it requires random file seek. There is simple usage of data.table::fread(paste0("zcat < ", PATH_TO_FILE)) but that depend on command line tool gzip, which is not always available in windows. See here for more details.
The code is based on R.utils::decompressFile with lots of modifications:

no more removing input file. I lost several data files and puzzled too much time because of this even if I read the document and knew this behavior in the beginning.
According to ?connections, `gzfile`` can handle gz, bzip2, xz. No need to specify and use different functions for uncompress.
The only exception is gzfile cannot handle zip. We can use unzip to decompress file directly without need of connections. unzip does not support Unicode filenames as introduced in zip 3.0. See more in ?unzip for its limitations. If you really need Unicode filename, it might be easier to just install the command line tool gzip (if it is not available already, like windows) and use format like data.table::fread(paste0("zcat < ", PATH_TO_FILE)) directly.

read.csv can read regular file and zip file in same syntax. temp_unzip actually can take regular file as input which is just write to temp directory again. Obviously this is not optimal, we will want to test if file is compressed. To use fread in same syntax for regular file or zip file, we can have something like this:
fread_all <- function(object, ...) {
  # just read directly to test if it is regular file
  data <- try(fread(object, nrows = 5),silent = TRUE)
  if (class(data) == "data.frame") { 
    return(fread(object, ...))
  } else {
    return(temp_unzip(object, fread, ...))
  }
}
Updates

2017-03-06  Added warning on multiple files in zip. Mac OS will add hidden folder even for single file zip. Our function still support this case but also gave warning and information about the file will be extracted.
2017-03-07 Now scan zip file contents. After filtering out hidden files (lead by . in unix, end by $ in windows, __MACOSX folder in mac), only proceed if there is only one visible file.

  
## 2. benchmark.md

      
    Raw
  

              2. benchmark.md
            
          
    The input file used here is a 160M csv, compressed to bzip2, gz, zip.
We compared

reading original csv directly
zcat method (note we need to quote the file name because there is & in it)
temp_unzip with bzip2, zip, gz

The bzip2 need signficantly longer time because the uncompress of bzip2 is slow. zcat actually is slightly faster than reading original csv. gz and zip with temp_unzip have very similar performance with reading original file without compression.
library(microbenchmark)
microbenchmark(
  fread(eg_csv),
  fread(input = paste0("zcat < '", eg_gz, "'")), 
  temp_unzip(eg_bz, fread),
  temp_unzip(eg_zip, fread),
  temp_unzip(eg_gz, fread),
  times = 1)
Unit: seconds
                                          expr      min       lq     mean   median       uq
                                 fread(eg_csv) 2.117812 2.117812 2.117812 2.117812 2.117812
 fread(input = paste0("zcat < '", eg_gz, "'")) 1.984009 1.984009 1.984009 1.984009 1.984009
                      temp_unzip(eg_bz, fread) 6.304849 6.304849 6.304849 6.304849 6.304849
                     temp_unzip(eg_zip, fread) 2.481650 2.481650 2.481650 2.481650 2.481650
                      temp_unzip(eg_gz, fread) 2.487811 2.487811 2.487811 2.487811 2.487811
      max neval
 2.117812     1
 1.984009     1
 6.304849     1
 2.481650     1
 2.487811     1


## 3. temp_unzip.R
# To decompress zip, gz, bzip2, xz into temp file, run function then remove temp file.
temp_unzip <- function(filename, fun, ...){
  BFR.SIZE <- 1e7
  if (!file.exists(filename)) {
    stop("No such file: ", filename);
  }
  if (!is.function(fun)) {
    stop(sprintf("Argument 'fun' is not a function: %s", mode(fun)));
  }
  temp_dir <- tempdir()
  # test if it's zip
  files_in_zip <- try(utils::unzip(filename, list = TRUE)$Name, silent = TRUE)
  if (class(files_in_zip) == "character") {
    # hidden files can be ignored: starting with ., ending with $, __MACOSX folder
    visible_files <- files_in_zip[!grepl("((^__MACOSX\\/.*)|(^\\..*)|(^.*\\$$))",
                                          files_in_zip)]
    # will not continue for multiple non-hidden files since behavior is not well defined.
    if(length(visible_files)>1) {
      stop(paste0("Zip file contains multiple visible files:\n",
                  paste0("    ", visible_files, collapse = "\n")))
    }
    if(length(visible_files) == 0) { stop("\n  No visible file found in Zip file")}
    # proceed with single non-hidden file
    utils::unzip(filename, files = visible_files[1], exdir = temp_dir, overwrite = TRUE)
    dest_file <- file.path(temp_dir, visible_files[1])
  } else {
    dest_file <- tempfile()
    # Setup input and output connections
    inn <- gzfile(filename, open = "rb")
    out <- file(description = dest_file, open = "wb")
    # Process
    nbytes <- 0
    repeat {
      bfr <- readBin(inn, what=raw(0L), size=1L, n=BFR.SIZE)
      n <- length(bfr)
      if (n == 0L) break;
      nbytes <- nbytes + n
      writeBin(bfr, con=out, size=1L)
      bfr <- NULL  # Not needed anymore
    }
    close(inn)
    close(out)
  }
  # call fun with temp file
  res <- fun(dest_file, ...)
  file.remove(dest_file)
  return(res)
}

## 4. test.R
# need `R.utils` to compress files
temp_test <- "temp.csv"
temp_content <- "a,b,c
1,2,3
4,5,6
"
cat(file = temp_test, temp_content)

library(R.utils)
# always remember the `remove` parameter when using `R.utils`!!!
gzip(temp_test, remove = FALSE)
bzip2(temp_test, remove = FALSE)

temp_unzip(paste0(temp_test, ".bz2"), fread)
temp_unzip(paste0(temp_test, ".gz"), fread)
	# To decompress zip, gz, bzip2, xz into temp file, run function then remove temp file.
	temp_unzip <- function(filename, fun, ...){
	BFR.SIZE <- 1e7
	if (!file.exists(filename)) {
	stop("No such file: ", filename);
	}
	if (!is.function(fun)) {
	stop(sprintf("Argument 'fun' is not a function: %s", mode(fun)));
	}
	temp_dir <- tempdir()
	# test if it's zip
	files_in_zip <- try(utils::unzip(filename, list = TRUE)$Name, silent = TRUE)
	if (class(files_in_zip) == "character") {
	# hidden files can be ignored: starting with ., ending with $, __MACOSX folder
	visible_files <- files_in_zip[!grepl("((^__MACOSX\\/.)\|(^\\..)\|(^.*\\$$))",
	files_in_zip)]
	# will not continue for multiple non-hidden files since behavior is not well defined.
	if(length(visible_files)>1) {
	stop(paste0("Zip file contains multiple visible files:\n",
	paste0(" ", visible_files, collapse = "\n")))
	}
	if(length(visible_files) == 0) { stop("\n No visible file found in Zip file")}
	# proceed with single non-hidden file
	utils::unzip(filename, files = visible_files[1], exdir = temp_dir, overwrite = TRUE)
	dest_file <- file.path(temp_dir, visible_files[1])
	} else {
	dest_file <- tempfile()
	# Setup input and output connections
	inn <- gzfile(filename, open = "rb")
	out <- file(description = dest_file, open = "wb")
	# Process
	nbytes <- 0
	repeat {
	bfr <- readBin(inn, what=raw(0L), size=1L, n=BFR.SIZE)
	n <- length(bfr)
	if (n == 0L) break;
	nbytes <- nbytes + n
	writeBin(bfr, con=out, size=1L)
	bfr <- NULL # Not needed anymore
	}
	close(inn)
	close(out)
	}
	# call fun with temp file
	res <- fun(dest_file, ...)
	file.remove(dest_file)
	return(res)
	}
	# need `R.utils` to compress files
	temp_test <- "temp.csv"
	temp_content <- "a,b,c
	1,2,3
	4,5,6
	"
	cat(file = temp_test, temp_content)

	library(R.utils)
	# always remember the `remove` parameter when using `R.utils`!!!
	gzip(temp_test, remove = FALSE)
	bzip2(temp_test, remove = FALSE)

	temp_unzip(paste0(temp_test, ".bz2"), fread)
	temp_unzip(paste0(temp_test, ".gz"), fread)