Skip to content

Instantly share code, notes, and snippets.

@markdanese
Last active January 9, 2024 09:23
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save markdanese/558414d9f475c06873e682a16ec095e8 to your computer and use it in GitHub Desktop.
Save markdanese/558414d9f475c06873e682a16ec095e8 to your computer and use it in GitHub Desktop.
data.table fread fixed width file reader
# for reading fixed with files, which are files with no delimiter (see readr package and read_fwf())
# col_widths is a vector of column widths (e.g., c(8, 4, 2, 9))
# input file is a character string with the input file (e.g., "./data/read.txt")
# on 300 MB file with 143 columns timings on 2018 Macbook pro were as follows:
# read_fwf from readr package: 10.8 sec
# non-parallel use of gawk: 10.5 sec
# parallel use of gawk: 4.4 sec (below function)
flat_fread <- function(col_widths, input_file){
col_spec <- paste0(widths, collapse = " ")
gawk_string <- paste0("parallel --pipepart -a ", input_file, " gawk \\'\\$1=\\$1\\' FIELDWIDTHS=\\'", col_spec, "\\' OFS=,")
dt <- fread(cmd = gawk_string, header = FALSE, stringsAsFactors = FALSE)
return(dt)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment