Last active
August 29, 2015 14:22
-
-
Save infotroph/dd0faa5fd24bb78b4ff6 to your computer and use it in GitHub Desktop.
filter short lines while reading CSV
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# My input files have short header lines, then CSV data, then short footer lines. | |
# I'm currently trimming the short lines with an external call to sed, | |
# but I want a pure-R solution for portability. | |
# This version works nicely on small examples but gets very slow on large files, | |
# because append() grows the list, triggering a memory reallocation, for every line. | |
# Suggestions for speed improvement requested. | |
read.longline = function(file){ | |
f = file(file, "r") | |
lines = list() | |
repeat{ # read short headers & discard | |
l = readLines(f, n=1) | |
if(length(l) > 0 && nchar(l) > 65){ | |
# We've found the first data row. | |
# Leave it on the stack to process in the next loop. | |
pushBack(l, f) | |
break | |
} | |
} | |
repeat{ # read long lines, add to CSV, break when short lines start again | |
l = readLines(f, n=1) | |
if(length(l) > 0 && nchar(l) > 65){ | |
# Naive implementation! | |
# Likely to be VERY slow because we're growing lines every time. | |
lines = append(lines, l) | |
}else{ | |
# Either we've hit a short line == beginning of PGP block, | |
# or empty line == end of the file. | |
# Either way we're done. | |
break | |
} | |
} | |
close(f) | |
# Now stitch lines together into a dataframe | |
txtdat = do.call("paste", c(lines, sep="\n")) | |
return(read.csv(text=txtdat, stringsAsFactors=FALSE)) | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
It took me a while to convince myself, but it looks as if pasting the lines together first is significantly faster than (at least this variant of) using a
textConnection
: https://gist.github.com/infotroph/cec9a9fb0158530d817f is a self-contained demo comparing the speed of different reading approaches with no filtering.On my machine, a few thousand lines take milliseconds with a
paste
d string and minutes with a vector of lines,textConnection
or not. What especially surprises me is that both approaches have a similar memory footprint & the vector read seems to be entirely CPU-bound. No idea what it spends all those cycles on...I'm still not sure if this is general or if my test case is pathological, but bottom line: If you came here looking to pass a vector of lines through a
textConnection
intoread.csv
, trypaste
ing them all together as one string with internal newlines and passing that intoreadr::read_csv
instead; maybe you'll get a speedup.