Skip to content

Instantly share code, notes, and snippets.

@infotroph
Last active August 29, 2015 14:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save infotroph/dd0faa5fd24bb78b4ff6 to your computer and use it in GitHub Desktop.
Save infotroph/dd0faa5fd24bb78b4ff6 to your computer and use it in GitHub Desktop.
filter short lines while reading CSV
# My input files have short header lines, then CSV data, then short footer lines.
# I'm currently trimming the short lines with an external call to sed,
# but I want a pure-R solution for portability.
# This version works nicely on small examples but gets very slow on large files,
# because append() grows the list, triggering a memory reallocation, for every line.
# Suggestions for speed improvement requested.
read.longline = function(file){
f = file(file, "r")
lines = list()
repeat{ # read short headers & discard
l = readLines(f, n=1)
if(length(l) > 0 && nchar(l) > 65){
# We've found the first data row.
# Leave it on the stack to process in the next loop.
pushBack(l, f)
break
}
}
repeat{ # read long lines, add to CSV, break when short lines start again
l = readLines(f, n=1)
if(length(l) > 0 && nchar(l) > 65){
# Naive implementation!
# Likely to be VERY slow because we're growing lines every time.
lines = append(lines, l)
}else{
# Either we've hit a short line == beginning of PGP block,
# or empty line == end of the file.
# Either way we're done.
break
}
}
close(f)
# Now stitch lines together into a dataframe
txtdat = do.call("paste", c(lines, sep="\n"))
return(read.csv(text=txtdat, stringsAsFactors=FALSE))
}
@infotroph
Copy link
Author

Best suggestion so far: scan the file twice, once to find start and end lines and once with read.csv at those offsets:

read.twice = function(file){
    f = file(file, "r")
    skip = NA
    last= NA
    cur = 0
    repeat{
        l = readLines(f, n=1)
        if(length(l) > 0){
            cur=cur+1   
            if(nchar(l) > 65 && is.na(skip)){
                skip = cur - 1
            }
            if(nchar(l) <= 65 && !is.na(skip)){
                last=cur
                break
            }
        }else{
            break
        }
    }
    close(f)
    if(is.na(skip)){ skip = 0 }
    if(is.na(last)){ last = cur }
    return(read.csv(file, skip=skip, nrows=(last-skip)))
}

This is still about 2x slower than read.csv() on a clean version of the same data, so I suspect without checking that ~all the bottlenecks of this version are in file reading.

@bpbond
Copy link

bpbond commented Jun 1, 2015

What about reading file into a vector of strings, then identify first line? For example (this is off the top of my head, I haven't run it):

fileContents <- readLines(f)
firstDataLine <- min(which(nchar(fileContents) > 65))
return(read.cv(connection(fileContents), skip=firstDataLine-1))

@infotroph
Copy link
Author

Is connection() part of a package I don't have, or pseudocode?

@infotroph
Copy link
Author

This seems to get me within percents instead of multiples of read.csv:

read.longlines = function(file){
    f = file(file, "r")
    f_lines = readLines(f)
    wanted = which(nchar(f_lines) > 65)
    firstData = min(wanted)
    lastData = max(wanted)
    return(read.csv(
        text=do.call(
            what="paste", 
            args=c(as.list(f_lines[firstData:lastData]), sep="\n")), 
        stringsAsFactors=FALSE))
}   

It runs in less than ~1 minute for a 150 kline test file, which isn't blazing but is probably good enough for my current purposes. Thanks!

@bpbond
Copy link

bpbond commented Jun 2, 2015

You're welcome. Sorry, I meant textConnection. Here's some revised code--this time, I actually tested it!

fileContents <- readLines("test.txt")
firstDataLine <- min(which(nchar(fileContents) >= MIN_DATA_LINE_LENGTH))
lastDataLine <- max(which(nchar(fileContents) >= MIN_DATA_LINE_LENGTH))
con <- textConnection(fileContents)
read.table(con, skip=firstDataLine-1, nrows=lastDataLine-firstDataLine,sep=",",header=TRUE)

@infotroph
Copy link
Author

It took me a while to convince myself, but it looks as if pasting the lines together first is significantly faster than (at least this variant of) using a textConnection: https://gist.github.com/infotroph/cec9a9fb0158530d817f is a self-contained demo comparing the speed of different reading approaches with no filtering.

On my machine, a few thousand lines take milliseconds with a pasted string and minutes with a vector of lines, textConnection or not. What especially surprises me is that both approaches have a similar memory footprint & the vector read seems to be entirely CPU-bound. No idea what it spends all those cycles on...

I'm still not sure if this is general or if my test case is pathological, but bottom line: If you came here looking to pass a vector of lines through a textConnection into read.csv, try pasteing them all together as one string with internal newlines and passing that into readr::read_csv instead; maybe you'll get a speedup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment