Skip to content

Instantly share code, notes, and snippets.

@rvprasad
Last active August 29, 2015 14:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rvprasad/4cac74eac692edf31f25 to your computer and use it in GitHub Desktop.
Save rvprasad/4cac74eac692edf31f25 to your computer and use it in GitHub Desktop.
read.csv style functions in R read the entire data file in one sweep. Hence, they can be hard to read files that cannot fit into memory of the host machine. Here's an R function to read such large files in chunks as separate data frames. The only requirement is that there is one column in the read data such that all records/rows with identical v…
#' Read a file in chunks
#'
#' @param theConn providing the data, e.g., file('data/transactions.csv', 'r').
#' @param headers of the data being read.
#' @param leftOver rows that were not read but not returned by the previous invocation of this function.
#' @param col on which the data is grouped.
#' @return a list of two elements: data provided by the current invocation and leftOver to be used during the next invocation.
getDataFrameForNextId <- function(theFile, headers, leftOver, col) {
while (NROW(leftOver) == 0 || NROW(unique(leftOver[,col])) < 2) {
tmp1 <- read.csv(theFile, nrows=100000)
if (NROW(tmp1) == 0) { break }
colnames(tmp1) <- headers
leftOver <- rbind(leftOver, tmp1)
}
tmp1 <- unique(leftOver[,col])[1]
data <- leftOver[leftOver[,col] == tmp1,]
leftOver <- leftOver[leftOver[,col] != tmp1,]
return(list(data=data, leftOver=leftOver))
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment