Skip to content

Instantly share code, notes, and snippets.

@kmader
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kmader/8854664 to your computer and use it in GitHub Desktop.
Save kmader/8854664 to your computer and use it in GitHub Desktop.
SparkR with CSV Data
library(SparkR,lib.loc="/home/mader/tools/SparkR-pkg/lib")
sc<-sparkR.init(master="local[40]")
cfile<-textFile(sc,"/home/mader/work/simple.csv")
testinput<-take(cfile,3)
headertext<-strsplit(testinput[[1]],",")[[1]]
sample.col<-which(headertext %in% "file.name")
format.row<-function(in.row) {
in.txt<-strsplit(in.row,",")[[1]]
list(in.txt[sample.col],mapply(list,headertext[-sample.col],sapply(in.txt[-sample.col],as.double))[2,])
}
lab.file<-lapply(cfile,format.row)
sample.files<-groupByKey(lab.file,1000L)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment