Skip to content

Instantly share code, notes, and snippets.

@oluies
Created September 12, 2016 13:04
Show Gist options
  • Save oluies/125631c4743a0d80f95272f5245d75c1 to your computer and use it in GitHub Desktop.
Save oluies/125631c4743a0d80f95272f5245d75c1 to your computer and use it in GitHub Desktop.
summarize a vector in spark
import org.apache.spark.sql.Row
import breeze.linalg.DenseVector
import org.apache.spark.mllib.linalg.{Vector, Vectors}
val t_df = sqlContext.read.parquet("/user/s89718/Pivoted_cust_weekday_total_with_Clusters.parquet")
val tm_df = t_df.select("IP_ID","assembled")
val emptyVector = BDV(Array.fill(7)(0.0))
val zeVector = tm_df
.rdd
.map{ case Row(k: String, v: Vector) => (k, DenseVector(v.toDense.values)) }
.map( _._2 )
.fold(emptyVector){ (acc,t) => acc += t }
val zeVectorDense = Vectors.dense(zeVector.toArray)
println(zeVectorDense)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment