Skip to content

Instantly share code, notes, and snippets.

@mlehman
mlehman / MultipleOutputsExample.scala
Last active April 11, 2022 06:54
Hadoop MultipleOutputs on Spark Example
/* Example using MultipleOutputs to write a Spark RDD to multiples files.
Based on saveAsNewAPIHadoopFile implemented in org.apache.spark.rdd.PairRDDFunctions, org.apache.hadoop.mapreduce.SparkHadoopMapReduceUtil.
val values = sc.parallelize(List(
("fruit/items", "apple"),
("vegetable/items", "broccoli"),
("fruit/items", "pear"),
("fruit/items", "peach"),
("vegetable/items", "celery"),
("vegetable/items", "spinach")
@mlehman
mlehman / gist:69dc9eaa1e254080833c
Created August 2, 2014 15:05
Save Case Class as TSV On Spark
implicit class ProductRDD[T <: Product](rdd: RDD[T]) {
/* Saves a RDD of Tuples into a TSV.
* Ex: Employee(emp_id = 123, Name(first="Bob",last="Smith")) => "123\tBob\tSmith"
*/
def saveAsTsv(path: String) {
rdd.map(p => p.productIterator.flatMap {
case a: Product => a.productIterator //flattens nested case classes
case b => Seq(b)
}.mkString("\t"))
@mlehman
mlehman / ejson.sh
Last active December 14, 2015 18:22
MongoDB to Extended JSON with SED
sed -e 's/NumberLong("*\(-*[[:digit:]]*\)"*)/{ "$numberLong" : "\1" }/' -e 's/ObjectId("*\([[:alnum:]]*\)"*)/{ "$oid" : "\1
" }/'
@mlehman
mlehman / hgrep.sh
Created June 22, 2016 18:00
history grep
# utility function for your profile to search history
hgrep(){
history | grep $1 | sort -k 2 | uniq -c -f 1 | sort
}