Skip to content

Instantly share code, notes, and snippets.

@afrad
afrad / WracBaseTest.scala
Created May 2, 2016 18:33
Read and search common crawl wrac files directly from s3 using spark
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.LongWritable
import org.apache.spark.rdd.RDD
import org.apache.spark.{SerializableWritable, SparkConf, SparkContext}
import org.warcbase.io.WarcRecordWritable
import org.warcbase.mapreduce.WacWarcInputFormat
import org.warcbase.spark.archive.io.{ArchiveRecord, WarcRecord}
import org.warcbase.spark.rdd.RecordRDD._;
/**