Skip to content

Instantly share code, notes, and snippets.

View mayankrastogi's full-sized avatar
🎯
Focusing

Mayank K Rastogi mayankrastogi

🎯
Focusing
View GitHub Profile
@mayankrastogi
mayankrastogi / MultiTagXmlInputFormat.scala
Created March 6, 2019 06:28
Hadoop XML Input Format that supports sharding by multiple start and end tags
import java.io.IOException
import java.nio.charset.StandardCharsets
import com.google.common.io.Closeables
import MultiTagXmlInputFormat.MultiTagXmlRecordReader
import com.typesafe.scalalogging.LazyLogging
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{DataOutputBuffer, LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.{FileSplit, TextInputFormat}
import org.apache.hadoop.mapreduce.{InputSplit, RecordReader, TaskAttemptContext}