Skip to content

Instantly share code, notes, and snippets.

View stdatalabs's full-sized avatar

Sachin Thirumala stdatalabs

View GitHub Profile
0x8e8CF9467c121897F86BccB0C6Cfb165AAA7482A
@stdatalabs
stdatalabs / customRecordReader.java
Last active October 5, 2017 19:44
A customRecordReader class to read every input split. More @ stdatalabs.blogspot.com
import java.io.IOException;
import java.io.InputStream;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
@stdatalabs
stdatalabs / CustomTextInputFormat.java
Last active October 24, 2016 08:03
A CustomTextInputFormat class that extends from TextInputFormat and calls the customRecordReader class. More @ stdatalabs.blogspot.com
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
@stdatalabs
stdatalabs / Hive customInputFormat - pom.xml
Created October 9, 2016 07:57
Dependencies for hive customInputFormat
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>0.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.1</version>
</dependency>
@stdatalabs
stdatalabs / JsonWordSplitterBolt.java
Last active October 24, 2016 08:02
A storm bolt to split tuples sent by KafkaSpout. More @ stdatalabs.blogspot.com
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
import twitter4j.Status;
import java.util.Map;
@stdatalabs
stdatalabs / TwitterSampleSpout.java
Last active September 29, 2021 14:50
A storm spout to produce a stream of tuples from the twitter streaming API. More @ stdatalabs.blogspot.com
import java.util.Map;
import java.util.concurrent.LinkedBlockingQueue;
import twitter4j.FilterQuery;
import twitter4j.StallWarning;
import twitter4j.Status;
import twitter4j.StatusDeletionNotice;
import twitter4j.StatusListener;
import twitter4j.TwitterStream;
@stdatalabs
stdatalabs / TwitterWordCountTopology.java
Last active October 24, 2016 08:00
A storm topology to count the list of top words used in tweets regarding a topic. More @ stdatalabs.blogspot.com
import java.util.*;
import com.stdatalabs.Storm.IgnoreWordsBolt;
import com.stdatalabs.Storm.TwitterSampleSpout;
import com.stdatalabs.Storm.WordCounterBolt;
import com.stdatalabs.Storm.JsonWordSplitterBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.Config;
@stdatalabs
stdatalabs / dbms_crypto_UDF.scala
Last active January 23, 2019 10:30
A Spark UDF to find the MD5 message digest of a column. More @ stdatalabs.blogspot.com
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.spark.sql.hive.HiveContext
val hiveContext = new HiveContext(sc)
import hiveContext.implicits._
import hiveContext.sql
import sqlContext.implicits._
import java.security.MessageDigest
/**
@stdatalabs
stdatalabs / KafkaSparkPopularHashTags.scala
Last active November 11, 2017 03:26
A Spark Streaming - Kafka integration to receive twitter data from kafka topic and find the popular hashtags. More @ stdatalabs.blogspot.com
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka_2.10 -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.twitter4j/twitter4j-core -->
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>