Skip to content

Instantly share code, notes, and snippets.

View stdatalabs's full-sized avatar

Sachin Thirumala stdatalabs

View GitHub Profile
@stdatalabs
stdatalabs / WordCounterBolt.java
Last active October 24, 2016 07:54
A word counter bolt to print list of top words over a time interval. More @ stdatalabs.blogspot.com
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Tuple;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.HashMap;
import java.util.Map;
@stdatalabs
stdatalabs / IgnoreWordsBolt.java
Last active October 24, 2016 07:53
A bolt to filters out a predefined set of words. More @ stdatalabs.blogspot.com
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
import java.util.Arrays;
import java.util.HashSet;
@stdatalabs
stdatalabs / StringWordSplitterBolt.java
Last active October 24, 2016 07:52
A string word splitter bolt that receives tweets and emits its words which are over a certain length. More @ stdatalabs.blogspot.com
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
import twitter4j.Status;
import java.util.Map;
@stdatalabs
stdatalabs / KafkaTwitterTopology.java
Last active October 24, 2016 07:50
A Storm topology to consume messages from kafka topic and count list of top words used in tweets. More @ stdatalabs.blogspot.com
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import storm.kafka.KafkaSpout;
import storm.kafka.SpoutConfig;
import storm.kafka.ZkHosts;
import java.util.Arrays;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
@stdatalabs
stdatalabs / KafkaTwitterProducer.java
Last active October 24, 2016 07:49
A Kafka producer that receives twitter data streams and publishes to kafka topic. More @ stdatalabs.blogspot.com
import java.util.Arrays;
import java.util.Properties;
import java.util.concurrent.LinkedBlockingQueue;
import twitter4j.*;
import twitter4j.conf.*;
import twitter4j.StallWarning;
import twitter4j.Status;
import twitter4j.StatusDeletionNotice;
<dependency>
<groupId>org.apache.storm</groupId>
<artifactId>storm-core</artifactId>
<version>0.10.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</exclusion>
@stdatalabs
stdatalabs / TwitterAvroSource.conf
Created September 29, 2016 18:45
A Flume Avro source to consume Twitter Stream
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = avroSink
# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey=<-- paste consumer key here -->
TwitterAgent.sources.Twitter.consumerSecret=<-- paste consumer secret here -->
TwitterAgent.sources.Twitter.accessToken=<-- paste access token here -->
@stdatalabs
stdatalabs / FlumeSparkPopularHashTags.scala
Last active October 24, 2016 07:46
Spark Streaming - Flume Avro Sink - PopularHashTags. More @ stdatalabs.blogspot.com
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume._
/**
@stdatalabs
stdatalabs / SparkPopularHashTags.scala
Last active November 11, 2017 03:23
TwitterPopularHashTags using Spark Streaming. More @ stdatalabs.blogspot.com
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume._
/**