Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

View DavidRdgz's full-sized avatar

David Rodriguez DavidRdgz

  • San Francisco State University - Graduate Student
  • Berkeley, CA
View GitHub Profile
@DavidRdgz
DavidRdgz / MapRawData.scala
Created May 7, 2018 17:16
Spark unique counts using HyperLogLog Algebird object with tests
package com.dvidr.counts
import com.twitter.algebird.{HLL, HyperLogLogMonoid}
import org.apache.spark.rdd.RDD
case class EmailSchema(sender: String,
to: String,
cc: String,
bcc: String,
sentDate: String,
@DavidRdgz
DavidRdgz / Vagrantfile
Created May 5, 2018 18:00
Quick Vagrant machine with Hadoop & Spark using Ansible
Vagrant.configure("2") do |config|
config.vm.box = "ubuntu/xenial64"
config.vm.hostname = "spark.xenial.box"
config.vm.network :private_network, ip: "192.168.0.42"
config.vm.synced_folder "./data", "/vagrant_data"
config.vm.provider "virtualbox" do |vb|
vb.gui = false
vb.memory = 4096
@DavidRdgz
DavidRdgz / MapperLMDB.java
Created March 26, 2018 14:46
A mapper to append and tag new information on a field: mmap a file using LMDB.
/**
* gradle clean
* gradle build
* <p>
* hadoop jar build/libs/mapper-lmdb-1.0-SNAPSHOT.jar com.dvidr.MapperLMDB src/main/resources/keys.txt src/main/resources/output
*/
package com.dvidr;
import org.apache.hadoop.conf.Configuration;
@DavidRdgz
DavidRdgz / PivotTable.java
Created February 16, 2018 20:44
ML models like vectors or sparse vectors. Creating a pivot table in MapReduce can create a sparse vector.
/**
* gradle clean
* gradle build
*
* hadoop jar build/libs/pivot-table-1.0-SNAPSHOT.jar com.dvidr.PivotTable src/main/resources/pivotdata.txt src/main/resources/output
*
*/
package com.dvidr;
@DavidRdgz
DavidRdgz / reservoir-sampling.py
Last active February 11, 2018 17:00
Sampling an infinite stream with reservoir sampling
#!/usr/bin/python
import random
shuf = random.sample(range(100000), 1000)
# Keep bag of k=10 elements
# Prefill the bag with first shuf items
bag = shuf[:10]
@DavidRdgz
DavidRdgz / time-dependent-graphs.py
Last active February 24, 2017 17:09
time-dependent graphs
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib
# -------------- Get Data/Graphs -------------
with open('graph-data/g1.txt', 'r') as g:
g1 = map(lambda x: eval(x), g.readlines())
with open('graph-data/g2.txt', 'r') as g:
@DavidRdgz
DavidRdgz / spike.py
Created August 15, 2016 17:33
LSTM and anomaly detection of web domain query activity gathered from OpenDNS
from __future__ import print_function
"""
Using OpenDNS domain query activity, we retrieve 5 days
of queries/hour to a domain for 240+ domains (stored
in dns.json). We predict the number of queries in
the next hour using a LSTM recurrent neural network.
An ad hoc anomaly detection is outlined in the final
for loop.
@DavidRdgz
DavidRdgz / AliceInGraphXLand.scala
Last active March 14, 2016 02:00
Mining Alice in Wonderland using Spark's GraphX. As if Alice in Wonderland were chat messages between Alice, Rabbit, Magpie, and Hatter.
package com.dvidr
import org.apache.spark.graphx.{VertexRDD, Edge, Graph}
import org.apache.spark.sql._
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.Random
case class Chat(id: Int, name: String, talk: String)
case class ChatGraph(id: Int, dst: Int, replyIn: Int, name: String, talk: String)
case class TopChat(id: Int, name: String, talk: String, inDeg: Int, outDeg: Int)
package com.dvidr
import com.twitter.algebird.{Moments, Aggregator}
import scala.util.Random
/*
Please refer to AliceInAggregatorLand first:
https://gist.github.com/johnynek/814fc1e77aad1d295bb7
This is an adaption where "Alice In Wonderland" is turned
@DavidRdgz
DavidRdgz / MyMonoid.scala
Created February 27, 2016 02:24
An Algebird map monoid defined with Numeric "value" types. This allows addition over "values".
package com.dvidr.storm.bolt
import com.twitter.algebird.Monoid
class MyMonoid[K, V <: Numeric] extends Monoid[Map[K, V]] {
override def zero = Map[K, V ]()
override def plus(x: Map[K, V], y: Map[K, V]): Map[K, V] = {
val list = x.toList ++ y.toList
return list.groupBy ( _._1) .map { case (k,v) => k -> v.map(_._2).sum }
}