Skip to content

Instantly share code, notes, and snippets.

View juanpampliega's full-sized avatar

Juan Martin Pampliega juanpampliega

View GitHub Profile
@juanpampliega
juanpampliega / gist:fc089003d28f2718b54cdc2e6741888f
Created September 24, 2019 12:21
Ubuntu Setup Install New Machine Alacritty, tmux, python3, vscode, fzf,
sudo apt install python3
sudo apt install python3-pip
sudo add-apt-repository "deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main"
sudo apt update
sudo apt install code
sudo apt-get install -y cmake libfreetype6-dev libfontconfig1-dev xclip
@juanpampliega
juanpampliega / gist:f7b68c3546d921154ac9eaabf06a8911
Created June 2, 2018 21:46
Install OpenX Hive JSON SerDe in Amazon EMR to use it with Presto
# Do this on every node of the cluster
curl -O http://www.congiu.net/hive-json-serde/1.3.8/hdp23/json-serde-1.3.8-jar-with-dependencies.jar
sudo cp json-serde-1.3.8-jar-with-dependencies.jar /usr/lib/presto/plugin/hive-hadoop2/
sudo chown presto:presto /usr/lib/presto/plugin/hive-hadoop2/json-serde-1.3.8-jar-with-dependencies.jar
#restart presto
sudo restart presto-server
@juanpampliega
juanpampliega / gettweets.py
Last active November 4, 2019 04:32
Python example to get tweets from stream using tweepy and write them to a file
#!/usr/bin/python
from tweepy import Stream, OAuthHandler
from tweepy.streaming import StreamListener
from progressbar import ProgressBar, Percentage, Bar
import json
import sys
#Twitter app information
consumer_secret='Your consumer secret'
@juanpampliega
juanpampliega / csv2sqlitedb.sh
Created October 6, 2017 20:42
Convert a csv file to a sqlite db file to enable easy querying
function csv2db() {
echo -e ".mode csv \n.import $1.csv $1" | sqlite3 $1.db && \
sqlite3 -header -column $1.db
}
val docs = sc.textFile("/opt/dataset/don-quijote.txt.gz")
val lower = docs.map(line => line.toLowerCase)
val words = lower.flatMap(line => line.split("\\s+"))
val counts = words.map(word => (word, 1))
val freq = counts.reduceByKey(_ + _)
@juanpampliega
juanpampliega / gist:1c5ffa6618cd3df8f1b2
Last active August 29, 2015 14:22
get detailed logs from yarn application master node
yarn logs -application_id <application_id>
e.g.
yarn logs -application_id application_1424284032717_0066
@juanpampliega
juanpampliega / log4j.properties
Created May 24, 2015 04:40
log4j.properties for lowering spark-shell logging to WARN level. This should be placed in $SPARK_HOME/conf directory.
# Set everything to be logged to the console
log4j.rootCategory=WARN, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=WARN
@juanpampliega
juanpampliega / TwitterTopHashtags.scala
Created May 16, 2015 06:18
Twitter Top Hashtags with Spark Streaming in spark-shell
import com.google.gson.Gson
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
@juanpampliega
juanpampliega / TwitterSentiment.scala
Last active August 29, 2015 14:21
Code for running Twitter sentiment analysis with Spark Streaming in spark-shell
import com.google.gson.Gson
import org.apache.spark.streaming.twitter.TwitterUtils
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
@juanpampliega
juanpampliega / spark_shell_twitter_deps
Last active October 3, 2015 05:38
Download Twitter dependencies for Spark Streaming and execute spark-shell with them
#!/usr/bin/env bash
USER_NAME=hbd
USER_HOME="/home/$USER_NAME"
cd $USER_HOME
mkdir $USER_HOME/twitter4j
cd $USER_HOME/twitter4j
# Get the Spark Streaming JAR.
curl -O "http://central.maven.org/maven2/org/apache/spark/spark-streaming-twitter_2.10/1.5.0/spark-streaming-twitter_2.10-1.5.0.jar"