Skip to content

Instantly share code, notes, and snippets.

View umbertogriffo's full-sized avatar

Umberto Griffo umbertogriffo

View GitHub Profile
@umbertogriffo
umbertogriffo / TwitterSentimentAnalysisAndN-gramWithHadoopAndHiveSQL.md
Last active May 11, 2021 13:22
Step by step Tutorial on Twitter Sentiment Analysis and n-gram with Hadoop and Hive SQL

PREREQUISITES

* Download JSON Serde at:
* http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar
* and to renominate it as hive-serdes-1.0.jar
  • Add Jar to HIVE_AUX_JARS_PATH of HiveServer2:

    1. Copy the JAR files to the host on which HiveServer2 is running. Save the JARs to any directory you choose, and make a note of the path (create directory in /usr/share/).
@umbertogriffo
umbertogriffo / broadcast_join_medium_size.scala
Last active December 11, 2020 16:05
broadcast_join_medium_size
import org.apache.spark.sql.functions._
val mediumDf = Seq((0, "zero"), (4, "one")).toDF("id", "value")
val largeDf = Seq((0, "zero"), (2, "two"), (3, "three"), (4, "four"), (5, "five")).toDF("id", "value")
mediumDf.show()
largeDf.show()
/*
+---+-----+
@umbertogriffo
umbertogriffo / DataFrameSuite.scala
Last active February 12, 2020 06:13
DataFrameSuite allows you to check if two DataFrames are equal. You can assert the DataFrames equality using method assertDataFrameEquals. When DataFrames contains doubles or Spark Mllib Vector, you can assert that the DataFrames approximately equal using method assertDataFrameApproximateEquals
package test.com.idlike.junit.df
import breeze.numerics.abs
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Column, DataFrame, Row}
/**
* Created by Umberto on 06/02/2017.
*/
@umbertogriffo
umbertogriffo / RddAPI.scala
Last active January 29, 2020 12:57
This is a collections of examples about Apache Spark's RDD Api. These examples aim to help me test the RDD functionality.
/*
This is a collections of examples about Apache Spark's RDD Api. These examples aim to help me test the RDD functionality.
References:
http://spark.apache.org/docs/latest/programming-guide.html
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
*/
object RddAPI {
@umbertogriffo
umbertogriffo / Winner.java
Created February 15, 2017 09:02
Java 8 Streams Cookbook
package knowledgebase.java.stream;
import java.time.Duration;
import java.util.*;
import static java.util.stream.Collectors.*;
/**
* Created by Umberto on 15/02/2017.
* https://dzone.com/articles/a-java-8-streams-cookbook
@umbertogriffo
umbertogriffo / falsehoods-programming-time-list.md
Created August 6, 2019 10:03 — forked from timvisee/falsehoods-programming-time-list.md
Falsehoods programmers believe about time, in a single list

Falsehoods programmers believe about time

This is a compiled list of falsehoods programmers tend to believe about working with time.

Don't re-invent a date time library yourself. If you think you understand everything about time, you're probably doing it wrong.

Falsehoods

  • There are always 24 hours in a day.
  • February is always 28 days long.
  • Any 24-hour period will always begin and end in the same day (or week, or month).
@umbertogriffo
umbertogriffo / Transpose.scala
Created October 26, 2016 08:05
Utility Methods to Transpose a org.apache.spark.mllib.linalg.distributed.RowMatrix
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
@umbertogriffo
umbertogriffo / JavaRddAPI.java
Created February 23, 2018 11:50
This is a collections of examples about Apache Spark's JavaRDD Api. These examples aim to help me test the JavaRDD functionality.
package test.idlike.spark.datastructure;
import org.apache.commons.lang3.SystemUtils;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
import java.util.*;
import java.util.Map.Entry;
import java.util.stream.Collectors;
/**
* Created by Umberto on 16/05/2017.
*/
public class HashMapUtils {
@umbertogriffo
umbertogriffo / TestPerformance.scala
Last active April 13, 2017 09:33
This Scala code tests the performance of Euclidean distance developed using map-reduce pattern, treeReduce and treeAggregate.
import org.apache.commons.lang.SystemUtils
import org.apache.spark.mllib.random.RandomRDDs._
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import scala.math.sqrt
/**
* Created by Umberto on 08/02/2017.
*/