Skip to content

Instantly share code, notes, and snippets.

View umbertogriffo's full-sized avatar

Umberto Griffo umbertogriffo

View GitHub Profile
@umbertogriffo
umbertogriffo / RddAPI.scala
Last active January 29, 2020 12:57
This is a collections of examples about Apache Spark's RDD Api. These examples aim to help me test the RDD functionality.
/*
This is a collections of examples about Apache Spark's RDD Api. These examples aim to help me test the RDD functionality.
References:
http://spark.apache.org/docs/latest/programming-guide.html
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html
*/
object RddAPI {
import java.util.*;
import java.util.Map.Entry;
import java.util.stream.Collectors;
/**
* Created by Umberto on 16/05/2017.
*/
public class HashMapUtils {
@umbertogriffo
umbertogriffo / TestPerformance.scala
Last active April 13, 2017 09:33
This Scala code tests the performance of Euclidean distance developed using map-reduce pattern, treeReduce and treeAggregate.
import org.apache.commons.lang.SystemUtils
import org.apache.spark.mllib.random.RandomRDDs._
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import scala.math.sqrt
/**
* Created by Umberto on 08/02/2017.
*/
@umbertogriffo
umbertogriffo / Winner.java
Created February 15, 2017 09:02
Java 8 Streams Cookbook
package knowledgebase.java.stream;
import java.time.Duration;
import java.util.*;
import static java.util.stream.Collectors.*;
/**
* Created by Umberto on 15/02/2017.
* https://dzone.com/articles/a-java-8-streams-cookbook
@umbertogriffo
umbertogriffo / DataFrameSuite.scala
Last active February 12, 2020 06:13
DataFrameSuite allows you to check if two DataFrames are equal. You can assert the DataFrames equality using method assertDataFrameEquals. When DataFrames contains doubles or Spark Mllib Vector, you can assert that the DataFrames approximately equal using method assertDataFrameApproximateEquals
package test.com.idlike.junit.df
import breeze.numerics.abs
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Column, DataFrame, Row}
/**
* Created by Umberto on 06/02/2017.
*/
@umbertogriffo
umbertogriffo / Method1.java
Last active January 22, 2017 14:21
How to make the method run() of class NoThreadSafe thread-safe in Java
public class Method1 {
/*
Adding synchronized to this method will makes it thread-safe.
When synchronized is added to a static method, the Class object is the object which is locked.
*/
public static void main(String[] args) throws InterruptedException {
ProcessingThreadS pt = new ProcessingThreadS();
Thread t1 = new Thread(pt, "t1");
@umbertogriffo
umbertogriffo / UniqueId.java
Last active March 6, 2023 08:16
Generate Long ID from UUID
/**
* Genereate unique ID from UUID in positive space
* Reference: http://www.gregbugaj.com/?p=587
* @return long value representing UUID
*/
private Long generateUniqueId()
{
long val = -1;
do
{
@umbertogriffo
umbertogriffo / Transpose.scala
Created October 26, 2016 08:05
Utility Methods to Transpose a org.apache.spark.mllib.linalg.distributed.RowMatrix
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
@umbertogriffo
umbertogriffo / ObjectPool.java
Created June 28, 2016 08:01
Generic Java object pool with minimalistic code
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
/**
* @param <T>
*/
public abstract class ObjectPool<T> {
@umbertogriffo
umbertogriffo / Kmeans Readme.md
Last active March 8, 2024 13:40
Step by step Code Tutorial on implementing a basic k-means in Spark in order to cluster a geo-located devices

DATASET

  • Download dataset here

CODE

* Follow the well-comented code kmeans.scala