Skip to content

Instantly share code, notes, and snippets.

View umbertogriffo's full-sized avatar

Umberto Griffo umbertogriffo

View GitHub Profile
@umbertogriffo
umbertogriffo / TwitterSentimentAnalysisAndN-gramWithHadoopAndHiveSQL.md
Last active May 11, 2021 13:22
Step by step Tutorial on Twitter Sentiment Analysis and n-gram with Hadoop and Hive SQL

PREREQUISITES

* Download JSON Serde at:
* http://files.cloudera.com/samples/hive-serdes-1.0-SNAPSHOT.jar
* and to renominate it as hive-serdes-1.0.jar
  • Add Jar to HIVE_AUX_JARS_PATH of HiveServer2:

    1. Copy the JAR files to the host on which HiveServer2 is running. Save the JARs to any directory you choose, and make a note of the path (create directory in /usr/share/).
@umbertogriffo
umbertogriffo / HBaseBackup.rb
Last active March 24, 2023 15:01
This code takes a snapshot of all HBase tables, using the snapshot command (No file copies are performed). Tested on CDH-5.4.4-1
# Checking if the hbase.snapshot.enabled property in hbase-site.xml is set to true
# To execute script launch this command on shell: hbase shell HBaseBackup.rb
@clusterToSave = "hdfs:///srv2:8082/hbase"
# CHECK THE PATH OF HBase lib
@libjars = `ls /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p0.4/lib/hbase/*.jar | tr "\n" ","`
@ignore = [ /zipkin\..*/i, /.*_temp/i, /.*tmp/i, /test_.*/i, /.*_test/i, /.*_old/i ]
@mappers = "2"
include Java
@umbertogriffo
umbertogriffo / HBaseRestore.rb
Created February 19, 2016 15:04
This code restore the snapshots of all HBase tables saved using the script HBaseBackup.rb (https://gist.github.com/umbertogriffo/fe1bce24f8e9ee68c75f). Tested on CDH-5.4.4-1
# To execute script launch this command on shell: hbase shell HBaseRestore.rb
include Java
java_import org.apache.hadoop.hbase.HBaseConfiguration
java_import org.apache.hadoop.hbase.client.HBaseAdmin
java_import org.apache.hadoop.hbase.snapshot.ExportSnapshot
java_import org.apache.hadoop.hbase.TableExistsException
java_import org.apache.hadoop.util.ToolRunner
@umbertogriffo
umbertogriffo / Kmeans Readme.md
Last active March 8, 2024 13:40
Step by step Code Tutorial on implementing a basic k-means in Spark in order to cluster a geo-located devices

DATASET

  • Download dataset here

CODE

* Follow the well-comented code kmeans.scala
@umbertogriffo
umbertogriffo / ObjectPool.java
Created June 28, 2016 08:01
Generic Java object pool with minimalistic code
import java.util.Queue;
import java.util.concurrent.ConcurrentLinkedQueue;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicInteger;
/**
* @param <T>
*/
public abstract class ObjectPool<T> {
@umbertogriffo
umbertogriffo / Transpose.scala
Created October 26, 2016 08:05
Utility Methods to Transpose a org.apache.spark.mllib.linalg.distributed.RowMatrix
def transposeRowMatrix(m: RowMatrix): RowMatrix = {
val transposedRowsRDD = m.rows.zipWithIndex.map{case (row, rowIndex) => rowToTransposedTriplet(row, rowIndex)}
.flatMap(x => x) // now we have triplets (newRowIndex, (newColIndex, value))
.groupByKey
.sortByKey().map(_._2) // sort rows and remove row indexes
.map(buildRow) // restore order of elements in each row and remove column indexes
new RowMatrix(transposedRowsRDD)
}
def rowToTransposedTriplet(row: Vector, rowIndex: Long): Array[(Long, (Long, Double))] = {
@umbertogriffo
umbertogriffo / UniqueId.java
Last active March 6, 2023 08:16
Generate Long ID from UUID
/**
* Genereate unique ID from UUID in positive space
* Reference: http://www.gregbugaj.com/?p=587
* @return long value representing UUID
*/
private Long generateUniqueId()
{
long val = -1;
do
{
@umbertogriffo
umbertogriffo / Method1.java
Last active January 22, 2017 14:21
How to make the method run() of class NoThreadSafe thread-safe in Java
public class Method1 {
/*
Adding synchronized to this method will makes it thread-safe.
When synchronized is added to a static method, the Class object is the object which is locked.
*/
public static void main(String[] args) throws InterruptedException {
ProcessingThreadS pt = new ProcessingThreadS();
Thread t1 = new Thread(pt, "t1");
@umbertogriffo
umbertogriffo / DataFrameSuite.scala
Last active February 12, 2020 06:13
DataFrameSuite allows you to check if two DataFrames are equal. You can assert the DataFrames equality using method assertDataFrameEquals. When DataFrames contains doubles or Spark Mllib Vector, you can assert that the DataFrames approximately equal using method assertDataFrameApproximateEquals
package test.com.idlike.junit.df
import breeze.numerics.abs
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.{Column, DataFrame, Row}
/**
* Created by Umberto on 06/02/2017.
*/
@umbertogriffo
umbertogriffo / Winner.java
Created February 15, 2017 09:02
Java 8 Streams Cookbook
package knowledgebase.java.stream;
import java.time.Duration;
import java.util.*;
import static java.util.stream.Collectors.*;
/**
* Created by Umberto on 15/02/2017.
* https://dzone.com/articles/a-java-8-streams-cookbook