Skip to content

Instantly share code, notes, and snippets.

View umbertogriffo's full-sized avatar

Umberto Griffo umbertogriffo

View GitHub Profile
@squito
squito / AccumulatorListener.scala
Last active March 15, 2019 06:34
Accumulator Examples
import scala.collection.mutable.Map
import org.apache.spark.{Accumulator, AccumulatorParam, SparkContext}
import org.apache.spark.scheduler.{SparkListenerStageCompleted, SparkListener}
import org.apache.spark.SparkContext._
/**
* just print out the values for all accumulators from the stage.
* you will only get updates from *named* accumulators, though
name := "playground"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.0"
libraryDependencies += "net.sf.opencsv" % "opencsv" % "2.3"
@ahoy-jon
ahoy-jon / CogroupDf.scala
Last active February 3, 2020 11:08
DataFrame.cogroup is the new HList.flatMap (UNFORTUNATELY, THIS IS VERY SLOW)
package org.apache.spark.sql.utils
import org.apache.spark.Partitioner
import org.apache.spark.rdd.{CoGroupedRDD, RDD}
import org.apache.spark.sql.catalyst.{CatalystTypeConverters, ScalaReflection}
import org.apache.spark.sql.execution.LogicalRDD
import org.apache.spark.sql.types.{ArrayType, StructField, StructType}
import org.apache.spark.sql.{SQLContext, DataFrame, Row}
import scala.reflect.ClassTag
import scala.reflect.runtime.universe.TypeTag
# Spark Streaming Logging Configuration
# See also: http://spark.apache.org/docs/2.0.2/running-on-yarn.html#debugging-your-application
log4j.rootLogger=INFO, stderr
# application namespace configuration
log4j.logger.de.inovex.mysparkapp=stderr, stdout
# Write all logs to standard Spark stderr file
log4j.appender.stderr=org.apache.log4j.RollingFileAppender
@tomron
tomron / spark_knn_approximation.py
Created November 19, 2015 16:47
A naive approximation of k-nn algorithm (k-nearest neighbors) in pyspark. Approximation quality can be controlled by number of repartitions and number of repartition
from __future__ import print_function
import sys
from math import sqrt
import argparse
from collections import defaultdict
from random import randint
from pyspark import SparkContext
@lesstif
lesstif / tomcat-service.sh
Last active July 27, 2021 01:26
RHEL/CentOS tomcat7 init.d service script.
#!/bin/bash
#
# tomcat
#
# chkconfig: 345 96 30
# description: Start up the Tomcat servlet engine.
#
# processname: java
# pidfile: /var/run/tomcat.pid
#
@frgomes
frgomes / AnyToDouble.scala
Last active January 23, 2022 23:15
Scala - Converts Any to Double, to LocalDate and Date
// this flavour is pure magic...
def toDouble: (Any) => Double = { case i: Int => i case f: Float => f case d: Double => d }
// whilst this flavour is longer but you are in full control...
object any2Double extends Function[Any,Double] {
def apply(any: Any): Double =
any match { case i: Int => i case f: Float => f case d: Double => d }
}
// like when you can invoke any2Double from another similar conversion...
@bernhardschaefer
bernhardschaefer / spark-submit-streaming-yarn.sh
Last active March 21, 2022 05:04
spark-submit template for running Spark Streaming on YARN (referenced in https://www.inovex.de/blog/247-spark-streaming-on-yarn-in-production/)
#!/bin/bash
# Minimum TODOs on a per job basis:
# 1. define name, application jar path, main class, queue and log4j-yarn.properties path
# 2. remove properties not applicable to your Spark version (Spark 1.x vs. Spark 2.x)
# 3. tweak num_executors, executor_memory (+ overhead), and backpressure settings
# the two most important settings:
num_executors=6
executor_memory=3g
@iamaziz
iamaziz / cipynb.py
Created February 16, 2015 01:01
Convert all ipython notebook(s) in a given directory into the selected format and place output in a separate folder. Using: ipython nbconvert and find command (Unix-like OS).
#!/usr/bin/env python
__author__ = 'Aziz'
"""
Convert all ipython notebook(s) in a given directory into the selected format and place output in a separate folder.
usages: python cipynb.py `directory` [-to FORMAT]
Using: ipython nbconvert and find command (Unix-like OS).
@gbishop
gbishop / Args.ipynb
Last active July 18, 2022 11:43
Allow arguments to be passed to notebooks via URL or command line.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.