Skip to content

Instantly share code, notes, and snippets.

View MLnick's full-sized avatar

Nick Pentreath MLnick

  • Automattic
  • Cape Town, South Africa
  • X @MLnick
View GitHub Profile
@MLnick
MLnick / StreamingHLL.scala
Last active January 24, 2024 19:39
Spark Streaming meets Algebird's HyperLogLog Monoid
import spark.streaming.StreamingContext._
import spark.streaming.{Seconds, StreamingContext}
import spark.SparkContext._
import spark.storage.StorageLevel
import spark.streaming.examples.twitter.TwitterInputDStream
import com.twitter.algebird.HyperLogLog._
import com.twitter.algebird._
/**
* Example of using HyperLogLog monoid from Twitter's Algebird together with Spark Streaming's
@MLnick
MLnick / StreamingCMS.scala
Created February 13, 2013 15:00
Spark Streaming with CountMinSketch from Twitter Algebird
import spark.streaming.{Seconds, StreamingContext}
import spark.storage.StorageLevel
import spark.streaming.examples.twitter.TwitterInputDStream
import com.twitter.algebird._
import spark.streaming.StreamingContext._
import spark.SparkContext._
/**
* Example of using CountMinSketch monoid from Twitter's Algebird together with Spark Streaming's
* TwitterInputDStream
@MLnick
MLnick / MovieSimilarities.scala
Created April 1, 2013 17:49
Movie Similarities with Spark
import spark.SparkContext
import SparkContext._
/**
* A port of [[http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/]]
* to Spark.
* Uses movie ratings data from MovieLens 100k dataset found at [[http://www.grouplens.org/node/73]]
*/
object MovieSimilarities {
@MLnick
MLnick / HyperLogLogStoreUDAF.scala
Last active March 16, 2022 05:31
Experimenting with Spark SQL UDAF - HyperLogLog UDAF for distinct counts, that stores the actual HLL for each row to allow further aggregation
class HyperLogLogStoreUDAF extends UserDefinedAggregateFunction {
override def inputSchema = new StructType()
.add("stringInput", BinaryType)
override def update(buffer: MutableAggregationBuffer, input: Row) = {
// This input Row only has a single column storing the input value in String (or other Binary data).
// We only update the buffer when the input value is not null.
if (!input.isNullAt(0)) {
if (buffer.isNullAt(0)) {
$ sbt/sbt assembly/assembly
$ sbt/sbt examples/assembly
$ SPARK_CLASSPATH=examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar IPYTHON=1 ./bin/pyspark
...
14/06/03 15:34:11 INFO SparkUI: Started SparkUI at http://10.0.0.4:4040
Welcome to
____ __
/ __/__ ___ _____/ /__
@MLnick
MLnick / SQLTransformerWithJoin.scala
Created August 18, 2016 07:55
Using SQLTransformer to join DataFrames in ML Pipeline
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.0.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_77)
Type in expressions to have them evaluated.
Type :help for more information.
@MLnick
MLnick / onnx.pb
Last active October 17, 2019 11:39
graph {
node {
input: "X"
input: "W"
output: "Y"
name: "matmult"
op_type: "Mul"
}
input {
name: "X"
@MLnick
MLnick / sklearn-lr-spark.py
Created February 4, 2013 14:29
SGD in Spark using Scikit-learn
import sys
from pyspark.context import SparkContext
from numpy import array, random as np_random
from sklearn import linear_model as lm
from sklearn.base import copy
N = 10000 # Number of data points
D = 10 # Numer of dimensions
ITERATIONS = 5
@MLnick
MLnick / dask-ps.py
Created May 31, 2017 08:57
Dask Parameter Server - Initial WIP
# ==== dask-ps
import dask
import dask.array as da
from dask import delayed
from dask_glm import families
from dask_glm.algorithms import lbfgs
from distributed import LocalCluster, Client, worker_client
import numpy as np
import time
1. Error: gapply() and gapplyCollect() on a DataFrame (@test_sparkSQL.R#2569) --
org.apache.spark.SparkException: Job aborted due to stage failure: Task 114 in stage 957.0 failed 1 times, most recent failure: Lost task 114.0 in stage 957.0 (TID 13209, localhost, executor driver): org.apache.spark.SparkException: R computation failed with
[1] 1
[1] 3
[1] 2
[1][1] 1 2
[1] 3
[1] 2
[1] 2