Skip to content

Instantly share code, notes, and snippets.

View reynoldsm88's full-sized avatar

Michael Reynolds reynoldsm88

  • Two Six Labs
  • New York City
View GitHub Profile
@reynoldsm88
reynoldsm88 / shingleprints.py
Created December 2, 2021 20:15 — forked from dustinboswell/shingleprints.py
Computing shingleprints for a document
def min_max_hashes(text, window=60):
hashes = [murmurhash(text[i:i+window]) for i in range(len(text)-window+1)]
return [min(hashes), max(hashes)]
def shingleprints(text):
min1, max1 = min_max_hashes(text[0:len(text)/2])
min2, max2 = min_max_hashes(text[len(text)/2:])
# combine pairs, using your favorite hash-value combiner
return [hash_combine(min1, min2),
hash_combine(min1, max2),
@reynoldsm88
reynoldsm88 / minhash.py
Created December 2, 2021 19:55 — forked from dustinboswell/minhash.py
Rough code for comparing document similarity with MinHash
def minhash(text, window=25): # assume len(text) > 50
hashes = [murmurhash(text[i:i+window]) for i in range(len(text)-window+1)]
return set(sorted(hashes)[0:20])
def similarity(text1, text2):
hashes1 = minhash(text1)
hashes2 = minhash(text2)
return len(hashes1 & hashes2) / len(hashes1)
A = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen"
@reynoldsm88
reynoldsm88 / _aws_golang_examples.md
Created January 26, 2021 21:45 — forked from eferro/_aws_golang_examples.md
golang aws: examples

AWS Golang SDK examples

@reynoldsm88
reynoldsm88 / swagger-yaml-to-html.py
Created September 19, 2019 02:19 — forked from oseiskar/swagger-yaml-to-html.py
Converts Swagger YAML to a static HTML document (needs: pip install PyYAML)
#!/usr/bin/python
#
# Copyright 2017 Otto Seiskari
# Licensed under the Apache License, Version 2.0.
# See http://www.apache.org/licenses/LICENSE-2.0 for the full text.
#
# This file is based on
# https://github.com/swagger-api/swagger-ui/blob/4f1772f6544699bc748299bd65f7ae2112777abc/dist/index.html
# (Copyright 2017 SmartBear Software, Licensed under Apache 2.0)
#
package org.apache.spark.countSerDe
import org.apache.spark.sql.catalyst.util._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericInternalRow
import org.apache.spark.sql.expressions.MutableAggregationBuffer
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
@reynoldsm88
reynoldsm88 / get_job_status.sh
Created December 4, 2017 21:38 — forked from arturmkrtchyan/get_job_status.sh
Apache Spark Hidden REST API
curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20151008145126-0000
@reynoldsm88
reynoldsm88 / build.sbt
Created June 22, 2017 18:30 — forked from seratch/build.sbt
Scala School - Testing with specs2 examples
organization := "net.seratch"
name := "sandbox"
version := "0.1"
scalaVersion := "2.9.1"
libraryDependencies ++= Seq(
"junit" % "junit" % "4.9" withSources(),

Spark internals through code

Nothing gives you more detail about spark internals than actually reading it source code. In addition, you get to learn many design techniques and improve your scala coding skills. These are the random notes I make while reading the spark code. The best way to comprehend the notes is to load spark code into an IDE, e.g. IntelliJ, and navigate the code on the side.

Genesis - creation of a spark cluster

The scripts for creating a spark cluster are: start-master.sh and start-slave.sh. Read them carefully, and you can see that both scripts are very similar except the values for $CLASS variable. For start-master.sh, the value is CLASS="org.apache.spark.deploy.master.Master", while the value for start-slave.sh is shown below with more context.

# NOTE: This exact class name is matched downstream by SparkSubmit.