This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def min_max_hashes(text, window=60): | |
hashes = [murmurhash(text[i:i+window]) for i in range(len(text)-window+1)] | |
return [min(hashes), max(hashes)] | |
def shingleprints(text): | |
min1, max1 = min_max_hashes(text[0:len(text)/2]) | |
min2, max2 = min_max_hashes(text[len(text)/2:]) | |
# combine pairs, using your favorite hash-value combiner | |
return [hash_combine(min1, min2), | |
hash_combine(min1, max2), |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def minhash(text, window=25): # assume len(text) > 50 | |
hashes = [murmurhash(text[i:i+window]) for i in range(len(text)-window+1)] | |
return set(sorted(hashes)[0:20]) | |
def similarity(text1, text2): | |
hashes1 = minhash(text1) | |
hashes2 = minhash(text2) | |
return len(hashes1 & hashes2) / len(hashes1) | |
A = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python | |
# | |
# Copyright 2017 Otto Seiskari | |
# Licensed under the Apache License, Version 2.0. | |
# See http://www.apache.org/licenses/LICENSE-2.0 for the full text. | |
# | |
# This file is based on | |
# https://github.com/swagger-api/swagger-ui/blob/4f1772f6544699bc748299bd65f7ae2112777abc/dist/index.html | |
# (Copyright 2017 SmartBear Software, Licensed under Apache 2.0) | |
# |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
package org.apache.spark.countSerDe | |
import org.apache.spark.sql.catalyst.util._ | |
import org.apache.spark.sql.types._ | |
import org.apache.spark.sql.Row | |
import org.apache.spark.sql.catalyst.InternalRow | |
import org.apache.spark.sql.catalyst.expressions.GenericInternalRow | |
import org.apache.spark.sql.expressions.MutableAggregationBuffer | |
import org.apache.spark.sql.expressions.UserDefinedAggregateFunction |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20151008145126-0000 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
organization := "net.seratch" | |
name := "sandbox" | |
version := "0.1" | |
scalaVersion := "2.9.1" | |
libraryDependencies ++= Seq( | |
"junit" % "junit" % "4.9" withSources(), |
Nothing gives you more detail about spark internals than actually reading it source code. In addition, you get to learn many design techniques and improve your scala coding skills. These are the random notes I make while reading the spark code. The best way to comprehend the notes is to load spark code into an IDE, e.g. IntelliJ, and navigate the code on the side.
The scripts for creating a spark cluster are: start-master.sh and start-slave.sh. Read them carefully, and you can see that both scripts are very similar except the values for $CLASS variable. For start-master.sh, the value is CLASS="org.apache.spark.deploy.master.Master", while the value for start-slave.sh is shown below with more context.
# NOTE: This exact class name is matched downstream by SparkSubmit.