Skip to content

Instantly share code, notes, and snippets.

@ssimeonov
Created September 12, 2015 04:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ssimeonov/eb12fcda75615e4a8d46 to your computer and use it in GitHub Desktop.
Save ssimeonov/eb12fcda75615e4a8d46 to your computer and use it in GitHub Desktop.

#Scala .hashCode vs. MurmurHash3 for Spark's MLlib

This is simple test of two hashing functions:

  • Scala's native implementation (obj.##), used in HashingTF
  • MurmurHash3, included in Scala, used by Vowpal Wabbit and many others

The test uses the aspell dictionary generated with the "insane" setting (download), which produces 676,547 entries, and explores the following grid:

  • Feature vector sizes: 2^18..22
  • Multiple: 1..4 (a 4 multiple repeats the word 4 times to emulate longer strings)

For English dictionary words, which are short, Scala's native implementation does OK relative to MurmurHash3 at all feature vector sizes. However, as input string length increases, the native hashCode implementation performs badly, its collision rate becoming 200-500% worse than Murmur's.

Dictionary size 676547 with 1x multiple and 262144 features:
    hashCode: 242165 uniques, 64.2057% collision rate
    murmur: 242251 uniques, 64.1930% collision rate

Dictionary size 676547 with 2x multiple and 262144 features:
    hashCode: 121887 uniques, 81.9840% collision rate
    murmur: 242273 uniques, 64.1898% collision rate

Dictionary size 676547 with 3x multiple and 262144 features:
    hashCode: 242281 uniques, 64.1886% collision rate
    murmur: 242236 uniques, 64.1952% collision rate

Dictionary size 676547 with 4x multiple and 262144 features:
    hashCode: 65225 uniques, 90.3591% collision rate
    murmur: 242196 uniques, 64.2012% collision rate

Dictionary size 676547 with 1x multiple and 524288 features:
    hashCode: 379436 uniques, 43.9158% collision rate
    murmur: 380189 uniques, 43.8045% collision rate

Dictionary size 676547 with 2x multiple and 524288 features:
    hashCode: 194525 uniques, 71.2474% collision rate
    murmur: 380070 uniques, 43.8221% collision rate

Dictionary size 676547 with 3x multiple and 524288 features:
    hashCode: 379644 uniques, 43.8851% collision rate
    murmur: 380139 uniques, 43.8119% collision rate

Dictionary size 676547 with 4x multiple and 524288 features:
    hashCode: 121794 uniques, 81.9977% collision rate
    murmur: 379846 uniques, 43.8552% collision rate

Dictionary size 676547 with 1x multiple and 1048576 features:
    hashCode: 497309 uniques, 26.4931% collision rate
    murmur: 498609 uniques, 26.3009% collision rate

Dictionary size 676547 with 2x multiple and 1048576 features:
    hashCode: 266298 uniques, 60.6387% collision rate
    murmur: 498648 uniques, 26.2951% collision rate

Dictionary size 676547 with 3x multiple and 1048576 features:
    hashCode: 497693 uniques, 26.4363% collision rate
    murmur: 498645 uniques, 26.2956% collision rate

Dictionary size 676547 with 4x multiple and 1048576 features:
    hashCode: 194769 uniques, 71.2113% collision rate
    murmur: 498493 uniques, 26.3181% collision rate

Dictionary size 676547 with 1x multiple and 2097152 features:
    hashCode: 577030 uniques, 14.7095% collision rate
    murmur: 578521 uniques, 14.4892% collision rate

Dictionary size 676547 with 2x multiple and 2097152 features:
    hashCode: 336288 uniques, 50.2935% collision rate
    murmur: 578186 uniques, 14.5387% collision rate

Dictionary size 676547 with 3x multiple and 2097152 features:
    hashCode: 577426 uniques, 14.6510% collision rate
    murmur: 578171 uniques, 14.5409% collision rate

Dictionary size 676547 with 4x multiple and 2097152 features:
    hashCode: 266570 uniques, 60.5985% collision rate
    murmur: 578144 uniques, 14.5449% collision rate

Dictionary size 676547 with 1x multiple and 4194304 features:
    hashCode: 623614 uniques, 7.8240% collision rate
    murmur: 625075 uniques, 7.6080% collision rate

Dictionary size 676547 with 2x multiple and 4194304 features:
    hashCode: 415648 uniques, 38.5633% collision rate
    murmur: 624846 uniques, 7.6419% collision rate

Dictionary size 676547 with 3x multiple and 4194304 features:
    hashCode: 623818 uniques, 7.7938% collision rate
    murmur: 624398 uniques, 7.7081% collision rate

Dictionary size 676547 with 4x multiple and 4194304 features:
    hashCode: 336577 uniques, 50.2508% collision rate
    murmur: 624710 uniques, 7.6620% collision rate
#!/bin/sh
exec scala -J-Xmx8g "$0" "$@"
!#
import scala.io.Source._
import scala.util.hashing.MurmurHash3
def pow2(n: Int) = Math.pow(2, n).toInt
val murmur = MurmurHash3.stringHashing
case class HashFunction(name: String, hashFunction: (String) => Int, numFeatures: Int) {
val uniques = scala.collection.mutable.SortedSet[Int]()
def hash(str: String) = uniques += nonNegativeMod(hashFunction(str))
def collisionRate(numFeatures: Int) = (numFeatures - uniques.size).toDouble / numFeatures
private
def nonNegativeMod(x: Int): Int = {
val rawMod = x % numFeatures
rawMod + (if (rawMod < 0) numFeatures else 0)
}
}
case class Test(numFeatures: Int, multiple: Int, hashFunctions: Array[HashFunction]) {
def hash(str: String) = {
val scaledString = str * multiple
hashFunctions.foreach(_.hash(scaledString))
}
def report(wordCount: Int) = {
println(s"\nDictionary size $wordCount with ${multiple}x multiple and $numFeatures features:")
hashFunctions.foreach { hashFunction =>
println(f" ${hashFunction.name}: ${hashFunction.uniques.size} uniques, ${100 * hashFunction.collisionRate(wordCount)}%1.4f%% collision rate")
}
}
}
object Script {
val tests = for (
numFeatures <- Seq(18, 19, 20, 21, 22).map(pow2);
multiple <- Seq(1, 2, 3, 4)
) yield new Test(
numFeatures,
multiple,
Array(
new HashFunction("hashCode", (word: String) => word.##, numFeatures),
new HashFunction("murmur", (word: String) => murmur.hash(word), numFeatures)
))
def main(args: Array[String]) {
val path = args(0)
var wordCount = 0
fromFile(path).getLines().foreach { word =>
wordCount += 1
tests.foreach(_.hash(word))
}
tests.foreach(_.report(wordCount))
}
}
Script.main(args)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment