Skip to content

Instantly share code, notes, and snippets.

@feynmanliang
Last active August 29, 2015 14:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save feynmanliang/70d79c23dffc828939ec to your computer and use it in GitHub Desktop.
Save feynmanliang/70d79c23dffc828939ec to your computer and use it in GitHub Desktop.

Cluster

  • Spark 1.4 + 1da3c7f
  • Databricks Cloud
  • 8 Workers, EC2 Spot instances
  • Workers: 240 GB Memory, 32 Cores
  • Driver: 30 GB Memory, 4 Cores

Data

  • 10,000 points, 100 each from N(x,1) where x in [0, 3, 6, ... 297]
  • n-dimensional points were generated by n iid draws from N(x,1)

e.g. for 40 features:

val data40D = sc.parallelize(
  (0 to 300 by 3).flatMap { mean =>
    Seq.fill(100)(Vectors.dense((0 until 40).map(_ => rng.nextGaussian() + 100*mean).toArray))
  }
)

Results

Parallel

Num Features Num Centers Runtimes
1 10 10.68, 9.14, 8.45
1 50 11.36, 10.08, 9.91
1 100 16.92, 13.53, 16.05
30 10 2.21, 2.23, 2.47
30 50 9.14, 9.16, 8.67
30 100 17.12, 17.13, 16.92
100 10 2.41, 2.59, 2.41, 2.40, 2.36
100 100 14.54, 17.62, 17.11
300 10 16.38, 16.23, 16.21, 16.20, 16.34
300 100 155.86, 147.93, 153.36
40 10000 383.29, 373.50, 372,90

Sequential

Num Features Num Centers Runtimes
1 10 5.57, 5.31, 5.57
1 50 8.96, 8.48, 7.18
1 100 10.56, 9.48, 12.88
30 10 2.39, 2.55, 2.33
30 50 9.91, 9.64, 9.55
30 100 18.87, 18.73, 18.64
100 10 3.16, 2.37, 2.31, 2.30, 2.37
100 100 17.86, 21.49, 17.96
300 10 18.44, 17.97, 20.49, 19.00
300 100 183.12, 172.17, 175.38
40 10000 397.97, 396.32, 396.21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment