Skip to content

Instantly share code, notes, and snippets.

@kaja47
Last active October 4, 2016 14:25
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save kaja47/6722683 to your computer and use it in GitHub Desktop.
Save kaja47/6722683 to your computer and use it in GitHub Desktop.
Pearson correlation coefficient in scalanlp/breeze
import breeze.linalg._
def corr(a: DenseVector[Double], b: DenseVector[Double]): Double = {
if (a.length != b.length)
sys.error("you fucked up")
val n = a.length
val (amean, avar) = meanAndVariance(a)
val (bmean, bvar) = meanAndVariance(b)
val astddev = math.sqrt(avar)
val bstddev = math.sqrt(bvar)
1.0 / (n - 1.0) * sum( ((a - amean) / astddev) :* ((b - bmean) / bstddev) )
}
@tbertelsen
Copy link

This method is really inefficient if used on SparseVectors. I have created a fork, with a version that runs a 1000 times faster on huge, sparse vectors.

Quick benchmark for two vectors with approximately 1000 non-zero elements.

Length Effecient method Original method
1 048 576 <0.005 s 0.22 s
4 194 304 <0.005 s 1.05 s
16 777 216 <0.005 s 2.86 s
67 108 864 ~0.01 s Throws OutOfMemory

@hyangminj
Copy link

is it worked? I could not complie on scala shell ..
so, I modified it

import breeze.linalg._

def corr(a: DenseVector[Double], b: DenseVector[Double]): Double = {
if (a.length != b.length)
sys.error("you fucked up")

val n = a.length

val ameanavar = meanAndVariance(a)
val amean = ameanavar.mean
val avar = ameanavar.variance
val bmeanbvar = meanAndVariance(b)
val bmean = bmeanbvar.mean
val bvar = bmeanbvar.variance
val astddev = math.sqrt(avar)
val bstddev = math.sqrt(bvar)

1.0 / (n - 1.0) * sum( ((a - amean) / astddev) :* ((b - bmean) / bstddev) )
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment