Skip to content

Instantly share code, notes, and snippets.

from sklearn import linear_model
from scipy import stats
import numpy as np
class LinearRegression(linear_model.LinearRegression):
"""
LinearRegression class after sklearn's, but calculate t-statistics
and p-values for model coefficients (betas).
Additional attributes available after .fit()
/*
* Object in scala for calculating cosine similarity
* Reuben Sutton - 2012
* More information: http://en.wikipedia.org/wiki/Cosine_similarity
*/
object CosineSimilarity {
/*
* This method takes 2 equal length arrays of integers
@gracio
gracio / spark-svd.scala
Last active August 29, 2015 14:11 — forked from vrilleup/spark-svd.scala
(i,j, val) to row matrix without going through coordinate matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg._
import org.apache.spark.{SparkConf, SparkContext}
// To use the latest sparse SVD implementation, please build your spark-assembly after this
// change: https://github.com/apache/spark/pull/1378
// Input tsv with 3 fields: rowIndex(Long), columnIndex(Long), weight(Double), indices start with 0
// Assume the number of rows is larger than the number of columns, and the number of columns is
// smaller than Int.MaxValue
Often we use Scalding to compute a disributed algo that generates tons of data.
For eg. imagine a simple Scalding job
-comb through 100 million user requests
-find (lat,lng) where each request originated.
-Convert (lat,lng) to zipcode via reverse geocoding.
-Visualizing result via a histogram for a bunch of zipcodes.
So say you pick 10 zipcodes in some county, I show you how many people hit your website from each zipcode.
The hard problem here isn't the scalding job -