gracio

## linear_model.py
from sklearn import linear_model
from scipy import stats
import numpy as np


class LinearRegression(linear_model.LinearRegression):
    """
    LinearRegression class after sklearn's, but calculate t-statistics
    and p-values for model coefficients (betas).
    Additional attributes available after .fit()

## CosineSimilarity.scala
/*
 * Object in scala for calculating cosine similarity
 * Reuben Sutton - 2012
 * More information: http://en.wikipedia.org/wiki/Cosine_similarity
 */

object CosineSimilarity {

  /*
   * This method takes 2 equal length arrays of integers

## spark-svd.scala
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg._
import org.apache.spark.{SparkConf, SparkContext}

// To use the latest sparse SVD implementation, please build your spark-assembly after this
// change: https://github.com/apache/spark/pull/1378

// Input tsv with 3 fields: rowIndex(Long), columnIndex(Long), weight(Double), indices start with 0
// Assume the number of rows is larger than the number of columns, and the number of columns is
// smaller than Int.MaxValue

## histogram.png
     Often we use Scalding to compute a disributed algo that generates tons of data.
     For eg. imagine a simple Scalding job
     -comb through 100 million user requests
     -find (lat,lng) where each request originated.
     -Convert (lat,lng) to zipcode via reverse geocoding.
     -Visualizing result via a histogram for a bunch of zipcodes.

     So say you pick 10 zipcodes in some county, I show you how many people hit your website from each zipcode.

     The hard problem here isn't the scalding job -
	from sklearn import linear_model
	from scipy import stats
	import numpy as np


	class LinearRegression(linear_model.LinearRegression):
	"""
	LinearRegression class after sklearn's, but calculate t-statistics
	and p-values for model coefficients (betas).
	Additional attributes available after .fit()
	/*
	* Object in scala for calculating cosine similarity
	* Reuben Sutton - 2012
	* More information: http://en.wikipedia.org/wiki/Cosine_similarity
	*/

	object CosineSimilarity {

	/*
	* This method takes 2 equal length arrays of integers
	import org.apache.spark.mllib.linalg.distributed.RowMatrix
	import org.apache.spark.mllib.linalg._
	import org.apache.spark.{SparkConf, SparkContext}

	// To use the latest sparse SVD implementation, please build your spark-assembly after this
	// change: https://github.com/apache/spark/pull/1378

	// Input tsv with 3 fields: rowIndex(Long), columnIndex(Long), weight(Double), indices start with 0
	// Assume the number of rows is larger than the number of columns, and the number of columns is
	// smaller than Int.MaxValue
	Often we use Scalding to compute a disributed algo that generates tons of data.
	For eg. imagine a simple Scalding job
	-comb through 100 million user requests
	-find (lat,lng) where each request originated.
	-Convert (lat,lng) to zipcode via reverse geocoding.
	-Visualizing result via a histogram for a bunch of zipcodes.

	So say you pick 10 zipcodes in some county, I show you how many people hit your website from each zipcode.

	The hard problem here isn't the scalding job -