manoj kumar MechCoder

## lint.txt
************* Module pyspark
W: 48, 0: Wildcard import pyspark.status (wildcard-import)
************* Module pyspark.broadcast
W: 27, 4: Redefining built-in 'unicode' (redefined-builtin)
C:  1, 0: Missing module docstring (missing-docstring)
C: 27, 4: Invalid class name "unicode" (invalid-name)
C: 33, 0: Invalid constant name "_broadcastRegistry" (invalid-name)
W: 37, 4: Redefining name '_broadcastRegistry' from outer scope (line 33) (redefined-outer-name)
C: 36, 0: Missing function docstring (missing-docstring)
W: 37, 4: Module import itself (import-self)

## Shell_commands.sh
echo "$whatever"

# To make the shell search this directory for an executable
export "PATH=newdirec:$PATH"

# To make python call non - usr/local/lib/distpackages
export "PYTHONPATH="

# To iterate ober a string
for i in string_separated_sentence

## bench_gaussian.scala
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.stat.distribution.MultivariateGaussian
import org.apache.spark.mllib.clustering.GaussianMixture
import scala.util.Random

val rng = Random
rng.setSeed(0)

val nSamplesArray = Array(100, 200)
val nFeaturesArray = Array(10, 20, 50, 100, 200)

## d.scala
val trainData = {
    if (sparse)
      data.map(sample => sample.asInstanceOf[SparseVector]).cache()
    else
      data.map(u => u.toBreeze.toDenseVector).cache()
}

// Now since trainData can have two possible types, this statement returns an error.

val sums = {

## sql.scala
[error] /home/manoj/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:308: polymorphic expression cannot be instantiated to expected type;
[error]  found   : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
[error]  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)]
[error]   implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
[error]                                                                                                             ^
[error] /home/manoj/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:310: polymorphic expression cannot be instantiated to expected type;
[error]  found   : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
[error]  required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)]

## birch.txt
1. It does not scale well for really sparse data in high dimensions, where the memory crashes,
for example in the newsgroup dataset, for around 80k features, memory crashes in my laptop.

2. From the profile, it seems it is as optimized as possible, it is

slightly faster than MiniBatchKMeans, for high n_clusters around 1000
slower than MiniBatchKMeans for higher n_features
slightly faster than MiniBatchKMeans for higher n_features (~400) and n_clusters (~1000).

3. The problem of setting the threshold, almost all the times I had to set the threshold manually

## bench_birch_again.py
Total time: 0.18872 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    18                                           @profile
    19                                           def _iterate_X(X):
    20                                               """
    21                                               This little hack returns a densified row when iterating over a sparse
    22                                               matrix, insted of constructing a sparse matrix for every row that is
    23                                               expensive.

## tamiltunes.py
# Store downloaded tamil songs in a directory from tamiltunes.com
# Supply links like http://tamiltunes.com/kayal-2014.html
# TODO: Format stuff like % in songs

import urllib
import os

a = raw_input("Enter link ")
b = urllib.urlopen(a)
html = b.read().split()

## birch1.txt
5.816400000000000000e+04 8.134310000000000000e+05
5.816000000000000000e+04 8.451400000000000000e+04
5.814800000000000000e+04 5.500010000000000000e+05
5.814400000000000000e+04 8.226450000000000000e+05
5.813400000000000000e+04 4.472990000000000000e+05
5.812100000000000000e+04 8.176100000000000000e+04
5.811500000000000000e+04 3.284160000000000000e+05
5.810900000000000000e+04 3.391240000000000000e+05
5.809800000000000000e+04 2.581170000000000000e+05
5.809100000000000000e+04 3.233850000000000000e+05

## bench_birch.py
==============================================================
   110                                               @profile
   111                                               def insert_cf_subcluster(self, subcluster):
   112                                                   """
   113                                                   Insert a new subcluster into the nide
   114                                                   """
   115    265652       183822      0.7      1.9          if not self.subclusters_:
   116         1            3      3.0      0.0              self.update(subcluster)
   117         1            0      0.0      0.0              return False
   118
	************* Module pyspark
	W: 48, 0: Wildcard import pyspark.status (wildcard-import)
	************* Module pyspark.broadcast
	W: 27, 4: Redefining built-in 'unicode' (redefined-builtin)
	C: 1, 0: Missing module docstring (missing-docstring)
	C: 27, 4: Invalid class name "unicode" (invalid-name)
	C: 33, 0: Invalid constant name "_broadcastRegistry" (invalid-name)
	W: 37, 4: Redefining name '_broadcastRegistry' from outer scope (line 33) (redefined-outer-name)
	C: 36, 0: Missing function docstring (missing-docstring)
	W: 37, 4: Module import itself (import-self)
	echo "$whatever"

	# To make the shell search this directory for an executable
	export "PATH=newdirec:$PATH"

	# To make python call non - usr/local/lib/distpackages
	export "PYTHONPATH="

	# To iterate ober a string
	for i in string_separated_sentence
	import org.apache.spark.mllib.linalg.{Vectors, Matrices}
	import org.apache.spark.mllib.stat.distribution.MultivariateGaussian
	import org.apache.spark.mllib.clustering.GaussianMixture
	import scala.util.Random

	val rng = Random
	rng.setSeed(0)

	val nSamplesArray = Array(100, 200)
	val nFeaturesArray = Array(10, 20, 50, 100, 200)
	val trainData = {
	if (sparse)
	data.map(sample => sample.asInstanceOf[SparseVector]).cache()
	else
	data.map(u => u.toBreeze.toDenseVector).cache()
	}

	// Now since trainData can have two possible types, this statement returns an error.

	val sums = {
	[error] /home/manoj/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:308: polymorphic expression cannot be instantiated to expected type;
	[error] found : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
	[error] required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)]
	[error] implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
	[error] ^
	[error] /home/manoj/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:310: polymorphic expression cannot be instantiated to expected type;
	[error] found : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
	[error] required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)]
	1. It does not scale well for really sparse data in high dimensions, where the memory crashes,
	for example in the newsgroup dataset, for around 80k features, memory crashes in my laptop.

	2. From the profile, it seems it is as optimized as possible, it is

	slightly faster than MiniBatchKMeans, for high n_clusters around 1000
	slower than MiniBatchKMeans for higher n_features
	slightly faster than MiniBatchKMeans for higher n_features (~400) and n_clusters (~1000).

	3. The problem of setting the threshold, almost all the times I had to set the threshold manually
	Total time: 0.18872 s

	Line # Hits Time Per Hit % Time Line Contents
	==============================================================
	18 @profile
	19 def _iterate_X(X):
	20 """
	21 This little hack returns a densified row when iterating over a sparse
	22 matrix, insted of constructing a sparse matrix for every row that is
	23 expensive.
	# Store downloaded tamil songs in a directory from tamiltunes.com
	# Supply links like http://tamiltunes.com/kayal-2014.html
	# TODO: Format stuff like % in songs

	import urllib
	import os

	a = raw_input("Enter link ")
	b = urllib.urlopen(a)
	html = b.read().split()
	5.816400000000000000e+04 8.134310000000000000e+05
	5.816000000000000000e+04 8.451400000000000000e+04
	5.814800000000000000e+04 5.500010000000000000e+05
	5.814400000000000000e+04 8.226450000000000000e+05
	5.813400000000000000e+04 4.472990000000000000e+05
	5.812100000000000000e+04 8.176100000000000000e+04
	5.811500000000000000e+04 3.284160000000000000e+05
	5.810900000000000000e+04 3.391240000000000000e+05
	5.809800000000000000e+04 2.581170000000000000e+05
	5.809100000000000000e+04 3.233850000000000000e+05
	==============================================================
	110 @profile
	111 def insert_cf_subcluster(self, subcluster):
	112 """
	113 Insert a new subcluster into the nide
	114 """
	115 265652 183822 0.7 1.9 if not self.subclusters_:
	116 1 3 3.0 0.0 self.update(subcluster)
	117 1 0 0.0 0.0 return False
	118