Skip to content

Instantly share code, notes, and snippets.

View MechCoder's full-sized avatar
💭
away

manoj kumar MechCoder

💭
away
View GitHub Profile
************* Module pyspark
W: 48, 0: Wildcard import pyspark.status (wildcard-import)
************* Module pyspark.broadcast
W: 27, 4: Redefining built-in 'unicode' (redefined-builtin)
C: 1, 0: Missing module docstring (missing-docstring)
C: 27, 4: Invalid class name "unicode" (invalid-name)
C: 33, 0: Invalid constant name "_broadcastRegistry" (invalid-name)
W: 37, 4: Redefining name '_broadcastRegistry' from outer scope (line 33) (redefined-outer-name)
C: 36, 0: Missing function docstring (missing-docstring)
W: 37, 4: Module import itself (import-self)
@MechCoder
MechCoder / Shell_commands.sh
Created July 6, 2015 17:06
Shell_commands
echo "$whatever"
# To make the shell search this directory for an executable
export "PATH=newdirec:$PATH"
# To make python call non - usr/local/lib/distpackages
export "PYTHONPATH="
# To iterate ober a string
for i in string_separated_sentence
@MechCoder
MechCoder / bench_gaussian.scala
Last active August 29, 2015 14:16
A script to bench Gaussian against distributed and non distributed mean and covariance updates.
import org.apache.spark.mllib.linalg.{Vectors, Matrices}
import org.apache.spark.mllib.stat.distribution.MultivariateGaussian
import org.apache.spark.mllib.clustering.GaussianMixture
import scala.util.Random
val rng = Random
rng.setSeed(0)
val nSamplesArray = Array(100, 200)
val nFeaturesArray = Array(10, 20, 50, 100, 200)
val trainData = {
if (sparse)
data.map(sample => sample.asInstanceOf[SparseVector]).cache()
else
data.map(u => u.toBreeze.toDenseVector).cache()
}
// Now since trainData can have two possible types, this statement returns an error.
val sums = {
@MechCoder
MechCoder / sql.scala
Created January 3, 2015 20:27
Spark SQL Errors
[error] /home/manoj/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:308: polymorphic expression cannot be instantiated to expected type;
[error] found : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
[error] required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)]
[error] implicit def functionToUdfBuilder[T: TypeTag](func: Function1[_, T]): ScalaUdfBuilder[T] = ScalaUdfBuilder(func)
[error] ^
[error] /home/manoj/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala:310: polymorphic expression cannot be instantiated to expected type;
[error] found : [T(in method apply)]org.apache.spark.sql.catalyst.dsl.ScalaUdfBuilder[T(in method apply)]
[error] required: org.apache.spark.sql.catalyst.dsl.package.ScalaUdfBuilder[T(in method functionToUdfBuilder)]
@MechCoder
MechCoder / birch.txt
Last active August 29, 2015 14:10
Progress on Birch
1. It does not scale well for really sparse data in high dimensions, where the memory crashes,
for example in the newsgroup dataset, for around 80k features, memory crashes in my laptop.
2. From the profile, it seems it is as optimized as possible, it is
slightly faster than MiniBatchKMeans, for high n_clusters around 1000
slower than MiniBatchKMeans for higher n_features
slightly faster than MiniBatchKMeans for higher n_features (~400) and n_clusters (~1000).
3. The problem of setting the threshold, almost all the times I had to set the threshold manually
Total time: 0.18872 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
18 @profile
19 def _iterate_X(X):
20 """
21 This little hack returns a densified row when iterating over a sparse
22 matrix, insted of constructing a sparse matrix for every row that is
23 expensive.
@MechCoder
MechCoder / tamiltunes.py
Last active August 29, 2015 14:09
Store downloaded tamil songs in a directory
# Store downloaded tamil songs in a directory from tamiltunes.com
# Supply links like http://tamiltunes.com/kayal-2014.html
# TODO: Format stuff like % in songs
import urllib
import os
a = raw_input("Enter link ")
b = urllib.urlopen(a)
html = b.read().split()
This file has been truncated, but you can view the full file.
5.816400000000000000e+04 8.134310000000000000e+05
5.816000000000000000e+04 8.451400000000000000e+04
5.814800000000000000e+04 5.500010000000000000e+05
5.814400000000000000e+04 8.226450000000000000e+05
5.813400000000000000e+04 4.472990000000000000e+05
5.812100000000000000e+04 8.176100000000000000e+04
5.811500000000000000e+04 3.284160000000000000e+05
5.810900000000000000e+04 3.391240000000000000e+05
5.809800000000000000e+04 2.581170000000000000e+05
5.809100000000000000e+04 3.233850000000000000e+05
==============================================================
110 @profile
111 def insert_cf_subcluster(self, subcluster):
112 """
113 Insert a new subcluster into the nide
114 """
115 265652 183822 0.7 1.9 if not self.subclusters_:
116 1 3 3.0 0.0 self.update(subcluster)
117 1 0 0.0 0.0 return False
118