Skip to content

Instantly share code, notes, and snippets.

View ogrisel's full-sized avatar

Olivier Grisel ogrisel

View GitHub Profile
@ogrisel
ogrisel / gist:1225080
Created September 18, 2011 13:39
pignlproc topics - sample output
$ python categorize.py http://en.wikinews.org/wiki/Denmark_elects_new_centre-left_coalition_and_prime_minister
Category:College_of_Europe [0.454]
Category:Elections_in_Denmark [0.447]
Category:Party_of_European_Socialists [0.432]
Category:Politics_of_Denmark [0.425]
Category:Danish_law [0.408]
$ python categorize.py http://en.wikinews.org/wiki/Zimbabwe_minister_warns_media
@ogrisel
ogrisel / output.txt
Created October 1, 2011 13:37
load_svmlight_file line profile
%lprun -f datasets.svmlight_format._load_svmlight_file _ = datasets.load_svmlight_file('competition_data/public_train_data.svmlight.dat')
Timer unit: 1e-06 s
File: /home/ogrisel/coding/scikit-learn/sklearn/datasets/svmlight_format.py
Function: _load_svmlight_file at line 21
Total time: 58.3922 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
@ogrisel
ogrisel / profiler_output.txt
Created October 2, 2011 19:46
CountVectorizer profiling
Timer unit: 1e-06 s
File: sklearn/feature_extraction/text.py
Function: fit_transform at line 290
Total time: 16.3795 s
Line # Hits Time Per Hit % Time Line Contents
==============================================================
290 def fit_transform(self, raw_documents, y=None):
291 """Learn the vocabulary dictionary and return the count vectors
@ogrisel
ogrisel / emi_out.txt
Created October 23, 2011 09:05
Expected Mutual Information profiling
Line # Hits Time Per Hit % Time Line Contents
==============================================================
627 def expected_mutual_information(contingency, n_samples):
628 """Calculate the expected mutual information for two labelings."""
629 25 82 3.3 0.0 R, C = contingency.shape
630 25 53 2.1 0.0 N = n_samples
631 25 1059 42.4 0.0 a = np.sum(contingency, axis=1, dtype='int')
632 25 932 37.3 0.0 b = np.sum(contingency, axis=0, dtype='int')
633 25 58 2.3 0.0 emi = 0
634 1839 4005 2.2 0.0 for i in range(R):
@ogrisel
ogrisel / learning_curves.png
Created December 30, 2011 16:04
Learning Curves for under/overfitting evaluation
learning_curves.png
@ogrisel
ogrisel / README.md
Created January 9, 2012 09:07
Closed form formula for the eps estimate from Johnson Lindenstrauss' lemma

Goal

Find a closed form formula of the estimated epsilon (squared distances ratio distortion) of the Johnson Lindenstrauss lemma. The goal is to implement it either as a pure python or pure numpy function to compute the eps out of the number of points (samples or observation) n and target dimension d.

Problem

The root expression as found by sympy is hard to rewrite as a python function that returns a complex variable: intermediate expression evaluation tend to overflow a lot. sympy is able to compute the numerical value when using evalf but not compile the expression to a numerically stable numpy implementation when using lambdify.

@ogrisel
ogrisel / test.md
Created January 11, 2012 14:55
Markdown sandbox

This is a test

12
@ogrisel
ogrisel / cython-0.16rc0-scikit-learn-stacktrace.md
Created March 31, 2012 21:41
cython 0.16rc0 crash report on scikit-learn master

To reproduce:

git clone https://github.com/scikit-learn/scikit-learn.git && cd scikit-learn
make cython

Here is the stderr output:


Error compiling Cython file:
@ogrisel
ogrisel / output.txt
Created July 10, 2012 15:36
memmaping for random forests
/Users/oliviergrisel/coding/scikit-learn/sklearn/externals/joblib/parallel.py in __call__(self=Parallel(n_jobs=2), iterable=<generator object <genexpr> at 0x10467b3c0>)
470 self.n_dispatched = 0
471 try:
472 for function, args, kwargs in iterable:
473 self.dispatch(function, args, kwargs)
474
--> 475 self.retrieve()
self.retrieve = <bound method Parallel.retrieve of Parallel(n_jobs=2)>
476 # Make sure that we get a last message telling us we are done
477 elapsed_time = time.time() - self._start_time
@ogrisel
ogrisel / .gitignore
Created December 14, 2012 19:19
Scratchpad for feature selection for clustering using a consensus ensemble method.
joblib
*.pyc