View matrix_plotting.py
import numpy as np
import pylab as pl
import matplotlib.transforms as mtransforms
################################################################################
# Display correlation matrices
def fit_axes(ax):
""" Redimension the given axes to have labels fitting.
"""
View notes.rst

Notes on scikit-learn round table

Q:What possible additions to scikit-learn are important to you?

  • Xavier Dupré (Microsoft): keep .fit in the API but X can be a stream from Spark for example. Transparent for the user. Gaël: indexable and len is n_samples, is that good enough? Answer: X accessible through sequential iterator.
  • Jean-François Puget (IBM): IBM is betting on Spark at the scale of the company. Most machine learning applications have small data but some don't. How can the bridge b/w scikit-learn and Spark get better? How to get scikit-learn used in a distributed environment? Not all algorithms can work out-of-core, need distributed algorithm.
  • Jean-Paul Smet (Nexedi): Nexedi is an example company. Wendelin.core helps us removing the overhead, and enabling out of core computing. Next step of the story in a year.
View strategies_comparison.py
"""Persistence strategies comparison script.
This script compute the speed, memory used and disk space used when dumping and
loading arbitrary data. The data are taken among:
- scikit-learn Labeled Faces in the Wild dataset (LFW)
- a fully random numpy array with 10000x10000 shape
- a dictionary with 1M random keys/values
- a list containing 10M random value
The compared persistence strategies are:
View plot_standardizing_linear_model.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.utils import check_random_state
from sklearn import datasets
View gist:33e7a7b297425890fefa

Introductory quote:

"Machine learning people use hugely complex algorithms on trivially simple datasets. Biology does trivially simple algorithms on hugely complex datasets."

Concepts of reproducible science

  • Replicability
View mutual_info.py
'''
Non-parametric computation of entropy and mutual-information
Adapted by G Varoquaux for code created by R Brette, itself
from several papers (see in the code).
These computations rely on nearest-neighbor statistics
'''
import numpy as np
View gist:a471e895eb4363def580
### Keybase proof
I hereby claim:
* I am GaelVaroquaux on github.
* I am gaelvaroquaux (https://keybase.io/gaelvaroquaux) on keybase.
* I have a public key whose fingerprint is 44B8 B843 6321 47EB 59A9 8992 6C52 6A43 ABE0 36FC
To claim this, I am signing this object:
View Discussion
This gist is only meant for discussion.
View bench_dbscan.py
import numpy as np
import time
from sklearn import cluster
from sklearn import datasets
lfw = datasets.fetch_lfw_people()
X_lfw = lfw.data[:, :5]
eps = 8. # This choice of EPS gives 44 clusters
View map_wrapper.pyx
"""
Uses C++ map containers for fast dict-like behavior with keys being
integers, and values float.
"""
# Author: Gael Varoquaux
# License: BSD
# XXX: this needs Cython 17.1 or later. Elsewhere you will get a C++ compilation error.
import numpy as np