Skip to content

Instantly share code, notes, and snippets.

View GaelVaroquaux's full-sized avatar

Gael Varoquaux GaelVaroquaux

View GitHub Profile
@GaelVaroquaux
GaelVaroquaux / lasso.py
Created October 5, 2012 07:12 — forked from aweinstein/lasso.py
scikitlearn lasso path fat vs thin X matrix
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, lars_path
np.random.seed(42)
def gen_data(n, m, k):
X = np.random.randn(n, m)
w = np.zeros((m, 1))
i = np.arange(0, m)
@GaelVaroquaux
GaelVaroquaux / map_wrapper.pyx
Created October 17, 2012 10:31
Wrapping CPP map container to a dict-like Python object
"""
Uses C++ map containers for fast dict-like behavior with keys being
integers, and values float.
"""
# Author: Gael Varoquaux
# License: BSD
# XXX: this needs Cython 17.1 or later. Elsewhere you will get a C++ compilation error.
import numpy as np
@GaelVaroquaux
GaelVaroquaux / bench_dbscan.py
Last active December 20, 2015 10:19
Benchmarking scikit_learn 0.14.X release
import numpy as np
import time
from sklearn import cluster
from sklearn import datasets
lfw = datasets.fetch_lfw_people()
X_lfw = lfw.data[:, :5]
eps = 8. # This choice of EPS gives 44 clusters
@GaelVaroquaux
GaelVaroquaux / Discussion
Created February 27, 2014 14:38
Temporary. Only for discussion purposes
This gist is only meant for discussion.
### Keybase proof
I hereby claim:
* I am GaelVaroquaux on github.
* I am gaelvaroquaux (https://keybase.io/gaelvaroquaux) on keybase.
* I have a public key whose fingerprint is 44B8 B843 6321 47EB 59A9 8992 6C52 6A43 ABE0 36FC
To claim this, I am signing this object:
@GaelVaroquaux
GaelVaroquaux / mutual_info.py
Last active June 18, 2023 12:25
Estimating entropy and mutual information with scikit-learn: visit https://github.com/mutualinfo/mutual_info
'''
Non-parametric computation of entropy and mutual-information
Adapted by G Varoquaux for code created by R Brette, itself
from several papers (see in the code).
This code is maintained at https://github.com/mutualinfo/mutual_info
Please download the latest code there, to have improvements and
bug fixes.
@GaelVaroquaux
GaelVaroquaux / gist:33e7a7b297425890fefa
Created July 10, 2015 14:07
Notes from reproducible machine learning discussion at ICML 2015 MLOSS workshop

Introductory quote:

"Machine learning people use hugely complex algorithms on trivially simple datasets. Biology does trivially simple algorithms on hugely complex datasets."

Concepts of reproducible science

  • Replicability
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.utils import check_random_state
from sklearn import datasets
@GaelVaroquaux
GaelVaroquaux / strategies_comparison.py
Last active May 17, 2016 20:22 — forked from aabadie/strategies_comparison.py
Persistence strategies comparison
"""Persistence strategies comparison script.
This script compute the speed, memory used and disk space used when dumping and
loading arbitrary data. The data are taken among:
- scikit-learn Labeled Faces in the Wild dataset (LFW)
- a fully random numpy array with 10000x10000 shape
- a dictionary with 1M random keys/values
- a list containing 10M random value
The compared persistence strategies are:
@GaelVaroquaux
GaelVaroquaux / notes.rst
Last active June 21, 2016 03:31
Notes from Pydata Paris discussion on scikit-learn

Notes on scikit-learn round table

Q:What possible additions to scikit-learn are important to you?

  • Xavier Dupré (Microsoft): keep .fit in the API but X can be a stream from Spark for example. Transparent for the user. Gaël: indexable and len is n_samples, is that good enough? Answer: X accessible through sequential iterator.
  • Jean-François Puget (IBM): IBM is betting on Spark at the scale of the company. Most machine learning applications have small data but some don't. How can the bridge b/w scikit-learn and Spark get better? How to get scikit-learn used in a distributed environment? Not all algorithms can work out-of-core, need distributed algorithm.
  • Jean-Paul Smet (Nexedi): Nexedi is an example company. Wendelin.core helps us removing the overhead, and enabling out of core computing. Next step of the story in a year.