Gael Varoquaux GaelVaroquaux

## lasso.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso, lars_path

np.random.seed(42)

def gen_data(n, m, k):
    X = np.random.randn(n, m)
    w = np.zeros((m, 1))
    i = np.arange(0, m)

## map_wrapper.pyx
"""
Uses C++ map containers for fast dict-like behavior with keys being
integers, and values float.
"""
# Author: Gael Varoquaux
# License: BSD

# XXX: this needs Cython 17.1 or later. Elsewhere you will get a C++ compilation error.

import numpy as np

## bench_dbscan.py
import numpy as np
import time

from sklearn import cluster
from sklearn import datasets


lfw = datasets.fetch_lfw_people()
X_lfw = lfw.data[:, :5]
eps = 8. # This choice of EPS gives 44 clusters

## Discussion
This gist is only meant for discussion.

## gist:a471e895eb4363def580
### Keybase proof

I hereby claim:

  * I am GaelVaroquaux on github.
  * I am gaelvaroquaux (https://keybase.io/gaelvaroquaux) on keybase.
  * I have a public key whose fingerprint is 44B8 B843 6321 47EB 59A9  8992 6C52 6A43 ABE0 36FC

To claim this, I am signing this object:

## mutual_info.py
'''
Non-parametric computation of entropy and mutual-information

Adapted by G Varoquaux for code created by R Brette, itself
from several papers (see in the code).

This code is maintained at https://github.com/mutualinfo/mutual_info
Please download the latest code there, to have improvements and
bug fixes.

## gist:33e7a7b297425890fefa

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              3 stars
            
          
                GaelVaroquaux
                / gist:33e7a7b297425890fefa
            
            
              Created
              July 10, 2015 14:07
            
              
                Notes from reproducible machine learning discussion at ICML 2015 MLOSS workshop
              
          
    Introductory quote:

"Machine learning people use hugely complex algorithms on trivially simple datasets. Biology does trivially simple algorithms on hugely complex datasets."

Concepts of reproducible science


Replicability


## plot_standardizing_linear_model.py
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.utils import check_random_state
from sklearn import datasets


## strategies_comparison.py

"""Persistence strategies comparison script.

This script compute the speed, memory used and disk space used when dumping and
loading arbitrary data. The data are taken among:
- scikit-learn Labeled Faces in the Wild dataset (LFW)
- a fully random numpy array with 10000x10000 shape
- a dictionary with 1M random keys/values
- a list containing 10M random value
The compared persistence strategies are:

## notes.rst

      
              1 file
            
          
              0 forks
            
          
              3 comments
            
          
              0 stars
            
          
                GaelVaroquaux
                / notes.rst
            
            
              Last active
              June 21, 2016 03:31
            
              
                Notes from Pydata Paris discussion on scikit-learn
              
          
    Notes on scikit-learn round table

Q:What possible additions to scikit-learn are important to you?


Xavier Dupré (Microsoft): keep .fit in the API but X can be a stream from Spark for example. Transparent for the user. Gaël: indexable and len is n_samples, is that good enough? Answer: X accessible through sequential iterator.
Jean-François Puget (IBM): IBM is betting on Spark at the scale of the company. Most machine learning applications have small data but some don't. How can the bridge b/w scikit-learn and Spark get better? How to get scikit-learn used in a distributed environment? Not all algorithms can work out-of-core, need distributed algorithm.
Jean-Paul Smet (Nexedi): Nexedi is an example company. Wendelin.core helps us removing the overhead, and enabling out of core computing. Next step of the story in a year.
	import numpy as np
	import matplotlib.pyplot as plt
	from sklearn.linear_model import Lasso, lars_path

	np.random.seed(42)

	def gen_data(n, m, k):
	X = np.random.randn(n, m)
	w = np.zeros((m, 1))
	i = np.arange(0, m)
	"""
	Uses C++ map containers for fast dict-like behavior with keys being
	integers, and values float.
	"""
	# Author: Gael Varoquaux
	# License: BSD

	# XXX: this needs Cython 17.1 or later. Elsewhere you will get a C++ compilation error.

	import numpy as np
	import numpy as np
	import time

	from sklearn import cluster
	from sklearn import datasets


	lfw = datasets.fetch_lfw_people()
	X_lfw = lfw.data[:, :5]
	eps = 8. # This choice of EPS gives 44 clusters
	### Keybase proof

	I hereby claim:

	* I am GaelVaroquaux on github.
	* I am gaelvaroquaux (https://keybase.io/gaelvaroquaux) on keybase.
	* I have a public key whose fingerprint is 44B8 B843 6321 47EB 59A9 8992 6C52 6A43 ABE0 36FC

	To claim this, I am signing this object:
	'''
	Non-parametric computation of entropy and mutual-information

	Adapted by G Varoquaux for code created by R Brette, itself
	from several papers (see in the code).

	This code is maintained at https://github.com/mutualinfo/mutual_info
	Please download the latest code there, to have improvements and
	bug fixes.
	import numpy as np
	import matplotlib.pyplot as plt

	from sklearn.linear_model import Ridge, Lasso
	from sklearn.cross_validation import ShuffleSplit
	from sklearn.grid_search import GridSearchCV
	from sklearn.utils import check_random_state
	from sklearn import datasets

	"""Persistence strategies comparison script.

	This script compute the speed, memory used and disk space used when dumping and
	loading arbitrary data. The data are taken among:
	- scikit-learn Labeled Faces in the Wild dataset (LFW)
	- a fully random numpy array with 10000x10000 shape
	- a dictionary with 1M random keys/values
	- a list containing 10M random value
	The compared persistence strategies are: