Skip to content

Instantly share code, notes, and snippets.

Gael Varoquaux GaelVaroquaux

Block or report user

Report or block GaelVaroquaux

Hide content and notifications from this user.

Learn more about blocking users

Contact Support about this user’s behavior.

Learn more about reporting abuse

Report abuse
View GitHub Profile
@GaelVaroquaux
GaelVaroquaux / impact_encoding.py
Created Oct 29, 2018
Target encoding (or impact encoding)
View impact_encoding.py
# how to use : df should be the dataframe restricted to categorical values to impact,
# target should be the pd.series of target values.
# Use fit, transform etc.
# three types : binary, multiple, continuous.
# for now m is a param <===== but what should we put here ? I guess some function of total shape.
# I mean what would be the value of m we want to have for 0.5 ?
import pandas as pd
import numpy as np
@GaelVaroquaux
GaelVaroquaux / deconfound.py
Last active Apr 5, 2018
Linear deconfounding in a fit-transform API
View deconfound.py
"""
A scikit-learn like transformer to remove a confounding effect on X.
"""
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.linear_model import LinearRegression
import numpy as np
class DeConfounder(BaseEstimator, TransformerMixin):
""" A transformer removing the effect of y on X.
@GaelVaroquaux
GaelVaroquaux / count_3_grams.py
Created Oct 31, 2017
Fast 3-gram counting on small strings
View count_3_grams.py
"""
Fast counting of 3-grams for short strings.
Quick benchmarking seems to show that pure Python code is faster when
for strings less that 1000 characters, and numpy versions is faster for
longer strings.
Very long strings would benefit from probabilistic counting (bloom
filter, count min sketch) as implemented eg in the "bounter" module.
View matrix_plotting.py
import numpy as np
import pylab as pl
import matplotlib.transforms as mtransforms
################################################################################
# Display correlation matrices
def fit_axes(ax):
""" Redimension the given axes to have labels fitting.
"""
@GaelVaroquaux
GaelVaroquaux / notes.rst
Last active Jun 21, 2016
Notes from Pydata Paris discussion on scikit-learn
View notes.rst

Notes on scikit-learn round table

Q:What possible additions to scikit-learn are important to you?

  • Xavier Dupré (Microsoft): keep .fit in the API but X can be a stream from Spark for example. Transparent for the user. Gaël: indexable and len is n_samples, is that good enough? Answer: X accessible through sequential iterator.
  • Jean-François Puget (IBM): IBM is betting on Spark at the scale of the company. Most machine learning applications have small data but some don't. How can the bridge b/w scikit-learn and Spark get better? How to get scikit-learn used in a distributed environment? Not all algorithms can work out-of-core, need distributed algorithm.
  • Jean-Paul Smet (Nexedi): Nexedi is an example company. Wendelin.core helps us removing the overhead, and enabling out of core computing. Next step of the story in a year.
@GaelVaroquaux
GaelVaroquaux / strategies_comparison.py
Last active May 17, 2016 — forked from aabadie/strategies_comparison.py
Persistence strategies comparison
View strategies_comparison.py
"""Persistence strategies comparison script.
This script compute the speed, memory used and disk space used when dumping and
loading arbitrary data. The data are taken among:
- scikit-learn Labeled Faces in the Wild dataset (LFW)
- a fully random numpy array with 10000x10000 shape
- a dictionary with 1M random keys/values
- a list containing 10M random value
The compared persistence strategies are:
View plot_standardizing_linear_model.py
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge, Lasso
from sklearn.cross_validation import ShuffleSplit
from sklearn.grid_search import GridSearchCV
from sklearn.utils import check_random_state
from sklearn import datasets
@GaelVaroquaux
GaelVaroquaux / gist:33e7a7b297425890fefa
Created Jul 10, 2015
Notes from reproducible machine learning discussion at ICML 2015 MLOSS workshop
View gist:33e7a7b297425890fefa

Introductory quote:

"Machine learning people use hugely complex algorithms on trivially simple datasets. Biology does trivially simple algorithms on hugely complex datasets."

Concepts of reproducible science

  • Replicability
@GaelVaroquaux
GaelVaroquaux / mutual_info.py
Last active Jun 11, 2019
Estimating entropy and mutual information with scikit-learn
View mutual_info.py
'''
Non-parametric computation of entropy and mutual-information
Adapted by G Varoquaux for code created by R Brette, itself
from several papers (see in the code).
These computations rely on nearest-neighbor statistics
'''
import numpy as np
View gist:a471e895eb4363def580
### Keybase proof
I hereby claim:
* I am GaelVaroquaux on github.
* I am gaelvaroquaux (https://keybase.io/gaelvaroquaux) on keybase.
* I have a public key whose fingerprint is 44B8 B843 6321 47EB 59A9 8992 6C52 6A43 ABE0 36FC
To claim this, I am signing this object:
You can’t perform that action at this time.