Skip to content

Instantly share code, notes, and snippets.

View GaelVaroquaux's full-sized avatar

Gael Varoquaux GaelVaroquaux

View GitHub Profile
@GaelVaroquaux
GaelVaroquaux / 00README.rst
Last active September 15, 2023 03:58
Copy-less bindings of C-generated arrays with Cython

Cython example of exposing C-computed arrays in Python without data copies

The goal of this example is to show how an existing C codebase for numerical computing (here c_code.c) can be wrapped in Cython to be exposed in Python.

The meat of the example is that the data is allocated in C, but exposed in Python without a copy using the PyArray_SimpleNewFromData numpy

@GaelVaroquaux
GaelVaroquaux / mutual_info.py
Last active June 18, 2023 12:25
Estimating entropy and mutual information with scikit-learn: visit https://github.com/mutualinfo/mutual_info
'''
Non-parametric computation of entropy and mutual-information
Adapted by G Varoquaux for code created by R Brette, itself
from several papers (see in the code).
This code is maintained at https://github.com/mutualinfo/mutual_info
Please download the latest code there, to have improvements and
bug fixes.
@GaelVaroquaux
GaelVaroquaux / impact_encoding.py
Created October 29, 2018 14:19
Target encoding (or impact encoding)
# how to use : df should be the dataframe restricted to categorical values to impact,
# target should be the pd.series of target values.
# Use fit, transform etc.
# three types : binary, multiple, continuous.
# for now m is a param <===== but what should we put here ? I guess some function of total shape.
# I mean what would be the value of m we want to have for 0.5 ?
import pandas as pd
import numpy as np
@GaelVaroquaux
GaelVaroquaux / deconfound.py
Last active July 18, 2021 12:35
Linear deconfounding in a fit-transform API
"""
A scikit-learn like transformer to remove a confounding effect on X.
"""
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.linear_model import LinearRegression
import numpy as np
class DeConfounder(BaseEstimator, TransformerMixin):
""" A transformer removing the effect of y on X.
@GaelVaroquaux
GaelVaroquaux / map_wrapper.pyx
Created October 17, 2012 10:31
Wrapping CPP map container to a dict-like Python object
"""
Uses C++ map containers for fast dict-like behavior with keys being
integers, and values float.
"""
# Author: Gael Varoquaux
# License: BSD
# XXX: this needs Cython 17.1 or later. Elsewhere you will get a C++ compilation error.
import numpy as np
@GaelVaroquaux
GaelVaroquaux / SparsePCA.py
Created January 28, 2011 10:12
A sparse PCA implementation based on the LARS algorithm
import time
import sys
import numpy as np
from numpy.lib.stride_tricks import as_strided
from math import sqrt
from scipy import linalg
from scikits.learn.linear_model import Lasso, lars_path
from joblib import Parallel, delayed
@GaelVaroquaux
GaelVaroquaux / count_3_grams.py
Created October 31, 2017 19:23
Fast 3-gram counting on small strings
"""
Fast counting of 3-grams for short strings.
Quick benchmarking seems to show that pure Python code is faster when
for strings less that 1000 characters, and numpy versions is faster for
longer strings.
Very long strings would benefit from probabilistic counting (bloom
filter, count min sketch) as implemented eg in the "bounter" module.
import numpy as np
import pylab as pl
import matplotlib.transforms as mtransforms
################################################################################
# Display correlation matrices
def fit_axes(ax):
""" Redimension the given axes to have labels fitting.
"""
@GaelVaroquaux
GaelVaroquaux / notes.rst
Last active June 21, 2016 03:31
Notes from Pydata Paris discussion on scikit-learn

Notes on scikit-learn round table

Q:What possible additions to scikit-learn are important to you?

  • Xavier Dupré (Microsoft): keep .fit in the API but X can be a stream from Spark for example. Transparent for the user. Gaël: indexable and len is n_samples, is that good enough? Answer: X accessible through sequential iterator.
  • Jean-François Puget (IBM): IBM is betting on Spark at the scale of the company. Most machine learning applications have small data but some don't. How can the bridge b/w scikit-learn and Spark get better? How to get scikit-learn used in a distributed environment? Not all algorithms can work out-of-core, need distributed algorithm.
  • Jean-Paul Smet (Nexedi): Nexedi is an example company. Wendelin.core helps us removing the overhead, and enabling out of core computing. Next step of the story in a year.
@GaelVaroquaux
GaelVaroquaux / strategies_comparison.py
Last active May 17, 2016 20:22 — forked from aabadie/strategies_comparison.py
Persistence strategies comparison
"""Persistence strategies comparison script.
This script compute the speed, memory used and disk space used when dumping and
loading arbitrary data. The data are taken among:
- scikit-learn Labeled Faces in the Wild dataset (LFW)
- a fully random numpy array with 10000x10000 shape
- a dictionary with 1M random keys/values
- a list containing 10M random value
The compared persistence strategies are: