Skip to content

Instantly share code, notes, and snippets.

@rdoume
Last active December 7, 2020 18:31
Show Gist options
  • Save rdoume/68f7309d16f4297b49d18346eda6acee to your computer and use it in GitHub Desktop.
Save rdoume/68f7309d16f4297b49d18346eda6acee to your computer and use it in GitHub Desktop.

Awesome Data Science with Python

Core - Ce qu'on va utiliser

pandas
scikit-learn

plotly-express

matplotlib
seaborn pandas_summaryDataFrameSummary(df).summary().
pandas_profiling
sklearn_pandas - Helpful DataFrameMapper class.

- En fonction de la qualité de la donnée, si elle a beaucoup de missing values

janitor - Clean les noms des cols missingno qgrid - Pandas DataFrame sorting.

Pandas Efficace:

** modin - Parallelization library for faster pandas DataFrame. **

swifter - Apply any function to a pandas dataframe faster.
xarray - Extends pandas to n-dimensional arrays.

Pimp your notebook

General tricks: link
Python debugger (pdb) - blog post, video, cheatsheet
cookiecutter-data-science - Project template for data science projects.
nteract - Open Jupyter Notebooks with doubleclick. blackcellmagic - Code formatting for jupyter notebooks.
pivottablejs - Drag n drop Pivot Tables and Charts for jupyter notebooks.
ipysheet - Jupyter spreadsheet widget.
nbdime - Diff two notebook files, Alternative GitHub App: ReviewNB.

Big Data - En fonction de la taille du Dataset qu'ils nous filent. (à priori petit)

dask, dask-ml - Pandas DataFrame for big data and machine learning library, resources, talk1, talk2, notebooks, videos.
turicreate - Helpful SFrame class for out-of-memory dataframes.
h2o - Helpful H2OFrame class for out-of-memory dataframes.
datatable - Data Table for big data support.
ray - Flexible, high-performance distributed execution framework.
vaex - Out-of-Core DataFrames.

Command line tools - Possiblement pour faire de la Dataprep

ni - Command line tool for big data.
xsv - Command line tool for indexing, slicing, analyzing, splitting and joining CSV files.
csvkit - Another command line tool for CSV files.
csvsort - Sort large csv files.

Exploration and Cleaning - The big Stuff

impyute - Imputations.
fancyimpute - Matrix completion and imputation algorithms.

** imbalanced-learn - Resampling for imbalanced datasets. **

tspreprocess - Time series preprocessing: Denoising, Compression, Resampling.

** Kaggler - Utility functions (OneHotEncoder(min_obs=100)) **

pyupset - Visualizing intersecting sets.
pyemd - Earth Mover's Distance, similarity between histograms.

Feature Engineering

sklearn - Pipeline, examples.
pdpipe - Pipelines for DataFrames.

few - Feature engineering wrapper for sklearn.
skoot - Pipeline helper functions.
categorical-encoding - Categorical encoding of variables dirty_cat - Encoding dirty categorical variables.

** mlxtend - nice pour stacking. ** **featuretools - Automated feature engineering, example. **

tsfresh - Time series feature engineering.
pypeln - Concurrent data pipelines.

[Dataiku] (https://www.dataiku.com/) - We know it

Feature Selection

Tutorial, Talk

sklearn - Feature selection.
mlxtend - Features selection helper methods around sklearn.
scikit-feature - Feature selection algorithms.
stability-selection - Stability selection.
scikit-rebate - Relief-based feature selection algorithms.
scikit-genetic - Genetic feature selection.
boruta_py - Feature selection, explaination, example.
linselect - Feature selection package.

Dimensionality Reduction

prince - Dimensionality reduction, factor analysis (PCA, MCA, CA, FAMD).
sklearn - Multidimensional scaling (MDS).
sklearn - t-distributed Stochastic Neighbor Embedding (t-SNE), intro. Faster implementations: lvdmaaten, MulticoreTSNE.
sklearn - Truncated SVD (aka LSA).
mdr - Dimensionality reduction, multifactor dimensionality reduction (MDR).
umap - Uniform Manifold Approximation and Projection.
FIt-SNE - Fast Fourier Transform-accelerated Interpolation-based t-SNE.

Visualization

All charts, Austrian monuments.
cufflinks - Dynamic visualization library, wrapper for plotly, medium, example.
physt - Better histograms, talk.
matplotlib_venn - Venn diagrams.
joypy - Draw stacked density plots.
mosaic plots - Categorical variable visualization, example.
yellowbrick - Wrapper for matplotlib for diagnosic ML plots.
bokeh - Interactive visualization library, Examples, Examples.
animatplot - Animate plots build on matplotlib.
plotnine - ggplot for Python.
altair - Declarative statistical visualization library.
bqplot - Plotting library for IPython/Jupyter Notebooks.
hvplot - High-level plotting library built on top of holoviews.
dtreeviz - Decision tree visualization and model interpretation.
chartify - Generate charts.
VivaGraphJS - Graph visualization (JS package).
pm - Navigatable 3D graph visualization (JS package), example.
python-ternary - Triangle plots.
falcon - Interactive visualizations for big data.

Recommender Systems - Peut êter Overkill en 24H ?

Examples: 1, 2, 2-ipynb, 3.
surprise - Recommender, talk.
turicreate - Recommender.
implicit - Fast Collaborative Filtering for Implicit Feedback Datasets.
spotlight - Deep recommender models using PyTorch.
lightfm - Recommendation algorithms for both implicit and explicit feedback.
funk-svd - Fast SVD.
pywFM - Factorization.

Decision Tree Models

Intro to Decision Trees and Random Forests, Intro to Gradient Boosting
lightgbm - Gradient boosting (GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, doc.
xgboost - Gradient boosting (GBDT, GBRT or GBM) library, doc, Methods for CIs: link1, link2.
catboost - Gradient boosting.
thundergbm - GBDTs and Random Forest.
h2o - Gradient boosting.
forestci - Confidence intervals for random forests.
scikit-garden - Quantile Regression.
grf - Generalized random forest.
dtreeviz - Decision tree visualization and model interpretation.
rfpimp - Feature Importance for RandomForests using Permuation Importance.
Why the default feature importance for random forests is wrong: link
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
bartpy - Bayesian Additive Regression Trees.
infiniteboost - Combination of RFs and GBDTs.
merf - Mixed Effects Random Forest for Clustering, video
rrcf - Robust Random Cut Forest algorithm for anomaly detection on streams.

Regression

Understanding SVM Regression: slides, forum, paper

pyearth - Multivariate Adaptive Regression Splines (MARS), tutorial.
pygam - Generalized Additive Models (GAMs), Explanation.
GLRM - Generalized Low Rank Models.

Clustering

pyclustering - All sorts of clustering algorithms.
somoclu - Self-organizing map.
hdbscan - Clustering algorithm.
nmslib - Similarity search library and toolkit for evaluation of k-NN methods.
buckshotpp - Outlier-resistant and scalable clustering algorithm.
merf - Mixed Effects Random Forest for Clustering, video

Interpretable Classifiers and Regressors

skope-rules - Interpretable classifier, IF-THEN rules.
sklearn-expertsys - Interpretable classifiers, Bayesian Rule List classifier.

Multi-label classification

scikit-multilearn - Multi-label classification, talk.

Hyperparameter Tuning - Sweet spot pour nous, c'est là qu'on va les battre

sklearn - GridSearchCV, RandomizedSearchCV.
hyperopt - Hyperparameter optimization.
hyperopt-sklearn - Hyperopt + sklearn.
skopt - BayesSearchCV for Hyperparameter search.
tune - Hyperparameter search with a focus on deep learning and deep reinforcement learning.
optuna - Hyperparamter optimization.
hypergraph - Global optimization methods and hyperparameter optimization.
bbopt - Black box hyperparameter optimization.
dragonfly - Scalable Bayesian optimisation.

Model Evaluation

pycm - Multi-class confusion matrix.
pandas_ml - Confusion matrix.
Plotting learning curve: link.
yellowbrick - Learning curve.

Model Explanation, Interpretability, Feature Importance - Très important pour l'explicatiblité

Book, Examples
shap - Explain predictions of machine learning models, talk.
treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.
lime - Explaining the predictions of any machine learning classifier, talk, Warning (Myth 7).
lime_xgboost - Create LIMEs for XGBoost.
eli5 - Inspecting machine learning classifiers and explaining their predictions.
lofo-importance - Leave One Feature Out Importance, talk.
pybreakdown - Generate feature contribution plots.
FairML - Model explanation, feature importance.
pycebox - Individual Conditional Expectation Plot Toolbox.
pdpbox - Partial dependence plot toolbox, example.
partial_dependence - Visualize and cluster partial dependence.
skater - Unified framework to enable model interpretation.
anchor - High-Precision Model-Agnostic Explanations for classifiers.
l2x - Instancewise feature selection as methodology for model interpretation.
contrastive_explanation - Contrastive explanations.
DrWhy - Collection of tools for explainable AI.
lucid - Neural network interpretability.
xai - An eXplainability toolbox for machine learning.

Deployment and Lifecycle Management - Ca peut être utile si on fait plein d'expériences

m2cgen - Transpile trained ML models into other languages.
sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
mlflow - Manage the machine learning lifecycle, including experimentation, reproducibility and deployment.
modelchimp - Experiment Tracking.
skll - Command-line utilities to make it easier to run machine learning experiments.
BentoML - Package and deploy machine learning models for serving in production

Liste des trucs que je sais pas si on va utiliser:

Statistics- Si jamais on doit faire des tests stats - J'en doute

Common statistical tests explained
Bland-Altman Plot - Plot for agreement between two methods of measurement.
scikit-posthocs - Statistical post-hoc tests for pairwise multiple comparisons.

Null Hypothesis Significance Testing (NHST), Correlation, Cohen's d, Confidence Interval, Equivalence, non-inferiority and superiority testing, Bayesian two-sample t test, Distribution of p-values when comparing two groups, Understanding the t-distribution and its normal approximation

Dashboards - Je pense pas que ça rentre dans ce cas là, mais au cas où.

dash - Dashboarding solution by plot.ly. Tutorial: 1, 2, 3, 4, 5, example
bokeh - Dashboarding solution.
visdom - Dashboarding library by facebook.
bowtie - Dashboarding solution.
panel - Dashboarding solution.
altair example - Video

Geopraphical Tools - Idem, je pense pas qu'on ait besoin de ça, mais au cas où

folium - Plot geographical maps using the Leaflet.js library, jupyter plugin.
stadiamaps - Plot geographical maps.
datashader - Draw millions of points on a map.
sklearn - BallTree, Example.
pynndescent - Nearest neighbor descent for approximate nearest neighbors.
geocoder - Geocoding of addresses, IP addresses.
Conversion of different geo formats: talk, repo
geopandas - Tools for geographic data
Low Level Geospatial Tools (GEOS, GDAL/OGR, PROJ.4)
Vector Data (Shapely, Fiona, Pyproj)
Raster Data (Rasterio)
Plotting (Descartes, Catropy)
Predict economic indicators from Open Street Map ipynb.
PySal - Python Spatial Analysis Library.
geography - Extract countries, regions and cities from a URL or text.

Natural Language Processing (NLP) / Text Processing

talk-nb, nb2, talk.
Text classification Intro, Preprocessing blog post.
gensim - NLP, doc2vec, word2vec, text processing, topic modelling (LSA, LDA), Example, Coherence Model for evaluation.
Embeddings - GloVe ([1], [2]), StarSpace, wikipedia2vec.
pyldavis - Visualization for topic modelling.
spaCy - NLP.
NTLK - NLP, helpful KMeansClusterer with cosine_distance.
pytext - NLP from Facebook.
fastText - Efficient text classification and representation learning.
annoy - Approximate nearest neighbor search.
faiss - Approximate nearest neighbor search.
pysparnn - Approximate nearest neighbor search.
infomap - Cluster (word-)vectors to find topics, example.
datasketch - Probabilistic data structures for large data (MinHash, HyperLogLog).
flair - NLP Framework by Zalando.
stanfordnlp - NLP Library.

Libs

fastai - Neural Networks in pytorch.

Time Series - Je doute qu'on ait besoin de ça mais qui sait ?

statsmodels - Time series analysis, seasonal decompose example, SARIMA, granger causality.
pyramid, pmdarima - Wrapper for (Auto-) ARIMA.
pyflux - Time series prediction algorithms (ARIMA, GARCH, GAS, Bayesian).
prophet - Time series prediction library.
htsprophet - Hierarchical Time Series Forecasting using Prophet.
tensorflow - LSTM and others, examples: link, link, link, Explain LSTM, seq2seq: 1, 2, 3, 4
tspreprocess - Preprocessing: Denoising, Compression, Resampling.
tsfresh - Time series feature engineering.
thunder - Data structures and algorithms for loading, processing, and analyzing time series data.
gatspy - General tools for Astronomical Time Series, talk.
gendis - shapelets, example.
tslearn - Time series clustering and classification, TimeSeriesKMeans, TimeSeriesKMeans.
pastas - Simulation of time series.
fastdtw - Dynamic Time Warp Distance.
fable - Time Series Forecasting (R package).
CausalImpact - Causal Impact Analysis (R package).
pydlm - Bayesian time series modeling (R package, Blog post)
PyAF - Automatic Time Series Forecasting.
luminol - Anomaly Detection and Correlation library from Linkedin.
matrixprofile-ts - Detecting patterns and anomalies, website, ppt.
obspy - Seismology package. Useful classic_sta_lta function.
RobustSTL - Robust Seasonal-Trend Decomposition.
seglearn - Time Series library.
pyts - Time series transformation and classification, Imaging time series.
Turn time series into images and use Neural Nets: example, example.

Survival Analysis

Time-dependent Cox Model in R.
lifelines - Survival analysis, Cox PH Regression, talk, talk2.
scikit-survival - Survival analysis.
xgboost - "objective": "survival:cox" NHANES example
survivalstan - Survival analysis, intro.
convoys - Analyze time lagged conversions.
RandomSurvivalForests (R packages: randomForestSRC, ggRandomForests).

Outlier Detection & Anomaly Detection - Pareil je pense pas

sklearn - Isolation Forest and others.
pyod - Outlier Detection / Anomaly Detection.
eif - Extended Isolation Forest.
AnomalyDetection - Anomaly detection (R package).
luminol - Anomaly Detection and Correlation library from Linkedin.

Ranking - Who knows?

lightning - Large-scale linear classification, regression and ranking.

Scoring - Who Knows?

SLIM - Scoring systems for classification, Supersparse linear integer models.

Stacking Models and Ensembles - Important pour gagner des points.

Model Stacking Blog Post
mlxtend - EnsembleVoteClassifier, StackingRegressor, StackingCVRegressor for model stacking.
vecstack - Stacking ML models.
StackNet - Stacking ML models.
mlens - Ensemble learning.

Automated Machine Learning - Hyper Important. C'est comme ça qu'on va gagner du temps.

AdaNet - Automated machine learning based on tensorflow.
tpot - Automated machine learning tool, optimizes machine learning pipelines.
auto_ml - Automated machine learning for analytics & production.
autokeras - AutoML for deep learning.
nni - Toolkit for neural architecture search and hyper-parameter tuning by Microsoft.
automl-gs - Automated machine learning. H2o-AutoML - Probablement le plus simple

Evolutionary Algorithms & Optimization - Funky methods pour avoir de nouveaux features

deap - Evolutionary computation framework (Genetic Algorithm, Evolution strategies).
evol - DSL for composable evolutionary algorithms, talk.
platypus - Multiobjective optimization.
autograd - Efficiently computes derivatives of numpy code.
nevergrad - Derivation-free optimization.
gplearn - Sklearn-like interface for genetic programming.
blackbox - Optimization of expensive black-box functions.
Optometrist algorithm - paper.

Other

metric-learn - Metric learning.

Awesome Lists

Data Science Notebooks
Awesome Adversarial Machine Learning
Awesome AI Booksmarks
Awesome AI on Kubernetes
Awesome Big Data
Awesome Business Machine Learning
Awesome CSV
Awesome Data Science with Ruby
Awesome Deep Learning
Awesome ETL
Awesome Financial Machine Learning
Awesome GAN Applications
Awesome Machine Learning
Awesome Machine Learning Interpretability
Awesome Machine Learning Operations
Awesome Network Embedding
Awesome Python
Awesome Python Data Science
Awesome Python Data Science
Awesome Recommender Systems
Awesome Semantic Segmentation
Awesome Sentence Embedding
Awesome Time Series
Awesome Time Series Anomaly Detection
Recommender Systems (Microsoft)
The GAN Zoo - List of Generative Adversarial Networks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment