sfaz/skleanr_quickref.md

## skleanr_quickref.md

      
    Raw
  

              skleanr_quickref.md
            
          
    Scikit Learn


supports numpy array, scipy sparse matrix, pandas dataframe.
Estimator - learns from data: can be a classification, regression , clustering that extracts/filters useful features from raw data - implements set_params, fit(X,y), predict(T)  , score (judge the quality of fit / predict), predict_proba (confidence level)
Transformer - transform (reduce dimensionality)/ inverse_transform,  -  clean (sklearn.preprocessing), reduce dimensions (sklearn.unsupervised _reduction), expand (sklearn.kernel_approximation) or generate feature representations (sklearn.feature_extraction).

sklearn.cluster

properties: labels_, cluster_centers_. distance metrics - maximize distance between samples in different classes, and minimizes it within each class: Euclidean distance (l2), Manhattan distance (l1) - good for sparse features, cosine distance - invariant to global scalings, or any precomputed affinity matrix.

dbscan - deterministicly separate areas of high density from areas of low density. cluster shape doesnt have to be convex. uses ball trees / kd-trees to determine the neighborhood of points, which avoids calculating the full distance matrix.
birch - dataset is lossy compressed. builds Characteristic Feature Tree (CFT). does not scale very well to high dimensional data.
KMeans - General-purpose, even cluster size, convex cluster shape, flat geometry, not too many clusters.  separate samples in n groups of equal variance, choose centroids that minimize inertia or within-cluster sum-of-squares.     Inertia limits - not suited to non-convex or non-isotropic elongated clusters, or manifolds with irregular shapes; and is not a normalized metric - in very high-dimensional spaces, Euclidean distances tend to become inflated. 2 steps in iteration: A) assigns each sample to its nearest centroid. 2) adjust centroid to mean value of all of the cluster samples. equivalent to the EM algorithm with a small, all-equal, diagonal covariance matrix. visualized with Voranoi diagrams. always converge, however this may be to a local minimum - highly dependent on the initialization of the centroids.
MiniBatchKMeans - reduce the computation time, with random subsets
AffinityPropagation - A cluster is described by a small number of representative exemplars - send messages between pairs of samples until convergence
MeanShift - discover blobs in a smooth density of samples
SpectralClustering -  low-dimension embedding of the affinity matrix between samples, followed by a KMeans in the low dimensional space.
AgglomerativeClustering - hierarchical bottom-up successive merging: more computationally efficient

sklearn.cluster.bicluster

simultaneously cluster rows and columns of a data matrix - Each determines a submatrix.

SpectralCoclustering -  finds biclusters with values higher than those in the corresponding other rows and columns. treats the input data matrix as a bipartite graph and performs generalized eigenvalue decomposition of the Laplacian of the graph.
SpectralBiclustering - assumes that the input data matrix has a hidden checkerboard structure.

sklearn.covariance


EmpiricalCovariance - estimation of a population’s covariance matrix using maximum likelihood estimator
ShrunkCovariance - applies transformation with a user-defined shrinkage coefficient to improve MLE estimation of the eigenvalues
LedoitWolf - compute the optimal shrinkage coefficient that minimizes the MSE of the covariance matrix
OAS - Gaussian distributed, yields smaller MSE
GraphLasso - uses an l1 penalty to enforce sparsity on the precision matrix (inverse of the covariance matrix)
MinCovDet - find a given proportion of good observations which are not outliers and compute their empirical covariance matrix, then rescale it (with weights according to their Mahalanobis distance) to compensate the performed selection of observations
EllipticEnvelope  - outlier detection - fits an ellipse to the central data points, ignoring points outside the central mode. decide whether a new observation belongs to the same distribution as existing observations (it is an inlier).

sklearn.datasets


load_iris
load_diabetes
load_digits
fetch_lfw_people - images
fetch_20newsgroups - NLP
make_blobs, make_classification, make_gaussian_quantiles , make_hastie_10_2, make_circles , make_moons, make_multilabel_classification , make_biclusters, make_checkerboard , make_regression ,  make_friedman1/2/3, make_s_curve, make_swiss_roll , make_low_rank_matrix , make_sparse_coded_signal , make_spd_matrix , make_sparse_spd_matrix  - Sample generators
load_boston - house prices
sklearn.datasets.base.Bunch:

{'DESCR': str,
 'data': numpy.ndarray,
 'feature_names': list, / 'images': numpy.ndarray,
 'target': numpy.ndarray,           # (n_samples, n_features array)
 'target_names': numpy.ndarray}


load_svmlight_file, fetch_olivetti_faces, fetch_20newsgroups, fetch_20newsgroups_vectorized, fetch_mldata (from mldata.org),  fetch_lfw_people (jpeg archive), fetch_lfw_pairs, fetch_covtype , fetch_rcv1 (news corpus)

sklearn.decomposition

decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. also provides a probabilistic interpretation that can give a likelihood of data based on the amount of variance it explains

PCA - looks for a combination of features that capture well the variance of the original features. use eigenfaces to select the successive components that explain the maximum variance in the signal. finds the directions in which the data is not flat --> reduce the dimensionality: explained_variance_. limitations for large datasets - only supports batch processing as data must fit in memory. can set  svd_solver='randomized' to project data to a lower-dimensional space that preserves most of the variance, by dropping the singular vector of components associated with lower singular values.
IncrementalPCA -  minibatch support allows for partial computations
FastICA - Independent component analysis separates a multivariate signal into additive subcomponents - so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals
KernelPCA -  non-linear dimensionality reduction through the use of kernels. for denoising, compression and structured prediction (kernel dependency estimation).
SparsePCA - extracting the set of sparse components that best reconstruct the data. yields a more parsimonious, interpretable representation.
MiniBatchSparsePCA - a faster, but less accurate version
TruncatedSVD - variant of singular value decomposition (SVD) that only computes the  largest singular values. use for preprocessing TF-IDF, LSA
SparseCoder - transform signals into sparse linear combination of atoms from a fixed, precomputed dictionary such as a discrete wavelet basis.
DictionaryLearning -  use matrix factorization to find a dictionary that can sparsely encode the fitted data: Representing data as sparse combinations of atoms from an overcomplete dictionary
MiniBatchDictionaryLearning - a faster, but less accurate version
FactorAnalysis - a classical statistical model
NMF - Non-negative matrix factorization - finds a decomposition of samples  into two matrices of non-negative elements, by optimizing the squared Frobenius norm
LatentDirichletAllocation - LDA - generative probabilistic model for discovering abstract topics from a collection of documents. Uses variational Bayes to maximize the Evidence Lower Bound (ELBO) -  equivalent to minimizing the Kullback-Leibler(KL) divergence.

sklearn.discriminant_analysis

supervised dimensionality reduction to N dimensions, by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes. closed-form solutions that can be easily computed, are inherently multiclass, no hyperparameters to tune. derived from simple probabilistic models which model the class conditional distribution of the data  for each class . Use Bayes’ rule with Gaussian prior (with mean and covariance estimate from training data) to select class which maximizes posterior conditional probability.

LinearDiscriminantAnalysis - LDA -  the Gaussians for each class are assumed to share the same covariance matrix
QuadraticDiscriminantAnalysis - QDA - no assumptions on the covariance matrices of the Gaussians. if thhe covariance matrices are diagonal, then the inputs are assumed to be conditionally independent in each class, and the resulting classifier is equivalent to naive_bayes.GaussianNB

sklearn.ensemble

combine the predictions of multiple base estimators (usually a decision tree) -> improve generalizability / robustness. techniques: averaging for complex models (random forests, bagging), boosting for simple models (sequential - AdaBoost, Gradient Tree Boost) - doesnt scale. main hyperparameters are n_estimators and max_features

BaggingClassifier /  BaggingRegressor - average on random subsets of the original training set. reduce variance of base estimator by introducing randomization into its construction procedure and then making an ensemble out of it -> reduce overfitting. pasting - draw without replacement, bagging - with replacement.
RandomForestClassifier / RandomForestRegressor - average randomized decision trees built by bagging. choose best split among a random subset of the features
ExtraTreesClassifier / ExtraTreesRegressor - instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature then pick best one as the splitting rule --> reduce variance
RandomTreesEmbedding - use random forest for unsupervised data transformation: neighbor data points are more likely to lie within the same leaf of a tree - take one-of-K leaf indices to form a sparse high-dimensional binary encoding of data --> implicit, non-parametric density estimation.
AdaBoostClassifier / AdaBoostRegressor -  fit a sequence of weak learners on repeatedly modified versions of the data, interatively learning weights (increase weigths for miss-predicted data and reduce for predicted) - examples that are difficult to predict receive ever-increasing influence,  Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence. combine predictions through a weighted majority vote (or sum).
GradientBoostingClassifier / GradientBoostingRegressor - combines gradient boosting with bootstrap averaging (bagging). a generalization of boosting to arbitrary differentiable loss functions. choice of tree size, loss function, regularization subsampling for shrinkage can further increase the accuracy. feature_importances_ - visualize best trees
VotingClassifier - combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. hard voting - majority voting , soft voting - argmax of the sum of predicted weighted probabilities. can also be used with GridSearch in order to tune the hyperparameters of the individual estimators.
IsolationForest outlier detection - ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

sklearn.ensemble.partial_dependence


plot_partial_dependence - Partial dependence plots (PDP) show the dependence between the target response and a set of ‘target’ features, marginalizing over the values of all other features (the ‘complement’ features)

sklearn.externals


joblib: dump, load - persistence

sklearn.feature_selection

dimensionality reduction.
univariate statistical test params -  For regression: f_regression, mutual_info_regression; For classification: chi2, f_classif, mutual_info_classif

VarianceThreshold - removes all features whose variance doesn’t meet some threshold.
SelectKBest removes all but the k highest scoring features
SelectPercentile removes all but a user-specified highest scoring percentage of features
SelectFpr  - univariate statistical tests for false positive rate of each feature
SelectFdr  - univariate statistical tests for false discovery rate of each feature
SelectFwe  - univariate statistical tests for family wise error of each feature
GenericUnivariateSelect - univariate feature selection with a configurable strategy.
RFE - recursive feature elimination: recursively considering smaller and smaller sets of features
RFECV - performs RFE in a cross-validation loop
SelectFromModel -  remove if model coef_ or feature_importances_ values are below the provided threshold

sklearn.feature_extraction

extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

DictVectorizer - convert feature arrays represented as lists of standard Python dict objects to one-hot coding for categorical (aka nominal, discrete) features.  uses a scipy.sparse matrix by default instead of a numpy.ndarray. get_feature_names
FeatureHasher - high-speed, low-memory vectorizer: apply a hash function to the features to determine their column index in sample matrices. signed MurmurHash3 function is used to cancel hash collisions.

sklearn.feature_extraction.image


extract_patches_2d - extracts patches from an image 2D array (or 3D with color information along the third axis)
reconstruct_from_patches_2d - inverse
PatchExtractor
img_to_graph - connectivity’ matrix
grid_to_graph - for agglomerative clustering (cluster neighboring pixels of an image,  forming contiguous patches), specify which samples to cluster together via a connectivity graph (sparse adjacency matrix) - connected regions of an image

sklearn.feature_extraction.text

Text preprocessing, tokenizing and filtering of stopwords. build a dictionary of features and transform documents to feature vectors:

CountVectorizer - implements tokenization and occurrence counting for bag of words representation (document word counts as high dimensional sparse matrix). supports counts of N-grams of words or consecutive characters, token occurrence frequency. vocabulary_, build_analyzer. disadv: cannot capture phrases and multi-word expressions - does not preserve local structure (word order dependence) of sentences and paragraphs, doesn’t account for misspellings or word derivations - need to lemmatize.
TfidfTransformer - re-weight the count features so that high frequency terms dont shadow the frequencies of rarer yet more interesting terms.
TfidfVectorizer - combines CountVectorizer and TfidfTransformer in a single model
HashingVectorizer - combines the FeatureHasher and CountVectorizer

sklearn.gaussian_process

for regression and probabilistic classification
advantages: prediction interpolates the observations  and is probabilistic (Gaussian)  - can compute empirical confidence intervals --> online / adaptive refitting of region of interest, Versatile: different kernels, does not suffer from the exponential scaling of kernel ridge regression grid search
disadvantages:not sparse - use entire samples/features, lose efficiency in medium+ dimensional spaces
GaussianProcessRegressor - specify prior distribution and maximize log-marginal-likelihood (LML). uses a kernel to define the covariance of a prior distribution over the target functions and uses the observed training data to define a likelihood function. Based on Bayes theorem, a (Gaussian) posterior distribution over target functions is defined, whose mean is used for prediction.
GaussianProcessClassifier - class probabilities. places a GP prior on a latent nuisance function , which is then squashed through a logistic link function to obtain the probabilistic classification. approximates the non-Gaussian posterior with a Gaussian based on the Laplace approximation. multi-class classification ( one-versus-rest or one-versus-one).
sklearn.gaussian_process.kernels

compute the GP’s covariance between datapoints. covariance functions determine the shape of prior and posterior of the GP. They encode the assumptions on the function being learned by defining the “similarity” of two datapoints combined with the assumption that similar datapoints should have similar target values. stationary kernels depend only on the distance of two datapoints and not on their absolute values and are thus invariant to translations in the input space, while non-stationary kernels depend also on the specific values of the datapoints. isotropic stationary kernels are also invariant to rotations in the input space.

Kernel - abstract base
ConstantKernel
Product
Sum
RBF - for long term, smooth rising trend. stationary, isotropic.
Matern -generalization of RBF - has additional parameter to control function smoothness
ExpSineSquared - periodic RBF
RationalQuadratic - a scale mixture (an infinite sum) of RBF kernels with different characteristic length-scales. smaller, medium term irregularities
WhiteKernel - includes noise term
DotProduct - non-stationary.

sklearn.kernel_approximation

explicit functions that approximate the implicit feature mappings that correspond to certain kernels -> efficiency

Nystroem - a general method for low-rank approximations of kernels
RBFSampler - constructs a Monte Carlo approximate mapping for the radial basis function kernel.
AdditiveChi2Sampler - additive chi squared kernel is a kernel on histograms, often used in computer vision. sample the Fourier transform in regular intervals, instead of approximating using Monte Carlo sampling.
SkewedChi2Sampler

sklearn.linear_model


LinearRegression: coef_, intercept_ minimize the residual sum of squares between the observed and the predicted - make the sum of the squared residuals of the model as small as possible. coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. multicollinearity - When terms are correlated becomes highly sensitive to random errors in the observed response, producing a large variance.
Ridge - imposing a penalty on the size of coefficients. linear least squares with l2-norm regularization. Shrinkage solution: shrink regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower the variance - choose alpha to minimize left out error. introduce bias (regularize) - decrease contribution of non-informative features. The alpha parameter controls the degree of sparsity of the coefficients estimated. L1 penalty leads to sparse solutions, driving most coefficients to zero.
RidgeCV - ridge regression with built-in cross-validation of the alpha parameter. like GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of leave-one-out cross-validation
Lasso - (least absolute shrinkage and selection operator). set some coefficients to zero (the sparse method) --> simpler models converge faster for some high dimensional data
LassoLarsCV -  Least Angle Regression algorithm For high-dimensional datasets with many collinear regressors
LassoLarsIC - use Akaike information criterion (AIC) + Bayes Information criterion (BIC). computationally cheaper
MultiTaskLasso - estimates sparse coefficients for multiple regression problems jointly
RandomizedLasso / RandomizedLogisticRegression - use randomization to prevent overfitting
ElasticNet - trained with L1 and L2 prior as regularizer --> learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. control the convex combination of L1 and L2 using the l1_ratio parameter. useful when there are multiple features which are correlated with one another.
MultiTaskElasticNet
Lars - Least-angle regression for high-dimensional data. similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one’s correlations with the residual.
Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector.
LassoLars
OrthogonalMatchingPursuit - e OMP algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients. a forward feature selection method
BayesianRidge - introducing spherical Gaussian priors over the hyper parameters of the model (gamma distributions).  robust.
ARDRegression - similar to Bayesian Ridge Regression, but with elliptical Gaussian priors --> sparser weights
LogisticRegression - sigmoid: gives less weight to data far from the decision frontier. for classification. known as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. can fit binary, One-vs- Rest (separate binary classifiers are trained for all classes), or multinomial logistic regression with optional L2 or L1 regularization. uses solvers  “liblinear” (coordinate descent (CD) algorithm via C++ LIBLINEAR lib), “newton-cg”, “lbfgs” (Multinomial) and “sag” (very large datasets). can return confidence interval - well calibrated predictions by default as it directly optimizes log-loss.
SGDClassifier / SGDRegressor  - Stochastic Gradient Descent using different convex loss functions (loss:  log [logistic regression] / hinge [soft margin SVC], modified_huber[smoothed hinge], Least-Squares (Ridge Regression), Epsilon-Insensitive (soft-margin SVR) ) and different penalties (penalty = L1 / L2 / ElastiNet) on coef_
Perceptron - for large scale learning
PassiveAggressiveClassifier / PassiveAggressiveRegressor - perceptron with regularization
RANSAC - robust regression for outliers in Y -  (RANdom SAmple Consensus) fits a model from random subsets of inliers from the complete data set.
TheilSenRegressor - robust multivariate regression for outliers in X for low dimensionality -  uses a generalization of the median in multiple dimensions.
HuberRegressor - fastest most robust regression for small datasets - regularization technique applies a linear loss to downweigh samples that are classified as outliers
KernelRidge - combines Ridge Regression with the kernel trick. unlike SVR, uses squared error loss - can be done in closed form, but model is non-sparse. faster than SVR for small to medium-sized training sets
SGDClassifier - Stochastic Gradient Descent. for unconstrained optimization problems. In contrast to (batch) gradient descent, approximates the true gradient by considering a single training example at a time. adv: efficiency (linear in the number of training examples), tunable. disadv: sensitive to feature scaling (requires preprocessing: StandardScalar)
SGDRegressor
IsotonicRegression - fits a non-decreasing function to data.

sklearn.manifold

manifold learning - nearest-neighbour approach to non-linear dimensionality reduction
linear dimensionality reduction algorithms [Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis]  are powerful, but often miss important non-linear structure in the data.

Isomap -  Isometric Mapping - extension of Multi-dimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points. 3 stages: 1) Nearest neighbor search using BallTree 2) Shortest-path graph search using  Dijkstra’s Algorithm or Floyd-Warshall algorithm 3) Partial eigenvalue decomposition - embedding is encoded in the eigenvectors corresponding to the n largest eigenvalues
LocallyLinearEmbedding -  LLE - lower-dimensional projection of the data which preserves distances within local neighborhoods. Like a series of local PCA. alternative stage 2 to Isomap: performs Weight Matrix Construction. Regularization varients using method param: A) modified - modified LLE (MLLE) - use multiple weight vectors in each neighborhood. B) hessian - Hessian Eigenmapping - hessian-based quadratic form at each neighborhood used to recover the locally linear structure. C) ltsa - Local tangent space alignment: characterize the local geometry at each neighborhood via its tangent space, and performs a global optimization to align these local tangent spaces.
SpectralEmbedding - Laplacian Eigenmaps - finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian. a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances. alternative stage 2 to Isomap: performs Graph Laplacian Construction.
MDS - retain the distance ratios in the original high-dimensional space. for analyzing similarity as distance. 2 types: A) metric: input similarity matrix arises from a metric (and thus respects the triangular inequality) -> distances between output two points are then set to be as close as possible to the similarity / disimilarity  B) non-metric: preserve the order of the distances, and hence seek for a monotonic relationship between the distances in the embedded space and the similarities/dissimilarities
TSNE - t-distributed Stochastic Neighbor Embedding - converts Student’s t-distribution affinities of data points to Gaussian joint probabilities. use SGD to minimize Kullback-Leibler (KL) divergence of the joint probabilities in the original space and the embedded space. advantages: sensitive to local structure -  good for extracting clustered local groups of samples , Revealing the structure at many scales and data in different manifolds / clusters, Reducing the tendency to crowd points together at the center. disadvantages: computationally expensive, limited dimensionality, stochastic (requires multiple restarts)

sklearn.metrics


classification_report - text report showing the main classification metrics
confusion_matrix -  evaluates classification accuracy: number of observations actually in group i, but predicted to be in group j.
cohen_kappa_score - compare labelings by different human annotators, not a classifier versus a ground truth.
hamming_loss - computes the average Hamming distance between two sets of samples.
hinge_loss - computes the average distance between the model and the data. considers only prediction errors. used in maximal margin classifiers
log_loss - evaluate the probability outputs (predict_proba) of a classifier instead of its discrete predictions. the negative log-likelihood of the classifier given the true label
zero_one_loss - the sum or the average of the 0-1 classification loss
brier_score_loss - computes the Brier score for binary classes: measures the accuracy of probabilistic predictions
coverage_error - computes the average number of true labels that have to be included in the final prediction such that all true labels are predicted.
label_ranking_average_precision_score -
jaccard_similarity_score - computes the average (default) or sum of Jaccard similarity coefficients between pairs of label sets.
adjusted_rand_score - ARI - for clustering: measures similarity of two assignments, ignoring permutations. score is normalized to [-1.0, 1.0], with random (uniform) at 0.0. No assumption is made on the cluster structure, but requires knowledge of the ground truth classes
normalized_mutual_info_score / adjusted_mutual_info_score - for clustering: measures agreement of two assignments, ignoring permutations. score is normalized to [0.0, 1.0]. requires knowledge of the ground truth classes
accuracy_score - default score function of classifiers to evaluate a parameter setting
r2_score - default score function of regressors to evaluate a parameter setting. the coefficient of determination: how well future samples are likely to be predicted by the model.
homogeneity_score - each cluster contains only members of a single class.
completeness_score - all members of a given class are assigned to the same cluster.
v_measure_score - harmonic mean of the pairwise precision and recall
fowlkes_mallows_score - geometric mean of the pairwise precision and recall
silhouette_score - composed of two scores: The mean distance between a sample and all other points in the same class; The mean distance between a sample and all other points in the next nearest cluster.
calinski_harabaz_score - ratio of the between-clusters dispersion mean and the within-cluster dispersion
precision_score - Compute the precision:  the ability of the classifier not to label as positive a sample that is negative
recall_score - Compute the recall: ability of the classifier to find all the positive samples
average_precision_score - Compute average precision (AP) from prediction scores
f1_score - Compute the F-measure:  weighted harmonic mean of the precision and recall.
fbeta_score - Compute the F-beta score
precision_recall_curve - Compute precision-recall pairs for different probability thresholds
precision_recall_fscore_support - Compute precision, recall, F-measure and support for each class
matthews_corrcoef - measure of the quality of binary (two-class) classifications. It takes into account true and false positives and negatives
roc_curve - receiver operating characteristic - performance of a binary classifier system as its discrimination threshold is varied. TPR (true positive rate) vs. FPR (false positive rate), at various threshold settings
roc_auc_score - summarized to one number
label_ranking_average_precision_score - average over each ground truth label assigned to each sample, of the ratio of true vs. total labels with lower score.
label_ranking_loss - averages over the samples the number of label pairs that are incorrectly ordered
mean_squared_error - Regression metrics: expected value of the squared (quadratic) loss
mean_absolute_error - Regression metrics: expected value of the absolute error L1-norm loss
mean_squared_log_error - Regression metrics: expected value of the squared logarithmic (quadratic) loss
 median_absolute_error - Regression metrics: median of all absolute differences between the target and the prediction. robust to outliers.
explained_variance_score - Regression metrics
DummyClassifier - sanity test

sklearn.metrics.pairwise


distance metrics and kernels (measures of similarity) to evaluate pairwise distances or affinity of sets of samples.
cosine_similarity - computes the L2-normalized dot product of vectors.
polynomial_kernel - computes the degree-d polynomial kernel between two vectors.
rbf_kernel - computes the radial basis function (RBF) kernel between two vectors
laplacian_kernel - a variant on the rbf_kernel that uses manhattan distance
chi2_kernel

sklearn.mixture

Gaussian Mixture Models - unsupervised probabilistic model where data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters: generalizing k-means clustering to incorporate information about the covariance structure of the data and centers of the latent Gaussians. Supports diagonal, spherical, tied and full covariance matrices.

GaussianMixture - implements the expectation-maximization (EM) iterative algorithm for maximum likelihood  : the Gaussian each sample probably belongs to. can also draw confidence ellipsoids for multivariate models, and compute the BIC (Bayesian Information Criterion) to assess the number of clusters in the data. options to constrain the covariance of the difference classes estimated: spherical, diagonal, tied or full covariance. adv: fast, unbiased. disadv: singularities, must specify number of components (can use BIC for this only  if asymptotic) - requires a lot of data, with unlabeled data difficult to tell which points came from which latent component.
BayesianGaussianMixture - variational inference - EM maximizes a lower bound on model evidence (including priors) instead of data likelihood. add priors for regularization weights to avoid singularities - 2 prior types: A) finite mixture model with Dirichlet distribution, B) infinite mixture model with the Dirichlet Process. adv: automatic selection, less sensitivity to the number of parameters, regularization. disadv: slower, extra hyperparam,  implicit bias.
Dirichlet process - prior probability distribution on clusterings with an infinite, unbounded, number of partitions. calculated using stick breaking process.

sklearn.model_selection


KFold  - Splits data into K folds, trains on K-1 and then tests on the left-out. split
StratifiedKFold  - Same as K-Fold but preserves the class distribution within each fold. each set contains approximately the same percentage of samples of each target class as the complete set.
GroupKFold - Ensures that the same group is not in both testing and training sets.
ShuffleSplit  - Generates train/test indices based on random permutation. generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.
StratifiedShuffleSplit - Same as shuffle split but preserves the class distribution within each iteration.
GroupShuffleSplit - Ensures that the same group is not in both testing and training sets.
LeaveOneGroupOut - Takes a group array to group observations.
LeavePGroupsOut - Leave P groups out.
LeaveOneOut  - Leave one observation out.
LeavePOut  - Leave P observations out.
PredefinedSplit - Generates train/test indices based on predefined splits
TimeSeriesSplit - successive training sets are supersets of those that come before them.
cross_val_score - splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set. validation set is no longer needed. uses the KFold or StratifiedKFold strategies. Cross validation iterators can also be used to directly perform model selection using Grid Search for the optimal hyperparameters of the model.
GridSearchCV - select the hyperparameter with the maximum score on multiple validation sets. exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. computes score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. best_score_, best_estimator_
RandomizedSearchCV - randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.
train_test_split - random split
cross_val_prediction - returns, for each element in the input, the prediction that was obtained for that element when it was in the test set.
train_test_split - prevent overfitting
validation_curve - plot the influence of a single hyperparameter on the training score and the validation score to find out if estimator is overfitting / underfitting for some hyperparameter values. If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible.
learning_curve - plot the average scores on the training sets and the average scores on the validation sets. how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data. If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization.

sklearn.multiclass


All classifiers in scikit-learn do multiclass classification  - implements meta-estimators to solve multiclass and multilabel classification problems by decomposing such problems into binary classification problems. Multitarget regression is also supported.
Multiclass classification - each sample is assigned to one and only one label
Multilabel classification - each sample is assigned a set of target labels - not mutually exclusive, eg preferences. expressed with label binary indicator 2D array (n_samples, n_classes). preprocess with MultiLabelBinarizer
Multioutput regression - each sample is assigned a set of target values (multi-dimensional datapoints)
Multioutput-multiclass classification - single estimator handling several joint classification tasks
multi-task classification - Multioutput-multiclass classification with different model formulations
OneVsRestClassifier - For each classifier, the class is fitted against all the other classes. adv: efficient, interpretability.
OneVsOneClassifier - constructs one binary classifier per pair of classes. selects the class with the highest aggregate classification confidence by summing over the underlying classifiers.
OutputCodeClassifier - binary matrix representing dimension of each class in a Euclidean space, where each dimension can only be 0 or 1 (code book - where code size is the dimensionality)
MultiOutputRegressor - Multioutput regression

sklearn.naive_bayes

“naive” assumption of independence between every pair of features. use Maximum A Posteriori (MAP) estimation to estimate  prior (relative frequency of class in the training set) and posterior. uses; document classification, spam filtering. adv: fast, require only small data. decoupling of the class conditional feature distributions -> each distribution can be independently estimated as a 1D distribution -> alleviate curse of dimensionality.  a decent classifier, but a bad probability estimator.

GaussianNB
MultinomialNB
BernoulliNB

sklearn.neighbors

foundation of manifold learning,  kernel density estimation and spectral clustering. weights: distance proportional to the inverse of the euclidean distance from the query point /uniform majority vote of the nearest neighbors /  user-defined distance function. non-generalizing - simply “remember” all its training data (algorithm - transform into a fast indexing structure [ kd_tree - reduce the required number of distance calculations by efficiently encoding relative aggregate distance information in a recursive tree, ball_tree - Where KD trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres - very efficient on highly-structured data, brute naive brute-force computation of distances between all pairs of points - For small data sets] - based on routines in sklearn.metrics.pairwise). leaf_size - tunes trees. non-parametric: can handle when the decision boundary is very irregular.

KNeighborsClassifier - supervised. number of samples is a user-defined constant (k-nearest).optimal choice of k is highly data-dependent:  larger suppresses the effects of noise, but makes classification boundaries less distinct.
RadiusNeighborsClassifier- supervised. number of samples varies based on the local density of points (radius-based). for cases where the data is not uniformly sampled, low-dimensional parameter spaces
KNeighborsRegressor - data labels are continuous
RadiusNeighborsRegressor
NearestCentroid - simple classifier. shrink_threshold -  value of each feature for each centroid is divided by the within-class variance of that feature: removing noisy features -> increases the accuracy.
LSHForest - Locality Sensitive Hashing Forest - approximate nearest neighbor to speedup query time with high dimensional data
NearestNeighbors: kneighbors, kneighbors_graph
LocalOutlierFactor - outlier detection -  measures the local density deviation of a given data point with respect to its neighbors

sklearn.neighbors.kde


KernelDensity -  learn a non-parametric generative model of a dataset in order to efficiently draw new samples from this generative model. kernel: gaussian / tophat / epanechnikov / exponential / linear / cosine

sklearn.neural_network

not intended for large-scale applications
Multi-layer Perceptron - non-linear function approximator. weighted linear summation , followed by non-linear activation function. disadv: non-convex loss function if multiple local minimum -> different random weight initializations make it non-deterministic; many hyperparameters, sensitive to feature scaling. trains using SGD, Adam, or L-BFGS.

MLPClassifier - supports only the Cross-Entropy loss function,  Softmax optimization function.
MLPRegressor - MSE loss function
BernoulliRBM - Restricted Boltzmann machine:  nonlinear feature learners based on a probabilistic model (uses binary Stochastic Maximum Likelihood). It can be approximated by Markov chain Monte Carlo using block iterative Gibbs sampling. The graphical model of an RBM is a fully-connected bipartite graph.

sklearn.pipeline


Pipeline: chain multiple estimators into one. adv: Convenience - single call to fit / predict , Joint parameter selection - grid search over parameters of all estimators. All estimators in a pipeline, except the last one, must be transformers. set_params - hyperparameters, steps, named_steps.
FeatureUnion combines several transformer objects into a new transformer that combines their output
make_pipeline,  make_union - helpers

sklearn.preprocessing

utility functions and transformer classes to change raw feature vectors into a representation for downstream estimators.

scale / StandardScaler - scale 1D array to Gaussian with zero mean and unit variance. scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Centering sparse data would destroy the sparseness structure in the data - specify with_mean=False.
MinMaxScaler / MaxAbsScaler - scale features to a range  [0, 1] / [-1, 1] by dividing through the largest maximum value in each feature
robust_scale / RobustScaler - scale data with outliers
LabelBinarizer: fit_transform - create a label indicator matrix from a list of multi-class labels: binarize the 2d array of multilabels to fit upon
PolynomialFeatures - add complexity to the model by considering nonlinear features of the input data. transforms an input data matrix into a new data matrix of a given degree. for Polynomial regression: extending linear models with basis functions - can the be used within a linear model. fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials, fit a much broader range of data. interaction features multiply together the most distinct features
MultiLabelBinarizer - convert a collection of collections of labels to Multilabel classification indicator format
KernelCenterer - transform a kernel matrix (inner products in a feature space defined by function) - removal of the mean in that space.
normalize / Normalizer - scaling individual samples to have L1 or L2 unit norm
Binarizer - thresholding numerical features to get boolean values - for downstream probabilistic estimators
OneHotEncoder - convert categorical features to one-hot encoding
Imputer - impute the missing values:  infer them from the known part of the data using the mean,  median or the most frequent value
FunctionTransformer - convert an existing Python function into a transformer to assist in data cleaning or processing
LabelEncoder - normalize labels such that they contain only values between 0 and n_classes-1

sklearn.random_projection


GaussianRandomProjection: data reduction by random projections. fit_transform
SparseRandomProjection

sklearn.semi_supervised

for when  some of the training data samples are not labeled, construct a similarity graph over all items in the input dataset

LabelPropagation - - hard clamping of input labels. 2 kernels: rbf, knn
LabelSpreading - minimizes a loss function - regularization --> more robust to noise. performs spectral clustering: iterates similarty graph and computes the normalized graph Laplacian matrix -> normalizes the edge weights

sklearn.svm


discriminant model family: try to find a combination of samples to build a plane maximizing the margin between the two classes. constructs hyper-planes in a high or infinite dimensional space - a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class the larger the margin the lower the generalization error. Advantages: Effective in high dimensional spaces , Versatile (different Kernel functions). Disadvantages: performs poorly If the number of features is much greater than the number of samples, does not directly provide probability estimates. members; support_vectors_, support_ (indices) and n_support (count), decision_function gives per-class scores for each sample (signed distance to the hyperplane). quadratic programming  (QP) solver of C++ libsvm. kernel values: linear / polynomial / rbf / sigmoid / precomputed (custom precomputed Gram matrix), custom (python function).
SVC - support vector classifier. multi-class capable (one-against-one” approach). cost function for building the model does not care about training points that lie beyond the margin
NuSVC - extra parameter for upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
LinearSVC -  linear kernel, “one-vs-the-rest” multi-class strategy
OneClassSVM - for novelty detection - classify new points as belonging to that set or not.
SVR - cost function for building the model ignores any training data close to the model prediction.

sklearn.tree

non-parametric. Use min_samples_split or min_samples_leaf to control the number of samples at a leaf node. reduce dimensionality and balance (via sampling) before running. Tree algorithms: ID3, C4.5, C5.0 and CART (classification and regression trees). recursively partitions until the maximum allowable depth. select parameters that minimize impurity. impurity measures: Gini, Cross-Entropy, Misclassification, MSE
advantages: whitebox - simple to understand and to interpret, O(log n), both numerical and categorical
disadvantages: do not generalise well - prone to overfitting, unstable, can be biased (unbalanced), linear

DecisionTreeClassifier: export_graphviz
DecisionTreeRegressor

sklearn.utils.testing


SkipTest

sklearn.utils.fixes


sp_version