Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
sklearn quickref

Scikit Learn

  • supports numpy array, scipy sparse matrix, pandas dataframe.
  • Estimator - learns from data: can be a classification, regression , clustering that extracts/filters useful features from raw data - implements set_params, fit(X,y), predict(T) , score (judge the quality of fit / predict), predict_proba (confidence level)
  • Transformer - transform (reduce dimensionality)/ inverse_transform, - clean (sklearn.preprocessing), reduce dimensions (sklearn.unsupervised _reduction), expand (sklearn.kernel_approximation) or generate feature representations (sklearn.feature_extraction).


properties: labels_, cluster_centers_. distance metrics - maximize distance between samples in different classes, and minimizes it within each class: Euclidean distance (l2), Manhattan distance (l1) - good for sparse features, cosine distance - invariant to global scalings, or any precomputed affinity matrix.

  • dbscan - deterministicly separate areas of high density from areas of low density. cluster shape doesnt have to be convex. uses ball trees / kd-trees to determine the neighborhood of points, which avoids calculating the full distance matrix.
  • birch - dataset is lossy compressed. builds Characteristic Feature Tree (CFT). does not scale very well to high dimensional data.
  • KMeans - General-purpose, even cluster size, convex cluster shape, flat geometry, not too many clusters. separate samples in n groups of equal variance, choose centroids that minimize inertia or within-cluster sum-of-squares. Inertia limits - not suited to non-convex or non-isotropic elongated clusters, or manifolds with irregular shapes; and is not a normalized metric - in very high-dimensional spaces, Euclidean distances tend to become inflated. 2 steps in iteration: A) assigns each sample to its nearest centroid. 2) adjust centroid to mean value of all of the cluster samples. equivalent to the EM algorithm with a small, all-equal, diagonal covariance matrix. visualized with Voranoi diagrams. always converge, however this may be to a local minimum - highly dependent on the initialization of the centroids.
  • MiniBatchKMeans - reduce the computation time, with random subsets
  • AffinityPropagation - A cluster is described by a small number of representative exemplars - send messages between pairs of samples until convergence
  • MeanShift - discover blobs in a smooth density of samples
  • SpectralClustering - low-dimension embedding of the affinity matrix between samples, followed by a KMeans in the low dimensional space.
  • AgglomerativeClustering - hierarchical bottom-up successive merging: more computationally efficient


simultaneously cluster rows and columns of a data matrix - Each determines a submatrix.

  • SpectralCoclustering - finds biclusters with values higher than those in the corresponding other rows and columns. treats the input data matrix as a bipartite graph and performs generalized eigenvalue decomposition of the Laplacian of the graph.
  • SpectralBiclustering - assumes that the input data matrix has a hidden checkerboard structure.


  • EmpiricalCovariance - estimation of a population’s covariance matrix using maximum likelihood estimator
  • ShrunkCovariance - applies transformation with a user-defined shrinkage coefficient to improve MLE estimation of the eigenvalues
  • LedoitWolf - compute the optimal shrinkage coefficient that minimizes the MSE of the covariance matrix
  • OAS - Gaussian distributed, yields smaller MSE
  • GraphLasso - uses an l1 penalty to enforce sparsity on the precision matrix (inverse of the covariance matrix)
  • MinCovDet - find a given proportion of good observations which are not outliers and compute their empirical covariance matrix, then rescale it (with weights according to their Mahalanobis distance) to compensate the performed selection of observations
  • EllipticEnvelope - outlier detection - fits an ellipse to the central data points, ignoring points outside the central mode. decide whether a new observation belongs to the same distribution as existing observations (it is an inlier).


  • load_iris
  • load_diabetes
  • load_digits
  • fetch_lfw_people - images
  • fetch_20newsgroups - NLP
  • make_blobs, make_classification, make_gaussian_quantiles , make_hastie_10_2, make_circles , make_moons, make_multilabel_classification , make_biclusters, make_checkerboard , make_regression , make_friedman1/2/3, make_s_curve, make_swiss_roll , make_low_rank_matrix , make_sparse_coded_signal , make_spd_matrix , make_sparse_spd_matrix - Sample generators
  • load_boston - house prices
  • sklearn.datasets.base.Bunch:
{'DESCR': str,
 'data': numpy.ndarray,
 'feature_names': list, / 'images': numpy.ndarray,
 'target': numpy.ndarray,           # (n_samples, n_features array)
 'target_names': numpy.ndarray}
  • load_svmlight_file, fetch_olivetti_faces, fetch_20newsgroups, fetch_20newsgroups_vectorized, fetch_mldata (from, fetch_lfw_people (jpeg archive), fetch_lfw_pairs, fetch_covtype , fetch_rcv1 (news corpus)


decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. also provides a probabilistic interpretation that can give a likelihood of data based on the amount of variance it explains

  • PCA - looks for a combination of features that capture well the variance of the original features. use eigenfaces to select the successive components that explain the maximum variance in the signal. finds the directions in which the data is not flat --> reduce the dimensionality: explained_variance_. limitations for large datasets - only supports batch processing as data must fit in memory. can set svd_solver='randomized' to project data to a lower-dimensional space that preserves most of the variance, by dropping the singular vector of components associated with lower singular values.
  • IncrementalPCA - minibatch support allows for partial computations
  • FastICA - Independent component analysis separates a multivariate signal into additive subcomponents - so that the distribution of their loadings carries a maximum amount of independent information. It is able to recover non-Gaussian independent signals
  • KernelPCA - non-linear dimensionality reduction through the use of kernels. for denoising, compression and structured prediction (kernel dependency estimation).
  • SparsePCA - extracting the set of sparse components that best reconstruct the data. yields a more parsimonious, interpretable representation.
  • MiniBatchSparsePCA - a faster, but less accurate version
  • TruncatedSVD - variant of singular value decomposition (SVD) that only computes the largest singular values. use for preprocessing TF-IDF, LSA
  • SparseCoder - transform signals into sparse linear combination of atoms from a fixed, precomputed dictionary such as a discrete wavelet basis.
  • DictionaryLearning - use matrix factorization to find a dictionary that can sparsely encode the fitted data: Representing data as sparse combinations of atoms from an overcomplete dictionary
  • MiniBatchDictionaryLearning - a faster, but less accurate version
  • FactorAnalysis - a classical statistical model
  • NMF - Non-negative matrix factorization - finds a decomposition of samples into two matrices of non-negative elements, by optimizing the squared Frobenius norm
  • LatentDirichletAllocation - LDA - generative probabilistic model for discovering abstract topics from a collection of documents. Uses variational Bayes to maximize the Evidence Lower Bound (ELBO) - equivalent to minimizing the Kullback-Leibler(KL) divergence.


supervised dimensionality reduction to N dimensions, by projecting the input data to a linear subspace consisting of the directions which maximize the separation between classes. closed-form solutions that can be easily computed, are inherently multiclass, no hyperparameters to tune. derived from simple probabilistic models which model the class conditional distribution of the data for each class . Use Bayes’ rule with Gaussian prior (with mean and covariance estimate from training data) to select class which maximizes posterior conditional probability.

  • LinearDiscriminantAnalysis - LDA - the Gaussians for each class are assumed to share the same covariance matrix
  • QuadraticDiscriminantAnalysis - QDA - no assumptions on the covariance matrices of the Gaussians. if thhe covariance matrices are diagonal, then the inputs are assumed to be conditionally independent in each class, and the resulting classifier is equivalent to naive_bayes.GaussianNB


combine the predictions of multiple base estimators (usually a decision tree) -> improve generalizability / robustness. techniques: averaging for complex models (random forests, bagging), boosting for simple models (sequential - AdaBoost, Gradient Tree Boost) - doesnt scale. main hyperparameters are n_estimators and max_features

  • BaggingClassifier / BaggingRegressor - average on random subsets of the original training set. reduce variance of base estimator by introducing randomization into its construction procedure and then making an ensemble out of it -> reduce overfitting. pasting - draw without replacement, bagging - with replacement.
  • RandomForestClassifier / RandomForestRegressor - average randomized decision trees built by bagging. choose best split among a random subset of the features
  • ExtraTreesClassifier / ExtraTreesRegressor - instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature then pick best one as the splitting rule --> reduce variance
  • RandomTreesEmbedding - use random forest for unsupervised data transformation: neighbor data points are more likely to lie within the same leaf of a tree - take one-of-K leaf indices to form a sparse high-dimensional binary encoding of data --> implicit, non-parametric density estimation.
  • AdaBoostClassifier / AdaBoostRegressor - fit a sequence of weak learners on repeatedly modified versions of the data, interatively learning weights (increase weigths for miss-predicted data and reduce for predicted) - examples that are difficult to predict receive ever-increasing influence, Each subsequent weak learner is thereby forced to concentrate on the examples that are missed by the previous ones in the sequence. combine predictions through a weighted majority vote (or sum).
  • GradientBoostingClassifier / GradientBoostingRegressor - combines gradient boosting with bootstrap averaging (bagging). a generalization of boosting to arbitrary differentiable loss functions. choice of tree size, loss function, regularization subsampling for shrinkage can further increase the accuracy. feature_importances_ - visualize best trees
  • VotingClassifier - combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels. hard voting - majority voting , soft voting - argmax of the sum of predicted weighted probabilities. can also be used with GridSearch in order to tune the hyperparameters of the individual estimators.
  • IsolationForest outlier detection - ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.


  • plot_partial_dependence - Partial dependence plots (PDP) show the dependence between the target response and a set of ‘target’ features, marginalizing over the values of all other features (the ‘complement’ features)


  • joblib: dump, load - persistence


dimensionality reduction. univariate statistical test params - For regression: f_regression, mutual_info_regression; For classification: chi2, f_classif, mutual_info_classif

  • VarianceThreshold - removes all features whose variance doesn’t meet some threshold.
  • SelectKBest removes all but the k highest scoring features
  • SelectPercentile removes all but a user-specified highest scoring percentage of features
  • SelectFpr - univariate statistical tests for false positive rate of each feature
  • SelectFdr - univariate statistical tests for false discovery rate of each feature
  • SelectFwe - univariate statistical tests for family wise error of each feature
  • GenericUnivariateSelect - univariate feature selection with a configurable strategy.
  • RFE - recursive feature elimination: recursively considering smaller and smaller sets of features
  • RFECV - performs RFE in a cross-validation loop
  • SelectFromModel - remove if model coef_ or feature_importances_ values are below the provided threshold


extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

  • DictVectorizer - convert feature arrays represented as lists of standard Python dict objects to one-hot coding for categorical (aka nominal, discrete) features. uses a scipy.sparse matrix by default instead of a numpy.ndarray. get_feature_names
  • FeatureHasher - high-speed, low-memory vectorizer: apply a hash function to the features to determine their column index in sample matrices. signed MurmurHash3 function is used to cancel hash collisions.


  • extract_patches_2d - extracts patches from an image 2D array (or 3D with color information along the third axis)
  • reconstruct_from_patches_2d - inverse
  • PatchExtractor
  • img_to_graph - connectivity’ matrix
  • grid_to_graph - for agglomerative clustering (cluster neighboring pixels of an image, forming contiguous patches), specify which samples to cluster together via a connectivity graph (sparse adjacency matrix) - connected regions of an image


Text preprocessing, tokenizing and filtering of stopwords. build a dictionary of features and transform documents to feature vectors:

  • CountVectorizer - implements tokenization and occurrence counting for bag of words representation (document word counts as high dimensional sparse matrix). supports counts of N-grams of words or consecutive characters, token occurrence frequency. vocabulary_, build_analyzer. disadv: cannot capture phrases and multi-word expressions - does not preserve local structure (word order dependence) of sentences and paragraphs, doesn’t account for misspellings or word derivations - need to lemmatize.
  • TfidfTransformer - re-weight the count features so that high frequency terms dont shadow the frequencies of rarer yet more interesting terms.
  • TfidfVectorizer - combines CountVectorizer and TfidfTransformer in a single model
  • HashingVectorizer - combines the FeatureHasher and CountVectorizer


for regression and probabilistic classification advantages: prediction interpolates the observations and is probabilistic (Gaussian) - can compute empirical confidence intervals --> online / adaptive refitting of region of interest, Versatile: different kernels, does not suffer from the exponential scaling of kernel ridge regression grid search disadvantages:not sparse - use entire samples/features, lose efficiency in medium+ dimensional spaces GaussianProcessRegressor - specify prior distribution and maximize log-marginal-likelihood (LML). uses a kernel to define the covariance of a prior distribution over the target functions and uses the observed training data to define a likelihood function. Based on Bayes theorem, a (Gaussian) posterior distribution over target functions is defined, whose mean is used for prediction. GaussianProcessClassifier - class probabilities. places a GP prior on a latent nuisance function , which is then squashed through a logistic link function to obtain the probabilistic classification. approximates the non-Gaussian posterior with a Gaussian based on the Laplace approximation. multi-class classification ( one-versus-rest or one-versus-one).


compute the GP’s covariance between datapoints. covariance functions determine the shape of prior and posterior of the GP. They encode the assumptions on the function being learned by defining the “similarity” of two datapoints combined with the assumption that similar datapoints should have similar target values. stationary kernels depend only on the distance of two datapoints and not on their absolute values and are thus invariant to translations in the input space, while non-stationary kernels depend also on the specific values of the datapoints. isotropic stationary kernels are also invariant to rotations in the input space.

  • Kernel - abstract base
  • ConstantKernel
  • Product
  • Sum
  • RBF - for long term, smooth rising trend. stationary, isotropic.
  • Matern -generalization of RBF - has additional parameter to control function smoothness
  • ExpSineSquared - periodic RBF
  • RationalQuadratic - a scale mixture (an infinite sum) of RBF kernels with different characteristic length-scales. smaller, medium term irregularities
  • WhiteKernel - includes noise term
  • DotProduct - non-stationary.


explicit functions that approximate the implicit feature mappings that correspond to certain kernels -> efficiency

  • Nystroem - a general method for low-rank approximations of kernels
  • RBFSampler - constructs a Monte Carlo approximate mapping for the radial basis function kernel.
  • AdditiveChi2Sampler - additive chi squared kernel is a kernel on histograms, often used in computer vision. sample the Fourier transform in regular intervals, instead of approximating using Monte Carlo sampling.
  • SkewedChi2Sampler


  • LinearRegression: coef_, intercept_ minimize the residual sum of squares between the observed and the predicted - make the sum of the squared residuals of the model as small as possible. coefficient estimates for Ordinary Least Squares rely on the independence of the model terms. multicollinearity - When terms are correlated becomes highly sensitive to random errors in the observed response, producing a large variance.
  • Ridge - imposing a penalty on the size of coefficients. linear least squares with l2-norm regularization. Shrinkage solution: shrink regression coefficients to zero: any two randomly chosen set of observations are likely to be uncorrelated. bias/variance tradeoff: the larger the ridge alpha parameter, the higher the bias and the lower the variance - choose alpha to minimize left out error. introduce bias (regularize) - decrease contribution of non-informative features. The alpha parameter controls the degree of sparsity of the coefficients estimated. L1 penalty leads to sparse solutions, driving most coefficients to zero.
  • RidgeCV - ridge regression with built-in cross-validation of the alpha parameter. like GridSearchCV except that it defaults to Generalized Cross-Validation (GCV), an efficient form of leave-one-out cross-validation
  • Lasso - (least absolute shrinkage and selection operator). set some coefficients to zero (the sparse method) --> simpler models converge faster for some high dimensional data
  • LassoLarsCV - Least Angle Regression algorithm For high-dimensional datasets with many collinear regressors
  • LassoLarsIC - use Akaike information criterion (AIC) + Bayes Information criterion (BIC). computationally cheaper
  • MultiTaskLasso - estimates sparse coefficients for multiple regression problems jointly
  • RandomizedLasso / RandomizedLogisticRegression - use randomization to prevent overfitting
  • ElasticNet - trained with L1 and L2 prior as regularizer --> learning a sparse model where few of the weights are non-zero like Lasso, while still maintaining the regularization properties of Ridge. control the convex combination of L1 and L2 using the l1_ratio parameter. useful when there are multiple features which are correlated with one another.
  • MultiTaskElasticNet
  • Lars - Least-angle regression for high-dimensional data. similar to forward stepwise regression, but instead of including variables at each step, the estimated parameters are increased in a direction equiangular to each one’s correlations with the residual. Instead of giving a vector result, the LARS solution consists of a curve denoting the solution for each value of the L1 norm of the parameter vector.
  • LassoLars
  • OrthogonalMatchingPursuit - e OMP algorithm for approximating the fit of a linear model with constraints imposed on the number of non-zero coefficients. a forward feature selection method
  • BayesianRidge - introducing spherical Gaussian priors over the hyper parameters of the model (gamma distributions). robust.
  • ARDRegression - similar to Bayesian Ridge Regression, but with elliptical Gaussian priors --> sparser weights
  • LogisticRegression - sigmoid: gives less weight to data far from the decision frontier. for classification. known as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. can fit binary, One-vs- Rest (separate binary classifiers are trained for all classes), or multinomial logistic regression with optional L2 or L1 regularization. uses solvers “liblinear” (coordinate descent (CD) algorithm via C++ LIBLINEAR lib), “newton-cg”, “lbfgs” (Multinomial) and “sag” (very large datasets). can return confidence interval - well calibrated predictions by default as it directly optimizes log-loss.
  • SGDClassifier / SGDRegressor - Stochastic Gradient Descent using different convex loss functions (loss: log [logistic regression] / hinge [soft margin SVC], modified_huber[smoothed hinge], Least-Squares (Ridge Regression), Epsilon-Insensitive (soft-margin SVR) ) and different penalties (penalty = L1 / L2 / ElastiNet) on coef_
  • Perceptron - for large scale learning
  • PassiveAggressiveClassifier / PassiveAggressiveRegressor - perceptron with regularization
  • RANSAC - robust regression for outliers in Y - (RANdom SAmple Consensus) fits a model from random subsets of inliers from the complete data set.
  • TheilSenRegressor - robust multivariate regression for outliers in X for low dimensionality - uses a generalization of the median in multiple dimensions.
  • HuberRegressor - fastest most robust regression for small datasets - regularization technique applies a linear loss to downweigh samples that are classified as outliers
  • KernelRidge - combines Ridge Regression with the kernel trick. unlike SVR, uses squared error loss - can be done in closed form, but model is non-sparse. faster than SVR for small to medium-sized training sets
  • SGDClassifier - Stochastic Gradient Descent. for unconstrained optimization problems. In contrast to (batch) gradient descent, approximates the true gradient by considering a single training example at a time. adv: efficiency (linear in the number of training examples), tunable. disadv: sensitive to feature scaling (requires preprocessing: StandardScalar)
  • SGDRegressor
  • IsotonicRegression - fits a non-decreasing function to data.


manifold learning - nearest-neighbour approach to non-linear dimensionality reduction linear dimensionality reduction algorithms [Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear Discriminant Analysis] are powerful, but often miss important non-linear structure in the data.

  • Isomap - Isometric Mapping - extension of Multi-dimensional Scaling (MDS) or Kernel PCA. Isomap seeks a lower-dimensional embedding which maintains geodesic distances between all points. 3 stages: 1) Nearest neighbor search using BallTree 2) Shortest-path graph search using Dijkstra’s Algorithm or Floyd-Warshall algorithm 3) Partial eigenvalue decomposition - embedding is encoded in the eigenvectors corresponding to the n largest eigenvalues
  • LocallyLinearEmbedding - LLE - lower-dimensional projection of the data which preserves distances within local neighborhoods. Like a series of local PCA. alternative stage 2 to Isomap: performs Weight Matrix Construction. Regularization varients using method param: A) modified - modified LLE (MLLE) - use multiple weight vectors in each neighborhood. B) hessian - Hessian Eigenmapping - hessian-based quadratic form at each neighborhood used to recover the locally linear structure. C) ltsa - Local tangent space alignment: characterize the local geometry at each neighborhood via its tangent space, and performs a global optimization to align these local tangent spaces.
  • SpectralEmbedding - Laplacian Eigenmaps - finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian. a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances. alternative stage 2 to Isomap: performs Graph Laplacian Construction.
  • MDS - retain the distance ratios in the original high-dimensional space. for analyzing similarity as distance. 2 types: A) metric: input similarity matrix arises from a metric (and thus respects the triangular inequality) -> distances between output two points are then set to be as close as possible to the similarity / disimilarity B) non-metric: preserve the order of the distances, and hence seek for a monotonic relationship between the distances in the embedded space and the similarities/dissimilarities
  • TSNE - t-distributed Stochastic Neighbor Embedding - converts Student’s t-distribution affinities of data points to Gaussian joint probabilities. use SGD to minimize Kullback-Leibler (KL) divergence of the joint probabilities in the original space and the embedded space. advantages: sensitive to local structure - good for extracting clustered local groups of samples , Revealing the structure at many scales and data in different manifolds / clusters, Reducing the tendency to crowd points together at the center. disadvantages: computationally expensive, limited dimensionality, stochastic (requires multiple restarts)


  • classification_report - text report showing the main classification metrics
  • confusion_matrix - evaluates classification accuracy: number of observations actually in group i, but predicted to be in group j.
  • cohen_kappa_score - compare labelings by different human annotators, not a classifier versus a ground truth.
  • hamming_loss - computes the average Hamming distance between two sets of samples.
  • hinge_loss - computes the average distance between the model and the data. considers only prediction errors. used in maximal margin classifiers
  • log_loss - evaluate the probability outputs (predict_proba) of a classifier instead of its discrete predictions. the negative log-likelihood of the classifier given the true label
  • zero_one_loss - the sum or the average of the 0-1 classification loss
  • brier_score_loss - computes the Brier score for binary classes: measures the accuracy of probabilistic predictions
  • coverage_error - computes the average number of true labels that have to be included in the final prediction such that all true labels are predicted.
  • label_ranking_average_precision_score -
  • jaccard_similarity_score - computes the average (default) or sum of Jaccard similarity coefficients between pairs of label sets.
  • adjusted_rand_score - ARI - for clustering: measures similarity of two assignments, ignoring permutations. score is normalized to [-1.0, 1.0], with random (uniform) at 0.0. No assumption is made on the cluster structure, but requires knowledge of the ground truth classes
  • normalized_mutual_info_score / adjusted_mutual_info_score - for clustering: measures agreement of two assignments, ignoring permutations. score is normalized to [0.0, 1.0]. requires knowledge of the ground truth classes
  • accuracy_score - default score function of classifiers to evaluate a parameter setting
  • r2_score - default score function of regressors to evaluate a parameter setting. the coefficient of determination: how well future samples are likely to be predicted by the model.
  • homogeneity_score - each cluster contains only members of a single class.
  • completeness_score - all members of a given class are assigned to the same cluster.
  • v_measure_score - harmonic mean of the pairwise precision and recall
  • fowlkes_mallows_score - geometric mean of the pairwise precision and recall
  • silhouette_score - composed of two scores: The mean distance between a sample and all other points in the same class; The mean distance between a sample and all other points in the next nearest cluster.
  • calinski_harabaz_score - ratio of the between-clusters dispersion mean and the within-cluster dispersion
  • precision_score - Compute the precision: the ability of the classifier not to label as positive a sample that is negative
  • recall_score - Compute the recall: ability of the classifier to find all the positive samples
  • average_precision_score - Compute average precision (AP) from prediction scores
  • f1_score - Compute the F-measure: weighted harmonic mean of the precision and recall.
  • fbeta_score - Compute the F-beta score
  • precision_recall_curve - Compute precision-recall pairs for different probability thresholds
  • precision_recall_fscore_support - Compute precision, recall, F-measure and support for each class
  • matthews_corrcoef - measure of the quality of binary (two-class) classifications. It takes into account true and false positives and negatives
  • roc_curve - receiver operating characteristic - performance of a binary classifier system as its discrimination threshold is varied. TPR (true positive rate) vs. FPR (false positive rate), at various threshold settings
  • roc_auc_score - summarized to one number
  • label_ranking_average_precision_score - average over each ground truth label assigned to each sample, of the ratio of true vs. total labels with lower score.
  • label_ranking_loss - averages over the samples the number of label pairs that are incorrectly ordered
  • mean_squared_error - Regression metrics: expected value of the squared (quadratic) loss
  • mean_absolute_error - Regression metrics: expected value of the absolute error L1-norm loss
  • mean_squared_log_error - Regression metrics: expected value of the squared logarithmic (quadratic) loss
  • median_absolute_error - Regression metrics: median of all absolute differences between the target and the prediction. robust to outliers.
  • explained_variance_score - Regression metrics
  • DummyClassifier - sanity test


  • distance metrics and kernels (measures of similarity) to evaluate pairwise distances or affinity of sets of samples.
  • cosine_similarity - computes the L2-normalized dot product of vectors.
  • polynomial_kernel - computes the degree-d polynomial kernel between two vectors.
  • rbf_kernel - computes the radial basis function (RBF) kernel between two vectors
  • laplacian_kernel - a variant on the rbf_kernel that uses manhattan distance
  • chi2_kernel


Gaussian Mixture Models - unsupervised probabilistic model where data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters: generalizing k-means clustering to incorporate information about the covariance structure of the data and centers of the latent Gaussians. Supports diagonal, spherical, tied and full covariance matrices.

  • GaussianMixture - implements the expectation-maximization (EM) iterative algorithm for maximum likelihood : the Gaussian each sample probably belongs to. can also draw confidence ellipsoids for multivariate models, and compute the BIC (Bayesian Information Criterion) to assess the number of clusters in the data. options to constrain the covariance of the difference classes estimated: spherical, diagonal, tied or full covariance. adv: fast, unbiased. disadv: singularities, must specify number of components (can use BIC for this only if asymptotic) - requires a lot of data, with unlabeled data difficult to tell which points came from which latent component.
  • BayesianGaussianMixture - variational inference - EM maximizes a lower bound on model evidence (including priors) instead of data likelihood. add priors for regularization weights to avoid singularities - 2 prior types: A) finite mixture model with Dirichlet distribution, B) infinite mixture model with the Dirichlet Process. adv: automatic selection, less sensitivity to the number of parameters, regularization. disadv: slower, extra hyperparam, implicit bias. Dirichlet process - prior probability distribution on clusterings with an infinite, unbounded, number of partitions. calculated using stick breaking process.


  • KFold - Splits data into K folds, trains on K-1 and then tests on the left-out. split
  • StratifiedKFold - Same as K-Fold but preserves the class distribution within each fold. each set contains approximately the same percentage of samples of each target class as the complete set.
  • GroupKFold - Ensures that the same group is not in both testing and training sets.
  • ShuffleSplit - Generates train/test indices based on random permutation. generate a user defined number of independent train / test dataset splits. Samples are first shuffled and then split into a pair of train and test sets.
  • StratifiedShuffleSplit - Same as shuffle split but preserves the class distribution within each iteration.
  • GroupShuffleSplit - Ensures that the same group is not in both testing and training sets.
  • LeaveOneGroupOut - Takes a group array to group observations.
  • LeavePGroupsOut - Leave P groups out.
  • LeaveOneOut - Leave one observation out.
  • LeavePOut - Leave P observations out.
  • PredefinedSplit - Generates train/test indices based on predefined splits
  • TimeSeriesSplit - successive training sets are supersets of those that come before them.
  • cross_val_score - splits the data repeatedly into a training and a testing set, trains the estimator using the training set and computes the scores based on the testing set. validation set is no longer needed. uses the KFold or StratifiedKFold strategies. Cross validation iterators can also be used to directly perform model selection using Grid Search for the optimal hyperparameters of the model.
  • GridSearchCV - select the hyperparameter with the maximum score on multiple validation sets. exhaustively generates candidates from a grid of parameter values specified with the param_grid parameter. computes score during the fit of an estimator on a parameter grid and chooses the parameters to maximize the cross-validation score. best_score_, best_estimator_
  • RandomizedSearchCV - randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.
  • train_test_split - random split
  • cross_val_prediction - returns, for each element in the input, the prediction that was obtained for that element when it was in the test set.
  • train_test_split - prevent overfitting
  • validation_curve - plot the influence of a single hyperparameter on the training score and the validation score to find out if estimator is overfitting / underfitting for some hyperparameter values. If the training score and the validation score are both low, the estimator will be underfitting. If the training score is high and the validation score is low, the estimator is overfitting and otherwise it is working very well. A low training score and a high validation score is usually not possible.
  • learning_curve - plot the average scores on the training sets and the average scores on the validation sets. how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error. If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data. If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization.


  • All classifiers in scikit-learn do multiclass classification - implements meta-estimators to solve multiclass and multilabel classification problems by decomposing such problems into binary classification problems. Multitarget regression is also supported. Multiclass classification - each sample is assigned to one and only one label Multilabel classification - each sample is assigned a set of target labels - not mutually exclusive, eg preferences. expressed with label binary indicator 2D array (n_samples, n_classes). preprocess with MultiLabelBinarizer Multioutput regression - each sample is assigned a set of target values (multi-dimensional datapoints) Multioutput-multiclass classification - single estimator handling several joint classification tasks multi-task classification - Multioutput-multiclass classification with different model formulations
  • OneVsRestClassifier - For each classifier, the class is fitted against all the other classes. adv: efficient, interpretability.
  • OneVsOneClassifier - constructs one binary classifier per pair of classes. selects the class with the highest aggregate classification confidence by summing over the underlying classifiers.
  • OutputCodeClassifier - binary matrix representing dimension of each class in a Euclidean space, where each dimension can only be 0 or 1 (code book - where code size is the dimensionality)
  • MultiOutputRegressor - Multioutput regression


“naive” assumption of independence between every pair of features. use Maximum A Posteriori (MAP) estimation to estimate prior (relative frequency of class in the training set) and posterior. uses; document classification, spam filtering. adv: fast, require only small data. decoupling of the class conditional feature distributions -> each distribution can be independently estimated as a 1D distribution -> alleviate curse of dimensionality. a decent classifier, but a bad probability estimator.

  • GaussianNB
  • MultinomialNB
  • BernoulliNB


foundation of manifold learning, kernel density estimation and spectral clustering. weights: distance proportional to the inverse of the euclidean distance from the query point /uniform majority vote of the nearest neighbors / user-defined distance function. non-generalizing - simply “remember” all its training data (algorithm - transform into a fast indexing structure [ kd_tree - reduce the required number of distance calculations by efficiently encoding relative aggregate distance information in a recursive tree, ball_tree - Where KD trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres - very efficient on highly-structured data, brute naive brute-force computation of distances between all pairs of points - For small data sets] - based on routines in sklearn.metrics.pairwise). leaf_size - tunes trees. non-parametric: can handle when the decision boundary is very irregular.

  • KNeighborsClassifier - supervised. number of samples is a user-defined constant (k-nearest).optimal choice of k is highly data-dependent: larger suppresses the effects of noise, but makes classification boundaries less distinct.
  • RadiusNeighborsClassifier- supervised. number of samples varies based on the local density of points (radius-based). for cases where the data is not uniformly sampled, low-dimensional parameter spaces
  • KNeighborsRegressor - data labels are continuous
  • RadiusNeighborsRegressor
  • NearestCentroid - simple classifier. shrink_threshold - value of each feature for each centroid is divided by the within-class variance of that feature: removing noisy features -> increases the accuracy.
  • LSHForest - Locality Sensitive Hashing Forest - approximate nearest neighbor to speedup query time with high dimensional data
  • NearestNeighbors: kneighbors, kneighbors_graph
  • LocalOutlierFactor - outlier detection - measures the local density deviation of a given data point with respect to its neighbors


  • KernelDensity - learn a non-parametric generative model of a dataset in order to efficiently draw new samples from this generative model. kernel: gaussian / tophat / epanechnikov / exponential / linear / cosine


not intended for large-scale applications Multi-layer Perceptron - non-linear function approximator. weighted linear summation , followed by non-linear activation function. disadv: non-convex loss function if multiple local minimum -> different random weight initializations make it non-deterministic; many hyperparameters, sensitive to feature scaling. trains using SGD, Adam, or L-BFGS.

  • MLPClassifier - supports only the Cross-Entropy loss function, Softmax optimization function.
  • MLPRegressor - MSE loss function
  • BernoulliRBM - Restricted Boltzmann machine: nonlinear feature learners based on a probabilistic model (uses binary Stochastic Maximum Likelihood). It can be approximated by Markov chain Monte Carlo using block iterative Gibbs sampling. The graphical model of an RBM is a fully-connected bipartite graph.


  • Pipeline: chain multiple estimators into one. adv: Convenience - single call to fit / predict , Joint parameter selection - grid search over parameters of all estimators. All estimators in a pipeline, except the last one, must be transformers. set_params - hyperparameters, steps, named_steps.
  • FeatureUnion combines several transformer objects into a new transformer that combines their output
  • make_pipeline, make_union - helpers


utility functions and transformer classes to change raw feature vectors into a representation for downstream estimators.

  • scale / StandardScaler - scale 1D array to Gaussian with zero mean and unit variance. scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Centering sparse data would destroy the sparseness structure in the data - specify with_mean=False.
  • MinMaxScaler / MaxAbsScaler - scale features to a range [0, 1] / [-1, 1] by dividing through the largest maximum value in each feature
  • robust_scale / RobustScaler - scale data with outliers
  • LabelBinarizer: fit_transform - create a label indicator matrix from a list of multi-class labels: binarize the 2d array of multilabels to fit upon
  • PolynomialFeatures - add complexity to the model by considering nonlinear features of the input data. transforms an input data matrix into a new data matrix of a given degree. for Polynomial regression: extending linear models with basis functions - can the be used within a linear model. fit a paraboloid to the data instead of a plane, we can combine the features in second-order polynomials, fit a much broader range of data. interaction features multiply together the most distinct features
  • MultiLabelBinarizer - convert a collection of collections of labels to Multilabel classification indicator format
  • KernelCenterer - transform a kernel matrix (inner products in a feature space defined by function) - removal of the mean in that space.
  • normalize / Normalizer - scaling individual samples to have L1 or L2 unit norm
  • Binarizer - thresholding numerical features to get boolean values - for downstream probabilistic estimators
  • OneHotEncoder - convert categorical features to one-hot encoding
  • Imputer - impute the missing values: infer them from the known part of the data using the mean, median or the most frequent value
  • FunctionTransformer - convert an existing Python function into a transformer to assist in data cleaning or processing
  • LabelEncoder - normalize labels such that they contain only values between 0 and n_classes-1


  • GaussianRandomProjection: data reduction by random projections. fit_transform
  • SparseRandomProjection


for when some of the training data samples are not labeled, construct a similarity graph over all items in the input dataset

  • LabelPropagation - - hard clamping of input labels. 2 kernels: rbf, knn
  • LabelSpreading - minimizes a loss function - regularization --> more robust to noise. performs spectral clustering: iterates similarty graph and computes the normalized graph Laplacian matrix -> normalizes the edge weights


  • discriminant model family: try to find a combination of samples to build a plane maximizing the margin between the two classes. constructs hyper-planes in a high or infinite dimensional space - a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class the larger the margin the lower the generalization error. Advantages: Effective in high dimensional spaces , Versatile (different Kernel functions). Disadvantages: performs poorly If the number of features is much greater than the number of samples, does not directly provide probability estimates. members; support_vectors_, support_ (indices) and n_support (count), decision_function gives per-class scores for each sample (signed distance to the hyperplane). quadratic programming (QP) solver of C++ libsvm. kernel values: linear / polynomial / rbf / sigmoid / precomputed (custom precomputed Gram matrix), custom (python function).
  • SVC - support vector classifier. multi-class capable (one-against-one” approach). cost function for building the model does not care about training points that lie beyond the margin
  • NuSVC - extra parameter for upper bound on the fraction of training errors and a lower bound of the fraction of support vectors.
  • LinearSVC - linear kernel, “one-vs-the-rest” multi-class strategy
  • OneClassSVM - for novelty detection - classify new points as belonging to that set or not.
  • SVR - cost function for building the model ignores any training data close to the model prediction.


non-parametric. Use min_samples_split or min_samples_leaf to control the number of samples at a leaf node. reduce dimensionality and balance (via sampling) before running. Tree algorithms: ID3, C4.5, C5.0 and CART (classification and regression trees). recursively partitions until the maximum allowable depth. select parameters that minimize impurity. impurity measures: Gini, Cross-Entropy, Misclassification, MSE advantages: whitebox - simple to understand and to interpret, O(log n), both numerical and categorical disadvantages: do not generalise well - prone to overfitting, unstable, can be biased (unbalanced), linear

  • DecisionTreeClassifier: export_graphviz
  • DecisionTreeRegressor


  • SkipTest


  • sp_version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment