Skip to content

Instantly share code, notes, and snippets.

@ronykroy
Last active May 23, 2018 07:06
Show Gist options
  • Save ronykroy/08c158f8010d6ab0c5a1b4a42520168a to your computer and use it in GitHub Desktop.
Save ronykroy/08c158f8010d6ab0c5a1b4a42520168a to your computer and use it in GitHub Desktop.
the gist of AL/ML No code.. just a summary and probably a few links expand as you go...
Machine learning tasks are usually one of the following 5:
Regression
Classification
Clustering
Feature Selection
Feature Extraction
And below are some of the prominent algorithms that do just that.. as described above..
1. Regression:
Purpose: Modeling and predicting continuous, numeric variables like test scores and housing prices
Types of regression algorithms and implementation...
1.1. (Regularized) Linear Regression : http://scikit-learn.org/stable/modules/linear_model.html
There are several implementations of the algo: OLS, ridge, Lasso etc.
1.2. Regression Tree (Ensembles)
Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM), combine predictions from many individual trees
1.2.1: Random forests :: http://scikit-learn.org/stable/modules/ensemble.html#random-forests
1.2.2: Gradient boosting regressor :: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor
1.2.3: Deep Learning/CNN/DNN enough neural networks can model relationships like an infinitely flexible funciotn
2. Classification
Purpoose : Modeling and predicting categorical variables
2.1. (Regularized) Logistic Regression: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Classificaiton counterpart to liner regression OP is binary
2.2 Classification Tree (Ensembles):
http://scikit-learn.org/stable/modules/ensemble.html#random-forests
http://scikit-learn.org/stable/modules/ensemble.html#regression
Commonly referred to as "decision trees" or by the umbrella term "classification and regression trees (CART), the regression counterpart to regression trees
2.3 Deep Learning
Same as in deep leanring for regression there is a counter part for classificaiton, train the outer layer of a pre-trained mdoel accordingly.
Check out keras.. examples..
TODO: pytorch examples..
2.4. Support Vector Machines
SVMs use kernels, which essentially calculate distance between two observations. The SVM algorithm then finds a decision boundary that maximizes the distance between the closest members of separate classes
Kernel explanation : https://stats.stackexchange.com/questions/152897/how-to-intuitively-explain-what-a-kernel-is
http://scikit-learn.org/stable/modules/svm.html#classification
2.5. Naive Bayes
Naive Bayes (NB) is a very simple algorithm based around conditional probability and counting
http://scikit-learn.org/stable/modules/naive_bayes.html
3. Clustering
Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset
Used for say: grouping customers, products, social networks
Because clustering is unsupervised (i.e. there's no "right answer"), data visualization is usually used to evaluate results
Helps to have prelabeled data, else you'll have to look into the data points that form the cluster to see what makes the cluster a cluster :)
3.1. K-Means
K-Means is a general purpose algorithm that makes clusters based on geometric distances (i.e. distance on a coordinate plane) between points.
The clusters are grouped around centroids, causing them to be globular and have similar sizes
http://scikit-learn.org/stable/modules/clustering.html#k-means
3.2. Affinity Propagation:
Affinity Propagation is a relatively new clustering technique that makes clusters based on graph distances between points
http://scikit-learn.org/stable/modules/clustering.html#affinity-propagation
3.3. Hierarchical / Agglomerative
Hierarchical clustering, a.k.a. agglomerative clustering, is a suite of algorithms based on the same idea:
(1) Start with each point in its own cluster.
(2) For each cluster, merge it with another based on some criterion.
(3) Repeat until only one cluster remains and you are left with a hierarchy of clusters.
http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering
3.4. DBSCAN:
DBSCAN is a density based algorithm that makes clusters for dense regions of points.
Updated version HDBSCAN that allows varying density clusters.
http://scikit-learn.org/stable/modules/clustering.html#dbscan
4. Feature Selection:
Filtering irrelevant or redundant features from your dataset, Features be likeindep variables...
Regularized Regression and Random Forests do have feature selection inbuilt...
They can be unsupervised (e.g. Variance Thresholds) or supervised (e.g. Genetic Algorithms)
4.1. Variance Thresholds:
Variance thresholds remove features whose values don't change much from observation to observation (i.e. their variance falls below a threshold)
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
4.2. Correlation Thresholds
Correlation thresholds remove features that are highly correlated with others (i.e. its values change very similarly to another's). These features provide redundant information.
https://gist.github.com/Swarchal/881976176aaeb21e8e8df486903e99d6
4.3. Genetic Algorithms (GA):
Genetic algorithms (GA) are a broad class of algorithms that can be adapted to different purposes. They are search algorithms that are inspired by evolutionary biology and natural selection, combining mutation and cross-over to efficiently traverse large solution spaces.
https://pypi.org/project/deap/
5. Feature Extraction
Feature extraction is for creating a new, smaller set of features that stills captures most of the useful information. Again, feature selection keeps a subset of the original features while feature extraction creates new ones
Some algorithms already have built-in feature extraction. The best example is Deep Learning, which extracts increasingly useful representations of the raw input data through each hidden neural layer
5.1. Principal Component Analysis (PCA)
Principal component analysis (PCA) is an unsupervised algorithm that creates linear combinations of the original features.
The new features are orthogonal, which means that they are uncorrelated.
They are ranked in order of their "explained variance." The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most variance and so on
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
5.2. Linear Discriminant Analysis (LDA)
Linear discriminant analysis (LDA) - not to be confused with latent Dirichlet allocation - also creates linear combinations of your original features.
However, unlike PCA, LDA doesn't maximize explained variance. Instead, it maximizes the separability between classes
http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
5.3. Autoencoders:
Autoencoders are neural networks that are trained to reconstruct their original inputs.
Image autoencoders are trained to reproduce the original images instead of classifying the image as a dog or a cat.
Check Keras: the key is to structure the hidden layer to have fewer neurons than the input/output layers. Thus, that hidden layer will learn to produce a smaller representation of the original image
6. Curse of dimensionality:Simply put when number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models, particularly observable in Clustering
7. Auroc: area under roc: roc receiver operating characteristic: is a plot of true positive rate vs false positive rate y/x for a model
Auroc is better than accuracy as a metric for out of sample evaluation for a model.
TODO:
L1/L2/L!=L@ regularization
No freelunch
Density Estimation and Anomaly Detection
Intuition to GAlgo http://www.obitko.com/tutorials/genetic-algorithms/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment