Last active
May 23, 2018 07:06
-
-
Save ronykroy/08c158f8010d6ab0c5a1b4a42520168a to your computer and use it in GitHub Desktop.
the gist of AL/ML No code.. just a summary and probably a few links expand as you go...
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Machine learning tasks are usually one of the following 5: | |
Regression | |
Classification | |
Clustering | |
Feature Selection | |
Feature Extraction | |
And below are some of the prominent algorithms that do just that.. as described above.. | |
1. Regression: | |
Purpose: Modeling and predicting continuous, numeric variables like test scores and housing prices | |
Types of regression algorithms and implementation... | |
1.1. (Regularized) Linear Regression : http://scikit-learn.org/stable/modules/linear_model.html | |
There are several implementations of the algo: OLS, ridge, Lasso etc. | |
1.2. Regression Tree (Ensembles) | |
Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM), combine predictions from many individual trees | |
1.2.1: Random forests :: http://scikit-learn.org/stable/modules/ensemble.html#random-forests | |
1.2.2: Gradient boosting regressor :: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor | |
1.2.3: Deep Learning/CNN/DNN enough neural networks can model relationships like an infinitely flexible funciotn | |
2. Classification | |
Purpoose : Modeling and predicting categorical variables | |
2.1. (Regularized) Logistic Regression: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression | |
Classificaiton counterpart to liner regression OP is binary | |
2.2 Classification Tree (Ensembles): | |
http://scikit-learn.org/stable/modules/ensemble.html#random-forests | |
http://scikit-learn.org/stable/modules/ensemble.html#regression | |
Commonly referred to as "decision trees" or by the umbrella term "classification and regression trees (CART), the regression counterpart to regression trees | |
2.3 Deep Learning | |
Same as in deep leanring for regression there is a counter part for classificaiton, train the outer layer of a pre-trained mdoel accordingly. | |
Check out keras.. examples.. | |
TODO: pytorch examples.. | |
2.4. Support Vector Machines | |
SVMs use kernels, which essentially calculate distance between two observations. The SVM algorithm then finds a decision boundary that maximizes the distance between the closest members of separate classes | |
Kernel explanation : https://stats.stackexchange.com/questions/152897/how-to-intuitively-explain-what-a-kernel-is | |
http://scikit-learn.org/stable/modules/svm.html#classification | |
2.5. Naive Bayes | |
Naive Bayes (NB) is a very simple algorithm based around conditional probability and counting | |
http://scikit-learn.org/stable/modules/naive_bayes.html | |
3. Clustering | |
Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset | |
Used for say: grouping customers, products, social networks | |
Because clustering is unsupervised (i.e. there's no "right answer"), data visualization is usually used to evaluate results | |
Helps to have prelabeled data, else you'll have to look into the data points that form the cluster to see what makes the cluster a cluster :) | |
3.1. K-Means | |
K-Means is a general purpose algorithm that makes clusters based on geometric distances (i.e. distance on a coordinate plane) between points. | |
The clusters are grouped around centroids, causing them to be globular and have similar sizes | |
http://scikit-learn.org/stable/modules/clustering.html#k-means | |
3.2. Affinity Propagation: | |
Affinity Propagation is a relatively new clustering technique that makes clusters based on graph distances between points | |
http://scikit-learn.org/stable/modules/clustering.html#affinity-propagation | |
3.3. Hierarchical / Agglomerative | |
Hierarchical clustering, a.k.a. agglomerative clustering, is a suite of algorithms based on the same idea: | |
(1) Start with each point in its own cluster. | |
(2) For each cluster, merge it with another based on some criterion. | |
(3) Repeat until only one cluster remains and you are left with a hierarchy of clusters. | |
http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering | |
3.4. DBSCAN: | |
DBSCAN is a density based algorithm that makes clusters for dense regions of points. | |
Updated version HDBSCAN that allows varying density clusters. | |
http://scikit-learn.org/stable/modules/clustering.html#dbscan | |
4. Feature Selection: | |
Filtering irrelevant or redundant features from your dataset, Features be likeindep variables... | |
Regularized Regression and Random Forests do have feature selection inbuilt... | |
They can be unsupervised (e.g. Variance Thresholds) or supervised (e.g. Genetic Algorithms) | |
4.1. Variance Thresholds: | |
Variance thresholds remove features whose values don't change much from observation to observation (i.e. their variance falls below a threshold) | |
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html | |
4.2. Correlation Thresholds | |
Correlation thresholds remove features that are highly correlated with others (i.e. its values change very similarly to another's). These features provide redundant information. | |
https://gist.github.com/Swarchal/881976176aaeb21e8e8df486903e99d6 | |
4.3. Genetic Algorithms (GA): | |
Genetic algorithms (GA) are a broad class of algorithms that can be adapted to different purposes. They are search algorithms that are inspired by evolutionary biology and natural selection, combining mutation and cross-over to efficiently traverse large solution spaces. | |
https://pypi.org/project/deap/ | |
5. Feature Extraction | |
Feature extraction is for creating a new, smaller set of features that stills captures most of the useful information. Again, feature selection keeps a subset of the original features while feature extraction creates new ones | |
Some algorithms already have built-in feature extraction. The best example is Deep Learning, which extracts increasingly useful representations of the raw input data through each hidden neural layer | |
5.1. Principal Component Analysis (PCA) | |
Principal component analysis (PCA) is an unsupervised algorithm that creates linear combinations of the original features. | |
The new features are orthogonal, which means that they are uncorrelated. | |
They are ranked in order of their "explained variance." The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most variance and so on | |
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html | |
5.2. Linear Discriminant Analysis (LDA) | |
Linear discriminant analysis (LDA) - not to be confused with latent Dirichlet allocation - also creates linear combinations of your original features. | |
However, unlike PCA, LDA doesn't maximize explained variance. Instead, it maximizes the separability between classes | |
http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis | |
5.3. Autoencoders: | |
Autoencoders are neural networks that are trained to reconstruct their original inputs. | |
Image autoencoders are trained to reproduce the original images instead of classifying the image as a dog or a cat. | |
Check Keras: the key is to structure the hidden layer to have fewer neurons than the input/output layers. Thus, that hidden layer will learn to produce a smaller representation of the original image | |
6. Curse of dimensionality:Simply put when number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models, particularly observable in Clustering | |
7. Auroc: area under roc: roc receiver operating characteristic: is a plot of true positive rate vs false positive rate y/x for a model | |
Auroc is better than accuracy as a metric for out of sample evaluation for a model. | |
TODO: | |
L1/L2/L!=L@ regularization | |
No freelunch | |
Density Estimation and Anomaly Detection | |
Intuition to GAlgo http://www.obitko.com/tutorials/genetic-algorithms/ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment