ronykroy/AI_ML_Gist.txt

## AI_ML_Gist.txt
Machine learning tasks are usually one of the following 5:

Regression
Classification
Clustering
Feature Selection
Feature Extraction

And below are some of the prominent algorithms that do just that.. as described above..

1. Regression:
	Purpose: Modeling and predicting continuous, numeric variables like test scores and housing prices
	Types of regression algorithms and implementation...
	1.1. (Regularized) Linear Regression : http://scikit-learn.org/stable/modules/linear_model.html
		There are several implementations of the algo: OLS, ridge, Lasso etc.
	1.2. Regression Tree (Ensembles)
		Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM), combine predictions from many individual trees
		1.2.1: Random forests ::	http://scikit-learn.org/stable/modules/ensemble.html#random-forests
		1.2.2: Gradient boosting regressor :: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor
		1.2.3: Deep Learning/CNN/DNN enough neural networks can model relationships like an infinitely flexible funciotn

2. Classification
	Purpoose : Modeling and predicting categorical variables
	2.1. (Regularized) Logistic Regression: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
		Classificaiton counterpart to liner regression OP is binary
	2.2 Classification Tree (Ensembles):
		http://scikit-learn.org/stable/modules/ensemble.html#random-forests
		http://scikit-learn.org/stable/modules/ensemble.html#regression
		Commonly referred to as "decision trees" or by the umbrella term "classification and regression trees (CART), the regression counterpart to regression trees
	2.3 Deep Learning
		Same as in deep leanring for regression there is a counter part for classificaiton, train the outer layer of a pre-trained mdoel accordingly.
		Check out keras.. examples..
		TODO: pytorch examples..
	2.4. Support Vector Machines
		SVMs use  kernels, which essentially calculate distance between two observations. The SVM algorithm then finds a decision boundary that maximizes the distance between the closest members of separate classes
		Kernel explanation : https://stats.stackexchange.com/questions/152897/how-to-intuitively-explain-what-a-kernel-is
		http://scikit-learn.org/stable/modules/svm.html#classification
	2.5. Naive Bayes
		Naive Bayes (NB) is a very simple algorithm based around conditional probability and counting
		http://scikit-learn.org/stable/modules/naive_bayes.html
3. Clustering
	Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset
	Used for say: grouping customers, products, social networks
	Because clustering is unsupervised (i.e. there's no "right answer"), data visualization is usually used to evaluate results
	Helps to have prelabeled data, else you'll have to look into the data points that form the cluster to see what makes the cluster a cluster :)
	3.1. K-Means
		K-Means is a general purpose algorithm that makes clusters based on geometric distances (i.e. distance on a coordinate plane) between points.
		The clusters are grouped around centroids, causing them to be globular and have similar sizes
		http://scikit-learn.org/stable/modules/clustering.html#k-means
	3.2. Affinity Propagation:
		Affinity Propagation is a relatively new clustering technique that makes clusters based on graph distances between points
		http://scikit-learn.org/stable/modules/clustering.html#affinity-propagation
	3.3. Hierarchical / Agglomerative
		Hierarchical clustering, a.k.a. agglomerative clustering, is a suite of algorithms based on the same idea:
		(1) Start with each point in its own cluster.
		(2) For each cluster, merge it with another based on some criterion.
		(3) Repeat until only one cluster remains and you are left with a hierarchy of clusters.
		http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

	3.4. DBSCAN:
		DBSCAN is a density based algorithm that makes clusters for dense regions of points.
		Updated version HDBSCAN that allows varying density clusters.
		http://scikit-learn.org/stable/modules/clustering.html#dbscan

4. Feature Selection:
	Filtering irrelevant or redundant features from your dataset, Features be likeindep variables...
	Regularized Regression and Random Forests do have feature selection inbuilt...
	They can be unsupervised (e.g. Variance Thresholds) or supervised (e.g. Genetic Algorithms)

	4.1. Variance Thresholds:
		Variance thresholds remove features whose values don't change much from observation to observation (i.e. their variance falls below a threshold)
		http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
	4.2. Correlation Thresholds
		Correlation thresholds remove features that are highly correlated with others (i.e. its values change very similarly to another's). These features provide redundant information.
		https://gist.github.com/Swarchal/881976176aaeb21e8e8df486903e99d6
	4.3. Genetic Algorithms (GA):
		Genetic algorithms (GA) are a broad class of algorithms that can be adapted to different purposes. They are search algorithms that are inspired by evolutionary biology and natural selection, combining mutation and cross-over to efficiently traverse large solution spaces.
		https://pypi.org/project/deap/
5. Feature Extraction
	Feature extraction is for creating a new, smaller set of features that stills captures most of the useful information. Again, feature selection keeps a subset of the original features while feature extraction creates new ones
	Some algorithms already have built-in feature extraction. The best example is Deep Learning, which extracts increasingly useful representations of the raw input data through each hidden neural layer

	5.1. Principal Component Analysis (PCA)
		Principal component analysis (PCA) is an unsupervised algorithm that creates linear combinations of the original features.
		The new features are orthogonal, which means that they are uncorrelated.
		They are ranked in order of their "explained variance." The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most variance and so on
		http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
	5.2. Linear Discriminant Analysis (LDA)
		Linear discriminant analysis (LDA) - not to be confused with latent Dirichlet allocation - also creates linear combinations of your original features.
		However, unlike PCA, LDA doesn't maximize explained variance. Instead, it maximizes the separability between classes
		http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
	5.3. Autoencoders:
		Autoencoders are neural networks that are trained to reconstruct their original inputs.
		Image autoencoders are trained to reproduce the original images instead of classifying the image as a dog or a cat.
		Check Keras: the key is to structure the hidden layer to have fewer neurons than the input/output layers. Thus, that hidden layer will learn to produce a smaller representation of the original image

6. Curse of dimensionality:Simply put when number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models, particularly observable in Clustering

7. Auroc: area under roc: roc receiver operating characteristic: is a plot of true positive rate vs false positive rate y/x for a model
	Auroc is better than accuracy as a metric for out of sample evaluation for a model.
TODO:
L1/L2/L!=L@ regularization
No freelunch

Density Estimation and Anomaly Detection
Intuition to GAlgo http://www.obitko.com/tutorials/genetic-algorithms/
	Machine learning tasks are usually one of the following 5:

	Regression
	Classification
	Clustering
	Feature Selection
	Feature Extraction

	And below are some of the prominent algorithms that do just that.. as described above..

	1. Regression:
	Purpose: Modeling and predicting continuous, numeric variables like test scores and housing prices
	Types of regression algorithms and implementation...
	1.1. (Regularized) Linear Regression : http://scikit-learn.org/stable/modules/linear_model.html
	There are several implementations of the algo: OLS, ridge, Lasso etc.
	1.2. Regression Tree (Ensembles)
	Ensemble methods, such as Random Forests (RF) and Gradient Boosted Trees (GBM), combine predictions from many individual trees
	1.2.1: Random forests :: http://scikit-learn.org/stable/modules/ensemble.html#random-forests
	1.2.2: Gradient boosting regressor :: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor
	1.2.3: Deep Learning/CNN/DNN enough neural networks can model relationships like an infinitely flexible funciotn

	2. Classification
	Purpoose : Modeling and predicting categorical variables
	2.1. (Regularized) Logistic Regression: http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
	Classificaiton counterpart to liner regression OP is binary
	2.2 Classification Tree (Ensembles):
	http://scikit-learn.org/stable/modules/ensemble.html#random-forests
	http://scikit-learn.org/stable/modules/ensemble.html#regression
	Commonly referred to as "decision trees" or by the umbrella term "classification and regression trees (CART), the regression counterpart to regression trees
	2.3 Deep Learning
	Same as in deep leanring for regression there is a counter part for classificaiton, train the outer layer of a pre-trained mdoel accordingly.
	Check out keras.. examples..
	TODO: pytorch examples..
	2.4. Support Vector Machines
	SVMs use kernels, which essentially calculate distance between two observations. The SVM algorithm then finds a decision boundary that maximizes the distance between the closest members of separate classes
	Kernel explanation : https://stats.stackexchange.com/questions/152897/how-to-intuitively-explain-what-a-kernel-is
	http://scikit-learn.org/stable/modules/svm.html#classification
	2.5. Naive Bayes
	Naive Bayes (NB) is a very simple algorithm based around conditional probability and counting
	http://scikit-learn.org/stable/modules/naive_bayes.html
	3. Clustering
	Clustering is an unsupervised learning task for finding natural groupings of observations (i.e. clusters) based on the inherent structure within your dataset
	Used for say: grouping customers, products, social networks
	Because clustering is unsupervised (i.e. there's no "right answer"), data visualization is usually used to evaluate results
	Helps to have prelabeled data, else you'll have to look into the data points that form the cluster to see what makes the cluster a cluster :)
	3.1. K-Means
	K-Means is a general purpose algorithm that makes clusters based on geometric distances (i.e. distance on a coordinate plane) between points.
	The clusters are grouped around centroids, causing them to be globular and have similar sizes
	http://scikit-learn.org/stable/modules/clustering.html#k-means
	3.2. Affinity Propagation:
	Affinity Propagation is a relatively new clustering technique that makes clusters based on graph distances between points
	http://scikit-learn.org/stable/modules/clustering.html#affinity-propagation
	3.3. Hierarchical / Agglomerative
	Hierarchical clustering, a.k.a. agglomerative clustering, is a suite of algorithms based on the same idea:
	(1) Start with each point in its own cluster.
	(2) For each cluster, merge it with another based on some criterion.
	(3) Repeat until only one cluster remains and you are left with a hierarchy of clusters.
	http://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering

	3.4. DBSCAN:
	DBSCAN is a density based algorithm that makes clusters for dense regions of points.
	Updated version HDBSCAN that allows varying density clusters.
	http://scikit-learn.org/stable/modules/clustering.html#dbscan

	4. Feature Selection:
	Filtering irrelevant or redundant features from your dataset, Features be likeindep variables...
	Regularized Regression and Random Forests do have feature selection inbuilt...
	They can be unsupervised (e.g. Variance Thresholds) or supervised (e.g. Genetic Algorithms)

	4.1. Variance Thresholds:
	Variance thresholds remove features whose values don't change much from observation to observation (i.e. their variance falls below a threshold)
	http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html
	4.2. Correlation Thresholds
	Correlation thresholds remove features that are highly correlated with others (i.e. its values change very similarly to another's). These features provide redundant information.
	https://gist.github.com/Swarchal/881976176aaeb21e8e8df486903e99d6
	4.3. Genetic Algorithms (GA):
	Genetic algorithms (GA) are a broad class of algorithms that can be adapted to different purposes. They are search algorithms that are inspired by evolutionary biology and natural selection, combining mutation and cross-over to efficiently traverse large solution spaces.
	https://pypi.org/project/deap/
	5. Feature Extraction
	Feature extraction is for creating a new, smaller set of features that stills captures most of the useful information. Again, feature selection keeps a subset of the original features while feature extraction creates new ones
	Some algorithms already have built-in feature extraction. The best example is Deep Learning, which extracts increasingly useful representations of the raw input data through each hidden neural layer

	5.1. Principal Component Analysis (PCA)
	Principal component analysis (PCA) is an unsupervised algorithm that creates linear combinations of the original features.
	The new features are orthogonal, which means that they are uncorrelated.
	They are ranked in order of their "explained variance." The first principal component (PC1) explains the most variance in your dataset, PC2 explains the second-most variance and so on
	http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
	5.2. Linear Discriminant Analysis (LDA)
	Linear discriminant analysis (LDA) - not to be confused with latent Dirichlet allocation - also creates linear combinations of your original features.
	However, unlike PCA, LDA doesn't maximize explained variance. Instead, it maximizes the separability between classes
	http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis
	5.3. Autoencoders:
	Autoencoders are neural networks that are trained to reconstruct their original inputs.
	Image autoencoders are trained to reproduce the original images instead of classifying the image as a dog or a cat.
	Check Keras: the key is to structure the hidden layer to have fewer neurons than the input/output layers. Thus, that hidden layer will learn to produce a smaller representation of the original image

	6. Curse of dimensionality:Simply put when number of features is very large relative to the number of observations in your dataset, certain algorithms struggle to train effective models, particularly observable in Clustering

	7. Auroc: area under roc: roc receiver operating characteristic: is a plot of true positive rate vs false positive rate y/x for a model
	Auroc is better than accuracy as a metric for out of sample evaluation for a model.
	TODO:
	L1/L2/L!=L@ regularization
	No freelunch

	Density Estimation and Anomaly Detection
	Intuition to GAlgo http://www.obitko.com/tutorials/genetic-algorithms/