svecon/Bellet-2014-SML-notes.txt

## Bellet-2014-SML-notes.txt
Mahalanobis distance dM(x,x′)^2 = (x − x′)^T M (x − x′)
	where M is cone of symmetric PSD d*d matrix

Must-link / cannot-link constraints (sometimes called positive / negative pairs):
	S = {(xi, xj) : xi and xj should be similar},
	D = {(xi, xj) : xi and xj should be dissimilar}.
Relative constraints (sometimes called training triplets):
	R = {(xi, xj , xk) : xi should be more similar to xj than to xk}.

Optimization problem that has the following general form: min M { ℓ(M, S,D,R) + λR(M) }
where ℓ(M, S,D,R) is a loss function that incurs a penalty when training constraints
are violated, R(M) is some regularizer on the parameters M of the learned metric and λ ≥ 0 is the regularization parameter

Related topics:
	Kernel learning
		- limited to transductive (observed/learnt)
	Multiple Kernel learning
		- effective, inductive, but more restricted than Kernel learning
	Dimensionality reduction
		- assume that the (unlabeled) data lie on an embedded low-dimensional manifold within the higher-dimensional space and aim at “unfolding” it

Metric learning:
	Learning paradigm
		- fully, weakly, semi-supervised
	Form of metric
		- linear, nonlinear, local
	Scalability
		- w.r.t. examples / dimension
	Optimality of the solution
		- local, global
	Dimensionality reduction
		- yes, no

Early Approaches
	Mahalanobis metric for clustering (MMC) (Xing) [&]
		- convex formulation with no regulation
		- projected gradient approach requiring the full eigenvalue decomposition of M at each iteration
	S&J (Schultz & Joachims)
		M = A^T*W*A, where A is fixed and known and W diagonal
		- less general, manual choosing of A

Approaches Driven by Nearest Neighbours
	Neighbourhood Component Analysis (NCA) (Goldberger) [&]
		- "to optimize the expected leave-one-out error of a stochastic nearest neighbor classifier in the projection space induced by dM"
		- non-convex => local maxima
		- later generalized to k-NN (Tarlow)

	 Maximally Collapsing Metric Learning (MCML) (Globerson & Roweis)
	 	- modification of NCA to convex problem, but costly projections onto PSD cone
	 	- based on minimizing a KL divergence between pij and an ideal distribution

	 Large Margin Nearest Neighbors (LMNN) (Weinberger) [&] [!!]
	 	- contraints are defined locally
	 	- sub-gradient descent (+ 4 more papers with different approaches)
	 	- performs well, some overfitting

Information-Theoretic Approaches
	(optimiaztion problem involving an information measure)

	Relevant Component Analysis (RCA) (Shental) [&]
		- positive pairs only (S), using transitive closure
		- efficient algorithm, but poor results (doesnt use any negative information)

	Information-Theoretic Metric Learning (ITML) (Davis) [&] [!!]
		- introduces LogDet divergence regularization
		- efficient, converges to the global minimum and the resulting distance performs well in practice
		- using positive (S) and negative (D) pairs
		- M0 (regularization matrix) must be picked at hand, usually I => euclidean distance

	Sparse Distance Metric Learning (SDML) (Qi)
		- deals with high dimensional data (n << d)

Online Approaches
(online learning methods often come with regret bounds,
stating that the accumulated loss suffered along the way
is not much worse than that of the best hypothesis chosen in hidsight)

	Pseudo-metric Online Learning Algorithm (POLA) (Shalev-Shwartz)
		- at each step it receives a pair (S or D) and does two orthogonal projections

	Logdet Exact Gradient Online (LEGO) (Jain)
		- overall improved POLA with LogDet regularization

	Regularized Distance Metric Learning (RDML) (Jin)
		- similar to POLA, but more flexible
		- performs similarly to LMNN and ITML

	Mirror Descent Metric Learning (MDML) (Kunapuli & Shavik)
		- composite mirror descent, which allows online optimization of many regularized problems
		- comparable to LMNN and ITML, is fast, induces low-rank solutions, but has not been evaluated on large-scale datasets

Multi-Task Metric Learning
(given a set of related tasks, one learns a metric for each in a coupled fashion in order to improve the performance on all tasks)

	Multi-Task LMNN (mt-LMNN) (Parameswaran & Weinberger) [&]
		- outperforms single-task metric learning methods
		- formulation remains convex and can be solved using the same efficient solver as LMNN

	Multi-Task Low-Rank Metric Learning Based on Common Subspace (MLCS) (Yang)
		- learning Mahalanobis metric for each task, parametrized by transformation matrix, and then decomposed
		- not convex (local minima), opposed to mt-LMNN can be made low-rank (less params)

	Geometry Preserving Multi-task Metric Learning (GPML) (Yang)
		- generalization of any previous MTL
		- their formulation can extend any metric learning algorithm to multi-task setting
		- outperforms single-task LMNN and mt-LMNN especially when training data is scarce

	Transfer Metric Learning (TML) (Zhang & Yeung) [&]
		- they assume we have solved similar tasks with enough labeled data and learned Mahalanobis distance for each of the task
		- then they use this for another task which has scarce data

Other Approaches
(sparse matrix learning, theory of boosting, ...)

	Learning Sparse Metrics via Linear Programming (LPML) (Rosales & Fung)
		- aims at learning matrices with entire columns/rows set to zero (making M low-rank)
		- but uses only regularization which targets cells to be zero (not rows/columns)

	Sparse Matrix Learning (SML) (Ying) [&]
		- uses L_2,1 norm which tends to zero-out entire rows of M (=> performs feature selection)
		- difficult to optimize, fast convergence, but slow iteration O(d^3)

	BoostMetric (Shen) [&]
		- The method is based on the property that any PSD matrix can be decomposed into a positive linear combination of trace-one rank-one matrices.
		- requires many iterations for high-dimensional datasets
		- improved upon by many others (scalability, redundancy, regularization)

	Distance Metric Learning with Eigenvalue Optimization (DML-p) (Ying)
		- revisiting original MMC
		- solved efficiently using "first-order algorithm" that only needs largest eigenvalue at each iteration
		- good results and low computational complexity, but not clear how to accommodate a regularizer

	Robust Metric Learning (RML) (Huang)
		- deals with noisy, incorrect training constraints
		- good robustness up to 30% of incorrect triplets

	Matric Learning to Rank (MLR) (McFee & Lankriet) [&]
		- learning a matric for ranking task (for given query)
		- needs full eigen decomposition at each iteration (problem for high-dimensional data)

Other Advances in Metric Learning
Linear Similarity Learning
	Similarity Learning for Nearest Neighbor Classification (SiLA) (Qamar)
		- M is not PSD not symetric
		- generalization of the cosine similarity (text and image retrieval)
		- subsuquent work studied an online feature reweighting algorithm

	Online and Batch Learning of Generalized Cosine Similarities (gCosLA) (Qamar & Gaussier)
		- similar to POLA (2 projections per iteration)
		- fewer iterations and better accuracy than SiLa
		- competitive with LMNN and ITML

	An Online Algorithm for Large Scale Image Similarity Learning (OASIS) (Chechik) [&]
		- bilinear, online, uses triplets (R), is efficient, more efficient for sparse inputs
		- can define similarity between instances of different dimensions
		- competitive resuts on medium-scale problems
		- is scalable to problems wih millions of training instances
		- cannot incorporate complex regularizers

	Similarity Learning for Linear Classification (SLLC) (Bellet)
		- different approach: instead of pairs or triples they take some averages
		- billinear, creates extremely sparse classifiers
		- con: cannot deal with multi-class setting? (one-vs-all, one-vs-one)

	Riemannian Similarity Learning (RSL) (Cheng)
		- billinear, focuses on the setting of pair matching (predicting whether two pairs are similar)

Nonlinear methods
	Kernelization of Linear Methods

	Learning Nonlinear Forms of Metrics
		Learning a Similarity Metric Discriminatively, with Application to Face Verification (LSMD) (Chopra)
			- Convolutional NN, SGD (=> local optima)
			- requires fine-tuning of hyperparameters

		Nonlinear NCA (NNCA) (Salakhutdinov & Hinton)
			- deep belief network + optimizing NCA for the last layer
			- performs well when enough data

		Support Vector Metric Learning (SVML) (Xu)
			- (i) learning the SVM model with respect to the current Mahalanobis distance
			- (ii) learning a Mahalanobis distance that minimizes a surrogate of the validation error of the current SVM model

		Gradient-Boosted LMNN (GB-LMNN) (Kedem)
			- gradient boosted regression trees of limited depth
			- robust to overfitting, comparable or better than LMNN and ITML

		Hamming Distance Metric Learning (HDML) (Norouzi) [&]
			- the goal here is to optimize a mapping b(x) that projects a d-dimensional real-valued input x to a q-dimensional binary code

Local Metric Learning
	Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification (Frome)
		- not reviewed in the paper
		- features vectors from image patches (important step)
		- too many training data, chosen 15 candidates from category -> 15*14*1500*101 (candidates * positive * negative * categories)
		- trained over triplets of data, tightly coupled (slower, but globally comparable), convex problem (solved using dual)

	Multiple Metrics LMNN (MM-LMNN) (Weinberger & Saul) [&]
		- expects C clusters (from labeled data or K-means)
		- learns C metrics in coupled fashion
		- large improvements, but prone to overfitting

	Generative Local Metric Learning (GLML) (Noh)
		- 1-NN classifier, calculating Mahalanobis distance for each training instance, regularized towards diagonal matrix
		- each metric calculated independently (=> scalable)
		- performs poorly on complex problems

	Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering (Bk-means) (Wu)
		- does not satisfy triangle inequality; is convex; generalizes Manalanobis distance and KL divergence
		- possitive (S), negative (D) pairs; learning ϕ can be seen as learning an infinite number of local Mahalanobis distances
		- n+1 variables instead of d^2 (scalable), but requires n kernel evaluations (expesive for large datasets)

	Parametric Local Metric Learning (PLML) (Wang) [&]
		- Mahalanobis metric d^2 M_i is learned for each training instance x_i
		- needs eigen-decomposition at each step in O(d^3) (intractable for high-dimensional)
		- like LMNN and PLML: is sensitive to the relelvance of the Euclidean distance to assess the similarity between anchor points (obtained through the means of clusters by K-Means alg.)

	Random Forest Distance (RFD) (Xiong) [&]
		- pair classification problem; encodes both relative and absolute position (compared to Mahalanobis, which only encodes relative information)
		- uses forest of decision trees
		- very efficient, each tree takes O(n log n) to generate and can be done in parallel
		- drawback: evaluation needs output of all trees
		- outperforms some global and local metric learning methods

Metric Learning for Histogram Data
Generalization Gaurantees for Metric Learning

Standard Semi-Supervised Setting
(unlabeled pairs)

	Laplacian Regularized Metric Learning (LRML) (Hoi) [&]
		- manifold regularization, weight matrix encodes the similarity between pairs of points
		- good for data with scarce information (??)
		- drawback: computing W is intractable for large-scale datasets

	Semisupervised Multiview Distance Metric Learning (M-DML) (Zha)
		- augumenting LRML, learning multiple metrics from auxiliary datasets

	Information-theoretic Semisupervised Metric Learning via Entropy Regularization (SERAPH) (Niu) [&]
		- optimizing a probability of labeling a given pair parameterized by a Mahalanobis distance
		- Intuitively, the regularization enforces low uncertainty of unobserved weak labels.

Metric Learning for Domain Adaptation
(the labeled training data and the test data come from different (but somehow related) distributions (referred to as the source and target distributions respectively))

	Consistent Distance Metric Learning (CDML) (Cao)
		- deals with the setting of covariate shift, which assumes that source and target data distributions
		  pS(x) and pT (x) are different but the conditional distribution of the labels given
		  the features, p(y|x), remains the same

	Domain Adaptation Metric Learning (DAML) (Geng)
		- maximum mean discrepancy
		- trade-off between satisfying the constraints on the labeled source data
		  and finding a projection that minimizes the discrepancy between the source and target
		  distribution

String Edit Distance Learning
	Mahalanobis distance dM(x,x′)^2 = (x − x′)^T M (x − x′)
	where M is cone of symmetric PSD d*d matrix

	Must-link / cannot-link constraints (sometimes called positive / negative pairs):
	S = {(xi, xj) : xi and xj should be similar},
	D = {(xi, xj) : xi and xj should be dissimilar}.
	Relative constraints (sometimes called training triplets):
	R = {(xi, xj , xk) : xi should be more similar to xj than to xk}.

	Optimization problem that has the following general form: min M { ℓ(M, S,D,R) + λR(M) }
	where ℓ(M, S,D,R) is a loss function that incurs a penalty when training constraints
	are violated, R(M) is some regularizer on the parameters M of the learned metric and λ ≥ 0 is the regularization parameter

	Related topics:
	Kernel learning
	- limited to transductive (observed/learnt)
	Multiple Kernel learning
	- effective, inductive, but more restricted than Kernel learning
	Dimensionality reduction
	- assume that the (unlabeled) data lie on an embedded low-dimensional manifold within the higher-dimensional space and aim at “unfolding” it

	Metric learning:
	Learning paradigm
	- fully, weakly, semi-supervised
	Form of metric
	- linear, nonlinear, local
	Scalability
	- w.r.t. examples / dimension
	Optimality of the solution
	- local, global
	Dimensionality reduction
	- yes, no

	Early Approaches
	Mahalanobis metric for clustering (MMC) (Xing) [&]
	- convex formulation with no regulation
	- projected gradient approach requiring the full eigenvalue decomposition of M at each iteration
	S&J (Schultz & Joachims)
	M = A^TWA, where A is fixed and known and W diagonal
	- less general, manual choosing of A

	Approaches Driven by Nearest Neighbours
	Neighbourhood Component Analysis (NCA) (Goldberger) [&]
	- "to optimize the expected leave-one-out error of a stochastic nearest neighbor classifier in the projection space induced by dM"
	- non-convex => local maxima
	- later generalized to k-NN (Tarlow)

	Maximally Collapsing Metric Learning (MCML) (Globerson & Roweis)
	- modification of NCA to convex problem, but costly projections onto PSD cone
	- based on minimizing a KL divergence between pij and an ideal distribution

	Large Margin Nearest Neighbors (LMNN) (Weinberger) [&] [!!]
	- contraints are defined locally
	- sub-gradient descent (+ 4 more papers with different approaches)
	- performs well, some overfitting

	Information-Theoretic Approaches
	(optimiaztion problem involving an information measure)

	Relevant Component Analysis (RCA) (Shental) [&]
	- positive pairs only (S), using transitive closure
	- efficient algorithm, but poor results (doesnt use any negative information)

	Information-Theoretic Metric Learning (ITML) (Davis) [&] [!!]
	- introduces LogDet divergence regularization
	- efficient, converges to the global minimum and the resulting distance performs well in practice
	- using positive (S) and negative (D) pairs
	- M0 (regularization matrix) must be picked at hand, usually I => euclidean distance

	Sparse Distance Metric Learning (SDML) (Qi)
	- deals with high dimensional data (n << d)

	Online Approaches
	(online learning methods often come with regret bounds,
	stating that the accumulated loss suffered along the way
	is not much worse than that of the best hypothesis chosen in hidsight)

	Pseudo-metric Online Learning Algorithm (POLA) (Shalev-Shwartz)
	- at each step it receives a pair (S or D) and does two orthogonal projections

	Logdet Exact Gradient Online (LEGO) (Jain)
	- overall improved POLA with LogDet regularization

	Regularized Distance Metric Learning (RDML) (Jin)
	- similar to POLA, but more flexible
	- performs similarly to LMNN and ITML

	Mirror Descent Metric Learning (MDML) (Kunapuli & Shavik)
	- composite mirror descent, which allows online optimization of many regularized problems
	- comparable to LMNN and ITML, is fast, induces low-rank solutions, but has not been evaluated on large-scale datasets

	Multi-Task Metric Learning
	(given a set of related tasks, one learns a metric for each in a coupled fashion in order to improve the performance on all tasks)

	Multi-Task LMNN (mt-LMNN) (Parameswaran & Weinberger) [&]
	- outperforms single-task metric learning methods
	- formulation remains convex and can be solved using the same efficient solver as LMNN

	Multi-Task Low-Rank Metric Learning Based on Common Subspace (MLCS) (Yang)
	- learning Mahalanobis metric for each task, parametrized by transformation matrix, and then decomposed
	- not convex (local minima), opposed to mt-LMNN can be made low-rank (less params)

	Geometry Preserving Multi-task Metric Learning (GPML) (Yang)
	- generalization of any previous MTL
	- their formulation can extend any metric learning algorithm to multi-task setting
	- outperforms single-task LMNN and mt-LMNN especially when training data is scarce

	Transfer Metric Learning (TML) (Zhang & Yeung) [&]
	- they assume we have solved similar tasks with enough labeled data and learned Mahalanobis distance for each of the task
	- then they use this for another task which has scarce data

	Other Approaches
	(sparse matrix learning, theory of boosting, ...)

	Learning Sparse Metrics via Linear Programming (LPML) (Rosales & Fung)
	- aims at learning matrices with entire columns/rows set to zero (making M low-rank)
	- but uses only regularization which targets cells to be zero (not rows/columns)

	Sparse Matrix Learning (SML) (Ying) [&]
	- uses L_2,1 norm which tends to zero-out entire rows of M (=> performs feature selection)
	- difficult to optimize, fast convergence, but slow iteration O(d^3)

	BoostMetric (Shen) [&]
	- The method is based on the property that any PSD matrix can be decomposed into a positive linear combination of trace-one rank-one matrices.
	- requires many iterations for high-dimensional datasets
	- improved upon by many others (scalability, redundancy, regularization)

	Distance Metric Learning with Eigenvalue Optimization (DML-p) (Ying)
	- revisiting original MMC
	- solved efficiently using "first-order algorithm" that only needs largest eigenvalue at each iteration
	- good results and low computational complexity, but not clear how to accommodate a regularizer

	Robust Metric Learning (RML) (Huang)
	- deals with noisy, incorrect training constraints
	- good robustness up to 30% of incorrect triplets

	Matric Learning to Rank (MLR) (McFee & Lankriet) [&]
	- learning a matric for ranking task (for given query)
	- needs full eigen decomposition at each iteration (problem for high-dimensional data)

	Other Advances in Metric Learning
	Linear Similarity Learning
	Similarity Learning for Nearest Neighbor Classification (SiLA) (Qamar)
	- M is not PSD not symetric
	- generalization of the cosine similarity (text and image retrieval)
	- subsuquent work studied an online feature reweighting algorithm

	Online and Batch Learning of Generalized Cosine Similarities (gCosLA) (Qamar & Gaussier)
	- similar to POLA (2 projections per iteration)
	- fewer iterations and better accuracy than SiLa
	- competitive with LMNN and ITML

	An Online Algorithm for Large Scale Image Similarity Learning (OASIS) (Chechik) [&]
	- bilinear, online, uses triplets (R), is efficient, more efficient for sparse inputs
	- can define similarity between instances of different dimensions
	- competitive resuts on medium-scale problems
	- is scalable to problems wih millions of training instances
	- cannot incorporate complex regularizers

	Similarity Learning for Linear Classification (SLLC) (Bellet)
	- different approach: instead of pairs or triples they take some averages
	- billinear, creates extremely sparse classifiers
	- con: cannot deal with multi-class setting? (one-vs-all, one-vs-one)

	Riemannian Similarity Learning (RSL) (Cheng)
	- billinear, focuses on the setting of pair matching (predicting whether two pairs are similar)

	Nonlinear methods
	Kernelization of Linear Methods

	Learning Nonlinear Forms of Metrics
	Learning a Similarity Metric Discriminatively, with Application to Face Verification (LSMD) (Chopra)
	- Convolutional NN, SGD (=> local optima)
	- requires fine-tuning of hyperparameters

	Nonlinear NCA (NNCA) (Salakhutdinov & Hinton)
	- deep belief network + optimizing NCA for the last layer
	- performs well when enough data

	Support Vector Metric Learning (SVML) (Xu)
	- (i) learning the SVM model with respect to the current Mahalanobis distance
	- (ii) learning a Mahalanobis distance that minimizes a surrogate of the validation error of the current SVM model

	Gradient-Boosted LMNN (GB-LMNN) (Kedem)
	- gradient boosted regression trees of limited depth
	- robust to overfitting, comparable or better than LMNN and ITML

	Hamming Distance Metric Learning (HDML) (Norouzi) [&]
	- the goal here is to optimize a mapping b(x) that projects a d-dimensional real-valued input x to a q-dimensional binary code

	Local Metric Learning
	Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification (Frome)
	- not reviewed in the paper
	- features vectors from image patches (important step)
	- too many training data, chosen 15 candidates from category -> 15141500101 (candidates positive * negative * categories)
	- trained over triplets of data, tightly coupled (slower, but globally comparable), convex problem (solved using dual)

	Multiple Metrics LMNN (MM-LMNN) (Weinberger & Saul) [&]
	- expects C clusters (from labeled data or K-means)
	- learns C metrics in coupled fashion
	- large improvements, but prone to overfitting

	Generative Local Metric Learning (GLML) (Noh)
	- 1-NN classifier, calculating Mahalanobis distance for each training instance, regularized towards diagonal matrix
	- each metric calculated independently (=> scalable)
	- performs poorly on complex problems

	Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering (Bk-means) (Wu)
	- does not satisfy triangle inequality; is convex; generalizes Manalanobis distance and KL divergence
	- possitive (S), negative (D) pairs; learning ϕ can be seen as learning an infinite number of local Mahalanobis distances
	- n+1 variables instead of d^2 (scalable), but requires n kernel evaluations (expesive for large datasets)

	Parametric Local Metric Learning (PLML) (Wang) [&]
	- Mahalanobis metric d^2 M_i is learned for each training instance x_i
	- needs eigen-decomposition at each step in O(d^3) (intractable for high-dimensional)
	- like LMNN and PLML: is sensitive to the relelvance of the Euclidean distance to assess the similarity between anchor points (obtained through the means of clusters by K-Means alg.)

	Random Forest Distance (RFD) (Xiong) [&]
	- pair classification problem; encodes both relative and absolute position (compared to Mahalanobis, which only encodes relative information)
	- uses forest of decision trees
	- very efficient, each tree takes O(n log n) to generate and can be done in parallel
	- drawback: evaluation needs output of all trees
	- outperforms some global and local metric learning methods

	Metric Learning for Histogram Data
	Generalization Gaurantees for Metric Learning

	Standard Semi-Supervised Setting
	(unlabeled pairs)

	Laplacian Regularized Metric Learning (LRML) (Hoi) [&]
	- manifold regularization, weight matrix encodes the similarity between pairs of points
	- good for data with scarce information (??)
	- drawback: computing W is intractable for large-scale datasets

	Semisupervised Multiview Distance Metric Learning (M-DML) (Zha)
	- augumenting LRML, learning multiple metrics from auxiliary datasets

	Information-theoretic Semisupervised Metric Learning via Entropy Regularization (SERAPH) (Niu) [&]
	- optimizing a probability of labeling a given pair parameterized by a Mahalanobis distance
	- Intuitively, the regularization enforces low uncertainty of unobserved weak labels.

	Metric Learning for Domain Adaptation
	(the labeled training data and the test data come from different (but somehow related) distributions (referred to as the source and target distributions respectively))

	Consistent Distance Metric Learning (CDML) (Cao)
	- deals with the setting of covariate shift, which assumes that source and target data distributions
	pS(x) and pT (x) are different but the conditional distribution of the labels given
	the features, p(y\|x), remains the same

	Domain Adaptation Metric Learning (DAML) (Geng)
	- maximum mean discrepancy
	- trade-off between satisfying the constraints on the labeled source data
	and finding a projection that minimizes the discrepancy between the source and target
	distribution

	String Edit Distance Learning