Skip to content

Instantly share code, notes, and snippets.

@svecon
Created August 28, 2017 12:15
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save svecon/2eead0ff05ec4b4b7412100e4b69f0a6 to your computer and use it in GitHub Desktop.
Save svecon/2eead0ff05ec4b4b7412100e4b69f0a6 to your computer and use it in GitHub Desktop.
Notes for Bellet 2014 survey
Mahalanobis distance dM(x,x′)^2 = (x − x′)^T M (x − x′)
where M is cone of symmetric PSD d*d matrix
Must-link / cannot-link constraints (sometimes called positive / negative pairs):
S = {(xi, xj) : xi and xj should be similar},
D = {(xi, xj) : xi and xj should be dissimilar}.
Relative constraints (sometimes called training triplets):
R = {(xi, xj , xk) : xi should be more similar to xj than to xk}.
Optimization problem that has the following general form: min M { ℓ(M, S,D,R) + λR(M) }
where ℓ(M, S,D,R) is a loss function that incurs a penalty when training constraints
are violated, R(M) is some regularizer on the parameters M of the learned metric and λ ≥ 0 is the regularization parameter
Related topics:
Kernel learning
- limited to transductive (observed/learnt)
Multiple Kernel learning
- effective, inductive, but more restricted than Kernel learning
Dimensionality reduction
- assume that the (unlabeled) data lie on an embedded low-dimensional manifold within the higher-dimensional space and aim at “unfolding” it
Metric learning:
Learning paradigm
- fully, weakly, semi-supervised
Form of metric
- linear, nonlinear, local
Scalability
- w.r.t. examples / dimension
Optimality of the solution
- local, global
Dimensionality reduction
- yes, no
Early Approaches
Mahalanobis metric for clustering (MMC) (Xing) [&]
- convex formulation with no regulation
- projected gradient approach requiring the full eigenvalue decomposition of M at each iteration
S&J (Schultz & Joachims)
M = A^T*W*A, where A is fixed and known and W diagonal
- less general, manual choosing of A
Approaches Driven by Nearest Neighbours
Neighbourhood Component Analysis (NCA) (Goldberger) [&]
- "to optimize the expected leave-one-out error of a stochastic nearest neighbor classifier in the projection space induced by dM"
- non-convex => local maxima
- later generalized to k-NN (Tarlow)
Maximally Collapsing Metric Learning (MCML) (Globerson & Roweis)
- modification of NCA to convex problem, but costly projections onto PSD cone
- based on minimizing a KL divergence between pij and an ideal distribution
Large Margin Nearest Neighbors (LMNN) (Weinberger) [&] [!!]
- contraints are defined locally
- sub-gradient descent (+ 4 more papers with different approaches)
- performs well, some overfitting
Information-Theoretic Approaches
(optimiaztion problem involving an information measure)
Relevant Component Analysis (RCA) (Shental) [&]
- positive pairs only (S), using transitive closure
- efficient algorithm, but poor results (doesnt use any negative information)
Information-Theoretic Metric Learning (ITML) (Davis) [&] [!!]
- introduces LogDet divergence regularization
- efficient, converges to the global minimum and the resulting distance performs well in practice
- using positive (S) and negative (D) pairs
- M0 (regularization matrix) must be picked at hand, usually I => euclidean distance
Sparse Distance Metric Learning (SDML) (Qi)
- deals with high dimensional data (n << d)
Online Approaches
(online learning methods often come with regret bounds,
stating that the accumulated loss suffered along the way
is not much worse than that of the best hypothesis chosen in hidsight)
Pseudo-metric Online Learning Algorithm (POLA) (Shalev-Shwartz)
- at each step it receives a pair (S or D) and does two orthogonal projections
Logdet Exact Gradient Online (LEGO) (Jain)
- overall improved POLA with LogDet regularization
Regularized Distance Metric Learning (RDML) (Jin)
- similar to POLA, but more flexible
- performs similarly to LMNN and ITML
Mirror Descent Metric Learning (MDML) (Kunapuli & Shavik)
- composite mirror descent, which allows online optimization of many regularized problems
- comparable to LMNN and ITML, is fast, induces low-rank solutions, but has not been evaluated on large-scale datasets
Multi-Task Metric Learning
(given a set of related tasks, one learns a metric for each in a coupled fashion in order to improve the performance on all tasks)
Multi-Task LMNN (mt-LMNN) (Parameswaran & Weinberger) [&]
- outperforms single-task metric learning methods
- formulation remains convex and can be solved using the same efficient solver as LMNN
Multi-Task Low-Rank Metric Learning Based on Common Subspace (MLCS) (Yang)
- learning Mahalanobis metric for each task, parametrized by transformation matrix, and then decomposed
- not convex (local minima), opposed to mt-LMNN can be made low-rank (less params)
Geometry Preserving Multi-task Metric Learning (GPML) (Yang)
- generalization of any previous MTL
- their formulation can extend any metric learning algorithm to multi-task setting
- outperforms single-task LMNN and mt-LMNN especially when training data is scarce
Transfer Metric Learning (TML) (Zhang & Yeung) [&]
- they assume we have solved similar tasks with enough labeled data and learned Mahalanobis distance for each of the task
- then they use this for another task which has scarce data
Other Approaches
(sparse matrix learning, theory of boosting, ...)
Learning Sparse Metrics via Linear Programming (LPML) (Rosales & Fung)
- aims at learning matrices with entire columns/rows set to zero (making M low-rank)
- but uses only regularization which targets cells to be zero (not rows/columns)
Sparse Matrix Learning (SML) (Ying) [&]
- uses L_2,1 norm which tends to zero-out entire rows of M (=> performs feature selection)
- difficult to optimize, fast convergence, but slow iteration O(d^3)
BoostMetric (Shen) [&]
- The method is based on the property that any PSD matrix can be decomposed into a positive linear combination of trace-one rank-one matrices.
- requires many iterations for high-dimensional datasets
- improved upon by many others (scalability, redundancy, regularization)
Distance Metric Learning with Eigenvalue Optimization (DML-p) (Ying)
- revisiting original MMC
- solved efficiently using "first-order algorithm" that only needs largest eigenvalue at each iteration
- good results and low computational complexity, but not clear how to accommodate a regularizer
Robust Metric Learning (RML) (Huang)
- deals with noisy, incorrect training constraints
- good robustness up to 30% of incorrect triplets
Matric Learning to Rank (MLR) (McFee & Lankriet) [&]
- learning a matric for ranking task (for given query)
- needs full eigen decomposition at each iteration (problem for high-dimensional data)
Other Advances in Metric Learning
Linear Similarity Learning
Similarity Learning for Nearest Neighbor Classification (SiLA) (Qamar)
- M is not PSD not symetric
- generalization of the cosine similarity (text and image retrieval)
- subsuquent work studied an online feature reweighting algorithm
Online and Batch Learning of Generalized Cosine Similarities (gCosLA) (Qamar & Gaussier)
- similar to POLA (2 projections per iteration)
- fewer iterations and better accuracy than SiLa
- competitive with LMNN and ITML
An Online Algorithm for Large Scale Image Similarity Learning (OASIS) (Chechik) [&]
- bilinear, online, uses triplets (R), is efficient, more efficient for sparse inputs
- can define similarity between instances of different dimensions
- competitive resuts on medium-scale problems
- is scalable to problems wih millions of training instances
- cannot incorporate complex regularizers
Similarity Learning for Linear Classification (SLLC) (Bellet)
- different approach: instead of pairs or triples they take some averages
- billinear, creates extremely sparse classifiers
- con: cannot deal with multi-class setting? (one-vs-all, one-vs-one)
Riemannian Similarity Learning (RSL) (Cheng)
- billinear, focuses on the setting of pair matching (predicting whether two pairs are similar)
Nonlinear methods
Kernelization of Linear Methods
Learning Nonlinear Forms of Metrics
Learning a Similarity Metric Discriminatively, with Application to Face Verification (LSMD) (Chopra)
- Convolutional NN, SGD (=> local optima)
- requires fine-tuning of hyperparameters
Nonlinear NCA (NNCA) (Salakhutdinov & Hinton)
- deep belief network + optimizing NCA for the last layer
- performs well when enough data
Support Vector Metric Learning (SVML) (Xu)
- (i) learning the SVM model with respect to the current Mahalanobis distance
- (ii) learning a Mahalanobis distance that minimizes a surrogate of the validation error of the current SVM model
Gradient-Boosted LMNN (GB-LMNN) (Kedem)
- gradient boosted regression trees of limited depth
- robust to overfitting, comparable or better than LMNN and ITML
Hamming Distance Metric Learning (HDML) (Norouzi) [&]
- the goal here is to optimize a mapping b(x) that projects a d-dimensional real-valued input x to a q-dimensional binary code
Local Metric Learning
Learning Globally-Consistent Local Distance Functions for Shape-Based Image Retrieval and Classification (Frome)
- not reviewed in the paper
- features vectors from image patches (important step)
- too many training data, chosen 15 candidates from category -> 15*14*1500*101 (candidates * positive * negative * categories)
- trained over triplets of data, tightly coupled (slower, but globally comparable), convex problem (solved using dual)
Multiple Metrics LMNN (MM-LMNN) (Weinberger & Saul) [&]
- expects C clusters (from labeled data or K-means)
- learns C metrics in coupled fashion
- large improvements, but prone to overfitting
Generative Local Metric Learning (GLML) (Noh)
- 1-NN classifier, calculating Mahalanobis distance for each training instance, regularized towards diagonal matrix
- each metric calculated independently (=> scalable)
- performs poorly on complex problems
Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering (Bk-means) (Wu)
- does not satisfy triangle inequality; is convex; generalizes Manalanobis distance and KL divergence
- possitive (S), negative (D) pairs; learning ϕ can be seen as learning an infinite number of local Mahalanobis distances
- n+1 variables instead of d^2 (scalable), but requires n kernel evaluations (expesive for large datasets)
Parametric Local Metric Learning (PLML) (Wang) [&]
- Mahalanobis metric d^2 M_i is learned for each training instance x_i
- needs eigen-decomposition at each step in O(d^3) (intractable for high-dimensional)
- like LMNN and PLML: is sensitive to the relelvance of the Euclidean distance to assess the similarity between anchor points (obtained through the means of clusters by K-Means alg.)
Random Forest Distance (RFD) (Xiong) [&]
- pair classification problem; encodes both relative and absolute position (compared to Mahalanobis, which only encodes relative information)
- uses forest of decision trees
- very efficient, each tree takes O(n log n) to generate and can be done in parallel
- drawback: evaluation needs output of all trees
- outperforms some global and local metric learning methods
Metric Learning for Histogram Data
Generalization Gaurantees for Metric Learning
Standard Semi-Supervised Setting
(unlabeled pairs)
Laplacian Regularized Metric Learning (LRML) (Hoi) [&]
- manifold regularization, weight matrix encodes the similarity between pairs of points
- good for data with scarce information (??)
- drawback: computing W is intractable for large-scale datasets
Semisupervised Multiview Distance Metric Learning (M-DML) (Zha)
- augumenting LRML, learning multiple metrics from auxiliary datasets
Information-theoretic Semisupervised Metric Learning via Entropy Regularization (SERAPH) (Niu) [&]
- optimizing a probability of labeling a given pair parameterized by a Mahalanobis distance
- Intuitively, the regularization enforces low uncertainty of unobserved weak labels.
Metric Learning for Domain Adaptation
(the labeled training data and the test data come from different (but somehow related) distributions (referred to as the source and target distributions respectively))
Consistent Distance Metric Learning (CDML) (Cao)
- deals with the setting of covariate shift, which assumes that source and target data distributions
pS(x) and pT (x) are different but the conditional distribution of the labels given
the features, p(y|x), remains the same
Domain Adaptation Metric Learning (DAML) (Geng)
- maximum mean discrepancy
- trade-off between satisfying the constraints on the labeled source data
and finding a projection that minimizes the discrepancy between the source and target
distribution
String Edit Distance Learning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment