shagunsodhani/t-SNE.md

## t-SNE.md

      
    Raw
  

              t-SNE.md
            
          
    Visualizing Data using t-SNE

Introduction


Method to visualize high-dimensional data points in 2/3 dimensional space.
Data visualization techniques like Chernoff faces and graph approaches just provide a representation and not an interpretation.
Dimensionality reduction techniques fail to retain both local and global structure of the data simultaneously. For example, PCA and MDS are linear techniques and fail on data lying on a non-linear manifold.
t-SNE approach converts data into a matrix of pairwise similarities and visualizes this matrix.
Based on SNE (Stochastic Neighbor Embedding)
Link to paper

SNE


Given a set of datapoints x₁, x₂, ...x_n, p_i|j is the probability that x_i would pick x_j as its neighbor if neighbors were picked in propotion to their porbability density under a Gaussian centred at x_i. Calculation of σ_i is described later.
Similarly, define q_i|j as condtional probability corresponding to low-dimensional representations of y_i and y_j (corresponding to x_i and x_j). The variance of Gaussian in this case is set to be 1/sqrt(2)
Argument is that if y_i and y_j captures the similarity between x_i and x_j, p_i|j and q_i|j should be equal. So objective function to be minimized is Kullback-Leibler (KL) Divergence measure for P_i and Q_i, where P_i (Q_i) represent condtional probability distribution given x_i (y_i)
Since KL Divergence is not symmetric, the objective function focuses on retaining the local structure.
Users specify a value called perplexity (measure of effective number of neighbors). A binary search is performed to find &sigma_i which produces the P_i having same perplexity as specified by the user.
Gradient Descent (with momentum) is used to minimize objective function and Gaussian noise is added in early stages to perform simulated annealing.

t-SNE (t-Distributed SNE)

Symmetric SNE


A single KL Divergence between P (joint probability distribution in high-dimensional space) and Q (joint probability distribution in low-dimensional space) is minimized.
Symmetric because p_i|j = p_j|i and q_i|j = q_j|i
More robust to outliers and has a simpler gradient expression.

Crowding Problem


When we model a high-dimensional dataset in 2 (or 3) dimensions, it is difficult to segregate the nearby datapoints from moderately distant datapoints and gaps can not form between natural clusters.
One way around this problem is to use UNI-SNE but optimization of the cost function, in that case, is difficult.

t-SNE


Instead of Gaussian, use a heavy-tailed distribution (like Student-t distribution) to convert distances into probability scores in low dimensions. This way moderate distance in high-dimensional space can be modeled by larger distance in low-dimensional space.
Student-t distribution is an infinite mixture of Gaussians and density for a point under this distribution can be computed very fast.
The cost function is easy to optimize.

Optimization Tricks

Early Compression


At the start of optimization, force the datapoints (in low-dimensional space) to stay close together so that datapoints can easily move from one cluster to another.
Implemented an L2-penalty term proportional to the sum of the squared distance of datapoints from the origin.

Early Exaggeration


Scale all the p_i|j's so that large q_i|j's are obtained with the effect that natural clusters in the data form tight, widely separated clusters as a lot of empty space is created in the low-dimensional space.

t-SNE on large datasets


Space and time complexity is quadratic in the number of datapoints so infeasible to apply on large datasets.
Select a random subset of points (called landmark points) to display.
for each landmark point, define a random walk starting at a landmark point and terminating at any other landmark point.
p_i|j is defined as fraction of random walks starting at x_i and finishing at x_j (both these points are landmark points). This way, p_i|j is not sensitive to "short-circuits" in the graph (due to noisy data points).

Advantages of t-SNE


Gaussian kernel employed by t-SNE (in high-dimensional) defines a soft border between the local and global structure of the data.
Both nearby and distant pair of datapoints get equal importance in modeling the low-dimensional coordinates.
The local neighborhood size of each datapoint is determined on the basis of the local density of the data.
Random walk version of t-SNE takes care of "short-circuit" problem.

Limitations of t-SNE


It is unclear t-SNE would perform on general Dimensionality Reduction for more than 3 dimensions. For such higher (than 3) dimensions, Student-t distribution with more degrees of freedom should be more appropriate.
t-SNE reduces the dimensionality of data mainly based on local properties of the data which means it would fail in data which has intrinsically high dimensional structure (curse of dimensionality).
The cost function for t-SNE is not convex requiring several optimization parameters to be chosen which would affect the constructed solution.