shagunsodhani/LargeVis.md

## LargeVis.md

      
    Raw
  

              LargeVis.md
            
          
    #Visualizing Large-scale and High-dimensional Data
Introduction


LargeVis - a technique to visualize large-scale and high-dimensional data in low-dimensional space.
Problem relates to both information visualization and machine learning (and data mining) domain.
Link to the paper

t-SNE


State of the art method for visualization problem.
Start by constructing a similarity structure from the data and then project the structure to 2/3 dimensional space.
An accelerated version proposed that uses a K-nearest Neighbor (KNN) graph as the similarity structure.
Limitations

Computational cost of constructing KNN graph for high-dimensional data.
Efficiency of graph visualization techniques breaks down for large data (O(NlogN) complexity).
Parameters are sensitive to the dataset.


LargeVis


Constructs KNN graph (in a more efficient manner as compared to t-SNE).
Uses a principled probabilistic model for graph visualization.
Let us say the input is in the form of N datapoints in d dimensional space.

KNN Graph Construction


Random projection tree used for nearest-neighborhood search in the high-dimensional space with euclidean distance as metric of distance.
Tree is constructed by partitioning the entire space and the nodes in the tree correspond to subspaces.
For every non-leaf node of the tree, select a random hyperplane that splits the subspace (corresponding to the leaf node) into two children.
This is done till the number of nodes in the subspace reaches a threshold.
Once the tree is constructed, each data point gets assigned a leaf node and points in the subspace of the leaf node are the candidate nearest neighbors of that datapoint.
For high accuracy, a large number of trees should be constructed (which increases the computational cost).
The paper counters this bottleneck by using the idea "a neighbor of my neighbor is also my neighbor" to increase the accuracy of the constructed graph.
Basically construct an approximate KNN graph using random projection trees and then for each node, search its neighbor's neighbors as well.
Edges are assigned weights just like in t-SNE.

Probabilistic Model For Graph Visualization


Given a pair of vertices, the probability of observing an edge between them is given as a function of the distance between the projection of the pair of vertices in the lower dimensional space.
The probability of observing an edge with weight w is given as w_th power of probability of observing an edge.
Given a weighted graph, G, the likelihood of the graph can be modeled as the likelihood of observed edges plus the likelihood of negative edges (vertex pairs without edges).
To simplify the objective function, some negative edges are sampled corresponding to each vertex using a noisy distribution.
The edges are sampled with probability proportional to their weight and then treated as binary edges.
The resulting objective function can be optimized using asynchronous stochastic gradient descent (very effective on sparse graphs).
The overall complexity is O(sMN), s is the dimension of lower dimensional space, M is the number of negative samples and N is the number of vertices.

Experiments


Data is first preprocessed with embedding learning techniques like SkipGram and LINE and brought down to 100-dimensional space.
One limitation is that the data is preprocessed to reduce the number of dimensions to 100. This seems to go against the claim of scaling for hundreds of dimensions.
KNN construction is fastest for LargeVis followed by random projection trees, NN Descent, and vantage point trees (used in t-SNE).
LargeVis requires very few iterations to create highly accurate KNN graphs.
KNN Classifier is used to measure the accuracy of graph visualization quantitatively.
Compared to t-SNE, LargeVis is much more stable with respect to learning rate across datasets.
LargeVis benefits from its linear complexity (as compared to t-SNE's log linear complexity) and for default learning rate, outperforms t-SNE for larger datasets.