Skip to content

Instantly share code, notes, and snippets.

@shagunsodhani
Last active November 10, 2023 02:05
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save shagunsodhani/6c267cf6122399e9be36491a2f510641 to your computer and use it in GitHub Desktop.
Save shagunsodhani/6c267cf6122399e9be36491a2f510641 to your computer and use it in GitHub Desktop.
Notes for LargeVis paper

#Visualizing Large-scale and High-dimensional Data

Introduction

  • LargeVis - a technique to visualize large-scale and high-dimensional data in low-dimensional space.
  • Problem relates to both information visualization and machine learning (and data mining) domain.
  • Link to the paper

t-SNE

  • State of the art method for visualization problem.
  • Start by constructing a similarity structure from the data and then project the structure to 2/3 dimensional space.
  • An accelerated version proposed that uses a K-nearest Neighbor (KNN) graph as the similarity structure.
  • Limitations
    • Computational cost of constructing KNN graph for high-dimensional data.
    • Efficiency of graph visualization techniques breaks down for large data (O(NlogN) complexity).
    • Parameters are sensitive to the dataset.

LargeVis

  • Constructs KNN graph (in a more efficient manner as compared to t-SNE).
  • Uses a principled probabilistic model for graph visualization.
  • Let us say the input is in the form of N datapoints in d dimensional space.

KNN Graph Construction

  • Random projection tree used for nearest-neighborhood search in the high-dimensional space with euclidean distance as metric of distance.
  • Tree is constructed by partitioning the entire space and the nodes in the tree correspond to subspaces.
  • For every non-leaf node of the tree, select a random hyperplane that splits the subspace (corresponding to the leaf node) into two children.
  • This is done till the number of nodes in the subspace reaches a threshold.
  • Once the tree is constructed, each data point gets assigned a leaf node and points in the subspace of the leaf node are the candidate nearest neighbors of that datapoint.
  • For high accuracy, a large number of trees should be constructed (which increases the computational cost).
  • The paper counters this bottleneck by using the idea "a neighbor of my neighbor is also my neighbor" to increase the accuracy of the constructed graph.
  • Basically construct an approximate KNN graph using random projection trees and then for each node, search its neighbor's neighbors as well.
  • Edges are assigned weights just like in t-SNE.

Probabilistic Model For Graph Visualization

  • Given a pair of vertices, the probability of observing an edge between them is given as a function of the distance between the projection of the pair of vertices in the lower dimensional space.
  • The probability of observing an edge with weight w is given as wth power of probability of observing an edge.
  • Given a weighted graph, G, the likelihood of the graph can be modeled as the likelihood of observed edges plus the likelihood of negative edges (vertex pairs without edges).
  • To simplify the objective function, some negative edges are sampled corresponding to each vertex using a noisy distribution.
  • The edges are sampled with probability proportional to their weight and then treated as binary edges.
  • The resulting objective function can be optimized using asynchronous stochastic gradient descent (very effective on sparse graphs).
  • The overall complexity is O(sMN), s is the dimension of lower dimensional space, M is the number of negative samples and N is the number of vertices.

Experiments

  • Data is first preprocessed with embedding learning techniques like SkipGram and LINE and brought down to 100-dimensional space.
  • One limitation is that the data is preprocessed to reduce the number of dimensions to 100. This seems to go against the claim of scaling for hundreds of dimensions.
  • KNN construction is fastest for LargeVis followed by random projection trees, NN Descent, and vantage point trees (used in t-SNE).
  • LargeVis requires very few iterations to create highly accurate KNN graphs.
  • KNN Classifier is used to measure the accuracy of graph visualization quantitatively.
  • Compared to t-SNE, LargeVis is much more stable with respect to learning rate across datasets.
  • LargeVis benefits from its linear complexity (as compared to t-SNE's log linear complexity) and for default learning rate, outperforms t-SNE for larger datasets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment