Skip to content

Instantly share code, notes, and snippets.

SimRank - A Measure of Structural-Context Similarity : NOTES

SimRank - A Measure of Structural-Context Similarity

Two objects a and b are similar if they are related to similar objects c and d. Similarity between a and b is proportional to the average similarity between in-neighbors of a and b, and is represented as s(a, b), which ranges from 0 to 1. Similarity scores are symmetric, and each node is maximally similar to itself (1). Other nodes known to be similar a-priori can be preassigned as well.

Similarity can be thought of as propagating from pair to pair in a derived graph G^2(V^2, E^2), where V^2 represents a pair of nodes (a, b) in G, and an edge (a, b) -> (c, d) exists in E^2 iff edges a -> c and b -> d exist in E (in-neighbor requirement). As similarity is symmetric, we represent pairs (a, b) and (b, a) as a single node {a, b}. Similarity propagates in G^2 from node to node, starting from singleton nodes {a, a}. A constant C is used as a confidence level or decay factor, to reduce the similarity as we propagate. It can be obtained by iteration to a fixed-point (attracting?).

We can extend similarity to the bipartite domain considering the idea that people are similar if they purchase similar items, and items are similar if they are purchased by similar people. Accordingly we have two equations which consider the in and out-neighbors respectively. As we consider only the percentage of times that items are purchased together, and not the absolute count, similarity score may be the same even if certain items have been purchased more. The absolute number can be incorporated if desired, as given in the full version of the paper. Bipartite equations can also be applied to homogeneous domains, such as the web and citations, by considering the hub and authority model of HITS. Either score or a combination may be used.

Similarity score s(a, b) measures how soon two random surfers are expected to meet at the same node if they started at nodes a and b and randomly walked the graph backwards (expected meeting distance). Expected f-meeting distance, with f = C^z is equivalent to SimRank. See proof in paper.

To address scalability and efficiency issues, we may divide the corpus into chunks, computing similarity score for each chunk and later combine them. When n is significantly large, neighborhood will be a very small percentage (< 1%) of the entire domain. Nodes far from a node a, whose neighborhood has little overlap will tend to have lower similarity scores with a. We can set the similarity between two nodes far apart to be 0, and consider only node pairs which are near each other within a radius r. SimRank can also be combined with traditional textual similarity, or other domain-specific similarity measures. Another possibility of representing similarity is given in the full version of the paper.



References

Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment