Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wolfram77/986c77f1ad947e12a8a3076f332afacd to your computer and use it in GitHub Desktop.
Save wolfram77/986c77f1ad947e12a8a3076f332afacd to your computer and use it in GitHub Desktop.
Accelerating pairwise SimRank estimation over static and dynamic graphs : NOTES

In the paper "Accelerating pairwise SimRank estimation over static and dynamic graphs", Wang et al. motivate that sometimes people may issue a set of similarity queries, which may not be restricted to a particular source. For example, people may want to know "Which two scientists in the DB area, who have not collaborated with each other before, would likely collaborate in the future?", or "Which scientists in DB, who have collaborated with AI researchers before, would keep on collaborating with other AI scientists?"

Wang et al. note that iterative deterministic methods have two drwabacks. One is its BFS fixed-order expansion, which is not optimized for the local structure of query nodes. It may also contain unnecessary computation, which contributes little to the final SimRank score. Second, as k increases, the number of k-hop neighbors grows exponentially.

Most recent solutions for dynamic SimRank query are based on its probabilistic interpretation: s(a,b) is the expected f-meeting time of two reverse random walks from a and b. Random walk based indexes are commonly designed, such as by Jiang et al. and Shao et al., which are then used for Monte Carlo sampling. Wang et al. observe that in some cases, random walks are much more expensive than deterministic approaches. Further, they observe that sample size is the major factor that determines the efficiency and accuracy of Monte Carlo methods.

They propose three accuracy aware pairwise SimRank estimation over static and dynamic graphs, using sample reduction techniques designed specially for single-pair queries - by exploring the local structure of query nodes. Their first method, called Backward Local Push and MC sampling (BLPMC), uses BLP - a deterministic method - for pairwise SimRank estimation. BLP has unpredictable cost due to different local structures around query nodes. Thus in the worst case, performance of BLPMC would be the same as pure MC sampling methods.

Next, they present an index-based method, called Forward Local Push and MC sampling (FLPMC) which uses global residuals pushed from singleton nodes to reduce online samples. Its query efficiency is independent of query nodes. It relaxes the constraint where only two random walks that meet can contribute to s(a,b).

Finally, Wang et al. combine the two strategies, using a method called Forward and Backward Local Push and MC sampling (FBLPMC), to achieve better query efficiency. I will have to go through the details later again, but basically the index consists of an estimate SimRank vector and a residual vector, which is then using with MC sampling to obtain actual SimRank scores.

From the results, it is observed that Tian et al.'s SLING (static algorithm) outperforms BLPMC/FLPMC/FBLPMCC in terms of result quality, but BLPMC/FLPMC/FBLPMCC significantly outperform SLING in terms of query time and index update time.

Critical Review

In this paper, Wang et al. consider the original formulation of SimRank, which has been found to return unintuitive scores due to its requirement of distance symmetricity from similar source nodes. Further, they only study the case for one edge insertion or deletion (for index update) - they do not consider a batch of updates.

Related work

Tian et al. propose SLING, a static algorithm for single-pair SimRank, which has O(1/ε) query time, O(n/ε) index size, and O(m/ε + ...) index time. Li et al. iteratively aggregate the meeting probability of k-hop neighbors from a and b.

Single-pair SimRank

Fogaras et al. use a fingerprint data structure for aggregating random walks - for index building in O(|V|N) and query in O(Nt) time. A commonly used technique for linearization (by Kusumoto et al., Li et al., and Maehara et al.), is to calculate a diagonal correction matrix. However, it does not guarentee error ε worst-case error, and is hard to compute.

Dynamic Single-source/Top-k SimRank

Shao et al. propose an algorithm called TSF, which requires O(N|V|) time for index-building and O(R_q*R_g*t) for querying. Jiang et al. present another random walk method, READS, which uses a set based indexing schema, with a querying time of O(N|V|), update time of O(Nt), and an index size of O(N|V|). Liu et al. proposes ProbeSim, an index-free method which has an expected query time of O(|V|/ε log |V|/δ).

Incremental All-pairs SimRank

Li et al. present a linear system based algorithm for SimRank with SVD, and are able to deal with incremental updates of all-pairs SimRank. In two separate papers, Yu et al. formulate ΔS as a rank-one Sylvester equation and solve it iteratively. They propose pruning rules to reduce unnecessary computation by identifying an affected region. They use an approximate version of SimRank, and thus have no accuracy guarentee. Wang et al. (same first author), introduce a corrected linear system and then propose Forward Local Push (FLP) algorithm to update all-pairs SimRank.

Other variants of SimRank

Tao et al. and Zhang et al. find top-k node pairs among all pairs (SimRank based Similarity Join). Yu et al. compute partial pairs SimRank, which computes SimRank scores in Cartesian product of two node sets. Yoon et al. handle dynamic updates for Random Walk with Restart (RWR), which is another node similarity measurement.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment