I think I wrote a summary of this thesis, but I cannot find it anymore.
const fs = require('fs'); | |
const path = require('path'); | |
const readline = require('readline'); | |
// Read the header of a graph file. | |
async function readGraphHeader(pth) { | |
var fstream = fs.createReadStream(pth); |
const cp = require('child_process'); | |
const fs = require('fs'); | |
const path = require('path'); | |
const _ = require('lodash'); | |
// Main function. | |
function main() { | |
for (var file of fs.readdirSync('./')) { |
# Beware! This file is rewritten by htop when settings are changed in the interface. | |
# The parser is also very primitive, and not human-friendly. | |
fields=0 48 17 18 38 39 40 2 46 47 49 1 | |
sort_key=46 | |
sort_direction=-1 | |
tree_sort_key=0 | |
tree_sort_direction=1 | |
hide_kernel_threads=1 | |
hide_userland_threads=0 | |
shadow_other_users=0 |
const os = require('os'); | |
const fs = require('fs'); | |
// Read a file. | |
function readFile(pth) { | |
var txt = fs.readFileSync(pth, 'utf8'); | |
return txt.replace(/\r\n?/g, '\n'); |
2023-07-30 08:51:17 OMP_NUM_THREADS=40 | |
2023-07-30 08:51:17 Loading graph /home/hema/Data/indochina-2004.mtx ... | |
2023-07-30 08:51:37 order: 7414866 size: 194109311 [directed] {} | |
2023-07-30 08:51:42 order: 7414866 size: 199021693 [directed] {} (selfLoopAllVertices) | |
2023-07-30 08:51:48 order: 7414866 size: 199021693 [directed] {} (transposeWithDegree) | |
{-5.000e-05/+5.000e-05 batchf, 000/001 threads 0000ms @ 0.00e+00 none failure} -> {0051604.6/0051604.6ms, 074 iter, 4.73e-10 err, 000 crashed] pagerankBasicOmp | |
{-5.000e-05/+5.000e-05 batchf, 000/001 threads 0000ms @ 0.00e+00 none failure} -> {0048078.0/0048078.0ms, 070 iter, 5.30e-10 err, 000 crashed] pagerankBasicNaiveDynamicOmp | |
{-5.000e-05/+5.000e-05 batchf, 000/001 threads 0000ms @ 0.00e+00 none failure} -> {0053318.4/0053318.4ms, 070 iter, 5.29e-10 err, 000 crashed] pagerankBasicDynamicFrontierOmp | |
{-5.000e-05/+5.000e-05 batchf, 000/001 threads 0000ms @ 0.00e+00 none failure} -> {0066079.3/0066079.3ms, 093 iter, 5.15e-10 err, 000 crashed] pagerankBarrierfreeOmp |
In the paper "SLING: A Near-Optimal Index Structure for SimRank", Tian and Xiao present SimRank via Local updates and sampling (SLING), an efficient index structure which guarentees at most ε
additive error, and answers single pair and single source SimRank queries in O(1/ε)
and O(n/ε)
time - which is near optimal. Further, SLING requires only O(n/ε)
space and O(m/ε + n log ...)
pre-computation time.
Tian and Xiao use √c
reverse random walks, where at each step, we inspect the in-neighbors of current node and select one of them uniformly at random with a probability of √c
, or stop with a probability of 1 - √c
. Each √c
walk has an expected length of 1/(1-√c)
. They then use hitting probabilities hˡ(i,k)
and hˡ(j,k)
to compute SimRank score s(i,j)
with dₖ
- a correction factor which represents the probability that two √c
walks do not meet each other after the 0th step. They propose to pre-compute approximate versions of dₖ
and hˡ(i,k)
as the index, and use them upon query to es
In the poster "Efficient Top-K SimRank-based Similarity Join", Wenbo Tao presents an algorithm to find k
pairs of nodes with the largest SimRank values, using a random-walk based method. Tao limits random walks to five steps without loos of too much accuracy, and computes summation of one-way paths sim(a, x, l)
using dynamic programming, where a
is the source node, x
is the target node, and l
is the length of the path. Top-k SimRank-based similarity join problem can then be solved by identifying k
pairs of vectors with the largest inner product values. This can be solved with the matrix compression method by Low et al. in O(nD^ξ)
, where ξ
is the number of iterations. Their results show good performance improvement over SRJ Query by Zheng et al., which requires index building and selection of a threshold value.
While Tao explores the problem of finding top-k SimRank values across the entire graph, Tao does not consider the problem of finding top-k SimRank values within
In the paper "Accelerating pairwise SimRank estimation over static and dynamic graphs", Wang et al. motivate that sometimes people may issue a set of similarity queries, which may not be restricted to a particular source. For example, people may want to know "Which two scientists in the DB area, who have not collaborated with each other before, would likely collaborate in the future?", or "Which scientists in DB, who have collaborated with AI researchers before, would keep on collaborating with other AI scientists?"
Wang et al. note that iterative deterministic methods have two drwabacks. One is its BFS fixed-order expansion, which is not optimized for the local structure of query nodes. It may also contain unnecessary computation, which contributes little to the final SimRank score. Second, as k
increases, the number of k-hop neighbors grows exponentially.
Most recent solutions for dynamic SimRank query are based on its probabilistic interpretation: s(a,b)
is the expected f-meeting time of two revers
In the paper "A Novel and Fast SimRank Algorithm", Lu et al. avoid the approximation of SimRank with (1-c)I
, or use of diagonal matrix to compute exact SimRank (which is costly to compute). Instead they use conjugate gradient algorithm along with jacobi preconditioner - after splitting matrices into "same" (S
) and "disjoint" (D
) components:
- Preconditioned conjugate gradient method for undirected graph, and
- Preconditioned bi-conjugate gradient stabilized method for directed graph
These methods speed up the convergence rate by finding the optimal search direction in each iteration. Further, they use common-neighbor sharing technique to optimize computation.
Lu et al. compare the algorithms with Optimal In-neighbors Partitioning (OIP), Successive Over-Relaxation (SOR), and Scalable SimRank Join (SSJ, stochastic) algorithms. They use a decay factor of c = 0.8
(although 0.6
is more common in literature), and an error bound of ε = 0.001
. In SOR, they select different optimal w
for each dat