Skip to content

Instantly share code, notes, and snippets.

@branliu0
Last active December 28, 2015 01:39
Show Gist options
  • Save branliu0/7421918 to your computer and use it in GitHub Desktop.
Save branliu0/7421918 to your computer and use it in GitHub Desktop.

CS 285: Final Project Proposals

by Brandon Liu November 12, 2013

Proposal: Distributed Hitting Time

(a) Motivation

One part of the course discussed trust, reputation, and sybil-proofness. My independent thesis research has been on this topic, and much of my research has been examining hitting time, which is a transitive trust mechanism and ranking system similar to PageRank [2] that has some sybil-proofness properties.

There are at least two ways in which malicious agents can attempt to manipulate PageRank with sybils: (1) adding sybils that capture the uniform restart probability, and direct it back to the malicious node; and (2) forming many cycles of length 2, which increases the weight of the stationary distribution for the malicious node.

Hitting time automatically solves the issue of (2) by ignoring the outgoing edges of a node when computing its hitting time score. Hitting time can also deal with (1) by specifying a trusted set of nodes that are known not to be sybils to compute the trust scores. By directing restart probability only to nodes in the trust set, this ensures that sybils do not capture any restart probability.

Hitting time can be measured in many ways. The standard definition of the hitting time from node i to node j is the expected number of steps a random walk takes starting from node i and ending at node j. An alternate definition used by Sheldon and Hopcroft [1] adds a restart probability, similar to that used in the PageRank algorithm, and defines a global hitting time score for a node j as the probability that a random walk starting from some starting distribution over the nodes will hit the node j before restarting.

Existing computational methods for hitting time work for small graphs, but there does not appear to be much research into the computation of hitting time in a large-scale web graph with hundreds of thousands of nodes. I propose a novel distributed computational method for estimating the Sheldon and Hopcroft definition of hitting time that makes use of distributed computational capacity in a network of machines.

This is based off an existing Monte Carlo method that I developed for my thesis. This distributed method extends that method to a parallel, distributed, and high-scale setting.

This distributed algorithm relates to our study of multi-agent systems because it involves distributed computational ability and distributed information (no node in the network needs to know the entire graph).

I am not outlining the distributed algorithm I have in mind here, but I have already sketched it out a bit on paper and in my mind. I already have a correct implementation of the Monte Carlo method mentioned above, so if I want to do a comparative runtime analysis, at least that part of the code is already written.

(b) Research Questions

  1. Conduct a brief literature review on computational methods for hitting time and computational methods on random walks in a distributed setting.

  2. Provide theoretical bounds for the algorithm: big-O analysis of overall runtime, big-O memory usage by each machine, big-O analysis of message size and communication, and expected number of simulated random walks needed for a certain level of convergence.

  3. Study the utility of probabilistic data structures and algorithms, such as Bloom filters, to improve runtime at some cost of accuracy. For example, in one step of the algorithm, nodes will need to keep track of which random walks have already passed through it. If we are simulating a large number of random walks (in the millions or billions), then it could be expensive to store all the IDs of those random walks. A Bloom filter could be used to probabilistically determine whether a random walk has already passed by.

  4. Use computational simulations to study (1) the runtime gain of using the distributed algorithm over a normal algorithm; and (2) the tradeoff of using probabilistic data structures

  5. Possibly generalize the framework for a more general study of random walks in a distributed setting. Or generalize it to a distributed method for computing the stationary distribution of a large Markov chain.

(c) Related Work

[1] J. Hopcroft and D. Sheldon, “Manipulation-resistant reputations using hitting time,” pp. 68–81, 2007.

Sheldon and Hopcroft provide a theoretical study of hitting time, its resistance to manipulation, and some basic algorithms for computing hitting time. There are no computational or empirical results.

[2] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation ranking: bringing order to the web.,” 1999.

Classic PageRank paper.

[3] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina, “The eigentrust algorithm for reputation management in p2p networks,” pp. 640–651, 2003.

Discusses a distributed algorithm to compute reputation scores for P2P networks of a kind similar to PageRank.

(d) Questions for Feedback

I also realize that it is possible that this proposal may be a bit ambitious for the final project. However, I can do a subset of the research questions I posed above for the final project, and continue the remainder for my thesis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment