Instantly share code, notes, and snippets.

Embed
What would you like to do?
Random Walk Generation on a Directed Graph with PySpark
def generate_random_walks(page_ids, adjacency_list, num_walks=10, len_walks=20):
"""
convenience method to generate a list of numWalks random walks. This saves a random walk in targetPath.
:param page_ids: an RDD of page ids for which the random walks should be generated.
:param adjacency_list: a simple RDD with tuples of the form (page_id, [list(id)]).
:param num_walks: optional. The number of walks, which are to be generated for each page id.
:param len_walks: optional. The maximum length of each walk.
:return: a RDD of random walks
"""
walkers = page_ids.flatMap(lambda page_id: [(page_id, [page_id])] * num_walks)
for _ in range(len_walks - 1):
walkers = walkers \
.leftOuterJoin(adjacency_list) \
.map(random_append) \
.coalesce(200)
return walkers.map(lambda x: x[1])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment