Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thomasniebler/03c85200aecb55c256ce152352fa46f9 to your computer and use it in GitHub Desktop.
Save thomasniebler/03c85200aecb55c256ce152352fa46f9 to your computer and use it in GitHub Desktop.
Random Walk Generation on a Directed Graph with PySpark
def generate_random_walks(page_ids, adjacency_list, num_walks=10, len_walks=20):
"""
convenience method to generate a list of numWalks random walks. This saves a random walk in targetPath.
:param page_ids: an RDD of page ids for which the random walks should be generated.
:param adjacency_list: a simple RDD with tuples of the form (page_id, [list(id)]).
:param num_walks: optional. The number of walks, which are to be generated for each page id.
:param len_walks: optional. The maximum length of each walk.
:return: a RDD of random walks
"""
walkers = page_ids.flatMap(lambda page_id: [(page_id, [page_id])] * num_walks)
for _ in range(len_walks - 1):
walkers = walkers \
.leftOuterJoin(adjacency_list) \
.map(random_append) \
.coalesce(200)
return walkers.map(lambda x: x[1])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment