Created
May 17, 2017 08:23
-
-
Save thomasniebler/03c85200aecb55c256ce152352fa46f9 to your computer and use it in GitHub Desktop.
Random Walk Generation on a Directed Graph with PySpark
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def generate_random_walks(page_ids, adjacency_list, num_walks=10, len_walks=20): | |
""" | |
convenience method to generate a list of numWalks random walks. This saves a random walk in targetPath. | |
:param page_ids: an RDD of page ids for which the random walks should be generated. | |
:param adjacency_list: a simple RDD with tuples of the form (page_id, [list(id)]). | |
:param num_walks: optional. The number of walks, which are to be generated for each page id. | |
:param len_walks: optional. The maximum length of each walk. | |
:return: a RDD of random walks | |
""" | |
walkers = page_ids.flatMap(lambda page_id: [(page_id, [page_id])] * num_walks) | |
for _ in range(len_walks - 1): | |
walkers = walkers \ | |
.leftOuterJoin(adjacency_list) \ | |
.map(random_append) \ | |
.coalesce(200) | |
return walkers.map(lambda x: x[1]) | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment