Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save jspeed-meyers/9e7f6ba974c577a0b581a48d17a34be7 to your computer and use it in GitHub Desktop.
Save jspeed-meyers/9e7f6ba974c577a0b581a48d17a34be7 to your computer and use it in GitHub Desktop.
Filter GitHub URLS returned from deps2repos for top python packages
"""Filter deps2repos output for top python package GitHub URLs"""
input_filename = "pypi_repo_deps2repos_output.txt"
output_filename = "deps2repos_output_post_filtering.txt"
# read in output of deps2repos
with open(input_filename, "r") as file:
# only start collecting data on the 79th line because
# the early information is about repos lacking GitHub
# repos
lines = file.readlines()[78:]
# collect all repos into a list and strip new lines
repos = []
for line in lines:
repos.append(line.strip("\n"))
# only capture unique GitHubs
unique_repos = set(repos)
# clean a dirty link
unique_repos = ["https://github.com/Edinburgh-Genome-Foundry/Proglog" if x == "https://github.com/Edinburgh-Genome-Foundry/Proglog>`_" else x for x in unique_repos]
# write the unique repos out
with open(output_filename, "w") as file:
file.writelines(line + "\n" for line in unique_repos)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment