Skip to content

Instantly share code, notes, and snippets.

@jspeed-meyers
Last active August 7, 2021 16:59
Show Gist options
  • Save jspeed-meyers/9344b036a35c7e75c21ad1fa9ee6318a to your computer and use it in GitHub Desktop.
Save jspeed-meyers/9344b036a35c7e75c21ad1fa9ee6318a to your computer and use it in GitHub Desktop.
A script to filter in only those anaconda packages with a GitHub link
"""Filter in packages with a GitHub link.
Take as input a .csv file with a field called clean_link, then
output only those values that include https://github.com.
The ouput should be a .txt file, each github link on its own line.
"""
import time
import pandas as pd
DATETIME_STAMP = time.strftime("%Y%m%d-%H%M%S")
INPUT_FILENAME = "../data/cleaned_packages_name_links_py3.9_linux-64.csv"
OUTPUT_FILENAME = "only_packages_with_github_links_" + DATETIME_STAMP + ".csv"
# open input file and store in pandas dataframe
df = pd.read_csv(INPUT_FILENAME, header=0, index_col=False)
# filter in only rows with a link to a GitHub repo
df_filtered = df[df["cleaned_link"].str.contains("https://github.com", na=False)]
# export filtered dataset to csv file
df_filtered.to_csv(OUTPUT_FILENAME, columns=["cleaned_link"], index=False)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment