Created
October 1, 2023 03:39
-
-
Save rjurney/08e2a63babe9c08894f3f6ef94c5681c to your computer and use it in GitHub Desktop.
Code that clusters the dirty journal name property of an arXiv citation graph to create clean journal names as labels for classification
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# | |
# Create a pd.DataFrame of the nodes for analysis in a notebook | |
# | |
# Extract nodes and their attributes into a list of dictionaries | |
node_data = [{**{"node": node}, **attr} for node, attr in G.nodes(data=True)] | |
# Convert the list of dictionaries into a DataFrame | |
node_df = pd.DataFrame(node_data) | |
# Embed the dirty Journal-ref and cluster it to produce labels. | |
model = SentenceTransformer("sentence-transformers/paraphrase-MiniLM-L6-v2") | |
for column in [ | |
"Journal-ref", | |
]: # "Title", "Abstract"]: | |
embeddings = model.encode(node_df[column].tolist()) | |
node_df[f"{column}Embedding"] = embeddings.tolist() |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment