Skip to content

Instantly share code, notes, and snippets.

@cj2001
Created February 9, 2021 23:36
Show Gist options
  • Save cj2001/f8a3c3e30b918ea33bd2f40825934c82 to your computer and use it in GitHub Desktop.
Save cj2001/f8a3c3e30b918ea33bd2f40825934c82 to your computer and use it in GitHub Desktop.
Clean arXiv author and category lists
def get_author_list(line):
# Cleans author dataframe column, creating a list of authors in the row.
return [e[1] + ' ' + e[0] for e in line]
def get_category_list(line):
# Cleans category dataframe column, creating a list of categories in the row.
return list(line.split(" "))
df['cleaned_authors_list'] = df['authors_parsed'].map(get_author_list)
df['category_list'] = df['categories'].map(get_category_list)
df = df.drop(['submitter', 'authors',
'comments', 'journal-ref',
'doi', 'report-no', 'license',
'versions', 'update_date',
'abstract', 'authors_parsed',
'categories'], axis=1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment