Skip to content

Instantly share code, notes, and snippets.

@formigone
Created December 3, 2019 17:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save formigone/33a6cb3d1b6a8db2ade68824103590fb to your computer and use it in GitHub Desktop.
Save formigone/33a6cb3d1b6a8db2ade68824103590fb to your computer and use it in GitHub Desktop.
Simple Python script that scrapes the first 200 abstracts (with author list and Arxiv ID) from arxiv.org for a given query string, and renders list in Markdown.
'''
The purpose of this script is to quickly collect a large number of abstracts
from argiv.org so to simplify scanning lots of documents on a single topic
before committing the time to do an in-depth reading of any particular paper.
@author formigone
'''
import requests
import re
from IPython.display import Markdown
from bs4 import BeautifulSoup
def fetch(query):
args = [
'query={}'.format(query),
'searchtype=all',
'source=header',
'size=200',
]
url = 'https://arxiv.org/search/?' + '&'.join(args)
res = requests.get(url).text
soup = BeautifulSoup(res, 'html.parser')
doc = []
for row in soup.find_all(class_='arxiv-result'):
doc.append('## ' + row.find(class_='title').text.strip())
authors = [auth.text.strip() for auth in row.find(class_='authors').find_all('a')]
doc_id = row.find(class_='list-title').find('a').text
doc.append('[{} | {}]'.format(', '.join(authors), doc_id))
abstract = row.find(class_='abstract-full').text.strip().replace('△ Less', '').strip()
re.sub(r'\s\s+', ' ', abstract)
doc.append('\n{}\n'.format(abstract))
return '\n'.join(doc)
Markdown(fetch('deep+learning'))
@formigone
Copy link
Author

Sample output

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment