Skip to content

Instantly share code, notes, and snippets.

@evanmiltenburg
Last active January 5, 2022 09:35
Show Gist options
  • Save evanmiltenburg/f8e8c8328dbae4553f92eaf25b05e2b8 to your computer and use it in GitHub Desktop.
Save evanmiltenburg/f8e8c8328dbae4553f92eaf25b05e2b8 to your computer and use it in GitHub Desktop.
Download OHLDM
import requests
import re
import time
r = requests.get('https://direct.mit.edu/books/book/5244/The-Open-Handbook-of-Linguistic-Data-Management',
stream=True, headers={'User-agent': 'Mozilla/5.0'})
urls = re.findall('href="(.*?.pdf)"', r.text)
base = 'https://direct.mit.edu'
urls = [base + path for path in urls if '/book/' in path]
for i, url in enumerate(urls):
r = requests.get(urls[0], stream=True, headers={'User-agent': 'Mozilla/5.0'})
with open(f'{i}.pdf','wb') as f:
f.write(r.content)
# Be nice to the server:
time.sleep(2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment