Skip to content

Instantly share code, notes, and snippets.

@yjzhang
Created October 4, 2022 21:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yjzhang/08c7a17a5ec68cde3ec3d67a31c8f839 to your computer and use it in GitHub Desktop.
Save yjzhang/08c7a17a5ec68cde3ec3d67a31c8f839 to your computer and use it in GitHub Desktop.
# https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
import requests
from selectolax.parser import HTMLParser
base_url = 'https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/'
r = requests.get(base_url)
tree = HTMLParser(r.content)
for node in tree.css('a'):
if '.xml.gz' in node.text():
url = base_url + node.attributes['href']
print(url)
req = requests.get(url)
with open(node.text(), 'wb') as f:
f.write(req.content)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment