Skip to content

Instantly share code, notes, and snippets.

@BrambleXu
Last active November 18, 2019 23:51
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save BrambleXu/db2899ea4c6461f42f15dda02be03d86 to your computer and use it in GitHub Desktop.
Save BrambleXu/db2899ea4c6461f42f15dda02be03d86 to your computer and use it in GitHub Desktop.
A simple crawler example to build a course material downloader
import os
import requests
from lxml import etree
import wget
# prepare
download_directory = 'slides/'
url = 'http://inst.eecs.berkeley.edu/~cs61a/fa18/'
# make request
r = requests.get(url)
html = etree.HTML(r.text)
# extract links
slide_links = html.xpath('//li/a[text()="8pp"]/@href')
slide_links = list(set(slide_links)) # remove the duplicated links
print(len(slide_links))
# download
for slide in slide_links:
print(slide)
download_link = url+slide
file_name = os.path.basename(slide)
download_path = download_directory + file_name # complete download link
wget.download(download_link, download_path)
@BrambleXu
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment