Skip to content

Instantly share code, notes, and snippets.

@spyros12
Forked from AO8/crawler.py
Created August 30, 2020 15:18
Show Gist options
  • Save spyros12/0fcdf2c578896a152dfb5c607c4f2964 to your computer and use it in GitHub Desktop.
Save spyros12/0fcdf2c578896a152dfb5c607c4f2964 to your computer and use it in GitHub Desktop.
Crawl a website and gather all internal links with Python and BeautifulSoup.
# Adapted from example in Ch.3 of "Web Scraping With Python, Second Edition" by Ryan Mitchell
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment