Skip to content

Instantly share code, notes, and snippets.

@AO8
Last active May 23, 2023 09:12
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save AO8/f721b6736c8a4805e99e377e72d3edbf to your computer and use it in GitHub Desktop.
Save AO8/f721b6736c8a4805e99e377e72d3edbf to your computer and use it in GitHub Desktop.
Crawl a website and gather all internal links with Python and BeautifulSoup.
# Adapted from example in Ch.3 of "Web Scraping With Python, Second Edition" by Ryan Mitchell
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
@amirmohammadrazmy
Copy link

perfect project

@AnthonyForan
Copy link

Hi,
Thanks for the solution. I'm having similar issues as @spyros12, however I can't seem to implement his solution in a way that works.
Has anyone been able to solve this ?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment