Last active
May 23, 2023 09:12
-
-
Save AO8/f721b6736c8a4805e99e377e72d3edbf to your computer and use it in GitHub Desktop.
Crawl a website and gather all internal links with Python and BeautifulSoup.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Adapted from example in Ch.3 of "Web Scraping With Python, Second Edition" by Ryan Mitchell | |
import re | |
import requests | |
from bs4 import BeautifulSoup | |
pages = set() | |
def get_links(page_url): | |
global pages | |
pattern = re.compile("^(/)") | |
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+ | |
soup = BeautifulSoup(html, "html.parser") | |
for link in soup.find_all("a", href=pattern): | |
if "href" in link.attrs: | |
if link.attrs["href"] not in pages: | |
new_page = link.attrs["href"] | |
print(new_page) | |
pages.add(new_page) | |
get_links(new_page) | |
get_links("") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
Thanks for the solution. I'm having similar issues as @spyros12, however I can't seem to implement his solution in a way that works.
Has anyone been able to solve this ?
Thanks!