Last active
May 23, 2023 09:12
-
-
Save AO8/f721b6736c8a4805e99e377e72d3edbf to your computer and use it in GitHub Desktop.
Crawl a website and gather all internal links with Python and BeautifulSoup.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Adapted from example in Ch.3 of "Web Scraping With Python, Second Edition" by Ryan Mitchell | |
import re | |
import requests | |
from bs4 import BeautifulSoup | |
pages = set() | |
def get_links(page_url): | |
global pages | |
pattern = re.compile("^(/)") | |
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+ | |
soup = BeautifulSoup(html, "html.parser") | |
for link in soup.find_all("a", href=pattern): | |
if "href" in link.attrs: | |
if link.attrs["href"] not in pages: | |
new_page = link.attrs["href"] | |
print(new_page) | |
pages.add(new_page) | |
get_links(new_page) | |
get_links("") |
perfect project
Hi,
Thanks for the solution. I'm having similar issues as @spyros12, however I can't seem to implement his solution in a way that works.
Has anyone been able to solve this ?
Thanks!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi !! Really Excellent Good Code !!! Very Handy & excellent !! A life Saver !! Ive been trying to make something similar using scrapy, but had issues with Rules and the other things. I feel scrapy would be faster and do more (like scrape the links of links and tabularise/organise them) . Never mind, but im running your code and adapted for a particular website, but its slow yeh? do you know why? Have you managed to make it faster? . Its very clean and accurate - gets the links one has in mind, but is slow.
I adapted it so it gets the links from each "next page" from a list (page1 ... page 30) using :
I then did this to the code found here: http://www.learningaboutelectronics.com/Articles/How-to-find-all-hyperlinks-on-a-web-page-in-Python-using-BeautifulSoup.php
and its much faster (although it scrapes all the pages content links). But still its super fast compared to your links code. I feel your code should have been as fast.
Still though, it got me started & very very Very Handy !! Life Saver.
I've been stuck on looking for a scrapy solution to this and never have managed it. I would have been still stuck on scrapy if I did't find your code (so thank you!) :)