Skip to content

Instantly share code, notes, and snippets.

@AO8
Last active May 23, 2023 09:12
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save AO8/f721b6736c8a4805e99e377e72d3edbf to your computer and use it in GitHub Desktop.
Save AO8/f721b6736c8a4805e99e377e72d3edbf to your computer and use it in GitHub Desktop.
Crawl a website and gather all internal links with Python and BeautifulSoup.
# Adapted from example in Ch.3 of "Web Scraping With Python, Second Edition" by Ryan Mitchell
import re
import requests
from bs4 import BeautifulSoup
pages = set()
def get_links(page_url):
global pages
pattern = re.compile("^(/)")
html = requests.get(f"your_URL{page_url}").text # fstrings require Python 3.6+
soup = BeautifulSoup(html, "html.parser")
for link in soup.find_all("a", href=pattern):
if "href" in link.attrs:
if link.attrs["href"] not in pages:
new_page = link.attrs["href"]
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links("")
@spyros12
Copy link

Hi !! Really Excellent Good Code !!! Very Handy & excellent !! A life Saver !! Ive been trying to make something similar using scrapy, but had issues with Rules and the other things. I feel scrapy would be faster and do more (like scrape the links of links and tabularise/organise them) . Never mind, but im running your code and adapted for a particular website, but its slow yeh? do you know why? Have you managed to make it faster? . Its very clean and accurate - gets the links one has in mind, but is slow.

I adapted it so it gets the links from each "next page" from a list (page1 ... page 30) using :

pagesx = list(range(33))
   for ii in pagesx:
         getpage= requests.get(f"THEWEBPAGEISWASSCRAPINGFROM={ii}")

I then did this to the code found here: http://www.learningaboutelectronics.com/Articles/How-to-find-all-hyperlinks-on-a-web-page-in-Python-using-BeautifulSoup.php

import requests
from bs4 import BeautifulSoup

pagesx = list(range(33))
for ii in pagesx:
    getpage= requests.get(f"THEWEBPAGEISWASSCRAPINGFROM={ii}") ##P.S. THIS WEBSITE IS PUBLIC AND ALLOWS SCRAPING

    getpage_soup= BeautifulSoup(getpage.text, 'html.parser')

    all_links= getpage_soup.findAll('a')

    for link in all_links:
        print (link)

and its much faster (although it scrapes all the pages content links). But still its super fast compared to your links code. I feel your code should have been as fast.

Still though, it got me started & very very Very Handy !! Life Saver.

I've been stuck on looking for a scrapy solution to this and never have managed it. I would have been still stuck on scrapy if I did't find your code (so thank you!) :)

@amirmohammadrazmy
Copy link

perfect project

@AnthonyForan
Copy link

Hi,
Thanks for the solution. I'm having similar issues as @spyros12, however I can't seem to implement his solution in a way that works.
Has anyone been able to solve this ?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment