Last active
June 5, 2023 15:36
-
-
Save PandaWhoCodes/7762fac08c4ed005cec82204d7abd61b to your computer and use it in GitHub Desktop.
Code to retrieve all links from a given URL
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import requests | |
from bs4 import BeautifulSoup | |
def get_all_links(url): | |
content = requests.get(url).content | |
soup = BeautifulSoup(content, 'html.parser') | |
links = [] | |
for link in soup.find_all('a'): | |
if link.has_attr('href'): | |
href = link['href'] | |
if href and not href.startswith('#'): | |
links.append(href) | |
return links | |
if __name__ == '__main__': | |
url = 'https://www.ashish.ch' | |
links = get_all_links(url) | |
print(links) |
hi, if I wanted to make a list of all of these hrefs- how would I go about that?
make an empty list (hrefs = []
) on line 5, and change line 8 to hrefs.append(link['href'])
Hi I am getting this error
Traceback (most recent call last):
File "c:\Users\yyy\Downloads\assignment\testwebscrapping.py", line 69, in
print(link['href'])
File "C:\Users\yyy\AppData\Roaming\Python\Python39\site-packages\bs4\element.py", line 1519, in getitem
return self.attrs[key]
KeyError: 'href'
updated the code to handle a few edge cases
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
thanks for this! btw line 6 should probably be:
for link in BeautifulSoup(content, parse_only=SoupStrainer('a'), features='lxml'):
Details
as there is a warning:(This is the list of available parsers https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use)