Skip to content

Instantly share code, notes, and snippets.

@PandaWhoCodes
Last active June 5, 2023 15:36
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save PandaWhoCodes/7762fac08c4ed005cec82204d7abd61b to your computer and use it in GitHub Desktop.
Save PandaWhoCodes/7762fac08c4ed005cec82204d7abd61b to your computer and use it in GitHub Desktop.
Code to retrieve all links from a given URL
@carlinmack
Copy link

carlinmack commented Jan 23, 2021

thanks for this! btw line 6 should probably be:

for link in BeautifulSoup(content, parse_only=SoupStrainer('a'), features='lxml'):

Details as there is a warning:
GuessedAtParserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 24 of the file getTheories.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

(This is the list of available parsers https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use)

@simrangopal
Copy link

hi, if I wanted to make a list of all of these hrefs- how would I go about that?

@carlinmack
Copy link

make an empty list (hrefs = []) on line 5, and change line 8 to hrefs.append(link['href'])

@sukanyaghosh1234
Copy link

Hi I am getting this error
Traceback (most recent call last):
File "c:\Users\yyy\Downloads\assignment\testwebscrapping.py", line 69, in
print(link['href'])
File "C:\Users\yyy\AppData\Roaming\Python\Python39\site-packages\bs4\element.py", line 1519, in getitem
return self.attrs[key]
KeyError: 'href'

@PandaWhoCodes
Copy link
Author

updated the code to handle a few edge cases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment