-
-
Save mprat/df2969142a75b668456c to your computer and use it in GitHub Desktop.
http://docs.python-requests.org/en/latest/ and http://www.crummy.com/software/BeautifulSoup/bs4/doc/
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# import the requests Python library for programmatically making HTTP requests | |
# after installing it according to these instructions: | |
# http://docs.python-requests.org/en/latest/user/install/#install | |
import requests | |
# import the BeautifulSoup Python library according to these instructions: | |
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup | |
# use this syntax as described on the documentation page: | |
# http://www.crummy.com/software/BeautifulSoup/bs4/doc/#making-the-soup | |
from bs4 import BeautifulSoup | |
# the URL of the NY Times website we want to parse | |
base_url = 'http://www.nytimes.com' | |
# the syntax (according to the documentation) for how to | |
# "load" a webpage through Python | |
r = requests.get(base_url) | |
# how to decode the text of the HTML of the NY Times homepage | |
# website. r comes from the requests request above | |
soup = BeautifulSoup(r.text) | |
# find and loop through all elements on the page with the | |
# class name "story-heading" | |
for story_heading in soup.find_all(class_="story-heading"): | |
# for the story headings that are links, print out the text | |
# and format it nicely | |
# for the others, take the contents out and format it nicely | |
if story_heading.a: | |
print(story_heading.a.text.replace("\n", " ").strip()) | |
else: | |
print(story_heading.contents[0].strip()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import requests | |
from bs4 import BeautifulSoup | |
base_url = 'http://www.nytimes.com' | |
r = requests.get(base_url) | |
soup = BeautifulSoup(r.text) | |
for story_heading in soup.find_all(class_="story-heading"): | |
if story_heading.a: | |
print(story_heading.a.text.replace("\n", " ").strip()) | |
else: | |
print(story_heading.contents[0].strip()) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In dd.mm.yyyy 14.04.2020 i get and error and do not know hot to fix it!
TEST.py:21: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 21 of the file TEST.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.
soup = BeautifulSoup(r.text)
Python 3.8.1 // Text Editor: Geany