Created
July 22, 2018 02:10
-
-
Save arthurazs/4c430fbc3795646af246ba16960c8034 to your computer and use it in GitHub Desktop.
Simple requests + bs4 example
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from bs4 import BeautifulSoup | |
from requests import get as req_get | |
# Stating the bot's identification | |
headers = {'user-agent': 'topsongs:v0.1.0 by me@mail.com'} | |
url = 'http://www.billboard.com/charts/billboard-200' | |
request = req_get(url, headers=headers) | |
request.raise_for_status() # in case something goes wrong | |
soup = BeautifulSoup(request.text, 'html.parser') | |
songs = soup.find_all('div', class_='chart-row__title') # top 200 list | |
top = [] | |
for song in songs: | |
name = song.find(class_='chart-row__song').text.strip() | |
artist = song.find(class_='chart-row__artist').text.strip() | |
# list with song and artist in dict format | |
top.append({'name': name, 'artist': artist}) | |
# You don't need this second for, you could print it in the previous for | |
# This is just an example showcasing how you could access the dictionary | |
for position, song in enumerate(top, 1): | |
name = song['name'] | |
artist = song['artist'] | |
print(f'#{position:03} {name} by {artist}') |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Data should be scraped responsibly, here follows a good guide talking about the best practices for web scraping.
There are some bad articles that will teach how to spoof the header in order to not be detected. Bear in mind that this is not ethical at all! I'd advise updating the header with the name and version of the app and some means of contacting the owner. I'd also suggest reading reddit's guide for their API explaining why the bot's header should be updated with good information instead of spoofing!