Skip to content

Instantly share code, notes, and snippets.

@arthurazs
Created July 22, 2018 02:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arthurazs/4c430fbc3795646af246ba16960c8034 to your computer and use it in GitHub Desktop.
Save arthurazs/4c430fbc3795646af246ba16960c8034 to your computer and use it in GitHub Desktop.
Simple requests + bs4 example
from bs4 import BeautifulSoup
from requests import get as req_get
# Stating the bot's identification
headers = {'user-agent': 'topsongs:v0.1.0 by me@mail.com'}
url = 'http://www.billboard.com/charts/billboard-200'
request = req_get(url, headers=headers)
request.raise_for_status() # in case something goes wrong
soup = BeautifulSoup(request.text, 'html.parser')
songs = soup.find_all('div', class_='chart-row__title') # top 200 list
top = []
for song in songs:
name = song.find(class_='chart-row__song').text.strip()
artist = song.find(class_='chart-row__artist').text.strip()
# list with song and artist in dict format
top.append({'name': name, 'artist': artist})
# You don't need this second for, you could print it in the previous for
# This is just an example showcasing how you could access the dictionary
for position, song in enumerate(top, 1):
name = song['name']
artist = song['artist']
print(f'#{position:03} {name} by {artist}')
@arthurazs
Copy link
Author

Data should be scraped responsibly, here follows a good guide talking about the best practices for web scraping.
There are some bad articles that will teach how to spoof the header in order to not be detected. Bear in mind that this is not ethical at all! I'd advise updating the header with the name and version of the app and some means of contacting the owner. I'd also suggest reading reddit's guide for their API explaining why the bot's header should be updated with good information instead of spoofing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment