Skip to content

Instantly share code, notes, and snippets.

@la5942
Last active August 7, 2017 05:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save la5942/a6cfd00f7a543d1efdfc to your computer and use it in GitHub Desktop.
Save la5942/a6cfd00f7a543d1efdfc to your computer and use it in GitHub Desktop.
Python Scraping

itunes have a rss feed of their top songs, I would use this to get the daily top 100. This is the feed you want https://itunes.apple.com/gb/rss/topsongs/limit=100/json and this is the generator https://rss.itunes.apple.com/gb/.

{"feed":{"author":{"name":{"label":"iTunes Store"}, "uri":{"label":"http://www.apple.com/uk/itunes/"}}, "entry":{"im:name":{"label":"Wrecking Ball"}, "im:image":[
{"label":"http://a166.phobos.apple.com/us/r30/Music/v4/7f/c5/85/7fc58500-1dd8-6581-2524-c07870b7defa/886444193757.55x55-70.jpg", "attributes":{"height":"55"}}, 
{"label":"http://a888.phobos.apple.com/us/r30/Music/v4/7f/c5/85/7fc58500-1dd8-6581-2524-c07870b7defa/886444193757.60x60-50.jpg", "attributes":{"height":"60"}}, 
{"label":"http://a231.phobos.apple.com/us/r30/Music/v4/7f/c5/85/7fc58500-1dd8-6581-2524-c07870b7defa/886444193757.170x170-75.jpg", "attributes":{"height":"170"}}], "im:collection":{"im:name":{"label":"Wrecking Ball - Single"}, "link":{"attributes":{"rel":"alternate", "type":"text/html", "href":"https://itunes.apple.com/gb/album/wrecking-ball-single/id711469715?uo=2"}}, "im:contentType":{"im:contentType":{"attributes":{"term":"Album", "label":"Album"}}, "attributes":{"term":"Music", "label":"Music"}}}, "im:price":{"label":"£0.99", "attributes":{"amount":"0.99000", "currency":"GBP"}}, "im:contentType":{"im:contentType":{"attributes":{"term":"Track", "label":"Track"}}, "attributes":{"term":"Music", "label":"Music"}}, "rights":{"label":"℗ 2013 RCA Records, a division of Sony Music Entertainment"}, "title":{"label":"Wrecking Ball - Miley Cyrus"}, "link":[
{"attributes":{"rel":"alternate", "type":"text/html", "href":"https://itunes.apple.com/gb/album/wrecking-ball/id711469715?i=711469887&uo=2"}}, 
{"im:duration":{"label":"30000"}, "attributes":{"title":"Preview", "rel":"enclosure", "type":"audio/x-m4a", "href":"http://a1853.phobos.apple.com/us/r1000/048/Music4/v4/7f/5e/47/7f5e4757-6717-f1db-fd62-82a3043159ed/mzaf_8084741812362172321.plus.aac.p.m4a", "im:assetType":"preview"}}], "id":{"label":"https://itunes.apple.com/gb/album/wrecking-ball/id711469715?i=711469887&uo=2", "attributes":{"im:id":"711469887"}}, "im:artist":{"label":"Miley Cyrus", "attributes":{"href":"https://itunes.apple.com/gb/artist/miley-cyrus/id137057909?uo=2"}}, "category":{"attributes":{"im:id":"14", "term":"Pop", "scheme":"https://itunes.apple.com/gb/genre/music-pop/id14?uo=2", "label":"Pop"}}, "im:releaseDate":{"label":"2013-10-06T00:00:00-07:00", "attributes":{"label":"06 October 2013"}}}, "updated":{"label":"2013-10-08T07:32:19-07:00"}, "rights":{"label":"Copyright 2008 Apple Inc."}, "title":{"label":"iTunes Store: Top Songs"}, "icon":{"label":"http://itunes.apple.com/favicon.ico"}, "link":[
{"attributes":{"rel":"alternate", "type":"text/html", "href":"https://itunes.apple.com/WebObjects/MZStore.woa/wa/viewTop?cc=gb&id=25106&popId=1"}}, 
{"attributes":{"rel":"self", "href":"https://itunes.apple.com/gb/rss/topsongs/limit=1/json"}}], "id":{"label":"https://itunes.apple.com/gb/rss/topsongs/limit=1/json"}}}

Read the json/xml data using the following tools. http://simplejson.readthedocs.org/en/latest/ http://docs.python.org/2/library/urllib2.html

Use the resultant data to grab the customer ratings data e.g number of stars and number of votes. I am not sure if you will be able to get the customer reviews as they only show a few at a time on a song page.

Hacking around with Scrapy:

# Install Scrapy.
sudo pip install Scrapy
# Play with shell.
scrapy shell "https://itunes.apple.com/gb/album/you-make-me/id685987899?i=685987903&ign-mpt=uo%3D2"
# Get ratings.
hxs.select("//div[@class='rating']/span[@class='rating-count']/text()").extract()

Not sure if scrapy is the best tool for the job, BeautifulSoup looks like it may be better - http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

Whilst comparing against the official charts, you may want to check whether these are just digital download charts or digital/retail etc as your results may be skewed by comparing this itunes digital download charts.

You could then enhance the concept further by collecting additional feed data from multiple API sources e.g - last.fm, spotify and comparing those too. You could also compare the UK/US charts?

Quick hack around with BeautifulSoup library and itunes feeds.

## This needs to be written OOP, basic testing of functionality as a proof of concept.

import urllib2
import simplejson as json
from bs4 import BeautifulSoup

# Get RSS feed of itunes top 100 songs
# Open url in urllib2
req = urllib2.Request("https://itunes.apple.com/gb/rss/topsongs/limit=5/json", None)
opener = urllib2.build_opener()
itunes = opener.open(req)
# Load the json array
itunes_json = json.load(itunes)


# Loop through json array to get song url.
for entry in itunes_json['feed']['entry']:
	# Print out detailed song info we may want to store in a DB.
	print entry['id']['attributes']['im:id']+"/"+entry['title']['label'] +"/"+ entry['im:price']['label']
	# Get the itunes song URL.
	scrape_url =  entry['id']['label']
	# Open the url in urllib2.
	scrape_request = urllib2.Request(scrape_url, None)
	# Get the data from the page.
	scrape_data  = urllib2.urlopen(scrape_request)
	# Load the data into the scraping library.
	soup = BeautifulSoup(scrape_data)

	# Get some of the page elements that we want.
 	#rating is located in the "aria-label" tag in the <div class="rating">
    #in the <div class="customer-ratings>
	ratings_tag = soup.find("div","customer-ratings").find("div","rating")
	#splits array of stars and rating from the "aria-label" tag
	stars,rating = ratings_tag['aria-label'].split(',')
	# print out junk - probably needs some comversion from text to numeric.
	print stars.strip()
	print rating.strip()
	# Store all this data somewhere.
	# End


 # Scrape the official singles chart. 

# URL to scrape
charts_scrape_url =  "http://www.officialcharts.com/singles-chart/"
# Load the request.
charts_scrape_request = urllib2.Request(charts_scrape_url, None)
# Get the data.
charts_scrape_data  = urllib2.urlopen(charts_scrape_request)
# Load data into scraping library.
soup1 = BeautifulSoup(charts_scrape_data)

# Loop through all entry class items.
for row in soup1.find_all(attrs={'class': 'entry'}):
	# Loop through all the text in the row and put in dict.
    data = [text for text in row.stripped_strings]
    # Chart position today
    chart_pos = data[0]
    # Chart position last week
    lw = data[1]
    # Weeks at position
    wks = data[2]

    # Get additional data from specific elements in the rows.
    # Song title
    title = row.select("h3")[0].text
    # Artist
    artist = row.select("h4")[0].text
    # Label
    label = row.select("h5")[0].text
    # Image
    image = row.select("img")[0]['src']

    # Store all this data somewhere........
    print chart_pos + ":" + title + ":" + artist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment