Skip to content

Instantly share code, notes, and snippets.

@mplewis
Last active August 29, 2015 13:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mplewis/10338447 to your computer and use it in GitHub Desktop.
Save mplewis/10338447 to your computer and use it in GitHub Desktop.
Exporting your Last.fm history into an SQLite DB for analysis
{
"metadata": {
"celltoolbar": "Raw Cell Format",
"name": "",
"signature": "sha256:322a2974ad58e39abb5e8cfd03bfb8a7a1dda552cc215255577037e09b5195fc"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exporting your Last.fm history into an SQLite DB for analysis\n",
"\n",
"Ever heard an album or a song for the first time and loved it so much you played it on repeat for three days straight? That's happened to me a few times, and I wanted to dig into my listening history to find out which songs I loved, got sick of, and forgot about for years.\n",
"\n",
"Good thing I use Last.fm obsessively. [Last.fm](http://www.last.fm/) is a service that \"scrobbles\"\u2014keeps track of\u2014every song you listen to past the halfway mark. I've been using Last.fm since 2006 and have a seriously extensive scrobble log. I wanted to move that into an SQL database so I can analyze my listening patterns or tie it into other apps.\n",
"\n",
"## How to scrape your Last.fm history\n",
"\n",
"Last.fm has a JSON API for accessing your music history. As with many APIs, it requires an API key to access\u2014[you can get an API account here](http://www.last.fm/api/account/create). Additionally, it's paginated, so we need to access our tracks in requests of 200 at a time.\n",
"\n",
"We'll be using the [user.getRecentTracks](http://www.last.fm/api/show/user.getRecentTracks) API method to access our listening history. We're going to start at the most recent page and work backwards until we reach the end.\n",
"\n",
"Here's the code I used to retrieve all pages of my scrobble log and store the raw paginated data in a simple Python dict:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import requests\n",
"\n",
"# Enter your own username and API key here.\n",
"user = 'trickybeta'\n",
"api_key = 'YOUR_LAST_FM_API_KEY_GOES_HERE'\n",
"\n",
"# These are the API parameters for our scraping requests.\n",
"per_page = 200\n",
"api_url = 'http://ws.audioscrobbler.com/2.0/?method=user.getrecenttracks&user=%s&api_key=%s&format=json&page=%s&limit=%s'\n",
"\n",
"def recent_tracks(user, api_key, page, limit):\n",
" \"\"\"Get the most recent tracks from `user` using `api_key`. Start at page `page` and limit results to `limit`.\"\"\"\n",
" return requests.get(api_url % (user, api_key, page, limit)).json()\n",
"\n",
"# We need to get the first page so we can find out how many total pages there are in our listening history.\n",
"resp = recent_tracks(user, api_key, 0, 200)\n",
"total_pages = int(resp['recenttracks']['@attr']['totalPages'])\n",
"\n",
"# Retrieve all pages of user listening history and store them into all_pages.\n",
"all_pages = []\n",
"for page_num in xrange(1, total_pages + 1):\n",
" print 'Page', page_num, 'of', total_pages\n",
" page = recent_tracks(user, api_key, page_num, 200)\n",
" all_pages.append(page)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Page 1 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 2 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 3 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" ...\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 209 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 210 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 211 of 211\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cool. We have all the paginated data, but it's not exactly clean or orderly.\n",
"\n",
"## Cleaning up the raw API data\n",
"\n",
"Let's take a look at what one of the tracks looks like after decoding its JSON:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"raw_track = all_pages[0]['recenttracks']['track'][0]\n",
"raw_track"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"{u'album': {u'#text': u'', u'mbid': u''},\n",
" u'artist': {u'#text': u'John Legend',\n",
" u'mbid': u'75a72702-a5ef-4513-bca5-c5b944903546'},\n",
" u'date': {u'#text': u'9 Apr 2014, 08:24', u'uts': u'1397031847'},\n",
" u'image': [{u'#text': u'', u'size': u'small'},\n",
" {u'#text': u'', u'size': u'medium'},\n",
" {u'#text': u'', u'size': u'large'},\n",
" {u'#text': u'', u'size': u'extralarge'}],\n",
" u'mbid': u'',\n",
" u'name': u'All Of Me (Steve James Remix)',\n",
" u'streamable': u'0',\n",
" u'url': u'http://www.last.fm/music/John+Legend/_/All+Of+Me+(Steve+James+Remix)'}"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There's lots of nesting and nasty `#` characters that will make SQL really hard to write. Here are some helper functions to clean up our data into something that plays nice with a column-row database:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import collections\n",
"\n",
"def flatten(d, parent_key=''):\n",
" \"\"\"From http://stackoverflow.com/a/6027615/254187. Modified to strip # symbols from dict keys.\"\"\"\n",
" items = []\n",
" for k, v in d.items():\n",
" new_key = parent_key + '_' + k if parent_key else k\n",
" if isinstance(v, collections.MutableMapping):\n",
" items.extend(flatten(v, new_key).items())\n",
" else:\n",
" new_key = new_key.replace('#', '') # Strip pound symbols from column names\n",
" items.append((new_key, v))\n",
" return dict(items)\n",
"\n",
"def process_track(track):\n",
" \"\"\"Removes `image` keys from track data. Replaces empty strings for values with None.\"\"\"\n",
" if 'image' in track:\n",
" del track['image']\n",
" flattened = flatten(track)\n",
" for key, val in flattened.iteritems():\n",
" if val == '':\n",
" flattened[key] = None\n",
" return flattened"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we can clean up our data, let's see how `process_track` operates on one of our tracks:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"process_track(raw_track)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": [
"{u'album_mbid': None,\n",
" u'album_text': None,\n",
" u'artist_mbid': u'75a72702-a5ef-4513-bca5-c5b944903546',\n",
" u'artist_text': u'John Legend',\n",
" u'date_text': u'9 Apr 2014, 08:24',\n",
" u'date_uts': u'1397031847',\n",
" u'mbid': None,\n",
" u'name': u'All Of Me (Steve James Remix)',\n",
" u'streamable': u'0',\n",
" u'url': u'http://www.last.fm/music/John+Legend/_/All+Of+Me+(Steve+James+Remix)'}"
]
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks much nicer! Now 'album_text' will be SQL NULL instead of the empty string.\n",
"\n",
"## Exporting to a database\n",
"\n",
"Our data is relatively structured. We know our track data is all the same because the Last.fm API guarantees our data is structured consistently. So we don't need to do any hardcore database or schema messing about.\n",
"\n",
"### Accessing a database using Dataset\n",
"\n",
"Instead of declaring our DB schema explicitly, let's save time and use the Python [Dataset package](https://dataset.readthedocs.org/en/latest/).\n",
"\n",
"Why do we like Dataset? Because it gives us:\n",
"\n",
"> A simple abstraction layer removes most direct SQL statements without the necessity for a full ORM model - essentially, databases can be used like a JSON file or NoSQL store.\n",
"\n",
"Basically, we can store stuff straight into our database without worrying about schema declarations. Great!\n",
"\n",
"To get started with the database, we need to first:\n",
"\n",
"* Connect to the SQLite DB, creating the DB file if it doesn't already exist\n",
"* Create a table for our data\n",
"\n",
"### Iterating through our data\n",
"\n",
"Once we have our DB and our shiny new table, all we need to do is:\n",
"\n",
"* Iterate through our raw page data\n",
"* Iterate through the raw tracks on each page\n",
"* Process each track\n",
"* Insert each processed track into the table we just created"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import dataset\n",
"\n",
"# Create a SQLite database with the filename `lastfm.sqlite`\n",
"db = dataset.connect('sqlite:///lastfm.sqlite')\n",
"\n",
"# Set up a table called `tracks`\n",
"tracks = db['tracks']\n",
"\n",
"# Iterate through all pages\n",
"num_pages = len(all_pages)\n",
"for page_num, page in enumerate(all_pages):\n",
" print 'Page', page_num + 1, 'of', num_pages\n",
" # On each page, iterate through all tracks\n",
" for track in page['recenttracks']['track']:\n",
" # Process each track and insert it into the `tracks` table\n",
" tracks.insert(process_track(track))\n",
"\n",
"# Confirm our tracks were inserted into the database\n",
"print 'Done!', len(tracks), 'rows in table `tracks`.'"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Page 1 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 2 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 3 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" ...\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 209 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 210 of 211\n",
"Page"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 211 of 211\n",
"Done!"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 42039 rows in table `tracks`.\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fantastic. Now let's verify our database's structure and data. We'll start by checking the columns in table `tracks`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"tracks.columns"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": [
"['id',\n",
" 'streamable',\n",
" 'date_uts',\n",
" 'name',\n",
" 'url',\n",
" 'album_mbid',\n",
" 'mbid',\n",
" 'album_text',\n",
" 'artist_mbid',\n",
" 'artist_text',\n",
" 'date_text']"
]
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now let's grab the first track from the database to see how it looks:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"dict(tracks.all().next())"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": [
"{u'album_mbid': None,\n",
" u'album_text': None,\n",
" u'artist_mbid': u'75a72702-a5ef-4513-bca5-c5b944903546',\n",
" u'artist_text': u'John Legend',\n",
" u'date_text': u'9 Apr 2014, 08:24',\n",
" u'date_uts': u'1397031847',\n",
" u'id': 1,\n",
" u'mbid': None,\n",
" u'name': u'All Of Me (Steve James Remix)',\n",
" u'streamable': u'0',\n",
" u'url': u'http://www.last.fm/music/John+Legend/_/All+Of+Me+(Steve+James+Remix)'}"
]
}
],
"prompt_number": 8
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perfect. All our data is in our database and ready to query!\n",
"\n",
"## Querying your scrobble log\n",
"\n",
"At this point, you can:\n",
"\n",
"* Use Dataset to query your DB\n",
"* Open your `lastfm.sqlite` DB in an external program and do the querying there\n",
"\n",
"Let's try running a query in Python using Dataset. Say I want to find all the times I played the song [\"Cheap Sunglasses\" by RAC](https://www.youtube.com/watch?v=EirUcCowe9E)."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"query = '''\n",
"SELECT name, artist_text, date_text\n",
"FROM tracks t\n",
"WHERE t.'name' LIKE '%cheap sunglasses%'\n",
"ORDER BY date_uts ASC;\n",
"'''\n",
"\n",
"results = db.query(query)\n",
"for result in results:\n",
" print dict(result)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 20:21'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 20:25'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 20:37'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses (feat. Matthew Koma)', u'date_text': u'8 Apr 2014, 20:56'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 21:11'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 21:18'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'9 Apr 2014, 02:40'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'9 Apr 2014, 05:45'}\n",
"{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses (feat. Matthew Koma)', u'date_text': u'9 Apr 2014, 08:04'}\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fantastic. I really love that song.\n",
"\n",
"## Back to the Start\n",
"\n",
"At the start of this experiment, I wanted to find the songs I played obsessively within the first week of hearing them. We can use SQL to look for exactly that!\n",
"\n",
"Let's define our heuristic:\n",
"\n",
"> I want to see a list of all the tracks I played more than 3 times within a week of their first play.\n",
"\n",
"This may not sound significant, but it turns out this filters a **lot** of tracks out and leaves me with a lot of tracks I found memorable.\n",
"\n",
"### Structuring our SQL Query\n",
"\n",
"Here's the query I'm going to run to implement the above heuristic:\n",
"\n",
">`SELECT COUNT(fp.url) AS plays, fp.date_text AS first_played, fp.artist_text AS artist, fp.name AS track, fp.album_text AS album`\n",
"\n",
"Title the columns nicely. Create a column that tracks play count within the first week called `plays`.\n",
"\n",
"Thanks to how Last.fm organizes their track data, we can tell one track from another by comparing URLs.\n",
"\n",
"The `url` column is a link to the Last.fm page for that track. If two rows have the same value in the URL column, we know that the two rows represent two different plays of the same track.\n",
"\n",
">`FROM (SELECT * FROM tracks GROUP BY url ORDER BY date_uts) fp`\n",
"\n",
"**Subquery**: Get unique plays. Order by `date_uts` ascending.\n",
"\n",
"This leaves us with a table where:\n",
"\n",
"* each track is represented at most once\n",
"* each track has a `date_uts` value that represents when the track was first played\n",
"\n",
">`INNER JOIN tracks ap ON fp.url = ap.url AND ap.date_uts - fp.date_uts <= (86400 * 7)`\n",
"\n",
"Join our subquery and our full table, keeping only rows where the play date is within 7 days of the first-played date.\n",
"\n",
">`GROUP BY ap.url`\n",
"\n",
"Group duplicate tracks into one row...\n",
"\n",
">`HAVING COUNT(ap.url) > 3`\n",
"\n",
"...but only keep rows where there are at least 3 plays within 7 days of the first-played date.\n",
"\n",
">`ORDER BY ap.date_uts DESC`\n",
"\n",
"Put most recently discovered tracks at the top of the list.\n",
"\n",
">`LIMIT 10`\n",
"\n",
"For this experiment, let's just check out the my 10 most recently obsessed-over tracks instead of grabbing all of them."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"query = '''\n",
"SELECT COUNT(fp.url) AS plays, fp.date_text AS first_played, fp.artist_text AS artist, fp.name AS track, fp.album_text AS album FROM\n",
" (SELECT * FROM tracks GROUP BY url ORDER BY date_uts) fp\n",
"INNER JOIN tracks ap ON\n",
" fp.url = ap.url AND\n",
" ap.date_uts - fp.date_uts <= (86400 * 7)\n",
"GROUP BY ap.url\n",
"HAVING COUNT(ap.url) > 3\n",
"ORDER BY ap.date_uts DESC\n",
"LIMIT 10\n",
"'''\n",
"\n",
"results = db.query(query)\n",
"for result in results:\n",
" r = dict(result)\n",
" print r['plays'], 'plays', '\\t', r['first_played'], '\\t', r['artist'], '-', r['track']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"7 plays \t8 Apr 2014, 20:21 \tRAC - Cheap Sunglasses\n",
"5 plays \t4 Apr 2014, 21:23 \tBasement Jaxx Vs. Metropole Orkest - Do Your Thing\n",
"5 plays \t4 Apr 2014, 22:46 \tBasement Jaxx Vs. Metropole Orkest - Where\u2019s Your Head At\n",
"4 plays \t18 Mar 2014, 06:12 \tThe-Dream, Pusha T, Ma$e, Cocaine 80s - Higher (Album Version (Edited))\n",
"4 plays \t9 Mar 2014, 16:49 \tRobert DeLong - Global Concepts (Robert DeLong Club Remix)\n",
"4 plays \t12 Mar 2014, 01:29 \tSkrillex - All Is Fair In Love And Brostep\n",
"4 plays \t5 Mar 2014, 05:51 \tRAC - Let Go (feat. Kele & MNDR)\n",
"4 plays \t28 Feb 2014, 19:57 \tKendrick Lamar - Sing That Shit (20syl Juicy Remix)\n",
"4 plays \t28 Feb 2014, 19:44 \tPharrell Williams - Gust of Wind\n",
"4 plays \t20 Feb 2014, 06:17 \tCom Truise - Mind\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$wag.\n",
"\n",
"## Conclusion\n",
"\n",
"Now you've got an SQLite database that has all of your music history, ready for analysis.\n",
"\n",
"Let me know if you come up with something cool! Email me at [matt@mplewis.com](mailto:matt@mplewis.com) or tweet me [@mplewis](http://www.twitter.com/mplewis).\n",
"\n",
"Here are some of my humble suggestions for external SQLite management programs:\n",
"* [Base 2](http://menial.co.uk/base/), the finest SQLite manager known to OS X\n",
"* [Valentina Studio](https://www.valentina-db.com/en/valentina-studio-overview), an excellent **free** option for all platforms"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment