Skip to content

Instantly share code, notes, and snippets.

@misterhay
Created November 1, 2022 17:40
Show Gist options
  • Save misterhay/1cba2d46550abd163b0fc53d4f5bbb7d to your computer and use it in GitHub Desktop.
Save misterhay/1cba2d46550abd163b0fc53d4f5bbb7d to your computer and use it in GitHub Desktop.
intro to music with Spotify
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Callysto.ca Banner](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-top.jpg?raw=true)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Music\n",
"\n",
"<table><tr>\n",
"<td style=\"font-size:8px;\"> <img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/5/57/Treble_a.svg/1920px-Treble_a.svg.png\" alt=\"Musical Staff\" style=\"width: 350px;\"/><br><a href=\"https://en.wikipedia.org/wiki/Musical_note#/media/File:Treble_a.svg\">By Dbolton - Own work, CC0, https://commons.wikimedia.org/w/index.php?curid=17813642</a></td>\n",
"<td> <img src=\"https://storage.googleapis.com/pr-newsroom-wp/1/2018/11/Spotify_Logo_CMYK_Green.png\" alt=\"Spotify Logo\" style=\"width: 650px;\"/> </td>\n",
"</tr></table>\n",
"\n",
"Music is an art loved by many people around the world, and it has been an important part of people's life.\n",
"\n",
"On a regular day, you might be listening to your artist or trying to play your favourite songs. In this hackathon notebook let's try to find out more about the most popular songs and what they have in common. Hopefully you will find some interesting insights that might be difficult to determine otherwise, while learning some new coding skills."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting ready\n",
"\n",
"This section sets up many things behind the scenes which are required for the rest of this notebook. Most of the code blocks in this section are ready-to-run so you won't have to do any modifications. You don't need to know everything about various tasks being accomplished by the code cell in this section to complete the challenges. However feel free to ask mentors about anything that makes you curious."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Import libraries\n",
"\n",
"Run the cell below to import required Python libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import plotly.express as px\n",
"print('Setup Complete')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Importing data and creating a dataframe\n",
"\n",
"[Spotify](https://en.wikipedia.org/wiki/Spotify), an audio streaming platform, has a huge database of songs and information about them.\n",
"\n",
"Run the cell below to import a dataset of about 40,000 songs that has been [exported from Spotify](https://developer.spotify.com/documentation/web-api)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"music = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify.csv')\n",
"# display the dataframe\n",
"music"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Columns\n",
"\n",
"Let's have a look at the columns in our data set."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for c in music.columns:\n",
" print(c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you know which columns are there in the dataset, but what do those columns refer to?\n",
"\n",
"**Danceability**: How suitable a track is for dancing. A value of 0.0 is least danceable and 1.0 is most danceable.\n",
"\n",
"**Energy**: A perceptual measure of intensity and activity that ranges between 0 to 1. Typically, energetic tracks feel fast, loud, and noisy.\n",
"\n",
"**Key**: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on.\n",
"\n",
"**Loudness**: The average loudness of a track in decibels (dB). Values typically ranges between -60 and 0 db.\n",
"\n",
"**Mode**: The modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.\n",
"\n",
"**Speechiness**: Indicates the presence of spoken words in a track. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech while below 0.33 most likely represent music and other non-speech-like tracks.\n",
"\n",
"**Acousticness**: A confidence measure indicating whether the track is acoustic. Value of 1 represents highest confidence.\n",
"\n",
"**Instrumentalness**: Predicts whether a track contains no vocals. The closer the value is to 1.0, the greater likelihood the track contains no vocal content.\n",
"\n",
"**Liveness**: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.\n",
"\n",
"**Valence**: A measure to describe the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).\n",
"\n",
"**Tempo**: The overall estimated tempo (speed or pace) of a track in beats per minute (BPM).\n",
"\n",
"**duration_ms**: The duration of the track in milliseconds.\n",
"\n",
"**time_signature**: An estimated overall time signature of a track. The time signature is a notational convention to specify how many beats are in each bar (or measure)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Cleaning\n",
"\n",
"### Adding New Columns\n",
"\n",
"We can add a new column to show the duration of the track in seconds instead of milliseconds."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"music['duration_s'] = music['duration_ms']/1000\n",
"music"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also add a column of links to the tracks."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"music['link'] = 'https://open.spotify.com/track/' + music['track_id']\n",
"music"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the `release_date` column, we can see that for some songs it is just the year and for some it is a [standard date](https://www.iso.org/iso-8601-date-and-time-format.html). Let's create a new column called `release_year` that is just the first four characters of the `release_date`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"music['release_year'] = music['release_date'].str[:4].astype(int)\n",
"music"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Analysis\n",
"\n",
"### Song Duration \n",
"\n",
"Let's visualize the song lengths over the years to see if there is anything strange in our dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"px.scatter(music, x='release_year', y='duration_s', title='Song Duration Over Time', hover_data=['artist', 'track', 'link'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We may want to eliminate some of the outliers, for example songs released before 1950."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_music = music[music['release_year'] >= 1949]\n",
"px.scatter(new_music, x='release_year', y='duration_s', title='Song Duration Over Time', hover_data=['artist', 'track', 'link'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or only look at songs from the 1990s with a duration less than 10 minutes."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"short_90s_music = music[(music['release_year']>1989) & (music['release_year']<2000) & (music['duration_s']<10*60)]\n",
"px.scatter(short_90s_music, x='release_year', y='duration_s', title='Song Duration Over Time', hover_data=['artist', 'track', 'link'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also see if the average `danceability` has changed over time in our dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"average_danceability = music.groupby('release_year')['danceability'].mean()\n",
"px.line(average_danceability, title='Average Danceability Over Time')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course for years where there are not a lot of songs in our in our dataset, the average will not be a useful value.\n",
"\n",
"Let's look instead at another dataset containing the top 50 tracks from 2010 to 2019."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"music2 = pd.read_csv('https://raw.githubusercontent.com/callysto/data-files/main/hackathon/spotify-top-50-from-2010-2019.csv')\n",
"px.line(music2.groupby('year')['danceability'].mean(), title='Top 50 Average Danceability Over Time')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if there is a relationship between `energy` and `danceability` in either dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"px.scatter(music, x='energy', y='danceability', title='Energy vs Danceability', hover_data=['artist', 'track'])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"px.scatter(music2, x='energy', y='danceability', title='Energy vs Danceability for Top 50', hover_data=['artist', 'title'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also explore and visualize other song features from the datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"music.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check out the [next notebook](music-challenge.ipynb) to continue your own analysis.\n",
"\n",
"<hr>\n",
"\n",
"#### Getting More Data Using the Spotify API (Optional)\n",
"\n",
"An [application programming interface](https://en.wikipedia.org/wiki/API) is a set of commands to access data from another system. The [Spotify Web API](https://developer.spotify.com/documentation/web-api/) allows us to get information about songs, albums, and artists.\n",
"\n",
"If you want to retireve more data and have a [Spotify account](https://www.spotify.com/us/signup), you can sign in to the [Developers Dashboard](https://developer.spotify.com/dashboard/login). From the Dashboard, you can click the `CREATE AN APP` button, type a name and description, and then click `CREATE`. Clicking on your new app in the Dashboard will show you the `Client ID` and `CLIENT SECRET` that you can paste into the code cell below."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"CLIENT_ID = 'PASTE_YOUR_CLIENT_ID_HERE'\n",
"CLIENT_SECRET = 'PASTE_YOUR_CLIENT_SECRET_HERE'\n",
"\n",
"import requests\n",
"try:\n",
" auth_response = requests.post('https://accounts.spotify.com/api/token', {'grant_type':'client_credentials', 'client_id':CLIENT_ID, 'client_secret':CLIENT_SECRET})\n",
" auth_response_data = auth_response.json()\n",
" access_token = auth_response_data['access_token']\n",
" headers = {'Authorization':'Bearer {token}'.format(token=access_token)}\n",
"except:\n",
" print('Remember to paste your client ID and secret into the code')\n",
"\n",
"def find_tracks(search_string):\n",
" try:\n",
" r = requests.get('https://api.spotify.com/v1/search?q=' + search_string + '&type=track', headers=headers)\n",
" info = r.json()\n",
" except:\n",
" print('Error with search string:', search_string)\n",
" info = None\n",
" return info\n",
"\n",
"def get_track_info(track_id):\n",
" try:\n",
" r = requests.get('https://api.spotify.com/v1/tracks/' + track_id, headers=headers)\n",
" info = r.json()\n",
" except:\n",
" print('Error with track id:', track_id)\n",
" info = None\n",
" return info\n",
"\n",
"def get_track_features(track_id):\n",
" try:\n",
" r = requests.get('https://api.spotify.com/v1/audio-features/' + track_id, headers=headers)\n",
" info = r.json()\n",
" except:\n",
" print('Error with track id:', track_id)\n",
" info = None\n",
" return info\n",
"\n",
"print('Spotify API setup complete')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can now search for a song, for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"search_string = 'i want a hippopotamus for christmas'\n",
"info = find_tracks(search_string)\n",
"\n",
"for item in info['tracks']['items']:\n",
" if item['type'] == 'track':\n",
" print(item['id'], item['name'], item['external_urls']['spotify'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can get information about a song:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"track_info = get_track_info('5P8Xvy5tzhmfyfA6GplQ8h')\n",
"print(track_info['name'])\n",
"print(track_info['album']['artists'][0]['name'])\n",
"print('duration_ms', track_info['duration_ms'])\n",
"print('popularity', track_info['popularity'])\n",
"print('release_date', track_info['album']['release_date'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And audio features of a song:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"track_features = get_track_features('5P8Xvy5tzhmfyfA6GplQ8h')\n",
"track_features"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As an example of a list of songs to check out, try the [Billions Club Playlist](https://open.spotify.com/playlist/37i9dQZF1DX7iB3RCnBnN4)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"playlist_id = '37i9dQZF1DX7iB3RCnBnN4'\n",
"\n",
"tracks = []\n",
"\n",
"for x in range(4): # it only returns 100 tracks at a time\n",
" offset = x*100\n",
" r = requests.get('https://api.spotify.com/v1/playlists/' + playlist_id + '/tracks?offset=' + str(offset), headers=headers)\n",
" for item in r.json()['items']:\n",
" tracks.append([item['track']['artists'][0]['name'], item['track']['name'], item['track']['id']])\n",
"\n",
"billions = pd.DataFrame(tracks, columns=['artist', 'track', 'id'])\n",
"billions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[![Callysto.ca License](https://github.com/callysto/curriculum-notebooks/blob/master/callysto-notebook-banner-bottom.jpg?raw=true)](https://github.com/callysto/curriculum-notebooks/blob/master/LICENSE.md)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.10 64-bit",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.10"
},
"nbTranslate": {
"displayLangs": [
"*"
],
"hotkey": "alt-t",
"langInMainMenu": true,
"sourceLang": "en",
"targetLang": "fr",
"useGoogleTranslate": true
},
"vscode": {
"interpreter": {
"hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment