Skip to content

Instantly share code, notes, and snippets.

@joseph-higaki
Last active August 5, 2021 00:23
Show Gist options
  • Save joseph-higaki/f93973b29737e6d3f812ab119e4f8870 to your computer and use it in GitHub Desktop.
Save joseph-higaki/f93973b29737e6d3f812ab119e4f8870 to your computer and use it in GitHub Desktop.
Understanding my Spotify Listening Habits - Part 0
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Understanding my Spotify Listening Habits - Part 0\n",
"\n",
"I like that time of the year when Spotify sends those gorgeous visuals of your listening habits.\n",
"\n",
"I couldn't wait any longer, I requested for my data [here](https://www.spotify.com/account/privacy).\n",
"\n",
"After following the instructions, a few days after I received a link to download some json files with my personal information, amongst them my *last 12 months of streaming history*.\n",
"\n",
"To get us started, I'm just trying to get my:\n",
"* 10 most played artists \n",
"* 10 most played songs\n",
"* my most played artist each month\n",
"\n",
"![Dataiku favorite bands MoM](https://user-images.githubusercontent.com/11904085/128270731-c8bb313f-31ab-4592-8510-90f6571ae18f.png)\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## I'm 👴🏿... so I like strong types and curly braces\n",
"Let's use **Scala**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## First things first, the json files\n",
"Each Spotify's StreamingHistory_.json file has a 10K record limit, you may have several of them"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Intitializing Scala interpreter ..."
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Spark Web UI available at http://DESKTOP-DOS8M7K:4041\n",
"SparkContext available as 'sc' (version = 3.1.2, master = local[*], app id = local-1628122781306)\n",
"SparkSession available as 'spark'\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"import java.io.File\r\n",
"path: String = data/joseph.higaki/my_spotify_data/MyData/\r\n",
"streamHistoryFilePaths: Array[String] = Array(data/joseph.higaki/my_spotify_data/MyData/StreamingHistory0.json, data/joseph.higaki/my_spotify_data/MyData/StreamingHistory1.json)\r\n"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import java.io.File\n",
"val path = \"data/joseph.higaki/my_spotify_data/MyData/\" \n",
"val streamHistoryFilePaths = new File(path).listFiles\n",
".filter(_.getName.startsWith(\"StreamingHistory\"))\n",
".map(_.getPath().replace(\"\\\\\",\"/\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Second, put them into DataFrames"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------------+----------------+--------+--------------------+\n",
"| artistName| endTime|msPlayed| trackName|\n",
"+-----------------+----------------+--------+--------------------+\n",
"| The Black Keys|2020-07-17 03:31| 191798| Howlin' For You|\n",
"| Royal Blood|2020-07-17 03:34| 183578| Figure It Out|\n",
"|The White Stripes|2020-07-17 03:38| 231920| Seven Nation Army|\n",
"|Cage The Elephant|2020-07-17 03:41| 175493|Ain't No Rest For...|\n",
"| Des Rocs|2020-07-17 03:44| 200933| Wayne|\n",
"| Barns Courtney|2020-07-17 03:47| 197510| Fire|\n",
"| Dirty Honey|2020-07-17 03:51| 204987| When I'm Gone|\n",
"| The Blue Stones|2020-07-17 03:54| 222936|Shakin' Off The Rust|\n",
"| Wolfmother|2020-07-17 04:11| 27963| Joker And The Thief|\n",
"| Guns N' Roses|2020-07-17 06:11| 0| Paradise City|\n",
"+-----------------+----------------+--------+--------------------+\n",
"only showing top 10 rows\n",
"\n"
]
},
{
"data": {
"text/plain": [
"import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\r\n",
"streamHistorySchema: org.apache.spark.sql.types.StructType = StructType(StructField(artistName,StringType,true), StructField(endTime,StringType,true), StructField(msPlayed,LongType,true), StructField(trackName,StringType,true))\r\n",
"streamHistoryReader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@6ae2589b\r\n",
"streamHistory: org.apache.spark.sql.DataFrame = [artistName: string, endTime: string ... 2 more fields]\r\n"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\n",
"\n",
"val streamHistorySchema = StructType(Array(\n",
" StructField(\"artistName\", StringType, true),\n",
" StructField(\"endTime\", StringType, true),\n",
" StructField(\"msPlayed\", LongType, true),\n",
" StructField(\"trackName\", StringType, true)\n",
" ))\n",
"\n",
"val streamHistoryReader = spark.read.\n",
"option(\"multiline\",\"true\").\n",
"format(\"json\").\n",
"schema(streamHistorySchema)\n",
"\n",
"var streamHistory = streamHistoryReader.load()\n",
"streamHistoryFilePaths.foreach(p => streamHistory = streamHistory.union(streamHistoryReader.load(p)))\n",
"\n",
"streamHistory.show(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I tried to load multiple json files using a single DataFrameReader method, and failed miserably. \n",
"In the interest of time I just iterated and union'ed the json files into my DataFrame\n",
"(Sorry Scala, my DataFrame was not immutable)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"//val streamHistory = spark.read.option(\"multiline\",\"true\").json(streamHistoryFilePaths)\n",
"// *****************************\n",
"// I don't know why it doesnt take the array. \n",
"// Now I know why: \n",
"// json doesn't take Array[String]\n",
"// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L233 \n",
"// Load does takes an Array[String]\n",
"// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L133\n",
"\n",
"// ***************************** \n",
"// It looks like json doesnt take multiple files as path or load\n",
"//val streamHistory = spark.read.option(\"multiline\",\"true\").format(\"json\").schema(streamHistorySchema).option(\"paths\", streamHistoryFilePaths.mkString(\",\")).load()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's peek the date ranges for my data dump"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+----------------+\n",
"|summary| endTime|\n",
"+-------+----------------+\n",
"| count| 16369|\n",
"| mean| null|\n",
"| stddev| null|\n",
"| min|2020-07-17 03:31|\n",
"| max|2021-07-16 22:22|\n",
"+-------+----------------+\n",
"\n"
]
}
],
"source": [
"streamHistory.describe(\"endTime\").show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Third, Some Quick Insights\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"streamHistory.createOrReplaceTempView(\"stream_history\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-------+\n",
"| artistName|minutes|\n",
"+--------------------+-------+\n",
"| Pearl Jam| 2351.5|\n",
"| Metallica|2059.21|\n",
"|Bob Marley & The ...|1231.93|\n",
"| The Beatles|1109.93|\n",
"|Rage Against The ...|1076.81|\n",
"| Laguna Pai|1051.49|\n",
"|Red Hot Chili Pep...| 818.37|\n",
"| The Doors| 693.38|\n",
"|The Smashing Pump...| 690.26|\n",
"| Daft Punk| 643.83|\n",
"+--------------------+-------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"spark.sql(\"\"\"\n",
"select artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n",
"from stream_history \n",
"group by artistName \n",
"order by sum(msPlayed) desc\"\"\").show(10)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+-------+\n",
"| trackName| artistName|minutes|\n",
"+--------------------+--------------------+-------+\n",
"| Canales| Laguna Pai| 354.21|\n",
"| Rock Fort Rock| The Skatalites| 337.31|\n",
"| Somos Pocos| Laguna Pai| 293.85|\n",
"| Roller Skates| Steel Pulse| 267.31|\n",
"|Hoping (Herbert's...| Louie Austen| 254.19|\n",
"| Jerusalem| Alpha Blondy| 241.35|\n",
"| Riders on the Storm| The Doors| 237.24|\n",
"| Night Nurse| Gregory Isaacs| 215.67|\n",
"| Killing In The Name|Rage Against The ...| 207.24|\n",
"|And The Living Is...| Guts| 201.28|\n",
"+--------------------+--------------------+-------+\n",
"only showing top 10 rows\n",
"\n"
]
}
],
"source": [
"spark.sql(\"\"\"\n",
"select trackName, artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n",
"from stream_history \n",
"group by trackName, artistName \n",
"order by sum(msPlayed) desc\"\"\")\n",
".show(10)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-------+--------+-------+\n",
"| artistName| month|msPlayed|minutes|\n",
"+--------------------+-------+--------+-------+\n",
"| Mattiel|2020-07| 2266116| 37.77|\n",
"| Laguna Pai|2020-08| 6881334| 114.69|\n",
"| Metallica|2020-09| 7155300| 119.26|\n",
"|Rage Against The ...|2020-10|10145011| 169.08|\n",
"| Pearl Jam|2020-11| 6818054| 113.63|\n",
"| Apache|2020-12|16384105| 273.07|\n",
"| Guts|2021-01| 7421847| 123.7|\n",
"| Laguna Pai|2021-02| 7873529| 131.23|\n",
"| Alice In Chains|2021-03|17728951| 295.48|\n",
"| Metallica|2021-04|45733120| 762.22|\n",
"| Pearl Jam|2021-05|50962015| 849.37|\n",
"| The Beatles|2021-06|25730009| 428.83|\n",
"| The Beatles|2021-07|21899167| 364.99|\n",
"+--------------------+-------+--------+-------+\n",
"\n"
]
}
],
"source": [
"spark.sql(\"\"\"\n",
"select artistName, month, msPlayed, round(msPlayed / (60 * 1000), 2) as minutes\n",
"from \n",
"(\n",
" select *, \n",
" ROW_NUMBER() OVER (PARTITION BY month ORDER BY msPlayed DESC) as rank \n",
" from \n",
" (\n",
" select artistName, substring(endTime, 1, 7) as month, sum(msPlayed) as msPLayed\n",
" from stream_history \n",
" group by artistName, substring(endTime, 1, 7)\n",
" order by substring(endTime, 1, 7) asc, sum(msPlayed) desc\n",
" )\n",
")\n",
"where rank = 1\n",
"order by month, rank\n",
"\"\"\")\n",
".show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## That's all folks (for now)\n",
"I'm planning to build some more interesting things on top of this, so this is just the first element from a collection of insights based on spotify's data."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "spylon-kernel",
"language": "scala",
"name": "spylon-kernel"
},
"language_info": {
"codemirror_mode": "text/x-scala",
"file_extension": ".scala",
"help_links": [
{
"text": "MetaKernel Magics",
"url": "https://metakernel.readthedocs.io/en/latest/source/README.html"
}
],
"mimetype": "text/x-scala",
"name": "scala",
"pygments_lexer": "scala",
"version": "0.4.1"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment