Last active
August 5, 2021 00:23
-
-
Save joseph-higaki/f93973b29737e6d3f812ab119e4f8870 to your computer and use it in GitHub Desktop.
Understanding my Spotify Listening Habits - Part 0
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"attachments": {}, | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Understanding my Spotify Listening Habits - Part 0\n", | |
"\n", | |
"I like that time of the year when Spotify sends those gorgeous visuals of your listening habits.\n", | |
"\n", | |
"I couldn't wait any longer, I requested for my data [here](https://www.spotify.com/account/privacy).\n", | |
"\n", | |
"After following the instructions, a few days after I received a link to download some json files with my personal information, amongst them my *last 12 months of streaming history*.\n", | |
"\n", | |
"To get us started, I'm just trying to get my:\n", | |
"* 10 most played artists \n", | |
"* 10 most played songs\n", | |
"* my most played artist each month\n", | |
"\n", | |
"![Dataiku favorite bands MoM](https://user-images.githubusercontent.com/11904085/128270731-c8bb313f-31ab-4592-8510-90f6571ae18f.png)\"" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## I'm 👴🏿... so I like strong types and curly braces\n", | |
"Let's use **Scala**" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## First things first, the json files\n", | |
"Each Spotify's StreamingHistory_.json file has a 10K record limit, you may have several of them" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"Intitializing Scala interpreter ..." | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"Spark Web UI available at http://DESKTOP-DOS8M7K:4041\n", | |
"SparkContext available as 'sc' (version = 3.1.2, master = local[*], app id = local-1628122781306)\n", | |
"SparkSession available as 'spark'\n" | |
] | |
}, | |
"metadata": {}, | |
"output_type": "display_data" | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"import java.io.File\r\n", | |
"path: String = data/joseph.higaki/my_spotify_data/MyData/\r\n", | |
"streamHistoryFilePaths: Array[String] = Array(data/joseph.higaki/my_spotify_data/MyData/StreamingHistory0.json, data/joseph.higaki/my_spotify_data/MyData/StreamingHistory1.json)\r\n" | |
] | |
}, | |
"execution_count": 1, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import java.io.File\n", | |
"val path = \"data/joseph.higaki/my_spotify_data/MyData/\" \n", | |
"val streamHistoryFilePaths = new File(path).listFiles\n", | |
".filter(_.getName.startsWith(\"StreamingHistory\"))\n", | |
".map(_.getPath().replace(\"\\\\\",\"/\"))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Second, put them into DataFrames" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-----------------+----------------+--------+--------------------+\n", | |
"| artistName| endTime|msPlayed| trackName|\n", | |
"+-----------------+----------------+--------+--------------------+\n", | |
"| The Black Keys|2020-07-17 03:31| 191798| Howlin' For You|\n", | |
"| Royal Blood|2020-07-17 03:34| 183578| Figure It Out|\n", | |
"|The White Stripes|2020-07-17 03:38| 231920| Seven Nation Army|\n", | |
"|Cage The Elephant|2020-07-17 03:41| 175493|Ain't No Rest For...|\n", | |
"| Des Rocs|2020-07-17 03:44| 200933| Wayne|\n", | |
"| Barns Courtney|2020-07-17 03:47| 197510| Fire|\n", | |
"| Dirty Honey|2020-07-17 03:51| 204987| When I'm Gone|\n", | |
"| The Blue Stones|2020-07-17 03:54| 222936|Shakin' Off The Rust|\n", | |
"| Wolfmother|2020-07-17 04:11| 27963| Joker And The Thief|\n", | |
"| Guns N' Roses|2020-07-17 06:11| 0| Paradise City|\n", | |
"+-----------------+----------------+--------+--------------------+\n", | |
"only showing top 10 rows\n", | |
"\n" | |
] | |
}, | |
{ | |
"data": { | |
"text/plain": [ | |
"import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\r\n", | |
"streamHistorySchema: org.apache.spark.sql.types.StructType = StructType(StructField(artistName,StringType,true), StructField(endTime,StringType,true), StructField(msPlayed,LongType,true), StructField(trackName,StringType,true))\r\n", | |
"streamHistoryReader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@6ae2589b\r\n", | |
"streamHistory: org.apache.spark.sql.DataFrame = [artistName: string, endTime: string ... 2 more fields]\r\n" | |
] | |
}, | |
"execution_count": 2, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\n", | |
"\n", | |
"val streamHistorySchema = StructType(Array(\n", | |
" StructField(\"artistName\", StringType, true),\n", | |
" StructField(\"endTime\", StringType, true),\n", | |
" StructField(\"msPlayed\", LongType, true),\n", | |
" StructField(\"trackName\", StringType, true)\n", | |
" ))\n", | |
"\n", | |
"val streamHistoryReader = spark.read.\n", | |
"option(\"multiline\",\"true\").\n", | |
"format(\"json\").\n", | |
"schema(streamHistorySchema)\n", | |
"\n", | |
"var streamHistory = streamHistoryReader.load()\n", | |
"streamHistoryFilePaths.foreach(p => streamHistory = streamHistory.union(streamHistoryReader.load(p)))\n", | |
"\n", | |
"streamHistory.show(10)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"I tried to load multiple json files using a single DataFrameReader method, and failed miserably. \n", | |
"In the interest of time I just iterated and union'ed the json files into my DataFrame\n", | |
"(Sorry Scala, my DataFrame was not immutable)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"//val streamHistory = spark.read.option(\"multiline\",\"true\").json(streamHistoryFilePaths)\n", | |
"// *****************************\n", | |
"// I don't know why it doesnt take the array. \n", | |
"// Now I know why: \n", | |
"// json doesn't take Array[String]\n", | |
"// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L233 \n", | |
"// Load does takes an Array[String]\n", | |
"// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L133\n", | |
"\n", | |
"// ***************************** \n", | |
"// It looks like json doesnt take multiple files as path or load\n", | |
"//val streamHistory = spark.read.option(\"multiline\",\"true\").format(\"json\").schema(streamHistorySchema).option(\"paths\", streamHistoryFilePaths.mkString(\",\")).load()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's peek the date ranges for my data dump" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+-------+----------------+\n", | |
"|summary| endTime|\n", | |
"+-------+----------------+\n", | |
"| count| 16369|\n", | |
"| mean| null|\n", | |
"| stddev| null|\n", | |
"| min|2020-07-17 03:31|\n", | |
"| max|2021-07-16 22:22|\n", | |
"+-------+----------------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"streamHistory.describe(\"endTime\").show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Third, Some Quick Insights\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"streamHistory.createOrReplaceTempView(\"stream_history\")" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+--------------------+-------+\n", | |
"| artistName|minutes|\n", | |
"+--------------------+-------+\n", | |
"| Pearl Jam| 2351.5|\n", | |
"| Metallica|2059.21|\n", | |
"|Bob Marley & The ...|1231.93|\n", | |
"| The Beatles|1109.93|\n", | |
"|Rage Against The ...|1076.81|\n", | |
"| Laguna Pai|1051.49|\n", | |
"|Red Hot Chili Pep...| 818.37|\n", | |
"| The Doors| 693.38|\n", | |
"|The Smashing Pump...| 690.26|\n", | |
"| Daft Punk| 643.83|\n", | |
"+--------------------+-------+\n", | |
"only showing top 10 rows\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"spark.sql(\"\"\"\n", | |
"select artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n", | |
"from stream_history \n", | |
"group by artistName \n", | |
"order by sum(msPlayed) desc\"\"\").show(10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+--------------------+--------------------+-------+\n", | |
"| trackName| artistName|minutes|\n", | |
"+--------------------+--------------------+-------+\n", | |
"| Canales| Laguna Pai| 354.21|\n", | |
"| Rock Fort Rock| The Skatalites| 337.31|\n", | |
"| Somos Pocos| Laguna Pai| 293.85|\n", | |
"| Roller Skates| Steel Pulse| 267.31|\n", | |
"|Hoping (Herbert's...| Louie Austen| 254.19|\n", | |
"| Jerusalem| Alpha Blondy| 241.35|\n", | |
"| Riders on the Storm| The Doors| 237.24|\n", | |
"| Night Nurse| Gregory Isaacs| 215.67|\n", | |
"| Killing In The Name|Rage Against The ...| 207.24|\n", | |
"|And The Living Is...| Guts| 201.28|\n", | |
"+--------------------+--------------------+-------+\n", | |
"only showing top 10 rows\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"spark.sql(\"\"\"\n", | |
"select trackName, artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n", | |
"from stream_history \n", | |
"group by trackName, artistName \n", | |
"order by sum(msPlayed) desc\"\"\")\n", | |
".show(10)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"+--------------------+-------+--------+-------+\n", | |
"| artistName| month|msPlayed|minutes|\n", | |
"+--------------------+-------+--------+-------+\n", | |
"| Mattiel|2020-07| 2266116| 37.77|\n", | |
"| Laguna Pai|2020-08| 6881334| 114.69|\n", | |
"| Metallica|2020-09| 7155300| 119.26|\n", | |
"|Rage Against The ...|2020-10|10145011| 169.08|\n", | |
"| Pearl Jam|2020-11| 6818054| 113.63|\n", | |
"| Apache|2020-12|16384105| 273.07|\n", | |
"| Guts|2021-01| 7421847| 123.7|\n", | |
"| Laguna Pai|2021-02| 7873529| 131.23|\n", | |
"| Alice In Chains|2021-03|17728951| 295.48|\n", | |
"| Metallica|2021-04|45733120| 762.22|\n", | |
"| Pearl Jam|2021-05|50962015| 849.37|\n", | |
"| The Beatles|2021-06|25730009| 428.83|\n", | |
"| The Beatles|2021-07|21899167| 364.99|\n", | |
"+--------------------+-------+--------+-------+\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"spark.sql(\"\"\"\n", | |
"select artistName, month, msPlayed, round(msPlayed / (60 * 1000), 2) as minutes\n", | |
"from \n", | |
"(\n", | |
" select *, \n", | |
" ROW_NUMBER() OVER (PARTITION BY month ORDER BY msPlayed DESC) as rank \n", | |
" from \n", | |
" (\n", | |
" select artistName, substring(endTime, 1, 7) as month, sum(msPlayed) as msPLayed\n", | |
" from stream_history \n", | |
" group by artistName, substring(endTime, 1, 7)\n", | |
" order by substring(endTime, 1, 7) asc, sum(msPlayed) desc\n", | |
" )\n", | |
")\n", | |
"where rank = 1\n", | |
"order by month, rank\n", | |
"\"\"\")\n", | |
".show()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## That's all folks (for now)\n", | |
"I'm planning to build some more interesting things on top of this, so this is just the first element from a collection of insights based on spotify's data." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "spylon-kernel", | |
"language": "scala", | |
"name": "spylon-kernel" | |
}, | |
"language_info": { | |
"codemirror_mode": "text/x-scala", | |
"file_extension": ".scala", | |
"help_links": [ | |
{ | |
"text": "MetaKernel Magics", | |
"url": "https://metakernel.readthedocs.io/en/latest/source/README.html" | |
} | |
], | |
"mimetype": "text/x-scala", | |
"name": "scala", | |
"pygments_lexer": "scala", | |
"version": "0.4.1" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment