joseph-higaki/read_spotify_user_streams.ipynb

## read_spotify_user_streams.ipynb
{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding my Spotify Listening Habits - Part 0\n",
    "\n",
    "I like that time of the year when Spotify sends those gorgeous visuals of your listening habits.\n",
    "\n",
    "I couldn't wait any longer, I requested for my data [here](https://www.spotify.com/account/privacy).\n",
    "\n",
    "After following the instructions, a few days after I received a link to download some json files with my personal information, amongst them my *last 12 months of streaming history*.\n",
    "\n",
    "To get us started, I'm just trying to get my:\n",
    "* 10 most played artists \n",
    "* 10 most played songs\n",
    "* my most played artist each month\n",
    "\n",
    "![Dataiku favorite bands MoM](https://user-images.githubusercontent.com/11904085/128270731-c8bb313f-31ab-4592-8510-90f6571ae18f.png)\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  I'm 👴🏿...  so I like strong types and curly braces\n",
    "Let's use **Scala**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First things first, the json files\n",
    "Each Spotify's StreamingHistory_.json file has a 10K record limit, you may have several of them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Intitializing Scala interpreter ..."
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Spark Web UI available at http://DESKTOP-DOS8M7K:4041\n",
       "SparkContext available as 'sc' (version = 3.1.2, master = local[*], app id = local-1628122781306)\n",
       "SparkSession available as 'spark'\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "import java.io.File\r\n",
       "path: String = data/joseph.higaki/my_spotify_data/MyData/\r\n",
       "streamHistoryFilePaths: Array[String] = Array(data/joseph.higaki/my_spotify_data/MyData/StreamingHistory0.json, data/joseph.higaki/my_spotify_data/MyData/StreamingHistory1.json)\r\n"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import java.io.File\n",
    "val path = \"data/joseph.higaki/my_spotify_data/MyData/\" \n",
    "val streamHistoryFilePaths = new File(path).listFiles\n",
    ".filter(_.getName.startsWith(\"StreamingHistory\"))\n",
    ".map(_.getPath().replace(\"\\\\\",\"/\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Second, put them into DataFrames"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----------------+----------------+--------+--------------------+\n",
      "|       artistName|         endTime|msPlayed|           trackName|\n",
      "+-----------------+----------------+--------+--------------------+\n",
      "|   The Black Keys|2020-07-17 03:31|  191798|     Howlin' For You|\n",
      "|      Royal Blood|2020-07-17 03:34|  183578|       Figure It Out|\n",
      "|The White Stripes|2020-07-17 03:38|  231920|   Seven Nation Army|\n",
      "|Cage The Elephant|2020-07-17 03:41|  175493|Ain't No Rest For...|\n",
      "|         Des Rocs|2020-07-17 03:44|  200933|               Wayne|\n",
      "|   Barns Courtney|2020-07-17 03:47|  197510|                Fire|\n",
      "|      Dirty Honey|2020-07-17 03:51|  204987|       When I'm Gone|\n",
      "|  The Blue Stones|2020-07-17 03:54|  222936|Shakin' Off The Rust|\n",
      "|       Wolfmother|2020-07-17 04:11|   27963| Joker And The Thief|\n",
      "|    Guns N' Roses|2020-07-17 06:11|       0|       Paradise City|\n",
      "+-----------------+----------------+--------+--------------------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\r\n",
       "streamHistorySchema: org.apache.spark.sql.types.StructType = StructType(StructField(artistName,StringType,true), StructField(endTime,StringType,true), StructField(msPlayed,LongType,true), StructField(trackName,StringType,true))\r\n",
       "streamHistoryReader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@6ae2589b\r\n",
       "streamHistory: org.apache.spark.sql.DataFrame = [artistName: string, endTime: string ... 2 more fields]\r\n"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\n",
    "\n",
    "val streamHistorySchema = StructType(Array(\n",
    "    StructField(\"artistName\", StringType, true),\n",
    "    StructField(\"endTime\", StringType, true),\n",
    "    StructField(\"msPlayed\", LongType, true),\n",
    "    StructField(\"trackName\", StringType, true)\n",
    "  ))\n",
    "\n",
    "val streamHistoryReader = spark.read.\n",
    "option(\"multiline\",\"true\").\n",
    "format(\"json\").\n",
    "schema(streamHistorySchema)\n",
    "\n",
    "var streamHistory = streamHistoryReader.load()\n",
    "streamHistoryFilePaths.foreach(p => streamHistory = streamHistory.union(streamHistoryReader.load(p)))\n",
    "\n",
    "streamHistory.show(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I tried to load multiple json files using a single DataFrameReader method, and failed miserably. \n",
    "In the interest of time I just iterated and union'ed the json files into my DataFrame\n",
    "(Sorry Scala, my DataFrame was not immutable)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "//val streamHistory = spark.read.option(\"multiline\",\"true\").json(streamHistoryFilePaths)\n",
    "// *****************************\n",
    "// I don't know why it doesnt take the array. \n",
    "// Now I know why: \n",
    "// json doesn't take Array[String]\n",
    "// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L233 \n",
    "// Load does takes an Array[String]\n",
    "// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L133\n",
    "\n",
    "// ***************************** \n",
    "// It looks like json doesnt take multiple files as path or load\n",
    "//val streamHistory = spark.read.option(\"multiline\",\"true\").format(\"json\").schema(streamHistorySchema).option(\"paths\", streamHistoryFilePaths.mkString(\",\")).load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's peek the date ranges for my data dump"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-------+----------------+\n",
      "|summary|         endTime|\n",
      "+-------+----------------+\n",
      "|  count|           16369|\n",
      "|   mean|            null|\n",
      "| stddev|            null|\n",
      "|    min|2020-07-17 03:31|\n",
      "|    max|2021-07-16 22:22|\n",
      "+-------+----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "streamHistory.describe(\"endTime\").show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Third, Some Quick Insights\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "streamHistory.createOrReplaceTempView(\"stream_history\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-------+\n",
      "|          artistName|minutes|\n",
      "+--------------------+-------+\n",
      "|           Pearl Jam| 2351.5|\n",
      "|           Metallica|2059.21|\n",
      "|Bob Marley & The ...|1231.93|\n",
      "|         The Beatles|1109.93|\n",
      "|Rage Against The ...|1076.81|\n",
      "|          Laguna Pai|1051.49|\n",
      "|Red Hot Chili Pep...| 818.37|\n",
      "|           The Doors| 693.38|\n",
      "|The Smashing Pump...| 690.26|\n",
      "|           Daft Punk| 643.83|\n",
      "+--------------------+-------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "select artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n",
    "from stream_history \n",
    "group by artistName \n",
    "order by sum(msPlayed) desc\"\"\").show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+--------------------+-------+\n",
      "|           trackName|          artistName|minutes|\n",
      "+--------------------+--------------------+-------+\n",
      "|             Canales|          Laguna Pai| 354.21|\n",
      "|      Rock Fort Rock|      The Skatalites| 337.31|\n",
      "|         Somos Pocos|          Laguna Pai| 293.85|\n",
      "|       Roller Skates|         Steel Pulse| 267.31|\n",
      "|Hoping (Herbert's...|        Louie Austen| 254.19|\n",
      "|           Jerusalem|        Alpha Blondy| 241.35|\n",
      "| Riders on the Storm|           The Doors| 237.24|\n",
      "|         Night Nurse|      Gregory Isaacs| 215.67|\n",
      "| Killing In The Name|Rage Against The ...| 207.24|\n",
      "|And The Living Is...|                Guts| 201.28|\n",
      "+--------------------+--------------------+-------+\n",
      "only showing top 10 rows\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "select trackName, artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n",
    "from stream_history \n",
    "group by trackName, artistName \n",
    "order by sum(msPlayed) desc\"\"\")\n",
    ".show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+-------+--------+-------+\n",
      "|          artistName|  month|msPlayed|minutes|\n",
      "+--------------------+-------+--------+-------+\n",
      "|             Mattiel|2020-07| 2266116|  37.77|\n",
      "|          Laguna Pai|2020-08| 6881334| 114.69|\n",
      "|           Metallica|2020-09| 7155300| 119.26|\n",
      "|Rage Against The ...|2020-10|10145011| 169.08|\n",
      "|           Pearl Jam|2020-11| 6818054| 113.63|\n",
      "|              Apache|2020-12|16384105| 273.07|\n",
      "|                Guts|2021-01| 7421847|  123.7|\n",
      "|          Laguna Pai|2021-02| 7873529| 131.23|\n",
      "|     Alice In Chains|2021-03|17728951| 295.48|\n",
      "|           Metallica|2021-04|45733120| 762.22|\n",
      "|           Pearl Jam|2021-05|50962015| 849.37|\n",
      "|         The Beatles|2021-06|25730009| 428.83|\n",
      "|         The Beatles|2021-07|21899167| 364.99|\n",
      "+--------------------+-------+--------+-------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "spark.sql(\"\"\"\n",
    "select artistName, month, msPlayed, round(msPlayed / (60 * 1000), 2) as minutes\n",
    "from \n",
    "(\n",
    "    select *, \n",
    "    ROW_NUMBER() OVER (PARTITION BY month ORDER BY msPlayed DESC) as rank         \n",
    "    from \n",
    "    (\n",
    "        select artistName, substring(endTime, 1, 7) as month, sum(msPlayed) as msPLayed\n",
    "        from stream_history \n",
    "        group by artistName, substring(endTime, 1, 7)\n",
    "        order by substring(endTime, 1, 7) asc, sum(msPlayed) desc\n",
    "    )\n",
    ")\n",
    "where rank = 1\n",
    "order by month, rank\n",
    "\"\"\")\n",
    ".show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## That's all folks (for now)\n",
    "I'm planning to build some more interesting things on top of this, so this is just the first element from a collection of insights based on spotify's data."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "spylon-kernel",
   "language": "scala",
   "name": "spylon-kernel"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".scala",
   "help_links": [
    {
     "text": "MetaKernel Magics",
     "url": "https://metakernel.readthedocs.io/en/latest/source/README.html"
    }
   ],
   "mimetype": "text/x-scala",
   "name": "scala",
   "pygments_lexer": "scala",
   "version": "0.4.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
	{
	"cells": [
	{
	"attachments": {},
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Understanding my Spotify Listening Habits - Part 0\n",
	"\n",
	"I like that time of the year when Spotify sends those gorgeous visuals of your listening habits.\n",
	"\n",
	"I couldn't wait any longer, I requested for my data [here](https://www.spotify.com/account/privacy).\n",
	"\n",
	"After following the instructions, a few days after I received a link to download some json files with my personal information, amongst them my last 12 months of streaming history.\n",
	"\n",
	"To get us started, I'm just trying to get my:\n",
	"* 10 most played artists \n",
	"* 10 most played songs\n",
	"* my most played artist each month\n",
	"\n",
	"![Dataiku favorite bands MoM](https://user-images.githubusercontent.com/11904085/128270731-c8bb313f-31ab-4592-8510-90f6571ae18f.png)\""
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## I'm 👴🏿... so I like strong types and curly braces\n",
	"Let's use Scala"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## First things first, the json files\n",
	"Each Spotify's StreamingHistory_.json file has a 10K record limit, you may have several of them"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"Intitializing Scala interpreter ..."
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": [
	"Spark Web UI available at http://DESKTOP-DOS8M7K:4041\n",
	"SparkContext available as 'sc' (version = 3.1.2, master = local[*], app id = local-1628122781306)\n",
	"SparkSession available as 'spark'\n"
	]
	},
	"metadata": {},
	"output_type": "display_data"
	},
	{
	"data": {
	"text/plain": [
	"import java.io.File\r\n",
	"path: String = data/joseph.higaki/my_spotify_data/MyData/\r\n",
	"streamHistoryFilePaths: Array[String] = Array(data/joseph.higaki/my_spotify_data/MyData/StreamingHistory0.json, data/joseph.higaki/my_spotify_data/MyData/StreamingHistory1.json)\r\n"
	]
	},
	"execution_count": 1,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import java.io.File\n",
	"val path = \"data/joseph.higaki/my_spotify_data/MyData/\" \n",
	"val streamHistoryFilePaths = new File(path).listFiles\n",
	".filter(_.getName.startsWith(\"StreamingHistory\"))\n",
	".map(_.getPath().replace(\"\\\\\",\"/\"))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Second, put them into DataFrames"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"+-----------------+----------------+--------+--------------------+\n",
	"\| artistName\| endTime\|msPlayed\| trackName\|\n",
	"+-----------------+----------------+--------+--------------------+\n",
	"\| The Black Keys\|2020-07-17 03:31\| 191798\| Howlin' For You\|\n",
	"\| Royal Blood\|2020-07-17 03:34\| 183578\| Figure It Out\|\n",
	"\|The White Stripes\|2020-07-17 03:38\| 231920\| Seven Nation Army\|\n",
	"\|Cage The Elephant\|2020-07-17 03:41\| 175493\|Ain't No Rest For...\|\n",
	"\| Des Rocs\|2020-07-17 03:44\| 200933\| Wayne\|\n",
	"\| Barns Courtney\|2020-07-17 03:47\| 197510\| Fire\|\n",
	"\| Dirty Honey\|2020-07-17 03:51\| 204987\| When I'm Gone\|\n",
	"\| The Blue Stones\|2020-07-17 03:54\| 222936\|Shakin' Off The Rust\|\n",
	"\| Wolfmother\|2020-07-17 04:11\| 27963\| Joker And The Thief\|\n",
	"\| Guns N' Roses\|2020-07-17 06:11\| 0\| Paradise City\|\n",
	"+-----------------+----------------+--------+--------------------+\n",
	"only showing top 10 rows\n",
	"\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\r\n",
	"streamHistorySchema: org.apache.spark.sql.types.StructType = StructType(StructField(artistName,StringType,true), StructField(endTime,StringType,true), StructField(msPlayed,LongType,true), StructField(trackName,StringType,true))\r\n",
	"streamHistoryReader: org.apache.spark.sql.DataFrameReader = org.apache.spark.sql.DataFrameReader@6ae2589b\r\n",
	"streamHistory: org.apache.spark.sql.DataFrame = [artistName: string, endTime: string ... 2 more fields]\r\n"
	]
	},
	"execution_count": 2,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"import org.apache.spark.sql.types.{StructType, StructField, StringType, LongType}\n",
	"\n",
	"val streamHistorySchema = StructType(Array(\n",
	" StructField(\"artistName\", StringType, true),\n",
	" StructField(\"endTime\", StringType, true),\n",
	" StructField(\"msPlayed\", LongType, true),\n",
	" StructField(\"trackName\", StringType, true)\n",
	" ))\n",
	"\n",
	"val streamHistoryReader = spark.read.\n",
	"option(\"multiline\",\"true\").\n",
	"format(\"json\").\n",
	"schema(streamHistorySchema)\n",
	"\n",
	"var streamHistory = streamHistoryReader.load()\n",
	"streamHistoryFilePaths.foreach(p => streamHistory = streamHistory.union(streamHistoryReader.load(p)))\n",
	"\n",
	"streamHistory.show(10)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"I tried to load multiple json files using a single DataFrameReader method, and failed miserably. \n",
	"In the interest of time I just iterated and union'ed the json files into my DataFrame\n",
	"(Sorry Scala, my DataFrame was not immutable)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"//val streamHistory = spark.read.option(\"multiline\",\"true\").json(streamHistoryFilePaths)\n",
	"// *****************************\n",
	"// I don't know why it doesnt take the array. \n",
	"// Now I know why: \n",
	"// json doesn't take Array[String]\n",
	"// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L233 \n",
	"// Load does takes an Array[String]\n",
	"// https://github.com/apache/spark/blob/82464fb2e02ca4e4d425017815090497b79dc93f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L133\n",
	"\n",
	"// ***************************** \n",
	"// It looks like json doesnt take multiple files as path or load\n",
	"//val streamHistory = spark.read.option(\"multiline\",\"true\").format(\"json\").schema(streamHistorySchema).option(\"paths\", streamHistoryFilePaths.mkString(\",\")).load()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's peek the date ranges for my data dump"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"+-------+----------------+\n",
	"\|summary\| endTime\|\n",
	"+-------+----------------+\n",
	"\| count\| 16369\|\n",
	"\| mean\| null\|\n",
	"\| stddev\| null\|\n",
	"\| min\|2020-07-17 03:31\|\n",
	"\| max\|2021-07-16 22:22\|\n",
	"+-------+----------------+\n",
	"\n"
	]
	}
	],
	"source": [
	"streamHistory.describe(\"endTime\").show()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Third, Some Quick Insights\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"streamHistory.createOrReplaceTempView(\"stream_history\")"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"+--------------------+-------+\n",
	"\| artistName\|minutes\|\n",
	"+--------------------+-------+\n",
	"\| Pearl Jam\| 2351.5\|\n",
	"\| Metallica\|2059.21\|\n",
	"\|Bob Marley & The ...\|1231.93\|\n",
	"\| The Beatles\|1109.93\|\n",
	"\|Rage Against The ...\|1076.81\|\n",
	"\| Laguna Pai\|1051.49\|\n",
	"\|Red Hot Chili Pep...\| 818.37\|\n",
	"\| The Doors\| 693.38\|\n",
	"\|The Smashing Pump...\| 690.26\|\n",
	"\| Daft Punk\| 643.83\|\n",
	"+--------------------+-------+\n",
	"only showing top 10 rows\n",
	"\n"
	]
	}
	],
	"source": [
	"spark.sql(\"\"\"\n",
	"select artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n",
	"from stream_history \n",
	"group by artistName \n",
	"order by sum(msPlayed) desc\"\"\").show(10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"+--------------------+--------------------+-------+\n",
	"\| trackName\| artistName\|minutes\|\n",
	"+--------------------+--------------------+-------+\n",
	"\| Canales\| Laguna Pai\| 354.21\|\n",
	"\| Rock Fort Rock\| The Skatalites\| 337.31\|\n",
	"\| Somos Pocos\| Laguna Pai\| 293.85\|\n",
	"\| Roller Skates\| Steel Pulse\| 267.31\|\n",
	"\|Hoping (Herbert's...\| Louie Austen\| 254.19\|\n",
	"\| Jerusalem\| Alpha Blondy\| 241.35\|\n",
	"\| Riders on the Storm\| The Doors\| 237.24\|\n",
	"\| Night Nurse\| Gregory Isaacs\| 215.67\|\n",
	"\| Killing In The Name\|Rage Against The ...\| 207.24\|\n",
	"\|And The Living Is...\| Guts\| 201.28\|\n",
	"+--------------------+--------------------+-------+\n",
	"only showing top 10 rows\n",
	"\n"
	]
	}
	],
	"source": [
	"spark.sql(\"\"\"\n",
	"select trackName, artistName, round(sum(msPlayed) / (60 * 1000), 2) as minutes\n",
	"from stream_history \n",
	"group by trackName, artistName \n",
	"order by sum(msPlayed) desc\"\"\")\n",
	".show(10)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"+--------------------+-------+--------+-------+\n",
	"\| artistName\| month\|msPlayed\|minutes\|\n",
	"+--------------------+-------+--------+-------+\n",
	"\| Mattiel\|2020-07\| 2266116\| 37.77\|\n",
	"\| Laguna Pai\|2020-08\| 6881334\| 114.69\|\n",
	"\| Metallica\|2020-09\| 7155300\| 119.26\|\n",
	"\|Rage Against The ...\|2020-10\|10145011\| 169.08\|\n",
	"\| Pearl Jam\|2020-11\| 6818054\| 113.63\|\n",
	"\| Apache\|2020-12\|16384105\| 273.07\|\n",
	"\| Guts\|2021-01\| 7421847\| 123.7\|\n",
	"\| Laguna Pai\|2021-02\| 7873529\| 131.23\|\n",
	"\| Alice In Chains\|2021-03\|17728951\| 295.48\|\n",
	"\| Metallica\|2021-04\|45733120\| 762.22\|\n",
	"\| Pearl Jam\|2021-05\|50962015\| 849.37\|\n",
	"\| The Beatles\|2021-06\|25730009\| 428.83\|\n",
	"\| The Beatles\|2021-07\|21899167\| 364.99\|\n",
	"+--------------------+-------+--------+-------+\n",
	"\n"
	]
	}
	],
	"source": [
	"spark.sql(\"\"\"\n",
	"select artistName, month, msPlayed, round(msPlayed / (60 * 1000), 2) as minutes\n",
	"from \n",
	"(\n",
	" select *, \n",
	" ROW_NUMBER() OVER (PARTITION BY month ORDER BY msPlayed DESC) as rank \n",
	" from \n",
	" (\n",
	" select artistName, substring(endTime, 1, 7) as month, sum(msPlayed) as msPLayed\n",
	" from stream_history \n",
	" group by artistName, substring(endTime, 1, 7)\n",
	" order by substring(endTime, 1, 7) asc, sum(msPlayed) desc\n",
	" )\n",
	")\n",
	"where rank = 1\n",
	"order by month, rank\n",
	"\"\"\")\n",
	".show()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## That's all folks (for now)\n",
	"I'm planning to build some more interesting things on top of this, so this is just the first element from a collection of insights based on spotify's data."
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "spylon-kernel",
	"language": "scala",
	"name": "spylon-kernel"
	},
	"language_info": {
	"codemirror_mode": "text/x-scala",
	"file_extension": ".scala",
	"help_links": [
	{
	"text": "MetaKernel Magics",
	"url": "https://metakernel.readthedocs.io/en/latest/source/README.html"
	}
	],
	"mimetype": "text/x-scala",
	"name": "scala",
	"pygments_lexer": "scala",
	"version": "0.4.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 4
	}