vansika/gist:21248a63e26428b78b54011af7aaa5b7

## gistfile1.txt
*** create_dataframes***

## users_df
+----------------+-------+
|       user_name|user_id|
+----------------+-------+
|       2814-6890|      1|
|        Alan_New|      2|
|    Aleksanderis|      3|
--------------------------

## recordings_df
+--------------------+--------------------+------------+
|      recording_mbid|        artist_mbids|recording_id|
+--------------------+--------------------+------------+
|00026127-dbca-454...|[c17f08f4-2542-46...|           1|
|00045795-60e4-443...|[e893c7b0-9861-45...|           2|
|000b104a-8aba-4ce...|[183d6ef6-e161-47...|           3|
--------------------------------------------------------
*** Here, we know that (recording_mbid, artist_mbids, track_name, artist_name) will always be unique.
*** Keeping recording_mbid distinct here should not be compromised for instance fetching the below table increased row slightly.

+--------------------+--------------------+--------------------+--------------------+------------+
|      recording_mbid|        artist_mbids|          track_name|         artist_name|recording_id|
+--------------------+--------------------+--------------------+--------------------+------------+
|00026127-dbca-454...|[c17f08f4-2542-46...|           Archetype|        Fear Factory|           1|
|00045795-60e4-443...|[e893c7b0-9861-45...|            Voor Jou|       Karin Bloemen|           2|
|000b104a-8aba-4ce...|[183d6ef6-e161-47...|They'll Need A Crane|They Might Be Giants|           3|
|000bcdbb-de53-480...|[558168a3-fa3d-4f...|A Basic Urge to Kill|            Wolfpack|           4|
-------------------------------------------------------------------------------------------------


## playcounts_df

+-------+------------+-----+
|user_id|recording_id|count|
+-------+------------+-----+
|    302|         414|    2|
|    368|        4676|    6|
|     16|       11868|    8|
---------------------------

*** create_dataframe looks fine as of now
*** We have saved the mapped listens to be used in future scripts (no need to join again and again to get mbid->msid mapping)
*** Just ensure that listens used to get top artists of a user must be a subset of listens used to train the model in
*** order to reuse the mapping.

----------------------------------------------------------------------------------------------------------------------

***candidate_sets***

***introducing mapping in candidate sets has been a lot of work.
*** Note that time has not been optimized in this first cut after introducing the mapping. That is a whole different ***process. The generation of HTML files has lead to fetching extra data and performing extra operations which will ***reduce the overall time IMO. I think it will be a good idea to remove the HTML files after the first release since
***right now we need them to monitor data within the community.


## artist_artist_relation
+--------------------+--------------------+--------------------+--------------------+-----+
|       artist_mbid_0|       artist_mbid_1|       artist_name_0|       artist_name_1|count|
+--------------------+--------------------+--------------------+--------------------+-----+
|352749d3-93dd-422...|4da3742f-8b2f-49e...|           Ray Lynch|  The Prairie Cartel|    4|
|352749d3-93dd-422...|dc453ee9-f2d4-458...|           Ray Lynch|               Баста|    4|
|afdb7919-059d-43c...|e204ed91-3684-456...|         Marvin Gaye|      Sick of It All|    4|
------------------------------------------------------------------------------------------

## top_artist_df
+--------------------+--------------------+
|         artist_mbid|         artist_name|
+--------------------+--------------------+
|bbd80354-597e-4d5...|               Saxon|
|0ce302fe-c064-4bf...|          ミニモニ。  |
|17b53d9f-5c63-4a0...|           The Kinks|
|65f4f0c5-ef9e-490...|           Metallica|
|6907a049-d207-454...|            The Firm|
-------------------------------------------
*** top_artists_df = listens_df.groupBy('user_name', 'artist_mbids', 'artist_name').agg(count('artist_mbids')) \
        .where(col('user_name') == user_name).limit(config.TOP_ARTISTS_LIMIT) \
        .select(explode('artist_mbids').alias('artist_mbid'), 'artist_name')

*** Let us analyse the query. After choosing top artists for a user we have exploded the artist_mbids column.
***artist_mbids is a list of artist mbid, we explode the column for two reasons
***1) artist_mbid in artist_artist_relation is of string type so we cannot comapre arraytype (artist_mbids) with it to
***get similar artists.
***2) if a user prefer an artist with artist_mbids as [a,b,c] then it can be inferred that the user likes all three
***artists and hence we try to get similar artists for all three.

 ## before explode
 artist_mbids| artist_name|
 --------------------------
 [a,b,c]     |           X|
 --------------------------

 ## after explode
 artist_mbid | artist_name|
 --------------------------
 a          |            X|
 --------------------------
 b          |            X|
 --------------------------
 c          |            X|
 --------------------------

 ## top_artist_recording_ids
 +-------+------------+
|user_id|recording_id|
+-------+------------+
|     14|         327|
|     14|         451|
|     14|         473|
----------------------
 df = recordings_df.select('recording_id', 'recording_mbid', explode('artist_mbids').alias('artist_mbid'))
top_artists_recording_ids_df = top_artists_df.join(df, 'artist_mbid', 'inner') \
    .withColumn('user_id', lit(user_id)).select('user_id', 'recording_id')

 *** Here we have used explode for recordings df
 # before explode
 +--------------------+--------------------+------------+
|      recording_mbid|        artist_mbids|recording_id|
+--------------------+--------------------+------------+
|0                   |[a,b,c]             |           1|
--------------------------------------------------------

# after explode
 +--------------------+--------------------+------------+
|      recording_mbid|        artist_mbids|recording_id|
+--------------------+--------------------+------------+
|0                   |a                   |           1|
--------------------------------------------------------
|0                   |b                   |           1|
--------------------------------------------------------
|0                   |c                   |           1|
--------------------------------------------------------

*** An important inference here that if we try to get recording ids corresponding to artist_mbid the result would be like
 +-------+------------+
|user_id|recording_id|
+-------+------------+
|     14|           1|
|     14|           1|
|     14|           1|
---------------------
** so if by chance recording_id = 1 is amongst the highest scorer and is choosen as one of the recommendations we will
***have three duplicate recordings and number of distinct recommendations will decrease.

*** Such duplication can increase in the case of similar_artists. Let us have a look
*** Note that we have ensured that similar artists and top artist don't intersect so tht users can have diverse playlists

| similar_artist_mbid|similar_artist_name| artist_name|
+--------------------+-------------------+------------+
|u                   |                  W|           L|
-------------------------------------------------------
|v                   |                  X|           L|
------------------------------------------------------
|w                   |                  Z|           L|
------------------------------------------------------
|u                   |                  W|           M|
-------------------------------------------------------
|v                   |                  X|           M|
------------------------------------------------------
|w                   |                  Z|           M|
-------------------------------------------------------
*** Here, artist L and M have common similar artists. Also, u,v,w formed a collaboration so they were assigned one recording id, after using explode the three will appear in different rows with same recording_id. So we have six recording ids which are equal. If this particullar id is recommended then we have even more duplicate recommendations.

*** There are two options: change the data type of mbid in artist_relation to array so it will be something like [a]
*** in this case we will not get duplicate recordngs (most probably) but we will miss out on many songs. Many recordings *** can be collaboration of artists in which 'a' appears like [a,b], [a,b,c], [a,z] but when we will compare the two
***arrays they won't be equal and we will not be able to choose these recordings.
*** If we keep go with the option of explode there can be many many duplicates but we have all possible songs of that
***artist. One possible solution is to that we have sufficient data, we try to fetch recommendations greater than ***required like fetch 100 if you need 20 and remove the dupicates

*** I think at this stage we should not worry much about the quality becuase recommendations are okay to be shipped but yes, duplication must be taken care of. We would not like to recommend two same recordings to a user.
	* create_dataframes*

	## users_df
	+----------------+-------+
	\| user_name\|user_id\|
	+----------------+-------+
	\| 2814-6890\| 1\|
	\| Alan_New\| 2\|
	\| Aleksanderis\| 3\|
	--------------------------

	## recordings_df
	+--------------------+--------------------+------------+
	\| recording_mbid\| artist_mbids\|recording_id\|
	+--------------------+--------------------+------------+
	\|00026127-dbca-454...\|[c17f08f4-2542-46...\| 1\|
	\|00045795-60e4-443...\|[e893c7b0-9861-45...\| 2\|
	\|000b104a-8aba-4ce...\|[183d6ef6-e161-47...\| 3\|
	--------------------------------------------------------
	*** Here, we know that (recording_mbid, artist_mbids, track_name, artist_name) will always be unique.
	*** Keeping recording_mbid distinct here should not be compromised for instance fetching the below table increased row slightly.

	+--------------------+--------------------+--------------------+--------------------+------------+
	\| recording_mbid\| artist_mbids\| track_name\| artist_name\|recording_id\|
	+--------------------+--------------------+--------------------+--------------------+------------+
	\|00026127-dbca-454...\|[c17f08f4-2542-46...\| Archetype\| Fear Factory\| 1\|
	\|00045795-60e4-443...\|[e893c7b0-9861-45...\| Voor Jou\| Karin Bloemen\| 2\|
	\|000b104a-8aba-4ce...\|[183d6ef6-e161-47...\|They'll Need A Crane\|They Might Be Giants\| 3\|
	\|000bcdbb-de53-480...\|[558168a3-fa3d-4f...\|A Basic Urge to Kill\| Wolfpack\| 4\|
	-------------------------------------------------------------------------------------------------


	## playcounts_df

	+-------+------------+-----+
	\|user_id\|recording_id\|count\|
	+-------+------------+-----+
	\| 302\| 414\| 2\|
	\| 368\| 4676\| 6\|
	\| 16\| 11868\| 8\|
	---------------------------

	*** create_dataframe looks fine as of now
	*** We have saved the mapped listens to be used in future scripts (no need to join again and again to get mbid->msid mapping)
	*** Just ensure that listens used to get top artists of a user must be a subset of listens used to train the model in
	*** order to reuse the mapping.

	----------------------------------------------------------------------------------------------------------------------

	*candidate_sets*

	***introducing mapping in candidate sets has been a lot of work.
	* Note that time has not been optimized in this first cut after introducing the mapping. That is a whole different process. The generation of HTML files has lead to fetching extra data and performing extra operations which will **reduce the overall time IMO. I think it will be a good idea to remove the HTML files after the first release since
	***right now we need them to monitor data within the community.


	## artist_artist_relation
	+--------------------+--------------------+--------------------+--------------------+-----+
	\| artist_mbid_0\| artist_mbid_1\| artist_name_0\| artist_name_1\|count\|
	+--------------------+--------------------+--------------------+--------------------+-----+
	\|352749d3-93dd-422...\|4da3742f-8b2f-49e...\| Ray Lynch\| The Prairie Cartel\| 4\|
	\|352749d3-93dd-422...\|dc453ee9-f2d4-458...\| Ray Lynch\| Баста\| 4\|
	\|afdb7919-059d-43c...\|e204ed91-3684-456...\| Marvin Gaye\| Sick of It All\| 4\|
	------------------------------------------------------------------------------------------

	## top_artist_df
	+--------------------+--------------------+
	\| artist_mbid\| artist_name\|
	+--------------------+--------------------+
	\|bbd80354-597e-4d5...\| Saxon\|
	\|0ce302fe-c064-4bf...\| ミニモニ。 \|
	\|17b53d9f-5c63-4a0...\| The Kinks\|
	\|65f4f0c5-ef9e-490...\| Metallica\|
	\|6907a049-d207-454...\| The Firm\|
	-------------------------------------------
	*** top_artists_df = listens_df.groupBy('user_name', 'artist_mbids', 'artist_name').agg(count('artist_mbids')) \
	.where(col('user_name') == user_name).limit(config.TOP_ARTISTS_LIMIT) \
	.select(explode('artist_mbids').alias('artist_mbid'), 'artist_name')

	*** Let us analyse the query. After choosing top artists for a user we have exploded the artist_mbids column.
	***artist_mbids is a list of artist mbid, we explode the column for two reasons
	***1) artist_mbid in artist_artist_relation is of string type so we cannot comapre arraytype (artist_mbids) with it to
	***get similar artists.
	***2) if a user prefer an artist with artist_mbids as [a,b,c] then it can be inferred that the user likes all three
	***artists and hence we try to get similar artists for all three.

	## before explode
	artist_mbids\| artist_name\|
	--------------------------
	[a,b,c] \| X\|
	--------------------------

	## after explode
	artist_mbid \| artist_name\|
	--------------------------
	a \| X\|
	--------------------------
	b \| X\|
	--------------------------
	c \| X\|
	--------------------------

	## top_artist_recording_ids
	+-------+------------+
	\|user_id\|recording_id\|
	+-------+------------+
	\| 14\| 327\|
	\| 14\| 451\|
	\| 14\| 473\|
	----------------------
	df = recordings_df.select('recording_id', 'recording_mbid', explode('artist_mbids').alias('artist_mbid'))
	top_artists_recording_ids_df = top_artists_df.join(df, 'artist_mbid', 'inner') \
	.withColumn('user_id', lit(user_id)).select('user_id', 'recording_id')

	*** Here we have used explode for recordings df
	# before explode
	+--------------------+--------------------+------------+
	\| recording_mbid\| artist_mbids\|recording_id\|
	+--------------------+--------------------+------------+
	\|0 \|[a,b,c] \| 1\|
	--------------------------------------------------------

	# after explode
	+--------------------+--------------------+------------+
	\| recording_mbid\| artist_mbids\|recording_id\|
	+--------------------+--------------------+------------+
	\|0 \|a \| 1\|
	--------------------------------------------------------
	\|0 \|b \| 1\|
	--------------------------------------------------------
	\|0 \|c \| 1\|
	--------------------------------------------------------

	*** An important inference here that if we try to get recording ids corresponding to artist_mbid the result would be like
	+-------+------------+
	\|user_id\|recording_id\|
	+-------+------------+
	\| 14\| 1\|
	\| 14\| 1\|
	\| 14\| 1\|
	---------------------
	** so if by chance recording_id = 1 is amongst the highest scorer and is choosen as one of the recommendations we will
	***have three duplicate recordings and number of distinct recommendations will decrease.

	*** Such duplication can increase in the case of similar_artists. Let us have a look
	*** Note that we have ensured that similar artists and top artist don't intersect so tht users can have diverse playlists

	\| similar_artist_mbid\|similar_artist_name\| artist_name\|
	+--------------------+-------------------+------------+
	\|u \| W\| L\|
	-------------------------------------------------------
	\|v \| X\| L\|
	------------------------------------------------------
	\|w \| Z\| L\|
	------------------------------------------------------
	\|u \| W\| M\|
	-------------------------------------------------------
	\|v \| X\| M\|
	------------------------------------------------------
	\|w \| Z\| M\|
	-------------------------------------------------------
	*** Here, artist L and M have common similar artists. Also, u,v,w formed a collaboration so they were assigned one recording id, after using explode the three will appear in different rows with same recording_id. So we have six recording ids which are equal. If this particullar id is recommended then we have even more duplicate recommendations.

	*** There are two options: change the data type of mbid in artist_relation to array so it will be something like [a]
	* in this case we will not get duplicate recordngs (most probably) but we will miss out on many songs. Many recordings * can be collaboration of artists in which 'a' appears like [a,b], [a,b,c], [a,z] but when we will compare the two
	***arrays they won't be equal and we will not be able to choose these recordings.
	*** If we keep go with the option of explode there can be many many duplicates but we have all possible songs of that
	*artist. One possible solution is to that we have sufficient data, we try to fetch recommendations greater than *required like fetch 100 if you need 20 and remove the dupicates

	*** I think at this stage we should not worry much about the quality becuase recommendations are okay to be shipped but yes, duplication must be taken care of. We would not like to recommend two same recordings to a user.