Skip to content

Instantly share code, notes, and snippets.

@Mithrandir0x
Created October 23, 2014 14:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Mithrandir0x/732051da61fbd06459d7 to your computer and use it in GitHub Desktop.
Save Mithrandir0x/732051da61fbd06459d7 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "nui_prac2_entrega.ipynb"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Pr\u00e0ctiques de Nous Usos de la Inform\u00e0tica"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Requeriments per fer les pr\u00e0ctiques:"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"1) Una forma simple d'instal\u00b7lar tots els paquets necessaris \u00e9s instal\u00b7lar la plataforma de distribuci\u00f3 de Python Anaconda: https://store.continuum.io/\n",
"Aquesta plataforma instal\u00b7la autom\u00e0ticament un conjunt d'eines (matplotlib, NumPy, SciPy, NetworkX, iPython, pandas, etc.) que constitueixen l'entorn de computaci\u00f3 cient\u00edfica necessari per desenvolupar les pr\u00e0ctiques d'aquesta assignatura. L'altra opci\u00f3 \u00e9s instal\u00b7lar independentment els paquets matplotlib, NumPy, SciPy, NetworkX, iPython i pandas.\n",
"\n",
"2) Les pr\u00e0ctiques es poden lliurar en forma de \"notebook\" de iPython que contingui tot el programari desenvolupat per l'alumne o simplement en un m\u00f2dul de Python que contingui tot el programari necessari per executar la pr\u00e0ctica. Informaci\u00f3 sobre iPython: http://ipython.org/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>ENTREGA: </b>\n",
"El dia l\u00edmit per a l'entrega d'aquesta pr\u00e0ctica \u00e9s el dia <b>18 d'Octubre a les 23.55h</b>\n",
"\n",
"<b>Format de l'entrega</b>\n",
"L'entrega s'efectur\u00e0 mitjan\u00e7ant el campus virtual. S'ha de penjar un arxiu per grup. El nom del fitxer ha de seguir el seg\u00fcent patro:\n",
"NUI_1_PrimeralletranomCognomMembre1_PrimeralletranomCognomMembre2.iypnb\n",
"\n",
"Exemple: <br>\n",
"Membre 1: Maria del Carme Vil\u00e0<br>\n",
"Membre 2: Francesc Castell<br>\n",
"\n",
"Nom de l'arxiu: <b>NUI_1_MVila_FCastell.ipynb</b>\n",
"\n"
]
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Pr\u00e0ctica 1. Recomanadors"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"La base de dades movielens 1M (http://www.grouplens.org/node/73) cont\u00e9 1,000,209 puntuacions de 3.900 pel\u00b7licules fetes l'any 2000 per 6.040 usuaris an\u00f2nims del recomanador online MovieLens (http://www.movielens.org/). "
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Els continguts de la base de dades s\u00f3n:\n",
"\n",
"================================================================================\n",
"\n",
"All ratings are contained in the file \"ratings.dat\" and are in the following format:\n",
"\n",
"UserID::MovieID::Rating::Timestamp\n",
"\n",
"- UserIDs range between 1 and 6040 \n",
"- MovieIDs range between 1 and 3952\n",
"- Ratings are made on a 5-star scale (whole-star ratings only)\n",
"- Timestamp is represented in seconds since the epoch as returned by time(2)\n",
"- Each user has at least 20 ratings\n",
"\n",
"USERS FILE DESCRIPTION\n",
"================================================================================\n",
"\n",
"User information is in the file \"users.dat\" and is in the following format:\n",
"\n",
"UserID::Gender::Age::Occupation::Zip-code\n",
"\n",
"All demographic information is provided voluntarily by the users and is not checked for accuracy. Only users who have provided some demographic information are included in this data set.\n",
"\n",
"- Gender is denoted by a \"M\" for male and \"F\" for female\n",
"- Age is chosen from the following ranges:\n",
"\n",
"\t* 1: \"Under 18\"\n",
"\t* 18: \"18-24\"\n",
"\t* 25: \"25-34\"\n",
"\t* 35: \"35-44\"\n",
"\t* 45: \"45-49\"\n",
"\t* 50: \"50-55\"\n",
"\t* 56: \"56+\"\n",
"\n",
"- Occupation is chosen from the following choices:\n",
"\n",
"\t* 0: \"other\" or not specified\n",
"\t* 1: \"academic/educator\"\n",
"\t* 2: \"artist\"\n",
"\t* 3: \"clerical/admin\"\n",
"\t* 4: \"college/grad student\"\n",
"\t* 5: \"customer service\"\n",
"\t* 6: \"doctor/health care\"\n",
"\t* 7: \"executive/managerial\"\n",
"\t* 8: \"farmer\"\n",
"\t* 9: \"homemaker\"\n",
"\t* 10: \"K-12 student\"\n",
"\t* 11: \"lawyer\"\n",
"\t* 12: \"programmer\"\n",
"\t* 13: \"retired\"\n",
"\t* 14: \"sales/marketing\"\n",
"\t* 15: \"scientist\"\n",
"\t* 16: \"self-employed\"\n",
"\t* 17: \"technician/engineer\"\n",
"\t* 18: \"tradesman/craftsman\"\n",
"\t* 19: \"unemployed\"\n",
"\t* 20: \"writer\"\n",
"\n",
"MOVIES FILE DESCRIPTION\n",
"================================================================================\n",
"\n",
"Movie information is in the file \"movies.dat\" and is in the following format:\n",
"\n",
"MovieID::Title::Genres\n",
"\n",
"- Titles are identical to titles provided by the IMDB (including year of release)\n",
"- Genres are pipe-separated and are selected from the following genres:\n",
"\n",
"\t* Action\n",
"\t* Adventure\n",
"\t* Animation\n",
"\t* Children's\n",
"\t* Comedy\n",
"\t* Crime\n",
"\t* Documentary\n",
"\t* Drama\n",
"\t* Fantasy\n",
"\t* Film-Noir\n",
"\t* Horror\n",
"\t* Musical\n",
"\t* Mystery\n",
"\t* Romance\n",
"\t* Sci-Fi\n",
"\t* Thriller\n",
"\t* War\n",
"\t* Western\n",
"\n",
"- Some MovieIDs do not correspond to a movie due to accidental duplicate entries and/or test entries\n",
"- Movies are mostly entered by hand, so errors and inconsistencies may exist\n"
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Baixa't la base de dades i c\u00f2pia-la a un directori local de la teva m\u00e0quina (p.e. \"C:\\Documents and Settings\\UB\\Escritorio\\My Dropbox\\iPython\"). "
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Llegeix les tres taules de la base de dades en tres DataFrames de pandas:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"\n",
"unames = ['user_id', 'gender', 'age', 'occupation', 'zip']\n",
"users = pd.read_table('./data/users.dat', sep='::', header=None, names=unames)\n",
"\n",
"rnames = ['user_id', 'movie_id', 'rating', 'timestamp']\n",
"ratings = pd.read_table('./data/ratings.dat', sep='::', header=None, names=rnames)\n",
"\n",
"mnames = ['movie_id', 'title', 'genres']\n",
"movies = pd.read_table('./data/movies.dat', sep='::', header=None, names=mnames)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print users[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" user_id gender age occupation zip\n",
"0 1 F 1 10 48067\n",
"1 2 M 56 16 70072\n",
"2 3 M 25 15 55117\n",
"3 4 M 45 7 02460\n",
"4 5 M 25 20 55455\n",
"5 6 F 50 9 55117\n",
"6 7 M 35 1 06810\n",
"7 8 M 25 12 11413\n",
"8 9 M 25 17 61614\n",
"9 10 F 35 1 95370\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"users"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<pre>\n",
"&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
"Int64Index: 6040 entries, 0 to 6039\n",
"Data columns (total 5 columns):\n",
"user_id 6040 non-null values\n",
"gender 6040 non-null values\n",
"age 6040 non-null values\n",
"occupation 6040 non-null values\n",
"zip 6040 non-null values\n",
"dtypes: int64(3), object(2)\n",
"</pre>"
],
"output_type": "pyout",
"prompt_number": 3,
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 6040 entries, 0 to 6039\n",
"Data columns (total 5 columns):\n",
"user_id 6040 non-null values\n",
"gender 6040 non-null values\n",
"age 6040 non-null values\n",
"occupation 6040 non-null values\n",
"zip 6040 non-null values\n",
"dtypes: int64(3), object(2)"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print ratings.groupby(by = 'movie_id')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<pandas.core.groupby.DataFrameGroupBy object at 0x0000000007F9BCC0>\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ratings"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<pre>\n",
"&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
"Int64Index: 1000209 entries, 0 to 1000208\n",
"Data columns (total 4 columns):\n",
"user_id 1000209 non-null values\n",
"movie_id 1000209 non-null values\n",
"rating 1000209 non-null values\n",
"timestamp 1000209 non-null values\n",
"dtypes: int64(4)\n",
"</pre>"
],
"output_type": "pyout",
"prompt_number": 5,
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 1000209 entries, 0 to 1000208\n",
"Data columns (total 4 columns):\n",
"user_id 1000209 non-null values\n",
"movie_id 1000209 non-null values\n",
"rating 1000209 non-null values\n",
"timestamp 1000209 non-null values\n",
"dtypes: int64(4)"
]
}
],
"prompt_number": 5
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ratings.sort_index(by='movie_id')[:8]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>user_id</th>\n",
" <th>movie_id</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>427702</th>\n",
" <td> 2599</td>\n",
" <td> 1</td>\n",
" <td> 4</td>\n",
" <td> 973796689</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1966 </th>\n",
" <td> 18</td>\n",
" <td> 1</td>\n",
" <td> 4</td>\n",
" <td> 978154768</td>\n",
" </tr>\n",
" <tr>\n",
" <th>683688</th>\n",
" <td> 4089</td>\n",
" <td> 1</td>\n",
" <td> 5</td>\n",
" <td> 965428947</td>\n",
" </tr>\n",
" <tr>\n",
" <th>596207</th>\n",
" <td> 3626</td>\n",
" <td> 1</td>\n",
" <td> 4</td>\n",
" <td> 966594018</td>\n",
" </tr>\n",
" <tr>\n",
" <th>465902</th>\n",
" <td> 2873</td>\n",
" <td> 1</td>\n",
" <td> 5</td>\n",
" <td> 972784317</td>\n",
" </tr>\n",
" <tr>\n",
" <th>78200 </th>\n",
" <td> 528</td>\n",
" <td> 1</td>\n",
" <td> 5</td>\n",
" <td> 976245400</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106468</th>\n",
" <td> 701</td>\n",
" <td> 1</td>\n",
" <td> 3</td>\n",
" <td> 979094230</td>\n",
" </tr>\n",
" <tr>\n",
" <th>434718</th>\n",
" <td> 2652</td>\n",
" <td> 1</td>\n",
" <td> 5</td>\n",
" <td> 973535127</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 6,
"text": [
" user_id movie_id rating timestamp\n",
"427702 2599 1 4 973796689\n",
"1966 18 1 4 978154768\n",
"683688 4089 1 5 965428947\n",
"596207 3626 1 4 966594018\n",
"465902 2873 1 5 972784317\n",
"78200 528 1 5 976245400\n",
"106468 701 1 3 979094230\n",
"434718 2652 1 5 973535127"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"movies[:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>movie_id</th>\n",
" <th>title</th>\n",
" <th>genres</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> Toy Story (1995)</td>\n",
" <td> Animation|Children's|Comedy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 2</td>\n",
" <td> Jumanji (1995)</td>\n",
" <td> Adventure|Children's|Fantasy</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 3</td>\n",
" <td> Grumpier Old Men (1995)</td>\n",
" <td> Comedy|Romance</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 4</td>\n",
" <td> Waiting to Exhale (1995)</td>\n",
" <td> Comedy|Drama</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 5</td>\n",
" <td> Father of the Bride Part II (1995)</td>\n",
" <td> Comedy</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 7,
"text": [
" movie_id title genres\n",
"0 1 Toy Story (1995) Animation|Children's|Comedy\n",
"1 2 Jumanji (1995) Adventure|Children's|Fantasy\n",
"2 3 Grumpier Old Men (1995) Comedy|Romance\n",
"3 4 Waiting to Exhale (1995) Comedy|Drama\n",
"4 5 Father of the Bride Part II (1995) Comedy"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ratings[0:5]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>user_id</th>\n",
" <th>movie_id</th>\n",
" <th>rating</th>\n",
" <th>timestamp</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> 1</td>\n",
" <td> 1193</td>\n",
" <td> 5</td>\n",
" <td> 978300760</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> 1</td>\n",
" <td> 661</td>\n",
" <td> 3</td>\n",
" <td> 978302109</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> 1</td>\n",
" <td> 914</td>\n",
" <td> 3</td>\n",
" <td> 978301968</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> 1</td>\n",
" <td> 3408</td>\n",
" <td> 4</td>\n",
" <td> 978300275</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> 1</td>\n",
" <td> 2355</td>\n",
" <td> 5</td>\n",
" <td> 978824291</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 8,
"text": [
" user_id movie_id rating timestamp\n",
"0 1 1193 5 978300760\n",
"1 1 661 3 978302109\n",
"2 1 914 3 978301968\n",
"3 1 3408 4 978300275\n",
"4 1 2355 5 978824291"
]
}
],
"prompt_number": 8
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"1.1 Exemple: C\u00e0lcul de les puntuacions mitjanes per sexe i edat dels usuaris."
]
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Suposa que volem calcular les puntuacions mitjanes d'una pel\u00b7licula per sexe i edat. El primer pas a obtebir una \u00fanica estructura que contingui tota la informaci\u00f3. \n",
"Per fer-ho podem usar la funci\u00f3 \"merge\" de pandas. Aquesta funci\u00f3 infereix autom\u00e0ticament quines columnes ha d'usar per fer el \"merge\" basant-se en els noms que fan interesecci\u00f3:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"data = pd.merge(pd.merge(ratings, users), movies) # Inner Join between \"ratings\" and \"users\"\n",
"ratings_users_by_movies = data\n",
"print data[:10] # Show the first 10 records"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" user_id movie_id rating timestamp gender age occupation zip \\\n",
"0 1 1193 5 978300760 F 1 10 48067 \n",
"1 2 1193 5 978298413 M 56 16 70072 \n",
"2 12 1193 4 978220179 M 25 12 32793 \n",
"3 15 1193 4 978199279 M 25 7 22903 \n",
"4 17 1193 5 978158471 M 50 1 95350 \n",
"5 18 1193 4 978156168 F 18 3 95825 \n",
"6 19 1193 5 982730936 M 1 10 48073 \n",
"7 24 1193 5 978136709 F 25 7 10023 \n",
"8 28 1193 3 978125194 F 25 1 14607 \n",
"9 33 1193 5 978557765 M 45 3 55421 \n",
"\n",
" title genres \n",
"0 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"1 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"2 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"3 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"4 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"5 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"6 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"7 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"8 One Flew Over the Cuckoo's Nest (1975) Drama \n",
"9 One Flew Over the Cuckoo's Nest (1975) Drama \n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"La funci\u00f3 \"ix\" ens permet obtenir un subconjunt de files i/o columnes:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print data.ix[1]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"user_id 2\n",
"movie_id 1193\n",
"rating 5\n",
"timestamp 978298413\n",
"gender M\n",
"age 56\n",
"occupation 16\n",
"zip 70072\n",
"title One Flew Over the Cuckoo's Nest (1975)\n",
"genres Drama\n",
"Name: 1, dtype: object\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Per obtenir les puntuacions mitjanes de cada pel\u00b7licula agrupada per g\u00e8nere podem usar el m\u00e8tode 'pivot_table':"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mean_ratings = data.pivot_table('rating', rows='title',cols='gender', aggfunc='mean')\n",
"mean_ratings[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>gender</th>\n",
" <th>F</th>\n",
" <th>M</th>\n",
" </tr>\n",
" <tr>\n",
" <th>title</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>$1,000,000 Duck (1971)</th>\n",
" <td> 3.375000</td>\n",
" <td> 2.761905</td>\n",
" </tr>\n",
" <tr>\n",
" <th>'Night Mother (1986)</th>\n",
" <td> 3.388889</td>\n",
" <td> 3.352941</td>\n",
" </tr>\n",
" <tr>\n",
" <th>'Til There Was You (1997)</th>\n",
" <td> 2.675676</td>\n",
" <td> 2.733333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>'burbs, The (1989)</th>\n",
" <td> 2.793478</td>\n",
" <td> 2.962085</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...And Justice for All (1979)</th>\n",
" <td> 3.828571</td>\n",
" <td> 3.689024</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1-900 (1994)</th>\n",
" <td> 2.000000</td>\n",
" <td> 3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10 Things I Hate About You (1999)</th>\n",
" <td> 3.646552</td>\n",
" <td> 3.311966</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101 Dalmatians (1961)</th>\n",
" <td> 3.791444</td>\n",
" <td> 3.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101 Dalmatians (1996)</th>\n",
" <td> 3.240000</td>\n",
" <td> 2.911215</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12 Angry Men (1957)</th>\n",
" <td> 4.184397</td>\n",
" <td> 4.328421</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 11,
"text": [
"gender F M\n",
"title \n",
"$1,000,000 Duck (1971) 3.375000 2.761905\n",
"'Night Mother (1986) 3.388889 3.352941\n",
"'Til There Was You (1997) 2.675676 2.733333\n",
"'burbs, The (1989) 2.793478 2.962085\n",
"...And Justice for All (1979) 3.828571 3.689024\n",
"1-900 (1994) 2.000000 3.000000\n",
"10 Things I Hate About You (1999) 3.646552 3.311966\n",
"101 Dalmatians (1961) 3.791444 3.500000\n",
"101 Dalmatians (1996) 3.240000 2.911215\n",
"12 Angry Men (1957) 4.184397 4.328421"
]
}
],
"prompt_number": 11
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Podem filtrar les pel\u00b7licules que han rebut al menys 250 puntuacions. Per fer-ho, agrupem les dades per t\u00edtol i usem \"size()\" per obtebir una s\u00e8rie de mides per cada t\u00edtol:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ratings_by_title = data.groupby('title').size()\n",
"active_titles = ratings_by_title.index[ratings_by_title >= 250]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 12
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"L'\u00edndex de t\u00edtols que reben al menys 250 puntuacions es pot fer servir per seleccionar les files de \"mean_ratings\": "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mean_ratings = mean_ratings.ix[active_titles]\n",
"mean_ratings[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>gender</th>\n",
" <th>F</th>\n",
" <th>M</th>\n",
" </tr>\n",
" <tr>\n",
" <th>title</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>'burbs, The (1989)</th>\n",
" <td> 2.793478</td>\n",
" <td> 2.962085</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10 Things I Hate About You (1999)</th>\n",
" <td> 3.646552</td>\n",
" <td> 3.311966</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101 Dalmatians (1961)</th>\n",
" <td> 3.791444</td>\n",
" <td> 3.500000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>101 Dalmatians (1996)</th>\n",
" <td> 3.240000</td>\n",
" <td> 2.911215</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12 Angry Men (1957)</th>\n",
" <td> 4.184397</td>\n",
" <td> 4.328421</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13th Warrior, The (1999)</th>\n",
" <td> 3.112000</td>\n",
" <td> 3.168000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2 Days in the Valley (1996)</th>\n",
" <td> 3.488889</td>\n",
" <td> 3.244813</td>\n",
" </tr>\n",
" <tr>\n",
" <th>20,000 Leagues Under the Sea (1954)</th>\n",
" <td> 3.670103</td>\n",
" <td> 3.709205</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2001: A Space Odyssey (1968)</th>\n",
" <td> 3.825581</td>\n",
" <td> 4.129738</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2010 (1984)</th>\n",
" <td> 3.446809</td>\n",
" <td> 3.413712</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 13,
"text": [
"gender F M\n",
"title \n",
"'burbs, The (1989) 2.793478 2.962085\n",
"10 Things I Hate About You (1999) 3.646552 3.311966\n",
"101 Dalmatians (1961) 3.791444 3.500000\n",
"101 Dalmatians (1996) 3.240000 2.911215\n",
"12 Angry Men (1957) 4.184397 4.328421\n",
"13th Warrior, The (1999) 3.112000 3.168000\n",
"2 Days in the Valley (1996) 3.488889 3.244813\n",
"20,000 Leagues Under the Sea (1954) 3.670103 3.709205\n",
"2001: A Space Odyssey (1968) 3.825581 4.129738\n",
"2010 (1984) 3.446809 3.413712"
]
}
],
"prompt_number": 13
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Per veure els films m\u00e9s valorats per les dones, podem ordenar per la columna F de forma descendent:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)\n",
"top_female_ratings[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>gender</th>\n",
" <th>F</th>\n",
" <th>M</th>\n",
" </tr>\n",
" <tr>\n",
" <th>title</th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Close Shave, A (1995)</th>\n",
" <td> 4.644444</td>\n",
" <td> 4.473795</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Wrong Trousers, The (1993)</th>\n",
" <td> 4.588235</td>\n",
" <td> 4.478261</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)</th>\n",
" <td> 4.572650</td>\n",
" <td> 4.464589</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Wallace &amp; Gromit: The Best of Aardman Animation (1996)</th>\n",
" <td> 4.563107</td>\n",
" <td> 4.385075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Schindler's List (1993)</th>\n",
" <td> 4.562602</td>\n",
" <td> 4.491415</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Shawshank Redemption, The (1994)</th>\n",
" <td> 4.539075</td>\n",
" <td> 4.560625</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Grand Day Out, A (1992)</th>\n",
" <td> 4.537879</td>\n",
" <td> 4.293255</td>\n",
" </tr>\n",
" <tr>\n",
" <th>To Kill a Mockingbird (1962)</th>\n",
" <td> 4.536667</td>\n",
" <td> 4.372611</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Creature Comforts (1990)</th>\n",
" <td> 4.513889</td>\n",
" <td> 4.272277</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Usual Suspects, The (1995)</th>\n",
" <td> 4.513317</td>\n",
" <td> 4.518248</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 14,
"text": [
"gender F M\n",
"title \n",
"Close Shave, A (1995) 4.644444 4.473795\n",
"Wrong Trousers, The (1993) 4.588235 4.478261\n",
"Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589\n",
"Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075\n",
"Schindler's List (1993) 4.562602 4.491415\n",
"Shawshank Redemption, The (1994) 4.539075 4.560625\n",
"Grand Day Out, A (1992) 4.537879 4.293255\n",
"To Kill a Mockingbird (1962) 4.536667 4.372611\n",
"Creature Comforts (1990) 4.513889 4.272277\n",
"Usual Suspects, The (1995) 4.513317 4.518248"
]
}
],
"prompt_number": 14
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Suposem ara que volem les pel\u00b7licules que estan valorades de forma m\u00e9s diferent entre homes i dones. Una forma d'obtenir-ho \u00e9s afegir una columna a \"mean_ratings\" que contingui la difer\u00e8ncia en mitjana i llavors ordenar:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 15
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Ordenant per 'diff' ens d\u00f3na les pel\u00b7licules amb valoraci\u00f3 m\u00e9s diferent i que s\u00f3n ben valorades per les dones:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sorted_by_diff = mean_ratings.sort_index(by='diff')\n",
"sorted_by_diff[:15]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>gender</th>\n",
" <th>F</th>\n",
" <th>M</th>\n",
" <th>diff</th>\n",
" </tr>\n",
" <tr>\n",
" <th>title</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Dirty Dancing (1987)</th>\n",
" <td> 3.790378</td>\n",
" <td> 2.959596</td>\n",
" <td>-0.830782</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Jumpin' Jack Flash (1986)</th>\n",
" <td> 3.254717</td>\n",
" <td> 2.578358</td>\n",
" <td>-0.676359</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Grease (1978)</th>\n",
" <td> 3.975265</td>\n",
" <td> 3.367041</td>\n",
" <td>-0.608224</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Little Women (1994)</th>\n",
" <td> 3.870588</td>\n",
" <td> 3.321739</td>\n",
" <td>-0.548849</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Steel Magnolias (1989)</th>\n",
" <td> 3.901734</td>\n",
" <td> 3.365957</td>\n",
" <td>-0.535777</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Anastasia (1997)</th>\n",
" <td> 3.800000</td>\n",
" <td> 3.281609</td>\n",
" <td>-0.518391</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Rocky Horror Picture Show, The (1975)</th>\n",
" <td> 3.673016</td>\n",
" <td> 3.160131</td>\n",
" <td>-0.512885</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Color Purple, The (1985)</th>\n",
" <td> 4.158192</td>\n",
" <td> 3.659341</td>\n",
" <td>-0.498851</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Age of Innocence, The (1993)</th>\n",
" <td> 3.827068</td>\n",
" <td> 3.339506</td>\n",
" <td>-0.487561</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Free Willy (1993)</th>\n",
" <td> 2.921348</td>\n",
" <td> 2.438776</td>\n",
" <td>-0.482573</td>\n",
" </tr>\n",
" <tr>\n",
" <th>French Kiss (1995)</th>\n",
" <td> 3.535714</td>\n",
" <td> 3.056962</td>\n",
" <td>-0.478752</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Little Shop of Horrors, The (1960)</th>\n",
" <td> 3.650000</td>\n",
" <td> 3.179688</td>\n",
" <td>-0.470312</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Guys and Dolls (1955)</th>\n",
" <td> 4.051724</td>\n",
" <td> 3.583333</td>\n",
" <td>-0.468391</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mary Poppins (1964)</th>\n",
" <td> 4.197740</td>\n",
" <td> 3.730594</td>\n",
" <td>-0.467147</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Patch Adams (1998)</th>\n",
" <td> 3.473282</td>\n",
" <td> 3.008746</td>\n",
" <td>-0.464536</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 16,
"text": [
"gender F M diff\n",
"title \n",
"Dirty Dancing (1987) 3.790378 2.959596 -0.830782\n",
"Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359\n",
"Grease (1978) 3.975265 3.367041 -0.608224\n",
"Little Women (1994) 3.870588 3.321739 -0.548849\n",
"Steel Magnolias (1989) 3.901734 3.365957 -0.535777\n",
"Anastasia (1997) 3.800000 3.281609 -0.518391\n",
"Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885\n",
"Color Purple, The (1985) 4.158192 3.659341 -0.498851\n",
"Age of Innocence, The (1993) 3.827068 3.339506 -0.487561\n",
"Free Willy (1993) 2.921348 2.438776 -0.482573\n",
"French Kiss (1995) 3.535714 3.056962 -0.478752\n",
"Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312\n",
"Guys and Dolls (1955) 4.051724 3.583333 -0.468391\n",
"Mary Poppins (1964) 4.197740 3.730594 -0.467147\n",
"Patch Adams (1998) 3.473282 3.008746 -0.464536"
]
}
],
"prompt_number": 16
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Invertint l'ordre de les files i fent un \"slicing\" de les 15 files superiors obtenim les pel\u00b7licules preferides pels homes que no han agradat a les dones: "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"sorted_by_diff[::-1][:15]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>gender</th>\n",
" <th>F</th>\n",
" <th>M</th>\n",
" <th>diff</th>\n",
" </tr>\n",
" <tr>\n",
" <th>title</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Good, The Bad and The Ugly, The (1966)</th>\n",
" <td> 3.494949</td>\n",
" <td> 4.221300</td>\n",
" <td> 0.726351</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Kentucky Fried Movie, The (1977)</th>\n",
" <td> 2.878788</td>\n",
" <td> 3.555147</td>\n",
" <td> 0.676359</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Dumb &amp; Dumber (1994)</th>\n",
" <td> 2.697987</td>\n",
" <td> 3.336595</td>\n",
" <td> 0.638608</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Longest Day, The (1962)</th>\n",
" <td> 3.411765</td>\n",
" <td> 4.031447</td>\n",
" <td> 0.619682</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Cable Guy, The (1996)</th>\n",
" <td> 2.250000</td>\n",
" <td> 2.863787</td>\n",
" <td> 0.613787</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Evil Dead II (Dead By Dawn) (1987)</th>\n",
" <td> 3.297297</td>\n",
" <td> 3.909283</td>\n",
" <td> 0.611985</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hidden, The (1987)</th>\n",
" <td> 3.137931</td>\n",
" <td> 3.745098</td>\n",
" <td> 0.607167</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Rocky III (1982)</th>\n",
" <td> 2.361702</td>\n",
" <td> 2.943503</td>\n",
" <td> 0.581801</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Caddyshack (1980)</th>\n",
" <td> 3.396135</td>\n",
" <td> 3.969737</td>\n",
" <td> 0.573602</td>\n",
" </tr>\n",
" <tr>\n",
" <th>For a Few Dollars More (1965)</th>\n",
" <td> 3.409091</td>\n",
" <td> 3.953795</td>\n",
" <td> 0.544704</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Porky's (1981)</th>\n",
" <td> 2.296875</td>\n",
" <td> 2.836364</td>\n",
" <td> 0.539489</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Animal House (1978)</th>\n",
" <td> 3.628906</td>\n",
" <td> 4.167192</td>\n",
" <td> 0.538286</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Exorcist, The (1973)</th>\n",
" <td> 3.537634</td>\n",
" <td> 4.067239</td>\n",
" <td> 0.529605</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Fright Night (1985)</th>\n",
" <td> 2.973684</td>\n",
" <td> 3.500000</td>\n",
" <td> 0.526316</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Barb Wire (1996)</th>\n",
" <td> 1.585366</td>\n",
" <td> 2.100386</td>\n",
" <td> 0.515020</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 17,
"text": [
"gender F M diff\n",
"title \n",
"Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351\n",
"Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359\n",
"Dumb & Dumber (1994) 2.697987 3.336595 0.638608\n",
"Longest Day, The (1962) 3.411765 4.031447 0.619682\n",
"Cable Guy, The (1996) 2.250000 2.863787 0.613787\n",
"Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985\n",
"Hidden, The (1987) 3.137931 3.745098 0.607167\n",
"Rocky III (1982) 2.361702 2.943503 0.581801\n",
"Caddyshack (1980) 3.396135 3.969737 0.573602\n",
"For a Few Dollars More (1965) 3.409091 3.953795 0.544704\n",
"Porky's (1981) 2.296875 2.836364 0.539489\n",
"Animal House (1978) 3.628906 4.167192 0.538286\n",
"Exorcist, The (1973) 3.537634 4.067239 0.529605\n",
"Fright Night (1985) 2.973684 3.500000 0.526316\n",
"Barb Wire (1996) 1.585366 2.100386 0.515020"
]
}
],
"prompt_number": 17
},
{
"cell_type": "raw",
"metadata": {},
"source": [
"Si volgu\u00e9ssim les pel\u00b7licules que han generat puntuacions m\u00e9s discordants, independentment del g\u00e8nere, podem fer servir la varian\u00e7a o la desviaci\u00f3 est\u00e0ndard de les puntuacions: "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Standard deviation of rating grouped by title\n",
"rating_std_by_title = data.groupby('title')['rating'].std()\n",
"# Filter down to active_titles\n",
"rating_std_by_title = rating_std_by_title.ix[active_titles]\n",
"rating_std_by_title.order(ascending=False)[:10]"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 18,
"text": [
"title\n",
"Dumb & Dumber (1994) 1.321333\n",
"Blair Witch Project, The (1999) 1.316368\n",
"Natural Born Killers (1994) 1.307198\n",
"Tank Girl (1995) 1.277695\n",
"Rocky Horror Picture Show, The (1975) 1.260177\n",
"Eyes Wide Shut (1999) 1.259624\n",
"Evita (1996) 1.253631\n",
"Billy Madison (1995) 1.249970\n",
"Fear and Loathing in Las Vegas (1998) 1.246408\n",
"Bicentennial Man (1999) 1.245533\n",
"Name: rating, dtype: float64"
]
}
],
"prompt_number": 18
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b> EXERCICI 1:</b>\n",
"Calcula la puntuaci\u00f3 mitjana de cada usuari. Quina \u00e9s la pel\u00b7l\u00edcula m\u00e9s ben puntuada? "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Inner Join between \"ratings\" and \"users\"\n",
"mean_ratings = ratings_users_by_movies.pivot_table('rating', rows='title',cols='gender', aggfunc='mean')\n",
"\n",
"# mean_rating_by_user is the dataframe that contains the user's name and its ratings\n",
"mean_rating_by_user = ratings_users_by_movies.groupby('user_id')['rating'].mean()\n",
"\n",
"# critic_higher_mean is the user index with the max mean\n",
"critic_id_higher_mean = mean_rating_by_user.index[mean_rating_by_user.argmax()]\n",
"\n",
"# Get the title of the movie with the max possible rating\n",
"print data.ix[data[data['user_id'] == critic_id_higher_mean]['rating'].idxmax()]['title']"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Bug's Life, A (1998)\n"
]
}
],
"prompt_number": 32
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>EXERCICI 2:</b> Defineix una funci\u00f3 anomenada <b>top_movie</b> que donat un usuari ens retorni quina \u00e9s la pel\u00b7l\u00edcula millor puntuada.<br> \n",
"def top_movie(user)\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def top_movie(dataFrame, user_id):\n",
" \"Returns a movie title with the maximum possible rating from a user\"\n",
" user = dataFrame[dataFrame['user_id'] == user_id]\n",
" df = user[['title', 'rating']]\n",
" id_max = df['rating'].idxmax()\n",
" return df.ix[id_max]['title']\n",
"\n",
"print top_movie(data, 1)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"One Flew Over the Cuckoo's Nest (1975)\n"
]
}
],
"prompt_number": 23
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"1.2 Construcci\u00f3 d'un recomanador."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Exercici 3: </b>\n",
"\n",
"Construeix dues funcions, <b>distEuclid(x,y)</b> i <b>coefPearson(x,y)</b>, que implementin la dist\u00e0ncia Euclidiana i el coeficient de correlaci\u00f3 de Pearson entre dos vectors. Escriu les funcions que calculin la semblan\u00e7a entre dos usuaris segons aquesta estructura:\n",
"\n",
"<b>def SimEuclid (DataFrame, User1, User2)</b>\n",
" Calcular els vectors representatius de cada usuari, C1 i C2, amb les puntuacions dels \u00edtems comuns que han puntuat el dos usuaris.<br>\n",
" Si no hi ha puntuacions en com\u00fa, retornar 0.\n",
" Retornar 1/(1+distEuclid(C1, C2))<br>\n",
" \n",
"<b>def SimPearson (DataFrame, User1, User2)</b>\n",
" Calcular els vectors representatius de cada usuari, C1 i C2, amb les puntuacions dels \u00edtems comuns que han puntuat el dos usuaris.<br>\n",
" Si no hi ha puntuacions en com\u00fa, retornar 0.>\n",
" Retornar coefPearson(C1,C2)<br>\n",
" \n",
"\n",
"<b>Utilizeu el panda per a realitzaci\u00f3 d'aquest exercici.</b>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import math\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"def distEuclid(x, y):\n",
" \"Given two same-sized lists of numerical values, return the Euclidian distance\"\n",
" if len(x) == 0 or len(y) == 0 or len(x) != len(y): return 0\n",
" return np.sqrt(np.sum((x - y) ** 2))\n",
"\n",
"def coefPearson(x, y):\n",
" \"Given two same-sized lists of numerical values, return the Pearson's coefficient\"\n",
" if len(x) == 0 or len(y) == 0 or len(x) != len(y): return 0\n",
" mean_x = np.mean(x)\n",
" mean_y = np.mean(y)\n",
" numerator = np.sum((x - mean_x) * (y - mean_y))\n",
" denominator_left = np.sqrt(np.sum((x - mean_x) ** 2))\n",
" denominator_right = np.sqrt(np.sum((y - mean_y) ** 2))\n",
" if denominator_left == 0 or denominator_right == 0: return 0\n",
" return numerator / ( denominator_left * denominator_right )\n",
"\n",
"def sim_euclid(t):\n",
" \"Return a normalized Euclidian distance given a pair of same-sized lists\"\n",
" return 1 / ( distEuclid(t[0], t[1]) + 1 )\n",
"\n",
"def sim_coef_pearson(t):\n",
" \"Return the Pearson's coefficient given a pair of same-sized lists\"\n",
" return coefPearson(t[0], t[1])\n",
"\n",
"def get_user_ratings(dataFrame, user_a, user_b):\n",
" \"\"\"\n",
" Given two users and a cartesian product of M-R-U, return a pair of lists\n",
" that contains the rating of the movies viewed by both users.\n",
" \"\"\"\n",
" def select_user(id): return dataFrame[dataFrame['user_id'] == id]\n",
" user_a = select_user(user_a)\n",
" user_b = select_user(user_b)\n",
" merged = pd.merge(user_a, user_b, left_on = 'movie_id', right_on = 'movie_id')\n",
" return ( merged['rating_x'], merged['rating_y'] )\n",
"\n",
"def simEuclid(data, user_a, user_b):\n",
" \"Calculate the Euclidian distance of two users by comparing their ratings.\"\n",
" ratings_a, ratings_b = get_user_ratings(data, user_a, user_b)\n",
" return 1 / ( distEuclid(ratings_a, ratings_b) + 1 )\n",
"\n",
"def simPearson(data, user_a, user_b):\n",
" \"Calculate the Pearson's coefficient of two users by comparing their ratings.\"\n",
" ratings_a, ratings_b = get_user_ratings(data, user_a, user_b)\n",
" return coefPearson(ratings_a, ratings_b)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Execute functions\n",
"print simEuclid(ratings_users_by_movies, 1, 2)\n",
"print simPearson(ratings_users_by_movies, 1, 2)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"0.333333333333\n",
"0.416666666667\n"
]
}
],
"prompt_number": 25
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Exercici 4:</b>\n",
"\n",
"Feu dues funcions, <b>getNBestEuclid(DataFrame, user,n)</b> i <b>getNBestPearson(DataFrame, user,n)</b>, que retornin els n usuaris m\u00e9s semblants segons aquestes dues mesures de similitud."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def get_common_ratings(user_a, user_b):\n",
" \"Return a pair of lists that contains the rating of the movies viewed by both users.\"\n",
" merged = pd.merge(user_a, user_b, left_on = 'movie_id', right_on = 'movie_id')\n",
" return ( merged['rating_x'], merged['rating_y'] )\n",
"\n",
"def getNBest(dataFrame, simalg, user_id, k):\n",
" \"Return a list of the k users most similar to the user provided.\"\n",
" def select_user(id): return dataFrame[dataFrame['user_id'] == id]\n",
" user_a = select_user(user_id)\n",
" sim_users = [ (simalg(get_common_ratings(user_a, user_b)), user_b) for u_id, user_b in dataFrame[dataFrame['user_id'] != user_id].groupby('user_id') ]\n",
" return sorted(sim_users, reverse = True, key = lambda x: x[0])[0:k]\n",
"\n",
"def getNBestEuclid(dataFrame, user, n):\n",
" ls = getNBest(dataFrame, sim_euclid, user, n)\n",
" return [ (coef, user['user_id'].values[0]) for coef, user in ls ]\n",
"\n",
"def getNBestPearson(dataFrame, user, n):\n",
" ls = getNBest(dataFrame, sim_coef_pearson, user, n)\n",
" return [ (coef, user['user_id'].values[0]) for coef, user in ls ]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 26
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# EXECUTE FUNCTIONS\n",
"print getNBestEuclid(ratings_users_by_movies, 1, 10)\n",
"print getNBestPearson(ratings_users_by_movies, 1, 10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[(1.0, 16), (1, 46), (1.0, 50), (1.0, 61), (1, 158), (1.0, 171), (1.0, 185), (1, 210), (1, 218), (1.0, 234)]\n",
"[(1.0000000000000002, 298), (1.0000000000000002, 448), (1.0000000000000002, 565), (1.0000000000000002, 994), (1.0000000000000002, 1076), (1.0000000000000002, 1388), (1.0000000000000002, 1455), (1.0000000000000002, 1877), (1.0000000000000002, 2318), (1.0000000000000002, 2426)]"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Exercici 5:</b>\n",
"\n",
"Desenvolupa un sistema de recomanaci\u00f3 col\u00b7laboratiu basat en usuaris. La funci\u00f3 principal, <b>getRecommendationsUser</b>, ha de tenir com a entrada una taula de puntuacions, un \"user_id\", el tipus de mesura de semblan\u00e7a (Euclidiana o Pearson) que volem usar i el nombre n m\u00e0xim de recomanacions que volem. Com a sortida ha de donar la llista de les 5 millors pel\u00b7l\u00edcules que li podriem recomanar segons la seva semblan\u00e7a amb altres usuaris.\n",
"\n",
"Nota: S'ha d'evitar comparar \"user_id\" a ell mateix."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def get_movie_user_matrix():\n",
" \"Returns a dataframe that contains the user's ratings of a movie\"\n",
" DF = ratings_users_by_movies\n",
" df = pd.DataFrame(columns = movies['movie_id'], index = users['user_id'])\n",
"\n",
" for user_id in df.index.values:\n",
" user = DF[DF['user_id'] == user_id]\n",
" for movie_id, rating in user[['movie_id', 'rating']].values:\n",
" df[movie_id][user_id] = rating\n",
"\n",
" return df\n",
"\n",
"def getRecommendationsUser(dataFrame, user_id, n, sim = sim_coef_pearson):\n",
" def select_user(id): return dataFrame[dataFrame['user_id'] == id]\n",
" \n",
" df = pd.DataFrame(columns = movies['movie_id'], index = users['user_id'])\n",
" \n",
" user = select_user(user_id)\n",
" filtered_movies = user['movie_id'].tolist()\n",
" \n",
" coefs_users = getNBest(dataFrame, sim, user_id, None)\n",
" similarities = { user_b['user_id'].values[0]: coef for coef, user_b in coefs_users }\n",
" \n",
" for coef, user_b in coefs_users:\n",
" user_b_id = user_b['user_id'].values[0]\n",
" for movie_id, corrected_coef in ( (movie_id, coef * rating) for movie_id, rating in user_b[['movie_id', 'rating']].values if movie_id not in filtered_movies ):\n",
" df[movie_id][user_b_id] = corrected_coef\n",
"\n",
" recommended_movies = []\n",
" for movie_id in df.columns:\n",
" total = 0\n",
" sim_sum = 0\n",
" for user_id, rating in df[[movie_id]].itertuples():\n",
" if not np.isnan(rating):\n",
" total += rating\n",
" sim_sum += similarities[user_id]\n",
" if sim_sum > 0:\n",
" recommended_movies.append(( movies.ix[movies['movie_id'] == movie_id]['title'].values[0], total / sim_sum ))\n",
"\n",
" recommended_movies = sorted(recommended_movies, key = lambda x: x[1])\n",
"\n",
" return recommended_movies[0:n]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print getRecommendationsUser(ratings_users_by_movies, 1, 10, sim_euclid)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"[('Silence of the Palace, The (Saimt el Qusur) (1994)', 1.0)]\n"
]
}
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Exercici 6:</b>\n",
"\n",
"Desenvolupa un sistema de recomanaci\u00f3 col\u00b7laboratiu basat en \u00edtems. \n",
"\n",
"Primer, escriu una funci\u00f3 <b>CalcSimItems(DataFrame)</b>, que construeixi i retorni una taula, itemsim, amb les semblances entre els \u00edtems.\n",
"Despr\u00e9s escriu la funci\u00f3 principal, <b>getRecommendationsItem(DataFrame, itemsim, user, n)</b>, ha de tenir com a entrada les puntuacions dels usuaris, la taula de semblan\u00e7a entre \u00edtems, un \"user_id\" i el nombre n m\u00e0xim de recomanacions que volem. Com a sortida ha de donar les n millors pel\u00b7l\u00edcules."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def get_movie_user_matrix():\n",
" DF = ratings_users_by_movies\n",
" df = pd.DataFrame(columns = movies['movie_id'], index = users['user_id'])\n",
"\n",
" for user_id in df.index.values:\n",
" user = DF[DF['user_id'] == user_id]\n",
" for movie_id, rating in user[['movie_id', 'rating']].values:\n",
" df[movie_id][user_id] = rating\n",
"\n",
" return df\n",
"\n",
"def CalcSimItems(DF): \n",
" df = pd.DataFrame(columns = movies['movie_id'], index = movies['movie_id'])\n",
" movie_ids = df.columns\n",
"\n",
" movies_by_users = get_movie_user_matrix()\n",
"\n",
" for movie_id_a in movie_ids: # ROWS\n",
" for movie_id_b in movie_ids: # COLUMNS\n",
" if movie_id_a != movie_id_b:\n",
" ratings_a = movies_by_users[movie_id_a]\n",
" ratings_b = movies_by_users[movie_id_b]\n",
" df[movie_id_b][movie_id_a] = sim_coef_pearson((ratings_a, ratings_b))\n",
"\n",
" return df\n",
"\n",
"def getRecommendationsItem(DF, IS, user_id, n, sim = sim_coef_pearson):\n",
" def select_user(id): return DF[DF['user_id'] == id]\n",
" \n",
" user = select_user(user_id)\n",
" user_movies = user['movie_id'].tolist()\n",
"\n",
" user_ratings = { movie_id: rating for movie_id, rating in user[['movie_id', 'rating']].values }\n",
" new_movies = [ movie_id for movie_id in movies['movie_id'].tolist() if movie_id not in user_movies ]\n",
"\n",
" df = pd.DataFrame(columns = new_movies, index = user_movies)\n",
"\n",
" for user_movie_id, row in df.iterrows():\n",
" for movie_id in new_movies:\n",
" df[movie_id][user_movie_id] = IS[movie_id][user_movie_id] * user_ratings[user_movie_id]\n",
"\n",
" recommended_movies = []\n",
" for movie_id in df.columns:\n",
" total = 0\n",
" sim_sum = 0\n",
" for user_movie_id, corrected_similarity in df[[movie_id]].itertuples():\n",
" if not np.isnan(rating):\n",
" total += corrected_similarity\n",
" sim_sum += IS[movie_id][user_movie_id]\n",
" if sim_sum > 0:\n",
" recommended_movies.append(( movies.ix[movies['movie_id'] == movie_id]['title'].values[0], total / sim_sum ))\n",
"\n",
" recommended_movies = sorted(recommended_movies, key = lambda x: x[1])\n",
"\n",
" return recommended_movies[0:n]"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 47
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"#Executions\n",
"itemsim=CalcSimItems(data)\n",
"getRecommendationsItem(data, itemsim, 1, 10)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 102,
"text": [
"0"
]
}
],
"prompt_number": 102
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>Exercici 7:</b> Creu una funcio, <b>EvaluateRecommendationsUser(DataFrame,DataFrameTest,similarity=SimPearson)</b>, que donat un conjunt de dades d'entrenament i un conjunt de dades de test ens avalua la precisi\u00f3 dels sistema.\n",
"Per a cadascun dels elements del conjunt de test haurem de pronosticar el seu valor i comparar-lo amb el valor real que l'usuari li ha asignat.<br> Els mesura que utilizarem per avaluar el sistema \u00e9s la seg\u00fcent:\n",
"$$accuracy = 1/N\\sum_{i=0}^N abs(rating_i - rating_i^*) $$ on rating \u00e9s la puntauci\u00f3 real que l'usuari va asginar a la pel\u00b7l\u00edcula i rating* \u00e9s el valor pronoticat pel sistema de recomanacio desenvolupat.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"from random import randint\n",
"\n",
"ratings_headers=['user_id', 'movie_id', 'rating']\n",
"users_headers=['user_id']\n",
"movies_headers=['movie_id']\n",
"\n",
"users = None\n",
"ratings = None\n",
"movies = None\n",
"ratings_users_by_movies = None\n",
"ratings_users_by_movies_filtered = None\n",
"\n",
"def generate_random_ratings(n, limit_cells_per_column = 3):\n",
" global users\n",
" global ratings\n",
" global movies\n",
" global ratings_users_by_movies\n",
" global ratings_users_by_movies_filtered\n",
" idx = range(1, n+1)\n",
" usr=[x for x in idx for y in idx]\n",
" mov=[y for x in idx for y in idx]\n",
" rat=[randint(1, 5) for x in idx for y in idx]\n",
" dict1={'user_id':usr,'movie_id':mov,'rating':rat}\n",
" ratings = pd.DataFrame(dict1)\n",
" users=pd.DataFrame({'user_id':idx})\n",
" movies=pd.DataFrame({'movie_id':idx})\n",
" ratings_users_by_movies=pd.merge(pd.merge(ratings, users), movies)\n",
" \n",
" _df = get_movie_user_matrix(ratings_users_by_movies)\n",
" _df = remove_randomly_cells(_df, limit_cells_per_column)\n",
" r2 = []\n",
" for user_id in _df.index.values:\n",
" for movie_id in _df.columns:\n",
" j = _df[movie_id][user_id]\n",
" if not np.isnan(j):\n",
" r2.append(int(j))\n",
" else:\n",
" r2.append(np.nan)\n",
" \n",
" ratings2 = pd.DataFrame({'user_id':usr,'movie_id':mov,'rating':r2})\n",
" ratings_users_by_movies_filtered = pd.merge(pd.merge(ratings2, users), movies)\n",
"\n",
"def get_movie_user_matrix(DF):\n",
" df = pd.DataFrame(columns = movies['movie_id'], index = users['user_id'])\n",
" \n",
" for user_id in df.index.values:\n",
" user = DF[DF['user_id'] == user_id]\n",
" for movie_id, rating in user[['movie_id', 'rating']].values:\n",
" df[movie_id][user_id] = rating\n",
"\n",
" return df\n",
"\n",
"\n",
"def remove_randomly_cells(df, limit = 3):\n",
" if limit > len(df.columns):\n",
" limit = ( len(df.columns) / 2 ) - 1\n",
" \n",
" for user_id, row in df.iterrows():\n",
" for movie_id in df.columns:\n",
" if randint(1,5)==1 and df[movie_id].count()>limit and df[:][user_id].count()>limit: \n",
" df[movie_id][user_id] = np.nan\n",
" return df\n",
"\n",
"\n",
"def user_per_movie_df(df):\n",
" user_ids = list(set(df['user_id'].tolist()))\n",
" movie_ids=list(set(df['movie_id'].tolist()))\n",
" df2=pd.DataFrame(columns = user_ids, index = movie_ids)\n",
" for user_id in user_ids:\n",
" for movie_id in movie_ids:\n",
" df2[user_id][movie_id]=(df['rating'][(df['user_id']==user_id) &(df['movie_id']==movie_id)]).tolist()[0]\n",
" return df2\n",
" \n",
"def getRecommendationsUser(dataFrame, user_id, sim = sim_coef_pearson):\n",
" def select_user(id): return dataFrame[dataFrame['user_id'] == id]\n",
" \n",
" df = pd.DataFrame(columns = movies['movie_id'], index = users['user_id'])\n",
" \n",
" user = select_user(user_id)\n",
" filtered_movies = user[user['rating'].notnull()]['movie_id'].tolist()\n",
"\n",
" if len(filtered_movies) == 0:\n",
" return None\n",
"\n",
" coefs_users = getNBest(dataFrame, sim, user_id, None)\n",
" similarities = { user_b['user_id'].values[0]: coef for coef, user_b in coefs_users }\n",
" \n",
" for coef, user_b in coefs_users:\n",
" user_b_id = user_b['user_id'].values[0]\n",
" for movie_id, corrected_coef in ( (movie_id, coef * rating) for movie_id, rating in user_b[['movie_id', 'rating']].values if movie_id not in filtered_movies ):\n",
" df[movie_id][user_b_id] = corrected_coef\n",
"\n",
" recommended_movies = {}\n",
" for movie_id in df.columns:\n",
" total = 0\n",
" sim_sum = 0\n",
" for user_id, rating in df[[movie_id]].itertuples():\n",
" if not np.isnan(rating):\n",
" total += rating\n",
" sim_sum += similarities[user_id]\n",
" if sim_sum > 0:\n",
" recommended_movies[movie_id] = total / sim_sum\n",
"\n",
" return recommended_movies\n",
"\n",
"def get_generated_ratings(df_test, sim = sim_coef_pearson):\n",
" generated_ratings = df_test.copy()\n",
" DFTest = get_movie_user_matrix(df_test)\n",
" print DFTest.values\n",
" for user_id in users.values:\n",
" not_seen_movies = getRecommendationsUser(df_test, user_id[0], sim)\n",
" print not_seen_movies\n",
" raise Exception()\n",
" if not_seen_movies != None:\n",
" for movie_id, rating in not_seen_movies.iteritems():\n",
" generated_ratings[movie_id][user_id] = rating\n",
" return generated_ratings\n",
"\n",
"def EvaluateRecommendationsUser(df_training, df_test, sim = sim_coef_pearson):\n",
" generated_ratings = get_generated_ratings(df_test, sim)\n",
" DF = get_movie_user_matrix(df_training)\n",
" DFT = get_movie_user_matrix(generated_ratings)\n",
" DFTest = get_movie_user_matrix(df_test)\n",
" print DFTest.values\n",
" print DFT.values\n",
" mae = 0 # mean absolut error\n",
" for user_id in DF.index:\n",
" for movie_id in DF.columns:\n",
" mae += DF[movie_id][user_id] - DFT[movie_id][user_id]\n",
" return mae / len(DF.index)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 103
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"generate_random_ratings(10, 5)\n",
"print EvaluateRecommendationsUser(ratings_users_by_movies, ratings_users_by_movies_filtered)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 104
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment