Skip to content

Instantly share code, notes, and snippets.

@onpillow
Created May 17, 2018 13:46
Show Gist options
  • Save onpillow/5c1fc5598fa9c51ff3672cf88eaafa1a to your computer and use it in GitHub Desktop.
Save onpillow/5c1fc5598fa9c51ff3672cf88eaafa1a to your computer and use it in GitHub Desktop.
medium01_6
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### <b>B. Sample prepare-- Filter Top 100 and Worst 100 movies in each year as the research sample.<b/>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>A) Select Top 100 popular movies in every year.</b>"
]
},
{
"cell_type": "code",
"execution_count": 479,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>popularity</th>\n",
" <th>budget</th>\n",
" <th>revenue</th>\n",
" <th>original_title</th>\n",
" <th>cast</th>\n",
" <th>director</th>\n",
" <th>keywords</th>\n",
" <th>runtime</th>\n",
" <th>genres</th>\n",
" <th>production_companies</th>\n",
" <th>release_date</th>\n",
" <th>vote_count</th>\n",
" <th>vote_average</th>\n",
" <th>release_year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>539</td>\n",
" <td>2.610362</td>\n",
" <td>806948.0</td>\n",
" <td>32000000.0</td>\n",
" <td>Psycho</td>\n",
" <td>Anthony Perkins|Vera Miles|John Gavin|Janet Le...</td>\n",
" <td>Alfred Hitchcock</td>\n",
" <td>hotel|clerk|arizona|shower|rain</td>\n",
" <td>109</td>\n",
" <td>Drama|Horror|Thriller</td>\n",
" <td>Shamley Productions</td>\n",
" <td>8/14/60</td>\n",
" <td>1180</td>\n",
" <td>8.0</td>\n",
" <td>1960</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>966</td>\n",
" <td>1.872132</td>\n",
" <td>2000000.0</td>\n",
" <td>4905000.0</td>\n",
" <td>The Magnificent Seven</td>\n",
" <td>Yul Brynner|Eli Wallach|Steve McQueen|Charles ...</td>\n",
" <td>John Sturges</td>\n",
" <td>horse|village|friendship|remake|number in title</td>\n",
" <td>128</td>\n",
" <td>Action|Adventure|Western</td>\n",
" <td>The Mirisch Corporation|Alpha Productions</td>\n",
" <td>10/23/60</td>\n",
" <td>224</td>\n",
" <td>7.0</td>\n",
" <td>1960</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id popularity budget revenue original_title \\\n",
"0 539 2.610362 806948.0 32000000.0 Psycho \n",
"1 966 1.872132 2000000.0 4905000.0 The Magnificent Seven \n",
"\n",
" cast director \\\n",
"0 Anthony Perkins|Vera Miles|John Gavin|Janet Le... Alfred Hitchcock \n",
"1 Yul Brynner|Eli Wallach|Steve McQueen|Charles ... John Sturges \n",
"\n",
" keywords runtime \\\n",
"0 hotel|clerk|arizona|shower|rain 109 \n",
"1 horse|village|friendship|remake|number in title 128 \n",
"\n",
" genres production_companies \\\n",
"0 Drama|Horror|Thriller Shamley Productions \n",
"1 Action|Adventure|Western The Mirisch Corporation|Alpha Productions \n",
"\n",
" release_date vote_count vote_average release_year \n",
"0 8/14/60 1180 8.0 1960 \n",
"1 10/23/60 224 7.0 1960 "
]
},
"execution_count": 479,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select Top 100 popular movies.\n",
"# fisrt sort it by release year ascending and popularity descending\n",
"df_top_p = df.sort_values(['release_year','popularity'], ascending=[True, False])\n",
"#group by year and choose the top 100 high\n",
"df_top_p = df_top_p.groupby('release_year').head(100).reset_index(drop=True)\n",
"#check, it must start from 1960, and with high popularity to low\n",
"df_top_p.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>B) Select Top 100 high revenue movies in every year.</b>"
]
},
{
"cell_type": "code",
"execution_count": 480,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>popularity</th>\n",
" <th>budget</th>\n",
" <th>revenue</th>\n",
" <th>original_title</th>\n",
" <th>cast</th>\n",
" <th>director</th>\n",
" <th>keywords</th>\n",
" <th>runtime</th>\n",
" <th>genres</th>\n",
" <th>production_companies</th>\n",
" <th>release_date</th>\n",
" <th>vote_count</th>\n",
" <th>vote_average</th>\n",
" <th>release_year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>967</td>\n",
" <td>1.136943</td>\n",
" <td>12000000.0</td>\n",
" <td>60000000.0</td>\n",
" <td>Spartacus</td>\n",
" <td>Kirk Douglas|Laurence Olivier|Jean Simmons|Cha...</td>\n",
" <td>Stanley Kubrick</td>\n",
" <td>gladiator|roman empire|gladiator fight|slavery...</td>\n",
" <td>197</td>\n",
" <td>Action|Drama|History</td>\n",
" <td>Bryna Productions</td>\n",
" <td>10/6/60</td>\n",
" <td>211</td>\n",
" <td>6.9</td>\n",
" <td>1960</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>539</td>\n",
" <td>2.610362</td>\n",
" <td>806948.0</td>\n",
" <td>32000000.0</td>\n",
" <td>Psycho</td>\n",
" <td>Anthony Perkins|Vera Miles|John Gavin|Janet Le...</td>\n",
" <td>Alfred Hitchcock</td>\n",
" <td>hotel|clerk|arizona|shower|rain</td>\n",
" <td>109</td>\n",
" <td>Drama|Horror|Thriller</td>\n",
" <td>Shamley Productions</td>\n",
" <td>8/14/60</td>\n",
" <td>1180</td>\n",
" <td>8.0</td>\n",
" <td>1960</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id popularity budget revenue original_title \\\n",
"0 967 1.136943 12000000.0 60000000.0 Spartacus \n",
"1 539 2.610362 806948.0 32000000.0 Psycho \n",
"\n",
" cast director \\\n",
"0 Kirk Douglas|Laurence Olivier|Jean Simmons|Cha... Stanley Kubrick \n",
"1 Anthony Perkins|Vera Miles|John Gavin|Janet Le... Alfred Hitchcock \n",
"\n",
" keywords runtime \\\n",
"0 gladiator|roman empire|gladiator fight|slavery... 197 \n",
"1 hotel|clerk|arizona|shower|rain 109 \n",
"\n",
" genres production_companies release_date vote_count \\\n",
"0 Action|Drama|History Bryna Productions 10/6/60 211 \n",
"1 Drama|Horror|Thriller Shamley Productions 8/14/60 1180 \n",
"\n",
" vote_average release_year \n",
"0 6.9 1960 \n",
"1 8.0 1960 "
]
},
"execution_count": 480,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select Top 100 high revenue movies.\n",
"# fisrt sort it by release year ascending and revenue descending\n",
"df_top_r = df.sort_values(['release_year','revenue'], ascending=[True, False])\n",
"#group by year and choose the top 100 high\n",
"df_top_r = df_top_r.groupby('release_year').head(100).reset_index(drop=True)\n",
"#check, it must start from 1960, and with high revenue to low\n",
"df_top_r.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>C) Select Top 100 high score rating movies in every year.</b>"
]
},
{
"cell_type": "code",
"execution_count": 481,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>popularity</th>\n",
" <th>budget</th>\n",
" <th>revenue</th>\n",
" <th>original_title</th>\n",
" <th>cast</th>\n",
" <th>director</th>\n",
" <th>keywords</th>\n",
" <th>runtime</th>\n",
" <th>genres</th>\n",
" <th>production_companies</th>\n",
" <th>release_date</th>\n",
" <th>vote_count</th>\n",
" <th>vote_average</th>\n",
" <th>release_year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>539</td>\n",
" <td>2.610362</td>\n",
" <td>806948.0</td>\n",
" <td>32000000.0</td>\n",
" <td>Psycho</td>\n",
" <td>Anthony Perkins|Vera Miles|John Gavin|Janet Le...</td>\n",
" <td>Alfred Hitchcock</td>\n",
" <td>hotel|clerk|arizona|shower|rain</td>\n",
" <td>109</td>\n",
" <td>Drama|Horror|Thriller</td>\n",
" <td>Shamley Productions</td>\n",
" <td>8/14/60</td>\n",
" <td>1180</td>\n",
" <td>8.0</td>\n",
" <td>1960</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>284</td>\n",
" <td>0.947307</td>\n",
" <td>3000000.0</td>\n",
" <td>25000000.0</td>\n",
" <td>The Apartment</td>\n",
" <td>Jack Lemmon|Shirley MacLaine|Fred MacMurray|Ra...</td>\n",
" <td>Billy Wilder</td>\n",
" <td>new york|new year's eve|lovesickness|age diffe...</td>\n",
" <td>125</td>\n",
" <td>Comedy|Drama|Romance</td>\n",
" <td>United Artists|The Mirisch Company</td>\n",
" <td>6/15/60</td>\n",
" <td>235</td>\n",
" <td>7.9</td>\n",
" <td>1960</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id popularity budget revenue original_title \\\n",
"0 539 2.610362 806948.0 32000000.0 Psycho \n",
"1 284 0.947307 3000000.0 25000000.0 The Apartment \n",
"\n",
" cast director \\\n",
"0 Anthony Perkins|Vera Miles|John Gavin|Janet Le... Alfred Hitchcock \n",
"1 Jack Lemmon|Shirley MacLaine|Fred MacMurray|Ra... Billy Wilder \n",
"\n",
" keywords runtime \\\n",
"0 hotel|clerk|arizona|shower|rain 109 \n",
"1 new york|new year's eve|lovesickness|age diffe... 125 \n",
"\n",
" genres production_companies release_date \\\n",
"0 Drama|Horror|Thriller Shamley Productions 8/14/60 \n",
"1 Comedy|Drama|Romance United Artists|The Mirisch Company 6/15/60 \n",
"\n",
" vote_count vote_average release_year \n",
"0 1180 8.0 1960 \n",
"1 235 7.9 1960 "
]
},
"execution_count": 481,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Select Top 100 high scorer ating movies.\n",
"# fisrt sort it by release year ascending and high scorer ating descending\n",
"df_top_s = df.sort_values(['release_year','vote_average'], ascending=[True, False])\n",
"#group by year and choose the top 100 high\n",
"df_top_s = df_top_s.groupby('release_year').head(100).reset_index(drop=True)\n",
"#check, it must start from 1960, and with high scorer ating to low\n",
"df_top_s.head(2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>D) To compare to results, I also create three subdataset for the last 100 movies.</b>"
]
},
{
"cell_type": "code",
"execution_count": 482,
"metadata": {},
"outputs": [],
"source": [
"# the last 100 popular movies in every year\n",
"df_low_p = df.sort_values(['release_year','popularity'], ascending=[True, True])\n",
"df_low_p = df_low_p.groupby('release_year').head(100).reset_index(drop=True)\n",
"# the last 100 high revenue movies in every year\n",
"df_low_r = df.sort_values(['release_year','revenue'], ascending=[True, True])\n",
"df_low_r = df_low_r.groupby('release_year').head(100).reset_index(drop=True)\n",
"# the last 100 score rating movies in every year\n",
"df_low_s = df.sort_values(['release_year','vote_average'], ascending=[True, True])\n",
"df_low_s = df_low_s.groupby('release_year').head(100).reset_index(drop=True)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:py3]",
"language": "python",
"name": "conda-env-py3-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment