Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file has been truncated, but you can view the full file.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Cleaning and Exploration: The Movie\n",
"\n",
"(Go to the READ.ME of this repository for the entire write-up.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I collected quite a bit of data: 43837 separate movies. The actual cleaning of the data was as tedious and dry as the following paragraphs."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"43837"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"import imdb\n",
"import re\n",
"import pandas as pd\n",
"import numpy as np\n",
"import ast\n",
"from datetime import datetime, timedelta\n",
"from sklearn.feature_extraction.text import CountVectorizer\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lost of the data came in lists, such as a list of the first four actors in a film. I had to go through and strip all members of the lists, as well as join lists on pipes to later be separated by my count vectorizer. I wrote a bunch of functions to clean the data, even a few I didn't need. This functions did thing like turn data to floats or create new columns of data."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"def release_to_datetime(n):\n",
" if type(n) == str:\n",
" n = n.replace(' Nov ', '-11-').replace(' Jun ', '-6-').replace(' Aug ', '-8-').replace(' Dec ', '-12-')\n",
" n = n.replace(' Oct ', '-10-').replace(' Jan ', '-1-').replace(' Feb ', '-2-').replace(' Mar ', '-3-')\n",
" n = n.replace(' Apr ', '-4-').replace(' May ', '-5-').replace(' Jul ', '-7-').replace(' Sep ', '-9-')\n",
" n = datetime.strptime(n, '%d-%m-%Y').date()\n",
" return n\n",
" else:\n",
" return n\n",
"\n",
"def delta_release(n):\n",
" y2k = datetime.strptime('01-01-2000', '%d-%m-%Y').date()\n",
" try:\n",
" m = y2k - n\n",
" return m.days\n",
" except:\n",
" return np.nan\n",
"\n",
"def pull_month(n):\n",
" try:\n",
" return n.month\n",
" except:\n",
" return np.nan\n",
"\n",
"def pull_day(n):\n",
" try:\n",
" return n.day\n",
" except:\n",
" return np.nan\n",
" \n",
"def runtime_to_float(x):\n",
" try:\n",
" return float(x)\n",
" except:\n",
" return np.nan\n",
" \n",
"def boxoffice_to_float(x):\n",
" try:\n",
" return float(x.replace(',',\"\").replace(\"$\",\"\"))\n",
" except:\n",
" return np.nan\n",
"\n",
"def RT_pull_out(entry):\n",
" for m in entry:\n",
" if m['Source'] == 'Rotten Tomatoes':\n",
" n = (m['Value'].replace('%', ''))\n",
" return(int(n))\n",
" else:\n",
" return(np.nan)\n",
" \n",
"def evan_train_test_df_cvec_capstone(train, test, min_df):\n",
" min_df = min_df\n",
" dummy_list_train = []\n",
" dummy_list_test = []\n",
" for x in train.columns:\n",
" cvec = CountVectorizer(binary=True,\n",
" tokenizer=(lambda m: m.split('|') ),\n",
" min_df = min_df,\n",
" stop_words = 'english',\n",
" strip_accents='unicode')\n",
" cvec.fit(train['{}'.format(x)])\n",
" lonely_matrix_train = cvec.transform(train['{}'.format(x)])\n",
" lonely_matrix_test = cvec.transform(test['{}'.format(x)])\n",
" df_train = pd.DataFrame(lonely_matrix_train.todense(), columns=cvec.get_feature_names())\n",
" df_test = pd.DataFrame(lonely_matrix_test.todense(), columns=cvec.get_feature_names())\n",
" dummy_list_train.append(df_train)\n",
" dummy_list_test.append(df_test)\n",
" dummied_df_train = pd.concat(dummy_list_train, axis=1)\n",
" dummied_df_test = pd.concat(dummy_list_test, axis=1)\n",
" return dummied_df_train, dummied_df_test\n",
"\n",
"def movie_split_and_join(train, test, func, min_df=1):\n",
" train_obj = train.select_dtypes(include=[np.object_])\n",
" train_num = train.select_dtypes(include=[np.number, np.bool_])\n",
" test_obj = test.select_dtypes(include=[np.object_])\n",
" test_num = test.select_dtypes(include=[np.number, np.bool_])\n",
" train_obj_dums, test_obj_dums = func(train_obj, test_obj, min_df)\n",
" train_obj_dums.reset_index(drop=True, inplace=True)\n",
" test_obj_dums.reset_index(drop=True, inplace=True)\n",
" train_num.reset_index(drop=True, inplace=True)\n",
" test_num.reset_index(drop=True, inplace=True)\n",
" final_train = pd.concat([train_num, train_obj_dums], axis=1)\n",
" final_test = pd.concat([test_num, test_obj_dums], axis=1)\n",
" return final_train, final_test\n",
"\n",
"def strip_list(column):\n",
" for n in column:\n",
" for m in range(len(n)):\n",
" n[m] = n[m].strip()\n",
" return column\n",
"\n",
"def put_in_avgs(train, test, df):\n",
" ind = 0\n",
" for n in train.columns:\n",
" for m in list(zip(df.name, df.avgscore)):\n",
" if n == m[0]:\n",
" train[n] *= m[1]\n",
" ind += 1\n",
" if ind % 10000 == 0:\n",
" print(ind)\n",
" ind = 0\n",
" for n in test.columns:\n",
" for m in list(zip(df.name, df.avgscore)):\n",
" if n == m[0]:\n",
" test[n] *= m[1]\n",
" ind += 1\n",
" if ind % 10000 == 0:\n",
" print(ind)\n",
" print(train.shape)\n",
" print(test.shape)\n",
" return train, test\n",
"\n",
"def single_column_cvec(train, test, min_df):\n",
" cvec = CountVectorizer(binary=True,\n",
" tokenizer=(lambda m: m.split('|') ),\n",
" min_df = min_df,\n",
" stop_words = 'english',\n",
" strip_accents='unicode')\n",
" cvec.fit(train)\n",
" lonely_matrix_train = cvec.transform(train)\n",
" lonely_matrix_test = cvec.transform(test)\n",
" new_train = pd.DataFrame(lonely_matrix_train.todense(), columns=cvec.get_feature_names())\n",
" new_test = pd.DataFrame(lonely_matrix_test.todense(), columns=cvec.get_feature_names())\n",
" return new_train, new_test"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv(\"new_all_my_movies_final.csv\", converters={\"Actors\": ast.literal_eval, \n",
" \"Director\": ast.literal_eval, \n",
" \"Genre\": ast.literal_eval, \n",
" \"RTRating\": ast.literal_eval, \n",
" \"Writer\": ast.literal_eval,\n",
" \"Year\": ast.literal_eval})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This following cell contains remnants from what I'll call \"The Actor Average Debacle\" later on in the presentation."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"award_df = pd.read_csv('meta_award_add_final.csv')\n",
"writers_df = pd.read_csv('writers_df.csv')\n",
"actors_df = pd.read_csv('actors_df.csv')\n",
"directors_df = pd.read_csv('directors_df.csv')\n",
"# actoravg= pd.read_csv('NewActorAvg.csv') # Plaguing problem \n",
"# morta_df = pd.read_csv('morta.csv') # Same Plaguing problem"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Getting rid of pesky extra spaces\n",
"df.Actors = strip_list(df.Actors)\n",
"df.Director = strip_list(df.Director)\n",
"df.Writer = strip_list(df.Writer)\n",
"\n",
"# Getting rid of my silly index column and dropping the duplicates\n",
"df.drop(['Unnamed: 0'], axis=1, inplace=True) \n",
"df = df.drop_duplicates(subset=['imdbID'], keep='first')\n",
"\n",
"# Joining actor list as pipes\n",
"df.Actors = df.Actors.map(lambda x: '|'.join(x))\n",
"\n",
"# Joining directors as pipes\n",
"# Taking out any stuff in parentheses\n",
"df.Director = df.Director.map(lambda x: '|'.join(x))\n",
"df.Director = df.Director.map(lambda x: re.sub(\"[\\(\\[].*?[\\)\\]]\", \"\", x))\n",
"\n",
"# Joining genres as pipes\n",
"df.Genre = df.Genre.map(lambda x: '|'.join(x))\n",
"\n",
"# Joining writers them as pipes\n",
"# Taking out any stuff in parentheses\n",
"df.Writer = df.Writer.map(lambda x: '|'.join(x))\n",
"df.Writer = df.Writer.map(lambda x: re.sub(\"[\\(\\[].*?[\\)\\]]\", \"\", x))\n",
"\n",
"# Pulling out rotten tomato rating from the RTRating column\n",
"df.RTRating = df.RTRating.map(RT_pull_out)\n",
"\n",
"# Turning released to datetime object as well as creating a delta column\n",
"# Also creating a column for number of month and number of day\n",
"df.Released = df.Released.map(release_to_datetime)\n",
"df['days_from_y2k'] = df.Released.map(delta_release)\n",
"df['month'] = df.Released.map(pull_month)\n",
"df['day'] = df.Released.map(pull_day)\n",
"\n",
"# Turning runtime and boxxofice to to float objects\n",
"df.Runtime = df.Runtime.map(runtime_to_float)\n",
"df.BoxOffice = df.BoxOffice.map(boxoffice_to_float)\n",
"\n",
"# Sorting the DataFrame on released\n",
"df = df.sort_values(['Released'], ascending=True)\n",
"df.reset_index(drop=True, inplace=True)\n",
"\n",
"# Adding a title length column \n",
"df['title_length'] = df.Title.map(lambda x: len(x))\n",
"\n",
"# Saving as a csv\n",
"pd.DataFrame(df).to_csv('cleaned_movie_df.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because of the size of the database, I didn't spend much time imputing missing values, especially because lots of those came from lesser-known foreign films that probably wouldn't have added that much to a model. The graph of missing values is below. "
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.