Skip to content

Instantly share code, notes, and snippets.

@onpillow
Created December 7, 2018 08:57
Show Gist options
  • Save onpillow/dfeb58e8c01c725bd06e2d447b5c0f6f to your computer and use it in GitHub Desktop.
Save onpillow/dfeb58e8c01c725bd06e2d447b5c0f6f to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Data Cleaning "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**First, according to the previous decision, let's drop unncessary columns : `imdb_id`, `homepage`, `tagline`, `overview`.**"
]
},
{
"cell_type": "code",
"execution_count": 383,
"metadata": {},
"outputs": [],
"source": [
"# After discussing the structure of the data and any problems that need to be\n",
"# cleaned, perform those cleaning steps in the second part of this section.\n",
"# Drop extraneous columns\n",
"col = ['imdb_id', 'homepage', 'tagline', 'overview', 'budget_adj', 'revenue_adj']\n",
"df.drop(col, axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": 384,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style>\n",
" .dataframe thead tr:only-child th {\n",
" text-align: right;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>id</th>\n",
" <th>popularity</th>\n",
" <th>budget</th>\n",
" <th>revenue</th>\n",
" <th>original_title</th>\n",
" <th>cast</th>\n",
" <th>director</th>\n",
" <th>keywords</th>\n",
" <th>runtime</th>\n",
" <th>genres</th>\n",
" <th>production_companies</th>\n",
" <th>release_date</th>\n",
" <th>vote_count</th>\n",
" <th>vote_average</th>\n",
" <th>release_year</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>135397</td>\n",
" <td>32.985763</td>\n",
" <td>150000000</td>\n",
" <td>1513528810</td>\n",
" <td>Jurassic World</td>\n",
" <td>Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...</td>\n",
" <td>Colin Trevorrow</td>\n",
" <td>monster|dna|tyrannosaurus rex|velociraptor|island</td>\n",
" <td>124</td>\n",
" <td>Action|Adventure|Science Fiction|Thriller</td>\n",
" <td>Universal Studios|Amblin Entertainment|Legenda...</td>\n",
" <td>6/9/15</td>\n",
" <td>5562</td>\n",
" <td>6.5</td>\n",
" <td>2015</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" id popularity budget revenue original_title \\\n",
"0 135397 32.985763 150000000 1513528810 Jurassic World \n",
"\n",
" cast director \\\n",
"0 Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... Colin Trevorrow \n",
"\n",
" keywords runtime \\\n",
"0 monster|dna|tyrannosaurus rex|velociraptor|island 124 \n",
"\n",
" genres \\\n",
"0 Action|Adventure|Science Fiction|Thriller \n",
"\n",
" production_companies release_date vote_count \\\n",
"0 Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 \n",
"\n",
" vote_average release_year \n",
"0 6.5 2015 "
]
},
"execution_count": 384,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# see if these columns are dropped.\n",
"df.head(1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"** Drop the duplicates.**"
]
},
{
"cell_type": "code",
"execution_count": 385,
"metadata": {},
"outputs": [],
"source": [
"#Drop the duplicates\n",
"df.drop_duplicates(inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:py3]",
"language": "python",
"name": "conda-env-py3-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment