Skip to content

Instantly share code, notes, and snippets.

@esjacobs
Created July 25, 2018 22:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save esjacobs/94a46aec0e0204e8f97171cd64629dc8 to your computer and use it in GitHub Desktop.
Save esjacobs/94a46aec0e0204e8f97171cd64629dc8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Collection: The Movie\n",
"\n",
"(Go to the READ.ME of this repository for the entire write-up.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first step in my capstone project to collect a large database of films was to first try and get a list of every movie I could. I figured that Wikipedia would have as many movies as I would need, and if a movie wasn't on wikipedia, it was also unlikely to be one that provided me with any useful information in regards to Metascore. So, I used a Wikipedia Python package. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import wikipedia\n",
"import warnings\n",
"warnings.simplefilter(\"ignore\")\n",
"import re\n",
"import pandas as pd\n",
"import numpy as np\n",
"from bs4 import BeautifulSoup\n",
"import requests\n",
"import ast"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After several attempts at trying to search for movies by year on Wikipedia, I found out that Wikipedia just has a page for \"list of movies,\" which was great because it was easy to scrape, though slightly frustrating as I had already put in several hours of trying to scrape the movies by year pages. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"movie_list = []\n",
"slug_list = ['numbers', 'A','B','C','D','E','F','G','H','I','J–K','L','M','N-O','P','Q–R','S','T','U-W','X–Z']\n",
"for title in slug_list:\n",
" df = wikipedia.page(f'List of films:_{title}')\n",
" movie_list.extend(df.links)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The entries weren't just clean titles, and there was some section headers in there as well. Also, the movies had \"(film)\" appended at the end, sometimes with a year, such as \"(2009 film)\". Because there are many remakes of movies and movies with the same title, getting the year here was important. So I removed all section headers, removed all \"(film)\"s, and extracted the year in a tuple. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for index, n in enumerate(movie_list):\n",
" if 'ist of fil' in n:\n",
" movie_list.remove(n)\n",
" if '(film)' in n:\n",
" movie_list[index] = n[0:-7]\n",
" if ' film)' in n:\n",
" movie_list[index] = (n.split(' (')[0],n.split('(')[1][0:4])\n",
"\n",
"movie_list"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here I made sure the list contained no doubles. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"movie_list = list(set(movie_list))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It was after this painstaking process that I found a huge list of movies on the site movielens (https://grouplens.org/datasets/movielens/latest/). However, instead of deciding to give up my life as a data scientist and moving to the remote woods, I noticed that this list was only updated as of August 2017, so I knew I had more movies from my scrape. Thus, I download the csv then extracted titles and years from it. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"data = pd.read_csv('movies.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"data.title[2][0:-7]\n",
"data.title[2][-5:-1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then I appended the titles to my previous list of movies, taking out certain strings that I would later find to be problematic. Also, because every movie in the downloaded database had a year attached, I took that to and made a list of tuples. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for n in data.title:\n",
" movie_list.append((n[0:-7].split(' (')[0].split(', The')[0].replace('&','and'), n[-5:-1]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Lists of tuples were causing problems for me when loading in the data, so I decided to turn this list of tuples and strings to a DataFrame, putting in NaN values when I didn't have a year."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list_of_dicts = []\n",
"for title in (movie_list):\n",
" temp_dict = {}\n",
" if type(title) == str:\n",
" temp_dict['title'] = title\n",
" temp_dict['year'] = np.nan\n",
" else:\n",
" temp_dict['title'] = title[0]\n",
" temp_dict['year'] = title[1]\n",
" list_of_dicts.append(temp_dict)\n",
"list_of_dicts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Saved it as a csv."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.DataFrame(list_of_dicts).to_csv('new_movie_list.csv', index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Loaded it back in for Amazon Web Services. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_movie_list = pd.read_csv('new_movie_list.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, I used a website that collects data from Metacritic, Rotten Tomatoes, IMDB, and a couple others and allows you to search through it. The API provided the following information: \n",
"\n",
"- Actors\n",
"- Awards\n",
"- BoxOffice\n",
"- Country\n",
"- DVD\n",
"- Director\n",
"- Genre\n",
"- Language\n",
"- Metascore\n",
"- Plot\n",
"- Poster\n",
"- Production\n",
"- Rated\n",
"- Ratings\n",
"- Released\n",
"- Response\n",
"- Runtime\n",
"- Title\n",
"- Type\n",
"- Website\n",
"- Writer\n",
"- Year\n",
"- imdbID\n",
"- imdbRating\n",
"- imdbVotes\n",
"\n",
"I didn't take \"Awards,\" \"DVD,\" \"Plot,\" or \"imdbVotes,\" because all of those attributes are things you will never have access to before a movies comes out. I kept the rating values to use as my target variable. \n",
"\n",
"I also didn't take \"Country,\" \"Language,\" \"Poster,\" \"Reponse,\" \"Type,\" or \"Website,\" because none of these things gave any valuable information. Perhaps country or language would be somewhat illuminating, and I may take them at a future date.\n",
"\n",
"My main issues with this API were that it restricted actors to the top four billed, had no other crew (Cinematographer? Hello? Composer?) and also didn't have things like opening weekend box office, budget, months in production, whether the movie is part of a franchise, etc. There are other databases with this type of information and I plan to access those in the future. \n",
"\n",
"Running through the script below took approximately 11 hours and required my patronage of the API at a rate of 1 dollar a month, which I thought was pretty reasonable. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"new_movie_list = pd.read_csv('new_movie_list.csv')\n",
"new_movie_successes = []\n",
"new_movie_failures = []\n",
"new_all_my_movies = []\n",
"index=0\n",
"for title, year in zip(new_movie_list.title, new_movie_list.year):\n",
" if index % 1000 == 0:\n",
" pd.DataFrame(new_all_my_movies).to_csv(f'new_all_my_movies_{index}.csv')\n",
" try: \n",
" temp_dict = {}\n",
" if type(year) == str:\n",
" murl = (f'http://www.omdbapi.com/?apikey=eac947e0&t={title}&y={year}&r=json')\n",
" else:\n",
" murl = (f'http://www.omdbapi.com/?apikey=eac947e0&t={title}&r=json') \n",
" res = requests.get(murl)\n",
" res_json = res.json()\n",
" temp_dict['Title'] = res_json['Title']\n",
" temp_dict['Rated'] = res_json['Rated']\n",
" temp_dict['Released'] = res_json['Released']\n",
" temp_dict['Year'] = res_json['Year']\n",
" temp_dict['Runtime'] = res_json['Runtime'][0:-4]\n",
" temp_dict['Genre'] = res_json['Genre'].split(',')\n",
" temp_dict['Director'] = res_json['Director'].split(',')\n",
" temp_dict['Writer'] = res_json['Writer'].replace(' (additional dialogue)', '')\\\n",
" .replace(' (characters)', '').replace(' (screenplay)', '').replace(' (story)', '').split(',')\n",
" temp_dict['Actors'] = res_json['Actors'].split(',')\n",
" temp_dict['Metascore'] = res_json['Metascore']\n",
" temp_dict['RTRating'] = res_json['Ratings']\n",
" temp_dict['imdbRating'] = res_json['imdbRating']\n",
" temp_dict['imdbID'] = res_json['imdbID']\n",
" temp_dict['BoxOffice'] = res_json['BoxOffice']\n",
" temp_dict['Production'] = res_json['Production']\n",
" new_all_my_movies.append(temp_dict)\n",
" new_movie_successes.append(title)\n",
" pd.DataFrame(new_movie_successes).to_csv('new_movie_successes.csv')\n",
" index += 1\n",
" except:\n",
" new_movie_failures.append(title)\n",
" pd.DataFrame(new_movie_failures).to_csv('new_movie_failures.csv')\n",
" index += 1\n",
" pass\n",
"pd.DataFrame(new_all_my_movies).to_csv('new_all_my_movies_final.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I came back to scrape the award column, but in the end I thought better of using it and it now gathers dust in my repository. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"additional = pd.read_csv('cleaned_movie_df.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"additional = pd.read_csv('cleaned_movie_df.csv')\n",
"meta_award_add = []\n",
"index=0\n",
"for imdbid in additional.imdbID:\n",
" if index % 1000 == 0:\n",
" pd.DataFrame(meta_award_add).to_csv(f'meta_award_add{index}.csv')\n",
" try: \n",
" temp_dict = {}\n",
" murl = (f'http://www.omdbapi.com/?apikey=eac947e0&i={imdbid}&r=json') \n",
" res = requests.get(murl)\n",
" res_json = res.json()\n",
" temp_dict['imdbID'] = imdbid\n",
" temp_dict['Awards'] = res_json['Awards']\n",
" meta_award_add.append(temp_dict)\n",
" index+=1\n",
" except:\n",
" index+=1\n",
" pass\n",
"pd.DataFrame(meta_award_add).to_csv('meta_award_add_final.csv')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [default]",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment