Skip to content

Instantly share code, notes, and snippets.

@priyankamandikal
Created November 2, 2015 15:44
Show Gist options
  • Save priyankamandikal/6ca7a83fd0655e5447ee to your computer and use it in GitHub Desktop.
Save priyankamandikal/6ca7a83fd0655e5447ee to your computer and use it in GitHub Desktop.
IMDb scraper using Pattern and BeautifulSoup
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IMDb Scraping\n",
"\n",
"requests is a python library for dealing with web pages.<br>\n",
"http://docs.python-requests.org/en/v2.0-0/user/quickstart/"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import requests\n",
"from pattern import web\n",
"from BeautifulSoup import BeautifulSoup"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Two ways of making requests\n",
"\n",
"#### 1. Explicit url"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2015\n"
]
}
],
"source": [
"url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2015'\n",
"r = requests.get(url)\n",
"print r.url"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2. Base url with GET dictionary\n",
"params is way to specify the added features while getting a url"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"http://www.imdb.com/search/title?sort=num_votes%2Cdesc&start=1&title_type=feature&year=1950%2C2015\n"
]
}
],
"source": [
"url = 'http://www.imdb.com/search/title'\n",
"params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2015')\n",
"r = requests.get(url, params=params)\n",
"print r.url # notice it constructs the full url for you"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Pattern\n",
"\n",
"check this out <br/>\n",
"http://www.clips.ua.ac.be/pages/pattern-web<br/>\n",
"r.text has the source code for the entire webpage. dom will now have all the tags from the source code."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Shawshank Redemption 142 mins. 9.3 [u'Crime', u'Drama']\n",
"The Dark Knight 152 mins. 9.0 [u'Action', u'Crime', u'Drama']\n",
"Inception 148 mins. 8.8 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n",
"Fight Club 139 mins. 8.9 [u'Drama']\n",
"Pulp Fiction 154 mins. 8.9 [u'Crime', u'Drama']\n",
"The Lord of the Rings: The Fellowship of the Ring 178 mins. 8.8 [u'Adventure', u'Fantasy']\n",
"Forrest Gump 142 mins. 8.8 [u'Drama', u'Romance']\n",
"The Matrix 136 mins. 8.7 [u'Action', u'Sci-Fi']\n",
"The Lord of the Rings: The Return of the King 201 mins. 8.9 [u'Adventure', u'Fantasy']\n",
"The Godfather 175 mins. 9.2 [u'Crime', u'Drama']\n",
"The Dark Knight Rises 165 mins. 8.5 [u'Action', u'Thriller']\n",
"The Lord of the Rings: The Two Towers 179 mins. 8.7 [u'Adventure', u'Fantasy']\n",
"Se7en 127 mins. 8.6 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n",
"The Avengers 143 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n",
"Gladiator 155 mins. 8.5 [u'Action', u'Drama']\n",
"Batman Begins 140 mins. 8.3 [u'Action', u'Adventure']\n",
"Django Unchained 165 mins. 8.5 [u'Western']\n",
"Avatar 162 mins. 7.9 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Silence of the Lambs 118 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Saving Private Ryan 169 mins. 8.6 [u'Action', u'Drama', u'War']\n",
"Star Wars 121 mins. 8.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Departed 151 mins. 8.5 [u'Crime', u'Drama', u'Thriller']\n",
"Schindler&#x27;s List 195 mins. 8.9 [u'Biography', u'Drama', u'History']\n",
"Inglourious Basterds 153 mins. 8.3 [u'Adventure', u'Drama', u'War']\n",
"Memento 113 mins. 8.5 [u'Mystery', u'Thriller']\n",
"The Prestige 130 mins. 8.5 [u'Drama', u'Mystery', u'Thriller']\n",
"Interstellar 169 mins. 8.7 [u'Adventure', u'Drama', u'Sci-Fi']\n",
"American Beauty 122 mins. 8.4 [u'Drama', u'Romance']\n",
"Pirates of the Caribbean: The Curse of the Black Pearl 143 mins. 8.1 [u'Action', u'Adventure', u'Fantasy']\n",
"Titanic 194 mins. 7.7 [u'Drama', u'Romance']\n",
"Star Wars: Episode V - The Empire Strikes Back 124 mins. 8.8 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"V for Vendetta 132 mins. 8.2 [u'Action', u'Drama', u'Thriller']\n",
"American History X 119 mins. 8.6 [u'Crime', u'Drama']\n",
"The Godfather: Part II 200 mins. 9.0 [u'Crime', u'Drama']\n",
"The Green Mile 189 mins. 8.5 [u'Crime', u'Drama', u'Fantasy', u'Mystery']\n",
"Shutter Island 138 mins. 8.1 [u'Mystery', u'Thriller']\n",
"Terminator 2: Judgment Day 137 mins. 8.5 [u'Action', u'Sci-Fi']\n",
"The Usual Suspects 106 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Braveheart 178 mins. 8.4 [u'Biography', u'Drama', u'History', u'War']\n",
"Kill Bill: Vol. 1 111 mins. 8.1 [u'Action']\n",
"Goodfellas 146 mins. 8.7 [u'Biography', u'Crime', u'Drama']\n",
"The Wolf of Wall Street 180 mins. 8.2 [u'Biography', u'Comedy', u'Crime', u'Drama']\n",
"L&#xE9;on: The Professional 110 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Back to the Future 116 mins. 8.5 [u'Adventure', u'Comedy', u'Sci-Fi']\n",
"The Hunger Games 142 mins. 7.3 [u'Adventure', u'Drama', u'Sci-Fi', u'Thriller']\n",
"The Sixth Sense 107 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n",
"WALL&#xB7;E 98 mins. 8.4 [u'Animation', u'Adventure', u'Family', u'Sci-Fi']\n",
"Iron Man 126 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n",
"One Flew Over the Cuckoo&#x27;s Nest 133 mins. 8.7 [u'Drama']\n",
"Finding Nemo 100 mins. 8.2 [u'Animation', u'Adventure', u'Comedy', u'Family']\n"
]
}
],
"source": [
"dom = web.Element(r.text)\n",
"for movie in dom.by_tag('td.title'):\n",
" title = movie.by_tag('a')[0].content #content of a tag is the stuff between the opening and closing tags\n",
" runtime = movie.by_tag('span.runtime')[0].content\n",
" rating = movie.by_tag('span.value')[0].content\n",
" genres = movie.by_tag('span.genre')[0].by_tag('a')\n",
" genre = [g.content for g in genres]\n",
" #could have as well done\n",
" #genre = []\n",
" #for g in genres:\n",
" # genre.append(g.content)\n",
" print title, runtime, rating, genre"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using BeautifulSoup\n",
"\n",
"Beautiful Soup is a Python library for pulling data out of HTML and XML files.<br/>\n",
"Check documentation here.<br/>\n",
"http://www.crummy.com/software/BeautifulSoup/bs4/doc/"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Shawshank Redemption 142 mins. 9.3 [u'Crime', u'Drama']\n",
"The Dark Knight 152 mins. 9.0 [u'Action', u'Crime', u'Drama']\n",
"Inception 148 mins. 8.8 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n",
"Fight Club 139 mins. 8.9 [u'Drama']\n",
"Pulp Fiction 154 mins. 8.9 [u'Crime', u'Drama']\n",
"The Lord of the Rings: The Fellowship of the Ring 178 mins. 8.8 [u'Adventure', u'Fantasy']\n",
"Forrest Gump 142 mins. 8.8 [u'Drama', u'Romance']\n",
"The Matrix 136 mins. 8.7 [u'Action', u'Sci-Fi']\n",
"The Lord of the Rings: The Return of the King 201 mins. 8.9 [u'Adventure', u'Fantasy']\n",
"The Godfather 175 mins. 9.2 [u'Crime', u'Drama']\n",
"The Dark Knight Rises 165 mins. 8.5 [u'Action', u'Thriller']\n",
"The Lord of the Rings: The Two Towers 179 mins. 8.7 [u'Adventure', u'Fantasy']\n",
"Se7en 127 mins. 8.6 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n",
"The Avengers 143 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n",
"Gladiator 155 mins. 8.5 [u'Action', u'Drama']\n",
"Batman Begins 140 mins. 8.3 [u'Action', u'Adventure']\n",
"Django Unchained 165 mins. 8.5 [u'Western']\n",
"Avatar 162 mins. 7.9 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Silence of the Lambs 118 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Saving Private Ryan 169 mins. 8.6 [u'Action', u'Drama', u'War']\n",
"Star Wars 121 mins. 8.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Departed 151 mins. 8.5 [u'Crime', u'Drama', u'Thriller']\n",
"Schindler&#x27;s List 195 mins. 8.9 [u'Biography', u'Drama', u'History']\n",
"Inglourious Basterds 153 mins. 8.3 [u'Adventure', u'Drama', u'War']\n",
"Memento 113 mins. 8.5 [u'Mystery', u'Thriller']\n",
"The Prestige 130 mins. 8.5 [u'Drama', u'Mystery', u'Thriller']\n",
"Interstellar 169 mins. 8.7 [u'Adventure', u'Drama', u'Sci-Fi']\n",
"American Beauty 122 mins. 8.4 [u'Drama', u'Romance']\n",
"Pirates of the Caribbean: The Curse of the Black Pearl 143 mins. 8.1 [u'Action', u'Adventure', u'Fantasy']\n",
"Titanic 194 mins. 7.7 [u'Drama', u'Romance']\n",
"Star Wars: Episode V - The Empire Strikes Back 124 mins. 8.8 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"V for Vendetta 132 mins. 8.2 [u'Action', u'Drama', u'Thriller']\n",
"American History X 119 mins. 8.6 [u'Crime', u'Drama']\n",
"The Godfather: Part II 200 mins. 9.0 [u'Crime', u'Drama']\n",
"The Green Mile 189 mins. 8.5 [u'Crime', u'Drama', u'Fantasy', u'Mystery']\n",
"Shutter Island 138 mins. 8.1 [u'Mystery', u'Thriller']\n",
"Terminator 2: Judgment Day 137 mins. 8.5 [u'Action', u'Sci-Fi']\n",
"The Usual Suspects 106 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Braveheart 178 mins. 8.4 [u'Biography', u'Drama', u'History', u'War']\n",
"Kill Bill: Vol. 1 111 mins. 8.1 [u'Action']\n",
"Goodfellas 146 mins. 8.7 [u'Biography', u'Crime', u'Drama']\n",
"The Wolf of Wall Street 180 mins. 8.2 [u'Biography', u'Comedy', u'Crime', u'Drama']\n",
"L&#xE9;on: The Professional 110 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Back to the Future 116 mins. 8.5 [u'Adventure', u'Comedy', u'Sci-Fi']\n",
"The Hunger Games 142 mins. 7.3 [u'Adventure', u'Drama', u'Sci-Fi', u'Thriller']\n",
"The Sixth Sense 107 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n",
"WALL&#xB7;E 98 mins. 8.4 [u'Animation', u'Adventure', u'Family', u'Sci-Fi']\n",
"Iron Man 126 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n",
"One Flew Over the Cuckoo&#x27;s Nest 133 mins. 8.7 [u'Drama']\n",
"Finding Nemo 100 mins. 8.2 [u'Animation', u'Adventure', u'Comedy', u'Family']\n"
]
}
],
"source": [
"bs = BeautifulSoup(r.text) #gives you the source code\n",
"for movie in bs.findAll('td','title'):\n",
" title = movie.find('a').contents[0] #use only find when you know you want only one value of a tag\n",
" runtime = movie.find('span','runtime').contents[0]\n",
" rating = movie.find('span','value').contents[0]\n",
" genres = movie.find('span','genre').findAll('a')\n",
" genre = [g.contents[0] for g in genres]\n",
" print title, runtime, rating, genre\n",
"\n",
"#http://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"<br/>\n",
"<br/>\n",
"#### So now how do you get the top 200 movies?<br/>\n",
"You'll have to iterate over the start parameter in the get request function.<br/>\n",
"syntax of xrange is xrange(start, stop, [step])</br>\n",
"It exits loop before the iteration can reach stop."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The Shawshank Redemption (1994) 142 mins. 9.3 [u'Crime', u'Drama']\n",
"The Dark Knight (2008) 152 mins. 9.0 [u'Action', u'Crime', u'Drama']\n",
"Inception (2010) 148 mins. 8.8 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n",
"Fight Club (1999) 139 mins. 8.9 [u'Drama']\n",
"Pulp Fiction (1994) 154 mins. 8.9 [u'Crime', u'Drama']\n",
"The Lord of the Rings: The Fellowship of the Ring (2001) 178 mins. 8.8 [u'Adventure', u'Fantasy']\n",
"Forrest Gump (1994) 142 mins. 8.8 [u'Drama', u'Romance']\n",
"The Matrix (1999) 136 mins. 8.7 [u'Action', u'Sci-Fi']\n",
"The Lord of the Rings: The Return of the King (2003) 201 mins. 8.9 [u'Adventure', u'Fantasy']\n",
"The Godfather (1972) 175 mins. 9.2 [u'Crime', u'Drama']\n",
"The Dark Knight Rises (2012) 165 mins. 8.5 [u'Action', u'Thriller']\n",
"The Lord of the Rings: The Two Towers (2002) 179 mins. 8.7 [u'Adventure', u'Fantasy']\n",
"Se7en (1995) 127 mins. 8.6 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n",
"The Avengers (2012) 143 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n",
"Gladiator (2000) 155 mins. 8.5 [u'Action', u'Drama']\n",
"Batman Begins (2005) 140 mins. 8.3 [u'Action', u'Adventure']\n",
"Django Unchained (2012) 165 mins. 8.5 [u'Western']\n",
"Avatar (2009) 162 mins. 7.9 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Silence of the Lambs (1991) 118 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Saving Private Ryan (1998) 169 mins. 8.6 [u'Action', u'Drama', u'War']\n",
"Star Wars (1977) 121 mins. 8.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Departed (2006) 151 mins. 8.5 [u'Crime', u'Drama', u'Thriller']\n",
"Schindler&#x27;s List (1993) 195 mins. 8.9 [u'Biography', u'Drama', u'History']\n",
"Inglourious Basterds (2009) 153 mins. 8.3 [u'Adventure', u'Drama', u'War']\n",
"Memento (2000) 113 mins. 8.5 [u'Mystery', u'Thriller']\n",
"The Prestige (2006) 130 mins. 8.5 [u'Drama', u'Mystery', u'Thriller']\n",
"Interstellar (2014) 169 mins. 8.7 [u'Adventure', u'Drama', u'Sci-Fi']\n",
"American Beauty (1999) 122 mins. 8.4 [u'Drama', u'Romance']\n",
"Pirates of the Caribbean: The Curse of the Black Pearl (2003) 143 mins. 8.1 [u'Action', u'Adventure', u'Fantasy']\n",
"Titanic (1997) 194 mins. 7.7 [u'Drama', u'Romance']\n",
"Star Wars: Episode V - The Empire Strikes Back (1980) 124 mins. 8.8 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"V for Vendetta (2005) 132 mins. 8.2 [u'Action', u'Drama', u'Thriller']\n",
"American History X (1998) 119 mins. 8.6 [u'Crime', u'Drama']\n",
"The Godfather: Part II (1974) 200 mins. 9.0 [u'Crime', u'Drama']\n",
"The Green Mile (1999) 189 mins. 8.5 [u'Crime', u'Drama', u'Fantasy', u'Mystery']\n",
"Shutter Island (2010) 138 mins. 8.1 [u'Mystery', u'Thriller']\n",
"Terminator 2: Judgment Day (1991) 137 mins. 8.5 [u'Action', u'Sci-Fi']\n",
"The Usual Suspects (1995) 106 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Braveheart (1995) 178 mins. 8.4 [u'Biography', u'Drama', u'History', u'War']\n",
"Kill Bill: Vol. 1 (2003) 111 mins. 8.1 [u'Action']\n",
"Goodfellas (1990) 146 mins. 8.7 [u'Biography', u'Crime', u'Drama']\n",
"The Wolf of Wall Street (2013) 180 mins. 8.2 [u'Biography', u'Comedy', u'Crime', u'Drama']\n",
"L&#xE9;on: The Professional (1994) 110 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n",
"Back to the Future (1985) 116 mins. 8.5 [u'Adventure', u'Comedy', u'Sci-Fi']\n",
"The Hunger Games (2012) 142 mins. 7.3 [u'Adventure', u'Drama', u'Sci-Fi', u'Thriller']\n",
"The Sixth Sense (1999) 107 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n",
"WALL&#xB7;E (2008) 98 mins. 8.4 [u'Animation', u'Adventure', u'Family', u'Sci-Fi']\n",
"Iron Man (2008) 126 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n",
"One Flew Over the Cuckoo&#x27;s Nest (1975) 133 mins. 8.7 [u'Drama']\n",
"Finding Nemo (2003) 100 mins. 8.2 [u'Animation', u'Adventure', u'Comedy', u'Family']\n",
"Sin City (2005) 124 mins. 8.1 [u'Crime', u'Thriller']\n",
"Eternal Sunshine of the Spotless Mind (2004) 108 mins. 8.4 [u'Drama', u'Romance', u'Sci-Fi']\n",
"The Truman Show (1998) 103 mins. 8.1 [u'Drama']\n",
"Raiders of the Lost Ark (1981) 115 mins. 8.6 [u'Action', u'Adventure']\n",
"Reservoir Dogs (1992) 99 mins. 8.4 [u'Crime', u'Drama']\n",
"Slumdog Millionaire (2008) 120 mins. 8.0 [u'Drama', u'Romance']\n",
"Up (2009) 96 mins. 8.3 [u'Animation', u'Adventure', u'Comedy', u'Family']\n",
"The Hobbit: An Unexpected Journey (2012) 169 mins. 8.0 [u'Adventure', u'Fantasy']\n",
"Star Wars: Episode VI - Return of the Jedi (1983) 134 mins. 8.4 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Lion King (1994) 89 mins. 8.5 [u'Animation', u'Adventure', u'Drama', u'Family', u'Musical']\n",
"300 (2006) 117 mins. 7.8 [u'Action', u'Fantasy', u'War']\n",
"Guardians of the Galaxy (2014) 121 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi']\n",
"No Country for Old Men (2007) 122 mins. 8.1 [u'Crime', u'Drama', u'Thriller']\n",
"Toy Story (1995) 81 mins. 8.3 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n",
"Snatch. (2000) 104 mins. 8.3 [u'Comedy', u'Crime']\n",
"The Shining (1980) 146 mins. 8.4 [u'Drama', u'Horror']\n",
"A Beautiful Mind (2001) 135 mins. 8.2 [u'Biography', u'Drama']\n",
"Good Will Hunting (1997) 126 mins. 8.3 [u'Drama']\n",
"The Terminator (1984) 107 mins. 8.1 [u'Action', u'Sci-Fi']\n",
"The Hangover (2009) 100 mins. 7.8 [u'Comedy']\n",
"Die Hard (1988) 131 mins. 8.3 [u'Action', u'Thriller']\n",
"Jurassic Park (1993) 127 mins. 8.1 [u'Adventure', u'Sci-Fi', u'Thriller']\n",
"Donnie Darko (2001) 113 mins. 8.1 [u'Drama', u'Sci-Fi']\n",
"Gravity (2013) 91 mins. 7.9 [u'Sci-Fi', u'Thriller']\n",
"Requiem for a Dream (2000) 102 mins. 8.4 [u'Drama']\n",
"Monsters, Inc. (2001) 92 mins. 8.1 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n",
"Gran Torino (2008) 116 mins. 8.2 [u'Drama']\n",
"Black Swan (2010) 108 mins. 8.0 [u'Drama', u'Thriller']\n",
"Alien (1979) 117 mins. 8.5 [u'Horror', u'Sci-Fi']\n",
"A Clockwork Orange (1971) 136 mins. 8.4 [u'Crime', u'Drama', u'Sci-Fi']\n",
"Iron Man 3 (2013) 130 mins. 7.3 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Spider-Man (2002) 121 mins. 7.3 [u'Action', u'Adventure']\n",
"District 9 (2009) 112 mins. 8.0 [u'Action', u'Sci-Fi']\n",
"Silver Linings Playbook (2012) 122 mins. 7.8 [u'Comedy', u'Drama', u'Romance']\n",
"Scarface (1983) 170 mins. 8.3 [u'Crime', u'Drama']\n",
"City of God (2002) 130 mins. 8.7 [u'Crime', u'Drama']\n",
"The Big Lebowski (1998) 117 mins. 8.2 [u'Comedy', u'Crime']\n",
"Am&#xE9;lie (2001) 122 mins. 8.4 [u'Comedy', u'Romance']\n",
"Toy Story 3 (2010) 103 mins. 8.4 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n",
"Man of Steel (2013) 143 mins. 7.2 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"I Am Legend (2007) 101 mins. 7.2 [u'Drama', u'Sci-Fi', u'Thriller']\n",
"Harry Potter and the Deathly Hallows: Part 2 (2011) 130 mins. 8.1 [u'Adventure', u'Drama', u'Fantasy', u'Mystery']\n",
"Transformers (2007) 144 mins. 7.1 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Thor (2011) 115 mins. 7.0 [u'Action', u'Adventure', u'Fantasy']\n",
"Pirates of the Caribbean: Dead Man&#x27;s Chest (2006) 151 mins. 7.3 [u'Action', u'Adventure', u'Fantasy']\n",
"Gone Girl (2014) 149 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n",
"Star Trek (2009) 127 mins. 8.0 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Kill Bill: Vol. 2 (2004) 137 mins. 8.0 [u'Action', u'Crime', u'Thriller']\n",
"X-Men: First Class (2011) 132 mins. 7.8 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Indiana Jones and the Last Crusade (1989) 127 mins. 8.3 [u'Action', u'Adventure']\n",
"Iron Man 2 (2010) 124 mins. 7.0 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Skyfall (2012) 143 mins. 7.8 [u'Action', u'Adventure', u'Thriller']\n",
"Star Wars: Episode I - The Phantom Menace (1999) 136 mins. 6.5 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Good, the Bad and the Ugly (1966) 161 mins. 8.9 [u'Western']\n",
"Taxi Driver (1976) 113 mins. 8.4 [u'Crime', u'Drama']\n",
"Taken (2008) 93 mins. 7.9 [u'Action', u'Thriller']\n",
"The King&#x27;s Speech (2010) 118 mins. 8.1 [u'Biography', u'Drama']\n",
"The Hunger Games: Catching Fire (2013) 146 mins. 7.6 [u'Adventure', u'Sci-Fi', u'Thriller']\n",
"X-Men: Days of Future Past (2014) 132 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n",
"Aliens (1986) 137 mins. 8.4 [u'Action', u'Horror', u'Sci-Fi']\n",
"Star Wars: Episode III - Revenge of the Sith (2005) 140 mins. 7.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"The Pianist (2002) 150 mins. 8.5 [u'Biography', u'Drama', u'War']\n",
"Catch Me If You Can (2002) 141 mins. 8.0 [u'Biography', u'Crime', u'Drama']\n",
"The Bourne Ultimatum (2007) 115 mins. 8.1 [u'Action', u'Thriller']\n",
"Captain America: The First Avenger (2011) 124 mins. 6.8 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Sherlock Holmes (2009) 128 mins. 7.6 [u'Action', u'Adventure', u'Crime', u'Mystery', u'Thriller']\n",
"Full Metal Jacket (1987) 116 mins. 8.3 [u'Drama', u'War']\n",
"Million Dollar Baby (2004) 132 mins. 8.1 [u'Drama', u'Sport']\n",
"The Hobbit: The Desolation of Smaug (2013) 161 mins. 7.9 [u'Adventure', u'Fantasy']\n",
"The Intouchables (2011) 112 mins. 8.6 [u'Biography', u'Comedy', u'Drama']\n",
"The Social Network (2010) 120 mins. 7.8 [u'Biography', u'Drama']\n",
"Ted (2012) 106 mins. 7.0 [u'Comedy', u'Fantasy']\n",
"The Incredibles (2004) 115 mins. 8.0 [u'Animation', u'Action', u'Adventure', u'Family']\n",
"Pirates of the Caribbean: At World&#x27;s End (2007) 169 mins. 7.1 [u'Action', u'Adventure', u'Fantasy']\n",
"How to Train Your Dragon (2010) 98 mins. 8.2 [u'Animation', u'Adventure', u'Family', u'Fantasy']\n",
"Trainspotting (1996) 94 mins. 8.2 [u'Drama']\n",
"Ratatouille (2007) 111 mins. 8.0 [u'Animation', u'Comedy', u'Family', u'Fantasy']\n",
"Pan&#x27;s Labyrinth (2006) 118 mins. 8.2 [u'Drama', u'Fantasy', u'War']\n",
"Prometheus (2012) 124 mins. 7.0 [u'Adventure', u'Mystery', u'Sci-Fi']\n",
"Shrek (2001) 90 mins. 7.9 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n",
"The Curious Case of Benjamin Button (2008) 166 mins. 7.8 [u'Drama', u'Fantasy', u'Romance']\n",
"Twelve Monkeys (1995) 129 mins. 8.1 [u'Mystery', u'Sci-Fi', u'Thriller']\n",
"Blade Runner (1982) 117 mins. 8.2 [u'Sci-Fi', u'Thriller']\n",
"World War Z (2013) 116 mins. 7.0 [u'Action', u'Adventure', u'Horror', u'Sci-Fi', u'Thriller']\n",
"Casino Royale (2006) 144 mins. 8.0 [u'Action', u'Adventure', u'Thriller']\n",
"The Amazing Spider-Man (2012) 136 mins. 7.1 [u'Action', u'Adventure', u'Fantasy']\n",
"Captain America: The Winter Soldier (2014) 136 mins. 7.8 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Apocalypse Now (1979) 153 mins. 8.5 [u'Drama', u'War']\n",
"Argo (2012) 120 mins. 7.8 [u'Drama', u'History', u'Thriller']\n",
"X-Men (2000) 104 mins. 7.4 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Fargo (1996) 98 mins. 8.2 [u'Crime', u'Drama', u'Thriller']\n",
"Kick-Ass (2010) 117 mins. 7.7 [u'Action', u'Comedy']\n",
"Harry Potter and the Sorcerer&#x27;s Stone (2001) 152 mins. 7.5 [u'Adventure', u'Family', u'Fantasy']\n",
"Drive (2011) 100 mins. 7.8 [u'Crime', u'Drama']\n",
"12 Angry Men (1957) 96 mins. 8.9 [u'Crime', u'Drama']\n",
"Heat (1995) 170 mins. 8.3 [u'Action', u'Crime', u'Drama', u'Thriller']\n",
"The Matrix Reloaded (2003) 138 mins. 7.2 [u'Action', u'Sci-Fi']\n",
"Life of Pi (2012) 127 mins. 8.0 [u'Adventure', u'Drama', u'Fantasy']\n",
"Superbad (2007) 113 mins. 7.6 [u'Comedy']\n",
"The Grand Budapest Hotel (2014) 99 mins. 8.1 [u'Adventure', u'Comedy', u'Drama']\n",
"Looper (2012) 119 mins. 7.5 [u'Action', u'Crime', u'Sci-Fi', u'Thriller']\n",
"Star Wars: Episode II - Attack of the Clones (2002) 142 mins. 6.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n",
"Groundhog Day (1993) 101 mins. 8.1 [u'Comedy', u'Fantasy', u'Romance']\n",
"Independence Day (1996) 145 mins. 6.9 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Now You See Me (2013) 115 mins. 7.3 [u'Crime', u'Mystery', u'Thriller']\n",
"Juno (2007) 96 mins. 7.5 [u'Comedy', u'Drama', u'Romance']\n",
"Into the Wild (2007) 148 mins. 8.2 [u'Adventure', u'Biography', u'Drama']\n",
"2001: A Space Odyssey (1968) 160 mins. 8.3 [u'Mystery', u'Sci-Fi']\n",
"L.A. Confidential (1997) 138 mins. 8.3 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n",
"Limitless (2011) 105 mins. 7.4 [u'Mystery', u'Sci-Fi', u'Thriller']\n",
"Psycho (1960) 109 mins. 8.6 [u'Horror', u'Mystery', u'Thriller']\n",
"Lock, Stock and Two Smoking Barrels (1998) 107 mins. 8.2 [u'Comedy', u'Crime']\n",
"12 Years a Slave (2013) 134 mins. 8.1 [u'Biography', u'Drama', u'History']\n",
"Edge of Tomorrow (2014) 113 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Jaws (1975) 124 mins. 8.1 [u'Adventure', u'Drama', u'Thriller']\n",
"Rise of the Planet of the Apes (2011) 105 mins. 7.6 [u'Action', u'Drama', u'Sci-Fi', u'Thriller']\n",
"Frozen (2013) 102 mins. 7.6 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy', u'Musical']\n",
"21 Jump Street (2012) 109 mins. 7.2 [u'Action', u'Comedy', u'Crime']\n",
"Minority Report (2002) 145 mins. 7.7 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n",
"Life Is Beautiful (1997) 116 mins. 8.6 [u'Comedy', u'Drama', u'Romance']\n",
"Ocean&#x27;s Eleven (2001) 116 mins. 7.8 [u'Crime', u'Thriller']\n",
"The Bourne Identity (2002) 119 mins. 7.9 [u'Action', u'Mystery', u'Thriller']\n",
"Spirited Away (2001) 125 mins. 8.6 [u'Animation', u'Adventure', u'Family', u'Fantasy']\n",
"X2 (2003) 134 mins. 7.5 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n",
"Men in Black (1997) 98 mins. 7.2 [u'Comedy', u'Sci-Fi']\n",
"Spider-Man 2 (2004) 127 mins. 7.3 [u'Action', u'Adventure', u'Fantasy']\n",
"Blood Diamond (2006) 143 mins. 8.0 [u'Adventure', u'Drama', u'Thriller']\n",
"Shaun of the Dead (2004) 99 mins. 8.0 [u'Comedy', u'Horror']\n",
"The Notebook (2004) 123 mins. 7.9 [u'Drama', u'Romance']\n",
"Star Trek Into Darkness (2013) 132 mins. 7.8 [u'Action', u'Adventure', u'Sci-Fi']\n",
"Thor: The Dark World (2013) 112 mins. 7.1 [u'Action', u'Adventure', u'Fantasy']\n",
"Rain Man (1988) 133 mins. 8.0 [u'Drama']\n",
"The Imitation Game (2014) 114 mins. 8.1 [u'Biography', u'Drama', u'Thriller', u'War']\n",
"Mad Max: Fury Road (2015) 120 mins. 8.2 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n",
"Cast Away (2000) 143 mins. 7.7 [u'Adventure', u'Drama']\n",
"Watchmen (2009) 162 mins. 7.6 [u'Action', u'Mystery', u'Sci-Fi']\n",
"Zombieland (2009) 88 mins. 7.7 [u'Adventure', u'Comedy', u'Horror']\n",
"I, Robot (2004) 115 mins. 7.1 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n",
"Oblivion (2013) 124 mins. 7.0 [u'Action', u'Adventure', u'Mystery', u'Sci-Fi']\n",
"Harry Potter and the Chamber of Secrets (2002) 161 mins. 7.4 [u'Adventure', u'Family', u'Fantasy', u'Mystery']\n",
"Crazy, Stupid, Love. (2011) 118 mins. 7.4 [u'Comedy', u'Drama', u'Romance']\n",
"Harry Potter and the Goblet of Fire (2005) 157 mins. 7.6 [u'Adventure', u'Family', u'Fantasy', u'Mystery']\n",
"Monty Python and the Holy Grail (1975) 91 mins. 8.3 [u'Adventure', u'Comedy', u'Fantasy']\n",
"Toy Story 2 (1999) 92 mins. 7.9 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n",
"Troy (2004) 163 mins. 7.2 [u'Adventure']\n",
"Pacific Rim (2013) 131 mins. 7.0 [u'Action', u'Adventure', u'Sci-Fi']\n",
"X-Men: The Last Stand (2006) 104 mins. 6.8 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n",
"The Hangover Part II (2011) 102 mins. 6.5 [u'Comedy']\n",
"Despicable Me (2010) 95 mins. 7.7 [u'Animation', u'Comedy', u'Family']\n",
"(500) Days of Summer (2009) 95 mins. 7.8 [u'Comedy', u'Drama', u'Romance']\n"
]
}
],
"source": [
"url = 'http://www.imdb.com/search/title'\n",
"for i in xrange(1,200,50):\n",
" params = dict(sort='num_votes,desc', start=i, title_type='feature', year='1950,2015')\n",
" r = requests.get(url, params=params)\n",
" dom = web.Element(r.text)\n",
" for movie in dom.by_tag('td.title'):\n",
" title = movie.by_tag('a')[0].content\n",
" year = movie.by_tag('span.year_type')[0].content\n",
" runtime = movie.by_tag('span.runtime')[0].content\n",
" rating = movie.by_tag('span.value')[0].content\n",
" genres = movie.by_tag('span.genre')[0].by_tag('a')\n",
" genre = [g.content for g in genres]\n",
" print title, year, runtime, rating, genre"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing the scraped data into a file\n",
"Now, we'll write this data into a file for further analysis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"imdb = open('imdb_top_200.txt','a')\n",
"\n",
"url = 'http://www.imdb.com/search/title'\n",
"for i in xrange(1,200,50):\n",
" params = dict(sort='num_votes,desc', start=i, title_type='feature', year='1950,2015')\n",
" r = requests.get(url, params=params)\n",
" dom = web.Element(r.text)\n",
" for movie in dom.by_tag('td.title'):\n",
" title = movie.by_tag('a')[0].content\n",
" year = movie.by_tag('span.year_type')[0].content\n",
" runtime = movie.by_tag('span.runtime')[0].content\n",
" rating = movie.by_tag('span.value')[0].content\n",
" genres = movie.by_tag('span.genre')[0].by_tag('a')\n",
" genre = [g.content for g in genres]\n",
" imdb.write(title+'\\t'+year+'\\t'+str(rating) +'\\t'+str(runtime)+'\\t'+str(genre)+'\\n')\n",
"imdb.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This pretty much gives me the top 200 movies but there are some redundant movies being written at the top of the file.<br/>Will have to look into that."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment