Created
November 2, 2015 15:44
-
-
Save priyankamandikal/6ca7a83fd0655e5447ee to your computer and use it in GitHub Desktop.
IMDb scraper using Pattern and BeautifulSoup
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# IMDb Scraping\n", | |
"\n", | |
"requests is a python library for dealing with web pages.<br>\n", | |
"http://docs.python-requests.org/en/v2.0-0/user/quickstart/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"import requests\n", | |
"from pattern import web\n", | |
"from BeautifulSoup import BeautifulSoup" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Two ways of making requests\n", | |
"\n", | |
"#### 1. Explicit url" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2015\n" | |
] | |
} | |
], | |
"source": [ | |
"url = 'http://www.imdb.com/search/title?sort=num_votes,desc&start=1&title_type=feature&year=1950,2015'\n", | |
"r = requests.get(url)\n", | |
"print r.url" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"#### 2. Base url with GET dictionary\n", | |
"params is way to specify the added features while getting a url" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"http://www.imdb.com/search/title?sort=num_votes%2Cdesc&start=1&title_type=feature&year=1950%2C2015\n" | |
] | |
} | |
], | |
"source": [ | |
"url = 'http://www.imdb.com/search/title'\n", | |
"params = dict(sort='num_votes,desc', start=1, title_type='feature', year='1950,2015')\n", | |
"r = requests.get(url, params=params)\n", | |
"print r.url # notice it constructs the full url for you" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Using Pattern\n", | |
"\n", | |
"check this out <br/>\n", | |
"http://www.clips.ua.ac.be/pages/pattern-web<br/>\n", | |
"r.text has the source code for the entire webpage. dom will now have all the tags from the source code." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The Shawshank Redemption 142 mins. 9.3 [u'Crime', u'Drama']\n", | |
"The Dark Knight 152 mins. 9.0 [u'Action', u'Crime', u'Drama']\n", | |
"Inception 148 mins. 8.8 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n", | |
"Fight Club 139 mins. 8.9 [u'Drama']\n", | |
"Pulp Fiction 154 mins. 8.9 [u'Crime', u'Drama']\n", | |
"The Lord of the Rings: The Fellowship of the Ring 178 mins. 8.8 [u'Adventure', u'Fantasy']\n", | |
"Forrest Gump 142 mins. 8.8 [u'Drama', u'Romance']\n", | |
"The Matrix 136 mins. 8.7 [u'Action', u'Sci-Fi']\n", | |
"The Lord of the Rings: The Return of the King 201 mins. 8.9 [u'Adventure', u'Fantasy']\n", | |
"The Godfather 175 mins. 9.2 [u'Crime', u'Drama']\n", | |
"The Dark Knight Rises 165 mins. 8.5 [u'Action', u'Thriller']\n", | |
"The Lord of the Rings: The Two Towers 179 mins. 8.7 [u'Adventure', u'Fantasy']\n", | |
"Se7en 127 mins. 8.6 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n", | |
"The Avengers 143 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"Gladiator 155 mins. 8.5 [u'Action', u'Drama']\n", | |
"Batman Begins 140 mins. 8.3 [u'Action', u'Adventure']\n", | |
"Django Unchained 165 mins. 8.5 [u'Western']\n", | |
"Avatar 162 mins. 7.9 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Silence of the Lambs 118 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Saving Private Ryan 169 mins. 8.6 [u'Action', u'Drama', u'War']\n", | |
"Star Wars 121 mins. 8.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Departed 151 mins. 8.5 [u'Crime', u'Drama', u'Thriller']\n", | |
"Schindler's List 195 mins. 8.9 [u'Biography', u'Drama', u'History']\n", | |
"Inglourious Basterds 153 mins. 8.3 [u'Adventure', u'Drama', u'War']\n", | |
"Memento 113 mins. 8.5 [u'Mystery', u'Thriller']\n", | |
"The Prestige 130 mins. 8.5 [u'Drama', u'Mystery', u'Thriller']\n", | |
"Interstellar 169 mins. 8.7 [u'Adventure', u'Drama', u'Sci-Fi']\n", | |
"American Beauty 122 mins. 8.4 [u'Drama', u'Romance']\n", | |
"Pirates of the Caribbean: The Curse of the Black Pearl 143 mins. 8.1 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Titanic 194 mins. 7.7 [u'Drama', u'Romance']\n", | |
"Star Wars: Episode V - The Empire Strikes Back 124 mins. 8.8 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"V for Vendetta 132 mins. 8.2 [u'Action', u'Drama', u'Thriller']\n", | |
"American History X 119 mins. 8.6 [u'Crime', u'Drama']\n", | |
"The Godfather: Part II 200 mins. 9.0 [u'Crime', u'Drama']\n", | |
"The Green Mile 189 mins. 8.5 [u'Crime', u'Drama', u'Fantasy', u'Mystery']\n", | |
"Shutter Island 138 mins. 8.1 [u'Mystery', u'Thriller']\n", | |
"Terminator 2: Judgment Day 137 mins. 8.5 [u'Action', u'Sci-Fi']\n", | |
"The Usual Suspects 106 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Braveheart 178 mins. 8.4 [u'Biography', u'Drama', u'History', u'War']\n", | |
"Kill Bill: Vol. 1 111 mins. 8.1 [u'Action']\n", | |
"Goodfellas 146 mins. 8.7 [u'Biography', u'Crime', u'Drama']\n", | |
"The Wolf of Wall Street 180 mins. 8.2 [u'Biography', u'Comedy', u'Crime', u'Drama']\n", | |
"Léon: The Professional 110 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Back to the Future 116 mins. 8.5 [u'Adventure', u'Comedy', u'Sci-Fi']\n", | |
"The Hunger Games 142 mins. 7.3 [u'Adventure', u'Drama', u'Sci-Fi', u'Thriller']\n", | |
"The Sixth Sense 107 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n", | |
"WALL·E 98 mins. 8.4 [u'Animation', u'Adventure', u'Family', u'Sci-Fi']\n", | |
"Iron Man 126 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"One Flew Over the Cuckoo's Nest 133 mins. 8.7 [u'Drama']\n", | |
"Finding Nemo 100 mins. 8.2 [u'Animation', u'Adventure', u'Comedy', u'Family']\n" | |
] | |
} | |
], | |
"source": [ | |
"dom = web.Element(r.text)\n", | |
"for movie in dom.by_tag('td.title'):\n", | |
" title = movie.by_tag('a')[0].content #content of a tag is the stuff between the opening and closing tags\n", | |
" runtime = movie.by_tag('span.runtime')[0].content\n", | |
" rating = movie.by_tag('span.value')[0].content\n", | |
" genres = movie.by_tag('span.genre')[0].by_tag('a')\n", | |
" genre = [g.content for g in genres]\n", | |
" #could have as well done\n", | |
" #genre = []\n", | |
" #for g in genres:\n", | |
" # genre.append(g.content)\n", | |
" print title, runtime, rating, genre" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Using BeautifulSoup\n", | |
"\n", | |
"Beautiful Soup is a Python library for pulling data out of HTML and XML files.<br/>\n", | |
"Check documentation here.<br/>\n", | |
"http://www.crummy.com/software/BeautifulSoup/bs4/doc/" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The Shawshank Redemption 142 mins. 9.3 [u'Crime', u'Drama']\n", | |
"The Dark Knight 152 mins. 9.0 [u'Action', u'Crime', u'Drama']\n", | |
"Inception 148 mins. 8.8 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n", | |
"Fight Club 139 mins. 8.9 [u'Drama']\n", | |
"Pulp Fiction 154 mins. 8.9 [u'Crime', u'Drama']\n", | |
"The Lord of the Rings: The Fellowship of the Ring 178 mins. 8.8 [u'Adventure', u'Fantasy']\n", | |
"Forrest Gump 142 mins. 8.8 [u'Drama', u'Romance']\n", | |
"The Matrix 136 mins. 8.7 [u'Action', u'Sci-Fi']\n", | |
"The Lord of the Rings: The Return of the King 201 mins. 8.9 [u'Adventure', u'Fantasy']\n", | |
"The Godfather 175 mins. 9.2 [u'Crime', u'Drama']\n", | |
"The Dark Knight Rises 165 mins. 8.5 [u'Action', u'Thriller']\n", | |
"The Lord of the Rings: The Two Towers 179 mins. 8.7 [u'Adventure', u'Fantasy']\n", | |
"Se7en 127 mins. 8.6 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n", | |
"The Avengers 143 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"Gladiator 155 mins. 8.5 [u'Action', u'Drama']\n", | |
"Batman Begins 140 mins. 8.3 [u'Action', u'Adventure']\n", | |
"Django Unchained 165 mins. 8.5 [u'Western']\n", | |
"Avatar 162 mins. 7.9 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Silence of the Lambs 118 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Saving Private Ryan 169 mins. 8.6 [u'Action', u'Drama', u'War']\n", | |
"Star Wars 121 mins. 8.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Departed 151 mins. 8.5 [u'Crime', u'Drama', u'Thriller']\n", | |
"Schindler's List 195 mins. 8.9 [u'Biography', u'Drama', u'History']\n", | |
"Inglourious Basterds 153 mins. 8.3 [u'Adventure', u'Drama', u'War']\n", | |
"Memento 113 mins. 8.5 [u'Mystery', u'Thriller']\n", | |
"The Prestige 130 mins. 8.5 [u'Drama', u'Mystery', u'Thriller']\n", | |
"Interstellar 169 mins. 8.7 [u'Adventure', u'Drama', u'Sci-Fi']\n", | |
"American Beauty 122 mins. 8.4 [u'Drama', u'Romance']\n", | |
"Pirates of the Caribbean: The Curse of the Black Pearl 143 mins. 8.1 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Titanic 194 mins. 7.7 [u'Drama', u'Romance']\n", | |
"Star Wars: Episode V - The Empire Strikes Back 124 mins. 8.8 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"V for Vendetta 132 mins. 8.2 [u'Action', u'Drama', u'Thriller']\n", | |
"American History X 119 mins. 8.6 [u'Crime', u'Drama']\n", | |
"The Godfather: Part II 200 mins. 9.0 [u'Crime', u'Drama']\n", | |
"The Green Mile 189 mins. 8.5 [u'Crime', u'Drama', u'Fantasy', u'Mystery']\n", | |
"Shutter Island 138 mins. 8.1 [u'Mystery', u'Thriller']\n", | |
"Terminator 2: Judgment Day 137 mins. 8.5 [u'Action', u'Sci-Fi']\n", | |
"The Usual Suspects 106 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Braveheart 178 mins. 8.4 [u'Biography', u'Drama', u'History', u'War']\n", | |
"Kill Bill: Vol. 1 111 mins. 8.1 [u'Action']\n", | |
"Goodfellas 146 mins. 8.7 [u'Biography', u'Crime', u'Drama']\n", | |
"The Wolf of Wall Street 180 mins. 8.2 [u'Biography', u'Comedy', u'Crime', u'Drama']\n", | |
"Léon: The Professional 110 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Back to the Future 116 mins. 8.5 [u'Adventure', u'Comedy', u'Sci-Fi']\n", | |
"The Hunger Games 142 mins. 7.3 [u'Adventure', u'Drama', u'Sci-Fi', u'Thriller']\n", | |
"The Sixth Sense 107 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n", | |
"WALL·E 98 mins. 8.4 [u'Animation', u'Adventure', u'Family', u'Sci-Fi']\n", | |
"Iron Man 126 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"One Flew Over the Cuckoo's Nest 133 mins. 8.7 [u'Drama']\n", | |
"Finding Nemo 100 mins. 8.2 [u'Animation', u'Adventure', u'Comedy', u'Family']\n" | |
] | |
} | |
], | |
"source": [ | |
"bs = BeautifulSoup(r.text) #gives you the source code\n", | |
"for movie in bs.findAll('td','title'):\n", | |
" title = movie.find('a').contents[0] #use only find when you know you want only one value of a tag\n", | |
" runtime = movie.find('span','runtime').contents[0]\n", | |
" rating = movie.find('span','value').contents[0]\n", | |
" genres = movie.find('span','genre').findAll('a')\n", | |
" genre = [g.contents[0] for g in genres]\n", | |
" print title, runtime, rating, genre\n", | |
"\n", | |
"#http://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": true | |
}, | |
"source": [ | |
"<br/>\n", | |
"<br/>\n", | |
"#### So now how do you get the top 200 movies?<br/>\n", | |
"You'll have to iterate over the start parameter in the get request function.<br/>\n", | |
"syntax of xrange is xrange(start, stop, [step])</br>\n", | |
"It exits loop before the iteration can reach stop." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"The Shawshank Redemption (1994) 142 mins. 9.3 [u'Crime', u'Drama']\n", | |
"The Dark Knight (2008) 152 mins. 9.0 [u'Action', u'Crime', u'Drama']\n", | |
"Inception (2010) 148 mins. 8.8 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n", | |
"Fight Club (1999) 139 mins. 8.9 [u'Drama']\n", | |
"Pulp Fiction (1994) 154 mins. 8.9 [u'Crime', u'Drama']\n", | |
"The Lord of the Rings: The Fellowship of the Ring (2001) 178 mins. 8.8 [u'Adventure', u'Fantasy']\n", | |
"Forrest Gump (1994) 142 mins. 8.8 [u'Drama', u'Romance']\n", | |
"The Matrix (1999) 136 mins. 8.7 [u'Action', u'Sci-Fi']\n", | |
"The Lord of the Rings: The Return of the King (2003) 201 mins. 8.9 [u'Adventure', u'Fantasy']\n", | |
"The Godfather (1972) 175 mins. 9.2 [u'Crime', u'Drama']\n", | |
"The Dark Knight Rises (2012) 165 mins. 8.5 [u'Action', u'Thriller']\n", | |
"The Lord of the Rings: The Two Towers (2002) 179 mins. 8.7 [u'Adventure', u'Fantasy']\n", | |
"Se7en (1995) 127 mins. 8.6 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n", | |
"The Avengers (2012) 143 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"Gladiator (2000) 155 mins. 8.5 [u'Action', u'Drama']\n", | |
"Batman Begins (2005) 140 mins. 8.3 [u'Action', u'Adventure']\n", | |
"Django Unchained (2012) 165 mins. 8.5 [u'Western']\n", | |
"Avatar (2009) 162 mins. 7.9 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Silence of the Lambs (1991) 118 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Saving Private Ryan (1998) 169 mins. 8.6 [u'Action', u'Drama', u'War']\n", | |
"Star Wars (1977) 121 mins. 8.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Departed (2006) 151 mins. 8.5 [u'Crime', u'Drama', u'Thriller']\n", | |
"Schindler's List (1993) 195 mins. 8.9 [u'Biography', u'Drama', u'History']\n", | |
"Inglourious Basterds (2009) 153 mins. 8.3 [u'Adventure', u'Drama', u'War']\n", | |
"Memento (2000) 113 mins. 8.5 [u'Mystery', u'Thriller']\n", | |
"The Prestige (2006) 130 mins. 8.5 [u'Drama', u'Mystery', u'Thriller']\n", | |
"Interstellar (2014) 169 mins. 8.7 [u'Adventure', u'Drama', u'Sci-Fi']\n", | |
"American Beauty (1999) 122 mins. 8.4 [u'Drama', u'Romance']\n", | |
"Pirates of the Caribbean: The Curse of the Black Pearl (2003) 143 mins. 8.1 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Titanic (1997) 194 mins. 7.7 [u'Drama', u'Romance']\n", | |
"Star Wars: Episode V - The Empire Strikes Back (1980) 124 mins. 8.8 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"V for Vendetta (2005) 132 mins. 8.2 [u'Action', u'Drama', u'Thriller']\n", | |
"American History X (1998) 119 mins. 8.6 [u'Crime', u'Drama']\n", | |
"The Godfather: Part II (1974) 200 mins. 9.0 [u'Crime', u'Drama']\n", | |
"The Green Mile (1999) 189 mins. 8.5 [u'Crime', u'Drama', u'Fantasy', u'Mystery']\n", | |
"Shutter Island (2010) 138 mins. 8.1 [u'Mystery', u'Thriller']\n", | |
"Terminator 2: Judgment Day (1991) 137 mins. 8.5 [u'Action', u'Sci-Fi']\n", | |
"The Usual Suspects (1995) 106 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Braveheart (1995) 178 mins. 8.4 [u'Biography', u'Drama', u'History', u'War']\n", | |
"Kill Bill: Vol. 1 (2003) 111 mins. 8.1 [u'Action']\n", | |
"Goodfellas (1990) 146 mins. 8.7 [u'Biography', u'Crime', u'Drama']\n", | |
"The Wolf of Wall Street (2013) 180 mins. 8.2 [u'Biography', u'Comedy', u'Crime', u'Drama']\n", | |
"Léon: The Professional (1994) 110 mins. 8.6 [u'Crime', u'Drama', u'Thriller']\n", | |
"Back to the Future (1985) 116 mins. 8.5 [u'Adventure', u'Comedy', u'Sci-Fi']\n", | |
"The Hunger Games (2012) 142 mins. 7.3 [u'Adventure', u'Drama', u'Sci-Fi', u'Thriller']\n", | |
"The Sixth Sense (1999) 107 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n", | |
"WALL·E (2008) 98 mins. 8.4 [u'Animation', u'Adventure', u'Family', u'Sci-Fi']\n", | |
"Iron Man (2008) 126 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"One Flew Over the Cuckoo's Nest (1975) 133 mins. 8.7 [u'Drama']\n", | |
"Finding Nemo (2003) 100 mins. 8.2 [u'Animation', u'Adventure', u'Comedy', u'Family']\n", | |
"Sin City (2005) 124 mins. 8.1 [u'Crime', u'Thriller']\n", | |
"Eternal Sunshine of the Spotless Mind (2004) 108 mins. 8.4 [u'Drama', u'Romance', u'Sci-Fi']\n", | |
"The Truman Show (1998) 103 mins. 8.1 [u'Drama']\n", | |
"Raiders of the Lost Ark (1981) 115 mins. 8.6 [u'Action', u'Adventure']\n", | |
"Reservoir Dogs (1992) 99 mins. 8.4 [u'Crime', u'Drama']\n", | |
"Slumdog Millionaire (2008) 120 mins. 8.0 [u'Drama', u'Romance']\n", | |
"Up (2009) 96 mins. 8.3 [u'Animation', u'Adventure', u'Comedy', u'Family']\n", | |
"The Hobbit: An Unexpected Journey (2012) 169 mins. 8.0 [u'Adventure', u'Fantasy']\n", | |
"Star Wars: Episode VI - Return of the Jedi (1983) 134 mins. 8.4 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Lion King (1994) 89 mins. 8.5 [u'Animation', u'Adventure', u'Drama', u'Family', u'Musical']\n", | |
"300 (2006) 117 mins. 7.8 [u'Action', u'Fantasy', u'War']\n", | |
"Guardians of the Galaxy (2014) 121 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"No Country for Old Men (2007) 122 mins. 8.1 [u'Crime', u'Drama', u'Thriller']\n", | |
"Toy Story (1995) 81 mins. 8.3 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n", | |
"Snatch. (2000) 104 mins. 8.3 [u'Comedy', u'Crime']\n", | |
"The Shining (1980) 146 mins. 8.4 [u'Drama', u'Horror']\n", | |
"A Beautiful Mind (2001) 135 mins. 8.2 [u'Biography', u'Drama']\n", | |
"Good Will Hunting (1997) 126 mins. 8.3 [u'Drama']\n", | |
"The Terminator (1984) 107 mins. 8.1 [u'Action', u'Sci-Fi']\n", | |
"The Hangover (2009) 100 mins. 7.8 [u'Comedy']\n", | |
"Die Hard (1988) 131 mins. 8.3 [u'Action', u'Thriller']\n", | |
"Jurassic Park (1993) 127 mins. 8.1 [u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"Donnie Darko (2001) 113 mins. 8.1 [u'Drama', u'Sci-Fi']\n", | |
"Gravity (2013) 91 mins. 7.9 [u'Sci-Fi', u'Thriller']\n", | |
"Requiem for a Dream (2000) 102 mins. 8.4 [u'Drama']\n", | |
"Monsters, Inc. (2001) 92 mins. 8.1 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n", | |
"Gran Torino (2008) 116 mins. 8.2 [u'Drama']\n", | |
"Black Swan (2010) 108 mins. 8.0 [u'Drama', u'Thriller']\n", | |
"Alien (1979) 117 mins. 8.5 [u'Horror', u'Sci-Fi']\n", | |
"A Clockwork Orange (1971) 136 mins. 8.4 [u'Crime', u'Drama', u'Sci-Fi']\n", | |
"Iron Man 3 (2013) 130 mins. 7.3 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Spider-Man (2002) 121 mins. 7.3 [u'Action', u'Adventure']\n", | |
"District 9 (2009) 112 mins. 8.0 [u'Action', u'Sci-Fi']\n", | |
"Silver Linings Playbook (2012) 122 mins. 7.8 [u'Comedy', u'Drama', u'Romance']\n", | |
"Scarface (1983) 170 mins. 8.3 [u'Crime', u'Drama']\n", | |
"City of God (2002) 130 mins. 8.7 [u'Crime', u'Drama']\n", | |
"The Big Lebowski (1998) 117 mins. 8.2 [u'Comedy', u'Crime']\n", | |
"Amélie (2001) 122 mins. 8.4 [u'Comedy', u'Romance']\n", | |
"Toy Story 3 (2010) 103 mins. 8.4 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n", | |
"Man of Steel (2013) 143 mins. 7.2 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"I Am Legend (2007) 101 mins. 7.2 [u'Drama', u'Sci-Fi', u'Thriller']\n", | |
"Harry Potter and the Deathly Hallows: Part 2 (2011) 130 mins. 8.1 [u'Adventure', u'Drama', u'Fantasy', u'Mystery']\n", | |
"Transformers (2007) 144 mins. 7.1 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Thor (2011) 115 mins. 7.0 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Pirates of the Caribbean: Dead Man's Chest (2006) 151 mins. 7.3 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Gone Girl (2014) 149 mins. 8.2 [u'Drama', u'Mystery', u'Thriller']\n", | |
"Star Trek (2009) 127 mins. 8.0 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Kill Bill: Vol. 2 (2004) 137 mins. 8.0 [u'Action', u'Crime', u'Thriller']\n", | |
"X-Men: First Class (2011) 132 mins. 7.8 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Indiana Jones and the Last Crusade (1989) 127 mins. 8.3 [u'Action', u'Adventure']\n", | |
"Iron Man 2 (2010) 124 mins. 7.0 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Skyfall (2012) 143 mins. 7.8 [u'Action', u'Adventure', u'Thriller']\n", | |
"Star Wars: Episode I - The Phantom Menace (1999) 136 mins. 6.5 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Good, the Bad and the Ugly (1966) 161 mins. 8.9 [u'Western']\n", | |
"Taxi Driver (1976) 113 mins. 8.4 [u'Crime', u'Drama']\n", | |
"Taken (2008) 93 mins. 7.9 [u'Action', u'Thriller']\n", | |
"The King's Speech (2010) 118 mins. 8.1 [u'Biography', u'Drama']\n", | |
"The Hunger Games: Catching Fire (2013) 146 mins. 7.6 [u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"X-Men: Days of Future Past (2014) 132 mins. 8.1 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"Aliens (1986) 137 mins. 8.4 [u'Action', u'Horror', u'Sci-Fi']\n", | |
"Star Wars: Episode III - Revenge of the Sith (2005) 140 mins. 7.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"The Pianist (2002) 150 mins. 8.5 [u'Biography', u'Drama', u'War']\n", | |
"Catch Me If You Can (2002) 141 mins. 8.0 [u'Biography', u'Crime', u'Drama']\n", | |
"The Bourne Ultimatum (2007) 115 mins. 8.1 [u'Action', u'Thriller']\n", | |
"Captain America: The First Avenger (2011) 124 mins. 6.8 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Sherlock Holmes (2009) 128 mins. 7.6 [u'Action', u'Adventure', u'Crime', u'Mystery', u'Thriller']\n", | |
"Full Metal Jacket (1987) 116 mins. 8.3 [u'Drama', u'War']\n", | |
"Million Dollar Baby (2004) 132 mins. 8.1 [u'Drama', u'Sport']\n", | |
"The Hobbit: The Desolation of Smaug (2013) 161 mins. 7.9 [u'Adventure', u'Fantasy']\n", | |
"The Intouchables (2011) 112 mins. 8.6 [u'Biography', u'Comedy', u'Drama']\n", | |
"The Social Network (2010) 120 mins. 7.8 [u'Biography', u'Drama']\n", | |
"Ted (2012) 106 mins. 7.0 [u'Comedy', u'Fantasy']\n", | |
"The Incredibles (2004) 115 mins. 8.0 [u'Animation', u'Action', u'Adventure', u'Family']\n", | |
"Pirates of the Caribbean: At World's End (2007) 169 mins. 7.1 [u'Action', u'Adventure', u'Fantasy']\n", | |
"How to Train Your Dragon (2010) 98 mins. 8.2 [u'Animation', u'Adventure', u'Family', u'Fantasy']\n", | |
"Trainspotting (1996) 94 mins. 8.2 [u'Drama']\n", | |
"Ratatouille (2007) 111 mins. 8.0 [u'Animation', u'Comedy', u'Family', u'Fantasy']\n", | |
"Pan's Labyrinth (2006) 118 mins. 8.2 [u'Drama', u'Fantasy', u'War']\n", | |
"Prometheus (2012) 124 mins. 7.0 [u'Adventure', u'Mystery', u'Sci-Fi']\n", | |
"Shrek (2001) 90 mins. 7.9 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n", | |
"The Curious Case of Benjamin Button (2008) 166 mins. 7.8 [u'Drama', u'Fantasy', u'Romance']\n", | |
"Twelve Monkeys (1995) 129 mins. 8.1 [u'Mystery', u'Sci-Fi', u'Thriller']\n", | |
"Blade Runner (1982) 117 mins. 8.2 [u'Sci-Fi', u'Thriller']\n", | |
"World War Z (2013) 116 mins. 7.0 [u'Action', u'Adventure', u'Horror', u'Sci-Fi', u'Thriller']\n", | |
"Casino Royale (2006) 144 mins. 8.0 [u'Action', u'Adventure', u'Thriller']\n", | |
"The Amazing Spider-Man (2012) 136 mins. 7.1 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Captain America: The Winter Soldier (2014) 136 mins. 7.8 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Apocalypse Now (1979) 153 mins. 8.5 [u'Drama', u'War']\n", | |
"Argo (2012) 120 mins. 7.8 [u'Drama', u'History', u'Thriller']\n", | |
"X-Men (2000) 104 mins. 7.4 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Fargo (1996) 98 mins. 8.2 [u'Crime', u'Drama', u'Thriller']\n", | |
"Kick-Ass (2010) 117 mins. 7.7 [u'Action', u'Comedy']\n", | |
"Harry Potter and the Sorcerer's Stone (2001) 152 mins. 7.5 [u'Adventure', u'Family', u'Fantasy']\n", | |
"Drive (2011) 100 mins. 7.8 [u'Crime', u'Drama']\n", | |
"12 Angry Men (1957) 96 mins. 8.9 [u'Crime', u'Drama']\n", | |
"Heat (1995) 170 mins. 8.3 [u'Action', u'Crime', u'Drama', u'Thriller']\n", | |
"The Matrix Reloaded (2003) 138 mins. 7.2 [u'Action', u'Sci-Fi']\n", | |
"Life of Pi (2012) 127 mins. 8.0 [u'Adventure', u'Drama', u'Fantasy']\n", | |
"Superbad (2007) 113 mins. 7.6 [u'Comedy']\n", | |
"The Grand Budapest Hotel (2014) 99 mins. 8.1 [u'Adventure', u'Comedy', u'Drama']\n", | |
"Looper (2012) 119 mins. 7.5 [u'Action', u'Crime', u'Sci-Fi', u'Thriller']\n", | |
"Star Wars: Episode II - Attack of the Clones (2002) 142 mins. 6.7 [u'Action', u'Adventure', u'Fantasy', u'Sci-Fi']\n", | |
"Groundhog Day (1993) 101 mins. 8.1 [u'Comedy', u'Fantasy', u'Romance']\n", | |
"Independence Day (1996) 145 mins. 6.9 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Now You See Me (2013) 115 mins. 7.3 [u'Crime', u'Mystery', u'Thriller']\n", | |
"Juno (2007) 96 mins. 7.5 [u'Comedy', u'Drama', u'Romance']\n", | |
"Into the Wild (2007) 148 mins. 8.2 [u'Adventure', u'Biography', u'Drama']\n", | |
"2001: A Space Odyssey (1968) 160 mins. 8.3 [u'Mystery', u'Sci-Fi']\n", | |
"L.A. Confidential (1997) 138 mins. 8.3 [u'Crime', u'Drama', u'Mystery', u'Thriller']\n", | |
"Limitless (2011) 105 mins. 7.4 [u'Mystery', u'Sci-Fi', u'Thriller']\n", | |
"Psycho (1960) 109 mins. 8.6 [u'Horror', u'Mystery', u'Thriller']\n", | |
"Lock, Stock and Two Smoking Barrels (1998) 107 mins. 8.2 [u'Comedy', u'Crime']\n", | |
"12 Years a Slave (2013) 134 mins. 8.1 [u'Biography', u'Drama', u'History']\n", | |
"Edge of Tomorrow (2014) 113 mins. 7.9 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Jaws (1975) 124 mins. 8.1 [u'Adventure', u'Drama', u'Thriller']\n", | |
"Rise of the Planet of the Apes (2011) 105 mins. 7.6 [u'Action', u'Drama', u'Sci-Fi', u'Thriller']\n", | |
"Frozen (2013) 102 mins. 7.6 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy', u'Musical']\n", | |
"21 Jump Street (2012) 109 mins. 7.2 [u'Action', u'Comedy', u'Crime']\n", | |
"Minority Report (2002) 145 mins. 7.7 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n", | |
"Life Is Beautiful (1997) 116 mins. 8.6 [u'Comedy', u'Drama', u'Romance']\n", | |
"Ocean's Eleven (2001) 116 mins. 7.8 [u'Crime', u'Thriller']\n", | |
"The Bourne Identity (2002) 119 mins. 7.9 [u'Action', u'Mystery', u'Thriller']\n", | |
"Spirited Away (2001) 125 mins. 8.6 [u'Animation', u'Adventure', u'Family', u'Fantasy']\n", | |
"X2 (2003) 134 mins. 7.5 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"Men in Black (1997) 98 mins. 7.2 [u'Comedy', u'Sci-Fi']\n", | |
"Spider-Man 2 (2004) 127 mins. 7.3 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Blood Diamond (2006) 143 mins. 8.0 [u'Adventure', u'Drama', u'Thriller']\n", | |
"Shaun of the Dead (2004) 99 mins. 8.0 [u'Comedy', u'Horror']\n", | |
"The Notebook (2004) 123 mins. 7.9 [u'Drama', u'Romance']\n", | |
"Star Trek Into Darkness (2013) 132 mins. 7.8 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"Thor: The Dark World (2013) 112 mins. 7.1 [u'Action', u'Adventure', u'Fantasy']\n", | |
"Rain Man (1988) 133 mins. 8.0 [u'Drama']\n", | |
"The Imitation Game (2014) 114 mins. 8.1 [u'Biography', u'Drama', u'Thriller', u'War']\n", | |
"Mad Max: Fury Road (2015) 120 mins. 8.2 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"Cast Away (2000) 143 mins. 7.7 [u'Adventure', u'Drama']\n", | |
"Watchmen (2009) 162 mins. 7.6 [u'Action', u'Mystery', u'Sci-Fi']\n", | |
"Zombieland (2009) 88 mins. 7.7 [u'Adventure', u'Comedy', u'Horror']\n", | |
"I, Robot (2004) 115 mins. 7.1 [u'Action', u'Mystery', u'Sci-Fi', u'Thriller']\n", | |
"Oblivion (2013) 124 mins. 7.0 [u'Action', u'Adventure', u'Mystery', u'Sci-Fi']\n", | |
"Harry Potter and the Chamber of Secrets (2002) 161 mins. 7.4 [u'Adventure', u'Family', u'Fantasy', u'Mystery']\n", | |
"Crazy, Stupid, Love. (2011) 118 mins. 7.4 [u'Comedy', u'Drama', u'Romance']\n", | |
"Harry Potter and the Goblet of Fire (2005) 157 mins. 7.6 [u'Adventure', u'Family', u'Fantasy', u'Mystery']\n", | |
"Monty Python and the Holy Grail (1975) 91 mins. 8.3 [u'Adventure', u'Comedy', u'Fantasy']\n", | |
"Toy Story 2 (1999) 92 mins. 7.9 [u'Animation', u'Adventure', u'Comedy', u'Family', u'Fantasy']\n", | |
"Troy (2004) 163 mins. 7.2 [u'Adventure']\n", | |
"Pacific Rim (2013) 131 mins. 7.0 [u'Action', u'Adventure', u'Sci-Fi']\n", | |
"X-Men: The Last Stand (2006) 104 mins. 6.8 [u'Action', u'Adventure', u'Sci-Fi', u'Thriller']\n", | |
"The Hangover Part II (2011) 102 mins. 6.5 [u'Comedy']\n", | |
"Despicable Me (2010) 95 mins. 7.7 [u'Animation', u'Comedy', u'Family']\n", | |
"(500) Days of Summer (2009) 95 mins. 7.8 [u'Comedy', u'Drama', u'Romance']\n" | |
] | |
} | |
], | |
"source": [ | |
"url = 'http://www.imdb.com/search/title'\n", | |
"for i in xrange(1,200,50):\n", | |
" params = dict(sort='num_votes,desc', start=i, title_type='feature', year='1950,2015')\n", | |
" r = requests.get(url, params=params)\n", | |
" dom = web.Element(r.text)\n", | |
" for movie in dom.by_tag('td.title'):\n", | |
" title = movie.by_tag('a')[0].content\n", | |
" year = movie.by_tag('span.year_type')[0].content\n", | |
" runtime = movie.by_tag('span.runtime')[0].content\n", | |
" rating = movie.by_tag('span.value')[0].content\n", | |
" genres = movie.by_tag('span.genre')[0].by_tag('a')\n", | |
" genre = [g.content for g in genres]\n", | |
" print title, year, runtime, rating, genre" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Writing the scraped data into a file\n", | |
"Now, we'll write this data into a file for further analysis." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [], | |
"source": [ | |
"imdb = open('imdb_top_200.txt','a')\n", | |
"\n", | |
"url = 'http://www.imdb.com/search/title'\n", | |
"for i in xrange(1,200,50):\n", | |
" params = dict(sort='num_votes,desc', start=i, title_type='feature', year='1950,2015')\n", | |
" r = requests.get(url, params=params)\n", | |
" dom = web.Element(r.text)\n", | |
" for movie in dom.by_tag('td.title'):\n", | |
" title = movie.by_tag('a')[0].content\n", | |
" year = movie.by_tag('span.year_type')[0].content\n", | |
" runtime = movie.by_tag('span.runtime')[0].content\n", | |
" rating = movie.by_tag('span.value')[0].content\n", | |
" genres = movie.by_tag('span.genre')[0].by_tag('a')\n", | |
" genre = [g.content for g in genres]\n", | |
" imdb.write(title+'\\t'+year+'\\t'+str(rating) +'\\t'+str(runtime)+'\\t'+str(genre)+'\\n')\n", | |
"imdb.close()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This pretty much gives me the top 200 movies but there are some redundant movies being written at the top of the file.<br/>Will have to look into that." | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.10" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment