Skip to content

Instantly share code, notes, and snippets.

@pybokeh
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pybokeh/8812542 to your computer and use it in GitHub Desktop.
Save pybokeh/8812542 to your computer and use it in GitHub Desktop.
Webscraping CarComplaintsDotCom
{
"metadata": {
"name": ""
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Python script used to scrape vehicle complaints information from [www.carcomplaints.com](http://www.carcomplaints.com) website."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Programming environment:\n",
"\n",
"* [Python](https://xkcd.com/353/) version 3.3.3\n",
"* [IPython notebook](http://ipython.org/notebook.html) version 1.1 for web-based, interactive data analysis\n",
"* [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) version 4.2 HTML/XML parsing libary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Before proceeding, I should probably present an argument for why would one obtain data from websites in this manner since a person can obviously just manually navigate to the data.&nbsp;&nbsp;While manually navigating to the data is perfectly suitable for obtaining one-off data, but there are times where we need to &#42;pull large amounts of data at once and store this data into a database. The data we want may be in one page and in other disparate pages. With some programming, we can automate the process.&nbsp;&nbsp;Plus, it is so fun!</p><br>&#42;Caveat: Some websites have built-in \"limiter\" on how much data you can pull from their website. So pulling a lot of data isn't always possible.&nbsp;&nbsp;There are some workarounds or you can subscribe to their developer API if they happen to provide such a service.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>carcomplaints.com has a very easy-to-understand URL scheme. It is basically in this format: http://www.carcomplaints.com/make/model/year/system/sub-system.shtml</p>"
]
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Below is a brief Python script to parse the entire contents of the Honda page at carcomplaints.com"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import OrderedDict # Not necessary, only being used to maintain sort order of our qty summaries\n",
"from bs4 import BeautifulSoup # non-standard Python library to parse HTML/XML pages\n",
"import urllib.request as request # standard Python library for opening HTML pages\n",
"import re # regular expressions module to enable us to search for text patterns\n",
"\n",
"url_Honda = 'http://www.carcomplaints.com/Honda/'\n",
"html_Honda = request.urlopen(url_Honda)\n",
"\n",
"soup_Honda = BeautifulSoup(html_Honda)"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Honda Models Overall Complaint Counts (http://www.carcomplaints.com/Honda/)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Looking at the page source of http://www.carcomplaints.com/Honda/, it appears the data we want is embedded in the &lt;ul&gt; element, with class='column bar', and id=c1 or c2 or c3.&nbsp;&nbsp;Using the BeautifulSoup API, we can grab data from certain HTML tags or elements:</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ul = soup_Honda.find_all('ul', class_='column bar',id=re.compile('c*'))\n",
"ul"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 2,
"text": [
"[<ul class=\"column bar\" id=\"c1\">\n",
"<li><a href=\"/Honda/Accord/\" title=\"Honda Accord complaints (8,601)\">Accord</a> <span class=\"count\">8,601</span> <span class=\"index\" style=\"width: 100%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Accord_Crosstour/\" title=\"Honda Accord Crosstour complaints (6)\">Accord Crosstour</a> <span class=\"count\">6</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Accord_Hybrid/\" title=\"Honda Accord Hybrid complaints (13)\">Accord Hybrid</a> <span class=\"count\">13</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Ballade/\" title=\"Honda Ballade complaints (1)\">Ballade</a> <span class=\"count\">1</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Brio/\" title=\"Honda Brio complaints (1)\">Brio</a> <span class=\"count\">1</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/City/\" title=\"Honda City complaints (22)\">City</a> <span class=\"count\">22</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Civic/\" title=\"Honda Civic complaints (3,793)\">Civic</a> <span class=\"count\">3,793</span> <span class=\"index\" style=\"width: 44%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Civic_Hybrid/\" title=\"Honda Civic Hybrid complaints (174)\">Civic Hybrid</a> <span class=\"count\">174</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/CR-V/\" title=\"Honda CR-V complaints (760)\">CR-V</a> <span class=\"count\">760</span> <span class=\"index\" style=\"width: 8%;\">\u00a0</span></li>\n",
"</ul>,\n",
" <ul class=\"column bar\" id=\"c2\">\n",
"<li><a href=\"/Honda/CR-Z/\" title=\"Honda CR-Z complaints (11)\">CR-Z</a> <span class=\"count\">11</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Crosstour/\" title=\"Honda Crosstour complaints (9)\">Crosstour</a> <span class=\"count\">9</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/CRX/\" title=\"Honda CRX complaints (4)\">CRX</a> <span class=\"count\">4</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Del_Sol/\" title=\"Honda Del Sol complaints (2)\">Del Sol</a> <span class=\"count\">2</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Element/\" title=\"Honda Element complaints (112)\">Element</a> <span class=\"count\">112</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Fit/\" title=\"Honda Fit complaints (136)\">Fit</a> <span class=\"count\">136</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Fit_EV/\" title=\"Honda Fit EV complaints (0)\">Fit EV</a> <span class=\"count\">0</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Insight/\" title=\"Honda Insight complaints (12)\">Insight</a> <span class=\"count\">12</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Jazz/\" title=\"Honda Jazz complaints (12)\">Jazz</a> <span class=\"count\">12</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"</ul>,\n",
" <ul class=\"column bar\" id=\"c3\">\n",
"<li><a href=\"/Honda/Odyssey/\" title=\"Honda Odyssey complaints (1,565)\">Odyssey</a> <span class=\"count\">1,565</span> <span class=\"index\" style=\"width: 18%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Orthia/\" title=\"Honda Orthia complaints (1)\">Orthia</a> <span class=\"count\">1</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Passport/\" title=\"Honda Passport complaints (66)\">Passport</a> <span class=\"count\">66</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Pilot/\" title=\"Honda Pilot complaints (527)\">Pilot</a> <span class=\"count\">527</span> <span class=\"index\" style=\"width: 6%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Prelude/\" title=\"Honda Prelude complaints (54)\">Prelude</a> <span class=\"count\">54</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/Ridgeline/\" title=\"Honda Ridgeline complaints (83)\">Ridgeline</a> <span class=\"count\">83</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"<li><a href=\"/Honda/S2000/\" title=\"Honda S2000 complaints (4)\">S2000</a> <span class=\"count\">4</span> <span class=\"index\" style=\"width: 5%;\">\u00a0</span></li>\n",
"</ul>]"
]
}
],
"prompt_number": 2
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>As you can see from above, the data I want (the model name and # of complaints are in the &lt;li&gt; tags). I will make a Python dictionary data structure (eg. key:value) of this data:</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"honda_model_counts_dict = {}\n",
"num_column_data = len(ul) # The data is divided up in arbitrary number of columns per HTML page source\n",
"for i in range(num_column_data): # For each column of data...\n",
" for row in ul[i].find_all('li'):\n",
" honda_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(\",\",\"\"))"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>OK, we created our Python dictionary containing Honda models and their respective number of complaints:</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"honda_model_counts_dict"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": [
"{'Accord': 8601,\n",
" 'Accord Crosstour': 6,\n",
" 'Accord Hybrid': 13,\n",
" 'Ballade': 1,\n",
" 'Brio': 1,\n",
" 'CR-V': 760,\n",
" 'CR-Z': 11,\n",
" 'CRX': 4,\n",
" 'City': 22,\n",
" 'Civic': 3793,\n",
" 'Civic Hybrid': 174,\n",
" 'Crosstour': 9,\n",
" 'Del Sol': 2,\n",
" 'Element': 112,\n",
" 'Fit': 136,\n",
" 'Fit EV': 0,\n",
" 'Insight': 12,\n",
" 'Jazz': 12,\n",
" 'Odyssey': 1565,\n",
" 'Orthia': 1,\n",
" 'Passport': 66,\n",
" 'Pilot': 527,\n",
" 'Prelude': 54,\n",
" 'Ridgeline': 83,\n",
" 'S2000': 4}"
]
}
],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Acura Models Overall Complaint Counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the same procedure I did for Honda, I will get the Acura models and their respective complaint counts.&nbsp;&nbsp;But this time, I will output the results into an actual HTML table just to show off IPython notebook's various output capabilities:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import OrderedDict\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import HTML # Used to make HTML table\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"url_Acura = 'http://www.carcomplaints.com/Acura/'\n",
"html_Acura = request.urlopen(url_Acura)\n",
"\n",
"soup_Acura = BeautifulSoup(html_Acura)\n",
"ul = soup_Acura.find_all('ul', class_='column bar',id=re.compile('c*'))\n",
"\n",
"acura_model_counts_dict = {}\n",
"num_column_data = len(ul) # The data is divided up in arbitrary number of columns\n",
"for i in range(num_column_data): # For each column of data...\n",
" for row in ul[i].find_all('li'):\n",
" acura_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(\",\",\"\"))\n",
"\n",
"OD_Acura = OrderedDict(sorted(acura_model_counts_dict.items(), key=lambda t: t[1], reverse=True)) # Sort by values descending\n",
"\n",
"s_header = '<table border=\"1\"><tr><th>Model Name</th><th># of Complaints</th></tr>'\n",
"\n",
"s_data = ''\n",
"for key in OD_Acura.keys():\n",
" s_data = s_data + '<tr><td align=\"center\">' + key + '</td>' + '<td align=\"center\">' + str(OD_Acura[key]) + '</td></tr>'\n",
"\n",
"s_footer = \"</table>\"\n",
"\n",
"h = HTML(s_header+s_data+s_footer);h"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table border=\"1\"><tr><th>Model Name</th><th># of Complaints</th></tr><tr><td align=\"center\">TL</td><td align=\"center\">109</td></tr><tr><td align=\"center\">MDX</td><td align=\"center\">50</td></tr><tr><td align=\"center\">TSX</td><td align=\"center\">36</td></tr><tr><td align=\"center\">Legend</td><td align=\"center\">27</td></tr><tr><td align=\"center\">Integra</td><td align=\"center\">25</td></tr><tr><td align=\"center\">RDX</td><td align=\"center\">20</td></tr><tr><td align=\"center\">CL</td><td align=\"center\">17</td></tr><tr><td align=\"center\">RSX</td><td align=\"center\">15</td></tr><tr><td align=\"center\">RL</td><td align=\"center\">11</td></tr><tr><td align=\"center\">1.7EL</td><td align=\"center\">5</td></tr><tr><td align=\"center\">EL</td><td align=\"center\">4</td></tr><tr><td align=\"center\">Vigor</td><td align=\"center\">2</td></tr><tr><td align=\"center\">RLX</td><td align=\"center\">1</td></tr><tr><td align=\"center\">SLX</td><td align=\"center\">1</td></tr><tr><td align=\"center\">ILX Hybrid</td><td align=\"center\">0</td></tr><tr><td align=\"center\">ILX</td><td align=\"center\">0</td></tr><tr><td align=\"center\">NSX</td><td align=\"center\">0</td></tr><tr><td align=\"center\">ZDX</td><td align=\"center\">0</td></tr></table>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 204,
"text": [
"<IPython.core.display.HTML at 0x5c89870>"
]
}
],
"prompt_number": 204
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Honda version executed all in one cell"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import OrderedDict\n",
"from bs4 import BeautifulSoup\n",
"from IPython.display import HTML\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"url_Honda = 'http://www.carcomplaints.com/Honda/'\n",
"html_Honda = request.urlopen(url_Honda)\n",
"\n",
"soup_Honda = BeautifulSoup(html_Honda)\n",
"ul = soup_Honda.find_all('ul', class_='column bar',id=re.compile('c*'))\n",
"\n",
"honda_model_counts_dict = {}\n",
"num_column_data = len(ul) # The data is divided up in arbitrary number of columns per HTML page source\n",
"for i in range(num_column_data): # For each column of data...\n",
" for row in ul[i].find_all('li'):\n",
" honda_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(\",\",\"\"))\n",
" \n",
"OD_Honda = OrderedDict(sorted(honda_model_counts_dict.items(), key=lambda t: t[1], reverse=True)) # Sort by values descending\n",
"\n",
"s_header = '<table border=\"1\"><tr><th>Model Name</th><th># of Complaints</th></tr>'\n",
"\n",
"s_data = ''\n",
"for key in OD_Honda.keys():\n",
" s_data = s_data + '<tr><td align=\"center\">' + key + '</td>' + '<td align=\"center\">' + str(OD_Honda[key]) + '</td></tr>'\n",
"\n",
"s_footer = \"</table>\"\n",
"\n",
"h = HTML(s_header+s_data+s_footer);h"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<table border=\"1\"><tr><th>Model Name</th><th># of Complaints</th></tr><tr><td align=\"center\">Accord</td><td align=\"center\">8560</td></tr><tr><td align=\"center\">Civic</td><td align=\"center\">3775</td></tr><tr><td align=\"center\">Odyssey</td><td align=\"center\">1555</td></tr><tr><td align=\"center\">CR-V</td><td align=\"center\">750</td></tr><tr><td align=\"center\">Pilot</td><td align=\"center\">522</td></tr><tr><td align=\"center\">Civic Hybrid</td><td align=\"center\">174</td></tr><tr><td align=\"center\">Fit</td><td align=\"center\">133</td></tr><tr><td align=\"center\">Element</td><td align=\"center\">112</td></tr><tr><td align=\"center\">Ridgeline</td><td align=\"center\">81</td></tr><tr><td align=\"center\">Passport</td><td align=\"center\">66</td></tr><tr><td align=\"center\">Prelude</td><td align=\"center\">54</td></tr><tr><td align=\"center\">City</td><td align=\"center\">22</td></tr><tr><td align=\"center\">Jazz</td><td align=\"center\">12</td></tr><tr><td align=\"center\">Insight</td><td align=\"center\">12</td></tr><tr><td align=\"center\">Accord Hybrid</td><td align=\"center\">12</td></tr><tr><td align=\"center\">CR-Z</td><td align=\"center\">11</td></tr><tr><td align=\"center\">Crosstour</td><td align=\"center\">9</td></tr><tr><td align=\"center\">Accord Crosstour</td><td align=\"center\">6</td></tr><tr><td align=\"center\">S2000</td><td align=\"center\">4</td></tr><tr><td align=\"center\">CRX</td><td align=\"center\">4</td></tr><tr><td align=\"center\">Del Sol</td><td align=\"center\">2</td></tr><tr><td align=\"center\">Brio</td><td align=\"center\">1</td></tr><tr><td align=\"center\">Orthia</td><td align=\"center\">1</td></tr><tr><td align=\"center\">Ballade</td><td align=\"center\">1</td></tr><tr><td align=\"center\">Fit EV</td><td align=\"center\">0</td></tr></table>"
],
"metadata": {},
"output_type": "pyout",
"prompt_number": 205,
"text": [
"<IPython.core.display.HTML at 0x8917030>"
]
}
],
"prompt_number": 205
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"OK, now that I've shown how to obtain the number of complaints for each model in a step-by-step manner, it is time to make a function out of this so we can re-use all this code"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from bs4 import BeautifulSoup\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"def getCountsByModel(make):\n",
" \"\"\"Function that returns the number of complaints for each model based on vehicle make\n",
" Applicable make values are: 'Honda','Acura','Ford','GM',etc\n",
" Method returns a dictionary where the key is the model, value is the qty of complaints\"\"\"\n",
" \n",
" url = 'http://www.carcomplaints.com/'\n",
" url_make = url+make+'/'\n",
" html_make = request.urlopen(url_make)\n",
" \n",
" soup = BeautifulSoup(html_make)\n",
" ul = soup.find_all('ul', class_='column bar',id=re.compile('c*'))\n",
" \n",
" make_model_counts_dict = OrderedDict()\n",
" num_column_data = len(ul) # The data is divided up in arbitrary number of columns per HTML page source\n",
" for i in range(num_column_data): # For each column of data...\n",
" for row in ul[i].find_all('li'):\n",
" make_model_counts_dict[row.a.get_text()] = int(row.span.get_text().replace(\",\",\"\"))\n",
" \n",
" return make_model_counts_dict"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"countsByModel = getCountsByModel('Honda')"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 17
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for key in countsByModel.keys():\n",
" print(key, countsByModel[key])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Accord 8601\n",
"Accord Crosstour 6\n",
"Accord Hybrid 13\n",
"Ballade 1\n",
"Brio 1\n",
"City 22\n",
"Civic 3793\n",
"Civic Hybrid 174\n",
"CR-V 760\n",
"CR-Z 11\n",
"Crosstour 9\n",
"CRX 4\n",
"Del Sol 2\n",
"Element 112\n",
"Fit 136\n",
"Fit EV 0\n",
"Insight 12\n",
"Jazz 12\n",
"Odyssey 1565\n",
"Orthia 1\n",
"Passport 66\n",
"Pilot 527\n",
"Prelude 54\n",
"Ridgeline 83\n",
"S2000 4\n"
]
}
],
"prompt_number": 19
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"I also made a function to get all available makes at carcomplaints.com"
]
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"from bs4 import BeautifulSoup\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"def getMakes():\n",
" \"\"\"Function to get all the makes available at carcomplaints.com\"\"\"\n",
" \n",
" url = 'http://www.carcomplaints.com/'\n",
" html = request.urlopen(url)\n",
" \n",
" soup = BeautifulSoup(html)\n",
" sections = soup.find_all('section', id=re.compile('makes'))\n",
" \n",
" make_list = []\n",
" for section in range(len(sections)):\n",
" for li in sections[section].find_all('li'):\n",
" make_list.append(li.a['href'].replace('/',''))\n",
" \n",
" return make_list"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 20
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"getMakes()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 21,
"text": [
"['Acura',\n",
" 'Audi',\n",
" 'BMW',\n",
" 'Buick',\n",
" 'Cadillac',\n",
" 'Chevrolet',\n",
" 'Chrysler',\n",
" 'Dodge',\n",
" 'Ford',\n",
" 'GMC',\n",
" 'Honda',\n",
" 'Hyundai',\n",
" 'Infiniti',\n",
" 'Isuzu',\n",
" 'Jeep',\n",
" 'Kia',\n",
" 'Lexus',\n",
" 'Lincoln',\n",
" 'Mazda',\n",
" 'Mercedes-Benz',\n",
" 'Mercury',\n",
" 'Mini',\n",
" 'Mitsubishi',\n",
" 'Nissan',\n",
" 'Oldsmobile',\n",
" 'Plymouth',\n",
" 'Pontiac',\n",
" 'Porsche',\n",
" 'Ram',\n",
" 'Saab',\n",
" 'Saturn',\n",
" 'Scion',\n",
" 'Subaru',\n",
" 'Toyota',\n",
" 'Volvo',\n",
" 'Volkswagen',\n",
" 'Alfa_Romeo',\n",
" 'AMC',\n",
" 'Bentley',\n",
" 'Chery',\n",
" 'Daewoo',\n",
" 'Datsun',\n",
" 'Eagle',\n",
" 'Ferrari',\n",
" 'Fiat',\n",
" 'Geo',\n",
" 'Holden',\n",
" 'HSV',\n",
" 'Hummer',\n",
" 'Jaguar',\n",
" 'Kenworth',\n",
" 'Lamborghini',\n",
" 'Land_Rover',\n",
" 'Mahindra',\n",
" 'Maruti',\n",
" 'Opel',\n",
" 'Peugeot',\n",
" 'Renault',\n",
" 'Rover',\n",
" 'Seat',\n",
" 'Skoda',\n",
" 'Suzuki',\n",
" 'Tata',\n",
" 'Tesla',\n",
" 'Vauxhall',\n",
" 'Zimmer']"
]
}
],
"prompt_number": 21
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Function to get available model years and complaint qty from a give model"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from bs4 import BeautifulSoup\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"def getYearCounts(make, model):\n",
" \"\"\"Function that returns a Python dict that contains model years and their complaint qty\"\"\"\n",
" \n",
" url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'\n",
" html = request.urlopen(url)\n",
"\n",
" soup = BeautifulSoup(html)\n",
" li = soup.find_all('li', id=re.compile('bar*'))\n",
"\n",
" year_counts_dict = {}\n",
" for item in li:\n",
" year_counts_dict[int(item.find('span',class_='label').get_text())]=int(item.find('span',class_='count').get_text().replace(\",\",\"\"))\n",
" \n",
" return year_counts_dict"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 22
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Now for a specific model, we can list out the number of complaints for each model year:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"getYearCounts('Honda','Accord')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 222,
"text": [
"{1979: 1,\n",
" 1986: 15,\n",
" 1987: 2,\n",
" 1988: 13,\n",
" 1989: 16,\n",
" 1990: 33,\n",
" 1991: 69,\n",
" 1992: 39,\n",
" 1993: 37,\n",
" 1994: 44,\n",
" 1995: 20,\n",
" 1996: 38,\n",
" 1997: 52,\n",
" 1998: 392,\n",
" 1999: 313,\n",
" 2000: 424,\n",
" 2001: 488,\n",
" 2002: 836,\n",
" 2003: 1447,\n",
" 2004: 460,\n",
" 2005: 184,\n",
" 2006: 141,\n",
" 2007: 212,\n",
" 2008: 2031,\n",
" 2009: 696,\n",
" 2010: 243,\n",
" 2011: 123,\n",
" 2012: 112,\n",
" 2013: 78,\n",
" 2014: 1}"
]
}
],
"prompt_number": 222
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Function to get Top Systems by Qty"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from bs4 import BeautifulSoup\n",
"from collections import OrderedDict\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"def getTopSystemsQty(make, model, year):\n",
" \"\"\"Function that returns an OrderedDict containing system problems and their complaint qty\"\"\"\n",
" \n",
" url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'\n",
" html = request.urlopen(url)\n",
"\n",
" soup = BeautifulSoup(html)\n",
" li = soup.find_all('li', id=re.compile('bar*'))\n",
" \n",
" problem_counts_dict = OrderedDict() # We want to maintain insertion order\n",
" for item in li:\n",
" try:\n",
" problem_counts_dict[item.a['href'][:-1]]=int(item.span.get_text().replace(\",\",\"\"))\n",
" except:\n",
" pass\n",
" \n",
" return problem_counts_dict"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 24
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Below is the top system failures from 2012 Honda Odyssey:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"topSystemsQty = getTopSystemsQty(year=2012, make='Honda', model='Odyssey')\n",
"for key in topSystemsQty.keys():\n",
" print(key, topSystemsQty[key])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"engine 6\n",
"accessories-interior 4\n",
"windows_windshield 4\n",
"body_paint 3\n",
"lights 3\n",
"suspension 3\n",
"electrical 1\n",
"miscellaneous 1\n",
"steering 1\n",
"transmission 1\n",
"wheels_hubs 1\n"
]
}
],
"prompt_number": 26
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Function to get number of corresponding qty of NHTSA complaints for each system category"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from bs4 import BeautifulSoup\n",
"from collections import OrderedDict\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"\n",
"def getNhtsaSystemsQty(make, model, year):\n",
" \"\"\"Function that returns an OrderedDict containing qty of NHTSA complaints by system\"\"\"\n",
" \n",
" url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'\n",
" html = request.urlopen(url)\n",
"\n",
" soup = BeautifulSoup(html)\n",
"\n",
" nhtsa = soup.find_all('em', class_='nhtsa')\n",
"\n",
" nhtsa_counts = []\n",
" for item in nhtsa:\n",
" try:\n",
" # There are 3 string tokens separated by whitespace, i want the 3rd token which is the qty\n",
" nhtsa_counts.append(int(item.span.get_text().split()[2]))\n",
" except:\n",
" # Unfortunately, some only have 2 tokens\n",
" nhtsa_counts.append(int(item.span.get_text().split()[1]))\n",
"\n",
" systems = soup.find_all('li', id=re.compile('bar*'))\n",
"\n",
" systems_list = []\n",
" for item in systems:\n",
" systems_list.append(item.a['href'][:-1]) # Remove the ending forward slash\n",
"\n",
" nhtsa_systems_counts = list(zip(systems_list,nhtsa_counts))\n",
" \n",
" nhtsa_systems_qty_dict = OrderedDict()\n",
" for item in nhtsa_systems_counts:\n",
" nhtsa_systems_qty_dict[item[0]]=item[1]\n",
" \n",
" \n",
" return nhtsa_systems_qty_dict"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 32
},
{
"cell_type": "code",
"collapsed": true,
"input": [
"getNhtsaSystemsQty('Honda','Accord','2001')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 33,
"text": [
"OrderedDict([('transmission', 129), ('seat_belts_air_bags', 327), ('engine', 76), ('body_paint', 8), ('electrical', 49), ('accessories-interior', 41), ('AC_heater', 1), ('brakes', 57), ('exhaust_system', 5), ('windows_windshield', 10), ('cooling_system', 1), ('drivetrain', 53), ('lights', 5), ('suspension', 20), ('fuel_system', 16), ('steering', 11), ('wheels_hubs', 24), ('miscellaneous', 6), ('accessories-exterior', 4), ('clutch', 1)])"
]
}
],
"prompt_number": 33
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Function to get qty of complaints by sub-system"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from bs4 import BeautifulSoup\n",
"from collections import OrderedDict\n",
"import urllib.request as request\n",
"import re\n",
"\n",
"make = 'Honda'\n",
"model = 'Civic'\n",
"year = 2001\n",
"system = 'transmission'\n",
"\n",
"def getSubSystemsQty(make, model, year, system):\n",
" \"\"\"Function that will return an OrderedDict of # of complaints by sub-system\"\"\"\n",
" \n",
" url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'+system+'/'\n",
" html = request.urlopen(url)\n",
" soup = BeautifulSoup(html)\n",
"\n",
" li = soup.find_all('li', id=re.compile('bar*'))\n",
"\n",
" subsystem_counts_dict = OrderedDict() # We want to maintain insertion order\n",
" for item in li:\n",
" subsystem_counts_dict[item.a['href'].split(\".\")[0]]=int(item.span.get_text().replace(\",\",\"\"))\n",
" \n",
" return subsystem_counts_dict"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"getSubSystemsQty(year=2012, make='Honda', model='Odyssey',system='engine')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 35,
"text": [
"OrderedDict([('engine_noise', 1), ('engine_revving_causing_sudden_acceleration', 1), ('hesitation_on_acceleration', 1), ('loss_of_power_engine_noise', 1), ('sudden_unintended_acceleration', 1), ('vibrates_and_rides_rough', 1), ('vehicle_speed_control', 6), ('engine_and_engine_cooling', 4), ('engine', 3)])"
]
}
],
"prompt_number": 35
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Below are sub-system qty, but also includes NHTSA's qty.&nbsp;&nbsp;I'll have to revisit the code to separate the NHTSA complaints."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"subSystemsQty = getSubSystemsQty(year=2012, make='Honda', model='Odyssey',system='engine')\n",
"for key in subSystemsQty.keys():\n",
" print(key, subSystemsQty[key])"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"engine_noise 1\n",
"engine_revving_causing_sudden_acceleration 1\n",
"hesitation_on_acceleration 1\n",
"loss_of_power_engine_noise 1\n",
"sudden_unintended_acceleration 1\n",
"vibrates_and_rides_rough 1\n",
"vehicle_speed_control 6\n",
"engine_and_engine_cooling 4\n",
"engine 3\n"
]
}
],
"prompt_number": 29
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": [
"Function to get the review text for a specific system failure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>This was actually harder than I thought. If there are more than 50 complaint reviews, then the reviews are spread out over multiple pages. So I had to account for this in the code.</p>"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def getReviews(make, model, year, system, subsystem):\n",
" \"\"\"Function that returns a list of all (maybe) customer reviews\n",
" NOTE: If there are more than 50 reviews, then the reviews are spread out over multiple pages.\"\"\"\n",
" \n",
" url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'+system+'/'+subsystem+'.shtml'\n",
" html = request.urlopen(url)\n",
" soup = BeautifulSoup(html)\n",
"\n",
" reviews = soup.find_all('div', itemprop=\"reviewBody\")\n",
" \n",
" complaints = []\n",
" for complaint in reviews:\n",
" complaints.append(complaint.p.get_text())\n",
" \n",
" ##### Read the first page, now check if there are 2 or more pages #####\n",
" # Get the subtitle so we can then figure out if there are multiple pages\n",
" page_count_text = soup.find('div', id=\"subtitle\").span.get_text()\n",
"\n",
" # If 'Page 1 of' exists, then there must be more than one page to read...loop thru all available pages\n",
" if 'Page 1' in page_count_text:\n",
" # Get total number of pages\n",
" num_pages = int(page_count_text.split()[3].replace(\")\",\"\"))\n",
" print(\"Page 1 of\",num_pages,\"parsed\")\n",
" for page in range(2,num_pages+1):\n",
" url = 'http://www.carcomplaints.com/'+make+'/'+model+'/'+str(year)+'/'+system+'/'+subsystem+'-'+str(page)+'.shtml'\n",
" html = request.urlopen(url)\n",
" try: # Thru testing, found page(s) that BeautifulSoup could not parse due to page having bad markup syntax\n",
" soup = BeautifulSoup(html)\n",
" reviews = soup.find_all('div', itemprop=\"reviewBody\")\n",
" for complaint in reviews:\n",
" complaints.append(complaint.p.get_text())\n",
" print(\"Page\",page,\"of\",num_pages,\"parsed\")\n",
" time.sleep(5) # Need to add delay to prevent Connection Refused error\n",
" except:\n",
" print(\"Page\", page,\"has severely bad markup!\",\"No data from this page was parsed.\")\n",
" pass\n",
" print(\"Retrieval of review text completed\\n\\n\")\n",
" else:\n",
" print(\"There was only 1 page to parse. Retrieval of review text completed.\\n\\n\")\n",
" \n",
" return complaints"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 41
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"for complaint in getReviews('Honda','Civic','2001','transmission','pops_out_of_gear'):\n",
" print(complaint)\n",
" print('*'*120) # Print a row of 120 astericks between comments"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"There was only 1 page to parse. Retrieval of review text completed.\n",
"\n",
"\n",
"This is the second time that the speed sensor has gone out in the last nine months.\n",
"************************************************************************************************************************\n",
"Syncros are bad, pops out of gear and doesn't get back in.. throw out bearing was bad too\n",
"************************************************************************************************************************\n",
"The transmission of my 2001 civic began popping out of 4th gear on occasion. The car was my wifes and I didn't normally drive it. When I did and if popped out during acceleration she told me it had been doing it for a while. We decided to get a new car since we saw a trend since we just put in new struts and had the anti-lock break system replaced for about $1000 each . The Honda dealership test drove it and said it needed a new transmission. This had a large impact on the trade in value. I need to keep an eye on this site and when I see if our 2010 civic might have issues and trade in before the problem manifests\n",
"************************************************************************************************************************\n",
"as a auto mechanic and auto car collector for 40 years, it is clear honda did not build these manual transmissions correctly. they should not pop out of gear at 100000.\n",
"************************************************************************************************************************\n",
"HONDA IS NO LONGER THE RELIABILITY CHAMP . I WON`T BE BUYING ANOTHER ONE .\n",
"************************************************************************************************************************\n",
"My 2001 Honda Civic EX will NOT stay in 5 gear for more than a split second. As soon as I put it in gear it pops out, so now I can't travel on roads over 50mph for any longer than a few minutes. At speeds around 45/50 and up the car eats gas worse than a chevrolet suburban because it runs around 3500rpm on up in 4th gear. Honda knows about this problem but refuse to fix it. They should fix our transmissions that are popping out at a million miles, it is their fault.\n",
"************************************************************************************************************************\n",
"I purchased a 2001 Honda Civic Ex for my daughter in November 2009. A couple weeks later, I find that the gear jumps outta 2nd gear when I'm driving around residential areas. I should have reported it to the dealership I bought it from but they would have countered with the \"as is\" clause when buying a new car. And like the other consumer complaints, the gears do feel as if they are grinding when shifting to another gear. Put me down as a litigant for a Honda Civic recall!\n",
"************************************************************************************************************************\n",
"My car either grinds into or pops out of 2nd gear. Very Annoying!! I will probably never buy a Honda again.\n",
"************************************************************************************************************************\n",
"Well I guess I'm not alone in this category! All of a sudden about a week or 2 ago I noticed it became difficult to move my stick shift. Someone else here said and I quote \" feels like I'm moving the stick shift through sand\" and that is just how I feel. In addition to the fact that it also pops out of 2nd and 4th gear. I'm not happy :( taking it in to get looked at, but can't understand for the life of me why Honda doesn't fix the problem with so many complaints filed. We will see\n",
"************************************************************************************************************************\n",
"felt my gears getting harder to shift, very difficulrt to get it into 4th gear and pops out of third...so I brought it to the mechanic \n",
"************************************************************************************************************************\n",
"I bought the 2001 Honda Civic as a cheap reliable car to help me travel 350 miles from home to school whenever the holidays or breaks would occur. Unfortunately, that is where I failed in my decision, believing Honda Civics was the SOLUTION!!! I now own a car that has already forced me out of the road once and may screw me over the coming months, I just don't know when. The transmission is seemingly destined for failure as fellow owners of this Honda all seem to have the same terrible end. What must we do for a class action suit, cause I'm so IN!!! Please do not stop communicating we need to move on this. Anyone feel similarly, please e-mail me at rocketboyssk@gmail.com.\n",
"************************************************************************************************************************\n",
"The service department told my wife we weren't the only ones with this problem and he thought they may come out with a recall.\n",
"************************************************************************************************************************\n",
"WOW! what a nightmare we are all sharing. I've dumped $6K into my 2001 Civic EX bought in Oct. '02 in the last 18 months alone. First the catalytic converter crapped out (twice!) and they found a problem with the struts. That was Sept. '03 (at the very beginning of a new job after being out for 2 yrs, thank you very much!) and THEN, just last Friday, I noticed it felt like I was pushing the shift thru sand to get it into 4th gear, so I brought it to the mechanic.Guess I'm luckier than most.... My baby didn't dump me. \n",
"************************************************************************************************************************\n",
"My husband I bought a 2001 civic in 2007. Soon after the gear would pop out of 2nd. From asking around, it seemed to be less expensive to replace the transmission with a used one then to repair it. I wish we didn't put out cash into this piece of crap and bought an automatic instead!\n",
"************************************************************************************************************************\n",
"Well I guess I get to join the Honda hate club. My tranny just went out while driving through town thank god I was at a red light when it happened otherwise who know what might have happened. I'm hopping a family friend can get this fixed but after all the post I read I think I'm screw I hope I can get some info on what will be the best to do through those that already fixed this problem.\n",
"************************************************************************************************************************\n",
"Thought Hondas' were the best thing there was for the money. Still do i guess. Just bought the car and was told that the gears were slippin. Being mechanically savvy i thought that i could take care of it. It however increased in occurrence and then now is a literal pain in the ass. I love my car, spent a lot of money on it. Now i just need it to be great again.\n",
"************************************************************************************************************************\n",
"For two years, I'd been putting up with the gearshifter popping out of 2nd gear on occasion (especially when I'm turning right and engaging the gear at the same time). Over time, it got more frequent, and I finally decided to bite the bullet and have the transmission looked at by a mechanic. Turns out the synchros were shot, and the transmission needed a rebuild (or, I could have opted to replace the transmission for another few hundred bucks). \n",
"************************************************************************************************************************\n"
]
}
],
"prompt_number": 42
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"This is basically it.&nbsp;&nbsp;When I have the time, I will make a Part 2 where I show how to insert the data we scaped into a database."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment