Skip to content

Instantly share code, notes, and snippets.

@rmax
Last active July 27, 2016 16:32
Show Gist options
  • Save rmax/087ef3dbcc499c63f1b4a2248161423e to your computer and use it in GitHub Desktop.
Save rmax/087ef3dbcc499c63f1b4a2248161423e to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Extracting content from tables is a highly common requirement not only web scraping projects but also for casual users and data analyst who want to capture. The project `extruct` could aim to be a good-enough content extractor. It supports Json+ld and microfortmats, and we can add more common structures extractors.\n",
"\n",
"Below is shown a prototype for a `tables` extractor. The initial objective is to be better than `pandas.read_html` and based on `lxml` so it can be used along with `parsel`.\n",
"\n",
"The current prototype does a naive recognition of rows and headers, supporting multiple column headers. Further improvements could be:\n",
"\n",
"* Allow parse callbacks (i.e.: allow the user to process the `td` content).\n",
"* Support of vertical headers (if defined with `th`)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import requests\n",
"import pandas as pd\n",
"from extruct.tables import TablesExtractor\n",
"te = TablesExtractor()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# extract.tables"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th></th>\n",
" <th colspan=\"3\" halign=\"left\">Team size</th>\n",
" <th colspan=\"8\" halign=\"left\"></th>\n",
" <th colspan=\"4\" halign=\"left\">Awards</th>\n",
" <th colspan=\"2\" halign=\"left\"></th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>All</th>\n",
" <th>M</th>\n",
" <th>F</th>\n",
" <th>P1</th>\n",
" <th>P2</th>\n",
" <th>P3</th>\n",
" <th>P4</th>\n",
" <th>P5</th>\n",
" <th>P6</th>\n",
" <th>Total</th>\n",
" <th>Rank</th>\n",
" <th>G</th>\n",
" <th>S</th>\n",
" <th>B</th>\n",
" <th>HM</th>\n",
" <th>Leader</th>\n",
" <th>Deputy leader</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>United States of America</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td></td>\n",
" <td>42</td>\n",
" <td>32</td>\n",
" <td>31</td>\n",
" <td>42</td>\n",
" <td>25</td>\n",
" <td>13</td>\n",
" <td>185</td>\n",
" <td>1</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Po-Shen Loh</td>\n",
" <td>John Berman</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>People's Republic of China</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td></td>\n",
" <td>42</td>\n",
" <td>36</td>\n",
" <td>12</td>\n",
" <td>42</td>\n",
" <td>22</td>\n",
" <td>27</td>\n",
" <td>181</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Bin Xiong</td>\n",
" <td>Qiusheng Li</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Republic of Korea</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td></td>\n",
" <td>42</td>\n",
" <td>15</td>\n",
" <td>18</td>\n",
" <td>42</td>\n",
" <td>25</td>\n",
" <td>19</td>\n",
" <td>161</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Yongjin Song</td>\n",
" <td>Bo-Hae Im</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Democratic People's Republic of Korea</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td></td>\n",
" <td>42</td>\n",
" <td>25</td>\n",
" <td>17</td>\n",
" <td>42</td>\n",
" <td>22</td>\n",
" <td>8</td>\n",
" <td>156</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yong Chol Ham</td>\n",
" <td>Ryong Gol Yom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Vietnam</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>35</td>\n",
" <td>21</td>\n",
" <td>23</td>\n",
" <td>42</td>\n",
" <td>28</td>\n",
" <td>2</td>\n",
" <td>151</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Bá Khánh Trình Lê</td>\n",
" <td>Anh Vinh Lê</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Team size \\\n",
" Country All M F P1 P2 P3 P4 P5 \n",
"0 United States of America 6 6 42 32 31 42 25 \n",
"1 People's Republic of China 6 6 42 36 12 42 22 \n",
"2 Republic of Korea 6 6 42 15 18 42 25 \n",
"3 Democratic People's Republic of Korea 6 6 42 25 17 42 22 \n",
"4 Vietnam 6 5 1 35 21 23 42 28 \n",
"\n",
" Awards \n",
" P6 Total Rank G S B HM Leader Deputy leader \n",
"0 13 185 1 5 1 0 0 Po-Shen Loh John Berman \n",
"1 27 181 2 4 2 0 0 Bin Xiong Qiusheng Li \n",
"2 19 161 3 3 1 2 0 Yongjin Song Bo-Hae Im \n",
"3 8 156 4 3 3 0 0 Yong Chol Ham Ryong Gol Yom \n",
"4 2 151 5 2 3 1 0 Bá Khánh Trình Lê Anh Vinh Lê "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = requests.get('https://www.imo-official.org/year_country_r.aspx?year=2015')\n",
"te.extract_tables(r.text)[0].to_pandas().head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>date</th>\n",
" <th>type</th>\n",
" <th>registration</th>\n",
" <th>operator</th>\n",
" <th>fat.</th>\n",
" <th>location</th>\n",
" <th></th>\n",
" <th>pic</th>\n",
" <th>cat</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>02-JAN-2016</td>\n",
" <td>Cessna 208B Grand Caravan</td>\n",
" <td>N540ME</td>\n",
" <td>Wright Air Service</td>\n",
" <td>0</td>\n",
" <td>near Anaktuvuk Pa...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>A2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>03-JAN-2016</td>\n",
" <td>Canadair Regional Jet CRJ-702</td>\n",
" <td>N164GJ</td>\n",
" <td>GoJet Airlines, opf. United Express</td>\n",
" <td>0</td>\n",
" <td>Chicago-O'Ha...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>A2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>04-JAN-2016</td>\n",
" <td>Beechcraft 200 Super King Air</td>\n",
" <td>N275X</td>\n",
" <td>Skyway Aircraft Inc.</td>\n",
" <td>0</td>\n",
" <td>St. Petersbu...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>A2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>05-JAN-2016</td>\n",
" <td>Basler BT-67 Turbo 67 (AC-47T)</td>\n",
" <td>FAC-1658</td>\n",
" <td>Colombian AF</td>\n",
" <td>0</td>\n",
" <td>Tres Esquina...</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>A2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>08-JAN-2016</td>\n",
" <td>Canadair Regional Jet CRJ-200PF</td>\n",
" <td>SE-DUX</td>\n",
" <td>West Air Sweden</td>\n",
" <td>2</td>\n",
" <td>near Akkajaure</td>\n",
" <td></td>\n",
" <td></td>\n",
" <td>A1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" date type registration \\\n",
"0 02-JAN-2016 Cessna 208B Grand Caravan N540ME \n",
"1 03-JAN-2016 Canadair Regional Jet CRJ-702 N164GJ \n",
"2 04-JAN-2016 Beechcraft 200 Super King Air N275X \n",
"3 05-JAN-2016 Basler BT-67 Turbo 67 (AC-47T) FAC-1658 \n",
"4 08-JAN-2016 Canadair Regional Jet CRJ-200PF SE-DUX \n",
"\n",
" operator fat. location pic cat \n",
"0 Wright Air Service 0 near Anaktuvuk Pa... A2 \n",
"1 GoJet Airlines, opf. United Express 0 Chicago-O'Ha... A2 \n",
"2 Skyway Aircraft Inc. 0 St. Petersbu... A2 \n",
"3 Colombian AF 0 Tres Esquina... A2 \n",
"4 West Air Sweden 2 near Akkajaure A1 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = requests.get('https://aviation-safety.net/database/dblist.php?Year=2016')\n",
"te.extract_tables(r.text)[0].to_pandas().head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Status:</td>\n",
" <td>Preliminary - official</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Date:</td>\n",
" <td>Saturday 2 January 2016</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Time:</td>\n",
" <td>12:05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Type:</td>\n",
" <td>Cessna 208B Grand Caravan</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Operator:</td>\n",
" <td>Wright Air Service</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Registration:</td>\n",
" <td>N540ME</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>C/n / msn:</td>\n",
" <td>208B-0540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>First flight:</td>\n",
" <td>1996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Engines:</td>\n",
" <td>1 Pratt &amp; Whitney Canada PT6A-114A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Crew:</td>\n",
" <td>Fatalities: 0 / Occupants: 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Passengers:</td>\n",
" <td>Fatalities: 0 / Occupants: 7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Total:</td>\n",
" <td>Fatalities: 0 / Occupants: 8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Airplane damage:</td>\n",
" <td>Substantial</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Location:</td>\n",
" <td>10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Phase:</td>\n",
" <td>En route (ENR)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Nature:</td>\n",
" <td>Domestic Scheduled Passenger</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>Departure airport:</td>\n",
" <td>Fairbanks International Airport, AK (FAI/PAFA)...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>Destination airport:</td>\n",
" <td>Anaktuvuk Pass Airport, AK (AKP/PAKP), United ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 Status: Preliminary - official\n",
"1 Date: Saturday 2 January 2016\n",
"2 Time: 12:05\n",
"3 Type: Cessna 208B Grand Caravan\n",
"4 Operator: Wright Air Service\n",
"5 Registration: N540ME\n",
"6 C/n / msn: 208B-0540\n",
"7 First flight: 1996\n",
"8 Engines: 1 Pratt & Whitney Canada PT6A-114A\n",
"9 Crew: Fatalities: 0 / Occupants: 1\n",
"10 Passengers: Fatalities: 0 / Occupants: 7\n",
"11 Total: Fatalities: 0 / Occupants: 8\n",
"12 Airplane damage: Substantial\n",
"13 Location: 10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...\n",
"14 Phase: En route (ENR)\n",
"15 Nature: Domestic Scheduled Passenger\n",
"16 Departure airport: Fairbanks International Airport, AK (FAI/PAFA)...\n",
"17 Destination airport: Anaktuvuk Pass Airport, AK (AKP/PAKP), United ..."
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r = requests.get('https://aviation-safety.net/database/record.php?id=20160102-0')\n",
"pd.concat([t.to_pandas() for t in te.extract_tables(r.text)])"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"## Comparing with pandas.read_html "
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Country</th>\n",
" <th>Team size</th>\n",
" <th>P1</th>\n",
" <th>P2</th>\n",
" <th>P3</th>\n",
" <th>P4</th>\n",
" <th>P5</th>\n",
" <th>P6</th>\n",
" <th>Total</th>\n",
" <th>Rank</th>\n",
" <th>Awards</th>\n",
" <th>Leader</th>\n",
" <th>Deputy leader</th>\n",
" <th>All</th>\n",
" <th>M</th>\n",
" <th>F</th>\n",
" <th>G</th>\n",
" <th>S</th>\n",
" <th>B</th>\n",
" <th>HM</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>United States of America</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>42</td>\n",
" <td>32</td>\n",
" <td>31</td>\n",
" <td>42</td>\n",
" <td>25</td>\n",
" <td>13</td>\n",
" <td>185</td>\n",
" <td>1</td>\n",
" <td>5</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Po-Shen Loh</td>\n",
" <td>John Berman</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>People's Republic of China</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>42</td>\n",
" <td>36</td>\n",
" <td>12</td>\n",
" <td>42</td>\n",
" <td>22</td>\n",
" <td>27</td>\n",
" <td>181</td>\n",
" <td>2</td>\n",
" <td>4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Bin Xiong</td>\n",
" <td>Qiusheng Li</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Republic of Korea</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>42</td>\n",
" <td>15</td>\n",
" <td>18</td>\n",
" <td>42</td>\n",
" <td>25</td>\n",
" <td>19</td>\n",
" <td>161</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>Yongjin Song</td>\n",
" <td>Bo-Hae Im</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Democratic People's Republic of Korea</td>\n",
" <td>6</td>\n",
" <td>6</td>\n",
" <td>NaN</td>\n",
" <td>42</td>\n",
" <td>25</td>\n",
" <td>17</td>\n",
" <td>42</td>\n",
" <td>22</td>\n",
" <td>8</td>\n",
" <td>156</td>\n",
" <td>4</td>\n",
" <td>3</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>Yong Chol Ham</td>\n",
" <td>Ryong Gol Yom</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Vietnam</td>\n",
" <td>6</td>\n",
" <td>5</td>\n",
" <td>1.0</td>\n",
" <td>35</td>\n",
" <td>21</td>\n",
" <td>23</td>\n",
" <td>42</td>\n",
" <td>28</td>\n",
" <td>2</td>\n",
" <td>151</td>\n",
" <td>5</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>Bá Khánh Trình Lê</td>\n",
" <td>Anh Vinh Lê</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Country Team size P1 P2 P3 P4 P5 P6 \\\n",
"0 United States of America 6 6 NaN 42 32 31 42 \n",
"1 People's Republic of China 6 6 NaN 42 36 12 42 \n",
"2 Republic of Korea 6 6 NaN 42 15 18 42 \n",
"3 Democratic People's Republic of Korea 6 6 NaN 42 25 17 42 \n",
"4 Vietnam 6 5 1.0 35 21 23 42 \n",
"\n",
" Total Rank Awards Leader Deputy leader All M F G \\\n",
"0 25 13 185 1 5 1 0 0 Po-Shen Loh \n",
"1 22 27 181 2 4 2 0 0 Bin Xiong \n",
"2 25 19 161 3 3 1 2 0 Yongjin Song \n",
"3 22 8 156 4 3 3 0 0 Yong Chol Ham \n",
"4 28 2 151 5 2 3 1 0 Bá Khánh Trình Lê \n",
"\n",
" S B HM \n",
"0 John Berman NaN NaN \n",
"1 Qiusheng Li NaN NaN \n",
"2 Bo-Hae Im NaN NaN \n",
"3 Ryong Gol Yom NaN NaN \n",
"4 Anh Vinh Lê NaN NaN "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_html('https://www.imo-official.org/year_country_r.aspx?year=2015', tupleize_cols=True)[0].head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" <th>5</th>\n",
" <th>6</th>\n",
" <th>7</th>\n",
" <th>8</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>date</td>\n",
" <td>type</td>\n",
" <td>registration</td>\n",
" <td>operator</td>\n",
" <td>fat.</td>\n",
" <td>location</td>\n",
" <td>NaN</td>\n",
" <td>pic</td>\n",
" <td>cat</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>02-JAN-2016</td>\n",
" <td>Cessna 208B Grand Caravan</td>\n",
" <td>N540ME</td>\n",
" <td>Wright Air Service</td>\n",
" <td>0</td>\n",
" <td>near Anaktuvuk Pa...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>A2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>03-JAN-2016</td>\n",
" <td>Canadair Regional Jet CRJ-702</td>\n",
" <td>N164GJ</td>\n",
" <td>GoJet Airlines, opf. United Express</td>\n",
" <td>0</td>\n",
" <td>Chicago-O'Ha...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>A2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>04-JAN-2016</td>\n",
" <td>Beechcraft 200 Super King Air</td>\n",
" <td>N275X</td>\n",
" <td>Skyway Aircraft Inc.</td>\n",
" <td>0</td>\n",
" <td>St. Petersbu...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>A2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>05-JAN-2016</td>\n",
" <td>Basler BT-67 Turbo 67 (AC-47T)</td>\n",
" <td>FAC-1658</td>\n",
" <td>Colombian AF</td>\n",
" <td>0</td>\n",
" <td>Tres Esquina...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>A2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 \\\n",
"0 date type registration \n",
"1 02-JAN-2016 Cessna 208B Grand Caravan N540ME \n",
"2 03-JAN-2016 Canadair Regional Jet CRJ-702 N164GJ \n",
"3 04-JAN-2016 Beechcraft 200 Super King Air N275X \n",
"4 05-JAN-2016 Basler BT-67 Turbo 67 (AC-47T) FAC-1658 \n",
"\n",
" 3 4 5 6 7 \\\n",
"0 operator fat. location NaN pic \n",
"1 Wright Air Service 0 near Anaktuvuk Pa... NaN NaN \n",
"2 GoJet Airlines, opf. United Express 0 Chicago-O'Ha... NaN NaN \n",
"3 Skyway Aircraft Inc. 0 St. Petersbu... NaN NaN \n",
"4 Colombian AF 0 Tres Esquina... NaN NaN \n",
"\n",
" 8 \n",
"0 cat \n",
"1 A2 \n",
"2 A2 \n",
"3 A2 \n",
"4 A2 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.read_html('https://aviation-safety.net/database/dblist.php?Year=2016')[0].head()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Status:</td>\n",
" <td>Preliminary - official</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Date:</td>\n",
" <td>Saturday 2 January 2016</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Time:</td>\n",
" <td>12:05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Type:</td>\n",
" <td>Cessna 208B Grand Caravan</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Operator:</td>\n",
" <td>Wright Air Service</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Registration:</td>\n",
" <td>N540ME</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>C/n / msn:</td>\n",
" <td>208B-0540</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>First flight:</td>\n",
" <td>1996</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Engines:</td>\n",
" <td>1 Pratt &amp; Whitney Canada PT6A-114A</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Crew:</td>\n",
" <td>Fatalities: 0 / Occupants: 1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Passengers:</td>\n",
" <td>Fatalities: 0 / Occupants: 7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Total:</td>\n",
" <td>Fatalities: 0 / Occupants: 8</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Airplane damage:</td>\n",
" <td>Substantial</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>Location:</td>\n",
" <td>10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>14</th>\n",
" <td>Phase:</td>\n",
" <td>En route (ENR)</td>\n",
" </tr>\n",
" <tr>\n",
" <th>15</th>\n",
" <td>Nature:</td>\n",
" <td>Domestic Scheduled Passenger</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>Departure airport:</td>\n",
" <td>Fairbanks International Airport, AK (FAI/PAFA)...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>17</th>\n",
" <td>Destination airport:</td>\n",
" <td>Anaktuvuk Pass Airport, AK (AKP/PAKP), United ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1\n",
"0 Status: Preliminary - official\n",
"1 Date: Saturday 2 January 2016\n",
"2 Time: 12:05\n",
"3 Type: Cessna 208B Grand Caravan\n",
"4 Operator: Wright Air Service\n",
"5 Registration: N540ME\n",
"6 C/n / msn: 208B-0540\n",
"7 First flight: 1996\n",
"8 Engines: 1 Pratt & Whitney Canada PT6A-114A\n",
"9 Crew: Fatalities: 0 / Occupants: 1\n",
"10 Passengers: Fatalities: 0 / Occupants: 7\n",
"11 Total: Fatalities: 0 / Occupants: 8\n",
"12 Airplane damage: Substantial\n",
"13 Location: 10 km (6.3 mls) SW of Anaktuvuk Pass Airport, ...\n",
"14 Phase: En route (ENR)\n",
"15 Nature: Domestic Scheduled Passenger\n",
"16 Departure airport: Fairbanks International Airport, AK (FAI/PAFA)...\n",
"17 Destination airport: Anaktuvuk Pass Airport, AK (AKP/PAKP), United ..."
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.concat(pd.read_html('https://aviation-safety.net/database/record.php?id=20160102-0'))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "DS (py3)",
"language": "python",
"name": "python3-ds"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment