HTML の表 (<table> タグ) をスクレイピングする時も pandas が超便利 ref: https://qiita.com/kitsuyui/items/4906bb457af4d0e2d0a5
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<table width="99%" border="1" cellpadding="5" cellspacing="0" class="datatable"> | |
<caption> | |
所得税の速算表 | |
</caption> | |
<tr> | |
<th width="62%" scope="col"> 課税される所得金額</th> | |
<th width="13%" scope="col"> 税率</th> | |
<th width="25%" scope="col"> 控除額</th> | |
</tr> | |
<tr> | |
<th align="left" scope="row"> 195万円以下</th> | |
<td align="center"> 5%</td> | |
<td align="right"> 0円</td> | |
</tr> | |
<tr> | |
<th align="left" scope="row"> 195万円を超え 330万円以下</th> | |
<td align="center"> 10%</td> | |
<td align="right"> 97,500円</td> | |
</tr> | |
<tr> | |
<th align="left" scope="row"> 330万円を超え 695万円以下</th> | |
<td align="center">20%</td> | |
<td align="right">427,500円</td> | |
</tr> | |
<tr> | |
<th align="left" scope="row"> 695万円を超え 900万円以下</th> | |
<td align="center">23%</td> | |
<td align="right">636,000円</td> | |
</tr> | |
<tr> | |
<th align="left" scope="row"> 900万円を超え 1,800万円以下</th> | |
<td align="center"> 33%</td> | |
<td align="right"> 1,536,000円</td> | |
</tr> | |
<tr> | |
<th align="left" scope="row"> 1,800万円を超え4,000万円以下</th> | |
<td align="center"> 40%</td> | |
<td align="right"> 2,796,000円</td> | |
</tr> | |
<tr> | |
<th align="left" scope="row"> 4,000万円超</th> | |
<td align="center"> 45%</td> | |
<td align="right"> 4,796,000円</td> | |
</tr> | |
</table> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ pip install pandas lxml html5lib BeautifulSoup4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ pip install pandas lxml html5lib BeautifulSoup4 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ python3 | |
>>> import pandas |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>> url = 'https://www.nta.go.jp/taxanswer/shotoku/2260.htm' | |
>>> fetched_dataframes = pandas.io.html.read_html(url) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>> fetched_dataframes[0] | |
0 1 2 | |
0 課税される所得金額 税率 控除額 | |
1 195万円以下 5% 0円 | |
2 195万円を超え 330万円以下 10% 97,500円 | |
3 330万円を超え 695万円以下 20% 427,500円 | |
4 695万円を超え 900万円以下 23% 636,000円 | |
5 900万円を超え 1,800万円以下 33% 1,536,000円 | |
6 1,800万円超 40% 2,796,000円 | |
>>> fetched_dataframes[1] | |
0 1 2 | |
0 課税される所得金額 税率 控除額 | |
1 195万円以下 5% 0円 | |
2 195万円を超え 330万円以下 10% 97,500円 | |
3 330万円を超え 695万円以下 20% 427,500円 | |
4 695万円を超え 900万円以下 23% 636,000円 | |
5 900万円を超え 1,800万円以下 33% 1,536,000円 | |
6 1,800万円を超え 4,000万円以下 40% 2,796,000円 | |
7 4,000万円超 45% 4,796,000円 | |
>>> fetched_dataframes[2] | |
0 1 2 | |
0 課税される所得金額 税率 控除額 | |
1 330万円以下 10% 0円 | |
2 330万円を超え 900万円以下 20% 330,000円 | |
3 900万円を超え 1,800万円以下 30% 1,230,000円 | |
4 1,800万円超 37% 2,490,000円 | |
>>> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
>>> fetched_dataframes[0].to_csv('heisei19to26.csv') | |
>>> fetched_dataframes[1].to_csv('heisei27to.csv') | |
>>> fetched_dataframes[2].to_csv('heisei11to18.csv') |
We can make this file beautiful and searchable if this error is corrected: Unclosed quoted field in line 9.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
,0,1,2 | |
0,課税される所得金額,税率,控除額 | |
1,195万円以下,5%,0円 | |
2,195万円を超え 330万円以下,10%,"97,500円" | |
3,330万円を超え 695万円以下,20%,"427,500円" | |
4,695万円を超え 900万円以下,23%,"636,000円" | |
5,"900万円を超え 1,800万円以下",33%,"1,536,000円" | |
6,"1,800万円を超え 4,000万円以下",40%,"2,796,000円" | |
7,"4,000万円超",45%,"4,796,000円” |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment