Skip to content

Instantly share code, notes, and snippets.

@wassname
Last active October 25, 2023 08:28
Show Gist options
  • Star 15 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save wassname/5b10774dfcd61cdd3f28 to your computer and use it in GitHub Desktop.
Save wassname/5b10774dfcd61cdd3f28 to your computer and use it in GitHub Desktop.
Scrape a table from wikipedia using python. Allows for cells spanning multiple rows and/or columns. Outputs csv files for each table
# -*- coding: utf-8 -*-
"""
Scrape a table from wikipedia using python. Allows for cells spanning multiple rows and/or columns. Outputs csv files for
each table
url: https://gist.github.com/wassname/5b10774dfcd61cdd3f28
authors: panford, wassname, muzzled, Yossi
license: MIT
"""
from bs4 import BeautifulSoup
import requests
import os
import codecs
wiki = "https://en.wikipedia.org/wiki/International_Phonetic_Alphabet_chart_for_English_dialects"
header = {
'User-Agent': 'Mozilla/5.0'
} # Needed to prevent 403 error on Wikipedia
page = requests.get(wiki, headers=header)
soup = BeautifulSoup(page.content)
tables = soup.findAll("table", {"class": "wikitable"})
# show tables
for i, table in enumerate(tables):
print("#"*10 + "Table {}".format(i) + '#'*10)
print(table.text[:100])
print('.'*80)
print("#"*80)
for tn, table in enumerate(tables):
# preinit list of lists
rows = table.findAll("tr")
row_lengths = [len(r.findAll(['th', 'td'])) for r in rows]
ncols = max(row_lengths)
nrows = len(rows)
data = []
for i in range(nrows):
rowD = []
for j in range(ncols):
rowD.append('')
data.append(rowD)
# process html
for i in range(len(rows)):
row = rows[i]
rowD = []
cells = row.findAll(["td", "th"])
for j in range(len(cells)):
cell = cells[j]
#lots of cells span cols and rows so lets deal with that
cspan = int(cell.get('colspan', 1))
rspan = int(cell.get('rowspan', 1))
l = 0
for k in range(rspan):
# Shifts to the first empty cell of this row
while data[i + k][j + l]:
l += 1
for m in range(cspan):
cell_n = j + l + m
row_n = i + k
# in some cases the colspan can overflow the table, in those cases just get the last item
cell_n = min(cell_n, len(data[row_n])-1)
data[row_n][cell_n] += cell.text
print(cell.text)
data.append(rowD)
# write data out to tab seperated format
page = os.path.split(wiki)[1]
fname = 'output_{}_t{}.tsv'.format(page, tn)
f = codecs.open(fname, 'w')
for i in range(nrows):
rowStr = '\t'.join(data[i])
rowStr = rowStr.replace('\n', '')
print(rowStr)
f.write(rowStr + '\n')
f.close()
@gkcng
Copy link

gkcng commented Mar 24, 2018

Hi, great work! Though this doesn't work on some tables with multiple row spans in different columns, the cell insertion went into the wrong places. e.g. https://www.theiphonewiki.com/wiki/Models

This will fix it, at least for the tables within the above page. Instead of:

            for k in range(rspan):
                for l in range(cspan):
                    data[i+k][j+l]+=cell.text

Do not append to already filled cells:

            l = 0
            for k in range(rspan):
                # Shifts to the first empty cell of this row
                while data[i+k][j+l]:
                    l+=1
                for m in range(cspan):
                    data[i+k][j+l+m]+=cell.text

@panford
Copy link

panford commented Jun 12, 2019

Great code!. It was of great help to me. However, I was scraping a wiki table that already had figures separated by commas (eg. 8,133,000) and after writing were separated and displaced where comma, into separate cells so I changed the delimiter ',' to '\t' in the last for loop. Looked like this rowStr = '\t'.join(data[i])

@wassname
Copy link
Author

Oh people are using this, that's great. Thanks for the improvements panford and muzzled, I tested and incorporated them into the gist above and they help a lot.

@Yossi
Copy link

Yossi commented Sep 10, 2019

lines 29 and 30 can be replaced with:
for tn, table in enumerate(tables):

@wassname
Copy link
Author

wassname commented Sep 11, 2019

Thanks, I added that change.

@dsvrsec
Copy link

dsvrsec commented Sep 13, 2019

when I am trying to use the code, for the link "https://en.wikipedia.org/wiki/Loganair",I am facing the below error.Can you please help me to solve this.

Traceback (most recent call last):

File "", line 65, in
while data[i + k][j + l]:

IndexError: list index out of range

@wassname
Copy link
Author

Worked for me, and I notice your lined number doesn't correspond to the latest version. Perhaps try the latest code?

@dsvrsec
Copy link

dsvrsec commented Sep 13, 2019

I am trying to extract infobox information..so i used class="infobox"..for some wikipedia pages it works..but for this type of company infobox it throws error.

@funkycoder
Copy link

#37 to #42 could be replaced by : data = [[''] * ncols for i in range(nrows)]
#45: for i in range(nrows):
#47, #68: all instance of rowD should be removed
#73: f = codecs.open(fname, mode='w', encoding='utf-8') <- to deal with non-English characters
Thank you very much for this great gist!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment