Skip to content

Instantly share code, notes, and snippets.

@jseabold
Last active June 14, 2018 03:02
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jseabold/5892603 to your computer and use it in GitHub Desktop.
Save jseabold/5892603 to your computer and use it in GitHub Desktop.
Read an HTML table using pandas
# you can use something like this if read_html fails to find a table
# if you have bs4 >= 4.2.1, you can skip the lxml stuff, the tables
# are scraped automatically. 4.2.0 won't work.
import pandas as pd
from lxml import html
url = "http://www.uesp.net/wiki/Skyrim:No_Stone_Unturned"
xpath = "//*[@id=\"mw-content-text\"]/table[3]"
tree = html.parse(url)
table = tree.xpath(xpath)[0]
raw_html = html.tostring(table)
dta = pd.read_html(raw_html, header=0)[0]
dta["completed"] = 0
del dta["Map"]
table.make_links_absolute()
dta["map_link"] = [i[1][0].get('href') for i in table[1:]]
@cpcloud
Copy link

cpcloud commented Jul 14, 2013

i changed the column index name

@jseabold
Copy link
Author

Yep works fine with bs4 4.2.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment