#Parsing API xml
data using BeautifulSoup
I had a difficult time extracting data from a xml
object retrieved using the requests
. Simply, I had dome something like:
[In] 1: import requests
[In] 2: socket = requests.get('https://the?xml?shitting&api?url')
Most of the documents I found pointed me towards 'xml' and using ElementTree. After several attempts, with no success at all, I was able to get the desired result by using BeautifulSoup. Specifically the following:
##Step 1:
Load the data in to a bs4
object, as specifically suggested where the parsing libraries are described
[In] 10: soup = bs4.BeautifulSoup(socket.content, ['lxml', 'xml'])
##Step 2:
Take a look at the mess of xml
data to ascertain the structure of it.
[In] 11: f = open('pretty_xml.xml', 'w')
[In] 12: f.writelines(soup.prettify())
[In] 13: f.close()
##Step 3:
Begin the extraction
Once the data is viewed more carefully, use the bs4.findChildren()
functionality to extract each node into the cell of a list. For example:
[In] 14: soup_list = soup.findChildren()
Then (assuming the tags
are the same for each of the list elements, or mostly similar as in my case), create a (index, values)
pairing using the text
and name
attributes of bs4.tag
elements, doing something like this:
[In] 15: d = {}
[In] 16: for i, element in enumerate(soup_list):
...: index = map(lambda x: x.name, element.findChildren())
...: vals = map(lambda x: unicode(x.text), element.findChildren())
...: d[i] = pandas.Series(vals, index = index)
Then you're all setup to make a pandas.DataFrame
passing the dict
of Series
.