Skip to content

Instantly share code, notes, and snippets.

@benjaminmgross
Created June 21, 2014 07:29
Show Gist options
  • Save benjaminmgross/1b38348239d8c0dff149 to your computer and use it in GitHub Desktop.
Save benjaminmgross/1b38348239d8c0dff149 to your computer and use it in GitHub Desktop.
Parsing data `xml` with [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4) instead of [xml](https://docs.python.org/2/library/xml.html)

#Parsing API xml data using BeautifulSoup

I had a difficult time extracting data from a xml object retrieved using the requests. Simply, I had dome something like:

[In] 1: import requests
[In] 2: socket = requests.get('https://the?xml?shitting&api?url')

Most of the documents I found pointed me towards 'xml' and using ElementTree. After several attempts, with no success at all, I was able to get the desired result by using BeautifulSoup. Specifically the following:

##Step 1:

Load the data in to a bs4 object, as specifically suggested where the parsing libraries are described

[In] 10: soup = bs4.BeautifulSoup(socket.content, ['lxml', 'xml'])

##Step 2:

Take a look at the mess of xml data to ascertain the structure of it.

[In] 11: f = open('pretty_xml.xml', 'w')
[In] 12: f.writelines(soup.prettify())
[In] 13: f.close()

##Step 3:

Begin the extraction

Once the data is viewed more carefully, use the bs4.findChildren() functionality to extract each node into the cell of a list. For example:

[In] 14: soup_list = soup.findChildren()

Then (assuming the tags are the same for each of the list elements, or mostly similar as in my case), create a (index, values) pairing using the text and name attributes of bs4.tag elements, doing something like this:

[In] 15: d = {}
[In] 16: for i, element in enumerate(soup_list):
...:        index = map(lambda x: x.name, element.findChildren())
...:        vals = map(lambda x: unicode(x.text), element.findChildren())
...:        d[i] = pandas.Series(vals, index = index)

Then you're all setup to make a pandas.DataFrame passing the dict of Series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment