benjaminmgross/gist:1b38348239d8c0dff149

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    #Parsing API xml data using BeautifulSoup
I had a difficult time extracting data from a xml object retrieved using the requests.  Simply, I had dome something like:
[In] 1: import requests
[In] 2: socket = requests.get('https://the?xml?shitting&api?url')

Most of the documents I found pointed me towards 'xml' and using ElementTree.  After several attempts, with no success at all, I was able to get the desired result by using BeautifulSoup.  Specifically the following:
##Step 1:
Load the data in to a bs4 object, as specifically suggested where the parsing libraries are described
[In] 10: soup = bs4.BeautifulSoup(socket.content, ['lxml', 'xml'])

##Step 2:
Take a look at the mess of xml data to ascertain the structure of it.
[In] 11: f = open('pretty_xml.xml', 'w')
[In] 12: f.writelines(soup.prettify())
[In] 13: f.close()

##Step 3:
Begin the extraction
Once the data is viewed more carefully, use the bs4.findChildren() functionality to extract each node into the cell of a list.  For example:
[In] 14: soup_list = soup.findChildren()

Then (assuming the tags are the same for each of the list elements, or mostly similar as in my case), create a (index, values) pairing using the text and name attributes of bs4.tag elements, doing something like this:
[In] 15: d = {}
[In] 16: for i, element in enumerate(soup_list):
...:        index = map(lambda x: x.name, element.findChildren())
...:        vals = map(lambda x: unicode(x.text), element.findChildren())
...:        d[i] = pandas.Series(vals, index = index)

Then you're all setup to make a pandas.DataFrame passing the dict of Series.