Skip to content

Instantly share code, notes, and snippets.

@SKalt
Last active August 6, 2017 05:03
Show Gist options
  • Save SKalt/9fdef848e4917538bd53a7d2368c1a9f to your computer and use it in GitHub Desktop.
Save SKalt/9fdef848e4917538bd53a7d2368c1a9f to your computer and use it in GitHub Desktop.
Functions to transform xml to JSON-able dicts

Lxml's current FAQ includes a method of transforming xml to a dict of dicts, but not JSON. The mismatch between a dict of dicts and JSON occurs when an element has mulitple children with the same tag name. Under JSON conventions, multiple children of the same name are equivalent to an array or tuple. The below python functions attempt to add the repeated tags to list functionality. I'd appreciate suggestions for improvements.

def recursive_dict(element):
"Given an lxml.etree._Element, recursively transform its children to dicts structured as JSON"
if not len(element):
return element.text
else:
results = {}
for child in element:
if results.get(child.tag, False):
if type(results[child.tag]) != list:
results[child.tag] = [results[child.tag]]
results[child.tag].append(recursive_dict(child))
else:
results[child.tag] = recursive_dict(child)
return results
def tojson(element):
unique_child_tags = set([child.tag for child in element])
results = {}
if not unique_child_tags:
return element.text
for tag in unique_child_tags:
children_with_tag = element.xpath(tag)
if len(children_with_tag) == 1:
results[tag] = tojson_2(children_with_tag[0])
else:
results[tag] = [tojson_2(child) for child in children_with_tag]
return results
@scoder
Copy link

scoder commented Aug 6, 2017

    if results.get(child.tag, False):
         if type(results[child.tag]) != list:

For safety, I'd always spell that like this:

    if child.tag in results:
         if type(results[child.tag]) is not list:

I would expect the second approach to be much slower than the first, but generally speaking, I don't think there is a one-size-fits-all conversion. If the XML format is not intended to be JSON conforming by design, users would probably end up applying one format quirk fix or the other at some point. Giving them an example is already the best we can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment