Skip to content

Instantly share code, notes, and snippets.

@ruthtillman
Created November 11, 2019 17:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ruthtillman/8f0224eeb51df47b2fd5ef8ad542746b to your computer and use it in GitHub Desktop.
Save ruthtillman/8f0224eeb51df47b2fd5ef8ad542746b to your computer and use it in GitHub Desktop.
Python snippets for parsing XML into CSV

Make sure you've got these:

from lxml import etree
import csv

The snippets will all be related to parsing XPaths with etree vs. CSV stuff.

A useful function to test how many nodes match a single xpath. e.g. if you have 3 dc:creator fields at 0, 1, 2, it will return 3. This is super helpful if you need to set up a condition for processing 1 element vs. multiples.

def lenTest(path,tree):
  return len(tree.xpath(path))

How to get he value of an element using etree. If path = the XPath of the item, elemZero is the actual element within the tree at this XPath, including the fact that it's the 0th element (vs [1] [2], [3] etc). However, the value of elemZero is just the reference point. So you need to use etree.tostring, which parses that element. I include encoding and method values here because it's useful as a reference point.

elemZero = tree.xpath(path)[0]
elemZeroValue = etree.tostring(elemZero, encoding='UTF-8', method='text')

If there's just instance of the element at that XPath, you generally need the [0] anyway. You can try testing without.

A simple example in EAD3:

# titleXPath is declared as a variable so you can switch to EAD 2002 if you want
# throws on a .strip() just in case your data is imperfect.

titleXPath = "/ead/control/filedesc/titlestmt/titleproper"
titleString = etree.tostring(tree.xpath(titleXPath)[0], encoding='UTF-8', method='text').strip()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment