Skip to content

Instantly share code, notes, and snippets.

@akr4
Last active September 28, 2016 02:18
Show Gist options
  • Save akr4/3f367354a0026a1925ca91717a7a593b to your computer and use it in GitHub Desktop.
Save akr4/3f367354a0026a1925ca91717a7a593b to your computer and use it in GitHub Desktop.
Python's XMLPullParser runs out of memory
import xml.etree.ElementTree as ET
import argparse
arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('in_file')
args = arg_parser.parse_args()
def main(args):
with open(args.in_file) as f:
parser = ET.XMLPullParser(['start', 'end'])
for line in f:
parser.feed(line)
for event, elem in parser.read_events():
pass
main(args)
@akr4
Copy link
Author

akr4 commented Sep 28, 2016

This seems to be intended behavior.

http://docs.python.jp/3/library/xml.etree.elementtree.html

If you don’t mind your application blocking on reading XML data but would still like to have incremental parsing capabilities, take a look at iterparse(). It can be useful when you’re reading a large XML document and don’t want to hold it wholly in memory.

@akr4
Copy link
Author

akr4 commented Sep 28, 2016

Using lxml or cElementTree and calling element.clear() solved the problem.

import lxml.etree as ET
import argparse

arg_parser = argparse.ArgumentParser()
arg_parser.add_argument('in_file')
args = arg_parser.parse_args()

def main(args):
    with open(args.in_file, 'rb') as f:
        it = ET.iterparse(f, ['start', 'end'])
        for event, elem in it:
            elem.clear()

main(args)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment