Skip to content

Instantly share code, notes, and snippets.

@stevecassidy
Created July 27, 2018 00:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stevecassidy/10af92b4a6b7f4e652456ef1594b0ca4 to your computer and use it in GitHub Desktop.
Save stevecassidy/10af92b4a6b7f4e652456ef1594b0ca4 to your computer and use it in GitHub Desktop.
Script to convert single large XML export from NLA Trove into single files.
"""
Author: Steve Cassidy (Steve.Cassidy@mq.edu.au)
Script to convert XML export from Trove into single files.
The XML export from Trove consists of a single XML file with many
<article> elements, one per article. Since an export file can be very
large this makes processing the data hard. This script breaks the
large file into many small files that could then be fed to
later processes. Each file is named for the article id number
and written to the directory 'output'.
Usage:
python3 xml2text.py <trove xml file>
"""
import sys
from xml.sax.handler import ContentHandler
from xml.sax import parse
import os
class ArticleHandler(ContentHandler):
def __init__(self):
self.text = ""
def startElement(self, name, attrs):
if name == 'article':
self.current_id = attrs['id']
if name == "articleText":
self.text = ""
def characters(self, content):
self.text += content
def endElement(self, name):
if name == 'articleText':
with open(os.path.join('output', self.current_id + ".html"), 'w') as out:
out.write(self.text)
if __name__ == '__main__':
with open(sys.argv[1]) as fd:
handler = ArticleHandler()
parse(fd, handler)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment