Skip to content

Instantly share code, notes, and snippets.

@arne-cl
Created December 11, 2013 16:15
Show Gist options
  • Save arne-cl/7913358 to your computer and use it in GitHub Desktop.
Save arne-cl/7913358 to your computer and use it in GitHub Desktop.
converts TigerXML files into tokenized plain text (one word per line with an empty line between sentences).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Author: Arne Neumann
#
# Purpose: extracts sentences from a Tiger XML input file and writes
# them to an output file (one word per line with an empty line
# between sentences).
import sys
import codecs
from lxml import etree
if __name__ == '__main__':
if len(sys.argv) != 3:
print "Usage: {0} tiger_input.xml plain_output.txt".format(sys.argv[0])
sys.exit(1)
else:
input_file_path = sys.argv[1]
output_file_path = sys.argv[2]
tree = etree.parse(input_file_path)
with codecs.open(output_file_path, 'w', 'utf8') as output_file:
for sent in tree.iterfind('//s'):
for token in sent.iterfind('./graph/terminals/t'):
output_file.write(token.attrib['word']+'\n')
output_file.write('\n')
@arne-cl
Copy link
Author

arne-cl commented Oct 31, 2014

This script is now part of https://github.com/arne-cl/lingconv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment