Skip to content

Instantly share code, notes, and snippets.

@enewe101
Created July 22, 2016 21:18
Show Gist options
  • Save enewe101/00ed34a082e26e06516bf75cbdfc98f4 to your computer and use it in GitHub Desktop.
Save enewe101/00ed34a082e26e06516bf75cbdfc98f4 to your computer and use it in GitHub Desktop.
Basic reader of parc xml files
from bs4 import BeautifulSoup as Soup
class AnnotatedText(object):
def __init__(self, parc_xml):
self.soup = Soup(parc_xml, 'html.parser')
self.words = []
self.sentences = []
sentence_tags = self.soup.find_all('sentence')
for sentence_tag in sentence_tags:
sentence = {'words':[]}
self.sentences.append(sentence)
word_tags = sentence_tag.find_all('word')
for word_tag in word_tags:
word = {
'token': word_tag['text'],
}
attribution = word_tag.find('attribution')
if attribution:
word['attribution'] = {
'role': attribution.find('attributionrole')['rolevalue'],
'id': attribution['id']
}
else:
word['attribution'] = None
self.words.append(word)
sentence['words'].append(word)
@enewe101
Copy link
Author

>>> from parc_reader import AnnotatedText as A
>>> annotated_text = A(open('some-file.xml').read())

Once you have an annotated text object, it has a property sentences which is a list of sentences. Each sentence is a dictionary, having a words property, which is a list of all the words in the sentence. Words are dictionaries too, and they have a token property (the original text of the word) and an attribution property. The attribution property can be None, or it could be a dictionary with a role and an id. The role identifies whether the word belongs to a cue, content, or source span.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment