Created
July 22, 2016 21:18
-
-
Save enewe101/00ed34a082e26e06516bf75cbdfc98f4 to your computer and use it in GitHub Desktop.
Basic reader of parc xml files
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from bs4 import BeautifulSoup as Soup | |
class AnnotatedText(object): | |
def __init__(self, parc_xml): | |
self.soup = Soup(parc_xml, 'html.parser') | |
self.words = [] | |
self.sentences = [] | |
sentence_tags = self.soup.find_all('sentence') | |
for sentence_tag in sentence_tags: | |
sentence = {'words':[]} | |
self.sentences.append(sentence) | |
word_tags = sentence_tag.find_all('word') | |
for word_tag in word_tags: | |
word = { | |
'token': word_tag['text'], | |
} | |
attribution = word_tag.find('attribution') | |
if attribution: | |
word['attribution'] = { | |
'role': attribution.find('attributionrole')['rolevalue'], | |
'id': attribution['id'] | |
} | |
else: | |
word['attribution'] = None | |
self.words.append(word) | |
sentence['words'].append(word) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Once you have an annotated text object, it has a property
sentences
which is a list of sentences. Each sentence is a dictionary, having awords
property, which is a list of all the words in the sentence. Words are dictionaries too, and they have atoken
property (the original text of the word) and anattribution
property. Theattribution
property can beNone
, or it could be a dictionary with arole
and anid
. The role identifies whether the word belongs to a cue, content, or source span.