Skip to content

Instantly share code, notes, and snippets.

@scott2b
Last active April 19, 2018 04:16
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save scott2b/083556cc48f7f5839bed2ba35c87d283 to your computer and use it in GitHub Desktop.
Save scott2b/083556cc48f7f5839bed2ba35c87d283 to your computer and use it in GitHub Desktop.
extract themes from the Gdelt SET_EVENTPATTERNS.xml file
"""
The patterns file is here: https://github.com/ahalterman/GKG-Themes/blob/master/SET_EVENTPATTERNS.xml
It is not valid XML so using regex
There are non-theme entries in this file not considered here. The globals section at the top of the
file should be taken into account when processing documents with pattern matches.
"""
import re
p = re.compile(r'^<CATEGORY NAME="([^"]+)" TYPE="THEME">\s*<TERMS>([^<]+)</TERMS>', re.M|re.S)
themes = {}
with open('SET_EVENTPATTERNS.xml') as f:
for theme, terms in p.findall(f.read()):
terms = [tuple(t.split('\t')) for t in terms.split('\n') if t and len(t.split('\t')) == 2]
if terms:
themes[theme] = terms
print(themes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment