Skip to content

Instantly share code, notes, and snippets.

@andreasvc
andreasvc / corenlpxmltoconll2012.py
Last active May 8, 2019
Convert XML output of Stanford CoreNLP to CoNLL 2012 format
View corenlpxmltoconll2012.py
"""Convert XML output of Stanford CoreNLP to CoNLL 2012 format.
$ ./corenlp.sh -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref \
-output.printSingletonEntities true \
-file /tmp/example.txt
$ python3 corenlpxmltoconll2012.py example.txt.xml > example.conll`
"""
import re
import sys
from lxml import etree
View iwpt2013.xml
<?xml version='1.0' encoding='UTF-8'?>
<volume id="W13">
<paper id="5700">
<title>Proceedings of The 13th International Conference on Parsing Technologies (IWPT 2013)</title>
<editor><first>Harry</first><last>Bunt</last></editor>
<editor><first>Khalil</first><last>Sima'an</last></editor>
<editor><first>Liang</first><last>Huang</last></editor>
<month>November</month>
<year>2013</year>
<address>Nara, Japan</address>
@andreasvc
andreasvc / udstyle.py
Last active Oct 6, 2021
Compute complexity metrics from Universal Dependencies
View udstyle.py
"""Compute complexity metrics from Universal Dependencies.
Usage: python3 udstyle.py [OPTIONS] FILE...
--parse=LANG parse texts with Stanza; provide 2 letter language code
--output=FILENAME write result to a tab-separated file.
Reported metrics:
- LEN: mean sentence length in words (excluding punctuation).
- MDD: mean dependency distance (Gibson, 1998).
- NDD: normalized dependency distance (Lei & Jockers, 2018).
- ADJ: proportion of adjacent dependencies.
@andreasvc
andreasvc / xmientityrename.py
Last active Apr 1, 2019
Rename numeric entity labels in .xmi file to text of first mention
View xmientityrename.py
"""Rename numeric entity labels in .xmi file to text of first mention.
Usage: python3 xmientityrename.py <FILE>...
Original file is modified in-place.
Only non-empty entities with numeric names are changed.
See https://github.com/nilsreiter/CorefAnnotator/issues/173"""
import os
import sys
from lxml import etree
@andreasvc
andreasvc / preprocess.py
Created Feb 10, 2019
Preprocess movie review polarity dataset v2.0
View preprocess.py
"""Preprocess movie review polarity dataset v2.0.
http://www.cs.cornell.edu/people/pabo/movie-review-data/
"""
import os
import re
import glob
import random
from syntok.tokenizer import Tokenizer
def process(path, pattern, out):
View cellbench.pyx
"""Run with python -c 'import pyximport; pyximport.install(); import cellbench; cellbench.main()'
"""
from libc.stdint cimport uint32_t
from libc.math cimport sqrt, modf
from libc.math cimport round as c_round
ctypedef uint32_t Label
cdef inline size_t cellidx(short start, short end, short lensent,
Label nonterminals):
View adventofcode.py
"""Advent of Code 2017. http://adventofcode.com/2017 """
import sys
import array
from collections import Counter, defaultdict
from operator import xor
from functools import reduce
from itertools import count
from binascii import hexlify
import numpy as np
@andreasvc
andreasvc / README.md
Last active Feb 6, 2018
Word lists for extraction of physical descriptions
View README.md

Word lists for extraction of physical descriptions

These are XPath macros used in our DSH paper on physical descriptions of appearance.

English translation of macro names

  • uiterlijkN = looksN
  • uiterlijkA = looksA
  • persoon = person
  • kleding = clothing
View checkall.py
"""Tool to check if function/class definitions in Python files match with
their __all__ attribute. Rudimentary support for Cython.
"""
import sys
import re
from collections import Counter
for filename in sys.argv[1:]:
with open(filename, 'rt') as inp:
@andreasvc
andreasvc / aclrename.py
Created Jan 31, 2016
Script to rename papers from ACL Anthology to 'author year title.pdf'
View aclrename.py
"""Script to rename papers from ACL Anthology to 'author year title.pdf'
Given PDF files from the ACL anthology http://aclweb.org/anthology/
downloads bibtex file and extracts author, year, title
to suggest more descriptive names.
Before: N04-1016.pdf
After: Lapata & Keller 2004 The Web as a Baseline: Evaluating the Perform[...]
Usage: