Skip to content

Instantly share code, notes, and snippets.

@fginter
Last active April 27, 2018 11:42
Show Gist options
  • Save fginter/6cb952dd3e819274f3f0910207dd552c to your computer and use it in GitHub Desktop.
Save fginter/6cb952dd3e819274f3f0910207dd552c to your computer and use it in GitHub Desktop.
ID,FORM,LEMMA,UPOS,XPOS,FEATS,HEAD,DEPREL,DEPS,MISC=range(10)
def read_conll(inp,max_sent=0,drop_tokens=True,drop_nulls=True):
comments=[]
sent=[]
yielded=0
for line in inp:
line=line.strip()
if line.startswith("#"):
comments.append(line)
elif not line:
if sent:
yield sent,comments
yielded+=1
if max_sent>0 and yielded==max_sent:
break
sent,comments=[],[]
else:
cols=line.split("\t")
if drop_tokens and "-" in cols[ID]:
continue
if drop_nulls and "." in cols[ID]:
continue
sent.append(cols)
else:
if sent:
yield sent,comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment