Skip to content

Instantly share code, notes, and snippets.

@GreenRiverRUS
Created November 25, 2017 20:41
Show Gist options
  • Save GreenRiverRUS/4ca507a032a3ef6afa55ae7130a51516 to your computer and use it in GitHub Desktop.
Save GreenRiverRUS/4ca507a032a3ef6afa55ae7130a51516 to your computer and use it in GitHub Desktop.
Simple converter to ConLL-2003 NER format for spaCy model training
DATA = [
[
[['Who', 'is', 'Shaka', 'Khan', '?'], ['O', 'O', 'I-PER', 'I-PER', 'O']]
],
[
[['I', 'like', 'London', 'and', 'Berlin', '.'], ['O', 'O', 'I-LOC', 'O', 'I-LOC', 'O']]
]
]
with open('output.conll', 'w') as f:
for doc in DATA:
f.write('-DOCSTART- -X- O O\n')
for sentence, sent_entities in doc:
for token, BIO_tag in zip(sentence, sent_entities):
f.write('{} -X- _ {}\n'.format(token, BIO_tag))
f.write('\n')
## Result
# -DOCSTART- -X- O O
# Who -X- _ O
# is -X- _ O
# Shaka -X- _ I-PER
# Khan -X- _ I-PER
# ? -X- _ O
#
# -DOCSTART- -X- O O
# I -X- _ O
# like -X- _ O
# London -X- _ I-LOC
# and -X- _ O
# Berlin -X- _ I-LOC
# . -X- _ O
@dipansh-girdhar
Copy link

Hey is there a way to go from Conll format to spacy format?

You can check out my repo for the solution.
https://github.com/dipansh-girdhar/NLP/tree/master/NER/Spacy%20NER

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment