Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created January 28, 2021 18:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/2c6b28d2f02b207ccf5b1e702956102c to your computer and use it in GitHub Desktop.
Save rjurney/2c6b28d2f02b207ccf5b1e702956102c to your computer and use it in GitHub Desktop.
Can't serialize Docs created with spacy-transformers
from spacy.cli import download
from spacy.tokens import DocBin
# Load the spaCy transformers model based on English web content
download("en_core_web_trf")
# download("en_core_web_lg")
nlp = spacy.load("en_core_web_trf")
# Store the documents of the articles because the transformer model is slow
doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=False)
# Text is in the 'body' field
articles, docs, sentences = [], [], []
with bz2.open("/path/to/file/foo.json.bz2") as f:
for line in tq.tqdm(f.readlines()):
article = json.loads(line.rstrip())
articles.append(article)
doc = nlp(article["body"])
docs.append(doc)
# Add so it can be serialized
doc_bin.add(doc)
sents = [sent.text.strip() for sent in doc.sents if len(sent) > 3]
sentences += sents
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_trf')
0%
0/531 [00:02<?, ?it/s]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-236a13821dcd> in <module>()
22
23 # Add so it can be serialized
---> 24 doc_bin.add(doc)
25
26 sents = [sent.text.strip() for sent in doc.sents if len(sent) > 3]
7 frames
/usr/local/lib/python3.6/dist-packages/srsly/msgpack/_packer.pyx in srsly.msgpack._packer.Packer._pack()
TypeError: can not serialize 'TransformerData' object
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment