Skip to content

Instantly share code, notes, and snippets.

@conormm
Created May 18, 2021 20:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save conormm/b5ceaa92965f0156a48673055b5963ef to your computer and use it in GitHub Desktop.
Save conormm/b5ceaa92965f0156a48673055b5963ef to your computer and use it in GitHub Desktop.
def tidy_tokens(docs):
"""Extract tokens and metadata from list of spaCy docs."""
cols = [
"doc_id", "token", "token_order", "lemma",
"ent_type", "tag", "dep", "pos", "is_stop",
"is_alpha", "is_digit", "is_punct"
]
meta_df = []
for ix, doc in enumerate(docs):
meta = extract_tokens_plus_meta(doc)
meta = pd.DataFrame(meta)
meta.columns = cols[1:]
meta = meta.assign(doc_id = ix).loc[:, cols]
meta_df.append(meta)
return pd.concat(meta_df)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment