Skip to content

Instantly share code, notes, and snippets.

@Mlawrence95
Last active November 5, 2019 19:13
Show Gist options
  • Save Mlawrence95/f83b04d662d3052d68d12b566ee40ac8 to your computer and use it in GitHub Desktop.
Save Mlawrence95/f83b04d662d3052d68d12b566ee40ac8 to your computer and use it in GitHub Desktop.
Takes a document (string) or iterable of documents and returns a Pandas dataframe containing the number of occurrences of each unique word. Note that this is not efficient enough to replace Scikit's CountVectorizer class for a bag of words transformer.
import numpy as np
import pandas as pd
def get_word_counts(document: str) -> pd.DataFrame:
"""
Turns a document into a dataframe of word, counts
Use preprocessing/lowercasing before this step for best results.
If passing many documents, use document = '\n'.join(iterable_of_documents)
"""
vocab, counts = np.unique(document.split(), return_counts=True)
combined_df = pd.DataFrame({'vocab': vocab,
'counts': counts})
return combined_df.sort_values('counts', ascending=False).reset_index(drop=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment