Skip to content

Instantly share code, notes, and snippets.

@amankharwal
Created October 4, 2020 08:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save amankharwal/6867461f16f3fc760eb3c61c752ebde8 to your computer and use it in GitHub Desktop.
Save amankharwal/6867461f16f3fc760eb3c61c752ebde8 to your computer and use it in GitHub Desktop.
import pandas as pd
import numpy as np
import textdistance
import re
from collections import Counter
words = []
with open('moby.txt', 'r') as f:
file_name_data = f.read()
file_name_data=file_name_data.lower()
words = re.findall('\w+',file_name_data)
# This is our vocabulary
V = set(words)
print(f"The first ten words in the text are: \n{words[0:10]}")
print(f"There are {len(V)} unique words in the vocabulary.")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment