Skip to content

Instantly share code, notes, and snippets.

@ethanwillis
Last active August 29, 2015 14:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ethanwillis/1d775bc3206df0de76f9 to your computer and use it in GitHub Desktop.
Save ethanwillis/1d775bc3206df0de76f9 to your computer and use it in GitHub Desktop.
Python script to find ngrams in a corpus
def ngramsProgram(corpus, n):
# Input comes from our parameters
# process our input to find ngrams
ngrams = findNGrams(corpus, n)
# Output our ngrams.
outputNGrams(ngrams)
def findNGrams(corpus, n):
# Tokenize our corpus, which also gives us all of our unigrams
tokenizedCorpus = corpus.split(' ')
# initialize our set of "windows", ngrams.
ngrams = []
# find windows of size n and add them to our set of windows.
for i in range(0, len(tokenizedCorpus)-(n-1)):
# initialize an empty ngram
curNGram = []
# find the ngram in our current window, starting with the xth unigram.
for x in range(i, i+n):
# build our current ngram from the current unigram
curNGram.append(tokenizedCorpus[x])
# add this ngram to our list of ngrams.
ngrams.append(curNGram)
return ngrams
def outputNGrams(ngrams):
for ngram in ngrams:
print(str(ngram))
ngramsProgram("The quick brown fox jumped over the lazy dog.", 4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment