Last active
November 5, 2015 23:12
-
-
Save anna-hope/163dab5b73d7f6625f17 to your computer and use it in GitHub Desktop.
get all possible n-grams from an iterable
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def get_ngrams(iterable): | |
length = len(iterable) | |
if length == 1: | |
yield iterable[0] | |
return | |
# the 'starting position' loop | |
for n in range(length): | |
# the 'skip step' loop | |
for step in range(1, length): | |
# make substrings | |
# starting at position 'n' and going up to the length of the string | |
for index in range(n, length, step): | |
# go from the next character | |
index += 1 | |
# don't emit duplicates | |
# (if the step is greater than one | |
# and than the length of the would be substring, | |
# it's a duplicate) | |
if step > 1 and index - n < step: | |
continue | |
else: | |
ngram = iterable[n:index:step] | |
yield ngram |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment