Skip to content

Instantly share code, notes, and snippets.

@fagci
Created October 27, 2021 18:02
Show Gist options
  • Save fagci/64dfb3943f32020a04827f37723e2d4b to your computer and use it in GitHub Desktop.
Save fagci/64dfb3943f32020a04827f37723e2d4b to your computer and use it in GitHub Desktop.
Extracts top of bigrams (ngrams) from text.
#!/usr/bin/env python3
from collections import Counter
from re import findall
from sys import argv
def main(text, top, n=2):
ngrams = []
for word in findall(r'\w+', text.lower()):
wlen = len(word)
if wlen >= n:
ngrams.extend({word[i:i + n] for i in range(wlen - n + 1)})
print([k for k, _ in Counter(ngrams).most_common(top)])
if __name__ == '__main__':
main(argv[1], argv[2] if len(argv) == 3 else 300)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment