Skip to content

Instantly share code, notes, and snippets.

@inaz2
Last active August 29, 2015 14:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save inaz2/83308e6139911736514a to your computer and use it in GitHub Desktop.
Save inaz2/83308e6139911736514a to your computer and use it in GitHub Desktop.
calculate n-gram statistics
$ wget http://printkjv.ifbweb.com/AV_txt.zip
2014-06-09 15:30:46 (535 KB/s) - `AV_txt.zip' saved [1262589/1262589]
$ unzip AV_txt.zip
Archive: AV_txt.zip
inflating: AV1611Bible.txt
$ python ngram.py < AV1611Bible.txt
1 gram result:
414963 'e'
314449 't'
282870 'h'
261344 'a'
237507 'o'
226169 'n'
187795 's'
183418 'i'
165122 'r'
151045 'd'
122093 'l'
84381 'u'
82332 'f'
77908 'm'
72022 ','
63982 'w'
58666 'y'
54141 'c'
49803 'g'
44875 'b'
2 gram result:
155741 'th'
129983 'he'
64964 'nd'
64533 'an'
47281 'in'
46226 'er'
44532 'ha'
40756 're'
38494 'of'
33978 '\r\n'
33325 'hi'
32298 'at'
31541 'ou'
30126 'en'
29251 'to'
29128 'or'
26966 'al'
26301 'on'
26006 'll'
25170 'it'
3 gram result:
97918 'the'
45833 'and'
24210 '.\r\n'
17551 'all'
17147 'hat'
16184 'ing'
15591 'her'
14092 'tha'
13130 'for'
12877 'And'
12409 'sha'
11925 'hal'
11906 'ere'
11451 'his'
11172 '\r\n1'
11151 'nto'
10484 'unt'
10457 'hou'
10162 'ith'
7609 'not'
4 gram result:
12893 'that'
11395 'shal'
10502 'ther'
9927 'hall'
9025 'unto'
7872 '.\r\n1'
7075 'they'
7009 'with'
6937 'them'
6747 'here'
6583 'LORD'
6283 'thou'
5451 '.\r\n2'
4494 'hich'
4360 'whic'
4070 'heir'
4033 'said'
3981 'thei'
3957 'have'
3923 'will'
5 gram result:
9773 'shall'
4576 'there'
4360 'which'
3980 'their'
3451 'efore'
2606 'srael'
2606 'Israe'
2436 'ation'
2316 'again'
2296 'ather'
2262 'house'
2157 'peopl'
2157 'eople'
2039 'child'
2029 'thing'
1829 'hildr'
1829 'ldren'
1829 'ildre'
1820 'befor'
1786 'other'
import sys
from collections import Counter
def print_ngram(data, n):
c = Counter()
for i in xrange(len(data)-(n-1)):
s = data[i:i+n]
if not ' ' in s:
c.update([s])
print "%d gram result:" % n
for s, count in c.most_common(20):
print "%d\t%r" % (count, s)
data = sys.stdin.read()
for n in xrange(1,6):
print_ngram(data, n)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment