- Use the text in VOA.txt for building LM
- Build a dictionary of ngrams and ngram counts (n = 1, 2) (split by newline, lowercase )
- Compute Nr for 1gram and 2gram
- Compute (r, r*), for 1gram and 2gram (k=10)
- Compute the normalization factor N / (N + k Nk)
- Good-Turing Estimation of P(w) and P(w’|w)
- Compute probability P(w1, w2, .., wn) of sentences (w1, w2, .., wn) on the course webpage
P(w1, w2, .., wn) = P(w1) x P(w2|w1) x P(w3|w2) ... x P(wn|wn-1)
- V = 80,000
- import math
- Use math.log10(P(w)), math.log10(P(w’|w)) instead of P(w), P(w’|w)
- Add log P instead of multiplying P
- Convert integer to float by casting, e.g., float(int1) is the floating point of int1
- Add
from __future__ import division
in the first line.x/y
to return a reasonable approximation of the mathematical result of the division ("true division"),x//y
to return the floor ("floor division").
-
ngram counts
# unigram the 174276 . 124709 , 107373 to 74496 of 71847 # bigram of the 17426 in the 15846 the united 8512 , and 7457 , the 7202
-
Nr
Nr1: 0: 43037, 1: 12350, 2: 4864, 3: 2825, 4: 1989, 5: 1419 Nr2: 0: 6399548570, 1: 249454, 2: 68452, 3: 34227, 4: 20665, 5: 13641
-
r*
r* unigram: 0: 0.2869623812068685, 1: 0.7876923076923077, 2: 1.742393092105263, 3: 2.816283185840708, 4: 3.5671191553544497, 5: 4.558139534883721, 6: 5.681818181818182, 7: 6.153142857142857, 8: 7.863298662704309, 9: 9.200680272108844, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, ... r* bigram: 0: 3.8979936986399026e-05, 1: 0.5488146111106657, 2: 1.5000438263308595, 3: 2.415052443976977, 4: 3.3005081054923786, 5: 4.260391466901254, 6: 5.099318604170969, 7: 6.272108843537415, 8: 7.145336225596529, 9: 8.039617486338798, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, ...
-
N
N unigram: 2914867 N bigram: 2789569
-
normalization factor
norm fact unigram: 0.998147436014 norm fact bigram: 0.987500349572
-
P(w), P(w, w'), P(w'|w)
P('this',) = -2.7286292930534612 P('this', 'is') = -3.6509704908731844 P('is'| 'this') = -0.9223411978197231
-
LM
P(this is a book .) = -9.63062529859 P(this is an book .) = -15.5540223618 P(she can speak english .) = -11.7383028302 P(she can say english .) = -17.5952641259