d2207197/nlplab_lab2.md

## nlplab_lab2.md

      
    Raw
  

              nlplab_lab2.md
            
          
    steps


Use the text in VOA.txt for building LM
Build a dictionary of ngrams and ngram counts (n = 1, 2) (split by newline, lowercase )
Compute Nr for 1gram and 2gram
Compute (r, r*), for 1gram and 2gram (k=10)
Compute the normalization factor N / (N + k Nk)
Good-Turing Estimation of P(w) and P(w’|w)
Compute probability P(w1, w2, .., wn) of sentences (w1, w2, .., wn) on the course webpage

P(w1, w2, .., wn) = P(w1) x P(w2|w1) x P(w3|w2) ... x P(wn|wn-1)

note


V = 80,000
import math
Use math.log10(P(w)), math.log10(P(w’|w)) instead of P(w), P(w’|w)
Add log P instead of multiplying P
Convert integer to float by casting, e.g., float(int1) is the floating point of int1
Add from __future__ import division in the first line. x/y to return a reasonable approximation of the mathematical result of the division ("true division"), x//y to return the floor ("floor division").

results of each steps


ngram counts
 # unigram
 the 	174276
 . 	124709
 , 	107373
 to 	74496
 of 	71847

 # bigram
 of the 	17426
 in the 	15846
 the united 	8512
 , and 	7457
 , the 	7202


Nr
Nr1: 0: 43037,      1: 12350,  2: 4864,  3: 2825,  4: 1989,  5: 1419
Nr2: 0: 6399548570, 1: 249454, 2: 68452, 3: 34227, 4: 20665, 5: 13641


r*
r* unigram: 
	0: 0.2869623812068685, 1: 0.7876923076923077, 2: 1.742393092105263, 3: 2.816283185840708, 4: 3.5671191553544497, 5: 4.558139534883721, 6: 5.681818181818182, 7: 6.153142857142857, 8: 7.863298662704309, 9: 9.200680272108844, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, ...
r* bigram: 
	0: 3.8979936986399026e-05, 1: 0.5488146111106657, 2: 1.5000438263308595, 3: 2.415052443976977, 4: 3.3005081054923786, 5: 4.260391466901254, 6: 5.099318604170969, 7: 6.272108843537415, 8: 7.145336225596529, 9: 8.039617486338798, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, ...
	

N
N unigram: 2914867
N bigram: 2789569


normalization factor
norm fact unigram:  0.998147436014
norm fact bigram:  0.987500349572


P(w), P(w, w'), P(w'|w)
P('this',) = -2.7286292930534612

P('this', 'is') = -3.6509704908731844

P('is'| 'this') = -0.9223411978197231


LM
P(this is a book .) = -9.63062529859
P(this is an book .) = -15.5540223618
P(she can speak english .) = -11.7383028302
P(she can say english .) = -17.5952641259