Skip to content

Instantly share code, notes, and snippets.

@d2207197
Last active August 29, 2015 14:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save d2207197/2ac9bb50f4f8911292cb to your computer and use it in GitHub Desktop.
Save d2207197/2ac9bb50f4f8911292cb to your computer and use it in GitHub Desktop.
NLP Lab2

steps

  1. Use the text in VOA.txt for building LM
  2. Build a dictionary of ngrams and ngram counts (n = 1, 2) (split by newline, lowercase )
  3. Compute Nr for 1gram and 2gram
  4. Compute (r, r*), for 1gram and 2gram (k=10)
  5. Compute the normalization factor N / (N + k Nk)
  6. Good-Turing Estimation of P(w) and P(w’|w)
  7. Compute probability P(w1, w2, .., wn) of sentences (w1, w2, .., wn) on the course webpage
P(w1, w2, .., wn) = P(w1) x P(w2|w1) x P(w3|w2) ... x P(wn|wn-1)

note

  1. V = 80,000
  2. import math
  3. Use math.log10(P(w)), math.log10(P(w’|w)) instead of P(w), P(w’|w)
  4. Add log P instead of multiplying P
  5. Convert integer to float by casting, e.g., float(int1) is the floating point of int1
  6. Add from __future__ import division in the first line. x/y to return a reasonable approximation of the mathematical result of the division ("true division"), x//y to return the floor ("floor division").

results of each steps

  1. ngram counts

     # unigram
     the 	174276
     . 	124709
     , 	107373
     to 	74496
     of 	71847
    
     # bigram
     of the 	17426
     in the 	15846
     the united 	8512
     , and 	7457
     , the 	7202
    
  2. Nr

    Nr1: 0: 43037,      1: 12350,  2: 4864,  3: 2825,  4: 1989,  5: 1419
    Nr2: 0: 6399548570, 1: 249454, 2: 68452, 3: 34227, 4: 20665, 5: 13641
    
  3. r*

    r* unigram: 
    	0: 0.2869623812068685, 1: 0.7876923076923077, 2: 1.742393092105263, 3: 2.816283185840708, 4: 3.5671191553544497, 5: 4.558139534883721, 6: 5.681818181818182, 7: 6.153142857142857, 8: 7.863298662704309, 9: 9.200680272108844, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, ...
    r* bigram: 
    	0: 3.8979936986399026e-05, 1: 0.5488146111106657, 2: 1.5000438263308595, 3: 2.415052443976977, 4: 3.3005081054923786, 5: 4.260391466901254, 6: 5.099318604170969, 7: 6.272108843537415, 8: 7.145336225596529, 9: 8.039617486338798, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, ...
    	
    
  4. N

    N unigram: 2914867
    N bigram: 2789569
    
  5. normalization factor

    norm fact unigram:  0.998147436014
    norm fact bigram:  0.987500349572
    
  6. P(w), P(w, w'), P(w'|w)

    P('this',) = -2.7286292930534612
    
    P('this', 'is') = -3.6509704908731844
    
    P('is'| 'this') = -0.9223411978197231
    
    
  7. LM

    P(this is a book .) = -9.63062529859
    P(this is an book .) = -15.5540223618
    P(she can speak english .) = -11.7383028302
    P(she can say english .) = -17.5952641259
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment