shengc/viterbi decoding.md

## viterbi decoding.md

      
    Raw
  

              viterbi decoding.md
            
          
    viterbi decoding is one of the most important techniques for sequence tagging. In natural language processing, viterbi decoding usually plays the role of finding tagging of the sequence with the maximum likelihood for tasks such as part of speech (POS) tagging and named entity recognition (NER). I will describe it how it works by making prediction of POS on simple sentences based on maximum likelihood from first order (bi-gram) hidden markov model (HMM).
First of all, let's create some sentences with each element consisting of word and corresponding POS tag separated by '/'
sentences = [
  "the/DT can/NN is/VB in/IN the/DT shed/NN",
  "the/DT dog/NN can/VB see/VB the/DT cat/NN"
]
Next, the counts of bigram of tags, unigram of tags, and the count of cooccurrence of word and tag. Note that in the following code,  and  stand for the virtual start and end symbols of the sequence, separately. There are no words associated for these symbols.
pos = [[word.split('/') for word in sentence.split()] for sentence in sentences]

c_bigram_tags, c_unigram_tags, c_word_tag = {}, {}, {}
for sentence in pos:
  for i, (word, tag) in enumerate(sentence):
    if i == 0:
      c_bigram_tags[('*', tag)] = c_bigram.get(('*', tag), 0) + 1
      c_unigram_tag[tag] = c_unigram_tag.get(tag, 0) + 1
      c_word_tag[(word, tag)] = c_word_tag.get((word, tag), 0) + 1
    else:
      prev_tag = sentence[i-1][1]
      c_bigram_tags[(prev_tag, tag)] = c_bigram.get((prev_tag, tag), 0) + 1
      c_unigram_tag[tag] = c_unigram_tag.get(tag, 0) + 1
      c_word_tag[(word, tag)] = c_word_tag.get((word, tag), 0) + 1
  c_bigram_tags[('$', tag)] = c_bigram_tags.get(('$', tag), 0) + 1
  c_unigram_tag['*'] = c_unigram_tag.get('*', 0) + 1
  
tags = list(c_unigram_tag.keys())
We want to return 0 for any keys that are not in these counts.
def count_bigram_tag(prev_tag, tag):
  return c_bigram_tags.get((prev_tag, tag), 0)
def count_unigram_tag(tag):
  return c_unigram_tag.get(tag, 0)
def count_word_tag(word, tag):
  return c_word_tag.get((word, tag), 0)
The idea of HMM is based on the generative learning. Different from discriminative learning which is based on the conditional probability of label y given the word x, i.e., , general learning is based on the joint probability of label y and x, i.e., . Applying some basic Bayes rule to rewrite the join probability, we can see generative learning in some way reaches the same form as discriminative learning.


Note that, to find out  that maximizes ,  is irrelevant.
In the case of sequence labeling, e.g, POS tagging, HMM looks to maximize the joint probability of the sequence of words  and the sequence of tags , i.e., to maximize : note that  and  represent virtual symbols  and  for start and end of a sentence, respectively. Based on the Bayes rule and definition of first order HMM, the joint probability can be rewritten and approximated as,


 is called the transition probability, while  is called the emission probability.
The way Viterbi decoding works is by using dynamic programming to keep track of the maximum probability at position  when choice of current tag is . If we define  is the maximum probability at position  and  is the tag of previous position , when current tag is equal to , i.e., the backpointers, the essence of Viterbi decoding algorithm is,


Note that

initial value of  at position 0 is set to be 1, i.e., 
at final position, i.e., the available tag to choose from is just , and emission probability is removed from equation at this step, since there is no word here.
backpointer at final position points to the choice of tag at the end of the sentence, i.e., the last word. This is critical, as we can now back track the choice of tags by looking at backpointers calculated for all previous positions.

Now given a sentence of mere symbols, we can tag the sentence using Viterbi decoding.
from six import iteritems
def decide(sentence):
  pi = [] # probability of sequence of tags at index n with current tag T
  bp = [] # backpointers that keep track of the previous tag that results in the current tag with maximum likelihood.
  for i, word in enumerate(sentence):
    pi.append({})
    if i == 0: # at first step, we do not need to consider previous pi, since there has been nothing yet.
      for tag in tags:
        pi[-1][tag] = \
          # transimission probability from start(*) to tag
          count_bigram_tag('*', tag) / count_unigram_tag('*') * \ 
          # emission probability given tag what is the probability generating word
          count_word_tag(word, tag) / count_unigram_tag(tag) 
    else:
      bp.append({})
      for tag in tags:
        temp = {}.fromkeys(tag)
        for prev_tag in tags: 
          # we need to find out what is the previous tag that can maximize the likelihood of choice of current tag
          temp[prev_tag] = \
            # pi at previous step
            pi[-2][prev_tag] * \ 
            # transmission probability from previous tag to tag
            count_bigram_tag(prev_tag, tag) / count_unigram_tag(prev_tag) * \ 
            # emission probability given tag what is the probability generating word
            count_word_tag(word, tag) / count_unigram_tag(tag) 
        pi[-1][tag] = max(iteritems(temp), key=lambda x:x[1])[1] # what is the maximum likelihood with current tag = tag
        bp[-1][tag] = max(iteritems(temp), key=lambda x:x[1])[0] # what is the previous tag that leads to the maxmimum likelihood with current tag = tag
  pi.append({})
  for tag in tags:
    pi[-1][tag] = pi[-2][tag] * count_bigram_tag('$', tag) / count_unigram_tag(tag) # there is no emission probability since the virtual end symbol does not have any tag attached to it
  pi_val = max(iteritems(pi[-1]), key=lambda x:x[1])[1] # maximum likehood of tags
  bp_val = []
  backpointer = max(iteritems(pi[-1]), key=lambda x:x[1])[0] # given the last symbol = '$', what is the tag at the final step that produces pi_val
  bp_val.append(backpointer)
  for b_p in bp[::-1]: # backtrack choice of the tags at each step that eventually yield the pi_val at final step
    backpointer = b_p[bp_val[0]]
    bp_val.insert(0, backpointer)
  return pi_val, bp_val
Make predicions on sentences:
for sentence in pos:
  sent = [word for word, _ in sentence]
  tags = [tag for _, tag in sentence]
  pi, bp = decode(sent)
  print('sentence: ', sent)
  print('actual tags: ', tags)
  print('predicted tags: ', bp)
  print('predict score: ', pi)
  print()
  
# result
# sentence: ['the', 'can', 'is', 'in', 'the', 'shed']
# actual tags: ['DT', 'NN', 'VB', 'IN', 'DT', 'NN']
# predicted tags: ['DT', 'NN', 'VB', 'IN', 'DT', 'NN']
# predict score: 0.001736111111111111
#
# sentence: ['the', 'dog', 'can', 'see', 'the', 'cat']
# actual tags: ['DT', 'NN', 'VB', 'VB', 'DT', 'NN']
# predicted tags: ['DT', 'NN', 'VB', 'VB', 'DT', 'NN']
# predict score: 0.00019290123456790122