Skip to content

Instantly share code, notes, and snippets.

@shengc
Last active April 20, 2018 11:57
Show Gist options
  • Save shengc/307f4a4f02ee1ce7a223aa9a9ef6e415 to your computer and use it in GitHub Desktop.
Save shengc/307f4a4f02ee1ce7a223aa9a9ef6e415 to your computer and use it in GitHub Desktop.

viterbi decoding is one of the most important techniques for sequence tagging. In natural language processing, viterbi decoding usually plays the role of finding tagging of the sequence with the maximum likelihood for tasks such as part of speech (POS) tagging and named entity recognition (NER). I will describe it how it works by making prediction of POS on simple sentences based on maximum likelihood from first order (bi-gram) hidden markov model (HMM).

First of all, let's create some sentences with each element consisting of word and corresponding POS tag separated by '/'

sentences = [
  "the/DT can/NN is/VB in/IN the/DT shed/NN",
  "the/DT dog/NN can/VB see/VB the/DT cat/NN"
]

Next, the counts of bigram of tags, unigram of tags, and the count of cooccurrence of word and tag. Note that in the following code, image and image stand for the virtual start and end symbols of the sequence, separately. There are no words associated for these symbols.

pos = [[word.split('/') for word in sentence.split()] for sentence in sentences]

c_bigram_tags, c_unigram_tags, c_word_tag = {}, {}, {}
for sentence in pos:
  for i, (word, tag) in enumerate(sentence):
    if i == 0:
      c_bigram_tags[('*', tag)] = c_bigram.get(('*', tag), 0) + 1
      c_unigram_tag[tag] = c_unigram_tag.get(tag, 0) + 1
      c_word_tag[(word, tag)] = c_word_tag.get((word, tag), 0) + 1
    else:
      prev_tag = sentence[i-1][1]
      c_bigram_tags[(prev_tag, tag)] = c_bigram.get((prev_tag, tag), 0) + 1
      c_unigram_tag[tag] = c_unigram_tag.get(tag, 0) + 1
      c_word_tag[(word, tag)] = c_word_tag.get((word, tag), 0) + 1
  c_bigram_tags[('$', tag)] = c_bigram_tags.get(('$', tag), 0) + 1
  c_unigram_tag['*'] = c_unigram_tag.get('*', 0) + 1
  
tags = list(c_unigram_tag.keys())

We want to return 0 for any keys that are not in these counts.

def count_bigram_tag(prev_tag, tag):
  return c_bigram_tags.get((prev_tag, tag), 0)
def count_unigram_tag(tag):
  return c_unigram_tag.get(tag, 0)
def count_word_tag(word, tag):
  return c_word_tag.get((word, tag), 0)

The idea of HMM is based on the generative learning. Different from discriminative learning which is based on the conditional probability of label y given the word x, i.e., image, general learning is based on the joint probability of label y and x, i.e., image. Applying some basic Bayes rule to rewrite the join probability, we can see generative learning in some way reaches the same form as discriminative learning.

image image image

Note that, to find out image that maximizes image, image is irrelevant.

In the case of sequence labeling, e.g, POS tagging, HMM looks to maximize the joint probability of the sequence of words image and the sequence of tags image, i.e., to maximize image: note that image and image represent virtual symbols image and image for start and end of a sentence, respectively. Based on the Bayes rule and definition of first order HMM, the joint probability can be rewritten and approximated as,

image

image

image

image is called the transition probability, while image is called the emission probability.

The way Viterbi decoding works is by using dynamic programming to keep track of the maximum probability at position image when choice of current tag is image. If we define image is the maximum probability at position image and image is the tag of previous position image, when current tag is equal to image, i.e., the backpointers, the essence of Viterbi decoding algorithm is,

image

image

Note that

  1. initial value of image at position 0 is set to be 1, i.e., image
  2. at final position, i.e., the available tag to choose from is just image, and emission probability is removed from equation at this step, since there is no word here.
  3. backpointer at final position points to the choice of tag at the end of the sentence, i.e., the last word. This is critical, as we can now back track the choice of tags by looking at backpointers calculated for all previous positions.

Now given a sentence of mere symbols, we can tag the sentence using Viterbi decoding.

from six import iteritems
def decide(sentence):
  pi = [] # probability of sequence of tags at index n with current tag T
  bp = [] # backpointers that keep track of the previous tag that results in the current tag with maximum likelihood.
  for i, word in enumerate(sentence):
    pi.append({})
    if i == 0: # at first step, we do not need to consider previous pi, since there has been nothing yet.
      for tag in tags:
        pi[-1][tag] = \
          # transimission probability from start(*) to tag
          count_bigram_tag('*', tag) / count_unigram_tag('*') * \ 
          # emission probability given tag what is the probability generating word
          count_word_tag(word, tag) / count_unigram_tag(tag) 
    else:
      bp.append({})
      for tag in tags:
        temp = {}.fromkeys(tag)
        for prev_tag in tags: 
          # we need to find out what is the previous tag that can maximize the likelihood of choice of current tag
          temp[prev_tag] = \
            # pi at previous step
            pi[-2][prev_tag] * \ 
            # transmission probability from previous tag to tag
            count_bigram_tag(prev_tag, tag) / count_unigram_tag(prev_tag) * \ 
            # emission probability given tag what is the probability generating word
            count_word_tag(word, tag) / count_unigram_tag(tag) 
        pi[-1][tag] = max(iteritems(temp), key=lambda x:x[1])[1] # what is the maximum likelihood with current tag = tag
        bp[-1][tag] = max(iteritems(temp), key=lambda x:x[1])[0] # what is the previous tag that leads to the maxmimum likelihood with current tag = tag
  pi.append({})
  for tag in tags:
    pi[-1][tag] = pi[-2][tag] * count_bigram_tag('$', tag) / count_unigram_tag(tag) # there is no emission probability since the virtual end symbol does not have any tag attached to it
  pi_val = max(iteritems(pi[-1]), key=lambda x:x[1])[1] # maximum likehood of tags
  bp_val = []
  backpointer = max(iteritems(pi[-1]), key=lambda x:x[1])[0] # given the last symbol = '$', what is the tag at the final step that produces pi_val
  bp_val.append(backpointer)
  for b_p in bp[::-1]: # backtrack choice of the tags at each step that eventually yield the pi_val at final step
    backpointer = b_p[bp_val[0]]
    bp_val.insert(0, backpointer)
  return pi_val, bp_val

Make predicions on sentences:

for sentence in pos:
  sent = [word for word, _ in sentence]
  tags = [tag for _, tag in sentence]
  pi, bp = decode(sent)
  print('sentence: ', sent)
  print('actual tags: ', tags)
  print('predicted tags: ', bp)
  print('predict score: ', pi)
  print()
  
# result
# sentence: ['the', 'can', 'is', 'in', 'the', 'shed']
# actual tags: ['DT', 'NN', 'VB', 'IN', 'DT', 'NN']
# predicted tags: ['DT', 'NN', 'VB', 'IN', 'DT', 'NN']
# predict score: 0.001736111111111111
#
# sentence: ['the', 'dog', 'can', 'see', 'the', 'cat']
# actual tags: ['DT', 'NN', 'VB', 'VB', 'DT', 'NN']
# predicted tags: ['DT', 'NN', 'VB', 'VB', 'DT', 'NN']
# predict score: 0.00019290123456790122
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment