viterbi decoding is one of the most important techniques for sequence tagging. In natural language processing, viterbi decoding usually plays the role of finding tagging of the sequence with the maximum likelihood for tasks such as part of speech (POS) tagging and named entity recognition (NER). I will describe it how it works by making prediction of POS on simple sentences based on maximum likelihood from first order (bi-gram) hidden markov model (HMM).
First of all, let's create some sentences with each element consisting of word and corresponding POS tag separated by '/'
sentences = [
"the/DT can/NN is/VB in/IN the/DT shed/NN",
"the/DT dog/NN can/VB see/VB the/DT cat/NN"
]
Next, the counts of bigram of tags, unigram of tags, and the count of cooccurrence of word and tag. Note that in the following code, and
stand for the virtual start and end symbols of the sequence, separately. There are no words associated for these symbols.
pos = [[word.split('/') for word in sentence.split()] for sentence in sentences]
c_bigram_tags, c_unigram_tags, c_word_tag = {}, {}, {}
for sentence in pos:
for i, (word, tag) in enumerate(sentence):
if i == 0:
c_bigram_tags[('*', tag)] = c_bigram.get(('*', tag), 0) + 1
c_unigram_tag[tag] = c_unigram_tag.get(tag, 0) + 1
c_word_tag[(word, tag)] = c_word_tag.get((word, tag), 0) + 1
else:
prev_tag = sentence[i-1][1]
c_bigram_tags[(prev_tag, tag)] = c_bigram.get((prev_tag, tag), 0) + 1
c_unigram_tag[tag] = c_unigram_tag.get(tag, 0) + 1
c_word_tag[(word, tag)] = c_word_tag.get((word, tag), 0) + 1
c_bigram_tags[('$', tag)] = c_bigram_tags.get(('$', tag), 0) + 1
c_unigram_tag['*'] = c_unigram_tag.get('*', 0) + 1
tags = list(c_unigram_tag.keys())
We want to return 0 for any keys that are not in these counts.
def count_bigram_tag(prev_tag, tag):
return c_bigram_tags.get((prev_tag, tag), 0)
def count_unigram_tag(tag):
return c_unigram_tag.get(tag, 0)
def count_word_tag(word, tag):
return c_word_tag.get((word, tag), 0)
The idea of HMM is based on the generative learning. Different from discriminative learning which is based on the conditional probability of label y given the word x, i.e., , general learning is based on the joint probability of label y and x, i.e.,
. Applying some basic Bayes rule to rewrite the join probability, we can see generative learning in some way reaches the same form as discriminative learning.
Note that, to find out that maximizes
,
is irrelevant.
In the case of sequence labeling, e.g, POS tagging, HMM looks to maximize the joint probability of the sequence of words and the sequence of tags
, i.e., to maximize
: note that
and
represent virtual symbols
and
for start and end of a sentence, respectively. Based on the Bayes rule and definition of first order HMM, the joint probability can be rewritten and approximated as,
is called the transition probability, while
is called the emission probability.
The way Viterbi decoding works is by using dynamic programming to keep track of the maximum probability at position when choice of current tag is
. If we define
is the maximum probability at position
and
is the tag of previous position
, when current tag is equal to
, i.e., the backpointers, the essence of Viterbi decoding algorithm is,
Note that
- initial value of
at position 0 is set to be 1, i.e.,
- at final position, i.e., the available tag to choose from is just
, and emission probability is removed from equation at this step, since there is no word here.
- backpointer at final position points to the choice of tag at the end of the sentence, i.e., the last word. This is critical, as we can now back track the choice of tags by looking at backpointers calculated for all previous positions.
Now given a sentence of mere symbols, we can tag the sentence using Viterbi decoding.
from six import iteritems
def decide(sentence):
pi = [] # probability of sequence of tags at index n with current tag T
bp = [] # backpointers that keep track of the previous tag that results in the current tag with maximum likelihood.
for i, word in enumerate(sentence):
pi.append({})
if i == 0: # at first step, we do not need to consider previous pi, since there has been nothing yet.
for tag in tags:
pi[-1][tag] = \
# transimission probability from start(*) to tag
count_bigram_tag('*', tag) / count_unigram_tag('*') * \
# emission probability given tag what is the probability generating word
count_word_tag(word, tag) / count_unigram_tag(tag)
else:
bp.append({})
for tag in tags:
temp = {}.fromkeys(tag)
for prev_tag in tags:
# we need to find out what is the previous tag that can maximize the likelihood of choice of current tag
temp[prev_tag] = \
# pi at previous step
pi[-2][prev_tag] * \
# transmission probability from previous tag to tag
count_bigram_tag(prev_tag, tag) / count_unigram_tag(prev_tag) * \
# emission probability given tag what is the probability generating word
count_word_tag(word, tag) / count_unigram_tag(tag)
pi[-1][tag] = max(iteritems(temp), key=lambda x:x[1])[1] # what is the maximum likelihood with current tag = tag
bp[-1][tag] = max(iteritems(temp), key=lambda x:x[1])[0] # what is the previous tag that leads to the maxmimum likelihood with current tag = tag
pi.append({})
for tag in tags:
pi[-1][tag] = pi[-2][tag] * count_bigram_tag('$', tag) / count_unigram_tag(tag) # there is no emission probability since the virtual end symbol does not have any tag attached to it
pi_val = max(iteritems(pi[-1]), key=lambda x:x[1])[1] # maximum likehood of tags
bp_val = []
backpointer = max(iteritems(pi[-1]), key=lambda x:x[1])[0] # given the last symbol = '$', what is the tag at the final step that produces pi_val
bp_val.append(backpointer)
for b_p in bp[::-1]: # backtrack choice of the tags at each step that eventually yield the pi_val at final step
backpointer = b_p[bp_val[0]]
bp_val.insert(0, backpointer)
return pi_val, bp_val
Make predicions on sentences:
for sentence in pos:
sent = [word for word, _ in sentence]
tags = [tag for _, tag in sentence]
pi, bp = decode(sent)
print('sentence: ', sent)
print('actual tags: ', tags)
print('predicted tags: ', bp)
print('predict score: ', pi)
print()
# result
# sentence: ['the', 'can', 'is', 'in', 'the', 'shed']
# actual tags: ['DT', 'NN', 'VB', 'IN', 'DT', 'NN']
# predicted tags: ['DT', 'NN', 'VB', 'IN', 'DT', 'NN']
# predict score: 0.001736111111111111
#
# sentence: ['the', 'dog', 'can', 'see', 'the', 'cat']
# actual tags: ['DT', 'NN', 'VB', 'VB', 'DT', 'NN']
# predicted tags: ['DT', 'NN', 'VB', 'VB', 'DT', 'NN']
# predict score: 0.00019290123456790122