Skip to content

Instantly share code, notes, and snippets.

@alvations
Last active November 4, 2015 17:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alvations/bf7c941a9748585c3aea to your computer and use it in GitHub Desktop.
Save alvations/bf7c941a9748585c3aea to your computer and use it in GitHub Desktop.

Entropy and WSD.

Let p(x) be the probability mass function of a random variable X over a discrte set of symbols X:

p(x) = P(X=x)

For example, if we toss two coins and count the no. of heads, we have a random variable: p(0) = 1/4, p(1) = 1/2 and p(2) = 1/4

So that means probability mass function sums to one.


In a WSD scenario, let small x be any synset(sense/meaning/concept) of a big X be the word and we specify some probability distribution.

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('dog')
[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]

So in the above case wn.synsets('dog') = X and Synset('dog.n.01') = x1 , Synset('frump.n.01') = x2, ...


The entropy (or self-information) is the average uncertainty of a single random variable:

H(p) = H(X) = -1 * sum( [p(xi)*log(p(xi),2) for xi in X] )

Note that entropy is use to quantify the random variable itself and not the possibilities that the random variable can manifest

So in WSD scenario, the entropy of a word can measured but not the entropy of a synset.


For example:

H('dog') =  -1 * sum( [p(xi)*log(p(xi),2) for xi in X] )
x1 = Synset('dog.n.01')
x2 = Synset('frump.n.01')
x3 = Synset('dog.n.03')
x4 = Synset('cad.n.01')
x5 = Synset('frank.n.02')
x6 = Synset('pawl.n.01')
x7 = Synset('andiron.n.01')
x8 = Synset('chase.v.01') 

So if we assume uniform probability distribution, xi = 1/8.0 = 0.125 and

H('dog') = -1 * sum([0.125*log(0.125,2)]*8)

And in code:

>>> from math import log
>>> -1 * sum([0.125*log(0.125,2)]*8)
3.0

So here's another example (in pure code):

>>> wn.synsets('cat')
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]
>>> len(wn.synsets('cat'))
10
>>> 1 / 10.0
0.1
>>> -1 * sum([0.1*log(0.1,2)]*10)
3.321928094887362

So given the H('dog') = 3 and H('cat') = 3.321, we can say that cat has a higher entropy than dog, which means to say there's a higher uncertainty when it comes to deciding the correct sense for cat.

Given a word with only a single meaning, you get H(x) = 0:

>>> wn.synsets('michael_jackson')
[Synset('jackson.n.02')]
>>> -1 * sum([1.0*log(1.0,2)]*1)
0.0

And the problem comes when the the probabilities generated by a WSD software don't sum to one, sum(prob(synset)) != 1, then the entropy formula assumes sum(p(xi)) = 1.

In the case of UKB WSD, if it's referring to the original pagerank algorithm, then the numbers generated for each synset don't sum to 1. They are not probabilities but they are vector ranking scores computed from a random walk graph.

So normalizing the rank scores for each word be the first solution to attempt when trying to count the entropy of a word. The problem is that a ranking algorithm don't care about absolute scores, so you might not be able to get a clear H(x) approximation but you can however claim that H('dog') < H('cat') since the rank of the entropy should correspond to the rank of the sum probabilities.

So to verify that you can claim entropy is correlated with UKB's pagerank scores, you have to check for every pair of words that the following conditions:

  • sum([p(xi) for xi in X]) > sum([p(yi) for yi in Y]) and H(X) > H(Y) or
  • sum([p(xi) for xi in X]) < sum([p(yi) for yi in Y]) and H(X) < H(Y)

And if for any pair of words where neither of the conditions applies, then the UKB's score cannot be used to calculate entropy, you will have to go back to the original Pr = cMPr + (1 − c)v formula from the UKB paper and ensure that Pr returns probability distribution for every vector v.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment