Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?

Our dearest script that everyone uses in the modern #neuralempty revolution in machine translation is multi-bleu.perl

But consider this:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl 
Use of uninitialized value $ARGV[0] in string eq at multi-bleu.perl line 11.
usage: multi-bleu.pl [-lc] reference < hypothesis
Reads the references from reference or reference0, reference1, ...
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat hyp.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat ref.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=1.000, ratio=1.000, hyp_len=2, ref_len=2)
alvas@ubi:~/git/mosesdecoder/scripts/generic$ echo 'foo bar' > hyp.txt
alvas@ubi:~/git/mosesdecoder/scripts/generic$ echo 'foo bar' > ref.txt
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat hyp.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat ref.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=1.000, ratio=1.000, hyp_len=2, ref_len=2)

What BLACK MAGIC is that?!

Yes, you're not seeing it wrong. Now let's look at a stabilized version of BLEU in NLTK after a bunch of wack-a-mole with logarithms, exponentials and smoothing:

alvas@ubi:~/git/nltk$ python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.translate import bleu
>>> ref = hyp = 'foo bar'.split() # Note that the input in NLTK is a list of str.
>>> bleu([ref], hyp) # Note that NLTK allows multiple references, thus the list casting.
1.0

Okay, that's more reasonable...

But how did multi-bleu.perl not work? Is it because it should have more than 1 sentence?

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 0.00, 100.0/100.0/100.0/0.0 (BP=1.000, ratio=1.000, hyp_len=6, ref_len=6)

NOOOOOOOO!!!!

Let's try again with a longer sentence:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 100.00, 100.0/100.0/100.0/100.0 (BP=1.000, ratio=1.000, hyp_len=5, ref_len=5)

Oh yes! All is not lost!

Let's try again with 2 sentences:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=0.368, ratio=0.500, hyp_len=3, ref_len=6)

Now, you're doing the rabbit out of the hat thing! It's mirrors, there must be some mirror!! I know it, it's a trapdoor and there's a mirror!

Okay, so I'll do the big reveal thing. Now, I feel like Penn Gillette.

Notice the 0.0s on the ouput in 100.0/100.0/0.0/0.0? Whenever there's any 0.0 it will return a `BLEU = 0.0. Let's take a look at the code point at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl#L156

$bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
				my_log( $bleu[2] ) +
				my_log( $bleu[3] ) +
				my_log( $bleu[4] ) ) / 4) ;

and $bleu is actually some sort of hashmap (aka python dict) object that stores the keys as the n-grams order and value as the precision of the n-gram overlaps:

my $bleu = 0;
my @bleu=();
for(my $n=1;$n<=4;$n++) {
  if (defined ($TOTAL[$n])){
    $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
    # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
  }else{
    $bleu[$n]=0;
  }
}

But that still don't explain the zero BLEU!!!

Let's see what happens in Python we take exponential of the sum of logs:

>>> from math import exp, log
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.2))
0.011999999999999999

What if we have a log(0.0)? (no, it's not an emoji...)

>>> from math import exp, log
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.2))
0.011999999999999999
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: math domain error

So it gives a math domain error, but how did NLTK give us a BLEU score?!

Long story to explain it, first it started with rough hack and say that we check for log(0.0): https://github.com/nltk/nltk/commits/develop/nltk/translate/bleu_score.py

But the solution is to really do some smart smoothing:

class SmoothingFunction:
    """
    This is an implementation of the smoothing techniques
    for segment-level BLEU scores that was presented in
    Boxing Chen and Collin Cherry (2014) A Systematic Comparison of
    Smoothing Techniques for Sentence-Level BLEU. In WMT14.
    http://acl2014.org/acl2014/W14-33/pdf/W14-3346.pdf
    """
    def __init__(self, epsilon=0.1, alpha=5, k=5):
        """
        This will initialize the parameters required for the various smoothing
        techniques, the default values are set to the numbers used in the
        experiments from Chen and Cherry (2014).
        >>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', 'ensures',
        ...                 'that', 'the', 'military', 'always', 'obeys', 'the',
        ...                 'commands', 'of', 'the', 'party']
        >>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', 'ensures',
        ...               'that', 'the', 'military', 'will', 'forever', 'heed',
        ...               'Party', 'commands']
        >>> chencherry = SmoothingFunction()
        >>> print (sentence_bleu([reference1], hypothesis1)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method0)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method1)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method2)) # doctest: +ELLIPSIS
        0.4489...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method3)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method4)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method5)) # doctest: +ELLIPSIS
        0.4905...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method6)) # doctest: +ELLIPSIS
        0.4135...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method7)) # doctest: +ELLIPSIS
        0.4905...
        :param epsilon: the epsilon value use in method 1
        :type epsilon: float
        :param alpha: the alpha value use in method 6
        :type alpha: int
        :param k: the k value use in method 4
        :type k: int
        """

Hey, but why did multi-bleu.perl return 0.0?

Remember this my_log function in the multi-bleu.perl:

And my_log basically takes the log:

sub my_log {
  return -9999999999 unless $_[0];
  return log($_[0]);
}

What it's doing is saying that if the input of my_log() is 0, return -9999999999.

Let's see what happens mathematically.

If we have a precision of 1.0 and we take log(), we get 0.0 and if we take an exponential of that we get 1.0:

>>> from math import exp, log
>>> log(1.0)
0.0
>>> exp(log(1.0))
1.0

But what does exp(-9999999999) give you?

>>> exp(-9999999999)
0.0

So it gives zero, but actually it's because it's sooo close to zero and to prevent underflow, it gives a zero. To prove this point, we can simply take the exponential of negative infinity:

>>> exp(float('-inf'))
0.0

And remember what's log(1.0), it's zero!! And if we add negative infinity to zero, we get?

Yes, it's negative infinity!! To prove the point again:

>>> float('-inf') + 0
-inf
>>> -9999999999 + 0
-9999999999
>>> exp(-9999999999 + 0)
0.0

Ah so now I get why multi-bleu.perl returns 0.0 when one of the n-gram order's precision is 0.0.

You've found a bug, why don't you report it!!! It's because it's not exactly a bug, it's how BLEU was meant to be.

BLEU is a corpus wide measure and the original implementation of BLEU assumes that the hypothesis and reference surely has at least 1 4-grams so we'll never meet the exp(-inf) problem on a real corpus.

But what if my corpus is made up of short sentences??!!!

Then you have to be extra careful when BLEU, to reiterate the point:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 0.00, 100.0/100.0/100.0/0.0 (BP=1.000, ratio=1.000, hyp_len=6, ref_len=6)

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
Use of uninitialized value in division (/) at multi-bleu.perl line 139, <STDIN> line 2.
BLEU = 0.00, 85.7/80.0/66.7/0.0 (BP=1.000, ratio=1.167, hyp_len=7, ref_len=6)

But that being said, there is actually a version of BLEU that includes a smoothing method, it's the mteval-13a.perl and that is the OFFICIAL evaluation script used by the Conference for Machine Translation (WMT) shared tasks not the multi-bleu.perl.

As much as it's a little more troublesome to get mteval-13a.perl setup, the smooth function is most appropriate for sentence level BLEU or corpora with mostly shorter than 4 words sentences at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L834

BTW, NLTK has an implementation of that smoothing function too at https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L475 and the implementation of BLEU in NLTK are tested to be as close as the mteval-13a.perl

So what's the lesson learnt?

  1. Use mteval-13a.perl and avoid multi-bleu.perl if possible (simple to use != correct)
  2. Stop using BLEU

And I stress on point 2 once again!!

As part of the machine translation community, we are addicted to BLEU. We use it regardless of how much we have known its oddity (exhaustive list, there's a lot more papers describing the flaws of BLEU):

And to quote an example from the original BLEU paper:

>>> from nltk.translate import bleu
>>> ref1 = 'the cat is on the mat'.split()
>>> ref2 = 'there is a cat on the mat'.split()
>>> hyp = 'the the the the the the the'.split()
>>> bleu([ref1, ref2], hyp)
0.7311104457090247

How is it possible that 'the the the the the the the' getst 73+ BLEU?!

IKR, we need to setup some "12 steps to kick the habit of BLEU".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment