Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Our dearest script that everyone uses in the modern #neuralempty revolution in machine translation is multi-bleu.perl

But consider this:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl 
Use of uninitialized value $ARGV[0] in string eq at multi-bleu.perl line 11.
usage: multi-bleu.pl [-lc] reference < hypothesis
Reads the references from reference or reference0, reference1, ...
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat hyp.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat ref.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=1.000, ratio=1.000, hyp_len=2, ref_len=2)
alvas@ubi:~/git/mosesdecoder/scripts/generic$ echo 'foo bar' > hyp.txt
alvas@ubi:~/git/mosesdecoder/scripts/generic$ echo 'foo bar' > ref.txt
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat hyp.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat ref.txt 
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=1.000, ratio=1.000, hyp_len=2, ref_len=2)

What BLACK MAGIC is that?!

Yes, you're not seeing it wrong. Now let's look at a stabilized version of BLEU in NLTK after a bunch of wack-a-mole with logarithms, exponentials and smoothing:

alvas@ubi:~/git/nltk$ python
Python 2.7.12 (default, Jul  1 2016, 15:12:24) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.translate import bleu
>>> ref = hyp = 'foo bar'.split() # Note that the input in NLTK is a list of str.
>>> bleu([ref], hyp) # Note that NLTK allows multiple references, thus the list casting.
1.0

Okay, that's more reasonable...

But how did multi-bleu.perl not work? Is it because it should have more than 1 sentence?

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 0.00, 100.0/100.0/100.0/0.0 (BP=1.000, ratio=1.000, hyp_len=6, ref_len=6)

NOOOOOOOO!!!!

Let's try again with a longer sentence:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 100.00, 100.0/100.0/100.0/100.0 (BP=1.000, ratio=1.000, hyp_len=5, ref_len=5)

Oh yes! All is not lost!

Let's try again with 2 sentences:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=0.368, ratio=0.500, hyp_len=3, ref_len=6)

Now, you're doing the rabbit out of the hat thing! It's mirrors, there must be some mirror!! I know it, it's a trapdoor and there's a mirror!

Okay, so I'll do the big reveal thing. Now, I feel like Penn Gillette.

Notice the 0.0s on the ouput in 100.0/100.0/0.0/0.0? Whenever there's any 0.0 it will return a `BLEU = 0.0. Let's take a look at the code point at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl#L156

$bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
				my_log( $bleu[2] ) +
				my_log( $bleu[3] ) +
				my_log( $bleu[4] ) ) / 4) ;

and $bleu is actually some sort of hashmap (aka python dict) object that stores the keys as the n-grams order and value as the precision of the n-gram overlaps:

my $bleu = 0;
my @bleu=();
for(my $n=1;$n<=4;$n++) {
  if (defined ($TOTAL[$n])){
    $bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
    # print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
  }else{
    $bleu[$n]=0;
  }
}

But that still don't explain the zero BLEU!!!

Let's see what happens in Python we take exponential of the sum of logs:

>>> from math import exp, log
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.2))
0.011999999999999999

What if we have a log(0.0)? (no, it's not an emoji...)

>>> from math import exp, log
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.2))
0.011999999999999999
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.0))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: math domain error

So it gives a math domain error, but how did NLTK give us a BLEU score?!

Long story to explain it, first it started with rough hack and say that we check for log(0.0): https://github.com/nltk/nltk/commits/develop/nltk/translate/bleu_score.py

But the solution is to really do some smart smoothing:

class SmoothingFunction:
    """
    This is an implementation of the smoothing techniques
    for segment-level BLEU scores that was presented in
    Boxing Chen and Collin Cherry (2014) A Systematic Comparison of
    Smoothing Techniques for Sentence-Level BLEU. In WMT14.
    http://acl2014.org/acl2014/W14-33/pdf/W14-3346.pdf
    """
    def __init__(self, epsilon=0.1, alpha=5, k=5):
        """
        This will initialize the parameters required for the various smoothing
        techniques, the default values are set to the numbers used in the
        experiments from Chen and Cherry (2014).
        >>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', 'ensures',
        ...                 'that', 'the', 'military', 'always', 'obeys', 'the',
        ...                 'commands', 'of', 'the', 'party']
        >>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', 'ensures',
        ...               'that', 'the', 'military', 'will', 'forever', 'heed',
        ...               'Party', 'commands']
        >>> chencherry = SmoothingFunction()
        >>> print (sentence_bleu([reference1], hypothesis1)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method0)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method1)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method2)) # doctest: +ELLIPSIS
        0.4489...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method3)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method4)) # doctest: +ELLIPSIS
        0.4118...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method5)) # doctest: +ELLIPSIS
        0.4905...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method6)) # doctest: +ELLIPSIS
        0.4135...
        >>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method7)) # doctest: +ELLIPSIS
        0.4905...
        :param epsilon: the epsilon value use in method 1
        :type epsilon: float
        :param alpha: the alpha value use in method 6
        :type alpha: int
        :param k: the k value use in method 4
        :type k: int
        """

Hey, but why did multi-bleu.perl return 0.0?

Remember this my_log function in the multi-bleu.perl:

And my_log basically takes the log:

sub my_log {
  return -9999999999 unless $_[0];
  return log($_[0]);
}

What it's doing is saying that if the input of my_log() is 0, return -9999999999.

Let's see what happens mathematically.

If we have a precision of 1.0 and we take log(), we get 0.0 and if we take an exponential of that we get 1.0:

>>> from math import exp, log
>>> log(1.0)
0.0
>>> exp(log(1.0))
1.0

But what does exp(-9999999999) give you?

>>> exp(-9999999999)
0.0

So it gives zero, but actually it's because it's sooo close to zero and to prevent underflow, it gives a zero. To prove this point, we can simply take the exponential of negative infinity:

>>> exp(float('-inf'))
0.0

And remember what's log(1.0), it's zero!! And if we add negative infinity to zero, we get?

Yes, it's negative infinity!! To prove the point again:

>>> float('-inf') + 0
-inf
>>> -9999999999 + 0
-9999999999
>>> exp(-9999999999 + 0)
0.0

Ah so now I get why multi-bleu.perl returns 0.0 when one of the n-gram order's precision is 0.0.

You've found a bug, why don't you report it!!! It's because it's not exactly a bug, it's how BLEU was meant to be.

BLEU is a corpus wide measure and the original implementation of BLEU assumes that the hypothesis and reference surely has at least 1 4-grams so we'll never meet the exp(-inf) problem on a real corpus.

But what if my corpus is made up of short sentences??!!!

Then you have to be extra careful when BLEU, to reiterate the point:

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
BLEU = 0.00, 100.0/100.0/100.0/0.0 (BP=1.000, ratio=1.000, hyp_len=6, ref_len=6)

alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt 
Use of uninitialized value in division (/) at multi-bleu.perl line 139, <STDIN> line 2.
BLEU = 0.00, 85.7/80.0/66.7/0.0 (BP=1.000, ratio=1.167, hyp_len=7, ref_len=6)

But that being said, there is actually a version of BLEU that includes a smoothing method, it's the mteval-13a.perl and that is the OFFICIAL evaluation script used by the Conference for Machine Translation (WMT) shared tasks not the multi-bleu.perl.

As much as it's a little more troublesome to get mteval-13a.perl setup, the smooth function is most appropriate for sentence level BLEU or corpora with mostly shorter than 4 words sentences at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L834

BTW, NLTK has an implementation of that smoothing function too at https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L475 and the implementation of BLEU in NLTK are tested to be as close as the mteval-13a.perl

So what's the lesson learnt?

  1. Use mteval-13a.perl and avoid multi-bleu.perl if possible (simple to use != correct)
  2. Stop using BLEU

And I stress on point 2 once again!!

As part of the machine translation community, we are addicted to BLEU. We use it regardless of how much we have known its oddity (exhaustive list, there's a lot more papers describing the flaws of BLEU):

And to quote an example from the original BLEU paper:

>>> from nltk.translate import bleu
>>> ref1 = 'the cat is on the mat'.split()
>>> ref2 = 'there is a cat on the mat'.split()
>>> hyp = 'the the the the the the the'.split()
>>> bleu([ref1, ref2], hyp)
0.7311104457090247

How is it possible that 'the the the the the the the' getst 73+ BLEU?!

IKR, we need to setup some "12 steps to kick the habit of BLEU".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.