Our dearest script that everyone uses in the modern #neuralempty revolution in machine translation is multi-bleu.perl
But consider this:
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl
Use of uninitialized value $ARGV[0] in string eq at multi-bleu.perl line 11.
usage: multi-bleu.pl [-lc] reference < hypothesis
Reads the references from reference or reference0, reference1, ...
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat hyp.txt
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat ref.txt
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=1.000, ratio=1.000, hyp_len=2, ref_len=2)
alvas@ubi:~/git/mosesdecoder/scripts/generic$ echo 'foo bar' > hyp.txt
alvas@ubi:~/git/mosesdecoder/scripts/generic$ echo 'foo bar' > ref.txt
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat hyp.txt
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ cat ref.txt
foo bar
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=1.000, ratio=1.000, hyp_len=2, ref_len=2)
Yes, you're not seeing it wrong. Now let's look at a stabilized version of BLEU in NLTK after a bunch of wack-a-mole with logarithms, exponentials and smoothing:
alvas@ubi:~/git/nltk$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk.translate import bleu
>>> ref = hyp = 'foo bar'.split() # Note that the input in NLTK is a list of str.
>>> bleu([ref], hyp) # Note that NLTK allows multiple references, thus the list casting.
1.0
But how did multi-bleu.perl
not work? Is it because it should have more than 1 sentence?
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/100.0/0.0 (BP=1.000, ratio=1.000, hyp_len=6, ref_len=6)
Let's try again with a longer sentence:
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 100.00, 100.0/100.0/100.0/100.0 (BP=1.000, ratio=1.000, hyp_len=5, ref_len=5)
Let's try again with 2 sentences:
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/0.0/0.0 (BP=0.368, ratio=0.500, hyp_len=3, ref_len=6)
Now, you're doing the rabbit out of the hat thing! It's mirrors, there must be some mirror!! I know it, it's a trapdoor and there's a mirror!
Okay, so I'll do the big reveal thing. Now, I feel like Penn Gillette.
Notice the 0.0
s on the ouput in 100.0/100.0/0.0/0.0
? Whenever there's any 0.0
it will return a `BLEU = 0.0. Let's take a look at the code point at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl#L156
$bleu = $brevity_penalty * exp((my_log( $bleu[1] ) +
my_log( $bleu[2] ) +
my_log( $bleu[3] ) +
my_log( $bleu[4] ) ) / 4) ;
and $bleu
is actually some sort of hashmap (aka python dict) object that stores the keys as the n-grams order and value as the precision of the n-gram overlaps:
my $bleu = 0;
my @bleu=();
for(my $n=1;$n<=4;$n++) {
if (defined ($TOTAL[$n])){
$bleu[$n]=($TOTAL[$n])?$CORRECT[$n]/$TOTAL[$n]:0;
# print STDERR "CORRECT[$n]:$CORRECT[$n] TOTAL[$n]:$TOTAL[$n]\n";
}else{
$bleu[$n]=0;
}
}
Let's see what happens in Python we take exponential of the sum of logs:
>>> from math import exp, log
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.2))
0.011999999999999999
What if we have a log(0.0)
? (no, it's not an emoji...)
>>> from math import exp, log
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.2))
0.011999999999999999
>>> exp(log(0.5) + log(0.4) + log(0.3) + log(0.0))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: math domain error
So it gives a math domain error, but how did NLTK give us a BLEU score?!
Long story to explain it, first it started with rough hack and say that we check for log(0.0)
: https://github.com/nltk/nltk/commits/develop/nltk/translate/bleu_score.py
But the solution is to really do some smart smoothing:
class SmoothingFunction:
"""
This is an implementation of the smoothing techniques
for segment-level BLEU scores that was presented in
Boxing Chen and Collin Cherry (2014) A Systematic Comparison of
Smoothing Techniques for Sentence-Level BLEU. In WMT14.
http://acl2014.org/acl2014/W14-33/pdf/W14-3346.pdf
"""
def __init__(self, epsilon=0.1, alpha=5, k=5):
"""
This will initialize the parameters required for the various smoothing
techniques, the default values are set to the numbers used in the
experiments from Chen and Cherry (2014).
>>> hypothesis1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'which', 'ensures',
... 'that', 'the', 'military', 'always', 'obeys', 'the',
... 'commands', 'of', 'the', 'party']
>>> reference1 = ['It', 'is', 'a', 'guide', 'to', 'action', 'that', 'ensures',
... 'that', 'the', 'military', 'will', 'forever', 'heed',
... 'Party', 'commands']
>>> chencherry = SmoothingFunction()
>>> print (sentence_bleu([reference1], hypothesis1)) # doctest: +ELLIPSIS
0.4118...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method0)) # doctest: +ELLIPSIS
0.4118...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method1)) # doctest: +ELLIPSIS
0.4118...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method2)) # doctest: +ELLIPSIS
0.4489...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method3)) # doctest: +ELLIPSIS
0.4118...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method4)) # doctest: +ELLIPSIS
0.4118...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method5)) # doctest: +ELLIPSIS
0.4905...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method6)) # doctest: +ELLIPSIS
0.4135...
>>> print (sentence_bleu([reference1], hypothesis1, smoothing_function=chencherry.method7)) # doctest: +ELLIPSIS
0.4905...
:param epsilon: the epsilon value use in method 1
:type epsilon: float
:param alpha: the alpha value use in method 6
:type alpha: int
:param k: the k value use in method 4
:type k: int
"""
Remember this my_log
function in the multi-bleu.perl
:
And my_log
basically takes the log:
sub my_log {
return -9999999999 unless $_[0];
return log($_[0]);
}
What it's doing is saying that if the input of my_log()
is 0, return -9999999999
.
Let's see what happens mathematically.
If we have a precision of 1.0
and we take log()
, we get 0.0
and if we take an exponential of that we get 1.0
:
>>> from math import exp, log
>>> log(1.0)
0.0
>>> exp(log(1.0))
1.0
But what does exp(-9999999999)
give you?
>>> exp(-9999999999)
0.0
So it gives zero, but actually it's because it's sooo close to zero and to prevent underflow, it gives a zero. To prove this point, we can simply take the exponential of negative infinity:
>>> exp(float('-inf'))
0.0
And remember what's log(1.0)
, it's zero!! And if we add negative infinity to zero, we get?
Yes, it's negative infinity!! To prove the point again:
>>> float('-inf') + 0
-inf
>>> -9999999999 + 0
-9999999999
>>> exp(-9999999999 + 0)
0.0
You've found a bug, why don't you report it!!! It's because it's not exactly a bug, it's how BLEU was meant to be.
BLEU is a corpus wide measure and the original implementation of BLEU assumes that the hypothesis and reference surely has at least 1 4-grams so we'll never meet the exp(-inf)
problem on a real corpus.
Then you have to be extra careful when BLEU, to reiterate the point:
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
BLEU = 0.00, 100.0/100.0/100.0/0.0 (BP=1.000, ratio=1.000, hyp_len=6, ref_len=6)
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('hyp.txt', 'w').write('foo bar bar\nbar black sheep sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ python3 -c "open('ref.txt', 'w').write('foo bar bar\nbar black sheep\n')"
alvas@ubi:~/git/mosesdecoder/scripts/generic$ perl multi-bleu.perl ref.txt < hyp.txt
Use of uninitialized value in division (/) at multi-bleu.perl line 139, <STDIN> line 2.
BLEU = 0.00, 85.7/80.0/66.7/0.0 (BP=1.000, ratio=1.167, hyp_len=7, ref_len=6)
But that being said, there is actually a version of BLEU that includes a smoothing method, it's the mteval-13a.perl
and that is the OFFICIAL evaluation script used by the Conference for Machine Translation (WMT) shared tasks not the multi-bleu.perl
.
As much as it's a little more troublesome to get mteval-13a.perl
setup, the smooth function is most appropriate for sentence level BLEU or corpora with mostly shorter than 4 words sentences at https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl#L834
BTW, NLTK has an implementation of that smoothing function too at https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L475 and the implementation of BLEU in NLTK are tested to be as close as the mteval-13a.perl
- Use
mteval-13a.perl
and avoidmulti-bleu.perl
if possible (simple to use != correct) - Stop using BLEU
As part of the machine translation community, we are addicted to BLEU. We use it regardless of how much we have known its oddity (exhaustive list, there's a lot more papers describing the flaws of BLEU):
- http://www.mt-archive.info/ACL-2004-Babych.pdf
- http://www.aclweb.org/anthology/E06-1032
- https://www.aclweb.org/anthology/W15-5009.pdf (Disclaimer: Shameless plug)
And to quote an example from the original BLEU paper:
>>> from nltk.translate import bleu
>>> ref1 = 'the cat is on the mat'.split()
>>> ref2 = 'there is a cat on the mat'.split()
>>> hyp = 'the the the the the the the'.split()
>>> bleu([ref1, ref2], hyp)
0.7311104457090247