MT Evaluation with Travatar

Overview
Supported Evaluation Measures
Significance Testing
Segmenting Output for Evaluation

Overview

Travatar contains a tool for evaluating translations according to various automatic evaluation metrics. We do this by passing the reference (ref.txt) and system outputs (sys1.txt sys2.txt or more) to the mt-evaluator program.

travatar/src/bin/mt-evaluator -ref ref.txt sys1.txt sys2.txt

This will print scores for BLEU and RIBES. Normal scores for systems should be a BLEU of around 10-50, and RIBES of around 50-90, with systems on the low end of the range generally being bad and high end of the range being good. If you would like to get separate evaluation results for every sentence, you can also add -sent true.

Supported Evaluation Measures

The evaluation supports several evaluation measures, each of which can optionally be parameterized using a string following a colon:

BLEU (indicator string: bleu:order=4,smooth=0,scope=corpus)
RIBES (indicator string: ribes:alpha=0.25,beta=0.1)
TER (indicator string: ter)
WER (indicator string: wer)

It is possible to indicate one or more evaluation measures during evaluation using the -eval option. For example, if you want to evaluate BLEU of order 5 and TER, you can call the following command:

travatar/src/bin/mt-evaluator -eval "bleu:order=5 ribes" -ref ref.txt sys1.txt sys2.txt

It is also possible (mainly for tuning) to linearly interpolate two or more evaluation measures using the "interp" pseudo-measure:

Interpolated (indicator string: interp:0.4|bleu:order=5|0.6|ribes)

For more advanced usage, Travatar supports advanced interpolation measure that takes other evaluation measures as variables and combine them with single math expression.

Advanced Interpolation (indicator string: ainterp:A|bleu:order=5|B|ribes|2*A*B/(A+B))

Here we are counting the harmonic mean of 5-BLEU and RIBES together. Currently, Travatar supports up to 26 variables (A-Z inclusive) and standard mathematic binary operator (+,-,/,*) and rounded parentheses.

Significance Testing

It is also possible to perform significance testing for differences between the systems using bootstrap resampling. You can do this by adding -bootstrap XXX, where XXX is the number of random samples you will use. 10000 is usually a reasonable number of samples:

travatar/src/bin/mt-evaluator -bootstrap 10000 -ref ref.txt sys1.txt sys2.txt

After the normal evaluation output, you will also get a significance report for each system pair and evaluation measure. The first column is the fraction of the time that the first system achieves a higher value than the second system, the center value indicates ties, and the final value indicates that the first system gets a lower value.

Segmenting MT Output for Evaluation

Sometimes the sentence segmentation of the reference translations and system output does not match, such as in speech translation where sentence boundaries are not clear (Matusov et al. 2005). In this case, you can use the mt-segmenter utility to divide the system output into sentences first before evaluation. Usage is as follows, where ref.txt is the reference, sys.txt is the system output (line boundaries are ignored), and seg.txt is the segmented output.

travatar/src/bin/mt-segmenter -ref ref.txt sys.txt > seg.txt

By default the segmentation is created to approximately optimize BLEU (with +1 smoothing to make dynamic programming go smoothly), but you can use other optimization measures as listed above. Also, note that this process is very slow, as it is not using any approximate search techniques. If you have document boundaries, it would be best to divide along these document boundaries, and send one document at a time to mt-segmenter, which will greatly improve speed.