MT Evaluation with Travatar
Overview
Travatar contains a tool for evaluating translations according to various automatic evaluation metrics. We do this by passing the reference (ref.txt) and system outputs (sys1.txt sys2.txt or more) to the mt-evaluator program.
travatar/src/bin/mt-evaluator -ref ref.txt sys1.txt sys2.txt
This will print scores for BLEU and RIBES. Normal scores for systems should be a BLEU of around 10-50, and RIBES of around 50-90, with systems on the low end of the range generally being bad and high end of the range being good. If you would like to get separate evaluation results for every sentence, you can also add -sent true.
Supported Evaluation Measures
The evaluation supports several evaluation measures, each of which can optionally be parameterized using a string following a colon:
- BLEU (indicator string: bleu:order=4,smooth=0,scope=corpus)
- RIBES (indicator string: ribes:alpha=0.25,beta=0.1)
- TER (indicator string: ter)
- WER (indicator string: wer)
It is possible to indicate one or more evaluation measures during evaluation using the -eval option. For example, if you want to evaluate BLEU of order 5 and TER, you can call the following command:
travatar/src/bin/mt-evaluator -eval "bleu:order=5 ribes" -ref ref.txt sys1.txt sys2.txt
It is also possible (mainly for tuning) to linearly interpolate two or more evaluation measures using the "interp" pseudo-measure:
- Interpolated (indicator string: interp:0.4|bleu:order=5|0.6|ribes)
- Advanced Interpolation (indicator string: ainterp:A|bleu:order=5|B|ribes|2*A*B/(A+B))
Significance Testing
It is also possible to perform significance testing for differences between the systems using bootstrap resampling. You can do this by adding -bootstrap XXX, where XXX is the number of random samples you will use. 10000 is usually a reasonable number of samples:
travatar/src/bin/mt-evaluator -bootstrap 10000 -ref ref.txt sys1.txt sys2.txt
After the normal evaluation output, you will also get a significance report for each system pair and evaluation measure. The first column is the fraction of the time that the first system achieves a higher value than the second system, the center value indicates ties, and the final value indicates that the first system gets a lower value.
Segmenting MT Output for Evaluation
Sometimes the sentence segmentation of the reference translations and system output does not match, such as in speech translation where sentence boundaries are not clear (Matusov et al. 2005). In this case, you can use the mt-segmenter utility to divide the system output into sentences first before evaluation. Usage is as follows, where ref.txt is the reference, sys.txt is the system output (line boundaries are ignored), and seg.txt is the segmented output.
travatar/src/bin/mt-segmenter -ref ref.txt sys.txt > seg.txt
By default the segmentation is created to approximately optimize BLEU (with +1 smoothing to make dynamic programming go smoothly), but you can use other optimization measures as listed above. Also, note that this process is very slow, as it is not using any approximate search techniques. If you have document boundaries, it would be best to divide along these document boundaries, and send one document at a time to mt-segmenter, which will greatly improve speed.