Preprocessing for Travatar

As Travatar requires parsing, the preprocessing is a bit heavier than that necessary for other translation toolkits. There is some information on the Parsing page about how to do this for each language. However, there are a lot of fine details and it is easy to mess things up, so we provide a default preprocessing script that supports English (en), Japanese (ja), Chinese (zh), German (de), and French (fr). To use it follow the directions below:

Installing Programs

The first thing that you will need to do is install programs other than Travatar that are needed to parse your particular language pair. The best place to install these programs is in $HOME/usr/local. If you install them in a different place you can specify them using -program-dir /path/to/tools when you run the preprocessing script.

GIZA++ Aligner (all): Unzip, enter the giza-pp directory, make, and then run the following command:

cp GIZA++-v2/{GIZA++,plain2snt.out,snt2cooc.out,snt2plain.out,trainGIZA++.sh} mkcls-v2/mkcls .

Ckylark parser (en, zh, ja): Clone from git, enter the directory, autoreconf -i, ./configure, make.
Egret Parser (en, zh, ja): Unzip, run bash make-linux.sh.
KyTea Morphological Analyzer (ja): Unzip, enter the directory, ./configure, make and sudo make install.
Nile (optional): Requires several libraries, etc. Follow the install directions.

Running Preprocessing

Preprocessing Training Data

If you have all of the tools installed in the $HOME/usr/local directory, running the preprocessing script is quite simple. For example, considering translation between English and Japanese, we could run the following command to tokenize, clean, parse, and align our training data:

$TRAVATAR_DIR/script/preprocess/preprocess.pl
   -src en -trg ja -threads 4 -clean 60 -align train.en train.ja preproc-train

where train.en is the source file, train.ja is the target file, and preproc-train is the output directory. -threads indicates the number of threads you want to use (if you use more, preprocessing will finish more quickly), -clean 60 indicates you want to remove all sentences of more than 60 words, and -align indicates that you want to run GIZA++ to generate alignments. When preprocessing finishes, you will have a number of directories under preproc-train containing the corpus in various forms.

In addition, if you don't mind a small amount of extra work to achieve the best accuracy possible, you can use Nile to obtain better word alignments. This can be done by adding "-nile-model MODEL_NAME.txt" where MODEL_NAME.txt is the name of a model for Nile that you have trained according to the Nile directions. You can download a model for the following language pairs (if you have a model for a different language pair, please contribute! We will list it here):

English-Japanese (en-ja)

If the model is in the opposite direction as the source/target in the preprocessing script, add the "-nile-direction trgsrc" option.

Preprocessing Testing Data

When preprocessing your development and testing data, you might want to use a command like this:

$TRAVATAR_DIR/script/preprocess/preprocess.pl
   -src en -trg ja -threads 4 -forest-src dev.en dev.ja preproc-dev
$TRAVATAR_DIR/script/preprocess/preprocess.pl
   -src en -trg ja -threads 4 -forest-src test.en test.ja preproc-test

In this case we should not filter the data and don't need to align, but we may want to generate forests for the languages that support them (e.g. en, zh).

Training/Tuning/Testing

Once we have run the preprocessing script, it is pretty easy to run training, tuning, and testing. For example, to build an English-Japanese system we first run training (assuming we've already built a language model ja.blm as mentioned on the training guide page):

nohup travatar/script/train/train-travatar.pl -work_dir train -lm_file ja.blm -src_file preproc-train/treelow/en -trg_file preproc-train/low/ja -align-file preproc-train/giza/enja -travatar_dir travatar -threads 2 &> train.log &

Next, we perform tuning, using the forests (hence -in-format egret) we created for the dev set:

nohup ~/work/travatar/script/mert/mert-travatar.pl -in-format egret -travatar-config train/model/travatar.ini -src preproc-dev/forlow/en -ref preproc-dev/low/ja -travatar-dir travatar -working-dir tune &> tune.log &

Finally, we decode using the tuned model:

travatar/src/bin/travatar -config_file tune/travatar.ini -in_format egret < preproc-test/forlow/en > output.ja

and measure the accuracy:

travatar/src/bin/mt-evaluator -ref preproc-test/low/ja output.ja