This guide will take you through the steps required to create a translation model for Travatar. In particular, we will use English-Japanese translation as an example, but this should work for other languages as well. If you have any trouble with this tutorial, feel free to ask questions, as mentioned on the front page.
First, let's create a directory to work in and change to it:
mkdir ~/travatar-tutorial cd ~/travatar-tutorial
First, you must get the latest version of Travatar. Follow the directions on the download page and place the compiled program in the travatar directory below travatar-tutorial. If you can run the following commands and get the help message, everything should be working properly.
cd ~/travatar-tutorial travatar/src/bin/travatar --help
In addition, you will need to install a syntactic parser to parse the input sentences, a tokenizer for the output sentence, and a word aligner. In this tutorial, in addition to Travatar, we will use the Ckylark for parsing English, the KyTea word segmenter for segmenting Japanese, and GIZA++ for word alignment. You can go to all these sites and download the code, or preferably download the code using git as follows:
cd ~/travatar-tutorial git clone https://github.com/odashi/ckylark.git git clone https://github.com/neubig/kytea.git git clone https://github.com/moses-smt/giza-pp.git
Then, compile Ckylark:
cd ckylark autoreconf -i ./configure make cd ~/travatar-tutorial
cd kytea autoreconf -i ./configure --prefix=$HOME/travatar-tutorial/usr make make install cd ~/travatar-tutorial
GIZA++ can be compiled as follows, and we additionally copy all of the binaries into the top directory for convenience later:
cd giza-pp make cp GIZA++-v2/GIZA++ GIZA++-v2/*.out mkcls-v2/mkcls . cd ~/travatar-tutorial
Next, we need to collect data for training the translation and language models. In this guide, we will use data from the Kyoto Free Translation Task. You can acquire this data using the following command:
wget http://www.phontron.com/kftt/download/kftt-data-1.0.tar.gz tar -xzf kftt-data-1.0.tar.gz
The next step is preparing the data in a format so that Travatar's training and translation can work. This will consist of parsing the input, and tokenizing the output (we recommend that you try this on your own once, but Travatar also provides a single preprocessing script that performs all these steps at once for several languages). First, let's make a directory for the data we will use.
The first thing we need to do is tokenize our data (in other words, divide it into words). For English, we can use the tokenizer included with Travatar.
travatar/src/bin/tokenizer < kftt-data-1.0/data/orig/kyoto-train.en > data/kyoto-train.tok.en
If you take a look at data/kyoto-train.tok.en you should see that the words have been tokenized. Next, we do the same for kyoto-dev and kyoto-test.
travatar/src/bin/tokenizer < kftt-data-1.0/data/orig/kyoto-dev.en > data/kyoto-dev.tok.en travatar/src/bin/tokenizer < kftt-data-1.0/data/orig/kyoto-test.en > data/kyoto-test.tok.en
Next, we tokenize Japanese with KyTea. We add -notags and -wsconst D to suppress the output of POS tags and prevent segmentation of numbers.
usr/bin/kytea -notags -wsconst D < kftt-data-1.0/data/orig/kyoto-train.ja > data/kyoto-train.tok.ja
You can also check to see that the Japanese has been properly segmented into words. Again, we do the same for kyoto-dev and kyoto-test.
usr/bin/kytea -notags -wsconst D < kftt-data-1.0/data/orig/kyoto-dev.ja > data/kyoto-dev.tok.ja usr/bin/kytea -notags -wsconst D < kftt-data-1.0/data/orig/kyoto-test.ja > data/kyoto-test.tok.ja
Cleaning the Training Data
When very long sentences exist in the training data, they can cause parsing and alignment to take a very long time, or even worse, fail. To get rid of these sentences from the training data, we use a script included with Travatar to clean the corpus. (By changing the -max_len setting, you can change the maximum length of the data used.)
travatar/script/train/clean-corpus.pl -max_len 60 data/kyoto-train.tok.en data/kyoto-train.tok.ja data/kyoto-clean.tok.en data/kyoto-clean.tok.ja
In addition, as you will probably want to go through this tutorial quickly, we will use only some of the training data (e.g. the first 20000 lines).
head -20000 < data/kyoto-clean.tok.en > data/kyoto-head.tok.en head -20000 < data/kyoto-clean.tok.ja > data/kyoto-head.tok.ja
Note that if you want to actually make a good translation system, you should use all of the data you have. If you want to do the tutorial with the full data set, just substitute kyoto-head into kyoto-clean for the rest of the tutorial.
Next, we will use the Ckylark parser to parse the source side English sentences.
ckylark/src/bin/ckylark --add-root-tag --model ckylark/data/wsj < data/kyoto-head.tok.en > data/kyoto-head.parse.en
And do the same for kyoto-dev and kyoto-test.
ckylark/src/bin/ckylark --add-root-tag --model ckylark/data/wsj < data/kyoto-dev.tok.en > data/kyoto-dev.parse.en
ckylark/src/bin/ckylark --add-root-tag --model ckylark/data/wsj < data/kyoto-test.tok.en > data/kyoto-test.parse.en
Note that parsing is slow, and thus these commands will take a while (an hour or two). More issues about parsing, including speeding things up, are discussed in more detail on the parsing page.
Training the Language Model
As with most statistical translation systems, Travatar can use a language model (LM) to improve the fluency of its output. In order to train the LM, we first make a language model directory:
Next, we convert the output data to lowercase:
travatar/script/tree/lowercase.pl < data/kyoto-train.tok.ja > data/kyoto-train.toklow.ja
Then run KenLM (included with Travatar) to build a language model
travatar/src/kenlm/lm/lmplz -o 5 < data/kyoto-train.toklow.ja > lm/kyoto-train.ja.arpa
and binarize it for faster loading:
travatar/src/kenlm/lm/build_binary -i lm/kyoto-train.ja.arpa lm/kyoto-train.ja.blm
Training the Translation Model
Training the translation model requires the parsed training data, so you have to wait until the parsing is finished, at least for the training set. In order to prevent lower case words and upper case words from being treated differently, we will first want to convert all the data to lower case:
travatar/script/tree/lowercase.pl < data/kyoto-head.parse.en > data/kyoto-head.parselow.en travatar/script/tree/lowercase.pl < data/kyoto-head.tok.ja > data/kyoto-head.toklow.ja
And do the same for kyoto-dev and kyoto-test.
travatar/script/tree/lowercase.pl < data/kyoto-dev.parse.en > data/kyoto-dev.parselow.en travatar/script/tree/lowercase.pl < data/kyoto-dev.tok.ja > data/kyoto-dev.toklow.ja travatar/script/tree/lowercase.pl < data/kyoto-test.parse.en > data/kyoto-test.parselow.en travatar/script/tree/lowercase.pl < data/kyoto-test.tok.ja > data/kyoto-test.toklow.jaOnce this data is prepared, we run the following training script. Note that this takes our parsed English, tokenized Japanese, and language model as input. We specify the directories for GIZA++ and Travatar, and also our working directory, where the model will be stored. This will take a little while, so we will run it in the background using nohup at the beginning and & at the end. In addition, if you have a computer with multiple cores, you can specify the number of cores you would like to use with -threads (for example, with 2 threads below).
nohup travatar/script/train/train-travatar.pl -work_dir $HOME/travatar-tutorial/train -lm_file $HOME/travatar-tutorial/lm/kyoto-train.ja.blm -src_file data/kyoto-head.parselow.en -trg_file data/kyoto-head.toklow.ja -travatar_dir travatar -bin_dir giza-pp -threads 2 &> train.log &
If training ends very quickly, there is probably something wrong, so check train.log for any error messages. There are a couple of options for training that can improve the accuracy of translation, so once you have gone through the tutorial it will be worth checking them out.
The above training creates the fundamental translation model, and we are able to perform translation. However, to achieve reasonable accuracy, we must perform tuning, which adjusts the weights of the translation model, language model, word penalties, etc.
This is done with the mert-travatar.pl script in the following fashion. This also takes a considerable amount of time, as we have to translate the development set several times.
nohup travatar/script/mert/mert-travatar.pl -travatar-config train/model/travatar.ini -nbest 100 -src data/kyoto-dev.parselow.en -ref data/kyoto-dev.toklow.ja -travatar-dir travatar -working-dir tune &> tune.log &
Again, if this finishes very quickly, there is probably an error in tune.log. Also, if you want to speed up the tuning process you can use multiple processors by adding -threads XX where XX is the number of processors to use.
When tuning finishes, there will be an appropriately tuned model in tune/travatar.ini. We can now use this model to translate the test text using the Travatar decoder. Before training, we will want to filter the model file to remove rules that are not needed for translation and reduce the memory footprint:
mkdir test travatar/script/train/filter-model.pl tune/travatar.ini test/filtered-test.ini test/filtered-test "travatar/script/train/filter-rt.pl -src data/kyoto-test.parselow.en"
Here the final argument in quotes is what command we will use to filter the rule table. You should change the -src option to whatever file you will be translating. Once we are done filtering, we can translate the test set as follows (again add -threads XX to speed things up):
travatar/src/bin/travatar -config_file test/filtered-test.ini < data/kyoto-test.parselow.en > test/kyoto-test.out
Finally, we can measure the accuracy of our translations using an automatic evaluation metric. We do this by passing the reference and system output to the mt-evaluator program included with Travatar (other options).
travatar/src/bin/mt-evaluator -ref data/kyoto-test.toklow.ja test/kyoto-test.out
If everything went OK, you should get a BLEU of around 10-12, RIBES of around 56-58 with the smaller data, or more with the full training set. If you want to improve the accuracy even more, please be sure to visit the parsing or training options sections for more details!