Parsing for Travatar
Travatar is a tree-to-string translation system, which means that it requires source language parse trees to perform translation. Within this framework, better parse trees will often directly improve your translation accuracy, so it is worth investing a little bit of time to created good trees. It can also use packed forests, which encode a large number of trees at decoding time, so using a parser that has the ability to output packed forests is also effective. This page describes how to perform the parsing step on your own, which is useful if you want to adjust various parameters of the parser, but there is also a preprocessing script that makes it a bit easier to get going quickly.
For parsing English, you can use any phrase structure parser of your choosing. In particular, we recommend the Ckylark parser for trees, or Egret parser for forests. Note that parsing has a large effect on translation results (Neubig&Duh ACL2014), so it might be worth trying other newer parsers.
Before parsing, you will have to tokenize your input. Travatar includes a tokenizer that does a reasonable job at tokenizing English, so you can use it on an input text input.txt as follows:
cat input.txt | travatar/src/bin/tokenizer > tokenized.txt
As mentioned in the training guide, the Ckylark parser is a good parser for generating trees, as it is accurate and will rarely fail at parsing. To install the parser, if you haven't already, clone it from github, and follow the installation directions. Then, if we want to parse English we will have to unzip the English parsing model (named WSJ, as it was trained on the Wall Street Journal section of the Penn Treebank).
Next, we can use this model to parse English sentences using the following command.
cat tokenized.txt | Ckylark/src/ckylark --model Ckylark/model/wsj > ckylark-parsed.txt
Egret is a parser that is able to output packed forests that can be used in translation.
Download Egret from the github page and unzip it. You can then run the following command in the top directory to compile:
Parsing 1-best Trees
For Travatar training (currently) only 1-best trees are supported for rule learning, so we will first describe how to create 1-best trees. Given the tokenized file tokenized.txt, we then run Egret as follows, where $EGRET_DIR is the directory to which you downloaded Egret.
$EGRET_DIR/egret -lapcfg -i=tokenized.txt -data=$EGRET_DIR/eng_grammar | sed 's/( /(ROOT /g' > egret-parsed.txt
The "sed" command here replaces the empty root symbol with an explicit "ROOT" symbol, which is necessary for Travatar.
Unfortunately, for some sentences, Egret's parsing fails and generates an empty tree "(())", maybe one in every 10000 or so sentences. This means that we will not be able to extract rules from, or translate any failed sentences. If we already have parse trees from the Ckylark parser, we can use the script replace-failed-parse.pl included with Travatar to replace only failed parses.
$TRAVATAR_DIR/script/tree/replace-failed-parse.pl ckylark-parsed.txt egret-parsed.txt > egret-parsefixed.txt
If parsing is taking a lot of time, you can parse in parallel by running Egret on multiple processors and specifying -range to indicate which sentences you want to parse in that particular instance. For example, if you have 1000 sentences to parse, you could run the following
$EGRET_DIR/egret -range=1-500 -lapcfg -i=tokenized.txt -data=$EGRET_DIR/eng_grammar > egret-parsed-1.txt & $EGRET_DIR/egret -range=501-1000 -lapcfg -i=tokenized.txt -data=$EGRET_DIR/eng_grammar > egret-parsed-2.txt &
and then combine the two files together when both processes finish running.
Creating forests with Egret just requires a few extra options. An example of the command can be found below:
$EGRET_DIR/egret -lapcfg -i=tokenized.txt -data=$EGRET_DIR/eng_grammar -nbest4threshold=100 -printForest > egret-forest.txt
Here nbest4threshold=100 indicates that we will only keep nodes or edges that appear in the 100 best parse trees. This is generally a good setting, but larger forests can sometimes improve accuracy (Neubig&Duh ACL2014).
Again, as some parses may fail, you can replace them with parses from Ckylark. In order to do so, we first convert the Ckylark trees to forests with the tree-converter tool provided with Travatar, and then use the replace-failed-parse.pl script as we did before.
$TRAVATAR_DIR/src/bin/tree-converter -input_format penn -output_format egret < ckylark-parse.txt > ckylark-forest.txt $TRAVATAR_DIR/script/tree/replace-failed-parse.pl -format egret ckylark-forest.txt egret-forest.txt > egret-forestfixed.txt
When using forests in decoding, we must pass the option "-in_format egret" to Travatar. When running mert-travatar.pl, this can be done using "-in-format egret" (note the difference between the hyphen and the underbar).
Before parsing Japanese, it is necessary to tokenize the text. This can be done using the KyTea segmenter. Install it according to the directions on the site, then tokenize using the following command (notags turns of POS tagging, and wsconst prevents dividing numbers):
kytea -notags -wsconst D < input.txt > tokenized.txt
For parsing Japanese, it is possible to use Ckylark and Egret like for English. The Japanese model in Ckylark is labeled jdc, and the Japanese model in Egret is labeled jpn_grammar.
The Stanford Parser supports German and French, so it can be used to generate trees.