Travatar Model Format

This page will tell you the basics of the Travatar model format, and give some tips about how to modify the model if necessary.

Model Format Basics

After training, the travatar model can be found in train/model/rule-table.gz, where "train" is the output directory specified at training time. The format of the rules in the table is as follows:

source ||| target ||| features ||| counts ||| alignments

Below is an example of a rule:

vp ( vbd ( "expected" ) x0:sbar ) ||| x0:sbar "予想" "し" "た" @ vp ||| fgep=-6.07 w=3 egfp=-4.52 fgel=-3.11 p=1 egfl=-13.28 lfreq=2.22 ||| 1 10 47 ||| 0-0 0-1 0-2

source: This is a source side subtree, representing the portion of a syntactic tree to be translated. The head of the tree comes first, followed by bracketed children. Strings surrounded with quotes are terminals, and strings like "x0:sbar" are places that connect to a child tree. The "x0" "x1" variables must be in ascending order.
target: This is in a similar form as the source, but with no brackets or tree structure. The ":sbar" after the x0 is optional. The "@ vp" at the end of the string indicates the syntactic head of the rule, and is also optional.
features: This is a list of "string=value" pairs. The string is the name of a feature, and the value is it's value. Travatar can handle sparse features during decoding, so it is possible to add features with any name you like.
counts: These are the number of times the rule appeared in the training corpus. The three numbers are counts of source-target, source, and target respectively. This field does not affect decoding.
alignment: This is an alignment between all terminals (words surrounded by quotes) in the source and target sides of the rule. This also has no effect on decoding.

In particular, with regards to the features, these are the standard dense features used in standard machine translation systems (take a look at Philipp Koehn's "Statistical Machine Translation" for more details). They are described briefly below:

egfp, fgep: These are log conditional probabilities "log P(e|f)" and "log P(f|e)".
egfl, fgel: These are log lexical probabilities, where the probability of each word is calculated separately: "log P_l(e|f)" and "log P_l(f|e)".
lfreq: This is the log frequency "log c(e,f)".
w: The number of words on the target side.
p: This is the phrase count feature, it will always be one.

Modifying Travatar Models

If you would like to modify the travatar model for whatever reason, it is in text format so you can do so directly. For example, let's say we really don't want to translate anything about apples, we can remove all rules that contain the string apple.

zcat train/model/rule-table.gz | grep -v apple | gzip > train/model/no-apples.gz

It is also possible to add new rules. Let's say we want to add a rule to translate the proper noun "apple" into "アップル" (the company) and the regular noun "apple" into "りんご" (the fruit). We can do so by creating a file apple.txt:

echo 'nnp ( "apple" ) ||| "アップル" ||| apple_rule=1' >> apple.txt
echo 'nn ( "apple" ) ||| "りんご" ||| apple_rule=1' >> apple.txt

We can then combine this file with our original rule table. Note that Travatar rule tables must be sorted, so we perform a sort on the newly concatenated table.

gzip apple.txt
zcat train/model/rule-table.gz apple.txt.gz | LC_ALL=C sort | gzip > train/model/with-apples.gz

After creating a new rule table, we can then modify the travatar.ini file to point to our new table.