Travatar Model Format
This page will tell you the basics of the Travatar model format, and give some tips about how to modify the model if necessary.
Model Format Basics
After training, the travatar model can be found in train/model/rule-table.gz, where "train" is the output directory specified at training time. The format of the rules in the table is as follows:
source ||| target ||| features ||| counts ||| alignments
Below is an example of a rule:
vp ( vbd ( "expected" ) x0:sbar ) ||| x0:sbar "予想" "し" "た" @ vp ||| fgep=-6.07 w=3 egfp=-4.52 fgel=-3.11 p=1 egfl=-13.28 lfreq=2.22 ||| 1 10 47 ||| 0-0 0-1 0-2
- source: This is a source side subtree, representing the portion of a syntactic tree to be translated. The head of the tree comes first, followed by bracketed children. Strings surrounded with quotes are terminals, and strings like "x0:sbar" are places that connect to a child tree. The "x0" "x1" variables must be in ascending order.
- target: This is in a similar form as the source, but with no brackets or tree structure. The ":sbar" after the x0 is optional. The "@ vp" at the end of the string indicates the syntactic head of the rule, and is also optional.
- features: This is a list of "string=value" pairs. The string is the name of a feature, and the value is it's value. Travatar can handle sparse features during decoding, so it is possible to add features with any name you like.
- counts: These are the number of times the rule appeared in the training corpus. The three numbers are counts of source-target, source, and target respectively. This field does not affect decoding.
- alignment: This is an alignment between all terminals (words surrounded by quotes) in the source and target sides of the rule. This also has no effect on decoding.
In particular, with regards to the features, these are the standard dense features used in standard machine translation systems (take a look at Philipp Koehn's "Statistical Machine Translation" for more details). They are described briefly below:
- egfp, fgep: These are log conditional probabilities "log P(e|f)" and "log P(f|e)".
- egfl, fgel: These are log lexical probabilities, where the probability of each word is calculated separately: "log Pl(e|f)" and "log Pl(f|e)".
- lfreq: This is the log frequency "log c(e,f)".
- w: The number of words on the target side.
- p: This is the phrase count feature, it will always be one.
Modifying Travatar Models
If you would like to modify the travatar model for whatever reason, it is in text format so you can do so directly. For example, let's say we really don't want to translate anything about apples, we can remove all rules that contain the string apple.
zcat train/model/rule-table.gz | grep -v apple | gzip > train/model/no-apples.gz
It is also possible to add new rules. Let's say we want to add a rule to translate the proper noun "apple" into "アップル" (the company) and the regular noun "apple" into "りんご" (the fruit). We can do so by creating a file apple.txt:
echo 'nnp ( "apple" ) ||| "アップル" ||| apple_rule=1' >> apple.txt echo 'nn ( "apple" ) ||| "りんご" ||| apple_rule=1' >> apple.txt
We can then combine this file with our original rule table. Note that Travatar rule tables must be sorted, so we perform a sort on the newly concatenated table.
gzip apple.txt zcat train/model/rule-table.gz apple.txt.gz | LC_ALL=C sort | gzip > train/model/with-apples.gz
After creating a new rule table, we can then modify the travatar.ini file to point to our new table.