Tree Converter

Travatar comes with a program tree-converter with the following options. More information about some of the functionality is included below.


 -input_format              The format of the input (penn/json/egret)
 -output_format             The format of the output (penn/json/egret/word)
 -split                     A regular expression to split words in the tree (e.g. "-")
 -compoundsplit             The language model file for use in compound splitting
 -compoundsplit_filler      Optional fillers for compound splitting, e.g. "e:es" for German
 -compoundsplit_threshold 	Words with unigram probability mass above this threshold will not be split
 -compoundsplit_minchar 	Mininimum required characters in subword for compound splitting
 -binarize                  How to binarize the trees (none/left/right/cky)
 -case                      How to case the trees (none/low/title/true:model=FILE)
 -flatten                   Whether to flatten unary productions
 -debug                     How much debug output to produce


Using the -binarize option allows the tree to be binarized according to a number of options.


Using the -case option allows the words in the tree to be all lowercased (low), the first word to be title cased (title), or the first word to be true cased according to a model (true).

Regex Splitting

The -split option allows us to specify an arbitrary regular expression. Given this regular expression, all words matching this expression will be split in half around the expression. This is useful for splitting hyphenated or slashed expressions, which are not split by the Penn treebank, or other parsing standards.

Compound Splitting

WordSplitterCompound works similarly to Moses compound-splitter.perl, i.e. it compares the unigram probability of the word (e.g. "autobahn") with the mean unigram probability of its subwords (e.g. "auto" and "bahn") and picks the one that is higher. It also considers fillers between words and deletes them if necessary (e.g. "arbeitstier" splits into "arbeit"+ (filler=s) + "tier").

Example use:

bin/tree-converter -compoundsplit LMFile -compoundsplit_filler "es:s:e" >

The language model (LMfile) provides the unigram statistics and should be trained on text that matches tokenization beforehand. Since we use KenLM, bigram or above is assumed, even though the algorithm only looks at unigrams. The fillers are specified in a colon (:) delimited format.

Additional options for this class are compoundsplit_threshold and compoundsplit_minchar, which determine which words are candidates splitting. Usually we don't want to consider a word for splitting if its unigram probability is above some high threshold, or if its subwords are too short. The default values are probably fine. Using "-debug 1" option will generate statistics on the number of words split.