prontron is a tool for pronunciation estimation, mainly focusing on the pronunciation of Japanese unknown words, but written in a general way so it can be used for any string-to-string conversion task. I created it as a quick challenge to see if I could apply discriminative learning (the structured perceptron) to Japanese pronunciation estimation, but I am posting it here in case anybody will find it useful.
Latest Version: prontron 0.1
Bleeding-Edge Code: @github
Past Versions:
None yet!
The code of prontron is distributed according to the Common Public License v 1.0, and can be distributed freely according to this license.
To estimate the pronunciation of words with prontron, you can use the models included in the model directory. If you have a file input.txt with one word per line, run the program as follows:
$ prontron.pl model/model.dict model/model.feat < input.txt > output.txt
This will output pronunciations, one per line, into output.txt.
Prontron training is a two step process. First, you have to build a dictionary of "subword/pronunciation" pairs, then run weight training.
First, create two files train.word and train.pron that contain words and their pronunciations. Then run the alignment program to create a dictionary model/model.dict of subword/pronunciation pairs:
$ mono-align.pl train.word train.pron model/model.dict
You can add more entries to the dictionary if you notice that anything important is missing. Also another tool like mpaligner could also be used, although we haven't tried it. Next, we train the feature weights model/model.feat using the perceptron algorithm (note that this will take a while).
$ prontron-train.pl train.word train.pron model/model.dict model/model.feat
That is it! Both of these programs have a number of training options (mins and maxes should be the same for both.
Both: -fmin minimum length of the input unit (1) -fmax maximum length of the input unit (1) -emin minimum length of the output unit (0) -emax maximum length of the output unit (5) -iters maximum number of iterations (10) -word use word units instead of characters mono-align.pl only: -cut all pairs that have a maximum posterior probability less than this will be trimmed (0.01) prontron-train.pl only: -inarow skip training examples we've gotten right this many times -recheck re-check skipped examples in this many times
Prontron uses discriminative training based on the structured perceptron. This is good, because it lets the training many arbitrary features. The basic idea of the structured perceptron algorithm is:
In the case of pronunciation estimation, it is not too difficult to find p, f(p), and f(p*) using the Viterbi algorithm. For the current features in prontron, we use bigram and length features over four sequences:
Word: | 発音 | 発表 |
---|---|---|
Pronunciation: | はつおん | はっぴょう |
Seq 1 -- Char/Pron. Pairs: | 発/はつ 音/おん | 発/はっ 表/ぴょう |
Seq 2 -- Pron. Strings: | はつ おん | はっ ぴょう |
Seq 3 -- Pron. Characters: | は つ お ん | は っ ぴ ょ う |
Seq 4 -- (Almost) Phonemes: | h a t u o n | h a x p i xyo u |
Examples of some high-weighted features learned over each of these sequences are as follows:
On a quick test, using 90% of the unique words in BCCWJ as training data, and 10% of the unique words as testing data, prontron get 66% correct, while a noisy channel model gets 62% (a joint trigram would probably do better). More importantly, it gives the flexibility to incorporate new features easily, which could lead to much better increases in accuracy.
If you are interested in participating in the prontron project, particularly tackling any of the interesting challenges below, please send an email to neubig at gmail dot com.
There are a bunch of possible improvements that would be quite interesting and useful: