This is the Kyoto Language Modeling toolkit (Kylm), a language modeling toolkit written in Java. It contains features including:
Latest Version: Kylm 0.0.7
Source code can be found here.
CountNgrams is a program to calculate a smoothed n-gram model from a text corpus.
Example: java -cp kylm.jar kylm.main.CountNgrams training.txt model.arpa N-gram model options -n: the length of the n-gram context [default: 3] -trim: the trimming for each level of the n-gram (example: 0:1:1) -name: the name of the model -smoothuni: whether or not to smooth unigrams Symbol/Vocabulary options -vocab: the vocabulary file to use -startsym: the symbol to use for sentence starts [default: <s>] -termsym: the terminal symbol for sentences [default: </s>] -vocabout: the vocabulary file to write out to -ukcutoff: the cut-off for unknown words [default: 0] -uksym: the symbol to use for unknown words [default: <unk>] -ukexpand: expand unknown symbols in the vocabulary -ukmodel: model unknown words. Arguments are processed first to last, so the most general model should be specified last. Format: "symbol:vocabsize[:regex(.*)][:order(2)][:smoothing(wb)]" Class options -classes: a file containing word class definitions ("class word [prob]", one per line) Smoothing options [default: kn] -ml: maximum likelihood smoothing -gt: Good-Turing smoothing (Katz Backoff) -wb: Witten-Bell smoothing -abs: absolute smoothing -kn: Kneser-Ney smoothing (default) -mkn: Modified Kneser-Ney smoothing (of Chen & Goodman) Output options [default: arpa] -bin: output in binary format -wfst: output in weighted finite state transducer format (WFST) -arpa: output in ARPA format -neginf: the number to print for non-existent backoffs (default: null, example: -99) Miscellaneous options -debug: the level of debugging information to print [default: 0]
A program to calculate the cross-entropy of a corpus using one or more language models.
Usage: java -cp kylm.jar kylm.main.CrossEntropy [OPTIONS] test.txt Example: CrossEntropy -arpa model1.arpa:model2.arpa test.txt -arpa: models in arpa format (model1.arpa:model2.arpa) -bin: models in binary format (model3.bin:model4.bin) -debug: print statistics for 0: document, 1: sentence, 2: word [default: 0]
The easiest way is by using the "-smoothuni" option. This will apply smoothing to unigrams, and save some probability mass for unknown words.
Another way is by setting "-ukcutoff 1" (or any other number). This will trim all unigrams with a count less than or equal to one, and reserve the rest of the probability for unknown words.
Additional developers are welcome. If you are interested, please send an email to kylm@.
Kylm is released under the GNU Lesser General Public License