Kylm - The Kyoto Language Modeling Toolkit

日本語

This is the Kyoto Language Modeling toolkit (Kylm), a language modeling toolkit written in Java. It contains features including:

Download/Install

Latest Version: Kylm 0.0.7

Source code can be found here.

Program Documentation

CountNgrams

CountNgrams is a program to calculate a smoothed n-gram model from a text corpus.

Example: java -cp kylm.jar kylm.main.CountNgrams training.txt model.arpa

N-gram model options
    -n:         the length of the n-gram context [default: 3]
    -trim:      the trimming for each level of the n-gram (example: 0:1:1)
    -name:      the name of the model
    -smoothuni: whether or not to smooth unigrams

Symbol/Vocabulary options
    -vocab:     the vocabulary file to use
    -startsym:  the symbol to use for sentence starts [default: <s>]
    -termsym:   the terminal symbol for sentences [default: </s>]
    -vocabout:  the vocabulary file to write out to
    -ukcutoff:  the cut-off for unknown words [default: 0]
    -uksym:     the symbol to use for unknown words [default: <unk>]
    -ukexpand:  expand unknown symbols in the vocabulary
    -ukmodel:   model unknown words. Arguments are processed first to last, 
                so the most general model should be specified last. 
                Format: "symbol:vocabsize[:regex(.*)][:order(2)][:smoothing(wb)]"

Class options
    -classes:   a file containing word class definitions 
                ("class word [prob]", one per line)

Smoothing options [default: kn]
    -ml:        maximum likelihood smoothing
    -gt:        Good-Turing smoothing (Katz Backoff)
    -wb:        Witten-Bell smoothing
    -abs:       absolute smoothing
    -kn:        Kneser-Ney smoothing (default)
    -mkn:       Modified Kneser-Ney smoothing (of Chen & Goodman)

Output options [default: arpa]
    -bin:       output in binary format
    -wfst:      output in weighted finite state transducer format (WFST)
    -arpa:      output in ARPA format
    -neginf:    the number to print for non-existent backoffs (default: null, example: -99)

Miscellaneous options
    -debug:     the level of debugging information to print [default: 0]

CrossEntropy

A program to calculate the cross-entropy of a corpus using one or more language models.

Usage: java -cp kylm.jar kylm.main.CrossEntropy [OPTIONS] test.txt
Example: CrossEntropy -arpa model1.arpa:model2.arpa test.txt
    -arpa:  models in arpa format (model1.arpa:model2.arpa)
    -bin:   models in binary format (model3.bin:model4.bin)
    -debug: print statistics for 0: document, 1: sentence, 2: word [default: 0]

Kylm API

FAQ

How do I make a model that handles unknown words?

The easiest way is by using the "-smoothuni" option. This will apply smoothing to unigrams, and save some probability mass for unknown words.

Another way is by setting "-ukcutoff 1" (or any other number). This will trim all unigrams with a count less than or equal to one, and reserve the rest of the probability for unknown words.

Development Information

Main Developer

Contributers

Additional developers are welcome. If you are interested, please send an email to kylm@.

Kylm is released under the GNU Lesser General Public License

Revision History

Planned Future Features:

Ver. 0.0.7 (4/21/2012)

Ver. 0.0.6 (5/21/2010)

Ver. 0.0.5 (11/25/2009)

Ver. 0.0.4 (11/13/2009)

Version 0.0.3 (6/22/2009)

Version 0.0.2 (5/28/2009)

Version 0.0.1 (Initial Alpha Release)