Kylm - The Kyoto Language Modeling Toolkit

日本語

This is the Kyoto Language Modeling toolkit (Kylm), a language modeling toolkit written in Java. It contains features including:

Tools for comparing the effectiveness of various types of language models.
Ability to model unknown words using sub-word units (characters).
Support for a number of different smoothing methods.
Output in WFST format for use with WFST decoders (such as Kyfd).

Download/Install
Documentation

CountNgrams
CrossEntropy
FAQ

Development

Download/Install

Latest Version: Kylm 0.0.7

Source code can be found here.

Program Documentation

CountNgrams

CountNgrams is a program to calculate a smoothed n-gram model from a text corpus.

Example: java -cp kylm.jar kylm.main.CountNgrams training.txt model.arpa

N-gram model options
    -n:         the length of the n-gram context [default: 3]
    -trim:      the trimming for each level of the n-gram (example: 0:1:1)
    -name:      the name of the model
    -smoothuni: whether or not to smooth unigrams

Symbol/Vocabulary options
    -vocab:     the vocabulary file to use
    -startsym:  the symbol to use for sentence starts [default: <s>]
    -termsym:   the terminal symbol for sentences [default: </s>]
    -vocabout:  the vocabulary file to write out to
    -ukcutoff:  the cut-off for unknown words [default: 0]
    -uksym:     the symbol to use for unknown words [default: <unk>]
    -ukexpand:  expand unknown symbols in the vocabulary
    -ukmodel:   model unknown words. Arguments are processed first to last, 
                so the most general model should be specified last. 
                Format: "symbol:vocabsize[:regex(.*)][:order(2)][:smoothing(wb)]"

Class options
    -classes:   a file containing word class definitions 
                ("class word [prob]", one per line)

Smoothing options [default: kn]
    -ml:        maximum likelihood smoothing
    -gt:        Good-Turing smoothing (Katz Backoff)
    -wb:        Witten-Bell smoothing
    -abs:       absolute smoothing
    -kn:        Kneser-Ney smoothing (default)
    -mkn:       Modified Kneser-Ney smoothing (of Chen & Goodman)

Output options [default: arpa]
    -bin:       output in binary format
    -wfst:      output in weighted finite state transducer format (WFST)
    -arpa:      output in ARPA format
    -neginf:    the number to print for non-existent backoffs (default: null, example: -99)

Miscellaneous options
    -debug:     the level of debugging information to print [default: 0]

CrossEntropy

A program to calculate the cross-entropy of a corpus using one or more language models.

Usage: java -cp kylm.jar kylm.main.CrossEntropy [OPTIONS] test.txt
Example: CrossEntropy -arpa model1.arpa:model2.arpa test.txt
    -arpa:  models in arpa format (model1.arpa:model2.arpa)
    -bin:   models in binary format (model3.bin:model4.bin)
    -debug: print statistics for 0: document, 1: sentence, 2: word [default: 0]

Kylm API

FAQ

How do I make a model that handles unknown words?

The easiest way is by using the "-smoothuni" option. This will apply smoothing to unigrams, and save some probability mass for unknown words.

Another way is by setting "-ukcutoff 1" (or any other number). This will trim all unigrams with a count less than or equal to one, and reserve the rest of the probability for unknown words.

Development Information

Main Developer

Graham Neubig

Contributers

Xuchen Yao

Additional developers are welcome. If you are interested, please send an email to kylm@.

Kylm is released under the GNU Lesser General Public License

Revision History

Planned Future Features:

Pitman-Yor LMs
Latent Dirichlet Allocation
Linear regression for Good-Turing discounts
Linear interpolation for various models
Estimation of class-based models

Ver. 0.0.7 (4/21/2012)

Added better error messages for when there is too little data to estimate Modified Kneser-Ney parameters appropriately (thanks to Dinu John for pointing this out).

Ver. 0.0.6 (5/21/2010)

Fixed a bug that caused an error when a vocabulary is specified manually, and a word in that vocabulary occurs in the test corpus but not the training corpus (thanks to Zhonghua Qu for pointing this out).

Ver. 0.0.5 (11/25/2009)

Fixed a bug in specifying the start and terminal symbols (thanks to Toru Taniguchi for pointing it out.)

Ver. 0.0.4 (11/13/2009)

Support for class-based LMs added
Smoothing for unigrams
Added the ability to output separate beginning and terminal symbols
Added support for Modified Kneser-Ney smoothing (of Chen and Goodman)
Fixed a bug in the unknown model code that affected models with unknown characters

Version 0.0.3 (6/22/2009)

Fixed the creation of empty strings when multiple white spaces exist
A variety of speed and memory improvements (removal of linked lists, indexing of the root node)
Added support for character-based modeling of unknown words
Fixed trimming so it works with Good-Turing smoothed models
Fixed a problem when piping data in to CountNgrams
Fixed a problem with WFST output that was killing beginning-of-sentence context

Version 0.0.2 (5/28/2009)

It is now possible to trim n-grams by count
A set vocabulary list can be used to limit the vocabulary
Output in AT&T WFST format was added as an option
Documentation has been improved

Version 0.0.1 (Initial Alpha Release)

Two programs, CountNgrams and CrossEntropy
Support for n-gram language models and several smoothing techniques (Maximum Likelihood, Good-Turing, Absolute Discounting, Witten-Bell, Kneser-Ney)
Support for input/output in ARPA and binary format
JUnit scripts to perform regression tests