This is the page for latticelm, a tool for Bayesian non-parametric word segmentation and language model learning using Pitman-Yor language models. The major point of the toolkit is that it is applicable to not only regular text, but also lattices. The motivation for the tool was research into the possibilities of learning language models from continuous speech, presented in the following paper:

It may also be used for word segmentation of regular text, in which case it is a slight modification of the method presented by Mochihashi et al. in ACL 2009.


Latest Version

latticelm 0.4

Before installing, OpenFst must be installed. Once OpenFst has been installed, just simply type "make" in the latticelm directory. latticelm has been tested on Ubuntu 10.4, but should run on any recent version of Linux. You can also find the source on github.

Program Documentation

In order to get acquainted with the program, I recommend that you read the two tutorials that are included with the package download. Options for the program are as follows:


Usage: latticelm -prefix out/ input.txt
  -burnin:       The number of iteration to execute as burn-in (20)
  -annealsteps:  The number of annealing steps to perform (3)
                 See Goldwater+ 2009 for details on annealing.
  -anneallength: The length of each annealing step in iterations (5)
  -samps:        The number of samples to take (100)
  -samprate:     The frequency (in iterations) at which to take samples (1)
  -knownn:       The n-gram length of the language model (3)
  -unkn:         The n-gram length of the spelling model (3)
  -prune:        If this is activated, paths that are worse than the
                 best path by at least this much will be pruned.
  -input:        The type of input (text/fst, default text).
  -filelist:     A list of input files, one file per line.
                 For fst input, files must be in OpenFST binary 
                 format, tropical semiring. Text files consist of one
                 sentence per line.
  -symbolfile:   The symbol file for the WFSTs, not used for text input.
  -prefix:       The prefix under which to print all output.
  -separator:    The string to use to separate 'characters'.
  -cacheinput:   For WFST input, cache the WFSTs in memory (otherwise
                 they will be loaded from disk every iteration).

Development Information

latticelm is currently developed solely by Graham Neubig, but any additional developers are welcome. If you are interested, please send an email to latticelm@.

latticelm is released under the Apache License, Version 2.0

Revision History

latticelm - Release 0.4 (4/9/2013)

latticelm - Release 0.3 (6/5/2012)

latticelm - Release 0.2 (9/21/2010)

latticelm - Release 0.1 (9/18/2010)