This is the page for latticelm, a tool for Bayesian non-parametric word segmentation and language model learning using Pitman-Yor language models. The major point of the toolkit is that it is applicable to not only regular text, but also lattices. The motivation for the tool was research into the possibilities of learning language models from continuous speech, presented in the following paper:
It may also be used for word segmentation of regular text, in which case it is a slight modification of the method presented by Mochihashi et al. in ACL 2009.
Before installing, OpenFst must be installed. Once OpenFst has been installed, just simply type "make" in the latticelm directory. latticelm has been tested on Ubuntu 10.4, but should run on any recent version of Linux. You can also find the source on github.
In order to get acquainted with the program, I recommend that you read the two tutorials that are included with the package download. Options for the program are as follows:
Usage: latticelm -prefix out/ input.txt Options: -burnin: The number of iteration to execute as burn-in (20) -annealsteps: The number of annealing steps to perform (3) See Goldwater+ 2009 for details on annealing. -anneallength: The length of each annealing step in iterations (5) -samps: The number of samples to take (100) -samprate: The frequency (in iterations) at which to take samples (1) -knownn: The n-gram length of the language model (3) -unkn: The n-gram length of the spelling model (3) -prune: If this is activated, paths that are worse than the best path by at least this much will be pruned. -input: The type of input (text/fst, default text). -filelist: A list of input files, one file per line. For fst input, files must be in OpenFST binary format, tropical semiring. Text files consist of one sentence per line. -symbolfile: The symbol file for the WFSTs, not used for text input. -prefix: The prefix under which to print all output. -separator: The string to use to separate 'characters'. -cacheinput: For WFST input, cache the WFSTs in memory (otherwise they will be loaded from disk every iteration).
latticelm is currently developed solely by Graham Neubig, but any additional developers are welcome. If you are interested, please send an email to latticelm@.
latticelm is released under the Apache License, Version 2.0