KyTea Version History

Future Features/Known Issues

EUC input can only handle 2-byte EUC, but future versions will handle 3-byte characters as well.
Improved the efficiency of the dictionary implementation

Major refactoring to make the code easier to read and improve compile time a little
Added the ability to select -out eda for compatibility with the Eda parser

Added the -wsconst option, making it possible to not segment some character types (e.g. "-wsconst D" prevents segmentation of digits)

Normalization of half-width characters
Model updated for better accuracy
Added a new input option to take tokenized text with no tags, which is the default when using the -nows tag
Other minor bug fixes

Made it possible to train using feature files as described on the training page.
Fixed a bug with the "-nounk" option.

Made it possible to estimate multiple tags at one time, and combined the default POS and pronunciation models into a single model.
Added support for the global tagging model described in the ACL 2011 paper.
Fixed a bug that caused a confidence of 100 for some probabilistic models.

Added an API for programmatic access to KyTea
Upgrade LIBLINEAR to version 1.7 to allow for improved logistic regression (-solver 7), and updated the models on the models page
Allowed for the specification of the separator between words and tags using options (details here)
Made support for a default tag (-deftag) when no tag candidates are generated, which is set to "/UNK" by default.
Made it possible to tune the cost parameter (-cost) for the SVM or LR training.
Added a -debug option that allows the printing of more details for analyzing KyTea's results.
Changes in the text model format to make it easier to read (feature weights are now written directly below their names).
Quantization is now disabled by default, which will reduce speed but increase stability when training models. It can be re-enabled by using the --enable-quantize option of the configure script.

Changed unknown word pronunciation from full search to beam search. The -unkbeam option was added. This should fix crashes due to long unknown words.
Probabilities were not properly output when using the "-out conf" option with a model trained with "-solver 6" (including the provided logisitic regression models). This was fixed.
Fixed a bug that didn't allow models to be trained with "-nope".
Confidence weighted output now has a single blank line after each sentence to ease processing.
Fixed a bug that caused dictionaries to not be read properly when not containing pronunciations.
Fixed a bug that broke models trained using the -modtext option.

When using multiple dictionaries, they are now treated as separate features (as opposed to a generic dictionary feature in the previous versions).
Update to the model file format (incompatible with version 0.0.3)
Partially annotated files may now use '?' in addition to ' ' for unlabeled boundaries (to express human annotator uncertainty).
Double spaces are handled as a single space in full annotation.
Escaped characters are allowed in annotated input.
The output of Logistic regression now reflects probabilities.
A model is now included with the package, and the model can be specified using an environmental variable.
Added the ability to output multiple answers in order of preference.