This is the home of the Kyoto Text Analysis Toolkit (KyTea, pronounced "cutie"). It is a general toolkit developed for analyzing text, with a focus on Japanese and other languages requiring word or morpheme segmentation.
KyTea is able to perform the following types of processing:
Both KyWs and KyPe use a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. The classifiers are trained with LIBLINEAR. More details KyTea's classification approach can be found here or in the following paper.
Latest Version: KyTea 0.1.2
This packages contains source code, and a default model that uses the UTF-8 character encoding, and estimates pronunciations according to keyboard input (which is slightly different than the actual phonetic pronunciations). More details, and a number of other models can be found on the KyTea Models page.
Past Versions: KyTea 0.1.1 KyTea 0.1.0 KyTea 0.0.3 KyTea 0.0.2 KyTea 0.0.1
The code of KyTea is distributed according to the Apache License Version 2, and can be distributed freely according to this license. The models included with KyTea or distributed on the KyTea models page may be used for research or commercial purposes, but may not be re-distributed without prior permission.
KyTea has been tested on Linux, Mac OSX, and Windows (via Cygwin). On Linux or Cygwin, download the source code, and install using the following commands.
tar -xzf kytea-X.X.X.tar.gz cd kytea-X.X.X ./configure make make install kytea --help
If this prints a help message, KyTea is working properly. There are a number of options that can be set during compile-time to adjust the install location or program efficiency.
After you have installed KyTea, run the program to split the text into words and annotate each word with a pronunciation. If test.raw is a file that contains raw text, the following command will create annotated text in the file test.full.
kytea < test.raw > test.full
While KyTea comes with a default model, if you have your own annotated text it is both simple and useful to build your own model. First, you must prepare a corpus with one sentence per line in the following format (if you only want to do word segmentation, the pronunciations are not necessary):
word1/pron1 word2/pron2 word3/pron3 word4/pron4 word5/pron5 word6/pron6
Let's say that this corpus is named train.full (full means that the file is fully annotated in the above format). If we have an unsegmented file named test.raw, we can create a model and analyze the unsegmented file using the following commands.
train-kytea -full train.wp -model model.dat kytea -model model.dat < test.raw > test.full
test.full will now have a segmented file with each word annotated with a pronunciation.
kytea performs word segmentation and pronunciation estimation given a model
Options: -model The model file to use when analyzing text -in The formatting of the input (raw/full/part/conf, default raw) -out The formatting of the output (full/part/conf, default full) -nows Don't do word segmentation (raw input cannot be accepted) -nope Don't do pronunciation estimation (full input cannot be accepted) -nounk Don't estimate the pronunciation of unkown words -unkcount The maximum number of unknown pronunciations to print (default 5, 0 implies no limit) -unktag A tag to append to indicate words not in the dictionary -unkbeam The width of the beam to use in beam search for unknown words (default 50, 0 for full search)
train-kytea is a program to train models for KyTea.
Input/Output Options: -encode The text encoding to be used for input/output (utf8/euc/sjis; default: utf8) -full A file of fully annotated training data (can be specified multiple times) -part A file of partially annotated training data (can be specified multiple times) -conf A file of training data annotated with confidences (can be specified multiple times) -dict A dictionary file (in the form of one 'word/pron' entry per line) -subword A dictionary file of subword units. Adding this will enable unknown word PE -model The file to write the trained model to -modtext Print a text model (instead of the default binary) Model Training Options (basic) -nows Don't train a word segmentation model -nope Don't train a pronunciation estimation model Model Training Options (for advanced users): -charw The window of characters to use on either side of a boundary for WS (default 3) -charn The maximum length of character n-grams to use for WS (default 3) -typew The window of character types to use on either side of a boundary for WS (default 3) -typen The maximum length of character type n-grams to use for WS (default 3) -dictn All dictionary words greater than this length will be bucketed together (default 4) -unkn The order of the language model to use for unknown words (default 3) -eps The epsilon stopping criterion for classifier training -bias The bias value to use in classifier training (default 1) -solver The solver (0 = logistic regression, 1 = SVM, etc.; default 1)
If you are interested in participating in the KyTea project, please send an email to kytea@
.
Last Modified: 2010-8-18 by neubig