Training Models for Kytea

Return to KyTea

This page contains information about training models for KyTea. Getting started is extremely simple, but there are a few fine points that can help you train models more effectively.

Simple Training

In order to train a model for KyTea at first you need data. Prepare a fully annotated file (let's say train.txt) like this:

コーパス/N/ko:pasu の/P/no 文/N/buN で/V/de す/TL/su 。/SYM/.
もう/ADV/mo: ひと/N/hito つ/SUF/tsu の/P/no 文/N/buN で/V/de す/TL/su 。/SYM/.

Note that the file has sentences separated by lines consisting of words, followed by a POS tag and pronunciation. These tags can actually be other things as well such as chunking or named entity recognition tags.

Next, we run the training program:

train-kytea -full train.txt -model train.mod

Training will run for a while, and the trained model will be output to train.mod. That's all you need to do to train a simple model! The following are more advanced techniques that you may want to use to get the best possible performance.

KyTea Feature File
Available Data
Using Dictionaries
Local/Global Models
Unknown Word Models
Parameter Tuning
Training with Partially Annotated Data

KyTea Feature Files

Starting at KyTea 0.3.2, it is possible to train models using files containing the features used in KyTea training. The (rather large) file below is the file containing features used in training a standard KyTea model.

KyTea Features 0.4.2

This can be specified during training using the -feat option. While there is no need to do this normally (a model is already included with KyTea), this is very useful if you have additional annotated resources that you would like to include in addition to the standard KyTea model. If you have an additional fully annotated file called train.txt, you can train a model using

train-kytea -feat kytea-0.3.2.feat -full train.txt -model train.mod

which will use the data in your annotated file in addition to the regular KyTea training data. Note that this will create a very big model, and should ideally be run on a machine with at least 4 GB of memory.

Available Data

Below is a list of language resources that are avaialable for training KyTea. If you know of any other resources, especially high-quality (99.9% accuracy or higher) ones or ones that have been annotated in KyTea's format, please tell us at kytea@.

Japanese

The main sources of data that go into KyTea's Japanese model are:

BCCWJ: The Balanced Corpus of Contemporary Japanese is a corpus created by the National Institute for the Japanese Language and Linguistics (NINJAL). It is annotated with correct parts of speech, and pronunciations with extremely high accuracy. We use the short unit segmentation standard, and further split the stems and conjugations of all words (more details in Japanese). It is generally available for research purposes, although an application for permission must be made to NINJAL.
UniDic: An extensive dictionary also created by NINJAL according to the same standard as BCCWJ.

Other Japanese resources that are not included in the default model include (with a focus on those that are freely available or those that we have previously used with KyTea):

The Mori Paper Text: A corpus of academic papers written by Shinsuke Mori that are word segmented according to the KyTea standard.
CSJ: The corpus of spontaneous Japanese consists of transcribed academic lectures annotated with pronunciations and parts of speech, and is a good source of data for the analysis of speech.
SKK Subword Dictionary: A subword dictionary that can be used to help train unknown word models.
Dictionaries in specialized areas can significantly improve KyTea's segmentation accuracy. One example that we have used is the Life Science Dictionary.

Chinese

A good, publically available source of data for training Chinese word segmenters is the data from the SIGHAN word segmentation bakeoff. In addition, we are using the Lancaster Corpus of Mandarin Chinese as it is annotated with Pinyin transcriptions. A freely available Chinese dictionary annotated with Pinyin is CC-CEDICT.

Using Dictionaries

KyTea's accuracy will greatly improve if you have a good dictionary or two that can be used along with the training corpus. Dictionaries can be specified by using the -dict option:

train-kytea -full train.txt -dict dict1.txt -dict dict2.txt -model train.mod

The format of the dictionary file is the same as that of the corpus, but with one word in one line.

コーパス/N/ko:pasu
の/P/no
文/N/buN
...

Note that it is possible to specify multiple dictionaries. If you have dictionaries that are from different sources (say one in-domain dictionary, and one out-of-domain dictionary), it is often better to separate them into different files and specify them one-by-one, as KyTea treats the words in different files differently.

Also, note that while all corpora (files specified by -full or -part) should be segmented according to a coherent standard, dictionaries may consist of entries that don't match the standard, such as noun phrases or compound words. Finally, do NOT make a dictionary simply using all the words in the training corpus. This will cause KyTea to over-fit the training corpus and over-segment all unknown words on your actual test data.

Local vs. Global Tagging Models

KyTea is able to train tagging models in two ways.

Local models train a separate classifier for every word in the dictionary. For example, let's say we have three words A, B, and C, and tags X, Y, and Z. Our corpus looks like this:
```
A/X A/Y B/Y B/Z C/Z
```
The local model will train a classifier for A to decide between the two tags X and Y, train a classifier for B to decide between the two tags Y and Z, and always assign Z to C as it is the only tag that appears with C in the corpus. Local models are good for situations such as pronunciation estimation, where there is no clear relation between the tags of different words.
Global models, on the other hand, train a single classifier to decide between all tags appearing in the corpus. In the previous example, this means that a single classifier decides between X, Y, and Z for all three words A, B, and C. Global models are better for situations where there is a clear relationship between tags assigned to different words such as tagging of parts of speech or named entities.

KyTea trains local tagging models by default. By specifying the -global n option, it is possible specify that global models should be used for the nth tag for each word.

train-kytea -full train.txt -global 1 -model train.mod

Note that the default model that comes with KyTea uses a local model for pronuncation estimation, and a global model for POS estimation.

Unknown Word Models

For the local models described in the previous section, words that don't appear in the corpus or dictionaries used in training become unknown words. KyTea contains functionality to predict pronunciations for unknown words by breaking words down into their component characters. In order to make this functionality active, you must add a subword dictionary in the form:

コー/ko:
パ/pa
文/buN
文/fumi

with one subword followed by a pronunciation on each line. Note that subwords can consist of multiple characters, and that a single character can have multiple pronunciations. This can be specified during training as follows.

train-kytea -full train.txt -subword subword.txt -model train.mod

Parameter Tuning

When training models, there are two important parameters to be chosen: -solver and -cost. The -solver parameter indicates what kind of learning algorithm KyTea should use to learn the weights for the model, more details can be found on the LIBLINEAR site, but important settings include:

-solver 1: is L2 regularized SVM (the default for KyTea).
-solver 5: is L1 regularized SVM.
-solver 6: is L1 regularized logistic regression.
-solver 7: is L2 regularized logistic regression.

In general SVM-based models have higher accuracy, but logistic regression models are able to output probabilities. L2 regularized models will have slightly higher accuracy, but L1 regularized models will be significantly more compact.

The second parameter -cost designates how heavily to regularize the model. For SVM models, generally the default (1) is pretty good, but for logistic regression models, it can often help to hold out a small amount of data for testing, try several values of -cost, and pick the one that gives the highest accuracy on your testing data.

Training with Partially Annotated Data

It is possible to train KyTea with data that is not fully annotated with tags, or even word boundaries. If we have a file that is fully annotated with word boundaries, but some tags are unannotated, we create a file in the following form:

コーパス の/P 文//buN で す/TL 。//.

Note that the words "の" and "す" are annotated with only parts of speech, the words "文" and "。" are annotated with only pronunciations, and the words "コーパス" and "で" have no tag at all. For the pronunciations, notice that we leave the POS tag empty, which indicates no tag. These sorts of corpora must be specified with the -full tag during training.

For corpora partially annotated with word boundaries, we create a file in KyTea's partially annotated format:

コ-ー-パ-ス|の/P|文//buN|で す|。//.

Here, notice that the boundary between the characters "で" and "す" is unannotated, while the other boundaries are annotated. Files in this format must be passed to training using the -part option.

Return to KyTea
Last Modified: 2011-4-25 by neubig