This page contains information about training models for KyTea. Getting started is extremely simple, but there are a few fine points that can help you train models more effectively.
In order to train a model for KyTea at first you need data. Prepare a fully annotated file (let's say train.txt) like this:
コーパス/N/ko:pasu の/P/no 文/N/buN で/V/de す/TL/su 。/SYM/. もう/ADV/mo: ひと/N/hito つ/SUF/tsu の/P/no 文/N/buN で/V/de す/TL/su 。/SYM/.
Note that the file has sentences separated by lines consisting of words, followed by a POS tag and pronunciation. These tags can actually be other things as well such as chunking or named entity recognition tags.
Next, we run the training program:
train-kytea -full train.txt -model train.mod
Training will run for a while, and the trained model will be output to train.mod. That's all you need to do to train a simple model! The following are more advanced techniques that you may want to use to get the best possible performance.
Starting at KyTea 0.3.2, it is possible to train models using files containing the features used in KyTea training. The (rather large) file below is the file containing features used in training a standard KyTea model.
This can be specified during training using the -feat option. While there is no need to do this normally (a model is already included with KyTea), this is very useful if you have additional annotated resources that you would like to include in addition to the standard KyTea model. If you have an additional fully annotated file called train.txt, you can train a model using
train-kytea -feat kytea-0.3.2.feat -full train.txt -model train.mod
which will use the data in your annotated file in addition to the regular KyTea training data. Note that this will create a very big model, and should ideally be run on a machine with at least 4 GB of memory.
Below is a list of language resources that are avaialable for training KyTea. If you know of any other resources, especially high-quality (99.9% accuracy or higher) ones or ones that have been annotated in KyTea's format, please tell us at kytea@.
The main sources of data that go into KyTea's Japanese model are:
Other Japanese resources that are not included in the default model include (with a focus on those that are freely available or those that we have previously used with KyTea):
A good, publically available source of data for training Chinese word segmenters is the data from the SIGHAN word segmentation bakeoff. In addition, we are using the Lancaster Corpus of Mandarin Chinese as it is annotated with Pinyin transcriptions. A freely available Chinese dictionary annotated with Pinyin is CC-CEDICT.
KyTea's accuracy will greatly improve if you have a good dictionary or two that can be used along with the training corpus. Dictionaries can be specified by using the -dict option:
train-kytea -full train.txt -dict dict1.txt -dict dict2.txt -model train.mod
The format of the dictionary file is the same as that of the corpus, but with one word in one line.
コーパス/N/ko:pasu の/P/no 文/N/buN ...
Note that it is possible to specify multiple dictionaries. If you have dictionaries that are from different sources (say one in-domain dictionary, and one out-of-domain dictionary), it is often better to separate them into different files and specify them one-by-one, as KyTea treats the words in different files differently.
Also, note that while all corpora (files specified by -full or -part) should be segmented according to a coherent standard, dictionaries may consist of entries that don't match the standard, such as noun phrases or compound words. Finally, do NOT make a dictionary simply using all the words in the training corpus. This will cause KyTea to over-fit the training corpus and over-segment all unknown words on your actual test data.
KyTea is able to train tagging models in two ways.
A/X A/Y B/Y B/Z C/ZThe local model will train a classifier for A to decide between the two tags X and Y, train a classifier for B to decide between the two tags Y and Z, and always assign Z to C as it is the only tag that appears with C in the corpus. Local models are good for situations such as pronunciation estimation, where there is no clear relation between the tags of different words.
KyTea trains local tagging models by default. By specifying the -global n option, it is possible specify that global models should be used for the nth tag for each word.
train-kytea -full train.txt -global 1 -model train.mod
Note that the default model that comes with KyTea uses a local model for pronuncation estimation, and a global model for POS estimation.
For the local models described in the previous section, words that don't appear in the corpus or dictionaries used in training become unknown words. KyTea contains functionality to predict pronunciations for unknown words by breaking words down into their component characters. In order to make this functionality active, you must add a subword dictionary in the form:
コー/ko: パ/pa 文/buN 文/fumi
with one subword followed by a pronunciation on each line. Note that subwords can consist of multiple characters, and that a single character can have multiple pronunciations. This can be specified during training as follows.
train-kytea -full train.txt -subword subword.txt -model train.mod
When training models, there are two important parameters to be chosen: -solver and -cost. The -solver parameter indicates what kind of learning algorithm KyTea should use to learn the weights for the model, more details can be found on the LIBLINEAR site, but important settings include:
In general SVM-based models have higher accuracy, but logistic regression models are able to output probabilities. L2 regularized models will have slightly higher accuracy, but L1 regularized models will be significantly more compact.
The second parameter -cost designates how heavily to regularize the model. For SVM models, generally the default (1) is pretty good, but for logistic regression models, it can often help to hold out a small amount of data for testing, try several values of -cost, and pick the one that gives the highest accuracy on your testing data.
It is possible to train KyTea with data that is not fully annotated with tags, or even word boundaries. If we have a file that is fully annotated with word boundaries, but some tags are unannotated, we create a file in the following form:
コーパス の/P 文//buN で す/TL 。//.
Note that the words "の" and "す" are annotated with only parts of speech, the words "文" and "。" are annotated with only pronunciations, and the words "コーパス" and "で" have no tag at all. For the pronunciations, notice that we leave the POS tag empty, which indicates no tag. These sorts of corpora must be specified with the -full tag during training.
For corpora partially annotated with word boundaries, we create a file in KyTea's partially annotated format:
コ-ー-パ-ス|の/P|文//buN|で す|。//.
Here, notice that the boundary between the characters "で" and "す" is unannotated, while the other boundaries are annotated. Files in this format must be passed to training using the -part option.
Return to KyTea
Last Modified: 2011-4-25 by neubig