KyTea Models

Return to KyTea

Here we provide a number of models for KyTea that you can download and use in your work. For all varieties of models, there are four models provided:

Models can be specified at run-time by specifying the -model option of kytea, or by using the environmental variable KYTEA_MODEL.

Japanese Models

These models are for use in analyzing Japanese. They are mainly trained using the Balanced Corpus of Contemporary Written Japanese, and UniDic, with a few other additional resources to increase the coverage. The models may be used for research or commercial purposes, but may not be re-distributed without prior permission.

All of the models can be used for word (morpheme) segmentation, part of speech estimation, and pronunciation estimation. According to the segmentation standard (in Japanese), morphemes are used as the basic unit of segmentation, and conjugations are separated from their stems. All POS tags are in Japanese, but this script can change them into English. For reference, I show the word, word+pronuncation, and word+POS tag F-measure on some held-out data from BCCWJ.

NameFull SVMCompact SVMCompact LR
Size 31M 11M 12M 233M
Word/Pron/POS 97.66% 97.75% 97.54%
  UTF8 UTF8 UTF8

These models work for KyTea version 0.4.0 and higher. Older models can be found here.

Chinese Models

Word Segmentation/Pronunciation Estimation

The following models are built from the Lancaster Corpus of Mandarin Chinese, and the CC-CEDICT dictionary. The models are in UTF-8 format with simplified characters, and can be used for both word segmentation and pronunciation estimation. Note that the pronunciation estimation model essentially chooses the most common pronunciation for each word, and cannot properly estimate the pronunciation of words with pronunciation variation such as "δΊ†", and thus should only be used as a reference. These models may be used for both research and commercial purposes.

 Full SVM ModelCompact SVM ModelCompact LR Model
LCMC Model (Simplified)
Word Accuracy
Download (13M)
96.9%
Download (5M)
97.0%
Download (4M)
96.2%

In addition, we provide the CEDICT dictionary, and the subword dictionary that were used to create these models. These are licensed under the Creative Commons Attribution-Share Alike 3.0 License.

Word Segmentation Only

The following models are based on the Penn Chinese Treebank, and can be used for segmentation in accordance with the treebank.

 Full SVM ModelCompact SVM ModelCompact LR Model
CTB Model (Simplified)
Word Accuracy
Download (25M)
95.7%
Download (5.8M)
95.2%
Download (5.0M)
95.0%

The following models are based on the MSR and AS data provided for the Second International Chinese Word Segmentation Bakeoff, and can be used for word segmentation. Accuracies and comparison with other systems entered in the word segmentation bakeoff are provided for reference (except for the full LR model, which we have not had a chance to test yet). These models may be used for non-commercial purposes.

 Full SVM ModelCompact SVM ModelCompact LR Model
MSR Model (Simplified)
Word Accuracy, Rank in MSR task
Download (28M)
96.5%, 1/30
Download (8M)
96.5%, 1/30
Download (4M)
95.9%, 5/30
AS Model (Traditional)
Word Accuracy, Rank in AS task
Download (42M)
95.0%, 2/11
Download (14M)
94.6%, 4/11
Download (6M)
94.4%, 5/11

Return to KyTea
Last Modified: 2012-01-27 by neubig