Here we provide a number of models for KyTea that you can download and use in your work. For all varieties of models, there are four models provided:
Models can be specified at run-time by specifying the -model option of kytea, or by using the environmental variable KYTEA_MODEL.
These models are for use in analyzing Japanese. They are mainly trained using the Balanced Corpus of Contemporary Written Japanese, and UniDic, with a few other additional resources to increase the coverage. The models may be used for research or commercial purposes, but may not be re-distributed without prior permission.
All of the models can be used for word (morpheme) segmentation, part of speech estimation, and pronunciation estimation. According to the segmentation standard (in Japanese), morphemes are used as the basic unit of segmentation, and conjugations are separated from their stems. All POS tags are in Japanese, but this script can change them into English. For reference, I show the word, word+pronuncation, and word+POS tag F-measure on some held-out data from BCCWJ.
Name | Full SVM | Compact SVM | Compact LR | |
---|---|---|---|---|
Size | 31M | 11M | 12M | 233M |
Word/Pron/POS | 97.66% | 97.75% | 97.54% | |
UTF8 | UTF8 | UTF8 |
These models work for KyTea version 0.4.0 and higher. Older models can be found here.
The following models are built from the Lancaster Corpus of Mandarin Chinese, and the CC-CEDICT dictionary. The models are in UTF-8 format with simplified characters, and can be used for both word segmentation and pronunciation estimation. Note that the pronunciation estimation model essentially chooses the most common pronunciation for each word, and cannot properly estimate the pronunciation of words with pronunciation variation such as "δΊ†", and thus should only be used as a reference. These models may be used for both research and commercial purposes.
Full SVM Model | Compact SVM Model | Compact LR Model | |
---|---|---|---|
LCMC Model (Simplified) Word Accuracy |
Download (13M) 96.9% |
Download (5M) 97.0% |
Download (4M) 96.2% |
In addition, we provide the CEDICT dictionary, and the subword dictionary that were used to create these models. These are licensed under the Creative Commons Attribution-Share Alike 3.0 License.
The following models are based on the Penn Chinese Treebank, and can be used for segmentation in accordance with the treebank.
Full SVM Model | Compact SVM Model | Compact LR Model | |
---|---|---|---|
CTB Model (Simplified) Word Accuracy |
Download (25M) 95.7% |
Download (5.8M) 95.2% |
Download (5.0M) 95.0% |
The following models are based on the MSR and AS data provided for the Second International Chinese Word Segmentation Bakeoff, and can be used for word segmentation. Accuracies and comparison with other systems entered in the word segmentation bakeoff are provided for reference (except for the full LR model, which we have not had a chance to test yet). These models may be used for non-commercial purposes.
Full SVM Model | Compact SVM Model | Compact LR Model | |
---|---|---|---|
MSR Model (Simplified) Word Accuracy, Rank in MSR task |
Download (28M) 96.5%, 1/30 |
Download (8M) 96.5%, 1/30 |
Download (4M) 95.9%, 5/30 |
AS Model (Traditional) Word Accuracy, Rank in AS task |
Download (42M) 95.0%, 2/11 |
Download (14M) 94.6%, 4/11 |
Download (6M) 94.4%, 5/11 |
Return to KyTea
Last Modified: 2012-01-27 by neubig