KyTea

日本語

This is the home of the Kyoto Text Analysis Toolkit (KyTea, pronounced "cutie"). It is a general toolkit developed for analyzing text, with a focus on Japanese and other languages requiring word or morpheme segmentation.

Features

KyTea is able to perform the following types of processing:

Both KyWs and KyPe use a pointwise classifier-based (SVM or logistic regression) approach, allowing for training on partially annotated training data. The classifiers are trained with LIBLINEAR. More details KyTea's classification approach can be found here or in the following paper.

Download/Install

Download

Latest Version: KyTea 0.1.2

This packages contains source code, and a default model that uses the UTF-8 character encoding, and estimates pronunciations according to keyboard input (which is slightly different than the actual phonetic pronunciations). More details, and a number of other models can be found on the KyTea Models page.

Past Versions: KyTea 0.1.1 KyTea 0.1.0 KyTea 0.0.3 KyTea 0.0.2 KyTea 0.0.1

The code of KyTea is distributed according to the Apache License Version 2, and can be distributed freely according to this license. The models included with KyTea or distributed on the KyTea models page may be used for research or commercial purposes, but may not be re-distributed without prior permission.

Install

KyTea has been tested on Linux, Mac OSX, and Windows (via Cygwin). On Linux or Cygwin, download the source code, and install using the following commands.

tar -xzf kytea-X.X.X.tar.gz
cd kytea-X.X.X
./configure
make
make install
kytea --help

If this prints a help message, KyTea is working properly. There are a number of options that can be set during compile-time to adjust the install location or program efficiency.

Program Documentation

Using the Program

After you have installed KyTea, run the program to split the text into words and annotate each word with a pronunciation. If test.raw is a file that contains raw text, the following command will create annotated text in the file test.full.

kytea < test.raw > test.full

Training a Model

While KyTea comes with a default model, if you have your own annotated text it is both simple and useful to build your own model. First, you must prepare a corpus with one sentence per line in the following format (if you only want to do word segmentation, the pronunciations are not necessary):

word1/pron1 word2/pron2 word3/pron3
word4/pron4 word5/pron5 word6/pron6

Let's say that this corpus is named train.full (full means that the file is fully annotated in the above format). If we have an unsegmented file named test.raw, we can create a model and analyze the unsegmented file using the following commands.

train-kytea -full train.wp -model model.dat
kytea -model model.dat < test.raw > test.full

test.full will now have a segmented file with each word annotated with a pronunciation.

Usage

kytea

kytea performs word segmentation and pronunciation estimation given a model

Options: 
  -model   The model file to use when analyzing text
  -in      The formatting of the input  (raw/full/part/conf, default raw)
  -out     The formatting of the output (full/part/conf, default full)
  -nows    Don't do word segmentation (raw input cannot be accepted)
  -nope    Don't do pronunciation estimation (full input cannot be accepted)
  -nounk   Don't estimate the pronunciation of unkown words
  -unkcount The maximum number of unknown pronunciations to print (default 5, 0 implies no limit)
  -unktag  A tag to append to indicate words not in the dictionary
  -unkbeam The width of the beam to use in beam search for unknown words (default 50, 0 for full search)

train-kytea

train-kytea is a program to train models for KyTea.

Input/Output Options: 
  -encode  The text encoding to be used for input/output (utf8/euc/sjis; default: utf8)
  -full    A file of fully annotated training data (can be specified multiple times)
  -part    A file of partially annotated training data (can be specified multiple times)
  -conf    A file of training data annotated with confidences (can be specified multiple times)
  -dict    A dictionary file (in the form of one 'word/pron' entry per line)
  -subword A dictionary file of subword units. Adding this will enable unknown word PE
  -model   The file to write the trained model to
  -modtext Print a text model (instead of the default binary)
Model Training Options (basic)
  -nows    Don't train a word segmentation model
  -nope    Don't train a pronunciation estimation model
Model Training Options (for advanced users): 
  -charw   The window of characters to use on either side of a boundary for WS (default 3)
  -charn   The maximum length of character n-grams to use for WS (default 3)
  -typew   The window of character types to use on either side of a boundary for WS (default 3)
  -typen   The maximum length of character type n-grams to use for WS (default 3)
  -dictn    All dictionary words greater than this length will be bucketed together (default 4)
  -unkn    The order of the language model to use for unknown words (default 3)
  -eps     The epsilon stopping criterion for classifier training
  -bias    The bias value to use in classifier training (default 1)
  -solver  The solver (0 = logistic regression, 1 = SVM, etc.; default 1)

Program Details

Detailed explanation of various parts of KyTea can be found below:

KyTea Contributors

If you are interested in participating in the KyTea project, please send an email to kytea@.

Revision History

Future Features/Known Issues

Version 0.1.2 (8/18/2010)

Version 0.1.1 (5/11/2010)

Version 0.1.0 (3/5/2010)

Version 0.0.3 (11/30/2009)

Version 0.0.2 (11/16/2009)

Version 0.0.1 (11/05/2009)

Last Modified: 2010-8-18 by neubig