KyTea's Word Segmentation and Tagging

Return to KyTea

This page describes the methods used in KyTea's word segmentation and tagging.

Word Segmentation

Word segmentation is performed using pointwise estimation. This means that each boundary between two characters is estimated separately, with no regard to which word boundaries exist in other parts of the sentence. Pointwise estimation allows for simple and efficient training using data that has been partially annotated.

Pointwise word segmentation

The information surrounding the potential word boundary under consideration is used as features for the classifier. There are a total of three types of features used in this decision: character n-grams, character type n-grams, and dictionary word features.

Character n-grams

Character n-grams provide information about the characters surrounding the potential word boundary. The n (maximum length) of the features can be set with the "-charn" option, while the window size can be set with the "-charw" option. The example below shows the character n-grams for -charn=3, -charw=2 (the default settings of KyTea).

n-gram features for word segmentation

Character type n-grams

In addition to the n-grams using the character as-is, n-grams consisting of character types are also included to ease problems of sparsity. The character types used include Kanji, Katakana, Hiragana, Roman Characters, Numbers, and "Other" (including symbols, etc.). Similarly to character n-grams, it is possible to adjust n using "-typen" and the window using "-typew".

Dictionary Words

Finally, when one or more dictionaries are used, using the "-dict" option, features are included indicating when a word in the dictionary exists to the right of the point under consideration (R), to the left of the point under consideration (L), or whether the point of consideration is included in a dictionary word (I). These features are also separated by the length of the words. For example, if a word of length 4 occurs to the left of the point under consideration, the feature "L4" will be activated. In addition, the "-dicn" feature can be used to set an upper limit on word length. In the case of -dicn=4, all words of length 4 or more will be bucketed into the features "L4", "R4", and "I4". The following diagram gives an example:

Dictionary word features for word segmentation

Tagging

Tagging is performed after word segmentation has been completed. Like word segmentation, tagging is also performed pointwise, with the tag of each word being estimated independently of the other words. There are two types of models, local and global models, that can be used for tagging.

Local Models

Estimation using local models is performed as follows:

Global Models

Global models perform tag prediction for every word, whether it is known or unknown, with a single classifier. Details are described in this paper, but essentially the feature set uses character and character type n-grams, as well as the identity of the word, the characters it contains, and whether it is in the included dictionaries.

Return to KyTea
Last Modified: 2011-4-26 by neubig