Analyzing KyTea's Results

Return to KyTea

Starting at KyTea version 0.2.0, it is now possible to analyze why KyTea gave you the result that it did. Before analyzing the results, it's necessary to understand how KyTea works, so it'd be best if you read the method page first.

Ok, now let's say that we have some result that KyTea gave us that we don't like so much. For example, in the following sentence, we would prefer that "党内" be separated into two words:

$ kytea
$ 民主党内の保守派
民主/みんしゅ 党内/とうない の/の 保守/ほしゅ 派/は

Now, let's run KyTea again, this time with the -debug 2 option. This will output a detailed list about the features that contributed to every segmentation decision:

$ kytea
$ 民主党内の保守派
WB 1 (民主): D0I2=1.17385 X1主=-0.148052 T1KKK=0.124833 X2党=0.118182 X0民主=0.0995188 T0KKK=0.0936877 D4R1=-0.0651905 D0R1=-0.0507923 T2KK=-0.0465374 D3I2=-0.0414231 T0KK=0.0321902 T2K=-0.0285143 T3K=-0.0260331 D3R1=0.0209904 D3L1=0.0151808 T1K=-0.0132829 T0K=-0.00677783 D4L1=0.00657395 D0L1=-0.00558881 T1KK=-0.000365658 BIAS=0 --- TOTAL=1.25245
…
WB 3 (党内): D0I2=1.17385 X0党=-0.380375 X1内の=-0.345203 T0KKH=0.344339 T-1KK=-0.215887 T-1K=-0.199744 T-1KKK=-0.150186 T-2KKK=0.149819 T2HK=-0.116214 T-2KK=0.0979653 D4R1=-0.0651905 D0R1=-0.0507923 T1KHK=0.0500478 T2H=-0.0474248 X-1主=-0.0461335 X1内=-0.0438923 T1KH=-0.038262 T0KK=0.0321902 T-2K=0.026408 T3K=-0.0260331 X3保=-0.0182091 X2の=0.0168658 T1K=-0.0132829 T0K=-0.00677783 D4L1=0.00657395 D0L1=-0.00558881 X0党内=-0.00256318 X-2民=0.00125031 BIAS=0 --- TOTAL=0.127554
…
PE 5 (派->は/ぱ): BIAS=0.975138 --- TOTAL=0.975138
民主/みんしゅ 党内/とうない の/の 保守/ほしゅ 派/は

Wow, that's a lot of text! And I deleted most of the non-important parts, but if we take a look one by one we can figure out exactly what this means. First, remember from the method page that KyTea is using three types of features, character n-gram, character type n-gram, and dictionary features. The below diagram summarizes these.

One by one, the IDs of the features mean:

Character/Type n-grams: Have IDs that take the format "[T or X][POSITION][STRING]. An ID starting with X means a character n-gram, and an ID starting with T means a type n-gram. The position is the relative position compared to the decision point. Position 0 means a n-gram that starts directly to the left of the decision point, position 1 means an n-gram that starts directly to the right of the decision point, and all other positions are relative to these. Finally, the string is the actual character or type string that starts at these points.
Dictionary Words: Have IDs that take the format "D[DICT ID][L or R or I][LENGTH]. As KyTea can be trained with multiple dictionaries, which dictionary generated every word is saved, and the information is used when generating the features. Here, L, R, and I, are that the point we are considering is to the right, left, or inside of the word. Words of different lengths are also treated differently, as long words are generally more informative than shorter words.
Bias: The bias feature allows a latent penalty or bonus to be given before any features are seen. This is useful when one of the options is much more or less likely than the others.

Now let's look back at the features that are active for the boundary between "党" and "内" and see why the mistake was made. Note that weights that are more than zero mean that adding a word boundary is less likely, and weights that are less than zero means that adding a word boundary is more likely. Also, features are ordered from largest to smallest, so the most important features occur at the beginning of the list.

D0I2=1.17385: This feature means that "the decision point lies within a word of length two in dictionary 0." In other words, "党内" is in dictionary 0. As we generally don't want to break apart words that are in the dictionary, this is given a very high weight, reducing the likelihood that a boundary will be inserted.
X0党=-0.380375: This feature means that "党" occurs directly before the decision point. As "党" generally stands on its own as a word, this is given a negative value, increasing the likelihood that a word boundary will be inserted.
X1内の=-0.345203: This means "内の" occurs directly after the decision point. As "内" is often used as a suffix, this increases the likelihood of adding a boundary.
T0KKH=0.344339: This means that the decision point occurs after the first character in a "Kanji, Kanji, Hiragana" sequence. This is a little less intuitive, but as there are many two-Kanji words followed by a hiragana particle, the classifier has decided that this gives a good hint not to insert a word boundary.
…
TOTAL=0.127554: There are more features, but as you probably get the idea, we'll skip to the final total. It can be seen that the total is greater than zero, which means that we will not insert a word boundary. However, this total is quite small, which means that the classifier was uncertain about its decision. This means that probably if we had a single instance of training data with this pattern in it, we would have had no problem separating this correctly.

Now What?

Ok, so we've analyzed KyTea's results, now what? Well one thing that you could do if you're making your own model is decide what kind of data you need to add to make your results better. Are most of the mistakes places where a dictionary word that should be in the dictionary is conspicuously absent? If so, add more entries to the dictionary. Or, if like this case the dictionary words are present but KyTea is still making a mistake, you can try creating a pointwise annotated corpus containing the spots that KyTea is having trouble with. Finally, if you find a particularly egregious feature value, it would also be possible to add a few examples to the corpus that contradict the example to reduce its negative effect.

Return to KyTea
Last Modified: 2010-12-24 by neubig