Analyzing KyTea's Results

Return to KyTea

Starting at KyTea version 0.2.0, it is now possible to analyze why KyTea gave you the result that it did. Before analyzing the results, it's necessary to understand how KyTea works, so it'd be best if you read the method page first.

Ok, now let's say that we have some result that KyTea gave us that we don't like so much. For example, in the following sentence, we would prefer that "党内" be separated into two words:

$ kytea
$ 民主党内の保守派
民主/みんしゅ 党内/とうない の/の 保守/ほしゅ 派/は

Now, let's run KyTea again, this time with the -debug 2 option. This will output a detailed list about the features that contributed to every segmentation decision:

$ kytea
$ 民主党内の保守派
WB 1 (民主): D0I2=1.17385 X1主=-0.148052 T1KKK=0.124833 X2党=0.118182 X0民主=0.0995188 T0KKK=0.0936877 D4R1=-0.0651905 D0R1=-0.0507923 T2KK=-0.0465374 D3I2=-0.0414231 T0KK=0.0321902 T2K=-0.0285143 T3K=-0.0260331 D3R1=0.0209904 D3L1=0.0151808 T1K=-0.0132829 T0K=-0.00677783 D4L1=0.00657395 D0L1=-0.00558881 T1KK=-0.000365658 BIAS=0 --- TOTAL=1.25245
…
WB 3 (党内): D0I2=1.17385 X0党=-0.380375 X1内の=-0.345203 T0KKH=0.344339 T-1KK=-0.215887 T-1K=-0.199744 T-1KKK=-0.150186 T-2KKK=0.149819 T2HK=-0.116214 T-2KK=0.0979653 D4R1=-0.0651905 D0R1=-0.0507923 T1KHK=0.0500478 T2H=-0.0474248 X-1主=-0.0461335 X1内=-0.0438923 T1KH=-0.038262 T0KK=0.0321902 T-2K=0.026408 T3K=-0.0260331 X3保=-0.0182091 X2の=0.0168658 T1K=-0.0132829 T0K=-0.00677783 D4L1=0.00657395 D0L1=-0.00558881 X0党内=-0.00256318 X-2民=0.00125031 BIAS=0 --- TOTAL=0.127554
…
PE 5 (派->は/ぱ): BIAS=0.975138 --- TOTAL=0.975138
民主/みんしゅ 党内/とうない の/の 保守/ほしゅ 派/は

Wow, that's a lot of text! And I deleted most of the non-important parts, but if we take a look one by one we can figure out exactly what this means. First, remember from the method page that KyTea is using three types of features, character n-gram, character type n-gram, and dictionary features. The below diagram summarizes these.

One by one, the IDs of the features mean:

Now let's look back at the features that are active for the boundary between "党" and "内" and see why the mistake was made. Note that weights that are more than zero mean that adding a word boundary is less likely, and weights that are less than zero means that adding a word boundary is more likely. Also, features are ordered from largest to smallest, so the most important features occur at the beginning of the list.

Now What?

Ok, so we've analyzed KyTea's results, now what? Well one thing that you could do if you're making your own model is decide what kind of data you need to add to make your results better. Are most of the mistakes places where a dictionary word that should be in the dictionary is conspicuously absent? If so, add more entries to the dictionary. Or, if like this case the dictionary words are present but KyTea is still making a mistake, you can try creating a pointwise annotated corpus containing the spots that KyTea is having trouble with. Finally, if you find a particularly egregious feature value, it would also be possible to add a few examples to the corpus that contradict the example to reduce its negative effect.

Return to KyTea
Last Modified: 2010-12-24 by neubig