There are four different IO formats that KyTea is able to handle. The format of the input and output can be specified during both training and analysis.
Full annotation is the traditional method for annotating word segmentations, by putting a space between every word. When pronunciations are annotated as well, a slash must be placed between the word and its pronunciation.
コーパス/ko:pasu の/no 文/buN で/de す/su 。/.
Here, words are separated by spaces, and tags are separated by slashes. However, these settings can be changed with the ("-wordbound" and "-tagbound") options. Also note that in the output, words for which no pronunciation could be generated are annotated with "/UNK" by default. This can be changed using the "-deftag" option. In addition, as KyTea is able to generate pronunciations using an unknown word model, all unknown words (whether they have a pronunciation or not) can be marked using the "-unktag" option.
Tokenized data is identical to fully annotated data, but does not use any tags.
コーパス の 文 で す 。
Partial annotation uses one of three tags between each character: the "word boundary exists (-hasbound)" tag "|", the "word boundary doesn't exist (-nobound)" tag "-", and the "existance of a word boundary is unknown (-unkbound)" tag " " (single space). When performing pronunciation estimation, a pronunciation can be added after a slash. The second sentence demonstrates an annotation where not all of the boundaries have been annotated.
コ-ー-パ-ス/ko:pasu|の/no|文/buN|で/de|す/su|。/. 境-界|未 知 の 文|で す 。
In addition, to allow annotators to skip annotating word boundaries they are unsure about, KyTea also has a "skipped boundary (-skipbound)," which functions identically to the unknown boundary tag.
The default input for the analyzer is a raw corpus with no annotation.
コーパスの文です。
KyTea is also able to output results annotated with confidence measures. For models using SVMs, this confidence measure indicates the distance from the SVM plane. For models using logistic regression, this confidence measure is the probability of the decision. Regardless of the type of classifier, the confidence measures for unknown words are probabilities, as unknown word pronunciations are estimated probabilistically. For confidence-annotated output, four lines are output for every line of input.
コーパス/ko:pasu の/no 文/buN&moN&fumi で/de す/su 。/. 3.18908 1.7448 3.91682 2.57838 2.23258 1.28151 2.6298 1.98738 100 100 0.309393&-1.36203e-17&-0.348795 100 100 100
Return to KyTea
Last Modified: 2010-5-11 by neubig