KyTea's IO Formats

Return to KyTea

There are four different IO formats that KyTea is able to handle. The format of the input and output can be specified during both training and analysis.

Full annotation

Full annotation is the traditional method for annotating word segmentations, by putting a space between every word. When pronunciations are annotated as well, a slash must be placed between the word and its pronunciation.

コーパス/ko:pasu の/no 文/buN で/de す/su 。/.

Here, words are separated by spaces, and tags are separated by slashes. However, these settings can be changed with the ("-wordbound" and "-tagbound") options. Also note that in the output, words for which no pronunciation could be generated are annotated with "/UNK" by default. This can be changed using the "-deftag" option. In addition, as KyTea is able to generate pronunciations using an unknown word model, all unknown words (whether they have a pronunciation or not) can be marked using the "-unktag" option.

Tokenized Data

Tokenized data is identical to fully annotated data, but does not use any tags.

コーパス の 文 で す 。

Partial Annotation

Partial annotation uses one of three tags between each character: the "word boundary exists (-hasbound)" tag "|", the "word boundary doesn't exist (-nobound)" tag "-", and the "existance of a word boundary is unknown (-unkbound)" tag " " (single space). When performing pronunciation estimation, a pronunciation can be added after a slash. The second sentence demonstrates an annotation where not all of the boundaries have been annotated.

コ-ー-パ-ス/ko:pasu|の/no|文/buN|で/de|す/su|。/.
境-界|未 知 の 文|で す 。

In addition, to allow annotators to skip annotating word boundaries they are unsure about, KyTea also has a "skipped boundary (-skipbound)," which functions identically to the unknown boundary tag.

Raw Corpus

The default input for the analyzer is a raw corpus with no annotation.

コーパスの文です。

Confidence Annotation

KyTea is also able to output results annotated with confidence measures. For models using SVMs, this confidence measure indicates the distance from the SVM plane. For models using logistic regression, this confidence measure is the probability of the decision. Regardless of the type of classifier, the confidence measures for unknown words are probabilities, as unknown word pronunciations are estimated probabilistically. For confidence-annotated output, four lines are output for every line of input.

  1. Line 1: Outputs words and proununciations in a format similar to that of full annoation. For words that have multiple possible pronunciations, every possible pronunciation is output separated by the "element boundary tag (-elembound)," here "&" in order of descending confidence.
  2. Line 2: Outputs the confidence measure of the word segmentation. The ith number on this line indicates the decision about whether to insert a word boundary between characters i and i+1. For example, in the example below, the first confidence value "3.18908" is associated with the space between the first to characters "コ" and "ー".
  3. Line 3: Outputs the confidence measure of pronunciation estimation. When there are multiple possible pronunciations, the confidence of each pronunciation will be output separated by "&".
  4. Line 4: An empty line. This was added from version 0.1.2.
コーパス/ko:pasu の/no 文/buN&moN&fumi で/de す/su 。/.
3.18908 1.7448 3.91682 2.57838 2.23258 1.28151 2.6298 1.98738
100 100 0.309393&-1.36203e-17&-0.348795 100 100 100

Return to KyTea
Last Modified: 2010-5-11 by neubig