Japanese Parallel Data

This is a list of data that can be used for creating machine translation systems to-from Japanese. It focuses on Japanese-English, but at the bottom there is info on data sets for Japanese aligned with other languages as well. If I am missing any data, please tell me! If you want a general purpose list of parallel texts, there are several others: 1 2 3.


Parallel Corpora

These corpora contain parallel sentences, that can be used for training statistical machine translation systems. Each entry has the name, link, size, whether it can be used for research or commercial purposes, and description. These are limited to corpora of over 100k sentences, or ones that are particularly interesting for some reason.

TAUS Memory ~9.0MPricing Data from translation memories of commercial companies. Lots from the computer software/hardware manual domain.
JParaCrawl 8.7MFreeContact Crawled data from the web.
Japanese-English Subtitle Corpus 3.2MFreeYes? A corpus of parallel Japanese-English subtitles.
NTCIR PatentMT 3.0MFree (w/ Contract)No A large corpus of parallel patents.
ASPEC 3.0MFree (w/ Contract)No A corpus of parallel scientific abstracts.
BTEC 700kNo A corpus of travel conversation sentences. This corpus has been used in research workshops in the past, but is now not available.
Kyoto Wiki (KFTT) 443kFreeFree? Wikipedia articles about Kyoto manually translated by NICT.
Eijiro Sentences ~420k~2,000 yen/licenseQuote A large collection of example sentences from the Eijiro dictionary.
Open Source 402kFreeFree? Translations of manuals from open source software.
JaEn Legal Corpus 260kFreeFree? Japanese laws translated into English.
JENAAD 150kFree (w/ Contract)No Japanese-English news article parallel data (somewhat noisy).
Tanaka Corpus 150kFree A corpus of sentences collected by Japanese students of English (somewhat noisy).
Novels 107kFreeFree Novels from Project Gutenburg and Aozora-bunko.
TED Talks 100kFreeNo Translated subtitles of talks from TED.
Reuters 57kFreeNo Aligned sentences from the Reuters newsfeed.
Japanese SemCor (data) 15kFree A corpus annotated with word senses from WordNet and Japanese WordNet. It can be used together with SemCor as parallel data.
Basic Expressions 5kFree A corpus in Japanese-English-Chinese covering very common expressions and grammatical structures in these languages.

Parallel Dictionaries

These are parallel dictionaries that can be used for training statistical machine translation systems.

Eijiro 2.0M~2,000 yen/licenseQuote An extremely extensive dictionary of English→Japanese words/phrases.
Wikipedia Links ~400kFreeFree? Wikipedia has links between languages which can also be used as a dictionary.
EDICT 150kFree A fairly extensive Japanese→English dictionary. There are also sub-dictionaries for various topics.

Manually Aligned Data

This data is contains manually created word aligned sentences for training or testing word alignment toolkits.

KFTT 1.2kFreeFree Hand-aligned articles from the Kyoto Wikipedia data.


These are resources that are available for translation from Japanese to other languages.

Parallel Corpora

ASPEC 672kFree (w/ Contract)No A Chinese-Japanese corpus of scientific abstracts.
TED Talks 100kFreeNo Translated subtitles of talks from TED between 33 languages.

Parallel Dictionaries

Wikipedia Links 50k-400kFreeFree? Wikipedia has links between languages which can also be used as a dictionary for a number of languages.
JaLexBD 16kFree A dictionary of common nouns between Japanese and French.