Japanese Parallel Data

This is a list of data that can be used for creating machine translation systems to-from Japanese. It focuses on Japanese-English, but at the bottom there is info on data sets for Japanese aligned with other languages as well. If I am missing any data, please tell me! If you want a general purpose list of parallel texts, there are several others: 1 2 3.

Japanese-English

Parallel Corpora

These corpora contain parallel sentences, that can be used for training statistical machine translation systems. Each entry has the name, link, size, whether it can be used for research or commercial purposes, and description. These are limited to corpora of over 100k sentences, or ones that are particularly interesting for some reason.

Name	Size	Research	Commerce	Description
TAUS Memory	~9.0M	Pricing		Data from translation memories of commercial companies. Lots from the computer software/hardware manual domain.
JParaCrawl	8.7M	Free	Contact	Crawled data from the web.
Japanese-English Subtitle Corpus	3.2M	Free	Yes?	A corpus of parallel Japanese-English subtitles.
NTCIR PatentMT	3.0M	Free (w/ Contract)	No	A large corpus of parallel patents.
ASPEC	3.0M	Free (w/ Contract)	No	A corpus of parallel scientific abstracts.
BTEC	700k	No		A corpus of travel conversation sentences. This corpus has been used in research workshops in the past, but is now not available.
Kyoto Wiki (KFTT)	443k	Free	Free?	Wikipedia articles about Kyoto manually translated by NICT.
Eijiro Sentences	~420k	~2,000 yen/license	Quote	A large collection of example sentences from the Eijiro dictionary.
Open Source	402k	Free	Free?	Translations of manuals from open source software.
JaEn Legal Corpus	260k	Free	Free?	Japanese laws translated into English.
JENAAD	150k	Free (w/ Contract)	No	Japanese-English news article parallel data (somewhat noisy).
Tanaka Corpus	150k	Free		A corpus of sentences collected by Japanese students of English (somewhat noisy).
Novels	107k	Free	Free	Novels from Project Gutenburg and Aozora-bunko.
TED Talks	100k	Free	No	Translated subtitles of talks from TED.
Reuters	57k	Free	No	Aligned sentences from the Reuters newsfeed.
Japanese SemCor (data)	15k	Free		A corpus annotated with word senses from WordNet and Japanese WordNet. It can be used together with SemCor as parallel data.
Basic Expressions	5k	Free		A corpus in Japanese-English-Chinese covering very common expressions and grammatical structures in these languages.

Parallel Dictionaries

These are parallel dictionaries that can be used for training statistical machine translation systems.

Name	Size	Research	Commerce	Description
Eijiro	2.0M	~2,000 yen/license	Quote	An extremely extensive dictionary of English→Japanese words/phrases.
Wikipedia Links	~400k	Free	Free?	Wikipedia has links between languages which can also be used as a dictionary.
EDICT	150k	Free		A fairly extensive Japanese→English dictionary. There are also sub-dictionaries for various topics.

Manually Aligned Data

This data is contains manually created word aligned sentences for training or testing word alignment toolkits.

Name	Size	Research	Commerce	Description
KFTT	1.2k	Free	Free	Hand-aligned articles from the Kyoto Wikipedia data.

Japanese-Other

These are resources that are available for translation from Japanese to other languages.

Parallel Corpora

Name	Size	Research	Commerce	Description
ASPEC	672k	Free (w/ Contract)	No	A Chinese-Japanese corpus of scientific abstracts.
TED Talks	100k	Free	No	Translated subtitles of talks from TED between 33 languages.

Parallel Dictionaries

Name	Size	Research	Commerce	Description
Wikipedia Links	50k-400k	Free	Free?	Wikipedia has links between languages which can also be used as a dictionary for a number of languages.
JaLexBD	16k	Free		A dictionary of common nouns between Japanese and French.