This is a list of data that can be used for creating machine translation systems to-from Japanese. It focuses on Japanese-English, but at the bottom there is info on data sets for Japanese aligned with other languages as well. If I am missing any data, please tell me! If you want a general purpose list of parallel texts, there are several others: 1 2 3.
These corpora contain parallel sentences, that can be used for training statistical machine translation systems. Each entry has the name, link, size, whether it can be used for research or commercial purposes, and description. These are limited to corpora of over 100k sentences, or ones that are particularly interesting for some reason.
|TAUS Memory||~9.0M||Pricing||Data from translation memories of commercial companies. Lots from the computer software/hardware manual domain.|
|JParaCrawl||8.7M||Free||Contact||Crawled data from the web.|
|Japanese-English Subtitle Corpus||3.2M||Free||Yes?||A corpus of parallel Japanese-English subtitles.|
|NTCIR PatentMT||3.0M||Free (w/ Contract)||No||A large corpus of parallel patents.|
|ASPEC||3.0M||Free (w/ Contract)||No||A corpus of parallel scientific abstracts.|
|BTEC||700k||No||A corpus of travel conversation sentences. This corpus has been used in research workshops in the past, but is now not available.|
|Kyoto Wiki (KFTT)||443k||Free||Free?||Wikipedia articles about Kyoto manually translated by NICT.|
|Eijiro Sentences||~420k||~2,000 yen/license||Quote||A large collection of example sentences from the Eijiro dictionary.|
|Open Source||402k||Free||Free?||Translations of manuals from open source software.|
|JaEn Legal Corpus||260k||Free||Free?||Japanese laws translated into English.|
|JENAAD||150k||Free (w/ Contract)||No||Japanese-English news article parallel data (somewhat noisy).|
|Tanaka Corpus||150k||Free||A corpus of sentences collected by Japanese students of English (somewhat noisy).|
|Novels||107k||Free||Free||Novels from Project Gutenburg and Aozora-bunko.|
|TED Talks||100k||Free||No||Translated subtitles of talks from TED.|
|Reuters||57k||Free||No||Aligned sentences from the Reuters newsfeed.|
|Japanese SemCor (data)||15k||Free||A corpus annotated with word senses from WordNet and Japanese WordNet. It can be used together with SemCor as parallel data.|
|Basic Expressions||5k||Free||A corpus in Japanese-English-Chinese covering very common expressions and grammatical structures in these languages.|
These are parallel dictionaries that can be used for training statistical machine translation systems.
|Eijiro||2.0M||~2,000 yen/license||Quote||An extremely extensive dictionary of English→Japanese words/phrases.|
|Wikipedia Links||~400k||Free||Free?||Wikipedia has links between languages which can also be used as a dictionary.|
|EDICT||150k||Free||A fairly extensive Japanese→English dictionary. There are also sub-dictionaries for various topics.|
Manually Aligned Data
This data is contains manually created word aligned sentences for training or testing word alignment toolkits.
|KFTT||1.2k||Free||Free||Hand-aligned articles from the Kyoto Wikipedia data.|
These are resources that are available for translation from Japanese to other languages.
|ASPEC||672k||Free (w/ Contract)||No||A Chinese-Japanese corpus of scientific abstracts.|
|TED Talks||100k||Free||No||Translated subtitles of talks from TED between 33 languages.|
|Wikipedia Links||50k-400k||Free||Free?||Wikipedia has links between languages which can also be used as a dictionary for a number of languages.|
|JaLexBD||16k||Free||A dictionary of common nouns between Japanese and French.|