Wikipedia Chinese-Japanese Parallel Corpus - LANGUAGE MEDIA PROCESSING LAB

Wikipedia Chinese-Japanese Parallel Corpus †

Update History †

2015 Added parallel fragments to the corpus
2014 Released a parallel corpus containing parallel sentences

Description †

This resource is a Chinese-Japanese parallel corpus automatically extracted from Wikipedia, which contains both parallel sentences and fragments.

Description of the files:
- sentence.zh & sentence.ja:
  - 126,811 parallel sentences automatically extracted from Wikipedia, which could be used as training data for machine translation (MT).
- fragment.zh & fragment.ja:
  - 131,509 parallel fragments automatically extracted from Wikipedia, which could be used as training data for MT.
- dev.zh & dev.ja:
  - 198 manually selected parallel sentences, which could be used as tuning data for MT.
- test.zh & test.ja:
  - 198 manually selected parallel sentences, which could be used as testing data for MT.

↑

Sample †

Parallel sentences

zh: 经考证郭店楚简抄写成书的时间不晚于公元前３００年，大约相当于战国中期，是到目前为止中国所发现最早的原装书。
ja: 考証によると、郭店楚簡の成書時期は紀元前３００年を下ることはなく、およそ戦国時代の中期とみられている。

Parallel fragments

zh: 美国新泽西州普林斯顿
ja: アメリカ合衆国ニュージャージー州プリンストン

↑

License †

Please refer to the Copyrights of Wikipedia when you use this resource: http://en.wikipedia.org/wiki/Wikipedia:Copyrights
Note that we do not endorse and shall not be held responsible or liable for damages resulting from the inappropriate usage of this resource.

↑

Download †

wiki_zh_ja_tallip2015.tgz

↑

Reference †

Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia,
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Vol.15, No.2, pp.10:1-10:22, (2015.12).

Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Constructing a Chinese-Japanese Parallel Corpus from Wikipedia,
In Proceedings of the 9th Conference on International Language Resources and Evaluation (LREC 2014), pp.642-647, Reykjavik, Iceland, (2014.5).

↑

Contact and Bug Report †

MAIL: nl-resource at nlp.ist.i.kyoto-u.ac.jp