Wikipedia Chinese-Japanese Parallel Corpus †
Update History †
- 2015 Added parallel fragments to the corpus
- 2014 Released a parallel corpus containing parallel sentences
Description †
This resource is a Chinese-Japanese parallel corpus automatically extracted from Wikipedia, which contains both parallel sentences and fragments.
- Description of the files:
- sentence.zh & sentence.ja:
- 126,811 parallel sentences automatically extracted from Wikipedia, which could be used as training data for machine translation (MT).
- fragment.zh & fragment.ja:
- 131,509 parallel fragments automatically extracted from Wikipedia, which could be used as training data for MT.
- dev.zh & dev.ja:
- 198 manually selected parallel sentences, which could be used as tuning data for MT.
- test.zh & test.ja:
- 198 manually selected parallel sentences, which could be used as testing data for MT.
Sample †
License †
Please refer to the Copyrights of Wikipedia when you use this resource: http://en.wikipedia.org/wiki/Wikipedia:Copyrights
Note that we do not endorse and shall not be held responsible or liable for damages resulting from the inappropriate usage of this resource.
Download †
wiki_zh_ja_tallip2015.tgz
Reference †
- Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia,
ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Vol.15, No.2, pp.10:1-10:22, (2015.12).
- Chenhui Chu, Toshiaki Nakazawa, Sadao Kurohashi.
Constructing a Chinese-Japanese Parallel Corpus from Wikipedia,
In Proceedings of the 9th Conference on International Language Resources and Evaluation (LREC 2014), pp.642-647, Reykjavik, Iceland, (2014.5).
Contact and Bug Report †
MAIL: nl-resource at nlp.ist.i.kyoto-u.ac.jp