Coursera Parallel Corpus - LANGUAGE MEDIA PROCESSING LAB

オープンコースウェア対訳コーパス/Coursera Parallel Corpus †

Update history †

2021 Added Chinese-English parallel dataset
2020 Added Japanese-English parallel dataset

Description †

This resource contains Japanese-English and Chinese-English parallel datasets extracted from Coursera. It is in the spoken language and in the educational domain. The train set is automatically aligned and the dev and test sets are manually evaluated.

Description of the files in the Japanese-English dataset:
- train.ja & train.en:
  - 40,770 parallel sentences extracted and aligned from Coursera, which could be used as training data for machine translation (MT).
- dev.ja & dev.en:
  - 541 manually checked parallel sentences, which could be used as tuning data for MT.
- test.ja & test.en:
  - 2,005 manually checked parallel sentences, which could be used as testing data for MT.

Description of the files in the Chinese-English dataset:
- train.zh & train.en:
  - 40,074 parallel sentences extracted and aligned from Coursera, which could be used as training data for machine translation (MT).
- dev.zh & dev.en:
  - 865 manually checked parallel sentences, which could be used as tuning data for MT.
- test.zh & test.en:
  - 2,009 manually checked parallel sentences, which could be used as testing data for MT.

↑

Sample †

Japanese-English parallel sentences

   Ja: 誰かに何をすべきか言うのか、 私がこれをしたら起こったことについて言うのかの違いです。
   En: It's the difference between telling someone what to do versus saying this is what happened when I did this.

Chinese-English parallel sentences

   Zh: 如今, 云计算包括虚拟化数据中心, 虚拟机和应用程序编程接口。
   En: Today, cloud computing involves virtualized datacenters, virtual machines and APIs.

↑

Reference †

Haiyue Song, Raj Dabre, Atsushi Fujita and Sadao Kurohashi. Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC2020), pp.3640‑3649, Marseille, France, (2020.5).

↑

bib †

   @inproceedings{song-etal-2020-coursera,
   title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation",
   author = "Song, Haiyue  and
     Dabre, Raj  and
     Fujita, Atsushi  and
     Kurohashi, Sadao",
   booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
   month = may,
   year = "2020",
   address = "Marseille, France",
   publisher = "European Language Resources Association",
   url = "https://www.aclweb.org/anthology/2020.lrec-1.449",
   pages = "3640--3649",
   language = "English",
   ISBN = "979-10-95546-34-4",

}

↑

Contact †

song AT nlp.ist.i.kyoto-u.ac.jp

オープンコースウェア対訳コーパス/Coursera Parallel Corpus †

Update history †

Description †

Sample †

Download †

Japanese-English: †

Chinese-English: †

Github †

Reference †

bib †

Contact †