オープンコースウェア対訳コーパス/Coursera Parallel Corpus

Update history

Description

This resource contains Japanese-English and Chinese-English parallel datasets extracted from Coursera. It is in the spoken language and in the educational domain. The train set is automatically aligned and the dev and test sets are manually evaluated.

Sample

   Ja: 誰かに何をすべきか言うのか、 私がこれをしたら起こったことについて言うのかの違いです。
   En: It's the difference between telling someone what to do versus saying this is what happened when I did this.
   Zh: 如今, 云计算包括虚拟化数据中心, 虚拟机和应用程序编程接口。
   En: Today, cloud computing involves virtualized datacenters, virtual machines and APIs.

Download

Japanese-English:

https://github.com/shyyhs/CourseraParallelCorpusMining/blob/master/data/Coursera_En-Ja.zip

Chinese-English:

https://github.com/shyyhs/CourseraParallelCorpusMining/blob/master/data/Coursera_En-Zh.zip

Github

https://github.com/shyyhs/CourseraParallelCorpusMining

Reference

Haiyue Song, Raj Dabre, Atsushi Fujita and Sadao Kurohashi. Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC2020), pp.3640‑3649, Marseille, France, (2020.5).

bib

   @inproceedings{song-etal-2020-coursera,
   title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation",
   author = "Song, Haiyue  and
     Dabre, Raj  and
     Fujita, Atsushi  and
     Kurohashi, Sadao",
   booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
   month = may,
   year = "2020",
   address = "Marseille, France",
   publisher = "European Language Resources Association",
   url = "https://www.aclweb.org/anthology/2020.lrec-1.449",
   pages = "3640--3649",
   language = "English",
   ISBN = "979-10-95546-34-4",

}

Contact

song AT nlp.ist.i.kyoto-u.ac.jp