This resource contains Japanese-English and Chinese-English parallel datasets extracted from Coursera. It is in the spoken language and in the educational domain. The train set is automatically aligned and the dev and test sets are manually evaluated.
Ja: 誰かに何をすべきか言うのか、 私がこれをしたら起こったことについて言うのかの違いです。 En: It's the difference between telling someone what to do versus saying this is what happened when I did this.
Zh: 如今, 云计算包括虚拟化数据中心, 虚拟机和应用程序编程接口。 En: Today, cloud computing involves virtualized datacenters, virtual machines and APIs.
https://github.com/shyyhs/CourseraParallelCorpusMining/blob/master/data/Coursera_En-Ja.zip
https://github.com/shyyhs/CourseraParallelCorpusMining/blob/master/data/Coursera_En-Zh.zip
https://github.com/shyyhs/CourseraParallelCorpusMining
Haiyue Song, Raj Dabre, Atsushi Fujita and Sadao Kurohashi. Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC2020), pp.3640‑3649, Marseille, France, (2020.5).
@inproceedings{song-etal-2020-coursera, title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation", author = "Song, Haiyue and Dabre, Raj and Fujita, Atsushi and Kurohashi, Sadao", booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference", month = may, year = "2020", address = "Marseille, France", publisher = "European Language Resources Association", url = "https://www.aclweb.org/anthology/2020.lrec-1.449", pages = "3640--3649", language = "English", ISBN = "979-10-95546-34-4",
}
song AT nlp.ist.i.kyoto-u.ac.jp