#author("2021-03-15T16:11:30+09:00","kurohashi_kawahara_lab","kurohashi_kawahara_lab")
#author("2022-04-14T17:36:57+09:00","kurohashi_kawahara_lab","kurohashi_kawahara_lab")
* オープンコースウェア対訳コーパス/Coursera Parallel Corpus [#mae4a384]

** Update history [#g075a542]
- 2021 Added Chinese-English parallel dataset
- 2020 Added Japanese-English parallel dataset

** Description [#c9d7e6af]

This resource contains Japanese-English and Chinese-English parallel datasets extracted from Coursera. It is in the spoken language and in the educational domain. The train set is automatically aligned and the dev and test sets are manually evaluated.

- Description of the files in the Japanese-English dataset:
-- train.ja & train.en:
--- 40,770 parallel sentences extracted and aligned from Coursera, which could be used as training data for machine translation (MT).
-- dev.ja & dev.en:
--- 541 manually checked parallel sentences, which could be used as tuning data for MT.
-- test.ja & test.en:
--- 2,005 manually checked parallel sentences, which could be used as testing data for MT.

- Description of the files in the Chinese-English dataset:
-- train.zh & train.en:
--- 40,074 parallel sentences extracted and aligned from Coursera, which could be used as training data for machine translation (MT).
-- dev.zh & dev.en:
--- 865 manually checked parallel sentences, which could be used as tuning data for MT.
-- test.zh & test.en:
--- 2,009 manually checked parallel sentences, which could be used as testing data for MT.

** Sample [#b6b769d8]
- Japanese-English parallel sentences

    Ja: 誰かに何をすべきか言うのか、 私がこれをしたら起こったことについて言うのかの違いです。
    En: It's the difference between telling someone what to do versus saying this is what happened when I did this.


- Chinese-English parallel sentences

    Zh: 如今, 云计算包括虚拟化数据中心, 虚拟机和应用程序编程接口。
    En: Today, cloud computing involves virtualized datacenters, virtual machines and APIs.

** Download [#h7c72293]
*** Japanese-English: [#ld1eacaa]
https://github.com/shyyhs/CourseraParallelCorpusMining/raw/master/data/Coursera_En-Ja.zip
https://github.com/shyyhs/CourseraParallelCorpusMining/blob/master/data/Coursera_En-Ja.zip

*** Chinese-English: [#u3834cd9]

https://github.com/shyyhs/CourseraParallelCorpusMining/raw/master/data/Coursera_En-Zh.zip
https://github.com/shyyhs/CourseraParallelCorpusMining/blob/master/data/Coursera_En-Zh.zip

** Github [#c35c6d03]
https://github.com/shyyhs/CourseraParallelCorpusMining

** Reference [#hff08bcb]
Haiyue Song, Raj Dabre, Atsushi Fujita and Sadao Kurohashi.
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC2020), pp.3640‑3649, Marseille, France, (2020.5).

*** bib [#na444d6c]
    @inproceedings{song-etal-2020-coursera,
    title = "{C}oursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation",
    author = "Song, Haiyue  and
      Dabre, Raj  and
      Fujita, Atsushi  and
      Kurohashi, Sadao",
    booktitle = "Proceedings of the 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.449",
    pages = "3640--3649",
    language = "English",
    ISBN = "979-10-95546-34-4",
}

** Contact [#s0d94ebb]
song AT nlp.ist.i.kyoto-u.ac.jp

Front page   Edit Diff Backup Attach Copy Rename Reload   New List of pages Search Recent changes   Help   RSS of recent changes