Backup of JCID(No. 4) - LANGUAGE MEDIA PROCESSING LAB

List of Backups
View the diff.
View the diff current.
View the source.
Go to JCID.
- 1 (2020-10-05 (Mon) 04:48:13)
- 2 (2020-10-06 (Tue) 01:01:25)
- 3 (2020-10-06 (Tue) 03:54:43)
- 4 (2020-10-06 (Tue) 10:09:00)
- 5 (2020-10-06 (Tue) 10:49:55)

Japanese Commonsense Inference Dataset †

This dataset is a Japanese commonsense inference dataset built from a web corpus of 715 million sentences.
It consists of 104k multiple-choice questions that ask contingency between basic events.
The problems are generated from the event pairs verified by crowdsourcing.
Here is an example:

お腹が空いたので (I am hungry, so)
   a. コーヒーを飲む (I drink coffee)
   b. ご飯を食べる (I have a meal)
   c. 汗をかく (I sweat)
   d. 眠くなる (I get sleepy)

The task is to select the most appropriate choice as the continuing sentence.
In this case, b is a correct answer.

↑

Data format †

The data format is jsonl as follows.

{
   "id": "0", 
   "label": "d", 
   "agreement": "2",
   "context": "電池 の 減り は やはり 早い ので 、", 
   "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
   "choice_b": "今回 は 期間 限定 で の お 届け に なり ます", 
   "choice_c": "原子 炉 を 手動 停止 する",
   "choice_d": "充電 用 に ＵＳＢ ケーブル 買い ます"
} ...

"agreement" is the result of contingency verification for each pair of a context and a correct choice, which is among {2, 3, 4}.
"context" and "choice_{a, b, c, d}" have already been divided into morphemes using Juman++ 2.0.0-rc3.

↑

Download †

We prepared training, development, and test set.
Please refer to [1] for the detailed statistics.

Download (10.4MB)

↑

Demo †

You can try some training examples on the demo site below.

Demo (in Japanese)

↑

Update history †

Release ver1.0 - October 6, 2020

↑

References †

[1] Kazumasa Omura, Daisuke Kawahara and Sadao Kurohashi: A Method for Building a Commonsense Inference Dataset based on Basic Events, In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020).