Japanese Commonsense Inference Dataset

This dataset is a Japanese commonsense inference dataset built from a web corpus of 715 million sentences.
It consists of 104k multiple-choice questions that ask contingency between basic events.
The problems are generated from the event pairs verified by crowdsourcing.
Here is an example:

お腹が空いたので (I am hungry, so)
   a. コーヒーを飲む (I drink coffee)
   b. ご飯を食べる (I have a meal)
   c. 汗をかく (I sweat)
   d. 眠くなる (I get sleepy)

The task is to select the most appropriate choice as the continuing sentence.
In this case, b is a correct answer.

Data format

The data format is jsonl as follows.

{
   "id": "0", 
   "label": "d", 
   "agreement": "2",
   "context": "電池 の 減り は やはり 早い ので 、", 
   "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
   "choice_b": "今回 は 期間 限定 で の お 届け に なり ます", 
   "choice_c": "原子 炉 を 手動 停止 する",
   "choice_d": "充電 用 に USB ケーブル 買い ます"
} ...

"agreement" is the result of contingency verification for each pair of a context and a correct choice, which is among {2, 3, 4}.
"context" and "choice_{a, b, c, d}" have already been divided into morphemes using Juman++ 2.0.0-rc3.

Download

We prepared training, development, and test set.
Please refer to [1] for the detailed statistics.

Demo

You can try some training examples on the demo site below.

Update history

References


Front page   New List of pages Search Recent changes   Help   RSS of recent changes