This dataset is a Japanese commonsense inference dataset built from a web corpus of 715 million sentences.
It consists of 104k multiple-choice questions that ask contingency between basic events.
The problems are generated from the event pairs verified by crowdsourcing.
Here is an example:
お腹が空いたので (I am hungry, so)
a. コーヒーを飲む (I drink coffee)
b. ご飯を食べる (I have a meal)
c. 汗をかく (I sweat)
d. 眠くなる (I get sleepy)
The task is to select the most appropriate choice as the continuing sentence.
In this case, b is a correct answer.
The data format is jsonl as follows.
{
"id": "0",
"label": "d",
"agreement": "2",
"context": "電池 の 減り は やはり 早い ので 、",
"choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
"choice_b": "今回 は 期間 限定 で の お 届け に なり ます",
"choice_c": "原子 炉 を 手動 停止 する",
"choice_d": "充電 用 に USB ケーブル 買い ます"
} ...
"agreement" is the result of contingency verification for each pair of a context and a correct choice, which is among {2, 3, 4}.
"context" and "choice_{a, b, c, d}" have already been divided into morphemes using Juman++ 2.0.0-rc3.
We prepared training, development, and test set.
Please refer to [1] for the detailed statistics.
Please concact us by e-mail at "nl-resource at nlp.ist.i.kyoto-u.ac.jp" for questions.
You can try some training examples on the demo site below.