Kyoto University Commonsense Inference Dataset (KUCI)

This dataset is a Japanese commonsense inference dataset built from a web corpus of 715 million sentences.
It consists of 104k multiple-choice questions that ask contingency between basic events.
The problems are generated from the event pairs verified by crowdsourcing.
Here is an example:

電池の減りはやはり早いので、 (The battery drains so fast that)
    a. 実際の半導体製造装置は実現しません (we cannot actually make the semiconductor manufacturing equipment)
    b. 今回は期間限定でのお届けになります (it is a limited-time offer this time)
    c. 原子炉を手動停止する (we manually shut down a nuclear reactor)
    d. 充電用にUSBケーブル買います (I buy a USB cable for charging)

The task is to select the most appropriate choice as the continuing sentence.
In this case, d is a correct answer.

Data format

The data format is jsonl as follows.

{
    "id": "0", 
    "label": "d", 
    "agreement": "2",
    "context": "電池 の 減り は やはり 早い ので 、", 
    "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
    "choice_b": "今回 は 期間 限定 で の お 届け に なり ます", 
    "choice_c": "原子 炉 を 手動 停止 する",
    "choice_d": "充電 用 に USB ケーブル 買い ます"
} ...

"agreement" is the result of contingency verification for each pair of a context and a correct choice, which is among {2, 3, 4}.
"context" and "choice_{a, b, c, d}" have already been divided into morphemes using Juman++ 2.0.0-rc3.

Download

We prepared training, development, and test set.
Please refer to [1] for the detailed statistics.

Please concact us by e-mail at "nl-resource at nlp.ist.i.kyoto-u.ac.jp" for questions.

Demo

You can try some training examples on the demo site below.

Update history

References


Front page   New List of pages Search Recent changes   Help   RSS of recent changes