Last Update: 2024-02-16 [日本語]
KUCI is a Japanese dataset for training/evaluating the linguistic capability to infer basic contingency (hereafter, commonsense contingency reasoning). This dataset comprises 104k multiple-choice questions that ask basic contingency. It is also characterized by its semi-automatic construction method: automatic extraction of pairs of basic event expressions that have contingent relation from a raw corpus, verification through crowdsourcing, and automatic generation of commonsense contingency reasoning problems from the verified pairs. Here is an example of the commonsense contingency reasoning problem:
電池の減りはやはり早いので、 (The battery drains so fast that) a. 実際の半導体製造装置は実現しません (actual semiconductor manufacturing equipment is not realized) b. 今回は期間限定でのお届けになります (it is a limited-time offer this time) c. 原子炉を手動停止する ({I} manually shut down a nuclear reactor) d. 充電用にUSBケーブル買います ({I} buy a USB cable for charging) ※ {} denotes a dropped pronoun.
The task is to choose the most appropriate choice as the continuation of a given context. In this case, d is a correct choice.
cf. [2], [3]
Train | Dev | Test |
83,127 | 10,228 | 10,291 |
In addition, 862k pseudo-problems are available.
The data format is JSON Lines.
{ "id": 0, "context": "電池 の 減り は やはり 早い ので 、", "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん", "choice_b": "今回 は 期間 限定 で の お 届け に なり ます", "choice_c": "原子 炉 を 手動 停止 する", "choice_d": "充電 用 に USB ケーブル 買い ます", "label": "d", "agreement": 2, "core_event_pair": "減り/へりv,ガ,早い/はやい|ケーブル/けーぶる,ヲ,買う/かう" }
Key | Type | Description |
id | int | unique whole number for each problem (0-origin) |
context | str | context (segmented into morphemes using Juman++ Version: 2.0.0-rc3) |
choice_{a, b, c, d} | str | choice (〃) |
label | str | letter corresponding to a correct choice (any of {a, b, c, d}) |
agreement | int | number of crowdworkers who agreed that the base has contingent relation (any of {2, 3, 4}) |
core_event_pair | str | pair of core events constituting the base |
After internal discussion, we have determined to license this dataset under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0).
If you find any problem, please contact us by e-mail at "nl-resource at nlp.ist.i.kyoto-u.ac.jp" or "omura at nlp.ist.i.kyoto-u.ac.jp".
(" at " = @)