Last Update: 2024-02-16    [日本語]

Kyoto University Commonsense Inference dataset (KUCI)

KUCI is a Japanese dataset for training/evaluating the linguistic capability to infer basic contingency (hereafter, commonsense contingency reasoning). This dataset comprises 104k multiple-choice questions that ask basic contingency. It is also characterized by its semi-automatic construction method: automatic extraction of pairs of basic event expressions that have contingent relation from a raw corpus, verification through crowdsourcing, and automatic generation of commonsense contingency reasoning problems from the verified pairs. Here is an example of the commonsense contingency reasoning problem:

電池の減りはやはり早いので、 (The battery drains so fast that)
  a. 実際の半導体製造装置は実現しません (actual semiconductor manufacturing equipment is not realized)
  b. 今回は期間限定でのお届けになります (it is a limited-time offer this time)
  c. 原子炉を手動停止する ({I} manually shut down a nuclear reactor)
  d. 充電用にUSBケーブル買います ({I} buy a USB cable for charging)
※ {} denotes a dropped pronoun.

The task is to choose the most appropriate choice as the continuation of a given context. In this case, d is a correct choice.

Definitions of Terms

cf. [2], [3]

contingency
the discourse relation between events established when one is likely to cause the other
core event
high-frequency predicate-argument structure (acquired from case frames)
base
pair of a context and a correct choice constituting each problem

Statistics

TrainDevTest
83,12710,22810,291

In addition, 862k pseudo-problems are available.

Data Format

The data format is JSON Lines.

 {
   "id": 0, 
   "context": "電池 の 減り は やはり 早い ので 、", 
   "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
   "choice_b": "今回 は 期間 限定 で の お 届け に なり ます", 
   "choice_c": "原子 炉 を 手動 停止 する",
   "choice_d": "充電 用 に USB ケーブル 買い ます",
   "label": "d", 
   "agreement": 2,
   "core_event_pair": "減り/へりv,ガ,早い/はやい|ケーブル/けーぶる,ヲ,買う/かう"
 }
KeyTypeDescription
idintunique whole number for each problem (0-origin)
contextstrcontext (segmented into morphemes using Juman++ Version: 2.0.0-rc3)
choice_{a, b, c, d}strchoice (〃)
labelstrletter corresponding to a correct choice (any of {a, b, c, d})
agreementintnumber of crowdworkers who agreed that the base has contingent relation (any of {2, 3, 4})
core_event_pairstrpair of core events constituting the base

License

After internal discussion, we have determined to license this dataset under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). If you find any problem, please contact us by e-mail at "nl-resource at nlp.ist.i.kyoto-u.ac.jp" or "omura at nlp.ist.i.kyoto-u.ac.jp".
(" at " = @)

External Links

History

References