Backup of KUCI(No. 9) - LANGUAGE MEDIA PROCESSING LAB

Kyoto University Commonsense Inference dataset (KUCI) †

KUCI is a Japanese commonsense inference dataset with 104k multiple-choice questions that ask contingency between basic events directly.
This dataset is also characterized by its semi-automatic data construction method: automatic extraction of contingent pairs of basic event expressions from a web corpus (of 715 million sentences), verification through crowdsourcing, and automatic generation of commonsense inference problems.
Here is an example:

電池の減りはやはり早いので、 (The battery drains so fast that)
    a. 実際の半導体製造装置は実現しません (we cannot actually make the semiconductor manufacturing equipment)
    b. 今回は期間限定でのお届けになります (it is a limited-time offer this time)
    c. 原子炉を手動停止する (we manually shut down a nuclear reactor)
    d. 充電用にＵＳＢケーブル買います (I buy a USB cable for charging)

The task is to select the most appropriate choice as the continuing sentence.
In this case, d is a correct answer.

↑

Data format †

The data format is jsonl as follows.

 {
   "id": "0", 
   "context": "電池 の 減り は やはり 早い ので 、", 
   "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
   "choice_b": "今回 は 期間 限定 で の お 届け に なり ます", 
   "choice_c": "原子 炉 を 手動 停止 する",
   "choice_d": "充電 用 に ＵＳＢ ケーブル 買い ます",
   "label": "d", 
   "agreement": "2",
   "core_event_pair": "減り/へりv,ガ,早い/はやい|ケーブル/けーぶる,ヲ,買う/かう"
 }, ...

"context" and "choice_{a, b, c, d}" have already been divided into morphemes using Juman++ 2.0.0-rc3. "agreement" is the result of contingency verification for each pair of a context and a correct choice, which is among {2, 3, 4}.

↑

Links †

KUCI (37.4MB)
- We prepared training, development, and test sets.
  Please refer to [1] for the detailed statistics.
Demo (in Japanese)
- You can try some training examples
Pseudo-data
Code

Please concact us by e-mail at "omura at nlp.ist.i.kyoto-u.ac.jp" or "nl-resource at nlp.ist.i.kyoto-u.ac.jp" for questions.
(" at " = @)

↑

Update history †

Update README- September 15, 2022
Renamed "Kyoto University Commonsense Inference dataset (KUCI)" - October 7, 2020
Release ver1.0 - October 6, 2020

↑

References †

[2] Kazumasa Omura and Sadao Kurohashi: Improving Commonsense Contingent Reasoning by Pseudo-data and its Application to the Related Tasks, In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022).
[1] Kazumasa Omura, Daisuke Kawahara and Sadao Kurohashi: A Method for Building a Commonsense Inference Dataset based on Basic Events, In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020).