Japanese Commonsense Inference Dataset (JCID)

This dataset is a Japanese commonsense inference dataset built from a web corpus of 715 million sentences.

Data format

The data format is jsonl as below.

{
   "id": "0", 
   "agree": "2", 
   "gold": "d", 
   "context": "電池 の 減り は やはり 早い ので 、", 
   "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
   "choice_b": "今回 は 期間 限定 で の お 届け に なり ます", 
   "choice_c": "原子 炉 を 手動 停止 する",
   "choice_d": "充電 用 に USB ケーブル 買い ます"
} ...

"agree" is the result of contingency relation verification for each problem. "context" and "choice_{a, b, c, d}" have already been divided into morphemes using Juman++ 2.0.0-rc3.

Download

We prepared training, development, and test set.

Update history

References


Front page   New List of pages Search Recent changes   Help   RSS of recent changes