Japanese Wikipedia Typo Dataset

This dataset is a Japanese typo dataset acquired from Wikipedia's revision history. Typos-correction sentence pairs are extracted by comparing each revision with the revision immediately preceding. The dataset contains four categories of typos (substitution, deletion, insertion, and kanji-conversion), and the total of them is approximately half a million sentence pairs. See References for more information.

Data format

The data format is jsonl as below.

{"category": "kanji-conversion", "page": "366", "pre_rev": "72387", "post_rev": "77423", "pre_loss": 122.24,
"post_loss": 120.72, "pre_text": "信長の死後、豊臣秀吉が実権を握ると、前田利家は加賀も領して、金沢に入場した。",
"post_text": "信長の死後、豊臣秀吉が実権を握ると、前田利家は加賀も領して、金沢に入城した。",
"diffs": [{"pre": "入場", "post": "入城"}]}

category is the typo category (substitution, deletion, insertion, kanji-conversion). page is the article page ID of Wikipedia, pre_rev (post_rev) is before (after) revision ID of Wikipedia. pre_loss (post_loss) is the total loss value of the character-based LSTM language model of the before (after) sentence. pre_text (post_text) is the before (after) sentence. diffs is the morpheme-level differences between pre_text and post_text.

Download

This dataset has a training set and a test set. Unlike the training set, the test set is filtered by the evaluation results of crowdsourcing, so it has less noise.

License

The license for this dataset is the same as the license for Japanese Wikipedia, which is CC-BY-SA 3.0. For more information, please refer to Japanese Wikipedia License.

Update history

References