JWTD - LANGUAGE MEDIA PROCESSING LAB

Japanese Wikipedia Typo Dataset †

This dataset is a Japanese typo dataset acquired from Wikipedia's revision history. Typos-correction sentence pairs are extracted by comparing each revision with the revision immediately preceding. The dataset contains four categories of typos (substitution, deletion, insertion, and kanji-conversion), and the total of them is approximately half a million sentence pairs. See References for more information.

↑

Data format †

The data format is jsonl as below.

{"category": "kanji-conversion", "page": "366", "pre_rev": "72387", "post_rev": "77423", "pre_loss": 122.24,
"post_loss": 120.72, "pre_text": "信長の死後、豊臣秀吉が実権を握ると、前田利家は加賀も領して、金沢に入場した。",
"post_text": "信長の死後、豊臣秀吉が実権を握ると、前田利家は加賀も領して、金沢に入城した。",
"diffs": [{"pre": "入場", "post": "入城"}]}

category is the typo category (substitution, deletion, insertion, kanji-conversion). page is the article page ID of Wikipedia, pre_rev (post_rev) is before (after) revision ID of Wikipedia. pre_loss (post_loss) is the total loss value of the character-based LSTM language model of the before (after) sentence. pre_text (post_text) is the before (after) sentence. diffs is the morpheme-level differences between pre_text and post_text.

↑

Download †

This dataset has a training set and a test set. Unlike the training set, the test set is filtered by the evaluation results of crowdsourcing, so it has less noise.

Download (282MB)

↑

License †

The license for this dataset is the same as the license for Japanese Wikipedia, which is CC-BY-SA 3.0. For more information, please refer to Japanese Wikipedia License.

↑

Update history †

Release. April 25, 2020

↑

References †

[1] Yu Tanaka, Yugo Murawaki, Daisuke Kawahara, Sadao Kurohashi: Building a Japanese Typo Dataset from Wikipedia's Revision History, ACL 2020 Student Research Workshop.