KWDLC - LANGUAGE MEDIA PROCESSING LAB

Kyoto University Web Document Leads Corpus †

This is a Japanese text corpus that consists of lead three sentences of web documents with various linguistic annotations. By collecting lead three sentences of web documents, this corpus contains documents with various genres and styles, such as news articles, encyclopedic articles, blogs and commercial pages. It comprises approximately 5,000 documents, which correspond to 15,000 sentences.

The linguistic annotations consist of annotations of morphology, named entities, dependencies, predicate-argument structures including zero anaphora, coreferences, and discourse. All the annotations were given by manually modifying automatic analyses of the morphological analyzer JUMAN and the dependency, case structure and anaphora analyzer KNP. For the discourse annotations, a small corpus annotated by experts and a large corpus annotated by crowdsourcing are included.

↑

Download †

GitHub repository: https://github.com/ku-nlp/KWDLC
Old version
- Kyoto University Web Document Leads Corpus Version 1.0 (bzip2 compression; 4,527,262 bytes): Download Form

↑

References †

Masatsugu Hangyo, Daisuke Kawahara and Sadao Kurohashi.
Building a Diverse Document Leads Corpus Annotated with Semantic Relations,
In Proceedings of the 26th Pacific Asia Conference on Language Information and Computing, pp.535-544, 2012.

Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi and Manabu Sassano.
Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing,
In Proceedings of the 25th International Conference on Computational Linguistics, pp.269-278, 2014.

↑

Acknowledgment †

The creation of this corpus was supported by JSPS KAKENHI Grant Number 24300053 and JST CREST "Advanced Core Technologies for Big Data Integration." The discourse annotations were acquired by crowdsourcing under the support of Yahoo! Japan Corporation. We deeply appreciate their support.

↑

Contact †

If you have any questions or problems about this corpus, please send an email to nl-resource at nlp.ist.i.kyoto-u.ac.jp. If you have a request to add source information or to delete a document in the corpus, please send an email to this mail address.