Kyoto University Web Document Leads Corpus

This is a Japanese text corpus that consists of lead three sentences of web documents with various linguistic annotations. By collecting lead three sentences of web documents, this corpus contains documents with various genres and styles, such as news articles, encyclopedic articles, blogs and commercial pages. It comprises approximately 5,000 documents, which correspond to 15,000 sentences.

The linguistic annotations consist of annotations of morphology, named entities, dependencies, predicate-argument structures including zero anaphora, coreferences, and discourse. All the annotations were given by manually modifying automatic analyses of the morphological analyzer JUMAN and the dependency, case structure and anaphora analyzer KNP. For the discourse annotations, a small corpus annotated by experts and a large corpus annotated by crowdsourcing are included.

Download

References

Acknowledgment

The creation of this corpus was supported by JSPS KAKENHI Grant Number 24300053 and JST CREST "Advanced Core Technologies for Big Data Integration." The discourse annotations were acquired by crowdsourcing under the support of Yahoo! Japan Corporation. We deeply appreciate their support.

Contact

If you have any questions or problems about this corpus, please send an email to nl-resource at nlp.ist.i.kyoto-u.ac.jp. If you have a request to add source information or to delete a document in the corpus, please send an email to this mail address.