Backup of DDLC(No. 7) - LANGUAGE MEDIA PROCESSING LAB

List of Backups
View the diff.
View the diff current.
View the source.
Go to DDLC.
- 1 (2014-08-12 (Tue) 06:19:52)
- 2 (2014-08-20 (Wed) 01:33:33)
- 3 (2014-11-09 (Sun) 08:16:23)
- 4 (2014-12-05 (Fri) 06:55:25)
- 5 (2015-04-24 (Fri) 03:21:48)
- 6 (2016-03-28 (Mon) 01:11:51)
- 7 (2016-03-31 (Thu) 08:26:49)

Kyoto University Web Document Leads Corpus †

This is a Japanese text corpus that consists of lead three sentences of web documents with various linguistic annotations. By collecting lead three sentences of web documents, this corpus contains documents with various genres and styles, such as news articles, encyclopedic articles, blogs and commercial pages. It comprises approximately 5,000 documents.

The linguistic annotations consist of annotations of morphology, named entities, dependencies, predicate-argument structures including zero anaphora, coreferences, and discourse. All the annotations except discourse annotations were given by manually modifying automatic analyses of the morphological analyzer JUMAN and the dependency, case structure and anaphora analyzer KNP. The discourse annotations were given using crowdsourcing.

↑

Download †

Kyoto University Web Document Leads Corpus Version 1.0 (bzip2 compression; 4,526,420 bytes) [New!]

↑

References †

Masatsugu Hangyo, Daisuke Kawahara and Sadao Kurohashi.
Building a Diverse Document Leads Corpus Annotated with Semantic Relations,
In Proceedings of the 26th Pacific Asia Conference on Language Information and Computing, pp.535-544, 2012.

Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi and Manabu Sassano.
Rapid Development of a Corpus with Discourse Annotations using Two-stage Crowdsourcing,
In Proceedings of the 25th International Conference on Computational Linguistics, pp.269-278, 2014.