Chinese Penn Treebank 5.0 Reannotation - LANGUAGE MEDIA PROCESSING LAB

Chinese Penn Treebank 5.0 Re-annotation †

Description †

A re-annotated CTB5.0 corpus based on a set of new annotation guidelines for Chinese word segmentation and POS tagging that address the data consistency issue.

The definition of “word” is an open issue in Chinese linguistics. In previous studies of Chinese corpus annotation, the judgement of word-hood of a meaningful string is based on the analysis of morphology: a morpheme in Chinese is defined as the smallest combination of meaning and phonetic sound in the Chinese language, which can be classified into two major types:

free morphemes, which can either be words by themselves or form words with other morphemes; and
bound morphemes, which can only form words by attaching to other morphemes.

An issue with word definition using morpheme classification is that it potentially undermines the consistency of the representation of words. For example, “论” (theory) is a bound morpheme, therefore the string “进化论” (theory of evolution) is treated as a word; on the other hand the string “进化 | 理论” (theory of evolution) are treated as two words, despite the fact that the two strings have the same meaning and structure. In another example, “者” (person) is considered as a bound morpheme, therefore “反对自由贸易者” (people who are against free trade) is treated as one word, while the string without the bound morpheme, i.e. “反对 | 自由 | 贸易” (be against free trade), can only be treated as a phrase of three words.

The morphology-based word definition can also make the data sparsity problem worse in corpus annotation. As an evidence, in the Penn Chinese Treebank 5.0 (CTB5) which is an annotated corpus widely used to train Chinese morphological analysis systems, we found that one of the major sources of the out-of-vocabulary (OOV) words is the compounds that end with a monosyllabic bound morpheme. For example, compounds 利用率 (utility rate) and 次品率 (rate of defective product) end with the bound morpheme率 (rate); 完成度 (degree of completion) and活跃度 (degree of activity) end with the bound morpheme度 (degree); 持续性 (sustainability) and 挥发性 (property of volatile) end with the bound morpheme 性 (property). While these compounds are sparse in the corpus, the morphemes which they consist of can be frequently observed; this means these OOV words can be observed and learnt by a word segmenter if we split the morphemes as individual words in the annotation. We therefore re-annotate the entire CTB5 using a new approach that overcomes the two issues: inconsistency and data sparsity.

This package includes the following files:

convert_ctb.py: a script to patch the original CTB5.0 with re-annotated word segmentation and POS tags.
reannotation_map.txt: a mapping from the original CTB5.0 word segmentation and POS tagging to the re-annotated version.
README:usage of the script.

↑

Usage †

1. Patch one of the original CTB files with the version in the "patch" dir:

patch path-to-PennChineseTreebank50/data/postagged/chtb_414.pos < patch/chtb_414.patch

2. Combining the following pos-tagged files into one file "chtb.pos":

PennChineseTreebank50/data/postagged/chtb_[1-270].pos

PennChineseTreebank50/data/postagged/chtb_[400-931].pos

PennChineseTreebank50/data/postagged/chtb_[1001-1151].pos

3. python script/convert_ctb_postagged.py < chtb.pos > [original CTB file] This will convert the pos-tagged file into one-sentence-per-line format.

4. python script/convert_ctb.py [original CTB5 file] annotation/reannotation_map.txt > [reannotated CTB5 file]

↑

Download †

ctb5_reannotation.tar.gz

↑

Contact †

MAIL: msmoshen at gmail.com