Chinese Character-level POS annotation

Description

Augmented CTB5.0 with character-level part-of-speech annotation.

Each Chinese character is in fact created as a word in origin with complete and independent meaning, it should be treated as the actual minimal morphological unit in Chinese language, and therefore should carry specific part-of-speech. For example, the character “打” (beat) is a verb and the character “破” (broken) is an adjective. A word on the other hand, is either single-character, or a compound formed by single-character words. For example, the verb “打破” (break) can be seen as a compound formed by the two single-character words with the construction “verb + adjective”.

This package includes the following files:

For more details, please refer to the paper "Chinese Morphological Analysis with Character-level POS Tagging" (http://www.aclweb.org/anthology/P14-2042)

Usage

Usage: 1. Patch one of the original CTB files with the version in the "patch" dir:

patch path-to-PennChineseTreebank50/data/postagged/chtb_414.pos < patch/chtb_414.patch

2. Combining the following pos-tagged files into one file "chtb.pos":

PennChineseTreebank50/data/postagged/chtb_[1-270].pos

PennChineseTreebank50/data/postagged/chtb_[400-931].pos

PennChineseTreebank50/data/postagged/chtb_[1001-1151].pos

3. python script/convert_ctb_postagged.py < chtb.pos > [original CTB file] This will convert the pos-tagged file into one-sentence-per-line format.

4. python script/add_cpos_to_ctb.py [original CTB5 file] annotation/cpos_dic.txt > [reannotated CTB5 file]


Download


Contact