Chinese Character-level POS Annotation - LANGUAGE MEDIA PROCESSING LAB

Chinese Character-level POS annotation †

Description †

Augmented CTB5.0 with character-level part-of-speech annotation.

Each Chinese character is in fact created as a word in origin with complete and independent meaning, it should be treated as the actual minimal morphological unit in Chinese language, and therefore should carry specific part-of-speech. For example, the character “打” (beat) is a verb and the character “破” (broken) is an adjective. A word on the other hand, is either single-character, or a compound formed by single-character words. For example, the verb “打破” (break) can be seen as a compound formed by the two single-character words with the construction “verb + adjective”.

This package includes the following files:

add_cpos_to_ctb.py: a script to patch the original CTB5.0 with character-level POS annotation.
cpos_dic.txt: a lexicon used to expand words in the original CTB5.0 character-level POS layer.
README: definition of each character-level POS tag and usage of the script.

For more details, please refer to the paper "Chinese Morphological Analysis with Character-level POS Tagging" (http://www.aclweb.org/anthology/P14-2042)

↑

Usage †

Usage: 1. Patch one of the original CTB files with the version in the "patch" dir:

patch path-to-PennChineseTreebank50/data/postagged/chtb_414.pos < patch/chtb_414.patch

2. Combining the following pos-tagged files into one file "chtb.pos":

PennChineseTreebank50/data/postagged/chtb_[1-270].pos

PennChineseTreebank50/data/postagged/chtb_[400-931].pos

PennChineseTreebank50/data/postagged/chtb_[1001-1151].pos

3. python script/convert_ctb_postagged.py < chtb.pos > [original CTB file] This will convert the pos-tagged file into one-sentence-per-line format.

4. python script/add_cpos_to_ctb.py [original CTB5 file] annotation/cpos_dic.txt > [reannotated CTB5 file]

↑

Download †

ctb5_charlevel_pos_annotation.tar.gz

↑

Contact †

MAIL: shen at nlp.ist.i.kyoto-u.ac.jp