Augmented CTB5.0 with character-level part-of-speech annotation.
Each Chinese character is in fact created as a word in origin with complete and independent meaning, it should be treated as the actual minimal morphological unit in Chinese language, and therefore should carry specific part-of-speech. For example, the character “打” (beat) is a verb and the character “破” (broken) is an adjective. A word on the other hand, is either single-character, or a compound formed by single-character words. For example, the verb “打破” (break) can be seen as a compound formed by the two single-character words with the construction “verb + adjective”.
This package includes the following files:
For more details, please refer to the paper "Chinese Morphological Analysis with Character-level POS Tagging" (http://www.aclweb.org/anthology/P14-2042)
Usage: 1. Patch one of the original CTB files with the version in the "patch" dir:
patch path-to-PennChineseTreebank50/data/postagged/chtb_414.pos < patch/chtb_414.patch
2. Combining the following pos-tagged files into one file "chtb.pos":
PennChineseTreebank50/data/postagged/chtb_[1-270].pos
PennChineseTreebank50/data/postagged/chtb_[400-931].pos
PennChineseTreebank50/data/postagged/chtb_[1001-1151].pos
3. python script/convert_ctb_postagged.py < chtb.pos > [original CTB file] This will convert the pos-tagged file into one-sentence-per-line format.
4. python script/add_cpos_to_ctb.py [original CTB5 file] annotation/cpos_dic.txt > [reannotated CTB5 file]