Chinese Character-level POS annotation

Description

Augmented CTB5.0 with character-level part-of-speech annotation.

Each Chinese character is in fact created as a word in origin with complete and independent meaning, it should be treated as the actual minimal morphological unit in Chinese language, and therefore should carry specific part-of-speech. For example, the character “打” (beat) is a verb and the character “破” (broken) is an adjective. A word on the other hand, is either single-character, or a compound formed by single-character words. For example, the verb “打破” (break) can be seen as a compound formed by the two single-character words with the construction “verb + adjective”.

This package includes the following files:

For more details, please refer to the paper "Chinese Morphological Analysis with Character-level POS Tagging" (http://www.aclweb.org/anthology/P14-2042)

Usage

python add_cpos_to_ctb.py [original CTB5 file] cpos_dic.txt > [reannotated CTB5 file]

"original CTB5 file" corresponds to chapters 1−260 in CTB5.0 (LDC2005T01).

Each line of this file is a segmented and PoS tagged sentence:

All sentences should be in utf-8.


Download


Contact


Front page   New List of pages Search Recent changes   Help   RSS of recent changes