Translated by Jim Barnett.
Translators comments between slashes //...//
User's Guide for the Juman System, a User-Extensible Morphological
Analyzer for Japanese. Version 0.5
Yuji Matsumoto, Sadao Kurohashi, Yutaka Nyoki, Hitoshi Shinho,
and Makoto Nagao.
1. The Morphological Grammar of Japanese
1.1 Part of Speech and Subpart of Speech
Morphemes that perform a grammatically similar role are classified
together. We call the resulting classes parts of speech (keitaihinsi.)
The current system allows the part of speech to be further
sub-classified. These sub-classes are called subparts of speech
(keitaihinsisaibunrui.)
1.2 Paradigm and Desinence
Along the lines of the "verb" "adjective" and "auxiliary verb"
described in school grammars, there are morphemes that change their
form according the morphemes that occur before and after them. This
change of form is called "conjugation." The part of the surface form
that doesn't changed is called the root, and the part that does change
is called the suffix. Most conjugation is regular. Categorizing
morphemes according to this regularity, we call the resulting classes
Paradigms. The combined surface forms that actually occur are called
desinences.
// "Paradigm" and "desinence" are arcane words, but they're the best I
can do. "Paradigm" is a generalization of "conjugation" for verbs and
"declension" for nouns. It denotes a set of words that inflect the
same way. A "desinence" is one of the inflected forms of a Paradigm.
To take an example from Latin, the first conjugation verbs (the ones
with principal parts -are, -avi, -atus) form a Paradigm, as do the
first declension nouns (the ones with nominatives in -us and
accusatives in -um.) The first person singular active present of a
verb (e.g., the form "amo" of the verb "amare") is a desinence, as is
the accusative singular of a noun. //
1.3 Morpheme Structure
If morpheme m has part of speech H1, subpart of speech H2, Paradigm
K1, Desinence K2 and surface form M, we call the list (H1, H2,K1,K2,M)
the "morpheme structure of morpheme m."
Elements H1, H2, K1, and K2 are filled with the names of the part of
speech, the subpart of speech, the paradigm and desinence.
Furthermore, M is filled with a surface form expressed in kana or
kanji.
This notation is used to describe the connection rules.
{a morpheme structure may be underspecified and stands for the set of
morphemes that will unify it.}
1.4 Connection Relations and Connection Rules
The fact that all the members of the sets of morphemes A1 and A2 can
be connected (can occur next to each other in a sentence) is expressed
as: (A1, A2). This is called a Connection Rule. A set of morphemes is
expressed as a list of morpheme structures. Given connection rule (A1,
A2), then for every morpheme in every morpheme structure alpha1 in A1
can be followed by any arbitrary morpheme in any morpheme structure
alpha2 in A2. The set of Connection Rules is called the Connection
Relations.
// If A1 is a list of morpheme structures (a11,...a1n) and similarly A2
= (a21, ... a2k), then (A1, A2) means that any morpheme that unifies
with any of the descriptions (a11...a1n) can be followed by any one
that unifies with one of (a21...a2k) //
2. Dictionary Definitions and Data Structures
Here we discuss the internal data structures of the dictionary (fig.
1). The speciality of this dictionary system is that it allows the user
to freely define parts of speech and Connection Relations.
We call the dictionary that has been defined by the user the User
Dictionary. The User Dictionary is divided into the Grammar Dictionary
and the Morpheme dictionary. The Grammar Dictionary, briefly described
//I'm not sure what "issetsu de nobeta" means//, is used to describe
the morphological grammar of Japanese. It is composed of the Subpart
of Speech Dictionary, the Conjugation Dictionary, the Connection Rule
Dictionary, and the Conjugation Relation Dictionary (section 2.1). The
Morpheme Dictionary contains information about individual morphemes
(sec. 2.2).
The dictionary that is used for morphological analysis is called the
System Dictionary. The System Dictionary consists of: the Connection
Table and Connection Matrix, which are generated from the Connection
Rule Dictionary and the Conjugation Connection Dictionary, the Tree
Structure Dictionary, which contains the information in the Morpheme
Dictionary in symbolic form and is constructed by referring to the
Grammar Dictionary, the (Lexical) Key Dictionary, the Readings (yomi)
Dictionary, and the Meaning Dictionary.
The dictionaries described here must be placed in the directory
held in the Unix environment variable JUMANPATH.
// I don't understand the next sentence: "hon sistemu wa, JUMANPATH wa
`/juman/dic' o sansho siteiru"//
Furthermore, this system comes equiped with a standard User
Dictionary. It will be introduced in section 2.5.
2.1 Definition of the Grammar Dictionary
The Grammar Dictionary is set up in the following way. Each
dictionary is defined in S-expression format.
1. The Morpheme Subpart of Speech Dictionary (cf. JUMAN.grammar)
Defines the names of the parts and subparts of speech that the
system uses. A single part of speech or subpart of speech is
represented by a single S-expression (list structure). The lists
first element contains the part of speech. If there are subparts
of speech, elements 2 and beyond of the list each contain a
subpart of speech in list format.
Individual parts of speech or subparts of speech are represented
as single element lists, but if the morphemes belonging to the
part or subpart of speech are inflected, the symbol `%' is added
as the second element of the list. If a part of speech has the
symbol `%', all morphemes belonging to any of its subparts of
speech are inflected.
2. Conjugation Dictionary (cf. JUMAN.katuyou)
A table of the Paradigms of inflectable morphemes and the
definitions of the Desinences contained in the Paradigms.
The list's first element contains the Paradigm, the second
element is a list containing lists of Desinences and Suffixes.
The base form of the morpheme cannot be left out as a Desinence,
since for inflectable words it is the base form which is entered
in the Morpheme Dictionary.
The word root is what remains after the removal of the base form
suffix from the form that is entered in the Morpheme Dictionary
// i.e., from the base form // If the suffix does not appear in
the surface form, the Suffix entry for a Desinence is marked with
`*'.
3. Connection Rule Dictionary (cf. JUMAN.connect)
The Connection Rule Dictionary is the set of connection rules. A
connection rule is a pair of sets of morpheme structures (v.
section 1.3) and is expressed as a two-element list. A set of
morpheme structures can also be expressed as a list.
// A rule is a 2-element list, where each element is either a
morpheme structure or a list of morpheme structures. In the
latter case, the rule can be viewed as shorthand for a number
of simple rules. Alternatively, we can view each element of the
rule as denoting a set of morphemes, where a morpheme structure
denotes the set of elements that unify it, and a list of
morpheme structures denotes the union of the sets its elements
denote. //
All morphemes contained in the first element of a connection rule can
be connected to all the elements contained in the second element.
A single morpheme may be contained in multiple morpheme structures
within a rule.
A special symbol `*' can be used in any morpheme structure. It
denotes a "don't care" value. // unifies with anything.// For
example, the morpheme structure
alpha1 = (*)
Can be viewed as an arbitrary set of morphemes. In the same
manner, the morpheme structure
alpha2 = (noun)
expresses the set of all morphemes classified as nouns. Further,
the morpheme structure
alpha3 = (* * * mizenkei)
denotes the set of all morphemes that can take the mizenkei
desinence. = alpha2, alpha1 >= alpha3.>
// A morpheme structure stands for all the morphemes that unify
with it, where `*' unifes with anything. AND the following
convention holds: each morpheme structure is 5 elements long,
and shorter structures are filled with `*' to the right.
Thus, `(noun)' really stands for `(noun * * * *)' and
`(* * * mizenkei)' stands for `(* * * mizenkei *)'. //
2.2 Definition of the Morpheme Dictionary
The Morpheme Dictionary is defined in list form. It is stored in
files with extension '.dic'. It may be divided into multiple files.
Here is the BNF for the dictionary.
::= ( |
( ())
::= | NIL
::= ()
::= (keyword )
// the word "keyword" followed by the appropriate form //
::= (reading )
// the word "reading" (yomi) followed by the appropriate form //
::= (paradigm ) | Nil
::= (meaning information ) | Nil
// I don't understand the first sentence //
The part of speech and subpart of speech must be defined in the
Morpheme Part of Speech Dictionary.
This cannot be left out if the part of speech or subpart of
speech have been defined to be inflectable.
This should be the surface form of the word. In the case of
declinable morphemes, this should be the base form.
This contains the morphemes reading. In addition to an arbitrary
sequence of kana, it is possible to store other information
here.
This contains semantic information. It consists of arbitrary
text. There is no restriction on length.
2.3 Construction of the System Dictionary
To build the system dictionary from the user dictionary, along
with the conversion of the Connection Rule Dictionary, a two step
dictionary data conversion is necessary (v. fig 1). Before the
conversions , the JUMANPATH must be set to contain the name of
the directory in which the dictionary is stored.
2.3.1 Conversion of the Connection Rule Dictionary
This is done by executing `makemat'. `makemat' doesn't take any
arguments. At this step, the connection table (JUMANTREE.table)
and the connection matrix (JUMANTREE.matrix) are generated from
the Connection Rule Dictionary (JUMAN.connect) and the
Conjugation Connection Dictionary (JUMAN.kankei).
The connection table contains an entry for each part or subpart
of speech. The entry is a
pointer into a row and a column of the connection matrix (the
row entry contains information about what can follow the
morpheme and the column entry contains information about what
can preceed it.) In the case of inflectable parts of speech, the
offset for the desinence in question is added to to the entry
for the part of speech and the resulting entry is consulted.
This allows us to define entries in a uniform manner for
arbitrary morpheme structures. The connection matrix records
whether a pair of morpheme structures can be connected or not.
// whether they can occur adjacent to each other. // Morpheme
structures that can combine with the same things to their right
share a row in the table. Those which have the same
possibilities for leftward combination share a column.
2.3.2 Conversion to the Intermediate Dictionary
The information in the Morpheme Dictionary is converted into the
Intermediate Dictionary. // I'm not sure what "ittan" means in
this sentence - maybe "partially". // This processing is invoked
via `makeint'. Makeint should be passed a file with extension
`.dic' as an argument. To convert all the dictionary files in
JUMANPATH or the appropriate directory, use
makeint *.dic
At this stage, the following processing takes place:
1. By consultation with the Morpheme Part of Speech Dictionary
and the Conjugation Dictionary, the part of speech, subpart of
speech, paradigm, and desinence are converted into single byte
symbolic values.
2. By consultation with the connection table, each morpheme is
assigned an entry number, represented as a 4 byte integer.
3. For inflectable words, consulting the Conjugation Dictionary,
only the word root is entered into the Intermediate
Dictionary. In the case of words whose entire surface form
changes due to inflection, all desinences are entered into
the Intermediate Dictionary.
The reason for the symbolization in steps 1 and 2 is for
convenience is processing fixed-length structures and to
restrict the size of the system dictionary files. The
information from before the symbolization can be recovered from
the Grammar Dictionary. One Intermediate Dictionary file is
created for each Morpheme Dictionary file. The files have the
same name as the original files, except that they have the
extension `.int'.
2.3.3 Conversion to the System Dictionary
The system dictionary is constructed from the intermediate
dictionary. This processing is performed by `maketree', and the
Tree Structure Dictionary (JUMANTREE.main), the Keyword
Dictionary (JUMANTREE.mida), the Readings Dictionary
(JUMANTREE.yomi) and the Semantic Dictionary (JUMANTREE.imis)
are generated. `maketree' must be passed files with extension
`.int' as arguments. To convert all intermediate dictionar
files to, type:
maketree *.int
If there is a file JUMANTREE.main in the directory specified by
JUMANPATH, the morphemes in the intermediate dictionary are
added to it. Otherwise the file JUMANTREE.main is created.
Multiple entries are not created for morphemes with the same
part of speech, subpart of speech, keyword, and reading. During
the conversion various sorts of information are recorded in the
file maketree.log
2.4 The Dictionary's Internal Data Structures
// I am very unsure of the translation of this section.//
The dictionary is represented as a set of B-Trees[2]. The B-Trees
are ordered by using the initial character of the Keyword as a
key. That is, all the morphemes in a given B-Tree have keywords
that start with the same character.
The B-Trees' internal data structure is given in figure 2. The
search key is the Keyword. Among the information that is in the
morpheme dictionary, the variable length items Keyword, Reading
and Meaning are stored in the Keyword Dictionary, the Reading
Dictionary and the Semantic Dictionary. In each case, the B-Tree
contains an absolute offset (pointer) from the head of the
dictionary file. The following 8 fields contain data:
H1: symbolized part of speedh
H2: " " subpart "
K1: " " Paradigm
K2: " " Desinence
contbl: connection table entry number
ptr_midasi: pointer into the Keyword Dictionary
ptr_yomi: pointer into the Readings Dictionary
ptr_imit: pointer into the Semantic Dictionary
Morphemes with the same Keyword index are stored in a linear
list at the nodes of the B-Tree. To make this possible, the
field ptr_next contains either a pointer to the next entry which
shares the same Keyword, or Nil.
2.5 The Standard System Grammar
A standard grammar has been prepared as the system's user
dictionary. This is called the Standard System Grammar. The Part
of Speech Dictionary, Conjugation Dictionary, and Connection
Dictionary were built as extensions of the analysis contained in
the Masuoka and Takubo grammar[1].
1. Part of Speech Dictionary: We defined 14 parts of speech ,
adding the class "special" (punctuation, symbols, parentheses,
etc.) to Masuoka and Takubo's system, and dividing affixes into
Prefixes and Suffixes. (cf. JUMAN.grammar)
2. Conjugation Dictionary: We defined 21 standard Paradigms
plus 6 special Paradigms, extending the Masuoka and Takubo
grammar in order to deal with literay language (bungo),
colloquial language, and polite language. (cf. JUMAN.katsuyou).
3. Connection Dictionary: This was designed from scratch,
consulting Masuoka and Takubo.
4. Conjugation Connection Dictionary: A table of inflectable
morpheme structures plus a table of the Desinences they can
take. (cf. JUMAN.kankei)
3. Morphological Analysis
The environment variable JUMANPATH defines the absolute path for
the dictionary directory. There are C and Prolog versions of the
morphology analysis program.
3.1 C Version
Morphological analyis is invoked by
juman -[b|m|p] -[f|e|c]
Input is read a line at a time from standard input.
Contents of the analysis: The system searches for the analysis
with the fewest unknown words, morphemes and independent words.
The results are displayed according to the following options:
If the analysis is ambiguous:
-b display a single analysis with the longest matching suffix.
// I don't know what gohoosaichooichi means, but compositionally
it would appear to mean "lonest matching suffix"//
-m display all possible morphemes in ambiguous parts of the
input // (but not duplicating unambiguous parts).//
-p display all analyses // duplicating unambiguous parts //
For each morpheme:
-f Display arranging the karamu
// don't know what "karamu" means //
-e Display all morpheme info in character format
//i.e., spelled out in kana and kanji //
-c Display all morpheme info in code format
3.2 Prolog Version
Processing Environment
Juman is designed for SICStus Prolog 0.7 #4.
Invocation
Moving to the directory juman/juman_pl, start Prolog, and
`consult' `juman.pl'. Processing can proceed a sentence at at
time or a file at a time.
1. | ?- juman()
2. | ?- juman.
Input file name?
Output file name? user.
Contents of the Analysis: The system seeks the analysis with the
fewest unknown words, morphemes, and independent words. In case
of ambiguity, output is expressed in a lattice structure.
Ambuguous parts of the analysis are indented.
References
[1] Takashi Masuoka and Yukinori Kabuto. "Basic Japanese Grammar" Kuroshi
Publishers. 1989
[2] Knuth, D.E. "The Art of Computer Programming" vol. 3 Sorting
and Searching. Addison Wesley. 1973.
Figures.
FIGURE 1. "Generation of the Dictionaries"
The boxes on the left are titled (from top to bottom)
Connection Rule Dictionary
Conjugation Connection Dictionary
// these two boxes are surrounded by a dotted line //
Morpheme Part of Speech Dictionary
Conjugation Dictionary
// these two dictionaries are surrounded by a dotted line //
Grammar Dictionary
// "grammar dictionary" is the label for the solid line surrounding
the top 4 dictionaries //
Morpheme Dictionary
// this the label for the solid box at the bottom. It has several
unlabeled boxes inside of it.//
The line connecting "makeint" to "maketree" is labeled
Intermediate Structure
The boxes on the right are labeled (from top to bottom)
Connection Table
Connection Matrix
// there is a solid line surrounding these two boxes//
Tree Structure Dictionary
// Arrows point from the Tree Structure Dictionary
to the following 3 dictionaries, which are drawn
in overlapping boxes //
Keyword Dictionary
Reading Dictionary
Semantic Dictionary
// The entire right side is labeled //
System Dictionary
// at the bottom //
FIGURE 2. "Dictionary Data Structures"
At the right of the diagram there are 5 boxes with Japanese labels
"prt_midasi" points to the top one, and "contbl" points to the
bottom one.
From top to bottom, the labels are:
Keyword Dictionary
Reading Dictionary
Semantic Dictionary
Connection Table
Connection Array