JUMAN Infomation

  Translated by Jim Barnett.
  Translators comments between slashes //...//


 User's Guide for the Juman System, a User-Extensible Morphological
Analyzer for Japanese. Version 0.5

    Yuji Matsumoto, Sadao Kurohashi, Yutaka Nyoki, Hitoshi Shinho, 
                     and Makoto Nagao.


1. The Morphological Grammar of Japanese

1.1 Part of Speech and Subpart of Speech

 Morphemes that perform a grammatically similar role are classified
together. We call the resulting classes parts of speech (keitaihinsi.)
The current system allows the part of speech to be further
sub-classified. These sub-classes are called subparts of speech
(keitaihinsisaibunrui.)

1.2 Paradigm and Desinence

 Along the lines of the "verb" "adjective" and "auxiliary verb"
described in school grammars, there are morphemes that change their
form according the morphemes that occur before and after them. This
change of form is called "conjugation." The part of the surface form
that doesn't changed is called the root, and the part that does change
is called the suffix. Most conjugation is regular. Categorizing
morphemes according to this regularity, we call the resulting classes
Paradigms. The combined surface forms that actually occur are called
desinences. 

 // "Paradigm" and "desinence" are arcane words, but they're the best I
can do. "Paradigm" is a generalization of "conjugation" for verbs and
"declension" for nouns. It denotes a set of words that inflect the
same way. A "desinence" is one of the inflected forms of a Paradigm.
To take an example from Latin, the first conjugation verbs (the ones
with principal parts -are, -avi, -atus) form a Paradigm, as do the
first declension nouns (the ones with nominatives in -us and
accusatives in -um.) The first person singular active present of a
verb (e.g., the form "amo" of the verb "amare") is a desinence, as is
the accusative singular of a noun. //


1.3 Morpheme Structure

 If morpheme m has part of speech H1, subpart of speech H2, Paradigm
K1, Desinence K2 and surface form M, we call the list (H1, H2,K1,K2,M)
the "morpheme structure of morpheme m."

  Elements H1, H2, K1, and K2 are filled with the names of the part of
speech, the subpart of speech, the paradigm and desinence.
Furthermore, M is filled with a surface form expressed in kana or
kanji.

 This notation is used to describe the connection rules.

 {a morpheme structure may be underspecified and stands for the set of
morphemes that will unify it.}

1.4 Connection Relations and Connection Rules

 The fact that all the members of the sets of morphemes A1 and A2 can
be connected (can occur next to each other in a sentence) is expressed
as: (A1, A2). This is called a Connection Rule. A set of morphemes is
expressed as a list of morpheme structures. Given connection rule (A1,
A2), then for every morpheme in every morpheme structure alpha1 in A1
can be followed by any arbitrary morpheme in any morpheme structure
alpha2 in A2. The set of Connection Rules is called the Connection
Relations. 

  // If A1 is a list of morpheme structures (a11,...a1n) and similarly A2
= (a21, ... a2k), then (A1, A2) means that any morpheme that unifies
with any of the descriptions (a11...a1n) can be followed by any one
that unifies with one of (a21...a2k) //

2. Dictionary Definitions and Data Structures

 Here we discuss the internal data structures of the dictionary (fig.
1). The speciality of this dictionary system is that it allows the user
to freely define parts of speech and Connection Relations.
 We call the dictionary that has been defined by the user the User
Dictionary. The User Dictionary is divided into the Grammar Dictionary
and the Morpheme dictionary. The Grammar Dictionary, briefly described
//I'm not sure what "issetsu de nobeta" means//, is used to describe
the morphological grammar of Japanese. It is composed of the Subpart
of Speech Dictionary, the Conjugation Dictionary, the Connection Rule
Dictionary, and the Conjugation Relation Dictionary (section 2.1). The
Morpheme Dictionary contains information about individual morphemes
(sec. 2.2).
 The dictionary that is used for morphological analysis is called the
System Dictionary. The System Dictionary consists of: the Connection
Table and Connection Matrix, which are generated from the Connection
Rule Dictionary and the Conjugation Connection Dictionary, the Tree
Structure Dictionary, which contains the information in the Morpheme
Dictionary in symbolic form and is constructed by referring to the
Grammar Dictionary, the (Lexical) Key Dictionary, the Readings (yomi)
Dictionary, and the Meaning Dictionary.
 The dictionaries described here must be placed in the directory
held in the Unix environment variable JUMANPATH. 
// I don't understand the next sentence: "hon sistemu wa, JUMANPATH wa
`/juman/dic' o sansho siteiru"//
 Furthermore, this system comes equiped with a standard User
Dictionary. It will be introduced in section 2.5.

2.1 Definition of the Grammar Dictionary
 The Grammar Dictionary is set up in the following way. Each
dictionary is defined in S-expression format.

 1. The Morpheme Subpart of Speech Dictionary (cf. JUMAN.grammar)
     Defines the names of the parts and subparts of speech that the
     system uses. A single part of speech or subpart of speech is
     represented by a single S-expression (list structure). The lists
     first element contains the part of speech. If there are subparts
     of speech, elements 2 and beyond of the list each contain a
     subpart of speech in list format.

     Individual parts of speech or subparts of speech are represented
     as single element lists, but if the morphemes belonging to the
     part or subpart of speech are inflected, the symbol `%' is added
     as the second element of the list. If a part of speech has the
     symbol  `%', all morphemes belonging to any of its subparts of
     speech are inflected.

 2. Conjugation Dictionary (cf. JUMAN.katuyou)
     A table of the Paradigms of inflectable morphemes and the
     definitions of the Desinences contained in the Paradigms.
     The list's first element contains the Paradigm, the second
     element is a list containing lists of Desinences and Suffixes. 
     The base form of the morpheme cannot be left out as a Desinence,
     since for inflectable words it is the base form which is entered
     in the Morpheme Dictionary.

     The word root is what remains after the removal of the base form
     suffix from the form that is entered in the Morpheme Dictionary 
     // i.e., from the base form // If the suffix does not appear in
     the surface form, the Suffix entry for a Desinence is marked with
     `*'. 

 3. Connection Rule Dictionary (cf. JUMAN.connect)
     The Connection Rule Dictionary is the set of connection rules. A
     connection rule is a pair of sets of morpheme structures (v.
     section 1.3) and is expressed as a two-element list. A set of
     morpheme structures can also be expressed as a list. 
     // A rule is a 2-element list, where each element is either a
        morpheme structure or a list of morpheme structures. In the
        latter case, the rule can be viewed as shorthand for a number
        of simple rules. Alternatively, we can view each element of the
        rule as denoting a set of morphemes, where a morpheme structure
        denotes the set of elements that unify it, and a list of
        morpheme structures denotes the union of the sets its elements
        denote. //

    All morphemes contained in the first element of a connection rule can
    be connected to all the elements contained in the second element.
    A single morpheme may be contained in multiple morpheme structures
    within a rule.

    A special symbol `*' can be used in any morpheme structure. It
    denotes a "don't care" value. // unifies with anything.// For
    example, the morpheme structure
                     
                    alpha1 = (*)

    Can be viewed as an arbitrary set of morphemes. In the same
    manner, the  morpheme structure
            
                   alpha2 = (noun)

   expresses the set of all morphemes classified as nouns. Further,
   the morpheme structure
  
                   alpha3 = (* * * mizenkei)

   denotes the set of all morphemes that can take the mizenkei
   desinence. = alpha2, alpha1 >= alpha3.>

   // A morpheme structure stands for all the morphemes that unify
      with it, where `*' unifes with anything. AND the following
      convention holds: each morpheme structure is 5 elements long,
      and shorter structures are filled with `*' to the right.
      Thus, `(noun)' really stands for `(noun * * * *)' and 
      `(* * * mizenkei)' stands for `(* * * mizenkei *)'. //

2.2 Definition of the Morpheme Dictionary

   The Morpheme Dictionary is defined in list form. It is stored in 
files with extension '.dic'. It may be divided into multiple files.
Here is the BNF for the dictionary.

   ::= (  |
                             ( (
                                                      ))


   ::=    | NIL


   ::= (
                            )

   ::= (keyword )
    // the word "keyword" followed by the appropriate form //

   ::= (reading )
    //  the word "reading" (yomi) followed by the appropriate form //

   ::= (paradigm ) | Nil
  
   ::= (meaning information ) | Nil



    
     // I don't understand the first sentence //
     The part of speech and subpart of speech must be defined in the
     Morpheme Part of Speech Dictionary.

   
     This cannot be left out if the part of speech or subpart of
     speech have been defined to be inflectable.

    
      This should be the surface form of the word. In the case of
      declinable morphemes, this should be the base form.

     
      This contains the morphemes reading. In addition to an arbitrary
      sequence of kana, it is possible to store other information
      here.

    
      This contains semantic information. It consists of arbitrary
      text. There is no restriction on length.


2.3 Construction of the System Dictionary

     To build the system dictionary from the user dictionary, along
     with the conversion of the Connection Rule Dictionary, a two step
     dictionary data conversion is necessary (v. fig 1). Before the
     conversions , the JUMANPATH must be set to contain the name of
     the directory in which the dictionary is stored.


2.3.1 Conversion of the Connection Rule Dictionary

      This is done by executing `makemat'. `makemat' doesn't take any
      arguments. At this step, the connection table (JUMANTREE.table)
      and the connection matrix (JUMANTREE.matrix) are generated from
      the Connection Rule Dictionary (JUMAN.connect) and the
      Conjugation Connection Dictionary (JUMAN.kankei).

      The connection table contains an entry for each part or subpart
      of speech.  The entry is a
      pointer into a row and a column of the connection matrix (the
      row entry contains information about what can follow the
      morpheme and the column entry contains information about what
      can preceed it.) In the case of inflectable parts of speech, the
      offset for the desinence in question is added to to the entry
      for the part of speech and the resulting entry is consulted.
      This allows us to define entries in a uniform manner for
      arbitrary morpheme structures. The connection matrix records
      whether a pair of morpheme structures can be connected or not. 
      // whether they can occur adjacent to each other. // Morpheme
      structures that can combine with the same things to their right
      share a row in the table. Those which have the same
      possibilities for leftward combination share a column.

2.3.2 Conversion to the Intermediate Dictionary
     
      The information in the Morpheme Dictionary is converted into the
      Intermediate Dictionary. // I'm not sure what "ittan" means in
      this sentence - maybe "partially". // This processing is invoked
      via  `makeint'. Makeint should be passed a file with extension 
      `.dic' as an argument. To convert all the dictionary files in
      JUMANPATH or the appropriate directory, use

             makeint *.dic

      At this stage, the following processing takes place:

      1. By consultation with the Morpheme Part of Speech Dictionary
      and the Conjugation Dictionary, the part of speech, subpart of
      speech, paradigm, and desinence are converted into single byte
      symbolic values. 

      2. By consultation with the connection table, each morpheme is
         assigned an entry number, represented as a 4 byte integer.

      3. For inflectable words, consulting the Conjugation Dictionary,
         only the word root is entered into the Intermediate
         Dictionary. In the case of words whose entire surface form
         changes due to inflection, all desinences are entered into
         the Intermediate Dictionary.


      The reason for the symbolization in steps 1 and 2 is for
      convenience is processing fixed-length structures and to
      restrict the size of the system dictionary files. The
      information from before the symbolization can be recovered from
      the Grammar Dictionary. One Intermediate Dictionary file is
      created for each Morpheme Dictionary file. The files have the
      same name as the original files, except that they have the
      extension  `.int'.

2.3.3 Conversion to the System Dictionary

      The system dictionary is constructed from the intermediate
      dictionary. This processing is performed by `maketree',  and the
      Tree Structure Dictionary (JUMANTREE.main), the Keyword
      Dictionary (JUMANTREE.mida), the Readings Dictionary
      (JUMANTREE.yomi) and the Semantic Dictionary (JUMANTREE.imis)
      are generated. `maketree' must be passed files with extension
      `.int' as arguments. To convert all intermediate dictionar
      files to, type:

             maketree *.int

      If there is a file JUMANTREE.main in the directory specified by
      JUMANPATH, the morphemes in the intermediate dictionary are
      added to it. Otherwise the file JUMANTREE.main is created.

      Multiple entries are not created for morphemes with the same
      part of speech, subpart of speech, keyword, and reading. During
      the conversion various sorts of information are recorded in the
      file maketree.log


2.4   The Dictionary's Internal Data Structures

      // I am very unsure of the translation of this section.//

      The dictionary is represented as a set of B-Trees[2]. The B-Trees
      are ordered by using the initial character of the Keyword as a
      key. That is, all the morphemes in a given B-Tree have keywords
      that start with the same character. 

      The B-Trees' internal data structure is given in figure 2. The
      search key is the Keyword. Among the information that is in the
      morpheme dictionary, the variable length items Keyword, Reading
      and Meaning are stored in the Keyword Dictionary, the Reading
      Dictionary and the Semantic Dictionary. In each case, the B-Tree
      contains an absolute offset (pointer) from the head of the
      dictionary file. The following 8 fields contain data:

      H1: symbolized part of speedh
      H2:   "   "    subpart "
      K1:   "   "    Paradigm
      K2:   "   "    Desinence
      contbl: connection table entry number
      ptr_midasi: pointer into the Keyword Dictionary
      ptr_yomi: pointer into the Readings Dictionary
      ptr_imit: pointer into the Semantic Dictionary

      Morphemes with the same Keyword index are stored in a linear
      list at the nodes of the B-Tree. To make this possible, the
      field ptr_next contains either a pointer to the next entry which
      shares the same Keyword, or Nil. 

2.5   The Standard System Grammar

      A standard grammar has been prepared as the system's user
      dictionary. This is called the Standard System Grammar. The Part
      of Speech Dictionary, Conjugation Dictionary, and Connection
      Dictionary were built as extensions of the analysis contained in
      the Masuoka and Takubo grammar[1].

      1. Part of Speech Dictionary: We defined 14 parts of speech ,
      adding the class "special" (punctuation, symbols, parentheses,
      etc.) to Masuoka and Takubo's system, and dividing affixes into
      Prefixes and Suffixes. (cf. JUMAN.grammar)

      2. Conjugation Dictionary:  We defined 21 standard Paradigms
      plus 6 special Paradigms, extending the Masuoka and Takubo
      grammar in order to deal with literay language (bungo),
      colloquial language, and polite language. (cf. JUMAN.katsuyou). 

      3. Connection Dictionary: This was designed from scratch,
      consulting Masuoka and Takubo.

      4. Conjugation Connection Dictionary:  A table of inflectable
      morpheme structures plus a table of the Desinences they can
      take. (cf. JUMAN.kankei)

3.    Morphological Analysis

      The environment variable JUMANPATH defines the absolute path for
      the dictionary directory. There are C and Prolog versions of the
      morphology analysis program.

3.1   C Version

      Morphological analyis is invoked by

      juman -[b|m|p] -[f|e|c]

      Input is read a line at a time from standard input.

      Contents of the analysis: The system searches for the analysis
      with the fewest unknown words, morphemes and independent words.
      The results are displayed according to the following options:

      If the analysis is ambiguous:

        -b display a single analysis with the longest matching suffix.
      // I don't know what gohoosaichooichi means, but compositionally
      it would appear to mean "lonest matching suffix"//

        -m display all possible morphemes in ambiguous parts of the
      input // (but not duplicating unambiguous parts).//

        -p display all analyses // duplicating unambiguous parts //

      For each morpheme:

        -f Display arranging the karamu
           // don't know what "karamu" means //

        -e Display all morpheme info in character format
           //i.e., spelled out in kana and kanji //

        -c Display all morpheme info in code format


3.2   Prolog Version

      Processing Environment
      Juman is designed for SICStus Prolog 0.7 #4.

      Invocation
      Moving to the directory juman/juman_pl, start Prolog, and
      `consult' `juman.pl'. Processing can proceed a sentence at at
      time or a file at a time.

      1. | ?- juman()
      2. | ?- juman.
           Input file name? 
           Output file name? user.
      
      Contents of the Analysis: The system seeks the analysis with the
      fewest unknown words, morphemes, and independent words. In case
      of ambiguity, output is expressed in a lattice structure.
      Ambuguous parts of the analysis are indented.

References

     [1] Takashi Masuoka and Yukinori Kabuto. "Basic Japanese Grammar" Kuroshi
      Publishers. 1989

     [2] Knuth, D.E. "The Art of Computer Programming" vol. 3 Sorting
      and Searching. Addison Wesley. 1973.



Figures.

FIGURE 1. "Generation of the Dictionaries"
The boxes on the left are titled (from top to bottom)
   Connection Rule Dictionary
   Conjugation Connection Dictionary
   // these two boxes are surrounded by a dotted line //
   Morpheme Part of Speech Dictionary
   Conjugation Dictionary
   // these two dictionaries are surrounded by a dotted line //
   Grammar Dictionary
   // "grammar dictionary" is the label for the solid line surrounding
     the top 4 dictionaries //
   Morpheme Dictionary
   // this the label for the solid box at the bottom. It has several
      unlabeled boxes inside of it.//


The line connecting "makeint" to "maketree" is labeled 
    Intermediate Structure

The boxes on the right are labeled (from top to bottom)

    Connection Table
    Connection Matrix
    // there is a solid line surrounding these two boxes//
    Tree Structure Dictionary
    // Arrows point from the  Tree Structure Dictionary
       to the following 3 dictionaries, which are drawn
       in overlapping boxes //
    Keyword Dictionary
    Reading Dictionary
    Semantic Dictionary

   // The entire right side is labeled //
    System Dictionary
   // at the bottom //


FIGURE 2. "Dictionary Data Structures"

  At the right of the diagram there are 5 boxes with Japanese labels
  "prt_midasi" points to the top one, and "contbl" points to the
   bottom one.

   From top to bottom, the labels are:

   Keyword Dictionary
   Reading Dictionary
   Semantic Dictionary
   Connection Table
   Connection Array

Back to Resource for processing natural language


webmaster@www-nagao.kuee.kyoto-u.ac.jp
last update on 1997/05/19