Alicia Garrido, Amaia Iturraspe,
Sandra Montserrat, Hermínia
Pastor,
and Mikel L. Forcada
Departament de Llenguatges i Sistemes Informàtics,
Universitat d'Alacant,
E-03071 Alacant, Spain.
E-mail: {alicia,amaia,sandra,herminia}@torsimany.ua.es,
mlf@dlsi.ua.es
Morphological analysis and generation are essential to many natural-language processing tasks such as machine translation. Morphological analysis reads the inflected surface form of each word in a text and writes its lexical form, consisting of a canonical form (or lemma) of the word and a set of tags showing its syntactical category and morphological characteristics. Generation is the inverse process. Both analysis and generation rely on two sources of information: a dictionary of the valid lemmas of the language and a set of inflection paradigms.
One of the most efficient approaches to morphological analysis and generation uses finite-state transducers (FST) (mohri97j; oncina93j), a class of finite-state automata, that will be described in the following. FST may be used used as one-pass morphological analysers and generators and may be very efficiently implemented. Other analysers-generators, of which aguiar96th is a complete example for Spanish, use an intuitive pattern-matching approach which tries first to decompose the word in a number of stem-inflection pairs which are subsequently validated. There are a number of tools for the construction of FST-based morphological analysers available, the best known being those developed at Xerox [Karttunen1994,Karttunen1993,Chanod1994] (for a review in Spanish on finite-state morphology, see alegria96j).
We have developed an aid to the construction and maintenance of FST-based morphological analysers and generators. It is a compiler that reads a morphological dictionary containing a static description of the lemmas and the inflection paradigms and writes a C program that implements a compact FST-based morphological analyser (or generator) performing the task. This allows the linguist to focus on describing the lexicon and morphology of the language in question in a simple format and frees him or her of having to think as a programmer.
The morphological analysers and generators are based on finite-state
transducers; in particular, we use letter transducers
[Roche Schabes1997]. Any finite-state transducer may always be turned into
an equivalent letter transducer. A letter transducer is
defined as
,
where Q is a finite set of states, L a set
of transition labels,
the initial state,
the
set of final states, and
the
transition function (where 2Q represents the set of all finite sets
of states).
The set of transition labels is
where
is the alphabet of input
symbols,
the alphabet of output symbols, and
represents the empty symbol. According to this definition, state
transition labels may therefore be of four kinds:
,
meaning that symbol
is read and symbol
is written;
,
meaning that a
symbol is read but nothing is written;
,
meaning
that nothing is read but a symbol is written; and
means that a state transition occurs without
reading or writing. The last kind of transitions are not necessary
neither convenient in final FSTs, but may be useful during
construction. It is customary to represent the empty symbol
with a zero (``0''). A letter transducer is said to be deterministic when
.
Note that a
letter transducer which is deterministic with respect to the alphabet
may still
be non-deterministic with respect to the input alphabet
.
A string
is considered to be a transduction of an
input string
if there is at least one path from the
initial state qI to a final state in F whose transition labels
form the pair w:w' when concatenated. There may in principle be
more than one of such paths for a given transduction; this should be
avoided, and is partially eliminated by determinization (see below).
On the other hand, there may be more than a
valid transduction for a string w (in analysis, this would
correspond to lexical ambiguity; in generation, this should be
avoided). In analysis, the symbols in
are those found in
texts, and the symbols in
are those necessary to form the lemmas
and special symbols representing morphological information, such as
<noun>, <fem>, <2p>, etc. In
generation,
and
are exchanged.
The general definition of letter transducers is completely parallel to
that of non-deterministic finite automata (NFA) and that of
deterministic letter transducers, parallel to that of DFA;
accordingly, letter transducers may be determinized and minimized
(with respect to the alphabet L)
using the existing algorithms for NFA and DFA
[Hopcroft Ullman1979,Salomaa1973,van de
Snepscheut1993]. Transitions labeled
may be eliminated during determinization using a
technique parallel to
-closure.
Unlike other compilers like karttunen93tr, the compiler
described in this paper builds letter transducers having no cycles
(transitions form a directed acyclic graph) which, in addition, have a
unique final state. The absence of cycles is due to the fact that only
concatenations and alternations are allowed in the morphological
dictionary (see section 3)2. To minimize the resulting transducer, we use an algorithm
described by vandesnepscheut93b, which has two identical steps
which may be summarized as follows: in each step, the transition
arrows in the letter transducer are reversed, so that the final state
is initial and the initial state is final, and the resulting
transducer is determinized with respect to L (that is, new states
are formed with sets of old states so that the new
is
). The transducer resulting from the
double reversal-determinization process is minimal. This algorithm is
particularly efficient in the case of acyclic letter transducers.
Moreover, the two steps have a simple interpretation: the first step
joins common endings (finds regularities in suffixes) and the second
one joins common beginnings of transductions (finds regularities in
prefixes).
FST-based analysers output all possible analyses for a homograph3.
The morphological dictionary is a text file where spaces, tabulators and newlines may be freely inserted for legibility. Any text between ``#'' and the end of line is ignored and may be used as a comment. The dictionary has the following three sections:
<fem> or
<sg>) are explicitly declared.
[verbs_in_ducir] for Spanish verbs such as
traducir, producir, inducir, etc.) and declared
in advance so that it may be used in the dictionary. Paradigms may
be indefinitely nested, that is, the names of paradigms previously
defined may be used to define other paradigms (paradigms are
compiled into subtransducers that are then integrated to build the
complete transducer). A paradigm is basically a name for a set of
alternate transductions; the simplest transduction is a
pair of strings (input and output), such as `` (com:comer<verb>)'', which we call a couple; a paradigm
name, such as ``[pi1]'', is always a valid transduction; and
the concatenation of one or more of transductions, such as `` (am:amar<verb>)[V1C]'' is also a valid transduction4.
Couples are treated as follows: if zeroes (``0'') are used to
align explicitly the input and output strings in a couple so that
both have the same length, as in ``(am000:amar<verb>)'',
zeroes are understood to represent the empty symbol and the
alignment is preserved in the transducer; if both strings have
different length and contain no zeroes, as in `` (am:amar<verb>)'', then the compiler will align them using a
simple heuristic: the shortest string in the couple is completed
with trailing zeroes5. We have experimentally found this heuristic to
work very well for Spanish; the corresponding experiments are described in
section 5. If, for some reason, one of the strings should be
empty, a single zero may be used instead, as in `` (0:<3p><sg>)''. If zeroes are used, but the lengths of input and
output strings do not match, as in ``(com00:comer<verb>)'',
the compiler assumes that the designer of the morphological
dictionary intended to supply an explicit alignment but has failed
to give a correct one; accordingly, the compiler issues a warning to
the effect, ignores the zeroes and uses the heuristic described
above.
Note that paradigms do not necessarily have to describe endings
(suffixes), because they may be placed anywhere in a transduction.
[tener]'' with all the
forms of the irregular Spanish verb tener (``to have''); the
dictionary entry for tener would then simply be the paradigm
name, which could be in turn be used to define other entries sharing
the same inflective pattern such as detener (``to arrest'') as
``(de:de)[tener]'' or contener (``to contain'') as
``(con:con)[tener]''.
Figure 1 shows an example of the format of the morphological dictionary.
The compiler has been developed under Linux using bison (an
evolved version of yacc, johnson75tr) and flex
(an evolved version of lex, lesk75tr) to build a
front-end which reads in the morphological dictionary file and
combines the partial transducers corresponding to the declared
paradigms and the dictionary entries into a single transducer containing one initial and one
final state using
transitions as ``glue'' where
convenient. Error messages are designed to help the linguist correct
possible errors in the format of the morphological dictionary file
being compiled (such as paradigms or symbols used but not defined,
mismatched parentheses, etc.). The back-end minimizes the resulting
transducer (as described in section 2) and combines the
resulting code with a standard skeleton to produce a C program which
is ready to be used on its own or included in a larger application
such as a machine translation system.
We have performed experiments to evaluate the heuristic used by the compiler to align string couples when the linguist who has constructed the morphological dictionary decides not to supply an explicit alignment. To that effect, a linguist constructed a morphological dictionary D1 for 3 099 of the most frequent Spanish words which included all the nominal, pronominal, adjectival and verbal flexive paradigms of the Spanish language. She used her linguistic knowledge to align explicitly all couples in the dictionary using zeroes where necessary. A second morphological dictionary D2 was created from D1 by removing all zeroes. Each one of these morphological dictionaries was compiled into an analyser (A1 and A2 respectively). The number of states of A1 (5 574) and A2 (5 589) was almost indistinguishable. We also used a small corpus of Spanish legal texts containing 798 801 words6 to study the number of live hypotheses per letter, that is, the number of partial transductions, or, equivalently, the number of states alive per input letter. This is a measure of non-determinism with respect to input symbols and therefore of time complexity. The results for A1 and A2 are almost indistinguishable: average 4.0 and standard deviation 2.5 for A1 (linguistically motivated alignment) and average 3.9 and standard deviation 2.5 for A2 (automatic alignment). Speeds observed lie in the range of 20 000 words per second on a Pentium running at 400 MHz.
A compiler to automatically build finite-state-transducer-based
morphological analysers and generators from morphological dictionaries
has been described. This tool may be of great interest when building
natural-language processing systems such as machine translation
programs. When the linguist does not supply an explicit alignment
between surface forms and lexical forms, the compiler uses a simple
heuristic to produce an alignment that has been experimentally shown
to be equally efficient. We are currently testing an extended version
the program which determinizes and minimizes the complete transducer
into a p-subsequential letter transducer [Mohri1997a,Mohri1997b] which
behaves exactly as the minimal deterministic finite automaton with
respect to the input alphabet
;
according to the experimental
results presented in the previous section, we expect to obtain speeds
on the order of 100 000 words per second on state-of-the-art desktop
workstations.