MorphInd: Indonesian Morphological Analyzer

MorphInd is a robust finite state morphology tool for Indonesian, that handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd consists of morphosyntactic and morphophonemic rules to analyze Indonesian derivational or inflectional surface words. MorphInd is designed specifically for Indonesian.

MorphInd uses positional tagset with 3 different morphological tags and a special lemma tag that directly follows the lemma. The complete tagset can be found in the Tagset section. Given below id the MorphInd output structure:


Figure 1. MorphInd Output Structure

The surface word form is followed by 1 up to 3 morphological tag(s). The lemma tag directly followed the lemma, so that the lemma can be easily found for a lemmatization purposes (see the “Lemma Position” in the figure above). Extra chunks, such as clitics (proclictic and enclitic) or particles, are analyzed as an independent surface word form but attached to the main chunk by a plus sign (+). Other output examples can be found in the Examples section.

Tagset

MorphInd has a fine-grained tagset which was inspired by the Penn Treebank tagset and adapted accordingly for Indonesian morphology. The tagset also adopts the concept of positional tags of the Prague Dependency Treebank tagset to cover most of the language behaviors that occur simultaneously in a surface word. Given in the table below is the complete MorphInd tagset.

Table 1. Morphological Tagset

1st Position 2nd Position 3rd Position
N Noun P Plural F Feminine
S Singular M Masculine
D Non-Specified
—- —- —-
P Personal Pronoun P Plural 1 First Person
S Singular 2 Second Person
3 Third Person
—- —- —-
V Verb P Plural A Active Voice
S Singular P Passive Voice
—- —- —-
C Numeral C Cardinal Numeral
O Ordinal Numeral
D Collective Numeral
—- —- —-
A Adjective P Plural P Positive
S Singular S Superlative
—- —- —-
H Coordinating Conjunction
S Subordinating Conjunction
F Foreign Word
R Preposition
M Modal
B Determiner
D Adverb
T Particle
G Negation
I Interjection
O Copula
W Question
X Unknown
Z Punctuation
Table 2. Lemma Tagset

Lemma Tag
n Noun
p Personal Pronoun
v Verb
c Numeral
q Adjective
h Coordinating Conjunction
s Subordinating Conjunction
f Foreign Word
r Preposition
m Modal
b Determiner
d Adverb
t Particle
g Negation
i Interjection
o Copula
w Question
x Unknown
z Punctuation

Examples

This section shows several tool output examples. Given below is a phrase example with proclitic and enclitic:

ph. kumengirimkannya (ph. I deliver him) yields aku<p>_PS1+meN+kirim<v>+kan_VSA+dia<p>_PS3

In some derivational case, the lemma lexical category can be different than the lexical category of the whole surface form, as shown in the example below:

v. kirim (v. deliver) yields kirim<v>_VSA
v. mengirim (v. deliver) yields meN+kirim<v>_VSA
n. kiriman (n. package) yields kirim<v>+an_NSD
n. pengiriman (n. delivery) yields peN+kirim<v>+an_NSD

Given below is the plural surface word form. There are also several special plural cases using infix, which hardly coded in the dictionary:

n. gerigi (n. teeth) yields gerigi<n>_NPD
n. gigi-gigi (n. teeth) yields gigi<n>_NPD

Given below is the example of numeral-noun agreement:

n. 2 buku (n. 2 books) yields 2<c>_CC- buku<n>_NSD
(lit n. *2 book)
n. dua buku (n. two books) yields dua<c>_CC- buku<n>_NSD>
(lit n. *two book)
n. buku-buku (n. books) yields buku<n>NPD
n. *2 buku-buku (lit n. 2 books) yields 2<c>_CC- buku<n>_NPS

given below is the example of numeral alternation:

num. 2 (num. 2) yields 2<c>_CC-
num. dua (num. two) yields dua<c>_CC-
num. ke-2 (num. second) yields ke+2<c>_CO-
num. kedua (num. second) yields ke+dua<c>_CO-

Prerequisite

Currently MorphInd can be run only in Unix operating system.

Given below is the prerequisite(s) to run MorphInd:

  • Foma 0.9.13alpha, or higher. FOMA can be downloaded here. Or in Ubuntu “apt-get install foma-bin“.

Download

Uploaded in 2012: Added a disambiguation module.

svn –username public co https://svn.ms.mff.cuni.cz/svn/morphind/trunk/morphind.v.1.2 morphind

password: “public”

  • Uploaded in February 2013: Added a disambiguation module. svn –username public co https://svn.ms.mff.cuni.cz/svn/morphind/trunk/morphind.v.1.3 morphind password: “public”
  • Uploaded in May 2013: Added a compound word module. svn –username public co https://svn.ms.mff.cuni.cz/svn/morphind/trunk/morphind.v.1.4 morphind password: “public”
  • If you encountered any problem, don’t hesitate to contact septina[dot]larasati[at]gmail[dot]com.

    How to Run

    • bash$ cat INPUTFILE | perl Morphind.pl > OUTPUTFILE
    • bash$ echo “mengirim” | perl Morphind.pl > OUTPUTFILE
    • bash$ cat INPUTFILE | perl Morphind.pl [-cache-file CACHEFILE -bin-file BINARYFILE -disambiguate (0/1)] > OUTPUTFILE

    Example:

    • bash$ cat “kirim kiriman kumengirimkannya pengirim pengiriman” | perl MorphInd.pl

    Output:

    ^kirim<v>_VSA$ ^kirim<v>+an_NSD$ ^aku<p>_PS1+meN+kirim<v>_VSA+dia<p>_PS3$ ^peN+kirim<v>_NSD$ ^peN+kirim<v>+an_NSD$

    Try it yourself:

    • bash$ cat sample.txt | perl Morphind.pl > sample.out
    • bash$ cat sample.txt | perl Morphind.pl -cache-file cache/default.cache -bin-file bin/combined.bin -disambiguate 1 > sample.out

    Known Problems

    Some encountered problems:

    • foma-proc: error while loading shared libraries: libreadline.so.5: cannot open shared object file: No such file or directory
      Make a symlink to updated library (commonly libreadline.so.6) and named it libreadline.so.5. And similarly to libhistory.so.6 and libhistory.so.5.

    If you encountered any problem, don’t hesitate to contact septina[dot]larasati[at]gmail[dot]com.

    Publication

    S.D. Larasati, V. Kuboň, and D. Zeman, “Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus”, Proceedings of Workshop on Systems and Frameworks for Computational Morphology (SFCM 2011), Zurich, Switzerland, 2011 [Springer] Presented 08/2011

    License

    If you are using MorphInd please cite our publication.

    MorphInd by Institute of Formal and Applied Linguistics (UFAL) is licensed under a

    Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.


    Creative Commons License

    Acknowledgement

    This project was financially supported by the grant LC536 Centrum Komputační Lingvistiky of the Czech Ministry of Education.