Septina Dian Larasati

MorphInd: Indonesian Morphological Analyzer

MorphInd is a robust morphological analyzer for Indonesian words developed in a Finite State architecture.

IntroductionTagsetExamplesPrerequisiteDownloadHow To RunKnown ProblemsDocumentationPublicationsManualsLicenseAcknowledgement

Introduction

MorphInd is a robust finite state morphology tool for Indonesian (MorphInd), which handles both morphological analysis and lemmatization for a given surface word form so that it is suitable for further language processing. MorphInd consists of morphosyntactic and morphophonemic rules for Indonesian derivational or inflectional surface words. MorphInd is designed specifically for Indonesian.

MorphInd uses positional tagset with 3 different morphological tags and a special lemma tag that directly follows lemma. The complete tagset can be found in at the Tagset section. Below given MorphInd output structure:

MorphInd_Output_Schema

Figure 1. MorphInd Output Structure

The surface word form is followed by 1 up to 3 morphological tag(s). The lemma tag directly followed the lemma, so that the lemma can be easily recognized for a lemmatization purposes (see “lemma position” on the figure above). Extra chunks, such as clitics (proclictic and enclitic) or particles, are analyzed as an independent surface word form but glued to the main chunk by a plus sign (+). Other output examples can be found in the Examples section.

Tagset

MorphInd also has a fine-grained tagset which was inspired by the PENN Treebank tagset and adapted it accordingly to Indonesian morphology. The tagset also adopts the concept of positional tags of the Prague Dependency Treebank tagset to cope with most of the language behaviors that occur simultaneously in a surface word. Given in the table below, the complete MorphInd tagset.

Table 1. Morphological Tagset

1st Position 2nd Position 3rd Position
N Noun P Plural F Feminine
S Singular M Masculine
D Non-Specified
—- —- —-
P Personal Pronoun P Plural 1 First Person
S Singular 2 Second Person
3 Third Person
—- —- —-
V Verb P Plural A Active Voice
S Singular P Passive Voice
—- —- —-
C Numeral C Cardinal Numeral
O Ordinal Numeral
D Collective Numeral
—- —- —-
A Adjective P Plural P Positive
S Singular S Superlative
—- —- —-
H Coordinating Conjunction
S Subordinating Conjunction
F Foreign Word
R Preposition
M Modal
B Determiner
D Adverb
T Particle
G Negation
I Interjection
O Copula
W Question
X Unknown
Z Punctuation
    Table 2. Lemma Tagset

Lemma Tag
n Noun
p Personal Pronoun
v Verb
c Numeral
q Adjective
h Coordinating Conjunction
s Subordinating Conjunction
f Foreign Word
r Preposition
m Modal
b Determiner
d Adverb
t Particle
g Negation
i Interjection
o Copula
w Question
x Unknown
z Punctuation

Examples

This section shows several tool output examples. Below given a phrase example with proclitic and enclitic:

ph. kumengirimkannya (ph. I deliver him) yields aku<p>_PS1+meN+kirim<v>+kan_VSA+dia<p>_PS3

In some derivational case, the lemma lexical category can be different than the lexical category of the whole
surface form, as shown in the example below:

v. kirim (v. deliver) yields kirim<v>_VSA
v. mengirim (v. deliver) yields meN+kirim<v>_VSA
n. kiriman (n. package) yields kirim<v>+an_NSD
n. pengiriman (n. delivery) yields peN+kirim<v>+an_NSD

Below given the plural surface word form. There are also several special plural cases using infix, which hardly coded in the dictionary:

n. gerigi (n. teeth) yields gerigi<n>_NPD
n. gigi-gigi (n. teeth) yields gigi<n>_NPD

Below is given the example of numeral-noun agreement:

n. 2 buku (n. 2 books) yields 2<c>_CC- buku<n>_NSD
(lit n. *2 book)
n. dua buku (n. two books) yields dua<c>_CC- buku<n>_NSD>
(lit n. *two book)
n. buku-buku (n. books) yields buku<n>NPD
n. *2 buku-buku (lit n. 2 books) yields 2<c>_CC- buku<n>_NPS

Below given the example of numeral alternation:

num. 2 (num. 2) yields 2<c>_CC-
num. dua (num. two) yields dua<c>_CC-
num. ke-2 (num. second) yields ke+2<c>_CO-
num. kedua (num. second) yields ke+dua<c>_CO-

Prerequisite

Currently MorphInd can be run only in Unix operating system.

Below given the prerequisite(s) to have MorphInd run:

  • Foma 0.9.13alpha, or higher. FOMA can be downloaded here.

Download

  • Uploaded in 2012: Added disambiguation module.

    svn –username public co https://svn.ms.mff.cuni.cz/svn/morphind/trunk/morphind.v.1.2 morphind

    password: “public”

  • Uploaded in February 2013: Added disambiguation module.svn –username public co https://svn.ms.mff.cuni.cz/svn/morphind/trunk/morphind.v.1.3 morphindpassword: “public”
  • Uploaded in May 2013: Added compound word module.svn –username public co https://svn.ms.mff.cuni.cz/svn/morphind/trunk/morphind.v.1.4 morphindpassword: “public”

If you encountered some problem, don’t hesitate to contact larasati [at] ufal [dot] mff [dot] cuni [dot] cz

How to Run

  • bash$ cat INPUTFILE | perl Morphind.pl > OUTPUTFILE
  • bash$ echo “mengirim” | perl Morphind.pl > OUTPUTFILE
  • bash$ cat INPUTFILE | perl Morphind.pl [-cache-file CACHEFILE -bin-file BINARYFILE -disambiguate (0/1)] > OUTPUTFILE

Example:

  • bash$ cat “kirim kiriman kumengirimkannya pengirim pengiriman” | perl MorphInd.pl

Output:

^kirim<v>_VSA$ ^kirim<v>+an_NSD$ ^aku<p>_PS1+meN+kirim<v>_VSA+dia<p>_PS3$ ^peN+kirim<v>_NSD$ ^peN+kirim<v>+an_NSD$

Try yourself:

  • bash$ cat sample.txt | perl Morphind.pl > sample.out
  • bash$ cat sample.txt | perl Morphind.pl -cache-file cache/default.cache -bin-file bin/combined.bin -disambiguate 1 > sample.out

Known Problems

Some encountered problems:

  • foma-proc: error while loading shared libraries: libreadline.so.5: cannot open shared object file: No such file or directory
    Make a symlink to updated library (commonly libreadline.so.6) and named it libreadline.so.5. And similarly to libhistory.so.6 and libhistory.so.5.

If you encountered any problems, don’t hesitate to contact us.

Documentation

If you are using MorphInd please cite our publication.

Publications

Septina Dian Larasati, Vladislav Kuboň, and Daniel Zeman: Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus.
SFCM 2011. August 2011. Zurich, Switzerland. To be appear in Springer CCIS proceedings of the Workshop on Systems and Frameworks for Computational Morphology
[pdf][bib]

Manuals

under construction

License

If you are using MorphInd please cite our publication (see documentation page).

MorphInd by Institute of Formal and Applied Linguistics (UFAL) is licensed under a

Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.


Creative Commons License

Acknowledgement

This project was financially supported by the grant LC536 Centrum Komputační Lingvistiky of the Czech Ministry of Education.