ARABIC PART-OF-SPEECH/MORPHOLOGICAL ANALYSIS TAGGING
The Penn Arabic Treebank uses a level of annotation more accurately
described as morphological analysis than as part-of-speech tagging. In
October 2001, the decision was taken to use Tim Buckwalter's morphological
analyzer and main lexicon, which currently contains over 77,800 stem
entries representing some 45,000 lexical items.
A DESCRIPTION OF TIM BUCKWALTER'S ARABIC MORPHOLOGICAL ANALYSIS TOOL
The Arabic morphological analysis and part-of-speech tagging was performed
with the Buckwalter Arabic Morphological Analyzer, an open-source software
package distributed by the LDC. The source code of the program and a full
technical description can be downloaded for free from the LDC website:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2002L49.
What follows is a brief description of the Arabic morphology analysis
algorithm and the structure of lexicon entries.
The Arabic morphology analysis is based on these assumptions:
1. Arabic words are composed of three elements: prefix, stem, and suffix
2. Prefix length is 0-4 characters
3. Stem length is 1-infinite characters
4. Suffix length is 0-6 characters
Given these rules, an Arabic word can be segmented as follows (using
wbAlErbyp as an example):
Prefix Stem Suffix
wbAlErbyp
wbAlErby p
wbAlErb yp
wbAlEr byp
wbAlE rbyp
wbAl Erbyp
wbA lErbyp
w bAlErbyp
w bAlErby p
w bAlErb yp
w bAlEr byp
w bAlE rbyp
w bAl Erbyp
w bA lErbyp
wb AlErbyp
wb AlErby p
wb AlErb yp
wb AlEr byp
wb AlE rbyp
wb Al Erbyp
wb A lErbyp
wbA lErbyp
wbA lErby p
wbA lErb yp
wbA lEr byp
wbA lE rbyp
wbA L Erbyp
wbAl Erbyp
wbAl Erby p
wbAl Erb yp
wbAl Er byp
wbAl E rbyp
Arabic dictionary look-up consists of asking, for each segmentation:
1. does the prefix exist in the lexicon of prefixes?
2. if so, does the stem exist in the lexicon of stem?
3. if so, does the suffix exist in the lexicon of suffixes?
Note that the dictionary of prefixes contains not only the individual
prefixes (wa-, fa-, li-, Al-, bi-, etc.) but all valid concatenations of
these as well (waAl-, biAl-, wabiAl-, etc). The same applies to the
dictionary of suffixes: (-ap, -At, -Ani, -athu, -Athum, -Anihi, -tumuwhA,
etc).
Here are some sample entries from the dictionary of prefixes:
wl wali NPref-Li and + for/to wa/CONJ+li/PREP+
ll lil NPref-Li to/for + the li/PREP+Al/DET+
wll walil NPref-Lil and + to/for + the wa/CONJ+li/PREP+Al/DET+
wbAl wabiAl NPref-BiAl and + with/by the wa/CONJ+bi/PREP+Al/DET+
The first column contains the actual string that we look up, whereas the
second column has the vocalized version of the same string. The third
column has the morphological category (whose function is explained further
below). The fourth column has the corresponding English glosses and
contains part-of-speech information for the constituent morphemes.
Here are some sample entries from the dictionary of stems (lines beginning
with ";; " contain the lemma ID string):
;; Earabiy~_1
Erby Earabiy~ N/ap Arab Earabiy~/NOUN
Erb Earab N Arabs Earab/NOUN
Erby Earabiy~ N/ap Arab Earabiy~/ADJ
Erb Earab N Arab Earab/ADJ
;; Earabiy~_2
Erby Earabiy~ N-ap Arabic;Arab Earabiy~/ADJ
;; Earabiy~_3
Erby Earabiy~ N0 Arabi Earabiy~/NOUN_PROP
;; Earabiy~ap_1
Erby Earabiy~ NapAt Arabic (language) Earabiy~/NOUN
The following are sample entries from the dictionary of suffixes:
p ap NSuff-ap [fem.sg.] +ap/NSUFF_FEM_SG
Ak Aka NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ka/POSS_PRON_2MS
Ak Aki NSuff-Ah your two +A/NSUFF_MASC_DU_NOM+ki/POSS_PRON_2FS
If all three word elements (prefix, stem, suffix) are found in their
respective lexicons, we then use their respective morphological categories
(the string in column 3) to determine whether they are compatible. We ask:
1. is the morphological category of the prefix compatible with the
morphological category of the stem? (i.e., is the combination found in the
list of compatible prefix-stem morphological categories?)
2. if so, is the morphological category of the prefix compatible with the
morphological category of the suffix? (i.e., is the combination found in
the list of compatible prefix-suffix morphological categories?)
3. if so, is the morphological category of the stem compatible with the
morphological category of the suffix? (i.e., is the combination found in
the list of compatible stem-suffix morphological categories?)
If the answer to the last question is "yes" then the morphological analysis
is valid.
Example:
INPUT STRING: ????
LOOK-UP WORD: wSfh
SOLUTION 1: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): + describe/characterize + he/it it/him
SOLUTION 2: (waSafahu) [waSaf-i_1] waSaf/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): + prescribe/give a prescription to + he/it it/him
SOLUTION 3: (waSofh) [waSof_1] waSof/NOUN+hu/POSS_PRON_3MS
(GLOSS): + description/portrayal/characterization + its/his
SOLUTION 4: (waSofh) [waSof_2] waSof/NOUN+hu/POSS_PRON_3MS
(GLOSS): + characteristic + its/his
SOLUTION 5: (waSaf~ahu) [Saf~-u_1] wa/CONJ+Saf~/VERB_PERFECT+a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
(GLOSS): and + arrange/classify + he/it it/him
SOLUTION 6: (waSaf~h) [Saf~_1] wa/CONJ+Saf~/NOUN+hu/POSS_PRON_3MS
(GLOSS): and + line/row/class + its/his
Solution #1 was found to be valid because:
1. All 3 components(null)+wSf+h exist in their respective lexicons (note
that there is a literal entry for the null prefix):
(null) (null) Pref-0 (null)
wSf waSaf PV describe;characterize
h ahu PVSuff-ah he/it it/him +a/PVSUFF_SUBJ:3MS+hu/PVSUFF_DO:3MS
2. The morphological categories of all 3 components are listed as
compatible pairs in the relevant compatibility tables:
1. "Pref-0 PV" (listed in the table of compatible prefix-stem
morphological categories)
2. "PV PVSuff-ah" (listed in the table of compatible stem-suffix
morphological categories)
3. "Pref-0 PVSuff-ah" (listed in the table of compatible prefix-suffix
morphological categories)
Solution #6 was found to be valid because:
1. All 3 components w+Sf+h exist in their respective lexicons:
w wa Pref-Wa and wa/CONJ+
Sf Saf~ Ndu line;row;class
H h NSuff-h its/his +hu/POSS_PRON_3MS
2. The morphological categories of all 3 components are listed as
compatible pairs in the relevant compatibility tables:
1. "Pref-Wa Ndu" (listed in the table of compatible prefix-stem
morphological categories)
2. "Ndu NSuff-h" (listed in the table of compatible stem-suffix
morphological categories)
3. "Pref-Wa NSuff-h" (listed in the table of compatible prefix-suffix
morphological categories)
The lexicon of stems used in the morphology analysis contains 83,811
entries and 39,321 lemmas (as of Dec. 20, 2002).
AFP ARABIC POS TAGS
ABBREV
ADJ
ADV
CONJ
DEM_PRON_F
DEM_PRON_FD
DEM_PRON_FS
DEM_PRON_MD
DEM_PRON_MP
DEM_PRON_MS
DET
EMPHATIC_PARTICLE
EXCEPT_PART
FUNC_WORD
FUT
INTERJ
INTERROG_PART
IV1P
IV1S
IV2D
IV2FS
IV2MP
IV2MS
IV3FD
IV3FP
IV3FS
IV3MD
IV3MP
IV3MS
IVSUFF_DO:1P
IVSUFF_DO:1S
IVSUFF_DO:2MP
IVSUFF_DO:2MS
IVSUFF_DO:3D
IVSUFF_DO:3FS
IVSUFF_DO:3MP
IVSUFF_DO:3MS
IVSUFF_SUBJ:2FS_MOOD:SJ
IVSUFF_SUBJ:D_MOOD:I
IVSUFF_SUBJ:D_MOOD:SJ
IVSUFF_SUBJ:FP
IVSUFF_SUBJ:MP_MOOD:I
IVSUFF_SUBJ:MP_MOOD:SJ
NEG_PART
NO_FUNC
NON_ALPHABETIC
NON_ARABIC
NOUN
NOUN_PROP
NSUFF_FEM_DU_ACCGEN
NSUFF_FEM_DU_ACCGEN_POSS
NSUFF_FEM_DU_NOM
NSUFF_FEM_DU_NOM_POSS
NSUFF_FEM_PL
NSUFF_FEM_SG
NSUFF_MASC_DU_ACCGEN
NSUFF_MASC_DU_ACCGEN_POSS
NSUFF_MASC_DU_NOM
NSUFF_MASC_DU_NOM_POSS
NSUFF_MASC_PL_ACCGEN
NSUFF_MASC_PL_ACCGEN_POSS
NSUFF_MASC_PL_NOM
NSUFF_MASC_PL_NOM_POSS
NSUFF_MASC_SG_ACC_INDEF
NUM
NUMERIC_COMMA
PART
POSS_PRON_1P
POSS_PRON_1S
POSS_PRON_2FS
POSS_PRON_2MP
POSS_PRON_2MS
POSS_PRON_3D
POSS_PRON_3FP
POSS_PRON_3FS
POSS_PRON_3MP
POSS_PRON_3MS
PREP
PRON_1P
PRON_1S
PRON_2FS
PRON_2MP
PRON_2MS
PRON_3D
PRON_3FP
PRON_3FS
PRON_3MP
PRON_3MS
PUNC
PVSUFF_DO:1P
PVSUFF_DO:1S
PVSUFF_DO:3D
PVSUFF_DO:3FS
PVSUFF_DO:3MP
PVSUFF_DO:3MS
PVSUFF_SUBJ:1P
PVSUFF_SUBJ:1S
PVSUFF_SUBJ:2FS
PVSUFF_SUBJ:2MP
PVSUFF_SUBJ:3FD
PVSUFF_SUBJ:3FP
PVSUFF_SUBJ:3FS
PVSUFF_SUBJ:3MD
PVSUFF_SUBJ:3MP
PVSUFF_SUBJ:3MS
REL_PRON
REL_ADV
RESULT_CLAUSE_PARTICLE
SUBJUNC
VERB_IMPERFECT
VERB_PERFECT
VERB_PASSIVE
AFP POS COVERAGE STATISTICS
The AFP Corpus contains 140,265 tokens, of which 16,455 are punctuation,
numbers, and Latin strings, and 123,810 are Arabic word tokens.
Punctuation, Numbers, Latin strings 16,455
Arabic Word Tokens 123,810
TOTAL 140,265
Of the 123,810 Arabic word tokens, 112,215 (90.63%) were provided with an
accurate morphological analysis and POS tag, and 11,595 (09.37%) Arabic
word tokens were judged to be inaccurate and flagged with a Comment
describing the nature of the inaccuracy.
Accurately parsed Arabic Word Tokens 112,215 90.63%
Commented Arabic Word Tokens 11,595 09.37%
TOTAL 123,810 100.00%
Of the 11,595 Comments, the most frequently identified problems are the
inaccurate parsing of proper names (28.47%) and the improper tagging of
adjectives (18.30%). A large group of Comments (29.16%) could not be
interpreted automatically (via scripting languages such as Perl) and was
classified as Miscellaneous.
ARABIC POS QUALITY CONTROL COMPARISON, 6-26-02
Five files with a total of 853 words (and a varying number of POS choices
per word) were each tagged independently by five annotators for a quality
control comparison of POS annotators.
Out of the total of 853 words, 128 show some disagreement. All five
annotators agreed on 85% of the words; the pairwise (between 2 annotators)
agreement rate is at least 92.2%.
There are a total of 82 words where four annotators agreed and only one
disagreed. Of those, 55 are cases of "no selection" having been chosen
from among the POS choices, due to one annotator's definition of
good-enough-match differing from all of the others'. The annotators have
since reached agreement on which cases are truly "no selection", and thus
the rate of this disagreement should fall markedly in future POS files,
raising the rate of overall agreement.
In addition, we plan to revise the same five files to create a gold
standard, which in the future may be used to evaluate and guide new
annotators during their training period.
AFP POS ANNOTATORS:
Current annotators:
Wigdan EL MEKKI
Mohamed MANSOUR
Zohra BENTAOUIT
Rachida FATHALLAH
Dalel ZAKHARY
Tasneem GHANDOUR
Ichraf AMGHOUZ
Niama LAADIOUI
Past annotators:
Fatima EL HIMYANI
Alexa FIRAT
Sarah TLILI
Gordon WITTY