Collapse of Arabic POS tags as they occur with the parsed Arabic Treebank tokens (taglist.selected) into old Penn English Treebank POS tags. ALL CAPS is Arabic tag, followed by --> PENNTAG and/or comment (a list of Arabic tags may be followed by one Penn tag, meaning that the whole list can go to that one tag). This list is done alphabetically by the Arabic POS tags from taglist.selected. ARABIC+TAG --> its corresponding Penn English Treebank POS tag (possible comment) 1-28-03, Ann Bies bies@ldc.upenn.edu ========================================================================== List of Penn POS tags used: JJ adjective RB adverb CC coordinating conjunction DT determiner/demonstrative pronoun FW foreign word NN common noun, singular NNS common noun, plural or dual NNP proper noun, singular NNPS proper noun, plural or dual RP particle VBP imperfect verb (***nb: imperfect rather than present tense) VBN passive verb (***nb: passive rather than past participle) VBD perfect verb (***nb: perfect rather than past tense) UH interjection PRP personal pronoun PRP$ possessive personal pronoun CD cardinal number IN subordinating conjunction (FUNC_WORD) or preposition (PREP) WP relative pronoun WRB wh-adverb , punctuation, token is , (PUNC) . punctuation, token is . (PUNC) : punctuation, token is : or other (PUNC) ========================================================================== ABBREV --> NN (Not sure what to say about ABBREV -- most likely NN. Penn English POS had abbreviations tagged as their full word function, but I expect most abbreviations in this Arabic corpus to be nouns.) ADJ ADJ+NSUFF_FEM_DU_ACCGEN ADJ+NSUFF_FEM_DU_ACCGEN_POSS ADJ+NSUFF_FEM_DU_NOM ADJ+NSUFF_FEM_PL ADJ+NSUFF_FEM_SG ADJ+NSUFF_MASC_DU_ACCGEN ADJ+NSUFF_MASC_DU_ACCGEN_POSS ADJ+NSUFF_MASC_DU_NOM ADJ+NSUFF_MASC_DU_NOM_POSS ADJ+NSUFF_MASC_PL_ACCGEN ADJ+NSUFF_MASC_PL_ACCGEN_POSS ADJ+NSUFF_MASC_PL_NOM ADJ+NSUFF_MASC_SG_ACC_INDEF ADJ_PROP ADJ_PROP+NSUFF_FEM_SG ADJ_PROP+NSUFF_MASC_PL_NOM ADJ_PROP+NSUFF_MASC_SG_ACC_INDEF --> JJ ADV ADV+NSUFF_FEM_SG ADV+NSUFF_MASC_SG_ACC_INDEF --> RB CONJ --> CC CONJ+NEG_PART --> CC cliticized to an RP DEM_PRON_F DEM_PRON_FD DEM_PRON_FS DEM_PRON_MD DEM_PRON_MP DEM_PRON_MS DET --> DT DET+ADJ DET+ADJ+NSUFF_FEM_DU_ACCGEN DET+ADJ+NSUFF_FEM_DU_NOM DET+ADJ+NSUFF_FEM_PL DET+ADJ+NSUFF_FEM_SG DET+ADJ+NSUFF_MASC_DU_ACCGEN DET+ADJ+NSUFF_MASC_DU_NOM DET+ADJ+NSUFF_MASC_PL_ACCGEN DET+ADJ+NSUFF_MASC_PL_NOM DET+ADJ_PROP DET+ADJ_PROP+NSUFF_FEM_SG DET+ADJ_PROP+NSUFF_MASC_PL_ACCGEN --> JJ DET+ADV+NSUFF_FEM_SG --> RB DET+NEG_PART --> DT cliticized to an RP DET+NOUN --> NN DET+NOUN+NSUFF_FEM_DU_ACCGEN DET+NOUN+NSUFF_FEM_DU_NOM DET+NOUN+NSUFF_FEM_PL --> NNS DET+NOUN+NSUFF_FEM_SG --> NN DET+NOUN+NSUFF_MASC_DU_ACCGEN DET+NOUN+NSUFF_MASC_DU_NOM DET+NOUN+NSUFF_MASC_PL_ACCGEN DET+NOUN+NSUFF_MASC_PL_NOM --> NNS DET+NOUN_PROP --> NNP DET+NOUN_PROP+NSUFF_FEM_PL --> NNPS DET+NOUN_PROP+NSUFF_FEM_SG --> NNP DET+NOUN_PROP+NSUFF_MASC_DU_ACCGEN DET+NOUN_PROP+NSUFF_MASC_PL_ACCGEN DET+NOUN_PROP+NSUFF_MASC_PL_NOM --> NNPS DET+PREP --> DT cliticized to an IN EMPHATIC_PARTICLE EXCEPT_PART --> RP FUNC_WORD --> IN (subordinating conjunction, rather than preposition) FUT+IV1P+VERB_IMPERFECT FUT+IV1S+VERB_IMPERFECT FUT+IV2MP+VERB_IMPERFECT+IVSUFF_SUBJ:MP_MOOD:I FUT+IV2MS+VERB_IMPERFECT FUT+IV3FS+VERB_IMPERFECT --> VBP (imperfect verb, using the old present tense verb tag) FUT+IV3FS+VERB_PASSIVE --> VBN (passive verb, using the old past participle tag) FUT+IV3MD+VERB_IMPERFECT+IVSUFF_SUBJ:D_MOOD:I FUT+IV3MP+VERB_IMPERFECT+IVSUFF_SUBJ:MP_MOOD:I FUT+IV3MS+VERB_IMPERFECT --> VBP (imperfect verb, using the old present tense verb tag) FUT+IV3MS+VERB_PASSIVE --> VBN (passive verb, using the old past participle tag) INTERJ INTERJ+NSUFF_MASC_SG_ACC_INDEF --> UH INTERROG_PART --> RP IV1P+VERB_IMPERFECT --> VBP (imperfect verb, using the old present tense verb tag) IV1P+VERB_PASSIVE --> VBN (passive verb, using the old past participle tag) IV1S+VERB_IMPERFECT --> VBP (imperfect verb, using the old present tense verb tag) IV1S+VERB_PASSIVE --> VBN (passive verb, using the old past participle tag) IV2D+VERB_IMPERFECT+IVSUFF_SUBJ:D_MOOD:I IV2FS+VERB_IMPERFECT+IVSUFF_SUBJ:2FS_MOOD:SJ IV2MP+VERB_IMPERFECT+IVSUFF_SUBJ:MP_MOOD:I IV2MP+VERB_IMPERFECT+IVSUFF_SUBJ:MP_MOOD:SJ IV2MS+VERB_IMPERFECT --> VBP (imperfect verb, using the old present tense verb tag) IV2MS+VERB_PASSIVE --> VBN (passive verb, using the old past participle tag) IV3FD+VERB_IMPERFECT+IVSUFF_SUBJ:D_MOOD:I IV3FD+VERB_IMPERFECT+IVSUFF_SUBJ:D_MOOD:SJ IV3FP+VERB_IMPERFECT+IVSUFF_SUBJ:FP IV3FS+VERB_IMPERFECT --> VBP (imperfect verb, using the old present tense verb tag) IV3FS+VERB_PASSIVE --> VBN (passive verb, using the old past participle tag) IV3MD+VERB_IMPERFECT+IVSUFF_SUBJ:D_MOOD:I IV3MD+VERB_IMPERFECT+IVSUFF_SUBJ:D_MOOD:SJ IV3MP+VERB_IMPERFECT+IVSUFF_SUBJ:MP_MOOD:I IV3MP+VERB_IMPERFECT+IVSUFF_SUBJ:MP_MOOD:SJ --> VBP (imperfect verb, using the old present tense verb tag) IV3MP+VERB_PASSIVE+IVSUFF_SUBJ:MP_MOOD:I --> VBN (passive verb, using the old past participle tag) IV3MS+VERB_IMPERFECT --> VBP (imperfect verb, using the old present tense verb tag) IV3MS+VERB_PASSIVE --> VBN (passive verb, using the old past participle tag) IVSUFF_DO:1P IVSUFF_DO:1S IVSUFF_DO:2MP IVSUFF_DO:2MS IVSUFF_DO:3D IVSUFF_DO:3FS IVSUFF_DO:3MP IVSUFF_DO:3MS --> PRP NEG_PART NEG_PART+PVSUFF_SUBJ:3MS --> RP NO_FUNC --> no POS tag given, but I'd say NNP is probably a good guess for most of these, since most though not all unknown words are names. NON_ALPHABETIC NON_ARABIC --> FW or NUM or punctuation (non-Arabic characters) NOUN --> NN NOUN+NSUFF_FEM_DU_ACCGEN NOUN+NSUFF_FEM_DU_ACCGEN_POSS NOUN+NSUFF_FEM_DU_NOM NOUN+NSUFF_FEM_DU_NOM_POSS NOUN+NSUFF_FEM_PL --> NNS NOUN+NSUFF_FEM_SG --> NN NOUN+NSUFF_MASC_DU_ACCGEN NOUN+NSUFF_MASC_DU_ACCGEN_POSS NOUN+NSUFF_MASC_DU_NOM NOUN+NSUFF_MASC_DU_NOM_POSS NOUN+NSUFF_MASC_PL_ACCGEN NOUN+NSUFF_MASC_PL_ACCGEN_POSS NOUN+NSUFF_MASC_PL_NOM NOUN+NSUFF_MASC_PL_NOM_POSS --> NNS NOUN+NSUFF_MASC_SG_ACC_INDEF --> NN NOUN_PROP --> NNP NOUN_PROP+NSUFF_FEM_PL --> NNPS NOUN_PROP+NSUFF_FEM_SG --> NNP NOUN_PROP+NSUFF_MASC_PL_ACCGEN --> NNPS NOUN_PROP+NSUFF_MASC_SG_ACC_INDEF --> NNP NUM --> CD NUMERIC_COMMA --> , or decimal point (This is an Arabic letter that looks like a comma, and is therefore often used in place of a comma for punctuation or for decimal points.) PART --> RP POSS_PRON_1P POSS_PRON_1S POSS_PRON_2FS POSS_PRON_2MP POSS_PRON_2MS POSS_PRON_3D POSS_PRON_3FP POSS_PRON_3FS POSS_PRON_3MP POSS_PRON_3MS --> PRP$ PREP PREP+NSUFF_FEM_SG PREP+NSUFF_MASC_SG_ACC_INDEF PREP_PROP --> IN PRON PRON_1P PRON_1S PRON_2FS PRON_2MP PRON_2MS PRON_3D PRON_3FP PRON_3FS PRON_3MP PRON_3MS --> PRP PUNC --> , or . or : (We use the PUNC tag for all types of punctuation (period, comma, etc.), with the exception of the Arabic character punctuation (which is tagged NUMERIC_COMMA). Actual conversion to Penn English POS tags would require looking at the token itself as well as the tag.) PVSUFF_DO:1P PVSUFF_DO:1S PVSUFF_DO:3D PVSUFF_DO:3FS PVSUFF_DO:3MP PVSUFF_DO:3MS PVSUFF_SUBJ:1P PVSUFF_SUBJ:1S PVSUFF_SUBJ:3FS --> PRP REL_ADV --> WRB REL_PRON --> WP REL_PRON+PREP --> WP cliticized to an IN RESULT_CLAUSE_PARTICLE SUBJUNC --> RP VERB_PASSIVE VERB_PASSIVE+PVSUFF_SUBJ:1S VERB_PASSIVE+PVSUFF_SUBJ:3FS VERB_PASSIVE+PVSUFF_SUBJ:3MD VERB_PASSIVE+PVSUFF_SUBJ:3MP VERB_PASSIVE+PVSUFF_SUBJ:3MS --> VBN (passive verb, using the old past participle tag) VERB_PERFECT VERB_PERFECT+PVSUFF_SUBJ:1P VERB_PERFECT+PVSUFF_SUBJ:1S VERB_PERFECT+PVSUFF_SUBJ:2FS VERB_PERFECT+PVSUFF_SUBJ:2MP VERB_PERFECT+PVSUFF_SUBJ:3FD VERB_PERFECT+PVSUFF_SUBJ:3FP VERB_PERFECT+PVSUFF_SUBJ:3FS VERB_PERFECT+PVSUFF_SUBJ:3MD VERB_PERFECT+PVSUFF_SUBJ:3MP VERB_PERFECT+PVSUFF_SUBJ:3MS --> VBD (perfect verb, using the old past tense verb tag)