Multilingual Advanced Processing System

Synopsis

MAPS (Multilingual Advanced Processing System) is our professional multilingual lexical processing system, a modular and compact yet versatile system capable of dealing with many tasks related to Arabic content management and NLP in general; however, the system is engineered to process Arabic etymology, orthography, and toponymy in particular.

MAPS takes in data in native script either through simple copy and paste mechanism or file loading; batch processing is supported for large text files, the system supports different encodings with UTF8 being dominant.

MAPS outputs the results in different formats and encodings (file saving formats) e.g. TXT, HTML, JSON, XML as well as DOC and PDF. The following paragraphs take you in a tour to explore the system in detail.

Potential applications include:
- Information Retrieval (IR)
- Machine Translation (MT)
- Named Entity Recognition (NER)
- Cross-Language Information Retrieval systems (CLIR)
- Law enforcement applications
- Text to Speech systems (TTS)

Information

Last updated: 21/1/2023

Terminology

We use few custome terms through this document which we would like to clarify. Each task done by MAPS is called "process", the following table shows pairs of languages in the first four columns, the column headers shows the "direction" of the process; the name of the process itself is shown in the last column. Please follow the links for details on each process.

Source Language Input Script Target Language Output Script Process Name
Arabic Unvocalized Arabic Arabic Vocalized Arabic Vocalization
Non-Arabic Language dependent Arabic Arabic Retrieval (Arabic)
Multilingual Language dependent Language dependent Language dependent Transcription (Phonemic)
Multilingual Language dependent Latin script based Phonemic Latin Romanization
Arabic Arabic Multilingual Language dependent Retrieval (Multilingual)
Multilingual Language dependent Arabic Arabic Arabicization

MAPS modules heirarchy

MAPS Families

MAPSOno© is the family name of MAPS branch that deals with anthroponyms "personal names"; it can be described as a knowledge-based system where elaborate set of rules is used together with some technical approaches to accomplish specific operations. The system depends basically on algorithmic techniques and very small databases of names being updated automatically through every input processing task; this technique is used for name scoring in particular. Some other techniques like heuristics are utilized for global name romanization.

Both Arabic and English lack some of each other’s sounds and letters. For example, there is no perfect match for pharyngeals [Haa', Ein] or uvulars [Qaf, Khaa', Ghain] in English and (P, V) in Arabic. This leads to ambiguities during the romanization process. Hence, if there is an Arabic name with one of these sounds, variant spellings will result in English.

This in fact a major stumbling block when converting non-Western language characters like Arabic Abjad to Roman characters; it is especially challenging for Arabic-to-English conversion because the Arabic uses only consonants and rarely use some diacritics for disambiguation, making it difficult to accurately return a single English version of an Arabic name input.

Our system takes into account these peculiarities and supports many transliteration/transcription standards including UNGEGN, ALA-LC, DIN31635, SATTS, ISO233 as well as some academic transliteration systems like Buckwalter, Khoja and Qalam; this makes it essential as an integral transliteration module in NLP applications like Machine Translation (MT) and Cross-Language Information Retrieval (CLIR). Please refer Personal Names Romanizer for details.

Representing personal names written in different languages and vice versa is a task always been described as a challenge to most cross-language content management and data mining systems; MAPS works not only for Romanization but also for a dozen of languages, the integral transliteration system makes it possible to take names in native script or Romanized form, perform the transcription and return results in the target native script, the output "transcribed names" are formatted in a special way directed to readers of the particular geographic region; For instance the Arabic name "بُرْهَان" is rendered "Бурхан" for Russian, "Burhan" for English speakers, Czech or Spanish, "Borhane" for Francophone and "Borhan" for German and Polish. Please refer to the Personal Names Transcriber for details and samples.

Transliteration is the process of formulating a representation of words in one language using the writing system of another language; the challenge of importing non-Arabic "foreign" names into Arabic language is not less important than the reverse process; this is called "Arabicization" in MAPS terminology, "Arabicizing" is the process of representing names written using scripts other than Arabic; this process does not actually impose such great challenges for language pairs that employ very close writing and sound systems such as Spanish/English or French/Spanish.

Distinction here should be made between two important points:

  • Arabic phonetic system accepts foreign imported phonemes and represents them in a way good enough for trained reader and native speakers to spell the name and pronounce it close to the original except for some phonemes and digraphs e.g. (P,V) and ("CH",""); that is true regarding the extra set of short vowels our system is making the best use of.
    Unlike Arabic Romanization, variants here occur as a result of Arabic varieties used in different geographic regions; our system adopts the MSA in general but can generate other varieties of Arabic as well.
  • The second point can be clarified using the name Michael which is actually spelled "مايكل" in Arabic but as "ميشيل" and "ميشال" too, the later is common in Levantine, similar example is the name Nichola which yields "نيكولا" and "نقولا"; other example is the name Clinton" is commonly spelled as "كلنتون" which will work just fine but another variant does exist "كلنطون" which is common in north Africa, Egypt in particular; the samples in the link below will show some other nuances. MAPS deals not only with these but also with other complexities between different varieties of Arabic, please refer to Personal Names Arabicizer for details.

The system is capable of regenerating names back to their original languages and return the result in the target language native script; this re-building capability makes it ideal for applications like Named Entity Recognition (NER), Cross Language Information Retrieval (CLIR); This module makes heavy usage of detailed conversion rules and heuristics to correctly re-build each input name no matter how bad the original name is damaged by the transcription process. Please refer to Personal Names Retriever for detailed output sample.

Currently applicable to Arabic names only, this module makes use of the truth that different geographic regions have different name patterns and most have specific set of names unique to it beside other patterns that are in common e.g. the names "حفني" /ħaf'ni /, "مرسي" /mursi /, and "مدبولي" /mad'bu:li / are unique to Egypt while the names "أحمد" /ʔħ'mad/ "محمد" /muħam'mad/ can not be assigned to specific geographic region since they share the top very high frequency in all regions in the Arabic speaking countries among other names like "علي" /ʕli /. The module also gives some hints "gist" about gender and guesses on the religion for some non-Arabic origin names e.g. "جرجس" / girgis/, "مينا" /mi:na/, and "حنا" /ħan'na/ which denote Coptic or Christian names common in Egypt and Iraq. MAPS uses this embedded module to give high and reliable results. Please refer Personal Names Indexer for details.

MAPSTopo© is the family name of our toponyms "geographic names" processing package. MAPSTopo© either transcription or transliteration to place names to multiple languages including IPA plus user custom transcription system.

This module does the mostly required processes of Arabic place name Romanization in more than 10 official Romanization systems; Please refer Geographic Names Romanizer for details.

This module is currently being developed. Please refer Geographic Names Transcriber for details.

Arabic is one of the UN offical languages spoken by over 200 million people in 20 countries mainly in the Gulf region and North Africa. MAPSTopo supports all Arabic oriented Romanization systems including UNGEGN and variants i.e. UNGEGN2002, BGN/PCGN1956, RJGC, IGN, ISO233, and SES. This module does Arabicization for geographic names. Please refer Geographic Names Arabicizer for details.

This module does retrieval for geographic names from many different languages to their original script, it performs the reverse process of transcription/transliteration i.e. retrieval of any transribed. Please refer Geographic Names Retriever for details.

MAPSOrtho© is the family name of MAPS set of modules that deal with orthography. The system depends basically on heuristics and comprehensive set of rules.

Arabic is a non-concatinative language, it can be described as derivational language meaning that the morphotactics depend rather on affixation i.e. adding morphemes onto the word without changing the root, that is, preserving the core order of the verb binyanim, this results in the highly regular inflectional pattern distinguishing the language.

The Inflection Generator (or simply conjugator) is a full-form lexical production module built on a root-based algorithm; a root like [ksr] "to break" may be seeded into the system yielding roughly 30,000 conjugations this is theoretically true for any other triconsonantal sound root.

What Kalmasoft offers here will be not the thorough listing of the verb conjugation paradigm but rather the software which can then be used to create the whole inflectional model of the language back again or just the conjugation table of a specific form of verb; binary scripts are available and can be obtained too. Please refer Arabic Verb Conjugator for details.

Arabic noun declension is the process of inflecting nouns to their sub-grammatical categories, MAPS inflects every single Arabic noun to more than dozen of categories including the classes e.g. Verbal Noun, Noun of Instrument, Active Participle, Passive Participle, Locative Noun, Numerative Noun and three cases Accusative, Nominative, and Genitive; the first group are directly derived from their parallel verbs since they are grammatically classified as nouns.

Other stem inherent or generic characteristics e.g. semantic classification are not reflected in the table below, they have rather been dealt with in a direct hard-coding basis throughout the declension. Please refer Arabic Noun Inflector for details.

Arabic is a highly inflectional language, meaning it uses an effective system to generate and derive words. Stemming is the process of removing any affixes from such words, and reducing those words to their roots. Our full-fledged morphological analyzer utilizes a light stemmer which does not only affix removal but also root extraction, it does this using complicated techniques to deal with all forms of the assimilated, hollow, and defect tokens, the morphological analyzer does the pattern recognition necessary to complete the task and returns the correct form of the root or stem. A root dictionary is implemented to boost the system which can be used in Arabic monolingual document retrieval. Please refer to Arabic Root Extractor for details

Arabic is one of the UN official languages and is read from right-to-left; Arabic language has an inflectional system that is known for its rich vocabulary and complex morphology. The Arabic writing system consists of twenty eight letters (Abjad), twenty five of which are consonants and the remaining three letters are long vowels. A distinguishing feature of Arabic is that no letters are used to represent short vowels. Instead, they are represented by short strokes called diacritics, which are placed either above or below the preceding consonant.

Another feature is that Arabic text is written unvocalized except for classical themes and Koranic text, this is a major stumbling stone for any NLP system. Kalmasoft diacritizing module is developed to accomplish full and semi-vocalization process of the raw input text. Please refer to Arabic Text Diacritizer for details.

This module is currently being developed. Please refer Arabic Stemmer for details.

MAPSSeman© is the family name of our Arbic semantics processing package; a set of specialized modules tuned for applications such as information retrieval, document clustering, rule-based machine translation (RBMT), example based machine translation (EBMT) and many other applications; it can be described as a knowledge-based system where elaborate set of rules is used together with some technical approaches to accomplish specific semantic operations. The system depends basically on algorithmic techniques and very small set of lexical databases.

POS tagging is the process of assigning a part-of-speech tag such as noun, verb, pronoun, preposition, adverb, adjective or other tags to each word in a sentence. It reflects the word syntactic category based on its context for the purposes of resolving lexical ambiguity.

This is a rule based module that makes use of an extensive knowledge base of rules developed our linguists to define precisely when to apply each form of tags.Please refer Arabic POS Tagger for details.

Please refer Arabic Named Entity Extractor for details.

Parsing is a key to accurate translation - once text is correctly dis-assembled, it is much easier to transfer to a different language. Kalmasoft has developed a unique Parser for Arabic language which can correctly analyze natural text, represent it as abstract elements and relationships, and then seed it to generate text in a new language. This technology is language dependent but requires only few changes, different rule set, and dictionary for each additional language. This module is currently being developed. Please refer to Arabic Text Parser for details.