Construction  |   Business  |   Government  |   Agriculture  |   Health  |   Education  |   Geolocation  |   Justice  |   Industry  |   Security  |   Language  |   Sport  |   Office  |   Transport
PERSONAL NAMES  |   PLACE NAMES  |   ENTITY NAMES  |   ACRONYMS  |   ENTITY ACRONYMS  |   SEMANTICS  |   ORTHOGRAPHY  |   ONTOLOGY

Multitasking Arabic Processing System

Synopsis

MAPS (Multitasking Arabic Processing System) is our professional multilingual lexical processing system, a modular and compact yet versatile system capable of dealing with many tasks related to Arabic content management and NLP in general; however, the system is engineered to process Arabic etymology, orthography, and toponomy in particular.

MAPS takes in linguistc data in native script through virtual keyboard or via simple cut and paste mechanism; batch processing is supported for large text files, the system supports different encodings with UTF8 being dominant.

MAPS outputs the results in different formats and encodings (file saving formats) e.g. TXT, HTM, and XML. The following paragraphs take you in a tour to explore the system in detail.

Terminology

We use some new terms through this document which we would like to clarify. Each task done by MAPS is called "process", the following table shows pairs of languages in the first four columns, the column headers shows the "direction" of the process; the name of the process itself is shown in the last column. Please follow the links for details on each process.

Source Language Input Script Target Language Output Script Process Name
Arabic Unvocalized Arabic Arabic Vocalized Arabic Vocalization
Non-Arabic Language dependent Arabic Arabic Alphabet Retrieval (Arabic)
Arabic Arabic Alphabet Non-Arabic Language dependent Transcription (Phonemic)
Arabic Arabic Alphabet Latin script based Phonemic Latin Romanization
 
Arabic Arabic Alphabet Multilingual Language dependent Retrieval (Multilingual)
Multilingual Language dependent Arabic Arabic Alphabet Arabicization


Features and specifications

  • Full support for Unicode v5.0
  • Interactive and batch mode support
  • Input/output in native languages
  • Over 20 languages supported
  • More than 12 different file encodings
  • Output to TXT, HTML, RTF, PDF, ODT, XML, SQL
  • virtual keyboard (where applicable)
  • highly customizable user friendly interfaces

Potential applications

  • Text to Speech systems (TTS)
  • Information Retrieval (IR)
  • Machine Translation (MT)
  • Named Entity Recognition (NER)
  • Cross-Language Information Retrieval systems (CLIR).
  • Law enforcement applications.


 A diagram showing MAPS Suite families and their sub-modules
 
ONO
   
 
Romanizer
 
Transcription
 
Retrieval
 
Indexer
 
Arabicizer
TOPO
   
 
Romanizer
 
Transcription
 
Retrieval
 
Arabicizer
ORTHO
   
 
Diacritizer
 
Extractor
 
Conjugation
 
Inflection
 
Stemmer
SEMAN
   
 
Tagger
 
Parser
 
Ontology




 Module pages layoutFamily layout
  MAPS pages are laid out as follows:
  • family home page (modules overview and definitions)
  • module's main page (description and output samples)
  • module's specifications page (data and screenshots)
  • module's support page (download links and documentation)
   
Family
   
Home Page


Specifications
Main Page


Support

How MAPSOrtho works

MAPSOrtho® is the family name of MAPS set of modules that deal with Arabic orthography. The system depends basically on heuristics and comprehensive set of rules.

 Arabic Diacritizer
Arabic is one of the UN official languages and is read from right-to-left; Arabic language has an inflectional system that is known for its rich vocabulary and complex morphology. The Arabic Abjad consists of twenty eight letters, twenty five of which are consonants and the remaining three letters are long vowels. A distinguishing feature of Arabic is that no letters are used to represent short vowels. Instead, they are represented by short strokes called diacritics, which are placed either above or below the preceding consonant.

Another feature is that Arabic text is written unvocalized except for classical themes and Koranic text, this is a major stumbling stone for any NLP system. Kalmasoft diacritizing module is developed to accomplish full and semi-vocalization process of the raw input text. Please refer to Arabic Text Diacritizer for details. This module is currently being developed. Please refer Arabic Diacritizer for details.

 Arabic Root Extractor
Arabic is a highly inflectional language, meaning it uses an effective system to generate and derive words. Stemming is the process of removing any affixes from such words, and reducing those words to their roots. Our full-fledged morphological analyzer utilizes a light stemmer which does not only affix removal but also root extraction, it does this using complicated techniques to deal with all forms of the assimilated, hollow, and defect tokens, the morphological analyzer does the pattern recognition necessary to complete the task and returns the correct form of the root or stem. A root dictionary is implemented to boost the system which can be used in Arabic monolingual document retrieval. Please refer to Arabic Text Stemmer/Root Extractor for detailsThis module is currently being developed. Please refer Arabic Stemmer for details.

 Arabic Conjugator
Arabic is a non-concatinative language, it can be described as derivational language meaning that the morphotactics depend rather on affixation i.e. adding morphemes onto the word without changing the root, that is, preserving the core order of the verb binyanim, this results in the highly regular inflectional pattern distinguishing the language.

The Inflection Generator (or simply conjugator) is a full-form lexical production module built on a root-based algorithm; a root like [ksr] "to break" may be seeded into the system yielding roughly 30,000 conjugations this is theoretically true for any other triconsonantal sound root.

What Kalmasoft offers here will be not the thorough listing of the verb conjugation paradigm but rather the software which can then be used to create the whole inflectional model of the language back again or just the conjugation table of a specific form of verb; binary scripts are available and can be obtained too. Please refer to the list of tagged roots for further information. This module is currently being developed; please refer Arabic Conjugator for details.

 Arabic Inflector
Arabic noun declension is the process of inflecting nouns to their sub-grammatical categories, MAPS inflects every single Arabic noun to more than dozen of categories including the classes e.g. Verbal Noun, Noun of Instrument, Active Participle, Passive Participle, Noun of Place, Noun of Time and three cases Accusative, Nominative, and Genitive; the first group are directly derived from their parallel verbs since they are grammatically classified as nouns.

Other stem inherent or generic characteristics e.g. semantic classification are not reflected in the table below, they have rather been dealt with in a direct hard-coding basis throughout the declension. This module is currently being developed; please refer Arabic Inflector for details.


How MAPSSeman works

MAPSSeman® is the family name of our Arbic semantics processing package; a set of specialized modules tuned for applications such as information retrieval, document clustering, rule-based machine translation (RBMT), example based machine translation (EBMT) and many other applications; it can be described as a knowledge-based system where elaborate set of rules is used together with some technical approaches to accomplish specific semantic operations. The system depends basically on algorithmic techniques and very small set of lexical databases.

 Arabic POS Tagger
POS tagging is the process of assigning a part-of-speech tag such as noun, verb, pronoun, preposition, adverb, adjective or other tags to each word in a sentence. It reflects the word syntactic category based on its context for the purposes of resolving lexical ambiguity.

This is a rule based module that makes use of an extensive knowledge base of rules developed our linguists to define precisely when to apply each form of tags.Please refer Arabic POS Tagger for details.

 Arabic Parser
Parsing is a key to accurate translation - once text is correctly dis-assembled, it is much easier to transfer to a different language. Kalmasoft has developed a unique Parser for Arabic language which can correctly analyze natural text, represent it as abstract elements and relationships, and then seed it to generate text in a new language. This technology is language dependent but requires only few changes, different rule set, and dictionary for each additional language. This module is currently being developed. Please refer to Arabic Parser for details.

 Arabic Ontology Processor
This module is currently being developed. Please refer to Arabic Ontology Processor for details.


How MAPSOno works

MAPSOno® is the family name of MAPS branch that deals with anthroponyms "personal names"; it can be described as a knowledge-based system where elaborate set of rules is used together with some technical approaches to accomplish specific operations. The system depends basically on algorithmic techniques and very small databases of names being updated automatically through every input processing task; this technique is used for name scoring in particular. Some other techniques like heuristics are utilized for global name transliteration.

 Personal Names Romanizer
Both Arabic and English lack some of each other’s sounds and letters. For example, there is no perfect match for pharyngeals [Haa', Ein] or uvulars [Qaf, Khaa', Ghain] in English and (P, V) in Arabic. This leads to ambiguities during the transliteration process. Hence, if there is an Arabic name with one of these sounds, variant spellings will result in English.

This in fact a major stumbling block when converting non-western language characters like Arabic Alphabet to Roman characters; it is especially challenging for Arabic-to-English conversion because the Arabic alphabet uses only consonants and rarely use some diacritics for disambiguation, making it difficult to accurately return a single English version of an Arabic name input.

Our system takes into account these peculiarities and supports many transliteration standards including UNGEGN, ALA-LC, DIN31635, SATTS, ISO233 as well as some academic transliteration systems like Buckwalter, Khoja and Qalam; this makes it essential as an integral transliteration module in NLP applications like Machine Translation (MT) and Cross-Language Information Retrieval (CLIR). Please refer Name Romanizer for details.

 Arabic Name Indexor/Geolocater
This module makes use of the truth that different geographical regions have different name patterns and most have specific set of names unique to it beside other patterns that are in common e.g. the names "حفني" /ħaf'ni /, "مرسي" /mursi /, and "مدبولي" /mad'bu:li / are unique to Egypt while the names "أحمد" /ʔħ'mad/ "محمد" /muħam'mad/ can not be assigned to specific geographical region since they share the top very high frequency in all regions in the Arabic speaking countries among other names like "علي" /ʕli /. The module also gives some hints "gist" about gender and guesses on the religion for some non-Arabic origin names e.g. "جرجس" / girgis/, "مينا" /mi:na/, and "حنا" /ħan'na/ which denote Coptic or Christian names common in Egypt and Iraq. MAPS uses this embedded module to give high and reliable results. Please refer Name Indexer for details.

 Personal Names Arabicizer
Transliteration is the process of formulating a representation of words in one language using the alphabet of another language; the challenge of importing non-Arabic "foreign" names into Arabic language is not less important than the reverse process; this is called "Arabicization" in MAPS terminology, "Arabicizing" is the process of representing names written using scripts other than Arabic alphabet; this process does not actually impose such great challenges for language pairs that employ very close alphabets and sound systems such as Spanish/English or French/Spanish. Distinction here should be made between two important points:

  • Arabic phonetic system accepts foreign imported phonemes and represents them in a way good enough for trained reader and native speakers to spell the name and pronounce it close to the original except for some phonemes and digraphs e.g. (P,V) and ("CH",""); that is true regarding the extra set of short vowels our system is making the best use of.

    Unlike Arabic Romanization, variants here occur as a result of Arabic varieties used in different geographical regions; our system adopts the MSA in general but can generate other varieties of Arabic as well.

  • The second point can be clarified using the name Michael which is actually spelled "مايكل" in Arabic but as "ميشيل" and "ميشال" too, the later is common in Levantine, similar example is the name Nichola which yields "نيكولا" and "نقولا"; other example is the name Clinton" is commonly spelled as "كلنتون" which will work just fine but another variant does exist "كلنطون" which is common in north Africa, Egypt in particular; the samples in the link below will show some other nuances. MAPS deals not only with these but also with other complexities between different varieties of Arabic, please refer to Global Name Arabicizer for details.

 Personal Names Transcription System
Representing Arabic names written in Arabic script in different languages and vice versa is a task always been described as a challenge to most cross-language content management and data mining systems; MAPS works not only for Romanization but also for a dozen of languages, the integral transliteration system make it possible to take names in native script or Romanized form, perform the transcription and return results in the target native script, the output "transcribed names" are formatted in a special way directed to readers of the particular geographic region; For instance the Arabic name "بُرْهَان" is rendered "Бурхан" for Russian, "Burhan" for English speakers, Czech or Spanish, "Borhane" for Francophone and "Borhan" for German and Polish. Please refer to the Global Personal Name Transcription System for details and samples.

 Personal Names Retrieval System
The system is capable of regenerating names back to their original languages and return the result in the target language native script; this re-building capability makes it ideal for applications like Named Entity Recognition (NER), Cross Language Information Retrieval (CLIR); retrieval feature works only for Arabic and partial section of Latin names for now. This module makes heavy usage of detailed conversion rules and heuristics to correctly re-build each input name no matter how bad the original name is damaged by the transliteration process. Please refer to Name Retrieval System for detailed output sample.


How MAPSTopo works

MAPSTopo® is the family name of our toponyms "geographic names" processing package. Arabic is one of the UN offical languages spoken by over 200 million people in 20 countries mainly in the Gulf area and North Africa. MAPSTopo supports all Arabic oriented Romanization systems including UNGEGN and variants i.e. UNGEGN2002, BGN/PCGN1956, RJGC, IGN, ISO233, and SES. Arabic Romanization is not only restricted to the "official" systems since MAPS can also export geographical Arabic names to over 50 languages and in their native scripts, all EU official languages, Russian, Afrikaans, Icelandic, and Turkish are included for example. MAPSTopo® also does another wider transcription for Arabic place names, this utilizes IPA plus user custom transcription system. The reverse process is also possible i.e. retrieval of any transribed Arabic place name or Arabicized non-Arabic place name. Another module is responsible of Arabicizing any non-Arabic toponym.

 Geographic Names Romanizer
This module does the mostly required processes of Arabic place name Romanization in more than 10 official Romanization systems; Please refer Geographical Names Arabicizer for details.

 Geographical Names Arabicizer
This module is currently being developed. Please refer Geographical Names Arabicizer for details.

 Geographic Names Transcription System
This module is currently being developed. Please refer Geographic Names Transcription System for details.

 Geographic Names Retrieval System
This module is currently being developed. Please refer Geographic Names Retrieval System for details.

Facts

Home » Multitasking Arabic Processing System
Category Software | Reference MAPS | Modules 17 | Last updated 13/02/2016