Kalmasoft Databases and Glossaries
Synopsis
Kalmasoft maintains and manages a central repository of datasets which are the largest multilingual databases; these datasets have been progressively collected since the last two decades with a major goal of achieving effective and accurate reference materials to assist in developing linguistic support for different software packages and to boost the ever-growing domains of the new web technologies that related to language engineering and NLP. We do keep our data updated and organized to ensure that only the correct information is always present whenever needed.
All materials presented here are commercially available and can be customized and fine-tuned to meet your specific requirements.
Potential applications include:
- Anti Money Laundering
- Customer Data Management
- Employment Diversity
- Electronic Health Records
- Fraud Detection
- Identity Matching
- Identity Resolution
- Immigration Control
- Intelligence Analysis
- KYC and Due Diligence
- Law Enforcement
- Passenger Screening
- Voters list correction
Information
Reference: DBASES
Total entries 250,000,000+
Last updated: 14/1/2023
Anthroponyms (personal names)
Thousands of Arabic given names in native script with Arabic diacritical marks (short vowels), Roman transcription, and gender fields; transcription here follows the common way a name is spelled in English but other transcription systems are available too. A separate database of frequency statistics on each name can be supplied upon request.
The most interesting database to those involved in developing NER applications or name scoring software packages, same as above with Arabic diacritical marks and full Roman transcription. A separate frequency statistics on each name can be supplied as a separate database upon request.
An extended database of Arabic names of exceeds 1 million records romanized to 6 languages English, German, Dutch, Spanish, Italian, and French, based on 300K original Arabic names.
A database of millions of real world Arabic names collected from many sources and supplied with gender and locale fields covering the entire Arabic region as well as additional three countries known to be under strong influence of Arabic culture.
Few hundreds of names of Arabic origins mostly from counties known to have been under the umbrella of Islamic culture e.g. Turkey, India, Spain, Persia and few African countries.
A database of 3.6M records based on 300K original Arabic names canonically transcripted to 12 languages Amharic, Hebrew, Greek, Japanese, Russian, Armenian, Georgian, Hindi, Thai, Bengali, Tagalog, and Malyalam.
A huge 40 millions records database of all Arabic names with their all possible roman variants, based on 300K original Arabic names.
Unique and indigenous names from all Arabic speaking countries.
Names from all over the world, what is new in this database is Arabic transcriptions which are added to every name, the gender field is also added, most of the records are showing additional information e.g. locale and meaning.
A database of names whose counterintuitive pronunciations make it difficult to spell or read even if obviously written in Latin characters since the phonemic characteristics of the original language affect the way these names are used.
A database of names that share the same spelling across multiple languages but with different meaning and pronunciation.
A database of Ethiopic names with Arabic parallel names that closely share the same meaning.
Toponyms (place names)
Highly organized gazetteer (populated places only) of thousands of Arabic place names ready for publishing on the internet with many Arabic transcription systems.
Highly organized gazetteer (populated places only) of thousands of Arabic place names ready for publishing on the internet with many Arabic transcription systems.
World gazetteer is a full featured database of geographic information concerning the geographic makeup of all world countries and natural physical features, such as mountains, waterways, or roads. This database is a complement to the above two databases.
First of its kind, this database of odonyms (street names) long been awaited now available in Arabic; world street names of more than two million geographic entities; very useful for information retrieval systems e.g. NER applications, web crawlers, search engines and CLIR applications.
This is part of the above database, it has all common geographic features e.g. valley, creek, summit etc. as well as some world famous features including oceans, continents and major cities.
Most of geographic terms can be found here, this database is compiled to be used with electronic dictionaries and MT applications.
Entity Names Databases
Famous Names and Celebrities from over 100 countries.
Full suite of bilingual databases covering almost all aspects of life e.g. sports, politics, science, and more; each database may have additional fields e.g. "type" but ,basically, all have the "locale" field present.
Unique and valuable, this is a database of all entity keywords found in Arabic, it also comprises the Arabic counterparts of entity keywords like company, society, union, factory committee and other terms, it is very useful for NER applications, web crawlers, search engines and CLIR applications.
A valuable listing of indigenous and rare names found in Arabic countries, the biggest database of its kind now available electronically.
Industrial entities e.g. consumer electronics, heavy industries, construction, automobiles, housing, information technology, medical equipment, military industries, etc. Good for MT software developers, web based search engines.
Acronyms and Initialisms
Thousands of acronyms and abbreviations cover many fields like aviation, aerospace, military, sports, education, science, engineering, media, law, recreation and entertainment, and more.
Orthographic Databases
Tagged Arabic corpus encoded either in UTF-8, Windows 1256, or in Kalmasoft generic transliteration system "KATS"; essential for MT application based on statistical techniques, and as a reference for POS taggers and text parsers.
Extended database of Arabic roots; the database is in two forms in native script coded using either UTF-8 or Windows 1256 coding or in Kalmasoft native transliteration system "KATS" which is using ASCII characters to facilitate text processing; this is essential for every root-based Arabic processing system in particular POS taggers and inflection generation systems.
Arabic full-form verbs that actually found in ordinary running text, this database includes all regular conjugated verbs.
Arabic full-form nouns that actually found in ordinary running text, this database includes all regular inflected nouns.
Aramolex is an Arabic morphological lexicon, a dictionary database generated to serve as a full-form lexicon for the entire regular vocabulary for the Arabic language beside the other non-regular surface forms of the Arabic vocabulary.
Full information about 5,000+ of loanwords of multiple origins including English, French, Turkish, etc. coded in native Arabic script and Kalmasoft "KATS" with their possible original parallels, this database is good for text abridging, parser and other kinds of NLP applications.
Full information about 50,000+ of loanterms of multiple origins including English, Spanish, Italian, French, Turkish, etc. coded in native Arabic script and Kalmasoft "KATS" with their possible original parallels, this database is good for text abridging, parser and other kinds of NLP applications.
Full information about thousands of Amharic loanwords found in the current day Arabic and also classical Arabic. This database is compiled for the purposes of CLIR and other IR disciplines.
Full information about thousands of Syriac loanwords found in the current day Arabic and also classical Arabic. This database is compiled for the purposes of CLIR and other IR disciplines.
Full information about thousands of words that are common to both Amharic and Syriac
Full information about thousands of loanwords from both Amharic and Syriac found in the current day Arabic and also classical Arabic. This database is compiled for the purposes of CLIR and other IR disciplines.
Semantic Databases
Hundreds of Arabic idiomatic expressions with their meanings and English parallels; important for MT applications.
Thousands of Arabic proverbs with their English parallels; important for MT and TMM.
Thousands of Arabic newspaper expressions with their meanings and English parallels; important for MT applications.
Fauna and Flora
under construction
under construction
Ontology and Semantic Databases
Arabic Noun ontology database (under construction).
Arabic verb ontology database (under construction).
Taxonomy Databases
Arabic Noun taxonomy database (under construction).