Abstract
The location of events in multilingual texts, particularly in Arabic, represents a challenge for epidemiological monitoring. Systems such as PADI-web rely on English translation to extract spatial entities, but the scarcity of annotated spatial entities in Arabic can hamper the reliability of translations and extraction. In this context, PADI-Location-AR-EN, which is a dataset of 328 spatial entities that were manually extracted from 96 Arabic-language news articles collected by the PADI-web epidemiological monitoring system, is presented in this paper. Each entity was manually translated into English, normalized using the GeoNames database, and then classified according to its type and spatial category. The dataset can be used to evaluate the translation quality of three machine translation systems (DeepL, Microsoft Azure and Reverso) as well as the performance of named entity recognition models on the translated texts.
Keywords: Arabic, Epidemiology, Spatial entities, Translation, English, NLP
Specifications Table
| Subject | Computer Science: Information System |
| Specific subject area | Animal Diseases Monitoring |
| Type of data | Tabular data (*.csv). Raw and Standardized. Data format: Filtered, Processed |
| Data collection | Data comes from the PADI-web monitoring tool database (https://padi-web.cirad.fr/en/). |
| Data source location | The data are hosted on the Dataverse CIRAD in the context of the MOOD (MOnitoring Outbreaks for Disease surveillance in a data science context) project1 |
| Data accessibility Related | Repository name: Dataverse CIRAD Data identification number: doi: 10.18167/DVN1/ZA1YUO Direct URL to data: https://doi.org/10.18167/DVN1/ZA1YUO |
| research article | None |
1. Value of the Data
-
•
This dataset provides a valuable opportunity to evaluate translation systems in terms of the quality of the translation of spatial entities from Arabic to English.
-
•
This dataset consists of spatial entities annotated in Arabic, which makes it valuable for the evaluation of Arabic spatial entity recognition systems.
-
•
This dataset can be used for few-shot evaluation or as a test set for large language models (LLMs).
2. Background
Text-based health surveillance systems, such as HealthMap [1] and PADI-web [2], use artificial intelligence to automate the extraction of epidemiological information from text, including online news. To ensure direct access to local information, these systems integrate multilingual content that is translated, usually into English or French, using tools such as Google Translate or Microsoft Translator. This translation facilitates the application of pretrained models for tasks such as named entity extraction (NER) and event localization. However, although spatial entity recognition is crucial for identifying epidemic hotspots and tracking their spread, the effect of Arabic text translation on the quality of this extraction has, to our knowledge, never been evaluated in the context of epidemiological surveillance.
Several datasets have been designed for the Arabic NER task, including ANERcorp [3] and AQMAR [4], with 5052 and 1888 spatial entities, respectively. Wojood [5] includes 917 spatial entities. In addition, several Arabic–English parallel corpora, such as the King Abdullah II [6] and the Leeds [7], have been developed to evaluate the quality of machine translation. However, none of these corpora specifically focus on the translation and recognition of spatial entities in the context of epidemiological surveillance.
Translating Arabic place names in medical news presents unique linguistic challenges, including high morphological variability, inconsistent transliteration, descriptive and relative spatial expressions, and ambiguity between administrative and geographic references [8]. These challenges can directly affect the accuracy and quality of spatial entity recognition, which highlights the need for careful validation of translations. By combining manual validation with automatic methods, our dataset addresses these language-specific issues, thereby ensuring higher reliability in location-based surveillance.
Recent studies have highlighted that cross-platform compatibility and standardized data exchange are essential for large-scale monitoring systems [9]. This finding illustrates a general principle: whether in industrial IoT or epidemiological surveillance, interoperability relies on high-quality, standardized data. In our study, validated GeoNames identifiers play this role, thereby enabling different AI and information systems to interoperate seamlessly and support effective decision-making.
To address this gap, we propose a new Arabic–English dataset that is dedicated to spatial entities extracted from epidemiological texts and, integrates validated manual and automatic translations. This hybrid dataset integrates manually validated translations using geographic databases and models applied to translated texts.
This kind of study is crucial for improving event-based surveillance systems and outbreak detection on the basis of location recognition in a multilingual context.
3. Data Description
The dataset PADI-Location-AR-EN comprises 231 sentences, which contain 328 spatial entities that correspond to all the locations present in the sentences. These sentences belong to 96 Arabic articles that were automatically extracted by the tool PADI-web [2] and manually labeled (see Section 3) to create a “gold standard”. Each Arabic spatial entity is associated with its translation into English. The spatial entities are classified into two categories: absolute spatial entities and relative spatial entities [10]:
-
•
Absolute Spatial Entities (ASEs): These entities are direct references to precise, locatable geographic spaces, namely, these entities can be located on a map or in a geographic database.
-
•
Relative Spatial Entities (RSEs): These entities are defined in relation to other spatial entities, using topological relationships. Examples include “
” (West of Mosul),
(western Iraq), and “
'' (northwestern Hungary).
Examples of each spatial entity category and the number of entities associated with it are shown in Table 1.
Table 1.
Categories of spatial entities and their distribution in our dataset, with representative examples for each category.
| Categories | Number of entities | Examples |
|---|---|---|
| ASE | 275 | Egypt, Morocco, Bizerte |
| RSE | 53 | East of Mosul, Oriental region, Northwestern China |
In addition to categories, spatial entities are classified into 10 specific types: continent, country, geographical region, city, region, province, village, governorate, rural community, and route [5,11]. Examples of each type of spatial entity, together with the number of entities that correspond to each type, are shown in Table 2.
Table 2.
Spatial entity types, number of occurrences for each type, and representative examples extracted from the dataset.
| Type | Number of entities | Example |
|---|---|---|
| Continent | 10 | Africa, Asia |
| Country | 76 | Morocco, Egypt |
| Geographical Region | 39 | Middle East, West Africa |
| Region | 24 | Talat district, El-Tod East, Oriental region |
| Province | 15 | Tebessa Province, Figuig province |
| City | 116 | Rabat, Ismailia |
| Village | 13 | Koumir, Banub, Al Adaymah |
| Governorate | 18 | Ismailia Governorate, Qalyubia, Monufia, Faiyum Governorate |
| Rural community | 14 | Al Kifaf, Lagnadiz, Oulad Abdoune |
| Route | 3 | Laban, Al-Miqdad Passage Street |
The annotated spatial entities were standardized using identifiers from the 2022 version of the GeoNames2 database (https://www.geonames.org/). The GeoNames identifier was selected manually: Each translated spatial entity was searched in the GeoNames graphical interface. When a feature was present, the most appropriate identifier was selected on the basis of the context. In this process, we initially examined the first result proposed; if it corresponded to the context and the translated feature, it was selected; otherwise, other results were explored to identify the most relevant feature. Finally, the GeoNames feature class associated with the chosen identifier was assigned to each entity. When no match was found in GeoNames, the GeoNames ID and GeoNames feature class fields were left blank.
For each type of spatial entity, the GeoNames class(es) (feature class(es)) selected from our database are shown in Table 3.
Table 3.
Types of spatial entities and their corresponding GeoNames feature classes in our dataset.
| Type of spatial entity | GeoNames Feature class |
|---|---|
| Continent | continent (CONT) |
| Country | independent political entity (PCLI) |
| Geographical Region | region (RGN) |
| City | seat of a first-order administrative division (PPLA), seat of a second-order administrative division (PPLA2), seat of a third-order administrative division (PPLA3), populated place (PPL), capital of a political entity (PPLC) |
| Region | first-order administrative division (ADM1), territory (TERR) |
| Province | first-order administrative division (ADM1), historical first- order administrative division (ADM1H), second-order administrative division (ADM2) |
| Village | populated place (PPL) |
| Governorate | first-order administrative division (ADM1) |
| Rural commune | second-order administrative division (ADM2), third-order administrative division (ADM3) |
| Route | - |
The dataset is a table with fourteen columns: Id, Arabic sentence, Arabic location, English location, GeoNames ID, GeoNames feature class, Type, Category, Translation DeepL, Translation Microsoft Azure, Translation Reverso, English sentences translated by DeepL, English sentences translated by Microsoft Azure, and English sentences translated by Reverso. In the “Id” column, the unique identifier of each online news article from which the sentence was extracted is listed. “Arabic sentence” contains the Arabic phrase from which the spatial entity was extracted. “Arabic location” represents an Arabic spatial entity that was manually extracted from the corresponding sentence. The “English Location” column provides the manual translation of the spatial entity into English, which is based on reliable sources such as Google Maps3 and the GeoNames database, and expert validation. The “GeoNames ID” column presents the unique ID from the GeoNames database that corresponds to each spatial entity (it is empty if there is no match in GeoNames). The “GeoNames feature class” column contains the feature class that corresponds to the GeoNames ID (it is empty if there is no match in GeoNames). The “Type” column indicates the type of spatial entity, and “Category” distinguishes between two categories of entities (i.e., absolute or relative). The “Translation DeepL”, “Translation Microsoft Azure”, and “Translation Reverso” columns present the automatic contextual translation of the spatial entity provided by these three translation systems. Finally, the columns “English sentences translated by DeepL/Microsoft Azure/Reverso” contain the automatically translated English sentences, from which the spatial entities for each translator have been extracted. Notably, in some cases observed during the analysis, the spatial entity is completely absent from the translated sentence. As a result, the column that contains the translated entity may be empty when the system does not produce an explicit spatial entity. Concrete examples of lines extracted from the dataset are presented in Table 4 to provide a clear view of its structure and content.
Table 4.
Examples of rows from the PADI-Location-AR-EN dataset.
![]() |
4. Experimental Design, Materials and Methods
To construct our PADI-Location-AR-EN dataset, we followed the process illustrated in Fig. 1.
Fig. 1.
Pipeline for building our reference dataset.
First, we collected Arabic articles from the PADI-Web resource. The first author, who is a native Arabic speaker, subsequently manually extracted the spatial entities from the articles to create the PADI-Location-AR dataset. The latter associates each article, which is identified by a unique ID, with all the spatial entities mentioned in it. We then manually verified the translation of these entities from Arabic to English using the GeoNames database and Google Maps. The joint use of these two resources proved necessary because of the variable coverage of spatial entities: Some entities absent from GeoNames were present in Google Maps, and vice versa. This hybrid approach maximized translation quality and completeness. In addition, the spatial entities were standardized using the GeoNames database: Each entity was manually linked to the most appropriate GeoName identifier on the basis of the context, and the corresponding entity class was assigned. We subsequently categorized these spatial entities and classified them according to different types. Finally, we integrated the translations provided by three separate translators.
Limitations
Several datasets have been designed for the task of named entity recognition in Arabic, including ANERcorp [3], AQMAR [4], and Wojood [5]. In addition, several Arabic–English parallel corpora, such as the King Abdullah II Parallel Corpus [6], the Leeds Parallel Corpus [7], and the Parallel Corpus of Saudi Laws and Regulations [12], have been developed to evaluate the quality of machine translation in political, legal, and educational contexts. However, the translation and recognition of spatial entities in the context of epidemiological surveillance in Arabic are not expressly addressed by any of the corpora that are currently available. The proposed corpus described in this data paper is smaller, but more focused on spatial entity recognition related to the epidemiological domain and integrates validated manual translations using geographic databases, normalization using the GeoNames database, and automatic translations from several translation systems.
Finally, the three spatial entities ``
'', ``
'', and ``
'' are not translated in our reference dataset, specifically in the “English Location” column, because their translations are not available in the literature, particularly in Google Maps or the GeoNames database.
Among the extracted relative spatial entities, 42 were not normalized because there was no match in the GeoNames database, whereas the others were correctly normalized and had a GeoNames ID. The absolute spatial entity ``Laban'', which is of type route, was also not normalized.
Representative examples of spatial entities for which no match exists in the GeoNames database, including Arabic and English locations, type, and category, are presented in Table 5. The limitations of GeoNames in covering Arabic epidemiological locations are illustrated in this table.
Table 5.
Representative examples of spatial entities with no match in GeoNames.
![]() |
These limitations are essentially related to the very nature of relative spatial entities, which mostly correspond to directional or descriptive subdivisions of known places and do not constitute officially recognized administrative units. Therefore, they generally lack both an identifier and an entity class in GeoNames, whose coverage is focused primarily on formal administrative divisions, populated localities, or large-scale regions. During the translation process, we relied, when possible, on the absolute spatial entity included in the relative expression, as well as on translations from the scientific literature to ensure consistency and redirection. However, this strategy did not allow for complete standardization, which explains the absence of GeoNames identifiers for the majority of the relative entities. Only certain institutionally recognized regions, such as West Africa or Sub-Saharan Africa, are correctly normalized, whereas fine relative expressions remain largely underrepresented, which constitutes a structural limitation for all geocoding tasks of unstructured texts, including Arabic epidemiological texts, where relative spatial descriptions play an important role in the contextualization of health events [13].
Ethics Statement
No conflict of interest exists in this submission. The authors declare that the work described in this paper is original and not under consideration for publication elsewhere, in whole or in part. Its publication is approved by all the authors listed.
CRediT Author Statement
Fatima Ezzahra El Houbri: Methodology, Software, Data curation, Writing; Najlae Idrissi: Supervision, Methodology, Writing – review & editing; Mathieu Roche: Supervision, Methodology, Writing – review & editing; Sarah Valentin: Supervision, Methodology, Writing – review & editing.
Acknowledgments
The authors thank Julien Rabatel for the English translation of sentences via the Microsoft Azure solution integrated into PADI-web.
This work is carried out with the support of the CNRST under the ``PhD Associate Scholarship – PASS'' program for doctoral student monitors in the digital field by 2027 – National Strategy Morocco 2030.
This work has been funded by the MOOD (MOnitoring Outbreaks for Disease surveillance in a data science context) project, DGAL and the French National Research Agency under the Investments for the Future Program: ANR-16-CONV-0004 (#DigitAg).
Declaration of Competing Interest
The authors declare that they have no financial or personal interests that could influence the work reported in this paper.
Footnotes
Data Availability
References
- 1.Freifeld C.C., Mandl K.D., Reis B.Y., Brownstein J.S. HealthMap: global infectious disease monitoring through automated classification and visualization of internet media reports. J. Am. Med. Inf. Assoc. 2008;15(2):150‑157. doi: 10.1197/jamia.M2544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Valentin S., et al. PADI-web 3.0: a new framework for extracting and disseminating fine-grained information from the news for animal disease surveillance. One Health. 2021;13 doi: 10.1016/j.onehlt.2021.100357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Benajiba Y., Rosso P., Ruiz J.M.B. Lect. Notes Comput. Sci. Springer; 2007. ANERsys: an Arabic named entity recognition system based on maximum entropy; pp. 143–153. [DOI] [Google Scholar]
- 4.Mohit B., Schneider N., Bhowmick R., Oflazer K., Smith N.A. Proc. EACL. 2012. Recall-oriented learning of named entities in arabic Wikipedia; pp. 162–173. [Google Scholar]
- 5.Jarrar M., Khalilia M., Ghanem S. Proc. 13th Language Resources and Evaluation Conf. ELRA; Marseille, France: 2022. Wojood: nested arabic named entity corpus and recognition using BERT; pp. 3626–3636. [Google Scholar]
- 6.Al-Khalafat L., Haider A.S. A corpus-assisted translation study of strategies used in rendering culture-bound expressions in the speeches of King Abdullah II. Theory Pr. Lang. Stud. 2022;12(1):130‑142. doi: 10.17507/tpls.1201.16. [DOI] [Google Scholar]
- 7.El-Farahaty H., Khallaf N., Alonayzan A. Building the Leeds monolingual and parallel legal corpora of arabic and English countries’ Constitutions : methods, challenges and solutions. Corpus. Pragmat. 2023;7(2):103‑119. doi: 10.1007/s41701-023-00138-x. [DOI] [Google Scholar]
- 8.Alayba A.M. Arabic Natural Language Processing (NLP): a comprehensive review of challenges, techniques, and emerging trends. Computers. 2025;14(11) doi: 10.3390/computers14110497. [DOI] [Google Scholar]
- 9.Addula S.R., Tyagi A.K., Naithani K., Kumari S. Blockchain-empowered internet of things (iots) platforms for automation in various sectors. Artif. Intell.-Enabled Digit. Twin Smart Manuf. Sep. 2024:443–477. doi: 10.1002/9781394303601.ch20. [DOI] [Google Scholar]
- 10.Flamein H. Annotation automatique des lieux dans l’oral spontané transcrit (Automatic annotation of places in the transcribed oral). Proc. 24th Conf. Traitement Automatique des Langues Naturelles (RECITAL 2017); Orléans, France; ATALA; 2017. pp. 121–134. [Google Scholar]
- 11.Suwaileh R., Imran M., Elsayed T. Proc. Annual Meeting of the Assoc. Comput. Linguistics. 2023. IDRISI-RA: the first Arabic location mention recognition dataset of disaster tweets; pp. 16298–16317. [DOI] [Google Scholar]
- 12.El-Farahaty H. Integrating corpora and AI tools in the teaching, learning, and assessing of Arabic/English legal translation. Int. J. Semiot. Law. 2025;38(6):2031‑2060. doi: 10.1007/s11196-025-10281-0. [DOI] [Google Scholar]
- 13.Syed M.A., Arsevska E., Roche M., Teisseire M. GeospatRE: extraction and geocoding of spatial relation entities in textual documents. Cartogr. Geogr. Inf. Sci. 2025;52(3):221‑236. doi: 10.1080/15230406.2023.2264753. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



