Version Changes
Revised. Amendments from Version 1
To address concerns about geographic and publication biases, the literature search strategy was expanded beyond global databases like PubMed and Web of Science. The protocol now includes plans to search regional repositories such as African Journals Online (AJOL) and SciELO in a second round of data extraction, aiming to capture older or regionally-published studies. The protocol also clarifies that non-English language papers are not excluded and will be handled using a combination of automated translation and expert native speakers. Significant updates were made to the data processing section to enhance transparency and rigor. For species name harmonization, a clear set of decision rules was introduced to resolve ambiguous or outdated taxonomic names, with a commitment to retain original names and flag uncertainties. A new section was added to explain how the database will handle conflicting assay results (e.g., serology vs. PCR) without making assumptions. By preserving all test results, the dataset will enable users to analyze different biological signals and account for varying assay sensitivities. Finally, the protocol now more explicitly states the project's strategy for addressing sampling bias. It details plans to create a spatiotemporal raster of rodent sampling effort from the database's metadata. This tool will allow researchers to identify surveillance gaps and support more robust, bias-aware ecological modeling. These changes collectively make the protocol more comprehensive, transparent, and aligned with best practices in data synthesis for infectious disease ecology.
Abstract
Arenaviruses and Hantaviruses, primarily hosted by rodents and shrews, represent significant public health threats due to their potential for zoonotic spillover into human populations. Despite their global distribution, the full impact of these viruses on human health remains poorly understood, particularly in regions like Africa, where data is sparse. Both virus families continue to emerge, with pathogen evolution and spillover driven by anthropogenic factors such as land use change, climate change, and biodiversity loss. Recent research highlights the complex interactions between ecological dynamics, host species, and environmental factors in shaping the risk of pathogen transmission and spillover. This underscores the need for integrated ecological and genomic approaches to better understand these zoonotic diseases. A comprehensive, spatially, and temporally explicit dataset, incorporating host-pathogen dynamics and human disease data, is crucial for improving risk assessments, enhancing disease surveillance, and guiding public health interventions. Such a dataset (ArHa) would also support predictive modelling efforts aimed at mitigating future spillover events. This paper proposes the development of this unified database for small-mammal hosts of Arenaviruses and Hantaviruses, identifying gaps in current research and promoting a more comprehensive understanding of pathogen prevalence, spillover risk, and viral evolution.
Keywords: database production protocol, arenaviruses, hantaviruses, rodent-borne zoonoses, pathogen surveillance
Plain Language summary
Arenaviruses and Hantaviruses are globally distributed pathogens that can spread from animals to humans (zoonoses). Here, we describe the production of a dataset (ArHa) to synthesise spatial and temporal information from published research, providing a unique resource for understanding geographic and temporal trends in Arenavirus and Hantavirus host-pathogen relationships. Through explicitly quantifying sampling biases and detection efforts, the ArHa dataset will allow more robust and accurate assessments of pathogen prevalence and distribution. The spatial scale of the produced dataset offers a platform for linking ecological data with human health outcomes which will support the identification of spillover hotspots. The ArHa dataset relies on currently published material, which may vary in terms of detail, accuracy, and completeness. Missing or imprecise information may impact the reliability of subsequent analyses. The dataset as produced will be a static resource which could limit its relevance over time as emerging data will not be added, encouragingly, researchers in the field of zoonotic infections increasingly make primary data available which will mitigate this limitation.
Introduction
Arenaviruses and Hantaviruses are globally distributed pathogens which primarily infect rodents (order Rodentia) and shrews (order Eulipotyphla) 1, 2 . Some of these Arenaviruses and Hantaviruses occasionally spillover (cross-species transmission of a parasite into a host population not previously infected) into human populations, with variable morbidity and mortality rates 3 . However, the overall human health impact remains poorly understood in many cases 4, 5 . Arenaviruses, include Mammarenavirus lassaense (LASV), responsible for Lassa fever in West Africa, and Mammarenavirus juninense (JUNV), which causes Argentine haemorrhagic fever in Argentina 6, 7 . Lassa fever is estimated to infect 900,000 individuals annually across West Africa, with over 200 deaths reported from Nigeria in 2024 8, 9 . In contrast to the limited distribution of LASV, Mammarenavirus choriomeningitidis (causing Lymphocytic choriomeningitis) has a global distribution with few infections reported 10 .
Hantaviruses, including Orthohantavirus puumalaense (PUUV) and Orthohantavirus seoulense (SEOV) cause haemorrhagic fever with renal syndrome (HFRS) in Europe and Asia, while Orthohantavirus sinnombreense (SNV) causes hantavirus pulmonary syndrome (HPS) in the Americas 11– 13 . HFRS is associated with 23,000 reported annual infections, but fewer than 100 deaths, whereas HPS, which has a higher case fatality rate (12-45%) is reported in fewer than 300 cases annually 14 . The distribution and human health impact of hantaviruses in Africa is even more poorly understood 15 .
Recent research indicates that both Arenaviruses and Hantaviruses continue to emerge, with pathogen evolution and spillover driven by anthropogenic factors such as land use change, climate change and small-mammal biodiversity loss 16, 17 . In addition to direct biodiversity changes, the structure and dynamics of host populations and their community interactions have also been shown to influence pathogen prevalence 18, 19 . Human activities such as deforestation, urbanization, and agricultural expansion bring humans into closer (e.g., increased frequency and intensity) contact with rodent reservoirs, increasing the likelihood of spillover events 20, 21 . Changes in climate, land use, rodent ecology, and human behaviour all affect human-reservoir contact, contributing to the complex human-animal-environment nexus that drive pathogen spillover and persistence 22– 24 . Although human-to-human transmission is a concern (e.g., Disease X), most infections result from rodent-to-human transmission, with LASV classified as a priority pathogen by the WHO due to its potential for human-to-human spread 4, 25 .
Understanding the economic and societal impacts of these diseases is also critical. In addition to direct healthcare costs, there are significant indirect costs (e.g., long-term productivity losses), which exacerbate economic strain on vulnerable populations 4 . Outbreaks of emerging rodent-borne zoonoses, like HPS during the Four Corners Outbreak (1993) or Yosemite outbreak (2012), demonstrate the broader social and economic disruptions that these diseases can cause 26, 27 . Long-term health consequences, such as sensorineural hearing defects in survivors of Lassa fever, further complicate the disease burden 28 . Although only ~20% of LASV infections are symptomatic, the full scale of both acute and long-term disease remains poorly understood 29 . Under-reporting of these diseases, likely due to difficulties in diagnosing acute infections, means that many cases go undetected, particularly in endemic regions 14 .
The traditional paradigm in which a specific pathogen is hosted by a single reservoir species is being overturned by increased pathogen surveillance in endemic settings 30 . These data have identified multiple potential hosts for pathogens including LASV, SEOV and SNV 13, 31, 32 . Similarly, a single host species may be a host of multiple Arenaviruses and Hantaviruses throughout its geographic distribution, for example Mastomys natalensis, the primary host of LASV is known to host at least 7 distinct Arenaviruses 33 .
The ecology of rodent hosts — such as population dynamics, behaviour, habitat preferences, and community interactions — directly influences pathogen transmission and spillover risk 34 . Linking ecological and genomic data enables the identification of genetic markers associated with increased virulence or transmissibility, providing critical insights into the evolutionary trajectories of Arenaviruses and Hantaviruses 35, 36 . Small-mammal sampling is essential for uncovering how these ecological factors moderate the prevalence, persistence, and spread of zoonotic pathogens within their communities 37, 38 . For example, fluctuations in host population density or shifts in community composition, often driven by habitat changes or seasonal cycles, can amplify transmission within reservoirs or increase human exposure risk 39, 40 . Understanding interactions between host species within shared environments, including competition or co-occurrence, further elucidates how pathogens circulate, expand, and evolve across interconnected populations 41– 43 . These insights underscore the value of systematic small-mammal sampling to inform surveillance, predict areas of heightened spillover risk, and develop targeted strategies to mitigate zoonotic disease emergence.
A dataset that includes detailed spatial and temporal data on rodent hosts and their pathogens would enable the linkage of local ecological findings with public health patterns at broader scales. While human disease data are often aggregated at national or subnational levels to preserve patient anonymity, clinical practitioners and public health authorities will have access to higher-resolution case data. For these stakeholders, integrating ecological data with localised human disease patterns could enhance the identification of hotspots of transmission risk and inform more targeted interventions. Data fragmentation currently limits efforts to connect local-scale, high-intensity research on host-pathogen dynamics to broader-scale trends, which impairs predictive modelling and disease management efforts. A unified dataset would support these efforts, enabling better understanding of ecological drivers of zoonotic disease risk, viral evolution, and pathogen emergence across landscapes and over time, improving both risk assessments and public health interventions 44 .
Historically, research on rodent-borne zoonoses has been uneven, with undersampling in high-risk regions like Africa, South Asia, and Southeast Asia 31, 45, 46 . Identifying these undersampled regions is crucial for guiding future research and prioritising viral discovery efforts 47 . Addressing these gaps will improve our understanding of pathogen diversity, distribution, and evolution, and help predict and mitigate future spillover events 48, 49 . A comprehensive global sampling effort will highlight knowledge gaps and inform future research priorities.
These uneven patterns of sampling are often shaped by factors such as research funding availability, geopolitical accessibility, diagnostic infrastructure, and regional disease prioritisation. While the causes of sampling bias cannot typically be inferred directly from the published data, our dataset is designed to capture and quantify the spatial and temporal distribution of sampling effort. By systematically documenting both presence and absence data for pathogens within rodent hosts, alongside metadata such as sampling dates, locations, and diagnostic methods, the database will allow researchers to visualise gaps in surveillance and identify systematic biases in sampling coverage.
An early planned analysis will involve constructing a spatiotemporal raster of small mammal sampling effort, enabling bias-aware ecological modelling and informing future pathogen discovery and surveillance activities. This approach will not only highlight where surveillance has occurred but also where it is lacking, facilitating more equitable and evidence-based targeting of future research.
A unified and comprehensive dataset on Arenaviruses and Hantaviruses in wild-caught small mammals is urgently needed to address these gaps. By collating spatial, temporal, and genomic data, such a dataset will be a critical resource for identifying undersampled regions and potential hosts, guiding viral discovery, and informing risk assessments, predictive models, and public health interventions. Adhering to Open Science principles, we will ensure that research tools, data extraction methods, and processing code are shared on suitable platforms following the FAIR guidelines 50 . This will ensure global accessibility and foster evidence-based decision-making and collaboration in disease surveillance. As automated tools increasingly drive scientific research, we must standardize datasets to avoid overlooking valuable data. By curating this dataset, we aim to preserve scientific knowledge and ensure its accessibility to global researchers through platforms like the Global Biodiversity Information Facility (GBIF) 51 .
Existing data on rodent-pathogen associations are often limited by geographic and temporal constraints, which limits broader analyses of pathogen prevalence, spillover risk, and range expansion. The proposed ArHa database will address these limitations by synthesizing and standardizing data from diverse sources, enabling research on host-pathogen dynamics, range expansion, and spillover risk. This unified resource will strengthen predictive models of zoonotic diseases, improve public health strategies, and enhance preparedness for future disease threats.
Protocol
Study objectives
The ArHa dataset aims to:
Construct a global, standardised and relational dataset of Arenavirus and Hantavirus detections in wild-caught rodents and shrews, integrating ecological, epidemiological and genomic information.
Quantify and map sampling effort, identifying spatiotemporal patterns in small mammal surveillance and highlighting geographic and taxonomic gaps.
Characterise pathogen detection and reporting biases, including diagnostic method, assay sensitivity and data completeness across studies.
Link host, pathogen, and genomic data across time and space, enabling the exploration of host pathogen associations and viral diversity.
Enable bias-aware modelling of pathogen prevalence, distribution, and spillover risk through harmonized spatial and temporal metadata.
Facilitate integration of ecological and genomic data with public health datasets, enabling identification of spillover hotspots and informing disease surveillance strategies.
Promote data reuse and interoperability by harmonizing taxonomy and adhering to FAIR principles with open access releases.
Search strategy
We searched NCBI PubMed and Clarivate Web of Science for relevant citations (see Table 1 for search terms), with no restrictions on publication date, language, or sampling locations. The search terms used are shown in Table 1, which details the search terms and results from both databases.
Table 1. Search terms used for identifying relevant literature.
| Search term | PubMed | Web of Science |
|---|---|---|
| 1. rodent* OR rat OR mouse | 3,878,308 | 4,090,059 |
| 2. shrew OR eulipotyphla | 7,582 | 9,108 |
| 3. arenavir* | 2,525 | 1,966 |
| 4. hantavir* | 4,316 | 5,084 |
| 5. 1 OR 2 | 3,883,403 | 4,096,018 |
| 6. 3 OR 4 | 6,746 | 6,899 |
| 7. 5 AND 6 | 3,100 | 2,904 |
Search terms 1-4 ( Table 1) included expanded terms. Citations returned by the searches were downloaded and de-duplicated by matching on titles, authors, publication identifiers, or digital object identifiers. The initial search (conducted 2023-10-16) resulted in 2,755 unique publications. These citations were uploaded to the Rayyan platform to assess against inclusion and exclusion criteria 52 .
To supplement our core searches in PubMed and Web of Science, we plan in a second round of data extraction to search additional regional repositories, including Google Scholar, African Journals Online (AJOL), SciELO, LILACS, J-STAGE (Japan), KoreaMed, IndMED, MedIND (India), ASEAN Citation Index, and the Index Medicus for the Eastern Mediterranean Region (IMEMR). These platforms will help us to capture relevant literature that may not be indexed in global bibliographic databases, particularly older or regionally published studies.
Study screening
Inclusion criteria. We screened studies against the following inclusion criteria. Studies were included if they reported:
-
1.
Rodent OR Shrew sampling from wild populations AND
-
2.
Direct or indirect detection of microorganisms in the Arenaviridae and Hantaviridae families AND
-
3.
Information on the sampling location of small mammals
Exclusion criteria. Studies were excluded if they:
-
1.
Did not contain information on the host species of the sample for direct or indirect detection of Arenaviridae or Hantaviridae,
-
2.
Reported experimental infections or solely human infections,
-
3.
Were abstracts or conference proceedings that did not provide any description of the method of animal sampling or microorganism detection.
Direct detection was defined as detection via nucleic acid amplification tests (e.g., Polymerase Chain Reaction (PCR)) or virus culture, indirect detection was defined as detection of specific antibodies or antigens.
We first screened the study titles and abstracts. For articles published in languages other than English we used automated translation services (e.g., Google Translate) if study authors were unable to review the manuscript in its published language. All titles and abstracts were then assessed against our inclusion and exclusion criteria. Assessments were made by two members of the study team (DS and RR). Studies deemed to meet all inclusion criteria by at least one author proceeded to full-text review.
Reference chaining was performed on studies considered for inclusion at the full text screening stage and relevant publications were added as manually identified relevant articles. The PRISMA flow chart for the search process is shown in Figure 1 which also indicates the status of pilot data extraction (10% of the 917 studies included at title and abstract stage). The search will be re-run to capture more recently published articles when data extraction from the current search has been completed.
Figure 1. PRISMA flow diagram for records identified in the initial search conducted 2023-10-16.
Studies excluded at full text review included those with incomplete data on ≥1 of pathogen, host, or study location. For studies containing duplicated data, the highest resolution dataset (i.e., temporal, spatial, taxonomic) was retained.
Data extraction
We aimed to extract included study meta-data and to produce three linked datasets that could be used to assess sampling of a) small-mammal hosts, b) viral prevalence and c) genetic variability. We have developed and refined the current data extraction tools in the pilot search ( Figure 1). A publicly available RShiny web-application will be developed to present the extracted data while the search is ongoing 53 .
Included studies. Included study meta-data was abstracted to ensure appropriate attribution for each rodent, pathogen and genomic material record to the original publication that presented it ( Table 2). Each included study was assigned a unique identifier which provides linkage to the reference. We extracted reported sampling effort, when this information was available (e.g., number of trap-nights). For articles not reporting sampling effort at the study level we inferred total sampling effort by summing effort across individual trapping sessions and sites. For studies that did not report a measure of sampling effort amenable to imputation (e.g., a variable number of traps per trapping line, or days per study session) we recorded a value of ‘not reported’. Finally, we recorded the level of data aggregation of rodent or pathogen sampling, classified as individual (data at the individual level) or summarised (data were aggregated; e.g., at site or sampling session level).
Table 2. Study information extraction sheet.
| Column name | Description |
|---|---|
| study_id | An internal unique identifier for the included study |
| full_text_id | The internal ID for the full text manuscript and its associated citation |
| datasetName | The title of the manuscript, report or book section |
| sampling_effort | A free text entry to capture the effort of sampling, ideally in trap nights for rodent studies |
| data_access | Whether data is available for individual small mammals or whether it is aggregated |
| data_resolution | The level of aggregation available |
| linked_manuscripts | The DOI or weblink to other studies including the same dataset either in its entirety or a
subset, this will be used to attempt to de-duplicate data |
| notes | Details that may be important for interpreting the extracted data |
Small-mammal sampling. Data were extracted into a rodent sheet ( Table 3) at the highest available level of temporal and spatial resolution. For example, a study reporting a single species at four spatially distinct sampling sites would be associated with four records. If this same study reported observations across four distinct trapping sessions at each of these sites there would be 16 (4 2) records associated with the study. Studies presenting data on N individuals are associated with N records. These records may not be for individuals (as would be the case for capture-mark-recapture studies), for example, they may be the same individual detected over multiple sessions. For this reason, we do not report a unique identifier at the individual level.
Table 3. Extracted rodent sampling variables.
| Column name | Description |
|---|---|
| rodent_record_id | A unique internal identifier for the rodent record. Record resolution depends on the extent of
aggregation in the study |
| study_id | A unique identifier to link a record to its respective study |
| scientificName | The binomial species name of the small-mammal reported in the study |
| eventDate | The period in which sampling was conducted |
| locality | The location of sampling effort, reported to the highest spatial resolution available |
| country | Country where trapping occurred. For multinational studies where counts are not disaggregated by
country, all countries are reported |
| verbatimLocality | High level habitat type will be recorded here at the scale for which trapping is recorded, particularly
useful when locality does not discriminate between multiple sampling sites |
| coordinate_resolution | The spatial resolution of coordinates |
| decimalLatitude | Latitude in decimal format, converted from reported coordinates as required |
| decimalLongitude | Longitude in decimal format, converted from reported coordinates as required |
| individualCount | The number of detected individuals associated with a record |
| trapEffort | Trap effort (recorded as number of trap nights) associated with a record |
| trapEffortResolution | The resolution of trapping effort for the record |
Identification of a detected small mammal was recorded as reported within the article, we will not systematically assess the method of identification and therefore assume accurate identification as reported within the articles. Where identification is to genus or higher taxonomic level we report to this level (i.e., Rattus spp., Rodentia). Further details are provided below on data processing and species name harmonisation.
Since we expected studies to report sampling location to varying degrees of resolution we extract locality as the highest resolution of location data that could be matched to administrative levels within a country (e.g., city, county). In cases where location is reported to a higher spatial resolution which may not map to administrative levels (e.g., ‘Site A’) we also include verbatimLocality. For studies not reporting geographic coordinates of sampling but including some information that describes the location (i.e., the name of the village sampled) we will locate coordinates through several methods. Locations will be searched for in Google Maps, Wikipedia or the Geographic Names Server provided by the National Geospatial-Intelligence Agency USA.
Pathogen sampling. The pathogen sheet ( Table 4) reports the assays used to detect Arenaviruses and Hantaviruses. Analogously to the rodent sheet, a record is produced at the highest resolution of pathogen sampling. One-to-many matching was permitted to record all data in cases where a single rodent was tested for a pathogen using multiple assays (e.g., serology and PCR).
Table 4. Pathogen information extraction sheet.
| Column name | Description |
|---|---|
| pathogen_record_id | A unique identifier for the group of samples from the same rodent species, at a specific
location or timepoint, tested for the same pathogen using the same method |
| associated_rodent_record_id | Linking identifier from the rodent sheet |
| study_id | A unique identifier to link a study to the descriptive sheet entry for that study |
| associatedTaxa | The scientificName of the rodent from the associated_rodent_record_id |
| family | The family of the pathogen, either Arenaviridae or Hantaviridae |
| scientificName | The species name of the pathogen assayed for, if available. Where assays are specific to a
higher taxa than species this is recorded here. |
| assay | Assay type, e.g., serology, PCR, or other |
| number_tested | The number of samples tested |
| number_negative | The number of negative samples |
| number_positive | The number of positive samples |
| number_inconclusive | The number of samples with inconclusive results |
| note | Notes relevant to this record |
The species or family targeted by the assay was recorded as reported within the relevant article; we did not make any assessment of the suitability, sensitivity or specificity of an assay. Where authors describe an assay as being family-wide (e.g., hantavirus antibody detection) we retain this level of specificity in the pathogen record. We recorded the number of samples tested using the assay for a specific pathogen. This may differ from the number of counted small-mammals as not all samples may not have been suitable for testing, or the study authors may have decided to subset available samples. The number of positive and negative samples are related to this tested number. Where reported we also extracted the number of samples with an inconclusive test result. Similarly to above, we recorded positives and negatives as reported by study authors and make no assessment of these classifications.
Sequences. If studies include complete or partial nucleotide sequences of hosts or viruses archived in NCBI (USA), EMBL (Europe), DDBJ (Japan) or CNGBdb (China) they will be linked through the sequences sheet ( Table 5). A record will be produced for each accession available. A sequence_record_id will be associated with each accession and depending on whether the sequence relates to a pathogen or host each record will be associated with one or both of these (i.e., a pathogen sequence will be linked to both a pathogen_record_id and rodent_record_id, while a host sequence will only be linked to a rodent_record_id). Many-to-one matching of sequences may occur for several reasons, first, Hantaviruses (three) and Arenaviruses (two) contain different number of genome segments and not all may have been successfully sequenced for each acutely infected small mammal. Second, reporting of pathogen sampling may be aggregated with multiple sequences obtained from a single reported assay (e.g., pooled sampling). Pathogen sequences obtained from infected humans will be extracted but will only be linked at the study level if reported within the manuscript from the same geographic location or temporal period.
Table 5. Pathogen sequences extraction sheet.
| Column name | Description |
|---|---|
| sequence_record_id | A unique identifier for the sequence record |
| sequenceType | One of Pathogen or Host |
| associated_pathogen_record_id | An associated pathogen record for this sequence |
| associated_rodent_record_id | An associated rodent record for this sequence |
| study_id | Study ID associated with this sequence |
| associatedTaxa | Species name of small mammal sampled/sequenced |
| scientificName | Species name of pathogen sequenced |
| accession_number | NCBI nucleotide accession |
| note | Notes relevant to this record |
Data processing
This section describes data cleaning and harmonisation. All data processing will be conducted in R with scripts retained in a version controlled git repository 54, 55 . Raw data will be downloaded from Google Sheets using the googledrive API in R, with date stamped files stored locally 56 .
Species name harmonisation. To standardise reported species we will match reported names to both the GBIF backbone taxonomy and the NCBI taxonomy database 51, 57 . The taxize R package was used to query the APIs of these platforms 58 . Reported names that do not match accepted taxonomic entries are reviewed and resolved using a reproducible set of decision rules:
-
1.
Subspecies and outdated synonyms are corrected to the currently accepted species name (e.g., Rattus rattus alexandrines -> Rattus rattus).
-
2.
Taxonomic revisions are incorporated where genus-level reassignments have occurred (e.g., Mus/Nannomys becomes Mus at the genus level).
-
3.
If no unambiguous match is found at the species level (e.g., due to vague or malformed entries), the identification is:
-
1.
Promoted to the lowest reliable taxonomic family (e.g., genus or subfamily);
-
2.
Flagged as ambiguous in a dedicated column.
-
1.
To ensure transparency we retain unprocessed species names alongside the cleaned and matched taxonomy.
To harmonise pathogen names, we match to the ICTV binomial taxonomy. If harmonisation is not possible (e.g., due to generic naming such as "Hantavirus sp." or references to obsolete clades), we match to the lowest taxonomic family, with an accompanying flag.
Location of sampling. Locations were associated with administrative divisions using the Database of Global Administrative Areas (GADM) accessed through the geodata R package 59 .
The coordinate resolution for coordinates labelled, site or study site will be set as 100 meters. For locations associated with an administrative area but no higher resolution data we record the coordinate resolution value as the radius of the administrative region to represent uncertainty in these coordinates.
Imputed non-detections. Non-detection of a small mammal species in a location it may be expected to be (i.e., based on IUCN range maps or prior research) is of interest to researchers. We will enrich the detection-only data of included studies by imputing non-detections. This imputation will be restricted to species that have been detected at other sites or sessions in the study. Imputed non-detections will be labelled as imputed in the final data product.
Conflicting pathogen detection results. We preserve potentially conflicting assay results (e.g., seropositive but PCR-negative individuals) without reconciliation or exclusion, by maintaining a one-to-many relationship between host records and pathogen assay records.
Each host record, whether representing a single animal or a pooled group of conspecifics trapped at the same site and time is linked to multiple pathogen assay records. For example, if 10 Apodemus agrarius were trapped at a site during a single session (i.e., a single rodent_record_id) and tested for hantavirus using both ELISA and PCR, the host dataset would contain one record, while the pathogen dataset would include two linked records (i.e., two pathogen_record_id’s) , each specifying test type, number tested, and number positive.
We retain all assay results irrespective of concordance, as different diagnostics capture distinct biological signals (e.g., prior exposure vs. current infection) and vary in sensitivity. No prioritisation is applied. Each assay record is annotated with test type, diagnostic target (e.g., antigen, antibody, RNA), and outcome, enabling users to apply their own criteria when analysing discordant results.
Discussion
This novel dataset will provide spatially and temporal information on small mammal sampling for Arenaviruses and Hantaviruses, offering a more comprehensive view of host-pathogen associations than is currently available. By including sampling effort and explicit spatial data, this dataset enhances the ability to quantify biases in where and how frequently sampling occurs, assess geographic gaps, and better understand the spatial ecology of these globally distributed pathogens of public health importance. The insights gained from this dataset could improve our understanding of how environmental changes, such as habitat fragmentation and urbanisation, influence pathogen dynamics and spillover risks.
Importantly, these data could inform the development of ecology-driven predictive models, helping to identify areas at heightened risk of zoonotic spillover 60, 61 . The ability to link ecological data with human health outcomes could inform targeted public health interventions, enhancing outbreak preparedness and response. For instance, identifying hotspots of high pathogen prevalence in rodents could guide targeted development of ecologically-based rodent control measures in regions identified as vulnerable.
Existing host-pathogen datasets lack detailed spatial or temporal information, limiting their use in ecological and epidemiological modelling 62 . By addressing these gaps, our dataset provides information about host-pathogen interactions that span multiple spatial and temporal scales. Moreover, the explicit reporting of sampling effort allows for more robust analyses of detection probability, which is crucial for understanding the true distribution of pathogens 63, 64 .
While currently available global host-pathogen datasets do not account for sampling biases, the dataset proposed here will explicitly report the extent of sampling for pathogen within their host species, providing critical insights into spatial sampling biases and detection efforts. For instance, while it is expected that Mastomys natalensis sampling occurs throughout its range, the detection of specific pathogens such as LASV or Morogoro virus are currently only detected in their West African and East African radiation respectively. By detailing both presence and absence data, the dataset allows for a more nuanced understanding of pathogen distribution within host species, beyond mere occurrence records. Quantification of detection effort for specific pathogens is vital for assessing spatial sampling biases, identifying under-sampled regions, and refining ecological and epidemiological models 65 .
As part of our future analytical work, we aim to develop a spatiotemporal raster of rodent sampling effort, based on georeferenced metadata from the included studies. This raster will allow us to quantify both spatial and temporal surveillance intensity across the ranges of known and suspected reservoir species. It will help identify under-sampled but environmentally suitable regions, inform targeted field surveys, and support downstream predictive modelling. Unlike bibliometric proxies that only reflect publication counts, this approach incorporates species- and pathogen-specific detail at a finer spatial resolution.
Despite its strengths, this dataset has several limitations. First, the inherent variability in the quality and detail of reported data from included studies means that some records may lack critical information, such as exact coordinates or specific sampling dates. The imputation of non-detections, introduces assumptions that could affect the interpretation of absence data. Additionally, the dataset is inherently static will not reflect emerging data following production.
Furthermore, the reliance on published literature introduces publication bias, as studies with significant findings are more likely to be published than those with negative results 66 . This could skew the dataset towards areas and hosts with known or previously detected pathogens, potentially under representing regions where sampling has occurred but without positive detections. We will not routinely contact study authors to obtain data reported within publications. Therefore, data that are not reported within the article or its supplementary appendices are not included in the dataset. This is the subject of an ongoing data request study. To address limitations associated with extracting data without access to the raw data we encourage scientists with ongoing field studies and pathogen surveillance programmes, to submit their data to dynamic repositories (e.g., Pharos and GBIF 51, 67 ). Beyond research applications, the dataset could serve as a tool for policymakers, assisting in the identification of priority areas for viral discovery and public health interventions.
Conclusion
Overall, this dataset will be a valuable resource for understanding the ecology of Arenaviruses and Hantaviruses in their natural reservoirs. By bridging the gap between local-scale ecological studies and broader public health needs, it has the potential to enhance our ability to predict and mitigate the risks posed by these emerging pathogens. Continued efforts to update and expand this resource will be crucial for maintaining its utility in a rapidly changing epidemiological landscape.
Ethics and consent
Ethical approval and consent were not required.
Funding Statement
This work was supported by Wellcome [220179]; NSF-NIH-NIFA and BBSRC Ecology and Evolution of Infectious Disease Award [2208034]; and NSF Biology Integration Institute [2213854].
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; peer review: 2 approved, 2 approved with reservations]
Data availability
No data associated with this article. All data products of this project will be made available on GitHub ( https://github.com/DidDrog11/arenavirus_hantavirus) and a finalized product will be archived on Zenodo.
OSF: PRISMA-P checklist for Protocol to produce a systematic Arenavirus and Hantavirus host-pathogen database: Project ArHa. https://doi.org/10.17605/OSF.IO/23KTF
Authors’ contributions
D.S. - conceptualisation, methodology, data curation, investigation, software, supervision, writing - original draft, writing - review and editing R.R. - methodology, data curation, investigation, writing - review and editing H.L.M.G. - methodology, data curation, investigation, writing - review and editing A. M-C G. - methodology, data curation, investigation, writing - review and editing G. C. M. - methodology, supervision, writing - review and editing G. R. - methodology, data curation, investigation, writing - review and editing D. W. R. - funding acquisition, resources, supervision, writing - review and editing S. N. S. - conceptualisation, funding acquisition, methodology, resources, supervision, writing - review and editing
References
- 1. Childs JE, Peters CJ: Ecology and epidemiology of arenaviruses and their hosts. In: The Arenaviridae.Springer;1993;331–84. 10.1007/978-1-4615-3028-2_19 [DOI] [Google Scholar]
- 2. Jonsson CB, Figueiredo LT, Vapalahti O: A global perspective on hantavirus ecology, epidemiology, and disease. Clin Microbiol Rev. 2010;23(2):412–41. 10.1128/CMR.00062-09 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Wells K, Clark NJ: Host specificity in variable environments. Trends Parasitol. 2019;35(6):452–65. 10.1016/j.pt.2019.04.001 [DOI] [PubMed] [Google Scholar]
- 4. Smith DRM, Turner J, Fahr P, et al. : Health and economic impacts of Lassa vaccination campaigns in West Africa. Nat Med. 2024;30(12):3568–3577. 10.1038/s41591-024-03232-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Vapalahti O, Mustonen J, Lundkvist A, et al. : Hantavirus infections in Europe. Lancet Infect Dis. 2003;3(10):653–61. 10.1016/s1473-3099(03)00774-6 [DOI] [PubMed] [Google Scholar]
- 6. Gibb R, Moses LM, Redding DW, et al. : Understanding the cryptic nature of Lassa fever in West Africa. Pathog Glob Health. 2017;111(6):276–88. 10.1080/20477724.2017.1369643 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Gallo GL, López N, Loureiro ME: The virus-host interplay in Junín mammarenavirus infection. Viruses. 2022;14(6):1134. 10.3390/v14061134 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Nigeria Centre for Disease Control: An update of Lassa fever outbreak in Nigeria.2024. Reference Source
- 9. Basinski AJ, Fichet-Calvet E, Sjodin AR, et al. : Bridging the gap: using reservoir ecology and human serosurveys to estimate Lassa virus spillover in West Africa. PLoS Comput Biol. 2021;17(3): e1008811. 10.1371/journal.pcbi.1008811 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Vilibic-Cavlek T, Savic V, Ferenc T, et al. : Lymphocytic choriomeningitis—emerging trends of a neglected virus: a narrative review. Trop Med Infect Dis. 2021;6(2):88. 10.3390/tropicalmed6020088 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Makary P, Kanerva M, Ollgren J, et al. : Disease burden of Puumala virus infections, 1995–2008. Epidemiol Infect. 2010;138(10):1484–92. 10.1017/S0950268810000087 [DOI] [PubMed] [Google Scholar]
- 12. Park Y: Epidemiologic study on changes in occurrence of Hemorrhagic Fever with Renal Syndrome in Republic of Korea for 17 years according to age group: 2001–2017. BMC Infect Dis. 2019;19(1): 153. 10.1186/s12879-019-3794-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Goodfellow SM, Nofchissey RA, Schwalm KC, et al. : Tracing transmission of Sin Nombre virus and discovery of infection in multiple rodent species. J Virol. 2021;95(23): e0153421. 10.1128/JVI.01534-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Vial PA, Ferrés M, Vial C, et al. : Hantavirus in humans: a review of clinical aspects and management. Lancet Infect Dis. 2023;23(9):e371–e382. 10.1016/S1473-3099(23)00128-7 [DOI] [PubMed] [Google Scholar]
- 15. Kang HJ, Stanley WT, Esselstyn JA, et al. : Expanded host diversity and geographic distribution of hantaviruses in sub-Saharan Africa. J Virol. 2014;88(13):7663–7. 10.1128/JVI.00285-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Yanagihara R, Gu SH, Arai S, et al. : Hantaviruses: rediscovery and new beginnings. Virus Res. 2014;187:6–14. 10.1016/j.virusres.2013.12.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. McMahon BJ, Morand S, Gray JS: Ecosystem change and zoonoses in the Anthropocene. Zoonoses Public Health. 2018;65(7):755–65. 10.1111/zph.12489 [DOI] [PubMed] [Google Scholar]
- 18. Pei S, Yu P, Raghwani J, et al. : Anthropogenic land consolidation intensifies zoonotic host diversity loss and disease transmission in human habitats. Nat Ecol Evol. 2025;9(1):99–110. 10.1038/s41559-024-02570-x [DOI] [PubMed] [Google Scholar]
- 19. Tian H, Hu S, Cazelles B, et al. : Urbanization prolongs hantavirus epidemics in cities. Proc Natl Acad Sci U S A. 2018;115(18):4707–12. 10.1073/pnas.1712767115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Plowright RK, Parrish CR, McCallum H, et al. : Pathways to zoonotic spillover. Nat Rev Microbiol. 2017;15(8):502–10. 10.1038/nrmicro.2017.45 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Dearing MD, Dizney L: Ecology of hantavirus in a changing world. Ann N Y Acad Sci. 2010;1195(1):99–112. 10.1111/j.1749-6632.2010.05452.x [DOI] [PubMed] [Google Scholar]
- 22. Keesing F, Belden LK, Daszak P, et al. : Impacts of biodiversity on the emergence and transmission of infectious diseases. Nature. 2010;468(7324):647–52. 10.1038/nature09575 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Ecke F, Han BA, Hörnfeldt B, et al. : Population fluctuations and synanthropy explain transmission risk in rodent-borne zoonoses. Nat Commun. 2022;13(1): 7532. 10.1038/s41467-022-35273-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Gibb R, Redding DW, Friant S, et al. : Towards a ‘people and nature’ paradigm for biodiversity and infectious disease. Philos Trans R Soc Lond B Biol Sci. 2025;380(1917): 20230259. 10.1098/rstb.2023.0259 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. World Health Organization: Blueprint for r&d preparedness and response to public health emergencies due to highly infectious pathogens. World Health Organization,2015. Reference Source
- 26. Van Hook CJ: Hantavirus pulmonary syndrome—the 25th anniversary of the four corners outbreak. Emerg Infect Dis. 2018;24(11):2056–2060. 10.3201/eid2411.180381 [DOI] [Google Scholar]
- 27. Bach M: Social implications of the yosemite hantavirus outbreak. Pathw Stanford J Public Health. 2017;6:23–6. Reference Source [Google Scholar]
- 28. Ficenec SC, Percak J, Arguello S, et al. : Lassa Fever induced hearing loss: the neglected disability of hemorrhagic fever. Int J Infect Dis. 2020;100:82–7. 10.1016/j.ijid.2020.08.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. McCormick JB, Webb PA, Krebs JW, et al. : A prospective study of the epidemiology and ecology of Lassa Fever. J Infect Dis. 1987;155(3):437–44. 10.1093/infdis/155.3.437 [DOI] [PubMed] [Google Scholar]
- 30. Reusken C, Heyman P: Factors driving hantavirus emergence in Europe. Curr Opin Virol. 2013;3(1):92–9. 10.1016/j.coviro.2013.01.002 [DOI] [PubMed] [Google Scholar]
- 31. Simons D, Attfield LA, Jones KE, et al. : Rodent trapping studies as an overlooked information source for understanding endemic and novel zoonotic spillover. PLoS Negl Trop Dis. 2023;17(1): e0010772. 10.1371/journal.pntd.0010772 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Clement J, LeDuc JW, Lloyd G, et al. : Wild rats, laboratory rats, pet rats: global seoul hantavirus disease revisited. Viruses. 2019;11(7):652. 10.3390/v11070652 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. De Bellocq JG, Bryjová A, Martynov AA, et al. : Dhati welel virus, the missing mammarenavirus of the widespread Mastomys natalensis. J Vertebr Biol. 2020;69(2): 20018.1-11. 10.25225/jvb.20018 [DOI] [Google Scholar]
- 34. Voutilainen L, Kallio ER, Niemimaa J, et al. : Temporal dynamics of Puumala Hantavirus infection in cyclic populations of bank voles. Sci Rep. 2016;6(1): 21323. 10.1038/srep21323 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Kozakiewicz CP, Burridge CP, Funk WC, et al. : Pathogens in space: advancing understanding of pathogen dynamics and disease ecology through landscape genetics. Evol Appl. 2018;11(10):1763–78. 10.1111/eva.12678 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Näpflin K, O’Connor EA, Becks L, et al. : Genomics of host-pathogen interactions: challenges and opportunities across ecological and spatiotemporal scales. PeerJ. 2019;7: e8013. 10.7717/peerj.8013 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Childs JE, Gordon ER: Surveillance and control of zoonotic agents prior to disease detection in humans. Mt Sinai J Med. 2009;76(5):421–8. 10.1002/msj.20133 [DOI] [PubMed] [Google Scholar]
- 38. Vora NM, Hannah L, Walzer C, et al. : Interventions to reduce risk for pathogen spillover and early disease spread to prevent outbreaks, epidemics, and pandemics. Emerg Infect Dis. 2023;29(3):1–9. 10.3201/eid2903.221079 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Keesing F, Ostfeld RS: Emerging patterns in rodent-borne zoonotic diseases. Science. 2024;385(6715):1305–10. 10.1126/science.adq7993 [DOI] [PubMed] [Google Scholar]
- 40. García-Peña GE, Rubio AV, Mendoza H, et al. : Land-use change and rodent-borne diseases: hazards on the shared socioeconomic pathways. Philos Trans R Soc Lond B Biol Sci. 2021;376(1837): 20200362. 10.1098/rstb.2020.0362 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Zou Y, Hu J, Wang ZX, et al. : Genetic characterization of hantaviruses isolated from Guizhou, China: evidence for spillover and reassortment in nature. J Med Virol. 2008;80(6):1033–41. 10.1002/jmv.21149 [DOI] [PubMed] [Google Scholar]
- 42. Zou Y, Hu J, Wang ZX, et al. : Molecular diversity and phylogeny of Hantaan Virus in Guizhou, China: evidence for Guizhou as a radiation center of the present Hantaan Virus. J Gen Virol. 2008;89(Pt 8):1987–97. 10.1099/vir.0.2008/000497-0 [DOI] [PubMed] [Google Scholar]
- 43. Dellicour S, Bastide P, Rocu P, et al. : How fast are viruses spreading in the wild?Dushoff J, editor. PLoS Biol. 2024;22(12): e3002914. 10.1371/journal.pbio.3002914 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Arruda LB, Free HB, Simons D, et al. : Current sampling and sequencing biases of lassa mammarenavirus limit inference from phylogeography and molecular epidemiology in Lassa Fever endemic regions. PLOS Glob Public Health. 2023;3(11): e0002159. 10.1371/journal.pgph.0002159 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Lachish S, Murray KA: The certainty of uncertainty: potential sources of bias and imprecision in disease ecology studies. Front Vet Sci. 2018;5: 90. 10.3389/fvets.2018.00090 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Wille M, Geoghegan JL, Holmes EC: How accurately can we assess zoonotic risk? PLoS Biol. 2021;19(4): e3001135. 10.1371/journal.pbio.3001135 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Gibb R, Albery GF, Mollentze N, et al. : Mammal virus diversity estimates are unstable due to accelerating discovery effort. Biol Lett. 2022;18(1): 20210427. 10.1098/rsbl.2021.0427 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Woolhouse M, Scott F, Hudson Z, et al. : Human viruses: discovery and emergence. Philos Trans R Soc Lond B Biol Sci. 2012;367(1604):2864–71. 10.1098/rstb.2011.0354 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Carroll D, Daszak P, Wolfe ND, et al. : The global virome project. Science. 2018;359(6378):872–874. 10.1126/science.aap7463 [DOI] [PubMed] [Google Scholar]
- 50. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. : The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1): 160018. 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. GBIF Secretariat: Global biodiversity information facility.2024. Reference Source
- 52. Ouzzani M, Hammady H, Fedorowicz Z, et al. : Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016;5(1): 210. 10.1186/s13643-016-0384-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Simons D: Arenaviruses and Hantaviruses of rodents.2024. Reference Source
- 54. R Core Team: R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing,2023. Reference Source [Google Scholar]
- 55. Simons D, Seifert SN: ArHa: Arenavirus and hantavirus sampling in small mammals.2024. Reference Source
- 56. D’Agostino McGowan L, Bryan J: Googledrive: an interface to google drive.2023. 10.32614/CRAN.package.googledrive [DOI] [Google Scholar]
- 57. Schoch CL, Ciufo S, Domrachev M, et al. : NCBI taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020;2020: baaa062. 10.1093/database/baaa062 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Chamberlain SA, Szöcs E: Taxize: taxonomic search and retrieval in R [version 2; peer review: 3 approved]. F1000Res. 2013:2:191. 10.12688/f1000research.2-191.v2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Hijmans RJ, Barbosa M, Ghosh A, et al. : Geodata: download geographic data.2023. 10.32614/CRAN.package.geodata [DOI] [Google Scholar]
- 60. Kreuder Johnson C, Hitchens PL, Smiley Evans T, et al. : Spillover and pandemic properties of zoonotic viruses with high host plasticity. Sci Rep. 2015;5(1): 14830. 10.1038/srep14830 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Olival KJ, Hosseini PR, Zambrana-Torrelio C, et al. : Host and viral traits predict zoonotic spillover from mammals. Nature. 2017;546(7660):646–650. 10.1038/nature22975 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62. Gibb R, Albery GF, Becker DJ, et al. : Data proliferation, reconciliation, and synthesis in viral ecology. Bioscience. 2021;71(11):1148–1156. 10.1093/biosci/biab080 [DOI] [Google Scholar]
- 63. Baele G, Suchard MA, Rambaut A, et al. : Emerging concepts of data integration in pathogen phylodynamics. Syst Biol. 2017;66(1):e47–65. 10.1093/sysbio/syw054 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Didelot X, Fraser C, Gardy J, et al. : Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol Biol Evol. 2017;34(4):997–1007. 10.1093/molbev/msw275 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65. Lecointre G, Philippe H, Vân Lê HL, et al. : Species sampling has a major impact on phylogenetic inference. Mol Phylogenet Evol. 1993;2(3):205–24. 10.1006/mpev.1993.1021 [DOI] [PubMed] [Google Scholar]
- 66. Dickersin K, Min YI: NIH clinical trials and publication bias. Online J Curr Clin Trials. 1993; 4967-words. [PubMed] [Google Scholar]
- 67. The pathogen harmonized observatory (PHAROS): The viral emergence research initiative. 2024. Reference Source

