Skip to main content
Wellcome Open Research logoLink to Wellcome Open Research
. 2025 Aug 6;10:227. Originally published 2025 Apr 28. [Version 2] doi: 10.12688/wellcomeopenres.24037.2

Protocol to produce a systematic Arenavirus and Hantavirus host-pathogen database: Project ArHa.

David Simons 1,a, Ricardo Rivero 2, Ana Martinez-Checa Guiote 3, Harry Luke Mackenzie Gordon 3, Gregory C Milne 3, Grant Rickard 2, David W Redding 3, Stephanie N Seifert 2
PMCID: PMC12415512  PMID: 40927039

Version Changes

Revised. Amendments from Version 1

To address concerns about geographic and publication biases, the literature search strategy was expanded beyond global databases like PubMed and Web of Science. The protocol now includes plans to search regional repositories such as African Journals Online (AJOL) and SciELO in a second round of data extraction, aiming to capture older or regionally-published studies. The protocol also clarifies that non-English language papers are not excluded and will be handled using a combination of automated translation and expert native speakers. Significant updates were made to the data processing section to enhance transparency and rigor. For species name harmonization, a clear set of decision rules was introduced to resolve ambiguous or outdated taxonomic names, with a commitment to retain original names and flag uncertainties. A new section was added to explain how the database will handle conflicting assay results (e.g., serology vs. PCR) without making assumptions. By preserving all test results, the dataset will enable users to analyze different biological signals and account for varying assay sensitivities. Finally, the protocol now more explicitly states the project's strategy for addressing sampling bias. It details plans to create a spatiotemporal raster of rodent sampling effort from the database's metadata. This tool will allow researchers to identify surveillance gaps and support more robust, bias-aware ecological modeling. These changes collectively make the protocol more comprehensive, transparent, and aligned with best practices in data synthesis for infectious disease ecology.

Abstract

Arenaviruses and Hantaviruses, primarily hosted by rodents and shrews, represent significant public health threats due to their potential for zoonotic spillover into human populations. Despite their global distribution, the full impact of these viruses on human health remains poorly understood, particularly in regions like Africa, where data is sparse. Both virus families continue to emerge, with pathogen evolution and spillover driven by anthropogenic factors such as land use change, climate change, and biodiversity loss. Recent research highlights the complex interactions between ecological dynamics, host species, and environmental factors in shaping the risk of pathogen transmission and spillover. This underscores the need for integrated ecological and genomic approaches to better understand these zoonotic diseases. A comprehensive, spatially, and temporally explicit dataset, incorporating host-pathogen dynamics and human disease data, is crucial for improving risk assessments, enhancing disease surveillance, and guiding public health interventions. Such a dataset (ArHa) would also support predictive modelling efforts aimed at mitigating future spillover events. This paper proposes the development of this unified database for small-mammal hosts of Arenaviruses and Hantaviruses, identifying gaps in current research and promoting a more comprehensive understanding of pathogen prevalence, spillover risk, and viral evolution.

Keywords: database production protocol, arenaviruses, hantaviruses, rodent-borne zoonoses, pathogen surveillance

Plain Language summary

Arenaviruses and Hantaviruses are globally distributed pathogens that can spread from animals to humans (zoonoses). Here, we describe the production of a dataset (ArHa) to synthesise spatial and temporal information from published research, providing a unique resource for understanding geographic and temporal trends in Arenavirus and Hantavirus host-pathogen relationships. Through explicitly quantifying sampling biases and detection efforts, the ArHa dataset will allow more robust and accurate assessments of pathogen prevalence and distribution. The spatial scale of the produced dataset offers a platform for linking ecological data with human health outcomes which will support the identification of spillover hotspots. The ArHa dataset relies on currently published material, which may vary in terms of detail, accuracy, and completeness. Missing or imprecise information may impact the reliability of subsequent analyses. The dataset as produced will be a static resource which could limit its relevance over time as emerging data will not be added, encouragingly, researchers in the field of zoonotic infections increasingly make primary data available which will mitigate this limitation.

Introduction

Arenaviruses and Hantaviruses are globally distributed pathogens which primarily infect rodents (order Rodentia) and shrews (order Eulipotyphla) 1, 2 . Some of these Arenaviruses and Hantaviruses occasionally spillover (cross-species transmission of a parasite into a host population not previously infected) into human populations, with variable morbidity and mortality rates 3 . However, the overall human health impact remains poorly understood in many cases 4, 5 . Arenaviruses, include Mammarenavirus lassaense (LASV), responsible for Lassa fever in West Africa, and Mammarenavirus juninense (JUNV), which causes Argentine haemorrhagic fever in Argentina 6, 7 . Lassa fever is estimated to infect 900,000 individuals annually across West Africa, with over 200 deaths reported from Nigeria in 2024 8, 9 . In contrast to the limited distribution of LASV, Mammarenavirus choriomeningitidis (causing Lymphocytic choriomeningitis) has a global distribution with few infections reported 10 .

Hantaviruses, including Orthohantavirus puumalaense (PUUV) and Orthohantavirus seoulense (SEOV) cause haemorrhagic fever with renal syndrome (HFRS) in Europe and Asia, while Orthohantavirus sinnombreense (SNV) causes hantavirus pulmonary syndrome (HPS) in the Americas 1113 . HFRS is associated with 23,000 reported annual infections, but fewer than 100 deaths, whereas HPS, which has a higher case fatality rate (12-45%) is reported in fewer than 300 cases annually 14 . The distribution and human health impact of hantaviruses in Africa is even more poorly understood 15 .

Recent research indicates that both Arenaviruses and Hantaviruses continue to emerge, with pathogen evolution and spillover driven by anthropogenic factors such as land use change, climate change and small-mammal biodiversity loss 16, 17 . In addition to direct biodiversity changes, the structure and dynamics of host populations and their community interactions have also been shown to influence pathogen prevalence 18, 19 . Human activities such as deforestation, urbanization, and agricultural expansion bring humans into closer (e.g., increased frequency and intensity) contact with rodent reservoirs, increasing the likelihood of spillover events 20, 21 . Changes in climate, land use, rodent ecology, and human behaviour all affect human-reservoir contact, contributing to the complex human-animal-environment nexus that drive pathogen spillover and persistence 2224 . Although human-to-human transmission is a concern (e.g., Disease X), most infections result from rodent-to-human transmission, with LASV classified as a priority pathogen by the WHO due to its potential for human-to-human spread 4, 25 .

Understanding the economic and societal impacts of these diseases is also critical. In addition to direct healthcare costs, there are significant indirect costs (e.g., long-term productivity losses), which exacerbate economic strain on vulnerable populations 4 . Outbreaks of emerging rodent-borne zoonoses, like HPS during the Four Corners Outbreak (1993) or Yosemite outbreak (2012), demonstrate the broader social and economic disruptions that these diseases can cause 26, 27 . Long-term health consequences, such as sensorineural hearing defects in survivors of Lassa fever, further complicate the disease burden 28 . Although only ~20% of LASV infections are symptomatic, the full scale of both acute and long-term disease remains poorly understood 29 . Under-reporting of these diseases, likely due to difficulties in diagnosing acute infections, means that many cases go undetected, particularly in endemic regions 14 .

The traditional paradigm in which a specific pathogen is hosted by a single reservoir species is being overturned by increased pathogen surveillance in endemic settings 30 . These data have identified multiple potential hosts for pathogens including LASV, SEOV and SNV 13, 31, 32 . Similarly, a single host species may be a host of multiple Arenaviruses and Hantaviruses throughout its geographic distribution, for example Mastomys natalensis, the primary host of LASV is known to host at least 7 distinct Arenaviruses 33 .

The ecology of rodent hosts — such as population dynamics, behaviour, habitat preferences, and community interactions — directly influences pathogen transmission and spillover risk 34 . Linking ecological and genomic data enables the identification of genetic markers associated with increased virulence or transmissibility, providing critical insights into the evolutionary trajectories of Arenaviruses and Hantaviruses 35, 36 . Small-mammal sampling is essential for uncovering how these ecological factors moderate the prevalence, persistence, and spread of zoonotic pathogens within their communities 37, 38 . For example, fluctuations in host population density or shifts in community composition, often driven by habitat changes or seasonal cycles, can amplify transmission within reservoirs or increase human exposure risk 39, 40 . Understanding interactions between host species within shared environments, including competition or co-occurrence, further elucidates how pathogens circulate, expand, and evolve across interconnected populations 4143 . These insights underscore the value of systematic small-mammal sampling to inform surveillance, predict areas of heightened spillover risk, and develop targeted strategies to mitigate zoonotic disease emergence.

A dataset that includes detailed spatial and temporal data on rodent hosts and their pathogens would enable the linkage of local ecological findings with public health patterns at broader scales. While human disease data are often aggregated at national or subnational levels to preserve patient anonymity, clinical practitioners and public health authorities will have access to higher-resolution case data. For these stakeholders, integrating ecological data with localised human disease patterns could enhance the identification of hotspots of transmission risk and inform more targeted interventions. Data fragmentation currently limits efforts to connect local-scale, high-intensity research on host-pathogen dynamics to broader-scale trends, which impairs predictive modelling and disease management efforts. A unified dataset would support these efforts, enabling better understanding of ecological drivers of zoonotic disease risk, viral evolution, and pathogen emergence across landscapes and over time, improving both risk assessments and public health interventions 44 .

Historically, research on rodent-borne zoonoses has been uneven, with undersampling in high-risk regions like Africa, South Asia, and Southeast Asia 31, 45, 46 . Identifying these undersampled regions is crucial for guiding future research and prioritising viral discovery efforts 47 . Addressing these gaps will improve our understanding of pathogen diversity, distribution, and evolution, and help predict and mitigate future spillover events 48, 49 . A comprehensive global sampling effort will highlight knowledge gaps and inform future research priorities.

These uneven patterns of sampling are often shaped by factors such as research funding availability, geopolitical accessibility, diagnostic infrastructure, and regional disease prioritisation. While the causes of sampling bias cannot typically be inferred directly from the published data, our dataset is designed to capture and quantify the spatial and temporal distribution of sampling effort. By systematically documenting both presence and absence data for pathogens within rodent hosts, alongside metadata such as sampling dates, locations, and diagnostic methods, the database will allow researchers to visualise gaps in surveillance and identify systematic biases in sampling coverage.

An early planned analysis will involve constructing a spatiotemporal raster of small mammal sampling effort, enabling bias-aware ecological modelling and informing future pathogen discovery and surveillance activities. This approach will not only highlight where surveillance has occurred but also where it is lacking, facilitating more equitable and evidence-based targeting of future research.

A unified and comprehensive dataset on Arenaviruses and Hantaviruses in wild-caught small mammals is urgently needed to address these gaps. By collating spatial, temporal, and genomic data, such a dataset will be a critical resource for identifying undersampled regions and potential hosts, guiding viral discovery, and informing risk assessments, predictive models, and public health interventions. Adhering to Open Science principles, we will ensure that research tools, data extraction methods, and processing code are shared on suitable platforms following the FAIR guidelines 50 . This will ensure global accessibility and foster evidence-based decision-making and collaboration in disease surveillance. As automated tools increasingly drive scientific research, we must standardize datasets to avoid overlooking valuable data. By curating this dataset, we aim to preserve scientific knowledge and ensure its accessibility to global researchers through platforms like the Global Biodiversity Information Facility (GBIF) 51 .

Existing data on rodent-pathogen associations are often limited by geographic and temporal constraints, which limits broader analyses of pathogen prevalence, spillover risk, and range expansion. The proposed ArHa database will address these limitations by synthesizing and standardizing data from diverse sources, enabling research on host-pathogen dynamics, range expansion, and spillover risk. This unified resource will strengthen predictive models of zoonotic diseases, improve public health strategies, and enhance preparedness for future disease threats.

Protocol

Study objectives

The ArHa dataset aims to:

  • Construct a global, standardised and relational dataset of Arenavirus and Hantavirus detections in wild-caught rodents and shrews, integrating ecological, epidemiological and genomic information.

  • Quantify and map sampling effort, identifying spatiotemporal patterns in small mammal surveillance and highlighting geographic and taxonomic gaps.

  • Characterise pathogen detection and reporting biases, including diagnostic method, assay sensitivity and data completeness across studies.

  • Link host, pathogen, and genomic data across time and space, enabling the exploration of host pathogen associations and viral diversity.

  • Enable bias-aware modelling of pathogen prevalence, distribution, and spillover risk through harmonized spatial and temporal metadata.

  • Facilitate integration of ecological and genomic data with public health datasets, enabling identification of spillover hotspots and informing disease surveillance strategies.

  • Promote data reuse and interoperability by harmonizing taxonomy and adhering to FAIR principles with open access releases.

Search strategy

We searched NCBI PubMed and Clarivate Web of Science for relevant citations (see Table 1 for search terms), with no restrictions on publication date, language, or sampling locations. The search terms used are shown in Table 1, which details the search terms and results from both databases.

Table 1. Search terms used for identifying relevant literature.

Search term PubMed Web of Science
1. rodent* OR rat OR mouse 3,878,308 4,090,059
2. shrew OR eulipotyphla 7,582 9,108
3. arenavir* 2,525 1,966
4. hantavir* 4,316 5,084
5. 1 OR 2 3,883,403 4,096,018
6. 3 OR 4 6,746 6,899
7. 5 AND 6 3,100 2,904

Search terms 1-4 ( Table 1) included expanded terms. Citations returned by the searches were downloaded and de-duplicated by matching on titles, authors, publication identifiers, or digital object identifiers. The initial search (conducted 2023-10-16) resulted in 2,755 unique publications. These citations were uploaded to the Rayyan platform to assess against inclusion and exclusion criteria 52 .

To supplement our core searches in PubMed and Web of Science, we plan in a second round of data extraction to search additional regional repositories, including Google Scholar, African Journals Online (AJOL), SciELO, LILACS, J-STAGE (Japan), KoreaMed, IndMED, MedIND (India), ASEAN Citation Index, and the Index Medicus for the Eastern Mediterranean Region (IMEMR). These platforms will help us to capture relevant literature that may not be indexed in global bibliographic databases, particularly older or regionally published studies.

Study screening

Inclusion criteria. We screened studies against the following inclusion criteria. Studies were included if they reported:

  • 1.

    Rodent OR Shrew sampling from wild populations AND

  • 2.

    Direct or indirect detection of microorganisms in the Arenaviridae and Hantaviridae families AND

  • 3.

    Information on the sampling location of small mammals

Exclusion criteria. Studies were excluded if they:

  • 1.

    Did not contain information on the host species of the sample for direct or indirect detection of Arenaviridae or Hantaviridae,

  • 2.

    Reported experimental infections or solely human infections,

  • 3.

    Were abstracts or conference proceedings that did not provide any description of the method of animal sampling or microorganism detection.

Direct detection was defined as detection via nucleic acid amplification tests (e.g., Polymerase Chain Reaction (PCR)) or virus culture, indirect detection was defined as detection of specific antibodies or antigens.

We first screened the study titles and abstracts. For articles published in languages other than English we used automated translation services (e.g., Google Translate) if study authors were unable to review the manuscript in its published language. All titles and abstracts were then assessed against our inclusion and exclusion criteria. Assessments were made by two members of the study team (DS and RR). Studies deemed to meet all inclusion criteria by at least one author proceeded to full-text review.

Reference chaining was performed on studies considered for inclusion at the full text screening stage and relevant publications were added as manually identified relevant articles. The PRISMA flow chart for the search process is shown in Figure 1 which also indicates the status of pilot data extraction (10% of the 917 studies included at title and abstract stage). The search will be re-run to capture more recently published articles when data extraction from the current search has been completed.

Figure 1. PRISMA flow diagram for records identified in the initial search conducted 2023-10-16.

Figure 1.

Studies excluded at full text review included those with incomplete data on ≥1 of pathogen, host, or study location. For studies containing duplicated data, the highest resolution dataset (i.e., temporal, spatial, taxonomic) was retained.

Data extraction

We aimed to extract included study meta-data and to produce three linked datasets that could be used to assess sampling of a) small-mammal hosts, b) viral prevalence and c) genetic variability. We have developed and refined the current data extraction tools in the pilot search ( Figure 1). A publicly available RShiny web-application will be developed to present the extracted data while the search is ongoing 53 .

Included studies. Included study meta-data was abstracted to ensure appropriate attribution for each rodent, pathogen and genomic material record to the original publication that presented it ( Table 2). Each included study was assigned a unique identifier which provides linkage to the reference. We extracted reported sampling effort, when this information was available (e.g., number of trap-nights). For articles not reporting sampling effort at the study level we inferred total sampling effort by summing effort across individual trapping sessions and sites. For studies that did not report a measure of sampling effort amenable to imputation (e.g., a variable number of traps per trapping line, or days per study session) we recorded a value of ‘not reported’. Finally, we recorded the level of data aggregation of rodent or pathogen sampling, classified as individual (data at the individual level) or summarised (data were aggregated; e.g., at site or sampling session level).

Table 2. Study information extraction sheet.

Column name Description
study_id An internal unique identifier for the included study
full_text_id The internal ID for the full text manuscript and its associated citation
datasetName The title of the manuscript, report or book section
sampling_effort A free text entry to capture the effort of sampling, ideally in trap nights for rodent studies
data_access Whether data is available for individual small mammals or whether it is aggregated
data_resolution The level of aggregation available
linked_manuscripts The DOI or weblink to other studies including the same dataset either in its entirety or a
subset, this will be used to attempt to de-duplicate data
notes Details that may be important for interpreting the extracted data

Small-mammal sampling. Data were extracted into a rodent sheet ( Table 3) at the highest available level of temporal and spatial resolution. For example, a study reporting a single species at four spatially distinct sampling sites would be associated with four records. If this same study reported observations across four distinct trapping sessions at each of these sites there would be 16 (4 2) records associated with the study. Studies presenting data on N individuals are associated with N records. These records may not be for individuals (as would be the case for capture-mark-recapture studies), for example, they may be the same individual detected over multiple sessions. For this reason, we do not report a unique identifier at the individual level.

Table 3. Extracted rodent sampling variables.

Column name Description
rodent_record_id A unique internal identifier for the rodent record. Record resolution depends on the extent of
aggregation in the study
study_id A unique identifier to link a record to its respective study
scientificName The binomial species name of the small-mammal reported in the study
eventDate The period in which sampling was conducted
locality The location of sampling effort, reported to the highest spatial resolution available
country Country where trapping occurred. For multinational studies where counts are not disaggregated by
country, all countries are reported
verbatimLocality High level habitat type will be recorded here at the scale for which trapping is recorded, particularly
useful when locality does not discriminate between multiple sampling sites
coordinate_resolution The spatial resolution of coordinates
decimalLatitude Latitude in decimal format, converted from reported coordinates as required
decimalLongitude Longitude in decimal format, converted from reported coordinates as required
individualCount The number of detected individuals associated with a record
trapEffort Trap effort (recorded as number of trap nights) associated with a record
trapEffortResolution The resolution of trapping effort for the record

Identification of a detected small mammal was recorded as reported within the article, we will not systematically assess the method of identification and therefore assume accurate identification as reported within the articles. Where identification is to genus or higher taxonomic level we report to this level (i.e., Rattus spp., Rodentia). Further details are provided below on data processing and species name harmonisation.

Since we expected studies to report sampling location to varying degrees of resolution we extract locality as the highest resolution of location data that could be matched to administrative levels within a country (e.g., city, county). In cases where location is reported to a higher spatial resolution which may not map to administrative levels (e.g., ‘Site A’) we also include verbatimLocality. For studies not reporting geographic coordinates of sampling but including some information that describes the location (i.e., the name of the village sampled) we will locate coordinates through several methods. Locations will be searched for in Google Maps, Wikipedia or the Geographic Names Server provided by the National Geospatial-Intelligence Agency USA.

Pathogen sampling. The pathogen sheet ( Table 4) reports the assays used to detect Arenaviruses and Hantaviruses. Analogously to the rodent sheet, a record is produced at the highest resolution of pathogen sampling. One-to-many matching was permitted to record all data in cases where a single rodent was tested for a pathogen using multiple assays (e.g., serology and PCR).

Table 4. Pathogen information extraction sheet.

Column name Description
pathogen_record_id A unique identifier for the group of samples from the same rodent species, at a specific
location or timepoint, tested for the same pathogen using the same method
associated_rodent_record_id Linking identifier from the rodent sheet
study_id A unique identifier to link a study to the descriptive sheet entry for that study
associatedTaxa The scientificName of the rodent from the associated_rodent_record_id
family The family of the pathogen, either Arenaviridae or Hantaviridae
scientificName The species name of the pathogen assayed for, if available. Where assays are specific to a
higher taxa than species this is recorded here.
assay Assay type, e.g., serology, PCR, or other
number_tested The number of samples tested
number_negative The number of negative samples
number_positive The number of positive samples
number_inconclusive The number of samples with inconclusive results
note Notes relevant to this record

The species or family targeted by the assay was recorded as reported within the relevant article; we did not make any assessment of the suitability, sensitivity or specificity of an assay. Where authors describe an assay as being family-wide (e.g., hantavirus antibody detection) we retain this level of specificity in the pathogen record. We recorded the number of samples tested using the assay for a specific pathogen. This may differ from the number of counted small-mammals as not all samples may not have been suitable for testing, or the study authors may have decided to subset available samples. The number of positive and negative samples are related to this tested number. Where reported we also extracted the number of samples with an inconclusive test result. Similarly to above, we recorded positives and negatives as reported by study authors and make no assessment of these classifications.

Sequences. If studies include complete or partial nucleotide sequences of hosts or viruses archived in NCBI (USA), EMBL (Europe), DDBJ (Japan) or CNGBdb (China) they will be linked through the sequences sheet ( Table 5). A record will be produced for each accession available. A sequence_record_id will be associated with each accession and depending on whether the sequence relates to a pathogen or host each record will be associated with one or both of these (i.e., a pathogen sequence will be linked to both a pathogen_record_id and rodent_record_id, while a host sequence will only be linked to a rodent_record_id). Many-to-one matching of sequences may occur for several reasons, first, Hantaviruses (three) and Arenaviruses (two) contain different number of genome segments and not all may have been successfully sequenced for each acutely infected small mammal. Second, reporting of pathogen sampling may be aggregated with multiple sequences obtained from a single reported assay (e.g., pooled sampling). Pathogen sequences obtained from infected humans will be extracted but will only be linked at the study level if reported within the manuscript from the same geographic location or temporal period.

Table 5. Pathogen sequences extraction sheet.

Column name Description
sequence_record_id A unique identifier for the sequence record
sequenceType One of Pathogen or Host
associated_pathogen_record_id An associated pathogen record for this sequence
associated_rodent_record_id An associated rodent record for this sequence
study_id Study ID associated with this sequence
associatedTaxa Species name of small mammal sampled/sequenced
scientificName Species name of pathogen sequenced
accession_number NCBI nucleotide accession
note Notes relevant to this record

Data processing

This section describes data cleaning and harmonisation. All data processing will be conducted in R with scripts retained in a version controlled git repository 54, 55 . Raw data will be downloaded from Google Sheets using the googledrive API in R, with date stamped files stored locally 56 .

Species name harmonisation. To standardise reported species we will match reported names to both the GBIF backbone taxonomy and the NCBI taxonomy database 51, 57 . The taxize R package was used to query the APIs of these platforms 58 . Reported names that do not match accepted taxonomic entries are reviewed and resolved using a reproducible set of decision rules:

  • 1.

    Subspecies and outdated synonyms are corrected to the currently accepted species name (e.g., Rattus rattus alexandrines -> Rattus rattus).

  • 2.

    Taxonomic revisions are incorporated where genus-level reassignments have occurred (e.g., Mus/Nannomys becomes Mus at the genus level).

  • 3.

    If no unambiguous match is found at the species level (e.g., due to vague or malformed entries), the identification is:

    • 1.

      Promoted to the lowest reliable taxonomic family (e.g., genus or subfamily);

    • 2.

      Flagged as ambiguous in a dedicated column.

To ensure transparency we retain unprocessed species names alongside the cleaned and matched taxonomy.

To harmonise pathogen names, we match to the ICTV binomial taxonomy. If harmonisation is not possible (e.g., due to generic naming such as "Hantavirus sp." or references to obsolete clades), we match to the lowest taxonomic family, with an accompanying flag.

Location of sampling. Locations were associated with administrative divisions using the Database of Global Administrative Areas (GADM) accessed through the geodata R package 59 .

The coordinate resolution for coordinates labelled, site or study site will be set as 100 meters. For locations associated with an administrative area but no higher resolution data we record the coordinate resolution value as the radius of the administrative region to represent uncertainty in these coordinates.

Imputed non-detections. Non-detection of a small mammal species in a location it may be expected to be (i.e., based on IUCN range maps or prior research) is of interest to researchers. We will enrich the detection-only data of included studies by imputing non-detections. This imputation will be restricted to species that have been detected at other sites or sessions in the study. Imputed non-detections will be labelled as imputed in the final data product.

Conflicting pathogen detection results. We preserve potentially conflicting assay results (e.g., seropositive but PCR-negative individuals) without reconciliation or exclusion, by maintaining a one-to-many relationship between host records and pathogen assay records.

Each host record, whether representing a single animal or a pooled group of conspecifics trapped at the same site and time is linked to multiple pathogen assay records. For example, if 10 Apodemus agrarius were trapped at a site during a single session (i.e., a single rodent_record_id) and tested for hantavirus using both ELISA and PCR, the host dataset would contain one record, while the pathogen dataset would include two linked records (i.e., two pathogen_record_id’s) , each specifying test type, number tested, and number positive.

We retain all assay results irrespective of concordance, as different diagnostics capture distinct biological signals (e.g., prior exposure vs. current infection) and vary in sensitivity. No prioritisation is applied. Each assay record is annotated with test type, diagnostic target (e.g., antigen, antibody, RNA), and outcome, enabling users to apply their own criteria when analysing discordant results.

Discussion

This novel dataset will provide spatially and temporal information on small mammal sampling for Arenaviruses and Hantaviruses, offering a more comprehensive view of host-pathogen associations than is currently available. By including sampling effort and explicit spatial data, this dataset enhances the ability to quantify biases in where and how frequently sampling occurs, assess geographic gaps, and better understand the spatial ecology of these globally distributed pathogens of public health importance. The insights gained from this dataset could improve our understanding of how environmental changes, such as habitat fragmentation and urbanisation, influence pathogen dynamics and spillover risks.

Importantly, these data could inform the development of ecology-driven predictive models, helping to identify areas at heightened risk of zoonotic spillover 60, 61 . The ability to link ecological data with human health outcomes could inform targeted public health interventions, enhancing outbreak preparedness and response. For instance, identifying hotspots of high pathogen prevalence in rodents could guide targeted development of ecologically-based rodent control measures in regions identified as vulnerable.

Existing host-pathogen datasets lack detailed spatial or temporal information, limiting their use in ecological and epidemiological modelling 62 . By addressing these gaps, our dataset provides information about host-pathogen interactions that span multiple spatial and temporal scales. Moreover, the explicit reporting of sampling effort allows for more robust analyses of detection probability, which is crucial for understanding the true distribution of pathogens 63, 64 .

While currently available global host-pathogen datasets do not account for sampling biases, the dataset proposed here will explicitly report the extent of sampling for pathogen within their host species, providing critical insights into spatial sampling biases and detection efforts. For instance, while it is expected that Mastomys natalensis sampling occurs throughout its range, the detection of specific pathogens such as LASV or Morogoro virus are currently only detected in their West African and East African radiation respectively. By detailing both presence and absence data, the dataset allows for a more nuanced understanding of pathogen distribution within host species, beyond mere occurrence records. Quantification of detection effort for specific pathogens is vital for assessing spatial sampling biases, identifying under-sampled regions, and refining ecological and epidemiological models 65 .

As part of our future analytical work, we aim to develop a spatiotemporal raster of rodent sampling effort, based on georeferenced metadata from the included studies. This raster will allow us to quantify both spatial and temporal surveillance intensity across the ranges of known and suspected reservoir species. It will help identify under-sampled but environmentally suitable regions, inform targeted field surveys, and support downstream predictive modelling. Unlike bibliometric proxies that only reflect publication counts, this approach incorporates species- and pathogen-specific detail at a finer spatial resolution.

Despite its strengths, this dataset has several limitations. First, the inherent variability in the quality and detail of reported data from included studies means that some records may lack critical information, such as exact coordinates or specific sampling dates. The imputation of non-detections, introduces assumptions that could affect the interpretation of absence data. Additionally, the dataset is inherently static will not reflect emerging data following production.

Furthermore, the reliance on published literature introduces publication bias, as studies with significant findings are more likely to be published than those with negative results 66 . This could skew the dataset towards areas and hosts with known or previously detected pathogens, potentially under representing regions where sampling has occurred but without positive detections. We will not routinely contact study authors to obtain data reported within publications. Therefore, data that are not reported within the article or its supplementary appendices are not included in the dataset. This is the subject of an ongoing data request study. To address limitations associated with extracting data without access to the raw data we encourage scientists with ongoing field studies and pathogen surveillance programmes, to submit their data to dynamic repositories (e.g., Pharos and GBIF 51, 67 ). Beyond research applications, the dataset could serve as a tool for policymakers, assisting in the identification of priority areas for viral discovery and public health interventions.

Conclusion

Overall, this dataset will be a valuable resource for understanding the ecology of Arenaviruses and Hantaviruses in their natural reservoirs. By bridging the gap between local-scale ecological studies and broader public health needs, it has the potential to enhance our ability to predict and mitigate the risks posed by these emerging pathogens. Continued efforts to update and expand this resource will be crucial for maintaining its utility in a rapidly changing epidemiological landscape.

Ethics and consent

Ethical approval and consent were not required.

Funding Statement

This work was supported by Wellcome [220179]; NSF-NIH-NIFA and BBSRC Ecology and Evolution of Infectious Disease Award [2208034]; and NSF Biology Integration Institute [2213854].

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

[version 2; peer review: 2 approved, 2 approved with reservations]

Data availability

No data associated with this article. All data products of this project will be made available on GitHub ( https://github.com/DidDrog11/arenavirus_hantavirus) and a finalized product will be archived on Zenodo.

OSF: PRISMA-P checklist for Protocol to produce a systematic Arenavirus and Hantavirus host-pathogen database: Project ArHa. https://doi.org/10.17605/OSF.IO/23KTF

Authors’ contributions

D.S. - conceptualisation, methodology, data curation, investigation, software, supervision, writing - original draft, writing - review and editing R.R. - methodology, data curation, investigation, writing - review and editing H.L.M.G. - methodology, data curation, investigation, writing - review and editing A. M-C G. - methodology, data curation, investigation, writing - review and editing G. C. M. - methodology, supervision, writing - review and editing G. R. - methodology, data curation, investigation, writing - review and editing D. W. R. - funding acquisition, resources, supervision, writing - review and editing S. N. S. - conceptualisation, funding acquisition, methodology, resources, supervision, writing - review and editing

References

  • 1. Childs JE, Peters CJ: Ecology and epidemiology of arenaviruses and their hosts. In: The Arenaviridae.Springer;1993;331–84. 10.1007/978-1-4615-3028-2_19 [DOI] [Google Scholar]
  • 2. Jonsson CB, Figueiredo LT, Vapalahti O: A global perspective on hantavirus ecology, epidemiology, and disease. Clin Microbiol Rev. 2010;23(2):412–41. 10.1128/CMR.00062-09 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Wells K, Clark NJ: Host specificity in variable environments. Trends Parasitol. 2019;35(6):452–65. 10.1016/j.pt.2019.04.001 [DOI] [PubMed] [Google Scholar]
  • 4. Smith DRM, Turner J, Fahr P, et al. : Health and economic impacts of Lassa vaccination campaigns in West Africa. Nat Med. 2024;30(12):3568–3577. 10.1038/s41591-024-03232-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Vapalahti O, Mustonen J, Lundkvist A, et al. : Hantavirus infections in Europe. Lancet Infect Dis. 2003;3(10):653–61. 10.1016/s1473-3099(03)00774-6 [DOI] [PubMed] [Google Scholar]
  • 6. Gibb R, Moses LM, Redding DW, et al. : Understanding the cryptic nature of Lassa fever in West Africa. Pathog Glob Health. 2017;111(6):276–88. 10.1080/20477724.2017.1369643 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Gallo GL, López N, Loureiro ME: The virus-host interplay in Junín mammarenavirus infection. Viruses. 2022;14(6):1134. 10.3390/v14061134 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Nigeria Centre for Disease Control: An update of Lassa fever outbreak in Nigeria.2024. Reference Source
  • 9. Basinski AJ, Fichet-Calvet E, Sjodin AR, et al. : Bridging the gap: using reservoir ecology and human serosurveys to estimate Lassa virus spillover in West Africa. PLoS Comput Biol. 2021;17(3): e1008811. 10.1371/journal.pcbi.1008811 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Vilibic-Cavlek T, Savic V, Ferenc T, et al. : Lymphocytic choriomeningitis—emerging trends of a neglected virus: a narrative review. Trop Med Infect Dis. 2021;6(2):88. 10.3390/tropicalmed6020088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Makary P, Kanerva M, Ollgren J, et al. : Disease burden of Puumala virus infections, 1995–2008. Epidemiol Infect. 2010;138(10):1484–92. 10.1017/S0950268810000087 [DOI] [PubMed] [Google Scholar]
  • 12. Park Y: Epidemiologic study on changes in occurrence of Hemorrhagic Fever with Renal Syndrome in Republic of Korea for 17 years according to age group: 2001–2017. BMC Infect Dis. 2019;19(1): 153. 10.1186/s12879-019-3794-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Goodfellow SM, Nofchissey RA, Schwalm KC, et al. : Tracing transmission of Sin Nombre virus and discovery of infection in multiple rodent species. J Virol. 2021;95(23): e0153421. 10.1128/JVI.01534-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Vial PA, Ferrés M, Vial C, et al. : Hantavirus in humans: a review of clinical aspects and management. Lancet Infect Dis. 2023;23(9):e371–e382. 10.1016/S1473-3099(23)00128-7 [DOI] [PubMed] [Google Scholar]
  • 15. Kang HJ, Stanley WT, Esselstyn JA, et al. : Expanded host diversity and geographic distribution of hantaviruses in sub-Saharan Africa. J Virol. 2014;88(13):7663–7. 10.1128/JVI.00285-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Yanagihara R, Gu SH, Arai S, et al. : Hantaviruses: rediscovery and new beginnings. Virus Res. 2014;187:6–14. 10.1016/j.virusres.2013.12.038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. McMahon BJ, Morand S, Gray JS: Ecosystem change and zoonoses in the Anthropocene. Zoonoses Public Health. 2018;65(7):755–65. 10.1111/zph.12489 [DOI] [PubMed] [Google Scholar]
  • 18. Pei S, Yu P, Raghwani J, et al. : Anthropogenic land consolidation intensifies zoonotic host diversity loss and disease transmission in human habitats. Nat Ecol Evol. 2025;9(1):99–110. 10.1038/s41559-024-02570-x [DOI] [PubMed] [Google Scholar]
  • 19. Tian H, Hu S, Cazelles B, et al. : Urbanization prolongs hantavirus epidemics in cities. Proc Natl Acad Sci U S A. 2018;115(18):4707–12. 10.1073/pnas.1712767115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Plowright RK, Parrish CR, McCallum H, et al. : Pathways to zoonotic spillover. Nat Rev Microbiol. 2017;15(8):502–10. 10.1038/nrmicro.2017.45 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Dearing MD, Dizney L: Ecology of hantavirus in a changing world. Ann N Y Acad Sci. 2010;1195(1):99–112. 10.1111/j.1749-6632.2010.05452.x [DOI] [PubMed] [Google Scholar]
  • 22. Keesing F, Belden LK, Daszak P, et al. : Impacts of biodiversity on the emergence and transmission of infectious diseases. Nature. 2010;468(7324):647–52. 10.1038/nature09575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Ecke F, Han BA, Hörnfeldt B, et al. : Population fluctuations and synanthropy explain transmission risk in rodent-borne zoonoses. Nat Commun. 2022;13(1): 7532. 10.1038/s41467-022-35273-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Gibb R, Redding DW, Friant S, et al. : Towards a ‘people and nature’ paradigm for biodiversity and infectious disease. Philos Trans R Soc Lond B Biol Sci. 2025;380(1917): 20230259. 10.1098/rstb.2023.0259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. World Health Organization: Blueprint for r&d preparedness and response to public health emergencies due to highly infectious pathogens. World Health Organization,2015. Reference Source
  • 26. Van Hook CJ: Hantavirus pulmonary syndrome—the 25th anniversary of the four corners outbreak. Emerg Infect Dis. 2018;24(11):2056–2060. 10.3201/eid2411.180381 [DOI] [Google Scholar]
  • 27. Bach M: Social implications of the yosemite hantavirus outbreak. Pathw Stanford J Public Health. 2017;6:23–6. Reference Source [Google Scholar]
  • 28. Ficenec SC, Percak J, Arguello S, et al. : Lassa Fever induced hearing loss: the neglected disability of hemorrhagic fever. Int J Infect Dis. 2020;100:82–7. 10.1016/j.ijid.2020.08.021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. McCormick JB, Webb PA, Krebs JW, et al. : A prospective study of the epidemiology and ecology of Lassa Fever. J Infect Dis. 1987;155(3):437–44. 10.1093/infdis/155.3.437 [DOI] [PubMed] [Google Scholar]
  • 30. Reusken C, Heyman P: Factors driving hantavirus emergence in Europe. Curr Opin Virol. 2013;3(1):92–9. 10.1016/j.coviro.2013.01.002 [DOI] [PubMed] [Google Scholar]
  • 31. Simons D, Attfield LA, Jones KE, et al. : Rodent trapping studies as an overlooked information source for understanding endemic and novel zoonotic spillover. PLoS Negl Trop Dis. 2023;17(1): e0010772. 10.1371/journal.pntd.0010772 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Clement J, LeDuc JW, Lloyd G, et al. : Wild rats, laboratory rats, pet rats: global seoul hantavirus disease revisited. Viruses. 2019;11(7):652. 10.3390/v11070652 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. De Bellocq JG, Bryjová A, Martynov AA, et al. : Dhati welel virus, the missing mammarenavirus of the widespread Mastomys natalensis. J Vertebr Biol. 2020;69(2): 20018.1-11. 10.25225/jvb.20018 [DOI] [Google Scholar]
  • 34. Voutilainen L, Kallio ER, Niemimaa J, et al. : Temporal dynamics of Puumala Hantavirus infection in cyclic populations of bank voles. Sci Rep. 2016;6(1): 21323. 10.1038/srep21323 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Kozakiewicz CP, Burridge CP, Funk WC, et al. : Pathogens in space: advancing understanding of pathogen dynamics and disease ecology through landscape genetics. Evol Appl. 2018;11(10):1763–78. 10.1111/eva.12678 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Näpflin K, O’Connor EA, Becks L, et al. : Genomics of host-pathogen interactions: challenges and opportunities across ecological and spatiotemporal scales. PeerJ. 2019;7: e8013. 10.7717/peerj.8013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Childs JE, Gordon ER: Surveillance and control of zoonotic agents prior to disease detection in humans. Mt Sinai J Med. 2009;76(5):421–8. 10.1002/msj.20133 [DOI] [PubMed] [Google Scholar]
  • 38. Vora NM, Hannah L, Walzer C, et al. : Interventions to reduce risk for pathogen spillover and early disease spread to prevent outbreaks, epidemics, and pandemics. Emerg Infect Dis. 2023;29(3):1–9. 10.3201/eid2903.221079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Keesing F, Ostfeld RS: Emerging patterns in rodent-borne zoonotic diseases. Science. 2024;385(6715):1305–10. 10.1126/science.adq7993 [DOI] [PubMed] [Google Scholar]
  • 40. García-Peña GE, Rubio AV, Mendoza H, et al. : Land-use change and rodent-borne diseases: hazards on the shared socioeconomic pathways. Philos Trans R Soc Lond B Biol Sci. 2021;376(1837): 20200362. 10.1098/rstb.2020.0362 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Zou Y, Hu J, Wang ZX, et al. : Genetic characterization of hantaviruses isolated from Guizhou, China: evidence for spillover and reassortment in nature. J Med Virol. 2008;80(6):1033–41. 10.1002/jmv.21149 [DOI] [PubMed] [Google Scholar]
  • 42. Zou Y, Hu J, Wang ZX, et al. : Molecular diversity and phylogeny of Hantaan Virus in Guizhou, China: evidence for Guizhou as a radiation center of the present Hantaan Virus. J Gen Virol. 2008;89(Pt 8):1987–97. 10.1099/vir.0.2008/000497-0 [DOI] [PubMed] [Google Scholar]
  • 43. Dellicour S, Bastide P, Rocu P, et al. : How fast are viruses spreading in the wild?Dushoff J, editor. PLoS Biol. 2024;22(12): e3002914. 10.1371/journal.pbio.3002914 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Arruda LB, Free HB, Simons D, et al. : Current sampling and sequencing biases of lassa mammarenavirus limit inference from phylogeography and molecular epidemiology in Lassa Fever endemic regions. PLOS Glob Public Health. 2023;3(11): e0002159. 10.1371/journal.pgph.0002159 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Lachish S, Murray KA: The certainty of uncertainty: potential sources of bias and imprecision in disease ecology studies. Front Vet Sci. 2018;5: 90. 10.3389/fvets.2018.00090 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Wille M, Geoghegan JL, Holmes EC: How accurately can we assess zoonotic risk? PLoS Biol. 2021;19(4): e3001135. 10.1371/journal.pbio.3001135 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Gibb R, Albery GF, Mollentze N, et al. : Mammal virus diversity estimates are unstable due to accelerating discovery effort. Biol Lett. 2022;18(1): 20210427. 10.1098/rsbl.2021.0427 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Woolhouse M, Scott F, Hudson Z, et al. : Human viruses: discovery and emergence. Philos Trans R Soc Lond B Biol Sci. 2012;367(1604):2864–71. 10.1098/rstb.2011.0354 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Carroll D, Daszak P, Wolfe ND, et al. : The global virome project. Science. 2018;359(6378):872–874. 10.1126/science.aap7463 [DOI] [PubMed] [Google Scholar]
  • 50. Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. : The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3(1): 160018. 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. GBIF Secretariat: Global biodiversity information facility.2024. Reference Source
  • 52. Ouzzani M, Hammady H, Fedorowicz Z, et al. : Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016;5(1): 210. 10.1186/s13643-016-0384-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Simons D: Arenaviruses and Hantaviruses of rodents.2024. Reference Source
  • 54. R Core Team: R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing,2023. Reference Source [Google Scholar]
  • 55. Simons D, Seifert SN: ArHa: Arenavirus and hantavirus sampling in small mammals.2024. Reference Source
  • 56. D’Agostino McGowan L, Bryan J: Googledrive: an interface to google drive.2023. 10.32614/CRAN.package.googledrive [DOI] [Google Scholar]
  • 57. Schoch CL, Ciufo S, Domrachev M, et al. : NCBI taxonomy: a comprehensive update on curation, resources and tools. Database (Oxford). 2020;2020: baaa062. 10.1093/database/baaa062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Chamberlain SA, Szöcs E: Taxize: taxonomic search and retrieval in R [version 2; peer review: 3 approved]. F1000Res. 2013:2:191. 10.12688/f1000research.2-191.v2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59. Hijmans RJ, Barbosa M, Ghosh A, et al. : Geodata: download geographic data.2023. 10.32614/CRAN.package.geodata [DOI] [Google Scholar]
  • 60. Kreuder Johnson C, Hitchens PL, Smiley Evans T, et al. : Spillover and pandemic properties of zoonotic viruses with high host plasticity. Sci Rep. 2015;5(1): 14830. 10.1038/srep14830 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Olival KJ, Hosseini PR, Zambrana-Torrelio C, et al. : Host and viral traits predict zoonotic spillover from mammals. Nature. 2017;546(7660):646–650. 10.1038/nature22975 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Gibb R, Albery GF, Becker DJ, et al. : Data proliferation, reconciliation, and synthesis in viral ecology. Bioscience. 2021;71(11):1148–1156. 10.1093/biosci/biab080 [DOI] [Google Scholar]
  • 63. Baele G, Suchard MA, Rambaut A, et al. : Emerging concepts of data integration in pathogen phylodynamics. Syst Biol. 2017;66(1):e47–65. 10.1093/sysbio/syw054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Didelot X, Fraser C, Gardy J, et al. : Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol Biol Evol. 2017;34(4):997–1007. 10.1093/molbev/msw275 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65. Lecointre G, Philippe H, Vân Lê HL, et al. : Species sampling has a major impact on phylogenetic inference. Mol Phylogenet Evol. 1993;2(3):205–24. 10.1006/mpev.1993.1021 [DOI] [PubMed] [Google Scholar]
  • 66. Dickersin K, Min YI: NIH clinical trials and publication bias. Online J Curr Clin Trials. 1993; 4967-words. [PubMed] [Google Scholar]
  • 67. The pathogen harmonized observatory (PHAROS): The viral emergence research initiative. 2024. Reference Source
Wellcome Open Res. 2025 Sep 9. doi: 10.21956/wellcomeopenres.27303.r129419

Reviewer response for version 2

Richard Ostfeld 1

The paper describes a protocol to produce a database called “Project ArHa” that consists of published records of direct or indirect detections of Arenaviruses and Hantaviruses in rodents and shrews worldwide. Such a database organizes and curates published associations between these small mammal hosts and two epidemiologically important groups of viruses, each with zoonotic members.  The authors expect the database to be applied to studies of human disease risk, including the causes of zoonotic spillover and mitigation of disease outbreaks. Given the importance of rodents and shrews as hosts of numerous pathogens, including zoonotic viruses, such a database would help correct the current biased emphasis on other mammalian orders.

The protocol includes detailed descriptions of the strategy for conducting literature searches, screening studies to facilitate decisions on inclusion, extracting the data from included studies, and processing the data before posting/publishing.  The study evinces considerable care in attending to many pitfalls inherent to the construction of similar databases addressing the occurrences of pathogens in hosts.  These pitfalls, well described in the paper, include issues such as how to interpret non-detections of a pathogen in a host, how to detect and address sampling bias and publication bias, and how to reconcile conflicting diagnostic methods and conclusions. Responding to the first round of review, the authors altered their search strategy and data processing protocols to further safeguard against incorporation of biases and address uncertainties in data quality.  I would expect the communities of disease ecologists and epidemiologists to pounce on the ArHa database, as has happened with a much more taxonomically diverse forebear, the Global Mammal Parasite Database (Stephens et al. 2017), which as of 14 August 2025 has been cited 122 times (Google Scholar statistic).   

My major concern with the study is with the potential for inappropriate uses of the database.  On the one hand, once the well-curated database is available for use, misuses can hardly be blamed on the authors.  On the other hand, the authors could help prevent unfortunate applications by better addressing both the motivations for creating the database and limitations in its use. The paper repeatedly stresses improving understanding of risk of pathogen transmission and spillover. Indeed, the database seems designed specifically to stimulate studies to address potential “spillover hotspots”, their ecological causes, and possible interventions.  Unfortunately, the database consists of detections of hantaviruses or arenaviruses in particular small-mammal hosts; it does not address transmission of the viruses by those hosts. It is easy to misuse data on pathogen detection when prevalence (or seroprevalence) in a population or species of host does not reflect the magnitude of onward transmission (including spillover) by that population or species.  As suggested by recent literature in disease ecology, users of the database could easily conflate the two. The elephant shrew in the room is that mere detection of a pathogen in a host is not tantamount to that host being important in transmission of that pathogen to other hosts.  For example, Sin Nombre Virus has been detected in wild populations of Mus musculus and Tamias minimus in New Mexico at prevalences similar to that of Peromyscus maniculatus, but with no corresponding evidence that M. musculus or T. minimus can act as “reservoir hosts”, amplifying the virus or contributing to spillover transmission (Goodfellow et al. 2021). Sciurus carolinensis is frequently infected with the bacterial agent of Lyme disease, but its net effect is to reduce overall transmission risk of this tick-borne pathogen to humans (Keesing et al. 2009, Levi et al. 2016). Distinguishing between the mere presence of a virus (or antibodies to a virus) in a host population and the ability of that host to transmit the virus onward is perhaps beyond the scope of the database.  But I would suggest that the authors not miss the opportunity to include this as a key limitation of their effort. 

I am concerned about the ability of the general search terms in Table 1 to retrieve the relevant literature. For instance, papers on squirrels, chipmunks, beavers, marmots, gophers, gerbils, jerboas, etc., might not be picked up with the search terms “rodent”, “rat”, and “mouse”.  I was confused about whether the database was intended to include only shrews within the eulipotyphlans or all eulipotyphlans. Although the term “insectivore” refers both to the Order Insectivora and a dietary category, use of this term might catch many relevant papers that would otherwise be missed.  “Small mammal” might be another important, general search term. I was also confused as to why mention of shrews as one of the key host taxa was dropped from the “Data extraction” section onwards.  I did not understand the insistence that the dataset would be “static”, suggesting that there would be no efforts to update it.  The care applied to the literature search, screening, extracting, and processing would seem to lend itself to a workflow that could be reapplied and updated at reasonable intervals, say every five years.

The paper consists of an Introduction, Methods, and Discussion, but no Results. It seems that the authors intend for the community of users to generate worthy results based on their database and wish to devote detailed attention to their protocol to foster transparency and utility.  Nevertheless, a paper proposing the development of the database seems less useful than one that includes the database itself, or that uses the database to address patterns or hypotheses.

References:

[Reference 1], [Reference 2], [Reference 3], [Reference 4]

Is the study design appropriate for the research question?

Yes

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Yes

Are the datasets clearly presented in a useable and accessible format?

Not applicable

Reviewer Expertise:

Community ecology, disease ecology, population ecology, biology of mammals

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

References

  • 1. : Tracing Transmission of Sin Nombre Virus and Discovery of Infection in Multiple Rodent Species. Journal of Virology .2021;95(23) : 10.1128/JVI.01534-21 10.1128/JVI.01534-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. : Hosts as ecological traps for the vector of Lyme disease. Proceedings of the Royal Society B: Biological Sciences .2009;276(1675) : 10.1098/rspb.2009.1159 3911-3919 10.1098/rspb.2009.1159 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. : Quantifying dilution and amplification in a community of hosts for tick‐borne pathogens. Ecological Applications .2016;26(2) : 10.1890/15-0122 484-498 10.1890/15-0122 [DOI] [PubMed] [Google Scholar]
  • 4. : Global Mammal Parasite Database version 2.0. Ecology .2017;98(5) : 10.1002/ecy.1799 10.1002/ecy.1799 [DOI] [PubMed] [Google Scholar]
Wellcome Open Res. 2025 Sep 8. doi: 10.21956/wellcomeopenres.27303.r129413

Reviewer response for version 2

Phelipe Magalhães Duarte 1

The manuscript, titled "Protocol to produce a systematic Arenavirus and Hantavirus host-pathogen database: Project ArHa," aimed to build a global, standardized, and relational dataset of arenavirus and hantavirus detections in wild-caught rodents and shrews, integrating ecological, epidemiological, and genomic information.

The study uses appropriate language, is coherent, and is technically sound.

The manuscript sequence is coherent and logical.

The methodology is adequate.

The results are based on scientific findings and are replicable.

Suggestion:

Keywords could be improved.

The purpose of keywords is to increase the manuscript's reach in databases and through search methods based on the topic.

I suggest including keywords in the references that are not in the title or abstract to increase reach.

Is the study design appropriate for the research question?

Yes

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Reviewer Expertise:

virology, one health

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

References

Wellcome Open Res. 2025 Jun 3. doi: 10.21956/wellcomeopenres.26520.r123245

Reviewer response for version 1

Roger Hewson 1

Summary: The manuscript presents a protocol for the development of ArHa, a unified, spatially and temporally explicit database synthesising published data on Arenaviruses and Hantaviruses in wild-caught rodents and shrews. These zoonotic viruses pose significant public health threats, yet their ecology and spillover risks remain under characterised, particularly in Africa and parts of Asia.

The proposed ArHa dataset will collate and harmonise data on host sampling, pathogen detection, and genomic sequences from global literature using a systematic review approach. Key features include metadata on sampling effort, species identification, detection assays and geospatial resolution. This enables users to assess sampling bias, prevalence, and host-pathogen associations across space and time. The dataset will be developed in R, made openly accessible via GitHub and Zenodo and adhere to FAIR principles. It is designed to support predictive modelling, ecological research, and public health planning.

The rationale for and objectives of, the study are well described and the manuscript articulates a strong rationale for the study, grounded in:

- The zoonotic threat posed by arenaviruses and hantaviruses;

- Sparse and fragmented ecological data, particularly in LMICs;

- Increasing anthropogenic drivers of spillover (land-use change, biodiversity loss, climate change);

- The need for integrated ecological and genomic data to support risk assessments and public health interventions.

Objectives are clearly stated and include constructing a global, standardised, relational dataset; setting out  sampling effort and detection biases; linking rodent, pathogen and genomic data across time and geography and making sure data interoperability through taxonomic harmonisation and public release via FAIR-compliant platforms.

The study design is appropriate and well aligned with objectives:

- A systematic review approach using two major literature databases (PubMed and Web of Science);

- Transparent inclusion/exclusion criteria, screening via Rayyan and documentation using a PRISMA flowchart;

- Extraction of three linked datasets: small mammal sampling, pathogen testing results and genomic sequences;

- Metadata capture on spatial resolution, sampling effort and methodological details (assay type, species identification);

- Harmonisation of species names (via GBIF and NCBI) and pathogen names (via ICTV), ensures taxonomic consistency.

The design is appropriate for developing existing literature into a coherent, workable 'data product' that can serve ecological, virological as well as public health purposes.

The planned datasets are clearly presented and authors acknowledge limitations of the study such as data incompleteness, static nature of the dataset, and potential publication bias. To address these they propose mitigations like flagging imputed data and encouraging data deposition to GBIF).

Scientific points to address

While the objectives are described within the text,they could be more clearly listed in bullet or numbered format near the end of the introduction or start of the methods for clarity.

It would be useful if authors could identify how ambiguous or inconsistent taxonomic identifications will be handled. Similarly the treatment of conflicting pathogen detection results (eg seropositive but PCR-negative individuals) should be described.

Overall the manuscript is a scientifically robust and well-articulated protocol for developing a valuable global resource on arenavirus and hantavirus ecology.

Is the study design appropriate for the research question?

Yes

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Yes

Are the datasets clearly presented in a useable and accessible format?

Yes

Reviewer Expertise:

Virology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Wellcome Open Res. 2025 Aug 5.
David Simons 1

Thank you for your constructive and detailed summary of our protocol. Your feedback has helped us clarify and improve several key aspects of our methodology. Following your suggestions, we have revised the manuscript to:

  1. List objectives explicitly: We have added a bulleted list of study objectives in the introduction to improve clarity and readability.

  2. Refine data processing: We have provided a detailed, reproducible set of decision rules for taxonomic harmonization, which will ensure consistency and transparency in how we handle ambiguous species names.

  3. Clarify handling of conflicting results: We added a section to explain our approach to conflicting pathogen detection data. We will preserve all assay results, acknowledging that different tests capture distinct biological signals, and will maintain a one-to-many data structure between host and pathogen records.

These revisions directly address your points and significantly enhance the scientific rigor and transparency of our protocol.

Wellcome Open Res. 2025 May 28. doi: 10.21956/wellcomeopenres.26520.r122839

Reviewer response for version 1

Peter Daszak 1

Review of Simons et al. Project ArHa

The paper describes a database to collate information on arenaviruses and hantaviruses. These viruses include some significant zoonotic threats to human health with high case loads (e.g. Lassa fever), and high mortality (e.g. Hantavirus pulmonary syndrome) and that are distributed widely (Asia, Africa, Americas). Despite this, data on the distribution of the viruses in their rodent reservoirs is fragmented, not systematically collected, and either privately held or deposited in a range of online databases. Likewise case reports and outbreak investigation data are often deposited in a non-systematic way, making it difficult to analyze distribution, assess biases in reporting and surveillance, and conduct predictive modeling. There are some notable exceptions, but these usually focus on single viruses, clusters of case reports over short time periods, or test specific hypotheses where the viral data often are scant compared to the environmental, demographic or other correlated data in the analysis (e.g. rodent hantaviral prevalence vs. climate, Lassa fever human cases vs. demography).

There is huge utility for databases on zoonotic viruses that have wildlife reservoirs and are sensitive to environmental and demographic factors for their emergence. Arenaviruses and hantaviruses causes diseases that appear to have cycles of high prevalence (e.g. HPS in the Southeastern USA, Lassa fever in West Africa) and for which causal correlates are complicated to analyze and require large, multi-year datasets.

The search protocols, data inclusion and exclusion criteria, data extraction and processing protocols are logical, and deal appropriately with all of the typical pitfalls that these data present (e.g. lack of consistency in sampling denominator reporting, species misidentification).

The goal is very important and useful to global health, disease ecology, surveillance and predictive modeling, as well as to potentially anlayzing the drivers of spillover risk. The approach is valid. The key weaknesses that should probably be addressed in more detail are:

 

  1. There may be substantial data on prevalence from papers in regional journals, published over a decade ago that may not be in PubMed or WoS

  2. Many non-English language papers will be of relevance and an approach to include these may make the database more powerful

  3. Large stretches of suitable habitat for potential reservoir rodent species may have had very little surveillance and this needs to be addressed, or perhaps should be one of the early studies from the database

  4. Sampling biases are likely to: be substantial geographically and temporally, linked to ability to fund research and surveillance, and linked to availability of testing technology. It would be good to have a strategy to deal with some of this in the database design and also in the early research from the database

Is the study design appropriate for the research question?

Yes

Is the rationale for, and objectives of, the study clearly described?

Yes

Are sufficient details of the methods provided to allow replication by others?

Partly

Are the datasets clearly presented in a useable and accessible format?

Not applicable

Reviewer Expertise:

Emerging zoonotic diseases. Wildlife reservoirs of zoonoses. Viral emergence.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Wellcome Open Res. 2025 Aug 5.
David Simons 1

Thank you for your valuable feedback. We agree that a comprehensive database must address the limitations of traditional search methods and sampling biases. In response to your comments, we have updated the protocol to: 1. Expand the search strategy: We will now conduct a second round of searches in regional databases such as African Journals Online (AJOL) and SciELO to capture literature not indexed in global databases. 2. Improve data inclusivity: Our protocol confirms we do not exclude non-English language papers and will use both automated tools and native speakers from our network for translation. 3. Address sampling bias: We have added a section detailing our plan to create a spatiotemporal raster of sampling effort. This will allow us to map surveillance gaps and enable more robust, bias-aware predictive modeling. These changes strengthen our methodology and ensure the resulting dataset will be a more accurate and representative resource for the research community.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Availability Statement

    No data associated with this article. All data products of this project will be made available on GitHub ( https://github.com/DidDrog11/arenavirus_hantavirus) and a finalized product will be archived on Zenodo.

    OSF: PRISMA-P checklist for Protocol to produce a systematic Arenavirus and Hantavirus host-pathogen database: Project ArHa. https://doi.org/10.17605/OSF.IO/23KTF


    Articles from Wellcome Open Research are provided here courtesy of The Wellcome Trust

    RESOURCES