Skip to main content
PLOS Pathogens logoLink to PLOS Pathogens
. 2020 Nov 30;16(11):e1009079. doi: 10.1371/journal.ppat.1009079

Global discovery of human-infective RNA viruses: A modelling analysis

Feifei Zhang 1,*, Margo Chase-Topping 2,3, Chuan-Guo Guo 4, Bram A D van Bunnik 1,2, Liam Brierley 5, Mark E J Woolhouse 1,2
Editor: Stephen Morse6
PMCID: PMC7728385  PMID: 33253277

Abstract

RNA viruses are a leading cause of human infectious diseases and the prediction of where new RNA viruses are likely to be discovered is a significant public health concern. Here, we geocoded the first peer-reviewed reports of 223 human RNA viruses. Using a boosted regression tree model, we matched these virus data with 33 explanatory factors related to natural virus distribution and research effort to predict the probability of virus discovery across the globe in 2010–2019. Stratified analyses by virus transmissibility and transmission mode were also performed. The historical discovery of human RNA viruses has been concentrated in eastern North America, Europe, central Africa, eastern Australia, and north-eastern South America. The virus discovery can be predicted by a combination of socio-economic, land use, climate, and biodiversity variables. Remarkably, vector-borne viruses and strictly zoonotic viruses are more associated with climate and biodiversity whereas non-vector-borne viruses and human transmissible viruses are more associated with GDP and urbanization. The areas with the highest predicted probability for 2010–2019 include three new regions including East and Southeast Asia, India, and Central America, which likely reflect both increasing surveillance and diversity of their virome. Our findings can inform priority regions for investment in surveillance systems for new human RNA viruses.

Author summary

There is a lack of evidence on the factors driving the discovery of RNA viruses in general globally. Here, we recorded the initial discovery sites of all 223 human RNA viruses and revealed its global distribution pattern. By using a machine learning method, we found that the virus discovery was driven by a combination of variables describing socio-economic level, land use, climate and biodiversity, with GDP and GDP growth found to be the two leading predictors. We also predicted the probability of virus discovery in 2010–2019 across the globe, and identified three new areas (East and Southeast Asia, India, and Central America) in addition to the historical high-risk areas. The further stratified analyses (distinguishing viruses transmissible in humans or strictly zoonotic, and vector-borne or non-vector-borne) helped pinpoint the explanatory factors for the discovery of specific categories of viruses and confirm the plausibility of the model. The results of our study further understanding of the spatial distribution of human RNA virus discovery, and map the likelihood of further discoveries across the world. By identifying where new viruses are most likely to be discovered in the near future the study helps identify priority areas for surveillance.

Introduction

Since the first identification of a virus in humans—yellow fever virus in 1901—viruses have been recognised as a leading cause of human infectious diseases [1]. Numerous human diseases, from the common cold [2] to life-threatening haemorrhagic fevers [3], are caused by RNA viruses. RNA viruses such as dengue virus, norovirus, and HIV impose significant burdens on global health and the global economy [46]. Despite the striking declines in the incidence and mortality of RNA virus-related diseases in human following the introduction of vaccination, infections due to measles virus, yellow fever virus, and Japanese encephalitis virus continue to endanger human health and cause hundreds to thousands of deaths each year [79], particularly in countries with limited resources to launch mass vaccination campaigns.

Human RNA viruses comprise a total of 214 International Committee on Taxonomy of Viruses (ICTV)-recognised species as of July 2017, classified into 55 genera and 22 families [1]. Many of these—such as rabies virus, dengue virus, and measles virus—have circulated in humans for thousands of years [6,9,10], though some—such as HIV-1 and SARS coronavirus—have emerged much more recently. Typically, a virus is identified through investigation of the aetiology of a human disease (e.g. yellow fever virus [11], measles virus [12]), although some have been identified during active virus discovery programmes (e.g. rotavirus C [13], parechovirus B [14]). Viruses such as hepatitis delta virus [15] and Highlands J virus [16] were discovered by chance, as incidental findings as part of a disease investigation.

The discovery curve of human viruses, for both RNA viruses and DNA viruses, was described for the first time in 2008 [17]. Up to nine new human virus species have been detected each year since the 1950s, and this is projected to continue in coming decades [17]. The factors driving the discovery of human viruses remain to be elucidated, though two previous studies have identified predictors of the emergence of infectious diseases more generally [18,19]. In this paper, we take a spatiotemporal modelling approach to identify explanatory factors influencing the discovery of RNA viruses in humans. We assume virus discovery is determined by two underlying spatiotemporal patterns: the geographical distribution of viruses in nature, and the process of virus detection—a human activity. Geographical ranges, which vary from worldwide (e.g. Norwalk virus [4], HIV-1 [5]) to very localised (e.g. Hendra virus [20], Menangle virus [21]), are mostly determined by virus natural history, vector distribution (for vector-borne viruses), and non-human host distribution(s) (for zoonotic viruses) [22]. In contrast, virus detection reflects scientific resources and research effort [18]. An uneven distribution of research effort will lead to an uneven distribution of virus discoveries. Geographical ranges and discovery effort are likely to have different drivers [23]. Previous studies [18,24] have attempted to allow for variation in discovery effort, although this is hard to do as no direct and effective measures are available. Here, we take a different approach by identifying explanatory factors of the raw virus discovery data and then interpreting in the discussion whether these effects might relate to virus geographic range or discovery effort or both.

Materials and methods

Methods overview

In this study, we followed methods and used code derived from Allen, et al [19]. We compiled and geocoded the first reports in the peer-reviewed literature of human infection for each RNA virus in our database over a period of 118 years from 1901 to 2018. A Poisson boosted regression tree (BRT) model—a method that handles spatially dependent data well—was fitted to the human RNA virus data with a set of variables thought to be potential explanatory factors. By matching the virus discovery count and all explanatory factors in each 1° resolution grid cell (approximately 110 km at the equator) by decade, we ranked the contribution of each explanatory factor to the predictions. We then used the parameter estimates from the best fitting BRT model to predict the probability of virus discovery for all grid cells across the globe in 2010–2019 using the values of all explanatory factors in 2015. We also conducted stratified analyses (distinguishing viruses transmissible in humans or strictly zoonotic, and vector-borne or non-vector-borne) to find the explanatory factors for the discovery of specific categories of viruses.

Data source of human RNA viruses and updating

Data on human RNA viruses were derived from an updated version of our previously published database (https://datashare.is.ed.ac.uk/handle/10283/2970), which contains 214 viruses, with discovery dates between from 1901 to 2017. Search terms, databases searched, and inclusion or exclusion criteria for data collection was provided in our previous paper [1]. The updated version to 2018 includes nine additional human virus species recently recognised by ICTV or newly added to the database: Nairobi sheep disease orthonairovirus, Achimota virus 2, Menangle rubulavirus, Madariaga virus, Pegivirus H, Central chimpanzee simian foamy virus, Guenon simian foamy virus, Enterovirus H and Orthohepevirus C (S1 Table). The metadata provide information on discovery date, transmissibility, transmission route, and host range [1].

We defined “discovery” as the first report of an ICTV-recognised RNA virus species from human(s) in the peer-reviewed literature, and the location of initial human exposure/infection with the virus was taken as the discovery location. When the location was not given from the original paper, the site of the research laboratory was used as the discovery location (n = 3). If neither human exposure/infection location nor research laboratory site were available, the address of the first author was used as the discovery location instead (n = 19). In our database, locations of initial human exposure/infection were used for 201 (90%) viruses (S1 Table) and none of these were contracted while travelling. The locations were georeferenced as precisely as possible according to the original literature, ranging from precise coordinates of points to polygon-level data (e.g., city, county, district, state, or country) (see S1 Text for details). For unspecified locations covering more than one grid cell (S2 Table), sampling was used in our bootstrap framework as described below.

Spatial explanatory factors

A set of 33 variables potentially affecting the spatial distribution of RNA virus discovery were collated and used as explanatory factors. Full details of sources, original resolutions, along with the definitions are provided in S3 Table. The variables were assigned to four groups: climatic, socio-economic, land use, and biodiversity. We expect GDP, GDP growth and university count etc. to be correlated with discovery effort as they imply more resources that could be invested in virus research [25,26]. Other groups of variables including land use, climate, and biodiversity are more likely to be related to the natural geographic range of the virus [27], i.e. these variables will affect discovery via the intermediate step of emergence.

All explanatory factors and virus locations were matched by a 1° spatial grid cell, having rescaled or transformed the data where necessary (details of data transformation are provided in S2 Text). Our model matched the RNA virus discovery count in each grid cell with historical decadal climatic variables, population, GDP, and land use data (described below), so we extrapolated the data for these variables back to 1901 (see S2 Text for details).

BRT modelling approach

By fitting a Poisson BRT model, we estimated the relative risk of RNA virus discovery for each 1° resolution of grid cell across the world as a function of the 33 explanatory factors. BRT is a tree-based machine learning method beginning to be widely used in ecological studies [28, 29]. It applies the technique of boosting to combine many simpler tree models adaptively, and renders improved predictive performance [30,31]. Tree-based learning methods are useful tools for modelling non-linear relationships and higher order interactions between variables. In addition, BRT handles spatially dependent data well, as it can capture complex structures within the data that many other modelling methods cannot [32]. We calculated Moran’s I (an index of spatial dependence) to estimate the ability of the BRT model to account for spatial dependence in the virus data, using package spdep in R v. 3.5.1 (fixed distance weights were generated based on spherical distance, with the cut-off values ranging from one time to thirty times of distance of 1° resolution grid cell at the equator, i.e. 110km to 3300km) [33]. Unlike the traditional, significance-based approaches, BRT assesses the individual effect of each variable by estimating the relative importance of each variable to the predictions.

The bootstrap resampling approach was applied to account for spatial uncertainty in the location of virus discoveries and generated 95% quantiles. For viruses with imprecise discovery locations, one grid cell was randomly selected each time. For each grid cell with virus discovery, two grid cells with no discovery were randomly selected from all cells throughout the world that were “virus discovery free” at all time points. So, in each model, 223 grid cells with virus discovery and 446 with no virus discovery were included. We matched the virus data with all explanatory factors (using the same decade for time-varying explanatory factors, e.g. 2010 values of variables were matched with viruses discovered in 2005–2014). We assumed that the virus count in any given grid cell in each decade follows a Poisson distribution, and used the virus discovery count in each grid cell by decade as the response variable.

Using bootstrap resampling, we fitted 1000 replicate BRT models and generated relative contribution plots and partial dependence plots with 95% quantiles. The relative contribution, or the influence/weight, of each variable is an indicator of that variable’s importance for predicting virus discovery counts. The relative contributions of all variables of a BRT model sum to 100%, with higher numbers indicating stronger influence on the response. We defined the most influential explanatory factors as those whose relative contributions were greater than the mean level (i.e. 100/(total explanatory factors counts*100); this study: 100/(33*100) = 3.03%) [28]. Partial dependence plots are a method of visualizing the relationships between a BRT’s predictive variables and its outcome after accounting for the average effects of all other variables. The means of the predictions of all 1000 models were used to predict the probability of virus discovery across the globe in 2010–2019, using 2015 values of the 33 explanatory factors. Using the equation of Poisson probability distribution, we converted the continuous prediction map to a probability map. We used the packages dismo and gbm in R v. 3.5.1 to fit BRT models. Parameters including tree complexity (reflecting the number of nodes in a tree), learning rate (shrinking the contribution of each added tree), and bag fraction (specifying the proportion of data to be selected at each step) were set following Elith et al. [31] to make sure each resampling model contained at least 1000 trees. The final parameters of the optimal model had the following values: tree complexity = 5, learning rate = 0.003, bag fraction = 0.5. A cross-validation stagewise function was used to identify the optimal number of trees in each model [31]. With these parameters, the 1000 replicate BRT models fitted a mean of 1214 trees.

The model’s predictive performance was assessed by calculating the deviance of the bootstrap model, as well as by conducting 50 rounds of ten-fold cross-validation. Details of model validation are provided in S3 Text and S4 Table.

We also performed sensitivity analyses by i) using data from 1980 to 2000 only (as explanatory variables are available without extrapolation only for this period), and ii) removing the 22 discovery reports that were not locations of infected humans (as these are less precise). Model parameters are provided in S5 Table.

Stratified analysis

Two stratified analyses were conducted to find explanatory factors specific to discoveries of different categories of virus. The first stratified analysis distinguished 131 viruses that are strictly zoonotic (all human infections are acquired from an infection in a non-human reservoir) and the 92 viruses that can spread within human populations (i.e. are transmissible, directly or indirectly, between humans) (S1 Table), based on previously published data [34]. A second stratified analysis was performed separately for 93 vector-borne viruses and 130 non-vector-borne viruses (S1 Table). We used the same BRT modelling approach for stratified analyses as we described before, and relative contribution plots and partial dependence plots with 95% quantiles were drawn for each category of virus. Model parameters are provided in S5 Table. Based on stratified BRT models, predictions of discovery probability for each category of viruses in 2010–2019 were also performed by using 2015 values of the 33 explanatory factors.

All statistical analyses were performed using R software, version 3.5.1 (R Foundation for Statistical Computing, Vienna, Austria), and all maps were visualised by using ArcGIS Desktop 10.5.1 (Environmental Systems Research Institute). The world shapefile used in the study was obtained from Data and Maps for ArcGIS (formerly Esri Data & Maps, https://www.arcgis.com/home/group.html?id=24838c2d95e14dd18c25e9bad55a7f82#overview) under a CC-BY license (S4 Text). Raw data and supporting R scripts used to generate figures for the full model are presented in S1 R script.

Results

The five regions with the highest virus count were eastern North America, Europe, central Africa, eastern Australia, and north-eastern South America (Fig 1A). Strictly zoonotic viruses and vector-borne viruses were mostly discovered from central Africa and north-eastern South America while transmissible viruses and non-vector-borne viruses were mostly discovered from eastern North America and Europe (S1 Fig). The cumulative discovery count increased slowly before 1950s, and thereafter increased at a constant rate (Fig 1B). There is variation for the rate of discovery by geographic region (S1 Video). More viruses have been discovered in North America and Europe, but the numbers have decreased in recent decades. By contrast, an increased number of viruses have been discovered in Asia. Transmissible viruses and non-vector-borne viruses showed a similar temporal pattern with the curve for all human RNA viruses, with an obvious increase in 1950 (S1 Fig). Strictly zoonotic viruses and vector-borne viruses showed a similar pattern in the early phase, with an obvious increase in 1925, but the number of new vector-borne viruses decreased after 1980 (S1 Fig).

Fig 1. Spatiotemporal distribution of human RNA virus discovery count from 1901 to 2018.

Fig 1

(A) Spatial distribution. The red spots indicate discovery points or centroids of polygons (administrative regions)–depending on the preciseness of the location provided by the original paper, with the size representing the cumulative virus species count. Centroid is the coordinate of the centre of mass in a spatial object. (B) Temporal distribution. The red curve indicates the cumulative virus species discovery count over time.

Based on the full BRT model involving all 223 viruses, twelve variables had relative contributions greater than the mean (3.03%) (Fig 2), including two socio-economic variables (GDP growth: 12.7%, GDP: 9.9%), four variables concerning urbanization [urbanized land: 8.7%, urbanization of secondary land (i.e. the percentage of land area change from secondary land to urban land; secondary land is natural vegetation that is recovering from previous human disturbance, see S3 Table for details): 4.8%, growth of urbanized land area: 3.6%, and urbanization of cropland (i.e. the percentage of land area change from cropland to urban land, see S3 Table for details): 3.3%], five climatic variables (minimum temperature: 6.3%, precipitation change: 5.0%, latitude: 4.3%, total precipitation: 3.6%, minimum precipitation: 3.5%), and one biodiversity variable (mammal species richness: 5.1%). The partial dependence plots shown in S2 Fig showed the relationships between these explanatory factors and virus discovery. For the majority of explanatory factors, the relationship with discovery probability is non-linear, with large effects often seen over a narrow range of values. For example, discovery probability fell sharply if GDP growth was negative, and for very low GDP and low percentage of urbanized land; whereas it rose sharply for high minimum temperature and high mammal richness.

Fig 2. Relative contribution of explanatory factors to human RNA virus discovery in the full model.

Fig 2

The boxplots show the median (black bar) and interquartile range (box) of the relative contribution across 1000 replicate models, with whiskers indicating minimum and maximum and black dots indicating outliers.

Our full BRT model reduced the Moran’s I for the raw virus data from a range of 0.04–0.31 to 0.007–0.065 (S3 Fig), indicating that this modelling method with 33 explanatory factors effectively removed the spatial dependence of the model residuals. Sensitivity analyses (the analysis using data from 1980 to 2000 and the analysis after removing the 22 viruses with least certain discovery locations) revealed consistent trends with the full model, though with several changes of relative contribution.

In the transmissibility-stratified BRT model, ten variables had relative contributions greater than 3.03% for discovering strictly zoonotic viruses (Fig 3A, partial dependence plots in S4A Fig), including four climatic variables (minimum temperature: 13.1%, latitude: 6.2%, precipitation change: 5.3%, total precipitation: 3.6%), three land use variables (urbanized land: 7.7%, urbanization of secondary land: 5.6%, growth of urbanized land area: 5.2%,), two socio-economic variables (GDP: 8.3%, GDP growth: 7.9%), and one biodiversity variable (mammal species richness: 5.6%). In contrast, eight variables had relative contributions greater than 3.03% for discovering viruses transmissible in humans (Fig 3B, partial dependence plots in S4B Fig), including four explanatory factors involving urbanization (urbanized land: 13.6%, urbanization of cropland: 9.3%, urbanization of secondary land: 6.6%, growth of urbanized land area: 3.6%), three socio-economic variables (GDP growth: 14.4%, GDP: 14.0%, population growth: 3.6%), and one climatic variable (minimum precipitation: 5.0%).

Fig 3. Relative contribution of explanatory factors to human RNA virus discovery in the stratified model by transmissibility.

Fig 3

(A) Strictly zoonotic, (B) Transmissible in humans. The boxplots show the median (black bar) and interquartile range (box) of the relative contribution across 1000 replicate models, with whiskers indicating minimum and maximum and black dots indicating outliers.

In the vector-borne-stratified BRT model, thirteen variables had relative contributions greater than 3.03% for discovering vector-borne viruses (Fig 4A, partial dependence plots in S5A Fig), including five climatic variables (minimum temperature: 17.1%, precipitation change: 7.9%, latitude: 6.2%, total precipitation: 3.8%, maximum precipitation: 3.3%), two socio-economic variables (GDP growth: 7.4%, GDP: 4.4%), one biodiversity variable (mammal species richness, 6.7%), and five land use variables (urbanization of secondary land: 4.8%, urbanized land: 4.1%, growth of cropland area: 3.7%, growth of urbanized land area: 3.6%, growth of pasture area: 3.4%). In contrast, seven variables had relative contributions greater than 3.03% for discovering non-vector-borne viruses (Fig 4B, partial dependence plots in S5B Fig), including four land use variables (urbanized land: 19.6%, urbanization of secondary land: 7.5%, urbanization of cropland: 4.5%, growth of urbanized land area: 3.5%), two socio-economic variables (GDP: 18.7%, GDP growth: 12.4%), and one climatic variable (minimum precipitation: 3.3%).

Fig 4. Relative contribution of explanatory factors to human RNA virus discovery in the stratified model by transmission mode.

Fig 4

(A) Vector-borne, (B) Non-vector-borne. The boxplots show the median (black bar) and interquartile range (box) of the relative contribution across 1000 replicate models, with whiskers indicating minimum and maximum and black dots indicating outliers.

The summary of the cumulative relative contribution of each group of explanatory factors to human RNA virus discovery in each model is shown in Fig 5. In comparison with non-vector-borne viruses and human transmissible viruses, the discovery of vector-borne viruses and strictly zoonotic viruses is better predicted by climatic variables and biodiversity than by socio-economic variables and land use.

Fig 5. Cumulative relative contribution of explanatory factors to human RNA virus discovery by group in each model.

Fig 5

The relative contributions of all explanatory factors sum to 100% in each model, and each colour represents the cumulative relative contribution of all explanatory factors within each group. The relative contribution of different groups to virus discovery varies across each model.

By applying 2015 values of all 33 explanatory factors (S6 Fig) to the fitted full BRT model, we obtained a predicted probability of human RNA virus discovery in 2010–2019 (Fig 6). Comparison with Fig 1 indicates that virus discoveries remain relatively likely in eastern North America, Europe, central Africa, eastern Australia and north-eastern South America but, in addition, we predict high probabilities of virus discovery across East and Southeast Asia, India and Central America. All eighteen new virus species since 2010 were discovered in regions of high-risk as predicted by our model (75.0%–99.9% percentiles of predicted probability over the global range), and eleven of them were discovered in very high-risk areas (90.0–99.9% percentiles of predicted probability over the global range). The predictions of discovery for each category of virus are shown in S7 Fig. Broadly similar patterns as the full prediction model were seen for all four categories: high probabilities of virus discoveries are predicted in East and Southeast Asia, India, and Central America in comparison with the historical distribution (S1 Fig). However, there is some variation between virus categories: strictly zoonotic viruses are more likely to be discovered in northern South America, central Africa, and Southeast Asia, while transmissible viruses are more likely to be discovered in North America, East Asia, and India (S7 Fig); and vector-borne viruses are predicted to be more likely to be discovered in northern South America, central Africa, India, and Southeast Asia than non-vector-borne viruses (S7 Fig).

Fig 6. Predicted probability of human RNA virus discovery in 2010–2019.

Fig 6

The triangles represented the actual discovery sites from 2010 to 2018, and the background colour represented the predicted discovery probability.

Discussion

In this study we compiled a large body of information on global spatiotemporal patterns of human RNA virus discovery and developed a spatiotemporal modelling framework to identify explanatory factors for the discovery of new viruses. The maps of human RNA virus discovery indicate five regions with historically high discovery counts: eastern North America, Europe, central Africa, eastern Australia, and north-eastern South America. BRT modelling suggests that virus discovery is well predicted by socio-economic variables (especially GDP and GDP growth), land use variables (especially those related to urbanization), climate variables (including minimum temperature, precipitation change, latitude, minimum precipitation, total precipitation), and biodiversity (especially mammal species richness). The predicted probability map in 2010–2019 identified three new areas across East and Southeast Asia, India, and Central America in addition to the historical high-risk areas.

We focused on the discovery of RNA viruses in human(s) in this study, rather than emergence. This is determined by the attribute of the database itself, i.e. the first report of each human RNA virus from the literature review. The discovery location may or may not represent the origin of the virus. For example, HIV-1 is believed to originate from non-human primates in West-central Africa, and is estimated to have transferred to humans in 1920s [35], but the first published case from peer-reviewed literature was a Caucasian and was published by researchers in France [36].

In both the full and the stratified BRT models, GDP and GDP growth were among the top predictors of virus discovery count. This is likely to reflect that richer, more developed areas have more research funding, better access to technologies for virus detection and more effective surveillance systems. In the United States, for example, the National Institute of Allergy and Infectious Diseases (NIAID) budget on emerging infectious diseases has quadrupled over the past decades from less than $50 million in 1994 to more than $1.7 billion in 2005 [37]. Comparison of Fig 1 with S6 Fig suggested that more viruses have been discovered in developed regions with/without fast GDP growth including North America, Europe, and Australia. We note that more developed countries are more likely to first capture viruses circulating in multiple regions. Over the last 100 years, North America and Europe have witnessed a decreasing fraction of discovered viruses in more recent decades (1985–2018: 32/86 = 37%) than previously (1901–1984: 78/137 = 57%), but Asia has accounted for a higher fraction (1901–1984: 16/137 = 12%; 1985–2018: 22/86 = 26%) (S1 Video). This can be partly explained by the higher GDP and faster GDP growth in Asia in recent decades. In addition, there have also been historical hotspots in individual countries (e.g. Brazil, Nigeria and Uganda) associated with active virus discovery initiatives such as those supported by the Rockefeller Foundation [26]. More viruses are likely to be discovered in the near future in areas with high GDP growth and GDP including most of Asia (except North and Central Asia), Europe and North America.

In contrast to GDP, all other explanatory factors identified in this study appear more directly associated with virus geographic distributions, our study having the important advantage that their influence is estimated independently of GDP. We note that the relative importance of GDP is less, though still substantial, for strictly zoonotic viruses and vector-borne viruses (two large, overlapping subsets of human RNA viruses—73 out of 93 (78.5%) strictly zoonotic viruses are vector-borne) (S1 Table). This likely reflects the fact that most such viruses have geographic ranges restricted by the distributions of their vectors and/or reservoir hosts.

Consistent with this interpretation, explanatory factors related to urbanization—a consistently important category—have greatest influence for human-transmissible viruses and non-vector-borne viruses. This, again, can be explained by the fact that more viruses have been discovered in areas (especially in Asia) which have experienced rapid urbanization in recent decades (especially after 1980 [38]). Population density and growth, in contrast, are much less prominent explanatory factors, with particularly little influence on strictly zoonotic viruses and vector-borne viruses. This implies that change in habitat—from natural or rural to urban [39]—has a greater influence on virus discovery (by altering the virus geographic distributions in nature) than human population size or density.

We also found associations between the discovery of RNA viruses and climate: five of the most influential explanatory factors in the full model were minimum temperature, precipitation change, latitude, minimum precipitation, and total precipitation. That warmer and wetter climate (higher minimum temperature, more precipitation and lower latitude) is positively associated with the virus discovery is consistent with previous studies [19]. Climatic variables (especially minimum temperature) were relatively more important predictors of vector-borne and strictly zoonotic virus discovery—both these categories are more often discovered in tropical and sub-tropical regions. Forty two percent (93 out of 223) of human RNA species are vector-borne [1] and the distribution and abundance of these viruses is strongly influenced by the impact of climate on vector populations [18,40]. That climate is also relatively important for the discovery of strictly zoonotic viruses may be at least partly explained by the fact that 78.5% of strictly zoonotic viruses are vector-borne, although there may also be an association between climate and the distribution of reservoir hosts.

For biodiversity, mammal species richness was shown to make an influential contribution to human RNA virus discovery, again particularly for vector-borne viruses and strictly zoonotic viruses. Most but not all previous studies have indicated that risk of spill-over for a virus from mammal hosts to humans is positively correlated with host species richness [18,19,41] which is consistent with mammals being the main source of zoonotic viruses [34] and that as the mammal species richness increases, so does the richness of the pool of viral zoonoses [42]. Where zoonotic viruses are first discovered will be influenced, inter alia, by a range of environmental, ecological and socioeconomic factors that increase the interaction between humans and mammal reservoirs [43].

Our predicted discovery map from the full model, along with two stratified models, identified three areas—East and Southeast Asia, India, and Central America—where more viruses were more likely to be detected in 2010–2019 than have been in the past. Inspection of the historical predicted probabilities of virus discovery in S8 Fig indicates there has always been and is still fewer discoveries than expected in these regions. This suggests that our model is missing explanatory factors (positive or negative) relevant to these regions. However, as mentioned before, for two predicted high-risk areas—East and Southeast Asia, India—account for higher fractions in more recent times. The underlying reason may be that the explanatory factors with the greatest influence on virus discovery, such as GDP and land use variables related to urbanization, have changed substantially over time in these areas (especially China).

This study had several limitations: firstly, as indicated above, our model is missing explanatory factors (positive or negative) relevant to the three newly identified high-risk regions. Second, there is often a lag between virus discovery and publication date, though we used the latter for consistency. Third, there are other potential biases concerning spatiotemporal variation in virus detection methodologies used, and diagnostic accuracy [1]. Fourth, we used ICTV species classification following other studies [44,45], though we note that viral species for each family are defined by independent groups using different criteria, which may lead to over- or under-representation of species entries for certain families in our study compared to their phylogenetic diversity. However, we regard ICTV taxonomy as the most authoritative for comparative analysis. Last, we did not attempt to correct for reporting bias by devising a plausible metric, though previous studies have done so [18,19]. However, we explicitly included predictors that we expect to be correlated with discovery effort, e.g. GDP and university count—these are indirect and likely partial measures of effort.

The strengths of the study include use of a comprehensive data set for human RNA virus discovery, the large set of high-resolution global variables postulated to influence RNA virus discovery, and a more robust model (BRT) combining the strengths of both regression trees and boosting that is capable of solving spatial dependence. We also performed further stratified analyses (distinguishing viruses transmissible in humans or strictly zoonotic, and vector-borne or non-vector-borne) and identified differences between explanatory factors for the discovery of these specific categories of viruses. These results further understanding of the spatial distribution of virus discovery for different types, and also demonstrate that such a method can be used to identify such differences between strictly zoonotic and human-transmissible viruses or between vector-borne or non-vector-borne viruses.

In conclusion, the discovery of human RNA viruses shows both spatial and temporal variation, and is a process associated with socio-economic variables, land use, climate, and biodiversity, although the relative importance of these variables differs across different category of RNA viruses. Our study helps distinguish the relative contributions of explanatory factors reflecting the natural virus distribution and those reflecting the effort invested in virus discovery to the spatial distribution of first reports of human viruses. New human viruses are more likely to be found in areas with more rapid socio-economic growth. But the underlying geographic distribution of viruses with the potential to infect humans may be somewhat different, reflecting climate, biodiversity and changes in land use. This implies that extra investment in virus discovery in settings that are resource-poor but have other risk factors may be warranted.

Supporting information

S1 Fig. Spatiotemporal distribution of human RNA virus discovery count split by category from 1901 to 2018.

The map was plotted with respect to transmissibility (top left: strictly zoonotic, top right: transmissible in humans), and transmission mode (bottom left: vector-borne viruses, bottom right: non-vector-borne viruses). In each subplot, the red spots indicate discovery points or centroids of polygons (administrative regions)–depending on the preciseness of the location provided by the original paper, with the size representing the cumulative virus species count. Centroid is the coordinate of the centre of mass in a spatial object. The red curve at the bottom left corner indicates the cumulative virus species discovery count over time.

(PDF)

S2 Fig. Partial dependence plots for all explanatory factors that influence human RNA virus discovery in the full model.

Partial dependence plots show the effect of an individual explanatory factor over its range on the response after factoring out other explanatory factors. Fitted lines represent the median (black) and 95% quantiles (coloured) based on 1000 replicated models. Y axes are centred around the mean without scaling. X axes show the range of sampled values of explanatory factors.

(PDF)

S3 Fig. Moran’s I across different spherical distances.

The solid line and dots represented the median Moran’s I value, and the grey area represented its 95% quantiles generated from 1000 samples (A: Raw virus data) or replicate BRT models (B: Model residuals).

(PDF)

S4 Fig. Partial dependence plots for all explanatory factors that influence human RNA virus discovery in the stratified model by transmissibility.

(A) Strictly zoonotic, (B) Transmissible in humans. Partial dependence plots show the effect of an individual explanatory factor over its range on the response after factoring out other explanatory factors. Fitted lines represent the median (black) and 95% quantiles (coloured) based on 1000 replicated models. Y axes are centred around the mean without scaling. X axes show the range of sampled values of explanatory factors.

(PDF)

S5 Fig. Partial dependence plots for all explanatory factors that influence human RNA virus discovery in the stratified model by transmission model.

(A) Vector-borne, (B) Non-vector-borne. Partial dependence plots show the effect of an individual explanatory factor over its range on the response after factoring out other explanatory factors. Fitted lines represent the median (black) and 95% quantiles (coloured) based on 1000 replicated models. Y axes are centred around the mean without scaling. X axes show the range of sampled values of explanatory factors.

(PDF)

S6 Fig. Distribution maps for 32 explanatory factors in 2015.

The values of these explanatory variables and latitude in each grid cell were used to predict the virus discovery in the corresponding grid cell across the globe in 2010–2019. Explanatory variables were log transformed where necessary to get better visualization, not meaning they entered the model by logged values.

(PDF)

S7 Fig. Predicted probability of human RNA virus discovery in 2010–2019 split by category.

The triangles represented the actual discovery sites from 2010 to 2018, and the background colour represented the predicted discovery probability.

(PDF)

S8 Fig. Historical predicted probability of human RNA virus discovery by decade (except the first period with four years).

The triangles represented the actual discovery sites in each decade, and the background colour represented the predicted discovery probability.

(PDF)

S1 Table. Summary of the human RNA virus database.

(DOCX)

S2 Table. Resolution and covered grid cells for virus discovery data.

(DOCX)

S3 Table. List of explanatory factors included in the model.

(DOCX)

S4 Table. Model validation statistics for stratified analyses.

(DOCX)

S5 Table. Model parameters for sensitivity analyses and stratified analyses.

(DOCX)

S1 Text. Georeferencing human RNA virus discovery locations.

(DOCX)

S2 Text. Transformation of resolution for explanatory factors and data extrapolation.

(DOCX)

S3 Text. Result of model validation.

(DOCX)

S4 Text. Source and permission for the world shapefile used in the study.

(DOCX)

S1 Video. The spatiotemporal pattern of human RNA virus discovery.

The red spot represents the discovery location of each virus species over time. The red curve at the bottom-left corner represents the cumulative virus species count over time.

(MP4)

S1 R script. A zipped file with the raw data and R code that was used for generating figures for the full model.

(ZIP)

Acknowledgments

We thank Donald Smith (University of Edinburgh, Edinburgh, UK) for validating the database, Melina Beykou and Melissa Taylor (University of Edinburgh, Edinburgh, UK) for checking the transmissibility of RNA virus, and Thibaud Porphyre (University of Edinburgh, Edinburgh, UK) for statistical guidance.

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

FFZ is funded by the Darwin Trust of Edinburgh (https://darwintrust.bio.ed.ac.uk/edinburgh). MEJW has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 874735 (VEO) (http://www.veo-europe.eu/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Woolhouse MEJ, Brierley L. Epidemiological characteristics of human-infective RNA viruses. Scientific data. 2018;5:180017 10.1038/sdata.2018.17 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Tang JW, Lam TT, Zaraket H, Lipkin WI, Drews SJ, Hatchette TF, et al. Global epidemiology of non-influenza RNA respiratory viruses: data gaps and a growing need for surveillance. The Lancet Infectious diseases. 2017;17(10):e320–e6. 10.1016/S1473-3099(17)30238-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Clark LE, Mahmutovic S, Raymond DD, Dilanyan T, Koma T, Manning JT, et al. Vaccine-elicited receptor-binding site antibodies neutralize two New World hemorrhagic fever arenaviruses. Nature communications. 2018;9(1):1884 10.1038/s41467-018-04271-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Lopman BA, Steele D, Kirkwood CD, Parashar UD. The Vast and Varied Global Burden of Norovirus: Prospects for Prevention and Control. PLoS medicine. 2016;13(4):e1001999 10.1371/journal.pmed.1001999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.WHO. HIV/AIDS report 2018 Geneva: World Health Organization; Available from: https://www.who.int/en/news-room/fact-sheets/detail/hiv-aids (accessed 19 July 2018). [Google Scholar]
  • 6.Guzman A, Isturiz RE. Update on the global spread of dengue. International journal of antimicrobial agents. 2010;36 Suppl 1:S40–2. 10.1016/j.ijantimicag.2010.06.018 [DOI] [PubMed] [Google Scholar]
  • 7.Barrett ADT. The reemergence of yellow fever. Science. 2018;361(6405):847–8. 10.1126/science.aau8225 [DOI] [PubMed] [Google Scholar]
  • 8.WHO. Japanese encephalitis report 2015 Geneva: World Health Organization Geneva: World Health Organization. Available from: https://www.who.int/en/news-room/fact-sheets/detail/japanese-encephalitis (accessed 31 December 2015). [Google Scholar]
  • 9.Moss WJ. Measles. Lancet. 2017;390(10111):2490–502. 10.1016/S0140-6736(17)31463-0 [DOI] [PubMed] [Google Scholar]
  • 10.Fisher CR, Streicker DG, Schnell MJ. The spread and evolution of rabies virus: conquering new frontiers. Nature reviews Microbiology. 2018;16(4):241–55. 10.1038/nrmicro.2018.11 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Reed W, Carroll JS, Agramonte A. The etiology of yellow fever: An additional note. Journal of the American Medical Association. 1901;XXXVI(7):431–40. [Google Scholar]
  • 12.Goldberger J, Anderson JF. The nature of the virus of measles. Journal of the American Medical Association. 1911;LVII(12):971–2. [Google Scholar]
  • 13.Bridger JC, Pedley S, McCrae MA. Group C rotaviruses in humans. Journal of clinical microbiology. 1986;23(4):760–3. 10.1128/JCM.23.4.760-763.1986 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Niklasson B, Heller KE, Schonecker B, Bildsoe M, Daniels T, Hampe CS, et al. Development of type 1 diabetes in wild bank voles associated with islet autoantibodies and the novel ljungan virus. International journal of experimental diabesity research. 2003;4(1):35–44. 10.1080/15438600303733 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Rizzetto M, Canese MG, Arico S, Crivelli O, Trepo C, Bonino F, et al. Immunofluorescence detection of new antigen-antibody system (delta/anti-delta) associated to hepatitis B virus in liver and in serum of HBsAg carriers. Gut. 1977;18(12):997–1003. 10.1136/gut.18.12.997 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Meehan PJ, Wells DL, Paul W, Buff E, Lewis A, Muth D, et al. Epidemiological features of and public health response to a St. Louis encephalitis epidemic in Florida, 1990–1. Epidemiology and infection. 2000;125(1):181–8. 10.1017/s0950268899004227 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Woolhouse ME, Howey R, Gaunt E, Reilly L, Chase-Topping M, Savill N. Temporal trends in the discovery of human viruses. Proceedings Biological sciences. 2008;275(1647):2111–5. 10.1098/rspb.2008.0294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jones KE, Patel NG, Levy MA, Storeygard A, Balk D, Gittleman JL, et al. Global trends in emerging infectious diseases. Nature. 2008;451(7181):990–3. 10.1038/nature06536 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Allen T, Murray KA, Zambrana-Torrelio C, Morse SS, Rondinini C, Di Marco M, et al. Global hotspots and correlates of emerging zoonotic diseases. Nature communications. 2017;8(1):1124 10.1038/s41467-017-00923-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Escaffre O, Borisevich V, Rockx B. Pathogenesis of Hendra and Nipah virus infection in humans. Journal of infection in developing countries. 2013;7(4):308–11. 10.3855/jidc.3648 [DOI] [PubMed] [Google Scholar]
  • 21.Philbey AW, Kirkland PD, Ross AD, Davis RJ, Gleeson AB, Love RJ, et al. An apparently new virus (family Paramyxoviridae) infectious for pigs, humans, and fruit bats. Emerging infectious diseases. 1998;4(2):269–71. 10.3201/eid0402.980214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Babayan SA, Orton RJ, Streicker DG. Predicting reservoir hosts and arthropod vectors from evolutionary signatures in RNA virus genomes. Science. 2018;362(6414):577–80. 10.1126/science.aap9072 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Brierley L, Vonhof MJ, Olival KJ, Daszak P, Jones KE. Quantifying Global Drivers of Zoonotic Bat Viruses: A Process-Based Perspective. The American naturalist. 2016;187(2):E53–64. 10.1086/684391 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Olival KJ, Hosseini PR, Zambrana-Torrelio C, Ross N, Bogich TL, Daszak P. Host and viral traits predict zoonotic spillover from mammals. Nature. 2017;546(7660):646–50. 10.1038/nature22975 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Lipkin WI. The changing face of pathogen discovery and surveillance. Nature reviews Microbiology. 2013;11(2):133–41. 10.1038/nrmicro2949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Rosenberg R, Johansson MA, Powers AM, Miller BR. Search strategy has influenced the discovery rate of human viruses. Proceedings of the National Academy of Sciences. 2013:201307243 10.1073/pnas.1307243110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Morse SS. Factors in the emergence of infectious diseases. Emerging infectious diseases. 1995;1(1):7–15. 10.3201/eid0101.950102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Shearer FM, Longbottom J, Browne AJ, Pigott DM, Brady OJ, Kraemer MUG, et al. Existing and potential infection risk zones of yellow fever worldwide: a modelling analysis. Lancet Global Health. 2018;6(3):E270–E8. 10.1016/S2214-109X(18)30024-X [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Redding DW, Moses LM, Cunningham AA, Wood J, Jones KE. Environmental-mechanistic modelling of the impact of global change on human zoonotic disease emergence: a case study of Lassa fever. Methods Ecol Evol. 2016;7(6):646–55. [Google Scholar]
  • 30.De'ath G. Boosted trees for ecological modeling and prediction. Ecology. 2007;88(1):243–51. 10.1890/0012-9658(2007)88[243:btfema]2.0.co;2 [DOI] [PubMed] [Google Scholar]
  • 31.Elith J, Leathwick JR, Hastie T. A working guide to boosted regression trees. The Journal of animal ecology. 2008;77(4):802–13. 10.1111/j.1365-2656.2008.01390.x [DOI] [PubMed] [Google Scholar]
  • 32.Crase B, Liedloff AC, Wintle BA. A new method for dealing with residual spatial autocorrelation in species distribution models. Ecography. 2012;35(10):879–88. [Google Scholar]
  • 33.Cliff AD, Ord JK. Spatial processes: Models and applications London: Pion Limited; 1981. [Google Scholar]
  • 34.Woolhouse ME, Brierley L, McCaffery C, Lycett S. Assessing the Epidemic Potential of RNA and DNA Viruses. Emerging infectious diseases. 2016;22(12):2037–44. 10.3201/eid2212.160123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Faria NR, Rambaut A, Suchard MA, Baele G, Bedford T, Ward MJ, et al. HIV epidemiology. The early spread and epidemic ignition of HIV-1 in human populations. Science. 2014;346(6205):56–61. 10.1126/science.1256739 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Barre-Sinoussi F, Chermann JC, Rey F, Nugeyre MT, Chamaret S, Gruest J, et al. Isolation of a T-lymphotropic retrovirus from a patient at risk for acquired immune deficiency syndrome (AIDS). Science. 1983;220(4599):868–71. 10.1126/science.6189183 [DOI] [PubMed] [Google Scholar]
  • 37.Fauci AS, Touchette NA, Folkers GK. Emerging infectious diseases: a 10-year perspective from the National Institute of Allergy and Infectious Diseases. Emerging infectious diseases. 2005;11(4):519–25. 10.3201/eid1104.041167 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Chini LP, Hurtt GC, Frolking S. Harmonized Global Land Use for Years 1500–2100, V1. ORNL Distributed Active Archive Center; 2014. [Google Scholar]
  • 39.Hassell JM, Begon M, Ward MJ, Fevre EM. Urbanization and Disease Emergence: Dynamics at the Wildlife-Livestock-Human Interface. Trends in ecology & evolution. 2017;32(1):55–67. 10.1016/j.tree.2016.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Li LM, Grassly NC, Fraser C. Genomic analysis of emerging pathogens: methods, application and future trends. Genome biology. 2014;15(11):541 10.1186/s13059-014-0541-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Wood CL, Lafferty KD, DeLeo G, Young HS, Hudson PJ, Kuris AM. Does biodiversity protect humans against infectious disease? Ecology. 2014;95(4):817–32. 10.1890/13-1041.1 [DOI] [PubMed] [Google Scholar]
  • 42.Keesing F, Belden LK, Daszak P, Dobson A, Harvell CD, Holt RD, et al. Impacts of biodiversity on the emergence and transmission of infectious diseases. Nature. 2010;468(7324):647–52. 10.1038/nature09575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Mackey TK, Liang BA, Cuomo R, Hafen R, Brouwer KC, Lee DE. Emerging and reemerging neglected tropical diseases: a review of key characteristics, risk factors, and the policy and innovation environment. Clinical microbiology reviews. 2014;27(4):949–79. 10.1128/CMR.00045-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Walker JW, Han BA, Ott IM, Drake JM. Transmissibility of emerging viral zoonoses. PloS one. 2018;13(11):e0206926 10.1371/journal.pone.0206926 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Shu X, Zang X, Liu X, Yang J, Wang J. Predicting MicroRNA Mediated Gene Regulation between Human and Viruses. Cells. 2018;7(8). 10.3390/cells7080100 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

David Wang, Stephen Morse

29 Jul 2020

Dear Ms Zhang,

Thank you very much for submitting your manuscript "Global discovery of human-infective RNA viruses: a modelling analysis" for consideration at PLOS Pathogens. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

The manuscript has significant strengths.  Many of the specific points have been addressed by the reviewers.  Here, I would like to summarize some overarching points, as well as reinforcing some of the reviewers' comments. 

The manuscript is a methodological tour de force, consonant with this group's previous work.  The approach is very interesting, building on this group's earlier work on rates and locations of pathogen discovery (e.g., their Ref. 11).  However, as the manuscript was not submitted as a methodological paper, therefore this manuscript should explicitly discuss how the work advances current knowledge.  In addition, as Reviewer 2 notes, the distinction and overlap between results from studies of (i) pathogen discovery and (ii) pathogen emergence should be clarified.  In the absence of formal analysis of reporting/publication bias, disentangling these threads becomes particularly critical. The authors themselves at times seem unsure which of these they're referring to.  For example, on p. 3, ll. 47-50, the authors state, "The results of our study further understanding of the spatial distribution of human RNA virus discovery, and maps the likelihood of further discoveries across the world. By identifying where new viruses are most likely to occur in the near future the study helps identify priority areas for surveillance.", but  in the Introduction, the authors note, quite astutely, "We assume virus discovery is determined by two underlying spatiotemporal patterns: the geographical distribution of viruses in nature, and the process of virus detection—a human activity."   In the cover letter, the authors imply that using this complementary approach can help to correct for resource bias in identification (discovery), among other things.  In that case, it should be possible to demonstrate the valued added  by this approach.  Some specific examples could be used here.  What are the significance and interpretation of these results, and are there recommendations?  If there is a close correlation of explanatory factors between studies of discovery and studies of "emergence", why is that so?  Does that imply we're looking in the right places, or simply the obvious ones, or that the factors that enhance discovery coincide with those that promote emergence, such as urbanization, travel, and increasing scientific capacity in developing countries?  Does the analysis suggest a better strategy for searching?  With an analysis this detailed one hopes it would be possible to go beyond the descriptive and begin more critically exploring mechanisms and hypotheses.

The figures are very elegant, but to this digital non-native, some seemed duplicative, and seemed to obscure the main points.  The text itself is rather dense.  For the readership of PLoS Pathogens, I think more explanation and context are required.  Similarly, the Discussion should more explicitly discuss any unique benefits and added value of this approach.  Again, an example or a few examples of how this can be used to advance our knowledge would be invaluable.  Earlier analyses appear to have arrived at similar conclusions from a different vantage point so ways to use the analyses in the manuscript to extend this knowledge beyond the descriptive would be very useful.  The analysis does study trends over time (including a video), and that would seem to merit more discussion as well.

This is admittedly a difficult subject.  The authors' group has a developed excellent database that they have generously and most commendably shared freely with all interested researchers.  As they indicate, however, the data are still sparse and rife with questions of definition.  Nevertheless, it remains a useful resource, and I hope that this manuscript can advance understanding of the mechanism of the emergence process.  The Discussion seems somewhat limited, and I would encourage the authors to consider additional conclusions and implications of their work in their discussion.

On another note, my sincere apologies to the authors for the unusually prolonged review period.  It is an unfortunate irony of timing to submit a manuscript on this subject just as an emerging viral pandemic appears, preoccupying all our colleagues.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Steve

Stephen Morse

Guest Editor

PLOS Pathogens

David Wang

Section Editor

PLOS Pathogens

Kasturi Haldar

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0001-5065-158X

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************

Reviewer's Responses to Questions

Part I - Summary

Please use this section to discuss strengths/weaknesses of study, novelty/significance, general execution and scholarship.

Reviewer #1: This paper uses a number of techniques, most significantly an ensemble of boosted regression trees (BRT)-based approach, to find associations between a set of predictor variables and their outcome variable representing RNA viral discovery.

The paper appears to follow methods and workflows strongly derived from Allen et. al. (2017) (currently citation 13), particularly in the generation of its BRT models and associated figures. I would suggest that the Methods section include a statement like, “We followed methods and used code derived from Allen et al (2017)…,” where describing derived methods/code/workflows.

In addition, the paper should share the code used to generate it’ models, as per the recommendations in PLOS’s Materials and Software Sharing policies (https://journals.plos.org/plospathogens/s/materials-and-software-sharing). Sharing model and figure code removes any ambiguity that may be present in the natural language description of a study’s methods as a result of space constraints or the difficulty of describing complex processes in natural language. Code should preferably be shared with an appropriate license on GitHub and assigned a DOI using a service such as Zenodo, but could also be included with the study’s Supplementary Information, as its dataset already is.

Following a similar process but with a separate outcome dataset with different inclusion criteria and a number of different predictors, the paper comes to similar conclusions as other studies on the same topic (particularly references 12 and 13) for its main model. That the different datasets would yield finds similar results undergirds the conclusions of the earlier papers. Some of their high-influence variables are similar (e.g. mammal species richness) and others are divergent (e.g. finding an association and significant impact for climate variables). These differences between similar previous studies are worth publishing.

This study also runs different BRT models for biologically distinct subsets of RNA viruses, and the finds notable differences between the drivers of these models. These findings are notable because they describe differences in the global distribution of viral discovery for different types of viruses, and second because they demonstrate that such a method can be used to identify such these sorts of differences between, say, zoonotic-only and human-transmissible diseases.

It’s my opinion that these contributions — the influence of climate variables and the stratified analyses — are the biggest points of departure from earlier literature, and this this studies best contributions.

There are a few other points that could be improved in the study’s discussion of its findings. These are noted below.

Reviewer #2: There are many strengths to this study, including their recognition of the need to deal with the critical issue of variation in research effort.

However, throughout the manuscript I found myself challenged by what ‘virus discovery’ really means in the context of this study. I was often confused about whether the authors are truly focused on the factors that influence virus discovery, or if they are actually focused on factors that influence virus emergence. Indeed, the opening two sentences of the abstract reflect this unresolved conflict between predictors of emergence (“…the prediction of where new RNA viruses will emerge is a significant public health concern”) and virus discovery (“we matched these data….to predict the probability of virus discovery”). While this might seem like semantics, I think it is critical to identify and articulate specifically what the paper is trying to achieve. Discovery and emergence are not the same process, and thus presumably should be approached differently in study design and interpretation. There are many many examples throughout the text where the two processes are confused and, unfortunately, this limited my ability to evaluate the specific contribution of this study.

I kept going back to the example of Reston virus, which was discovered in Virginia (U.S), but geocoded to Manilla, Philippines. If the study is about understanding factors underlying virus discovery, it seems odd to ascribe the discovery of Reston virus to Manilla. Reston virus was discovered because infected animals happened to end up in the hands of an ebolavirus expert at a high-containment lab, in a completely different country. I can understand why the authors decided to code Reston virus to the Philippines (it is where the virus naturally circulates); however, I question whether these spatial covariates are relevant to its discovery in the U.S. In this case, I think the spatial data that is linked to the Philippines would tell us more about virus transmission and emergence than discovery, and I worry about spurious associations. The same issue exists for Ebola virus (species Zaire ebolavirus), which was ascribed to DRC when it was actually discovered by Prof Piot in Belgium.

Ignoring, for a moment, whether these viruses are appropriately geocoded for their particular question, there is another problem. That is, the approach taken to geocode the viruses is inconsistent. For example, while Reston virus and Ebola virus were coded to reflect their point of origin (i.e., where the virus was naturally circulating), other viruses like influenza were coded to reflect where they were indeed first ‘discovered’. Influenza was first characterized in 1933 in Mill Hill, London, and the authors have geocoded it accordingly. The approach taken in coding influenza therefore appears to be different from the approach taken for Reston and Ebola. While the spatial covariates for Reston virus and Ebola virus are capturing one type of information (location of original circulation), the covariates for other viruses like Influenza or Hepatitis C are capturing a different type of information (location where the virus was actually discovered, but not necessarily or exclusively where it was circulating).

Throughout the paper, I worry that ‘discovery’ and ‘emergence’ have been confused. In the author summary, they state “By identifying where new viruses are most likely to occur in the future, the study helps identify priority areas for surveillance.” The phrase “most likely to occur” suggests they are indeed interested in emergence. Equally, in the discussion they state “This implies that it is the change in habitat—from natural or rural to urban—has a greater influence on virus discovery than human population size or density.” Again, this sentence implies that the study is about emergence, not discovery. However, elsewhere in the text they refer specifically to need to “identify the factors driving the discovery of RNA viruses”. I think it would be important to more clearly articulate exactly what the study is focused on (I think this is virus discovery) and comment on how this links or informs on our understanding of viral emergence (if indeed it does). Currently, I am not sure how to interpret the following statement in their abstract: “…areas with the highest predicted probability for 2015–2024 include new foci in East and Southeast Asia, India, and Central America”. Does this really mean areas where new viruses are likely to be discovered? Or does it mean areas where new viruses are likely to emerge, and then subsequently discovered simply because they are emerging?

I also found the definition of ‘discovery’ confusing. It was defined as ‘the first isolation of a virus in a human patient.’ Given that many human viruses were first isolated years after they were actually discovered, it would be good to clarify whether the authors really do mean ‘isolated’. (To a virologist, this means culture of the virus in vitro or vivo). In reviewing the supplemental material, I see that the authors have used 1989 for hepatitis C. This makes me think they do not actually mean ‘isolated’, but rather ‘discovered’. Hep C was first cloned in 1989, but it was not successfully grown in culture (‘isolated’) until around 2005. Again, this is inconsistent with examples like influenza. Note, I did not review the list of viruses exhaustively, so would recommend that the authors double-check everything. In addition, the term ‘patient’ is included in their definition and is perhaps misleading. Many of the viruses they have included are probably not human pathogens, or at least have not been conclusively linked with human pathology. Reston virus is a good example of this.

Reviewer #3: This paper is a timely advance for understanding discovery of human RNA viruses that will be of interest to researchers working on emerging infectious disease. The methods are rigorous and this work uses a new approach to evaluate patterns in virus detections. This work highlights important correlates of newly recognized RNA viruses that can help predict future emergence. The insights are not especially novel but findings are important and confirm findings previously reported using new data and methods. The only major concern is that the very broad presentation of correlates in a model limits the impact and relevance of this work.

**********

Part II – Major Issues: Key Experiments Required for Acceptance

Please use this section to detail the key new experiments or modifications of existing experiments that should be absolutely required to validate study conclusions.

Generally, there should be no more than 3 such required experiments or major modifications for a "Major Revision" recommendation. If more than 3 experiments are necessary to validate the study conclusions, then you are encouraged to recommend "Reject".

Reviewer #1: There are no major new experiments required for publication.

Reviewer #2: (No Response)

Reviewer #3: The analyses are thorough and evaluate a wide range of potential correlates for virus discovery. But a major limitation of this manuscript is that the inferences are not clear beyond very broad assessment of relative contributions for factors evaluated that are presented in numerous figures in the main text and SI. Concrete and meaningful functional relationships between key predictors and the outcome are not derived or explained in the results and discussion. Exploration of specific values for predictive factors and how these influence quantity of virus discovery would bring meaning to these findings. For example, the authors should explore how urbanization influences virus discovery – what levels of urbanization were more influential and what does this mean for virus discovery based on expected future trends in urbanization? Examples to highlight virus discoveries and the factors related to their first detection that underlie broad patterns in the data would bring clarity and meaning to findings.

The many figures in the main paper and SI are largely of the same variety of relative contributions to the model and probability maps – these should be narrowed down to only those that highlight major findings with improved quality. A figure that shows specific values of the one or two key factors and the predicted influence on number of viruses discovered would enhance the presentation of results.

Can the authors further explore the rate of detection over time? How did rate of detection or the temporal distribution vary by the 4 classifications of viruses; did the rate of detection vary by geographic region? The influence of the predictors on prediction is the most novel aspect of this work yet it is not explored beyond highlighting broad regions with increased probability. Can the authors add specific inferences based on the analysis in terms of the number of viruses expected to be discovered every year, by virus groups examined? Is it possible to predict this beyond 2024?

**********

Part III – Minor Issues: Editorial and Data Presentation Modifications

Please use this section for editorial suggestions as well as relatively minor modifications of existing data that would enhance clarity.

Reviewer #1: On line 49, the authors state that they identify “where new viruses are most likely to occur”, but they are actually identifying where they are most likely to be discovered. This might seem like splitting hairs, but since the authors did not treat the effect of reporting bias separately, it’s an important distinction to make.

The authors’ discussion of this very phenomenon on lines 74–83 is cogent, and succinct. Distinguishing between virus species range and discovery probability is one of the clearest explanations of this subject I’ve read.

However, their methodological treatment of reporting bias is not clearly described here or in the methods section. In the introduction, it is described on lines 84-86: “Here, we take a different approach by identifying explanatory factors of the raw virus discovery data and then considering whether these relate to virus geographic range or discovery effort or both.” This might be made clearer by saying something like “and then interpreting in the results whether these effects might relate to virus geographic range or discovery or both.

The Discussion section notes that the paper does not attempt to explicitly correct for reporting bias (line 361). In this mention, they cite Jones et al. (ref. 12) but not the subsequent paper (ref. 13) which was an improvement on the Jones et al. method. However, the study includes variables expected to be associated with discovery effort, e.g. GDP and university count. Earlier in the discussion, it is noted that GDP growth is among the top predictors, and its effect differs across the stratified models. That the study does not attempt to factor out reporting effort means that its authors should be careful to interpret their results only as “viral discovery” and not “viral emergence”.

The paper should be given a minor copy edit (e.g. “spatial dependencies” -> “spatially dependent” on line 93).

The contribution of the paper’s use of k-means clustering to its central research questions is unclear. The paper’s initial discussion of the k-means clustering analysis suggests that this was perhaps used as selection criteria for including points in the model (line 92, “if the spherical K function detected a clustered pattern, a Poisson boosted…”). However, the flow chart in the Supplementary Information gives the impression that the clustering analysis was not part of the BRT modeling workflow, but a separate analysis. If the k-means analysis is part of the modeling workflow, its relation to the BRT models should be made more clear. If it is a separate analysis, the authors should devote space in the Discussion section to interpreting its outcome (the figure in the supplementary information is not self-explanatory) or remove it from the paper entirely.

Line 167: For readers unfamiliar with BRTs, perhaps move the sentence about partial dependence plots to a new paragraph, or introduce it in a slightly different way, something like, “Partial dependence plots are a method of visualizing the relationships between a BRT’s predictive variables and its outcome…”.

Line 199: The paper talks about making predictions for 2015-2024, but these predictions are in fact the model’s output for 2015 variables. The authors do not go into enough detail about how the variables were matched to decade for us to assess this. If the outcome of the model uses 2015 variables because these are the most recent ones available, the model’s conclusions are still valid, but stating those predictions as “2015-2024” without justification would be misleading.

Line 231: The use of Moran’s I to demonstrate the model’s removal of spatial residuals is very clear.

Figures 3 and 4 uses the labels “Relative Contribution” and “Relative Influence” inconsistently. It should use only one. Readability would be much improved if the subplots were titled, rather than just being labeled A and B, so the reader can see which group they represent rather than referring to the caption.

The maps use a color palette which is not perceptually uniform — a color palette such as Viridis or Google’s Turbo color palette would be better for displaying this kind of continuous data on a map.

Line 365: The high-resolution variables were all scaled down to a 1º grid, so the higher resolution of the source variables was not a factor in the models’ output.

Reviewer #2: The authors considered a combination of socio-economic, land use, climate, and biodiversity variables as correlates of virus discovery. What is the rationale for how/why they are relevant to virus discovery? And given that land use, climate, and biodiversity variables seem more associated with emergence, can they provide some discussion on why these were selected, beyond saying ‘it’s a way to deal with variation in research effort’.

For their future predictions – did they account for projected changes in the factors that were correlated? Forecasted changes in GDP, land use, climate, etc.

Line 61 – “Human RNA viruses”. Can they authors define what they mean by a human RNA virus? Is it a virus that was found in a human at least once? And if so, is that sufficient to call it a human virus?

Lines 62-64: Species don’t circulate; viruses do. Please correct taxonomy and use virus names (not species names) when talking about viruses as nouns. No italics.

Line 66-67: “…some have been identified during active viral discovery programmes”. Please provide citations as examples. Also consider adding a sentence to say that some viruses have been discovered by chance, as incidental findings as part of a disease investigation. (I.e., a virus was identified in a sick person, but the virus was not thought to be the cause of the disease. Just an incidental finding).

Lines 109-111. I realize that this is likely not a fair thing to ask, but given the impact of the current pandemic, it seems that including SARS-CoV-2 would be good?

I would appreciate some justification for the taxonomic boundaries they have used, or at least some discussion about the potential limitations of this approach. While it might seem robust to use ICTV classifications – species demarcations vary widely by virus family. The factors used to demarcate species in one family can be quite different from the factors used to demarcate species in another family. I recognize that there is clearly no good way to address this problem, just as there is no clear answer for how to deal with variation in research effort. However, like the research effort problem, this has the potential to significantly impact their results. For example – is it reasonable to consider all influenza viruses as one example while separating out enteroviruses into so many species?

Figure 6. Do these hotspots indicate where good lab capacity exists? I am trying to understand the specific significance of this figure. What does it really mean and how would I use it to guide surveillance?

Reviewer #3: Abstract should include mention of investigative effort specifically (the process of virus detection) as a driver of discovery, as highlighted in the paper.

Given there is overlap between zoonotic viruses, human transmissible viruses, and vector-borne viruses, this overlap should be clarified in the text, by showing % overlap among categories, ie for human transmissible viruses, what % are zoonotic.

Information on how the geocoded location for virus detection was ascertained should be described at least briefly in the main paper, as opposed to only in the SI. When describing patient’s address, was this the location of the hospital or clinic, site of initial human exposure/infection with the virus? Were discoveries of virus in people that likely contacted the virus while travelling included? If so, how would this affect the findings?

The methods for data collection should be fully characterized, including search terms, databases searched, and inclusion or exclusion criteria, to enable repeatability. This information does not appear to be readily available in the link provided.

Line 224 - The partial dependence plots do not provide detailed descriptions of the relationships. This should be rephrased. For all figures of partial dependence plots, the x axes should be labelled. Many of these variables (eg primary land, secondary land, urbanization of primary land) are not readily understood unless the SI table is reviewed. Brief descriptions within the main text would ensure improved understanding of the main findings.

The relationship between GDP and virus discovery should be better explained, especially with respect to how the GDP of the country of virus detection and GDP change is related to detection. Is the GDP of a country where a virus was first detected generally reflect effort or resources spent toward investigative efforts at that location? Share evidence to support interpretation. Brazil, Nigeria, and Uganda are areas with high virus detection – how does GDP explain these detections and what were the specific factors involved in virus discovery; was this evenly distributed in time or based on unique efforts?

Lines 327-331 – does urbanization reflect a change in habitat? If so over what time scale? Or is this variable reflective of the level of urbanization at the time of analysis? Provide supporting evidence or documentation characterizing urbanization as a change.

Line 347 – the effect of biodiversity on virus emergence has been debated and this finding is minimally discussed here even though this is one of the effects that could be causally related to virus discovery in humans. Could the authors add more discussion of the relationship between biodiversity and virus discovery – examine this by potentially causal mechanisms and infer what this means for future trends?

The conclusion could be more forward looking, rather than a restatement of main findings already discussed

Minor editorial suggestions

Line 31: is = are

Line 89: Expand literature review methods description.

Results: start this section with a description of the major findings, not an immediate reference to a figure in the first word of the first sentence.

Line 331: rephrase, missing a verb

Line 359: awkward sentence, especially as start of new paragraph – what were other limitations mentioned?

Use of the term ‘species’ in figures and legends should be rephrased as ‘virus species’.

Figure 1: Legend should provide more detail and clarity on data points (to replace “occurrences” and define “centroids”. The 1B figure should have a descriptive y axis label.

Figure 5: revise to have a title and descriptive legend.

SI Fig. this methods overview is not especially helpful in understanding approach or methods.

S2 Fig needs a more descriptive legend. The bar plots overlap the numbers and are difficult to discern. Improve this graphic for readability and quality.

S4 Fig, S7 Fig, S9 Fig, S10 Fig; units are unclear for some plots.

S12 Fig – the units of measure are not clear for many plots. Colors on maps do not always match the range shown on the color bar. The legend does not seem to match this figure; expand description of what is being shown in the maps.

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Toph Allen

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example see here on PLOS Biology: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, PLOS recommends that you deposit laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see http://journals.plos.org/plospathogens/s/submission-guidelines#loc-materials-and-methods

Decision Letter 1

David Wang, Stephen Morse

19 Oct 2020

Dear Ms Zhang,

We are pleased to inform you that your manuscript 'Global discovery of human-infective RNA viruses: A modelling analysis' has been provisionally accepted for publication in PLOS Pathogens.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Pathogens.

Best regards,

Stephen Morse

Guest Editor

PLOS Pathogens

David Wang

Section Editor

PLOS Pathogens

Kasturi Haldar

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0001-5065-158X

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

***********************************************************

Thank you for your revisions in response to the review comments.

Reviewer Comments (if any, and for reference):

Acceptance letter

David Wang, Stephen Morse

18 Nov 2020

Dear Ms Zhang,

We are delighted to inform you that your manuscript, "Global discovery of human-infective RNA viruses: A modelling analysis," has been formally accepted for publication in PLOS Pathogens.

We have now passed your article onto the PLOS Production Department who will complete the rest of the pre-publication process. All authors will receive a confirmation email upon publication.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any scientific or type-setting errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Note: Proofs for Front Matter articles (Pearls, Reviews, Opinions, etc...) are generated on a different schedule and may not be made available as quickly.

Soon after your final files are uploaded, the early version of your manuscript, if you opted to have an early version of your article, will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Pathogens.

Best regards,

Kasturi Haldar

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0001-5065-158X

Michael Malim

Editor-in-Chief

PLOS Pathogens

orcid.org/0000-0002-7699-2064

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Spatiotemporal distribution of human RNA virus discovery count split by category from 1901 to 2018.

    The map was plotted with respect to transmissibility (top left: strictly zoonotic, top right: transmissible in humans), and transmission mode (bottom left: vector-borne viruses, bottom right: non-vector-borne viruses). In each subplot, the red spots indicate discovery points or centroids of polygons (administrative regions)–depending on the preciseness of the location provided by the original paper, with the size representing the cumulative virus species count. Centroid is the coordinate of the centre of mass in a spatial object. The red curve at the bottom left corner indicates the cumulative virus species discovery count over time.

    (PDF)

    S2 Fig. Partial dependence plots for all explanatory factors that influence human RNA virus discovery in the full model.

    Partial dependence plots show the effect of an individual explanatory factor over its range on the response after factoring out other explanatory factors. Fitted lines represent the median (black) and 95% quantiles (coloured) based on 1000 replicated models. Y axes are centred around the mean without scaling. X axes show the range of sampled values of explanatory factors.

    (PDF)

    S3 Fig. Moran’s I across different spherical distances.

    The solid line and dots represented the median Moran’s I value, and the grey area represented its 95% quantiles generated from 1000 samples (A: Raw virus data) or replicate BRT models (B: Model residuals).

    (PDF)

    S4 Fig. Partial dependence plots for all explanatory factors that influence human RNA virus discovery in the stratified model by transmissibility.

    (A) Strictly zoonotic, (B) Transmissible in humans. Partial dependence plots show the effect of an individual explanatory factor over its range on the response after factoring out other explanatory factors. Fitted lines represent the median (black) and 95% quantiles (coloured) based on 1000 replicated models. Y axes are centred around the mean without scaling. X axes show the range of sampled values of explanatory factors.

    (PDF)

    S5 Fig. Partial dependence plots for all explanatory factors that influence human RNA virus discovery in the stratified model by transmission model.

    (A) Vector-borne, (B) Non-vector-borne. Partial dependence plots show the effect of an individual explanatory factor over its range on the response after factoring out other explanatory factors. Fitted lines represent the median (black) and 95% quantiles (coloured) based on 1000 replicated models. Y axes are centred around the mean without scaling. X axes show the range of sampled values of explanatory factors.

    (PDF)

    S6 Fig. Distribution maps for 32 explanatory factors in 2015.

    The values of these explanatory variables and latitude in each grid cell were used to predict the virus discovery in the corresponding grid cell across the globe in 2010–2019. Explanatory variables were log transformed where necessary to get better visualization, not meaning they entered the model by logged values.

    (PDF)

    S7 Fig. Predicted probability of human RNA virus discovery in 2010–2019 split by category.

    The triangles represented the actual discovery sites from 2010 to 2018, and the background colour represented the predicted discovery probability.

    (PDF)

    S8 Fig. Historical predicted probability of human RNA virus discovery by decade (except the first period with four years).

    The triangles represented the actual discovery sites in each decade, and the background colour represented the predicted discovery probability.

    (PDF)

    S1 Table. Summary of the human RNA virus database.

    (DOCX)

    S2 Table. Resolution and covered grid cells for virus discovery data.

    (DOCX)

    S3 Table. List of explanatory factors included in the model.

    (DOCX)

    S4 Table. Model validation statistics for stratified analyses.

    (DOCX)

    S5 Table. Model parameters for sensitivity analyses and stratified analyses.

    (DOCX)

    S1 Text. Georeferencing human RNA virus discovery locations.

    (DOCX)

    S2 Text. Transformation of resolution for explanatory factors and data extrapolation.

    (DOCX)

    S3 Text. Result of model validation.

    (DOCX)

    S4 Text. Source and permission for the world shapefile used in the study.

    (DOCX)

    S1 Video. The spatiotemporal pattern of human RNA virus discovery.

    The red spot represents the discovery location of each virus species over time. The red curve at the bottom-left corner represents the cumulative virus species count over time.

    (MP4)

    S1 R script. A zipped file with the raw data and R code that was used for generating figures for the full model.

    (ZIP)

    Attachment

    Submitted filename: Response to Editors and Reviewers.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS Pathogens are provided here courtesy of PLOS

    RESOURCES