Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2022 Dec 12;17(12):e0278982. doi: 10.1371/journal.pone.0278982

Machine learning models exploring characteristic single-nucleotide signatures in yellow fever virus

Álvaro Salgado 1,*,#, Raquel C de Melo-Minardi 2,#, Marta Giovanetti 1,3,#, Adriano Veloso 2,#, Francielly Morais-Rodrigues 1, Talita Adelino 4, Ronaldo de Jesus 5, Stephane Tosta 1, Vasco Azevedo 1, José Lourenco 6,*, Luiz Carlos J Alcantara 1,3,*
Editor: Rakesh Kumar Verma7
PMCID: PMC9744328  PMID: 36508435

Abstract

Yellow fever virus (YFV) is the agent of the most severe mosquito-borne disease in the tropics. Recently, Brazil suffered major YFV outbreaks with a high fatality rate affecting areas where the virus has not been reported for decades, consisting of urban areas where a large number of unvaccinated people live. We developed a machine learning framework combining three different algorithms (XGBoost, random forest and regularized logistic regression) to analyze YFV genomic sequences. This method was applied to 56 YFV sequences from human infections and 27 from non-human primate (NHPs) infections to investigate the presence of genetic signatures possibly related to disease severity (in human related sequences) and differences in PCR cycle threshold (Ct) values (in NHP related sequences). Our analyses reveal four non-synonymous single nucleotide variations (SNVs) on sequences from human infections, in proteins NS3 (E614D), NS4a (I69V), NS5 (R727G, V643A) and six non-synonymous SNVs on NHP sequences, in proteins E (L385F), NS1 (A171V), NS3 (I184V) and NS5 (N11S, I374V, E641D). We performed comparative protein structural analysis on these SNVs, describing possible impacts on protein function. Despite the fact that the dataset is limited in size and that this study does not consider virus-host interactions, our work highlights the use of machine learning as a versatile and fast initial approach to genomic data exploration.

Introduction

Yellow fever (YF) is an acute viral hemorrhagic disease endemic in tropical areas of Africa and Latin America. The causative agent, yellow fever virus (YFV), represents the prototypical member of the genus Flavivirus (family Flaviviridae), consisting of a single-stranded, positive-sense RNA virus, with a genome about 11,000 kb and a single open-reading frame of 10,233 nucleotides [1, 2]. Disease varies from nonspecific febrile illness to a fatal hemorrhagic fever. Symptoms usually appear after an incubation period of three to six days following the bite of an infected mosquito, with a period of infection lasting several days [24]. The World Health Organization (WHO) reports case fatality rates in the order of 15 to 50% [5]. Vaccination remains the most effective YF prevention method, providing lifetime immunity in up to 99% of vaccinated people [6]. Nevertheless, the burden of YF is estimated to be between 84,000 to 170,000 severe cases and 29,000 to 60,000 deaths annually [7, 8], while an estimated 35 million people remain unvaccinated in areas at risk in Brazil only [9].

YFV spreads in two different cycles: sylvatic and urban. The sylvatic transmission cycle occurs in forested areas, where the virus is endemically transmitted between several non-human primate (NHP) species. The urban transmission cycle occurs when the virus is introduced into human populations with high density and urban-dwelling mosquitoes (mainly Aedes aegypti) [3]. Urban cycles of YFV transmission have been eradicated in Brazil since 1942 due to vaccination and vector control campaigns [1013].

In the last decade in Brazil, however, human and NHP epizootic YF cases have been notified at places beyond the limits of regions previously considered (sylvatic) endemic for the virus [1417]. The severe impact of these recent outbreaks can be measured, in part, by its fatality rate at around 34%, higher than the general rate estimated by Monath and colleagues [4], motivating the inquiry as to what could be the possible factors contributing to such a high fatality rate, and if YFV genetic signatures could be among those factors.

Additionally, important findings in recent epidemics [11, 18] show a significant difference in the distribution of NHP Ct values, in which Callithrix spp. exhibit generally higher Ct values than other NHP species, do not develop fatal YFV infections similar to those reported in humans and can persist for longer, thus increasing the infectious period. The latter can be an essential factor in igniting an urban cycle of transmission, mainly due to the genus’ proximity to densely urbanized areas.

On this respect, genomic and epidemiological monitoring have become an integral part of the national (Brazil) and international response to emerging and ongoing epidemics of viral infectious diseases, allowing the availability of a large amount of genomic data [1923].

In genomic and epidemiological monitoring analysis, machine learning (ML) approaches are usually applied [24, 25], as described in works that analyze the effectiveness of large scale genome-wide association studies (GWAS), due to their capability to computationally model the relationship between combinations of single nucleotide variants, other genetic variations and environmental factors with observed outcomes [2628].

We curated two different datasets of YFV genomes, one from human cases and the other from NHP cases. After data curation, the human dataset contained 56 YFV sequences, with 40 sequences related to infections leading to severe outcomes or death, and 16 sequences related to cases with no severe outcome. We also gathered an NHP (Callithrix spp.) dataset, that after curation contained 27 sequences, of which 21 were related to low Ct values (< 20) and 6 were related to high Ct values (> = 20).

We applied three different ML models to each dataset, to guarantee robustness of the ML analysis [29]. We then analyzed the models using SHAP (SHapley Additive explanation) [3032] to highlight genetic signatures. The possible biological impacts of these signatures were investigated and discussed by means of in-silico protein structural analysis coupled with literature review.

Results

Non-human primates Ct value statistical analysis

Fig 1 shows the distribution of cycle threshold (Ct) values from Callithrix spp. sequences, with two distinct clusters roughly around 12 and 30, with a median value of 26.1. The result of Hartigan’s dip test of unimodality [33] rejected the null hypothesis of a unimodal distribution for Ct values (p < .001), which indicate the existence of two groups of Callithrix spp. Ct values.

Fig 1. Ct values associated with Callithrix spp. sequences.

Fig 1

The median value is shown by a dashed line. Hartigan’s dip test of unimodality indicates bimodal distribution (p < .001).

Machine learning models’ performance

Fig 2 shows the confusion matrices for the machine learning models applied. For the human dataset, the XGBoost classification model correctly classified 16 serious/death and 5 not serious/not death cases, out of a total of 28 instances, achieving an accuracy of 75% on the test set, with F-1 scores of 0.59 and 0.82 for classes 0 (not severe/not death) and 1 (severe/death), respectively. The random forest model correctly classified 16 serious/death and 7 not serious/not death cases, out of a total of 28 instances, achieving an accuracy of 82% on the test set, with F-1 scores of 0.74 and 0.86 for classes 0 and 1 respectively. The modified logistic regression model performance was the same as that of random forest.

Fig 2. Confusion matrices.

Fig 2

Each box contains one confusion matrix, which measures the performance of different machine learning algorithms over different datasets. For each matrix, the rows represent instances in the actual class, while the columns represent instances in the predicted class. Instances found along the diagonal were correctly classified, while those outside the diagonal were misclassified (false positives and false negatives). Top three matrices correspond to human test dataset and the bottom three matrices correspond to Callithrix spp. test dataset, for XGBoost, random forest and regularized logistic regression respectively.

For the Callithrix spp. dataset, the three models achieved an accuracy of 100% on the test set, correctly predicting 5 low Ct cases and 1 high Ct case, out of 6 instances in the test set, with F-1 scores of 1.00 and 1.00 for classes 0 (low Ct) and 1 (high Ct) respectively.

YFV genetic signatures

The machine learning methods identified the non-synonymous SNVs shown in Table 1. The table displays each protein where the SNV was found, with nucleotide position on YFV genome, position relative to the protein’s sequence, amino acid position relative to the translated protein, reference genome amino acid and corresponding codon, analyzed sequences amino acid variation and corresponding codon and SNV position inside codon (1st, 2nd or 3rd).

Table 1. YFV genetic signatures.

Human dataset
Protein nn position nn position on protein aa position on protein aa reference codon reference aa variation codon variation SNV codon position (1, 2, 3)
NS3 6412 1842 614 E gaa D D gac gat 3
NS4a 6644 205 69 I atc V gtc 1
NS5 9815 2179 727 R agg G ggg 1
NS5 9564 1928 643 V gtt A gct 2
NHP dataset
Protein nn position nn position on protein aa position on protein aa reference codon reference aa variation codon variation SNV codon position (1, 2, 3)
NS5 8756 1120 374 I atc V gtc 1
NS5 9559 1923 641 E gaa D D gac gat 3
NS3 5120 550 184 I atc V gtc 1
E 2126 1153 385 L ctc F ttc 1
NS5 7647 11 4 N aat S agt 2
NS1 2964 512 171 A gca V gta 2

The results obtained by the analysis of human YFV sequences highlighted 4 SNV positions that result in the amino acid change, and the results obtained by the analysis of Callithrix spp. shows 6 SNV positions that resulted in the amino acid change.

Protein structural analysis

We performed protein structural analysis for all SNVs indicated in Table 1. The templates, their resolution, the quality of models provided from Swiss-Model [34] and the changes in binding affinity and stability predicted by mCSM-NA [35] are summarized in Table 2.

Table 2. Protein structural analysis results.

  Callithrix spp. Dataset Human dataset
Protein E NS1 NS3 NS5     NS3 NS4a NS5  
SNV L385F A171V I184V N11S I374V E641D E614D I69V R727G V643A
Template (PDB ID) Template (6WI5) (?) Zika virus NS1 (5K6K) (27) Dengue virus NS3 (5YV8:A) (31) Yellow fever virus NS5 (6QSN) (32) Yellow fever virus NS5 (6QSN) (32) Yellow fever virus NS5 (6QSN) (32) Yellow fever virus NS3 (1YKS) (29) No structure deposited in the PDB Yellow fever virus NS5 (6QSN) (32) Yellow fever virus NS5(6QSN) (32)
Resolution 1.83 Å 1.89 Å 2.5 Å 3.00 Å 3.00 Å 3.00 Å 1.80 Å - 3.00 Å 3.00 Å
Coverage (%)   100% 99%         -    
Sequence identity 47.58% 50.57% -
Localization at protein Domain III Wing flexible loop Linker region MTase domain Palm subdomain Palm subdomain Helicase domain - Thumb subdomain Palm subdomain
Global Model Quality Estimation (GMQE) 0.79 0.76 -
(Qualitiy mean) QMEAN   -2.5 -2.22         -    
Predicted change in binding affinity Target distant from binding sites ΔΔG = 0.003 Kcal/mol (Increased affinity) ΔΔG = 0.025 Kcal/mol (Increased affinity) - ΔΔG = -1.545 Kcal/mol (Reduced affinity) ΔΔG = -0.001 Kcal/mol (Reduced affinity)
Predicted change in stability         -1.908 Kcal/mol (Destabilising)   -0.254 Kcal/mol (Destabilising) - -0.751 Kcal/mol (Destabilising) -1.884 Kcal/mol (Destabilising)

Fig 3 shows the structural representation of proteins E, NS1, NS3 and NS5, with corresponding SNVs.

Fig 3. YFV proteins analysis.

Fig 3

(A) Envelope protein—(A, left) PDB id 6wi5—tretamer of E protein LEU385 presented in spheres. (A, right) PDB id 6ivz—in green, one chain of protein E and in yellow, the light and heavy chains of monoclonal antibody 5A. Note that the L385F SNV is distant from both the binding between protein E chains and the antibody recognition site. (B) NS1 protein. Comparative model built with Swiss-Model and PDB id 5k6k. In orange, we depict the wing flexible loop and in spheres, SNV ALA171. (C) NS3 protein. (C, left) Comparative model built with Swiss-Model and PDB id 5yvu. In orange, we depict the interdomain linker region and in spheres, SNV ILE184. (C, right) GLU614 from PDB id 1yks showing that the SNV E614D is close to the cleft where the DNA binds the NS3 protein. (D) NS5 protein. In green, SNVs. In yellow, important conserved residues across ZIKV, DENV and WNV. In cyan, active site. In grey, ligand S-adenosyl-L-homocysteine. Sulfate and Zn íons are also represented in spheres.

E (envelope) protein

LEU385 (NHP dataset) is far from the intra-chain binding site and the antibody recognition site (Fig 3A). LEU385 is in the Domain III (DIII) which has an immunoglobulin C domain (IgC-like) presenting a seven-stranded fold and is supposed to contain the receptor-binding site. DIII suffers a rotation and goes closer to the fusion loop (FL), bringing the C-terminal part of DIII (residue 392) close to FL. LEU385 is 20.3 angstroms far from GLU392.

NS1 protein

The SNV A171V (NHP dataset) is in close contact with a region called wing flexible loop (highlighted in orange in Fig 3B) and it is also a probable glycosylation site [36].

NS3 protein

Although there is a crystallographic structure (PDB id: 1yks) of the helicase domain [37] and another containing part of NS3 complexed with NS2B (PDB id 6urv) [38] deposited in the PDB, the referred SNV occurs in the unresolved stretch. The I184V (NHP dataset) is in the linker region (shown in orange in Fig 3C, left). It connects protease and helicase domains and corresponds to sequence KEEGKEELQEIP that encompasses residues between 174 and 185. The SNV E614D (human dataset) occurs in the helicase domain and is located in the RNA binding cleft, shown in red on Fig 3C, right.

NS4a protein

Since it has multiple transmembrane hydrophobic segments, structural analysis of NS4a has been unsuccessful and, so far, there is no structure deposited in the PDB. It is still one of the least characterized proteins from YFV. It was not possible to obtain a good structural model since the best template found had a coverage of only 37% and a sequence identity of 25.53%.

NS5 protein

There is a recent YFV NS5 structure deposited on PDB (PDB id 6qsn) [39]. The analyzed SNVs (shown in green in Fig 3D) are N11S, the only SNV in the MTase domain; I374V and E641D, both located in the palm subdomain. These three SNVs were found on the NHP dataset. Additionally, SNV V643A, located in the palm, and R727G, in the thumb subdomain, were found in the human dataset. As depicted in Fig 3D, they (green) are located far from the Zn and sulfate ions and the ligand (S-adenosyl-L-homocysteine) (grey). They are not close to any important / conserved mentioned residue (yellow) that interact with the nucleic acid. SNV I374V is also present in ZIKV, DENV and WNV. Position E641D varies across other viruses (K, N, R). Position V643A is also not conserved being an insertion, K or N. Position R727G is S, E or T in other flaviviruses.

Discussion

Emerging and reemerging viruses present a highly complex challenge for the Brazilian public health system. Among them, arboviruses transmitted by mosquitoes are agents capable of causing serious diseases, such as hemorrhagic fevers, encephalitis and meningitis. For these reasons, real-time genomic surveillance is extremely important to guide prevention and control measures, as it allows reconstruction of the origins of epidemics and the estimation of transmission rates at different times and geographic regions, subject to environmental and human factors. In addition, genomic surveillance makes it possible to identify emerging, re-emerging, circulating and co-circulating variants, through viral genetic diversity quantification, making it possible to estimate the likelihood of new outbreaks and/or possible escapes from existing vaccines and treatments. As a result, relevant information is acquired for the design of public health policies, in addition to contributing to the development of vaccines, new drugs and improved serological and molecular diagnostic methods [40, 41].

In this context, Brazil has become a global reference in real-time genomic surveillance, achieving fundamental results in early detection and monitoring of outbreaks. However, the large amount of data produced using next-generation sequencing platforms demands sophisticated analytical approaches, capable of dealing with complex and large datasets, aiming at the extraction of as much information as possible. In this sense, Machine Learning algorithms have been successfully used in Bioinformatics, motivating their application in the search for genetic signatures in arboviruses, associated with phenotypic or epidemiological characteristics in recent outbreaks in Brazil.

In this study, we demonstrate the potential of applying ML approaches on real-time genomic surveillance, to quickly identify genetic loci which may be of public health interest. Further studies and analytical strategies in line with the present work can help improve real-time epidemiological surveillance in Brazil and the Americas, resulting in better public health policy outcomes.

We find signals in multiple genetic loci and present a structural-based review on the potential impact of changes at those loci.

However, the limited number of sequences analyzed demands caution when presenting the results. A large number of high-quality sequences is ideal for the application of ML analysis, especially when dealing with viruses, whose high mutation rates tend to insert many variations on its genomes. Furthermore, our analysis didn’t consider virus-host interactions, such as host genome or immune system and pre-existing health conditions. In this regard, efficient host data collection, such as Electronic Health Records (EHR), are of paramount importance for a thorough investigation of clinical outcomes.

The envelope (E) protein is related to virus attachment and fusion [42]. NHP dataset analysis shows SNV L395F on Domain III, a region containing an IgC-like domain and supposed to contain a receptor-binding site crucial for virion maturation [43]. It is possible that this SNV could have an impact on the plasticity of this domain and affect the virus’ receptor-binding site and it would be interesting to investigate this behaviour through simulations of molecular dynamics in future work.

NS1 protein is a crucial non-structural protein [36]. We found a non-synonymous SNV (A171V) on NHP dataset, located near a highly flexible region on the protein, called the wing flexible loop, which is a probable glycosylation site (GS) [44]. NS1 is also a key protein secreted by infected cells, which has the potential to interact with the adaptive immune system responses [36]. In dengue infections, NS1 is known to modulate capillary leakage in severe disease and may thus have a role to play in the severity of YFV infection [45, p.].

NS3 protein, which is composed of protease and helicase domains, has functions related to viral polyprotein processing and cleavage, viral genome replication and RNA capping [42]. SNV I184V, found on NHP dataset, is in a region with probable limited functional constraints [46]. SNV E614D, found on the human dataset, occurs in the NS3 helicase domain and is located in the RNA binding cleft. ASP and GLU are both negatively charged amino acids, but ASP has a shorter side chain which can cause it to lose access to the ligand. We used mCSM-NA [35] to evaluate the impact of the SNV on stability and affinity with RNA, showing a small destabilizing effect on the interaction with RNA (-0.646 Kcal/mol), which could have an impact on its function, fundamental to viral genome replication. Unfortunately, there are no structures in complex with RNA available. Models built with protein-RNA docking techniques could help elucidate if there is a significant impact of this SNV on RNA interaction.

We found one SNV on the NS4a protein (I69V, human dataset). However, a lack of current knowledge on YFV NS4a impeded us from further exploring the possible role of I69V in human hosts. Based on the protein’s proposed functions [4749], this SNV could, in principle, affect viral replication, but such hypothesis would have to be tested by non-computational means.

As an outlier, NS5 had the highest number of identified variations—I374V, E641D, N11S, R727G, V643A - from both human and NHP YFV sequences. NS5 protein is a fundamental enzyme for viral replication because it contains an N-terminal methyltransferase domain (MTase) and a C-terminal RNA dependent RNA polymerase domain (RdRp) [50]. MTase domain has important functions involved in protecting viral RNA from degradation and innate immunity response. RdRp is essential for viral RNA replication, because its activities cannot be performed by host polymerases, and is a promising target for antiviral drug development [51]. Furthermore, dengue virus NS5 has been associated with immune response evasion [52, p. 2]. With the structural analyses, we found that none of the SNVs found has been reported to be in positions with apparent connections to protein function or structure. However, R727G on the human dataset, on the thumb subdomain of RdRp domain, shows a change in predicted affinity for RNA upon SNV occurrence. This reduction in affinity could impact polymerase function and viral replication efficiency.

Conclusions

In conclusion, even though the method proposed was applied on data that was already available from other sources, our study demonstrate that it is efficient and easy to replicate, making it suitable for real-time genomic surveillance, in which genetic data is analized as it is generated. This approach may help detect and inform on possible connections between ongoing genetic changes and public health in a timely manner.

Materials and methods

Ethics approval and consent to participate

This project was reviewed and approved by the Comissão Nacional de Ética em Pesquisa (CONEP) [National Research Ethics Committee] from the Brazilian Ministry of Health (BrMoH), as part of the arboviral genomic surveillance efforts within the terms of Resolution 510/2016 of CONEP, by the Pan American Health Organization Ethics Review Committee (PAHOERC) (Ref. No. PAHO-2016-08-0029), and by the Oswaldo Cruz Foundation Ethics Committee (CAAE: 90249218.6.1001.5248).

Datasets

We retrieved YFV complete or near complete genome sequences from the recent Brazilian outbreaks, available on public databases [1923], with associated epidemiological and clinical data containing relevant information regarding clinical severity and outcome for human infections samples, as well as PCR cycle threshold value (Ct) for both human and NHP samples. The alignment was made using MAFFT online [53] and was manually verified and corrected using AliView (https://ormbunkar.se/aliview/). Sequences with coverage lower than 90% were removed from the study. For ML analysis, human infection sequences were divided between not severe/not death and severe/death. Callithrix ssp. infection sequences were divided by low Ct (<20) and high Ct (≥ 20). After curation, the human dataset contained 56 YFV sequences, with 40 sequences related to severe/death cases, and 16 sequences related to not severe/not death cases. The NHP (Callithrix spp.) dataset, after curation, contained 27 sequences, of which 21 were related to low Ct values (< 20) and 6 were related to high Ct values (> = 20).

Machine learning model adjustment

We applied three different ML models for each of two analyzed datasets, XGBoost [54, 55], random forest [56] and regularized logistic regression [57]. We adjusted the XGBoost model parameters in a “grid-search cross-validation” scheme with five folds. Random forest adjustment used “out of bag” data as validation. Regularized logistic regression parameters were adjusted on a 10-fold cross-validation scheme, using their averages in the final model. Part of the dataset (test set) was held out of model adjustment and validation, being used afterwards to test model’s performance (as presented earlier).

Model interpretation

Feature importance was computed using SHAP (SHapley Additive exPlanation) [3032]. We followed the author’s suggestion (https://github.com/slundberg/shap/issues/397) on dealing with categorical data when using SHAP.

Protein structural analysis

We searched on Protein Data Bank (PDB) [58] for experimentally resolved structures. For those proteins that did not have structures, we looked for templates for comparative modeling with at least 30% identity. The comparative models were built with the Swiss-Model server [59]. We used mCSM [35] method to predict the impact of SNVs on protein stability and interactions.

Data Availability

Data was retrieved from open access sources and repositories, available at: http://www.nature.com/articles/s41598-019-56650-1 https://science.sciencemag.org/content/361/6405/894 https://jvi.asm.org/content/94/1/e01623-19 https://dx.plos.org/10.1371/journal.pntd.0008405 https://dx.plos.org/10.1371/journal.ppat.1008699 Code is available at GitHub (https://github.com/alvarosalgado/yfv_code).

Funding Statement

A.S. was supported by Decit, SCTIE, Brazilian Ministry of Health, Conselho Nacional de Desenvolvimento Científico - CNPq - (Grants 440685/2016-8 and 440856/2016-7) https://www.gov.br/cnpq/pt-br; Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - CAPES (Grants 88887.130716/2016-00, 88881.130825/2016-00 and 88887.130823/2016-00) https://www.gov.br/capes/pt-br; The European Union’s Horizon 2020 Research and Innovation Programme under ZIKAlliance Grant Agreement No. 734548 - https://zikalliance.tghn.org. All other authors received no specific funding for this work. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Chambers T. J., Hahn C. S., Galler R., and Rice C. M., “Flavivirus genome organization, expression, and replication,” Annu. Rev. Microbiol., vol. 44, no. 1, pp. 649–688, Oct. 1990, doi: 10.1146/annurev.mi.44.100190.003245 [DOI] [PubMed] [Google Scholar]
  • 2.Monath T. P., “Yellow fever: an update,” Lancet Infect Dis, vol. 1, no. 1, pp. 11–20, Aug. 2001, doi: 10.1016/S1473-3099(01)00016-0 [DOI] [PubMed] [Google Scholar]
  • 3.Gardner C. L. and Ryman K. D., “Yellow Fever: A Reemerging Threat,” Clinics in Laboratory Medicine, vol. 30, no. 1, pp. 237–260, Mar. 2010, doi: 10.1016/j.cll.2010.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Monath T. P., “Treatment of yellow fever,” Antiviral Research, vol. 78, no. 1, pp. 116–124, Apr. 2008, doi: 10.1016/j.antiviral.2007.10.009 [DOI] [PubMed] [Google Scholar]
  • 5.“WHO Report on Global Surveillance of Epidemic-prone Infectious Diseases—Yellow fever,” World Health Organization, 2015. https://www.who.int/csr/resources/publications/yellowfev/CSR_ISR_2000_1/en/ (accessed Jul. 02, 2020). [Google Scholar]
  • 6.“Yellow Fever,” Pan American Health Organization / World Health Organization, Accessed 2020. https://www.paho.org/hq/index.php?option=com_topics&view=article&id=69&Itemid=40784&lang=en (accessed Jun. 29, 2020).
  • 7.Paules C. I. and Fauci A. S., “Yellow Fever—Once Again on the Radar Screen in the Americas,” New England Journal of Medicine, vol. 376, no. 15, pp. 1397–1399, Apr. 2017, doi: 10.1056/NEJMp1702172 [DOI] [PubMed] [Google Scholar]
  • 8.“Yellow fever,” World Health Organization, May 07, 2019. https://www.who.int/news-room/fact-sheets/detail/yellow-fever (accessed Jul. 01, 2020). [Google Scholar]
  • 9.Shearer F. M. et al. , “Global yellow fever vaccination coverage from 1970 to 2016: an adjusted retrospective analysis,” The Lancet infectious diseases, vol. 17, no. 11, pp. 1209–1217, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Consoli R. A. and de Oliveira R. L., Principais mosquitos de importância sanitária no Brasil. SciELO-Editora FIOCRUZ, 1994. [Google Scholar]
  • 11.de M M. A. M. Mares-Guia et al. , “Yellow fever epizootics in non-human primates, Southeast and Northeast Brazil (2017 and 2018),” Parasites & Vectors, vol. 13, no. 1, 2020, doi: 10.1186/s13071-020-3966-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.“Yellow fever, the return of an old threat,” Fiocruz, May 02, 2017. https://portal.fiocruz.br/en/news/yellow-fever-return-old-threat (accessed Jul. 01, 2020). [Google Scholar]
  • 13.da C P. F. Vasconcelos, “Yellow fever in Brazil: thoughts and hypotheses on the emergence in previously free areas,” Revista de Saúde Pública, vol. 44, no. 6, pp. 1144–1149, Dec. 2010, doi: 10.1590/S0034-89102010005000046 [DOI] [PubMed] [Google Scholar]
  • 14.Delatorre E. et al. , “Distinct YFV Lineages Co-circulated in the Central-Western and Southeastern Brazilian Regions From 2015 to 2018,” Front. Microbiol., vol. 10, 2019, doi: 10.3389/fmicb.2019.01079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.“DIVE—Boletim Epidemiológico n° 06/2020 Situação epidemiológica da Febre Amarela em Santa Catarina (Atualizado em 10/06/2020).” http://www.dive.sc.gov.br/index.php/arquivo-noticias/1204-boletim-epidemiologico-n-06-2020-situacao-epidemiologica-da-febre-amarela-em-santa-catarina-atualizado-em-10-06-2020 (accessed Jan. 06, 2021).
  • 16.“Boletim epidemiológico da Febre Amarela no Brasil 2019/2020 | RETS—Rede Internacional de Educação de Técnicos em Saúde.” http://www.rets.epsjv.fiocruz.br/biblioteca/boletim-epidemiologico-da-febre-amarela-no-brasil-20192020 (accessed Jan. 06, 2021).
  • 17.“Febre Amarela | Secretaria da Saúde,” 2020. https://www.saude.pr.gov.br/Editoria/Febre-Amarela (accessed Jan. 06, 2021). [Google Scholar]
  • 18.Cunha M. S. et al. , “Epizootics due to Yellow Fever Virus in São Paulo State, Brazil: viral dissemination to new areas (2016–2017),” Scientific Reports, vol. 9, no. 1, p. 5474, Apr. 2019, doi: 10.1038/s41598-019-41950-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.dos P M. Cunha et al. , “Origin of the São Paulo Yellow Fever epidemic of 2017–2018 revealed through molecular epidemiological analysis of fatal cases,” Scientific Reports, vol. 9, no. 1, Dec. 2019, doi: 10.1038/s41598-019-56650-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Faria N. R. et al. , “Genomic and epidemiological monitoring of yellow fever virus transmission potential,” Science, vol. 361, no. 6405, pp. 894–899, Aug. 2018, doi: 10.1126/science.aat7115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Giovanetti M. et al. , “Yellow Fever Virus Reemergence and Spread in Southeast Brazil, 2016–2019,” Journal of Virology, vol. 94, no. 1, Oct. 2019, doi: 10.1128/JVI.01623-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Goes de Jesus J. et al. , “Yellow fever transmission in non-human primates, Bahia, Northeastern Brazil,” PLOS Neglected Tropical Diseases, vol. 14, no. 8, p. e0008405, Aug. 2020, doi: 10.1371/journal.pntd.0008405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Hill S. C. et al. , “Genomic Surveillance of Yellow Fever Virus Epizootic in São Paulo, Brazil, 2016–2018,” PLOS Pathogens, vol. 16, no. 8, p. e1008699, Aug. 2020, doi: 10.1371/journal.ppat.1008699 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Chen X. and Ishwaran H., “Random forests for genomic data analysis,” Genomics, vol. 99, no. 6, pp. 323–329, Jun. 2012, doi: 10.1016/j.ygeno.2012.04.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ishwaran H., Kogalur U. B., Gorodeski E. Z., Minn A. J., and Lauer M. S., “High-Dimensional Variable Selection for Survival Data,” Journal of the American Statistical Association, vol. 105, no. 489, pp. 205–217, Mar. 2010, doi: 10.1198/jasa.2009.tm08622 [DOI] [Google Scholar]
  • 26.Behravan H. et al. , “Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls,” Scientific Reports, vol. 8, no. 1, Dec. 2018, doi: 10.1038/s41598-018-31573-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ho D. S. W., Schierding W., Wake M., Saffery R., and O’Sullivan J., “Machine Learning SNP Based Prediction for Precision Medicine,” Frontiers in Genetics, vol. 10, Mar. 2019, doi: 10.3389/fgene.2019.00267 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Moore J. H., Asselbergs F. W., and Williams S. M., “Bioinformatics challenges for genome-wide association studies,” Bioinformatics, vol. 26, no. 4, pp. 445–455, Feb. 2010, doi: 10.1093/bioinformatics/btp713 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wolpert D. H., “The Lack of A Priori Distinctions Between Learning Algorithms,” Neural Computation, vol. 8, no. 7, pp. 1341–1390, Oct. 1996, doi: 10.1162/neco.1996.8.7.1341 [DOI] [Google Scholar]
  • 30.Lundberg S. and Lee S.-I., “A Unified Approach to Interpreting Model Predictions,” arXiv:1705.07874 [cs, stat], Nov. 2017, Accessed: Jul. 10, 2020. [Online]. Available: http://arxiv.org/abs/1705.07874 [Google Scholar]
  • 31.Lundberg S. M. and Lee S.-I., “Consistent feature attribution for tree ensembles,” arXiv:1706.06060 [cs, stat], Feb. 2018, Accessed: May 01, 2020. [Online]. Available: http://arxiv.org/abs/1706.06060 [Google Scholar]
  • 32.Lundberg S. M. et al. , “Explainable machine-learning predictions for the prevention of hypoxaemia during surgery,” Nature Biomedical Engineering, vol. 2, no. 10, pp. 749–760, Oct. 2018, doi: 10.1038/s41551-018-0304-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hartigan J. A. and Hartigan P. M., “The Dip Test of Unimodality,” Ann. Statist., vol. 13, no. 1, pp. 70–84, Mar. 1985, doi: 10.1214/aos/1176346577 [DOI] [Google Scholar]
  • 34.Arnold K., Bordoli L., Kopp J., and Schwede T., “The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling,” Bioinformatics, vol. 22, no. 2, pp. 195–201, Jan. 2006, doi: 10.1093/bioinformatics/bti770 [DOI] [PubMed] [Google Scholar]
  • 35.Pires D. E. V. and Ascher D. B., “mCSM-NA: predicting the effects of mutations on protein-nucleic acids interactions,” Nucleic Acids Res., vol. 45, no. W1, pp. W241–W246, 03 2017, doi: 10.1093/nar/gkx236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Brown W. C. et al. , “Extended Surface for Membrane Association in Zika Virus NS1 Structure,” Nat Struct Mol Biol, vol. 23, no. 9, pp. 865–867, Sep. 2016, doi: 10.1038/nsmb.3268 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Wu J., Bera A. K., Kuhn R. J., and Smith J. L., “Structure of the Flavivirus Helicase: Implications for Catalytic Activity, Protein Interactions, and Proteolytic Processing,” J Virol, vol. 79, no. 16, pp. 10268–10277, Aug. 2005, doi: [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Noske G. D. et al. , “Structural characterization and polymorphism analysis of the NS2B-NS3 protease from the 2017 Brazilian circulating strain of Yellow Fever virus,” Biochim Biophys Acta Gen Subj, vol. 1864, no. 4, p. 129521, Apr. 2020, doi: 10.1016/j.bbagen.2020.129521 [DOI] [PubMed] [Google Scholar]
  • 39.Dubankova A. and Boura E., “Structure of the yellow fever NS5 protein reveals conserved drug targets shared among flaviviruses,” Antiviral Research, vol. 169, p. 104536, Sep. 2019, doi: 10.1016/j.antiviral.2019.104536 [DOI] [PubMed] [Google Scholar]
  • 40.Faria N. R., Sabino E. C., Nunes M. R. T., Alcantara L. C. J., Loman N. J., and Pybus O. G., “Mobile real-time surveillance of Zika virus in Brazil,” Genome Medicine, vol. 8, no. 1, p. 97, Sep. 2016, doi: 10.1186/s13073-016-0356-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Faria N. R. et al. , “Establishment and cryptic transmission of Zika virus in Brazil and the Americas,” Nature, vol. 546, no. 7658, pp. 406–410, Jun. 2017, doi: 10.1038/nature22401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Barrows N. J. et al. , “Biochemistry and Molecular Biology of Flaviviruses,” Chemical Reviews, vol. 118, no. 8, pp. 4448–4482, Apr. 2018, doi: 10.1021/acs.chemrev.7b00719 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Lu X. et al. , “Double Lock of a Human Neutralizing and Protective Monoclonal Antibody Targeting the Yellow Fever Virus Envelope,” Cell Reports, vol. 26, no. 2, pp. 438–446.e5, Jan. 2019, doi: 10.1016/j.celrep.2018.12.065 [DOI] [PubMed] [Google Scholar]
  • 44.Watanabe Y., Bowden T. A., Wilson I. A., and Crispin M., “Exploitation of glycosylation in enveloped virus pathobiology,” Biochimica et Biophysica Acta (BBA)—General Subjects, vol. 1863, no. 10, pp. 1480–1497, Oct. 2019, doi: 10.1016/j.bbagen.2019.05.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Puerta-Guardo H., Glasner D. R., and Harris E., “Dengue Virus NS1 Disrupts the Endothelial Glycocalyx, Leading to Hyperpermeability,” PLOS Pathogens, vol. 12, no. 7, p. e1005738, Jul. 2016, doi: 10.1371/journal.ppat.1005738 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Luo D., Xu T., Hunke C., Grüber G., Vasudevan S. G., and Lescar J., “Crystal Structure of the NS3 Protease-Helicase from Dengue Virus,” Journal of Virology, vol. 82, no. 1, pp. 173–183, Jan. 2008, doi: 10.1128/JVI.01788-07 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Zou J. et al. , “Characterization of Dengue Virus NS4A and NS4B Protein Interaction,” Journal of Virology, vol. 89, no. 7, pp. 3455–3470, Apr. 2015, doi: 10.1128/JVI.03453-14 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Miller S., Kastner S., Krijnse-Locker J., Bühler S., and Bartenschlager R., “The non-structural protein 4A of dengue virus is an integral membrane protein inducing membrane alterations in a 2K-regulated manner,” J. Biol. Chem., vol. 282, no. 12, pp. 8873–8882, Mar. 2007, doi: 10.1074/jbc.M609919200 [DOI] [PubMed] [Google Scholar]
  • 49.Lin M.-H., Hsu H.-J., Bartenschlager R., and Fischer W. B., “Membrane undulation induced by NS4A of Dengue virus: a molecular dynamics simulation study,” Journal of Biomolecular Structure and Dynamics, vol. 32, no. 10, pp. 1552–1562, Oct. 2014, doi: 10.1080/07391102.2013.826599 [DOI] [PubMed] [Google Scholar]
  • 50.El Sahili A. and Lescar J., “Dengue Virus Non-Structural Protein 5,” Viruses, vol. 9, no. 4, Apr. 2017, doi: 10.3390/v9040091 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Zhu W., Chen C. Z., Gorshkov K., Xu M., Lo D. C., and Zheng W. , “RNA-Dependent RNA Polymerase as a Target for COVID-19 Drug Discovery,” SLAS DISCOVERY: Advancing the Science of Drug Discovery, p. 247255522094212, Jul. 2020, doi: 10.1177/2472555220942123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Ashour J., Laurent-Rolle M., Shi P.-Y., and García-Sastre A., “NS5 of Dengue Virus Mediates STAT2 Binding and Degradation,” J Virol, vol. 83, no. 11, pp. 5408–5418, Jun. 2009, doi: 10.1128/JVI.02188-08 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Katoh K., Rozewicki J., and Yamada K. D., “MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization,” Briefings in Bioinformatics, vol. 20, no. 4, pp. 1160–1166, Jul. 2019, doi: 10.1093/bib/bbx108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Friedman J. H., “Greedy Function Approximation: A Gradient Boosting Machine,” The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001. [Google Scholar]
  • 55.Schapire R. E., “The Boosting Approach to Machine Learning: An Overview,” in Nonlinear Estimation and Classification, vol. 171, Denison D. D., Hansen M. H., Holmes C. C., Mallick B., and Yu B., Eds. New York, NY: Springer New York, 2003, pp. 149–171. doi: 10.1007/978-0-387-21579-2_9 [DOI] [Google Scholar]
  • 56.Breiman L., “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001. [Google Scholar]
  • 57.Morais-Rodrigues F. et al. , “Analysis of the microarray gene expression for breast cancer progression after the application modified logistic regression,” Gene, vol. 726, p. 144168, Feb. 2020, doi: 10.1016/j.gene.2019.144168 [DOI] [PubMed] [Google Scholar]
  • 58.Goodsell D. S. et al. , “RCSB Protein Data Bank: Enabling biomedical research and drug discovery,” Protein Sci., vol. 29, no. 1, pp. 52–65, 2020, doi: 10.1002/pro.3730 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Schwede T., Kopp J., Guex N., and Peitsch M. C., “SWISS-MODEL: an automated protein homology-modeling server,” Nucleic Acids Res, vol. 31, no. 13, pp. 3381–3385, Jul. 2003, doi: 10.1093/nar/gkg520 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Rakesh Kumar Verma

17 Jun 2022

PONE-D-21-15059

Machine learning models exploring characteristic single-nucleotide signatures in yellow fever virus

PLOS ONE

Dear Dr. Salgado,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 30 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Rakesh Kumar Verma, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

3. Your ethics statement should only appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please ensure that your ethics statement is included in your manuscript, as the ethics statement entered into the online submission form will not be published alongside your manuscript. 

4. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

********** 

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

********** 

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

********** 

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

********** 

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors developed a machine learning framework combining three different algorithms to analyze YFV genomic sequences. This method was applied to 56 YFV sequences from human infections and 27 from non-human primate

(NHPs) infections and they investigated the presence of genetic signatures possibly related to disease severity (in human related sequences) and differences in PCR cycle threshold (Ct) values (in NHP related sequences).

The authors found four non-synonymous single nucleotide variations (SNVs) on sequences from human infections, in proteins NS3 (E614D), NS4a (I69V), NS5 (R727G, V643A) and six non-synonymous SNVs on NHP sequences, in proteins E (L385F), NS1 (A171V), NS3 (I184V) and NS5 (N11S, I374V, E641D).

They performed comparative protein structural analysis on these SNVs, trying to define possible impacts on protein function.

I have some major points to discuss here with authors:

Do you think using only Callithrix spp. sequences only as a NHP dateset can be a bias for your analysis since it know that Aluata spp. is the major sensitive NHP specimen for YFV in sylvatic cycle?

I think figure 2 needs to be more explained in Figure legend. It is hard for the readers understand the figure.

I think the authors should discuss better the importance of their findins in the control of YFV in Brazil and Latin America settings.

********** 

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 12;17(12):e0278982. doi: 10.1371/journal.pone.0278982.r002

Author response to Decision Letter 0


4 Nov 2022

#1 Reviewer’s comment:

Do you think using only Callithrix spp. sequences only as a NHP dateset can be a bias for your analysis since it know that Aluata spp. is the major sensitive NHP specimen for YFV in sylvatic cycle?

Author’s answer:

The use of only Callithrix spp. sequences was intentional. It is indeed known that Aluata spp. is the major sensitive NHP specimen for YFV in sylvatic cycle, but these animals usually live far from human populations, in forest areas separated from large urban centers, and are unlikely to be responsible for starting an urban cycle of transmission. On the other hand, Callithrix spp. animals are easily found in densely populated urban areas, in parks and in forested regions close to cities. That is why a variation in PCR cycle threshold (Ct) values constitutes a greater risk in igniting an urban cycle of transmission, because it can signify a change in persistence and infectivity of Yellow Fever Virus in Callithrix spp.

Therefore, the present work intends to investigate if there are any genetic signatures possibly related to this specific phenomenon, and this is why the work focused on Callithrix spp. alone.

#2 Reviewer’s comment:

I think figure 2 needs to be more explained in Figure legend. It is hard for the readers understand the figure.

Author’s answer:

Change was made on manuscript, replicated here for your convenience:

“Machine learning models’ performance

Fig 2 shows the confusion matrices for the machine learning models applied. For the human dataset, the XGBoost classification model correctly classified 16 serious/death and 5 not serious/not death cases, out of a total of 28 instances, achieving an accuracy of 75% on the test set, with F-1 scores of 0.59 and 0.82 for classes 0 (not severe/not death) and 1 (severe/death), respectively. The random forest model correctly classified 16 serious/death and 7 not serious/not death cases, out of a total of 28 instances, achieving an accuracy of 82% on the test set, with F-1 scores of 0.74 and 0.86 for classes 0 and 1 respectively. The modified logistic regression model performance was the same as that of random forest.

For the Callithrix spp. dataset, the three models achieved an accuracy of 100% on the test set, correctly predicting 5 low Ct cases and 1 high Ct case, out of 6 instances in the test set, with F-1 scores of 1.00 and 1.00 for classes 0 (low Ct) and 1 (high Ct) respectively.

Fig 2. Confusion matrices. Each box contains one confusion matrix, which measures the performance of different machine learning algoritms over different datasets. For each matrix, the rows represent instances in the actual class, while the columns represent instances in the predicted class. Instances found along the diagonal were correctly classified, while those outside the diagonal were misclassified (false positives and false negatives). Top three matrices correspond to human test dataset and the bottom three matrices correspond to Callithrix spp. test dataset, for XGBoost, random forest and regularized logistic regression respectively.

#3 Reviewer’s comment:

I think the authors should discuss better the importance of their findins in the control of YFV in Brazil and Latin America settings.

Author’s answer:

Changes were made on manuscript, replicated here for your convenience:

“Emerging and reemerging viruses present a highly complex challenge for the Brazilian public health system. Among them, arboviruses transmitted by mosquitoes are agents capable of causing serious diseases, such as hemorrhagic fevers, encephalitis and meningitis. For these reasons, real-time genomic surveillance is extremely important to guide prevention and control measures, as it allows reconstruction of the origins of epidemics and the estimation of transmission rates at different times and geographic regions, subject to environmental and human factors. In addition, genomic surveillance makes it possible to identify emerging, re-emerging, circulating and co-circulating variants, through viral genetic diversity quantification, making it possible to estimate the likelihood of new outbreaks and/or possible escapes from existing vaccines and treatments. As a result, relevant information is acquired for the design of public health policies, in addition to contributing to the development of vaccines, new drugs and improved serological and molecular diagnostic methods (40,41).

In this context, Brazil has become a global reference in real-time genomic surveillance, achieving fundamental results in early detection and monitoring of outbreaks. However, the large amount of data produced using next-generation sequencing platforms demands sophisticated analytical approaches, capable of dealing with complex and large datasets, aiming at the the extraction of as much information as possible. In this sense, Machine Learning algorithms have been successfully used in Bioinformatics, motivating their application in the search for genetic signatures in arboviruses, associated with phenotypic or epidemiological characteristics in recent outbreaks in Brazil.

In this study, we demonstrate the potential of applying ML approaches on real-time genomic surveillance, to quickly identify genetic loci which may be of public health interest. Further studies and analytical strategies in line with the present work can help improve real-time epidemiological surveillance in Brazil and the Americas, resulting in better public health policy outcomes.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Rakesh Kumar Verma

29 Nov 2022

Machine learning models exploring characteristic single-nucleotide signatures in yellow fever virus

PONE-D-21-15059R1

Dear Dr. Salgado,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Rakesh Kumar Verma, Ph.D

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Dear Author,

The Manuscript fulfil all the requirements and finds suitable for publication in PLOS-ONE Journal.

Thank You

Reviewers' comments:

Acceptance letter

Rakesh Kumar Verma

3 Dec 2022

PONE-D-21-15059R1

Machine learning models exploring characteristic single-nucleotide signatures in yellow fever virus

Dear Dr. Salgado:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Rakesh Kumar Verma

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    Data was retrieved from open access sources and repositories, available at: http://www.nature.com/articles/s41598-019-56650-1 https://science.sciencemag.org/content/361/6405/894 https://jvi.asm.org/content/94/1/e01623-19 https://dx.plos.org/10.1371/journal.pntd.0008405 https://dx.plos.org/10.1371/journal.ppat.1008699 Code is available at GitHub (https://github.com/alvarosalgado/yfv_code).


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES