Skip to main content
Science Progress logoLink to Science Progress
. 2025 Jan 22;108(1):00368504241301535. doi: 10.1177/00368504241301535

On the importance of data curation for knowledge mining in antiviral research

Holli-Joi Martin 1, Cleber C Melo-Filho 1, Alexey V Zakharov 2,, Eugene Muratov 1,, Alexander Tropsha 1,
PMCID: PMC11752173  PMID: 39840476

Abstract

The recent severe acute respiratory syndrome coronavirus 2 pandemic has clearly exemplified the need for broad-spectrum antiviral (BSA) medications. However, previous outbreaks show that about one year after an outbreak, interest in antiviral research diminishes and the work toward an effective medication is left unfinished. Martin et al. endeavored to support the early stages of focused BSA development by creating the Small Molecule Antiviral Compound Collection (SMACC), which is publicly available online at https://smacc.mml.unc.edu. SMACC is a highly curated database with over 32,500 entries of chemical compounds tested in both phenotypic and target-based assays across 13 viruses from the NIAID's list of emerging infectious diseases/pathogens. The authors advise judicious use of knowledge collections such as SMACC and recommend users critically evaluate retrieved data and resulting hypotheses prior to experimental testing. When used correctly, SMACC-like databases may serve as a reference for medicinal chemists and virologists working to develop novel BSA medications. To summarize, we emphasize the importance of data curation for both novel outbreak prediction and development of BSAs against these outbreaks.

Keywords: Broad-spectrum antiviral, drug discovery, data curation, knowledge mining, outbreak prediction, drug repurposing


Comment on: Martin HJ, Melo-Filho CC, Korn D, et al. (2023) Small molecule antiviral compound collection (SMACC): A comprehensive, highly curated database to support the discovery of broad-spectrum antiviral drug molecules. Antiviral Research. 217: 105620.1

Data curation in antiviral research

Infectious diseases have had profound impacts on global human health since the beginning of time. In the past two decades factors such as population growth and travel have increased the rate of viral outbreaks, with a new viral threat seen nearly every year. 2 The millions of lives lost and trillions of dollars in damage to the global economy due to the recent pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) highlight the importance of medications that can offer protection against diverse viral threats that may emerge in the future.3,4 The development of broad-spectrum antiviral (BSA) drugs is a promising strategy against emergent viruses, but discovering BSAs has been a challenge to date. According to DrugBank, there are only 109 antiviral medications with even fewer approved to treat multiple viral diseases. Herpes, hepatitis, or human immunodeficiency viruses have had the most successful campaigns, but these drugs are generally ineffective against novel viral threats, such as SARS-CoV-2. Although few, medications active against multiple viruses do exist, suggesting that concerted strategic efforts could lead to BSA development. However, previous outbreaks show that about one year after an outbreak, antiviral research support diminishes and the work toward an effective medication is left unfinished. 5

Martin et al. 1 endeavored to support the focused development of BSAs by creating the SMACC, available online at https://smacc.mml.unc.edu. Focusing on a subset of viruses from the NIAID's list of emerging infectious diseases/pathogens, they utilized the ChEMBL database 6 to build a rigorously curated and further annotated database with over 32,500 entries of chemical compounds tested in both phenotypic and target-based assays across 13 viruses. Despite all chemogenomic data being integrated into SMACC, a large degree of sparsity (93%) still existed within the data matrix because many of the viruses of high concern remain understudied. This underscores that novel results can be obtained through targeted testing of the compounds in SMACC against additional proteins. As such, the authors identified ∼50 compounds targeting the flavi- and coronaviruses with a polyviral inhibition profile. This analysis reinforces that previous disconcerted studies have led to very few successful BSAs and that special, focused efforts must be established going forward.

A major emphasis of their work arose as the authors identified multiple fundamental issues in assay reporting and consolidation within the ChEMBL database. Extensive curation efforts were required to improve the quality and usability of the extracted antiviral activity data and create SMACC. The greatest issue involved the phenotypic assays for antiviral compounds, which had inconsistent or missing ontological annotations, misused BioAssay Ontology annotations, and an insufficient level of detail provided for assay procedures. The issues identified in the data made proper data analysis challenging, put data reproducibility in question, and its original state, and prohibited meaningful integration of multiple assay results. Chemical bioactivity data curators have improved their inputs over the years, 7 however, careful data processing of chemical structure and bioactivity labeling remains an important focus to enable a user to obtain the entirety of data that they searched for.

Although antiviral research regained the attention of the broader research community and funding agencies upon the emergence of SARS-CoV-2, there were already curation efforts for antiviral activity data underway. For example, in 2016, there was a comprehensive effort to analyze the structure−activity information for antiviral compounds within the ChEMBL database. 8 Similar to the process for developing the SMACC database, these authors used cheminformatics tools to carefully curate and process the data collected and faced many of the same issues including the need for multiple text queries to find relevant entries and removal of inconsistent or irrelevant data. Specifically, the authors stated that a “lack of clear ontology and homogeneity of antiviral data made grouping of antivirals a challenge.” Nikitina et al. 9 developed a semi-automatic method for extracting antiviral data after realizing the inefficiency of the virus taxonomy annotations of these data in ChEMBL. Remarkably, these authors were able to extract “1.6 times more assays linked to 2.5 times more compounds and data points than ChEMBL web interface allows.” These challenges are shared amongst all authors, therefore demonstrating a clear issue in extracting antiviral data and the time required for subsequent processing. These challenges also inherently limit the size and scope of databases such as SMACC and DenvInD, a dengue virus inhibitors database for clinical and molecular research. 10 This also impacts basic research and computational method development that rely on these data as input, such as a recent paper on recommender systems. 11

SMACC-like databases could be integral in developing BSAs for novel viruses as we continue to publish antiviral assay data online because they provide clear and accessible antiviral data to the greater scientific community. As described above, antiviral assays are often poorly described in online collections due to the difficulty in extracting complex assay information from publications. Their complexities could also explain the minimal coverage of virology-related journals in ChEMBL and other public resources, which leave a large number of antiviral datasets unprocessed. Martin et al. outline several suggestions for improving antiviral assay reporting so that the data can be easily added to online repositories. Some suggestions for publications included adding clear descriptions of antiviral assays and creating obvious links to the associated data sets. For data scientists, the authors recommended being as descriptive as possible when using or developing antiviral data ontology and annotation frameworks, inputting single information types per column, and breaking complex data into different columns as needed. The development of comprehensive assay reporting guidelines is integral to the future success of antiviral research. However, more comprehensive guidelines require a larger effort made in partnership with the broader community so we must act together. Improving the accuracy of assay reporting is a task for all scientists and together we can improve data quality and accessibility for a wide variety of researchers.

The knowledge-based curation, database generation, computational hit discovery, and experimental nomination strategy described in SMACC is generalizable, workable, and applicable to many understudied proteins, especially where structural or evolutionary linkages can be made within or between species. This is especially important for antiviral research as there remains little research on many proteins important for viral replication such as helicases, methyltransferases, proteases, and polymerases. By harnessing reasoning over existing knowledge, this workflow can be used to identify subsets of existing chemogenomics data that are relevant to these understudied targets and nominate compounds for targeted experimental assays.

The SMACC database was instrumental in the early stages of hit discovery for the Rapidly Emerging Antiviral Drug Development Initiative-Antiviral Drug Discovery Center funded by the National Institute of Allergy and Infectious Diseases (https://www.niaid.nih.gov/research/antiviral-drug-discovery-centers-pathogens-pandemic-concern). For example, in the first use of SMACC, 73 compounds with a known activity of <10 μM or >50% inhibition against any flavivirus were selected for testing in a nanoLuc assay against Dengue 2 and Dengue 4 at 1 and 10 μM. In this assay, 43 out of 73 compounds had significant activity as defined as <50% relative light unit. With 13 and 10 compounds showing significant activity at 1 μM against Dengue 2 and Dengue 4, respectively (unpublished data). Further experimentation of the top hits revealed a large degree of cytotoxicity, and overcoming this requires further medicinal chemistry efforts for derivative design. Nonetheless, SMACC is a tool developed for early hit discovery and has been successfully utilized in this capacity.

It is important to note that users must use knowledge collections such as SMACC with extreme caution. Data cannot just be extracted from the database blindly and there must be a well-formulated question. A query that is too general will result in a very large output with less relevancy to one's goals, whereas a query that is too specific may not return any compounds. The user is responsible for generating detailed interpretations and understanding of the mechanisms of action of the selected compound(s). Users are also highly encouraged to research all available in vivo and clinical information on compounds they select to ensure they are appropriate for their specific applications prior to experimental testing. Users must also have a solid understanding of the data inside the collection. In the SMACC article, the authors used the hypothesis that similarities between homologous viral proteins could be exploited for target selection and the development of broad-spectrum anti-coronaviral compounds, 12 to suggest the experimental testing of compounds from SMACC. This is a promising strategy to develop BSA drugs prior to the emergence of novel viruses from known viral families. Supplementing this, the authors have used this hypothesis and the generalized SMACC framework to develop a Helicase-targeting SMACC to focus specifically on identifying promising viral helicase inhibitors using assay data from viral, human, and bacterial helicases. However, in both cases, just building the database, making it publicly available, and nominating compounds is not enough. This exercise alone will not lead to new BSA drugs. Collaboration is imperative to the success of antiviral drug discovery in the modern age. Scientists of various disciplines must join perspectives and skill sets toward common goals. Especially when pursuing a goal with so much breadth as BSA drug discovery.

Rigorous curation of the data retrieved by knowledge mining is also critical for viral outbreak prediction research. This can be achieved by careful manual data analysis as demonstrated by the Baric lab for SARS-CoV-2. 13 Knowledge mining tools can generate predictions on a larger scale but suffer from a high false positive rate that could be decreased by rigorous data curation. For instance, Sessions et al. 4 completed a comprehensive literature review, data analysis, and conducted outbreak simulations to provide a perspective on the regions of the world that are most likely to encounter future outbreaks. Despite the complexities associated with establishing reliable outbreak predictions, extensive knowledge mining and rigorous data curation enabled the authors to successfully predict the outbreaks of Marburg in Africa 14 and Dengue Virus in Brazil. 15

Conclusions

Society needs BSA medications, however, previous outbreaks show that about one year after an outbreak, antiviral research support diminishes leaving the work unfinished. 5 Martin et al. 1 endeavored to support the focused development of BSAs and made recommendations for targeted experiments from their SMACC database. These recommendations were made after considerable curation efforts, demonstrating issues with antiviral assay recording and reporting. Several actionable steps were suggested to improve data quality and accessibility. Despite the open nature of the SMACC database, the authors advise judicious use of knowledge collections such as SMACC and recommend users critically evaluate retrieved data prior to experimental testing. Employing meticulous data curation for modeling and knowledge-mining strategies is critical for the accurate identification of both novel antiviral medications and even viral outbreaks. However, these efforts are in vain without the collaboration of scientists from many diverse backgrounds working toward the discovery of BSA compounds.

Footnotes

Author contributions: HJM, AVZ, EM, and AT contributed to conceptualization; HJM, CMF, and EM contributed to writing—original draft preparation; EM and AT contributed to writing—review and editing; EM and AT contributed to funding acquisition. All authors have read and agreed to the published version of the manuscript.

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: AT and ENM are co-founders of Predictive, LLC, which develops novel alternative methods and software for toxicity prediction. All other authors have nothing to disclose.

Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors from UNC-Chapel Hill were supported by the National Institutes of Health (Grants U19AI171292 and R01GM140154). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. This research was supported in part by the Intramural/Extramural research program of the NCATS, NIH. HJM received support from the American Foundation for Pharmaceutical Education's (AFPE) Predoctoral Fellowship.

ORCID iD: Eugene Muratov https://orcid.org/0000-0003-4616-7036

References

  • 1.Martin HJ, Melo-Filho CC, Korn D, et al. Small molecule antiviral compound collection (SMACC): a comprehensive, highly curated database to support the discovery of broad-spectrum antiviral drug molecules. Antiviral Res 2023; 217: 105620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bloom DE, Black S, Rappuoli R. Emerging infectious diseases: a proactive approach. Proc Natl Acad Sci USA 2017; 114: 4055–4059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Melo-Filho CC, Bobrowski T, Martin HJ, et al. Conserved coronavirus proteins as targets of broad-spectrum antivirals. Antiviral Res 2022; 204: 105360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Sessions Z, Bobrowski T, Martin HJ, et al. Praemonitus praemunitus: can we forecast and prepare for future viral disease outbreaks? FEMS Microbiol Rev 2023; 47: fuad048. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Bobrowski T, Melo-Filho CC, Korn D, et al. Learning from history: do not flatten the curve of antiviral research! Drug Discov Today 2020; 25: 1604–1613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Gaulton A, Hersey A, Nowotka M, et al. The ChEMBL database in 2017. Nucleic Acids Res 2017; 45: D945–D954. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Papadatos G, Gaulton A, Hersey Aet al. et al. Activity, assay and target data curation and quality in the ChEMBL database. J Comput Aided Mol Des 2015; 29: 885–896. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Klimenko K, Marcou G, Horvath Det al. et al. Chemical space mapping and structure–activity analysis of the ChEMBL antiviral compound set. J Chem Inf Model 2016; 56: 1438–1454. [DOI] [PubMed] [Google Scholar]
  • 9.Nikitina AA, Orlov AA, Kozlovskaya LI, et al. Enhanced taxonomy annotation of antiviral activity data from ChEMBL. Database (Oxford). 2019;2019:bay139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dwivedi VD, Arya A, Yadav P, et al. Denvind: dengue virus inhibitors database for clinical and molecular research. Brief Bioinform 2021; 22: bbaa098. [DOI] [PubMed] [Google Scholar]
  • 11.Sosnina EA, Sosnin S, Nikitina AA, et al. Recommender systems in antiviral drug discovery. ACS Omega 2020; 5: 15039–15051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Yazdani S, De Maio N, Ding Y, et al. Genetic variability of the SARS-CoV-2 pocketome. J Proteome Res. 2021;20:4215. [DOI] [PubMed] [Google Scholar]
  • 13.Menachery VD, Yount BL, Debbink K, et al. A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence. Nat Med 2015; 21: 1508–1513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Sibomana OKE. First-ever Marburg virus disease outbreak in Equatorial Guinea and Tanzania: an imminent crisis in West and East Africa. Immun Inflamm Dis. 2023;11:e980. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lenharo M. Brazil’s record dengue surge: why a vaccine campaign is unlikely to stop it. Nature 2024; 627: 250–251. [DOI] [PubMed] [Google Scholar]

Articles from Science Progress are provided here courtesy of SAGE Publications

RESOURCES