Introduction: Computational biology has become an integral part of modern biomedical research
Over the past two decades, the landscape of computational research has undergone a remarkable transformation, primarily driven by the concurrent decrease in sequencing costs and the explosive growth of large-scale biological data. Modern high-throughput assays (e.g., genomics, transcriptomics, proteomics, and metabolomics) have generated a wealth of diverse biological data1,2, which are essential in various fields such as drug discovery, clinical diagnostics, prognosis, and therapeutic response. The magnitude of available data often requires relying solely on computational methods to navigate through it and convert raw data into testable predictions or draw meaningful conclusions. The computational community has developed a multitude of software tools3 to derive meaningful insights from these large high throughput datasets, reshaping the landscape of modern biology and making computational biology an essential component of biomedical research2. The ensemble of computational tools has been paralleled by a notable cultural shift towards computational, data-centric research practices and the widespread sharing of data in the public domain.
Computational thinking and techniques have become vital to life science research. The integration of computational methodologies with technological innovation has sparked a surge in interdisciplinary collaboration. Partnerships between experts from various fields such as biology, mathematics, statistics, and computer science have not only opened new avenues of research but also accelerated bioinformatics as a mainstream component of biology, transforming how we study life systems4. This collaborative effort has prompted scientists to adopt an interdisciplinary approach to modeling, analysis, and interpretation. By harnessing different computational tools, researchers can enhance scientific knowledge and gain deeper insights into the progression of diseases and their impact on human health5. Computational research stands at the forefront of scientific inquiry - from decoding genetic regulation to understanding the complexities of cellular signaling pathways, it holds the potential to revolutionize our understanding of nature and lead to groundbreaking discoveries6,7.
The availability of vast and diverse public datasets enables computational researchers — encompassing computer scientists, data scientists, bioinformaticians, statisticians, and other quantitative researchers who work in the domain of big data — to analyze complex datasets that demand interdisciplinary skills in computer science and biology. Leveraging these datasets to generate new knowledge and develop innovative approaches to address complex biological questions empowers computational researchers to take on more independent and leadership roles in modern life sciences. This further enables computational researchers to aggregate larger sample sizes, granting them the statistical power necessary to generate novel results with greater reliability8. Moreover, the availability of datasets, accompanied by detailed and accurate metadata, allows for effective secondary analysis - a research approach that involves analyzing pre-existing data before formulating a hypothesis. While secondary analysis of public data holds immense potential to extract novel biological insights and uncover new information from the vast datasets available to the public, this process poses various challenges, including data preparation hurdles and the complexities of analyzing and interpreting results8. The reusability of data for secondary analysis benefits the biomedical community by optimizing resources, generating new discoveries9, and even upholding the robustness and validity of pre-existing findings.
Computational research initially emerged primarily as a tool to assist researchers rather than a distinct discipline, lacking a defined set of fundamental questions. Consequently, computational researchers have traditionally played a supportive role within research programs led by other scientists, who determined the feasibility and significance of the scientific inquiries to pursue. Together with experimental research, the role of computational researchers is now evolving from being service providers to becoming leading innovators in scientific advancement10. The success of computational research heavily relies on bidirectional collaborations between domain experts and experimental biologists. These bidirectional collaborations facilitate the selection of optimal analysis strategies for intricate biological datasets. Furthermore, these collaborations require the early involvement of computational researchers in experimental design, which is essential for effectively addressing complex scientific questions.
This commentary presents insights from our interdisciplinary group comprising prominent researchers from eleven institutions worldwide, who have collectively provided their perspectives on the evolving roles and expanding opportunities of computational researchers in a data-centric world. We discuss the pitfalls, challenges, and opportunities confronting data-centric research in biology. We explore the evolving perception of computational, data-centric research and its emergence as an independent domain within biomedical research. Moreover, we examine the significant collaborative prospects that arise from integrating computational research with experimental and translational biology efforts. Furthermore, our discussion extends to the future direction of data-centric research and its potential applications across various domains within the biomedical field. We discuss the dynamic landscape of computational research and its implications for advancing scientific understanding and biomedical innovation. The primary goal of this paper is to increase awareness within the research community about the importance of interdisciplinary collaborations, particularly between experimental and computational biologists.
Challenges and pitfalls of computational data-centric research
Navigating through computational data-centric research presents a multitude of challenges and pitfalls that researchers must overcome. Firstly, handling large-scale biological data introduces complexities, including the need to ensure the robustness and integrity of results through rigorous evaluation, data cleaning, and preprocessing techniques11. Data sources may contain errors, inconsistencies, or biases that can significantly impact the results of the analysis, highlighting the need to design effective methodologies to overcome or compensate for these errors or biases12. Despite the efforts to navigate these obstacles, methods developed and used by computational researchers can still capture and model noise or random fluctuations in the data rather than genuine underlying patterns. This can lead to subpar performance when applied to new, unseen data. This phenomenon is known as overfitting, which can result in reduced accuracy, difficulty in interpretation of derived results, and decreased generalizability of results. This is particularly problematic when dealing with complex models or high-dimensional data. To mitigate this risk, researchers must strike a delicate balance between model complexity and generalization performance. This can be achieved through the implementation of various strategies such as regularization techniques, cross-validation, hold-out test sets, use of data sets from multiple sites or centers, and meticulous model selection criteria.
Secondly, computational researchers must deal with various ethical and privacy concerns, requiring strict compliance with regulations on data handling, especially when working with data derived from human subjects. In particular, they must follow ethical standards and regulations governing the handling of clinical and genetic data, which can include obtaining informed consent and removing personal identifiers. Additionally, all researchers (including computational ones) must contemplate the societal impact of their work, addressing concerns surrounding fairness, bias, discrimination, and implementing measures to promote equitable outcomes for those affected by the research outcomes.
An important challenge in computational data-centric research is handling large-scale biological data, which is heightened by restricted access to data, often requiring a substantial amount of time and effort to obtain access and, in some cases, incurs costs and restrictions (e.g., UK Biobank13). Ideally, all generated biological data and cleaned analyzed data should be made available publicly and accompanied by detailed metadata, allowing for effective repurposing and secondary analysis. Restricted or incomplete datasets can hinder the progress of computational data-centric research, as restricted access to data limits the ability to replicate findings across diverse research groups. Examples of addressing some of these barriers have been implemented through various open science practices which ensure transparency, credibility, and reproducibility of research findings14,15. These open science sharing practices include data sharing initiatives such as National Center for Biotechnology Information (NCBI)16,17, European Bioinformatics Institute (EMBL-EBI)18,19, and DNA Data Bank of Japan (DDBJ)20 which have created centralized repositories. In addition, open-access policies implemented by the National Institutes of Health (NIH) in the United States require researchers to deposit data into publicly accessible databases like GenBank21. Other platforms like UniProt22 and Gene Ontology (GO)23,24 contain curated protein sequences and functional information in addition to providing standardized vocabularies for annotating gene function. Cloud-based platforms and big data technologies such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP) enable scalable storage, processing, and provide infrastructure for hosting and analyzing large-scale data.
Sharing data raises concerns related to privacy, confidentiality, and broader philosophical and sociological implications, which arise when sharing the data with another research group for reanalysis25–28. However, balancing the need to make data open to other researchers with a concern for indigenous populations, and low and middle-income countries is a complex subject outside the scope of this article. Many computational tools and platforms are also proprietary29–31, which can limit access to algorithms and the ability to modify or fully understand them. This dependency also raises issues of sustainability and control over research tools and outputs. Addressing issues related to data accessibility is crucial for advancing computational data-centric research and ensuring the integrity and reliability of scientific findings.
The scattered nature of these large-scale publicly available datasets, which also often lack metadata, limits the potential of data to answer crucial biological questions32,33. Metadata refers to the descriptive and contextual information about the generation, provenance, and context of raw data, including experimental design, instrumentation parameters, and data processing steps34. Data is usually generated by the clinicians and/or biomedical researchers in the wet lab from clinical assessments, clinical trials, or wet lab experiments. The data, both raw and metadata, is typically shared on public databases for other researchers, including computational biologists, to be able to access it for secondary analysis. Often, the data is scattered across multiple repositories, leading researchers to encounter discrepancies between datasets, which may arise from the data of a single project not being located in a single place but instead available through different repositories or from various parts of a single project not being adequately interlinked. Despite the data being “public”, there are still many instances where some data are not shared or are restricted, with particular limitations in the release of raw data hindering comprehensive analysis and collaborations25,32,33,35. In some cases, the data can be accessed, but the information in the data may be incomplete - due to annotation challenges or incomplete details in the metadata and raw data33,35. To fully unlock the great potential of public data, it is essential to ensure comprehensive documentation of metadata and the proper annotation of raw data, by working together from the experimental design and data collection stages. This would facilitate easier integration of information from different datasets and streamline the data cleaning process, making it less time-consuming.
Given the modern data-centric scientific landscape, computational resources are crucial for conducting research. In contrast to wet lab experiments that require specialized equipment like sequencing machines and molecular cloning tools, computational researchers rely heavily on computational facilities. These facilities must be equipped with sufficient storage capacity and computational power to handle large datasets and perform complex analyses. Without adequate access to computational resources, researchers may face significant barriers to processing data efficiently or running computational simulations effectively. Leading research institutions typically have access to those resources, but resource-constrained universities often need help in obtaining the necessary computational infrastructure. Dry labs rely on a spectrum of resources, ranging from high-performance computing clusters and cloud computing platforms to specialized software and data storage systems, as personal computational devices often do not have enough storage or computational resources to process large-scale data. These resources offer the scalability and computational power necessary to handle the complexities inherent in large-scale data analysis, enabling researchers to conduct thorough analyses and extract meaningful insights without being hindered by hardware limitations. For example, initiatives like the H3ABioNet36 address the increasing need for bioinformatics research in Africa amidst growing volumes of complex genomics data and limited research infrastructure. They provide training and support in human genetics and genomics, with effects extending to other areas of bioinformatics, enhancing the continent’s capacity for high-throughput biology research.
Another significant opportunity for enhancing research arises in the collaborative planning stages of projects, where critical decisions about the selection of measurement technologies are made. Measurement technologies encompass various high-throughput techniques used to process large volumes of data efficiently and rapidly. In the context of computational data-centric research, these protocols are crucial for generating massive datasets required for analysis. Choosing the most appropriate measurement technology for a research project requires careful consideration of factors such as the research question, the biological system under study, the type of data needed, and available resources. Computational researchers frequently encounter limitations, forced to utilize datasets generated by predetermined methods, even if they don’t align well with their intended analytical approaches. For example, in transcriptomics, researchers may face challenges when analyzing RNA-seq data generated using low-depth sequencing methods, as these may not provide sufficient coverage for detecting lowly expressed genes accurately.
As computational research continues to evolve, it is increasingly becoming a driving force in scientific advancement in a data-centric world. However, this transformation into big data and large-scale computing also raises growing awareness of their environmental impact. The operation of big data centers and computing infrastructures, essential for processing large biological datasets, contributes significantly to the carbon emissions and energy consumption. Many researchers in this domain are actively working on and advocating for energy-efficient computing strategies to mitigate these impacts and promote sustainable practices37–40.
A positive shift in perception towards computational data-centric research
Computational data-centric research has undergone a remarkable transformation in perception, evolving from being viewed merely as a supportive tool for researchers to becoming a driving force in scientific progress, leading the way in innovation. One of the most significant drivers of this transformation has been the indefatigable rise in the size of biological datasets. The exponential growth in sample size and volume of data has necessitated the development of more advanced computational methods to manage, analyze, and derive meaningful insights from these vast datasets. Secondly, improvements in computational hardware and software have significantly enhanced the scalability and efficiency of data analysis tools, allowing researchers to tackle increasingly complex problems with greater speed41,42. Thirdly, advances in machine learning and artificial intelligence have revolutionized the field, enabling the development of predictive models and algorithms capable of deriving actionable insights from large and heterogeneous datasets43–46. Lastly, the establishment of interdisciplinary collaborations has facilitated the exchange of knowledge and the ability to harness computational power, bringing computational researchers and domain experts together to tackle real-world problems. These advancements have collectively contributed to a growing recognition of the value and potential for computational data-centric research, enabling the scientific community to gain insights and generate more hypotheses, which was not possible before the utilization of these data-centric approaches, fostering a more positive perception among both researchers and the wider scientific community.
The advancement in computational algorithms and data analytics now allows for the utilization of big data to advance science and medicine. As computational data-centric research gains momentum, its impact is evident in the discovery of potential biomarkers and therapeutic targets, enabling the design of novel treatments, optimization of existing therapies, and even the repurposing of approved drugs for different indications47–49. Furthermore, the identification of new biomarkers can enhance disease progression monitoring, facilitating timely adjustments in treatment plans. This research not only revolutionizes treatment but also stimulates advancements in diagnostics, fostering the development of more accurate and timely diagnostic tools. For instance, although primary tumor biopsies can yield important genetic information that may guide initial treatment decisions, liquid biopsies aimed at measuring circulating tumor DNA (ctDNA), cell-free DNA or circulating tumor cells (CTCs) can provide information on disease progression — and highlight detailed molecular information of a patient’s cancer as it evolves over the course of a therapeutic regimen50. By incorporating frequently updated molecular information, along with bioinformatics and AI-based methods as part of the clinical trial protocol, computational researchers have the opportunity to develop generalized approaches that lead to optimal treatment strategies for the disease. This holds the potential to increase overall survival and improve patient outcomes43–46. Together with increasing opportunities to access longitudinal patient data linked to measurements for precision care, computational data-centric research has evolved from bioinformatic, tool-driven research activity to participation in interdisciplinary research teams.
Moreover, the integration of computational approaches in drug discovery and personalized medicine represents a transformative shift in biomedical research. For example, GWAS analysis, now complemented by deep learning approaches like AlphaMissense51 enables efficient approaches to target discovery. The availability of structural information for most macromolecular targets, combined with the advent of breakthrough computational tools such as AlphaFold252, and more accurate tools for molecular docking and binding energy predictions53, made computational screening and design a highly competitive and often preferred approach to the initial discovery of hits and leads54 than traditional high-throughput screening (HTS) assays. This is further supported by the development of ultra-large chemical libraries55 and even larger giga- and tera-scale on-demand chemical spaces56, which enable computational screening of vast drug-like chemical spaces and subsequent experimental validation of the predictions. For the design of proteins and peptides, AlphaFold2 and other deep learning tools also proved to greatly accelerate the discovery of high-affinity molecules binding to their targets57,58. This paradigm shift towards computationally driven drug discovery is a significant advance from the very limited role molecular modelers played in this field only a few years ago59. This shift from a one-size-fits-all approach to personalized interventions holds immense promise for improving patient outcomes and reducing healthcare costs. Ultimately, the integration of individualized genomic information into medical decision-making marks a progression toward more effective and efficient healthcare practices.
Data-centric research, particularly in computational settings, allows for the formulation of hypotheses based on insights derived from data interpretation, fostering creativity in characterizing novel clinical questions, developing methods, and improving data analysis tools. This approach distinguishes itself from hypothesis-driven wet lab research primarily through the scale and nature of the data employed60. Importantly, computational research spans the entire spectrum from method development to analysis of biological data, providing a valuable and efficient means that complements wet lab experiments, which are essential for generating foundational data and validating hypotheses.
Engaging in bioinformatics research does not demand a substantial time investment in data generation. This characteristic enhances the potential for computational data-centric research to aggregate extensive datasets for interpretation, facilitating effective collaboration with wet lab biologists to validate discoveries whenever feasible. However, the realization of this potential often remains aspirational, as major findings from large-scale computational analyses frequently encounter challenges in experimental validation due to small effect sizes or the complexity and variability inherent in biological systems, underscoring the need for innovative approaches to bridge the gap between computational discovery and wet lab validation. With the expertise of computational researchers, experimental biologists can explore their data from a different perspective, proposing new hypotheses or refining existing ones. Furthermore, the scale of data analysis in wet and dry labs varies significantly. Experimental biologists typically work with smaller sample sizes or fewer biological replicates, focusing on obtaining statistically significant results through carefully controlled experiments. By contrast, computational researchers often have the opportunity to work with larger sample sizes by aggregating data across multiple existing studies and open public datasets, enabling them to obtain statistically significant results and broaden the scope of their analyses. By combining these strengths, both fields can enhance their research outcomes: experimental biologists provide in-depth, controlled insights, while computational researchers offer comprehensive, data-driven perspectives.
Despite many computational researchers becoming more independent and developing solutions to sophisticated computational problems, the fact that they remain at the end of the experimental pipeline after data is collected is still a persistent limitation. Integrating data analysis right from the start, while designing experiments, facilitates the seamless integration of knowledge from various sources, such as clinical and preclinical data. For example, Model-Informed Drug Development (MIDD) techniques ensure continuous integration of diverse datasets, enhancing the depth and reliability of research outcomes61–63. Traditionally, computational researchers were expected to have secondary roles10 in biomedical projects and were not anticipated to lead projects geared toward significant biomedical discoveries. However, the complexity and volume of modern biological datasets require effective collaborations which emphasizes the importance of combining diverse expertise to drive important discoveries. Furthermore, in collaborations with experimentalists, some computational researchers may perceive their role as addressing loose ends in the research question, rather than actively participating in hypothesis formulation and project ownership64. It is essential to recognize that ownership increases accountability and contributes to more robust and reliable results by deepening the understanding of underlying biological questions and experimental goals. Ensuring that contributions from both computational and experimental biologists are appropriately recognized in publications and presentations can help build trust and foster long-term collaborations.
Examples of successful interdisciplinary collaborations
Embracing interdisciplinary collaboration and leveraging complementary expertise will enable researchers to navigate the intricacies of biological systems more effectively, thereby advancing our understanding of biology and its complexities. Dialogue between computational and experimental biologists is currently occurring, yet there are opportunities to improve it. For example, in study design, planning can be improved by understanding both the technical details of the analytical tools and the specifics of the experiments and data to be collected. This highlights the valuable opportunity for further enhanced collaboration and dialogue between computational and experimental researchers.
There are major advancements in life sciences that have already been made through successful interdisciplinary collaborations. For instance, the GTEx (Genotype-Tissue Expression) Consortium65 has provided the first detailed atlas of gene expression and regulation across previously unstudied human tissues. This achievement was made possible by the collaboration of clinical pathologists, who collected and processed tissue samples, and experimental scientists, who subsequently ensured the quality and relevance of these samples. Computational researchers then utilized sophisticated computational solutions to analyze the extensive RNA-Seq samples, encompassing a wide variety of tissues and conditions. This collaborative effort has yielded valuable insights into gene function and the genetic basis of diseases.
Similarly, through the integration of experimental data and computational analysis, the HMP has provided critical insights into how microbial communities influence human physiology and pathology. The Human Microbiome Project (HMP) has mapped the microbial communities in and on the human body, demonstrating how interdisciplinary collaboration can lead to a deeper understanding of microbiome dynamics and their impact on human health. The HMP brought together microbiologists, clinicians, and computational biologists who worked together to collect and analyze vast amounts of metagenomics data. This collaborative approach revealed the complex interactions between human hosts and their microbiomes, highlighting their importance in maintaining health and contributing to understanding disease66,67.
Combining experimental and computational expertise are also essential to translate genomic discoveries into clinical applications. One such project, the Cancer Genome Atlas (TCGA)68 has similarly advanced our understanding of cancer genomics through the identification of key genomic alterations across various cancer types. By integrating high-throughput sequencing data with robust computational tools, TCGA has uncovered numerous genetic mutations, copy number variations, and epigenetic modifications that drive cancer development and progression. These findings have not only explained the molecular basis of different cancers but also highlighted potential therapeutic targets, paving the way for precision medicine approaches in oncology. The collaborative efforts in TCGA underscore the critical importance of combining experimental and computational expertise to translate genomic discoveries into clinical applications.
Collaborations between experimental and computational researchers allow us to build upon each other’s work by creating robust resources and databases sharing both omics and structural data. Resources like the Protein Data Bank (PDB)69 have facilitated the sharing of structural data on biomolecules, allowing researchers to access and build upon each other’s work. EMBL-EBI18 and ChEMBL70 have provided platforms for data submission and retrieval, enhancing drug discovery efforts through shared computational analysis and experimental insights. There are multiple interdisciplinary endeavors focused on the reuse of omics data, including immunological data. The NIAID ImmPort repository71 has enabled the reuse of immunological data, supporting the development of new hypotheses and the validation of existing ones through collaborative efforts. Creating such platforms requires interdisciplinary collaborations between experimental and computational scientists. While we briefly highlight a few examples from different fields, we acknowledge the existence of numerous other significant collaborations that could not be included due to space limitations. These examples illustrate how pooling resources, expertise, and data from different scientific areas can lead to significant breakthroughs.
While we briefly highlight a few examples from different fields, we acknowledge the existence of numerous other significant collaborations that could not be included due to space limitations. These examples illustrate how pooling resources, expertise, and data from different scientific areas can lead to significant breakthroughs.
Establishing meaningful bidirectional collaborations between computational and experimental biologists
Biological research is increasingly dependent on computational tools for collecting experimental data, as well as analyzing, interpreting, and visualizing the results. For example, modern measurement technologies depend heavily on computers embedded in the sequencing machines to produce sequencing data. Modern experiments often generate extensive datasets, which require complex and detailed analysis, frequently requiring multiple computational tools to be employed over the course of a single project. However, experimental scientists may not have the necessary training required to use these computational tools effectively. The increasing complexity and the volume of biological data require effective communication between computational and experimental biologists in forming bidirectional collaborations72. These collaborations not only revolve around the analysis of experimental data, but also frequently concentrate on devising effective strategies for leveraging computational tools to shape scientific questions and guide future laboratory experiments. This also serves as a link for formulating hypotheses and refining experimental approaches through computational insights. The integration of computational and experimental strategies has become increasingly crucial for unlocking novel insights and addressing intricate scientific questions73. Computational scientists are often attracted to challenging problems in the domain of biology for which they can develop novel and practical algorithms. As a result, many active research areas in computational research are closely linked to various scientific problems, particularly those in biology. Bidirectional collaborations between computational and experimental biologists allow system-level approaches to multiple domains of biology, enabling the use of computer simulations to investigate the behavior and relationships of all elements of biological systems. Through collaborations, computational and experimental scientists can create the best designs for simulations, figure out how to interpret the results and develop methods to compare them with real experimental data74.
The complementary nature of computational data-centric and experimental research provides effective mutual learning through bidirectional exchanges of knowledge and skills. Computational researchers learn insights into experimental techniques, biological principles, and real-world applications, while their experimental counterparts can acquire computational skills, data analysis techniques, and computational modeling approaches. As collaborations between experimental scientists and computational researchers continue, experimental scientists are gaining deeper insights into the underlying principles of algorithms they use, the principles and techniques by which information is managed, as well as the practical advantages and limitations of computational tools both in terms of accuracy and scalability. There are many successful undergraduate and graduate programs around the world, particularly in the US, that have been developed to prepare the next generation of scientists in bioinformatics, computational biology, and quantitative biology. The rise in majors in these fields reflects the growing recognition of the importance of interdisciplinary training in biology. However, while these programs are a significant step forward, it remains challenging to provide students with in-depth knowledge and experience in both experimental and computational areas — computer science, math, statistics, theoretical and experimental biology—necessary to fully engage in interdisciplinary research75. An integrative approach to training is still necessary to ensure that students from diverse backgrounds gain a comprehensive understanding of both life sciences and computational research. This includes incorporating interdisciplinary coursework, hands-on research experiences, and collaborative projects that bring together students from different disciplines. By doing so, we can better prepare scientists who are adept at navigating the complexities of both computational and experimental research, fostering more effective and innovative collaborations in the future.
Current trends to simplify installation processes and enhanced usability of software tools and machine learning algorithms make such training more effective. Approaches such as packaging and containerization are currently used to prepare computational tools that are pre-configured with the necessary software, thereby facilitating easier installation and use. Similarly, automated machine learning (AutoML) holds promise by helping in selecting and optimizing algorithms needed for any given dataset, unlocking the potential of machine learning for experimental researchers76. However, it is essential to recognize that simply making tools more accessible does not negate the need for thorough training and domain knowledge. For example, while these platforms can streamline the process of algorithm selection, understanding the appropriateness, assumptions, biases, and generalizability of these models requires deep expertise in statistics and machine learning. Without this domain knowledge, users may misinterpret results or apply models inappropriately. Therefore, alongside advancements in usability, comprehensive training programs that integrate domain-specific knowledge of statistical principles and machine learning techniques remain crucial. These programs ensure that researchers not only utilize computational tools effectively, but also evaluate their outputs and implications in biological research.
Discussion
This paper discusses computational research’s unique opportunities and challenges in a data-centric world. We discuss the transformative role of computational data-centric research in modern biology, illustrating how it has transitioned from a supportive tool to a central component driving scientific innovation. The collaboration between computational researchers and traditional biologists has led to novel insights and technological advancements, underpinning significant progress in various fields, including drug discovery and disease progression monitoring.
Partnerships between experts from various fields, such as biology, mathematics, statistics, and computer science, have not only opened new avenues of research but also accelerated computational research as a mainstream component of biology, transforming how we study life systems. We seek to address the evolving perception of computational researchers - initially seen as merely a tool, it is now recognized as a vital discipline within biology. The integration of computational methodologies with technological innovation has sparked a surge in interdisciplinary collaboration.
This shift reflects broader cultural changes toward data-centric research practices that promise to redefine how we approach and solve complex biological challenges, from enhancing the understanding of diseases to revolutionizing treatment strategies and drug discovery. Moving away from separate departments toward integrated centers with broader expertise under a common umbrella or research theme is crucial for fostering these interdisciplinary collaborations. Such integration can lead to a more cohesive and efficient research environment where diverse expertise can collaborate to tackle complex problems.
Acknowledgments
We thank Dr. McWeeney, Dr. Jönsson, Dr. Frolova, Dr. Obolensk, and Dr. Nadel for their valuable feedback and discussion.
Contributor Information
Dhrithi Deshpande, Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, California, 90089, USA.
Karishma Chhugani, Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, California, 90089, USA.
Tejasvene Ramesh, Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, California, 90089, USA.
Matteo Pellegrini, Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, CA 90095.
Sagiv Shiffman, Department of Genetics, The Alexander Silberman Inst. of Life Sciences, The Hebrew University of Jerusalem, Jerusalem 9190401.
Malak S. Abedalthagafi, Genomics Research Department, King Fahad Medical City, Riyadh, Saudi Arabia; Department of Pathology & Laboratory Medicine, Emory University Hospital, Atlanta, GA,USA
Saleh Alqahtani, The Liver Transplant Unit, King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia; The Division of Gastroenterology and Hepatology, Johns Hopkins University.
Jimmie Ye, Department of Epidemiology & Biostatistics, Institute for Human Genetics, University of California, San Francisco. 513 Parnassus Ave S965F San Francisco, CA 94143.
Xiaole Shirley Liu, GV20 Oncotherapy. One Broadway, 14th floor, Kendall Square, Cambridge, MA 02142, USA.
Jeffrey T. Leek, Biostatistics and Oncology at the Johns Hopkins Bloomberg School of Public Health and Johns Hopkins Data Science Lab. John Hopkins University. 615 N. Wolfe Street Baltimore Maryland 21205
Alvis Brazma, EMBL’s European Bioinformatics Institute.
Roel A. Ophoff, Department of Psychiatry and Human Genetics, Center for Neurobehavioral Genetics, University of California Los Angeles, Los Angeles, CA USA
Gauri Rao, Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA 90033, USA.
Atul J. Butte, Bakar Computational Health Sciences Institute, University of California, San Francisco. 490 Illinois Street, San Francisco CA 94158
Jason H. Moore, Department of Computational Biomedicine, Cedars-Sinai Medical Center, 700 N. San Vicente Blvd, Pacific Design Center Suite G540, West Hollywood, CA 90068
Vsevolod Katritch, Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, 90007, USA.
Serghei Mangul, Titus Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA 90033, USA; Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA 90033, USA.
References
- 1.Katz K, Shutov O, Lapoint R, Kimelman M, Brister JR, and O’Sullivan C (2022). The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res 50, D387–D390. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Mangul S, Mosqueiro T, Abdill RJ, Duong D, Mitchell K, Sarwal V, Hill B, Brito J, Littman RJ, Statz B, et al. (2019). Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol 17, e3000333. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Allan C, Burel J-M, Moore J, Blackburn C, Linkert M, Loynton S, Macdonald D, Moore WJ, Neves C, Patterson A, et al. (2012). OMERO: flexible, model-driven data management for experimental biology. Nat. Methods 9, 245–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rung J, and Brazma A (2013). Reuse of public genome-wide gene expression data. Nat. Rev. Genet 14, 89–99. [DOI] [PubMed] [Google Scholar]
- 5.Van Noorden R, Maher B, and Nuzzo R (2014). The top 100 papers. 10.1038/514550a. [DOI] [PubMed] [Google Scholar]
- 6.Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, and Hoffman MM (2019). Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities. Inf. Fusion 50, 71–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Way GP, Greene CS, Carninci P, Carvalho BS, de Hoon M, Finley SD, Gosline SJC, Lȇ Cao K-A, Lee JSH, Marchionni L, et al. (2021). A field guide to cultivating computational biology. PLoS Biol 19, e3001419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hippen AA, and Greene CS (2021). Expanding and Remixing the Metadata Landscape. Trends Cancer Res 7, 276–278. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rosinger AY, and Ice G (2019). Secondary data analysis to answer questions in human biology. Am. J. Hum. Biol 31, e23232. [DOI] [PubMed] [Google Scholar]
- 10.Yanai I, and Chmielnicki E (2017). Computational biologists: moving to the driver’s seat. Genome Biol 18, 223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Fan C, Chen M, Wang X, Wang J, and Huang B (2021). A review on data preprocessing techniques toward efficient and reliable knowledge discovery from building operational data. Front. Energy Res 9. 10.3389/fenrg.2021.652801. [DOI] [Google Scholar]
- 12.Baldwin JR, Pingault J-B, Schoeler T, Sallis HM, and Munafò MR (2022). Protecting against researcher bias in secondary data analysis: challenges and potential solutions. Eur. J. Epidemiol 37, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, Downey P, Elliott P, Green J, Landray M, et al. (2015). UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med 12, e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Enhancing Reproducibility through Rigor and Transparency. https://grants.nih.gov/policy/reproducibility/index.htm. [Google Scholar]
- 16.Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, Connor R, Funk K, Kelly C, Kim S, et al. (2022). Database resources of the national center for biotechnology information. Nucleic Acids Res 50, D20–D26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, and Sayers EW (2016). GenBank. Nucleic Acids Res 44, D67–D72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P, van den Broek A, Castro M, Cochrane G, et al. (2005). The EMBL Nucleotide Sequence Database. Nucleic Acids Res 33, D29–D33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A, Cheng Y, Cleland I, Faruque N, Goodgame N, Gibson R, et al. (2011). The European Nucleotide Archive. Nucleic Acids Res 39, D28–D31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Fukuda A, Kodama Y, Mashima J, Fujisawa T, and Ogasawara O (2021). DDBJ update: streamlining submission and access of human data. Nucleic Acids Res 49, D71–D75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Suber P (2008). An open access mandate for the National Institutes of Health. Open Med 2, e39–e41. [PMC free article] [PubMed] [Google Scholar]
- 22.Consortium UniProt (2023). UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res 51, D523–D531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet 25, 25–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gene Ontology Consortium, Aleksander SA, Balhoff J, Carbon S, Cherry JM, Drabkin HJ, Ebert D, Feuermann M, Gaudet P, Harris NL, et al. (2023). The Gene Ontology knowledgebase in 2023. Genetics 224. 10.1093/genetics/iyad031. [DOI] [Google Scholar]
- 25.Figueiredo AS (2017). Data Sharing: Convert Challenges into Opportunities. Front Public Health 5, 327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Joly Y, Dyke SOM, Knoppers BM, and Pastinen T (2016). Are Data Sharing and Privacy Protection Mutually Exclusive? Cell 167, 1150–1154. [DOI] [PubMed] [Google Scholar]
- 27.Bartlett A, Penders B, and Lewis J (2017). Bioinformatics: indispensable, yet hidden in plain sight? BMC Bioinformatics 18, 311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bauchner H, McDermott MM, and Butte AJ (2023). Data Sharing Enters a New Era. Ann. Intern. Med 176, 400–401. [DOI] [PubMed] [Google Scholar]
- 29.GastroPlus® PBPK & PBBM modeling and simulation (2016). Simulations Plus. https://www.simulations-plus.com/software/gastroplus/. [Google Scholar]
- 30.Ingenuity pathway analysis (2023). Bioinformatics Software | QIAGEN Digital Insights. https://digitalinsights.qiagen.com/IPA. [Google Scholar]
- 31.Geneious (2023). Geneious. https://www.geneious.com/. [Google Scholar]
- 32.Rajesh A, Chang Y, Abedalthagafi MS, Wong-Beringer A, Love MI, and Mangul S (2021). Improving the completeness of public metadata accompanying omics studies. Genome Biol 22, 106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Huang Y-N, Jaiswal PV, Rajesh A, Yadav A, Yu D, Liu F, Scheg G, Boldirev G, Nakashidze I, Mehta JH, et al. (2023). The systematic assessment of completeness of public metadata accompanying omics studies. bioRxiv, 2021.11.22.469640. 10.1101/2021.11.22.469640. [DOI] [Google Scholar]
- 34.Leipzig J, Nüst D, Hoyt CT, Ram K, and Greenberg J (2021). The role of metadata in reproducible computational research. Patterns (N Y) 2, 100322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Callebaut W (2012). Scientific perspectivism: A philosopher of science’s response to the challenge of big data biology. Stud. Hist. Philos. Biol. Biomed. Sci 43, 69–80. [DOI] [PubMed] [Google Scholar]
- 36.Mulder NJ, Adebiyi E, Alami R, Benkahla A, Brandful J, Doumbia S, Everett D, Fadlelmola FM, Gaboun F, Gaseitsiwe S, et al. (2016). H3ABioNet, a sustainable pan-African bioinformatics network for human heredity and health in Africa. Genome Res 26, 271–277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.A comprehensive review and conceptual framework for cloud computing adoption in bioinformatics (2023). Healthcare Analytics 3, 100190. [Google Scholar]
- 38.Grealey J, Lannelongue L, Saw W-Y, Marten J, Méric G, Ruiz-Carmona S, and Inouye M (2022). The Carbon Footprint of Bioinformatics. Mol. Biol. Evol 39, msac034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.European Bioinformatics Institute Reducing the carbon footprint of scientific computing. https://www.ebi.ac.uk/about/news/perspectives/greener-principles/. [Google Scholar]
- 40.Singh Banipal I, and Mazumder S (2024). How to make AI sustainable. Nature India. 10.1038/d44151-024-00024-8. [DOI] [Google Scholar]
- 41.Xu Y, Liu X, Cao X, Huang C, Liu E, Qian S, Liu X, Wu Y, Dong F, Qiu C-W, et al. (2021). Artificial intelligence: A powerful paradigm for scientific research. Innovation (Camb) 2, 100179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Dash S, Shakyawar SK, Sharma M, and Kaushik S (2019). Big data in healthcare: management, analysis and future prospects. Journal of Big Data 6, 1–25. [Google Scholar]
- 43.West J, You L, Zhang J, Gatenby RA, Brown JS, Newton PK, and Anderson ARA (2020). Towards Multidrug Adaptive Therapy. Cancer Res 80, 1578–1589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Pritchard JR, Bruno PM, Gilbert LA, Capron KL, Lauffenburger DA, and Hemann MT (2013). Defining principles of combination drug mechanisms of action. Proc. Natl. Acad. Sci. U. S. A 110, E170–E179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Jonsson VD, Blakely CM, Lin L, Asthana S, Matni N, Olivas V, Pazarentzos E, Gubens MA, Bastian BC, Taylor BS, et al. (2017). Novel computational method for predicting polytherapy switching strategies to overcome tumor heterogeneity and evolution. Sci. Rep 7, 44206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Irurzun-Arana I, McDonald TO, Trocóniz IF, and Michor F (2020). Pharmacokinetic Profiles Determine Optimal Combination Treatment Schedules in Computational Models of Drug Resistance. Cancer Res 80, 3372–3382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Sliwoski G, Kothiwale S, Meiler J, and Lowe EW Jr (2014). Computational methods in drug discovery. Pharmacol. Rev 66, 334–395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Park K (2019). A review of computational drug repurposing. Transl Clin Pharmacol 27, 59–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Zong N, Wen A, Moon S, Fu S, Wang L, Zhao Y, Yu Y, Huang M, Wang Y, Zheng G, et al. (2022). Computational drug repurposing based on electronic health records: a scoping review. NPJ Digit Med 5, 77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Azad TD, Chaudhuri AA, Fang P, Qiao Y, Esfahani MS, Chabon JJ, Hamilton EG, Yang YD, Lovejoy A, Newman AM, et al. (2020). Circulating Tumor DNA Analysis for Detection of Minimal Residual Disease After Chemoradiotherapy for Localized Esophageal Cancer. Gastroenterology 158, 494–505.e6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, Pritzel A, Wong LH, Zielinski M, Sargeant T, et al. (2023). Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science 381, eadg7492. [DOI] [PubMed] [Google Scholar]
- 52.Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Žídek A, Nelson AWR, Bridgland A, et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710. [DOI] [PubMed] [Google Scholar]
- 53.Abel R, Wang L, Mobley DL, and Friesner RA (2017). A Critical Review of Validation, Blind Testing, and Real- World Use of Alchemical Protein-Ligand Binding Free Energy Calculations. Curr. Top. Med. Chem 17, 2577–2585. [DOI] [PubMed] [Google Scholar]
- 54.Sadybekov AV, and Katritch V (2023). Computational approaches streamlining drug discovery. Nature 616, 673–685. [DOI] [PubMed] [Google Scholar]
- 55.Lyu J, Wang S, Balius TE, Singh I, Levit A, Moroz YS, O’Meara MJ, Che T, Algaa E, Tolmachova K, et al. (2019). Ultra-large library docking for discovering new chemotypes. Nature 566, 224–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Sadybekov AA, Sadybekov AV, Liu Y, Iliopoulos-Tsoutsouvas C, Huang X-P, Pickett J, Houser B, Patel N, Tran NK, Tong F, et al. (2022). Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature 601, 452–459. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Krishna R, Wang J, Ahern W, Sturmfels P, Venkatesh P, Kalvet I, Lee GR, Morey-Burrows FS, Anishchenko I, Humphreys IR, et al. (2024). Generalized biomolecular modeling and design with RoseTTAFold All-Atom. Science, eadl2528. [DOI] [PubMed] [Google Scholar]
- 58.Vázquez Torres S, Leung PJY, Venkatesh P, Lutz ID, Hink F, Huynh H-H, Becker J, Yeh AH-W, Juergens D, Bennett NR, et al. (2024). De novo design of high-affinity binders of bioactive helical peptides. Nature 626, 435–442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Wagner J, Dahlem AM, Hudson LD, Terry SF, Altman RB, Gilliland CT, DeFeo C, and Austin CP (2018). A dynamic map for learning, communicating, navigating and improving therapeutic development. Nat. Rev. Drug Discov 17, 150. [DOI] [PubMed] [Google Scholar]
- 60.Yanai I, and Lercher M (2020). A hypothesis is a liability. Genome Biol 21, 231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Madabushi R, Seo P, Zhao L, Tegenge M, and Zhu H (2022). Review: Role of Model-Informed Drug Development Approaches in the Lifecycle of Drug Development and Regulatory Decision-Making. Pharm. Res 39, 1669–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Madabushi R, Wang Y, and Zineh I (2019). A Holistic and Integrative Approach for Advancing Model-Informed Drug Development. CPT Pharmacometrics Syst Pharmacol 8, 9–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Marshall S, Madabushi R, Manolis E, Krudys K, Staab A, Dykstra K, and Visser SAG (2019). Model-Informed Drug Discovery and Development: Current Industry Good Practice and Regulatory Expectations and Future Perspectives. CPT Pharmacometrics Syst Pharmacol 8, 87–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Mangul S, Martin LS, Langmead B, Sanchez-Galan JE, Toma I, Hormozdiari F, Pevzner P, and Eskin E (2019). How bioinformatics and open data can boost basic science in countries and universities with limited resources. Nat. Biotechnol. 37, 324–326. [DOI] [PubMed] [Google Scholar]
- 65.Lonsdale J, Thomas J, Salvatore M, Phillips R, Lo E, Shad S, Hasz R, Walters G, Garcia F, Young N, et al. (2013). The Genotype-Tissue Expression (GTEx) project. Nat. Genet 45, 580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. About the Human Microbiome. https://www.hmpdacc.org/ihmp/overview/#. [Google Scholar]
- 67.Human Microbiome Project Consortium (2012). Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Weinstein JN, Collisson EA, Mills GB, Shaw KM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM, and Cancer Genome Atlas Research Network (2013). The Cancer Genome Atlas Pan-Cancer Analysis Project. Nat. Genet 45, 1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, and Bourne PE (2000). The Protein Data Bank. Nucleic Acids Res 28, 235–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Gaulton A, Bellis LJ, Patricia Bento A, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, et al. (2012). ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40, D1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Website https://www.niaid.nih.gov/research/immport.
- 72.Morrison-Smith S, Boucher C, Sarcevic A, Noyes N, O’Brien C, Cuadros N, and Ruiz J (2022). Challenges in large-scale bioinformatics projects. Humanities and Social Sciences Communications 9, 1–9. [Google Scholar]
- 73.Foster I (2006). 2020 computing: a two-way street to science’s future. Nature 440, 419. [DOI] [PubMed] [Google Scholar]
- 74.Cohen JE (2004). Mathematics is biology’s next microscope, only better; biology is mathematics’ next physics, only better. PLoS Biol 2, e439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Spreafico R, Mitchell S, and Hoffmann A (2015). Training the 21st Century Immunologist. Trends Immunol. 36, 283–285. [DOI] [PubMed] [Google Scholar]
- 76.Choi H, Moran J, Matsumoto N, Hernandez ME, and Moore JH (2023). Aliro: an automated machine learning tool leveraging large language models. Bioinformatics 39. 10.1093/bioinformatics/btad606. [DOI] [PMC free article] [PubMed] [Google Scholar]
