Abstract
Bioinformatics is an interdisciplinary, fast-developing, and broad-ranging field, coevolving with and empowered by advanced technologies across multiple related disciplines. Given the ever-growing volume of biological data generated at multiple levels and scales, bioinformatics represents a holistic approach to decipher the complexity of biological systems and thus holds significant potential to realize a paradigm shift by transforming data to theory. Here I articulate a vision of expanding bioinformatics from data to theory that paves the way for the paradigm shift in biology, which can consolidate fragmented research findings within a theoretical framework, drive theory-guided AI modelling and experimentation with enhanced explainability and reduced parameter space, drive biological research from a holistic perspective, and further strengthen the identity and coherence of bioinformatics as a discipline.
Keywords: Bioinformatics, Data, Theory, Paradigm, AI
Bioinformatics is an interdisciplinary field that integrates biology, computer science, mathematics, statistics, etc., emerging as a single discipline in the 1960s [1]. Since then, propelled by rapid advances in multiple related fields [2] and accelerated by several landmark genome projects like the Human Genome Project (HGP) launched in 1990, bioinformatics has been rapidly developed, transforming biology from a purely lab-based into an information-based science that supports both hypothesis-driven and data-driven research. Beyond all doubt, bioinformatics is widely recognized as an important discipline of the 21st century, playing a vital role in coping with the growing amounts of multi-dimensional biological data.
Intrinsically, bioinformatics is a fast-developing discipline, coevolving with and empowered by advanced technologies across various related disciplines, particularly biotechnology (BT) and information technology (IT). Consequently, it is becoming increasingly data-intensive and promising to tackle complex biological questions by artificial intelligence (AI), statistical learning, etc. Nowadays, coupled with the ever-growing data that are generated at multiple levels and scales of biological systems, bioinformatics bears the great potential to drive a paradigm shift in biology.
1. What is the current paradigm in biology?
The term “paradigm”, originating from Thomas Kuhn's famous conception of scientific revolutions [3], refers to a set of fundamental concepts, research methods, postulates, and norms that are widely adopted as a guiding framework by the scientific community. Over time, one paradigm is incompatible with new phenomena and problems that cannot be solved within the current framework, stimulating the creation of a new paradigm that characterizes a scientific revolution to replace the prior paradigm and thereby leading to a shift from one paradigm to another.
Foremost, it is crucial to know that there are four paradigms in science: (1) First Paradigm—empirical science, based on experiments and observations of natural phenomena along with empirical data accumulation; (2) Second Paradigm—theoretical science, with the development of scientific theories to explain the principles behind natural phenomena; (3) Third Paradigm—computational science, involving computational modelling, simulation, and algorithm development; and (4) Fourth Paradigm—data-driven science, with knowledge discovery by processing large amounts of data. It is noted that the four paradigms are linked coherently and it cannot be established that one paradigm is superior to another paradigm.
Then, the question is what the current paradigm is in biology. Only by knowing this, can we better understand the direction of the paradigm shift. At the first glance, it seems that biology covers all the four paradigms. But here I would argue that biology is primarily in the first paradigm, since we still lack massive high-quality empirical data. Albeit BT has been rapidly advanced over the past years and IT (particularly AI) has been extensively utilized in biology, we have to admit that multi-dimensional data have been and are still generated with varying quality, which, most importantly, account for a tiny portion of data universe across species, organisms, organs, tissues, and cells at a full range of spatial-temporal scales. To be short, biology is still in the first paradigm to accumulate the high-quality empirical data via observations and experiments.
2. Paradigm shift in biology: Data deluge, theory desert
Thus, it is clear that the direction of the paradigm shift in biology is toward the second paradigm of theoretical science, aiming to develop theories that are derived from the data accumulated in the first paradigm of empirical science. Unfortunately, theory in biology is highly underappreciated and its significance in advancing biology is not fully recognized [4]. Compared to other disciplines (like physics), however, there are relatively limited theories in biology up to now, particularly quantitative ones (like Chargaff's rules proposed in 1952 [5]). Theory, as an integral component in biology [6], has great potential to aggregate fragmented research findings and direct theory-guided AI modelling and experimentation. Just as a saying by Leonardo da Vinci (1452–1519) goes: “He who loves practice without theory is like the sailor who boards ship without a rudder and compass and never knows where he may cast”. Only equipped with theory, can we avoid aimless drift, navigate the sea of biological science with clear direction, and ultimately decipher basic principles of complex biological systems.
Noticeably, biology is fragmented into many subdisciplines. As a consequence, the fragmentation of research findings is akin to the story of blind men and an elephant, posing great challenges in connecting research findings from different areas, understanding complex biological systems from a holistic perspective, and forming a theoretical framework to shift the paradigm.
3. Expanding bioinformatics from data to theory
It is no exaggeration to say that transforming data to theory can be regarded as a scientific revolution to shift the paradigm in biology. Thus, it is time to gain a deeper understanding of bioinformatics by defining and expanding its research areas. So why it is bioinformatics that can realize the paradigm shift in biology? Over the past years, bioinformatics has already made historical impacts on a wide range of research across life, medicine, and health sciences [7]. As mentioned, biology is in the era of data deluge and theory desert, thirsting for fundamental theories to fully understand complex biological systems from a holistic view. Bioinformatics represents a holistic approach to decipher the complexity of biological systems by systematic integration of biological data, computational methods, and computing resources [8], accordingly promising to realize the paradigm shift by generalization and conceptualization of theories.
To contribute to the paradigm shift in biology, bioinformatics, positioning with the holistic vision in life sciences and standing at the cross-disciplinary forefront of the field, should be expanded to have its own distinct research areas ranging from data to theory (Fig. 1): (1) Database: build database resources to manage the data with value-added curation and integration; (2) Algorithm: develop algorithms (as well as related tools and pipelines) in aid of data modelling and simulation; (3) Analysis: analyze the data and interpret them in a biologically meaningful manner; and (4) Theory: formulate theoretical principles and laws that are derived from vast amounts of high-quality multi-dimensional data. In the era of big data, theory is vital to direct biological research and guide AI with improved explainability and reduced parameter space.
Fig. 1.
Bioinformatics research areas, involving database, algorithm, analysis, and theory, in aid of a paradigm shift from data to theory.
The former three areas are born with the advent of bioinformatics, albeit given the paucity of available data and computing resources at that time. For the first two, typically, Margaret Dayhoff (1925–1983) conducted pioneering work and made significant contributions to this field by building the “Atlas of Protein Sequence and Structure” in 1965, the first biological sequence database with 65 protein sequences in the first edition and developing COMPROTEIN in 1962, the first computer program for determining protein primary structure [1]. For the third area, the analysis of genome sequence data, as a critical component for the achievement of HGP, has been proved to be a landmark for bioinformatics. Regarding the fourth area, a case in point is the notion of whole-genome duplication as a powerful mechanism of evolutionary innovation, which was eventually testified through bioinformatic analysis of complete genomes of Saccharomyces cerevisiae and its close relatives [9]. Another example is three laws of genome nucleotide composition [10], which can be used as a theoretical framework for studying genome organization and evolution and driving synthetic genome engineering. Clearly, unlike other disciplines, bioinformatics possesses cross-disciplinary, fast-developing, broad-ranging, big-data-driven, holistic features and collaborative visions, and therefore, has tremendous potential to decipher life code by developing fundamental theories and revitalizing theoretical biology research [11] (which was greatly inspired by “What Is Life?” a book released in 1944 by Erwin Schrödinger (1887–1961)).
4. Concluding thoughts
As biology is a natural science of life, bioinformatics can be viewed as a data science of life involving multi-disciplinary methods on diverse scales as small as molecules and cells to as large as species and populations. Essentially, bioinformatics is an interdisciplinary, fast-evolving, and broad-ranging discipline that has been gradually expanded in response to rapid advances in related disciplines. In retrospect, bioinformatics has experienced different stages associated with landmark events (Fig. 2), namely, sequence-oriented (since 1952, Chargaff's rules), omics-driven (since 1990, HGP launched), and AI-powered (since 2018, AlphaFold). Going forward, bioinformatics is going to enter a theory-guided stage (beyond 2024), directing AI modelling and experimentation in biology. Inevitably, challenges are ahead, which are primarily rooted in data. Specifically, data (particularly high-quality data) is key to the formulation of theory, which, in return, can lead to more successful applications of AI that ideally are signified by fewer parameters to capture essential patterns, reduced training cost and time, and enhanced explainability with biological reasoning. This feedback chain—where data informs theory, which further guides AI—creates and accelerates the paradigm shift in biology.
Fig. 2.
Schematic representation of four stages with major landmarks in bioinformatics. The history of bioinformatics roughly falls into four stages with open-ended future: sequence-oriented stage since 1952, omics-driven stage since 1990, AI-powered stage since 2018, and theory-guided stage beyond 2024. The major landmarks in bioinformatics are: Chargaff's rules in 1952, COMPROTEIN in 1962, Atlas of Protein Sequence and Structure in 1965, Needleman-Wunsch algorithm for global sequence alignment in 1970, Smith-Waterman algorithm for local sequence alignment in 1981, GenBank in 1982, BLAST (Basic Local Alignment Search Tool) in 1990, TCGA Data Portal in 2010, and AlphaFold in 2018.
Collectively, the cross-disciplinary nature and the proliferation into many biological research fields, enable bioinformatics to lead the way toward the paradigm shift from data to theory, viz., from the first paradigm to the second paradigm, which can direct biological research from a holistic vision and further enhance its disciplinary identity and coherence.
CRediT authorship contribution statement
Zhang Zhang: Conceptualization, Writing – original draft, Writing – review & editing, Supervision, Project administration, Funding acquisition.
Declaration of competing interest
The author declares that he has no conflicts of interest in this work.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (32030021), National Key R&D Program of China (2023YFC2604400) and International Partnership Program of the Chinese Academy of Sciences (153F11KYSB20160008). I thank Yu Xue for valuable discussions on this work. My sincere apologies for omitting many citations due to space limitation.
Biography
Zhang Zhang (BRID: 03082.00.07730) is a distinguished professor of China National Center for Bioinformation & Beijing Institute of Genomics, Chinese Academy of Sciences (CAS). He received his PhD degree from Institute of Computing Technology, CAS in 2007. His research interests include bioinformatics, theoretical biology, big data integration, and development of omics data resources and algorithms.
References
- 1.Gauthier J., Vincent A.T., Charette S.J., et al. A brief history of bioinformatics. Brief Bioinform. 2019;20:1981–1996. doi: 10.1093/bib/bby063. [DOI] [PubMed] [Google Scholar]
- 2.Hagen J.B. The origins of bioinformatics. Nat. Rev. Genet. 2000;1:231–236. doi: 10.1038/35042090. [DOI] [PubMed] [Google Scholar]
- 3.Kuhn T.S. University of Chicago Press; Chicago, United States: 1962. The Structure of Scientific Revolutions. [Google Scholar]
- 4.National Research Council . The National Academies Press; Washington, DC: 2008. The Role of Theory in Advancing 21st-Century Biology: Catalyzing Transformative Research. [Google Scholar]
- 5.Chargaff E., Lipshitz R., Green C. Composition of the desoxypentose nucleic acids of four genera of sea-urchin. J. Biol. Chem. 1952;195:155–160. [PubMed] [Google Scholar]
- 6.Hogeweg P. The roots of bioinformatics in theoretical biology. PLoS Comput. Biol. 2011;7 doi: 10.1371/journal.pcbi.1002021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhang Z., Hu S., Yu J. Toward a new paradigm of genomics research. Geno. Proteom. Bioinf. 2023;21:904–909. [Google Scholar]
- 8.Searls D.B. The roots of bioinformatics. PLoS Comput. Biol. 2010;6 doi: 10.1371/journal.pcbi.1000809. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Kellis M., Birren B.W., Lander E.S. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004;428:617–624. doi: 10.1038/nature02424. [DOI] [PubMed] [Google Scholar]
- 10.Zhang Z. Laws of genome nucleotide composition. Geno. Proteom. Bioinf. 2024;22:qzae061. doi: 10.1093/gpbjnl/qzae061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ouzounis C.A. Rise and demise of bioinformatics? Promise and progress. PLoS Comput. Biol. 2012;8 doi: 10.1371/journal.pcbi.1002487. [DOI] [PMC free article] [PubMed] [Google Scholar]


