Skip to main content
eBioMedicine logoLink to eBioMedicine
. 2024 Nov 8;109:105446. doi: 10.1016/j.ebiom.2024.105446

Generative AI agents are transforming biology research: high resolution functional genome annotation for multiscale understanding of life

Stefan Harrer a,, Rahul V Rane b, Robert E Speight c,∗∗
PMCID: PMC11583719  PMID: 39520825

Life is inherently multimodal and governed by interactions between DNA, RNA, proteins and metabolites that vary in time and space, from cell to cell, and genome to biome. This complexity prevents us from understanding many biological systems and has brought heuristic analytical techniques to their limits. Holistic approaches driven by narrow AI have over the last decade shown promise for overcoming those limitations and offered new ways of explaining and designing biological systems. The evolution of Large Language Model (LLM)-based generative AI techniques and the currently unfolding Cambrian Explosion of novel agentic AI tools (Agents) allow biologists to bundle AI, biological data, and advanced life sciences technologies into a suite of scientific and engineering capabilities. Referred to as generative biology, this rapidly expanding field fundamentally redefines our ability to make sense of life from genome to biome.

The genome is the blueprint of an organism. To understand its biology, the quantitative functions, characteristics, and interactions of the encoded gene products need to be known and annotated in the genome. Whilst many biological systems have been well characterised through extensive experimentation, we lack the detailed understanding of the function of every gene. With the vast amount of genomic sequence data and the expansion of high-throughput experimental methods (e.g. biofoundries), LLMs are playing a growing role in predicting functions from gene sequences and improving genome annotations. The insights that are unlocked accelerate drug development and other valuable applications, such as the engineering of new biological systems and organisms. This Commentary describes current key applications of generative AI in biology and explains how progress in AI and agent technology will shape the future of biology and the role of biologists.

AI and biology are made for each other

As biological systems we study concurrently grow in scale and complexity, fewer heuristic tools are available to describe them, because their performance is linked to this complexity. Models can reach their limits surprisingly quickly, even for simpler systems. For example, approximately 70% of the bodyweight of a mammal is made up of water, largely inside cells, facilitating essential biochemical processes for life. But despite this ubiquity of water, first-principles modelling of interacting water molecules in solution was demonstrated only in 2022, is limited to a 256-water molecule system, and required the development of compute-power intense quantum mechanical and machine learning algorithms.1

Biological datasets are large and unstructured, with differences in data creation speed, size, quality, usability, types, and modalities across different ‘omics layers. Scientists often lack prior knowledge of these factors needed to choose data that fits their analytical goals and to annotate them for heuristic models or supervised learning algorithms. This is being resolved by the introduction of unsupervised and self-supervised learning methods, capable of discovering patterns and correlations in unstructured and unlabelled data.

While these AI technologies can be trained to explain specific aspects of biological systems, such as for example predicting the 3D structure of proteins from amino acid sequences2 or genome-scale metabolic modelling,3 verifiable insights remain confined to problems for which suitably large and high-quality data sets are available. In 2023, Isomorphic Labs and Google DeepMind CEO and recipient of the 2024 Nobel Prize in Chemistry Sir Demis Hassabis declared modelling a virtual digital cell as the next frontier of using AI in biology, a feat which would require holistic bridging of genomics, transcriptomics, proteomics, metabolomics and cell signalling with spatial and temporal considerations.

Breakthroughs in model architecture design have moved AI from application-specific to use-case agnostic and Hassabis' vision from aspirational to feasible. Large Language Models (LLMs), a type of generative AI, can perform tasks they weren't specifically trained for by processing large, unstructured data. LLMs can autonomously answer questions, manage knowledge, and create new information when prompted. The most recent evolution of LLMs also marks a quantum leap in their use: so-called Agents are LLM-driven programs that in response to user requests autonomously plan and execute entire workflows using other tools and data.4 Biologists are now creating and using agents to aid in scientific design and execution. These agents offer open-ended problem-solving, can contextualise complex, multimodal data, design experimental workflows, and use advanced analytical tools, thus revolutionising biology research.

Augmenting human intelligence: how AI can help unravel biology

Synthetic biology, also known as engineering biology, creates knowledge and tools to reprogram and engineer biomolecules for beneficial uses. In recent decades, the field has demonstrated impact through the generation of new pharmaceuticals and vaccines, improved crops, and industrial processes.5 These successes have mainly come from incremental changes to the biomolecule or host organism. However, its truly transformative potential is limited by insufficiently detailed knowledge of diverse biological systems and the ability to accurately predict engineering outcomes. The field is now being transformed through advances in AI (multi-modal biological LLM's and Agents) and large biological datasets, referred to as generative biology. The ability to predict biological function is dramatically accelerating the generation of new knowledge and biological data, and with it more impactful outcomes.

Function prediction accuracy is however constrained by the availability of suitable volumes of quality data. AlphaFold benefits from decades of research that has delivered almost 225,000 structures with defined quality (resolution) in the Protein Data Bank.6 Structure provides important clues about function, but for higher resolution functional parameters like enzyme kinetics and substrate specificity, datasets are much more limited and variable in quality. Despite decades of research, curated enzyme kinetics parameters from literature only exist for around seventeen thousand entries from nearly eight thousand protein sequences.7 Large datasets generated in a standardised fashion in single experiments that link sequence with enzyme function to explore fitness landscapes are now becoming available, with more expected using biofoundries and high throughput experimentation.8 But sequence space is vast and almost a billion natural protein sequences are available.9 With each sequence having potentially different substrate specificity, kinetics, inhibition, stability, interactions, and spatial–temporal expression, experimental characterisation of a significant proportion of the available sequences is unfeasible.

To better understand these interactions and their impact on biology, scientists have started to develop multimodal generative models for biology. These models help generate and study DNA, RNA, proteins and their interactions or interdependencies, all while having interpretability in mind. This presents a complexity that is unlike traditional large language models but brings the promise of accelerating synthetic biology by extrapolating data and rules based on the limited data we have per species. This advancement is coupled with the use of Agents to study such high-dimensional systems in new and more efficient ways: NVIDIA released NIM Agent Blueprint for drug discovery, a service that allows biologists to integrate different AI models such as AlphaFold2 and NVIDIA BioNeMo into screening workflows for small molecules. InSilico Medicine runs an agentised AI end-to-end drug discovery pipeline including an experimental biolab that is controlled by AI-driven robots. Biotech Owkin and its recent spin-out Bioptimus build LLM agents for creating and analysing large high-quality bio datasets. Marinka Zitnik's lab at Harvard Medical School is building a rapidly expanding suite of AI agents for studying biomedical areas such as genetics, cell biology and chemical biology. And Google DeepMind and BioNTech have both announced that they are building LLM-powered lab assistants to help researchers plan scientific experiments and better predict their outcomes. We are witnessing the onset of a process that will fundamentally transform how biologists work and what impact they will achieve. Owkin CEO Thomas Clozel has recently shared his pointed view on the matter: “Imagine an intelligence made up of hundreds of specialized AI agents that could run millions of virtual experiments and choose the best one to run in a real, automated lab”. With laboratory experimentation remaining a major bottleneck, this is where biology is heading.

AI will not replace biologists, but those who use AI will set the pace of scientific breakthroughs

Two trends shape the future of biology: Firstly, the development of narrow specialised AI e.g. the AlphaFold or AlphaMissense10 model series will continue alongside the curation of large high-quality datasets for training, validating, and scaling these models. Secondly, the development of broad generative AI will pivot from building ever larger LLMs to smaller customised assistive Agents. Biologists will use such Agents to access specialised AI models and other analytical and engineering tools as they plan and execute scientific studies of biological systems at unprecedented scales.

These developments will help support breakthroughs in largely two areas of biology within the next 3–5 years. Firstly, models of entire virtual digital cells will become available. Secondly, building on this step, models directly linking genotypes to phenotypes, bridging or bypassing layers in between will overcome the conventional waterfall style approach that postulates the need for comprehension of intermediate levels, such as for example tissue level, before further upstream levels, such as for example organ or whole organism layers, can be understood.

The ground truth for any such extrapolation is the genome, hence rendering high-resolution functional genome annotation a critically important component for all biological research. Novel ways of reading and understanding the genome are, however, only half of the transformational journey biology is currently undergoing. The other half will come from transforming the understanding of a genome to re-writing it using synthetic biology and the genome editing technologies that have given biologists paradigm-shifting engineering capabilities that match the disruptiveness of the AI-powered analytical toolbox. We expect some of the most impactful Agents in biology to connect analytical functional genome annotation technology with experimental genome writing and validation capabilities.

Equitably accessible, large, high-quality biological datasets are instrumental for all of this to unfold. For AI to be safe, trusted, and adopted it needs to be designed, deployed and assessed following responsible and ethical standards. These standards are best deployed through collaboration and transparent co-design and validation of models and their use across the developer and adopter community, including scientists. We therefore expect the most notable progress to be made by the open-source community and by research organisations and labs which are embedded in collaborative and interdisciplinary environments with strong incentives to solve real-world problems.

Industrial and societal impact will be profound: pharma and biotech will improve the efficiency of the drug development cycle. The agriculture and food industry will custom-develop crops with increased resistance to local threats such as pathogens and global threats such as climate change. For the bioeconomy, supply chains and processes will become more robust, efficient, and sustainable.

NVIDIA CEO Jensen Huang recently observed that ‘for the first time in human history, biology has the opportunity to be engineering not science’. He then explained that ‘when something becomes engineering not science it becomes exponentially improving’. That is indeed the paradigm shift AI has brought to biology. AI gives biologists unprecedented power to understand life. But it also gives them unparalleled power to change and improve biological systems at multiple scales, from single proteins to biomes. For the first time, biology has the opportunity to truly be both engineering and science.

Contributors

SH, RVR, and RES jointly wrote the manuscript. All three are responsible for all parts of the study including conceptualisation, investigation, methodology, project administration, resources, validation, editing, writing, reviewing, and revising the original draft as well as the published manuscript. SH, RVR, and RES conducted this study as employees of Australia's National Science Agency, the Commonwealth Scientific and Industrial Research Organisation (CSIRO), and all three accept responsibility to submit for publication.

Declaration of interests

SH is an inventor on the following granted US patents in the fields of DNA sequencing and generative AI: US11250219B2 ‘Cognitive natural language generation with style model’, US10267784 ‘DNA sequencing using multiple metal layer structure with different organic coatings forming different transient bondings to DNA’, and US9651518 ‘Nano-fluidic field effective device to control DNA transport through the same’.

Acknowledgements

The authors would like to thank Denis Bauer, Rad Suchecki, Andrew Warden, Cheng Soon Ong, Michael Kuiper, Dan Steinberg, and Regina Campbell from CSIRO as well as Astitva Chopra from Google Research for fruitful discussions.

Contributor Information

Stefan Harrer, Email: stefan.harrer@csiro.au.

Robert E. Speight, Email: robert.speight@csiro.au.

References

  • 1.Yu Q., Qu C., Houston P.L., Conte R., Nandi A., Bowman J.M. q-AQUA: a many-body CCSD (T) water potential, including four-body interactions, demonstrates the quantum nature of water from clusters to the liquid phase. J Phys Chem Lett. 2022;13(22):5068–5074. doi: 10.1021/acs.jpclett.2c00966. [DOI] [PubMed] [Google Scholar]
  • 2.Abramson J., Adler J., Dunger J., et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kundu P., Beura S.K., Mondal S., Das A.K., Ghosh A. Machine learning for the advancement of genome-scale metabolic modelling. Biotechnol Adv. 2024 doi: 10.1016/j.biotechadv.2024.108400. [DOI] [PubMed] [Google Scholar]
  • 4.Schick T., Dwivedi-Yu J., Dessì R., et al. Toolformer: language models can teach themselves to use tools. Adv Neural Inf Process Syst. 2024;36 [Google Scholar]
  • 5.Mock M., Langmead C.J., Grandsard P., Edavettal S., Russell A.J.C. Recent advances in generative biology for biotherapeutic discovery. Trends Pharmacol Sci. 2024;45(3):255–267. doi: 10.1016/j.tips.2024.01.003. [DOI] [PubMed] [Google Scholar]
  • 6.PDB statistics: overall growth of released structures per year (rcsb.org) 2024. [Google Scholar]
  • 7.Li F., Yuan L., Lu H., et al. Deep learning-based kcat prediction enables improved enzyme- constrained model reconstruction. Nat Catal. 2022;5(8):662–672. [Google Scholar]
  • 8.Papkou A., Garcia-Pastor L., Escudero J.A., Wagner A. A rugged yet easily navigable fitness landscape. Science. 2023;382(6673) doi: 10.1126/science.adh3860. [DOI] [PubMed] [Google Scholar]
  • 9.Sayers E.W., Bolton E.E., Brister J.R., et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50(D1):D20–D26. doi: 10.1093/nar/gkab1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Cheng J., Novati G., Pan J.P., et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381(6664) doi: 10.1126/science.adg7492. [DOI] [PubMed] [Google Scholar]

Articles from eBioMedicine are provided here courtesy of Elsevier

RESOURCES