Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Nov 11;26(6):bbaf601. doi: 10.1093/bib/bbaf601

The rise and potential opportunities of large language model agents in bioinformatics and biomedicine

Tiantian Yang 1, Yihang Xiao 2, Zhijie Bao 3,4, Jianye Hao 5,, Jiajie Peng 6,7,
PMCID: PMC12602188  PMID: 41214870

Abstract

Large language model (LLM) agents have demonstrated remarkable potential in the fields of bioinformatics and biomedicine. This paper reviews the technical foundations of LLM agents, including their core architecture, key technologies, and collaborative modes. We explore the applications of LLM agents in multi-omics, drug development, chemical research, clinical diagnosis, and health management. The paper also analyzes the major challenges faced by LLM agents, such as the interaction and extension of their frameworks, data privacy and security, model hallucinations and interpretability, timeliness of knowledge updates, and ethical and legal risks. Furthermore, we discuss future directions, including paradigms for human-artificial intelligence collaboration and the development of open-source ecosystems and standardization. This paper aims to provide a comprehensive perspective and guidance on the advancement of LLM agents in bioinformatics and biomedicine.

Keywords: LLMs, agents, bioinformatics, biomedicine

Introduction

Large language models (LLMs) [1, 2] are artificial intelligence (AI) models built on deep learning architectures (with Transformers as the core), which acquire strong language understanding and generation capabilities through pre-training on large-scale text data, such as GPT-4 [3–5], LLaMa [6, 7], Gemini [8], and PaLM [9] have recently advanced a wide range of natural language processing tasks (such as text generation, question answering, summarization), demonstrating human-level proficiency in interpreting, planning, and processing multimodal data [10]. Leveraging their powerful understanding, generation, and reasoning capabilities, LLMs have been introduced into bioinformatics and biomedicine, resulting in domain-specialized models [11–14]. For instance, BioGPT [15] enables efficient processing of biomedical literature, while MedGPT [16] focuses on knowledge-based Q&A and clinical support.

Despite this progress, current biomedical LLMs face several limitations [17–22]: (i) They lack experimental capabilities, being limited to textual processing and unable to carry out experimental operations or data collection, which restricts practical scientific applications. (ii) Their knowledge is constrained by training data, making it difficult to handle tasks requiring up-to-date or domain-specific expertise; thus, answers may be outdated or inaccurate, especially in fast-evolving fields like biomedicine. (iii) They struggle to interpret complex biomedical data (e.g. gene sequences, protein structures) that often require specialized algorithms. (iv) Personalized medical advice remains beyond their scope, as it typically requires integrating genomic, lifestyle, and clinical data, which exceeds current LLM capabilities.

To address these issues, the concept of LLM agents has gained increasing attention. LLM agents represent a class of systems capable of autonomously executing complex tasks, extending beyond the text-generation capabilities of conventional language models. An LLM agent is fundamentally an intelligent entity with the abilities to reason, plan, and act. At its core is an LLM that not only comprehends and processes natural language instructions but also interacts with the environment through external tools [23–26]. Systems like BioMANIA [27] use LLMs to interpret user instructions and automate bioinformatics workflows via Application Programming Interfaces (APIs) integration, while MEDAGENTS [28] demonstrates the value of multi-agent collaboration in enhancing domain reasoning. As biomedical data and research complexity increase [29–32], LLM agents—by integrating domain knowledge and advanced task execution—are emerging as the ideal solutions.

While reviews have discussed LLMs in medicine [11, 14], few cover bioinformatics, and none systematically review agent applications in biomedicine. This review addresses this gap by systematically analyzing LLM agents’ potential and practical applications in bioinformatics and biomedicine, focusing on key use cases and future directions. The Overview of LLMs details core characteristics and architectural categories of LLMs; the Technical foundations of agents details agent architectures and enabling technologies; the Applications of agents in bioinformatics and biomedicine explores application scenarios; and the Challenges and future directions discusses current challenges and outlines prospects for future research.

By integrating insights from computational biology, clinical informatics, and AI ethics, this review outlines the current landscape and future roadmap for LLM-powered agents in bioinformatics and biomedicine. As these agents evolve from task-specific tools to “AI scientists” [33] that partner with researchers and clinicians, they promise to accelerate precision medicine, scientific discovery, and the adoption of advanced biomedical technologies, thereby advancing the field to new heights.

Overview of LLMs

An LLM is a deep learning model based on the Transformer architecture, which is composed of multi-head self-attention and feed-forward sub-networks [34]. LLMs distinguish themselves by comprising billions of parameters [19] and are trained on massive, Web-scale corpora of unlabeled text via self-supervised pre-training to estimate the probability of the next token in a sequence. After this large-scale optimization, the same parameter set can be readily applied to numerous downstream tasks, such as classification, summarization, or question answering, through prompt conditioning and in-context learning. Empirical studies consistently demonstrate that increasing the model’s scale and training data yields systematic performance improvements and can elicit qualitatively new capabilities at specific scale thresholds [35].

Core characteristics of LLMs

The transformative impact of LLMs stems from three interwoven core characteristics that distinguish them from traditional natural language processing models:

Scalable knowledge assimilation

LLMs are pre-trained on trillions of tokens harvested from diverse web corpora. As the parameter count increases from millions to hundreds of billions, the models compress an ever-larger fraction of public knowledge into their weights, yielding predictable log-linear improvements on downstream biomedical benchmarks and enabling single-model coverage of tasks that previously required specialized corpora [36].

Few-shot and zero-shot learning capabilities

By condition-ing on prompt contexts that contain only a handful of labeled examples—or none at all—LLMs can instantly absorb new entity types, relation schemas, or output formats without gradient updates. This in-context learning eliminates the need for task-specific training sets that are often small or privacy-restricted in medical domains [4].

Advanced reasoning and generation

Causal-decoder architectures equipped with chain-of-thought prompting produce step-wise explanations, reconcile contradictory evidence across articles, and generate executable analysis scripts (R, Python, SQL) that chain multiple tools together [37].

Architectural categories

Modern LLMs can be grouped into four architectural families that determine how they are best deployed in biomedical pipelines.

Encoder-only

This architecture uses bidirectional attention over the full input and is optimized for understanding tasks [38]. Its representative models include BERT [38] and BioBERT [39]. Its typical use covers named-entity recognition of chemicals or diseases, sentence-level relation extraction, and embedding biomedical sentences for similarity search.

Encoder–decoder

This architecture has separate stacks for context encoding and token generation and is naturally suited for transformations. Its representative models include T5 [40], BART [41], and mT5 [42]. Its typical use includes medical question-to-SQL generation, summaries of radiology reports, and paraphrasing patient-friendly descriptions of clinical notes.

Decoder-only

This architecture adopts a single left-to-right transformer stream, scales efficiently to hundreds of billions of parameters and excels at open-ended generation. Its representative models include GPT-4 [4], PaLM2 [43], and Gemini [8]. Its typical use involves conversational agents that interpret wet-lab protocols, generate R or Python analysis scripts on demand, or draft patient-specific discharge instructions.

Mixture-of-experts sparse decoders

This architecture uses causal decoding with conditional computation—only a subset of parameters is activated per token—and this design allows trillion-scale models without proportional FLOPS growth. Its representative models include GLaM [44], Switch-Transformer [45]. Its typical use features hospital-scale triage assistants that merge EHR text, imaging, and genomics by routing each modality to dedicated experts.

Choosing among these classes involves trading off latency, memory footprint, and downstream accuracy.

Technical foundations of agents

Core architecture of agents

The architecture of an LLM agent includes four core components: planning, perception, action, and memory. As shown in Fig. 1, these modules allow agents to autonomously decompose tasks, perceive complex environments, execute actions, and accumulate knowledge—enabling efficient biomedical workflow management [23, 26, 46–49].

Figure 1.

Alt text: A core architecture diagram showing four interconnected components: planning, perception, action, and memory. This system processes biomedical workflows from initial input through perception, planning, and action to knowledge utilization.

Core architecture of agents. Planning, perception, action, memory. These components enable autonomous handling of biomedical workflows from input, perception, planning, action to knowledge utilization.

Planning

Planning breaks down complex biomedical tasks into manageable subtasks and forms execution strategies [47, 49]. For example, CellAgent [50] uses LLMs to interpret instructions and decompose scRNA-seq analysis, while ProtChat [51] automates planning for protein property and interaction tasks.

Action

Action modules execute plans through tool invocation and interaction with the environment [52–54]. DrugAgent [55], for instance, applies DeepPurpose for automated drug-target prediction, iteratively refining outputs.

Perception

Perception collects and processes multimodal data—text, images [56, 57], audio [58, 59], molecules [60, 61], and tables [62]. AutoBA [29] integrates multi-omics data for comprehensive automated analysis.

Memory

In the field of AI agents, memory is regarded as a core component for continual learning and contextual decision-making, and is generally divided into two complementary layers: short-term storage, which maintains the current task context to ensure coherent multi-step reasoning, and long-term storage, which accumulates cross-task experiences, code snippets, or personalized knowledge that can later be retrieved, reused, or fine-tuned to steadily expand the model’s domain capacity [47, 63–66]. EHRAgent [67] stores code snippets to optimize clinical data handling, and RareAgents [68] maintains personalized experience bases for complex biomedical decision support.

Architecture-guided application paradigms

In data-intensive bioinformatics, the application paradigm targets efficient multimodal biological data (scRNA-seq, molecules, genes) processing. Perception integrates multi-omics data and parses molecular features; planning decomposes tasks; action automates toolchains via APIs; memory stores reusable assets to boost reproducibility.

In clinical biomedicine, the application paradigm serves diagnosis, treatment, and drug development. Perception fuses clinical data; planning generates dynamic decision chains; action links clinical equipment; memory supports personalization and interpretability.

Key technologies of agents

Knowledge base and retrieval augmented generation

The development of biological knowledge infrastructures, such as DrugBank [69], PDB [70], and the 1000 Genomes Project [71], provides standardized, searchable resources that underpin intelligent systems. Databases like AlphaFold [72] store over 200 million predicted structures, while search engines such as FoldSeek [73] enable precise structure retrieval by mapping 3D structures to the sequence feature space. However, these systems lack advanced reasoning and dynamic filtering abilities.

As shown in Fig. 2, agents overcome these limitations by integrating domain knowledge bases with retrieval augmented generation (RAG) [74], allowing for dynamic query strategy generation during inference. For example, BioMaster [75] employs multi-agent collaboration, where planning agents use RAG to design workflows, task agents invoke tool knowledge bases, and debug/verify agents ensure correctness. In complex workflows such as Hi-C or RNA-seq analysis, BioMaster has achieved reduced error rates and high task completion. Such architectures, deeply integrating knowledge bases, RAG, and multi-agent division, provide scalable, interpretable solutions for bioinformatics.

Figure 2.

Alt text: A diagram illustrating four key technologies that enhance agent capabilities: RAG for query-based information retrieval, tool use for external tool interaction, multimodal fusion for processing diverse data types (text, image, video, graph, sequence, table), and continual learning through interaction and feedback cycles. These technologies enable agents to handle complex biomedical tasks.

Key technologies of agent. (1) RAG, showing how users’ queries trigger retrieval from data sources, which then combines with the query for response generation. (2) Tool use, illustrating input transformation, external tool invocation, and output generation. (3) Multimodal fusion, listing diverse data types (text, image, video, graph, sequence, table) agents can process for richer cognition. (4) Continual Learning, depicting cycles of interaction, feedback, and adaptation, enabling agents to learn continuously. These technologies collectively empower agents to solve complex biomedical tasks.

Nonetheless, challenges persist: fusing multimodal biomedical data, maintaining up-to-date knowledge, and enhancing interpretability remain difficult. Future directions include developing interpretable RAG architectures and lifelong-learning biomedical knowledge graphs.

Tool use

To address diverse tasks, agents leverage a wide range of tools, including APIs [76], search engines [77], ML models [78], biomedical databases [69–71], and medical robots [79].

Advanced frameworks like BioMANIA [27] and AutoBA [29] integrate Python tool APIs and biomedical resources for automated omics analysis, while ClinicalAgent [80] combines GPT-4 with clinical database retrieval for trial prediction. Agents typically invoke these tools by generating specific instruction formats or controlling pre-trained models. KGARevion [81] uses knowledge graph mechanisms to optimize tool invocation, and MEDAGENTS [28] applies multi-agent collaboration to improve tool utilization.

A key challenge for LLM agents lies in integrating various external systems, which often operate in separate data silos. To overcome this, agents increasingly rely on standardized interfaces such as the Model Context Protocol (MCP) [82]. The MCP is a foundational tool that enables seamless interaction between AI models and external resources, allowing agents to overcome data silos and access a wide range of tools and databases. Unlike traditional, task-specific APIs, the MCP provides a universal framework for interoperability, allowing models to securely and efficiently communicate with heterogeneous systems [82]. Overall, the integration of diverse tools significantly broadens the functional scope and problem-solving abilities of biomedical agents. As tool ecosystems and interoperability standards continue to evolve, agents are expected to achieve even greater automation, adaptability, and reliability, further accelerating progress in biomedical research and clinical practice.

Multimodal fusion

Multimodal fusion empowers agents with comprehensive cognitive abilities by processing heterogeneous data from sources such as text, images, and gene sequences. Current multimodal fusion techniques can be grouped into three streams: (i) concat fusion merely links features and merges them at the decision layer, offering speed but limited interaction, and is now fading out; (ii) cross-attention uses Query-Key matching to let image and text search each other, providing the current sweet spot between accuracy and computational cost; and (iii) unified embedding compresses every modality into one shared vector space, enabling single-sequence autoregressive reasoning and is regarded as the ultimate route to scalable agents [48].

MMedAgent [83] first converts medical images into a handful of learnable query tokens via lightweight cross-attention, then lets the LLM decide which tools to call, stitching visual clues with textual intent into a final answer. Surgical-LLaVA [84] aligns surgical video streams with electronic medical record text via vision-language joint training, enabling instant recognition of critical steps and intraoperative recommendations. Its multimodal fusion delivers high accuracy in real time without extra annotations.

These advances enable agents to integrate evidence from pathology, genomics, and clinical records, thus supporting precision medicine. Ongoing research focuses on standardized multimodal alignment, spatiotemporal data integration for disease prediction, and lightweight models for real-time clinical analysis.

Continual learning

Continual learning enables agents to accumulate knowledge across sequential tasks or time periods without catastrophic forgetting, a crucial capability for ever-evolving biomedical data streams [85].

In drug discovery, MolRL-MGPT [61] retains previously learned chemical scaffolds while generating novel SARS-CoV-2 inhibitors, demonstrating lifelong molecular optimization. MEDCO [86] allows virtual medical students to incrementally refine diagnostic questioning strategies as new cases arrive, mimicking human clinical apprenticeship. These paradigms turn biomedical agents into lifelong learners that evolve alongside real-world data, ensuring up-to-date and personalized healthcare support.

Continual learning is becoming a driving force for next-generation biomedical agents, enabling them to continuously evolve, generalize across tasks, and deliver more intelligent, personalized solutions in complex healthcare and research environments.

Cooperative mode of agents

Single agent

In the biomedical field, LLM-based single-agent systems offer efficient solutions for specialized tasks through centralized decision-making and integration of domain knowledge and tools. For instance, RL-GenRisk [87] combines graph convolutional network feature learning and deep Q-network optimization for risk gene identification in clear cell renal cell carcinoma, improving low-frequency gene identification accuracy by over 40%. RDguru [88] integrates phenotype matching, the Orphanet database, and LLMs for rare disease diagnosis, raising the top-5 diagnosis capture rate to 63.87%. BioDiscoveryAgent [89] leverages LLMs for hypothesis generation and experimental planning, boosting gene perturbation experiment efficiency by 46% compared to the baselines.

Although single agent systems are effective for domain-specific tasks, their limitations become apparent in complex, interdisciplinary, or dynamically changing scenarios. In such cases, multi-agent systems with collaborative reasoning and role division demonstrate stronger adaptability [24, 25].

Multiple agents

As shown in Fig. 3, multi-agent collaboration is defined as a paradigm in which multiple LLM agents work together in a coordinated fashion. Each agent is assigned a specific role and capability, and they communicate and cooperate to collectively solve a complex problem, thereby overcoming the limitations of a single agent [24]. Common architectures include [24, 90]:

Figure 3.

Alt text: A diagram illustrating the three cooperative modes of intelligent agents: single-agent mode for autonomous task execution, multi-agent mode for collaboration among agents, and human-agent mode for synergy between agents and human experts. These modes allow the system to handle biomedical tasks from simple execution to complex cooperation.

Cooperative mode of agents. Single-agent mode shows autonomous task processing; multi-agent mode depicts collaborative patterns among agents; human-agent mode highlights human-machine synergy between agents and experts. These modes enable agents to address biomedical tasks ranging from single-task execution to complex multi-party cooperation.

Centralized architecture. The architecture employs a single coordinator to aggregate expert outputs and return the final decision—simple and accountable, yet vulnerable to a single point of failure. The Virtual Lab [91] uses a centralized multi-agent architecture, with one LLM Principal Investigator as the coordinator to guide LLM scientist agents, integrate their outputs, incorporate human feedback, build a nanobody design pipeline, and design 92 new SARS-CoV-2 nanobodies.

Decentralized architecture. The architecture lets agents negotiate peer-to-peer, offering higher fault-tolerance and parallelism, but demands richer messaging and conflict-resolution protocols. The ProtAgents [92] framework adopts a decentralized multi-agent architecture, where user proxy, planner, assistant, and critic agents interact peer-to-peer under a dynamic Group Chat Manager, each undertaking specific tasks with mutual feedback and error correction for fault tolerance and parallelism.

Hierarchical architecture. The architecture decomposes tasks from top-level goals through mid-level allocation, to low-level execution, thereby balancing scalability and explainability while requiring boundary retuning when new data modalities emerge. ClinicalAgent’s [80] hierarchical architecture decomposes top-level clinical trial tasks into mid-level subproblems and allocates them to specialized agents. Mid-level agents use low-level tools for execution, with the reasoning agent synthesizing results to balance scalability and explainability.

Communication patterns align with the above architectures [93, 94]. In centralized systems, a single hub collects messages from all agents and then broadcasts decisions downward, keeping traffic low and making conflicts easy to detect. Decentralized systems let every agent talk directly to its neighbors; this peer-to-peer exchange improves fault tolerance and parallelism but generates more messages and requires additional consensus steps. Hierarchical systems pass commands downward layer by layer and send summaries back upward, maintaining a clear global view while allowing local nodes to respond quickly. These patterns are consistent with principles from complex network analysis, where spectral methods enable precise characterization of system structure [95, 96] and can inform agent role assignment and interaction design.

These patterns enable effective collaboration in complex biomedical scenarios via task decomposition and knowledge complementarity. Key challenges for future research include improving communication, conflict resolution, privacy, and combining adaptive learning and multimodal data integration for precision medicine [97].

Human-agent collaboration

Human-AI collaboration refers broadly to any joint activity involving humans and AI systems. Human-agent collaboration denotes sustained, goal-oriented partnerships in which autonomous or semi-autonomous agents proactively plan, execute, and refine tasks alongside human experts, integrating contextual knowledge with scalable computation to accelerate biomedical workflows.

MEDCO [86] uses a triangular collaboration model (proxy patient, expert doctor, radiologist), where students improve diagnostic accuracy by 23% through interaction with AI agents. The MEDAGENTS framework [28] supports human-machine collaborative reasoning, raising zero-shot reasoning accuracy on the MedQA dataset by 17% over single expert performance. BioMaster [75] demonstrates layered human-agent collaboration in Hi-C analysis, cutting task completion time from 48 to 6 h and reducing the error rate by 37%.

Further advances require research into natural interactive interfaces (e.g. Surgical-LLaVA’s [84] real-time intraoperative collaboration), dynamic trust calibration for human-machine decision weight adjustment, and cross-institutional collaboration using federated learning frameworks like FedMRL [98].

Applications of agents in bioinformatics and biomedicine

As shown in Fig. 4, the integration of LLM agents into bioinformatics and biomedicine has driven transformative advancements across molecular biology, clinical practice, and research. The functions, applications, and limitations of agents are compiled in Table 1. This section highlights representative applications and case studies demonstrating the versatility and impact of LLM agents in these fields.

Figure 4.

Alt text: A comprehensive visualization of agent applications across bioinformatics and biomedicine. Key domains are grouped into Research and Development (Drug Discovery, Genomics, Protein Engineering, Automated Synthesis), Clinical Practice (Decision Support, Text Processing, Imaging Diagnosis), and Medical Support (Education, Inquiry, Psychological Health). The agents leverage multiple knowledge sources, multimodal fusion, and automated reasoning to transform molecular biology, clinical care, and research.

Applications of agents in bioinformatics and biomedicine. Key domains include Drug Discovery and Development, Psychological Health Support, Chemical Automated Synthesis, Genomic and Transcriptomic Analysis, Gene Edit, Clinical Decision Support, Clinical Text Processing, Medical Inquiry and Response, Medical Education, Medical Imaging Assisted Diagnosis, and Protein Engineering and Proteomic Analysis. These applications show how agents use multiple knowledge sources, multimodal fusion, and automated reasoning to advance molecular biology, clinical practice, and research, highlighting their transformative role across the biomedical landscape.

Table 1.

Summary of AI agent systems across key biomedical domains, including their primary applications, core functions, and current limitations. The table is organized into three major categories: drug discovery and development, medical diagnosis and support, and multi-omic analysis. Each agent is evaluated based on its practical utility and existing challenges, highlighting opportunities for future improvement in robustness, generalization, and clinical integration

Application Work Function Limitation
Drug discovery and development DrugAgent [55] Predict drug–target interactions Lack autonomous knowledge updates
MolRL-MGPT [61] De novo drug design Data quality limits
SwiftDossier [99] Auto target dossier Need expert input
ProtAgents [92] De novo protein design Chroma design flaws
BioDiscoveryAgent [89] Validate drug target efficacy Tool benefit varies
SAMPLE [100] Auto protein engineering Throughput bottleneck
The Virtual Lab [91] Design nanobodies Rely on human feedback
ProtChat [51] Auto protein analysis Poor model interpretability
TourSynbio [101] Protein engineering aid Few case studies
Chemist-X [60] Recommend reaction conditions Few reaction types tested
CACTUS [102] Aid drug discovery Input type limited
LLM-RDF [103] Auto chemical synthesis LLM response unreliability
ChemCrow [104] Auto chemical synthesis Weak spatial reasoning
PRIME [105] Aid protein engineering No wet-lab validation
DrugPilot [106] Aid drug discovery No wet-lab validation
Medical diagnosis and support MMedAgent [83] Aid medical tasks Few tasks/modalities
IOMIDS [107] Ophthalmic AI diagnosis Poor slit-lamp model
FedMRL [98] Aid medical imaging Few dataset validations
Cancer Cell Removing Agent [108] Aid cancer cell removal Small dataset size
MAGDA [109] Guide medical diagnosis No real-clinic test
MEDCO [86] Aid medical education Only virtual student tested
AIPatient [110] Aid medical education Verbose responses
Beyond Flashcards [111] Personalized medical learning Over-reliance on RAG
Med-PaLM2 [112] Aid medical QA Inferior to specialists
KGAREVION [81] Aid biomedical QA Need more revision rounds
RDguru [88] Aid rare disease QA Fixed tool reliance
ASTRID [113] Evaluate clinical RAG QA Only single-turn focus
LLM-MedQA [114] Aid MedQA High computation cost
Surgical-LLaVA [84] Aid surgical VQA No open-ended quant eval
Active Inference Strategy [115] Aid medical QA Single LLM reliance
EHRAgent [67] Aid EHR tabular QA No low-resource adapt
AI Scribe [116] Aid clinical documentation No long-term test
Almanac Copilot [117] Aid EHR navigation Hallucination risk
ColaCare [118] Aid EHR prediction No continuous learning
Reflexion [119] Aid agent learning Local minima risk
MDAgents [120] Aid medical decision-making No patient interaction
The work on mitigating cognitive biases [121] Enhance diagnostic accuracy Case report bias
KG4Diagnosis [122] Aid medical diagnosis No rare disease cover
SurgBox [123] Support surgery decision Poor multi-situation handling
The AMC system [124] Aid depression diagnosis LLM role-play flaws
Sunnie [125] Aid well-being support Short study duration
PsycoLLM [126] Enhance psychological LLM LLM-generated data bias
Multi-omic analysis GeneAgent [127] Aid gene-set analysis Database scale limits
GenoTEX [128] Aid gene-trait analysis Trait preprocessing weak
ChatNT [129] Aid bio-seq analysis RNA degradation gap
BioAgents [130] Aid bioinformatics workflows Weak complex code generaion
Genesis [131] Automate systems biology Hardware not scaled
CRISPR-GPT [132] Aid CRISPR experiment design Weak rare case handling
CellAgent [50] Aid scRNA-seq analysis Narrow evaluation methods
BioMaster [75] Aid bioinformatics workflows No new tool auto-integrate
CellAgentChat [133] Aid cell–cell interaction analysis Computationally costly
Biomni [134] Aid biomedical research Weak clinical judgment
DrBioRight2.0 [135] Aid cancer proteomics analysis Weak rare cancer support
AutoBA [29] Aid multi-omic analysis Lag in integrating latest tools
BIA [136] Aid scRNA-seq analysis Weak dynamic workflow
Agentomics-ML [137] Aid omics analysis Narrow task scope
PromptBio [138] Aid multi-omic analysis No full agent autonomy
CellForge [139] Aid virtual cell modeling High computation cost

Drug discovery and development

Drug design and discovery

LLM agents have greatly accelerated drug discovery by integrating multi-source knowledge bases and automated reasoning. For example, DrugAgent [55] uses a multi-agent framework with drug target interaction models, knowledge graphs, and literature search agents to predict new indications for existing drugs, achieving higher accuracy and interpretability than traditional methods. MolRL-MGPT [61] applies reinforcement learning and multi-GPT agents for molecular library generation, performing well in the GuacaMol benchmark and SARS-CoV-2 inhibitor design. SwiftDossier [99] uses RAG and multi-agent collaboration to automate dossier generation, providing structured reports and improving content accuracy.

Current challenges include the reliance on static knowledge bases—making it difficult to update dynamic clinical data, a lack of systematic preclinical validation for generated molecules or reports, and unresolved intellectual property and ethical issues. Future work should focus on closed-loop generation-verification systems, stronger multimodal fusion, and transparent decision mechanisms to enhance clinical translation credibility.

Protein engineering

With the development of LLMs and agent frameworks, the use of LLM agents in protein engineering has become a research focus. By leveraging the reasoning and collaboration abilities of multi-agent systems, tools like ProtAgents [92] and BioDiscoveryAgent [89] automate protein attribute prediction and gene perturbation design. SAMPLE [100] and The Virtual Lab [91] couple automated experimental platforms with LLM inference to accelerate experimental validation. ProtChat [51] and TourSynbio [101] further improve protein analysis through multimodal modeling and natural language interfaces.

Challenges remain, such as LLM hallucination leading to inaccurate designs, the need for advanced reasoning and computational efficiency due to the complexity of protein design spaces, and scalability for experimental validation. Combining advanced validation, optimization algorithms, and richer domain knowledge could further enhance LLM agent applications in protein engineering.

Chemical automated synthesis

The combination of LLMs and multi-agent collaboration has advanced chemical synthesis by optimizing reaction conditions, catalyst design, and automation. Chemist-X [60] uses Retrieval Augmented Generation to optimize reaction conditions, reducing the chemists’ workload. CACTUS [102] automates molecular property prediction and drug similarity assessment. LLM-RDF [103] delivers end-to-end automation from literature search to experiment. ChemCrow [104] integrates 18 expert tools, allowing GPT-4 to autonomously plan organic synthesis, catalyst screening, and new chromophore discovery, lowering barriers for non-professional chemists. The Virtual Lab [91] shows the design of nano antibodies targeting SARS-CoV-2 variants via multi-agent collaboration.

However, issues such as LLM hallucination, multi-step reasoning requirements, long-term memory support, and scalability for real-world applications persist. Future directions include developing stronger validation mechanisms, optimizing reasoning and adaptability, and supporting broader chemical and industrial applications.

Medical diagnosis and support

Medical imaging-assisted diagnosis

Medical imaging-assisted diagnosis plays a key role in early disease detection, treatment planning, and prognosis. Recently, LLM agents have shown strong potential by integrating multimodal data and domain knowledge to aid clinicians.

For example, MMedAgent [83] enhances efficiency and accuracy in analyzing multiple imaging modalities (MRI, CT, X-ray) and tasks (visual Q&A, segmentation, report generation). IOMIDS [107] demonstrates multimodal learning in ophthalmic disease diagnosis, enabling AI chatbots to provide quality recommendations by fusing images and text. FedMRL [98] applies federated multi-agent reinforcement learning to overcome data heterogeneity, improving model performance across diverse distributions. Cancer cell removing agent [108] and MAGDA [109] further illustrate LLM agents’ application in image-guided therapy and diagnostic support.

Despite progress, challenges remain. LLM agents may generate inaccurate results due to limited data quality or domain gaps, particularly when handling complex imaging scenarios. Current systems also struggle to fully integrate multimodal data, especially across images, text, and clinical records, and their adaptability needs improvement. Clinical integration and user experience require further optimization for practical use. Future research should focus on incorporating high-quality domain knowledge graphs, creating unified multimodal frameworks for robust analysis across data types, and optimizing clinical workflows for seamless adoption.

Medical education

LLM agents are transforming medical education with immersive and efficient learning.

Systems like MEDCO [86] use multi-agent frameworks to simulate realistic clinical environments, fostering diagnostic and communication skills through collaboration among proxy patients, medical experts, and radiologists. AIPatient [110] integrates electronic health records and reasoning-augmented generation to provide advanced simulated patient systems, supporting both education and model assessment. Beyond flashcards [111] leverages LLMs for personalized study support, using contextual learning and prompt engineering to optimize exam preparation.

Nevertheless, current systems face limitations: virtual agents’ learning abilities still lag behind real students, and available datasets are often small. Broader, multimodal collaborative datasets are needed to unlock full potential. Feedback is mostly text-based; future systems should offer richer, multimodal feedback (e.g. imaging examples) to strengthen diagnostic skills. Expanding multidisciplinary collaboration and complex simulation scenarios will better reflect real medical environments. With these advances, LLM agents are poised to become comprehensive assistants for medical education, enhancing training for future clinicians.

Medical inquiry and response

Medical question answering is essential for improving healthcare quality and efficiency. With the evolution of LLMs, agents now offer increasingly accurate and efficient support for professionals and patients.

Med-PaLM2 [112] refines long-form medical answers and real-world workflows via improved base LLMs, medical fine-tuning, and retrieval-chain reasoning, scoring 86.5% on MedQA and being preferred by doctors over physician answers in eight of nine clinical axes. KGAREVION [81] combines knowledge graphs and semantic reasoning to extract relevant data and generate detailed answers. RDguru [88] focuses on rare diseases, integrating authoritative knowledge and multi-source data to provide differential diagnosis and treatment suggestions, boosting diagnostic accuracy. ASTRID [113] introduces an automated evaluation framework for clinical Q&A, ensuring response reliability. LLM-MedQA [114] demonstrates strong performance on complex queries, while Surgical-LLaVA [84] supports multimodal Q&A in surgical contexts. Active Inference Strategy [115] improves answer reliability through active reasoning, and EHRAgent [67] enables code-enhanced inference on electronic health records, boosting data processing in medical Q&A.

Challenges remain: LLMs sometimes produce inaccurate answers due to incomplete knowledge or data quality, especially for rare or complex topics. Multimodal fusion remains limited, and adaptability and scalability require further enhancement. Improving reasoning via high-quality domain knowledge graphs, unified multimodal frameworks, and smoother clinical integration will help agents better meet practical medical needs.

Clinical text processing

Clinical text processing is vital for efficient, high-quality healthcare delivery. LLM agents provide clinicians with more accurate and timely support.

For documentation, AI Scribe [116] transcribes doctor-patient conversations in real time, converting unstructured data into structured electronic health records and reducing manual workload. Almanac Copilot [117] can retrieve information, query records, and manage orders within electronic health records, streamlining clinical workflows. ColaCare [118] combines expert models and LLMs to improve modeling and prediction, especially for high-risk scenarios. EHRAgent [67] uses code interfaces for automated code generation and multi-table inference, enhancing complex clinical question answering. Agentic LLM Workflow [119] applies iterative self-reflection to refine LLM outputs, improving the quality and readability of clinical reports and reducing the need for manual editing.

Yet, challenges persist: LLMs may still generate errors in complex reports due to data or knowledge gaps, and current systems have limited ability to merge text, images, and clinical data. Future efforts should focus on integrating high-quality domain knowledge, unified multimodal frameworks, and improving adaptability for diverse real-world needs, ultimately boosting care quality and patient outcomes.

Clinical decision support

Clinical decision support is crucial for improving diagnostic accuracy, optimizing treatment plans, and enhancing patient outcomes. LLM agents have demonstrated great promise in this area, providing medical professionals with efficient and accurate decision tools.

For instance, MDAgents [120] uses a multi-agent framework to simulate real-world medical decision-making, with notable improvements in multimodal reasoning and complex tasks compared to baseline methods. The work on mitigating cognitive biases [121] shows multi-agent systems can reduce clinical biases via simulated collaborative discussions, significantly improving diagnostic accuracy. KG4Diagnosis [122] leverages knowledge graphs to enhance reasoning in complex cases, offering more reliable clinical support. Similarly, SurgBox [123] and MAGDA [109] showcase the potential of these agents in surgical assistance and collaborative diagnosis.

Future directions include introducing high-quality domain knowledge graphs to reduce errors, developing unified multimodal frameworks for robust analysis across data types, and optimizing feedback mechanisms for richer diagnostic support. With these advances, LLM agents are expected to become indispensable assistants for clinical decision support.

Psychological health support

Mental health concerns such as depression are increasingly significant in society. With artificial intelligence, agent-based assistive systems offer promising tools for psychological assessment, diagnosis, and intervention.

By simulating expert–patient interactions, intelligent agents can efficiently assess and diagnose psychological states. The AMC system [124] uses multi-level memory and dialog control to simulate psychiatrist evaluation of depression and suicide risks, improving accuracy and efficiency, achieving a notable performance gain of up to 6.8% on depression diagnosis in challenging test scenarios. Such systems aid both clinical and remote mental health services. Personified dialog agents like Sunnie [125] significantly enhance user acceptance by improving perceptions of relational warmth by over 18% compared to a non-anthropomorphic baseline, while PsycoLLM [126] significantly improves its understanding of complex psychological states through specialized training. It is the only model to achieve over 60% average standard accuracy on the benchmark’s multiple-choice questions (MCQs), and it demonstrates superior performance on key metrics for case-based analysis.

In summary, agents are advancing mental health support across diagnosis, personalized intervention, and safety. Ongoing optimization of algorithms for clinical relevance and safety, along with ethical regulation, will further improve reliability and promote sustainable development. The integration of these technologies will enable safer, more accessible, and effective mental health services.

Multi-omic analysis

Genomic analysis

The explosion of genomic data is transforming bioinformatics. LLM agents, combining natural language understanding with domain knowledge, are changing analysis paradigms.

GeneAgent [127] integrates self-verification and domain databases, reducing hallucinations and improving gene set function discovery accuracy. GenoTEX [128] provides a benchmark for evaluating gene expression analysis, supporting AI–bioinformatics collaboration. ChatNT [129] is a multimodal agent for DNA, RNA, and protein tasks, showing strong sequence analysis capabilities. BioAgents [130] lowers barriers for non-experts via multi-agent collaboration, and Genesis [131] automates systems biology from hypothesis generation to experiment design.

Remaining challenges include LLM hallucinations or logical errors in complex genomics, insufficient multimodal fusion (especially combining gene, protein, and clinical data), and limited scalability. Improvements should focus on knowledge graph integration, unified multimodal frameworks, and reinforcement learning for better task planning, making these agents powerful accelerators for genomics research.

Gene editing

LLM agents lower the technical threshold for experimental design in gene editing by integrating natural language understanding and domain expertise. For instance, CRISPR-GPT [132] automates CRISPR-based experiment design, supporting a range of tasks (knockout, epigenetic editing, prime and base editing), and integrates with lab automation for end-to-end workflows. It recommends systems, delivery methods, designs sgRNAs, predicts off-target effects, and plans validation.

However, current applications remain limited. LLMs may still hallucinate especially without experimental validation. CRISPR-GPT cannot yet generate complete constructs from natural language, and ethical and regulatory issues persist. Future directions include integrating knowledge graphs for better reasoning, developing robust validation mechanisms, and improving user interfaces and lab automation for broader application scenarios.

Transcriptomic analysis

The rapid development of transcriptomics is shifting RNA analysis paradigms, with LLM agents offering new tools for deciphering complex biological systems.

CellAgent [50] uses a multi-agent framework to automate single-cell RNA sequencing analysis from preprocessing to differential expression. BioMaster [75] achieves end-to-end automation in RNA-seq bioinformatics, lowering the technical barrier for users. CellAgentChat [133] enables analysis of cell signaling pathways through agent-based modeling, while recent work [140] shows LLM agents can annotate cell types without reference databases, reducing expertise and time demands. Biomni [134] integrates retrieval-augmented planning with code-level execution to autonomously perform differential expression, pathway enrichment and causal gene prioritization directly from bulk or single-cell RNA sequence data, outputting experimentally testable protocols without task-specific prompting.

Despite this progress, challenges include hallucinations or errors in complex RNA tasks, limited multimodal fusion (combining transcriptome, proteome, and clinical data), and generalizability. Integrating knowledge graphs, unified multimodal frameworks, and reinforcement learning for task optimization will further enhance the efficiency and robustness of transcriptomic analysis.

Proteomic analysis

Proteomics is also being revolutionized by LLM agents, which provide advanced tools for analyzing complex biological systems.

For example, DrBioRight2.0 [135] integrates large-scale cancer proteome data and natural language interfaces, enabling users to perform sophisticated analyses—such as heatmap generation, survival analysis, and cross-omics association—via simple instructions. AutoBA [29] demonstrates LLM agents’ capacity for multi-omics data integration and automated analytic workflows. The BioInformatics Agent (BIA) [136] streamlines complex proteomic analyses for non-experts through strong reasoning capabilities.

Future research should focus on knowledge graph integration, multimodal frameworks, and reinforcement learning to enhance reasoning, robustness, and efficiency. These improvements will help accelerate insight generation, supporting disease mechanism research, and therapeutic discovery.

Impact of omic data characteristics on agents collaboration

Omic layers differ fundamentally across several critical dimensions, resulting in systemic operational conflicts when utilized in multi-agent workflows. Key sources of heterogeneity include: [141–145]:

Time-scale heterogeneity. The vast discrepancy in data-collection granularity, such as static genomic data versus highly dynamic metabolomic data, causes gradient mis-synchrony in sequential decision processes.

Measurement error heterogeneity. Differences in data quality across platforms, where certain modalities exhibit higher noise levels, lead to voting bias when agents aggregate decisions.

Privacy class heterogeneity. Differences in data sensitivity and regulatory classifications introduce compliance conflicts due to non-uniform consent and patient protection requirements.

Data sparsity heterogeneity. Uneven data completeness arising from detection limits across omic layers results in imbalanced feature contributions and misaligned feature spaces.

Knowledge representation heterogeneity. Inconsistent nomenclature and ontology standards hinder knowledge sharing and cause semantic alignment conflicts among agents.

The methodological foundations for addressing these multi-faceted challenges in omics data processing are rooted in traditional bioinformatics techniques. For instance, the jMF2D algorithm [146] employs a dual-network model to integrate diverse omics data through cell-type similarity networks, while the DrNMF framework [147] applies regularized nonnegative matrix factorization and temporal smoothness contraints to mitigate data heterogeneity.

Agentomics-ML [137] further adopts per-layer uncertainty weighting to prevent noisy proteomic signals from dominating genomic decisions, and PromptBio[138] and DrBioRight2.0[135] demonstrate that modality-specialized agents connected through a unified prompt interface can assemble reproducible multi-omics pipelines. A recent review [148] concludes that embedding temporal resolution, confidence and GDPR tags as first-class metadata could transform current modality gaps into complementary strengths, paving the way for scalable, interpretable, and legally compliant agent ecosystems.

Challenges and future directions

The integration of LLM agents into bioinformatics and biomedicine, while transformative, brings significant technical, practical, and ethical challenges. Addressing these is essential for safe, effective, and equitable deployment.

Challenges

Data privacy and security

The widespread use of LLM agents in biomedicine raises critical concerns about data privacy and security. Biomedical data often contain sensitive information, such as gene sequences and disease histories, where leaks can seriously infringe on patient privacy. Traditional encryption and access control are insufficient against complex threats like model inversion attacks, where attackers deduce training data from outputs [149]. LLM agents can also be vulnerable to adversarial manipulation, potentially leading to erroneous results that impact clinical decisions.

Regulations such as HIPAA require strict data protection across collection, storage, and use, but many current models struggle to fully comply [150]. New threats like prompt injection attacks exploit LLM-specific vulnerabilities, further increasing risks [151].

Addressing these issues requires a multi-pronged approach: adopting technical safeguards like differential privacy and federated learning, strengthening regulatory compliance, and implementing comprehensive security monitoring and assessment in industry practices.

Model hallucination and interpretability

Model hallucination is a phenomenon where LLM agents generate plausible but incorrect or fabricated content, which can be especially harmful in biomedicine. For example, hallucinated diagnostic suggestions can mislead clinical decisions, and errors in drug development may waste resources and misguide research.

Fine-tuning LLMs can sometimes increase hallucinations [152], particularly when the training data are biased or incomplete [153]. This is problematic in rare diseases or controversial research areas. Solutions include self-alignment [154], real-space editing [155], black-box detection [156], contrastive decoding [157], retrieval-enhanced revision [158], and verification chains [159].

Interpretability is another major challenge. The complex, opaque architecture of LLM agents makes their decision-making process difficult to explain [160]. This undermines trust, especially in drug development or clinical applications where transparency is critical. Visualizing neural activity, developing hybrid rule-based and neural models, and improving interpretability tools can help, but further advances are needed for safe and effective application in biomedicine.

Updating and timeliness of knowledge

Biomedical knowledge evolves rapidly, requiring LLM agents to keep pace with new discoveries, treatments, and guidelines. However, updating and maintaining timely knowledge is challenging due to the complexity and diversity of multimodal biomedical data.

Integrating new information, especially across text, images, and genetic data, is difficult given varying data characteristics and update frequencies. For example, rapid advances in genetic and clinical data often require cross-modal fusion that current LLM agents struggle to achieve [48]. Timeliness is critical, as outdated recommendations can lead to misdiagnosis or mistreatment. Many current LLM agents lack feedback mechanisms and rely on resource-intensive retraining, which is slow and may cause knowledge forgetting [161].

Recent solutions include dynamic agent networks for rapid knowledge sharing [162], policy-level reflection for adaptive learning [163], and automatic learning mechanisms [164]. Future directions involve lifelong learning systems [165], enhanced multimodal processing, and improved agent collaboration. Strengthening evaluation and supervision of knowledge updates is also essential for ensuring accuracy and reliability in biomedical applications.

Limitations of agent architecture

LLM agents generally use a modular architecture, including memory, planning, and action modules. The interactions among these modules are complex [166]. For example, the memory module must share data with the planning module for reasonable action steps, and the action module must accurately execute plans. Poor interface design or weak communication can cause information flow problems, data inconsistency, and reduced system performance and stability. This complexity raises the risk of system errors and complicates maintenance and upgrades [167–169]. To address this, agents should employ clear module interfaces and communication protocols, and adopt loosely coupled designs to boost flexibility and scalability.

In multi-agent systems, coordinating multiple agents for complex tasks is another challenge [90, 170]. For instance, in a diagnostic system, distinct agents may analyze symptoms, query medical records, or interpret images. Without effective coordination, conflicts, redundant work, or poor information sharing can arise, reducing system performance [24, 171]. Recent solutions, such as IoA [172], introduce flexible integration protocols and dynamic mechanisms to improve collaboration among heterogeneous agents. The Advantage Alignment algorithm [173] simplifies interest alignment in multi-agent reinforcement learning, improving outcomes in cooperative-competitive scenarios.

Memory mechanisms, typically split into short-term and long-term memory, also have limitations. Short-term memory can be overloaded, while long-term memory may be slow to retrieve [65]. Inspired by operating system memory hierarchies, techniques like virtual context management and retrieval-based long-term memory (e.g. MemoryBank [66]) help expand usable memory and improve retrieval. The Reflexion framework [119] offers episodic memory, using natural language summaries for feedback and future decisions, improving adaptability and memory efficiency.

Traditionally, reasoning and action are separated: the reasoning module devises strategies, while the action module executes them. This separation can slow response and reduce adaptability in dynamic environments [174]. AutoCoA integrates action generation into reasoning, allowing seamless switching between reasoning and action and reducing interaction costs. AFlow [175] uses Monte Carlo Tree Search for dynamic workflow optimization, especially in complex multi-step tasks.

Ethical and legal risks

While LLM agents offer opportunities in biomedicine, they also introduce serious ethical and legal risks. Privacy protection is paramount [176–178], as patient records and genetic data are highly sensitive, and these data not only involve personal health status but also carry genetic information related to family members, meaning a single privacy breach may affect multiple individuals. LLM agents rely on vast data, making every stage—collection, storage, transmission—potentially risky [179]. Inadequate consent or weak security may result in privacy breaches and harm patient rights [180, 181]. For instance, if patients are not clearly informed about how their data will be used by LLM agents, it violates the ethical principle of autonomous decision-making in medicine; technically, such leaks may further lead to secondary harms like identity theft or discrimination in insurance coverage.

Algorithmic bias is another concern [182, 183], and its ethical and legal implications extend beyond unfair treatment. Bias in training data can cause unfair diagnoses and treatment recommendations, disadvantaging underrepresented groups [184–186]. A comprehensive vignette study [187] showed that simply altering the race or gender prompt of an otherwise identical case significantly changed analgesic prescribing and imaging-referral probabilities generated by GPT-4 and other clinically tuned models. Similarly, a study [188] found that GPT-4 consistently ranked stigmatizing diagnoses higher for minority men and psychiatric diagnoses higher for women when only demographic identifiers were modified. If a hospital implements an LLM agent whose biased recommendations delay or withhold appropriate care for under-represented groups, the institution may face civil litigation and regulatory penalties for providing unequal services.

Legal responsibility is also unclear, and this ambiguity is further complicated by the “black box” nature of LLM agents. When LLM agent advice leads to harm, it is difficult to assign blame among technology developers, medical institutions, and the agents themselves [20, 189]. Developers may argue outputs are for reference only, while institutions may claim compliance with adoption procedures, and LLM agents lack legal status to bear responsibility independently. This ambiguity complicates compensation and may hinder the adoption of AI in healthcare [190].

Addressing these risks requires improved legal frameworks, stronger regulations, and better technical and ethical standards: legally, define responsibility boundaries of stakeholders; technically, promote privacy-enhancing technologies and algorithm auditing; ethically, establish multi-stakeholder consultation mechanisms. Only through such a comprehensive approach can patient rights be protected and the safe, compliant, and ethical development of LLM agents in biomedicine be ensured [191].

Future directions

Human-AI collaboration paradigms

Several human-AI collaboration models are emerging in biomedicine. In the complementary model, humans and AI each contribute distinct strengths: AI algorithms powered by LLM agents can rapidly analyze data and detect patterns, while clinicians add contextual expertise and holistic judgment [192, 193]. For example, AI may handle initial medical image screening, allowing experts to focus on final assessments, reducing workload and improving efficiency [192].

The iterative co-creation model is also gaining traction. In drug discovery, LLM agents can generate hypotheses, which researchers iteratively refine, resulting in the co-creation of new knowledge [33]. This back-and-forth enhances both creativity and accuracy.

However, challenges persist. Effective collaboration requires careful task allocation, robust communication, and trust in AI outputs [194]. User proficiency with AI tools and the capability of AI agents also significantly affect outcomes [195]. Future directions include developing more natural language interfaces and improving explainability to support trust and usability, as well as creating evaluation frameworks that consider both human and AI factors [196].

Embodied AI

The concept of embodied AI extends the capabilities of LLM agents from the purely digital or virtual realm to real-time interaction with physical or simulated environments. An embodied AI system is not only capable of abstract reasoning but can also perceive its surroundings and take action based on feedback [197, 198]. In clinical interventions, surgical robots have enabled autonomous completion of complex procedures, boosting surgical precision and safety [199, 200]. Embodied AI also supports disease diagnosis, rehabilitation, and daily care, such as medical image analysis, personalized training, and patient companionship [201–206]. In research, robotic chemists accelerate drug discovery [207, 208].

Looking ahead, embodied AI is expected to gain stronger cognitive and decision abilities and more precise operations with advances in algorithms and hardware [197, 209]. However, ethical and data security issues must be addressed [210–214]. Overall, embodied AI is poised to transform both biomedical research and clinical practice, despite ongoing challenges.

Open-source ecosystems and standardization

Open-source initiatives have accelerated the development of LLM agents in biomedicine by enabling sharing of code, models, and data. Tools like BioDiscoveryAgent [89] and BioInformatics Agent (BIA) [136] exemplify this collaborative spirit, allowing researchers to collectively improve and adapt agents for genetic experiment design and bioinformatics workflows. Open-source ecosystems lower barriers for resource-limited researchers, promote transparency via peer review, and foster rapid innovation through community contributions.

As adoption grows, standardization becomes increasingly important for quality, safety, and interoperability [215]. This includes standardizing interfaces for agent communication, developing common data formats for easier integration, and creating robust platforms for deployment and evaluation [216]. Standardized evaluation metrics are also needed for fair performance comparison [217].

In summary, open-source collaboration and standardization are key to the reliable, scalable, and impactful integration of LLM agents into biomedical research and clinical care, ensuring that these technologies are transparent, interoperable, and widely accessible.

Conclusion

The integration of LLM agents into bioinformatics and biomedicine represents a transformative milestone, driving interdisciplinary innovation. This review has systematically explored their technical foundations, core technologies, collaboration models, and varied applications spanning drug discovery, protein engineering, clinical decision support, medical education, and multi-omics analysis. LLM agents have evolved from simple tools into powerful enablers, reshaping research, accelerating discovery, and optimizing clinical workflows.

However, significant challenges persist. Issues of data privacy and security, model hallucination, interpretability, knowledge update timeliness, architectural limitations, and ethical and legal risks require ongoing attention and innovation. Addressing these challenges will depend on coordinated efforts from researchers, clinicians, policymakers, and industry, along with progress in privacy-preserving techniques, explainable AI, lifelong learning, and regulatory frameworks.

Looking ahead, LLM agents hold great promise for bioinformatics and biomedicine. Advancing human-AI collabora-tion, developing embodied AI systems, and fostering open-source, standardized ecosystems will further unlock their potential. These efforts can accelerate precision medicine, broaden access to advanced technologies, and catalyze new scientific breakthroughs. Ultimately, LLM agents are set to become essential partners in research and healthcare, ushering in a new era of innovation and improved health outcomes.

Key Points

  • This review overviews the remarkable potential of large language models (LLM) agents in bioinformatics and biomedicine, including their core architecture, key technologies, and collaborative modes.

  • This review explores LLM agents’ applications in multi-omics, drug development, clinical diagnosis, mapping their utility in biomedical-related fields.

  • This review analyzes major challenges for LLM agents: framework limitations, data privacy and security, model hallucinations and interpretability, knowledge update timeliness, and ethical–legal risks.

  • This review discusses future directions to advance LLM agents in the domain, such as human-artificial intelligence collaboration paradigms, open-source ecosystem development.

Contributor Information

Tiantian Yang, AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University, No. 1 Dongxiang Road, Xi’an, 710129, China.

Yihang Xiao, AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University, No. 1 Dongxiang Road, Xi’an, 710129, China.

Zhijie Bao, AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University, No. 1 Dongxiang Road, Xi’an, 710129, China; School of Data Science, Fudan University, No. 220 Handan Road, Shanghai, 200433, China.

Jianye Hao, School of Intelligence and Computing, Tianjin University, No. 92 Weijin Road, Tianjin, 300072, China.

Jiajie Peng, AI for Science Interdisciplinary Research Center, School of Computer Science, Northwestern Polytechnical University, No. 1 Dongxiang Road, Xi’an, 710129, China; Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, No. 1 Dongxiang Road, Xi’an, 710129, China.

Conflict of interest: None declared.

Funding

This paper was supported by the National Natural Science Foundation of China (grant nos. 92370106 and 62072376).

Data availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study. All data and information sources reviewed are fully referenced in the corresponding sections of the manuscript.

References

  • 1. Zhao  WX, Zhou  K, Li  J. et al.  A survey of large language models. arXiv preprint. 2023. arXiv:2303.18223. 10.48550/arXiv.2303.18223 [DOI]
  • 2. Yang  J, Jin  H, Tang  R. et al.  Harnessing the power of LLMs in practice: a survey on chatgpt and beyond. ACM Trans Knowl Discov Data  2024;18:1–32. 10.1145/3649506 [DOI] [Google Scholar]
  • 3. Baktash  JA, Dawodi  M. Gpt-4: a review on advancements and opportunities in natural language processing. arXiv preprint. 2023. arXiv:2305.03195.
  • 4. Brown  TB, Mann  B, Ryder  N. et al.  Language models are few-shot learners. Adv Neural Inform Process Syst  2020;33:1877–901. [Google Scholar]
  • 5. OpenAI  JA, Adler  S. et al.  GPT-4 technical report. arXiv preprint. 2023. arXiv:2303.08774. 10.48550/arXiv.2303.08774 [DOI]
  • 6. Touvron  H, Lavril  T, Izacard  G. et al.  LLaMA: open and efficient foundation language models. arXiv preprint. 2023. arXiv:2302.13971. 10.48550/arXiv.2302.13971 [DOI]
  • 7. Touvron  H, Martin  L, Stone  K. et al.  Llama 2: open foundation and fine-tuned chat models. arXiv preprint. 2023. arXiv:2307.09288. 10.48550/arXiv.2307.09288 [DOI]
  • 8. Akter  SN, Yu  Z, Muhamed  A. et al.  An in-depth look at gemini’s language abilities. arXiv preprint. 2023. arXiv:2312.11444.
  • 9. Chowdhery  A, Narang  S, Devlin  J. et al.  PaLM: Scaling language modeling with pathways. Journal of Machine Learning Research 2023;24:1–113. 10.48550/arXiv.2204.02311 [DOI] [Google Scholar]
  • 10. Zhang  D, Yahan  Y, Dong  J. et al.  MM-LLMs: recent advances in MultiModal large language models. Findings of the Association for Computational Linguistics: ACL 2024 2024, 12401–30. 10.48550/arXiv.2401.13601 [DOI]
  • 11. Zhou  H. A survey of large language models in medicine: progress, application, and challenge. arXiv preprint. 2023. arXiv:2311.05112. 10.48550/arXiv.2311.05112 [DOI]
  • 12. Zhang  K, Yang  X, Wang  Y. et al.  Artificial intelligence in drug development. Nat Med  2025;31:45–59. 10.1038/s41591-024-03434-4 [DOI] [PubMed] [Google Scholar]
  • 13. Zhang  J, Yin  F, Zhang  N. et al.  Exploring the potential of large language models in molecular tasks: an insightful evaluation with GPT-4 bioRxiv, 2023: 2023.11. 28.568966. 10.1101/2023.11.28.568966 [DOI]
  • 14. Thirunavukarasu  AJ, Ting  DSJ, Elangovan  K. et al.  Large language models in medicine. Nat Med  2023;29:1930–40. 10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
  • 15. Luo  R, Sun  L, Xia  Y. et al.  BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform  2022;23:bbac409. 10.1093/bib/bbac409 [DOI] [PubMed] [Google Scholar]
  • 16. Kraljevic  Z, Shek  A, Bean  D. et al.  MedGPT: medical concept prediction from clinical narratives. arXiv preprint. 2021. arXiv:2107.03134. 10.48550/arXiv.2107.03134 [DOI]
  • 17. Hager  P, Jungmann  F, Holland  R. et al.  Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med  2024;30:2613–22. 10.1038/s41591-024-03097-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Tamkin  A, Brundage  M, Clark  J. et al.  Understanding the capabilities, limitations, and societal impact of large language models. arXiv preprint. 2021. arXiv:2102.02503. 10.48550/arXiv.2102.02503 [DOI]
  • 19. Minaee  S, Mikolov  T, Nikzad  N. et al.  Large language models: a survey. arXiv preprint. 2024. arXiv:2402.06196. 10.48550/arXiv.2402.06196 [DOI]
  • 20. Ullah  E, Parwani  A, Baig  MM. et al.  Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology – a recent scoping review. Diagn Pathol  2024;19:43. 10.1186/s13000-024-01464-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Mirzadeh  I, Alizadeh  K, Shahrokhi  H. et al.  GSM-symbolic: understanding the limitations of mathematical reasoning in large language models. arXiv preprint arXiv:2410.05229, 2024. 10.48550/arXiv.2410.05229 [DOI]
  • 22. Fan  W, Ding  Y, Ning  L. et al.  A survey on RAG meeting LLMs: towards retrieval-augmented large language models. Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024, 6491–501. 10.48550/arXiv.2405.06211 [DOI]
  • 23. Xi  Z, Chen  W, Guo  X. et al.  The rise and potential of large language model based agents: a survey. Science China Information Sciences 2025;68:121101. 10.48550/arXiv.2309.07864 [DOI] [Google Scholar]
  • 24. Guo  T, Chen  X, Wang  Y. et al.  Large language model based multi-agents: a survey of progress and challenges. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI). 2024, 8048–57. 10.48550/arXiv.2402.01680 [DOI]
  • 25. Sumers  TR, Yao  S, Narasimhan  K. et al.  Cognitive architectures for language agents. Transactions on Machine Learning Research, 2023.
  • 26. Wang  L, Ma  C, Feng  X. et al.  A survey on large language model based autonomous agents. Front Comp Sci  2024;18:186345. 10.1007/s11704-024-40231-1 [DOI] [Google Scholar]
  • 27. Dong  Z, Zhong  V, Yang Young  L. Biomania: simplifying bioinformatics data analysis through conversation. bioRxiv  2023. 10.1101/2023.10.29.564479 [DOI] [Google Scholar]
  • 28. Tang  X, Zou  A, Zhang  Z. et al.  MedAgents: large language models as collaborators for zero-shot medical reasoning. Findings of the Association for Computational Linguistics: ACL 2024, vol. 1, pp. 599–621, 2024. 10.48550/arXiv.2311.10537 [DOI]
  • 29. Zhou  J, Zhang  B, Li  G. et al.  An AI agent for fully automated multi-omic analyses. Adv Sci  2024;11:e2407094. 10.1002/advs.202407094 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Luscombe  NM, Greenbaum  D, Gerstein  M. What is bioinformatics? An introduction and overview. Yearb Med Inform  2001;10:83–100. 10.1055/s-0038-1638103 [DOI] [PubMed] [Google Scholar]
  • 31. Gauthier  J, Vincent  AT, Charette  SJ. et al.  A brief history of bioinformatics. Brief Bioinform  2019;20:1981–96. 10.1093/bib/bby063 [DOI] [PubMed] [Google Scholar]
  • 32. Baxevanis  AD, Bader  GD, Wishart  DS. et al.  Bioinformatics, 4th ed. John Wiley & Sons, 2020.
  • 33. Gao  S, Fang  A, Huang  Y. et al.  Empowering biomedical discovery with AI agents. Cell  2024;187:6125–51. 10.1016/j.cell.2024.09.022 [DOI] [PubMed] [Google Scholar]
  • 34. Vaswani  A, Shazeer  N, Parmar  N. et al.  Attention is all you need. Advances in Neural Information Processing Systems 30, pp. 5998–6008, 2017. [Google Scholar]
  • 35. Naveed  H, Khan  AU, Qiu  S. et al.  A comprehensive overview of large language models. ACM Trans Intell Syst Technol  2025;:1–72. 10.1145/3744746 [DOI] [Google Scholar]
  • 36. Kaplan J, McCandlish S, Henighan T. et al. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • 37. Wei  J, Wang  X, Schuurmans  D. et al.  Chain-of-thought prompting elicits reasoning in large language models. Adv Neural Inform Process Syst  2022;35:24824–37. [Google Scholar]
  • 38. Devlin  J, Chang  M-W, Lee  K. et al.  BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1 (Long and Short Papers), pp. 4171–86. Stroudsburg, PA, USA: Association for Computational Linguistics, 2019. 10.18653/v1/N19-1423 [DOI]
  • 39. Lee  J, Yoon  W, Kim  S. et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics  2020;36:1234–40. 10.1093/bioinformatics/btz682 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Raffel  C, Shazeer  N, Roberts  A. et al.  Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res  2020;21:1–67. 10.5555/3455716.345571734305477 [DOI] [Google Scholar]
  • 41. Lewis  M, Liu  Y, Goyal  N. et al.  BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 703–14. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020. 10.18653/v1/2020.acl-main.703 [DOI]
  • 42. Xue  L, Constant  N, Roberts  A. et al.  mT5: a massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 483–98, 2021.
  • 43. Anil  R, Dai  AM, Firat  O. et al.  PaLM 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  • 44. Nan  D, Huang  Y, Dai  AM. et al.  GLaM: efficient scaling of language models with mixture-of-experts. In: Proceedings of the International Conference on Machine Learning (ICML), Vol. 162 of Proceedings of Machine Learning Research (PMLR), pp. 5547–69. PMLR, 2022.
  • 45. Fedus  W, Zoph  B, Shazeer  N. et al.  Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J Mach Learn Res  2022;23:1–39. [Google Scholar]
  • 46. Park  JS, O’Brien  JC, Cai  CJ. et al.  Generative agents: interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST '23), Article 44, pp. 1–22, 2023. 10.1145/3586183.3606763 [DOI]
  • 47. Xie  J, Chen  Z, Zhang  R. et al.  Large multimodal agents: a survey. arXiv preprint arXiv:2402.15116, 2024.
  • 48. Durante  Z, Huang  Q, Wake  N. et al.  Agent AI: surveying the horizons of multimodal interaction. arXiv preprint arXiv:2401.03568, 2024.
  • 49. Cheng  Y, Zhang  C, Zhang  Z. et al.  Exploring large language model based intelligent agents: definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024.
  • 50. Xiao  Y, Liu  J, Zheng  Y. et al.  Cellagent: an LLM-driven multi-agent framework for automated single-cell data analysis. arXiv preprint arXiv:2407.09811, 2024.
  • 51. Huang  H, Shi  X, Lei  H. et al.  ProtChat: an AI multi-agent for automated protein analysis leveraging GPT-4 and protein language model. J Chem Inf Model  2025;65:62–70. 10.1021/acs.jcim.4c01345 [DOI] [PubMed] [Google Scholar]
  • 52. Qin  Y, Hu  S, Lin  Y. et al.  Tool learning with foundation models. ACM Comput Surv  2024;57:101:1–101:40. 10.1145/3704435 [DOI] [Google Scholar]
  • 53. Kojima  T. Shixiang Shane Gu, Machel Reid et al. Large language models are zero-shot reasoners. Adv Neural Inform Process Syst  2022;:22199–213. [Google Scholar]
  • 54. Savelka  J, Ashley  KD, Gray  MA. et al.  Can GPT-4 support analysis of textual data in tasks requiring highly specialized domain expertise? In: Laakso M-J, Monga M, Sheard J (eds.), Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1, pp. 117–23. New York, NY, USA: ACM, 2023. 10.1145/3587102.3588792 [DOI]
  • 55. Inoue  Y, Song  T, Fu  T. DrugAgent: explainable drug repurposing agent with large language model-based reasoning. arXiv preprint arXiv:2408.13378, 2024.
  • 56. Dai  W, Li  J, Li  D. et al.  InstructBLIP: towards general-purpose vision-language models with instruction tuning. In: Advances in Neural Information Processing Systems 36 (NeurIPS 2023), pp. 49250–67, 2023.
  • 57. Peng  Z, Wang  W, Dong  L. et al.  Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  • 58. Gong  Y, Chung  Y-a, Glass  J. AST: audio spectrogram transformer. In: Proc. Interspeech 2021, pp. 571–575, 2021. 10.21437/Interspeech.2021-698 [DOI]
  • 59. Hsu  W-N, Bolte  B, Tsai  Y-HH. et al.  HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans Audio, Speech and Lang Proc  2021;29:3451–60. 10.1109/TASLP.2021.3122291 [DOI] [Google Scholar]
  • 60. Chen  K, Li  J, Wang  K. et al.  Chemist-X: large language model-empowered agent for reaction condition recommendation in chemical synthesis. arXiv preprint arXiv:2311.10776, 2023. 10.48550/arXiv.2311.10776 [DOI]
  • 61. Hu  X, Liu  G, Zhao  Y. et al.  De novo drug design using reinforcement learning with multiple GPT agents. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S (eds.), Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS '23, 7405–18. Red Hook, NY, USA: Curran Associates Inc, 2023. [Google Scholar]
  • 62. Galkin  F, Naumov  V, Pushkov  S. et al.  Precious3GPT: multimodal multi-species multi-omics multi-tissue transformer for aging research and drug discovery. bioRxiv, 2024.07.25.605062, 2024. 10.1101/2024.07.25.605062 [DOI]
  • 63. Wujiang  X, Mei  K, Gao  H. et al.  A-MEM: agentic memory for LLM agents. arXiv preprint arXiv:2502.12110, 2025. 10.48550/arXiv.2502.12110 [DOI]
  • 64. Tao  Z, Lin  T-E, Chen  X. et al.  A survey on self-evolution of large language models. arXiv preprint arXiv:2404.14387, 2024. 10.48550/arXiv.2404.14387 [DOI]
  • 65. Zhang  Z, Dai  Q, Bo  X. et al.  A survey on the memory mechanism of large language model-based agents. ACM Trans Inform Syst  2025;43:1–47. 10.1145/3748302  Published: 10 September 2025 [DOI] [Google Scholar]
  • 66. Zhong  W, Guo  L, Gao  Q. et al.  MemoryBank: enhancing large language models with long-term memory. Proc AAAI Conf Artif Intell  2024;38:19724–31. 10.1609/aaai.v38i17.29946 [DOI] [Google Scholar]
  • 67. Shi  W, Xu  R, Zhuang  Y. et al.  EHRAgent: code empowers large language models for few-shot complex tabular reasoning on electronic health records. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Vol. 2024, pp. 22315. Miami, Florida, USA: Association for Computational Linguistics, 2024. [DOI] [PMC free article] [PubMed]
  • 68. Chen  X, Jin  Y, Mao  X. et al.  RareAgents: advancing rare disease care through LLM-empowered multi-disciplinary team. arXiv preprint arXiv:2412.12475, 2024. 10.48550/arXiv.2412.12475 [DOI]
  • 69. Wishart  DS, Knox  C, Guo  AC. et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res  2008;36:D901–6. 10.1093/nar/gkm958 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Protein Data Bank . Protein data Bank nature new biology. Crystallography: Protein Data Bank  1971;233:223–1038. 10.1038/newbio233223b0 [DOI] [Google Scholar]
  • 71. Siva  N. 1000 genomes project. (Meeting Abstract) 2008;256. [DOI] [PubMed]
  • 72. Jumper  J, Evans  R, Pritzel  A. et al.  Highly accurate protein structure prediction with AlphaFold. Nature  2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73. van Kempen  M, Kim  SS, Tumescheit  C. et al.  Fast and accurate protein structure search with foldseek. Nat Biotechnol  2024;42:243–6. 10.1038/s41587-023-01773-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74. Lewis  P, Perez  E, Piktus  A. et al.  Retrieval-augmented generation for knowledge-intensive NLP tasks. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds.), Advances in Neural Information Processing Systems 33, 9459–74. Red Hook, NY, USA: Curran Associates Inc., 2020. [Google Scholar]
  • 75. Houcheng  S, Long  W, Zhang  Y. BioMaster: Multi-agent system for automated bioinformatics analysis workflow. bioRxiv, 2025. 10.1101/2025.01.23.634608 [DOI]
  • 76. Schick  T, Dwivedi-Yu  J, Dessì  R. et al.  Toolformer: language models can teach themselves to use tools. Advances in Neural Information Processing Systems 36, pp. 68539–51, 2023.
  • 77. Nakano  R, Hilton  J, Balaji  S. et al.  WebGPT: browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2022. 10.48550/arXiv.2112.09332 [DOI]
  • 78. Shen  Y, Kaitao Song  X  et al. HuggingGPT: solving AI tasks with ChatGPT and its friends in hugging face. Adv Neural Inform Process Syst  2023;36:38154–80. [Google Scholar]
  • 79. Yip  M, Salcudean  S, Goldberg  K. et al.  Artificial intelligence meets medical robotics. Science  2023;381  (Jul.14 TN.6654):141–6. [DOI] [PubMed] [Google Scholar]
  • 80. Yue  L, Xing  S, Chen  J. et al.  Clinicalagent: clinical trial multi-agent system with large language model-based reasoning. In: Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, pp. 1–10, 2024.
  • 81. Xiaorui  S, Wang  Y, Gao  S. et al.  KGARevion: an AI agent for knowledge-intensive biomedical QA. arXiv preprint arXiv:2410.04660, 2024. 10.48550/arXiv.2410.04660 [DOI]
  • 82. Hou  X, Zhao  Y, Wang  S. et al.  Model context protocol (MCP): landscape, security threats, and future research directions. arXiv preprint arXiv:2503.23278, 2025. 10.48550/arXiv.2503.23278 [DOI]
  • 83. Li  B, Yan  T, Pan  Y. et al.  Mmedagent: learning to use medical tools with multi-modal agent. In: Findings of the Association for Computational Linguistics: EMNLP 2024, pp. 8745–60, 2024.
  • 84. Jin  J, Jeong  CW. Surgical-LLaVA: toward surgical scenario understanding via large language and vision models. arXiv preprint arXiv:2410.09750, 2024. 10.48550/arXiv.2410.09750 [DOI]
  • 85. Shi  H, Xu  Z, Wang  H. et al.  Continual learning of large language models: a comprehensive survey. ACM Computing Surveys, vol. 57, no. 1, pp. 1–39, 2024.
  • 86. Wei  H, Qiu  J, Yu  H, Yuan  W. MEDCO: medical education copilots based on a multi-agent framework. In: Leonardis A, Ricci E (eds.), Computer Vision – ECCV 2024 Workshops, Vol. 15630 of Lecture Notes in Computer Science, pp. 129–44. Cham: Springer, 2025. 10.1007/978-3-031-91813-1_8 [DOI] [Google Scholar]
  • 87. Lu  D, Zheng  Y, Yi  X. et al.  Identifying potential risk genes for clear cell renal cell carcinoma with deep reinforcement learning. Nat  Commun  2025;16:3591. 10.1038/s41467-025-58439-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88. Yang  J, Shu  L, Duan  H. et al.  RDguru: a conversational intelligent agent for rare diseases. IEEE J Biomed Health Inform  2024;29:6366–78. 10.1109/JBHI.2024.3464555 [DOI] [PubMed] [Google Scholar]
  • 89. Yusuf  Roohani, A.  Lee, Q.  Huang et al. Biodiscoveryagent: an AI agent for designing genetic perturbation experiments. arXiv preprint arXiv:2405.17631, 2024.
  • 90. Tran  K-T, Dao  D, Nguyen  M-D. et al.  Multi-agent collaboration mechanisms: a survey of LLMs. arXiv preprint arXiv:2501.06322, 2025. 10.48550/arXiv.2501.06322 [DOI]
  • 91. Swanson  K, Wu  W, Bulaong  NL. et al.  The virtual lab of AI agents designs new SARS-CoV-2 nanobodies. Nature  2025;646:716–23. 10.1038/s41586-025-09442-9 [DOI] [PubMed] [Google Scholar]
  • 92. Ghafarollahi  A, Buehler  MJ. Protagents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. Digital Discov  2024;3:1389–409. 10.1039/D4DD00013G [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93. Li  Y, Zhang  R, Singh  P. et al.  Advancing multi-agent systems through model context protocol: architecture, implementation, and applications. arXiv preprint arXiv:2504.21030, 2025.
  • 94. Moore, David J. A Taxonomy of Hierarchical Multi-Agent Systems: Design Patterns, Coordination Mechanisms, and Industrial Applications. arXiv preprint arXiv:2508.12683, 2025.
  • 95. Ma  X, Gao  L, Yong  X. Eigenspaces of networks reveal the overlapping and hierarchical community structure more precisely. J Stat Mech Theory Exp  2010;2010:P08012. 10.1088/1742-5468/2010/08/P08012 [DOI] [Google Scholar]
  • 96. Ma  H, Wang  B, Liang  Y. Semi-supervised spectral algorithms for community detection in complex networks based on equivalence of clustering methods. Phys A: Stat Mech Appl  2018;490:786–802. 10.1016/j.physa.2017.08.116 [DOI] [Google Scholar]
  • 97. Singh  A, Ehtesham  A, Kumar  S. et al.  Agentic retrieval-augmented generation: a survey on agentic RAG. arXiv preprint arXiv:2501.09136, 2025. 10.48550/arXiv.2501.09136 [DOI]
  • 98. Sahoo  P, Tripathi  A, Saha  S, Mondal  S. FedMRL: data heterogeneity aware federated multi-agent deep reinforcement learning for medical imaging. In: Linguraru MG, Dou Q (eds.), Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, Vol. 15003 of Lecture Notes in Computer Science, pp. 649–59. Cham: Springer, 2024. 10.1007/978-3-031-72384-1_60 [DOI] [Google Scholar]
  • 99. Fossi  G, Boulaimen  Y, Outemzabet  L. et al.  SwiftDossier: tailored automatic dossier for drug discovery with LLMs and agents. arXiv preprint arXiv:2409.15817, 2024. 10.48550/arXiv.2409.15817 [DOI]
  • 100. Rapp  JT, Bremer  BJ, Romero  PA. Self-driving laboratories to autonomously navigate the protein fitness landscape. Nat  Chem Eng  2024;1:97–107. 10.1038/s44286-023-00002-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101. Shen  Y, Chen  Z, Mamalakis  M. et al.  Toursynbio: a multi-modal large model and agent framework to bridge text and protein sequences for protein engineering. In: Cannataro M, Zheng HJ (eds.), 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2382–89. Lisbon, Portugal: IEEE, 2024.
  • 102. McNaughton  AD, Ramalaxmi  S, Krishna  G. et al.  CACTUS: chemistry agent connecting tool usage to science. ACS Omega  2024;9:46563–73. 10.1021/acsomega.4c08408 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103. Ruan  Y, Chenyin  L, Ning  X. et al.  An automatic end-to-end chemical synthesis development platform powered by large language models. Nat Commun  2024;15:10160. 10.1038/s41467-024-54457-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104. Bran  AM, Cox  S, Schilter  O. et al.  Augmenting large language models with chemistry tools. Nat Mach Intell  2024;6:525–35. 10.1038/s42256-024-00832-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105. Zhou  Y, Jiahui  S, Zhang  J. et al.  PRIME: a multi-agent environment for orchestrating dynamic computational workflows in protein engineering. bioRxiv preprint, 2025. 10.1101/2025.09.22.677756 [DOI]
  • 106. Li  K, Wu  Z, Wang  S. et al.  DrugPilot: LLM-based parameterized reasoning agent for drug discovery. arXiv preprint arXiv:2505.13940, 2025. 10.48550/arXiv.2505.13940 [DOI]
  • 107. Ma  R, Cheng  Q, Yao  J. et al.  Multimodal machine learning enables AI chatbot to diagnose ophthalmic diseases and provide high-quality medical responses. npj Digit Med  2025;8:64–18. 10.1038/s41746-025-01461-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108. Fard  AM. Cancer cell removing using a reinforcement learning agent. bioRxiv preprint, 2024. 10.1101/2024.09.01.610680 [DOI]
  • 109. Bani-Harouni  D, Navab  N, Keicher  M. MAGDA: multi-agent guideline-driven diagnostic assistance. In: Deng Z, Shen Y et al. (eds.), Foundation Models for General Medical AI, Vol. 15184 of Lecture Notes in Computer Science, 205–17. Cham: Springer, 2025. 10.1007/978-3-031-73471-7_17 [DOI] [Google Scholar]
  • 110. Yu  H, Zhou  J, Li  L. et al.  Simulated patient systems are intelligent when powered by large language model-based AI agents. arXiv preprint arXiv:2409.18924, 2024. 10.48550/arXiv.2409.18924 [DOI]
  • 111. Saxena  RR. Beyond flashcards: designing an intelligent assistant for USMLE mastery and virtual tutoring in medical education (A study on harnessing chatbot Technology for Personalized Step 1 prep). arXiv preprint arXiv:2409.10540, 2024. 10.48550/arXiv.2409.10540 [DOI]
  • 112. Singhal  K, Eric  T, Gottweis  J. et al.  Toward expert-level medical question answering with large language models. Nat Med  2025;31:943–50. 10.1038/s41591-024-03423-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113. Chowdhury  M, He  YV, Joselowitz  J. et al.  ASTRID – an automated and scalable TRIaD for the evaluation of RAG-based clinical question answering systems. arXiv preprint arXiv:2501.08208, 2025. 10.48550/arXiv.2501.08208 [DOI]
  • 114. Yang  H, Chen  H, Guo  H. et al.  LLM-MedQA: enhancing medical question answering through case studies in large language models. arXiv preprint arXiv:2501.05464, 2025. 10.48550/arXiv.2501.05464 [DOI]
  • 115. Shusterman  R, Waters  AC, O’Neill  S. et al.  An active inference strategy for prompting reliable responses from large language models in medical practice. npj Digit Med  2025;8:119–0. 10.1038/s41746-025-01516-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116. Lee  C, Kumar  S, Vogt  Kimon A.. et al. Improving clinical documentation with AI: a comparative study of sporo AI scribe and GPT-4o mini. arXiv preprint arXiv:2410.15528, 2024. 10.48550/arXiv.2410.15528 [DOI]
  • 117. Zakka  C, Cho  J, Fahed  G. et al.  Almanac copilot: towards autonomous electronic health record navigation. arXiv preprint arXiv:2405.07896, 2024. 10.48550/arXiv.2405.07896 [DOI]
  • 118. Wang  Z, Zhu  Y, Zhao  H. et al.  ColaCare: enhancing electronic health record modeling through large language model-driven multi-agent collaboration. In: Lewin-Eytan L, Huang H, Yom-Tov E (eds.), Proceedings of the ACM on Web Conference 2025 (WWW ‘25). New York: ACM, 2025, 2250–61. [Google Scholar]
  • 119. Shinn  N, Labash  B, Jami  A. Reflexion: language agents with verbal reinforcement learning. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S (eds.), Advances in Neural Information Processing Systems 36 (NeurIPS 2023). Red Hook, NY: Curran Associates, Inc., 2023, 8634–52.
  • 120. Kim  Y, Li  H, Wang  Y-A. et al.  MDAgents: an adaptive collaboration of LLMs for medical decision-making. In: Globerson A, Mackey L, Belgrave D, Fan A, Paquet U, Tomczak J, Zhang C (eds.), Advances in Neural Information Processing Systems 37 (NeurIPS 2024). Red Hook, NY: Curran Associates, Inc., 2024, 79410–52.
  • 121. Yuhe Ke  MS, Yang  R, Mm  SAL. et al.  Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study. J Med Internet Res  2024;26:e59439. 10.2196/59439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122. Zuo  K, Jiang  Y, Mo  F. et al.  KG4Diagnosis: a hierarchical multi-agent LLM framework with knowledge graph enhancement for medical diagnosis. In: Wu J, Zhu J, Xu M, Jin Y (eds.), Proceedings of the AAAI Bridge Program on AI for Medicine and Healthcare (PMLR Volume 281). Philadelphia, PA, USA: PMLR, 2025, 195–204.
  • 123. Jinlin  W, Liang  X, Bai  X. et al.  SurgBox: agent-driven operating room sandbox with surgery copilot. In: Ding W, Lu C-T, Wang F, Di L, Wu K, Huan J, Nambiar R, Li J, Ilievski F, Baeza-Yates R, Hu X (eds.), 2024 IEEE International Conference on Big Data (BigData). Washington, DC, USA: IEEE, 2024, 2041–48. [Google Scholar]
  • 124. Lan  K. Depression diagnosis dialogue simulation: self-improving psychiatrist with tertiary memory. arXiv preprint, arXiv:2409.15084, 2024. 10.48550/arXiv.2409.15084 [DOI]
  • 125. Wu  S. I like sunnie more than i expected! Exploring user expectation and perception of an anthropomorphic LLM-based conversational agent for well-being support. arXiv preprint, arXiv:2405.13803, 2024. 10.48550/arXiv.2405.13803 [DOI]
  • 126. Jinpeng  H, Dong  T, Luo  G. et al.  Psycollm: enhancing LLM for psychological understanding and evaluation. IEEE Trans Comput Soc Syst  2025;12:539–51. 10.1109/TCSS.2024.3497725 [DOI] [Google Scholar]
  • 127. Wang  Z, Jin  Q, Wei  C-H. et al.  GeneAgent: self-verification language agent for gene-set analysis using domain databases. Nat Methods  2025;22:1677–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128. Liu  H, Chen  S, Zhang  Y. et al.  GenoTEX: an LLM Agent benchmark for automated gene expression data analysis. arXiv preprint arXiv:2406.15341, 2024. 10.48550/arXiv.2406.15341 [DOI]
  • 129. Richard  G, de Almeida  BP, Dalla-Torre  H. et al.  ChatNT: a multimodal conversational agent for DNA, RNA and protein tasks. Nat Mach Intell 2025;7:928–41. [Google Scholar]
  • 130. Mehandru  N, Hall  AK, Melnichenko  O. et al.  BioAgents: democratizing bioinformatics analysis with multi-agent systems. arXiv preprint arXiv:2501.06314, 2025. 10.48550/arXiv.2501.06314 [DOI] [PMC free article] [PubMed]
  • 131. Tiukova  IA. Genesis: towards the automation of systems biology research. arXiv preprint arXiv:2408.10689, 2024. 10.48550/arXiv.2408.10689 [DOI]
  • 132. Yuanhao  Q, Huang K, Cousins  H. et al.  CRISPR-GPT for agentic automation of gene-editing experiments. Nature Biomedical Engineering 2025;9:981–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133. Raghavan V, Zheng Y, Li Y. et al.  Harnessing agent-based frameworks in CellAgentChat to unravel cell–cell interactions from single-cell and spatial transcriptomics. Genome Research 2025;35:gr.279771.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134. Huang  K, Zhang  S, Wang  H. et al.  Biomni: a general-purpose biomedical AI agent. bioRxiv, 2025. 10.1101/2025.05.30.656746 [DOI]
  • 135. Liu  W, Li  J, Tang  Y. et al.  Drbioright 2.0: an LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis. Nat Commun  2025;16:2256. 10.1038/s41467-025-57430-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136. Xin  Q, Kong  Q, Ji  H. et al.  BioInformatics agent (BIA): unleashing the power of large language models to reshape bioinformatics workflow. bioRxiv, 2024. 10.1101/2024.05.22.595240 [DOI]
  • 137. Martinek  V, Liu  W, Alexiou  P. et al.  Agentomics-ML: autonomous machine learning experimentation agent for genomic and transcriptomic data. arXiv preprint arXiv:2506.05542, 2025. 10.48550/arXiv.2506.05542 [DOI]
  • 138. Yang  X, Zhou  J, Liu  W. et al.  Promptbio: a multi-agent AI platform for bioinformatics data analysis. bioRxiv,  2025. 10.1101/2025.07.05.663295 [DOI] [Google Scholar]
  • 139. Tang  X, Yu  Z, Chen  J. et al.  Cellforge: agentic design of virtual cell models. arXiv, 2025. 10.48550/arXiv.2508.02276 [DOI]
  • 140. Huang  Y, Cohen  I, Truong  VQ-T. et al.  Reference-free cell-type annotation with LLM agents. In: Theis FJ, Regev A, Hasanzadeh A, et al. (eds.), ICLR 2025 Workshop on Machine Learning for Genomics Explorations (MLGenX). Singapore, 2025.
  • 141. Taheriyoun  AR, Ross  A, Safikhani  A. et al.  Longitudinal omics data analysis: a review on models, algorithms, and tools. arXiv preprint arXiv:2506.11161, 2025.
  • 142. Baião  AR, Cai  Z, Poulos  RC. et al.  A technical review of multi-omics data integration methods: from classical statistical to deep generative approaches. Brief Bioinform  2025;26:bbaf355. 10.1093/bib/bbaf355 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 143. Escriba-Montagut  X, Marcon  Y, Anguita-Ruiz  A. et al.  Federated privacy-protected meta-and mega-omics data analysis in multi-center studies with a fully open-source analytic platform. PLoS Comput Biol  2024;20:e1012626. 10.1371/journal.pcbi.1012626 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 144. Song  M, Greenbaum  J, Joseph Luttrell  IV. et al.  A review of integrative imputation for multi-omics datasets. Front Genet  2020;11:570255. 10.3389/fgene.2020.570255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145. Silva  MC, Faria  D, Pesquita  C. Matching multiple ontologies to build a knowledge graph for personalized medicine. In: Groth P, Vidal M-E, Suchanek F, et al. (eds.), The Semantic Web: 19th International Conference, ESWC 2022. Cham: Springer International Publishing, 2022, p. 461–77. 10.1007/978-3-031-06981-9_27 [DOI]
  • 146. Zha  Y, Feng  S, Gao  P. et al.  Enhancing and accelerating cell type deconvolution of large-scale spatial transcriptomics slices with dual network model. Bioinformatics  2025;41:btaf419. 10.1093/bioinformatics/btaf419 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147. Ma  X, Sun  P, Gong  M. An integrative framework of heterogeneous genomic data for cancer dynamic modules based on matrix decomposition. IEEE/ACM Trans Comput Biol Bioinform  2022;19:305–16. 10.1109/TCBB.2020.3004808 [DOI] [PubMed] [Google Scholar]
  • 148. Lobentanzer  S, Iperi  C, Yang  X. et al.  Prompt-based bioinformatics: a new interface for multi-omics analysis. Nat Rev Genet 2025;26:1–2. 10.1038/s41576-025-00889-0 [DOI] [PubMed] [Google Scholar]
  • 149. He  F, Zhu  T, Ye  D. et al.  The emerged security and privacy of LLM agent: a survey with case studies. ACM Comput Surv 2024;56:114. [Google Scholar]
  • 150. Lv  D, Zhu  S, Huazheng  X. et al. A Review of Big Data Security and Privacy Protection Technology. In: 2018 IEEE 18th International Conference on Communication Technology (ICCT). Chongqing, China: IEEE, 2018, p. 1082–91. 10.1109/ICCT.2018.8600051 [DOI] [Google Scholar]
  • 151. Hao  D, Liu  S, Zheng  L. et al.  Privacy in fine-tuning large language models: attacks, defenses, and future directions. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, vol. 15874. Singapore: Springer Nature Singapore, 2025, p. 326–44.
  • 152. Gekhman  Z, Yona  G, Aharoni  R. et al.  Does fine-tuning LLMs on new knowledge encourage hallucinations? In: EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, vol. EMNLP 2024. Stroudsburg, PA: Association for Computational Linguistics (ACL), 2024, p. 7765–84. 10.48550/arXiv.2405.05904 [DOI]
  • 153. Mahaut  M, Aina  L, Czarnowska  P. et al.  Factual confidence of LLMs: on reliability and robustness of current estimators. In: Ku L-W, Martins A, Srikumar V (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), pp. 4554–70. Bangkok, Thailand: Association for Computational Linguistics (ACL), 2024. 10.18653/v1/2024.acl-long.250 [DOI]
  • 154. Zhang  X, Peng  B, Tian  Y. et al.  Self-alignment for factuality: mitigating hallucinations in LLMs via self-evaluation. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1: Long Papers. Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 1946–65. 10.48550/arXiv.2402.09267 [DOI] [PMC free article] [PubMed]
  • 155. Zhang  S, Tian  Y, Feng  Y. TruthX: alleviating hallucinations by editing large language models in truthful space. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1: Long Papers. Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 8908–49. 10.48550/arXiv.2402.17811 [DOI]
  • 156. Manakul  P, Liusie  A, Gales  MJF. SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, 2023, p. 9004–17. 10.48550/arXiv.2303.08896 [DOI]
  • 157. Chuang  Y-S, Xie  Y, Luo  H. et al.  DoLa: decoding by contrasting layers improves factuality in large language models. In: Proceedings of the International Conference on Learning Representations (ICLR). 2024. 10.48550/arXiv.2309.03883 [DOI]
  • 158. Gao  L, Dai  Z, Pasupat  P. et al.  RARR: researching and revising what language models say, using language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). Toronto, Canada: Association for Computational Linguistics, 2023, p. 16477–508. 10.48550/arXiv.2210.08726 [DOI]
  • 159. Dhuliawala  S, Komeili  M, Jing  X. et al.  Chain-of-verification reduces hallucination in large language models. In: Findings of the Association for Computational Linguistics: ACL 2024. Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 3563–78. 10.48550/arXiv.2309.11495 [DOI]
  • 160. Yin  Z, Sun  Q, Guo  Q. et al.  Do large language models know what they don’t know? In: Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, 2023, p. 8235–45. 10.48550/arXiv.2305.18153 [DOI]
  • 161. Huang  X, Cheng  S, Huang  S. et al.  Queryagent: a reliable and efficient reasoning framework with environmental feedback-based self-correction. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 4882–903.
  • 162. Liu  Z, Zhang  Y, Li  P. et al.  A dynamic LLM-powered agent network for task-oriented agent collaboration. In: Proceedings of the First Conference on Language Modeling (CoLM). 2024. 10.48550/arXiv.2310.02170 [DOI]
  • 163. Zhang  W, Tang  K, Hai  W. et al.  Agent-pro: learning to evolve via policy-level reflection and optimization. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 5348–75. 10.18653/v1/2024.acl-long.292 [DOI] [PMC free article] [PubMed]
  • 164. Qiao  S, Zhang  N, Fang  R. et al.  Autoact: automatic agent learning from scratch for QA via self-planning. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 1827–43. 10.18653/v1/2024.acl-long.165 [DOI] [PMC free article] [PubMed]
  • 165. Zheng  J, Shi  C, Cai  X. et al.  Lifelong learning of large language model based agents: a roadmap. arXiv preprint arXiv:2501.07278. 2025. 10.48550/arXiv.2501.07278 [DOI]
  • 166. Zhang  H, Weihua  D, Shan  J. et al.  Building cooperative embodied agents modularly with large language models. In: International Conference on Learning Representations (ICLR). 2024.
  • 167. He  P, Lin  Y, Dong  S. et al.  Red-teaming LLM multi-agent systems via communication attacks. In: Findings of the Association for Computational Linguistics: ACL 2025, pp 6726–47. Association for Computational Linguistics, 2025.
  • 168. Wang  J, Li  Y, Hong  Y. et al.  Integrated adaptive communication in multi-agent systems: dynamic topology, frequency, and content optimization for efficient collaboration. Neurocomputing  2025;617:129068. 10.1016/j.neucom.2024.129068 [DOI] [Google Scholar]
  • 169. Liu  B, Li  X, Zhang  J. et al.  Advances and challenges in foundation agents: from brain-inspired intelligence to evolutionary, collaborative, and safe systems. arXiv preprint arXiv:2504.01990. 2025.
  • 170. Wang  J, Hong  Y, Wang  J. et al.  Cooperative and competitive multi-agent systems: from optimization to games. IEEE/CAA J Autom Sin  2022;9:763–83. 10.1109/JAS.2022.105506 [DOI] [Google Scholar]
  • 171. Cemri  M, Pan  MZ, Yang  S. et al.  Why do multi-agent LLM systems fail? arXiv preprint arXiv:2503.13657. 2025. 10.48550/arXiv.2503.13657 [DOI]
  • 172. Chen  W, You  Z, Li  R. et al.  Internet of agents: weaving a web of heterogeneous agents for collaborative intelligence. arXiv preprint arXiv:2407.07061. 2024. 10.48550/arXiv.2407.07061 [DOI]
  • 173. Duque  JA, Aghajohari  M, Cooijmans  T. et al.  Advantage alignment algorithms. In Proceedings of the International Conference on Learning Representations (ICLR). 2025. 10.48550/arXiv.2406.14662 [DOI]
  • 174. Zhang  Y, Yang  Y, Shu  J. et al.  Agent models: internalizing chain-of-action generation into reasoning models. arXiv preprint arXiv:2503.06580. 2025. 10.48550/arXiv.2503.06580 [DOI]
  • 175. Zhang  J, Xiang  J, Zhaoyang  Y. et al.  AFlow: automating agentic workflow generation. In Proceedings of the International Conference on Learning Representations (ICLR). 2025. 10.48550/arXiv.2410.10762 [DOI]
  • 176. Gan  Y, Yang  Y, Ma  Z. et al.  Navigating the risks: a survey of security, privacy, and ethics threats in LLM-based agents. arXiv preprint arXiv:2411.09523. 2024.
  • 177. Tang  X, Jin  Q, Zhu  K. et al.  Prioritizing safeguarding over autonomy: risks of LLM agents for science. In ICLR 2024 Workshop on Large Language Model (LLM) Agents. 2024.
  • 178. Ong  JCL, Chang  SY-H, William  W. et al.  Ethical and regulatory challenges of large language models in medicine. Lancet Digit  Health  2024;6:e428–32. 10.1016/S2589-7500(24)00061-X [DOI] [PubMed] [Google Scholar]
  • 179. Okonji  OR, Yunusov  K, Gordon  B. Applications of generative AI in healthcare: algorithmic, ethical, legal and societal considerations. arXiv preprint arXiv:2406.10632. 2024. 10.48550/arXiv.2406.10632 [DOI]
  • 180. He  F, Zhu  T, Ye  D. et al.  The emerged security and privacy of LLM agent: a survey with case studies. ACM Computing Surveys 2025;57:1–40. 10.1145/3773080 [DOI] [Google Scholar]
  • 181. Zhang  X, Huiyu  X, Ba  Z. et al.  Privacyasst: safeguarding user privacy in tool-using large language model agents. IEEE Trans Depend Secure Comput  2024;21:5242–58. 10.1109/TDSC.2024.3372777 [DOI] [Google Scholar]
  • 182. Panch  T, Mattie  H, Atun  R. Artificial intelligence and algorithmic bias: implications for health systems. J Glob Health  2019;9:020318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 183. Kirkpatrick  K. Battling algorithmic bias: how do we ensure algorithms treat us fairly?  Commun ACM  2016;59:16–7. 10.1145/2983270 [DOI] [Google Scholar]
  • 184. Wang  W, Ma  Z, Wang  Z. et al.  A survey of LLM-based agents in medicine. How far are we from baymax?  arXiv preprint arXiv:2502.11211. 2025. 10.48550/arXiv.2502.11211 [DOI] [Google Scholar]
  • 185. Chen  RJ, Wang  JJ, Williamson  DFK. et al.  Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng  2023;7:719–42. 10.1038/s41551-023-01056-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 186. Jie  X, Xiao  Y, Wang  WH. et al.  Algorithmic fairness in computational medicine. EBioMedicine  2022;84:104250. 10.1016/j.ebiom.2022.104250 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 187. Poulain  R, Fayyaz  H, Beheshti  R. et al.  Bias patterns in the application of LLMs for clinical decision support: a comprehensive study. arXiv preprint arXiv:2404.15149. 2024.
  • 188. Zack  T, Lehman  E, Suzgun  M. et al.  Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health  2024;6:e12–22. 10.1016/S2589-7500(23)00225-X [DOI] [PubMed] [Google Scholar]
  • 189. Wachter  S, Mittelstadt  B, Russell  C. Do large language models have a legal duty to tell the truth?  R Soc Open Sci  2024;11:240197. 10.1098/rsos.240197 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 190. Bouderhem  R. Shaping the future of AI in healthcare through ethics and governance. Human Soc Sci Commun  2024;11:1–12. 10.1057/s41599-024-02894-w [DOI] [Google Scholar]
  • 191. Oscar . A future role for health applications of large language models depends on regulators enforcing safety standards Freyer et al  A future role for health applications of large language models depends on regulators enforcing safety standards. Lancet Digit Health  2024;6:e662–72. [DOI] [PubMed] [Google Scholar]
  • 192. Chen  M, Wang  Y, Wang  Q. et al.  Impact of human and artificial intelligence collaboration on workload reduction in medical image interpretation. npj Digit Med  2024;7:349–10. 10.1038/s41746-024-01328-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 193. Reverberi  C, Rigon  T, Solari  A. et al.  Experimental evidence of effective human–AI collaboration in medical decision-making. Sci Rep  2022;12:14952. 10.1038/s41598-022-18751-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 194. Jain  R, Garg  N, Khera  SN. Effective human–AI work design for collaborative decision-making. Kybernetes  2022;52:5017–40. 10.1108/K-04-2022-0548 [DOI] [Google Scholar]
  • 195. Peng  L, Li  D, Zhang  Z. et al.  Human-AI collaboration: unraveling the effects of user proficiency and AI agent capability in intelligent decision support systems. Int J Ind Ergon  2024;103:103629. 10.1016/j.ergon.2024.103629 [DOI] [Google Scholar]
  • 196. Fragiadakis  G, Diou  C, Kousiouris  G. et al.  Evaluating human-AI collaboration: a review and methodological framework. arXiv preprint arXiv:2407.19098. 2024. 10.48550/arXiv.2407.19098 [DOI]
  • 197. Liu  Y. Aligning cyber space with physical world: a comprehensive survey on embodied AI. IEEE/ASME Transactions on Mechatronics. 2025. 10.1109/TMECH.2025.3574943 [DOI]
  • 198. Yihao Liu  X, Cao  TC. et al.  From screens to scenes: a survey of embodied AI in healthcare. Inform Fusion  2025;119:103033. 10.1016/j.inffus.2025.103033 [DOI] [Google Scholar]
  • 199. Saeidi  H, Opfermann  JD, Kam  M. et al.  Autonomous robotic laparoscopic surgery for intestinal anastomosis. Sci  Robot  2022;7:eabj2908. 10.1126/scirobotics.abj2908 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 200. Safiejko  K, Tarkowski  R, Koselak  M. et al.  Robotic-assisted vs. standard laparoscopic surgery for rectal cancer resection: a systematic review and meta-analysis of 19,731 patients. Cancers  2021;14:180. 10.3390/cancers14010180 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 201. Kaur  C, Garg  U. Artificial intelligence techniques for cancer detection in medical image processing: a review. Mater Today Proc  2023;81:806–9. 10.1016/j.matpr.2021.04.241 [DOI] [Google Scholar]
  • 202. Lanotte  F, Megan  K, O’brien, and Arun Jayaraman.   AI in rehabilitation medicine: opportunities and challenges. Ann Rehabil Med  2023;47:444–58. 10.5535/arm.23131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 203. Park  D, Hoshi  Y, Mahajan  HP. et al.  Active robot-assisted feeding with a general-purpose mobile manipulator: design, evaluation, and lessons learned. Robot Auton Syst  2020;124:103344. 10.1016/j.robot.2019.103344 [DOI] [Google Scholar]
  • 204. Bhattacharjee  T, Gordon  EK, Scalise  R. et al.  Is more autonomy always better? Exploring preferences of users with mobility impairments in robot-assisted feeding. In: Belpaeme T, Young J, Gunes H, Riek L (eds.), Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction. Cambridge, UK: IEEE, 2020, 181–90.
  • 205. Ahmad  R, Siemon  D, Gnewuch  U. et al.  Designing personality-adaptive conversational agents for mental health care. Inform Syst Front  2022;24:923–43. 10.1007/s10796-022-10254-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 206. Mesquita  AC, Zamarioli  CM, Campos  E. et al.  The use of robots in nursing care practices: an exploratory descriptive study. Online Braz J Nurs  2016;15:404–13. 10.17665/1676-4285.20165395 [DOI] [Google Scholar]
  • 207. Song  T, Luo  M, Zhang  X. et al.  A multi-agent-driven robotic AI chemist enabling autonomous chemical research on demand. J Am Chem Soc 2025;147:12534–45. [DOI] [PubMed] [Google Scholar]
  • 208. Burger  B, Maffettone  PM, Gusev  VV. et al.  A mobile robotic chemist. Nature  2020;583:237–41. 10.1038/s41586-020-2442-2 [DOI] [PubMed] [Google Scholar]
  • 209. Xiang  T, Tao  T, Yi  G. et al.  Language models meet world models: embodied experiences enhance language models. Adv Neural Inform Process Syst  2024;36:75392–412. [Google Scholar]
  • 210. Elendu  C, Amaechi  DC, Elendu  TC. et al.  Ethical implications of AI and robotics in healthcare: a review. Medicine  2023;102:e36671. 10.1097/MD.0000000000036671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 211. De Micco  F, Tambone  V, Frati  P. et al.  Disability 4.0: bioethical considerations on the use of embodied artificial intelligence. Front Med  2024;11:1437280. 10.3389/fmed.2024.1437280 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 212. Boada  JP, Maestre  BR, Genís  CT. The ethical issues of social assistive robotics: a critical literature review. Technol Soc  2021;67:101726. 10.1016/j.techsoc.2021.101726 [DOI] [Google Scholar]
  • 213. Price  WN, Cohen  IG. Privacy in the age of medical big data. Nat Med  2019;25:37–43. 10.1038/s41591-018-0272-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 214. Saheb  T, Saheb  T, Carpenter  D. Mapping research strands of ethics of artificial intelligence in healthcare: a bibliometric and content analysis. Comput Biol Med  2021;135:104660. 10.1016/j.compbiomed.2021.104660 [DOI] [PubMed] [Google Scholar]
  • 215. Al Nazi  Z, Peng  W. Large language models in healthcare and medical domain: a review. Informatics  2024;11:57. 10.3390/informatics11030057 [DOI] [Google Scholar]
  • 216. Lobentanzer  S, Feng  S, Bruderer  N. et al.  A platform for the biomedical application of large language models. Nat Biotechnol  2025;43:166–9. 10.1038/s41587-024-02534-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 217. Doneva  SE, Qin  S, Sick  B. et al.  Large language models to process, analyze, and synthesize biomedical texts: a scoping review. Discovery Artif Intell  2024;4:107. 10.1007/s44163-024-00197-2 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study. All data and information sources reviewed are fully referenced in the corresponding sections of the manuscript.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES