Abstract
With the fast‐growing and evolving omics data, the demand for streamlined and adaptable tools to handle bioinformatics analysis continues to grow. In response to this need, Automated Bioinformatics Analysis (AutoBA) is introduced, an autonomous AI agent designed explicitly for fully automated multi‐omic analyses based on large language models (LLMs). AutoBA simplifies the analytical process by requiring minimal user input while delivering detailed step‐by‐step plans for various bioinformatics tasks. AutoBA's unique capacity to self‐design analysis processes based on input data variations further underscores its versatility. Compared with online bioinformatic services, AutoBA offers multiple LLM backends, with options for both online and local usage, prioritizing data security and user privacy. In comparison to ChatGPT and open‐source LLMs, an automated code repair (ACR) mechanism in AutoBA is designed to improve its stability in automated end‐to‐end bioinformatics analysis tasks. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. Overall, AutoBA represents an advanced and convenient tool, offering robustness and adaptability for conventional multi‐omic analyses.
Keywords: agent, bioinformatics, large language model, omics analysis
This study introduces AutoBA, which is an AI agent designed for fully automated multi‐omic analyses based large language models. AutoBA simplifies bioinformatics tasks with minimal user input, offering detailed analytical plans and supporting both online and local usage for enhanced data security. AutoBA features an automated code repair mechanism and adapts to emerging tools, providing robust and versatile bioinformatics analysis.

1. Introduction
Bioinformatics is an interdisciplinary field that encompasses computational, statistical, and biological approaches to analyze, understand and interpret complex biological data.[ 1 , 2 , 3 ] With the rapid growth of gigabyte‐sized biological data generated from various high‐throughput technologies, bioinformatics has become an essential tool for researchers to make sense of these massive datasets and extract meaningful biological insights. The applications of bioinformatics typically cover diverse fields such as genome analysis,[ 4 , 5 ] structural bioinformatics,[ 6 , 7 ] systems biology,[ 8 ] data and text mining,[ 9 , 10 ] phylogenetics,[ 11 , 12 ] and population analysis,[ 13 ] which has further enabled significant advances in personalized medicine[ 14 ] and drug discovery.[ 15 ]
In broad terms, bioinformatics could be categorized into two primary domains: the development of innovative algorithms to address various biological challenges,[ 16 , 17 , 18 , 19 , 20 ] and the application of established tools to analyze extensive biological datasets,[ 21 , 22 ] especially high‐throughput sequencing data. Developing new bioinformatics software requires a substantial grasp of biology and programming expertise. Alongside the development of novel computational methods, one of the most prevalent applications of bioinformatics is the investigation of biological data using the existing tools and pipelines,[ 23 , 24 ] which typically involves a sequential, flow‐based analysis of omics data, encompassing variety types of datasets like whole genome sequencing (WGS),[ 25 ] whole exome sequencing (WES), RNA sequencing (RNA‐seq),[ 26 ] single‐cell RNA‐seq (scRNA‐Seq),[ 27 ] transposase‐accessible chromatin with sequencing (ATAC‐Seq),[ 28 ] ChIP‐seq,[ 29 ] and spatial transcriptomics.[ 30 ]
For example, the conventional analytical framework for bulk RNA‐seq involves a meticulously structured sequence of computational steps.[ 31 ] This intricate pipeline reveals its complexity through a series of carefully orchestrated stages. It begins with quality control,[ 32 ] progresses to tasks such as adapter trimming[ 33 ] and the removal of low‐quality reads, and then moves on to critical steps like genome or transcriptome alignment.[ 34 ] Furthermore, it extends to some advanced tasks, including the identification of splice junctions,[ 35 ] quantification through read counting,[ 36 ] and the rigorous examination of differential gene expression.[ 37 ] Moreover, the pipeline delves into the intricate domain of alternative splicing[ 38 ] and isoform analysis.[ 39 ] This progressive journey ultimately ends in downstream tasks like the exploration of functional enrichment,[ 40 ] providing a comprehensive range of analytical pursuits. Compared to bulk RNA‐seq, ChIP‐seq involves distinct downstream tasks, such as peak calling,[ 41 ] motif discovery,[ 42 ] peak annotation[ 43 ] and so on. In summary, the analysis of different types of omics data requires professional skills and a comprehensive comprehension of the corresponding field, particularly for customized data analysis. Moreover, the methods and pipelines might vary across different bioinformaticians and they even may evolve with the development of more advanced algorithms.
Meanwhile, online semi‐automatic bioinformatics analysis platforms are currently in vogue,[ 44 ] such as iDEP,[ 45 ] ICARUS[ 46 ] and STellaris.[ 47 ] However, they often necessitate the uploading of either raw data or pre‐processed statistics by users, which could potentially give rise to additional privacy concerns and data leakage risks.[ 48 ]
In the context described above, the bioinformatics community grapples with essential concerns regarding the standardization, portability, and reproducibility of analysis pipelines.[ 49 , 50 , 51 ] Moreover, achieving proficiency in utilizing these pipelines for data analysis demands additional training, posing challenges for many wet lab researchers due to its potential complexity and time‐consuming nature. Even dry‐lab researchers may find the repetitive process of running and debugging these pipelines to be quite tedious.[ 52 ] Meanwhile, bioinformatics data analysis training incurs substantial costs. The elevated expenses associated with training in bioinformatics data analysis could be attributed to the highly specialized nature of the field, the need for multi‐modal data analysis, the evolution of technologies, restricted computing resources, the expense of training materials and tools, as well as the operational costs of training institutions. These factors collectively contribute to the high cost of bioinformatics training.[ 53 ] Consequently, there is a growing anticipation within the community for the development of a more user‐friendly, low‐code, multi‐functional, automated, and natural language‐driven intelligent tool tailored for end‐to‐end bioinformatics analysis. Such a tool has the potential to generate significant excitement and benefit researchers across the field.
Over the past few months, the rapid advancement of Large Language Models (LLMs)[ 54 ] has raised substantial expectations for the enhancement of scientific research, particularly in the field of biology.[ 55 , 56 , 57 ] These advancements hold promise for applications such as disease diagnosis,[ 58 , 59 , 60 , 61 ] drug discovery,[ 62 ] and all. In the realm of bioinformatics, LLMs, such as ChatGPT, also demonstrate immense potential in tasks related to bioinformatics education[ 63 ] and code generation.[ 64 ] While researchers have found ChatGPT to be a valuable tool in facilitating bioinformatics research, such as data analysis, there remains a strong requirement for human intervention in the execution process. ChatGPT shows sensitivity to the nuances of user queries, resulting in diverse responses based on the prompts, which is the reason why prompt engineering is getting huge attention.[ 65 ] Given the specialized nature of bioinformatics tools, ChatGPT is also susceptible to potential issues, such as misinterpreting parameters, errors in software utilization, and other bugs that may arise during code generation. Users may encounter the necessity for ongoing engagement with ChatGPT, involving a continuous cycle of inquiry, code generation, execution, and debugging to ensure desired performance. AutoGPT,[ 66 ] as a recently developed, advanced, and experimental open‐source autonomous AI agent, has the capacity to string together LLM‐generated “thoughts” to autonomously achieve user‐defined objectives. Nevertheless, given the intricate and specialized nature of bioinformatics tasks, such as specialized software, the direct application of AutoGPT in this field still presents significant challenges. Notably, it faces difficulties in effectively managing the intricate software requirements of bioinformatics, encompassing tasks such as installation, software calls, and parameter settings.
In this study, we introduce Automated Bioinformatics Analysis (AutoBA), an autonomous AI agent tailored for comprehensive and conventional multi‐omic analyses, as it can be applied to the analysis of different omics datasets. AutoBA simplifies user interactions to just three inputs: data path, data description, and the final objective. This tool autonomously proposes analysis plans, generates code, executes codes, and conducts subsequent data analysis by using our well‐designed prompts. We implemented AutoBA as open‐source software that offers multiple LLM backends, with options for both online and local usage, prioritizing data security and user privacy (Figure 1 ). To show the reliability of AutoBA, we tested it in a large number of real‐world multi‐omic analysis scenarios (Figure 2 ). AutoBA, serving as an AI agent tailored for bioinformatics data analysis, could address the surging demand for streamlined multi‐omics data analysis, mitigate the financial challenges associated with bioinformatics training, and cater to diverse customization requirements. Compared with online bioinformatic services, AutoBA offers multiple LLM backends, with options for both online and local usage, prioritizing data security and user privacy (Table 1 ). In comparison to ChatGPT and open‐source LLMs, we have designed an automated code repair mechanism in AutoBA to improve its stability in automated end‐to‐end bioinformatics analysis tasks. Moreover, different from the predefined pipeline, AutoBA has adaptability in sync with emerging bioinformatics tools. In summary, AutoBA is the first agent of this kind and represents a significant leap in the application of LLMs and automated AI agents within the domain of bioinformatics, highlighting their potential to accelerate future research in this field.
Figure 1.

Design of AutoBA. AutoBA stands as the first autonomous AI agent meticulously crafted for conventional multi‐omic analyses. Remarkably user‐friendly, AutoBA simplifies the analytical process by requiring minimal user input, including data path, data description, and the final objective, while delivering detailed step‐by‐step plans for various bioinformatics tasks. With these inputs, it autonomously proposes analysis plans, generates code, executes codes, and conducts subsequent data analysis by using our well‐designed prompts. AutoBA was implemented as open‐source software that offers multiple LLM backends, with options for both online and local deployment, prioritizing data security and user privacy and offering a streamlined and efficient solution for bioinformatics tasks. Step 1 and Step 3 require human intervention, while Step 2 requires no human intervention. Due to the numerous and complex nature of these specific items, they are represented with ellipsis to indicate the vast and detailed possibilities that cannot be fully enumerated in this limited space.
Figure 2.

Method design and evaluation of AutoBA. a) AutoBA workflow design and technical details. b) Pie chart indicates the number of all cases used for validating AutoBA.
Table 1.
Qualitative comparison of AutoBA against other methods. All methods were conceptually assessed with seven metrics, including user‐friendliness, time efficiency, diminished human intervention (degree of automation), ease of redevelopment, generality, robustness, and privacy considerations. * denotes a category of methods rather than a specific one. For instance, Online Webserver refers to platforms like iDEP and ICARUS, and Open Source LLMs includes models such as Llama2 and CodeLlama.
| Methods | Easy to Master | Save Time | Reduced Human Intervention | Redevelopment | Generalizability | Robustness | Privacy |
|---|---|---|---|---|---|---|---|
| AutoBA | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓✓ | ✓✓✓ |
| Conventional Bioinformatics Tools | ✓ | ✓ | – | ✓✓ | ✓ | ✓✓ | ✓✓✓ |
| Online Webserver ∗ | ✓✓ | ✓✓ | – | – | ✓ | ✓ | – |
| AutoGPT | ✓✓✓ | ✓✓ | ✓✓ | ✓✓ | ✓✓✓ | ✓ | ✓✓✓ |
| ChatGPT | ✓✓✓ | ✓✓ | ✓ | ✓✓ | ✓✓✓ | ✓ | ✓✓ |
| Open source LLMs ∗ | ✓ | ✓✓ | ✓ | ✓✓ | ✓✓✓ | ✓ | ✓✓✓ |
2. Experimental Section
2.1. The Overall Framework Design of AutoBA
AutoBA is the first autonomous AI agent tailor‐made for conventional multi‐omic analyses. As illustrated in Figure 1, conventional bioinformatics typically entails the use of pipelines to analyze diverse data types such as WGS, WES, RNA‐seq, single‐cell RNA‐seq, ChIP‐seq, ATAC‐seq, spatial transcriptomics, and more, all requiring the utilization of various tools. Users are traditionally tasked with selecting the appropriate tools based on their specific analysis needs. In practice, this process involves configuring the environment, installing software, writing code, and debugging, which are time‐consuming and labor‐intensive.
With the advent of AutoBA, this labor‐intensive process is revolutionized. Users are relieved from the burden of dealing with multiple software packages and need only provide three key inputs in YAML format: the data path (e.g., /data/SRR1374921.fasta.gz), data description (e.g., single‐end reads in condition A), and the ultimate analysis goal (e.g., identify differentially expressed genes). AutoBA takes over by autonomously analyzing the data, generating comprehensive step‐by‐step plans, composing code for each step, executing the generated code, and conducting in‐depth analysis. Depending on the complexity and difficulty of the tasks, users can expect AutoBA to complete the tasks within a matter of minutes to a few hours, all without the need for additional human intervention (Table 2 and Figure 2).
Table 2.
Summary of AutoBA application scenarios in bioinformatics multi‐omics analysis. The table displays a comprehensive list of 40 real‐world cases utilized to assess AutoBA, providing information on the class of the cases, the respective task name, and the corresponding case ID.
| Bioinformatics Pipelines | Tasks | Types of Omics | Case ID |
|---|---|---|---|
| WGS data analysis | Genome assembly | Genomics | 1.1 |
| WGS/WES data analysis | Somatic SNV+indel calling | Genomics | 2.1 |
| WGS/WES data analysis | Somatic SNV+indel calling and annotation | Genomics | 2.2 |
| WGS/WES data analysis | Structure variation identification with normal | Genomics | 2.3 |
| WGS/WES data analysis | Structure variation identification without normal | Genomics | 2.4 |
| ChIP‐seq data analysis | Peak calling | Genomics | 3.1 |
| ChIP‐seq data analysis | Motif discovery for binding sites | Genomics | 3.2 |
| ChIP‐seq data analysis | Functional enrichment of target gene | Genomics | 3.3 |
| Bisulfite‐Seq data analysis | Identifying DNA methylation | Genomics | 4.1 |
| ATAC‐seq data analysis | Identifying open chromatin regions | Genomics | 5.1 |
| DNase‐seq data analysis | Identifying Dnasel hypersensitive site | Genomics | 6.1 |
| 4C‐seq data analysis | Find genomics interactions | Genomics | 7.1 |
| Nanopore DNA sequencing data analysis | Genome assembly | Genomics | 8.1 |
| Nanopore DNA sequencing data analysis | Tandem repeats variation identification | Genomics | 8.2 |
| PacBio DNA sequencing data analysis | Genome assembly | Genomics | 9.1 |
| RNA‐Seq data analysis | Find Differentially expressed genes | Transcriptomics | 10.1 |
| RNA‐Seq data analysis | Identify the top5 downregulated genes | Transcriptomics | 10.2 |
| RNA‐Seq data analysis | Predict Fusion gene with annotation | Transcriptomics | 10.3 |
| RNA‐Seq data analysis | Isoform expression | Transcriptomics | 10.4 |
| RNA‐Seq data analysis | Splicing analysis | Transcriptomics | 10.5 |
| RNA‐Seq data analysis | APA analysis | Transcriptomics | 10.6 |
| RNA‐Seq data analysis | RNA editing | Transcriptomics | 10.7 |
| RNA‐Seq data analysis | Circular RNA identification | Transcriptomics | 10.8 |
| Small RNA sequencing data analysis | microRNA quantification | Transcriptomics | 11.1 |
| Small RNA sequencing data analysis | microRNA prediction | Transcriptomics | 11.2 |
| CAGE‐seq data analysis | TSS identification | Transcriptomics | 12.1 |
| 3’ end‐seq data analysis | PAS (polyadenylation site) identification | Transcriptomics | 13.1 |
| Nanopore RNA sequencing data analysis | Isoform expression | Transcriptomics | 14.1 |
| PacBio RNA sequencing data analysis | Isoform expression | Transcriptomics | 15.1 |
| CLIP‐seq data analysis | Identify protein‐RNA crosslink sites | Transcriptomics | 16.1 |
| RIP‐seq data analysis | Find enriched genes bounded by RBP | Transcriptomics | 16.2 |
| Ribo‐seq data analysis | Identify translated ORFs | Transcriptomics | 17.1 |
| single‐cell RNA‐seq data analysis | Cell clustering from fastq data | Transcriptomics | 18.1 |
| single‐cell RNA‐seq data analysis | Find differentially expressed genes based on count matrix | Transcriptomics | 18.2 |
| single‐cell RNA‐seq data analysis | Find marker genes based on count matrix | Transcriptomics | 18.3 |
| single‐cell RNA‐seq data analysis | Cell clustering and visualization | Transcriptomics | 18.4 |
| Spatial transcriptomics | Neighborhood enrichment analysis | Transcriptomics | 19.1 |
| Spatial transcriptomics | Single‐cell mapping | Transcriptomics | 19.2 |
| Mass spectrometry data analysis | Protein expression quantification | Proteomics | 20.1 |
| Mass spectrometry data analysis | Metabolites quantification | Metabolomics | 21.1 |
2.2. Prompt Engineering of AutoBA
To initiate AutoBA, users provide three essential inputs: the data path, data description, and the previously mentioned analysis objective. AutoBA comprises three distinct phases: the planning phase, the code generation phase, and the execution phase as shown in Step 2 of Figure 1. During the planning phase, AutoBA meticulously outlines a comprehensive step‐by‐step analysis plan. This plan includes details such as the software name and version to be used at each step, along with guided actions and specific sub‐tasks for each stage. Subsequently, in the code generation phase, AutoBA systematically follows the plan and generates codes for sub‐tasks, which entails procedures like configuring the environment, installing the necessary software, and writing code. Then, in the execution phase, AutoBA executes the generated code. In light of this workflow, AutoBA incorporates two distinct prompts: one tailored for the planning phase and the other for the code generation phase. Intensive experiments have shown that these two sets of prompts are essential for the proper functioning of AutoBA in automated bioinformatics analysis tasks.
The prompts for both the planning phase and the code generation phase are displayed in the Supporting Information. In both prompt designs, the term blacklist pertains to the user's personalized list of prohibited software. The current default blacklist contains several tools that frequently caused errors during the testing processes. Meanwhile, data list encompasses the inputs necessary for AutoBA, encompassing data paths and data descriptions. The term current goal serves as the final objective during the planning phase and as the sub‐goal in the execution phase, while history summary encapsulates AutoBA's memory of previous actions and information.
2.3. Memory Management of AutoBA
A memory mechanism is incorporated within AutoBA to enable it to generate code more effectively by drawing from past actions, thus avoiding unnecessary repetition of certain steps. AutoBA meticulously logs the outcome of each step in a specific format, and all these historical records become part of the input for the subsequent prompt. In the planning phase, memories are structured as follows: “First, you provided input in the format ‘file path: file description’ in a list: <data list>. You devised a detailed plan to accomplish your overarching objective. Your overarching goal is <global goal>. Your plan involves <tasks>.” In the code generation phase, memories follow this format: “Then, you successfully completed the task: <task> with the corresponding code: <code>.”
2.4. Automatic Code Repair of AutoBA
AutoBA incorporates an automatic code repair (ACR) module designed to streamline the debugging process and enhance the reliability of generated code. During the code execution phase, AutoBA identifies errors from the output stream called standard error (stderr) and standard output (stdout). Once an error is detected, these detected errors will be integrated into the prompt for code regeneration, ensuring a repetitive cycle until the generated code successfully executes without errors.
2.5. Evaluation of AutoBA
The results produced by AutoBA undergo thorough validation by bioinformatics experts. This validation process encompasses a comprehensive review of the proposed plans, generated codes, execution of the code, and confirmation of the results for accuracy and reliability. AutoBA's development and validation are built upon a specific environment and software stack, which includes Ubuntu version 20.04, Python 3.10.0, and openai version 0.27.6. These environment and software specifications form the robust foundation for AutoBA's functionality in the field of bioinformatics, ensuring its reliability and effectiveness. To further assess the usability of AutoBA, a comparative analysis involving the following methods was conducted: 1) AutoBA (w/o ACR, online with ChatGPT‐4), 2) AutoBA (with ACR, online with ChatGPT‐4), 3) AutoBA (w/o ACR, offline with CodeLlama‐34B‐Instruct), 4) AutoBA (with ACR, offline with CodeLlama‐34B‐Instruct), 5) AutoGPT, 6) ChatGPT‐3.5, 7) ChatGPT‐4 and 8) CodeLlama‐34B‐Instruct. Given that prompt engineering and workflow design is a distinctive innovation of AutoBA, during the evaluation of AutoGPT, ChatGPT‐3.5, ChatGPT‐4, and CodeLlama‐34B‐Instruct, user behavior was emulated by utilizing a generalized and uniform prompt as shown in the supplementary information.
2.6. Online and Local LLM Backends of AutoBA
AutoBA offers several versions of LLM backends, including online backends based on ChatGPT‐3.5 and ChatGPT‐4, and local LLMs, including CodeLlama‐7B‐Instruct, CodeLlama‐13B‐Instruct, CodeLlama‐34B‐Instruct,[ 67 ] Llama‐2‐7b‐chat, Llama‐2‐13b‐chat and Llama‐2‐70b‐chat.[ 68 ]
2.7. Security and Safety of AutoBA
AutoBA incorporates a sandbox mode to establish a secure and isolated environment for conducting analyses. This mode encapsulates the analysis processes, effectively shielding the underlying system from potential threats. Meanwhile, AutoBA imposes restrictions on system commands throughout the execution phase, thereby reducing the risk of malicious commands being executed within the environment. Additionally, AutoBA leverages Docker containerization, introducing an extra layer of security to further fortify the overall system integrity. Furthermore, Docker containerization simplifies the installation process, contributing to a reduction in learning costs for users. A workstation with 252 GB RAM, 112 CPU cores, and 1 Nvidia A100 GPU was adopted for all experiments. AutoBA was developed based on Python3.10 and CUDA12.0. CUDA is a parallel computing platform and programming model developed by NVIDIA. AutoBA utilizes LLMs as its core for generating analysis plans and code. These models benefit greatly from GPU acceleration. CUDA allows for faster computation of model parameters, leading to quicker generation of results and more efficient handling of complex natural language processing tasks. A detailed list of dependencies could be found in the code availability. The online version operates without the need for a GPU, while the offline version requires GPU support (7B: 12.55GB, 13B: 24GB, 34B: 63GB, 70B: 74GB). The sizes mentioned, such as 7B, 13B, 34B, and 70B, indicate the number of parameters in the models (B stands for billion). The corresponding GPU memory requirements in gigabytes (GB) are listed next to each model size.
2.8. Statistical Analysis
During the evaluation of AutoBA, all experimental data were examined at least three times on the same computational environment and data. Statistical analysis was performed using Python packages.
3. Results
3.1. AutoBA Proposes Detailed Analysis Plans for Tasks
AutoBA offers a robust capability to generate a highly detailed and customized analysis plan, leveraging the user's input, which encompasses critical elements such as data paths, data descriptions, and objective descriptions.
As an example, in Figure 3 , the user supplied four RNA‐Seq samples: two from the LoGlu group (SRR1374921.fastq.gz and SRR1374922.fastq.gz, mouse pancreatic islets cultured at low ambient glucose) and two from the HiGlu group (SRR1374923.fastq.gz and SRR1374924.fastq.gz, mouse pancreatic islets cultured at high ambient glucose) from Benner et al.’s paper.[ 69 ] Additionally, the user also provided the mouse reference genome (mm39.fa) and genome annotation (mm39.ncbiRefSeq.gtf). The primary objective of this case was to identify differentially expressed genes between the two data groups. Using textual inputs only, AutoBA generated a detailed, step‐by‐step analysis plan during the planning phase, as outlined below:
Figure 3.

Example of applying AutoBA to find differentially expressed genes with RNA‐Seq data. In this case, the user provided four RNA‐seq samples, comprising two from the LoGlu group and two from the HiGlu group. The primary objective of this analysis was to identify differentially expressed genes between two groups. During the automated analysis prior to the alignment run by AutoBA, the ACR module identified an error where Hisat2 did not first index the genome. Consequently, the ACR module automatically corrected the code (highlighted in blue). This case demonstrates that AutoBA can be used for end‐to‐end bioinformatics analysis. Users only need to provide the simplest inputs, and AutoBA will deliver the final analysis results.
First, I will use the “Trimmomatic” software to trim the adapters from the single‐end mouse rna‐seq reads. The input files for this step will be “./data/ SRR1374921.fastq.gz”, “./data/SRR1374922.fastq.gz”, “./data/SRR1374923.fastq.gz”, “./data/SRR1374924.fastq.gz” and the adapter file “./data/TruSeq3‐SE.fa”.
Next, I will use the “Hisat2” software to align the trimmed reads to the mouse mm39 genome. The input files for this step will be the output files from the previous step and the genome file “./data/mm39.fa”.
Then, I will use the “Samtools” software to convert the alignment output files from Hisat2 (in SAM format) to BAM format.
After that, I will use the “HTSeq” software to count the number of reads mapped to each gene. The input files for this step will be the BAM files from the previous step and the annotation file “./data/mm39.ncbiRefSeq.gtf”.
Finally, I will use the “DESeq2” software to identify the differentially expressed genes. The input files for this step will be the count files from the previous step.
Within each step of the plan outlined above, AutoBA provides precise instructions regarding the required software, including names like Trimmomatic, Hisat2, Samtools, HTSeq, and DESeq2, along with clear sub‐tasks for each analytical stage. This level of tailored planning ensures that the analysis process aligns precisely with the user's objectives, promoting both efficiency and accuracy in data processing and results generation.
3.2. AutoBA Generates Precise Codes for Sub‐Tasks
During the code generation phase, AutoBA generates code in bash format for every sub‐task of the plan established in the planning phase. These scripts encompass environment setup, software installation, and tailored code for software utilization. Parameters and data paths specific to the software are meticulously incorporated. As exemplified in Figure 3, the preliminary phase of the differentially expressed genes (DEG) analysis constitutes the essential process of adapter trimming, an indispensable preprocessing step in the context of raw RNA‐Seq data. Within this critical step, AutoBA automatically generated code, including activating the conda environment, installing software packages, and calling software to analyze data as shown below:
source activate abc_runtime
conda config –add channels conda‐forge
conda config –add channels bioconda
conda install ‐y trimmomatic
trimmomatic SE ‐phred33 ./data/SRR1374921.fastq.gz./output/SRR1374921_trimmed.fastq.gz ILLUMINACLIP:./data/TruSeq3‐SE.fa:2:30:10
trimmomatic SE ‐phred33 ./data/SRR1374922.fastq.gz./output/SRR1374922_trimmed.fastq.gz ILLUMINACLIP:./data/TruSeq3‐SE.fa:2:30:10
trimmomatic SE ‐phred33 ./data/SRR1374923.fastq.gz./output/SRR1374923_trimmed.fastq.gz ILLUMINACLIP:./data/TruSeq3‐SE.fa:2:30:10
trimmomatic SE ‐phred33 ./data/SRR1374924.fastq.gz./output/SRR1374924_trimmed.fastq.gz ILLUMINACLIP:./data/TruSeq3‐SE.fa:2:30:10
The generated code further underwent a meticulous and thorough validation process, which included a manual verification and execution performed by experienced and proficient bioinformaticians, as meticulously documented in Table 3 and Supplementary information. This critical validation step ensures the accuracy and reliability of the code, reaffirming the robustness of AutoBA.
Table 3.
Summary of AutoBA (w/o ACR) generated results evaluated by bioinformatics experts. The table presents an assessment conducted by bioinformatics experts on the analysis plan proposed by AutoBA, along with the generated codes and the code execution. If the evaluation passes, it is displayed as success, while instances of failure are accompanied by detailed explanations of the specific reasons for the failure. Additionally, we provide a summary of the software tools automatically chosen by AutoBA for each case, as well as the total time taken to generate the corresponding code.
| Case ID | Propose Plans | Generate Codes | Execute Codes | Tools Used | Time Cost (without Executing Codes) in Minutes |
|---|---|---|---|---|---|
| 1.1 | Success | Success | Success | FastQC, Trimmomatic[ 76 ], SPAdes[ 77 ], QUAST[ 78 ] | 3 |
| 2.1 | Success | Success | Success | FastQC, Trimmomatic, BWA[ 79 ], Samtools[ 80 ], GATK[ 81 ] | 8 |
| 2.2 | Success | Success | Success | FastQC, Trimmomatic, BWA, Samtools, GATK, ensembl‐vep[ 82 ] | 8 |
| 2.3 | Success | Success | Success | FastQC, Trimmomatic, BWA, Samtools, GATK, manta[ 73 ] | 18 |
| 2.4 | Success | Success | Failed: pindel requires configuration file | FastQC, Trimmomatic, BWA, Samtools, pindel[ 74 ], SnpEff[ 83 ] | 6 |
| 3.1 | Success | Success | Success | FastQC, Trim Galore, Bowtie 2[ 84 ], Samtools, MACS2[ 85 ], BEDTools, IGV | 6 |
| 3.2 | Success | Success | Success | FastQC, Trim Galore, Bowtie2, MACS2, HOMER, MEME[ 86 ] | 4 |
| 3.3 | Failed: DESeq2 is not suitable for peaks identified by MACS2 | – | – | FastQC, BWA, MACS, BEDTools[ 87 ], DESeq2[ 88 ], g:Profiler[ 89 ], R[ 90 ] | 6 |
| 4.1 | Success | Success | Success | Trim Galore, Bismark[ 91 ], IGV[ 92 ] | 9 |
| 5.1 | Success | Success | Failed: (wrongly used BEDTools) | Trim Galore, BWA, Samtools, MACS2, BEDTools | 8 |
| 6.1 | Success | Success | Success | FastQC, Cutadapt, BWA, MACS2, IGV, GREAT[ 93 ] | 5 |
| 7.1 | Success | Success | Success | FastQC, BEDTools, Samtools, Bowtie 2, R | 6 |
| 8.1 | Success | Success | Failed: racon medaka wrongly used the parameters | canu[ 94 ], Minimap2[ 72 ], Racon[ 95 ], Flye[ 96 ], Medaka, Bandage[ 97 ] | 7 |
| 8.2 | Failed: cannot find a correct pipeline | – | – | Minimap2, Samtools, trf[ 98 ] | 7 |
| 9.1 | Success | Failed: install the wrong tool, pb‐falcon rather than falcon | – | Canu, FALCON[ 99 ], Quiver, MUMmer[ 100 ] | 7 |
| 10.1 | Success | Success | Success | FASTQC, Trimmomatic, HISAT2[ 71 ], htseq[ 101 ], DESeq2 | 5 |
| 10.2 | Success | Success | Success | FASTQC, Trimmomatic, HISAT2, htseq, DESeq2, gprofileR | 5 |
| 10.3 | Success | Success | Success | gunzip, HISAT2, fusioncatcher[ 102 ], gffcompare[ 103 ] | 6 |
| 10.4 | Success | Success | Success | Trim Galore, HISAT2, Samtools, StringTie | 5 |
| 10.5 | Success | Success | Success | Trimmomatic, HISAT2, Samtools, StringTie, featureCounts[ 36 ], rMATs[ 38 ] | 6 |
| 10.6 | Success | Failed: DaPars (not available in conda) | – | Trim Galore, HISAT2, StringTie, DaPars[ 104 ] | 7 |
| 10.7 | Failed: cannot find a correct pipeline | – | – | FastQC, Trimmomatic, HISAT2, Samtools, StringTie, ballgowan[ 105 ], GATK | 7 |
| 10.8 | Success | Failed: CIRI2 (not available in conda) | – | Trim Galore, HISAT2, CIRI2[ 106 ], CIRIQuant[ 107 ] | 5 |
| 11.1 | Success | Success | Success | Fastqc, Cutadapt, Bowtie, Samtools, subread/featureCounts, DESeq2, edgeR[ 108 ] | 11 |
| 11.2 | Success | Success | Failed: conda of miRDeep2 is problematic | Fastqc, Cutadapt, Bowtie, Samtools, featureCounts, miRDeep2, DESeq2, edgeR | 11 |
| 12.1 | Success | Success | Success | Fastqc, Trimmomatic, HISAT2, HTSeq/htseq‐count, CAGEr[ 109 ] | 6 |
| 13.1 | Failed: cannot find a correct pipeline | – | – | Trim Galore, HISAT2, StringTie, DaPars | 5 |
| 14.1 | Success | Success | Failed: prepDE.py no need to run with ’python prepDE.py’ | Minimap2, Samtools, StringTie, DESeq2 | 9 |
| 15.1 | Success | Success | Success | Minimap2, Samtools, StringTie, cufflinks[ 110 ] | 5 |
| 16.1 | Success | Success | Failed: conda of Piranha is problematic | FastQC, Cutadapt, Bowtie2, Samtools, BEDTools, Piranha | 6 |
| 16.2 | Success | Success | Success | FastQC, Trim Galore, HISAT2, htseq, DESeq2 | 4 |
| 17.1 | Success | Success | Failed: not regular conda of ribotaper | FastQC, Trim Galore, HISAT2, Samtools, StringTie, RiboTaper[ 111 ] | 7 |
| 18.1 | Success | Success | Success | Cell Ranger, Seurat[ 112 ] | 5 |
| 18.2 | Success | Success | Success | Scanpy[ 113 ] | 8 |
| 18.3 | Success | Success | Success | Scanpy | 6 |
| 18.4 | Success | Success | Success | Scanpy | 5 |
| 19.1 | Success | Success | Success | Squidpy[ 114 ], AnnData | 5 |
| 19.2 | Success | Success | Success | AnnData, Scanpy, Tangram[ 115 ] | 3 |
| 20.1 | Success | Success | Success | proteowizard[ 116 ], OpenMS[ 117 ] | 15 |
| 21.1 | Success | Success | Success | pymzml[ 118 ], pandas, numpy, scipy | 13 |
| #Success | 36 | 33 | 26 | – | – |
3.3. AutoBA Adeptly Manages Similar Tasks with Robustness
In practical bioinformatics applications, even when researchers are working with similar data types, such as RNA‐Seq, it is noteworthy that analyses often manifest variations stemming from diverse sources. These variations are primarily attributed to disparities in the characteristics of input data and the distinct objectives pursued in the analytical process.
As exemplified in Case 10.1 (find differentially expressed genes), Case 10.2 (identify the top five down‐regulated genes in HiGlu group), and Case 10.3 (predict fusion genes), when performing RNA‐Seq analysis, users may have distinct final goals, necessitating adjustments in software and parameter selection during the actual execution. In comparison to case 10.1, AutoBA introduces an additional step in case 10.2, tailored for screening the top five differentially expressed genes to fulfill the user's specific requirements as shown in the code below:
Rscript ‐e ‘‘library(’pheatmap’); library(’DESeq2’); res <‐ read.csv(’./examples/output/differential_expression_results.csv’, row.names = 1); res_ordered <‐ res[order(res$log2FoldChange),]; top5_downregulated <‐ head(res_ordered, 5);
3.4. AutoBA Adjusts Analysis Based on Task and Input Data Variations
Alignment is an essential step for bioinformatic analysis, for which multiple tools have been developed for distinct tasks. For instance, tools including STAR[ 70 ] and HISAT2[ 71 ] designed for RNA‐seq data analysis are splicing aware, which is efficient in identifying junction reads that map to two distal positions in the reference genome. Besides, long‐read sequencing data from Pacific Bioscience (PacBio) and Oxford Nanopore Technology (ONT) also require specialized tools for the alignment, for which Minimap2[ 72 ] is the most widely used method. Moreover, each read from single‐cell sequencing data contains barcodes for UMI and cell labels, which needs to be integrated with the alignment. CellRanger is a popular software with this capacity. Therefore, bioinformatic analysis should use appropriate tools for the alignment based on the types of tasks. Interestingly, we found that AutoBA has learned this knowledge and can correctly employ the tool for the alignment (Figure 4a).
Figure 4.

Results of AutoBA and the comparison with other methods. a) Heatmap illustrating options of utilizing different alignment tools for multiple tasks planned by AutoBA. b) AutoBA utilizes the tools for identifying structure variations in tumor samples with or without the matched normal samples. The highlight shows the difference between Goal 1 and Goal 2. c) Conceptual comparison of AutoBA with other methods in terms of human intervention. Orange indicates the need for human intervention, while green signifies an absence of human intervention (fully automated process). d) Evaluation of results generated by various methods by manually checking and executing codes and comparing them to standard analysis pipelines. Orange indicates a failure, and blue indicates a success.
For many bioinformatic analyses, multiple tools are available but require different conditions of inputs. For instance, to identify structural variations from tumor WGS/WES data, the method “manta”[ 73 ] can handle the analysis against the matched normal. On the other hand, tools like “Pindel”[ 74 ] that relies on the detection of breakpoints with the reference genome, only conduct analysis on the tumor samples. We found that AutoBA can automatically select “manta” when the matched normal samples were provided and correctly utilized the parameters “–normalBam” and “–tumorBam”. However, if only the tumor samples were provided in the input data, AutoBA will select “Pindel” for the analysis (Figure 4b). These results suggest that AutoBA learned the requirements of different bioinformatic tools and is capable of selecting appropriate tools based on different conditions of the input data.
manta –normalBam ./output/SRR23015874.recalibrated.bam –tumorBam ./output/SRR23015876.recalibrated.bam –referenceFasta ./data/hg38.fa –runDir ./output/manta_SRR23015874
3.5. Apply AutoBA to a Variety of Conventional Multi‐Omic Analysis Scenarios
To evaluate the robustness of AutoBA, we conducted assessments involving a total of 40 cases spanning four distinct types of omics data: genomics, transcriptomics, proteomics, and metabolomics as shown in Table 2 and Supporting Information.
All cases underwent an independent analysis process conducted by AutoBA and were subsequently subjected to validation by experienced bioinformatics experts. The collective results underscore the versatility and robustness of AutoBA across a spectrum of multi‐omics analysis procedures in the field of bioinformatics as shown in Table 3. AutoBA demonstrates its capability to autonomously devise novel analysis processes based on varying input data, showcasing its adaptability to diverse input data and analysis objectives with a success rate of 90% (36 out of 40) for proposing plans, 82.5% (33 out of 40) for generating codes to obtain and install appropriate tools, and 65% (26 out of 40) for automated end‐to‐end analysis. With the incorporation of the ACR module, AutoBA demonstrates enhanced robustness, with the same success rate of 90% (36 out of 40) for proposing plans, but a higher success rate of 87.5% (35 out of 40) for generating codes to obtain and install appropriate tools, and 87.5% (35 out of 40) for automated end‐to‐end analysis. Compared to the online version, the local version showed a slight decline in performance as shown in Figure 4d.
3.6. AutoBA Reduces Human Intervention and Increases Robustness Compared to Other Methods
As shown in Figure 4c, we conducted a conceptual comparison between AutoBA and alternative methods in terms of human intervention. In utilizing conventional bioinformatics tools and web servers, users are required to prepare input data and comprehend detailed analysis plans prior to execution. Throughout the execution phase, users must configure the environment, install essential dependencies, write code, and proceed with step‐by‐step debugging. In contrast, ChatGPT and other open‐source LLMs assist users in proposing step‐by‐step plans and generating code, thus mitigating human intervention. Nevertheless, users still need to manually configure the environment, execute code, and perform debugging. AutoGPT, functioning as an AI agent, aids users in executing generated code to further minimize human intervention. However, within the context of bioinformatics data analysis, AutoGPT encounters challenges in setting up the environment and debugging for users. Conversely, AutoBA significantly reduces human intervention, necessitating only the preparation of input data.
To show the robustness of AutoBA, we further conducted a comprehensive comparison of eight methods, including 1) AutoBA (w/o ACR, online with ChatGPT‐4), 2) AutoBA (with ACR, online with ChatGPT‐4), 3) AutoBA (w/o ACR, offline with CodeLlama‐34B‐Instruct), 4) AutoBA (with ACR, offline with CodeLlama‐34B‐Instruct), 5) AutoGPT, 6) ChatGPT‐3.5, 7) ChatGPT‐4 and 8) CodeLlama‐34B‐Instruct, across all 40 cases, as illustrated in Figure 4d. AutoBA showed better performance in comparison to AutoGPT (90% for proposing plans, 25% for generating codes to obtain and install appropriate tools, and 0% for automated end‐to‐end analysis), ChatGPT‐3.5 (92.5% for proposing plans, 30% for generating codes to obtain and install appropriate tools, and 2.5% for automated end‐to‐end analysis), ChatGPT‐4 (92.5% for proposing plans, 37.5% for generating codes to obtain and install appropriate tools, and 7.5% for automated end‐to‐end analysis), and CodeLlama‐34B‐Instruct (80% for proposing plans, 7.5% for generating codes to obtain and install appropriate tools, and 2.5% for automated end‐to‐end analysis).
4. Discussion
To our knowledge, AutoBA is the first autonomous AI agent tailored explicitly for conventional multi‐omic analyses for omics data. AutoBA streamlines the analytical process, requiring minimal user input while providing detailed step‐by‐step plans for various bioinformatics tasks (Video S1, Supporting Information). The results of our investigation reveal that AutoBA excels in accurately handling a diverse array of omics analysis tasks, such as RNA‐seq, scRNA‐seq, ChIP‐seq, spatial transcriptomics, and so on. One of the key strengths of AutoBA is its adaptability to variations in analysis objectives. As demonstrated in the cases presented, even with similar data types, such as RNA‐Seq, users often have distinct goals, necessitating modifications in software and parameter selection during execution. AutoBA effectively accommodates these variations, allowing users to tailor their analyses to specific research needs without compromising accuracy. Furthermore, AutoBA's versatility is highlighted by its ability to self‐design new analysis processes based on differing input data. This autonomous adaptability makes AutoBA a valuable tool for bioinformaticians working on novel or unconventional research questions, as it can adjust its approach to the unique characteristics of the data.
Online bioinformatics analysis platforms are currently in vogue, but they often necessitate the uploading of either raw data or pre‐processed statistics by users, which could potentially give rise to privacy concerns and data leakage risks. In contrast, AutoBA addresses these privacy issues by offering both online version and local version. When utilizing the online version of AutoBA with ChatGPT, data uploads are unnecessary, requiring only descriptive information in natural language as specified in our prompt design. This information is limited in terms of private details. In comparison, the local version of AutoBA provides the highest level of privacy protection, as it operates on local backends and eliminates the need to share any information with third parties. Moreover, AutoBA showcases its adaptability in sync with emerging bioinformatics tools, with LLM seamlessly incorporating these latest tools into the database. Furthermore, AutoBA is inclined toward selecting the most popular analytical frameworks or widely applicable tools in the planning phase, underscoring its robustness. Another distinguishing feature is AutoBA's transparent and interpretable execution process. This transparency allows professional bioinformaticians to easily modify and customize AutoBA's outputs, leveraging AutoBA to expedite the data analysis process.
AutoBA is also a future‐proof AI agent designed for bioinformatics analysis, leveraging LLMs as its core. This design allows AutoBA to integrate with any existing LLM, whether online (e.g., ChatGPT, GPT‐4, GPT‐4o) or offline (e.g., LLaMA, CodeLLaMA, and DeepSeek). The LLM used in AutoBA is fully substitutable, enabling it to benefit from the continual advancements in LLM technology. As new state‐of‐the‐art LLMs are developed, AutoBA can incorporate them to enhance its performance in automatic bioinformatics analysis. Still, AutoBA's limitation in tool selection does exist. Current LLMs are trained on internet data, meaning that widely used methods in bioinformatics are typically well‐trained, while methods from specific papers may be underrepresented or not trained at all. As a result, the best outcomes are achieved when using tools that have been extensively trained, which can lead to potential biases in tool selection. To address this, training a specialized LLM for bioinformatics that thoroughly covers all tools and methods in the field could be a solution in the future.
Given that classical bioinformatic analysis encompasses a far broader spectrum of tasks and challenges than the 40 cases studied in this work (Tables 2 and 3), it is essential to conduct more real‐world applications by our potential users to further comprehensively validate the robustness of AutoBA. We found that a large proportion (36%, 5 out of 14) of failed cases in executing code is due to the tools in conda being problematic, not in a regular form (end with .sh, .pl et al), or requiring an edited config file, suggesting a demand for more standard bioinformatics tools. Furthermore, taking into account the timeliness of the training data used for large language models, it's important to note that some of the most recently proposed methods in bioinformatics may still pose challenges in automatically generating code by AutoBA. Therefore, a future endeavor to train an up‐to‐date large language model explicitly tailored for bioinformatics can significantly enhance AutoBA's ability to maintain up‐to‐date code generation capabilities. Nevertheless, AutoBA represents a significant advancement in the field of bioinformatics, offering a user‐friendly, efficient, and adaptable solution for a wide range of omics analysis tasks. Its capacity to handle diverse data types and analysis goals, coupled with its robustness and adaptability, positions AutoBA as a valuable asset in the pursuit of accelerating bioinformatics research. We anticipate that AutoBA will find extensive utility in the scientific community, supporting researchers in their quest to extract meaningful insights from complex biological data.
Conflict of Interest
The authors declare no conflict of interest.
Author Contributions
J.Z. and B.Z. contributed equally to this work. J.Z. and X.G performed conceptualization. J.Z., B.Z., and X.G. performed design. J.Z. performed code Implementation. J.Z., B.Z., X.C., H.L., C.X., and W.H. performed application. J.Z. and B.Z. wrote the original draft of the manuscript. J.Z., B.Z., X.X., S.C., X.G., L.L., and G.L. performed critical revision of the manuscript for important intellectual content. J.Z. and X.G. performed supervision. X.G. performed funding acquisition.
Code Availability Statement
The AutoBA software is publicly available at https://github.com/JoshuaChou2018/AutoBA. The Docker version of AutoBA is available at https://hub.docker.com/r/joshuachou666/autoba.
Supporting information
Supporting Information
Supplemental Video 1
Acknowledgements
J.Z., B.Z., X.C., H.L., X.X., S.C., W.H., C.X., and X.G. were supported in part by grants from the Office of Research Administration (ORA) at King Abdullah University of Science and Technology (KAUST) under award number FCC/1/1976‐44‐01, FCC/1/1976‐45‐01, REI/1/5202‐01‐01, REI/1/5234‐01‐01, REI/1/4940‐01‐01, RGC/3/4816‐01‐01, REI/1/0018‐01‐01, REI/1/5414‐ 01‐01, REI/1/5289‐01‐01, and REI/1/5404‐01‐01.
Zhou J., Zhang B., Li G., Chen X., Li H., Xu X., Chen S., He W., Xu C., Liu L., Gao X., An AI Agent for Fully Automated Multi‐Omic Analyses. Adv. Sci. 2024, 11, 2407094. 10.1002/advs.202407094
Contributor Information
Liwei Liu, Email: liuliwei5@huawei.com.
Xin Gao, Email: xin.gao@kaust.edu.sa.
Data Availability Statement
The RNA‐seq dataset could be downloaded from Sequence Read Archive (SRA) with IDs: SRR1374921, SRR1374922, SRR1374923, and SRR1374924. The dataset for case 1.3 could be downloaded from https://github.com/STAR‐Fusion/STAR‐Fusion‐Tutorial/wiki. The scRNA‐seq dataset could be downloaded from http://cf.10xgenomics.com/samples/cell‐exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz. The ChIP‐seq dataset could be downloaded with IDs: SRR620204, SRR620205, SRR620206, and SRR620208. The Spatial Transcriptomics dataset could be downloaded from https://doi.org/10.5281/zenodo.6334774. The CAGE‐seq dataset could be downloaded from SRA with IDs: SRR11351697, SRR11351698, SRR11351700, and SRR11351701. The 3’end‐seq dataset could be downloaded from SRA with IDs: SRR17422754, SRR17422755, SRR17422756, and SRR17422757. The CLIP‐seq dataset could be downloaded from ENCODE (https://www.encodeproject.org) with IDs: ENCLB742AYH and ENCLB770EDJ. The Ribo‐seq data could be downloaded from SRA with IDs: RR12354645 and RR12354646. The raw single‐cell RNA sequencing data could be downloaded from 10X genomics. The PacBio long‐read sequencing data could be downloaded from SRA with IDs: SRR19552218 and SRR19785215. The small RNA‐seq data could be downloaded from the previous study.[ 75 ]
References
- 1. Luscombe N. M., Greenbaum D., Gerstein M., Methods of information in medicine 2001, 40, 346. [PubMed] [Google Scholar]
- 2. Gauthier J., Vincent A. T., Charette S. J., Derome N., Briefings in bioinformatics 2019, 20, 1981. [DOI] [PubMed] [Google Scholar]
- 3. Baxevanis A. D., Bader G. D., Wishart D. S., Bioinformatics, John Wiley & Sons, NJ, USA: 2020. [Google Scholar]
- 4. Munk P., Brinch C., Møller F. D., Petersen T. N., Hendriksen R. S., Seyfarth A. M., Kjeldgaard J. S., Svendsen C. A., van Bunnik B., Berglund F., Nat. Commun. 2022, 13, 7251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Lips E. H., Kumar T., Megalios A., Visser L. L., Sheinman M., Fortunato A., Shah V., Hoogstraat M., Sei E., Mallo D., Nat. Genet. 2022, 54, 850. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Jones D. T., Thornton J. M., Nat. Methods 2022, 19, 15. [DOI] [PubMed] [Google Scholar]
- 7. Hekkelman M. L., de Vries I., Joosten R. P., Perrakis A., Nat. Methods 2023, 20, 205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Sapoval N., Aghazadeh A., Nute M. G., Antunes D. A., Balaji A., Baraniuk R., Barberan C., Dannenfelser R., Dun C., Edrisi M., Nat. Commun. 2022, 13, 1728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Gupta T., Zaki M., Krishnan N. A., Mausam, npj Comput. Mater. 2022, 8, 102. [Google Scholar]
- 10. Zeng Z., Yao Y., Liu Z., Sun M., Nat. Commun. 2022, 13, 862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. De Maio N., Kalaghatgi P., Turakhia Y., Corbett‐Detig R., Minh B. Q., Goldman N., Nat. Genet. 2023, 55, 746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Chanderbali A. S., Jin L., Xu Q., Zhang Y., Zhang J., Jian S., Carroll E., Sankoff D., Albert V. A., Howarth D. G., Nat. Commun. 2022, 13, 643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Rhodes J., Abdolrasouli A., Dunne K., Sewell T. R., Zhang Y., Ballard E., Brackin A. P., van Rhijn N., Chown H., Tsitsopoulou A., Nat. Microbiol. 2022, 7, 663. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Heinken A., Hertel J., Acharya G., Ravcheev D. A., Nyga M., Okpala O. E., Hogan M., Magnúsdóttir S., Martinelli F., Nap B., Nat. Biotechnol. 2023, 41, 1320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Hemmerling F., Piel J., Nat. Rev. Drug Discovery 2022, 21, 359. [DOI] [PubMed] [Google Scholar]
- 16. Zhou J., Zhang B., Li H., Zhou L., Li Z., Long Y., Han W., Wang M., Cui H., Li J., Genomics, Proteomics and Bioinformatics 2022, 20, 959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Li H., Li H., Zhou J., Gao X., Bioinformatics 2022, 38, 4878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zhang T., Li L., Sun H., Xu D., Wang G., Briefings in Bioinformatics 2023, 24, bbad316. [DOI] [PubMed] [Google Scholar]
- 19. Li Z., Gao E., Zhou J., Han W., Xu X., Gao X., Cell Reports Methods 2023, 3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Long Y., Zhang B., Tian S., Chan J. J., Zhou J., Li Z., Li Y., An Z., Liao X., Wang Y., Genome Res. 2023, 33, 644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Bardet A. F., He Q., Zeitlinger J., Stark A., Nat. Protoc. 2012, 7, 45. [DOI] [PubMed] [Google Scholar]
- 22. Vieth B., Parekh S., Ziegenhain C., Enard W., Hellmann I., Nat. Commun. 2019, 10, 4667. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Luecken M. D., Theis F. J., Molecular systems biology 2019, 15, e8746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Grandi F. C., Modi H., Kampman L., Corces M. R., Nat. Protoc. 2022, 17, 1518. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Ng P. C., Kirkness E. F., Genetic variation: Methods and protocols 2010, 628, 215. [Google Scholar]
- 26. Wang Z., Gerstein M., Snyder M., Nat. Rev. Genet. 2009, 10, 57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Saliba A.‐E., Westermann A. J., Gorski S. A., Vogel J., Nucleic Acids Res. 2014, 42, 8845. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Buenrostro J. D., Wu B., Chang H. Y., Greenleaf W. J., Curr. Protoc. Mol. Biol. 2015, 109, 21.29.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Park P. J., Nat. Rev. Genet. 2009, 10, 669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Burgess D. J., Nat. Rev. Genet. 2019, 20, 317. [DOI] [PubMed] [Google Scholar]
- 31. Conesa A., Madrigal P., Tarazona S., Gomez‐Cabrero D., Cervera A., McPherson A., Szcześniak M. W., Gaffney D. J., Elo L. L., Zhang X., Genome Biol. 2016, 17, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Wang L., Wang S., Li W., Bioinformatics 2012, 28, 2184. [DOI] [PubMed] [Google Scholar]
- 33. Martin M., EMBnet. journal 2011, 17, 10. [Google Scholar]
- 34. Dobin A., Gingeras T. R., Curr. Protoc. Bioinform. 2015, 51, 11.14.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Trapnell C., Pachter L., Salzberg S. L., Bioinformatics 2009, 25, 1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Liao Y., Smyth G. K., Shi W., Bioinformatics 2014, 30, 923. [DOI] [PubMed] [Google Scholar]
- 37. Rapaport F., Khanin R., Liang Y., Pirun M., Krek A., Zumbo P., Mason C. E., Socci N. D., Betel D., Genome Biol. 2013, 14, R95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Shen S., Park J. W., Lu Z.‐x., Lin L., Henry M. D., Wu Y. N., Zhou Q., Xing Y., Proc. Natl. Acad. Sci. USA 2014, 111, E5593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Katz Y., Wang E. T., Airoldi E. M., Burge C. B., Nat. Methods 2010, 7, 1009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Wang X., Cairns M. J., presented at BMC bioinformatics , 2013.
- 41. Thomas R., Thomas S., Holloway A. K., Pollard K. S., Briefings in bioinformatics 2017, 18, 441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Bailey T. L., Bioinformatics 2011, 27, 1653. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Yu G., Wang L.‐G., He Q.‐Y., Bioinformatics 2015, 31, 2382. [DOI] [PubMed] [Google Scholar]
- 44. Reska D., Czajkowski M., Jurczuk K., Boldak C., Kwedlo W., Bauer W., Koszelew J., Kretowski M., biocybernetics and biomedical engineering 2021, 41, 1646. [Google Scholar]
- 45. Ge S. X., Son E. W., Yao R., BMC Bioinformatics 2018, 19, 534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Jiang A., Lehnert K., You L., Snell R. G., Nucleic Acids Res. 2022, 50, W427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Li X., Xiao C., Qi J., Xue W., Xu X., Mu Z., Zhang J., Li C.‐Y., Ding W., Nucleic Acids Res. 2023, 51, W560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Zhou J., Chen S., Wu Y., Li H., Zhang B., Zhou L., Hu Y., Xiang Z., Li Z., Chen N., Sci. Adv. 2024, 10, eadh8601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Roy S., Coldren C., Karunamurthy A., Kip N. S., Klee E. W., Lincoln S. E., Leon A., Pullambhatla M., Temple‐Smolkin R. L., Voelkerding K. V., The Journal of Molecular Diagnostics 2018, 20, 4. [DOI] [PubMed] [Google Scholar]
- 50. Ewels P. A., Peltzer A., Fillinger S., Patel H., Alneberg J., Wilm A., Garcia M. U., Di Tommaso P., Nahnsen S., Nat. Biotechnol. 2020, 38, 276. [DOI] [PubMed] [Google Scholar]
- 51. Wratten L., Wilm A., Göke J., Nat. Methods 2021, 18, 1161. [DOI] [PubMed] [Google Scholar]
- 52. Işık E. B., Brazas M. D., Schwartz R., Gaeta B., Palagi P. M., van Gelder C. W., Suravajhala P., Singh H., Morgan S. L., Zahroh H., Nat. Biotechnol. 2023, 41, 1171. [DOI] [PubMed] [Google Scholar]
- 53. Attwood T. K., Blackford S., Brazas M. D., Davies A., Schneider M. V., Briefings in Bioinformatics 2019, 20, 398. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Wei J., Tay Y., Bommasani R., Raffel C., Zoph B., Borgeaud S., Yogatama D., Bosma M., Zhou D., Metzler D., arXiv 2022, arXiv:2206.07682.
- 55. Thirunavukarasu A. J., Ting D. S. J., Elangovan K., Gutierrez L., Tan T. F., Ting D. S. W., Nat. Med. 2023, 29, 1930. [DOI] [PubMed] [Google Scholar]
- 56. Madani A., Krause B., Greene E. R., Subramanian S., Mohr B. P., Holton J. M., Olmos J. L., Xiong C., Sun Z. Z., Socher R., Nat. Biotechnol. 2023, 41, 1099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Meskó B., Topol E. J., npj digital medicine 2023, 6, 120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Wang S., Zhao Z., Ouyang X., Wang Q., Shen D., arXiv 2023, arXiv:2302.07257.
- 59. Zhou J., He X., Sun L., Xu J., Chen X., Chu Y., Zhou L., Liao X., Zhang B., Gao X., Pre‐trained multimodal large language model enhances dermatological diagnosis using SkinGPT‐4. Nature Communications arXiv 2024, 15, 5649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Zhou J., Chen X., Gao X., medRxiv 2023, 2023.06. 23.23291802.
- 61. Tu T., Azizi S., Driess D., Schaekermann M., Amin M., Chang P.‐C., Carroll A., Lau C., Tanno R., Ktena I., NEJM AI 2024, 1, AIoa2300138. [Google Scholar]
- 62. Flam‐Shepherd D., Zhu K., Aspuru‐Guzik A., Nat. Commun. 2022, 13, 3293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63. Shue E., Liu L., Li B., Feng Z., Li X., Hu G., Quantitative Biology 2023, 11, 105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64. Piccolo S. R., Denny P., Luxton‐Reilly A., Payne S., Ridge P. G., arXiv 2023, arXiv:2303.13528.
- 65. Giray L., Ann. Biomed. Eng. 2023, 51, 2629. [DOI] [PubMed] [Google Scholar]
- 66. Gravitas S., Auto‐GPT: An autonomous GPT‐4 experiment 2023.
- 67. Roziere B., Gehring J., Gloeckle F., Sootla S., Gat I., Tan X. E., Adi Y., Liu J., Remez T., Rapin J., arXiv 2023, arXiv:2308.12950.
- 68. Touvron H., Martin L., Stone K., Albert P., Almahairi A., Babaei Y., Bashlykov N., Batra S., Bhargava P., Bhosale S., arXiv 2023, arXiv:2307.09288.
- 69. Benner C., van der Meulen T., Cacéres E., Tigyi K., Donaldson C. J., Huising M. O., BMC Genomics 2014, 15, 620. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Dobin A., Davis C. A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T. R., Bioinformatics 2013, 29, 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Kim D., Paggi J. M., Park C., Bennett C., Salzberg S. L., Nat. Biotechnol. 2019, 37, 907. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72. Li H., Bioinformatics 2018, 34, 3094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73. Chen X., Schulz‐Trieglaff O., Shaw R., Barnes B., Schlesinger F., Källberg M., Cox A. J., Kruglyak S., Saunders C. T., Bioinformatics 2016, 32, 1220. [DOI] [PubMed] [Google Scholar]
- 74. Ye K., Schulz M. H., Long Q., Apweiler R., Ning Z., Bioinformatics 2009, 25, 2865. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75. Kwok Z. H., Zhang B., Chew X. H., Chan J. J., Teh V., Yang H., Kappei D., Tay Y., Cancer Res. 2021, 81, 1308. [DOI] [PubMed] [Google Scholar]
- 76. Bolger A., Giorgi F., Bioinformatics 2014, 30, 2114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77. Bankevich A., Nurk S., Antipov D., Gurevich A. A., Dvorkin M., Kulikov A. S., Lesin V. M., Nikolenko S. I., Pham S., Prjibelski A. D., J. Comput. Biol. 2012, 19, 455. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78. Gurevich A., Saveliev V., Vyahhi N., Tesler G., Bioinformatics 2013, 29, 1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Li H., arXiv 2013, arXiv:1303.3997.
- 80. Li H., Handsaker B., Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R., Subgroup G. P. D. P., Bioinformatics 2009, 25, 2078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. McKenna A., Hanna M., Banks E., Sivachenko A., Cibulskis K., Kernytsky A., Garimella K., Altshuler D., Gabriel S., Daly M., Genome Res. 2010, 20, 1297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82. McLaren W., Gil L., Hunt S. E., Riat H. S., Ritchie G. R., Thormann A., Flicek P., Cunningham F., Genome Biol. 2016, 17, 122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83. Cingolani P., Platts A., Wang L. L., Coon M., Nguyen T., Wang L., Land S. J., Lu X., Ruden D. M., fly 2012, 6, 80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84. Langmead B., Salzberg S. L., Nat. Methods 2012, 9, 357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85. Zhang Y., Liu T., Meyer C. A., Eeckhoute J., Johnson D. S., Bernstein B. E., Nusbaum C., Myers R. M., Brown M., Li W., Genome Biol. 2008, 9, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86. Bailey T. L., Johnson J., Grant C. E., Noble W. S., Nucleic Acids Res. 2015, 43, W39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87. Quinlan A. R., Hall I. M., Bioinformatics 2010, 26, 841. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88. Love M. I., Huber W., Anders S., Genome Biol. 2014, 15, 550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89. Raudvere U., Kolberg L., Kuzmin I., Arak T., Adler P., Peterson H., Vilo J., Nucleic Acids Res. 2019, 47, W191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90. Ihaka R., Gentleman R., J. computational and graphical statistics 1996, 5, 299. [Google Scholar]
- 91. Krueger F., Andrews S. R., Bioinformatics 2011, 27, 1571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92. Thorvaldsdóttir H., Robinson J. T., Mesirov J. P., Briefings in bioinformatics 2013, 14, 178. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93. McLean C. Y., Bristor D., Hiller M., Clarke S. L., Schaar B. T., Lowe C. B., Wenger A. M., Bejerano G., Nat. Biotechnol. 2010, 28, 495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94. Koren S., Walenz B. P., Berlin K., Miller J. R., Bergman N. H., Phillippy A. M., Genome Res. 2017, 27, 722. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95. Vaser R., Sović I., Nagarajan N., Šikić M., Genome Res. 2017, 27, 737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96. Kolmogorov M., Yuan J., Lin Y., Pevzner P. A., Nat. Biotechnol. 2019, 37, 540. [DOI] [PubMed] [Google Scholar]
- 97. Wick R. R., Schultz M. B., Zobel J., Holt K. E., Bioinformatics 2015, 31, 3350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98. Benson G., Nucleic Acids Res. 1999, 27, 573. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99. Chin C.‐S., Peluso P., Sedlazeck F. J., Nattestad M., Concepcion G. T., Clum A., Dunn C., O'Malley R., Figueroa‐Balderas R., Morales‐Cruz A., Nat. Methods 2016, 13, 1050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100. Marçais G., Delcher A. L., Phillippy A. M., Coston R., Salzberg S. L., Zimin A., PLoS Comput. Biol. 2018, 14, e1005944. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101. Anders S., Pyl P. T., Huber W., Bioinformatics 2015, 31, 166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102. Nicorici D., Şatalan M., Edgren H., Kangaspeska S., Murumägi A., Kallioniemi O., Virtanen S., Kilkku O., biorxiv 2014, 011650. [Google Scholar]
- 103. Pertea G., Pertea M., F1000Research 2020, 9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104. Xia Z., Donehower L. A., Cooper T. A., Neilson J. R., Wheeler D. A., Wagner E. J., Li W., Nat. Commun. 2014, 5, 5274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105. Frazee A. C., Pertea G., Jaffe A. E., Langmead B., Salzberg S. L., Leek J. T., Biorxiv 2014, 1, 003665. [Google Scholar]
- 106. Gao Y., Wang J., Zhao F., Genome Biol. 2015, 16, 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107. Zhang J., Chen S., Yang J., Zhao F., Nat. Commun. 2020, 11, 90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108. Robinson M. D., McCarthy D. J., Smyth G. K., Bioinformatics 2010, 26, 139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109. Haberle V., Forrest A. R., Hayashizaki Y., Carninci P., Lenhard B., Nucleic Acids Res. 2015, 43, e51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110. Ghosh S., Chan C.‐K. K., Plant Bioinformatics: Methods and Protocols 2016, 1374, 339. [Google Scholar]
- 111. Calviello L., Mukherjee N., Wyler E., Zauber H., Hirsekorn A., Selbach M., Landthaler M., Obermayer B., Ohler U., Nat. Methods 2016, 13, 165. [DOI] [PubMed] [Google Scholar]
- 112. Satija R., Farrell J. A., Gennert D., Schier A. F., Regev A., Nat. Biotechnol. 2015, 33, 495. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113. Wolf F. A., Angerer P., Theis F. J., Genome Biol. 2018, 19, 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114. Palla G., Spitzer H., Klein M., Fischer D., Schaar A. C., Kuemmerle L. B., Rybakov S., Ibarra I. L., Holmberg O., Virshup I., Nat. Methods 2022, 19, 171. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115. Biancalani T., Scalia G., Buffoni L., Avasthi R., Lu Z., Sanger A., Tokcan N., Vanderburg C. R., Segerstolpe Å., Zhang M., Nat. Methods 2021, 18, 1352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116. Kessner D., Chambers M., Burke R., Agus D., Mallick P., Bioinformatics 2008, 24, 2534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117. Sturm M., Bertsch A., Gröpl C., Hildebrandt A., Hussong R., Lange E., Pfeifer N., Schulz‐Trieglaff O., Zerck A., Reinert K., BMC Bioinformatics 2008, 9, 163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118. Bald T., Barth J., Niehues A., Specht M., Hippler M., Fufezan C., Bioinformatics 2012, 28, 1052. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information
Supplemental Video 1
Data Availability Statement
The RNA‐seq dataset could be downloaded from Sequence Read Archive (SRA) with IDs: SRR1374921, SRR1374922, SRR1374923, and SRR1374924. The dataset for case 1.3 could be downloaded from https://github.com/STAR‐Fusion/STAR‐Fusion‐Tutorial/wiki. The scRNA‐seq dataset could be downloaded from http://cf.10xgenomics.com/samples/cell‐exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz. The ChIP‐seq dataset could be downloaded with IDs: SRR620204, SRR620205, SRR620206, and SRR620208. The Spatial Transcriptomics dataset could be downloaded from https://doi.org/10.5281/zenodo.6334774. The CAGE‐seq dataset could be downloaded from SRA with IDs: SRR11351697, SRR11351698, SRR11351700, and SRR11351701. The 3’end‐seq dataset could be downloaded from SRA with IDs: SRR17422754, SRR17422755, SRR17422756, and SRR17422757. The CLIP‐seq dataset could be downloaded from ENCODE (https://www.encodeproject.org) with IDs: ENCLB742AYH and ENCLB770EDJ. The Ribo‐seq data could be downloaded from SRA with IDs: RR12354645 and RR12354646. The raw single‐cell RNA sequencing data could be downloaded from 10X genomics. The PacBio long‐read sequencing data could be downloaded from SRA with IDs: SRR19552218 and SRR19785215. The small RNA‐seq data could be downloaded from the previous study.[ 75 ]
