Skip to main content
Plant Communications logoLink to Plant Communications
. 2025 Sep 8;6(12):101494. doi: 10.1016/j.xplc.2025.101494

DeepPGDB: A novel paradigm for AI-guided interactive plant genomic database

Fangping Li 1,4, Jiaxuan Chen 1,4, Wei Luo 1, Jieying Liu 1, Guodong Chen 1, Binyu Shuai 2, Zhuangwei Hou 1, Zhenpeng Gan 1, Hongyuan Zhao 1, Penglin Zhan 1, Changwei Bi 2, Zefu Wang 2, Haifei Hu 1,3,, Shaokui Wang 1,∗∗
PMCID: PMC12744749  PMID: 40926409

Dear Editor,

Over the past decade, significant progress has been made in generating omics data, with genomics as a prime example. In plant sciences, high-quality chromosome-level genomes have been published for over 1000 species (Liu et al., 2024). For model plants like rice and Arabidopsis, advancements now extend to population-level genome assemblies, pushing the field into the pangenomics era. Despite this rapid accumulation of genomic data, researchers with primarily biological backgrounds often struggle to mine these datasets and conduct omics analyses (Li et al., 2021). These challenges largely stem from the need for advanced bioinformatics expertise and familiarity with complex command-line tools, programming languages, data analysis pipelines, and intricate user interfaces that require substantial computational skills.

The rapid development of generative large language models, such as ChatGPT and DeepSeek, has recently provided substantial assistance in data processing (DeepSeek-AI et al., 2025). The emergence of intelligent agents driven by artificial intelligence (AI; such as AutoGPT) has prompted consideration of these models for real-world applications (Yang et al., 2023). Initially applied to large-scale cancer functional proteomics analysis (Liu et al., 2025), AI approaches have been used to design genome sequences with desired biological functions (Wang et al., 2025). This has spurred the consideration of placing these models at the core of genomic databases, enabling interactive access to the embedded knowledge through intuitive natural language queries. Guided by this conceptual framework rooted in large language models, our study integrates DeepPGDB (https://www.deeppgdb.chat), a model fine-tuning with prompt engineering, to build the first AI-powered plant genomics database. This paradigm of AI-guided interactive genomic database is designed to lower technical barriers, enabling seamless analysis of complex omics data based on high-quality genomes.

Genomic and multi-omics data are large in volume and complex in format (Wörheide et al., 2021). Bioinformatics researchers often rely on specialized tools for specific formats to extract biologically meaningful information efficiently. When AI models act as the core scheduler for interactive data access, standardizing their output to produce task-specific commands for backend data retrieval and frontend visualization is an effective function-calling approach.

DeepPGDB integrates bioinformatics tools through model integration, combined with fine-tuning, prompt engineering, and retrieval-augmented generation (RAG). This framework enables an AI scheduler to accurately interpret user intent and translate it into a standardized tool invocation command. DeepPGDB categorizes tasks by both task type and data. The AI model interprets user input to classify the task as either a tool invocation or a textual knowledge query, then selects the appropriate model for execution. For tool invocation tasks, the fine-tuned reasoning model determines the data type based on user intent (e.g., omics data like expression, genomic, or gene function data). Then, standardized commands generated based on its reasoning process and initial prompt. These commands are executed on the backend to call the appropriate tools for data retrieval and visualization, with results returned to the frontend (Figure 1A). For textual knowledge queries, which primarily involve genome information such as versions, references, and gene numbers, the model employs RAG to extract accurate information from relevant documents (Figure 1A). This combined approach functions as a simplified Model Context Protocol server designed for genomics and related data.

Figure 1.

Figure 1

Summary and functions of DeepPGDB.

(A) Workflow of DeepPGDB.

(B) User interface of DeepPGDB.

(C–F) Examples of natural language queries and responses demonstrating basic functions of DeepPGDB using rice and Arabidopsis.

(C) Sequence extraction for specific intervals from the rice genome.

(D) Protein sequence BLAST search against Arabidopsis using a Chinese query.

(E) Chromosomal location information for specific rice genes.

(F) Visualization of gene expression patterns across rice organs.

(G and H) Examples of advanced summarization functions of DeepPGDB.

(G) Extraction of variation loci within the GW5L region (chr01:4826590–4828733) in rice populations and haplotype analysis based on output and summarization modules.

(H) Protein sequence extraction and summary of basic properties using the rice OsSPL16 gene as an example.

(I) Accuracy, response time, and token usage of various AI models in DeepPGDB.

DSQW, Deepseek-Qwen; FT, fine-tune; SP, short pre-prompt; LP, long pre-prompt.

As the first AI agent designed for direct genomic queries, DeepPGDB differs substantially from other plant science AI models such as PlantGPT and SeedLLM (Yang et al., 2025; Zhang et al., 2025). Unlike these models, which primarily acquire knowledge from research articles, DeepPGDB is trained and structured to extract targeted information directly from bioinformatics files. This approach allows the system to select suitable tools for data retrieval based on file records, closely reflecting researchers' practical workflows. Users interact with DeepPGDB through a dialogue interface, where the AI processes and delivers responses in conversational form (Figures 1A and 1B). The platform integrates more than 20 publicly available and high-quality plant genomes, enabling users to access and analyze genomic data via natural language input in multiple languages. For example, users can retrieve sequences from specific genomic regions in rice or Arabidopsis by submitting simple queries (Figure 1C; Supplemental Figure 1). BLAST-supported sequence alignment, a cornerstone of genomic data analysis, is fully embedded in DeepPGDB. Users can launch alignment tasks by entering nucleotide or protein sequences and specifying the target species in the dialogue box (Figure 1D; Supplemental Figure 2). The AI automatically detects sequence type, generates appropriate backend commands, and returns results in a conversational format.

Genomic location queries and gene list retrieval based on functional categories are essential for advancing genomic and functional research. DeepPGDB integrates structural and functional annotations of plant genomes, enabling users to submit natural language queries processed by the AI scheduler. The scheduler interprets user intent and generates standardized commands, which are executed on the backend to retrieve relevant data. Results are presented in a structured tabular format on the frontend. For example, when a user requests the chromosomal location of a specific rice gene, the AI extracts the gene identifier, generates a corresponding standardized backend command, executes it, and displays the results (Figure 1E). Similarly, for gene family queries, AI identifies relevant keywords, searches the backend, and returns a comprehensive gene list (e.g., the GRF transcription factor family in rice; Supplemental Figure 3). To support species with multiple annotation versions, DeepPGDB uses a multi-annotation correspondence table in the backend. For rice, this includes mappings across various identifiers such as International Rice Genome Sequencing Project (e.g., Os02g0661100), Rice Genome Annotation Project (e.g., LOC_Os02g44230), and gene names (e.g., OsTPP1). When a query is submitted, AI matches user-provided terms to these identifiers. If a single match is found, the system returns the results directly. If multiple matches are identified, DeepPGDB displays all relevant genes with annotations and prompts the user to refine the query with a specific gene ID (Supplemental Figure 4).

In addition to basic queries and tabular outputs, DeepPGDB incorporates interactive statistical visualization tools powered by ECharts, enabling users to explore data intuitively through dynamic charts. Curated gene expression profiles for the included species are integrated into the system, allowing users to visualize differential expression patterns between groups in a dataset using natural language instructions (Figure 1F). Similar visualization and analysis workflows are available for enrichment analyses based on species-specific gene lists (Supplemental Figure 5). DeepPGDB also supports population genetics, a critical area of plant research, by integrating population genomic variation data from multiple species. When a query is submitted, AI interprets the request, invokes the PLINK tool (Purcell et al., 2007), and retrieves results from preloaded population datasets. These outputs are displayed in a structured tabular format on the frontend (Figure 1G).

DeepPGDB’s interactive and user-friendly framework for omics data retrieval and visualization meets the core expectations of modern biological databases. However, contemporary large language models provide capabilities beyond basic data querying, enabling advanced reasoning and synthesis. Drawing on the RAG framework, we incorporated a summarization module into DeepPGDB to serve as a reasoning layer that infers biological significance based on user requests and prior interactions. Its inference capabilities are supported by a foundational knowledge base and scientific reasoning logic developed through fine-tuning and structured prompt engineering. For example, the module can summarize haplotypes of the GW5L genomic interval (Tian et al., 2019) across rice subspecies, revealing subspecies-specific haplotype differentiation (Figure 1G). It can also extract the sequence of the rice OsSPL16 gene (Wang et al., 2012) and calculate its protein properties (Figure 1H), demonstrating its ability to perform multi-step biological reasoning and analysis.

The parameter size of generative models is closely tied to their reasoning capabilities. To balance DeepPGDB’s performance and generation efficiency for practical use, we benchmarked several candidate core models under identical quantization settings and hardware conditions (a single V100-SXM2 GPU with 32 GB VRAM combined with a dual-processor system utilizing AMD EPYC 7642 CPUs) (Figure 1I). Each model was evaluated for output accuracy and reasoning time. The 14-billion-parameter reasoning model (Deepseek-r1:14b) achieved the best performance, with approximately 90% accuracy under a long pre-prompt (about 2100 tokens; Supplemental Data 1) across various tasks while maintaining shorter response times than larger reasoning models such as Deepseek-r1:32b. By contrast, a similarly scaled general-purpose model (Gemma:12b) generated instructions faster but showed markedly lower accuracy due to limited reasoning ability. To improve efficiency, we fine-tuned the reasoning model for use with a short pre-prompt (about 400 tokens; Supplemental Data 1), reducing token consumption while preserving the high accuracy observed with the long pre-prompt. This optimization significantly improved responsiveness, making the model suitable for deployment in DeepPGDB. Smaller reasoning models (Deepseek-r1:7b/Deepseek-r1:1.5b) showed modest accuracy improvements after fine-tuning on the same corpus, but failed to meet performance requirements with frequent hallucinations and lower reasoning accuracy despite reduced resource demands. As a result, Deepseek-r1:14b was selected for deployment (Figure 1I).

The introduction of DeepPGDB represents a transformative advancement in plant genomics, promoting interdisciplinary collaboration and bridging gaps among computational biology, genomics, and agricultural sciences. By integrating high-quality genomic and multi-omics data with AI-driven scheduling, standardized command generation, and interactive visualization tools, DeepPGDB empowers researchers from diverse backgrounds to efficiently access and interpret complex biological information. This platform not only supports precise gene localization, functional annotation, and population genetics studies but also enhances the extraction of biologically meaningful insights through its summarization module. Despite these capabilities, DeepPGDB has considerable potential for improvement. Future updates will focus on systematically curating and incorporating high-quality plant genome data based on user feedback, ensuring that the database remains comprehensive and up to date. We also plan to enhance the framework by fully integrating the Model Context Protocol architecture, expanding multi-omics integration, and strengthening the system’s capacity for biological insight extraction. With these developments, DeepPGDB is poised to drive major breakthroughs in agricultural science, conservation, and biotechnology, ushering plant research into a new era of data-driven discovery.

Funding

This work was supported by the STI 2030-Major Project (2023ZD04069), the Major Science and Technology Research Projects of Guangdong Laboratory for Lingnan Modern Agriculture (NT2021001), the National Natural Science Foundation of China (32400512, 32472129, and U24A20392), the Guangdong Basic and Applied Basic Research Foundation (2024A1515011981), the “YouGu” Plan and “Outstanding Youth Researcher” Program of the Rice Research Institute of Guangdong Academy of Agricultural Sciences (2023YG04 & 2024YG01), and the Introduction of Young Key Talents Program of the Guangdong Academy of Agricultural Sciences (R2023YJ-QC001).

Acknowledgments

We would like to express our great gratitude to all community members for their valuable feedback and suggestions during testing. No conflict of interest declared.

Author contributions

S.W., H.H., F.L., and J.C. conceived and supervised the study and revised the manuscript. F.L. led the model design and system construction, contributed to manuscript writing, and maintains the project. J.C. led data collection, frontend interaction optimization, manuscript writing, and figure generation. S.W., J.L., W.L., Z.H., P.Z., C.B., and Z.W. tested the database. H.H., S.W., and G.C. provided hardware support. B.S., Z.G., and H.Z. contributed to data collection.

Published: September 8, 2025

Footnotes

Supplemental information is available at Plant Communications Online.

Contributor Information

Haifei Hu, Email: huhaifei@gdaas.cn.

Shaokui Wang, Email: shaokuiwang@scau.edu.cn.

Supplemental information

Document S1. Supplemental Figures 1–5
mmc1.pdf (1.6MB, pdf)
Supplemental Data 1. Supplemental method
mmc2.pdf (193.2KB, pdf)
Document S2. Article plus supplemental information
mmc3.pdf (8MB, pdf)

References

  1. DeepSeek-AI. Guo D., Yang D., Zhang H., Song J., Zhang R., Xu R., Zhu Q., Ma S., Wang P., et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv. 2025 doi: 10.48550/arXiv.2501.12948. Preprint at. [DOI] [Google Scholar]
  2. Liu W., Li J., Tang Y., Zhao Y., Liu C., Song M., Ju Z., Kumar S.V., Lu Y., Akbani R., et al. DrBioRight 2.0: an LLM-powered bioinformatics chatbot for large-scale cancer functional proteomics analysis. Nat. Commun. 2025;16:2256. doi: 10.1038/s41467-025-57430-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Liu Z., Zhang C., He J., Li C., Fu Y., Zhou Y., Cao R., Liu H., Song X. plantGIR: a genomic database of plants. Hortic. Res. 2024;11 doi: 10.1093/hr/uhae342. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Li J., Chen H., Wang Y., Chen M.J.M., Liang H. Next-Generation Analytics for Omics Data. Cancer Cell. 2021;39:3–6. doi: 10.1016/j.ccell.2020.09.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Purcell S., Neale B., Todd-Brown K., Thomas L., Ferreira M.A.R., Bender D., Maller J., Sklar P., de Bakker P.I.W., Daly M.J., et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007;81:559–575. doi: 10.1086/519795. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Tian P., Liu J., Mou C., Shi C., Zhang H., Zhao Z., Lin Q., Wang J., Wang J., Zhang X., et al. GW5-Like, a homolog of GW5, negatively regulates grain width, weight and salt resistance in rice. J. Integr. Plant Biol. 2019;61:1171–1185. doi: 10.1111/jipb.12745. [DOI] [PubMed] [Google Scholar]
  7. Wang S., Wu K., Yuan Q., Liu X., Liu Z., Lin X., Zeng R., Zhu H., Dong G., Qian Q., et al. Control of grain size, shape and quality by OsSPL16 in rice. Nat. Genet. 2012;44:950–954. doi: 10.1038/ng.2327. [DOI] [PubMed] [Google Scholar]
  8. Wang J.-Y., Xie Z.-X., Cui Y.-Z., Li B.-Z., Yuan Y.-J. Artificial design of the genome: from sequences to the 3D structure of chromosomes. Trends Biotechnol. 2025;43:304–317. doi: 10.1016/j.tibtech.2024.08.012. [DOI] [PubMed] [Google Scholar]
  9. Wörheide M.A., Krumsiek J., Kastenmüller G., Arnold M. Multi-omics integration in biomedical research – A metabolomics-centric review. Anal. Chim. Acta. 2021;1141:144–162. doi: 10.1016/j.aca.2020.10.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Yang F., Kong H., Ying J., Chen Z., Luo T., Jiang W., Yuan Z., Wang Z., Ma Z., Wang S., et al. SeedLLM·Rice: A large language model integrated with rice biological knowledge graph. Mol. Plant. 2025;18:1118–1129. doi: 10.1016/j.molp.2025.05.013. [DOI] [PubMed] [Google Scholar]
  11. Yang H., Yue S., He Y. Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions. arXiv. 2023 doi: 10.48550/arXiv.2306.02224. Preprint at. [DOI] [Google Scholar]
  12. Zhang R., Wang Y., Yang W., Wen J., Liu W., Zhi S., Li G., Chai N., Huang J., Xie Y., et al. PlantGPT: An Arabidopsis-Based Intelligent Agent that Answers Questions about Plant Functional Genomics. Adv. Sci. 2025;12 doi: 10.1002/advs.202503926. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Supplemental Figures 1–5
mmc1.pdf (1.6MB, pdf)
Supplemental Data 1. Supplemental method
mmc2.pdf (193.2KB, pdf)
Document S2. Article plus supplemental information
mmc3.pdf (8MB, pdf)

Articles from Plant Communications are provided here courtesy of Elsevier

RESOURCES