Abstract
Reconstructing bioinformatics workflows from the literature is the foundation of scientific analysis. However, the required details—processing steps, software tools, versions, and parameter settings—are dispersed across narrative text, tables, figure captions, and supplemental files. Manual reconstruction typically takes hours per paper and is error-prone, while existing question-answering (QA) and retrieval systems focus on local passages and lack the full-text, multimodal capabilities needed to automatically rebuild complete workflows. We introduce BioWorkflow, a large language model (LLM)-based, retrieval-augmented framework that automates end-to-end workflow extraction from publications by (i) parsing PDFs and building a unified index over text, tables, and figures with chunk-level summaries and embeddings; (ii) hierarchically decomposing queries with dynamic reformulation when new entities or ambiguities emerge; (iii) performing iterative, context-aware retrieval and assembling a directed workflow that captures steps, tools, versions, and parameters; and (iv) linking each predicted element to its cited evidence and running automated consistency checks to suppress hallucinations and ensure traceability. Evaluated on 100 expert-annotated papers, BioWorkflow recovers ~80% of workflow steps (versus ~20% for existing tools), improves reproducibility, completeness, and accuracy by >20% over strong LLM baselines, and reduces curation time to 3–5 minutes per paper, enabling rapid and reliable reuse of published pipelines.
Keywords: bioinformatics, large language models, retrieval-augmented generation, workflow extraction, multimodal
Introduction
As multi-omics datasets become increasingly accessible, researchers are building ever more elaborate bioinformatics workflows to uncover biological insights and drive clinical applications [1–3]. While experimental findings are clearly highlighted in the Results section and are easy to locate, the underlying workflows—comprising software toolchains, algorithm versions, parameter settings, and pre- and post-processing steps—are often buried in narrative text, complex tables, figures, and supplementary materials [4]. For example, one paper may list only the key software tools in the main text, while the parameters, thresholds, filters, or preprocessing steps have to be moved to the supplementary materials; another may describe ablations or follow-up experiments with different configurations in the Discussion. Rapid access to these workflows is critical for accelerating both basic research and clinical translation. Yet manual extraction requires hours of careful reading of full articles and supplementary materials, making it both labor-intensive and error-prone [5]. As the literature grows exponentially, this manual approach becomes untenable, leaving valuable methods buried and not readily reusable.
Early efforts to automate literature-based extraction have centered on retrieval-augmented generation (RAG) and related large language model (LLM) approaches [6, 7]. RAG mitigates LLM “hallucinations” by first retrieving relevant snippets and then conditioning the generation on that evidence—a strategy shown to be effective in open-domain question answering (QA) and document summarization [8–11]. However, standard biomedical QA benchmarks (e.g., BioASQ-QA and PubMedQA) operate only at the abstract level and target simple factoid or yes/no questions, offering no support for the full-text, multi-hop inference needed to recover detailed workflows [12–14]. Recent LLMs (e.g., GPT-4o) can accept PDF uploads but still rely on single-pass retrieval with expert-crafted prompts, yielding only high-level summaries that omit critical steps [5, 15]. In bioinformatics, BCO-RAG is the only prior attempt to auto-populate BioCompute Object (BCO) schemas from publications via static, single-pass, fixed-template retrieval; it does not integrate tables or figures and has limited publicly reported quantitative evaluations [16–23]. Rule-based parsers for consistently formatted pipelines break under format variation, and GUI-based tools (e.g. Galaxy’s BCO export) require extensive manual editing [24]. To date, no method can, starting from raw PDFs, jointly handle multimodal content, iteratively fill information gaps, validate against source evidence, and output a fully ordered, standards-compliant workflow.
Extracting full analysis workflows from published articles faces two key challenges. The first is multimodal dispersion: pipeline elements are scattered across unstructured text passages, structured tables, figures, and multiple supplementary files [5]. Single-pass, text-only retrieval often misses or incompletely captures these details. The second is context disambiguation: a single paper may use one parameter set for primary experiments and a different set for control or ablation experiments. Without preserving both local and global contexts, retrieval systems cannot determine which values belong to the core workflow rather than to auxiliary analyses, leading to incomplete or incorrect reconstructions.
Although LLMs have advanced literature QA and retrieval, they exhibit two further limitations for workflow extraction: (i) reliance on single-pass, expert-crafted prompts that demand specialized prompt engineering and deep domain expertise—barriers to bench scientists and regulators; and (ii) hallucination risk—without rigorous grounding and verification against source passages, models may fabricate steps or parameters, undermining trust. Moreover, most biomedical search tools focus on abstract-level retrieval and cannot meet the full-text, multimodal requirements of workflow extraction. These shortcomings motivate an end-to-end solution that jointly handles multimodal content, maintains contextual fidelity, supports iterative retrieval, and enforces automatic validation.
Unlike prior systems such as BCO-RAG and LLM-based literature retrieval tools that perform single-pass searches with expert-crafted prompts, BioWorkflow combines hierarchical query decomposition with iterative query refinement: a publication-level objective is decomposed into interdependent subquestions that first recover the workflow backbone (ordering and I/O) and then resolve steps, tools/versions, and parameter values; leaf queries are dynamically reformulated as intermediate answers surface new entities or ambiguities, improving step recall and preserving execution order. Beyond text-only retrieval, BioWorkflow performs multimodal integration by constructing a unified vector index over full text, structured tables, and figures, storing for each chunk both the original content and a concise context summary to enable cross-modal retrieval and increase parameter- and version-level coverage. Where many LLM-based tools return fluent but loosely grounded answers that may omit or paraphrase critical details, BioWorkflow enforces validation and traceability: every predicted step, tool, version, and parameter is linked to its cited source and checked by automated consistency rules, curbing misleading outputs. Together, these design choices address the four issues outlined above and lead directly into the components detailed in the Materials and Methods section, with empirical validation in the Results section.
Applied to 100 expert-annotated articles, BioWorkflow recovers ~80% of workflow steps (versus ~20% for BCO-RAG), improves reproducibility, completeness, and accuracy by >20% over state-of-the-art LLMs (GPT-o4-mini), and reduces curation time to 3–5 min per paper. Outputs include BCO-compliant JavaScript Object Notation (JSON) and concise, user-oriented protocols, enabling rapid and reliable reuse of published pipelines.
Materials and methods
Methods overview
Existing LLM-based RAG pipelines often struggle with low retrieval precision and hallucinations when faced with the complex, multimodal nature of bioinformatics literature. To overcome these issues, we present BioWorkflow (Fig. 1), an end-to-end automated workflow reconstruction framework. First, BioWorkflow parses each PDF into synchronized text passages, table entries, and figure captions—directly tackling the multimodal dispersion challenge (Challenge 1). An LLM then generates concise summaries for each chunk—extracting software toolkits, algorithms, protocol steps, and other entities—and embeds both summaries and originals into a FAISS vector space, enabling context-aware retrieval of all modalities (Challenge 2). Next, when a user submits a query, it is automatically decomposed into a hierarchy of interdependent subquestions. Each subquestion undergoes multi-round retrieval with dynamic reformulation—in place of single-shot, hand-engineered prompts—ensuring iterative, precise query refinement without requiring specialized prompt expertise (Challenge 3). Finally, BioWorkflow stitches together structured, citation-backed answers for each subquestion using stage-specific prompt templates and employs an automated validation agent to cross-check every generated step against its source evidence—suppressing hallucinations and guaranteeing automatic validation and traceability (Challenge 4). When applied to a standard bioinformatics corpus, BioWorkflow delivers substantial gains in retrieval accuracy and answer fidelity, enabling researchers to rapidly and reliably reuse published workflows. In the following section, we describe in detail the design of the BioWorkflow framework (see Fig. 2).
Figure 1.
Overview of the end-to-end workflow extraction pipeline.
Figure 2.
System workflow of BioWorkflow, consisting of five modules: (1) user query input, (2) hierarchical query decomposition, (3) multimodal parsing and embedding of text, tables, and figures, (4) iterative retrieval and structured workflow assembly, and (5) final answer generation with automated validation.
Literature collection and preprocessing
We begin by assembling a comprehensive corpus of bioinformatics publications in PDF format, which are then parsed into machine-readable components using a state-of-the-art toolkit (e.g., PaperMage). Each PDF is converted into plain text, tables, and figure image files, while metadata (title, authors, abstract, and journal) is extracted via Optical Character Recognition (OCR) and structured-document parsers. To support unified, multimodal search, we generate detailed natural-language descriptions for every table and figure—leveraging an LLM to translate graphical or tabular content into annotated captions linked to their original labels. This preprocessing transforms heterogeneous PDF inputs into text passages, tabular entries, image descriptions, and metadata records, all indexed by document and section. Unlike prior tools (e.g., BCO-RAG) that primarily index narrative text and flatten or ignore tables and figures, our preprocessing preserves table structure and links figure-derived descriptions back to their original locations. This design surfaces parameters, thresholds, and software versions that are frequently absent from the main text and improves subsequent retrieval of workflow details.
Next, we organize the extracted text into “chunks” of 500–1000 tokens each to preserve semantic coherence within retrieval units. Each chunk-whether narrative text, table description, or figure caption-is embedded into a high-dimensional vector space via a pretrained embedding model (e.g., text-embedding-3-small). To accelerate similarity search at scale, we index these vectors in an FAISS-based vector store, using LLM-generated summaries as compact, focused representations that reduce noise and improve retrieval precision. Concurrently, we apply LLM-driven named-entity recognition to tag key bioinformatics entities—such as software tools, parameter values, file formats, databases, and scripting languages—within each chunk. By coupling semantically rich embeddings with entity annotations, our vector database enables rapid, accurate lookup of workflow-relevant content in downstream RAG stages. Compared with text-only embedding stores used by prior RAG systems, this summary-augmented, entity-aware, and provenance-preserving index reduces retrieval drift and increases hit precision for step/tool/version/parameter queries—laying the groundwork for hierarchical query decomposition and iterative query refinement introduced in this section.
Question decomposition and reformulation
To focus retrieval on the most relevant evidence, we first decompose the user’s free-text query into a hierarchy of interdependent subquestions using reasoning-guided templates initialized with the article abstract. Each subquestion is embedded in the same vector space as the preprocessed document chunks, and we compute cosine similarity between the subquestion vector and all chunk vectors to identify the top-K chunks (passages/tables/figures) most germane to that query, thereby reducing the effective context from an entire article to a small set of highly relevant chunks. In contrast to prior systems (e.g., BCO-RAG or LLM-based literature retrievers) that issue a single-pass prompt over a vector store, this hierarchical decomposition targets step-level, tool/version-level, and parameter-level information explicitly, enabling fine-grained retrieval rather than broad, under-specified matches.
A key design choice is the use of standardized, decomposition-specific templates in place of ad hoc, hand-crafted prompts. These templates (i) guide retrieval by embedding clear cues about expected content (e.g., “list the software, version, and inputs for each pipeline step”), and (ii) align generation by specifying output formats and field names. Compared with single-shot keyword prompts, these templates narrow the search space and stabilize outputs, which is essential for step/tool/version/parameter extraction.
Once preliminary answers are obtained for each leaf, a question reformulator examines the returned content. When new entities, metrics, or ambiguities emerge, it extracts salient keywords (e.g., tool names, biomarkers, and statistical measures) from the leaf answer and injects them back into the subquestion template, yielding a more precise, reformulated query (as shown in Fig. 3). This refined query is then resubmitted to the retrieval and answer-generation pipeline, iterating through retrieval, generation, and reformulation until the branch no longer benefits or a user-specified maximum depth is reached (as shown in Fig. 4). Throughout, we enforce depth/branch constraints, perform deduplication to avoid redundant subquestions, and apply disambiguation rules so that each node addresses a unique information need. Unlike single-pass pipelines, this iterative refinement preserves both local evidence and global document context, directly addressing context disambiguation by steering each branch toward the parameter set and step ordering used in the core workflow rather than in auxiliary analyses.
Figure 3.
Retrieval and question reformulator: BioWorkflow retrieves multimodal evidence and refines subquestions based on parent query answers for more precise, context-aware queries.
Figure 4.
Dynamic question reformulation: BioWorkflow iteratively refines subquestions by extracting keywords from answers and resubmitting updated queries until improvements plateau or depth limits are reached.
Vector retrieval and subquestion answering
In this stage, each dynamically reformulated subquestion is resolved by our RAG retriever against the preprocessed corpus. First, the target PDF is partitioned into the same text, table, and figure chunks used during preprocessing, and each chunk’s embedding is stored in the FAISS index. When a subquestion is ready, we first narrow the search space by matching its vector against a summary-level index to identify the most relevant sections of each paper. We then perform a k-nearest neighbors (k-NN) query over the FAISS store-configured with an Inverted File Index (IVF)-flat index of 1024 centroids—to retrieve the top-K semantically similar chunks, whether narrative passages, table entries, or figure descriptions. These retrieved snippets are concatenated with a stage-specific prompt template and fed to the LLM (e.g., GPT-4o), which generates the answer to that subquestion. By grounding every subquestion in high-relevance, multimodal evidence, the retriever maps complex bioinformatics queries—such as “Which statistical test validated differential expression?”—to the precise document fragments that address them. Finally, answers to all subquestions are aggregated and passed to the synthesis module to produce the complete, structured workflow response.
In the answer generation stage, each retrieved subquestion is converted into a structured prompt augmented with clear output specifications and grounded evidence. We use carefully crafted templates that not only outline the required response format but also embed the retrieved snippets-ensuring the LLM generates answers “with evidence,” so outputs remain accurate and traceable. When a subquestion depends on a parent node, that parent’s answer is prepended to the prompt, preserving the reasoning hierarchy and maintaining coherence across levels. In contrast to prior systems that return fluent but loosely grounded summaries, this schema-and-evidence conditioning curbs over-generalization and prevents parameters from being borrowed from the wrong context (e.g., control versus primary experiments).
We adopt a multi-turn interaction strategy. Immediately after each response, an automated evaluation agent reviews the answer for completeness, factual accuracy, and relevance to the subquestion and the original user query. If the score falls below a threshold, the subquestion is routed back to the question reformulator for dynamic rewriting—incorporating newly identified entities or clarifications—before re-entering the retrieval and answer generation pipeline. The loop repeats until the answer meets quality criteria or a maximum number of iterations is reached. Unlike static, single-pass pipelines, this closed-loop retrieval–generation–reformulation process directly addresses context disambiguation and mitigates hallucinations, yielding precise, comprehensive, and fully traceable sub-answers for downstream synthesis.
Final synthesis
In the final synthesis stage, the system first checks whether all leaf nodes in the reasoning tree have generated sub-answers of acceptable quality; if any branches remain unresolved, those subquestions are automatically queued for another iteration of retrieval, generation, and reformulation. Once every branch is complete, the pipeline aggregates all validated sub-answers into a unified narrative, applying a final pass of LLM-based post-editing to ensure logical coherence, terminological consistency, and stylistic fluency. Depending on the user’s specified output format—whether a detailed stepwise workflow, a tabular summary of methods, or a graphical process diagram—the synthesized content is formatted accordingly, with inline citations of original document snippets, structured tables for key parameters, and annotated figure captions as needed. This modular synthesis both preserves the hierarchical reasoning provenance and delivers a polished, ready-to-use workflow description tailored to the user’s requirements.
Results
To comprehensively evaluate BioWorkflow’s performance, we designed four complementary experiments in the following subsections: Multi-hop reasoning validation, Ablation study design and results, Expert-driven manual evaluation, and Efficiency and user feedback. Each experiment tests a different aspect of our framework—covering general multi-hop reasoning ability, module-level contributions, real-world usability in BCO extraction and workflow summarization, and end-to-end efficiency. In all of the following experiments, BioWorkflow uses GPT-4o as the LLM and the text-embedding-3-small model for embedding tasks.
Multihop reasoning validation
We first assessed BioWorkflow’s ability to perform deep, multi-step inference by benchmarking on HotpotQA, a widely used question-answering dataset that requires multi-hop reasoning across multiple Wikipedia passages. Although HotpotQA is not specific to bioinformatics, it stresses the same core capabilities—retrieving evidence from disparate document sections and chaining multiple inference steps—that underlie complex bioinformatics workflow extraction [25]. Following the setup of previous work [26], we randomly sampled 1000 questions that require chaining evidence from disparate document sections.
We compared BioWorkflow against seven state-of-the-art RAG variants—SiReRAG [27], Vendi-RAG [28], SearChain [29], Collab-RAG [30], naive RAG, ReaRAG [31], and Chain of Evidence (CoE) [32]—under the same retrieval and generation settings. The primary metric is exact-match accuracy. As shown in Fig. 5, BioWorkflow matches CoE’s performance and outperforms all other baselines. These results demonstrate that our iterative retrieval–reformulation strategy and hierarchical reasoning-tree architecture effectively balance retrieval diversity and contextual relevance, yielding competitive accuracy on complex, multi-hop queries.
Figure 5.
Accuracy on HotpotQA for selected RAG variants, including BioWorkflow.
Therefore, having established strong multi-hop reasoning ability, we next isolate individual components to confirm their necessity through ablation studies.
Ablation study design and results
To validate the contribution of each core component, we ran six ablation experiments—disabling or modifying one module at a time while keeping all others active. We compared each variant to the full BioWorkflow pipeline (the “Baseline”) using four published gold-standard workflows from the BioCompute DB [33] and three automated metrics from the open-source DeepEval framework: Answer Relevancy, Faithfulness, and Contextual Relevancy (CR) [34].
DeepEval leverages an LLM to identify statements in the generated output and classify each statement against the input prompt and retrieved evidence. Answer relevancy measures how directly and appropriately the output addresses the query, faithfulness assesses whether each claim aligns with source material, and CR evaluates the coherence of the response with its retrieval context. By automating these checks, we obtain rapid, repeatable insights into how each module affects overall performance.
TextOnly: Only text chunks are indexed in FAISS; tables and figures are ignored. In BioWorkflow, text, table, and figure chunks are all indexed (Challenge 1).
NoSummary: Removing LLM-generated summaries from the index. During embedding, only raw chunk embeddings are used; no LLM-generated summaries are included. In BioWorkflow, both raw chunks and their LLM-generated summaries are jointly embedded (Challenges 2 and 3).
NoTree: Removing the reasoning-tree module. Chain-of-thought decomposition is disabled; the system issues the full user query in one shot to retrieval and generation. In BioWorkflow, queries are decomposed into a reasoning tree, with each subquestion processed sequentially (Challenges 2 and 3).
NoReform: Removing dynamic query reformulation. After initial decomposition and retrieval, no further keyword-based query rewriting is performed; each subquestion is submitted exactly once. In BioWorkflow, each subquestion undergoes iterative dynamic rewriting based on newly extracted keywords and parent-node answers (Challenges 2 and 3).
NoParentContext: Omitting parent-context injection. Each subquestion prompt omits any context from its parent subquestions; retrieval and generation rely solely on the current subquestion’s content. Whereas in BioWorkflow, parent subquestion answers are injected into each child subquestion prompt (Challenges 3).
NoValidate: Removing automated validation. The automated validation agent is turned off; the system uses the LLM’s first-round answer without any re-evaluation or refinement. In BioWorkflow, each generated sub-answer is cross-checked against retrieved evidence and, if necessary, triggers additional queries or corrections (Challenges 4).
As shown in Fig. S1(a)–(f), the results of our ablation experiments highlight the critical contributions of each core component in the BioWorkflow pipeline. Excluding multimodal content, such as tables and figures, significantly impacted contextual relevancy, with a 25-point drop, underscoring the importance of grounding in multiple modalities to ensure factual consistency. This pronounced gap reflects our evaluation setup: under DeepEval's strict mode, CR is a binary judgment of each subquestion’s retrieved context (relevant = 1; not relevant = 0). Small retrieval shifts (e.g., landing in a control section rather than in the primary experiment), therefore, flip many items from 1 to 0 and amplify between-variant differences. By contrast, answer relevancy and faithfulness are continuous [0,1] scores of the generated answer, so their deltas are naturally smoother and smaller.
The removal of LLM-generated summaries also depresses answer relevancy and CR: summaries distill verbose passages into focused, low-noise embeddings, improving nearest-neighbor retrieval and reducing drift. Disabling hierarchical question decomposition causes sharp drops—especially in CR—because flat, single-pass prompts fail to form entity-specific subqueries that recover step ordering and parameterized details. Likewise, removing iterative query reformulation and parent-context injection reduces CR by breaking discourse continuity and leaving ambiguities unresolved, which increases off-target evidence. Finally, ablations that omit automated validation and traceability degrade overall performance, indicating that consistency checks are essential for suppressing hallucinations and aligning answers with their cited sources.
Taken together, these results show that multimodal grounding, hierarchical and iterative retrieval, summary-augmented indexing, and post-generation validation each play a nonredundant role. All four are required to achieve high retrieval accuracy, answer fidelity, and end-to-end workflow reconstruction.
Expert-driven manual evaluation
In addition to automated metrics, we report two complementary evaluations because they target different use cases and notions of correctness. (i) BCO extraction assesses whether a system can reconstruct a schema-constrained, standards-compliant workflow suitable for archival and compliance settings; the gold standard is a set of expert-validated BCOs paired with their source publications. Metrics emphasize schema-level correctness (domain/field match, required-domain coverage, step ordering, and version/parameter correctness), plus provenance linkage to cited passages. (ii) Workflow summary extraction evaluates readable, concise narratives for practitioners; the gold standard is an expert-written reference for 100 papers. Quality is judged by blinded experts on completeness, accuracy, and reproducibility, scored on a 0–10 scale, reflecting how well each summary conveys the workflow backbone and step/tool/version/parameter details.
BioCompute object extraction evaluation
To assess real-world usability, we collected a set of publicly available BCOs from the BioCompute DB [33], which serves as a repository for expert-extracted BCOs. Currently, the database includes 50 published BCOs, of which nine are associated with bioinformatics papers containing extractable workflow details. We used these nine BCOs, supported by corresponding literature, as our gold standard for comparison. These BCOs document workflow steps as specified in the BCO Description Domain. We automatically generated BCO JSON from each paper using BioWorkflow and compare them against the gold-standard BCOs and the output of BCO-RAG.
Quantitative comparison
We retrieved nine published BCOs with supporting literature from the BioCompute DB and used each BCO’s Description Domain as the gold-standard reference. For each of these papers, we generated two sets of BCO JSON files:
Ours: BCOs produced by our BioWorkflow framework.
BCO-RAG (Baseline): We adopted the existing BCO-RAG method—the only approach proposed to date—which populates BCO’s Description Domain via a simple, static, single-pass retrieval. In contrast, our method performs dynamic, multi-round iterative retrieval to continuously refine, and enhance the context for BCO’s Description Domain.
We focused on the pipeline steps field within the Description Domain. For each method, we computed recall, precision, and F1 scores.
As shown in Fig. 6, BioWorkflow outperforms the BCO-RAG baseline across all three metrics—most notably, our recall exceeds 80%, compared with under under 15% for the baseline, highlighting BioWorkflow’s superior ability to capture complete pipeline steps.
Figure 6.
Precision, recall, and F1 for BioWorkflow versus BCO-RAG baseline in BCO pipeline steps extraction.
Figure 7 provides a concrete example from the BCO experiment, comparing BioWorkflow and BCO-RAG. The figure includes the paper’s gold-standard workflow alongside the JSON outputs produced by both methods. Relative to the reference workflow, BCO-RAG omits several software tools and misreports version numbers. By contrast, BioWorkflow reconstructs the complete workflow consistent with the gold standard and correctly extracts the software versions from the paper.
Figure 7.
Example comparative results in the BCO experiment: BioWorkflow versus BCO-RAG.
Qualitative assessment
Because quantitative metrics measure only “step count” accuracy, we supplemented our evaluation with a manual assessment of output readability and usability. Three bioinformatics experts independently reviewed nine published BCOs and the corresponding JSON files generated by BioWorkflow and BCO-RAG. Each expert evaluated the BCOs based on four specific criteria, scoring them on a scale from 0 to 2 (0 = Poor, 1 = Average, 2 = Excellent). This scoring system is consistent with the standard used in the BCO-RAG paper, ensuring comparability and alignment with previously established benchmarks:
Completeness: Does the output include all critical pipeline steps from the gold-standard BCO?
Accuracy: Are the extracted steps described correctly (tool names, parameter settings, etc.)?
Readability: Are JSON fields and any annotations clear, self-explanatory, and aligned with the IEEE 2791–2020 schema?
Overall Usability: How satisfied would a user be with this output for direct inclusion in a manuscript?
Figure 8 shows that BioWorkflow’s expert scores are more than six times higher in completeness and five times higher in both accuracy and readability, with overall usability also improved by over fourfold compared with BCO-RAG. These multipliers underscore that BioWorkflow captures far more complete, precise pipeline steps and delivers JSON outputs that are substantially more readable and ready for integration without further editing.
Figure 8.
Radar chart comparing BioWorkflow versus BCO-RAG baseline on completeness, accuracy, readability, and overall usability.
Quantitatively, BioWorkflow delivers substantially higher recall, precision, and F1 scores than BCO-RAG, confirming its ability to extract pipeline steps more completely and accurately. At the same time, our pipeline-step outputs are far more readable and user-friendly, greatly reducing the need for manual editing before integration into manuscripts.
Bioinformatics workflow summary extraction evaluation
Beyond schema-constrained outputs, we evaluate BioWorkflow on its ability to produce user-friendly workflow summaries. Bioinformatics experts curated 100 multi-omics papers from PubMed published in the past 5 years, prioritizing leading peer-reviewed journals. We excluded reviews, case reports, and papers without extractable end-to-end workflows. Ten bioinformatics experts then screened abstracts and full texts to confirm explicit workflow content and authored reference summaries that reconstruct the complete workflow (backbone ordering/I/O, steps, tools/versions, and key parameters), each linked to supporting passages. The resulting set spans multiple omics modalities (e.g., genomics, transcriptomics, and epigenomics) and covers diverse organisms, platforms, and journals, yielding a broad, contemporary, and representative gold-standard benchmark. All experts involved in workflow extraction and evaluation had at least 5 years of bioinformatics experience and held a master’s degree or higher.
Using the identical prompt, “Summarize the workflow,” we compare BioWorkflow with GPT-o4-mini, GPT-4, and GPT-4o to cover commonly deployed capability tiers in document-centric QA. GPT-o4-mini represents a high-throughput, cost-efficient tier commonly chosen for large-scale document QA; GPT-4 is a widely adopted strong text baseline used as a de facto reference in academic evaluations; GPT-4o is a frontier multimodal model offering stronger vision-text reasoning and better latency/cost than GPT-4. Together, they form a representative triad of low/medium/high capability models with stable APIs and long-context support, all compatible with our PDF-to-chunks retrieval pipeline. For fairness, we fix prompts and decoding hyperparameters across models.
Ten bioinformatics experts, blinded to system identity, compared system outputs to the gold standard and scored them on three metrics. Each metric was rated on a 0–10 scale (0 = insufficient; 10 = fully accurate, clear, and complete), reflecting expert judgment of how well the summary conveyed each aspect of the workflow:
Completeness: The degree to which the workflow enumerates all essential modules—from sample prep and instrumentation to data processing and statistical tests.
Accuracy: The extent to which each mentioned tool, version, parameter, and analytic method matches the original description.
Reproducibility and usability: How clear, logically structured, and detailed the summary is—covering data sources, tools/versions, parameters, quality control—so another researcher can both understand and execute the full analysis.
Figure 9 shows the mean and variance of these metrics across the 100 papers. BioWorkflow consistently outperforms GPT-o4-mini, GPT-4, and GPT-4o, with higher averages and lower variance, indicating greater reliability and stability. The three baselines operate as single-turn, prompt-driven systems: given the same user prompt and the same retrieved snippets, they perform one-shot retrieval and generation—without hierarchical decomposition or iterative reformulation. Their differences therefore arise at generation: GPT-o4-mini tends to over-compress, often missing parameters/versions and step order; GPT-4 recovers more detail but can blend auxiliary context into the primary workflow; GPT-4o best leverages text and figure evidence yet lacks explicit planning/verification, leaving occasional gaps. BioWorkflow replaces single-pass summarization with a structured, multi-round pipeline—hierarchical decomposition, iterative reformulation for fine-grained details, multimodal (text/table/figure) indexing with summary-augmented chunks, and validation/traceability—which explains the higher means and lower variance observed in Fig. 9.
Figure 9.
Boxplots of completeness, accuracy, and reproducibility for BioWorkflow versus state-of-the-art LLMs (GPT-o4-mini, GPT-4o, and GPT-4).
In a large-scale evaluation across 100 studies, BioWorkflow consistently delivered richer, more faithful protocol extractions than any of the tested LLMs (Fig. 10). Whereas baseline systems often omitted key steps or introduced slight inaccuracies, BioWorkflow’s outputs were uniformly comprehensive, precisely mirroring original methods, and organized in a way that researchers found immediately actionable. This leap in clarity and fidelity not only streamlines the transition from the literature to the bench but also markedly reduces the need for manual correction or reverse-engineering.
Figure 10.
Radar chart of average completeness, accuracy, and reproducibility for BioWorkflow versus state-of-the-art LLMs (GPT-o4-mini, GPT-4o, and GPT-4).
As shown in Fig. 11, pairwise Wilcoxon signed-rank tests confirm that BioWorkflow’s gains over each of the LLMs are highly significant (P < .05 across reproducibility, completeness, and accuracy), underscoring that BioWorkflow uniquely delivers more faithful, comprehensive, and actionable workflows. These results provide robust statistical validation of its superior performance in extracting experimental protocols.
Figure 11.
Pairwise Wilcoxon P-value heatmaps for four methods (BioWorkflow versus GPT-o4-mini, GPT-4o, and GPT-4) across reproducibility, completeness, and accuracy.
In contrast to GPT-o4-mini, GPT-4, and GPT-4o—which generate broad, high-level outlines that routinely omit key software versions, parameter settings, and detailed data-processing steps—BioWorkflow delivers an end-to-end protocol that names every tool and its version, spells out filtering and normalization thresholds, and preserves even “non-core” actions like QC checks. By handling every procedural nuance in one go, BioWorkflow eliminates the need to revisit the original text or guess missing details, dramatically accelerating bioinformatics implementation and safeguarding against errors that derail downstream analyses.
Efficiency and user feedback
Finally, we measured the end-to-end time required to generate a complete BCO Description Domain (or workflow summary). Manual curation of a single paper typically demands at least 1–2 h, whereas BioWorkflow produces a fully populated Description Domain in under 3–5 minutes on average. In a user study, some participants reported that integrating BioWorkflow with light manual validation would “substantially accelerate” their workflow documentation tasks, underscoring the tool’s potential to alleviate cognitive load and streamline bioinformatics reporting.
Overall, these experiments establish that BioWorkflow is not only theoretically sound but also practically effective, addressing all four core challenges (multimodal heterogeneity, retrieval precision, hallucination mitigation, and user-friendly domain adaptation) in a unified, end-to-end framework.
Discussion and conclusion
In this work, we demonstrated that grounding LLMs in the relevant literature via our BioWorkflow framework enables fully automated extraction of bioinformatics workflows. Hierarchical CoT decomposition and dynamic, iterative query reformulation ensure contextual precision, while stage-specific prompts coupled with automated validation effectively suppress hallucinations. On a 100-paper gold standard, BioWorkflow recovered over 80% of workflow steps (versus ~20% for BCO-RAG), outperformed leading LLMs by >20% across reproducibility, completeness, and accuracy, and cut expert curation time from over an hour to just 5 min per paper.
Despite its gains, BioWorkflow has several constraints. First, in very long articles with complex pipelines, the presence of highly ambiguous language (e.g., “default settings,” “as previously described”) can weaken disambiguation and lower performance. Second, workflows that contain deeply nested or parallel subworkflows can exceed current depth/branch assumptions; our query decomposition is hierarchical but not yet nesting-aware, which risks under-decomposition or misordered steps. Third, ingestion is currently limited to PDFs and to tables/figures extracted as text; we do not parse other supplementary formats (e.g., spreadsheets with multilevel headers, code archives, and repositories), which caps parameter/version coverage. Future work will target nesting-aware decomposition, ambiguity-robust normalization, and multiformat supplement ingestion to address these failure modes.
Key Points
BioWorkflow presents a scalable, AI-driven framework for end-to-end extraction of published bioinformatics pipelines by synchronously parsing multimodal content (text, tables, and figures), decomposing user queries into hierarchical subquestions, and performing iterative, context-aware retrieval and refinement.
It embeds unified “chunks” of multimodal evidence into an FAISS vector index and injects prior context through reinforced prompt templates and parent-context propagation, ensuring precise workflow element retrieval while suppressing large language model (LLM) hallucinations via automatic validation and traceability.
Evaluated on 100 expert-annotated bioinformatics articles, BioWorkflow recovers over 80% of workflow steps (versus ~20% for BioCompute Object retrieval-augmented generation), outperforms state-of-the-art LLMs (e.g. GPT-o4-mini) by >20% in reproducibility, completeness, and accuracy, and reduces manual curation time from over an hour to just 3–5 min per paper.
By exporting both BioCompute Object-compliant JSON and concise, user-friendly protocols, BioWorkflow enables rapid, reliable reuse of published pipelines, significantly boosting reproducibility and throughput in bioinformatics research.
Supplementary Material
Acknowledgements
We thank the researchers at the Shaanxi Engineering Research Center of Medical and Health Big Data for their invaluable contributions to our experimental evaluations.
Contributor Information
Yidan Wang, School of Computer Science and Technology, Faculty of Electronics and Information Engineering, Xi'an Jiaotong University, No. 28 Xianning West Road, Beilin District, Xi'an, Shaanxi 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, No. 28 Xianning West Road, Beilin District, Xi'an 710049, China.
Jiayin Wang, School of Computer Science and Technology, Faculty of Electronics and Information Engineering, Xi'an Jiaotong University, No. 28 Xianning West Road, Beilin District, Xi'an, Shaanxi 710049, China; Shaanxi Engineering Research Center of Medical and Health Big Data, Xi'an Jiaotong University, No. 28 Xianning West Road, Beilin District, Xi'an 710049, China; The Second Affiliated Hospital of Xi'an Jiaotong University, No. 157 Xiwu Road, Beilin District, Xi'an 710003, Shaanxi, China.
Author contributions
J.W. and Y.W. conceived and designed this research; Y.W. designed the model, implemented the program and performed the experiments; J.W. and Y.W. wrote the manuscript. All authors have read and agreed to the latest version of the manuscript.
Conflict of interest: None declared.
Funding
This work was supported by the National Natural Science Foundation of China (grant nos 62402376, 62572389, 72293581, and 72274152).
References
- 1. Gauthier J, Vincent AT, Charette SJ. et al. A brief history of bioinformatics. Brief Bioinform 2019;20:1981–96. 10.1093/bib/bby063 [DOI] [PubMed] [Google Scholar]
- 2. Levin C, Dynomant E, Gonzalez BJ. et al. A data-supported history of bioinformatics tools. arXiv preprint arXiv:1807.06808 [Google Scholar]
- 3. Marques VSS, Amaral LRD. Documenting bioinformatics software via reverse engineering. arXiv, 2305.04349, 2023, 10.48550/arXiv.2305.04349 [DOI]
- 4. Liu NF, Lin K, Hewitt J. et al. Lost in the middle: how language models use long contexts. arXiv, 2307.03172, 2023, 10.48550/arXiv.2307.03172 [DOI]
- 5. Wang X, Huey SL, Sheng R. et al. SciDaSynth: interactive structured knowledge extraction and synthesis from scientific literature with large language model. arXiv, 2404.12765, 2024, 10.48550/arXiv.2404.13765 [DOI] [PMC free article] [PubMed]
- 6. Gao Y, Xiong Y, Gao X. et al. Retrieval-augmented generation for large language models: a survey. arXiv, 2312.10997, 2023, 10.48550/arXiv.2312.10997 [DOI]
- 7. Kaplan J, McCandlish S, Henighan T. et al. Scaling laws for neural language models. arXiv, 2001.08361, 2020. 10.48550/arXiv.2001.08361 [DOI]
- 8. Zhou B, Geißler D, Lukowicz P. Misinforming LLMs: vulnerabilities, challenges and opportunities. arXiv, 2408.01168, 2024, 10.48550/arXiv.2408.01168 [DOI]
- 9. Lewis P, Perez E, Piktus A. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in neural information processing systems 2020;33:9459–74. [Google Scholar]
- 10. Yu W, Iter D, Wang S. et al. Generate rather than retrieve: large language models are strong context generators. arXiv, 2209.10063, 2022, 10.48550/arXiv.2209.10063 [DOI]
- 11. Tang Y, Yang Y. MultiHOP-RAG: benchmarking retrieval-augmented generation for multi-hop queries. arXiv, 2401.15391, 2024, 10.48550/arXiv.2401.15391 [DOI]
- 12. Karpukhin V, Oguz B, Min S. et al. Dense passage retrieval for open-domain question answering. In: EMNLP, pp. 6769–81, 2020.
- 13. Jin Q, Dhingra B, Liu Z. et al. PubmedQA: a dataset for biomedical research question answering. arXiv, 1909.06146, 2019, 10.48550/arXiv.1909.06146 [DOI]
- 14. Guo Z, Wang P, Wang Y. et al. Improving small language models on PubMedQA via generative data augmentation. arXiv, 2305.07804, 2023, 10.48550/arXiv.2305.07804 [DOI]
- 15. Sahoo P, Singh AK, Saha S. et al. A systematic survey of prompt engineering in large language models: techniques and applications. arXiv, 2402.07927, 2024. 10.48550/arXiv.2402.07927 [DOI]
- 16. Kim S, Mazumder R. Enhancing scientific reproducibility through automated BioCompute object creation using retrieval-augmented generation from publications. arXiv, 2409.15076, 2024, 10.48550/arXiv.2409.15076 [DOI]
- 17. Simonyan V, Goecks J, Mazumder R. Biocompute objects—a step towards evaluation and validation of biomedical scientific computations. PDA J Pharm Sci Technol 2016;71:136. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Mazumder R, Simonyan V, IEEE P2791 BioCompute Working Group . IEEE standard for bioinformatics analyses generated by high-throughput sequencing (HTS) to facilitate communication: IEEE Std 2791-2020. In:IEEE P2791 BioCompute Working Group (BCOWG): Standard for Bioinformatics Computations and Analyses Generated by High-Throughput Sequencing (HTS) to Facilitate Communication. IEEE, 2020. [Google Scholar]
- 19. King CHS IV, Keeney J, Guimera N. et al. Communicating regulatory high-throughput sequencing data using BioCompute objects. Drug Discov Today 2022;27:1108–14. 10.1016/j.drudis.2022.01.007 [DOI] [PubMed] [Google Scholar]
- 20. Alterovitz G, Dean D, Goble C. et al. Enabling precision medicine via standard communication of HTS provenance, analysis, and results. PLoS Biol 2018;16:e3000099. 10.1371/journal.pbio.3000099 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Keeney JG, Gulzar N, Baker JB. et al. Communicating computational workflows in a regulatory environment. Drug Discov Today 2024;29:103884. 10.1016/j.drudis.2024.103884 [DOI] [PubMed] [Google Scholar]
- 22. Lyman DF, Bell A, Black A. et al. Modeling and integration of N-glycan biomarkers in a comprehensive biomarker data model. Glycobiology 2022;32:855–70. 10.1093/glycob/cwac046 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Edfeldt K, Edwards AM, Engkvist O. et al. A data science roadmap for open science organizations engaged in early-stage drug discovery. Nat Commun 2024;15:5640. 10.1038/s41467-024-49777-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Galaxy_project BioCompute Object Galaxy. https://galaxyproject.org/use/biocompute-object/ (September 2024, date last accessed).
- 25. Yang Z, Qi P, Zhang S. et al. HotpotQA: a dataset for diverse, explainable multi-hop question answering. arXiv, 1809.09600, 2018, 10.48550/arXiv.1809.09600 [DOI]
- 26. Gutiérrez BJ, Shu Y, Gu Y. et al. Hipporag: neurobiologically inspired long-term memory for large language models. Advances in Neural Information Processing Systems 37 (2024): 59532-59569.
- 27. Zhang N, Choubey PK, Fabbri A. et al. SiReRAG: indexing similar and related information for multihop reasoning. arXiv, 2412.06206, 2024, 10.48550/arXiv.2412.06206 [DOI]
- 28. Rezaei MR, Dieng AB. Vendi-RAG: adaptively trading-off diversity and quality significantly improves retrieval augmented generation with LLMs. arXiv, 2502.11228, 2025, 10.48550/arXiv.2502.11228 [DOI]
- 29. Xu S, Pang L, Shen H. et al. Search-in-the-chain: interactively enhancing large language models with search for knowledge-intensive tasks. In Proceedings of the ACM Web Conference 2024, pp. 1362-1373. 2024.
- 30. Xu R, Shi W, Zhuang Y. et al. Collab-RAG: boosting retrieval-augmented generation for complex question answering via white-box and Black-box LLM collaboration. arXiv, 2504.04915, 2025. 10.48550/arXiv.2504.04915 [DOI]
- 31. Lee Z, Cao S, Liu J. et al. ReaRAG: knowledge-guided reasoning enhances factuality of large reasoning models with iterative retrieval augmented generation. arXiv, 2503.21729, 2025, 10.48550/arXiv.2503.21729 Focus to learn more. [DOI]
- 32. Chang Z, Li M, Jia X. et al. What external knowledge is preferred by LLMs? Characterizing and exploring chain of evidence in imperfect context. arXiv, 2412.12632, 2024. 10.48550/arXiv.2412.12632 [DOI]
- 33.BCO_portal BioCompute Object portal. https://www.biocomputeobject.org/. (September 2024, date last accessed).
- 34.DeepEval_team DeepEval: open-source evaluation framework for LLMs. https://docs.confident-ai.com/docs/getting-started, https://docs.confident-ai.com/docs/metricsanswer-relevancy, [https://docs.confident-ai.com/docs/metrics-faithfulness (September 2024, date last accessed).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.











