Human-augmented large language model-driven selection of glutathione peroxidase 4 as a candidate blood transcriptional biomarker for circulating erythroid cells

Bishesh Subba; Mohammed Toufiq; Fuadur Omi; Marina Yurieva; Taushif Khan; Darawan Rinchai; Karolina Palucka; Damien Chaussabel

doi:10.1038/s41598-024-73916-5

. 2024 Oct 5;14:23225. doi: 10.1038/s41598-024-73916-5

Human-augmented large language model-driven selection of glutathione peroxidase 4 as a candidate blood transcriptional biomarker for circulating erythroid cells

Bishesh Subba ^1,², Mohammed Toufiq ¹, Fuadur Omi ¹, Marina Yurieva ¹, Taushif Khan ¹, Darawan Rinchai ³, Karolina Palucka ¹, Damien Chaussabel ^1,^✉

PMCID: PMC11455862 PMID: 39369090

Abstract

The identification of optimal candidate genes from large-scale blood transcriptomic data is crucial for developing targeted assays to monitor immune responses. Here, we introduce a novel, optimized large language model (LLM)-based approach for prioritizing candidate biomarkers from blood transcriptional modules. Focusing on module M14.51 from the BloodGen3 repertoire, we implemented a multi-step LLM-driven workflow. Initial high-throughput screening used GPT-4, Claude 3, and Claude 3.5 Sonnet to score and rank the module’s constituent genes across six criteria. Top candidates then underwent high-resolution scoring using Consensus GPT, with concurrent manual fact-checking and, when needed, iterative refinement of the scores based on user feedback. Qualitative assessment of literature-based narratives and analysis of reference transcriptome data further refined the selection process. This novel multi-tiered approach consistently identified Glutathione Peroxidase 4 (GPX4) as the top candidate gene for module M14.51. GPX4’s role in oxidative stress regulation, its potential as a future drug target, and its expression pattern across diverse cell types supported its selection. The incorporation of reference transcriptome data further validated GPX4 as the most suitable candidate for this module. This study presents an advanced LLM-driven workflow with a novel optimized scoring strategy for candidate gene prioritization, incorporating human-in-the-loop augmentation. The approach identified GPX4 as a key gene in the erythroid cell-associated module M14.51, suggesting its potential utility for biomarker discovery and targeted assay development. By combining AI-driven literature analysis with iterative human expert validation, this method leverages the strengths of both artificial and human intelligence, potentially contributing to the development of biologically relevant and clinically informative targeted assays. Further validation studies are needed to confirm the broader applicability of this human-augmented AI approach.

Supplementary Information

The online version contains supplementary material available at 10.1038/s41598-024-73916-5.

Keywords: Large language models, Generative artificial intelligence, Biomarkers, Blood transcriptomics, Gene prioritization, Glutathione peroxidase 4, Erythroid cells, Transcriptional modules

Subject terms: Immunology, Translational immunology

Introduction

Targeted transcriptional profiling assays enable precise, quantitative assessments of the abundance of panels comprising tens to hundreds of transcripts^1,2. These assays offer advantages in terms of cost-effectiveness, simplicity, and rapid turnaround times, making them valuable for research involving large sample volumes, longitudinal studies, and resource-constrained settings^3,4. They also hold potential for biomarker discovery, drug response evaluation, and treatment monitoring^5,6.

The selection of candidate genes for targeted assays can be guided by both data-driven and knowledge-driven approaches. In the context of the development of targeted blood transcriptional profiling assays, data-driven approaches often rely on transcriptomics, which measures simultaneously the abundance of tens of thousands of transcript species in a given sample^7,8. These approaches aim to identify genes with expression patterns that are associated with specific biological states or clinical outcomes. In our previous work, we combined data-driven strategies with knowledge-driven approaches to develop targeted transcriptional panels for monitoring immune responses to SARS-CoV-2 infection⁹ and for monitoring women during pregnancy¹⁰. However, the vast amount of biomedical literature poses a challenge for knowledge-driven gene prioritization, as manual curation and synthesis of information become increasingly difficult with large candidate gene pools.

Large language models (LLMs) offer a solution to this bottleneck by efficiently processing and synthesizing information from the literature. In our previous work¹¹, we developed and demonstrated the utility of an LLM-driven workflow for candidate gene prioritization. These models can score and rank candidate genes, and incorporate contextual information to guide gene selection.

In the current study, we apply an optimized and significantly enhanced version of this LLM-driven workflow to prioritize and select a top candidate gene from a blood transcriptional module (M14.51) associated with erythroid cells and erythropoiesis. The erythroid cell signature captured by module M14.51 has been linked to various physiological and pathological conditions, including respiratory syncytial virus infection severity, late-stage melanoma, pregnancy, and liver transplantation^12,13. By focusing on this module from the BloodGen3 repertoire¹⁴, we aim to demonstrate the utility of our LLM-driven approach in streamlining the selection of biologically relevant, clinically informative, and mechanistically grounded candidates for targeted assay development.

Our revised stepwise approach incorporates advanced LLM capabilities, a multi-tiered scoring system, and improved integration of transcriptome data, representing a substantial refinement of our original methodology. Through this process, we ultimately identify GPX4 as the top candidate gene for module M14.51, showcasing the potential of LLMs to enhance the efficiency and scalability of knowledge-driven candidate biomarker prioritization.

Methods

BloodGen3 module repertoire

The BloodGen3 module repertoire, described in detail in our previous work¹⁴, consists of 382 co-expressed gene sets identified through the integrated analysis of 16 reference whole blood transcriptome datasets. These modules represent groups of genes demonstrating coordinated expression across diverse pathological states. The module repertoire is organized into aggregates, with each aggregate comprising multiple modules that share similar expression patterns across the reference datasets.

Large language models

To facilitate candidate gene prioritization and selection, we employed multiple advanced LLMs. Initially, we used OpenAI’s GPT-4 and Anthropic’s Claude 3 for high-throughput screening. GPT-4 is an autoregressive LLM featuring over 175 billion parameters, enabling sophisticated natural language generation leveraging patterns learned from exposure to a diverse range of internet text. Claude incorporates Constitutional AI techniques, utilizing additional rules-based governance frameworks alongside its 11 billion parameters to ensure responses meet desired constraints. We also utilized Claude 3.5 Sonnet, an advanced iteration of Anthropic’s Claude model, for additional high-throughput scoring. This model offers enhanced capabilities in processing larger batches of information efficiently.

For high-resolution scoring, we employed Consensus GPT, a specialized AI research assistant integrated with ChatGPT. Consensus GPT has access to over 200 million academic papers, providing a more comprehensive and potentially more accurate evaluation compared to generic LLMs. It is built on the foundation of GPT-4 and incorporates additional specialized training on scientific literature.

We devised a stepwise prompting strategy tailored to harness the capacities of these LLMs for systematic evaluation and scoring of module candidates. This method, initially described in our previous work¹¹, has been significantly expanded and refined in the current study to incorporate the strengths of each model and to allow for iterative refinement through model-human interaction.

Selection of a module for candidate gene prioritization (step 1)

The first step in our workflow involves selecting a module from the BloodGen3 repertoire for candidate gene prioritization. This selection is based on several factors, including:

The module’s association with a specific cell type or biological process of interest, as determined by our previous work (e.g^12,15,16.
The module’s pattern of abundance across a set of reference patient cohorts, which can provide insights into its potential clinical relevance.
The module’s link to various disease states and physiological conditions, as established by our previous studies and other published literature.

Once a module is selected, the constituent genes are identified and become the focus of the subsequent prioritization steps.

LLM scoring of module genes (step 2)

We employed two distinct LLM scoring approaches to enhance the robustness of our selection process. These approaches also reflect the evolution in LLM scoring capabilities over the course of the year since we instigated this line of work.

Step 2a: LLM chat scoring v07/2023

We employed the scoring approach described in Toufiq et al.¹¹, using OpenAI’s GPT-4 and Anthropic’s Claude to score the genes within the selected module. Each LLM was provided with the list of module genes and specific prompts requesting them to score each gene on a scale of 0 to 10 for six criteria, accompanied by an evaluative comment and supporting references when applicable. The criteria assessed were:

Relevance to the cell type or biological process of interest.
Relevance to circulating leukocytes immune biology.
Clinical use as a biomarker.
Potential value as a blood transcriptional biomarker.
Known drug target.
Therapeutic relevance for immune-mediated diseases.

Cumulative scores were then computed, and the candidates rank-ordered accordingly. The scores and justifications provided by the LLMs for the top 5 scoring genes were consolidated and summarized. For each criterion, the scores for each gene were averaged across the two models. Genes were then ranked based on their cumulative average scores. The detailed prompts and methodology is described in¹¹ .

Step 2b: LLM high-throughput chat scoring v07/2024

To leverage more advanced capabilities available at the time of manuscript revision and to enable efficient processing of larger gene sets, we employed Claude 3.5 Sonnet for an additional high-throughput scoring approach. This method allowed us to evaluate genes in larger batches, potentially reducing bias and increasing efficiency. Genes were scored in batches of up to 10 using the following prompt:

“For the following genes: [list symbols of up to 10 genes]

Provide the gene’s official name.
Give each of the following statements a score from 0 to 10, with 0 indicating no evidence and 10 indicating very strong evidence:

The gene is associated with erythroid cell biology. Score: Based on evidence linking the gene to the development, function, or regulation of erythroid cells, including impacts on erythropoiesis, hemoglobin synthesis, erythrocyte membrane structure, or other processes specific to red blood cell biology.
The gene is relevant to circulating leukocytes immune biology. Score: Based on evidence linking the gene to the development, function, or regulation of circulating leukocytes, including impacts on leukocyte differentiation, activation, signaling, or effector functions.
The gene or its products are currently being used as a biomarker in clinical settings. Score: Based on evidence of the gene or its products’ application as biomarkers for diagnosis, prognosis, or monitoring of diseases in clinical settings, with a focus on their validated use and acceptance in medical practice.
The gene has potential value as a blood transcriptional biomarker. Score: Based on evidence supporting the gene’s expression patterns in blood cells as reflective of specific physiological or pathological states, considering both current research findings and potential for future clinical utility.
The gene is a known drug target. Score: Based on evidence of the gene or its encoded protein serving as a target for therapeutic intervention, including approved drugs targeting this gene, compounds in clinical trials, or promising preclinical studies.
The gene is therapeutically relevant for diseases involving the immune system. Score: Based on evidence linking the gene to the pathogenesis, progression, or treatment of diseases involving the immune system, including its role in immune dysregulation, or as a target for immunotherapy.

Scoring criteria: 0 - No evidence found 1–3 - Very limited evidence 4–6 - Some evidence, but needs validation or is limited to certain conditions 7–8 - Good evidence 9–10 - Strong evidence.

The output should consist in a table, with genes as rows and gene names, and scores for each statement (a-f), as columns.”

Detailed scoring criteria were provided for each statement to ensure consistency and accuracy in the scoring process. This approach allowed for a more nuanced and up-to-date evaluation of the genes, taking advantage of the enhanced capabilities of Claude 3.5 Sonnet and enabling efficient processing of larger gene sets. Scoring was run in triplicates and scores were averaged to enhance reliability and account for potential variations in LLM outputs.

Selection of top candidates (step 3)

To identify the most promising candidate genes, we employed an approach leveraging the results from both scoring methods (2a and 2b). The process was as follows:

We first rank-ordered genes based on cumulative LLM scores, creating separate ranked lists for scores derived from Step 2a and Step 2b.
For each ranked list, we then distributed the genes into terciles. Given that our list comprised 13 genes, we assigned the first 5 genes to the first tercile, the next 4 to the second tercile, and the final 4 to the third tercile.
We selected for further analysis any gene candidates that appeared in the first tercile of either scoring method (2a or 2b). This approach allowed us to capture top-performing genes across both the established and the high-throughput scoring methods, ensuring a comprehensive selection of promising candidates.

This selection strategy aimed to balance the strengths of both scoring approaches while identifying a manageable number of top candidates for more in-depth analysis in subsequent steps.

Prioritization of top candidates through high-resolution LLM scoring and fact-checking (step 4)

For the top candidate genes identified in Step 3, we employed a high-resolution scoring approach using Consensus GPT, a specialized AI research assistant integrated with ChatGPT. This step aimed to provide a more detailed and rigorously supported evaluation of each candidate gene.

Consensus GPT, which has access to over 200 million academic papers, was used to revisit and refine the scores for each top candidate gene. The process involved using a generic prompt for each of the six criteria, which was then customized for each gene. The generic prompt structure was as follows:

“ For the gene [GENE SYMBOL]:

Provide its official name.
Provide a brief summary of its function.
Give the following statement a score from 0 to 10, with 0 indicating no evidence and 10 indicating very strong evidence:

“[SPECIFIC STATEMENT – see Step 2b]”.

Score: Based on [SCORING CRITERIA – see Step 2b].

Scoring criteria:

0 - No evidence found.

1–3 - Very limited evidence.

4–6 - Some evidence but needs validation or is limited to certain conditions.

7–8 - Good evidence, used or proposed for some clinical applications.

9–10 - Strong evidence, firmly established as a useful biomarker.

For scores of 4 or above, please provide an evaluative comment and supporting references, when applicable.

The results should be generated as a table with the following columns:

This prompt was used for each of the six criteria:

(a) Association with erythroid cells or erythropoiesis (b) Relevance to circulating leukocytes immune biology (c) Current use as a biomarker in clinical settings (d) Potential value as a blood transcriptional biomarker (e) Status as a known drug target (f) Therapeutic relevance for diseases involving the immune system.

For each prompt, Consensus GPT provided the requested information, including scores, evaluative comments, and supporting references when applicable. The high-resolution scoring process then incorporated human-in-the-loop augmentation. Human experts manually verified the relevance and completeness of backing references and critically evaluated the initial scores.

When references were found to be inadequate, irrelevant, or missing, experts prompted the model to re-evaluate its assessments, often providing additional context or pointing to overlooked literature. This iterative process, characteristic of human-in-the-loop augmentation, allowed for refinement of the AI-generated evaluations, ensuring accuracy and reliability.

This approach combined the large-scale information processing capabilities of AI with human domain expertise and critical thinking. The result was a nuanced, evidence-based evaluation of top candidate genes that leveraged the strengths of both artificial and human intelligence in the gene prioritization process.

Prioritization of top candidates based on qualitative assessment across six criteria (step 5)

Following the high-resolution scoring, we employed a two-stage qualitative assessment process to prioritize the top three candidate genes for module M14.51. This process used Claude 3.5 Sonnet. In the first stage, we used the following prompt:

“Using the data provided in the attached summary table generated via human-in-the-loop augmented scoring, please compile the justifications for the top 3 ranked genes across all six criteria: [insert gene symbols here]. Generate a comprehensive narrative for each gene, including supporting PMIDs. The narrative should synthesize the information across all criteria, highlighting key strengths and potential limitations of each gene. The narrative should neither account for nor disclose the scores indicated in the table.”

This prompt generated comprehensive narratives for the top three ranked genes based on the fact-checked justifications, synthesizing information across all six criteria without reference to numerical scores. We subsequently retrieved and compiled PubMed abstracts of all the literature cited in this document, which served as input in the next step.

In the second stage, we used another prompt to perform a qualitative assessment and rank-ordering of these three candidates:

“Based on the narratives generated for the top three genes in the previous step, and the PubMed summaries for the cited references, please provide a qualitative assessment and rank-ordering of these three candidate genes. Your assessment should consider the strength and relevance of the evidence provided for each gene across all six criteria. Please provide a clear rationale for your ranking, discussing why each gene was placed in its position. Conclude with a final recommendation for the top candidate gene for module M14.51, explaining why it stands out among the three.”

LLM ranking of top candidate genes based on expression data (step 6)

We prompted Claude 3.5 Sonnet to devise a strategy for prioritizing the top candidate genes based on transcriptional profiles available from three reference datasets. The following was proposed by Claude 3.5 Sonnet and adopted for prioritizing candidate genes solely based on transcriptional profiling data:

“To prioritize the top three candidate genes (GPX4, ST3GAL1, and EIF3B) based solely on expression data, we can employ a three-criteria ranking system using data from three distinct transcriptome profiling datasets:

Erythroid Specificity Assessment: Using the Novershtern et al. dataset (GSE24759)¹⁷, which comprises profiles of isolated leukocyte populations and hematopoietic progenitors. This step focuses on the expression levels in erythroid populations (ERY1-5) compared to other cell types. Genes with higher expression in erythroid populations are ranked higher.
Whole Blood Expression Evaluation: Using the Linsley et al. dataset (GSE60424)¹⁸, which includes profiles of whole blood and isolated circulating blood leukocytes, the average expression of each candidate gene in whole blood samples is calculated. Genes with higher overall expression in whole blood are prioritized.
Expression Variability Across Conditions: Examining the Altman, Rinchai et al. dataset (GSE100150)¹⁴, which encompasses bulk blood transcriptional profiles from 16 reference cohorts (985 transcriptional profiles in total). For this step the standard deviation of log2 fold changes is calculated across all conditions for each candidate gene. Genes exhibiting greater variability across different disease states or conditions are ranked higher.”

For each criterion, the three candidate genes were ranked from 1 to 3, with 1 being the highest rank. The final ranking was determined by the consistency of high ranks across all three criteria.

This data-driven approach aimed to identify the candidate gene with the most favorable expression characteristics for use as a blood transcriptional biomarker of circulating erythroid cells, considering erythroid specificity, detectability in whole blood, and responsiveness to various physiological and pathological conditions. For this step the input consisted of expression values (.csv files) from each of the datasets for the candidate genes of interest.

Results

Selection and prioritization of module M14.51 (step 1)

Our approach, described in detail in the methods section and illustrated in Fig. 1, involves a stepwise selection process using the fixed BloodGen3 repertoire as a framework¹⁴. Here we focused on the BloodGen3 aggregate A36, which our previous work identified as being associated with a circulating erythroid cell signature¹². This aggregate comprises six modules, and we selected module M14.51 for further analysis based on its pattern of abundance across a set of 16 reference patient cohorts (Fig. 2). The circulating erythroid cell signature, which module M14.51 is a representative of, has been linked to various disease states and physiological conditions, including RSV disease severity¹², late-stage cancer^12,19, COVID-19²⁰, and pregnancy¹³. Additionally, we recently described pronounced changes in abundance for the transcripts included in this signature following the administration of the second dose of COVID-19 mRNA vaccines¹⁶.

Fig. 2 — Module M14.51 Abundance Patterns Across Reference Patient Cohorts. This heatmap represents a transcriptional fingerprint, which reflects the relative proportion of transcripts in the module (rows) that show statistically significant abundance differences between case subjects and controls across each reference dataset (columns). The heatmap values extend from + 100% (solid red, indicating a significantly higher abundance of all constitutive transcripts within the module) to -100% (solid blue, indicating a significantly lower abundance of all constitutive transcripts within the module). The module aggregate A36 from the BloodGen3 repertoire includes six modules, among which is module M14.51, that was subjected to the candidate gene prioritization and selection workflow.

Module M14.51 is comprised of 13 genes: ATP5EP2, BTF3, EIF3B, FAM152B, GPX4, MED16, MYL6, PRDX5, SH3BGRL3, ST3GAL1, TM7SF2, UNC45A, and VPS28. Given the potential biological and clinical significance of this module, we aimed to prioritize these 13 constituent genes and select the most promising candidate for downstream characterization as a potential blood transcriptional biomarker associated with erythroid physiology.

LMM scoring of M14.51 genes employing chat-GPT-4 and Claude (steps 2a and 2b)

To reflect the rapidly evolving capabilities of LLMs and to cross-validate our results, we employed two distinct scoring approaches to evaluate the 13 genes comprising module M14.51. This dual approach aimed to prioritize candidate genes through independent runs and models, enhancing the robustness of our selection process.

The first approach (Step 2a) utilized GPT-4 and Claude, while the second approach (Step 2b) employed Claude 3.5 in high-throughput mode (see methods for details). Figure 3A illustrates the results from both scoring methods. In the GPT-4/Claude scoring (left panel), GPX4 emerged with the highest cumulative score, followed closely by PRDX5. BTF3, ST3GAL1, and EIF3B also received notable scores. GPX4 showed particularly high scores in categories such as drug target potential and relevance to immune-mediated diseases. The Claude 3.5 high-throughput scoring (right panel) yielded similar but not identical results. GPX4 and PRDX5 maintained the top two positions with the highest cumulative scores. However, there were some differences in the subsequent rankings. ST3GAL1 received a higher cumulative score in this method, placing it third, followed by EIF3B and SH3BGRL3. Notably, BTF3, which scored third in the GPT-4/Claude method, received a lower cumulative score in the Claude 3.5 scoring.

Fig. 3 — Gene Prioritization in Module M14.51 Using LLM-based Scoring Approaches. (A) Stacked bar charts presenting the cumulative scores for genes within module M14.51, based on six evaluative criteria. Left: Average scores from GPT-4 and Claude. Right: Scores from Claude 3.5 high-throughput scoring in triplicate. Asterisks (*) indicate top candidate genes for the module. Boxes outline the candidates belonging to the first tercile (see methods). (B) Consensus GPT scoring of top candidate genes. The line graph illustrates the individual scores across six criteria (Erythroid Biology, Leukocyte Biology, Clinical Biomarker, Blood Biomarker, Drug Target, and Immune Disease Relevance) for the top-ranking genes. (C) Stacked line graph showing the consensus GPT scoring for all six criteria for the top candidate genes. The cumulative scores are represented by the height of each stacked line. Asterisks (*) identify the final three candidates that were retained for further evaluation and prioritization.

Both scoring methods demonstrated consistency in identifying GPX4 and PRDX5 as the genes with the highest cumulative scores across the six criteria. However, the variations in scores for other genes highlight the potential differences in assessment between different LLM approaches.

These results showcase the scoring distributions across the M14.51 genes using our dual LLM approach, providing a comprehensive evaluation of each gene’s relevance across multiple criteria.

Six top candidate genes are identified for module M14.51 (Step 3)

Following our dual LLM scoring approach, we implemented a selection strategy to identify the top candidate genes from module M14.51. As described in the methods, we distributed the genes into terciles based on their cumulative scores for each scoring method (2a and 2b) separately. We then selected genes that appeared in the first tercile of either scoring method.

Figure 3A illustrates this selection process. The green boxes on both panels indicate the genes falling within the first tercile for each scoring method. In the GPT-4/Claude scoring (Method 2a, left panel), the first tercile comprised the top five genes: GPX4, PRDX5, BTF3, ST3GAL1, and EIF3B. In the Claude 3.5 high-throughput scoring (Method 2b, right panel), the first tercile also included five genes: GPX4, PRDX5, ST3GAL1, EIF3B, and SH3BGRL3.

Combining the results from both scoring methods, we identified six unique genes that appeared in the first tercile of at least one method. These top candidates, marked with asterisks (*) in Fig. 3A, are: GPX4, PRDX5, BTF3, ST3GAL1, EIF3B and SH3BGRL3.

This selection approach allowed us to capture top-performing genes across both scoring methods, ensuring a comprehensive selection of promising candidates. Notably, GPX4 and PRDX5 were consistently ranked in the top two positions by both methods, underlining their strong potential as candidates for further investigation. The inclusion of SH3BGRL3, which was highly ranked in the Claude 3.5 scoring but not in the GPT-4/Claude scoring, demonstrates the value of our dual-method approach in capturing a broader range of potential candidates. These six genes were then carried forward for more detailed, high-resolution scoring in the subsequent step of our analysis.

High resolution scoring and fact-checking prioritizes top M14.51 candidates (Step 4)

While our initial LLM scoring approaches (Steps 2a and 2b) proved effective in identifying top-tier candidates, a more rigorous method was necessary to further prioritize these selections. To achieve this, we employed a high-resolution scoring and fact-checking process using Consensus GPT, a specialized generative AI model. This approach leverages both the vast knowledge base of GPT-4 and a compendium of 200 million full-text scientific articles, allowing for a more comprehensive and nuanced evaluation of each candidate gene.

The high-resolution scoring method, while not high-throughput due to its focus on individual genes and statements, offers a depth of analysis crucial for final candidate prioritization. This process is more labor-intensive, requiring human experts to verify supporting references and participate in the optimization and correction of scores. However, it provides a level of scrutiny and validation essential for confident gene selection.

The comparative scoring across three different LLM approaches - Consensus GPT, Claude 3.5 Sonnet (in triplicate), and the initial GPT-4/Claude scoring - is illustrated in Fig. 3B. This graph demonstrates varying levels of agreement between these methods across six key criteria: erythroid cell biology, leukocyte biology, clinical biomarker potential, blood biomarker potential, drug target status, and relevance to immune diseases. Notably, GPX4 consistently scored high across all three methods, particularly in erythroid cell biology and as a drug target.

A comprehensive view of the Consensus GPT scoring for the top candidate genes is provided in Fig. 3C. This stacked area chart clearly shows GPX4 leading in cumulative scores across all six criteria, followed by EIF3B and ST3GAL1. The graph visually represents the multi-faceted strengths of each candidate, with GPX4 showing promise in immune disease therapeutics and as a drug target. A detailed breakdown of scores and justifications for each gene across the six criteria, complete with supporting references and evaluative comments, is offered in Supplementary File 1.

This high-resolution scoring and fact-checking process ultimately reinforced GPX4’s position as the leading candidate from module M14.51, while also providing valuable insights into the potential of other top-ranking genes. The process demonstrates the power of combining advanced AI tools with human expertise in the intricate task of gene prioritization for targeted assay development.

Qualitative assessment prioritizes GPX4 as the top candidate for module M14.51 (step 5)

To further refine our selection of the top candidate gene for module M14.51, we conducted a two-stage qualitative assessment of the three highest-ranking genes from the previous step: GPX4, EIF3B, and ST3GAL1. This process involved generating comprehensive narratives for each gene based on the fact-checked justifications across all six criteria generated in the previous step (Supplementary File 1), followed by a comparative analysis incorporating PubMed summaries of cited references (see methods for details). The assessment revealed GPX4 as the strongest candidate, demonstrating robust evidence across all criteria. GPX4 showed critical involvement in erythropoiesis, protecting erythroid precursors from oxidative stress and participating in key processes such as reticulocyte maturation and erythroblast enucleation^21–23. It demonstrated significant relevance to immune biology, particularly in maintaining immune homeostasis and protecting regulatory T cells from ferroptosis²⁴. GPX4 also showed promising potential as a biomarker in cancer research and chronic hepatitis C^25–27, and as a therapeutic target due to its role in regulating ferroptosis^27,28. ST3GAL1 ranked second, with strong evidence particularly in immune-related processes and as a therapeutic target in cancer immunotherapy^29,30, but showed limited direct relevance to erythropoiesis. EIF3B ranked third, showing potential across several criteria, particularly as a biomarker and potential drug target in cancer^31,32, but with less comprehensive evidence compared to GPX4 and ST3GAL1. This qualitative assessment, detailed in Supplementary File 2, reinforced GPX4’s position as the top candidate from module M14.51, providing a strong rationale for its selection for further investigation and potential inclusion in a targeted blood transcriptional assay.

LLM-driven expression-based ranking prioritizes GPX4 as the top candidate for module M14.51 (step 6)

Through its analysis of expression data for the top three candidate genes (GPX4, ST3GAL1, and EIF3B) across three distinct datasets Claude 3.5 noted clear differences in their expression patterns and variability:

“Examination of erythroid specificity using the Novershtern et al. dataset demonstrated that GPX4 exhibited the highest expression specificity in erythroid populations, particularly in the more mature erythroid stages (ERY3-5). GPX4 showed a marked increase in expression from ERY1 to ERY5, with peak expression in ERY3-4. ST3GAL1 also displayed increased expression in erythroid cells, albeit to a lesser extent than GPX4. EIF3B showed some elevation in erythroid cells, but its expression was less specific compared to the other two candidates (Fig. 4A)”.

Fig. 4 — Expression Analysis of Top-Ranked Candidate Genes from M14.51 in Leukocyte Transcriptomes. (A) A stacked bar chart displays the transcript abundance for the final three candidate genes of M14.51 across various leukocyte and hematopoietic progenitor populations. Data are sourced from the Novershtern et al. contribution to the NCBI Gene Expression Omnibus (GEO) under accession number GSE24759. Abbreviated notations for this dataset are as follows: HSC1, Hematopoietic stem cell CD133 + CD34dim; HSC2, Hematopoietic stem cell CD38- CD34+; CMP, Common myeloid progenitor; MEP, Megakaryocyte/erythroid progenitor; ERY1, Erythroid CD34 + CD71 + GlyA-; ERY2, Erythroid CD34- CD71 + GlyA-; ERY3, Erythroid CD34- CD71 + GlyA+; ERY4, Erythroid CD34- CD71lo GlyA+; ERY5, Erythroid CD34- CD71- GlyA+; MEGA1, Colony Forming Unit-Megakaryocytic; MEGA2, Megakaryocyte; DENDa1, Plasmacytoid dendritic cell; DENDa2, Myeloid dendritic cell; GMP, Granulocyte/monocyte progenitor; GRAN1, Colony Forming Unit-Granulocyte; GRAN2, Granulocyte (Neutrophilic Metamyelocyte); GRAN3, Granulocyte (Neutrophil); MONO1, Colony Forming Unit-Monocyte; MONO2, Monocyte; BASO1, Basophil; EOS2, Eosinophil; Pre-BCELL2, Early B cell; Pre-BCELL3, Pro-B cell; BCELLa1, Naive B cell; BCELLa2, Mature B cell, able to class switch; BCELLa3, Mature B cell; BCELLa4, Mature B cell, class switched; NKa1, Mature NK cell_CD56- CD16 + CD3-; NKa2, Mature NK cell_CD56 + CD16 + CD3-; NKa3, Mature NK cell CD56- CD16- CD3-; NKa4, NKT cell; TCELL1, CD8 + effector memory RA; TCELL2, Naive CD8 + T cell; TCELL3, CD8 + effector memory cell; TCELL4, CD8 + central memory; TCELL6, Naive CD4 + T cell; TCELL7, CD4 + effector memory cell; TCELL8, CD4 + central memory; Note: NKa1-4 as well as DENDa1 and DENDa2 cells were isolated from adult peripheral blood, other cell populations were isolated from cord blood. (B) Expression profiles of top candidate genes across whole blood and leukocyte subsets. Left panel: Stacked bar chart showing cumulative expression of GPX4, ST3GAL1, and EIF3B across different cell types. Right panels: Individual box plots for GPX4, ST3GAL1, and EIF3B displaying the distribution of expression levels across cell types. Data derived from Speake et al., GEO accession number GSE60459. (C) The box plot depicts the relative expression of GPX4 across multiple patient cohorts and conditions. Each box represents a different condition or disease state, with individual data points shown as grey dots. The y-axis represents relative expression, with the dashed red line indicating the baseline level. The plot shows the distribution of GPX4 expression across various conditions such as melanoma, JDM (Juvenile Dermatomyositis), PTB (Pulmonary Tuberculosis), Staph (Staphylococcus infection), RSV, MS (Multiple Sclerosis), liver transplant, COPD, SLE (Systemic Lupus Erythematosus), B-cell deficiency, Kawasaki disease, SoJIA (Systemic onset Juvenile Idiopathic Arthritis), FLU (Influenza infection), sepsis, HIV, and pregnancy.

“Analysis of whole blood expression levels using the Linsley et al. dataset revealed that GPX4 had the highest average expression (121.5 normalized counts), followed closely by EIF3B (115.9 normalized counts), with ST3GAL1 showing the lowest expression (113.2 normalized counts) among the three candidates.” Notably, when prompted to examine Fig. 4B Claude 3.5’s correctly interpreted the higher expression of GPX4 in whole blood compared to individual leukocyte populations as suggesting a potential enrichment in erythrocytes or reticulocytes.

“Investigation of expression variability across different conditions using the Altman, Rinchai et al. dataset showed that GPX4 demonstrated the highest variability (standard deviation of log2 fold changes = 0.39), followed by EIF3B (0.26) and ST3GAL1 (0.24). This suggests that GPX4 may be more responsive to changes in physiological or pathological states (Fig. 4C).”

Collectively, these findings consistently positioned GPX4 as the top-ranking candidate across all three criteria. GPX4 demonstrated the highest erythroid specificity, the highest whole blood expression, and the greatest variability across conditions. ST3GAL1 was ranked second overall, primarily due to its higher erythroid specificity compared to EIF3B, which is particularly relevant for a marker of circulating erythroid cells. EIF3B, while showing robust expression in whole blood and moderate variability across conditions, ranked third due to its lower erythroid specificity. These results suggest that among the three candidates, GPX4 possesses the most favorable expression characteristics for use as a blood transcriptional biomarker of circulating erythroid cells, based on its consistent top performance across all evaluated criteria.

Overall GPX4 emerges as the top candidate gene from the M14.51 module

The multi-step prioritization process, encompassing high-resolution scoring (Step 4), literature-based qualitative assessment (Step 5), and expression profiling data analysis (Step 6), consistently identified GPX4 as the top candidate gene for module M14.51. In the high-resolution scoring phase, GPX4 achieved the highest cumulative score (40 out of 60) across the six evaluated criteria, significantly outperforming EIF3B²³, ST3GAL1²¹, and PRDX5¹⁸. GPX4 scored particularly well in erythroid biology⁸, leukocyte biology⁷, drug target potential⁷, and immune disease relevance⁷. The subsequent literature-based assessment further supported GPX4’s potential, highlighting its crucial role in cellular redox homeostasis and its emerging significance in erythroid cell biology. Finally, the expression profiling data analysis reinforced GPX4’s candidacy by demonstrating its high erythroid specificity, robust expression in whole blood, and marked variability across different physiological and pathological conditions. Notably, GPX4 consistently outperformed other candidates in all three expression-based criteria. This convergence of quantitative scoring, qualitative literature assessment, and expression data analysis provides strong, multi-faceted evidence for GPX4 as the most promising candidate gene from module M14.51 for further investigation as a potential blood transcriptional biomarker of circulating erythroid cells.

Discussion

Building upon our previous work that demonstrated the potential of large LLMs for candidate gene prioritization¹¹, we have developed and implemented an advanced, two-step LLM-driven workflow to streamline the selection of a top candidate gene from module M14.51. Our refined approach first employs a high-throughput scoring strategy, utilizing both our original method (averaging ChatGPT and Claude scores) and a new method leveraging the enhanced capabilities of Claude 3.5 Sonnet to efficiently score gene batches. This initial step identifies top-tier candidates from the module. Subsequently, we apply a novel, high-resolution Consensus GPT scoring method to these top candidates, providing a more nuanced, reference-backed evaluation. This second step involves manual fact-checking and allows for iterative refinement through model-human conversation, significantly enhancing the confidence in our final rankings. We then further refine our selection by analyzing LLM-generated functional narratives and literature summaries for the top candidates. Finally, by incorporating reference transcriptome data from the BloodGen3 repertoire, we identified GPX4 as the top candidate gene. This advanced workflow, which integrates scoring, functional analysis, and expression profiling, showcases the evolving utility of LLMs in improving the efficiency, scalability, and reliability of knowledge-driven candidate biomarker prioritization, ultimately contributing to the development of targeted assays that are biologically relevant, clinically informative, and mechanistically grounded.

As the top candidate gene for module M14.51, GPX4 will be subjected to further in-depth characterization using a separate workflow^33,34. However, it is worth discussing here the background and potential relevance of GPX4 in the context of the six statements used for scoring the candidate genes. GPX4, or glutathione peroxidase 4, is a key antioxidant enzyme that plays a crucial role in protecting cells against oxidative stress and ferroptosis, a form of regulated cell death^35,36. GPX4’s role in erythroid cells and erythropoiesis is not directly established, but its antioxidant function suggests potential relevance in these contexts³⁷. In terms of its potential as a blood transcriptional biomarker, GPX4 received a score of 5 in our LLM-based assessment. This score reflects the gene’s involvement in cellular processes that are often dysregulated in various pathological conditions, such as neurological disorders, liver diseases, and diabetes^38–40. Moreover, altered GPX4 expression has been observed in several cancer types, including hepatocellular carcinoma, colorectal cancer, and breast cancer^41–43. These findings underscore the potential utility of GPX4 as a biomarker for a range of diseases characterized by oxidative stress and ferroptosis. While GPX4 may not be widely used as a clinical biomarker at present, its emerging role as a therapeutic target, particularly in the context of cancer⁴⁴, highlights its translational potential. As our understanding of GPX4’s involvement in disease pathogenesis continues to grow, its utility as a blood transcriptional biomarker may become increasingly evident.

While our study demonstrates the utility of LLMs in candidate gene prioritization and selection, it is important to acknowledge its limitations and areas for future improvement. First, the performance of LLMs remains dependent on the quality and scope of their training data. While models like GPT-4, Claude, and Consensus GPT have been trained on vast corpora of biomedical literature, they may not capture the most recent findings or niche areas of research, potentially impacting the accuracy and completeness of the information they provide.

Second, the potential for hallucination in LLMs - generating plausible but factually incorrect information - remains a concern. Our new methodology, relying on Consensus-GPT and allowing for user-guided correction of scores, significantly mitigates this risk for top-tier candidates. However, some risk of inaccuracies may persist, particularly in the initial screening phase.

Third, time constraints pose a significant challenge. While we have developed an API-based automation for high-throughput initial screening (manuscript under review), the subsequent high-resolution scoring step involves time-intensive human-in-the-loop augmentation. This manual process, crucial for ensuring accuracy and reliability, cannot be automated via API. It involves iterative human-AI interaction, where experts verify references, challenge scores, and guide the model to reconsider assessments when necessary. While this human-augmented approach provides essential nuance and domain expertise, it is time-consuming and may introduce some subjectivity. Future work should focus on optimizing the balance between efficiency and accuracy, potentially by refining automated methods and exploring the integration of structured knowledge bases to enhance the verification process. These improvements could particularly benefit biomarker discovery and gene prioritization efforts, streamlining the process while maintaining the high level of accuracy required in scientific research. Fourth, the selection of candidate genes is based on predefined criteria and the information available in the literature. The relative importance of these criteria may vary depending on the specific research question or clinical application. In this study, we assigned equal weights to all criteria, but in practice, domain experts may need to adjust these weights based on their specific requirements. Fifth, the reproducibility of LLM-based approaches is an important consideration. Future studies should assess the sensitivity of the stepwise prompting to variations in input, potential day-to-day variability in recommendations, and the impact of the order in which gene expression datasets are processed. While we observed consistency in our results across different LLM approaches and over time in this work and previous benchmarking experiments¹¹, a more systematic evaluation of reproducibility would strengthen the reliability of this method.

It is important to note that while our LLM-driven approach provides valuable insights and efficiency in candidate gene prioritization, it does not replace the need for experimental validation. We are currently conducting similar prioritization experiments with other modules of the erythroid BloodGen3 signature, as well as other BloodGen3 signatures (such as interferon and inflammation). Our plan is to evaluate these candidates simultaneously in a multiplex assay for further selection and validation. This broader approach will allow us to assess the generalizability of our method across different biological contexts and provide a more comprehensive validation of our LLM-driven prioritization strategy.

Despite these limitations, our study demonstrates the potential of LLMs in candidate gene prioritization. It shows how LLMs can streamline the process of candidate gene prioritization for targeted assay development by efficiently synthesizing information from biomedical literature. The identification of GPX4 as the top candidate for the erythroid cell-associated module M14.51 illustrates the approach’s potential in selecting biologically relevant targets. The human-in-the-loop LLM scoring augmentation strategy developed here may have applications beyond targeted assay development. This approach could potentially be adapted for broader biomarker selection or drug target identification efforts, where efficient processing of large-scale biological data and literature is crucial. Ongoing work in our group focuses on scaling these methods, expanding their application to other biological contexts, and validating selected candidates experimentally. As LLMs continue to evolve, their integration into biomarker discovery workflows may contribute to the development of targeted assays for various physiological and pathological conditions. This LLM-driven approach may accelerate the translation of systems-level insights into actionable tools for research and clinical applications, though further validation is needed to confirm its broader applicability.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(27.5KB, docx)}

Supplementary Material 2^{(20.7KB, docx)}

Abbreviations

AI: Artificial Intelligence
ATP5EP2: ATP Synthase Membrane Subunit DAPIT
BCELLa1-4: Different mature B cell populations
BTF3: Basic Transcription Factor 3
CD: Cluster of Differentiation
CMP: Common Myeloid Progenitor
COPD: Chronic Obstructive Pulmonary Disease
DENDa1-2: Dendritic cell populations
EIF3B: Eukaryotic Translation Initiation Factor 3 Subunit B
EOS2: Eosinophil
ERY1-5: Erythroid cell populations at different stages
FAM152B: Family With Sequence Similarity 152
Member B GEO: Gene Expression Omnibus
GMP: Granulocyte/Monocyte Progenitor
GPX4: Glutathione Peroxidase 4
GRAN1-3: Granulocyte populations
HIV: Human Immunodeficiency Virus
HSC1-2: Hematopoietic Stem Cell populations
JDM: Juvenile Dermatomyositis
LLMs: Large Language Models
MED16: Mediator Complex Subunit 16
MEGA1-2: Megakaryocyte populations
MEP: Megakaryocyte/Erythroid Progenitor
MONO1-2: Monocyte populations
MS: Multiple Sclerosis
MYL6: Myosin Light Chain 6
NCBI: National Center for Biotechnology Information
NK: Natural Killer cell
PRDX5: Peroxiredoxin 5
PRE_BCELL2-3: Early B cell populations
RSV: Respiratory Syncytial Virus
SH3BGRL3: SH3 Domain Binding Glutamate Rich Protein Like 3
SLE: Systemic Lupus Erythematosus
SoJIA: Systemic onset Juvenile Idiopathic Arthritis
ST3GAL1: ST3 Beta-Galactoside Alpha-2,3-Sialyltransferase 1
Staph: Staphylococcus
TCELL1-8: Different T cell populations
TFA: Transcriptome Fingerprinting Assay
TM7SF2: Transmembrane 7 Superfamily Member 2
UNC45A: Unc-45 Myosin Chaperone A
VPS28: VPS28 Subunit of ESCRT-I

Author contributions

BS, MT, FO, DR, and DC conceptualized the study. BS, MT, FO, MY, TK, DR and DC developed the methodology. BS, MT, MY, TK, and DR performed the investigation. BS and DC wrote the original draft. BS, MT, FO, MY, TK, DR, KP, and DC reviewed and edited the manuscript. BS and DC prepared the visualizations. KP and DC supervised the project.

Data availability

The datasets analyzed during the current study are available in the Gene Expression Omnibus (GEO) repository and can be accessed using the accession numbers GSE60424, GSE24759, and GSE100150.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Geiss, G. K. et al. Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat. Biotechnol.26 (3), 317–325 (2008). [DOI] [PubMed] [Google Scholar]
2.Spurgeon, S. L., Jones, R. C. & Ramakrishnan, R. High throughput gene expression measurement with real time PCR in a microfluidic dynamic array. PloS One. 3 (2), e1662 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Tsang, H. F. et al. NanoString, a novel digital color-coded barcode technology: current and future applications in molecular diagnostics. Expert Rev. Mol. Diagn.17 (1), 95–103 (2017). [DOI] [PubMed] [Google Scholar]
4.Hannouf, M. B. et al. Cost-effectiveness analysis of multigene expression profiling assays to guide adjuvant therapy decisions in women with invasive early-stage breast cancer. Pharmacogenomics J.20 (1), 27–46 (2020). [DOI] [PubMed] [Google Scholar]
5.Byron, S. A., Van Keuren-Jensen, K. R., Engelthaler, D. M., Carpten, J. D. & Craig, D. W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet.17 (5), 257–271 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Jiang, Y. et al. Construction of a set of novel and robust gene expression signatures predicting prostate cancer recurrence. Mol. Oncol.12 (9), 1559–1578 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet.20 (11), 631–656 (2019). [DOI] [PubMed] [Google Scholar]
8.Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet.10 (1), 57–63 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Rinchai, D. et al. A modular framework for the development of targeted Covid-19 blood transcript profiling panels. J. Transl Med.18 (1), 291 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Brummaier, T. et al. Design of a targeted blood transcriptional panel for monitoring immunological changes accompanying pregnancy. Front. Immunol.15, 1319949 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Toufiq, M. et al. Harnessing large language models (LLMs) for candidate gene prioritization and selection. J. Transl Med.21 (1), 728 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Rinchai, D. et al. Definition of erythroid cell-positive blood transcriptome phenotypes associated with severe respiratory syncytial virus infection. Clin. Transl Med.10 (8), e244 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Hong, S. et al. Longitudinal profiling of human blood transcriptome in healthy and lupus pregnancy. J. Exp. Med.216 (5), 1154–1169 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Altman, M. C. et al. Development of a fixed module repertoire for the analysis and interpretation of blood transcriptome data. Nat. Commun.12 (1), 4385 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Rawat, A. et al. A neutrophil-driven inflammatory signature characterizes the blood transcriptome fingerprint of Psoriasis. Front. Immunol.11, 587946 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Rinchai, D. et al. High–temporal resolution profiling reveals distinct immune trajectories following the first and second doses of COVID-19 mRNA vaccines. Sci. Adv.8 (45), eabp9961 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Novershtern, N. et al. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell. 144 (2), 296–309 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Linsley, P. S., Speake, C., Whalen, E. & Chaussabel, D. Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PloS One. 9 (10), e109760 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Zhao, L. et al. Late-stage tumors induce anemia and immunosuppressive extramedullary erythroid progenitor cells. Nat. Med.24 (10), 1536–1544 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Bernardes, J. P. et al. Longitudinal multi-omics analyses identify responses of Megakaryocytes, Erythroid Cells, and plasmablasts as Hallmarks of severe COVID-19. Immunity. 53 (6), 1296–1314e9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Ouled-Haddou, H. et al. A new role of glutathione peroxidase 4 during human erythroblast enucleation. Blood Adv.4 (22), 5666–5680 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Rademacher, M., Kuhn, H. & Borchert, A. Expression silencing of Glutathione Peroxidase 4 in mouse erythroleukemia cells delays in Vitro Erythropoiesis. Int. J. Mol. Sci.22 (15), 7795 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Altamura, S. et al. Glutathione peroxidase 4 and vitamin E control reticulocyte maturation, stress erythropoiesis and iron homeostasis. Haematologica. 105 (4), 937–950 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Xu, C. et al. The glutathione peroxidase Gpx4 prevents lipid peroxidation and ferroptosis to sustain Treg cell activation and suppression of antitumor immunity. Cell. Rep.35 (11), 109235 (2021). [DOI] [PubMed] [Google Scholar]
25.Brault, C. et al. Glutathione peroxidase 4 is reversibly induced by HCV to control lipid peroxidation and to increase virion infectivity. Gut. 65 (1), 144–154 (2016). [DOI] [PubMed] [Google Scholar]
26.Cueto-Ureña, C. et al. Glutathione peroxidase gpx1 to gpx8 genes expression in experimental brain tumors reveals gender-dependent patterns. Genes. 14 (9), 1674 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Sha, R. et al. Predictive and prognostic impact of ferroptosis-related genes ACSL4 and GPX4 on breast cancer treated with neoadjuvant chemotherapy. EBioMedicine. 71, 103560 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Weaver, K. & Skouta, R. The selenoprotein glutathione peroxidase 4: from Molecular mechanisms to Novel Therapeutic opportunities. Biomedicines. 10 (4), 891 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Garnham, R. et al. ST3 beta-galactoside alpha-2,3-sialyltransferase 1 (ST3Gal1) synthesis of Siglec ligands mediates anti-tumour immunity in prostate cancer. Commun. Biol.7 (1), 276 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Lin, W. D. et al. Sialylation of CD55 by ST3GAL1 facilitates Immune Evasion in Cancer. Cancer Immunol. Res.9 (1), 113–122 (2021). [DOI] [PubMed] [Google Scholar]
31.Zang, Y. et al. Eukaryotic Translation Initiation Factor 3b is both a Promising Prognostic Biomarker and a potential therapeutic target for patients with Clear Cell Renal Cell Carcinoma. J. Cancer. 8 (15), 3049–3061 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Liu, C. et al. The expression of eukaryotic translation initiation factor 3B and its correlation with tumor characteristics as well as prognosis in non-small cell lung cancer patients: a retrospective study. J. BUON Off J. Balk. Union Oncol.25 (5), 2350–2357 (2020). [PubMed] [Google Scholar]
33.Rinchai, D. & Chaussabel, D. Assessing the potential relevance of CEACAM6 as a blood transcriptional biomarker [Internet]. F1000Research; [cited 2024 Apr 2]. (2022). https://f1000research.com/articles/11-1294 [DOI] [PMC free article] [PubMed]
34.Rinchai, D. & Chaussabel, D. A training curriculum for retrieving, structuring, and aggregating information derived from the biomedical literature and large-scale data repositories. [Internet]. F1000Research; [cited 2024 Apr 2]. (2022). https://f1000research.com/articles/11-994
35.Ingold, I. et al. Selenium utilization by GPX4 is required to Prevent Hydroperoxide-Induced ferroptosis. Cell. 172 (3), 409–422e21 (2018). [DOI] [PubMed] [Google Scholar]
36.Seibt, T. M., Proneth, B. & Conrad, M. Role of GPX4 in ferroptosis and its pharmacological implication. Free Radic Biol. Med.133, 144–152 (2019). [DOI] [PubMed] [Google Scholar]
37.Canli, Ö. et al. Glutathione peroxidase 4 prevents necroptosis in mouse erythroid precursors. Blood. 127 (1), 139–148 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Hambright, W. S., Fonseca, R. S., Chen, L., Na, R. & Ran, Q. Ablation of ferroptosis regulator glutathione peroxidase 4 in forebrain neurons promotes cognitive impairment and neurodegeneration. Redox Biol.12, 8–17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Carlson, B. A. et al. Glutathione peroxidase 4 and vitamin E cooperatively prevent hepatocellular degeneration. Redox Biol.9, 22–31 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Zhang, Y. et al. Imidazole Ketone Erastin Induces Ferroptosis and slows Tumor Growth in a mouse lymphoma model. Cell. Chem. Biol.26 (5), 623–633e9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Yang, W. S. et al. Regulation of ferroptotic cancer cell death by GPX4. Cell. 156 (1–2), 317–331 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Sui, X. et al. RSL3 drives ferroptosis through GPX4 inactivation and ROS production in Colorectal Cancer. Front. Pharmacol.9, 1371 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Harder, B. et al. Molecular mechanisms of Nrf2 regulation and how these influence chemical modulation for disease intervention. Biochem. Soc. Trans.43 (4), 680–686 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Hangauer, M. J. et al. Drug-tolerant persister cancer cells are vulnerable to GPX4 inhibition. Nature. 551 (7679), 247–250 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(27.5KB, docx)}

Supplementary Material 2^{(20.7KB, docx)}

Data Availability Statement

The datasets analyzed during the current study are available in the Gene Expression Omnibus (GEO) repository and can be accessed using the accession numbers GSE60424, GSE24759, and GSE100150.

[CR1] 1.Geiss, G. K. et al. Direct multiplexed measurement of gene expression with color-coded probe pairs. Nat. Biotechnol.26 (3), 317–325 (2008). [DOI] [PubMed] [Google Scholar]

[CR2] 2.Spurgeon, S. L., Jones, R. C. & Ramakrishnan, R. High throughput gene expression measurement with real time PCR in a microfluidic dynamic array. PloS One. 3 (2), e1662 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Tsang, H. F. et al. NanoString, a novel digital color-coded barcode technology: current and future applications in molecular diagnostics. Expert Rev. Mol. Diagn.17 (1), 95–103 (2017). [DOI] [PubMed] [Google Scholar]

[CR4] 4.Hannouf, M. B. et al. Cost-effectiveness analysis of multigene expression profiling assays to guide adjuvant therapy decisions in women with invasive early-stage breast cancer. Pharmacogenomics J.20 (1), 27–46 (2020). [DOI] [PubMed] [Google Scholar]

[CR5] 5.Byron, S. A., Van Keuren-Jensen, K. R., Engelthaler, D. M., Carpten, J. D. & Craig, D. W. Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet.17 (5), 257–271 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Jiang, Y. et al. Construction of a set of novel and robust gene expression signatures predicting prostate cancer recurrence. Mol. Oncol.12 (9), 1559–1578 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet.20 (11), 631–656 (2019). [DOI] [PubMed] [Google Scholar]

[CR8] 8.Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet.10 (1), 57–63 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Rinchai, D. et al. A modular framework for the development of targeted Covid-19 blood transcript profiling panels. J. Transl Med.18 (1), 291 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Brummaier, T. et al. Design of a targeted blood transcriptional panel for monitoring immunological changes accompanying pregnancy. Front. Immunol.15, 1319949 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Toufiq, M. et al. Harnessing large language models (LLMs) for candidate gene prioritization and selection. J. Transl Med.21 (1), 728 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR12] 12.Rinchai, D. et al. Definition of erythroid cell-positive blood transcriptome phenotypes associated with severe respiratory syncytial virus infection. Clin. Transl Med.10 (8), e244 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Hong, S. et al. Longitudinal profiling of human blood transcriptome in healthy and lupus pregnancy. J. Exp. Med.216 (5), 1154–1169 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Altman, M. C. et al. Development of a fixed module repertoire for the analysis and interpretation of blood transcriptome data. Nat. Commun.12 (1), 4385 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR15] 15.Rawat, A. et al. A neutrophil-driven inflammatory signature characterizes the blood transcriptome fingerprint of Psoriasis. Front. Immunol.11, 587946 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Rinchai, D. et al. High–temporal resolution profiling reveals distinct immune trajectories following the first and second doses of COVID-19 mRNA vaccines. Sci. Adv.8 (45), eabp9961 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Novershtern, N. et al. Densely interconnected transcriptional circuits control cell states in human hematopoiesis. Cell. 144 (2), 296–309 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Linsley, P. S., Speake, C., Whalen, E. & Chaussabel, D. Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PloS One. 9 (10), e109760 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR19] 19.Zhao, L. et al. Late-stage tumors induce anemia and immunosuppressive extramedullary erythroid progenitor cells. Nat. Med.24 (10), 1536–1544 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR20] 20.Bernardes, J. P. et al. Longitudinal multi-omics analyses identify responses of Megakaryocytes, Erythroid Cells, and plasmablasts as Hallmarks of severe COVID-19. Immunity. 53 (6), 1296–1314e9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR21] 21.Ouled-Haddou, H. et al. A new role of glutathione peroxidase 4 during human erythroblast enucleation. Blood Adv.4 (22), 5666–5680 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Rademacher, M., Kuhn, H. & Borchert, A. Expression silencing of Glutathione Peroxidase 4 in mouse erythroleukemia cells delays in Vitro Erythropoiesis. Int. J. Mol. Sci.22 (15), 7795 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Altamura, S. et al. Glutathione peroxidase 4 and vitamin E control reticulocyte maturation, stress erythropoiesis and iron homeostasis. Haematologica. 105 (4), 937–950 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Xu, C. et al. The glutathione peroxidase Gpx4 prevents lipid peroxidation and ferroptosis to sustain Treg cell activation and suppression of antitumor immunity. Cell. Rep.35 (11), 109235 (2021). [DOI] [PubMed] [Google Scholar]

[CR25] 25.Brault, C. et al. Glutathione peroxidase 4 is reversibly induced by HCV to control lipid peroxidation and to increase virion infectivity. Gut. 65 (1), 144–154 (2016). [DOI] [PubMed] [Google Scholar]

[CR26] 26.Cueto-Ureña, C. et al. Glutathione peroxidase gpx1 to gpx8 genes expression in experimental brain tumors reveals gender-dependent patterns. Genes. 14 (9), 1674 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Sha, R. et al. Predictive and prognostic impact of ferroptosis-related genes ACSL4 and GPX4 on breast cancer treated with neoadjuvant chemotherapy. EBioMedicine. 71, 103560 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Weaver, K. & Skouta, R. The selenoprotein glutathione peroxidase 4: from Molecular mechanisms to Novel Therapeutic opportunities. Biomedicines. 10 (4), 891 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Garnham, R. et al. ST3 beta-galactoside alpha-2,3-sialyltransferase 1 (ST3Gal1) synthesis of Siglec ligands mediates anti-tumour immunity in prostate cancer. Commun. Biol.7 (1), 276 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Lin, W. D. et al. Sialylation of CD55 by ST3GAL1 facilitates Immune Evasion in Cancer. Cancer Immunol. Res.9 (1), 113–122 (2021). [DOI] [PubMed] [Google Scholar]

[CR31] 31.Zang, Y. et al. Eukaryotic Translation Initiation Factor 3b is both a Promising Prognostic Biomarker and a potential therapeutic target for patients with Clear Cell Renal Cell Carcinoma. J. Cancer. 8 (15), 3049–3061 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Liu, C. et al. The expression of eukaryotic translation initiation factor 3B and its correlation with tumor characteristics as well as prognosis in non-small cell lung cancer patients: a retrospective study. J. BUON Off J. Balk. Union Oncol.25 (5), 2350–2357 (2020). [PubMed] [Google Scholar]

[CR33] 33.Rinchai, D. & Chaussabel, D. Assessing the potential relevance of CEACAM6 as a blood transcriptional biomarker [Internet]. F1000Research; [cited 2024 Apr 2]. (2022). https://f1000research.com/articles/11-1294 [DOI] [PMC free article] [PubMed]

[CR34] 34.Rinchai, D. & Chaussabel, D. A training curriculum for retrieving, structuring, and aggregating information derived from the biomedical literature and large-scale data repositories. [Internet]. F1000Research; [cited 2024 Apr 2]. (2022). https://f1000research.com/articles/11-994

[CR35] 35.Ingold, I. et al. Selenium utilization by GPX4 is required to Prevent Hydroperoxide-Induced ferroptosis. Cell. 172 (3), 409–422e21 (2018). [DOI] [PubMed] [Google Scholar]

[CR36] 36.Seibt, T. M., Proneth, B. & Conrad, M. Role of GPX4 in ferroptosis and its pharmacological implication. Free Radic Biol. Med.133, 144–152 (2019). [DOI] [PubMed] [Google Scholar]

[CR37] 37.Canli, Ö. et al. Glutathione peroxidase 4 prevents necroptosis in mouse erythroid precursors. Blood. 127 (1), 139–148 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Hambright, W. S., Fonseca, R. S., Chen, L., Na, R. & Ran, Q. Ablation of ferroptosis regulator glutathione peroxidase 4 in forebrain neurons promotes cognitive impairment and neurodegeneration. Redox Biol.12, 8–17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Carlson, B. A. et al. Glutathione peroxidase 4 and vitamin E cooperatively prevent hepatocellular degeneration. Redox Biol.9, 22–31 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Zhang, Y. et al. Imidazole Ketone Erastin Induces Ferroptosis and slows Tumor Growth in a mouse lymphoma model. Cell. Chem. Biol.26 (5), 623–633e9 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR41] 41.Yang, W. S. et al. Regulation of ferroptotic cancer cell death by GPX4. Cell. 156 (1–2), 317–331 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Sui, X. et al. RSL3 drives ferroptosis through GPX4 inactivation and ROS production in Colorectal Cancer. Front. Pharmacol.9, 1371 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Harder, B. et al. Molecular mechanisms of Nrf2 regulation and how these influence chemical modulation for disease intervention. Biochem. Soc. Trans.43 (4), 680–686 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR44] 44.Hangauer, M. J. et al. Drug-tolerant persister cancer cells are vulnerable to GPX4 inhibition. Nature. 551 (7679), 247–250 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Human-augmented large language model-driven selection of glutathione peroxidase 4 as a candidate blood transcriptional biomarker for circulating erythroid cells

Bishesh Subba

Mohammed Toufiq

Fuadur Omi

Marina Yurieva

Taushif Khan

Darawan Rinchai

Karolina Palucka

Damien Chaussabel

Abstract

Supplementary Information

Introduction

Methods

BloodGen3 module repertoire

Large language models

Selection of a module for candidate gene prioritization (step 1)

LLM scoring of module genes (step 2)

Step 2a: LLM chat scoring v07/2023

Step 2b: LLM high-throughput chat scoring v07/2024

Selection of top candidates (step 3)

Prioritization of top candidates through high-resolution LLM scoring and fact-checking (step 4)

Prioritization of top candidates based on qualitative assessment across six criteria (step 5)

LLM ranking of top candidate genes based on expression data (step 6)

Results

Selection and prioritization of module M14.51 (step 1)

Fig. 1.

Fig. 2.

LMM scoring of M14.51 genes employing chat-GPT-4 and Claude (steps 2a and 2b)

Fig. 3.

Six top candidate genes are identified for module M14.51 (Step 3)

High resolution scoring and fact-checking prioritizes top M14.51 candidates (Step 4)

Qualitative assessment prioritizes GPX4 as the top candidate for module M14.51 (step 5)

LLM-driven expression-based ranking prioritizes GPX4 as the top candidate for module M14.51 (step 6)

Fig. 4.

Overall GPX4 emerges as the top candidate gene from the M14.51 module

Discussion

Electronic supplementary material

Abbreviations

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases