Skip to main content
PLOS One logoLink to PLOS One
. 2025 Oct 27;20(10):e0333179. doi: 10.1371/journal.pone.0333179

Integrative GWAS and RNA-Seq analysis for target identification and virtual drug screening in colorectal cancer

Qinghui Liu 1,2,*, Yiyang Lei 1,2, Zixuan Liu 1, Jiale Han 1
Editor: Zhengrui Li3
PMCID: PMC12558462  PMID: 41144378

Abstract

Background

Colorectal cancer (CRC) is a leading cause of global cancer-related mortality, necessitating the identification of novel therapeutic targets. Integrating genetic and transcriptomic data may reveal key molecular drivers of CRC progression and treatment opportunities.

Methods

We performed a multiomics analysis combining genome-wide association study (GWAS) data (p < 1e-6) and RNA-seq data from the TCGA. Differential expression analysis (Limma) identified 24 consistently dysregulated genes (17 mRNAs, 7 lncRNAs) in CRC. Survival analysis was used to evaluate their prognostic impact on overall survival (OS), relapse-free survival (RFS), and post progression survival (PPS). Drug‒gene interactions were explored via Enrichr, and virtual screening (PubChem) prioritized high-affinity compounds that target PYGL, a metabolic regulator.

Results

Integration of GWAS and RNA-seq revealed that 24 CRC-associated genes, including PYGL, SMAD7, and TCF7L2, are involved in tumor metabolism and Wnt/TCF signaling. Survival analysis revealed that five genes (CDKN2B, BOC, METRNL, etc.) were significantly correlated with OS, RFS, and PPS. Ten small-molecule candidates targeting PYGL exhibited high binding affinity, suggesting their therapeutic potential.

Conclusion

This study identified CRC-linked genes through GWASs and transcriptomics, highlighting their prognostic and druggable relevance. Computational drug repurposing pinpoints PYGL inhibitors as promising candidates, offering a translational framework for CRC therapy development.

Introduction

Colorectal cancer (CRC) accounts for approximately 10% of global cancer cases and is the second leading cause of cancer-related mortality [1,2]. In China, the annual incidence rate of CRC is 9.2%, ranking fourth among all cancers. In terms of both incidence and mortality, colorectal cancer rates were significantly higher in males than in females and greater in rural areas than in urban areas. The incidence increases significantly with age, especially after 40 or 45 years of age [3].

Single nucleotide polymorphisms (SNPs) are among the most common forms of genetic variation and involve single nucleotide alterations at specific genomic positions. These variations differ among individuals and can influence various phenotypic traits. SNPs can affect gene function, making them key determinants of susceptibility, progression, and prognosis in diseases such as CRC. The functional impact of SNPs depends on their genomic location and mutation type, with those occurring in coding or regulatory regions exerting more substantial effects on gene function. The TCGA database provides RNA-seq data across various cancer types, including gene expression profiles from tumor, adjacent, and normal tissues. Analyzing these datasets allows the identification of differentially expressed genes and their encoded proteins, many of which have emerged as key therapeutic targets in disease research, such as SPAG5, MAGEA3 [4], and TOP2A [5]. Furthermore, a drug and disease prediction platform was developed on the basis of TCGA data [6,7].

However, single-method analyses of differential gene expression often have limitations, as they focus primarily on highly differentially expressed genes while potentially overlooking those with lower expression differences, thereby reducing the likelihood of identifying novel therapeutic targets or drug candidates [8]. To address this, the present study integrates GWAS and RNA-seq data to identify novel target genes and proteins for CRC treatment [9]. First, GWAS data from European and East Asian populations were analyzed to identify significant genetic variants associated with CRC. Next, RNA-seq analysis identified seven mRNAs and six long noncoding RNAs (lncRNAs) with significant differential expression in CRC patients. Finally, molecular docking analysis was performed to screen potential drug candidates targeting these differentially expressed genes. By integrating multiomics approaches, this study aims to identify novel biomarkers and potential therapeutic targets for CRC, providing a theoretical foundation for future precision medicine strategies.

1. Materials and methods

1.1 Study design

The study design process is illustrated in Fig 1.

Fig 1. Schematic representation of the study design.

Fig 1

1.2 Data acquisition

GWAS data related to CRC in European populations were obtained from the GWAS Catalog database. Datasets were filtered on the basis of case‒control status and sample size, with smaller sample sizes (e.g., ncases < 5,000) excluded to ensure statistical robustness. The GCST90018808 dataset was selected (Table 1), encompassing a total of 637,693 participants, including 14,886 CRC patients and 622,807 controls. The study population was exclusively European and East Asian, and both male and female participants were included to minimize allele frequency biases caused by population stratification and linkage disequilibrium (LD). Since this study utilized publicly available summary-level data, no additional ethical approval or informed consent was needed.

Table 1. Summary of GWAS datasets for colorectal cancer analysis.

GWAS ID Case (n) Control (n) Number of SNPs Population Ref.
GCST90018808 6,581 463,421 25,843,452 European [10]
8,305 159,386 East Asian

1.3 GWAS analysis

GWAS data analysis was conducted via the TwoSampleMR and ieugwasr packages in R (version 4.4.2). Chen et al. [11] employed a conditional p value threshold of p < 1e-6 to identify independent association signals and discovered several potential CRC susceptibility genes that had not been reported previously. SNPs with a significance threshold of p < 1e-6 were retained for further analysis, while weakly associated SNPs were filtered out. The functional annotation of the SNPs was performed via the FastTraitR package to assess the relationships among the SNPs, genes and phenotypes, facilitating the identification of CRC-associated genes. CMplot was used to construct a Manhattan plot, providing a visual representation of the positional distribution of significantly associated SNPs.

During linkage disequilibrium (LD)-based SNP screening, linked SNPs are typically removed, retaining only a single representative SNP. However, this selected SNP may have limited prior research or lack associated phenotypic data, potentially leading to the exclusion of SNPs with known phenotypic associations. To address this, we first conducted phenotypic analysis to identify SNPs with available phenotypic data, followed by LD analysis and subsequent investigation of the selected SNP loci. LD-based instrumental variable selection was carried out via the ieugwasr package, with the parameters set to r2 < 0.1 and a window size of 100,000 kb. To minimize bias from weak instrumental variables, SNPs with an F statistic < 10 were excluded [12].

1.4 RNA-seq differential expression analysis

RNA-seq data for colon cancer types were retrieved and analyzed via Sangerbox (http://sangerbox.com/tool.html) [13]. Gene identifiers (ENSG_ID) were converted to GeneSymbols. First, the expression matrix was obtained, and genes and samples with more than 50% missing (NA) values were removed. Next, missing values were imputed via the impute.knn function from the R package impute, with the number of neighbors set to 10,000 for data imputation, and data normalization was performed via log2(X + 1) transformation. The expression profiles of genes identified through GWAS were extracted, and differential expression analysis was conducted via the Limma package [14]. The analysis criteria included a fold-change threshold of 2.0 and a significance threshold of FDR < 0.01 (|logFC| > 1.0 and -log10(FDR) ≥ 2.0) [15]. Volcano plots were constructed to visualize the results, highlighting genes with significantly altered expression in CRC.

1.5 Survival analysis across all genes in colon cancer

Survival analysis was performed via the Kaplan‒Meier plotter platform (https://kmplot.com/analysis/) [16]. Colon cancer was selected as the screening target, and either the gene symbols or Affy ids of the test genes were input into the designated fields. Patients were stratified via the “auto select best cutoff” option, which evaluates all possible cutoff values between the lower and upper quartiles and selects the optimal performance threshold as the cutoff value. The generated p value does not include correction for multiple hypothesis testing, all other parameters were maintained as default settings in the database. Separate survival analyses were conducted for OS, RFS, and PPS.

1.6 Enrichment analysis and functional annotation

Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses were conducted to examine biological process variations via the GO biological process 2025 library and the KEGG 2021 human database on the Enrichr platform (https://maayanlab.cloud/Enrichr/) [17]. The results were filtered to display the top 10 enriched GO terms and KEGG pathways, with a focus on the biological process category.

1.7 Online drug screening

The Enrichr database (https://maayanlab.cloud/Enrichr/), specifically the DSigDB dataset, which comprises 4,026 small-molecule compounds and 19,513 genes, was used for online drug screening [17]. Significantly differentially expressed genes in CRC were queried in Enrichr to identify candidate drugs and their mechanisms of action, facilitating the selection of potential therapeutic agents for CRC.

1.8 Virtual screening

Protein structures and interaction networks of the significantly differentially expressed genes were retrieved from STRING version 12.0 (https://cn.string-db.org/) [18]. Structural data were obtained from the AlphaFold Protein Structure Database and the Protein Data Bank (PDB), with priority given to PDB entries. The selection criteria included the scientific name of the source organism (Homo sapiens), the experimental method (X-ray diffraction or electron microscopy), and a refinement resolution of ≤2.5 Å. For proteins lacking small-molecule ligands, binding sites were predicted via computational tools. If ligands were present, the ligand-binding region was defined as the active site for drug screening.

The protein encoded by the PYGL gene (PDB ID: 3DDS) was selected as the receptor, with its original ligand, 26B (molecular weight: 505.60 g/mol), used for docking. The 3DDS structure was determined by X-ray diffraction at a resolution of 1.80 Å. Only protein structures with well-defined active sites were retained for further analysis.

This study retrieved PYGL-targeting compounds from the BindingDB database (https://www.bindingdb.org/rwd/bind/index.jsp) [19]. Compounds with IC50 values of zero or nonspecific numerical data were removed. After the elimination of structurally identical compounds, 499 unique small-molecule compounds were obtained, all of which had experimentally determined IC50 values against PYGL.

A total of 231,187 small-molecule crystal structures were downloaded from the PubChem database. Lipinski’s rule of five was applied to assess drug likeness [20] with the following criteria: molecular weight (200--500 Da), hydrogen bond donors (≤5), hydrogen bond acceptors (≤10), lipophilicity (LogP: 1--5), and rotatable bonds (1--10). After filtering, 62,964 small-molecule crystal structures were retained, hydrogenated, and optimized as ligands for virtual screening.

AutoDock Vina software was used for virtual screening [21], with protein crystal structures serving as receptors and the original ligand-binding pocket defined as the active site. Potential drug candidates were identified on the basis of binding energy and docking conformations. The molecular docking results were visualized and analyzed via PyMOL (version 3.1.0).

2. Results

2.1 Identification of SNPs significantly associated with phenotypes through GWAS data analysis

A significance threshold of p < 1e-6 was applied to filter out weakly associated SNPs from the CRC GWAS dataset derived from European and East Asian populations. Functional annotation was performed, and gene names were extracted from the MAPPED_GENE field across the datasets, resulting in the identification of 67 genes significantly associated with various phenotypes. These phenotypes include tumor-related traits, hematologic disorders, diabetes, and others. Most genes contained between 1 and 5 significant SNPs, although certain loci presented 10 or more. For example, the SMAD7 gene (chromosome 18: 48,922,435–48,927,678 bp) has more than 10 associated SNPs, including rs11874392, rs12953717, and rs12956924. The CDKN2B-AS1 locus (chromosome 9: 22,003,368–22,125,504 bp) contains more than 60 SNPs, all of which are strongly associated with gene activity.

LD-based instrumental variable selection was performed via the ieugwasr package. Integration of the GWAS datasets led to the identification of 55 SNPs and 67 candidate genes in total (S1 Table). Manhattan plots and QQ plot for the GWAS dataset were generated via the CMplot package. The results indicated that significant SNP loci were distributed across multiple chromosomes (Fig 2).

Fig 2. Manhattan plot and QQ plot for GWAS ID: GCST90018808.

Fig 2

A: Manhattan plot; B: QQ plot. The significance threshold was set at 1e-6, with SNPs exhibiting significant associations with colorectal cancer (CRC) appropriately annotated. Multiple SNPs reached genome-wide significance (p > 5e-8), and their proximal genes exhibited significant differential expression in the RNA-seq data.

2.2 Identification of genes significantly associated with CRC via both GWAS and RNA-seq data analyses

RNA-seq data for colon adenocarcinoma (COAD) were retrieved from TCGA via Sangerbox. The data were normalized, with colon cancer patients serving as the comparison group and adjacent or normal tissues serving as the control group. Differential expression analysis was performed via the Limma package. To ensure data integrity, we removed genes with zero expression values in >50% of samples, yielding 64 genes associated with phenotype were identified.

By applying a fold-change (FC) threshold of 2.0 and a false discovery rate (FDR) threshold of 0.01, a total of 7 upregulated and 17 downregulated genes were identified (Table 2). A volcano plot was generated to visualize the RNA-seq analysis results for colon cancer, highlighting some genes that exhibited significant differential expression in both the GWAS and RNA-seq datasets for CRC. Notably, TMEM220-AS1, TMEM238L, SOX6, SMAD7, TCF7L2, and PYGL were significantly downregulated in colon cancer, whereas LINC02257, PCAT1, CASC8, and POU5F1B were significantly upregulated in colon cancer.

Table 2. Information on selected SNPs and associated genes.

No. GWAS Nearby gene RNA type RNA-seq
SNP Chr Pos p_value logFC P.Value
1 rs12022676 1 222146085 6.40E-07 LINC02257 lncRNA 2.1226 7.80E-15
LINC02474 lncRNA 2.1521 2.96E-07
2 rs10049390 3 133701119 2.33E-07 SLCO2A1 mRNA −2.9041 1.41E-35
3 rs139372065 3 28513403 4.41E-07 ZCWPW2 mRNA −1.4994 3.99E-13
4 rs72942485 3 112999560 8.07E-07 BOC mRNA −2.3307 2.33E-16
5 rs10505477 8 128407443 1.40E-33 PCAT1 lncRNA 1.3592 1.57E-10
CASC8 lncRNA 1.9539 5.33E-12
POU5F1B mRNA 3.1255 2.44E-17
6 rs1063192 9 22003367 1.43E-07 CDKN2B mRNA −4.0373 7.20E-55
CDKN2B-AS1 lncRNA −6.6891 9.86E-71
7 rs7068313 10 114725926 8.16E-07 TCF7L2 mRNA −1.2813 1.19E-14
8 rs61871279 10 101343705 8.80E-07 NKX2–3 mRNA −3.5499 6.30E-44
9 rs66830472 11 16007446 5.42E-08 SOX6 mRNA −1.8222 2.38E-11
10 rs7398375 12 57540848 4.47E-07 LRP1 mRNA −1.1805 4.03E-08
11 rs28611105 14 51359658 1.32E-08 PYGL mRNA −1.1501 5.20E-06
ABHD12B mRNA −1.7702 0.0002864
12 rs16969681 15 32993111 5.47E-09 SCG5 mRNA −1.1292 2.77E-05
13 rs2439411 15 66983982 7.37E-09 LINC01169 lncRNA 1.8324 8.07E-08
14 rs3118233 16 68733646 1.60E-07 CDH3 mRNA 5.8195 1.12E-109
15 rs1078643 17 10707241 1.26E-09 TMEM220-AS1 lncRNA −3.4109 1.02E-43
TMEM238L mRNA −2.5739 5.84E-28
16 rs4986080 17 81049741 3.80E-07 METRNL mRNA −1.4039 1.58E-17
17 rs7229639 18 46450976 1.74E-13 SMAD7 mRNA −1.5507 8.96E-18
18 rs6019378 20 47309716 3.73E-08 PREX1 mRNA −1.1888 1.62E-10

In addition, 7 lncRNAs, such as CDKN2B-AS1, TMEM220-AS1, PCAT1, and CASC8, were identified. Although these lncRNAs do not encode proteins, they play a regulatory role in gene expression through various molecular mechanisms. CDKN2B-AS1 and TMEM220-AS1 were significantly downregulated in colon cancer, whereas LINC02257, PCAT1, and CASC8 were significantly upregulated in colorectal cancer (Fig 3). Meng et al. [22] developed mPEG-DSPE liposomes encapsulating siRNA-targeting lncRNAs, which effectively inhibited CRC progression.

Fig 3. Volcano plot of differentially expressed genes in colon cancer RNA-seq data.

Fig 3

2.3 Survival analysis

The Kaplan‒Meier plotter background database is manually curated, incorporating gene expression data along with relapse-free and overall survival information sourced from GEO, EGA, and TCGA. In this study, survival analysis was performed on 18 screened genes (Table 3 and Fig 4). Among these genes, five genes, CDKN2B, BOC, SCG5, PYGL, and METRNL, demonstrated statistical significance (p value < 0.05) in terms of OS, RFS, and PPS, with hazard ratios (HRs) > 1, indicating that high expression of these genes increases mortality risk and reduces patient survival rates. These genes may represent promising targets for future pharmacological development. SMAD7 expression was significantly different (p value < 0.05) across all three survival metrics, with an HR > 1 for OS and RFS and an HR < 1 for PPS, suggesting a complex relationship between its expression levels and colorectal cancer patient outcomes that warrants further investigation.

Table 3. Genes whose expression is higher in CRC tumors and are correlated to OS, RFS, PPS.

No. Gene symbol Affy ID OS RFS PPS
P value FDR HR Cutoff value P value FDR HR Cutoff value P value FDR HR Cutoff value
1 CDKN2B 236313_at 2.2e-5 0.01 1.66 296 3.5e-5 0.01 1.72 225 0.0026 0.50 1.74 526
2 BOC 225990_at 0.0003 0.03 1.54 257 9.9e-5 0.01 1.61 186 0.0008 0.10 1.76 262
3 SCG5 203889_at 0.00056 0.05 1.43 212 0.0437 0.50 1.25 206 0.0045 0.20 1.54 274
4 LRP1 200784_s_at 0.0016 0.20 1.4 179 9.6e-9 0.01 2.29 88 0.1055 1.00 1.28 250
5 ZCWPW2 243863_at 0.0024 0.20 1.58 5 0.3117 1.00 1.13 14 0.2683 1.00 0.82 25
6 PYGL 202990_at 0.0038 0.50 1.37 736 0.0003 0.03 1.67 312 0.0263 0.50 1.41 532
7 PREX1 224925_at 0.0248 0.50 1.33 223 0.0012 0.10 1.47 197 0.3142 1.00 1.19 205
8 SMAD7 204790_at 0.0267 0.50 1.27 674 0.0002 0.02 1.52 658 0.017 0.50 0.68 423
9 SLCO2A1 204368_at 0.0292 0.50 1.27 358 0.0001 0.01 1.53 352 0.1291 1.00 0.79 405
10 METRNL 225955_at 0.0433 0.50 1.28 840 1.3e-5 0.01 1.64 753 0.0095 0.50 1.58 759
11 POU5F1B 208286_x_at 0.0453 0.50 1.24 296 0.3299 1.00 1.13 123 0.0852 1.00 0.77 205
12 ABHD12B 237974_at 0.0612 1.00 1.26 93 0.0079 0.50 1.39 14 0.0672 1.00 1.37 69
13 TCF7L2 212761_at 0.1024 1.00 0.85 3825 1.8e-5 0.01 0.63 3786 0.0208 0.50 0.7 3841
14 CDH3 203256_at 0.1803 1.00 1.15 1042 0.0046 0.50 1.43 576 0.0752 1.00 1.35 1295
15 SOX6 227498_at 0.3515 1.00 0.89 60 0.0344 0.50 0.78 68 0.0271 0.50 0.66 130
16 NKX2–3 1553808_a_at 0.458 1.00 0.91 127 0.1049 1.00 0.83 137 0.0689 1.00 0.73 114
17 CDKN2B-AS1 1559884_at 0.002 0.20 1.46 4 0.0002 0.02 1.54 4 0.4442 1.00 1.14 6
18 LINC01169 1563132_at 0.0595 1.00 1.28 47 0.0219 0.50 0.76 8 0.0303 0.50 0.7 25

OS = Overall Survival, RFS = Relapse-Free Survival, PPS = Post-Progression Survival, HR = Hazard Rate, FDR = false discovery rate.

Fig 4. Most significant druggable genes associated with Overall Survival.

Fig 4

HR = Hazard Rate, CDKN2B(A), BOC(B), SCG5(C), LRP1(D), ZCWPW2(E), PYGL(F).

2.4 Enrichment analysis and functional annotation

The adjusted p values were calculated via the Benjamini‒Hochberg method to correct for multiple hypothesis testing. All genes in the human genome were used as background. Only the top 10 significant results are presented in Table 4. GO enrichment analysis revealed that the Wnt signaling pathway, fat cell differentiation and regulation of epithelial-to-mesenchymal transition were the most highly enriched biological processes. The Wnt signaling pathway is a highly conserved signaling cascade that plays crucial roles in cell fate determination, tissue development, and tumorigenesis. In the canonical Wnt pathway, β-catenin degradation is inhibited, allowing it to accumulate and interact with TCF/LEF transcription factors (e.g., TCF7L2) to activate the transcription of downstream target genes.

Table 4. Top 10 significantly enriched GO terms.

Term Overlap P-value Adjusted P-value Odds Ratio Genes
Canonical Wnt Signaling Pathway (GO:0060070) 2/63 8.67E-04 0.046307 54.43989 TCF7L2;
SMAD7
Fat Cell Differentiation (GO:0045444) 2/65 9.23E-04 0.046307 52.70635 TCF7L2;
METRNL
Positive Regulation of Cell Differentiation (GO:0045597) 3/297 0.001045 0.046307 18.26716 TCF7L2;
BOC;
SOX6
Regulation of Epithelial to Mesenchymal Transition (GO:0010717) 2/91 0.001798 0.046307 37.2603 TCF7L2;
SMAD7
Glucose Homeostasis (GO:0042593) 2/92 0.001837 0.046307 36.84444 TCF7L2;
PYGL
Regulation of Epithelial Cell Proliferation (GO:0050678) 2/93 0.001877 0.046307 36.43773 TCF7L2;
CDKN2B
Cellular Response to Transforming Growth Factor Beta Stimulus (GO:0071560) 2/98 0.002081 0.046307 34.53125 SOX6;
SMAD7
Regulation of Transforming Growth Factor Beta Receptor Signaling Pathway (GO:0017015) 2/120 0.003099 0.046307 28.06215 CDKN2B;
SMAD7
Maintenance of DNA Repeat Elements (GO:0043570) 1/5 0.003495 0.046307 384.2692 TCF7L2
Regulation of Ventricular Cardiac Muscle Cell Membrane Depolarization (GO:0060373) 1/5 0.003495 0.046307 384.2692 SMAD7

KEGG pathway analysis revealed several significant pathways, including the TGF-beta signaling pathway, gastric cancer and Cushing syndrome (Table 5), suggesting a potential role of these genes in cancer-related mechanisms and metabolic processes. The TGF-beta signaling pathway plays vital roles in regulating cell growth, differentiation, and immune function. The glucocorticoid receptor signaling pathway, which is implicated in Cushing syndrome, modulates a wide range of physiological processes, including glucose metabolism and immune responses. Starch and sucrose metabolism is central to maintaining energy homeostasis, regulating blood glucose levels, and supporting overall metabolic health. Notably, genes such as TCF7L2, CDKN2B, and PYGL are involved in multiple glucose metabolism-related pathways. We hypothesize that their downregulation in colorectal cancer tissues may disrupt glucose metabolism, thereby contributing to tumorigenesis.

Table 5. Top 10 significantly enriched KEGG pathways.

Term Overlap P-value Adjusted P-value Odds Ratio Genes
TGF-beta signaling pathway 2/94 7.70E-04 0.022317 61.79814 CDKN2B;
SMAD7
Gastric cancer 2/149 0.001918 0.022317 38.56948 TCF7L2;
CDKN2B
Cushing syndrome 2/155 0.002073 0.022317 37.04575 TCF7L2;
CDKN2B
Hippo signaling pathway 2/163 0.002289 0.022317 35.19077 TCF7L2;
SMAD7
Kaposi sarcoma-associated herpesvirus infection 2/193 0.00319 0.024879 29.61855 TCF7L2;
PREX1
Starch and sucrose metabolism 1/36 0.016087 0.081916 71.27143 PYGL
Thyroid cancer 1/37 0.01653 0.081916 69.28819 TCF7L2
Malaria 1/50 0.022281 0.081916 50.87245 LRP1
Cholesterol metabolism 1/50 0.022281 0.081916 50.87245 LRP1
Pathways in cancer 2/531 0.022382 0.081916 10.51148 TCF7L2;
CDKN2B

2.5 Online drug screening for CRC-associated genes

The 13 identified genes were queried in the Enrichr database, and potential therapeutic compounds were screened via DSigDB (Table 6). These compounds exert their effects by targeting specific genes or their encoded proteins, thereby influencing key signaling pathways involved in CRC progression. As a result, these compounds and their derivatives hold potential as therapeutic agents for CRC treatment [2325].

Table 6. Potential therapeutic compounds identified through online screening.

Term Overlap P-value Adjusted P-value Odds Ratio Genes
TERT-butyl hydroperoxide CTD 00007349 6/1341 1.69E-04 0.044751 10.47809 CDKN2B;CDH3;LRP1;
SCG5;PYGL;SMAD7
Retinoic acid CTD 00006918 9/4258 6.22E-04 0.080978 6.666651 POU5F1B;TCF7L2;PREX1;
CDKN2B;CDH3;BOC;
SLCO2A1;SOX6;SMAD7
VALPROIC ACID CTD 00006977 12/8312 9.17E-04 0.080978 8.447711 POU5F1B;TCF7L2;PREX1;
CDKN2B;CDH3;BOC;SCG5;
PYGL;METRNL;SOX6;
ABHD12B;SMAD7
indomethacin CTD 00006147 3/335 0.001478 0.09792 16.14513 TCF7L2;SLCO2A1;SMAD7
Rifampicin CTD 00006701 2/133 0.00379 0.194544 25.26081 CDKN2B;SLCO2A1
Decitabine CTD 00000750 5/1800 0.005882 0.194544 5.630145 CDKN2B;CDH3;SLCO2A1;
SCG5;SMAD7
trichostatin A CTD 00000660 7/3584 0.006229 0.194544 4.587364 PREX1;CDKN2B;CDH3;
BOC;PYGL;METRNL;SMAD7
Tesmilifene CTD 00001953 1/13 0.009064 0.194544 128.0385 LRP1
4-tert-Butylphenol CTD 00000316 1/13 0.009064 0.194544 128.0385 LRP1
alsterpaullone MCF7 UP 2/256 0.013419 0.194544 12.94751 LRP1;SCG5

2.6 Virtual screening via molecular docking analysis of the receptor–ligand complex

A total of 17 mRNAs were identified, with their encoded protein structures available in the PDB or the AlphaFold protein structure database, among which only 3 possessed PDB IDs and contained small-molecule ligands. Based on the properties and size of the active sites, PYGL (PDB ID: 3DDS) was ultimately selected for further investigation [26], with the original ligand 26B serving as the docking ligand. Molecular docking was performed via AutoDock Vina, with the docking box centered at (x = 80.27, y = −97.68, z = 124.02). The calculated binding energy between the receptor and the original ligand was −12.3 kcal/mol, indicating a strong interaction. The docking conformation of the ligand exhibited a high degree of structural overlap with the original ligand (Fig 5). Both the crystal conformation and the docking conformation of the original ligand formed hydrogen bonds with key amino acid residues (Val40, Gln71, and Arg310), confirming the accuracy and reliability of the molecular docking method in predicting the optimal binding conformation of small-molecule compounds within the receptor’s active site.

Fig 5. Docking conformation of the original ligand at the active site.

Fig 5

The red structure represents the docked ligand conformation, with magenta bonds indicating hydrogen bonding interactions. The green structure denotes the crystal conformation of the protein-bound ligand, with yellow bonds highlighting hydrogen interactions.

A total of 499 small-molecule compounds associated with the PYGL gene were retrieved from the BindingDB database, with most exhibiting nanomolar-range IC50 values against the PYGL protein. Molecular docking analysis via AutoDock Vina identified 155 compounds with binding energies ≤ −10 kcal/mol, whereas 497 compounds (99% of all analyzed compounds) had binding energies ≤ −7 kcal/mol. Only two compounds exhibited binding energies > −7 kcal/mol, two of which had molecular weights below 150 g/mol (S2 Table), suggesting that low molecular weights may have affected docking accuracy. As shown in Fig 6, a positive correlation was observed between the binding energy and the Log(IC50) value (p = 8.2e-3). These findings demonstrate that the screening platform based on AutoDock Vina can effectively identify potential active compounds that target the PYGL protease.

Fig 6. Correlation analysis of the binding energies and IC50 values of compounds retrieved from the BindingDB.

Fig 6

Small molecule compounds were retrieved from the PubChem database and filtered on the basis of Lipinski’s Rule of Five to assess drug likeness. A total of 62,964 small-molecule compounds were selected, hydrogenated, and optimized for use as ligands in virtual screening. Using the PYGL protein structure (PDB: 3DDS) as the receptor and the original ligand binding site as the target, virtual screening was conducted with AutoDock Vina. This process identified 10 high-affinity compounds (Table 7), which are potential candidates for CRC treatment.

Table 7. Top 10 high-affinity compounds identified through virtual screening.

No. Compound CID Molecular formula Molecular weight (g/mol) Affinity (kcal/mol)
1 139070664 C24H21N3O3 399.4 −11.3
2 44572643 C26H27N2O 383.5 −11.0
3 139095568 C24H27NO4 393.5 −11.0
4 139079778 C26H24N2O2 396.5 −10.9
5 15845721 C26H22N4 390.5 −10.8
6 101542638 C24H22N4 366.5 −10.5
7 139054227 C25H25N3O 383.5 −10.5
8 139195689 C22H24N4 344.5 −10.4
9 139116666 C21H22N2O3 350.4 −10.4
10 86237799 C23H21N3O2 371.4 −10.3

3. Discussion

This study provides a comprehensive molecular characterization of colorectal cancer (CRC) through integrated analysis of GWAS and transcriptomic data combined with computational drug discovery approaches. Our systematic investigation identified 24 CRC-associated genes, including key regulators such as SMAD7 (a negative modulator of TGF-β signaling) [27,28], TCF7L2 (a critical component of the Wnt/β-catenin pathway) [29], and PYGL (a metabolic enzyme involved in glycogenolysis) [30,31]. These findings not only reinforce established CRC pathways but also reveal novel molecular vulnerabilities, particularly in tumor metabolic reprogramming.

The computational drug screening pipeline identified ten high-affinity small molecules that target PYGL, suggesting potential for therapeutic repurposing. This metabolic target represents an innovative approach to disrupt cancer energy homeostasis, which is distinct from conventional kinase inhibitors. Furthermore, our analysis revealed that differentially expressed lncRNAs (e.g., CDKN2B-AS1 and TMEM220-AS1) may function as epigenetic regulators in CRC pathogenesis, opening new avenues for RNA-targeted therapies [32].

While these computational predictions require experimental validation, the robustness of our findings is supported by the following:

  1. Stringent statistical thresholds (p < 1e-6 in GWAS) [11].

  2. Cross-platform consistency between the genomic and transcriptomic data.

  3. Comprehensive molecular docking analyses.

  4. Utilization of established drug‒gene interaction databases.

Future directions should focus on the following:

  1. Functional validation via CRISPR-based gene editing and organoid models [33,34].

  2. Preclinical evaluation of PYGL inhibitors in relevant CRC models.

  3. Development of LNP-encapsulated CRISPR systems for lncRNA modulation [3537].

  4. Investigation of combinatorial approaches targeting both metabolic and signaling pathways

This study establishes a framework for translating multiomics discoveries into therapeutic opportunities, highlighting the value of integrative computational biology in accelerating cancer drug discovery. The identified targets and compounds provide a foundation for developing next-generation CRC therapies that address the current limitations of targeted agents, particularly in overcoming drug resistance and improving treatment personalization.

Conclusion

In summary, this study employed p < 1e-6 as the screening threshold for the colorectal cancer GWAS, combined with RNA-seq, GO, KEGG, and survival analyses. Survival analysis revealed that five genes (CDKN2B, BOC, SCG5, PYGL, and METRNL) were significantly correlated with overall survival (OS), relapse-free survival (RFS), and post-progression survival (PPS). Among these, CDKN2B, BOC, and METRNL showed p-values greater than 5e-8 in the colorectal cancer GWAS data. Subsequently, computer-aided drug design was utilized to screen multiple compounds exhibiting low binding energy with the PYGL protein, providing new therapeutic targets and directions for colorectal cancer treatment.

Supporting information

S1 Table. Phenotype-associated SNPs and genes identified in the GCST90018808 dataset.

(XLSX)

pone.0333179.s001.xlsx (31.3KB, xlsx)
S2 Table. Binding energies and IC50 values of compounds derived from BindingDB that target the PYGL protein.

The R scripts used in the study are available via the following DOI: 10.5281/zenodo.15803098, v1.0.2.

(XLSX)

pone.0333179.s002.xlsx (42.3KB, xlsx)

Acknowledgments

We would like to thank Professor Baoen Shan for kindly proofreading the manuscript. We would also like to express gratitude to EditSprings (https://www.editsprings.cn) for the expert linguistic services provided.

Data Availability

The gene expression data utilized in this study were obtained from the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/). Specifically, RNA-sequencing (RNA-seq) gene expression counts from the TCGA-COAD project were downloaded, with open access.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Zhou J, Zheng R, Zhang S, Zeng H, Wang S, Chen R, et al. Colorectal cancer burden and trends: comparison between China and major burden countries in the world. Chin J Cancer Res. 2021;33(1):1–10. doi: 10.21147/j.issn.1000-9604.2021.01.01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394–424. doi: 10.3322/caac.21492 [DOI] [PubMed] [Google Scholar]
  • 3.Liu S, Zheng R, Zhang M, Zhang S, Sun X, Chen W. Incidence and mortality of colorectal cancer in China, 2011. Chin J Cancer Res. 2015;27(1):22–8. doi: 10.3978/j.issn.1000-9604.2015.02.01 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Li B, Severson E, Pignon J-C, Zhao H, Li T, Novak J, et al. Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biol. 2016;17(1):174. doi: 10.1186/s13059-016-1028-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Yu Y, Li L, Luo B, Chen D, Yin C, Jian C, et al. Predicting potential therapeutic targets and small molecule drugs for early-stage lung adenocarcinoma. Biomed Pharmacother. 2024;174:116528. doi: 10.1016/j.biopha.2024.116528 [DOI] [PubMed] [Google Scholar]
  • 6.Stathias V, Jermakowicz AM, Maloof ME, Forlin M, Walters W, Suter RK, et al. Drug and disease signature integration identifies synergistic combinations in glioblastoma. Nat Commun. 2018;9(1):5315. doi: 10.1038/s41467-018-07659-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Mariano JA, Shen Y, Federico MG. Network-based inference of protein activity helps functionalize the genetic landscape of cancer. Nat Genet. 2016;48(8):838–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wang W, Liu Y, Wang Z, Tan X, Jian X, Zhang Z. Exploring and validating the necroptotic gene regulation and related lncRNA mechanisms in colon adenocarcinoma based on multi-dimensional data. Sci Rep. 2024;14(1):22251. doi: 10.1038/s41598-024-73168-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zhan J, Sun S, Chen Y, Xu C, Chen Q, Li M, et al. MiR-3130-5p is an intermediate modulator of 2q33 and influences the invasiveness of lung adenocarcinoma by targeting NDUFS1. Cancer Med. 2021;10(11):3700–14. doi: 10.1002/cam4.3885 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Sakaue S, Kanai M, Tanigawa Y, Karjalainen J, Kurki M, Koshiba S, et al. A cross-population atlas of genetic associations for 220 human phenotypes. Nat Genet. 2021;53(10):1415–24. doi: 10.1038/s41588-021-00931-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chen Z, Guo X, Tao R, Huyghe JR, Law PJ, Fernandez-Rozadilla C, et al. Fine-mapping analysis including over 254,000 East Asian and European descendants identifies 136 putative colorectal cancer susceptibility genes. Nature Commun. 2024;15:3557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Burgess S, Thompson SG, CRP CHD Genetics Collaboration. Avoiding bias from weak instruments in Mendelian randomization studies. Int J Epidemiol. 2011;40(3):755–64. [DOI] [PubMed] [Google Scholar]
  • 13.Shen W, Song Z, Zhong X, Huang M, Shen D, Gao P, et al. Sangerbox: a comprehensive, interaction-friendly clinical bioinformatics analysis platform. Imeta. 2022;1(3):e36. doi: 10.1002/imt2.36 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kaisrlikova M, Kundrat D, Koralkova P, Trsova I, Lenertova Z, Votavova H, et al. Attenuated cell cycle and DNA damage response transcriptome signatures and overrepresented cell adhesion processes imply accelerated progression in patients with lower-risk myelodysplastic neoplasms. Int J Cancer. 2024;154(9):1652–68. doi: 10.1002/ijc.34834 [DOI] [PubMed] [Google Scholar]
  • 16.Győrffy B. Integrated analysis of public datasets for the discovery and validation of survival-associated genes in solid tumors. Innovation (Camb). 2024;5(3):100625. doi: 10.1016/j.xinn.2024.100625 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kuleshov MV, Jones MR, Rouillard AD, Fernandez NF, Duan Q, Wang Z, et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 2016;44(W1):W90-7. doi: 10.1093/nar/gkw377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, et al. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023;51(D1):D638–46. doi: 10.1093/nar/gkac1000 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44(D1):D1045-53. doi: 10.1093/nar/gkv1072 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Roskoski R Jr. Rule of five violations among the FDA-approved small molecule protein kinase inhibitors. Pharmacol Res. 2023;191:106774. doi: 10.1016/j.phrs.2023.106774 [DOI] [PubMed] [Google Scholar]
  • 21.Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem. 2010;31(2):455–61. doi: 10.1002/jcc.21334 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Meng Q, Wang J, Jiang B, Zhang X, Xu J, Cao Y, et al. SiRNA-based delivery nanoplatform attenuates the CRC progression via HIF1α-AS2. Nano Today. 2022;47:101667. doi: 10.1016/j.nantod.2022.101667 [DOI] [Google Scholar]
  • 23.Eissa SI. Synthesis, characterization and biological evaluation of some new indomethacin analogs with a colon tumor cell growth inhibitory activity. Med Chem Res. 2017;26(9):2205–20. doi: 10.1007/s00044-017-1932-8 [DOI] [Google Scholar]
  • 24.An N, Sun Y, Ma LG. Helveticoside exhibited p53-dependent anticancer activity against colorectal cancer. Arch Med Res. 2020; 51(3): 224–32. [DOI] [PubMed] [Google Scholar]
  • 25.Guo YC, Chang CM, Hsu WL. Indomethacin inhibits cancer cell migration via attenuation of cellular calcium mobilization. Molecules. 2013;18:6584–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Thomson SA, Banker P, Bickett DM, Boucheron JA, Carter HL, Clancy DC, et al. Anthranilimide based glycogen phosphorylase inhibitors for the treatment of type 2 diabetes. Part 3: X-ray crystallographic characterization, core and urea optimization and in vivo efficacy. Bioorg Med Chem Lett. 2009;19(4):1177–82. doi: 10.1016/j.bmcl.2008.12.085 [DOI] [PubMed] [Google Scholar]
  • 27.Massagué J, Blain SW, Lo RS. TGFbeta signaling in growth control, cancer, and heritable disorders. Cell. 2000;103(2):295–309. doi: 10.1016/s0092-8674(00)00121-5 [DOI] [PubMed] [Google Scholar]
  • 28.Itóh S, Landström M, Hermansson A, Itoh F, Heldin CH, Heldin NE, et al. Transforming growth factor beta1 induces nuclear export of inhibitory Smad7. J Biol Chem. 1998;273(44):29195–201. doi: 10.1074/jbc.273.44.29195 [DOI] [PubMed] [Google Scholar]
  • 29.Slattery ML, Folsom AR, Wolff R, Herrick J, Caan BJ, Potter JD. Transcription factor 7-like 2 polymorphism and colon cancer. Cancer Epidemiol Biomarkers Prev. 2008;17(4):978–82. doi: 10.1158/1055-9965.EPI-07-2687 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Cao T, Wang J. PYGL regulation of glycolysis and apoptosis in glioma cells under hypoxic conditions via HIF1α-dependent mechanisms. Transl Cancer Res. 2024;13(10):5627–48. doi: 10.21037/tcr-24-1974 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Ji Q, Li H, Cai Z, Yuan X, Pu X, Huang Y, et al. PYGL-mediated glucose metabolism reprogramming promotes EMT phenotype and metastasis of pancreatic cancer. Int J Biol Sci. 2023;19(6):1894–909. doi: 10.7150/ijbs.76756 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ke R, Lv L, Zhang S, Zhang F, Jiang Y. Functional mechanism and clinical implications of MicroRNA-423 in human cancers. Cancer Med. 2020;9(23):9036–51. doi: 10.1002/cam4.3557 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Douglas H. Hallmarks of cancer: new dimensions. Cancer Discov. 2022;12(1):31–46. [DOI] [PubMed] [Google Scholar]
  • 34.Morelli E, Gulla’ A, Amodio N, Taiana E, Neri A, Fulciniti M, et al. CRISPR interference (CRISPRi) and CRISPR activation (CRISPRa) to explore the oncogenic lncRNA network. Methods Mol Biol. 2021;2348:189–204. doi: 10.1007/978-1-0716-1581-2_13 [DOI] [PubMed] [Google Scholar]
  • 35.Holtzman L, Gersbach CA. Editing the epigenome: reshaping the genomic landscape. Annu Rev Genomics Hum Genet. 2018;19:43–71. doi: 10.1146/annurev-genom-083117-021632 [DOI] [PubMed] [Google Scholar]
  • 36.Kampmann M. CRISPRi and CRISPRa screens in mammalian cells for precision biology and medicine. ACS Chem Biol. 2018;13(2):406–16. doi: 10.1021/acschembio.7b00657 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kon E, Ad-El N, Hazan-Halevy I, Stotsky-Oterin L, Peer D. Targeting cancer with mRNA-lipid nanoparticles: key considerations and future prospects. Nat Rev Clin Oncol. 2023;20(11):739–54. doi: 10.1038/s41571-023-00811-9 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Zhengrui Li

16 Jun 2025

Integrative GWAS and RNA-Seq Analysis for Target Identification and Virtual Drug Screening in Colorectal Cancer

PLOS ONE

Dear Dr. liu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jul 31 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Zhengrui Li

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for uploading your study's underlying data set. Unfortunately, the repository you have noted in your Data Availability statement does not qualify as an acceptable data repository according to PLOS's standards.

At this time, please upload the minimal data set necessary to replicate your study's findings to a stable, public repository (such as figshare or Dryad) and provide us with the relevant URLs, DOIs, or accession numbers that may be used to access these data. For a list of recommended repositories and additional information on PLOS standards for data deposition, please see https://journals.plos.org/plosone/s/recommended-repositories .

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Partly

Reviewer #2: Partly

Reviewer #3: Partly

Reviewer #4: Yes

Reviewer #5: Partly

Reviewer #6: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: No

Reviewer #6: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

**********

Reviewer #1: Summary: This study attempts to find potential treatment targets and drug candidates for colorectal cancer by combining previously published genome-wide association studies (GWAS) with gene expression data (RNA-Seq) and drug screening through computational docking. The authors use existing public data to identify genes commonly linked to colorectal cancer in East Asian populations and suggest a few of them, such as PYGL and SMAD7, as potential targets. They then use molecular docking to screen for drugs that might bind to these targets. The results suggest a list of small molecules that could theoretically be repurposed to treat colorectal cancer. However, all analyses were performed with previously published datasets, and there was no experimental validation.

Strengths:

1. The manuscript is clearly written and well-structured, though excessively long for the novelty it presents.

2. Figures and tables are presented with clarity, though mostly re-confirm well-established findings.

Major Comments:

1. Lack of Novelty in Data or Methodology: The manuscript relies entirely on publicly available GWAS (e.g., bbj-a-76, GCST009869) and RNA-Seq (TCGA-COAD, TCGA-READ) datasets. No new data generation or significant methodological innovation is evident.

2. Absence of Experimental Validation

3. Overstated Claims: The manuscript claims to offer a "theoretical framework for drug repurposing strategies" and makes repeated references to “promising therapeutic targets,” yet all results are entirely in silico, and the novelty is limited.

4. Redundancy With Existing Literature: The identified targets (e.g., SMAD7, TCF7L2, PYGL) have been widely studied in colorectal cancer, and their roles are well-documented. The manuscript does not provide fresh insights into these genes functions or offer a new angle on their clinical relevance.

Reviewer #2: This manuscript presents results from a workflow that integrates GWAS and RNA-seq analyses for identifying risk genes in colorectal cancer (CRC) and conduct targeted virtual drug screening. The authors report on 13 potential risk genes, with the PYGL representing the most promising therapeutic target, for which ten small molecule inhibitors were identified as potential drug targets for CRC.

The following are my major comments:

(1) A significant portion of the Introduction discusses Mendelian Randomization (MR), a method for investigating the causal relationships between exposures and some outcome, typically disease phenotypes. And further, the analysis of archived GWAS data was done in part using the R package TwoSampleMR, with the aim of minimizing weak instrumental variables, a key step in any MR analysis. And yet, curiously, no MR analysis was actually performed in this study. This needs clarification.

(2) For the GWAS data analysis, the summary level data from three GWAS datasets were examined. Simply, SNPs with P-values less than 1E-5 and r^2 less than 0.001 within a large 10,000 kb (i.e., 10 Mb) sliding window were retained. Presumably there is a substantial overlap in the genome-wide SNPs tested in these datasets. If so, most of the SNPs are likely to have multiple association P-values (i.e., from each of three GWAS datasets). How was this handled? In the Results, a Manhattan plot is presented (Figure 2), but for only one of the three GWAS datasets (bbj-1-107). Further, the SNPs that passed the filtering thresholds are discussed per dataset, giving the reader the impression that each GWAS dataset was indeed examined examined separately, with each batch of SNPs simply lumped together to comprise the final list. If so, this is problematic. Typically some sort of meta-analysis is performed to combine evidence across the three datasets (as opposed to disregarding data and cherry-picking the best result per SNP). Common methods include inverse-variance weighted fixed-effects meta-analysis or random-effects meta-analysis (e.g., METAL, GWAMA). This will yield a single P-value and effect estimate per SNP across the three datasets. Please discuss.

(3) From the GWAS analysis, 27 genes "significantly associated with CRC" were identified (NOTE: in Figure 1, this is listed as 28 genes). Of these, 13 genes (including lncRNAs) show significant differential expression in colon cancer tissue (NOTE: gene expression results for rectal cancer (READ) are not provided, despite being analyzed separately). Although each of these 13 genes harbor at least one significant GWAS SNP related to CRC, the relationship of these SNPs to cis gene expression was not examined (i.e., eQTLs). SNPs can have disease risk effects that stem from deleterious changes to an encoded protein impacting function (e.g., nonsynonymous mutations), with *no* effect on gene expression levels. Thus, going from a SNP association with CRC to differential gene expression in CRC without actually examining the relationship between the SNP and gene expression to link the two appears somewhat problematic. Please discuss.

(4) For the enrichment analysis for GO terms and KEGG pathways, it was performed on a list of only 13 genes. Except in cases of exceptional enrichment, this list is too small to yield anything informative. Looking at the top results (Tables 3 and 4), most of the GO terms and pathways involve only a single gene in the list (with a few having two genes). This doesn't provide any real insight. The point of such tests is to find wider patterns that connect various genes in their function and/or biological processes and represent an overrepresentation of such characteristics. Finding a single gene from the 13 genes that is also among the genes involved in, say, "starch and sucrose metabolism" doesn't tell you much, despite an adjusted P-value less than 0.1. This type of gene annotation can be achieved using various databases without resorting to enrichment testing. The authors may want to reconsider how to conduct this step in their study design.

(5) The gene PYGL was selected for "virtual screening via molecular docking analysis". Why this gene was selected from the list of 13 genes is not clear to this reviewer. Please explain.

Reviewer #3: This manuscript integrates genome-wide association study (GWAS) data from East Asian populations with RNA-Seq data from TCGA to identify colorectal cancer (CRC)-related genes, followed by virtual drug screening using molecular docking. The study highlights several candidate targets including PYGL, SMAD7, and TCF7L2, and proposes ten small-molecule compounds with potential therapeutic effects.

While the concept of integrating multi-omics data for drug discovery is promising and timely, the manuscript has several methodological and interpretational weaknesses that need to be addressed before it can be considered for publication.

Major Concerns

Despite the mention of Mendelian Randomization and LD-based SNP filtering, there is no actual demonstration of SNP-gene expression links through eQTL analysis. Without this, the integration between GWAS and RNA-Seq remains superficial. eQTL or TWAS analyses are essential to strengthen the causal inference between SNPs and gene expression in CRC.

Key genes (e.g., PYGL, SMAD7, TCF7L2) are listed, but their roles in CRC progression are not deeply discussed. More mechanistic insight or network analysis is needed.

The manuscript requires significant language editing for clarity and scientific tone. Several sections are repetitive or loosely organized.

Moreover, the manuscript lacks a clear and concise conclusion that synthesizes the key findings and their biological or clinical implications. A strong closing section is needed to articulate the overall contribution of the study.

Minor Issues

GWAS results need QQplot.

The format of table is not consisted.

Limitations should be stated more explicitly

Reviewer #4: In this paper, the authors systematically identified key genes and potential drug targets associated with colorectal cancer in East Asian populations by integrating GWAS and RNA-Seq data, combined with molecular docking techniques. The authors also performed differential expression analysis on RNA-Seq data from TCGA to refine the findings. Overall, the paper is well-written. Below are my comments on the paper.

• In 1.2 Data Acquisition, the author mentioned that “Datasets were filtered based on case-control status and sample size, with smaller sample sizes excluded to ensure statistical robustness.” Could the author be more specific on how they determine if the sample size of a dataset is small.

• It would be better if the authors could briefly introduce the three GWAS dataset used. Are there any population substructure in each of the dataset?

• It seems to me that the authors are more interested in identifying CRC-related genes. If this is the case, what about using set-based genetic association analysis (e.g. SKAT by Wu et al. (2011)) directly to identifying disease-related genes? Are there any advantages of using SNP-based GWAS compared to set-based GWAS in your study?

Ref: Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011 Jul 15;89(1):82-93. doi: 10.1016/j.ajhg.2011.05.029. Epub 2011 Jul 7. PMID: 21737059; PMCID: PMC3135811.

Reviewer #5: General Comments:

This study attempts an integrative approach combining GWAS and RNA-Seq data with molecular docking to identify therapeutic targets and drug candidates for colorectal cancer. The authors aim to bridge genetic insights with drug discovery, which is a valuable objective. However, the current methodology and presentation have several limitations that significantly impact the robustness and reliability of the findings. The conclusions drawn appear to be overstated given the foundational weaknesses.

Specific Comments:

1. The study uses three separate GWAS datasets. Given this, a formal meta-analysis of the GWAS data would significantly increase statistical power and the robustness of identified SNPs and genes. Alternatively, only retaining consistently replicated significant findings across at least two datasets could enhance accuracy and reduce false positives.

2. The significance threshold for SNP selection (p<1e−5) is relatively loose for GWAS, especially without stringent meta-analysis or replication across multiple independent cohorts. This increases the risk of including false-positive associations. Please justify this threshold more thoroughly or consider a stricter criterion.

3. The inclusion criteria for GWAS datasets mentions filtering based on "smaller sample sizes excluded to ensure statistical robustness". However, specific thresholds for exclusion are not provided. Please clarify what constitutes a "smaller sample size" in this context.

4. The linkage disequilibrium (LD) filtering criteria (r2<0.001 and a window size of 10,000 kb) seem unusually broad, potentially leading to very few independent SNPs or a lack of fine-mapping resolution. Please provide a more detailed justification for these specific parameters and their implications for instrumental variable selection.

5. The methods state that "genes and samples with more than 50% missing (NA) values were removed." This is a very lenient threshold, and keeping data with nearly 50% missingness, especially when followed by imputation, raises concerns about data quality and potential biases introduced by the imputation process. Please justify why such a high percentage of missing values was tolerated and discuss its potential impact.

6. The fold-change threshold of 2.0 (equivalent to |logFC|>1.0) for differential expression is not quite strict. Consider if the current stringency is appropriate for the downstream analyses.

7. Colorectal cancer incidence and mortality rates are noted to differ between males and females. A sex-stratified analysis could reveal important biological differences and sex-specific therapeutic targets. Please consider performing such an analysis or discussing its implications as a limitation and future direction.

8. The results show very few genes being shared across the three GWAS datasets. This lack of strong replication across datasets is a concern. The authors should explicitly discuss potential reasons for this limited overlap (e.g., population-specific effects, heterogeneity in study design, statistical power differences, or lack of truly robust signals) and its implications for the generalizability of the findings.

9. The GO and KEGG enrichment analyses appear weak, with many terms including only one or two genes. This makes it difficult to draw strong biological conclusions. The background gene set used for the enrichment analysis should be clearly indicated. Were all genes in the human genome used as background, or a more relevant subset (e.g., all genes expressed in colon/rectal tissue)? The adjusted p-values for many terms in Table 3 and Table 4 are relatively high, suggesting marginal significance despite being "top 10." The conclusions drawn from these enrichment results should be very carefully tempered.

10. Please clarify why adjusted p-values were not used for the online drug screening results. Unadjusted p-values in such a large-scale screening can lead to a high number of false positives.

11. The gene PYGL is highlighted as a key finding, but no validation is presented using external data or wet-lab experiments.

12. The most significant limitation of this study is the complete lack of experimental validation for the identified therapeutic targets and the candidate small-molecule inhibitors. The findings are entirely computational, and without in vitro or in vivo validation, their therapeutic potential remains speculative. This should be explicitly stated as a major limitation in the discussion, with a clear commitment to future experimental work. The authors mention "the incorporation of extensive positive and control datasets enhances the robustness and reliability of the findings" in the discussion of limitations. While this is a good practice for computational studies, it does not substitute for independent experimental validation. This statement should be rephrased to avoid overstating the current study's robustness in a biological context.

13. The supplementary table file names do not consistently match the table numbers inside the files.

Reviewer #6: To the Author,

Thanks for this valuable work.

Its noteworthy, this article, may lead to applied the suggested targeted therapies in experimental or clinical trial after subjected it to validation as well as further laboratory evaluations and adjustments. Furthermore, other limitation may the restriction of the study to the Asian population, which may not align with the genetic predispositions in other populations.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

Reviewer #5: No

Reviewer #6: Yes:  Luma Hassan Alwan Al Obaidy

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: TO THE EDITORS.pdf

pone.0333179.s003.pdf (103.3KB, pdf)
PLoS One. 2025 Oct 27;20(10):e0333179. doi: 10.1371/journal.pone.0333179.r002

Author response to Decision Letter 1


11 Jul 2025

Reviewer #5 provided several insightful comments on the manuscript, all of which were highly professional and offered valuable guidance for revision. These suggestions significantly improved the logical flow and academic rigor of the paper. We sincerely appreciate Reviewer #5’s constructive feedback

Attachment

Submitted filename: Response to Reviewers.docx

pone.0333179.s005.docx (264.2KB, docx)

Decision Letter 1

Zhengrui Li

6 Aug 2025

Dear Dr. liu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 20 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Zhengrui Li

Academic Editor

PLOS ONE

Journal Requirements:

If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: (No Response)

Reviewer #3: All comments have been addressed

Reviewer #4: All comments have been addressed

Reviewer #5: (No Response)

Reviewer #6: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #1: Partly

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: No

Reviewer #6: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: No

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: No

Reviewer #6: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #3: Yes

Reviewer #4: Yes

Reviewer #5: Yes

Reviewer #6: Yes

**********

Reviewer #1: The prior comments still hold true regarding the lack of novelty in data or methodology for the analysis and results presented in the study.

Reviewer #3: This manuscript presents a promising integrative in silico approach combining GWAS, RNA-Seq, and molecular docking to identify therapeutic targets for colorectal cancer. The authors have addressed many of the reviewers' previous concerns, including data selection, method transparency, and overstatement of claims. However, several key issues remain unresolved:

The lack of direct SNP-expression linkage (e.g., eQTL or TWAS) weakens the connection between genetic association and expression.

nsufficient explanation for selecting PYGL for molecular docking over other candidates (e.g., SMAD7, TCF7L2).

Paper foamting and Language still to be polished.

Still want have QQ plot to validate SNP associations.

Reviewer #4: The authors have successfully addressed all my previous comments and I don't have additional comments.

Reviewer #5: General Comments:

The authors of this manuscript present a computational workflow to identify potential therapeutic targets and drug candidates for colorectal cancer (CRC) by integrating data from a Genome-Wide Association Study (GWAS) with RNA-sequencing (RNA-Seq) data. The study culminates in a virtual drug screening of a selected target, PYGL. While the overall objective of integrating multi-omics data for drug discovery is a valuable endeavor, the current manuscript suffers from fundamental methodological weaknesses, lack of transparency, and a disconnect between the claims and the supporting evidence. The revisions made in response to previous reviewer comments do not adequately address these core issues, and in some cases, introduce new inconsistencies. For these reasons, the manuscript is not suitable for publication in its current form.

Specific Comments:

1. The manuscript initially mentions using three GWAS datasets but the revised version states that only one dataset, GCST90018808, was used. This change, supposedly in response to requests for replication and meta-analysis, is not a sound scientific solution. Removing datasets without clear, predefined criteria and without acknowledging the loss of statistical power and generalizability is problematic. The authors' response to Reviewers simply states that the single dataset was selected without explaining the rationale for discarding the others. A robust study would perform a meta-analysis to combine evidence from multiple cohorts or, at a minimum, demonstrate consistent findings (replication) across them. The arbitrary selection of a single dataset makes the findings highly susceptible to a single study's biases and reduces the ability to generalize results.

The authors claim to have filtered datasets based on sample size but only provide a non-specific cutoff of ncase < 5,000 as an example. The specific criteria used to select GCST90018808 over the others and what constituted a "small sample size" for the excluded datasets remain unclear in the text. This lack of transparency makes the data selection process appear arbitrary.

2. The authors mention the study population for the GWAS data but fail to provide the same demographic information for the RNA-Seq data, creating a data-context gap. This information is essential for interpreting the results and discussing generalizability.

A central claim of the study is the integration of GWAS and RNA-Seq data. However, the manuscript fails to explicitly address whether the populations in the GWAS and RNA-Seq datasets are matched. The GWAS data is from European and East Asian populations , while the RNA-Seq data is from the TCGA-COAD cohort. The TCGA is a US-based consortium, and its racial and ethnic composition may not align with the East Asian population that the authors repeatedly emphasize as a focus in their introduction and discussion. A failure to match populations can introduce significant confounding due to population stratification, undermining the validity of linking genetic variants to gene expression.

The authors mention using the TwoSampleMR R package and filtering for weak instrumental variables, but they explicitly state that they did not perform a Mendelian Randomization (MR) analysis. Furthermore, they fail to perform an eQTL analysis to directly link the identified SNPs to gene expression changes, which is a critical missing piece of the claimed "integrative analysis". Without this crucial link, the association between the selected genes and CRC is not robustly established, and the integration of GWAS and RNA-Seq data remains superficial and unconvincing.

3. The manuscript contains contradictory statements regarding data preprocessing. For RNA-Seq data, the method mentions filtering out genes and samples with more than 1% missing (NA) values, but the results section mentions excluding genes with more than 50% missing values. This is a significant discrepancy that raises serious questions about the rigor and reproducibility of the methodology.

The Abstract is poorly structured. It mixes methodological steps with results, making it difficult to follow the logical flow of the study. For instance, it lists specific numbers of identified mRNAs and lncRNAs in the "Method" section, which should be in the "Result" section. Similarly, the Conclusion section provides new examples of genes (CDKN2B, BOC, and METRNL) that are listed as potential therapeutic targets, which should be presented and discussed in the main result.

The manuscript lacks a concise and impactful conclusion. The concluding paragraphs are repetitive and contain general statements about the value of computational methods, rather than synthesizing the specific and unique contributions of this study.

4. The authors acknowledge the lack of experimental validation but their responses and the manuscript's tone do not sufficiently temper the "overstated claims" of identifying "promising therapeutic targets". The claim that "the incorporation of extensive positive and control datasets enhances the robustness and reliability of the findings" is misleading, as this does not substitute for independent biological validation and is not adequately supported by the current work.

While survival analysis is a valuable component of the study, the authors do not specify if they performed a multiple testing correction for the survival analysis of the 18 screened genes. Given the number of tests performed (at least 18 genes x 3 survival metrics = 54 tests), uncorrected p-values (p < 0.05) are highly susceptible to false positives. This should be clearly stated and, if not performed, acknowledged as a limitation.

The GO and KEGG enrichment analyses are based on a very small gene list and, as the authors acknowledge in their response to Reviewer, often include only one or two genes per term. This provides minimal insight and weakens the biological interpretation of the findings. The conclusion that these analyses are meaningful and support the findings is not well-justified.

5. The GWAS Manhattan plot (Fig 2) shows a significance threshold of p<1e−6 , but the text also mentions a genome-wide significance threshold of p>5e−8, which is typically a much stricter standard. This inconsistency needs clarification.

Reviewer #6: The study attempts to find potent therapeutic targets in colorectal cancer by analyzing the genomic data of European and East Asian populations. The researchers suggested the PYGL gene as a potential therapeutic target, as it has interacting motives with the suggested therapeutic small molecules. All work was performed by a computation prediction application, which may facilitate finding targets and therapies for cancer and reduce the time of experimentation. The major limitation of this work is the lack of validation and experimentation, as the researcher confesses in their work.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: No

Reviewer #3: No

Reviewer #4: No

Reviewer #5: No

Reviewer #6: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Oct 27;20(10):e0333179. doi: 10.1371/journal.pone.0333179.r004

Author response to Decision Letter 2


15 Aug 2025

Reviewer #1: The prior comments still hold true regarding the lack of novelty in data or methodology for the analysis and results presented in the study.

Response: Thank you for your time and expertise.

Reviewer #3: This manuscript presents a promising integrative in silico approach combining GWAS, RNA-Seq, and molecular docking to identify therapeutic targets for colorectal cancer. The authors have addressed many of the reviewers' previous concerns, including data selection, method transparency, and overstatement of claims. However, several key issues remain unresolved:

The lack of direct SNP-expression linkage (e.g., eQTL or TWAS) weakens the connection between genetic association and expression.

nsufficient explanation for selecting PYGL for molecular docking over other candidates (e.g., SMAD7, TCF7L2).

Paper foamting and Language still to be polished.

Still want have QQ plot to validate SNP associations.

Response: Thank you for your time and expertise. The manuscript formatting and language have been thoroughly edited. A QQ plot has been added to the Results section.

Section 1.8 outlines the gene screening methodology.

“Structural data were obtained from the AlphaFold Protein Structure Database and the Protein Data Bank (PDB), with priority given to PDB entries. Selection criteria included the scientific name of the source organism (Homo sapiens), experimental method (X-ray diffraction or electron microscopy), and a refinement resolution of ≤2.5 Å.

A total of 17 mRNAs were identified through screening, among which only 3 possessed PDB IDs and contained small-molecule ligands. Based on the properties and size of the active sites, PYGL was ultimately selected for further investigation.

Table 1. Structural data of the 17 mRNA-encoded proteins

No. Gene symbol RNA type PDB AlphaFold

PDB ID Ligand ID

1 CDKN2B mRNA P42772

2 BOC mRNA 3N1G /

3 SCG5 mRNA P05408

4 LRP1 mRNA 1D2L /

5 ZCWPW2 mRNA 4Z0R EDO: C2H6O2

6 PYGL mRNA 3DDS 26B: C29H35N3O5

7 PREX1 mRNA 6VSK /

8 SMAD7 mRNA 7CD1 /

9 SLCO2A1 mRNA Q92959

10 METRNL mRNA Q641Q3

11 POU5F1B mRNA Q06416

12 ABHD12B mRNA Q7Z5M8

13 TCF7L2 mRNA 1JDH /

14 CDH3 mRNA 5JYL GOL: C3H8O3

15 SOX6 mRNA P35712

16 NKX2-3 mRNA Q8TAU0

17 TMEM238L mRNA A6NJY4

Reviewer #4: The authors have successfully addressed all my previous comments and I don't have additional comments.

Response: Thank you for your time and expertise.

Reviewer #5: General Comments:

1. The manuscript initially mentions using three GWAS datasets but the revised version states that only one dataset, GCST90018808, was used. This change, supposedly in response to requests for replication and meta-analysis, is not a sound scientific solution. Removing datasets without clear, predefined criteria and without acknowledging the loss of statistical power and generalizability is problematic. The authors' response to Reviewers simply states that the single dataset was selected without explaining the rationale for discarding the others. A robust study would perform a meta-analysis to combine evidence from multiple cohorts or, at a minimum, demonstrate consistent findings (replication) across them. The arbitrary selection of a single dataset makes the findings highly susceptible to a single study's biases and reduces the ability to generalize results.

The authors claim to have filtered datasets based on sample size but only provide a non-specific cutoff of ncase < 5,000 as an example. The specific criteria used to select GCST90018808 over the others and what constituted a "small sample size" for the excluded datasets remain unclear in the text. This lack of transparency makes the data selection process appear arbitrary.

Response: Thank you for your time and expertise.

2. The authors mention the study population for the GWAS data but fail to provide the same demographic information for the RNA-Seq data, creating a data-context gap. This information is essential for interpreting the results and discussing generalizability.

A central claim of the study is the integration of GWAS and RNA-Seq data. However, the manuscript fails to explicitly address whether the populations in the GWAS and RNA-Seq datasets are matched. The GWAS data is from European and East Asian populations , while the RNA-Seq data is from the TCGA-COAD cohort. The TCGA is a US-based consortium, and its racial and ethnic composition may not align with the East Asian population that the authors repeatedly emphasize as a focus in their introduction and discussion. A failure to match populations can introduce significant confounding due to population stratification, undermining the validity of linking genetic variants to gene expression.

The authors mention using the TwoSampleMR R package and filtering for weak instrumental variables, but they explicitly state that they did not perform a Mendelian Randomization (MR) analysis. Furthermore, they fail to perform an eQTL analysis to directly link the identified SNPs to gene expression changes, which is a critical missing piece of the claimed "integrative analysis". Without this crucial link, the association between the selected genes and CRC is not robustly established, and the integration of GWAS and RNA-Seq data remains superficial and unconvincing.

Response: Thank you for your time and expertise.

3. The manuscript contains contradictory statements regarding data preprocessing. For RNA-Seq data, the method mentions filtering out genes and samples with more than 1% missing (NA) values, but the results section mentions excluding genes with more than 50% missing values. This is a significant discrepancy that raises serious questions about the rigor and reproducibility of the methodology.

The Abstract is poorly structured. It mixes methodological steps with results, making it difficult to follow the logical flow of the study. For instance, it lists specific numbers of identified mRNAs and lncRNAs in the "Method" section, which should be in the "Result" section. Similarly, the Conclusion section provides new examples of genes (CDKN2B, BOC, and METRNL) that are listed as potential therapeutic targets, which should be presented and discussed in the main result.

The manuscript lacks a concise and impactful conclusion. The concluding paragraphs are repetitive and contain general statements about the value of computational methods, rather than synthesizing the specific and unique contributions of this study.

Response: Thank you for your time and expertise.

1% missing (NA) values�The missing value threshold stated as 1% in the original text should be corrected to 50%, which was applied as a filtering criterion during data normalization using log2(X+1) transformation.

50% missing values�For the Limma analysis, samples with zero expression values in >50% of genes were removed.

Both the Abstract and Conclusion sections have been revised.

4. The authors acknowledge the lack of experimental validation but their responses and the manuscript's tone do not sufficiently temper the "overstated claims" of identifying "promising therapeutic targets". The claim that "the incorporation of extensive positive and control datasets enhances the robustness and reliability of the findings" is misleading, as this does not substitute for independent biological validation and is not adequately supported by the current work.

While survival analysis is a valuable component of the study, the authors do not specify if they performed a multiple testing correction for the survival analysis of the 18 screened genes. Given the number of tests performed (at least 18 genes x 3 survival metrics = 54 tests), uncorrected p-values (p < 0.05) are highly susceptible to false positives. This should be clearly stated and, if not performed, acknowledged as a limitation.

The GO and KEGG enrichment analyses are based on a very small gene list and, as the authors acknowledge in their response to Reviewer, often include only one or two genes per term. This provides minimal insight and weakens the biological interpretation of the findings. The conclusion that these analyses are meaningful and support the findings is not well-justified.

Response: The specified content has been deleted as requested.“the incorporation of extensive positive and control datasets enhances the robustness and reliability of the findings”

The content has been added in Section 1.5.“The generated p value does not include correction for multiple hypothesis testing,”

5. The GWAS Manhattan plot (Fig 2) shows a significance threshold of p<1e-6 , but the text also mentions a genome-wide significance threshold of p>5e−8, which is typically a much stricter standard. This inconsistency needs clarification.

Response: While the conventional genome-wide significance threshold is typically set at p < 5e-8, we considered single nucleotide polymorphisms (SNPs) with p < 1e-6 as potentially significant in our study. Using this less stringent threshold (p <1e-6) for initial screening, we identified and validated several SNPs. Subsequent analysis revealed three genes CDKN2B, BOC, and METRNL showing significant associations, with p-values ranging between 5e-8 and 1e-6.

Reviewer #6: The study attempts to find potent therapeutic targets in colorectal cancer by analyzing the genomic data of European and East Asian populations. The researchers suggested the PYGL gene as a potential therapeutic target, as it has interacting motives with the suggested therapeutic small molecules. All work was performed by a computation prediction application, which may facilitate finding targets and therapies for cancer and reduce the time of experimentation. The major limitation of this work is the lack of validation and experimentation, as the researcher confesses in their work.

Response: Thank you for your time and expertise.

Attachment

Submitted filename: Response to Reviewers 20250815.docx

pone.0333179.s006.docx (26.1KB, docx)

Decision Letter 2

Zhengrui Li

10 Sep 2025

<p>Integrative GWAS and RNA-Seq Analysis for Target Identification and Virtual Drug Screening in Colorectal Cancer

PONE-D-25-19455R2

Dear Dr. liu,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Zhengrui Li

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewer #3:

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #3: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #3: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #3: Yes

**********

Reviewer #3: The authors have made some revisions in response to the reviewers’ comments; however, many important points remain insufficiently addressed. In particular, the QQ plot raises concerns about potential inflation in the GWAS results, which should have been discussed in greater depth. Overall, the manuscript does not contain major fatal flaws, but it also lacks clear novelty or significant contributions that would advance the field.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #3: No

**********

Acceptance letter

Zhengrui Li

PONE-D-25-19455R2

PLOS ONE

Dear Dr. liu,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Zhengrui Li

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Phenotype-associated SNPs and genes identified in the GCST90018808 dataset.

    (XLSX)

    pone.0333179.s001.xlsx (31.3KB, xlsx)
    S2 Table. Binding energies and IC50 values of compounds derived from BindingDB that target the PYGL protein.

    The R scripts used in the study are available via the following DOI: 10.5281/zenodo.15803098, v1.0.2.

    (XLSX)

    pone.0333179.s002.xlsx (42.3KB, xlsx)
    Attachment

    Submitted filename: TO THE EDITORS.pdf

    pone.0333179.s003.pdf (103.3KB, pdf)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0333179.s005.docx (264.2KB, docx)
    Attachment

    Submitted filename: Response to Reviewers 20250815.docx

    pone.0333179.s006.docx (26.1KB, docx)

    Data Availability Statement

    The gene expression data utilized in this study were obtained from the Genomic Data Commons (GDC) Data Portal (https://portal.gdc.cancer.gov/). Specifically, RNA-sequencing (RNA-seq) gene expression counts from the TCGA-COAD project were downloaded, with open access.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES