Skip to main content
Nature Communications logoLink to Nature Communications
. 2026 Jan 15;17:1377. doi: 10.1038/s41467-025-68127-z

Mixed-model and transcriptome-wide association analyses identify transcription factors and genes associated with colorectal cancer susceptibility

Zhishan Chen 1,#, Wenqiang Song 1,2,#, Qing Li 1,#, Chao Li 1, Wanqing Wen 1, Jeroen R Huyghe 3, Philip J Law 4, Ceres Fernandez-Rozadilla 5,6, Maria N Timofeeva 7,8, Minta Thomas 3, Stephanie L Schmit 9,10, Vicente Martin 11,12, Matthew Devall 13, Christopher Dampier 13, Ferran Moratalla-Navarro 11,14,15,16, Qiuyin Cai 1, Jifeng Wang 1, Jiajun Shi 1, Sun-Seog Kweon 17,18, Chizu Tanikawa 19, Wei-Hua Jia 20, Xiang Shu 21, Jirong Long 1, Jing Gao 22, Jeongseon Kim 23, Aesun Shin 24, Keitaro Matsuo 25,26, Sun Ha Jee 27, Keum Ji Jung 27, Nan Wang 28, Dong-Hyun Kim 29, Jie Ping 1, Gong Yang 1, Min-Ho Shin 17, Zefang Ren 30, Jae Hwan Oh 31, Isao Oze 25, Yoon-Ok Ahn 24, Yu-Tang Gao 32, Zhi-Zhong Pan 20, Yoichiro Kamatani 33,34, Luc Van Kaer 2, Lan Wu 2, Bingshan Li 35, Koichi Matsuda 36, Xiao-Ou Shu 1, Li Hsu 3, Malcolm G Dunlop 7, Stephen B Gruber 37, Richard Houlston 38, Ian Tomlinson 5,39, Li Li 13, Ken S Lau 40, Victor Moreno 11,14,15,16, Graham Casey 13, Ulrike Peters 3,41, Wei Zheng 1, Xingyi Guo 1,42,
PMCID: PMC12877115  PMID: 41540004

Abstract

Susceptibility transcription factors (TF) whose DNA bindings are altered by genetic variants regulating colorectal cancer (CRC) risk genes remain poorly defined. Using generalized linear mixed models, we analyze 218 TF ChIP-Seq datasets alongside GWAS data from 100,204 CRC cases and 154,587 controls of East Asian and European ancestries. We identify 51 TFs and TF-cofactor interactions, including VDR-cofactors, as key regulators of CRC risk. Integrating these TF insights with transcriptome-wide association studies (TWAS), we further evaluate associations between genetically predicted gene expression, alternative splicing, and alternative polyadenylation with CRC risk, using RNA-seq data from 364 Asian-ancestry and 707 European-ancestry individuals. Multi-ancestry TWAS identify 222 risk genes, including 95 novel genes and 48 potentially druggable targets. Single-cell analysis provides additional functional evidence supporting ~45% of these genes, and experimental validation confirms oncogenic roles for RHPN2, IRS2, and TXN. Our findings elucidate key TF–gene regulatory networks and uncover novel CRC risk genes.

Subject terms: Cancer epigenetics, Genome-wide association studies


This study applies generalized linear mixed models (GLMM) and advanced transcriptome wide association study (TWAS) methods to improve the discovery of colorectal cancer risk transcription factors and genes, including potential druggable targets.

Introduction

Transcription factors (TFs) play a pivotal role in mediating the effects of genetic risk variants by regulating gene expression in a context-specific manner, thereby influencing disease risk1,2. A better understanding of the regulatory mechanisms by which TFs mediate these genetic variants is critical for elucidating the molecular basis of complex traits, including cancer. Previous studies have identified susceptible TFs through enrichment analyses of genome-wide association studies (GWAS)-identified risk variants in TF binding sites through chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq)36. However, these approaches often suffer from suboptimal statistical power, as they focus primarily on a narrow range of GWAS-identified risk variants. Recently, we developed a statistical method using a generalized linear mixed model (GLMM) to enhance power by analyzing TF ChIP-seq data in conjunction with GWAS summary statistics of variants across the genome7. This method evaluated associations between variants’ risk (e.g., Chi-squared values from GWAS) and their binding status by a TF, and has identified known susceptible TFs, such as FOXA1, ESR1 and GATA3 for breast cancer, by showing significantly higher risk for variants occupied by these TFs compared to genome-wide controls7. However, the specific TFs involved in colorectal cancer (CRC) susceptibility remain unclear. Thus, identifying TFs that interact with CRC risk variants could provide insights into TF-mediated gene regulation and enhance our understanding of CRC susceptibility.

GWASs have successfully identified more than 250 risk loci for CRC1,2,8, and integrative analyses of expression quantitative trait loci (eQTL) and these GWAS loci have revealed numerous candidate susceptibility genes912. Further transcriptome-wide association studies (TWAS) have linked predicted gene expression levels with disease risk, offering higher statistical power and potentially revealing novel associations overlooked in GWAS13,14. Prior research, including our own, has conducted TWAS in CRC and identified more than 200 putative susceptibility genes8,15,16. However, conventional TWAS methods are prone to false positives due to the inclusion of non-regulatory variants1719. To address this, we developed the sTF-TWAS approach, which refines gene expression prediction by focusing only on regulatory variants within TF-occupied cis-regulatory elements (STFCREs)7,20. This approach significantly enhances disease gene detection, outperforming conventional TWAS methods20. In addition, our recent alternative polyadenylation (APA)-WAS framework has uncovered new cancer risk genes21, while its application to CRC has yet to be explored. Moreover, while TWAS has primarily been conducted in European populations, CRC studies in East Asian populations are limited due to a lack of relevant RNA-seq data from colon tissue.

In this study, we integrate 218 ChIP-seq datasets from CRC-related cells with GWAS data from 100,204 CRC cases and 154,587 controls of both European and East Asian ancestries to identify susceptible TFs for CRC. We further utilize RNA-seq data from 364 Asian-ancestry and 707 European-ancestry individuals to build ancestry-specific models for gene expression, alternative splicing (AS), and APA) under the sTF-TWAS framework. Our goal is to identify novel CRC susceptibility genes through multi-ancestry TWAS analyses, providing a deeper understanding of the genetic basis of CRC.

Results

Identification of susceptible CRC-associated TFs and key TF-cofactor interactions

We integrated 218 ChIP-seq data for 84 TFs (Supplementary Data 1) with CRC GWAS summary statistics from European and East Asian ancestry to identify CRC susceptible TFs (Fig. 1A). At Bonferroni-corrected P  <  0.05, we identified 51 TFs associated with CRC susceptibility. Among the top TFs were VDR, MYC, JUN, CDX2, NIPBL, MED12, and ETV5 (Fig. 1B and Supplementary Data 2, “Methods”). We conducted motif analysis for each TF-binding region derived from ChIP-seq data and observed significant enrichment for host motifs, including those for VDR (24.2%), JUN (11.1%), KLF5 (49.7%), and HNF4A (36.4%) (Fig. 1C). We further explored the impact of genetic variations in co-occupied cis-regulatory regions of these 51 TFs and observed a substantial proportion of genetic variants located in co-occupied binding sites (Fig. 1D and Supplementary Data 3). Pair-wise analyses identified 154 significant interactions at Bonferroni-corrected P < 0.05, showing that genetic variants in regions co-occupied by multiple TFs exhibited stronger associations with CRC risk compared to variants bound by a single TF (Fig. 1D and Supplementary Data 3). Notably, VDR emerged as a key player in these interactions, highlighting its central role in CRC susceptibility (Fig. 1D and Supplementary Data 3). In addition, we applied stratified LD Score Regression (S-LDSC)22 as an alternative approach and found that 40 of the 51 risk TFs identified by our method were also detected by S-LDSC at a Bonferroni-corrected P < 0.05. Most of the remaining 11 TFs showed nominal significance (P < 0.05) (Supplementary Fig. 1A), providing further support for the validity of our findings. We also conducted simulation analyses to compare the statistical power of our GMM with S-LDSC. Our results demonstrate that GLMM consistently outperformed S-LDSC in detecting causal TFs across varying levels of heritability (Supplementary Fig. 1B). Quantile-quantile (Q-Q) plots demonstrate that Type I error rates are under control for both approaches, with overall comparable performance (Supplementary Fig. 1C–F). These simulation results are consistent with our observations in real data, supporting GLMM as a robust approach for identifying risk TFs in colorectal cancer.

Fig. 1. Associations between TFs with CRC risk using generalized linear mixed models.

Fig. 1

A A flow chart to illustrate the integrative analysis of ChIP-seq data (n = 218) for 84 TFs and CRC GWAS summary statistics from 100,204 cases and 154,587 controls of European and East Asian ancestry. B A total of 51 identified TFs with genetic variation of TF-DNA bindings significantly associated with CRC risk. P-values were determined by a two-sided Wald Z test. The dashed line represents a Bonferroni-corrected P  <  0.05. C The host motifs of identified TFs were enriched in their ChIP-seq peaks. D Analysis of co-occupied binding regions of the top 10 CRC risk-associated TFs. Venn diagrams in the upper-right triangle show the number of genetic variants (multiplied by 1000) that are occupied by specific TFs or co-occupied by two TFs in each TF pair. Bar plots in the lower-left triangle show the association strengths (regression coefficients) for the genetic variants occupied by two TFs (only the first TF and only the second TF, respectively) as indicated from left to right. Two TFs with significant interactions at the Bonferroni-threshold of P  <  3.92 × 10−5 (0.05/1,275 TF pairs from 51 TFs) are highlighted in red. P-values were determined by a two-sided Wald Z test.

Associations of genetically predicted gene expression with CRC risk

For the multi-ancestry TWAS under the sTF-TWAS framework, we built gene expression prediction models using RNA-seq data from 364 participants of East Asian ancestry (from the ACCC) and 707 participants of European ancestry (N = 423 from the BarcUVa-seq and N = 284 from the GTEx projects) (Methods)16. A total of 7096 (for ACCC), 14419 (for BarcUVa-seq), 11361 (for GTEx) gene expression prediction models were built with a coefficient of determination R2 > 0.01 (10% correlation) using the elastic net approach (Supplementary Data 4). We then applied these ancestry-specific prediction models to summary statistics of CRC GWAS from 80,774 cases and 105,298 controls of European ancestry and from 23,572 cases and 48,700 control of East Asian ancestry. At a Bonferroni-corrected P < 0.05, we identified 138 genes whose genetically predicted expression was associated with CRC risk after applying a meta-analysis of TWAS results based on models built in both European and East Asian populations (Supplementary Fig. 2, “Methods”). In addition, we evaluated the associations separately in European and Asian populations. At the Bonferroni-corrected threshold, we identified two additional genes (SENP8 and PCNT) from the European population, while no additional significance was identified in the East Asian population (Supplementary Data 5).

Together, 140 genes showed an association between their genetically predicted expression and CRC risk (Fig. 2A). Of them, 41 (29% of 140) genes were not previously reported to associate with CRC risk1,2,8,10,15,16 (Table 1). Of note, seven putative susceptibility genes (PPAN, DERL3, PROX1, TMEM87B, SV2A, FBXL20, and MED1) are located at loci, with > 2 Mb away from the closest GWAS-identified risk variants (Table 1). To assess whether the genes identified were independent of the established GWAS association signals, we conducted conditional analyses for their associations with CRC risk, adjusting for the associations with the closest lead SNP for each locus (“Methods”). Of the 99 previously reported genes located in GWAS-identified risk loci, conditional analysis showed that 39 genes remained statistically significant at a nominal P < 0.05 (Supplementary Data 6).

Fig. 2. Manhattan plot of the associations result from TWAS, AS-WAS and APA-WAS.

Fig. 2

A The meta-analysis result from the associations among European ancestry and East Asian ancestry participants. P-values were calculated from a two-sided Wald Z test conducted by sTF-TWAS. The blue line indicates a Bonferroni-threshold of nominal P < 5.8 × 10−6 for TWAS (A), and P < 1.95 × 10−6 for AS-WAS (B), and P < 8.25 × 10−6 for APA-WAS (C), respectively. The novel identified risk genes from this study are highlighted in green. Four genes showed association signals with P-values exceeding the y-axis scale: POU5F1B (P = 1.09 × 10⁻⁶⁴) in (A); COLCA2 (P = 5.49 × 10⁻⁴⁵) in (B); LAMC1 (P = 9.57 × 10⁻²⁴) and DIP2B (P = 1.25 × 10⁻²²) in (C).

Table 1.

TWAS novel genes presented in novel and known GWAS loci

Locusa Gene Zb ACAT_Pc Nearest lead variant Distance (Mb)d Cond_ACAT_Pe
Novel loci
1q21.2 SV2A − 4.98 1.30 × 10−06 rs5028523 22.98 1.40 × 10−06
1q32.3 PROX1 4.75 2.02 × 10−06 rs6691170 7.83 2.75 × 10−06
2q13 TMEM87B -4.96 7.03 × 10−07 rs11692435 14.40 6.53 × 10−07
17q12 MED1 5.21 1.87 × 10−07 rs1078643 28.60 1.94 × 10−07
19p13.2 PPAN 5.28 2.51 × 10−07 rs10409772 4.265 5.57 × 10−07
22q11.23 DERL3 − 5.72 1.09 × 10−08 rs4616575 5.171 2.71 × 10−09
Known loci
1p32.3 LEXM − 5.43 5.53 × 10−08 rs12143541 0.02 6.22 × 10−01
1q25.3 RP11-181K3.4 9.90 4.32 × 10−23 rs8179460 0.05 4.19 × 10−01
2q33.1 AC019330.1 5.77 7.93 × 10−09 rs6434979 0.00 1.79 × 10−01
2q35 TUBA4A − 4.93 1.64 × 10−06 rs3731861 0.92 2.82 × 10−01
3p21.1 STAB1 5.65 1.64 × 10−08 rs2001732 0.32 2.50 × 10−02
3p22.1 EIF1B – 4.87 1.11 × 10−06 rs35470271 0.56 7.89 × 10−01
3q22.2 KY − 4.93 1.69 × 10−06 rs10049390 0.57 1.23 × 10−05
6p21.31 MAPK13 − 4.90 1.93 × 10−06 rs9470361 0.51 4.00 × 10−03
6p21.33 EHMT2 6.67 2.48 × 10−11 rs2516452 0.33 9.12 × 10−09
6p22.1 TRIM15 5.46 4.76 × 10−08 rs1476570 0.32 1.36 × 10−08
7p13 AC058791.1 6.95 7.43 × 10–12 rs17686932 0.00 4.10 × 10−02
7q32.3 RP11-138A9.1 6.61 3.95 × 10−11 rs17686932 0.01 9.05 × 10−01
8q24.21 TMEM75 7.28 3.43 × 10−13 rs4733767 0.38 1.19 × 10−13
9q31.3 TXN 4.92 1.69 × 10−06 rs10817106 0.64 6.82 × 10−07
11q23.1 SIK2 − 5.39 1.41 × 10−07 rs3087967 0.32 6.80 × 10−01
12q13.12 RP3-405J10.3 − 7.45 9.59 × 10−14 rs11169572 0.63 9.57 × 10−01
12q13.3 LRP1-AS 6.66 2.75 × 10−11 rs7398375 0.00 8.81 × 10−01
13q13.3 SMAD9-IT1 − 5.87 4.43 × 10−09 rs12427846 0.07 4.76 × 10−01
15q23 MYO9A 5.51 3.64 × 10−08 rs8031386 0.10 5.54 × 10−01
17p13.3 RP11-676J12.7 − 8.08 6.57 × 10−16 rs4968127 0.00 2.00 × 10−03
17q25.3 AC144831.1 8.51 1.71 × 10–17 rs35204860 0.01 5.72 × 10−01
20q13.33 RP5-908M14.9 12.12 7.92 × 10−34 rs1741640 0.03 9.42 × 10−01

a For locus with multiple identified genes, only the gene with the most significant association with CRC risk is presented. Additional genes for loci include: 17q12 (FBXL20); 2q35 (PLCD4); 6p21.33 (HCP5, DDR1, LINC00243, HLA-E); 8q24.21 (CASC8); 11q23.1 (POU2AF1); 12q13.12 (ASIC1, RP11-133N21.10); 15q23 (PKM); 20q13.33 (ZGPAT, RP5-908M14.10).

bFor each gene, the Z score of the most significant association is presented.

c P-value was calculated using the ATAC with the Cauchy combination test statistics through joint analysis of multi-ancestry TWAS associations.

d Distance between a gene with the closest risk variant identified from previous GWAS in CRC.

e P-value was calculated using the ATAC with Cauchy combination test statistics through joint analysis of multi-ancestry TWAS associations conditioned on the closest risk variant.

Additional risk genes identified by AS-WAS and APA-WAS

We built prediction models for AS and APA using the aforementioned RNA-seq data in colon tissues, generating 9680 AS models for ACCC, 10,025 for BarcUVa-seq, and 20,240 for GTEx, at prediction performance R2 > 0.01 (Methods, Supplementary Data 7). A meta-analysis of AS-WAS results from these models, we identified 82 risk genes at a Bonferroni-corrected threshold P < 0.05 (Fig. 2B). We identified three additional genes, POLD3, PARP11, and CEP89, specific in European populations (Supplementary Data 8). In total, we identified 85 genes after combining the results from the meta-analysis and population-specific analysis. For APA, a total of 2569 (for BarcUVa-seq), 4399 (for GTEx), and 2744 (for ACCC) prediction models were built successfully with R2 > 0.01 (Supplementary Data 9). After meta-analysis of APA-WAS results from these models, we identified 29 risk genes, at a Bonferroni-corrected P < 0.05 (Fig. 2C). Similarly, we identified four additional genes (ARPC5, PHLDB2, COA4, NEIL1) specifically in the European population (Supplementary Data 10). In total, we identified 33 genes after combining the results from the meta-analysis and population-specific analysis.

In total, 112 genes were identified for CRC risk combining results of SP-TWAS and APA-WAS. Of them, 58 were not previously reported for CRC risk (Table 2, 3). Among the remaining 54 previously reported genes, conditional analysis showed that 24 genes remained statistically significant at a nominal P < 0.05 (Supplementary Data 11).

Table 2.

AS-WAS novel genes presented in novel and known GWAS loci

Locusa Gene Zb ACAT_Pc Nearest lead variant Distance (Mb)d Cond_ACAT_Pe
Novel loci
1p31.3 MIER1 − 5.40 1.32 × 10−07 rs3124454 3.62 4.20 × 10−08
17p11.2 MPRIP − 5.02 5.12 × 10−07 rs1078643 6.34 1.06 × 10−05
17q12 FBXL20 5.47 4.56 × 10−08 rs1078643 28.54 9.27 × 10−08
19q13.2 ACTN4 − 5.15 7.08 × 10−07 rs1800469 2.64 1.18 × 10−07
Known loci
10q24.2 SLC25A28 − 5.39 7.21 × 10−08 rs35564340 0.03 1.47 × 10−05
11q23.1 PPP2R1B − 5.32 1.03 × 10−07 rs3087967 0.45 4.00 × 10−02
11q12.2 DAGLA 6.56 5.46 × 10−11 rs174537 0.05 6.61 × 10−01
11q13.4 MRPL48 6.65 2.88 × 10−11 rs11236187 0.83 6.00 × 10−03
1q25.3 NPL − 5.42 6.09 × 10−08 rs8179460 0.26 1.87 × 10−01
12q24.13 TMEM116 − 5.11 3.15 × 10−07 rs653178 0.43 2.98 × 10−01
12p11.21 DDX11 – 5.47 8.97 × 10−08 rs77969132 0.34 2.58 × 10−06
12p13.32 PARP11 − 4.93 8.12 × 10–07 rs10774214 0.39 4.00 × 10−03
12q13.12 MCRS1 6.42 2.74 × 10−10 rs11169572 1.26 4.46 × 10−01
12q15 KCNMB4 5.33 9.79 × 10−08 rs11178634 0.69 8.91 × 10−01
13q13.3 EXOSC8 7.21 1.10 × 10−12 rs12427846 0.08 5.72 × 10−01
15q13.3 RP11-1000B6.3 6.69 4.50 × 10−11 rs1554865 0.13 1.81 × 10−04
19q13.43 MZF1 − 6.31 5.73 × 10−10 rs11670192 0.06 9.40 × 10−01
20q11.22 GSS − 5.19 2.12 × 10−07 rs6059938 0.33 1.10 × 10−02
20q13.11 SRSF6 6.19 6.02 × 10−10 rs6017248 0.57 9.90 × 10−02
20q13.13 FAM65C 5.19 2.15 × 10−07 rs6091189 0.04 6.44 × 10−05
20q13.2 NFATC2 − 5.44 5.23 × 10−08 rs6091189 0.75 2.27 × 10−08
20q13.33 ARFGAP1 8.26 1.40 × 10−16 rs2738783 0.40 4.59 × 10−17
2q35 AAMP 6.10 1.08 × 10−09 rs3731861 0.06 3.94 × 10−01
3p21.2 ACY1 − 5.34 9.22 × 10−08 rs2001732 0.86 3.40 × 10−02
3p21.1 STAB1 – 6.36 2.02 × 10−10 rs2001732 0.34 8.51 × 10−04
5p13.1 RICTOR 5.05 4.32 × 10−07 rs55810369 1.14 8.75 × 10−04
6p22.1 TRIM26 − 5.38 7.27 × 10−08 rs1476570 0.36 5.78 × 10-08
6p21.33 FLOT1 − 5.97 2.43 × 10−09 rs116353863 0.30 9.63 × 10−09
6p24.3 SSR1 5.83 5.43 × 10−09 rs9379084 0.07 7.00 × 10−03

a For locus with multiple identified genes, only the gene with the most significant association with CRC risk is presented. Additional genes for loci include: 11q13.4 (PAAF1, ARRB1); 12q13.12 (SPATS2); 20q13.13 (ZNFX1, SLC9A8, RTEL1); 3p21.1 (DNAH1); 6p21.33 (BAG6); 6p21.33 (SKIV2L, TNXB).

b For each gene, the Z score of the most significant association is presented.

c P-value was calculated through ATAC using the Cauchy combination test statistics through joint analysis of multi-ancestry AS-WAS associations.

d Distance between a gene with the closest risk variant identified from previous GWAS in CRC.

e P-value was calculated through ATAC using the Cauchy combination test statistics through joint analysis of multi-ancestry AS-WAS associations conditioned on the closest risk variant.

Table 3.

APA-WAS novel genes presented in novel and known GWAS loci

Locusa Gene Zb ACAT_Pc Nearest lead variant Distance (Mb)d Cond_ACAT_Pe
Novel loci
1p22.1 EVI5 4.56 5.04 × 10−06 rs6660031 17.03 5.20 × 10−06
15q22.2 GTF2A2 4.70 2.59 × 10−06 rs3809570 7.05 5.23 × 10−05
15q24.1 SCAMP2 − 4.88 1.08 × 10−06 rs8031386 2.63 1.02 × 10−06
Known loci
2q35 TNS1 6.21 5.26 × 10−10 rs3731861 0.48 3.90 × 10−02
3p21.2 RPL29 − 4.91 1.81 × 10−06 rs2001732 0.85 3.63 × 10−01
6p21.33 LST1 5.50 3.76 × 10−08 rs2516452 0.13 2.08 × 10−04
6p22.1 TRIM26 5.78 7.65 × 10−09 rs1476570 0.34 6.56 × 10−09
8q24.11 RAD21 − 6.01 1.90 × 10−09 rs117079142 0.07 1.61 × 10–01
11q23.1 POU2AF1 − 5.65 1.60 × 10−08 rs3087967 0.07 3.10 × 10−02
12p13.31 PEX5 6.03 1.62 × 10−09 rs10849434 0.89 2.28 × 10−01
12q24.12 BRAP − 5.17 2.36 × 10−07 rs653178 0.07 3.67 × 10−03
15q22.33-q23 AAGAB 4.79 1.66 × 10−06 rs56324967 0.09 9.21 × 10−08
15q23 HEXA -5.39 7.14 × 10−08 rs8031386 0.12 1.78 × 10−01
18q21.1 LIPG 4.96 7.10 × 10−07 rs2337113 0.64 3.60 × 10−02
20p12.3 C20orf196 -7.23 4.80 × 10−13 rs966816 0.53 2.00 × 10−02
20q13.13 RNF114 7.25 8.23 × 10−13 rs6012915 0.42 1.29 × 10−06
20q13.33 MTG2 − 4.95 1.45 × 10−06 rs1741640 0.15 4.99 × 10−01

a For locus with multiple identified genes, only the gene with the most significant association with CRC risk is presented. Additional genes for loci include: 6p21.33 (GPANK1, PRR3, DDR1).

b For each gene, the Z score of the most significant association is presented.

c P-value was calculated through ATAC using the Cauchy combination test statistics through joint analysis of multi-ancestry APA-WAS associations.

d Distance between a gene with the closest risk variant identified from previous GWAS in CRC.

e P-value was calculated through ATAC using the Cauchy combination test statistics through joint analysis of multi-ancestry APA-WAS associations conditioned on the closest risk variant.

Risk genes identified by combining TWAS, AS-WAS, and APA-WAS analyses

By integrating results from TWAS, AS-WAS, and APA-WAS, we identified 222 unique CRC risk genes, 95 of which had not been previously associated with CRC (Fig. 2). Among them, four genes (COLCA2, METRNL, ARPC5, COX14) were commonly identified by all three TWAS analyses and 28 genes commonly detected by any two of the TWAS analyses (Supplementary Fig. 3). By comparing genes reported in previous CRC TWAS and other genetic studies1,2,8,10,15,16, 95 of them were not previously identified as CRC susceptibility genes, particularly including 13 genes reported in new loci (Tables 13).

Using scRNA-seq data to explore risk genes potentially contributing to CRC development

We next analyzed scRNA-seq datasets spanning CRC developmental stages, including colon normal, serrated polyps, conventional adenomas, and carcinomas tissues from the COLON MAP23. Our scRNA-seq data analysis identified several major cell types (e.g., ABS, GOB, STM) in normal tissues, SSC in serrated polyps, and ASC in conventional adenomas, consistent with previous studies23 (Supplementary Fig. 4). Differential analysis of these major cell types was performed using DESeq224 to evaluate their contribution to two CRC etiology pathways: the normal-serrated polyp-carcinoma pathway and the normal-adenoma-carcinoma pathway (“Methods”). At an FDR-adjusted P < 0.05, 48 genes exhibited significantly differential expression with consistent associations from TWAS analysis across one of these major cell types in the normal-serrated polyp-carcinoma pathway (Supplementary Data 12), while 92 genes showed significant differential expression in the normal-adenoma-carcinoma pathway (Supplementary Data 13). Combined with these results, we provide additional evidence that 99 genes (44.6% of the identified 222 risk genes) are potentially involved in CRC development.

Druggable genes for CRC prevention and intervention

We comprehensively annotated 222 identified risk genes as therapeutic targets for existing drugs, including those either approved or in clinical development (“Methods). Our analysis identified nine CRC susceptibility genes that are druggable by 125 drugs approved or in clinical trials for CRC treatment (Fig. 3 and Supplementary Data 14). In addition, we identified 26 genes targeted by 519 drugs approved or in clinical trials for the treatment of other cancers (Supplementary Data 15). Furthermore, 40 genes were identified as targets of 585 drugs approved or in clinical trials for non-cancer indications (Supplementary Data 16). In total, 48 of the 222 CRC susceptibility genes (21.6%) were found to be druggable, with potential links to 1216 drugs, either approved or under investigation.

Fig. 3. A circular plot showing nine druggable risk genes potentially targeted by 125 drugs either approved or undergoing clinical trials for treatment of CRC.

Fig. 3

The innermost to outermost layers represent CRC, the identified druggable risk genes, and therapeutic drugs - approved (green) or currently under clinical trials (Phase 3: yellow, Phase 2: red).

Functional assays for three putative oncogenes, RHPN2, IRS2, and TXN

To functionally examine the CRC susceptibility genes, we performed CRISPR/Cas9-mediated gene knockout. Three risk genes, including one previously reported gene, RHPN2, and two newly identified risk genes, IRS2 and TXN, were selected for downstream experiments based on evidence of their potential oncogenic functions and literature review (Methods). This was supported by their predicted elevated expressions being associated with an increased risk of CRC, as well as differential analysis of single-cell data highlighting their oncogenic roles. For each gene (RHPN2, IRS2, or TXN), we employed two independent sgRNAs for protein coding region targeting (Supplementary Data 17). We used western blotting for confirming target protein reduction and CCK-8 proliferation and colony formation (Fig. 4) assays for gene’s functions. TXN proteins displayed 65-98% reductions between knockout cells and control cells (sgRNA targeting LacZ) in the CRC cell lines (RKO, HCT116, and SW480) (Supplementary Fig. 5). We showed that TXN knockout significantly inhibited cell proliferation and colony formation in all CRC cell lines (Fig. 4A–C). IRS2 proteins had 66–86% reductions between knockout cells and control cells in three CRC cell lines (Supplementary Fig. 5). There was only significant inhibition of colony formation (Fig. 4E, F), indicating a long-term effect of knocking out the IRS2 protein. RHPN2 proteins had 34–61% reductions between knockout cells and control cells (Supplementary Fig. 5). However, the data showed a significant inhibition of cell proliferation and colony formation only in RKO cells, together with a trend of inhibition in HCT116 cells (Fig. 4G–I). There was a minimal effect in SW480 cells by knocking out RHPN2. The variable extents of RHPN2 dependency in cell proliferation and growth were matched to the consistent protein expression levels in these cells. Our experimental verification demonstrated the potential oncogenic functions of TXN, IRS2 and RHPN2, aligning with our TWAS analysis.

Fig. 4. The effects on cell viability and colony formation for selected genes, TXN, IRS2, and RHPN2 in three CRC cell lines (HCT116, RKO, SW480).

Fig. 4

CRC cell lines with knocking down by two sgRNAs for each gene were compared to CRC cell lines with a non-targeting sgRNA control. Cell viability was detected with CK8 assay for TXN (A), IRS2 (D), and RRHPN2 (G). Colony formation assays were performed with cell stained by crystal violet and were further quantified for TXN (B and C), IRS2 (E and F), and RRHPN2 (H and I). The independent experiments were performed in triplicate. P-values were determined by two-sided t test from the comparison of knockdown and control cells. “*”, P < 0.05; “**”, P < 0.01; “***”, P < 0.001. The height of each bar represents the mean value across replicates, and the error bar indicates the standard error of the mean.

Discussion

Key TFs transcribed in a cell type-specific manner and co-occupy cis-regulatory elements that control gene expression programs to establish and maintain cell identity2529. Identifying these TFs, whose DNA-binding activities are altered by risk-associated genetic variations, and understanding the genes they regulate, is crucial for advancing our knowledge of transcriptional dysregulation in human cancers3033. In this study, we identified 51 TFs associated with CRC susceptibility, advancing our understanding of key TF–gene regulatory networks underlying CRC etiology. By incorporating prior knowledge on TF-binding sites for CRC into the sTF-TWAS framework, we demonstrated that our approach significantly improves predictive model construction and facilitates the discovery of risk genes. Compared to traditional TWAS analysis in our previous study16, the sTF-TWAS method improved the prediction of gene expression and AS, increasing the number of predicted genes by approximately 1500 and 2000, respectively. Moreover, sTF-TWAS identified 94 additional genes beyond those detected by the traditional TWAS and AS-WAS. For the APA analysis, we integrated RNA-seq data from ACCC and BarcUVa-seq, compared to our previous study relying on GTEx data, identifying 30 novel risk genes21. Notably, the sTF-TWAS analysis, using these diverse prediction models, complements and strengthens risk gene discovery. Among the 222 identified risk genes, only four genes (COLCA2, METRNL, ARPC5, COX14) were detected by TWAS, AS-WAS, and APA-WAS. This observation aligns with prior research showing that different molecular traits within the same gene can represent largely independent signals21,34. For these four commonly detected genes, we found that very few shared variants were observed among them, indicating genetic heritability may independently contribute to each molecular trait, while a small proportion of shared risk variants may influence expression, splicing, and APA through interconnected mechanisms under multi-layered regulatory control.

Our sTF-TWAS analysis highlighted risk TF-based transcriptional networks underlying genetic susceptibility to CRC. By examining TF occupancy information for the genetic variants included in the prediction model, we found that 117 of the 222 putative susceptibility genes (52.7%) are commonly regulated by the top five TFs (RAD21, CBX3, MAX, SP1, and POLR2A). In addition, we conducted a new TF enrichment analysis on these 222 genes using publicly available TF ChIP-seq data from the Enrichr35,36. This analysis identified four CRC-associated TFs (VDR, EP300, MYC, and SOX2) significantly enriched at FDR < 0.05, and an additional set of 17 TFs (Supplementary Data 18) enriched at nominal P < 0.05. Of note, among the 222 risk genes, several TFs themselves, such as MYC, PROX1, SOX9, KLF5, and RAD21, were identified as significant genes. It is worth noting that most CRC-associated TFs were not detected by TWAS, likely because TWAS captures genetic heritability primarily from cis-variants. However, our GMM analysis using genome-wide variants provided evidence of their susceptibility. Taken together, these findings underscore the role of TFs and TF-based transcriptional networks in CRC genetic susceptibility. Moving forward, in our prediction models, we identified 144 lead variants—representing the strongest associations with CRC risk for each model, at P < 5 × 10⁻⁸. These variants are predicted to influence the binding affinity of CRC-associated TFs and are linked to 162 risk genes (Supplementary Data 19). Taken together, these findings offer valuable insights for prioritizing functional studies aimed at fine-mapping causal variants and genes for CRC through in vitro and in vivo assays. Our integration of scRNA-seq data analyses enhances causal gene discovery and improves our understanding of the cellular processes driving CRC. We acknowledge that some genes identified through differential expression analysis across developmental stages may not reflect true causal associations, particularly given the high correlation among many gene expression levels. In the trans-ancestry TWAS analysis, meta-analysis improved statistical power, enabling robust gene prediction in both East Asian and European populations. Notably, 37 CRC risk genes identified through meta-analysis exhibited consistent associations with CRC risk (P < 0.05) across both populations, providing strong evidence for these genes as potential causal factors for CRC (Supplementary Data 5, 8, 10).

Our identification of risk genes has significantly advanced our understanding of CRC biology and etiology. Among the 222 identified genes, nine are known CRC driver genes, including CDH1, ERBB2, KLF5, SOX9, SRSF6, TBX3, TET2, TRPS1, and ZNRF3, highlighting their genetic contributions to CRC development3740. We also uncovered seven previously unreported CRC risk genes that map to six established pathways crucial for CRC development4144. Specifically, PPP2R1B regulates both the Wnt/β-catenin and TGF-β signaling pathways, while RICTOR modulates the PI3K-AKT pathway, a key driver of cell proliferation and survival. RAD21 is involved in both the Cell Cycle and Mismatch Repair pathways, reinforcing its role in genomic maintenance. NFATC2 supports MAPK signaling and DNA mismatch repair, while ARRB1 and MAPK13 further emphasize the significance of MAPK signaling. RTEL1 strengthens the Mismatch Repair pathway. In addition to these established oncogenic pathways, we identified three novel oncogenes that enhance our understanding of CRC susceptibility. In addition to the established CRC oncogenic pathways, we provided strong evidence of three oncogenes to advance the understanding of CRC susceptibility. RHPN2, a Rho GTPase-binding protein that regulates the cell cytoskeleton, may contribute to cancer cell malignant growth45. IRS2 has been shown to activate the oncogenic PI3 kinase pathway and increase CRC cell lines adhesion (signal of invasive cells)46. Moreover, over-expression of IRS2 is also observed in many other cancers47,48 and its amplification has been proposed as a biomarker for inhibitor-based cancer therapy49. TXN, a redox protein, may regulate antioxidant reactions that promote cancer cell metabolism and survival50,51. In our prior work, we have validated the functionality of another five reported genes, DIP2B, SFMBT1, TRPS1, METRNL, and C14orf166, using siRNA/shRNA-mediated gene knockdown cell lines in vitro10,16. Furthermore, public CRISPR-Cas9 data from the Cancer Dependency Map (DepMap) provided additional evidence for the essential roles of 21 risk genes in colorectal adenocarcinoma cell proliferation (Supplementary Data 20). Together, these discoveries expand the repertoire of CRC susceptibility genes and reveal previously unexplored molecular mechanisms underlying CRC development.

Through the analysis of drug-protein interaction databases5255, we identified nine druggable proteins that are either approved or in clinical trials for CRC treatment. Notably, six of these genes—DDR1, SIK2, EHMT2, MAPK13, PKM and PARP11—are novel CRC risk genes reported in this study. Among these six novel genes, PKM2 from PKM family, a member of the PKM family, has been reported to mediate VEGF-induced angiogenesis in human endothelial cells. Resveratrol inhibits this process by suppressing aerobic glycolysis, potentially through modulation of PKM2 nuclear translocation56. PARP11, implicated in DNA repair, can be inhibited by approved PARP inhibitors, including Olaparib, Niraparib, and Talazoparib57. However, these results are observations are exploratory and require further validation. Overall, our protein–drug mapping expands the catalog of druggable genes, candidate therapeutics and supports the development of precision medicine strategies for CRC prevention and intervention.

Methods

Study populations

This study utilized a meta-analysis summary statistics dataset of 31 GWAS comprised of 186,072 individuals of European ancestry (80,774 CRC cases and 105,298 controls) and 72,272 individuals of Asian ancestry (23,572 CRC cases and 48,700 controls). The meta-analysis was conducted using the fixed-effects inverse variance weighted model implemented in METAL58. Details on sample selection and matching, sample numbers, and demographic characteristics of study participants have been described previously5962. All participants provided written informed consent, and all study protocols were approved by the relevant research ethics committee or institutional review board.

We included RNA-seq data from tumor-adjacent normal colon tissues and genotyping data of 364 individuals of East Asian ancestry who participated in the Asia Colorectal Cancer Consortium (ACCC). We also included data from individuals of European ancestry, including RNA-seq in normal colon tissues and genotyping data of 423 individuals in the BarcUVa-Seq project and 284 individuals in the GTEx project. The sample collection for these data has been described in detail in previous studies1,63,64.

Generalized linear mixed models

We systematically analyzed TF ChIP-seq data generated in normal colorectal epithelium and CRC cell lines from the Cistrome database65. Only ChIP-seq datasets meeting Cistrome’s recommended quality control standards were included. After evaluation, we curated 218 ChIP-seq datasets corresponding to 84 TFs. To ensure robust analysis, we focused on datasets with more than 1000 peaks and a fold change greater than 10. For TFs with multiple ChIP-seq datasets, only the dataset with the highest number of peaks was retained. This process resulted in a high-quality set of ChIP-seq data for 69 TFs, which were used in downstream analyses.

We used our recently developed generalized linear mixed model approach7 to investigate associations of CRC risk with variations of TF-DNA binding by a single TF (Eq. 1). In brief, a NxM matrix for all genetic variants from the GWAS summary statistics and the annotation from all available TF-DNA binding regions (“N” refers to the number of genetic variants and “M” refers to the number of TF ChIP-seq datasets; the variant in ChIP-seq peak is coded as “1” and the one not in ChIP-seq peak is coded as “0”). We used the Chi-squared value for each genetic variant reported in the GWAS summary data to measure its association with CRC risk. We then used generalized linear mixed models to estimate the associations between the Chi-squared values (Y) and TF binding status of genetic variants located in binding sites of each TF, given LD blocks (non-overlapping segments of 100 kb) of genetic variants to handle the dependence between genetic variants (Eq. (1)).

Yij=β0+β1TFij+Vi+εij 1

In Eq. (1), Yij is the Chi-squared value for the j-th variant in the i-th LD block; β0 is the fixed intercept, and β1 is the fixed slope, which measure the mean difference of the Chi-Squared values (ΔX¯2) between TF status; TFij is the j-th TF value (i.e., 1 for a variant located in a TF binding site, 0 otherwise) in the i-th LD block; Vi is the random intercept for the i-th LD block; and εij is the error term. Based on this statistical model, we identified cis-regulatory elements occupied by TFs whose genetic variations of TF-DNA bindings are associated with target cancer risk at Bonferroni-corrected P  <  0.05. We additionally used generalized linear mixed models to estimate the associations of the Y values of variants with the TF-pair occupancy if they showed a significant interaction (Eq. (2)). Of note, we used genetic variants non-occupied by any of the TFs investigated as the referenced control group for all the analyses.

Yij=β0+β1TF1ij+β2TF2ij+β3TF1ij×TF2ij+Vi+εij 2

Simulation studies

To benchmark the statistical power of GLMM against S-LDSC, we conducted simulations under a polygenic causality scenario where genotypes influence phenotypes through TF binding sites. Specifically, we assumed that only SNPs located within TF binding sites contribute to disease risk, while SNPs outside these regions have no effect.

For each replicate, we randomly selected K causal variants (K = 500, 1000, 5000, or 10,000) from a pool of 600,000 SNPs annotated to 84 TFs in CRC cell lines (Supplementary Data 1). Under an additive architecture, phenotypes were simulated using Eq. 3:

Yi*=j=1Kβ1×XijTF+εi 3

where, XijTF is the genotype (0,1,2) of j-th variant for subject i use data from 489 European ancestry from1000G project phase 366. β1 was set to 1 and εj is sampling from a normal distribution N0,1h2h2varj=1Kβ1×XijTF, with heritability h2= 0.5, 0.25, 0.1, 0.05 and varj=1Kβ1×XijTF is the variance from the genotype part.

To facilitate downstream GWAS, we binarized the continuous trait into a case–control phenotype Ci* based on the mean value of Yi* across individuals using Eq. (4).

Ci*=1,Yi*mediani=1NYi*0,Yi*<meiani=1NYi* 4

WeperformedGWAS using PLINK 2.067 and applied both GMM and S-LDSC to the resulting summary statistics to estimate TF–disease associations. For each K, we conducted 50 replicates, each with a different random subset of causal SNPs, ensuring reproducibility while keeping K fixed. Results were summarized as the number of true-positive TFs detected at a nominal threshold (P < 0.05) by each method.

To evaluate the type I error rate of GLMM, we conducted simulation analyses under the null hypothesis of no association between genotype and phenotype. Specifically, we randomly assigned binary phenotype values (0 or 1) to 489 individuals of European ancestry and performed logistic regression to generate GWAS summary statistics under the null. We then applied GLMM and S-LDSC using different numbers of TF binding variants (500, 1000, 5000, and 10000) and repeated the entire procedure 1000 times. Finally, we compared the Type I error rates of GLMM and S-LDSC using quantile-quantile (QQ) plots of p-values across the 1000 replicates for each binding variant set size.

Gene expression profiling and data processing

The RNA-seq library preparation, sequencing, data processing, and quality control for the ACCC1 and the BarcUVa-Seq project63 have been described previously. In brief, paired-end sequencing was conducted for mRNA obtained from the BarcUVa-Seq project using either an Illumina HiSeq 2500 or a NovaSeq 6000 instrument and for mRNA obtained from the ACCC project using a HiSeq 3000 Sequencer. The raw RNA-seq data were processed following the GTEx Consortium pipeline64. BAM files containing aligned reads to the reference genome (GRCh37 for ACCC and GRCh38 for BarcUVa-Seq and GTEx project), were generated using STAR68 (v2.5.4). These BAM files, along with GENCODE release 19 (for ACCC) and 26 (for BarcUVa-Seq and GTEx) were used for quantification of gene expression, AS and APA.

Building on our previous work in processing genet expression and AS16, gene expression quantification was performed using RNA-SeQC69. Read counts and Transcripts Per Kilobase Million (TPM) values were calculated from aligned reads based on the following criteria: (1) reads were uniquely mapped; (2) aligned reads were properly paired; (3) the read alignment distance was < 6. Genes with expression thresholds of ≥ 0.1 TPM in ≥ 20% of samples and ≥ 6 reads (unnormalized) in ≥ 20% of samples were selected. Quantile normalization was applied to standardize the gene expression data within each sample to the same distribution, followed by rank-based inverse normal transformation across all samples.

AS was quantified using LeafCutter70 and RegTools71. In brief, the intron usage was quantified from aligned BAM files using the junction extract from the RegTools with the default options. Next, the introns were clustered together using the leafcutter_cluster_regtools.py script from LeafCutter, with the following options: -m 50 -l 500000. The clusters were mapped to each gene using the map_clusters_to_genes.R script from the cluster_prepare_fastqtl.py provided by the GTEx. Introns with low counts or low complexity were filtered based on the criteria in the in the GTEx pipeline (e.g., cluster_prepare_fastqtl.py). The filtered counts were normalized using the prepare_phenotype_table.py script from LeafCutter.

For APA analysis, following our recent APA-WAS study21, APA was qualified using the percentage of PDUI estimated from DaPars v2.021. We performed quantile normalization to transform the quantified PDUI values of APA for each sample to the same distribution. We used 60 probabilistic estimation of expression residuals (PEER) factors72 to correct batch effects and experimental confounders in downstream prediction model building for gene expression, AS and APA.

Genotyping data processing

The generation and processing of genotype data have been described in our previous work1,16. Briefly, CRC patients of East Asian ancestry from the ACCC were genotyped using the Illumina OncoArray and the expanded Illumina MEGA Array. For genotype quality control, genetic variants were excluded based on the following criteria: (1) genotype call rate < 95%; (2) ambiguous variants; (3) duplicated variants; (4) P for Hardy-Weinberg equilibrium (HWE) < 1.0 × 10−6. For quality control of samples, we excluded those with: (1) genotype call rate < 95%; (2) genetically identical or duplicated samples; (3) first- or second-degree relatives; (4) racial outliers. To estimate the genetic relatives among samples, the quality-controlled common SNPs (minor allele frequency [MAF] > 0.01) were first pruned based on the window size = 1500 SNPs and pairwise r2 > 0.2. Identity-by-descent (IBD) was calculated using the pruned genotype data. Quality-controlled genotypes were imputed using the 1000 Genomes Project Phase 3 mixed reference haplotypes with the Michigan Imputation Server (Minimac4 for imputation and Eagle v2.4 for phasing). Variants with MAF < 0.01 were excluded before imputation. Approximately 4.5 million variants with MAF > 0.05 and R2 > 0.3 were included for prediction model building.

For the BarcUVa-Seq project, approximately 400,000 SNPs were genotyped from DNA isolated from peripheral blood samples, using the OncoArray BeadChip. Quality controls and genotyping data processing details were provided in a previous study63. Genotypes were aligned to the reference genome GRCh38 and imputed using the TOPMed (Version R2) reference panel63. Genetic variants with an imputation quality of R2 < 0.3 were excluded.

In the GTEx project, whole genome sequencing (WGS) was performed on DNA samples from 284 individuals of European ancestry. WGS-based genotype data for these individuals, in VCF format, were accessed from the dbGap (phs000424.v8.p2). Detailed information on WGS variant quality control is provided in GTEx documentation64.

Conditional analyses based on adjusted GWAS association signals

To investigate whether the identified associations between genetically predicted gene expression (or AS, APA) and CRC risk are independent of the association signals identified in previous GWAS, we performed conditional analyses by adjusting for the lead variant using the GCTA-COJO method73. The lead variant in each risk locus was determined based on the strongest association from GWAS summary statistics derived from the meta-analysis of European and East Asian populations or from Europeans alone. Genotype data from either European or East Asian populations in the 1000 Genomes Project Phase 3 were used to estimate linkage disequilibrium (LD). We then conducted TWAS analyses (or AS, APA) using the adjusted GWAS summary statistics to explore the association between genetically predicted gene expression and CRC risk. In addition, joint analyses were performed to combine p-values for the European population, as well as datasets from both European and East Asian populations, for gene expression (or AS, APA).

Building genetically predicted models of gene expression, AS, and APA

To construct genetically predicted models for gene expression, AS, and APA, based on our sTF-TWAS framework20, we selected only genetic variants that located in the binding regions of any susceptible TFs for variants with MAF > 0.05, and HWE P < 10−6. We also excluded variants missing in CRC GWAS summary statistics from either European or Asian populations, which resulted in approximately 378,000 genetic variants. Prediction models were trained using the elastic net approach based on their flanking variants within a ± 1 Mb region, accounting for potential confounding factors, such as the top five principal components (PCs), gender, potential batch effects, and other factors derived from the PEERs method. A tenfold cross-validation approach was used to fine-tune the model parameters. The performance of the prediction models was evaluated by calculating the squared correlation (R2) between the predicted values and the observed levels for gene expression, AS and APA, respectively.

Association analyses of genetically predicted gene expression, AS, and APA with CRC risk

Using the weight matrix and summary statistics from CRC GWAS datasets consisting of 186,072 individuals of European ancestry and 72,272 individuals of East Asian ancestry, we evaluated the association between gene expression (or AS, APA) and CRC risk under the sTF-TWAS framework20. The statistical method was outlined in Eq. (3)

ZglModelgωlgσ^lσ^gβ^lseβ^l5

Briefly, the Z-score was used to estimate the association between predicted gene expression (or AS, APA) and CRC risk, wlg is the weight of variant l for predicting the expression of gene g (or AS, APA), β^l and se(β^l) are the association regression coefficient and its standard error, respectively, for variant l in GWAS, and σ^l andσ^g are the estimated variances of variant l and the predicted expression of gene g (or AS, APA). We further conducted a joint analysis to combine TWAS p-values for each gene from the BarcUVa-Seq and GTEx datasets for the European population, as well as from datasets across European and East Asian populations (ACCC), using the aggregated Cauchy association test (ACAT) method74. Similarly, joint analyses were also performed to combine p-values for AS and APA, respectively. Statistically significant genes were determined based on Bonferroni corrected P < 0.05, considering the total number of tests for the models of gene expression, AS, and APA, respectively.

Differential gene expression analysis from single cell data

We downloaded and analyzed single-cell RNA sequencing (scRNA-seq) data spanning CRC developmental stages, including colon normal tissues (n = 31), serrated polyps (n = 19), conventional adenomas (n = 29), and carcinomas (microsatellite stability [MSS], n = 17; and microsatellite instability-high [MSI-H], n = 15) from the Colorectal Molecular Atlas Project (COLON MAP)23. We filtered the cell count matrix for each sample by removing low-expressed genes (expressed in ≤3 cells) and excluded cells with high mitochondrial activity ( ≥ 20% of transcripts attributed to mitochondrial genes) using Seurat (version 4.2.1)75,76. We next identified the top 2000 highly variable genes for principal component analysis (PCA) and clustered cells based on the first 30 principal components using the Shared-Nearest-Neighbors (SNN) graph and the Louvain community detection algorithm. These clusters were visualized using the dimensionality reduction technique Manifold Approximation and Projection (UMAP).

For differential expression analysis across CRC developmental stages, we examined risk genes associated with the normal-serrated polyp-carcinoma pathway and the normal-adenoma-carcinoma pathway. We performed integrative analysis of the cell count matrix across stages, generating a pseudobulk gene count matrix for each cell type within each tissue sample. Using these pseudobulk count data, we applied DESeq224 to identify differentially expressed genes between developmental stages. DESeq2 accounts for heterogeneous variability (heteroscedasticity) in gene expression, where variance depends on mean values, by modeling this relationship. This approach effectively adjusts for differences in variability between low- and high-expressing genes, leading to more accurate differential expression analysis. Specifically, we focused on potential CRC cell origins in major cell types, including absorptive cells (ABS), goblet cells (GOB), and stem cells (STM), comparing normal tissues with serrated polyps and normal tissues with conventional adenomas. Furthermore, we analyzed differential gene expressions between adenoma-specific cells (ASC) and MSS, as well as ASC and MSI-H. Similarly, differential gene expression analyses were performed between serrated-specific cells (SSC) and MSS, and SSC and MSI-H. Statistically significant genes were considered as a False Discovery Rate (FDR)-adjusted p-value < 0.05.

Identifying druggable targets for the identified risk genes

We identified potentially druggable risk genes and associated drugs by integrating drug-protein interaction data from four well-established drug databases: DrugBank52, ChEMBL53, the Therapeutic Target Database54 (TTD) and OpenTargets55. Specifically, we extracted key information for each drug, including commercial names, maximum clinical trial phases, and indications. Using this data, we annotated the identified risk genes as druggable targets and linked their corresponding drugs that are either approved or in clinical trials. In addition, we categorized therapeutic drugs based on their indications, grouping them into those targeting CRC, other types of cancer, and non-cancer diseases.

Cell culture and cell proliferation and colony formation assays

Human colorectal cancer cell lines (HCT116, RKO, and SW480) from ATCC were cultured in RPMI 1640 supplemented with 10% FBS and penicillin/streptomycin. Cells were kept in a humidified incubator at 37˚ with 5% CO2. CCK-8 cell proliferation was performed for four days. In detail, 5 × 103 cells were seeded into a 96-well plate and cultured for four days. The CCK-8 reagent (APExBIO) was then added into each well, and the plate was read at OD 450 nm using a microplate reader (BioTek). Relative cell growth was calculated by normalizing the OD value of knockout cells to the control cells. For long-term effect, colony formation assay was performed for 10–15 days. In detail, 200 cells (HCT116 and RKO) or 500 cells (SW480) were seeded into a 12-well plate. Cell colonies were visualized by crystal violet staining. The colonies were photographed, and numbers were counted.

CRISPR/Cas9-mediated gene knockout

For gene knockout, the lentiviral vector LentiCRISPR v2 (Addgene) was used to clone sgRNA of the individual target gene. Lentiviruses were amplified in HEK 293 T cells. Briefly, lentiviral vector and packaging vectors, psPAX2 and pVSV-G, were transfected into 293 T cells with lipofectamine 2000 (Thermo Fisher Scientific). Viral supernatants were collected by centrifugation at 48 and 96 hours after transfection. CRC lines were infected with the corresponding lentiviruses for 24 h. Cells were changed to new growth medium for another 24 h and then selected with puromycin (2 μg/ml, Sigma-Aldrich) for 5 days. The CRISPR/Cas9 edited cells were verified by western blotting of the target protein before using for the functional assays.

Western blotting

Total proteins were extracted from the corresponding cells lysed with RIPA buffer plus protease and phosphatase inhibitors (Roche). Twenty ug of total proteins were separated by SDS-PAGE, transferred to PVDF membrane and probed with the indicated primary antibodies. Anti-rabbit or anti-mouse secondary antibody was subsequently applied, and western blot signals were detected using enhanced chemiluminescence substrate (West Femto, Thermo Fisher Scientific). β-Tubulin was used as the loading control. Band intensity was calculated with NIH ImageJ software.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

41467_2025_68127_MOESM2_ESM.pdf (95.1KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1-21 (6.7MB, xlsx)
Reporting Summary (110KB, pdf)

Source data

Source Data (10.2MB, xlsx)

Acknowledgements

This research was primarily supported by US National Institutes of Health (NIH) grant R37CA227130, R01CA269589 and R01CA297582 (to X.G.). This study has been supported by: Spanish Association Against Cancer (AECC) Scientific Foundation grant GCTRA18022MORE (to V.M.). Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), action Genrisk (to V.M.); Instituto de Salud Carlos III (ISCIII), “Programa FORTALECE del Ministerio de Ciencia e Innovación” (FORT23/00032) (to V.M.). We acknowledge Dr. Chong Wu from MD Anderson for the valuable discussions regarding simulation analysis.

Author contributions

X.G. conceived and designed the study. Z.C. and Q.L. conducted the main genetic and statistical analyses, and W.S. performed the experimental validation. Q.L., C. L., K.S.L., and X.G. carried out the single-cell RNA-seq analysis and interpretation. Z.C., W.S., Q.L., and X.G. wrote the manuscript with feedback from all other authors, including C.L., W.W., J.R.H., P.J.L., C.F.R., M.N.T., M.T., S.L.S., V.M., M.D., C.D., F.M.-N., Q.C., J.W., J.S., S.K., C.T., W.H.J., X.S., J.L., J.G., J.K., A.S., K.M., S.H.J., K.J.J., N.W., D.H.K., J.P., G.Y., M.H.S., Z.R., J.H.O., I.O., Y.A., Y.T.G., Z.Z.P., Y.K., L.V.K., L.W., B.L., K.M., X.O.S., L.H., M.G.D., S.B.G., R.H., I.T., L.L., K.S.L., V.M., G.C., U.P., and W.Z. All authors reviewed and approved the final manuscript. X.G. supervised the study and acquired funding.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Data availability

The GWAS summary statistics are used in this study are publicly available from the GWAS Catalog under accession GCST90129505. The RNA-seq data and genotype data of subjects of East Asian ancestry from the ACCC were generated as part of this study and have been deposited to the NIH Genotypes and Phenotypes (dbGaP) under accession number phs002813.v1.p1. Access to these data is controlled to protect participant privacy in accordance with NIH policies and is freely available to qualified academic researchers through dbGaP. Data access requests are reviewed by the appropriate Data Access Committee, with an expected response timeframe of approximately two months, based on estimates provided by dbGaP. The data from the GTEx (version 8) used in this study are available through dbGaP under accession phs000424.v8.p2. Transcriptome and genotype data, along with sample covariates, from the BarcUVa-Seq project are available through dbGaP under accession phs003338.v1.p1. Single-cell RNA-sequencing datasets were obtained from the Colorectal Molecular Atlas Project (COLON MAP) through the National Cancer Institute-funded Human Tumor Atlas Network (HTAN) Data Portal. The CHIP-seq data for 218 TFs were obtained from the Cistrome database. Drugs and compounds data was retrieved from publicly available databases, including ChEMBL: https://www.ebi.ac.uk/chembl/; Therapeutic Target Database: https://db.idrblab.net/ttd/; Open Targets: https://www.opentargets.org/; DrugBank: https://go.drugbank.com/. Summary-level results generated in this study, including all TWAS, AS-WAS, and APA-WAS findings, are provided in Supplementary Data 5, 8 and 10Source data are provided in this paper.

Code availability

All analysis code used in this study is publicly available in the GitHub repository (https://github.com/XingyiGuo/sTF-TWAS-CRC) or a Code Ocean capsule (10.24433/CO.1946831.v1).

Competing interests

Luc Van Kaer is a member of the scientific advisory board of Isu Abxis Co., Ltd. (South Korea). Stephen B. Gruber is co-founder of Brogent International LLC and reports research grant support from AbbVie, AstraZeneca, Eisai, HaloDx, Invitae, and Johnson & Johnson (all unrelated to this work). The other authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Zhishan Chen, Wenqiang Song, Qing Li.

Supplementary information

The online version contains supplementary material available at 10.1038/s41467-025-68127-z.

References

  • 1.Chen, Z. et al. Fine-mapping analysis including over 254,000 East Asian and European descendants identifies 136 putative colorectal cancer susceptibility genes. Nat. Commun.15, 3557 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Law, P. J. et al. Systematic prioritization of functional variants and effector genes underlying colorectal cancer risk. Nat. Genet.56, 2104–2111 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Cowper-Sal lari, R. et al. Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression. Nat. Genet.44, 1191–1198 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Li, Q. et al. Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell152, 633–641 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Castro, M. A. et al. Regulators of genetic risk of breast cancer identified by integrative network analysis. Nat. Genet.48, 12–21 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Iotchkova, V. et al. GARFIELD classifies disease-relevant genomic features through integration of functional annotations with association signals. Nat. Genet.51, 343–353 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wen, W. et al. Genetic variations of DNA bindings of FOXA1 and co-factors in breast cancer susceptibility. Nat. Commun.12, 5318 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Fernandez-Rozadilla, C. et al. Deciphering colorectal cancer genetics through multi-omic analysis of 100,204 cases and 154,587 controls of European and east Asian ancestries. Nat. Genet.55, 89–99 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chen, Z. et al. Identifying Putative Susceptibility Genes and Evaluating Their Associations with Somatic Mutations in Human Cancers. Am. J. Hum. Genet.105, 477–492 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yuan, Y. et al. Multi-omics analysis to identify susceptibility genes for colorectal cancer. Hum. Mol. Genet.30, 321–330 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yao, L., Tak, Y. G., Berman, B. P. & Farnham, P. J. Functional annotation of colon cancer risk SNPs. Nat. Commun.5, 5114 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Moreno, V. et al. Colon-specific eQTL analysis to inform on functional SNPs. Br. J. Cancer119, 971–977 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Gamazon, E. R. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet.47, 1091–1098 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Gusev, A. et al. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet.48, 245–252 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Guo, X. et al. Identifying novel susceptibility genes for colorectal cancer risk from a transcriptome-wide association study of 125,478 subjects. Gastroenterology160, 1164–1178.e1166 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chen, Z. et al. Novel insights into genetic susceptibility for colorectal cancer from transcriptome-wide association and functional investigation. J. Natl. Cancer Inst.116, 127–137 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Mancuso, N. et al. Probabilistic fine-mapping of transcriptome-wide association studies. Nat. Genet.51, 675–682 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Wainberg, M. et al. Opportunities and challenges for transcriptome-wide association studies. Nat. Genet.51, 592–599 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Amariuta, T. et al. IMPACT: Genomic annotation of cell-state-specific regulatory elements inferred from the epigenome of bound transcription factors. Am. J. Hum. Genet.104, 879–895 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.He, J. et al. Integrating transcription factor occupancy with transcriptome-wide association analysis identifies susceptibility genes in human cancers. Nat. Commun.13, 7118 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Guo, X. et al. Large-scale alternative polyadenylation-wide association studies to identify putative cancer susceptibility genes. Cancer Res.84, 2707–2719 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet.47, 1228–1235 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Chen, B. et al. Differential pre-malignant programs and microenvironment chart distinct paths to malignancy in human colorectal polyps. Cell184, 6262 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.15, 550 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Wontakal, S. N. et al. A large gene network in immature erythroid cells is controlled by the myeloid and B cell transcriptional regulator PU.1. PLoS Genet.7, e1001392 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Lee, T. I. & Young, R. A. Transcriptional regulation and its misregulation in disease. Cell152, 1237–1251 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Saint-Andre, V. et al. Models of human core transcriptional regulatory circuitries. Genome Res.26, 385–396 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A. & Luscombe, N. M. A census of human transcription factors: function, expression and evolution. Nat. Rev. Genet.10, 252–263 (2009). [DOI] [PubMed] [Google Scholar]
  • 29.Wontakal, S. N. et al. A core erythroid transcriptional network is repressed by a master regulator of myelo-lymphoid differentiation. Proc. Natl. Acad. Sci. USA109, 3832–3837 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Deplancke, B., Alpern, D. & Gardeux, V. The genetics of transcription factor DNA binding variation. Cell166, 538–554 (2016). [DOI] [PubMed] [Google Scholar]
  • 31.Tehranchi, A. K. et al. Pooled chIP-seq links variation in transcription factor binding to complex disease risk. Cell165, 730–741 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Yan, J. et al. Systematic analysis of binding of transcription factors to noncoding variants. Nature591, 147–151 (2021). [DOI] [PMC free article] [PubMed]
  • 33.Choudhuri, A. et al. Common variants in signaling transcription-factor-binding sites drive phenotypic variability in red blood cell traits. Nat. Genet.52, 1333–1345 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wu, Y. et al. Joint analysis of GWAS and multi-omics QTL summary statistics reveals a large fraction of GWAS signals shared with molecular phenotypes. Cell Genom.3, 100344 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform.14, 128 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res.44, W90–W97 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Bailey, M.H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell173, 371–385 (2018). [DOI] [PMC free article] [PubMed]
  • 38.Dietlein, F. et al. Identification of cancer driver genes based on nucleotide context. Nat. Genet.52, 208 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Nunes, L. et al. Prognostic genome and transcriptome signatures in colorectal cancers. Nature633, 137–146 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Cornish, A. J. et al. The genomic landscape of 2,023 colorectal cancers. Nature633, 127–136 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Ahmad, R. et al. Emerging trends in colorectal cancer: Dysregulated signaling pathways (Review). Int. J. Mol. Med.47, 10.3892/ijmm.2021.4847 (2021). [DOI] [PMC free article] [PubMed]
  • 42.Koveitypour, Z. et al. Signaling pathways involved in colorectal cancer progression. Cell Biosci.9, 97 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Patel, M., Horgan, P. G., McMillan, D. C. & Edwards, J. NF-kappaB pathways in the development and progression of colorectal cancer. Transl. Res.197, 43–56 (2018). [DOI] [PubMed] [Google Scholar]
  • 44.DeDecker, L., Coppedge, B., Avelar-Barragan, J., Karnes, W. & Whiteson, K. Microbiome distinctions between the CRC carcinogenic pathways. Gut Microbes13, 1854641 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Peck, J. W., Oberst, M., Bouker, K. B., Bowden, E. & Burbelo, P. D. The RhoA-binding protein, rhophilin-2, regulates actin cytoskeleton organization. J. Biol. Chem.277, 43924–43932 (2002). [DOI] [PubMed] [Google Scholar]
  • 46.Day, E. et al. IRS2 is a candidate driver oncogene on 13q34 in colorectal cancer. Int. J. Exp. Pathol.94, 203–211 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Gibson, S. L., Ma, Z. & Shaw, L. M. Divergent roles for IRS-1 and IRS-2 in breast cancer metastasis. Cell Cycle6, 631–637 (2007). [DOI] [PubMed] [Google Scholar]
  • 48.Porter, H. A., Perry, A., Kingsley, C., Tran, N. L. & Keegan, A. D. IRS1 is highly expressed in localized breast tumors and regulates the sensitivity of breast cancer cells to chemotherapy, while IRS2 is highly expressed in invasive breast tumors. Cancer Lett.338, 239–248 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Lee, M. S. et al. IRS2 Amplification as a predictive biomarker in response to ceritinib in small cell lung cancer. Mol. Ther. Oncolytics16, 188–196 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Liu, W. et al. Thioredoxin-1 ameliorates myosin-induced autoimmune myocarditis by suppressing chemokine expressions and leukocyte chemotaxis in mice. Circulation110, 1276–1283 (2004). [DOI] [PubMed] [Google Scholar]
  • 51.Cortes, R. et al. Enhanced reduction in oxidative stress and altered glutathione and thioredoxin system response to unsaturated fatty acid load in familial hypercholesterolemia. Clin. Biochem.47, 291–297 (2014). [DOI] [PubMed] [Google Scholar]
  • 52.Knox, C. et al. DrugBank 6.0: the DrugBank Knowledgebase for 2024. Nucleic Acids Res.52, D1265–D1275 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res.40, D1100–D1107 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Zhou, Y. et al. TTD: Therapeutic Target Database describing target druggability information. Nucleic Acids Res.52, D1465–D1477 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Ochoa, D. et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res.51, D1353–D1359 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Wu, H. et al. Resveratrol inhibits VEGF-induced angiogenesis in human endothelial cells associated with suppression of aerobic glycolysis via modulation of PKM2 nuclear translocation. Clin. Exp. Pharmacol. Physiol.45, 1265–1273 (2018). [DOI] [PubMed] [Google Scholar]
  • 57.Kim, D. S., Camacho, C. V. & Kraus, W. L. Alternate therapeutic pathways for PARP inhibitors and potential mechanisms of resistance. Exp. Mol. Med.53, 42–51 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics26, 2190–2191 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Huyghe, J. R. et al. Discovery of common and rare genetic risk variants for colorectal cancer. Nat. Genet.51, 76–87 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Lu, Y. et al. Large-Scale Genome-Wide Association Study of East Asians Identifies Loci Associated With Risk for Colorectal Cancer. Gastroenterology156, 1455–1466 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Law, P. J. et al. Association analyses identify 31 new risk loci for colorectal cancer susceptibility. Nat. Commun.10, 2154 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Lu, Y. et al. Identification of Novel Loci and New Risk Variant in Known Loci for Colorectal Cancer Risk in East Asians. Cancer Epidemiol. Biomarkers Prev.29, 477–486 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Díez-Obrero, V. et al. Genetic effects on transcriptome profiles in colon epithelium provide functional insights for genetic risk loci. Cell. Mol. Gastroenterol. Hepatol.12, 181–197 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed]
  • 65.Zheng, R. et al. Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res.47, D729–D735 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Genomes Project, C. et al. A global reference for human genetic variation. Nature526, 68–74 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience4, 7 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics29, 15–21 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.DeLuca, D. S. et al. RNA-SeQC: RNA-seq metrics for quality control and process optimization. Bioinformatics28, 1530–1532 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Li, Y. I. et al. Annotation-free quantification of RNA splicing using LeafCutter. Nat. Genet.50, 151–158 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Cotto, K. C. et al. Integrated analysis of genomic and transcriptomic data for the discovery of splice-associated variants in cancer. Nat. Commun.14, 1589 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc.7, 500–507 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet.48, 481–487 (2016). [DOI] [PubMed] [Google Scholar]
  • 74.Yang, J., Lee, S. H., Goddard, M. E. & Visscher, P. M. GCTA: a tool for genome-wide complex trait analysis. Am. J. Hum. Genet.88, 76–82 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol.42, 293–304 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Stuart, T. et al. Comprehensive integration of single-cell data. Cell177, 1888–1902 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

41467_2025_68127_MOESM2_ESM.pdf (95.1KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1-21 (6.7MB, xlsx)
Reporting Summary (110KB, pdf)
Source Data (10.2MB, xlsx)

Data Availability Statement

The GWAS summary statistics are used in this study are publicly available from the GWAS Catalog under accession GCST90129505. The RNA-seq data and genotype data of subjects of East Asian ancestry from the ACCC were generated as part of this study and have been deposited to the NIH Genotypes and Phenotypes (dbGaP) under accession number phs002813.v1.p1. Access to these data is controlled to protect participant privacy in accordance with NIH policies and is freely available to qualified academic researchers through dbGaP. Data access requests are reviewed by the appropriate Data Access Committee, with an expected response timeframe of approximately two months, based on estimates provided by dbGaP. The data from the GTEx (version 8) used in this study are available through dbGaP under accession phs000424.v8.p2. Transcriptome and genotype data, along with sample covariates, from the BarcUVa-Seq project are available through dbGaP under accession phs003338.v1.p1. Single-cell RNA-sequencing datasets were obtained from the Colorectal Molecular Atlas Project (COLON MAP) through the National Cancer Institute-funded Human Tumor Atlas Network (HTAN) Data Portal. The CHIP-seq data for 218 TFs were obtained from the Cistrome database. Drugs and compounds data was retrieved from publicly available databases, including ChEMBL: https://www.ebi.ac.uk/chembl/; Therapeutic Target Database: https://db.idrblab.net/ttd/; Open Targets: https://www.opentargets.org/; DrugBank: https://go.drugbank.com/. Summary-level results generated in this study, including all TWAS, AS-WAS, and APA-WAS findings, are provided in Supplementary Data 5, 8 and 10Source data are provided in this paper.

All analysis code used in this study is publicly available in the GitHub repository (https://github.com/XingyiGuo/sTF-TWAS-CRC) or a Code Ocean capsule (10.24433/CO.1946831.v1).


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES