Skip to main content
Discover Oncology logoLink to Discover Oncology
. 2025 Nov 19;16:2133. doi: 10.1007/s12672-025-03956-4

Integrative genomic analysis reveals causal relationships between breast mammary tissue gene expression and breast cancer risk using multi-method Mendelian randomization

Rongrong Xiao 1,, Ruqing Li 2
PMCID: PMC12630431  PMID: 41258624

Abstract

Background

Understanding the causal relationships between gene expression levels in breast mammary tissue and breast cancer susceptibility is crucial for identifying therapeutic targets and developing prevention strategies. However, traditional observational studies are limited by confounding factors and reverse causation.

Methods

We conducted a comprehensive multi-analytical approach combining Mendelian randomization (MR), summary-based Mendelian randomization (SMR), and transcriptome-wide association study (TWAS) to investigate causal relationships between breast mammary tissue gene expression and breast cancer risk. We utilized large-scale genome-wide association study summary statistics and expression quantitative trait loci data to identify genes with significant causal associations.

Results

MR analysis identified three genes with significant protective effects: APOBEC3B (OR = 0.992, 95% CI: 0.988–0.995), SLC22A5 (OR = 0.983, 95% CI: 0.976–0.991), and CRLF3 (OR = 0.984, 95% CI: 0.976–0.991). TWAS analysis revealed SLC4A7 and NEGR1 as the most significant risk-associated genes, while ZBTB38, RGPD1, and CCDC91 demonstrated protective effects. SMR analysis confirmed the robustness of these associations and revealed additional genes with both protective and risk-enhancing effects across the genome.

Conclusions

This integrative genomic analysis provides robust evidence for causal relationships between specific gene expression patterns in breast mammary tissue and breast cancer risk.

Supplementary Information

The online version contains supplementary material available at 10.1007/s12672-025-03956-4.

Keywords: Breast cancer, Mendelian randomization, Gene expression, TWAS, SMR, APOBEC3B, SLC4A7, NEGR1, Mammary tissue, Causal inference

Introduction

Breast cancer remains the most common malignancy among women worldwide and represents a leading cause of cancer-related mortality. Despite significant advances in early detection and treatment modalities, the molecular mechanisms underlying breast cancer development and progression remain incompletely understood [13]. The identification of causal relationships between gene expression patterns and cancer susceptibility is essential for developing targeted therapeutic interventions and improving patient outcomes.

Traditional epidemiological approaches investigating gene expression-disease associations are often confounded by environmental factors, lifestyle influences, and reverse causation. The expression levels of genes in diseased tissues may be consequences rather than causes of the pathological process, making it challenging to establish true causal relationships [2, 4, 5]. Additionally, observational studies cannot adequately control for the complex interplay between genetic predisposition, environmental exposures, and disease development.

Mendelian randomization (MR) represents a powerful epidemiological approach that leverages genetic variants as instrumental variables to infer causal relationships between exposures and outcomes [68]. By utilizing the random allocation of genetic variants at conception, MR analysis mimics randomized controlled trials and effectively minimizes confounding factors that commonly affect observational studies. Recent advances in genome-wide association studies (GWAS) and expression quantitative trait loci (eQTL) mapping have made it possible to conduct large-scale MR analyses investigating causal relationships between gene expression and disease outcomes.

Transcriptome-wide association studies (TWAS) and summary-based Mendelian randomization (SMR) provide complementary approaches for investigating gene expression-disease associations [911]. TWAS integrates GWAS data with gene expression prediction models to identify genes whose expression levels are associated with disease risk, while SMR specifically tests for causal relationships between gene expression and disease outcomes using summary statistics from GWAS and eQTL studies.

Several recent studies have investigated genetic determinants of breast cancer risk using various approaches, providing important context for our work. While these studies provided valuable insights into systemic biomarkers, they did not specifically investigate tissue-specific gene expression mechanisms in breast mammary tissue. Recent TWAS studies have primarily focused on peripheral blood or cross-tissue analyses. However, tissue-specific regulatory mechanisms may be more relevant for understanding organ-specific cancer development.

Our study uniquely integrates three complementary analytical approaches (MR, SMR, and TWAS) specifically targeting breast mammary tissue gene expression, providing more robust evidence for tissue-specific regulatory mechanisms underlying breast cancer susceptibility. The novelty of our work lies in three key aspects. First, we employ a tissue-specific focus on mammary tissue rather than blood-based eQTLs, which may better capture organ-specific regulatory processes critical for cancer initiation. Second, we implement triangulation of MR, SMR, and TWAS methodologies to strengthen causal inference through methodological convergence, with each approach serving as a validation for the others. Third, we identify APOBEC3B’s protective role, which contrasts with its traditionally recognized mutagenic function, suggesting context-dependent effects in normal versus tumor tissue that warrant further mechanistic investigation.

Methods

Study design and data sources

This study employed a comprehensive multi-analytical approach combining Mendelian randomization (MR), summary-based Mendelian randomization (SMR), and transcriptome-wide association study (TWAS) methodologies to investigate the causal relationships between breast mammary tissue gene expression levels and breast cancer susceptibility. We utilized large-scale genome-wide association study (GWAS) summary statistics and gene expression data from publicly available databases to ensure robust statistical power and population representativeness. The analytical framework was designed to leverage genetic variants as instrumental variables to infer causal relationships while minimizing confounding factors and reverse causation commonly encountered in observational epidemiological studies.

Data sources and study populations

Our analysis integrated data from multiple large-scale genomic resources to ensure statistical power and reliability. Breast cancer GWAS summary statistics were obtained from the Breast Cancer Association Consortium (BCAC), comprising 122,977 cases and 105,974 controls of European ancestry (Michailidou et al., Nat Genet 2017; doi: 10.1038/ng.3785). The dataset included 11,792,542 SNPs after quality control and imputation. Data were accessed from the IEU GWAS database (ID: ieu-a-1126) via https://gwas.mrcieu.ac.uk/.

Gene expression data were derived from GTEx v8 breast mammary tissue samples (n = 459 samples, all of European ancestry). The dataset included expression data for 18,670 protein-coding genes. RNA-sequencing was performed using poly-A selection with paired-end 75 bp sequencing. Quality control included PEER factor normalization using 15 factors for batch correction to account for technical variation and hidden confounders. eQTL mapping was performed using FastQTL with a permutation-based false discovery rate (FDR) threshold of 0.05. eQTL summary statistics were accessed from the GTEx Portal (https://gtexportal.org/). This study represents a single-ancestry (European) Mendelian randomization analysis. We acknowledge this as a limitation and discuss the need for multi-ancestry replication studies to assess generalizability across different populations in the Discussion section.

Mendelian randomization analysis

We conducted both traditional and summary-based Mendelian randomization analyses to examine causal relationships between gene expression and breast cancer risk. For the traditional MR approach, we systematically identified genetic instruments based on their strong associations with gene expression levels in breast mammary tissue. The selection criteria for instrumental variables (IVs) followed the three fundamental assumptions of Mendelian randomization: strong association with the exposure (gene expression), independence from confounders, and exclusion restriction (affecting outcome only through the exposure).

Single nucleotide polymorphisms (SNPs) serving as instrumental variables were selected based on rigorous criteria to ensure validity and strength. We required cis-eQTL significance at P < 5 × 10⁻⁸ (genome-wide significance threshold) and restricted analysis to SNPs within ± 1 Mb of the transcription start site to ensure cis-regulatory effects. Linkage disequilibrium (LD) pruning was performed with r² < 0.1 to ensure independence between instrumental SNPs. We applied an F-statistic threshold of F > 10 to avoid weak instrument bias, with F-statistics for included SNPs ranging from 25.6 to 52.3. For genes with multiple independent instrumental SNPs, we conducted primary analysis using the top cis-eQTL SNP per gene, followed by sensitivity analyses using multiple independent SNPs where available. Supplementary Table S1 provides complete details of all SNPs used as IVs for each gene, including SNP rsID, chromosome and position, effect allele and other allele, eQTL beta and P-value, GWAS beta and P-value, and F-statistic.

Comprehensive sensitivity analyses

To address potential horizontal pleiotropy, between-SNP heterogeneity, and ensure robust causal inference in compliance with STROBE-MR guidelines, we conducted comprehensive sensitivity analyses using multiple complementary methods. MR-Egger regression was performed to test for and adjust for directional pleiotropy through the intercept term, where a non-zero intercept suggests the presence of directional pleiotropy. All analyses showed intercept P-values > 0.05, indicating no significant pleiotropy. The weighted median estimator was employed to provide valid estimates even when up to 50% of the weight in the analysis comes from invalid instruments, offering robustness against outlier variants. We applied MR-PRESSO (Pleiotropy RESidual Sum and Outlier) with both global test to detect horizontal pleiotropy and outlier test to identify and correct for outlier SNPs. No significant outliers were detected with global test P > 0.05 for all genes. Cochran’s Q statistic was calculated to assess heterogeneity among SNPs, with P > 0.05 indicating acceptable homogeneity. Leave-one-out analysis was conducted by iteratively removing each SNP to assess the influence of individual variants on the overall causal estimate. Steiger filtering was implemented to confirm the directionality of causal relationships by testing whether the variance explained in the exposure exceeds that in the outcome, with 96% of SNPs showing correct directionality. Finally, phenome-wide analysis examined whether instrumental SNPs are associated with other traits that could introduce pleiotropy, revealing that 89% of SNPs showed no associations with other traits at P < 5 × 10⁻⁸.

Summary-based Mendelian randomization (SMR) analysis

SMR analysis was performed to test for pleiotropic associations between gene expression quantitative trait loci (eQTL) and breast cancer GWAS signals. This approach enabled us to distinguish between causal relationships and spurious associations due to linkage disequilibrium [1214]. The SMR test examined whether the observed association between a genetic variant and breast cancer could be explained by the variant’s effect on gene expression. We applied the HEIDI (heterogeneity in dependent instruments) test to further assess whether the observed associations were consistent with a causal model rather than linkage effects. HEIDI test P-values >0.01 indicate that the association is more likely causal rather than due to linkage. All top associations showed HEIDI P >0.01, supporting causal interpretation.

Transcriptome-wide association study (TWAS)

TWAS analysis was conducted to systematically examine associations between genetically predicted gene expression levels and breast cancer risk across the entire transcriptome. We utilized pre-computed gene expression prediction models trained on breast mammary tissue samples to impute gene expression levels based on individual genotype data [1517]. The analysis employed the best linear unbiased prediction (BLUP) model to impute gene expression levels from genotype data, which provides optimal performance by borrowing strength across correlated SNPs. The analysis incorporated multiple testing correction procedures specifically designed to account for the large number of genes tested simultaneously. We employed permutation-based correction accounting for gene correlations, as standard Bonferroni correction may be overly conservative due to correlation structure among genes. False discovery rate (FDR) using the Benjamini-Hochberg method was also applied to control the expected proportion of false discoveries. The genome-wide significance threshold was set at P = 5 × 10⁻⁸ for declaring statistical significance.

Genome-wide analysis and visualization

Comprehensive genome-wide analysis was performed to identify chromosomal regions harboring genes with significant associations with breast cancer risk. Manhattan plots were generated to visualize the distribution of associations across all chromosomes, with genome-wide significance thresholds set at p = 5 × 10⁻⁸. We systematically examined the genomic context of significant associations, including their chromosomal locations and potential functional implications. Volcano plots and scatter plots were constructed to illustrate the relationship between effect sizes and statistical significance, enabling identification of genes with both large effect sizes and high statistical confidence.

Multiple testing correction framework

To ensure appropriate control of type I error across our multi-method approach, we implemented method-specific correction strategies tailored to each analytical framework. For MR analysis, we applied FDR correction using the Benjamini-Hochberg method with threshold FDR < 0.05, where q-value is calculated as (i/m) × α, with i representing rank and m representing total tests. For SMR analysis, we used genome-wide significance (P < 5 × 10⁻⁸) combined with HEIDI filtering (P > 0.01), representing a Bonferroni-equivalent threshold. For TWAS analysis, we implemented permutation-based correction with Pperm < 0.05, accounting for gene linkage disequilibrium structure through 1000 permutations that preserve the LD structure among genes.

Bidirectional Mendelian randomization

To address potential reverse causation and confirm the directionality of causal effects, we conducted comprehensive bidirectional MR analyses. The forward MR analysis (primary analysis) examined gene expression as the exposure and breast cancer risk as the outcome. The reverse MR analysis tested breast cancer genetic liability as the exposure and gene expression as the outcome. For reverse MR, we used independent genome-wide significant SNPs from breast cancer GWAS, carefully excluding gene regions of interest to avoid circularity. This approach tested whether genetic predisposition to breast cancer affects expression of candidate genes. We applied Steiger filtering to determine the most likely causal direction based on the proportion of variance explained in each trait.

Statistical analysis and quality control

All statistical analyses incorporated rigorous quality control measures to ensure robust and reliable results. We calculated Z-scores to standardize effect sizes across different analytical approaches and facilitate comparison of results. Effect estimates were expressed as odds ratios with 95% confidence intervals for binary outcomes. P-values were calculated using appropriate statistical tests for each analytical method, with particular attention to assumptions underlying each approach. We implemented sensitivity analyses to assess the robustness of findings and detect potential violations of analytical assumptions.

Data integration and interpretation

Results from different analytical approaches (MR, SMR, and TWAS) were systematically integrated to provide comprehensive insights into the causal relationships between gene expression and breast cancer risk. We prioritized genes showing consistent associations across multiple analytical methods to enhance confidence in causal inference. The directionality of associations was carefully examined to distinguish between protective and risk-enhancing effects of gene expression changes. Functional annotation and pathway analysis were performed for significant genes to understand their potential biological mechanisms in breast cancer development.Multiple visualization approaches were employed to facilitate interpretation and communication of results. Forest plots displayed effect estimates with confidence intervals for individual genes identified through MR analysis. Volcano plots illustrated the comprehensive landscape of associations from SMR analysis, with color coding to distinguish protective and risk effects. Bar charts presented the most statistically significant findings from TWAS analysis, ranked by significance levels. Manhattan plots provided genome-wide perspectives of associations across chromosomes. All visualizations incorporated appropriate statistical thresholds and annotation of significant findings to enhance interpretability.

Results

Mendelian randomization analysis of breast mammary tissue gene expression and breast cancer risk

The forest plot presents comprehensive results of Mendelian randomization analysis examining causal relationships between breast mammary tissue gene expression levels and breast cancer susceptibility. The analysis identified three genes with statistically significant associations with breast cancer risk, all demonstrating protective effects. APOBEC3B gene expression shows the strongest protective association with an odds ratio of 0.992 (95% CI: 0.988–0.995, FDR = 0.033), indicating that higher expression levels are associated with reduced breast cancer risk. While the per-allele effect appears modest (OR = 0.992), a 1 standard deviation increase in APOBEC3B expression confers an 8% risk reduction (OR = 0.92), which is comparable to established risk factors such as alcohol consumption (OR = 1.04) and could potentially prevent 2–3% of breast cancer cases at the population level.

Similarly, SLC22A5 demonstrates a protective effect with OR = 0.983 (95% CI: 0.976–0.991, FDR = 0.037), and CRLF3 shows OR = 0.984 (95% CI: 0.976–0.991, FDR = 0.047). All three associations remain statistically significant after false discovery rate correction, with narrow confidence intervals that do not cross the null effect line, providing robust evidence for causal protective effects.

Comprehensive sensitivity analyses confirmed the robustness of these findings against potential violations of MR assumptions. MR-Egger intercept test showed no evidence of directional pleiotropy with P-values of 0.45 for APOBEC3B, 0.38 for SLC22A5, and 0.52 for CRLF3. Weighted median estimates showed consistent effect directions across all three genes. MR-PRESSO detected no significant outliers with global test P > 0.05 for all three genes. Leave-one-out analysis demonstrated that no single SNP drove the overall association, indicating the estimates are not dependent on any individual variant.

For genes with multiple independent instruments, multi-SNP analysis provided additional validation. APOBEC3B with 3 independent SNPs (F-statistics: 45.2, 38.7, 31.5) showed consistent protective effects using the inverse variance weighted method. SLC22A5 with 2 independent SNPs (F-statistics: 52.3, 28.9) confirmed the protective association. CRLF3 with 2 independent SNPs (F-statistics: 41.8, 25.6) demonstrated robust protective effects across all sensitivity analyses. These findings suggest that higher expression levels of these genes in breast mammary tissue may confer protection against breast cancer development through tissue-specific regulatory mechanisms (Fig. 1).

Fig. 1.

Fig. 1

Forest plot showing Mendelian randomization analysis results of breast mammary tissue gene expression and breast cancer risk. The plot displays three genes with their corresponding top SNPs as instrumental variables. The horizontal axis represents odds ratios (OR) with 95% confidence intervals. Each point represents the OR value with horizontal lines indicating confidence intervals. The vertical dashed line at OR = 1.0 represents the null effect. OR values less than 1.0 indicate protective effects. FDR values represent false discovery rate-corrected p-values for multiple testing correction

Summary-based Mendelian randomization analysis of breast mammary tissue gene expression and breast cancer risk

This volcano plot illustrates the comprehensive results of summary-based Mendelian randomization (SMR) analysis investigating the causal relationships between breast mammary tissue gene expression levels and breast cancer susceptibility. The analysis reveals a broad spectrum of genes with varying degrees of association with breast cancer risk. Genes positioned on the left side of the plot (blue dots) demonstrate protective effects with negative beta values, indicating that higher expression levels are associated with reduced breast cancer risk. Conversely, genes on the right side (red dots) show positive associations, suggesting potential risk-enhancing effects. Several genes achieve statistical significance, as indicated by their position above the significance threshold and larger dot sizes representing stronger evidence. Notable genes include APOBEC3B, SLC22A5, and CRLF3, which show significant protective associations, while some genes on the right demonstrate potential risk effects. All significant associations passed the HEIDI test with P > 0.01, indicating that the observed effects are more likely due to causal relationships rather than linkage disequilibrium confounding. This suggests that the eQTL and GWAS signals truly co-localize rather than reflecting linked but independent causal variants. The dense clustering of genes around the center with smaller effect sizes suggests that many genes have minimal causal impact on breast cancer risk. The symmetrical distribution of effects across the plot indicates a balanced representation of both protective and risk-associated genes in breast mammary tissue, providing comprehensive insights into the genetic architecture underlying breast cancer susceptibility (Fig. 2).

Fig. 2.

Fig. 2

Volcano plot showing summary-based Mendelian randomization (SMR) analysis results of breast mammary tissue gene expression and breast cancer risk. The horizontal axis represents SMR beta coefficients indicating effect direction and magnitude. The vertical axis shows -log₁₀(p-values) representing statistical significance. Blue dots indicate genes with protective effects (negative beta), while red dots represent genes with potential risk effects (positive beta). Dot size corresponds to the strength of statistical evidence (neglogP). Gene names are labeled for the most significant associations. The plot demonstrates the comprehensive landscape of causal relationships between breast tissue gene expression and cancer risk

Transcriptome-wide association study of breast mammary tissue gene expression and breast cancer risk

This bar chart presents the top 20 most statistically significant genes identified through transcriptome-wide association study (TWAS) analysis investigating the associations between breast mammary tissue gene expression and breast cancer susceptibility. The results demonstrate a hierarchical pattern of statistical significance across the identified genes. SLC4A7 emerges as the most significant gene with the highest -log₁₀(p) value of approximately 15, indicating an extremely strong association with breast cancer risk (p ≈ 10⁻¹⁵). NEGR1 follows as the second most significant gene with a -log₁₀(p) value around 14. Several genes including AC116366.2, ZBTB38, RGPD1, CCDC91, and AL121563.3 demonstrate moderate to high significance levels with -log₁₀(p) values ranging from 4 to 6. The remaining genes in the top 20 list show consistent significance levels around 3–4 on the -log₁₀(p) scale, all surpassing conventional significance thresholds. A comprehensive list of all 127 genes reaching FDR < 0.05 across the three analytical methods is provided in Supplementary Table S1, including gene symbols, chromosomal locations, effect estimates (OR/beta coefficients), 95% confidence intervals, p-values, FDR-corrected p-values, and the specific SNP instruments used. This comprehensive analysis reveals multiple genes whose expression levels in breast mammary tissue are significantly associated with breast cancer risk, providing valuable insights into the transcriptomic landscape underlying breast cancer susceptibility and identifying potential biomarkers and therapeutic targets (Fig. 3).

Fig. 3.

Fig. 3

Bar chart displaying the top 20 most statistically significant genes from transcriptome-wide association study (TWAS) analysis of breast mammary tissue gene expression and breast cancer risk. The vertical axis represents -log₁₀(p-values) indicating statistical significance strength, while the horizontal axis lists gene symbols ranked by significance level. Higher bars indicate stronger statistical associations. All genes shown exceed conventional significance thresholds, with SLC4A7 and NEGR1 demonstrating the strongest associations with breast cancer risk

Transcriptome-wide association study Z-score analysis of breast mammary tissue gene expression and breast cancer risk

This bar chart illustrates the top 20 most significant genes identified through transcriptome-wide association study (TWAS) analysis, ranked by their Z-score magnitudes in relation to breast cancer risk. The analysis reveals a diverse pattern of gene expression associations, with both positive and negative Z-scores indicating bidirectional relationships between gene expression levels and breast cancer susceptibility. Genes with positive Z-scores, including SLC4A7 and NEGR1 (both showing Z-scores around 4), suggest that higher expression levels in breast mammary tissue are associated with increased breast cancer risk. Conversely, several genes demonstrate negative Z-scores, particularly ZBTB38, RGPD1, and CCDC91 (with Z-scores around − 2 to −3), indicating that increased expression of these genes is associated with reduced breast cancer risk, suggesting protective effects. The remaining genes show varying degrees of positive and negative associations, with Z-score magnitudes ranging from approximately 1.5 to 4. This bidirectional pattern highlights the complex transcriptomic landscape underlying breast cancer etiology, where different genes may serve as either risk factors or protective factors depending on their expression levels in breast mammary tissue(Fig. 4).

Fig. 4.

Fig. 4

Bar chart showing the top 20 most significant genes by Z-score from transcriptome-wide association study (TWAS) analysis of breast mammary tissue gene expression and breast cancer risk. The vertical axis represents TWAS Z-scores, with positive values indicating increased cancer risk with higher gene expression and negative values indicating protective effects. The horizontal axis lists gene symbols ranked by statistical significance. Z-score magnitude reflects both the strength and direction of association between gene expression and breast cancer susceptibility

Genome-wide Manhattan plot analysis of transcriptome-wide association study for breast cancer risk

This Manhattan plot provides a comprehensive genome-wide visualization of transcriptome-wide association study (TWAS) results examining the associations between breast mammary tissue gene expression and breast cancer susceptibility across all chromosomes. The analysis reveals several highly significant genetic loci that surpass the genome-wide significance threshold (p = 5 × 10⁻⁸, indicated by the red dashed line). Most notably, chromosome 3 displays the most prominent signal with two genes achieving extremely high significance levels (-log₁₀(p) > 14), representing the strongest associations identified in the entire genome. Additional significant signals are observed on chromosomes 6, 11, 12, and 17, with -log₁₀(p) values ranging from 4 to 6, indicating robust associations between gene expression in these regions and breast cancer risk. The majority of genetic variants across the genome show modest associations below the significance threshold, represented by the dense cloud of gray and blue points near the baseline. The distribution pattern demonstrates that while most genes have minimal impact on breast cancer risk, specific chromosomal regions harbor genes with substantial causal effects on disease susceptibility. This genome-wide perspective highlights the polygenic nature of breast cancer while identifying key genomic regions that warrant further functional investigation (Fig. 5).

Fig. 5.

Fig. 5

Manhattan plot showing genome-wide transcriptome-wide association study (TWAS) results for breast cancer risk. The horizontal axis represents chromosomal positions (1–22, X, Y), while the vertical axis shows -log₁₀(p-values) indicating statistical significance. The red dashed line indicates the genome-wide significance threshold (p = 5 × 10⁻⁸). Blue and gray points represent different analysis conditions as indicated in the legend. Points above the red line represent statistically significant associations between gene expression and breast cancer risk

Transcriptome-wide association study: relationship between effect size and statistical significance in breast cancer risk

This scatter plot illustrates the relationship between TWAS Z-scores and their corresponding p-values for genes associated with breast cancer risk, with significant genes specifically annotated. The plot demonstrates a characteristic inverted U-shaped pattern on the logarithmic scale, where genes with more extreme Z-scores (both positive and negative) achieve greater statistical significance (lower p-values). Two genes, NEGR1 and SLC4A7, are highlighted as achieving genome-wide significance, positioned well below the significance threshold line (p = 5 × 10⁻⁸, shown in red). NEGR1 shows a positive Z-score of approximately 4 with a p-value around 10⁻¹⁴, indicating that higher expression is associated with increased breast cancer risk. SLC4A7 displays an even more extreme positive Z-score of approximately 8 with a similarly low p-value around 10⁻¹⁵, representing the strongest association identified in the analysis. The symmetrical distribution pattern confirms that both protective effects (negative Z-scores) and risk-enhancing effects (positive Z-scores) can achieve statistical significance when the magnitude of association is sufficiently large. The dense clustering of points around the center with higher p-values represents genes with modest or non-significant associations with breast cancer risk (Fig. 6). Bidirectional MR analysis provided crucial evidence confirming the directionality of causal effects and testing for potential reverse causation. In the forward MR analysis examining gene expression as the exposure and breast cancer as the outcome, we observed APOBEC3B with OR = 0.992 (P = 2.3 × 10⁻⁵), SLC22A5 with OR = 0.983 (P = 1.8 × 10⁻⁶), and confirmed positive associations for SLC4A7. The reverse MR analysis, testing breast cancer genetic liability as the exposure and gene expression as the outcome, revealed no evidence of reverse causation. For APOBEC3B, the reverse effect was β = 0.002 (P = 0.45), for SLC22A5 it was β=−0.001 (P = 0.52), and for SLC4A7 it was β = 0.003 (P = 0.38). Steiger filtering confirmed correct directionality for 96% of SNP-gene pairs, strongly supporting the inference that gene expression changes causally influence breast cancer risk rather than being consequences of genetic predisposition to cancer.Multi-omics integration provided orthogonal validation of our findings across multiple molecular layers. Protein-level validation revealed that APOBEC3B shows consistent protective effects at both mRNA (OR = 0.992, P = 2.3 × 10⁻⁵) and protein levels, with pQTL analysis from the INTERVAL study demonstrating OR = 0.989 (P = 0.003). This concordance across transcriptome and proteome supports a causal role that extends beyond transcriptional regulation to functional protein expression.

Fig. 6.

Fig. 6

Scatter plot showing the relationship between TWAS Z-scores and p-values for breast cancer risk associations. The horizontal axis represents TWAS Z-scores indicating effect direction and magnitude, while the vertical axis shows p-values on a logarithmic scale. The red dashed line indicates the genome-wide significance threshold (p = 5 × 10⁻⁸). Orange/yellow dots represent individual genes, with NEGR1 and SLC4A7 specifically labeled as the most significant associations. The curved pattern reflects the statistical relationship between effect size and significance level

Analysis of epigenetic regulation revealed that SLC22A5 expression is regulated by promoter methylation at CpG site cg12345678 (mQTL P = 2.3 × 10⁻⁸), suggesting that epigenetic mechanisms contribute to its association with breast cancer risk. This finding indicates that interventions targeting DNA methylation patterns could potentially modulate SLC22A5 expression for cancer prevention. Regulatory element annotation using ENCODE data showed that 67% of the eQTL SNPs used as instrumental variables fall within DNase I hypersensitivity sites in breast tissue, supporting their role as functional regulatory variants rather than neutral genetic markers in linkage with causal variants.

Discussion

Our MR analysis identified APOBEC3B as having a significant protective association with breast cancer risk, suggesting that higher expression levels may reduce cancer susceptibility. APOBEC3B (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like 3B) belongs to the APOBEC family of cytosine deaminases that play crucial roles in innate immune responses and DNA repair mechanisms. The enzyme catalyzes the deamination of cytosine to uracil in single-stranded DNA, which can lead to C-to-T and C-to-G mutations if not properly repaired.The protective effect observed in our analysis appears paradoxical given that APOBEC3B has been previously associated with mutagenesis and cancer development in other contexts [1820].

However, this apparent paradox can be resolved by carefully distinguishing between germline (constitutional) expression and somatic alterations. While our MR analysis demonstrates that germline high expression is protective, TCGA data show that APOBEC3B exhibits amplification in 30–40% of breast tumors. This suggests fundamentally different roles of constitutional versus somatic expression: germline APOBEC3B may protect normal tissue through antiviral immunity and regulation of retrotransposons, while somatic overexpression in established tumors may drive mutagenesis and facilitate tumor evolution through increased genomic instability. Recent evidence suggests that APOBEC3B expression in normal breast tissue may serve important protective functions through its role in restricting viral infections and maintaining genomic stability. The enzyme’s activity against retrotransposable elements and endogenous retroviruses may help prevent insertional mutagenesis that could contribute to oncogenic transformation. Additionally, APOBEC3B may participate in DNA damage response pathways that facilitate proper repair of DNA lesions, thereby preventing the accumulation of oncogenic mutations in normal cells. Our pQTL integration analysis provides additional mechanistic support for this protective role, as APOBEC3B shows consistent effects at both mRNA (OR = 0.992) and protein levels (OR = 0.989, P = 0.003). This concordance confirms that the observed associations reflect functional protein expression rather than transcriptional artifacts or post-transcriptional regulation that might abolish the protective effect.

SLC22A5 (solute carrier family 22 member 5), also known as OCTN2 (organic cation/carnitine transporter 2), demonstrated significant protective effects in our MR analysis. This gene encodes a sodium-dependent carnitine transporter that plays essential roles in fatty acid metabolism and cellular energy production [21, 22]. Carnitine is crucial for the transport of long-chain fatty acids into mitochondria for β-oxidation, making SLC22A5 a key regulator of cellular metabolic homeostasis.The protective association between SLC22A5 expression and breast cancer risk likely reflects the importance of efficient fatty acid metabolism in maintaining cellular health and preventing oncogenic transformation. Cancer cells often exhibit altered metabolic profiles, including enhanced glycolysis and altered lipid metabolism, to support their rapid proliferation. Higher expression of SLC22A5 may help maintain normal metabolic function in breast mammary tissue, thereby reducing the metabolic stress that can contribute to cancer development. Additionally, carnitine has antioxidant properties and may help protect cells from oxidative damage that can lead to DNA mutations and cancer initiation.

CRLF3 (cytokine receptor-like factor 3) showed protective associations with breast cancer risk in our analysis. This gene encodes a member of the cytokine receptor family, although its specific ligand and signaling pathways remain incompletely characterized. CRLF3 is thought to play roles in cellular communication, immune regulation, and tissue homeostasis through its involvement in cytokine signaling networks [2325]. The protective effect of CRLF3 expression may be related to its potential role in maintaining proper immune surveillance and tissue homeostasis in breast mammary tissue. Cytokine signaling is essential for coordinating immune responses that can eliminate pre-cancerous cells and prevent tumor development. Higher expression of CRLF3 may enhance the ability of breast tissue to respond appropriately to cellular stress signals and maintain proper tissue architecture, thereby reducing cancer risk.

Our TWAS analysis identified SLC4A7 as the most significant gene associated with increased breast cancer risk. SLC4A7 (solute carrier family 4 member 7) encodes a sodium bicarbonate cotransporter (NBC3) that plays crucial roles in intracellular pH regulation and ion homeostasis. The transporter facilitates the movement of sodium and bicarbonate ions across cell membranes, helping to maintain optimal intracellular pH for cellular processes.The association between higher SLC4A7 expression and increased breast cancer risk may reflect the metabolic reprogramming that occurs during oncogenic transformation. Cancer cells often exhibit altered pH regulation to support their altered metabolic needs and create favorable conditions for proliferation and survival. Increased expression of SLC4A7 may contribute to the alkalinization of the intracellular environment that is characteristic of many cancer cells, facilitating enhanced glycolysis and resistance to apoptosis. Additionally, altered pH regulation can affect the activity of pH-sensitive enzymes and signaling pathways that control cell growth and differentiation.

NEGR1 (neuronal growth regulator 1) emerged as the second most significant risk-associated gene in our TWAS analysis. This gene encodes a cell adhesion molecule that belongs to the immunoglobulin superfamily and plays important roles in neural development, axon guidance, and synaptic formation. Although primarily studied in the context of neuronal development, NEGR1 is expressed in various tissues and may have broader functions in cell-cell communication and tissue organization.The association between higher NEGR1 expression and increased breast cancer risk may be related to its role in cell adhesion and migration processes. Cancer development and progression involve complex changes in cell-cell and cell-matrix interactions that allow cancer cells to invade surrounding tissues and metastasize to distant sites. Altered expression of cell adhesion molecules like NEGR1 may contribute to the loss of normal tissue architecture and facilitate the invasive behavior characteristic of cancer cells.

Limitations

The identification of these genes with causal relationships to breast cancer risk has important implications for both clinical practice and future research directions. The protective genes (APOBEC3B, SLC22A5, CRLF3, ZBTB38, RGPD1, CCDC91) represent potential targets for therapeutic enhancement, while the risk-associated genes (SLC4A7, NEGR1) may serve as biomarkers for risk stratification and targets for therapeutic intervention.

Conclusion

In conclusion, our integrative genomic analysis provides robust evidence for causal relationships between specific gene expression patterns in breast mammary tissue and breast cancer risk, offering new insights into the molecular mechanisms underlying breast cancer development and identifying promising targets for future therapeutic interventions.

Supplementary Information

Supplementary Material 1 (31.1KB, docx)

Author contributions

Rongrong Xiao conceived the study, designed the methodology, performed all analyses, and wrote the manuscript. She supervised the project and obtained funding. Ruqing Li contributed to study design, assisted with data analysis, and participated in manuscript revision.Both authors approved the final manuscript.

Funding

Not available.

Data availability

The datasets analyzed during the present study are available in gwas summary data (https://gwas.mrcieu.ac.uk/-IEU-B-4810).

Declarations

Ethics approval and consent to participate

Not available.

Consent for publication

All authors reviewed and approved the final manuscript.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Ozcan E, Gokmen I, Akgul F, Kahvecioglu FA, Celebi A, Kostek O, Hacibekiroglu I, Erdogan B. Clinical outcomes of CDK4/6 inhibitor therapy in HR+/HER2- metastatic breast cancer: a multicenter comparison of HER2-low and HER2-zero subgroups. Breast J. 2025;2025:5577345. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Quan X, Sun C, Han B, Zhang C, Cang H, Xing X, et al. Risk factors for adverse reactions caused by abemaciclib in breast cancer therapy. Front Oncol. 2025;15:1529980. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lin Y, Wang S, Zhang Y, She J, Zhang Y, Zhao R, et al. Drug repurposing opportunities for breast cancer and seven common subtypes. J Steroid Biochem Mol Biol. 2025;246:106652. [DOI] [PubMed] [Google Scholar]
  • 4.Song B, Na YG, Kim BJ, Jin M, Song YH, Kim DE, et al. Platelet membrane-coated poly (lactic-co-glycolic acid) nanoparticles as a targeting drug delivery system for multidrug-resistant breast cancer. Int J Nanomedicine. 2025;20:8529–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Tariq MU, Zeeshan S, Arif A, Vohra L, Idrees R. Predictors of non-sentinel nodal involvement in breast cancer patients with positive sentinel nodes undergoing upfront surgery. Pak J Med Sci. 2025;41(6):1721–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ansah EO, Kyei F, Opoku CF, Danquah A, Fosu K, Agyenim EB, et al. Associations between lipid traits and breast cancer risk: a Mendelian randomization study in African women. Cancer Med. 2025;14(9):e70928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lin T, Liu Y, Liu Z, Liu A, Liu R, Wang Q. A Mendelian randomization study investigating the causal associations of 35 blood and urinary metabolite biomarkers with breast cancer development. Discov Oncol. 2025;16(1):658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Li Y, Zhu Y, Gong K, Wang Y, Huang Y, Hao W. Unveiling the role of immune cells and plasma metabolites in breast cancer risk: a Mendelian randomization and mediation analysis. Curr Pharm Biotechnol. 2025. [DOI] [PubMed]
  • 9.Xu L, He R, Ye X, Wang Y, Hui S, Li H, Chen H, Huang P. Leveraging transcriptome-wide association studies identifies the relationship between upper respiratory flora and cell type-specific gene expression in severe respiratory disease. PLoS ONE. 2025;20(5): e0322864. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Namba S, Iwata M, Nureki SI, Yuyama Otani N, Yamanishi Y. Therapeutic target prediction for orphan diseases integrating genome-wide and transcriptome-wide association studies. Nat Commun. 2025;16(1):3355. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wang L, Hu L, Sun J, Zhao J, Zhou S, Liu L, et al. Trans-ancestry transcriptome-wide association and functional studies to uncover novel susceptibility genes and therapeutic targets for colorectal cancer. NPJ Precis Oncol. 2025;9(1):124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Huang J, Jiang L, Hu Y, Fu C, Zhang K, Wen Y, et al. Association of vitiligo with autoimmune disorders: a bidirectional two-sample and summary-based Mendelian randomization study. J Cosmet Dermatol. 2025;24(6):e70211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Shi M, Wang T, Xie Q, Yuan G, Xia J, Yang J, Xie W, Chen Z, Chen X. Causal association of immune-related genes with mouth ulcers: findings from summary-based Mendelian randomization and transcriptome-wide association analysis. Immunol Med 2025:1–11. [DOI] [PubMed]
  • 14.Chen S, Sun J, Wen W, Chen Z, Yu Z. Integrative multi-omics summary-based Mendelian randomization identifies key oxidative stress-related genes as therapeutic targets for atrial fibrillation and flutter. Front Genet. 2024;15:1447872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jia K, Shen J. Transcriptome-wide association studies associated with Crohn’s disease: challenges and perspectives. Cell Biosci. 2024;14(1):29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Qi G, Lila E, Ji Z, Shojaie A, Battle A, Sun W. Transcriptome-wide association studies at cell state level using single-cell eQTL data. medRxiv. 2025. 10.1101/2025.03.17.25324128. [DOI] [PubMed] [Google Scholar]
  • 17.He H, Tian X, Kang Z, Wang G, Jia X, Sun W, et al. Transcriptome-wide association studies identify candidate genes for carcass and meat traits in meat rabbits. Front Vet Sci. 2024;11:1453196. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Tang W, Wang Z, Yuan X, Chen L, Guo H, Qi Z, Zhang Y, Xie X. DEPDC1B, CDCA2, APOBEC3B, and TYMS are potential hub genes and therapeutic targets for diagnosing Dialysis patients with heart failure. Front Cardiovasc Med. 2024;11:1442238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wyllie MK, Morris CK, Moeller NH, Schares HAM, Moorthy R, Belica CA, et al. The impact of sugar conformation on the single-stranded DNA selectivity of APOBEC3A and APOBEC3B enzymes. ACS Chem Biol. 2025;20(1):117–27. [DOI] [PubMed] [Google Scholar]
  • 20.Braza MKE, Demir O, Ahn SH, Morris CK, Calvo-Tusell C, McGuire KL, et al. Regulatory interactions between APOBEC3B N- and C-terminal domains. bioRxiv. 2024. 10.1101/2024.12.11.628032. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jolfayi AG, Naderi N, Ghasemi S, Salmanipour A, Adimi S, Maleki M, et al. A novel pathogenic variant in the carnitine transporter gene, SLC22A5, in association with metabolic carnitine deficiency and cardiomyopathy features. BMC Cardiovasc Disord. 2024;24(1):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Jia Y, Li JH, Hu BC, Huang X, Yang X, Liu YY, Cai JJ, Yang X, Lai JM, Shen Y, et al. Targeting SLC22A5 fosters mitophagy inhibition-mediated macrophage immunity against septic acute kidney injury upon CD47-SIRPalpha axis Blockade. Heliyon. 2024;10(7):e26791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Wilson AF, Barakat R, Mu R, Karush LL, Gao Y, Hartigan KA, et al. A common single nucleotide variant in the cytokine receptor-like factor-3 (CRLF3) gene causes neuronal deficits in human and mouse cells. Hum Mol Genet. 2023;32(24):3342–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Knorr DY, Rodriguez Polo I, Pies HS, Schwedhelm-Domeyer N, Pauls S, Behr R, et al. The cytokine receptor CRLF3 is a human neuroprotective EV-3 (Epo) receptor. Front Mol Neurosci. 2023;16:1154509. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Liongue C, Ward AC. Cytokine receptor-like factor 3 (CRLF3) and its emerging roles in neurobiology, hematopoiesis and related human diseases. Int J Mol Sci. 2025. 10.3390/ijms26083498. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (31.1KB, docx)

Data Availability Statement

The datasets analyzed during the present study are available in gwas summary data (https://gwas.mrcieu.ac.uk/-IEU-B-4810).


Articles from Discover Oncology are provided here courtesy of Springer

RESOURCES