Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Sep 11;50(3):691–702. doi: 10.1080/02664763.2021.1973387

Identification of novel genes for triple-negative breast cancer with semiparametric gene-based analysis

Xiaotong Liu a, Guoliang Tian a, Zhenqiu Liu b,CONTACT
PMCID: PMC9930760  PMID: 36819073

Abstract

Triple-negative breast cancer (TNBC) is generally considered an aggressive breast cancer subtype associated with poor prognostic outcomes. Up to now, the molecular and cellular mechanisms underlying TNBC pathology have not been fully understood. In this manuscript, we propose a novel semiparametric model with kernel for gene-based analysis with a breast cancer GWAS data. The software of SPMGBA (semiparametric method for gene-based analysis) in MATLAB is available at GitHub (https://github.com/zliu3/SPMGBA). Genetic signatures associated with breast cancer are discovered. We further validate the prognostic power of the identified genes with a large cohort of expression data from the European Genome-Phenome Archive, and discover that SEL1L is associated with the overall survival of TNBC with the p-value of .0002. We conclude that gene SEL1L is down-regulated in TNBC and the expression of SEL1L is positively associated with patient survival.

Keywords: Triple-negative breast cancer, gene-based analysis, semiparametric modeling, genome-wide association studies, SNPs, survival analysis

1. Introduction

Triple-negative breast cancer (TNBC) is characterized by a lack of or reduced expression of the estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) [4]. TNBC is based on immunohistochemistry performed in the clinic. A large subset of TNBC is considered as basal-like breast cancers because of the expression of basal epithelial cell markers, as one of the five intrinsic breast cancer subtypes including luminal A, luminal B, triple-negative, normal-like and HER2-positive breast cancer distinguished based on their gene expression patterns [6]. However, the overlap between the two different identification systems is approximately 80% [7], indicating that the label ‘triple-negative breast cancer’ describes a more heterogeneous subtype than other breast cancer subtypes [24]. TNBC accounts for about 15% of all breast tumor cases. When compared to other breast tumor subtypes, it displays features that contribute to the poor prognosis of patients: larger size, higher grade, higher chances of lymph node involvement and are more aggressive [26]. The treatment of TNBC is a major challenge as hormone therapy or receptor targeted therapy does not work [18]. Currently, the treatment of TNBCs is mainly carried out with neo-adjuvant chemotherapy in the clinic, but TNBC patients with such treatment have a poor prognosis and a higher rate of distant relapse than patients with other breast tumor subtypes. The chances of resistance are higher and less than 30% of women diagnosed with TNBC survive beyond 5 years [9,19]. Thus, there is an urgent need for identifying biological therapeutic targets and molecular signatures for the treatment of TNBC.

Single nucleotide polymorphism (SNP) based genome-wide association studies (GWAS) have been widely used to identify disease-related SNPs. However, the power with an individual SNP test will become much lower with correlated and increased number of SNPs, especially when the effect sizes of individual SNPs are small and only their cumulative effect is associated with a disease [1,12]. Gene-based analysis may have higher power to identify the causal variants for complex disease, because it takes into account the correlations among SNPs within a single gene [30]. There have been several approaches for gene-based tests. Score-based tests such as GATES [15], TATES [27], OBrien [29] and VEGAS [17] combine the summary statistics ( p-values) of the SNPs within a gene to obtain an overall p-value for the association of the entire gene [3], while kernel-based methods such as SKAT [28] and SKAT-O [14] design the test statistics based on the kernel matrix of the SNPs within a single gene. However, linear regression (for quantitative traits) and logistic regression (for binary traits) are still the most popular methods of evaluating the overall association between a gene and a trait. Regression-based methods are straightforward and easy to understand. With such an approach, each SNP is entered as an explanatory variable, and case-control status or quantitative traits are the response variable. A gene-based p-value is then provided by the likelihood ratio test comparing the full model with all available SNPs and the null model without any SNP. However, such simple approaches may suffer from low statistical power if many SNPs within a gene are included in the model. They are also linear, and cannot be used to detect the nonlinear associations.

Therefore, in this manuscript, we develop a semiparametric method for gene-based analysis with GWAS data by combining kernel methods and logistic regression. We first construct a gene-based feature vector through nonparametric kernel regression (classification) with the SNP matrix, and then evaluate the association of the feature vector and binary traits with simple logistic regression. It is nonlinear and the proposed test does not have high degrees as it with individual SNPs. We apply the proposed method to identify breast cancer associated variations associated with breast cancer, and then evaluate the prognostic effect of the identified genes in TNBC with a TNBC gene expression data. A novel gene, SEL1L, associated with TNBC survival is identified for further studies.

2. Methods

Associations between disease status such as cancer and individual SNPs have been well studied in GWAS. However, multiple SNPs in a gene need to be combined into a group to perform a unified analysis. The advantages of gene-level analysis include reducing multiple testing burden and capturing the multi-SNP effects. In this study, we developed a novel semiparametric method for investigating associations between disease status and a panel of SNPs at the gene level. The algorithm is very efficient and it handles a large number of SNPs without much difficulty. Given an n×m matrix X with n subjects and m SNPS on a gene, and the disease status y=[y1,,yn]t, where yi=1/0, an association can be studied with a logistic function:

P(y=1|x)=P(y=1|x)P(y=1|x)+P(y=0|x)=11+exp(ϕ(x)),

where x is an m×1 vector representing m SNPs for a subject insider the gene, and ϕ(x) is a scalar feature from the above probability distribution. Solving ϕ(x), we have

ϕ(x)=logP(y=1|x)P(y=0|x)=logP(x,y=1)P(x,y=0).

One appealing approach is to estimate the probability density with nonparametric kernels. Defining D1={i|yi=1} and D0={i|yi=0}, then

P(y=1|x)=iD1K(x,xi)i=1nK(x,xi),andP(y=0|x)=iD0K(x,xi)i=1nK(x,xi).

So that

ϕ(x)=logiD1K(x,xi)iD0K(x,xi),

where the Gaussian kernel is defined as K(x,xi)=exp(||xxi||222h2). Note that there is only one free parameter h to be determined irrelevant to the number of SNPs in a gene. h is a smooth parameter related to the bandwidth of the kernel, and can be determined by the leave-one-out cross-validation as stated in the next subsection. After we determine the kernel and the log-probability ratio feature, model complexity for logistic regression does not increase with the number of SNPs. Therefore, the proposed approach overcomes the overfitting problem automatically. To prevent the curse of dimensionality problem in nonparametric kernel density estimation with a large number of variations in X, we propose first to project the SNPs onto an orthogonal subspace with the principal component analysis (PCA), and then construct the kernel with the top principal components (PCs). The number of PCs is chosen to explain at least 75% of the total variance of X. In general, 5–30 PCs with the largest eigenvalues are required for the SNP data within a gene. For this specific study, we construct the kernel with only marginal significant SNPs from individual SNP test without PCA projection, as the number of SNPs within a gene is notlarge.

2.1. Leave-one-out cross-validation for determining the smooth parameter h

The smooth parameter (bandwidth) h will be determined with maximal log likelihood and leave-one-out cross-validation. The leave-one-out probability density isdefined as

P(y=1|xj)=iD1,ijK(xj,xi)ijK(xj,xi),andP(y=0|xj)=iD0,ijK(xj,xi)ijK(xj,xi).

The log-likelihood to be maximized will be:

L(h)=j=1nyjlogP(y=1|xj)+(1yj)logP(y=0|xj)=j=1nyjlogiD1,ijK(xj,xi)ijK(xj,xi)+(1yj)logiD0,ijK(xj,xi)ijK(xj,xi).

A feature vector ϕ(X) can be calculated after we find the leave-one-out optimal h, where ϕ(X)=[ϕ(x1),,ϕ(xj),,ϕ(xn)]t. The probability for a variation set at the gene level will then be estimated with standard logistic regression:

P(y=1|ϕ(xj))=11+exp((β0+β1PC1j+β2PC2j+β3ϕ(xj))),

where PC1 and PC2 are the first two principal components for adjusting the structural differences in the genetic data. The p-values for each gene can be then found from a t-test with the hypothesis of β30. Note that the proposed semiparametric approach only needs to identify 5 parameters including h, β0, β1, β2, and β3 by transforming an SNP matrix X for a gene into a log-probability ratio vector ϕ(X), irrelevant to the number of SNPs in a gene. Unlike most other popular methods, in which the number of parameters estimated increases with either the number of SNPs or the number of samples. The proposed approach prevents overfitting by leave-one-out cross-validation and simple model estimation. In addition, it is straightforward to add confounding variables and the feature vectors of multiple genes into the logistic regression model. The proposed approach is novel in that it combines the nonparametric kernel classification with logistic regression. We first transform the kernel matrix into a log-probability ratio vector ϕ(x), and then perform the logistic regression only with that vector. As a result, we only need to estimate one parameter β3, instead of n parameters as in kernel logistic regression.

3. Results

3.1. Datasets

3.1.1. GWAs genetic data

The genetic data are downloaded from dbGaP (www.ncbi.nlm.nih.gov/gap) of NCBI with permissions. phs000147 CGEMS breast cancer dataset is originally collected from the Cancer Genetic Markers of Susceptibility (CGEMS) breast cancer genome-wide association study (GWAS). The dataset includes genotyping 528,173 SNPs (Illumina HumanHap550) in 1145 postmenopausal women of European ancestry with invasive breast cancer and 1142 controls from the Nurses' Health Study (NHS). The dataset was initially generated and analyzed in [8,10].

3.1.2. Gene expression datasets

The expression data is downloaded with permission from the European Genome-Phenome Archive (www.ebi.ac.uk/ega/), which is hosted by the European Bioinformatics Institute, under accession number EGAS00000000083. The original data was published by Curtis et al., 2012 [5]. The gene expression data is used for validating the genes identified with GWAS data, and evaluate the prognostic power of the genes and pathways in TNBC patients. There are 36,107 probes with annotated genes and 997, samples in the discovery data. Among 997 discovery samples, there are 134 TNBC patient samples available. Patients are from Europe and over 99% of patients are postmenopausal women. Survival and other clinical information are also available.

3.2. Simulation studies of type 1 error rate and statistical power

To evaluate the proposed approach for gene-based test, we set the nominal type 1 rate of α=0.05, and generate a genotype block of 30 SNPs, which are all biallelic and under Hardy–Weinberg equilibrium. We consider three types of disease models: (i) a null model where no SNP has any effect on disease risk, (ii) an additive model where one SNP in each linkage disequilibrium (LD) block has a minor allele that increases the risk ratio additively by 0.14, and (iii) a multiplicative model where one SNP in each LD block has a minor allele that increases the risk ratio multiplicatively by a factor of 1.14. We also consider three different linkage disequilibrium (LD) structures including (i) the 30 SNPs are situated in six strong LD blocks with the LD values of 0.8 or 0.9, (ii) the 30 SNPs are situated in six moderate LD blocks with the LD values of 0.4 or 0.5 in each block, or (iii) the 30 SNPs are in linkage equilibrium. The minor allele frequency (MAF) is set to the range of 0.1–0.4, similar to the experiments performed by Li et al. [15]. The HapSim algorithm [21] is used to generate genotype data. We generate a population of 1,000,000 subjects for each combination of the LDs and disease models. A random sample of 1500 cases and 1500 controls is then drawn without replacement from the population. Type 1 error rates and statistical power estimates under the different combinations are obtained from the proportion of simulated datasets out of 1000 simulated populations with different significance α-values. The results are reported in Table 1.

Table 1.

Empirical type 1 errors and power of the proposed approaches (in percentage), where LE denotes linkage equilibrium.

α= Disease models LE Moderate LD High LD
0.05 Error rate (no disease) 5.04 5.47 5.25
  Power (additive model) 63.68 70.97 81.53
  Power (multiplicative model) 92.36 95.85 98.89
0.01 Error rate (no disease) 0.88 1.05 1.17
  Power (additive model) 60.42 68.16 78.90
  Power (multiplicative model) 90.26 92.54 94.97
0.001 Error rate (no disease) 0.107 0.21 0.13
  Power (additive model) 52.81 61.4 72.90
  Power (multiplicative model) 80.38 83.62 85.15

Notes: The SNP set includes 30 SNPs, 6 LD blocks and one disease-susceptibility locus in each block. The nominal type-1 error rate is set to 0.05, 0.01, and 0.001, respectively.

The empirical type-1 error rates and statistical powers of SPMGBA are reported in percentage ( %) in Table 1. As shown in Table 1, the empirical type-1 errors are well-controlled with different α values. For instance, with α=0.05, the empirical type-1 error rate ( 5.04%) is very close to the nominal error rate of 0.05, when the SNPs within a gene are independent. The empirical error rates slightly increase to 5.47% and 5.25% for moderate and high LDs, respectively, but it is still not far away from 0.05. Similar conclusions can be drawn with α=0.01, and 0.001, respectively, indicating that the proposed approach controls type-1 error rate well under different scenarios.

The statistical power of the proposed test is affected by the disease models, α-values, and different LD values. The powers of the multiplicity model are higher than that of additive models under different LD structures and α-values. On the other hand, the power is higher as the LD coefficients are larger. For instance, with α=0.05, the additive model has the powers of 63.68%, 70.97%, and 81.53%, while the multiplicative model has the powers of 92.36%, 95.85%, and 98.89% with LE, moderate LD, and high LD, respectively. Although not included in this paper, the proposed approach with α=0.05 has the power comparable to the best performer of five other software packages reported in Table 1 of [15]. Similar conclusions can be reached with α=0.01 and 0.001, respectively. Finally, the power is lower when the nominal error rate α is smaller, which is reasonable statistically.

3.3. Computational results with real GWAS data

We first recode the breast cancer genetic data into 0, 1, 2 with plink, and then read the text file into MATLAB. We then drop the SNPs with more than five missing values, and fill the rest missing values with 0. This leaves 462,363 SNPs for further study. Then 22,887 SNPs for gene- and pathway-based analysis are selected with p<.05 using the logistic regression with two PCs (glmfit.m) in MATLAB. Among 22,887 SNPs, 17,214 SNPs are annotated into 5535 genes using our gene SNP annotation database (in MATLAB). Gene-level associations are explored with the proposed semiparametric method. With Bonferroni correction, 42 out of 5535 genes with FDR<0.03 (raw p-value <5.4e06) are identified to be associated with breast cancer status (Table 2).

Table 2.

Gene-level variants associated with breast cancer identified with GWAS data and semiparametric modeling.

Gene ID Beta p-Values FDR Average CADD
NCAM2 1.16 3.3E 11 1.6E 07 2.97
PTPRD 1.55 7.9E 11 3.8E 07 3.70
CSMD1 1.64 6.89E 10 3.3E 06 1.66
RBFOX1 1.62 3.17E 09 1.53E 05 3.42
LINGO2 1.18 1.32E 08 6.39E 05 2.63
DAOA 0.91 2.90E 08 0.00014 1.55
TENM4 1.37 3.45E 08 0.00016 3.44
CDH20 0.84 3.48E 08 0.00017 3.06
CTNNA3 1.06 4.81E 08 0.00023 3.85
CDH13 1.58 5.57E 08 0.0003 3.28
PCDH9 1.31 6.31E 08 0.0003 6.22
ZP4 1.06 1.29E 07 0.00062 1.98
HS3ST3A1 1.23 2.05E 07 0.00099 2.28
SFSWAP 1.08 2.07E 07 0.001 1.5
SHANK2 1.06 2.41E 07 0.0011 2.41
ATP10A 1.03 4.39E 07 0.0021 2.89
CDH6 1.18 5.13E 07 0.0025 4.80
TENM3 1.35 6.03E 07 0.0029 3.94
DLGAP1 1.06 6.32E 07 0.0031 2.73
SORCS1 1.47 6.51E 07 0.0032 3.25
CXADR 1.11 7.70E 07 0.0037 4.17
VRK1 1.06 9.48E 07 0.0046 2.45
MYH9 0.89 9.55E 07 0.0046 5.13
MCTP2 1.32 1.31E 06 0.0063 2.43
SDK2 1.01 1.39E 06 0.0067 3.44
NR2F1 1.15 1.41E 06 0.0068 4
LRRC4C 1.06 1.43E 06 0.0069 2.83
TLE4 1.37 1.48E 06 0.0071 3.23
GPR133 0.94 2.15E 06 0.01 3.22
TBC1D22A 1.28 2.24E 06 0.011 2.73
ERCC6L2 0.85 2.83E 06 0.014 4.28
TCERG1L 1.53 3.64E 06 0.018 1.4
CDH4 1.27 3.79E 06 0.018 1.58
SEL1L 1.06 4.02E 06 0.019 4
SV2B 1.04 4.09E 06 0.020 3.03
TMEM132D 1.29 4.36E 06 0.021 1.93
AUTS2 1.29 4.47E 06 0.021 5.32
EPS8L3 0.91 4.92E 06 0.024 2.6
GNG7 0.99 4.93E 06 0.024 3
ZBED4 1.26 5.01E 06 0.024 2.61
STIM2 1.29 5.04E 06 0.024 2.02
PCDH18 0.87 6.07E 06 0.029 2.96

The average CADD (Combined Annotation Dependent Depletion) scores report in Table 2 represent the likelihood of functional, deleterious, and disease causal variations for each gene [13]. The larger the average CADD score, the more likely the variants under that gene are causal. Enrichment analysis for the 42 genes is performed with STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is a biological database and web resource for gene enrichment analysis and can be accessed at http://string-db.org/. Functional enrichment analysis of those 42 genes indicates that they are involved in several biological processes (FDR<0.001). We choose 0.001 as a threshold because all the pathways that are statistical significant from STRING have the FDR<0.001. The pathways include those involving cell-cell adhesion via plasma-membrane adhesion molecules, homophilic cell adhesion via plasma membrane adhesion molecules, and cell-cell adhesion.

3.4. Overall survival associated genes in TNBC

The next step is to test which of the 42 genes are associated with overall survival of TNBC with the gene expression data. Kaplan–Meier (KM) estimate and logrank test are used to evaluate the statistically significant differences of survival curves with the median cutoff. The top four genes SEL1L, PCDH18, PCDH9, and TLE4 with the p-value <.1 are shown in Figure 1.

Figure 1.

Figure 1.

Statistically significant genes associated with the overall survival of TNBC.

Figure 1 shows that genes PCDH18, TLE4, and PCDH9 are marginally significant in TNBC, indicating those three genes were not prognostic significant for TNBC survival. SEL1L was highly significant ( p=.0002) in TNBC, suggesting potential prognostic value for TNBC. More interestingly, the high expression of SEL1L was associated with better overall survival, supporting a potential tumor suppressor role for SEL1L in triple-negative breast cancer.

The intragenic and intergenic SNPs for gene SEL1L (annotated with hg19) are reported in Table 3.

Table 3.

Selected individual SNPs of gene SEL1L with p<.05 from the GWAS data.

SNP IDs Beta p-Values CADD
rs10498554 0.175 .023 0.5
rs1152419 0.179 .013 2
rs1152422 0.145 .031 1
rs12436488 −0.140 .049 3
rs12882346 0.209 .015 3.5
rs12883722 0.212 .033 0
rs12887222 −0.124 .046 8
rs1457979 −0.184 .023 4
rs17588820 −0.172 .012 6.5
rs1813146 0.408 .0138 22
rs2372239 0.149 .039 2.5
rs4899801 −0.133 .031 3
rs718564 0.155 .016 2
rs799099 0.175 .012 0.5
rs799103 0.177 .020 2.5
rs799121 0.164 .0179 3

Table 3 shows that no individual SNP has the p-value of .01 . However, the 16 variations of gene SEL1L together are highly statistically significant p=4.02E06 . Therefore, even though individual SNPs have only modest effects, the joint effect of 16 individually moderate SNPs can be important biologically. Among all the SNPs, rs1813146 has the CADD score of 22, indicating its biological significance.

SEL1L is the human orthologue of the Caenorhabditis elegans sel-1 gene. It has been shown that this gene plays a fundamental role in eukaryotic intracellular protein degradation processes [2]. Protein degradation is becoming a central theme in cancer biology. We have demonstrated that SEL1L is strongly associated with the survival of TNBC patients, and the expression of SEL1L is positively associated with patient survival in TNBC (p = .0002).

4. Discussion and conclusion

In this study, we developed a novel semiparametric method for investigating associations between disease status and a panel of SNPs at the gene level. The algorithm is very efficient and it handles a large number of SNPs without much difficulty. The software of SPMGBA (semiparametric method for gene-based analysis) in MATLAB is available at GitHub (https://github.com/zliu3/SPMGBA). We applied the proposed approach to real GWAS data in breast tumor, and identify genes associated with the overall survival of TNBC. A novel gene SEL1L and multiple variations in the nearby region of SEL1L are identified with a gene-based association study. SEL1L is down-regulated and the expression of SEL1L is positively associated with patient survival in TNBC (p = .0002).

SEL1L, Suppressor/Enhancer of Lin-12-like, is a type-I transmembrane protein with a large luminal domain containing SEL1 repeats. SEL1L maps to 14q24.3-31 on the genome and exhibits a tissue-specific patterns of expression. It is over-expressed in several normal tissues and cell lines with strong expressions in breast, placenta, and pancreas as shown in GeneCards (http://www.genecards.org). SEL1L is a component of the endoplasmic reticulum (ER)-associated degradation (ERAD) pathway and plays a crucial role in selecting and transporting ERAD substrates for degradation [11]. Protein misfolding and aggregation in the ER contribute significantly to the etiology and pathogenesis of many complex diseases, including different cancers. Failure to degrade misfolded proteins in ER may cause ER stress, activate unfolded protein response (UPR), and lead to global changes in transcription and translation. ERAD targets misfolded secretory and membrane proteins for proteasomal degradation, and is a universal quality-control system in the cell to maintain ER homeostasis and adjust ER capacity [23]. SEL1L is an adaptor protein for gene HRD1 (the E3 ligase hydroxymethylglutaryl reductase degradation protein 1). It mediates HRD1 interactions on ERAD, and is critical for translocation of Class I major histocompatibility complex (MHC) heavy chains (HCs) [11]. Furthermore, the physiologic functions have been studied recently with mice model [25]. SEL1L is indispensable for ER homeostasis and cellular survival, further indicating its functional importance. In addition, it has been shown that SEL1L may play an important role in pancreatic and breast carcinoma. The expression of SEL1L in pancreatic and breast tumor has been associated with the reductions of tumor cell aggressiveness in both vivo and vitro. Clinically, SEL1L SNP rs12435998 has been reported to be significantly associated with the overall survival (OS) of glioblastoma, pancreatic ductal adenocarcinoma, and Alzheimer's disease [16,20,22]. One of the 16 SNPs reported in Table 3, rs12436488, is within 0.5 kb of SNP rs12435998, suggesting a potential prognostic value of gene SEL1L in TNBC.

Disclosure statement

No potential conflict of interest was reported by the author(s).

References

  • 1.Alonso-Gonzalez A., Calaza M., Rodriguez-Fontenla C., and Carracedo A., Novel gene-based analysis of ASD GWAS: Insight into the biological role of associated genes, Front. Genet. 10 (2019), p. 733. doi: 10.3389/fgene.2019.00733. PMID: 31447886; PMCID: PMC6696953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Cattaneo M., Lotti L.V., Martino S., and Alessio M., Secretion of novel SEL1L endogenous variants is promoted by ER stress/UPR via endosomes and shed vesicles in human cancer cells, PLoS One 6 (2011), p. e17206. doi: 10.1371/journal.pone.0017206. PMID: 21359144; PMCID: PMC3040770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chung J., Jun G.R., Dupuis J., and Farrer L., Comparison of methods for multivariate gene-based association tests for complex diseases using common variants, Eur. J. Hum. Genet. 27 (2019), pp. 811–823. doi: 10.1038/s41431-018-0327-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Costa R. and Gradishar W., Triple-negative breast cancer: Current practice and future directions, J. Oncol. Practice 13 (2017), pp. 301–303. doi: 10.1200/JOP.2017.023333. [DOI] [PubMed] [Google Scholar]
  • 5.Curtis C., Shah S.P., Chin S.F., Turashvili G., Rueda O.M., Dunning M.J., Speed D., Lynch A.G., Samarajiwa S., Yuan Y., Gräf S., Ha G., Haffari G., A, Bashashati, Russell R., McKinney S.; METABRIC Group, Langerød A., Green A., Provenzano E., Wishart G., Pinder S., Watson P., Markowetz F., Murphy L., Ellis I., Purushotham A., Børresen-Dale A.L., Brenton J.D., Tavaré S., Caldas C., and Aparicio S., The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups, Nature 486 (2012), pp. 346–352. doi: 10.1038/nature10983. PMID: 22522925; PMCID: PMC3440846. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dai X., Li T., Bai Z., Yang Y., Liu X., Zhan J., and Shi B., Breast cancer intrinsic subtype classification, clinical use and future trends, Am. J. Cancer Res. 5 (2015), pp. 2929–2943. PMID: 26693050; PMCID: PMC4656721. [PMC free article] [PubMed] [Google Scholar]
  • 7.Garrido-Castro A.C., Lin N.U., and Polyak K., Insights into molecular classifications of triple-negative breast cancer: Improving patient selection for treatment, Cancer Discov. 9 (2019), pp. 176–198. doi: 10.1158/2159-8290.CD-18-1177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Haiman C.A., Chen G.K., Vachon C.M., and Canzian F., A common variant at the TERT-CLPTM1L locus is associated with estrogen receptor-negative breast cancer, Nat. Genet. 43 (2011), pp. 1210–1214. doi: 10.1038/ng.985. PMID: 22037553; PMCID: PMC3279120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hirai T., Nemoto A., Ito Y., and Matsuura M., Meta-analyses on progression-free survival as a surrogate endpoint for overall survival in triple-negative breast cancer, Breast Cancer Res. Treat. 181 (2020), pp. 189–198. doi: 10.1007/s10549-020-05615-4. Epub 2020 Apr 3. PMID: 32246379. [DOI] [PubMed] [Google Scholar]
  • 10.Hunter D.J., Kraft P., Jacobs K.B., Cox D.G., Yeager M., Hankinson S.E., Wacholder S., Z, Wang, Welch R., Hutchinson A., Wang J., Yu K., Chatterjee N., Orr N., Willett W.C., Colditz G.A., Ziegler R.G., Berg C.D., Buys S.S., McCarty C.A., Feigelson H.S., Calle E.E., Thun M.J., Hayes R.B., Tucker M., Gerhard D.S., Fraumeni J.F. Jr, Hoover R.N., Thomas G., and Chanock S.J., A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer, Nat. Genet. 39 (2007), pp. 870–874. doi: 10.1038/ng2075. PMID: 17529973; PMCID: PMC3493132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jeong H., Sim H.J., Song E.K., Lee H., Ha S.C., Jun Y., Park T.J., and Lee C., Crystal structure of SEL1L: Insight into the roles of SLR motifs in ERAD pathway, Sci. Rep. 6 (2016), p. 20261. doi: 10.1038/srep20261. PMID: 27064360; PMCID: PMC4746701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Kang G.Jiang B. and Cui Y., Gene-based genomewide association analysis: A comparison study, Curr. Genom. 14 (2013), pp. 250–255. doi: 10.2174/13892029113149990001. PMID: 24294105; PMCID: PMC3731815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kircher M., Witten D.M., Jain P., O'Roak B.J., Cooper G.M., and Shendure J., A general framework for estimating the relative pathogenicity of human genetic variants, Nat. Genet. 46 (2014), pp. 310–315. doi: 10.1038/ng.2892. Epub 2014 Feb 2. PMID: 24487276; PMCID: PMC3992975. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lee S., Abecasis G.R., Boehnke M., and Lin X., Rare-variant association analysis: Study designs and statistical tests, Am. J. Hum. Genet. 95 (2014), pp. 5–23. doi: 10.1016/j.ajhg.2014.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Li M.X., Gui H.S., Kwan J.S., and Sham P.C., GATES: A rapid and powerful gene-based association test using extended Simes procedure, Am. J. Hum. Genet. 88 (2011), pp. 283–293. doi: 10.1016/j.ajhg.2011.01.019. PMID: 21397060; PMCID: PMC3059433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Liu Q., Chen J., Mai B., Amos C., Killary A.M., Sen S., Wei C., and Frazier M.L., A single-nucleotide polymorphism in tumor suppressor gene SEL1L as a predictive and prognostic marker for pancreatic ductal adenocarcinoma in Caucasians, Mol. Carcinog. 51 (2011), pp. 433–438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Liu J.Z., McRae A.F., Nyholt D.R., Medland S.E., N.R. Wray, Brown K.M., Investigators AMFS, Hayward N.K., Montgomery G.W., Visscher P.M., Martin N.G., and Macgregor S.A., A versatile gene-based test for genome-wide association studies, Am. J. Hum. Genet. 87 (2010), pp. 139–145. doi: 10.1016/j.ajhg.2010.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Lotfinejad P., Jafarabadi M.A., Shadbad M.A., Kazemi T., Pashazadeh F., Shotorbani S.S., Jadidi Niaragh F., Baghbanzadeh A., Vahed N., Silvestris N., and Baradaran B., Prognostic role and clinical significance of Tumor-Infiltrating Lymphocyte (TIL) and programmed death ligand 1 (PD-L1) expression in Triple-Negative Breast Cancer (TNBC): A systematic review and meta-analysis study, Diagnostics 10 (2020), p. E704. doi: 10.3390/diagnostics10090704. PMID: 32957579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Luz P., Dias D., Fortuna A., Bretes L., and Gosalbez B., How shall we treat locally advanced triple negative breast cancer?, F1000Res 28 (2019), p. 1649. doi: 10.12688/f1000research.20509.2. PMID: 32802311; PMCID: PMC7411516. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Mellai M., Cattaneo M., Storaci A.M., Annovazzi L., Cassoni P., Melcarne A., De Blasio P., Schiffer D., and Biunno I., SEL1L SNP rs12435998, a predictor of glioblastoma survival and response to radio-chemotherapy, Oncotarget 6 (2015), pp. 12452–12467. doi: 10.18632/oncotarget.3611. Erratum in: Oncotarget. 2018 Aug 24;9:32731. PMID: 25948789; PMCID: PMC4494950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Montana G., HapSim: A simulation tool for generating haplotype data with pre-specified allele frequencies and LD coefficients, Bioinformatics 21 (2005), pp. 4309–4311. doi: 10.1093/bioinformatics/bti689. [DOI] [PubMed] [Google Scholar]
  • 22.Saltini G., Dominici R., Lovati C., Cattaneo M., Michelini S., Malferrari G., Caprera A., Milanesi L., Finazzi D., Bertora P., Scarpini E., Galimberti D., Venturelli E., Musicco M., Adorni F., Mariani C., and Biunno I., A novel polymorphism in SEL1L confers susceptibility to Alzheimer's disease, Neurosci. Lett. 398 (2006), pp. 53–58. [DOI] [PubMed] [Google Scholar]
  • 23.Sha H., Sun S., Francisco A.B., Ehrhardt N., Xue Z., Liu L., Lawrence P., Mattijssen F., Guber R.D., Panhwar M.S., Brenna J.T., Shi H., Xue B., Kersten S., Bensadoun A., Péterfy M., Long Q., and Qi L., The ER-associated degradation adaptor protein Sel1L regulates LPL secretion and lipid metabolism, Cell. Metab. 20 (2014), pp. 458–470. doi: 10.1016/j.cmet.2014.06.015. PMID: 25066055; PMCID: PMC4156539. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sharma S., Barry M., Gallagher D.J., Kell M., and Sacchini V., An overview of triple negative breast cancer for surgical oncologists, Surg. Oncol. 24 (2015), pp. 276–283. doi: 10.1016/j.suronc.2015.06.007. [DOI] [PubMed] [Google Scholar]
  • 25.Sun S., Shi G., Han X., Francisco A.B., Ji Y., Mendonça N., Liu X., Locasale J.W., Simpson K.W., Duhamel G.E., Kersten S., Yates J.R. 3rd, Long Q., and Qi L., Sel1L is indispensable for mammalian endoplasmic reticulum-associated degradation, endoplasmic reticulum homeostasis, and survival, Proc. Natl. Acad. Sci. USA 111 (2014), pp. E582–E591. doi: 10.1073/pnas.1318114111. PMID: 24453213; PMCID: PMC3918815. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Umar S.M., Kashyap A., Kahol S., Mathur S.R., Gogia A., Deo S.V.S., and Prasad C.P., Prognostic and therapeutic relevance of phosphofructokinase platelet-type (PFKP) in breast cancer, Exp. Cell. Res. 396 (2020), p. 112282. [DOI] [PubMed] [Google Scholar]
  • 27.van der Sluis S., Posthuma D., and Dolan C.V., TATES: Efficient multivariate genotype-phenotype analysis for genome-wide association studies, PLoS Genet. 9 (2013), p. e1003235. doi: 10.1371/journal.pgen.1003235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Wu B., Pankow J.S., and Guan W., Sequence kernel association analysis of rare variant set based on the marginal regression model for binary traits, Genet. Epidemiol. 39 (2015), pp. 399–405. doi: 10.1002/gepi.21913. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Yang Q., Wu H., Guo C.Y., and Fox C.S., Analyze multivariate phenotypes in genetic association studies by combining univariate association tests, Genet. Epidemiol. 34 (2010), pp. 444–454. doi: 10.1002/gepi.20497. PMID: 20583287; PMCID: PMC3090041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Zhang L. and Kim I., Semiparametric Bayesian kernel survival model for evaluating pathway effects, Stat. Methods Med. Res. 28 (2018), pp. 3301–3317. doi: 10.1177/0962280218797360. Epub 2018 Oct 5. PMID: 30289021. [DOI] [PubMed] [Google Scholar]

Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES