Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Dec 15.
Published in final edited form as: Biometrics. 2020 Jan 6;76(4):1340–1350. doi: 10.1111/biom.13208

Using Sufficient Direction Factor Model to Analyze Latent Activities Associated with Breast Cancer Survival

Seungchul Baek 1,*, Yen-Yi Ho 2,**, Yanyuan Ma 3,***
PMCID: PMC7305041  NIHMSID: NIHMS1557994  PMID: 31860141

Summary:

High-dimensional gene expression data often exhibit intricate correlation patterns as the result of coordinated genetic regulation. In practice, however, it is difficult to directly measure these coordinated underlying activities. Analysis of breast cancer survival data with gene expressions motivates us to use a two-stage latent factor approach to estimate these unobserved coordinated biological processes.

Compared to existing approaches, our proposed procedure has several unique characteristics. In the first stage, an important distinction is that our procedure incorporates prior biological knowledge about gene-pathway membership into the analysis and explicitly model the effects of genetic pathways on the latent factors. Secondly, to characterize the molecular heterogeneity of breast cancer, our approach provides estimates specific to each cancer subtype. Finally, our proposed framework incorporates sparsity condition due to the fact that genetic networks are often sparse.

In the second stage, we investigate the relationship between latent factor activity levels and survival time with censoring using a general dimension reduction model in the survival analysis context. Combining the factor model and sufficient direction model provides an efficient way of analyzing high-dimensional data and reveals some interesting relations in the breast cancer gene expression data. The computing codes for our simulation studies and application are publicly available at https://github.com/sbaek306/FACTOR.

Keywords: Breast cancer, Dimension reduction, Factor model, General index model, Pathway analysis, Semiparametric model

1. Introdcution

High-dimensional gene expression data often exhibit complicated correlation patterns. These correlation patterns are frequently the result of coordinated gene regulations related to the underlying data generating process. Groups of genetic pathways or processes can be co-regulated through shared biological signals in a cell. In order for cancer cells to continuously maintain disease progression, multiple sources of coordinated genetic regulations are needed for the cells to escape the mechanisms that normally govern their proliferation, survival, and migration (Feng et al., 2018). In practice, however, it is difficult to measure these underlying coordinated genetic activities directly.

In this paper, we aim to estimate the activity levels of these unobserved underlying biological processes from independent sources via a latent factor approach. In the literature, several approaches have been proposed to model latent factors: such as the supervised principal components approach proposed by Bair et al. (2006), the surrogate variable analysis proposed by Leek and Storey (2007), and the partial least squares methods by Nguyen and Rocke (2002); Li and Gui (2004). In addition, methods based on Bayesian approach were introduced in the literature as well, for example, Bernardo et al. (2003), Carvalho et al. (2008) and Lopes and West (2004), and Lucas et al. (2006).

Our proposed procedure has some unique characteristics for analyzing breast cancer data with gene expressions, which make it different from methodologies presented in the existing literature. An important distinction is that our procedure incorporates prior biological knowledge about gene-pathway membership into the analysis. Motivated by the fact that crosstalks between disrupted pathways occurs frequently in cancer (Wang et al., 2016), our proposed approach explicitly models the effects of gene pathways on the latent processes. As a result, our method offers pathway-based estimates and hence enhances the interpretability of the model.

The proposed framework comprises of two stages. In the first stage, we use gene expression measurements and prior biological knowledge about gene-pathway membership to estimate the latent factor activities. In the second stage, we aim to study how the latent processes jointly affect the survival time of the cancer patients.

In order to consider the molecular heterogeneity of breast cancer in the first stage, we specify breast cancer subtype-specific loading matrices. This allows the linear association between the latent factor activity levels and the exhibited gene expression levels to be different for each breast cancer subtype, and hence, it improves the interpretability and flexibility of the model. Second, we impose a sparsity assumption on the loading matrices. The purpose of this assumption is to incorporate the prior biological knowledge that the expression levels of certain genes can be crucial to the underlying latent processes while other genes are irrelevant and hence the corresponding entries in the loading matrix should be shrunk to zero. The sparsity assumption is taken into account in our factor model analysis procedure through incorporating an L1 penalty in terms of methodology and is achieved via an Alternating Direction Method of Multipliers (ADMM) optimization procedure in terms of computation.

Even with the successful estimation of the pathway activity levels of each patient, it is still not a straightforward task to link these latent activity levels to breast cancer survival. This is due to the unclear functional relation between patient survival time and his/her individual latent activity levels. We hence restrain from imposing any specific survival model and leave the functional form nonparametric. On the other hand, even when dealing with a seemingly small number of latent processes, our estimation is still subject to the curse of dimensionality due to the nonparametric nature of the relation.

In addition, besides the latent factor activity levels, certain covariates are known to directly link to the survival times of breast cancer patients. For example, the highest breast cancer incidence rates for white women are among the group aged 40 and above (DeSantis et al., 2014). We thus need to include the ages of patients as a clinically relevant variable and incorporate it into the model as well. Thus, as a second stage model, we adopt the flexible sufficient dimension reduction approach. We assume the pathways together with age form several linear combinations, and these combinations jointly affect the survival time in a nonparametric way. We will use the data to determine what is the appropriate number of the linear combinations, what are these combinations, and what is the eventual functional form. To further take into account the censoring nature of the survival times, we incorporate the Martingale technique in combination with semiparametric treatment (Zhao et al., 2017) to provide a consistent estimator that is optimal in terms of achieving the smallest possible estimation variability.

2. Methodological Development

2.1. Data structure and modeling

To be specific, we describe the structure of the data considered in this work and our modeling strategy. Although the model is directly motivated by the breast cancer data, it is sufficiently general and can be applied in similar problems as well.

Let the independent and identically distributed (iid) observations (Xi, Yi, ∆i, Zi, Wi), i = 1, … , n, measured from n patients. Here, XiRp is the vector of the gene expressions of the ith patient measured on p genes. For the breast cancer data that we consider, n = 978 and p = 19149. Let Ti be the time to event and Ci be the censoring time for the ith patient, then Yi = min(Ti, Ci) and ∆i = I(Ti < Ci). It is a common practice to classify breast cancer tumors into one of five subtypes (Parker et al. 2009; Cancer Genome Atlas Network 2012), and we use Zi to represent the breast cancer subtype of patient i. There are five categorical values that Zi can be in our data: 1 (Basal), 2 (Her2-enriched), 3 (Luminal B), 4 (Luminal A) and 5 (Normal-like). In a more general setting, we can view Zi as a stratification index with, say K, different categories, and Zi can be correlated with Xi. Finally, Wi contains additional covariates that are known to affect the time to event Ti. In the breast cancer study, Wi contains age because of its known causal effect on survival (DeSantis et al., 2014).

We first stratify gene expressions Xi based on cancer subtype Zi, and then model the relationship between Xi and the latent factor activity levels fi within each stratum using a factor model:

Xi=Bkfi+UiwhenZi=k, (1)

where fiRq, UiRp and BkRp×q is a subtype-specific loading matrix. For identifiability reason and to to keep the interpretation simple, we assume that cov(fi) = Iq and BkTBk is diagonal for all k = 1, … , K. The above equation associates the independent latent biological processes (F) with the gene expression measurements (X) through the loading matrix B. The optimal number of latent factors (q) can be determined based on cross-validated mean squared error (MSE). In order to study the effects of pathways on latent activities, we incorporate known prior knowledge about gene-pathway membership matrix (G) and set Bk = GVk. Hence, the above model can be rewritten as

Xi=GVkfi+UiwhenZi=k, (2)

Here, G is known and G ∈ {−1, 0, 1}p×m for p genes and m gene sets (mp). Gij = 1(−1) indicates that gene i is part of the jth geneset and promote (inhibit) geneset activity; Gij = 0 indicates that gene i is not part of the jth geneset. In addition, Vk is a subtype-specific m × q loading matrix for each gene set. The j-th column of Vk is the coefficients of a set of pathways associated with the latent variable j. In other words, the columns of Vk are the estimated effects of pathways on the latent factors.

In the second stage, we model the impact of fi and Wi on Ti through

pr(Ti<t|fi,Wi)=h(t,βTfi+αTWi), (3)

where βRq×d with d < q, αRdim(W)×d, and h is an unspecified smooth conditional cumulative distribution function. The stage 1 model in (1) (2) and the stage 2 model in (3) jointly fully describe our modeling strategy for the breast cancer data analysis.

2.2. Estimation of pathway activity levels in stage 1 model

In the first stage, we estimate individual latent activity levels based on breast cancer subtypes, which results in estimating subtype-specific loading matrices. To this end, we stratify the samples into five strata according to the Zi values, with the data in the kth stratum having Zi = k for all i. In each stratum, say stratum k, we first iteratively estimate the subtype-specific loading matrix Bk and the latent factor fi for the ith patient in the kth stratum. After obtaining the estimates of Bk and fi, the estimate for pathway loading matrix Vk can be calculated as: Vk = (GTG)−1GTBk. Next, we describe the iterative procedure to obtain Bk and f in detail.

For simplicity of notation, we omit the stratum index k and write the analysis as if there were only one stratum. Let X = (X1, … , Xn), F = (f1, … , fn), U = (U1, … ,Un). Hence the factor model (1) is summarized as X = BF + U. We can interpret the relationship as the following. The expression level of Xij is linearly related to q independent latent activities, plus some random noise captured in Uij. If we do not impose any additional knowledge on B, then B and fi can be estimated using the Singular Value Decomposition (SVD) as in the classical factor model. This is equivalent to minimizing the Frobenious norm of XBF. In our context, we thus naturally extend the SVD to solving minB,FXBFF2 subject to the constraints that BTB is diagonal and FFT = Iq. Note that in the case of several strata, we only require BTB to be diagonal within each stratum. We do not impose additional constraints on the relation between loading matrices from different strata.

In breast cancer studies, some further prior knowledge is available. Certain breast cancer-related genes are more likely to participate in active biological processes than other genes. To incorporate such knowledge, we impose additional sparsity condition on B through incorporating penalty. For example, using Lasso penalty, we solve the minimization problem minB,Fn1XBFF2+λi=1qBi1, where Bi represents the ith column of B with some components that we do not want to shrink to zero excluded. Since the correlation among genes averaged 0.08 only, we simply used the L1 penalty. Alternatively, the elastic net penalty can be used if such correlation is a concern. In terms of computation, the detailed algorithm is provided in Supporting Information A.

2.3. Estimation of survival model in stage 2 model

We now establish the relationship between breast cancer survival and latent factor activity levels and other covariates that are linked to breast cancer survival. To this end, we treat fi=(fiT,Wi)T in the second stage model as covariates, where fi is individual latent factor activity levels and Wi contains age. This process results in an uncertainty that affects stage 2 model. Jiang et al. (2019) takes this uncertainty into account in the final variance analysis. Letting γ=(βT,αT)T, then we can write (3) succinctly as

pr(Ti<t|fi)=h(t,γTfi), (4)

where γR{q+dim(Wi)}×d. In the estimation of γ, all latent factors and ages from all n samples are considered as covariates. Hence, we estimated γ using all individuals together, not separately for each subtype. After estimating γ, we estimated the cumulative hazard function based on each subtype in order to investigate possible different functional dependences. Thus, we are able to obtain a specific survival function based on each cancer subtype as follows:

pr(Ti<t|fi)=hk(t,γTfi),whenZi=k,

where hk(⸳) is the survival function for the kth stratum.

The second stage model (4) merits some remarks. First, we employ d linear summaries by the d columns of γ, which gives the flexibility for the association among components of fi and Wi. Second, by leaving the link function h(⸳) in (4) unspecified, we can avoid problems that stem from the mis-specification of a model. In order for γ to be identifiable, we parameterize it so that the upper d × d block matrix of γ is the identity matrix Id. For convenience, we write f=(fuT,flT)T, where fuRd and flR{q+dim(Wi)}d.

Combining semiparametric treatment and martingale process, Zhao et al. (2017) proposed an estimation procedure for censored survival data that is consistent, asymptotically normal and semiparametric efficient, without imposing any assumption on the conditional distribution of the censoring time given covariates. We use this procedure to obtain the semiparametric efficient estimator by solving the following estimating equations

i=1nΔiλ^1(Yi,γTfi)λ^(Yi,γTfi)[fliE^{fliUi(Yi)|γTfi}E^{Ui(Yi)|γTfi}]=0, (5)

where λ^() is the estimated hazard function, λ^1() is the derivative of λ^() with respect to γTf, and U (t) = I(Yt). More specifically,

λ^(Y,γTf)=i=1nKb(YiY)ΔiKh(γTfiγTf)j=1nI(YjYi)Kh(γTfjγTf), (6)
λ^1(Y,γTf)=λ^(Y,γTf)/(γTf), (7)

where K(⸳) is a kernel function, Kh()=K(/h)/h and h is a bandwidth. Kh'(v) is the first derivative of Kh(v) with respect to v, and b is also a bandwidth. Similarly, using kernel estimation, we let

E^{fliUi(Yi)|γTfi}=j=1nKh(γTfjγTfi)fljI(YjYi)j=1nKh(γTfjγTfi), (8)
E^{Ui(Yi)|γTfi}=j=1nKh(γTfjγTfi)I(YjYi)j=1nKh(γTfjγTfi), (9)

where E{Ui(Yi)|γTfi}E{Ui(t)|γTfi}|t=Yi. Inserting (6), (7), (8) and (9) into (5), we can solve the estimating equation and the solution is the semiparametric efficient estimator of γ. More details about the derivation of the semiparametric efficient estimator can be found in Supporting Information B.

The nonparametric estimators given in (6), (7), (8) and (9) require bandwidth selection. As noted in Assumption A2 in Supporting Information C, a wide range of bandwidths can be used, and it does not affect the asymptotic property, i.e., the final estimator γ^ is insensitive to the bandwidth selection. For example, we can select the bandwidths by taking the sample size n to suitable power for which Assumption A2 holds, and multiply a constant, such as the variance of the covariates, for proper scaling. We refer to Zhao et al. (2017) for more specific guidelines on bandwidth selection.

Zhao et al. (2017) showed the consistency and asymptotic properties of the efficient estimator obtained by solving the estimating equations (5) under Assumptions A1-A7 in Supporting Information C. In summary, γ^γ0=op(1) and n(γ^γ0)N(0,[E{Seff2(Δ,Y,f)}]1) in distribution, where γ0 is the true parameter, and Seff is the efficient score with its specific form given in Supporting Information C, and a2aaT for any vector or matrix a. These asymptotic properties are provided through Theorems 1 and 2 in Supporting Information C for convenience. See Zhao et al. (2017) for their proofs.

3. Simulations

3.1. Simulation Settings

Before applying our modeling and estimation methods to analyze the breast cancer data, we first carry out simulations to evaluate the finite sample performance of the methods. We set up three simulation scenarios. The three simulation scenarios vary in the number of strata (for cancer subtypes, K), the number of variables (p), the variance-covariance matrix (Σ) of p random variables, the number of latent variable (q), sample size (n), and the proportion of censoring.

Scenario 1

In scenario 1, we consider two strata (K = 2) for the stage 1 model. We fix Zi = 1 for i = 1, … , n1, and Zi = 2 for i = n1 + 1, … , n1 + n2. The sample size for each stratum is n1 = n2 = 250, resulting in n = 500. We let p = 200, q = 4 and d = 1. For each i = 1, … , n, we simulate fi from the q-dimensional multivariate standard normal distribution and further generate a p × 1 random vector Li as the following. When Zi = 1, we simulate Li from the p-dimensional multivariate standard normal distribution. When Zi = 2, we simulate Li from a p-dimensional multivariate normal distribution with mean 0 and variance-covariance matrix ΣL, where the (i, j) element of ΣL is 0.5|ij| for 1 ⩽ i, jp.

To construct the factor loading matrix Bk for each stratum from (1), we perform eigen decomposition on the matrix LLT, where L = (L1, … , Ln)T. We let E1 be the n × q orthogonal matrix formed by the eigenvectors corresponding to the q largest eigenvalues of LLT, and let E2 be the n × q orthogonal matrix formed by the eigenvectors corresponding to the (q + 1)th to the 2qth largest eigenvalues of LLT. We then form Bk = n−1/2LTEk, for k = 1, 2. Because the eigenvectors corresponding to a symmetric matrix are orthogonal to each other, this construction ensures that BkTBk is diagonal.

We further generate Uis from a p-dimensional multivariate normal distribution with mean 0 and variance-covariance matrix 0.5Ip and then form Xi = Bkfi + Ui when Zi = k, for k = 1, 2, i = 1, … , n. To further generate data in the stage 2 model, we set dim(Wi) = 1, and generate Wi from a standard normal distribution. We then form a new covariate fi=(fiT,Wi)TRq+1. We now generate event times from

T=|γTf|2+(γTf+1.5)2+0.5ϵ,

where ϵ is uniformly distribution on (0, 1), and γ = (1, 0, 0.3, −0.3, −1)T. We further generate the covariate dependent censoring times using Ci=Φ(2fi2+2fi3)+c1, where c1 is used to control the proportion of censoring.

Scenario 2

In scenario 2, we provide the numerical performance under more general setting assuming five strata and eight latent factors, different sizes of (n, p), and dim(Wi) = 4. The detailed setting for scenario 2 are relegated to Supporting Information D and E.

Scenario 3

We design simulation scenario 3 to resemble the real data in Section 4, i.e., K = 5, p = 19, 149, q = 8, dim(Wi) = 1, and d = 1. Unlike Simulation 2, we treat the estimated loading matrices from the real data set in Section 4 as the true loadings. The specific setting and the performance for Simulation 3 are given in Supporting Information D and E.

3.2. Simulation Results

We first evaluate the performance of the stage 1 model and compare our proposed stage 1 model to two other closely related methods: supervise principal component (Superpc) (Bair et al., 2006; Bair and Tibshirani, 2004) and surrogate variable analysis (SVA) (Leek and Storey, 2007) using data generated with simulation scenario 1. Except that the current version of Superpc only allow q to be at most 3, hence we set q = 3 in this comparison. For Superpc and SVA, we used the provided software programs in R/Bioconductor (Superpc, and sva). To account for K cancer subtypes, for Superpc, we stratified the samples into K strata and perform analysis with default parameter values. For SVA, cancer subtypes were treated as covariates adjusted in the model. We compare the estimation of F and calculate the average absolute errors (AAE):PF^PF1. Here the projection matrix is calculated as PA=A(ATA)1AT for any generic matrix A. The AAEs of our proposed approach, Superpc, and SVA are presented in Table 1. The result suggests that our proposed approach based on iterative ADMM outperforms Superpc and SVA and achieves smallest errors in estimating F.

Table 1:

The average absolute errors (AAE) for the estimation of the latent factors (F) using our proposed approach (Iterative ADMM), Superpc, and SVA using data generated from simulation scenario 1 with q = 3. AAE is calculated as PB^kPBk1. Here the projection matrix is calculated as PA=A(ATA)1AT for any generic matrix A.

Censoring Rate 0% 20% 40%
Iterative ADMM 0.0038 0.0042 0.0042
Superpc 0.3199 0.3225 0.3279
SVA 0.0106 0.0115 0.0109

In addition, we report the performance of the estimation of F and B through calculating the corresponding AAEs for our proposed approach in all three simulation scenarios. For stage 2 model, we provided the difference of the true and estimates indices via AAE3, which is the average of n1i=1nβ^Tf^iβTfi1. See Table S.5 in Supporting Information E for the results.

Finally, we evaluate the performance of our proposed methods including both stage 1 and stage 2 procedure, and provide the final coefficient estimates, absolute biases, and sample standard deviations in Table S.1 for scenario 1, S.2 for scenario 2, and Table S.3 for scenario 3, respectively. For scenario 1 and 2, we consider three different censoring rates: no censoring, 20%, and 40% censoring. Across all simulations, the estimators have very small biases. It is not surprising that the variability of the estimators under no censoring is in general smaller compared to the cases with censoring. Further as p and n increase in Simulation 2, the estimation performance improves in terms of the Euclidean distances γ^γ, as shown in Figure S.1. The results in Simulation 3 (Table S.3) continue to have very small bias, while the standard errors are larger due to the small sample sizes in some individual strata.

4. Application

We now apply the two-stage modeling and estimation procedure to the breast cancer survival data. We downloaded breast cancer data from The Cancer Genome Atlas (TCGA) project via the ICGC controlled data portal (https://dcc.icgc.org/releases/release26/Projects/BRCA-US). The data contains 978 female patients with information for gene expression, age, and survival/follow-up time. Following Parker et al. (2009), we categorize the patients in the study into five subtypes: Basal (185, 18.9%), Her2-enriched (107, 10.9%), Luminal B (428, 43.8%), Luminal A (248, 25.4%), and Normal-like (10, 1.0%). For the subtype classification, we employ the prediction analysis of microarray (PAM) algorithm (Parker et al., 2009). Each individual is classified into one among five subtypes using the 50-gene classifier (PAM50). R package ‘genefu’ is available to carry out this algorithm. Among the 978 patients, only 100 events are observed, while the remaining 878 are censored, and hence the censoring rate is about 89.8%, and there is no left- or right-truncation.

For each patient, expression levels at 19,149 genes are measured, i.e., p = 19, 149. In the following analysis, the optimal number of latent factors q was estimated to be 8 based on 10-fold cross-validated MSE per latent factor (XX^F2q). Inspired by the fact that genetic networks (Gardner et al., 2003) are sparse, we incorporate sparsity condition of B using Lasso penalty, except for 132 breast cancer related-genes reported in the KEGG database. The list of these 132 genes can be found in Table S.4 in Supporting Information E.

We incorporate prior information of gene-pathway membership into the stage 1 analysis. Using the KEGG database, 335 genesets are downloaded, and 91 genesets specific to human diseases are excluded. We use the term geneset (a collection of functionally related genes) and pathway interchangeably throughout the texts. The KEGG pathway database provides graphical diagrams indicating various interactions between genetic molecules. We specify G ∊ {−1, 0, 1}p×m based on the above information. Gij = 1(−1) indicates that gene i is part of the jth geneset and promote (inhibit) geneset activity; Gij = 0 indicates that gene i is not part of the jth geneset. For genes that do not belong to any known gene set, a new gene set that contains itself is constructed. Using the ICGC Breast cancer data, m = m′ + m′′=244+12095=12339 (m′=244 known KEGG genesets and m′′=12,095 genesets with genes that do not map to any known KEGG genesets).

We performed the estimation procedure following the description in Section 2. After obtaining the estimate of Bk, we calculate the estimate of Vk(Rm×q) as Vk = (GTG)−1GTBk, and we refer to Vk as the pathway loading matrix. To explore the relationship between known pathways and the latent factors, we examined the subsets of V (named as V) which describe the loading values of 244 known genesets associated with the eight latent factors. Using the data from Her2-enriched breast cancer, we remove the pathways whose loading values are larger than 5% quantiles and less than 95% quantiles for all latent factors and plot the factor loadings for the remaining 21 pathways in Figure 1.

Figure 1:

Figure 1:

Heatmap of top pathways loaded on eight latent factors in patients with Her-2 enriched breast cancer Her-2 enriched.

The top pathways associated with the eight latent factors are presented in Table 2. Several pathways listed in Table 2 are reported to play pivotal roles in breast cancer metastasis in the literature including: fatty acid metabolism, adherens junction, ribosome, antigen processing and presentation, ErbB and relaxin signaling pathway (Monaco, 2017; Burris, 2004; Eroles et al., 2012; Penzo et al., 2019; Zhu et al., 2014; Khoury et al., 2001; Radestock et al., 2008).

Table 2:

Top pathways associated with eight latent factors in patients with Her2-enriched breast cancer.

Latent factor Factor loading Top pathways
Factor 1 7.85 Biotin metabolism
7.24 Proximal tubule bicarbonate reclamation
−17.86 Relaxin signaling pathway
−18.58 Ribosome

Factor 2 7.56 Relaxin signaling pathway
6.20 Protein digestion and absorption
−3.24 Antigen processing and presentation
−7.99 Ribosome

Factor 3 2.43 Adherens junction
1.99 Biotin metabolism
−1.70 Fatty acid biosynthesis
−1.79 Relaxin signaling pathway

Factor 4 1.35 Adherens junction
1.32 ErbB signaling pathway
−0.83 PPAR signaling pathway
−3.65 Antigen processing and presentation

Factor 5 1.30 Endocrine and other factor-regulated calcium reabsorption
1.09 Hematopoietic cell lineage
−0.96 Neomycin, kanamycin and gentamicin biosynthesis
−1.44 Relaxin signaling pathway

Factor 6 1.24 PPAR signaling pathway
1.16 Biosynthesis of unsaturated fatty acids
−1.34 Relaxin signaling pathway
−5.06 Ribosome

Factor 7 1.60 Adherens junction
1.44 ErbB signaling pathway
−1.55 Biotin metabolism
−2.21 Hematopoietic cell lineage

Factor 8 1.52 Fatty acid biosynthesis
0.85 ErbB signaling pathway
−0.85 Antigen processing and presentation
−0.88 Biotin metabolism

For example, the results for latent factor 4 in Table 2 and Figure 1 suggest this latent process involves pathway crosstalk between adherens junction, ErbB signaling, PPAR signaling and antigen processing and presentation. Similar collaboration patterns of deregulated pathways are also reported in a microarray gene expression study by Zhu et al. (2014). The study by Khoury et al. (2001) reported that ErbB signal pathway works together with cell-cell junction activity and alters cell motility. PPAR signaling transduction and immuno-inflammatory response (antigen processing and presentation) activities are advantageous for malignant cells to stay alive. In addition, HIF1 signaling pathway indicated in the fourth column in Figure 1 also is found to be a major regulator of ErbB2 that regulates malignant cells from anoikis and metabolic stress caused by decreased matrix adhesion (Whelan et al., 2013). Similarly, the corresponding heatmaps of top pathways in V for Basal, Luminal B, Luminal A, and Normal-like subtypes are provided in Figure S.2 to S.5 of Supporting Information E.

In the stage 2 model, we incorporated the latent factor activities levels and age as the covariate W in the second stage model due to its potential association with breast cancer survival. In addition, the dimension of the reduced space d is selected via validated information criterion (VIC), where the candidate dimension at which the VIC value is minimized is chosen as d (Ma and Zhang, 2015). Specifically, for the breast cancer data, the VIC value at d = 1 is 62.05, whereas the VIC values are all greater than 124.27 when d ⩾ 2. Thus we select d = 1. The tuning parameters (λ and ρ) in stage 1 was selected via a tenfold cross-validation, where 90% of observations were randomly selected as training sample and the rest 10% as the testing sample.

We estimated the variability of the estimates through 1,000 bootstrap samples and provide the final results in Table 3, showing that all eight latent activity levels are statistically significant in terms of their relation with survival time, with p-values ranging from 0.0045 to 0.0192. However, having adjusted for the latent activity levels, age is no longer a significant factor, with its p-value more than 0.5. In addition, our results illustrate that the latent factors 2 and 5 would have effects on the time to event T in the opposite direction to the other factors. Since we selected d = 1, we set γ1 in (4) to be 1 for identifiablility of γ. Relative to the first latent factor, the magnitude of all other factors affecting the time to event varied from 1.7 to 2.5. For example, γ4 = 1.9, meaning that the fourth latent factor has almost twice impact on the cumulative hazard compared to the first latent factor, and it has the same direction as the first latent factor. Furthermore, for example, since the coefficient of the latent factor 3 from Table 3 is positive; we can expect that the cumulative hazard would decrease with increasing value of the latent factor 3 while holding all other latent factor activity levels fixed.

Table 3:

Analysis of the breast cancer data with censoring. Estimates (est) are obtained from the two-stage model. Sample mean of estimates (mean), sample median of estimates (median), sample standard deviation (std) and sample median absolute deviation (mad) are calculated based on 1000 bootstrap samples. The pathway activity level indices in the second stage survival model are γi, i = 2, … , 8, and γ9 is the index corresponding to age.

γ2 γ3 γ4 γ5 γ6 γ7 γ8 γ9
est −2.4591 2.5606 1.9657 −2.1220 2.0584 1.8624 1.7060 0.0878

mean −2.5368 2.6165 1.9785 −2.1641 2.0389 1.8670 1.7519 0.0568
median −2.5151 2.5929 1.9345 −2.1199 1.9645 1.8133 1.7360 0.0312
std 0.8940 0.9278 0.7769 0.8572 0.7588 0.7975 0.7378 0.6246
mad 0.8305 0.8353 0.7752 0.8088 0.8069 0.8089 0.7627 0.7770

p-value 0.0045 0.0048 0.0109 0.0116 0.0072 0.0192 0.0176 0.9275

The estimated hazard function Λ^(t,βTf+αTW) is plotted in Figure 2 for four cancer subtypes including Basal, Her2-enriched, Luminal B and Luminal A. Because no terminal events were observed in the Normal-like cancer patients, we do not provide results for this cancer subtype. The upper-left panel in Figure 2 illustrates the plot of Λ^ as a function of t while the pathway activity levels are fixed at the respective mean value index value β^Tf¯i+α^TW¯i for each cancer subtype. Compared to other cancer subtypes, the cancer subtype Her2-enriched shows much higher estimated cumulative hazard at almost all times, suggesting a much high risk of an event. The other three panels contain plots of Λ^ as a function of the index βTf+αTW while t is fixed at 600, 1200 and 3000 respectively. These three plots illustrate that the cumulative hazard function is generally a decreasing function of the index βTf+αTW. In addition, we provide contour plots of Λ^ in Figure 3 as a function of both t and the index βTf+αTW for the four cancer subtypes. The numbers in the plots are the values of Λ^ on the contour. It shows that Her2-enriched cancer subtype has relatively high cumulative hazard.

Figure 2:

Figure 2:

Plot of the estimated cumulative hazard functions Λ^ for four cancer subtypes. Upper-left panel: Λ^ as a function of t while the index β^Tf+α^TW is fixed at its sample mean. Other panels: Λ^ as a function of βTf+αTW while t is fixed at 600 (upper-right), 1200 (lower-left) and 3000(lower-right).

Figure 3:

Figure 3:

Contour plots of Λ^ as a function of βTf+αTW and t for each cancer subtype, which are Basal, Her2-enriched, Luminal B, and Luminal A as labeled.

5. Discussion

We have adopted a framework for analyzing high-dimensional breast cancer survival data by combining a factor model and flexible sufficient direction method. There are several features and corresponding benefits. First, for different cancer subtypes, we employ a stratified approach using a latent factor model, i.e., the estimation of factor and loading matrix hinges upon each cancer subtype. Second, we incorporate the prior sparsity knowledge on the factor loading matrix using L1 penalty, which helps to reduce the number of none zero elements to be estimated in the loading matrix. Third, we incorporated prior gene-pathway membership information so that the pathway-latent factor relationships can be characterized via the pathway loading matrices Vk. Finally, we employed a very flexible general index model in our second stage model which only requires minimal assumptions. This is important because when we form a model using latent factors as covariates, it can be more susceptible to model misspecification.

In the stage one model, Vk is calculated as Vk = (GTG)−1GTBk To ensure invertibility, it is recommended that the pathways defined in G contain genesets that are not linear combinations of other smaller pathways. In some special situations, a generalized inverse can be applied for estimating Vk(Glazko and Emmert-Streib, 2009).

In this study, we used the cross-validated MSE/q to choose the optimal number of latent factors (q). A large body of literature provides many estimation methods for determining the number of factors (Ahn and Horenstein 2013; Bai and Ng 2002; Lam and Yao 2012; Luo and Li 2016), which can be easily adapted into our proposed approach.

Lastly, we would like to point out that research on dimension reduction for p > n is much needed. It will provide alternative ways of analyzing high-dimensional data by applying dimension reduction technique to p > n case more directly than the method introduced in this work.

Supplementary Material

supplement.pdf

Footnotes

SUPPLEMENTARY INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of the article.

References

  1. Ahn SC and Horenstein AR. (2013). Eigenvalue ratio test for the number of factors. Econometrica 81, 1203–1227. [Google Scholar]
  2. Bai J and Ng S. (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–221. [Google Scholar]
  3. Bair E, Hastie T, Paul D, and Tibshirani R. (2006). Prediction by supervised principal components. Journal of the American Statistical Association 101, 119–137. [Google Scholar]
  4. Bair E and Tibshirani R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology 2, E108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, et al. (2003). Bayesian factor regression models in the large p, small n paradigm. Bayesian Statistics 7, 733–743. [Google Scholar]
  6. Burris HA. (2004). Dual kinase inhibition in the treatment of breast cancer: initial experience with the egfr/erbb-2 inhibitor lapatinib. The oncologist 9 Suppl 3, 10–15. [DOI] [PubMed] [Google Scholar]
  7. Cancer Genome Atlas Network (2012). Comprehensive molecular portraits of human breast tumors. Nature 490, 61–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, and West M. (2008). High-dimensional sparse factor modeling: applications in gene expression genomics. Journal of the American Statistical Association 103, 1438–1456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. DeSantis C, Ma J, Bryan L, and Jemal A. (2014). Breast cancer statistics, 2013. CA: A Cancer Journal for Clinicians; 64, 52–62. [DOI] [PubMed] [Google Scholar]
  10. Eroles P, Bosch A, Pérez-Fidalgo JA, and Lluch A. (2012). Molecular biology in breast cancer: intrinsic subtypes and signaling pathways. Cancer Treatment Reviews 38, 698–707. [DOI] [PubMed] [Google Scholar]
  11. Feng Y, Spezia M, Huang S, Yuan C, Zeng Z, Zhang L, Ji X, Liu W, Huang B, Luo W, et al. (2018). Breast cancer development and progression: Risk factors, cancer stem cells, signaling pathways, genomics, and molecular pathogenesis. Genes & Diseases 5, 77–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gardner TS, Di Bernardo D, Lorenz D, and Collins JJ. (2003). Inferring genetic networks and identifying compound mode of action via expression profiling. Science 301, 102–105. [DOI] [PubMed] [Google Scholar]
  13. Glazko GV and Emmert-Streib F. (2009). Unite and conquer: univariate and multivariate approaches for finding differentially expressed gene sets. Bioinformatics 25, 2348–2354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jiang F, Ma Y, and Wei Y. (2019). Sufficient direction factor model and its appication to gene expression quantitative trait loci discovery. Biometrika 106, 417–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Khoury H, Dankort DL, Sadekova S, Naujokas MA, Muller WJ, and Park M. (2001). Distinct tyrosine autophosphorylation sites mediate induction of epithelial mesenchymal like transition by an activated erbb-2/neu receptor. Oncogene 20, 788–799. [DOI] [PubMed] [Google Scholar]
  16. Lam C and Yao Q. (2012). Factor modeling for high-dimensional time series: inference for the number of factors. The Annals of Statistics 40, 694–726. [Google Scholar]
  17. Leek JT and Storey JD. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Li H and Gui J. (2004). Partial cox regression analysis for high-dimensional microarray gene expression data. Bioinformatics 20, i208–i215. [DOI] [PubMed] [Google Scholar]
  19. Lopes HF and West M. (2004). Bayesian model assessment in factor analysis. Statistica Sinica 14, 41–67. [Google Scholar]
  20. Lucas J, Carvalho C, Wang Q, Bild A, Nevins JR, and West M. (2006). Sparse statistical modelling in gene expression genomics In Do K-A, Mller P, and Vannucci M, editors, Bayesian inference for gene expression and proteomics, page 155176 Cambridge University Press. [Google Scholar]
  21. Luo W and Li B. (2016). Combining eigenvalues and variation of eigenvectors for order determination. Biometrika 103, 875–887. [Google Scholar]
  22. Ma Y and Zhang X. (2015). A validated information criterion to determine the structural dimension in dimension reduction models. Biometrika 102, 409–420. [Google Scholar]
  23. Monaco ME. (2017). Fatty acid metabolism in breast cancer subtypes. Oncotarget 8, 29487–29500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Nguyen DV and Rocke DM. (2002). Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 18, 1625–1632. [DOI] [PubMed] [Google Scholar]
  25. Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. Journal of Clinical Oncology 27, 1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Penzo M, Montanaro L, Trer D, and Derenzini M. (2019). The ribosome biogenesis-cancer connection. Cells 8, 55–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Radestock Y, Hoang-Vu C, and Hombach-Klonisch S. (2008). Relaxin reduces xenograft tumour growth of human mda-mb-231 breast cancer cells. Breast Cancer Research 10, R71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Wang L, Li J, Zhao H, Hu J, Ping Y, Li F, Lan Y, Xu C, Xiao Y, and Li X. (2016). Identifying the crosstalk of dysfunctional pathways mediated by lncrnas in breast cancer subtypes. Molecular BioSystems 12, 711–720. [DOI] [PubMed] [Google Scholar]
  29. Whelan KA, Schwab LP, Karakashev SV, Franchetti L, Johannes GJ, Seagroves TN, and Reginato MJ. (2013). The oncogene her2/neu (erbb2) requires the hypoxia-inducible factor hif-1 for mammary tumor growth and anoikis resistance. The Journal of Biological Chemistry 288, 15865–15877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zhao G, Ma Y, and Lu W. (2017). Efficient estimation for dimension reduction with censored data. https://arxiv.org/abs/1710.05377.
  31. Zhu X, Tao L, Yao J, Sun P, Pei L, Li J, Long Z, Wang Y, and Zhang F. (2014). Identification of collaboration patterns of dysfunctional pathways in breast cancer. International Journal of Clinical and Experimental Pathology 7, 3853–3864. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

supplement.pdf

RESOURCES