Fitting the Cox proportional hazards model to big data

Jianqiao Wang; Donglin Zeng; Dan-Yu Lin

doi:10.1093/biomtc/ujae018

. 2024 Mar 18;80(1):ujae018. doi: 10.1093/biomtc/ujae018

Fitting the Cox proportional hazards model to big data

Jianqiao Wang ¹, Donglin Zeng ^2,^✉, Dan-Yu Lin ³

PMCID: PMC10946235 PMID: 38497824

Abstract

The semiparametric Cox proportional hazards model, together with the partial likelihood principle, has been widely used to study the effects of potentially time-dependent covariates on a possibly censored event time. We propose a computationally efficient method for fitting the Cox model to big data involving millions of study subjects. Specifically, we perform maximum partial likelihood estimation on a small subset of the whole data and improve the initial estimator by incorporating the remaining data through one-step estimation with estimated efficient score functions. We show that the final estimator has the same asymptotic distribution as the conventional maximum partial likelihood estimator using the whole dataset but requires only a small fraction of computation time. We demonstrate the usefulness of the proposed method through extensive simulation studies and an application to the UK Biobank data.

Keywords: censoring, efficient score, one-step estimation, partial likelihood, time complexity, time-dependent covariates

1. INTRODUCTION

Large biobanks, such as the UK Biobank (Bycroft et al., 2018), provide unprecedented opportunities to explore the genetic basis of the onset and progression of complex human diseases. In particular, time-to-event analyses of the UK Biobank data have identified novel genetic variants associated with hypertension, heart disease, breast cancer, and Alzheimer’s disease (Bi et al., 2020; Dey et al., 2022). The semiparametric Cox (1972) proportional hazards model has been widely used in such time-to-event analyses. The regression parameters are commonly estimated by maximizing the partial likelihood (Cox, 1975). This maximization is time-consuming and often infeasible for big data involving a very large number of subjects, especially in the presence of time-dependent covariates.

One possible solution is to randomly divide the full dataset into several subsets, analyze each subset separately, and then combine the estimates. Two versions of this divide-and-conquer (DAC) approach have been suggested: The linearization-based DAC method calculates an initial maximum partial likelihood estimate for 1 block and iteratively updates the estimate with the score function and information matrix on each individual subset in 2 iterations (Wang et al., 2021); the weighted DAC method calculates the maximum partial likelihood estimate for each block and produces a weighted average of the estimates (Wang et al., 2022). The DAC approach is computationally efficient under the distributed computing architecture but requires separate estimation of the baseline hazard function for each subset of data, which may cause numerical instability and reduce statistical efficiency.

In this paper, we propose a computationally fast method for fitting the Cox regression model to big data without sacrificing statistical efficiency. Specifically, we divide the whole dataset into 2 blocks, with the first block being much smaller than the second block. We perform maximum partial likelihood estimation on the first block of data. We then improve this initial estimator by using one-step estimation with semiparametric efficient scores on the second block of data. Since maximum partial likelihood estimation is performed only on a small subset of the whole data, and the one-step update does not involve any iterations or calculation of the Hessian matrix, the proposed method takes only a small fraction of the computation time that is required to maximize the partial likelihood on the whole dataset. Using the counting-process martingale theory and modern empirical process theory, we show that the proposed estimator achieves the same asymptotic efficiency as the maximum partial likelihood estimator on the whole dataset.

2. METHODS

2.1. Maximum partial likelihood estimation

The Cox (1972) proportional hazards model specifies that the hazard function of an event time T conditional on a p-vector of potentially time-dependent covariates Inline graphic takes the form

where Inline graphic is a p-vector of unknown regression parameters, and is an arbitrary baseline hazard function. When T is subject to right censoring by C, we observe , where , , and is the indicator function. It is assumed that C is independent of T conditional on .

We consider a random sample of n subjects and use the subscript i to indicate any variable associated with the ith subject. Given the data Inline graphic , the partial likelihood for takes the form

The corresponding score function and information matrix are

and

respectively. Here and in the sequel, Inline graphic , and . The standard method is to solve the score equation via the Newton-Raphson algorithm. The resulting estimator is consistent and asymptotically normal, with a covariance matrix that can be estimated by (Andersen and Gill, 1982).

The time required to obtain Inline graphic is determined by the evaluation of and during the Newton-Raphson iterations. When is time-independent, the summations in the numerators and denominators of and take the form for some function . Once the observed event times are sorted in descending order, each summation can be calculated as a cumulative sum of Inline graphic in reverse order, and the summations can be performed for all in 1 loop. Thus, the computation time for evaluating and is linear in , since the dimension of is p. Therefore, the total time complexity for the standard method is , provided that the sorting algorithm has a time complexity Inline graphic . When is a continuous function of t, however, the summation must be calculated over all the subjects in the risk set at each observed event time . Because each calculation is linear in n, the total time complexity for evaluating and becomes , where m is the total number of distinct observed event times. Consequently, the computation can be formidable when n and m are very large.

2.2. Improving computational efficiency

We propose a method to reduce computation time while maintaining statistical efficiency. To this end, we randomly divide the entire dataset into 2 blocks indexed by Inline graphic and , with observations in . We maximize the partial likelihood with the data from the first block to obtain an initial estimator of , denoted by . We then incorporate the data from the second block by using the one-step estimator from semiparametric efficiency theory (Bickel et al., 1993).

The efficient score function for Inline graphic based on a single observation is

where Inline graphic , , and . We obtain an empirical counterpart by replacing , , and in with ,

(Breslow, 1972), and

respectively. Our final estimator for Inline graphic is given by

where Inline graphic is the information matrix for the first block. We refer to as the Cox one-step efficient score (Cox-OSES) estimator, to reflect the fact that the one-step estimation with the efficient scores on the second block is used to boost the efficiency of the initial estimator . We show in the next section that the Cox-OSES estimator Inline graphic has the same asymptotic distribution as the standard estimator . Thus, we estimate the covariance matrix of by .

Remark 1

A statistically more efficient covariance matrix estimator is given by , which makes use of the entire data. However, its computation is more demanding than scaling . For statistical inference, only a consistent estimator of the covariance matrix is required, and estimating the covariance matrix of by is both sufficient and practical.

According to the discussion in Section 2.1, the time complexity for calculating Inline graphic is when is time-independent and when contains continuously time-dependent covariates, where is the total number of distinct observed event times in the first block. For the second block, we need to evaluate the term , which is equal to

(1)

where Inline graphic . Note that is a step function with jumps, which is obtained from the first block analysis. Thus, the time complexity for the first term in Equation 1 is after the event times in the second block are sorted. The time complexity for calculating the second term in Equation 1 depends on the type of covariates. If Inline graphic is time-independent, this term can be written as

Given the sorted event times in the second block, the time complexity of computing this term is Inline graphic . When is continuously time-dependent, however, we need to evaluate the summation for each of the distinct event times in the first block, with a time complexity of . Combining the calculations in the first and second blocks, we conclude that the total time complexity of the proposed method is Inline graphic when covariates are all time-independent and when covariates are continuously time-dependent.

According to the discussion in Section 2.1, the total time complexity of the DAC methods with M blocks is Inline graphic when covariates are all time-independent and when covariates are continuously time-dependent. Here, we assume the DAC methods evenly divide the full dataset into M subsets and the block size is of the same magnitude as . Then the time complexity of the proposed method is of the same order as the standard method and the DAC methods when covariates are all time-independent and is much lower than both the standard method and the DAC methods when covariates are continuously time-dependent and p is large. In terms of actual computation, the proposed method is always less demanding than both the standard method and the DAC methods since the one-step estimation in the second block does not involve any iterations or calculation of the Hessian matrix.

The Cox-OSES estimator can be used for variable selection. Motivated by Wang and Leng (2007) and Zou (2006), we minimize the objective function:

where Inline graphic is a tuning parameter controlling the level of penalization, and is a pre-specified positive number. We set in our simulation studies and real-data example. This optimization can be solved by using the R package glmnet (Friedman et al., 2010). Let denote the minimizer of . As and Inline graphic , satisfies both selection consistency and the oracle property (Wang and Leng, 2007).

2.3. Asymptotic properties of the Cox-OSES estimator

We impose the following regularity conditions:

Condition 1

The vector of covariates has finite total variation over the time interval , where denotes the end of the study.

Condition 2

The true baseline hazard function satisfies that .

Condition 3

The probability .

Condition 4

The matrix is positive definite, where , , and is the true value of .

Condition 5

The size of the first block satisfies that as .

The first 4 conditions ensure that the maximum partial likelihood estimator is consistent, asymptotically normal, and asymptotically efficient (Andersen and Gill, 1982). The last condition implies that Inline graphic is much smaller than n but is larger than . We state below the asymptotic properties of the Cox-OSES estimator.

Theorem 1

Under Conditions 1-5, converges in distribution to a zero-mean normal random vector with covariance matrix .

The theorem is proved in Web Appendix A.

3. SIMULATION STUDIES

We conducted extensive simulation studies to compare the proposed method with the standard method and the DAC methods. We refer to the linearization-based DAC method as DAC-Linearization and weighted DAC method with the block size or information matrix as the weight as DAC-Size or DAC-Information, respectively. We considered one setting involving only time-independent covariates and one setting involving time-dependent covariates. In the first setting, we generated 100 covariates from the multivariate normal distribution with mean 0 and covariance matrix Inline graphic . We set and , and we generated C from , where is a vector with all elements being a, and denotes the exponential distribution with mean . In the second setting, we included 50 time-independent covariates and 50 time-dependent covariates . We considered 5 time intervals , , , , and Inline graphic , where the time-dependent covariates are constant within each interval but may vary between intervals. We generated from the multivariate normal distribution with mean and covariance matrix . The final covariates are , where We set , , and . In both settings, approximately 70% of the event times were censored. A total sample size of Inline graphic or was considered for both settings.

For the proposed method, we set Inline graphic , where , 0.2, 0.3, or 0.4. For the DAC methods, we set , 10, 20, 50, 100, 200, 500, or 1000. The convergence criterion for the Newton-Raphson algorithm was that the absolute relative change in the log-partial likelihood between 2 successive iterations is less than . For each method, we recorded the total CPU time and calculated the squared error Inline graphic , where denotes the estimate of .

Figure 1 displays the simulation results based on 1000 replicates. In both settings, the Cox-OSES estimator has almost the same statistical efficiency as the standard estimator but takes much less time to compute. In the second setting with Inline graphic , the standard method took approximately 2.04 hours whereas the Cox-OSES method with took only 26.91 minutes, with only 0.07% efficiency loss. Compared with the DAC-Size estimator and DAC-Information estimator, the Cox-OSES estimator always takes less time to compute. When the Cox-OSES estimator with some k and the DAC-Linearization estimator with some M have the same squared error, the Cox-OSES estimator takes less time to compute. When the computation time is the same between the 2 methods, the Cox-OSES estimator has a smaller squared error.

Average ratios of squared error and total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.

Figures S1 and S2 in Web Appendix B display the bias and standard error of the proposed estimator in the second setting with Inline graphic . Neither the bias nor the standard error is influenced by the type of covariate, the true coefficient value, or the choice of .

We further compared the proposed method with the standard method and DAC methods in the context of rare events. We adopted the first simulation setting, but with C generated from Inline graphic . Approximately 98% of the event times were censored. We considered . Tables S1 and S2 in Web Appendix B present the simulation results based on 1000 replicates. The basic conclusions are the same as those of the first 2 settings.

The simulation results suggest that with Inline graphic , the proposed estimator achieves the same statistical efficiency as the standard estimator while reducing the computation time by 80%. Nevertheless, the choice of should depend on both n and the event rate. A smaller n or a lower event rate requires a larger .

We also conducted simulation studies to evaluate the variable selection performance of the regularized estimation approach with the Cox-OSES estimator. We adopted the first 2 simulation settings to generate Inline graphic and used the same . For each setting, we considered 2 choices of to reflect the effect sizes. When only time-independent covariates are involved, we set and generated C from or set and generated C from . When time-dependent covariates are involved, we set and or set and . In all scenarios, approximately 70% of the event times were censored. We set Inline graphic .

For the proposed method and the DAC methods, we used the same block size as in the previous simulation studies. For DAC methods, we adopted the regularized estimation approach of Wang et al. (2021, 2022). For the standard method, we employed the unified Lasso estimation of Wang and Leng (2007). The tuning parameter Inline graphic was chosen by minimizing the Bayesian information criteria. For each method, we report the average true discoveries and false discoveries based on 1000 replicates. As shown in Table S3 in Web Appendix B, the regularized estimation approach with the Cox-OSES estimator performs similarly to the standard method.

4. AN EXAMPLE

We considered the UK Biobank, which is a cohort study on approximately half a million subjects (Bycroft et al., 2018). We related the age-at-onset of circulatory-system diseases to 1100 genetic variants, of which 50 were selected from each chromosome by univariate screening, with adjustment for sex and the top 10 genetic principal components representing population substructures. To construct the time-to-event outcome, we converted the International Classification of Disease code version 10 to the PheWAS code (Denny et al., 2013) and then calculated the difference between the date of diagnosis and the date of birth (Dey et al., 2022). The dataset contains a total of 487 252 subjects, with an event rate of 39%. We performed variable selection on the 1100 genetic variants using the regularized estimation approach with the Cox-OSES estimator. We randomly chose 20% of the whole dataset as the first block, and we calculated the Cox-OSES estimates and covariance matrix estimates. We repeated this process 10 times and averaged the 10 sets of estimates to adjust for the randomness introduced by data splitting. We then used the averaged estimates for variable selection among the 1100 genetic variants. The final analysis dataset contains 140 genetic variants.

We then randomly chose 10%, 20%, 30%, or 40% of the whole dataset as the first block for the proposed method and varied M from 5 to 1000 for the DAC methods. For each method, we reported the total CPU time and calculated the squared error Inline graphic , where denotes a DAC estimate or the Cox-OSES estimate and is the standard estimate. We repeated the process one hundred times. All calculations were done on the same computer.

The results are presented in Figure 2. Compared to the standard method, the average squared errors for the Cox-OSES estimates are Inline graphic , , and under , , and , respectively. Under and 50, the average squared errors are and for the DAC-Linearization estimates, and for the DAC-Size estimates, and and for the DAC-Information estimates, respectively. The standard method took about 245.45 seconds to converge. The DAC-Size and DAC-Information methods took more time to compute than the standard method for all choices of M because Newton-Raphson algorithm sometimes required more iterations to converge on small blocks than the entire UK Biobank dataset. The DAC-Linearization method took 150 and 128 seconds under Inline graphic and 50, respectively. By contrast, the Cox-OSES method took only 30, 57, and 84 seconds under , , and , respectively.

Average squared errors and ratios of total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to the DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.

Figure S3 displays the estimated hazard ratios and the corresponding 95% confidence intervals for the selected 140 genetic variants. The most significant genetic variant is 20:8603950:G:C with a P-value of Inline graphic . This variant was identified by Vuckovic et al. (2020) to be associated with white blood cell count, plateletcrit, and platelet count. A second variant, 19:11526765:G:T, has a P-value of , and it was previously linked to hypertension by Liu et al. (2016). In addition, 3:124337402:G:A, with a P-value of Inline graphic , was associated with mean platelet volume and platelet count by Vuckovic et al. (2020). Finally, 4:144122227:A:G has a P-value of and was identified by Stanzick et al. (2021) as being associated with blood urea nitrogen levels and glomerular filtration rate.

We divided the study participants into 3 groups based on their polygenic risk scores (The International Schizophrenia Consortium, 2009). Group 1 consists of the lowest 1/3, Group 2 the middle, and Group 3 the highest 1/3. The polygenic risk scores were constructed by Inline graphic , where denotes the genetic variants and is the estimate of the regression parameters. Figure S4 presents the Kaplan-Meier estimates of the disease-free probabilities over time for the 3 groups. There are clear separations of the 3 curves, indicating the predictive value of the fitted model.

Following Cox (1972) and Kalbfleisch and Prentice (2002), as well as the coxph function in R and phreg in SAS, we checked the proportional hazards assumption by including an interaction between variant 20:8603950:G:C and Inline graphic . In this case, the counting process format is infeasible due to the machine’s memory. This task would take more than 1 month for the standard method to complete. We randomly chose 10% and 20% of the whole dataset as the first block for the proposed method and set and 10 for the DAC methods to ensure that the size of the first block in the proposed method is the same as the block size in the DAC methods. The DAC-Size and DAC-Information took 229 and 127 hours under Inline graphic and 10, respectively, and the DAC-Linearization took 148 and 67 hours, respectively. By contrast, the proposed method with and took only 13 and 48 hours, respectively.

We repeated the process 10 times and averaged the 10 sets of estimates to adjust for the randomness introduced by data splitting. Figure S5 in Web Appendix C displays the 10 sets of estimates for the time-dependent covariate. Figure 3 shows that Cox-OSES and DAC-Linearization yielded similar estimates and similar standard errors. Table 1 summarizes the results for the coefficient of the time-dependent covariate. The proposed method and DAC methods agreed on rejecting the proportional hazards assumption.

Parameter estimates and standard error estimates for the proposed method and DAC-Linearization methods in the analysis of the UK Biobank data with a time-dependent covariate.

TABLE 1.

Estimation results for the regression parameter of the time-dependent covariate by the proposed method and DAC methods.

	Estimate	Std error	P-value
Cox-OSES
	–0.0719	0.02880	.013
	–0.0710	0.02879	.014
DAC-Linearization
	–0.0690	0.02867	.016
	–0.0689	0.02867	.016
DAC-Size
	–0.0688	0.02870	.017
	–0.0685	0.02874	.017
DAC-Information
	–0.0691	0.02867	.016
	–0.0691	0.02867	.016

Open in a new tab

Abbreviation: DAC, divide-and-conquer.

We performed only 1 analysis. The gains in computational efficiency offered by the proposed method will be more important in genome-wide association studies requiring a large number of analyses.

5. REMARKS

The proposed method can incorporate more than 2 blocks of data. Suppose that there are K blocks of data indexed by Inline graphic , with sizes , respectively. We obtain using the first block of data. For the remaining blocks, we calculate

We then define the Cox-OSES estimator as Inline graphic . Mathematically, is independent of the choice of K, provided that is the same, and thus Theorem 1 holds for the Cox-OSES estimator with multiple blocks. However, there are practical advantages to using more than 2 blocks. First, when the dataset is so big that the data have to be divided and stored in multiple machines, the boosting procedure can be carried out simultaneously among the Inline graphic blocks, and only the estimates from the first block analysis are communicated to the remaining blocks. Thus, there is great efficiency in both computation and communication. Second, because only estimates are shared among the K blocks, privacy can be preserved when combining data from different sources. Third, the proposed procedure can be used to incorporate new data in real time without revisiting previous data. By treating the new batch of data as the Inline graphic th block, we can obtain and include it in the calculation of .

When applying the proposed method to multiple data blocks, heterogeneity across blocks may emerge. We may account for heterogeneity by adopting the stratified Cox proportional hazards model. Specifically, we fit the stratified Cox model to the first data block, estimate the efficient score functions for each stratum using the remaining data, and then update the initial coefficient estimates accordingly.

There may exist outlying observations in the dataset. The proposed method is not sensitive to outlying observations due to the rank-based nature of the partial likelihood inference.

We can extend the proposed method to marginal and random-effects models for correlated event times. Suppose that there are n independent clusters, with Inline graphic subjects in the ith cluster. To calculate the Cox-OSES estimator, we randomly select clusters into . For the marginal Cox model, we replace with the sandwich variance estimator in the one-step estimation, where denotes for the jth subject of the ith cluster. The covariance matrix of the resulting estimator is estimated by Inline graphic .

Fitting random-effects models is more challenging. Direct maximization of nonparametric likelihood is unstable. EM algorithms (Zeng and Lin, 2007) provide a stable solution, but the convergence is slow. In such situations, the proposed method is even more beneficial than in the current setting because the orders of computation are higher. In general, efficient score functions do not have explicit expressions but can be obtained as the numerical solution to a Fredholm-type equation (Zeng and Lin, 2007).

We can also extend the proposed approach to the situation of high-dimensional covariates. Specifically, we perform variable selection using the first block of data. After debiasing the initial estimates, we perform a one-step update on the coefficients of the selected variables using the remaining data. The main computation lies in the analysis of the first block of data, which is always less demanding than direct analysis of the whole dataset. We anticipate that uncertainty due to variable selection will not affect the asymptotic efficiency of the Cox-OSES estimator.

Kawaguchi et al. (2021) proposed a forward-backward scan algorithm, which reduces the time complexity from quadratic to linear for fitting the proportional subdistribution hazards model to competing risks data with time-independent covariates. However, their algorithm cannot be used to further improve computational efficiency in our setting because the reduction in time complexity by that algorithm results from the same mechanism of cumulative sum as the standard method for fitting the Cox proportional hazards model with time-independent covariates, which has only linear time complexity, as detailed in Section 2.1.

Supplementary Material

ujae018_Supplemental_Files

Web Appendices, Tables, and Figures, referenced in Sections 2, 3, and 4, and the computation codes are available with this paper at the Biometrics website on Oxford Academic.

ujae018_supplemental_files.zip^{(19.9MB, zip)}

Acknowledgement

The authors thank the editor, associate editor, and 2 referees for their helpful comments.

Contributor Information

Jianqiao Wang, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Donglin Zeng, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

Dan-Yu Lin, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.

FUNDING

This research was supported by the National Institutes of Health grants R01 HL149683 and GM124104.

CONFLICT OF INTEREST

None declared.

DATA AVAILABILITY

UK Biobank data are available from http://www.ukbiobank.ac.uk. A formal application to the UK Biobank is required to use the data.

References

Andersen P. K., Gill R. D. (1982). Cox’s regression model for counting processes: a large sample study. The Annals of Statistics, 10, 1100–1120. [Google Scholar]
Bi W., Fritsche L. G., Mukherjee B., Kim S., Lee S. (2020). A fast and accurate method for genome-wide time-to-event data analysis and its application to UK biobank. The American Journal of Human Genetics, 107, 222–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bickel P. J., Klaassen C. A. J., Ritov Y., Wellner J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press. [Google Scholar]
Breslow N. (1972). Discussion of the paper by D. R. Cox. Journal of the Royal Statistical Society, Series B, 34, 216–217. [Google Scholar]
Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K. et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cox D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B, 34, 187–202. [Google Scholar]
Cox D. R. (1975). Partial likelihood. Biometrika, 62, 269–276. [Google Scholar]
Denny J. C., Bastarache L., Ritchie M. D., Carroll R. J., Zink R., Mosley J. D. et al. (2013). Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature Biotechnology, 31, 1102–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dey R., Zhou W., Kiiskinen T., Havulinna A., Elliott A., Karjalainen J. et al. (2022). Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nature Communications, 13, 5437. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J. H., Hastie T., Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
Kalbfleisch J. D., Prentice R. L. (2002). The Statistical Analysis of Failure Time Data. John Wiley & Sons: New York. [Google Scholar]
Kawaguchi E. S., Shen J. I., Suchard M. A., Li G. (2021). Scalable algorithms for large competing risks data. Journal of Computational and Graphical Statistics, 30, 685–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu C., Kraja A. T., Smith J. A., Brody J. A., Franceschini N., Bis J. C. et al. (2016). Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nature Genetics, 48, 1162–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stanzick K. J., Li Y., Schlosser P., Gorski M., Wuttke M., Thomas L. F. et al. (2021). Discovery and prioritization of variants and genes for kidney function in ≳1.2 million individuals. Nature Communications, 12, 4350. [DOI] [PMC free article] [PubMed] [Google Scholar]
The International Schizophrenia Consortium (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460, 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vuckovic D., Bao E. L., Akbari P., Lareau C. A., Mousas A., Jiang T. et al. (2020). The polygenic and monogenic basis of blood traits and diseases. Cell, 182, 1214–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H., Leng C. (2007). Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102, 1039–1048. [Google Scholar]
Wang W., Lu S.-E., Cheng J. Q., Xie M., Kostis J. B. (2022). Multivariate survival analysis in big data: a divide-and-combine approach. Biometrics, 78, 852–866. [DOI] [PubMed] [Google Scholar]
Wang Y., Hong C., Palmer N., Di Q., Schwartz J., Kohane I. et al. (2021). A fast divide-and-conquer sparse Cox regression. Biostatistics, 22, 381–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zeng D., Lin D. Y. (2007). Maximum likelihood estimation in semiparametric regression models with censored data. Journal of the Royal Statistical Society, Series B, 69, 507–564. [Google Scholar]
Zou H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ujae018_Supplemental_Files

Web Appendices, Tables, and Figures, referenced in Sections 2, 3, and 4, and the computation codes are available with this paper at the Biometrics website on Oxford Academic.

ujae018_supplemental_files.zip^{(19.9MB, zip)}

Data Availability Statement

UK Biobank data are available from http://www.ukbiobank.ac.uk. A formal application to the UK Biobank is required to use the data.

[bib1] Andersen P. K., Gill R. D. (1982). Cox’s regression model for counting processes: a large sample study. The Annals of Statistics, 10, 1100–1120. [Google Scholar]

[bib2] Bi W., Fritsche L. G., Mukherjee B., Kim S., Lee S. (2020). A fast and accurate method for genome-wide time-to-event data analysis and its application to UK biobank. The American Journal of Human Genetics, 107, 222–233. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] Bickel P. J., Klaassen C. A. J., Ritov Y., Wellner J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press. [Google Scholar]

[bib4] Breslow N. (1972). Discussion of the paper by D. R. Cox. Journal of the Royal Statistical Society, Series B, 34, 216–217. [Google Scholar]

[bib5] Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K. et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Cox D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B, 34, 187–202. [Google Scholar]

[bib7] Cox D. R. (1975). Partial likelihood. Biometrika, 62, 269–276. [Google Scholar]

[bib8] Denny J. C., Bastarache L., Ritchie M. D., Carroll R. J., Zink R., Mosley J. D. et al. (2013). Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature Biotechnology, 31, 1102–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib9] Dey R., Zhou W., Kiiskinen T., Havulinna A., Elliott A., Karjalainen J. et al. (2022). Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nature Communications, 13, 5437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] Friedman J. H., Hastie T., Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22. [PMC free article] [PubMed] [Google Scholar]

[bib11] Kalbfleisch J. D., Prentice R. L. (2002). The Statistical Analysis of Failure Time Data. John Wiley & Sons: New York. [Google Scholar]

[bib12] Kawaguchi E. S., Shen J. I., Suchard M. A., Li G. (2021). Scalable algorithms for large competing risks data. Journal of Computational and Graphical Statistics, 30, 685–693. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Liu C., Kraja A. T., Smith J. A., Brody J. A., Franceschini N., Bis J. C. et al. (2016). Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nature Genetics, 48, 1162–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] Stanzick K. J., Li Y., Schlosser P., Gorski M., Wuttke M., Thomas L. F. et al. (2021). Discovery and prioritization of variants and genes for kidney function in ≳1.2 million individuals. Nature Communications, 12, 4350. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] The International Schizophrenia Consortium (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460, 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Vuckovic D., Bao E. L., Akbari P., Lareau C. A., Mousas A., Jiang T. et al. (2020). The polygenic and monogenic basis of blood traits and diseases. Cell, 182, 1214–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] Wang H., Leng C. (2007). Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102, 1039–1048. [Google Scholar]

[bib18] Wang W., Lu S.-E., Cheng J. Q., Xie M., Kostis J. B. (2022). Multivariate survival analysis in big data: a divide-and-combine approach. Biometrics, 78, 852–866. [DOI] [PubMed] [Google Scholar]

[bib19] Wang Y., Hong C., Palmer N., Di Q., Schwartz J., Kohane I. et al. (2021). A fast divide-and-conquer sparse Cox regression. Biostatistics, 22, 381–401. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Zeng D., Lin D. Y. (2007). Maximum likelihood estimation in semiparametric regression models with censored data. Journal of the Royal Statistical Society, Series B, 69, 507–564. [Google Scholar]

[bib21] Zou H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

PERMALINK

Fitting the Cox proportional hazards model to big data

Jianqiao Wang

Donglin Zeng

Dan-Yu Lin

Abstract

1. INTRODUCTION

2. METHODS

2.1. Maximum partial likelihood estimation

2.2. Improving computational efficiency

Remark 1

2.3. Asymptotic properties of the Cox-OSES estimator

Condition 1

Condition 2

Condition 3

Condition 4

Condition 5

Theorem 1

3. SIMULATION STUDIES

FIGURE 1.

4. AN EXAMPLE

FIGURE 2.

FIGURE 3.

TABLE 1.

5. REMARKS

Supplementary Material

Acknowledgement

Contributor Information

FUNDING

CONFLICT OF INTEREST

DATA AVAILABILITY

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Fitting the Cox proportional hazards model to big data

Jianqiao Wang

Donglin Zeng

Dan-Yu Lin

Abstract

1. INTRODUCTION

2. METHODS

2.1. Maximum partial likelihood estimation

2.2. Improving computational efficiency

Remark 1

2.3. Asymptotic properties of the Cox-OSES estimator

Condition 1

Condition 2

Condition 3

Condition 4

Condition 5

Theorem 1

3. SIMULATION STUDIES

FIGURE 1.

4. AN EXAMPLE

FIGURE 2.

FIGURE 3.

TABLE 1.

5. REMARKS

Supplementary Material

Acknowledgement

Contributor Information

FUNDING

CONFLICT OF INTEREST

DATA AVAILABILITY

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases