Abstract
The semiparametric Cox proportional hazards model, together with the partial likelihood principle, has been widely used to study the effects of potentially time-dependent covariates on a possibly censored event time. We propose a computationally efficient method for fitting the Cox model to big data involving millions of study subjects. Specifically, we perform maximum partial likelihood estimation on a small subset of the whole data and improve the initial estimator by incorporating the remaining data through one-step estimation with estimated efficient score functions. We show that the final estimator has the same asymptotic distribution as the conventional maximum partial likelihood estimator using the whole dataset but requires only a small fraction of computation time. We demonstrate the usefulness of the proposed method through extensive simulation studies and an application to the UK Biobank data.
Keywords: censoring, efficient score, one-step estimation, partial likelihood, time complexity, time-dependent covariates
1. INTRODUCTION
Large biobanks, such as the UK Biobank (Bycroft et al., 2018), provide unprecedented opportunities to explore the genetic basis of the onset and progression of complex human diseases. In particular, time-to-event analyses of the UK Biobank data have identified novel genetic variants associated with hypertension, heart disease, breast cancer, and Alzheimer’s disease (Bi et al., 2020; Dey et al., 2022). The semiparametric Cox (1972) proportional hazards model has been widely used in such time-to-event analyses. The regression parameters are commonly estimated by maximizing the partial likelihood (Cox, 1975). This maximization is time-consuming and often infeasible for big data involving a very large number of subjects, especially in the presence of time-dependent covariates.
One possible solution is to randomly divide the full dataset into several subsets, analyze each subset separately, and then combine the estimates. Two versions of this divide-and-conquer (DAC) approach have been suggested: The linearization-based DAC method calculates an initial maximum partial likelihood estimate for 1 block and iteratively updates the estimate with the score function and information matrix on each individual subset in 2 iterations (Wang et al., 2021); the weighted DAC method calculates the maximum partial likelihood estimate for each block and produces a weighted average of the estimates (Wang et al., 2022). The DAC approach is computationally efficient under the distributed computing architecture but requires separate estimation of the baseline hazard function for each subset of data, which may cause numerical instability and reduce statistical efficiency.
In this paper, we propose a computationally fast method for fitting the Cox regression model to big data without sacrificing statistical efficiency. Specifically, we divide the whole dataset into 2 blocks, with the first block being much smaller than the second block. We perform maximum partial likelihood estimation on the first block of data. We then improve this initial estimator by using one-step estimation with semiparametric efficient scores on the second block of data. Since maximum partial likelihood estimation is performed only on a small subset of the whole data, and the one-step update does not involve any iterations or calculation of the Hessian matrix, the proposed method takes only a small fraction of the computation time that is required to maximize the partial likelihood on the whole dataset. Using the counting-process martingale theory and modern empirical process theory, we show that the proposed estimator achieves the same asymptotic efficiency as the maximum partial likelihood estimator on the whole dataset.
2. METHODS
2.1. Maximum partial likelihood estimation
The Cox (1972) proportional hazards model specifies that the hazard function of an event time T conditional on a p-vector of potentially time-dependent covariates
takes the form
![]() |
where
is a p-vector of unknown regression parameters, and
is an arbitrary baseline hazard function. When T is subject to right censoring by C, we observe
, where
,
, and
is the indicator function. It is assumed that C is independent of T conditional on
.
We consider a random sample of n subjects and use the subscript i to indicate any variable associated with the ith subject. Given the data
, the partial likelihood for
takes the form
![]() |
The corresponding score function and information matrix are
![]() |
and
![]() |
respectively. Here and in the sequel,
, and
. The standard method is to solve the score equation
via the Newton-Raphson algorithm. The resulting estimator
is consistent and asymptotically normal, with a covariance matrix that can be estimated by
(Andersen and Gill, 1982).
The time required to obtain
is determined by the evaluation of
and
during the Newton-Raphson iterations. When
is time-independent, the summations in the numerators and denominators of
and
take the form
for some function
. Once the observed event times are sorted in descending order, each summation can be calculated as a cumulative sum of
in reverse order, and the summations can be performed for all
in 1 loop. Thus, the computation time for evaluating
and
is linear in
, since the dimension of
is p. Therefore, the total time complexity for the standard method is
, provided that the sorting algorithm has a time complexity
. When
is a continuous function of t, however, the summation
must be calculated over all the subjects in the risk set
at each observed event time
. Because each calculation is linear in n, the total time complexity for evaluating
and
becomes
, where m is the total number of distinct observed event times. Consequently, the computation can be formidable when n and m are very large.
2.2. Improving computational efficiency
We propose a method to reduce computation time while maintaining statistical efficiency. To this end, we randomly divide the entire dataset into 2 blocks indexed by
and
, with
observations in
. We maximize the partial likelihood with the data from the first block to obtain an initial estimator of
, denoted by
. We then incorporate the data from the second block by using the one-step estimator from semiparametric efficiency theory (Bickel et al., 1993).
The efficient score function for
based on a single observation
is
![]() |
where
,
, and
. We obtain an empirical counterpart
by replacing
,
, and
in
with
,
![]() |
(Breslow, 1972), and
![]() |
respectively. Our final estimator for
is given by
![]() |
where
is the information matrix for the first block. We refer to
as the Cox one-step efficient score (Cox-OSES) estimator, to reflect the fact that the one-step estimation with the efficient scores on the second block is used to boost the efficiency of the initial estimator
. We show in the next section that the Cox-OSES estimator
has the same asymptotic distribution as the standard estimator
. Thus, we estimate the covariance matrix of
by
.
Remark 1
A statistically more efficient covariance matrix estimator is given by
, which makes use of the entire data. However, its computation is more demanding than scaling
. For statistical inference, only a consistent estimator of the covariance matrix is required, and estimating the covariance matrix of
by
is both sufficient and practical.
According to the discussion in Section 2.1, the time complexity for calculating
is
when
is time-independent and
when
contains continuously time-dependent covariates, where
is the total number of distinct observed event times in the first block. For the second block, we need to evaluate the term
, which is equal to
![]() |
(1) |
where
. Note that
is a step function with
jumps, which is obtained from the first block analysis. Thus, the time complexity for the first term in Equation 1 is
after the event times in the second block are sorted. The time complexity for calculating the second term in Equation 1 depends on the type of covariates. If
is time-independent, this term can be written as
![]() |
Given the sorted event times in the second block, the time complexity of computing this term is
. When
is continuously time-dependent, however, we need to evaluate the summation for each of the
distinct event times in the first block, with a time complexity of
. Combining the calculations in the first and second blocks, we conclude that the total time complexity of the proposed method is
when covariates are all time-independent and
when covariates are continuously time-dependent.
According to the discussion in Section 2.1, the total time complexity of the DAC methods with M blocks is
when covariates are all time-independent and
when covariates are continuously time-dependent. Here, we assume the DAC methods evenly divide the full dataset into M subsets and the block size is of the same magnitude as
. Then the time complexity of the proposed method is of the same order as the standard method and the DAC methods when covariates are all time-independent and is much lower than both the standard method and the DAC methods when covariates are continuously time-dependent and p is large. In terms of actual computation, the proposed method is always less demanding than both the standard method and the DAC methods since the one-step estimation in the second block does not involve any iterations or calculation of the Hessian matrix.
The Cox-OSES estimator can be used for variable selection. Motivated by Wang and Leng (2007) and Zou (2006), we minimize the objective function:
![]() |
where
is a tuning parameter controlling the level of penalization, and
is a pre-specified positive number. We set
in our simulation studies and real-data example. This optimization can be solved by using the R package glmnet (Friedman et al., 2010). Let
denote the minimizer of
. As
and
,
satisfies both selection consistency and the oracle property (Wang and Leng, 2007).
2.3. Asymptotic properties of the Cox-OSES estimator
We impose the following regularity conditions:
Condition 1
The vector of covariates
has finite total variation over the time interval
, where
denotes the end of the study.
Condition 2
The true baseline hazard function satisfies that
.
Condition 3
The probability
.
Condition 4
The matrix
is positive definite, where
,
, and
is the true value of
.
Condition 5
The size of the first block
satisfies that
as
.
The first 4 conditions ensure that the maximum partial likelihood estimator is consistent, asymptotically normal, and asymptotically efficient (Andersen and Gill, 1982). The last condition implies that
is much smaller than n but is larger than
. We state below the asymptotic properties of the Cox-OSES estimator.
Theorem 1
Under Conditions 1-5,
converges in distribution to a zero-mean normal random vector with covariance matrix
.
The theorem is proved in Web Appendix A.
3. SIMULATION STUDIES
We conducted extensive simulation studies to compare the proposed method with the standard method and the DAC methods. We refer to the linearization-based DAC method as DAC-Linearization and weighted DAC method with the block size or information matrix as the weight as DAC-Size or DAC-Information, respectively. We considered one setting involving only time-independent covariates and one setting involving time-dependent covariates. In the first setting, we generated 100 covariates from the multivariate normal distribution with mean 0 and covariance matrix
. We set
and
, and we generated C from
, where
is a
vector with all elements being a, and
denotes the exponential distribution with mean
. In the second setting, we included 50 time-independent covariates
and 50 time-dependent covariates
. We considered 5 time intervals
,
,
,
, and
, where the time-dependent covariates
are constant within each interval but may vary between intervals. We generated
from the multivariate normal distribution with mean
and covariance matrix
. The final covariates are
, where
We set
,
, and
. In both settings, approximately 70% of the event times were censored. A total sample size of
or
was considered for both settings.
For the proposed method, we set
, where
, 0.2, 0.3, or 0.4. For the DAC methods, we set
, 10, 20, 50, 100, 200, 500, or 1000. The convergence criterion for the Newton-Raphson algorithm was that the absolute relative change in the log-partial likelihood between 2 successive iterations is less than
. For each method, we recorded the total CPU time and calculated the squared error
, where
denotes the estimate of
.
Figure 1 displays the simulation results based on 1000 replicates. In both settings, the Cox-OSES estimator has almost the same statistical efficiency as the standard estimator but takes much less time to compute. In the second setting with
, the standard method took approximately 2.04 hours whereas the Cox-OSES method with
took only 26.91 minutes, with only 0.07% efficiency loss. Compared with the DAC-Size estimator and DAC-Information estimator, the Cox-OSES estimator always takes less time to compute. When the Cox-OSES estimator with some k and the DAC-Linearization estimator with some M have the same squared error, the Cox-OSES estimator takes less time to compute. When the computation time is the same between the 2 methods, the Cox-OSES estimator has a smaller squared error.
FIGURE 1.
Average ratios of squared error and total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.
Figures S1 and S2 in Web Appendix B display the bias and standard error of the proposed estimator in the second setting with
. Neither the bias nor the standard error is influenced by the type of covariate, the true coefficient value, or the choice of
.
We further compared the proposed method with the standard method and DAC methods in the context of rare events. We adopted the first simulation setting, but with C generated from
. Approximately 98% of the event times were censored. We considered
. Tables S1 and S2 in Web Appendix B present the simulation results based on 1000 replicates. The basic conclusions are the same as those of the first 2 settings.
The simulation results suggest that with
, the proposed estimator achieves the same statistical efficiency as the standard estimator while reducing the computation time by 80%. Nevertheless, the choice of
should depend on both n and the event rate. A smaller n or a lower event rate requires a larger
.
We also conducted simulation studies to evaluate the variable selection performance of the regularized estimation approach with the Cox-OSES estimator. We adopted the first 2 simulation settings to generate
and used the same
. For each setting, we considered 2 choices of
to reflect the effect sizes. When only time-independent covariates are involved, we set
and generated C from
or set
and generated C from
. When time-dependent covariates are involved, we set
and
or set
and
. In all scenarios, approximately 70% of the event times were censored. We set
.
For the proposed method and the DAC methods, we used the same block size as in the previous simulation studies. For DAC methods, we adopted the regularized estimation approach of Wang et al. (2021, 2022). For the standard method, we employed the unified Lasso estimation of Wang and Leng (2007). The tuning parameter
was chosen by minimizing the Bayesian information criteria. For each method, we report the average true discoveries and false discoveries based on 1000 replicates. As shown in Table S3 in Web Appendix B, the regularized estimation approach with the Cox-OSES estimator performs similarly to the standard method.
4. AN EXAMPLE
We considered the UK Biobank, which is a cohort study on approximately half a million subjects (Bycroft et al., 2018). We related the age-at-onset of circulatory-system diseases to 1100 genetic variants, of which 50 were selected from each chromosome by univariate screening, with adjustment for sex and the top 10 genetic principal components representing population substructures. To construct the time-to-event outcome, we converted the International Classification of Disease code version 10 to the PheWAS code (Denny et al., 2013) and then calculated the difference between the date of diagnosis and the date of birth (Dey et al., 2022). The dataset contains a total of 487 252 subjects, with an event rate of 39%. We performed variable selection on the 1100 genetic variants using the regularized estimation approach with the Cox-OSES estimator. We randomly chose 20% of the whole dataset as the first block, and we calculated the Cox-OSES estimates and covariance matrix estimates. We repeated this process 10 times and averaged the 10 sets of estimates to adjust for the randomness introduced by data splitting. We then used the averaged estimates for variable selection among the 1100 genetic variants. The final analysis dataset contains 140 genetic variants.
We then randomly chose 10%, 20%, 30%, or 40% of the whole dataset as the first block for the proposed method and varied M from 5 to 1000 for the DAC methods. For each method, we reported the total CPU time and calculated the squared error
, where
denotes a DAC estimate or the Cox-OSES estimate and
is the standard estimate. We repeated the process one hundred times. All calculations were done on the same computer.
The results are presented in Figure 2. Compared to the standard method, the average squared errors for the Cox-OSES estimates are
,
, and
under
,
, and
, respectively. Under
and 50, the average squared errors are
and
for the DAC-Linearization estimates,
and
for the DAC-Size estimates, and
and
for the DAC-Information estimates, respectively. The standard method took about 245.45 seconds to converge. The DAC-Size and DAC-Information methods took more time to compute than the standard method for all choices of M because Newton-Raphson algorithm sometimes required more iterations to converge on small blocks than the entire UK Biobank dataset. The DAC-Linearization method took 150 and 128 seconds under
and 50, respectively. By contrast, the Cox-OSES method took only 30, 57, and 84 seconds under
,
, and
, respectively.
FIGURE 2.
Average squared errors and ratios of total CPU time for the Cox-OSES and DAC methods to the standard method. The solid curve pertains to Cox-OSES, and the numbers along it indicate the proportion of the first block to the entire dataset. The dashed curve pertains to the DAC-Linearization, and the numbers along it indicate the number of blocks. The dotted-dashed and dotted curves pertain to DAC-Size and DAC-Information, respectively. The numbers along the curves indicate the corresponding numbers of blocks.
Figure S3 displays the estimated hazard ratios and the corresponding 95% confidence intervals for the selected 140 genetic variants. The most significant genetic variant is 20:8603950:G:C with a P-value of
. This variant was identified by Vuckovic et al. (2020) to be associated with white blood cell count, plateletcrit, and platelet count. A second variant, 19:11526765:G:T, has a P-value of
, and it was previously linked to hypertension by Liu et al. (2016). In addition, 3:124337402:G:A, with a P-value of
, was associated with mean platelet volume and platelet count by Vuckovic et al. (2020). Finally, 4:144122227:A:G has a P-value of
and was identified by Stanzick et al. (2021) as being associated with blood urea nitrogen levels and glomerular filtration rate.
We divided the study participants into 3 groups based on their polygenic risk scores (The International Schizophrenia Consortium, 2009). Group 1 consists of the lowest 1/3, Group 2 the middle, and Group 3 the highest 1/3. The polygenic risk scores were constructed by
, where
denotes the genetic variants and
is the estimate of the regression parameters. Figure S4 presents the Kaplan-Meier estimates of the disease-free probabilities over time for the 3 groups. There are clear separations of the 3 curves, indicating the predictive value of the fitted model.
Following Cox (1972) and Kalbfleisch and Prentice (2002), as well as the coxph function in R and phreg in SAS, we checked the proportional hazards assumption by including an interaction between variant 20:8603950:G:C and
. In this case, the counting process format is infeasible due to the machine’s memory. This task would take more than 1 month for the standard method to complete. We randomly chose 10% and 20% of the whole dataset as the first block for the proposed method and set
and 10 for the DAC methods to ensure that the size of the first block in the proposed method is the same as the block size in the DAC methods. The DAC-Size and DAC-Information took 229 and 127 hours under
and 10, respectively, and the DAC-Linearization took 148 and 67 hours, respectively. By contrast, the proposed method with
and
took only 13 and 48 hours, respectively.
We repeated the process 10 times and averaged the 10 sets of estimates to adjust for the randomness introduced by data splitting. Figure S5 in Web Appendix C displays the 10 sets of estimates for the time-dependent covariate. Figure 3 shows that Cox-OSES and DAC-Linearization yielded similar estimates and similar standard errors. Table 1 summarizes the results for the coefficient of the time-dependent covariate. The proposed method and DAC methods agreed on rejecting the proportional hazards assumption.
FIGURE 3.
Parameter estimates and standard error estimates for the proposed method and DAC-Linearization methods in the analysis of the UK Biobank data with a time-dependent covariate.
TABLE 1.
Estimation results for the regression parameter of the time-dependent covariate by the proposed method and DAC methods.
| Estimate | Std error | P-value | |
|---|---|---|---|
| Cox-OSES | |||
|
–0.0719 | 0.02880 | .013 |
|
–0.0710 | 0.02879 | .014 |
| DAC-Linearization | |||
|
–0.0690 | 0.02867 | .016 |
|
–0.0689 | 0.02867 | .016 |
| DAC-Size | |||
|
–0.0688 | 0.02870 | .017 |
|
–0.0685 | 0.02874 | .017 |
| DAC-Information | |||
|
–0.0691 | 0.02867 | .016 |
|
–0.0691 | 0.02867 | .016 |
Abbreviation: DAC, divide-and-conquer.
We performed only 1 analysis. The gains in computational efficiency offered by the proposed method will be more important in genome-wide association studies requiring a large number of analyses.
5. REMARKS
The proposed method can incorporate more than 2 blocks of data. Suppose that there are K blocks of data indexed by
, with sizes
, respectively. We obtain
using the first block of data. For the remaining
blocks, we calculate
![]() |
We then define the Cox-OSES estimator as
. Mathematically,
is independent of the choice of K, provided that
is the same, and thus Theorem 1 holds for the Cox-OSES estimator with multiple blocks. However, there are practical advantages to using more than 2 blocks. First, when the dataset is so big that the data have to be divided and stored in multiple machines, the boosting procedure can be carried out simultaneously among the
blocks, and only the estimates
from the first block analysis are communicated to the remaining blocks. Thus, there is great efficiency in both computation and communication. Second, because only estimates are shared among the K blocks, privacy can be preserved when combining data from different sources. Third, the proposed procedure can be used to incorporate new data in real time without revisiting previous data. By treating the new batch of data as the
th block, we can obtain
and include it in the calculation of
.
When applying the proposed method to multiple data blocks, heterogeneity across blocks may emerge. We may account for heterogeneity by adopting the stratified Cox proportional hazards model. Specifically, we fit the stratified Cox model to the first data block, estimate the efficient score functions for each stratum using the remaining data, and then update the initial coefficient estimates accordingly.
There may exist outlying observations in the dataset. The proposed method is not sensitive to outlying observations due to the rank-based nature of the partial likelihood inference.
We can extend the proposed method to marginal and random-effects models for correlated event times. Suppose that there are n independent clusters, with
subjects in the ith cluster. To calculate the Cox-OSES estimator, we randomly select
clusters into
. For the marginal Cox model, we replace
with the sandwich variance estimator
in the one-step estimation, where
denotes
for the jth subject of the ith cluster. The covariance matrix of the resulting estimator is estimated by
.
Fitting random-effects models is more challenging. Direct maximization of nonparametric likelihood is unstable. EM algorithms (Zeng and Lin, 2007) provide a stable solution, but the convergence is slow. In such situations, the proposed method is even more beneficial than in the current setting because the orders of computation are higher. In general, efficient score functions do not have explicit expressions but can be obtained as the numerical solution to a Fredholm-type equation (Zeng and Lin, 2007).
We can also extend the proposed approach to the situation of high-dimensional covariates. Specifically, we perform variable selection using the first block of data. After debiasing the initial estimates, we perform a one-step update on the coefficients of the selected variables using the remaining data. The main computation lies in the analysis of the first block of data, which is always less demanding than direct analysis of the whole dataset. We anticipate that uncertainty due to variable selection will not affect the asymptotic efficiency of the Cox-OSES estimator.
Kawaguchi et al. (2021) proposed a forward-backward scan algorithm, which reduces the time complexity from quadratic to linear for fitting the proportional subdistribution hazards model to competing risks data with time-independent covariates. However, their algorithm cannot be used to further improve computational efficiency in our setting because the reduction in time complexity by that algorithm results from the same mechanism of cumulative sum as the standard method for fitting the Cox proportional hazards model with time-independent covariates, which has only linear time complexity, as detailed in Section 2.1.
Supplementary Material
Web Appendices, Tables, and Figures, referenced in Sections 2, 3, and 4, and the computation codes are available with this paper at the Biometrics website on Oxford Academic.
Acknowledgement
The authors thank the editor, associate editor, and 2 referees for their helpful comments.
Contributor Information
Jianqiao Wang, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.
Donglin Zeng, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.
Dan-Yu Lin, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, United States.
FUNDING
This research was supported by the National Institutes of Health grants R01 HL149683 and GM124104.
CONFLICT OF INTEREST
None declared.
DATA AVAILABILITY
UK Biobank data are available from http://www.ukbiobank.ac.uk. A formal application to the UK Biobank is required to use the data.
References
- Andersen P. K., Gill R. D. (1982). Cox’s regression model for counting processes: a large sample study. The Annals of Statistics, 10, 1100–1120. [Google Scholar]
- Bi W., Fritsche L. G., Mukherjee B., Kim S., Lee S. (2020). A fast and accurate method for genome-wide time-to-event data analysis and its application to UK biobank. The American Journal of Human Genetics, 107, 222–233. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bickel P. J., Klaassen C. A. J., Ritov Y., Wellner J. A. (1993). Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: Johns Hopkins University Press. [Google Scholar]
- Breslow N. (1972). Discussion of the paper by D. R. Cox. Journal of the Royal Statistical Society, Series B, 34, 216–217. [Google Scholar]
- Bycroft C., Freeman C., Petkova D., Band G., Elliott L. T., Sharp K. et al. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature, 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society, Series B, 34, 187–202. [Google Scholar]
- Cox D. R. (1975). Partial likelihood. Biometrika, 62, 269–276. [Google Scholar]
- Denny J. C., Bastarache L., Ritchie M. D., Carroll R. J., Zink R., Mosley J. D. et al. (2013). Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nature Biotechnology, 31, 1102–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dey R., Zhou W., Kiiskinen T., Havulinna A., Elliott A., Karjalainen J. et al. (2022). Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nature Communications, 13, 5437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedman J. H., Hastie T., Tibshirani R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1–22. [PMC free article] [PubMed] [Google Scholar]
- Kalbfleisch J. D., Prentice R. L. (2002). The Statistical Analysis of Failure Time Data. John Wiley & Sons: New York. [Google Scholar]
- Kawaguchi E. S., Shen J. I., Suchard M. A., Li G. (2021). Scalable algorithms for large competing risks data. Journal of Computational and Graphical Statistics, 30, 685–693. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu C., Kraja A. T., Smith J. A., Brody J. A., Franceschini N., Bis J. C. et al. (2016). Meta-analysis identifies common and rare variants influencing blood pressure and overlapping with metabolic trait loci. Nature Genetics, 48, 1162–1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stanzick K. J., Li Y., Schlosser P., Gorski M., Wuttke M., Thomas L. F. et al. (2021). Discovery and prioritization of variants and genes for kidney function in ≳1.2 million individuals. Nature Communications, 12, 4350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The International Schizophrenia Consortium (2009). Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature, 460, 748–752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vuckovic D., Bao E. L., Akbari P., Lareau C. A., Mousas A., Jiang T. et al. (2020). The polygenic and monogenic basis of blood traits and diseases. Cell, 182, 1214–1231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang H., Leng C. (2007). Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102, 1039–1048. [Google Scholar]
- Wang W., Lu S.-E., Cheng J. Q., Xie M., Kostis J. B. (2022). Multivariate survival analysis in big data: a divide-and-combine approach. Biometrics, 78, 852–866. [DOI] [PubMed] [Google Scholar]
- Wang Y., Hong C., Palmer N., Di Q., Schwartz J., Kohane I. et al. (2021). A fast divide-and-conquer sparse Cox regression. Biostatistics, 22, 381–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng D., Lin D. Y. (2007). Maximum likelihood estimation in semiparametric regression models with censored data. Journal of the Royal Statistical Society, Series B, 69, 507–564. [Google Scholar]
- Zou H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Web Appendices, Tables, and Figures, referenced in Sections 2, 3, and 4, and the computation codes are available with this paper at the Biometrics website on Oxford Academic.
Data Availability Statement
UK Biobank data are available from http://www.ukbiobank.ac.uk. A formal application to the UK Biobank is required to use the data.


































