Skip to main content
Genetics logoLink to Genetics
. 2020 Mar 4;215(1):41–58. doi: 10.1534/genetics.119.302940

Fast Algorithms for Conducting Large-Scale GWAS of Age-at-Onset Traits Using Cox Mixed-Effects Models

Liang He 1,1, Alexander M Kulminski 1,1
PMCID: PMC7198273  PMID: 32132097

Age-at-onset traits are of paramount importance in studies of age-related diseases and behavioral genetics. Cox mixed-effects models (CMEMs) are widely used when sample dependence is taken into account. However, the application of CMEMs in genome-wide association...

Keywords: Cox mixed-effects model, GWAS, time-to-event analysis, Alzheimer’s disease, age-at-onset

Abstract

Age-at-onset is one of the critical traits in cohort studies of age-related diseases. Large-scale genome-wide association studies (GWAS) of age-at-onset traits can provide more insights into genetic effects on disease progression and transitions between stages. Moreover, proportional hazards (or Cox) regression models can achieve higher statistical power in a cohort study than a case-control trait using logistic regression. Although mixed-effects models are widely used in GWAS to correct for sample dependence, application of Cox mixed-effects models (CMEMs) to large-scale GWAS is so far hindered by intractable computational cost. In this work, we propose COXMEG, an efficient R package for conducting GWAS of age-at-onset traits using CMEMs. COXMEG introduces fast estimation algorithms for general sparse relatedness matrices including, but not limited to, block-diagonal pedigree-based matrices. COXMEG also introduces a fast and powerful score test for dense relatedness matrices, accounting for both population stratification and family structure. In addition, COXMEG generalizes existing algorithms to support positive semidefinite relatedness matrices, which are common in twin and family studies. Our simulation studies suggest that COXMEG, depending on the structure of the relatedness matrix, is orders of magnitude computationally more efficient than coxme and coxph with frailty for GWAS. We found that using sparse approximation of relatedness matrices yielded highly comparable results in controlling false-positive rate and retaining statistical power for an ethnically homogeneous family-based sample. By applying COXMEG to a study of Alzheimer’s disease (AD) with a Late-Onset Alzheimer’s Disease Family Study from the National Institute on Aging sample comprising 3456 non-Hispanic whites and 287 African Americans, we identified the APOE ε4 variant with strong statistical power (P = 1e−101), far more significant than that reported in a previous study using a transformed variable and a marginal Cox model. Furthermore, we identified novel SNP rs36051450 (P = 2e−9) near GRAMD1B, the minor allele of which significantly reduced the hazards of AD in both genders. These results demonstrated that COXMEG greatly facilitates the application of CMEMs in GWAS of age-at-onset traits.


AGE-AT-ONSET and time-to-event are among the most important traits of interest in cohort studies of age-related diseases (He et al. 2016c; Kulminski et al. 2016) and behavioral genetics (He et al. 2016a), as these traits carry information on timing of events. Genome-wide association studies (GWAS) of time-to-event traits are critical to gain insights into the genetics of disease development and progression. Specifically, for many age-related diseases, including Alzheimer’s disease (AD), pathological processes may be triggered decades before the first symptoms. Identifying genetic variants associated with age-at-onset may provide insights into potential etiologies of such diseases. Time-to-event traits are also of inherent interest in specific fields such as clinical trials. In such situations, time-to-event analysis is essential because events occur in a subset of subjects whereas information on the outcome of other subjects is mostly censored. Compared to case-control studies using logistic regression models, evidence shows that cohort studies using proportional hazards (or Cox) regression models generally increase statistical power (Green and Symons 1983; Callas et al. 1998; Staley et al. 2017). The logistic regression model may also yield inflated estimates of the effect sizes in GWAS (Staley et al. 2017).

While time-to-event analysis using a Cox model plays a central role and is often more powerful in many biological and epidemiological studies, its application to GWAS is hampered by its massive computational burden when random effects are needed to account for unexplained heterogeneity (e.g., genetic relationship). Unlike a computationally efficient linear mixed model (LMM), which has been well optimized and broadly applied in genetics (Kang et al. 2010; Lippert et al. 2011; Svishcheva et al. 2012; Zhou and Stephens 2012; Loh et al. 2015), much less attention has been paid to Cox mixed-effects models (CMEMs). To facilitate genetic association analysis in family-based studies, Therneau (2015) developed a specific data structure (bdsmatrix) and a fast algorithm implemented in C in the Cox mixed-effects R package (coxme). The coxme package is optimized for block-diagonal kinship matrices constructed based on pedigree information (Therneau 2003; Pankratz et al. 2005). The performance of the coxme package is remarkable for block-diagonal kinship matrices, reducing the central processing unit (CPU) time for one single nucleotide polymorphisms (SNP) from ∼15 min for a dense relatedness matrix to < 1 sec when analyzing a cohort comprising ∼3500 individuals [e.g., the Late-Onset Alzheimer’s Disease Family Study from the National Institute on Aging (NIA-LOADFS) (Lee et al. 2008)]. Despite this dramatic improvement, applying a CMEM (Ripatti and Palmgren 2000; Therneau 2015) to a large-scale GWAS involving millions of SNPs is still computationally challenging. Even for the NIA-LOADFS sample, it would take coxme ∼3 months using one CPU thread to run a GWAS for 10 million imputed SNPs. Moreover, it is limited to block-diagonal and positive definite relatedness matrices. For a more general genetic relationship matrix (GRM) [e.g., the GRM proposed in Yang et al. (2011)] that cannot be grouped into a block-diagonal matrix or is not positive definite due to, e.g., monozygotic (MZ) twins, coxme would take decades or even fail.

Current GWAS often include many subjects (e.g., more than tens of thousands of subjects in biobank data) and multiple studies, especially in consortia. Recent GWAS have also interrogated higher-order relationships such as SNP–environment and SNP–SNP interactions, requiring a large number of tests. Therefore, there is an urgent need to develop fast algorithms for CMEMs that can be applied to large-scale GWAS involving time-to-event traits. In this work, we introduce efficient estimation and testing algorithms for CMEMs through an R package COXMEG (https://cran.r-project.org/web/packages/coxmeg/), in which G stands for GWAS. COXMEG addressed the aforementioned computational challenges, and is well suited for large-scale GWAS. We made five major contributions in COXMEG. First, it substantially improved computational efficiency. For example, for a family-based design using a block-diagonal kinship matrix, COXMEG is often > 100-fold faster than the remarkably optimized coxme package in the same computational environment. Second, we developed a novel inexact Newton method, which is not restricted to a block-diagonal pedigree-based relatedness matrix. An arbitrary sparse positive definite matrix is able to enjoy the speed gain from COXMEG. For example, we showed in our application to NIA-LOADFS that COXMEG attained high computational performance and statistical power using a sparse GRM after thresholding. This new algorithm significantly expands the application of COXMEG in many areas including, e.g., biobank data that use sparse relatedness matrices. Third, when a fully dense relatedness matrix cannot be approximated by a sparse matrix, it is computationally demanding to obtain P-values by estimating the hazard ratios (HRs) and then performing a Wald test. We developed a fast score test for such situations, dramatically reducing the computational cost of obtaining the P-values by not estimating the effect size of each SNP. Fourth, we introduced a generalized penalized partial likelihood (GPPL) to handle positive semidefinite relatedness matrices, which are very common in GWAS. Fifth, for a large-scale (> 20,000) fully dense relatedness matrix, we developed a fast randomized algorithm based on stochastic Lanczos quadrature (SLQ), considerably improving the performance for estimating the variance component.

Feasibility of COXMEG was demonstrated by performing GWAS of age-at-onset of AD in an NIA-LOADFS sample. AD is a complex age-related disease, spanning multiple stages and probably involving different biological mechanisms [e.g., β-amyloid deposition, pathogenesis, vascular changes, activation of the immune system, and phosphorylation of tau (Finch and Kulminski 2019)] at different stages. Our hypothesis is that genetic architecture of AD is likely implicated in faster pathological progression from the initial stage. We demonstrate high computational performance of COXMEG, and report novel findings using a transethnic sample of NIA-LOADFS.

Methods

Model specification in COXMEG

We describe the estimation framework in COXMEG using a penalized partial likelihood (PPL) proposed by Breslow and Clayton (1993), and Ripatti and Palmgren (2000), based on the Laplace approximation. Other estimation strategies such as the penalized likelihood method (McGilchrist and Aisbett 1991; McGilchrist 1993), the hierarchical likelihood (h-likelihood) framework (Ha et al. 2001, 2011; Lee et al. 2006), and the expectation–maximization (EM) algorithm (Cortiñas Abrahantes and Burzykowski 2005) lead to a similar iterative procedure. Specifically, suppose that we have right-censored time-to-event observations Ti,i{1,,N} from a cohort of N subjects, and a corresponding variable di indicating whether a failure (di=1) or censoring (di=0) occurs at Ti for each subject. The semiparametric proportional hazards (or Cox) model (Cox 1972) assumes that the conditional hazard function for subject i has the form

λi(t)=λ0(t)exp(ηi)=λ0(t)exp(Xiβ+Ziγ),

where λ0(t) is an unknown baseline hazard function, X is a matrix of m predictors (βm×1) corresponding to the fixed effects, and Z is a design matrix (γ1) for the random effects, respectively. The fixed effects include both the SNP to be tested and other covariates. In a CMEM, we assume that Z is the identity matrix (often denoted as I), and γ follows a multivariate normal distribution, i.e.,

γN(0,τΣ),

where ΣN×N is expected to be a symmetric positive definite (SPD) matrix measuring the genetic distance between the subjects (e.g., a kinship matrix or a GRM), and τ is the variance of the genetic component. For simplicity, we consider here one variance component, but generalization to multiple components is straightforward. The situation where Σ is symmetric positive semidefinite (SPSD) is more complicated, and will be discussed in a separate section. The log-likelihood based on the above model is given by

l=i=1n{dilog(λ0(t)exp(Xiβ+γi))Λ0(t)exp(Xiβ+γi)}12τγTΣ1γ12log(τN|Σ|), (1)

where Λ0(t)=0tλ0(t)dt is the cumulative hazard function. The baseline hazard λ0(t) is unknown, but can be estimated from the data. Either substituting in Equation 1 with a partial likelihood (McGilchrist 1993; Ripatti and Palmgren 2000) or using maximum likelihood estimation for λ0(t) based on the h-likelihood (Ha et al. 2001) leads to the following PPL or equivalently h-likelihood

l1=i=1n{di(ηilog(j(i)exp(ηj)))}12τγTΣ1γ12Nlog(τ), (2)

where (i) is the set of all subjects at risk when an event is observed for subject i.

Because β and γ are mutually canonical [e.g., see Lee et al. (2006)], the h-likelihood l1 can be treated as an ordinary likelihood for estimating β and γ simultaneously, which we denote by θ=(β,γ), when τ is given (Ha et al. 2001; Lee et al. 2006). However, τ has to be estimated separately using a marginal likelihood or an adjusted profile likelihood. Following Ripatti and Palmgren (2000), an integrated marginal likelihood of τ using the Laplace approximation is

l2=l1(θ^,τ)12log|J(l1(θ^),τ)2π|, (3)

where J(l1,τ) is an information matrix of γ using the PPL l1 with θ^ estimated from Equation 2. A standard two-layer algorithm for estimating θ, such as coxme, is to iterate between l1 and l2. However, in GWAS it is often sensible to practically assume that τ is a constant for estimating the effect of each SNP (Kang et al. 2010; Loh et al. 2015). We leverage this assumption in COXMEG by estimating the variance component τ under a null model with only relevant covariates, and then use the estimated τ^ to analyze all SNPs. This step substantially reduces computational time compared to coxme, which estimates τ for each SNP individually. Some studies show that using a uniform variance component may lead to loss of statistical power to detect very significant SNPs (Zhou and Stephens 2012). Nevertheless, this potential issue can be amended by conducting a sensitivity analysis for those SNPs passing certain significance in the first round of the GWAS. We first focus on the maximization of l1, where an optimization algorithm finds θ^ given a known τ^. The estimation of τ under the null model will be discussed in a later section in detail.

Estimation algorithm in COXMEG-sparse

We consider in COXMEG-sparse a situation where Σ is a sparse SPD matrix. Under a large sample size, l1 is highly quadratic in adjacent of the optimal θ, the Newton–Raphson (NR) method generally converges in a small number of steps, and thus is an efficient optimization method for this high-dimensional problem. Computing the first and second derivatives of l1 with respect to θ gives the following score function and the Fisher information matrix

s=dl1dθ=(XT(dWMA1)dWMA1Σ1γτ),
V=d2l1dθ2=(XTHXXTHHXH+Σ1τ),

where

H=WBQQT=WBWMA2MTW,
W=diag{exp(η)},
A=diag{D}diag-1{MTW1},D=(d1,,dN)T,
B=diag{MA1}, and
M=RT,Rij={1TiTj0Ti>Tj,

where R is an indicator matrix for the risk set (i), in which we adopt the Breslow’s approximation (Breslow 1974) for ties. The NR method estimates θ by computing

θk=θk1+V1s

iteratively until θk converges. In terms of computational burden, the cross product QQT takes O(N2) rather than O(N3) because W and A are diagonal, and M is an indicator matrix similar, through permutation, to a triangular matrix when no ties are present. Another computational challenge is to compute V1s, which takes O(N3) if solved directly using, e.g., Cholesky decomposition (Cholesky decomposition is valid because H is always SPSD as shown in Supplemental Material, File S1 Materials A.2, and V is SPD when Σ is SPD). If there are p SNPs and on average L1 iterations are needed in the NR optimization of l1 for each SNP, the dominant term in the total computational cost for estimating θ would be O(N3pL1), which becomes immediately intractable when the sample size N becomes thousands. We notice that when Σ is block-diagonal (e.g., a pedigree-based kinship matrix), coxme computes only the block-diagonal entries of V corresponding to those in Σ. Such an approximation can theoretically reduce the computational time to O(F2NpL1), where F is the largest block (or family) size. In family-based studies, F is generally much smaller than N, and can often be assumed as a constant that does not increase with N. This approximation provides remarkable gain in computational performance. However, the drawback of this method is that this approximation depends on the structure of Σ and is limited to a block-diagonal Σ. Accordingly, more general GRMs with long-range correlations, e.g., long-distance cryptic relatedness, or other sparse correlation matrices are not efficiently handled in this framework. In addition, the block-diagonal approximation is not optimal when there are large sparse blocks in Σ. We next propose an algorithm that is efficient for an arbitrary sparse SPD matrix.

Rewriting V1 using Schur complements leads to

V1=(V11V11XTHV221V221HTXV11V221+V221HTXV11XTHV221), (4)

where

V11=(XTHXXTHV221HX)1

and

V22=H+Σ1τ.

Note that the computation of XTH does not involve explicit evaluation of QQT, and requires only multiplication and addition in O(N). Therefore, once V22 is solved, the remaining computation can be done in O(N). The obstacle of inversing V22 is that V22 is fully dense even if Σ is sparse because QQT is a fully dense matrix. To solve V22, we observe that WB is diagonal, QQT is low-rank, and Σ (or Σ−1) is sparse. Rewriting V221 using the Woodbury identity (Hager 1989) gives

V221=(WBQQT+Σ1τ)1=S1+S1Q(IQTS1Q)1QTS1, (5)

where

S1=(WB+Σ1τ)1=(WB)1(Σ+(WB)1τ)1Σ. (6)

The advantage of using Equation 5 is that it disentangles the sparse and dense parts in V221, so that the sparse term S1 in Equation 5 can be evaluated efficiently by leveraging sparse algorithms depending on which of Σ and Σ−1 has the better sparsity pattern. For example, when Σ is block-diagonal, both matrices are sparse. When Σ is a matrix with first-order autoregressive [AR(1)] correlation structure [example 4 in Liang and Zeger (1986)], only Σ−1 is sparse. However, in many cases, Σ−1 might not be as sparse as Σ. In the worst case where the preconditioned conjugate gradient (PCG) method is used, the time complexity of solving S1 is linear with respect to the number of nonzero entries in Σ. The major computational bottleneck in Equation 5 is the evaluation of (IQTS1Q)1, which is at least O(N13) where N1 is the number of noncensored subjects. So we simply ignore it in COXMEG-sparse, and propose to use instead the approximation

V221=S1+S1QQTS1. (7)

Substituting V221 by V221 in Equation 4 gives an approximated NR method. Similarly, no explicit evaluation of QQT is needed when using Equation 7. Hence, once an sparse Cholesky decomposition (SCD) of S is obtained, the evaluation of the second term in Equation 7 mostly requires only one more forward–backward substitution. In the special case where Σ is block-diagonal, the computational burden for solving V22 is O(F2N).

We show in File S1 Materials A.1 that V221 is a first-order approximation of V221. In fact, using Neumann series, V221 can be expressed as

V221=S1+n=1(S1QQT)nS1,

from which we see that S1 is a zero-order approximation of V221, equivalent to completely ignoring QQT from the information matrix V. This amounts to adding QQT, an SPSD matrix, to the information matrix. We show in File S1 Materials A.2 that COXMEG-sparse satisfies the conditions for a family of inexact Newton methods (Dembo et al. 1982; Eisenstat and Walker 1996), suggesting that it converges locally as long as a forcing term is bounded away from one (Dembo et al. 1982; Kelley 1995). Nevertheless, it no longer has the quadratic convergence rate generally (i.e., it converges in more iterations than the NR method). We prove the local convergence in more detail in File S1 Materials A.2, but in short, the convergence rate depends on multiple properties including the variance component τ and the spectrum density of Σ. In terms of computational efficiency, a higher-order approximation speeds up the convergence, which, on the other hand, requires one more forward–backward substitution. Therefore, there is a trade-off between the convergence rate and the number of forward–backward substitutions in selecting an optimal order of approximation. We propose the first-order approximation V221 because our simulation and real studies showed that the first-order approximation is often adequate to achieve nearly optimal performance in many practical applications to family-based GWAS (Figure S4 and Materials A.2 in File S1). In the cases where the algorithm converges slowly with V221, which often happens for a large τ or condition number of Σ, a higher-order approximation should be used for better computational performance (see Figure 1B and additional results in File S1 Materials A.2). Finally, replacing V221 by V221in V11 gives the following approximated variance of β^:

V11=(XTHXXTHV221HX)1.

We found that the relative error between V11 and V11 was < 0.1% under most settings considered in our simulation, except for τ=0.5 (Figure S8 in File S1), suggesting that the first-order approximation V221 is sufficiently accurate for testing β in general applications in GWAS when the variance component τ is not overly large. Otherwise, the second-order approximation should be used. The P-values are then calculated based on β^ and V11 using a Wald test.

Figure 1.

Figure 1

Comparison of computational time using COXMEG-sparse, coxph, and coxme under different sample sizes and family (block) sizes for a block-diagonal relatedness matrix. The evaluation was performed on a Linux Xeon E5-2680v2 machine. (A) Computational time [log10(second)] of analyzing 10,000 SNPs. Computational time of COXMEG-sparse (using the first-order approximation) included estimation of the variance component and SNP effects. A shared Gaussian frailty model was used in coxph. (B) Computational time (seconds) of estimating the variance component. The first-order and fourth-order approximations in the inexact Newton method in COXMEG-sparse were evaluated. CPU, central processing unit.

As aforementioned, the major step in evaluating V221 is to compute S1. Either term in Equation 6 can be used depending on which matrix, Σ or Σ−1, is sparser. The performance of SCD algorithms is also highly affected by fill-ins. In our simulations, we compared three sparse algorithms in R to solve a block-diagonal S: (i) the LDLT SCD in RcppEigen, (ii) the sparse PCG algorithm using diagonal elements in RcppEigen, and (iii) the SCD in the Matrix R package. We found in our study that the LDLT SCD in RcppEigen yielded the best performance under most scenarios, especially for very sparse Σ (Figure S5 in File S1). On the other hand, the PCG method showed computational advantages when Σ is very large or less sparse (Figure S5 in File S1). The PCG method is also much faster than SCD for a sparse Σ whose inverse is dense.

Score test for dense relatedness matrices

When both the matrices Σ and Σ−1 are dense, COXMEG-sparse can no longer leverage the sparse algorithms to speed up the evaluation of S1. Instead of the Cholesky decomposition used in coxme, we implemented a PCG method (Lidauer et al. 1999; Tsuruta et al. 2001) (termed as COXMEG-pcg) to compute each step in the NR algorithm. This alteration reduces the time complexity from O(N3pL1) to O(N2pL1). However, for a large-scale GWAS, the PCG method might still be overly time-consuming. Therefore, we further propose COXMEG-score, a fast score test for the situation in which Σ and Σ−1 cannot be approximated by a sparse matrix. In COXMEG-score, we first fit the full CMEM under the null model, in which only covariates are included in X. After the null model is fitted, we obtain the estimates of the score function of τ, H, and the inverse of the information matrix under the null hypothesis, denoted by s0, H0, and V01, respectively, where s0=dQ1. Denote by z a vector of the genotypes of a SNP to be tested. The score test statistic proposed in COXMEG-score is given by

tscore=(zTs0)2Vs,

where Vs is the variance of the score in the numerator given by

Vs=zTH0zzTH0(XIN×N)V01(XIN×N)TH0z.

Because the h-likelihood l1 can be treated as an ordinary likelihood once the variance component is given, tscore approximately follows a χ12 distribution with 1 d.f. under the null hypothesis that the SNP has no effect on the HR. Note that the computation of Vs involves only one matrix–vector multiplication and the rest can be performed in O(N). So the computational burden of the score test alone is O(N2p) regardless of the structure of the relatedness matrix. This time complexity is the same as the score tests proposed for generalized linear models (Chen et al. 2016, 2019). Our simulation results suggest that COXMEG-score is statistically as powerful as the Wald tests like COXMEG-sparse and coxme (Figures S1 and S2).

Estimation of variance components

We have described above the estimation and testing algorithms when the variance component τ is given. Here, we describe the estimation of τ. In COXMEG, τ is estimated only once under the null model, and then is used for testing all SNPs. The estimation of τ can be performed by calculating the first and second derivatives of l2 given in Equation 3, and iterating between l1 and l2 using the NR method, as suggested in Ripatti and Palmgren (2000) and Ha et al. (2011). We found that this EM-like procedure practically converges rather slowly for estimating τ. Instead, we follow the strategy implemented in coxme. Because τ is often a scalar or in a low-dimensional space, we treat l2 as a marginal likelihood of τ and directly use a derivative-free optimization. We found in our simulation study that this strategy converged faster. Thus, after l1 is optimized, the only additional computation in each iteration for optimizing l2 is the evaluation of the log-determinant of J(l1,τ)=V22. As V22 is SPD, a direct evaluation would require a Cholesky decomposition taking O(13N3L2), where L2 is the number of iterations in optimizing l2, and is often between 10 and 50 depending on the convergence. We evaluated the performance of the following three derivative-free algorithms: (i) Bobyqa (Powell 2009) implemented in the nloptr R package (Ypma 2014), (iii) Brent’s method, and (iii) the Nelder–Mead method (Nelder and Mead 1965) in the optim R function. Bobyqa was the fastest in most cases, and Brent’s method was slightly slower than Bobyqa in our simulation study.

When N is very large (e.g., > 20,000), the evaluation of the log-determinant becomes a bottleneck in COXMEG-sparse and COXMEG-score, and even intractable. In addition, a second-stage sensitivity analysis using a variant-specific variance component may be performed for those SNPs passing a predefined threshold of significance in the first round. When the number of significant SNPs is large, this second-stage analysis would be computationally intensive. So, we propose fast approximation methods to estimate τ for sparse and dense GRMs, respectively. Specifically, for a sparse GRM, we propose the following marginal likelihood instead of l2.

l2=l1(θ^)12log|W(θ^)B(θ^)diag{Q(θ^)Q(θ^)T}+Σ1τ2π|.

Compared to l2, the only difference is that l2 uses a diagonal approximation of QQT. When Σ or Σ−1 is sparse, the determinant can often be efficiently computed using SCD depending on its sparsity pattern. We found in our simulation study that the estimated τ using l2 is very close to that from coxme, which uses a block-diagonal approximation of QQT, with a relative difference < 1% in most cases (Figure S6 File S1).

When the relatedness matrix is dense or its sparsity pattern results in too many fill-ins, the above diagonal approximation does not help. Instead, we propose a method based on SLQ (Bai et al. 1996; Golub and Meurant 2009; Ubaru et al. 2017) to approximate the log-determinant, which reduces the time complexity from cubic to quadratic (O(nmnqN2L2), see File S1 Materials A.4) and achieves accurate estimation (Figure S7 in File S1). The basic idea is to first express the log-determinant using a Monte Carlo estimator of the trace of the matrix logarithm. Each term in the Monte Carlo estimator can be written as a Riemann–Stieltjes integral approximated by the Gaussian quadrature rule. The nodes and weights in the Gaussian quadrature can be elegantly computed by the eigenvalues and first elements in the eigenvectors of the tridiagonal matrix obtained from the Lanczos algorithm. We found that the accuracy of the SLQ method for approximating a log-determinant was generally higher by one order of magnitude than that using Chebyshev orthogonal polynomials (Pace and LeSage 2004; Han et al. 2016), both of which were more accurate than Martin’s Taylor expansion (Martin 1992; Barry and Kelley Pace 1999), consistent with previous reports (Han et al. 2016; Ubaru et al. 2017). More details about the SLQ approximation are described in File S1 Materials A.4.

Handling a positive semidefinite relatedness matrix

When the relatedness matrix Σ is SPSD, l1 becomes invalid because Σ−1 does not exist. So, we instead propose the following GPPL

l1=i=1n{di(ηilog(j(i)exp(ηj))}12τγTΣγ12Nlog(τ), (8)

where Σ is the pseudoinverse defined as Σ=UΛUT, where U is the eigenvectors of Σ and Λ is the pseudoinverse of the diagonal matrix consisting of the eigenvalues of Σ. In addition, N is replaced by N, the number of nonzero eigenvalues of Σ. The rationale of Equation 8 is similar to that in Kauermann and Wegener (2011) and He et al. (2016b), in which the penalty is imposed on a subspace of the coefficients. We show in File S1 Materials A.3 that when replacing l1 with the GPPL l1 for an SPSD Σ, the NR algorithms in COXMEG-sparse and COXMEG-score still work, except that the sum of elements in each row of Σ is zero (i.e., an SPSD matrix has eigenvector 1 with eigenvalue 0). In such a case, V22 becomes noninvertible. In addition, the last transformation in Equation 6 is valid only for Σ being SPD. So, if Σ but not Σ1is sparse, it would be better to make Σ SPD, as discussed in the following section The above GPPL ignores penalties with respect to zero eigenvalues in Λ, and thus it can be used when the SPSD matrix originates from biologically meaningless singularities (e.g., singularity originating from calculating a covariance matrix). However, when the singularity is biologically meaningful (e.g., for MZ twins), such zero eigenvalues would strongly penalize the difference between random effects within MZ twins and they should not be ignored. In this case, we propose a modified PPL by replacing all zero eigenvalues in Λ with a small positive number (e.g., 10−6) and keeping the other eigenvalues unchanged, which makes Σ positive definite, and preserves the strong penalties.

Construction of relatedness matrices for COXMEG

Depending on the purpose of using the CMEMs, the relatedness matrix Σ summarizes the dependence structure between individuals. Such a matrix in the context of GWAS is generally a kinship matrix or a GRM. There exist a variety of approaches defining and calculating these matrices depending on the assumption of the genetic architecture underlying the trait. Here, we focus on two frequently used types of relatedness matrices, a pedigree-based kinship matrix and a SNP-based GRM based on the infinitesimal model (Yang et al. 2011), and discuss how to construct such matrices for COXMEG.

A pedigree-based kinship matrix provides a measure of the total average genetic correlation based on the pedigree information but no genotype information, and it does not account for a specific phenotype. When the individuals in a cohort are more or less ethnically homogeneous, the CMEM with a pedigree-based kinship matrix is used primarily to account for family structure. A pedigree-based kinship matrix [for example, generated by the kinship2 R package (Sinnwell et al. 2014)], is generally block-diagonal SPD in which individuals from different families are regarded as independent. One exception is MZ twins, which lead to an SPSD kinship matrix. COXMEG-sparse is suitable for this situation because such matrices are sparse when the family size is not at the same magnitude of the sample size.

In contrast, a SNP-based GRM is more flexible in terms of genomic regions, more precise in terms of measuring genetic correlation, and can detect long-distance cryptic relatedness. For example, a GCTA GRM introduced in Yang et al. (2011) is calculated by

jk=1pi=1p(Gij2fi)(Gik2fi)2fi(1fi), (9)

where fi is the frequency of the coding allele of SNP i, and Gij is the dosage of SNP i and individual j. This type of GRM is a covariance matrix, which is generally SPSD even if p>N because 1 d.f. is lost for estimating fi from the same sample. It always has eigenvector 1 with eigenvalue 0. Note that an estimated GRM may have negative eigenvalues if the SNPs used to construct it have many missing values, which should be avoided.

For an ethnically homogeneous cohort, a GRM can often be approximated by a sparse matrix using thresholding because most elements outside of a family are almost zero. Various methods can be used to approximate the GRM, such as hard thresholding (Bickel and Levina 2008), and soft thresholding using graphical lasso (Friedman et al. 2008). Our analysis of the NIA-LOADFS data showed that simple hard thresholding, which has literally little computational cost, using a small cutoff (e.g., 0.02) could induce sparsity and often (but not always) preserve SPD (Bickel and Levina 2008). Sparsity approximation with guaranteed positive definite is also available [e.g., Rothman (2012); implemented in the PDSCE R package] albeit computationally intensive compared to a simple hard thresholding.

Simulation study

We performed simulation studies to investigate the computational and estimation performance under R 3.5. The simulation studies were conducted on a Linux server with an Intel Xeon E5-2680v2 CPU, and Windows machines with an Intel Core i5-6300HQ or i7-8700 CPU. We considered block-diagonal correlation matrices, banded AR(1) correlation matrices, and sparse GRMs by thresholding a GRM estimated from the genotype data in the NIA-LOADFS sample. All blocks in the block-diagonal correlation matrices shared the same block size and a single correlation coefficient. We evaluated multiple block sizes (5, 20, 100, and 500) and correlation coefficients (0.1, 0.5, and 0.9). The elements in the banded AR(1) correlation matrices were generated by ρ|ij| and then set zeros for all ρ|ij|<0.01, in which we evaluated ρ=0.3,0.6,0.9,. For each relatedness matrix Σ, we generated random effects based on γN(0,τΣ), in which we considered τ=0.01,0.05,0.1,0.2,0.5 to reflect different magnitudes of heritability. Genotype dosages for each SNP were generated from a binomial distribution GB(2,p), in which p, the minor allele frequency (MAF), ranged between 0.05 and 0.5. A disease time variable was generated using the following exponential distribution YExp(exp(μ+G+γ)), in which we assumed a constant baseline hazard with μ=2. For each individual, we generated a censoring time variable using the following exponential distribution CExp(c), in which we considered c=0.02,0.05,0.1,0.15 to reflect scenarios from heavy censoring c=0.15 to almost no censoring c=0.02. The time-to-event variable was generated by

T=min(Y,C),D={1,YC0,Y>C.

Data availability

This manuscript was prepared using limited access data sets obtained through dbGaP [accession numbers: phs000168.v2.p2 (LOADFS)]. Code used to generate the simulated age-at-onset data can be found in the vignettes of the coxmeg R package at https://cran.r-project.org/web/packages/coxmeg/. Supplemental material available at figshare: https://doi.org/10.25386/genetics.11862486.

Results

Overview of COXMEG

Depending on whether the relatedness matrix is sparse, we propose in COXMEG two algorithms accounting for the dependence in subjects, COXMEG-sparse for a sparse relatedness matrix, and COXMEG-score for a dense relatedness matrix. Two primary purposes to adjust for dependence structure in the context of GWAS are: (i) correcting for population stratification and (ii) accounting for family structure. We consider these two purposes separately because the rationale, and thus the relatedness matrices, used in the two situations are different. In the first case, the major concern is to avoid identifying false-positive SNPs correlated with race. This is because population stratification results in many SNPs that can almost tag the race information in the sample, and consequently are highly correlated with race. If race or relevant variables, such as principal components (PCs), are not included in the model, the estimated effects of these SNPs should be attributed to the effect of race, leading to a large number of false positives. However, in the second case, the major benefit of explicitly specifying the sample dependence is to control the overall false-positive rate (FPR) and increase statistical power due to family structure. Previous studies have shown that ignoring dependence in the Cox model leads to overestimation of the model parameters (Wei et al. 1989). We show in our analysis that COXMEG-score is well suited for the first case, and COXMEG-sparse is dedicated to the second case.

Sparse approximation of the relatedness matrix is attractive, particularly in large-scale biobank data where the full relatedness matrix is too large to be squeezed into computer memory. When a selected study population is almost ethnically homogeneous and a CMEM is adopted primarily to correct for family structure, for example, in family-based GWAS (He et al. 2016a), the relatedness matrix is often sparse or can be approximated by a sparse matrix. For a family-based study, a widely used relatedness matrix is a pedigree-based kinship matrix, which is generally block-diagonal and sparse. When a SNP-based GRM computed from genotypes [e.g., proposed in Yang et al. (2011)] is used, we notice that it often has a large proportion of near-zero elements. We show in the following analysis that, compared to a fully dense GRM, it often produces comparable results to use a sparse approximation of the GRM by, e.g., thresholding the whole matrix or extracting its entries corresponding to family members. Using a sparse approximation dramatically improves both computational speed and memory usage. Compared to coxme, which is only optimized for a block-diagonal pedigree-based kinship matrix, COXMEG-sparse achieves its speed gain by introducing a novel approximation of the Hessian matrix in the optimization procedure. As we will show in this work, such an approximation cannot only accommodate an arbitrary sparse relatedness matrix, but is also practically much faster than coxme for a block-diagonal relatedness matrix. In contrast, COXMEG-score is dedicated to dense relatedness matrices for which COXMEG-sparse cannot achieve acceleration. The core idea in COXMEG-score is a fast score test for genome-wide association analysis and an efficient stochastic algorithm for estimating the variance component.

COXMEG-sparse performs faster than coxme and coxph for family-based studies

We first investigated the computational efficiency of COXMEG-sparse for block-diagonal relatedness matrices. We compared it with the coxme R package (Therneau 2015), which is widely used to fit a CMEM and highly optimized for block-diagonal matrices. We also included in the comparison the function “coxph” in the survival R package (Therneau and Lumley 2015) with a shared Gaussian frailty model. A shared frailty model (Vaupel et al. 1979) was used as a fast, but generally less accurate, alternative to coxme in some studies (He et al. 2016c). It assumes that all members in one family share the same genetic component, which might not be a realistic presumption in many situations. In both coxme and coxph, block-diagonal approximation was used. Figure 1A showed that it took COXMEG-sparse ∼60 sec (using one CPU thread) to analyze 10,000 SNPs for a small sample of 2000 individuals with a family (or block) size of five on a Linux server with a Xeon E5-2680v2 CPU. In contrast, under the same computing environment for the same task, it took coxme ∼200 min, ∼200-fold as much as that of COXMEG-sparse. Moreover, COXMEG-sparse was ∼20-fold faster than coxph under this setting. For block-diagonal relatedness matrices, the computational burden of COXMEG-sparse rose almost linearly with increasing sample size (Figure 1A), consistent with theoretical estimates given in the Methods section. In contrast, the computational cost of coxme and coxph increased more than linearly. When the sample size was up to 10,000, COXMEG-sparse was > 1000-fold and > 100-fold faster than coxme and coxph, respectively (Figure 1A). On the other hand, the computational burden of COXMEG-sparse increased more rapidly than coxme with respect to the family size (Figure 1A). These observations are consistent with the theoretical time complexity of COXMEG-sparse, which is linear and quadratic with respect to sample size and family size, respectively, as derived in the Methods section. Interestingly, the computational time of coxme climbed modestly with increasing family size. This is probably because some heavier computation in coxme masks the time complexity of the Cholesky decomposition. On the contrary, the computational burden of coxph dropped rapidly with increasing family sizes. This is because coxph depends on the number of families, which decreased when the family size increased and the total sample size was kept unchanged.

To further understand how COXMEG-sparse achieved the speed gain, we evaluated its performance in estimating the variance component for one SNP. We found that on average ∼20 iterations for the marginal likelihood l2, described in the Methods section, are needed for COXMEG-sparse to estimate the variance component. Because COXMEG-sparse does not estimate a variant-specific variance component and thus requires only one iteration for each SNP, this gives COXMEG-sparse an ∼20-fold boost of speed. Next, when the family size was five and the sample size was 5000, COXMEG-sparse was ∼18-fold faster than coxme for estimating the variance component (Figure 1B, left). Multiplying these two factors gives COXMEG-sparse a 360-fold boost under this setting, consistent with the benchmarking in Figure 1A. The computational cost for estimating the variance component of coxme and coxph soared with increasing sample size (Figure 1B), justifying their poor performance under large sample size shown in Figure 1A. We found that the performance also depended on the operating systems. On a Windows 10 machine with a Core i5-6300HQ CPU, COXMEG-sparse was only ∼10-fold faster than coxme for estimating the variance component under the same setting (Figure S1 in File S1). We found that this difference is because RcppEigen, the program used in COXMEG for the Cholesky decomposition of a block-diagonal matrix, was much (∼3×) slower than bdsmatrix used in coxme on Windows, while they had comparable performance on Linux. This suggests that there is much room for COXMEG-sparse to optimize for Windows by using faster Cholesky decomposition programs.

On the other hand, when the family size was 100, the speed advantage of COXMEG-sparse dropped (Figure 1B, right). One cause is that the relatedness matrix with larger block size has a large condition number, and thus the first-order approximation converges slowly, as described in the Methods section. Indeed, we found that using a higher-order approximation in COXMEG-sparse reduced computational cost by ∼40% (Figure 1B). Another cause is that the performance of RcppEigen used in COXMEG-sparse dropped rapidly with decreasing sparsity (Figure S5 in File S1). Fortunately, such relatedness matrices are rarely encountered in real data analysis. For example, the average family size in the NIA-LOADFS sample is 2.16.

COXMEG-sparse accommodates general sparse matrices

Next, we examined the performance of COXMEG-sparse for general sparse relatedness matrices. We assessed a class of sparse GRMs obtained by hard-thresholding small values of a dense GRM so that individuals with long genetic distance were treated as independent. This type of GRM is of interest to us because we observed that most elements in a GRM based on an infinitesimal model are almost zero in a family-based study consisting of an ethnically homogeneous cohort. For instance, when we studied only the non-Hispanic whites in the NIA-LOADFS sample, and set all elements below a prespecified small threshold (e.g., 0.02) to zero, the resulting sparse GRM had on average < 1% nonzero elements per row, and most of them were from the family members. This type of GRM is not block-diagonal because it has sporadic nonzero elements. As shown in Figure 2A, COXMEG-sparse was dramatically more efficient to deal with this sort of approximated GRM than coxme, which was extremely slow for a matrix that is not block-diagonal. By thresholding a GRM including 4376 subjects from the NIA-LOADFS sample using different cutoffs, we examined sparse GRMs having 0.25–5% nonzero elements. Figure 2A showed that it took coxme > 3 years to estimate 10,000 SNPs, while it took COXMEG-sparse only 5 min when 0.25% of the elements were nonzero. The computational time decreased almost linearly with increasing sparsity of the relatedness matrix (Figure 2A), consistent with the theoretical time complexity derived in the Methods section.

Figure 2.

Figure 2

Comparison of computational time [log10(second)] of analyzing 10,000 SNPs using COXMEG-sparse and coxme under different sparsity patterns and sample sizes. The computational time of COXMEG-sparse included estimation of the variance component and SNP effects. The computational time of coxme was evaluated for five SNPs, and then projected by multiplying that by 2000. (A) CPU time for sparse relatedness matrices (0.25–5% nonzero elements) obtained by hard-thresholding a GRM constructed using the genotypes of 4376 Caucasian subjects from the NIA-LOADFS; The evaluation was performed on a Windows Core i5-6300HQ machine. (B) CPU time for sparse relatedness matrices obtained by banding an AR(1) correlation matrix. The CPU time of coxme under the sample size of 10,000 was projected based on O(N3). The evaluation was performed on a Linux Xeon E5-2680v2 machine. AR(1), first-order autoregressive; CPU, central processing unit; GRM, genetic relationship matrix; NIA-LOADFS, Late-Onset Alzheimer’s Disease Family Study from the National Institute on Aging.

To evaluate the performance of COXMEG-sparse for other correlation patterns, we conducted a simulation study using banded AR(1) matrices, which cannot be factorized as a block-diagonal matrix. Similarly, we found that COXMEG-sparse was substantially computationally faster than coxme (Figure 2B). The computational time increased almost linearly with respect to the sample size, much slower than coxme (Figure 2B). We did not include coxph in the above analyses because a shared frailty component is not clearly defined for a general sparse covariance matrix.

Efficient testing using COXMEG-score for dense relatedness matrices

When there is a necessity to correct for both family structure and population stratification in a family-based multiethnic study, the GRM is often dense and might not be directly approximated by a sparse matrix using naive thresholding methods (e.g., hard thresholding). For a fully dense GRM, the estimation algorithm would be extremely slow for coxme, which may even take 10 min to analyze one SNP with a small sample size of 2000. For such dense GRMs, we implemented an estimation method based on PCG, termed COXMEG-pcg, which substantially reduced computational intensity compared to the Cholesky decomposition used in coxme. As shown in Figure 3A, the computational time of COXMEG-pcg dropped drastically compared to coxme in practice. However, the efficiency of COXMEG-pcg might still not be acceptable for a large sample size when speed is a leading concern. Thus, we propose COXMEG-score, a much faster method, by conducting a score test to obtain the P-values. As compared in Figure 3A, COXMEG-score completed the analysis for 10,000 SNPs of a sample consisting of 1000 individuals within 15 sec, computationally much more efficient than COXMEG-pcg and coxme, which took 1000 and 10,000 sec, respectively. When the sample size increased up to 10,000, it took COXMEG-score and COXMEG-pcg ∼1.4 hr and ∼1.2 days, respectively, while it would take coxme ∼3 years according to our projection.

Figure 3.

Figure 3

(A) Comparison of computational time [log10(second)] of analyzing 10,000 SNPs using COXMEG-score, COXMEG-pcg, and coxme with a dense relatedness matrix of different sample sizes. The computational time of coxme was evaluated for five SNPs, and then projected by multiplying that by 2000. The evaluation was performed on a Windows Core i7-8700 machine. (B) Comparison of computational time [log10(minute)] of estimating the variance component using COXMEG-score with the exact log-determinant, COXMEG-score with an SLQ approximation, and coxme for a dense relatedness matrix of different sample sizes. The CPU time of coxme for a sample size > 5000 was projected based on O(N3). The evaluation was performed on a Linux Xeon E5-2680v2 machine. CPU, central processing unit; SLQ, stochastic Lanczos quadrature.

For a very large cohort (e.g., > 20,000), the estimation of the variance component in COXMEG-score becomes a computational bottleneck because the estimation requires an evaluation of a log-determinant of a large matrix, the time complexity of which is cubic. Therefore, for large sample size, we introduce a randomized method using SLQ (Bai et al. 1996; Golub and Meurant 2009; Ubaru et al. 2017) to approximate the log-determinant, reducing the time complexity from cubic to quadratic. As demonstrated in Figure 3B, COXMEG-score with SLQ employing 100 Monte Carlo samples and 10 steps of the Lanczos algorithm markedly lowered the computational burden for a cohort with ≥ 10,000 individuals, and maintained control of the FPR for detecting significant SNPs (File S1 Materials A.4). Our simulation results indicated that the accuracy of the SLQ approximation depended on the structure of the relatedness matrix, and an estimated variance component was particularly accurate for a large relatedness matrix with weak correlations (i.e., more accurate for a larger sample size and a smaller condition number of the relatedness matrix) (File S1 Materials A.4).

COXMEG controls FPR in real data analysis

In addition to the speed gain, we demonstrated the validity of COXMEG by assessing its statistical power and FPR through a simulation study. We found that COXMEG, including COXMEG-score and COXMEG-sparse, had nearly the same statistical power and FPR as coxme, while coxph with a shared frailty had inflated FPR in certain situations of family-based data sets (Figures S2 and S3). Detailed results about the evaluation of the statistical power and FPR using simulated data can be found in File S1 Materials A.5.

Given the validated performance of COXMEG, we then used it to carry out GWAS of age-at-onset of AD for the NIA-LOADFS data set to detect SNPs associated with the hazards of late-onset AD. We first focused on a homogeneous subpopulation by selecting 3456 non-Hispanic white subjects with available AD status from the cohort. For the subjects with missing information about the age-at-onset of AD, we treated them as censored and set the age-at-onset as the age at recruitment. The genotypes were imputed using the Michigan Imputation Server with a reference panel from HRC (Version r1.1 2016) (Das et al. 2016). A total of ∼5.42 million SNPs were interrogated in the analyses after removing SNPs having MAF < 0.05. Sex was included as a covariate in all analyses.

We investigated the performance of COXMEG in the following three aspects. First, we evaluated the performance of COXMEG-sparse, COXMEG-score, coxme, and coxph (with a shared Gaussian frailty corresponding to the families) in terms of statistical power and control of the genomic inflation factor. Second, we investigated the potential influence on the statistical power of using different approximated GRMs in COXMEG-sparse. Third, we investigated the performance of using a uniform variance component estimated from the null model for analyzing all SNPs. More specifically, we assumed in coxph that all members in each family shared the same frailty. In the other three methods, we computed a GRM using all qualified genotypes (filtered by MAF > 0.05 and missing rate < 1e−4) from a 600,000 Illumina Human610_Quadv1_B genotype array. The GRM was estimated based on the GCTA model (Yang et al. 2011) (also see Equation 9 in the Methods section) and the diagonal elements were further scaled to one using the SNPRelate R package (Zheng et al. 2012). For coxme, we used a block-diagonal GRM by first defining diagonal blocks according to the families, and then setting all the elements outside the block-diagonal parts to zero. We found that eight pairs of individuals were in complete correlation, indicating that they were MZ twins. As coxme does not handle such positive semidefinite matrices, we set those off-diagonal elements corresponding to the MZ twins to 0.99 to make the GRM positive definite. For COXMEG-sparse, we tested two GRMs: (i) the same block-diagonal GRM as used in coxme and (ii) a sparse GRM by setting all elements in the GRM that had an absolute value < 0.03 to zero. For COXMEG-score, we tested three GRMs, the original GRM, the block-diagonal GRM, and the sparse GRM. In terms of the computational efficiency, the analyses using COXMEG-sparse and COXMEG-score were completed in ∼1 day using one CPU thread, while it took coxph and coxme > 3 weeks to complete the analysis. Surprisingly, we found that coxph took longer than coxme. Note that the average family size in NIA-LOADFS is only 2.16. Combined with the trend shown in the simulation study (Figure 1), these observations suggested that coxph can be less efficient than coxme when average family size is small.

Our results further showed that COXMEG-sparse, COXMEG-score, coxme using the same block-diagonal GRM, and coxph had almost the same genomic inflation factor (λ = 1.0214, 1.0215, 1.024, and 1.0008, respectively) (Figure 4A), suggesting that all four methods were comparable in terms of controlling false positives for this analysis. Using a family-based block-diagonal GRM or assuming a shared frailty did not inflate the significance of P-values in this study. We observed that coxph produced slightly less significant P-values for top SNPs than the other methods. On the other hand, we observed that the statistical power of COXMEG-score to detect the APOE ε4 variant (rs429358) was much higher (more than four orders of magnitude) than that of the other three methods (Figure 4A). We observed little difference in the P-values for detecting top SNPs between the original GRM, the approximated sparse GRM, and the block-diagonal GRM (Figure 4B). In addition, they yielded almost the same genomic inflation factor (λ = 1.0014, 1.0215, and 1.0173, respectively), suggesting that the sparse approximations were practically viable for ethnically homogeneous family-based cohorts. Finally, we evaluated potential consequences of using a fixed variance component estimated from the null model instead of a variant-specific variance component for top SNPs. We estimated a variant-specific variance component, and recomputed the effect size and its P-value for each SNP having a P-value < 1e−5. We found that using the fixed variance component did produce slightly less significant P-values for those SNPs outside the APOE region (Figure 4C), while no evident difference was observed for those SNPs in the APOE region (Figure 4D). However, the variance component estimated from the null model including only sex was 0.26, while it rose to 0.38 when the null model included both sex and the APOE ε4 variant.

Figure 4.

Figure 4

(A) Comparison of the P-values from GWAS of age-at-onset of AD with a LOADFS non-Hispanic white sample using four different methods (coxph with a shared frailty, coxme, COXMEG-sparse, and COXMEG-score). A block-diagonal GRM capturing family structure was used for the latter three methods. Genomic inflation factors (reported by gcontrol2 in the gap R package) are shown on the top left corner. (B) Comparison of the P-values from GWAS of age-at-onset of AD with a LOADFS non-Hispanic white sample using three different GRMs [a raw GCTA GRM (original), a block-diagonal matrix obtained by extracting the block-diagonal part of the GCTA GRM capturing family structure, and a sparse matrix by setting the small entries (< 0.03) in the GCTA GRM to zero]. Genomic inflation factors are shown on the top left corner. (C and D) Comparison of P-values using a fixed variance component estimated from the null model, and a variant-specific variance component for SNPs having P < 1e−5 outside chromosome 19 (C) and on chr 19 (D). AD, Alzheimer’s disease; chr, chromosome; GRM, genetic relationship matrix; LOADFS, Late-Onset Alzheimer’s Disease Family Study.

COXMEG is statistically powerful and identifies novel SNPs associated with age-at-onset of AD

The results from the non-Hispanic whites show that the APOE ε4 variant was the strongest SNP, associated with increasing hazards of AD with a P-value of 1e−101 (Figure 5A), which was more significant than the P-value (P = 3.3e−96) reported from a previous study of age-at-onset of AD using a much larger sample size (9162 Caucasian participants) (Naj et al. 2014), in which the NIA-LOADFS sample is a subset. The previous study used log-transformed age-at-onset as the phenotype and generalized estimating equations for estimation (Naj et al. 2014). This might suggest that the CMEM is a more appropriate and powerful model than a marginal model for studying AD. In addition, we found that rs36051450, located on 11q24.1 and ∼25-kb upstream from GRAMD1B, yielded a P-value of 8e−8, narrowly below the genome-wide significance (Figure 5A). GRAMD1B is a cholesterol transporter mediating nonvesicular transport of cholesterol from the plasma membrane to the endoplasmic reticulum. In contrast, a logistic mixed-effects model produced a P-value of only 2.85e−05 for rs36051450. We also examined 23 common SNPs identified by a recent large-scale meta-analysis associated with incidence of AD (Jansen et al. 2019). We found that eight SNPs (in CR1, BIN1, CLNK, HLA-DRB1, MS4A6A, PICALM, ABI3, and CASS4) were nominally significantly (P < 0.05) associated with the hazards of AD (Table S1 in File S1), in which rs4663105 in the BIN1 region showed the strongest significance (P = 2.96e−04).

Figure 5.

Figure 5

Results of GWAS of age-at-onset of AD using (A) a non-Hispanic white sample (3456 subjects), and (B) a sample comprising non-Hispanic whites (3456 subjects) and African Americans (287 subjects) in LOADFS. GWAS were carried out using COXMEG-score with a GCTA GRM estimated from an original genotype microarray. The local plots show the P-values of significant SNPs in the APOE region and the local region of rs36051450. The red horizontal line is the genome-wide significance threshold (5e−8) and the blue horizontal line is the suggestive significance threshold (1e−5). (C) Survival curves of AD in three genotype groups of rs36051450 for men and women separately in the NIA-LOADFS sample. (D) Comparison of the P-values using three different methods (coxph, coxph with five top PCs estimated from the GCTA GRM as covariates, and COXMEG-score with the GCTA GRM). Genomic inflation factors (reported by gcontrol2 in the gap R package) are shown on the top left corner. AD, Alzheimer’s disease; chr, chromosome; GRM, genetic relationship matrix; LOADFS, Late-Onset Alzheimer’s Disease Family Study; NIA, National Institute on Aging; PC, principal component.

We further applied COXMEG-score to a sample consisting of 3456 non-Hispanic whites and 287 African Americans in NIA-LOADFS. The genotypes were imputed using the Michigan Imputation Server with the reference panel from 1000 Genomes Phase 3 (Version 5) (Das et al. 2016). A total of ∼20 million SNPs were included in the GWAS of age-at-onset of AD after removing SNPs having MAF < 0.05. A GRM was constructed using the same procedure as in the previous analyses. When applying COXMEG-score, we included sex as a covariate. To examine the consequence without adjusting for the dependence between subjects, we performed two additional GWAS using coxph without frailty. In the first analysis, we included only sex as a covariate. In the second analysis, we included sex and the top five PCs computed from the GRM to account for population stratification. We did not include coxme for comparison because it was not able to finish the analysis with a dense GRM in a reasonable time.

The top SNP identified by COXMEG-score was again the APOE ε4 variant rs429358 with a P-value of 1e−91 (Figure 5B), slightly less significant than that obtained from the non-Hispanic white sample alone, which probably indicates that the effect of APOE ε4 is lower in African Americans than non-Hispanic whites. Outside the APOE region, we identified the novel SNP rs36051450 (P = 2e−9) near GRAMD1B, the P-value of which was more significant than that from the non-Hispanic whites. We found that the minor allele of rs36051450, with an allele frequency of ∼18% in the sample, had a protective effect on AD, significantly reducing the hazards in both male and female individuals (Figure 5C). The effect in males was less significant than that in females probably because of the smaller sample size of males. Interestingly, rs735665, a nearby SNP in high linkage disequilibrium (r2 = 0.94) with rs36051450, is significantly (P-value = 4e−9–4e−39) associated with chronic lymphocytic leukemia (CLL) and follicular lymphoma (Di Bernardo et al. 2008; Conde et al. 2010; Slager et al. 2012; Berndt et al. 2013; Speedy et al. 2014). In contrast, the minor allele of rs735665 increases the risk of CLL (odds ratio = 1.45–1.81), which suggests a potential antagonistic pleiotropy of these SNPs.

Given that the censoring proportion is 55%, the heritability of the risk of AD estimated from the variance component obtained from the CMEM in the NIA-LOADFS sample was 0.43 based on Equation 10 in the Discussion section. The genomic inflation factors were 1.014, 1.063, and 1.065 for COXMEG-score, coxph, and coxph with PCs, respectively (Figure 5D), suggesting that the FPR was inflated when not accounting for the family structure. Using PCs to account for population stratification slightly improved the FPR. We observed that COXMEG-score yielded more significant P-values for the top SNPs than coxph with PCs (Figure 5D), suggesting that COXMEG-score achieved higher statistical power to detect SNPs in the APOE region.

Discussion

Age-at-onset is one of the most common and fundamental traits in genetic epidemiology. When the underlying hazard function is unknown, the Cox model (Cox 1972) is a prevalent routine for analyzing age-at-onset traits as it is more flexible than a parametric hazard model (e.g., Weibull). When subjects are correlated, a CMEM is frequently used to model the dependence. Although the CMEMs have been investigated for decades (McGilchrist and Aisbett 1991; McGilchrist 1993; Ripatti and Palmgren 2000; Ha et al. 2001; Ripatti et al. 2002; Cortiñas Abrahantes and Burzykowski 2005), its estimation algorithm, unfortunately, has no closed form in contrast to a parametric hazard frailty model, in which it is possible to express the marginal likelihood explicitly using the Laplace transform for some frailty families (Yashin et al. 1995; Wienke et al. 2003). Iterative methods are required for estimation, which are computationally intensive, and consequently prohibit its broad application in GWAS. Therneau (2003, 2015) optimized the estimation algorithm specifically for block-diagonal relatedness matrices and achieved remarkable performance, but its use has still been limited in large-scale GWAS.

This work sets out to fill in the gap in the application of the CMEMs in GWAS. It might be of particular interest to many areas in epidemiology, such as age-related diseases (He et al. 2016c; Kulminski et al. 2016), behavior genetics (He et al. 2016a), and perhaps clinical trials, in which age-at-onset is a central concern. Moreover, the application of COXMEG includes, but is not limited to, genetic data. As demonstrated, COXMEG-sparse is computationally fast for general sparse matrices. A time-to-event analysis involving sample dependence structure, e.g., geographically clustered time-to-event data, can also benefit from COXMEG.

We have proposed in COXMEG novel efficient algorithms to facilitate the application of the CMEMs to genome-wide age-at-onset or time-to-event analysis. We demonstrated in our simulation and real data analyses that COXMEG substantially improves computational efficiency, and is applicable for a broader family of relatedness matrices. Specifically, for a block-diagonal relatedness matrix, COXMEG is often hundreds to thousands of times faster than coxme, while for a dense relatedness matrix, COXMEG using a score test can be 100,000-fold faster. In addition, compared to coxme, COXMEG deals with positive semidefinite relatedness matrices, which are common in the context of GWAS. We demonstrated through a study of AD using the NIA-LOADFS sample that these improvements and generalizations in COXMEG allow fast GWAS of age-at-onset for a cohort of thousands, or tens of thousands, of individuals. For example, COXMEG completed the genome-wide analysis within 1 day for 10 million SNPs in NIA-LOADFS (∼3500 subjects) using one CPU thread.

Our results about the risk of AD show that the CMEMs may identify new SNPs that are not significant using a linear or logistic model. Interestingly, for the identified novel SNP rs36051450 near GRAMD1B, the P-value from the logistic model was less significant than that from the CMEM, suggesting that this SNP might contribute more to accelerating the disease progression than the incidence. According to GTEx (GTEx Consortium et al. 2017), GRAMD1B is abundantly expressed in brain tissues. Cell-sorting experiments of human cortex tissue have indicated that GRAMD1B is most strongly expressed in astrocytes and neurons among five major brain cell types (Zhang et al. 2016). Further research could be attempted to replicate this finding in other cohorts.

Analogous to heritability defined in a linear model, a generalized heritability for the CMEMs defined on a logarithmic scale of cumulative hazards can be obtained as proposed in Korsgaard et al. (1999), Yazdi et al. (2002), and Schneider et al. (2005) using, e.g., the following formula,

h=τ^τ^+11pc, (10)

where pc is the proportion of censored subjects and τ^ is the estimated variance component in CMEMs.

Despite the efforts made in COXMEG, it is still challenging, in terms of both computational time and memory, to handle a fully dense relatedness matrix for a cohort of > 50,000 individuals in a transethnic study. As discussed in Gianola et al. (2016), a fast and viable solution to account for sample dependence is to split the relatedness matrix into two parts based on PC analysis. As we showed in our analysis of the NIA-LOADFS sample, using a sparse approximation of the SNP-based GRM in an ethnically homogeneous study of family-based design achieved highly comparable performance and considerably alleviated computational burden. Thus, an efficient solution to a very large transethnic cohort might be to use top PCs from the GRM to correct for population stratification, and a sparse approximation to correct for family structure.

Acknowledgments

The authors thank the editors and four reviewers (including Doug Speed, two anonymous reviewers for GENETICS, and one anonymous reviewer for Bioinformatics) for their careful reading and constructive comments, and Terry Therneau for providing detailed information about the implementation of coxme and constructive discussion. This research was supported by grants from the National Institute on Aging (P01 AG043352, R01 AG047310, and R01 AG061853). The funders had no role in study design, data collection and analysis, decision to publish, or manuscript preparation. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors declare that they have no conflict of interest.

Author contributions: L.H. conceived, developed, and implemented the algorithms. L.H. designed the simulation study, and analyzed the simulated and real data. A.M.K. contributed to acquiring the real data, examining the algorithms, and discussion of final results. L.H. and A.M.K. contributed to the writing of the manuscript.

Footnotes

Supplemental material available at figshare: https://doi.org/10.25386/genetics.11862486.

Communicating editor: E. Eskin

Literature Cited

  1. Bai Z., Fahey G., and Golub G., 1996.  Some large-scale matrix computation problems. J. Comput. Appl. Math. 74: 71–89. 10.1016/0377-0427(96)00018-0 [DOI] [Google Scholar]
  2. Barry R. P., and Kelley Pace R., 1999.  Monte Carlo estimates of the log determinant of large sparse matrices. Linear Algebra Appl. 289: 41–54. 10.1016/S0024-3795(97)10009-X [DOI] [Google Scholar]
  3. Berndt S. I., Skibola C. F., Joseph V., Camp N. J., Nieters A. et al. , 2013.  Genome-wide association study identifies multiple risk loci for chronic lymphocytic leukemia. Nat. Genet. 45: 868–876. 10.1038/ng.2652 [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bickel P. J., and Levina E., 2008.  Covariance regularization by thresholding. Ann. Stat. 36: 2577–2604. 10.1214/08-AOS600 [DOI] [Google Scholar]
  5. Breslow N., 1974.  Covariance analysis of censored survival data. Biometrics 30: 89–99. 10.2307/2529620 [DOI] [PubMed] [Google Scholar]
  6. Breslow N. E., and Clayton D. G., 1993.  Approximate inference in generalized linear mixed models. J. Am. Stat. Assoc. 88: 9–25. [Google Scholar]
  7. Callas P. W., Pastides H., and Hosmer D. W., 1998.  Empirical comparisons of proportional hazards, poisson, and logistic regression modeling of occupational cohort data. Am. J. Ind. Med. 33: 33–47. [DOI] [PubMed] [Google Scholar]
  8. Chen H., Wang C., Conomos M. P., Stilp A. M., Li Z. et al. , 2016.  Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 98: 653–666. 10.1016/j.ajhg.2016.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen H., Huffman J. E., Brody J. A., Wang C., Lee S. et al. , 2019.  Efficient variant set mixed model association tests for continuous and binary traits in large-scale whole-genome sequencing studies. Am. J. Hum. Genet. 104: 260–274. 10.1016/j.ajhg.2018.12.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Conde L., Halperin E., Akers N. K., Brown K. M., Smedby K. E. et al. , 2010.  Genome-wide association study of follicular lymphoma identifies a risk locus at 6p21.32. Nat. Genet. 42: 661–664. 10.1038/ng.626 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cortiñas Abrahantes J., and Burzykowski T., 2005.  A version of the EM algorithm for proportional hazard model with random effects. Biom. J. J. Math. Methods Biosci. 47: 847–862. [DOI] [PubMed] [Google Scholar]
  12. Cox D. R., 1972.  Regression models and life-tables. J. R. Stat. Soc. Ser. B Methodol. 34: 187–220. [Google Scholar]
  13. Das S., Forer L., Schönherr S., Sidore C., Locke A. E. et al. , 2016.  Next-generation genotype imputation service and methods. Nat. Genet. 48: 1284–1287. 10.1038/ng.3656 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dembo R. S., Eisenstat S. C., and Steihaug T., 1982.  Inexact Newton methods. SIAM J. Numer. Anal. 19: 400–408. 10.1137/0719025 [DOI] [Google Scholar]
  15. Di Bernardo M. C., Crowther-Swanepoel D., Broderick P., Webb E., Sellick G. et al. , 2008.  A genome-wide association study identifies six susceptibility loci for chronic lymphocytic leukemia. Nat. Genet. 40: 1204–1210. 10.1038/ng.219 [DOI] [PubMed] [Google Scholar]
  16. Eisenstat S. C., and Walker H. F., 1996.  Choosing the forcing terms in an inexact Newton method. SIAM J. Sci. Comput. 17: 16–32. 10.1137/0917003 [DOI] [Google Scholar]
  17. Finch C. E., and Kulminski A. M., 2019.  The Alzheimer’s disease exposome. Alzheimers Dement. 15: 1123–1132. 10.1016/j.jalz.2019.06.3914 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Friedman J., Hastie T., and Tibshirani R., 2008.  Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9: 432–441. 10.1093/biostatistics/kxm045 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gianola D., Fariello M. I., Naya H., and Schön C.-C., 2016.  Genome-wide association studies with a genomic relationship matrix: a case study with wheat and Arabidopsis. G3 (Bethesda) 6: 3241–3256. 10.1534/g3.116.034256 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Golub G. H., and Meurant G., 2009.  Matrices, Moments and Quadrature with Applications. Princeton University Press, Princeton, NJ: 10.1515/9781400833887 [DOI] [Google Scholar]
  21. Green M. S., and Symons M. J., 1983.  A comparison of the logistic risk function and the proportional hazards model in prospective epidemiologic studies. J. Chronic Dis. 36: 715–723. 10.1016/0021-9681(83)90165-0 [DOI] [PubMed] [Google Scholar]
  22. GTEx Consortium; Laboratory, Data Analysis & Coordinating Center (LDACC)—Analysis Working Group; Statistical Methods groups—Analysis Working Group; Enhancing GTEx (eGTEx) groups; NIH Common Fund et al. , 2017.  Genetic effects on gene expression across human tissues. Nature 550: 204–213 [corrigenda: Nature 553: 530 (2018)]. 10.1038/nature24277 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Ha I. D., Lee Y., and Song J., 2001.  Hierarchical likelihood approach for frailty models. Biometrika 88: 233–243. 10.1093/biomet/88.1.233 [DOI] [Google Scholar]
  24. Ha I. D., Sylvester R., Legrand C., and MacKenzie G., 2011.  Frailty modelling for survival data from multi-centre clinical trials. Stat. Med. 30: 2144–2159. 10.1002/sim.4250 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Hager W., 1989.  Updating the inverse of a matrix. SIAM Rev. 31: 221–239. 10.1137/1031049 [DOI] [Google Scholar]
  26. Han I., Malioutov D., Avron H., and Shin J., 2016.  Approximating the spectral sums of large-scale matrices using Chebyshev approximations. SIAM J. Sci. Comput. 39: A1558–A1585. [Google Scholar]
  27. He L., Pitkäniemi J., Heikkilä K., Chou Y.-L., Madden P. A. F. et al. , 2016a Genome-wide time-to-event analysis on smoking progression stages in a family-based study. Brain Behav. 6: e00462 10.1002/brb3.462 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. He L., Sillanpää M. J., Silventoinen K., Kaprio J., and Pitkäniemi J., 2016b Estimating modifying effect of age on genetic and environmental variance components in twin models. Genetics 202: 1313–1328. 10.1534/genetics.115.183905 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. He L., Kernogitski Y., Kulminskaya I., Loika Y., Arbeev K. G. et al. , 2016c Pleiotropic meta-analyses of longitudinal studies discover novel genetic variants associated with age-related diseases. Front. Genet. 7: 179 [corrigenda: Front. Genet. 8: 226 (2018)]. 10.3389/fgene.2016.00179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Jansen I. E., Savage J. E., Watanabe K., Bryois J., Williams D. M. et al. , 2019.  Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nat. Genet. 51: 404–413 [corrigenda: Nat. Genet. 52: 354 (2020)]. 10.1038/s41588-018-0311-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Kang H. M., Sul J. H., Service S. K., Zaitlen N. A., Kong S. et al. , 2010.  Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42: 348–354. 10.1038/ng.548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Kauermann G., and Wegener M., 2011.  Functional variance estimation using penalized splines with principal component analysis. Stat. Comput. 21: 159–171. 10.1007/s11222-009-9156-5 [DOI] [Google Scholar]
  33. Kelley C. T., 1995.  Iterative Methods for Linear and Nonlinear Equations. Society for Industrial and Applied Mathematics, Philadelphia, PA: 10.1137/1.9781611970944 [DOI] [Google Scholar]
  34. Korsgaard I. R., Andersen A. H., and Jensen J., 1999.  Discussion of heritability of survival traits. Interbull Bull. 21: 31–35. [Google Scholar]
  35. Kulminski A. M., He L., Culminskaya I., Loika Y., Kernogitski Y. et al. , 2016.  Pleiotropic associations of allelic variants in a 2q22 region with risks of major human diseases and mortality. PLoS Genet. 12: e1006314 10.1371/journal.pgen.1006314 [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Lee J. H., Cheng R., Graff-Radford N., Foroud T., Mayeux R.; National Institute on Aging Late-Onset Alzheimer’s Disease Family Study Group , 2008.  Analyses of the National Institute on Aging Late-Onset Alzheimer’s Disease Family Study: implication of additional loci. Arch. Neurol. 65: 1518–1526. 10.1001/archneur.65.11.1518 [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Lee Y., Nelder J. A., and Pawitan Y., 2006.  Generalized Linear Models with Random Effects: Unified Analysis via H-Likelihood. Chapman and Hall/CRC, New York: 10.1201/9781420011340 [DOI] [Google Scholar]
  38. Liang K.-Y., and Zeger S. L., 1986.  Longitudinal data analysis using generalized linear models. Biometrika 73: 13–22. 10.1093/biomet/73.1.13 [DOI] [Google Scholar]
  39. Lidauer M., Strandén I., Mäntysaari E. A., Pösö J., and Kettunen A., 1999.  Solving large test-day models by iteration on data and preconditioned conjugate gradient. J. Dairy Sci. 82: 2788–2796. 10.3168/jds.S0022-0302(99)75536-0 [DOI] [PubMed] [Google Scholar]
  40. Lippert C., Listgarten J., Liu Y., Kadie C. M., Davidson R. I. et al. , 2011.  FaST linear mixed models for genome-wide association studies. Nat. Methods 8: 833–835. 10.1038/nmeth.1681 [DOI] [PubMed] [Google Scholar]
  41. Loh P.-R., Tucker G., Bulik-Sullivan B. K., Vilhjálmsson B. J., Finucane H. K. et al. , 2015.  Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47: 284–290. 10.1038/ng.3190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Martin R. J., 1992.  Approximations to the determinant term in gaussian maximum likelihood estimation of some spatial models. Commun. Stat. Theory Methods 22: 189–205. 10.1080/03610929308831013 [DOI] [Google Scholar]
  43. McGilchrist C. A., 1993.  REML estimation for survival models with frailty. Biometrics 49: 221–225. 10.2307/2532615 [DOI] [PubMed] [Google Scholar]
  44. McGilchrist C. A., and Aisbett C. W., 1991.  Regression with frailty in survival analysis. Biometrics 47: 461–466. 10.2307/2532138 [DOI] [PubMed] [Google Scholar]
  45. Naj A. C., Jun G., Reitz C., Kunkle B. W., Perry W. et al. , 2014.  Effects of multiple genetic loci on age at onset in late-onset Alzheimer disease: a genome-wide association study. JAMA Neurol. 71: 1394–1404. 10.1001/jamaneurol.2014.1491 [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Nelder J. A., and Mead R., 1965.  A simplex method for function minimization. Comput. J. 7: 308–313. 10.1093/comjnl/7.4.308 [DOI] [Google Scholar]
  47. Pace R. K., and LeSage J. P., 2004.  Chebyshev approximation of log-determinants of spatial weight matrices. Comput. Stat. Data Anal. 45: 179–196. 10.1016/S0167-9473(02)00321-3 [DOI] [Google Scholar]
  48. Pankratz V. S., de Andrade M., and Therneau T. M., 2005.  Random-effects Cox proportional hazards model: general variance components methods for time-to-event data. Genet. Epidemiol. 28: 97–109. 10.1002/gepi.20043 [DOI] [PubMed] [Google Scholar]
  49. Powell, M. J., 2009 The BOBYQA algorithm for bound constrained optimization without derivatives. Cambridge NA Report NA2009/06. University of Cambridge, Cambridge, UK. pp. 26–46.
  50. Ripatti S., and Palmgren J., 2000.  Estimation of multivariate frailty models using penalized partial likelihood. Biometrics 56: 1016–1022. 10.1111/j.0006-341X.2000.01016.x [DOI] [PubMed] [Google Scholar]
  51. Ripatti S., Larsen K., and Palmgren J., 2002.  Maximum likelihood inference for multivariate frailty models using an automated Monte Carlo EM algorithm. Lifetime Data Anal. 8: 349–360. 10.1023/A:1020566821163 [DOI] [PubMed] [Google Scholar]
  52. Rothman A. J., 2012.  Positive definite estimators of large covariance matrices. Biometrika 99: 733–740. 10.1093/biomet/ass025 [DOI] [Google Scholar]
  53. Schneider M. del P., Strandberg E., Ducrocq V., and Roth A., 2005.  Survival analysis applied to genetic evaluation for female fertility in dairy cattle. J. Dairy Sci. 88: 2253–2259. 10.3168/jds.S0022-0302(05)72901-5 [DOI] [PubMed] [Google Scholar]
  54. Sinnwell J. P., Therneau T. M., and Schaid D. J., 2014.  The kinship2 R package for pedigree data. Hum. Hered. 78: 91–93. 10.1159/000363105 [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Slager S. L., Skibola C. F., Di Bernardo M. C., Conde L., Broderick P. et al. , 2012.  Common variation at 6p21.31 (BAK1) influences the risk of chronic lymphocytic leukemia. Blood 120: 843–846. 10.1182/blood-2012-03-413591 [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Speedy H. E., Di Bernardo M. C., Sava G. P., Dyer M. J. S., Holroyd A. et al. , 2014.  A genome-wide association study identifies multiple susceptibility loci for chronic lymphocytic leukemia. Nat. Genet. 46: 56–60. 10.1038/ng.2843 [DOI] [PubMed] [Google Scholar]
  57. Staley J. R., Jones E., Kaptoge S., Butterworth A. S., Sweeting M. J. et al. , 2017.  A comparison of Cox and logistic regression for use in genome-wide association studies of cohort and case-cohort design. Eur. J. Hum. Genet. EJHG 25: 854–862. 10.1038/ejhg.2017.78 [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Svishcheva G. R., Axenovich T. I., Belonogova N. M., van Duijn C. M., and Aulchenko Y. S., 2012.  Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44: 1166–1170. 10.1038/ng.2410 [DOI] [PubMed] [Google Scholar]
  59. Therneau T., 2003.  On mixed-effect Cox models, sparse matrices, and modeling data from large pedigrees. Available at: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.9.6767
  60. Therneau T., 2015.  Mixed effects Cox models. CRAN repository. https://repo.bppt.go.id/cran/web/packages/coxme/vignettes/coxme.pdf
  61. Therneau T. M., and Thomas Lumley, 2015.  “Package ‘survival’.” R Top Doc. 128: 112. https://cran.r-project.org/web/packages/survival/survival.pdf
  62. Tsuruta S., Misztal I., and Strandén I., 2001.  Use of the preconditioned conjugate gradient algorithm as a generic solver for mixed-model equations in animal breeding applications. J. Anim. Sci. 79: 1166–1172. 10.2527/2001.7951166x [DOI] [PubMed] [Google Scholar]
  63. Ubaru S., Chen J., and Saad Y., 2017.  Fast estimation of tr(f(A)) via stochastic Lanczos quadrature. SIAM J. Matrix Anal. Appl. 38: 1075–1099. 10.1137/16M1104974 [DOI] [Google Scholar]
  64. Vaupel J. W., Manton K. G., and Stallard E., 1979.  The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography 16: 439–454. 10.2307/2061224 [DOI] [PubMed] [Google Scholar]
  65. Wei L.-J., Lin D. Y., and Weissfeld L., 1989.  Regression analysis of multivariate incomplete failure time data by modeling marginal distributions. J. Am. Stat. Assoc. 84: 1065–1073. 10.1080/01621459.1989.10478873 [DOI] [Google Scholar]
  66. Wienke A., Lichtenstein P., and Yashin A. I., 2003.  A bivariate frailty model with a cure fraction for modeling familial correlations in diseases. Biometrics 59: 1178–1183, discussion 1184–1185. 10.1111/j.0006-341X.2003.00135.x [DOI] [PubMed] [Google Scholar]
  67. Yang J., Lee S. H., Goddard M. E., and Visscher P. M., 2011.  GCTA: a Tool for genome-wide complex trait analysis. Am. J. Hum. Genet. 88: 76–82. 10.1016/j.ajhg.2010.11.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Yashin A. I., Vaupel J. W., and Iachine I. A., 1995.  Correlated individual frailty: an advantageous approach to survival analysis of bivariate data. Math. Popul. Stud. 5: 145–159, 183. 10.1080/08898489509525394 [DOI] [PubMed] [Google Scholar]
  69. Yazdi M. H., Visscher P. M., Ducrocq V., and Thompson R., 2002.  Heritability, reliability of genetic evaluations and response to selection in proportional hazard models. J. Dairy Sci. 85: 1563–1577. 10.3168/jds.S0022-0302(02)74226-4 [DOI] [PubMed] [Google Scholar]
  70. Ypma J., 2014.  Introduction to nloptr: an R interface to NLopt. Tech. rep. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.488.6344&rep=rep1&type=pdf
  71. Zhang Y., Sloan S. A., Clarke L. E., Caneda C., Plaza C. A. et al. , 2016.  Purification and characterization of progenitor and mature human astrocytes reveals transcriptional and functional differences with mouse. Neuron 89: 37–53. 10.1016/j.neuron.2015.11.013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Zheng X., Levine D., Shen J., Gogarten S. M., Laurie C. et al. , 2012.  A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28: 3326–3328. 10.1093/bioinformatics/bts606 [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Zhou X., and Stephens M., 2012.  Genome-wide efficient mixed model analysis for association studies. Nat. Genet. 44: 821–824. 10.1038/ng.2310 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This manuscript was prepared using limited access data sets obtained through dbGaP [accession numbers: phs000168.v2.p2 (LOADFS)]. Code used to generate the simulated age-at-onset data can be found in the vignettes of the coxmeg R package at https://cran.r-project.org/web/packages/coxmeg/. Supplemental material available at figshare: https://doi.org/10.25386/genetics.11862486.


Articles from Genetics are provided here courtesy of Oxford University Press

RESOURCES