Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Sep 1.
Published in final edited form as: Biometrika. 2015 Jul 14;102(3):515–532. doi: 10.1093/biomet/asv030

Efficient Estimation of Nonparametric Genetic Risk Function with Censored Data

YUANJIA WANG 1, BAOSHENG LIANG 2, XINGWEI TONG 3, KAREN MARDER 4, SUSAN BRESSMAN 5, AVI ORR-URTREGER 6, NIR GILADI 7, DONGLIN ZENG 8
PMCID: PMC4581539  NIHMSID: NIHMS686156  PMID: 26412864

Summary

With an increasing number of causal genes discovered for complex human disorders, it is crucial to assess the genetic risk of disease onset for individuals who are carriers of these causal mutations and compare the distribution of age-at-onset with that in non-carriers. In many genetic epidemiological studies aiming at estimating causal gene effect on disease, the age-at-onset of disease is subject to censoring. In addition, some individuals’ mutation carrier or non-carrier status can be unknown due to the high cost of in-person ascertainment to collect DNA samples or death in older individuals. Instead, the probability of these individuals’ mutation status can be obtained from various sources. When mutation status is missing, the available data take the form of censored mixture data. Recently, various methods have been proposed for risk estimation from such data, but none is efficient for estimating a nonparametric distribution. We propose a fully efficient sieve maximum likelihood estimation method, in which we estimate the logarithm of the hazard ratio between genetic mutation groups using B-splines, while applying nonparametric maximum likelihood estimation for the reference baseline hazard function. Our estimator can be calculated via an expectation-maximization algorithm which is much faster than existing methods. We show that our estimator is consistent and semiparametrically efficient and establish its asymptotic distribution. Simulation studies demonstrate superior performance of the proposed method, which is applied to the estimation of the distribution of the age-at-onset of Parkinson's disease for carriers of mutations in the leucine-rich repeat kinase 2 gene.

Keywords: Empirical process, Mixture distribution, Parkinson's disease, Semiparametric efficiency, Sieve maximum likelihood estimation

1. Introduction

Identification of causal genes for many genetic disorders has made personalized risk assessment and prediction of disease onset a real possibility. However, although interest lies in estimating the cumulative risk distributions of disease onset for individuals who are carriers of deleterious mutations or for those with a certain haplotype, investigators may encounter missing genotypes or phase information of the haplotypes in a large proportion of individuals. For instance, genotypes in family members may be missing due to the high cost of collecting blood samples from relatives, death of a relative (Wacholder et al., 1998; Marder et al., 2003; Zhang et al., 2010; Wang et al., 2012; Qin et al., 2014), or limitations in the technology to separate two homologous chromosomes in genotyping. Furthermore, disease onset information is subject to censoring due to lost to follow-up or death.

In the presence of missing genotype information, the statistical framework for estimating disease risk distribution associated with genetic mutations is essentially the analysis of censored mixture data. There is a large body of literature on inference for mixture models. See for example, Titterington et al. (1985) and Mclachlan & Basford (1988) for parametric models, and Hall & Zhou (2003) for nonparametric models. Most of these papers address non-censored outcomes. For many genetic epidemiological studies of disease risk distributions, two features distinguish them from other censored mixture models. First, each subgroup in the mixture model is biologically meaningful and corresponds to mutation carriers or non-carriers; second, the mixing probability is usually known to the investigators or can be inferred from family pedigrees and other external sources. For example, in a case-control genetic study with valid family history information on relatives (Marder et al., 2003), the probability of a relative having a certain genotype is obtained through the relationship between relatives and probands under Mendelian assumptions (Wacholder et al., 1998; Zhang et al., 2010; Wang et al., 2012; Qin et al., 2014). In haplotype studies, the probability of a certain haplotype can be inferred from unphased genotypes under Hardy–Weinberg equilibrium (Zeng et al., 2006), from external sources such as the HapMap project, or from sequencing data (Yang et al., 2013).

One application of this paper is to a recent study on age-specific risk of Parkinson's disease associated with mutations in the leucine-rich repeat kinase 2 gene (Paisán-Ruíz et al., 2004; Healy et al., 2008). Although Parkinson's disease is traditionally considered a non-genetic disorder, recent studies have identified genetic risk factors for Parkinson's disease especially in more genetically homogeneous sub-populations such as Ashkenazi Jews (Trinh, 2013). The goal of the current study is to estimate age-specific risk of Parkinson's disease in Ashkenazi Jews for the leucine-rich repeat kinase 2 gene mutation carriers and compare it to non-carriers. Since the leucine-rich repeat kinase 2 mutations have low prevalence, it is not efficient to randomly sample individuals from the Ashkenazi population. Instead, the study used the kin-cohort design (Wacholder et al., 1998) which was initially implemented to study genetic risks of breast cancer. In our study, an initial sample of individuals with Parkinson's disease referred as probands were sequenced for the leucine-rich repeat kinase 2 mutations and provided age-at-onset information for their first-degree relatives. Most of the relatives were not genotyped due to limited resources and therefore had unknown leucine-rich repeat kinase 2 mutation status. In addition, for older relatives who were deceased, it was not possible to collect blood samples.

Several existing works consider estimating distribution functions for such mixture data in a parametric or semiparametric framework (e.g., Diao & Lin, 2005; Zhang et al., 2010). When concerns over model misspecification arise in practice (e.g., Langbehn et al., 2004), a nonparametric model and inference through nonparametric maximum likelihood estimation are natural. However, although the Kaplan–Meier estimator is nonparametric efficient for censored data, nonparametric maximum likelihood estimators are either inconsistent or inefficient for mixture data (Wang et al., 2012). To account for censoring and the mixture nature of the problem while ensuring monotonicity of the estimated distribution function on the entire support, Qin et al. (2014) proposed methods based on a binomial likelihood and a sequence of nonparametric estimates performed by reducing censored data to current status data and implementing the expectation-maximization algorithm (Larid & Ware, 1982) with the pooled-adjacent-violators algorithm. However, this method is not guaranteed to be efficient and can be computationally intensive. Other works involving a nonparametric model based on estimating equations and weighting of Kaplan–Meier survival curves include Wacholder et al. (1998) and Fine et al. (2004).

In this work, we propose a sieve maximum likelihood estimation method to estimate disease risk associated with genetic mutations in censored mixture models. Specifically, we utilize sieve estimation based on B-splines to estimate the log-hazard ratios between the carriers and non-carriers, while the nonparametric maximum likelihood estimator is used to estimate the reference baseline hazard function. The derived estimators for the disease risk distributions are guaranteed to be asymptotically efficient. Furthermore, the calculation of the sieve maximum likelihood estimators can be easily implemented via an expectation-maximization algorithm which converges much faster than existing algorithms, due to closed form solutions in the M-step. We tackle the theoretical challenge when one functional parameter is estimated using a nonparametric maximum likelihood estimator while the other parameter is estimated using a sieve estimator. We demonstrate substantial efficiency gains of the proposed method by simulation. Finally, we apply our method to estimate the age-at-onset of Parkinson's disease for individuals with deleterious leucine-rich repeat kinase 2 mutations (Goldwurm et al., 2011).

2. Method and Inference Procedure

2·1. Data and likelihood function

Let Ti be the age-at-onset of a disease which is subject to random censoring. Let Bi denote the potentially missing mutation status, with 1 indicating the carrier group where each individual has at least one copy of the mutation, and 2 indicating the non-carrier group. As in the Parkinson's disease study described in Section 1, the probability of being a carrier takes a finite number of values. For example, a child of a heterozygote carrier parent has a probability of 0·5 of carrying this mutation under the Mendelian assumption, so if the mutation prevalence in the general population is denoted as f, we have pr(B = 1) = 0.5(1 + f) for this child. For individuals with observed carrier status, pr(B = 1) equals 1 for carriers and 0 for non-carriers. We denote the finite set of values for the probability pr(B = 1) by p1, . . ., pm. Our goal is to estimate the risk distribution of the age-at-onset in the mutation group and no-mutation group, that is, F1(t) = pr(Tt | B = 1) and F2(t) = pr(Tt | B = 2), respectively.

Due to right censoring, the observations from n individuals consist of {Yi = TiCi, Δi = I(TiCi), pr(Bi = 1)}, i = 1, . . ., n, where Ci denotes the censoring time assumed to be independent of Ti. We introduce an indicator variable Gi to denote m distinct mixing probabilities, so Gi = g indicates pr(Bi = 1) = pg (g = 1, . . ., m). After grouping individuals with the same pg value together, the likelihood function can be written as

i=1ng=1m[{pgf1(Yi)+(1pg)f2(Yi)}Δi{1pgF1(Yi)(1pg)F2(Yi)}1Δi]I(Gi=g),

where fk is the density function corresponding to Fk (k = 1, 2). Our interest is to estimate Fk.

In survival analysis, it is usually more convenient to re-write the observed likelihood function using hazard functions instead of distribution functions. Let λk(t) be the hazard function for T in the group with B = k, and let Λk(t) be the corresponding cumulative hazard function. Then the likelihood function can be re-expressed as

i=1ng=1m({pgλ1(Yi)eΛ1(Yi)+(1pg)λ2(Yi)eΛ2(Yi)}Δi×[1pg{1eΛ1(Yi)}(1pg){1eΛ2(Yi)}]1Δi)I(Gi=g). (1)

The goal is to maximize the likelihood function (1) to estimate Λ1(t) and Λ2(t) nonparametrically and thus to obtain the age-at-onset distributions, F1(t) and F2(t). In the likelihood function (1), pg equals 1 or 0 if an individual is observed to be a carrier or non-carrier respectively.

2·2. Sieve maximum likelihood estimation

At first glance, to estimate Λ1 or Λ2 in (1), one may consider a nonparametric maximum likelihood estimator (Zeng & Lin, 2010), where Λ1 or Λ2 are treated as step functions with jumps at the observed event times. However, due to ambiguous support points for event times when the mutation group membership is not observed, the nonparametric maximum likelihood estimator may not be consistent and its bias was observed in simulations even for very large samples (Ma & Wang, 2012; Wang et al., 2012). We therefore propose a hybrid approach involving nonparametric estimator and sieve maximum likelihood estimators that leads to consistent and semiparametric efficient estimation.

Define β(t) = log{λ1(t)/λ2(t)}, so Λ1(t)=0texp{β(s)}dΛ2(s). The likelihood in (1) can be re-expressed as

i=1ng=1m(λ2(Yi)Δi[pgeβ(Yi)exp{0Yieβ(t)dΛ2(t)}+(1pg)exp{Λ2(Yi)}]Δi×[pgexp{0Yieβ(t)dΛ2(t)}+(1pg)exp{Λ2(Yi)}]1Δi)I(Gi=g). (2)

To maximize (2), consider using a nonparametric maximum likelihood estimator to estimate the cumulative hazard function in the baseline group, say, Λ2(t), but adopting a sieve approximation to estimate β(t). Specifically, we assume that Λ2 jumps at observed Yi's with Δi = 1, and we use a sieve approximation for the log-hazard ratio β(t), letting β(t)=j=1Knαjϕj(t), where ϕ1, . . ., ϕKn are basis functions for the sieve approximation. The resulting estimator maximizes a partially smoothed likelihood, where the smoothing is performed on the hazard ratio function. The use of a smoothed approximation enables one to borrow information to estimate Λ1(t) and thus avoid specifying its ambiguous support points as required for the nonparametric maximum likelihood estimator. In our implementation, we choose B-splines as the basis functions: We let the spline knots be 0 = t1 = ··· = tl < tl+1 < ··· < τ = tmn+l = tmn+l+1 = ··· = tmn+2l, where τ is the study duration, mn is an integer to be chosen in a data-driven fashion, and l is the order of the B-splines. There is a total of Kn = mn + l B-spline basis functions, denoted as {ϕj : j = 1, . . ., Kn}.

Using the nonparametric maximum likelihood estimator for Λ2 and the sieve estimate for β(t), we aim to maximize (2) or its logarithm over all the parameters including the jumps of Λ2 and the spline coefficients α1, . . ., αKn. Direct maximization is computationally intensive and inefficient since the log-likelihood is not convex and the parameters include the potentially many jumps of Λ2. However, using the expectation-maximization algorithm with B1, . . ., Bn, the mutation status of all individuals, treated as missing data, fast numerical convergence can be obtained due to various closed-form solutions in the M-step.

Assuming that the Bi were observed, the complete data log-likelihood function for (Yi, Δi, Bi, Gi), i = 1, . . ., n, is

i=1nI(Bi=1)[ΔilogδΛ2(Yi)+Δij=1Knαjϕj(Yi)YkYiδΛ2(Yk)exp{j=1Knαjϕj(Yk)}]+i=1nI(Bi=2){ΔilogδΛ2(Yi)Λ2(Yi)}+i=1ng=1mI(Gi=g,Bi=1)logpg+i=1ng=1mI(Gi=g,Bi=2)log(1pg),

where δΛ2(y) denotes the jump of Λ2 at y. Therefore, the expectation-maximization algorithm consists of the following E- and M-steps. In the E-step, we evaluate the conditional probability of Bi = 1 given the data (Gi, Yi, Δi),

qi=pGiexp[Δij=1Knαjϕj(Yi)0Yiexp{j=1Knαjϕj(t)}dΛ2(t)]pGiexp[Δij=1nαjϕj(Yi)0Yiexp{j=1Knαjϕj(t)}dΛ2(t)]+(1pGi)exp{Λ2(Yi)}.

In the M-step, we maximize

i=1nqi[ΔilogδΛ2(Yi)+Δij=1Knαjϕj(Yi)YkYiδΛ2(Yk)exp{j=1Knαjϕj(Yk)}]+i=1n(1qi){ΔilogδΛ2(Yi)Λ2(Yi)}. (3)

By differentiating (3) with respect to the jumps of Λ2, we obtain a closed form solution

δΛ2(Yi)=Δi/k=1nI(YkYi)[qkexp{j=1Knαjϕj(Yi)}+(1qk)]. (4)

After inserting (4) into (3) and differentiating with respect to the αs, we obtain αs that solve the estimating equation

i=1nΔi(qik=1nI(YkYi)qkexp{j=1Knαjϕj(Yi)}k=1nI(YkYi)[qkexp{j=1Knαjϕj(Yi)}+(1qk)])(ϕ1(Yi)ϕKn(Yi))=0, (5)

which is easily solved using the Newton–Raphson method. With updated α's, we use (4) to update the jumps of Λ2(·). We iterate between the E- and M-steps until convergence. We denote the final estimators by Λ^2n(t) and β^n(t)=j=1Knα^jϕj(t). Although we choose Λ2 as the baseline group for the nonparametric maximum likelihood estimation and use sieve estimation to obtain a time-dependent log-hazard ratio of the first group versus the second group, the procedure can also be reversed by treating Λ1 as the baseline group. In the subsequent arguments, for the ease of theoretical justification, we will denote this reversely estimated Λ^1n(t) as an estimator of Λ1(t) instead of using 0texp{β^n(s)}dΛ^2n(s). Empirically, we find that these two estimators of Λ1 are almost identical.

Our theoretical results show that Λ^kn(t)(k=1,2) converges in distribution to a Gaussian process after normalization. To estimate its asymptotic variance, following results in Zeng & Lin (2010), one approach is to compute the observed information matrix for the jump sizes of Λ^1 and Λ^2 and use the inverse of this matrix to estimate the asymptotic covariance of Λ^1 and Λ^2. However, this approach may be numerically unstable due to inversion of a potentially high-dimensional information matrix. Alternatively, bootstrapping can be used to estimate their asymptotic covariance. Our numerical experience shows that 100 bootstrap samples are usually sufficient. In our algorithm, Λ2 is updated using the closed form in (4) and the α's are obtained via the one-step Newton–Raphson solution to (5). Therefore, the computational burden is much less than existing methods.

Finally, using the proposed nonparametric estimators for Fk(t) ≡ 1 – exp{–Λk(t)} , that is, F^kn(t)=1exp{Λ^kn(t)}, we can construct a variety of test statistics to compare the carrier group and the non-carrier group. One test statistic is based on the Kolmogorov–Smirnov test Tn=supt[0,τ]F^1n(t)F^2n(t). When Tn<cα, we reject the null hypothesis that there is no difference between the disease risk distributions of the two groups. Here, α is the significance level and cα is the (1 – α)-quantile of the sampling distribution of Tn under permutations where the variables Gi's are permuted. Other test statistics can be Tn=0τω(t)F^1n(t)F^2n(t)dt, where ω(t) is a user-defined weight function that may focus on a specific time range.

2·3. Generalization to cure rate survival data

The proposed method can be generalized to analyze cure rate survival data, in which some individuals are considered to be immune to the disease of interest. To this end, we introduce a binary indicator Z to denote cure status. We assume that pr(Z = 1 | B = k) = rk and the disease risk function among non-cured population is pr(TtZ=0,B=k)=F~k(t)=1exp{Λ~k(t)}(k=1,2). The observed data consist of (Yi, Δi, Gi, ΔiZi) (i = 1, . . ., n), where Δi indicates either diseased or cured. That is, for non-censored individuals, we observe some individuals, usually those who have not experienced disease after a certain age, to be cured. However, the cured status for the censored individuals is unknown. Thus, if defining Λk(t)=log[rk+(1rk)exp{Λ~k(t)}], the observed likelihood function becomes

i=1ng=1m[{λ1(Yi)eΛ1(Yi)pg+λ2(Yi)eΛ2(Yi)(1pg)}Δi(1Zi){eΛ1(Yi)pg+eΛ2(Yi)(1pg)}1Δi]I(Gi=g)×i=1ng=1m[{(1r1)pg+(1r2)(1pg)]Δi(1Zi){r1pg+r2(1pg)}ΔiZi]I(Gi=g). (6)

We can estimate the cure rates rk by maximizing the last part of expression (6), while we estimate Λk(t) by maximizing the first part using the sieve method proposed in Section 2·2. Finally, we estimate k(t) via k(t) = {1 – e–Λk(t)} / (1 – rk) (k = 1, 2).

3. Asymptotic Results

Let λk0 and Λk0 be the true hazard rates and the cumulative hazard functions for group k, (k = 1, 2) under the setting of Sections 2.1 and 2.2. Then the true log-hazard ratio is β0(t) = log{λ10(t)/λ20(t)}. We need the following conditions:

Condition 1

Both λ10(t) and λ20(t) are r times continuously differentiable in [0, τ], where r ≥ 2. In addition, there exist g1 and g2 such that pg1/pg2 ≠ (1 – pg1)/(1 – pg2).

Condition 2

The density of C has bounded and continuous rth derivative in [0, τ], and C is independent of T conditional on G.

Condition 3

The number of interior knots mn satisfies mn32n12=O(1) and n12mn2r0, as n goes to infinity.

Conditions 1 and 2 are the regularity conditions for the underlying density functions of T in both groups. The second part of Condition 1 ensures that the data contain at least two distinct kinds of pg to ensure identifiability of the underlying distributions. In Condition 3, one particular choice for the number of the interior knots is mn = nv, where 1/(4r) < v ≤ 1/3. Under these conditions, our first theorem gives the uniform consistency of Λ^1n and Λ^2n in [0, τ].

Theorem 1

Under Conditions 1, 2 and 3 and the setting of Sections 2.1 and 2.2,

supt[0,τ]Λ^1n(t)Λ10(t)+supt[0,τ]Λ^2n(t)Λ20(t)=op(1),n.

To describe the asymptotic distributions of Λ^1n and Λ^2n, we first introduce the sets FBV={f(t):f(t) has a total variance bounded by 1 in [0, τ]} and Fβ={g(t):g(t) has its r th derivative bounded by 1 in [0, τ]}. We then treat both Λ^1n and Λ^2n as bounded stochastic processes in FBV by defining Λ^kn(f)=0τf(s)dΛ^kn(s)(k=1,2),fFBV. Similarly, we treat β^n(t) as a stochastic process on Fβ as β^n(g)=0τg(s)β^n(s)ds,gFβ. The following theorem shows the weak convergence of these stochastic processes.

Theorem 2

Consider {Λ^1n(t)Λ10(t),Λ^2n(t)Λ20(t)} as a stochastic process in l(FBV×FBV). Then under Conditions 1, 2 and 3 and the setting of Sections 2.1 and 2.2, n12{Λ^1n(t)Λ10(t),Λ^2n(t)Λ20(t)} converges in distribution to a mean-zero Gaussian process in l(FBV×FBV), as n → ∞. Furthermore, Λ^1n and Λ^2n are semiparametrically efficient in terms of the definition in Bickel et al. (1993). In addition, as a stochastic process in l(Fβ),n12(β^nβ0) converges in distribution to a mean-zero Gaussian process, as n → ∞.

Remark 1

Theorem 2 establishes that n12{Λ^1n(t)Λ10(t)} and n12(Λ^2n(t)Λ20(t)} converge distribution to some Gaussian process in l ([0, τ]). By the delta method, this also holds for the corresponding distribution function estimators, F^1n(t)=1exp{Λ^1n(t)} and F^2n(t)=1exp{Λ^2n(t)}. Thus the sieve nonparametric maximum likelihood estimators 1n and 2n achieve the semiparametric efficiency bound and are optimal for the censored mixture data.

Here, semiparametric efficiency is defined in the sense of Bickel et al. (1993, Chapter 6). Theorem 2 shows that k, as a function estimator in BV[0, τ], is semiparametrically efficient, which means that any bounded linear functional of k achieves its efficiency bound asymptotically. The weak convergence in Theorem 2 ensures that we can construct a valid confidence band based on these estimators. The proofs of Theorems 1 and 2 are in the Appendix. The main technical challenge is to handle the mixed convergence rates of the infinite-dimensional parameter estimators, since Λ^kn has a n1/2-convergence rate while β^n(t) has a slower convergence rate. In the proof of Theorem 2, with the derived rates for Λ^kn and β^n(t) under some suitable norms, the master Z-theorem in Section 3·3 of van der Vaart & Wellner (1996) is implemented to derive the asymptotic distributions of the estimators. These theorems hold for the estimators using the cure rate survival data due to the similar likelihood function in the estimation. Although the proposed method is fully efficient based on the assumption of independent Ti given the mutation status Bi, it can be easily generalized to correlated family data by maximizing

i=1nj=1nig=1m[{pgf1(Yij)+(1pg)f2(Yij)}Δij{1pgF1(Yij)(1pg)F2(Yij)}1Δij]I(Gij=g),

where i indicates the family and j indicates an individual in the family. In this case, the proposed inference procedure including the expectation-maximization algorithm and bootstrap over independent families is still valid, and Theorems 1 and 2 hold except that the derived estimators may not achieve the semiparametric efficiency bound due to the maximization of a marginal likelihood.

4. Simulation Studies

Extensive simulation studies were conducted to compare the small sample performances of the proposed and existing methods. Our first simulation study used the same distribution functions as in Qin et al. (2014). Specifically, for the carriers, F1(t) = {1 – exp(–t)}/{1 – exp(–10)}, while for the non-carriers, F2(t) = {1 – exp(–t/2·8)}/{1 – exp(–10/2·8)} for 0 ≤ t ≤ 10. The mutation probability pi was randomly chosen from either the set Case I: (1, 0·6, 0·2, 0·16) or Case II: (0·75, 0·6, 0·5, 0·16). The censoring time followed a uniform distribution to yield a censoring rate of 20% or 40%. In the second simulation study, we imitated the results from the Parkinson's disease study described in Section 5: we generated survival times for carriers and non-carriers using distributions similar to the estimated distributions in the actual data, F1 = Weibull (5·0, 102), F2 = Weibull (5·0, 125). Furthermore, the sample size was n = 2275 and the mutation probability pi was taken from (0, 0·02, 0·51, 1), as in the real example. The censoring times were generated from a uniform distribution to achieve a censoring rate of 40% or 80%.

When implementing our method, we used the cubic B-spline functions to estimate β(t). The number of knots was set at mn = ⌊n1/3⌋ – 1 and the location each interior knot was selected to evenly distributed at the quantiles of the observed failure times. Some neighboring knots were combined if the data were found to be too sparse to stably estimate the coefficient of a particular basis function. We also experimented with the number of interior knots as mn/2 or 2mn, and the estimates for Λ1(t) and Λ2(t) varied very little. To avoid local maximization in the expectation-maximization algorithm, we used different initial estimators including the estimates from a published method such as Qin et al. (2014). Empirically, our algorithm converged to the same results. We used 500 bootstrap samples for variance estimation. Furthermore, we compared our method with the estimator in Qin et al. (2014), which sequentially censored the observed event times to construct a binomial likelihood and applied the pooled-adjacent-violators algorithm for estimation.

The simulation results from 500 replicates for the first scenario are given in Table 1. We present the average estimated values of the cumulative distribution functions F1 and F2 at various quartiles. Table 1 suggests that both the sieve estimator and the method of Qin et al. (2014) have small bias, the variance estimate based on bootstrap agrees adequately with the empirical variability, and the coverage probabilities are close to the nominal level. The sieve estimator is more efficient than the method of Qin et al. (2014) in all simulation settings, and the efficiency gain, which can be as large as 60%, is more evident for the upper quartiles and for the higher censoring rate. A similar advantage of the sieve estimators is seen in Table 2 for the second simulation scenario. Our method performs well even under 80% censoring. The efficiency gain is up to 15%. In the Supplementary Material, we report root integrated mean squared errors and the average of the point-wise variance for the estimators of Λ's. Our estimators for Λ's have smaller estimation errors than those of Qin et al. (2014), especially for the estimation of Λ1.

Table 1.

Summary results for the estimated distribution functions in the first simulation scenario (×10–2)

Case I Case II
Proposed EM-PAVA Proposed EM-PAVA
n c% Bias SD SE CP Bias SD Ratio Bias SD SE CP Bias SD Ratio
100 20% F1(Q0·25) –0·8 7·2 7·1 93 –0·4 7·4 105·5 –1·3 8·9 7·9 90 –0·1 9·5 112·2
F1(Q0·50) –0·8 9·0 8·9 93 –0·2 9·2 104·6 –2·7 10·9 10·4 92 0·0 11·6 114·0
F1(Q0·75) –0·8 8·0 8·1 94 –0·1 8·4 108·5 –3·2 10·4 10·1 92 0·6 11·8 130·0
F2(Q0·25) 0·6 7·9 8·1 94 0·4 8·1 104·7 2·8 10·9 10·9 93 0·4 11·0 101·0
F2(Q0·50) 0·4 9·0 8·8 94 0·3 9·2 103·5 3·8 10·6 10·9 94 0·3 12·1 128·8
F2(Q0·75) –0·5 7·7 7·3 94 –0·3 8·0 108·7 2·8 8·4 7·8 89 1·9 9·5 130·0
40% F1(Q0·25) –0·6 7·8 7·2 92 –0·2 7·9 103·1 –1·1 8·2 7·7 92 0·0 8·6 110·6
F1(Q0·50) –0·7 9·0 9·1 94 –0·1 9·3 106·7 –2·9 10·6 10·3 92 –0·4 11·8 123·4
F1(Q0·75) –1·1 9·2 8·8 92 –0·2 9·6 109·7 –4·8 11·3 10·4 90 –1·0 13·0 130·8
F2(Q0·25) 0·0 8·4 8·3 92 –0·1 8·6 102·7 2·4 11·2 10·7 91 0·0 12·1 115·4
F2(Q0·50) 0·3 10·0 9·7 94 0·3 10·1 102·1 4·1 12·2 11·7 91 1·3 13·7 125·2
F2(Q0·75) –2·4 12·0 10·2 88 2·2 14·2 140·8 0·6 11·3 10·7 88 7·0 14·3 160·1
300 20% F1(Q0·25) –0·5 4·3 4·3 94 –0·4 4·4 102·3 –0·8 5·3 5·3 94 –0·3 5·4 105·7
F1(Q0·50) –0·1 5·3 5·2 94 0·0 5·3 103·3 –1·7 6·6 6·8 95 –0·7 6·8 108·6
F1(Q0·75) –0·4 4·8 4·7 95 –0·2 5·0 110·7 –1·9 6·5 6·7 94 –0·2 7·2 122·8
F2(Q0·25) –0·1 4·7 4·8 95 –0·1 4·7 103·7 1·5 6·9 6·9 95 0·6 6·9 100·2
F2(Q0·50) 0·0 5·0 5·1 96 0·0 5·0 103·6 1·9 7·2 7·0 95 0·4 7·7 113·1
F2(Q0·75) 0·0 4·5 4·4 95 0·0 4·5 103·0 2·4 5·4 5·2 91 1·4 5·7 112·1
40% F1(Q0·25) 0·1 4·5 4·4 95 0·2 4·5 100·7 –0·6 5·2 5·2 94 0·1 5·4 106·0
F1(Q0·50) 0·3 5·4 5·4 96 0·3 5·4 101·5 –1·6 6·8 7·0 95 –0·5 7·1 108·3
F1(Q0·75) 0·1 5·0 5·1 96 0·1 5·1 107·0 –2·4 6·9 7·1 94 –0·7 7·6 121·7
F2(Q0·25) –0·4 4·9 5·0 95 –0·4 4·9 101·0 1·7 6·9 7·2 95 0·6 6·9 100·5
F2(Q0·50) –0·3 5·7 5·8 95 –0·3 5·9 104·1 2·6 7·5 7·7 94 0·7 8·0 114·2
F2(Q0·75) –0·7 7·5 7·1 92 0·5 8·3 124·6 1·3 8·4 7·8 91 3·0 10·3 150·7

EM-PAVA denotes the method of Qin et al. (2014). Q0·25, Q0·5 and Q0·75 denote the first to third quartiles of F1 and F2, respectively, and c% denotes the censoring rate. Bias is the average estimation bias over 500 replications; SD is the empirical standard deviation; SE is the average of the estimated standard errors from bootstraps; CP is the actual coverage probability corresponding to nominal 95% confidence intervals; and Ratio gives the relative efficiency ratio between the proposed method and the method of Qin et al. (2014).

Table 2.

Summary results for the estimated distribution functions in the second simulation scenario (×10–2)

Proposed EM-PAVA
Censoring Bias SD SE CP Bias SD Ratio
40% F1(Q0·25) –0·1 3·1 3·2 95 0·0 3·3 110·3
F1(Q0·50) –0·1 3·8 4·1 96 0·0 3·8 103·3
F1(Q0·75) –0·4 3·7 4·1 96 0·0 4·0 115·1
F2(Q0·25) 0·1 1·3 1·3 94 0·1 1·3 103·2
F2(Q0·50) 0·1 1·6 1·5 94 0·0 1·6 100·5
F2(Q0·75) 0·2 1·4 1·3 93 0·0 1·4 103·9
80% F1(Q0·25) –0·4 4·2 4·1 94 –0·3 4·3 102·5
F1(Q0·50) –0·7 5·4 5·8 94 –0·4 5·6 104·7
F1(Q0·75) –1·1 6·0 6·5 95 –0·3 6·4 112·5
F2(Q0·25) 0·1 1·8 1·8 95 0·0 1·8 100·8
F2(Q0·50) 0·2 2·5 2·6 95 0·0 2·5 101·4
F2(Q0·75) –0·2 4·0 3·7 93 1·0 4·2 107·2

For footnotes see Table 1.

We performed two additional simulations with crossed distributions. The results are reported in the Supplementary Material simulations 3 and 4. The findings are similar. Finally, we also conducted simulation studies to evaluate the permutation test for the Kolmogorov–Smirnov statistic comparing the two distributions. The data generation was similar to the second simulation study, except that F1 = F2 = Weibull (5·0, 102). The empirical type I error rate is 4·6% with censoring rate 40% and 5·0% with censoring rate 80%. Both are close to the nominal significance level of 5% so the proposed permutation test appears to be valid.

5. Application

Since mutations in the leucine-rich repeat kinase 2 gene were found to be a potential cause of idiopathic Parkinson's disease (Paisán-Ruíz et al., 2004), there has been great interest in estimating the cumulative risk of Parkinson's disease for the leucine-rich repeat kinase 2 mutation carriers, especially in Ashkenazi Jews, who have an increased mutation rate (Alcalay et al., 2013). Although such risk estimates are important for genetic counseling (Goldwurm et al., 2011), results on the risk for leucine-rich repeat kinase 2 carriers in the clinical literature have been inconsistent and estimates vary widely (Goldwurm et al., 2011).

To address these concerns, we aim to estimate the age-specific cumulative risk of Parkinson's disease in the leucine-rich repeat kinase 2 carriers and non-carriers. Due to the low prevalence of leucine-rich repeat kinase 2 mutations, a kin-cohort design was used (Marder et al., 2014). To avoid bias in the ascertainment of the initial samples, our analysis units are the first-degree family members excluding the initial probands (e.g., Wacholder et al., 1998; Wang et al., 2012). Our initial probands were recruited from the Michael J. Fox foundation Ashkenazi Jewish leucine-rich repeat kinase 2 consortium; the details of the sample were reported elsewhere (Alcalay et al., 2013). All probands were screened for G2019S mutations in leucine-rich repeat kinase 2 gene and common mutations in the glucocerebrosidase gene. To isolate the effect of the leucine-rich repeat kinase 2 mutations on Parkinson's disease risk, we excluded participants with other known genetic risk factors such as glucocerebrosidase mutations. A validated family history instrument (Marder et al., 2003) was applied to the probands or the first-degree relatives themselves if relatives were seen by a neurologist.

The data included information from 2275 first-degree relatives of the probands in the Ashkenazi Jewish leucine-rich repeat kinase 2 consortium. There were four groups of mutation probabilities, pg ∈ {0, 0·02, 0·51, 1}, with frequencies 1·6%, 70·9%, 25·4% and 2·1%, respectively. There were only 3·7% of relatives with observed genotypes, that is, their corresponding pg is either 1 or 0. The first-degree relatives including parents, siblings or children of non-carrier probands have pg =0·02 under a 2% population prevalence of leucine-rich repeat kinase 2 in the Ashkenazi Jewish population (Orr-Urtreger et al., 2007) and the Mendelian assumption. Similarly, the first-degree relatives of heterozygote carrier probands have pg =0·51 under the Mendelian assumption. The censoring rate was close to 95%. Due to the high censoring rate, we analyzed the data under the cure rate model (6). Individuals who did not develop Parkinson's disease by age 95 were considered immune to the disease since the largest documented age at onset is 94 years of age (Driver, 2009). In the implementation of the proposed sieve maximum likelihood approach, we used the Bayesian information criterion to choose the number of interior knots and the degree of the B-spline basis. The choices that minimizes this criterion was two interior knots and a degree of two. We used bootstrap resampling of families to construct pointwise confidence intervals to ensure valid inference.

In the practice of genetic counseling, it is more useful to provide the population cumulative risks, that is, Fk(t) in model (6), regardless of the cure survival status. Thus we report the estimates of Fk(t) in Table 3. This shows that the cumulative risk of Parkinson's disease by age 80 for carriers can be as high as 27·4% with 95% confidence interval 17·6%–39·1%, while it is 10·4% with 95% confidence interval 7·8%–13·2% for non-carriers. The risk of Parkinson's disease in non-carriers is quite high compared to general non-Ashkenazi Jews, whose risk is normally 1%, indicating that they may have other risk mutations for Parkinson's disease. The estimated lifetime cumulative risk is consistent with some previous findings in Ashkenazi Jews for leucine-rich repeat kinase 2 mutation carriers (Wang et al., 2008), but it contrasts with some other studies, which estimate risk of Parkinson's disease to be 100% in leucine-rich repeat kinase 2 carriers (Lesage, 2005). Methodological issues including assigning individuals with unobserved leucine-rich repeat kinase 2 genotypes to carrier or non-carrier groups based on their Parkinson's disease status may have contributed to this large difference with those studies. Figure 1 presents the estimated cumulative Parkinson's disease distributions in the two mutation groups and their pointwise confidence intervals. The carrier group has a dramatic increase of the risk of Parkinson's disease after age 60 as compared to a slower increase in the disease risk in the non-carrier group.

Table 3.

Estimated cumulative risk of Parkinson's disease onset in leucine-rich repeat kinase 2 carriers and non-carriers in the Ashkenazi Jewish leucine-rich repeat kinase 2 Consortium study (×10–2)

Carrier F1(·) Non-Carrier F2(·)
Age Est. SE 95% CI Est. SE 95% CI
20 0·0 0·0 (0·0, 0·1) 0·1 0·1 (0·0, 0·3)
30 0·1 0·1 (0·0, 0·3) 0·1 0·1 (0·0, 0·3)
40 0·3 0·4 (0·0, 1·4) 0·2 0·1 (0·0, 0·4)
50 1·8 0·8 (0·5, 3·4) 0·6 0·2 (0·3, 1·1)
60 8·1 1·9 (4·8, 12·5) 2·8 0·6 (1·6, 4·1)
70 18·3 3·9 (11·2, 26·2) 6·8 1·1 (4·9, 9·0)
80 27·4 5·7 (17·6, 39·1) 10·4 1·4 (7·8, 13·2)

95% CI, 95% confidence interval for estimated value.

Fig. 1.

Fig. 1

Estimated cumulative risk functions for Parkinson's disease onset in the leucine-rich repeat kinase 2 carriers and non-carriers. The solid curve is the estimated distribution function for carriers and the dashed curve is for non-carriers. The dotted curves are their pointwise 95% confidence intervals. The shaded regions indicate area covered in the pointwise confidence interval.

To compare the distributions, we used the Kolmogorov–Smirnov test to examine the maximal difference between the two groups. We computed the p-value for this test based on 1,000 permutations, where for each permutation, the grouping variable Gi was perturbed. The resulting p-value is less than 0·001. It may be of practical interest to examine some classes of parametric models for the genetic risk functions. For example, within the class of Weibull distributions, we find the estimated distribution for the carriers is adequately approximated by a Weibull distribution with shape and scale parameters 5 and 102, while the estimated distribution for the non-carriers is close to a Weibull with shape and scale parameters 5 and 125.

The cure rates in carriers and non-carriers were estimated to be 0·3% with 95% confidence interval 0%– 19·8% and 26·6% with 95% confidence interval 17·9%–34·6%, respectively. There is a of significant difference 26·3% between the two rates with 95% confidence interval 3·6%–34·3%. In the non-cured population, the cumulative risk of Parkinson's disease for carriers by age 80 was 27·5%, that is, 1(t) as defined in Section 2·3 was 27·5% at age 80, compared to 14·2% for the non-carrier group. The low cure rate in the carrier group suggests a high risk of Parkinson's disease had a subject lived long enough. This observation is consistent with the existing clinical literature. For example, Latourelle et al. (2008) reported a high lifetime risk of Parkinson's disease, where the median risk of disease was about 70% and the upper limit of the 95% confidence interval was about 80%.

6. Discussion

One interesting theoretical is to tackle the different convergence rates of the nonparametric maximum likelihood and the sieve estimators based on B-splines. Alternatively, sieve estimation can also be applied to Λ2, as done by Cheng & Wang (2011) for a semiparametric additive transformation model with current status data. However, one advantage of using the nonparametric maximum likelihood estimator for Λ2 is that there is no need to determine the number of sieves. In addition, our nonparametric maximum likelihood estimator has an explicit solution in the M-step of the expectation-maximization algorithm, which leads to computational gain.

Using the re-parametrized likelihood function (2), the proposed method can be readily generalized to regression problems where other environmental covariates are included through a proportional hazards model in both groups (Diao & Lin, 2005). Lastly, to efficiently analyze family data, an alternative method using frailty models may be considered to account for within-family dependence through shared frailties.

Supplementary Material

Supplementary file

ACKNOWLEDGMENTS

This work is supported by Michael J. Fox foundation, U.S. National Institute of Health grants, National Natural Science Foundation of China, and China Scholarship Council. We thank the editor, an associate editor and referees for helpful comments to improve the paper.

Appendix

Before proving Theorem 1 and Theorem 2, we first show that the information operator for Λ2 and β is invertible. For G = g, we define

A1(Δ,G,Y)=Ξ1[(1pg)eΛ2(Y)+pgeβ(Y)exp{0Yeβ(t)dΛ2(t)}],A2(Δ,G,Y)=(1pg)eΛ2(Y)(Ξ1+Ξ2),A3(Δ,G,Y)=pg{eβ(Y)Ξ1+Ξ2}exp{0Yeβ(t)dΛ2(t)},B1(Δ,G,Y)=pgeβ(Y)exp{0Yeβ(t)dΛ2(t)}Ξ1,

where Ξ1=Δ(pgeβ(Y)exp[0Yexp{β(t)}dΛ2(t)]+(1pg)exp{Λ2(Y)})1, and Ξ2=(1Δ)(pgexp[0Yexp{β(t)}dΛ2(t)]+(1pg)exp{Λ2(Y)})1. The log-likelihood function for a single subject is

l(Λ2,β)=g=1mI(G=g)(Δlog[pgλ2(Y)eβ(Y)exp{0Yeβ(t)dΛ2(t)}+(1pg)λ2(Y)eΛ2(Y)]+(1Δ)log[pgexp{0Yeβ(t)dΛ2(t)}+(1pg)exp{Λ2(Y)}]).

By differentiating l2, β) with respect to Λ2 and β along sub-models dΛ2(1 + εh1) and β + εh2 respectively, we obtain the following score operators lΛ2(Λ2,β)(h1)=A1h1(Y)0Yh1(t)[A2+A3exp{β(t)}]dΛ2(t),lβ(Λ2,β)(h2)=B1h2(Y)A30Yh2(t)exp{β(t)}dΛ2(t) Thus, if we define 〈f1, f2〉 = E(f1 f2), for any L2(P)-integrable functions {w1(Δ, G, Y), w2(Δ, G, Y)}, we have

{lΛ2(h~1)lβ(h~2)},(w1w2)=(h1h2),{E(A1w1Y)E(B1w2Y)}+0Y(h1h2),[E{(A2+A3)w1Y=t}eβ(t)E{A3w2Y=t}eβ(t)]dΛ2(t).

Thus,

lΛ2lΛ2(h1)=E(A12Y)h1(Y)0YE[A1{A2+A3eβ(t)}Y=t]h1(t)dΛ2(t)0YE{(A2+A3)A1Y=t}h1(t)dΛ2(t)+0Y0tE[I(G=g)(A2+A3){A2+A3eβ(t)}Y=s]dΛ2(s)h1(t)dΛ2(t),lβlβ(h2)=E(B12Y)h2(Y)0YE{B1A3eβ(t)Y=t}h2(t)dΛ2(t)0YE{A3g=1mI(G=g)B1Y=t}h2(t)dΛ2(t)+0Y[0tE{A3A3eβ(t)Y=s}dΛ2(s)]h2(t)dΛ2(t),

where (lΛ2,lβ) is the dual operator of (lΛ2, lβ). Therefore, the information operator I(Λ2,β)=(lΛ2,lβ)(lΛ2,lβ) can be expressed as a Fredholm operator of the first kind, which is the summation of an invertible operator and an integral operator when Λ2 = Λ20 and β = β0. As a result, to show that I(Λ20,β0) is invertible, following Rudin (1973), it suffices to show that I(Λ20,β0) is one-to-one. That is, we need to prove that for any h1 and h2 if I(Λ20,β0)(h1,h2)=0, which is equivalent to lΛ20 (h1) + lβ0 (h2) = 0, then h1 ≡ 0 and h2 ≡ 0. Suppose that lΛ20 (h1) + lβ0 (h2) = 0, let Δ = 1 and G = g and integrate Y from 0 to any t ∈ [0, τ], we then obtain 0t[pg{h1(s)+h2(s)}eβ0(s)+(1pg)h2(s)]dΛ20(s)=0. Thus, pg{h1(t) + h2(t)}eβ0(t) + (1 – pg)h2(t) = 0. From Condition 1, we immediately conclude that h1 = h2 ≡ 0. Therefore, I(Λ20,β0)(h1,h2) is continuously invertible.

Furthermore, we consider a different Banach space H={(h1,h2):h1L2[0,τ],h2L2[0,τ]}. Then the above arguments still hold. Hence, the invertibility of I(Λ20,β0) implies I(Λ20,β0)L22c(h1L22+h2L22) where c is a constant. Furthermore, if ∥Λ2 – Λ20 + ∥ββ0 < ε0 for a small ε0, the continuity of I in this space gives

I(Λ2,β)(h1,h2)L22c2(h1L22+h2L22).

We will use this fact in the following consistency proof.

Proof of Theorem 1

We define a sieve space

Sn={(Λ2,β):Λ2is the step function with jump at the observed events,β(t)=j=1Knαjϕj(t),where theϕjareB-spline bases}.

First, we show that there exists a local maximum of the observed data likelihood function over Sn such that the proposed estimators (Λ^2,β^) converge to the true parameters in probability under the norm ∥ · ∥.

By Schumaker (2007) and Condition 1 there exists a function β^0(t)=j=1Knαj0ϕj(t) such that β^0β0=O(mnr). Then we consider the neighborhood of β^0 in the following sieve space Nn={β:β(t)=j=1Knαjϕj(t)with(j=1Knαjαj02)12n} where εn is to be chosen later. For each βNn, we define Λ^2,β=argmaxΛ2 Pnl2, β), where Λ2 is a step function with jumps at the observed failure events. If we chose εn such that mn32n0, then for βNn,

ββ^0BVj=1Knαjαj0ϕjO(mn){n2(mn+l)}120.

Therefore, β has bounded total variation. Define

Λ^20(t)=j=1n0tI(Yjs)k=1nI(Yks)[qkexp{β0(s)}+(1qk)]dNj(s),

it is easy to see that Λ^20Λ20BV=Op(n12). Therefore, Pnl(Λ^2,β,β)Pnl(Λ^20,β), where Pn denotes the empirical measure. Note that Pnl(Λ^2,β,β)pnl(Λ^20,β) equals

n1i=1ng=1mI(Gi=g)(Δilog[δΛ^2,β(Yi)δΛ^20(Yi)pigeβ(Yi)exp{0Yieβ(t)dΛ^2,β(t)}+(1pig)eΛ^2,β(Yi)pigeβ(Yi)exp{0Yieβ(t)dΛ^20(t)}+(1+pig)eΛ^20(Yi)]+(1Δi)log[pigexp{0Yieβ(t)dΛ^2,β(t)}+(1pig)eΛ^2,β(Yi)pigexp{0Yieβ(t)dΛ^20(t)}+(1pig)eΛ^20(Yi)]).

It is easy to show that δΛ^20(t)=Op(1n), so we conclude that there exist constants c1 and c2 independent of β such that

0Pnl(Λ^2,β,β)Pnl(Λ^20,β)n1i=1nc1log{nδΛ^2,β(Yi)}Δi+c2log[piexp{0τeβ(s)dΛ^2,β(s)}+(1pi)eΛ^2,β(τ)]n1i=1nc1log{nδΛ^2,β(Yi)}Δi+c2log{piecΛ^2,β(τ)+(1pi)ecΛ^2,β(τ)}n1i=1nc1log{nδΛ^2,β(Yi)}Δic2Λ^2,β(τ)+Op(1).

Hence, n1i=1nΔilog{nδΛ^2,β(Yi)}c1Λ^2,β(τ) is bounded from below in probability. since n1i=1nΔilog{nδΛ^2,β(Yi)} is less than log{i=1nΔiδΛ^2,β(Yi)}=logΛ^2(τ),lim¯n{supβNnΛ^2,β(τ)} is finite with probability tending to one. As a result, {Λ^2,β:βNn} consists of bounded and increasing functions.

From the fact that PnlΛ2(Λ^2,β,β)(h1)=0, we obtain (PnP)lΛ2(Λ^2,β,β)(h1)=PlΛ2(Λ^2,β,β)(h1). The left-side of this equation is Op(n–1/2), because lΛ2 is Donsker due to the fact that both Λ2,β and β belong to BV[0, τ]. We apply the Taylor expansion at the true (Λ20, β0) to the right-hand side, then we have

Op(n12)=I1(Λ20,β0)(h1),dΛ^2,βdΛ20L2+o(Λ^2,βΛ20BV)+Op(ββ0L2),

where I1 is the operator in I corresponding to Λ2. Using the invertibility of I1, we have Λ^2,βΛ20BV=An(n12+ββ0L2), where supβNnAn is a bounded random variable.

We now consider BnPn{l(Λ^2,β,β)l(Λ^20,β^0)}. First, Bn=(PnP){l(Λ^2,β,β)l(Λ^20,β^0)}+P{l(Λ^2,β,β)l(Λ^20,β^0)}. The first term on the right hand side is equal to cnn–1/2, where supβNncn0. For the second term, we apply the expansion at the true values and obtain

I(Λ2,β)(dΛ^2,βdΛ^20λ20,ββ0),(dΛ^2,βdΛ^20λ20,ββ0)L2+o(Λ^20Λ202+β^0β02),

where (Λ2,β) is between (Λ^2,β,β) and (Λ20, β0). Thus, we obtain Bncnn12c12ββ0L22+bn(n1+mn2r), where supβNnbn0. Therefore, if βNn, the result from Boor (1978) gives ββ0L22c2n2, so that Bn{cnn12+bn(n1+mn2r)}c1c2n22. Hence, if we choose n2=4(c1c2)1{cnn12+bn(n1+mn2r)}, then Bn < 0, noting that such εn still satisfies mn32n0 due to r ≥ 2 and Condition 3. That is, there exists a local maximum β^n within this neighborhood. Consequently, β^nβ0BV0 and β^nβ0L22β^nβ^0L22+O(mn2r)n2+mn2r=op(n12). From the result that Λ^2,βΛ20BV=An(n12+ββ0L2), the corresponding Λ^2,β satisfies Λ^2,β^nΛ20BV=Op(n12)+β^nβ0L2=op(n14). It implies supt[0,τ]Λ^2n(t)Λ20(t)+supt[0,τ]β^n(t)β0(t)=op(1). By reversing the labels, the same argument implies supt[0,τ]Λ^1n(t)Λ10(t)=op(1). The proof of Theorem 1 is completed.

Proof of Theorem 2

For any h1 ∈ BV[0, τ] and any h2 with bounded rth derivative in [0, τ], we have PnlΛ2(Λ^2n,β^n)(h1)=0 and Pnlβ(Λ^2n,β^n)(h2n)=0. Here, h2n is the projection of h2 on Sn, and h2nh2=O(mnr). This gives

Gn{lΛ2(Λ^2n,β^n)(h1)+lβ(Λ^2n,β^n)(h2n)}=n12P{lΛ2(Λ^2n,β^n)(h1)+lβ(Λ^2n,β^n)(h2n)}, (A1)

where Gn = n1/2(PnP). It is straightforward to verify {lβ(Λ^2n,β^n)(h1)+lΛ2(Λ^2n,β^n)(h2):h1BV1,h21} is a Donsker class. Thus, the left-hand side of equation (A1) is equal to Gn{lΛ2(Λ^2n,β^n)(h1)+lβ(Λ20,β0)(h2)}+op(1) where op(1) here and in the sequel refers to some random element that converges in probability to zero uniformly in (h1, h2).

By the Taylor expansion, the right-hand side of equation (A1) equals

{1+op(1)}{h1d(Λ^2nΛ20)+h2(β^nβ0)dt}+n12O(Λ^2nΛ20BV2+β^nβL22)=n12{1+op(1)}{h1d(Λ^2nΛ20)+h2(β^nβ0)dt}+op(1),

where (h1,h2)=I(Λ20,β0)(h1,h2). This yields that

Gn{lΛ2(Λ20,β0)(h1)+lβ(Λ20,β0)(h2)}+op(1)=n12{h1d(Λ^2nΛ20)+h2(β^nβ0)dt}

where (h1,h2)=I1(Λ20,β0)(h1,h2). That is, n12{Λ^2n(t)Λ20(t),β^n(t)β0(t)} converges in distribution to mean-zero Gaussian process in l(FBV×Fβ). Finally, since Λ^1n is obtained using the same estimation as Λ^2n by reversing group labels, a similar asymptotically linear expansion holds for n12h1d(Λ^1nΛ10). Hence, we conclude that n12{Λ^1n(t)Λ10(t),Λ^2n(t)Λ20(t)} converges in distribution to a mean-zero Gaussian process in l(FBV×FBV).

From the asymptotic linear expansion of n12{Λ^kn(t)Λk0(t)}, we note that for any fixed t, the influence function of Λ^kn is on the tangent space of the score functions. Therefore, the estimators are semi-parametrically efficient in metric space l(FBV×FBV) according to Theorem 18.8 in Kosorok (2008). We have completed the proof of Theorem 2.

Footnotes

SUPPLEMENTARY MATERIAL

Supplementary material available at Biometrika online includes a proof for model identifiability, additional tables for Simulations 1 and 2, and results from two additional simulation studies.

Contributor Information

YUANJIA WANG, Department of Biostatistics, Mailman School of Public Health, 722 W168th Street, New York 10032, U.S.A. yw2016@columbia.edu.

BAOSHENG LIANG, School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China. liangbs@mail.bnu.edu.cn.

XINGWEI TONG, School of Mathematical Sciences, Beijing Normal University, Beijing 100875, China. xweitong@bnu.edu.cn.

KAREN MARDER, Department of Neurology and Psychiatry, College of Physicians and Surgeons, Columbia University, New York 10032, U.S.A. ksm1@columbia.edu.

SUSAN BRESSMAN, The Alan and Barbara Mirken Department of Neurology, Beth Israel Medical Center, New York, 10003, U.S.A. sbressma@chpnet.org.

AVI ORR-URTREGER, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel. aviorr@tasmc.health.gov.il.

NIR GILADI, Sackler Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel. nirg@tasmc.health.gov.il.

DONGLIN ZENG, Department of Biostatistics, CB # 7420, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7420, U.S.A. dzeng@bios.unc.edu.

References

  1. Alcalay RN, Mirelman A, Saunders-Pullman R, Tang M, Mejia Santana H, Raymond D, Roos E, Orbe-Reilly M, Gurevich T, Bar Shira A, et al. Parkinson's disease phenotype in Ashkenazi jews with and without LRRK2 G2019S mutations. Mov. Disord. 2013;28:1966–1971. doi: 10.1002/mds.25647. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Springer; New York: 1993. [Google Scholar]
  3. Cai T, Hyndman RJ, Wand W. Mixed model-based hazard estimation. Journal of the Computational and Graphical Statistics. 2002;11:784–798. [Google Scholar]
  4. Cheng G, Wang X. Semiparametric additive transformation model under current status data. Electron. J. Statist. 2011;5:1735–1764. [Google Scholar]
  5. de Boor C. A Practical Guide to Splines. Springer; Wroclaw: 1978. [Google Scholar]
  6. Diao G, Lin D. Semiparametric methods for mapping quantitative trait loci with censored data. Biometrics. 2005;61:789–798. doi: 10.1111/j.1541-0420.2005.00346.x. [DOI] [PubMed] [Google Scholar]
  7. Driver JA, Logroscino G, Gaziano JM, Kurth T. Incidence and remaining lifetime risk of Parkinson disease in advanced age. Neurology. 2009;72:432–438. doi: 10.1212/01.wnl.0000341769.50075.bb. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fine JP, Zou F, Yandell BS. Nonparametric estimation of mixture models, with application to quantitative trait loci. Biostatistics. 2004;5:501–513. doi: 10.1093/biostatistics/kxh004. [DOI] [PubMed] [Google Scholar]
  9. Goldwurm S, Tunesi S, Tesei S, Zini M, Sironi F, Primignani P, Magnani C, Pezzoli G. Kin-cohort analysis of LRRK2-G2019S penetrance in Parkinson's disease. Mov. Disord. 2011;26:2144–2145. doi: 10.1002/mds.23807. [DOI] [PubMed] [Google Scholar]
  10. Hall P, Zhou XH. Nonparametric estimation of component distributions in a multivariate mixture. Ann. Statist. 2003;31:201–224. [Google Scholar]
  11. Healy DG, Falchi M, O'Sullivan SS, Bonifati V, Durr a., Bressman S, Brice a., Aasly J, Zabetian CP, Goldwurm S, et al. Phenotype, genotype, and worldwide genetic penetrance of LRRK2-associated Parkinson's disease: a case-control study. Lancet Neurol. 2008;7:583–590. doi: 10.1016/S1474-4422(08)70117-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Hentati F, Trinh J, Thompson C, Nosova E, Farrer MJ, Aasly JO. LRRK2 Parkinsonism in Tunisia and Norway: A comparative analysis of disease penetrance. Neurology. 2014;83:568–569. doi: 10.1212/WNL.0000000000000675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kachergus J, Mata IF, Hulihan M, Taylor JP, Lincoln S, Aasly J, Gibson JM, Ross OA, Lynch T, Wiley J, et al. Identification of a novel LRRK2 mutation linked to autosomal dominant Parkinsonism: evidence of a common founder across European populations. American Journal of Human Genetics. 2005;76:672–680. doi: 10.1086/429256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kosorok M. Introduction to Empirical Processes and Semiparametric Inference. Springer; New York: 2008. [Google Scholar]
  15. Langbehn DR, Brinkman RR, Falush D, Paulsen JS, Hayden MR. A new model for prediction of the age of onset and penetrance for Huntington's disease based on CAG length. Clinical Genetics. 2004;65:267–277. doi: 10.1111/j.1399-0004.2004.00241.x. [DOI] [PubMed] [Google Scholar]
  16. Larid NM, Ware J. Random-effect models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
  17. Latourelle JC, Sun M, Lew MF, Suchowersky O, Klein C, Golbe LI, Mark MH, Growdon JH, Wooten GF, Watts RL, et al. The Gly2019Ser mutation in LRRK2 is not fully penetrant in familial Parkinson's disease: the GenePD study. BMC medicine. 2008;6:32. doi: 10.1186/1741-7015-6-32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Lesage S, Leutenegger AL, Ibanez P, Janin S, Lohmann E, Durr a., Brice a., French Parkinson's Disease Genetics Study Group LRRK2 haplotype analyses in European and North African families with Parkinson disease: a common founder for the G2019S mutation dating from the 13th century. American Journal ofHuman Genetics. 2005;77:330. doi: 10.1086/432422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ma Y, Wang Y. Efficient distribution dstimation for data with unobserved sub-population identifiers. Electronic Journal ofStatistics. 2012;6:710–737. doi: 10.1214/12-EJS690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Marder K, Levy G, Louis ED, Mejia-Santana H, Cote L, Andrews H, Harris J, Waters C, Ford B, Frucht S, Fahn S, Ottman R. Accuracy of family history data on Parkinson's disease. Neurology. 2003;61:18–23. doi: 10.1212/01.wnl.0000074784.35961.c0. [DOI] [PubMed] [Google Scholar]
  21. Marder K, Tang M, Alcalay R, Mejia-Santana H, Raymond D, Mirelman a., Saunders-Pullman R, Clark L, Ozelius L, Orr-Urtreger A, et al. Age specific penetrance of the LRRK2 G2019S mutation in the Michael J Fox Ashkenazi Jewish (AJ) LRRK2 consortium. Neurology. 2014;82(10 Supplement):S17–002. doi: 10.1212/WNL.0000000000001708. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Mclachlan GJ, Basford KE. Mixture Models, Inference and Applications to Clustering. Dekker; New York: 1988. [Google Scholar]
  23. Orr-Urtreger A, Shifrin C, Rozovski U, Rosner S, Bercovich D, Gurevich T, Yagev-More H, Bar-Shira A, Giladi N. The LRRK2 G2019S mutation in Ashkenazi Jews with Parkinson's disease: Is there a gender effect? Neurology. 2007;69:1595–1602. doi: 10.1212/01.wnl.0000277637.33328.d8. [DOI] [PubMed] [Google Scholar]
  24. Paisán-Ruíz C, Jain S, Evans EW, Gilks W. p., Simón J, van der Brug M, de Munain AL, Aparicio S, Gil AM, Khan N, et al. Cloning of the gene containing mutations that cause PARK8-linked Parkinson's disease. Neuron. 2004;44:595–600. doi: 10.1016/j.neuron.2004.10.023. [DOI] [PubMed] [Google Scholar]
  25. Qin J, Garcia T, Ma Y, Tang M, Marder K, Wang Y. Combining isotonic regression and EM algorithm to predict genetic risk under monotonicity constraint. The Annals of Applied Statistics. 2014;8:1182–1208. doi: 10.1214/14-AOAS730. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Rudin W. Functional Analysis. McGraw-Hill; New York: 1973. [Google Scholar]
  27. Schumaker L. Spline Functions: Basic Theory. Cambridge University Press; Cambridge: 2007. [Google Scholar]
  28. Teodorescu B, Van Keilegom I, Cao R. Generalized time-dependent conditional linear models under left truncation and right censoring. Annals of the Institute of Statistical Mathematics. 2010;62:465–485. [Google Scholar]
  29. Titterington DM, Smith AFM, Markov UE. Statistical Analysis of Finite Mixture Distributions. Wiley; Chichester: 1985. [Google Scholar]
  30. Trinh J, Farrer M. Advances in the genetics of Parkinson disease. Nature Reviews Neurology. 2013;9:445–454. doi: 10.1038/nrneurol.2013.132. [DOI] [PubMed] [Google Scholar]
  31. Trinh J, Amouri R, Duda JE, Morley JF, Read M, Donald a., Farrer MJ. A comparative study of Parkinson's disease and leucine-rich repeat kinase 2 p. G2019S Parkinsonism. Neurobiology of Aging. 2014;35:1125–1131. doi: 10.1016/j.neurobiolaging.2013.11.015. [DOI] [PubMed] [Google Scholar]
  32. van der Vaart A, Wellner J. Weak Convergence and Empirical Processes. Springer; New York: 1996. [Google Scholar]
  33. Wacholder S, Hartge P, Struewing J, Pee D, McAdams M, Brody L, Tucker M. The Kin-Cohort Study for Estimating Penetrance. American Journal of Epidemiology. 1998;148:623–630. doi: 10.1093/aje/148.7.623. [DOI] [PubMed] [Google Scholar]
  34. Wang Y, Clark LN, Louis ED, Mejia-Santana H, Harris J, Cote LJ, Waters C, Andrews D, Ford B, Frucht S. Risk of Parkinson's disease in carriers of Parkin mutations: estimation using the kin-cohort method. Arch. Neurol. 2008;65:467–474. doi: 10.1001/archneur.65.4.467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Wang Y, Garcia T, Ma Y. Nonparametric estimation for censored mixture data with application to the Cooperative Huntington's Observational Research Trial. J. Amer. Statist. Assoc. 2012;107:1324–1338. doi: 10.1080/01621459.2012.699353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Wang Q, Tong X, Sun L. Exploring the varying covariate effects in proportional odds models with censored data. Journal of Multivariate Analysis. 2012;109:168–189. [Google Scholar]
  37. Yang W, Hormozdiari F, Wang Z, He D, Pasaniuc B, Eskin E. Leveraging reads that span multiple single nucleaotide polymorphisms for haplotype inference from sequencing data. Bioinformatics. 2013;29:2245–2252. doi: 10.1093/bioinformatics/btt386. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Zeng D, Lin D. A general asymptotic theory for maximum likelihood estimation in semiparametric regression models with censored data. Statistica Sinica. 2010;20:871–910. [PMC free article] [PubMed] [Google Scholar]
  39. Zeng D, Lin D, Avery CL, North KE. Efficient Semiparametric Estimation of Haplotype-disease Associations in Case-cohort and Nested Case-control Studies. Biostatistics. 2006;7:486–502. doi: 10.1093/biostatistics/kxj021. [DOI] [PubMed] [Google Scholar]
  40. Zhang H, Olschwang S, Yu K. Statistical inference on the penetrances of rare genetic mutations based on a case-family design. Biostatistics. 2010;11:519–532. doi: 10.1093/biostatistics/kxq009. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file

RESOURCES