COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT

Jing Qin; Tanya P Garcia; Yanyuan Ma; Ming-Xin Tang; Karen Marder; Yuanjia Wang

doi:10.1214/14-AOAS730

. Author manuscript; available in PMC: 2015 Jan 1.

Published in final edited form as: Ann Appl Stat. 2014;8(2):1182–1208. doi: 10.1214/14-AOAS730

COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT

Jing Qin ^¶,^*, Tanya P Garcia ^‖,^*, Yanyuan Ma ^**, Ming-Xin Tang ^††, Karen Marder ^††, Yuanjia Wang ^††

PMCID: PMC4231830 NIHMSID: NIHMS586180 PMID: 25404955

Abstract

In certain genetic studies, clinicians and genetic counselors are interested in estimating the cumulative risk of a disease for individuals with and without a rare deleterious mutation. Estimating the cumulative risk is difficult, however, when the estimates are based on family history data. Often, the genetic mutation status in many family members is unknown; instead, only estimated probabilities of a patient having a certain mutation status are available. Also, ages of disease-onset are subject to right censoring. Existing methods to estimate the cumulative risk using such family-based data only provide estimation at individual time points, and are not guaranteed to be monotonic, nor non-negative. In this paper, we develop a novel method that combines Expectation-Maximization and isotonic regression to estimate the cumulative risk across the entire support. Our estimator is monotonic, satisfies self-consistent estimating equations, and has high power in detecting differences between the cumulative risks of different populations. Application of our estimator to a Parkinson’s disease (PD) study provides the age-at-onset distribution of PD in PARK2 mutation carriers and non-carriers, and reveals a significant difference between the distribution in compound heterozygous carriers compared to non-carriers, but not between heterozygous carriers and non-carriers.

Keywords: Binomial likelihood, Parkinson’s disease, Pool adjacent violation algorithm, Self-consistency estimating equations

1. Introduction

In genetic epidemiology studies (Struewing et al., 1997; Marder et al., 2003; Goldwurm et al., 2011), family history data is collected to estimate the cumulative distribution function of disease onset in populations with different risk factors (e.g., genetic mutation carriers and non-carriers). Such estimates provide crucial information to assist clinicians, genetic counselors and patients to make important decisions such as mastectomy (Grady et al., 2013). The family history data, however, raises serious challenges when estimating the cumulative risk. First, a family member’s exact risk factor is unknown; the only available information is the estimated probabilities that a family member has each risk factor. Second, ages of disease onset are subject to censoring due to patient drop-out or loss to follow-up. For such family history data, the cumulative risk of disease is thus a mixture of cumulative distributions for the risk factors with known mixture probabilities. While different parametric and nonparametric estimators have been proposed for estimating these mixture data distribution functions, they are not guaranteed to be monotonic, nor non-negative: two principle features of distribution functions. Most of these estimators also examine the mixture distributions only at individual time points, rather than at a range of time points. To overcome these challenges, we develop a novel, simultaneous estimation method which combines isotone regression (Barlow et al., 1972) with an Expectation-Maximization (EM) algorithm. Our algorithm is based on the binomial likelihood at all observations (Huang et al., 2007; Ma and Wang, 2013), and yields estimated distribution functions that are non-negative, monotone, consistent, efficient and that provide estimates of the cumulative risk over a range of time points.

Family history data is often collected when studying the risk of disease associated with rare mutations (Struewing et al., 1997; Marder et al., 2003; Wang et al., 2008; Goldwurm et al., 2011). For example, estimating the probability that Ashkenazi Jewish women with specific mutations of BRCA1 or BRCA2 will develop breast cancer (Struewing et al., 1997); estimating the survival function from relatives of Huntington’s disease probands with expanded C-A-G repeats in the huntingtin gene (Wang et al., 2012); and, in this paper, estimating age-at-onset of Parkinson’s disease in carriers of PARK2 mutations (Section 1.1).

In all these cases, a sample of (usually diseased) subjects referred to as probands are genotyped. Disease history in the probands’ first-degree relatives, including age-at-onset of the disease, is obtained through validated interviews (Marder et al., 2003). Because of practical considerations including high costs or unwillingness to undergo genetic testing, the relatives’ genotype information is not collected. Instead, the probability that the relative has the mutation or not is computed based on the relative’s relationship to the proband and the proband’s mutation status (Khoury et al., 1993, section 8.4). Thus, the distribution of the relative’s age-at-onset of a disease is a mixture of genotype-specific distributions with known, subject-specific mixing proportions.

A first attempt at estimating the mixture distribution functions was based on assuming parametric or semiparametric forms (Wu et al., 2007) for the underlying mixture densities. To avoid model misspecification, however, nonparametric estimators such as the nonparametric maximum likelihood estimator (NPMLE) were also proposed. While in many situations the NPMLEs are consistent and efficient, they are neither for the mixture model (Wang et al., 2012; Ma and Wang, 2013). As improvements over the NPMLEs, Wang et al. (2012) and Ma and Wang (2013) proposed consistent and efficient nonparametric estimators based on estimating equations. The estimators stem from casting the problem into a semiparametric theory framework and identifying the efficient estimator. The resulting estimator, however, can have computational difficulties when the data is censored as it uses inverse probability weighting (IPW) and augmented IPW to estimate the mixture distribution functions (Wang et al., 2012). The weighting function involves a Kaplan-Meier estimator which can result in unstable estimation because the weighting function can be close to zero in the right tail. There is also no guarantee that the resulting estimator is monotonic or non-negative; thus, a post-estimate adjustment was implemented to ensure monotonicity.

In this paper, we propose a novel nonparametric estimator that is neither complex, nor computationally intensive, and yields a genuine distribution for the mixture data problem under the monotonicity constraint of a distribution function. Providing nonparametric estimators for survival functions under ordered constraints has received considerable attention recently (Park et al., 2012; Barmi and McKeague, 2013), but the emphasis has been on non-mixture data. The method we propose is applicable to mixture data. Our method is motivated from a real world study on genetic epidemiology of Parkinson’s disease (see Section 1.1), and is based on maximizing a binomial likelihood simultaneously at all observations (Huang et al., 2007). Our method involves combining an EM algorithm and isotone regression (Ayer et al., 1955) so that monotonicity is ensured. We demonstrate that our estimator is consistent, satisfies self-consistent estimating equations, and yields large power in detecting differences between the distribution functions in the mixture populations. Our estimator is easy to implement, and for non-mixture data, we show that our method coincides with the NPMLE.

1.1. CORE-PD study to estimate the risk of PARK2 mutations

Parkinson’s disease (PD) is a neurodegenerative disorder of the central nervous system that results in bradykinesia, tremors, and problems with gait. PD mostly affects the elderly 50 and older, but early onset cases do occur and are hypothesized to be a result of genetic risk factors. Mutations in the PARK2 gene (Kitada et al., 1998; Hedrich et al., 2004) are the most common genetic risk factor for early-onset PD (Lücking et al., 2000) and may be a risk factor for late onset (Oliveira et al., 2003). While mutations in the PARK2 gene are rare, genetic or acquired defects in Parkin function may have far-reaching implications for the understanding and treatment of both familial and sporadic PD.

To understand the effects of mutations in the PARK2 gene, the Consortium on Risk for Early Onset PD (CORE-PD) study was begun in 2004 (Marder et al., 2010). Experienced neurologists performed in-depth examinations (i.e., neurological, cognitive, psychiatric assessments) of proband participants, a subset of non-carriers, and some of the first-degree relatives of probands and non-carriers. For relatives who were not examined in person, their PARK2 genotypes were not available, but their age-at-onset of PD was obtained through systematic family history interviews (Marder et al., 2003). Based on this family history data, the objective then is to determine the age-specific cumulative risk of PD in PARK2 mutation carriers and non-carriers. The results will help patients interpret a positive test result both in deciding treatment options and making important life decisions such as family planning.

The remaining sections of this paper are as follows. Section 2 describes our proposed estimator which involves maximizing a binomial log-likelihood with an EM algorithm. We demonstrate that the ensuing estimator solves a self-consistent estimating equation, and is consistent for complete and right censored data. We demonstrate in Section 3 that we can re-formulate the estimator using a different EM algorithm, for which we can apply the pool adjacent violators algorithm (PAVA) from isotone regression to yield a non-negative and monotonic estimator. We demonstrate the advantages of our new estimator over current ones through extensive simulation studies in Section 4. We apply our estimator to the CORE-PD study in Section 5 and conclude the paper in Section 6. Technical details are in the Appendix, and additional numerical results are available in the Supplementary Material.

2. Binomial Likelihood Estimation

To simplify the presentation, we focus on a mixture distribution with two components; the techniques presented can be easily extended to more than two components.

For i = 1, …, n, we observe a quantitative measure S_i known to come from one of p = 2 populations with corresponding distributions F₁, F₂ and densities dF₁, dF₂. For example, in the Parkinson’s disease study, S_i is the age of disease-onset, F₁ is the distribution for the PARK2 mutation carrier group, and F₂ is for the non-carrier group. The exact population to which S_i belongs is unknown (i.e., we do not know whether a family member is a mutation carrier or non-carrier), but one can estimate the probability q_ki that S_i was generated from the kth population, k = 1, 2. We suppose the mixture probability Q_i has a discrete distribution, denoted as p_Q(q_i), with finite support u₁, …, u_m. We also suppose that q_1i + q_2i = 1, and hence, sometimes write q_1i ≡ λ_i and q_2i ≡ 1 − λ_i. In this case, instead of referring to the discrete distribution p_Q(q_i), we simply refer to the distribution of λ_i, denoted as η(λ_i). Furthermore, S_i is subject to right-censoring, so we observe X_i = min(S_i,C_i), where C_i is a random censoring time independent of S_i. We let G(·) denote the survival function of C_i and dG(·) its corresponding density. Lastly, we let Δ_i = I(S_i ≤ C_i) denote the censoring indicator.

Our objective is to use the independent, identically distributed (iid) data (Q_i = q_i, X_i = x_i, Δ_i = δ_i) to form a nonparametric estimator of F(t) = {F₁(t), F₂(t)}^T that is consistent, monotone on the support of S_i, and efficient. Identifiability of F(t) is ensured since the mixture probabilities are assumed known and Q_i are not all the same Wang et al. (2007). In fact if Q_i has at least k distinguished the support points, then the model is identifiable. To estimate F(t), we first consider the nonparametric log-likelihood

\sum_{i = 1}^{n} log (p_{Q} (q_{i}) {q_{i}^{T} d F (x_{i}) G (x_{i})}^{δ_{i}} {[{1 - q_{i}^{T} F (x_{i})} d G (x_{i})]}^{1 - δ_{i}}) .

Because p_Q(q_i) is independent of the estimation of F(t), and the censoring times are random, the log-likelihood above simplifies to

\sum_{i = 1}^{n} log [{q_{i}^{T} d F (x_{i})}^{δ_{i}} {1 - q_{i}^{T} F (x_{i})}^{1 - δ_{i}}] .

(1)

Different maximizations of (1) result in the commonly used NPMLEs (see Appendix A.1). Unfortunately, for the mixture data problem, they turn out to be inconsistent or inefficient (Ma and Wang, 2012).

2.1. Motivation for binomial likelihood formulation

As an improvement over the NPMLEs, we consider a binomial likelihood estimator. To motivate this estimator, we first consider a non-mixture model without censoring. That is, we observe independent observations S₁, …, S_n generated from a common distribution F. Without loss of generality, we suppose S₁ ≤ S₂ ≤ ⋯ ≤ S_n (i.e., ties may occur). We demonstrate that, in this setting, the NPMLE and the binomial likelihood estimator of F are the same. Thus, because the NPMLE is most efficient in this setting, the binomial likelihood estimator is as well.

For non-mixture data without censoring, the nonparametric estimator of F maximizes

\sum_{i = 1}^{n} log d F (s_{i})

with respect to dF(s_i) subject to $\sum_{i = 1}^{n} d F (s_{i}) = 1$ and dF(s_i) ≥ 0. From first principles, the maximizer is the well-known empirical distribution function, ${F̂}_{n} (t) = n^{- 1} \sum_{i = 1}^{n} I (s_{i} \leq t)$ .

On the other hand, the empirical distribution function is also the maximizer of the following binomial log-likelihood. For distinctive time points t₁ < t₂ < ⋯ < t_h and each S_i, denote a success if S_i > t_j and a failure if S_i ≤ t_j i = 1, …, n, j = 1, …, h. The probability of a success is F̄ (t_j) ≔ 1 − F(t_j), and the probability of a failure is F(t_j). The times t₁, …, t_h can be arbitrary, but are typically chosen to span the support of the events S_i so as to estimate the cumulative distribution function over the full support.

Accounting for all possible successes and failures, the binomial log-likelihood is

\sum_{j = 1}^{h} \sum_{i = 1}^{n} {I (s_{i} \leq t_{j}) log F (t_{j}) + I (s_{i} > t_{j}) log F̄ (t_{j})} .

Maximizing the above with respect to each F(t_j) and subject to the monotonic constraint F(t₁) ≤ F(t₂) ≤ … ≤ F(t_h) gives

{F̂}_{n} (t_{j}) = n^{- 1} \sum_{i = 1}^{n} I (s_{i} \leq t_{j}), j = 1, \dots, h .

However, this is exactly the empirical distribution function which, by definition, satisfies the monotonic constraint.

Therefore, in the non-mixture case, maximizing the nonparametric log-likelihood with respect to dF is equivalent to maximizing the binomial log-likelihood with respect to F subject to the monotonic constraint F(t₁) ≤ F(t₂) ≤ … ≤ F(t_h). Because the two estimators are equivalent and the NPMLE is known to be most efficient, the resulting binomial likelihood estimator is fully efficient. Motivated by this result, we anticipate that maximizing the binomial log-likelihood may yield highly efficient estimators in more general mixture models.

2.2. Binomial Likelihood Estimator for Censored Mixture Data

We now construct a binomial likelihood estimator for mixture data with censoring. Again, consider arbitrary time points t₁ < ⋯ < t_h, such that for each event time S_i, a success occurs if S_i > t_j and a failure if S_i ≤ t_j, i = 1, …, n, j = 1, …, h. As in Section 2.1, we allow for ties in the event times S_i, and choose times t₁, …, t_h to span the support of the event times.

Under censoring, we observe X_i = min(S_i, C_i), which means a success, I(S_i > t_j), is unobservable for those subjects who are lost to follow-up before t_j. A natural approach then is to view the unobserved successes as missing data and to use an EM algorithm to maximize the constructed binomial log-likelihood.

Let V_ij = I(S_i > t_j), the unobserved success. For mixture data, when V_ij is observable (i.e., non-censored data), we have that P(V_ij = 1) = λ_iF̄₁(t_j)+(1 − λ_i)F̄₂(t_j), and P(V_ij = 0) = λ_iF₁(t_j) + (1 − λ_i)F₂(t_j), where F̄_k(t_j) = 1 − F_k(t_j), k = 1, 2. Considering all time points t₁, …, t_h, and all possible successes and failures, the complete data binomial log-likelihood of {I(S_i > t_j)}, i = 1, …, n, j = 1, …, h, is

\sum_{j = 1}^{h} \sum_{i = 1}^{n} [I (s_{i} \leq t_{j}) log {λ_{i} F_{1} (t_{j}) + (1 - λ_{i}) F_{2} (t_{j})} + I (s_{i} > t_{j}) log {λ_{i} {F̄}_{1} (t_{j}) + (1 - λ_{i}) {F̄}_{2} (t_{j})}] .

If V_ij = I(S_i > t_j) were observable, we could estimate F(t_j), j = 1, …, h, by maximizing the binomial log-likelihood with respect to F₁(t_j) and F₂(t_j). However, because V_ij is unobservable, we instead use an EM algorithm for maximization. An EM algorithm at a single t_j was given in Ma and Wang (2013), but, they did not further pursue it. In fact, Efron (1967) did impute this.

The EM algorithm we propose is an iterative procedure where at the bth step, the imputed V_ij is

w_{i j}^{(b)} = E {I (S_{i} > t_{j}) | x_{i}} = I (x_{i} > t_{j}) + (1 - δ_{i}) I (x_{i} \leq t_{j}) \frac{λ_{i} {F̄}_{1}^{(b)} (t_{j}) + (1 - λ_{i}) {F̄}_{2}^{(b)} (t_{j})}{λ_{i} {F̄}_{1}^{(b)} (x_{i}) + (1 - λ_{i}) {F̄}_{2}^{(b)} (x_{i})},

(2)

based on the observed data X_i = x_i. The E-step is then the imputed binomial log-likelihood

\sum_{j = 1}^{h} \sum_{i = 1}^{n} [(1 - w_{i j}^{(b)}) log {λ_{i} F_{1} (t_{j}) + (1 - λ_{i}) F_{2} (t_{j})} + w_{i j}^{(b)} log {λ_{i} {F̄}_{1} (t_{j}) + (1 - λ_{i}) {F̄}_{2} (t_{j})}] .

(3)

The M-step then maximizes the above with respect to F₁(t_j) and F₂(t_j); specifically, the M-step involves solving

- \sum_{i = 1}^{n} λ_{i} \frac{w_{i j}^{(b)} - λ_{i} {F̄}_{1} (t_{j}) - (1 - λ_{i}) {F̄}_{2} (t_{j})}{{λ_{i} F_{1} (t_{j}) + (1 - λ_{i}) F_{2} (t_{j})} {λ_{i} {F̄}_{1} (t) + (1 - λ_{i}) {F̄}_{2} (t)}} = 0, - \sum_{i = 1}^{n} (1 - λ_{i}) \frac{w_{i j}^{(b)} - λ_{i} {F̄}_{1} (t_{j}) - (1 - λ_{i}) {F̄}_{2} (t_{j})}{{λ_{i} F_{1} (t_{j}) + (1 - λ_{i}) F_{2} (t_{j})} {λ_{i} {F̄}_{1} (t) + (1 - λ_{i}) {F̄}_{2} (t)}} = 0,

(4)

for j = 1, …, h. The solution to (4) leads to the new estimate $F_{1}^{(b + 1)} (t_{j})$ and $F_{2}^{(b + 1)} (t_{j})$ . Iterating the E- and M-steps until convergence leads to the binomial likelihood estimator F̂(t_j), j = 1, …, h, for censored mixture data. We now make several observations about this proposed estimator.

The estimating equations in (4) are optimally weighted (Godambe, 1960), and are, in fact, self-consistent estimating equations (Efron, 1967). The self-consistency stems from the imputation procedure of the EM algorithm, analogously to the work of Efron (1967). In the special case of right censoring but no mixture, the above approach has a closed form solution, which is the celebrated Kaplan-Meier estimator (Efron, 1967). In the general case, it can be shown that the proposed estimator F̂ is consistent. The proof is trivial if F takes discrete finite many values. On the hand if F is a continuous distribution, one may use the law of large sample and Kullback-Leibler information inequality to prove it. Details are given in the Appendix A.2. Asymptotics of F̂(t_j) are much more involved, however, and require solving a complex integral equation which is impractical. Hence, inference is usually performed using a Bootstrap approach.

Solving for F̂(t) in practice is also a computationally intensive task. No closed form solution to (4) exists, and ensuring monotonicity and non-negativity of F̂(t) would actually require solving (4) subject to the constraints F_k(t₁) ≤ F_k(t₂) ≤ … ≤ F_k(t_h), k = 1, 2, for t₁ ≤ ⋯ ≤ t_h. Such a constraint only further complicates the already demanding estimation procedure. Still, requiring monotonicity is essential when the data is censored. Without monotonicity, the imputed weights $w_{i j}^{(b)}$ may not be in the range (0,1), which could lead to non-convergence when solving (4). Thus, to ensure monotonicity and avoid the complexities of directly solving (4), we now describe another approach for obtaining the binomial likelihood estimator.

3. Genuine Nonparametric Distribution Estimators

To construct a monotone and non-negative estimator F̂(t) at times t₁ < ⋯ < t_h, we maximize a binomial log-likelihood using a combined EM algorithm and pool adjacent violators algorithm (PAVA). Before describing the new method, we first provide a brief overview of PAVA.

3.1. Pool Adjacent Violator Algorithm

Isotone regression (Barlow et al., 1972) is the notion of fitting a monotone function to a set of observed points y₁, …, y_n in a plane. Formally, the problem involves finding a vector a = (a₁, …, a_n)^T that minimizes the weighted least squares

\sum_{i = 1}^{n} r_{i} {(y_{i} - a_{i})}^{2}

subject to a₁ ≤ ⋯ ≤ a_n for weights r_i > 0, i = 1, …, n. The solution to this optimization problem is the so-called max-min formula (Barlow et al., 1972):

â_{j} = max_{s \leq j} min_{t \geq j} \frac{\sum_{h = s}^{t} y_{h} r_{h}}{\sum_{h = s}^{t} r_{h}}, j = 1, \dots, n .

Rather than solving this max-min formula, the weighted least squares problem is instead solved using PAVA (Ayer et al., 1955; Barlow et al., 1972): a simple procedure that yields the solution in O(n) time (Grotzinger and Witzgall, 1984). The history of PAVA, its computational aspects, and a fast implementation in R are discussed in Leeuw et al. (2009). Variations of PAVA implementation include using up-and-down blocks (Kruskal, 1964) and recursive partitioning (Luss et al., 2010).

Our idea is to apply PAVA to a variant of our binomial loglikehood and yield a monotone estimator F̂(t). It is important to note that we cannot simply apply PAVA to the estimator solving (4). The E-step in (3) is not in the exponential family, which is a requirement of PAVA (Robertson et al., 1988). Furthermore, applying PAVA to maximize a binomial loglikelihood has been used in current status data (Jewell and Kalbfleisch, 2004), but not in the context of mixture data as we do.

3.2. PAVA-based Binomial Likelihood Estimator for Censored Mixture Data

We now modify the construction of the binomial likelihood estimator for censored mixture data (Section 2.2) so that PAVA may be applied. In our earlier construction (Section 2.2), we viewed the event I(S_i > t_j) as the only missing data, i = 1, …, n, j = 1, …, h. Now, we also consider the unobserved population membership as missing. Let L_i denote the unobserved population membership for observation i.

Analogous to the argument in Section 2.2, we first consider the ideal situation when L_i and I(S_i > t_j) are observable. We suppose L_i = 1 when S_i is generated from F₁, and L_i = 0 when S_i is generated from F₂. In this case, P(L_i = 1) = λ_i and P(L_i = 0) = 1 − λ_i. For mixture data, the probability S_i > t_j is λ_iF̄₁(t_j) when L_i = 1, and is (1 − λ_i) F̄₂(t_j) when L_i = 0. Likewise, the probability S_i ≤ t_j is λ_iF₁(t_j) when L_i = 1 and is (1 − λ_i)F₂(t_j) when L_i = 0. Therefore, the complete data log-likelihood of {L_i, I(S_i ≤ t_j)}, i = 1, 2…, n, j = 1, 2, …, h, is the binomial log-likelihood

ℓ_{c} = \sum_{j = 1}^{h} \sum_{i = 1}^{n} [L_{i} I (S_{i} \leq t_{j}) log {λ_{i} F_{1} (t_{j})} + L_{i} I (S_{i} > t_{j}) log {λ_{i} {F̄}_{1} (t_{j})} + (1 - L_{i}) I (S_{i} \leq t_{j}) log {(1 - λ_{i}) F_{2} (t_{j})} + (1 - L_{i}) I (S_{i} > t_{j}) log {(1 - λ_{i}) {F̄}_{2} (t_{j})}] .

However, neither the population membership L_i, nor the event I(S_i > t_j) are available. Hence, these values must be imputed, and an EM algorithm will be used for maximization.

At the bth step of the EM algorithm, we compute E{L_iI(S_i ≤ t_j)|x_i} = E{L_i|S_i ≤ t_j}E{I(S_i ≤ t_j)|x_i} and E{L_iI(S_i > t_j)|x_i} = E{L_i|S_i > t_j}E{I(S_i > t_j)|x_i} based on observed data X_i = min(S_i, C_i) with X_i = x_i. We found earlier that $E {I (S_{i} > t_{j}) | x_{i}} = w_{i j}^{(b)}$ as defined in (2). Using a similar calculation, we obtain

u_{i j}^{(b)} \equiv E (L_{i} | S_{i} \leq t_{j}) = \frac{λ_{i} F_{1}^{(b)} (t_{j})}{λ_{i} F_{1}^{(b)} (t_{j}) + (1 - λ_{i}) F_{2}^{(b)} (t_{j})}, υ_{i j}^{(b)} \equiv E (L_{i} | S_{i} > t_{j}) = \frac{λ_{i} {F̄}_{1}^{(b)} (t_{j})}{λ_{i} {F̄}_{1}^{(b)} (t_{j}) + (1 - λ_{i}) {F̄}_{2}^{(b)} (t_{j})} .

Therefore, at the bth step, with observed data O^(b) = {X_i}, i = 1, …, n, the E-step is

E (ℓ_{c} | O^{(b)}) = \sum_{j = 1}^{h} \sum_{i = 1}^{n} [u_{i j}^{(b)} (1 - w_{i j}^{(b)}) log {λ_{i} F_{1} (t_{j})} + υ_{i j}^{(b)} w_{i j}^{(b)} log {λ_{i} {F̄}_{1} (t_{j})} + (1 - u_{i j}^{(b)}) (1 - w_{i j}^{(b)}) log {(1 - λ_{i}) F_{2} (t_{j})} + (1 - υ_{i j}^{(b)}) w_{i j}^{(b)} log {(1 - λ_{i}) {F̄}_{2} (t_{j})}] .

The M-step then maximizes the above expression with respect to F₁(t_j) and F₂(t_j) at each t_j. To ensure monotonicity, however, the M-step actually involves maximizing E(ℓ_c|O^(b)) subject to the monotonic constraints F_k(t₁) ≤ F_k(t₂) ≤ … ≤ F_k(t_h), k = 1, 2. Though constrained maximization is typically a challenging procedure, the task is simplified because the log-likelihood E(ℓ_c|O^(b)) belongs to the exponential family, in which case PAVA is applicable. From the theory of isotonic regression (Robertson et al., 1988), we have

\underset{F_{1} (t_{1}) \leq \dots \leq F_{1} (t_{h})}{arg max} E (ℓ_{c} | O^{(b)}) = \underset{F_{1} (t_{1}) \leq \dots \leq F_{1} (t_{h})}{arg min} \sum_{j = 1}^{h} \sum_{i = 1}^{n} r_{1 i j}^{(b)} {u_{i j}^{(b)} \frac{1 - w_{i j}^{(b)}}{r_{1 i j}^{(b)}} - F_{1} (t_{j})}^{2}, \underset{F_{2} (t_{1}) \leq \dots \leq F_{2} (t_{h})}{arg max} E (ℓ_{c} | O^{(b)}) = \underset{F_{2} (t_{1}) \leq \dots \leq F_{2} (t_{h})}{arg min} \sum_{j = 1}^{h} \sum_{i = 1}^{n} r_{2 i j}^{(b)} {(1 - u_{i j}^{(b)}) \frac{1 - w_{i j}^{(b)}}{r_{2 i j}^{(b)}} - F_{2} (t_{j})}^{2},

where $r_{1 i j}^{(b)} = u_{i j}^{(b)} (1 - w_{i j}^{(b)}) + υ_{i j}^{(b)} w_{i j}^{(b)}$ and $r_{2 i j}^{(b)} = (1 - u_{i j}^{(b)}) (1 - w_{i j}^{(b)}) + (1 - υ_{i j}^{(b)}) w_{i j}^{(b)}$ .

These formulations suggest that ${F_{1} (t_{j})}_{j = 1}^{h}$ is the weighted isotonic regression of $u_{i j}^{(b)} (1 - w_{i j}^{(b)}) / r_{1 i j}^{(b)}$ with weights $r_{1 i j}^{(b)}$ . Likewise, ${F_{2} (t_{j})}_{j = 1}^{h}$ is the weighted isotonic regression of $(1 - u_{i j}^{(b)}) (1 - w_{i j}^{(b)}) / r_{2 i j}^{(b)}$ with weights $r_{2 i j}^{(b)}$ . Thus, the max-min results of isotone regression apply and yield solutions

{F̃}_{1}^{(b + 1)} (t_{j}) = max_{s \leq j} min_{t \geq j} \frac{\sum_{h = s}^{t} \sum_{i = 1}^{n} u_{i h}^{(b)} (1 - w_{i h}^{(b)})}{\sum_{h = s}^{t} \sum_{i = 1}^{n} {u_{i h}^{(b)} (1 - w_{i h}^{(b)}) + υ_{i h}^{(b)} w_{i h}^{(b)}}}, {F̃}_{2}^{(b + 1)} (t_{j}) = max_{s \leq j} min_{t \geq j} \frac{\sum_{h = s}^{t} \sum_{i = 1}^{n} (1 - u_{i h}^{(b)}) (1 - w_{i h}^{(b)})}{\sum_{h = s}^{t} \sum_{i = 1}^{n} {(1 - u_{i h}^{(b)}) (1 - w_{i h}^{(b)}) + (1 - υ_{i h}^{(b)}) w_{i h}^{(b)}}} .

Rather than solving these max-min formulas, we instead use the PAVA algorithm implemented in R (Leeuw et al., 2009). Iterating through the E- and M- steps with PAVA leads to a genuine estimator of the mixture distributions.

For non-censored data (i.e., δ_i = 1, i = 1, …, n), $w_{i j}^{(b)}$ in (2) simplifies to $w_{i j}^{(b)} = I (S_{i} > t_{j})$ . In this case, the proposed EM algorithm with PAVA in the M-step remains as stated but with $w_{i j}^{(b)} = I (S_{i} > t_{j})$ throughout.

Finally, the proposed EM-PAVA algorithm converges to the maximum likelihood estimate of the binomial likelihood. This follows because E(ℓ_c|O^(b)) belongs to the exponential family and is convex (Wu, 1983). Thus, the derived estimator is the unique maximizer and satisfies the monotonic property of distribution functions.

3.3. Hypothesis Testing

For a two mixture model, one key interest is testing for differences between the two mixture distributions; i.e., testing H₀ : F₁(t) = F₂(t) vs. H₁ : F₁(t) ≠ F₂(t) for a finite set of t values or over an entire range. To test this difference, we suggest the following permutation strategy (Churchill and Doerge, 1994). For the data set given, obtain the estimate F̃⁽⁰⁾ (t) using the EM-PAVA algorithm, and compute $s^{(0)} = {sup}_{t} | {F̃}_{1}^{(0)} (t) - {F̃}_{2}^{(0)} (t) |$ . Then, for k = 1, …, K, create a permuted sample of the data by permuting the pairs (X_i, δ_i) and coupling them with the mixture proportions q₁, …, q_n. For the kth permuted data set, compute F̃^(k) (t) and $s^{(k)} = {sup}_{t} | {F̃}_{1}^{(k)} (t) - {F̃}_{2}^{(k)} (t) |$ . Finally, the p-value associated with testing H₀ is $\sum_{k = 1}^{K} I (s^{(k)} \geq s^{(0)}) / K$ . In practice, we recommend using K = 1000 permutation data sets. We compare the power of various tests in Section 4.

4. Simulation Study

4.1. Simulation Design

We performed extensive simulation studies to investigate the performance of the proposed EM-PAVA algorithm. We report here the results of three experiments comparing EM-PAVA to existing estimators in the literature: the type I NPMLE, type II NPMLE (see Appendix A.1 for the forms of the NPMLEs), and the oracle efficient augmented inverse probability weighting estimator (Oracle EFFAIPW) of Wang et al. (2012, sec. 3). “Oracle” here refers to the assumption that the underlying density dF(t) is known exactly and is not estimated using nonparametric methods.

The three experiments were designed as follows:

Experiment 1: F₁(t) = {1 − exp(−t)}/{1 − exp(−10)} and F₂(t) = {1 − exp(−t/2.8)}/{1 − exp(−10/2.8)} for 0 ≤ t ≤ 10.
Experiment 2: F₁(t) = 0.8/[1 + exp{−(t − 80)/5}] for 0 ≤ t ≤ 100 and F₁(t) = 0.678 + 0.001t for 100 ≤ t ≤ 300. F₂(t) = 0.2/[1 + exp{−(t − 80)/5}] for 0 ≤ t ≤ 100 and F₂(t) = −0.205 + 0.004t for 100 ≤ t ≤ 300. Data is generated as specified, however, the estimation procedure focuses on estimates of F(t) for 0 ≤ t ≤ 100.
Experiment 3: F₁(t) = {1−exp(−t/4)}/{1−exp(−2.5)} for 0 ≤ t ≤ 10 and F₂(t) = {1 − exp(−t/2)}/{1 − exp(−2.5)} for 0 ≤ t ≤ 5.

The second experiment is designed to mimic the Parkinson’s disease data in Section 5. In all experiments, we set the random mixture proportion q_i = (λ_i, 1 − λ_i) to be one of m = 4 vector values: (1, 0)^T, (0.6, 0.4)^T, (0.2, 0.8)^T and (0.16, 0.84)^T. The four vector values had an equally likely chance of being selected. Our sample size was 500 and we generated a uniform censoring distribution to achieve 0%, 20%, and 40% censoring rates.

The primary goal of the simulation studies is to compare the bias, efficiency and power of detecting distribution differences. Bias and efficiency were evaluated at different t values. First, we evaluated the pointwise bias, F̂(t) − F₀(t), at different t values, where F₀(t) denotes the truth. Specifically, we ran 500 Monte Carlo simulations and evaluated the pointwise bias at t = 1.3 in Experiment 1 (Table 1); at t = 85 in Experiment 2 (Table 1); and at t = 2 in Experiment 3 (Supplementary Material, Table ??).

Table 1.

Results for Experiment 1 at t = 1.3 and Experiment 2 at t = 85: bias, empirical standard deviation (emp sd), average estimated standard deviation (est sd), and 95% coverage (95% cov) of estimators at different censoring rates. Results based on 500 simulations with sample size n = 500.

		Experiment 1

		F₁ (t) = 0.7275				F₂ (t) = 0.3822
Estimator	bias	emp sd	est sd	95% cov	bias	emp sd	est sd	95% cov

	Censoring rate = 0%

EM-PAVA	0.0002	0.0471	0.0440	0.9420	−0.0015	0.0438	0.0419	0.9480
Oracle EFFAIPW	0.0004	0.0461	0.0440	0.9520	−0.0014	0.0435	0.0419	0.9480
type I NPMLE	−0.0159	0.1048	0.0579	0.9120	−0.0029	0.0804	0.0627	0.9160
type II NPMLE	−0.0674	0.0588	0.0329	0.5040	0.0824	0.0473	0.0288	0.2980
	Censoring rate = 20%

EM-PAVA	0.0023	0.0491	0.0456	0.9360	−0.0024	0.0445	0.0430	0.9520
Oracle EFFAIPW	0.0019	0.0488	0.0454	0.9420	0.0011	0.0447	0.0432	0.9440
type I NPMLE	−0.0089	0.0921	0.0588	0.9260	−0.0041	0.0835	0.0644	0.9180
type II NPMLE	−0.0846	0.0849	0.0440	0.5720	0.0920	0.0720	0.0393	0.3900
	Censoring rate = 40%

EM-PAVA	0.0022	0.0526	0.0486	0.9420	−0.0025	0.0464	0.0456	0.9500
Oracle EFFAIPW	0.0057	0.0562	0.0486	0.9220	−0.0017	0.0508	0.0460	0.9360
type I NPMLE	−0.0103	0.0981	0.0614	0.9160	−0.0061	0.0868	0.0674	0.9120
type II NPMLE	−0.0954	0.0952	0.0453	0.5580	0.1008	0.0854	0.0395	0.3800

		Experiment 2

		F₁ (t) = 0.5848				F₂ (t) = 0.1462
Estimator	bias	emp sd	est sd	95% cov	bias	emp sd	est sd	95% cov

	Censoring rate = 0%

EM-PAVA	−0.0009	0.0482	0.0470	0.9540	−0.0037	0.0398	0.0357	0.9280
Oracle EFFAIPW	−0.0015	0.0480	0.0472	0.9600	−0.0036	0.0403	0.0368	0.9480
type I NPMLE	−0.0133	0.0890	0.0597	0.9500	−0.0034	0.0659	0.0521	0.8980
type II NPMLE	−0.0872	0.0697	0.0349	0.4520	0.1035	0.0532	0.0248	0.0520
	Censoring rate = 20%

EM-PAVA	0.0002	0.0548	0.0493	0.9300	−0.0013	0.0391	0.0381	0.9540
Oracle EFFAIPW	0.0006	0.0548	0.0498	0.9340	−0.0015	0.0396	0.0389	0.9640
type I NPMLE	−0.0078	0.0908	0.0623	0.9160	−0.0030	0.0682	0.0544	0.8860
type II NPMLE	−0.0959	0.0792	0.0437	0.4800	0.1086	0.0695	0.0353	0.1160
	Censoring rate = 40%

EM-PAVA	−0.0016	0.0557	0.0525	0.9320	−0.0002	0.0425	0.0401	0.9500
Oracle EFFAIPW	0.0009	0.0578	0.0525	0.9380	−0.0008	0.0434	0.0410	0.9560
type I NPMLE	−0.0111	0.0977	0.0650	0.9100	−0.0043	0.0711	0.0560	0.8760
type II NPMLE	−0.1048	0.0857	0.0454	0.4740	0.1153	0.0846	0.0361	0.1380

Open in a new tab

Second, we evaluated the estimators over the entire range of t values based on results from 500 Monte Carlo simulations; see Tables 2 and ?? (Supplementary Material). In this case, we evaluated the estimators based on the integrated absolute bias (IAB), average pointwise variance, and average pointwise 95% coverage probabilities. The integrated absolute bias (IAB) is $\int_{0}^{\infty} | {F̄}_{k} (t) - F_{k 0} (t) | d t$ , k = 1, 2, where F̄_k(t) is the average estimate over the 500 data sets, and F_k0 is the truth. In our simulation study, the integral in the IAB was computed using a Riemann sum evaluated at 50 evenly spaced time points across the entire range (i.e., over (0,10) in Experiments 1 and 3, and over (0,100) in Experiment 2). The IAB for F₂(t) in Experiment 3 was computed over (0,5) because it is only defined on this interval. The average pointwise variance and average pointwise 95% coverage probabilities were also computed over 50 time points evenly spaced across the entire range (i.e., over (0,10) in Experiments 1 and 3, and over (0,100) in Experiment 2). Specifically, for each of the 50 time points, we computed the pointwise variance and pointwise 95% coverage probabilities of the 500 data sets. Then, we reported the average of the 50 pointwise values.

Table 2.

Results for Experiment 1 and 2 across a range of time points: integrated absolute bias, average pointwise variance, and average 95% coverage probabilities of estimators at different censoring rates. Results based on 500 simulations with sample size n = 500.

	Censoring rate
	0%		20%		40%
Estimator	F₁ (t)	F₂ (t)	F₁ (t)	F₂ (t)	F₁ (t)	F₂ (t)
		Experiment 1

	Integrated absolute bias^*

EM-PAVA	0.0085	0.0065	0.0190	0.0071	0.0327	0.0199
Oracle EFFAIPW	0.0040	0.0055	0.0248	0.0232	0.0967	0.0689
type I NPMLE	0.1409	0.0407	0.2276	0.1063	0.4726	0.5084
type II NPMLE	0.4290	0.2960	0.5656	0.3332	0.7127	0.3814
	Average pointwise variance^*

EM-PAVA	0.0009	0.0005	0.0012	0.0006	0.0015	0.0014
Oracle EFFAIPW	0.0009	0.0005	0.0011	0.0007	0.0016	0.0015
type I NPMLE	0.0010	0.0013	0.0013	0.0017	0.0022	0.0038
type II NPMLE	0.0006	0.0003	0.0013	0.0004	0.0024	0.0009
	Average 95% coverage probabilities^†

EM-PAVA	0.9512	0.9551	0.9530	0.9518	0.9513	0.9535
Oracle EFFAIPW	0.9498	0.9557	0.9535	0.9514	0.9519	0.9445
type I NPMLE	0.9471	0.9508	0.9378	0.9344	0.9130	0.8458
type II NPMLE	0.3756	0.5838	0.4234	0.5927	0.3890	0.6760

		Experiment 2

	Integrated absolute bias^**

EM-PAVA	0.1372	0.0342	0.1140	0.0307	0.1049	0.0261
Oracle EFFAIPW	0.0966	0.0266	0.1282	0.0729	0.2704	0.1215
type I NPMLE	0.1097	0.0467	0.0770	0.0574	0.0791	0.0557
type II NPMLE	3.7021	2.4581	3.9157	2.4937	4.4027	2.5877
	Average pointwise variance^**

EM-PAVA	0.0011	0.0003	0.0013	0.0003	0.0014	0.0003
Oracle EFFAIPW	0.0011	0.0003	0.0013	0.0003	0.0015	0.0003
type I NPMLE	0.0013	0.0007	0.0016	0.0007	0.0017	0.0008
type II NPMLE	0.0006	0.0001	0.0006	0.0001	0.0007	0.0002
	Average 95% coverage probabilities^††

EM-PAVA	0.9564	0.9495	0.9538	0.9513	0.9552	0.9530
Oracle EFFAIPW	0.9547	0.9436	0.9518	0.9475	0.9507	0.9467
type I NPMLE	0.9556	0.9479	0.9506	0.9492	0.9505	0.9481
type II NPMLE	0.5738	0.4737	0.5781	0.4740	0.5504	0.4805

Open in a new tab

Computed over (0,10) for F₁ (t) and F₂ (t).

^†

Computed over (0,4) for F₁(t) and over (0,9) for F₂(t).

^**

Computed over (0,100) for F₁ (t) and F₂ (t).

^††

Computed over (48,100) for F₁ (t) and F₂ (t).

Third, we evaluated the type I error rate and power in detecting differences between F₁(t) and F₂(t) over the entire range of t values. We investigated the type I error rate under H₀ : F₁(t) = F₂(t) based on 1000 simulations. In this case, we generated data so that F₂(t) was set to the form of F₁(t) in each experiment (see the description of Experiment 1, 2, 3). Everything else was left unchanged. The type I error rate was then computed using the permutation test in Section 3.3 using 1000 permutations. The power was computed based on 200 Monte Carlo simulations. That is, we tested for differences between F₁(t) and F₂(t) when F₁(t), F₂(t) were evaluated at 50 time points evenly spaced across the entire range: over (0,10) in Experiments 1 and 3, and over (0,100) in Experiment 2. To compute the empirical power under H₁ : F₁(t) ≠ F₂(t), we used the permutation test in Section 3.3 with 1000 permutations. Results are in Tables 3 and ?? (Supplementary Material).

Table 3.

Empirical rejection rates for Experiment 1 and 2. Test of F₁ (t) = F₂ (t) over the entire time range was performed using a permutation test with 1000 permutations. Results based on 1000 simulations (for test under H₀) and 200 simulations (for test under H₁), with sample size n = 500 and 40% censoring (under H₁).

	Nominal Levels
Estimator	0.01	0.05	0.10	0.20	0.01	0.05	0.10	0.20
			Experiment 1

	Under H₀ : F₁ (t) = F₂ (t)				Under H₁ : F₁ (t) ≠ F₂ (t)

EM-PAVA	0.0120	0.0560	0.0950	0.1920	0.9000	0.9800	0.9900	1.0000
Oracle EFFAIPW	0.0090	0.0500	0.0900	0.1820	0.6150	0.7950	0.8650	0.9350
type I NPMLE	0.0130	0.0550	0.1020	0.1970	0.6200	0.7650	0.8450	0.9000
type II NPMLE	0.0060	0.0490	0.1020	0.2020	0.4400	0.5150	0.5550	0.5900
			Experiment 2

	Under H₀ : F₁ (t) = F₂ (t)				Under H₁ : F₁ (t) ≠ F₂ (t)

EM-PAVA	0.0170	0.0551	0.1022	0.2094	0.9950	0.9950	0.9950	1.0000
Oracle EFFAIPW	0.0140	0.0600	0.1100	0.2050	0.9950	0.9950	0.9950	1.0000
type I NPMLE	0.0080	0.0550	0.1120	0.2100	0.9200	0.9400	0.9500	0.9600
type II NPMLE	0.0100	0.0550	0.1120	0.2150	0.7000	0.7300	0.7500	0.7700

Open in a new tab

4.2. Simulation Results

Among all four estimators considered, the type I NPMLE has the largest estimation variability and the type II has the largest estimation bias (see Tables 2 and ?? (Supplementary Material)). In all experiments, as the censoring rate increases from 0% to 40%, the inefficiency for the type I and the bias for the type II worsens. These poor performances alter the 95% coverage probabilities, especially for the type II NPMLE which has coverage probabilities well under the nominal level (see Table 2). The inconsistency of the type II NPMLE is most apparent in Experiments 1 and 2, where the estimated curve and 95% confidence band completely miss the true underlying distributions; see Figures 1 and 2. The type II NPMLE is also not consistent in Experiment 3, but to a lesser extent; see Figure ?? (Supplementary Material).

Fig 1 — Experiment 1. True cumulative distribution function and the mean of 500 simulations along with 95% confidence band (dotted) for the four proposed estimators. Sample size is 500, censoring rate is 40%.

Fig 2 — Experiment 2. True cumulative distribution function and the mean of 500 simulations along with 95% confidence band (dotted) for the four proposed estimators. Sample size is 500, censoring rate is 40%.

In contrast, across all experiments and censoring rates, the EM-PAVA estimator performs satisfactorily throughout the entire range of t (see Figures 1, 2 and ?? (Supplementary Material)). The EM-PAVA estimator is as efficient as the Oracle EFFAIPW, but with much smaller bias especially when censoring is present. The EM-PAVA also performs well in detecting small differences between F₁(t) and F₂(t). In Table 3, the type I error rates for all estimators adhere to their nominal levels. When F₁(t) and F₂(t) are largely different (i.e., Experiment 2), then both EM-PAVA and the Oracle EFFAIPW have similar power in detecting differences. However, when F₁(t) and F₂(t) are different but to a lesser degree (i.e., Experiment 1) then EM-PAVA has larger power in detecting the difference than all other estimators, including the Oracle EFFAIPW. The larger power of the EM-PAVA estimator is not too surprising considering that it estimates F(t) across a range of time points, unlike the point-wise estimation of the Oracle EFFAIPW.

A benefit of EM-PAVA over the Oracle EFFAIPW (and the two NPM-LEs) is that EM-PAVA yields a genuine distribution function (i.e., the estimator is monotone, non-negative and has values in the [0,1] range). The curves shown in Figures 1, 2 and ?? (Supplementary Material) for Oracle EFFAIPW are the result of doing a post-estimation procedure to yield monotonicity. The ingenuity of the Oracle EFFAIPW estimator, however, is evident from its 95% confidence band, which was constructed from the 2.5% and 97.5% pointwise quantiles of the 500 Monte Carlo data sets. Figure ?? (Supplementary Material) shows that the Oracle EFFAIPW estimator can have 95% confidence bands outside of the [0,1]; for large t in Figure ??, the upper confidence bound is larger than 1. In contrast, the EM-PAVA estimator is always guaranteed to be within [0,1], and thus its 95% confidence bands are always within this range.

5. Application to the CORE-PD study

5.1. CORE-PD data and mixture proportions

We applied our estimator to the CORE-PD study introduced in Section 1.1. Data from the CORE-PD study include information from first-degree relatives (i.e., parents, siblings, and children) of PARK2 probands. The probands had age-at-onset (AAO) of Parkinson’s disease (PD) less than or equal to 50 and did not carry mutations in other genes (i.e., neither LRRK2 mutations nor GBA mutations, Marder et al. (2010)). The key interest is estimating the cumulative risk of PD-onset for the first-degree relatives belonging to different populations:

PARK2 mutation carrier vs. non-carrier: We compared the estimated cumulative risk in first-degree relatives expected to carrying one or more copies of a mutation in the PARK2 gene (carriers) to relatives expected to carry no mutation (non-carrier).
PARK2 compound heterozygous (or homozygous) mutation carrier vs. heterozygous mutation carrier vs. non-carrier: We considered first-degree relatives who have the compound heterozygous genotype (two or more different copies of the mutation) or homozygous genotype (two or more copies of the same mutation). We compared distribution of risk in this population to two different populations: (a) relatives who are expected to have the heterozygous genotype (mutation on a single allele), and (b) relatives who are expected to be non-carriers (no mutation). These comparisons will bring insight into whether heterozygous PARK2 mutations alone increase the risk of PD, or if additional risk alleles play a role.

In the CORE-PD study, the ages-at-onset for the first-degree relatives are at least 90% censored. Information discerning to which population a relative belongs is available through different mixture proportions. The mixture proportions are vectors (p_i, 1 − p_i), where p_i is the probability of the ith first-degree relative carrying at least one copy of a mutation. This probability was computed based on the proband’s genotype, a relative’s relationship to a proband under Mendelian transmission assumption. For example, a child of a heterozygous carrier proband has a probability of 0.5 to inherit the mutated allele, and thus a probability of 0.5 to be a carrier. A child of a homozygous carrier proband has a probability of 1 to be a carrier. More details are given in Wang et al. (2007, 2008). Summary statistics for the populations and the mixture proportions are listed in Table 4.

Table 4.

Summary statistics for CORE-PD study. Total number of first-degree relatives (n), number of parents, siblings, and children, and percentage of first-degree relatives who have the specified mixture-proportion (p; 1 − p), where p is the probability of a relative carrying at least one copy of mutation.

					Mixture proportion (%)
	n	Parents	Siblings	Children	(1,0)	(0,1)	(0.5,0.5)
Carrier vs. Non-Carrier	355	63	182	110	31.5	64.8	3.7
Compound Heterozygous Carrier or Homozygous Carrier^*	17	1	15	1	100.0	0	0
Heterozygous Carrier vs. Non-Carrier	338	62	167	109	28.1	68.1	3.8

Open in a new tab

Genotype for subjects in this group are known.

5.2. Results

We estimated the cumulative risk based on the EM-PAVA estimator and compared its results with the type I NPMLE. The Oracle EFFAIPW estimator could not be used because the high censoring led to unstable estimation: the inverse weights in the estimator were close to zero. Estimates for the PARK2 compound heterozygous (or homozygous) mutation carriers were based on a Kaplan-Meier estimator because these subjects were observed to carry two or more mutations and there is no uncertainty about the relatives’ genotype status (i.e., the data is not mixture data). We report the cumulative risk estimates along with 95% confidence intervals based on 100 Bootstrap replicates.

Figure 3 (top-right) shows that by age 50, PARK2 mutation carriers have a large increase in cumulative risk of PD onset compared to non-carriers. Based on EM-PAVA, the cumulative risk (see Table 5) of PD-onset for PARK2 mutation carriers at age 50 is 17.1% (95% CI: 8.5%, 25.6%) whereas the cumulative risk for non-carriers at age 50 is 0.8% (95% CI: 0%, 2.1%). This difference between PARK2 mutation carriers and non-carriers at age 50 was formally tested using the permutation test in Section 3.3. We found that carrying a PARK2 mutation significantly increases the cumulative risk by age 50 (p-value< 0.001, Table 7), suggesting that a mutation in the PARK2 gene substantially increases the chance of early onset PD. The difference is smaller yet still significant at age 70 (p-value=0.04, Table 7). Even across the age range (20,70), the cumulative risk for PARK2 mutation carriers was significantly different than the cumulative risk for non-carriers (p-value=0.010, see Table 7). These findings are consistent with other clinical and biological evidence that PARK2 mutations contribute to early age onset of PD (Hedrich et al., 2004; Lücking et al., 2000).

Fig 3 — CORE-PD study. Estimated cumulative distribution function for age-at-onset of Parkinson’s disease for Parkin mutation carrier vs. non-carrier (top), and Parkin compound heterozygous or homozygous carrier vs. Parkin heterozygous carrier and non-carrier (bottom).

Table 5.

Results for Parkin mutation carriers vs. non-carriers: estimated cumulative distribution function and 95% confidence intervals (in parentheses) based on type I NPMLE and EM-PAVA.

Age	type I NPMLE	EM-PAVA	type I NPMLE	EM-PAVA
	Carrier		Non-Carrier

20	0.015 (0.000, 0.043)	0.017 (0.000, 0.048)	−0.011 (−0.009, 0.000)	0.000 (0.000, 0.000)
25	0.023 (0.007, 0.061)	0.026 (0.008, 0.068)	−0.011 (−0.013, −0.001)	0.000 (0.000, 0.000)
30	0.032 (0.008, 0.073)	0.036 (0.009, 0.083)	−0.011 (−0.016, −0.002)	0.000 (0.000, 0.000)
35	0.061 (0.026, 0.116)	0.068 (0.029, 0.134)	−0.011 (−0.026, −0.007)	0.000 (0.000, 0.000)
40	0.072 (0.030, 0.128)	0.081 (0.034, 0.143)	−0.011 (−0.030, −0.008)	0.000 (0.000, 0.000)
45	0.121 (0.058, 0.198)	0.137 (0.067, 0.217)	−0.011 (−0.044, −0.015)	0.000 (0.000, 0.000)
50	0.150 (0.074, 0.225)	0.171 (0.085, 0.256)	−0.011 (−0.053, −0.005)	0.008 (0.000, 0.021)
55	0.166 (0.091, 0.263)	0.190 (0.104, 0.299)	−0.011 (−0.057, −0.008)	0.008 (0.000, 0.021)
60	0.166 (0.086, 0.262)	0.190 (0.105, 0.299)	−0.011 (−0.053, 0.016)	0.023 (0.000, 0.053)
65	0.321 (0.117, 0.505)	0.266 (0.138, 0.400)	0.117 (−0.039, 0.250)	0.027 (0.000, 0.060)
70	0.321 (0.109, 0.495)	0.266 (0.148, 0.400)	0.170 (−0.005, 0.323)	0.094 (0.009, 0.193)

Open in a new tab

Table 7.

P-values associated with testing H₀ : F₁ (t) = F₂ (t) at different t-values for CORE-PD study. H₀ was tested using the permutation test with 1000 permutations.

	type I NPMLE	EM-PAVA
	Carrier vs. Non-Carrier

t ∈ [20, 70]	0.013	0.010
t = 50	<0.001	<0.001
t = 70	0.073	0.04
	Het. Carrier vs Non-Carrier

t ∈ [20, 70]	0.790	0.594
t = 50	0.341	0.386
t = 70	0.813	0.969
	Compound Het./Hom. Carrier vs Het. Carrier

t ∈ [20, 70]	0.013	0.006
t = 50	<0.001	<0.001
t = 70	0.013	0.017
	Compound Het/Hom. Carrier vs Non-Carrier

t ∈ [20, 70]	0.011	0.007
t = 50	<0.001	<0.001
t = 70	0.013	0.017

Open in a new tab

To further distinguish the risk of PD among compound heterozygous or homozygous carriers (with at least two copies of mutations) from heterozygous carriers, we separately estimated the distribution functions in these two groups and compared them to the risk in the non-carrier group. The numerical results in Table 6 and a plot of the cumulative risk in Figure 3 (bottom panel) indicate a highly elevated risk in compound heterozygous or homozygous carriers combined. In contrast, the risk for heterozygous carriers closely resembles the risk in non-carriers. This result that being a heterozygous carrier has essentially similar risk to being a non-carrier was also observed in another study (Wang et al., 2008). Further investigation in a larger study is needed to examine whether risk differs in any subgroup. Using a permutation test, we also formally tested for differences between the distribution functions for each group. Results in Table 7 show that there is a significant difference between compound heterozygous carriers and heterozygous carriers as well as a significant difference between compound heterozygous and the non-carriers over the age range (20,70), and at particular ages 50 and 70. Furthermore, there is no significant difference between heterozygous carriers and non-carriers. These analyses suggest a recessive mode of inheritance for PARK2 gene mutations for early onset PD.

Table 6.

Results for Parkin compound heterozygous or homozygous carrier (Compound Carrier), Parkin heterozygous carrier and non-carrier: estimated cumulative distribution function and 95% confidence intervals (in parentheses).

Age	Kaplan-Meier^*	type I NPMLE	EM-PAVA
	Compound Carrier	Heterozygous Carrier

20	0.118 (0.000, 0.258)	0.000 (0.000, 0.000)	0.000 (0.000, 0.000)
25	0.118 (0.000, 0.258)	0.009 (0.000, 0.027)	0.010 (0.000, 0.030)
30	0.186 (0.000, 0.355)	0.009 (0.000, 0.027)	0.010 (0.000, 0.030)
35	0.389 (0.087, 0.591)	0.009 (0.000, 0.027)	0.010 (0.000, 0.030)
40	0.389 (0.087, 0.591)	0.023 (0.000, 0.049)	0.026 (0.000, 0.056)
45	0.644 (0.252, 0.830)	0.037 (0.000, 0.089)	0.041 (0.000, 0.100)
50	0.822 (0.391, 0.948)	0.037 (−0.004, 0.088)	0.041 (0.000, 0.100)
55	0.822 (0.391, 0.948)	0.056 (0.000, 0.119)	0.063 (0.000, 0.130)
60	0.822 (0.391, 0.948)	0.056 (−0.001, 0.116)	0.064 (0.000, 0.131)
65	0.911 (0.432, 0.986)	0.177 (0.042, 0.304)	0.100 (0.016, 0.206)
70	0.911 (0.432, 0.986)	0.177 (0.027, 0.288)	0.100 (0.016, 0.206)
		Non-Carrier

20		−0.002 (0.000, 0.000)	0.000 (0.000, 0.000)
25		−0.002 (−0.007, 0.000)	0.000 (0.000, 0.000)
30		−0.002 (−0.007, 0.000)	0.000 (0.000, 0.000)
35		−0.002 (−0.007, 0.000)	0.000 (0.000, 0.000)
40		−0.002 (−0.014, 0.000)	0.000 (0.000, 0.000)
45		−0.002 (−0.018, 0.000)	0.000 (0.000, 0.000)
50		−0.002 (−0.015, 0.014)	0.008 (0.000, 0.022)
55		−0.002 (−0.023, 0.011)	0.008 (0.000, 0.022)
60		0.009 (−0.022, 0.044)	0.023 (0.000, 0.055)
65		0.142 (−0.006, 0.259)	0.032 (0.000, 0.076)
70		0.199 (0.009, 0.334)	0.106 (0.015, 0.181)

Open in a new tab

Genotype for subjects in this group are known. When there is no mixture, both methods reduce to Kaplan-Meier.

In comparison to the EM-PAVA, the type I NPMLE had wide and non-monotone confidence intervals, which altered the inference conclusions and is undesirable (see Table 7). Moreover, the type I NPMLE provided a higher cumulative risk in non-carriers by age 70 (17%) which appear to be higher than reported in other epidemiological studies (e.g., Wang et al. (2008)). The poor performance of the type I NPMLE can be due to instability and inefficiency of the type I especially at the right tail area. In contrast, EMPAVA always provided monotone distribution function estimates, as well as monotone and narrower confidence bands. The EM-PAVA also gave a lower cumulative risk in non-carriers by age 70 (9.4%) which better reflects the population based estimates. The increased risk in PARK2 carriers at earlier ages compared to population based estimates can also suggest that there are other genetic and environmental causes of PD in early onset cases that are different than late onset.

6. Concluding Remarks

In this work, we provide nonparametric estimation of age-specific cumulative risk for mutation carriers and non-carriers. This topic is an important issue in genetic counseling since clinicians and patients use risk estimates to guide their decisions on choices of preventive treatments and planning for the future. For example, individuals with a family history of Parkinson’s disease generally stated that if they were found to be a carrier and in their mid-thirties, they would most likely elect to not have children (McInerney-Leo et al., 2005). Or, in the instance they did choose to start a family, PARK2 mutation carriers were more inclined to undergo prenatal testing (McInerney-Leo et al., 2005).

It is well known that the NPMLE is the most robust and efficient method when there is no parametric assumption for the underlying distribution functions. Unfortunately, in the mixture model discussed in this paper, the NPMLE (type II) fails to produce consistent estimates. On the other hand, the maximum binomial likelihood method studied in this paper provides an alternative consistent estimation method. Moreover, to implement this method we have used the combination of an EM algorithm and PAVA, which leads to genuine distribution function estimates. For a non-mixture model, the proposed method coincides with the NPMLE. As a result, we expected the proposed method to have high efficiency which was apparent through the various simulation studies. Even though we only considered two-component mixture models, in principle the proposed method can be applied to more than two components mixture models without essential difficulty.

In some applications, it may be desirable to consider parametric or semi-parametric models (e.g., Cox proportional hazards model, proportional odds model) in a future work. However, diagnosing model misspecification has received little attention in the genetics literature. Our maximum binomial likelihood method can be used as a basis to construct numerical goodness-of-fit tests. In this case, we can test whether the distributions conform to a particular parametric or semiparametric model. That is, the interest is in testing H₀ : F₁(t) = F₁(t, β₁), F₂(t) = F₂(t, β₂) for some parametric models F₁(t, β₁) and F₂(t, β₂). To perform this test, we can use the Kolmogorov-Smirov goodness of fit

Δ = \sqrt{n} max_{- \infty < t < \infty} {| {F̃}_{1} (t) - F_{1} (t, {β̂}_{1}) | + | {F̃}_{2} (t) - F_{2} (t, {β̂}_{2}) |}

where β̂₁, β̂₂ are the parametric maximum likelihood estimates of β₁ and β₂. Moreover if one is interested in estimating other quantities of the underlying distribution functions, for example the densities, one may use the kernel method to smooth the estimated distribution functions.

In our analysis of CORE-PD data, probands were not included due to concerns of potential ascertainment bias that may be difficult to adjust (Begg, 2002). In studies where a clear ascertainment scheme is implemented, adjustment can be made based on a retrospective likelihood. Lastly, the computational procedure of the proposed estimator is simple and efficient. An R function implementing the proposed method is available from the authors.

Supplementary Material

Supplementary file

NIHMS586180-supplement-Supplementary_file.pdf^{(174.3KB, pdf)}

APPENDIX A

SKETCH OF TECHNICAL ARGUMENTS

A.1. The type I and type II NPMLEs

For the type I NPMLE, let $s_{j} (x_{i}) = u_{j}^{T} d F (x_{i})$ and $S_{j} (x_{i}) = 1 - u_{j}^{T} F (x_{i})$ , i = 1, …, n, j = 1, …, m. The type I NPMLE maximizes

\sum_{j = 1}^{m} \sum_{i = 1}^{n} log {s_{j} {(x_{i})}^{δ_{i}} S_{j} {(x_{i})}^{1 - δ_{i}}} I (q_{i} = u_{j})

with respect to s_j(x_i)’s and subject to $\sum_{i = 1}^{n} s_{j} (x_{i}) I (q_{i} = u_{j}) \leq 1, s_{j} (x_{i}) \geq 0$ for j = 1, …, m. Because this is equivalent to m separate maximization problems, each concerning s_j(·) and S_j(·) only, the maximizers are the classical Kaplan-Meier estimators:

Ŝ_{j} (t) = \prod_{x_{i} \leq t, q_{i} = u_{j}} {1 - \frac{δ_{i}}{\sum_{q_{k} = u_{j}} I (x_{k} \geq x_{i})}},

with s_j(t) = S_j(t⁻) − S_j(t) for all t. With Ŝ (t) = {Ŝ₁(t), …, Ŝ_m(t)}^T, and U = (u₁, …, u_m)^T, the type I NPMLE is

{F̃}_{type I} (t) = {(U^{T} U)}^{- 1} U^{T} {1_{m} - Ŝ (t)} .

Let the variance-covariance matrix of Ŝ (t) be Σ, which is a diagonal matrix because each of the m components of Ŝ (t) is estimated using a distinct subset of the observations. Then, F̃_w(t) = (U^TΣ⁻¹U)⁻¹U^TΣ⁻¹{1_m − Ŝ (t)} is a weighted version of the type I NPMLE and is more efficient than the type I NPMLE.

The type II NPMLE has no closed form solution, and an EM algorithm is typically employed. Specifically, for k = 1, 2, we form at the bth step in the EM algorithm:

c_{i k}^{(b)} = δ_{i} \frac{q_{i k} d F_{k}^{(b)} (x_{i})}{\sum_{k = 1}^{2} q_{i k} d F_{k}^{(b)} (x_{i})} + (1 - δ_{i}) \frac{q_{i k} {1 - F_{k}^{(b)} (x_{i})}}{\sum_{k = 1}^{2} q_{i k} {1 - F_{k}^{(b)} (x_{i})}},

and update the type II NPMLE estimate as

1 - {F̌}_{type I I, k}^{(b + 1)} (t) = \prod_{x_{i} \leq t, δ_{i} = 1} {1 - \frac{\sum_{j = 1}^{n} I (x_{j} = x_{i}, δ_{j} = 1) c_{j k}^{(b)}}{\sum_{j = 1}^{n} c_{j k}^{(b)} I (x_{j} \geq x_{i})}} = \prod_{x_{i} \leq t, δ_{i} = 1} {1 - \frac{c_{i k}^{(b)}}{\sum_{j = 1}^{n} c_{j k}^{(b)} I (x_{j} \geq x_{i})}} .

The procedure is iterated until convergence.

A.2. Consistency of Imputed Log-Likelihood

We first demonstrate consistency for the non-censored data case. When F takes discrete finite many values, the result holds true trivially. If F is a continuous distribution function, then for non-censored data, the binomial log-likelihood is

ℓ = \sum_{j = 1}^{h} \sum_{i = 1}^{n} I (s_{i} \leq t_{j}) log [λ_{i} F_{1} (t_{j}) + (1 - λ_{i}) F_{2} (t_{j})] + I (s_{i} > t_{j}) log [λ_{i} {F̄}_{1} (t_{j}) + (1 - λ_{i}) {F̄}_{2} (t_{j})] .

This can be written as

n^{- 2} ℓ = \int \int I (s \leq t) log [λ F_{1} (t) + (1 - λ) F_{2} (t)] + I (s > t) log [λ {F̄}_{1} (t) + (1 - λ) {F̄}_{2} (t)] d η_{n} (s, λ) d ξ_{n} (t)

where

η_{n} (s, λ) = n^{- 1} \sum_{i = 1}^{n} I (s_{i} \leq s, λ_{i} \leq λ), ξ_{h} (t) = h^{- 1} \sum_{i = 1}^{h} I (t_{i} \leq t) .

By the Law of Large Numbers, it can be shown that

n^{- 2} ℓ = \int {λ F_{10} (t) + (1 - λ) F_{20} (t)} log {λ F_{1} (t) + (1 - λ) F_{2} (t)} d η_{0} (λ) d ξ_{0} (t) + {λ {F̄}_{10} (t) + (1 - λ) {F̄}_{20} (t)} log {λ {F̄}_{1} (t) + (1 - λ) {F̄}_{2} (t)} d η_{0} (λ) d ξ_{0} (t) ≕ Δ

where η₀(λ) is the marginal distribution of λ and

ξ_{0} (t) = \int {λ F_{10} (t) + (1 - λ) F_{20} (t)} d η_{0} (λ) .

Here, the subscript ₀ denotes the truth. By the Kullback-Leibler information inequality, the above limiting value achieves the maximum if and only if F₁ = F₁₀ and F₂ = F₂₀. Therefore, the maximum binomial likelihood estimation is consistent.

For the censored data case, consistency also holds following a similar argument. The only difference in the log-likelihood is that the indicator function I(S_i ≤ t_j) is replaced by w_ij = E{I(S_i ≥ t_j)|S_i ≥ x_j}. If ŵ_i(t_j) is replaced by an initial consistency estimation, then the log censored binomial likelihood will converge to Δ again.

Footnotes

^†

T.P. Garcia is supported by the Huntington’s Disease Society of America, Human Biology Project Fellowship.

^‡

K. Marder is supported by NS036630 Parkinson Disease Foundation, UL1 RR024156.

^§

Y. Wang is correspondence author and supported by NIH grant NS073671.

SUPPLEMENTARY MATERIAL

Supplement: Additional Simulation Results (doi: 10.1214/00-AOASXXXXSUPP). The Supplementary Material contains additional simulation results.

Contributor Information

Jing Qin, Email: jingqin@niaid.nih.gov.

Tanya P. Garcia, Email: tpgarcia@srph.tamhsc.edu.

Yanyuan Ma, Email: ma@stat.tamu.edu.

Ming-Xin Tang, Email: mxt1@columbia.edu.

Karen Marder, Email: ksm1@cumc.columbia.edu.

Yuanjia Wang, Email: yw2016@columbia.edu.

REFERENCES

Alcalay RN, Caccappolo E, Mejia-Santana H, Tang MX, Rosado L, Ross BM, Verbitsky M, Kisselev S, Louis ED, Comella C, Colcher A, Jennings D, Nance MA, Bressman SB, Scott WK, Tanner C, Mickel S, Andrews H, Waters C, Fahn S, Cote L, Frucht S, Ford B, Rezak M, Novak K, Friedman JH, Pfeiffer R, Marsh L, Hiner B, Siderowf A, Ottman R, Marder K, Clark LN. Frequency of known mutations in early-onset Parkinson disease: implication for genetic counseling: the consortium on risk for early onset Parkinson disease study. Arch. Neurol. 2010;67:1116–1122. doi: 10.1001/archneurol.2010.194. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman An empirical distribution function for sampling with incomplete information. Ann. Math. Statist. 1955;26:641–647. [Google Scholar]
Barmi H, McKeague IW. Empirical likelihood-based tests for stochastic ordering. Bernoulli. 2013;19:295–307. doi: 10.3150/11-BEJ393SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Begg CB. On the Use of Familial Aggregation in Population-Based Case Probands for Calculating Penetrance. Journal of the National Cancer Institute. 2002;94:1221–1226. doi: 10.1093/jnci/94.16.1221. [DOI] [PubMed] [Google Scholar]
Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. Statistical Inference Under Order Restrictions. New York: John Wiley; 1972. [Google Scholar]
Churchill GA, Doerge RW. Empirical threshold values for quantitative trait mapping. Genetics. 1994;138:963–971. doi: 10.1093/genetics/138.3.963. [DOI] [PMC free article] [PubMed] [Google Scholar]
de Leeuw J, Hornik K, Mair P. Isotone optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods. Journal of Statistical Software. 2009;5:1–24. [Google Scholar]
Efron B. The two sample problem with censored data. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press; Berkeley, California. 1967. pp. 835–853. [Google Scholar]
Goldwurm S, Tunesi S, Tesei S, et al. Kin-cohort analysis of LRRK2-G2019S penetrance in Parkinson’s disease. Mov Disord. 2011:2144–2145. doi: 10.1002/mds.23807. [DOI] [PubMed] [Google Scholar]
Godambe VP. An Optimum Property of Regular Maximum Likelihood Estimation. Ann. Math. Stat. 1960;34:1208–1211. [Google Scholar]
Grady D, Parker-Pope T, Belluck P. Jolie’s disclosure of preventative mastectomy highlights dilemma. New York Times. 2013 May 15;:A1. 2013. [Google Scholar]
Grotzinger SJ, Witzgall C. Projections onto simplices. Applied Mathematics and Optimization. 1984;12:247–270. [Google Scholar]
Hedrich K, Eskelson C, Wilmot B, Marder K, Harris J, Garrels J, Meija-Santana H, Vieregge P, Jacobs H, Bressman SB, Lang AE, Kann M, Abbruzzese G, Martinelli P, Schwinger E, Ozelius LJ, Pramstaller PP, Klein C, Kramer P. Distribution, type, and origin of Parkin mutations: review and case studies. Mov Disord. 2004;19:1146–1157. doi: 10.1002/mds.20234. [DOI] [PubMed] [Google Scholar]
Huang CY, Qin J, Zou F. Empirical Likelihood-based Inference in a Genetic Mixture Model. Canadian Journal of Statistics. 2007;35:563–574. [Google Scholar]
Jewell NP, Kalbeisch JD. Maximum likelihood estimation of ordered multinomial parameters. Biostatistics. 2004;5:291–306. doi: 10.1093/biostatistics/5.2.291. [DOI] [PubMed] [Google Scholar]
Johansen S. The product limit estimator as maximum likelihood estimator. Scan-dinavian Journal of Statistics. 1978;5:195–199. [Google Scholar]
Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association. 1958;53:457–481. [Google Scholar]
Kiefer J, Wolfowitz J. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist. 1956;56:887–906. [Google Scholar]
Kitada T, Asakawa S, Hattori N, Matsumine H, Yamamura Y, Minoshima S, Yokochi M, Mizuno Y, Shimizu N. Mutations in the Parkin gene cause autosomal recessive juvenile parkinsonism. Nature. 1998;392:605–608. doi: 10.1038/33416. [DOI] [PubMed] [Google Scholar]
Khoury M, Beaty H, Cohen B. Fundamentals of Genetic Epidemiology. New York: Oxford University Press; 1993. [Google Scholar]
Kruskal JB. Nonparametric multidimensional scaling: a numerical method. Psychometrika. 1964;29:115–129. [Google Scholar]
Lücking CB, Dürr A, Bonifati V, Vaughan J, De Michele G, Gasser T, Harhangi BS, Meco G, Denee P, Wood NW, Agid Y, Brice A French Parkinson’s Disease Genetics Study Group and European Consortium on Genetic Susceptibility in Parkinson’s Disease. Association between early-onset Parkinson’s disease and mutations in the Parkin gene. New England Journal of Medicine. 2000;342:1560–1567. doi: 10.1056/NEJM200005253422103. [DOI] [PubMed] [Google Scholar]
Luss R, Rosset S, Shahar M. Isotonic recursive partitioning. 2010 Preprint, arXiv:1102.5496. [Google Scholar]
Ma Y, Wang Y. Efficient semiparametric estimation for mixture data. Electronic Journal of Statistics. 2012;6:710–737. doi: 10.1214/12-EJS690. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma Y, Wang Y. Estimating disease onset distribution functions in mutation carriers with censored mixture data. Journal of the Royal Statistical Society, Series C. 2013 in press. [Google Scholar]
Marder K, Levy G, Louis ED, Mejia-Santana H, Cote L, Andrews H, Harris J, Waters C, Ford B, Frucht S, Fahn S, Ottman R. Accuracy of family history data on Parkinson’s Disease. Neurology. 2003;61:18–23. doi: 10.1212/01.wnl.0000074784.35961.c0. [DOI] [PubMed] [Google Scholar]
Marder KS, Tang MX, Mejia-Santana H, Rosado L, Louis ED, Comella CL, Colcher A, Siderowf AD, Jennings D, Nance MA, Bressman S, Scott WK, Tanner CM, Mickel SF, Andrews HF, Waters C, Fahn S, Ross BM, Cote LJ, Frucht S, Ford B, Alcalay RN, Rezak M, Novak K, Friedman JH, Pfeiffer RF, Marsh L, Hiner B, Neils GD, Verbitsky M, Kisselev S, Caccappolo E, Ottman R, Clark LN. Predictors of parkin mutations in early-onset Parkinson disease: the consortium on risk for early-onset Parkinson disease study. Arch. Neurol. 2010;67:731–738. doi: 10.1001/archneurol.2010.95. [DOI] [PMC free article] [PubMed] [Google Scholar]
McInerney-Leo A, Hadley DW, Gwinn-Hardy K, Hardy J. Genetic testing in Parkinson’s Disease. Movement Disorders. 2005;20:1–10. doi: 10.1002/mds.20316. [DOI] [PubMed] [Google Scholar]
Oliveira SA, Scott WK, Martin ER, Nance MA, Watts RL, Hubble JP, Koller WC, Pahwa R, Stern MB, Hiner BC, Ondo WG, Allen FH, Jr, Scott BL, Goetz CG, Small GW, Mastaglia F, Stajich JM, Zhang F, Booze MW, Winn MP, Middleton LT, Haines JL, Pericak-Vance MA, Vance JM. Parkin mutations and susceptibility alleles in late-onset Parkinson’s disease. Ann. Neurol. 2003;53:624–629. doi: 10.1002/ana.10524. [DOI] [PubMed] [Google Scholar]
Park Y, Taylor JM, Kalbeisch JD. Pointwise nonparametric maximum likelihood estimator of stochastically ordered survivor functions. Biometrika. 2012;99(2):327–343. doi: 10.1093/biomet/ass006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Robertson T, Wright FT, Dykstra RL. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Chichester: John Wiley & Sons Ltd; 1988. Order restricted statistical inference. [Google Scholar]
Struewing JP, Hartge P, Wacholder S, Baker SM, Berlin M, McAdams M, Timmerman MM, Brody LC, Tuker MA. The risk of cancer associated with specific mutations of BRCA1 and BRCA2 among Ashkenazi Jews. New England Journal of Medicine. 1997;336:1401–1408. doi: 10.1056/NEJM199705153362001. 336. [DOI] [PubMed] [Google Scholar]
Wang Y, Garcia TP, Ma Y. Nonparametric estimation for uncensored mixture data with application to the cooperative Huntington’s observational research trial. Journal of the American Statistical Association. 2012;107:1324–1338. doi: 10.1080/01621459.2012.699353. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Y, Clark LN, Marder K, Robinowitz D. Nonparametric estimation of genotype-specific age-at-onset distributions from censored kin-cohort data. Biometrika. 2007;94:403–414. [Google Scholar]
Wang Y, Clark LN, Louis ED, Mejia-Santana H, Harris J, Cote LJ, Waters C, Andrews D, Ford B, Frucht S, Fahn S, Ottman R, Rabinowitz D, Marder K. Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using the kin-cohort method. Arch Neurol. 2008;65:467–474. doi: 10.1001/archneur.65.4.467. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu CFJ. On the convergence properties of the EM algorithm. Annals of Statistics. 1983;11:95–103. [Google Scholar]
Wu RL, Ma CX, Casella G. Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York: Springer-Verlag; 2007. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary file

NIHMS586180-supplement-Supplementary_file.pdf^{(174.3KB, pdf)}

[R1] Alcalay RN, Caccappolo E, Mejia-Santana H, Tang MX, Rosado L, Ross BM, Verbitsky M, Kisselev S, Louis ED, Comella C, Colcher A, Jennings D, Nance MA, Bressman SB, Scott WK, Tanner C, Mickel S, Andrews H, Waters C, Fahn S, Cote L, Frucht S, Ford B, Rezak M, Novak K, Friedman JH, Pfeiffer R, Marsh L, Hiner B, Siderowf A, Ottman R, Marder K, Clark LN. Frequency of known mutations in early-onset Parkinson disease: implication for genetic counseling: the consortium on risk for early onset Parkinson disease study. Arch. Neurol. 2010;67:1116–1122. doi: 10.1001/archneurol.2010.194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Ayer M, Brunk HD, Ewing GM, Reid WT, Silverman An empirical distribution function for sampling with incomplete information. Ann. Math. Statist. 1955;26:641–647. [Google Scholar]

[R3] Barmi H, McKeague IW. Empirical likelihood-based tests for stochastic ordering. Bernoulli. 2013;19:295–307. doi: 10.3150/11-BEJ393SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Begg CB. On the Use of Familial Aggregation in Population-Based Case Probands for Calculating Penetrance. Journal of the National Cancer Institute. 2002;94:1221–1226. doi: 10.1093/jnci/94.16.1221. [DOI] [PubMed] [Google Scholar]

[R5] Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. Statistical Inference Under Order Restrictions. New York: John Wiley; 1972. [Google Scholar]

[R6] Churchill GA, Doerge RW. Empirical threshold values for quantitative trait mapping. Genetics. 1994;138:963–971. doi: 10.1093/genetics/138.3.963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] de Leeuw J, Hornik K, Mair P. Isotone optimization in R: Pool-Adjacent-Violators Algorithm (PAVA) and Active Set Methods. Journal of Statistical Software. 2009;5:1–24. [Google Scholar]

[R8] Efron B. The two sample problem with censored data. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; University of California Press; Berkeley, California. 1967. pp. 835–853. [Google Scholar]

[R9] Goldwurm S, Tunesi S, Tesei S, et al. Kin-cohort analysis of LRRK2-G2019S penetrance in Parkinson’s disease. Mov Disord. 2011:2144–2145. doi: 10.1002/mds.23807. [DOI] [PubMed] [Google Scholar]

[R10] Godambe VP. An Optimum Property of Regular Maximum Likelihood Estimation. Ann. Math. Stat. 1960;34:1208–1211. [Google Scholar]

[R11] Grady D, Parker-Pope T, Belluck P. Jolie’s disclosure of preventative mastectomy highlights dilemma. New York Times. 2013 May 15;:A1. 2013. [Google Scholar]

[R12] Grotzinger SJ, Witzgall C. Projections onto simplices. Applied Mathematics and Optimization. 1984;12:247–270. [Google Scholar]

[R13] Hedrich K, Eskelson C, Wilmot B, Marder K, Harris J, Garrels J, Meija-Santana H, Vieregge P, Jacobs H, Bressman SB, Lang AE, Kann M, Abbruzzese G, Martinelli P, Schwinger E, Ozelius LJ, Pramstaller PP, Klein C, Kramer P. Distribution, type, and origin of Parkin mutations: review and case studies. Mov Disord. 2004;19:1146–1157. doi: 10.1002/mds.20234. [DOI] [PubMed] [Google Scholar]

[R14] Huang CY, Qin J, Zou F. Empirical Likelihood-based Inference in a Genetic Mixture Model. Canadian Journal of Statistics. 2007;35:563–574. [Google Scholar]

[R15] Jewell NP, Kalbeisch JD. Maximum likelihood estimation of ordered multinomial parameters. Biostatistics. 2004;5:291–306. doi: 10.1093/biostatistics/5.2.291. [DOI] [PubMed] [Google Scholar]

[R16] Johansen S. The product limit estimator as maximum likelihood estimator. Scan-dinavian Journal of Statistics. 1978;5:195–199. [Google Scholar]

[R17] Kaplan EL, Meier P. Nonparametric estimation from incomplete observations. Journal of the American Statistical Association. 1958;53:457–481. [Google Scholar]

[R18] Kiefer J, Wolfowitz J. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Statist. 1956;56:887–906. [Google Scholar]

[R19] Kitada T, Asakawa S, Hattori N, Matsumine H, Yamamura Y, Minoshima S, Yokochi M, Mizuno Y, Shimizu N. Mutations in the Parkin gene cause autosomal recessive juvenile parkinsonism. Nature. 1998;392:605–608. doi: 10.1038/33416. [DOI] [PubMed] [Google Scholar]

[R20] Khoury M, Beaty H, Cohen B. Fundamentals of Genetic Epidemiology. New York: Oxford University Press; 1993. [Google Scholar]

[R21] Kruskal JB. Nonparametric multidimensional scaling: a numerical method. Psychometrika. 1964;29:115–129. [Google Scholar]

[R22] Lücking CB, Dürr A, Bonifati V, Vaughan J, De Michele G, Gasser T, Harhangi BS, Meco G, Denee P, Wood NW, Agid Y, Brice A French Parkinson’s Disease Genetics Study Group and European Consortium on Genetic Susceptibility in Parkinson’s Disease. Association between early-onset Parkinson’s disease and mutations in the Parkin gene. New England Journal of Medicine. 2000;342:1560–1567. doi: 10.1056/NEJM200005253422103. [DOI] [PubMed] [Google Scholar]

[R23] Luss R, Rosset S, Shahar M. Isotonic recursive partitioning. 2010 Preprint, arXiv:1102.5496. [Google Scholar]

[R24] Ma Y, Wang Y. Efficient semiparametric estimation for mixture data. Electronic Journal of Statistics. 2012;6:710–737. doi: 10.1214/12-EJS690. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Ma Y, Wang Y. Estimating disease onset distribution functions in mutation carriers with censored mixture data. Journal of the Royal Statistical Society, Series C. 2013 in press. [Google Scholar]

[R26] Marder K, Levy G, Louis ED, Mejia-Santana H, Cote L, Andrews H, Harris J, Waters C, Ford B, Frucht S, Fahn S, Ottman R. Accuracy of family history data on Parkinson’s Disease. Neurology. 2003;61:18–23. doi: 10.1212/01.wnl.0000074784.35961.c0. [DOI] [PubMed] [Google Scholar]

[R27] Marder KS, Tang MX, Mejia-Santana H, Rosado L, Louis ED, Comella CL, Colcher A, Siderowf AD, Jennings D, Nance MA, Bressman S, Scott WK, Tanner CM, Mickel SF, Andrews HF, Waters C, Fahn S, Ross BM, Cote LJ, Frucht S, Ford B, Alcalay RN, Rezak M, Novak K, Friedman JH, Pfeiffer RF, Marsh L, Hiner B, Neils GD, Verbitsky M, Kisselev S, Caccappolo E, Ottman R, Clark LN. Predictors of parkin mutations in early-onset Parkinson disease: the consortium on risk for early-onset Parkinson disease study. Arch. Neurol. 2010;67:731–738. doi: 10.1001/archneurol.2010.95. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] McInerney-Leo A, Hadley DW, Gwinn-Hardy K, Hardy J. Genetic testing in Parkinson’s Disease. Movement Disorders. 2005;20:1–10. doi: 10.1002/mds.20316. [DOI] [PubMed] [Google Scholar]

[R29] Oliveira SA, Scott WK, Martin ER, Nance MA, Watts RL, Hubble JP, Koller WC, Pahwa R, Stern MB, Hiner BC, Ondo WG, Allen FH, Jr, Scott BL, Goetz CG, Small GW, Mastaglia F, Stajich JM, Zhang F, Booze MW, Winn MP, Middleton LT, Haines JL, Pericak-Vance MA, Vance JM. Parkin mutations and susceptibility alleles in late-onset Parkinson’s disease. Ann. Neurol. 2003;53:624–629. doi: 10.1002/ana.10524. [DOI] [PubMed] [Google Scholar]

[R30] Park Y, Taylor JM, Kalbeisch JD. Pointwise nonparametric maximum likelihood estimator of stochastically ordered survivor functions. Biometrika. 2012;99(2):327–343. doi: 10.1093/biomet/ass006. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Robertson T, Wright FT, Dykstra RL. Wiley Series in Probability and Mathematical Statistics: Probability and Mathematical Statistics. Chichester: John Wiley & Sons Ltd; 1988. Order restricted statistical inference. [Google Scholar]

[R32] Struewing JP, Hartge P, Wacholder S, Baker SM, Berlin M, McAdams M, Timmerman MM, Brody LC, Tuker MA. The risk of cancer associated with specific mutations of BRCA1 and BRCA2 among Ashkenazi Jews. New England Journal of Medicine. 1997;336:1401–1408. doi: 10.1056/NEJM199705153362001. 336. [DOI] [PubMed] [Google Scholar]

[R33] Wang Y, Garcia TP, Ma Y. Nonparametric estimation for uncensored mixture data with application to the cooperative Huntington’s observational research trial. Journal of the American Statistical Association. 2012;107:1324–1338. doi: 10.1080/01621459.2012.699353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Wang Y, Clark LN, Marder K, Robinowitz D. Nonparametric estimation of genotype-specific age-at-onset distributions from censored kin-cohort data. Biometrika. 2007;94:403–414. [Google Scholar]

[R35] Wang Y, Clark LN, Louis ED, Mejia-Santana H, Harris J, Cote LJ, Waters C, Andrews D, Ford B, Frucht S, Fahn S, Ottman R, Rabinowitz D, Marder K. Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using the kin-cohort method. Arch Neurol. 2008;65:467–474. doi: 10.1001/archneur.65.4.467. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Wu CFJ. On the convergence properties of the EM algorithm. Annals of Statistics. 1983;11:95–103. [Google Scholar]

[R37] Wu RL, Ma CX, Casella G. Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York: Springer-Verlag; 2007. [Google Scholar]

PERMALINK

COMBINING ISOTONIC REGRESSION AND EM ALGORITHM TO PREDICT GENETIC RISK UNDER MONOTONICITY CONSTRAINT

Jing Qin

Tanya P Garcia

Yanyuan Ma

Ming-Xin Tang

Karen Marder

Yuanjia Wang

Abstract

1. Introduction

1.1. CORE-PD study to estimate the risk of PARK2 mutations

2. Binomial Likelihood Estimation

2.1. Motivation for binomial likelihood formulation

2.2. Binomial Likelihood Estimator for Censored Mixture Data

3. Genuine Nonparametric Distribution Estimators

3.1. Pool Adjacent Violator Algorithm

3.2. PAVA-based Binomial Likelihood Estimator for Censored Mixture Data

3.3. Hypothesis Testing

4. Simulation Study

4.1. Simulation Design

Table 1.

Table 2.

Table 3.

4.2. Simulation Results

Fig 1.

Fig 2.

5. Application to the CORE-PD study

5.1. CORE-PD data and mixture proportions

Table 4.

5.2. Results

Fig 3.

Table 5.

Table 7.

Table 6.

6. Concluding Remarks

Supplementary Material

APPENDIX A

SKETCH OF TECHNICAL ARGUMENTS

A.1. The type I and type II NPMLEs

A.2. Consistency of Imputed Log-Likelihood

Footnotes

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases