Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Apr 15.
Published in final edited form as: Stat Med. 2013 Sep 12;33(8):1369–1382. doi: 10.1002/sim.5971

Nonparametric modeling and analysis of association between Huntington’s disease onset and CAG repeats

Yanyuan Ma 1, Yuanjia Wang 1,*
PMCID: PMC3947445  NIHMSID: NIHMS521742  PMID: 24027120

Abstract

Huntington’s disease (HD) is a neurodegenerative disorder with a dominant genetic mode of inheritance caused by an expansion of CAG repeats on chromosome 4. Typically, a longer sequence of CAG repeat length is associated with increased risk of experiencing earlier onset of HD. Previous studies of the association between HD onset age and CAG length have favored a logistic model, where the CAG repeat length enters the mean and variance components of the logistic model in a complex exponential-linear form. To relax the parametric assumption of the exponential-linear association to the true HD onset distribution, we propose to leave both mean and variance functions of the CAG repeat length unspecified and perform semiparametric estimation in this context through a local kernel and backfitting procedure. Motivated by including family history of HD information available in the family members of participants in the Cooperative Huntington’s Observational Research Trial (COHORT), the methodology is developed in the context of mixture data, where some subjects have a positive probability of being risk free. We also allow censoring on the age-at-onset of disease and accommodate covariates other than the CAG length. We study the theoretical properties of the proposed estimator and derive its asymptotic distribution. Finally, the proposed methods are applied to the COHORT data to estimate the HD onset distribution using a group of study participants and the disease family history information available on their family members.

Keywords: Mixture, Varying-coefficient model, Partially linear model, Chronic disease, Age at onset

1 Motivating Study and the Existing Model

1.1 Huntington’s Disease Study

Huntington’s disease (HD) is a severe hereditary neurodegenerative disorder caused by an expansion of CAG repeats at a gene on chromosome 4 that codes the protein named hunt-ingtin (Huntington’s Study Investigators 1993 [1]). Typically neurological and physical symptoms express around 30–50 years of age in affected individuals, although sometimes the symptoms can develop much earlier (pre-teen) or much later in life (in the 80s; See for example, [2]). Patients eventually die from complications such as pneumonia, heart failure, or other complications, usually 15–20 years after the disease onset although the duration of the disease also varies depending on the onset age [3]. Clinical studies suggest that an individual with a CAG repeat length (denoted as X) smaller than 36 is risk free of HD (no risk of developing HD at any given age; [2, 4]). Otherwise, for an individual with CAG repeat length greater than or equal to 36, the CAG length is an important factor that is inversely correlated with the age-at-onset (AAO) of HD (denoted as T), where subjects with longer stretches of CAG repeat length tend to have earlier onset.

To further study the association between CAG length and the onset time of HD, various large epidemiological studies on HD were conducted worldwide. One particular study is the Cooperative Huntington’s Observational Research Trial (COHORT), an observational study organized by 42 Huntington Study Group research centers in North America and Australia. In COHORT, the initial participants (probands) undergo a clinical evaluation where blood samples are sequenced for CAG repeat length [5]. Since 2005, the study has expanded to collect family members’ morbidity and mortality information (e.g., AAO of HD) through systematic family history interviews administered to the probands [5, 6]. However, due to the high cost of conducting in-person interviews of family members, the blood samples of the family members were not collected. This kind of studies are referred as kin-cohort study in [7]. The COHORT study with family history data can be classified as a kin-cohort design. A complexity arising from COHORT is that whether a relative shares the same CAG expansion status with a proband (e.g., whether a child has inherited the mutation allele with CAG expansion from a parent) is not available. Instead, we can obtain a relative’s probability of carrying a mutation allele, which is calculated through Mendelian law using the relative’s relationship with the proband and the proband’s mutation status (e.g., Section 8.4 in [8]; and [9, 10]). This calculation yields a probability p (0 ≤ p ≤ 1), indicating the probability that the relative shares the same mutation allele as his or her proband so that the relative’s CAG repeat length is the same as the proband, and he or she is at risk of HD. For example, parents, children and siblings of an at risk proband have at risk probabilities of p = 0.5 under the Mendelian law. Thus the relative has a probability of 1 − p = 0.5 to share the normal allele with his or her proband, in which case the relative will have a CAG length value < 36 and will not be at risk of HD. We assume that the CAG repeat length does not change in the gamete transmission process, i.e., a child will inherit an expanded allele with the same repeat length from a parent. This assumption is used in literature [6], and implications of this assumption are discussed in Section 5.

Another complexity arising from the COHORT study is that HD onset time is not observed for all study subjects, and some study subjects are censored because of loss to follow-up or death due to other causes before developing HD.

1.2 Existing Model

The functional form of the association between the onset time T and CAG repeat length X has been debated in the clinical literature and multiple parametric models have been proposed [11, 12]. Currently, the accepted model captures the relation between the AAO of HD and the CAG length through a logistic link and assumes that the CAG length affects both the mean and the variance components of AAO through an exponential-linear form in [11]. Specifically, the model specifies the conditional distribution F(t,x)=defpr(T<tX=x) as

F(t,x)=11+e-{t-μ(x)}/s(x), (1)

where

μ(x)=μ1+exp(μ2-μ3x),s(x)=σ1+exp(σ2-σ3x),

and μ1, μ2, μ3, σ1, σ2, σ3 are six unspecified parameters to offer model flexibility and will be estimated from data.

Although (1) is the accepted model in the current clinical literature for describing the association between CAG length and HD onset, it does impose some rather strong assumptions. For example, the specific functional forms of both the mean and variance are pre-specified to be exponential-linear, which can be an advantage if they happen to reflect the true biological relationship between CAG length and onset time, but can also be misleading otherwise. In addition, this global parametric model may not fit well for certain ranges of the CAG lengths and ages (e.g., for smaller CAG length values; [11]). At the values of μi and σi (i = 1, 2, 3) fitted using data from [11], the corresponding additive and multiplicative coefficient functions for t has the form

F(t,x)=11+e-α2(x)t-α1(x) (2)

where, using the relation α1(x) = − μ(x)/s(x), α2(x) = 1/s(x), we get

α1(x)=-π/3{21.54+exp(9.56-0.146x)}{35.55+exp(17.72-0.327x)}-1/2,α2(x)=π/3{35.55+exp(17.72-0.327x)}-1/2.

Thus α1(x) is not a monotonically increasing function of x (see Figure 1, upper-left plot). As a result, for some t values, such as t ≤ 22, F(t, x) is not an increasing function of x (see Figure 1, lower-left plot). This may not agree with the clinical conjecture that greater CAG expansion length increases the risk of HD at a given age, i.e., F(t, x) is an increasing function of x at a fixed time point t. While it is still unclear whether the clinical impression is fully supported by data, the assumption that an exponential-linear functional form in both the mean and variance captures the true CAG length effect on HD onset globally in the entire range of x and t can be strong. It may be desirable to relax this parametric model assumption, by using a more flexible nonparametric or semiparametric model that is capable of fitting local changes in certain ranges of x. In addition, no covariates other than CAG length are modeled in (1).

Figure 1.

Figure 1

Plots of the true (solid) and estimated (dashed) curves of the intercept α1(x) (left) and slope α2(x) (right) in simulation 1 (upper) and simulation 2 (lower). The dashed curves contain the median, 5% quantile and 95% quantile of the 1000 estimated curves.

2 Proposed Model and its Estimation Procedure

Due to restrictions of a parametric model, we propose to relax the specification of CAG length effect to nonparametric functions by leaving both α1(x) and α2(x) in (2) unspecified. Since F(t, x) is a cumulative conditional distribution function, it is required to be an increasing function of t at any value of x. To satisfy this assumption, the slope α2(x) should be positive. This can be taken into account through a reparameterization such as writing exp{α2(x)} instead of α2(x) with the aim of estimating α2(x). However, in our numerical experiments reported in Sections 3 and 4, such reparameterization does not seem necessary since α2(x) is estimated to be positive without any constraints. Thus, throughout this article, we simply consider α2(x) directly. In addition, if the clinical consensus that higher CAG length values are associated with earlier onset times is to be enforced, α1(x)+ α2(x)t should be an increasing function of x for any possible HD onset time t. However, we estimate α1(x) and α2(x) without forcing the monotonicity constraint. Thus, the resulting fitted functions under the more flexible model can serve as empirical evidence on whether or not the clinical consensus holds.

To express model (2) on the logit scale, note that

logit{F(t,x)}=α1(x)+α2(x)t,

which is a logistic model with varying coefficients. Thus, although our problem is motivated by relaxing the parametric model in [11], it is very general and is applicable to modeling distribution of other disease onset as well. Under this varying-coefficient logistic model, other patient-specific covariates such as gender or baseline symptom severity measures can be easily introduced. Since these covariates are not of primary interest and misspecification of their functional form is less of a concern, we can simply use several linear terms to capture their effects. Collecting these additional covariates into a vector Z, we can extend model (2) to a partially linear varying-coefficient logistic model

F{t,x,z,β,α(x)}=11+e-{zTβ2+α2(x)}t-{zTβ1+α1(x)}, (3)

where β=(β1T,β2T)T and α(x) = {α1(x), α2(x)}T are unknown parameters and functions to be estimated from data.

2.1 Likelihood and Estimation

To introduce the likelihood, we start by defining some notation. We use C to denote the censoring time and use fC(c, x, z) and FC(c, x, z) to denote the censoring probability density function (pdf) and cumulative distribution function (cdf) conditional on the covariates (X, Z). We assume the censoring to be conditionally independent of HD onset time given a set of covariates. Let Δ = I(TC) and Y = min(T, C). We denote the ith observation as (pi, Xi, Zi, Yi, Δi). Here pi is the probability of the ith subject having an expanded CAG calculated from the relation between the proband-relative relation, and is known. We use pi = 1 or pi = 0 if the ith subject’s CAG expansion status is certain. Taking into consideration the uncertainty in a relative’s CAG expansion status and censoring, the likelihood is

L{β,α(·)}=i=1n[pif{Yi,Xi,Zi,β,α(Xi)}]Δi[1-piF{Yi,Xi,Zi,β,α(Xi)}]1-Δi{1-FC(Yi,Xi,Zi)}ΔifC(Yi,Xi,Zi)1-ΔifX,Z,p(Xi,Zi,pi)i=1nf{Yi,Xi,Zi,β,α(Xi)}Δi[1-piF{Yi,Xi,Zi,β,α(Xi)}]1-Δi,

where F{y, x, z, β, α(x)} is given in (3) and

f{t,x,z,β,α(x)}=F{t,x,z,β,α(x)}t=e-{zTβ2+α2(x)}t-{zTβ1+α1(x)}{zTβ2+α2(x)}[1+e-{zTβ2+α2(x)}t-{zTβ1+α1(x)}]2.

Due to the inclusion of unspecified nonparametric functions α(x), directly maximizing the above likelihood is difficult. Thus, instead of using the maximum likelihood estimator (MLE) or nonparametric MLE, we propose the following backfitting procedure based on local kernel smoothing estimator of α(x). Let the score function with respect to β be

Sβ{Yi,Xi,Zi,β,α(Xi)}=Δif{Yi,Xi,Zi,β,α(Xi)}/βf{Yi,Xi,Zi,β,α(Xi)}-(1-Δi)piF{Yi,Xi,Zi,β,α(Xi)}/β1-piF{Yi,Xi,Zi,β,α(Xi)}.

Replace α(X) with a = (a1, a2)T locally at X = x0, and let the score function with respect to a be

Sa(Yi,Xi,Zi,β,a)=Δif(Yi,Xi,Zi,β,a)/af(Yi,Xi,Zi,β,a)-(1-Δi)piF(Yi,Xi,Zi,β,a)/a{1-piF(Yi,Xi,Zi,β,a)}.

The backfitting procedure consists of iterating between the following two steps.

  1. Obtain β̃ at a fixed α̃(·) through solving
    0=i=1nSβ{Yi,Xi,Zi,β,α(Xi)}.
  2. Obtain α̃(x0) at x0 = x1, …, xn at a fixed β̃ through solving
    0=i=1nKh(Xi-x0)Sa(Yi,Xi,Zi,β,a).

    Here K(·) is a symmetric kernel function, h is a bandwidth and Kh(x) = K(x/h)/h for any bandwidth h.

The above two steps can use the MLEs as starting values by treating α(x) as constants, and is iteratively performed until convergence is reached. This type of backfitting method adopts the local constant idea for the nonparametric estimation of α1(·) and α2(·). When desirable, more sophisticated methods such as local polynomial can also be used. Note that in the second step, the maximization is repeatedly performed for n different x0 values, thus the computation can be quite involved.

2.2 Asymptotic Properties

To study the asymptotic properties of the backfitting estimator, we first define some notation. Let Sββ be the partial derivative of Sβ with respect to β, Sβα be the partial derivative of Sβ with respect to α, Sαα be the partial derivative of Sα with respect to α and Sαβ be the partial derivative of Sα with respect to β. Also define Ω(X) = E[Sαα{Y, X, Z, β, α(X)}|X], αβ (X) = −Ω(X)−1E[Sαβ{Y, X, Z, β, α (X)}|X] and U(X) = E[Sβα{Y, X, Z, β, α(X)}|X]Ω(X)−1. Furthermore, define

F=E[Sββ{Y,X,Z,β,α(X)}+Sβα{Y,X,Z,β,α(X)}αβ(X)].

Then we have the following results.

Theorem 1

Assume that the bandwidth h satisfies nh4 → 0 and nh2 → ∞. Then the backfitting estimator β̂ has the asymptotic expansion

-Fn1/2(β^-β)=n-1/2i=1n[Sβ{Yi,Xi,Zi,β,α(Xi)}-U(Xi)Sα{Yi,Xi,Zi,β,α(Xi)}]+op(1). (4)

Hence, n1/2(β̂β) is asymptotically normally distributed with mean zero and covariance matrix Inline graphicΣ Inline graphic, where Σ = cov [Sβ{Y, X, Z, β, α (X)} − U(X)Sα{Y, X, Z, β, α(X)}].

Remark 1

In Theorem 1, the requirement that nh4 → 0 is the undersmoothing condition typically required for backfitting, and a direct consequence of the bias of the local constant estimator. The undersmoothing requirement does not lead to difficulty in practice because we can easily rescale a selected optimal bandwidth hopt to obtain the undersmoothed bandwidth h=hopt3/5. In addition, the estimation of β is often insensitive to the bandwidth choice. There are various methods proposed in the literature to avoid undersmoothing as well, including using a projection augmentation on Sβ or profiling, see Van Keilegom and Carroll (2007) [13] for details.

Remark 2

There are various possibilities to perform inference about β in our context. Note [14] describes conditions under which the bootstrap will be asymptotically valid for backfitting estimators. Alternatively, one can use the asymptotic results given in Theorem 1. This entails approximating the terms in Inline graphic and Σ by their sample versions. Specifically, F^=n-1i=1n[Sββ{Yi,Xi,Zi,β^,α^(Xi)}+Sβα{Yi,Xi,Zi,β^,α^(Xi)}α^β(Xi)] and ^=n-1i=1n[Sβ{Yi,Xi,Zi,β^,α^(X)}-U^(X)Sα{Yi,Xi,Zi,β^,α^(X)}]2. Here and throughout the text, a⊗2 stands for aaT for any vector or matrix a. In these calculations, α̂β(X) = −Ω̂(X)−1Ê[Sαβ{Y, X, Z, β, α̂(X)}|X], Û(X) = Ê[Sβα{Y, X, Z, β̂, α̂(X)}|X]Ω̂(X)−1 and Ω(X) = Ê[Sαα{Y, X, Z, β̂, α̂ (X)}|X], where all the conditional expectations are estimated nonparametrically.

Since our main interest is in estimating α (·), after obtaining the root-n consistent estimator β̂, we need to perform an additional nonparametric estimation step using the usual bandwidth to obtain the final estimates for α̂(X). Because Theorem 1 guarantees the root-n rate for β̂, which is faster than the nonparametric rate, hence the final α̂(X) has the same classic bias and variance properties of the standard nonparametric estimator under a known β. We state the asymptotic property of α̂(·) in Theorem 2.

Theorem 2

Assume that the bandwidth used in the last local linear estimation step is h, and h = O(n−1/5). Then α̂(x, β̂) satisfy

α^(x,β^)-α(x)=-h2E{Saa(Y,X,Z,β,a)x}-1d2[E{Sa(Y,X,Z,β,a)x}fX(x)]2fX(x)dx2t2K(t)dt-E{Saa(Y,X,Z,β,a)x}-11nfX(x)i=1nKh(Xi-x)Sa{Yi,Xi,Zi,β,α(Xi)}+o(h2)+op{(nh)-1/2}.

Thus, it has bias

bias{α^(x)}=-h2Ω(x)-1d2{E[Sα{Y,X,Z,β,α(X)}X=x]fX(x)}2fX(x)dx2t2K(t)dt+o(h2)

and variance

var{α^(x)}=-K2(t)dtnhfX(x)Ω(x)+o{(nh)-1}.

Remark 3

Once the estimation of β̂ and α̂ is obtained, we can plug these estimates in (3) to obtain the estimation of the distribution of the onset time, F{t, x, z, β̂, α̂(x)}. Using the delta method and considering that α̂(x) converges at a slower nonparametric rate than β̂, we can easily obtain that F{t, x, z, β̂, α̂(x)} estimates the true distribution function F{t, x, z, β, α(x)} with leading order bias

F{t,x,z,β,α(x)}α(x)Tbias{α^(x)}

and leading order variance

F{t,x,z,β,α(x)}α(x)Tvar{α^(x)}F{t,x,z,β,α(x)}α(x),

where bias{α̂(x)} and var{α̂(x)} are given in Theorem 2. Thus, the distribution function is also estimated at the classical nonparametric rate as if β were known.

3 Simulation Study

We conducted simulation studies to investigate the finite sample performance of the proposed estimators. Two simulations were conducted with sample size n = 4000 and repeated 1000 times. Here n = 4000 is the approximate sample size in the COHORT study. We generated the standardized CAG length values X from a uniform distribution between 0 and 1. In many clinical studies of HD, additional information such as gender, inheritance of CAG expansion through mother or father, verbal fluency score, and presence of psychiatric symptoms, type of relatives (e.g., parents, siblings, and children) are also recorded [15]. Thus, in the simulations we generated four additional covariates to form Z, where two are continuous and two are discrete. Specifically, Z1 is generated from a uniform distribution between [−0.5, 0.5], Z2 is generated from a uniform distribution between [−1, 1], Z3 is generated from a Bernoulli distribution with probability 0.5 to be one, and Z4 is generated from a Bernoulli distribution with probability 0.4 to be one. Our at risk probability p is generated from a Bernoulli type distribution, where p = 1 with probability 0.3, and p = 0.5 with probability 0.7. This is approximately the distribution of the at-risk indicators in the COHORT data. We generated the HD onset time following two different models. The first model is the model in [11] presented in (2), while the second one has the true α functions being

α1(x)=5log{(1+x)},α2(x)exp{sin(πx-π/2)+0.1}.

Note that the second model thus has monotonically increasing α(x) functions. We further generated the censoring times from uniform distributions for both models, so that the censoring rate is approximately 65%, again reflecting the scenario of the COHORT study data structure.

The simulation results for β and α are provided in Table 1 and Figure 1, respectively, where the bandwidth is chosen via cross-validation, i.e., maximizing

Table 1.

Simulation results on β. Mean, median, standard deviation (std), mean absolute deviations (Mad1) and median absolute deviation (Mad2) are reported.

β11 β12 β13 β14 β21 β22 β23 β24
Simulation 1
truth −0.5 0.5 −1 1 0.02 −0.02 0.04 −0.04
Mean −0.4842 0.4999 −0.9161 1.0636 0.0197 −0.0199 0.0372 −0.0420
Median −0.4693 0.4999 −0.9212 1.0634 0.0193 −0.0199 0.0371 −0.0420
Std 0.3787 0.1921 0.2279 0.2304 0.0098 0.0050 0.0058 0.0060
Mad1 0.3794 0.1939 0.2273 0.2308 0.0098 0.0051 0.0058 0.0060
Mad2 0.3823 0.1980 0.2296 0.2306 0.0104 0.0054 0.0059 0.0061
Simulation 2
truth −0.5 0.5 −1 1 0.1 −0.1 0.2 −0.2
Mean −0.4977 0.5024 −0.9928 1.0195 0.1008 −0.1009 0.1964 −0.2069
Median −0.4972 0.5034 −0.9915 1.0181 0.1013 −0.1007 0.1964 −0.2071
Std 0.2221 0.1136 0.1363 0.1384 0.0278 0.0147 0.0180 0.0188
Mad1 0.2205 0.1139 0.1365 0.1371 0.0276 0.0147 0.0179 0.0187
Mad2 0.2123 0.1114 0.1386 0.1383 0.0277 0.0143 0.0179 0.0183
i=1n(Δilog[f{Yi,Xi,Zi,β^,α^-i(Xi,h)}]+(1-Δi)log[1-piF{Yi,Xi,Zi,β^,α^-i(Xi,h)}])

as a function of h. Here α̂i(Xi, h) means we estimate α(x) at x = Xi using bandwidth h and all the data except the ith observation. From the results, we can see that in both models, the nonparametric modeling allows us to retrieve the shape of the α functions reasonably well, and the estimation of β has small bias. It is interesting to note that the estimation procedure in the second simulation model seems to perform better than in the first model, in the sense that the biases are much smaller across all parameters in β in model 2. In addition, the estimation variance is smaller in absolute value in model 2 for all β components that appear in the intercept term, and are also smaller in relative value for all β components that appear in the slope term. Furthermore, the biases and confidence bands for α are also narrower in simulation model 2. The similarity between the mean and median of the estimates for β, as well as between the standard deviation, mean absolute deviation and median absolute deviation indicates that the computation in both models are quite stable.

4 Application to COHORT Data

We now analyze the COHORT data which motivated this work. As introduced in Section 1, COHORT is an observational study collecting genetic (e.g., CAG repeat length) and clinical data on symptomatic and pre-manifest HD patients (probands), and clinical data on their family members and care givers. In the COHORT study, ascertainment of probands does not depend on family history [5]. The probands include subjects clinically diagnosed with HD or subjects who pursued genetic testing prior to baseline, carry an CAG expanded allele, but did not have clinically diagnosed HD. It is known that HD is a dominant genetic disease, (e.g., having one expanded allele is sufficient to cause HD) [16]. Subjects with a CAG repeat length ≥ 36 are considered to be HD mutation positive and have highly elevated risk of developing the disease, while subjects with CAG repeat length < 36 do not develop HD [2, 17, 18]. In this analysis, each proband participant has his or her CAG repeat value between 41 and 56 (hence at risk of HD with the at risk probability p = 1).

For family members of the proband, as discussed in Section 1, no blood sample was collected. Thus for those who have not experienced HD, it is unknown whether they share the same mutation allele with the proband. Family members’ HD onset information was collected through a family history interview administered to the probands. All the first-degree relatives with available family history information are included in the analysis. These relatives are not selected based on their HD status or possible mutation carrier status, so there is little obvious ascertainment issue for including relative data. There are 34% parents, 38% siblings and 28% children. The distribution of the at risk probabilities in the whole sample is 1196 individuals having p = 1 and 2768 having p = 0.5. This yields 3964 observations. Here, we assume that inclusion of a family member in the study is independent of the family member’s risk status. Note that among the 1196 individuals, some are relatives who developed HD, hence we can obtain their CAG status under the assumption of no interference and thus they share the same repeat length as their probands. The onset times in the COHORT data range from 11 to 82, with a censoring rate about 19% in probands and 62% overall. Some of the relatives are censored if they have not experienced HD at the time of family history interview. The censoring rate in family members depends on the relative type. Since children are younger, they are more likely to have not experienced HD especially children of probands with shorter CAG repeats. We account for the covariate gender by including it in Z.

We analyzed the COHORT data using model (3) and the method described in Section 2.2, with the bandwidth 1.33, selected through a cross-validation procedure. The estimated α(x) and their confidence intervals are provided in Figure 2. We can see that the estimation of α(x) is much more reliable for CAG length value x < 48 than for CAG length value x ≥ 48. This is because the majority of the COHORT observations contain relatively small (< 48) CAG length values. Although the slope function (i.e., α2(x)) exhibits an increasing trend, it is not sufficient to confirm that it is indeed monotone especially in the large CAG length region, where the estimation variability is very high. At the onset time ranging from t = 15 to t = 80, the intercept and slope functions translate to a set of functions α1(x) + α2(x)t, which appear to show an increasing relation with x for x between 41 and 50, while they then slightly deviate from this trend for a CAG length value beyond 50. This suggests that in general the cumulative risk of HD onset by age t increases with longer sequence of CAG repeat length across different values of t.

Figure 2.

Figure 2

Plots of the estimated (solid) curves of the intercept α1(x) (upper-left) and slope α2(x) (upper-right) and α1(x) + α2(x)t for t = 15, 20, 25, …, 80 (lower) in COHORT data. The dashed curves contain the median, 5% quantile and 95% quantile of the 1000 bootstrap estimation results.

Comparing the estimated intercept and slope functions with the plots in Figure 1 with parameters fitted in [11] suggests that the intercept and slope components can be different from what are estimated from the nonparametric method here. To better compare the parametric model of [11] and our nonparametric model, while eliminating the effect of using different data, we re-fit the exponential linear model (1) with the COHORT data stratified by gender. The fitted parametric functions are

μ(x)=16.92+exp(7.90-0.103x),s(x)=44.49+exp(13.64-0.225x)

for females, and

μ(x)=19.08+exp(8.73-0.125x),s(x)=12.40+exp(13.63-0.213x)

for males. In the left panel of Figure 3, we plot the estimated cdf, (t, x), as a function of t at different values of CAG repeats x using both the parametric and nonparametric methods in females. The figures for males show similar trend and are therefore omitted. Comparing results obtained under a nonparametric model with that of a parametric model, we see that at a given CAG repeat length, the shapes of the estimated cdfs are similar, which is expected since at each value of x, model (1) belongs to the class of nonparametric/semiparametric models used here. However, we do not assume a parametric relationship of F(t, x) across different values of x, and therefore our model is less restrictive. The fitted values of the cdfs differ, especially for higher CAG length values (left curves). The largest difference appear to be when the CAG repeat length is 54, where the cumulative risk is estimated to be slightly higher with the nonparametric method than the parametric method.

Figure 3.

Figure 3

Plots of the (t, x) from COHORT data analysis as a function of t at x = 42, 44, 46, ···, 54 (left), and as a function of x at t= 15, 25, 35, …, 75 (right) in females. Plots of males are similar and therefore omitted. The solid curves are estimated from the nonparametric model and the dashed curves are estimated from the parametric model.

The right panel of Figure 3 shows (t, x) as a function of CAG repeats x at different values of age t. It is clearly seen from the figure that the CAG length has a larger influence on cumulative risk for the middle age range (e.g., between 25 and 65). By age 75, almost all subjects with a CAG length greater than 40 will develop disease regardless of their actual CAG repeats (cumulative risk approximates 100%). By age 65, subjects with a CAG length greater than 45 will develop disease. At the ages plotted in Figure 3, the parametric model imposes a constraint of F(t, x) being an increasing function of x. Although there is such an increasing trend in general, it is not necessarily supported by the data at certain local ranges, especially for younger ages such 15, 25 and 35 (lower three curves) as shown from fitting a more flexible semiparametric model; for certain ranges of CAG repeat values, the cumulative disease risk F(t, x) may be a constant and does not necessarily increase with x. Therefore the impression that a longer sequence of CAG repeats increases risk of disease at any given age does not necessarily hold and needs to be investigated further in future studies especially in the population with more extreme lengths of CAG repeats.

In the right panel of Figure 3, there seems to be a plateau effect for large t. This is due to the nature of cumulative risk function F(t, x) for HD subjects with expanded CAG repeats. It is suggested that most subjects at risk of HD will develop the disease by a certain age regardless of the CAG repeats length. Therefore, when t is large, say t = 75, F(t, x) approaches one quickly for any fixed x, and creates a visual plateau effect.

Regarding the gender effect, our analysis shows an estimate of β̂1 = 0.3387, β̂2 = −0.0055, with the standard errors 0.3517 and 0.0078 respectively. This indicates that gender is not a significant risk factor for HD onset, which agrees with the current clinical literature.

5 Discussion

We have developed a flexible partially linear varying-coefficient model under the logit link function to model the onset of Huntington’s disease. Existing parametric models are parsimonious and efficient if the functional form is correctly specified. However, in practice there is usually not sufficient biological information to suggest a particular parametric model to be correct. For example, the logistic-exponential model with six parameters [6] may be somewhat arbitrary. In contrast, the nonparametric approach proposed here is more flexible and not subject to model misspecification. It is also useful for revealing the underlying functional relation and constructing goodness-of-fit test for parametric models. The proposed methods here are sufficiently general to be applied to other known link functions through a similar backfitting maximization procedure. The methods account for random censoring and take advantage of the family history of disease information reported by the study participants without requiring the mutation status of the family members to be known.

Here, we assumed Mendelian transmission of CAG repeat length without interference so that the CAG length does not change from parents to offspring. In reality, CAG lengths can vary somewhat among family members, and those with paternal inheritance have, on average, a slightly longer stretch of CAG repeats than their fathers. A possible explanation may be that there are many more biological opportunities for the CAG repeat length to change in a paternal process of sperm formation than in a maternal process of egg formation [6]. Although these processes have been studied extensively [19], there is no validated population genetics models for such processes. Assuming the CAG length does not change from father to offspring may lead to a slightly lower estimated risk for affected fathers of probands. The transmission from mother to offspring is thought to be more stable [19].

Our methodology relies on the assumption of no ascertainment bias in recruiting probands. All the estimation and inference are developed under this assumption. The issue of ascertainment bias is best treated in the sampling design stage (e.g., selecting a random sample of probands from the population), and adjusting for potential ascertainment bias in the estimation stage needs to be treated separately. The COHORT study did not recruit probands through a positive family history, which avoided one of the major sources of ascertainment bias.

There are several reasons the estimated cdfs obtained here are different from [11] other than that we do not assume an exponential-linear form of logit {F(t, x)}. The age-at-onset (AAO) for probands in COHORT is age-at-diagnosis of HD, while in [11] it was earliest age at which a clinician observed an irreversible objective sign of the illness. This may occur earlier than the point at which an actual diagnosis of manifest HD is given. Thus, the two versions of AAO may be slightly different. Furthermore, here we included family history information in the relatives in the analysis, whereas [11] focuses only on proband participants. Also the AAO for the family members in COHORT study is the AAO of the first symptom of HD, potentially reported by a subject, not necessarily by the clinician. Although including family members’ age at onset data increases the sample size, a practical limitation is that relative data may be less reliable than the data directly collected from the probands. Thus, if additional information can be obtained to ascertain the potential uncertainty involved in a relative’s age at onset information, then further analysis incorporating such randomness can be pursued.

Lastly, we present some final remarks about the COHORT data analysis. One reason that prevents us from concluding that a larger CAG length value is associated with an increased risk of earlier HD onset across all ages is the absence of other covariates. There can be other risk factors that affect the age-specific risk of HD onset. Since data on these factors are unavailable (especially in family members), we cannot incorporate them into the model and this could distort the estimation of α(x), especially if these factors are correlated with CAG repeat length as well. Since in practice, it is often difficult to obtain these covariates especially for relatives, modeling and studying the potential association of these covariates and the CAG length values is of importance. Such knowledge will allow us to treat the relatives’ risk factors as missing covariates, and develop appropriate methods to make use of the covariate information on the proband and handle such problems in the missing covariate framework.

Although the work is motivated from COHORT study, the nonparametric/semiparametric methodology developed based on the likelihood here can be used in other studies with a similar kin-cohort design, for example, the studies reviewed in [20] on estimating risk of LRRK2 mutation on Parkinson’s disease.

Acknowledgments

This work was supported by a grant from National Institute of Neurological Disorders and Stroke (R01NS073671-01) and grants from the National Science Foundation (DMS-1000354, DMS-1206693). We thank the Huntington Study Group for performing the COHORT study and making the data available and Cure Huntington’s Disease Initiative (CHDI) for sponsoring COHORT.

Appendix

Proof of Theorem 1

We provide only a sketch of the proof. Precise conditions that justify our calculations and the general backfitting algorithm have been given by Claeskens and Van Keilegom (2003) [21] and Chen et al. (2003) [14].

We assume that X has compact support and that its density function is positive on the support. We also assume that α̂(x,β) has the usual properties uniformly in x in neighborhoods of {β, α(·)}, and in particular that α̂(x, β) = α(x) + op(n−1/4) uniformly in x, this follows because nh4 → 0.

Usual expansion around β yields

0=n-1/2i=1nSβ{Yi,Xi,Zi,β^,α^(Xi,β^)}=n-1i=1n[Sββ{Yi,Xi,Zi,β,α^(Xi,β)}+Sβα{Yi,Xi,Zi,β,α^(Xi,β)}α^(Xi,β)βT]n(β^-β)+n-1/2i=1nSβ{Yi,Xi,Zi,β,α^(Xi,β)}+op(1)=n-1i=1n[Sββ{Yi,Xi,Zi,β,α(Xi,β)}+Sβα{Yi,Xi,Zi,β,α(Xi,β)}α(Xi,β)βT]n(β^-β)+n-1/2i=1nSβ{Yi,Xi,Zi,β,α^(Xi,β)}+op(1). (A.1)

Note that for all β*, E[Sα{Y, X, Z, β*, α(X, β*)} | X] = 0, hence taking derivative with respect to β*, we have

0=E[Sαβ{Y,X,Z,β,α(X,β)}X]+E[Sαα{Y,X,Z,β,α(X,β)}X]α(X,β)βT.

Letting β* = β, we have

α(X,β)βT=-E[Sαα{Y,X,Z,β,α(X,β)}X]-1E[Sαβ{Y,X,Z,β,α(X,β)}X]=αβ(X).

Inserting this relation in (A.1), we have

0=n-1i=1n[Sββ{Yi,Xi,Zi,β,α(Xi,β)}+Sβα{Yi,Xi,Zi,β,α(Xi,β)}αβ(Xi)]n(β^-β)+n-1/2i=1nSβ{Yi,Xi,Zi,β,α^(Xi,β)}+op(1)=Fn1/2(β^-β)+n-1/2i=1nSβ{Yi,Xi,Zi,β,α(Xi,β)}+n-1/2i=1nSβα{Yi,Xi,Zi,β,α(Xi,β)}{α^(Xi,β)-α(Xi,β)}+op(1),

thus we obtain the expansion

-Fn1/2(β^-β)=n-1/2i=1n[Sβ{Yi,Xi,Zi,β,α(Xi,β)}+Sβα{Yi,Xi,Zi,β,α(Xi,β)}{α^(Xi,β)-α(Xi)}]+op(1). (A.2)

Performing standard expansion with local constant estimation, taking into account that nh4 → 0 and nh2 → ∞, we have

0=n-1i=1nKh(Xi-x0)Sa(Yi,Xi,Zi,β,α^)=n-1i=1nKh(Xi-x0)Sa(Yi,Xi,Zi,β,a)+n-1i=1nKh(Xi-x0)Saa(Yi,Xi,Zi,β,a)(a^-a)+op(n-1/2)=n-1i=1nKh(Xi-x0)Sa{Yi,Xi,Zi,β,α(Xi)}+n-1i=1nKh(Xi-x0)[Sa(Yi,Xi,Zi,β,a)-Sa{Yi,Xi,Zi,β,α(Xi)}]+E{Saa(Y,X,Z,β,a)X=x0}fX(x0)(a^-a)+op(n-1/2). (A.3)

Note that

E[Kh(X-x0)Sa{Y,X,Z,β,α(X)}]=E(Kh(X-x0)E[Sa{Y,X,Z,β,α(X)}X])=0,

and

E{Kh(X-x0)Sa(Y,X,Z,β,a)}=K(t)E{Sa(Y,X,Z,β,a)X=x0+ht}fX(x0+ht)dt=E{Sa(Y,X,Z,β,a)X=x0}fX(x0)+d2[E{Sa(Y,X,Z,β,a)X=x}fX(x)]2dx2|x=x0h2t2K(t)dt+o(h2)=d2[E{Sa(Y,X,Z,β,a)X=x}fX(x)]2dx2|x=x0h2t2K(t)dt+o(h2). (A.4)

In the last equality, we used the fact that Sa(Y, X, Z, β, a) = Sa{Y, X, Z, β, α(X)} at X = x0 and E{Sα{Y, X, Z, β, α(X)} | X] = 0.

In addition, we have

var(Kh(X-x0)[Sa(Y,X,Z,β,a)-Sa{Y,X,Z,β,α(X)}])Kh2(x-x0)E([Sa(Y,X,Z,β,a)-Sa{Y,X,Z,β,α(X)}]2X=x)fX(x)dx=h-1K2(t)E([Sa(Y,X,Z,β,a)-Sa{Y,X,Z,β,α(X)}]2X=x0+ht)fX(x0+ht)dt=h-1K2(t)E([Sa(Y,X,Z,β,a)-Sa{Y,X,Z,β,α(X)}]2X=x0)fX(x0)dt+tK2(t)ddx{E([Sa(Y,X,Z,β,a)-Sa{Y,X,Z,β,α(X)}]2x)fX(x)}x=x0dt+ht2K2(t)d22dx2{E([Sa(Y,X,Z,β,a)-Sa{Y,X,Z,β,α(X)}]2x)fX(x)}x=x0dt+o(h)=0+0+O(h)=O(h).

Here, in the last equality, we used Sa(Y, X, Z, β, a) = Sa(Y, X, Z, β, α(X)} at X = x0 and ∫ tK2(t)dt = 0. Thus

n-1i=1nKh(Xi-x0)[Sa(Yi,Xi,Zi,β,a)-Sa{Yi,Xi,Zi,β,α(Xi)}]=O(h2)+Op{(h/n)1/2}=op(n-1/2)

when nh4 → 0. Thus we have obtained

α^(x,β)-α(x)=-1nfX(x)i=1nKh(Xi-x)E{Saa(Y,X,Z,β,a)X=x}-1Sa{Yi,Xi,Zi,β,α(Xi)}+op(n-1/2). (A.5)

Substituting the right hand side of the (A.5) into (A.2), we have

-Fn1/2(β^-β)=n-1/2i=1nSβ{Yi,Xi,Zi,β,α(Xi)}+n-1/2j=1nSβα{Yj,Xj,Zj,β,α(Xj)}{α^(Xj,β)-α(Xj)}+op(1)=n-1/2i=1nSβ{Yi,Xi,Zi,β,α(Xi)}-n-1/2i=1n[j=1nSβα{Yj,Xj,Zj,β,α(Xj)}1nfX(Xj)E{Saa(Y,X,Z,β,a)X=Xj}-1Kh(Xi-Xj)]Sa{Yi,Xi,Zi,β,α(Xi)}+op(1)=n-1/2i=1nSβ{Yi,Xi,Zi,β,α(Xi)}-n-1/2i=1nE[Sβα{Y,Xi,Z,β,α(Xi)}Xi]E{Saa(Y,X,Z,β,a)X=Xi}-1Sa{Yi,Xi,Zi,β,α(Xi)}+op(1)=n-1/2i=1nSβ{Yi,Xi,Zi,β,α(Xi)}-n-1/2i=1nU(Xi)Sa{Yi,Xi,Zi,β,α(Xi)}+op(1).

This completes the proof of Theorem 1.

Proof of Theorem 2

Because β̂ has a root-n convergence rate, we replace β̂ by β inside α̂. Working through the same derivation following (A.3), while maintaining the bias term in (A.4), we can obtain a refined version of (A.5), which is exactly the expansion in Theorem 2.

Because Sa{Y, X, Z, β, α(X)} is a score function, we have

E[Sa{Y,X,Z,β,α(X)}SaT{Y,X,Z,β,α(X)}x]=-E[Saa{Y,X,Z,β,α(X)}x].

This yields the variance to be

K2(t)dtnhfX(x)E[Sa{Y,X,Z,β,α(X)}SaT{Y,X,Z,β,α(X)}x]-1+O{(nh)-1}=-K2(t)dtnhfX(x)Ω(x)+o{(nh)-1}.

This completes the proof of Theorem 2.

References

  • 1.The-Huntington’s-Study-Group-Investigators. A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell. 1993;72:971–983. doi: 10.1016/0092-8674(93)90585-e. [DOI] [PubMed] [Google Scholar]
  • 2.Walker FO. Huntington’s disease. Lancet. 2007;369:218–228. doi: 10.1016/S0140-6736(07)60111-1. [DOI] [PubMed] [Google Scholar]
  • 3.Foroud T, Gray J, Ivashina J, Conneally PM. Differences in duration of Huntington’s disease based on age at onset. Journal of Neurology, Neurosurgery & Psychiatry. 1999;66:52–56. doi: 10.1136/jnnp.66.1.52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wexler NS, Lorimer J, Porter J, Gomez F, Moskowitz C, et al. Venezuelan kindreds reveal that genetic and environmental factors modulate Huntington’s disease age of onset. Proc Natl Acad Sci. 2004;101:3498–3503. doi: 10.1073/pnas.0308679101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dorsey ER Investigators HSGC. Characterization of a large group of individuals with Huntington disease and their relatives enrolled in the COHORT study. PLoS ONE. 2012;7(2):Article ID e29522. doi: 10.1371/journal.pone.0029522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chen T, Wang Y, Ma Y, Marder K, Langbehn DR. Predicting disease onset from mutation status using proband and family data with applications to Huntington’s disease. Journal of Probability and Statistics. 2012:Article ID 375935. doi: 10.1155/2012/375935. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wacholder S, Hartge P, Struewing JP, Pee D, McAdams M, Brody L, Tucker M. The kin-cohort study for estimating penetrance. American Journal of Epidemiology. 1998;148:623–630. doi: 10.1093/aje/148.7.623. [DOI] [PubMed] [Google Scholar]
  • 8.Khoury M, Beaty H, Cohen B. Fundamentals of Genetic Epidemiology. New York: Oxford University Press; 1993. [Google Scholar]
  • 9.Wang Y, Clark LN, Marder K, Rabinowitz D. Non-parametric estimation of genotype-specific age-at-onset distributions from censored kin-cohort data. Biometrika. 2007;94:403–414. [Google Scholar]
  • 10.Wang Y, Clark LN, Louis ED, Mejia-Santana H, Harris J, Cote LJ, Waters C, Andrews D, Ford B, Frucht S, Fahn S, Ottman R, Rabinowitz D, Marder K. Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using the kin-cohort method. Archives of Neurology. 2008;65(4):467–474. doi: 10.1001/archneur.65.4.467. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Langbehn DR, Brinkman RR, Falush D, Paulsen JS, Hayden MR. A new model for prediction of the age of onset and penetrance for Huntington’s disease based on CAG length. Clinical Genetics. 2004;65:267–277. doi: 10.1111/j.1399-0004.2004.00241.x. [DOI] [PubMed] [Google Scholar]
  • 12.Langbehn DR, Hayden MR, Paulsen JS the PREDICT-HD Investigators of the Huntington Study Group. CAG-repeat length and the age of onset in Huntington’s disease (HD): A review and validation study of statistical approaches. American Journal of Medical Genetics. 2009;153:397–408. doi: 10.1002/ajmg.b.30992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Van Keilegom I, Carroll RJ. Backfitting versus profiling in general criterion functions. Statistica Sinica. 2007;17:797–816. [Google Scholar]
  • 14.Chen X, Linton O, Van Keilegom I. Estimation of semiparametric models when the criterion function is not smooth. Econometrica. 2003;71:1591–08. [Google Scholar]
  • 15.Langbehn DR, Paulsen JS the Huntington Study Group. Predictors of diagnosis in Huntington disease. Neurology. 2007;68:1710–1717. doi: 10.1212/01.wnl.0000261918.90053.96. [DOI] [PubMed] [Google Scholar]
  • 16.Lee JM, Ramos E, Lee JH, Gillis T, Mysore J, Hayden M, Warby S, Morrison P, Nance M, Ross C, et al. CAG repeat expansion in Huntington disease determines age at onset in a fully dominant fashion. Neurology. 2012;78(10):6990–695. doi: 10.1212/WNL.0b013e318249f683. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Rubinsztein DC, Leggo J, Coles R, Almqvist E, et al. Phenotypic characterization of individuals with 30–40 CAG repeats in the Huntington disease (HD) gene reveals HD cases with 36 repeats and apparently normal elderly individuals with 36–39 repeats. American Journal of Human Genetics. 1996;59(1):16–22. [PMC free article] [PubMed] [Google Scholar]
  • 18.Nance MA, Seltzer W, Ashizawa T, Bennett R, McIntosh N, Myers R, Potter N, Shea D. ACMG/ASHG Statement Laboratory guidelines for Huntington’s disease genetic testing. Am J Hum Genet. 1998;62:1243–1247. [PMC free article] [PubMed] [Google Scholar]
  • 19.McMurray C. Mechanisms of trinucleotide repeat instability during human development. Nature Reviews Genetics. 2010;11:786–799. doi: 10.1038/nrg2828. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Goldwurm S, Tunesi S, Tesei S, et al. LRRK2-G2019S penetrance in parkinson’s disease. Movement Disorders. 2011;26:2144–2145. doi: 10.1002/mds.23807. [DOI] [PubMed] [Google Scholar]
  • 21.Claeskens G, Van Keilegom I. Bootstrap confidence bands for regression functions and their derivatives. Annals of Statistics. 2003;31:1852–1884. [Google Scholar]

RESOURCES