Abstract
We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects’ genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.
Keywords and phrases: Finite mixed samples, robustness, semiparametric efficiency, nonparametric maximum likelihood estimator (NPMLE)
1. Introduction
In many scientific studies, data arise from a mixture of scientifically meaningful distributions. For example, in a quantitative trait locus (QTL) study, the goal is to identify, map and estimate effect of a QTL predisposing the trait. However, the genomic location of the QTL is unknown, therefore subjects’ genotypes at the QTL are not observed. Mixture models are widely used to map QTLs using location-known molecular markers such as single nucleotide polymorphisms (SNPs) or microsatellite markers, see Lander and Botstein (1989) and Wu et al. (2007).
Another example where mixture model is useful is genetic studies where genotypes in relatives of an initial sample (probands) are not collected (Marder et al., 2003; Wang et al., 2008). In these studies, of scientific interest is to estimate the conditional distribution of a trait given a genotype (or penetrance, Khoury et al., 1993). Genotype information in the initial sample of probands are collected. However, it is common that due to high cost of administering in-person interviews in relatives, their genotype information is not collected. For example, in Wacholder et al. (1998) and Wang et al. (2007, 2008), only the probands are genotyped, but none of the first-degree relatives of the probands was genotyped. Distribution of possible genotypes of a relative, however, can easily be obtained given the relationship between the relative and the proband and the genotype in the proband. The relatives’ disease history or trait information is usually obtained by administering a systematic and reliable phone-interview (Marder et al., 2003). Distribution of the trait in a relative is then a mixture of conditional distribution of the trait given the relative’s genotype and these relatives form the main analysis sample.
A concrete example of such genetic studies is an investigation of association between the APOE gene and the LDL concentrations in young children (Shea et al., 1999). There are three common alleles at the APOE locus (ε2, ε3, ε4). The APOE ε3 is the most prevalent allele in the general population, with frequency 75% to 80%. Previous studies have suggested that the APOE ε4 allele may be associated with higher LDL cholesterol levels in adults (Davignon et al., 1988). Of interest is the association between APOE ε4 allele and LDL cholesterol distribution in children.
Subjects included in the study were recruited from a cross-sectional biomarker study of children conducted from 1994 to 1998 (Shea et al., 1999). Proband children were recruited from lists of cardiac patients generated through the Presbyterian Hospital Clinical Information System, private cardiology practices, lipid clinics and pediatric practices. Families with at least one healthy child 4 to 25 years of age were eligible for participation. Siblings of proband children were recruited to the study. The availability of the APOE genotype information of the probands and the sibling relationship enables the calculation of each sibling’s probability of carrying the ε4 allele. The cumulative distribution function of LDL concentration for carriers of ε4 allele (carrying one or two copies of ε4) and for the non-carriers (carrying zero copy of ε4) are of primary interest in this study.
Traditional statistical analysis of mixture data specifies a parametric form of conditional distribution of an outcome given group membership (e.g., Gaussian mixture model, Wu et al., 2007) and estimates mixture probabilities and parameters in the conditional distribution by maximum likelihood through an EM algorithm (McLachlan and Peel, 2000). In this work, we provide nonparametric estimation in the sense that we do not make any distributional assumption on the conditional distributions. One common feature of the two examples introduced before is that the mixture probabilities are easily calculated without using the outcome data or are known, and the mixture populations are scientifically meaningful (e.g., subjects carrying a certain genotype). Treating these mixture probabilities as random variables, each observation in the data consists a vector of mixture probabilities and a continuous outcome, and the observations are assumed to be independent and identically distributed (i.i.d.).
To fix idea, let Q denote a p-dimensional vector of random mixture probabilities, and let pQ denote the probability mass function of Q, which has a finite support u1, …um. Let S denote a random outcome, let L denote the unobserved group membership (or genotype), and let f(s) denote the p-dimensional conditional density of S given L. For simplicity, we assume that f(s) is supported on a compact interval, say [T1, T2]. For the ith subject, i = 1, …, n, we observe (qi, si), where the joint density of Q, S at Q = qi and S = si is
| (1) |
Here f(s) is a length p vector, where the jth component fj(s) represents the conditional probability density function (PDF) of s given that it belongs to the jth genotype group, j = 1, …, p. Each component of f(s), fj(s), is the PDF of a trait at time t given the gene mutation status being the jth kind in a relative (for example, j = 1 denotes carriers and j = 2 denotes non-carriers), or the PDF of a quantitative trait given the QTL genotype being the jth kind. Let F(·) denote the corresponding p dimensional cumulative distribution function (CDF) of f(·). Our interest is in estimating F at any fixed time t. The vector qi represents probabilities that a relative carries a certain genotype given the proband’s genotype, or a vector of probabilities of a subject having a certain QTL genotype given the flanking markers. Obviously w e . The distribution of qi (i.e., pQ) depends on study design and can be easily estimated consistently from the empirical distribution of qi. For example, for a backcross QTL experiment, qi takes four different values depending on the marker genotype frequencies (e.g., Table 10.3 of Wu et al., 2007). The vector of density functions f is completely unspecified, thus f is an infinite-dimensional nuisance parameter with length p.
Here, we characterize the complete class of consistent estimators which includes Fine et al. (2004) and Chatterjee and Wacholder (2001). We show that any weighted least squares estimator is a member of this estimation class hence yields a consistent estimator. In addition, we construct a special subclass which obtains the minimum estimation variance and reaches the semiparametric efficiency bound. We inspect two types of widely used NPMLEs and report a surprising finding that they are either inefficient or even inconsistent. Although commonly applied in clinical studies (Sigurdson et al., 2004; Hauptmann et al., 2003; Webb et al., 2006a,b; Hartge et al., 2002), the inconsistency of the second type of NPMLE has not been discovered in the literature before.
The remaining of the paper is organized as follows. In Section 2, weighted least squares estimators are introduced and a complete class of consistent estimators encompassing the least squares is defined. The optimal member of the class is identified and shown to reach the semiparametric efficiency bound. In Section 3, an algorithm to implement the efficient estimator is developed and asymptotic properties of the estimator are proved. In Section 4, two types of commonly used NPMLE estimators are investigated and one type is found to be inefficient while the other is inconsistent. In Section 5, simulation experiments are conducted to investigate the finite sample performance of the developed methods, and several estimators including the efficient estimator, the least squares estimators and the NPMLEs are compared. In Section 6, the proposed methods are implemented to analyze two data examples, one from a genetic linkage study of rice plant height and the other from a study of association between plasma low-density lipoprotein (LDL) cholesterol level and the apolipoprotein-E (APOE) gene. In Section 7, possible extensions of the proposed methods are discussed.
2. Estimation procedures
2.1. A class of weighted least squares estimators
Although the traditional approach to estimating F(t) is maximum likelihood estimator for a parametric model or NPMLE for a nonparametric model, a very simple weighted estimator can be used if we formulate the same problem from a different angle. Observe that the model in (1) implies qT F(t) = ε{I(S ≤ t)|q}, where I(·) denotes an indicator function. Therefore, viewing the qi’s as covariates and I(Si ≤ t) as response variables, the covariates and the responses are linked by F(t) via a familiar linear regression model
where E(ei|qi) = 0, i = 1, …, n. It is straightforward that the ei’s are independent conditional on qi’s, and have the variances . Thus, weighted least squares based method can be used to estimate F(t). Denote by M an arbitrary n × n diagonal matrix. Let A = (q1, … qn)T ∈ Rn×p, Y = (y1, … yn)T ∈ Rn, and e = (e1, …, en)T ∈ Rn. Then we obtain the general WLS estimator
The simplest estimator is the OLS where we set M = In, also derived in Fine et al. (2004) using a different formulation, while the most efficient WLS estimator is obtained when we assign M to be a diagonal matrix with the ith diagonal entry equals . Standard iteratively re-weighted estimation procedure can be used to obtain this optimal WLS (OWLS) estimator. The presence of the matrix M also allows the flexibility to derive other WLS estimators to achieve desired properties such as robustness.
2.2. The complete class of consistent estimators
Although simple to derive and easy to implement, it is unclear whether the class of WLS is complete and whether OWLS is the optimal estimator among all consistent estimators of F(t). To answer these questions and to provide easy variance estimation for any consistent estimator, we perform a formal semiparametric analysis to characterize the complete class of consistent estimators. We derive in Appendix A.1 that the family of all influence functions is
| (2) |
where Ip is a p-dimensional identity matrix, C is an arbitrary p × p constant matrix, and 1p is a p-dimensional vector with all elements being one.
For any qualified b-function as described in SIF, an estimator for F (t) is
| (3) |
where we use Cb = ∫ b(q, s)qTpQ(q)dμ(q) − I(s ≤ t)Ip to denote the constant matrix corresponding to this b-function. For example, a convenient choice of b(q, s) is
| (4) |
where h1(q, s), h2(q, s), and h3(q) can be arbitrary functions in Rp such that ∫ h1(q, s)qT pQ(q)dμ(q) and ∫ h2(q, s)qT pQ(q)dμ(q) are invertible, and B is an arbitrary constant matrix. This characterization provides a simple construction of a very rich class of estimators.
Since SIF contains all the influence functions, any regular asymptotic linear (RAL, Newey, 1990) estimator can be written in the form of (3). For example, we show in Appendix A.2 that the influence function of any WLS estimator is
Here, w is a weight variable. For the ith individual, w = wi is the ith diagonal entry of M. We use W to denote the weight variable when it is considered as a random variable. It is easy to see that this corresponds to choosing h1 = wq, h2 = 0, and h3 = −{E(WQQT)}−1wqqT F(t) + F(t), hence any WLS is indeed a member of SIF. In addition, comparing the form of φWLS and SIF indicates that the WLS estimators are only a subset of consistent estimators that can be constructed. To further study whether the optimal WLS estimator is the most efficient among all the consistent estimators for F(t), we need to derive the efficient influence function.
2.3. The semiparametric efficient estimator
Projecting an arbitrary influence function φ onto the tangent space Λ
yields an efficient influence function (Newey, 1990). In Appendix A.3, we derive the form of Λ
and its orthogonal complement, which enables us to derive the following theorem.
Theorem 1
The efficient influence function is
where
and
The proof of the Theorem 1 is in Appendix A.4.
It is straightforward to see that the construction of the efficient estimator requires correct specification of the nuisance parameter f(s), which is not always easy to obtain. If we unknowingly mis-specify f(s) as f*(s) and follow the same construction in Theorem 1 to obtain , then the result is no longer a valid influence function. To see this, note that , where , and K* = ∫ I(s ≤ t)A*−1 (s)ds{∫A*−1 (s)ds}−1. We can then easily verify that E(φ̌) = F(t) − K*1p, which is not necessarily zero. We thus robustify the influence function by constructing
| (5) |
Regardless of the form of f*, (5) always yields a valid influence function. In addition, φ = φeff when f*(s) = f0(s) and φ can be used to estimate F(t) via
| (6) |
Remark 1
In (6), we can replace K* by an arbitrary constant matrix. The resulting estimator remains consistent, and the corresponding φ is still a valid influence function. However, since different K* corresponds to different influence function, the estimators have different variances.
In practice, since f(s) is usually either proposed or estimated so that it may be different from f0(t), it is always a safer choice to use (6) to obtain F̂(t). We will show in Section 3 that as long as f(s) is consistently estimated, the estimator (6) is guaranteed to provide an efficient estimator for F(t).
2.4. Analytic comparison between OWLS and the efficient estimator
We are now ready to assess whether the OWLS is efficient. Comparing φeff with φOWLS obtained in Appendix A.2, we find that although the OWLS is optimal among the WLS family, it does not reach the semiparametric efficiency bound. We prove this claim by contradiction. Suppose that the OWLS is efficient, then we would have φeff = φOWLS + op(1), which would imply that for all (q, s) pairs,
Denote B = E QQT/[QTF(t){1 − QTF(t)}], we then have
which leads to qTF(t)A−1(s)q = KA−1(s)q. The left hand-side is a quadratic function of q, while the right hand-side is linear, so the above equality will never hold since q cannot be a constant vector of zero.
3. Efficient estimator and its asymptotic properties
As we have pointed out, the efficient influence function derived in Theorem 1 involves unknown nuisance parameters f(s) and therefore cannot be directly used to construct an efficient estimator for F(t). Using (6) will provide a robust and locally efficient estimator, in the sense that if f*(s) = f0(s), the estimator is indeed efficient, otherwise, the estimator is still guaranteed to be consistent. We now propose a method to construct an estimator that is always efficient. This method avoids estimating the p-dimensional PDF f(s) directly, and is simple to implement.
3.1. Algorithm for implementing the efficient estimator
We propose to use the following procedure to construct the efficient estimator.
Randomly split the data into two sets. The second set has size n2 = n5/6, and the first set has size n1 = n − n2. Assume that the first set contains (q1, s1), …(qn1, sn1) and the second set (qn1+1, sn1+1), …, (qn, sn).
-
Obtain the empirical estimator of qTf(s), from the second set of sample with size n2. Recall that the random vector Q can take m different vector values u1, …, um, so for each k = 1, …, m, we can calculate a kernel estimate for as
Here Kh is any kernel function with bandwidth h satisfying (n2h)−1 = o(1), n2h5 ≤ O(1) as n2 → ∞, and Kh(·) = h−1K(·/h).
-
Calculatewhere EQ stands for expectation with respect to Q. We construct
using numerical integration, and form .
-
Formand let the estimator be
(7)
The estimation procedure described above is straightforward to implement. Comparing to many other semiparametric problems where the efficient estimator often involves solving integral equations (Rabinowitz, 2000) and iterative procedures (Tsiatis and Ma, 2004), the estimator here is very simple. In addition, unlike most semiparametric problems where the nonparametric functions have to be estimated at a certain rate, sometimes using an under-smoothed bandwidth (Liang and Wang, 2005; Li and Liang, 2008) to reach optimality, we do not have such estimation constraints. In fact, we will show that any consistent estimation of f(s) will be as good as the true f(s) asymptotically. Since consistency can be obtained with a wide range of bandwidth, typically one does not have to go through the computationally intensive cross validation procedure to choose an optimal bandwidth. Finally, we point out that the splitting of the data is solely to facilitate the later theoretical proof and is not mandatory. In reality, one can certainly use the whole data set to estimate f(s) and to form F̂(t) in (7).
3.2. Asymptotics and inferences
We present the asymptotic property of the proposed efficient estimator in the following theorem:
Theorem 2
The estimator constructed in (7) achieves the semiparametric efficiency bound. Specifically, for n → ∞, in distribution, where V = var(φeff) and can be consistently estimated as
Intuitively, the reason that (7) can reach the semiparametric efficiency is because it solves the estimating equation formed by summing over the robustified influence functions (5) while replacing the unspecified quantities K*, qT f*(s) and A* by their corresponding optimal choices which are, respectively, the non-parametric estimates of K, qTf(s) and A(s, qTf). The rigorous proof of Theorem 2 is in Appendix A.5.
Since we are able to construct the optimal estimators and estimate their variances, it is straightforward to make inferences based on these results. For example, we can construct a locally most powerful test for the hypothesis H0 : F1(t) − F2(t) = δ0 versus H1 : F1(t) − F2(t) ≠ δ0. Because of the explicit form of F̂(t), the Wald test is an obvious choice. Let D̂ = F̂1(t) − F̂2(t) − δ0, then the test statistic is
| (8) |
where v = V11 − V12 − V21 + V22, and Vij is the (i, j)th element of the covariance matrix V stated in Theorem 2. It is straightforward that when n → ∞, T has a chi-square distribution with one degree of freedom under H0. Under the local alternative, say , T has a noncentral chi-square distribution with one degree of freedom and noncentrality parameter (δ − δ0)2/v.
In some applications, one may be interested in testing whether F1(t)−F2(t) = δt at several different t values simultaneously, say at t1, …, tJ. Letting , where Δ0 = (δt1, …, dtJ)T. This can be written as a problem of testing H0 : a = 0 versus H1 : a ≠ 0, Under H0, a has a multivariate normal random distribution with mean zero and variance-covariance matrix n−1Σ, where Σjk = (−1, 1)cov{F̂(tj), F(tk)}(−1, 1)T for j, k = 1, …, J. Here, cov{F̂(tj), F̂(tk)} can be estimated using
where and F̂(tj) denote ψeff and F̂ evalcuated at the ith observation and calculated at time tj. Thus, we can construct the test statistic
| (9) |
When n → ∞, under H0, T has a chi-square distribution with J degrees of freedom. Under a local alternative, say for some length J vector Δ, T has a noncentral chi-square distribution with noncentrality parameter ΔTΣ−1Δ.
4. Understanding the NPMLEs
For many nonparametric models, the NPMLE is a widely used estimation procedure. In the literature, two types of NPMLE have been proposed (Wacholder et al., 1998; Chatterjee and Wacholder, 2001). The first type of NPMLE treats each j = 1, …, m as an unknown PDF, while the second type treats f(s) as a p-dimensional unknown PDF. To explain these two NPMLEs in detail, group the observations in such a way that the first r1 observations form a first subset where each observation has the same q value that equals to u1, the next r2 observations form a second subset with the same q values u2 and so on. Assume that the last rm observations form the mth subset and have the q values equal to um. We use F̃(t) to denote the type I NPMLE of F(t), and F̌ (t) the type II NPMLE.
The type I NPMLE maximizes
with respect to for the ith subject in the jth subset subject to and for j = 1, …, m. This is essentially equivalent to performing an empirical density estimation in each of the m groups, where in each group the qi values are identical. Obviously, the resulting estimation for qTf(s) in the jth group is an empirical PDF with weights at the observed values. The procedure then uses for j = 1, …, m to recover F̃(t) = (UTU)−1UTG(t), where we denote U = (u1, …, um)T, and G(t) is a length m vector with the jth component equals . It is not difficult to see that
where if qi = uj. Thus, the type I NPMLE belongs to the family of WLS estimators (therefore a member of class (2)), where the weights are taken to be , the inverse of the number of observations in the jth group with the same qi value. However, the weights of this WLS estimator are obviously non-optimal. In addition, intuitively such choice of weights is not reasonable, because it down-weights the contributions from a larger subset. In fact, one would rather downweight the contribution from the observations with less estimation precision, while the quality of the estimation of F(t) from each observation has no definitive link with its subset size.
The type II NPMLE maximizes the same log likelihood, but with respect to f(si), subject to and f (si) ≥ 0 component-wise. It is easy to see that the maximum is obtained when the rj values of f (si) corresponding to the same uj are the same. We denote this common f (si) value by hj, for j = 1, …, m. We thus maximize
with respect to hj’s subject to and hj ≥ 0 component-wise. In general, no closed form solution exists for the hj’s, and the EM algorithm is often used to solve this optimization problem and to obtain the hj’s. The NPMLE then proceeds to form
The type II NPMLE is different from the type I NPMLE in that here, the term “nonparametric” refers to f(s), not to . In the literature, the type II estimator is considered as an improvement of the type I NPMLE. However, our careful investigation reveals that the type II NPMLE is not even consistent, which is a rather counter intuitive result. In Appendix A.6, we give a detailed calculation in a concrete case to explicitly illustrate the inconsistency and in Section 5 we demonstrate the bias of the type II NPMLE in a moderately large sample through simulations.
We now give a more general demonstration to show why the type II NPMLE is inconsistent. Suppose the solution to the constrained maximization problem is h1, …, hm, then the type II NPMLE is
where H = (r1h1, … rmhm), and U, G(t), F̃(t) are the same as defined before. We already know that F̃(t) is a consistent estimator of F(t). If F̌(t) is also consistent, then we would have HU → Ip when n → ∞. This is a much stronger condition than the original constraints of the maximization problem and is in general not satisfied. In fact, this condition means that the type II NPMLE is asymptotically equivalent to the type I NPMLE, which contradicts the original goal of developing a type II estimator. In other words, as a distinct estimator from the type I NPMLE, the type II NPMLE is inconsistent.
5. Simulations
To study the finite sample performance of the proposed estimators, we conducted several simulation studies. In all the simulations, the dimension of F(t) is p = 2, and the number of simulation iterations is 1000.
5.1. Three simulated examples
In the first simulation experiment, we investigate the performance of the various estimators studied in Sections 2, 3 and 4. Here, qi’s can take six different values, i.e. m = 6, while the group sizes rj, j = 1, …, m, are randomly generated. The six different qi values are respectively (0.3, 0.7)T, (0, 1)T, (0.7, 0.3)T, (0.8, 0.2)T, (0.5, 0.5)T, (0.6, 0.4)T. The two components in the true F(t) both have truncated exponential form, since exponential function is a commonly used parametric model in practice. Specifically, F1(t) = {1 − exp(−t/3)}/{1 − exp(−10/3)} and F2(t) = 1 − {1 − exp(t/3 − 10/3)}/{1 − exp(−10/3)} on the interval (0, 10).
We studied eight different estimators. The efficient estimator with true f(s) inserted (hence unrealistic) is denoted ORACLE, while with the estimated f(s) inserted is denoted EFF. Thus EFF is the implemented efficient estimator. Two different kinds of robust estimators are considered, where ROB1 had the f(t) mis-specified, and ROB2 not only used a mis-specified f(t), but also had K = 0 plugged in. Specifically, in ROB1, we used the true f1(t) as the proposed model for f2(t), and used the true f2(t) as the proposed model for f1(t). In ROB2, we proposed uniform model for both f1(t) and f2(t). These two estimators are expected to be consistent hence reflecting robustness to mis-specification of the PDFs. We also investigated the proposed OWLS estimator. For comparison, we implemented the OLS, NPMLE1 and NPMLE2 estimators that are used in the literature. We implement the estimation procedures at t = 6.8. The resulting estimation mean, sample and estimated standard errors and 95% coverage of the confidence intervals are summarized in Table 1.
Table 1.
Bias, empirical standard error (emp se), average estimated standard error (est se), 95% coverage (95% cov) of Simulation 1, sample size n = 300, 1000 simulations
| F1(t) = 0.9295 | F2(t) = 0.3199 | |||||||
|---|---|---|---|---|---|---|---|---|
| Estimator | bias | emp se† | est se* | 95% cov | bias | emp se† | est se* | 95% cov |
| ORACLE | 0.0003 | 0.529 | 0.538 | 94.9% | −0.0013 | 0.704 | 0.684 | 93.8% |
| EFF | 0.0007 | 0.540 | 0.549 | 94.3% | −0.0017 | 0.726 | 0.704 | 92.5% |
| ROB1 | 0.0001 | 0.545 | 0.555 | 94.5% | −0.0013 | 0.732 | 0.710 | 92.6% |
| ROB2 | 0.0001 | 0.545 | 0.555 | 94.5% | −0.0013 | 0.732 | 0.710 | 92.6% |
| OWLS | −0.0004 | 0.537 | 0.553 | 94.6% | −0.0006 | 0.727 | 0.712 | 93.6% |
|
| ||||||||
| OLS | 0.0001 | 0.545 | 0.559 | 94.6% | −0.0013 | 0.732 | 0.716 | 93.0% |
| NPMLE1 | 0.0001 | 0.570 | 0.581 | 95.0% | −0.0010 | 0.753 | 0.738 | 93.4% |
| NPMLE2 | −0.179 | 0.323 | – | – | 0.2148 | 0.425 | – | – |
Empirical standard error × 10
Estimated standard error × 10
It can be seen that all the consistent estimators perform well in finite samples, and the estimated variances are very close to the empirical variances. This indicates that the asymptotic results are relevant for a moderate sample size of n = 300. It is very clear that the type II NPMLE yields very large bias. We emphasize here that this bias is not a reflection of small sample size because the bias persists when we increase the sample size to 1000.
We can also see that the type I NPMLE and OLS does not make a very good choice of the weights, hence the estimation standard errors are both larger than the OWLS. This is especially prominent for the type I NPMLE, in that it performs even worse than the simple OLS estimator. The two robust estimator (ROB1 and ROB2) perform very similarly, and both have minimal bias, reflecting the desired robustness property with respect to the PDF estimation. Finally, although in theory the efficient estimator (EFF) should outperform the OWLS estimator, the performance of OWLS is as satisfactory as EFF. This appears to be often the case in our other simulations not shown here. Thus, using either proposed OWLS or EFF in practice is expected to be adequate.
We also studied the type I error and power of the test (8) in this situation, and present the results in Table 2. The overall performance of the proposed tests is satisfactory. From the left panel of Table 2, we see that all estimators maintain correct size. From the right panel of the same table, we see that the OLS and NPMLE1 have lower power compared to other estimators due to their larger estimation variances.
Table 2.
Type I error and power of test in Simulation 1, sample size n = 300, 1000 simulations
| Estimator | Type I error | Power | ||||||
|---|---|---|---|---|---|---|---|---|
| 0.01 | 0.05 | 0.1 | 0.2 | 0.01 | 0.05 | 0.1 | 0.2 | |
| ORACLE | 0.016 | 0.062 | 0.105 | 0.194 | 0.198 | 0.424 | 0.546 | 0.700 |
| EFF | 0.017 | 0.061 | 0.118 | 0.198 | 0.177 | 0.400 | 0.523 | 0.676 |
| ROB1 | 0.018 | 0.062 | 0.117 | 0.200 | 0.167 | 0.391 | 0.529 | 0.680 |
| ROB2 | 0.018 | 0.062 | 0.117 | 0.200 | 0.167 | 0.391 | 0.529 | 0.680 |
| OWLS | 0.018 | 0.057 | 0.119 | 0.197 | 0.170 | 0.396 | 0.529 | 0.681 |
|
| ||||||||
| OLS | 0.018 | 0.061 | 0.113 | 0.198 | 0.162 | 0.388 | 0.521 | 0.673 |
| NPMLE1 | 0.023 | 0.062 | 0.100 | 0.204 | 0.148 | 0.354 | 0.496 | 0.655 |
The second simulation experiment is conducted to closely mimic a QTL mapping data analyzed in Section 6.1. We generated the data from a mixture of two distributions. The first one is a uniform distribution on (3, 10), while the second one has CDF c(1 − e−t/2.5) on the interval (0, 10). The mixture probability has four different values which are (0.02, 0.98)T, (0.2, 0.8)T, (0.1, 0.9)T, (0.98, 0.02)T, and the sample size is 100. Based on the performance of the various estimators studied in the first simulation, here we used only the two best estimators, the OWLS and the efficient estimator (EFF) to estimate the two CDFs. We also implemented the type II NPMLE for comparison. We plot the true CDFs, the mean of the estimated CDFs and the 95% pointwise confidence band for each method in Figure 1. As expected, both OWLS and EFF give satisfactory results, while NPMLE2 is clearly biased. Again, we emphasize that the bias of NPMLE2 is not caused by the moderate sample size. In fact, when we increased the sample sizes to 1000, the bias became even more prominent.
Fig 1.
Simulation 2. True CDF (solid) and the mean (dashed), 95% pointwise confidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs. The OWLS (left), EFF (mid) and NPMLE2 (right) are plotted. The mean and true CDFs are undistinguishable in OWLS and EFF estimators. Sample size is 100, and results are based on 1000 simulations.
Similarly, the third simulation is conducted to closely mimic the LDL data analyzed in Section 6.2. The first CDF is c1/{1 + e−(t−3)/0.5} on the interval (0, 6), and the second CDF is c2/{1 + e− (t−2.5)/0.2} on the interval (0, 7). Note that these two CDFs cross. Here, the mixture probability distribution has three different values which are (0.15, 0.85)T, (0.6, 0.4)T, (0.8, 0.2)T, and the sample size is 300. Estimations based on OWLS, EFF and NPMLE2 are computed, and the mean of the estimated CDFs, the 95% pointwise confidence band for each method are presented in Figure 2 together with the true CDFs. Similar to the second simulation, both OWLS and EFF perform well, while NPMLE2 shows large bias.
Fig 2.
Simulation 3. True CDF (solid) and the mean (dashed), 95% pointwise confidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs. The OWLS (left), EFF (mid) and NPMLE2 (right) are plotted. The mean and true CDFs are undistinguishable in OWLS and EFF estimators. Sample size is 300, and results are based on 1000 simulations.
6. Real data examples
6.1. Estimation from QTL mapping data
In QTL studies, the trait observations are assumed to be drawn from a mixture of several QTL genotype groups and the mixture probabilities of a subject assuming a certain QTL genotype given flanking markers are calculated based on the study design, the marker genotypes and the recombination fraction between the location-known flanking markers and the putative QTL (Wu et al., 2007). The first example that we use to illustrate our methods is a genetic linkage study used to map QTLs for rice plant height and grain shape. The identified QTL can be used to produce taller rice plants to increase yield. In Huang et al. (1997), a doubled haploid (DH) population of rice plants was derived from two inbred lines (semi-dwarf IR64 and tall Azucena), creating 123 DH lines each genotyped with 135 RFLP markers and 40 isozyme and RAPD markers. Several traits such as grain shape and plant height were recorded. A DH population is equivalent to a backcross population where the two marker genotypes have an approximately 1:1 distribution ratio. The mixture probabilities qi of a plant carrying a certain QTL genotype given the flanking markers are computed based on the marker genotypes and the recombination fraction between the marker and the QTL. The details of qi computation can be found in Table 10.3 of Wu et al. (2007).
Using a Gaussian mixture model, Wu et al. (2007) analyzed the plant height measured at 10 weeks after the rice was transplanted to the field and mapped a QTL for this trait to 199cM on chromosome 1 between the markers RZ730 and RZ801. Here we estimate the cumulative distribution function of the rice plant height for each of the two QTL genotypes at the same locus (199cM on chromosome 1) using the model (1).
There were 84 plant height measurements available. Table 3 presents the estimated CDFs and their standard errors for each of the two QTL genotypes at several values of the plant height. We present the efficient estimator (EFF) and the optimal WLS (OWLS). We omitted OLS and the two NPMLEs due to their respective deficiencies. The proposed OWLS and EFF lead to comparable results. The test of H0 : F1(t) = F2(t) based on the test statistic (8) was significant at 5% level for both estimators at three typical values of t, indicating a difference in the distribution functions for the two QTL genotypes. In addition, we tested the difference between the two distributions at the three t values simultaneously by the test (9). The null distribution of the test statistic was a chi-square with three degrees of freedom, and the p-value was less than 0.01 which indicates a significant difference.
Table 3.
Data example 1. Estimated CDFs of plant height and their standard errors for QTL genotypes bb (F̂1) and Bb (F̂2)
| t | Estimator | F̂1(t) | SE(F̂1) | F̂2(t) | SE(F̂2) | p value* |
|---|---|---|---|---|---|---|
| 80 | EFF | 0.132 | 0.048 | 0 | 0.006 | 0.011 |
| 80 | OWLS | 0.126 | 0.048 | 0 | 0.001 | 0.011 |
|
| ||||||
| 110 | EFF | 0.895 | 0.05 | 0.095 | 0.062 | <0.001 |
| 110 | OWLS | 0.927 | 0.043 | 0.098 | 0.062 | <0.001 |
|
| ||||||
| 140 | EFF | 0.992 | 0.024 | 0.699 | 0.083 | 0.001 |
| 140 | OWLS | 1.000 | 0.006 | 0.684 | 0.082 | 0.000 |
p value for testing H0 : F1(t) = F2(t) based on (8)
Figure 3 presents the CDFs of rice plant heights for plants carrying each of the two QTL genotypes estimated by the efficient estimator (EFF). It can be seen that there is a large difference in the CDFs across the entire range of the plant height and carrying a risk allele increases the plant height. For example, it was estimated that 90.5% (CI: 78.3%, 100%) of the plants with Bb QTL genotype will have plant heights greater than 110, compared to 10.5% (CI: 0.7%, 20.3%) in the bb genotype group. This difference is highly significant (p < 0.001). These results are consistent with the analysis conducted in Wu et al. (2007).
Fig 3.
Data example 1. Estimated cumulative distribution function (CDF) of plant height for QTL genotype Bb (solid) and bb (dashed)
6.2. Estimation from the LDL data
In the LDL example introduced in Section 1, the association between the APOE ε4 allele and the LDL concentrations in young children is our main research interest. There were 230 subjects included in the data analyses. We show the estimated cumulative distribution function of LDL concentration for carriers of ε4 allele (carrying one or two copies of ε4) compared to non-carriers (carrying zero copy of ε4) at several values of the LDL levels in Table 4. As in data example 1, we present the EFF and the OWLS. Both estimators yielded similar results. The comparison of CDF for carriers versus non-carriers was not significant at 5% level at LDL= 100 or LDL= 260, but was significant at LDL= 180. Similar to the QTL analysis, we tested the difference between two distributions at these three typical t values simultaneously by (9). The p-value was 0.29, indicating a non-significant overall difference of the two distributions at these values.
Table 4.
Data example 2. Estimated CDFs of LDL levels and their standard errors of APOE ε4 carriers (F̂1) and non-carriers (F̂2)
| t | Estimator | F̂1(t) | SE(F̂1) | F̂2(t) | SE(F̂2) | p value* |
|---|---|---|---|---|---|---|
| 100 | EFF | 0.719 | 0.108 | 0.619 | 0.054 | 0.496 |
| 100 | OWLS | 0.718 | 0.110 | 0.619 | 0.054 | 0.510 |
|
| ||||||
| 180 | EFF | 1.000 | 0.014 | 0.921 | 0.024 | 0.037 |
| 180 | OWLS | 1.000 | 0.014 | 0.922 | 0.024 | 0.035 |
|
| ||||||
| 260 | EFF | 1.000 | 0.006 | 0.984 | 0.011 | 0.364 |
| 260 | OWLS | 1.000 | 0.006 | 0.984 | 0.011 | 0.354 |
p value for testing H0 : F1(t) = F2(t) based on (8)
Figure 4 depicts the CDF of LDL for carriers and non-carriers estimated by the efficient estimator, EFF. It can be seen that there is virtually no difference of the two CDFs in the range from 45 to 130. The CDF for carriers is elevated in the interval (130, 200) compared to non-carriers and the two functions merge again for LDL greater than 200. Previous analyses in the literature focus on the mean LDL concentration. Our analysis shows that the effect of APOE ε4 on LDL manifests in the range of 130 to 200.
Fig 4.
Data example 2. Estimated CDF of LDL levels for carriers of APOE ε4 allele (solid) and non-carriers (dashed)
7. Discussion
We have developed nonparametric estimation procedures for mixed samples where the conditional distribution of the outcome given the group membership is completely unspecified and the mixing probabilities are known or can be calculated without using the outcome data. We propose an extremely simple optimal weighted least squares estimator and derive an easy-to-compute efficient estimator which reaches the semiparametric efficiency bound. We illustrate by simulations that the OWLS estimator has good efficiency in many practical situations. We investigate performances of two types of NPMLE and show the surprising results that none of them is efficient and one of them is not even consistent. This is in contrast to many other semiparametric problems where the NPMLE is an efficient estimator.
Although the estimators are constructed for CDFs, it is straightforward to adapt these procedures to estimate a quantile function F−1(τ). This is because we can then express all the estimators in terms of solving for F(t) from an estimating equation. When we denote t = F−1(τ), replace F(t) with τ in these estimating equations, and solve for t from the known τ value instead of solving for F(t) from the known t value, we can obtain estimators for the quantile functions. For example, the efficient quantile estimator at τ can be obtained through solving for t from
where K itself is now a function of t hence we use the notation .
The CDFs estimated by the consistent estimators may not be monotone increasing functions of t when the sample size is relatively small. In fact, the type II NPMLE was originally proposed to address this issue, but it unfortunately lead to inconsistency. One way to guarantee the monotonicity is though reparametrization. For example, we could write , and treat g(u) as a nuisance parameter, which will guarantee the range of to be monotone and within 0 and 1. However, the additional complexity may not be worth the gain. Instead, we suggest to use a post estimation adjustment, such as a pooled adjacent algorithm (Barlow et al., 1972) to modify the results to achieve monotonicity. For a detailed description, see Wang et al. (2007).
Finally, we point out that one needs to be cautious in interpreting inconsistency of the type II NPMLE. The inconsistency occurs when a pure non-parametric model is used. Parametric models and semiparametric models such as Cox proportional hazards model with a nonparametric baseline or piecewise exponential models are likely to be consistent. An extension of the proposed methods to handle censoring based on full data influence functions discovered here and inverse probability weighting is underway.
Acknowledgments
The authors wish to thank Dr. Steve Shea and Dr. Rongling Wu for providing data. Ma’s research is supported by an NSF grant DMS-0906341, DMS-1206693 and NIH grant NS073671-01. Wang’s research is supported by NIH grants AG031113-01A2 and NS073671-01.
Appendix
A.1. Derivation of the complete influence function family
To perform a formal semiparametric analysis (Bickel et al., 1993; Tsiatis, 2006), we denote by θ the function that maps the nuisance parameter f(s) to the p-dimensional parameter of interest, F(t), i.e., . We denote the infinite dimensional nuisance parameter f(s) as η, i.e., η = f(x).
We now derive a general class of consistent estimators through characterizing the complete influence function set. An influence function φ(q, s; θ, η) is a mean zero function that satisfies
| (A.1) |
for any parametric submodel. A parametric submodel is a model where the original unknown function f(s) is replaced by a parametric PDF model f(s; γ), and it satisfies f(s; γ0) = f0(s). Here Sγ is the score function with respect to γ evaluated at γ0,
and
The relation in (A.1) indicates that
where μ(q) is the counting measure of Q.
Given any parametric submodel of the form g(q, s; γ) = pQ(q)qT f (s; γ), where γ = (γ1, …, γp)T, and f (s; γ) = {f1(s; γ1), …, fp(s; γp)}T, the parameter of interest is
On one hand, the partial derivative of the parameter of interest with respect to γ is a block diagonal matrix of the form
On the other hand, the score vector Sγ evaluated at the truth is
Recall that (A.1) requires
for j = 1,…, p, and
for k ≠ j. Here φj is the jth component of φ. Because f (s) is completely unspecified, the function can be any function that satisfies . It then follows almost everywhere that ∫ φjqjpQ(q)dμ(q) − I(s ≤ t) is a constant and ∫ φjqkpQ(q)dμ(q) is also a constant for k ≠ j. These requirements can be written concisely as
| (A.2) |
Note that a legitimate influence function also needs to have mean zero, hence
Thus, we can write φ(q, s) as φ(q, s) = b(q, s) − F(t) − C1p, where b satisfies (A.2). This gives the desired family of influence functions described in (2).
A.2. Influence function of the WLS
Denote the ith diagonal entry in M as wi for i = 1, …, n. When we view the weight wi as a random variable, we denote it as Wi. Since our arguments are general for any i = 1, …, n, we often omit the subscript i, and use w or W for the corresponding quantities. From
we obtain
Note that , hence
So the influence function of WLS is
Specifically, for the OLS and the optimal WLS estimators, the influence functions are respectively
A.3. Derivation of Λ
and
We denote the collection of mean zero functions orthogonal to all the elements in Λ
as
. Consider the space of tangent vectors contributed from the jth component fj(s) only, we obtain
Combining the Λj’s for j = 1, …, p, the nuisance tangent space is therefore
Furthermore, it is easy to see that
where C is a constant p × p matrix.
A.4. Proof of Theorem 1
We only need to verify that φeff given in Theorem 1 satisfies , where Π denotes an orthogonal projection.
To show this, we first point out that K1p = F (t). This is because from the definition of A(s), we have
Integrate the both sides of the above equation from T1 to T2 and from T1 to t respectively, we obtain
and the result follows.
Now, letting h1(q, s) = h2(q, s) = A−1(s)q/qT f(s), h3(q) = K1p and B = −K, we can easily verify that the corresponding b(q, s) in (4) has the form
Since
its corresponding influence function is
Note that the above expression equals φeff. Thus, we have shown that φeff is a valid influence function hence φeff ∈ Λ
.
Now, for any φ ∈ Λ
, we need to show
. We have
is a constant matrix. In the last equality, we used the fact that an influence function φ can be written as φ = b − F(t) − C1p, where ∫ dqTpQ(q)dμ(q) = I(s ≤ t)Ip − C. From
and follow the description of , we indeed have .
A.5. Proof of Theorem 2
First, we note that all the approximations are caused by , which is estimated using the second subset of the data. No other estimation or approximation is involved in our construction. From (7) we obtain
Note that A(s; qT f) = A(s), K(qTf) = K, and K1p = F (t), hence
From (5), we see that
is an influence function. Thus, the difference between
and ψ(q, s; qT f) is the difference between a valid influence functions and its projection on Λ
, hence is orthogonal to Λ
. Specifically, we have
and it has mean zero. Consequently, the estimator F̂(t) is consistent and has variance
When n2 → ∞, the number of observations that satisfy qi = uk also goes to infinity in probability due to the randomness of the data. Thus, the kernel estimator for satisfies uniformly on any compact set of s for each k ∈ {1, …, m}. Therefore, as n → ∞. Note that ψ(q, s; qT f) is a pathwise differentiable function of qTf, it then follows that . This proves that F̂(t) is indeed an efficient estimator.
A.6. Inconsistency of the type II NPMLE
Consider a very simple and explicit case where p = m = 2, u2 = (1, 0)T, while u1 ≠ (1, 0)T and u1 ≠ (0, 1)T. This corresponds to the situation where there exists two genotypes, and for the first r1 observations we know that they belong to the first group with probability u11 and belong to the second group with probability u12 = 1 − u11; while for the last r2 observations, we know that they are from the first group. Under this special case, the NPMLE becomes
subject to r1h11 + r2h21 = 1, r1h12 + r2h22 = 1, and hij ≥ 0 for i, j = 1, 2. Obviously, the maximum is obtained only when and h22 = 0. This can be written as for all i = 1, …, n. Hence the NPMLE2 for the PDF f2(s) puts zero weights on observations that are known to be drawn from the first group, and puts equal weights, , on other observations. Such result is equivalent to the standard empirical likelihood estimation of a PDF when we are only given observations s1, …, sr1 drawn as a random sample from this PDF. Hence its corresponding CDF estimation is a consistent estimate of the corresponding true CDF. However, s1, …, sr1 is a random sample from a mixture of two populations, where the mixture probability is u11 for being from the first population and is u12 for the second population. In other words, the estimator F̂2(t) is a consistent estimator of u11F1(t)+u12F2(t). Obviously, u11F1(t) + u12F2(t) does not equal to F2(t) unless u11 ≡ 0. Consequently, the type II NPMLE is not consistent for this simple case.
Contributor Information
Yanyuan Ma, Email: ma@stat.tamu.edu, Department of Statistics, Texas A&M University, College Station, TX 77845.
Yuanjia Wang, Email: yuanjia.wang@columbia.edu, Department of Biostatistics Mailman School of Public Health Columbia University 722 West 168th Street New York, NY 10032.
References
- Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. Statistical Inference Under Order Restrictions. New York: JohnWiley; 1972. [Google Scholar]
- Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: The Johns Hopkins University Press; 1993. [Google Scholar]
- Chatterjee N, Wacholder S. A Marginal Likelihood Approach for Estimating Penetrance from Kin-cohort Designs. Biometrics. 2001;57:245–252. doi: 10.1111/j.0006-341x.2001.00245.x. [DOI] [PubMed] [Google Scholar]
- Davignon J, Gregg RE, Sing CF. Apolipoprotein E Polymorphism and Atherosclerosis. Arteriosclerosis. 1988;8:1–21. doi: 10.1161/01.atv.8.1.1. [DOI] [PubMed] [Google Scholar]
- Fine JP, Zou F, Yandell BS. Nonparametric estimation of the effects of quantitative trait loci. Biometrics. 2004;5:501–513. doi: 10.1093/biostatistics/kxh004. [DOI] [PubMed] [Google Scholar]
- Hartge P, Chatterjee N, Wacholder S, Brody LC, Tucker MA, Struewing JP. Breast cancer risk in Ashkenazi BRCA1/2 mutation carriers: effects of reproductive history. Epidemiology. 2002;13(3):255–261. doi: 10.1097/00001648-200205000-00004. [DOI] [PubMed] [Google Scholar]
- Hauptmann M, Sigurdson AJ, Chatterjee N, Rutter JL, Hill DA, Doody MM, Struewing JP. Re: Population-Based, CaseControl Study of HER2 Genetic Polymorphism and Breast Cancer Risk. Journal of the National Cancer Institute. 2003;95:1251–1252. doi: 10.1093/jnci/djg032. [DOI] [PubMed] [Google Scholar]
- Hixson JE. Apolipoprotein E Polymorphisms Affect Atherosclerosis in Young Males: Pathobiological Determinants of Atherosclerosis in Youth (PDAY) Research Group. Arterioscler Thromb. 1991;11:237–244. doi: 10.1161/01.atv.11.5.1237. [DOI] [PubMed] [Google Scholar]
- Huang N, Parco A, Mew T, Magpantay G, McCouch S, Gulderdoni E, Xu J, Subudhi P, Angeles E, Khush G. RFLP Mapping of Isozymes, RAPD and QTLs for Grain Shape, Brown Planthopper Resistance in a Doubled Haploid Rice Population. Molecular Breeding. 1997;3:105–113. [Google Scholar]
- Khoury M, Beaty H, Cohen B. Fundamentals of Genetic Epidemiology. New York: Oxford University Press; 1993. [Google Scholar]
- Lander ES, Botstein D. Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps. Genetics. 1989;121:743–756. doi: 10.1093/genetics/121.1.185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li R, Liang H. Variable selection in semiparametric regression modeling. Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang H, Wang N. Large sample theory in a semiparametric partially linear errors-in-variables model. Statistica Sinica. 2005;15:99–117. [Google Scholar]
- Marder K, Levy G, Louis ED, Mejia-Santana H, Cote L, Andrews H, Harris J, Waters C, Ford B, Frucht S, Fahn S, Ottman R. Accuracy of family history data on Parkinson’s disease. Neurology. 2003;61:18–23. doi: 10.1212/01.wnl.0000074784.35961.c0. [DOI] [PubMed] [Google Scholar]
- McLachlan GJ, Peel D. Finite Mixture Models. New York: Wiley; 2000. [Google Scholar]
- Newey WK. Semiparametric Efficiency Bounds. Journal of Applied Econometrics. 1990;5:99–135. [Google Scholar]
- Rabinowitz D. Computing the Efficient Score in Semi-parametric Problems. Statistica Sinica. 2000;10:265–280. [Google Scholar]
- Sigurdson AJ, Hauptmann M, Chatterjee N, Alexander BH, Doody MM, Rutter JL, Struewing JP. Kin-cohort estimates for familial breast cancer risk in relation to variants in DNA base excision repair, BRCA1 interacting and growth factor genes. BMC Cancer. 2004;4:9. doi: 10.1186/1471-2407-4-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shea S, Isasi CR, Couch S, Starc TJ, Tracy RP, Deckelbaum R, Talmud P, Berglund L, Humphries SE. Relations of Plasma Fibrinogen Level in Children to Measures of Obesity, the (G-455->A) Mutation in the Beta-Fibrinogen Promoter Gene, and Family History of Ischemic Heart Disease: the Columbia University BioMarkers Study. American Journal of Epidemiology. 1999;150:737–46. doi: 10.1093/oxfordjournals.aje.a010076. [DOI] [PubMed] [Google Scholar]
- Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]
- Tsiatis AA, Ma Y. Locally Efficient Semiparametric Estimators for Functional Measurement Error Models. Biometrika. 2004;91:835–848. [Google Scholar]
- Wacholder S, Hartge P, Struewing J, Pee D, McAdams M, Brody L, Tucker M. The Kin-cohort Study for Estimating Penetrance. American Journal of Epidemiology. 1998;148:623–630. doi: 10.1093/aje/148.7.623. [DOI] [PubMed] [Google Scholar]
- Wang Y, Clark LN, Marder K, Rabinowitz D. Nonparametric Estimation of Genotype-specific Age-at-onset Distributions From Censored Kin-cohort Data. Biometrika. 2007;94:403–414. [Google Scholar]
- Wang Y, Clark LN, Louis ED, Mejia-Santana H, Harris J, Cote LJ, Waters C, Andrews D, Ford B, Frucht S, Fahn S, Ottman R, Rabinowitz D, Marder K. Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using the kin-cohort method. Arch Neurol. 2008;65(4):467–474. doi: 10.1001/archneur.65.4.467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Webb EL, Rudd MF, Houlston RS. Case-control, kin-cohort and meta-analyses provide no support for STK15 F31I as a low penetrance colorectal cancer allele. British Journal of Cancer. 2006a;95:1047–1049. doi: 10.1038/sj.bjc.6603382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Webb EL, Rudd MF, Sellick GS, Galta R, Bethke L, Wood W, Fletcher O, Penegar S, Withey L, Qureshi M, Johnson N, Tomlinson I, Gray R, Peto J, Houlston RS. Search for low penetrance alleles for colorectal cancer through a scan of 1467 nonsynonymous SNPs in 2575 cases and 2707 controls with validation by kin-cohort analysis of 14 704 first-degree relatives. Hum Mol Genet. 2006b;15(21):3263–3271. doi: 10.1093/hmg/ddl401. [DOI] [PubMed] [Google Scholar]
- Wu R, Ma C, Casella G. Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York: Springer; 2007. [Google Scholar]




