Efficient distribution estimation for data with unobserved sub-population identifiers

Yanyuan Ma; Yuanjia Wang

doi:10.1214/12-EJS690

. Author manuscript; available in PMC: 2013 Jun 19.

Published in final edited form as: Electron J Stat. 2012;6:710–737. doi: 10.1214/12-EJS690

Efficient distribution estimation for data with unobserved sub-population identifiers

Yanyuan Ma ¹, Yuanjia Wang ²

PMCID: PMC3685883 NIHMSID: NIHMS470955 PMID: 23795232

Abstract

We study efficient nonparametric estimation of distribution functions of several scientifically meaningful sub-populations from data consisting of mixed samples where the sub-population identifiers are missing. Only probabilities of each observation belonging to a sub-population are available. The problem arises from several biomedical studies such as quantitative trait locus (QTL) analysis and genetic studies with ungenotyped relatives where the scientific interest lies in estimating the cumulative distribution function of a trait given a specific genotype. However, in these studies subjects’ genotypes may not be directly observed. The distribution of the trait outcome is therefore a mixture of several genotype-specific distributions. We characterize the complete class of consistent estimators which includes members such as one type of nonparametric maximum likelihood estimator (NPMLE) and least squares or weighted least squares estimators. We identify the efficient estimator in the class that reaches the semiparametric efficiency bound, and we implement it using a simple procedure that remains consistent even if several components of the estimator are mis-specified. In addition, our close inspections on two commonly used NPMLEs in these problems show the surprising results that the NPMLE in one form is highly inefficient, while in the other form is inconsistent. We provide simulation procedures to illustrate the theoretical results and demonstrate the proposed methods through two real data examples.

Keywords and phrases: Finite mixed samples, robustness, semiparametric efficiency, nonparametric maximum likelihood estimator (NPMLE)

1. Introduction

In many scientific studies, data arise from a mixture of scientifically meaningful distributions. For example, in a quantitative trait locus (QTL) study, the goal is to identify, map and estimate effect of a QTL predisposing the trait. However, the genomic location of the QTL is unknown, therefore subjects’ genotypes at the QTL are not observed. Mixture models are widely used to map QTLs using location-known molecular markers such as single nucleotide polymorphisms (SNPs) or microsatellite markers, see Lander and Botstein (1989) and Wu et al. (2007).

Another example where mixture model is useful is genetic studies where genotypes in relatives of an initial sample (probands) are not collected (Marder et al., 2003; Wang et al., 2008). In these studies, of scientific interest is to estimate the conditional distribution of a trait given a genotype (or penetrance, Khoury et al., 1993). Genotype information in the initial sample of probands are collected. However, it is common that due to high cost of administering in-person interviews in relatives, their genotype information is not collected. For example, in Wacholder et al. (1998) and Wang et al. (2007, 2008), only the probands are genotyped, but none of the first-degree relatives of the probands was genotyped. Distribution of possible genotypes of a relative, however, can easily be obtained given the relationship between the relative and the proband and the genotype in the proband. The relatives’ disease history or trait information is usually obtained by administering a systematic and reliable phone-interview (Marder et al., 2003). Distribution of the trait in a relative is then a mixture of conditional distribution of the trait given the relative’s genotype and these relatives form the main analysis sample.

A concrete example of such genetic studies is an investigation of association between the APOE gene and the LDL concentrations in young children (Shea et al., 1999). There are three common alleles at the APOE locus (ε2, ε3, ε4). The APOE ε3 is the most prevalent allele in the general population, with frequency 75% to 80%. Previous studies have suggested that the APOE ε4 allele may be associated with higher LDL cholesterol levels in adults (Davignon et al., 1988). Of interest is the association between APOE ε4 allele and LDL cholesterol distribution in children.

Subjects included in the study were recruited from a cross-sectional biomarker study of children conducted from 1994 to 1998 (Shea et al., 1999). Proband children were recruited from lists of cardiac patients generated through the Presbyterian Hospital Clinical Information System, private cardiology practices, lipid clinics and pediatric practices. Families with at least one healthy child 4 to 25 years of age were eligible for participation. Siblings of proband children were recruited to the study. The availability of the APOE genotype information of the probands and the sibling relationship enables the calculation of each sibling’s probability of carrying the ε4 allele. The cumulative distribution function of LDL concentration for carriers of ε4 allele (carrying one or two copies of ε4) and for the non-carriers (carrying zero copy of ε4) are of primary interest in this study.

Traditional statistical analysis of mixture data specifies a parametric form of conditional distribution of an outcome given group membership (e.g., Gaussian mixture model, Wu et al., 2007) and estimates mixture probabilities and parameters in the conditional distribution by maximum likelihood through an EM algorithm (McLachlan and Peel, 2000). In this work, we provide nonparametric estimation in the sense that we do not make any distributional assumption on the conditional distributions. One common feature of the two examples introduced before is that the mixture probabilities are easily calculated without using the outcome data or are known, and the mixture populations are scientifically meaningful (e.g., subjects carrying a certain genotype). Treating these mixture probabilities as random variables, each observation in the data consists a vector of mixture probabilities and a continuous outcome, and the observations are assumed to be independent and identically distributed (i.i.d.).

To fix idea, let Q denote a p-dimensional vector of random mixture probabilities, and let p_Q denote the probability mass function of Q, which has a finite support u₁, …u_m. Let S denote a random outcome, let L denote the unobserved group membership (or genotype), and let f(s) denote the p-dimensional conditional density of S given L. For simplicity, we assume that f(s) is supported on a compact interval, say [T₁, T₂]. For the ith subject, i = 1, …, n, we observe (q_i, s_i), where the joint density of Q, S at Q = q_i and S = s_i is

g (q_{i}, s_{i}) = p_{Q} (q_{i}) q_{i}^{T} f (s_{i}) .

(1)

Here f(s) is a length p vector, where the jth component f_j(s) represents the conditional probability density function (PDF) of s given that it belongs to the jth genotype group, j = 1, …, p. Each component of f(s), f_j(s), is the PDF of a trait at time t given the gene mutation status being the jth kind in a relative (for example, j = 1 denotes carriers and j = 2 denotes non-carriers), or the PDF of a quantitative trait given the QTL genotype being the jth kind. Let F(·) denote the corresponding p dimensional cumulative distribution function (CDF) of f(·). Our interest is in estimating F at any fixed time t. The vector q_i represents probabilities that a relative carries a certain genotype given the proband’s genotype, or a vector of probabilities of a subject having a certain QTL genotype given the flanking markers. Obviously w e $\sum_{j = 1}^{p} q_{i j} = 1$ . The distribution of q_i (i.e., p_Q) depends on study design and can be easily estimated consistently from the empirical distribution of q_i. For example, for a backcross QTL experiment, q_i takes four different values depending on the marker genotype frequencies (e.g., Table 10.3 of Wu et al., 2007). The vector of density functions f is completely unspecified, thus f is an infinite-dimensional nuisance parameter with length p.

Here, we characterize the complete class of consistent estimators which includes Fine et al. (2004) and Chatterjee and Wacholder (2001). We show that any weighted least squares estimator is a member of this estimation class hence yields a consistent estimator. In addition, we construct a special subclass which obtains the minimum estimation variance and reaches the semiparametric efficiency bound. We inspect two types of widely used NPMLEs and report a surprising finding that they are either inefficient or even inconsistent. Although commonly applied in clinical studies (Sigurdson et al., 2004; Hauptmann et al., 2003; Webb et al., 2006a,b; Hartge et al., 2002), the inconsistency of the second type of NPMLE has not been discovered in the literature before.

The remaining of the paper is organized as follows. In Section 2, weighted least squares estimators are introduced and a complete class of consistent estimators encompassing the least squares is defined. The optimal member of the class is identified and shown to reach the semiparametric efficiency bound. In Section 3, an algorithm to implement the efficient estimator is developed and asymptotic properties of the estimator are proved. In Section 4, two types of commonly used NPMLE estimators are investigated and one type is found to be inefficient while the other is inconsistent. In Section 5, simulation experiments are conducted to investigate the finite sample performance of the developed methods, and several estimators including the efficient estimator, the least squares estimators and the NPMLEs are compared. In Section 6, the proposed methods are implemented to analyze two data examples, one from a genetic linkage study of rice plant height and the other from a study of association between plasma low-density lipoprotein (LDL) cholesterol level and the apolipoprotein-E (APOE) gene. In Section 7, possible extensions of the proposed methods are discussed.

2. Estimation procedures

2.1. A class of weighted least squares estimators

Although the traditional approach to estimating F(t) is maximum likelihood estimator for a parametric model or NPMLE for a nonparametric model, a very simple weighted estimator can be used if we formulate the same problem from a different angle. Observe that the model in (1) implies q^T F(t) = ε{I(S ≤ t)|q}, where I(·) denotes an indicator function. Therefore, viewing the q_i’s as covariates and I(S_i ≤ t) as response variables, the covariates and the responses are linked by F(t) via a familiar linear regression model

Y_{i} \equiv I (S_{i} \leq t) = q_{i}^{T} F (t) + e_{i},

where E(e_i|q_i) = 0, i = 1, …, n. It is straightforward that the e_i’s are independent conditional on q_i’s, and have the variances $v_{i} = q_{i}^{T} F (t) {1 - q_{i}^{T} F (t)}$ . Thus, weighted least squares based method can be used to estimate F(t). Denote by M an arbitrary n × n diagonal matrix. Let A = (q₁, … q_n)^T ∈ Rⁿ^×^p, Y = (y₁, … y_n)^T ∈ Rⁿ, and e = (e₁, …, e_n)^T ∈ Rⁿ. Then we obtain the general WLS estimator

\hat{F} (t) = {(A^{T} M A)}^{- 1} A^{T} M Y .

The simplest estimator is the OLS where we set M = I_n, also derived in Fine et al. (2004) using a different formulation, while the most efficient WLS estimator is obtained when we assign M to be a diagonal matrix with the ith diagonal entry equals $v_{i}^{- 1}$ . Standard iteratively re-weighted estimation procedure can be used to obtain this optimal WLS (OWLS) estimator. The presence of the matrix M also allows the flexibility to derive other WLS estimators to achieve desired properties such as robustness.

2.2. The complete class of consistent estimators

Although simple to derive and easy to implement, it is unclear whether the class of WLS is complete and whether OWLS is the optimal estimator among all consistent estimators of F(t). To answer these questions and to provide easy variance estimation for any consistent estimator, we perform a formal semiparametric analysis to characterize the complete class of consistent estimators. We derive in Appendix A.1 that the family of all influence functions is

S_{I F} = {φ (q, s) : φ (q, s) = b (q, s) - F (t) - C 1_{p}, where \int b (q, s) q^{T} p_{Q} (q) d μ (q) = I (s \leq t) I_{p} = C},

(2)

where I_p is a p-dimensional identity matrix, C is an arbitrary p × p constant matrix, and 1_p is a p-dimensional vector with all elements being one.

For any qualified b-function as described in S_IF, an estimator for F (t) is

\hat{F} (t) = n^{- 1} \sum_{i = 1}^{n} b (q_{i}, s_{i}) - C_{b} 1_{p},

(3)

where we use C_b = ∫ b(q, s)q^Tp_Q(q)dμ(q) − I(s ≤ t)I_p to denote the constant matrix corresponding to this b-function. For example, a convenient choice of b(q, s) is

b (q, s) = I (s \leq t) {\int h_{1} (q, s) q^{T} p_{Q} (q) d μ (q)}^{- 1} h_{1} (q, s) + B {\int h_{2} (q, s) q^{T} p_{Q} (q) d μ (q)}^{- 1} h_{2} (q, s) + h_{3} (q),

(4)

where h₁(q, s), h₂(q, s), and h₃(q) can be arbitrary functions in R^p such that ∫ h₁(q, s)q^T p_Q(q)dμ(q) and ∫ h₂(q, s)q^T p_Q(q)dμ(q) are invertible, and B is an arbitrary constant matrix. This characterization provides a simple construction of a very rich class of estimators.

Since S_IF contains all the influence functions, any regular asymptotic linear (RAL, Newey, 1990) estimator can be written in the form of (3). For example, we show in Appendix A.2 that the influence function of any WLS estimator is

φ_{WLS} = {E ({WQQ}^{T})}^{- 1} w q {I (s \leq t) - q^{T} F (t)} .

Here, w is a weight variable. For the ith individual, w = w_i is the ith diagonal entry of M. We use W to denote the weight variable when it is considered as a random variable. It is easy to see that this corresponds to choosing h₁ = wq, h₂ = 0, and h₃ = −{E(WQQ^T)}⁻¹wqq^T F(t) + F(t), hence any WLS is indeed a member of S_IF. In addition, comparing the form of φ_WLS and S_IF indicates that the WLS estimators are only a subset of consistent estimators that can be constructed. To further study whether the optimal WLS estimator is the most efficient among all the consistent estimators for F(t), we need to derive the efficient influence function.

2.3. The semiparametric efficient estimator

Projecting an arbitrary influence function φ onto the tangent space Λ yields an efficient influence function (Newey, 1990). In Appendix A.3, we derive the form of Λ and its orthogonal complement, which enables us to derive the following theorem.

Theorem 1

The efficient influence function is

φ_{eff} = \frac{{I (s \leq t) I_{p} - K} A^{- 1} (s) q}{q^{T} f (s)},

where

A (s) = \int \frac{{q q}^{T} p Q (q)}{q^{T} f (s)} d μ (q),

and

K = \int_{T_{1}}^{T_{2}} I (s \leq t) A^{- 1} (s) d s {\int_{T_{1}}^{T_{2}} A^{- 1} (s) d s}^{- 1} .

The proof of the Theorem 1 is in Appendix A.4.

It is straightforward to see that the construction of the efficient estimator requires correct specification of the nuisance parameter f(s), which is not always easy to obtain. If we unknowingly mis-specify f(s) as f*(s) and follow the same construction in Theorem 1 to obtain $φ_{eff}^{*}$ , then the result is no longer a valid influence function. To see this, note that $\overset{ˇ}{φ} = \frac{{I (s \leq t) - K^{*}} A^{* - 1} (s) q}{q^{T} f^{*} (s)}$ , where $A^{*} (s) = \int \frac{{q q}^{T} p Q (q)}{q^{T} f^{*} (s)} d μ (q)$ , and K* = ∫ I(s ≤ t)A*⁻¹ (s)ds{∫A*⁻¹ (s)ds}⁻¹. We can then easily verify that E(φ̌) = F(t) − K*1_p, which is not necessarily zero. We thus robustify the influence function by constructing

φ = \frac{{I (s \leq t) - K^{*}} A^{* - 1} (s) q}{q^{T} f^{*} (s)} - F (t) + K^{*} 1_{p} .

(5)

Regardless of the form of f*, (5) always yields a valid influence function. In addition, φ = φ_eff when f*(s) = f₀(s) and φ can be used to estimate F(t) via

\hat{F} (t) = n^{- 1} \sum_{i = 1}^{n} \frac{{I (s_{i} \leq t) - K^{*}} A^{* - 1} (s_{i}) q_{i}}{q_{i}^{T} f^{*} (s_{i})} + K^{*} 1_{p} .

(6)

Remark 1

In (6), we can replace K* by an arbitrary constant matrix. The resulting estimator remains consistent, and the corresponding φ is still a valid influence function. However, since different K* corresponds to different influence function, the estimators have different variances.

In practice, since f(s) is usually either proposed or estimated so that it may be different from f₀(t), it is always a safer choice to use (6) to obtain F̂(t). We will show in Section 3 that as long as f(s) is consistently estimated, the estimator (6) is guaranteed to provide an efficient estimator for F(t).

2.4. Analytic comparison between OWLS and the efficient estimator

We are now ready to assess whether the OWLS is efficient. Comparing φ_eff with φ_OWLS obtained in Appendix A.2, we find that although the OWLS is optimal among the WLS family, it does not reach the semiparametric efficiency bound. We prove this claim by contradiction. Suppose that the OWLS is efficient, then we would have φ_eff = φ_OWLS + o_p(1), which would imply that for all (q, s) pairs,

\frac{{I (s \leq t) - K} A^{- 1} (s) q}{q^{T} f (s)} = {[E \frac{{Q Q}^{T}}{Q^{T} F (t) {1 - Q^{T} F (t)}}]}^{- 1} \frac{q {I (s \leq t) - q^{T} F (t)}}{q^{T} F (t) {1 - q^{T} F (t)}} .

Denote B = E QQ^T/[Q^TF(t){1 − Q^TF(t)}], we then have

\frac{A^{- 1} (s) q}{q^{T} f (s)} = \frac{B^{- 1} q}{q^{T} F (t) {1 - q^{T} F (t)}} and \frac{{K A}^{- 1} (s) q}{q^{T} f (s)} = \frac{B^{- 1} q}{1 - q^{T} F (t)},

which leads to q^TF(t)A⁻¹(s)q = KA⁻¹(s)q. The left hand-side is a quadratic function of q, while the right hand-side is linear, so the above equality will never hold since q cannot be a constant vector of zero.

3. Efficient estimator and its asymptotic properties

As we have pointed out, the efficient influence function derived in Theorem 1 involves unknown nuisance parameters f(s) and therefore cannot be directly used to construct an efficient estimator for F(t). Using (6) will provide a robust and locally efficient estimator, in the sense that if f*(s) = f₀(s), the estimator is indeed efficient, otherwise, the estimator is still guaranteed to be consistent. We now propose a method to construct an estimator that is always efficient. This method avoids estimating the p-dimensional PDF f(s) directly, and is simple to implement.

3.1. Algorithm for implementing the efficient estimator

We propose to use the following procedure to construct the efficient estimator.

Randomly split the data into two sets. The second set has size n₂ = n^5/6, and the first set has size n₁ = n − n₂. Assume that the first set contains (q₁, s₁), …(q_n₁, s_n₁) and the second set (q_n₁+1, s_n₁+1), …, (q_n, s_n).
Obtain the empirical estimator of q^Tf(s), $\hat{q^{T} f (s)}$ from the second set of sample with size n₂. Recall that the random vector Q can take m different vector values u₁, …, u_m, so for each k = 1, …, m, we can calculate a kernel estimate for $u_{j}^{T} f (s)$ as
$\hat{u_{k}^{T} f (s)} = \frac{\sum_{i = n_{1} + 1}^{n} I (q_{i} = u_{k}) K_{h} (s_{i} - s)}{\sum_{i = n_{1} + 1}^{n} I (q_{i} = u_{k})} .$

Here K_h is any kernel function with bandwidth h satisfying (n₂h)⁻¹ = o(1), n₂h⁵ ≤ O(1) as n₂ → ∞, and K_h(·) = h⁻¹K(·/h).
Calculate
$A (s; \hat{q^{T} f}) = \int \frac{{q q}^{T} p_{Q} (q)}{\hat{q^{T} f (s)}} d μ (q) = E_{Q} {\frac{{Q Q}^{T}}{\hat{Q^{T} f (s)}}} = \sum_{k = 1}^{m} \frac{u_{k} u_{k}^{T} p_{Q} (u_{k})}{\hat{u_{k}^{T} f (s)}},$

where E_Q stands for expectation with respect to Q. We construct
$K_{1} (\hat{q^{T} f}) = \int_{T_{1}}^{T_{2}} I (s \leq t) A^{- 1} (s; \hat{q^{T} f}) d s, K_{2} (\hat{q^{T} f}) = \int_{T_{1}}^{T_{2}} A^{- 1} (s; \hat{q^{T} f}) d s$

using numerical integration, and form $K (\hat{q^{T} f}) = K_{1} (\hat{q^{T} f}) K_{2}^{- 1} (\hat{q^{T} f})$ .
Form
$ψ (Q, S; \hat{q^{T} f}) = \frac{{I (S \leq t) - K (\hat{q^{T} f})} A^{- 1} (S; \hat{q^{T} f}) Q}{\hat{Q^{T} f (S)}} + K (\hat{q^{T} f}) 1_{p},$

and let the estimator be
$\hat{F} (t) = n_{1}^{- 1} \sum_{i = 1}^{n_{1}} ψ (q_{i}, s_{i}; \hat{q^{T} f}) .$ (7)

The estimation procedure described above is straightforward to implement. Comparing to many other semiparametric problems where the efficient estimator often involves solving integral equations (Rabinowitz, 2000) and iterative procedures (Tsiatis and Ma, 2004), the estimator here is very simple. In addition, unlike most semiparametric problems where the nonparametric functions have to be estimated at a certain rate, sometimes using an under-smoothed bandwidth (Liang and Wang, 2005; Li and Liang, 2008) to reach optimality, we do not have such estimation constraints. In fact, we will show that any consistent estimation of f(s) will be as good as the true f(s) asymptotically. Since consistency can be obtained with a wide range of bandwidth, typically one does not have to go through the computationally intensive cross validation procedure to choose an optimal bandwidth. Finally, we point out that the splitting of the data is solely to facilitate the later theoretical proof and is not mandatory. In reality, one can certainly use the whole data set to estimate f(s) and to form F̂(t) in (7).

3.2. Asymptotics and inferences

We present the asymptotic property of the proposed efficient estimator in the following theorem:

Theorem 2

The estimator constructed in (7) achieves the semiparametric efficiency bound. Specifically, for n → ∞, $\sqrt{n} {\hat{F} (t) - F (t)} \to N (0, V)$ in distribution, where V = var(φ_eff) and can be consistently estimated as

n^{- 1} \sum_{i = 1}^{n} {ψ (q_{i}, s_{i}; \hat{q^{T} f}) - \hat{F} (t)} {ψ (q_{i}, s_{i}; \hat{q^{T} f}) - \hat{F} (t)}^{T} .

Intuitively, the reason that (7) can reach the semiparametric efficiency is because it solves the estimating equation formed by summing over the robustified influence functions (5) while replacing the unspecified quantities K*, q^T f*(s) and A* by their corresponding optimal choices which are, respectively, the non-parametric estimates of K, q^Tf(s) and A(s, q^Tf). The rigorous proof of Theorem 2 is in Appendix A.5.

Since we are able to construct the optimal estimators and estimate their variances, it is straightforward to make inferences based on these results. For example, we can construct a locally most powerful test for the hypothesis H₀ : F₁(t) − F₂(t) = δ₀ versus H₁ : F₁(t) − F₂(t) ≠ δ₀. Because of the explicit form of F̂(t), the Wald test is an obvious choice. Let D̂ = F̂₁(t) − F̂₂(t) − δ₀, then the test statistic is

T = n {\hat{D}}^{2} / v,

(8)

where v = V₁₁ − V₁₂ − V₂₁ + V₂₂, and V_ij is the (i, j)th element of the covariance matrix V stated in Theorem 2. It is straightforward that when n → ∞, T has a chi-square distribution with one degree of freedom under H₀. Under the local alternative, say $F_{1} (t) - F_{2} (t) = δ / \sqrt{n}$ , T has a noncentral chi-square distribution with one degree of freedom and noncentrality parameter (δ − δ₀)²/v.

In some applications, one may be interested in testing whether F₁(t)−F₂(t) = δ_t at several different t values simultaneously, say at t₁, …, t_J. Letting $a^{T} = (1, - 1) {F (t_{1}), \dots F (t_{J})} - Δ_{0}^{T}$ , where Δ₀ = (δ_t₁, …, d_{t_J})^T. This can be written as a problem of testing H₀ : a = 0 versus H₁ : a ≠ 0, Under H₀, a has a multivariate normal random distribution with mean zero and variance-covariance matrix n⁻¹Σ, where Σ_jk = (−1, 1)cov{F̂(t_j), F(t_k)}(−1, 1)^T for j, k = 1, …, J. Here, cov{F̂(t_j), F̂(t_k)} can be estimated using

n^{- 1} \sum_{i = 1}^{n} {ψ_{eff} (q_{i}, s_{i}; t_{j}, \hat{q^{T} f}) - \hat{F} (t_{j})} {ψ_{eff} (q_{i}, s_{i}; t_{k}, \hat{q^{T} f}) - \hat{F} (t_{k})}^{T},

where $ψ_{eff} (q_{i}, s_{i},; t_{j}, \hat{q^{T} f})$ and F̂(t_j) denote ψ_eff and F̂ evalcuated at the ith observation and calculated at time t_j. Thus, we can construct the test statistic

T = {n a}^{T} \sum^{- 1} a .

(9)

When n → ∞, under H₀, T has a chi-square distribution with J degrees of freedom. Under a local alternative, say $a = Δ / \sqrt{n}$ for some length J vector Δ, T has a noncentral chi-square distribution with noncentrality parameter Δ^TΣ⁻¹Δ.

4. Understanding the NPMLEs

For many nonparametric models, the NPMLE is a widely used estimation procedure. In the literature, two types of NPMLE have been proposed (Wacholder et al., 1998; Chatterjee and Wacholder, 2001). The first type of NPMLE treats each $u_{j}^{T} f (s)$ j = 1, …, m as an unknown PDF, while the second type treats f(s) as a p-dimensional unknown PDF. To explain these two NPMLEs in detail, group the observations in such a way that the first r₁ observations form a first subset where each observation has the same q value that equals to u₁, the next r₂ observations form a second subset with the same q values u₂ and so on. Assume that the last r_m observations form the mth subset and have the q values equal to u_m. We use F̃(t) to denote the type I NPMLE of F(t), and F̌ (t) the type II NPMLE.

The type I NPMLE maximizes

\sum_{i = 1}^{n} log {q_{i}^{T} f (s_{i})} = \sum_{j = 1}^{m} \sum_{i = 1}^{n} log {q_{i}^{T} f (s_{i})} I (q_{i} = u_{j})

with respect to $q_{i}^{T} f (s_{i})$ for the ith subject in the jth subset subject to $q_{i}^{T} f (s_{i}) \geq 0$ and $\sum_{i = 1}^{n} q_{i}^{T} f (s_{i}) I (q_{i} = u_{j}) = 1$ for j = 1, …, m. This is essentially equivalent to performing an empirical density estimation in each of the m groups, where in each group the q_i values are identical. Obviously, the resulting estimation for q^Tf(s) in the jth group is an empirical PDF with weights $r_{j}^{- 1}$ at the observed values. The procedure then uses $u_{j}^{T} F (t) = r_{j}^{- 1} \sum_{i = 1}^{n} I (s_{i} \leq t, q_{i} = u_{j})$ for j = 1, …, m to recover F̃(t) = (U^TU)⁻¹U^TG(t), where we denote U = (u₁, …, u_m)^T, and G(t) is a length m vector with the jth component equals $r_{j}^{- 1} \sum_{i = 1}^{n} I (s_{i} \leq t, q_{i} = u_{j})$ . It is not difficult to see that

\begin{array}{l} U^{T} U = \sum_{j = 1}^{m} u_{j} u_{j}^{T} = \sum_{i = 1}^{n} w_{i} q_{i} q_{i}^{T}, \\ and U^{T} G (t) = \sum_{j = 1}^{m} u_{j} r_{j}^{- 1} \sum_{i = 1}^{n} I (s_{i} \leq t, q_{i} = u_{j}) = \sum_{i = 1}^{n} w_{i} q_{i} I (s_{i} \leq t), \end{array}

where $w_{i} = r_{j}^{- 1}$ if q_i = u_j. Thus, the type I NPMLE belongs to the family of WLS estimators (therefore a member of class (2)), where the weights are taken to be $r_{j}^{- 1}$ , the inverse of the number of observations in the jth group with the same q_i value. However, the weights of this WLS estimator are obviously non-optimal. In addition, intuitively such choice of weights is not reasonable, because it down-weights the contributions from a larger subset. In fact, one would rather downweight the contribution from the observations with less estimation precision, while the quality of the estimation of F(t) from each observation has no definitive link with its subset size.

The type II NPMLE maximizes the same log likelihood, but with respect to f(s_i), subject to $\sum_{i = 1}^{n} f (s_{i}) = 1_{p}$ and f (s_i) ≥ 0 component-wise. It is easy to see that the maximum is obtained when the r_j values of f (s_i) corresponding to the same u_j are the same. We denote this common f (s_i) value by h_j, for j = 1, …, m. We thus maximize

\sum_{j = 1}^{m} r_{j} log (u_{j}^{T} h_{j})

with respect to h_j’s subject to $\sum_{j = 1}^{m} r_{j} h_{j} = 1_{p}$ and h_j ≥ 0 component-wise. In general, no closed form solution exists for the h_j’s, and the EM algorithm is often used to solve this optimization problem and to obtain the h_j’s. The NPMLE then proceeds to form

\overset{ˇ}{F} (t) = \sum_{i = 1}^{n} I (s_{i} \leq t) \hat{f} (s_{i}) = \sum_{j = 1}^{m} \sum_{i = 1}^{n} I (s_{i} \leq t, q_{i} = u_{j}) h_{j} .

The type II NPMLE is different from the type I NPMLE in that here, the term “nonparametric” refers to f(s), not to $u_{j}^{T} f (s)$ . In the literature, the type II estimator is considered as an improvement of the type I NPMLE. However, our careful investigation reveals that the type II NPMLE is not even consistent, which is a rather counter intuitive result. In Appendix A.6, we give a detailed calculation in a concrete case to explicitly illustrate the inconsistency and in Section 5 we demonstrate the bias of the type II NPMLE in a moderately large sample through simulations.

We now give a more general demonstration to show why the type II NPMLE is inconsistent. Suppose the solution to the constrained maximization problem is h₁, …, h_m, then the type II NPMLE is

\overset{ˇ}{F} (t) = \sum_{j = 1}^{m} {\sum_{i = 1}^{n} I (q_{i} = u_{j}) I (s_{i} \leq t)} h_{j} = \sum_{j = 1}^{m} r_{j} G_{j} (t) h_{j} = H G (t) = H U \tilde{F} (t),

where H = (r₁h₁, … r_mh_m), and U, G(t), F̃(t) are the same as defined before. We already know that F̃(t) is a consistent estimator of F(t). If F̌(t) is also consistent, then we would have HU → I_p when n → ∞. This is a much stronger condition than the original constraints of the maximization problem and is in general not satisfied. In fact, this condition means that the type II NPMLE is asymptotically equivalent to the type I NPMLE, which contradicts the original goal of developing a type II estimator. In other words, as a distinct estimator from the type I NPMLE, the type II NPMLE is inconsistent.

5. Simulations

To study the finite sample performance of the proposed estimators, we conducted several simulation studies. In all the simulations, the dimension of F(t) is p = 2, and the number of simulation iterations is 1000.

5.1. Three simulated examples

In the first simulation experiment, we investigate the performance of the various estimators studied in Sections 2, 3 and 4. Here, q_i’s can take six different values, i.e. m = 6, while the group sizes r_j, j = 1, …, m, are randomly generated. The six different q_i values are respectively (0.3, 0.7)^T, (0, 1)^T, (0.7, 0.3)^T, (0.8, 0.2)^T, (0.5, 0.5)^T, (0.6, 0.4)^T. The two components in the true F(t) both have truncated exponential form, since exponential function is a commonly used parametric model in practice. Specifically, F₁(t) = {1 − exp(−t/3)}/{1 − exp(−10/3)} and F₂(t) = 1 − {1 − exp(t/3 − 10/3)}/{1 − exp(−10/3)} on the interval (0, 10).

We studied eight different estimators. The efficient estimator with true f(s) inserted (hence unrealistic) is denoted ORACLE, while with the estimated f(s) inserted is denoted EFF. Thus EFF is the implemented efficient estimator. Two different kinds of robust estimators are considered, where ROB1 had the f(t) mis-specified, and ROB2 not only used a mis-specified f(t), but also had K = 0 plugged in. Specifically, in ROB1, we used the true f₁(t) as the proposed model for f₂(t), and used the true f₂(t) as the proposed model for f₁(t). In ROB2, we proposed uniform model for both f₁(t) and f₂(t). These two estimators are expected to be consistent hence reflecting robustness to mis-specification of the PDFs. We also investigated the proposed OWLS estimator. For comparison, we implemented the OLS, NPMLE1 and NPMLE2 estimators that are used in the literature. We implement the estimation procedures at t = 6.8. The resulting estimation mean, sample and estimated standard errors and 95% coverage of the confidence intervals are summarized in Table 1.

Table 1.

Bias, empirical standard error (emp se), average estimated standard error (est se), 95% coverage (95% cov) of Simulation 1, sample size n = 300, 1000 simulations

	F₁(t) = 0.9295				F₂(t) = 0.3199
Estimator	bias	emp se^†	est se^*	95% cov	bias	emp se^†	est se^*	95% cov
ORACLE	0.0003	0.529	0.538	94.9%	−0.0013	0.704	0.684	93.8%
EFF	0.0007	0.540	0.549	94.3%	−0.0017	0.726	0.704	92.5%
ROB1	0.0001	0.545	0.555	94.5%	−0.0013	0.732	0.710	92.6%
ROB2	0.0001	0.545	0.555	94.5%	−0.0013	0.732	0.710	92.6%
OWLS	−0.0004	0.537	0.553	94.6%	−0.0006	0.727	0.712	93.6%

OLS	0.0001	0.545	0.559	94.6%	−0.0013	0.732	0.716	93.0%
NPMLE1	0.0001	0.570	0.581	95.0%	−0.0010	0.753	0.738	93.4%
NPMLE2	−0.179	0.323	–	–	0.2148	0.425	–	–

Open in a new tab

^†

Empirical standard error × 10

Estimated standard error × 10

It can be seen that all the consistent estimators perform well in finite samples, and the estimated variances are very close to the empirical variances. This indicates that the asymptotic results are relevant for a moderate sample size of n = 300. It is very clear that the type II NPMLE yields very large bias. We emphasize here that this bias is not a reflection of small sample size because the bias persists when we increase the sample size to 1000.

We can also see that the type I NPMLE and OLS does not make a very good choice of the weights, hence the estimation standard errors are both larger than the OWLS. This is especially prominent for the type I NPMLE, in that it performs even worse than the simple OLS estimator. The two robust estimator (ROB1 and ROB2) perform very similarly, and both have minimal bias, reflecting the desired robustness property with respect to the PDF estimation. Finally, although in theory the efficient estimator (EFF) should outperform the OWLS estimator, the performance of OWLS is as satisfactory as EFF. This appears to be often the case in our other simulations not shown here. Thus, using either proposed OWLS or EFF in practice is expected to be adequate.

We also studied the type I error and power of the test (8) in this situation, and present the results in Table 2. The overall performance of the proposed tests is satisfactory. From the left panel of Table 2, we see that all estimators maintain correct size. From the right panel of the same table, we see that the OLS and NPMLE1 have lower power compared to other estimators due to their larger estimation variances.

Table 2.

Type I error and power of test in Simulation 1, sample size n = 300, 1000 simulations

Estimator	Type I error				Power
Estimator	0.01	0.05	0.1	0.2	0.01	0.05	0.1	0.2
ORACLE	0.016	0.062	0.105	0.194	0.198	0.424	0.546	0.700
EFF	0.017	0.061	0.118	0.198	0.177	0.400	0.523	0.676
ROB1	0.018	0.062	0.117	0.200	0.167	0.391	0.529	0.680
ROB2	0.018	0.062	0.117	0.200	0.167	0.391	0.529	0.680
OWLS	0.018	0.057	0.119	0.197	0.170	0.396	0.529	0.681

OLS	0.018	0.061	0.113	0.198	0.162	0.388	0.521	0.673
NPMLE1	0.023	0.062	0.100	0.204	0.148	0.354	0.496	0.655

Open in a new tab

The second simulation experiment is conducted to closely mimic a QTL mapping data analyzed in Section 6.1. We generated the data from a mixture of two distributions. The first one is a uniform distribution on (3, 10), while the second one has CDF c(1 − e⁻^t^/2.5) on the interval (0, 10). The mixture probability has four different values which are (0.02, 0.98)^T, (0.2, 0.8)^T, (0.1, 0.9)^T, (0.98, 0.02)^T, and the sample size is 100. Based on the performance of the various estimators studied in the first simulation, here we used only the two best estimators, the OWLS and the efficient estimator (EFF) to estimate the two CDFs. We also implemented the type II NPMLE for comparison. We plot the true CDFs, the mean of the estimated CDFs and the 95% pointwise confidence band for each method in Figure 1. As expected, both OWLS and EFF give satisfactory results, while NPMLE2 is clearly biased. Again, we emphasize that the bias of NPMLE2 is not caused by the moderate sample size. In fact, when we increased the sample sizes to 1000, the bias became even more prominent.

Fig 1 — Simulation 2. True CDF (solid) and the mean (dashed), 95% pointwise confidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs. The OWLS (left), EFF (mid) and NPMLE2 (right) are plotted. The mean and true CDFs are undistinguishable in OWLS and EFF estimators. Sample size is 100, and results are based on 1000 simulations.

Similarly, the third simulation is conducted to closely mimic the LDL data analyzed in Section 6.2. The first CDF is c₁/{1 + e⁻⁽^t^−3)/0.5} on the interval (0, 6), and the second CDF is c₂/{1 + e^{− (}^t^−2.5)/0.2} on the interval (0, 7). Note that these two CDFs cross. Here, the mixture probability distribution has three different values which are (0.15, 0.85)^T, (0.6, 0.4)^T, (0.8, 0.2)^T, and the sample size is 300. Estimations based on OWLS, EFF and NPMLE2 are computed, and the mean of the estimated CDFs, the 95% pointwise confidence band for each method are presented in Figure 2 together with the true CDFs. Similar to the second simulation, both OWLS and EFF perform well, while NPMLE2 shows large bias.

Fig 2 — Simulation 3. True CDF (solid) and the mean (dashed), 95% pointwise confidence band (upper band dotted, lower band dash-dotted) of the estimated CDFs. The OWLS (left), EFF (mid) and NPMLE2 (right) are plotted. The mean and true CDFs are undistinguishable in OWLS and EFF estimators. Sample size is 300, and results are based on 1000 simulations.

6. Real data examples

6.1. Estimation from QTL mapping data

In QTL studies, the trait observations are assumed to be drawn from a mixture of several QTL genotype groups and the mixture probabilities of a subject assuming a certain QTL genotype given flanking markers are calculated based on the study design, the marker genotypes and the recombination fraction between the location-known flanking markers and the putative QTL (Wu et al., 2007). The first example that we use to illustrate our methods is a genetic linkage study used to map QTLs for rice plant height and grain shape. The identified QTL can be used to produce taller rice plants to increase yield. In Huang et al. (1997), a doubled haploid (DH) population of rice plants was derived from two inbred lines (semi-dwarf IR64 and tall Azucena), creating 123 DH lines each genotyped with 135 RFLP markers and 40 isozyme and RAPD markers. Several traits such as grain shape and plant height were recorded. A DH population is equivalent to a backcross population where the two marker genotypes have an approximately 1:1 distribution ratio. The mixture probabilities q_i of a plant carrying a certain QTL genotype given the flanking markers are computed based on the marker genotypes and the recombination fraction between the marker and the QTL. The details of q_i computation can be found in Table 10.3 of Wu et al. (2007).

Using a Gaussian mixture model, Wu et al. (2007) analyzed the plant height measured at 10 weeks after the rice was transplanted to the field and mapped a QTL for this trait to 199cM on chromosome 1 between the markers RZ730 and RZ801. Here we estimate the cumulative distribution function of the rice plant height for each of the two QTL genotypes at the same locus (199cM on chromosome 1) using the model (1).

There were 84 plant height measurements available. Table 3 presents the estimated CDFs and their standard errors for each of the two QTL genotypes at several values of the plant height. We present the efficient estimator (EFF) and the optimal WLS (OWLS). We omitted OLS and the two NPMLEs due to their respective deficiencies. The proposed OWLS and EFF lead to comparable results. The test of H₀ : F₁(t) = F₂(t) based on the test statistic (8) was significant at 5% level for both estimators at three typical values of t, indicating a difference in the distribution functions for the two QTL genotypes. In addition, we tested the difference between the two distributions at the three t values simultaneously by the test (9). The null distribution of the test statistic was a chi-square with three degrees of freedom, and the p-value was less than 0.01 which indicates a significant difference.

Table 3.

Data example 1. Estimated CDFs of plant height and their standard errors for QTL genotypes bb (F̂₁) and Bb (F̂₂)

t	Estimator	F̂₁(t)	SE(F̂₁)	F̂₂(t)	SE(F̂₂)	p value^*
80	EFF	0.132	0.048	0	0.006	0.011
80	OWLS	0.126	0.048	0	0.001	0.011

110	EFF	0.895	0.05	0.095	0.062	<0.001
110	OWLS	0.927	0.043	0.098	0.062	<0.001

140	EFF	0.992	0.024	0.699	0.083	0.001
140	OWLS	1.000	0.006	0.684	0.082	0.000

Open in a new tab

p value for testing H₀ : F₁(t) = F₂(t) based on (8)

Figure 3 presents the CDFs of rice plant heights for plants carrying each of the two QTL genotypes estimated by the efficient estimator (EFF). It can be seen that there is a large difference in the CDFs across the entire range of the plant height and carrying a risk allele increases the plant height. For example, it was estimated that 90.5% (CI: 78.3%, 100%) of the plants with Bb QTL genotype will have plant heights greater than 110, compared to 10.5% (CI: 0.7%, 20.3%) in the bb genotype group. This difference is highly significant (p < 0.001). These results are consistent with the analysis conducted in Wu et al. (2007).

Fig 3 — Data example 1. Estimated cumulative distribution function (CDF) of plant height for QTL genotype Bb (solid) and bb (dashed)

6.2. Estimation from the LDL data

In the LDL example introduced in Section 1, the association between the APOE ε4 allele and the LDL concentrations in young children is our main research interest. There were 230 subjects included in the data analyses. We show the estimated cumulative distribution function of LDL concentration for carriers of ε4 allele (carrying one or two copies of ε4) compared to non-carriers (carrying zero copy of ε4) at several values of the LDL levels in Table 4. As in data example 1, we present the EFF and the OWLS. Both estimators yielded similar results. The comparison of CDF for carriers versus non-carriers was not significant at 5% level at LDL= 100 or LDL= 260, but was significant at LDL= 180. Similar to the QTL analysis, we tested the difference between two distributions at these three typical t values simultaneously by (9). The p-value was 0.29, indicating a non-significant overall difference of the two distributions at these values.

Table 4.

Data example 2. Estimated CDFs of LDL levels and their standard errors of APOE ε4 carriers (F̂₁) and non-carriers (F̂₂)

t	Estimator	F̂₁(t)	SE(F̂₁)	F̂₂(t)	SE(F̂₂)	p value^*
100	EFF	0.719	0.108	0.619	0.054	0.496
100	OWLS	0.718	0.110	0.619	0.054	0.510

180	EFF	1.000	0.014	0.921	0.024	0.037
180	OWLS	1.000	0.014	0.922	0.024	0.035

260	EFF	1.000	0.006	0.984	0.011	0.364
260	OWLS	1.000	0.006	0.984	0.011	0.354

Open in a new tab

p value for testing H₀ : F₁(t) = F₂(t) based on (8)

Figure 4 depicts the CDF of LDL for carriers and non-carriers estimated by the efficient estimator, EFF. It can be seen that there is virtually no difference of the two CDFs in the range from 45 to 130. The CDF for carriers is elevated in the interval (130, 200) compared to non-carriers and the two functions merge again for LDL greater than 200. Previous analyses in the literature focus on the mean LDL concentration. Our analysis shows that the effect of APOE ε4 on LDL manifests in the range of 130 to 200.

Fig 4 — Data example 2. Estimated CDF of LDL levels for carriers of APOE ε4 allele (solid) and non-carriers (dashed)

7. Discussion

We have developed nonparametric estimation procedures for mixed samples where the conditional distribution of the outcome given the group membership is completely unspecified and the mixing probabilities are known or can be calculated without using the outcome data. We propose an extremely simple optimal weighted least squares estimator and derive an easy-to-compute efficient estimator which reaches the semiparametric efficiency bound. We illustrate by simulations that the OWLS estimator has good efficiency in many practical situations. We investigate performances of two types of NPMLE and show the surprising results that none of them is efficient and one of them is not even consistent. This is in contrast to many other semiparametric problems where the NPMLE is an efficient estimator.

Although the estimators are constructed for CDFs, it is straightforward to adapt these procedures to estimate a quantile function F⁻¹(τ). This is because we can then express all the estimators in terms of solving for F(t) from an estimating equation. When we denote t = F⁻¹(τ), replace F(t) with τ in these estimating equations, and solve for t from the known τ value instead of solving for F(t) from the known t value, we can obtain estimators for the quantile functions. For example, the efficient quantile estimator at τ can be obtained through solving for t from

n^{- 1} \sum_{i = 1}^{n} \frac{{I (s_{i} \leq t) - K (t, \hat{q^{T} f})} A^{- 1} (s_{i}; \hat{q^{T} f}) q_{i}}{\hat{q_{i}^{T} f (s_{i})}} + K (t, \hat{q^{T} f}) 1_{p} = τ,

where K itself is now a function of t hence we use the notation $K (t, \hat{q^{T} f})$ .

The CDFs estimated by the consistent estimators may not be monotone increasing functions of t when the sample size is relatively small. In fact, the type II NPMLE was originally proposed to address this issue, but it unfortunately lead to inconsistency. One way to guarantee the monotonicity is though reparametrization. For example, we could write $f (t) = e^{g (t)} exp {- \int_{0}^{t} e^{g (u)} d u}$ , and treat g(u) as a nuisance parameter, which will guarantee the range of $F (t) = 1 - exp {- \int_{0}^{t} e^{g (u)} d u}$ to be monotone and within 0 and 1. However, the additional complexity may not be worth the gain. Instead, we suggest to use a post estimation adjustment, such as a pooled adjacent algorithm (Barlow et al., 1972) to modify the results to achieve monotonicity. For a detailed description, see Wang et al. (2007).

Finally, we point out that one needs to be cautious in interpreting inconsistency of the type II NPMLE. The inconsistency occurs when a pure non-parametric model is used. Parametric models and semiparametric models such as Cox proportional hazards model with a nonparametric baseline or piecewise exponential models are likely to be consistent. An extension of the proposed methods to handle censoring based on full data influence functions discovered here and inverse probability weighting is underway.

Acknowledgments

The authors wish to thank Dr. Steve Shea and Dr. Rongling Wu for providing data. Ma’s research is supported by an NSF grant DMS-0906341, DMS-1206693 and NIH grant NS073671-01. Wang’s research is supported by NIH grants AG031113-01A2 and NS073671-01.

Appendix

A.1. Derivation of the complete influence function family

To perform a formal semiparametric analysis (Bickel et al., 1993; Tsiatis, 2006), we denote by θ the function that maps the nuisance parameter f(s) to the p-dimensional parameter of interest, F(t), i.e., $θ {f (s)} = \int_{T_{1}}^{t} f (s) d s$ . We denote the infinite dimensional nuisance parameter f(s) as η, i.e., η = f(x).

We now derive a general class of consistent estimators through characterizing the complete influence function set. An influence function φ(q, s; θ, η) is a mean zero function that satisfies

E (φ S_{γ}^{T}) = \partial θ (γ_{0}) / \partial γ^{T}

(A.1)

for any parametric submodel. A parametric submodel is a model where the original unknown function f(s) is replaced by a parametric PDF model f(s; γ), and it satisfies f(s; γ₀) = f₀(s). Here S_γ is the score function with respect to γ evaluated at γ₀,

S_{γ} = {\frac{\partial log {p_{Q} (q) q^{T} f (s; γ)}}{\partial γ} |}_{γ = γ_{0}}

and

θ (γ) \equiv θ {f (s; γ)} = \int_{T_{1}}^{t} f (s; γ) d s .

The relation in (A.1) indicates that

\int \int_{T_{1}}^{T_{2}} φ q^{T} \frac{\partial f (s; γ_{0})}{\partial γ^{T}} {dsp}_{Q} (q) d μ (q) = \int_{T_{1}}^{t} \frac{\partial f (s; γ_{0})}{\partial γ^{T}} d s,

where μ(q) is the counting measure of Q.

Given any parametric submodel of the form g(q, s; γ) = p_Q(q)q^T f (s; γ), where γ = (γ₁, …, γ_p)^T, and f (s; γ) = {f₁(s; γ₁), …, f_p(s; γ_p)}^T, the parameter of interest is

\begin{array}{l} θ {f (s; γ)} = {\int_{T_{1}}^{t} f_{1} (s; γ_{1}) d s, \dots, \int_{T_{1}}^{t} f_{p} (s; γ_{p}) d s}^{T} \\ = {\int_{T_{1}}^{T_{2}} I (s \leq t) f_{1} (s; γ_{1}) d s, \dots, \int_{T_{1}}^{T_{2}} I (s \leq t) f_{p} (s; γ_{p}) d s}^{T} . \end{array}

On one hand, the partial derivative of the parameter of interest with respect to γ is a block diagonal matrix of the form

{\frac{\partial θ {f (s; γ)}}{\partial γ^{T}} |}_{γ = γ_{0}} = diag {\int_{T_{1}}^{T_{2}} I (s \leq t) f_{1 γ_{1}}^{'} (s; γ_{10}) d s, \dots, \int_{T_{1}}^{T_{2}} I (s \leq t) f_{p γ_{p}}^{'} (s; γ_{p 0}) d s} .

On the other hand, the score vector S_γ evaluated at the truth is

S_{γ} = {\frac{q_{1} f_{1 γ_{1}}^{' T} (s; γ_{10})}{q^{T} f (s)}, \dots, \frac{q_{p} f_{p γ_{p}}^{' T} (s; γ_{p 0})}{q^{T} f (s)}}^{T} .

Recall that (A.1) requires

\int_{T_{1}}^{T_{2}} I (s \leq t) f_{j γ_{j}}^{'} (s; γ_{j 0}) d s = \int \int_{T_{1}}^{T_{2}} φ_{j} q_{j} f_{j γ_{j}}^{'} (s; γ_{j 0}) p_{Q} (q) dsd μ (q)

for j = 1,…, p, and

\int \int_{T_{1}}^{T_{2}} φ_{k} q_{j} f_{j γ_{j}}^{'} (s; γ_{j 0}) p_{Q} (q) dsd μ (q) = 0

for k ≠ j. Here φ_j is the jth component of φ. Because f (s) is completely unspecified, the function $f_{γ}^{'} (s; γ_{0})$ can be any function that satisfies $\int_{T_{1}}^{T_{2}} f_{γ}^{'} (s; γ_{0}) d s = 0$ . It then follows almost everywhere that ∫ φ_jq_jp_Q(q)dμ(q) − I(s ≤ t) is a constant and ∫ φ_jq_kp_Q(q)dμ(q) is also a constant for k ≠ j. These requirements can be written concisely as

\int φ (q, s) q^{T} p_{Q} (q) d μ (q) = I (s \leq t) I_{p} + C .

(A.2)

Note that a legitimate influence function also needs to have mean zero, hence

0 = E (φ) = \int_{T_{1}}^{T_{2}} \int φ (q, s) q^{T} p_{Q} (q) d μ (q) f (s) d s = F (t) + C 1_{p} .

Thus, we can write φ(q, s) as φ(q, s) = b(q, s) − F(t) − C1_p, where b satisfies (A.2). This gives the desired family of influence functions described in (2).

A.2. Influence function of the WLS

Denote the ith diagonal entry in M as w_i for i = 1, …, n. When we view the weight w_i as a random variable, we denote it as W_i. Since our arguments are general for any i = 1, …, n, we often omit the subscript i, and use w or W for the corresponding quantities. From

\hat{F} (t) = {(\frac{1}{n} \sum_{i = 1}^{n} w_{i} q_{i} q_{i}^{T})}^{- 1} \frac{1}{n} \sum_{i = 1}^{n} w_{i} q_{i} I (s_{i} \leq t),

we obtain

\begin{array}{l} \sqrt{n} {\hat{F} (t) - F (t)} = \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} {{(\frac{1}{n} \sum_{i = 1}^{n} w_{i} q_{i} q_{i}^{T})}^{- 1} w_{i} q_{i} I (s_{i} \leq t)} - \sqrt{n} F (t) \\ = \frac{1}{\sqrt{n}} {(\frac{1}{n} \sum_{i = 1}^{n} w_{i} q_{i} q_{i}^{T})}^{- 1} \sum_{i = 1}^{n} {w_{i} q_{i} I (s_{i} \leq t) - w_{i} q_{i} q_{i}^{T} F (t)} . \end{array}

Note that $E {W_{i} Q_{i} I (S_{i} \leq t) - W_{i} Q_{i} Q_{i}^{T} F (t)} = 0$ , hence

\sqrt{n} {\hat{F} (t) - F (t)} = \frac{1}{\sqrt{n}} {E ({WQQ}^{T})}^{- 1} \sum_{i = 1}^{n} {w_{i} q_{i} I (s_{i} \leq t) - w_{i} q_{i} q_{i}^{T} F (t)} + o_{p} (1) .

So the influence function of WLS is

φ_{WLS} (q, s) = {E ({WQQ}^{T})}^{- 1} w q {I (s \leq t) - q^{T} F (t)} .

Specifically, for the OLS and the optimal WLS estimators, the influence functions are respectively

\begin{array}{l} φ_{OLS} (q, s) = {E ({Q Q}^{T})}^{- 1} q (I (s \leq t) - q^{T} F (t)}, \\ and φ_{OWLS} (q, s) = {[E \frac{{Q Q}^{T}}{Q^{T} F (t) {1 - Q^{T} F (t)}}]}^{- 1} \frac{q {I (s \leq t) - q^{T} F (t)}}{q^{T} F (t) {1 - q^{T} (t)}} . \end{array}

A.3. Derivation of Λ and $Λ_{T}^{⊥}$

We denote the collection of mean zero functions orthogonal to all the elements in Λ as $Λ_{T}^{⊥}$ . Consider the space of tangent vectors contributed from the jth component f_j(s) only, we obtain

Λ_{j} = {\frac{q_{j} h (s)}{q^{T} f (s)} : \int h (s) d s = 0, h \in R^{p}} .

Combining the Λ_j’s for j = 1, …, p, the nuisance tangent space is therefore

Λ_{T} = {\frac{h (s) q}{q^{T} f (s)} : \int h (s) d s = 0, h \in R^{p \times p}} .

Furthermore, it is easy to see that

Λ_{T}^{⊥} = {r (q, s) : \int r (q, s) q^{T} p_{Q} (q) d μ (q) = C, C 1_{p} = 0},

where C is a constant p × p matrix.

A.4. Proof of Theorem 1

We only need to verify that φ_eff given in Theorem 1 satisfies $φ_{eff} = Π (φ ∣ Λ_{T}) = φ - Π (φ ∣ Λ_{T}^{⊥})$ , where Π denotes an orthogonal projection.

To show this, we first point out that K1_p = F (t). This is because from the definition of A(s), we have

f (s) = A^{- 1} (s) \int \frac{{q q}^{T} f (s) p_{Q} (q)}{q^{T} f (s)} d μ (q) = A^{- 1} (s) \int {q p}_{Q} (q) d μ (q) .

Integrate the both sides of the above equation from T₁ to T₂ and from T₁ to t respectively, we obtain

\begin{array}{l} 1_{p} = \int_{T_{1}}^{T_{2}} A^{- 1} (s) d s \int {q p}_{Q} (q) d μ (q), \\ F (t) = \int_{T_{1}}^{T_{2}} I (s \leq t) A^{- 1} (s) d s \int {q p}_{Q} (q) d μ (q), \end{array}

and the result follows.

Now, letting h₁(q, s) = h₂(q, s) = A⁻¹(s)q/q^T f(s), h₃(q) = K1_p and B = −K, we can easily verify that the corresponding b(q, s) in (4) has the form

b_{eff} = {I (s \leq t) I_{p} - K} \frac{A^{- 1} (s) q}{q^{T} f (s)} + K 1_{p} .

Since

\int b_{eff} (q, s) q^{T} p_{Q} (d μ) (q) = I (s \leq t) I_{p} - K + K 1_{p} \int q^{T} p_{Q} (q) d μ (q),

its corresponding influence function is

\begin{array}{l} b_{eff} (q, s) - F (t) - {- K + K 1_{p} \int q^{T} p_{Q} (q) d μ (q)} 1_{p} = b_{eff} (q, s) - F (t) + K 1_{p} - K 1_{p} \int q^{T} 1_{p} p_{Q} (q) d μ (q) \\ = b_{eff} (q, s) - F (t) . \end{array}

Note that the above expression equals φ_eff. Thus, we have shown that φ_eff is a valid influence function hence φ_eff ∈ Λ.

Now, for any φ ∈ Λ, we need to show $φ - φ_{eff} \in Λ_{T}^{⊥}$ . We have

\begin{array}{l} \int (φ - φ_{eff}) q^{T} p_{Q} (q) d μ (q) = \int [φ - {I (s \leq t) I_{p} - K} \frac{A^{- 1} (s) q}{q^{T} f (s)}] q^{T} p_{Q} (q) d μ (q) \\ = \int φ q^{T} p_{Q} (q) d μ (q) - {I (s \leq t) I_{p} - K} \\ = - C - {F (t) + C 1_{p}} \int q^{T} p_{Q} (q) d μ (q) + K \end{array}

is a constant matrix. In the last equality, we used the fact that an influence function φ can be written as φ = b − F(t) − C1_p, where ∫ dq^Tp_Q(q)dμ(q) = I(s ≤ t)I_p − C. From

[- C - {F (t) + C 1_{p}} \int q^{T} p_{Q} (q) d μ (q) + K] 1_{p} = - C 1_{p} - {F (t) + C 1_{p}} + K 1_{p} = 0

and follow the description of $Λ_{T}^{⊥}$ , we indeed have $φ - φ_{eff} \in Λ_{T}^{⊥}$ .

A.5. Proof of Theorem 2

First, we note that all the approximations are caused by $\hat{q^{T} f}$ , which is estimated using the second subset of the data. No other estimation or approximation is involved in our construction. From (7) we obtain

\begin{array}{l} n_{1}^{1 / 2} {\hat{F} (t) - F (t)} = n_{1}^{- 1 / 2} \sum_{i = 1}^{n_{1}} {ψ (q_{i}, s_{i}; \hat{q^{T} f}) - F (t)} \\ = n_{1}^{- 1 / 2} \sum_{i = 1}^{n_{1}} {ψ (q_{i}, s_{i}; q^{T} f) - F (t)} + n_{1}^{- 1 / 2} \sum_{i = 1}^{n_{1}} {ψ (q_{i}, s_{i}; \hat{q^{T} f}) - ψ (q_{i}, s_{i}; q^{T} f)} . \end{array}

Note that A(s; q^T f) = A(s), K(q^Tf) = K, and K1_p = F (t), hence

ψ (q, s; q^{T} f) - F (t) = \frac{{I (s_{i} \leq t) - K} A^{- 1} (s) q}{q^{T} f (s)} + K 1_{p} - F (t) = φ_{eff} (q, s) .

From (5), we see that $ψ (q, s; \hat{q^{T} f}) - F (t)$ is an influence function. Thus, the difference between $ψ (q, s; \hat{q^{T} f})$ and ψ(q, s; q^T f) is the difference between a valid influence functions and its projection on Λ, hence is orthogonal to Λ. Specifically, we have

ψ (q, s; \hat{q^{T} f}) - ψ (q, s; q^{T} f) = {ψ (q, s; \hat{q^{T} f}) - F (t)} - {ψ (q, s; q^{T} f) - F (t)} ⊥ Λ_{T}

and it has mean zero. Consequently, the estimator F̂(t) is consistent and has variance

var [n_{1}^{1 / 2} {\hat{F} (t) - F (t)}] = var (φ_{eff}) + var {ψ (q, s; \hat{q^{T} f}) - ψ (q, s; q^{T} f)} .

When n₂ → ∞, the number of observations that satisfy q_i = u_k also goes to infinity in probability due to the randomness of the data. Thus, the kernel estimator for $\hat{u_{k}^{T} f (s)}$ satisfies $\hat{u_{k}^{T} f (s)} - u_{k}^{T} f (s) = o_{p} (1)$ uniformly on any compact set of s for each k ∈ {1, …, m}. Therefore, $\hat{q^{T} f} (s) - q^{T} f (s) = o_{p} (1)$ as n → ∞. Note that ψ(q, s; q^T f) is a pathwise differentiable function of q^Tf, it then follows that $var {ψ (q, s; \hat{q^{T} f}) - ψ (q, s; q^{T} f)} = o (1)$ . This proves that F̂(t) is indeed an efficient estimator.

A.6. Inconsistency of the type II NPMLE

Consider a very simple and explicit case where p = m = 2, u₂ = (1, 0)^T, while u₁ ≠ (1, 0)^T and u₁ ≠ (0, 1)^T. This corresponds to the situation where there exists two genotypes, and for the first r₁ observations we know that they belong to the first group with probability u₁₁ and belong to the second group with probability u₁₂ = 1 − u₁₁; while for the last r₂ observations, we know that they are from the first group. Under this special case, the NPMLE becomes

max_{h_{1} h_{2}} {(u_{1}^{T} h_{1})}^{r_{1}} {(u_{2}^{T} h_{2})}^{r_{2}} = {(u_{11} h_{11} + u_{12} h_{12})}^{r_{1}} h_{21}^{r_{2}}

subject to r₁h₁₁ + r₂h₂₁ = 1, r₁h₁₂ + r₂h₂₂ = 1, and h_ij ≥ 0 for i, j = 1, 2. Obviously, the maximum is obtained only when $h_{12} = r_{1}^{- 1}$ and h₂₂ = 0. This can be written as ${\hat{f}}_{2} (s_{i}) = r_{1}^{- 1} I (q_{i} = u_{1})$ for all i = 1, …, n. Hence the NPMLE2 for the PDF f₂(s) puts zero weights on observations that are known to be drawn from the first group, and puts equal weights, $r_{1}^{- 1}$ , on other observations. Such result is equivalent to the standard empirical likelihood estimation of a PDF when we are only given observations s₁, …, s_r₁ drawn as a random sample from this PDF. Hence its corresponding CDF estimation ${\hat{F}}_{2} (t) = \sum_{i = 1}^{n} {\hat{f}}_{2} (s_{i}) I (s_{i} \leq t) = r_{1}^{- 1} \sum_{i = 1}^{r_{1}} I (q_{i} = u_{1}) I (s_{i} \leq t)$ is a consistent estimate of the corresponding true CDF. However, s₁, …, s_r₁ is a random sample from a mixture of two populations, where the mixture probability is u₁₁ for being from the first population and is u₁₂ for the second population. In other words, the estimator F̂₂(t) is a consistent estimator of u₁₁F₁(t)+u₁₂F₂(t). Obviously, u₁₁F₁(t) + u₁₂F₂(t) does not equal to F₂(t) unless u₁₁ ≡ 0. Consequently, the type II NPMLE is not consistent for this simple case.

Contributor Information

Yanyuan Ma, Email: ma@stat.tamu.edu, Department of Statistics, Texas A&M University, College Station, TX 77845.

Yuanjia Wang, Email: yuanjia.wang@columbia.edu, Department of Biostatistics Mailman School of Public Health Columbia University 722 West 168th Street New York, NY 10032.

References

Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. Statistical Inference Under Order Restrictions. New York: JohnWiley; 1972. [Google Scholar]
Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: The Johns Hopkins University Press; 1993. [Google Scholar]
Chatterjee N, Wacholder S. A Marginal Likelihood Approach for Estimating Penetrance from Kin-cohort Designs. Biometrics. 2001;57:245–252. doi: 10.1111/j.0006-341x.2001.00245.x. [DOI] [PubMed] [Google Scholar]
Davignon J, Gregg RE, Sing CF. Apolipoprotein E Polymorphism and Atherosclerosis. Arteriosclerosis. 1988;8:1–21. doi: 10.1161/01.atv.8.1.1. [DOI] [PubMed] [Google Scholar]
Fine JP, Zou F, Yandell BS. Nonparametric estimation of the effects of quantitative trait loci. Biometrics. 2004;5:501–513. doi: 10.1093/biostatistics/kxh004. [DOI] [PubMed] [Google Scholar]
Hartge P, Chatterjee N, Wacholder S, Brody LC, Tucker MA, Struewing JP. Breast cancer risk in Ashkenazi BRCA1/2 mutation carriers: effects of reproductive history. Epidemiology. 2002;13(3):255–261. doi: 10.1097/00001648-200205000-00004. [DOI] [PubMed] [Google Scholar]
Hauptmann M, Sigurdson AJ, Chatterjee N, Rutter JL, Hill DA, Doody MM, Struewing JP. Re: Population-Based, CaseControl Study of HER2 Genetic Polymorphism and Breast Cancer Risk. Journal of the National Cancer Institute. 2003;95:1251–1252. doi: 10.1093/jnci/djg032. [DOI] [PubMed] [Google Scholar]
Hixson JE. Apolipoprotein E Polymorphisms Affect Atherosclerosis in Young Males: Pathobiological Determinants of Atherosclerosis in Youth (PDAY) Research Group. Arterioscler Thromb. 1991;11:237–244. doi: 10.1161/01.atv.11.5.1237. [DOI] [PubMed] [Google Scholar]
Huang N, Parco A, Mew T, Magpantay G, McCouch S, Gulderdoni E, Xu J, Subudhi P, Angeles E, Khush G. RFLP Mapping of Isozymes, RAPD and QTLs for Grain Shape, Brown Planthopper Resistance in a Doubled Haploid Rice Population. Molecular Breeding. 1997;3:105–113. [Google Scholar]
Khoury M, Beaty H, Cohen B. Fundamentals of Genetic Epidemiology. New York: Oxford University Press; 1993. [Google Scholar]
Lander ES, Botstein D. Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps. Genetics. 1989;121:743–756. doi: 10.1093/genetics/121.1.185. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li R, Liang H. Variable selection in semiparametric regression modeling. Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liang H, Wang N. Large sample theory in a semiparametric partially linear errors-in-variables model. Statistica Sinica. 2005;15:99–117. [Google Scholar]
Marder K, Levy G, Louis ED, Mejia-Santana H, Cote L, Andrews H, Harris J, Waters C, Ford B, Frucht S, Fahn S, Ottman R. Accuracy of family history data on Parkinson’s disease. Neurology. 2003;61:18–23. doi: 10.1212/01.wnl.0000074784.35961.c0. [DOI] [PubMed] [Google Scholar]
McLachlan GJ, Peel D. Finite Mixture Models. New York: Wiley; 2000. [Google Scholar]
Newey WK. Semiparametric Efficiency Bounds. Journal of Applied Econometrics. 1990;5:99–135. [Google Scholar]
Rabinowitz D. Computing the Efficient Score in Semi-parametric Problems. Statistica Sinica. 2000;10:265–280. [Google Scholar]
Sigurdson AJ, Hauptmann M, Chatterjee N, Alexander BH, Doody MM, Rutter JL, Struewing JP. Kin-cohort estimates for familial breast cancer risk in relation to variants in DNA base excision repair, BRCA1 interacting and growth factor genes. BMC Cancer. 2004;4:9. doi: 10.1186/1471-2407-4-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shea S, Isasi CR, Couch S, Starc TJ, Tracy RP, Deckelbaum R, Talmud P, Berglund L, Humphries SE. Relations of Plasma Fibrinogen Level in Children to Measures of Obesity, the (G-455->A) Mutation in the Beta-Fibrinogen Promoter Gene, and Family History of Ischemic Heart Disease: the Columbia University BioMarkers Study. American Journal of Epidemiology. 1999;150:737–46. doi: 10.1093/oxfordjournals.aje.a010076. [DOI] [PubMed] [Google Scholar]
Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]
Tsiatis AA, Ma Y. Locally Efficient Semiparametric Estimators for Functional Measurement Error Models. Biometrika. 2004;91:835–848. [Google Scholar]
Wacholder S, Hartge P, Struewing J, Pee D, McAdams M, Brody L, Tucker M. The Kin-cohort Study for Estimating Penetrance. American Journal of Epidemiology. 1998;148:623–630. doi: 10.1093/aje/148.7.623. [DOI] [PubMed] [Google Scholar]
Wang Y, Clark LN, Marder K, Rabinowitz D. Nonparametric Estimation of Genotype-specific Age-at-onset Distributions From Censored Kin-cohort Data. Biometrika. 2007;94:403–414. [Google Scholar]
Wang Y, Clark LN, Louis ED, Mejia-Santana H, Harris J, Cote LJ, Waters C, Andrews D, Ford B, Frucht S, Fahn S, Ottman R, Rabinowitz D, Marder K. Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using the kin-cohort method. Arch Neurol. 2008;65(4):467–474. doi: 10.1001/archneur.65.4.467. [DOI] [PMC free article] [PubMed] [Google Scholar]
Webb EL, Rudd MF, Houlston RS. Case-control, kin-cohort and meta-analyses provide no support for STK15 F31I as a low penetrance colorectal cancer allele. British Journal of Cancer. 2006a;95:1047–1049. doi: 10.1038/sj.bjc.6603382. [DOI] [PMC free article] [PubMed] [Google Scholar]
Webb EL, Rudd MF, Sellick GS, Galta R, Bethke L, Wood W, Fletcher O, Penegar S, Withey L, Qureshi M, Johnson N, Tomlinson I, Gray R, Peto J, Houlston RS. Search for low penetrance alleles for colorectal cancer through a scan of 1467 nonsynonymous SNPs in 2575 cases and 2707 controls with validation by kin-cohort analysis of 14 704 first-degree relatives. Hum Mol Genet. 2006b;15(21):3263–3271. doi: 10.1093/hmg/ddl401. [DOI] [PubMed] [Google Scholar]
Wu R, Ma C, Casella G. Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York: Springer; 2007. [Google Scholar]

[R1] Barlow RE, Bartholomew DJ, Bremner JM, Brunk HD. Statistical Inference Under Order Restrictions. New York: JohnWiley; 1972. [Google Scholar]

[R2] Bickel PJ, Klaassen CAJ, Ritov Y, Wellner JA. Efficient and Adaptive Estimation for Semiparametric Models. Baltimore: The Johns Hopkins University Press; 1993. [Google Scholar]

[R3] Chatterjee N, Wacholder S. A Marginal Likelihood Approach for Estimating Penetrance from Kin-cohort Designs. Biometrics. 2001;57:245–252. doi: 10.1111/j.0006-341x.2001.00245.x. [DOI] [PubMed] [Google Scholar]

[R4] Davignon J, Gregg RE, Sing CF. Apolipoprotein E Polymorphism and Atherosclerosis. Arteriosclerosis. 1988;8:1–21. doi: 10.1161/01.atv.8.1.1. [DOI] [PubMed] [Google Scholar]

[R5] Fine JP, Zou F, Yandell BS. Nonparametric estimation of the effects of quantitative trait loci. Biometrics. 2004;5:501–513. doi: 10.1093/biostatistics/kxh004. [DOI] [PubMed] [Google Scholar]

[R6] Hartge P, Chatterjee N, Wacholder S, Brody LC, Tucker MA, Struewing JP. Breast cancer risk in Ashkenazi BRCA1/2 mutation carriers: effects of reproductive history. Epidemiology. 2002;13(3):255–261. doi: 10.1097/00001648-200205000-00004. [DOI] [PubMed] [Google Scholar]

[R7] Hauptmann M, Sigurdson AJ, Chatterjee N, Rutter JL, Hill DA, Doody MM, Struewing JP. Re: Population-Based, CaseControl Study of HER2 Genetic Polymorphism and Breast Cancer Risk. Journal of the National Cancer Institute. 2003;95:1251–1252. doi: 10.1093/jnci/djg032. [DOI] [PubMed] [Google Scholar]

[R8] Hixson JE. Apolipoprotein E Polymorphisms Affect Atherosclerosis in Young Males: Pathobiological Determinants of Atherosclerosis in Youth (PDAY) Research Group. Arterioscler Thromb. 1991;11:237–244. doi: 10.1161/01.atv.11.5.1237. [DOI] [PubMed] [Google Scholar]

[R9] Huang N, Parco A, Mew T, Magpantay G, McCouch S, Gulderdoni E, Xu J, Subudhi P, Angeles E, Khush G. RFLP Mapping of Isozymes, RAPD and QTLs for Grain Shape, Brown Planthopper Resistance in a Doubled Haploid Rice Population. Molecular Breeding. 1997;3:105–113. [Google Scholar]

[R10] Khoury M, Beaty H, Cohen B. Fundamentals of Genetic Epidemiology. New York: Oxford University Press; 1993. [Google Scholar]

[R11] Lander ES, Botstein D. Mapping Mendelian Factors Underlying Quantitative Traits Using RFLP Linkage Maps. Genetics. 1989;121:743–756. doi: 10.1093/genetics/121.1.185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] Li R, Liang H. Variable selection in semiparametric regression modeling. Annals of Statistics. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Liang H, Wang N. Large sample theory in a semiparametric partially linear errors-in-variables model. Statistica Sinica. 2005;15:99–117. [Google Scholar]

[R14] Marder K, Levy G, Louis ED, Mejia-Santana H, Cote L, Andrews H, Harris J, Waters C, Ford B, Frucht S, Fahn S, Ottman R. Accuracy of family history data on Parkinson’s disease. Neurology. 2003;61:18–23. doi: 10.1212/01.wnl.0000074784.35961.c0. [DOI] [PubMed] [Google Scholar]

[R15] McLachlan GJ, Peel D. Finite Mixture Models. New York: Wiley; 2000. [Google Scholar]

[R16] Newey WK. Semiparametric Efficiency Bounds. Journal of Applied Econometrics. 1990;5:99–135. [Google Scholar]

[R17] Rabinowitz D. Computing the Efficient Score in Semi-parametric Problems. Statistica Sinica. 2000;10:265–280. [Google Scholar]

[R18] Sigurdson AJ, Hauptmann M, Chatterjee N, Alexander BH, Doody MM, Rutter JL, Struewing JP. Kin-cohort estimates for familial breast cancer risk in relation to variants in DNA base excision repair, BRCA1 interacting and growth factor genes. BMC Cancer. 2004;4:9. doi: 10.1186/1471-2407-4-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Shea S, Isasi CR, Couch S, Starc TJ, Tracy RP, Deckelbaum R, Talmud P, Berglund L, Humphries SE. Relations of Plasma Fibrinogen Level in Children to Measures of Obesity, the (G-455->A) Mutation in the Beta-Fibrinogen Promoter Gene, and Family History of Ischemic Heart Disease: the Columbia University BioMarkers Study. American Journal of Epidemiology. 1999;150:737–46. doi: 10.1093/oxfordjournals.aje.a010076. [DOI] [PubMed] [Google Scholar]

[R20] Tsiatis AA. Semiparametric Theory and Missing Data. New York: Springer; 2006. [Google Scholar]

[R21] Tsiatis AA, Ma Y. Locally Efficient Semiparametric Estimators for Functional Measurement Error Models. Biometrika. 2004;91:835–848. [Google Scholar]

[R22] Wacholder S, Hartge P, Struewing J, Pee D, McAdams M, Brody L, Tucker M. The Kin-cohort Study for Estimating Penetrance. American Journal of Epidemiology. 1998;148:623–630. doi: 10.1093/aje/148.7.623. [DOI] [PubMed] [Google Scholar]

[R23] Wang Y, Clark LN, Marder K, Rabinowitz D. Nonparametric Estimation of Genotype-specific Age-at-onset Distributions From Censored Kin-cohort Data. Biometrika. 2007;94:403–414. [Google Scholar]

[R24] Wang Y, Clark LN, Louis ED, Mejia-Santana H, Harris J, Cote LJ, Waters C, Andrews D, Ford B, Frucht S, Fahn S, Ottman R, Rabinowitz D, Marder K. Risk of Parkinson’s disease in carriers of Parkin mutations: estimation using the kin-cohort method. Arch Neurol. 2008;65(4):467–474. doi: 10.1001/archneur.65.4.467. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Webb EL, Rudd MF, Houlston RS. Case-control, kin-cohort and meta-analyses provide no support for STK15 F31I as a low penetrance colorectal cancer allele. British Journal of Cancer. 2006a;95:1047–1049. doi: 10.1038/sj.bjc.6603382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Webb EL, Rudd MF, Sellick GS, Galta R, Bethke L, Wood W, Fletcher O, Penegar S, Withey L, Qureshi M, Johnson N, Tomlinson I, Gray R, Peto J, Houlston RS. Search for low penetrance alleles for colorectal cancer through a scan of 1467 nonsynonymous SNPs in 2575 cases and 2707 controls with validation by kin-cohort analysis of 14 704 first-degree relatives. Hum Mol Genet. 2006b;15(21):3263–3271. doi: 10.1093/hmg/ddl401. [DOI] [PubMed] [Google Scholar]

[R27] Wu R, Ma C, Casella G. Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. New York: Springer; 2007. [Google Scholar]

PERMALINK

Efficient distribution estimation for data with unobserved sub-population identifiers

Yanyuan Ma

Yuanjia Wang

Abstract

1. Introduction

2. Estimation procedures

2.1. A class of weighted least squares estimators

2.2. The complete class of consistent estimators

2.3. The semiparametric efficient estimator

Theorem 1

Remark 1

2.4. Analytic comparison between OWLS and the efficient estimator

3. Efficient estimator and its asymptotic properties

3.1. Algorithm for implementing the efficient estimator

3.2. Asymptotics and inferences

Theorem 2

4. Understanding the NPMLEs

5. Simulations

5.1. Three simulated examples

Table 1.

Table 2.

Fig 1.

Fig 2.

6. Real data examples

6.1. Estimation from QTL mapping data

Table 3.

Fig 3.

6.2. Estimation from the LDL data

Table 4.

Fig 4.

7. Discussion

Acknowledgments

Appendix

A.1. Derivation of the complete influence function family

A.2. Influence function of the WLS

A.3. Derivation of Λ and ΛT⊥

A.4. Proof of Theorem 1

A.5. Proof of Theorem 2

A.6. Inconsistency of the type II NPMLE

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A.3. Derivation of Λ and $Λ_{T}^{⊥}$