A Semiparametric Single-Index Risk Score Across Populations

Shujie Ma; Yanyuan Ma; Yanqing Wang; Eli S Kravitz; Raymond J Carroll

doi:10.1080/01621459.2016.1222944

. Author manuscript; available in PMC: 2018 Mar 6.

Published in final edited form as: J Am Stat Assoc. 2017 Jul 18;112(520):1648–1662. doi: 10.1080/01621459.2016.1222944

A Semiparametric Single-Index Risk Score Across Populations

Shujie Ma ¹, Yanyuan Ma ², Yanqing Wang ³, Eli S Kravitz ⁴, Raymond J Carroll ⁵

PMCID: PMC5839692 NIHMSID: NIHMS878200 PMID: 29520120

Abstract

We consider a problem motivated by issues in nutritional epidemiology, across diseases and populations. In this area, it is becoming increasingly common for diseases to be modeled by a single diet score, such as the Healthy Eating Index, the Mediterranean Diet Score, etc. For each disease and for each population, a partially linear single-index model is fit. The partially linear aspect of the problem is allowed to differ in each population and disease. However, and crucially, the single-index itself, having to do with the diet score, is common to all diseases and populations, and the nonparametrically estimated functions of the single-index are the same up to a scale parameter. Using B-splines with an increasing number of knots, we develop a method to solve the problem, and display its asymptotic theory. An application to the NIH-AARP Study of Diet and Health is described, where we show the advantages of using multiple diseases and populations simultaneously rather than one at a time in understanding the effect of increased Milk consumption. Simulations illustrate the properties of the methods.

Keywords: Asymptotic theory, B-splines, Combining data sets, Healthy Eating Index, Logistic regression, Partially linear single-index models, Semiparametric models, Single-index models

1 Introduction

We describe a novel partially linear logistic single index model in which there are multiple populations, and multiple diseases within each population, but where the single index part of the model is shared across the populations and diseases. In the case of a single disease across independent populations, we derive B-spline based semiparametric efficient methodology. In other cases, such as multiple populations with multiple diseases, our B-spline based methods are consistent and we derive their asymptotic theory.

The problem arises from common practice in nutritional epidemiology, where the goal is to relate nutritional intakes to disease. In this area, it is increasingly common to relate the patterns of multiple dietary components, rather than an individual dietary component, to a disease. One popular way to summarize dietary intake patterns is through a dietary pattern score. While there are many avors of dietary pattern scores, in our example we use the U.S. Department of Agriculture’s (USDA’s) Healthy Eating Index-2005 (HEI-2005, http://www.cnpp.usda.gov/HealthyEatingIndex.htm). It is based on the key recommendations of the 2005 Dietary Guidelines for Americans available at http://www.health.gov/dietaryguidelines/dga2005/document/default.htm. The HEI-2005 comprises 12 distinct component scores. Intakes of each food or nutrient, represented by one of the 12 components, are expressed as a ratio to energy (caloric) intake, assessed, and given a score. See Table 1 for a list of these components and the standards for scoring, and see Guenther et al. (2008) and Guenther et al. (2008) for details. The 12 different component scores are then summed to get a total score, ranging from 0 for a terrible diet to 100 for the best possible diet.

Table 1.

Description of the HEI-2005 scoring system. Except for saturated fat and SoFAAS, density is obtained by multiplying intake by 1000 and dividing by intake of kilo-calories. For saturated fat, density is 9 × 100 saturated fat (grams) divided by calories, i.e., the percentage of calories coming from saturated fat intake. For SoFAAS, the density is the percentage of intake that comes from intake of calories, i.e., the division of intake of SoFAAS by intake of calories. Here, “DOL” is dark green and orange vegetables and legumes. Also, “SoFAAS” is calories from solid fats, alcoholic beverages and added sugars. The total HEI-2005 score is the sum of the individual component scores.

Component	Units	HEI-2005 score calculation
Total Fruit	cups	min {5, 5 × (density/.8)}
Whole Fruit	cups	min {5, 5 × (density/.4)}
Total Vegetables	cups	min {5, 5 × (density/1.1)}
DOL	cups	min {5, 5 × (density/.4)}
Total Grains	ounces	min {5, 5 × (density/3)}
Whole Grains	ounces	min {5, 5 × (density/1.5)}
Milk	cups	min {10, 10 × (density/1.3)}
Meat and Beans	ounces	min {10, 10 × (density/2.5)}
Oil	grams	min {10, 10 × (density/12)}
Saturated Fat	% of	if density ≥ 15 score = 0
	energy	else if density ≤ 7 score = 10
		else if density > 10 score = 8 − {8 × (density − 10)/5}
		else, score = 10 − {2 × (density − 7)/3}
Sodium	milligrams	if density ≥ 2000 score=0
		else if density ≤ 700 score=10
		else if density ≥ 1100 score = 8 − {8 × (density − 1100)/(2000 − 1100)}
		else score = 10 − {2 × (density − 700)/(1100 − 700)}
SoFAAS	% of	if density ≥ 50 score = 0
	energy	else if density ≤ 20 score=20
		else score = 20 − {20 × (density − 20)/(50 − 20)}

Open in a new tab

The key concept here is that the total score is developed before any health outcome data are considered. Once a total score is developed, it is then used, across multiple populations, in risk models to relate any disease to the total score. As an example, Panagiotakos et al. (2006) show that for colorectal cancer in the NIH-AARP Study of Diet and Health (Schatzkin et al., 2001), with diet assessed by a food frequency questionnaire, higher HEI-2005 total scores are statistically significantly associated with lower disease risks. They also consider three other dietary pattern scores. George et al. (2010) show that among breast cancer survivors, higher HEI-2005 total scores are associated with lower levels of chronic inflammation. Chiuve et al. (2012) show that the HEI-2005 total score and the Alternative Healthy Eating Index (AHEI) (McCullough et al., 2002) are significant predictors of chronic diseases such as coronary heart disease, diabetes, stroke and cancer, and that closer adherence to the 2005 Dietary Guidelines may lower the risk of major chronic diseases. The AHEI is also associated with all cause mortality (Akbaraly et al., 2011).

In its most general form, there are k = 1, …, K populations. Within population k, there are ℓ = 1, …, L_k diseases. As in the HEI-2005, there are j = 0, …, J dietary components. Let the J + 1 individual component scores in population k be (X_0k, …, X_Jk): in our case, J + 1 = 12. Then the current practice in nutritional epidemiology would be to form a total score $\sum_{j = 0}^{J} X_{jk}$ for population k and use it as the risk predictor for all populations/diseases. Thus, for example, in a logistic regression with a binary outcome Y_kℓ, and with H(·) being the logistic distribution function, the model for population k and disease ℓ is

pr (Y_{k ℓ} = 1 | X_{0 k}, \dots, X_{Jk}) = H (β_{0 k ℓ} + β_{1 k ℓ} \sum_{j = 0}^{J} X_{jk}) .

(1)

This is important and convenient from a public health perspective, because it enables nutritional epidemiologists to use the same predictor, namely $\sum_{j = 0}^{J} X_{jk}$ , for all diseases, and to describe the effect of that predictor through a single quantity, β_1kℓ.

Crucially, it is undesirable to try to fit different parameters for each component score. That is, instead of fitting (1), one might fit

pr (Y_{k ℓ} = 1 | X_{0 k}, \dots, X_{Jk}) = H (β_{0 k ℓ} + \sum_{j = 0}^{J} β_{jk ℓ} X_{jk}) .

(2)

The reason why (1) is preferred to (2) for public health purposes is that it is much more interpretable. Model (1) describes how a single, interpretable score, $\sum_{j = 0}^{J} X_{jk}$ , affects disease risk. Model (2) is chaotic because it requires policy makers to say things such as “if you are in population k = 1 and are worried about disease ℓ = 1 then your diet improves your risk if you eat this kind of food more and that kind of food less, but for disease ℓ = 2 you need to consider your dietary composition in another way”. Interpretability is even more complicated because the component scores have a reasonably complex pattern of correlations, see Table S.1 of the Supplementary Material. Since there are so many diseases and populations, this is not helpful practically and would not be used. Indeed, as seen above, the single HEI-2005 score is associated with colon cancer (Reedy et al., 2008), chronic inflammation, (George et al., 2010) and many chronic diseases (Chiuve et al., 2012), and its ease of interpretation is apparent.

Our goal is to develop a single interpretable score that, unlike the HEI-2005 score or other scores, is calibrated to different populations or diseases. We do this not by summing the component scores, but by weighting them and allowing a more flexible shape. Thus, for an unknown function m(·) and weights (α₀, …, α_J), we propose the single-index score $m (\sum_{j = 0}^{J} X_{jk} α_{j})$ , and model the risk as

pr (Y_{k ℓ} = 1 | X_{0 k}, \dots, X_{Jk}) = H {β_{0 k ℓ} + β_{1 k ℓ} m (\sum_{j = 0}^{J} X_{jk} α_{j})} .

(3)

Model (3), like model (1), is based upon a single interpretable score, $m (\sum_{j = 0}^{J} X_{jk} α_{j})$ , that can be used across populations and/or diseases.

In Section 2, we present our model more formally. Section 3 describes how to fit the model for a single disease across independent populations, and we show that our method is semiparametric efficient in this case. In Section 4 we describe generalizations. Section 4.1 considers a single population with multiple diseases, while Section 4.2 describes the real goal of our data analysis, where there are multiple populations and multiple diseases. Section 5 gives results of the data example, while Section 6 describes simulation results. Computational and technical details are in an Appendix and in Supplementary Material.

2 Multiple Population Single-Index Model

2.1 Model and Splines

In this section, we consider a single disease across multiple different populations. There are k = 1, …, K populations. For the k^th population, there are i = 1, …, n_k individuals with binary responses Y_ik. Define X_ik = (X_ik1, …, X_ikJ)^T. In addition to the responses, for individual i in population k we observe (G_ik, X_ik0, X_ik, Z_ik, G_ikW_ik), defined as follows. The J + 1 dietary component scores are ${(X_{ik 0}, X_{ik}^{T})}^{T}$ . Covariates that are observed for all individuals are Z_ik = (Z_ik1, …, Z_ikd)^T. Further, we allow for a subset of individuals to have additional covariates W_ik = (W_ik1, …, W_ika)^T, and define G_ik to be the binary indicator that these individuals have such additional covariates. For example, Reedy et al. (2008) fit models to men and women with many common covariates such as age and levels of educational status, but for women they in addition include indicators of types of hormone replacement therapy.

For the i^th individual and the k^th outcome, we posit the marginal model

pr (Y_{ik} = 1 | G_{ik}, X_{ik 0}, X_{ik}, Z_{ik}, G_{ik} W_{ik}) = H_{ik} = H {(β_{k 1} + β_{k 2} G_{ik}) m (X_{ik 0} + X_{ik}^{T} α) + Z_{ik}^{T} (θ_{k 1} + θ_{k 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k 3}} .

(4)

Here α ∈ ℝ^J, θ_k1 ∈ ℝ^d, θ_k2 ∈ ℝ^d, θ_k3 ∈ ℝ^a, β_k1 ∈ ℝ and β_k2 ∈ ℝ. Crucially, for use in practice, the function m(·) and the parameter α do not depend on k.

Remark 1

In model (4), the most general form of the single-index is $m (X_{ik 0} α_{0} + X_{ik}^{T} α)$ . However, because m(·) is modeled nonparametrically, such a formulation is not identifiable. There are three equivalent ways to obtain identifiability. The first common method, what we have done, is to select one variable that is known to be related to the response, which we label as X_ik0, and to set its parameter α₀ = 1. A second common method is to make the restriction that $α_{0}^{2} + α^{T} α = 1$ . In the context of our problem, there is a third way. Since common nutritional epidemiology practice is to weight each variable the same, namely = 1, the sum of the weights = J + 1. For comparison purposes, we can achieve identifiability via the restriction α₀ + (1, …, 1)α = J + 1. We use the first method in our computations, but report results for the third.

Model (4) generalizes the now-classical generalized partially linear single-index model (Carroll et al., 1997), with the novelty being both the context and that the same single-index $m (X_{ik 0} + X_{ik}^{T} α)$ is used across multiple outcomes.

Single-index models have been widely used as a popular tool in multivariate nonparametric regression to alleviate the “curse of dimensionality” (Bellman, 1961). For example, recently Yu and Ruppert (2002) used penalized spline least squares estimation for single-index models with independent and identically distributed observations: their number of knots was fixed, unlike in our development. Wang and Yang (2009a) proposed polynomial spline estimation and extended the results to weakly dependent response variables. Cui et al. (2011) developed a kernel estimating function method for generalized single-index models, while Ma and Zhu (2013) constructed robust and efficient estimation with high dimensional covariates. These papers are restricted to the case of one population, K = 1, and one outcome, L = 1. Here we consider multiple populations and multiple outcomes. We propose a regression spline based profile estimation procedure and establish the asymptotic properties of the estimators in model (4).

To set the main ideas clearly, it is convenient to first assume that the Y_ik are independent, see Section 4 for the more general cases of interest in the HEI-2005 problem. To ensure identifiability, we set β₁₁ = −1, and we set the first component of θ₁₁ = (θ₁₁₁, …, θ_11d) equal to zero, i.e., θ₁₁₁ = 0. Hence, our parameters are (ν, m), where m is an unspecified function that is sufficiently smooth, while

ν = {(α^{T}, β_{12}, β_{21}, β_{22}, \dots, β_{K 1}, β_{K 2}, θ_{112}, \dots, θ_{11 d}, θ_{12}^{T}, θ_{13}^{T}, θ_{21}^{T}, θ_{22}^{T}, θ_{23}^{T}, \dots, θ_{K 1}^{T}, θ_{K 2}^{T}, θ_{K 3}^{T})}^{T} .

Thus, ν has total dimension d_ν = J + 2K + 2Kd + Ka − 2.

Let $U_{ik} (α) = X_{ik 0} + X_{ik}^{T} α$ be the realizations of $U_{k} (α) = X_{k 0} + X_{k}^{T} α$ . The unknown function m(·) is estimated by polynomial splines described as follows. Without loss of generality, assume u ∈ [a, b]. Let N = N_n be the number of interior knots. Divide [a, b] into (N + 1) subintervals I_p = {(ξ_p, ξ_p+1), p = r, r + 1, …, N + r − 1}, I_N = (ξ_N+r, 1), where ${(ξ_{p})}_{p = r + 1}^{N + r}$ is a sequence of interior knots, given as

ξ_{1} = \dots = a = ξ_{r} < ξ_{(r + 1)} < \dots ξ_{(r + N)} < b = ξ_{N + r + 1} = \dots = ξ_{N + 2 r} .

Define the distance between neighboring knots as h_p = ξ_p+1 − ξ_p, r ≤ p ≤ N + r, and h = max_r≤p≤N+r h_p. Let G_n be the space of B-splines of order r, so that P_n = N + r is the number of functions in G_n. For u ∈ [a, b], let G_n be the linear space spanned by the B-spline functions B_r(u) = {B_r,p(u), 1 ≤ p ≤ P_n}^T. Then m(u) can be approximated by $\tilde{m} (u) = \sum_{p = 1}^{P_{n}} B_{r, p} (u) λ_{p} = B_{r}^{T} (u) λ$ , where λ = (λ₁, …, λ_{P_n})^T. B-splines have been used frequently to estimate the nonparametric functions in nonparametric and semiparametric models because they are easy to compute with derivable asymptotic theory. See Huang (2003) and Wang and Yang (2009b) for their utility in nonparametric models, Stone (1985) and Huang and Yang (2004) in additive models, and Huang et al. (2002), Liu et al. (2011) and Wang et al. (2011) in semiparametric models.

3 Profile Estimating Procedure

Our estimation is performed through a conceptually simple profiling procedure, as described below.

Step 1

Define

{\tilde{H}}_{ik} = H [(β_{k 1} + β_{k 2} G_{ik}) B_{r}^{T} {U_{ik} (α)} λ + Z_{ik}^{T} (θ_{k 1} + θ_{k 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k 3}] .

Treating ν as a fixed parameter, estimate m(u) by spline functions $\hat{m} (u, ν) = \sum_{p = 1}^{P_{n}} B_{r, p} (u) {\hat{λ}}_{p} (ν)$ with λ̂(ν) = {λ̂₁(ν), ⋯, λ̂_{P_n}(ν)}^T through maximizing

L_{n} (λ, ν) = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {Y_{ik} log ({\tilde{H}}_{ik}) + (1 - Y_{ik}) log (1 - {\tilde{H}}_{ik})} .

(5)

To prepare for the second step, we perform the following additional calculations. Let B_r−1(u) = {B_r−1,p(u) : 2 ≤ p ≤ P_n}^T be the B-spline functions of order r − 1. We estimate m′(u, ν), the first derivative of m, through $\hat{m}' (u, ν) = \sum_{p = 2}^{P_{n}} B_{r - 1, p} (u) {\hat{λ}}_{p}^{(1)} (ν)$ , where ${\hat{λ}}_{p}^{(1)} (ν) = (r - 1) {{\hat{λ}}_{p} (ν) - {\hat{λ}}_{p - 1} (ν)} / (ξ_{p + r - 1} - ξ_{p})$ , for 2 ≤ p ≤ P_n. This is because the first derivative of a spline function can be expressed in terms of a spline of one order lower see page 116 of de Boor (2001). Let D = (d_jj′)_{1≤j,j′≤P_n−1} be a (P_n − 1) × (P_n − 1) diagonal matrix with d_jj = 1/(ξ_j+r − ξ_j+1) and d_jj′ = 0 for j ≠ j′, and let D₁₁ = (−D, 0_{P_n−1})_{(P_n−1)×P_n} and D₁₂ = (0_{P_n−1}, D)_{(P_n−1)×P_n}, where 0_{P_n−1} is the (P_n − 1)-dimensional vector with 0’s as its elements. Then $\hat{m}' (u, ν) = B_{r - 1}^{T} (u) D_{1} \hat{λ} (ν)$ , where D₁ = (r − 1) (D₁₁ + D₁₂). For u ∈ [a, b], define

{\hat{σ}}^{2} (u, ν) = B_{r}^{T} (u) {[\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} V_{ik} (ν) {(β_{k 1} + β_{k 2} G_{ik})}^{2} B_{r} {U_{ik} (α)} B_{r} {U_{ik} (α)}^{T}]}^{- 1} B_{r} (u),

(6)

where V_ik(ν) = H_ik(ν)(1 − H_ik(ν)), and

H_{ik} (ν) = H {(β_{k 1} + β_{k 2} G_{ik}) m (X_{ik 0} + X_{ik}^{T} α) + Z_{ik}^{T} (θ_{k 1} + θ_{k 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k 3}} .

Step 2

Define

{\hat{H}}_{ik} (ν) = H {(β_{k 1} + β_{k 2} G_{ik}) B_{r}^{T} (U_{ik} (α)) \hat{λ} (ν) + Z_{ik}^{T} (θ_{k 1} + θ_{k 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k 3}} .

(7)

Estimate ν by ν̂ through maximizing

L_{n} (ν) = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [Y_{ik} log {{\hat{H}}_{ik} (ν)} + (1 - Y_{ik}) log {1 - {\hat{H}}_{ik} (ν)}] .

Once we obtain ν̂, we can plug it into m̂ to obtain ν̂, m̂(u, ν̂) as the final estimator. To prepare the description of the asymptotic properties of our procedure, we define

Q_{i 1} (ν) = {{(- 1 + β_{12} G_{i 1}) m' (X_{i 10} + X_{i 1}^{T} α) X_{i 1}^{T}, G_{i 1} m (X_{i 10} + X_{i 1}^{T} α), 0_{1, 2 K - 2}, Z_{1 i 2}, \dots, Z_{i 1 d}, G_{i 1} Z_{i 1}^{T}, G_{i 1} W_{i 1}^{T}, 0_{1, (K - 1) (2 d + a)}}}^{T},

and for k = 2, …, K define

Q_{ik} (ν) = {(β_{k 1} + β_{k 2} G_{ik}) m' (X_{ik 0} + X_{ik}^{T} α) X_{ik}^{T}, 0_{1, 2 k - 3}, m (X_{ik 0} + X_{ik}^{T} α), G_{ik} m (X_{ik 0} + X_{ik}^{T} α), 0_{1, 2 K - 2 k + (k - 1) (2 d + a) - 1}, Z_{ik}^{T}, G_{ik} Z_{ik}^{T}, G_{ik} W_{ik}^{T}, 0_{1, (K - k) (2 d + a)}}^{T} .

Denote the elements of $Q_{ik} (ν) = {(Q_{ik, ℓ} (ν))}_{l = 1}^{d_{ν}}$ . Let

ν^{0} = {(α^{0 T}, β_{12}^{0}, β_{21}^{0}, β_{22}^{0}, \dots, β_{K 1}^{0}, β_{K 2}^{0}, θ_{112}^{0}, \dots, θ_{11 d}^{0}, θ_{12}^{0 T}, θ_{13}^{0 T}, θ_{21}^{0 T}, θ_{22}^{0 T}, θ_{23}^{0 T}, \dots, θ_{K 1}^{0 T}, θ_{K 2}^{0 T}, θ_{K 3}^{0 T})}^{T}

be the collection of the true parameters. Let [a₀, b₀] be the support of $X_{k 0} + X_{k}^{T} α^{0}$ , where α⁰ is the true population parameter. Denote ‖·‖ as the L₂ norm of any square integrable function on [a₀, b₀]. For 1 ≤ ℓ ≤ d_ν, let $η_{ℓ}^{0} (\cdot)$ be the function η_ℓ(·) ∈L₂([a₀, b₀]) that minimizes $E [\sum_{k = 1}^{K} n_{k} {Q_{ik, ℓ} (ν^{0}) - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η_{ℓ} (U_{ik} (α^{0}))}^{2} V_{ik}]$ . Also, define $η^{0} {U_{ik} (α)} = {{η_{1}^{0} {U_{ik} (α)}, \dots, η_{d_{ν}}^{0} {U_{ik} (α)}]}^{T}$ . For simplicity of notations, we let V_ik = V_ik(ν⁰), Q_ik = Q_ik(ν⁰) and U_ik = U_ik(α⁰). Let $n = \sum_{k = 1}^{K} n_{k}$ .

In the following three theorems, we establish the consistency, asymptotic normality and efficiency of our procedure.

Theorem 1

Under the conditions in Appendix A.2, when ν is the collection of the true parameters or a $\sqrt{n}$ -consistent estimator of ν⁰, (a) |m̂(u, ν) − m(u)| = O_p{(nh)^−1/2 + h^q} uniformly in u ∈ [a₀, b₀]; (b) |m̂′(u, ν) − m′(u)| = O_p(n^−1/2h^−3/2 + h^q−1) uniformly in u ∈ [a₀, b₀]; and (c) as n → ∞, σ̂⁻¹(u, ν⁰) {m̂(u, ν) − m(u)} → Normal(0, 1).

Theorem 2

Define n = n₁ and for k = 2, …, K, define n_k = nc_kn, where there are constants c_* > 0 and c_** < ∞ such that c_* ≤ c_k = lim_n→1 c_kn ≤ c_**. Under the conditions in Appendix A.2, ‖ν̂ − ν⁰‖₂ = O_p (n^−1/2), and

n^{1 / 2} (\hat{ν} - ν^{0}) = {[n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} H_{ik} (1 - H_{ik}) {(Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{ik}))}^{\otimes 2}]}^{- 1} \times [n^{- 1 / 2} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} (Y_{ik} - H_{ik}) {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{ik})}] + o_{p} (1),

(8)

for ν̂ in a neighborhood of ν⁰. Then as n → ∞, n^1/2(ν̂ − ν⁰) → Normal(0_{d_ν}, Σ), where

\sum = {(\sum_{k = 1}^{K} c_{k} E [V_{ik} {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{ik})}^{\otimes 2}])}^{- 1},

and 0_{d_ν} is a d_ν-dimensional vector with “0” as its elements. Here and throughout the text, a^⊗2 ≡ aa^T for any matrix or vector a.

In practice, Σ is estimated by

\sum^{^} = n {(\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [{\hat{V}}_{ik} {(Q_{ik} (\hat{ν}) - {\hat{Π}}_{n} Q_{ik} (\hat{ν}))}^{\otimes 2}])}^{- 1},

where V̂_ik = Ĥ_ik(ν̂)(1 − Ĥ_ik(ν̂)), Π̂_nQ_ik(ν̂) = (Π̂_nQ_ik,ℓ(ν̂), 1 ≤ ℓ ≤ d_ν)^T, and for 1 ≤ l ≤ d_ν, ${\hat{Π}}_{n} Q_{ik, ℓ} (\hat{ν}) = ({\hat{β}}_{k 1} + {\hat{β}}_{k 2} G_{ik}) B_{r}^{T} (U_{ik} (\hat{α})) {\hat{δ}}_{ℓ}$ , where

{\hat{δ}}_{ℓ} = arg min_{δ_{ℓ} \in R^{P_{n}}} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {Q_{ik, ℓ} (\hat{ν}) - (({\hat{β}}_{k 1} + {\hat{β}}_{k 2} G_{i}) B_{r}^{T} (U_{ik} (\hat{α})) δ}^{2} {\hat{V}}_{ik} .

In addition, under the assumption of independence, or conditional independence of the Y_ik given the covariates, our estimation method is semiparametric efficient. We state this as

Theorem 3

Under the conditions in Appendix A.2, profile likelihood estimation of the parameter ν reaches the semiparametric efficiency bound. The minimum variance bound for estimating ν can be further simplified to

{cov}_{opt} {n^{1 / 2} (\hat{ν} - ν^{0})} = {E (\sum_{k = 1}^{K} c_{k} V_{ik} [Q_{ik} Q_{ik}^{T} - {(β_{k 1}^{0} + β_{k 2}^{0} G_{ik})}^{2} {η^{0} (U_{ik})}^{\otimes 2}])}^{- 1} .

The proofs of the theorems are given in the Appendix.

4 Generalizations

4.1 Single Population, Multiple Diseases

In this section, we relax the assumption of independence of the Y_ik given the covariates, and consider the case of a single population with K outcomes, with a common sample size n. The response indicators remain as (Y_i1, …, Y_iK), but now the covariates are the same for each response, and are written as ℂ_i = (G_i, X_i0, X_i, Z_i, G_iW_i), and now we use $U_{i} (α) = X_{i 0} + X_{i}^{T} α$ . Ignoring this correlation and invoking a “working independence” principle, the profile likelihood procedure described in Section 3 will still provide consistent estimation. However, more efficient estimation can be generally obtained through taking into account the correlation structure.

Specifically, the derivative of Y_iklog(H̃_ik) + (1 − Y_ik)log(1 − H̃_ik) with respect to λ is (Y_ik − H̃_ik)(β_k1 + β_k2G_i)B_r(U_i(α)). Translated to the setting of this section, Step 1 in Section 3 is equivalent to solving

\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} (Y_{ik} - {\tilde{H}}_{ik}) (β_{k 1} + β_{k 2} G_{i}) B_{r} (U_{i} (α)) = 0 .

Here, we modify this step to

Step 1d

Let $Ω_{i} = {(Ω_{i, k, k'})}_{k, k' = 1}^{K}$ represent a working covariance matrix of (Y_i1, …, Y_iK) conditional on ℂ_i. Let B_i(ν) be a K × K matrix with the (k, k′) entry B_i,k,k′ (ν) = Ω_i,k,k′ (β_k1 + β_k2G_i)(β_k′1 + β_k′2G_i). Obtain λ̂_w(ν) by solving $\sum_{i = 1}^{n} B_{r} {U_{i} (α)} {\tilde{A}}_{i} (ν) B_{i} {(ν)}^{- 1} Φ_{i} (ν) = 0$ where

Φ_{i} (ν) = {(Y_{i 1} - {\tilde{H}}_{i 1}) (β_{11} + β_{12} G_{i}), \dots, (Y_{iK} - {\tilde{H}}_{iK}) (β_{K 1} + β_{K 2} G_{i})}^{T},

{\tilde{A}}_{i} (ν) = {{\tilde{V}}_{i 1} {(β_{11} + β_{12} G_{i})}^{2}, \dots, {\tilde{V}}_{ik} {(β_{K 1} + β_{K 2} G_{i})}^{2}},

whereṼ_ik = H̃_ik(1 − H̃_ik).

Using λ̂_w(ν), we form the corresponding estimators of m(u) and m′(u), which are ${\hat{m}}_{w} (u, ν) = B_{r}^{T} (u) {\hat{λ}}_{w} (ν)$ and ${\hat{m}}_{w}^{'} (u, ν) = {B_{r}^{'} (u)}^{T} {\hat{λ}}_{w} (ν)$ . Define A_i(ν) = {V_i1(β₁₁ + β₁₂G_i)², …, V_ik(β_K1 + β_K2G_i)²}. Let A_i = A_i(ν⁰) and B_i = B_i(ν⁰). For u ∈ [a, b], define ${\hat{σ}}_{w}^{2} (u, ν) = B_{r}^{T} (u) Π_{n}^{- 1} Ξ_{n} Π_{n}^{- 1} B_{r} (u)$ , where

Π_{n} = \sum_{i = 1}^{n} B_{r} (U_{i}) A_{i} B_{i}^{- 1} A_{i}^{T} B_{r} {(U_{i})}^{T},

Ξ_{n} = \sum_{i = 1}^{n} B_{r} (U_{i}) A_{i} B_{i}^{- 1} Q_{i} B_{i}^{- 1} A_{i}^{T} B_{r} {(U_{i})}^{T},

and where Q_i is a K × K matrix with the (k, k′) entry

Q_{i, k, k'} (ν) = {E (Y_{ik} Y_{ik'} | ℂ_{i}) - H_{ik} H_{ik'}} (β_{k 1}^{0} + β_{k 2}^{0} G_{i}) (β_{k' 1}^{0} + β_{k' 2}^{0} G_{i}) .

In the description above, Ω_i is a generic working covariance matrix. Here is how we implemented it. Let Ω_i be the conditional covariance matrix of Y_i = (Y_i1, …, Y_iK)^T given ℂ_i. Then the (k, k′) entry of Ω_i is Ω_i,k,k′ = E(Y_ikY_ik′ | ℂ_i) − H_ikH_ik′. In practice, we estimate Ω_i by ${\hat{Ω}}_{i} = {\hat{V}}_{i}^{1 / 2} \hat{R} {\hat{V}}_{i}^{1 / 2}$ , where V̂_i is a K × K diagonal matrix with the k^th diagonal as Ĥ_ik(1 − Ĥ_ik) and $\hat{R} = n^{- 1} \sum_{i = 1}^{n} {\hat{V}}_{i}^{- 1 / 2} (Y_{i} - {\hat{H}}_{i}) {(Y_{i} - {\hat{H}}_{i})}^{T} {\hat{V}}_{i}^{- 1 / 2}$ , where Ĥ_i = (Ĥ_i1, …, Ĥ_iK)^T and Ĥ_ik = Ĥ_ik(ν̂).

Similarly, the derivative of the (i, k) term in (7) with respect to ν is

(Y_{ik} - {\hat{H}}_{ik} (ν)) {{\hat{Q}}_{ik} (ν) + (β_{k 1} + β_{k 2} G_{i}) {{\hat{λ}}_{w}^{'} (ν)}^{T} B_{r} (X_{i 0} + X_{i}^{T} α)},

where Ĥ_ik(ν) is the same as H̃_ik except that λ in H̃_ik is replaced by λ̂_w(ν) in Ĥ_ik(ν), and Q̂_ik(ν) is the same as Q_ik(ν) except that m(·), m′(·) in Q_ik(ν) are replaced by m̂_w(·, ν), ${\hat{m}}_{w}^{'} (\cdot, ν)$ in Q̂_ik, and ${\hat{λ}}_{w}^{'} (ν) = \partial {\hat{λ}}_{w} (ν) / \partial ν^{T}$ is the P_n × d_ν derivative matrix of λ̂_w(ν) with respect to ν. We thus modify Step 2 to

Step 2d

Let Ψ_i(ν) be the d_νK × 1 vector formed by K vectors, each of length d_ν, with the k^th, k = 1, …, K, vector being $(Y_{ik} - {\hat{H}}_{ik} (ν)) {{\hat{Q}}_{ik} (ν) + (β_{k 1} + β_{k 2} G_{i}) {{\hat{λ}}_{w}^{'} (ν)}^{T} B_{r} (X_{i 0} + X_{i}^{T} α)}$ . Obtain ν̂_w from solving $\sum_{i = 1}^{n} {\hat{C}}_{i} (ν) {\hat{D}}_{i} {(ν)}^{- 1} Ψ_{i} (ν) = 0$ , where Ĉ_i(ν) is a d_ν × d_νK matrix, with k^th block

{\hat{C}}_{i, k} (ν) = {\hat{V}}_{ik} (ν) {{\hat{Q}}_{ik} (ν) + (β_{k 1} + β_{k 2} G_{i}) {\hat{λ}}_{w}^{'} {(ν)}^{T} B_{r} (X_{i 0} + X_{i}^{T} α)}^{\otimes 2},

where V̂_ik(ν) = Ĥ_ik(ν)(1 − Ĥ_ik(ν)) and D̂_i(ν) is a d_νK × d_νK matrix, with (k, k′) block

{\hat{D}}_{i, k, k'} (ν) = Ω_{i, k, k'} {{\hat{Q}}_{ik} (ν) + (β_{k 1} + β_{k 2} G_{i}) {\hat{λ}}_{w}^{'} {(ν)}^{T} B_{r} (X_{i 0} + X_{i}^{T} α)}, \times {{\hat{Q}}_{i k'} (ν) + (β_{k' 1} + β_{k' 2} G_{i}) {\hat{λ}}_{w}^{'} {(ν)}^{T} B_{r} (X_{i 0} + X_{i}^{T} α)}^{T},

and ${\hat{λ}}_{w}^{'} (ν)$ can be obtained via numerical differentiation. Let β₁+β₂G_i = {(β_k1 + β_k2G_i), 1 ≤ k ≤ K}^T, and $Θ_{i} (ν) = diag ({V_{ik} (ν) (β_{k 1} + β_{k} G_{i})}_{k = 1}^{K}$ . Then A_i(ν) = (β₁ + β₂G_i)^TΘ_i(ν). Denote 1_{d_ν} as the d_ν-dimensional vector with 1’s as its elements. Let Θ_i = Θ_i(ν⁰). Let Q_i = (Q_i1, …, Q_iK)^T and let η be a vector of functions η(u) = {η₁(u), …, η_{d_ν} (u)}^T with η_ℓ(·) ∈L₂([a, b]) that minimizes

1_{d_{ν}}^{T} E [{Q_{i} - (β_{1}^{0} + β_{2}^{0} G_{i}) η^{T} (U_{i})}^{T} Θ_{i} B_{i}^{- 1} Θ_{i} {Q_{i} - (β_{1}^{0} + β_{2}^{0} G_{i}) η^{T} (U_{i})}] 1_{d_{ν}} .

Define C_i as a d_ν × d_ν K matrix, with k^th block $C_{i, k} = V_{ik} {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{i}) η (U_{i})}^{\otimes 2}$ . Define D_i as a d_νK × d_νK matrix, with (k, k′) block

D_{i, k, k'} = Ω_{i, k, k'} {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{i}) η (U_{i})} {Q_{ik'} - (β_{k' 1}^{0} + β_{k' 2}^{0} G_{i}) η (U_{i})}^{T},

and define $D_{i}^{*}$ as a d_ν K × d_νK matrix, with (k, k′) block

D_{i, k, k'}^{*} = {E (Y_{ik} Y_{ik'} | ℂ_{i}) - H_{ik} H_{ik'}} {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{i}) η (U_{i})} \times {Q_{ik'} - (β_{k' 1}^{0} + β_{k' 2}^{0} G_{i}) η (U_{i})}^{T} .

In the following two theorems, we establish the consistency and asymptotic normality of our procedure. Different from the independent disease case, without a correct specification of the correlation structure of the occurrences of different diseases, we can no longer achieve semiparametric efficiency.

Theorem 4

Under the conditions in Appendix A.2, when ν is the collection of the true parameters or a $\sqrt{n}$ -consistent estimator of ν, (a) ${\hat{σ}}_{w}^{- 1} (u, ν^{0}) {{\hat{m}}_{w} (u, ν) - m (u)} \to Normal (0, 1)$ ; (b) |m̂_w(u, ν) − m(u)| = O_p (nh)^−1/2 + h^q} uniformly in u ∈ [a₀, b₀]; and (c) $| {\hat{m}}_{w}^{'} (u, ν) - m' (u) | = O_{p} (n^{- 1 / 2} h^{- 3 / 2} + h^{q - 1})$ uniformly in u ∈ [a₀, b₀].

Let ${\hat{D}}_{i}^{*}$ be a d_νK × d_νK matrix, with (k, k′) block

{\hat{D}}_{i, k, k'}^{*} = {\hat{Ω}}_{i, k, k'} {Q_{ik} (\hat{ν}) - ({\hat{β}}_{k 1} + {\hat{β}}_{k 2} G_{i}) \hat{η} (U_{i} (\hat{ν}))} {Q_{ik'} (\hat{ν}) - ({\hat{β}}_{k' 1} + {\hat{β}}_{k' 2} G_{i}) \hat{η} (U_{i} (\hat{ν}))}^{T},

where η̂(U_i(ν̂)) = {η̂₁(U_i(ν̂)), …, η̂_{d_ν} (U_i(ν̂))}^T and ${\hat{η}}_{ℓ} (U_{i} (\hat{ν})) = B_{r}^{T} (U_{ik} (\hat{ν})) {\hat{τ}}_{ℓ}$ with {τ̂_ℓ} minimizing

1_{d_{ν}}^{T} \sum_{i = 1}^{n} [{Q_{i} (\hat{ν}) - ({\hat{β}}_{1} + {\hat{β}}_{2} G_{i}) {\hat{η}}^{T} (U_{i} (\hat{ν}))}^{T} Θ_{i} (\hat{ν}) B_{i} {(\hat{ν})}^{- 1} Θ_{i} (\hat{ν}) \times {Q_{i} (\hat{ν}) - ({\hat{β}}_{1} + {\hat{β}}_{2} G_{i}) {\hat{η}}^{T} (U_{i} (\hat{ν}))}] 1_{d_{ν}} .

Theorem 5

Let (C, D, D*) be generic notation for random variables with the same distribution as (C_i, D_i, $D_{i}^{*}$ ). Under the conditions in Appendix A.2, $\sqrt{n} (\hat{ν} - ν^{0}) \to Normal (0_{d_{ν}}, \sum)$ for ν̂ in a neighborhood of ν⁰, where

\sum = {E ({CD}^{- 1} C^{T})}^{- 1} E ({CD}^{- 1} D^{*} D^{- 1} C^{T}) {E ({CD}^{- 1} C^{T})}^{- 1} .

Here, Σ is consistently estimated by the sandwich estimator

\sum^{^} = {(n^{- 1} \sum_{i = 1}^{n} {\hat{C}}_{i} {\hat{D}}_{i}^{- 1} {\hat{C}}_{i}^{T})}^{- 1} (n^{- 1} \sum_{i = 1}^{n} {\hat{C}}_{i} {\hat{D}}_{i}^{- 1} {\hat{D}}_{i}^{*} {\hat{D}}_{i}^{- 1} {\hat{C}}_{i}^{T}) {(n^{- 1} \sum_{i = 1}^{n} {\hat{C}}_{i} {\hat{D}}_{i}^{- 1} {\hat{C}}_{i}^{T})}^{- 1} .

(9)

4.2 Multiple Populations and Multiple Diseases

Finally, we consider the general case that there are k = 1, …, K independent populations, and within the k^th population, there are ℓ = 1, …, L_k diseases and i = 1, …, n_k observations. The outcomes are Y_ikℓ and the covariates are ℂ_ik = (G_ik, X_ik0, X_ik, Z_ik, G_ik W_ik). The model is

pr (Y_{ik ℓ} = 1 | ℂ_{ik}) = H_{ik ℓ} = H {(β_{k ℓ 1} + β_{k ℓ 2} G_{ik}) m (X_{ik 0} + X_{ik}^{T} α) + Z_{ik}^{T} (θ_{k ℓ 1} + θ_{k ℓ 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k ℓ 3}} .

(10)

We make the same assumptions as in Section A.2. As in Theorem 2, we write n = n₁ and for k = 2, ‥, K, define n_k = nc_kn, where there are constants c_* > 0 and c_** < ∞ such that c_* ≤ c_k = lim_n→∞ c_kn ≤ c_**.

Make the definitions of the terms in Section 4.1 appropriate to population k = 1, ‥, K, e.g., Ã_ik(ν), B_ik(ν), Φ_ik(ν), Ĉ_ik(ν), D̂_ik(ν), Ψ_ik(ν), C_ik, D_ik, Π_nk, Ξ_nk, etc. Obtain λ̂_w(ν) by solving $\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} B_{r} {U_{ik} (α)} {\tilde{A}}_{ik} (ν) B_{ik} {(ν)}^{- 1} Φ_{ik} (ν) = 0$ , and obtain ν̂_w by solving $\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {\hat{C}}_{ik} (ν) {\hat{D}}_{ik} {(ν)}^{- 1} Ψ_{ik} (ν) = 0$ . Define

{\hat{σ}}_{w}^{2} (u, ν) = B_{r}^{T} (u) {\sum_{k = 1}^{K} (n_{k} / n) Π_{nk}}^{- 1} {\sum_{k = 1}^{K} (n_{k} / n) Ξ_{nk}} {\sum_{k = 1}^{K} (n_{k} / n) Π_{nk}}^{- 1} B_{r} (u);

\sum = {\sum_{k = 1}^{K} c_{k} E (C_{ik} D_{ik}^{- 1} C_{ik}^{T})}^{- 1} {\sum_{k = 1}^{K} c_{k} E (C_{ik} D_{ik}^{- 1} D_{ik}^{*} C_{ik}^{T})} \times {\sum_{k = 1}^{K} c_{k} E (C_{ik} D_{ik}^{- 1} C_{ik}^{T})}^{- 1};

\sum^{^} = {(n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {\hat{C}}_{ik} {\hat{D}}_{ik}^{- 1} {\hat{C}}_{ik}^{T})}^{- 1} (n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {\hat{C}}_{ik} {\hat{D}}_{ik}^{- 1} {\hat{D}}_{ik}^{*} {\hat{D}}_{ik}^{- 1} {\hat{C}}_{ik}^{T}) \times {(n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {\hat{C}}_{ik} {\hat{D}}_{ik}^{- 1} {\hat{C}}_{ik}^{T})}^{- 1} .

Then Theorems 4–5 hold with these definitions, see Appendix A.8, and Σ̂ remains a sandwich estimator.

As in other problems involving correlated binary data and generalized estimating equations, the semiparametric efficiency established in Theorem 3 does not hold for the multiple populations and multiple diseases case, mainly due to the fact that the responses are correlated among different diseases and the correlation structure is unknown. Discussion of the working correlation matrix in parametric generalized estimating equation problems can be found in many papers, see for example Chaganty and Joe (2004).

Instead of embedding the problem in the generalized estimating equation framework, as we have done, there is some literature on developing a likelihood function that allows correlation among the binary responses while having the marginal probabilities be of logistic form, see for example Zhao and Prentice (1990) and Le Cessie and Van Houwelingen (1994). Our methods can be extended to this approach, but the ease of computation associated with a generalized estimating equation approach is a considerable advantage. This computational advantage is one of the reasons that generalized estimating equation methods are so widely employed in practice.

5 Data Analysis

5.1 Spline Setup

In all our implementations, we used cubic splines (r = 4) with equally spaced knots to approximate the nonparametric function m(·). We selected the number of interior knots N by minimizing a BIC criterion, where BIC(N) = −2L_n(λ̂, ν̂) + (N + p)log(n). See Xue and Yang (2006) and Ma and Yang (2011) for the properties of the BIC criterion.

5.2 Dietary Score Example

We applied our methods to the NIH-AARP Study of Diet and Health (Schatzkin et al., 2001). The method used for assessing dietary component intakes is the National Cancer Institute’s Dietary History Questionnaire (DHQ) (Subar et al., 2001). There were 294,673 men and 199,285 women in the data set. There were also dummy variables for various groups of age, body mass index, education, ethnicity, physical activity and smoking, making up the variables Z. In addition, for women, there were two dummy variables for hormone replacement therapy, making up the variables W. The HEI-2005 score for whole grains were taken as X_ik0 and X_ik. The sum of the weights was normalized to equal J + 1 = 12 for ease of comparison with the HEI-2005 total score, all of whose weights = 1: the standard errors of these weights were obtained by the delta-method after fitting the data as described in Sections 3 and 4.

For women, the data set contains four diseases, breast cancer, ovarian cancer, colorectal cancer and lung cancer, while for men there are prostate cancer, colorectal cancer and lung cancer. See Table 2 for the numbers and percentages of cancer cases. The minimum HEI-2005 total score in the data set was x_min = 19.67, while the maximum was x_max = 96.61.

Table 2.

Summary of the NIH-AARP data.

	Men		Women
Description	# Cases	Percentages	# Cases	Percentages
Sample size	294,673		199,285
Breast cancer			7,736	3.88%
Ovarian cancer			759	0.38%
Prostate cancer	23,477	7.97%
Colorectal cancer	4,693	1.59%	2,291	1.15%
Lung cancer	6,135	2.08%	3,630	1.82%

Open in a new tab

We used $F {U (\hat{α})} = Φ ([U (\hat{α}) - E {U (\hat{α})}] / \sqrt{Var {U (\hat{α})}})$ to construct B-spline functions, where Φ(·) is the distribution function of the standard normal distribution and U (α̂) = X₀ + X^Tα̂. Thus the nonparametric function m is estimated by $\hat{m} {u (\hat{α}), ν} = \sum_{p = 1}^{P_{n}} B_{r, p} [F {u (\hat{α})}] {\hat{λ}}_{p} (ν)$ .

We performed two analyses. In the first, we took a single disease and the two independent populations of men and women, using the method in Section 3, and applied to colorectal cancer and lung cancer separately. In the second, we analyzed all the cancer outcomes, using the method in Section 4.2. The point of doing the former is to illustrate that analyzing single diseases at a time can lead to very different results than those from analyzing multiple diseases simultaneously, a point we made in Section 1.

5.3 Independent Populations, Single Disease

Our first analysis uses the setup in Section 3, where there are K = 2 independent populations, men and women, and L = 1 disease. We performed analyses separately for colorectal cancer and lung cancer, and display here the results for both. Because hormone replacement therapy occurs only for women, the right had side of model (4) is identifiable when the parameter subscripts do not involve k, e.g., β₁ + β₂G_ik.

Table 3 shows the estimates of the weights of the component scores, their standard errors and their p-values for testing whether the weights = 1, i.e., whether the weight equals the HEI-2005 weight.

Table 3.

Results for the analysis of Section 5.3, where lung cancer and colorectal cancer were analyzed separately, thus each analysis has one disease and 2 independent populations. The weights of the component scores were normalized so that their sum = 12, thus placing the weights on the same scale as the HEI-2005 total score, whose weights all = 1. The p-values for the test that the individuals components = 1 are also displayed.

	Colorectal Cancer			Lung Cancer
	Estimate	se	p-value	Estimate	se	p-value
Total Fruit	0.27	0.87	0.40	2.18	0.34	0.00
Total Grains	2.58	0.85	0.06	2.96	0.33	0.00
Whole Grains	2.44	0.85	0.09	0.53	0.27	0.08
Total Vegetables	0.01	1.02	0.33	0.99	0.36	0.98
DOL Vegetables	1.33	0.72	0.65	0.99	0.26	0.96
Dairy	2.44	0.42	0.00	0.42	0.10	0.00
Meat and Beans	0.00	0.53	0.06	0.00	0.18	0.00
Oils	0.58	0.32	0.20	0.33	0.11	0.00
Sodium	0.80	0.45	0.65	1.12	0.16	0.45
Saturated Fat	0.53	0.31	0.13	0.94	0.13	0.65
Empty Calories	0.49	0.21	0.02	0.21	0.08	0.00

Open in a new tab

The main conclusion of Table 3 is that the weights for the HEI-2005 component scores are strikingly different for total fruit, whole grains and dairy, depending on whether one is interested in colorectal cancer or lung cancer. This is a point we made in the discussion after equation (2) about having a single score and not one for every disease. Table 3 suggests that if one is worried about colorectal cancer, one should increase consumption of whole grains and dairy products, but if one is worried about lung cancer, such consumption would have only a minor effect, but total fruit intake should be increased.

One can see in Table 3 that the weights of many of the individual component scores differ from HEI-2005’s weight of 1.0. We also tested whether the HEI-2005 weights fit the data as well, by testing H₀ : α₁ = α₂ = ⋯ = α_J = 1. To this end, we constructed the Wald chi-square statistic $χ_{W}^{2} = {(\hat{α} - 1_{J})}^{T} {\hat{V} (\hat{α})}^{- 1} (\hat{α} - 1_{J})$ , which has an asymptotic chi-square distribution with J degrees of freedom under H₀. Here, V̂ (α̂) is the estimated asymptotic variance-covariance matrix of α̂, and is calculated following Theorem 5. The p-value for this hypothesis is < 0.0001.

5.4 Multiple Populations and Multiple Diseases Analysis

Our second analysis uses the setup in Section 4.2, with all the cancers available in our data set: lung, colorectal, breast and ovarian cancers for women, and lung, colorectal and prostate cancers for men. We found that the working correlations among men and women were all < 0.03 in absolute value, so we report results for the working independence estimate.

Table 4 shows the estimated weights of the component scores and their standard errors. Because we are using multiple diseases and populations, and not just colorectal or lung cancer separately, but all the cancers simultaneously, we can expect differences between Table 4 and either analysis in Table 3. One of the striking difference is the vast down-weighting of increased Milk consumption compared to the results for colorectal cancer only. In the HEI-2005 score, increase of Milk consumption results in a monotonically increasing score for Milk. In the colorectal cancer case, a person who gets the top score of 10 on Milk contributes 24.4 to the single index. After accounting for the other diseases, however, the contribution is 0.61, a vast decrease. To us, this makes perfect sense, because the value of increased Milk consumption in adults is hardly universally accepted. For example, The Alternative Healthy Eating Index (McCullough et al., 2002) does not even include dairy as part of its index, i.e., increased Milk consumption gets zero weight. The Modified Mediterranean Diet Score (Trichopoulou et al., 2005) and the MedDietScore (Panagiotakos et al., 2006), have been shown to be related to overall survival and coronary heart disease respectively, but for these scores, increases in Milk consumption lead to decreases in the score for Milk, i.e., negative weight.

Table 4.

Results for estimated weights α̂ in the analysis of Section 5.4, with two populations (men and women), three diseases for men (lung, colorectal and prostate cancer) and four diseases for women (lung, colorectal, breast and ovarian cancer). The weights of the component scores were normalized so that their sum = 12, thus placing the weights on the same scale as the HEI-2005 total score, whose weights all = 1. The p-values for the test that the individuals components = 1 are also displayed. The actual estimated weights for Meat and Beans was actually negative, but we have set it = 0 for nutritional purposes.

	Estimate	se	p-value
Total Fruit	1.89	0.31	0.00
Whole Fruit	1.32	0.30	0.27
Total Grains	2.94	0.30	0.00
Whole Grains	0.70	0.26	0.32
Total Vegetables	0.97	0.34	0.93
DOL Vegetables	0.93	0.24	0.81
Dairy	0.61	0.09	0.00
Meat and Beans	0.00	0.17	0.00
Oils	0.39	0.11	0.00
Sodium	1.13	0.15	0.36
Saturated Fat	0.89	0.12	0.40
Empty Calories	0.23	0.07	0.00

Open in a new tab

Table 5 shows the estimates, standard errors (se) and p-values for the coefficients β. The β coefficient for men associated with lung cancer was = −1 for identifiability. However, when we instead set the coefficient for lung cancer for women to be = −1, the estimated coefficient for lung cancer for men was −1.06, and the p-value was very small. It is clear from the table that the real practical impact of diet here is its contribution to decreases in risk for lung and colorectal cancers, and for both men and women, and that the impact is greater for lung cancers. See below for a discussion of the relative risks, displayed in Figure 2, which supports our conclusion. The estimated values for all other groups are also negative except for the two groups: (a) men and prostate cancer; and (b) women and ovarian cancer, where the coefficients are very small: they have both been set = 0 under the constraint that a better diet is not a risk factor for either cancer. In the figures that we discuss, the index (x-axis) plotted is from the 3^rd to the 97^th percentiles of the actual index.

Table 5.

Results for β̂ for the analysis of Section 5.4, with two populations (men and women), three diseases for men (lung, colorectal and prostate cancer) and four diseases for women (lung, colorectal, breast and ovarian cancer). The weights of the component scores were normalized so that their sum = 12, thus placing the weights on the same scale as the HEI-2005 total score, whose weights all = 1. The p-values for the test that the individual β̂ terms = 0 are also displayed. The actual estimated coefficients for Prostate and Ovarian cancers were positive, but we have set them = 0 for nutritional purposes, with the constraint that a better diet is not a risk factor for either disease.

	Estimate	se	p-value
Men, Lung	−1.00	NA	NA
Men, Colorectal	−0.39	0.06	0.00
Men, Prostate	0.00	0.06	0.07
Women, Lung	−0.91	0.07	0.00
Women, Colorectal	−0.28	0.08	0.00
Women, Breast	−0.07	0.04	0.11
Women, Ovarian	0.00	0.14	0.90

Open in a new tab

Analysis of multiple diseases as in Section 5.4. Relative risks for men and women on a grid between the 3rd and 97^th percentile of the index. Left panel is for men: solid blue line is the relative risk for lung cancer, while the dashed red line is for colorectal cancer. The right is for women: solid blue line is the relative risk for lung cancer, dashed red line is for colorectal cancer and the dot-dashed magenta line is for breast cancer.

Figure 1 shows the plot of the estimates of m(·) against the index u (α̂) along with pointwise 95% confidence intervals, without any additional monotonicity constraints on m(·). The estimated function itself is monotone as expected. Observe that the estimated function is not an exact linear function, especially when considering the pointwise confidence intervals. Indeed, from the index value of 50 to 72, the estimated function has an increasing acceleration, then it becomes at, and it increases quickly again when the index value is greater than 82. When we refit the data with a linear link, the results, while different, are in good agreement, both in the estimated functions, the tables, and the analyses that are described next.

Analysis of multiple diseases as in Section 5.4. The function m̂(X^Tα̂) along with its pointwise 95% confidence interval.

Figure 2 displays the estimated relative risks of the various cancers and separately for men and women. Clearly, we see that the index predicts stronger decreases for lung cancer relative risks for better diet index score as compared to the other diseases. In Figure 2, the effect of better diet on prostate cancer in men and ovarian cancer in women are is nearly null, and the effect of better diet is quite modest on breast cancer. When analyzing the HEI-2005 total score, the p-values for prostate cancer, breast cancer and ovarian cancer were 0.15, 0.09, and 0.44, respectively, roughly what is seen in Figure 2.

Figure S.1 of the Supplementary Material shows how the estimated relative risks differ between men and women for lung and colorectal cancer. In both cases, women have the lower risk in general, with the largest difference being in colorectal cancer, but even there, the differences are not great. This agrees with the marginal rate of lung cancer for men and women being 2.08% and 1.82%, respectively, while the marginal rate of colorectal cancer for men and women 1.59% and 1.15%, respectively, see Table 2.

The hypothesis for testing that the weights all equal 1.0 is rejected with a p-value numerically very close to 0, as expected.

6 Simulation

In this section, we describe a simulation study to assess the finite-sample performance of our method in the case of two populations and multiple diseases. Section S.1 of the Supplementary Material has results for two independent populations and one disease. Here simulated data from the logistic model with multiple populations and diseases, so that

pr (Y_{ik ℓ} = 1 | ℂ_{ik}) = H {β_{k ℓ} m (X_{ik}^{T} α) + Z_{ik}^{T} θ_{k ℓ 1} + G_{ik} W_{ik}^{T} θ_{k ℓ 3}},

for i = 1, …, n and k = 1, 2. We consider two independent populations, and within the k^th population, there are ℓ = 1, …, L_k diseases and i = 1, …, n observations. We let L₁ = 3 and L₂ = 4, so that, as in the example of Section 5, the first and second populations have four and three diseases, respectively. There were 1, 000 simulated data sets.

For each simulated data set, we let n = 3, 000, and set the covariates to be randomly selected, without replacement, from the real data in Section 5. We set each component of α = 1, and made the convention that the estimates should sum to 12. The true values of β_kℓ are listed in Table 7. We simulated the components in θ_kℓ1 and θ_kℓ3 from the Uniform[−0.5, 0.5] distribution, except that, for identifiability, the first component in θ_k1 is taken to be zero. The true function is m(u) = exp(u/3).

Table 7.

Simulation results for β when n = 3000 and the binary responses have correlation 0.05. Estimate is the mean, Estimated se is the mean of the estimated standard errors, Actual se is the actual standard deviation of the estimates, and Coverage is the actual coverage of a nominal 95% confidence interval. The average total number of cases across the simulation = 7,975.

	Mean # Cases	True β	Mean β̂	Estimated se	Actual se	Coverage
Population 1, Disease 1	826	−1.00	−1.00	NA	NA	NA
Population 1, Disease 2	1052	−0.60	−0.61	0.12	0.11	97.00
Population 1, Disease 3	1347	−0.20	−0.21	0.10	0.09	97.70
Population 2, Disease 1	957	−0.80	−0.80	0.14	0.14	94.40
Population 2, Disease 2	1104	−0.57	−0.57	0.12	0.13	94.50
Population 2, Disease 3	1265	−0.33	−0.34	0.11	0.11	94.00
Population 2, Disease 4	1422	−0.10	−0.11	0.10	0.09	98.40

Open in a new tab

To generate the correlated binary data, we use the following algorithm. Suppose we want to generate M binary variables with probabilities (π₁, …, π_M). Let (U₁, …, U_M) be equicorrelated standard normal random variables with correlation ρ, and for m = 1, …, M, define T_m = log{π_m/(1 − π_m)} + log[Φ(U_m)/{1 − Φ(U_m)}]. Then setting Y_m = I(T_m > 0) creates correlated binary random variables with the desired probabilities. In our setting, the correlations of the binary variables were approximately ρ/2, so for the simulation in this section we set ρ = 0.10. This resulted in correlations somewhat higher than in the real data in Section 5.

We also conducted simulation experiments for independent binary outcomes and the sample size but with correlation nearly 0.10, (ρ = 0.20), and both correlations with n = 2000. These results similar to the results in this section, are in Section S.2 of the Supplementary Material.

To give some idea of how this simulation compares with the real data, Table 7 also lists the mean number of cases by disease and by population: the mean across both is 7, 975. These are many fewer than the number of cases seen in the actual data, see Table 2. Hence, since the effective sample size might be thought of as the number of cases, the simulation approximates a smaller study than the NIH-AARP data analyzed in Section 5.

Table 6 gives the results for the estimates of α, while Table 7 gives the results for the estimates of the β_kℓ. In both cases, the estimates are very nearly unbiased, the estimated standard errors very nearly equal the actual standard errors, and the coverage probabilities are close the nominal 95%.

Table 6.

Results of the simulation study when n = 3000, the binary responses have correlation 0.05 and where the actual values of α all = 1:00. Here Estimate is the mean of the estimates, Estimated se is the mean of the estimated standard errors, Actual se is the actual standard deviation of the estimates, and Coverage is the actual coverage of a nominal 95% confidence interval. The actual estimates α̂ were normalized to sum to 12.

	Mean	Estimated se	Actual se	Coverage
Total Fruit	1.03	0.32	0.34	93.00
Whole Fruit	1.04	0.55	0.59	95.40
Total Grains	1.00	0.55	0.58	95.00
Whole Grains	0.99	0.37	0.38	94.10
Total Vegetables	1.01	0.44	0.45	94.20
DOL Vegetables	0.99	0.38	0.38	94.90
Dairy	1.01	0.27	0.28	95.10
Meat and Beans	1.01	0.29	0.31	94.00
Oils	1.00	0.28	0.29	94.10
Sodium	0.98	0.35	0.37	94.30
Saturated Fat	0.98	0.40	0.42	94.20
Empty Calories	0.96	0.46	0.47	94.40

Open in a new tab

Figure S.2 of the Supplementary Material shows that the mean estimated function across the simulated data sets is also very nearly unbiased. Overall, the simulation suggests that our methodology leads to nearly unbiased estimates and inferences that achieve their nominal levels.

7 Discussion

Based on motivation from the current practice in nutritional epidemiology, we have developed generalized partially linear logistic single index models in the case that there are several populations and/or diseases. The novelty of the modeling is that the single-index function itself is the same across populations and diseases. In the case that the populations/diseases are independent given covariates, we developed a computable B-spline based semiparametric efficient methodology. In the case that the populations/diseases are correlated given the covariates, our method makes no assumptions about how the diseases are related.

The importance of developing a score that is constructed for health risk prediction across multiple diseases and populations, versus a different score for each population and disease, were illustrated in our work. When we analyzed men for colorectal cancer solely, increasing Milk consumption was given a very high weight in the single index. However, when we fit the model simultaneously for multiple diseases, the weight for increasing Milk consumption became much smaller. The importance of Milk consumption is a source of controversy in the nutrition literature, and our results agree with the Alternative Healthy Eating Index (AHEI), which assigns zero weight to increased Milk consumption, and the Mediterranean diets scores assigns negative weight to increased Milk consumption.

Our results are focused on logistic regression, so that they are readily transparent, but they are easily adapted to apply to any generalized linear model, by replacing H and its related quantities with a more general link function and its corresponding related quantities in the derivation.

We have strived to use a single overall score, which we have argued (a) avoids different diet for different disease; and (b) is important for interpretability. It is important to use as many diseases and populations as possible, and to draw inferences and projections only to those populations and diseases. One can think of what we have done as to come up with a framework for a type of “average” version of the individual model fits across multiple diseases and populations. We do not average directly, instead, we average across the estimating functions.

Supplementary Material

Supplemental Figures and Tables

NIHMS878200-supplement.pdf^{(152.3KB, pdf)}

Matlab Code

NIHMS878200-supplement-Matlab_Code.zip^{(289.7KB, zip)}

R code

NIHMS878200-supplement-R_code.zip^{(1.2MB, zip)}

Acknowledgments

S. Ma’s research was supported by NSF grant DMS-1306972. Y. Ma’s research was supported by NSF grant DMS-1206693 and NIH grant R01-NS073671. Carroll’s research was supported by a grant from the National Cancer Institute (U01-CA057030).

Appendix

A.1 Some Simplifications and Definitions

For simplicity of notation, we work through the asymptotics in the case that there is one population with a common sample size n, and thus the covariates are the same across the populations/diseases. The statements of Theorems 1–3 are readily verified when the responses are independent with different sample sizes and different covariates across k = 1, …, K.

For any vector ζ = (ζ₁, …, ζ_s)^T ∈ R^s, denote the norm ‖ζ‖_r = (|ζ₁|^r +⋯+|ζ_s|^r)^1/r, 1 ≤ r ≤ ∞. For positive numbers a_n and b_n, n > 1, let a_n ≍ b_n denote that lim_n→∞ a_n/b_n = c, where c is some nonzero constant. Denote the space of the q^th order smooth functions as C^(q)([a, b]) = {ϕ|ϕ^(q) ∈ C [a, b]}. For any s × s symmetric matrix A, denote its L_r norm as ${‖ A ‖}_{q} = {max}_{ς \in R^{s}, ς \neq 0} {‖ A ς ‖}_{q} {‖ ς ‖}_{q}^{- 1}$ . Let ${‖ A ‖}_{\infty} = {max}_{1 \leq i \leq s} \sum_{j = 1}^{s} | a_{ij} |$ . For a vector a, let ‖a‖_∞ = max_1≤i≤s |a_i|.

A.2 Regularity Conditions

(C1)
The density function f_X₀+X^T
α (x₀ + x^T α) of random variable X₀ + X^T α is bounded away from 0 on S_α and satisfies the Lipschitz condition of order 1 on S_α, where S_α = {X₀ + X^T α,(X₀, X^T)^T∈S} and S is a compact support set of (X₀, X^T)^T, for α in a neighborhood of its true values α⁰.
(C2)
m(·) ∈ C^(q)([a₀, b₀]) for q ≥ 2, $η_{ℓ}^{0} \in C^{(1)} ([a_{0}, b_{0})]$ , 1 ≤ l ≤ d_ν, and the spline order satisfies r ≥ q.
(C3)
There exists 0 < c < ∞, such that the distances between neighboring knots satisfies
$max_{r \leq p \leq N + r} | h_{p + 1} - h_{p} | = o (N^{- 1}) and h / min_{r \leq p \leq N + r} h_{J} \leq c .$

Furthermore, the number of knots satisfies N → ∞, as n → ∞, N⁻⁴n → ∞ and Nn^−1/(2q+2) → ∞.
(C4)
sup_i,k |G_ik| ≤ M < ∞. The eigenvalues of $\sum_{k = 1}^{K} c_{k} E [V_{ik} {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{i k})}^{\otimes 2}]$ are bounded below from zero. The eigenvalues of E(CD⁻¹C^T) given in Theorem 5 are bounded below from zero.

Conditions (C1)–(C3) are commonly used in the nonparametric smoothing literature; see, for example, Zhou et al. (1998) and Cui et al. (2011). Condition (C4) is needed for asymptotic normality of the parametric estimator.

A.3 Proof of Theorem 1

We first introduce two lemmas which will be used in the following proofs.

Lemma 1

For any a = (a_p : 1 ≤ p ≤ P_n), there exist constants 0 < c_B ≤ C_B < ∞, such that for n large enough,

c_{B} a^{T} ah \leq a^{T} E {B_{r} (U_{ik} (α^{0})) B_{r}^{T} (U_{ik} (α^{0}))} a \leq C_{B} a^{T} ah .

(A.1)

max_{1 \leq p, p' \leq P_{n}} | n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [B_{r, p} (U_{ik} (α^{0})) B_{r, p'} (U_{ik} (α^{0})) - E {B_{r, p} (U_{ik} (α^{0})) B_{r, p'} (U_{ik} (α^{0}))}] | = O_{p} {\sqrt{h n^{- 1} log (n)}} .

(A.2)

Proof of Lemma 1

Result (A.1) follows from Theorem 5.4.2 of DeVore and Lorentz (1993), and (A.2) can be proved by Bernstein’s inequality in Bosq (1961).

Define

V_{n}^{0} (ν) = n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} E {V_{ik} {(β_{k 1} + β_{k 2} G_{i})}^{2} B_{r} (U_{ik} (α)) B_{r}^{T} (U_{ik} (α))} .

(A.3)

Lemma 2

There are constants 0 < c_υ < C_υ < ∞, and 0 < C_S < ∞, such that for n large enough,

{‖ V_{n}^{0} {(ν^{0})}^{- 1} ‖}_{\infty} \leq C_{S} h^{- 1} .

(A.4)

and

c_{υ} h \leq {‖ V_{n}^{0} (ν^{0}) ‖}_{2} \leq C_{υ} h, C_{υ}^{- 1} h^{- 1} \leq {‖ V_{n}^{0} {(ν^{0})}^{- 1} ‖}_{2} \leq c_{υ}^{- 1} h^{- 1},

(A.5)

Proof of Lemma 2

Result (A.5) follows from (A.1). The result that ${‖ V_{n}^{0} {(ν^{0})}^{- 1} ‖}_{\infty} \leq C_{S} h^{- 1}$ follows from (A.5) and Theorem 13.4.3 in DeVore and Lorentz (1993).

If m ∈ C^q [a₀, b₀], there exists λ⁰ ∈ R^P_n, such that

sup_{u \in [a_{0}, b_{0}]} | m (u) - \tilde{m} (u) | = O (h^{q}),

(A.6)

where $\tilde{m} (u) = B_{r}^{T} (u) λ^{0}$ (de Boor, 2001). In the following, we prove the results for the nonparametric estimator m̂(u, ν) in Theorem 1 when ν = ν⁰. Then the results also hold when ν is a $\sqrt{n}$ consistent estimator of ν⁰, since the nonparametric convergence rate in Theorem 1 is slower than n^−1/2. Let $α_{n} = n^{1 / 2} P_{n} + P_{n}^{- q + 1 / 2}$ . We will show that for any given ε > 0, for n sufficiently large, there exists a large constant C > 0 such that

pr {sup_{{‖ τ ‖}_{2} = C} L_{n} (λ^{0} + α_{n} τ, ν^{0}) < L_{n} (λ^{0}, ν^{0})} \geq 1 - ε .

(A.7)

This implies that for n sufficiently large, with probability at least 1 − ε, there exists a local maximum for (5) in the ball {λ⁰ + α_nτ :‖τ‖₂ ≤ C}. Hence, there exists a local maximizer such that ‖λ̂(ν⁰) − λ⁰‖₂ = O_p(α_n). Since L_n(λ, ν⁰) is a concave function of λ, the local maximizer is the global maximizer of (5).

Define

{\tilde{H}}_{ik} (λ, ν) = H [(β_{k 1} + β_{k 2} G_{ik}) B_{r}^{T} {U_{ik} (α)} λ + Z_{ik}^{T} (θ_{k 1} + θ_{k 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k 3}],

then

\partial L_{n} (λ, ν) / \partial λ = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {Y_{ik} - {\tilde{H}}_{ik} (λ, ν)} (β_{k 1} + β_{k 2} G_{ik}) B_{r} {U_{ik} (α)},

\partial^{2} L_{n} (λ, ν) / \partial λ \partial λ^{T} = - \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} {\tilde{H}}_{ik} (λ, ν) {1 - {\tilde{H}}_{ik} (λ, ν)} {(β_{k 1} + β_{k 2} G_{i})}^{2} B_{r} {U_{ik} (α)} B_{r}^{T} {U_{ik} (α)} .

By Taylor’s expansion, we have

L_{n} (λ^{0} + α_{n} τ, ν^{0}) - L_{n} (λ^{0}, ν^{0}) = {\partial L_{n} (λ^{0}, ν^{0}) / \partial λ}^{T} α_{n} τ - [- 2^{- 1} {(α_{n} τ)}^{T} {\partial^{2} L_{n} (λ^{*}, ν^{0}) / \partial λ \partial λ^{T}} α_{n} τ],

(A.8)

where λ* = ϱλ + (1 − ϱ)λ⁰ for some ϱ ∈ (0, 1). Moreover,

| {\partial L_{n} (λ^{0}, ν^{0}) / \partial λ}^{T} α_{n} τ | \leq α_{n} {‖ \partial L_{n} (λ^{0}, ν^{0}) / \partial λ ‖}_{2} {‖ τ ‖}_{2} = C α_{n} {‖ \partial L_{n} (λ^{0}, ν^{0}) / \partial λ ‖}_{2},

and ∂L_n(λ⁰, ν⁰)/∂λ = Δ_n1 + Δ_n2, where

Δ_{n 1} = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} (Y_{ik} - H_{ik} (ν^{0})) (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) B_{r} (U_{ik} (α^{0})),

Δ_{n 2} = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} (H_{ik} (ν^{0}) - {\tilde{H}}_{ik} (λ^{0}, ν^{0})) (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) B_{r} (U_{ik} (α^{0})) .

Since E(Δ_n1) = 0, and $E {[{Y_{ik} - H_{ik} (ν^{0})} (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) B_{r, p} {U_{ik} (α^{0})}]}^{2} \leq C_{1} h$ for some constant 0 < C₁ < ∞, then $E ({‖ n^{- 1} Δ_{n 1} ‖}_{2}^{2}) \leq P_{n} {Kn}^{- 1} C_{1} h$ . By Condition (C3), we have $h \leq {cP}_{n}^{- 1}$ . Then $E ({‖ n^{- 1} Δ_{n 1} ‖}_{2}^{2}) \leq P_{n} {Kn}^{- 1} C_{1} {cP}_{n}^{- 1} = {KC}_{1} {cn}^{- 1}$ . Then for any ε > 0, by Chebyshev’s inequality, we have $pr ({‖ n^{- 1} Δ_{n 1} ‖}_{2} \geq \sqrt{n^{- 1} {KC}_{1} c ε^{- 1}}) \leq ε$ . Hence, there exists an event A_n1 with $pr (A_{n 1}^{C}) \leq ε$ , such that on A_n1 we have ${‖ Δ_{n 1} ‖}_{2} < \sqrt{{KC}_{1} c ε^{- 1}} n^{1 / 2}$ . Moreover, by (A.6), we have sup_i,k |H_ik(ν⁰) − H̃_ik(λ⁰, ν⁰)| = O(h^q). Denote

Δ_{ikp} = (H_{ik} (ν^{0}) - {\tilde{H}}_{ik} (λ^{0}, ν^{0})) (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) B_{r, p} (U_{ik} (α^{0})) .

Then, there exist constants 0 < C₂, $C_{2}^{'} < \infty$ such that

E ({‖ Δ_{n 2} ‖}_{2}) \leq P_{n}^{1 / 2} {sup_{1 \leq p \leq P_{n}} E {(\sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} Δ_{ikp})}^{2}}^{1 / 2} \leq P_{n}^{1 / 2} [{sup_{1 \leq p \leq P_{n}} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} E (Δ_{ikp}^{2})}^{1 / 2} + {sup_{1 \leq p \leq P_{n}} \sum_{(k, i) \neq (k', i')} E (Δ_{ikp} Δ_{i' k' p})}^{1 / 2}] \leq P_{n}^{1 / 2} {{(C_{2} {nh}^{2 q} h)}^{1 / 2} + {(C_{2}^{'} n^{2} h^{2 q} h^{2})}^{1 / 2}} \leq P_{n}^{1 / 2} (\sqrt{C_{2}} + \sqrt{C_{2}^{'}}) {nh}^{q + 1} \leq P_{n}^{1 / 2} C_{3} {nc}^{q + 1} P_{n}^{- (q + 1)} = C_{3} c^{q + 1} {nP}_{n}^{- q - 1 / 2},

where $C_{3} = (\sqrt{C_{2}} + \sqrt{C_{2}^{'}})$ , for n sufficiently large given that nh → ∞. Again by Chebyshev’s inequality, for any ε > 0, we have $pr ({‖ Δ_{n 2} ‖}_{2} \geq ε^{- 1 / 2} C_{3} c^{q + 1} {nP}_{n}^{- q - 1 / 2}) \leq ε$ . Hence, there exists an event A_n2 with $pr (A_{n 2}^{C}) \leq ε$ , such that on A_n2 we have ${‖ Δ_{n 2} ‖}_{2} < ε^{- 1 / 2} C_{3} c^{q + 1} {nP}_{n}^{- q - 1 / 2}$ . Therefore, by the above results, we have for n sufficiently large, on the event A_n1 ∩ A_n2 with pr(A_n1 ∩ A_n2) ≥ 1 − 2ε, such that

| {\partial L_{n} (λ^{0}, ν^{0}) / \partial λ}^{T} α_{n} τ | \leq C α_{n} {‖ \partial L_{n} (λ^{0}, ν^{0}) / \partial λ ‖}_{2} \leq C α_{n} ({‖ Δ_{n 1} ‖}_{2} + {‖ Δ_{n 2} ‖}_{2}) \leq C α_{n} (\sqrt{{KC}_{1} c ε^{- 1}} n^{1 / 2} + ε^{- 1 / 2} C_{3} c^{q + 1} {nP}_{n}^{- q - 1 / 2}) .

(A.9)

Moreover, by (A.1) and (A.2), we have for n sufficiently large, with probability approaching 1,

- 2^{- 1} τ^{T} {\partial^{2} L_{n} (λ^{*}, ν^{0}) / \partial λ \partial λ^{T}} τ \geq {nC}_{3} τ^{T} τ h \geq C_{4} C^{2} {nP}_{n}^{- 1} .

Thus, there exists an event A_n3 with $pr (A_{n 3}^{C}) \leq ε$ for any ε > 0, such that on A_n3,

- 2^{- 1} {(α_{n} τ)}^{T} {\partial^{2} L_{n} (λ^{*}, ν^{0}) / \partial λ \partial λ^{T}} (α_{n} τ) \geq α_{n}^{2} C_{4} C^{2} {nP}_{n}^{- 1} .

(A.10)

Therefore, by (A.8), (A.9) and (A.10), for n sufficiently large, on the event A_n1 ∩ A_n2 ∩ A_n3 with pr(A_n1 ∩ A_n2 ∩ A_n3) ≥ 1 − 3ε, we have

L_{n} (λ^{0} + α_{n} τ, ν^{0}) - L_{n} (λ^{0}, ν^{0}) \leq C α_{n} (\sqrt{{KC}_{1} c ε^{- 1}} n^{1 / 2} + ε^{- 1 / 2} C_{3} c^{q + 1} {nP}_{n}^{- q - 1 / 2}) - α_{n}^{2} C_{4} C^{2} {nP}_{n}^{- 1} = C α_{n} P_{n}^{- 1} {\sqrt{{KC}_{1} c ε^{- 1}} n^{1 / 2} P_{n} + ε^{- 1 / 2} C_{3} c^{q + 1} {nP}_{n}^{- q + 1 / 2} - {CC}_{4} n α_{n}} = C α_{n} P_{n}^{- 1} {\sqrt{{KC}_{1} c ε^{- 1}} n^{1 / 2} P_{n} + ε^{- 1 / 2} C_{3} c^{q + 1} {nP}_{n}^{- q + 1 / 2} - {CC}_{4} n^{1 / 2} P_{n} - {CC}_{4} {nP}_{n}^{- q + 1 / 2}} < 0,

when $C > max (C_{4}^{- 1} \sqrt{{KC}_{1} c ε^{- 1}}, ε^{- 1 / 2} C_{4}^{- 1} C_{3} c^{q + 1})$ . This shows (A.7). Hence, we have ${‖ \hat{λ} (ν^{0}) - λ^{0} ‖}_{2} = O_{p} (α_{n}) = O_{p} (n^{- 1 / 2} P_{n} + P_{n}^{- q + 1 / 2})$ . A similar strategy for proving consistency has been used in the literature when the dimension of the parameter is diverging, see for example the proof of Theorem 3 in Fan and Lv (2011).

Next, let

V_{ik} = var (Y_{ik} | G_{ik}, X_{ik 0}, X_{ik}, Z_{ik}, G_{ik} W_{ik}) = H_{ik} (1 - H_{ik}),

and

V_{n} (ν) = n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} V_{ik} {(β_{k 1} + β_{k 2} G_{ik})}^{2} B_{r} (U_{ik} (α)) B_{r}^{T} (U_{ik} (α)) .

(A.11)

By (A.2), (A.6) and the assumption that $P_{n}^{4} n^{- 1} = o (1)$ ,

{‖ - n^{- 1} \partial^{2} L_{n} (λ^{0}, ν^{0}) / \partial λ \partial λ^{T} - V_{n} (ν^{0}) ‖}_{\infty} = O (h^{q}) {‖ n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} B_{r} (U_{ik} (α^{0})) B_{r}^{T} (U_{ik} (α^{0})) ‖}_{\infty} = O (h^{q}) {‖ E {B_{r} (U_{ik} (α^{0})) B_{r}^{T} (U_{ik} (α^{0}))} ‖}_{\infty} + O (h^{q}) P_{n} O_{p} {\sqrt{{hn}^{- 1} log (n)}} = O (h^{q}) O (h) + O (h^{q}) P_{n} O_{p} {\sqrt{{hn}^{- 1} log (n)}} = O_{p} (h^{q + 1}) .

By (A.2) and (A.4), we have ‖V_n(ν⁰)⁻¹‖_∞ = O_p(h⁻¹). Thus by the above results, one has

{‖ {- n^{- 1} \partial^{2} L_{n} (λ^{0}, ν^{0}) / \partial λ \partial λ^{T}}^{- 1} - V_{n} {(ν^{0})}^{- 1} ‖}_{\infty} = O_{p} (h^{q - 1}) .

Let

D_{n} (ν) = n^{- 1} \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} (Y_{ik} - H_{ik}) (β_{k 1} + β_{k 2} G_{ik}) B_{r} (U_{ik} (α)) .

Since $E {(β_{k 1}^{0} + β_{k 2}^{0} G_{i k}) B_{r, p} (U_{i k} (α^{0}))} = O (h)$ , by Bernstein’s inequality, we have

{‖ n^{- 1} \sum_{i = 1}^{n} \sum_{k = 1}^{K} (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) B_{r} (U_{ik} (α^{0})) ‖}_{\infty} = O_{p} (h) .

By the above result and (A.6),

{‖ n^{- 1} \partial L_{n} (λ^{0}, ν^{0}) / \partial λ - D_{n} (ν^{0}) ‖}_{\infty} = O (h^{q}) {‖ n^{- 1} \sum_{i = 1}^{n} \sum_{k = 1}^{K} (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) B_{r} (U_{ik} (α^{0})) ‖}_{\infty} = O_{p} (h^{q + 1}) .

Let $ℂ_{ik} = {(G_{i}, X_{ik 0}, X_{ik}^{T}, Z_{ik}^{T}, G_{i} W_{ik}^{T})}^{T}, ℂ_{i} = {(ℂ_{ik}^{T}, 1 \leq k \leq K)}^{T}$ , and ℂ = (ℂ₁, …, ℂ_n)^T. It can be proved by Bernstein’s inequality in Bosq (1961) that ${‖ D_{n} (ν^{0}) ‖}_{\infty} = O_{p} (\sqrt{{hn}^{- 1} log (n)})$ . Also, by (A.4), ‖{−n⁻¹∂²L_n(λ⁰, ν⁰)/∂λ∂λ^T}⁻¹‖_∞ = O_p (h⁻¹). Thus for a ∈ R^P_n with ‖a‖₂ = 1,

a^{T} [{- n^{- 1} \partial^{2} L_{n} (λ^{0}, ν^{0}) / \partial λ \partial λ^{T}}^{- 1} {n^{- 1} \partial L_{n} (λ^{0}, ν^{0}) / \partial λ} - V_{n} {(ν^{0})}^{- 1} D_{n} (ν^{0})] \leq {‖ a ‖}_{\infty} {‖ {- n^{- 1} \partial^{2} L_{n} (λ^{0}, ν^{0}) / \partial λ \partial λ^{T}}^{- 1} ‖}_{\infty} {‖ n^{- 1} \partial L_{n} (λ^{0}, ν^{0}) / \partial λ - D_{n} (ν^{0}) ‖}_{\infty} + {‖ a ‖}_{\infty} {‖ {- n^{- 1} \partial^{2} L_{n} (λ^{0}, ν^{0}) / \partial λ \partial λ^{T}}^{- 1} - V_{n} {(ν^{0})}^{- 1} ‖}_{\infty} {‖ D_{n} (ν^{0}) ‖}_{\infty} = O_{p} (h^{q}) + O_{p} (h^{q - 1}) O_{p} (\sqrt{{hn}^{- 1} log (n)}) .

(A.12)

Let ê = V_n(ν⁰)⁻¹D_n(ν⁰). By Central Limit Theorem, ${[B_{r}^{T} (u) var (\hat{e} | ℂ) B_{r} (u)]}^{- 1 / 2} B_{r}^{T} (u) \hat{e} \to Normal (0, 1)$ , where var(ê|ℂ) = {nV_n(ν⁰)}⁻¹ and $B_{r}^{T} (u) var (\hat{e} | ℂ) B_{r} (u) = {\hat{σ}}^{2} (u, ν^{0})$ . By Lemma 2 and (A.2), there are constants $0 < c_{υ}^{'} < C_{υ}^{'} < \infty$ , such that with probability approaching 1, $c_{υ}^{'} h^{- 1} \leq {‖ V_{n} {(ν^{0})}^{- 1} ‖}_{2} \leq C_{υ}^{'} h^{- 1}$ , and

{‖ V_{n} {(ν^{0})}^{- 1} - V_{n}^{0} {(ν^{0})}^{- 1} ‖}_{2} = O_{p} (h^{- 2} \sqrt{{hn}^{- 1} log (n)}) .

(A.13)

Therefore, there exist constants 0 < c_σ ≤ C_σ < ∞ such that with probability approaching 1 and for large enough n,

c_{σ} {(nh)}^{- 1 / 2} \leq inf_{u \in [a_{0}, b_{0}]} \hat{σ} (u, ν^{0}) \leq sup_{u \in [a_{0}, b_{0}]} \hat{σ} (u, ν^{0}) \leq C_{σ} {(nh)}^{- 1 / 2} .

(A.14)

Thus $B_{r}^{T} (u) \hat{e} = O_{p} {{(nh)}^{- 1 / 2}}$ uniformly in u ∈ [a₀, b₀], and

B_{r}^{T} (u) {- \partial^{2} L_{n} (λ^{0}, ν^{0}) / \partial λ \partial λ^{T}}^{- 1} {\partial L_{n} (λ^{0}, ν^{0}) / \partial λ} = O_{p} {{(nh)}^{- 1 / 2} + h^{q}}

uniformly in u ∈ [a₀, b₀]. By Taylor’s expansion,

\hat{λ} (ν^{0}) - λ^{0} = {- \partial^{2} L_{n} (λ^{0}, ν^{0}) / \partial λ \partial λ^{T}}^{- 1} {\partial L_{n} (λ^{0}, ν^{0}) / \partial λ} {1 + o_{p} (1)} .

(A.15)

Thus by (A.12), (A.14), and Condition (C3),

sup_{u \in [a_{0}, b_{0}]} | \hat{σ} {(u, ν^{0})}^{- 1} [B_{r}^{T} (u) {\hat{λ} (ν^{0}) - λ} - B_{r}^{T} (u) \hat{e}] | = O_{p} {{(nh)}^{1 / 2}} O_{p} {(h^{q}) + O_{a . s .} (h^{q - 1}) O_{a . s .} (\sqrt{{hn}^{- 1} log (n)})} + O_{p} {{(nh)}^{1 / 2}} o_{p} {{(nh)}^{- 1 / 2} + h^{q}} = o_{p} (1) .

Therefore by Slutsky’s theorem σ̂⁻¹(u, ν⁰) {m̂(u, ν⁰) − m̃(u)} → Normal(0, 1) and m̂(u, ν⁰) − m̃(u) = Op {(nh)^−1/2} uniformly in u ∈ [a₀, b₀]. By sup_{u∈[a₀,b₀]} |m(u) − m̃(u)| = o(h^q), we have |m̂(u, ν⁰) − m(u)| = O_p{(nh)^−1/2 + h^q} uniformly in u ∈ [a₀, b₀]. By Slutsky’s theorem, we have

{\hat{σ}}^{- 1} (u, ν^{0}) {\hat{m} (u, ν^{0}) - m (u)} \to Normal (0, 1) .

Since $\hat{m}' (u, ν) = B_{r - 1}^{T} (u) D_{1} \hat{λ} (ν)$ and B_r−1(u) are B-spline basis functions with one order lower than B_r(u), by the same argument as in Zhou and Wolfe (2000) and the proof for m̂(u, ν⁰), we have the result (b) in Theorem 1. Then the proof is complete.

A.4 Proof of Theorem 2

Define $L_{ik} (ν) = (β_{k 1} + β_{k 2} G_{ik}) m (X_{ik 0} + X_{ik}^{T} α) + Z_{ik}^{T} (θ_{k 1} + θ_{k 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k 3}$ . It is straight-forward to prove that ∂L_i1(ν)/∂ν = Q_i1(ν), and for k = 2, …, K, ∂L_ik(ν)/∂ν = Q_ik(ν). Then by (A.15) and Condition (C3) and by the same arguments as the proof for proposition 4.1 in Ai and Chen (2003), we have

\partial L_{n} (ν^{0}) / ν = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [(Y_{ik} - H_{ik} (ν^{0})) \times {\partial L_{ik} (ν^{0}) / \partial ν - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{ik} (α^{0}))}] {1 + o_{p} (1)} = \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [(Y_{ik} - H_{ik} (ν^{0})) \times {Q_{ik} (ν^{0}) - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{ik} (α^{0}))}] {1 + o_{p} (1)},

\partial^{2} L_{n} (ν^{0}) / \partial ν \partial ν^{T} = - \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [H_{ik} (ν^{0}) (1 - H_{ik} (ν^{0})) \times {\partial L_{ik} (ν^{0}) / \partial ν - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{ik} (α^{0}))}^{\otimes 2}] {1 + o_{p} (1)} = - \sum_{k = 1}^{K} \sum_{i = 1}^{n_{k}} [V_{ik} (ν^{0}) {Q_{ik} (ν^{0}) - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η^{0} (U_{ik} (α^{0}))}^{\otimes 2}] {1 + o_{p} (1)} .

By Taylor’s expansion, we have

ν - ν^{0} = - {\partial^{2} L_{n} (ν^{0}) / \partial ν \partial ν^{T}}^{- 1} {\partial L_{n} (ν^{0}) / ν} {1 + o_{p} (1)} .

By the above result, we have (8). Then the asymptotic normality in Theorem 2 follows from the Central Limit Theorem and (8).

A.5 Proof of Theorem 3

Here we show that our method for estimating ν is semiparametric efficient when (Y_i1, …, Y_iK) are independent given ℂ_i. We have that

log {pr (Y_{i} = y_{i} | ℂ_{i})} = \sum_{k = 1}^{K} {y_{ik} log (H_{ik}) + (1 - y_{ik}) log (1 - H_{ik})} .

The i^th score with respect to ν is $S_{ν i} = \sum_{k = 1}^{K} (Y_{ik} - H_{ik}) Q_{ik}$ . The nuisance tangent space is

Λ = {\sum_{k = 1}^{K} (Y_{ik} - H_{ik}) (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η (X_{ik 0} + X_{ik}^{T} α^{0}) : η (\cdot) \in ℝ^{J + 2 K + 2 K d + K a - 2}} .

We decompose S_νi as S_νi = S_eff,i + S_1i, where

S_{eff, i} = \sum_{k = 1}^{K} (Y_{ik} - H_{ik}) {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η_{0 i}},

S_{1 i} = \sum_{k = 1}^{K} (Y_{ik} - H_{ik}) (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η_{0 i},

η_{0 i} = \frac{E {\sum_{k = 1}^{K} V_{ik} Q_{ik} (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) | X_{ik 0} + X_{ik}^{T} α^{0}}}{E {\sum_{k = 1}^{K} V_{ik} {(β_{k 1}^{0} + β_{k 2}^{0} G_{ik})}^{2} | X_{ik 0} + X_{ik}^{T} α^{0}}} .

Obviously, S_1i ∈ Λ. For any element S_i ∈ Λ, say $S_{i} = \sum_{k = 1}^{K} (Y_{ik} - H_{ik}) (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η (X_{ik 0} + X_{ik}^{T} α^{0})$ , we can easily verify that $E (S_{eff, i}^{T} S_{i}) = 0$ .

Thus, S_eff,i is the residual of the orthogonal projection of S_νi onto Λ, hence it is the efficient score. The minimum variance bound for estimating ν is therefore

{cov}_{opt} {n^{1 / 2} (\hat{ν} - ν)} = {E (S_{eff, i} S_{eff, i}^{T})}^{- 1} = {[E \sum_{k = 1}^{K} V_{ik} {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{ik}) η_{0 i}}^{\otimes 2}]}^{- 1} .

Since S_1i is the orthogonal projection of S_νi onto Λ, it minimizes the covariance matrix of S_νi − S_i among all the functions S_i ∈ Λ, i.e., η_0i minimizes

cov [\sum_{k = 1}^{K} (Y_{ik} - H_{ik}) {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{i}) η (X_{ik 0} + X_{ik}^{T} α^{0})}] = [E \sum_{k = 1}^{K} V_{ik} {Q_{ik} - (β_{k 1}^{0} + β_{k 2}^{0} G_{i}) η (X_{ik 0} + X_{ik}^{T} α^{0})}^{\otimes 2}]

among all possible $η (X_{ik 0} + X_{ik}^{T} α^{0}) \in ℝ^{J + 2 K + 2 K d + K a - 2}$ . This shows that Σ in Theorem 2 reaches the semiparametric efficiency bound, as claimed.

A.6 Proof of Theorem 4

Let $ε_{ik} = (β_{k 1}^{0} + β_{k 2}^{0} G_{i}) (Y_{ik} - H_{ik})$ and ε_i = (ε_i1, …, ε_iK)^T. Following the same procedure as the proof of Theorem 1, we have that ${‖ {\hat{λ}}_{w} (ν^{0}) - λ^{0} ‖}_{2} = O_{p} (n^{- 1 / 2} P_{n} + P_{n}^{- q + 1 / 2})$ . By this result and Taylor’s expansion, we have

0 = \sum_{i = 1}^{n} B_{r} (U_{i}) A_{i} B_{i}^{- 1} A_{i}^{T} B_{r}^{T} (U_{i}) {λ^{0} - {\hat{λ}}_{w} (ν^{0})} + \sum_{i = 1}^{n} B_{r} (U_{i}) A_{i} B_{i}^{- 1} ε_{i} + o_{p} (n^{1 / 2}) .

Thus

{\hat{λ}}_{w} (ν^{0}) - λ^{0} = {n^{- 1} \sum_{i = 1}^{n} B_{r} (U_{i}) A_{i} B_{i}^{- 1} A_{i}^{T} B_{r}^{T} (U_{i})}^{- 1} \times {n^{- 1} \sum_{i = 1}^{n} B_{r} (U_{i}) A_{i} B_{i}^{- 1} ε_{i}} {1 + o_{p} (1)} .

(A.16)

Then with probability approaching 1, var{λ̂_w(ν⁰) − λ⁰|ℂ_i} approaches $\prod_{n}^{- 1} Ξ_{n} \prod_{n}^{- 1}$ . Theorem 4 can be proved following the same methods as in the proof of Theorem 1.

A.7 Proof of Theorem 5

Let ζ_i be the d_νK × 1 vector formed by K length d_ν vectors. The kth, k = 1, …, K vector component is $(Y_{ik} - H_{ik} (ν)) {{\hat{Q}}_{ik} (ν) + (β_{k 1} + β_{k 2} G_{i}) {{\hat{λ}}_{w}^{'} (ν)}^{T} B_{r} (X_{i 0} + X_{i}^{T} α)}$ . Following the same outline as the proof of Theorem 2, it can be proved that

\sqrt{n} ({\hat{ν}}_{w} - ν^{0}) = \sqrt{n} {(\sum_{i = 1}^{n} C_{i} D_{i}^{- 1} C_{i}^{T})}^{- 1} (\sum_{i = 1}^{n} C_{i} D_{i}^{- 1} ζ_{i}) + o_{p} (1) .

Therefore,

var (\sqrt{n} ({\hat{ν}}_{w} - ν^{0}) | ℂ_{i}) = n {(\sum_{i = 1}^{n} C_{i} D_{i}^{- 1} C_{i}^{T})}^{- 1} (\sum_{i = 1}^{n} C_{i} D_{i}^{- 1} D_{i}^{*} D_{i}^{- 1} C_{i}^{T}) \times {(\sum_{i = 1}^{n} C_{i} D_{i}^{- 1} C_{i}^{T})}^{- 1} + o_{p} (1),

and the asymptotic normality of $\sqrt{n} ({\hat{ν}}_{w} - ν^{0})$ given in Theorem 5 follows from the Central Limit Theorem.

A.8 Extending to Multiple Study Centers

Here we indicate briefly the necessary changes needed if there are multiple study centers, and multiple dependent disease outcomes within each study center. Suppose that there are k = 1, …, K study centers, with ℓ = 1, …, L_k binary disease outcomes in each center, and with i = 1, …, n_k observations at the k^th center. Write the outcomes at Y_ik = (Y_ik1, …, Y_{ikL_k}), and write the covariates as ℂ_ik = (G_ik, X_ik0, X_ik, Z_ik, G_ikW_ik). The model is

pr (Y_{ik ℓ} = 1 | ℂ_{ik}) = H_{ik ℓ} = H_{ik ℓ} = H {(β_{k ℓ 1} + β_{k ℓ 2} G_{ik}) m (X_{ik 0} + X_{ik}^{T} α) + Z_{ik}^{T} (θ_{k ℓ 1} + θ_{k ℓ 2} G_{ik}) + G_{ik} W_{ik}^{T} θ_{k ℓ 3}} .

(A.17)

We make the same assumptions as in Section A.2, but in addition we assume that lim_{n₁, …,
n_K→∞}(max n_k/ min n_k) = c with 0 < c < ∞.

From the above model, we can see that in different centers, because different physical populations are studied, the same disease occurrence is modeled with different parameters. Thus, we can simply view the L_k diseases in k = 1, …, K centers as $\sum_{k = 1}^{K} L_{k}$ different diseases from a single center, and all our analyses formulated for data from one center applies.

Footnotes

Supplementary Material

TheSupplementary Material contains results of addition simulations, and R and Matlab programs to run the analysis. The NIH-AARP data used in the data analysis are available from the NIH via a data transfer agreement (www.http://dietandhealth.cancer.gov/) but we are not allowed to distribute it. The program files include simulated data as described in Section 6.

Contributor Information

Shujie Ma, Department of Statistics, University of California at Riverside, Riverside, CA92521.

Yanyuan Ma, Department of Statistics, University of South Carolina, Columbia, SC 29208.

Yanqing Wang, Fred Hutchinson Cancer Research Center, Seattle, WA 98109.

Eli S. Kravitz, Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX

Raymond J. Carroll, Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143, and School of Mathematical Sciences, University of Technology Sydney, Broadway NSW 2007

References

Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]
Akbaraly TN, Ferrie JE, Berr C, Brunner EJ, Head J, Marmot MG, Singh-Manoux A, Ritchie K, Shipley MJ, Kivimaki M. Alternative Healthy Eating Index and mortality over 18 y of follow-up: results from the Whitehall II cohort. American Journal of Clinical Nutrition. 2011;194:247–253. doi: 10.3945/ajcn.111.013128. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bellman RE. Adaptive Control Processes. Princeton University Press; Princeton: 1961. [Google Scholar]
Bosq D. Nonparametric Statistics for Stochastic Processes. Springer; New York: 1961. [Google Scholar]
Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. Journal of the American Statistical Association. 1997;92:477–489. [Google Scholar]
Chaganty NR, Joe H. Efficiency of generalized estimating equations for binary responses. Journal of the Royal Statistical Society: Series B. 2004;66:851–860. [Google Scholar]
Chiuve SE, Fung TT, Rimmand EB, Hu FB, McCullough ML, Wang M, Stampfer MJ, Willett WC. Alternative dietary indices both strongly predict risk of chronic disease. Journal of Nutrition. 2012;142:1009–1018. doi: 10.3945/jn.111.157222. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cui X, Härdle WK, Zhu L, et al. The efm approach for single-index models. Annals of Statistics. 2011;39:1658–1688. [Google Scholar]
de Boor C. A Practical Guide to Splines. Springer; New York: 2001. [Google Scholar]
DeVore RA, Lorentz GG. Constructive Approximation. Springer; Berlin: 1993. [Google Scholar]
Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
George SM, Neuhouser ML, Mayne ST, Irwin ML, Albanes D, Gail MH, Alfano CM, Bernstein L, McTiernan A, Reedy J, Smith AW, Ulrich CM, Ballard-Barbash R. Postdiagnosis diet quality is inversely related to a biomarker of inflammation among breast cancer survivors. Cancer Epidemiology, Biomarkers & Prevention. 2010;19:2220–2228. doi: 10.1158/1055-9965.EPI-10-0464. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guenther PM, Reedy J, Krebs-Smith SM. Development of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008;108:1896–1901. doi: 10.1016/j.jada.2008.08.016. [DOI] [PubMed] [Google Scholar]
Guenther PM, Reedy J, Krebs-Smith SM, Reeve BB. Evaluation of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008;108:1854–1864. doi: 10.1016/j.jada.2008.08.011. [DOI] [PubMed] [Google Scholar]
Huang JZ. Local asymptotics for polynomial spline regression. Annals of Statistics. 2003;31:1600–1635. [Google Scholar]
Huang JZ, Wu CO, Zhou L. Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika. 2002;89:111–128. [Google Scholar]
Huang JZ, Yang L. Identification of non-linear additive autoregressive models. Journal of the Royal Statistical Society: Series B. 2004;66:463–477. [Google Scholar]
Le Cessie S, Van Houwelingen J. Logistic regression for correlated binary data. Applied Statistics. 1994;43:95–108. [Google Scholar]
Liu X, Wang L, Liang H. Estimation and variable selection for semiparametric additive partial linear models. Statistica Sinica. 2011;21:1225. doi: 10.5705/ss.2009.140. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ma S, Yang L. A jump-detecting procedure based on spline estimation. Journal of Nonparametric Statistics. 2011;23:67–81. [Google Scholar]
Ma Y, Zhu L. Doubly robust and efficient estimators for heteroscedastic partially linear single-index models allowing high dimensional covariates. Journal of the Royal Statistical Society: Series B. 2013;75:305–322. doi: 10.1111/j.1467-9868.2012.01040.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
McCullough ML, Feskanich D, Stampfer MJ, Giovannucci EL, Rimm EB, Hu FB, Spiegelman D, Hunter DJ, Colditz GA, Willett WC. Diet quality and major chronic disease risk in men and women: moving toward improved dietary guidance. American Journal of Clinical Nutrition. 2002;76:1261–1271. doi: 10.1093/ajcn/76.6.1261. [DOI] [PubMed] [Google Scholar]
Panagiotakos DB, Pitsavos C, Stefanadis C. Dietary patterns: a mediterranean diet score and its relation to clinical and biological markers of cardiovascular disease risk. Nutrition, Metabolism and Cardiovascular Diseases. 2006;16:559–568. doi: 10.1016/j.numecd.2005.08.006. [DOI] [PubMed] [Google Scholar]
Reedy JR, Mitrou PN, Krebs-Smith SM, Wirfält E, Flood AV, Kipnis V, Leitzmann M, Mouwand T, Hollenbeck A, Schatzkin A, Subar AF. Index-based dietary patterns and risk of colorectal cancer: the nih-aarp diet and health study. American Journal of Epidemiology. 2008;168:38–48. doi: 10.1093/aje/kwn097. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schatzkin A, Subar AF, Thompson FE, Harlan LC, Tangrea J, Hollenbeck AR, Hurwitz PE, Coyle L, Schussler N, Michaud DS, Freedman LS, Brown CC, Midthune D, Kipnis V. Design and serendipity in establishing a large cohort with wide dietary intake distributions: the national institutes of health-aarp diet and health study. American Journal of Epidemiology. 2001;154:1119–1125. doi: 10.1093/aje/154.12.1119. [DOI] [PubMed] [Google Scholar]
Stone CJ. Additive regression and other nonparametric models. Annals of Statistics. 1985:689–705. [Google Scholar]
Subar AF, Thompson FE, Kipnis V, Mithune D, Hurwitz P, McNutt S, McIntosh A, Rosenfeld S. Comparative validation of the block, willett, and national cancer institute food frequency questionnaires: The Eating at America’s Table Study. American Journal of Epidemiology. 2001;154:1089–1099. doi: 10.1093/aje/154.12.1089. [DOI] [PubMed] [Google Scholar]
Trichopoulou A, Orfanos P, Norat T, Bueno-de Mesquita B, Ocké MC, Peeters PH, van der Schouw YT, Boeing H, Hoffmann K, Boffetta P, et al. Modified mediterranean diet and survival: Epic-elderly prospective cohort study. British Medical Journal. 2005;330:991. doi: 10.1136/bmj.38415.644155.8F. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang J, Yang L. Polynomial spline confidence bands for regression curves. Statistica Sinica. 2009a;19:325. [Google Scholar]
Wang L, Liu X, Liang H, Carroll RJ. Estimation and variable selection for generalized additive partial linear models. Annals of Statistics. 2011;39:1827. doi: 10.1214/11-AOS885SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L, Yang L. Spline estimation of single-index models. Statistica Sinica. 2009b;19:765. [Google Scholar]
Xue L, Yang L. Additive coefficient modeling via polynomial spline. Statistica Sinica. 2006;16:1423. [Google Scholar]
Yu Y, Ruppert D. Penalized spline estimation for partially linear single-index models. Journal of the American Statistical Association. 2002;97:1042–1054. [Google Scholar]
Zhao LP, Prentice RL. Correlated binary regression using a quadratic exponential model. Biometrika. 1990;77:642–648. [Google Scholar]
Zhou S, Shen X, Wolfe DA. Local asymptotics for regression splines and confidence regions. The Annals of Statistics. 1998:1760–1782. [Google Scholar]
Zhou S, Wolfe DA. On derivative estimation in spline regression. Statistica Sinica. 2000:93–108. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Figures and Tables

NIHMS878200-supplement.pdf^{(152.3KB, pdf)}

Matlab Code

NIHMS878200-supplement-Matlab_Code.zip^{(289.7KB, zip)}

R code

NIHMS878200-supplement-R_code.zip^{(1.2MB, zip)}

[R1] Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]

[R2] Akbaraly TN, Ferrie JE, Berr C, Brunner EJ, Head J, Marmot MG, Singh-Manoux A, Ritchie K, Shipley MJ, Kivimaki M. Alternative Healthy Eating Index and mortality over 18 y of follow-up: results from the Whitehall II cohort. American Journal of Clinical Nutrition. 2011;194:247–253. doi: 10.3945/ajcn.111.013128. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Bellman RE. Adaptive Control Processes. Princeton University Press; Princeton: 1961. [Google Scholar]

[R4] Bosq D. Nonparametric Statistics for Stochastic Processes. Springer; New York: 1961. [Google Scholar]

[R5] Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. Journal of the American Statistical Association. 1997;92:477–489. [Google Scholar]

[R6] Chaganty NR, Joe H. Efficiency of generalized estimating equations for binary responses. Journal of the Royal Statistical Society: Series B. 2004;66:851–860. [Google Scholar]

[R7] Chiuve SE, Fung TT, Rimmand EB, Hu FB, McCullough ML, Wang M, Stampfer MJ, Willett WC. Alternative dietary indices both strongly predict risk of chronic disease. Journal of Nutrition. 2012;142:1009–1018. doi: 10.3945/jn.111.157222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Cui X, Härdle WK, Zhu L, et al. The efm approach for single-index models. Annals of Statistics. 2011;39:1658–1688. [Google Scholar]

[R9] de Boor C. A Practical Guide to Splines. Springer; New York: 2001. [Google Scholar]

[R10] DeVore RA, Lorentz GG. Constructive Approximation. Springer; Berlin: 1993. [Google Scholar]

[R11] Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] George SM, Neuhouser ML, Mayne ST, Irwin ML, Albanes D, Gail MH, Alfano CM, Bernstein L, McTiernan A, Reedy J, Smith AW, Ulrich CM, Ballard-Barbash R. Postdiagnosis diet quality is inversely related to a biomarker of inflammation among breast cancer survivors. Cancer Epidemiology, Biomarkers & Prevention. 2010;19:2220–2228. doi: 10.1158/1055-9965.EPI-10-0464. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Guenther PM, Reedy J, Krebs-Smith SM. Development of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008;108:1896–1901. doi: 10.1016/j.jada.2008.08.016. [DOI] [PubMed] [Google Scholar]

[R14] Guenther PM, Reedy J, Krebs-Smith SM, Reeve BB. Evaluation of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008;108:1854–1864. doi: 10.1016/j.jada.2008.08.011. [DOI] [PubMed] [Google Scholar]

[R15] Huang JZ. Local asymptotics for polynomial spline regression. Annals of Statistics. 2003;31:1600–1635. [Google Scholar]

[R16] Huang JZ, Wu CO, Zhou L. Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika. 2002;89:111–128. [Google Scholar]

[R17] Huang JZ, Yang L. Identification of non-linear additive autoregressive models. Journal of the Royal Statistical Society: Series B. 2004;66:463–477. [Google Scholar]

[R18] Le Cessie S, Van Houwelingen J. Logistic regression for correlated binary data. Applied Statistics. 1994;43:95–108. [Google Scholar]

[R19] Liu X, Wang L, Liang H. Estimation and variable selection for semiparametric additive partial linear models. Statistica Sinica. 2011;21:1225. doi: 10.5705/ss.2009.140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Ma S, Yang L. A jump-detecting procedure based on spline estimation. Journal of Nonparametric Statistics. 2011;23:67–81. [Google Scholar]

[R21] Ma Y, Zhu L. Doubly robust and efficient estimators for heteroscedastic partially linear single-index models allowing high dimensional covariates. Journal of the Royal Statistical Society: Series B. 2013;75:305–322. doi: 10.1111/j.1467-9868.2012.01040.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] McCullough ML, Feskanich D, Stampfer MJ, Giovannucci EL, Rimm EB, Hu FB, Spiegelman D, Hunter DJ, Colditz GA, Willett WC. Diet quality and major chronic disease risk in men and women: moving toward improved dietary guidance. American Journal of Clinical Nutrition. 2002;76:1261–1271. doi: 10.1093/ajcn/76.6.1261. [DOI] [PubMed] [Google Scholar]

[R23] Panagiotakos DB, Pitsavos C, Stefanadis C. Dietary patterns: a mediterranean diet score and its relation to clinical and biological markers of cardiovascular disease risk. Nutrition, Metabolism and Cardiovascular Diseases. 2006;16:559–568. doi: 10.1016/j.numecd.2005.08.006. [DOI] [PubMed] [Google Scholar]

[R24] Reedy JR, Mitrou PN, Krebs-Smith SM, Wirfält E, Flood AV, Kipnis V, Leitzmann M, Mouwand T, Hollenbeck A, Schatzkin A, Subar AF. Index-based dietary patterns and risk of colorectal cancer: the nih-aarp diet and health study. American Journal of Epidemiology. 2008;168:38–48. doi: 10.1093/aje/kwn097. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Schatzkin A, Subar AF, Thompson FE, Harlan LC, Tangrea J, Hollenbeck AR, Hurwitz PE, Coyle L, Schussler N, Michaud DS, Freedman LS, Brown CC, Midthune D, Kipnis V. Design and serendipity in establishing a large cohort with wide dietary intake distributions: the national institutes of health-aarp diet and health study. American Journal of Epidemiology. 2001;154:1119–1125. doi: 10.1093/aje/154.12.1119. [DOI] [PubMed] [Google Scholar]

[R26] Stone CJ. Additive regression and other nonparametric models. Annals of Statistics. 1985:689–705. [Google Scholar]

[R27] Subar AF, Thompson FE, Kipnis V, Mithune D, Hurwitz P, McNutt S, McIntosh A, Rosenfeld S. Comparative validation of the block, willett, and national cancer institute food frequency questionnaires: The Eating at America’s Table Study. American Journal of Epidemiology. 2001;154:1089–1099. doi: 10.1093/aje/154.12.1089. [DOI] [PubMed] [Google Scholar]

[R28] Trichopoulou A, Orfanos P, Norat T, Bueno-de Mesquita B, Ocké MC, Peeters PH, van der Schouw YT, Boeing H, Hoffmann K, Boffetta P, et al. Modified mediterranean diet and survival: Epic-elderly prospective cohort study. British Medical Journal. 2005;330:991. doi: 10.1136/bmj.38415.644155.8F. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Wang J, Yang L. Polynomial spline confidence bands for regression curves. Statistica Sinica. 2009a;19:325. [Google Scholar]

[R30] Wang L, Liu X, Liang H, Carroll RJ. Estimation and variable selection for generalized additive partial linear models. Annals of Statistics. 2011;39:1827. doi: 10.1214/11-AOS885SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Wang L, Yang L. Spline estimation of single-index models. Statistica Sinica. 2009b;19:765. [Google Scholar]

[R32] Xue L, Yang L. Additive coefficient modeling via polynomial spline. Statistica Sinica. 2006;16:1423. [Google Scholar]

[R33] Yu Y, Ruppert D. Penalized spline estimation for partially linear single-index models. Journal of the American Statistical Association. 2002;97:1042–1054. [Google Scholar]

[R34] Zhao LP, Prentice RL. Correlated binary regression using a quadratic exponential model. Biometrika. 1990;77:642–648. [Google Scholar]

[R35] Zhou S, Shen X, Wolfe DA. Local asymptotics for regression splines and confidence regions. The Annals of Statistics. 1998:1760–1782. [Google Scholar]

[R36] Zhou S, Wolfe DA. On derivative estimation in spline regression. Statistica Sinica. 2000:93–108. [Google Scholar]

PERMALINK

A Semiparametric Single-Index Risk Score Across Populations

Shujie Ma

Yanyuan Ma

Yanqing Wang

Eli S Kravitz

Raymond J Carroll

Abstract

1 Introduction

Table 1.

2 Multiple Population Single-Index Model

2.1 Model and Splines

Remark 1

3 Profile Estimating Procedure

Step 1

Step 2

Theorem 1

Theorem 2

Theorem 3

4 Generalizations

4.1 Single Population, Multiple Diseases

Step 1d

Step 2d

Theorem 4

Theorem 5

4.2 Multiple Populations and Multiple Diseases

5 Data Analysis

5.1 Spline Setup

5.2 Dietary Score Example

Table 2.

5.3 Independent Populations, Single Disease

Table 3.

5.4 Multiple Populations and Multiple Diseases Analysis

Table 4.

Table 5.

Figure 2.

Figure 1.

6 Simulation

Table 7.

Table 6.

7 Discussion

Supplementary Material

Acknowledgments

Appendix

A.1 Some Simplifications and Definitions

A.2 Regularity Conditions

A.3 Proof of Theorem 1

Lemma 1

Proof of Lemma 1

Lemma 2

Proof of Lemma 2

A.4 Proof of Theorem 2

A.5 Proof of Theorem 3

A.6 Proof of Theorem 4

A.7 Proof of Theorem 5

A.8 Extending to Multiple Study Centers

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases