Abstract
We consider a problem motivated by issues in nutritional epidemiology, across diseases and populations. In this area, it is becoming increasingly common for diseases to be modeled by a single diet score, such as the Healthy Eating Index, the Mediterranean Diet Score, etc. For each disease and for each population, a partially linear single-index model is fit. The partially linear aspect of the problem is allowed to differ in each population and disease. However, and crucially, the single-index itself, having to do with the diet score, is common to all diseases and populations, and the nonparametrically estimated functions of the single-index are the same up to a scale parameter. Using B-splines with an increasing number of knots, we develop a method to solve the problem, and display its asymptotic theory. An application to the NIH-AARP Study of Diet and Health is described, where we show the advantages of using multiple diseases and populations simultaneously rather than one at a time in understanding the effect of increased Milk consumption. Simulations illustrate the properties of the methods.
Keywords: Asymptotic theory, B-splines, Combining data sets, Healthy Eating Index, Logistic regression, Partially linear single-index models, Semiparametric models, Single-index models
1 Introduction
We describe a novel partially linear logistic single index model in which there are multiple populations, and multiple diseases within each population, but where the single index part of the model is shared across the populations and diseases. In the case of a single disease across independent populations, we derive B-spline based semiparametric efficient methodology. In other cases, such as multiple populations with multiple diseases, our B-spline based methods are consistent and we derive their asymptotic theory.
The problem arises from common practice in nutritional epidemiology, where the goal is to relate nutritional intakes to disease. In this area, it is increasingly common to relate the patterns of multiple dietary components, rather than an individual dietary component, to a disease. One popular way to summarize dietary intake patterns is through a dietary pattern score. While there are many avors of dietary pattern scores, in our example we use the U.S. Department of Agriculture’s (USDA’s) Healthy Eating Index-2005 (HEI-2005, http://www.cnpp.usda.gov/HealthyEatingIndex.htm). It is based on the key recommendations of the 2005 Dietary Guidelines for Americans available at http://www.health.gov/dietaryguidelines/dga2005/document/default.htm. The HEI-2005 comprises 12 distinct component scores. Intakes of each food or nutrient, represented by one of the 12 components, are expressed as a ratio to energy (caloric) intake, assessed, and given a score. See Table 1 for a list of these components and the standards for scoring, and see Guenther et al. (2008) and Guenther et al. (2008) for details. The 12 different component scores are then summed to get a total score, ranging from 0 for a terrible diet to 100 for the best possible diet.
Table 1.
Component | Units | HEI-2005 score calculation |
---|---|---|
Total Fruit | cups | min {5, 5 × (density/.8)} |
Whole Fruit | cups | min {5, 5 × (density/.4)} |
Total Vegetables | cups | min {5, 5 × (density/1.1)} |
DOL | cups | min {5, 5 × (density/.4)} |
Total Grains | ounces | min {5, 5 × (density/3)} |
Whole Grains | ounces | min {5, 5 × (density/1.5)} |
Milk | cups | min {10, 10 × (density/1.3)} |
Meat and Beans | ounces | min {10, 10 × (density/2.5)} |
Oil | grams | min {10, 10 × (density/12)} |
Saturated Fat | % of | if density ≥ 15 score = 0 |
energy | else if density ≤ 7 score = 10 | |
else if density > 10 score = 8 − {8 × (density − 10)/5} | ||
else, score = 10 − {2 × (density − 7)/3} | ||
Sodium | milligrams | if density ≥ 2000 score=0 |
else if density ≤ 700 score=10 | ||
else if density ≥ 1100 score = 8 − {8 × (density − 1100)/(2000 − 1100)} | ||
else score = 10 − {2 × (density − 700)/(1100 − 700)} | ||
SoFAAS | % of | if density ≥ 50 score = 0 |
energy | else if density ≤ 20 score=20 | |
else score = 20 − {20 × (density − 20)/(50 − 20)} |
The key concept here is that the total score is developed before any health outcome data are considered. Once a total score is developed, it is then used, across multiple populations, in risk models to relate any disease to the total score. As an example, Panagiotakos et al. (2006) show that for colorectal cancer in the NIH-AARP Study of Diet and Health (Schatzkin et al., 2001), with diet assessed by a food frequency questionnaire, higher HEI-2005 total scores are statistically significantly associated with lower disease risks. They also consider three other dietary pattern scores. George et al. (2010) show that among breast cancer survivors, higher HEI-2005 total scores are associated with lower levels of chronic inflammation. Chiuve et al. (2012) show that the HEI-2005 total score and the Alternative Healthy Eating Index (AHEI) (McCullough et al., 2002) are significant predictors of chronic diseases such as coronary heart disease, diabetes, stroke and cancer, and that closer adherence to the 2005 Dietary Guidelines may lower the risk of major chronic diseases. The AHEI is also associated with all cause mortality (Akbaraly et al., 2011).
In its most general form, there are k = 1, …, K populations. Within population k, there are ℓ = 1, …, Lk diseases. As in the HEI-2005, there are j = 0, …, J dietary components. Let the J + 1 individual component scores in population k be (X0k, …, XJk): in our case, J + 1 = 12. Then the current practice in nutritional epidemiology would be to form a total score for population k and use it as the risk predictor for all populations/diseases. Thus, for example, in a logistic regression with a binary outcome Ykℓ, and with H(·) being the logistic distribution function, the model for population k and disease ℓ is
(1) |
This is important and convenient from a public health perspective, because it enables nutritional epidemiologists to use the same predictor, namely , for all diseases, and to describe the effect of that predictor through a single quantity, β1kℓ.
Crucially, it is undesirable to try to fit different parameters for each component score. That is, instead of fitting (1), one might fit
(2) |
The reason why (1) is preferred to (2) for public health purposes is that it is much more interpretable. Model (1) describes how a single, interpretable score, , affects disease risk. Model (2) is chaotic because it requires policy makers to say things such as “if you are in population k = 1 and are worried about disease ℓ = 1 then your diet improves your risk if you eat this kind of food more and that kind of food less, but for disease ℓ = 2 you need to consider your dietary composition in another way”. Interpretability is even more complicated because the component scores have a reasonably complex pattern of correlations, see Table S.1 of the Supplementary Material. Since there are so many diseases and populations, this is not helpful practically and would not be used. Indeed, as seen above, the single HEI-2005 score is associated with colon cancer (Reedy et al., 2008), chronic inflammation, (George et al., 2010) and many chronic diseases (Chiuve et al., 2012), and its ease of interpretation is apparent.
Our goal is to develop a single interpretable score that, unlike the HEI-2005 score or other scores, is calibrated to different populations or diseases. We do this not by summing the component scores, but by weighting them and allowing a more flexible shape. Thus, for an unknown function m(·) and weights (α0, …, αJ), we propose the single-index score , and model the risk as
(3) |
Model (3), like model (1), is based upon a single interpretable score, , that can be used across populations and/or diseases.
In Section 2, we present our model more formally. Section 3 describes how to fit the model for a single disease across independent populations, and we show that our method is semiparametric efficient in this case. In Section 4 we describe generalizations. Section 4.1 considers a single population with multiple diseases, while Section 4.2 describes the real goal of our data analysis, where there are multiple populations and multiple diseases. Section 5 gives results of the data example, while Section 6 describes simulation results. Computational and technical details are in an Appendix and in Supplementary Material.
2 Multiple Population Single-Index Model
2.1 Model and Splines
In this section, we consider a single disease across multiple different populations. There are k = 1, …, K populations. For the kth population, there are i = 1, …, nk individuals with binary responses Yik. Define Xik = (Xik1, …, XikJ)T. In addition to the responses, for individual i in population k we observe (Gik, Xik0, Xik, Zik, GikWik), defined as follows. The J + 1 dietary component scores are . Covariates that are observed for all individuals are Zik = (Zik1, …, Zikd)T. Further, we allow for a subset of individuals to have additional covariates Wik = (Wik1, …, Wika)T, and define Gik to be the binary indicator that these individuals have such additional covariates. For example, Reedy et al. (2008) fit models to men and women with many common covariates such as age and levels of educational status, but for women they in addition include indicators of types of hormone replacement therapy.
For the ith individual and the kth outcome, we posit the marginal model
(4) |
Here α ∈ ℝJ, θk1 ∈ ℝd, θk2 ∈ ℝd, θk3 ∈ ℝa, βk1 ∈ ℝ and βk2 ∈ ℝ. Crucially, for use in practice, the function m(·) and the parameter α do not depend on k.
Remark 1
In model (4), the most general form of the single-index is . However, because m(·) is modeled nonparametrically, such a formulation is not identifiable. There are three equivalent ways to obtain identifiability. The first common method, what we have done, is to select one variable that is known to be related to the response, which we label as Xik0, and to set its parameter α0 = 1. A second common method is to make the restriction that . In the context of our problem, there is a third way. Since common nutritional epidemiology practice is to weight each variable the same, namely = 1, the sum of the weights = J + 1. For comparison purposes, we can achieve identifiability via the restriction α0 + (1, …, 1)α = J + 1. We use the first method in our computations, but report results for the third.
Model (4) generalizes the now-classical generalized partially linear single-index model (Carroll et al., 1997), with the novelty being both the context and that the same single-index is used across multiple outcomes.
Single-index models have been widely used as a popular tool in multivariate nonparametric regression to alleviate the “curse of dimensionality” (Bellman, 1961). For example, recently Yu and Ruppert (2002) used penalized spline least squares estimation for single-index models with independent and identically distributed observations: their number of knots was fixed, unlike in our development. Wang and Yang (2009a) proposed polynomial spline estimation and extended the results to weakly dependent response variables. Cui et al. (2011) developed a kernel estimating function method for generalized single-index models, while Ma and Zhu (2013) constructed robust and efficient estimation with high dimensional covariates. These papers are restricted to the case of one population, K = 1, and one outcome, L = 1. Here we consider multiple populations and multiple outcomes. We propose a regression spline based profile estimation procedure and establish the asymptotic properties of the estimators in model (4).
To set the main ideas clearly, it is convenient to first assume that the Yik are independent, see Section 4 for the more general cases of interest in the HEI-2005 problem. To ensure identifiability, we set β11 = −1, and we set the first component of θ11 = (θ111, …, θ11d) equal to zero, i.e., θ111 = 0. Hence, our parameters are (ν, m), where m is an unspecified function that is sufficiently smooth, while
Thus, ν has total dimension dν = J + 2K + 2Kd + Ka − 2.
Let be the realizations of . The unknown function m(·) is estimated by polynomial splines described as follows. Without loss of generality, assume u ∈ [a, b]. Let N = Nn be the number of interior knots. Divide [a, b] into (N + 1) subintervals Ip = {(ξp, ξp+1), p = r, r + 1, …, N + r − 1}, IN = (ξN+r, 1), where is a sequence of interior knots, given as
Define the distance between neighboring knots as hp = ξp+1 − ξp, r ≤ p ≤ N + r, and h = maxr≤p≤N+r hp. Let Gn be the space of B-splines of order r, so that Pn = N + r is the number of functions in Gn. For u ∈ [a, b], let Gn be the linear space spanned by the B-spline functions Br(u) = {Br,p(u), 1 ≤ p ≤ Pn}T. Then m(u) can be approximated by , where λ = (λ1, …, λPn)T. B-splines have been used frequently to estimate the nonparametric functions in nonparametric and semiparametric models because they are easy to compute with derivable asymptotic theory. See Huang (2003) and Wang and Yang (2009b) for their utility in nonparametric models, Stone (1985) and Huang and Yang (2004) in additive models, and Huang et al. (2002), Liu et al. (2011) and Wang et al. (2011) in semiparametric models.
3 Profile Estimating Procedure
Our estimation is performed through a conceptually simple profiling procedure, as described below.
Step 1
Define
Treating ν as a fixed parameter, estimate m(u) by spline functions with λ̂(ν) = {λ̂1(ν), ⋯, λ̂Pn(ν)}T through maximizing
(5) |
To prepare for the second step, we perform the following additional calculations. Let Br−1(u) = {Br−1,p(u) : 2 ≤ p ≤ Pn}T be the B-spline functions of order r − 1. We estimate m′(u, ν), the first derivative of m, through , where , for 2 ≤ p ≤ Pn. This is because the first derivative of a spline function can be expressed in terms of a spline of one order lower see page 116 of de Boor (2001). Let D = (djj′)1≤j,j′≤Pn−1 be a (Pn − 1) × (Pn − 1) diagonal matrix with djj = 1/(ξj+r − ξj+1) and djj′ = 0 for j ≠ j′, and let D11 = (−D, 0Pn−1)(Pn−1)×Pn and D12 = (0Pn−1, D)(Pn−1)×Pn, where 0Pn−1 is the (Pn − 1)-dimensional vector with 0’s as its elements. Then , where D1 = (r − 1) (D11 + D12). For u ∈ [a, b], define
(6) |
where Vik(ν) = Hik(ν)(1 − Hik(ν)), and
Step 2
Define
(7) |
Estimate ν by ν̂ through maximizing
Once we obtain ν̂, we can plug it into m̂ to obtain ν̂, m̂(u, ν̂) as the final estimator. To prepare the description of the asymptotic properties of our procedure, we define
and for k = 2, …, K define
Denote the elements of . Let
be the collection of the true parameters. Let [a0, b0] be the support of , where α0 is the true population parameter. Denote ‖·‖ as the L2 norm of any square integrable function on [a0, b0]. For 1 ≤ ℓ ≤ dν, let be the function ηℓ(·) ∈L2([a0, b0]) that minimizes . Also, define . For simplicity of notations, we let Vik = Vik(ν0), Qik = Qik(ν0) and Uik = Uik(α0). Let .
In the following three theorems, we establish the consistency, asymptotic normality and efficiency of our procedure.
Theorem 1
Under the conditions in Appendix A.2, when ν is the collection of the true parameters or a -consistent estimator of ν0, (a) |m̂(u, ν) − m(u)| = Op{(nh)−1/2 + hq} uniformly in u ∈ [a0, b0]; (b) |m̂′(u, ν) − m′(u)| = Op(n−1/2h−3/2 + hq−1) uniformly in u ∈ [a0, b0]; and (c) as n → ∞, σ̂−1(u, ν0) {m̂(u, ν) − m(u)} → Normal(0, 1).
Theorem 2
Define n = n1 and for k = 2, …, K, define nk = nckn, where there are constants c* > 0 and c** < ∞ such that c* ≤ ck = limn→1 ckn ≤ c**. Under the conditions in Appendix A.2, ‖ν̂ − ν0‖2 = Op (n−1/2), and
(8) |
for ν̂ in a neighborhood of ν0. Then as n → ∞, n1/2(ν̂ − ν0) → Normal(0dν, Σ), where
and 0dν is a dν-dimensional vector with “0” as its elements. Here and throughout the text, a⊗2 ≡ aaT for any matrix or vector a.
In practice, Σ is estimated by
where V̂ik = Ĥik(ν̂)(1 − Ĥik(ν̂)), Π̂nQik(ν̂) = (Π̂nQik,ℓ(ν̂), 1 ≤ ℓ ≤ dν)T, and for 1 ≤ l ≤ dν, , where
In addition, under the assumption of independence, or conditional independence of the Yik given the covariates, our estimation method is semiparametric efficient. We state this as
Theorem 3
Under the conditions in Appendix A.2, profile likelihood estimation of the parameter ν reaches the semiparametric efficiency bound. The minimum variance bound for estimating ν can be further simplified to
The proofs of the theorems are given in the Appendix.
4 Generalizations
4.1 Single Population, Multiple Diseases
In this section, we relax the assumption of independence of the Yik given the covariates, and consider the case of a single population with K outcomes, with a common sample size n. The response indicators remain as (Yi1, …, YiK), but now the covariates are the same for each response, and are written as ℂi = (Gi, Xi0, Xi, Zi, GiWi), and now we use . Ignoring this correlation and invoking a “working independence” principle, the profile likelihood procedure described in Section 3 will still provide consistent estimation. However, more efficient estimation can be generally obtained through taking into account the correlation structure.
Specifically, the derivative of Yiklog(H̃ik) + (1 − Yik)log(1 − H̃ik) with respect to λ is (Yik − H̃ik)(βk1 + βk2Gi)Br(Ui(α)). Translated to the setting of this section, Step 1 in Section 3 is equivalent to solving
Here, we modify this step to
Step 1d
Let represent a working covariance matrix of (Yi1, …, YiK) conditional on ℂi. Let Bi(ν) be a K × K matrix with the (k, k′) entry Bi,k,k′ (ν) = Ωi,k,k′ (βk1 + βk2Gi)(βk′1 + βk′2Gi). Obtain λ̂w(ν) by solving where
whereṼik = H̃ik(1 − H̃ik).
Using λ̂w(ν), we form the corresponding estimators of m(u) and m′(u), which are and . Define Ai(ν) = {Vi1(β11 + β12Gi)2, …, Vik(βK1 + βK2Gi)2}. Let Ai = Ai(ν0) and Bi = Bi(ν0). For u ∈ [a, b], define , where
and where Qi is a K × K matrix with the (k, k′) entry
In the description above, Ωi is a generic working covariance matrix. Here is how we implemented it. Let Ωi be the conditional covariance matrix of Yi = (Yi1, …, YiK)T given ℂi. Then the (k, k′) entry of Ωi is Ωi,k,k′ = E(YikYik′ | ℂi) − HikHik′. In practice, we estimate Ωi by , where V̂i is a K × K diagonal matrix with the kth diagonal as Ĥik(1 − Ĥik) and , where Ĥi = (Ĥi1, …, ĤiK)T and Ĥik = Ĥik(ν̂).
Similarly, the derivative of the (i, k) term in (7) with respect to ν is
where Ĥik(ν) is the same as H̃ik except that λ in H̃ik is replaced by λ̂w(ν) in Ĥik(ν), and Q̂ik(ν) is the same as Qik(ν) except that m(·), m′(·) in Qik(ν) are replaced by m̂w(·, ν), in Q̂ik, and is the Pn × dν derivative matrix of λ̂w(ν) with respect to ν. We thus modify Step 2 to
Step 2d
Let Ψi(ν) be the dνK × 1 vector formed by K vectors, each of length dν, with the kth, k = 1, …, K, vector being . Obtain ν̂w from solving , where Ĉi(ν) is a dν × dνK matrix, with kth block
where V̂ik(ν) = Ĥik(ν)(1 − Ĥik(ν)) and D̂i(ν) is a dνK × dνK matrix, with (k, k′) block
and can be obtained via numerical differentiation. Let β1+β2Gi = {(βk1 + βk2Gi), 1 ≤ k ≤ K}T, and . Then Ai(ν) = (β1 + β2Gi)TΘi(ν). Denote 1dν as the dν-dimensional vector with 1’s as its elements. Let Θi = Θi(ν0). Let Qi = (Qi1, …, QiK)T and let η be a vector of functions η(u) = {η1(u), …, ηdν (u)}T with ηℓ(·) ∈L2([a, b]) that minimizes
Define Ci as a dν × dν K matrix, with kth block . Define Di as a dνK × dνK matrix, with (k, k′) block
and define as a dν K × dνK matrix, with (k, k′) block
In the following two theorems, we establish the consistency and asymptotic normality of our procedure. Different from the independent disease case, without a correct specification of the correlation structure of the occurrences of different diseases, we can no longer achieve semiparametric efficiency.
Theorem 4
Under the conditions in Appendix A.2, when ν is the collection of the true parameters or a -consistent estimator of ν, (a) ; (b) |m̂w(u, ν) − m(u)| = Op (nh)−1/2 + hq} uniformly in u ∈ [a0, b0]; and (c) uniformly in u ∈ [a0, b0].
Let be a dνK × dνK matrix, with (k, k′) block
where η̂(Ui(ν̂)) = {η̂1(Ui(ν̂)), …, η̂dν (Ui(ν̂))}T and with {τ̂ℓ} minimizing
Theorem 5
Let (C, D, D*) be generic notation for random variables with the same distribution as (Ci, Di, ). Under the conditions in Appendix A.2, for ν̂ in a neighborhood of ν0, where
Here, Σ is consistently estimated by the sandwich estimator
(9) |
4.2 Multiple Populations and Multiple Diseases
Finally, we consider the general case that there are k = 1, …, K independent populations, and within the kth population, there are ℓ = 1, …, Lk diseases and i = 1, …, nk observations. The outcomes are Yikℓ and the covariates are ℂik = (Gik, Xik0, Xik, Zik, Gik Wik). The model is
(10) |
We make the same assumptions as in Section A.2. As in Theorem 2, we write n = n1 and for k = 2, ‥, K, define nk = nckn, where there are constants c* > 0 and c** < ∞ such that c* ≤ ck = limn→∞ ckn ≤ c**.
Make the definitions of the terms in Section 4.1 appropriate to population k = 1, ‥, K, e.g., Ãik(ν), Bik(ν), Φik(ν), Ĉik(ν), D̂ik(ν), Ψik(ν), Cik, Dik, Πnk, Ξnk, etc. Obtain λ̂w(ν) by solving , and obtain ν̂w by solving . Define
Then Theorems 4–5 hold with these definitions, see Appendix A.8, and Σ̂ remains a sandwich estimator.
As in other problems involving correlated binary data and generalized estimating equations, the semiparametric efficiency established in Theorem 3 does not hold for the multiple populations and multiple diseases case, mainly due to the fact that the responses are correlated among different diseases and the correlation structure is unknown. Discussion of the working correlation matrix in parametric generalized estimating equation problems can be found in many papers, see for example Chaganty and Joe (2004).
Instead of embedding the problem in the generalized estimating equation framework, as we have done, there is some literature on developing a likelihood function that allows correlation among the binary responses while having the marginal probabilities be of logistic form, see for example Zhao and Prentice (1990) and Le Cessie and Van Houwelingen (1994). Our methods can be extended to this approach, but the ease of computation associated with a generalized estimating equation approach is a considerable advantage. This computational advantage is one of the reasons that generalized estimating equation methods are so widely employed in practice.
5 Data Analysis
5.1 Spline Setup
In all our implementations, we used cubic splines (r = 4) with equally spaced knots to approximate the nonparametric function m(·). We selected the number of interior knots N by minimizing a BIC criterion, where BIC(N) = −2Ln(λ̂, ν̂) + (N + p)log(n). See Xue and Yang (2006) and Ma and Yang (2011) for the properties of the BIC criterion.
5.2 Dietary Score Example
We applied our methods to the NIH-AARP Study of Diet and Health (Schatzkin et al., 2001). The method used for assessing dietary component intakes is the National Cancer Institute’s Dietary History Questionnaire (DHQ) (Subar et al., 2001). There were 294,673 men and 199,285 women in the data set. There were also dummy variables for various groups of age, body mass index, education, ethnicity, physical activity and smoking, making up the variables Z. In addition, for women, there were two dummy variables for hormone replacement therapy, making up the variables W. The HEI-2005 score for whole grains were taken as Xik0 and Xik. The sum of the weights was normalized to equal J + 1 = 12 for ease of comparison with the HEI-2005 total score, all of whose weights = 1: the standard errors of these weights were obtained by the delta-method after fitting the data as described in Sections 3 and 4.
For women, the data set contains four diseases, breast cancer, ovarian cancer, colorectal cancer and lung cancer, while for men there are prostate cancer, colorectal cancer and lung cancer. See Table 2 for the numbers and percentages of cancer cases. The minimum HEI-2005 total score in the data set was xmin = 19.67, while the maximum was xmax = 96.61.
Table 2.
Men | Women | |||
---|---|---|---|---|
Description | # Cases | Percentages | # Cases | Percentages |
Sample size | 294,673 | 199,285 | ||
Breast cancer | 7,736 | 3.88% | ||
Ovarian cancer | 759 | 0.38% | ||
Prostate cancer | 23,477 | 7.97% | ||
Colorectal cancer | 4,693 | 1.59% | 2,291 | 1.15% |
Lung cancer | 6,135 | 2.08% | 3,630 | 1.82% |
We used to construct B-spline functions, where Φ(·) is the distribution function of the standard normal distribution and U (α̂) = X0 + XTα̂. Thus the nonparametric function m is estimated by .
We performed two analyses. In the first, we took a single disease and the two independent populations of men and women, using the method in Section 3, and applied to colorectal cancer and lung cancer separately. In the second, we analyzed all the cancer outcomes, using the method in Section 4.2. The point of doing the former is to illustrate that analyzing single diseases at a time can lead to very different results than those from analyzing multiple diseases simultaneously, a point we made in Section 1.
5.3 Independent Populations, Single Disease
Our first analysis uses the setup in Section 3, where there are K = 2 independent populations, men and women, and L = 1 disease. We performed analyses separately for colorectal cancer and lung cancer, and display here the results for both. Because hormone replacement therapy occurs only for women, the right had side of model (4) is identifiable when the parameter subscripts do not involve k, e.g., β1 + β2Gik.
Table 3 shows the estimates of the weights of the component scores, their standard errors and their p-values for testing whether the weights = 1, i.e., whether the weight equals the HEI-2005 weight.
Table 3.
Colorectal Cancer | Lung Cancer | |||||
---|---|---|---|---|---|---|
Estimate | se | p-value | Estimate | se | p-value | |
Total Fruit | 0.27 | 0.87 | 0.40 | 2.18 | 0.34 | 0.00 |
Total Grains | 2.58 | 0.85 | 0.06 | 2.96 | 0.33 | 0.00 |
Whole Grains | 2.44 | 0.85 | 0.09 | 0.53 | 0.27 | 0.08 |
Total Vegetables | 0.01 | 1.02 | 0.33 | 0.99 | 0.36 | 0.98 |
DOL Vegetables | 1.33 | 0.72 | 0.65 | 0.99 | 0.26 | 0.96 |
Dairy | 2.44 | 0.42 | 0.00 | 0.42 | 0.10 | 0.00 |
Meat and Beans | 0.00 | 0.53 | 0.06 | 0.00 | 0.18 | 0.00 |
Oils | 0.58 | 0.32 | 0.20 | 0.33 | 0.11 | 0.00 |
Sodium | 0.80 | 0.45 | 0.65 | 1.12 | 0.16 | 0.45 |
Saturated Fat | 0.53 | 0.31 | 0.13 | 0.94 | 0.13 | 0.65 |
Empty Calories | 0.49 | 0.21 | 0.02 | 0.21 | 0.08 | 0.00 |
The main conclusion of Table 3 is that the weights for the HEI-2005 component scores are strikingly different for total fruit, whole grains and dairy, depending on whether one is interested in colorectal cancer or lung cancer. This is a point we made in the discussion after equation (2) about having a single score and not one for every disease. Table 3 suggests that if one is worried about colorectal cancer, one should increase consumption of whole grains and dairy products, but if one is worried about lung cancer, such consumption would have only a minor effect, but total fruit intake should be increased.
One can see in Table 3 that the weights of many of the individual component scores differ from HEI-2005’s weight of 1.0. We also tested whether the HEI-2005 weights fit the data as well, by testing H0 : α1 = α2 = ⋯ = αJ = 1. To this end, we constructed the Wald chi-square statistic , which has an asymptotic chi-square distribution with J degrees of freedom under H0. Here, V̂ (α̂) is the estimated asymptotic variance-covariance matrix of α̂, and is calculated following Theorem 5. The p-value for this hypothesis is < 0.0001.
5.4 Multiple Populations and Multiple Diseases Analysis
Our second analysis uses the setup in Section 4.2, with all the cancers available in our data set: lung, colorectal, breast and ovarian cancers for women, and lung, colorectal and prostate cancers for men. We found that the working correlations among men and women were all < 0.03 in absolute value, so we report results for the working independence estimate.
Table 4 shows the estimated weights of the component scores and their standard errors. Because we are using multiple diseases and populations, and not just colorectal or lung cancer separately, but all the cancers simultaneously, we can expect differences between Table 4 and either analysis in Table 3. One of the striking difference is the vast down-weighting of increased Milk consumption compared to the results for colorectal cancer only. In the HEI-2005 score, increase of Milk consumption results in a monotonically increasing score for Milk. In the colorectal cancer case, a person who gets the top score of 10 on Milk contributes 24.4 to the single index. After accounting for the other diseases, however, the contribution is 0.61, a vast decrease. To us, this makes perfect sense, because the value of increased Milk consumption in adults is hardly universally accepted. For example, The Alternative Healthy Eating Index (McCullough et al., 2002) does not even include dairy as part of its index, i.e., increased Milk consumption gets zero weight. The Modified Mediterranean Diet Score (Trichopoulou et al., 2005) and the MedDietScore (Panagiotakos et al., 2006), have been shown to be related to overall survival and coronary heart disease respectively, but for these scores, increases in Milk consumption lead to decreases in the score for Milk, i.e., negative weight.
Table 4.
Estimate | se | p-value | |
---|---|---|---|
Total Fruit | 1.89 | 0.31 | 0.00 |
Whole Fruit | 1.32 | 0.30 | 0.27 |
Total Grains | 2.94 | 0.30 | 0.00 |
Whole Grains | 0.70 | 0.26 | 0.32 |
Total Vegetables | 0.97 | 0.34 | 0.93 |
DOL Vegetables | 0.93 | 0.24 | 0.81 |
Dairy | 0.61 | 0.09 | 0.00 |
Meat and Beans | 0.00 | 0.17 | 0.00 |
Oils | 0.39 | 0.11 | 0.00 |
Sodium | 1.13 | 0.15 | 0.36 |
Saturated Fat | 0.89 | 0.12 | 0.40 |
Empty Calories | 0.23 | 0.07 | 0.00 |
Table 5 shows the estimates, standard errors (se) and p-values for the coefficients β. The β coefficient for men associated with lung cancer was = −1 for identifiability. However, when we instead set the coefficient for lung cancer for women to be = −1, the estimated coefficient for lung cancer for men was −1.06, and the p-value was very small. It is clear from the table that the real practical impact of diet here is its contribution to decreases in risk for lung and colorectal cancers, and for both men and women, and that the impact is greater for lung cancers. See below for a discussion of the relative risks, displayed in Figure 2, which supports our conclusion. The estimated values for all other groups are also negative except for the two groups: (a) men and prostate cancer; and (b) women and ovarian cancer, where the coefficients are very small: they have both been set = 0 under the constraint that a better diet is not a risk factor for either cancer. In the figures that we discuss, the index (x-axis) plotted is from the 3rd to the 97th percentiles of the actual index.
Table 5.
Estimate | se | p-value | |
---|---|---|---|
Men, Lung | −1.00 | NA | NA |
Men, Colorectal | −0.39 | 0.06 | 0.00 |
Men, Prostate | 0.00 | 0.06 | 0.07 |
Women, Lung | −0.91 | 0.07 | 0.00 |
Women, Colorectal | −0.28 | 0.08 | 0.00 |
Women, Breast | −0.07 | 0.04 | 0.11 |
Women, Ovarian | 0.00 | 0.14 | 0.90 |
Figure 1 shows the plot of the estimates of m(·) against the index u (α̂) along with pointwise 95% confidence intervals, without any additional monotonicity constraints on m(·). The estimated function itself is monotone as expected. Observe that the estimated function is not an exact linear function, especially when considering the pointwise confidence intervals. Indeed, from the index value of 50 to 72, the estimated function has an increasing acceleration, then it becomes at, and it increases quickly again when the index value is greater than 82. When we refit the data with a linear link, the results, while different, are in good agreement, both in the estimated functions, the tables, and the analyses that are described next.
Figure 2 displays the estimated relative risks of the various cancers and separately for men and women. Clearly, we see that the index predicts stronger decreases for lung cancer relative risks for better diet index score as compared to the other diseases. In Figure 2, the effect of better diet on prostate cancer in men and ovarian cancer in women are is nearly null, and the effect of better diet is quite modest on breast cancer. When analyzing the HEI-2005 total score, the p-values for prostate cancer, breast cancer and ovarian cancer were 0.15, 0.09, and 0.44, respectively, roughly what is seen in Figure 2.
Figure S.1 of the Supplementary Material shows how the estimated relative risks differ between men and women for lung and colorectal cancer. In both cases, women have the lower risk in general, with the largest difference being in colorectal cancer, but even there, the differences are not great. This agrees with the marginal rate of lung cancer for men and women being 2.08% and 1.82%, respectively, while the marginal rate of colorectal cancer for men and women 1.59% and 1.15%, respectively, see Table 2.
The hypothesis for testing that the weights all equal 1.0 is rejected with a p-value numerically very close to 0, as expected.
6 Simulation
In this section, we describe a simulation study to assess the finite-sample performance of our method in the case of two populations and multiple diseases. Section S.1 of the Supplementary Material has results for two independent populations and one disease. Here simulated data from the logistic model with multiple populations and diseases, so that
for i = 1, …, n and k = 1, 2. We consider two independent populations, and within the kth population, there are ℓ = 1, …, Lk diseases and i = 1, …, n observations. We let L1 = 3 and L2 = 4, so that, as in the example of Section 5, the first and second populations have four and three diseases, respectively. There were 1, 000 simulated data sets.
For each simulated data set, we let n = 3, 000, and set the covariates to be randomly selected, without replacement, from the real data in Section 5. We set each component of α = 1, and made the convention that the estimates should sum to 12. The true values of βkℓ are listed in Table 7. We simulated the components in θkℓ1 and θkℓ3 from the Uniform[−0.5, 0.5] distribution, except that, for identifiability, the first component in θk1 is taken to be zero. The true function is m(u) = exp(u/3).
Table 7.
Mean # Cases |
True β |
Mean β̂ |
Estimated se |
Actual se |
Coverage | |
---|---|---|---|---|---|---|
Population 1, Disease 1 | 826 | −1.00 | −1.00 | NA | NA | NA |
Population 1, Disease 2 | 1052 | −0.60 | −0.61 | 0.12 | 0.11 | 97.00 |
Population 1, Disease 3 | 1347 | −0.20 | −0.21 | 0.10 | 0.09 | 97.70 |
Population 2, Disease 1 | 957 | −0.80 | −0.80 | 0.14 | 0.14 | 94.40 |
Population 2, Disease 2 | 1104 | −0.57 | −0.57 | 0.12 | 0.13 | 94.50 |
Population 2, Disease 3 | 1265 | −0.33 | −0.34 | 0.11 | 0.11 | 94.00 |
Population 2, Disease 4 | 1422 | −0.10 | −0.11 | 0.10 | 0.09 | 98.40 |
To generate the correlated binary data, we use the following algorithm. Suppose we want to generate M binary variables with probabilities (π1, …, πM). Let (U1, …, UM) be equicorrelated standard normal random variables with correlation ρ, and for m = 1, …, M, define Tm = log{πm/(1 − πm)} + log[Φ(Um)/{1 − Φ(Um)}]. Then setting Ym = I(Tm > 0) creates correlated binary random variables with the desired probabilities. In our setting, the correlations of the binary variables were approximately ρ/2, so for the simulation in this section we set ρ = 0.10. This resulted in correlations somewhat higher than in the real data in Section 5.
We also conducted simulation experiments for independent binary outcomes and the sample size but with correlation nearly 0.10, (ρ = 0.20), and both correlations with n = 2000. These results similar to the results in this section, are in Section S.2 of the Supplementary Material.
To give some idea of how this simulation compares with the real data, Table 7 also lists the mean number of cases by disease and by population: the mean across both is 7, 975. These are many fewer than the number of cases seen in the actual data, see Table 2. Hence, since the effective sample size might be thought of as the number of cases, the simulation approximates a smaller study than the NIH-AARP data analyzed in Section 5.
Table 6 gives the results for the estimates of α, while Table 7 gives the results for the estimates of the βkℓ. In both cases, the estimates are very nearly unbiased, the estimated standard errors very nearly equal the actual standard errors, and the coverage probabilities are close the nominal 95%.
Table 6.
Mean | Estimated se | Actual se | Coverage | |
---|---|---|---|---|
Total Fruit | 1.03 | 0.32 | 0.34 | 93.00 |
Whole Fruit | 1.04 | 0.55 | 0.59 | 95.40 |
Total Grains | 1.00 | 0.55 | 0.58 | 95.00 |
Whole Grains | 0.99 | 0.37 | 0.38 | 94.10 |
Total Vegetables | 1.01 | 0.44 | 0.45 | 94.20 |
DOL Vegetables | 0.99 | 0.38 | 0.38 | 94.90 |
Dairy | 1.01 | 0.27 | 0.28 | 95.10 |
Meat and Beans | 1.01 | 0.29 | 0.31 | 94.00 |
Oils | 1.00 | 0.28 | 0.29 | 94.10 |
Sodium | 0.98 | 0.35 | 0.37 | 94.30 |
Saturated Fat | 0.98 | 0.40 | 0.42 | 94.20 |
Empty Calories | 0.96 | 0.46 | 0.47 | 94.40 |
Figure S.2 of the Supplementary Material shows that the mean estimated function across the simulated data sets is also very nearly unbiased. Overall, the simulation suggests that our methodology leads to nearly unbiased estimates and inferences that achieve their nominal levels.
7 Discussion
Based on motivation from the current practice in nutritional epidemiology, we have developed generalized partially linear logistic single index models in the case that there are several populations and/or diseases. The novelty of the modeling is that the single-index function itself is the same across populations and diseases. In the case that the populations/diseases are independent given covariates, we developed a computable B-spline based semiparametric efficient methodology. In the case that the populations/diseases are correlated given the covariates, our method makes no assumptions about how the diseases are related.
The importance of developing a score that is constructed for health risk prediction across multiple diseases and populations, versus a different score for each population and disease, were illustrated in our work. When we analyzed men for colorectal cancer solely, increasing Milk consumption was given a very high weight in the single index. However, when we fit the model simultaneously for multiple diseases, the weight for increasing Milk consumption became much smaller. The importance of Milk consumption is a source of controversy in the nutrition literature, and our results agree with the Alternative Healthy Eating Index (AHEI), which assigns zero weight to increased Milk consumption, and the Mediterranean diets scores assigns negative weight to increased Milk consumption.
Our results are focused on logistic regression, so that they are readily transparent, but they are easily adapted to apply to any generalized linear model, by replacing H and its related quantities with a more general link function and its corresponding related quantities in the derivation.
We have strived to use a single overall score, which we have argued (a) avoids different diet for different disease; and (b) is important for interpretability. It is important to use as many diseases and populations as possible, and to draw inferences and projections only to those populations and diseases. One can think of what we have done as to come up with a framework for a type of “average” version of the individual model fits across multiple diseases and populations. We do not average directly, instead, we average across the estimating functions.
Supplementary Material
Acknowledgments
S. Ma’s research was supported by NSF grant DMS-1306972. Y. Ma’s research was supported by NSF grant DMS-1206693 and NIH grant R01-NS073671. Carroll’s research was supported by a grant from the National Cancer Institute (U01-CA057030).
Appendix
A.1 Some Simplifications and Definitions
For simplicity of notation, we work through the asymptotics in the case that there is one population with a common sample size n, and thus the covariates are the same across the populations/diseases. The statements of Theorems 1–3 are readily verified when the responses are independent with different sample sizes and different covariates across k = 1, …, K.
For any vector ζ = (ζ1, …, ζs)T ∈ Rs, denote the norm ‖ζ‖r = (|ζ1|r +⋯+|ζs|r)1/r, 1 ≤ r ≤ ∞. For positive numbers an and bn, n > 1, let an ≍ bn denote that limn→∞ an/bn = c, where c is some nonzero constant. Denote the space of the qth order smooth functions as C(q)([a, b]) = {ϕ|ϕ(q) ∈ C [a, b]}. For any s × s symmetric matrix A, denote its Lr norm as . Let . For a vector a, let ‖a‖∞ = max1≤i≤s |ai|.
A.2 Regularity Conditions
-
(C1)
The density function fX0+XT α (x0 + xT α) of random variable X0 + XT α is bounded away from 0 on Sα and satisfies the Lipschitz condition of order 1 on Sα, where Sα = {X0 + XT α,(X0, XT)T∈S} and S is a compact support set of (X0, XT)T, for α in a neighborhood of its true values α0.
-
(C2)
m(·) ∈ C(q)([a0, b0]) for q ≥ 2, , 1 ≤ l ≤ dν, and the spline order satisfies r ≥ q.
-
(C3)There exists 0 < c < ∞, such that the distances between neighboring knots satisfies
Furthermore, the number of knots satisfies N → ∞, as n → ∞, N−4n → ∞ and Nn−1/(2q+2) → ∞.
-
(C4)
supi,k |Gik| ≤ M < ∞. The eigenvalues of are bounded below from zero. The eigenvalues of E(CD−1CT) given in Theorem 5 are bounded below from zero.
Conditions (C1)–(C3) are commonly used in the nonparametric smoothing literature; see, for example, Zhou et al. (1998) and Cui et al. (2011). Condition (C4) is needed for asymptotic normality of the parametric estimator.
A.3 Proof of Theorem 1
We first introduce two lemmas which will be used in the following proofs.
Lemma 1
For any a = (ap : 1 ≤ p ≤ Pn), there exist constants 0 < cB ≤ CB < ∞, such that for n large enough,
(A.1) |
(A.2) |
Proof of Lemma 1
Result (A.1) follows from Theorem 5.4.2 of DeVore and Lorentz (1993), and (A.2) can be proved by Bernstein’s inequality in Bosq (1961).
Define
(A.3) |
Lemma 2
There are constants 0 < cυ < Cυ < ∞, and 0 < CS < ∞, such that for n large enough,
(A.4) |
and
(A.5) |
Proof of Lemma 2
Result (A.5) follows from (A.1). The result that follows from (A.5) and Theorem 13.4.3 in DeVore and Lorentz (1993).
If m ∈ Cq [a0, b0], there exists λ0 ∈ RPn, such that
(A.6) |
where (de Boor, 2001). In the following, we prove the results for the nonparametric estimator m̂(u, ν) in Theorem 1 when ν = ν0. Then the results also hold when ν is a consistent estimator of ν0, since the nonparametric convergence rate in Theorem 1 is slower than n−1/2. Let . We will show that for any given ε > 0, for n sufficiently large, there exists a large constant C > 0 such that
(A.7) |
This implies that for n sufficiently large, with probability at least 1 − ε, there exists a local maximum for (5) in the ball {λ0 + αnτ :‖τ‖2 ≤ C}. Hence, there exists a local maximizer such that ‖λ̂(ν0) − λ0‖2 = Op(αn). Since Ln(λ, ν0) is a concave function of λ, the local maximizer is the global maximizer of (5).
Define
then
By Taylor’s expansion, we have
(A.8) |
where λ* = ϱλ + (1 − ϱ)λ0 for some ϱ ∈ (0, 1). Moreover,
and ∂Ln(λ0, ν0)/∂λ = Δn1 + Δn2, where
Since E(Δn1) = 0, and for some constant 0 < C1 < ∞, then . By Condition (C3), we have . Then . Then for any ε > 0, by Chebyshev’s inequality, we have . Hence, there exists an event An1 with , such that on An1 we have . Moreover, by (A.6), we have supi,k |Hik(ν0) − H̃ik(λ0, ν0)| = O(hq). Denote
Then, there exist constants 0 < C2, such that
where , for n sufficiently large given that nh → ∞. Again by Chebyshev’s inequality, for any ε > 0, we have . Hence, there exists an event An2 with , such that on An2 we have . Therefore, by the above results, we have for n sufficiently large, on the event An1 ∩ An2 with pr(An1 ∩ An2) ≥ 1 − 2ε, such that
(A.9) |
Moreover, by (A.1) and (A.2), we have for n sufficiently large, with probability approaching 1,
Thus, there exists an event An3 with for any ε > 0, such that on An3,
(A.10) |
Therefore, by (A.8), (A.9) and (A.10), for n sufficiently large, on the event An1 ∩ An2 ∩ An3 with pr(An1 ∩ An2 ∩ An3) ≥ 1 − 3ε, we have
when . This shows (A.7). Hence, we have . A similar strategy for proving consistency has been used in the literature when the dimension of the parameter is diverging, see for example the proof of Theorem 3 in Fan and Lv (2011).
Next, let
and
(A.11) |
By (A.2), (A.6) and the assumption that ,
By (A.2) and (A.4), we have ‖Vn(ν0)−1‖∞ = Op(h−1). Thus by the above results, one has
Let
Since , by Bernstein’s inequality, we have
By the above result and (A.6),
Let , and ℂ = (ℂ1, …, ℂn)T. It can be proved by Bernstein’s inequality in Bosq (1961) that . Also, by (A.4), ‖{−n−1∂2Ln(λ0, ν0)/∂λ∂λT}−1‖∞ = Op (h−1). Thus for a ∈ RPn with ‖a‖2 = 1,
(A.12) |
Let ê = Vn(ν0)−1Dn(ν0). By Central Limit Theorem, , where var(ê|ℂ) = {nVn(ν0)}−1 and . By Lemma 2 and (A.2), there are constants , such that with probability approaching 1, , and
(A.13) |
Therefore, there exist constants 0 < cσ ≤ Cσ < ∞ such that with probability approaching 1 and for large enough n,
(A.14) |
Thus uniformly in u ∈ [a0, b0], and
uniformly in u ∈ [a0, b0]. By Taylor’s expansion,
(A.15) |
Thus by (A.12), (A.14), and Condition (C3),
Therefore by Slutsky’s theorem σ̂−1(u, ν0) {m̂(u, ν0) − m̃(u)} → Normal(0, 1) and m̂(u, ν0) − m̃(u) = Op {(nh)−1/2} uniformly in u ∈ [a0, b0]. By supu∈[a0,b0] |m(u) − m̃(u)| = o(hq), we have |m̂(u, ν0) − m(u)| = Op{(nh)−1/2 + hq} uniformly in u ∈ [a0, b0]. By Slutsky’s theorem, we have
Since and Br−1(u) are B-spline basis functions with one order lower than Br(u), by the same argument as in Zhou and Wolfe (2000) and the proof for m̂(u, ν0), we have the result (b) in Theorem 1. Then the proof is complete.
A.4 Proof of Theorem 2
Define . It is straight-forward to prove that ∂Li1(ν)/∂ν = Qi1(ν), and for k = 2, …, K, ∂Lik(ν)/∂ν = Qik(ν). Then by (A.15) and Condition (C3) and by the same arguments as the proof for proposition 4.1 in Ai and Chen (2003), we have
By Taylor’s expansion, we have
By the above result, we have (8). Then the asymptotic normality in Theorem 2 follows from the Central Limit Theorem and (8).
A.5 Proof of Theorem 3
Here we show that our method for estimating ν is semiparametric efficient when (Yi1, …, YiK) are independent given ℂi. We have that
The ith score with respect to ν is . The nuisance tangent space is
We decompose Sνi as Sνi = Seff,i + S1i, where
Obviously, S1i ∈ Λ. For any element Si ∈ Λ, say , we can easily verify that .
Thus, Seff,i is the residual of the orthogonal projection of Sνi onto Λ, hence it is the efficient score. The minimum variance bound for estimating ν is therefore
Since S1i is the orthogonal projection of Sνi onto Λ, it minimizes the covariance matrix of Sνi − Si among all the functions Si ∈ Λ, i.e., η0i minimizes
among all possible . This shows that Σ in Theorem 2 reaches the semiparametric efficiency bound, as claimed.
A.6 Proof of Theorem 4
Let and εi = (εi1, …, εiK)T. Following the same procedure as the proof of Theorem 1, we have that . By this result and Taylor’s expansion, we have
Thus
(A.16) |
Then with probability approaching 1, var{λ̂w(ν0) − λ0|ℂi} approaches . Theorem 4 can be proved following the same methods as in the proof of Theorem 1.
A.7 Proof of Theorem 5
Let ζi be the dνK × 1 vector formed by K length dν vectors. The kth, k = 1, …, K vector component is . Following the same outline as the proof of Theorem 2, it can be proved that
Therefore,
and the asymptotic normality of given in Theorem 5 follows from the Central Limit Theorem.
A.8 Extending to Multiple Study Centers
Here we indicate briefly the necessary changes needed if there are multiple study centers, and multiple dependent disease outcomes within each study center. Suppose that there are k = 1, …, K study centers, with ℓ = 1, …, Lk binary disease outcomes in each center, and with i = 1, …, nk observations at the kth center. Write the outcomes at Yik = (Yik1, …, YikLk), and write the covariates as ℂik = (Gik, Xik0, Xik, Zik, GikWik). The model is
(A.17) |
We make the same assumptions as in Section A.2, but in addition we assume that limn1, …, nK→∞(max nk/ min nk) = c with 0 < c < ∞.
From the above model, we can see that in different centers, because different physical populations are studied, the same disease occurrence is modeled with different parameters. Thus, we can simply view the Lk diseases in k = 1, …, K centers as different diseases from a single center, and all our analyses formulated for data from one center applies.
Footnotes
TheSupplementary Material contains results of addition simulations, and R and Matlab programs to run the analysis. The NIH-AARP data used in the data analysis are available from the NIH via a data transfer agreement (www.http://dietandhealth.cancer.gov/) but we are not allowed to distribute it. The program files include simulated data as described in Section 6.
Contributor Information
Shujie Ma, Department of Statistics, University of California at Riverside, Riverside, CA92521.
Yanyuan Ma, Department of Statistics, University of South Carolina, Columbia, SC 29208.
Yanqing Wang, Fred Hutchinson Cancer Research Center, Seattle, WA 98109.
Eli S. Kravitz, Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX
Raymond J. Carroll, Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143, and School of Mathematical Sciences, University of Technology Sydney, Broadway NSW 2007
References
- Ai C, Chen X. Efficient estimation of models with conditional moment restrictions containing unknown functions. Econometrica. 2003;71:1795–1843. [Google Scholar]
- Akbaraly TN, Ferrie JE, Berr C, Brunner EJ, Head J, Marmot MG, Singh-Manoux A, Ritchie K, Shipley MJ, Kivimaki M. Alternative Healthy Eating Index and mortality over 18 y of follow-up: results from the Whitehall II cohort. American Journal of Clinical Nutrition. 2011;194:247–253. doi: 10.3945/ajcn.111.013128. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bellman RE. Adaptive Control Processes. Princeton University Press; Princeton: 1961. [Google Scholar]
- Bosq D. Nonparametric Statistics for Stochastic Processes. Springer; New York: 1961. [Google Scholar]
- Carroll RJ, Fan J, Gijbels I, Wand MP. Generalized partially linear single-index models. Journal of the American Statistical Association. 1997;92:477–489. [Google Scholar]
- Chaganty NR, Joe H. Efficiency of generalized estimating equations for binary responses. Journal of the Royal Statistical Society: Series B. 2004;66:851–860. [Google Scholar]
- Chiuve SE, Fung TT, Rimmand EB, Hu FB, McCullough ML, Wang M, Stampfer MJ, Willett WC. Alternative dietary indices both strongly predict risk of chronic disease. Journal of Nutrition. 2012;142:1009–1018. doi: 10.3945/jn.111.157222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui X, Härdle WK, Zhu L, et al. The efm approach for single-index models. Annals of Statistics. 2011;39:1658–1688. [Google Scholar]
- de Boor C. A Practical Guide to Splines. Springer; New York: 2001. [Google Scholar]
- DeVore RA, Lorentz GG. Constructive Approximation. Springer; Berlin: 1993. [Google Scholar]
- Fan J, Lv J. Nonconcave penalized likelihood with NP-dimensionality. IEEE Transactions on Information Theory. 2011;57:5467–5484. doi: 10.1109/TIT.2011.2158486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- George SM, Neuhouser ML, Mayne ST, Irwin ML, Albanes D, Gail MH, Alfano CM, Bernstein L, McTiernan A, Reedy J, Smith AW, Ulrich CM, Ballard-Barbash R. Postdiagnosis diet quality is inversely related to a biomarker of inflammation among breast cancer survivors. Cancer Epidemiology, Biomarkers & Prevention. 2010;19:2220–2228. doi: 10.1158/1055-9965.EPI-10-0464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guenther PM, Reedy J, Krebs-Smith SM. Development of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008;108:1896–1901. doi: 10.1016/j.jada.2008.08.016. [DOI] [PubMed] [Google Scholar]
- Guenther PM, Reedy J, Krebs-Smith SM, Reeve BB. Evaluation of the Healthy Eating Index-2005. Journal of the American Dietetic Association. 2008;108:1854–1864. doi: 10.1016/j.jada.2008.08.011. [DOI] [PubMed] [Google Scholar]
- Huang JZ. Local asymptotics for polynomial spline regression. Annals of Statistics. 2003;31:1600–1635. [Google Scholar]
- Huang JZ, Wu CO, Zhou L. Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika. 2002;89:111–128. [Google Scholar]
- Huang JZ, Yang L. Identification of non-linear additive autoregressive models. Journal of the Royal Statistical Society: Series B. 2004;66:463–477. [Google Scholar]
- Le Cessie S, Van Houwelingen J. Logistic regression for correlated binary data. Applied Statistics. 1994;43:95–108. [Google Scholar]
- Liu X, Wang L, Liang H. Estimation and variable selection for semiparametric additive partial linear models. Statistica Sinica. 2011;21:1225. doi: 10.5705/ss.2009.140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma S, Yang L. A jump-detecting procedure based on spline estimation. Journal of Nonparametric Statistics. 2011;23:67–81. [Google Scholar]
- Ma Y, Zhu L. Doubly robust and efficient estimators for heteroscedastic partially linear single-index models allowing high dimensional covariates. Journal of the Royal Statistical Society: Series B. 2013;75:305–322. doi: 10.1111/j.1467-9868.2012.01040.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCullough ML, Feskanich D, Stampfer MJ, Giovannucci EL, Rimm EB, Hu FB, Spiegelman D, Hunter DJ, Colditz GA, Willett WC. Diet quality and major chronic disease risk in men and women: moving toward improved dietary guidance. American Journal of Clinical Nutrition. 2002;76:1261–1271. doi: 10.1093/ajcn/76.6.1261. [DOI] [PubMed] [Google Scholar]
- Panagiotakos DB, Pitsavos C, Stefanadis C. Dietary patterns: a mediterranean diet score and its relation to clinical and biological markers of cardiovascular disease risk. Nutrition, Metabolism and Cardiovascular Diseases. 2006;16:559–568. doi: 10.1016/j.numecd.2005.08.006. [DOI] [PubMed] [Google Scholar]
- Reedy JR, Mitrou PN, Krebs-Smith SM, Wirfält E, Flood AV, Kipnis V, Leitzmann M, Mouwand T, Hollenbeck A, Schatzkin A, Subar AF. Index-based dietary patterns and risk of colorectal cancer: the nih-aarp diet and health study. American Journal of Epidemiology. 2008;168:38–48. doi: 10.1093/aje/kwn097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schatzkin A, Subar AF, Thompson FE, Harlan LC, Tangrea J, Hollenbeck AR, Hurwitz PE, Coyle L, Schussler N, Michaud DS, Freedman LS, Brown CC, Midthune D, Kipnis V. Design and serendipity in establishing a large cohort with wide dietary intake distributions: the national institutes of health-aarp diet and health study. American Journal of Epidemiology. 2001;154:1119–1125. doi: 10.1093/aje/154.12.1119. [DOI] [PubMed] [Google Scholar]
- Stone CJ. Additive regression and other nonparametric models. Annals of Statistics. 1985:689–705. [Google Scholar]
- Subar AF, Thompson FE, Kipnis V, Mithune D, Hurwitz P, McNutt S, McIntosh A, Rosenfeld S. Comparative validation of the block, willett, and national cancer institute food frequency questionnaires: The Eating at America’s Table Study. American Journal of Epidemiology. 2001;154:1089–1099. doi: 10.1093/aje/154.12.1089. [DOI] [PubMed] [Google Scholar]
- Trichopoulou A, Orfanos P, Norat T, Bueno-de Mesquita B, Ocké MC, Peeters PH, van der Schouw YT, Boeing H, Hoffmann K, Boffetta P, et al. Modified mediterranean diet and survival: Epic-elderly prospective cohort study. British Medical Journal. 2005;330:991. doi: 10.1136/bmj.38415.644155.8F. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J, Yang L. Polynomial spline confidence bands for regression curves. Statistica Sinica. 2009a;19:325. [Google Scholar]
- Wang L, Liu X, Liang H, Carroll RJ. Estimation and variable selection for generalized additive partial linear models. Annals of Statistics. 2011;39:1827. doi: 10.1214/11-AOS885SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang L, Yang L. Spline estimation of single-index models. Statistica Sinica. 2009b;19:765. [Google Scholar]
- Xue L, Yang L. Additive coefficient modeling via polynomial spline. Statistica Sinica. 2006;16:1423. [Google Scholar]
- Yu Y, Ruppert D. Penalized spline estimation for partially linear single-index models. Journal of the American Statistical Association. 2002;97:1042–1054. [Google Scholar]
- Zhao LP, Prentice RL. Correlated binary regression using a quadratic exponential model. Biometrika. 1990;77:642–648. [Google Scholar]
- Zhou S, Shen X, Wolfe DA. Local asymptotics for regression splines and confidence regions. The Annals of Statistics. 1998:1760–1782. [Google Scholar]
- Zhou S, Wolfe DA. On derivative estimation in spline regression. Statistica Sinica. 2000:93–108. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.