Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 May 17.
Published in final edited form as: Stat Theory Relat Fields. 2018 May 17;2(1):2–10. doi: 10.1080/24754269.2018.1466098

Nutritional Epidemiology Methods and Related Statistical Challenges and Opportunities

Ross L Prentice 1, Ying Huang 1
PMCID: PMC6377194  NIHMSID: NIHMS1502755  PMID: 30778402

Abstract

The public health importance of nutritional epidemiology research is discussed, along with methodologic challenges to obtaining reliable information on dietary approaches to chronic disease prevention. Measurement issues in assessing dietary intake need to be addressed to obtain reliable disease association information. Selfreported dietary data typically incorporate major random and systematic biases. In-take biomarkers offer potential for more reliable analyses, but biomarkers have been established only for a few dietary variables, and these may be too expensive to apply to all participants in large epidemiologic cohorts. A possible way forward involves additional nutritional biomarker development using high-dimensional metabolomic profiling, using blood and urine specimens, in conjunction with further development of statistical approaches for accommodating measurement error with failure time response data. Statisticians have the opportunity to contribute greatly to worldwide public health through the development of statistical methods to address these nutritional epidemiology research challenges, as is elaborated in this contribution.

Keywords: Chronic disease, epidemiology, failure time data, hazard ratio, measurement error, metabolomics, nutritional biomarker, regression calibration

1. Introduction

Chronic diseases constitute the major cause of morbidity and mortality in many countries worldwide, especially in countries that are more economically developed. In fact, the incidence of cardiovascular diseases, major cancers and diabetes tend to be several times higher in economically developed populations than in other parts of the world (e.g., Cancer Incidence in Five Continents, 2014). Much of this elevated incidence appears to be driven by modifiable exposures, since migrant populations tend to develop disease rates similar to those in their new environment within a generation or two of migration, even though the acculturation process may span some decades. However, the primary drivers for the observed risk elevations for specific chronic diseases are not well understood.

Diet and physical activity patterns over the lifespan provide natural candidate exposures to explain chronic disease risk variations among populations, as well as chronic disease risk changes over time in specific populations. However, when expert committees have been assembled to review the analytic epidemiology literature on these patterns and exposures they have mostly concluded that there are few nutrition and chronic disease associations that can be viewed as established, or even as probable (World Cancer Research Fund and American Istitute for Cancer Research, 1997, 2007; World Health Organization, 2003). In contrast, ecologic analyses tend to exhibit strong correlations between national incidence rates and per capita food ‘disappearance’ measures, especially for such food components as total energy and total fat (Armstrong & Doll, 1975; Prentice & Sheppard, 1990).

Much of the explanation for these apparently discrepant findings likely resides in the properties and quality of available dietary data. Analytic epidemiology studies mostly rely on self-reported dietary intake data, with prominent assessment methodologies involving food frequencies, food records or dietary recalls. At best these measurement approaches yield noisy estimates of targeted intakes, which are usually expressed as daily average intakes over a short period of a few days to a few months. The noise feature alone typically leads to greatly attenuated associations, necessitating studies having a large number of incident cases of disease for associations to be evident. A larger issue is systematic bias in the self-report assessments, corresponding to differential reporting by study subjects according to such personal characteristics as body mass index (BMI) defined as weight in kilograms divided by the square of height in meters, age and ethnicity. These random and systematic biases may combine to thoroughly distort, or possibly even reverse, disease association estimates. In comparison ecologic analyses that compare dietary intakes among population groups (e.g., countries) can be expected to be relatively free from influences due to the noise component of measurement error, even if based on individual self-report data, but systematic bias along with potential ecologic confounding, preclude a strong reliance on related disease association analyses.

What then are the study designs that can help to develop reliable information on dietary intakes and patterns for chronic disease risk reduction? Certainly randomized, controlled dietary intervention trials have the potential to be informative. However, there are substantial challenges related also to this research strategy: Such trials typically need to be quite large for change to a new dietary pattern to appreciably offset preceding years or decades of the study participant’s usual diet, and usually need to be of long duration for the same reason. Hence, randomized dietary intervention trials with chronic disease outcomes can be quite expensive and logistically challenging, while only evaluating one, or a few, specific dietary patterns. Furthermore the long trial duration can pose challenges for adherence to the assigned dietary intervention, and may open up the possibility of post-randomization confounding if participants adopt medications and other approaches to chronic disease risk reduction in a differential manner among randomized groups.

In comparison, the use of nutritional biomarkers provides a practical and potentially comprehensive approach to strengthening nutritional epidemiology observational research. If such biomarkers can be obtained from biospecimens, typically blood or urine specimens, stored on the members of large epidemiology cohorts, then these objective intake measures can be directly associated with subsequent disease risk, perhaps using nested case-control (Prentice & Breslow, 1978; Thomas, 1977) or case-cohort (Prentice, 1986; Self & Prentice, 1988) sampling within study cohorts.

Otherwise, biomarker determinations can be made in a cohort subsample, and used to correct self-report data for random and systematic biases, with corrected estimates subsequently associated with disease risk. However, only a few dietary components have established intake biomarkers, and there is a strong research need for the development of additional biomarkers. To be useful, such biomarkers should plausibly adhere to a classical measurement model. Even when such biomarkers can be identified there is a need for further development of statistical methods and theory for estimating key disease association parameters, such as parameters relating disease hazard ratios to preceding (unobserved) dietary intake histories (Carroll, Ruppert, Stefanski, & Crainiceanu, 2006; Prentice, 1982), in cohort study contexts.

These nutritional epidemiology methodology needs and opportunities have a strong statistical component. In fact, statistical input in this multidisciplinary nutritional epidemiology research area is as crucial to the development of useful and interpretable disease prevention information as is input from any other disciplinary group. Furthermore, the needed research includes most interesting statistical methodology issues, including issues in the use of high dimensional metabolomic data for nutritional biomarker development, and issues related to estimating disease association parameters in non-linear models when predictor variables include considerable measurement error.

These issues will be elaborated below, in an attempt to encourage additional statistical theoreticians to consider research goals in this important public health research area, especially during this time of national and international crises in diabetes, obesity, major cancers and cardiovascular diseases, in affluent populations.

2. Nutritional Biomarker Development Methods

The principal requirement for a useful nutritional biomarker, w, is adherence to a classical measurement model

w=z+e (1)

where 𝒵 is the targeted nutritional variable, and e is a pure noise error component that is independent of 𝒵 and other study subject characteristics (e.g., age, ethnicity, BMI, . . . ) that may be pertinent to the risk of the chronic disease under study. The hallmark of the biomarker is then freedom from systematic bias relative to the targeted dietary variable and relative to risk factors for the outcome under study. Additionally the variance of e should not be too large compared to the variance of 𝒵, so that w provides an efficient biomarker.

Usually 𝒵 is defined as log-transformed usual daily intake over a certain time period, such as a few weeks or months, while w typically arises as a corresponding log-transformed intake assessment from biospecimens collected at a single point, or a few points, in time. Prominent examples of nutritional biomarkers include a doubly-labeled water (DLW) biomarker of energy intake (Schoeller, 1999), a urinary nitrogen biomarker of protein intake (Bingham, 2003), and 24-hour urine-based measures of sodium and potassium intake (Luft, Fineberg, & Sloan, 1982; Rakova et al., 2013). Additionally, our research group, using data from a 153-woman human feeding study, has recently proposed novel biomarkers for the intake of several carotenoids and tocopherols using blood serum concentrations measurements (Lampe et al., 2017), and a carbohydrate biomarker using plasma fatty acid profiles (Song et al., 2017). These latter biomarkers required the inclusion of certain study subject variables for (1) to be plausible.

Only a few research groups have engaged in nutritional biomarker identification and development, and the brevity of the nutritional biomarker list described above strongly suggests that nutrient metabolite recovery in urine along with blood nutrient concentrations will not provide sufficiently comprehensive sources of data for biomarker development. However urine and blood metabolomic profiling (i.e., studies of small molecule concentrations) provide an intriguing possibility for additional nutritional biomarker development.

Over the past 15 years high-dimensional genotype data for disease association analyses, and for other purposes, have provided a considerable stimulus to statistical theory development, with methods based on the notion of only a few real associations among many examined, or sparsity, coming to play an influential role (Hastie, Tib-shirani, & Wainwright, 2015). These studies have generated lengthy lists of chronic disease-associated genetic variants for many chronic diseases and conditions. Most such associations, however, are very weak and collectively may not explain as much response variation as do simply collected data on family history for the outcomes in question. The difference between the outcome variation explained by family history compared to that explained by measured genetic variates is sometimes referred to as the ‘missing heritability.’ Another explanation, however, is that much of the observed familial association is attributable to shared environment, including similar diet and activity patterns among family members, rather than to shared genotype.

High-dimensional exposure history data are more complicated to model and analyze than are high-dimensional genotype data for at least two reasons: Unlike time-invariant germline genetic variants that can be assessed or imputed with great precision, environmental exposure data often are assessed with substantial measurement error, as with dietary and activity pattern assessments. Secondly, exposure patterns for individuals may change in a noteworthy fashion over the years and decades that are relevant to chronic disease risk. Hence, the statistical challenges in using high-dimensional exposure data are substantial in the nutritional epidemiology area, and require the input of theoreticians who are knowledgeable in the application of both high-dimensional data and exposure measurement error methodologies.

The two ‘exposome’ complexities just mentioned are separable to some extent. Blood and urine metabolomic profiles typically provide measurements that are responsive to recent dietary exposures, for example over the past few days. In that the diets of free living individuals tend to track over time, much may be learned by studying disease risk in relation to dietary exposures over short preceding time periods (e.g. most recent year). The incorporation of dietary changes over an extended period of time may be able to be accomplished by obtaining biospecimens periodically during a lengthy cohort follow-up period, and by relating disease risk at specific follow-up times to a preceding biomarker-based dietary intake history.

Our research group has been developing metabolomic profile data in the context of the human feeding study mentioned above (Lampe et al., 2017) among 153 participants in the U.S. Women’s Health Initiative. Profiles developed in the laboratory of Dr. Dan Raftery involve both targeted platforms, typically with 100–200 pre-specified metabolites, and global platforms with a much larger number of metabolites, many of which lack biological identification. Especially, the global platforms, which require peak identifications in mass spectra (e.g. liquid chromotagraphy/mass spectrometry (LC/MS) or gas chromotography/mass spectrometry (GC/MS)), include complex missing data features and a non-ignorable noise component for quantitative measurements. Higherdimensional statistical methods that have proven to be successful in genetic association applications need to be extended to allow for the measurement properties of these types for metabolomic profile data. Without such extension it seems likely that global platform measurements will be systematically excluded from potential biomarker specifications based on their weak performance in cross-validation components of model building activities, even if the underlying metabolites are highly relevant to the targeted intake.

3. Disease Association Analysis Methods Using Nutritional Biomarkers

Suppose now that data from a study cohort are available as S = TC, δ = I[S = T] and Z(S) = {z(u); 0 ≤ u < S}, where S is the smaller of time from cohort enrollment to chronic disease diagnosis (T) or to right censoring (C), δ is a non-censoring indicator, and Z(S) is the history of actual dietary intakes, as well as dietary self-report and potential confounding factors for the study subject up to time S. Cox regression (Cox, 1972, 1975; Kalbfleisch & Prentice, 2002) provides a major tool for studying the association between Z and disease risk, under the usual assumption that the hazard rate for T at following time t does not depend on censoring conditional on Z(t), for any t > 0. Under the Cox model the hazard rate

λ{t;Z(t)}=limΔt0pr{tT<t+Δt;Tt,Z(t)}/Δt

is modeled as

λ{t;Z(t)}=λ0s(t)exp{x(t)β}, (2)

where x(t) = {x1(t),…,xp(t)} is a data-analyst-defined regression vector formed from {Z(t), t} with corresponding hazard ratio parameter β = (β1,..., βρ)’ to be estimated, while λ0s is an unspecified ‘baseline’ hazard rate function at x(t) ≡ 0 in stratum s, where the stratification s = s{t; Z(t)} ∈ {1,2,…} is also defined by the data analyst. Estimation of the association parameter β is based on applying usual maximum likelihood formulae to the ‘partial likelihood’ function (Cox, 1975)

L(β)=s>0i=1ds{kDs(Δtsi)exk(tsi)β/lRs(tsi)exl(tsi)β}, (3)

where ts1<ts2<<tsds are the uncensored disease incidence times in stratum s, Dstsi) is the set of individuals failing at time tsi in stratum s, and Rs(tsi) is the set of study subjects ‘at risk’ (i.e. without prior disease diagnosis or censoring) for disease occurrence in stratum s at time tsi. The Cox model incorporates substantial flexibility as a result of its nonparametric baseline disease rates and its stratification features, and it is well suited to estimation problems for exposures that may vary over time, and for confounding factors that may also need to be allowed vary over a study follow-up time for an independent censoring assumption to be plausible. Note that the hazard ratio interpretation for β is natural and convenient in many biomedical research contexts, including nutritional epidemiology studies.

Expression (3) also provides a basis for the estimation of β in (2) when data are available only for cases developing disease during cohort follow-up and time-matched ‘controls’ without disease at the time of corresponding case occurrence simply by regarding each matched case-control set as a distinct stratum (Prentice & Breslow, 1978; Thomas, 1977). Similarly, maximization of (3) is also appropriate if data are available only on cases and a random sample, or stratified random sample, of the study cohort, with Rs(tsi) redefined to include only cases occurring in stratum s at time tsi and subcohort controls at risk in stratum s at that time. Note that a variance estimator more complex than that from the negative second derivative of logL(β) is required with case-cohort sampling (Prentice, 1986; Self & Prentice, 1988). These ad hoc sampling designs do not have established optimality properties, though efficiency can be expected to be good if case and comparison groups are well matched on potential confounding variables. The corresponding hazard ratio parameter estimates cited above are also sub-optimal, with efficiency loss that tends to be larger when some of the covariate components are available for all cohort members. Estimating efficiency can be improved by including inverse probability weights in these estimating equations (e.g., Breslow, McNeney, Wellner, et al., 2003; Breslow & Wellner, 2007), but resulting estimators have not been shown to be semiparametric efficient.

Now consider estimation of the hazard ratio parameter in (2) when the ‘covariate history’ Z(t) incorporates measurement error. More specifically suppose that the targeted x(t) in (2) can be written as

x(t)=x˜(t)+e˜(t) (4)

where x˜(t) values are obtained from available measurements, and e˜(t) is a measurement error component that is independent of x˜(t) and potential confounding factors. Also suppose that the stratification variable s = s{t; Z(t)} relies only on elements of Z that are free of measurement error. The induced hazard rate model that specifies disease risk given measured data only, at each follow-up time t, can be written (Prentice, 1982) as

λ0s(t)E{ex(t)β;Tt,X˜(t)}

where X˜(t)={x˜(u);0u<t} and E denotes expectation. In general these induced hazard rates involve an expectation factor that is a complicated function of the baseline hazard rates in (2). However, if the disease outcome is rare during the cohort follow-up period then the conditioning event Tt can be ignored to a good approximation. Doing so leads to a hazard rate model under the specialized Berkson-type measurement model (4) of

λ0s(t)E{ex(t)β;X˜(t)}=λ0s(t)ex˜(t)βE{ee˜(t)β;X˜(t)}.=λ˜0s(t)ex˜(t)β

with the last equality following from the independence of e˜(t) and x˜(t). Hence if one could identify a data construct x˜(t) that adheres to (4) one could regress the hazard rate on x˜(t) in a standard Cox model fashion to estimate the regression coefficient β in (2).

Suppose that an assessment q(t) of x(t) is available for all members of a study cohort, while a biomarker assessment w(t) of x(t) is also available on a random sample from the same population, at all follow-up times t ≥ 0. If q(t) is a self-report assessment of x(t) then a measurement model

q(t)=a0+a1x(t)+a2v(t)+ε(t) (5)

may be appropriate where a0,a1 and a2 = (a21,a22,…)′ are constants, v(t)′ = (v1(t),v2(t),…) are study subject characteristics that may be associated with the measurement properties of q(t) or that may be needed to control confounding in (2), and ε(t) is a random noise component that is independent of x(t), given v(t). In the biomarker sample one will have measurements

w(t)=x(t)+e(t)

where the error e(t) is independent of x(t) and is also independent of study subject characteristics that determine v(t), an assumption that will often be plausible if Z(t) incorporates dietary intake data over a short time period (e.g., a few months) prior to t. Also, importantly, suppose that the error terms e(t) and ε(t) are independent given v(t). Then under joint normality assumptions for {x(t),ε(t)} given v(t) one has

E{x(t);q(t),v(t)}=b0+b1q(t)+b2v(t)

for some constants b0,b1 and b2, and x˜(t)=E^{w(t);q(t),v(t)} satisfies (4) where E^ denotes an estimator of x(t) arising from linear regression of w(t) on q(t) and v(t) in the biomarker sample. In this context x˜(t) is referred to as a biomarker calibrated estimate of x(t). Values of x˜(t) can be calculated for each of the members of the study cohort and the regression parameter β can be estimated by standard Cox regression (Cox, 1975) of the disease outcome data on x˜(t). A non-standard variance estimator is required for the regression parameter estimator to acknowledge the randomness in calibration equation coefficient estimates. A bootstrap procedure typically works well for variance estimation. The estimation procedure just described simply generalizes the regression calibration procedure for failure time data (Carroll et al., 2006; Prentice, 1982) to a broader class of measurement error models.

In some applications q(t) may also be a biomarker measurement that is available on the entire cohort, or on a suitable set of cases and controls drawn from the cohort. The calibration procedures may be applied as above, though the v(t) term can then be dropped from the calibration equation. Note that the error terms for the two biomarker assessments in the biomarker sample need to be statistically independent in this context, with implications for the exposure time period used in the definition of Z(t) in (2).

The above procedures depend on the biomarker adhering to a classical measurement model, the disease under study being infrequent (e.g., < 10%) during cohort follow-up, and the so-called instrumental variable q(t) adhering to (5) with error term ε(t) that is independent of the error term e(t) for the biomarker given v(t). These assumptions will often be appropriate in nutritional epidemiology contexts for dietary exposure variables having an established biomarker. The regression calibration procedure outlined above also assumed the log-hazard rate to depend linearly on the modeled exposure variable x(t). Additional hazard ratio regression modeling choices will also be of interest for the exploration and presentation of nutritional epidemiology data. However, estimation procedures for such other modeling choices have received little attention to date, when the measured exposure variables incorporate substantial measurement error.

For example, it is common to display epidemiologic data by showing estimated hazard ratios, or closely related odds ratios, across quartiles or quintiles of the modeled exposure variable. One possibility for the estimation of hazard ratios across such quantiles, assuming model (2), is to calibrate the exposure variable, then estimate hazard ratios based on quantiles of the calibrated exposure. Another possibility is to define x(t) in (2) to be a set of quantile indicator variables, typically taking the smallest or largest quantile as the base value for hazard ratio comparison. One can then consider a regression calibration procedure of the type outlined above with x˜(t) defined as a set of calibrated quantile indicators for each quantile except the comparator. Simulation studies described in the next sections show, perhaps surprisingly, that the second approach has better performance than the first and even enjoys some robustness to departure from the rare disease assumption used in the calibration procedure. The main point here, however, is that hazard ratio estimation procedures are needed to handle a variety of regression model forms in (2), as an integral component of nutritional epidemiology association analysis methods when biomarker data are available in a study cohort, or in appropriate subsamples thereof.

4. Hazard Ratio Estimation for Exposure Quantiles

It is commonplace in epidemiologic reporting to show estimated hazard ratios across quantiles of key univariate exposure variables. The regression calibration approach outlined above has not previously been adapted to this estimation problem.

To do so consider a time-independent targeted variable x* = I{x ∈ (x0,x1)} for some fixed x0 and x1 values, when I again denotes an indicator function, and suppose that

λ(t;x,q,v)=λ0(t)exp{β1x*+β2v}.

Under a rare outcome specification the induced hazard rate given observable variates is to a good approximation

λ(t;q,v)=λ˜0(t)exp{β1E(X*;q,v)+β2v}.

Under the multivariate normality assumptions of the previous section, x given (q, v) is normally distributed with mean that can be estimated by regressing biomarker values w on q and v, and with variance that can be estimated using repeat biomarker determinations in a biomarker substudy. From these estimators one can compute a corresponding estimator of the expectation of x* given q and v by integrating this estimated normal density from x0 to x1. Simultaneous calibrated hazard ratio estimators can be calculated by corresponding integration over the elements of a partition formed by quantile cutpoints of this same estimated normal distribution for x*.

To test this approach we simulated data from a hazard rate model

λ(t;x,q,v)=λ0(t)exp{β1x1*+β2x2*+β3v}

where x1* and x2* are indicators corresponding to the second and third tertiles of x, which followed a standard normal distribution. Also the univariate covariate v was taken to be independent of x and to adhere to a standard normal distribution, while sampling errors e and ε were also normally distributed with mean zero and variance 0.5, and were independent of each other and of the other modeled variates, while the measured exposure q derived from q = 0.8x + 0.5v + ε. Also, terminal censoring was imposed at a fixed value c. Data were generated from a cohorts of size 2000 with (q, v) measurements, along with an external biomarker sample of size 500 with both w and q values available and a 20% reliability subsample with a second w value having measurement error that is independent of the first.

Multiple simulation scenarios were considered, each giving very similar results. Table 1 shows summary statistics from 5000 generated cohort samples with λ0(t) ≡ 0.7, β1 = log(1.5),β2 = 2log(1.5) and c = 1, giving a censoring probability of about 35%. Even though one doesn’t expect a rare disease approximation to be accurate with censoring rates as low as 35% the calibrated hazard ratio estimators (RC1) for the second and third tertiles show very little bias relative to their generating values. Sample standard deviation estimates and coverage rates for estimated 95% confidence intervals are also shown, the latter being close to nominal values. Also shown in Table 1 are corresponding summary statistics (i) if one had available the actual generated x-values and used these in standard Cox regression (true); (ii) if one used tertiles from the measured q-values in Cox regression (naive); and (iii) if one used tertiles from the calibrated X (RC2). Clearly the naive and RC2 ‘estimators’ do not perform adequately in this simulation setting.

Table 1.

Simulationa summary statistics for regression calibration estimates of tertile hazard ratios.

Estimation Statistic Hazard Ratio Regression Coefficients
β1 = 0.405 β2 = 0.811 β2 = 0
RC1b Sample mean 0.416 0.805 0
Sample standard deviation 0.217 0.111 0.031
95% CI coverage 96.6 94.9 95.0
Trueb Sample mean 0.406 0.813 0
Sample standard deviation 0.072 0.069 0.028
95% CI coverage 95.1 95.5 95.0
Naiveb Sample mean 0.259 0.510 −0.081
Sample standard deviation 0.074 0.072 0.03
95% CI coverage 49.3 1.3 22.7
RC2b Sample mean 0.281 0.553 0
Sample standard deviation 0.071 0.075 0.03
95% CI coverage 62.4 9.1 94.8
a

Simulation based on 5000 cohorts each of size 2000, with an external biomarker subsample of size 500 in which both biomarker (w) and self-report (q) are measured along with a 20% random subsample in which a second biomarker measurement (w) is available.

b

RC1 is proposed regression calibration procedure; True is from Cox regression using x-value without measurement error; Naive is based on tertiles for measured q-values; and RC2 arises from forming tertiles of calibrated x-values.

Our proposed tertile hazard ratio estimators (RC1) seem eminently usable though, of course, they incorporate considerable additional random variation, compared to analyses based on true x-values, as is to be expected with this amount of measurement error contamination.

5. Example of Sodium Intake and Cardiovascular Disease Risk

To further illustrate the importance of needed hazard ratio estimation developments, consider the association between dietary sodium and cardiovascular disease risk. Even though a high intake of sodium, or a high intake ratio of sodium to potassium, is associated with elevated blood pressure in observational studies and randomized trials (Stamler et al., 1988; Tzoulaki et al., 2012; Whelton et al., 1997), evidence for these dietary associations with cardiovascular diseases has been inconclusive (Bibbins-Domingo et al., 2010; Strazzullo, D’Elia, Kandala, & Cappuccio, 2009; Yang et al., 2011) in spite of considerable public health interest and importance (Mozaffarian et al., 2014; Oria, Yaktine, Strom, et al., 2013). Uncertainty concerning these associations was enhanced when the large international Prospective Urban Rural Epidemiology (PURE) reported a J-shaped relationship between sodium excretion and major cardiovascular disease outcomes, with higher disease risk at intakes that were relatively low as well as relatively high (O’Donnell et al., 2014) with risk elevations at the low end at values well below recommended maximal intakes (US Department of Health and Human Services et al., 2015). This led to questions concerning the wisdom of sodium reduction as an isolated public health recommendation (e.g., Oparil, 2014).

While most reports of sodium intake in relation to chronic disease outcomes have relied on dietary self-report, the PURE study can be commended for using a biomarker assessment of sodium intake. Specifically morning spot urine sodium excretion was adjusted using a formula (Kawasaki, Itoh, Uezono, & Sasaki, 1993) to provide an estimate of 24-hour urinary excretion. However, in other studies spot urine excretion has been found to not correlate well with 24-hour sodium excretion (Cogswell et al., 2013; Ji, Miller, Venezia, Strazzullo, & Cappuccio, 2014; Ji et al., 2012), implying that even if the adjusted intake estimates adhere to (1) the error variance may be quite large relative to the variance for the targeted intake Z. This suggests that the spot urine derived intake estimates may be inefficient, at best, as a biomarker of usual daily sodium intake. Even sodium excretion from 24-hour urine specimens is somewhat noisy as a usual intake biomarker, with average excretion over multiple days able to usefully reduce the measurement error variance in (1).

The PURE study authors presented associations between estimated usual sodium intake and cardiovascular disease hazard ratios by fitting a cubic spline model in (2) without making any provision for measurement error in their sodium intake estimates. Methods for fitting this type of model while allowing measurement error in (1) to constitute a major fraction if the biomarker variations are needed to interpret, and to correct, the PURE study associations for measurement error.

Recently the authors have used the regression calibration procedure described above with 24-hour sodium excretion as a biomarker, in conjunction with food frequency estimated sodium intake and a range of study subject characteristics to develop calibration equations to estimate short-term sodium intake (Huang et al., 2014). These developments used data from a biomarker substudy of the Women’s Health Initiative (WHI) cohorts (Prentice et al., 2011). The calibration equations were used to produce usual daily intake estimates for individuals in WHI cohorts of postmenopausal women in the United States. Calibrated estimates of log-sodium intake were then associated with hazard ratios over cohort follow-up for various cardiovascular disease outcomes. Positive associations were found between calibrated sodium intake, and calibrated ratios of sodium to potassium intake, with major cardiovascular diseases, including coronary heart disease, and heart failure (Prentice et al., 2017). In contrast to the PURE Study, these analyses do not suggest higher risk for these major cardiovascular disease outcomes at relatively low sodium intakes, but a careful study of hazard ratio function shape, while allowing for measurement error in intake estimates would require the ability to fit hazard ratio models more general than the linear model in log-intake applied in these analyses.

In some applications an additive model of the form (1) may be plausible, but the classical measurement model assumption may not hold because of dependence of the variance of the error term e on the value of the targeted nutritional variable x. If the error variance is large compared to the variance of × then even modest dependencies of the error variance on x could have important implications of the estimated shape of the hazard ratio function, especially if complex hazard ratio dependencies, such as cubic spline models, are entertained. Hence, additional statistical methods and theory development is strongly needed for this important public health question to be addressed using dietary biomarker and self-report data. Such developments are needed not only for full-cohort data analyses but also for the major cohort subsampling designs, including nested case-control and case-cohort sampling.

In summary, even though sodium over-consumption is projected to be responsible for very substantial morbidity and mortality worldwide (Mozaffarian et al., 2014) issues related to sodium intake assessment have prevented definitive quantitative results from emerging on the associations between sodium intake over the lifespan and the incidence and mortality of specific cardiovascular diseases. The further development of statistical methods and theory is a crucial component of related needed research.

6. Summary and Conclusions

There have been many important statistical developments over the past 15–20 years as reliable, high-dimensional genotype data on individual study subjects came available. During the same time period high-dimensional data on gene, protein, and metabolite expression profiles, using blood and urine specimens, as well as high-dimensional data from various types of imaging techniques, have been ascertained in a variety of contexts. These latter data types typically target quantities that vary over the lifespan of the study subject, and the ability of assessment platforms to be comprehensive in terms of analytes measured, may be a challenge (e.g., mass spectrometry-based proteomic or metabolomic platforms).

In public health contexts gene, protein and metabolite profiles may reflect both genotype and prior exposure history, including such exposures as diet and physical activity patterns over the preceding months or years. If these exposure patterns could be well-measured by self-report then the high-dimensional data just mentioned could be used to explain biologic pathways and processes whereby these commonplace activities affect chronic disease risk. However, after several decades of development and application of self-report data for these exposures it is evident that they are not sufficiently reliable for many nutritional epidemiology purposes, most notably for the study of associations with total energy intake, or with the absolute intake of the components of energy.

To the extent that measures in urine and blood, including metabolomic platform measurements, directly reflect dietary intake patterns, these measures may be able to provide an objective assessment of the intake of food and nutrients over the recent past. Repeat application of such objective assessments over cohort follow-up periods may then allow an objective dietary exposure histories to to be developed with enhancement of the reliability of related nutritional epidemiology association analyses.

While this biomarker approach to nutritional epidemiology study has considerable potential, there is a need for an intensive research effort to develop biomarkers for many additional nutritional variables, and an equal need to develop flexible statistical measurement error methods for applying such objective exposure assessments. The latter need arises because the biomarker strategy may be able to yield objective exposure assessments, but these assessments are likely to incorporate noise components that cannot be ignored in analyses to relate dietary exposures to chronic disease risk.

This article is written with a goal of enlisting additional strong statistical methodologists and theorists in this important public health research.

Acknowledgments

Funding

This manuscript was written with partial support from National Institutes of Health grants R01 CA210921 and R01 CA119171.

References

  1. Armstrong B, & Doll R (1975). Environmental factors and cancer incidence and mortality in different countries, with special reference to dietary practices. International Journal of Cancer, 15 (4), 617–631. [DOI] [PubMed] [Google Scholar]
  2. Bibbins-Domingo K, Chertow GM, Coxson PG, Moran A, Lightwood JM, Pletcher MJ, & Goldman L (2010). Projected effect of dietary salt reductions on future cardiovascular disease. New England Journal of Medicine, 362 (7), 590–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bingham SA (2003). Urine nitrogen as a biomarker for the validation of dietary protein intake. The Journal of Nutrition, 133 (3), 921S–924S. [DOI] [PubMed] [Google Scholar]
  4. Breslow N, McNeney B, Wellner JA, et al. (2003). Large sample theory for semiparametric regression models with two-phase, outcome dependent sampling. The Annals of Statistics, 31 (4), 1110–1139. [Google Scholar]
  5. Breslow N, & Wellner JA (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to cox regression. Scandinavian Journal of Statistics, 34 (1), 86–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cancer Incidence in Five Continents, Vol. X (2014). (Forman D et al. , Eds.) (No. 164). Lyon, France: International Agency for Research on Cancer. [Google Scholar]
  7. Carroll RJ, Ruppert D, Stefanski LA, & Crainiceanu CM (2006). Measurement Error in Nonlinear Models: a Modern Perspective, 2nd Ed. Boca Raton, FL: Chapman & Hall/CRC. [Google Scholar]
  8. Cogswell ME, Wang C-Y, Chen T-C, Pfeiffer CM, Elliott P, Gillespie CD, ... others (2013). Validity of predictive equations for 24-h urinary sodium excretion in adults aged 18–39 y. The American Journal of Clinical Nutrition, 98 (6), 1502–1513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cox DR (1972). Regression models and life-tables (with discussion). Journal of the Royal Statistical Society. Series B (Methodological), 34 (2), 187–220. Retrieved from http://www.jstor.org/stable/2985181 [Google Scholar]
  10. Cox DR (1975). Partial likelihood. Biometrika, 62 (2), 269–276. [Google Scholar]
  11. Hastie T, Tibshirani R, & Wainwright M (2015). Statistical Learning with Sparsity: the Lasso and Generalizations (No. 143). Chapman and Hall, CRC Press. [Google Scholar]
  12. Huang Y, Van Horn L, Tinker LF, Neuhouser ML, Carbone L, Mossavar-Rahmani Y, ... Prentice RL (2014). Measurement error corrected sodium and potassium intake estimation using 24-hour urinary excretion. Hypertension, 63 (2), 238–244. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ji C, Miller M, Venezia A, Strazzullo P, & Cappuccio F (2014). Comparisons of spot vs 24-h urine samples for estimating population salt intake: validation study in two independent samples of adults in britain and italy. Nutrition, Metabolism and Cardiovascular Diseases, 24 (2), 140–147. [DOI] [PubMed] [Google Scholar]
  14. Ji C, Sykes L, Paul C, Dary O, Legetic B, Campbell NR, & Cappuccio FP (2012). Systematic review of studies comparing 24-hour and spot urine collections for estimating population salt intake. Revista Panamericana de Salud Pública, 32 (4), 307–315. [DOI] [PubMed] [Google Scholar]
  15. Kalbfleisch JD, & Prentice RL (2002). The Statistical Analysis of Failure Time Data, Second Edition. New York: Wiley and Sons. [Google Scholar]
  16. Kawasaki T, Itoh K, Uezono K, & Sasaki H (1993). A simple method for estimating 24 h urinary sodium and potassium excretion from second morning voiding urine specimen in adults. Clinical and Experimental Pharmacology and Physiology, 20 (1), 7–14. [DOI] [PubMed] [Google Scholar]
  17. Lampe JW, Huang Y, Neuhouser ML, Tinker LF, Song X, Schoeller DA, ... others (2017). Dietary biomarker evaluation in a controlled feeding study in women from the womens health initiative cohort. The American Journal of Clinical Nutrition, 105(2), 466–475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Luft F, Fineberg N, & Sloan R (1982). Estimating dietary sodium intake in individuals receiving a randomly fluctuating intake. Hypertension, 4 (6), 805–808. [DOI] [PubMed] [Google Scholar]
  19. Mozaffarian D, Fahimi S, Singh GM, Micha R, Khatibzadeh S, Engell RE, ... Powles J (2014). Global sodium consumption and death from cardiovascular causes. New England Journal of Medicine, 371 (7), 624–634. [DOI] [PubMed] [Google Scholar]
  20. O’Donnell M, Mente A, Rangarajan S, McQueen MJ, Wang X, Liu L, ... others (2014). Urinary sodium and potassium excretion, mortality, and cardiovascular events. N Engl J Med, 2014 (371), 612–623. [DOI] [PubMed] [Google Scholar]
  21. Oparil S (2014). Low sodium intakecardiovascular health benefit or risk? New England Journal of Medicine, 371 (7), 677–679. [DOI] [PubMed] [Google Scholar]
  22. Oria M, Yaktine AL, Strom BL, et al. (2013). Sodium intake in populations: assessment of evidence. National Academies Press. [PubMed] [Google Scholar]
  23. Prentice RL (1982). Covariate measurement errors and parameter estimation in a failure time regression model. Biometrika, 69 (2), 331–342. [Google Scholar]
  24. Prentice RL (1986). A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika, 73 (1), 1–11. [Google Scholar]
  25. Prentice RL, & Breslow NE (1978). Retrospective studies and failure time models. Biometrika, 65 (1), 153–158. [Google Scholar]
  26. Prentice RL, Huang Y, Neuhouser ML, Manson JE, Mossavar-Rahmani Y, Thomas F, . . . others (2017). Biomarker calibrated sodium and potassium intake and cardiovascular disease risk among postmenopausal women. American Journal of Epidemiology, 186 (9), 1035–1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Prentice RL, Mossavar-Rahmani Y, Huang Y, Van Horn L, Beresford SA, Caan B, ... others (2011). Evaluation and comparison of food records, recalls, and frequencies for energy and protein assessment by using recovery biomarkers. American Journal of Epidemiology, 174 (5), 591–603. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Prentice RL, & Sheppard L (1990). Dietary fat and cancer: consistency of the epidemiologic data, and disease prevention that may follow from a practical reduction in fat consumption. Cancer Causes and Control, 1 (1), 81–97. [DOI] [PubMed] [Google Scholar]
  29. Rakova N, Jüttner K, Dahlmann A, Schröder A, Linz P, Kopp C, ... others (2013). Long-term space flight simulation reveals infradian rhythmicity in human na+ balance. Cell metabolism, 17 (1), 125–131. [DOI] [PubMed] [Google Scholar]
  30. Schoeller DA (1999). Recent advances from application of doubly labeled water to measurement of human energy expenditure. The Journal of Nutrition, 129 (10), 1765–1768. [DOI] [PubMed] [Google Scholar]
  31. Self SG, & Prentice RL (1988). Asymptotic distribution theory and efficiency results for case-cohort studies. The Annals of Statistics, 16 (1), 64–81. [Google Scholar]
  32. Song X, Huang Y, Neuhouser ML, Tinker LF, Vitolins MZ, Prentice RL, & Lampe JW (2017). Dietary long-chain fatty acids and carbohydrate biomarker evaluation in a controlled feeding study in participants from the womens health initiative cohort. The American Journal of Clinical Nutrition, 105 (6), 1272–1282. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Stamler J, Rose G, Stamler R, Elliott P, Marmot M, Pyorala K, ... others (1988). INTERSALT: an international study of electrolyte excretion and blood pressure. Results for 24 hour urinary sodium and potassium excretion. British Medical Journal, 297 (6644), 319–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Strazzullo P, D’Elia L, Kandala N-B, & Cappuccio FP (2009). Salt intake, stroke, and cardiovascular disease: meta-analysis of prospective studies. British Medical Journal, 339, b4567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Thomas DC (1977). Addendum to ‘Methods for cohort analysis: appraisal by applicaiton ot asbestos mining’ by By F. D. K. Liddell, J. C. McDonald and D. C. Thomas. Journal of the Royal Statistical Society A, 140, 469–491. [Google Scholar]
  36. Tzoulaki I, Patel CJ, Okamura T, Chan Q, Brown IJ, Miura K, ... others (2012). A nutrient-wide association study on blood pressure. Circulation, 126 (21), 2456–2464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. US Department of Health and Human Services, et al. (2015). 2015–2020 dietary guidelines for americans. Washington (DC): USDA. [Google Scholar]
  38. Whelton PK, He J, Cutler JA, Brancati FL, Appel LJ, Follmann D, & Klag MJ (1997). Effects of oral potassium on blood pressure: meta-analysis of randomized controlled clinical trials. Journal of the American Medical Association, 277 (20), 1624–1632. [DOI] [PubMed] [Google Scholar]
  39. World Cancer Research Fund and American Istitute for Cancer Research. (1997). Food, Nutrition and the Prevention of Cancer: A Global Perspective (Tech. Rep.). Washington, DC. [DOI] [PubMed] [Google Scholar]
  40. World Cancer Research Fund and American Istitute for Cancer Research. (2007). Food, Nutrition and the Prevention of Cancer: A Global Perspective (Tech. Rep.). Washington, DC. [Google Scholar]
  41. World Health Organization; (2003). Diet, Nutrition and the Prevention of Chronic Diseases: Report of a Joint WHO/FAO Expert Consultation (Tech. Rep. No. 916). Geneva. [PubMed] [Google Scholar]
  42. Yang Q, Liu T, Kuklina EV, Flanders WD, Hong Y, Gillespie C, ... others (2011). Sodium and potassium intake and mortality among US adults: prospective data from the Third National Health and Nutrition Examination Survey. Archives of Internal Medicine, 171 (13), 1183–1191. [DOI] [PubMed] [Google Scholar]

RESOURCES