ABSTRACT
Background
Principal components analysis (PCA) has been the most widely used method for deriving dietary patterns to date. However, PCA requires arbitrary ad hoc decisions for selecting food variables in interpreting dietary patterns and does not easily accommodate covariates. Sparse latent factor models can be utilized to address these issues.
Objective
The objective of this study was to compare Bayesian sparse latent factor models with PCA for identifying dietary patterns among young adults.
Methods
Habitual food intake was estimated in 2730 sedentary young adults from the Training Interventions and Genetics of Exercise Response (TIGER) Study [aged 18–35 y; body mass index (BMI; in kg/m2): 26.5 ± 6.1] who exercised <30 min/wk during the previous 30 d without restricting caloric intake before study enrollment. A food-frequency questionnaire was used to generate the frequency intakes of 102 food items. Sparse latent factor modeling was applied to the standardized food intakes to derive dietary patterns, incorporating additional covariates (sex, race/ethnicity, and BMI). The identified dietary patterns via sparse latent factor modeling were compared with the PCA derived dietary patterns.
Results
Seven dietary patterns were identified in both PCA and sparse latent factor analysis. In contrast to PCA, the sparse latent factor analysis allowed the covariate information to be jointly accounted for in the estimation of dietary patterns in the model and offered probabilistic criteria to determine the foods relevant to each dietary pattern. The derived patterns from both methods generally described common dietary behaviors. Dietary patterns 1–4 had similar food subsets using both statistical approaches, but PCA had smaller sets of foods with more cross-loading elements between the 2 factors. Overall, the sparse latent factor analysis produced more interpretable dietary patterns, with fewer of the food items excluded from all patterns.
Conclusion
Sparse latent factor models can be useful in future studies of dietary patterns by reducing the intrinsic arbitrariness involving the choice of food variables in interpreting dietary patterns and incorporating covariates in the assessment of dietary patterns.
Keywords: dietary patterns, sparse latent factor models, principal components analysis, young adults, nutritional epidemiology, Bayesian modeling
Introduction
Dietary pattern studies have been undertaken in nutritional epidemiology to examine potential cumulative and interactive effects of individual components of an overall diet, in which foods are consumed in combination (1). In previous research, dietary patterns have been inferred theoretically on the basis of qualitative assessments [e.g., the Healthy Eating Index (2)] or empirically by using statistical methods to extract information about dietary patterns in data (3). Principal components analysis (PCA) is the most commonly used method to empirically derive dietary patterns (4, 5). In a dietary pattern analysis, PCA assumes that each dietary observation is characterized by a small number of latent factors (i.e., dietary patterns). Each factor is assumed to represent a multifaceted picture of dietary consumption and to have a natural interpretation as an unobserved dietary characteristic.
Despite its popularity in dietary pattern studies, PCA has some shortcomings. The first is that it is often difficult to interpret the derived dietary patterns because they are, in general, linear combinations of all food variables in the observed data (4). It is a common practice to ignore foods that are weakly associated with a given pattern to simplify the interpretation. However, setting an appropriate inclusion threshold is often based on intuition rather than on well-defined criteria (6, 7).
Another shortcoming is that it is difficult to extend the scope of PCA to incorporate covariates that may confound measurements of dietary intake. In practice, PCA generally characterizes dietary patterns solely on the basis of reported dietary intake and does not account for additional covariates influencing individuals’ dietary practices, such as age, sex, and sociocultural factors, or other types of biological data. Lack of inclusion of these covariates may lead to improper identification of latent pattern structures, because PCA attempts to find dietary patterns that explain the total variation in the dietary data as much as possible, ignoring any covariate-based trends that might contribute to this variation. To deal with nonfood factors associated with dietary intake, some studies have performed stratified analyses by sex (8–15) and less often by ethnicity (16–18) or age (13, 19), and then searched for dietary patterns therein. This strategy may not be appropriate when the global patterns are of interest are in a mixed population (20), and it is generally limited to account for a few categorical covariates, with a small number of levels within each category. Even if the primary interest is in the local patterns of a subpopulation, there could be unwanted sources of variance that need to be accounted for in the assessment of dietary patterns. However, simultaneous modeling of different sources of variation with the use of both covariates and latent patterns is not a common practice in dietary pattern analysis.
Sparse latent factor models (21–23) can be utilized to address these methodologic shortcomings of PCA. West (21) initiated the idea of sparse factor modeling on the basis of a Bayesian modeling framework, which aims to provide parsimonious relations between high-dimensional variables and latent factors by forcing less influential associations to have a zero association in the model. This modeling framework has been used in gene expression studies and other complex data sets, and its approach is generic and broadly applicable in other fields (21–24). Unlike PCA, sparse latent factor models naturally produce latent factors with only a subset of foods being included. Sparsity modeling is appealing in dietary pattern analysis because each pattern is typically described by a small subset of food variables in data. Taking a Bayesian approach provides a quantitative, probabilistic approach for identifying the members of each subset rather than ad hoc threshold-based decisions that may obscure some relations in the data. In addition, sparse latent factor models provide a way to address the potential influence of covariates. Lucas et al. (22) extended a sparse latent factor model by coupling it with a sparse regression model framework, in which observed responses were expressed as a linear superposition of latent factors and other regressors on the basis of covariates. This approach allows for jointly identifying the contribution that is due to each source of variance, taking other components into account.
In this study, we applied a sparse latent factor model to dietary pattern analysis with the use of data from the Training Intervention and Genetics of Exercise Response (TIGER) Study. Specific characteristics of participants, including sex, race/ethnicity, and BMI, were hypothesized to account for a portion of the variation observed in dietary data and jointly modeled to isolate their effects from the assessment of underlying patterns. The derived dietary patterns from sparse latent factor analysis were compared with those obtained from the classical PCA procedure.
Methods
Study population
The TIGER study is a prospective cohort study with the primary goal of identifying genetic factors that influence physiologic responses to a 15-wk, 3-d/wk aerobic exercise training protocol. Participant enrollment in the TIGER study was initiated in 2003 and was completed in 2015. Participants were drawn from the University of Houston (2003–2008) and the University of Alabama at Birmingham (2010–2015) in 15 cohorts ascertained each fall or spring semester. The target participant from this study was a sedentary individual aged 18–35 y who exercised <30 min/wk during the previous 30 d without restricting caloric intake before enrollment. The exclusion criteria for enrollment included a physical contraindication to exercise (e.g., cardiomyopathy), a metabolic disorder known to alter body composition (e.g., lipodystrophy), or pregnancy. The TIGER study was approved by the respective institutional review boards, and informed consent was obtained from all participants. Details on the study protocol have been described elsewhere (25). The present study focused on the subset of participants (n = 2828) who reported their average intake of different foods and beverage items with the use of a self-administered Block FFQ (NutritionQuest) at the baseline assessment. The FFQ was used to characterize habitual intakes of 102 food items by study participants. Food intake was measured by using 9 frequency categories, ranging from “never” to “every day,” which were converted to frequency per day scale for each item. Participants were excluded from the analyses if they did not complete ≥90% of the food item questions or were missing anthropometric or demographic data. After all exclusions, 2730 participants were available for analysis. Participants included 1769 women and 961 men from diverse racial/ethnic groups, including non-Hispanic whites (n = 1150), African Americans (n = 788), Hispanics (n = 382), Asians (n = 166), Asian Indians (n = 81), Native Americans (n = 12), and others (n = 151).
Bayesian formulation of sparse latent factor models
Basic model form
Latent factor models aim to explain observed data in terms of a linear superposition of latent factors. For individual i, the observed food intake
with p food items is then expressed as follows:
![]() |
(1) |
where
is the k-vector of latent factor scores (
),
is the p-vector of noise, and
is the
factor loading matrix. Within a dietary pattern context, the factor loading matrix expresses how each dietary pattern is associated with the original food intake and factor scores represent an individual's relative position on each dietary pattern. Nonprobabilistic models such as PCA do not explicitly model the noise term, but in a Bayesian context, all of the entries are usually assumed to be independent and normally distributed. Explicitly modeling the confounding of noise allows us to decouple variation that is unique to each food from the structure of the dietary patterns.
To learn the model parameters in a Bayesian context, we place prior distributions over those parameters and update these distributions with the observed data, following Bayes’ law. The resulting posterior distribution then appropriately captures our beliefs and uncertainties about the parameters and can be used to make predictions. The model properties follow from the choice of prior distribution.
Sparse factor loadings
When modeling dietary patterns, we typically want our factor loadings to be sparse to facilitate interpretation: if the factor loadings are sparse, each factor or dietary pattern contains only a relatively small number of significant foods. A Bayesian approach for imposing explicit sparseness on factor loadings is to assume that each loading may be zero or have a nonzero value with a prior probability (i.e., the probability of a factor loading being nonzero) (21–23). More specifically, our prior distributions over individual probabilities are chosen to have substantial probability mass at zero to induce shrinkage of negligible loadings to zero. However, the priors also ensure that the probability mass is spread over a wide range of plausible values so that important loadings escape shrinkage and take nonzero values. In our case, if a food is associated with a specific dietary pattern or patterns, the nonzero values of the factor loadings are modeled as arising from a normal distribution. If this is not the case, the loading will be exactly zero. This reflects the fundamental view behind dietary pattern analysis that each food is expected to be largely associated with 1 or, at most, 2 patterns but unrelated to the other patterns. From a modeling perspective, the loading matrix will have many zeros because each row has very few salient loadings. PCA achieves this structure in a post hoc manner with the use of various factor rotation techniques and loading truncations in which the factor loadings with absolute values smaller than an arbitrary threshold are ignored. Sparse latent factor models, on the other hand, exploit sparsity-inducing priors as an integral part of the identification of dietary patterns, as described above. On the basis of this specification and our data, we can calculate the posterior distribution over the inclusion probabilities that are central to defining a set of foods showing significant association with specific dietary patterns. More details on sparsity-inducing priors are available in the Supplemental Methods.
Nonnormality in food intake and flexible factor scores
To specify our model, we must place a prior distribution on the factor scores. This prior distribution needs to account for the arbitrary nonnormal structure in dietary data that may arise from strong positive skewness or large proportions of zeros for many food items. A Dirichlet Process mixture model was used for the modeling of the factor score distributions to respond to observed nonnormal structure in data (23). A Dirichlet Process mixture model can be thought of as a mixture of infinitely many normal distributions and can capture any continuous distribution. The number of component distributions used is finite and determined by the data, allowing for flexible adaptation to the arbitrary nonnormal structure without overfitting. It also has the feature of cutting back to normality if the data suggest that is appropriate. Full details of this approach are available in Carvalho et al. (23).
Incorporation of covariates
To incorporate covariate information, sparse latent factor models can be extended as described in references 22 and 23, as follows:
![]() |
(2) |
where
is an r-vector of regressors based on covariates and
is the
matrix of regression coefficients. The prior structure on regression coefficients has the same form for the factor loadings that imposes sparsity on food-covariate associations.
Inference and missing value imputations
Direct analysis of the posterior distributions over the factor loading matrix and factor scores is not feasible. Instead, we use Markov Chain Monte Carlo (MCMC) methods. MCMC is a class of algorithms that generate samples from the posterior distribution over parameters. We can use these samples to make inferences on the basis of posterior means and characterize the uncertainty of model parameters. The full details of MCMC analysis for the model described in this article are available in Carvalho et al. (23).
The Bayesian framework described in this article provides a model-based solution for dealing with missing data. If the intake of some foods is missing, the missing values are treated as additional unknown parameters to be estimated. We can either integrate these parameters out, or explicitly sample them as part of the MCMC algorithm.
Statistical analysis
Both PCA and the sparse latent factor model were applied to the standardized intakes of 102 food items. For PCA, the missing values (0.6% of the total observations) were replaced with the median of each food item in the remaining observations. The number of components to retain was chosen on the basis of Velicer's (26) minimum average partial test, yielding 7 dietary patterns. Varimax rotation (27) was applied to the component loadings to increase interpretability. Food items with an absolute value of loadings >0.3 on a factor were considered to be a meaningful association to describe dietary patterns. PCA was performed using R version 3.3.0 (28).
For the sparse latent factor analysis, the number of factors to retain was chosen by restricting any pattern to have ≥3 foods showing a significant association to make the patterns interpretable. This yielded 7 dietary patterns. The factor loading matrix was constrained to be a lower triangular matrix with positive diagonal elements to uniquely determine its structure (22, 23, 29, 30). Sex, race/ethnicity, and BMI were jointly modeled to account for their contributions to the dietary measures. Cohort information was included in the model to control for subtle confounding from data collection effects. The BFRM software (31) was used to fit the sparse latent factor model.
Results
Figure 1A shows the absolute values of the rotated loadings from PCA with Varimax rotation that utilizes the loading-based selection of food subsets for dietary patterns. These values ranged from 0 to 0.7 without clear separation, providing little information for distinguishing which food items meaningfully contributed to interpreting dietary patterns. Figure 1B shows the posterior inclusion probability that a given food is included in a specific dietary pattern using sparse latent factor modeling. The bimodal shape of the histogram provided a strong basis for interpreting dietary patterns by screening out insignificant food-pattern pairs while highlighting those having meaningful associations.
FIGURE 1.

The standard interpretation of dietary patterns using the PCA procedure is performed by setting arbitrary thresholds, whereas the sparse latent factor model uses the posterior inclusion probability that a food and a specific dietary pattern has a non-zero association. (A) Histogram of absolute values of the rotated PCA loadings obtained from PCA with Varimax rotation. (B) Histogram of posterior inclusion probabilities obtained from the sparse latent factor analysis. The stronger contrast observed in panel B facilitates identifying meaningful food-pattern associations by separating them from negligible associations compared with those in panel A. PCA, principal components analysis.
Figure 2 provides a visual summary of the underlying latent factors (dietary patterns) that were ordered to increase comparability between PCA (Figure 2A) and the sparse latent factor analysis (Figure 2B). In Figure 2B, we show foods that have a posterior inclusion probability >0.95 (i.e., that are relevant to a given dietary pattern with high probabilities). The posterior inclusion probability and the posterior mean of factor loadings with their uncertainty intervals are available in Supplemental Figure 1, which provides an image of the manner in which food items are distinctly clustered into dietary patterns. It should be noted that the resulting loadings between PCA and the sparse latent factor model are not directly comparable because their scales are different. PCA loadings are scaled to lie between –1 and 1, but this is not the case in our approach in which each factor has its own scale that is learned from the data. In both approaches, factor 1 showed meaningful associations with fruit, nuts, and several vegetables. Additional foods such as fish and poultry were contained in the sparse latent factor solution, but overall, both identified a set of foods considered generally healthy. In the sparse factor analysis, factor 2 pertained to red and processed meat, potatoes, fried foods, pasta dishes, and breads, and factor 3 was associated with a cluster of snacks and sweets. The PCA factors 2 and 3 also provided similar food subsets but had smaller sets of foods with more cross-loading elements between the 2 factors. The structure of factor 4 was comparable between the sparse latent factor analysis and PCA, because both factors included a group of Hispanic foods, but the sparse latent factor solution formed a more concise and uncluttered cluster of foods. In PCA, it was not clear how to interpret the remaining factors. For example, factor 5 is composed of proteins, factor 6 consists of beverages, and factor 7 is a composite of somewhat unrelated foods. None of these factors would be considered as patterns of eating behavior. Conversely, the sparse latent factor model provided compact clusters of foods with clear interpretations, characterizing meat and dietary alternatives, alcoholic beverages, and cereal with milk, respectively. In addition, slight changes in the choice of cutoff had a nonnegligible influence on the final pattern structures (Supplemental Figure 2). A small decrease in the cutoff produced a more complex solution, creating many new cross-loading elements, whereas a small increase in the cutoff provided 7 clearly distinct blocks, but many foods were unrelated to any pattern. The qualitative comparison of the food subsets between 2 approaches is shown in Supplemental Figure 3.
FIGURE 2.
The structure of latent factors is visualized by using a heatmap in which rows correspond to food items and columns correspond to dietary patterns. Dietary patterns were identified on the basis of the loadings for each food item using the dietary data obtained from young adults (n = 2730) in the TIGER study. The food items were ordered to clarify the structure of dietary patterns. (A) The dietary patterns derived by PCA with Varimax rotation. Foods with absolute values of the rotated loading >0.30 were used to define dietary patterns. (B) The dietary patterns derived by the sparse latent factor model. Foods with the posterior mean of loadings that pass the threshold of posterior inclusion probability >0.95 were used to define dietary patterns. Bullet symbols (●) indicate negative values of loadings. PCA, principal components analysis; TIGER, Training Intervention and Genetics of Exercise.
Figure 3 , which is presented in the identical order as Figure 2B, shows groups of food subsets significantly associated with each of covariates. The food-covariate associations were simultaneously identified through the sparse latent factor analysis, providing insight into the covariate-specific trends in food intake. Men consumed eggs, meat, and meat products more frequently and ate vegetables, salad, yogurt, coffee, and chocolate less frequently than did women. Race/ethnicity also appeared to contribute to the variation in food consumptions. African Americans consumed meat and processed foods more frequently than did non-Hispanic whites, while consuming low-fat dairy products, fruit, and vegetables less frequently. Hispanics tended to eat greater amounts of traditional Hispanic foods (e.g., tortillas, beans, and tacos) and rice compared with non-Hispanic whites. In Asian and Asian-Indian groups, rice was the most notable food that distinguished both groups from the non-Hispanic white group. Asian participants consumed seafood and some vegetables more frequently than non-Hispanic whites, but this contrast was not apparent between Asian Indians and non-Hispanic whites. BMI showed little association with the frequencies of food intake.
FIGURE 3.

The influences of sex, race/ethnicity, and BMI were simultaneously quantified via sparse latent factor analysis. Sex and race/ethnicity were dummy coded such that the reference groups are male participants for the sex category and non-Hispanic white participants for the race/ethnicity category. A coefficient was set to zero if the posterior inclusion probability was <0.95. Bullet symbols (●) indicate negative values of coefficients. A, Asian; AA, African American; AI, Asian Indian; F, female; H, Hispanic; N, Native American; O, other racial/ethnic groups.
Discussion
We performed a dietary pattern analysis with the use of a sparse latent factor model that imposes sparsity on factor loadings to produce distinct factors composed of only a subset of foods, while allowing for the incorporation of covariates. This sparse latent factor model was implemented in a Bayesian modeling framework that uses probability theory to represent all forms of uncertainty in the model, which, in turn, allows the probabilistic assessments for the choice of food variables in interpreting dietary patterns. The sparsity-inducing prior provides strong shrinkage toward zero for foods showing trivial associations, effectively separating them from nonzero associations.
In PCA, interpretation of factor loadings is typically enforced in a post hoc manner by choosing a cutoff and ignoring the values below that point to define meaningful food-pattern associations. In the present study, we observed that this simple loading truncation approach may be misleading. Due to the observed weak contrast of the factor loadings in PCA, it was difficult to choose the optimal cutoff to differentiate between significant and negligible loadings, and small changes in the loading cutoff led to notable changes in the final pattern structures. In practice, the cutoff value for PCA is often determined by the ease of interpretation, but there are no objective criteria that explicitly define interpretability.
In general, nutritional epidemiologic studies produce dietary data in addition to other types of data, such as demographic, anthropometric, or study design factors that may provide important information when exploring dietary patterns. In contrast to PCA, one important advantage of sparse latent factor models is their modularity that links several submodels to address a more complex setting. A sparse latent factor model can be viewed as a multivariate regression model through a linear combination of latent factors in which regressors are themselves uncertain. Covariate information is directly integrated in this regression framework as additional predictors. Importantly, the shrinkage of our Bayesian analysis automatically takes care of the implicit multiple tests (22, 32) arising from the simultaneous inferences of many food-covariate associations.
In the present study, a comparison of the resulting outputs is rather complex due to their different ability to incorporate phenotypic information, but some of the factors are similar and comparable to what has been shown in the literature. In both results, factor 1 of each result was similar to a prudent dietary pattern (3, 33), whereas factors 2 and 3 were related to a Western dietary pattern (3, 33). A cluster of Hispanic foods was also identified in both outputs, which may reflect their popularity among college students in the study cohorts (southern United States). The sparse latent factor analysis produced the clusters of dairy alternatives, alcoholic beverages, and cereal with milk, but the PCA output was not as clearly interpretable. PCA may have produced spurious associations that do not adequately reflect underlying dietary patterns, because it extracts latent factors by decomposing the total variation in the dietary intake data without any treatment for covariate effects or measurement error.
In the sparse latent factor analysis, the identified contributions of covariates generally agree with the findings in the literature. In the present study, women tended to make healthier dietary choices, which has been reported in several studies (34–39). Comprehensive research on racial/ethnic disparities in dietary intake are still needed, but some studies have reported that non-Hispanic blacks had lower Healthy Eating Index scores (40) and consumed fewer daily servings of fruits and vegetables than did non-Hispanic whites (41). A higher consumption of rice in Asians than in other racial/ethnic groups in the United States has also been reported (42). Previous studies have yielded mixed results with regard to the associations between dietary intake and BMI (43–46). Here, we observed little association between the intake frequencies of specific foods and BMI.
Dietary patterns are sometimes viewed as broader lifestyle patterns rather than simple diets (47), and hence, the influence of covariates is expected to be absorbed in latent pattern structures. Some researchers may prefer to derive dietary patterns for categorical covariates, such as sex or race/ethnicity, separately for each level, instead of finding global patterns of the whole study population. The advantage of spare latent factor analysis is in its ability to robustly derive dietary patterns while simultaneously controlling for potential interaction with other variables. A 2-step covariate-adjustment approach has often been used in dietary pattern analysis in which each food consumption is regressed on the predefined covariate or covariates and the residuals are then used as new input variables (20, 48, 49). Alternatively, sparse latent factor models can provide a single-step approach to adjust for covariates by directly including them as additional regressors. The practical advantage of this approach is that it not only controls for influence of covariates but utilizes its information jointly to derive dietary patterns. Here, we provide evidence that sparse latent factor analysis can derive dietary patterns in a valid manner, while incorporating factors (e.g., sex, race/ethnicity, BMI) reported to influence dietary intake. A more exciting possibility for this approach will be in the incorporation of more complex data sets (e.g., genotype, dietary questionnaires, etc.) to identify biological interactions that ultimately influence how people eat.
In dietary pattern studies, one practical drawback of PCA is its inability to appropriately deal with missing values (50). Because standard PCA cannot be performed with missing values, some studies replaced missing values with the median of each food consumption or zero before it is processed (5, 14, 51). However, this strategy does not account for multivariate relations in data. In contrast, the Bayesian aspect of sparse latent factor models considers missing values as unknown parameters that can be sampled in an MCMC simulation, preserving their potential dependence on other food variables and observed covariates. The uncertainty in the imputation of missing items is propagated to other parts of the model and taken into account in the estimation of model parameters.
A common issue in modeling dietary data is that many dietary variables are often positively skewed (52). Log transformation is typically used to reduce the skewness in data, but the presence of zeros in dietary observations often requires adding an arbitrary constant to each food intake item [e.g., log(g/d intake +1)]. One other key feature of sparse latent factor models is its ability to adapt to arbitrary non-Gaussian structure in data via a Dirichlet process mixture model for the distributions of factor scores, which is an alternative approach to dealing with nonnormal distributions when using Bayesian statistics (23). The model incorporates nonnormality distributions of factor scores when nonnormality is evident in data, whereas it resumes to normality if data are consistent with normality.
In practice, of key interest is the examination of associations between dietary pattern and health outcomes. In PCA, component scores are calculated sequentially on the basis of the rotated loadings, which are used as explanatory variables in regression models, including health outcomes of interest as a response variable. Sparse latent factor models also produce factor scores, but instead, they are simultaneously estimated with factor loadings. A health outcome can then be regressed on these factor scores. Other clinical predictors known a priori to be associated with a health outcome or for which they have a strong clinical rationale can also be bundled with factor scores in this regression model, which is consistent with the conventional PCA-based approach. The ability to accommodate covariates in sparse latent factor models may offer an advantage in future health outcome analyses. The reason for incorporating covariates into sparse latent factor analysis is to derive dietary patterns that are independent of those covariates. For example, including sex in sparse latent factor analysis ensures that none of the dietary patterns are highly correlated with sex. When these dietary patterns are used in conjunction with sex as regressors in the examination of relations between dietary patterns and a health outcome, this health outcome analysis is less likely to experience a multicollinearity problem. The inclusion of sex in sparse latent factor analysis does preclude the use of sex in health outcome analyses, because sex may have its own influence on health outcomes, independent of its relations with dietary patterns. Moreover, as a probabilistic module, these models have a potential to be further extended to include health outcome variables so that estimations of factor loadings and scores and examination of pattern-health relations are jointly performed in one coherent procedure. Future research coupling predictive regression components for different types of health outcomes (e.g., binary, categorical, and censored) to dietary pattern modeling is warranted.
Although we showed that sparse latent factor models may offer several advantages over PCA, there is still an ongoing debate as to whether the use of sparsity is appropriate in dietary pattern analysis (4, 53). Imamura and Jacques (53) argued that sparsity may be less useful than other methods because the cumulative effect of all foods is important when considering outcomes and risk associated with dietary intake. Nevertheless, interpretable pattern structures are obtained via selection of food variables in practice. This is the rationale behind the use of loading truncations on the dietary patterns produced by PCA with subsequent rotation. Sparsity can be thought of as an alternative way of regularizing dietary pattern modeling in an attempt to better explain the variation structure in data, not to make a statement that associations are truly zero. The sparsity approach helps to pull out the foods with dominant effects by allowing many others to be zero if their contributions are small enough.
The current study has limitations. The present analysis used single food items as input variables for dietary pattern analysis, whereas food items are commonly collapsed into the smaller number of food groups before using the PCA approach. Moeller et al. (49) pointed out that using too many food variables may lead to odd combinations with undue influence. However, an investigator-driven food grouping strategy can be subject to arbitrary choices, because there are numerous ways of aggregating food items differently into different numbers of food groups (47, 54). Indeed, the Nutrition Evidence Library review (55) reported that there were variations in the number and type of food groupings across dietary pattern studies. Future studies designed to evaluate the performance of 2 methods in different input variable settings and to improve standardization of food grouping schemes would greatly enrich our understanding of dietary pattern analysis. It should also be noted that the present study relied on self-reported FFQs to assess food intake, which are subject to measurement errors and self-report bias.
A potential barrier to the use of sparse latent factor models could be its mathematical complexity in comparison to PCA. It is true that PCA is relatively simple and works in many settings. However, it is also clearly desirable to have a flexible, extensible approach that can be adapted individually to specific study situations and formalize our intuition as well-defined criteria. In this respect, sparse latent factor models can be complementary to PCA by allowing investigators to move closer to modeling the complex situations when it is necessary.
The proper characterization of dietary patterns is important in accounting for inherent interactions among foods and other factors that may have synergistic and cumulative effects on health outcomes, describing common characteristics of eating behaviors and identifying features belonging to a healthy diet to provide guidance for nutrition intervention. The classical PCA procedure in dietary pattern analysis does not adequately handle the intrinsic arbitrariness inherent in the choice of food variables for characterizing dietary patterns, the contribution of covariates to variation in data, and other sources of uncertainty arising from noise and missing values in the measurement process. Progress can be made in understanding patterns of dietary intake if all reasonable sources of uncertainty and variation are incorporated in the evaluation of dietary patterns and eating behavior. The sparse latent factor models can be a useful addition to dietary pattern modeling by addressing the practical issues in PCA and extending the analysis of dietary intake beyond food items and categorical covariates.
Supplementary Material
Acknowledgments
The authors’ responsibilities were as follows—MSB: was responsible for the acquisition of data and supervised the conduct of research and provided essential materials; JJ, SAW, and MSB: were involved in developing the overall analysis plan and analytic strategy; JJ: performed the statistical analyses and drafted the manuscript; SAW: provided support with statistical analyses; SAW, AIV, JRF, and MSB: contributed to the interpretation of results and critically reviewed the manuscript; JJ and MSB: had primary responsibility for the final content; and all authors: read and approved the final manuscript.
Notes
Supported by award R01DK0642148 from the National Institute of Diabetes and Digestive and Kidney Diseases.
Author disclosures: JJ, SAW, AIV, JRF, and MSB, no conflicts of interest.
Supplemental Methods and Supplemental Figures 1–3 are available from the “Supplementary data” link in the online posting of the article and from the same link in the online table of contents at https://academic.oup.com/jn/.
Abbreviations used:
- MCMC
Markov Chain Monte Carlo
- PCA
principal components analysis
- TIGER
Training Intervention and Genetics of Exercise.
References
- 1. Hu FB. Dietary pattern analysis: a new direction in nutritional epidemiology. Curr Opin Lipidol 2002;13:3–9. [DOI] [PubMed] [Google Scholar]
- 2. Guenther PM, Casavale KO, Kirkpatrick SI, Reedy J, Hiza HAB, Kuczynski KJ, Kahle LL, Krebs-Smith SM. Update of the Healthy Eating Index: HEI-2010. J Acad Nutr Diet 2013;113(4):569–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Newby PK, Tucker KL. Empirically derived eating patterns using factor or cluster analysis: a review. Nutr Rev 2004;62:177–203. [DOI] [PubMed] [Google Scholar]
- 4. Gorst-Rasmussen A, Dahm CC, Dethlefsen C, Scheike T, Overvad K. Exploring dietary patterns by using the Treelet transform. Am J Epidemiol 2011;173:1097–104. [DOI] [PubMed] [Google Scholar]
- 5. Varraso R, Garcia-Aymerich J, Monier F, Moual NL, Batlle JD, Miranda G, Pison C, Romieu I, Kauffmann F, Maccario J. Assessment of dietary patterns in nutritional epidemiology: principal component analysis compared with confirmatory factor analysis. Am J Clin Nutr 2012;96:1079–92. [DOI] [PubMed] [Google Scholar]
- 6. Cadima JF, Jolliffe IT. Variable selection and the interpretation of principal subspaces. J Agric Biol Environ Stat 2001;6:62–79. [Google Scholar]
- 7. Zou H, Hastie T, Tibshirani R. Sparse principal component analysis. J Comput Graph Stat 2006;15:265–86. [Google Scholar]
- 8. Slattery ML, Boucher KM, Caan BJ, Potter JD, Ma K-N. Eating patterns and risk of colon cancer. Am J Epidemiol 1998;148:4–16. [DOI] [PubMed] [Google Scholar]
- 9. Schulze MB, Hoffmann K, Kroke A, Boeing H. Dietary patterns and their association with food and nutrient intake in the European Prospective Investigation into Cancer and Nutrition (EPIC)–Potsdam study. Br J Nutr 2001;85:363. [DOI] [PubMed] [Google Scholar]
- 10. Handa K, Kreiger N. Diet patterns and the risk of renal cell carcinoma. Public Health Nutr 2002;5:757–67. [DOI] [PubMed] [Google Scholar]
- 11. Balder HF, Virtanen M, Brants HAM, Krogh V, Dixon LB, Tan F, Mannisto S, Bellocco R, Pietinen P, Wolk A et al.. Common and country-specific dietary patterns in four European cohort studies. J Nutr 2003;133:4246–51. [DOI] [PubMed] [Google Scholar]
- 12. Flood A, Rastogi T, Wirfält E, Mitrou PN, Reedy J, Subar AF, Kipnis V, Mouw T, Hollenbeck AR, Leitzmann M et al.. Dietary patterns as identified by factor analysis and colorectal cancer among middle-aged Americans. Am J Clin Nutr 2008;88:176–84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Cutler GJ, Flood A, Hannan PJ, Slavin JL, Neumark-Sztainer D. Association between major patterns of dietary intake and weight status in adolescents. Br J Nutr 2012;108:349–56. [DOI] [PubMed] [Google Scholar]
- 14. Thorpe MG, Milte CM, Crawford D, McNaughton SA. A comparison of the dietary patterns derived by principal component analysis and cluster analysis in older Australians. Int J Behav Nutr Phys Act 2016;13:30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Nanri A, Mizoue T, Shimazu T, Ishihara J, Takachi R, Noda M, Iso H, Sasazuki S, Sawada N, Tsugane S et al.. Dietary patterns and all-cause, cancer, and cardiovascular disease mortality in Japanese men and women: the Japan Public Health Center-Based Prospective Study. PLoS One 2017;12:e0174848. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Nettleton JA, Steffen LM, Ni H, Liu K, Jacobs DR. Dietary patterns and risk of incident type 2 diabetes in the Multi-Ethnic Study of Atherosclerosis (MESA). Diabetes Care 2008;31:1777–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Williams CD, Satia JA, Adair LS, Stevens J, Galanko J, Keku TO, Sandler RS. Dietary patterns, food groups, and rectal cancer risk in whites and African Americans. Cancer Epidemiol Biomarkers Prev 2009;18:1552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Dekker LH, van Dam RM, Snijder MB, Peters RJ, Dekker JM, de Vries JH, de Boer EJ, Schulze MB, Stronks K, Nicolaou M. Comparable dietary patterns describe dietary behavior across ethnic groups in the Netherlands, but different elements in the diet are associated with glycated hemoglobin and fasting glucose concentrations. J Nutr 2015;145:1884–91. [DOI] [PubMed] [Google Scholar]
- 19. Gallagher ML, Farrior E, Broadhead L, Gillette LS, Rowe ML, Somes G, West P, Kolasa KM. Development and testing of a food frequency recall instrument for describing dietary patterns in adults and teenagers. Nutr Res 1993;13:177–88. [Google Scholar]
- 20. Fahey MT, Thane CW, Bramwell GD, Coward WA. Conditional Gaussian mixture modelling for dietary pattern analysis. J R Stat Soc Ser A Stat Soc 2007;170:149–66. [Google Scholar]
- 21. West M. Bayesian factor regression models in the “Large p, Small n” paradigm. Bayesian statistics 7: 723–32. [Google Scholar]
- 22. Lucas J, Carvalho C, Wang Q, Bild A, Nevins JR, West M. Sparse statistical modelling in gene expression genomics. In: Do KA, Muller P, Vannucci M, editors. Bayesian inference for gene expression and proteomics. Cambridge (United Kingdom):Cambridge University Press; 2006. p. 155–76. [Google Scholar]
- 23. Carvalho CM, Chang J, Lucas JE, Nevins JR, Wang Q, West M. High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Stat Assoc 2008;103:1438–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Seo DM, Goldschmidt-Clermont PJ, West M. Of mice and men: sparse statistical modeling in cardiovascular genomics. Ann Appl Stat 2007;1:152–78. [Google Scholar]
- 25. Sailors MH, Jackson AS, McFarlin BK, Turpin I, Ellis KJ, Foreyt JP, Hoelscher DM, Bray MS. Exposing college students to exercise: the Training Interventions and Genetics of Exercise Response (TIGER) Study. J Am Coll Health 2010;59:13–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Velicer WF. Determining the number of components from the matrix of partial correlations. Psychometrika 1976;41:321–7. [Google Scholar]
- 27. Kaiser HF. The Varimax criterion for analytic rotation in factor analysis. Psychometrika 1958;23:187–200. [Google Scholar]
- 28. R Core Team. R: a language and environment for statistical computing. Vienna (Austria): R Foundation for Statistical Computing; 2017. [Google Scholar]
- 29. Geweke J, Zhou G. Measuring the pricing error of the arbitrage pricing theory. Rev Financ Stud 1996;9:557–87. [Google Scholar]
- 30. Aguilar O, West M. Bayesian dynamic factor models and portfolio allocation. J Bus Econ Stat 2000;18:338–57. [Google Scholar]
- 31. Wang Q, Carvalho CM, Lucas J, West M. BFRM: software for Bayesian factor regression models. Bull Int Soc Bayesian Anal 2007;14:4–5. [Google Scholar]
- 32. Scott JG, Berger JO. An exploration of aspects of Bayesian multiple testing. J Stat Plan Inference 2006;136:2144–62. [Google Scholar]
- 33. Tucker KL. Dietary patterns, approaches, and multicultural perspective: can we identify culture-specific healthful dietary patterns among diverse populations undergoing nutrition transition? Appl Physiol Nutr Metab 2010;35:211–8. [DOI] [PubMed] [Google Scholar]
- 34. Fraser GE, Welch A, Luben R, Bingham SA, Day NE. The effect of age, sex, and education on food consumption of a middle-aged English cohort—EPIC in East Anglia. Prev Med 2000;30:26–34. [DOI] [PubMed] [Google Scholar]
- 35. Baker AH, Wardle J. Sex differences in fruit and vegetable intake in older adults. Appetite 2003;40:269–75. [DOI] [PubMed] [Google Scholar]
- 36. Liebman M, Propst K, Moore SA, Pelican S, Holmes B, Wardlaw MK, Melcher LM, Harker JC, Dennee PM, Dunnagan T. Gender differences in selected dietary intakes and eating behaviors in rural communities in Wyoming, Montana, and Idaho. Nutr Res 2003;23:991–1002. [Google Scholar]
- 37. Wardle J, Haase AM, Steptoe A, Nillapun M, Jonwutiwes K, Bellisie F. Gender differences in food choice: the contribution of health beliefs and dieting. Ann Behav Med 2004;27:107–16. [DOI] [PubMed] [Google Scholar]
- 38. Prättälä R, Paalanen L, Grinberga D, Helasoja V, Kasmel A, Petkeviciene J. Gender differences in the consumption of meat, fruit and vegetables are similar in Finland and the Baltic countries. Eur J Public Health 2007;17:520–5. [DOI] [PubMed] [Google Scholar]
- 39. Räty R, Carlsson-Kanyama A. Energy consumption by gender in some European countries. Energy Policy 2010;38:646–9. [Google Scholar]
- 40. Wang Y, Chen X. How much of racial/ethnic disparities in dietary intakes, exercise, and weight status can be explained by nutrition- and health-related psychosocial factors and socioeconomic status among US adults? J Am Diet Assoc 2011;111:1904–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Dubowitz T, Heron M, Bird CE, Lurie N, Finch BK, Basurto-Dávila R, Hale L, Escarce JJ. Neighborhood socioeconomic status and fruit and vegetable intake among whites, blacks, and Mexican-Americans in the United States. Am J Clin Nutr 2008;87:1883–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Batres-Marquez SP, Jensen HH, Upton J. Rice consumption in the United States: recent evidence from food consumption surveys. J Am Diet Assoc 2009;109:1719–27. [DOI] [PubMed] [Google Scholar]
- 43. Togo P, Osler M, Sørensen TI, Heitmann BL. Food intake patterns and body mass index in observational studies. Int J Obes Relat Metab Disord 2001;25:1741. [DOI] [PubMed] [Google Scholar]
- 44. Field AE, Gillman MW, Rosner B, Rockett HR, Colditz GA. Association between fruit and vegetable intake and change in body mass index among a large sample of children and adolescents in the United States. Int J Obes Relat Metab Disord 2003;27:821–6. [DOI] [PubMed] [Google Scholar]
- 45. Charlton K, Kowal P, Soriano MM, Williams S, Banks E, Vo K, Byles J. Fruit and vegetable intake and body mass index in a large sample of middle-aged Australian men and women. Nutrients 2014;6:2305–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Sachithananthan V, Gad N. A study on the frequency of food consumption and its relationship to BMI in school children and adolescents in Abha City, KSA. Curr Res Nutr Food Sci J 2016;4:203–8. [Google Scholar]
- 47. Martinez ME, Marshall JR, Sechrest L. Invited commentary: factor analysis and the search for objectivity. Am J Epidemiol 1998;148:17–9. [DOI] [PubMed] [Google Scholar]
- 48. Willett WC, Howe GR, Kushi LH. Adjustment for total energy intake in epidemiologic studies. Am J Clin Nutr 1997;65(Suppl):1220S–8S. [DOI] [PubMed] [Google Scholar]
- 49. Moeller SM, Reedy J, Millen AE, Dixon LB, Newby PK, Tucker KL, Krebs-Smith SM, Guenther PM. Dietary patterns: challenges and opportunities in dietary patterns research. J Am Diet Assoc 2007;107:1233–9. [DOI] [PubMed] [Google Scholar]
- 50. Roweis ST. EM algorithms for PCA and SPCA. In: Jordan MI, Kearns MJ, Solla SA, editors. Advances in Neural Information Processing Systems 10. NIPS '97: Proceedings of the Neural Information Processing Systems conference; 1997 December 1-6; Denver, Colorado. Cambridge, MA: MIT Press; 1998. p 626–32. [Google Scholar]
- 51. Northstone K, Ness A, Emmett P, Rogers I. Adjusting for energy intake in dietary pattern investigations using principal components analysis. Eur J Clin Nutr 2008;62:931–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Hu FB, Stampfer MJ, Rimm E, Ascherio A, Rosner BA, Spiegelman D, Willett WC. Dietary fat and coronary heart disease: a comparison of approaches for adjusting for total energy intake and modeling repeated dietary measurements. Am J Epidemiol 1999;149:531–40. [DOI] [PubMed] [Google Scholar]
- 53. Imamura F, Jacques PF. Invited commentary: dietary pattern analysis. Am J Epidemiol 2011;173:1105–8. [DOI] [PubMed] [Google Scholar]
- 54.Dietary Guidelines Advisory Committee. Scientific report of the 2015 Dietary Guidelines Advisory Committee [Internet]. Washington (DC): Department of Health and Human Services, USDA; 2015[cited 2018 Apr 7]. Available from: https://health.gov/dietaryguidelines/2015-scientific-report/PDFs/Scientific-Report-of-the-2015-Dietary-Guidelines-Advisory-Committee.pdf. [Google Scholar]
- 55.USDA, Center for Nutrition Policy and Promotion. A series of systematic reviews on the relationship between dietary patterns and health outcomes [Internet]. Alexandria (VA): USDA, Center for Nutrition Policy and Promotion, Evidence Analysis Library Division; 2014. Available from: https://www.cnpp.usda.gov/sites/default/files/usda_nutrition_evidence_flbrary/DietaryPatternsExecutiveSummary.pdf. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



