Table 4.
Metabolites selected for predicting breast cancer and colorectal cancer in pooled analysis 1.
Metabolites Selected | Proportion of Explained Variation 2 | Direction of Coefficient for Metabolites 3 |
---|---|---|
Breast Cancer | All covariates + metabolites: 0.27 | |
Serum | ||
LC-MS | ||
Cystenyl-glycine | 0.22 | − |
Ethanolamine | 0.21 | + |
Sucrose | 0.22 | − |
Lipidyzer 4 | ||
Free fatty acid (FFA 20:2) | 0.22 | + |
Phosphatidylcholine (PC 16:0/18:2) | 0.23 | + |
Triacylglyceride (TAG 48:4/18:2) | 0.22 | − |
Triacylglyceride (TAG 50:5/18:3) | 0.22 | − |
Triacylglyceride (TAG 52:2/18:2) | 0.22 | + |
Urine | ||
NMR | ||
Uracil | 0.22 | − |
2-Hydroxyisobutyrate | 0.22 | + |
GC-MS | ||
Unknown 73.0 14.49 5 | 0.21 | + |
Unknown 73.0 12.10 5 | 0.22 | − |
Colorectal Cancer | All covariates + metabolites: 0.33 | |
Serum | ||
LC-MS | ||
Adenosine | 0.20 | − |
Leucic acid | 0.18 | + |
Glycerate | 0.23 | + |
Hypoxanthine | 0.18 | + |
Myoinositol | 0.19 | + |
N-Acetylneuraminate | 0.19 | − |
2-Hydroxyglutarate | 0.19 | + |
7-Methylguanine | 0.19 | + |
Lipidyzer | ||
CE (FA20) | 0.17 | − |
TAG (53:2/18:1) | 0.18 | + |
Urine | ||
NMR | ||
Histidine | 0.18 | − |
Taurine | 0.19 | + |
Threonine | 0.17 | + |
GC-MS | ||
Unknown 103 17.03 5 | 0.18 | − |
Unknown 285 22.41 5 | 0.18 | + |
Unknown 57 9.58 5 | 0.18 | + |
Unknown 73 10.76 5 | 0.18 | − |
Unknown 73 17.27 5 | 0.17 | + |
1 All variables listed below were selected using the lasso for variable selection with all four platforms pooled together prior to variable selection. The base set of covariates (forced into all models) are age, WHI enrollment date, and self-reported race or ethnicity. Selected covariates for breast cancer: age, self-reported race/ethnicity, income, Gail 5-year risk score, waist circumference, sample draw visit, randomized to the CaD control arm. Selected covariates for colorectal cancer: age, self-reported race/ethnicity, income, education, waist circumference, sample draw visit. 2 The proportion of explained variation (PEV) was estimated by first creating a dataset with only the selected metabolites and covariates for each outcome. Then, we used cross-validation to fit a logistic regression on each set of training data and predict on the test data; the PEV is defined as the correlation between the observed outcomes and the predictions. 3 Positive direction of the estimated coefficient from the multiple logistic regression model implies higher odds of being a case; negative direction implies lower odds of being a case. 4 FFA: free fatty acid; FA: fatty acid; TAG: triacylglyceride; PC: phosphatidyl choline; CE: cholesterol ester. 5 Values represent mass at retention time of the unknown metabolites, i.e., 73 12.10 indicates a mass of 73 at 12.10 min. In post hoc analyses excluding women using HT, prediction performance in the subpopulation was modestly improved for CRC compared to the full population (CV-AUCs range from 0.622–0.637, while in the full population they range from 0.589–0.608). Prediction performance for BC was slightly decreased in the subpopulation compared to the full population (CV-AUCs range from 0.535–0.554, while in the full population they range from 0.559–0.563). Several LC-MS metabolites were selected in the subgroup analysis that were also selected in the whole-cohort analysis: cysteinyl glycine, N-isovaleryl glycine, and valine (BC); adenosine, leucic acid, glycerate, hydroxyproline, and 2-hydroxyglutarate (CRC). Additional metabolites selected in the subgroup analysis for CRC included adipic and 3-hydroxybutyric acids, involved in fatty acid metabolism; betaine, a marker of whole grains; glucuronate, found in gums and fermented beverages; and trigonelline, found in coffee.