Abstract
Birth weight is a key consequence of environmental exposures and metabolic alterations and can influence lifelong health. While a number of methods have been used to examine associations of trace element (including essential nutrients and toxic metals) concentrations or metabolite concentrations with a health outcome, birth weight, studies evaluating how the coexistence of these factors impacts birth weight are extremely limited. Here, we present a novel algorithm NETwork Clusters (NET-C), to improve the prediction of outcome by considering the interactions of features in the network and then apply this method to predict birth weight by jointly modelling trace element and cord blood metabolite data. Specifically, by using trace element and/or metabolite subnetworks as groups, we apply group lasso to estimate birth weight. We conducted statistical simulation studies to examine how both sample size and correlations between grouped features and the outcome affect prediction performance. We showed that in terms of prediction error, our proposed method outperformed other methods such as a) group lasso with groups defined by hierarchical clustering, b) random forest regression and c) neural networks. We applied our method to data ascertained as part of the New Hampshire Birth Cohort Study on trace elements, metabolites and birth outcomes, adjusting for other covariates such as maternal body mass index (BMI) and enrollment age. Our proposed method can be applied to a variety of similarly structured high-dimensional datasets to predict health outcomes.
Keywords: Outcome prediction, Gaussian graphical model, lasso, dimensionality reduction, trace element exposures, metabolic network
1. Introduction
Detecting associations between inter-related high-dimensional exposures and intermediates with health outcomes is an emerging challenge in biomedical re-search. As the complexity of interactions of high-dimensional factors increases, predicting the response variable can be difficult. This is because the network structures cannot be easily measured and the effect of high-dimensional data elements cannot be precisely estimated using traditional multi-linear regression techniques. In addition to prediction models using regression methods [1], approaches involving subgroups of highly correlated variables [2]; [3] with connectivity-based clustering methods, hierarchical clustering [4];[5], and Ran-dom forest regression utilizing tree bagging methods [6];[7] have been developed in literature. Hierarchical clustering fails to prioritize more important clusters over less significant groups of factors or disconnected factors due to its heuristic properties [8];[9]. Random forest regression requires more time to construct decision trees with high complexity [10] and does not consider highly correlated variables. while the network-based feature selection was used to predict prognosis of cancer [11], its application is limited to the classification of categorical outcomes rather than the prediction of a continuous outcome. To help overcome these limitations, we developed a statistical method to predict health outcomes using well-defined clusters that embrace complex relations between high dimensional biological factors of environmental exposures and metabolites.
Specifically, we present a method named NET-C (prediction of outcome using NETwork Clusters) designed to improve regression methods for predicting an outcome utilizing network analysis. We hypothesize that analyzing dense sub-networks of biological factors that are relevant to a particular health outcome can strengthen our prediction models. This method groups interacting biological factors using lasso and modularity maximization. Then it models their association with health outcome using regularized regression for grouped predictors. Model-based clustering implemented by lasso detects and prioritizes subgroups of strongly connected factors by shrinkage and selection. These grouped factors are used in regression models with covariates [12] to predict a health outcome. We tested the proposed method in simulation studies under various parameter settings (e.g. sample size and the association of features with the outcome) and compared its prediction performance with group lasso using groups determined by hierarchical clustering, or random forest regression. We chose the example of predicting birth weight as an outcome influenced by maternal exogenous and endogenous factors [13]. Specifically, we applied our method to data on placental trace element concentrations, cord-blood metabolite concentrations, birth weight and other covariates collected as part of the New Hampshire Birth Cohort Study (NHBCS).
Given the intricate biological pathway from environmental exposures such as trace elements and cord-blood metabolites to human health, we assume that our proposed method can be applied to various health outcomes. Applying our proposed method to birth weight is one significant approach to illustrate how high-dimensional exposures and metabolites can affect a specific outcome.
2. Methods
2.1. Overall Procedure
NET-C can be summarized in the following three main steps:
Step1: Applying lasso to all features to generate a network.
Step2: Identifying densely connected subnetworks in the estimated network and classify those subnetworks as distinct groups using modularity maximization algorithm such as Louvain.
Step3: Fitting a group lasso model for outcome and feature groups defined in Step 2.
We also compared the predictive performance of NET-C with alternative outcome prediction methods, including random forest regression, neural networks, and hierarchical clustering based group-lasso.
2.2. Subnetwork generation and group assignment: Lasso and Louvain
We began by identifying the network of interacting features using lasso. Let N × p data matrix X represents N observations of a p-dimensional random vectors X= (X1,X2, …Xp) which are conventionally called features in literature, let x1, x2, …, xN denote the rows of matrix X. We assume that X ~ Np (0, Σ), with Σ being the positive definite covariance matrix while V={1,2,…,p] is the index set of the nodes corresponding to the p features. We further define the precision matrix of X as Ω = Σ−1 which characterizes the conditional dependence between the features. The network G = (V,E) among the features can then be defined based on Ω, i.e., if ωij ≠ 6= 0, then we define an edge between node i and node j; otherwise, there is no edge between the node i and node j. A penalized linear regression [14] using each feature as outcome and the rest as predictors is utilized to determine if there is an edge between two given features. Since the normality is assumed for X, the penalized log-likelihood is defined as follows,
(1) |
with βj = (βj1, …, βj(j−1), 0, βj(j+1), …, βjp). Here we utilize L1 penalty of βj for a sparse network estimation. It is known that such L1 penalty can remove the zero or very weak conditional correlation among the features so that we can focus on the significant relationship among X1,…,Xp. After applying lasso to detect features significantly associated with each column of data, a p × p symmetrical matrix representing partial correlations of p features is formed based on βj, j=1,…,p. This p × p matrix corresponds to a network representing significant interactions of p elements. The tuning parameter λ involved in (1) is obtained by minimizing Akaike’s Information Criterion (AIC) [15] which is defined as,
(2) |
where represents the log-likelihood function corresponding to a given network G, is the maximum likelihood estimation of Ω given the network G. Because we have assumed the normal distribution for the features, we have the following expression for lN(·),
(3) |
where S is the empirical covariance matrix.
Negative Gaussian log-likelihood in the formula (2) measures how well the model fits and indicates the difference in the negative Gaussian log-likelihood of the current model from a null model [16], and k is the number of estimated parameters which are non-zero. We choose the λ which minimizes AIC as the value for the tuning parameter. Given such choice of λ, we then minimize objective function (1) with respect to βj, j = 1·, p. After this step, the set of estimated non-zero parameters which indicate interactions of features is obtained.
To determine the final groups of features consisting of dense networks, a community detection algorithm, Louvain, using maximal modularity score [17], is applied to the estimated network of features by lasso to detect and classify highly dense subnetworks of features.
Given a network with N nodes, the estimated interactions of nodes by lasso and estimated weights on edges between nodes, each node of the network is initially assigned to a distinct community. Each node is removed from its community and placed into another community to maximize the gain of modularity in the network. Given a network with correlations between variables, the modularity, M, is computed as follows [18]:
(4) |
where gi is the community to which vertex i belongs, the function f(x, y) equals to 1 if x = y and 0 otherwise, di is the degree of vertex i, E is the number of edges in the network, and Wij is the partial correlation of the edge between i and j. We assume that the gain of modularity acquired by moving a node i into a different community C is improved when the sum of edge weights between node i and the node members of C increases. This process continues until the positive gain modularity in the network is not possible and then each node stays in its finally assigned community. It is required that each node should always belong to one community throughout the whole process. Each finalized community is considered to be a subnetwork of node features [17]. After applying Louvain algorithm to the estimated interactions of features, the final dense subnetworks of interacting features are identified.
Algorithm 1:
Generating Clusters with Lasso and Louvain
Input: An N-by-P input data matrix Output: Community (Group) index for all feature variables in the data matrix:
|
2.3. Alternative Methods
We have three alternative methods, group lasso using all distinct groups, group lasso using groups determined by hierarchical clustering and random forest regression. The simplest way to group features is to consider that all features belong to all distinct groups. Hierarchical clustering classifies objects into a dendrogram whose branches are the target clusters using dissimilarities over samples by implementing branch cutting or pruning [19]. To generate the dendrogram, distance between features was measured as follows using Euclidean distance measurement [20]:
(5) |
where f is the attribute size. We used complete linkage rule for hierarchical clustering which maximizes the distance between two features in distinct clusters. The number of highly connected subnetworks or clusters generated by hierarchical clustering is identically set to the number of groups detected byNET-C.
Random forest regression can compute the importance of each feature and predict a continuous outcome with the training dataset to fit the models and the testing datasets [10];[21]. In the random forest regression algorithm, with the fixed number of trees, each regression sub-tree is generated in a bootstrap sample from the initial training data. Features which can minimize the residual sum of squares along down left and right branches of each node in the tree are selected [22]. In our statistical simulation, the number of trees is 500, the number of features randomly sampled as candidates at each split on the node is set to 40 and the minimum size of terminal nodes is 5. Variable importance is assigned to each feature after testing how the assignments of bigger importance to specific features increase prediction error as the features are permuted.
Input feature variables and the target outcome for the supervised training were used to implement a single-hidden layer neural network with 120 hidden neurons. Input weight values for each pair of input and hidden neurons and output weight values for each pair of hidden and output neurons are computed [23].
2.4. Parameter estimation by group lasso
After features are grouped using lasso and modularity maximization, or other grouping methods, we utilized group lasso to associate outcome with all features. Group indexes are assigned to features belonging in each group. We use group lasso formula as follows:
(6) |
A lasso penalty is applied to the Euclidean (L2) norm of the coefficients in each group, leading group-specific variable selection. If one group is selected, all elements in the group would have non-zero coefficient estimates. Otherwise, the effect of all elements in the group would shrink to zero. Our proposed method, NET-C, groups features with lasso and modularity maximization and predicts an outcome with these grouped features.
2.5. Datasets and additional pre-processing and validation steps for the real data application
We applied NET-C to training and testing datasets from a large human subjects study, the NHBCS, to predict birth weight using placental trace elements and cord blood metabolites. The NHBCS a longitudinal pregnancy cohort of mother-infant pairs that has been described previously [24]. Briefly, pregnant women who seek prenatal care at a participating study clinic were eligible for enrollment if they were 18 to 45 years old, used a private, unregulated well for drinking water, and were not planning to move prior to delivery. All study participants provided written informed consent prior to participation according to the guidelines of the Committee for the Protection of Human Subjects at Dartmouth.
Placental biopsies were collected from infants born to NHBCS-enrolled mothers at the time of delivery. Following removal of any decidua tissue and avoiding calcium deposits and connective tissue, 1 cm deep and 1–2 cm wide biopsies are taken at the base of the cord insertion of the fetal side of the placenta to minimize heterogeneity. Placental biopsies were stored in trace element-free tubes, which were labeled with a sample barcode ID and stored at −80°C until analysis. Cord-blood was collected at the time of delivery by clinical staff and processed and stored at −80°C until analysis.
Mass spectrometry (MS) was used to quantify the metallome of nutrient and toxic elements in placental biopsies, as well as the metabolome in cord blood plasma [25];[26]. A total of 220 cord blood metabolites and computed ratios of metabolites were measured using the Biocrates AbsoluteIDQTM p180 kit (Biocrates Life Sciences AG, Innsbruck, Austria). In placental biopsies, 24 trace elements were quantified by Inductively Coupled Plasma Mass Spectrometry (7700x Agilent, Santa Clara, CA), with analysis following the quality control procedures described in EPA 6020a [27]. The NHBCS dataset also included relevant covariate information such as maternal enrollment age and body mass index (BMI). Covariate selection followed the standard described in this paper [27]. When we generated highly dense subnetworks of grouped trace elements or metabolites, log-transformed (base e) was also considered to improve model fit and normalize residuals.
3. Simulation Study
3.1. Statistical Simulation
Consider the following regression equation, Y (outcome) = β0 + X * β + ε, in which N × p matrix X is the N observations of p features; ε = (ε1, …, εN) is the noise vector drawn from the uniform distribution U[−20,20] with independent components; and β is the coefficient vector for p features. The features were considered as continuous variables with normal distribution. The sample size N considered include 150, 180, 210 and 240. The number of features is set to be 120. We assume there exist 9 clusters of highly connected features, which consist of 3 big clusters of size 9, 3 medium clusters of size 6 and 3 small clusters of size 3. There were 54 features involving in these clusters. The correlation between these 120 features can be depicted by a correlation matrix C. For the first cluster, we have C[i,j] with integers i, j ∈ S1 × T1 where S1 and T1 are integers from 1 to 9; for second cluster, we have i, j ∈ S2 × T2 where S2 and T2 are integers from 10 to 18; for the third cluster, we have i, j ∈ S3 × T3 where S3 and T3 are integers from 19 to 27 indicates 3 large clusters; for the cluster 4, we have C[i,j] with i, j ∈ S4 × T4 where S4 and T4 are integers from 28 to 33; for the cluster 5, we have i, j ∈ S6 × T5 where S5 and T5 are integers from 34 to 39; for cluster 6, we have i, j ∈ S6 × T6 where S6 and T6 are integers from 40 to 45 indicates 3 medium clusters; for cluster 7, we have C[I,j] with integers i, jS7 × T7 where S7 and T7 are integers from 46 to 48; for cluster 8, we have i, j ∈ S8 × T8 where S8 and T8 are integers from 49 to 51; for cluster 9, we have i, j ∈ S9 × T9 where S9 and T9 are integers from 52 to 54 indicates 3 small clusters (Figure 1). The strength of feature interactions in each square submatrix, C[i,j], were randomly selected between 0.1 and 0.7 or between −0.7 and −0.1 and the strength of other feature interactions outside submatrices of clusters was selected between 0.01 and 0.012 or between −0.01 and −0.012. The elements of such matrix C below the diagonal are then set to be equal to the corresponding elements above the diagonal. In order to ensure C can be used as precision matrix, the diagonal entries of C is further set to be 1. We employed mvrnorm function in R package, “MASS”, to generate an N × p continuous matrix X based on C
Figure 1:
The matrix representation of three large clusters, three medium clusters and three small clusters. The strength of feature interactions in each square submatrix, C[i,j], in the matrix is randomly chosen between 0.1 and 0.7 or between −0.7 and −0.1, and the strength of other feature interactions outside submatrices of clusters is chosen between 0.01 and 0.012 or between −0.012 and −0.01. C[i,j] with integers i, j ∈ S1 × T1 where S1 and T1 are integers from 1 to 9, i, j ∈ S2 × T2 where S2 and T2 are integers from 10 to 18, and i, j ∈ S3 × T3 where S3 and T3 are integers from 19 to 27 represents 3 large clusters, C[i,j] with integers i, j ∈ S4 × T4 where S4 and T4 are integers from 28 to 33, i, j ∈ S5 × T5 where S5 and T5 are integers from 34 to 39, and i, j ∈ S6 × T6 where S6 and T6 are integers from 40 to 45 represents 3 medium clusters and C[i,j] with integers i, j ∈ S7 × T7 where S7 and T7 are integers from 46 to 48, i, j ∈ S8 × T8 where S8 and T8 are integers from 49 to 51, and i, j ∈ S9 × T9 where S9 and T9 are integers from 52 to 54 represents 3 small clusters
Since we have 120 features, a vector of coefficients, β is a vector of positive or negative values. The first 54 values in β indicate βc, a vector of coefficients of features in any of 9 defined clusters and the other 66 values in β indicate βn, a vector of coefficients of features not belonging to any clusters. We define three different ranges of coefficients; L, M, and S which are randomly chosen from uniform distribution of [10,20], [3,9], and [0,2] respectively. Using, L, M, and S, βc is defined as follows.
Case 1: βc = a vector of 54 {S}
Case 2: βc = a vector of 54 {M}
Case 3: βc = a vector of 18 {L, M, S}
Case 4: βc = a vector of 27 {L} for clusters of size 9 and of 9 {L, M, S} for clusters size of 6 and 3
Case 5: βc = a vector of 45 {L} for clusters of size 9 and of 3 {L, M, S} for clusters size of 3
βn is a vector of 66 randomly chosen numbers ranging from 0 to 20. Positive and negative signs are randomly assigned to βc and βn. With βc and βn, β = [βc, βn]. In the regression formula, we set β0 to 20 and b to a real number randomly chosen from Uniform distribution of [−20, 20]. The outcome is generated using 20 + Xβ + U(−20, 20).
3.2. Validation – definition of prediction error
To evaluate how our proposed model will perform in new data, we compared the prediction error by four prediction methods: NET-C, group lasso with groups defined by hierarchical clustering, group lasso with all distinct groups, neural networks and random forest regression. In each generated dataset, samples were randomly reordered with replacement and 70% of reordered samples were allotted to the training set and 30% of reordered samples were allotted to the testing set; In the real data application, 267 samples were as-signed to the training set and 114 samples were assigned to the testing set. The overall prediction error was computed by averaging the prediction error over distinct training and testing datasets for as follows:
(7) |
where N is the number of distinct datasets, is observed outcome values in the testing set and is predicted outcome values in the testing set; This estimation of prediction error can be considered as a “hold-out” validation process. Distinct training and testing datasets were generated 100 times for simulation and 1000 times for real data application. In total, 20 examples were generated with four different sample sizes and five different cases of coefficients of features indicating effect of features on the outcome.
In our real data application, to estimate positive or negative coefficients of trace elements, metabolites, or covariates affecting birth weight, we implemented a 95% bootstrap confidence interval method. Over 1000 datasets, we get 1000 coefficient values for trace elements, metabolites, and covariates respectively. For each variable, after sorting the 1000 relevant coefficients in increasing order, we obtained a lower bound of the confidence interval, the 25th smallest coefficient, and an upper bound of the confidence interval, the 975th smallest coefficient.
4. Results
4.1. Comparisons of clustering approaches over different parameters for simulated data
In 20 cases with different pairs of sample size and coefficients of features, NET-C con-sistently outperforms the prediction method with all distinct groups by at least 6% as shown in Table 1. Increasing sample size decreases prediction error overall. For all five cases, as the sample size increased from 150 to 240, the overall error in estimating the outcome by four prediction methods decreased. Error difference between the prediction method with all distinct groups and our proposed method is smaller with sample size 240 than sample size 150. This trend in prediction error by different sample size occurs in all cases except case 5.
Table 1:
Sample size of 150, 180, 210 and 240 with 120 features and 5 cases of diverse correlations between features and outcome. Prediction errors were estimated using Mean Squared Error. The values in brackets are 95% confidence intervals and those in parentheses indicate error difference between one alternative and the proposed methods. Error difference = ((erroralternative − errorproposed)/errorproposed) * 100%
Case 1 | Case 2 | Case 3 | Case 4 | Case 5 | ||
---|---|---|---|---|---|---|
Sample Size 150 | All distinct groups | 18568.68 [5974.73, 41126.96] (25.39%) |
62833.2 [19056.68,115757.3] (19.75%) |
66290.04 [17788.26, 126248.01] (16.31%) |
85915.94 [22199.13,188909.02] (23.15%) |
109567.9 [29177.38, 220182.34] (13.98%) |
Random forest | 80527.18 [77439.97,82922.59] (443.77%) |
146612.4 [142971.27,150083.17] (179.42%) |
176946.9 [172482.82,182477.75] (210.47%) |
320221.8 [307611.43,331080.32] (359.01%) |
462729.4 [443418.79, 482350.36] (381.38%) |
|
Hierarchical Clustering | 23115.31 [5924.6, 58431.84] (56.09%) |
67426.33 [23289.28,141161.24] (28.51%) |
69058.95 [20304.65, 127986.74] (21.17%) |
83990.03 [26727.68, 156791.92] (20.39%) |
111273.5 [40909.36, 204437.1] (15.76%) |
|
Neural Network | 82671.22 [65224.87, 98158.46] (458.25%) |
126666.3 [96673.71,152119.01] (141.41%) |
132216.4 [88408.62, 164300.7] (131.99%) |
225135.3 [144572.34, 290927.45] (222.71%) |
212898.7 [153445.03, 290296.66] (121.48%) |
|
NET-C | 14809.08 [4253.95, 26546.37] (−) |
52469.51 [16635.74, 96948.04] (−) |
56993.2 [17144.97, 100261.95] (−) |
69763.29 [22408.49, 153802.14] (−) |
96125.39 [29092.24, 199140.78] (−) |
|
Sample Size 180 | All distinct groups | 14400.58 [4263.61, 28411.74] (13.01%) |
44194.68 [15944.62,90938.87] (26.14%) |
47357.77 [12020.73, 92841.86] (24.92%) |
56164.97 [20072.7, 116324.99] (21.58%) |
67699.63 [20912.07, 138292.09] (15.48%) |
Random forest | 89945.03 [86842.7, 92954.87] (605.87%) |
160179.1 [156310.96, 163649.34] (357.17%) |
240391.9 [233776.42, 246768.04] (534.10%) |
656608.1 [638741.55, 673103.46] (1321.32%) |
808929.4 [783126.61, 831468.29] (1279.88%) |
|
Hierarchical Clustering | 17565.72 [7402.82, 40751.04] (37.85%) |
48398.8 [16635.2, 87704.28] (38.13%) |
50935.02 [12385.62, 97952.97] (34.36%) |
61213.54 [20435.06, 119424.77] (32.51%) |
75531.55 [22399.73, 144443.37] (28.84%) |
|
Neural Network | 70776.80 [57653.27, 80329.09] (455.44%) |
97271.28 [78847.95, 111702.15] (177.62%) |
133749.2 [113283.54, 153510.06] (252.80%) |
407915.9 [341324.53, 456098.73] (782.99%) |
426748.5 [358249.97, 500890.16] (627.95%) |
|
NET-C | 12742.49 [3649.06, 28832.63] (−) |
35037.4 [9660.61, 68941.24] (−) |
37910.71 [8131.88, 79338.32] (−) |
46196.19 [14446.62, 105897.32] (−) |
58623.18 [19284.89, 134004.98] (−) |
|
Sample Size 210 | All distinct groups | 13576.77 [7159.25, 24482.09] (10.74%) |
30503.26 [11458.59,58262.57] (24.93%) |
31200.36 [12715.6, 70858.52] (17.27%) |
35472.79 [10617.84, 76525.08] (30.65%) |
43164.89 [9488.41, 102281.12] (39.14%) |
Random forest | 107248 [104697.1,109720.89] (774.78%) |
188861.7 [183232.04,191897.38] (673.50%) |
265437.5 [257855.88,270791.96] (897.65%) |
498784.9 [483101.93,512250.22] (1737.13%) |
611100.4 [590475.07,625845.84] (1869.85%) |
|
Hierarchical Clustering | 15746.16 [8567.33, 25181.28] (28.44%) |
33881.52 [13252.88,64691.01] (38.77%) |
35271.82 [13488.29, 65607.85] (32.57%) |
38694.58 [12699.72, 80869.13] (42.52%) |
46643.94 [12338.08, 107048.6] (50.35%) |
|
Neural Network | 84586.79 [68602.39, 102087.15] (589.94%) |
97091.62 [79892.35, 112219.07] (297.65%) |
99468.98 [81846.43, 115955.34] (273.86%) |
163776.1 [125533.18, 204886.74] (503.22%) |
186608.2 [145036.57, 227575.46] (501.52%) |
|
NET-C | 12260.00 [6348.56, 20002.31] (−) |
24416.41 [11265.1, 48331.15] (−) |
26606.17 [9878.37, 48504.34] (−) |
27150.23 [7922.28, 57580.08] (−) |
31022.72 [8777.89, 65977.65] (−) |
|
Sample Size 240 | All distinct groups | 14130.16 [7023.12, 22189.39] (6.09%) |
24787.49 [11499.41,48052.27] (14.32%) |
23559.76 [11217.06, 40356.36] (8.31%) |
27884.84 [12605.56, 49574.59] (14.09%) |
30942.51 [12917.27, 51819.06] (18.92%) |
Random forest | 131145.7 [127364.01, 135141.95] (884.66%) |
212061.5 [207104.7, 216528.84] (878.07%) |
300151.2 [292243.98, 308762.77] (1279.86%) |
610733.3 [588138.18, 626159.55] (2398.87%) |
829610 [803871.9, 858056.09] (3088.44%) |
|
Hierarchical Clustering | 16069.76 [7600.94, 26281.03] (20.65%) |
27315.92 [11796.44,48638.82] (25.99%) |
26142.85 [11500.4, 47778.4] (20.18%) |
30165.19 [15083.98, 52850.44] (23.42%) |
33547.22 [13671.76,58795.88] (28.93%) |
|
Neural Network | 83712.84 [71489.45,96757.48] (528.53%) |
87518.12 [69688.57,104204.54] (303.65%) |
103910.4 [84451.76, 127396.82] (377.70%) |
170889.4 [118336.48,212808.13] (599.21%) |
238134.7 [173360.46,320509.51] (815.22%) |
|
NET-C | 13318.82 [6068.15, 22066.34] (−) |
21681.7 [9951.26, 42505.94] (−) |
21752.24 [11038.68, 36481.85] (−) |
24440.43 [12845.34, 39898.85] (−) |
26019.32 [12621.36, 46810.1] (−) |
The next step was to validate how the coefficients of features in the subnet-work affect accuracy in predicting the outcome. To compare prediction error by different prediction methods, we investigated difference between prediction error of NET-C and those of other methods in the two extreme cases: Case 1 290 and Case 5. From case 1 to case 5, there was no clear increasing or decreasing trend in error difference. However, it was obvious that our proposed method gave lowest prediction error over all situations, the prediction method with all distinct groups outperformed one with groups defined by hierarchical clustering. Random forest regression and neural networks gave much higher prediction errors than other prediction methods did. The accuracy results of finding true clusters using hierar-chical clustering and NET-C are shown in Appendix B-1
4.2. Application to placental metallomics and infant cord plasma metabolomics with covariates in NHBCS
We applied the proposed method to predict birth weight using 220 cord metabolites, 24 trace elements, and maternal characteristics (age and BMI prior to pregnancy) from 381 participants in NHBCS. The first step of NET-C, which involves grouping highly correlated variables, was applied to the placental trace element and cord-blood metabolite data to find the clusters of highly correlated trace elements and metabolites and removing weak or insignificant interactions. Trace elements clustered into 5 groups. Group 1 included Mg, P, and Zn. Group 2 included K, Ca, and Pb. Group 3 included Al, Cr, Co, Ni, As, Mo, Ag, Cd, Sb, and Hg. Group 4 included Sr and Ba. Group 5 included Na, Si, Mn, Cu, Se, Tl. Cord-blood metabolites clustered into 10 groups as shown in Appendix B-2.
The second step of proposed method used group lasso to predict birth weight with the clustered trace elements and cord-blood metabolites, along with maternal enrollment age and maternal BMI as covariates. In this real data application, NET-C and other three pre-diction methods such as group lasso with all distinct groups and group lasso with hierarchical clustering gave similar results in prediction accuracy. A summary of our findings appears in Table 2, Table 3 and Appendix B-2, and Figure 2. In Table 2 and Table 3, all the coefficients were modelled by group lasso method and 95% confidence intervals were calculated using replications of prediction procedures 1000 training and testing instances. Trace elements, metabolites, or covariates with consistently nonnegative coefficients were listed in Table 2, and those with consistently nonpositive coefficients were listed in Table 3. The meaning of each metabolite symbol in Tables 2 and 3 is explained in Appendix B-3 [28]. Lysophosphatidyl choline metabolites were consistently positively associated with birth weight and isoleucine (ile) and other metabolites were possibly positively associated with birth weight. As shown in Figure 2, five lysophosphatidyl choline metabolites which are positively associated with birth weight also formed a closely interconnected subnet-work. Table 3 shows a trace element, Cu, and metabolites including c5 which was consist-ently negatively associated with birth weight and formed a subnetwork consisting of c5 and c3. Cu, and other metabolites were possibly negatively associated with birth weight.
Table 2:
Covariates (e.g., maternal BMI), and metabolites nonnegatively correlated with birth weight: Metabolites with lower C.I. above zero are colored green
Variable Name | Mean | Lower C.I. | Upper C.I. |
---|---|---|---|
Maternal BMI | 9.883713 | 0.000000 | 19.746845 |
Ile | 0.001994 | 0.000000 | 0.005080 |
lysopc_a_c16_1 | 0.034728 | 0.006077 | 0.066917 |
lysopc_a_c18_1 | 0.015001 | 0.004081 | 0.026512 |
lysopc_a_c18_2 | 0.010608 | 0.000857 | 0.022875 |
lysopc_a_c20_3 | 0.051269 | 0.012737 | 0.091663 |
lysopc_a_c20_4 | 0.010309 | 0.000260 | 0.022651 |
pc_aa_c38_3 | 0.001703 | 0.000000 | 0.005539 |
pc_aa_c40_5 | 0.013531 | 0.000000 | 0.033373 |
pc_ae_c40_1 | 0.195712 | 0.000000 | 0.516420 |
pc_ae_c40_4 | 0.066743 | 0.000000 | 0.157386 |
pc_ae_c42_3 | 0.313017 | 0.000000 | 0.800545 |
sm_oh_c22_1 | 0.015206 | 0.000000 | 0.035723 |
Table 3:
Trace elements or metabolites nonpositively correlated with birth weight. Metabolites with upper C.I. below zero are colored red
Variable Name | Mean | Lower C.I. | Upper C.I. |
---|---|---|---|
Cu | −0.177953 | −0.394365 | 0.000000 |
c5 | −0.677693 | −1.408347 | −0.012290 |
c14_2 | −6.780522 | −16.859072 | 0.000000 |
pc_aa_c34_1 | −0.000687 | −0.001734 | 0.000000 |
pc_aa_c42_1 | −0.307263 | −0.823219 | 0.000000 |
pc_aa_c42_4 | −0.408022 | −1.064783 | 0.000000 |
pc_ae_c32_2 | −0.229497 | −0.673711 | 0.000000 |
sm_oh_c16_1 | −0.047273 | −0.118574 | 0.000000 |
sm_c22_3 | −0.528129 | −1.312262 | 0.000000 |
sm_c26_1 | −0.475565 | −1.117796 | 0.000000 |
Figure 2:
Metabolites in this subnetwork including lysophosphatidylcholines were nonnegatively associated with birth weight. The metabolite nodes such as lysophosphadidylcholines colored green were positively associated with birth weight
5. Discussion
We proposed the statistical method, NET-C, a regularized regression method with grouped features that were determined by lasso and network community detection by Louvain [29]. In simulations, NET-C outperformed other clustering strategies such as hierarchical clustering and random forest regression. Conditions such as varying sample size or correlations of features with the outcome variables influenced outcome prediction accuracy. Our proposed network approach considered partial correlations of all pairs of features in the data and selected significant interactions of features in the network. There are many statistical methods of computing correlations between two features for all possible interactions of the given data. The network approach extracts and visualizes “sub-networks” of interacting features rather than a single interaction between two features. Thus, a more efficient computation and representation of interactions is possible using this dimensionality reduction step compared to other regression methods.
In statistical simulations, we found that NET-C, the group lasso method based on lasso, outperformed group lasso predictions using other grouping strategies. Hierarchical clus-tering did not perform well because in defined clusters, features are either positively or negatively correlated. Thus, hierarchical clustering cannot always accurately group clus-ters of features. In the high-dimensional setting, random forest regression cannot efficiently estimate a continuous outcome since it cannot select the most important features among the vast set of features. However, as sample size increases, the error difference between NET-C and other methods decreases. A future strategy may be to use prior knowledge of the correlations between predictor features and the outcome to improve prediction. Generating subnetworks of features which are a priori known to be strongly correlated with an outcome would likely produce a better prediction accuracy.
In our real data application, NET-C slightly outperformed the prediction methods such as group lasso with all distinct groups or group lasso with groups defined by hierarchical clustering. As shown in our statistical simulations, the higher the sample size, the less the difference between the prediction error computed by NET-C and the prediction errors computed by alternative methods. In this real data application, the sample size of 380 ma-ternal-infant dyads is greater than the number of metabolites, 220. It implies that the outperformance of NET-C over alternative methods can be reduced. Further, trace elements and metabolites were mostly positively correlated, and hierarchical clustering tends to group positively correlated variables more accurately than equally positively and negatively correlated variables. Because of these properties of our real data, performance of group lasso methods with distinct groups or groups defined by hierarchical clustering was improved. Thus, in the real data application, our proposed method did not greatly outperform alternative methods in terms of prediction accuracy, but it was clear that our proposed method successfully identified sub-networks of metabolites strongly affecting birth weight given that our proposed method can predict birth weight as accurately as promising.
The associations found in our analysis were, for the most part aligned, with known biology [30];[31];[32];[33]. In a previous study, higher Cu concentrations in both maternal and cord blood was associated with fetal growth restriction and maternal Cu concentration was related to reduced birth weight [34]. The concentrations of certain maternal amino acids and proteins such as isoleucine in the third trimester of pregnancy also have been positively correlated with fetal growth and neurodevelopment [35]. Cord-blood lysophosphatidyl choline metabolites 16: 1 and 18:1 in particular were positively correlated with birth weight [36]. In contrast, concentrations of C3, C3DC, C5 and C5OH 385 were higher in low birth weight infants compared to normal birth weight infants [37]. While consistent with the available with literature, they will need to be confirmed or refuted by additional studies.
In summary, we found that NET-C identified a network of features that predict birth weight. while the proposed method may more accurately predict outcome compared to alternative clustering methods. Nonetheless, the generalizability of this method will need to be studied further to understand its ability to predict health outcomes under a broader set of features characteristic of real-world data.
Supplementary Material
Acknowledgment
The study is funded in part by the following grants: R01LM012012 and R01LM012723 from National Library of Medicine, P20GM104416 National Institute of General Medical Sciences, P42ES007373 and P01ES022832 National Institute of Envi-ronmental Health Sciences, RD83544201Environmental Protection Agency and R25CA134286 National Cancer Institute.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Disclosures and Ethics
As a requirement of publication author(s) have provided to the publisher signed confirma-tion of compliance with legal and ethical obligations including but not limited to the fol-lowing: authorship and contributorship, conflicts of interest, privacy and confidentiality and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of inter-est criteria. The authors have also confirmed that this article is unique and not under con-sideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section.
References
- [1].Ramsay Ian S., Ma Sisi, Fisher Melissa, Loewy Rachel L., Ragland 415 J. Daniel, Niendam Tara, Carter Cameron S., Vinogradoy Sophia, Model selection and prediction of outcomes in recent onset schizophrenia patients who undergo cognitive training, Schizophrenia Research: Cognition Volume 11, March 2018, Pages 1–5, 10.1016/j.scog.2017.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Bremner I. (1988) Trace Elements in Man and Animals 6, 1988, ISBN : 978-1-4612-8050-7 [Google Scholar]
- [3].Piazza I, Kochanowski K, Cappelletti V, Fuhrer T, Noor E, Sauer U, Picotti P. A Map of Protein-Metabolite Interactions Reveals Principles of Chemical Communication. Cell. 2018. January 11;172(1–2):358–372.e23. doi: 10.1016/j.cell.2017.12.006. [DOI] [PubMed] [Google Scholar]
- [4].Havens TC, Bezdek JC and Palaniswami M, “Scalable single linkage hierarchical clustering for big data,” 2013 IEEE Eighth International Conference on Intelligent Sensors, Sensor Networks and Information Processing, Melbourne, VIC, 2013, pp. 396–401, doi: 10.1109/ISSNIP.2013.6529823. [DOI] [Google Scholar]
- [5].Miyagi A, Takahashi H, Takahara K et al. Principal component and hierarchical clustering analysis of metabolites in destructive weeds; polygonaceous plants. Metabolomics 6, 146–155 (2010). 10.1007/s11306-009-0186-y [DOI] [Google Scholar]
- [6].Shaikhina Torgyn, Lowe Dave, Daga Sunil, Briggs David, Higgins Robert, Khovanova Natasha, Decision tree and random forest models for outcome prediction in anti-body incompatible kidney transplantation, Biomedical Signal Processing and Control, Volume 52, July 2019, Pages 456–462, 10.1016/j.bspc.2017.01.012 [DOI] [Google Scholar]
- [7].Peterek T, Dohnalek P, Gajdos P and Smondrk M Performance evaluation of ran-dom forest regression model in tracking parkinson’s disease progress In Hybrid Intel-ligent Systems (HIS), 2013 13th International Conference on, 83–87 (IEEE, 2013). 10.1109/HIS.2013.6920459 [DOI] [Google Scholar]
- [8].Fan W, Wang C, Chen Y and Lai J, “HDenDist: Nonlinear Hierarchical Clustering Based on Density and Min-distance,” 2015 IEEE Fifth International Conference on Big Data and Cloud Computing, Dalian, 2015, pp.45–50, 10.1109/BDCloud.2015.16 [DOI] [Google Scholar]
- [9].Zhang Z, Murtagh F, Van Poucke S, Lin S, and Lan P (2017). Hierar-chical cluster analysis in clinical research with heterogeneous study population: highlighting its visualization with R. Annals of translational medicine, 5(4), 75. doi: 10.21037/atm.2017.02.05 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Dimopoulos T, Tyralis H, Bakas N, and Hadjimitsis D (2018), Accuracy measurement of Random Forests and Linear Regression for mass appraisal models that estimate the prices of residential apartments in Nicosia, Cyprus, Adv. Geosci, 45, 377–382, 10.5194/adgeo-45-377-2018 [DOI] [Google Scholar]
- [11].Shao B, Bjaanæs MM, Helland Å, Schütte C, Conrad T (2019) EMT network-based feature selection improves prognosis pre-diction in lung adenocarcinoma. PLOS ONE 14(1): e0204186 10.1371/journal.pone.0204186 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Breheny P, Huang J. Group descent algorithms for nonconvex penalized 460 linear and logistic regression models with grouped predictors. Stat Comput. 2015. March;25(2):173–187. doi: 10.1007/s11222-013-9424-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Robinson O, Keski-Rahkonen P, Chatzi L, Kogevinas M, Nawrot T, Pizzi C, Plusquin M, Richiardi L, Robinot N, Sunyer J, Vermeulen R, Vrijheid M, Vineis P, Scalbert A, Chadeau-Hyam M. Cord Blood Metabolic Signatures of Birth Weight: A Population-Based Study. J Proteome Res. 2018. March 2;17(3):1235–1247. doi: 10.1021/acs.jproteome.7b00846. Epub 2018 Feb 9. [DOI] [PubMed] [Google Scholar]
- [14].Tibshirani Robert. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, 1996, pp. 267–288. JSTOR, www.jstor.org/stable/2346178. [Google Scholar]
- [15].Göl CS, Bozkurt L, Tura A, Pacini G, Kautzky-Willer A, et al. (2015) Application of Penalized Regression Techniques in Modelling Insulin Sensitivity by Correlated Metabolic Parameters. PLOS ONE 10(11): e0141524 10.1371/journal.pone.0141524 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Park Mee Young and Hastie Trevor (2008), Penalized logistic regression for de-tecting gene interactions, Biostatistics (2008), 9, 1, pp. 30–50 doi: 10.1093/biostatistics/kxm010 [DOI] [PubMed] [Google Scholar]
- [17].Blondel Vincent D., Guillaume Jean-Loup, Lambiotte Renaud, Lefebvre Etienne. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, IOP Publishing, 2008, P10008, pp. 1–12. 10.1088/1742-5468/2008/10/P10008. [DOI] [Google Scholar]
- [18].Newman MEJ, Analysis of Weighted Networks, PHYSICAL REVIEW E 70, 056131 (2004), doi: 10.1103/PhysRevE.70.056131 [DOI] [PubMed] [Google Scholar]
- [19].Langfelder Peter, Zhang Bin, Horvath Steve, Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R, Bioinformatics, Volume 24, Issue 5, 1 March 2008, Pages 719–720, 10.1093/bioinformatics/btm563 [DOI] [PubMed] [Google Scholar]
- [20].Zhang Z, Murtagh F, Van Poucke S, Lin S, Lan P. Hierarchical cluster analysis in clinical research with heterogeneous study population: highlighting its visualization with R. Ann Transl Med. 2017;5(4):75. doi: 10.21037/atm.2017.02.05 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].MacQueen J. Some methods for classification and analysis of multivariate observations Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281–297, University of California Press, Berkeley, Calif., 1967. https://projecteuclid.org/euclid.bsmsp/1200512992 [Google Scholar]
- [22].Ceh Marjan, Kilibarda Milan, Lisec Anka and Bajat Branislav, Estimating the Perfor-mance of Random Forest versus Multiple Regression for Predicting Prices of the Apartments ISPRS Int. J. Geo-Inf 2018, 7, 168; doi: 10.3390/ijgi7050168 [DOI] [Google Scholar]
- [23].Lolli F, Gamberini R, Regattieri A, Balugani E, Gatos T, Gucci S, Single-hidden layer neural networks for forecasting intermittent demand, International Journal of Production Economics, Volume 183, Part A, 2017, Pages 116–128, ISSN 0925–5273, 10.1016/j.ijpe.2016.10.021 [DOI] [Google Scholar]
- [24].Diane Gilbert-Diamond Kathryn L. Cottingham, Gruber Joann F., Punshon Tracy, Sayarath Vicki, Gandolfi A. Jay, Baker Emily R., Jackson Brian P., Folt Carol L., Karagas Margaret R., Proceedings of the National Academy of Sciences December 2011, 108 (51) 20656–20660; DOI: 10.1073/pnas.1109127108 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Ren J, Zhang A, Kong L, and Wang X, Advances in mass spectrometry based metabo-lomics for investigation of metabolites. Royal Society of Chemistry Advances, 8 (2018), pp. 22335–22350, 10.1039/c8ra01574k [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Halliday Alex N., Lee Der-Chuen, Christensen John N., Walder Andrew J., Freedman Philip A., Jones Charles E., Hall Chris M., Yi Wen, Teagle Damon, Recent developments in inductively coupled plasma magnetic sector multiple collector mass spectrometry. International Journal of Mass Spectrometry and Ion Processes, Volumes 146–147 (1995), pp. 21–33, 10.1016/0168-1176(95)04200-5 [DOI] [Google Scholar]
- [27].Punshon T, Li Z, Marsit CJ, Jackson BP, Baker ER, Karagas MR. Placental Metal Concentrations in Relation to Maternal and Infant Toenails in a U.S. Cohort. Environ Sci Technol. 2016. February 2;50(3):1587–94. doi: 10.1021/acs.est.5b05316. Epub 2016 Jan 15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].Kim YJ, Lee HS, Kim YK, Park S, Kim JM, et al. (2016) Association of Metabolites with Obesity and Type 2 Diabetes Based on FTO Genotype. PLOS ONE 11(6): e0156612 10.1371/journal.pone.0156612 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [29].Blondel VD, Guillaume J-L, Lambiotte R, and Lefebvre E, Fast unfolding of commu-nities in large networks, Journal of Statistical Mechanics: Theory and Experiment, vol. 2008, no. 10, p. P10008, 2008, 10.1088/1742-5468/2008/10/P10008 [DOI] [Google Scholar]
- [30].Bermuúdez L, García-Vicent C, López J, Torró MI, and Lurbe E (2015). Assessment of ten trace elements in umbilical cord blood and maternal blood: association with birth weight. Journal of translational medicine, 13, 291 10.1186/s12967-015-0654-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Hellmuth C, Uhl O, Standl M, Demmelmair H, Heinrich J, Koletzko B, and Thiering E (2017). Cord Blood Metabolome Is Highly Associated with Birth Weight, but Less Pre-dictive for Later Weight Development. Obesity facts, 10(2), 85–100. 10.1159/000453001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [32].Yazdani S, Yosofniyapasha Y, Nasab BH, Mojaveri MH, and Bouzari Z (2012). Ef-fect of maternal body mass index on pregnancy outcome and newborn weight. BMC re-search notes, 5, 34. doi: 10.1186/1756-0500-5-34 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [33].Dennis JA, and Mollborn S (2013). Young maternal age and low birth weight risk: An exploration of racial/ethnic disparities in the birth outcomes of mothers in the United States. The Social science journal, 50(4), 625–634. doi: 10.1016/j.soscij.2013.09.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Ozdemir U, Gulturk S, Aker A, Guvenal T, Imir G, Erselcan T. Correlation between birth weight, leptin, zinc and copper levels in maternal and cord blood. J Physiol Biochem. 2007. June;63(2):121–8. doi: 10.1007/BF03168223. [DOI] [PubMed] [Google Scholar]
- [35].Moghissi Kamran S., Churchill John A., Kurrie Dorothy, Relationship of maternal amino acids and proteins to fetal growth and mental development, American Journal of Obstetrics and Gynecology, Volume 123, Issue 4, 1975, Pages 398–410, ISSN 0002–9378, 10.1016/S0002-9378(16)33441-X [DOI] [PubMed] [Google Scholar]
- [36].Lu YP, Reichetzeder C, Prehn C, Yin LH, Yun C, Zeng S, Chu C, Adamski J, Hocher B. Cord Blood Lysophosphatidylcholine 16: 1 is Positively Associated with Birth Weight. Cell Physiol Biochem. 2018;45(2):614–624. doi: 10.1159/000487118 [DOI] [PubMed] [Google Scholar]
- [37].Yang L, Zhang Y, Yang J, Huang X, Effects of birth weight on profiles of dried blood amino-acids and acylcarnitines, Ann Clin Biochem. 2018. January;55(1):92–99. doi: 10.1177/0004563216688038 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.