Skip to main content
. 2022 Jun 23;11:e73870. doi: 10.7554/eLife.73870

Figure 3. LSTM-guided design and interpretability of community-level metabolite production profiles (a) Schematic of model-training and design of communities with tailored metabolite outputs.

(b) Heatmap of butyrate and lactate concentrations of all possible communities predicted by the LSTM model M1. Grey points indicate communities chosen via k-means clustering to span metabolite design space. Colored boxes indicate ‘corner’ regions defined by 95th percentile values on each axis with points of the corresponding color indicating designed communities within that ‘corner’. Insets show heat maps of acetate and succinate concentrations for all communities within the corresponding boxes on the main figure. Boxes on the inset indicate ‘corners’ defined by 95th percentile values on each axis with colored points corresponding to the same points indicated on the main plot. (c) Cross-validation accuracy of LSTM model trained and validated on a random 90/10 split of all community observations (model M2), evaluated as Pearson correlation R2 for the correlation of predicted versus measured for each variable (all p-valueslt0.05, N and p-value for each test reported in Supplementary file 1). Dashed line indicates R2=0.5, which is used as a cutoff for including a variable in the subsequent network diagrams. (d) and (e) Network representation of median LIME explanations of the LSTM model M2 from (c) for prediction of each metabolite concentration (d) or species abundance (e) by the presence of each species. Edge widths are proportional to the median LIME explanation across all communities from (b) used to train the model in units of concentration (for (d)) or normalized to the species’ self-impact (for (e)). Only explanations for those variables where the cross-validated predictions had R2>0.5 are shown. Networks were simplified by using lower thresholds for edge width (5 mM for (d), 0.2 for (e)). Red and blue edges indicate positive and negative contributions, respectively.

Figure 3.

Figure 3—figure supplement 1. Cross-validation of LSTM model M1 predictions of species abundance and metabolite concentration.

Figure 3—figure supplement 1.

Each plot indicates the comparison of predicted versus measured species abundance (a), butyrate concentration (b), acetate concentration (c), lactate concentration (d), or succinate concentration (e) for cross-validation of model M1 predictions of the validation communities from Clark et al., 2021 (model trained on 110 pairwise communities, 156 communities with 3–5 species, and 124 communities with 11-17 species; cross-validation shown is prediction of a different set of 124 communities with 11–17 species, including 82 communities with all 5 butyrate producers (AC, ER, FP, CC, RI) and 42 communities with the 4 butyrate producers other than AC). Each data point indicates the average of biological replicates of a single community. Black lines indicate linear regressions with slope (m) and R2 indicated in the legends. Dashed blue line indicates.x=y.
Figure 3—figure supplement 2. Predicted total carbon in fermentation products.

Figure 3—figure supplement 2.

Histogram of the model M1 predicted total carbon concentration in butyrate, acetate, lactate, and succinate for all possible communities with >10 species (26,434,916 communities).
Figure 3—figure supplement 3. Prediction and classification statistics for model M1 predictions of designed community sets.

Figure 3—figure supplement 3.

(a) Scatter plot of prediction accuracy (correlation of predicted versus measured) of each variable (25 species abundances, 4 metabolite concentrations) by the LSTM model M1 versus the composite model based on the method from Clark et al., 2021. For metabolites, prediction accuracy is also included where the regression model from the composite model is replaced with a Random Forest Regressor (Triangles) or a Feed Forward Network (Plus Signs). Pearson correlation,R2 p-values and N reported in Supplementary file 1. (b–e) Prediction accuracy of model M1 for the indicated metabolites. Dashed line indicates the linear regression for all data points. Legends indicates the Pearson correlation R2 (including p-values, N=80 for Corner, 100 for Distributed) and RMSE for communities from the ‘corner’ set (red) or ‘distributed’ set (blue) for each variable. Solid black lines indicate.x=y (f) Confusion matrix for classification of the ‘corner’ communities into their specified classes (shown in Figure 3b). Values indicate the fraction of communities from each predicted class whose metabolite concentrations were closest (Euclidean distance) to the centroid of each class (Measured Class). Colored boxes indicate ‘sub-classes’ that fall within the four major classes determined in the lactate and butyrate concentration space as shown in Figure 3b. (g) Scatter plot of misclassification rate between each pair of classes (values from (f), fraction of communities misclassified from one class to the other) versus the Euclidean distance between the centroids of that pair of classes. Black data points indicate pairs of classes that fall within the same major classes defined by the colored boxes in (f) and red data points indicate pairs of classes that do not fall within the same major class.
Figure 3—figure supplement 4. Metabolite production of each species grown in monoculture Bars show the mean net production or consumption of each metabolite for monocultures of each species (bar color indicates species as specified in the legend).

Figure 3—figure supplement 4.

Error bars indicate bootstrapped 95% confidence interval on the mean of between 3 and 22 biological replicates. The dashed lines indicate +/- 10mM and the numbers on the plot indicate the number of species with mean net production of that metabolite outside the +/- 10 mM range.
Figure 3—figure supplement 5. Metabolite-species LIME explanations computed over a 20-fold partitioning of the data set.

Figure 3—figure supplement 5.

Box plots of LIME explanations for the metabolites acetate, butyrate, lactate and succinate. LIME analysis to compute the impact of initial species abundances on metabolite predictions in the 25-member (full) community was performed after training on each subset of data that resulted from a 20-fold partitioning. In each box, the black horizontal line shows the median LIME explanation over the 20 samples, each box encloses the first (Q1) and third (Q3) quartiles, whiskers extend to the farthest data points within the range 1.5*(Q3 - Q1). Data points that exceed this range are considered outliers, which are shown as black circles.
Figure 3—figure supplement 6. Microbe-microbe LIME explanations computed over a 20-fold partitioning of the data set.

Figure 3—figure supplement 6.

Box plots of LIME explanations for each species in the 25-member synthetic human gut community. LIME analysis to compute the impact of initial species abundances on the end-point species abundance predictions in the 25-member (full) community was performed after training on each subset of data that resulted from a 20-fold partitioning. In each box, the black horizontal line shows the median LIME explanation over the 20 samples, each box encloses the first (Q1) and third (Q3) quartiles, and whiskers extend to the farthest data points within the range 1.5*(Q3 - Q1). Data points that exceed this range are considered outliers, which are shown as black circles. LIME explanations from each fold are normalized to the given species self-impact such that the LIME explanation of a species to predict its own abundance is equal to one.
Figure 3—figure supplement 7. Comparison of LIME explanations of LSTM to gLV Parameters.

Figure 3—figure supplement 7.

(a) Scatter plot of LIME explanations of each species impact on each other species in model M2 versus the corresponding interspecies interaction parameter (aij) from the gLV model from Clark et al., 2021. Dashed line indicates the linear regression with the regression parameters shown in the legend (N=600). (b) Heatmap representation of agreement/disagreement between specific interactions for the same comparison as in (a). Legend describes what each color represents.