Skip to main content
. 2022 Jun 23;11:e73870. doi: 10.7554/eLife.73870

Figure 4. Hold-out prediction performance on sub-communities provides information about poorly understood species and interactions between species.

(a) Sensitivity of metabolite prediction performance (R2) to the amount of training data. Training datasets were randomly subsampled 30 times using 50–100% of the total dataset in increments of 10%. Each subsampled training set was subject to 20-fold cross-validation to assess prediction performance. Lineplot of the mean prediction performance over the 30 trials for each percentage of the data. Error bars denote 1 s.d. from the mean. (b) Schematic scatter plot representing how communities containing species A and B define a poorly predicted subsample of the full sample set (c) Heatmap of prediction performance (R2) of acetate for each subset of communities containing a given species (diagonal elements) or pair of species (off-diagonal elements). (d) Heatmap of prediction performance for acetate, butyrate, lactate, and succinate. A sample subset containing a given species or pair of species included all communities in which the species were initially present. Predictions for each community were determined using 20-fold cross validation so that for each model the predicted samples were excluded from the training samples. N and p-values are reported in Supplementary file 1.

Figure 4.

Figure 4—figure supplement 1. Sensitivity of species abundance prediction performance (R2) to the size of the training dataset.

Figure 4—figure supplement 1.

Training datasets were randomly subsampled 30 times using 50% to 100% of the total dataset in increments of 10%. Each subsampled training set was subject to 20-fold cross-validation to assess prediction performance. Sub-plots show the mean prediction performance (±one standard deviation) over the 30 trials for each percentage of the dataset. Subplots were sorted according to the variance in species abundance taken over the total dataset. In general, prediction performance of low variance species was less likely to improve in response to more training data. N and p-values are reported in Supplementary file 1.