Skip to main content
. 2017 Dec 27;6:e31035. doi: 10.7554/eLife.31035

Figure 3. Prediction of growth-defect phenotypes in the E.coli strain collection.

(A) Schematic representation of the computation of the prediction score and its evaluation; for each condition the predicted score is computed using the disruption score of the conditionally essential genes. The score is then evaluated against the actual phenotypes through a Precision-Recall curve. (B) Higher predictive power for conditions with higher proportion of growth phenotypes. For each condition set, the PR-AUC value for each condition is reported, together with the median and mean value (C) Significance of the PR-AUC value reported for the condition ‘Clindamycin 3 μg/ml’, against the distribution of three randomization strategies. ‘Shuffled strains’ indicates a prediction in which the actual strains’ phenotypes have been shuffled; ‘shuffled sets’ indicates a prediction where the conditionally essential genes of a different condition have been used, and ‘random set’ indicates a prediction where a random gene set has been used as conditionally essential genes. For all three randomizations we report a significant difference between the actual prediction and the distribution of the randomizations (q-values of 1E-30, 0.05 and 1E-22, respectively). See Figure 3—figure supplement 2 for the other conditions. (D) Genome-wide gene associations are in agreement with the predictive score; the enrichment of conditionally essential genes in the results of the gene association analysis is significantly higher in conditions with higher PR-AUC.

Figure 3.

Figure 3—figure supplement 1. Detailed view of the predicted growth score and its properties.

Figure 3—figure supplement 1.

(A) Influence of the different predictors of the impact of mutations on each condition PR-AUC. ‘SNPs’ indicates single nucleotide variants only, ‘Accessory genome’ gene presence-absence patterns, ‘All predictors’ the combination of both, plus stop codon substitutions. (B–C) Proportion of well-predicted conditions (PR-AUC >= 0.1 and Pearson’s FDR-corrected p-value<=0.01) over total conditions with at least 1% and 5% sick strains. Marker’s size is proportional to the log10 of the FDR-corrected p-value. (E) Conditions with higher predictive power (measured as PR-AUC) also have an enrichment of sick strains at the top of the predicted score, as measured by the Gene Set Enrichment Analysis (GSEA); sick strains are used as ‘gene sets’. A pseudocount of 10−4 has been added to the GSEA p-values. (F) Prediction performance improves when using the weighting scheme to account for conservation of gene essentiality, especially for well-predicted conditions.
Figure 3—figure supplement 2. Significance of the PR-AUC value reported for all conditions with at least 5% of the tested strains showing a sick phenotype, against the distribution of three randomization strategies.

Figure 3—figure supplement 2.

‘Shuffled strains’ indicates a prediction in which the actual strains’ phenotypes has been shuffled; ‘shuffled sets’ indicates a prediction where the conditionally essential genes of a different condition have been used, and ‘random set’ indicates a prediction where a random gene set has been used as conditionally essential genes. Conditions are ordered by the significance of the difference between the actual prediction and the ‘Random genes’ bootstrap. The proportion of conditions where the actual prediction is significantly different than the randomizations (q-value <0.05) are 52%, 21% and 33.3% for the ‘shuffled strains’, ‘shuffled sets’ and ‘random genes’ randomizations, respectively.