Using 20 media conditions, we generated 21 GENREs, where each GENRE was gap filled using either: a different order of the same input media conditions (“Order Only”), random weighting of reactions in the database and random subsets of reactions from the draft (“Diverse”), or a diverse ensemble which also included negative growth data through a trimming step (“Negative Growth Data”). We evaluated the accuracy, precision, and recall of every individual GENRE and of the ensembles by predicting growth on 17 positive media conditions and 17 negative media conditions which were not used during gap filling. The average of the individual GENREs is shown as black points with the maxima and minima as black lines extended above and below. The ensemble predictions using the three different voting thresholds are shown as red circles “any”, green triangles “majority”, and blue squares “consensus”. Note that there is less ensemble diversity when differences result only from media condition ordering (maxima/minima of “Order Only” compared to “Diverse” or “Negative Growth Data”). Adding additional diversity results in GENREs with both greater and lower accuracy than the best and worst of “Order Only”. Addition of the trimming step (“Negative Growth Data”) improves overall accuracy and precision by ~15%. In terms of ensemble thresholds, the “majority” threshold tends to perform similarly to the average of the individual GENREs. The “any” threshold achieves recall as good or better than the best individual GENREs. The “consensus” threshold performs consistently well in terms of accuracy and precision if there is very little diversity in the ensemble (“Order Only”).