We explored several regression models, from simple linear models using only one input variable (genome GC content) to more complex by progressively increasing the number of terms and using two input variables (read GC content and genome GC content). While this strategy helped us find models with lower RMSE, it eventually led to overfitting and a significant increase in RMSE (the forth-degree polynomial model). However, using non-linear regression with a Gaussian exponential term significantly improved RMSE (last model). Complete results of model testing with estimates of abundance of each bacterium in the validation sets are provided in S2 Table. R output with statistics for the tested models is included in S2 File.