Skip to main content
. 2016 Oct 19;11(10):e0165015. doi: 10.1371/journal.pone.0165015

Table 1. Regression models considered for GC normalization.

We explored several regression models, from simple linear models using only one input variable (genome GC content) to more complex by progressively increasing the number of terms and using two input variables (read GC content and genome GC content). While this strategy helped us find models with lower RMSE, it eventually led to overfitting and a significant increase in RMSE (the forth-degree polynomial model). However, using non-linear regression with a Gaussian exponential term significantly improved RMSE (last model). Complete results of model testing with estimates of abundance of each bacterium in the validation sets are provided in S2 Table. R output with statistics for the tested models is included in S2 File.

Model of normalized coverage Unique solution found Residual standard error Degrees of freedom RMSE (Experiment 1, 250 ng bacterial DNA), % RMSE (Experiment 2, 20 ng bacterial DNA), %
θ1 + θ2 log(GCG) Yes 0.6519 340 8.35 8.51
θ1 + θ2GCG + θ3GCG2 Yes 0.6515 339 8.09 8.26
θ1 + θ2GCR + θ3GCR2 + θ4 log(GCG) Yes 0.4414 338 8.41 8.69
θ1 + θ2GCR + θ3GCR2 + θ4GCR3 + θ5 log(GCG) Yes 0.373 337 6.24 6.53
θ1 + θ2GCR + θ3GCR2 + θ4GCR3 + θ5GCR4 + θ6 log(GCG) Yes 0.3441 336 11.7 11.92
θ1e0.5(GCRθ2θ3)2+θ4+θ5log(CGG) No NA NA NA NA
θ1e0.5(GCRθ2θ3)2+θ4+θ5GCR+θ6GCR2 No NA NA NA NA
θ1e0.5(GCRθ2θ3)2+θ4+θ5GCR+θ6GCR2+θ7log(GCG) No NA NA NA NA
θ1e0.5(GCRθ2θ3)2+θ4+θ5GCR+θ6GCR2+θ7GCR3+θ8log(GCG) Yes 0.3283 334 2.87 2.91