Model behavior under maximum-likelihood estimation. (A) Relationship between the input-layer size, Lx, and the optimal hidden-layer size, , at a fixed sample size (). Gray lines are found by optimizing Eq. 12 with respect to Lh; dashed lines are the asymptotic expression derived in SI Appendix, section 5.1. (B) Optimal hidden-layer size, , as a function of the input-layer size, Lx, and the sample size, N, from Eq. 12. (C) Scaling at . Gray line is theory; black points are from simulations; colored circles are the experimental data from Fig. 1A. Simulations were done only for low Lx, due to the computational cost of the simulations when Lx is large. (D) Relationship between the hidden-layer size, Lh, and the generalization error, ϵgen, under the logistic activation function (black), and ReLU (gray), at Lx = 50 and . Lines are theory; bars are from simulations. Vertical lines mark the minima (solid, theory; dashed, simulations). Error bars are the SD over 10 simulations. (E) Scaling for the logistic activation function with . Gray line is theory; black points are from simulations; colored circles are the experimental data from Fig. 1A. As in C, simulations were done only for low Lx, due to the computational cost of the simulations when Lx is large. (F) Analytical estimation of the - Lx scaling versus the Lx-N scaling (y axis, coefficient β in the scaling ; x axis, coefficient γ in the scaling ; see SI Appendix, section 5.1 for details). The gray horizontal line is the 3/2 scaling from Fig. 1A. As in Fig. 3, the teacher network had a hidden-layer size of 500, with a ReLU nonlinearity, and the noise was set to .