Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2020 Aug 25;117(36):21857–21864. doi: 10.1073/pnas.1919995117

Archetypal landscapes for deep neural networks

Philipp C Verpoort a,1, Alpha A Lee a, David J Wales b
PMCID: PMC7486703  PMID: 32843349

Significance

Deep neural networks have reached impressive predictive capability for many challenging tasks, yet it remains unclear why they work. Training neural networks involves minimizing a complex, high-dimensional, nonconvex loss function, yet, empirically, it proves possible to produce useful models without rigorous global optimization. To provide insight into this observation, we analyze the structure of the loss-function landscape of deep neural networks and show that it features either a single funnel or low barriers between minima. Such landscapes are relatively easy to optimize and are qualitatively different from the energy landscape of a structural glass. More generally, our results demonstrate how the methodology developed for exploring molecular energy landscapes can be exploited to extend our understanding of machine learning.

Keywords: neural networks, deep learning, energy landscapes, statistical mechanics, optimization

Abstract

The predictive capabilities of deep neural networks (DNNs) continue to evolve to increasingly impressive levels. However, it is still unclear how training procedures for DNNs succeed in finding parameters that produce good results for such high-dimensional and nonconvex loss functions. In particular, we wish to understand why simple optimization schemes, such as stochastic gradient descent, do not end up trapped in local minima with high loss values that would not yield useful predictions. We explain the optimizability of DNNs by characterizing the local minima and transition states of the loss-function landscape (LFL) along with their connectivity. We show that the LFL of a DNN in the shallow network or data-abundant limit is funneled, and thus easy to optimize. Crucially, in the opposite low-data/deep limit, although the number of minima increases, the landscape is characterized by many minima with similar loss values separated by low barriers. This organization is different from the hierarchical landscapes of structural glass formers and explains why minimization procedures commonly employed by the machine-learning community can navigate the LFL successfully and reach low-lying solutions.

Introduction

The performance of deep neural networks (DNNs) continues to evolve to increasingly impressive levels (1), surpassing skilled human operators in many challenging tasks, such as gameplay. However, it is not clear why DNNs work so well. DNNs include a huge number of tunable parameters, which, together with the loss function, define the loss-function landscape (LFL). This landscape is high-dimensional and nonconvex, and, thus, one might expect global optimization to be a challenging task. However, empirically, even simple gradient descent with stochastic noise manages to avoid high-loss local minima and locate useful models. Resolving this optimizability puzzle will suggest strategies for designing more robust machine-learning architectures, as well as scalable optimization algorithms.

Previous investigations have developed simplified analytical models or performed numerical experiments to understand this optimizability puzzle. On the analytical front, ref. 2 analytically examines the LFL of a single hidden-layer neural network, and ref. 3 mapped DNNs to spin-glass models, which required a series of significant approximations. On the numerical front, earlier works focus on local geometric properties, such as the width of minima (47). Those insights led to modified optimization algorithms that bias the search toward solutions with favorable geometric properties (8, 9). Going beyond individual minima, more recent work focused on the connectivity of the landscape. In particular, we have analyzed the landscape for single hidden-layer neural networks using the disconnectivity graph (10, 11) formalism developed for molecular energy landscapes (1214). Beyond single hidden-layer networks, ref. 15 visualized the LFL of DNNs by projecting to three dimensions and suggested that architectures such as skip-connections lead to smoother LFLs, while ref. 16 considered pairs of minima in DNNs and showed that the barrier along the minimum energy path is much lower than the linear path in weight space that connects the two minima. While these results are intriguing, they lack a global view of the LFL as a function of network depth and number of training data.

In this paper, we examine the number of minima and transition states, the connectivity of minima, and the relationship between minima width and generalization error, all as a function of the number of hidden layers and the amount of training data. Our approach is to benchmark small networks, where local minima and transition states can be characterized accurately using computational energy landscapes techniques (17). This perspective reveals the structure of the underlying solution space, in terms of the minimal barriers that any optimization procedure, including stochastic gradient descent, must navigate to pass between the catchment basins of different local minima (17, 18) and reach low-lying minima. Our most important conclusion is that the LFLs generally exhibit low barriers between local minima in terms of both local and global organization, which explains why DNNs can be optimized sufficiently to produce useful predictions.

Methods

DNNs.

We briefly define the DNNs considered in this paper (Fig. 1). We label layers using l{0,,H+1}, with H being the number of hidden layers, l=0 the input layer, l=H+1 the output layer, and nl the number of nodes in layer l. The activations ai(l) are obtained from the zj(l1) using:

ai(l)=j=1nl1wij(l)zj(l1)+θi(l)=j=0nl1wij(l)zj(l1)for1inl, [1]

where in the last step we have absorbed the bias weights into the link weight matrix by setting wi0(l)=θi(l) and z0(l1)=1 for all l, giving these matrices the shape nl×(nl1+1). The number of variables is then given by v=l=1H+1nl×(nl1+1). The activations are turned into signals through zil=ϕl(ai(l)) with activation functions ϕl. Here, we set ϕl=tanh for all l.

Fig. 1.

Fig. 1.

A DNN. Blue, input; red, output; green, hidden nodes.

We define the loss function (the analogue of the energy in a potential energy landscape [PEL]) from the softmax classification probabilities:

L=dNdataln(pc(d))+Lreg  with  pc=eac(H+1)/a=03eaa(H+1), [2]

where c(d) is the known outcome for data item d, Ndata is the number of training data, and Lreg=λw2 is an L2 regularization term with regularization parameter λ, and w=(w(1),,w(H+1)) is the vector of all weights. Unless stated otherwise, the regularization used in this work is λ=104.

Training Datasets.

We analyze the LFLs for three datasets: LJAT19 (SI Appendix), OPTDIG (19), and WINE (20), as summarized below.

The LJAT19 dataset was generated specifically for the assessment of DNN loss landscapes. It is based on the outcome predictions of geometry optimization of the LJAT3 problem and was used in previous work (13, 21, 22). Details of its creation are reported in the SI Appendix. Readers primarily interested in LFLs need simply note that this is a fourfold classification problem with three inputs. Here, we only employ two of these inputs for training and testing to make the problem harder. This benchmark is appealing because we can generate arbitrary amounts of training and testing data, and it has practical importance in chemistry, where calculations of molecular configuration volumes and densities of states are of interest.

OPTDIG is a set of optical data for handwritten digits, with the target being digit recognition/classification. This resource is similar to the widely known MNIST (Modified National Institute of Standards and Technology) dataset, but with inputs of size 8 × 8. The WINE dataset is a list of red and white wines, with the target being the classification of their quality as “good,” “medium,” or “bad” based on 11 physicochemical tests (such as acidity, sulfates, alcohol, etc.). Both OPTDIG and WINE are “real-life” data and were obtained from the University of California, Irvine, Machine Learning Repository (23).

We primarily focus our studies on the LJAT19 dataset because we can generate as much data as is needed for benchmarking, which allows us to produce disconnectivity graphs with up to Ndata=100,000 training data. In contrast, the WINE and OPTDIG datasets only have up to Ndata=5,000 entries, so we use them to validate and confirm our principal conclusions. The computational cost of our methods is limited by the dimensionality of the LFL, and, hence, the datasets we study are a compromise between complexity and feasibility.

The performance of any point in the LFL, defined by a set of parameters w, can be quantified using standard area under the curve (AUC) receiver operating characteristic metrics. AUC values range between 0 and 1, where random classification yields an AUC value of 0.5 and a perfect prediction yields 1. In addition, we visualize the prediction outcomes of points in the landscape of the LJAT19 dataset by coloring a representative subspace of the inputs according to the classification indices (Figs. 2 and 3).

Fig. 2.

Fig. 2.

Disconnectivity graphs for Ndata {100, 1,000, 2,000} training data (from top to bottom) for the DNNs with H{1,2,3} hidden layers (from left to right), labeled as “DATASET-#HL-#” on the top of each panel, where the two digits indicate, respectively, the number of hidden layers, H, and the amount of training data, Ndata. Only the lowest 2,000 minima (or all, if fewer than 2,000 were identified) are shown, and the vertical scale has been adjusted to span the range of loss-function values within this set. Included as Insets below each disconnectivity graph are graphical visualizations of the performance of the global minimum (see SI Appendix, section S2 for details), as well as a plot of the training (horizontal axis) versus testing (vertical axis) loss values of all minima. It is apparent from these graphs that, in each case, the structure of the LFL is either funneled or comprises many minima with similar loss values connected by low barriers.

Fig. 3.

Fig. 3.

Continuation of Fig. 2 with Ndata{10,000,100,000}.

Exploring the LFL.

The LFL (1214, 24) is the analogue of the PEL in molecular science, where the local minima correspond to isomers, and the pathways between them are atomic rearrangements. While the PEL is defined by the energy as a function of the atomic coordinates, the LFL is defined by the edge weights between nodes in training. We explore and visualize the LFL using geometry optimization tools developed for the analysis of PELs in molecular science; details are available in several reviews (17, 22, 2527), and a brief overview is given here.

For each LFL, we first employ basin-hopping (BH) global optimization (2830) to locate the putative global minimum. This technique is very effective in navigating the loss function and is not subject to the exponential slow-down suggested by spin-glass models (3). Local minima in themselves do not define a landscape; to determine the connectivity and barriers between solutions, we need to locate transition states, defined as stationary points with precisely one negative Hessian eigenvalue (31), which connect pairs of minima via steepest-descent paths (17, 18). Databases of minima and transition states were created and expanded using the doubly-nudged (32, 33) elastic band (34, 35) (DNEB) approach to identify points for accurate transition state refinement by hybrid eigenvector-following (36, 37). The connectivity between minima was established by calculating approximate steepest-descent paths for each transition state using a custom limited memory quasi-Newton Broyden (38), Fletcher (39), Goldfarb (40), Shanno (41) (LBFGS) routine. The resulting database of stationary points is the analogue of a kinetic transition network for a PEL (4245), which is employed to analyze structure, dynamics, and thermodynamics in molecular systems. This computational energy landscapes framework has been applied to a wide variety of problems, and most of the standard procedures for expanding stationary point databases (17, 22, 2527) carry over directly to the landscapes considered in the present contribution. Additional care is needed in tightening convergence criteria and in characterizing highly asymmetric pathways (46).

We note that ref. 16 employed nudged elastic band (NEB) interpolations to investigate barrier heights in neural network LFLs. However, in that study, the transition states were not refined accurately, and the connectivity between stationary points was not established. The discrete images of an NEB or DNEB interpolation can straddle barriers and intervening stationary points. To establish the true topology, it is important to refine transition states accurately using hybrid eigenvector-following (36, 37) and identify connected minima via approximate steepest-descent pathways. Gaps in the profile between two target minima can then be addressed using the missing connection algorithm (47) to identify the best candidates for additional double-ended searches, as in the present work.

We visualized the LFL using disconnectivity graphs (10, 11), which provide a faithful representation of the barrier heights, coarse-grained over a regular sequence of energy thresholds. The loss increases on the vertical axis, and the bottom of every branch corresponds to the value of the loss for a particular local training minimum. At each threshold, a superbasin analysis is performed to determine which groups of minima can interconvert without exceeding the threshold. Superbasins and the corresponding branches merge together as the loss threshold increases. We chose the position of the branches on the horizontal axis to avoid crossings and to produce a clear representation of the solution space.

Results

First, we studied the LFL of DNNs using the LJAT19 dataset. We characterized the landscapes as a function of training data (Ndata {100, 1,000, 2,000, 10,000, 100,000}) and number of hidden layers (H=1,2,3). We chose nl=10,5,4 for each hidden layer l, which yielded an approximately equal number of weight variables (v=74,69,72). The input and output dimensions were n0=2 and nH+1=4, respectively. In each case, we calculated databases of local minima and the transition states that connect them. We report the number of minima and transition states, visualize the LFL using disconnectivity graphs, and analyze the correlation between train error, test error, and minima width.

The above values of v are small compared to many of the DNNs in typical applications, but they support loss functions with many local minima and allow us to identify clear trends where all of the optimization procedures can be converged accurately. We note that the lowest minimum discovered from BH remained the lowest in all cases on expanding the database, except for H=3 and Ndata=100. Our results therefore enabled us to focus on the fundamental organization of the landscape, independent of any optimization protocol.

After our study of the LJAT19 dataset, we confirmed that our key results also applied to the OPTDIG (Ndata{1,500, 5,000}, H{1,3}) and WINE (Ndata = 1,500, H{1,3}) datasets, where we simply report the disconnectivity graphs for comparison.

All of the results are labeled as “DATASET-#HL-#,” where the two digits indicate, respectively, the number of hidden layers, H, and the amount of training data, Ndata.

Number of Minima.

Two key observations for LJAT19 are easily summarized (Table 1): The number of local minima, nmin, and the number of transition states, nts, decreased rapidly with increasing Ndata. All of the values reported in Table 1 are lower bounds; we expect the largest databases for Ndata=100 with H=2,3 to be the least complete, but the low-lying regions of the LFL should be well sampled. The ranges of loss values of all located minima followed the same pattern (note the scale bars in Figs. 2 and 3). The number of minima for given Ndata also increased rapidly with H. For H=1, which has the fewest stationary points, we also compared the landscapes for Ndata{250, 500, 3,000, 4,000, 5,000, 20,000, 30,000, 40,000, 50,000, 60,000, 70,000, 80,000, 90,000} to confirm the observed trends. Note that the disconnectivity graphs in Figs. 2 and 3 show only the lowest 2,000 minima for clarity.

Table 1.

Number of local minima and transition states for DNNs with varying number of hidden layers, H, and training data, Ndata

H=1 H=2 H=3
Ndata nmin nts nmin nts nmin nts
100 649 10,426 299,484 508,471 585,730 1,126,877
1,000 7 33 13,336 41,837 85,150 263,615
2,000 5 21 487 1,583 3,027 35,363
10,000 7 43 49 393 150 804
100,000 5 11 33 200 23 96

The number of local minima was further examined for Ndata{100, 1,000} and networks with H=1 containing n1{3,4,5,6,10,15} nodes. We found five or fewer minima in each case, with AUC values for both testing and training data between 0.79 and 0.80. Hence, it does not require many hidden nodes to reduce the solution space to a few minima with similar accuracy when we provide enough training data.

Connectivity of Minima.

Based on the disconnectivity graphs for the LJAT19 dataset (Figs. 2 and 3), it is evident that for sufficiently large Ndata, the loss function has a relatively simple structure corresponding to a funneled landscape, where finding the global minimum is straightforward (17). This “palm tree” (11) organization is especially clear in panels 1HL-100, 2HL-2,000, and 3HL-10,000 (and those with higher Ndata): Pathways from all of the minima with higher loss to the global minimum encountered relatively low downhill barriers. The structure for small Ndata with H=2 or H=3 is rather different: We see many competing low-lying minima with similar loss values. The observation of low barriers between the minima agrees with previous calculations (16). The PELs visualized for structural glasses support many amorphous structures, corresponding to minima with similar potential energy (4852). However, these minima are separated by large barriers compared to the relevant thermal energy, producing a landscape (11) with a hierarchical structure, where metabasins can be defined in terms of cage-breaking rearrangements (4852). In contrast, the barriers between the low-lying minima for 2HL-100 and 3HL-100 were relatively low; the structure at the bottom of these landscapes is more reminiscent of a mangrove swamp, and we do not see the hierarchical organization characteristic of structural glasses (4852). Instead, the overall funneled structure probably agrees with the conclusion that local optimization is likely to be guided downhill toward low-lying minima (53).

Visualizations of the predictions of the DNN global minimum are included in Figs. 2 and 3; a description is provided below the figure with further details in SI Appendix. We see how the best solution converges to the same pattern as sufficient training data are supplied. The best AUC value here is around 0.8 and corresponds to convergence of the relative probabilities for predictions in the subspace of one missing input variable. Note that the class index colored in red is entirely absent in these visualizations, reflecting the difficulty of distinguishing the red from the gray class index in the LJAT3 problem without knowledge of all three inputs.

The same optimal solution was obtained for H{1,2,3}, but more training data are required for larger H (even though the number of variable weights that are optimized is very similar in each case). The enhanced expressibility of the DNNs with higher H may be reflected in the increased complexity of the patterns in the visualizations (see especially Ndata{100, 1,000}).

In Figs. 2 and 3, we also plot the training versus testing loss of all minima. The graphs for H{2,3} reveal the following trends: There exists a weak anticorrelation between the train-loss and test-loss values of low-lying minima in the case of little training data (Ndata1,000). This result suggests overfitting: Minima with loss values lower than the optimal value obtained for the large training-data limit must gain their advantage in a way that does not generalize well to testing data. In this regime, the corresponding DNN models likely yield no good learning capability when trained, due to the lack of train-test loss correlation and the resulting poor generalizability of the model, despite the fact that it can easily be optimized, and low-lying minima can easily be identified. For a medium amount of training data (Ndata = 2,000), there is no clear correlation in the graphs, while for large amounts of training data (Ndata10,000), there is a positive correlation. Interestingly, the lack of bare train-test loss correlation for medium (Ndata = 2,000)—but not small (Ndata=100)—amounts of training data can be overcome by considering the correlation with the geometrical properties of the minimum, as discussed in the next section.

Relating Test Error and Basin Geometry.

To further investigate the correlation of train and test loss, Ltrain and Ltest, as well as how the loss of each minimum correlates with local curvatures, we performed a fit of the following two functions to the data:

Ltest(1)(Ltrain)=a1+b1Ltrain , [3]
Ltest(2)(Ltrain,S)=a2+b2Ltrain+c2S, [4]

where S is the log product of all Hessian eigenvalues of the respective minimum and is defined analogously to the vibrational entropy in molecular systems. The optimal fit parameters and values of the adjusted coefficient of determination, r2, are reported in Table 2 (where results for H=1 and Ndata1,000 are omitted because nmin7, so that proper statistical analysis is impossible). First, the trend of negative to positive correlation between Ltrain and Ltest with increasing Ndata is confirmed by the values of b1 and b2. Second, while adding the term proportional to S seems to be irrelevant in the case of Ndata10,000 (as c2 is small, b2 is similar to b1, and r2 changes only slightly), the results are very different for small Ndata: The parameter c2 increases by up to three orders of magnitude with decreasing Ndata; the optimal values of b1 and b2 differ drastically; and the r2 value increases significantly between the two fits. The change between the fits is most pronounced for Ndata{1,000, 2,000}. We even find that for H=2 and Ndata = 2,000, the correlation between train and test loss changes from negative (b1<0) to positive (b2>0), and r2 increases from 4.7 × 10−3 to 4.0 × 10−1.

Table 2.

Fitting parameters for correlation between training and testing loss of minima

Ndata b1 r2 b2 c2 r2
H=1
 100 7.8±1.1 0.068 4.7±1.1 8.39±0.88 0.18
H=2
 100 5.5±0.027 0.097 4.858±0.027 5.57±0.05 0.12
 1,000 2.529±0.038 0.25 0.308±0.033 0.4495±0.0038 0.63
 2,000 0.155±0.085 0.0047 0.231±0.07 0.151±0.0085 0.4
 10,000 1.07±0.1 0.68 0.9±0.13 0.035±0.016 0.71
 100,000 0.94±0.057 0.89 0.944±0.058 0.0034±0.0047 0.89
H=3
 100 4.05±0.017 0.09 4.678±0.016 6.736±0.03 0.16
 1,000 2.998±0.011 0.48 1.083±0.013 0.5131±0.0025 0.65
 2,000 1.059±0.045 0.15 0.063±0.035 0.2122±0.0037 0.6
 10,000 0.673±0.075 0.35 0.672±0.071 0.0267±0.0069 0.41
 100,000 0.945±0.023 0.99 0.939±0.025 0.0019±0.0027 0.99

This result suggests that knowledge of the curvature of a minimum (encoded in the entropy parameter S) may enhance the prediction of the performance of training LFL minima on testing data. However, we note that information on the curvature is only helpful when the model has the potential to overfit the data, hence explaining why the additional fitting parameter delivers little benefit for Ndata10,000. Moreover, we note that S summarizes the information contained in the eigenvalues of all of the v=74,69,72 (for H=1,2,3) dimensions of the Hessian matrix. In future work, it would be interesting to investigate the distribution of those eigenvalues (55) that play a key role in determining the structure of the heat capacity analogue for the LFL (13, 14).

Performance of Minima.

The performance of all of the LJAT19 training minima can be analyzed for the accompanying test data in terms of the loss function and AUC. For consistency, we summarize results for training and testing datasets of equal size (Table 3); our general conclusions are not sensitive to which test data are chosen. The average barrier heights for individual transition states and for the lowest barrier pathway to the global minimum are collected in SI Appendix, Tables S1 and S2.

Table 3.

Average loss function and AUC values for train and test data of all minima found in training for databases listed in Table 1

H=1 H=2 H=3
Train Test Train Test Train Test
Ndata Loss AUC Loss AUC Loss AUC Loss AUC Loss AUC Loss AUC
100 0.297 0.953 1.453 0.547 0.166 0.995 2.344 0.557 0.121 0.994 2.424 0.564
1,000 0.519 0.810 0.552 0.796 0.511 0.823 0.572 0.787 0.502 0.829 0.596 0.779
2,000 0.539 0.806 0.548 0.795 0.534 0.809 0.550 0.792 0.531 0.812 0.556 0.788
10,000 0.546 0.801 0.559 0.801 0.542 0.802 0.557 0.801 0.542 0.803 0.557 0.801
100,000 0.547 0.797 0.551 0.796 0.543 0.798 0.548 0.797 0.543 0.798 0.548 0.797

Comparison with Other Datasets.

To check that our results are transferable to very different prediction problems, we analyzed the LFL for the OPTDIG and WINE datasets and present their disconnectivity graphs in Fig. 4. The number of weight variables for OPTDIG with n0=64, nH+1=10, H=1,3, and nl=5,4 are v=385,350 and for WINE with n0=11, nH+1=3, H=1,3, and nl=5,3 are v=78,72 (where l are hidden layers). The structure of these landscapes appears to be qualitatively similar to LJAT19. There are no large barriers separating low-lying local minima, which would hinder efficient optimization.

Fig. 4.

Fig. 4.

Disconnectivity graphs for the training datasets OPTDIG (with Ndata {1,500, 5,000} and H{1,3}) and WINE (with Ndata = 1,500 and H{1,3}). Only the lowest 2,000 minima (or all of them if fewer than 2,000 were found) are shown. The vertical scale is adjusted to span the range of loss-function values within this set.

Conclusions

The main result of this study is that the LFLs of DNNs have archetypal structures, depending on the network depth and the amount of data employed. The single-layer networks generally feature single-funneled landscapes, and in the high-Ndata regime, the number of minima is very low (fewer than 10 in some cases). Such networks are easy to optimize.

This situation changes significantly when we move to a multilayer system (while keeping the number of variables approximately fixed), where the number of minima grows rapidly with increased depth. In the high-Ndata limit, the LFL remains funneled with low barriers, and it is still easy to locate low-lying solutions. The landscapes are rather different when few data points are provided. There exist many minima with similar loss values, but they are connected by low barriers. This organization suggests that a reasonable optimization procedure will still reach low-lying minima, since we do not expect to see the slow dynamics characteristic of structural glasses, where the PEL has a hierarchical organization (4852). The structure that our analysis reveals could be interpreted as a somewhat perturbed convex function, which corresponds to a funneled landscape (17).

We stress that these results are relevant for the optimizability of DNNs. One might expect that training of DNNs, which involves the optimization of complex, high-dimensional, nonconvex loss functions, could suffer from trapping in local minima. However, our study shows that the absence of high barriers between minima allows techniques such as stochastic gradient descent to reach low-lying solutions in the LFL.

Unsurprisingly, when few data are supplied, the large number of minima in the training LFL of DNNs spread across a wide range of loss values, and the same is true for testing data. In this limit, the system is prone to overfitting, which is manifested in anticorrelation between training and testing loss values. We find a strong correlation between the local curvatures (represented by the Hessian eigenvalues and the corresponding entropy) and the generalizability of minima in the deep medium-Ndata regime. Hence, we conclude that entropic contributions to the optimization dynamics may guide the system to minima that generalize well.

Extending our study to larger networks (3) is not straightforward, because the expected exponential growth in the number of stationary points with the number of variables (55, 56) makes enumeration impossible. However, the trends for small networks observed in this contribution, especially the arrangement of low-lying minima separated by small barriers, appear to be consistent with theory based on spin-glass models (3).

Previous studies by Baity-Jesi et al. (57) of DNN training dynamics were used to infer properties of the LFL, whereas our work directly interrogates the organization of the landscape. Our results are consistent with ref. 57, confirming that the dynamics of DNNs, unlike structural glasses, do not suffer from high loss barriers. However, our “microscopic” view of the energy landscape revises the hypothesis of flat basins and, instead, would attribute the observed slow dynamics to the proliferation of multiple shallow minima.

There still remain many interesting open questions concerning the properties of DNNs, and the various theoretical and computational tools developed to understand and explore PELs in chemical physics (17, 22, 2527) could provide new insights. For example, the number of local minima and transition states in an atomic system are expected to grow exponentially with the number of degrees of freedom, D, as exp(γD) and Dexp(γD), respectively (55, 56). The exponential factor γ is larger for short-range interatomic potentials (5861), because more local minima are possible when the effect of distant atoms is smaller. The number of stationary points may increase with the number of hidden layers for the same reason: Nodes in nonadjacent layers are not directly connected, so weights that correspond to local stability may combine almost combinatorially. A more quantitative analysis of this effect has been presented for atomic systems using catastrophe theory (60), and the same ideas should apply to the properties of multilayer neural networks. We will present further analysis of this effect elsewhere.

Supplementary Material

Supplementary File
Supplementary File
Download video file (1.2MB, mp4)

Acknowledgments

P.C.V. thanks Dr. John Morgan for assistance with the GMIN, OPTIM, and PATHSAMPLE programs, as well as Victor A. T. Jouffrey and Annika Schubert for useful discussions. A.A.L. was supported by the Winton Program for the Physics of Sustainability. P.C.V. and D.J.W. were supported by the Engineering and Physical Sciences Research Council.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at https://www.pnas.org/lookup/suppl/doi:10.1073/pnas.1919995117/-/DCSupplemental.

Data Availability.

All of the energy-landscapes exploration software, documentation, examples, and further resources are available for download under the Gnu Public License from the Cambridge Landscape Database (62). The input data and program instructions used to produce the results for the present work have been made available on the Apollo repository, https://doi.org/10.17863/CAM.55772 (63).

References

  • 1.Silver D., et al. , A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362, 1140–1144, (2018). [DOI] [PubMed] [Google Scholar]
  • 2.Song M., Montanari A., Nguyen P.-M., A mean field view of the landscape of two-layer neural networks. Proc. Natl. Acad. Sci. U.S.A. 115, 7665–7671 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Choromanska A., Henaff M. B., Mathieu M., Ben Arous G., LeCun Y., “The loss surfaces of multilayer networks” in Proceedings of Machine Learning Research (PMLR), Lebanon G., Vishwanathan S. V. N., Eds. (PMLR, Cambridge, MA, 2015) vol. 38, pp. 192–204. [Google Scholar]
  • 4.Hochreiter S., Schmidhuber J., “Simplifying neural nets by discovering flat minima” in NIPS’94: Proceedings of the 7th International Conference on Neural Information Processing Systems, Tesauro G., Touretzky D. S., Leen T. K., Eds. (MIT Press, Cambridge, MA, 1995), pp. 529–536. [Google Scholar]
  • 5.Hochreiter S., Schmidhuber J., Flat minima. Neural Comput. 9, 1–42 (1997). [DOI] [PubMed] [Google Scholar]
  • 6.Shirish Keskar N., Mudigere D., Nocedal J., Smelyanskiy M., Tang P. T. P., On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.0436v2 (9 February 2017).
  • 7.Jastrzebski S., et al. , “Width of minima reached by stochastic gradient descent is influenced by learning rate to batch size ratio,” in Artificial Neural Networks and Machine Learning – ICANN, Kůrková V., Manolopoulos Y., Hammer B., Iliadis L., Maglogiannis I., Eds. (Springer, Cham, Switzerland, 2018), vol. 11141. [Google Scholar]
  • 8.Baldassi C., et al. , Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. Proc. Natl. Acad. Sci. U.S.A. 113, 7655–7662 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Chaudhari P., et al. , Entropy-SGD: Biasing gradient descent into wide valleys. J. Stat. Mech. Theory Exp. E 2019, 124019 (2019). [Google Scholar]
  • 10.Becker O. M., Karplus M., The topology of multidimensional potential energy surfaces: Theory and application to peptide structure and kinetics. J. Chem. Phys. 106, 1495–1517 (1997). [Google Scholar]
  • 11.Wales D. J., Miller M. A., R Walsh T., Archetypal energy landscapes. Nature 394, 758–760 (1998). [Google Scholar]
  • 12.Pavlovskaia M., Tu K., Zhu S.-C., Mapping energy landscapes of non-convex learning problems. arXiv:1410.0576 (2 October 2014).
  • 13.Ballard A. J., Stevenson J. D., Das R., Wales D. J., Energy landscapes for a machine learning application to series data. J. Chem. Phys. 144, 124119 (2016). [DOI] [PubMed] [Google Scholar]
  • 14.Das R., Wales D. J., Energy landscapes for a machine-learning prediction of patient discharge. Phys. Rev. E 93, 063310 (2016). [DOI] [PubMed] [Google Scholar]
  • 15.Li H., Xu Z., Taylor G., Studer C., Goldstein T., “Visualizing the loss landscape of neural nets” in Advances in Neural Information Processing Systems, Bengio S., et al., Eds. (NIPS, 2018). [Google Scholar]
  • 16.Draxler F., Veschgini K., Salmhofer M., Hamprecht F. A., “Essentially no barriers in neural network energy landscape” in Proceedings of the 35th International Conference on Machine Learning, J. Dy, A. Krause, Eds. (PMLR, Cambridge, MA, 2018). [Google Scholar]
  • 17.Wales D. J., Energy Landscapes (Cambridge University Press, Cambridge, UK, 2003). [Google Scholar]
  • 18.Mezey P. G., Potential Energy Hypersurfaces (Elsevier, Amsterdam, Netherlands, 1987). [Google Scholar]
  • 19.Alpaydin E., Kaynak C., Cascading classifiers. Kybernetika 34, 369–374 (1998). [Google Scholar]
  • 20.Cortez P., Cerdeira A., Almeida F., Matos T., Reis J., Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47, 547–553 (1998). [Google Scholar]
  • 21.Das R., Wales D. J., Machine learning prediction for classification of outcomes in local minimisation. Chem. Phys. Lett. 667, 158–164 (2017). [Google Scholar]
  • 22.Wales D. J., Exploring energy landscapes. Annu. Rev. Phys. Chem. 69, 401–425 (2018). [DOI] [PubMed] [Google Scholar]
  • 23.Dua D., Graff C., UCI machine learning repository. (2017). http://archive.ics.uci.edu/ml. Accessed 24 April 2020.
  • 24.Wang T. E., Gu Y., Mehta D., Zhao X., Bernal E. A., Towards robust deep neural networks. arXiv:1810.11726v2 (4 December 2018). [Google Scholar]
  • 25.Wales D. J., Decoding the energy landscape: Extracting structure, dynamics and thermodynamics. Phil. Trans. Roy. Soc. A 370, 2877–2899 (2012). [DOI] [PubMed] [Google Scholar]
  • 26.Ballard A. J., et al. , Energy landscapes for machine learning. Phys. Chem. Chem. Phys. 19, 12585–12603 (2017). [DOI] [PubMed] [Google Scholar]
  • 27.Joseph J. A., Röder K., Chakraborty D., Mantell R. G., Wales D. J., Exploring biomolecular energy landscapes. Chem. Commun. 53, 6974–6988 (2017). [DOI] [PubMed] [Google Scholar]
  • 28.Li Z., Scheraga H. A., Monte Carlo-minimization approach to the multiple-minima problem in protein folding. Proc. Natl. Acad. Sci. U.S.A. 84, 6611–6615 (1987). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Li Z., Scheraga H. A., Structure and free energy of complex thermodynamic systems. J. Mol. Struct. 179, 333–352 (1988). [Google Scholar]
  • 30.Wales D. J., Doye J. P. K., Global optimization by basin-hopping and the lowest energy structures of Lennard-Jones clusters containing up to 110 atoms. J. Phys. Chem. 101, 5111–5116 (1997). [Google Scholar]
  • 31.Murrell J. N., Laidler K. J., Symmetries of activated complexes. Trans. Faraday Soc. 64, 371–377 (1968). [Google Scholar]
  • 32.Trygubenko S. A., Wales D. J., A doubly nudged elastic band method for finding transition states. J. Chem. Phys. 120, 2082–2094 (2004). [DOI] [PubMed] [Google Scholar]
  • 33.Trygubenko S. A., Wales D. J., Analysis of cooperativity and localization for atomic rearrangements. J. Chem. Phys. 121, 6689–6697 (2004). [DOI] [PubMed] [Google Scholar]
  • 34.Henkelman G., Uberuaga B. P., Jónsson H., A climbing image nudged elastic band method for finding saddle points and minimum energy paths. J. Chem. Phys. 113, 9901–9904 (2000). [Google Scholar]
  • 35.Henkelman G., Jónsson H., Improved tangent estimate in the nudged elastic band method for finding minimum energy paths and saddle points. J. Chem. Phys. 113, 9978–9985 (2000). [Google Scholar]
  • 36.Munro L. J., Wales D. J., Defect migration in crystalline silicon. Phys. Rev. B 59, 3969–3980 (1999). [Google Scholar]
  • 37.Zeng Y., Xiao P., Henkelman G., Unification of algorithms for minimum mode optimization. J. Chem. Phys. 140, 044115 (2014). [DOI] [PubMed] [Google Scholar]
  • 38.Broyden C. G., The convergence of a class of double-rank minimization algorithms 1. general considerations. J. Inst. Math. Appl. 6, 76–90 (1970). [Google Scholar]
  • 39.Fletcher R., A new approach to variable metric algorithms. Comput. J. 13, 317–322 (1970). [Google Scholar]
  • 40.Goldfarb D., A family of variable-metric methods derived by variational means. Math. Comput. 24, 23–26 (1970). [Google Scholar]
  • 41.Shanno D. F., Conditioning of quasi-Newton methods for function minimization. Math. Comput. 24, 647–656 (1970). [Google Scholar]
  • 42.Rao F., Caflisch A., The protein folding network. J. Mol. Biol. 342, 299–306 (2004). [DOI] [PubMed] [Google Scholar]
  • 43.Noé F., Fischer S., Transition networks for modeling the kinetics of conformational change in macromolecules. Curr. Opin. Struct. Biol. 18, 154–162 (2008). [DOI] [PubMed] [Google Scholar]
  • 44.Prada-Gracia D., Gómez-Gardenes J., Echenique P., Fernando F., Exploring the free energy landscape: From dynamics to networks and back. PLoS Comput. Biol. 5, e1000415 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Wales D. J., Energy landscapes: Some new horizons. Curr. Opin. Struct. Biol. 20, 3–10 (2010). [DOI] [PubMed] [Google Scholar]
  • 46.Mehta D., Zhao X., Bernal E. A., Wales D. J., Loss surface of XOR artificial neural networks. Phys. Rev. 97, 052307 (2018). [DOI] [PubMed] [Google Scholar]
  • 47.Carr J. M., Trygubenko S. A., Wales D. J., Finding pathways between distant local minima. J. Chem. Phys. 122, 234903 (2005). [DOI] [PubMed] [Google Scholar]
  • 48.de Souza V. K., Wales D. J., Energy landscapes for diffusion: Analysis of cage-breaking processes. J. Chem. Phys. 129, 164507 (2008). [DOI] [PubMed] [Google Scholar]
  • 49.de Souza V. K., Wales D. J., Connectivity in the potential energy landscape for binary Lennard-Jones systems. J. Chem. Phys. 130, 194508 (2009). [DOI] [PubMed] [Google Scholar]
  • 50.de Souza V. K., Stevenson J. D., Niblett S. P., Farrell J. D., Wales D. J., Defining and quantifying frustration in the energy landscape: Applications to atomic and molecular clusters, biomolecules, jammed and glassy systems. J. Chem. Phys. 146, 124103 (2017). [DOI] [PubMed] [Google Scholar]
  • 51.Niblett S. P., de Souza V. K., Stevenson J. D., Wales D. J., Dynamics of a molecular glass former: Energy landscapes for diffusion in ortho-terphenyl. J. Chem. Phys. 145, 024505 (2016). [DOI] [PubMed] [Google Scholar]
  • 52.Niblett S. P., Biedermann M., Wales D. J., de Souza V. K., Pathways for diffusion in the potential energy landscape of the network glass former SiO2. J. Chem. Phys. 147, 152726 (2017). [DOI] [PubMed] [Google Scholar]
  • 53.Wu L., Zhu Z., E W., Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv:1706.10239v2 (28 November 2017).
  • 54.Sagun L., Bottou L., LeCun Y., Eigenvalues of the Hessian in deep learning: Singularity and beyond. arXiv:1611.07476 (22 November 2016).
  • 55.Stillinger F. H., Weber T. A., Packing structures and transitions in liquids and solids. Science 225, 983–989 (1984). [DOI] [PubMed] [Google Scholar]
  • 56.Wales D. J., Doye J. P. K., Stationary points and dynamics in high-dimensional systems. J. Chem. Phys. 119, 12409–12416 (2003). [Google Scholar]
  • 57.Baity-Jesi M., et al. , Comparing dynamics: Deep neural networks versus glassy systems. J. Stat. Mech. Theor. Exp. 2019, 124013 (2019). [Google Scholar]
  • 58.Braier P. A., Berry R. S., Wales D. J., How the range of pair interactions governs features of multidimensional potentials. J. Chem. Phys. 93, 8745 (1990). [Google Scholar]
  • 59.Doye J. P. K., Wales D. J., The structure and stability of atomic liquids—from clusters to bulk. Science 271, 484–487 (1996). [Google Scholar]
  • 60.Wales D. J., A microscopic basis for the global appearance of energy landscapes. Science 293, 2067–2069 (2001). [DOI] [PubMed] [Google Scholar]
  • 61.Wales D. J., Energy landscapes of clusters bound by short-ranged potentials. ChemPhysChem 11, 2491–2494 (2010). [DOI] [PubMed] [Google Scholar]
  • 62.Wales D. J., et al. , The Cambridge Energy Landscape Database http://www-wales.ch.cam.ac.uk/CCD.html. Accessed 23 July 2020. [Google Scholar]
  • 63.Verpoort P. C., Lee A. A., Wales D. J., Research data supporting “Archetypal landscapes for deep neural networks”. Apollo–University of Cambridge Repository 10.17863/CAM.55772. Accessed 6 August 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
Supplementary File
Download video file (1.2MB, mp4)

Data Availability Statement

All of the energy-landscapes exploration software, documentation, examples, and further resources are available for download under the Gnu Public License from the Cambridge Landscape Database (62). The input data and program instructions used to produce the results for the present work have been made available on the Apollo repository, https://doi.org/10.17863/CAM.55772 (63).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES