Abstract
Occam’s razor is the principle that, all else being equal, simpler explanations should be preferred over more complex ones. This principle is thought to guide human decision-making, but the nature of this guidance is not known. Here we used preregistered behavioral experiments to show that people tend to prefer the simpler of two alternative explanations for uncertain data. These preferences match predictions of formal theories of model selection that penalize excessive flexibility. These penalties emerge when considering not just the best explanation but the integral over all possible, relevant explanations. We further show that these simplicity preferences persist in humans, but not in certain artificial neural networks, even when they are maladaptive. Our results imply that principled notions of statistical model selection, including integrating over possible, latent causes to avoid overfitting to noisy observations, may play a central role in human decision-making.
Occam’s razor formalized as model selection
To make decisions in the real world, we must often choose between multiple, plausible explanations for noisy, sparse data. When evaluating competing explanations, Occam’s razor says that we should consider not just how well they account for data that have been observed, but also their potentially excessive flexibility in describing alternative, and potentially irrelevant, data that have not been observed (Baker, 2022) (e.g., “a ghost did it!”, Figure 1a,b). This kind of simplicity preference has long been proposed as an organizing principle for mental function (Feldman, 2016), such as in the early concept of Prägnanz in Gestalt psychology (Koffka, 2014; Wagemans et al., 2012), a number of “minimum principles” for vision (Hatfield, 1985) and theories that posit a central role for data compression in cognition (Chater & Vitányi, 2003). However, despite evidence that human decision-makers can exhibit simplicity preferences under certain task conditions (Genewein & Braun, 2014; Gershman & Niv, 2013; Johnson et al., 2014; Little & Shiffrin, 2009; Pothos & Chater, 2002) we lack a principled understanding of what, exactly, constitutes the “simplicity” that is favored (or, equivalently, “complexity” that is disfavored) and how we balance that preference with the evidence provided by the observed data when we make decisions.
Figure 1. Formalizing Occam’s razor as Bayesian model selection to understand simplicity preferences in human decision-making.
a: Occam’s razor prescribes an aversion to complex explanations (models). In Bayesian model selection, model complexity quantifies the flexibility of a model, or its capacity to account for a broad range of empirical observations. In this example, we observe an apple falling from a tree (left) and compare two possible explanations: 1) classical mechanics, and 2) the intervention of a ghost. b: Schematic comparison of the evidence of the two models in a. Classical mechanics (pink) explains a narrower range of observations than the ghost (green), which is a valid explanation for essentially any conceivable phenomenon (e.g., both a falling and spinning-upward trajectory, as in the insets). Absent further evidence and given equal prior probabilities, Occam’s razor posits that the simpler model (classical mechanics) is preferred, because its hypothesis space is more concentrated around the sparse, noisy data and thus avoids “overfitting” to noise. c: A geometrical view of the model-selection problem. Two alternative models are represented as geometrical manifolds, and the maximum-likelihood point for each model is represented as the projection of the data (red star) onto the manifolds. d: Systematic expansion of the log evidence of a model (see previous work by Balasubramanian (1997) and Methods section A.2). is the maximum-likelihood point on model for data , is the number of observations, is the number of parameters of the model, is the likelihood gradient evaluated at , is the observed Fisher information matrix, and is the expected Fisher information matrix (see Methods). captures how distinguishable elements of are in the neighborhood of (see Methods section A.2 and previous work by Balasubramanian (1997)). When is the true source of the data can be seen as a noisy version of , estimated from limited data (Balasubramanian, 1997). is a shorthand for , and is the length of measured in the metric defined by . The ellipsis collects terms that decrease as grows. Each term of the expansion represents a distinct geometrical feature of the model (Balasubramanian, 1997): dimensionality penalizes models with many parameters; boundary (a novel contribution of this work) penalizes models for which is on the boundary; volume counts the number of distinguishable probability distributions contained in ; and robustness captures the shape (curvature) of near (see Methods section A.2 and previous work by Balasubramanian (1997)). e: Psychophysical task with variants designed to probe each geometrical feature in d. For each trial, a random location on one model was selected (gray star), and data (red dots) were sampled from a Gaussian centered around that point (gray shading). The red star represents the empirical centroid of the data, by analogy with c. The maximum-likelihood point can be found by projecting the empirical centroid onto one of the models. Participants saw the models (black lines) and data (red dots) only and were required to choose which model was best for the data. Insets: task performance for the given task variant, for a set of 100 simulated ideal Bayesian observers (orange) versus a set of 100 simulated maximum-likelihood observers (i.e., choosing based only on whichever model was the closest to the empirical centroid of the data on a given trial; cyan).
To address this issue quantitatively, we formalize decision-making as a model-selection problem: given a set of observations and a set of possible parametric statistical models , we seek the model that in some sense is the best description of the data . In this context, Occam’s razor can be interpreted as requiring the goodness-of-fit of a model to be penalized by some measure of its flexibility, or complexity, when comparing it against other models. Bayesian statistics offers a natural characterization of such a measure of complexity and specifies the way in which it should be traded off against goodness-of-fit to maximize decision accuracy, typically because the increased flexibility provided by increased complexity tends to cause errors by overfitting to noise in the observations (Balasubramanian, 1997; Good, 1968; Gull, 1988; Jaynes, 2003; H. Jeffreys, 1939; W. Jeffreys & Berger, 1991; MacKay, 1992; Smith & Spiegelhalter, 1980). According to the Bayesian framework, models should be compared based on their evidence or marginal likelihood , where represents model parameters and their associated prior. By varying the parameters, we can explore instances of a model and sweep out a manifold of possible descriptions of the data. Two such manifolds are visualized in Figure 1c, along with the maximum-likelihood parameters that assign the highest probability to the observed data.
Under mild regularity assumptions and with sufficient data, the (log) evidence can be written as the sum of the maximum log likelihood of and several penalty factors (Figure 1d). These penalty factors, which are found even when the prior (data-independent) probabilities of the models under consideration are equal, can be interpreted as providing quantitatively defined preferences against certain models according to specific forms of complexity that they embody (Balasubramanian, 1997; MacKay, 1992). This approach, which we call the Fisher Information Approximation (FIA), has been used to identify worse-fitting, but better-generalizing, psychophysical models describing the relationship between physical variables (e.g., light intensity) and their psychological counterparts (e.g., brightness) (Myung et al., 2000). It is related to similar quantitative definitions of statistical model complexity, such as the Minimum Description Length (Grünwald, 2007; Lanterman, 2001; Rissanen, 1996), Minimum Message Length (Wallace, 2005), and Predictive Information (Bialek et al., 2001) frameworks. A key feature of the FIA is that if the prior over parameters is taken to be uninformative (Jaynes, 2003), each penalty factor can be shown to capture a distinct geometric property of the model (Balasubramanian, 1997). These properties include not just the model’s dimensionality (number of parameters), which is the well-known Bayesian Information Criterion (BIC) for model selection (Neath & Cavanaugh, 2012; Schwarz, 1978), but also its boundary (a novel term, detailed in Methods section A.2; see also the Discussion), volume, and shape (Figure 1c,d).
The complexity penalties depicted in Figure 1 emerge because the Bayesian framework marginalizes over the model parameters. In applying this framework to human decision-making, we interpret this marginalization as an integration over latent causes: to evaluate a particular explanation (or “model”) for a given set of observed data, one considers how likely the data are under that explanation, on average over all possible configurations of that explanation. Intuitively, flexible explanations are penalized by the averaging because many of their configurations have nothing to do with the observed state of the world and thus possess a vanishingly small likelihood . Consider the following example, in which the data are represented by a point on a plane, (Figure 2, top left). The problem is to decide between two alternative explanations (models) for the data: 1) , a Gaussian distribution centered in (0,0) with unit, isotropic variance; and 2) , a parametric family of Gaussians, also with unit variance, but with centers located anywhere along the straight line connecting (−1/2,1) and (1/2,1). It is clear that can explain a wider range of data, just like the ghost in Figure 1a, and is therefore more complex. For data that are equidistant from the two models, , Occam’s razor prescribes that we should choose . In other words, the decision boundary separating the area where one or the other model should be preferred is closer to (the more complex model) than (the simpler one). This simplicity bias is specific to a decision-maker that integrates over the latent causes (model configurations) and does not result from sampling multiple possible explanations via other, less systematic means, for example by adding sensory and/or choice noise (Figure 2; see also Supplementary Figure B.9).
Figure 2. Integration over latent causes leads to Occam’s razor.
a: Schematic of a simple decision-making scenario. A single datapoint (star) is sampled from one of two models (pink dot, green bar). One of the models () is a Gaussian with known variance, centered at the location of the pink dot. The other model () is a parametric family of Gaussians, with known and fixed variance and center located at a latent location along the green bar. Cyan line: boundary indicating locations in data space that are equidistant from and . b−d: Potential components of a decision-making observer for this scenario, which we call Noise-Integration-Noise observer (see Methods section A.1 and Supplementary Information section B.7 for further details). b: Sensory noise: the observer does not have access to the true data (location of the star), but a noisy version of it corrupted by Gaussian noise with variance ρ. c: Integration over latent causes: the observer can consider possible positions of the center of the Gaussian in model . d: Choice noise: after forming an internal estimate of the relative likelihood of and , the observer can choose a model based on a deterministic process (for instance, always pick the most likely one), or a stochastic one where the probability of sampling one model is related to its likelihood. e-h: Behavior of the observer as a function of the location of the datapoint, within the zoomed-in region highlighted in a, and of the presence of the mechanisms illustrated in b-d. e: probability that the observer will report as a function of the location of the datapoint, when sensory and choice noise are low and in absence of integration over latent causes. f: same as e, but in presence of integration over latent causes. The decision boundary of the observer (white area) is shifted towards the more complex model () compared to e. This shift means that, when the data is equidistant from and , the observer prefers the simplest model (). g: same as e, but with strong sensory noise. The decision boundary of the observer is shifted in the opposite direction as f. h: same as e, but with strong choice noise. Choice noise has no effect on the location of the decision boundary.
Humans exhibit theoretically grounded simplicity preferences
To relate the FIA complexity terms to the potential preferences exhibited by both human and artificial decision-makers, we designed a simple decision-making task. For each trial, simultaneously presented, noisy observations (red dots in Figure 1e) were sampled from a 2D Normal (“generative”) distribution centered somewhere within one of two possible shapes (black shapes in Figure 1e). The identity of the shape generating the data (top versus bottom) was chosen at random with equal probability. Likewise, the location of the center of the Normal distribution within the selected shape was sampled uniformly at random, in a way that did not depend on the model parameterization, by using Jeffrey’s prior (Jaynes, 2003). Given the observations, the decision-maker chose the shape (model) that was more likely to contain the center of the generative distribution. We used four task variants, each designed to probe primarily one of the distinct geometrical features that are penalized in Bayesian model selection (i.e., a Bayesian observer is expected to have a particular, quantitative preference away from the more-complex alternative in each pair; Figure 1c and d). For this task, the FIA provided a good approximation of the exact Bayesian posterior (Supplementary Information section B.1). Accordingly, simulated observers that increasingly integrated over latent causes, like the Bayesian observer, exhibited increasing FIA-like biases. These biases were distinguishable from (and degraded by) effects of increasing sensory and/or choice noise (Supplementary Information B.7.1 and Supplementary Figure B.10).
For our human studies, we used the on-line research platform Pavlovia to implement the task, and Prolific to recruit and test participants. Following our preregistered approaches (Piasini et al., 2020, 2021, 2022), we collected data from 202 participants, divided into four groups that each performed one of the four separate versions of the task depicted in Figure 1e (each group comprised ∼50 participants). We provided instructions that used the analogy of seeds from a flower located in one of two flower beds, to provide an intuitive framing of the key concepts of noisy data generated by a particular instance of a parametric model from one of two model families. To minimize the possibility that participants would simply learn from implicit or explicit feedback over the course of each session to make more optimal (i.e., simplicity-preferring) choices of flower beds, we: 1) used conditions for which the difference in performance between ideal observers that penalized model complexity according to the FIA and simulated observers that used only model likelihood was ∼1% (depending on the task type; Figure 1e, insets), which translates to ∼5 additional correct trials over the course of an entire experiment; and 2) provided feedback only at the end of each block of 100 trials, not each trial. We used hierarchical (Bayesian) logistic regression to measure the degree to which each participant’s choices were affected by model likelihood (distance from the data to a given model) and each of the FIA features (see Methods section A.6). We defined each participant’s sensitivity to each FIA term as a normalized quantity, relative to their likelihood sensitivity (i.e., by dividing the logistic coefficient associated with a given FIA term by the logistic coefficient associated with the likelihood).
The human participants were sensitive to all four forms of model complexity (Figure 3). Specifically, the estimated normalized population-level sensitivity (posterior mean ± st. dev., where zero implies no sensitivity and one implies Bayes-optimal sensitivity) was 4.66±0.96 for dimensionality, 1.12±0.10 for boundary, 0.23±0.12 for volume, and 2.21±0.12 for robustness (note that, following our preregistered plan, we emphasize parameter estimation using Bayesian approaches (Gelman et al., 2014; Kruschke, 2015; McElreath, 2016) here and throughout the main text, and we provide complementary null hypothesis significance testing in the Supplementary Information section B.6 and Table B.8). Formal model comparison (WAIC; see Supplementary Information section B.6.1 and Tables B.6 and B.7) confirmed that their behavior was better described by taking into account the geometric penalties defined by the theory of Bayesian model selection, rather than by relying only on the minimum distance between model and data (i.e., the maximum-likelihood solution). Consistent with these analyses, their decisions were consistent with processes that tended to integrate over latent causes (and tended to exhibit moderate levels of sensory noise and low choice noise; Supplementary Information B.7 and Supplementary Figures B.11 and B.12). Overall, our data indicate that people tend to integrate over latent causes in a way that gives rise to Occam’s razor, manifesting as sensitivity to the geometrical features in Bayesian model selection that characterize model complexity.
Figure 3. Humans exhibit theoretically grounded simplicity preferences.
a: Summary of human behavior. Hue (pink/green): k-nearest-neighbor interpolation of the model choice, as a function of the empirical centroid of the data. Color gradient (light/dark): marginal density of empirical data centroids for the given model pair, showing the region of space where data were more likely to fall. Cyan solid line: decision boundary for an observer that always chooses the model with highest maximum likelihood. Orange dashed line: decision boundary for an ideal Bayesian observer. The participants’ choices tended to reflect a preference for the simpler model, particularly near the center of the screen, where the evidence for the alternatives was weak. For instance, in the left panel there is a region where data were closer to the line than to the dot, but participants chose the dot (the simpler, lower-dimensional “model”) more often than the line. b: Participant sensitivity to each geometrical feature characterizing model complexity was estimated via hierarchical logistic regression (see Methods section A.6 and Supplementary Information section B.2), using as predictors a constant to account for an up/down choice bias, the difference in likelihoods for the two models and the difference in each FIA term for the two models (, etc). Following a hierarchical regression scheme, the participant-level sensitivities were in turn modeled as being sampled from a population-level distribution. The mean of this distribution is our population-level estimate for the sensitivity. c: Overall accuracy versus estimated relative FIA sensitivity for each task condition, as indicated. Points are data from individual participants. Each fitted FIA coefficient was normalized to the likelihood coefficient and thus could be interpreted as a relative sensitivity to the associated FIA term. For each term, an ideal Bayesian observer would have a relative sensitivity of one (dashed orange lines), whereas an observer that relied on only maximum-likelihood estimation (i.e., choosing “up” or “down” based on only the model that was the closest to the data) would have a relative sensitivity of zero (solid cyan lines). Top, gray: Population-level estimates (posterior distribution of population-level relative sensitivity given the experimental observations). Bottom: each gray dot represents the task accuracy of one participant ( axis) versus the posterior mean estimate of the relative sensitivity for that participant ( axis). Intuitively, the population-level posterior can be interpreted as an estimate of the location of the center of the cloud of dots representing individual subjects in the panel below. See Methods section A.6 for further details on statistical inference and the relationship between population-level and participant-level estimates. Purple: relative sensitivity of an ideal observer that samples from the exact Bayesian posterior (not the approximated one provided by the FIA). Shading: posterior mean ± 1 or 2 stdev., estimated by simulating 50 such observers.
The participants exhibited substantial individual variability in performance that included ranges of sensitivities to each FIA term that spanned optimal and suboptimal values. This variability was large compared to the uncertainty associated with participant-level sensitivity estimates (Supplementary Information B.4) and impacted performance in a manner that highlighted the usefulness of appropriately tuned (i.e., close to Bayes optimal) simplicity preferences: accuracy tended to decline for participants with FIA sensitivities further away from the theoretical predictions (Figure 2c; posterior mean ± st. dev. of Spearman’s rho between accuracy and , where is the sensitivity: dimensionality, −0.69±0.05; boundary, −0.21±0.11; volume, −0.10±0.10; robustness, −0.54±0.10). The sub-optimal sensitivities exhibited by many participants did not appear to result simply from a lack of task engagement, because FIA sensitivity did not correlate with errors on easy trials (posterior mean ± st. dev. of Spearman’s rho between lapse rate, estimated with an extended regression model detailed in Methods section A.6.1, and the absolute difference from optimal sensitivity for: dimensionality, 0.08±0.12; boundary, 0.15±0.12; volume, −0.04±0.13; robustness, 0.15±0.14; see Supplementary Information section B.5). Likewise, sub-optimal FIA sensitivity did not correlate with weaker likelihood sensitivity for the boundary (rho=−0.13±0.11) and volume (−0.06±0.11) terms, although stronger, negative relationships with the dimensionality (−0.35±0.07) and robustness terms (−0.56±0.10) suggest that the more extreme and variable simplicity preferences under those conditions (and lower performance, on average; see Figure 2c) reflected a more general difficulty in performing those versions of the task.
Artificial Neural Networks learn optimal simplicity preferences
To better understand the optimality, variability, and generality of the simplicity preferences exhibited by our human participants, we compared their performance to that of artificial neural networks (ANNs) trained to optimize performance on this task. We used a novel ANN architecture that we designed to perform statistical model selection, in a form applicable to the task described above (Figure 4a,b). On each trial, the network took as input two images representing the models to be compared, and a set of coordinates representing the observations on that trial. The output of the network was a decision between the two models, encoded as a softmax vector. We analyzed 50 instances of the ANN that differed only in the random initialization of their weights and in the examples seen during training, using the same logistic-regression approach we used for the human participants.
Figure 4. Artificial neural networks exhibit theoretically grounded simplicity preferences.
a: A novel deep neural-network architecture for statistical model selection. The network (see text and Methods for details) takes two images as input, each representing a model, and a set of 2D coordinates, each representing a datapoint. The output is a softmax-encoded choice between the two models. b: Each network was trained on multiple variants of the task, including systematically varied model length or curvature, then tested using the same configurations as for the human studies. c: Summary of network behavior, like Figure 3a. Hue (pink/green): k-nearest-neighbor interpolation of the model choice, as a function of the empirical centroid of the data. Color gradient (light/dark): marginal density of empirical data centroids for the given model pair, showing the region of space where data were more likely to fall. Cyan solid line: decision boundary for an observer that always chooses the model with highest maximum likelihood. Orange dashed line: decision boundary for an ideal Bayesian observer. d: Estimated relative sensitivity to geometrical features characterizing model complexity. As for the human participants, each fitted FIA coefficient was normalized to the likelihood coefficient and thus can be interpreted as a relative sensitivity to the associated FIA term. For each term, an ideal Bayesian observer would have a relative sensitivity of one (dashed orange lines), whereas an observer that relied on only maximum-likelihood estimation (i.e., choosing“up” or “down” based on only the model that was the closest to the data) would have a relative sensitivity of zero (solid cyan lines). Top: population-level estimate (posterior distribution of population-level relative sensitivity given the experimental observations; see Methods section A.6 for details). Bottom: each gray dot represents the task accuracy of one of 50 trained networks ( axis) versus the posterior mean estimate of the relative sensitivity for that network ( axis). Intuitively, the population-level posterior can be interpreted as an estimate of the location of the center of the cloud of dots representing individual subjects in the panel below. See Methods section A.6 for further details on statistical inference and the relationship between population-level and participant-level estimates. Purple: relative sensitivity of an ideal observer that samples from the exact Bayesian posterior (not the approximated one provided by the FIA). Shading: posterior mean ± 1 or 2 stdev., estimated by simulating 50 such observers.
The ANN was designed as follows (see Methods for more details). The input stage consisted of two pretrained VGG16 convolutional neural networks (CNNs), each of which took in a pictorial representation of one of the two models under consideration. VGG was chosen as a popular architecture that is often taken as a benchmark for comparisons with the human visual system (Muratore et al., 2022; Schrimpf et al., 2020). The CNNs were composed of a stack of convolutional layers whose weights were kept frozen at their pretrained values, followed by three fully-connected layers whose weights were allowed to change during training. The output of the CNNs were each fed into a multilayer perceptron (MLP) consisting of linear, rectified-linear (ReLU), and batch-normalization layers. The MLP outputs were then concatenated and fed into an equivariant MLP, which enforces equivariance of the network output under position swap of the two models through a custom parameter-sharing scheme (Ravanbakhsh et al., 2017). The network also contained two conditional variational autoencoder (C-VAE) structures, which sought to replicate the data-generation process conditioned on each model and therefore encouraged the fully connected layers upstream to learn model representations that captured task-relevant features.
After training, the ANNs performed the task substantially better than the human participants, with higher overall accuracies that included higher likelihood sensitivities (Supplementary Information section B.3) and simplicity preferences that more closely matched the theoretically optimal values (Figure 4c,d). These simplicity preferences were closer to those expected from simulated observers that use the exact Bayesian model posterior rather than the FIA-approximated one, consistent with the fact that the FIA provides an imperfect approximation to the exact Bayesian posterior. These simplicity preferences varied slightly in magnitude across the different networks, but unlike for the human participants this variability was relatively small (compare ranges of values in Figures 3c and 4d, plotted on different -axis scales) and it was not an indication of suboptimal network behavior because it was not related systematically to any differences in the generally high accuracy rates for each condition (Figure 4d; posterior mean ± st. dev. Of Spearman’s rho between accuracy and , where is the sensitivity: dimensionality, −0.14±0.10; boundary, 0.08±0.11; volume, −0.12±0.11; robustness, −0.08±0.11). These results imply that the stochastic nature of the task gives rise to some variability in simplicity biases even after extensive training to optimize performance accuracy, but this source of variability cannot by itself account for the range of sensitivities (and suboptimalities) exhibited by the human participants.
Humans, unlike ANNs, maintain suboptimal simplicity preferences
Our results, combined with the fact that we did not provide trial-by-trial feedback to the participants while they performed the task, suggest that the human simplicity preferences we measured were not simply learned optimizations for these particular task conditions but rather are a more inherent (and variable) part of how we make decisions under uncertainty. However, because we provided each participant with instructions that echoed Bayesian-like reasoning (see Methods and a brief training set with feedback before their testing session, we cannot rule out from the data presented in Figure 3 alone that at least some aspects of the simplicity preferences we measured from the human participants depended on those specific instructions and training conditions. We therefore ran a second experiment to rule out this possibility. For this experiment, we used the same task variants as above but a different set of instructions and training, designed to encourage participants to pick the model with the maximum likelihood (i.e., not integrate over latent causes but instead just consider the single cause that best matches the observed data), thus disregarding model complexity. Specifically, the visual cues were the same as in the original experiment, but the participants were asked to report which of the two shapes on the screen was closest to the center-of-mass of the dot cloud. We ensured that the participants recruited for this “maximum-likelihood” task had not participated in the original, “generative” task. We also trained and tested ANNs on this version of the task, using the maximum-likelihood solution as the correct answer.
Despite this major difference in instructions and training, the human participants exhibited similar simplicity preferences on the generative and maximum-likelihood tasks, suggesting that humans have a general predilection for simplicity even without relevant instructions or incentives (Figure 5, left). Specifically, despite some quantitative differences, the distributions of relative sensitivities showed the same basic patterns for both tasks, with a general increase of relative sensitivity from volume (0.19±0.08 for the maximum-likelihood task; compare to values above), to boundary (0.89±0.10), to robustness (2.27±0.15), to dimensionality (2.29±0.41). In stark contrast to the human data and to ANNs trained on the true generative task, ANN sensitivity to model complexity on the maximum-likelihood task was close to zero for all four terms (Figure 5, right). To summarize the similarities and differences between how humans and ANNs used simplicity biases to guide their decision-making behaviors for these tasks, and their implications for performance, Figure 6 shows overall accuracy for each set of conditions we tested. Specifically, for each participant or ANN, task configuration, and instruction set, we computed the percentage of correct responses with respect to both the generative task (i.e., for which theoretically optimal performance depends on simplicity biases) and the maximum-likelihood task (i.e., for which theoretically optimal performance does not depend on simplicity biases). Because the maximum-likelihood solutions are deterministic (they depend only on which model the data centroid is closest to, and thus there exists an optimal, sharp decision boundary that leads to perfect performance) and the generative solutions are not (they depend probabilistically on the likelihood and bias terms, so it is generally impossible to achieve perfect performance), performance on the former is expected to be higher than on the latter. Accordingly, both ANNs and (to a lesser extent) humans tended to perform better when assessed relative to maximum-likelihood solutions. Moreover, the ANNs tended to exhibit behavior that was consistent with optimization to the given task conditions: networks trained to find maximum-likelihood solutions did better than networks trained to find generative solutions for the maximum-likelihood task, and networks trained to find generative solutions did better than networks trained to find maximum-likelihood solutions for the generative task. In contrast, humans tended to adopt similar strategies regardless of the task conditions, in all cases using Bayesian-like simplicity biases.
Figure 5. Humans, but not artificial networks, exhibit simplicity preferences even when they are suboptimal.
a: Relative sensitivity of human participants to the geometric complexity terms (population-level estimates, as in Figure 3c, top) for two task conditions: 1) the original, “generative” task where participants were implicitly instructed to solve a model-selection problem (same data as in Figure 3c, top; orange); and 2) a “maximum-likelihood” task variant, where participants were instructed to report which of two models has the highest likelihood (shortest distance from the data; cyan). The two task variants were tested on distinct participant pools of roughly the same size (202 participants for the generative task, 201 for the maximum-likelihood task, in both cases divided in four groups of roughly 50 participants each). Solid cyan lines: relative sensitivity of a maximum-likelihood observer. Orange dashed lines: relative sensitivity of an ideal Bayesian observer. b: Same comparison and format (note the different x-axis scaling), but for two distinct populations of 50 deep neural networks trained on the two variants of the task (orange is the same data as in Figure 4d, top).
Figure 6. Humans and artificial neural networks have different patterns of accuracy reflecting their different use of simplicity preferences.
Each panel shows accuracy with respect to maximum-likelihood solutions (i.e., the model closest to the centroid of the data; ordinate) versus with respect to generative solutions (i.e., the model that generated the data; abscissa). The gray line is the identity. Columns correspond to the four task variants associated with the four geometric complexity terms, as indicated. a: Data from individual human participants (points), instructed to find the generative (orange) or maximum-likelihood (cyan) solution. Human performance was higher when evaluated against maximum-likelihood solutions than it was when evaluated against generative solutions, for all groups of participants (two-tailed paired t-test, generative task participants: dimensionality, t-statistic 2.21, p-value 0.03; boundary, 6.21, 1e-7; volume, 9.57, 8e-13; robustness, 10.6, 2e-14. Maximum-likelihood task participants: dimensionality, 5.75, 5e-7; boundary, 4.79, 2e-6; volume, 10.8, 2e-14; robustness, 12.2, 2e-16). b: Data from individual ANNs (points), trained on the generative (orange) or maximum-likelihood (cyan) task. Network performance was always highest when evaluated against maximum-likelihood solutions, compared to generative solutions (all dots are above the identity line).
Put briefly, ANNs exhibited simplicity preferences only when trained to do so, whereas human participants exhibited them regardless of their instructions and training.
Discussion
Simplicity has long been regarded as a key element of effective reasoning and rational decision-making. It has been proposed as a foundational principle in philosophy (Baker, 2022), psychology (Chater & Vitányi, 2003; Feldman, 2016), statistical inference (Balasubramanian, 1997; de Mulatier & Marsili, 2024; Grünwald, 2007; Gull, 1988; H. Jeffreys, 1939; MacKay, 1992; Wallace, 2005; Xie & Marsili, 2024), and more recently machine learning (Chaudhari et al., 2019; De Palma et al., 2019; Valle-Perez et al., 2019; Yang et al., 2022). Accordingly, various forms of simplicity preferences have been identified in human cognition (Gershman & Niv, 2013; Little & Shiffrin, 2009; Pothos & Chater, 2002), such as a tendency to prefer smoother (simpler) curves as the inferred, latent source of noisy observed data (Genewein & Braun, 2014; Johnson et al., 2014), and visual perception related to grouping, contour detection, and shape identification (Feldman & Singh, 2006; Froyen et al., 2015; Wilder et al., 2016). However, despite the solid theoretical grounding of these works, none of them attempted to define a quantitative notion of simplicity bias that could be measured (as opposed to simply detected) in human perception and behavior. In this work, we showed that simplicity preferences are closely related to a specific mathematical formulation of Occam’s razor, situated at the convergence of Bayesian model selection and information theory (Balasubramanian, 1997). This formulation enabled us to go beyond the mere detection of a preference for simple explanations for data and to measure precisely the strength of this preference in artificial and human participants under a variety of theoretically motivated conditions.
Our study makes several novel contributions. The first is theoretical: we derived a new term of the Fisher Information Approximation (FIA) in Bayesian model selection that accounts for the possibility that the best model is on the boundary of the model family (see details in Methods section A.2). This boundary term is important because it can account for the possibility that, because of the noise in the data, the best value of one parameter (or of a combination of parameters) takes on an extreme value. This condition is related to the phenomenon of “parameter evaporation” that is common in real-world models for data (Transtrum et al., 2015). Moreover, boundaries for parameters are particularly important for studies of perceptual decision-making, in which sensory stimuli are limited by the physical constraints of the experimental setup and thus reasoning about unbounded parameters would be problematic for observers. For example, imagine designing an experiment that requires participants to report the location of a visual stimulus. In this case, an unbounded set of possible locations (e.g., along a line that stretches infinitely far in the distance to the left and to the right) is clearly untenable. Our “boundary” term formalizes the impact of considering the set of possibilities as having boundaries, which tend to increase local complexity because they tend to reduce the number of local hypotheses close to the data (see Figure 1b).
The second contribution of this work relates to Artificial Neural Networks: we showed that these networks can learn to use or ignore simplicity preferences in an optimal way (i.e., according to the magnitudes prescribed by theory), depending on how they are trained. These results are different from, and complementary to, recent work that has focused on the idea that implementation of simple functions could be key to generalization in deep neural networks (Chaudhari et al., 2019; De Palma et al., 2019; Valle-Perez et al., 2019; Yang et al., 2022). Here we have shown that effective learning can take into account the complexity of the hypothesis space, rather than that of the decision function, in producing normative simplicity preferences. On the one hand, these results do not seem surprising, because ANNs, and deep networks in particular, are powerful function approximators that perform well in practice on a vast range of inference tasks (Bengio et al., 2021). Accordingly, our ANNs trained with respect to the true generative solutions were able to make effective decisions, including simplicity preferences, about the generative source of a given set of observations. Likewise, our ANNs trained with respect to maximum-likelihood solutions were able to make effective decisions, without simplicity preferences, about the maximum-likelihood match for a given set of observations. On the other hand, these results also provide new insights into how ANNs might be analyzed to better understand the kinds of solutions they produce for particular problems. In particular, assessing the presence or absence of the kinds of simplicity preferences that we observed might help identify if and/or how well an ANN is likely to avoid overfitting to training data and provide more generalizable solutions.
The third, and most important, contribution of this work relates to human behavior: people tend to use simplicity preferences when making decisions, and unlike ANNs these preferences do not seem to be simply consequences of learning specific tasks but rather an inherent part of how we interpret uncertain information. This tendency has important implications for the kinds of computations our brains must use to solve these kinds of tasks and how those computations appear to differ from those implemented by the ANNs we used. From a theoretical perspective, the difference between a Bayesian solution (i.e., one that includes the simplicity preferences) and a maximum-likelihood solution (i.e., one that does not include the simplicity preferences) to these tasks is that the latter considers only the single best-fitting model from each family, whereas the former integrates over all possible models in each family. This integration process is what gives rise to the simplicity bias, which, crucially, cannot emerge from simpler mechanisms such as sampling over different possible causes of the stimulus due to an unreliable sensory representation or a stochastic choice process (see Figure 2). Our finding that ANNs can converge on either strategy when trained appropriately indicates that both are, in principle, learnable. However, our finding that people tend to use the Bayesian solution even when instructed to use the maximum-likelihood solution suggests that we naturally do not make decisions based on the single best or archetypical instance within a family of possibilities but rather integrate across that family. Put more concretely in terms of our task, when told to identify the shape closest to the data points, participants were likely uncertain about which exact location on each shape was closest. They therefore naturally integrated over the possibilities, which induces the simplicity preferences as prescribed by the Bayesian solution. These findings will help motivate and inform future studies to identify where and how the brain implements and stores these integrated solutions to relevant decision problems.
Another key feature of our findings that merits further study is the magnitude and variability of preferences exhibited by the human participants. On average, human sensitivity to each geometrical model feature was: 1) larger than zero, 2) at least slightly different from the optimal value (e.g., larger for dimensionality and robustness, smaller for volume), 3) different for distinct features and different participants, and 4) relatively insensitive to instructions and training. What is the source of this diversity? One hypothesis is that people may weigh more heavily the model features that are easier or cheaper to compute. In our experiments, the most heavily weighted feature was model dimensionality. In our mathematical framework, this feature corresponds to the number of degrees of freedom of a possible explanation for the observed data and thus can be relatively easy to assess. By contrast, the least heavily weighted feature was model volume. This feature involves integrating over the whole model family (to count how many distinct states of the world can be explained by a certain hypothesis, one needs to enumerate them) and thus can be very difficult to compute. The other two terms, boundary and robustness, are intermediate in terms of human weighting and computational difficulty: they are harder to compute than dimensionality, because they depend on the data and on the properties of the model at the maximum likelihood location, but are also simpler than the volume term, because they are local quantities that do not require integration over the whole model manifold. This intuition leads to new questions about the relationship between the complexity of explanations being compared and the complexity of the decision-making process itself, calling into question notions of bounded rationality and diminishing returns in optimal inference (Tavoni et al., 2019, 2022). Answering such questions is beyond the scope of the present work but merits further study.
A different, intriguing future direction is a comparison with other formal approaches to the emergence of simplicity that can lead to different predictions. Recent studies have argued that Jeffrey’s prior (upon which our geometric approach is based) could give an incomplete picture of the complexity of a class of models that occur commonly in the natural sciences, which contain many combinations of parameters that do not affect model behavior, and proposed instead the use of data-dependent priors (Mattingly et al., 2018; Quinn et al., 2022). The two methods lead to different results, especially in the data-limited regime (Abbott & Machta, 2023). It would be useful to understand the relevance of these differences to human and machine decision-making. Finally, our task design and analyses were constrained to conditions in which the FIA for the models involved could be computed analytically. Generalizing our approach to other conditions is another important future direction.
In summary, our work demonstrates the direct, quantitative relevance of formal notions of model complexity to human behavior. By relying on a combination of theoretical advances, computational modeling, and behavioral experiments, we have established a novel set of normative reference points for decision-making under uncertainty. These findings open up a new arena in which human cognition could be measured against optimal inferential processes, potentially leading to new insights into the constraints affecting information processing in the brain.
Supplementary Material
Acknowledgements
We thank Kamesh Krishnamurthy for discussions and acknowledge the financial support of R01 NS113241 (EP), R01 EB026945 (JIG and VB), IIS-2145164 (PC), CCF-2212519 (PC) as well as a hardware grant from the NVIDIA Corporation (EP). The HPC Collaboration Agreement between SISSA and CINECA granted access to the Marconi100 and Leonardo clusters. VB was supported in part by the Eastman Professorship at Balliol College, University of Oxford.
Footnotes
Code availability
All data and code needed to reproduce the experiments (including running the online psychophysics tasks and training and testing the neural networks), and to analyze the data and produce all figures is available at doi:10.17605/OSF.IO/R6D8N.
Ethics
Human participant protocols were approved and determined to be Exempt by the University of Pennsylvania Internal Review Board (IRB protocol 844474). Participants provided consent on-line before they began the task.
The authors have no known conflicts of interest to disclose. Preprint: doi:10.1101/2023.01.10.523479
Data availability
All experimental data collected in this work is available at doi:10.17605/OSF.IO/R6D8N.
References
- Abbott M. C., & Machta B. B. (2023). Far from Asymptopia: Unbiased High-Dimensional Inference Cannot Assume Unlimited Data. Entropy, 25(3), 434. 10.3390/e25030434 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker A. (2022). Simplicity. In Zalta E. N. (Ed.), The Stanford Encyclopedia of Philosophy (Summer 2022). Metaphysics Research Lab, Stanford University. Retrieved August 9, 2022, from https://plato.stanford.edu/archives/sum2022/entries/simplicity/ [Google Scholar]
- Balasubramanian V. (1997). Statistical Inference, Occam’s Razor, and Statistical Mechanics on the Space of Probability Distributions. Neural Computation, 9(2), 349–368. 10.1162/neco.1997.9.2.349 [DOI] [Google Scholar]
- Bengio Y., Lecun Y., & Hinton G. (2021). Deep learning for AI. Communications of the ACM, 64(7), 58–65. 10.1145/3448250 [DOI] [Google Scholar]
- Bialek W., Nemenman I., & Tishby N. (2001). Predictability, Complexity and Learning. Neural Computation, (13), 2409–2463. 10.1162/089976601753195969 [DOI] [PubMed]
- Chater N., & Vitányi P. (2003). Simplicity: A unifying principle in cognitive science? Trends in Cognitive Sciences, 7(1), 19–22. 10.1016/S1364-6613(02)00005-0 [DOI] [PubMed] [Google Scholar]
- Chaudhari P., Choromanska A., Soatto S., LeCun Y., Baldassi C., Borgs C., Chayes J., Sagun L., & Zecchina R. (2019). Entropy-SGD: Biasing gradient descent into wide valleys*. Journal of Statistical Mechanics: Theory and Experiment, 2019(12), 124018. 10.1088/1742-5468/ab39d9 [DOI] [Google Scholar]
- de Mulatier C., & Marsili M. (2024, August 27). Bayesian Inference of Minimally Complex Models with Interactions of Arbitrary Order. arXiv: 2008.00520 [cs]. 10.48550/arXiv.2008.00520 [DOI] [PubMed]
- De Palma G., Kiani B., & Lloyd S. (2019). Random deep neural networks are biased towards simple functions. In Wallach H., Larochelle H., Beygelzimer A., d’Alché-Buc F., Fox E., & Garnett R. (Eds.), Advances in Neural Information Processing Systems (Vol. 32). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/feab05aa91085b7a8012516bc3533958-Paper.pdf [Google Scholar]
- Feldman J. (2016). The simplicity principle in perception and cognition: The simplicity principle. Wiley Interdisciplinary Reviews: Cognitive Science, 7(5), 330–340. 10.1002/wcs.1406 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feldman J., & Singh M. (2006). Bayesian estimation of the shape skeleton. Proceedings of the National Academy of Sciences, 103(47), 18014–18019. 10.1073/pnas.0608811103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Froyen V., Feldman J., & Singh M. (2015). Bayesian hierarchical grouping: Perceptual grouping as mixture estimation. Psychological Review, 122(4), 575–597. 10.1037/a0039540 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gelman A., Carlin J. B., Stern H. S., Dunson D. B., Vehtari A., & Rubin D. B. (2014). Bayesian Data Analysis (3rd ed.). CRC Press. [Google Scholar]
- Genewein T., & Braun D. A. (2014). Occam’s Razor in sensorimotor learning. Proceedings of the Royal Society B: Biological Sciences, 281(1783), 20132952. 10.1098/rspb.2013.2952 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gershman S. J., & Niv Y. (2013). Perceptual estimation obeys Occam’s razor. Frontiers in Psychology, 4. 10.3389/fpsyg.2013.00623 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Good I. J. (1968). Corroboration, Explanation, Evolving Probability, Simplicity and a Sharpened Razor. The British Journal for the Philosophy of Science, 19(2), 123–143. Retrieved December 16, 2024, from http://www.jstor.org/stable/686791 [Google Scholar]
- Grünwald P. D. (2007). The Minimum Description Length Principle. MIT press. [Google Scholar]
- Gull S. F. (1988). Bayesian Inductive Inference and Maximum Entropy. In Erickson G. J. & Smith C. R. (Eds.), Maximum-Entropy and Bayesian Methods in Science and Engineering: Foundations (pp. 53–74). Springer; Netherlands. 10.1007/978-94-009-3049-0_4 [DOI] [Google Scholar]
- Hatfield G. (1985). The status of the minimum principle in the theoretical analysis of visual perception. Psychological Bulletin, 97(2), 155. 10.1037/0033-2909.97.2.155 [DOI] [PubMed] [Google Scholar]
- Jaynes E. T. (2003, April 1). Probability Theory: The Logic of Science. Cambridge University Press. [Google Scholar]
- Jeffreys H. (1939). Theory of probability. Clarendon Press. [Google Scholar]
- Jeffreys W., & Berger J. (1991, August). Sharpening Ockham’s razor on a Bayesian strop (No. 91–44C). Department of Statistics, Purdue University. [Google Scholar]
- Johnson S., Jin A., & Keil F. (2014). Simplicity and Goodness-of-Fit in Explanation: The Case of Intuitive Curve-Fitting. Proceedings of the Annual Meeting of the Cognitive Science Society, 36(36). [Google Scholar]
- Koffka K. (2014). Principles of Gestalt psychology. Mimesis international. OCLC: 1261863455. [Google Scholar]
- Kruschke J. K. (2015). Doing Bayesian Data Analysis (2nd ed.). Academic Press. [Google Scholar]
- Lanterman A. D. (2001). Schwarz, Wallace, and Rissanen: Intertwining Themes in Theories of Model Selection. s, 69(2), 185–212. 10.1111/j.1751-5823.2001.tb00456.x [DOI] [Google Scholar]
- Little D. R. B., & Shiffrin R. (2009). Simplicity Bias in the Estimation of Causal Functions. Proceedings of the Annual Meeting of the Cognitive Science Society, 31(31). Retrieved August 10, 2022, from https://escholarship.org/uc/item/3d85q7zt [Google Scholar]
- MacKay D. J. C. (1992). Bayesian Interpolation. Neural Computation, 4(3), 415–447. 10.1162/neco.1992.4.3.415 [DOI] [Google Scholar]
- Mattingly H. H., Transtrum M. K., Abbott M. C., & Machta B. B. (2018). Maximizing the information learned from finite data selects a simple model. Proceedings of the National Academy of Sciences of the United States of America, 115(8), 1760–1765. 10.1073/pnas.1715306115 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McElreath R. (2016). Statistical Rethinking. CRC Press. [Google Scholar]
- Muratore P., Tafazoli S., Piasini E., Laio A., & Zoccolan D. (2022). Prune and distill: Similar reformatting of image information along rat visual cortex and deep neural networks. Advances in Neural Information Processing Systems, 35, 30206–30218. Retrieved May 8, 2023, from https://proceedings.neurips.cc/paper_files/paper/2022/hash/c2d82a425af4c18a35049899fea5ee82-Abstract-Conference.html [Google Scholar]
- Myung I. J., Balasubramanian V., & Pitt M. A. (2000). Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences, 97(21), 11170–11175. 10.1073/pnas.170283897 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Neath A. A., & Cavanaugh J. E. (2012). The Bayesian information criterion: Background, derivation, and applications. WIREs Computational Statistics, 4(2), 199–203. 10.1002/wics.199 [DOI] [Google Scholar]
- Piasini E., Balasubramanian V., & Gold J. I. (2020). Preregistration document. 10.17605/OSF.IO/2X9H6 [DOI]
- Piasini E., Balasubramanian V., & Gold J. I. (2021). Preregistration document addendum. 10.17605/OSF.IO/5HDQZ [DOI]
- Piasini E., Liu S., Balasubramanian V., & Gold J. I. (2022). Preregistration document addendum. 10.17605/OSF.IO/826JV [DOI]
- Pothos E. M., & Chater N. (2002). A simplicity principle in unsupervised human categorization. Cognitive Science, 26(3), 303–343. 10.1207/s15516709cog2603_6 [DOI] [Google Scholar]
- Quinn K. N., Abbott M. C., Transtrum M. K., Machta B. B., & Sethna J. P. (2022). Information geometry for multiparameter models: New perspectives on the origin of simplicity. Reports on Progress in Physics, 86(3), 035901. 10.1088/1361-6633/aca6f8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ravanbakhsh S., Schneider J., & Póczos B. (2017). Equivariance Through Parameter-Sharing. Proceedings of the 34th International Conference on Machine Learning, 2892–2901. Retrieved January 5, 2023, from https://proceedings.mlr.press/v70/ravanbakhsh17a.html
- Rissanen J. (1996). Fisher information and stochastic complexity. IEEE Transactions on Information Theory, 42(1), 40–47. 10.1109/18.481776 [DOI] [Google Scholar]
- Schrimpf M., Kubilius J., Hong H., Majaj N. J., Rajalingham R., Issa E. B., Kar K., Bashivan P., Prescott-Roy J., Geiger F., Schmidt K., Yamins D. L. K., & DiCarlo J. J. (2020). BrainScore: Which Artificial Neural Network for Object Recognition is most Brain-Like? bioRxiv, 407007. 10.1101/407007 [DOI]
- Schwarz G. (1978). Estimating the Dimension of a Model. The Annals of Statistics, 6(2), 461–464. 10.1214/aos/1176344136 [DOI] [Google Scholar]
- Smith A. F. M., & Spiegelhalter D. J. (1980). Bayes Factors and Choice Criteria for Linear Models. Journal of the Royal Statistical Society: Series B (Methodological), 42(2), 213–220. 10.1111/j.2517-6161.1980.tb01122.x [DOI] [Google Scholar]
- Tavoni G., Balasubramanian V., & Gold J. I. (2019). What is optimal in optimal inference? Current Opinion in Behavioral Sciences, 29, 117–126. 10.1016/j.cobeha.2019.07.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tavoni G., Doi T., Pizzica C., Balasubramanian V., & Gold J. I. (2022). Human inference reflects a normative balance of complexity and accuracy. Nature Human Behaviour, 6(8), 1153–1168. 10.1038/s41562-022-01357-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Transtrum M. K., Machta B. B., Brown K. S., Daniels B. C., Myers C. R., & Sethna J. P. (2015). Perspective: Sloppiness and emergent theories in physics, biology, and beyond. The Journal of Chemical Physics, 143(1), 010901. 10.1063/1.4923066 [DOI] [PubMed] [Google Scholar]
- Valle-Perez G., Camargo C. Q., & Louis A. A. (2019). Deep learning generalizes because the parameter-function map is biased towards simple functions. International Conference on Learning Representations.
- Wagemans J., Elder J. H., Kubovy M., Palmer S. E., Peterson M. A., Singh M., & von der Heydt R. (2012). A century of Gestalt psychology in visual perception: I. Perceptual grouping and figure–ground organization. Psychological Bulletin, 138(6), 1172–1217. 10.1037/a0029333 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wallace C. S. (2005). Statistical and inductive inference by minimum message length. Springer. [Google Scholar]
- Wilder J., Feldman J., & Singh M. (2016). The role of shape complexity in the detection of closed contours. Vision Research, 126, 220–231. 10.1016/j.visres.2015.10.011 [DOI] [PubMed] [Google Scholar]
- Xie R., & Marsili M. (2024). A simple probabilistic neural network for machine understanding. Journal of Statistical Mechanics: Theory and Experiment, 2024(2), 023403. 10.1088/1742-5468/ad0a8c [DOI] [Google Scholar]
- Yang R., Mao J., & Chaudhari P. (2022). Does the Data Induce Capacity Control in Deep Learning? Proceedings of the 39th International Conference on Machine Learning, 25166–25197. Retrieved January 5, 2023, from https://proceedings.mlr.press/v162/yang22k.html [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All experimental data collected in this work is available at doi:10.17605/OSF.IO/R6D8N.