Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Feb 19.
Published in final edited form as: J Econ Behav Organ. 2019 Jun 6;164:148–165. doi: 10.1016/j.jebo.2019.05.026

Choice-theoretic foundations of the divisive normalization model

Kai Steverson a,*, Adam Brandenburger b, Paul Glimcher c
PMCID: PMC7029780  NIHMSID: NIHMS1036462  PMID: 32076358

Abstract

Recent advances in neuroscience suggest that a utility-like calculation is involved in how the brain makes choices, and that this calculation may use a computation known as divisive normalization. While this tells us how the brain makes choices, it is not immediately evident why the brain uses this computation or exactly what behavior is consistent with it. In this paper, we address both of these questions by proving a three-way equivalence theorem between the normalization model, an information-processing model, and an axiomatic characterization. The information-processing model views behavior as optimally balancing the expected value of the chosen object against the entropic cost of reducing stochasticity in choice. This provides an optimality rationale for why the brain may have evolved to use normalization-type models. The axiomatic characterization gives a set of testable behavioral statements equivalent to the normalization model. This answers what behavior arises from normalization. Our equivalence result unifies these three models into a single theory that answers the “how”, “why”, and “what” of choice behavior.

1. Introduction

Choice is often modeled as behavior that seeks to maximize a utility function. Advances in neuroscience over the past few decades have pointed to a discrete set of brain areas apparently dedicated to representing a quantity that functions much like a utility representation (Fehr and Rangel, 2011; Glimcher, 2011; Knutson et al., 2001; Platt and Glimcher, 1999). These brain areas produce different levels of neural activity for different choice alternatives, where higher associated activity indicates a higher probability that the alternative in question is chosen. This “utility-like” process of the brain is referred to by neuroscientists as the subjective value function. Its stochastic relationship with choice can be modeled by a utility function with an additive noise term (Webb et al., 2013), which adapts some of the building blocks of classic stochastic choice theory (Luce, 1959; McFadden, 1973) to a modern neuroscientifically-motivated setting.

Another important finding from neuroscience about the subjective value function is that it appears to employ a computation known as divisive normalization (Louie et al., 2011). Originally identified in the visual domain, divisive normalization has been argued to be a canonical neural computation (Carandini and Heeger, 2012) in which the neural activity level generated to represent a particular “stimulus” (whether visual or other) is re-scaled in a way that depends on neighboring stimuli, that is, on the context. In choice behavior, divisive normalization works by re-scaling the value of each option by a function that depends on the values of all available alternatives (Louie et al., 2015). While this re-scaling function can take different forms, it is typically assumed to be proportional to the sum of the values of the items currently in the choice set plus a constant, which is the form we adopt throughout this paper.

More recently, divisive normalization has been successfully used to explain human choice behavior. Louie et al. (2013), Itthipuripat et al. (2015) and Khaw et al. (2017) conduct choice experiments that confirm different predictions of the divisive normalization model regarding context effects in choice. Webb et al. (2014) show that divisive normalization explains choice involving departures from the classic Luce rule —which is equivalent to the Independence of Irrelevant Alternatives — better than other proposals such as Multinomial Probit. Glimcher and Tymula (2018) document how divisive normalization can re-produce many of the behaviors typically associated with prospect theory. Landry and Webb (2017) show how a variant of the divisive normalization model can accommodate a range of context effects. For example, divisive normalization can also accommodate the violation of the Regularity Property often seen in the well-known Attraction Effect (Huber et al., 1982; Simonson, 1989).1 Its ability to violate Regularity (which has been previously discussed in Louie et al. (2013) and Webb et al. (2014)) demonstrates that divisive normalization is not a member of the “random utility” class of models (Block and Marschak, 1960; Thurstone, 1927), since these latter obey this condition.

While empirically observed stochastic choice modeled with divisive normalization tells us how the brain makes choices, it is not immediately evident why the brain would use this particular computation or exactly what behavior is consistent with it. In this paper, we address both of these questions by proving a three-way equivalence theorem between the divisive normalization model, an information-processing model, and an axiomatic characterization. The information-processing model views behavior as optimally balancing the expected value of the chosen object against the entropic cost of reducing stochasticity in choice. This provides an optimality rationale for why the brain may have evolved to use divisive normalization. The axiomatic characterization gives a set of testable behavioral statements equivalent to the divisive normalization model. This gives a precise answer to the question of what behavior arises from divisive normalization. Our equivalence result unifies the three models into a single theory that can simultaneously address the how, why, and what of choice behavior.

Divisive normalization, as a functional form, has been used by neuroscientists for over two decades to model how neural data, and, in particular, the neurally measurable subjective value function, determines choices. However, whether neural data can be precisely predicted (identified cardinally) from choices has not yet been addressed. For neuroscientists, this has proven to be a stark limitation, since it has meant neural observables have only been inferred from choice via the fitting of arbitrary functional forms to choice data. In this paper, we prove a result showing that choice data alone can be used to uniquely specify neural observables within the framework of the normalization model. In other words, we show there is a one-to-one relationship between neural and choice observables.

This result is important to empirical neuroscientists, since it allows for novel and precise predictions which link neural and choice data. We believe that this result is also important for economists working only with choice data, since it allows the application of insights from neural analysis without neural data. For example, it has been suggested that the subjective value function can be used as a direct measure of welfare and happiness (Glimcher, 2011; Loewenstein et al., 2008). Without taking any stance on this controversial question, we merely point out that the one-to-one relationship we establish means that this measurement can be made from choice data alone – and it suggests that inter-individual comparisons may also be possible. We prove the one-to-one relationship as a corollary to our more general uniqueness result that shows that the parameters of the divisive normalization model are behaviorally identified up to a multiplicative scaling. This level of identification is similar to other theories of stochastic choice, such as the Luce rule, and is enough to rank the alternatives by their values and to rank the choice sets by the expected value the agent receives when facing that set.

We next use our equivalence result to study how the divisive normalization model handles context effects. We use a standard notion of context effects as departures from the Independence of Irrelevant Alternatives (IIA) property developed by Luce (1959). Violations of IIA have been widely documented empirically, including in the well-known Compromise and Attraction Effects (Huber et al., 1982; Simonson, 1989) as well as in a wide variety of other settings. (For a survey, see Rieskamp et al., 2006.) We apply our equivalence result to find analogs to IIA violations in our divisive normalization and information-processing models. Specifically, we show that IIA violations are equivalent to differences in the divisive factor and the marginal cost of reducing stochasticity, in the divisive normalization and information-processing models, respectively. However, divisive normalization also places limits on context effects. In the forms studied here, it does not allow (stochastic) preference reversals: If one alternative is chosen more often than a second in one choice set, then it is chosen more often than the second in every choice set. Also, we provide a novel testable restriction by showing that any choice sets, which share at least two items, can be unambiguously ranked in terms of stochasticity of the choices made on those sets.

We now offer more detail on each of the three models in our equivalence result. The divisive normalization model works by first assigning a set-independent value to each possible choice alternative. Within a particular choice set, these values are then re-scaled by a factor that depends on the available alternatives. Specifically, this set-dependent factor is equal to the sum of the values of the items in the choice set plus a constant. These re-scaled values are then multiplied by a constant that serves to set upper and lower bounds on the set of possible choice probabilities. Both constants are assumed to be set-independent. A random error term is added to each re-scaled value and the alternative with the highest sum of re-scaled value plus error is then chosen. The error term accounts for observed neurobiological stochasticity in the representation of value.

Our information-processing model views choices as optimally balancing the cost and benefit of decreasing stochasticity in choice. Decreasing stochasticity in choices decreases information entropy, which, on physical grounds, must be costly (Landauer, 1961). We model this cost as an increasing function of the associated decrease in Shannon entropy (Shannon, 1948), where the specific functional form can depend on the choice set faced. Decreasing stochasticity in choices benefits the decision maker by increasing the chance of selecting the highest valued alternative. Therefore, a decision maker faces a trade-off between the cost and benefit of reducing stochasticity, which our information-processing model optimally balances. (Other work in decision theory models this feature as a preference for hedging or stochasticity; see, for example, Machina, 1989, or Agranov and Ortoleva, 2017.) Our general information-processing analysis identifies a broad family of divisive normalization models. We also identify the specific functional form for the cost of reducing entropy that corresponds to the additive normalization factor which is most often used in the empirical neuroscience literature, thereby providing an information-processing foundation for the latter.

Our axiomatic characterization consists of a nested set of six behaviorally testable axioms that together are equivalent to the divisive normalization model. The axioms are layered in a way that allows for different versions of the model to be independently tested. The first two axioms by themselves characterize a generalized version of the divisive normalization model where the divisive term can be any strictly positive choice set-dependent factor. Adding in the next two axioms restricts the divisive term to being equal to the sum of the values of the items in the choice set plus a constant. The final two axioms restricts the values of the two constants in the model to being strictly positive.

Returning to our equivalence result, we think of its three components as answering, respectively, the “how,” the “why,” and the “what” of choice behavior. The divisive normalization model explains how the brain makes choices, namely through the normalization computation. The information-processing formulation provides some insight into why that computation is used. The axiomatic characterization outlines exactly what behavior arises. In this way, our theory provides a unified answer to the “how,” the “why,” and the “what” of choice behavior.

2. Literature review

In addition to the divisive normalization literature reviewed in the introduction, our paper relates to the literature on random choice following the Luce rule (Luce, 1959) as well as the literature that employs Shannon entropy in models of decision making.

Our use of the Gumbel error term connects the divisive normalization model to the Luce rule — without the divisive re-scaling step, our model would be equivalent to the Luce rule. The connection between the Gumbel error and the Luce rule is well known (see Luce and Suppes, 1965, who attribute the result to Holman and Marley). However, our model does not contain the Luce rule as a special case since our divisive re-scaling factor cannot be constant across sets. The divisive normalization model with Gumbel error is an instance of the more general Set-Dependent-Luce model (Marley et al., 2008), which is equivalent to a version of our model where the re-scaling factor is allowed to be any strictly positive set-dependent term. We also relate to the larger literature generalizing the Luce rule to accommodate a wider range of empirical phenomena (e.g., Echenique and Saito, 2015; Echenique et al., 2014; Gul et al., 2014; Ravid, 2015; Tserenjigmid, 2016). Our paper is similar to these in that we can also accommodate a wider range of behavior, but we differ in that we use a neurobiological motivated functional form.

Our paper connects to previous work that employs Shannon entropy in models of decision making. Some of these previous papers also use entropy to model the cost of reducing stochasticity (Fudenberg et al., 2015; Mattsson and Weibull, 2002), while others have used entropy to model a taste for variety (Anderson et al., 1992; Swait and Marley, 2013). Several of these papers trace out a similar mathematical connection between entropy and the probability formulas as we do. In fact, this mathematical connection goes back much further to the physics connecting Helmholtz free energy to the Boltzmann distribution (see Mandl, 1988). Of particular note is Swait and Marley (2013), who studied a model equivalent to the Set-Dependent Luce model discussed above. Thus, the model of Swait and Marley (2013) is equivalent to a divisive normalization model with any strictly positive divisive factor.

The use of Shannon entropy in our information-processing formulation is also a point of connection with the rational-inattention literature initiated by Sims (1998, 2003). Recently, the rational-inattention notion has been applied to stochastic choice settings with uncertain values for the alternatives (Caplin and Dean, 2015; Matějka and McKay, 2014). This leads to an information-processing task: Determine the optimal cost to incur in learning about these uncertain values. By contrast, the information-processing task we consider is an efficient reduction in the intrinsic stochasticity of choosing amongst alternatives.

Finally, our paper relates to the efficient coding literature from neuroscience, which argues that neural processes should be efficiency-promoting (Attneave, 1954; Barlow, 1961). Within this literature, a number of papers advance efficiency arguments for the divisive normalization computation (see Carandini and Heeger, 2012 for a review). However, this literature differs from our paper by being concerned with applying divisive normalization to sensory processing and how the brain can efficiently store and represent sensory information. For example, one thread of this literature shows how divisive normalization can be used to de-correlate the activity of different neurons to reduce redundancy in how sensory input is represented (see for example, Schwartz and Simoncelli, 2001). We see our information-processing model as building a parallel efficiency argument at a more foundational level, with a focus on the choice domain.

3. Foundations of the normalization model

We now provide two foundations for the divisive normalization model: an information-processing model and an axiomatic characterization. The information-processing model provides insight into why the normalization computation is used — namely, because it optimally balances the costs and benefits of reducing stochasticity. The axiomatic characterization pins down what behavior the normalization model allows by providing a set of testable restrictions. The main result of this section is an equivalence theorem uniting all three models in terms of the behavior they imply. For example, any behavior that arises from the normalization model optimally solves the information-processing model and obeys our axioms. The equivalence works in all directions.

We begin with the formal framework, which we maintain throughout. Let X be a finite set consisting of all the alternatives from which the decision maker may be able to choose, which we assume contains at least four items. Let A=2X\ be the collection of all non-empty subsets of X, to be thought of as the possible choice sets the decision maker may face. As a convenient shorthand, for any function f:X and AA, we set f(A)=xAf(x).

The choice behavior of the decision maker is described by a random choice rule ρ that assigns a full-support probability measure to every choice set AA. We limit attention to full-support measures because the divisive normalization model uses a full-support error term that implies all available options have a non-zero chance of being chosen. Formally, for any choice set A, define

ΔA:={p:A[0,1]|xAp(x)=1},

which is the set of all probability measures on A. A random choice rule is then any function ρ:X×A[0,1] such that ρ(x,A)>0xA and ρ(,A)ΔA for each AA.2 The interpretation is that ρ(x,A) is the probability that the decision maker chooses alternative x when faced with choice set A. To avoid degenerate cases, we assume ρ assigns at least three distinct probabilities in choice set X. In other words, there exist x,y,zX such that ρ(x,X),ρ(y,X), and ρ(z,X) are all distinct.

We now define the divisive normalization model using a standard functional form widely employed in the neuroscience literature (Carandini and Heeger, 2012). The normalization model generates a stochastic and set-dependent utility for each alternative, and the highest utility alternative is then chosen. The stochastic utility of alternative x in set A is

γv(x)σ+v(A)+εx.

The function v provides a set-independent value of each alternative. These set-independent values are then divisively re-scaled by the factor σ+v(A) which is a constant plus the sum of values of items in the choice set. The term γ is a strictly positive constant that sets an upper and lower bound on achievable choice probabilities, with a larger γ corresponding to more relaxed bounds. We will make these claims about γ more precise at the end of this section. Lastly, εx is a random noise term that is i.i.d. across the alternatives and that we assume follows a Gumbel distribution with location 0 and scale 1.

We define the normalization model as the set of choice probabilities that can be generated using this utility form. We state this more formally as follows.

Definition 1.

A random choice rule ρ has a divisive normalization representation if there exists v:X++, σ > 0, and γ>0, such that for any AA and xA

ρ(x,A)=Pr(xarg maxyAγv(y)σ+v(A)+εy),

where εy is distributed i.i.d. Gumbel (0,1).

The divisive normalization model is distinct from random utility. On the one hand, the presence of the divisive factor σ+v(A) allows a set dependence absent in a standard random utility model. On the other hand, random utility allows for more general assumptions on the error term. The Gumbel distribution we assume does arise in a number of settings — in particular, as the asymptotic distribution of the maximum of a sequence of i.i.d. normal random variables. See, e.g., David and Nagaraja (2003). Also note that the continuous nature of the Gumbel distribution ensures there there is always a strict utility maximizing element, so we do not have to worry about ties.

As we discussed in the literature review, the use of the Gumbel error in the divisive normalization model creates a strong connection to the standard Luce rule. If we removed the divisive step, the stochastic utility would simply be v(x)+εx, which (given that the error is Gumbel) is one way to formulate the Luce model. Despite this, the divisive normalization model does not contain the Luce model as a special case. In other words, there is no combination of parameters that makes the divisive step irrelevant. However, we can approximate any Luce model as a limiting case.3 Specifically, if we take σ, γ to infinity at the same rate, then the ratio γ/(σ+ν(A)) approaches 1, which leaves us with the standard Luce model v(x)+εx.

3.1. Information-processing model

In our information-processing model, the decision maker balances the expected utility of a given choice rule against the cost involved in reducing stochasticity in choices. The value of alternative x is given by v(x) and the expected utility of choice rule ρ on choice set A is

xAρ(x,A)v(x).

The information-processing costs of a particular rule ρ come from the reduction in Shannon entropy (Shannon, 1948) relative to the fully stochastic case. Shannon entropy measures the degree of stochasticity in behavior, where a higher degree of stochasticity implies higher entropy. We will find it useful to define entropy generally for any function f:A+, as follows:

H(f):=xAf(x)f(A)lnf(x)f(A),

where 0ln 0 is understood to equal zero. For a probability measure pΔA, H(p) equals the associated Shannon entropy of p on A. The maximum entropy of any function defined on set A is ln|A|, which is achieved by the uniform measure that assigns the same probability to each alternative in A. Therefore, the entropy reduction achieved by any function f:A+ is

ΔH(f):=ln|A|H(f).

The total cost of random choice rule ρ on choice set A will be a strictly increasing function of the entropy reduction achieved by ρ on A, where the shape of the function can depend on the choice set faced. We impose standard regularity conditions that the function is continuously differentiable and convex. Combining this with the expected value of a choice rule yields our definition of optimal behavior with costly information processing.

Definition 2.

A random choice rule ρ has an information-processing representation if there exists a function v:X++ and, for each AA, a strictly increasing, convex, and continuously differentiable function CA: such that for any AA

ρ(,A)argmaxpΔAxAp(x)v(x)CA(ΔH(p)). (1)

A choice rule ρ having an information-processing representation is actually equivalent to a more general normalization computation where the divisive factor can be any strictly positive set-dependent function.4 Hence, the information-processing framework addresses why the brain might use the general form of divisive normalization without the specific additive function form for the divisive factor that is observed empirically. However, since the additive form is consistent with both behavioral and neurobiological evidence, we think it is interesting to ask what specific restrictions on the information-processing framework might lead to it. We will develop what we call the Marginal Cost Condition (MCC), which restricts the marginal cost of the family of functions {CA}AA. We prove that the information-processing model along with the MCC is equivalent to the divisive normalization model with the additive functional form. Hence, the MCC can be seen as an empirically motivated functional form restriction on the information-processing framework. Our analysis does not provide an independent motivation for the form of the MCC, but since it is equivalent to a neurobiologically motivated functional form in the normalization model, we posit that such a rationale may exist, and we leave this point as an interesting challenge for future empirical and theoretical work.

To define the MCC, first, for any AA, v:X++, σ>0, and γ > 0, set

δ(A;v,σ,γ):=ΔH(exp(γv(x)σ+v(A))).

In words, δ(A;v,σ,γ) equals the entropy reduction achieved by the function that maps x to (γv(x)σ+ν(A)). This turns out to be equal to the entropy reduction achieved on A by the normalization computation using (v, σ, γ).

We say an information-processing representation (v,{CA}AA) obeys the Marginal Cost Condition (MCC) if there exist σ > 0 and γ > 0 such that, for each AA,

CA(δ(A;v,σ,γ))=σ+v(A)γ

for each AA. The MCC places a restriction on the marginal cost of entropy reduction at the choice probabilities generated by the divisive normalization model. Specifically, the marginal cost has to vary linearly with the total value of items in the set. This places a neurobiologically motivated restriction on the functional forms in the information-processing model.

3.2. Axiomatic characterization

Our axiomatic characterization gives six testable behavioral restrictions, arranged into three nested groups, which are jointly equivalent to the full divisive normalization model. The axioms are layered in a way that allows for different versions and aspects of the normalization model to be independently tested.

To state the axioms compactly, it will be useful to define a few terms. We say the pair (x, y) is distinguishable in A if x,yA and ρ(x,A)ρ(y,A). For any (x, y) distinguishable in A, we define

Rxy(A):=(lnρ(x,A)ρ(y,A))1lnρ(x,X)ρ(y,X).

The number Rxy(A) measures how the choice probability ratio between x and y differs across choice sets A and X. Larger Rxy (A) means that this ratio is closer to 1 in A than in X. This suggests Rxy(A) is related to the divisive factor, since a larger divisive factor pushes the re-scaled values, and hence the choice probabilities, closer together. In fact, if ρ has a divisive normalization representation (v, σ, γ), and (x, y) is distinguishable in A, then

Rxy(A)=σ+v(A)σ+v(X). (2)

We will discuss the proof of this fact in our discussion around Eq. (3) below.

We are now ready to state our axioms.

Axiom 1

(Order). Let A,BA and x,yAB. Then ρ(x,A)ρ(y,A) if and only if ρ(x,B)ρ(y,B).

Our first axiom requires a set-independent ordinal ranking of the alternatives, in the sense that whether x is chosen more often than y is consistent across all choice sets. In the divisive normalization model, this ordinal ranking follows the ranking given by v, a fact we explore further in the next section.

Axiom 2

(Divisive Factoring), If (x, y) and (x,y) are distinguishable pairs in A, then

Rxy(A)=Rxy(A).

Our second axiom states that the value of Rxy(A) does not depend on the specific x, y pair used. This is an immediate implication of Eq. (2) and captures the fact that the divisive factor in the normalization model depends only on the choice set and not on the particular item being re-scaled.

Together, our first two axioms characterize a basic version of the normalization model where the divisive factor is allowed to be any strictly positive set-dependent function.5

Axiom 3

(Additive Separability); For any zA

Rxy(A)Rxy(A\{z})=Rxy(X)Rxy(X\{z}),

where (x, y) is distinguishable in all four sets used in the above equation.

Our third axiom says that the effect of removing an item on the divisive factor does not depend on the other alternatives in the choice set. This captures the additive separability of the divisive factor across the items.

Axiom 4

(Separability by Values). Suppose the pairs (x,z), (x,z), (y,z), and (y,z) are each distinguishable in the set that contains only that pair and that (x,y) is distinguishable in X. Then

Rxz({x,z})Ryz({y,z})Rxz({x,z})Ryz({y,z})=lnρ(x,X)ρ(y,X)(lnρ(x,X)ρ(y,X))1.

Implicit in Axiom 4 is that both sides of the equation are well defined (do not involve division by zero) when the distinguishability conditions stated are met. We can interpret the ratio ρ(x,X)/ρ(y,X) as providing a measure of x’s value relative to y’s, since we expect more valuable items to be chosen more often. Under this interpretation, the fourth axiom relates the relative values of x, y and x,y to the divisive factor involving those alternatives. This ensures the divisive factor is additively separable using the values of the alternatives.

Our first four axioms together characterize a version of normalization where v, σ, and γ are not necessarily positive.6 For this, we require two additional axioms.

Axiom 5

(Strictly Positive v and γ.) Suppose AA contains distinguishable pair (x, y), and let z, zA\{x,y} be such that ρ(z,X)>ρ(z,X). Then

Rxy(A)>Rxy(A\{z})>Rxy(A\{z}).

Our fifth axiom implies two facts: (1) the divisive factor strictly increases when adding alternatives to the choice set, and (2) the divisive factor increases by more when adding alternatives with a higher choice probability. From Eq. (2), we can easily see that the first fact corresponds to v > 0. The second follows from Eq. (2) under the assumption that alternatives with a higher choice probability have a higher value for v. This assumption requires γ > 0, since γ < 0 would allow the re-scaled value γv(x)/(σ+v(A)) to decrease in v(x). Therefore, Axiom 5 corresponds to v and γ being strictly positive.

Axiom 6

(Strictly Positive σ.) If (x, y) is distinguishable in both A and AB, and (x,y) is distinguishable in B, then

Rxy(A)+Rxy(B)>Rxy(AB).

Our final axiom imposes strict subadditivity on the Rxy function, which captures the fact that σ > 0. To see why, note that, applying Eq. (2), the difference between the left and right-hand sides of the inequality equals

σ+v(AB)σ+v(X),

which must be positive because v and σ are strictly positive.

Using the axioms, we can see exactly which behavioral restrictions prevent the Luce rule from being a special case of the divisive normalization model. Under the Luce rule, Rxy(A) always equals 1, from which we can see that Axioms 13 and 6 are consistent with the Luce rule while Axioms 45 are not. (Axiom 4 is inconsistent with the Luce rule because of the implicit requirement that Rxz({x,z})Ryz({y,z})0 whenever all pairs from x,y,z are distinguishable in X.) Axioms 45 prevent the Luce rule from being a special case because they link the value of an alternative with its impact on the divisive factor. Through this mechanism, alternatives with strictly positive value create a context effect inconsistent with the Luce rule. Without these axioms, we could construct a Luce rule by setting each alternative’s contribution to the divisive factor to be 0 while setting its value independently.

3.3. Equivalence result

The main result of this paper establishes a three-way equivalence uniting the divisive normalization model, our information-processing model, and our axiomatic characterization. The unification of the three models works on the level of behavior. Any choice probabilities that fit into one of the three models necessarily must fit into all three.

Theorem 1.

For any random choice rule ρ the following are equivalent:

  1. ρ has a divisive normalization representation,

  2. ρ has an information-processing representation that obeys the MCC,

  3. ρ obeys Axioms 16.

Proof.

See the Appendix. □

We think of the three models in our equivalence result as answering, respectively, the “how,” the “why,” and the “what” of choice behavior. According to existing work in neuroscience, the divisive normalization model explains how the brain makes choices, namely through the normalization computation. The information-processing formulation provides some insight into why that computation is used. The axiomatic characterization outlines exactly what behavior arises.

The proof of Theorem 1 proceeds by establishing that all three parts are equivalent to the statement that there exists a function v:X++ and constants σ > 0 and γ > 0 such that

ρ(x,A)=exp(γv(x)σ+v(A))yAexp(γv(y)σ+v(A)) (3)

for all AA and xA.

We can use Eq. (3) to prove the claims we made about the role of γ. First, rewrite Eq. (3) as

ρ(x,A)=11+yA\{x}exp(γv(y)v(x)σ+v(A)).

Since all the parameters are strictly positive, whenever x,yA,

γγv(x)v(y)σ+v(A)γ.

Combining these inequalities with our rewritten version of Eq. (3) yields

11+(|A|1)exp(γ)ρ(x,A)11+(|A|1)exp(γ).

These inequalities confirm that γ determines an upper and lower bound on the possible choice probabilities. As γ these inequalities give only the trivial statement ρ(x,A)[0,1], and as γ0 they force ρ(x,A)=1|A| for all x, A.

We can also use Eq. (3) to establish our claim regarding Rxy(A) in Eq. (2). Eq. (3) implies that, for all x,yA

ρ(x,A)ρ(y,A)=exp(γv(x)v(y)σ+v(A)),

from which,

Rxy(A)=(γv(x)v(y)σ+v(A))1(γv(x)v(y)σ+v(X)),

which simplifies to Eq. (2).

4. Identifying neural and behavioral parameters

Divisive normalization has been used by neuroscientists to model how neural data determines choices. Whether neural data can be identified from choices has not yet been addressed. In this section we prove a result showing that choice alone can be used to uniquely specify neural observables within the framework of the normalization model. In more detail, the re-scaled values in the divisive normalization model are used to match the neurally observable subjective value function, experimentally measured as the number of action potentials per second (or “firing rate”) of individual neurons. We prove that the re-scaled values are fully identified from choice behavior alone, and conversely, that the re-scaled values fully determine the (stochastic) choice behavior. In other words, there exists a one-to-one relationship between the neurally measurable subjective value function and the behavior it generates. This result is important to empirical neuroscientists. It allows novel and precise predictions linking neural and choice data. This result is also important for economists, since it allows the application of insights from neural analysis without neural data. We also note that this identity may have welfare implications.

We prove the one-to-one relationship as a corollary to our more general uniqueness result on the parameters of the divisive normalization model. We start this section by presenting this more general identification result. We then discuss its implications for neural and choice parameters in the form of two corollaries. We end by providing the proof of the identification result, which builds on the notation and logic (notably Eq. (2)) from the axiomatic characterization in the previous section.

Proposition 1.

Suppose ρ has divisive normalization representation (v, σ, γ)) Then (v,σ,γ) is also a divisive normalization representation of ρ if and only if γ=γ and there exists α > 0 such that (v,σ)=α(v,σ).

Proposition 1 establishes that the v and σ parameters in the divisive normalization model are jointly unique up to a strictly positive multiplicative constant, while γ is fully unique. The parameter γ can be identified because of the divisive step in the normalization procedure. Without this step the model reduces to γv(x)+εx (which is equivalent to the Luce rule as discussed earlier), where γ is superfluous since it can be freely absorbed into v. In the normalization model, absorbing γ into v also impacts the divisive step, and so cannot be done freely. Interestingly, this constraint not only necessitates the γ parameter, but also allows for γ to be fully identified.

A transformation of the parameters of particular interest is the re-scaled value of each alternative x in choice set A, that is

γv(x)σ+v(A).

As discussed above, these re-scaled values model the neurally measurable subjective value function in the divisive normalization framework. An immediate corollary of Proposition 1 is that these re-scaled values are fully identified from choice behavior. Conversely, the definition of the divisive normalization model immediately implies that these re-scaled values fully determine the choice probabilities.

Corollary 1.

Suppose ρ has divisive normalization representation (v, σ, γ). Then (v,σ,γ) is also a divisive normalization representation of ρ if and only if for every (x, A) in X × A:

γv(x)σ+v(A)=γv(x)σ+v(A).

Corollary 1 establishes two facts. First, if (v, σ, γ) and (v,σ,γ) have the same re-scaled values, then they represent the same behavior. Second, if (v, σ, γ) and (v,σ,γ) represent the same behavior, then they must have the same re-scaled values. This creates an exact one-to-one relationship between choice behavior and the neurally observable subjective value function, within the divisive normalization model. This allows for precise predictions on neural data from choice data alone, enabling new types of experimental hypotheses for empirical neuroscientists. This result is also relevant for data sets containing choices alone, since it justifies the application of insights based on neural analysis without neural data. For example, some researchers have suggested that the neurally measured subjective value function is the correct value level for welfare analysis (Loewenstein et al., 2008), and our result suggests this analysis can be performed with choice data alone.

Corollary 1 works because the re-scaling step uses the values of the alternatives. We can measure v(x) by how much the choice probabilities of other alternatives change when x is added to the set. A larger v(x) causes more re-scaling which pushes the values and choice probabilities closer together. If, instead, the re-scaling was done via a set-dependent factor that did not depend on the values then the re-scaled values would not be unique.

For example, suppose we assigned stochastic utility to alternative x in set A of

v(x)F(A)+εx,

where F(A) is any strictly positive set-dependent function and εx is an i.i.d. random variable. Now define v(x):=v(x)+α for some constant α0. Then for any choice set A and x,yA,

v(x)F(A)v(y)F(A)=v(x)F(A)v(y)F(A).

This is enough to ensure that (v, F) and (v,F) deliver the same choice probabilities, while having different re-scaled values. By contrast, in the divisive normalization model, changing from v to v also changes the re-scaling factor which impacts the choice probabilities.

The second set of parameters we are interested in identifying consists of the untransformed values without re-scaling. Proposition 1 shows that these are unique only up to a multiplicative constant, as is the case in other stochastic choice models, such as the Luce rule. This degree of uniqueness is enough to determine a unique ordinal ranking over the choice alternatives and choice sets. Define

EA[v(x)|ρ]:=xAρ(x,A)v(x),

which is the expected value from choice rule ρ on set A using values v.

Corollary 2.

Suppose ρ has two divisive normalization representations (v, σ, γ) and (v,σ,γ). Then:

  1. v(x)v(y) if and only if v(x)v(y) for all x,yX.

  2. EA[v(x)|ρ]EB[v(x)|ρ] if and only if EA[v(x)|ρ]EB[v(x)|ρ]

Corollary 2 says that the divisive normalization model uniquely ranks the alternatives by their values and choice sets by their expected values. In this sense, the divisive normalization model provides a well-defined ordinal preference over alternatives and choice sets.

Proof of Proposition 1..

The “if direction” is obvious. For the other direction, suppose that (v, σ, γ) and (v,σ,γ) are both divisive normalization representations of ρ. If choice set A contains a distinguishable pair (x, y), then we know

σ+v(A)σ+v(X)=σ+v(A)σ+v(X) (4)

since, using Eq. (2), both sides of the equation equal Rxy(A).

Now define

α:=σ+v(X)σ+v(X).

It is clear that α > 0 since all the terms are strictly positive. By our assumptions on ρ, we can find x,y,wX such that ρ(x,X), ρ(y,X), and ρ(w,X), are all distinct. By Theorem 1, we know ρ obeys Axiom 1, which implies all pairs from {x, y, w} are distinguishable in every set that contains them. Hence, whenever x,yA, we can rearrange Eq. (4) to get

v(A)=αv(A)+ασσ.

Therefore, whenever x,yAB, we get

v(A)v(B)=α(v(A)v(B)).

Hence, for any zX\{x,y}, we have that

v(z)=v({x,y,z})v({x,y})=α(v({x,y,z})v({x,y}))=αv(z).

We can apply the same logic with {x, w} taking the role of {x, y} to prove v(y)=αv(y). We can also let {w, y} take the role of {x, y} to prove v(x)=αv(x).. Therefore, we have shown v(z)=αv(z) for all zX. Combining v(X)=αv(X) with the definition of α, it follows that σ=ασ.

To prove γ=γ, we can use Eq. (3) to get

exp(γv(x)v(y)σ+v(X))=exp(γv(x)v(y)σ+v(X)),

since both sides equal ρ(x,A)/ρ(y,A). Since (x, y) is distinguishable, we know that neither side of the equation is equal to 1, so that v(x)v(y)0. Using (v,σ)=α(v,σ) and taking natural logs of both sides gives

γv(x)v(y)σ+v(X)=γv(x)v(y)σ+v(X).

And the desired result follows using v(x)v(y)0. □

5. Context effects

In this section, we use our equivalence result to study how the divisive normalization model handles context effects.7 We use a standard notion of a context effect as a departure from the Independence of Irrelevant Alternatives (IIA) property developed by Luce (1959). Following the logic of our equivalence result, we find the precise analogs of IIA violations in our information-processing and divisive normalization models. In the normalization model, IIA violations are equivalent to differences across choice sets in the divisive factor. In the information-processing model, IIA violations are equivalent to changes in the marginal cost of reducing stochasticity across different sets. We also show how and the extent to which the divisive normalization model can accommodate well-known choice phenomena such as choice overload, the attraction effect, and the compromise effect. Last, we discuss the limitations on context effects implied by the divisive normalization model, which suggest a new testable prediction.

The IIA property requires that

ρ(x,A)ρ(y,A)=ρ(x,B)ρ(y,B),

whenever x,yAB. In words, this says the relative choice probability between two alternatives is independent of the other alternatives in the set. For expositional purposes, we only consider probability ratios, where ρ(x,A)/ρ(y,A)1. This allows us to interpret larger ratios as being further from the equal-probability case and will simplify the statements of results. This simplification is without loss of generality since we can just invert any ratio that is smaller than 1.

Proposition 2.

Suppose ρ has normalization representation (v, σ, γ). Consider A,BA and x,yAB such that ρ(x,B)/ρ(y,B)1. Then the following are equivalent:

  1. ρ(x,A)ρ(y,A)>ρ(x,B)ρ(y,B),

  2. v(A)<v(B)

  3. ρ has an information-processing representation (v,{CA}AA), where δ(A)<δ(B).

Proof.

The proof of Theorem 1 establishes that we can use the same parameters (v, σ, γ) in the normalization representation as in the information-processing representation and associated MCC. This, along with Theorem 1, immediately establishes the equivalence of Parts 2 and 3.

For the equivalence of Parts 1 and 2, let x,yAB such that v(x)>v(y). From Eq. (3), this implies ρ(x,C)>ρ(y,C) whenever x,yC. By the definition of Rxy, we then know that

Rxy(A)<Rxy(B)ρ(x,A)ρ(y,A)>ρ(x,B)ρ(y,B).

By Eq. (2), this can be written as,

σ+v(A)σ+v(X)<σ+v(B)σ+v(X)ρ(x,A)ρ(y,A)>ρ(x,B)ρ(y,B),

which gives us the desired result. □

The equivalence between Parts 1 and 2 in Proposition 2 establishes that a larger divisive factor is equivalent to IIA violations that move the probability ratios closer to equal probability. This demonstrates that the normalization model captures context effects through the divisive factor. Previous papers have noted this relationship between the divisive factor and IIA violations in more limited contexts. For example, Louie et al. (2013) studied this feature of the normalization model in three-item choice sets, while our result works for a choice set of any size. The equivalence between Parts 1 and 3 in Proposition 2 also shows that IIA violations correspond to changes in δ(A) in the information-processing model. Recall that δ(A) equals the marginal cost of reducing stochasticity on set A.

We can use Proposition 2 to better understand how and why the divisive normalization model accounts for previously studied context effects. For example, the well-known Compromise and Attraction Effects create IIA violations by adding a third alternative to a two-alternative choice set (Huber et al., 1982; Simonson, 1989). Louie et al. (2013) found IIA violations when they replaced the worst alternative in a three-alternative choice set with a slightly improved option. Specifically, they found this increased choice stochasticity between the two unchanged alternatives, in the sense of pushing the probability ratio closer to 1. Proposition 2 shows exactly how divisive normalization can accommodate these IIA violations. It does so because adding an alternative or raising the value of an alternative both change the divisive factor that drives context effects. Proposition 2 also suggests why these context effects occur, namely, because of changes in the marginal cost of reducing stochasticity across choice sets. This interpretation lines up particularly nicely with the result in Louie et al. (2013), since, under Proposition 2, raising the value of the worst alternative increases the marginal cost of reducing stochasticity, which naturally leads to more stochastic choices.

Choice overload is another context effect naturally captured by divisive normalization. This effect was found in a series of studies challenging the idea that adding options must weakly improve choice outcomes (Chernev, 2004; Iyengar et al., 2003; Iyengar and Lepper, 2000; Shah and Wolford, 2007). Instead, these papers make the argument that having more alternatives can lead to “worse” choices or even declining to choose at all. Our framework does not allow the possibility of choosing nothing, so we cannot capture that aspect of choice overload. However, the divisive normalization model does, very naturally, capture the idea that more options can lead to worse choices.8 More precisely, suppose we have a choice rule ρ with a divisive normalization representation (v, σ, γ). Let qΔA be the choice distribution calculated from ρ(,A{x}), conditional on not choosing x. We show that q is first-order stochastically dominated by ρ(,A), when using v to rank the alternatives. In other words, adding x to the set A unambiguously worsens the choices made among the items in A. We formally state and prove this in the following proposition.

Proposition 3.

Suppose ρ has divisive normalization representation (v,σ,γ). Choose AA such that v is not constant over A. Let xA. Define qΔA by

q(y)=ρ(y,A{x})/(1zAρ(z,A{x})).

Then ρ(,A) first-order stochastically dominates q when ordering the alternatives according to v.

Proof.

Order the elements of A as z1,,zn, where v(zi)<v(zi+1) for each i<n. We want to show that for each m=1,,n,

im(q(zi)ρ(zi,A))0,

with the inequality strict for at least one m. First note that whenever ji, we have

q(zj)q(zi)=exp(v(zj)v(zi)σ+v(A{x}))exp(v(zj)v(zi)σ+v(A))ρ(zj,A)ρ(zi,A).

We will use this inequality repeatedly in what follows. Moreover, the inequality must hold strictly whenever v(zj)v(zj).

Next, note that

q(z1)=(i=1nq(zi)q(z1))1>(i=1nρ(zi,A)ρ(z1,A))1=ρ(z1,A).

We get the strict inequality because v is not constant over A, so that q(zi)/q(z1)<ρ(zi,A)/ρ(z1,A) for at least one i.

We have now shown the desired equation holds strictly for m = 1. Now suppose the equation holds at m and we will show it holds at m + 1. If m + 1 = n the equation holds with equality because probabilities sum to 1. Suppose m + 1 < n. If q(zm+1)>ρ(zm+1,A), then the desired result is obtained by simply adding q(zm+1)ρ(zm+1,A) to the left-hand side of the equation stated at m. Otherwise, if q(Zm+1)<ρ(Zm+1,A), then

im+1(q(zi)ρ(zi,A))=i>m+1(q(zi)ρ(zi,A))=i>m+1(q(zm+1)q(zi)q(zm+1)ρ(zm+1,A)ρ(zi,A)ρ(zm+1,A))i>m+1(ρ(zi,A)ρ(zm+1,A)(q(zm+1)ρ(zm+1,A)))0,

as desired. □

It is also important to note limitations on the types of context effects which divisive normalization can accommodate. For example, the Attraction Effect is often associated with (stochastic) preferences reversals, where an alternative x is chosen more often than y in one choice set but not in another. The divisive normalization model can never achieve these reversals, which is an immediate implication of Axiom 1.9 Instead, the values of the normalization model will create a ordinal ranking of the alternative, where higher alternatives are always chosen more often. This also implies that the divisive normalization model obeys Weak Stochastic Transitivity (Block and Marschak, 1960), which requires that if ρ(x,{x,y})12 and ρ(y,{y,z})12, then ρ(x,{x,z})12.

The divisive normalization model obeys an even stronger notion of ordering known as strong stochastic transitivity (SST). This is the requirement that if ρ(x,{x,y})>12 and ρ(y,{y,z})12, then ρ(x,{x,z})max{ρ(y,{y,z}),ρ(x,{x,y})}. In words, if x beats y and y beats z, then not only must x beat z but the gap between x and z must be at least as large as either of the previous two comparisons. To see that SST holds, note that given a divisive normalization model (v,σ,γ),ρ(x,{x,y})>12 and ρ(y,{y,z})12 imply that v(x)>v(y)>v(z). Simple algebra then yields

γv(x)v(z)σ+v(x)+v(z)>max{γv(x)v(y)σ+v(x)+v(y),γv(y)v(z)σ+v(y)+v(z)}.

From Eq. (3), this implies that ρ(x,{x,z})max{ρ(y,{y,z}),ρ(x,{x,y})}.

Another restriction on context effects is that the direction of the IIA violation must be consistent across all pairs when moving across choice sets. In other words, if one pair of items violates IIA by being further from the equal-probability case in set A versus set B, then the same must be true of all pairs that appear in both sets. We state this formally as:

Corollary 3.

Suppose ρ has divisive normalization representation (v, σ). Let x,y,x,yAB such that v(x)>v(y) and v(x)>v(y). Then

ρ(x,A)ρ(y,A)>ρ(x,B)ρ(y,B)ρ(x,A)ρ(y,A)>ρ(x,B)ρ(y,B).

Corollary 3 follows immediately from Proposition 2, because the equivalence between Parts 1 and 2 imply that whether

ρ(x,A)ρ(y,A)>ρ(x,B)ρ(y,B)

holds for any particular pair can be determined by an inequality that depends on the sets as a whole.

Another way to interpret Corollary 3 is in terms of the relative stochasticity of the choice sets. The choice between x and y is more stochastic when the probability ratio between x and y is closer to 1. Therefore, Corollary 3 says any two sets (that share at least two items) can be unambiguously ranked by how stochastic the choices are on them. This provides a novel testable restriction on the divisive normalization model.

6. Concluding remarks

In this paper, we studied three different models that each presented a different perspective on choice behavior. The divisive normalization model says how the brain makes choices, namely, via the neurobiologically-motivated normalization computation. The information-processing formulation provides some insight into why that computation is used, namely, because it optimally balances the benefits and costs of reducing stochasticity. The axiomatic characterization pinpoints exactly what behavior arises by providing a set of testable behavioral predictions. Our main result proves an equivalence between these three models, uniting them into a single theory that can simultaneously address the “how,” the “why,” and the “what” of choice behavior.

We also explore how the parameters of the divisive normalization model can be identified from behavior, and what that tells us about the link between observable choice and observable neural quantities. We prove that, in the divisive normalization model, there is a one-to-one relationship between the neurally measurable subjective value function and the behavior it generates. This creates a theoretical foundation for work that links neural and behavioral data, and indicates that inference about neural variables can be made from choice behavior alone.

Lastly, we use our equivalence result to study how the divisive normalization model handles context effects. The divisive normalization model allows for context effects through changes in the divisive factor. When the divisive factor is equal across two choice sets, the choices on those sets will be context independent in the sense of obeying the Independence of Irrelevant Alternatives (IIA). We use our equivalence result to provide behavioral and information-processing analogs to the changes in divisive factor that drive context effects. We then apply these analogs to shed new light on existing empirical work, and to provide a novel testable prediction on the type of context effects allowed in the divisive normalization model.

We conclude by commenting on one of the more unusual aspects our paper, relative to the economics literature — namely our inclusion of a neurobiologically-motivated functional form in a choice model. The inclusion of this aspect is motivated, in part, by the argument due to Simon (1955, p. 99) that a theory of decision making should be consistent “with the access to information and the computational capacities that are actually possessed by the organism.” At the time of Simon’s writing, the development of such a theory was hindered by a lack of empirical knowledge about precisely such information and computational capacities — a fact which Simon himself noted (Simon 1955, p. 100).

In the decades since then, advances in neuroscience have taught us much about the actual decision processes of various organisms, humans included. By capitalizing on these advances, we have been able to build a theory of decision-making consistent with how the human brain actually makes choices, and, in this way, advance Simon’s argument. With this, we hope to have taken a step towards reconciling traditional approaches to decision-making with the fact that all behavior making must, ultimately, have a physical implementation.

Acknowledgments

Financial support from NIH grant R01DA038063, NYU Stern School of Business, NYU Shanghai, and J.P. Valles is gratefully acknowledged. We have greatly benefited from discussions with Andrew Caplin, Faruk Gul, Kenway Louie, Anthony Marley, Paula Miret, Wolfgang Pesendorfer, Doron Ravid, Shellwyn Weston, Jan Zimmerman, members of the Glimcher Lab at NYU, and seminar audiences at Columbia, HKUST, NYU Shanghai, PHBS Shenzhen, Princeton, Yale, and Zurich University. We are also grateful to two anonymous referees for very helpful comments. A previous version of this paper was entitled “Rational Imprecision: Information-Processing, Neural, and Choice-Rule Perspectives”.

Appendix A. Proof of Theorem 1

We will show that all three parts of Theorem 1 are equivalent to Eq. (3). Additionally, while not strictly required for Theorem 1, our proof will show that the same set of parameters (v, σ, γ) are used in Eq. (3), the divisive normalization model, and the information-processing model. In other words, our proof will also show the following three statements are equivalent:

  1. (v, σ, γ) satisfies Eq. (3),

  2. (v, σ, γ) is a divisive normalization representation of ρ,

  3. (v, σ, γ) is an information information-processing representation of ρ that obeys the MCC.

A1. Equivalence with divisive normalization

Proving the equivalence of the divisive normalization model and Eq. (3) follows the lines of well-known arguments (Luce and Suppes, 1965; McFadden, 1978). To begin, suppose ρ has a divisive normalization representation (v, σ, γ), so that

ρ(x,A)=Pr(x=argmaxγv(y)σ+v(A)+εy),

where εy‘s are i.i.d. and Gumbel with location 0 and scale 1. Let g(t)=exp(texp(t))andG(t)=exp(exp(t)) be the pdf and cdf of a Gumbel (0, 1) random variable. We then have

ρ(x,A)=+{yA\{x}G(|γv(x)v(y)σ+v(A)+t)}g(t)dt=+{yA\{x}exp(exp(γv(x)v(y)σ+v(A)t))}exp(texp(t))dt,

which we can rearrange to give

ρ(x,A)=+{exp(exp(t)(1+yA\{x}exp(γv(x)v(y)σ+v(A))))}exp(t)dt,

which can be integrated to obtain

ρ(x,A)=1(1+yA\{x}exp(γv(x)v(y)σ+v(A)))×exp(exp(t)(1+yA\{x}exp(γv(x)v(y)σ+v(A))))|t=t=+.

Evaluating at the limits yields

ρ(x,A)=1(1+yA\{x}exp(γv(x)v(y)σ+v(A)))(10),

which can be rearranged to give

ρ(x,A)=exp(γv(x)σ+v(A))yAexp(γv(y)σ+v(A)),

as desired. This argument can be run backwards to prove the reverse implication.

A2. Equivalence with information processing model

Fix v:X++, γ>0, and σ > 0. Let {CA}AA be any family of strictly increasing convex, continuously differentiable, functions that map and obey the MCC using (v, σ, γ). Note that such a family always exists. For example,

CA(c)=σ+v(A)γc.

Recall, we defined ρ to have information-processing representation (v,{CA}AA) if, for each AA,ρ(,A) is a solution to the following maximization problem:

maxpΔAxAp(x)v(x)CA(ΔH(p)). (5)

To prove the equivalence, it suffices to show that, for each AA. the unique solution to this maximization problem is given by the measure defined by Eq. (3) using (v, σ, γ). Fix AA, and define p*ΔA to be that measure. That is for each xA:

p*(x)=exp(γv(x)σ+v(A))yAexp(γv(y)σ+v(A)).

Next, note that any function p:A[0,1] can be viewed as a point in |A|. Under this interpretation, ΔA forms a compact subset of |A| defined by affine constraints. Also note that, by standard properties of entropy, H() is strictly concave, which means ΔH() is strictly convex. Using that CA() is convex and strictly increasing, it follows that CA(ΔH()) is strictly convex, and hence CA(ΔH()) is strictly concave. Therefore, the objective function of the maximization problem in (5) is strictly concave since the only other term is linear. Affine constraints and a strictly concave objective function mean that the Karush-Kuhn-Tucker conditions are both necessary and sufficient for a feasible point to be a solution. Those conditions say

v(x)CA(ΔH(p))(ln(p(x))+1)+λ+μx=0, (6)

for some λ and μx, with the complementary slackness condition that p(x)0μx=0. We now show that p* satisfies those conditions. Since p*(x)>0 for all x, we set μx = 0. Define

λ:=σ+v(A)γ(ln(yAexp(γv(y)σ+v(A)))1).

We need to show that

v(x)CA(ΔH(p*))(ln(p*(x))+1)+λ=0.

By definition, δ(A,v,σ,γ)=ΔH(p*). Hence we can apply the MCC to transform the above equation into

v(x)σ+v(A)γ(ln(p*(x))+1)+λ=0.

Using our definition of p* this becomes

v(x)σ+v(A)γ(γv(x)σ+v(A)ln(yAexp(γv(y)σ+v(A)))+1)+λ=0,

which simplifies to

σ+v(A)γ(ln(yAexp(γv(y)σ+v(A)))1)+λ=0,

which holds by definition of λ.

We have now proved that p* is a solution to the maximization problem. Next suppose q* also solves the maximization problem. Since ΔA is closed under convex combinations, we can define a feasible pP by

p(x):=12q*(x)+12p*(x),

for each xA. Since the objective function is strictly concave, if p*q*, then p would strictly improve on the optimal payoff, which is not possible. Hence p* must be the unique maximizer, as desired.

A3. Equivalence with axioms

Eq. (3) Implies the axioms

Suppose ρ obeys Eq. (3) using (v, σ, γ). We will show ρ obeys all six axioms. Axiom 1 follows immediately from the fact that, under Eq. (3), ρ(x,A)ρ(y,A)v(x)v(y) since γ > 0 and σ+v(A)>0 for all AA. To show the necessity of the rest of the axioms, we use the following lemma.

Lemma 1.

If ρ obeys Eq. (3) using (v, σ, γ), then

Rxy(A)=σ+v(A)σ+v(X).

whenever (x, y) is distinguishable in A.

Proof.

By definition,

Rxy(A)=(lnρ(x,A)ρ(y,A))1(lnρ(x,X)ρ(y,X)).

Applying Eq. (3) to the right-hand side gives

Rxy(A)=(γv(x)v(y)σ+v(A))1(γv(x)v(y)σ+v(X)),

and the desired conclusion follows. □

Axiom 2 follows immediately from Lemma 1. Also, Lemma 1 allows us to rewrite the equation in Axiom 3 as

σ+v(A)σ+v(X)σ+v(A\{z})σ+v(X)=σ+v(B)σ+v(X)σ+v(B\{z})σ+v(X),

which holds since both sides are equal to v(z)/(σ+v(X)).

Using Lemma 1, the equation in Axiom 4 is equivalent to

v(x)v(y)σ+v(X)(v(x)v(y)σ+v(X))1=lnρ(x,X)ρ(y,A)(lnρ(x,X)ρ(y,X))1,

which can be verified by applying Eq. (3) to the right-hand side.

Now suppose that AA contains a distinguishable pair (x, y) and z,zA\{x,y} such that ρ(z,X)>ρ(z,X). By Eq. (3), v(z)>v(z). Using this and v,σ>0, it follows that

σ+v(A)σ+v(X)>σ+v(A\{z})σ+v(X)>σ+v(A\{z})σ+v(X),

which, via Lemma 1, proves Axiom 5.

Finally, let the sets A, B contain a distinguishable pair. Then Axiom 6 is equivalent to

σ+v(A)σ+v(X)+σ+v(B)σ+v(X)>σ+v(AB)σ+v(X).

Since the denominators are all positive, this is equivalent to

σ+v(AB)>0

which holds because σ,v>0.

A3.1. The axioms imply equation (3)

Now assume Axioms 16 hold. We will find (v, σ, γ) such that Eq. (3) holds. By Axiom 1, if (x, y) is distinguishable in A, then (x, y) is distinguishable in all sets that contain this pair. So, we will simply say (x, y) is distinguishable to indicate that ρ(x, A) ≠ ρ(y, A) whenever x, y ∈ A. By Axiom 2, for any AA, we can set R(A) = Rxy(A) for all distinguishable (x, y) in A. If A does not contain any distinguishable pairs, set R(A) = 1. Also, set R(Ø)=0.

Lemma 2.

There exists a distinguishable pair (x,y) such that X\{x,y} contains a distinguishable pair.

Proof.

By assumption, ρ(,X) contains at least three distinct choice probabilities. Let {x,y,z} denote the three items generating the distinct probabilities. Since |X|4, there exists wX\{x,y,z}. It must be that either ρ(w,X)ρ(x,X) or ρ(w,X)ρ(y,X). In the first case, set (x,y)=(y,z), and, in the second case, set (x,y)=(x,z). Either way we get the desired result. □

From here on, let (x,y) be a distinguishable pair such that X\{x,y} contains a distinguishable pair. For any xX, note that either X\{x}X\{x,y} or {x,y}X\{x}. Hence, X\{x} always contains a distinguishable pair.

Now, for any xX. define

v(x):=R(X)R(X\{x}).

Also define

σ:=R({x,y})v(x)v(y).
Lemma 3.

If AA contains distinguishable pair (x, y), then

R(A)=R({x,y})+v(A\{x,y}).
Proof.

Let z ∈ A\{x, y}. By Axiom 3,

R(X)R(X\{z})=R(A)R(A\{z}).

Combining the above with the definition of v yields

R(A)=R(A\{z})+v(z).

For any zA\{z,y,x}, the same logic yields

R(A)=R(A\{z,z})+v(z)+v(z).

We repeatedly apply these steps to get the desired result. □

Lemma 4.

For any AA that contains a distinguishable pair

R(A)=σ+v(A).
Proof.

Using Lemma 3, it suffices to show that, for any distinguishable pair (x, y)

R({x,y})=σ+v(x)+v(y).

First, we can apply Lemma 3 and the definition of σ to get

R({x,y,x})=R({x,y})+v(x)=σ+v(x)+v(y)+v(x).

Since ρ(x,X)ρ(y,X), it must be that either (x,x) is distinguishable or (x,y) is distinguishable. We will treat the first case, since the proof for the other case is similar. By Lemma 3, we have

R({x,y,x})=R({x,x})+v(y).

Combining the previous two equations gives

R({x,x})=σ+v(x)+v(x).

Since (x⋆, x) and (x, y) are distinguishable pairs, we can use Lemma 3 to get

R({x,x,y})=R({x,x})+v(y)=R({x,y})+v(x).

Combining this with the previous display equation yields

R({x,y})=σ+v({x})+v({y}),

as desired. □

Since ρ(,X) contains at least three distinct choice probabilities, there exists zX such that all pairs in {x,y,z} are distinguishable. Define

γ:=lnρ(x,X)ρ(y,X)R({x,Z})R({y,Z}).

That γ is well-defined (i.e., that the denominator does not equal 0) is implicit in Axiom 4, where we apply the axiom to the case that (x,y,z)=(x,y,z)=(x,y,z). That (x,y) is distinguishable ensures that γ0.

Now fix any AA, and we claim that

ρ(x,A)ρ(y,A)=exp(γv(x)v(y)σ+v(A)), (7)

for all x,yA. Chose any pair x,yA. Since ρ(,X) contains at least three distinct choice probabilities, there must exist z ∈ X such that (x, z) and (y, z) form distinguishable pairs. Applying Axiom 4, we then get

R({x,Z})R({y,Z})R({x,Z})R({y,Z})=lnρ(x,X)ρ(y,X)(lnρ(x,X)ρ(y,X))1.

Applying the definition of γ and Lemma 4, we have

γ(v(x)v(y))=lnρ(x,X)ρ(y,X).

If (x, y) is not distinguishable, then the above equation implies v(x)=v(y). Therefore, Eq. (7) holds since both sides are equal to 1. If (x, y) is distinguishable, then we can use our definition of R(A) to get

γv(x)v(y)R(A)=lnρ(x,A)ρ(y,A).

Since we are assuming (x, y) is distinguishable, we can apply Lemma 4 and take the exponent of both sides to get

ρ(x,A)ρ(y,A)=exp(γv(x)v(y)σ+v(A)).

Hence we have proved Eq. (7). Using the fact that choice probabilities must sum to 1, we can get that for each AA and xA:

ρ(x,A)=exp(γv(x)σ+v(A))yAexp(γv(y)σ+v(A)).

All that remains is to show that v,σ,γ are strictly positive. Recall that X\{x} contains a distinguishable pair for all x ∈ A. Hence, by Axiom 5,

v(x)=R(X)R(X\{x})>0,

for all x ∈ X. Next, note that, using Lemma 4, we get

R({x,z})R({y,z})=R(X\{y})R(X\{x}).

Therefore, by Axiom 5, R({x,z})R({y,z}) and ln(ρ(x,X)/ρ(y,X)) have the same sign. Since γ is defined as the ratio of these two terms, and using the fact that we already showed γ is well-defined and non-zero, we get that γ > 0. Finally, recall that, by construction, {x,y} and X\{x,y} both contain a distinguishable pair. Hence we can apply Lemma 4 to get

R({x,y})+R(X\{x,y})R(X)=σ.

By Axiom 6, the left-hand side of that equation is strictly positive, and it follows that σ > 0.

Footnotes

1

The Attraction Effect has drawn interest in part because it is a single empirical phenomenon that violates a number of well-known choice principles. Here we use it as an example of Regularity violations, but it also violates the Independence of Irrelevant Alternatives (IIA), and can create (stochastic) preference reversals. Below, we will discuss each of these choice principles in more detail and use the Attraction Effect as an example of a phenomenon where they fail.

2

More precisely, by ρ(,A)ΔA we mean that the restriction of ρ(,A) to A is in ΔA. Throughout, we will often find it convenient to treat ρ(,A) as a function from A to [0, 1], since the values of ρ(,A) outside of A are always zero.

3

We thank a referee for this observation.

4

For details, see the online appendix at http://www.adambrandenburger.com/articles/papers

5

See the online appendix for the proof.

6

The sum σ+v(A) would have to have the same sign for all |A|2 to avoid violating Axiom 1.

7

As we noted in the introduction, that the divisive normalization model can handle contexts has been previously shown (e.g., Louie, et al., 2013 and Webb et al., 2014 ). What we add here is using our equivalence result to shed further light on how and why the normalization model accomplishes this.

8

We thank a referee for making this observation.

9

For an alternative formulation of the divisive normalization model that can achieve choice reversals, see Zimmermann et al. (2018).

References

  1. Agranov M, Ortoleva P, 2017. Stochastic choice and preferences for randomization. J. Polit. Econ. 125. [Google Scholar]
  2. Anderson S, de Palma A, Thisse J-F, 1992. Discrete Choice Theory of Product Differentiation. MIT Press, Cambridge. [Google Scholar]
  3. Attneave F, 1954. Some informational aspects of visual perception.. Psychol. Rev. 61 (3), 183–193. doi: 10.1037/h0054663. [DOI] [PubMed] [Google Scholar]
  4. Barlow H, 1961. Possible principles underlying the transformation of sensory messages In: Rosenblith WA (Ed.), Sensory Communication. M.I.T. Press, Cambridge, pp. 217–234. [Google Scholar]
  5. Block HD, Marschak J, 1960. Random orderings and stochastic theories of responses In: Olkin I (Ed.), Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Stanford University Press, Stanford, pp. 97–132. [Google Scholar]
  6. Caplin A Dean M 2015. Revealed preference, rational inattention, and costly information acquisition. Am. Econ. Rev. 105 (7). [Google Scholar]
  7. Carandini M, Heeger D, 2012. Normalization as a canonical neural computation.. Nature Rev. Neurosci. (November) 1–12. doi: 10.1038/nrn3136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chernev A 2004. When more is less and less is more : the role of ideal point availability and assortment in consumer choice. J. Consumer Res. 30, 180–183. [Google Scholar]
  9. David H, Nagaraja H, 2003. Order Statistics, 3rd ed. Wiley, New York: doi: 10.1016/j.bpj.2010.07.012. [DOI] [Google Scholar]
  10. Echenique F, Saito K, 2019. General Luce Model Economic Theory Forthcoming. [Google Scholar]
  11. Echenique F, Saito K, Tserenjigmid G, 2018. The Perception-adjusted luce model. Math. Soc. Sci. 93(c), 67–76. [Google Scholar]
  12. Fehr E, Rangel A, 2011. Neuroeconomic foundations of economic choice-Recent advances. J. Econ. Perspect. 25 (4), 3–30. doi: 10.1257/jep.25.4.3.21595323 [DOI] [Google Scholar]
  13. Fudenberg D, Iijima R, Strzalecki T, 2015. Stochastic choice and revealed perturbed utility. Econometrica 83 (6), 2371–2409. doi: 10.3982/ECTA12660. [DOI] [Google Scholar]
  14. Glimcher P, 2011. Foundations of Neuroeconomic Analysis. OUP. [Google Scholar]
  15. Glimcher PW, Tymula AA, 2018. Expected Subjective Value Theory (ESVT): A Representation of Decision under Risk and Certainty, Working Paper, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2783638.
  16. Gul F, Natenzon P, Pesendorfer W, 2014. Random choice as behavioral optimization. Econometrica 82 (5), 1873–1912. doi: 10.3982/ECTA10621. [DOI] [Google Scholar]
  17. Huber J, Payne JW, Puto C, 1982. Adding asymmetrically dominated alternatives: violations of regularity and the similarity hypothesis. J. Consumer Res. 9 (1), 90–98. doi: 10.1086/208899. [DOI] [Google Scholar]
  18. Itthipuripat S, Cha K, Rangsipat N, Serences JT, 2015. Value-Based attentional capture influences context-Dependent decision-Making. J. Neurophysiol. 114 (1), 560–569. doi: 10.1152/jn.00343.2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Iyengar SS, Jiang W, Huberman G, 2003. How Much Choice is Too Much?: Contributions to 401 (k) Retirement Plans In: Mitchell O, Utkus S (Eds.), Pension Design and Structure: New Lessons from Behavioral Finance. Oxford University Press, Oxford, pp. 83–96. [Google Scholar]
  20. Iyengar SS, Lepper MR, 2000. When choice is demotivating: can one desire too much of a good thing? J. Personal. Soc. Psychol. 79 (6), 995–1006. doi: 10.1037/0022-3514.79.6.995. [DOI] [PubMed] [Google Scholar]
  21. Khaw MW, Glimcher PW, Louie K, 2017. Normalized value coding explains dynamic adaptation in the human valuation process. Proc. Natl. Acad. Sci. 114 (48), 201715293. doi: 10.1073/pnas.1715293114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Knutson B, Adams CM, Fong GW, Hommer D, 2001. Anticipation of increasing monetary reward selectively recruits nucleus accumbens. J. Neurosci. 21 (16), RC159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Landauer R, 1961. Irreversibility and heat generation in the computing process. IBM J. Res. Devel. 5 (July), 183–191. doi: 10.1147/rd.53.0183. [DOI] [Google Scholar]
  24. Landry P, Webb R, 2017. Pairwise normalization: a neuroeconomic theory of multi-Attribute choice. SSRN doi: 10.2139/ssrn.2963863. [DOI] [Google Scholar]
  25. Loewenstein G, Rick S, Cohen J, 2008. Neuroeconomics. Ann. Rev. Psychol. 59, 647–672. doi: 10.1017/9781316676349.022. [DOI] [PubMed] [Google Scholar]
  26. Louie K, Glimcher PW, Webb R, 2015. Adaptive neural coding: from biological to behavioral decision-making. Current Opin. Behav. Sci. 5, 91–99. doi: 10.1016/j.cobeha.2015.08.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Louie K, Grattan LE, Glimcher PW, 2011. Reward value-Based gain control: divisive normalization in parietal cortex. J. Neurosci. 31 (29), 10627–10639. doi: 10.1523/JNEUROSCI.1237-11.2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Louie K, Khaw MW, Glimcher PW, 2013. Normalization is a general neural mechanism for context-Dependent decision making.. Proc. Natl. Acad. Sci USA 110 (15), 6139–6144. doi: 10.1073/pnas.1217854110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Luce RD, 1959. Individual Choice Behavior: A Theoretical Analysis. Wiley, New York. [Google Scholar]
  30. Luce RD, Suppes P, 1965. Preference, Utility, and Subjective Probability In: Luce RD, Bush R, Galanter E (Eds.), Handbook of Mathematical Psychology. Wiley, New York, pp. 249–410. [Google Scholar]
  31. Machina MJ, 1989. Dynamic consistency and non-Expected utility models of choice under uncertainty. J. Econ. Literat. 27 (4), 1622–1668. [Google Scholar]
  32. Mandl F 1988. Statistical Physics, 2nd ed John Wiley & Sons. [Google Scholar]
  33. Marley AAJ, Flynn TN, Louviere JJ, 2008. Probabilistic models of set-dependent and attribute-level best-worst choice. J. Math. Psychol. 52 (5), 281–296. doi: 10.1016/j.jmp.2008.02.002. [DOI] [Google Scholar]
  34. Matějka F, McKay A, 2014. Rational inattention to discrete choices: a new foundation for the multinomial logit model. Am. Econ. Rev. 105 (1), 1–55. doi: 10.1257/aer.20130047. [DOI] [Google Scholar]
  35. Mattsson L-G, Weibull JW, 2002. Probabilistic choice and procedurally bounded rationality. Games Econ. Behav. 41 (1), 61–78. doi: 10.1016/S0899-8256(02)00014-3. [DOI] [Google Scholar]
  36. McFadden D, 1973. Conditional Logit analysis of qualitative choice behavior In: Zarembka P (Ed.), Frontiers in Econometrics; Wiley, New York, pp. 105–142. [Google Scholar]
  37. McFadden D, 1978. Modelling the choice of residential location In: Karlqvist S, Lundqvist L, Snickars F, Weibull J (Eds.), Spatial Interaction Theory and Planning Models, 673. North-Holland, Amsterdam, pp. 75–96. [Google Scholar]
  38. Platt ML, Glimcher PW, 1999. Neural correlates of decision variables in parietal cortex.. Nature 400 (6741), 233–238. doi: 10.1038/22268. [DOI] [PubMed] [Google Scholar]
  39. Ravid D, 2015. Focus, then compare. Work. Pap. 1–40. [Google Scholar]
  40. Rieskamp J, Busemeyer JR, Mellers BA, 2006. Extending the bounds of rationality: evidence and theories of preferential choice. J. Econ. Literat. 44 (3), 631–661. doi: 10.1257/jel.44.3.631. [DOI] [Google Scholar]
  41. Schwartz O, Simoncelli EP, 2001. Natural signal statistics and sensory gain control.. Nature Neurosci. 4 (8), 819–825. doi: 10.1038/90526. [DOI] [PubMed] [Google Scholar]
  42. Shah AM Wolford G, 2007. Buying behavior as a function of parametric variation of number of choices.. Psychol. sci. 18 (5), 369–370. [DOI] [PubMed] [Google Scholar]
  43. Shannon CE, 1948. A mathematial theory of communication. Bell Syst. Tech. J. 27 (3), 379–423. [Google Scholar]
  44. Simon HA, 1955. A behavioral model of rational choice. Quart. J. Econ. 69 (1), 99–118. doi: 10.2307/1884852. [DOI] [Google Scholar]
  45. Simonson I, 1989. Choice based on reasons: the case of attraction and compromise effects. J. Consumer Res. 16 (2), 158–174. doi: 10.1086/209205. [DOI] [Google Scholar]
  46. Sims C, 1998. Stickiness. Proceedings of the Carnegie-Rochester Conference Series on Public Policy 49, 317–356. 10.1016/S0167-2231(99)00013-5 [DOI] [Google Scholar]
  47. Sims CA, 2003. Implications of rational inattention. J. Monet. Econ. 50 (3), 665–690. doi: 10.1016/S0304-3932(03)00029-1. [DOI] [Google Scholar]
  48. Swait J, Marley AAJ, 2013. Probabilistic choice (models) as a result of balancing multiple goals. J. Math. Psychol. 57 (1–2), 1–14. doi: 10.1016/j.jmp.2013.03.003. [DOI] [Google Scholar]
  49. Thurstone L 1927. A law of comparative judgment. Psychol. Rev. 34 (4), 273–286. [Google Scholar]
  50. Tserenjigmid G, 2016. The Order-Dependent Luce Model, 1–20. [Google Scholar]
  51. Webb R Glimcher P, Levy I 2013. Neural Random Utility and Measured Value. SSRN, pp. 1–36. [Google Scholar]
  52. Webb R, Glimcher PW, Louie K, 2014. Rationalizing context-Dependent preferences: divisive normalization and neurobiological constraints on choice. Work. Pap. 1–56. [Google Scholar]
  53. Zimmermann J, Glimcher PW, Louie K, 2018. Multiple timescales of normalized value coding underlie adaptive choice behavior. Nature Commun. 9 (1), 3206. doi: 10.1038/s41467-018-05507-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES