Abstract
This article establishes a general equivalence between discrete choice and rational inattention models. Matějka and McKay (2015) showed that when information costs are modeled using the Shannon entropy, the choice probabilities in the rational inattention (RI) model take the multinomial logit form. We show that, for one given prior over states, RI choice probabilities may take the form of any additive random utility discrete choice model (ARUM) when the information cost is a Bregman information, a class defined in this article. The prior information of the rationally inattentive agent is summarized in a constant vector of utilities in the corresponding ARUM.
1. INTRODUCTION
In many situations where agents make decisions under uncertainty, information acquisition is costly (involving pecuniary, time, or psychological costs); therefore, agents may rationally choose to remain imperfectly informed about the available options. This idea underlies the theory of rational inattention (RI), which has become an important paradigm for modeling boundedly rational behavior in many areas of economics (Sims, 2003, 2010). In this article, our main contribution is to establish a general equivalence between additive random utility discrete choice and RI models. Matějka and McKay (2015) showed that when information costs are modeled using the Shannon mutual information between actions and states, the resulting choice probabilities in the RI model take the familiar multinomial logit form, leading to the “RI‐logit” model, as we will refer to it below. This is a very appealing result, providing a microfoundation as well as alternative interpretation for the multinomial logit model.
However, the RI‐logit model has the “independence of irrelevant alternatives” (IIA) property, which states that, in a given state, the ratio of the choice probabilities of two alternatives does not depend on the utility of a third (irrelevant) alternative.2 In many empirical contexts, the IIA property implies restrictive and unrealistic substitution patterns among the choice options, as illustrated in the following example:
Example 1
Consider a rationally inattentive consumer facing a choice between pineapple (good 1), mango (good 2), and cheesecake (good 3). A priori, the consumer does not know the value associated with each good but has fixed prior beliefs about the possible realizations of the valuation vector . Assume that has four equally likely possible states:
and assume that we observe corresponding choice probabilities and for the first two states.3 These probabilities reflect a situation where an increase (from 0 to 0.1) in the value of good 1, the pineapple, causes consumers to substitute disproportionately from the mango instead of the cheesecake. However, such outcomes violate the IIA property, as the choice probability for mango decreases by 10%, whereas the choice probability for cheesecake decreases only by 4%; hence, they cannot arise from the RI‐logit model.
The root of the problem in the previous example is that the Shannon mutual information embodies an important and strong assumption of symmetry: the Shannon entropy is invariant to permutation in its arguments; therefore, reordering the choice options does not affect the information cost. This makes the cost of processing information context independent (Hobson, 1969), and hence, it cannot take into account that some choice options are more similar than other options.4
In this article, we introduce a new class of generalized entropies that allows us to define cost functions that embody information related to the identity of alternatives. Formally, our generalized entropy is defined as the negative convex conjugate of the surplus function of an additive random utility model (ARUM), and allows patterns such as those in the example above to be accommodated in an RI model. The new generalized entropies are not required to be symmetric. This contributes to making RI models empirically relevant. In fact, we show that, depending on the choice of information cost, an RI model can yield the same choice probability system as any ARUM; this includes specifications such as nested logit, multinomial probit, and so on, which are often employed in empirical work.
Based on our definition of generalized entropy, we introduce a class of generalized RI models. In particular, we define a general class of information cost functions where the Shannon entropy is replaced by our generalized entropy. This generalizes the Shannon mutual information since this cost function arises when the ARUM is the multinomial logit model. Because of this connection with the ARUM class, we label our general RI model as “RI‐ARUM.” As we will show, an RI‐ARUM model exists corresponding to any ARUM, implying that rationally inattentive behavior can lead to choice probabilities that violate the IIA property, as in the above example.
1.1. Related Literature
Besides the papers already mentioned above, the main equivalence result in this article is related to several strands of the literature. This article contributes to the growing literature on RI with more general cost functions. Hébert and Woodford (2017) provide a foundation for the RI model based on a dynamic information accumulation process. In particular, they introduce a class of “neighborhood‐cost” functions, which allows them to reflect varying similarity of states to one another. Morris and Yang (2019) use ideas from global games to develop an RI framework, in which it is more difficult for players to distinguish between nearby states. Our results complement these, but instead of allowing cost functions to reflect that some states are more similar than others, we introduce cost functions that may reflect that some choice options are more similar than others that, as illustrated in the example above, are relevant for the empirical implications of the RI model.
Caplin et al. (2017) and Frankel and Kamenica (2019) study properties that may be required of information cost functions. We propose to use Bregman information (Banerjee et al., 2005) based on generalized entropy.
Our results also relate to the literature on perturbed utility models. Anderson et al. (1988) derived the representative consumer model underlying the logit model, showing that the direct utility has an entropy form. This observation was generalized by Hofbauer and Sandholm (2002), who showed that the choice probabilities generated by any ARUM can be derived from a deterministic model based on payoff perturbations that depend nonlinearly on the vector of choice probabilities. Fosgerau and McFadden (2012) provide a foundation for applications of consumer theory to perturbed utility problems with nonlinear budget constraints. Fudenberg et al. (2015) provide an axiomatic characterization of a class of perturbed random utility models. Allen and Rehbeck (2019) consider identification. Fosgerau et al. (2019) construct a class of inverse demand models that are useful for estimating demand for differentiated products using Berry's (1994) method. We contribute to that literature in two ways. First, we provide an explicit characterization of the perturbation term corresponding to general ARUM. Second, our equivalence result allows us to interpret the class of perturbed random utility models in terms of RI arguments.
Finally, the RI framework has also inspired some recent empirical work; see Caplin et al. (2016), Joo (2019), Brown and Jeon (2019), and Porcher (2019). These papers primarily utilize the Shannon/multinomial logit framework. The results in this article may enable researchers to apply RI models far more general than the multinomial logit model, as they show that choice behavior emerging from any ARUM model may emerge from rationally inattentive behavior.
1.2. Layout
Section 2 introduces the RI model. Section 3 introduces the ARUM framework, and uses convex analysis to generate some insights into the fundamental structure of these models. Using this structure, we introduce a class of generalized entropies and present a few key results. Section 4 shows how generalized entropy can be used to define the information cost in the RI model, leading to the class of RI‐ARUM models. Then we present the key result from this article, which establishes the equivalence between choice probabilities emerging from the discrete choice model, and those emerging from RI‐ARUM models. Section 5 discusses the specific case of the nested logit model, which has proven useful in many empirical models. We show how rationally inattentive behavior can generate choice probabilities with substitution patterns that violate IIA, as in the above example. Two examples demonstrate some properties of the RI‐nested logit. Section 6 concludes. All proofs are in the Appendix.
1.3. Notation
Throughout this article, for vectors and , denotes the vector scalar product , such that, for example, . A univariate function applied to a vector is understood as coordinate‐wise application of the function, for example, . Consequently, for scalar a. The gradient with respect to a vector is ; for example, for , . The unit simplex in is Δ.
2. RATIONAL INATTENTION
We introduce the RI model. The DM is presented with a group of N options, from which he must choose one. Each option has an associated payoff , but the vector of payoffs is unobserved by the DM. Instead, the DM considers the payoff vector to be random, taking values in a set ; for simplicity, we take to be finite. The DM possesses some prior knowledge about the available options, given by a probability measure μ, where for all .
The DM's choice is an action and we write as shorthand for the conditional probability that the action is i in state . The payoff resulting from action i is . The vector of choice probabilities conditional on is then , and is the collection of conditional probabilities. Given the conditional probabilities and the prior μ, it is convenient to have notation for the unconditional choice probabilities and we let and .
The problem of the rationally inattentive DM is to choose the conditional distribution , balancing the expected payoff against the cost of information. The DM's strategy is a solution to the following problem:
(1) |
2.1. The RI‐Logit Model: The Matějka and McKay (2015) Result
The key element in program (1) is the choice of the information cost function. In specifying information costs, researchers have used concepts from information theory (Cover and Thomas, 2006). Specifically, much of the existing literature (Sims, 2003; Matějka and McKay, 2015) has utilized the mutual Shannon information between payoffs and actions to measure the information costs;5 that is, letting denote the Shannon entropy, the information cost is specified as
(2) |
Plugging this into (1), the rationally inattentive DM chooses the system of conditional probabilities to optimize6
(3) |
subject to
(4) |
Solving this, the DM finds conditional choice probabilities
(5) |
which satisfy . We may rewrite (5) as
(6) |
where . This may be recognized as a multinomial logit model in which the payoff vector is shifted by . Remarkably, the influence of the prior information μ is completely captured by this shift vector .
Below, we show that this equivalence between the RI model and the logit discrete choice model can be extended to the entire class of additive random utility discrete choice models, by suitably generalizing the information cost function Ω(). We will also find that a location shift vector that completely captures the influence of the prior information. Before turning to these results, we briefly review some properties of the ARUM class and establish some new results that are useful for working with generalized entropies.
3. RANDOM UTILITY MODELS AND GENERALIZED ENTROPY
Consider a DM making a utility maximizing discrete choice among a set of options. The utility of option i is
(7) |
where is deterministic and is a vector of random utility shocks. This is the classic ARUM framework pioneered by McFadden (1978). Our presentation of the ARUM framework here will emphasize convex‐analytic properties that will be important in drawing connections with the RI model in what follows.
Assumption 1
The random vector ε follows a joint distribution with finite means that is absolutely continuous, independent of , and fully supported on .
Assumption 1 leaves the distribution of the ε's unspecified, thus allowing for a wide range of choice probability systems far beyond the often used logit model. Importantly, it accommodates arbitrary correlation in the 's across options, which is reasonable and realistic in applications.
The DM then has choice probabilities
An important concept in this article is the surplus function of the discrete choice model (so named by McFadden, 1981), defined as
(8) |
Under Assumption 1, is convex and differentiable and the choice probabilities coincide with the derivatives of :7
or, using vector notation, . This is the Williams–Daly–Zachary theorem, famous in the discrete choice literature (McFadden, 1978, 1981).
From the differentiability of W and the Williams–Daly–Zachary theorem, it follows that the choice probabilities emerging from any random utility discrete choice model can be expressed in closed‐form as8:
(9) |
where the vector‐valued function is defined as the gradient of the exponentiated surplus, that is,
(10) |
For the specific case of multinomial logit, the 's are i.i.d. across options i with a type 1 extreme value distribution, the surplus function is , implying that . Thus, Equation (9) becomes the familiar multinomial logit choice formula: .
Based on (9), we may refer to as the scaled demand mapping. We will use the inverse of the scaled demand mapping to construct an information cost. The inverse must allow the zero demands that arise in the RI model. Existence of such an inverse is established by the following proposition:
Proposition 1
(Invertibility) Let Assumption 1 hold. Then the function has a continuous extension to that is surjective, injective, and hence globally invertible. Moreover, the function defined as satisfies iff for .
In what follows we refer to as the inverse scaled demand mapping. For any discrete choice model, there is a close relationship between the corresponding inverse scaled demand () and surplus (W) functions. They are related in terms of convex conjugate duality. Since the social surplus function W for any ARUM is convex, we know that there exists a convex conjugate function satisfying the proble9
(11) |
where the maximum on the right‐hand side is attained at .
The next proposition establishes a specific structure of the surplus function W and its convex conjugate .10
Proposition 2
(Generalized entropy functions) Consider an ARUM discrete choice model satisfying Assumption 1, with surplus function . Then
Remark 1
(RI‐logit revisited.) To see how this works in a special case, let us consider the multinomial logit model. In this case, is the identity, implying that its inverse, , is also just the identity. Then by Proposition 2(ii), the negative convex conjugate of the surplus function is , which is just the Shannon (1948) entropy.
Moreover, we see that Equation (13) implies that the RI‐logit optimization problem (3), written as
has the multinomial logit choice probabilities in Equation (6) as solution.
Proposition 2(i) generalizes the “logsum” formula for the multinomial logit model to the entire class ARUM. Similarly, generalizing from the logit case, , the negative convex conjugate of the surplus W of any ARUM may be viewed as a generalized entropy. In particular, Proposition 2(ii) shows how the generalized entropy may be expressed in terms of the inverse scaled demand as .
To aid further interpretation of the generalized entropy function, note that Equation (8) implies that the surplus function can be written as
Combining this with (11), we obtain an alternative expression for the generalized entropy, as a choice probability‐weighted sum of expectations of the utility shocks :11
In this way, different distributions for the utility shocks in the random utility model will imply different generalized entropies.
We may also interpret as follows: Given in the interior of the simplex, there exists such that satisfy (9), see Norets and Takahashi (2013). Then, using (12),12
which means that is the expected utility gain from the discrete choice relative to the deterministic utility component of option j. This coincides with the mapping introduced in Lemma 1 of Arcidiacono and Miller (2011), which is a key component for the estimation procedures developed in that article.
Proposition A1 in the Appendix contains important mathematical properties of the class of inverse scaled demand functions , which are used in proving the key propositions in the remainder of the article.
4. GENERALIZING THE RI‐LOGIT MODEL: RI‐ARUM MODELS WITH BREGMAN INFORMATION COST
In this section, we generalize the equivalence result between RI and multinomial logit. We begin by generalizing the RI framework described in Section 2, using generalized entropy in place of the Shannon entropy. Specifically, we let be the inverse scaled demand corresponding to some ARUM satisfying Assumption 1, and define as the corresponding generalized entropy.
For a strictly convex function f, the Bregman (1967) divergence associated with f is a function of probability vectors to the real line defined by . It measures the vertical distance from to a tangent hyperplane to f at the point . By convexity of f, this distance is positive and increasing away from . When is random, we can consider the expected Bregman divergence , which measures the expected divergence of from . Banerjee et al. (2005) show that is minimized at for any choice of f. Banerjee et al. (2005) use this observation to define the Bregman information of as , noting that this is the expected distortion, as measured by the Bregman divergence, when replacing a random by the optimal constant vector .
We define accordingly an information cost as the Bregman information associated with the negative of the generalized entropy , that is, .13 Proposition 3 below establishes that
(14) |
In particular, the information cost is equal to the Bregman information, associated with the (negative) Shannon entropy, which is well known as the mutual (Shannon) information. The interpretation of our information cost is analogous to the information cost for the RI‐logit model, in Equation (2) above. It measures consumers' action adjustment costs associated with shifting behavior from the state‐independent unconditional choice probabilities to the state‐dependent conditional choice probabilities .14 Taking information cost as Bregman information means that the information cost inherits properties of the Bregman divergence, as stated in the next proposition.
Proposition 3
For any generalized entropy function , the information cost in (14) is the expectation of the Bregman divergence associated with of and . Hence, it is convex in when holding constant and if action and state are independent.
The information cost takes context into account by construction; that is, going back to Example 1, exchanging the labels of good 1 (pineapple) and good 3 (cheesecake) affects information costs and therefore choices by design. Allowing the information cost function to depend on context in this way entails some loss of generality, as it need not satisfy Blackwell's information ordering. In particular, is not invariant with respect to permutation in its arguments. Example 5.2 below illustrates this for the case of corresponding to a nested logit model.
Using the generalized cost function just introduced, we now define a new RI model describing a DM who chooses the collection of conditional probabilities to maximize his expected payoff less the general information cost
(15) |
We refer this model as RI‐ARUM to make explicit the fact that the cost function (14) is defined in terms of the generalized entropy that is derived from an ARUM. The maximization problem in Equation (15) is an extension of the maximization problem in Equation (11) that is representative for the ARUM. In fact, holding fixed , the RI‐ARUM objective function (15) above coincides, pointwise in , with the problem (11), where the generalized entropy function is as stated in Proposition 2. This connection underlies the finding, elaborated in the following proposition, that the optimal conditional choice probabilities for any RI‐ARUM model have the logit‐like closed form from Equation (9) above.
Proposition 4
Let T be the scaled demand of an ARUM and let be the inverse scaled demand. Let be the solution to the corresponding RI‐ARUM model and . Then
- (i)
The unconditional probabilities satisfy the fixed point equation
(16) - (ii)
The conditional probabilities are given in terms of the unconditional probabilities by
(17) - (iii)
The optimized value of (15) is
(18)
The unconditional and conditional choice probabilities in (16) and (17) generalize the corresponding expressions for the RI‐logit model in a straightforward way. We note in particular that the influence of the prior information is captured completely by the vector . The implications of prior for the behavior of an RI‐ARUM agent can be summarized by a vector in where N is the number of choice options. This is true regardless of the form of the prior beliefs.
Part (i) of the proposition shows that the solution of the RI‐ARUM model involves a fixed point problem; in what follows, we assume that a solution exists. In general, the uniqueness of a solution to RI‐ARUM is not guaranteed. By Cover and Thomas (2006, Theorem 2.7.4), the mutual (Shannon) information is convex as a function of the conditional probability , and strictly convex when the conditional probabilities differ from the unconditional probability . In this case, the equations in Proposition 4 uniquely identify the solution to the RI‐ARUM model. By extension, as we show in Proposition A2 in the Appendix, this conclusion also applies to the nested logit model. Whether it applies to all RI‐ARUM remains an open question.
Proposition 4(i) also implies that an important feature of the RI‐ARUM model is that some may be 0, in which case the corresponding are also 0. To see this, consider Equation (17). When , then Proposition 1(ii) implies that and hence . Following the literature, we refer to the set of options chosen with positive probability as the consideration set (e.g., Caplin et al., 2019).
Proposition 4(iii) indicates an alternative way for calculating the value attained by optimal rationally inattentive behavior. In particular, Equation (18) shows that the optimal value of program (15) can be computed as the expected surplus function of the appropriately shifted ARUM. This generalizes the corresponding Lemma 2 in Matějka and McKay (2015).
Although Proposition 4 does not explicitly characterize the consideration set emerging from an RI‐ARUM problem, Corollary A1 in the Appendix describes one important feature that it has, namely that it excludes options that offer the lowest utility in all states of the world.
4.1. Equivalence between Discrete Choice (ARUM) and Rational Inattention
We now establish the central result of this article, namely, the equivalence between additive random utility discrete choice models and RI models. In particular, we show that the choice probabilities generated by an RI‐ARUM lead to the same choice probabilities as a corresponding ARUM and vice versa. In particular, comparing the expressions for the choice probabilities in the RI‐ARUM model in (17) to those in an ARUM in (9), it is clear that such a result is available: the expressions for the choice probabilities are identical except for the location shift of the deterministic utility components by the vector in the RI‐ARUM model.
Proposition 5
For every RI‐ARUM with prior , inverse scaled demand and choice probabilities there is an ARUM defined on the consideration set of the RI‐ARUM that yields the same choice probabilities for all .
Conversely, for every ARUM with choice probabilities and inverse scaled demand and given a prior such that the corresponding information cost is strictly convex, there is a location shift vector such that the RI‐ARUM with prior and inverse scaled demand has choice probabilities that satisfy for all .
This proposition implies a new interpretation of ARUM models as describing boundedly rational behavior, which suggests that to apply an ARUM, one need not assume that DMs are completely aware of the valuations of all the available options. This is important, as it is clearly unrealistic to expect DMs to be aware of all options when the number is large.
At the same time, despite the formal equivalence in the choice probabilities in ARUM and RI‐ARUM models under the conditions of Proposition 7, there are several important differences between them. The RI‐ARUM model also allows some options to have zero unconditional choice probabilities . Since choice probabilities are necessarily positive in the ARUM under Assumption 1, the equivalence is defined only for the options that are in the consideration set of the RI‐ARUM model, as is clear from Proposition 5.
Moreover, the equivalence requires fixing the prior over states ; a prior is part of the RI‐ARUM model but needs to be added to the random utility model. When we consider two choice scenarios with different priors, the subsequent choices in the RI‐ARUM and ARUM models can deviate considerably. In particular, the RI‐ARUM class allows options in the choice set to be complements, in the sense that increasing the payoff of an option in some state may lead to an increase in the choice probability of some other options. Such complementarities are explicitly ruled out by ARUM discrete choice models.15
Additionally, we have necessary and sufficient conditions for a system of choice probabilities to be consistent with an ARUM (Fosgerau et al., 2013). By Proposition 5, the same conditions are then necessary for a system of choice probabilities that derives from an RI‐ARUM and some fixed prior.
Starting from a given ARUM model, proceeding to the corresponding RI‐ARUM model requires deriving the convex conjugate function corresponding to the social surplus function of the given ARUM. Explicit closed forms for the convex conjugate functions are available, as far as we are aware, only for the multinomial and nested logit models. However, in general, the mass transport approach in Chiong et al. (2016) can be used to simulate the convex conjugate function for any ARUM, via computationally straightforward linear programming algorithms.
In what follows, we illustrate these features for a specific example; namely, we study a RI‐ARUM model in which the choice probabilities are equivalent to those from a nested logit discrete choice model, a frequently used model in empirical applications.
5. THE RI‐NESTED LOGIT MODEL
From an applied point of view, an important implication of Proposition 5 is that it allows us to formulate RI models that have complex substitution patterns, going beyond the multinomial logit case. In this section, we consider an RI‐ARUM model with information cost derived from a nested logit model. The nested logit choice probabilities are consistent with a discrete choice model in which the utility shocks have a certain generalized extreme value joint distribution. Among applied researchers, the nested logit model is often preferred over the multinomial logit model because it allows some products to be closer substitutes than others, thus avoiding the restrictions implied by the IIA property.16
We partition the set of options into mutually exclusive nests, and let denote the nest containing option i. Let be nest‐specific parameters. For a valuation vector , the nested logit choice probabilities are given by
(19) |
The inverse scaled demand corresponding to a nested logit model is
(20) |
Applying Proposition 5, the nested logit choice probabilities (19) are the same as those from an RI‐ARUM model with valuations
(21) |
The inverse scaled demand for the nested logit model in Equation (20) has several interesting features, relative to the Shannon entropy. First, Equation (20) allows us to write the generalized entropy as
(22) |
The first term in Equation (22) captures the Shannon entropy within nests, whereas the second term captures the information between nests. According to this, we may interpret Equation (22) as an augmented version of the Shannon entropy. It is also apparent from (22) that is not invariant to reordering of the choice probabilities, due to the second term.
Second, when the nesting parameters , then is the identity ( for all j), corresponding to the Shannon entropy. When , then ; here, behaves as a probability weighting function that tends to overweight options j belonging to larger nests. At the extreme , all options within the same nest effectively collapse into one aggregate option and become perfect substitutes.
We denote this model as RI‐nested logit (hereafter RI‐NL). Using this model, we consider two examples, emphasizing both differences and similarities of the RI‐NL vis‐a‐vis the RI‐logit model.
5.1. Example 1: Mango–Pineapple–Cheesecake Continued
We return to the earlier pineapple–mango–cheesecake example from Section 1. For these three products, we consider a model with two nests, in which the tropical fruits pineapple (good 1) and mango (good 2) are placed in one nest g 1, whereas cheesecake (good 3) is placed by itself in a second nest g 2. For the nesting parameters, we choose . The value of is irrelevant since nest g 2 comprises just one alternative. Recall that there are four equally likely possible states:
(23) |
Solving this RI‐NL model leads to that is a constant vector plus . Hence, the nested logit model with payoffs shifted by this vector produces the same choice probabilities as the RI‐NL. The RI‐NL conditional choice probabilities,
do not satisfy IIA and are hence not compatible with the RI‐logit model.17
Conversely, we can start with the nested logit model with the payoffs given in (23) above and the same nest parameters as before. The conditional choice probability vectors for this model are
and the unconditional choice probability vector is under the uniform prior. The corresponding location shift vector is up to a constant and shifting the payoffs by this amount produces RI‐NL conditional choice probabilities that are equal to the nested logit choice probabilities.
For comparison, we also compute the case where the nesting parameter has been set to , which makes the alternatives pineapple and mango closer substitutes than before. The RI‐NL unconditional choice probability vector becomes and the RI‐NL conditional choice probabilities become
The changes across states deviate more from IIA than when . It appears that, as goods 1 and 2 become more substitutable, the DM is able to make better choices in states where goods 1 or 2 are optimal. This comparative statics suggests that increasing the substitutability between a set of goods corresponds to shifts in the information structure toward signals that allow the DM to better distinguish between states in which these goods are optimal.
5.2. Example 2: Swapping Alternatives Can Lead to Increased Information Cost
As we have stated through the article, our cost functions embody information related to the identity of alternatives. To see how this feature works in practice, consider a nested logit with four choice options, nests formed by options 1–2 and 3–4, and with two states and . Define , where is the nest that contains option i. A garbling that swaps alternatives 2 and 3 will move probability mass across nests in state 2 but not in state 1. Then the probability distribution across nests is independent of the state without garbling but not with garbling and therefore this garbling increases the information cost .
6. CONCLUSIONS
The central result in this article is the equivalence between an additive random utility discrete choice model and a corresponding RI‐ARUM. Thus, any additive random utility discrete choice model can be cast as a model of rationally inattentive behavior, and vice versa for any RI‐ARUM; we can go back and forth between the two paradigms.18 Then, to apply an ARUM, it is no longer necessary to assume that DMs are completely aware of the valuations of all the available options. This is important, as it is clearly unrealistic to expect DMs to be aware of all options when the number is large.
Our equivalence result is at the individual level; hence, it also holds for ARUM with random parameters, including the mixed logit or random coefficient logit models that have been popular in applied work.19
Our equivalence result generalizes to perturbed random utility models (e.g., Hofbauer and Sandholm, 2002; Fudenberg et al., 2015) where the information cost is the Bregman information associated with a (negative) generalized entropy (Fosgerau et al., 2019). We are also exploring connections between our results and those in the decision theory literature. Gul et al. (2014), for instance, show an equivalence between random utility and an “attribute rule” model of stochastic choice, and we conjecture that our results may be useful in showing similar results for other decision‐theoretic models.
Finally, there are RI models outside the RI‐ARUM framework; that is, RI models with information costs outside the class of generalized entropies introduced in this article.20 Obviously, choice probabilities from these non‐RI‐ARUM models would not be equivalent to those that can be generated from ARUM models; it will be interesting to examine the empirical distinctions that non‐RI‐ARUM choice probabilities would have.
The properties in Proposition A1 are satisfied by any inverse scaled demand corresponding to an ARUM but do not characterize ARUM. In fact, the properties may be used to define a class of generalized entropies that is strictly larger than the class consisting of those generalized entropies corresponding to ARUM (see Fosgerau et al., 2019). We have not found direct conditions that characterize those generalized entropies that correspond to ARUM. We have chosen in this article to work with the generalized entropies that derive from ARUM to emphasize the main point of the article: the connection between RI with our information cost and ARUM. However, the conclusions of Proposition 6 extend without change to the case when is an inverse scaled demand that satisfies the conclusions of Proposition 8 but not necessarily corresponds to an ARUM. For applications, it is then possible to work with such generalized entropies without needing to check that they correspond to an ARUM.
A.1. Proofs of Results in Main Text
Proof of Proposition 1
Note first that we may write
The probabilities in are never 0 since the random utility shocks have full support. Define for convenience . The results in Norets and Takahashi (2013) apply to the mapping : Hence, is a bijection between X and the interior of the unit simplex Δ.
To obtain injectivity of on , suppose that and aim to show that . Since and , we may sum to find that and hence that . Then by the Norets and Takahashi (2013) result, that leads to , and hence, .
Consider next surjectivity and let be an arbitrary point. We aim to solve the equation . By Norets and Takahashi, there exists such that . Let . Then
which establishes that is a surjection from to .
The next point is to extend to . For on the boundary of , let index the nonzero components of . If , then we let . For , consider the discrete choice model (7) with choice restricted to z. Let be the choice probabilities from this restricted model and let for . Similarly, let be the expected maximum utility for the restricted model. Define then .
The argument that is a bijection from to may be repeated for each combination of zeros reflected in the set z. Hence, the extended function is a bijection from to .
It remains to show that is continuous. We will do this by establishing that the values of on the boundary of are limits of values from sequences in the interior. A limit point of a continuous function is unique; hence, for each boundary point, we need just consider one sequence converging to that point.
Consider first a sequence with . As the limit is unique if it exists, consider for some . Note that . Then since are bounded between 0 and 1, as required.
Consider then , let be nonempty and define for and for . Let F be the cumulative distribution function of the vector of random utility shocks and let be its partial derivatives. Then choice probabilities may be written as
(A.1) As above, let be the choice probabilities when choice is restricted to z. At no loss of generality, let , where . For , use the dominated convergence theorem together with (A.1) to see that
These probabilities sum to 1. Hence, for .
By dominated convergence,
Combining these results, find that as required.
Finally, defining the conclusion follows at once.
Proof of Proposition 2
We first evaluate . If , then
which can be made arbitrarily large by changing γ and hence . Next, consider with some . decreases toward a lower bound as . Then increases toward , and hence, is outside the unit simplex Δ.
For , we solve the maximization problem
(A.2) Note that for any constant k, we have , so that we may normalize . Maximize then the Lagrangian with first‐order conditions , which lead to . Then
Inserting this into (A.2) leads to the desired result.
For part (i), let be a solution to problem (11). Then, by the homogeneity of , we have , where . Then, by the definition of , it follows that . Replacing the latter expression in Equation (11), we get
Proof of Proposition 3
Let be a generalized entropy. Then, using Proposition A1, the associated Bregman divergence becomes
where we have used that .
Convexity of in follows from Proposition A1 or from the fact that it is a Bregman divergence. Clearly, .
Our information cost is by definition an expected Bregman divergence. We therefore immediately obtain that it is convex in holding constant and that if action and state are independent, since in that case, .
Proof of Proposition 4
The Lagrangian for the DM's problem is
where and are Lagrange multipliers corresponding to condition (4).
Before we derive the first‐order conditions for , it is useful to note that we may regard the terms and in the information cost as a constant, since their derivatives cancel out by Proposition A1(iii). Define and . Then the first‐order condition for is easily found to be
(A.3) This fixes as a function of since then
(A.4) If some , then we must have , which implies that and the value of is irrelevant. If , then . We may then simplify by setting for all at no loss of generality, which means that .
Using that probabilities sum to 1 leads to
and hence, (i) follows. Item (ii) then follows immediately.
Now substitute (17) back into the objective, using , to find that it reduces to
(A.5) We may then use (A.5) to determine . Now apply Equation (12) to establish part (iii) of the proposition.
Proof of Proposition 5
Consider an RI‐ARUM model with prior , scaled demand and choice probabilities and let be its consideration set. For , we have and by Proposition 1. Let . Then for , we have
which is an ARUM on .
To prove the converse, let , and consider the RI‐ARUM with prior and scaled demand . The RI‐ARUM conditional choice probabilities satisfy the first‐order condition
with . By strict convexity of the information cost, the first‐order condition uniquely identifies the optimal RI‐ARUM conditional choice probabilities. Then solves the RI‐ARUM maximization problem.
A.2. Additional Results
Proposition A1
(Properties of the inverse scaled demand function) For any ARUM discrete choice model satisfying Assumption 1, the corresponding inverse scaled demand satisfies:
- (i)
is continuous and homogenous of degree 1.
- (ii)
is convex and strictly convex on the interior of Δ.
- (iii)
is differentiable with :where is a probability vector with for all i.
Proof of Proposition A1
Continuity of follows from continuity of the partial derivatives of W, which is immediate from the definition. Homogeneity of is equivalent to homogeneity of . Using the homogeneity property of W
which shows that , and hence, are homogenous of degree 1.
The requirement that in the relative interior of the unit simplex Δ may be expressed in matrix notation as
where
is the Jacobian of .
Defining , we have and hence by Proposition 2. Noting that the requirement in part (ii) is equivalent to
Now, use the Williams–Daly–Zachary theorem to find that
as required.
Convexity of follows from Proposition 2(ii). To show that is strictly convex for , note first that has positive definite Hessian on the set (Hofbauer and Sandholm, 2002). This Hessian is equal to the Jacobian of , which is then positive definite. The inverse of is , which then also has a positive definite Jacobian. But the Hessian of is .
Corollary A1
For some option j, and for all , let for all , and assume that the inequality is strict with positive probability. Then (i.e., option j is not in the consideration set).
Proof of Corollary A1
Let ○ denote the Hadamard product, that is, . Assume, toward a contradiction, that . It follows from cyclic monotonicity (Rockafellar, 1970, Thm. 23.5) that increases as the utility of other options decreases. Then,
(A.6)
(A.7)
(A.8) This is a contradiction as desired.
Proposition A2
(Convexity of the Bregman divergence) The Bregman information associated with a nested logit model is convex. It is strictly convex when the conditional probabilities differ from the unconditional probability p 0.
Proof of Proposition A2
When Ω is the Shannon entropy, that is, when the associated ARUM is a multinomial logit model, then the Bregman information is the mutual (Shannon) information. By Cover and Thomas (2006, Thm 2.7.4), the mutual (Shannon) information is convex as a function of the conditional probability , and strictly convex when the conditional probabilities differ from the unconditional probability p 0.
Consider now a nested logit model and note that the corresponding generalized entropy may be written , where Γ is a matrix whose columns are linearly independent probability vectors. For example, the inverse scaled demand (20) of the two‐level nested logit model may be written as
Then, the NL generalized entropy is the composition of a linear function and a concave function, plus a linear function. The Bregman information may then be written
which is a composition of a linear function and the convex mutual Shannon information, whereas the linear terms cancel out. Hence, it is convex and strictly convex when the conditional probabilities differ from the unconditional probability p 0.
Footnotes
In the context of the RI model, we interpret IIA as a comparison across states, holding the decision maker's (DM) prior fixed. For further details about IIA, we refer the reader to Maddala (1986, 3.2) and Anderson et al. (1992).
The choice probabilities here are generated, not by an RI‐logit model, but by an RI‐nested logit model, introduced in this article.
This example considers how choices vary across different states of the world. This is distinct from the exercises in Matějka and McKay (2015), who consider how choices vary as priors or choice sets change. For empirical analysis, the change in choices across states is often most relevant: for instance, in demand analysis, researchers wish to uncover how consumer choices depend on changes in product prices, which can be considered as a change from one state to another.
More formally, Matějka and McKay (2015) study the problem where agents first choose an information structure (mapping from state of the world to information signals). and then based on signals, choose optimal actions.
Our presentation of the RI paradigm here follows Sims (2003, 2010), in which agents are modeled as choosing directly their conditional choice probabilities , taking the prior distribution as given.
The convexity of follows from the convexity of the max function. Differentiability follows from the absolute continuity of . See Shi et al. (2018), Chiong and Shum (2019), and Melo et al. (2019) for semiparametric econometric approaches based on these convex‐analytic properties of discrete choice models.
By direct differentiation of , and applying the Williams–Daly–Zachary theorem, we have for all i. Imposing , we have .
For details, see Rockafellar (1970, ch. 12). Briefly, for a convex function , its convex conjugate function is defined as , which is also convex. Fenchel's theorem then establishes that . When and are scalar and is differentiable, then and are inverse mappings to each other. Vohra (2011) applies these ideas to the mechanism design setting.
To the best of our knowledge, this result is new in the literature on random utility models, and may be of independent interest. In particular, this result is related to the literature on perturbed random utility models, which has been focused on characterizing choice probabilities as the solution of a deterministic optimization problem (Hofbauer and Sandholm, 2002; Fosgerau and McFadden, 2012; Fudenberg et al., 2015).
See Chiong et al. (2016).
We have . where we have used the fact that is homogeneous of degree 1.
Strict concavity of is established in Proposition A1 in the Appendix.
Using Bayes' rule, such a shift in choice probabilities corresponds to a change in beliefs from the prior μ to a posterior , which Caplin and Dean (2015) and Chambers et al. (2018) refer to as “revealed posterior” distributions.
Indeed, in empirical papers utilizing discrete‐choice demand models, complementarities between choices can be typically accommodated only by modeling consumers as choosing “bundles” of options in the choice set; see, or example, Gentzkow (2007) and Fox and Lazzati (2017). Such an approach may become intractable as the dimensionality of the choice set increases. In contrast, complementarities can arise in the RI model both from correlation in the priors (as pointed out by Matějka and McKay, 2015), and also from the form of the generalized entropy information cost functions considered in this article.
To be consistent with the definition of IIA, when applied to RI models, we assume that the DM's prior is fixed. This assumption allows us to keep the choice set constant so that we can focus on changes in the utilities associated to alternatives. For further details about IIA, see Maddala (1986, Chap. 2) and Anderson et al. (1992).
But even with the logit specification, the IIA property can break down if the DM is able to consume more than one product, that is,. bundles of goods (cf. Gentzkow, 2007).
In a similar vein, Webb (2019) demonstrates an equivalence between random utility models and bounded‐accumulation or drift‐diffusion models of choice and reaction times used in the neuroeconomics and psychology literature.
As an example, the function is not a generalized entropy function; thus, an RI model using this as an information cost function would lie outside the RI‐ARUM framework.
REFERENCES
- Allen, R. , and Rehbeck J., “Identification with Additively Separable Heterogeneity,” Econometrica 87 (2019), 1021–54. [Google Scholar]
- Anderson, S. P. , De Palma A., and Thisse J.‐F., “A Representative Consumer Theory of the Logit Model,” International Economic Review 29 (1988), 461–66. [Google Scholar]
- Anderson, S. P. , De Palma A., and Thisse J.‐F., Discrete Choice Theory of Product Differentiation (Cambridge, MA: MIT Press, 1992). [Google Scholar]
- Arcidiacono, P. , and Miller R. A., “Conditional Choice Probability Estimation of Dynamic Discrete Choice Models with Unobserved Heterogeneity,” Econometrica 79 (2011), 1823–67. [Google Scholar]
- Banerjee, A. , Merugu S., Dhillon I. S., and Ghosh J., “Clustering with Bregman Divergences,” Journal of Machine Learning Research 6 (2005), 1705–49. [Google Scholar]
- Berry, S. , “Estimating Discrete‐Choice Models of Market Equilibrium,” RAND Journal of Economics 25 (1994), 242–62. [Google Scholar]
- Berry, S. T. , Levinsohn J., and Pakes A., “Automobile Prices in Market Equilibrium,” Econometrica 63 (1995), 841–90. [Google Scholar]
- Bregman, L. , “The Relaxation Method of Finding the Common Point of Convex Sets and Its Application to the Solution of Problems in Convex Programming,” USSR Computational Mathematics and Mathematical Physics 7 (1967), 200–17. [Google Scholar]
- Brown, Z. Y. , and Jeon J., “Endogenous Information Acquisition and Insurance Choice,” Working paper, Michigan, University of Michigan, 2019.
- Caplin, A. , and Dean M., “Revealed Preference, Rational Inattention, and Costly Information Acquisition,” American Economic Review 105 (2015), 2183–203. [Google Scholar]
- Caplin, A. , Dean M., and Leahy J., “Rationally Inattentive Behavior: Characterizing and Generalizing Shannon Entropy,” Technical Report, National Bureau of Economic Research, Cambridge, MA, 2017.
- Caplin, A. , Dean M., and Leahy J., “Rational Inattention, Optimal Consideration Sets, and Stochastic Choice,” Review of Economic Studies 86 (2019), 1061–94. [Google Scholar]
- Caplin, A. , Leahy J., and Matějka F., “Rational Inattention and Inference from Market Share Data,” CERGE‐EI Working Paper, 2016.
- Chambers, C. P. , Liu C., and Rehbeck J., “Costly Information Acquisition,” UCSD, 2018.
- Chiong, K. X. , Galichon A., and Shum M., “Duality in Dynamic Discrete‐Choice Models,” Quantitative Economics 7 (2016), 83–115. [Google Scholar]
- Chiong, K. X. , and Shum M., “Random Projection Estimation of Discrete‐Choice Models with Large Choice Sets,” Management Science 65 (2019), 256–71. [Google Scholar]
- Cover, T. M. , and Thomas J. A., Elements of Information Theory, 2nd edition (Somerset: Wiley‐Interscience, 2006). [Google Scholar]
- Fosgerau, M. , Monardo J., and de Palma A., “The Inverse Product Differentiation Logit Model,” July 13, 2019. Available at SSRN: https://ssrn.com/abstract=3141041 or 10.2139/ssrn.3141041. [DOI]
- Fosgerau, M. , and McFadden D. L., “A Theory of the Perturbed Consumer with General Budgets,” NBER Working Paper, 2012.
- Fosgerau, M. , and McFadden D., and Bierlaire M., “Choice Probability Generating Functions,” Journal of Choice Modelling 8 (2013), 1–18. 10.1016/j.jocm.2013.05.002. [DOI] [Google Scholar]
- Fox, J. T. , Kim K. I., Ryan S. P., and Bajari P., “The Random Coefficients Logit Model Is Identified,” Journal of Econometrics 166 (2012), 204–12. [Google Scholar]
- Fox, J. T. , and Lazzati N., “A Note on Identification of Discrete Choice Models for Bundles and Binary Games,” Quantitative Economics 8 (2017), 1021–36. [Google Scholar]
- Frankel, A. , and Kamenica E., “Quantifying Information and Uncertainty,” American Economic Review 109 (2019), 3650–80. [Google Scholar]
- Fudenberg, D. , Iijima R., and Strzalecki T., “Stochastic Choice and Revealed Perturbed Utility,” Econometrica 83 (2015), 2371–2409. [Google Scholar]
- Gentzkow, M. , “Valuing New Goods in a Model with Complementarity: Online Newspapers,” American Economic Review 97 (2007), 713–44. [Google Scholar]
- Gul, F. , Natenzon P., and Pesendorfer W., “Random Choice as Behavioral Optimization,” Econometrica 82 (2014), 1873–1912. [Google Scholar]
- Hébert, B. , and Woodford M., “Rational Inattention with Sequential Information Sampling,” NBER Working Paper, 2017.
- Hobson, A. , “A New Theorem of Information Theory,” Journal of Statistical Physics 1 (1969), 383–91. [Google Scholar]
- Hofbauer, J. , and Sandholm W. H., “On the Global Convergence of Stochastic Fictitious Play,” Econometrica 70 (2002), 2265–94. [Google Scholar]
- Joo, J. , “Rational Inattention as an Empirical Framework – With an Application to the Welfare Effects of New Product Introduction and Endogenous Promotion,” Working Paper, UT Dallas, 2019.
- Maddala, G. S. , Limited‐Dependent and Qualitative Variables in Econometrics (Cambridge: Cambridge University Press, 1986). [Google Scholar]
- Matějka, F. , and McKay A., “Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model,” American Economic Review 105 (2015), 272–98. 10.1257/aer.20130047. [DOI] [Google Scholar]
- McFadden, D. , “Modelling the Choice of Residential Location,” in Karlquist A., Snickars F., and Weibull J. W., eds., Spatial Interaction Theory and Planning Models, Volume 673 (Amsterdam: North Holland, 1978), 75–96. [Google Scholar]
- McFadden, D. , “Econometric Models of Probabilistic Choice,” in Manski C. and McFadden D., eds., Structural Analysis of Discrete Data with Econometric Applications (Cambridge, MA: MIT Press, 1981), 198–72. [Google Scholar]
- McFadden, D. , and Train K., “Mixed MNL Models for Discrete Response,” Journal of Applied Econometrics 15 (2000), 447–70. [Google Scholar]
- Melo, E. , Pogorelskiy K., and Shum M., “Testing the Quantal Response Hypothesis,” International Economic Review 60 (2019), 53–74. [Google Scholar]
- Morris, S. , and Yang M., “Coordination and Continuous Choice,” Working Paper, 2019.
- Norets, A. , and Takahashi S., “On the Surjectivity of the Mapping between Utilities and Choice Probabilities,” Quantitative Economics 4 (2013), 149–55. [Google Scholar]
- Porcher, C. , “Migration with Costly Information,” Working Paper, Princeton, 2019.
- Rockafellar, R. T. , Convex Analysis (Princeton, NJ: Princeton University Press, 1970). [Google Scholar]
- Shannon, C. E. , “A Mathematical Theory of Communication,” Bell System Technical Journal 27 (1948), 379–423. [Google Scholar]
- Shi, X. , Shum M., and Song W., “Estimating Semi‐Parametric Panel Multinomial Choice Models Using Cyclic Monotonicity,” Econometrica 86 (2018), 737–61. [Google Scholar]
- Sims, C. A. , “Implications of Rational Inattention,” Journal of Monetary Economics 50 (2003), 665–90. [Google Scholar]
- Sims, C. A. , “Rational Inattention and Monetary Economics,” in Woodford M. and Friedman B. M., eds., Handbook of Monetary Economics, Volume 3, Chapter 4 (Amsterdam: Elsevier, 2010), 155–81. [Google Scholar]
- Vohra, R. , Mechanism Design: A Linear Programming Approach (Cambridge: Cambridge University Press, 2011). [Google Scholar]
- Webb, R. , “The (Neural) Dynamics of Stochastic Choice,” Management Science 65 (2019), 230–55. [Google Scholar]