Abstract
Delay discounting reflects the rate at which a reward loses its subjective value as a function of delay to that reward. Many models have been proposed to measure delay discounting, and many comparisons have been made among these models. We highlight the two‐parameter delay discounting model popularized by Howard Rachlin by demonstrating two key practical features of the Rachlin model. The first feature is flexibility; the Rachlin model fits empirical discounting data closely. Second, when compared with other available two‐parameter discounting models, the Rachlin model has the advantage that unique best estimates for parameters are easy to obtain across a wide variety of potential discounting patterns. We focus this work on this second feature in the context of maximum likelihood, showing the relative ease with which the Rachlin model can be utilized compared with the extreme care that must be used with other models for discounting data, focusing on two illustrative cases that pass checks for data validity. Both of these features are demonstrated via a reanalysis of discounting data the authors have previously used for model selection purposes.
Keywords: model fitting, intertemporal choice, discounted value, optimization, maximum likelihood
Delay discounting refers to the ubiquitous phenomenon that organisms, humans and nonhumans alike, tend to devalue outcomes that occur in the future compared to outcomes that are readily available in the here and now. Over the years, researchers and theorists have debated the precise shape of the discounting curve, as this shape has implications for the behavioral phenomena observed in experiments. For example, the existence of preference reversals, or the tendency for an organism to ‘lock in’ a self‐controlled choice made distant to the event even if the impulsive choice will be preferred nearer to the event, suggests that value is discounted hyperbolically as a function of delay (Ainslie, 1975). These observations and subsequent experiments led to the validation and popularization of a single parameter nonlinear hyperbolic function to describe the rate of decline (e.g., Lancaster, 1963; Mazur, 1987). Subsequent work explored adding an additional free parameter to this model to better describe a wider range of discounting functions observed in experiments (e.g., Myerson & Green, 1995; Rachlin, 2006; Rodriguez & Logue, 1988). Each of these models attempts to characterize how organisms behave when presented with different outcomes differing in their receipt to delay, but have important implications from both a theoretical perspective (Rachlin, 2006) and practical perspective as a tool to accurately describe data and aid in data analyses (Franck et al., 2015).
Rachlin (2006) viewed discounting the value of an outcome delayed in time (and, in fact, many other patterns of behavior) to be an instance of molar choice, or a series of seemingly discrete behaviors governed by one overarching process that the organism decides upon as a group. Importantly, Rachlin theorized that the Stevens (1957) power law, where the psychological effect of a physical object is governed by a power function, is a fundamental and individualized part of the perception of time in delay discounting. This is pertinent to addiction because an individual may allocate choices towards a drug as it increases short‐term benefits (Rachlin, 2007). A connection between excessive drug use and impulsive choices based on delay discounting has been identified (Amlung et al., 2017). This ties in well with Rachlin's perspective on preference reversals and how the value of a commodity decreases hyperbolically as a function of time (Rachlin et al., 1986), as short‐term maximization of utility may overtake long‐term maximization of utility (i.e., maladaptive long‐term outcomes of drug use are undervalued relative to short‐term benefits (Rachlin, 2007).
Rachlin has been widely recognized as popularizing and validating the form of hyperbolic discounting that incorporates this observation as an exponent free parameter appended to the delay term of the hyperbolic discounting function. While recognizing the important philosophical and theoretical implications of various forms of discounting functions, a branch of recent work (Franck et al., 2015; Gilroy et al., 2017; Gilroy & Hantula, 2018) has been focused on some of the more practical implications of these functions for data analysis pipelines. The goal of this work is to highlight the reliable and flexible performance of the discounting function popularized by Rachlin (2006). To accomplish this, we first provide an overview of some popular one‐ and two‐parameter discounting functions. Then, we provide an overview of model fitting via maximum likelihood. Finally, we demonstrate how obtaining valid parameter estimates can be inherently challenging for some models in cases where discounting data appear orderly. We show that the Rachlin model does not seem to face certain difficulties with estimation that can occasionally bedevil other models.
Particularly important is the descriptive utility of Rachlin's model to data on choice. Consider the scatterplot of delay discounting data in Figure 1. These data were previously analyzed (Koffarnus & Bickel, 2014, Franck et al., 2015) and arise from a hypothetical delay discounting questionnaire. The questionnaire asked each participant to answer a series of titrating preference questions at seven delays in order to establish the indifference point, that is, the proportional decrease in the larger later reward that a participant would just as readily accept in order to have the reward immediately. For example, an indifference point of 0.9 occurring at delay of 7 days means the participant has no preference between, for example, nine hundred dollars now versus one thousand dollars in a week. The indifference points in Figure 1 decrease systematically as delay increases, which is the expected pattern for indifference points as a function of delay. The right panel of Figure 1 shows delay in the natural log of days, which aids visualization for short delays.
Figure 1.

Discounting Data from Subject 7 (Koffarnus & Bickel, 2014 ) Note. Horizontal axes are delay (left panel) and delay in the natural log scale (right panel). Vertical axis is indifference point. The orange line corresponds to the two‐parameter Rachlin model in Equation 1. The red line corresponds to the one‐parameter Mazur model in Equation 2. The estimated error variance from the Rachlin model (0.025) is smaller than the error variance estimated from the Mazur model (0.084), reflecting a closer fit to the observed data for the Rachlin model.
The notion that indifference points should decrease as delay increases is captured by the many available competing discounting models. In keeping with the focus of the special issue, we have overlaid the model fit in Figure 1 for the two‐parameter model popularized by Rachlin (2006):
| (1) |
where is the indifference point and denotes the expected indifference point at a given delay (i.e., the curvilinear orange regression line). The parameters and govern the shape of the curve, where larger values of correspond to steeper decline in the valuation of rewards as delay increases, and is the sensitivity parameter. As described above, this parameter adapted from Stevens's power law (1957) characterizes the psychophysical sensitivity specific to delay. The constant is the known value of the larger later reward in the task, and indifference points data analyzed in this work are normalized such that . Commonly, the natural log is used for analysis as collections of estimated values from study participants are better suited to the assumption of a normal distribution when is logged. It should be noted that the first instance of Equation 1 being applied to discounting data was by Rodriguez & Logue (1988). Rachlin states this in his notes on discounting paper (Rachlin, 2006). However, it is important to realize the substantial conceptual contribution Rachlin made on Equation 1 such as connecting the s parameter to Stevens's power law, and extending the equation to other molar phenomena, such as matching and maximization (Rachlin, 2006). While the equation is not his per se, his scholastic effort on extending it to behavioral processes has played a key role in the equation's rise to prominence.
When fixed at , the Rachlin model simplifies to Mazur's one‐parameter hyperbolic discounting model (Mazur, 1987).
| (2) |
Compared with the two‐parameter Rachlin model, the one‐parameter Mazur model (red regression line in Fig. 1) has the advantage of simplicity. However, it is naturally less flexible, meaning that the regression lines it can produce are not necessarily able to get as close to the empirical data as the two‐parameter Rachlin model can. This illustrates that the first advantage of a two‐parameter model is flexibility to move closer to the observed data, in addition to the philosophical justifications provided for as a parameter to model sensitivity.
However, we expect any two‐parameter model to have more flexibility than a one‐parameter model. The second major practical advantage of the Rachlin model is the ease with which one can reliably conduct statistical estimation. In statistical parlance, the parameter values and are unknown and they must be estimated using available data. We establish that the regression lines in Figure 1 are the best lines for their respective models by optimizing a certain statistical metric. In this paper we estimate parameters via maximum likelihood, which is a statistical approach that estimates parameter values that make the observed data the “most likely” to have occurred. We chose to use maximum likelihood because it is a more general approach than least squares for parameter estimation since it covers a much wider array of parameter estimation problems (e.g. models for categorical data). Specific to behavioral economics, maximum likelihood is essential for mixed effects modeling in discounting (Young, 2017) and demand (Kaplan et al., 2021). Thus, the purpose of this study was to determine some of the reasons why the Rachlin model has been so performant from a statistical perspective. We illustrate these features with two specific case studies below.
Method
We expound on the virtues of the Rachlin model by (i) describing other two‐parameter discounting models including the Myerson and Green model (Myerson & Green, 1995) model and the Laibson beta‐delta model (Laibson, 1997), (ii) explaining the maximum likelihood approach to statistical estimation in the context of discounting (using the Mazur model used to illustrate basic ideas), and (iii) comparing the success and ease with which maximum likelihood estimates can be obtained using each of the three models for two valid discounting data sets.
Additional Two‐Parameter Discounting Models
The two‐parameter discounting models chosen for this study have been previously discussed and compared, (see e.g., Franck et al., 2015; Gilroy & Hantula, 2018; McKerchar et al., 2009. The two‐parameter model popularized by Myerson and Green (Myerson & Green, 1995) is:
| (3) |
This model also has both and parameters. For this model, the parameter exponentiates the entire denominator of the equation and also reflects an individual sensitivity constant.
The final two‐parameter model we consider is the Laibson beta‐delta model (Laibson, 1997):
| (4) |
The beta‐delta model has two parameters, and . Both of these parameters are between zero and one. Values of below one prescribe the rate of decay of this function, and corresponds to an individual who does not devalue rewards as a function of delay and thus has a constant, nondecreasing average trend in their indifference points. The parameter essentially controls the y‐intercept of the function. It is presumed that the expected indifference point at delay is , that is, an individual will not devalue the larger later reward at zero delay even though the function in Equation 4 might prescribe this. The motivation for the development of the model was to separately model the hypothesized dual‐process neurobiological model of delay discounting, which posits a push‐and‐pull interplay between a valuation process that values immediate outcomes and a separate valuation process that values delayed outcomes (McClure et al., 2004).
We note that other models have been proposed for delay discounting, (see e.g., Ebert & Prelec, 2007). Rather than strive for a comprehensive review and comparison among all available discounting models, we focus on explaining the “under‐the‐hood” workings of model fitting via maximum likelihood in the context of a few well‐known models with an emphasis on the Rachlin model. Thus, a noted limitation of this work is that we illustrate the concepts only on the Mazur, Myerson‐Green, Rachlin, and Laibson models. Of course, these concepts extend to model fitting more generally.
Overview of Maximum Likelihood
As mentioned above, the values of behavioral economic parameters such as , , , and are considered unknown and must be estimated from observed data. Maximum likelihood estimation is a procedure that determines which values of the parameters make the observed data the “most likely” to have occurred, and those values serve as estimates for the parameters. The term “most likely” is intuitive, but more formally, the likelihood function has a specific definition. The likelihood function associates the observed data with the unknown parameters in a probabilistic framework. Values of the parameter are searched until the set that maximizes this likelihood is discovered. In other words, the goal is to figure out which values of the parameters maximize the likelihood function.
Determining which parameter values maximize a function requires mathematical optimization. Mathematical optimization is a discipline of its own, one that is deeply rooted in mathematical and computational theory. In an effort to make some of the basic ideas concrete we resort to a series of analogies, but we hope these next several paragraphs do not belie the underlying subtlety and beauty of mathematical optimization. A good introductory overview to optimization is Nocedal & Wright (2006).
One way to think of optimization is to imagine we are searching for the top of a hill. The likelihood function is like a topographical map of the altitude. On earth we would search across latitudes and longitudes to find the location where the top of the hill occurs, but in our optimization approach, we search the parameter spaces of and for the Rachlin and Myerson‐Green models ( and for the Laibson model) to find the values of the parameter where the top of the hill occurs. Since this hilltop is the maximum of the likelihood function, the parameter values corresponding to the hilltop are the maximum likelihood estimates.
The likelihood function of interest tends to be highly peaked. You can think of it as a mountain which is easy to see, but not necessarily easy to climb. To make the search for the optimum easier (both mathematically and computationally), it is frequently a better choice to optimize the log‐likelihood function rather than the likelihood function directly. Since the logarithm function is monotone, the parameter values where the max occurs in the likelihood function also correspond to the max of the log‐likelihood. Figure 2 shows this is the case using the Mazur model and data from Subject 7. With respect to the fitting of discounting curves, the Mazur model is a simpler model, and it is nested in the Myerson‐Green and Rachlin models (in the sense that these models are equivalent to the Mazur model if ). The Mazur model is elegant in its simplicity, as the one‐parameter hyperbolic discounting curve it produces provides a decreasing curvilinear trend that effectively characterizes many observed discounting patterns (particularly when is close to 1).
Figure 2.

Mazur Model Likelihood (Solid Red Line) and Log‐Likelihood (Dashed Black Line) Functions Across Varying Values of for Data from Subject 7 (Koffarnus & Bickel, 2014 ) Note. The value of that maximizes the likelihood function also maximizes the log‐likelihood function and is the maximum likelihood estimator. For these data, the maximum likelihood estimator for is . A user can optimize either the likelihood or log‐likelihood to get the maximum likelihood estimator, and it is usually much easier to optimize the log‐likelihood.
Further, practical experience suggests that difficulties obtaining maximum likelihood estimates (e.g., using the optim() function in R as we do in this report) are exceedingly rare for this one‐parameter model. While optimization is easy for the Mazur model, this model is less flexible than two‐parameter models and thus cannot generally get as close to the data as two‐parameter models can. Previous work using the broader data from Koffarnus & Bickel (2014) that focused on parsimonious model selection (Franck et al., 2015) found that the Rachlin model was favored in a plurality of cases among those models considered in this paper, due to its favorable tradeoff between model fit and complexity. Since the major goals of this paper are to characterize the two‐parameter models in terms of model fit and optimization, we use the Mazur model to introduce the preliminary ideas used in the comparison of the other three models.
Figure 2 is a practical demonstration of the universal fact that the maximum likelihood estimate for is the same whether the likelihood or the log‐likelihood function is optimized. You can also see that while the likelihood has the shape of a highly peaked mountain, the log‐likelihood is a more gently sloping hill. These observations hold more generally. In this paper and in most other work across disciplines, the log of the likelihood function, or log‐likelihood, is optimized to obtain maximum likelihood estimates. Because the peak of the likelihood function is easier to see visually, we plot the likelihood surface in several figures in this paper to aid in the visual detection of the max. For some simple problems, calculus can be used to obtain a mathematical expression for the maximum likelihood estimators. For the nonlinear discounting models in this paper, a numerical computer algorithm must be used. Generally speaking, the algorithm is given a starting point, and it uses the curvature of the function to move uphill until the curve flattens, at which point the algorithm stops. In general, when an optimization approach identifies a maximum, we say it converged.
Unfortunately, for the two‐parameter models under consideration, there is no ironclad mathematical guarantee that a hilltop exists. We will see that certain combinations of data sets and models do not seem to lead to valid maximum likelihood estimates. Specifically, when certain models are applied to certain data sets, it appears that the likelihood function grows towards infinity and no maximal value (i.e., hilltop) exists. It is also possible for optimization difficulties to occur if the log‐likelihood function exhibits many local maxima. A major differentiating factor among models is the tendency for some models to have valid maximum likelihood estimates across a multitude of data sets, where other models seem unable to produce maximum likelihood estimates, even in certain cases where data conform to the expected behavior outlined in Johnson & Bickel (2008). The ability of some models to readily produce valid statistical estimates for parameters more readily and reliably than others is surely an important consideration when deciding which delay discounting function to use in practice.
The analogy of hills, mountains, topographical maps, and latitude–longitude coordinates corresponding to parameters holds up very well for problems with two inputs. Fortunately, maximum likelihood is useful in higher dimensions as well. In fact, the error variance, , is also a parameter that must be considered when obtaining maximum likelihood estimates. Error variance quantifies the discrepancy between the observed data and the regression line, with larger values corresponding to cases where data are further from the line. Parameters that are not of central interest to the study but must be accounted for in the analysis are sometimes called nuisance parameters. The error variance was estimated simultaneously with other parameters in this study, but each likelihood surface depicted graphically is at the maximum likelihood estimate for , if available, or at the level for which an optimization algorithm converged spuriously if no maximum likelihood estimate is available. We assume departures from the regression line, namely residuals, follow a normal distribution with mean zero and variance .
For the two‐parameter models considered in this paper, the likelihood function can be visualized as a topographical map where each axis corresponds to a free parameter, and the “altitude” is the value of the log‐likelihood. A computer optimization algorithm is used to determine the highest peak on this map, and just as in the one‐dimensional representation in Figure 2, the values of the parameters that correspond to the peak are the maximum likelihood estimates. Technically, the optimization happens in three input dimensions since the error variance also needs to be estimated. For clarity of exposition, we visualize likelihood surfaces with the value of fixed at its maximum likelihood value to showcase the behavioral economic parameters , , , and . When the optimization algorithm does not identify the highest peak of the log‐likelihood, then it cannot establish the maximum likelihood estimates. In a case such as this, we say the model failed to optimize. We will see shortly that there are no guarantees that maximum likelihood estimates exist for two‐parameter discounting models. To further illustrate the ideas presented here, we show the indifference points, Mazur model fits, and log‐likelihood surfaces as a function of for data from two exemplar participants used in this study.
Participants
This study has examined two specific cases where optimization is challenging for certain models using data that appear to be orderly. We provide an “under‐the‐hood” view of the optimization approach and potential challenges in analyzing discounting data. The participant pool for this study comes from data previously collected (Koffarnus & Bickel, 2014). We consider the titrating delay discounting questionnaire from this study. The 111 participants' data from the previous study have been analyzed (Franck et al., 2015) to develop approximate Bayesian model selection and effective delay 50 (Yoon & Higgins, 2008) for the models described here. In that study, the Rachlin model successfully optimized for all 111 cases, the Laibson beta‐delta model optimized for 108 cases, the Myerson‐Green model optimized for 82 of the 111 cases, and the one‐parameter Mazur model optimized in all 111 cases. Furthermore, each candidate model was selected as the best (according to the criteria in that paper) for a subset of models, indicating that each of these models is a valuable contribution to the literature. Nonetheless, the tendency for two‐parameter models to occasionally fail to optimize justifies the need to further examine these cases. To do this, we have deliberately chosen two participants' discounting data to highlight potential difficulties with model fitting. Specifically, we selected Subjects 7 and 15 from the previous analysis as we encountered some optimization difficulties for these cases, yet the pattern of indifference points for both subjects are valid since they do not violate well established rules to identify nonsystematic discounting patterns (Johnson & Bickel, 2008). Sample size was set to for the purpose of illustrating the model‐fitting process. The data are publicly available in the supplementary materials in Franck et al. (2019).
Analytic Approach
The analytic approach we take in this paper is straightforward. First, we used the optim() function in R in an attempt to obtain maximum likelihood estimates for the parameters of interest in the Rachlin, Myerson‐Green, and Laibson models using data from Subjects 7 and 15. The optim() function takes the participant data, the form of the model (e.g., Mazur, Rachlin, Laibson, or ‐Green). Then, beginning with a user‐provided starting value, the optim() function moves “up the log‐likelihood hill” (shown in Fig. 2 and the bottom row of Fig. 3) until the maximum is reached. The value of the parameter that corresponds to the maximum is the maximum likelihood estimator.
Figure 3.

Scatterplots and Mazur Fits for Subjects 7 and 15 (top row), and Log‐Likelihood Function Values for Subjects 7 and 15 (bottom row) (Koffarnus & Bickel, 2014 ) Note. The value of corresponding to the highest value of log‐likelihood (denoted with a blue circle in the bottom row) is the maximum likelihood estimate.
To further elucidate the likelihood surface, and also to examine whether the optim() function successfully identifies the true maximum for the different models (i.e., Mazur, Rachlin, Laibson, or ‐Green), we employed a grid search. To employ our grid search, we assessed the log‐likelihood function across a wide grid of values of and for Rachlin and Myerson‐Green models, and , and for the Laibson model at values of obtained by the optim() function. The grid search enables (i) graphical depiction of the likelihood surface, and (ii) a clear way to assess whether the putative optimal value identified by optim() is correct. Topographical plots of likelihood surfaces can visualize any optimization challenges and get an empirical assessment of which models produce surfaces that might be hard to optimize for what reasons. In cases where there is a readily discoverable higher likelihood value compared with the results of optim(), we conclude that the optimizer stopped at the incorrect location and failed to find the actual maximum likelihood estimates. A readily discoverable higher‐likelihood value occurs when a grid search reveals there is at least one grid point with higher likelihood than the value prescribed by the optimizer function. Please note that models with likelihood surfaces that are easy to optimize (i.e., optim() and grid search agree) are more convenient to work with when analyzing data.
For all subject data, the optim() function met its convergence default criteria and the algorithm stopped, although Table 1 indicates that for these subjects, the optimizer stopped somewhere besides the global maximum for the Myerson‐Green and Laibson models.
Table 1.
Discounting Parameters, Error Variance, and Log‐Likelihood from Different Discounting Models
| Subject 7 | Subject 15 | ||||||
|---|---|---|---|---|---|---|---|
| optim() | Grid Search | Match | optim() | Grid Search | Match | ||
| Rachlin | ln(k) | ‐15.303 | ‐15.3 | ✓ | ‐27.524 | ‐27.512 | |
| s | 2.783 | 2.782 | 3.022 | 3.021 | |||
| σ 2 | 0.026 | ‐ | 0.007 | ‐ | |||
| log‐likelihood | 15.703 | 15.703 | 25.222 | 25.222 | ✓ | ||
| Myerson‐Green | ln(k) | ‐11.185 | ‐12.05 | ‐13.33 | ‐14.17 | ||
| s | 209.404 | 497.11 | 42.990 | 99.26 | |||
| σ 2 | 0.062 | ‐ | 0.046 | ‐ | |||
| log‐likelihood | 9.513 | 9.518 | × | 11.693 | 11.715 | × | |
| Laibson | β | 0.99991 | 1 | .99993 | 1 | ||
| δ | 0.997 | 0.997 | .99997 | .99993 | |||
| σ 2 | 0.05 | ‐ | 0.328 | ‐ | |||
| log‐likelihood | 9.139 | 9.151 | × | 1.087 | 1.297 | × | |
Note: Comparisons between the optim() function and grid searches of parameter estimates and log likelihoods of different discounting models. Match indicates whether the optim() and grid search parameter and log‐likelihood estimates are indistinguishable based on the density of the grid that was searched.
Results
The optim() function's stopping criteria were met for all three models and both data sets in this study, although in some cases the function failed to identify the global maximum and thus establish valid estimates for parameters. Table 1 shows that the optimization routine and grid search agree for the Rachlin model using both subjects' data. For the Myerson‐Green model, the optimizer converges at a value of the log‐likelihood that can be readily shown to be nonoptimal via grid search. The Laibson surface has a clear maximum and the optim() function came close, but the grid search revealed that the maximum likelihood estimates are on the boundary of the parameter space.
Figure 4 shows the likelihood surface for each combination of model and data set. Green circles correspond to maximum likelihood estimates, and red X marks correspond to incorrect results where stopping criteria for optim() are met at a location that is demonstrably not the maximum. For the Rachlin model, it is readily apparent that the optimizer has correctly identified the peak of the likelihood surface, and thus maximum likelihood estimators do exist and have been identified for these data using this model. These surfaces have a clear and obvious maximum that was located by the optimizer. The range and density of the grid points in Figure 4 correspond to those used in the grid search. Specific values of each grid point search can be found in the comments of the R code included in the supplementary materials.
Figure 4.

Indifference Points, Delays and Model Fits for Subjects 7 and 15 (Koffarnus & Bickel, 2014 ) Note. Scatterplots of indifference points and delays are shown for Subject 7 (top left) and Subject 15 (top right). Second row shows the likelihood surface for the Rachlin model for these subjects. Third row shows the likelihood surface for the Myerson‐Green model for these subjects. Fourth row shows the likelihood surface for the Laibson model for these subjects. Green circles correspond to the maximum likelihood estimates. Red X symbols correspond to values of the parameters where an optimizer converges spuriously (i.e., other readily available values of the likelihood are higher than what the optimizer suggests). All likelihood surfaces are visualized at the value of recommended by the optimization algorithm even if that algorithm converged to something other than the maximum likelihood estimate.
By contrast, the likelihood surface for the Myerson‐Green model shows that there exist higher log‐likelihood values for a combination of and than those identified by the optimizer. It appears as though the likelihood surface continues growing in an unbounded fashion, and our further attempts to identify a global maximum outside of the plotted range revealed the same trend. Of course, it is possible that for an extremely large value of that a global maximum does exist. That is, one can always keep testing larger and larger values, never ruling out the possibility that the peak exists just beyond the current horizon. But the fact that the optimizer readily converged to a nonmaximal, spurious value and an effort‐intensive approach to graphically identify a global maximum failed (and would need to be attempted for every subject in the study, by the way) reveal that from a practical model‐fitting standpoint, the Myerson‐Green model must (i) be monitored to an almost excessive degree to check the validity of estimates, (ii) it seems possible that the Myerson‐Green model will not always be able to produce credible estimates, even for discounting data that meet the conditions for validity (Johnson & Bickel, 2008).
The parameters in the Rachlin and Myerson‐Green models share a relationship. Specifically, for both of these models, similar model fits can be achieved if, for example, ln(k) is increased while s is decreased. This is visually apparent in the second and third rows of Figure 4 that show that the highest likelihood ridge corresponds to combinations of higher levels of ln(k) with lower levels of s. While the overall optimal point is readily apparent for the Rachlin model, no such maximum is apparent for the Myerson‐Green model for these two data sets.
The Laibson model's optimization faced challenges that differ in kind from the Myerson‐Green model. The primary issue the Laibson model encountered is that the global max occurred on the boundary of the parameter space in both cases (row 4, Fig. 3, ). The optim() function used a box constrained quasi‐Newton method, which efficiently searches for local maxima, and in general algorithms such as this do not reliably search the boundary. The grid search revealed that for these data, optimal log‐likelihood values occur when , and thus it is important for users of the Laibson model to manually assess the boundaries of the likelihood surface in order to determine whether the likelihood at these values is higher than in the interior of the parameter space which can be searched with a readily available optimizer.
The top panel of Figure 4 shows indifference points and model fits. Solid lines correspond to models where the maximum likelihood estimate can be established via numerical optimization or grid search. Dashed lines correspond to model fits based on optimizations when convergence criteria are met, but grid search definitively shows that higher log likelihood values are attainable. Thus, dashed lines (e.g., due to spurious convergence in the Myerson‐Green model or from bad starting values in the Laibson model) do not correspond to valid model fits but would be erroneously adopted if the user fails to recognize the shortcoming of the optimizer. Data and R scripts to fully reproduce all analyses in this paper are included in the online supplementary materials.
Discussion
As a molar choice model, hyperbolic discounting describes an extended pattern of behaviors as being guided by a single identifiable function. Rachlin validated and popularized the idea that molar choice as it relates to delayed outcomes consists of a combination of processes: (1) the hyperbolic discounting of rewards that describes the process by which rewards are devalued by some constraint, and (2) the psychophysical scaling of an individual's perception of the value of that constraint. His work in this sphere focused on the validation of this model with arguments focusing on goodness of fits to data and theoretical implications of fitted parameters.
Here, we have explored the more practical aspects of model fitting as it pertains to discounting data. First, the one‐parameter Mazur model was used to illustrate the concept of maximum likelihood, where the estimates of unknown parameters take the values that make observed data the most likely to occur. We briefly described and made some analogies regarding the optimization algorithms that are used to perform maximum likelihood estimation. We then selected two specific subject‐level discounting sets chosen because they (i) consist of valid data, and (ii) two‐parameter model fitting is nonetheless challenging for some models.
Among the three 2‐parameter discounting models we compared, the Rachlin model stands alone as the model that optimizes easily, and produces flexible model fits that characterize discounting data well. The Myerson‐Green model encountered spurious convergence to incorrect values, since graphical and grid‐based analysis readily show that the log likelihood surface continues growing past the point where the optimizer stops. It is not possible to show definitively that no maximum likelihood estimate exists (because one can always expand the search region in unbounded parameter spaces for and ). Nonetheless, we are satisfied that the level of effort devoted to searching for maximum likelihood estimates for the Myerson‐Green model for these data sets reflects the challenges most behavioral data analysts will face when attempting to use the Myerson‐Green model, challenges that are compounded by the fact that the data at hand pass well‐established rules for systematic discounting (Johnson & Bickel, 2008). There seems to be a large risk of selecting spurious estimates and the level of effort needed to assess successful optimization (for every subject) seems excessive.
The Laibson model also faced some implementation challenges to optimization. For this model, the upper bounds of 1 for both and make the parameter space easier to canvas with a grid search compared with the and parameters of the Rachlin and Myerson‐Green models, which are unbounded. But the user should beware that optimization algorithms might not effectively search the boundaries, so we recommend manually assessing log likelihood on the boundary for this reason. In the analyses presented in this paper, the optimizer was close for subject 7 to the true maximum. For subject 15, we noticed a tendency towards an incorrect convergence for starting values that one might hope would work, for example, setting both to leads to convergence at and at the sample mean of the indifference points. When , the Laibson model does not decrease as a function of delay, and the maximum likelihood estimate of is set equal to the average indifference point (see dashed line for Laibson model in top right of Fig. 4).
Cases where maximum likelihood estimates don't exist, or exist but elude the researcher, lead to the following problems. First, the researcher misunderstands whether reasonable estimates even exist for model parameters, or at least has a poor guess of what reasonable estimates might be. Further, likelihood‐based methods for hypothesis testing (i.e., likelihood ratio tests) and interval estimation (i.e., profile likelihood intervals) are not available, as these rely on log‐likelihood functions that can be optimized to determine the global maximum.
This study is not without limitations. First, we used optim() at default settings for convergence criteria except where noted as part of the investigation, and we relied heavily on our visualizations to study the likelihood surface. When convergence issues arise in practice, analysts frequently need to adjust optimizer settings for step size, tolerances, starting values, and iteration limits to carefully search for potential optima. We note that Figure 4 graphically suggests that the issue with the Myerson‐Green model has more to do with an ever‐increasing function for which a global max may not exist, on which no optimizer can then function properly.
Second, this study focuses almost exclusively on practical issues related to fitting discounting models and providing an overview of maximum likelihood. While understanding the statistical implications for choosing different models is important, these issues must be taken alongside considerations over which model best portrays the psychological processes underlying delay discounting. Ultimately, model selection and comparison involve subjectivity. No single metric or consideration is objectively “best” to assess models from a statistical perspective (e.g., many competing information criteria and testing approaches, valuation of model interpretability vs. out‐of‐sample predictive performance, and ease‐of‐optimization considerations as described here). Likewise, there is plenty of debate over models' relative merits from a theoretical perspective. This work is focused specifically on exploring optimization and related challenges for fitting delay discounting models to observed data.
Another point to emphasize is that the data in this study were specifically chosen as challenging cases for optimization on the basis of a previous study (Franck et al., 2015). This earlier study noted convergence failure based on least squares in 26% of cases for the Myerson‐Green model, but did not explore the likelihood surfaces on a by‐subject basis for all 111 participants. Our team's experience has been that it seems to be generally true across many discounting analyses that the Rachlin model optimizes with relative ease, and varying levels of additional care are needed for other two‐parameter models. However, the possibility surely exists that there are observable discounting patterns for which the Rachlin model might struggle, and/or other models optimize more easily. If such a case is encountered, the strategies to explore likelihood functions we describe here could be useful to gain similar insights in those contexts.
A final limitation is that this paper focuses exclusively on estimation in the context of maximum likelihood. The Bayesian approach to statistical inference approaches estimation from a different perspective that focuses on obtaining the distribution of unknown parameters given observed data and thus does not rely so heavily on optimization. Broadly speaking, Bayesian methods are more intuitive to interpret but more difficult to implement compared with classical statistical methods. Further, the underlying problem of statistical association between estimates of and in the Myerson‐Green model are not automatically overcome merely by adopting a Bayesian approach. For more on Bayesian reasoning in the context of delay discounting see Franck et al. (2019).
With the present analysis at hand, we show why the Rachlin model is favorable on the basis of convenience with which it can be fitted and the flexibility it imparts to modeling discounting trends. The optimization algorithm and visual depiction conclude that the Rachlin model provides an easy‐to‐identify, unique maximum, and model fits that correspond closely with the observed discounting data. In addition, the Rachlin model is conceptually parsimonious assuming that the psychophysical scaling constant, s, operates on the delay variable rather than the entire denominator. As described by Rachlin (2006; pg. 429), “It is at least conceivable that the frequently found deviations from s = 1.0 in Equation 8 eventually will be explicable in terms of other factors than simple time and distance judgements—for example, the expectation of performing anticipatory activities during the delay period, differences of constraints on reward consumption [Raineri & Rachlin, 1993], or differences in valence or quality of outcome…” Understanding the statistical properties of this two‐parameter discounting equation will help supplement the conceptual progress of identifying underlying factors contributing to deviations in s. Rachlin's model has been extremely useful in describing data related to impulsive choice and helping to elucidate potential underlying processes in addiction (e.g., relations between temporal perspective and substance use). We anticipate this model will be in use for a long time, especially given its utility in describing discounting data, and reflects one of the indelible marks Howard Rachlin has left on the discipline.
Supporting information
Data S1 Supporting Information
Data are publicly available in the supplementary materials of Franck et al. (2019).
One hundred percent of this research was supported by federal or state money with no financial or nonfinancial support from nongovernmental sources. Code will be made publicly available upon acceptance of the article. This study was not preregistered.
The authors declare no conflict of interest.
Footnotes
Editor‐in‐Chief: Mark Galizio
Associate Editor: Suzanne Mitchell
References
- Ainslie, G. (1975). Specious reward: A behavioral theory of impulsiveness and impulse control. Psychological Bulletin, 82(4), 463–496. 10.1037/h0076860 [DOI] [PubMed] [Google Scholar]
- Amlung, M. , Vedelago, L. , Acker, J. , Balodis, I. , & MacKillop, J. (2017). Steep delay discounting and addictive behavior: A meta‐analysis of continuous associations. Addiction, 112(1), 51–62. 10.1111/add.13535 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ebert, J. E. J. , & Prelec, D. (2007). The fragility of time: Time‐insensitivity and valuation of the near and far future. Management Science, 53(9), 1423–1438. https://www.jstor.org/stable/20122300 [Google Scholar]
- Franck, C. T. , Koffarnus, M. N. , House, L. L. , & Bickel, W. K. (2015). Accurate characterization of delay discounting: A multiple model approach using approximate Bayesian model selection and a unified discounting measure. Journal of the Experimental Analysis of Behavior, 103(1), 218–233. 10.1002/jeab.128 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Franck, C. T. , Koffarnus, M. N. , McKerchar, T. L. , & Bickel, W. K. (2019). An overview of Bayesian reasoning in the analysis of delay‐discounting data. Journal of the Experimental Analysis of Behavior, 111(2), 239–251. 10.1002/jeab.504 [DOI] [PubMed] [Google Scholar]
- Gilroy, S. P. , Franck, C. T. , & Hantula, D. A. (2017). The discounting model selector: Statistical software for delay discounting applications. Journal of the Experimental Analysis of Behavior, 107(3), 388–401. 10.1002/jeab.257 [DOI] [PubMed] [Google Scholar]
- Gilroy, S. P. , & Hantula, D. A. (2018). Discounting model selection with area‐based measures: A case for numerical integration. Journal of the Experimental Analysis of Behavior, 109(2), 433–449. 10.1002/jeab.318 [DOI] [PubMed] [Google Scholar]
- Johnson, M. W. , & Bickel, W. K. (2008). An algorithm for identifying nonsystematic delay‐discounting data. Experimental and Clinical Psychopharmacology, 16(3), 264–274. 10.1037/1064-1297.16.3.264 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kaplan, B. A. , Franck, C. T. , McKee, K. , Gilroy, S. P. , & Koffarnus, M. N. (2021). Applying mixed‐effects modeling to behavioral economic demand: An introduction. Perspectives on Behavior Science, 44(2), 333–358. 10.1007/s40614-021-00299-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koffarnus, M. N. , & Bickel, W. K. (2014). A 5‐trial adjusting delay discounting task: Accurate discount rates in less than one minute. Experimental and Clinical Psychopharmacology, 22(3), 222–228. 10.1037/a0035973 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laibson, D. (1997). Golden eggs and hyperbolic discounting. The Quarterly Journal of Economics, 112(2), 443–478. 10.1162/003355397555253 [DOI] [Google Scholar]
- Lancaster, K. (1963). An axiomatic theory of consumer time preference. International Economic Review, 4(2), 221–231. 10.2307/2525488 [DOI] [Google Scholar]
- Mazur, J. E. (1987). An adjusting procedure for studying delayed reinforcement. In Commons M. L., Mazur J. E., Nevin J. A., & Rachlin H. (Eds.), The effect of delay and of intervening events on reinforcement value. (pp. 55–73). Lawrence Erlbaum Associates. [Google Scholar]
- McClure, S. M. , Laibson, D. I. , Loewenstein, G. , & Cohen, J. D. (2004). Separate neural systems value immediate and delayed monetary rewards. Science, 306(5695), 503–507. 10.1126/science.1100907 [DOI] [PubMed] [Google Scholar]
- McKerchar, T. L. , Green, L. , Myerson, J. , Pickford, T. S. , Hill, J. C. , & Stout, S. C. (2009). A comparison of four models of delay discounting in humans. Behavioural Processes, 81(2), 256–259. 10.1016/j.beproc.2008.12.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Myerson, J. , & Green, L. (1995). Discounting of delayed rewards: Models of individual choice. Journal of the Experimental Analysis of Behavior, 64(3), 263–276. 10.1901/jeab.1995.64-263 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nocedal, J. , & Wright, S. (2006). Numerical optimization. Springer; New York. https://books.google.com/books?id=VbHYoSyelFcC [Google Scholar]
- Rachlin, H. (2006). Notes on discounting. Journal of the Experimental Analysis of Behavior, 85(3), 425–435. 10.1901/jeab.2006.85-05 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rachlin, H. (2007). In what sense are addicts irrational? Drug and Alcohol Dependence, 90 Suppl 1, S92‐99. 10.1016/j.drugalcdep.2006.07.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rachlin, H. , Logue, A. W. , Gibbon, J. , & Frankel, M. (1986). Cognition and behavior in studies of choice. Psychological Review, 93(1), 33–45. 10.1037/0033-295X.93.1.33 [DOI] [Google Scholar]
- Rodriguez, M. L. , & Logue, A. W. (1988). Adjusting delay to reinforcement: Comparing choice in pigeons and humans. Journal of Experimental Psychology. Animal Behavior Processes, 14 1, 105–117. PMID: 3351438. [PubMed] [Google Scholar]
- Stevens, S. S. (1957). On the psychophysical law. Psychological Review, 64(3), 153–181. 10.1037/h0046162 [DOI] [PubMed] [Google Scholar]
- Yoon, J. H. , & Higgins, S. T. (2008). Turning k on its head: Comments on use of an ED50 in delay discounting research. Drug and Alcohol Dependence, 95(1–2), 169–172. 10.1016/j.drugalcdep.2007.12.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Young, M. E. (2017). Discounting: A practical guide to multilevel analysis of indifference data. Journal of the Experimental Analysis of Behavior, 108(1), 97–112. 10.1002/jeab.265 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1 Supporting Information
