Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jan 1.
Published in final edited form as: J Am Stat Assoc. 2023 Nov 22;119(545):27–38. doi: 10.1080/01621459.2023.2261184

A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders

Xingche Guo a, Donglin Zeng b, Yuanjia Wang c,*
PMCID: PMC11068237  NIHMSID: NIHMS1940486  PMID: 38706706

Abstract

Major depressive disorder (MDD) is one of the leading causes of disability-adjusted life years. Emerging evidence indicates the presence of reward processing abnormalities in MDD. An important scientific question is whether the abnormalities are due to reduced sensitivity to received rewards or reduced learning ability. Motivated by the probabilistic reward task (PRT) experiment in the EMBARC study, we propose a semiparametric inverse reinforcement learning (RL) approach to characterize the reward-based decision-making of MDD patients. The model assumes that a subject’s decision-making process is updated based on a reward prediction error weighted by the subject-specific learning rate. To account for the fact that one favors a decision leading to a potentially high reward, but this decision process is not necessarily linear, we model reward sensitivity with a non-decreasing and nonlinear function. For inference, we estimate the latter via approximation by I-splines and then maximize the joint conditional log-likelihood. We show that the resulting estimators are consistent and asymptotically normal. Through extensive simulation studies, we demonstrate that under different reward-generating distributions, the semiparametric inverse RL outperforms the parametric inverse RL. We apply the proposed method to EMBARC and find that MDD and control groups have similar learning rates but different reward sensitivity functions. There is strong statistical evidence that reward sensitivity functions have nonlinear forms. Using additional brain imaging data in the same study, we find that both reward sensitivity and learning rate are associated with brain activities in the negative affect circuitry under an emotional conflict task.

Keywords: semiparametric model, monotone splines, reward learning, learning ability, behavioral phenotyping, major depression

1. Introduction

Mental disorders cause immense disability, accounting for 183.9 million disability-adjusted life-years worldwide (Whiteford et al., 2013). However, despite intense clinical and research efforts devoted to developing pharmacological and behavioral treatments for mental disorders over the years, effective treatment strategies remain elusive, and treatment responses are far from adequate across mental disorders (e.g., 30%−50%). Recently, there has been a fast-growing trend to utilize discovered novel biomarkers and behavioral markers, accompanied by sophisticated computational methods (e.g., machine learning or reinforcement learning), to provide a more accurate characterization of mental disorders and more precise prediction of treatment responses than traditional approaches (Passos et al., 2019).

To address substantial between-patient heterogeneity and limitations in the traditional symptom-based diagnosis, the paradigm shift put forward by the NIMH strategic plan on research domain criteria (RDoC; Insel et al., 2010) initiative calls for re-defining mental disorders by latent constructs identified from biological and behavioral measures across different domains of functioning (e.g., positive/negative affect, social processing, decision making) at different levels (e.g., genes, cells, and circuits), instead of solely relying on clinical symptoms or the diagnostic and statistical manual of mental disorders (APA, 2013). The vision on precision psychiatry (Insel et al., 2010; Williams, 2016) is that research needs to account for extensive diagnostic heterogeneity and substantial between-patient variation in biological, behavioral, and clinical manifestations of disease, examine shared latent constructs across disorders, and match treatments with the underlying pathophysiology.

Learning and decision-making are considered important in the etiology of the mental function (Kendler et al., 1999). An individual’s learning ability and decision-making process may be altered by mental disorders (Pizzagalli et al., 2005). In turn, poor decision-making may lead individuals into high-risk scenarios and experience more mental illness (Kendler et al., 1999). For example, impairment in processing rewards is implicated in various mental disorders. Anhedonia (i.e., inability to experience pleasure) is one of the core constructs of major depression. Pizzagalli et al. (2005) showed that anhedonia may alter how MDD patients process reward in a PRT experiment. Briefly, subjects were presented with two types of stimuli in a computer-based experiment (probabilistic reward task), and one stimulant was rewarded more frequently than the other (details in Section 1.1). Over time, healthy subjects will learn which reward has a higher value and choose the correct stimuli with a higher probability. It is hypothesized that anhedonia may reduce a patient’s ability to learn which reward is more valuable or reduce a subject’s tendency to choose the richer reward. Decision-making models are proposed to evaluate whether these hypotheses are supported (Huys et al., 2013). As another example, restrictive eating behavior in anorexia nervosa can be measured via the Food Choice Task, a computer-based assessment of ratings and choices of food items, and patients’ choices in the task predicts future caloric intake (Foerde et al., 2015).

A useful computational model for describing a patient’s learning behavior and decision-making ability is Reinforcement Learning (RL; Sutton and Barto, 2018). To use RL to describe behaviors or actions generated from a probabilistic reward task, a simple prediction error learning is invoked to characterize the value of different choices depending on key behavioral phenotype parameters, including reward sensitivity and learning rate. Various empirical studies and a meta-analysis have shown that objective, quantitative measures derived from behavioral data can be associated across multiple mental disorders (Pike and Robinson, 2022), making these measures ideal candidate behavioral phenotypes under the RDoC framework. In contrast to regular RL, where one aims to estimate an optimal policy that maximizes the long-term reward, the goal of analysis in behavioral phenotyping in mental health is behavioral cloning or imitation learning (Ross and Bagnell, 2010). That is, given observed patients’ choices on probabilistic reward tasks, how to infer their reward learning ability and value computation process, and how these abilities and processes relate to MDD and, ultimately, are affected by MDD treatments.

There is a gap in using existing imitation learning or inverse RL methods to address the unique challenges presented by mental disorders. For example, the models proposed in Huys et al. (2013) assume that a patient’s decision-making process relies on a function with restrictive parametric linear form, which may oversimplify the complex decision process and interactions between core mental functions. For example, similar to other psychometric models (Gravetter and Forzano, 2018) and as we demonstrate in Section 4, there is often a floor and ceiling effect on the relationship between action and expected reward, which is not captured by a linear model. Some existing imitation learning and inverse RL methods are more flexible to allow nonlinear models (e.g., Arora and Doshi, 2021), but they do not provide important behavioral phenotypes such as learning rate and reward sensitivity and do not borrow information from multiple subjects under a random effects model.

In this work, motivated by the experiments in Pizzagalli et al. (2005) and Trivedi et al. (2016), we propose a semiparametric inverse reinforcement learning approach to characterize reward-based decision-making for patients with mental disorders. Our method incorporates subject-specific learning rates and reward sensitivity as random effects to address between-patient heterogeneity. To borrow information while remaining flexible, we assume that the function that links action to the contrast in expected reward is nonparametric but shared across the patients. We describe the behavioral phenotyping experiment and motivating study in the next section.

1.1. Probabilistic Reward Task and the Motivating Study

The probabilistic reward task (PRT; Pizzagalli et al., 2005) is a computer-based experiment that measures the subject’s ability to modify behavior in response to rewards. On each trial of the task, the subject sees a cartoon face with a short or long mouth (two stimuli). The task is to indicate whether a short or long mouth was presented by pressing buttons ‘C’ or ‘M’ on keyboards. Critically, the correct response to the short mouth was rewarded more frequently than a correct response to the other (i.e., the short mouth is the rich reward), and the difference in length between the short and long mouths is minimal. Participants were given verbal instructions about the task and told that the goal of the task was to win as much money as possible. It is critical that the subject understands that not all correct responses will be rewarded. To maximize the reward, participants should press the correct button, regardless of which face is associated with the higher reward. However, the difference in size between the short and long mouths is designed to be small. Consequently, participants frequently encounter difficulties in accurately perceiving the presented state, which results in a tendency to prioritize states with higher rewards rather than those that are more accurate. Thus, the PRT experiment probes a subject’s ability to learn rich versus lean reward.

Emerging evidence indicates the presence of reward processing abnormalities in MDD (Whitton et al., 2015). MDD patients are more likely to have lower reward learning ability than patients not diagnosed with MDD (Vrieze et al., 2013). There has been significant interest in learning how MDD affects patients’ decision-making in reward learning tasks. In our motivating study, the Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care (EMBARC) trial, each PRT session consisted of 200 trials, divided into two blocks of 100 trials, with blocks separated by a 30-second break. For each block, 40 correct trials were followed by reward feedback. Correct identification of the short mouth was associated with three times more positive feedback (30 out of 40) than correct identification of the long mouth (10 out of 40). However, subjects were not told about the different frequencies of rewards across the two response types. Figure 1(a) shows the PRT schematic diagram.

Fig. 1.

Fig. 1

Schematic of a PRT experiment and EMBARC study design

EMBARC study is designed to identify the difference in reward learning abilities between the patients with MDD and the control group (not diagnosed with MDD) and between SSRI antidepressant sertraline (SERT) and placebo (PBO) in a randomized trial for patients with MDD. Figure 1(b) shows the experimental design diagram. In the PRT experiments, there were 40 subjects from the control group and 168 subjects from the MDD group; within the MDD group, 82 participants were randomly assigned to the SERT group, and the other 86 participants were assigned to the PBO group. Each subject had two PRT sessions, one session at the baseline before treatment (week 0) and one session after one week of treatment (week 1). Note that subjects in the control group also took two sessions in week 0 and week 1. In week 1, five sessions were missing for the MDD group and one for the control group.

2. Methods

2.1. Data from a behavioral task

Consider n subjects from a population. Data consists of a time-series of subjects’ decision-making in a behavioral task (e.g., probabilistic reward task), i.e., Sit,Ait,Ritt=1,,Ti=1,,n, where Sit,Ait, and Rit are the state, action, and reward at time t for the i-th subject. In this paper, we assume that there are m states in the state space (i.e. Sit{0,,m-1}) and two actions in the action space (i.e. Ait{0,1}). Denote it=Sij,Aij,Rijj=1,,t-1 as the observed history for the i-th subject by time t. The reward Rit is generated based on Sit,Ait, and it (for some cases, only based on Sit and Ait.Rit is assumed to be bounded (either discrete or continuous), and without loss of generality, we assume Rit is between 0 and 1. The probabilistic reward task introduced in Section 1.1 is a special case with two possible states. We let Sit=0 correspond to the state with lean rewards and Sit=1 correspond to the state with rich rewards. The reward generating distributions conditional on Sit=Ait=0 and Sit=Ait=1 (and history it) are Bernoulli distributions with probability 0<P00<P11<1. No reward is provided if the ith patient does not correctly respond to the state at time t.

2.2. Semiparametric inverse RL model

RL is a computational tool used to model patients’ responses to stimuli in behavioral tasks such as PRT. A core concept in RL modeling of PRT is the reward prediction error, the difference between the expected reward and the reward that was received. Mathematically, the reward prediction error is Rit-Qit(a,s), where Qit(a,s) is the expected reward or value of taking action a at state s for the ith subject at the tth trial. This is the foundation of building more complex models to explain patients’ behavioral responses. Specifically, for MDD patients, a failure to adequately process reward is hypothesized to occur via two distinct mechanisms: a reduced sensitivity to received reward or the reduced ability to learn (Huys et al., 2013). The former mechanism is assessed by a parameter referred to as reward sensitivity (denoted by ρi), and the latter is assessed by learning rate (denoted by βi). When accounting for reward sensitivity, a subject’s reward prediction error is

δit=ρiRit-Qit(a,s),

where the obtained reward Rit is discounted by a factor of ρi. Each patient’s value (denoted as Qi,t+1(a,s)) naturally evolves over time depending on the past history of value and is updated based on a weighted reward prediction error as

Qi,t+1(a,s)=Qit(a,s)+βiδitIit(a,s) (1)

where βi(0,1) is a subject-specific learning rate that describes how fast the value of a state-action pair is updated, Iit(a,s)=IAit=a,Sit=s is the indicator function of the event Ait=a and Sit=s. The larger ρi is, the more a subject’s action will depend on rewards. Equivalently, the expected reward at t+1 is the weighted sum of the obtained reward and the expected reward at t, i.e., Qi,t+1(a,s)=1-βiIit(a,s)Qit(a,s)+βiρiRitIit(a,s). As the learning rate βi approaches one, learning is so fast that the Q values are simply the last experienced outcome. As a note, (1) is also known as the Rescorla-Wagner equation (Rescorla, 1972) in classical conditioning. Figure 2 shows a graphical representation of the model where the green paths show the RL component of updating Qit over trials.

Fig. 2.

Fig. 2

Diagram of the semiparametric inverse RL model over trials t, where f is the reward sensitivity function, βi is the learning rate, and ρi is related to the relative reward sensitivity.

At time t=1, we assume Qi1(a,s)=αas. Define Δit=Δit(a,s) as the total counts that subject i takes action a at state s before time t. Define time τik=τik(a,s) as the k-th time such that subject i takes action a at state s. According to (1), we can express Qit(a,s) by

Qit(a,s)=1-βiΔitαas+βiρik=1Δit1-βiΔit-kRi,τik. (2)

Clearly, 0Qit(a,s)maxαas,ρi and Qit(a,s) is also a function of βi,ρi,αas, and history it. Furthermore, Qita,s,βi,cρi,cαas;it=cQita,s,βi,ρi,αas;it. By trial t, participants have the updated reward value Qit(a,s) based on (t-1) trials. However, because participants cannot be sure which stimulus would be presented at trial t, their action at trial t is based on a “belief” of the expected reward, Wit(a,s), defined as a mixture of Qit(a,s)s=0,,m-1

Wit(a,s)=ωssQit(a,s)+rsωsrQit(a,r), (3)

where the parameters ωsr[0,1] satisfy r=0m-1ωsr=1, and they reflect the weights from Q-value from different states than s.

It is worth noting that the formulation of Qit(a,s) and  Wit(a,s) are motivated by Huys et al. (2013). We generalize Huys et al. (2013) to allow a nonlinear reward sensitivity function, more than two states (i.e., Sit=1,,m), and use a random effects model to borrow information across subjects to estimate reward sensitivities and learning rates. Specifically, given the believed reward Wit(a,s) and the actual stimulus Sit at trial t, we define a contrast between two actions as

Zit=Wit1,Sit-Wit0,Sit, (4)

and we assume that the action Ait, conditional on history it and Sit, follows the model

logitPAit=1Sit,it=fZit, (5)

where f() is an unknown non-decreasing function satisfying f(0)=0. In other words, the participant is more likely to choose the action whose expected reward is higher, but this probability can be a nonlinear function of the contrast. As a result, we refer to f() as a reward sensitivity function. The nonlinear modeling of the decision process may more accurately reveal when subjects are more sensitive to reward under a probabilistic reward task. Finally, to incorporate the heterogeneity among the participants, we assume vi=logβi/1-βi,γi=logρi, where vi and γi are random effects generated from a bivariate normal distribution, i.e., (vi,γi)~iidN(μ,Σ) with μ=μv,μγ. Note that since Qit is invariant if its initial value and ρi are scaled by a constant factor c,f(x) and f(x/c) yield the same action model. Thus, to ensure identifiability, we let μγ1. By this normalization, γi serves as an indicator for the relative sensitivity of the i-th subject i.e.γi-μγ/μγ. We refer to γi as the relative sensitivity in the rest of the paper. The decision-making model (5) is represented by the blue paths in Figure 2. The rewards are generated based on a participant’s state, choice, and history depending on the experiment paradigm (pink paths in Figure 2).

2.3. Implementation

Under the proposed semiparametric inverse RL/imitation learning model, since Rit and Sit given it are independent of the random effects, the log-likelihood function for subject i is, up to a constant,

i(μ,Σ,α,Ω,f)=log t=1TPitvi,γi,α,Ω,f;itϕvi,γiμ,Σdvidγi, (6)

where ϕ(,μ,Σ) denotes the density of N(μ,Σ),

α=αasa=0,1s=0,,m-1,Ω=ωsrs,r=0,,m-1
Pitvi,γi,α,Ω,f;it=expAitfZit1+expfZit.,

We maximize

(μ,Σ,α,Ω,f)=i=1ni(μ,Σ,α,Ω,f), (7)

to estimate all the parameters. To fit a non-decreasing function f(x) for x[L,U](L<0,U>0) where L and U are two pre-given lower and upper bounds for the contrast, we approximate f using a family of monotone splines called I-splines (Ramsay, 1988). Denote M=M1,,MK as a set of M-splines, I=I1,,IK as the corresponding I-splines, where each Ik(x) is a integral of Mk() from L to x. Because Mk() is non-negative for all k=1,,K,Ik() is guaranteed to be non-decreasing. More details of the representation of M-splines and I-splines can be found in Appendix A.1. We can approximate f() by f˜(x)=a+I(x)b. Giving restrictions on the coefficients bk0 for k=1,,K enforces f˜() to be non-decreasing. We also need a=-I(0)b to ensure f˜(0)=0. Besides, the first order derivative of f˜ is: f˜(x)=M(x)b.

We evaluate the double integral using bivariate Gauss–Hermite quadrature (Jäckel, 2005). In the optimization step, because α and Ω are linearly constrained, the coefficients of the I-splines are non-negative, so the constrained optimization (Lange et al., 1999) is used to maximize (7). In our numerical studies, the L-BFGS-B algorithm is implemented by the “optim” function in R package “stats” for the case m=2 (i.e., two states), when m>2, an adaptive barrier algorithm implemented by the “ constrOptim” function in R package “stats” can be used instead. Algorithm 1 presents a detailed version of the semiparametric RL algorithm for parameters and nonparametric function optimization. Note that evaluating the value of joint conditional log-likelihood function (7) is time-consuming, so to accelerate the computation, we evaluate the log-likelihood function for each subject (6) in parallel and sum up each part to obtain (7). The maximum likelihood estimation is obtained when the L-BFGS-B algorithm converges.

Furthermore, denote Λ=(μ,Σ,α,Ω,fˆ) as the maximum likelihood estimate of (7). The subject-specific learning rate and relative sensitivity can be estimated by plugging in Λ into the posterior mean of vi,γi.

Evi,γiΛ;i,T+1=vi,γit=1TPitvi,γi,α,Ω,fˆ;itϕvi,γiμ,Σdvidγi×exp-i(μ,Σ,α,Ω,fˆ). (8)

In the Appendix, we study the theoretical properties of the MLEs of parameters and monotone function f() and show consistency and asymptotic normality. Furthermore, nonparametric bootstrap is used for inference. Because asymptotic normality holds, as we show in Appendix A.2, we construct 95% bootstrap confidence intervals under the normal distribution (i.e., MLE±1.96×bootstrap se) in simulation studies and data application. Because the distributions of σv,v2 and σγ,γ2 are highly skewed to the right, we transform them by logarithm before using normal approximation to construct confidence intervals.

2.

3. Simulation Studies

We conducted extensive simulation studies to assess the finite sample performance of the proposed method. To mimic the real study, denote the distribution for generating reward Rit conditional on Ait=a and Sit=s as pasRit. We considered m=2 (two states), We further let α00=α11:=α and α01=α10:=0, and Ω satisfy ω00=ω11:=ω. As the result, (4) became

Zit=ωQit(1,1)-(1-ω)Qit(0,0)ifSit=1-ωQit(0,0)+(1-ω)Qit(1,1)ifSit=0 (9)

Furthermore, we considered two reward-generating distributions: (Case I—Bernoulli distribution) p11=Bern(0.75) (rich reward), p00=Bern(0.3) (lean reward); (Case II—Beta distribution) p11~Beta(3,1) (rich reward), p00~Beta(1,3) (lean reward). The rewards at different time points were independently assigned. The states were generated by flipping a fair coin. For the ith subject, the transformed learning rate vi and reward sensitivity γi were generated from a bivariate normal distribution with μv=-2.5,μγ=1,Σv,v=0.25,Σγ,γ=0.16, and Σv,γ=-0.1. We set ω=0.8 and α=2. The nonparametric function was f(x)=sgn(x)log(2|x|+1). The I-splines with order r=2, internal knots {0,±0.6,±1.2,±2.4}, and boundary knots ±6 were used in all scenarios.

We generated 200 replicates with sample sizes  n{100,250} and number of trials T{100,500}. We maximized the proposed joint log-likelihood function (7) by applying a 3 × 3 bivariate Gauss-Hermite quadrature. We used 50 bootstrap samples for inference. In the Supplementary Materials Section S.1.2, we show that using a 3 × 3 Gauss-Hermite quadrature and 50 bootstrap samples are sufficient to provide accurate point estimates and confidence intervals. According to Figure S1 in (Supplementary materials Section S.1.3), most Zit(i=1,,n,t=1,,T) lie in (−1.5, 2.5) for T=100, hence we evaluate the estimation performance of f() at {± 0.5, ± 1, 1.5, 2}.

The simulation results of the parameter estimate from 200 replicates were given in Table 1. The simulation results of the nonparametric function were given in Table 2. For both tables, the relative bias (RB), standard deviation (SD), average bootstrap standard error (SE), and coverage probability of the 95% bootstrap confidence intervals (CP) were reported. Furthermore, we compared our method (Semiparametric RL) with the linear model that f(x)=cx (Linear RL) in Table 1 and Table 2. Table 1 shows that estimates of group mean μv for both semiparametric RL and Linear RL models have small relative biases. It suggests we can estimate the group learning rate with high precision even when the reward sensitivity function f() is incorrectly specified. However, the estimates of covariance matrix Σ have much larger relative bias when the reward sensitivity function f() is incorrectly specified. Meanwhile, for the semiparametric RL model, the relative bias of all parameters decreases when T and n increase. In contrast, for the linear RL model, the relative bias for all parameters except the group learning rate does not change much as T and n increase. Table 2 suggests that the semiparametric RL estimates of nonparametric function have much smaller relative bias and larger standard deviation than the Linear RL model estimates. Note that the coverage probabilities of semiparametric RL estimates are close to the nominal level (95%). In contrast, the coverage probabilities of Linear RL estimates should be much less than the nominal level. Hence, we conclude that the Linear RL model cannot provide reliable statistical inference on f() when the underlying true reward sensitivity function is nonlinear. In the Supplementary Materials Section S.1.1, we also investigated the case when the underlying true reward sensitivity function is linear. We find that the two methods provide estimates with similar relative bias; our Semiparametric RL has a relatively larger standard deviation compared to Linear RL.

Table 1.

Summary of the parameter estimates in 200 simulations with Bernoulli (Case I) and Beta (Case II) reward distribution.

Case I Case II
Semiparametric Linear Semiparametric Linear
T n RB SD SE CP RB SD RB SD SE CP RB SD
100 100 μv 0.014 0.316 0.346 97 0.103 0.301 0.055 0.501 0.516 97 0.139 0.385
σv,v2 −0.119 0.208 0.301 98 −0.577 0.235 −0.048 0.205 0.390 98 −0.245 0.225
σγ,γ2 −0.154 0.132 0.132 98 0.533 0.037 −0.100 0.161 0.150 97 0.581 0.037
σv,γ2 0.163 0.119 0.135 98 −0.251 0.070 0.046 0.137 0.149 97 −0.259 0.069
α −0.053 0.454 0.445 95 −0.055 0.233 −0.077 0.570 0.529 97 −0.085 0.214
ω −0.011 0.052 0.057 96 −0.062 0.061 −0.017 0.077 0.077 97 −0.036 0.080
100 250 μv 0.002 0.169 0.202 97 0.091 0.162 0.014 0.273 0.324 95 0.128 0.240
σv,v2 −0.009 0.111 0.126 99 −0.509 0.144 0.077 0.129 0.158 99 −0.166 0.132
σγ,γ2 −0.070 0.071 0.077 96 0.577 0.019 −0.052 0.106 0.086 91 0.561 0.025
σv,γ2 0.014 0.061 0.072 98 −0.333 0.040 −0.115 0.075 0.077 97 −0.267 0.041
α −0.009 0.236 0.279 98 −0.044 0.133 0.004 0.269 0.316 96 −0.061 0.143
ω −0.009 0.035 0.036 97 −0.064 0.036 −0.022 0.053 0.055 98 −0.044 0.052
500 00 μv 0.001 0.091 0.102 97 −0.002 0.109 −0.003 0.114 0.128 96 0.042 0.114
σv,v2 0.100 0.070 0.076 98 −0.283 0.133 0.139 0.101 0.115 99 −0.168 0.153
σγ,γ2 024 0.045 0.050 95 0.685 0.009 0.096 0.055 0.058 94 0.689 0.010
σv,γ2 0.059 0.041 0.050 95 −0.382 0.034 −0.028 0.049 0.053 96 −0.343 0.035
α 0.002 0.220 0.232 95 −0.075 0.135 −0.024 0.243 0.281 98 −0.092 0.133
ω 0.000 0.007 0.007 95 −0.081 0.011 −0.004 0.014 0.016 99 −0.067 0.010
500 250 μv −0.001 0.062 0.063 95 −0.003 0.070 −0.003 0.070 0.079 95 0.044 0.080
σv,v2 0.133 0.042 0.045 93 −0.268 0.076 0.113 0.069 0.070 96 −0.238 0.110
σγ,γ2 0.088 0.021 0.027 91 0.682 0.005 0.109 0.034 0.033 90 0.694 0.006
σv,γ2 −0.023 0.019 0.025 97 −0.336 0.016 −0.0 0.028 0.030 94 −0.338 0.026
α 0.000 0.131 0.136 94 −0.075 0.079 −0.003 0.144 0.176 96 −0.090 0.078
ω −0.001 0.004 0.004 97 −0.081 0.006 −0.002 0.005 0.007 97 −0.066 0.007

(RB): relative bias; (SD): standard deviation; (SE): average bootstrap standard error; (CP): coverage probability of the 95% bootstrap confidence intervals; (Semiparametric): our proposed RL method; (Linear): RL model with f(x)=cx.

Table 2.

Summary of the estimated nonparametric reward sensitivity function in 200 simulations with Bernoulli (Case I) and Beta (Case II) reward distribution.

Case I Case II
Semiparametric Linear Semiparametric Linear
T n RB SD SE CP RB SD RB SD SE CP RB SD
100 100 f(-1.0) −0.021 0.188 0.221 98 −0.191 0.153 −0.024 0.312 0.308 99 −0.148 0.213
f(-0.5) −0.028 0.171 0.177 97 −0.341 0.077 −0.094 0.236 0.229 98 −0.308 0.107
f(0.5) −0.001 0.188 0.185 97 0.341 0.077 0.056 0.258 0.246 99 0.308 0.107
f(1.0) 0.006 0.167 0.181 96 0.191 0.153 0.022 0.260 0.282 99 0.148 0.213
f(1.5) 0.010 0.179 0.194 98 0.047 0.230 0.022 0.208 0.306 97 0.005 0.320
f(2.0) −0.025 0.277 0.284 97 −0.090 0.307 −0.030 0.293 0.427 98 −0.149 0.426
100 250 f(-1.0) −0.013 0.110 0.131 99 −0.206 0.078 −0.016 0.179 0.190 97 −0.177 0.120
f(-0.5) −0.021 0.124 0.122 95 −0.353 0.039 −0.075 0.178 0.169 96 −0.331 0.060
f(0.5) 0.002 0.136 0.136 93 0.353 0.039 0.067 0.197 0.179 95 0.331 0.060
f(1.0) −0.001 0.100 0.111 94 0.206 0.078 0.033 0.173 0.182 99 0.177 0.120
f(1.5) 0.008 0.088 0.089 97 0.065 0.117 0.020 0.125 0.143 96 0.031 0.181
f(2.0) 0.000 0.126 0.136 98 −0.068 0.156 −0.014 0.195 0.199 98 −0.109 0.241
500 100 f(-1.0) 0.007 0.083 0.082 98 −0.235 0.030 −0.013 0.112 0.123 96 −0.228 0.027
f(-0.5) 0.018 0.040 0.047 97 −0.375 0.015 −0.008 0.066 0.082 99 −0.370 0.013
f(0.5) −0.023 0.047 0.054 98 0.375 0.015 0.002 0.080 0.097 99 0.370 0.013
f(1.0) −0.001 0.057 0.060 96 0.235 0.030 0.026 0.084 0.100 95 0.228 0.027
f(1.5) −0.006 0.048 0.051 96 0.100 0.044 0.001 0.055 0.060 98 0.091 0.040
f(2.0) −0.006 0.060 0.066 97 −0.028 0.059 −0.010 0.068 0.084 96 −0.039 0.053
500 250 f(-1.0) −0.002 0.046 0.049 97 −0.237 0.018 −0.003 0.064 0.071 95 −0.226 0.018
f(-0.5) 0.011 0.026 0.028 94 −0.377 0.009 0.008 0.034 0.040 95 −0.368 0.009
f(0.5) −0.010 0.029 0.032 94 0.377 0.009 −0.003 0.040 0.048 97 0.368 0.009
f(1.0) 0.009 0.031 0.036 96 0.237 0.018 0.017 0.044 0.050 94 0.226 0.018
f(1.5) −0.004 0.028 0.031 94 0.102 0.027 −0.003 0.028 0.033 97 0.089 0.027
f(2.0) −0.007 0.036 0.038 95 −0.026 0.037 −0.011 0.041 0.047 95 −0.041 0.036

(RB): relative bias; (SD): standard deviation; (SE): average bootstrap standard error; (CP): coverage probability of the 95% bootstrap confidence intervals; (Semiparametric): our proposed RL method; (Linear): RL model with f(x)=cx.

Comparing semiparametric RL estimation results among four data size scenarios for both reward-generating mechanisms, we found that the estimation bias and standard deviation decrease significantly when we increase the number of trials from 100 to 500. The estimation bias and standard deviation also decrease when we increase the subject number from 100 to 250. The reduction rate is relatively large for T=100, compared to T=500. The sources of bias come from the bias of using 3 × 3 bivariate Gauss–Hermite quadrature to approximate (6) and the bias of using numerical optimizer to find the maximum of (7). By comparing reward-generating distribution in Cases I and II, we found that the distribution of the reward slightly affects the estimation performance. The simulation results show Case I has a smaller relative bias than Case II. In the Supplementary Materials Section S.1.5, we performed an additional simulation study involving three states (i.e., m=3). The results indicate that the Semiparametric RL model with three states also produces accurate estimation results.

4. Application to EMBARC Study

We now apply the proposed methodology to our motivating PRT data. Through a preliminary analysis, we found that the learning pattern might change from the first block to the second block. To avoid bias, we only focus on the PRT data in the first block (first 100 trials) in the paper.

First, we compare reward learning abilities between patients with MDD and the healthy control group. We fitted our proposed semiparametric reward learning models for the ‘MDD’ group that contains subjects diagnosed with MDD at the baseline (pre-treatment) and the ‘Control’ group that contains the healthy subjects with two repeated measurements (no treatment at both times). We examined the number of interior knots from 3 to 10, the boundary knots equal to {−4, 6}, and order r=2 for the I-Splines. By the AIC criterion, we selected the model with 6 interior knots. The parameters of interest are the transformed group learning rate μv and the group reward sensitivity function f(). Table 3 shows the estimation results for μv and the elements in Σ. For inference, nonparametric bootstrap was applied to generate 200 resampling sets for each group. Bootstrap standard errors and 95% bootstrap confidence intervals were shown in Table 3. The results show that the learning rates for the two groups have similar values. We also constructed the 95% confidence interval for the contrast of μv between MDD and the control group; according to the result that the 95% confidence interval is equal to (−0.94, 1.93), we conclude that the difference of learning rate between MDD group and control group is not significant. Another interesting finding is that the learning rate is negatively correlated with reward sensitivity for the subjects.

Table 3.

Estimation of the parameters in EMBARC study under the proposed method for MDD and Control group.

MDD Control
Parameters Est. Err. 95% CI Est. Err. 95% CI
μv −2.57 0.46 (−3.47, −1.67) −3.07 0.67 (−4.38, −1.75)
σv,v2 10.68 3.75 (4.41, 25.86) 9.08 4.27 (3.06, 26.95)
σγ,γ2 2.66 0.96 (1.50, 4.74) 3.15 2.30 (0.92, 10.86)
σv,γ2 −1.87 0.86 (−3.55, −0.19) −2.92 0.99 (−4.85, −0.98)

Figure 3 presents the nonparametric estimation of group reward sensitivity function f(). Figure 3(a) compares the nonparametric estimate of f() between MDD and Control group using the proposed method. The fitted reward sensitivity functions for both groups are clearly nonlinear. The function flattens when x>2 or x<-2, suggesting that subjects’ probability of correct actions at given states would not rise endlessly even if they received enough rewards. Meanwhile, the reward sensitivity functions for the two groups have similar shapes when x<1.5,f(x) for the control group increases to a higher level than f(x) for the MDD group when x>2. It suggests that the control group has a larger probability of taking correct actions at rich reward states than the MDD group when subjects in both groups receive adequate rewards in rich reward states. Figure 3(b) presents the 95% pointwise confidence band (PCB) and the 95% simultaneous confidence band (SCB) for the contrast of f() between the two groups. We evaluate f(x) for -3<x<4 since few Zit values fall outside this interval. The construction of the PCB uses bootstrap and normal approximation, and the construction of the SCB uses a bootstrap method that mimics Hall and Horowitz (2013). For further details regarding the SCB and its coverage rates in simulation studies, see Supplementary Materials Section S.1.4. Because the 95% SCB covers zero in the entire range of x, we lack statistical evidence to conclude a significant difference in the reward sensitivity function between the Control and MDD groups. However, the 95% PCB suggests that the Control group may exhibit a higher reward sensitivity compared to the MDD group, particularly when the expected difference in reward between two stimuli is large (e.g., x>2). As an exploratory analysis, we constructed a 95% SCB for the contrast of f() between the two groups specifically at x>2. We observe that the entire band falls below zero for x>2. The result is shown in Figure S3 in the Supplementary Materials. This finding provides statistical evidence that the Control group exhibits a greater reward sensitivity compared to the MDD group, given that the expected difference in reward (rich reward minus lean reward) between two stimuli exceeds 2.

Fig. 3.

Fig. 3

(a) The estimated reward sensitivity function estimates for MDD and Control group; (b) The estimation (red curve), 95% pointwise confidence band (shaded area in gray) and 95% simultaneous confidence band (two dash-dotted lines in blue) for the contrast of f() between the MDD and Control groups.

Moreover, we applied the proposed model with a linear reward sensitivity function represented by f(x)=cx to both the MDD and Control groups. Figure S4 in the Supplementary Materials displays the estimated linear reward sensitivity functions. Compared to Figure 3 obtained from the semiparametric model, the linear model yields less precise reward sensitivity patterns and a higher AIC. Furthermore, the linear model fails to capture the substantial difference in the reward sensitivity function between the two groups beyond x>2.

The comparison of MDD versus Control examines whether reward sensitivity is a characteristic of MDD that differs between patients and controls. Our results suggest that reward sensitivity obtained from the probabilistic reward task may be considered a behavioral marker or a phenotype of depression. This analysis does not establish a causal relationship between depression and reduced reward sensitivity due to confounding.

Next, we compared reward learning abilities for the MDD patients between the sertraline (an antidepressant) (SERT) and placebo group (PBO). We fitted our proposed semiparametric reward learning models for subjects in PBO and SERT at week 0 and week 1, respectively. We used the same knots and order for the I-Splines as in the above analysis. Table 4 shows the estimated group learning rate and variance components in SERT and placebo group pre- and post-treatment. To investigate whether there is a significant difference in group learning rate μv for MDD patients who received SERT versus PBO, we constructed 95% bootstrap confidence interval of the difference between the change of μv from week 0 to week 1 in the two treatment arms. The result shows that the 95% confidence interval is (−1.01, 3.10), indicating that the one-week changes in learning rate between PBO and SERT groups are not significantly different. A recent meta-analysis of the PRT test also reached the same conclusion of a non-significant group learning rate (Pike and Robinson, 2022).

Table 4.

Estimation of the parameters in EMBARC study for treatment (SERT) and placebo (PBO) group at pre- and post-treatment.

SERT (pre-treatment) SERT (post-treatment)
Parameters Est. Err. 95% CI Est. Err. 95% CI
μv −1.84 0.77 (−3.35, −0.33) −3.17 0.41 (−3.96, −2.37)
σv,v2 6.45 4.03 (1.60, 25.96) 8.90 4.11 (3.15, 25.12)
σγ,γ2 2.92 1.40 (1.12, 7.65) 4.44 1.20 (1.99, 9.89)
σv,γ2 −2.14 1.31 (−4.70, 0.43) −3.62 1.61 (−6.77, −0.46)
PBO (pre-treatment) PBO (post-treatment)
Parameters Est. Err. 95% CI Est. Err. 95% CI
μv −3.03 0.42 (−3.85, −2.22) −3.32 0.53 (−4.35, −2.28)
σv,v2 13.30 3.75 (6.32, 27.97) 5.24 2.22 (2.11, 13.04)
σγ,γ2 4.16 1.50 (1.99, 8.70) 3.09 1.42 (1.42, 6.75)
σv,γ2 −2.46 0.99 (−4.41, −0.52) −1.19 0.63 (−2.43, 0.06)

Figure 4(a) and 4(b) show the estimated reward sensitivity function f() for SERT and PBO groups at pre- and post-treatment. We find that the placebo group pre-(week 0) and post-treatment (week 1) reward sensitivity function is similar across the entire support, while for the SERT group, f(x) increases to a higher level after treatment when x>1.5. It suggests that the SERT treatment increases MDD patients’ probability of taking correct actions at rich reward states when adequate rewards are received. We also constructed the 95% PCB and SCB of the difference between the change of f() pre- and post-treatment in the two groups. Figure 4(c) does not show sufficient statistical evidence that the pre-post change in reward sensitivity function differs between the two treatment groups. However, the presence of a similar pattern in the comparisons of reward sensitivity between Figure 3(b) and Figure 4(c) suggests that there might be a positive impact of sertraline on MDD patients, potentially bringing their reward learning sensitivity closer to that of healthy individuals at the rich state, which is worth future investigation. Regarding the timing of post-treatment measurement, the antidepressant is expected to take full effect in about four weeks. The rationale for measuring PRT reward sensitivity one-week post-treatment is to detect early responses. With a longer period of treatment, the difference may be greater.

Fig. 4.

Fig. 4

(a) The estimated reward sensitivity function for SERT group pre- and post-treatment; (b) The estimated reward sensitivity function for PBO group; (c) The estimation (red curve), 95% PCB (shaded area in gray) and 95% SCB (two dash-dotted lines in blue) of the difference between PBO and SERT groups in the pre-post change of reward sensitivity function f() (difference in difference).

To investigate whether dysfunctions in the human brain circuitry are associated with decision-making and learning ability, we examined correlations between subjects’ learning rate, relative sensitivity, and task functional magnetic resonance imaging (fMRI) measures of brain activation in an emotional conflict task assessing amygdala-anterior cingulate (ACC) circuitry (Etkin et al., 2006). We find that both reward sensitivity and learning rate are associated with brain activities in negative affect circuitry evoked by sad stimuli. Detailed analysis can be found in Supplementary Materials Section S.2.1.

5. Discussion

In this paper, we propose a semiparametric inverse reinforcement learning framework to characterize reward-based decision-making with an application of probabilistic reward tasks in the EMBARC study. We assume that a patient’s decision-making process relies on two subject-specific factors, learning rate and reward sensitivity, and a shared non-decreasing function with a nonparametric form. Extensive simulation studies showed that our proposed method is satisfactory for large sample sizes under different reward-generating distributions. It also shows the advantage of the semiparametric structure when the true underlying function is beyond the restrictive parametric form using a sigmoid function. In the application, we find that MDD patients and control and patients between SERT and PBO groups have similar learning rates. However, the different groups have different reward sensitivity functions. The results in the paper are consistent with the findings in Huys et al. (2013), but by contrast, the proposed model shows a more detailed description of how the reward sensitivity differs. We also find that behavioral phenotypes, including learning rate and reward sensitivity, are associated with human brain activities at the negative affect circuits.

Estimation is carried out by maximizing the joint conditional log-likelihood for all patients at all trials, where the non-decreasing function and its first-order derivative are characterized by I-splines and M-splines. Because of the close relationship between M-splines and B-splines, M-splines and I-splines share good approximation power as B-splines. We studied asymptotic consistency and asymptotic distributions for the parameters in the proposed model. Note that we assume that n (the number of patients) goes to infinity and T (the number of trials) is fixed in the theoretical setting. It is much more challenging to assume both n and T go to infinity, and we will study this setting in future work. The proposed model can also be regarded as a time-varying model where the reward sensitivity is allowed to change through time. One future direction is allowing the learning rate to evolve gradually.

Our proposed work aligns with imitation learning or behavioral cloning in the RL literature. A related line of research is inverse RL (Ng et al., 2000; Abbeel and Ng, 2004; Arora and Doshi, 2021; Luckett et al., 2021), which assumes an agent’s behavior follows an optimal policy and seeks to learn the corresponding reward function from the agent’s observed behavior and decision-making process. In contrast, the problem we study here aims to recover the agent’s reward sensitivity function and learning rate without assuming the agent’s behavior follows an optimal policy.

Our current work does not accommodate heterogeneity in Qit beyond the aspects of learning rate and reward sensitivity. An extension is to model them in a mixed effects model framework but at the cost of more complex modeling and additional assumptions. Similarly, covariates can be introduced to model systematic heterogeneity in learning rates, reward sensitivity, and other parameters. Some recent work shows that the perceptual decision-making process would alternate between multiple interleaved strategies. Ashwood et al. (2022) provides an example in which the decision process is a mixture model of “engaged” and “lapse” trials. When the decision-making strategy is “engaged”, a subject chooses according to the RL model; When the decision-making strategy is “lapse”, the subject ignores the stimulus and chooses based on a fixed probability. The strategy-switching is characterized by a hidden Markov model. Extension to a semiparametric inverse reinforcement learning framework that allows strategies-switching is of interest. Another direction worth pursuing is a broader reinforcement learning context where states are not given randomly so that state distributions depend on past actions. Lastly, the current model is not restricted to characterizing decision-making in MDD patients. Our flexible framework can be easily applied to model other mental disorders such as schizophrenia and translates neuroscience and behavioral science constructs to clinical applications (Huys et al., 2016).

Supplementary Material

Supp 1
Supp 2

Acknowledgements

This research was supported by the National Institutes of Health funding MH123487, NS073671, and GM124104. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors report there are no competing interests to declare. Data and/or research tools used in the preparation of this manuscript were obtained from the National Institute of Mental Health (NIMH) Data Archive (NDA). NDA is a collaborative informatics system created by the National Institutes of Health to provide a national resource to support and accelerate research in mental health. Dataset identifier(s): 2199. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or of the Submitters submitting original data to NDA.

Appendix

A.1. Representation of M-splines and I-splines

According to Ramsay (1988), for a set of M-splines in [L,U] with K basis functions and order r, denote the K basis as Mk():=Mk(r),k=1,,K. Denote the knot sequence of the M-splines as τ1,τ2,,τK+r, where the boundary knots should satisfy τ1==τr=L and τK+1==τK+r=U; the interior knots should satisfy L<τr+1<<τK<U. (Please refer to Ramsay (1988) for a more general case that allows tie for the interior knots.) The M-splines can be constructed by recursion

Mk(x1)=1τk+1-τkIτkx<τk+1,Mk(xr)=rx-τkMk(xr-1)+τk+r-xMk+1(xr-1)(r-1)τk+r-τk.

It can be shown that Mk(xr)>0 only when τkx<τk+r. Also Mk(xr)dx=1 for all k. Define a set of I-splines as the integral of the corresponding set of M-splines, i.e.

Ik(xr)=LxMk(ur)du,   for   k=1,,K,

Ik(xr) is guaranteed to be non-decreasing. A function represented by I-splines of order r is r-1 times continuously differentiable, and the r-th derivatives are bounded in [L,U]; A function represented by M-splines of order r is r-2 times continuously differentiable, and the (r-1)-th derivatives are bounded in [L,U]. Finally, a function represented by M-splines can also be represented by B-splines; to be more specific, a set of B-splines can be expressed by a set of M-splines with the same knots and order, i.e., Bk(xr)=τk+r-τkMk(xr)/r, hence M-splines and I-splines maintain good approximation power as B-splines (Schumaker, 2007). In practice, AIC can be used as the model selection criterion. A grid search on spline orders and interior knots can be conducted to select the I-splines with the smallest AIC value.

A.2. Asymptotic Theory

To simplify the notation, denote θ=α,Ω,μv,ΣRd. Let θ0 and f0() denote the true parameters and nonparametric function. The log-likelihood function from a single subject is

(θ,f)=logq(a,θ,f)da, (10)

where a=(v,γ)

q(a,θ,f)=t=1TexpAtfZt1+expfZtϕ(a;θ). (11)

For any vector bRd and function hL2[L,U] that has bounded r-th derivatives, we define the score operators of (θ,f) as lθ(θ,f)b and lf(θ,f)[h] by differentiating the log-likelihood function with respect to θ and f along the submodel θ+ϵb and f+ϵh. Define the information operator of (θ,f) as (θ,f)=Elθ,lf*lθ,lf, where lθ,lf* is the dual operator of lθ,lf*(θ,f) is a self-adjoint operator from

=b,h:bRd,hL2L,U with bounded r-th derivatives}

to itself. Define 1/2(θ,f) as the square root operator of (θ,f) such that =1/21/2, where   is the operator composition. To maximize (7), we only consider nonparametric function in

𝒮n=f˜:f˜(x)=k=1Knη˜kIk(x)-Ik(0),η˜kR.

The following conditions are needed for the theorems in this paper.

Condition 1. The true values θ0 lie in the interior of a known compact set in Rd. The true parameter f0() is increasing function with f0(0)=0 in a known bounded interval [L,U], where L<0,U>0.f0(x)f0(U) for x>U,f0(x)f0(L) for x<L.f0() is r-1 times continuously differentiable and the r-th derivatives is bounded in [L,U] where r2.

Condition 2. Let f˜0 be in the projection of f0 on 𝒮n under 2 norm. We assume that there exists a constant C>0 such that f˜0 belongs to a subspace of 𝒮n:

𝒮n*(C)=f˜:f˜(x)=k=1Knη˜kIk(x)-Ik(0),0η˜k<CKn-1.

Condition 3. For any two different set of parameters θ1,f1 and θ2,f2, the log-likelihood function θ1,f1θ2,f2 with positive probability. The information operator (θ,f) of the log-likelihood function (θ,f) at true parameter values θ0,f0 is invertible and satisfies

Cminb22+h2212θ0,f0b,h2P2Cmaxb22+h22,(b,h),0<Cmin<Cmax<.

Condition 4. The number of I-spline basis Kn satisfies Kn2r-2n-1/2 and Kn1/2n-1/20. The adjacent distance of the interior knots is between  c-1Kn-1 and  cKn-1 for some constant c>1.

Condition 1 ensures function f0 is bounded and smooth. Condition 2 and 4 guarantees f˜ and f˜ to be uniformly bounded for all f˜𝒮n*(C), where f˜ is the first order derivatives of f˜. The first part of Condition 3 is the identifiability, and the second part implies that the information operator is invertible in 2 space. In particular, this identifiability condition implies that the latent variable, Zit defined in equation (4), does not degenerate and should have a continuous density with support containing [L,U], the domain of f. In Condition 4, we may set Kn=nδ, where 14(r-1)<δ<1.

We state the consistency and asymptotic distribution of the estimators for the model parameters in the following two theorems, whose proofs are given in the supplementary materials Section S.3.

Theorem 1. Under Conditions 1–4, θ-θ020,fˆ-f020 in probability. Furthermore, θ-θ022+fˆ-f022=opn-1/2.

To describe the asymptotic distributions of fˆ and θ, we first introduce the sets f={h(x):h(x)has the r-th derivatives bounded by1in[L,U]}. Define 𝒪θ be the unit ball in Rd. We then treat θ-θ0,fˆ-f0 as a stochastic class in l(𝒪θ×f) whose value for b*,h* is defined as

b*θ-θ0+h*(u)fˆ(u)-f0(u)du,

where b𝒪θ,hf,b*,h*=(θ0,f0)[b,h]. The following theorem shows weak convergence of this stochastic process.

Theorem 2. Under Conditions 1–4, n1/2θ-θ0,fˆ-f0 converges in distribution to a zero-mean and tight Gaussian process in the metric space l(𝒪θ×f) as n.

The proof to Theorem 2 also implies the asymptotic normality of θˆ and some smooth functional of f. However, it does not support pointwise normality, although the simulation study shows normality is still plausible. Also note that the convergence rate given in Theorem 1 is an intermediate result needed for proving the faster rate and asymptotic normality for θ in Theorem 2. On the other hand, the convergence rates for fˆ in these two theorems are not exactly comparable. Theorem 1 gives the convergence rate for 2- metric for fˆ; while the convergence rate and normality for fˆ in Theorem 2 refer to the functional of fˆ,h*(u)fˆ(u)du.

References

  1. Abbeel P and Ng AY (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, page 1. [Google Scholar]
  2. APA (2013). Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition. American Psychiatric Association. [Google Scholar]
  3. Arora S and Doshi P (2021). A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500. [Google Scholar]
  4. Ashwood ZC, Roy NA, Stone IR, Urai AE, Churchland AK, Pouget A, and Pillow JW (2022). Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25(2):201–212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Etkin A, Egner T, Peraza DM, Kandel ER, and Hirsch J (2006). Resolving emotional conflict: A role for the rostral anterior cingulate cortex in modulating activity in the amygdala. Neuron, 51(6):871–882. [DOI] [PubMed] [Google Scholar]
  6. Foerde K, Steinglass JE, Shohamy D, and Walsh BT (2015). Neural mechanisms supporting maladaptive food choices in anorexia nervosa. Nature Neuroscience, 18(11):1571–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Gravetter FJ and Forzano L-AB (2018). Research Methods for the Behavioral Sciences. Cengage Learning. [Google Scholar]
  8. Hall P and Horowitz J (2013). A simple bootstrap method for constructing nonparametric confidence bands for functions. The Annals of Statistics, pages 1892–1921. [Google Scholar]
  9. Huys QJ, Maia TV, and Frank MJ (2016). Computational psychiatry as a bridge from neuroscience to clinical applications. Nature Neuroscience, 19(3):404–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Huys QJ, Pizzagalli DA, Bogdan R, and Dayan P (2013). Mapping anhedonia onto reinforcement learning: A behavioural meta-analysis. Biology of Mood & Anxiety Disorders, 3(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Insel T, Cuthbert B, Garvey M, Heinssen R, Pine DS, Quinn K, Sanislow C, and Wang P (2010). Research domain criteria (RDoC): Toward a new classification framework for research on mental disorders. American Journal of Psychiatry, 167(7):748–751. [DOI] [PubMed] [Google Scholar]
  12. Jäckel P (2005). A note on multivariate gauss-hermite quadrature. London: ABN-Amro. Re. [Google Scholar]
  13. Kendler KS, Karkowski LM, and Prescott CA (1999). Causal relationship between stressful life events and the onset of major depression. American Journal of Psychiatry, 156(6):837–841. [DOI] [PubMed] [Google Scholar]
  14. Lange K, Chambers J, and Eddy W (1999). Numerical Analysis for Statisticians, volume 2. Springer. [Google Scholar]
  15. Luckett DJ, Laber EB, Kim S, and Kosorok MR (2021). Estimation and optimization of composite outcomes. The Journal of Machine Learning Research, 22(1):7558–7597. [PMC free article] [PubMed] [Google Scholar]
  16. Ng AY, Russell S, et al. (2000). Algorithms for inverse reinforcement learning. In ICML, volume 1, page 2. [Google Scholar]
  17. Passos IC, Mwangi B, and Kapczinski F (2019). Personalized Psychiatry: Big Data Analytics in Mental Health. Springer. [Google Scholar]
  18. Pike AC and Robinson OJ (2022). Reinforcement learning in patients with mood and anxiety disorders vs control individuals: A systematic review and meta-analysis. JAMA Psychiatry. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Pizzagalli DA, Jahn AL, and O’Shea JP (2005). Toward an objective characterization of an anhedonic phenotype: A signal-detection approach. Biological Psychiatry, 57(4):319–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Ramsay JO (1988). Monotone regression splines in action. Statistical Science, pages 425–441. [Google Scholar]
  21. Rescorla RA (1972). A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Current Research and Theory, pages 64–99. [Google Scholar]
  22. Ross S and Bagnell D (2010). Efficient reductions for imitation learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 661–668. JMLR Workshop and Conference Proceedings. [Google Scholar]
  23. Schumaker L (2007). Spline Functions: Basic Theory. Cambridge University Press. [Google Scholar]
  24. Sutton RS and Barto AG (2018). Reinforcement Learning: An Introduction. MIT Press. [Google Scholar]
  25. Trivedi MH, McGrath PJ, Fava M, Parsey RV, Kurian BT, Phillips ML, Oquendo MA, Bruder G, Pizzagalli D, Toups M, et al. (2016). Establishing moderators and biosignatures of antidepressant response in clinical care (embarc): Rationale and design. Journal of Psychiatric Research, 78:11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Vrieze E, Pizzagalli DA, Demyttenaere K, Hompes T, Sienaert P, de Boer P, Schmidt M, and Claes S (2013). Reduced reward learning predicts outcome in major depressive disorder. Biological Psychiatry, 73(7):639–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Whiteford HA, Degenhardt L, Rehm J, Baxter AJ, Ferrari AJ, Erskine HE, Charlson FJ, Norman RE, Flaxman AD, Johns N, et al. (2013). Global burden of disease attributable to mental and substance use disorders: Findings from the global burden of disease study 2010. The Lancet, 382(9904):1575–1586. [DOI] [PubMed] [Google Scholar]
  28. Whitton AE, Treadway MT, and Pizzagalli DA (2015). Reward processing dysfunction in major depression, bipolar disorder and schizophrenia. Current Opinion in Psychiatry, 28(1):7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Williams LM (2016). Precision psychiatry: A neural circuit taxonomy for depression and anxiety. The Lancet Psychiatry, 3(5):472–480. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1
Supp 2

RESOURCES