A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders

Xingche Guo; Donglin Zeng; Yuanjia Wang

doi:10.1080/01621459.2023.2261184

. Author manuscript; available in PMC: 2025 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2023 Nov 22;119(545):27–38. doi: 10.1080/01621459.2023.2261184

A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders

Xingche Guo ^a, Donglin Zeng ^b, Yuanjia Wang ^c,^*

PMCID: PMC11068237 NIHMSID: NIHMS1940486 PMID: 38706706

Abstract

Major depressive disorder (MDD) is one of the leading causes of disability-adjusted life years. Emerging evidence indicates the presence of reward processing abnormalities in MDD. An important scientific question is whether the abnormalities are due to reduced sensitivity to received rewards or reduced learning ability. Motivated by the probabilistic reward task (PRT) experiment in the EMBARC study, we propose a semiparametric inverse reinforcement learning (RL) approach to characterize the reward-based decision-making of MDD patients. The model assumes that a subject’s decision-making process is updated based on a reward prediction error weighted by the subject-specific learning rate. To account for the fact that one favors a decision leading to a potentially high reward, but this decision process is not necessarily linear, we model reward sensitivity with a non-decreasing and nonlinear function. For inference, we estimate the latter via approximation by I-splines and then maximize the joint conditional log-likelihood. We show that the resulting estimators are consistent and asymptotically normal. Through extensive simulation studies, we demonstrate that under different reward-generating distributions, the semiparametric inverse RL outperforms the parametric inverse RL. We apply the proposed method to EMBARC and find that MDD and control groups have similar learning rates but different reward sensitivity functions. There is strong statistical evidence that reward sensitivity functions have nonlinear forms. Using additional brain imaging data in the same study, we find that both reward sensitivity and learning rate are associated with brain activities in the negative affect circuitry under an emotional conflict task.

Keywords: semiparametric model, monotone splines, reward learning, learning ability, behavioral phenotyping, major depression

1. Introduction

Mental disorders cause immense disability, accounting for 183.9 million disability-adjusted life-years worldwide (Whiteford et al., 2013). However, despite intense clinical and research efforts devoted to developing pharmacological and behavioral treatments for mental disorders over the years, effective treatment strategies remain elusive, and treatment responses are far from adequate across mental disorders (e.g., 30%−50%). Recently, there has been a fast-growing trend to utilize discovered novel biomarkers and behavioral markers, accompanied by sophisticated computational methods (e.g., machine learning or reinforcement learning), to provide a more accurate characterization of mental disorders and more precise prediction of treatment responses than traditional approaches (Passos et al., 2019).

To address substantial between-patient heterogeneity and limitations in the traditional symptom-based diagnosis, the paradigm shift put forward by the NIMH strategic plan on research domain criteria (RDoC; Insel et al., 2010) initiative calls for re-defining mental disorders by latent constructs identified from biological and behavioral measures across different domains of functioning (e.g., positive/negative affect, social processing, decision making) at different levels (e.g., genes, cells, and circuits), instead of solely relying on clinical symptoms or the diagnostic and statistical manual of mental disorders (APA, 2013). The vision on precision psychiatry (Insel et al., 2010; Williams, 2016) is that research needs to account for extensive diagnostic heterogeneity and substantial between-patient variation in biological, behavioral, and clinical manifestations of disease, examine shared latent constructs across disorders, and match treatments with the underlying pathophysiology.

Learning and decision-making are considered important in the etiology of the mental function (Kendler et al., 1999). An individual’s learning ability and decision-making process may be altered by mental disorders (Pizzagalli et al., 2005). In turn, poor decision-making may lead individuals into high-risk scenarios and experience more mental illness (Kendler et al., 1999). For example, impairment in processing rewards is implicated in various mental disorders. Anhedonia (i.e., inability to experience pleasure) is one of the core constructs of major depression. Pizzagalli et al. (2005) showed that anhedonia may alter how MDD patients process reward in a PRT experiment. Briefly, subjects were presented with two types of stimuli in a computer-based experiment (probabilistic reward task), and one stimulant was rewarded more frequently than the other (details in Section 1.1). Over time, healthy subjects will learn which reward has a higher value and choose the correct stimuli with a higher probability. It is hypothesized that anhedonia may reduce a patient’s ability to learn which reward is more valuable or reduce a subject’s tendency to choose the richer reward. Decision-making models are proposed to evaluate whether these hypotheses are supported (Huys et al., 2013). As another example, restrictive eating behavior in anorexia nervosa can be measured via the Food Choice Task, a computer-based assessment of ratings and choices of food items, and patients’ choices in the task predicts future caloric intake (Foerde et al., 2015).

A useful computational model for describing a patient’s learning behavior and decision-making ability is Reinforcement Learning (RL; Sutton and Barto, 2018). To use RL to describe behaviors or actions generated from a probabilistic reward task, a simple prediction error learning is invoked to characterize the value of different choices depending on key behavioral phenotype parameters, including reward sensitivity and learning rate. Various empirical studies and a meta-analysis have shown that objective, quantitative measures derived from behavioral data can be associated across multiple mental disorders (Pike and Robinson, 2022), making these measures ideal candidate behavioral phenotypes under the RDoC framework. In contrast to regular RL, where one aims to estimate an optimal policy that maximizes the long-term reward, the goal of analysis in behavioral phenotyping in mental health is behavioral cloning or imitation learning (Ross and Bagnell, 2010). That is, given observed patients’ choices on probabilistic reward tasks, how to infer their reward learning ability and value computation process, and how these abilities and processes relate to MDD and, ultimately, are affected by MDD treatments.

There is a gap in using existing imitation learning or inverse RL methods to address the unique challenges presented by mental disorders. For example, the models proposed in Huys et al. (2013) assume that a patient’s decision-making process relies on a function with restrictive parametric linear form, which may oversimplify the complex decision process and interactions between core mental functions. For example, similar to other psychometric models (Gravetter and Forzano, 2018) and as we demonstrate in Section 4, there is often a floor and ceiling effect on the relationship between action and expected reward, which is not captured by a linear model. Some existing imitation learning and inverse RL methods are more flexible to allow nonlinear models (e.g., Arora and Doshi, 2021), but they do not provide important behavioral phenotypes such as learning rate and reward sensitivity and do not borrow information from multiple subjects under a random effects model.

In this work, motivated by the experiments in Pizzagalli et al. (2005) and Trivedi et al. (2016), we propose a semiparametric inverse reinforcement learning approach to characterize reward-based decision-making for patients with mental disorders. Our method incorporates subject-specific learning rates and reward sensitivity as random effects to address between-patient heterogeneity. To borrow information while remaining flexible, we assume that the function that links action to the contrast in expected reward is nonparametric but shared across the patients. We describe the behavioral phenotyping experiment and motivating study in the next section.

1.1. Probabilistic Reward Task and the Motivating Study

The probabilistic reward task (PRT; Pizzagalli et al., 2005) is a computer-based experiment that measures the subject’s ability to modify behavior in response to rewards. On each trial of the task, the subject sees a cartoon face with a short or long mouth (two stimuli). The task is to indicate whether a short or long mouth was presented by pressing buttons ‘C’ or ‘M’ on keyboards. Critically, the correct response to the short mouth was rewarded more frequently than a correct response to the other (i.e., the short mouth is the rich reward), and the difference in length between the short and long mouths is minimal. Participants were given verbal instructions about the task and told that the goal of the task was to win as much money as possible. It is critical that the subject understands that not all correct responses will be rewarded. To maximize the reward, participants should press the correct button, regardless of which face is associated with the higher reward. However, the difference in size between the short and long mouths is designed to be small. Consequently, participants frequently encounter difficulties in accurately perceiving the presented state, which results in a tendency to prioritize states with higher rewards rather than those that are more accurate. Thus, the PRT experiment probes a subject’s ability to learn rich versus lean reward.

Emerging evidence indicates the presence of reward processing abnormalities in MDD (Whitton et al., 2015). MDD patients are more likely to have lower reward learning ability than patients not diagnosed with MDD (Vrieze et al., 2013). There has been significant interest in learning how MDD affects patients’ decision-making in reward learning tasks. In our motivating study, the Establishing Moderators and Biosignatures of Antidepressant Response for Clinical Care (EMBARC) trial, each PRT session consisted of 200 trials, divided into two blocks of 100 trials, with blocks separated by a 30-second break. For each block, 40 correct trials were followed by reward feedback. Correct identification of the short mouth was associated with three times more positive feedback (30 out of 40) than correct identification of the long mouth (10 out of 40). However, subjects were not told about the different frequencies of rewards across the two response types. Figure 1(a) shows the PRT schematic diagram.

Fig. 1 — Schematic of a PRT experiment and EMBARC study design

EMBARC study is designed to identify the difference in reward learning abilities between the patients with MDD and the control group (not diagnosed with MDD) and between SSRI antidepressant sertraline (SERT) and placebo (PBO) in a randomized trial for patients with MDD. Figure 1(b) shows the experimental design diagram. In the PRT experiments, there were 40 subjects from the control group and 168 subjects from the MDD group; within the MDD group, 82 participants were randomly assigned to the SERT group, and the other 86 participants were assigned to the PBO group. Each subject had two PRT sessions, one session at the baseline before treatment (week 0) and one session after one week of treatment (week 1). Note that subjects in the control group also took two sessions in week 0 and week 1. In week 1, five sessions were missing for the MDD group and one for the control group.

2. Methods

2.1. Data from a behavioral task

Consider $n$ subjects from a population. Data consists of a time-series of subjects’ decision-making in a behavioral task (e.g., probabilistic reward task), i.e., ${\{S_{i t}, A_{i t}, R_{i t}\}}_{t = 1, \dots, T}^{i = 1, \dots, n}$ , where $S_{i t}, A_{i t}$ , and $R_{i t}$ are the state, action, and reward at time $t$ for the $i$ -th subject. In this paper, we assume that there are $m$ states in the state space (i.e. $S_{i t} \in {0, \dots, m - 1}$ ) and two actions in the action space (i.e. $A_{i t} \in {0,1}$ ). Denote $ℋ_{i t} = {\{S_{i j}, A_{i j}, R_{i j}\}}_{j = 1, \dots, t - 1}$ as the observed history for the $i$ -th subject by time $t$ . The reward $R_{i t}$ is generated based on $S_{i t}, A_{i t}$ , and $ℋ_{i t}$ (for some cases, only based on $S_{i t}$ and $A_{i t}) . R_{i t}$ is assumed to be bounded (either discrete or continuous), and without loss of generality, we assume $R_{i t}$ is between 0 and 1. The probabilistic reward task introduced in Section 1.1 is a special case with two possible states. We let $S_{i t} = 0$ correspond to the state with lean rewards and $S_{i t} = 1$ correspond to the state with rich rewards. The reward generating distributions conditional on $S_{i t} = A_{i t} = 0$ and $S_{i t} = A_{i t} = 1$ (and history $ℋ_{i t}$ ) are Bernoulli distributions with probability $0 < P_{00} < P_{11} < 1$ . No reward is provided if the $i$ th patient does not correctly respond to the state at time $t$ .

2.2. Semiparametric inverse RL model

RL is a computational tool used to model patients’ responses to stimuli in behavioral tasks such as PRT. A core concept in RL modeling of PRT is the reward prediction error, the difference between the expected reward and the reward that was received. Mathematically, the reward prediction error is $R_{i t} - Q_{i t} (a, s)$ , where $Q_{i t} (a, s)$ is the expected reward or value of taking action $a$ at state $s$ for the $i$ th subject at the $t$ th trial. This is the foundation of building more complex models to explain patients’ behavioral responses. Specifically, for MDD patients, a failure to adequately process reward is hypothesized to occur via two distinct mechanisms: a reduced sensitivity to received reward or the reduced ability to learn (Huys et al., 2013). The former mechanism is assessed by a parameter referred to as reward sensitivity (denoted by $ρ_{i}$ ), and the latter is assessed by learning rate (denoted by $β_{i}$ ). When accounting for reward sensitivity, a subject’s reward prediction error is

δ_{i t} = ρ_{i} R_{i t} - Q_{i t} (a, s),

where the obtained reward $R_{i t}$ is discounted by a factor of $ρ_{i}$ . Each patient’s value (denoted as $Q_{i, t + 1} (a, s)$ ) naturally evolves over time depending on the past history of value and is updated based on a weighted reward prediction error as

Q_{i, t + 1} (a, s) = Q_{i t} (a, s) + β_{i} δ_{i t} I_{i t} (a, s)

(1)

where $β_{i} \in (0,1)$ is a subject-specific learning rate that describes how fast the value of a state-action pair is updated, $I_{i t} (a, s) = I (A_{i t} = a, S_{i t} = s)$ is the indicator function of the event $\{A_{i t} = a$ and $S_{i t} = s\}$ . The larger $ρ_{i}$ is, the more a subject’s action will depend on rewards. Equivalently, the expected reward at $t + 1$ is the weighted sum of the obtained reward and the expected reward at $t$ , i.e., $Q_{i, t + 1} (a, s) = (1 - β_{i} I_{i t} (a, s)) Q_{i t} (a, s) + β_{i} ρ_{i} R_{i t} I_{i t} (a, s)$ . As the learning rate $β_{i}$ approaches one, learning is so fast that the $Q$ values are simply the last experienced outcome. As a note, (1) is also known as the Rescorla-Wagner equation (Rescorla, 1972) in classical conditioning. Figure 2 shows a graphical representation of the model where the green paths show the RL component of updating $Q_{i t}$ over trials.

Fig. 2 — Diagram of the semiparametric inverse RL model over trials $t$ , where $f$ is the reward sensitivity function, $β_{i}$ is the learning rate, and $ρ_{i}$ is related to the relative reward sensitivity.

At time $t = 1$ , we assume $Q_{i 1} (a, s) = α_{a s}$ . Define $Δ_{i t} = Δ_{i t} (a, s)$ as the total counts that subject $i$ takes action $a$ at state $s$ before time $t$ . Define time $τ_{i k} = τ_{i k} (a, s)$ as the $k$ -th time such that subject $i$ takes action $a$ at state $s$ . According to (1), we can express $Q_{i t} (a, s)$ by

Q_{i t} (a, s) = {(1 - β_{i})}^{Δ_{i t}} α_{a s} + β_{i} ρ_{i} \sum_{k = 1}^{Δ_{i t}} {(1 - β_{i})}^{Δ_{i t} - k} R_{i, τ_{i k}} .

(2)

Clearly, $0 \leq Q_{i t} (a, s) \leq max (α_{a s}, ρ_{i})$ and $Q_{i t} (a, s)$ is also a function of $β_{i,} ρ_{i}, α_{a s}$ , and history $ℋ_{i t}$ . Furthermore, $Q_{i t} (a, s, β_{i}, c ρ_{i}, c α_{a s}; ℋ_{i t}) = c Q_{i t} (a, s, β_{i}, ρ_{i}, α_{a s}; ℋ_{i t})$ . By trial $t$ , participants have the updated reward value $Q_{i t} (a, s)$ based on $(t - 1)$ trials. However, because participants cannot be sure which stimulus would be presented at trial $t$ , their action at trial $t$ is based on a “belief” of the expected reward, $W_{i t} (a, s)$ , defined as a mixture of ${\{Q_{i t} (a, s)\}}_{s = 0, \dots, m - 1}$

W_{i t} (a, s) = ω_{s s} Q_{i t} (a, s) + \sum_{r \neq s} ω_{s r} Q_{i t} (a, r),

(3)

where the parameters $ω_{s r} \in [0,1]$ satisfy $\sum_{r = 0}^{m - 1} ω_{s r} = 1$ , and they reflect the weights from $Q$ -value from different states than $s$ .

It is worth noting that the formulation of $Q_{i t} (a, s)$ and $^{W_{i t}} (a, s)$ are motivated by Huys et al. (2013). We generalize Huys et al. (2013) to allow a nonlinear reward sensitivity function, more than two states (i.e., $S_{i t} = 1, \dots, m$ ), and use a random effects model to borrow information across subjects to estimate reward sensitivities and learning rates. Specifically, given the believed reward $W_{i t} (a, s)$ and the actual stimulus $S_{i t}$ at trial $t$ , we define a contrast between two actions as

Z_{i t} = W_{i t} (1, S_{i t}) - W_{i t} (0, S_{i t}),

(4)

and we assume that the action $A_{i t}$ , conditional on history $ℋ_{i t}$ and $S_{i t}$ , follows the model

logit P (A_{i t} = 1 ∣ S_{i t}, ℋ_{i t}) = f (Z_{i t}),

(5)

where $f (\cdot)$ is an unknown non-decreasing function satisfying $f (0) = 0$ . In other words, the participant is more likely to choose the action whose expected reward is higher, but this probability can be a nonlinear function of the contrast. As a result, we refer to $f (\cdot)$ as a reward sensitivity function. The nonlinear modeling of the decision process may more accurately reveal when subjects are more sensitive to reward under a probabilistic reward task. Finally, to incorporate the heterogeneity among the participants, we assume $v_{i} = log (β_{i} / (1 - β_{i})), γ_{i} = log (ρ_{i})$ , where $v_{i}$ and $γ_{i}$ are random effects generated from a bivariate normal distribution, i.e., $(v_{i}, γ_{i}) \overset{i i d}{~} N (μ, Σ)$ with $μ = {(μ_{v}, μ_{γ})}^{⊤}$ . Note that since $Q_{i t}$ is invariant if its initial value and $ρ_{i}$ are scaled by a constant factor $c, f (x)$ and $f (x / c)$ yield the same action model. Thus, to ensure identifiability, we let $μ_{γ} \equiv 1$ . By this normalization, $γ_{i}$ serves as an indicator for the relative sensitivity of the $i$ -th subject $(i . e . (γ_{i} - μ_{γ}) / μ_{γ})$ . We refer to $γ_{i}$ as the relative sensitivity in the rest of the paper. The decision-making model (5) is represented by the blue paths in Figure 2. The rewards are generated based on a participant’s state, choice, and history depending on the experiment paradigm (pink paths in Figure 2).

2.3. Implementation

Under the proposed semiparametric inverse RL/imitation learning model, since $R_{i t}$ and $S_{it}$ given $ℋ_{i t}$ are independent of the random effects, the log-likelihood function for subject $i$ is, up to a constant,

ℓ_{i} (μ, Σ, α, Ω, f) = log [\iint \{\prod_{t = 1}^{T} P_{i t} (v_{i}, γ_{i}, α, Ω, f; ℋ_{i t})\} ϕ (v_{i}, γ_{i} ∣ μ, Σ) d v_{i} d γ_{i}],

(6)

where $ϕ (\cdot, \cdot ∣ μ, Σ)$ denotes the density of $N (μ, Σ)$ ,

α = {(α_{a s})}_{a = 0,1}^{s = 0, \dots, m - 1}, Ω = {(ω_{s r})}_{s, r = 0, \dots, m - 1}

P_{i t} (v_{i}, γ_{i}, α, Ω, f; ℋ_{i t}) = \frac{exp \{A_{i t} f (Z_{i t})\}}{1 + exp \{f (Z_{i t})\}} .,

We maximize

ℒ (μ, Σ, α, Ω, f) = \sum_{i = 1}^{n} ℓ_{i} (μ, Σ, α, Ω, f),

(7)

to estimate all the parameters. To fit a non-decreasing function $f (x)$ for $x \in [L, U] (L < 0, U > 0$ ) where $L$ and $U$ are two pre-given lower and upper bounds for the contrast, we approximate $f$ using a family of monotone splines called I-splines (Ramsay, 1988). Denote $M = (M_{1}, \dots, M_{K})$ as a set of M-splines, $I = (I_{1}, \dots, I_{K})$ as the corresponding I-splines, where each $I_{k} (x)$ is a integral of $M_{k} (\cdot)$ from $L$ to $x$ . Because $M_{k} (\cdot)$ is non-negative for all $k = 1, \dots, K, I_{k} (\cdot)$ is guaranteed to be non-decreasing. More details of the representation of M-splines and I-splines can be found in Appendix A.1. We can approximate $f (\cdot)$ by $\tilde{f} (x) = a + I^{⊤} (x) b$ . Giving restrictions on the coefficients $b_{k} \geq 0$ for $k = 1, \dots, K$ enforces $\tilde{f} (\cdot)$ to be non-decreasing. We also need $a = - I^{⊤} (0) b$ to ensure $\tilde{f} (0) = 0$ . Besides, the first order derivative of $\tilde{f}$ is: ${\tilde{f}}^{'} (x) = M^{⊤} (x) b$ .

We evaluate the double integral using bivariate Gauss–Hermite quadrature (Jäckel, 2005). In the optimization step, because $α$ and $Ω$ are linearly constrained, the coefficients of the I-splines are non-negative, so the constrained optimization (Lange et al., 1999) is used to maximize (7). In our numerical studies, the L-BFGS-B algorithm is implemented by the “optim” function in R package “stats” for the case $m = 2$ (i.e., two states), when $m > 2$ , an adaptive barrier algorithm implemented by the “ constrOptim” function in R package “stats” can be used instead. Algorithm 1 presents a detailed version of the semiparametric RL algorithm for parameters and nonparametric function optimization. Note that evaluating the value of joint conditional log-likelihood function (7) is time-consuming, so to accelerate the computation, we evaluate the log-likelihood function for each subject (6) in parallel and sum up each part to obtain (7). The maximum likelihood estimation is obtained when the L-BFGS-B algorithm converges.

Furthermore, denote $Λ = (μ, Σ, α, Ω, \hat{f})$ as the maximum likelihood estimate of (7). The subject-specific learning rate and relative sensitivity can be estimated by plugging in $Λ$ into the posterior mean of ${(v_{i}, γ_{i})}^{⊤}$ .

E ({(v_{i}, γ_{i})}^{⊤} ∣ Λ; ℋ_{i, T + 1}) = [\iint {(v_{i}, γ_{i})}^{⊤} \{\prod_{t = 1}^{T} P_{i t} (v_{i}, γ_{i}, α, Ω, \hat{f}; ℋ_{i t})\} ϕ (v_{i}, γ_{i} ∣ μ, Σ) d v_{i} d γ_{i}] \times exp \{- ℓ_{i} (μ, Σ, α, Ω, \hat{f})\} .

(8)

In the Appendix, we study the theoretical properties of the MLEs of parameters and monotone function $f (\cdot)$ and show consistency and asymptotic normality. Furthermore, nonparametric bootstrap is used for inference. Because asymptotic normality holds, as we show in Appendix A.2, we construct 95% bootstrap confidence intervals under the normal distribution (i.e., $M L E \pm 1.96 \times bootstrap se$ ) in simulation studies and data application. Because the distributions of $σ_{v, v}^{2}$ and $σ_{γ, γ}^{2}$ are highly skewed to the right, we transform them by logarithm before using normal approximation to construct confidence intervals.

3. Simulation Studies

We conducted extensive simulation studies to assess the finite sample performance of the proposed method. To mimic the real study, denote the distribution for generating reward $R_{i t}$ conditional on $A_{i t} = a$ and $S_{i t} = s$ as $p_{a s} (R_{i t})$ . We considered $m = 2$ (two states), We further let $α_{00} = α_{11} : = α$ and $α_{01} = α_{10} : = 0$ , and $Ω$ satisfy $ω_{00} = ω_{11} : = ω$ . As the result, (4) became

Z_{i t} = \{\begin{array}{l} ω \cdot Q_{i t} (1,1) - (1 - ω) \cdot Q_{i t} (0,0) & if & S_{i t} = 1 \\ - ω \cdot Q_{i t} (0,0) + (1 - ω) \cdot Q_{i t} (1,1) & if & S_{i t} = 0 \end{array}

(9)

Furthermore, we considered two reward-generating distributions: (Case I—Bernoulli distribution) $p_{11} = Bern (0.75)$ (rich reward), $p_{00} = Bern (0.3)$ (lean reward); (Case II—Beta distribution) $p_{11} ~ Beta (3,1)$ (rich reward), $p_{00} ~ Beta (1,3)$ (lean reward). The rewards at different time points were independently assigned. The states were generated by flipping a fair coin. For the $i$ th subject, the transformed learning rate $v_{i}$ and reward sensitivity $γ_{i}$ were generated from a bivariate normal distribution with $μ_{v} = - 2.5, μ_{γ} = 1, Σ_{v, v} = 0.25, Σ_{γ, γ} = 0.16$ , and $Σ_{v, γ} = - 0.1$ . We set $ω = 0.8$ and $α = 2$ . The nonparametric function was $f (x) = sgn (x) log (2 | x | + 1)$ . The I-splines with order $r = 2$ , internal knots ${0, \pm 0.6, \pm 1.2, \pm 2.4}$ , and boundary knots ±6 were used in all scenarios.

We generated 200 replicates with sample sizes $n \in {100,250}$ and number of trials $T \in {100,500}$ . We maximized the proposed joint log-likelihood function (7) by applying a 3 × 3 bivariate Gauss-Hermite quadrature. We used 50 bootstrap samples for inference. In the Supplementary Materials Section S.1.2, we show that using a 3 × 3 Gauss-Hermite quadrature and 50 bootstrap samples are sufficient to provide accurate point estimates and confidence intervals. According to Figure S1 in (Supplementary materials Section S.1.3), most $Z_{i t} (i = 1, \dots, n, t = 1, \dots, T)$ lie in (−1.5, 2.5) for $T = 100$ , hence we evaluate the estimation performance of $f (\cdot)$ at {± 0.5, ± 1, 1.5, 2}.

The simulation results of the parameter estimate from 200 replicates were given in Table 1. The simulation results of the nonparametric function were given in Table 2. For both tables, the relative bias (RB), standard deviation (SD), average bootstrap standard error (SE), and coverage probability of the 95% bootstrap confidence intervals (CP) were reported. Furthermore, we compared our method (Semiparametric RL) with the linear model that $f (x) = c x$ (Linear RL) in Table 1 and Table 2. Table 1 shows that estimates of group mean $μ_{v}$ for both semiparametric RL and Linear RL models have small relative biases. It suggests we can estimate the group learning rate with high precision even when the reward sensitivity function $f (\cdot)$ is incorrectly specified. However, the estimates of covariance matrix $Σ$ have much larger relative bias when the reward sensitivity function $f (\cdot)$ is incorrectly specified. Meanwhile, for the semiparametric RL model, the relative bias of all parameters decreases when $T$ and $n$ increase. In contrast, for the linear RL model, the relative bias for all parameters except the group learning rate does not change much as $T$ and $n$ increase. Table 2 suggests that the semiparametric RL estimates of nonparametric function have much smaller relative bias and larger standard deviation than the Linear RL model estimates. Note that the coverage probabilities of semiparametric RL estimates are close to the nominal level (95%). In contrast, the coverage probabilities of Linear RL estimates should be much less than the nominal level. Hence, we conclude that the Linear RL model cannot provide reliable statistical inference on $f (\cdot)$ when the underlying true reward sensitivity function is nonlinear. In the Supplementary Materials Section S.1.1, we also investigated the case when the underlying true reward sensitivity function is linear. We find that the two methods provide estimates with similar relative bias; our Semiparametric RL has a relatively larger standard deviation compared to Linear RL.

Table 1.

Summary of the parameter estimates in 200 simulations with Bernoulli (Case I) and Beta (Case II) reward distribution.

			Case I						Case II
			Semiparametric				Linear		Semiparametric				Linear
$T$	$n$		RB	SD	SE	CP	RB	SD	RB	SD	SE	CP	RB	SD
100	100	$μ_{v}$	0.014	0.316	0.346	97	0.103	0.301	0.055	0.501	0.516	97	0.139	0.385
		$σ_{v, v}^{2}$	−0.119	0.208	0.301	98	−0.577	0.235	−0.048	0.205	0.390	98	−0.245	0.225
		$σ_{γ, γ}^{2}$	−0.154	0.132	0.132	98	0.533	0.037	−0.100	0.161	0.150	97	0.581	0.037
		$σ_{v, γ}^{2}$	0.163	0.119	0.135	98	−0.251	0.070	0.046	0.137	0.149	97	−0.259	0.069
		$α$	−0.053	0.454	0.445	95	−0.055	0.233	−0.077	0.570	0.529	97	−0.085	0.214
		$ω$	−0.011	0.052	0.057	96	−0.062	0.061	−0.017	0.077	0.077	97	−0.036	0.080
100	250	$μ_{v}$	0.002	0.169	0.202	97	0.091	0.162	0.014	0.273	0.324	95	0.128	0.240
		$σ_{v, v}^{2}$	−0.009	0.111	0.126	99	−0.509	0.144	0.077	0.129	0.158	99	−0.166	0.132
		$σ_{γ, γ}^{2}$	−0.070	0.071	0.077	96	0.577	0.019	−0.052	0.106	0.086	91	0.561	0.025
		$σ_{v, γ}^{2}$	0.014	0.061	0.072	98	−0.333	0.040	−0.115	0.075	0.077	97	−0.267	0.041
		$α$	−0.009	0.236	0.279	98	−0.044	0.133	0.004	0.269	0.316	96	−0.061	0.143
		$ω$	−0.009	0.035	0.036	97	−0.064	0.036	−0.022	0.053	0.055	98	−0.044	0.052
500	00	$μ_{v}$	0.001	0.091	0.102	97	−0.002	0.109	−0.003	0.114	0.128	96	0.042	0.114
		$σ_{v, v}^{2}$	0.100	0.070	0.076	98	−0.283	0.133	0.139	0.101	0.115	99	−0.168	0.153
		$σ_{γ, γ}^{2}$	024	0.045	0.050	95	0.685	0.009	0.096	0.055	0.058	94	0.689	0.010
		$σ_{v, γ}^{2}$	0.059	0.041	0.050	95	−0.382	0.034	−0.028	0.049	0.053	96	−0.343	0.035
		$α$	0.002	0.220	0.232	95	−0.075	0.135	−0.024	0.243	0.281	98	−0.092	0.133
		$ω$	0.000	0.007	0.007	95	−0.081	0.011	−0.004	0.014	0.016	99	−0.067	0.010
500	250	$μ_{v}$	−0.001	0.062	0.063	95	−0.003	0.070	−0.003	0.070	0.079	95	0.044	0.080
		$σ_{v, v}^{2}$	0.133	0.042	0.045	93	−0.268	0.076	0.113	0.069	0.070	96	−0.238	0.110
		$σ_{γ, γ}^{2}$	0.088	0.021	0.027	91	0.682	0.005	0.109	0.034	0.033	90	0.694	0.006
		$σ_{v, γ}^{2}$	−0.023	0.019	0.025	97	−0.336	0.016	−0.0	0.028	0.030	94	−0.338	0.026
		$α$	0.000	0.131	0.136	94	−0.075	0.079	−0.003	0.144	0.176	96	−0.090	0.078
		$ω$	−0.001	0.004	0.004	97	−0.081	0.006	−0.002	0.005	0.007	97	−0.066	0.007

Open in a new tab

(RB): relative bias; (SD): standard deviation; (SE): average bootstrap standard error; (CP): coverage probability of the 95% bootstrap confidence intervals; (Semiparametric): our proposed RL method; (Linear): RL model with $f (x) = c x$ .

Table 2.

Summary of the estimated nonparametric reward sensitivity function in 200 simulations with Bernoulli (Case I) and Beta (Case II) reward distribution.

			Case I						Case II
			Semiparametric				Linear		Semiparametric				Linear
$T$	$n$		RB	SD	SE	CP	RB	SD	RB	SD	SE	CP	RB	SD
100	100	$f (- 1.0)$	−0.021	0.188	0.221	98	−0.191	0.153	−0.024	0.312	0.308	99	−0.148	0.213
		$f (- 0.5)$	−0.028	0.171	0.177	97	−0.341	0.077	−0.094	0.236	0.229	98	−0.308	0.107
		$f (0.5)$	−0.001	0.188	0.185	97	0.341	0.077	0.056	0.258	0.246	99	0.308	0.107
		$f (1.0)$	0.006	0.167	0.181	96	0.191	0.153	0.022	0.260	0.282	99	0.148	0.213
		$f (1.5)$	0.010	0.179	0.194	98	0.047	0.230	0.022	0.208	0.306	97	0.005	0.320
		$f (2.0)$	−0.025	0.277	0.284	97	−0.090	0.307	−0.030	0.293	0.427	98	−0.149	0.426
100	250	$f (- 1.0)$	−0.013	0.110	0.131	99	−0.206	0.078	−0.016	0.179	0.190	97	−0.177	0.120
		$f (- 0.5)$	−0.021	0.124	0.122	95	−0.353	0.039	−0.075	0.178	0.169	96	−0.331	0.060
		$f (0.5)$	0.002	0.136	0.136	93	0.353	0.039	0.067	0.197	0.179	95	0.331	0.060
		$f (1.0)$	−0.001	0.100	0.111	94	0.206	0.078	0.033	0.173	0.182	99	0.177	0.120
		$f (1.5)$	0.008	0.088	0.089	97	0.065	0.117	0.020	0.125	0.143	96	0.031	0.181
		$f (2.0)$	0.000	0.126	0.136	98	−0.068	0.156	−0.014	0.195	0.199	98	−0.109	0.241
500	100	$f (- 1.0)$	0.007	0.083	0.082	98	−0.235	0.030	−0.013	0.112	0.123	96	−0.228	0.027
		$f (- 0.5)$	0.018	0.040	0.047	97	−0.375	0.015	−0.008	0.066	0.082	99	−0.370	0.013
		$f (0.5)$	−0.023	0.047	0.054	98	0.375	0.015	0.002	0.080	0.097	99	0.370	0.013
		$f (1.0)$	−0.001	0.057	0.060	96	0.235	0.030	0.026	0.084	0.100	95	0.228	0.027
		$f (1.5)$	−0.006	0.048	0.051	96	0.100	0.044	0.001	0.055	0.060	98	0.091	0.040
		$f (2.0)$	−0.006	0.060	0.066	97	−0.028	0.059	−0.010	0.068	0.084	96	−0.039	0.053
500	250	$f (- 1.0)$	−0.002	0.046	0.049	97	−0.237	0.018	−0.003	0.064	0.071	95	−0.226	0.018
		$f (- 0.5)$	0.011	0.026	0.028	94	−0.377	0.009	0.008	0.034	0.040	95	−0.368	0.009
		$f (0.5)$	−0.010	0.029	0.032	94	0.377	0.009	−0.003	0.040	0.048	97	0.368	0.009
		$f (1.0)$	0.009	0.031	0.036	96	0.237	0.018	0.017	0.044	0.050	94	0.226	0.018
		$f (1.5)$	−0.004	0.028	0.031	94	0.102	0.027	−0.003	0.028	0.033	97	0.089	0.027
		$f (2.0)$	−0.007	0.036	0.038	95	−0.026	0.037	−0.011	0.041	0.047	95	−0.041	0.036

Open in a new tab

Comparing semiparametric RL estimation results among four data size scenarios for both reward-generating mechanisms, we found that the estimation bias and standard deviation decrease significantly when we increase the number of trials from 100 to 500. The estimation bias and standard deviation also decrease when we increase the subject number from 100 to 250. The reduction rate is relatively large for $T = 100$ , compared to $T = 500$ . The sources of bias come from the bias of using 3 × 3 bivariate Gauss–Hermite quadrature to approximate (6) and the bias of using numerical optimizer to find the maximum of (7). By comparing reward-generating distribution in Cases I and II, we found that the distribution of the reward slightly affects the estimation performance. The simulation results show Case I has a smaller relative bias than Case II. In the Supplementary Materials Section S.1.5, we performed an additional simulation study involving three states (i.e., $m = 3$ ). The results indicate that the Semiparametric RL model with three states also produces accurate estimation results.

4. Application to EMBARC Study

We now apply the proposed methodology to our motivating PRT data. Through a preliminary analysis, we found that the learning pattern might change from the first block to the second block. To avoid bias, we only focus on the PRT data in the first block (first 100 trials) in the paper.

First, we compare reward learning abilities between patients with MDD and the healthy control group. We fitted our proposed semiparametric reward learning models for the ‘MDD’ group that contains subjects diagnosed with MDD at the baseline (pre-treatment) and the ‘Control’ group that contains the healthy subjects with two repeated measurements (no treatment at both times). We examined the number of interior knots from 3 to 10, the boundary knots equal to {−4, 6}, and order $r = 2$ for the I-Splines. By the AIC criterion, we selected the model with 6 interior knots. The parameters of interest are the transformed group learning rate $μ_{v}$ and the group reward sensitivity function $f (\cdot)$ . Table 3 shows the estimation results for $μ_{v}$ and the elements in $Σ$ . For inference, nonparametric bootstrap was applied to generate 200 resampling sets for each group. Bootstrap standard errors and 95% bootstrap confidence intervals were shown in Table 3. The results show that the learning rates for the two groups have similar values. We also constructed the 95% confidence interval for the contrast of $μ_{v}$ between MDD and the control group; according to the result that the 95% confidence interval is equal to (−0.94, 1.93), we conclude that the difference of learning rate between MDD group and control group is not significant. Another interesting finding is that the learning rate is negatively correlated with reward sensitivity for the subjects.

Table 3.

Estimation of the parameters in EMBARC study under the proposed method for MDD and Control group.

	MDD			Control
Parameters	Est.	Err.	95% CI	Est.	Err.	95% CI
$μ_{v}$	−2.57	0.46	(−3.47, −1.67)	−3.07	0.67	(−4.38, −1.75)
$σ_{v, v}^{2}$	10.68	3.75	(4.41, 25.86)	9.08	4.27	(3.06, 26.95)
$σ_{γ, γ}^{2}$	2.66	0.96	(1.50, 4.74)	3.15	2.30	(0.92, 10.86)
$σ_{v, γ}^{2}$	−1.87	0.86	(−3.55, −0.19)	−2.92	0.99	(−4.85, −0.98)

Open in a new tab

Figure 3 presents the nonparametric estimation of group reward sensitivity function $f (\cdot)$ . Figure 3(a) compares the nonparametric estimate of $f (\cdot)$ between MDD and Control group using the proposed method. The fitted reward sensitivity functions for both groups are clearly nonlinear. The function flattens when $x > 2$ or $x < - 2$ , suggesting that subjects’ probability of correct actions at given states would not rise endlessly even if they received enough rewards. Meanwhile, the reward sensitivity functions for the two groups have similar shapes when $x < 1.5, f (x)$ for the control group increases to a higher level than $f (x)$ for the MDD group when $x > 2$ . It suggests that the control group has a larger probability of taking correct actions at rich reward states than the MDD group when subjects in both groups receive adequate rewards in rich reward states. Figure 3(b) presents the 95% pointwise confidence band (PCB) and the 95% simultaneous confidence band (SCB) for the contrast of $f (\cdot)$ between the two groups. We evaluate $f (x)$ for $- 3 < x < 4$ since few $Z_{i t}$ values fall outside this interval. The construction of the PCB uses bootstrap and normal approximation, and the construction of the SCB uses a bootstrap method that mimics Hall and Horowitz (2013). For further details regarding the SCB and its coverage rates in simulation studies, see Supplementary Materials Section S.1.4. Because the 95% SCB covers zero in the entire range of $x$ , we lack statistical evidence to conclude a significant difference in the reward sensitivity function between the Control and MDD groups. However, the 95% PCB suggests that the Control group may exhibit a higher reward sensitivity compared to the MDD group, particularly when the expected difference in reward between two stimuli is large (e.g., $x > 2$ ). As an exploratory analysis, we constructed a 95% SCB for the contrast of $f (\cdot)$ between the two groups specifically at $x > 2$ . We observe that the entire band falls below zero for $x > 2$ . The result is shown in Figure S3 in the Supplementary Materials. This finding provides statistical evidence that the Control group exhibits a greater reward sensitivity compared to the MDD group, given that the expected difference in reward (rich reward minus lean reward) between two stimuli exceeds 2.

Fig. 3 — (a) The estimated reward sensitivity function estimates for MDD and Control group; (b) The estimation (red curve), 95% pointwise confidence band (shaded area in gray) and 95% simultaneous confidence band (two dash-dotted lines in blue) for the contrast of $f (\cdot)$ between the MDD and Control groups.

Moreover, we applied the proposed model with a linear reward sensitivity function represented by $f (x) = c x$ to both the MDD and Control groups. Figure S4 in the Supplementary Materials displays the estimated linear reward sensitivity functions. Compared to Figure 3 obtained from the semiparametric model, the linear model yields less precise reward sensitivity patterns and a higher AIC. Furthermore, the linear model fails to capture the substantial difference in the reward sensitivity function between the two groups beyond $x > 2$ .

The comparison of MDD versus Control examines whether reward sensitivity is a characteristic of MDD that differs between patients and controls. Our results suggest that reward sensitivity obtained from the probabilistic reward task may be considered a behavioral marker or a phenotype of depression. This analysis does not establish a causal relationship between depression and reduced reward sensitivity due to confounding.

Next, we compared reward learning abilities for the MDD patients between the sertraline (an antidepressant) (SERT) and placebo group (PBO). We fitted our proposed semiparametric reward learning models for subjects in PBO and SERT at week 0 and week 1, respectively. We used the same knots and order for the I-Splines as in the above analysis. Table 4 shows the estimated group learning rate and variance components in SERT and placebo group pre- and post-treatment. To investigate whether there is a significant difference in group learning rate $μ_{v}$ for MDD patients who received SERT versus PBO, we constructed 95% bootstrap confidence interval of the difference between the change of $μ_{v}$ from week 0 to week 1 in the two treatment arms. The result shows that the $95 %$ confidence interval is (−1.01, 3.10), indicating that the one-week changes in learning rate between PBO and SERT groups are not significantly different. A recent meta-analysis of the PRT test also reached the same conclusion of a non-significant group learning rate (Pike and Robinson, 2022).

Table 4.

Estimation of the parameters in EMBARC study for treatment (SERT) and placebo (PBO) group at pre- and post-treatment.

	SERT (pre-treatment)			SERT (post-treatment)
Parameters	Est.	Err.	95% CI	Est.	Err.	95% CI
$μ_{v}$	−1.84	0.77	(−3.35, −0.33)	−3.17	0.41	(−3.96, −2.37)
$σ_{v, v}^{2}$	6.45	4.03	(1.60, 25.96)	8.90	4.11	(3.15, 25.12)
$σ_{γ, γ}^{2}$	2.92	1.40	(1.12, 7.65)	4.44	1.20	(1.99, 9.89)
$σ_{v, γ}^{2}$	−2.14	1.31	(−4.70, 0.43)	−3.62	1.61	(−6.77, −0.46)
	PBO (pre-treatment)			PBO (post-treatment)
Parameters	Est.	Err.	95% CI	Est.	Err.	95% CI
$μ_{v}$	−3.03	0.42	(−3.85, −2.22)	−3.32	0.53	(−4.35, −2.28)
$σ_{v, v}^{2}$	13.30	3.75	(6.32, 27.97)	5.24	2.22	(2.11, 13.04)
$σ_{γ, γ}^{2}$	4.16	1.50	(1.99, 8.70)	3.09	1.42	(1.42, 6.75)
$σ_{v, γ}^{2}$	−2.46	0.99	(−4.41, −0.52)	−1.19	0.63	(−2.43, 0.06)

Open in a new tab

Figure 4(a) and 4(b) show the estimated reward sensitivity function $f (\cdot)$ for SERT and PBO groups at pre- and post-treatment. We find that the placebo group pre-(week 0) and post-treatment (week 1) reward sensitivity function is similar across the entire support, while for the SERT group, $f (x)$ increases to a higher level after treatment when $x > 1.5$ . It suggests that the SERT treatment increases MDD patients’ probability of taking correct actions at rich reward states when adequate rewards are received. We also constructed the 95% PCB and SCB of the difference between the change of $f (\cdot)$ pre- and post-treatment in the two groups. Figure 4(c) does not show sufficient statistical evidence that the pre-post change in reward sensitivity function differs between the two treatment groups. However, the presence of a similar pattern in the comparisons of reward sensitivity between Figure 3(b) and Figure 4(c) suggests that there might be a positive impact of sertraline on MDD patients, potentially bringing their reward learning sensitivity closer to that of healthy individuals at the rich state, which is worth future investigation. Regarding the timing of post-treatment measurement, the antidepressant is expected to take full effect in about four weeks. The rationale for measuring PRT reward sensitivity one-week post-treatment is to detect early responses. With a longer period of treatment, the difference may be greater.

Fig. 4 — (a) The estimated reward sensitivity function for SERT group pre- and post-treatment; (b) The estimated reward sensitivity function for PBO group; (c) The estimation (red curve), 95% PCB (shaded area in gray) and 95% SCB (two dash-dotted lines in blue) of the difference between PBO and SERT groups in the pre-post change of reward sensitivity function $f (\cdot)$ (difference in difference).

To investigate whether dysfunctions in the human brain circuitry are associated with decision-making and learning ability, we examined correlations between subjects’ learning rate, relative sensitivity, and task functional magnetic resonance imaging (fMRI) measures of brain activation in an emotional conflict task assessing amygdala-anterior cingulate (ACC) circuitry (Etkin et al., 2006). We find that both reward sensitivity and learning rate are associated with brain activities in negative affect circuitry evoked by sad stimuli. Detailed analysis can be found in Supplementary Materials Section S.2.1.

5. Discussion

In this paper, we propose a semiparametric inverse reinforcement learning framework to characterize reward-based decision-making with an application of probabilistic reward tasks in the EMBARC study. We assume that a patient’s decision-making process relies on two subject-specific factors, learning rate and reward sensitivity, and a shared non-decreasing function with a nonparametric form. Extensive simulation studies showed that our proposed method is satisfactory for large sample sizes under different reward-generating distributions. It also shows the advantage of the semiparametric structure when the true underlying function is beyond the restrictive parametric form using a sigmoid function. In the application, we find that MDD patients and control and patients between SERT and PBO groups have similar learning rates. However, the different groups have different reward sensitivity functions. The results in the paper are consistent with the findings in Huys et al. (2013), but by contrast, the proposed model shows a more detailed description of how the reward sensitivity differs. We also find that behavioral phenotypes, including learning rate and reward sensitivity, are associated with human brain activities at the negative affect circuits.

Estimation is carried out by maximizing the joint conditional log-likelihood for all patients at all trials, where the non-decreasing function and its first-order derivative are characterized by I-splines and M-splines. Because of the close relationship between M-splines and B-splines, M-splines and I-splines share good approximation power as B-splines. We studied asymptotic consistency and asymptotic distributions for the parameters in the proposed model. Note that we assume that $n$ (the number of patients) goes to infinity and $T$ (the number of trials) is fixed in the theoretical setting. It is much more challenging to assume both $n$ and $T$ go to infinity, and we will study this setting in future work. The proposed model can also be regarded as a time-varying model where the reward sensitivity is allowed to change through time. One future direction is allowing the learning rate to evolve gradually.

Our proposed work aligns with imitation learning or behavioral cloning in the RL literature. A related line of research is inverse RL (Ng et al., 2000; Abbeel and Ng, 2004; Arora and Doshi, 2021; Luckett et al., 2021), which assumes an agent’s behavior follows an optimal policy and seeks to learn the corresponding reward function from the agent’s observed behavior and decision-making process. In contrast, the problem we study here aims to recover the agent’s reward sensitivity function and learning rate without assuming the agent’s behavior follows an optimal policy.

Our current work does not accommodate heterogeneity in $Q_{i t}$ beyond the aspects of learning rate and reward sensitivity. An extension is to model them in a mixed effects model framework but at the cost of more complex modeling and additional assumptions. Similarly, covariates can be introduced to model systematic heterogeneity in learning rates, reward sensitivity, and other parameters. Some recent work shows that the perceptual decision-making process would alternate between multiple interleaved strategies. Ashwood et al. (2022) provides an example in which the decision process is a mixture model of “engaged” and “lapse” trials. When the decision-making strategy is “engaged”, a subject chooses according to the RL model; When the decision-making strategy is “lapse”, the subject ignores the stimulus and chooses based on a fixed probability. The strategy-switching is characterized by a hidden Markov model. Extension to a semiparametric inverse reinforcement learning framework that allows strategies-switching is of interest. Another direction worth pursuing is a broader reinforcement learning context where states are not given randomly so that state distributions depend on past actions. Lastly, the current model is not restricted to characterizing decision-making in MDD patients. Our flexible framework can be easily applied to model other mental disorders such as schizophrenia and translates neuroscience and behavioral science constructs to clinical applications (Huys et al., 2016).

Supplementary Material

Supp 1

NIHMS1940486-supplement-Supp_1.pdf^{(128KB, pdf)}

Supp 2

NIHMS1940486-supplement-Supp_2.zip^{(2.2MB, zip)}

Acknowledgements

This research was supported by the National Institutes of Health funding MH123487, NS073671, and GM124104. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The authors report there are no competing interests to declare. Data and/or research tools used in the preparation of this manuscript were obtained from the National Institute of Mental Health (NIMH) Data Archive (NDA). NDA is a collaborative informatics system created by the National Institutes of Health to provide a national resource to support and accelerate research in mental health. Dataset identifier(s): 2199. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or of the Submitters submitting original data to NDA.

Appendix

A.1. Representation of M-splines and I-splines

According to Ramsay (1988), for a set of M-splines in $[L, U]$ with $K$ basis functions and order $r$ , denote the $K$ basis as $M_{k} (\cdot) : = M_{k} (\cdot ∣ r), k = 1, \dots, K$ . Denote the knot sequence of the M-splines as $\{τ_{1}, τ_{2}, \dots, τ_{K + r}\}$ , where the boundary knots should satisfy $τ_{1} = \dots = τ_{r} = L$ and $τ_{K + 1} = \dots = τ_{K + r} = U$ ; the interior knots should satisfy $L < τ_{r + 1} < \dots < τ_{K} < U$ . (Please refer to Ramsay (1988) for a more general case that allows tie for the interior knots.) The M-splines can be constructed by recursion

\{\begin{array}{l} M_{k} (x ∣ 1) = \frac{1}{τ_{k + 1} - τ_{k}} I (τ_{k} \leq x < τ_{k + 1}), \\ ⋮ \\ M_{k} (x ∣ r) = \frac{r [(x - τ_{k}) M_{k} (x ∣ r - 1) + (τ_{k + r} - x) M_{k + 1} (x ∣ r - 1)]}{(r - 1) (τ_{k + r} - τ_{k})} . \end{array}

It can be shown that $M_{k} (x ∣ r) > 0$ only when $τ_{k} \leq x < τ_{k + r}$ . Also $\int M_{k} (x ∣ r) d x = 1$ for all $k$ . Define a set of I-splines as the integral of the corresponding set of M-splines, i.e.

I_{k} (x ∣ r) = \int_{L}^{x} M_{k} (u ∣ r) d u, for k = 1, \dots, K,

$I_{k} (x ∣ r)$ is guaranteed to be non-decreasing. A function represented by I-splines of order $r$ is $r - 1$ times continuously differentiable, and the $r$ -th derivatives are bounded in $[L, U]$ ; A function represented by M-splines of order $r$ is $r - 2$ times continuously differentiable, and the $(r - 1)$ -th derivatives are bounded in $[L, U]$ . Finally, a function represented by M-splines can also be represented by B-splines; to be more specific, a set of B-splines can be expressed by a set of M-splines with the same knots and order, i.e., $B_{k} (x ∣ r) = (τ_{k + r} - τ_{k}) M_{k} (x ∣ r) / r$ , hence M-splines and I-splines maintain good approximation power as B-splines (Schumaker, 2007). In practice, AIC can be used as the model selection criterion. A grid search on spline orders and interior knots can be conducted to select the I-splines with the smallest AIC value.

A.2. Asymptotic Theory

To simplify the notation, denote $θ = (α, Ω, μ_{v}, Σ) \in R^{d}$ . Let $θ_{0}$ and $f_{0} (\cdot)$ denote the true parameters and nonparametric function. The log-likelihood function from a single subject is

ℓ (θ, f) = log (\iint q (a, θ, f) d a),

(10)

where $a = (v, γ)$

q (a, θ, f) = \prod_{t = 1}^{T} \frac{exp (A_{t} f (Z_{t}))}{1 + exp (f (Z_{t}))} ϕ (a; θ) .

(11)

For any vector $b \in R^{d}$ and function $h \in L_{2} [L, U]$ that has bounded $r$ -th derivatives, we define the score operators of $ℓ (θ, f)$ as $l_{θ}^{⊤} (θ, f) b$ and $l_{f} (θ, f) [h]$ by differentiating the log-likelihood function with respect to $θ$ and $f$ along the submodel $θ + ϵ b$ and $f + ϵ h$ . Define the information operator of $ℓ (θ, f)$ as $ℐ (θ, f) = E \{{(l_{θ}, l_{f})}^{*} (l_{θ}, l_{f})\}$ , where ${(l_{θ}, l_{f})}^{*}$ is the dual operator of ${(l_{θ}, l_{f})}^{*} ℐ (θ, f)$ is a self-adjoint operator from

ℰ = \{(b, h) : b \in R^{d}, h \in L_{2} [L, U] with bounded r-th derivatives}

to itself. Define $ℐ^{1 / 2} (θ, f)$ as the square root operator of $ℐ (θ, f)$ such that $ℐ = ℐ^{1 / 2} \circ ℐ^{1 / 2}$ , where $^{\circ}$ is the operator composition. To maximize (7), we only consider nonparametric function in

𝒮_{n} = \{\tilde{f} : \tilde{f} (x) = \sum_{k = 1}^{K_{n}} {\tilde{η}}_{k} \{I_{k} (x) - I_{k} (0)\}, {\tilde{η}}_{k} \in R\} .

The following conditions are needed for the theorems in this paper.

Condition 1. The true values $θ_{0}$ lie in the interior of a known compact set in $R^{d}$ . The true parameter $f_{0} (\cdot)$ is increasing function with $f_{0} (0) = 0$ in a known bounded interval $[L, U]$ , where $L < 0, U > 0 . f_{0} (x) \equiv f_{0} (U)$ for $x > U, f_{0} (x) \equiv f_{0} (L)$ for $x < L . f_{0} (\cdot)$ is $r - 1$ times continuously differentiable and the $r$ -th derivatives is bounded in $[L, U]$ where $r \geq 2$ .

Condition 2. Let ${\tilde{f}}_{0}$ be in the projection of $f_{0}$ on $𝒮_{n}$ under $ℒ_{2}$ norm. We assume that there exists a constant $C > 0$ such that ${\tilde{f}}_{0}$ belongs to a subspace of $𝒮_{n}$ :

𝒮_{n}^{*} (C) = \{\tilde{f} : \tilde{f} (x) = \sum_{k = 1}^{K_{n}} {\tilde{η}}_{k} \{I_{k} (x) - I_{k} (0)\}, 0 \leq {\tilde{η}}_{k} < C K_{n}^{- 1} \cdot\} .

Condition 3. For any two different set of parameters $(θ_{1}, f_{1})$ and $(θ_{2}, f_{2})$ , the log-likelihood function $ℓ (θ_{1}, f_{1}) \neq ℓ (θ_{2}, f_{2})$ with positive probability. The information operator $ℐ (θ, f)$ of the log-likelihood function $ℓ (θ, f)$ at true parameter values $(θ_{0}, f_{0})$ is invertible and satisfies

C_{min} (∥ b ∥_{2}^{2} + ∥ h ∥_{ℒ_{2}}^{2}) \leq ‖ {∥ℐ^{\frac{1}{2}} (θ_{0}, f_{0}) [b, h ‖]∥}_{ℒ_{2} (P)}^{2} \leq C_{max} (∥ b ∥_{2}^{2} + ∥ h ∥_{ℒ_{2}}^{2}), \forall (b, h) \in ℰ, 0 < C_{min} < C_{max} < \infty .

Condition 4. The number of I-spline basis $K_{n}$ satisfies $K_{n}^{2 r - 2} n^{- 1 / 2} \to \infty$ and $K_{n}^{1 / 2} n^{- 1 / 2} \to 0$ . The adjacent distance of the interior knots is between $^{c^{- 1} K_{n}^{- 1}}$ and $^{c K_{n}^{- 1}}$ for some constant $c > 1$ .

Condition 1 ensures function $f_{0}$ is bounded and smooth. Condition 2 and 4 guarantees $\tilde{f}$ and ${\tilde{f}}^{'}$ to be uniformly bounded for all $\tilde{f} \in 𝒮_{n}^{*} (C)$ , where ${\tilde{f}}^{'}$ is the first order derivatives of $\tilde{f}$ . The first part of Condition 3 is the identifiability, and the second part implies that the information operator is invertible in $ℒ_{2}$ space. In particular, this identifiability condition implies that the latent variable, $Z_{i t}$ defined in equation (4), does not degenerate and should have a continuous density with support containing $[L, U]$ , the domain of $f$ . In Condition 4, we may set $K_{n} = n^{δ}$ , where $\frac{1}{4 (r - 1)} < δ < 1$ .

We state the consistency and asymptotic distribution of the estimators for the model parameters in the following two theorems, whose proofs are given in the supplementary materials Section S.3.

Theorem 1. Under Conditions 1–4, ${∥θ - θ_{0}∥}_{2} \to 0, {∥\hat{f} - f_{0}∥ ℒ}_{2} \to 0$ in probability. Furthermore, ${∥θ - θ_{0}∥}_{2}^{2} + {∥\hat{f} - f_{0}∥}_{ℒ_{2}}^{2} = o_{p} (n^{- 1 / 2})$ .

To describe the asymptotic distributions of $\hat{f}$ and $θ$ , we first introduce the sets $ℱ_{f} = {h (x) : h (x) has the r-th derivatives bounded by 1 in [L, U]}$ . Define $𝒪_{θ}$ be the unit ball in $R^{d}$ . We then treat $\{θ - θ_{0}, \hat{f} - f_{0}\}$ as a stochastic class in $l^{\infty} (𝒪_{θ} \times ℱ_{f})$ whose value for $(b^{*}, h^{*})$ is defined as

b^{* ⊤} (θ - θ_{0}) + \int h^{*} (u) \{\hat{f} (u) - f_{0} (u)\} d u,

where $b \in 𝒪_{θ}, h \in ℱ_{f}, [b^{*}, h^{*}] = ℐ_{(θ_{0}, f_{0})} [b, h]$ . The following theorem shows weak convergence of this stochastic process.

Theorem 2. Under Conditions 1–4, $n^{1 / 2} \{θ - θ_{0}, \hat{f} - f_{0}\}$ converges in distribution to a zero-mean and tight Gaussian process in the metric space $l^{\infty} (𝒪_{θ} \times ℱ_{f})$ as $n \to \infty$ .

The proof to Theorem 2 also implies the asymptotic normality of $\hat{θ}$ and some smooth functional of $f$ . However, it does not support pointwise normality, although the simulation study shows normality is still plausible. Also note that the convergence rate given in Theorem 1 is an intermediate result needed for proving the faster rate and asymptotic normality for $θ$ in Theorem 2. On the other hand, the convergence rates for $\hat{f}$ in these two theorems are not exactly comparable. Theorem 1 gives the convergence rate for $ℒ_{2 -}$ metric for $\hat{f}$ ; while the convergence rate and normality for $\hat{f}$ in Theorem 2 refer to the functional of $\hat{f}, \int h^{*} (u) \hat{f} (u) d u$ .

References

Abbeel P and Ng AY (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, page 1. [Google Scholar]
APA (2013). Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition. American Psychiatric Association. [Google Scholar]
Arora S and Doshi P (2021). A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500. [Google Scholar]
Ashwood ZC, Roy NA, Stone IR, Urai AE, Churchland AK, Pouget A, and Pillow JW (2022). Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25(2):201–212. [DOI] [PMC free article] [PubMed] [Google Scholar]
Etkin A, Egner T, Peraza DM, Kandel ER, and Hirsch J (2006). Resolving emotional conflict: A role for the rostral anterior cingulate cortex in modulating activity in the amygdala. Neuron, 51(6):871–882. [DOI] [PubMed] [Google Scholar]
Foerde K, Steinglass JE, Shohamy D, and Walsh BT (2015). Neural mechanisms supporting maladaptive food choices in anorexia nervosa. Nature Neuroscience, 18(11):1571–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gravetter FJ and Forzano L-AB (2018). Research Methods for the Behavioral Sciences. Cengage Learning. [Google Scholar]
Hall P and Horowitz J (2013). A simple bootstrap method for constructing nonparametric confidence bands for functions. The Annals of Statistics, pages 1892–1921. [Google Scholar]
Huys QJ, Maia TV, and Frank MJ (2016). Computational psychiatry as a bridge from neuroscience to clinical applications. Nature Neuroscience, 19(3):404–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huys QJ, Pizzagalli DA, Bogdan R, and Dayan P (2013). Mapping anhedonia onto reinforcement learning: A behavioural meta-analysis. Biology of Mood & Anxiety Disorders, 3(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
Insel T, Cuthbert B, Garvey M, Heinssen R, Pine DS, Quinn K, Sanislow C, and Wang P (2010). Research domain criteria (RDoC): Toward a new classification framework for research on mental disorders. American Journal of Psychiatry, 167(7):748–751. [DOI] [PubMed] [Google Scholar]
Jäckel P (2005). A note on multivariate gauss-hermite quadrature. London: ABN-Amro. Re. [Google Scholar]
Kendler KS, Karkowski LM, and Prescott CA (1999). Causal relationship between stressful life events and the onset of major depression. American Journal of Psychiatry, 156(6):837–841. [DOI] [PubMed] [Google Scholar]
Lange K, Chambers J, and Eddy W (1999). Numerical Analysis for Statisticians, volume 2. Springer. [Google Scholar]
Luckett DJ, Laber EB, Kim S, and Kosorok MR (2021). Estimation and optimization of composite outcomes. The Journal of Machine Learning Research, 22(1):7558–7597. [PMC free article] [PubMed] [Google Scholar]
Ng AY, Russell S, et al. (2000). Algorithms for inverse reinforcement learning. In ICML, volume 1, page 2. [Google Scholar]
Passos IC, Mwangi B, and Kapczinski F (2019). Personalized Psychiatry: Big Data Analytics in Mental Health. Springer. [Google Scholar]
Pike AC and Robinson OJ (2022). Reinforcement learning in patients with mood and anxiety disorders vs control individuals: A systematic review and meta-analysis. JAMA Psychiatry. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pizzagalli DA, Jahn AL, and O’Shea JP (2005). Toward an objective characterization of an anhedonic phenotype: A signal-detection approach. Biological Psychiatry, 57(4):319–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ramsay JO (1988). Monotone regression splines in action. Statistical Science, pages 425–441. [Google Scholar]
Rescorla RA (1972). A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Current Research and Theory, pages 64–99. [Google Scholar]
Ross S and Bagnell D (2010). Efficient reductions for imitation learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 661–668. JMLR Workshop and Conference Proceedings. [Google Scholar]
Schumaker L (2007). Spline Functions: Basic Theory. Cambridge University Press. [Google Scholar]
Sutton RS and Barto AG (2018). Reinforcement Learning: An Introduction. MIT Press. [Google Scholar]
Trivedi MH, McGrath PJ, Fava M, Parsey RV, Kurian BT, Phillips ML, Oquendo MA, Bruder G, Pizzagalli D, Toups M, et al. (2016). Establishing moderators and biosignatures of antidepressant response in clinical care (embarc): Rationale and design. Journal of Psychiatric Research, 78:11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vrieze E, Pizzagalli DA, Demyttenaere K, Hompes T, Sienaert P, de Boer P, Schmidt M, and Claes S (2013). Reduced reward learning predicts outcome in major depressive disorder. Biological Psychiatry, 73(7):639–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
Whiteford HA, Degenhardt L, Rehm J, Baxter AJ, Ferrari AJ, Erskine HE, Charlson FJ, Norman RE, Flaxman AD, Johns N, et al. (2013). Global burden of disease attributable to mental and substance use disorders: Findings from the global burden of disease study 2010. The Lancet, 382(9904):1575–1586. [DOI] [PubMed] [Google Scholar]
Whitton AE, Treadway MT, and Pizzagalli DA (2015). Reward processing dysfunction in major depression, bipolar disorder and schizophrenia. Current Opinion in Psychiatry, 28(1):7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Williams LM (2016). Precision psychiatry: A neural circuit taxonomy for depression and anxiety. The Lancet Psychiatry, 3(5):472–480. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS1940486-supplement-Supp_1.pdf^{(128KB, pdf)}

Supp 2

NIHMS1940486-supplement-Supp_2.zip^{(2.2MB, zip)}

[R1] Abbeel P and Ng AY (2004). Apprenticeship learning via inverse reinforcement learning. In Proceedings of the Twenty-first International Conference on Machine Learning, page 1. [Google Scholar]

[R2] APA (2013). Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition. American Psychiatric Association. [Google Scholar]

[R3] Arora S and Doshi P (2021). A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500. [Google Scholar]

[R4] Ashwood ZC, Roy NA, Stone IR, Urai AE, Churchland AK, Pouget A, and Pillow JW (2022). Mice alternate between discrete strategies during perceptual decision-making. Nature Neuroscience, 25(2):201–212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Etkin A, Egner T, Peraza DM, Kandel ER, and Hirsch J (2006). Resolving emotional conflict: A role for the rostral anterior cingulate cortex in modulating activity in the amygdala. Neuron, 51(6):871–882. [DOI] [PubMed] [Google Scholar]

[R6] Foerde K, Steinglass JE, Shohamy D, and Walsh BT (2015). Neural mechanisms supporting maladaptive food choices in anorexia nervosa. Nature Neuroscience, 18(11):1571–1573. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Gravetter FJ and Forzano L-AB (2018). Research Methods for the Behavioral Sciences. Cengage Learning. [Google Scholar]

[R8] Hall P and Horowitz J (2013). A simple bootstrap method for constructing nonparametric confidence bands for functions. The Annals of Statistics, pages 1892–1921. [Google Scholar]

[R9] Huys QJ, Maia TV, and Frank MJ (2016). Computational psychiatry as a bridge from neuroscience to clinical applications. Nature Neuroscience, 19(3):404–413. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Huys QJ, Pizzagalli DA, Bogdan R, and Dayan P (2013). Mapping anhedonia onto reinforcement learning: A behavioural meta-analysis. Biology of Mood & Anxiety Disorders, 3(1):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Insel T, Cuthbert B, Garvey M, Heinssen R, Pine DS, Quinn K, Sanislow C, and Wang P (2010). Research domain criteria (RDoC): Toward a new classification framework for research on mental disorders. American Journal of Psychiatry, 167(7):748–751. [DOI] [PubMed] [Google Scholar]

[R12] Jäckel P (2005). A note on multivariate gauss-hermite quadrature. London: ABN-Amro. Re. [Google Scholar]

[R13] Kendler KS, Karkowski LM, and Prescott CA (1999). Causal relationship between stressful life events and the onset of major depression. American Journal of Psychiatry, 156(6):837–841. [DOI] [PubMed] [Google Scholar]

[R14] Lange K, Chambers J, and Eddy W (1999). Numerical Analysis for Statisticians, volume 2. Springer. [Google Scholar]

[R15] Luckett DJ, Laber EB, Kim S, and Kosorok MR (2021). Estimation and optimization of composite outcomes. The Journal of Machine Learning Research, 22(1):7558–7597. [PMC free article] [PubMed] [Google Scholar]

[R16] Ng AY, Russell S, et al. (2000). Algorithms for inverse reinforcement learning. In ICML, volume 1, page 2. [Google Scholar]

[R17] Passos IC, Mwangi B, and Kapczinski F (2019). Personalized Psychiatry: Big Data Analytics in Mental Health. Springer. [Google Scholar]

[R18] Pike AC and Robinson OJ (2022). Reinforcement learning in patients with mood and anxiety disorders vs control individuals: A systematic review and meta-analysis. JAMA Psychiatry. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Pizzagalli DA, Jahn AL, and O’Shea JP (2005). Toward an objective characterization of an anhedonic phenotype: A signal-detection approach. Biological Psychiatry, 57(4):319–327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Ramsay JO (1988). Monotone regression splines in action. Statistical Science, pages 425–441. [Google Scholar]

[R21] Rescorla RA (1972). A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Current Research and Theory, pages 64–99. [Google Scholar]

[R22] Ross S and Bagnell D (2010). Efficient reductions for imitation learning. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 661–668. JMLR Workshop and Conference Proceedings. [Google Scholar]

[R23] Schumaker L (2007). Spline Functions: Basic Theory. Cambridge University Press. [Google Scholar]

[R24] Sutton RS and Barto AG (2018). Reinforcement Learning: An Introduction. MIT Press. [Google Scholar]

[R25] Trivedi MH, McGrath PJ, Fava M, Parsey RV, Kurian BT, Phillips ML, Oquendo MA, Bruder G, Pizzagalli D, Toups M, et al. (2016). Establishing moderators and biosignatures of antidepressant response in clinical care (embarc): Rationale and design. Journal of Psychiatric Research, 78:11–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Vrieze E, Pizzagalli DA, Demyttenaere K, Hompes T, Sienaert P, de Boer P, Schmidt M, and Claes S (2013). Reduced reward learning predicts outcome in major depressive disorder. Biological Psychiatry, 73(7):639–645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Whiteford HA, Degenhardt L, Rehm J, Baxter AJ, Ferrari AJ, Erskine HE, Charlson FJ, Norman RE, Flaxman AD, Johns N, et al. (2013). Global burden of disease attributable to mental and substance use disorders: Findings from the global burden of disease study 2010. The Lancet, 382(9904):1575–1586. [DOI] [PubMed] [Google Scholar]

[R28] Whitton AE, Treadway MT, and Pizzagalli DA (2015). Reward processing dysfunction in major depression, bipolar disorder and schizophrenia. Current Opinion in Psychiatry, 28(1):7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Williams LM (2016). Precision psychiatry: A neural circuit taxonomy for depression and anxiety. The Lancet Psychiatry, 3(5):472–480. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Semiparametric Inverse Reinforcement Learning Approach to Characterize Decision Making for Mental Disorders

Xingche Guo

Donglin Zeng

Yuanjia Wang

Abstract

1. Introduction

1.1. Probabilistic Reward Task and the Motivating Study

Fig. 1.