Simulating bout-and-pause patterns with reinforcement learning

Kota Yamada; Atsunori Kanemura

doi:10.1371/journal.pone.0242201

. 2020 Nov 12;15(11):e0242201. doi: 10.1371/journal.pone.0242201

Simulating bout-and-pause patterns with reinforcement learning

Kota Yamada ^1,^2,^*, Atsunori Kanemura ²

Editor: Gennady Cymbalyuk³

PMCID: PMC7660465 PMID: 33180864

Abstract

Animal responses occur according to a specific temporal structure composed of two states, where a bout is followed by a long pause until the next bout. Such a bout-and-pause pattern has three components: the bout length, the within-bout response rate, and the bout initiation rate. Previous studies have investigated how these three components are affected by experimental manipulations. However, it remains unknown what underlying mechanisms cause bout-and-pause patterns. In this article, we propose two mechanisms and examine computational models developed based on reinforcement learning. The model is characterized by two mechanisms. The first mechanism is choice—an agent makes a choice between operant and other behaviors. The second mechanism is cost—a cost is associated with the changeover of behaviors. These two mechanisms are extracted from past experimental findings. Simulation results suggested that both the choice and cost mechanisms are required to generate bout-and-pause patterns and if either of them is knocked out, the model does not generate bout-and-pause patterns. We further analyzed the proposed model and found that it reproduced the relationships between experimental manipulations and the three components that have been reported by previous studies. In addition, we showed alternative models can generate bout-and-pause patterns as long as they implement the two mechanisms.

Introduction

Animals engage in various activities in their daily lives. For humans, they may be working, studying, practicing sports, or playing video games. For rats, they may be grooming, foraging, or escaping from a predator. Although specific activities are different between different species, common behavioral features are often observed.

Bout-and-pause patterns are one of the behavioral features commonly observed in many species. Activities engaged by an animal do not occur uniformly through time but often have short periods in which a burst of engaged responses is observed. For example, in an operant conditioning experiment, a rat presses a lever repeatedly in a short period and then it stops lever pressing. After a moment, the rat starts lever pressing again. The rat switches between the lever pressing behavior and the no lever pressing behavior again and again throughout the experiment. Such a temporal structure comprising of short-period response bursts and long pauses is observed in various species and activities; for example, email and letter communication by humans [1], foraging by cows [2], and walking by Drosophila [3].

Shull et al. [4] showed that bout-and-pause patterns, observed under an environment where rewards are available probabilistically at a constant rate (variable interval (VI) schedule), can be described with a broken-stick shape in the log-survivor plot of interresponse times (IRTs), which are characterized by a bi-exponential probability model. If IRTs follow a single exponential distribution, then the log-survivor plot shows a straight line. If IRTs follow a mixture exponential distribution called a bi-exponential model, the log-survior plot shows a broken-stick shape composed of two straight lines that have different slopes. Killeen et al. [5] found that lever pressing by rats is well described with a bi-exponential model, suggesting that this behavior has a bout-and-pause pattern. If IRTs follow a bi-exponential distribution, there are two different types of responses; within-bout responses, which have short IRTs, and between-bout responses, which have long IRTs. Each response type has its own exponential distribution in a bi-exponential model. Killeen et al. [5] formulated the bi-exponential model as follows:

\begin{matrix} p (I R T = τ) = (1 - q) ω e^{- ω τ} + q b e^{- b τ}, \end{matrix}

(1)

where the first term describe IRTs of within-bout responses and the second term describes IRTs of between-bout responses. This model has three free parameters: q, ω, and b, each of which corresponds to a different component in bout-and-pause patterns. First, q denotes the mixture ratio of the two exponential distributions in the model and it corresponds to the mean length of a bout. The bout length is the number of responses contained in one bout. Second, ω denotes the rate parameter for the exponential distribution of within-bout IRTs and it corresponds to the within-bout response rate. Finally, b denotes the rate parameter for the exponential distribution of between-bout IRTs and it corresponds to the bout initiation rate. These three model parameters define the overall response rate. They are also called bout components.

The bout length, the within-bout response rate, and the bout initiation rate are affected by motivational and schedule-type manipulations [4, 6–11]. Motivational manipulations include the reinforcement rate, the response-reinforcement contingency, and the deprivation level. An example of schedule-type manipulations is adding a small variable ratio (VR) schedule in tandem to a variable interval (VI) schedule.

Table 1 summarizes existing findings on the relationships between experimental manipulations and the two bout components. The bout length was reported to be affected by manipulations as follows:

Table 1. Previous findings from animal experiments on the relationships between manipulations and bout components.

	Motivational			Schedule type
	Reinforcement rate	Deprivation level	Extinction	Tandem VR
Bout length	↗ or →?	↗ or →?	↘ or →?	↗
Bout initiation rate	↗	↗	↘	↘ or →?

Open in a new tab

The three cells marked with “?” do not have agreement within the previous reports.

It increases or stays the same as the reinforcement rate increases [4, 6].
It increases or stays the same as the deprivation level increases [4, 7, 8].
It decreases or stays the same by extinction [10, 12].
It increases by tandem VR [4, 6, 13]. When a VI schedule is followed by a small VR (tandem VI VR), an animal stays in a bout longer and emits more responses in each bout.

The bout initiation rate was reported to be:

It increases as the reinforcement rate increases [4, 6, 14, 15].
It increases as the deprivation level increases [4, 7, 8].
It decreases by extinction [10, 12, 16].
It decreases or stays the same by tandem VR [10]. Brackney et al. [10] showed that if we add a small VR schedule in tandem to a VI schedule, the bout initiation rate decreased slightly.

Although the previous studies have investigated the relationships between some experimental manipulations and the bout components, we still do not know how to construct a model that generate bout-and-pause patterns based on the experimental findings. Smith et al. [17] showed experimentally that choice and cost play important roles in organizing responses into bout-and-pause patterns. When pigeons were trained under a single schedule, the log-survivor plot did not show a broken-stick shape [18, 19]. Smith et al. [17] trained pigeons under a concurrent VI VI schedule with and without a changeover delay (COD). When pigeons were trained under the concurrent VI VI schedule without a COD, the log-survivor plot still did not show a broken stick, resulting in the a straight line. However, under the concurrent VI VI schedule with a COD, the log-survivor plot showed a broken stick, indicating that bout-and-pause patterns were clearly observed. Similar observations have been made for rats, assuming that they engage in alternative behaviors during conditioning [20]. From these experimental observations, we extracted the following three facts. 1) When animals engage only in one response in a given situation, bout-and-pause patterns are not observed. 2) If animals can choose responses from two alternatives without a COD, bout-and-pause patterns are still not observed. 3) Considering 1) and 2), we conclude that bout-and-pause patterns are organized only when animals have two (or more) possible alternatives under a given situation (i.e., choice is available) and there is a COD between the start of engagement and a reinforcement (i.e., cost is associated with a changeover). These facts are interesting but they remain to be inductive and we still do not have constructive explanation that generate bout-and-pause patterns. Existing studies on bout-and-pause patterns have investigated to describe the phenomena rather than to provide constructive models. Although many models have been proposed [5, 9, 21]), they are descriptive and did not answer the question of “what mechanisms shape responses into bout-and-pause patterns?”

Kulubekova and McDowell [22] examined a computational model aimed to reproduce bout-and-pause patterns based on the principle of selection by consequences developed by McDowell [23] but they did not test which mechanisms are behind bout-and-pause patterns. In other words, they showed that a computational model of selection by consequence could reproduce bout-and-pause patterns but did not show minimal requirements to reproduce them.

In this article, we propose a computational model based on reinforcement learning that accounts for the constructive mechanism of bout-and-pause patterns. We assume that bout-and-pause patterns are generated by two mechanisms: a choice between operant and other behaviors and a cost that is required to a transition from one behavior to another. We suppose that motivational manipulations affect only the choice mechanism and schedule-type manipulations affect the cost mechanism. To incorporate those two mechanisms, we design a three-state Markov transition model, which has an extra state in addition to the bout and pause states. We perform three simulation studies to analyze the proposed model. In Simulation 1, we introduce our model on the basis of the two different mechanisms, choice and cost. We show that the proposed model can reproduce bout-and-pause patterns by finding that the log-survivor plot shows a broken-stick shape. We compare three models: a dual model, a no cost model, and a no choice model. The dual model is composed of both the choice and cost mechanisms. The no cost model has only the choice mechanism and the no choice model has only the cost mechanism. Simulation results demonstrate that the dual model can reproduce bout-and-pause patterns but the other two models failed to reproduce them. It implies that both choice and cost are required for animal responses to be organized into bout-and-pause patterns. In Simulation 2, we analyze the dual model in depth and report its behavior under various experimental settings to test if the dual model can reproduce the relationships between the experimental manipulations and the bout components discovered so far. Simulation results suggest that the dual model can reproduce them not only qualitatively but also quantitatively. In Simulation 3, we show that a two-state model can also reproduce bout-and-pause patterns even without the third state because it incorporates the two mechanisms. However, having the third state is useful for separating the effects of the choice and cost mechanisms. We speculate that real animals might have similar mechanisms that generate bout-and-pause patterns as the dual model, which can be a useful computational tool for studying animal behavior.

1 Simulation 1

Material and method

Model

Our model is based on reinforcement learning [24]. We designed a three-state Markov process for modeling bout-and-pause patterns (Fig 1(a)). Two of the three states are “Operant” and “Others,” in which the agent engages in the operant behavior or performs other behaviors, respectively. We call them Operant and Others instead of engagement or visit and disengagement or pause, thereby we emphasize that bout-and-pause patterns are results of a choice between the operant and other behaviors. In the third “Choice” state, the agent makes a decision between the operant and other behaviors. By having the Choice state in our model, we incorporate the knowledge that animals can choose their behavior from available options (e.g. grooming, exploration, and excretion) when they move freely during an experiment. The second knowledge is a cost required to make a transition from one behavior to another. Animals must decide whether to keep doing the same behavior or to make a transition, because a fast switching is not optimal if a transition incurs a cost. Fig 1(b) and 1(c) shows two knockout models, the no choice model and the no cost model, respectively. In each model, one of the two mechanisms from the dual model is removed. In the no choice model, an agent can choose only the operant behavior in a given situation. In the no cost model, no cost is required when a transition is made.

Here is how the agent travels in the proposed model. In the Choice state, the agent chooses either the operant or other behaviors. As a result of the choice, it moves from the Choice state to one of the Operant or Others states. It makes the choice based on the preference for each behavior, which is denoted by Q_pref. We will explain how to calculate Q_pref in the next paragraph. In the Operant state, the agent engages in the operant behavior, and, after every response, it decides whether to stay in the Operant state or to move back to the Choice state. It decides to stay or move based on Q_cost, which represents a transition cost to the Choice state, whose mathematical definition will be given later in this Model section. The Others state is the same as the Operant state except for that the agent performs other behaviors.

The preference Q_pref is a function that compares the operant and other behaviors when the agent makes a choice between them. The Q_pref function changes over time since it is updated based on the presence (or absence) of a reinforcer per bout. The following equation describes the updating rule for Q_pref:

Q_{pref}^{(i)} (t + 1) = {\begin{array}{l} Q_{pref}^{(i)} (t) + α_{rft} (r^{(i)} (t) - Q_{pref}^{(i)} (t)), & if a reinforcer is presented, & (2 a) \\ Q_{pref}^{(i)} (t) + α_{ext} (0 - Q_{pref}^{(i)} (t)), & otherwise, & (2 b) \end{array}

where t denotes time in session; α_rft and α_ext denotes the learning rates of reinforcement and extinction, respectively; r denotes the reinforcer value and we assume r > 0 when a reinforcer is present and r = 0 when a reinforcer is absent; and i ∈ {Operant, Others} denotes each option, that is, i = Operant if the operant behavior is chosen and if i = Others if other behaviors are chosen. We omit superscript (i) and denote Q_pref when it can be any of i = Operant or Others.

In the Choice state, the agent chooses either of the Operant or Others states according to the probability distribution calculated from the preferences for the two behaviors. The probability of transition to option i ∈ {Operant, Others} is defined as follows:

\begin{matrix} p_{i} = \frac{exp {β Q_{pref}^{(i)} (t)}}{\sum_{i \in {Operant, Others}} exp {β Q_{pref}^{(i)} (t)}}, \end{matrix}

(3)

where the softmax inverse temperature parameter β represents the degree to which a choice is focused on the highest-value option.

The cost Q_cost is a function that defines a barrier in making a transition from the performed behavior to the Choice state. We assumed that the cost is independent from the preference and depends only on the number of responses that are emitted to obtain a reinforcer from a bout initiation. When a reinforcer is presented, the cost function Q_cost is updated according to

\begin{matrix} Q_{cost}^{(i)} (t + 1) = Q_{cost}^{(i)} (t) + α_{rft} (log x^{(i)} (t) - Q_{cost}^{(i)} (t)), \end{matrix}

(4)

where x denotes the number of responses that are emitted to obtain a reinforcer in a bout. Then, x is initialized to 1 when the agent receives a reinforcer or comes back to the Choice state without a reinforcer. The other parameters are the same as Eqs (2a) and (2b). The same (i)-omitting rule applies also to Q_cost. In Eq (4), x is attenuated by taking its logarithm. This is because, if we do not attenuate x, the barrier defined by Q_cost becomes too high and the agent keeps staying at the performed state. To avoid it, we employed Fechner‘s law [25] to make the performed state less attractive.

If the agent is in either of the Operant or Others states, it makes a decision whether to stay in the same state or to go back to the Choice state. A decision is made according to the probability of staying in the same state calculated from the cost and the preference for the state, which is defined as follows:

\begin{matrix} p_{stay}^{(i)} = exp {\frac{- 1}{w_{pref} Q_{pref}^{(i)} (t) + w_{cost} Q_{cost}^{(i)} (t)}}, \end{matrix}

(5)

where w_pref and w_cost are positive weighting parameters for Q_pref and Q_cost, respectively. We assumed w_cost > w_pref because schedule-type operations have stronger effects on the bout length than motivational manipulations. When Q_pref or Q_cost increase, p_stay increases too.

Simulation

In Simulation 1, we compared the three possible models; the dual model, the no choice model, and the no cost model. The dual model (Fig 1(a)) includes both the choice and cost mechanisms as we described in the Model section. The second model was the no choice model (Fig 1(b)), which has only the cost mechanism and it can be thought of as a model made by removing the choice mechanism from the dual model. In the no choice model, the agent only engages in the operant behavior. In other words, this model chooses only the operant behavior in the Choice state. The third model was the no cost model, which has only the choice mechanism without the cost mechanism. The no cost model chooses either operant or other behavior independent of the previous behavior; that is, according to this model, the agent does not continue to be in the same state and comes back to the Choice state after each response. In the no cost model, the self transition paths were removed because p_stay is very low without having Q_cost in Eq (5).

Simulation conditions were as follows. The schedule for the operant behavior was VI 120 s (0.5 reinforcer per min) without an inter-trial interval, and the schedule for the other behavior was FR 1. The maximum number of reinforcers in the Operant state was 1,000; that is, if the number of reinforcers reached 1,000, the simulation was terminated. The value of a reinforcer given by taking the operant behavior was r^(Operant) = 1.0 and that by taking other behaviors was r^(Others) = 0.5. The model parameters were α_rft, α_ext, β, w_pref, and w_cost. We set α_rft = 0.05, α_ext = 0.01, β = 12.5, w_pref = 1.0 and w_cost = 3.5. The response probabilities in the Operant and the Others states were fixed at 1/3 in each time step. These parameters were designed based on the knowledge on experimental conditions, e.g., the reinforcer for the operant behavior should be higher than that for other behaviors, implying r^(Operant) > r^(Others). Before the start of the simulation, we initialized the agent and the experimental environment. The initial values of $Q_{pref}^{(i)}$ and $Q_{cost}^{(i)}$ were both 0 and we created a VI table according to Flesher and Hoffman [26]. We set the time step in the simulation to be 0.1 s.

We show pseudocode of the model and simulation in Algorithm 1, where NumResponses means x in Eq 4 and the three Behavior() functions are defined in Algorithms 2, 3, and 4. We implemented the algorithm in Julia 1.0 and ran simulations on a computer with a 1.80 GHz Intel i7-8565 processor, 16 GB of RAM, and 1 TB of SSD, operating with Ubuntu 18.04 LTS. The same configuration was used also for Simulations 2 and 3. The Julia code is available at: https://github.com/echo0yasum1/simulating_bout_and_pause_pattern.

Algorithm 1 Pseudocode of simulation

t ← 0, NumRewards ← 0, ResponseTimes ← {}, i ← Choice

while NumRewards < 1000 do

t ← t + 0.1

if i = Choice then

ChoiceBehavior()

end if

if i = Operant and uniform(0, 1)≤1/3 then

OperantBehavior()

end if

if i = Others and uniform(0, 1)≤1/3 then

OthersBehavior()

end if

end while

Algorithm 2 Definition of ChoiceBehavior()

Select a state i ∈ {Operant, Others} with probability defined by Eq (3)

NumResponses ← 1

Algorithm 3 Definition of OperantBehavior()

Append t to ResponseTimes

NumResponses ← NumResponses + 1

Select a state i ∈ {Operant, Choice} with probability defined by Eq (5)

if reward is presented then

Update $Q_{pref}^{(Operant)} (t)$ according to Eq (2a)

Update $Q_{cost}^{(Operant)} (t)$ according to Eq (4)

NumRewards ← NumRewards + 1

NumResponses ← 1

end if

if reward is absent then

if i = Choice then

Update $Q_{pref}^{(Operant)} (t)$ according to Eq 2b

end if

Algorithm 4 Definition of OthersBehavior()

NumResponses ← NumResponses + 1

Update $Q_{pref}^{(Others)} (t)$ according to Eq (2a)

Update $Q_{cost}^{(Others)} (t)$ according to Eq (4)

reward is presented according to FR 1

NumResponses ← 1

Select a state i ∈ {Others, Choice} with probability defined by Eq (5)

Results: Simulation 1

Fig 2(a) shows event records of IRTs generated by each model and Fig 3 shows the model schemes with transition probabilities. The top panel of Fig 2(a) shows that the no choice model generated a dense repetition of only the operant behavior at a high rate without long pauses. From Fig 3, the probability the agent stayed in the Operant state was empirically 0.95. In the middle panel of Fig 2(a), the response rate under the no cost model was low and each response was separated by long pauses. From Fig 3, the probability of the agent choosing to transit to the Operant state was empirically 0.06 and the agent returned to the Choice state immediately after it responded. In the bottom panel of Fig 2(a), the agent with the dual model generated a repetitive pattern of responses with a high rate in a short period followed by a long pause. From Fig 3, the agent in the Choice state made a transition to the Operant state with a 0.12 probability and it stayed in the Operant state with a 0.71 probability.

Fig 2(b) show log-survivor plots to see whether they show a straight line or a broken stick. We used the IRTs from after the agent obtained 500 reinforcers to the end of the simulation. The log-survivor plots of the no choice model and the no cost model were described by one straight line whereas that of the dual model was described with a broken-stick shape. The no choice model has a steeper slope than the no cost model and is tangential to the curve of the dual model at the leftmost position. The no cost model slope was slightly steeper than that of the dual model at the right side.

1.1 Discussion: Simulation 1

Both the event records and log-survivor plots in Fig 2 imply that only the dual model generated bout-and-pause patterns and the other two models failed to reproduce bout-and-pause patterns. The event records in Fig 2(a) suggests that only the dual model exhibit bout-and-pause patterns. The log-survivor plot of only the dual model in Fig 2(b) showed not a straight but a broken-stick shape, which is an evidence that the underlying IRTs follow a bi-exponential distribution. Thus, only the dual model reproduced bout-and-pause patterns.

We posit that both of the choice and cost mechanisms are necessary to organize responses into bout-and-pause patterns. The no choice model failed because it lacks the choice mechanism. Without the choice mechanism, the agent almost always stayed in the Operant state and responded at a high rate without pauses. The reason behind the failure of the no cost model was the knockout of the cost mechanism. When the cost of a changeover is zero, the agent easily return to the Choice state, resulting in sporadic operant responses followed by long pauses. Similar behaviors were observed in pigeons under a concurrent VI VI schedule without COD [17]. The choice and cost mechanisms contribute differently to generate bout-and-pause patterns; the choice mechanism generates pauses and the cost mechanism produces response bursts. Since the dual model has both the mechanisms, it reproduced bout-and-pause patterns.

Since we have a full control of the simulation environment and the agent in it, we can exclude the possibility of contamination by other factors. Smith et al. [17]’s results implied that choice and cost are behind bout-and-pause patterns but it was not clear if other factors influence the formation of bout-and-pause patterns; this is an inherent limitation of experimental studies. It was not straightforward to draw conclusions like “these mechanisms are enough to generate bout-and-pause patterns” from the experimental findings that IRT distributions observed in pigeons followed a bi-exponential distribution under concurrent VI VI schedules with a COD. In contrast, our constructive approach makes it clear that the two mechanisms are sufficient to reproduce bout-and-patterns, and this conclusion is hard to draw only from the experimental findings from [17].

We suggest that what is important for generating bout-and-pause patterns is not the specific architecture of our model but the choice and cost mechanisms. Our model is composed of three states and five equations, and those equations are from one of most popular reinforcement algorithms called Q-learning. Even if such model architecture and algorithm are substituted with others, the new model will still reproduce bout-and-pause patterns if it involves the choice and cost. The specific equation forms such as the logarithm in softmax function in Eq (3) or the logarithm in Eq (4) the can also be replaceable with other forms. We do not reject other possible forms to implement the two mechanisms.

We also do not claim the uniqueness of our experimental settings. Although we employed an FR 1 schedule for the other behaviors, other schedules including VI should produce similar results.

2 Simulation 2

Having demonstrated in Simulation 1 that the dual model successfully reproduced bout-and-pause patterns, in Simulation 2 we analyzed this model under various environments. The previous studies [4, 6–10, 27] have applied various experimental manipulations to animals to understand bout-and-pause patterns, as summarized in Table 1. We applied manipulations to the agent in the model by changing environmental settings.

2.1 Method: Simulation 2

Using the dual model, we performed four experiments by manipulating only one of the four variables while keeping the other three variables the same as Simulation 1. The procedure of simulation was also the same as Simulation 1.

The four experimental manipulations are applied independently to each of the four variables: 1) the rate of reinforcement, 2) the deprivation level, 3) the presence of extinction, and 4) the schedule type. 1) We manipulated the rate of reinforcement by varying mean intervals of the VI schedule. Mean intervals used in this simulation were VI 30 s, 120 s, and 480 s (2.0, 0.5, and 0.125 reinforcer per min). 2) We varied reward values obtained in the Operant state to control the deprivation level of the agent. Those values were 0.5, 1.0, and 1.5 to induce low deprivation, baseline, and high deprivation levels, respectively. The reward value that the agent received by taking other behaviors was the same as Simulation 1 throughout all the simulations. 3) To attenuate the engagement to the operant response, we switched the schedule from VI 120 s (0.5 reinforcer per min) to extinction after the agent obtained 1,000 reinforcers. The extinction phase finished when 3,600 s (36,000 time steps) elapsed. 4) We manipulated the schedule type by adding a small VR schedule in tandem to a variable time (VT) schedule. The mean interval of the VT schedule was fixed to 120 s and VR values were 0, 4, and 8.

When we analyzed the IRTs data from the extinction simulation, we used a dynamic bi-exponential model [10], in which the model parameters, q, ω, and b, are time-dependent and Eq (1) is rewritten as follows:

\begin{matrix} p (I R T = τ) = (1 - q_{t}) ω_{t} e^{- ω τ} + q_{t} b_{t} e^{- b_{t} τ} . \end{matrix}

(6)

Extinction causes exponential decay of the model parameters according to the following equations:

\begin{matrix} 1 - q_{t} & = (1 - q_{0}) e^{- γ t}, \end{matrix}

(7)

\begin{matrix} b_{t} & = b_{0} e^{- δ t}, \end{matrix}

(8)

where the parameters γ and δ denote the decay rates of q and b, respectively. Since the decay of any of the three model parameters q, b, and ω can cause extinction, we need to identify which of these parameters actually decayed during the extinction simulation. We excluded ω because it was fixed to 1/3 during the simulation. To identify whether one or both of the q and/or b parameters decayed, we compared three models, that is, the qb-decay, q-decay, and b-decay models. We calculated WAIC (widely applicable information criterion [28]) for each model. We use Markov chain Monte Carlo (MCMC) with Stan [29] to estimate posterior distribution and used MCMC samples to calculate WAIC. The same configuration as Simulation 1 was used in Simulation 2.

To examine the molar relationship between the reinforcement rate and response rate, we fitted Herrnstein’s hyperbola [30] to the simulated data. We used its modern version [31],

\begin{matrix} R = \frac{k r^{a}}{r^{a} + r_{e}^{a} / c}, \end{matrix}

(9)

where R is the response rate, r is the reinforcement rate, r_e is the external reinforcement rate, k is the total amount of behavior, and a is the exponent and bias parameters, respectively. Since the parametrization of term $r_{e}^{a} / c$ is redundant, we did not fit r_e and c separately and estimated only $r_{e}^{a} / c$ .

2.2 Results: Simulation 2

Fig 4 shows the log-survivor plots of IRTs from each of the four simulations. Fig 4(a) and 4(b) shows that manipulating the rate of reinforcement or the deprivation level changed the slope and intercept of the right limb. As the rate of reinforcement or the deprivation level increased, the slope of the right limb became steeper, indicating that the bout initiation rate became larger. The broken sticks in Fig 4(c) have different slopes and y-axis intercepts, suggesting that both the bout initiation rate and the bout length were changed. Fig 4(d) shows that adding the tandem VR schedule to the VT schedule affected only the y-axis intercept of the right limb without changing its slope. As the required response increased from the baseline to VR 4 or VR 8, the bout length became larger. However, the right limbs were not stable and we performed a fitting analysis described in the next paragraph.

Table 2 shows estimated parameters of the bi-exponential model, q, ω, and b in three simulations except for extinction. Parameter q increased as the reinforcement rate, the deprivation level, and the number of required responses increased. Parameter ω did not change in all manipulations. Parameter b increased as the rate of reinforcement and the deprivation level increased.

Table 2. Estimated parameters of the bi-exponential model in simulations.

Manipulation	Condition	ω	b	q
Rate of reinforcement	VI 30 (2.0 per min)	3.08	0.23	0.17
	VI 120 (0.5 per min)	3.06	0.09	0.27
	VI 480 (0.125 per min)	3.19	0.03	0.31
Deprivation level	High deprivation	3.04	0.24	0.17
	Baseline	3.15	0.08	0.26
	Low deprivation	3.17	0.03	0.31
Tandem VT 120 VR x	VR 0	3.07	0.09	0.26
	VR 4	3.23	0.08	0.19
	VR 8	3.11	0.08	0.16

Open in a new tab

In Fig 4(c), the total number of IRTs during the extinction phase was insufficient to reliably estimate the right limb. Then, we analyzed the dynamic bi-exponential model fitted to the IRTs during extinction. Table 3 shows the WAIC values for the three models. The smallest WAIC was attained by the qb-decay model, but the differences from the other models are not large and it is not conclusive which of the bout initiation rate and the bout length decayed during extinction.

Table 3. Parameter selection for the dynamic bi-exponential model with WAIC.

Model	WAIC
qb-decay	1.936
b-decay	1.940
q-decay	1.980

Open in a new tab

The lower WAIC, the better the model.

Fig 5 shows the boxplots of Q_pref and Q_cost in the three simulations except for extinction, which are to be used for assessing how the changes in the bout components are mediated. We excluded the extinction simulation because we already knew that Q_pref causes the change of the bout components since Q_cost is fixed during the extinction phase. The top panel shows that Q_pref and Q_cost increased as the rate of reinforcement increased. The middle panel indicates that increasing the deprivation level moved Q_pref and Q_cost upward. From the bottom panel, we can see that adding tandem VR schedule increased Q_cost without affecting Q_pref. Table 5 summarized the dependency of Q_pref and Q_cost to experimental manipulations. Comparing Tables 1 and 5, Q_pref and Q_cost correspondent to the bout initiation rate and the bout length.

Table 5. The dependency of Q_pref and Q_cost to experimental manipulations in the dual model.

	Motivational			Schedule type
	Reinforcement rate	Deprivation level	Extinction	Tandem VR
Q_cost	↗	↗	→	↗
Q_pref	↗	↗	↘	→

Open in a new tab

Fig 6 shows the relationship between the reinforcement rate and response rate in our model. The response rate increased with diminished gradients, converging to k = 187.41. The other parameters were fitted to be a = 2.25, and $r_{e}^{a} / c = 2.65$ . The percentage of variance accounted for (%VAF) was 99.3, and a = 2.25 implies that our model showed overmatching. In our model, β in Eq (3) controls the absolute difference between the Operant and Others behaviors and we can change overmatching to strict matching by lowering the value of β.

2.3 Discussion: Simulation 2

In Simulation 2, we tested whether the dual model has the same characteristics as animals reported by the previous studies. We analyzed the model with four experimental manipulations: the rate of reinforcement, the deprivation level, the presence of extinction, and the schedule type. The rate of reinforcement, the deprivation level, and the presence of extinction affected the bout initiation rate and the bout length and adding the tandem VR schedule to the VT schedule affected only the bout length.

Table 4 summarizes the relationship between the experimental manipulations and the bout components observed in the dual model, which suggests that the behaviors of the dual model are consistent with the existing knowledge on animal behaviors. Furthermore, we made stable predictions to the cells with the question marks in Table 1. Our predictions are stable because our results can be easily reproduced and tested using the same simulation code. In contrast, experimental studies with animals could report different conclusions. Although our model does not implement Herrnstein’s hyperbola a priori, the molar relationship between the reinforcement rate and response rate is well described by the modern matching theory (Fig 6). Cheung et al. [12] and Brackney et al. [32] showed that the bout initiation rate and the bout length decayed during extinction. Table 3 shows that parameter selection for dynamic bi-exponential model with WAIC but the differences between each model are small. However, the lowest WAIC model is consistent with previous studies. Therefore, the dual model satisfies at least the necessary conditions to be a model to be analyzed for the generation mechanism of bout-and-pause patterns.

Table 4. The behavior of the dual model.

	Motivational			Schedule type
	Reinforcement rate	Deprivation level	Extinction	Tandem VR
Bout length	↗*	↗*	↘^†	↗*
Bout initiation rate	↗*	↗*	↘*	→^†

Open in a new tab

The cells marked with “*” indicates the consistency with the animal findings shown in Table 1. The cells marked with “†” were the cells with “?” in Table 1.

Table 2 showed estimated parameters of the bi-exponential model in each simulation and they are consistent with the parameters of previous study with real animals.

The dependency of Q_pref and Q_cost to experimental manipulations, showed in Table 5, can be understood according to the categorization of motivational and schedule-type manipulations proposed by Shull et al. [4]. In our simulations, manipulating any of the three motivational variables, i.e. the rate of reinforcement, the deprivation level, or extinction, changed Q_pref and Q_cost. The change of Q_cost was not a primary but a secondary effect because Q_cost was changed as a result of the increased Q_pref; with a higher Q_pref, the agent emits more responses. The schedule type manipulation affected only Q_cost. These changes of Q_pref and Q_cost are consistent with what was proposed by Shull et al. [4].

The dual model is limited to reproduce only some of the previous findings. Here are three examples of limitations. First, our model is not designed for analyzing the addition of a tandem VR schedule to a VT schedule, by which Tanno [9] and Matsui et al. [21] found the change of the within-bout response rate, which was fixed in our model. Second, the value and the delay between a response and a reinforcer were fixed in our model. Brackney et al. [10] and Podlesnik et al. [8] considered a delayed reinforcement from a bout initiation causes the inverse correlation between the bout initiation rate and the bout length. This result can be reproduced if required responses by tandem VR is very high (more than 32). Third, sometimes the bout length does not decrease during extinction [10]. Our dual model could not reproduce this result even if we changed the model parameters.

3 Simulation 3

In Simulation 3, we examined a two-state model that incorporates the choice and cost mechanisms to examine the possibility of alternative models, particularly a simpler one. We built a two-state model without the Choice state and ran simulations with it.

3.1 Method: Simulation 3

Fig 7 shows the two-state model comprising of the Operant and Others states. Although it does not have the Choice state, the choice mechanism is implemented as the transitions between the Operant and the Others states. The probability of staying at the same state is defined as follows.

\begin{matrix} p (s_{t + 1} = i | s_{t} = i) = \frac{exp (w_{pref} Q_{pref}^{(i)} (t) + w_{cost} Q_{cost}^{(i)} (t))}{\sum_{i} exp (w_{pref} Q_{pref}^{(i)} (t) + w_{cost} Q_{cost}^{(i)} (t))}, \end{matrix}

(10)

where w_pref and w_cost are positive weights for Q_pref and Q_cost, respectively. Updating rules for Q_pref and Q_cost are the same as Eqs (2a), (2b) and (4), respectively. The parameters of the two-state model were sought in the ranges shown in Table 6, which includes the parameter values used for the three-state, dual model. The following parameter settings were selected from the range: α_rft = 0.01, α_ext = 0.01, w_pref = 4.0, w_cost = 3.5, r^(Operant) = 1.0, and r^(Others) = 0.5.

Table 6. Parameter range of the two-state model.

Parameter	Min	Max	Step
α_rft	0.01	0.2	0.01
α_ext	0.01	0.2	0.01
w_pref	1.0	6.0	0.1
w_cost	1.0	6.0	0.1

Open in a new tab

To examine if the two-state model could generate bout-and-pause patterns and if it could be used for simulations with experimental manipulations, we performed simulation analysis. We varied the reinforcement rate as VI 30 s, VI 120 s, and VI 480 s, which were the same as the values used in Simulation 2.

3.2 Results: Simulation 3

Fig 8 shows the log-survivor plots of IRTs from the simulation of the two-state model with different values of the reinforcement rate. It showed broken-stick shapes and the slopes and intercepts of the right limbs decreased as the reinforcement rate decreased.

3.3 Discussion: Simulation 3

Since the log-survivor plots of IRTs generated by the two-state model showed broken-stick curves, bout-and-pause patterns were reproduced. In addition, the change of the log survivor plots of the two-state model was consistent with experimental findings. Therefore, we can construct alternative models even without the explicit third state. Also, the two-state model implements the two mechanisms through Eq (10).

We consider the three-state, dual model has advantages in modeling and analyzing bout-and-pause patterns.

In the three-state dual model, the effects of choice and cost are separated. It is clear in the dual model shown in Fig 1(a) that the choice between the operant and other behaviors is made at the Choice state and whether the agent continues to stay in the same state is moderated by the cost mechanism at each of the Operant and Others states. This can be understood by Eq (3), which describes only the choice rule, and Eq (5), which calculates the stay probability based on the cost mechanism. However, in the two-state model, choice and stay are not well separated; in Eq (10), choice and cost are mixed and the behavior of the agent cannot be explained by only one of them.

4 General discussion

In this paper, we have developed a computational model with reinforcement learning. The model was meant to explain how bout-and-pause patterns can be generated and we examined its validity by comparing computer simulations and experimental findings. We hypothesized that two independent mechanisms, the choice between Operant and Others and the cost in the changeover of behaviors, are necessary to organize responses into bout-and-pause patterns. We demonstrated in Simulation 1 that the dual model reproduced bout-and-pause patterns under a VI schedule. Simulation 2 found that the relationships between various experimental manipulations and the bout components in our model were consistent. Simulation 3 found that two-state model incorporating the two mechanisms can also reproduce the bout-and-pause pattern. However, the third state has advantages in analyzing an agent behavior because it separates the effects of the choice and cost mechanisms. These results support our hypothesis that assumes that an agent transitioning between the three states driven by the choice and cost mechanisms organizes its responses into bout-and-pause patterns. This is our answer to “why bout-and-pause patterns are organized?”

Our constructive model reproduced the descriptive results reported by [4, 6, 7, 27]. Although our dual model does not explicitly include the bi-exponential model in Eq (1), IRTs generated by the dual model followed the bi-exponential model.

The fundamental difference between our model based on reinforcement learning and Kulubekova and McDowell [22]’s model based on selection by consequences is that our model explicitly has the choice and cost mechanisms but Kulubekova and McDowell [22]’s model is unclear about them. Their model did not generate a clear distinction between a burst of responses in a short period and long pauses that separate bursts, resulting in a dull bend of the log-survivor plot. Kulubekova and McDowell [22] discussed that this divergence from live animals might be due to the lack of CODs in their model. Our model reproduced clear distinction between bursts and pauses (Fig 2), and this was because our model can change CODs through the cost mechanism. Another advantage of us over Kulubekova and McDowell [22] is that they did not compare with alternative models but we tested our hypothesises of the choice and cost mechanisms by the knockout analysis Simulation 1.

Our model has at least two shortcomings, which are the range of parameters and the redundancy of the model. The parameters α_rft, α_ext, β, ω_pref, and ω_cost in our model have not been optimized to fit to behavioral data from real animals. The evidence that supports our parameter selection is that our model quantitatively reproduced bout-and-pause patterns. Second, although our model has five parameters, fewer parameters may suffice to reproduce bout-and-pause patterns. To verify our model for these two points, it would be useful to compare empirical data from real animals and our computational model.

Standing on our model proposed in this paper, we can extend our research to many directions to explain more various aspects of bout-and-pause patterns. Here we discuss four of them. The following four paragraphs are devoted to this topic.

First, our results were retrospective to data from previous behavioral experiments and the proposed model was not tested by its prediction ability to unseen data. Our model can suggest a new experiment that could add new knowledge about how manipulating CODs affect animals’ behavior under a concurrent VI VI schedule. Smith et al. [17] pointed out that employing asymmetrical CODs in a concurrent VI VI schedule could produce behaviors under a single response schedule. Our modeling is consistent to what Smith et al. [17] pointed out but approaches from a different direction. We consider that, even if an animal is under the single schedule, it makes choices between the operant behavior and other behaviors; this is implemented to our simulation as concurrent VI FR 1. We used an FR 1 schedule for other behaviors in our simulations, but we can change it from FR 1 to a VI schedule so that the whole schedule becomes a concurrent VI VI schedule. In our model, the cost for the operant behavior, defined by $Q_{cost}^{(Operant)}$ , affects actions of the agent only in the Operant state, without affecting those in the Others state; similarly, the cost for other behaviors influences the agent to its actions only in the Others state. Therefore, according to our model, it is expected that, in a concurrent VI VI schedule, if the experimenter varies CODs for one schedule, the behavior of the animal changes only for the varied schedule without affecting the behavior for the other schedule. It will be interesting to conduct such experiments with real animals to reveal actual effects of CODs on the behavior under concurrent VI VI schedules. In such a way, our model can bridge between animal behaviors observed in concurrent schedules and single schedules by offering a unified framework.

The second approach is verification based on neuroscientific knowledge. Even if the model can correctly predict unseen data from behavioral experiments, it is not guaranteed that animals employ the same the model. To explore real mechanisms that animals implement, it would be effective to compare the internal variables of the model with neural activities measured from real animals during behavioral experiments. Possible experiments are to perform knock out experiments by inducing lesions at specific areas of the brain that should be active during the experiments, or to activate or deactivate specific neurons during the experiment.

Third, we can assess the plausibility of our model in more detail by conducting simulation under new experimental manipulations including disruptors or analyzing measures that we did not analyze. For example, recent studies showed that the distribution of bout lengths is sensitive to experimental manipulations [13, 33, 34]. Sanabria et al. [35] have proposed a computational formulation of behavior systems [36] and their descriptive model well described bout-and-pause patterns including the distribution of bout lengths.

Fourth, we can design models that are not Markov transition models. The bout-and-pause response patterns shown in Fig 2 can be generated by a Markov transition model whose transition matrix is given a priori without reinforcement learning. We argue that the statistical description of the Markov model (i.e., the transition matrix defined by the transition probabilities shown in Fig 3) is not the source of the reproducibility of bout-and-pause patterns. There may be other models that are not formulated by Markov transition, such as the model proposed by McDowell [23]. We can introduce the choice and cost mechanisms to such models.

Reinforcement learning can be employed to model and explain animal behaviors other than bout-and-pause patterns, since it is a general framework where an agent learns optimal behaviors in a given environment through trial-and-error [24]. Such a reinforcement learning framework agrees well with the three-term contingency in behavior analysis. There are three essential elements in reinforcement learning; a state, an action, and a reward. The state is what agent observe and is information about the environment. The action is a behavior that the agent takes in a given state. The reward is what the agent obtains as the result of the action. These three elements are similar to a discriminative stimulus, a response, and an outcome. This similarity would allow behavior analysts to employ reinforcement learning in their research. For example, Sakai and Fukai [37] employed actor-critic reinforcement learning to modeling the matching law. We hope more computational studies will be performed to expand methods of behavioral science.

Data Availability

Simulation programs are available from GitHub: https://github.com/echo0yasum1/simulating_bout_and_pause_pattern.

Funding Statement

This study was supported in part by Grant-in-Aid for JSPS Fellows (20J21568) to KY from the Japan Society for the Promotion of Science (http://www.jsps.go.jp/english/e-grants). The funder had no role in study design, data collection, data analysis, and preparation of the manuscript. KY and AK are employed by and receive salaries from LeapMind Inc. (https://leapmind.io/en/), and the both authors played roles in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section. There was no additional external funding received for this study.

References

1. Barabasi AL. The origin of bursts and heavy tails in human dynamics. Nature. 2005;435(7039):207 10.1038/nature03459 [DOI] [PubMed] [Google Scholar]
2. Tolkamp BJ, Kyriazakis I. To split behaviour into bouts, log-transform the intervals. Animal Behaviour. 1999;57(4):807–817. 10.1006/anbe.1998.1022 [DOI] [PubMed] [Google Scholar]
3. Sorribes A, Armendariz BG, Lopez-Pigozzi D, Murga C, de Polavieja GG. The origin of behavioral bursts in decision-making circuitry. PLoS Computational Biology. 2011;7(6):e1002075 10.1371/journal.pcbi.1002075 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Shull RL, Gaynor ST, Grimes JA. Response rate viewed as engagement bouts: Effects of relative reinforcement and schedule type. Journal of the Experimental Analysis of Behavior. 2001;75(3):247–274. 10.1901/jeab.2001.75-247 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Killeen PR, Hall SS, Reilly MP, Kettle LC. Molecular analyses of the principal components of response strength. Journal of the Experimental Analysis of Behavior. 2002;78(2):127–160. 10.1901/jeab.2002.78-127 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Shull RL, Grimes JA, Bennett JA. Bouts of responding: The relation between bout rate and the rate of variable-interval reinforcement. Journal of the Experimental Analysis of Behavior. 2004;81(1):65–83. 10.1901/jeab.2004.81-65 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Shull RL. Bouts of responding on variable-interval schedules: Effects of deprivation level. Journal of the Experimental Analysis of Behavior. 2004;81(2):155–167. 10.1901/jeab.2004.81-155 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Podlesnik CA, Jimenez-Gomez C, Ward RD, Shahan TA. Resistance to change of responding maintained by unsignaled delays to reinforcement: A response-bout analysis. Journal of the Experimental Analysis of Behavior. 2006;85(3):329–347. 10.1901/jeab.2006.47-05 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Tanno T. Response-bout analysis of interresponse times in variable-ratio and variable-interval schedules. Behavioural Processes. 2016;132:12–21. 10.1016/j.beproc.2016.09.001 [DOI] [PubMed] [Google Scholar]
10. Brackney RJ, Cheung TH, Neisewander JL, Sanabria F. The isolation of motivational, motoric, and schedule effects on operant performance: A modeling approach. Journal of the Experimental Analysis of Behavior. 2011;96(1):17–38. 10.1901/jeab.2011.96-17 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Chen X, Reed P. Factors controlling the micro-structure of human free-operant behaviour: Bout-initiation and within-bout responses are effected by different aspects of the schedule. Behavioural Processes. 2020; p. 104106 10.1016/j.beproc.2020.104106 [DOI] [PubMed] [Google Scholar]
12. Cheung TH, Neisewander JL, Sanabria F. Extinction under a behavioral microscope: Isolating the sources of decline in operant response rate. Behavioural Processes. 2012;90(1):111–123. 10.1016/j.beproc.2012.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Brackney RJ, Sanabria F. The distribution of response bout lengths and its sensitivity to differential reinforcement. Journal of the experimental analysis of behavior. 2015;104(2):167–185. 10.1002/jeab.168 [DOI] [PubMed] [Google Scholar]
14. Reed P. The structure of random ratio responding in humans. Journal of Experimental Psychology: Animal Learning and Cognition. 2015;41(4):419. [DOI] [PubMed] [Google Scholar]
15. Reed P, Smale D, Owens D, Freegard G. Human performance on random interval schedules. Journal of Experimental Psychology: Animal Learning and Cognition. 2018;44(3):309 10.1037/xan0000172 [DOI] [PubMed] [Google Scholar]
16. Brackney RJ, Cheung TH, Sanabria F. A bout analysis of operant response disruption. Behavioural processes. 2017;141:42–49. 10.1016/j.beproc.2017.04.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Smith TT, McLean AP, Shull RL, Hughes CE, Pitts RC. Concurrent performance as bouts of behavior. Journal of the Experimental Analysis of Behavior. 2014;102(1):102–125. 10.1002/jeab.90 [DOI] [PubMed] [Google Scholar]
18. Bennett JA, Hughes CE, Pitts RC. Effects of methamphetamine on response rate: A microstructural analysis. Behavioural Processes. 2007;75(2):199–205. 10.1016/j.beproc.2007.02.013 [DOI] [PubMed] [Google Scholar]
19. Bowers MT, Hill J, Palya WL. Interresponse time structures in variable-ratio and variable-interval schedules. Journal of the Experimental Analysis of Behavior. 2008;90(3):345–362. 10.1901/jeab.2008.90-345 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Wallace M, Singer G. Schedule induced behavior: A review of its generality, determinants and pharmacological data. Pharmacology Biochemistry and Behavior. 1976;5(4):483–490. 10.1016/0091-3057(76)90114-3 [DOI] [PubMed] [Google Scholar]
21. Matsui H, Yamada K, Sakagami T, Tanno T. Modeling bout–pause response patterns in variable-ratio and variable-interval schedules using hierarchical Bayesian methodology. Behavioural Processes. 2018;157:346–353. 10.1016/j.beproc.2018.07.014 [DOI] [PubMed] [Google Scholar]
22. Kulubekova S, McDowell JJ. A computational model of selection by consequences: Log survivor plots. Behavioural Processes. 2008;78(2):291–296. 10.1016/j.beproc.2007.12.005 [DOI] [PubMed] [Google Scholar]
23. McDowell JJ. A computational model of selection by consequences. Journal of the Experimental Analysis of Behavior. 2004;81(3):297–317. 10.1901/jeab.2004.81-297 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Sutton RS, Barto AG. Reinforcement learning: An introduction. 2nd ed MIT Press; 2018. [Google Scholar]
25. Nieder A. Counting on neurons: The neurobiology of numerical competence. Nature Reviews Neuroscience. 2005;6(3):177–190. 10.1038/nrn1626 [DOI] [PubMed] [Google Scholar]
26. Fleshler M, Hoffman HS. A progression for generating variable-interval schedules. Journal of the Experimental Analysis of Behavior. 1962;5(4):529–530. 10.1901/jeab.1962.5-529 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Shull RL, Gaynor ST, Grimes JA. Response rate viewed as engagement bouts: Resistance to extinction. Journal of the Experimental Analysis of Behavior. 2002;77(3):211–231. 10.1901/jeab.2002.77-211 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research. 2010;11(Dec):3571–3594. 10.5555/1756006.1953045 [DOI] [Google Scholar]
29. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan: A probabilistic programming language. Journal of Statistical Software. 2017;76(1). 10.18637/jss.v076.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Herrnstein RJ. On the law of effect 1. Journal of the Experimental Analysis of Behavior. 1970;13(2):243–266. 10.1901/jeab.1970.13-243 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. McDowell J. On the classic and modern theories of matching. Journal of the Experimental Analysis of Behavior. 2005;84(1):111–127. 10.1901/jeab.2005.59-04 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Brackney RJ, Cheung TH, Herbst K, Hill JC, Sanabria F. Extinction learning deficit in a rodent model of attention-deficit hyperactivity disorder. Behavioral and Brain Functions. 2012;8(1):59 10.1186/1744-9081-8-59 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Jiménez ÁA, Sanabria F, Cabrera F. The effect of lever height on the microstructure of operant behavior. Behavioural processes. 2017;140:181–189. 10.1016/j.beproc.2017.05.002 [DOI] [PubMed] [Google Scholar]
34. Daniels CW, Sanabria F. About bouts: A heterogeneous tandem schedule of reinforcement reveals dissociable components of operant behavior in Fischer rats. Journal of Experimental Psychology: Animal Learning and Cognition. 2017;43(3):280 10.1037/xan0000144 [DOI] [PubMed] [Google Scholar]
35. Sanabria F, Daniels CW, Gupta T, Santos C. A computational formulation of the behavior systems account of the temporal organization of motivated behavior. Behavioural processes. 2019;169:103952 10.1016/j.beproc.2019.103952 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Timberlake W. Behavior systems and reinforcement: An integrative approach. Journal of the Experimental Analysis of Behavior. 1993;60(1):105–128. 10.1901/jeab.1993.60-105 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Sakai Y, Fukai T. The actor-critic learning is behind the matching law: Matching versus optimal behaviors. Neural Computation. 2008;20(1):227–251. 10.1162/neco.2008.20.1.227 [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0242201.r001

Decision Letter 0

Gennady Cymbalyuk

1 Sep 2020

PONE-D-20-18275

Simulating bout-and-pause patterns with reinforcement learning

PLOS ONE

Dear Dr. Yamada,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The manuscript should be revised for clarity of the presentation and appropriate references should be discussed and appropriately acknowledged.

Please, provide a statistical description of the three-state Markov model. The transition matrix could be analytically computed in at least some limit cases. That could give the authors some indications regarding the limitations of their model (besides the limitations mentioned in the general discussion of their manuscript).

Reviewers raised concerns about the embedded (hidden) assumptions in the computational model that lead to “emergent” properties. For example, the authors mentioned that “Although our dual model does not explicitly include the bi-exponential model in Eq. (1), IRTs generated by the dual model followed the bi-exponential model.” Please, discuss a possibility that the “natural” assumption of a Boltzmann factor in Eq.3 of the model or the logarithmic formula in Eq. 4 does not lead to the “emergent” behavior, such as the bi-exponential model form Eq. 1? How do the authors know that this simple assumption is not the root of the observed bi-exponential and other exciting features?

Pleaser, discuss the potential limitations of the model.

Some important recent papers are missing from discussion and should be included in the revision.

Brackney, R. J., Cheung, T. H. C., & Sanabria, F. (2017). A bout analysis of operant response disruption. Behavioural Processes, 141(Part 1). https://doi.org/10.1016/j.beproc.2017.04.008

Brackney, R. J., & Sanabria, F. (2015). The distribution of response bout lengths and its sensitivity to differential reinforcement. Journal of the Experimental Analysis of Behavior, 104(2), 167–185. https://doi.org/10.1002/jeab.168

Chen, X., & Reed, P. (2020). Factors controlling the micro-structure of human free-operant behaviour: Bout-initiation and within-bout responses are effected by different aspects of the schedule. Behavioural Processes, 175(March), 104106. https://doi.org/10.1016/j.beproc.2020.104106

Daniels, C. W., & Sanabria, F. (2017). About bouts: A heterogeneous tandem schedule of reinforcement reveals dissociable components of operant behavior in Fischer rats. Journal of Experimental Psychology: Animal Learning and Cognition, 43(3), 280–294. https://doi.org/10.1037/xan0000144

Jiménez, Á. A., Sanabria, F., & Cabrera, F. (2017). The effect of lever height on the microstructure of operant behavior. Behavioural Processes, 140, 181–189. https://doi.org/10.1016/j.beproc.2017.05.002

Reed, P. (2015). The structure of random ratio responding in humans. Journal of Experimental Psychology: Animal Learning and Cognition, 41(4), 419–431.

Reed, P., Smale, D., Owens, D., & Freegard, G. (2018). Human performance on random interval schedules. Journal of Experimental Psychology: Animal Learning and Cognition, 44(3), 309–321.

Sanabria, F., Daniels, C. W., Gupta, T., & Santos, C. (2019). A computational formulation of the behavior systems account of the temporal organization of motivated behavior. Behavioural Processes, 169, 103952. https://doi.org/10.1016/j.beproc.2019.103952

Brackney et al. (2017), for instance, report on the effect of various disruptors, including extinction, on bout-organized behavior. Although the distribution of bout lengths is not assessed in the proposed model, it may be important to note that research on that front has been conducted (Brackney & Sanabria, 2015; Jiménez et al., 2017). Also, the proposed model is, in some aspects, comparable to the partially hidden Markov model proposed by Sanabria et al. (2019)—the latter is not a learning model, but accounts for stable-state bi-exponential distribution of IRTs without building that distribution in the model itself.

In page 2, it should be pointed out that the bi-exponential distribution of IRTs has been demonstrated in VI schedules, where reinforcement is available probabilistically at a constant rate. Later in the manuscript the authors make reference to the schedules of reinforcement without explaining them. Also, q does not correspond to the length of a bout but to the *mean* length of a bout.

In page 3, the authors generalize the results from pigeons in Smith et al. (2014) to all animals, when rats actually show a very different pattern. The conclusion they reach is reasonable, assuming that rats engage in alternative behaviors during conditioning.

Page 4, line 5: “both of” should be “both”

Figure 1: Please use a larger font size.

Line 129: “knowledge that is observed” is a strange, ambiguous expression.

Line 170: Do you mean “Fechner’s law”, which implies a representation of magnitude (here, number of lever presses) in logarithmic space. Weber’s law does not imply such representation.

Line 242: “We posit both…” should be “We posit that both…”

Equation 9: Its description includes a parameter b that is not included in the equation.

Line 487: “real animals may have fewer parameters” is a strange expression.v

Please submit your revised manuscript by Oct 16 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Gennady Cymbalyuk, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for including your competing interests statement; "The authors have no competing interests."

We note that one or more of the authors are employed by a commercial company: LeapMind Inc.

Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement.

“The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement.

2. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests

3. We note you have included a table to which you do not refer in the text of your manuscript. Please ensure that you refer to Table 5 in your text; if accepted, production will need this reference to link the reader to the Table.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Yamada and Kanemura propose a parsimonious yet insightful reinforcement learning model that successfully reproduces the bout-like temporal organization of instrumental behavior. Moreover, the model reasonably links key experimental manipulations to parameters of the model. In its most complex version, the model is a 3-state (choice, operant, other) Markov chain with updatable transition probabilities. The authors judiciously attempt to simplify the model further, showing that such simplifications come at great heuristic cost. It is particularly commendable that the authors acknowledge the potential limitations of the model.

I only have two relatively minor concerns regarding the manuscript. First, although the authors provide a useful synthesis of the literature on the microstructure of instrumental behavior, many important recent papers are missing from that synthesis, which makes it appear outdated. Below I have listed several recent papers that the authors omitted, and that I believe would inform the assessment of their model. Because I am co-author in many of these papers, I am disclosing my name in the signature, and my recommendation will not change whether or not the authors choose to include any of them.

Brackney, R. J., Cheung, T. H. C., & Sanabria, F. (2017). A bout analysis of operant response disruption. Behavioural Processes, 141(Part 1). https://doi.org/10.1016/j.beproc.2017.04.008

Reed, P. (2015). The structure of random ratio responding in humans. Journal of Experimental Psychology: Animal Learning and Cognition, 41(4), 419–431.

Reed, P., Smale, D., Owens, D., & Freegard, G. (2018). Human performance on random interval schedules. Journal of Experimental Psychology: Animal Learning and Cognition, 44(3), 309–321.

The second concern is about style—not nearly as important as content, which is excellent in this paper, but it is important nonetheless. In various parts, the manuscript would benefit from economy of expression, precision, clearer organization of key claims in separate paragraphs, and a more deliberately logical connection between ideas. Below I just point at some salient examples:

Page 4, line 5: “both of” should be “both”

Figure 1: Please use a larger font size.

Line 129: “knowledge that is observed” is a strange, ambiguous expression.

Line 170: I believe the authors mean “Fechner’s law”, which implies a representation of magnitude (here, number of lever presses) in logarithmic space. Weber’s law does not imply such representation.

Line 242: “We posit both…” should be “We posit that both…”

Equation 9: Its description includes a parameter b that is not included in the equation.

Line 487: “real animals may have fewer parameters” is a strange expression.

Federico Sanabria

Associate Professor of Psychology

Arizona State University

Reviewer #2: The manuscript expands on the previous work of Kota Yamada (see reference 15, where they analyzed the statistics of within–bout and bout-initiation). This work, in particular, is inspired by the research done in McDowell’s lab at Emory.

Briefly, the beauty of the model is its parsimony. The authors considered that the bout-and-pause patterns could be captured by a three-state Markov model controlled by two independent mechanisms: (1) the choice between Operant and Others, and (2) the cost in the changeover of behaviors.

At the same time, a three-state Markov is amenable to at least a basic statistical description, and the authors did not attempt that. The transition matrix could be analytically computed in at least some limit cases. That could give the authors some indications regarding the limitations of their model (besides the limitations mentioned in the general discussion of their manuscript).

The second concern I have is about the embedded (hidden) assumptions in the computational model that lead to “emergent” properties. For example, the authors mentioned that “Although our dual model does not explicitly include the bi-exponential model in Eq. (1), IRTs generated by the dual model followed the bi-exponential model.” My question is: how do they know that the “natural” assumption of a Boltzmann factor in Eq.3 of the model or the logarithmic formula in Eq. 4 does not lead to the “emergent” behavior, such as the bi-exponential model form Eq. 1? I understand that everybody used Boltzmann’s factor in every field of science, but still – how do the authors know that this simple assumption is not the root of the observed bi-exponential and other exciting features?

I also understand that both of my concerns are hard to address, but maybe the authors could at least comment on how they would address them.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Federico Sanabria

Reviewer #2: Yes: Sorinel Oprisan

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Nov 12;15(11):e0242201. doi: 10.1371/journal.pone.0242201.r002

Author response to Decision Letter 0

4 Oct 2020

Many thanks for considering our manuscript for possible publication in PLOS ONE. We have revised our manuscript based on the editor and the reviwers's comments. We believe our manuscript has significantly been improved thanks to comments from the editors and the reviwers.

Editor's Comment 1: Please, provide a statistical description of the three-state Markov model. The transition matrix could be analytically computed in at least some limit cases. That could give the authors some indications regarding the limitations of their model (besides the limitations mentioned in the general discussion of their manuscript).

In the General Discussion section, we have added a paragraph discussing that the statistical description (i.e., the transition probabilities shown in Figure 3) of the three-state Markov model gives an indication on the limitations of our model. The second paragraph from the last of the revised manuscript reads:

" Fourth, we can design models that are not Markov transition models. The bout-and-pause response patterns shown in Fig. 2 can be generated by a Markov transition model whose transition matrix is given a priori without reinforcement learning. We argue that the statistical description of the Markov model (i.e., the transition matrix defined by the transition probabilities shown in Fig. 3) is not the source of the reproducibility of bout-and-pause

patterns. There may be other models that are not formulated by Markov transition, such as

the model proposed by McDowell [23]. We can introduce the choice and cost mechanismssuch models. "

Editor's Comment 2: Reviewers raised concerns about the embedded (hidden) assumptions in the computational model that lead to “emergent” properties. For example, the authors mentioned that “Although our dual model does not explicitly include the bi-exponential model in Eq. (1), IRTs generated by the dual model followed the bi-exponential model.” Please, discuss a possibility that the “natural” assumption of a Boltzmann factor in Eq.3 of the model or the logarithmic formula in Eq. 4 does not lead to the “emergent” behavior, such as the bi-exponential model form Eq. 1? How do the authors know that this simple assumption is not the root of the observed bi-exponential and other exciting features?

Thank you for raising the question that how we know the specific form equation used in our model is not the cause of bout-and-patterns. To answer this question, we conducted a simulation with a modified model, where the Boltzmann factor and the logarithmic formula were replaced as follows.

●Use the matching law pi = Qi / ∑ Qi instead of the Boltzmann-type softmax function.

●Use square root instead of logarithm.

The result is shown in Fig. R1 below, which looks similar to Fig. 4(a). It implies that specific forms of equations such as the Boltzman factor in Eq. (3) and the logarithm in Eq. (4) arethe cause of bout-and-pause patterns. The second-to-last paragraph of the "Discussion of Simulation 1 section now has a new sentence:

" The specific equation forms such as the softmax function Eq. (3) or the logarithm in Eq. (4) can also be replaceable with other forms."

Editor's Comment 3: Pleaser, discuss the potential limitations of the model.

As described in our response to Editor's Comment 1, we have added discussion on the limitation and extendability of the model. The concern raised in Editor's Comment 2 on specific equation forms was found not to be a fundamental limitation of the model since replacing the specific forms did not change the simulation results.

Editor's Comment 4: Some important recent papers are missing from discussion and should be included in the revision.

Thank you for enumerating recent important papers we missed in the previous manuscript. We reffered all of them from appropriate locations of the revised manuscript.

Editor's Comment 5: Brackney et al. (2017), for instance, report on the effect of various disruptors, including extinction, on bout-organized behavior. Although the distribution of bout lengths is not assessed in the proposed model, it may be important to note that research on that front has been conducted (Brackney & Sanabria, 2015; Jiménez et al., 2017). Also, the proposed model is, in some aspects, comparable to the partially hidden Markov model proposed by Sanabria et al. (2019)—the latter is not a learning model, but accounts for stable-state bi-exponential distribution of IRTs without building that distribution in the model itself.

Thank you for pointing out the importance of the research front issues. We have added a discussion on this point to General Discussion in our revised manuscript. The second-to-last paragraph of General Discussion of the revised manuscript reads: " Third, we can assess the plausibility of our model in more detail by conducting simulation under new experimental manipulations including disruptors or analyzing measures that we did not analyze. For example, recent studies showed that the distribution of bout lengths is sensitive to experimental manipulations [13, 33, 34]. Sanabria et al. [35] have proposed a computational formulation of behavior systems [36] and their descriptive model well described bout-and-pause patterns including the distribution of bout lengths. "

Editor’s Comment 6: In page 2, it should be pointed out that the bi-exponential distribution of IRTs has been demonstrated in VI schedules, where reinforcement is available probabilistically at a constant rate. Later in the manuscript the authors make reference to the schedules of reinforcement without explaining them. Also, q does not correspond to the length of a bout but to the *mean* length of a bout.

Thank you for pointing out our insufficiencies on the explanation about VI schedules. In the third paragraph of Introduction, we added a brief description of VI schedules and specified bout-and-pause patterns are observed under this schedule. Also, we have inserted "mean" to the description of q .

Editor's Comment 7: In page 3, the authors generalize the results from pigeons in Smith et al. (2014) to all animals, when rats actually show a very different pattern. The conclusion they reach is reasonable, assuming that rats engage in alternative behaviors during conditioning.

In the sixth paragraph of Introduction, we specified that rats also engage alternative behaviors (i.e. schedule induced behavior, interim behavior, or adjunctive behavior) during conditioning. The added sentence is: " Similar observations have been made for rats, assuming that they engage in alternative behaviors during conditioning [20]. "

Editor’s Comment 8:

Page 4, line 5: “both of” should be “both”

Figure 1: Please use a larger font size.

Line 129: “knowledge that is observed” is a strange, ambiguous expression.

Line 170: Do you mean “Fechner’s law”, which implies a representation of magnitude

(here, number of lever presses) in logarithmic space. Weber’s law does not imply

such representation.

Line 242: “We posit both...” should be “We posit that both...”

Equation 9: Its description includes a parameter b that is not included in the equation.

Line 487: “real animals may have fewer parameters” is a strange expression.

We have fixed all of these points.

Attachment

Submitted filename: Response_to_Reviwers.pdf

Click here for additional data file.^{(195.3KB, pdf)}

PLoS One. doi: 10.1371/journal.pone.0242201.r003

Decision Letter 1

Gennady Cymbalyuk

29 Oct 2020

Simulating bout-and-pause patterns with reinforcement learning

PONE-D-20-18275R1

Dear Dr. Yamada,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Gennady Cymbalyuk, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The revised version of the manuscript addresses all concerns raised in the previous review. I have no further comments.

Reviewer #2: The authors attempted answering my questions the best they could. I understand that a more detailed answer than what few phrases they provided would actually mean adding a new section to the paper, which probably thye don't want at this stage.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Federico Sanabria

Reviewer #2: Yes: Sorinel A Oprisan

PLoS One. doi: 10.1371/journal.pone.0242201.r004

Acceptance letter

Gennady Cymbalyuk

3 Nov 2020

PONE-D-20-18275R1

Simulating bout-and-pause patterns with reinforcementlearning

Dear Dr. Yamada:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Gennady Cymbalyuk

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response_to_Reviwers.pdf

Click here for additional data file.^{(195.3KB, pdf)}

Data Availability Statement

Simulation programs are available from GitHub: https://github.com/echo0yasum1/simulating_bout_and_pause_pattern.

[pone.0242201.ref001] 1. Barabasi AL. The origin of bursts and heavy tails in human dynamics. Nature. 2005;435(7039):207 10.1038/nature03459 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref002] 2. Tolkamp BJ, Kyriazakis I. To split behaviour into bouts, log-transform the intervals. Animal Behaviour. 1999;57(4):807–817. 10.1006/anbe.1998.1022 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref003] 3. Sorribes A, Armendariz BG, Lopez-Pigozzi D, Murga C, de Polavieja GG. The origin of behavioral bursts in decision-making circuitry. PLoS Computational Biology. 2011;7(6):e1002075 10.1371/journal.pcbi.1002075 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref004] 4. Shull RL, Gaynor ST, Grimes JA. Response rate viewed as engagement bouts: Effects of relative reinforcement and schedule type. Journal of the Experimental Analysis of Behavior. 2001;75(3):247–274. 10.1901/jeab.2001.75-247 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref005] 5. Killeen PR, Hall SS, Reilly MP, Kettle LC. Molecular analyses of the principal components of response strength. Journal of the Experimental Analysis of Behavior. 2002;78(2):127–160. 10.1901/jeab.2002.78-127 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref006] 6. Shull RL, Grimes JA, Bennett JA. Bouts of responding: The relation between bout rate and the rate of variable-interval reinforcement. Journal of the Experimental Analysis of Behavior. 2004;81(1):65–83. 10.1901/jeab.2004.81-65 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref007] 7. Shull RL. Bouts of responding on variable-interval schedules: Effects of deprivation level. Journal of the Experimental Analysis of Behavior. 2004;81(2):155–167. 10.1901/jeab.2004.81-155 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref008] 8. Podlesnik CA, Jimenez-Gomez C, Ward RD, Shahan TA. Resistance to change of responding maintained by unsignaled delays to reinforcement: A response-bout analysis. Journal of the Experimental Analysis of Behavior. 2006;85(3):329–347. 10.1901/jeab.2006.47-05 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref009] 9. Tanno T. Response-bout analysis of interresponse times in variable-ratio and variable-interval schedules. Behavioural Processes. 2016;132:12–21. 10.1016/j.beproc.2016.09.001 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref010] 10. Brackney RJ, Cheung TH, Neisewander JL, Sanabria F. The isolation of motivational, motoric, and schedule effects on operant performance: A modeling approach. Journal of the Experimental Analysis of Behavior. 2011;96(1):17–38. 10.1901/jeab.2011.96-17 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref011] 11. Chen X, Reed P. Factors controlling the micro-structure of human free-operant behaviour: Bout-initiation and within-bout responses are effected by different aspects of the schedule. Behavioural Processes. 2020; p. 104106 10.1016/j.beproc.2020.104106 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref012] 12. Cheung TH, Neisewander JL, Sanabria F. Extinction under a behavioral microscope: Isolating the sources of decline in operant response rate. Behavioural Processes. 2012;90(1):111–123. 10.1016/j.beproc.2012.02.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref013] 13. Brackney RJ, Sanabria F. The distribution of response bout lengths and its sensitivity to differential reinforcement. Journal of the experimental analysis of behavior. 2015;104(2):167–185. 10.1002/jeab.168 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref014] 14. Reed P. The structure of random ratio responding in humans. Journal of Experimental Psychology: Animal Learning and Cognition. 2015;41(4):419. [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref015] 15. Reed P, Smale D, Owens D, Freegard G. Human performance on random interval schedules. Journal of Experimental Psychology: Animal Learning and Cognition. 2018;44(3):309 10.1037/xan0000172 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref016] 16. Brackney RJ, Cheung TH, Sanabria F. A bout analysis of operant response disruption. Behavioural processes. 2017;141:42–49. 10.1016/j.beproc.2017.04.008 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref017] 17. Smith TT, McLean AP, Shull RL, Hughes CE, Pitts RC. Concurrent performance as bouts of behavior. Journal of the Experimental Analysis of Behavior. 2014;102(1):102–125. 10.1002/jeab.90 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref018] 18. Bennett JA, Hughes CE, Pitts RC. Effects of methamphetamine on response rate: A microstructural analysis. Behavioural Processes. 2007;75(2):199–205. 10.1016/j.beproc.2007.02.013 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref019] 19. Bowers MT, Hill J, Palya WL. Interresponse time structures in variable-ratio and variable-interval schedules. Journal of the Experimental Analysis of Behavior. 2008;90(3):345–362. 10.1901/jeab.2008.90-345 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref020] 20. Wallace M, Singer G. Schedule induced behavior: A review of its generality, determinants and pharmacological data. Pharmacology Biochemistry and Behavior. 1976;5(4):483–490. 10.1016/0091-3057(76)90114-3 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref021] 21. Matsui H, Yamada K, Sakagami T, Tanno T. Modeling bout–pause response patterns in variable-ratio and variable-interval schedules using hierarchical Bayesian methodology. Behavioural Processes. 2018;157:346–353. 10.1016/j.beproc.2018.07.014 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref022] 22. Kulubekova S, McDowell JJ. A computational model of selection by consequences: Log survivor plots. Behavioural Processes. 2008;78(2):291–296. 10.1016/j.beproc.2007.12.005 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref023] 23. McDowell JJ. A computational model of selection by consequences. Journal of the Experimental Analysis of Behavior. 2004;81(3):297–317. 10.1901/jeab.2004.81-297 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref024] 24. Sutton RS, Barto AG. Reinforcement learning: An introduction. 2nd ed MIT Press; 2018. [Google Scholar]

[pone.0242201.ref025] 25. Nieder A. Counting on neurons: The neurobiology of numerical competence. Nature Reviews Neuroscience. 2005;6(3):177–190. 10.1038/nrn1626 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref026] 26. Fleshler M, Hoffman HS. A progression for generating variable-interval schedules. Journal of the Experimental Analysis of Behavior. 1962;5(4):529–530. 10.1901/jeab.1962.5-529 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref027] 27. Shull RL, Gaynor ST, Grimes JA. Response rate viewed as engagement bouts: Resistance to extinction. Journal of the Experimental Analysis of Behavior. 2002;77(3):211–231. 10.1901/jeab.2002.77-211 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref028] 28. Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research. 2010;11(Dec):3571–3594. 10.5555/1756006.1953045 [DOI] [Google Scholar]

[pone.0242201.ref029] 29. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, et al. Stan: A probabilistic programming language. Journal of Statistical Software. 2017;76(1). 10.18637/jss.v076.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref030] 30. Herrnstein RJ. On the law of effect 1. Journal of the Experimental Analysis of Behavior. 1970;13(2):243–266. 10.1901/jeab.1970.13-243 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref031] 31. McDowell J. On the classic and modern theories of matching. Journal of the Experimental Analysis of Behavior. 2005;84(1):111–127. 10.1901/jeab.2005.59-04 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref032] 32. Brackney RJ, Cheung TH, Herbst K, Hill JC, Sanabria F. Extinction learning deficit in a rodent model of attention-deficit hyperactivity disorder. Behavioral and Brain Functions. 2012;8(1):59 10.1186/1744-9081-8-59 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref033] 33. Jiménez ÁA, Sanabria F, Cabrera F. The effect of lever height on the microstructure of operant behavior. Behavioural processes. 2017;140:181–189. 10.1016/j.beproc.2017.05.002 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref034] 34. Daniels CW, Sanabria F. About bouts: A heterogeneous tandem schedule of reinforcement reveals dissociable components of operant behavior in Fischer rats. Journal of Experimental Psychology: Animal Learning and Cognition. 2017;43(3):280 10.1037/xan0000144 [DOI] [PubMed] [Google Scholar]

[pone.0242201.ref035] 35. Sanabria F, Daniels CW, Gupta T, Santos C. A computational formulation of the behavior systems account of the temporal organization of motivated behavior. Behavioural processes. 2019;169:103952 10.1016/j.beproc.2019.103952 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref036] 36. Timberlake W. Behavior systems and reinforcement: An integrative approach. Journal of the Experimental Analysis of Behavior. 1993;60(1):105–128. 10.1901/jeab.1993.60-105 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0242201.ref037] 37. Sakai Y, Fukai T. The actor-critic learning is behind the matching law: Matching versus optimal behaviors. Neural Computation. 2008;20(1):227–251. 10.1162/neco.2008.20.1.227 [DOI] [PubMed] [Google Scholar]

PERMALINK

Simulating bout-and-pause patterns with reinforcement learning

Kota Yamada

Atsunori Kanemura

Roles

Abstract

Introduction

Table 1. Previous findings from animal experiments on the relationships between manipulations and bout components.

1 Simulation 1

Material and method

Model

Fig 1. Model schemes of the dual model, the no choice model and the no cost model.

Simulation

Results: Simulation 1

Fig 2.

Fig 3. The transition probabilities between the three states that were calculated from the simulation data after the agent obtained 500 reinforcers.

1.1 Discussion: Simulation 1

2 Simulation 2

2.1 Method: Simulation 2

2.2 Results: Simulation 2

Fig 4. The log-survivor plots of IRTs generated under the manipulation of (a) the rate of reinforcement, (b) the deprivation level, (c) the schedule type, and (d) the presence of extinction drawn by all data after the agent obtained 500 reinforcers.

Table 2. Estimated parameters of the bi-exponential model in simulations.

Table 3. Parameter selection for the dynamic bi-exponential model with WAIC.

Fig 5. Boxplots of Qpref and Qcost in each simulation.

Table 5. The dependency of Qpref and Qcost to experimental manipulations in the dual model.

Fig 6. The response rate as a function of the reinforcement rate.

2.3 Discussion: Simulation 2

Table 4. The behavior of the dual model.

3 Simulation 3

3.1 Method: Simulation 3

Fig 7. Two-state model.

Table 6. Parameter range of the two-state model.

3.2 Results: Simulation 3

Fig 8. Log-survivor plots of IRTs generated by the two-state model under VI 30 s, VI 120 s, and VI 480 s schedules.

3.3 Discussion: Simulation 3

4 General discussion

Data Availability

Funding Statement

References

Decision Letter 0

Gennady Cymbalyuk

Roles

Author response to Decision Letter 0

Decision Letter 1

Gennady Cymbalyuk

Roles

Acceptance letter

Gennady Cymbalyuk

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 5. Boxplots of Q_pref and Q_cost in each simulation.

Table 5. The dependency of Q_pref and Q_cost to experimental manipulations in the dual model.