Abstract
Goal-directed actions are instrumental behaviors whose performance depends on the organism’s knowledge of the reinforcing outcome’s value. In contrast, habits are instrumental behaviors that are insensitive to the outcome’s current value. Although habits in everyday life are typically controlled by stimuli that occasion them, most research has studied habits using free-operant procedures in which no discrete stimuli are present to occasion the response. We therefore studied habit learning when rats were reinforced for lever pressing on a random interval 30-s schedule in the presence of a discriminative stimulus (S), but not in its absence. In Experiment 1, devaluing the reinforcer with taste aversion conditioning weakened instrumental responding in a 30-s S after 4, 22, and 66 sessions of instrumental training. Even extensive practice thus produced goal-directed action, not habit. Experiments 2 and 3 contrastingly found habit when the duration of S was increased from 30 s to 8 min. Experiment 4 then found habit with the 30-s S when it always contained a reinforcer; goal-directed action was maintained when reinforcers were earned at the same rate, but occurred in only 50% of Ss (as in the previous experiments). The results challenge the view that habits are an inevitable consequence of repeated reinforcement (as in the Law of Effect), and instead suggest that discriminated habits develop when the reinforcer becomes predictable. Under those conditions, organisms may pay less attention to their behavior, much as they pay less attention to signals associated with predicted reinforcers in Pavlovian conditioning.
Keywords: Stimulus Control, Habit, Goal-directed action, Attention
Recent theories of instrumental learning have emphasized the idea that instrumental behavior can come in two varieties: goal-directed actions and habits (Daw, Niv, & Dayan, 2005; Dickinson, 1985; 1994; 2012). Actions are goal directed in the sense that they are sensitive to the current motivational value of the reinforcer. In contrast, habits are relatively automatic and apparently insensitive to changes in the motivational value of the reinforcer. Goal-directed actions and habits are usually distinguished with reinforcer revaluation procedures. In those procedures, the instrumental response is tested in extinction after a separate treatment has changed the value of its reinforcer. Such revaluation is usually accomplished by pairing the reinforcer with illness (and hence conditioning a taste aversion to it) or by allowing the animal to consume it to a point of satiety. If the instrumental response is a goal-directed action, reinforcer devaluation depresses its performance. In contrast, if the response is a habit, reinforcer devaluation has no effect (Adams, 1982; Balleine & Dickinson, 1998). Goal-directed actions tend to convert to habits with repeated training or practice (e.g., Dickinson, Balleine, Watt, Gonzales, & Boakes, 1995; Holland, 2004).
In everyday life, habits are usually performed in the presence of specific cues or circumstances; that is, they are under stimulus control. For example, one might overeat popcorn when the lights go down at the movie theater or light a cigarette when one starts driving a car. In contrast, in laboratory experiments, habits are usually studied under free-operant conditions in which the instrumental response can be made repeatedly without a discriminative stimulus (Adams, 1982; Adams & Dickinson, 1981; Dickinson et al., 1995; Dickinson, Nicholas, & Adams, 1983; Nelson & Killcross, 2006; Thrailkill & Bouton, 2015; Yin, Knowlton, & Balleine, 2003). There is evidence that background contextual stimuli can control the evocation of habit in such free-operant situations (Thrailkill & Bouton, 2015). However, there is less evidence of the development of habit when the response is controlled by a discrete discriminative stimulus (see the General Discussion for further discussion). One goal of the present series of experiments was thus to help fill this gap in the literature.
Surprisingly little is also known about the behavioral processes that underlie the development of habit more generally. It is often assumed that habits merely develop through Thorndike’s (1911) Law of Effect (e.g., de Wit & Dickinson, 2009; Wood & Rünger, 2016). That is, with repetition of a “trigger-action-reward” sequence, reinforcement gradually strengthens a direct connection between the trigger (S) and the response. On this view, a habit should develop naturally in any discriminated operant situation as the response is repeatedly reinforced in the presence of S. A second perspective is that habits develop when the organism experiences little correlation between the rate of its behavior and the rate of reinforcement (Dickinson, 1985; Dickinson & Perez, 2018). For example, habit develops when the behavior is reinforced on a random-interval (RI) reinforcement schedule, because there is a weak correlation between momentary response rate and reinforcement rate (e.g., Dickinson et al., 1983). In contrast, goal-directed action is maintained when the behavior is reinforced on a random-ratio schedule, where there is a stronger moment-to-moment correlation between response rate and reinforcement rate (e.g., Dickinson et al., 1983). This “correlational” perspective suggests that habit might be difficult to develop in many discriminative operant procedures, because they could create a strong local correlation between response rate and reinforcement rate inside vs. outside the S. In the present article, we suggest a third perspective. We note that automatic, habitual behavior might develop under conditions in which the organism is encouraged to pay less attention to its behavior. In Pavlovian learning, attention to a conditioned stimulus (CS) as indexed by orienting responses can decrease when it is consistently and repeatedly paired with a reinforcer (e.g., Beesley, Nguyen, Pearson, & Le Pelley, 2015; Hall & Pearce, 1979; Hogarth, Dickinson, Austin, Brown, & Duka, 2008; Kaye & Pearce, 1984). According to the Pearce-Hall model (Pearce & Hall, 1980), attention to the CS decreases as the reinforcer becomes less surprising and better predicted. Conditioned responding therefore becomes automatic. If there is a parallel in instrumental learning, the organism might likewise pay less and less attention to its behavior (and perform it automatically) as the reinforcer becomes well predicted. In summary, although several perspectives are available to explain the development of habit, there are few data that can help distinguish between the different accounts.
The goal of the present experiments was thus (1.) to ask whether habits can develop in discriminated operant procedures and (2.) to begin to understand the conditions that lead to them. Each used the three-phase method illustrated in Table 1. In the first phase, an operant response was reinforced in the presence of an S but not in its absence using standard discriminated operant methods. In the second phase, the reinforcer was devalued with a taste aversion procedure in which the reinforcer was separately paired (or not paired) with lithium chloride - induced illness. The response was then tested (with and without S) in a third, extinction phase. If reinforcer devaluation depressed responding in the test, the training procedure could be said to have produced a goal-directed action. If there was no depression of responding, the evidence would instead suggest the acquisition of a habit (Adams & Dickinson, 1981). We found that habit does not develop with all discriminated operant procedures. The results are most consistent with the hypothesis that habit develops when the animal pays less attention to the response as the reinforcer becomes highly predictable.
Table 1.
General Experimental Design
| Training | Devaluation | Test |
|---|---|---|
| S:R+ | Pellets-LiCl (Paired) or Pellets, LiCl (Unpaired) |
S:R− |
Note. “S” is discriminative stimulus, “R” is response, “LiCl” is lithium chloride. “+” is reinforced, “-” is nonreinforced (extinction).
Experiment 1
Experiment 1 asked whether a discriminated response becomes habitual after different amounts of discrimination training. Rats were trained on a discriminated operant procedure in which presentations of a 30-s S signaled that a lever-pressing response would be reinforced on a RI 30-s schedule. The lever remained in the chamber during the ITI (which was variable, averaging 90 s), but presses during the ITI had no consequence (extinction). This procedure has been used extensively in our laboratory and others, and it supports robust discriminated operant responding (e.g., Bouton, Todd, & León, 2013; Bouton, Trask, & Carranza-Jasso, 2016; Todd, Vurbic, & Bouton, 2014; see also Colwill & Rescorla, 1988, 1990). In Experiment 1a, groups received either brief or extended training before the reinforcer was devalued (see Table 1). In Experiment 1b, rats received three times the amount of instrumental training than the extended group received in Experiment 1a. If discriminated operant behavior is like free-operant behavior, then the effects of outcome devaluation observed in the final test should decrease as the amount of instrumental training increases (e.g., Dickinson et al., 1995; Holland, 2004).
Method
Subjects
Experiment 1a.
Thirty-two naïve female Wistar rats (four groups with n = 8) were purchased from Charles River, St. Constance, Canada. The rats were 75 to 90 days of age at the start of the experiment, and were individually housed in a climate-controlled room with a 16:8 light-dark cycle. Experimental sessions were conducted during the light portion of the cycle at approximately the same time each day. Rats were food deprived and maintained at 80% of their free-feeding weights for the duration of the experiment. Rats had unlimited access to water in their homecages and were given supplementary feeding when necessary at approximately 2 hr after each session.
Experiment 1b.
Sixteen naïve female Wistar rats (two groups with n = 8) were obtained from the same supplier and were housed and maintained as in Experiment 1a.
Apparatus
The apparatus in both experiments consisted of two unique sets of four conditioning chambers (model ENV-008-VP; Med Associates, Fairfax, VT) housed in separate rooms of the laboratory. Each chamber was in its own sound attenuation chamber. All boxes measured 30.5 × 24.1 × 23.5 cm (length × width × height). The side walls and ceiling were made of clear acrylic plastic, while the front and rear walls were made of brushed aluminum. A recessed food cup measured 5.1 cm × 5.1 cm and was centered on the front wall approximately 2.5 cm above the grid floor. Two retractable levers (model ENV-112CM, Med Associates) were located on the front wall on either side of the food cup. The levers were 4.8 cm long and 6.3 cm above the grid floor. Levers protruded 1.9 cm from the front wall when extended. The right lever was never used (and remained retracted throughout all experimental sessions). The chambers were illuminated by a 7.5-W incandescent bulb mounted to the ceiling of the sound attenuation chamber, 34.9 cm from the grid floor, ventilation fans provided background noise of 65 dBA. The two sets of boxes had unique features that allowed them to serve as different contexts, but were not used for that purpose here. In one set of boxes, the grids of the floor were spaced 1.6 cm apart (center-to-center). In the other set of boxes, the floor consisted of alternating stainless steel grids with different diameters (0.5 and 1.3 cm, spaced 1.6 cm apart). There were no other distinctive features between the two sets of chambers. Reinforcement consisted of the delivery of a 45 mg food pellet into the food cup (MLab Rodent Tablets; TestDiet, Richmond, IN). A 3000-Hz tone (80 dB) could be delivered through a 7.6-cm speaker mounted to the ceiling of the sound attenuation chamber. The apparatus was controlled by computer equipment in an adjacent room.
Procedure
Experiment 1a
Food restriction began one week prior to the beginning of training. Sessions were then conducted 7 days a week. Rats were divided into two groups (n = 16). On Day 1, the group that received extensive training began its daily training sessions. The other group (n = 16) experienced equivalent handling and food deprivation, but did not begin training until Day 10. The staggered start allowed the groups to be tested on the same day, thus controlling for handling and days on the food deprivation schedule prior to the focal final test session.
Magazine training.
On the first day, rats received a single 30-min session of magazine training in the conditioning chambers. In this session, food pellets were delivered freely on a random time (RT) 60-s schedule. The schedule delivered a pellet in any given second with a probability of 1/60. The response manipulandum was not present during this session.
Response training.
On the next day, the rats received a 30-min session in which lever pressing was reinforced on a RI 30-s schedule. As is typical with our use of this method, the animals learned to respond without shaping by hand.
Discriminative operant training.
Acquisition of discriminated operant responding then followed a standard procedure used in this laboratory (e.g., Bouton et al., 2016; Todd et al., 2014). Rats received two daily 32.5-min sessions in which lever pressing was reinforced only during 30-s presentations of the tone discriminative stimulus (S). During each session, there were 16 presentations of S; the response was reinforced on the RI 30-s schedule during S, but not during the intervals between S presentations (the intertrial interval, or ITI). The ITI was increased over the first three sessions of training. During the first session, the ITI was 30 s. During the second session, the ITI was variable and increased to a mean of 60 s (range: 30–90 s), and during the third and all subsequent sessions the ITI averaged 90 s (range: 30–120 s). Groups Brief and Extended received a total of 4 and 22 sessions during the discriminative operant training phase, respectively.
Reinforcer devaluation.
Starting the next day, the rats were matched on discriminated operant performance and assigned to groups that now received the pellet reinforcer paired or unpaired with lithium chloride (LiCl) injection. Aversion conditioning proceeded in the operant chambers over the next 12 days and followed a procedure previously used in this laboratory (Thrailkill & Bouton, 2015). Conditioning trials occurred every other day and were separated by a context exposure session on the intervening days. The lever was not present during any session. On the first day of each 2-day cycle, paired rats received 50 noncontingent pellets according to RT 60 s. They were then removed from the chamber and given an immediate intraperitoneal (ip) injection of 20 ml/kg LiCl (0.15 M) and put in the transport box prior to being returned to the home cage. Unpaired rats received the same exposure to the chamber and an immediate LiCl injection, but did not receive any pellets. On Day 2 of each cycle, all rats were placed in the chamber and eventually removed and transported to their home cages without an injection. On this day, unpaired rats received 50 noncontingent pellets according to RT 60 s, whereas paired rats received exposure to the chamber for the same amount of time as in the preceding pellet session. There were six 2-day conditioning cycles. In order to maintain equivalent pellet exposure during aversion conditioning, the unpaired groups were only given the average number of pellets eaten by the respective paired group on the trial before.
Testing.
Testing was then conducted on the next 3 days. On the first day of testing, all rats received one 32.5-min session that contained 16 presentations of the 30-s S. The lever was available during the session, but presses had no programmed consequences (extinction). On the next day, the rats were given a test of pellet consumption in order to assess the strength of the aversion to the pellets. Each rat was placed in the chamber with the lever removed and 10 food pellets were delivered on a RT 60-s schedule. The test session lasted approximately 10 min. Magazine entries and number of pellets consumed were recorded. Finally, on the last day, the rats were allowed to press the lever to earn food pellets according to the discriminated operant training procedure (reacquisition).
Experiment 1b
Experiment 1b used the same procedure as Experiment 1a except as noted. During discriminated operant training, there were three sessions each day separated by an interval of approximately 45 min. Each session contained 16 presentations of the 30-s tone S separated by a variable 90-s ITI. (The ITI duration was increased over the first sessions of training in the same manner as described in the Experiment 1.) There was a total of 66 sessions of training. Paired and Unpaired groups (ns = 8) then received an aversion conditioning treatment with the pellet reinforcer and then testing occurred following the procedure used in Experiment 1a.
Data analysis
The computer recorded the number of responses made during each 30-s presentation of S as well as during the 30-s period just prior to the S (the pre-S period). Responding during both periods was of interest, and both are reported here. For the test sessions, the analyses compared responses averaged over blocks of four trials. Analyses of variance (ANOVAs) were used to assess differences based on between group (e.g., devaluation) and within subject factors (e.g., stimulus period, session). The rejection criterion was set at .05 for all statistical tests. When relevant to the hypotheses, we report effect sizes with confidence intervals calculated according to the method suggested by Steiger (2004).
Results
Experiment 1a
Acquisition.
Two rats, one in Group Brief Paired and one in Group Extended Unpaired, failed to acquire operant responding by the end of training and were excluded from analysis. The remaining rats acquired the discriminated operant response without incident; that is, they learned to make more lever presses during the S than during the pre-S period. This is evident in Figure 1a and b, which plots responses in the S and the pre-S period from each group over each acquisition session. Separate ANOVAs for the Brief and Extended conditions compared the number of responses recorded during S and pre-S periods across sessions of training in the Paired and Unpaired groups. For animals in the Brief condition, a Devaluation (Paired, Unpaired) by Stimulus Period (S, pre S) by Session (4) ANOVA found reliable effects of Stimulus Period, F(1, 13) = 35.68, MSE = 11.33, p < .001, Session, F(3, 39) = 15.39, MSE = 8.36, p < .001, and a Stimulus Period by Session interaction, F(3, 39) = 26.49, MSE = 2.79, p < .001. No effect or interaction involving Devaluation, a dummy variable at this point, approached significance, largest F(3, 39) = 1.26. A similar pattern emerged in the Extended condition, where a Devaluation (Paired, Unpaired) by Stimulus Period (S, pre S) by Session (22) ANOVA found reliable effects of Stimulus Period, F(1, 13) = 114.09, MSE = 49.20, p < .001, Session, F(21, 273) = 3.00, MSE = 18.37, p < .001, and a Stimulus Period by Session interaction, F(21, 273) = 21.67, MSE = 3.57, p < .001. No effect or interaction involving Devaluation approached significance, Fs < 1.
Figure 1.
Results of Experiments 1a and 1b. Mean number of responses during each stimulus (S) and 30-s period preceding the S (pre-S) across sessions (Left) and during 4-trial blocks of the test (Right) for Groups Brief (a.) and Extended (b.) from Experiment 1a and from Experiment 1b (c.). Error bars are the standard error of the mean.
Testing.
Devaluation proceeded smoothly. All rats in the Paired groups came to reject the pellets: On the last trial of the devaluation phase, the Paired rats in Groups Brief and Extended ate a mean of 0.4 and 0.4 pellets, respectively. Results of the crucial extinction test of operant responding are shown at right in Figure 1a and b. In both the Brief and Extended training conditions, the Paired group made fewer responses during the S and pre-S periods than the Unpaired group. These observations were supported by a Training (Brief, Extended) by Devaluation by Stimulus Period by Block ANOVA. There was a significant effect of Devaluation, F(1, 26) = 22.42, MSE = 227.93, p < .001, η2p = .46, 95% CI [.17, .64], Stimulus Period, F(1, 26) = 31.85, MSE = 139.92, p < .001, and Block, F(3, 78) = 30.37, MSE = 58.24, p < .001. There were several reliable interactions, some including the Devaluation factor: Devaluation by Block, F(3, 78) = 7.25, MSE = 58.24, p < .001, Training by Devaluation by Block F(3, 78) = 3.15, MSE = 58.24, p = .030, and Devaluation by Stimulus Period by Block, F(3, 78) = 3.17, MSE = 53.97, p = .029. Other interactions that did not include Devaluation also reached significance: Training by Stimulus Period, F(1, 26) = 10.66, p = .003, Stimulus Period by Block, F(3, 78) = 4.73, MSE = 53.97, p = .004.
To decompose the interactions, responding in the Brief and Extended groups was analyzed with separate Devaluation by Stimulus Period by Block ANOVAs. For Group Brief, there were significant effects of Devaluation, F(1, 13) = 10.17, MSE = 176.93, p = .007, η2p = .44, 95% CI [.04, .66], Stimulus Period, F(1, 13) = 7.26, MSE = 157.58, p = .018, and Block, F(3, 39) = 22.99, MSE = 36.42, p < .001. None of the interactions were significant, largest F(1, 13) = 2.24, p = .16. For Group Extended, there were significant effects of Devaluation, F(1, 13) = 12.34, p = .004, MSE = 278.93, η2p = .49, 95% CI [.07, .69], Stimulus Period, F(1, 13) = 30.03, MSE = 122.26, p < .001, and Block, F(3, 39) = 13.31, MSE = 81.06, p < .001. There was a significant Devaluation by Stimulus Period interaction indicating that aversion conditioning influenced the difference in responding in the S and pre-S periods, F(1, 13) = 10.50, MSE = 122.26, p = .006, η2p = .45, 95% CI [.05, .67]. The interactions involving Block were also significant; Stimulus Period by Block, F(3, 39) = 4.75, MSE = 64.34, p = .006, Devaluation by Block, F(3, 39) = 6.63, p = .001, η2p = .34, 95% CI [.08, .49], and Devaluation by Stimulus Period by Block, F(3, 39) = 3.68, p = .020, η2p = .22, 95% CI [.00, .38]. The pattern suggests that the difference between S and pre S responding in the Paired and Unpaired groups decreased across the trial blocks in the test session in the extended training groups. Most important, there was a clear reinforcer devaluation effect in both the Brief and Extended training conditions.
The effectiveness of the devaluation procedure was confirmed by the pellet consumption test on the following day. Paired rats in Groups Extended and Brief ate a mean of 0.3 and 0.4 out of the 10 pellets offered, whereas Unpaired rats ate all 10. Finally, a reacquisition test further confirmed that the pellet no longer served to support lever pressing in the Paired groups (data not shown).
Experiment 1b
Acquisition.
All rats acquired the discriminated operant response, and discrimination training is summarized in the left panel of Figure 1c. A Group (Paired, Unpaired) by Stimulus Period (S, pre S) by Session ANOVA compared the responses in the S and pre-S period over sessions. There were significant effects of Stimulus Period, F(1, 14) = 116.99, MSE = 673.42, p < .001, Session, F(65, 910) = 9.52, MSE = 16.08, p < .001, and a Stimulus Period by Session interaction, F(65, 910) = 21.47, MSE = 9.61, p < .001. No effect or interaction involving Group approached significance, Fs < 1.
Testing.
Devaluation again proceeded smoothly. On the last trial of the devaluation phase, the rats in Group Paired ate a mean of 2 pellets; Group Unpaired ate all of them. Results of the crucial lever pressing test are shown in the right panel of Figure 1c. The Paired group again made fewer responses during the S and pre-S periods than the Unpaired group. A Devaluation by Stimulus Period by Block ANOVA found significant effects of Devaluation, F(1, 14) = 4.61, MSE = 264.22, p = .050, η2p = .25, 95% CI [.00, .53], Stimulus Period, F(1, 14) = 32.53, MSE = 171.51, p < .001, Block, F(3, 42) = 19.17, MSE = 74.63, p < .001, and a Stimulus Period by Block interaction, F(3, 42) = 13.54, MSE = 60.15, p < .001. The Devaluation by Stimulus Period interaction approached reliability, F(1, 14) = 3.91, p = .068, η2p = .22, 95% CI [.00, .50]. No interaction involving Block was significant, largest F(3, 42) = 2.04.
Devaluation was confirmed by the pellet consumption test on the following day, where Paired rats ate a mean of 1.4 pellets and the Unpaired group ate all 10 of them. Finally, a reacquisition test further confirmed that the pellet no longer served to support lever pressing in the Paired groups (data not shown).
Discussion
The rats in both experiments readily acquired the discriminated operant response. And consistent with findings from free-operant procedures, discriminative responding was sensitive to devaluation after a minimal amount of discriminated operant training. However, the results also indicated that reinforcer devaluation reduced responding after either 4, 22, or 66 sessions of discriminative training. Sixty-six sessions of training involved approximately 1,054 occasions on which the response was paired with the reinforcer, a number that far exceeds the number of response-reinforcer pairings often thought to yield habit in free-operant procedures (Dickinson et al., 1995; Thrailkill & Bouton, 2015). The results thus suggest that discriminated operant training can produce goal-directed behavior that is difficult to convert to habit.
Experiment 2
Dickinson (e.g., 1985, 1989; Dickinson & Perez, 2018) has argued that procedures that arrange a strong local correlation between response rate and reinforcement rate will maintain goal-directed behavior and prevent habit formation. In contrast, habits develop when there is a weak moment-to-moment correlation between response rate and reinforcement rate. As we noted in the Introduction, from this point of view, habit may be difficult to produce in many discriminated operant procedures, because they can maintain a local correlation between response rate and reinforcement rate when compared inside and outside the S, as was potentially the case given the 30-s S used in Experiment 1. In Experiment 2, we therefore studied a discriminated operant procedure that weakened the moment-to-moment correlation by extending the duration of S. Rats acquired a discriminated operant response under conditions that arranged the same amount of total stimulus exposure, reinforced responses, and ITI time during training as the 66-session procedure in Experiment 1b. However, the duration of S was increased sixteen-fold from 30 s to 8 min. We reasoned that a long S might allow the rat to experience extended periods of time in which response rate and reinforcement rate were only weakly correlated—as in a free-operant (unsignaled) RI schedule. Thus, on the correlational view, training with a long-duration S might increase the likelihood that habit would develop and decrease the behavior’s sensitivity to reinforcer devaluation.
Method
Subjects and apparatus.
The subjects were 16 female Wistar rats purchased from the same vendor as those in Experiment 1 and maintained under the same conditions. The apparatus was also the same.
Procedure
Acquisition.
Magazine training and response training were conducted as described in Experiments 1a and 1b. Rats then received three training sessions each day that each contained 2 presentations of an 8-min tone S separated by a variable ITI. The time to the first S and time after the second S was the same in a given session. The ITI duration averaged 16 min, and the pre-first S/post-second S durations averaged 14.8 min Session duration averaged 63.8 min. The response was reinforced on the RI 30-s schedule during the S, but not during the ITI or in the pre-/post-S periods.
Aversion conditioning.
Aversion conditioning was conducted in the same manner as in Experiments 1a and 1b. There were six cycles.
Testing.
Rats received a test session consisting of a 6-min ITI followed by a single presentation of the 8-min S. No reinforcers were presented. The test session was followed the next day by a pellet consumption test following the usual procedure, and a reacquisition test under the acquisition training conditions on the final day.
Results
Acquisition.
All rats successfully acquired the discriminated operant response. Discrimination learning is summarized in the left panel of Figure 2, which depicts response rate in S prior to the first reinforcer from response rate in the 30-s pre S period across blocks of 6 trials (3 sessions) in Groups Paired and Unpaired. (We isolated the period before the first reinforcer in S to unequivocally evaluate discriminative control by S rather than the reinforcer.) A Group (Paired, Unpaired) by Stimulus Period (S, pre S) by Block (11) ANOVA compared the mean number of responses per S and pre-S period across session blocks in the Paired and Unpaired groups. There were significant effects of Stimulus Period, F(1, 14) = 77.09, MSE = 199.16, p < .001, Block, F(10, 140) = 10.93, MSE = 28.33, p < .001, and a Stimulus Period by Block interaction, F(10, 140) = 14.28, MSE = 18.10, p < .001. No effect or interaction involving Devaluation approached significance, Fs < 1.
Figure 2.
Results of Experiment 2. Mean number of responses in the period between onset of the stimulus and the delivery of the first reinforcer across blocks of 6 trials (Left), and in each 1-min block of the test session. “S” and “pre-S” refer to responses recorded during the stimulus and during the 30 s that preceded onset of the stimulus. “On” and “Off” refer to the onset and offset of the discriminative stimulus. Error bars are the standard error of the mean.
Testing.
Devaluation proceeded smoothly. On the last trial of the devaluation phase, rats in Group Paired ate a mean of 0 pellets. Results of the test are shown in the right panel of Figure 2. Each group increased the number lever presses beginning with the onset of the stimulus in the 6th minute of the test session. Because testing was conducted in extinction, responding then decreased during the S and remained low following the offset of the stimulus in the 14th minute. Most important, the Paired and Unpaired groups showed similar response rates throughout the test. These observations were supported by a Devaluation by 1-min Bin ANOVA, which found a significant effect of Bin, F(19, 266) = 8.19, MSE = 34.35, p < .001, η2p = .37, 95% CI [.24, .41], and no effect of Devaluation or interaction, Fs < 1. Devaluation was confirmed by a consumption test on the following day. Paired rats ate a mean of 0.4 out of the 10 pellets offered, whereas Unpaired rats ate all 10. Finally, a reacquisition test further confirmed that the pellet no longer served to support lever pressing in the Paired group (data not shown).
Discussion
The results suggest that a discriminated habit can develop using a method in which rats learned to respond in a longer-duration (8-min) S. It may be worth noting that when a habit develops in a traditional free-operant procedure, the rat likewise spends an extended period of time earning reinforcers in the contextual S (Thrailkill & Bouton, 2015).
Experiment 3
Experiments 1 and 2 provide evidence that habitual discriminated operant responding develops only when the S duration is long. However, there were procedural differences between Experiment 1 and 2. For instance, Experiment 2 allowed the rats to respond for reinforcement for a total of 16 min per session, whereas Experiments 1a and 1b allowed rats to respond for reinforcement for only 8 min per session. Perhaps longer amounts of time spent lever pressing in each session encouraged the development of habit. Thus, in addition to seeking to replicate the results of Experiment 2, Experiment 3 manipulated the duration of S experimentally. Two groups of rats received training in which responding was reinforced on the RI 30-s schedule in either a 30-s S or an 8-min S. Rats trained with the 8-min S received 2 such trials per session, as in Experiment 2. Rats trained with the 30-s S received 32 trials per session so that the groups received equivalent total time in the S and the ITI each session. After the usual reinforcer devaluation treatment, all rats were tested for discriminated responding with a common S duration at the geometric mean of the two trained durations (2 min), which would arguably support equivalent generalization between training and testing in the two conditions (Church & Deluty, 1977). They then received additional extinction tests with S at the duration that had been used during original discrimination training (i.e., 30 s or 8 min).
Method
Subjects and Apparatus.
The subjects were 32 female Wistar rats purchased from the same vendor as those in the previous experiments and maintained under the same conditions. The apparatus was also the same.
Procedure
Acquisition.
Magazine training and response training were conducted as in Experiment 1. Sessions were conducted once daily. Rats were then divided into two groups (n = 16) that received discriminated operant training with the RI 30-s schedule with either the 30-s S (Group Short) or with the 8-min S (Group Long). Group Short received 32 trials each session separated by a variable 90-s ITI, and Group Long received 2 trials per session that were preceded, followed, and separated by a variable 15.8-min ITI. Rats in both groups thus received 63.5-min sessions containing identical amounts of reinforced S time (with the RI 30-s schedule operating) and nonreinforced ITI time. For Group Short, the ITI duration was increased over the first sessions of training in the manner described in Experiment 1.
Reinforcer devaluation.
Aversion conditioning with the pellets was conducted in the same manner as in the preceding experiments. As usual, there were 6 cycles.
Testing.
There were two tests of nonreinforced operant responding. In the first, common, test, all rats received a session that contained four presentations of a 2-min S separated by a 4-min ITI. (Two min is the geometric mean of the 0.5 and 8-min trained S durations, and was therefore expected to support similar generalization decrement between training and testing.) On the following day, the Long and Short groups respectively received a test consisting of either a single 8-min S or sixteen 30-s Ss in extinction (training duration test). On the final two days, rats were tested for pellet consumption, and received a reacquisition test under the acquisition training conditions.
Results
Acquisition.
Five rats failed to acquire reliable instrumental responding and were excluded from the experiment. As a result, there were 7 rats in Group Short Paired, 6 rats in Group Short Unpaired, 8 rats in Group Long Paired, and 6 rats in Group Long Unpaired. The left panels of Figure 3 show the responding in the training phase for Groups Short (a) and Long (b). Rats increased responding in S across the training phase. Responding in the pre-S period was lower in the Long groups, presumably reflecting the longer period of continuous extinction in the ITI that preceded each presentation of S. A Stimulus Duration (Long, Short) by Stimulus Period (S, pre S) by Devaluation (Paired, Unpaired) by Block/Trial (32) ANOVA found significant effects of Stimulus Duration, F(1, 23) = 8.26, MSE = 996.39, p = .009, Stimulus Period, F(1, 23) = 154.23, MSE = 664.03, p < .001, Block/Trial, F(32, 713) = 12.19, MSE = 31.42, p < .001, and a Stimulus Period by Block/Trial interaction, F(31, 713) = 23.94, MSE = 22.51, p < .001. No other effects or interactions approached significance, largest F(31, 713) = 1.15. The effect of Stimulus Duration was driven by the lower pre-S responding in Group Long. A Stimulus Duration by Devaluation by Block ANOVA comparing responses in the S periods found no effect of Stimulus Duration, F(1, 23) = 1.73, MSE = 1612.46, p = .202. The same analysis found a significant effect of Stimulus Duration on pre-S responding, F(1, 23) = 118.89, MSE = 47.96, p < .001.
Figure 3.
Results of Experiment 3. Mean response rates (responses per minute) during training (Left), per trial during the geometric mean test (Center), and across blocks of trials and minutes in the training duration test (Right) in Groups Short (a.) and Long (b.). “S” and “pre-S” refer to responses recorded during the stimulus and during the 30 s that preceded onset of the stimulus. “On” and “Off” refer to the onset and offset of the discriminative stimulus, see text for details. Error bars are the standard error of the mean.
Testing.
Aversion conditioning occurred without incident. On the last trial of the devaluation phase, the rats in Groups Paired Short and Paired Long ate a mean of 0.5 and 0.6 pellets. Results of the geometric mean and training duration tests are shown in the middle and right panels of Figure 3. Data from the geometric mean test were put through a Group (Short, Long) by Devaluation (Paired, Unpaired) by Stimulus Period (S, pre S) by Trial Block (4) ANOVA. It found significant effects of Group, F(1, 23) = 4.68, MSE = 121.81, p = .041, η2p = .17, 95% CI [.00, .41], Stimulus Period, F(1, 23) = 78.37, MSE = 96.36, p < .001, η2p = .77, 95% CI [.56, .85], Block, F(1, 23) = 7.33, MSE = 20.67, p < .001, η2p = .24, 95% CI [.01, .48], as well as a Stimulus Period by Block interaction, F(3, 69) = 5.02, MSE = 18.28, p = .003. Importantly, the devaluation effect differed between Short and Long groups; there were significant Group by Devaluation by Block, F(3, 69) = 6.26, MSE = 20.67, p = .001, η2p = .21, 95% CI [.05, .35], and Group by Devaluation by Stimulus Period by Block interactions, F(3, 69) = 4.37, MSE = 18.28, p = .007, η2p = .16, 95% CI [.01, .29].
To analyze the interactions, geometric mean test data for Groups Long and Short were analyzed separately. For Group Long, there was no evidence devaluation had an effect on responding in either test, which replicates and extends the results of Experiment 2. A Devaluation (Paired, Unpaired) by Stimulus Period (S, pre S) by Trial (4) ANOVA found a significant effect of Stimulus Period, F(1, 12) = 30.52, MSE = 79.44, p < .001, Trial, F(3, 36) = 6.17, MSE = 19.46, p = .002, and a Stimulus Period by Trial interaction, F(3, 36) = 5.15, MSE = 17.23, p = .005. None of the effects or interactions involving Devaluation approached significance, largest F(3, 36) = 2.36, MSE = 19.46.
For Group Short, the results of the geometric mean test were mixed. A Devaluation by Stimulus Period by Trial ANOVA found a significant effect of Stimulus Period, F(1, 11) = 46.82, MSE = 114.83, p < .001, η2p = .81, 95% CI [.45, .89], as well as significant interactions between Devaluation and Trial, F(3, 33) = 6.15, MSE = 21.99, p = .002, η2p = .36, 95% CI [.07, .52], and Devaluation, Stimulus Period, and Trial, F(3, 33) = 4.68, MSE = 19.42, p = .008, η2p = .30, 95% CI [.03, .47]. For responding during the S, a Devaluation by Trial ANOVA found a significant Devaluation by Trial interaction, F(3, 33) = 5.86, MSE = 38.03, p = .003, η2p = .35, 95% CI [.06, .51]; other Fs < 1.55. For pre S responding, the same analysis found no significant effects, largest F = 2.66, MSE = 3.39. Unexpectedly, responding during the S was significantly greater in the Paired than Unpaired Group in the first test trial, F(1, 11) = 5.37, MSE = 63.00, p = .041, η2p = .33, 95% CI [.00, .60]. But by the final trial, the Paired group made significantly fewer responses than the Unpaired group in the S, F(1, 11) = 4.97, MSE = 54.02, p = .048, η2p = .31, 95% CI [.00, .59].
Turning to the results of the training duration test, the results for Group Long were very similar to those of Experiment 2. Both the Paired and Unpaired groups increased responding at the onset of the stimulus in the 6th minute, and then responding decreased during the S and remained low following the offset of the stimulus in the 14th minute. These observations are supported by a Devaluation by Block ANOVA, which found a significant effect of Block, F(19, 228) = 7.71, MSE = 11.56, p < .001, η2p = .39, 95% CI [.24, .43], and crucially, no effect of Devaluation or interaction, Fs < 1. For Group Short, the results of training duration test revealed the appearance of weaker responding during S in the Paired Group, again suggesting the maintenance of goal-directed action with the 30-s S. A Devaluation by Stimulus Period by Block ANOVA found significant effects of Stimulus Period, F(1, 11) = 23.63, MSE = 223.12, p = .001, Block, F(3, 33) = 20.35, MSE = 42.13, p < .001, and a Stimulus Period by Block interaction, F(3, 33) = 10.93, MSE = 38.89, p < .001. Despite the visual trend, none of the effects involving Devaluation were significant, largest F(3, 33) = 1.68, MSE = 42.13.
Devaluation was confirmed by a consumption test on the following day. Paired rats in Groups Short and Long ate a mean of 0.7 and 0.5 out of the 10 pellets offered, whereas Unpaired rats ate all 10. Finally, a reacquisition test further confirmed that the pellet no longer served to support lever pressing in the Paired group (data not shown).
Discussion
The results with Group Long replicated the results of Experiment 2: Discriminated responding again developed a hallmark of a habit with the longer (8-min) discriminative stimulus. The results of Group Short were less clear. Although Experiments 1a and 1b had established that training with the 30-s S supported goal-directed action rather than habit, the trends did not reach statistical significance here. As noted earlier, Experiment 3 involved 32 trials per session, whereas Experiments 1a and 1b involved 16. Perhaps a longer total amount of responding in each session did encourage habit formation. However, in order to provide a more complete assessment of the effect of reinforcer devaluation on discriminated operant responding when there are 32 30-s S presentations per session, we combined the test data from Experiment 3 with test data from three additional experiments that also involved training and devaluation after the 32-trials-per-session method used with Group Short here. The additional experiments were run to pursue the possibility, perhaps suggested here, that we had obtained habit with the present 32-trials-per-session procedure. They used the same 32-trial procedure that was used with Group Short, but compared it to other procedures. The pooled data with the 32-trial procedures (Figure 4) involved 30 rats that received the Paired treatment and 30 rats that received the Unpaired treatment. The results leave little doubt that the 32 trials-per-session procedure still supports goal-directed action rather than habit. A Devaluation (Paired, Unpaired) by Experiment (4) by Stimulus Period (S, pre) by Block (4) ANOVA found significant effects of Devaluation, F(1, 52) = 10.21, MSE = 477.51, p = .002, η2p = .16, 95% CI [.02, .34], Stimulus Period, F(1, 52) = 138.09, MSE = 247.68, p < .001, Block, F(3, 156) = 53.86, MSE = 122.36, p < .001. There were significant Devaluation by Stimulus Period, F(3, 52) = 6.19, MSE = 247.68, p = .016, η2p = .21, 95% CI [.05, .41], and Stimulus Period by Block, F(3, 156) = 25.61, MSE = 66.70, p < .001, interactions. No other effects or interactions, including those involving the Experiment factor, reached significance, largest F(3, 52) = 2.57, MSE = 477.51.
Figure 4.
Summary of test results from Experiment 3 and three replications. Data from four groups of rats trained with the procedure identical to Group Short in Experiment 3 (N = 30; see text for details). “S” and “pre-S” refer to responses recorded during the stimulus and during the 30 s that preceded onset of the stimulus. Error bars are the standard error of the mean.
The results thus confirm that the 8-min S produces habit, whereas procedures that contain the same amount of time in S and the ITI in each session, but involve a 30-s S, produce goal-directed action.
Experiment 4
Experiments 1–3 suggest a role for S duration in the development of a discriminated habit. As noted earlier, the pattern is consistent with the local correlation view (Dickinson, 1985, 1989; Dickinson & Perez, 2018). However, another possibility is raised by the fact that training with a 30-s S and the RI 30-s reinforcement schedule uniquely resulted in trials that could contain a reinforcer vs trials that did not with roughly equal probability. That is, when the RI 30-s schedule is active during the S, the computer decides that the next response will be reinforced by querying a 1/30 probability each second. This process is functionally equivalent to the computer randomly selecting intervals from a distribution of intervals that were equally likely to be greater than or shorter than 30 s. Therefore, animals trained previously with the 30-s S experienced a mixture of trials that involved earning a single reinforcer, no reinforcer, or (occasionally) multiple reinforcers. In contrast, groups trained with the 8-min S in Experiments 2 and 3 earned at least one reinforcer (and typically many more) during every trial.
One consequence of receiving a mixture of reinforced and nonreinforced trials is that it reduced the overall predictability of the reinforcer during training. In experiments on classical conditioning, reinforcer predictability has an important influence on attention processes. For example, partial reinforcement maintains orienting responses to the CS (Kaye & Pearce, 1984; see also Beesley et al., 2015; Hogarth et al., 2008) as well as the CS’s associability (e.g., Kaye & Pearce, 1984). It would be natural to assume that partial reinforcement would have the same effect in discriminated operant learning; thus, partial reinforcement should be expected to maintain attention to the S. However, for the same reason, it might also serve to maintain attention to the response, thereby preventing the conversion of goal-directed action into habit. If we were to apply the Pearce-Hall attention rule (Pearce & Hall, 1980) to discriminated instrumental behavior, if every S predictably signals a reinforcer, attention might decrease to both S and R, potentially making the behavior more automatic. (When a Pavlovian CS is repeatedly paired with a predictable reinforcer, the conditioned response it elicits is also thought to be automatic.) In contrast, if only 50% of Ss contain a reinforcer, attention would be maintained to both S and R, perhaps preventing the development of habit.
Experiment 4 therefore asked whether partial reinforcement was the feature of Experiments 1 and 3 that maintained goal-directed action in the 30-s S. Two groups received discriminated training with a 30-s S. Group Partial Reinforcement (PRF) received training exactly like that of the Short groups in Experiments 1 and 3; roughly half the Ss contained reinforcers, and half did not. In contrast, Group Continuous Reinforcement (CRF) received training under similar conditions, except that the RI 30-s schedule was modified so that a response was more likely to be reinforced in every S. Specifically, at the outset of every trial the computer sampled from a distribution of intervals ranging from 2 to 26 s; the first response emitted after the selected interval was then reinforced in the S. (Only one reinforcer was delivered on a given trial.) The method maintained the average reinforcement rate experienced by the PRF group, but guaranteed the possibility of a reinforcer on every trial. If reinforcer predictability determines the development of habit, then there should be no reinforcer devaluation effect in Group CRF, despite the use of the 30-s S. Goal-directed action should be maintained, as usual, in Group PRF.
Method
Subjects and apparatus.
The subjects were 32 female Wistar rats purchased from the same vendor as those in the previous experiments and maintained under the same conditions. The apparatus was also the same.
Procedure
Acquisition.
Magazine training and response training were conducted as in Experiment 1 except where noted. Rats were divided into two groups (n = 16) that received discriminated operant training with the 30-s S. One session was conducted each day. All rats received sessions consisting of 32 30-s trials separated by a variable 90-s ITI. There was a total of 64 min of session time. All rats experienced identical amounts of time in S with the RI 30-s schedule operating and nonreinforced ITI time, each day. For Group PRF, reinforcers in S were scheduled according to the usual RI 30-s schedule used in Experiments 1 and 3. For Group CRF, the reinforcement schedule was modified such that an interval was randomly selected from a list of possible durations (range 2–26 s) at the beginning of each trial. A pellet was then delivered when the rat made the first response after the selected interval. In the first two sessions, the delivery of the first pellet then started a RI 30-s schedule that ran for the remainder of the S presentation. This allowed the animals to earn more than one pellet per trial. However, in order to arrange a similar number of reinforced responses for the remaining 14 sessions of training, the schedule was modified so that only one reinforcer was delivered on any given trial.
Aversion conditioning.
Aversion conditioning was then conducted following the usual procedure. Aversion conditioning proceeded for 6 cycles.
Testing.
Rats were tested for discriminated operant responding, pellet consumption, and allowed to reacquire responding in the S following the 3-day procedure used in Experiment 1.
Results
Acquisition.
Acquisition proceeded without incident, except that a computer error resulted in the loss of data from 8 rats (2 from each group) on the 15th session of training (trial blocks 29 and 30). In addition, one rat in Group CRF Paired made a total of only 3 responses during the entire test session. This data point was identified as an outlier, as it was greater than 1.5 times the inter-quartile range below the group median (Tukey, 1977). We therefore excluded all data from this rat from the analysis, but note that including the data did not influence the outcome of any statistical test. The remaining rats successfully acquired the discriminated operant response during training, as suggested by Figures 5a and 5b. In each group, the separation of S and the pre-S responding occurred similarly across the training phase. A Group (PRF, CRF) by Devaluation (Paired, Unpaired) by Stimulus Period (S, pre S) by Block (32) ANOVA supported these observations. There were significant effects of Stimulus Period, F(1, 19) = 86.51, MSE = 180.81, p < .001, Block, F(31, 589) = 13.33, MSE = 8.14, p < .001, and a Stimulus Period by Block interaction, F(31, 589) = 29.16, MSE = 4.38, p < .001. In addition, there was a reliable Group by Block interaction, F(31, 589) = 2.27, MSE = 4.38, p < .001, other Fs < 1.38. This result suggests that responding differed across the acquisition phase as a function of whether the rats were trained with PRF or CRF. This was true only early in training; responding in the two groups did not differ in the final 16-trial block, where a Group by Devaluation by Stimulus Period ANOVA revealed an effect of Stimulus Period, F(1, 27) = 109.92, MSE = 16.13, p < .001, and no other effects or interactions, largest F(1, 27) = 1.66, MSE = 20.90. We also note that the groups did not differ in the number of reinforcers earned over the final ten 16-trial blocks of training: Group PRF earned a mean of 14.3 pellets [standard deviation (SD) = 2.1], and Group CRF earned a mean of 15.8 pellets (SD = 0.3). A Group (PRF, CRF) by Devaluation (P, U) by Block (10) ANOVA did not find any reliable effects of interactions, largest F(1, 19) = 3.66, MSE = 24.27.
Figure 5.
Results of Experiment 4. Mean number of responses per trial across blocks of 16 trials in acquisition (Left) and per 4-trial block in the test (Right). “S” and “pre-S” refer to responses recorded during the stimulus and during the 30 s that preceded onset of the stimulus. Error bars are the standard error of the mean.
Figure 6 confirms that training with the RI-30 schedule resulted in at least one reinforcer in slightly more than half the trials in Group PRF. In contrast, the modified RI schedule given Group CRF resulted in roughly every trial containing a reinforced response. A Group (PRF, CRF) by Devaluation (P, U) by Block (32) ANOVA on the data shown in the figure found significant effects of Group, F(1, 19) = 293.58, MSE = 0.08, p < .001, η2p = .94, 95% CI [.86, .96], and Block, F(31, 589) = 12.88, MSE = 0.02, p < .001. The other effects and interactions did not reach significance, largest F(1, 19) = 4.06.
Figure 6.
Reinforcers during training in Experiment 4. Mean proportion of trials with at least one reinforcer plotted in blocks of 16 trials for Groups PRF and CRF. Error bars are the standard error of the mean.
Test.
Devaluation proceeded smoothly; the paired rats in Groups PRF and CRF ate a mean of 2.9 and 1.6 pellets in the last trial of the devaluation phase. Results of the extinction test are shown in the right panels of Figure 5, which summarize test responding over blocks of 4 trials. Reinforcer devaluation weakened responding in Group PRF, but not Group CRF. Statistical analysis supported this conclusion. A Group (PRF, CRF) by Devaluation (Paired, Unpaired) by Stimulus Period (S, pre S) by 4-trial Block (4) ANOVA found significant effects of Stimulus Period, F(1, 27) = 65.35, MSE = 416.40, p < .001, and Block, F(3, 81) = 40.20, MSE = 177.51, p < .001. There were also significant interactions between Devaluation and Stimulus Period, F(1, 27) = 6.64, p = .016, η2p = .20, 95% CI [.01, .43], Group and Block, F(3, 81) = 6.53, p = .001, Stimulus Period and Block, F(3, 81) = 16.14, MSE = 116.02, p < .001, and a three-way interaction between Group, Stimulus Period, and Block, F(3, 81) = 4.21, p = .008, η2p = .13, 95% CI [.01, .25]. The other effects and interactions did not reach significance, largest F(1, 27) = 2.66, MSE = 725.44.
In order to understand the interactions, PRF and CRF groups were analyzed in separate Stimulus Period by Devaluation by Block ANOVAs. For Group CRF, there were significant effects of Stimulus Period, F(1, 13) = 33.52, MSE = 489.50, p < .001, Block, F(3, 39) = 27.04, MSE = 250.65, p < .001, and a Stimulus Period by Block interaction, F(3, 81) = 11.71, MSE = 176.38, p < .001. Effects involving Devaluation did not approach significance, largest F(1, 13) = 1.34, MSE = 489.50. For Group PRF, the same analysis similarly found significant effects of Stimulus Period, F(1, 14) = 31.53, MSE = 348.53, p < .001, Block, F(3, 42) = 12.13, MSE = 109.59, p < .001, and a Stimulus Period by Block interaction, F(3, 42) = 3.88, MSE = 59.98, p = .016. Importantly, while the main effect of Devaluation approached significance, F(1, 14) = 4.31, MSE = 505.07, p = .057, η2p = .24, 95% CI [.00, .52], there was a significant Stimulus Period by Devaluation interaction, F(1, 14) = 6.93, MSE = 348.52, p = .020, η2p = .33, 95% CI [.01, .59]. Further analysis of responding in each stimulus period revealed an effect of devaluation in the S, F(1, 14) = 5.73, MSE = 800.99, p = .031, η2p = .29, 95% CI [.00, .56] but not in the pre-S period, F < 1. The analyses of PRF and CRF groups are consistent with the hypothesis that habit developed in the CRF condition, but not in the usual PRF condition.
A conditioned aversion to the pellets was confirmed by a consumption test on the following day. Paired rats in Groups PRF and CRF ate a mean of 2.7 and 0.4 out of the 10 pellets offered, whereas Unpaired rats ate all 10. Finally, a reacquisition test further confirmed that the pellet no longer served to support lever pressing in the Paired group (data not shown).
Discussion
Rats acquired robust discriminated operant responding with both training procedures. However, the CRF procedure resulted in a response that was not sensitive to reinforcer devaluation, whereas the PRF procedure resulted in a response that was still sensitive to devaluation (as in Experiments 1 and 3). The findings are thus consistent with the view that a predictable reinforcer (Group CRF) allows habit to develop, and a less predictable reinforcer (Group PRF) maintains goal-directed action. Such findings are consistent with an attentional account of habit formation.
It is interesting to note that responding in Group PRF was slower to extinguish during testing than that in Group CRF. This example of the partial-reinforcement extinction effect may not be surprising from the view of traditional theories of extinction. For example, because the PRF group had received nonreinforced trials with S during acquisition, responding would have generalized more from acquisition to extinction (e.g., Capaldi, 1967, 1994). However, it is interesting to note that a habit in Group CRF was faster to extinguish than a goal-directed action was in Group PRF. To our knowledge, this is the first comparison of behavior change in habits and goal-directed actions with a similar number of previous response-outcome pairings and similar response rates entering testing. Based on intuition about habits and goal-directed actions, one might have expected habitual responding in the CRF condition to be more persistent. However, the literature on training and extinction does not necessarily support this intuition; for example, overtraining an instrumental response, which might be expected to yield strong habit, can actually result in more rapid extinction than less training (e.g., Ison, 1962; Tombaugh, 1967).
General Discussion
The present results may provide new insight into how discriminated operants can become goal-directed actions or habits. Using a standard discriminated operant procedure in which a 30-s S set the occasion for responding reinforced on an RI 30-s schedule (Bouton et al., 2016; Todd et al., 2014), Experiment 1 found evidence of sensitivity to reinforcer devaluation after 4, 22, and 66 sessions of discriminated operant training. The 66-session procedure allowed approximately 1,054 conjunctions between the response and the reinforcer, a number that far exceeds the number that can yield habit in free-operant procedures. Habit was observed, however, when the duration of the S was extended to 8 min in Experiment 2. Importantly, the animals in Experiment 2 received the same number of reinforced responses and total exposure to time in the S and the ITI as the 66-session group in Experiment 1b. And when Experiment 3 compared the effects of training with 30-s and 8-min Ss experimentally, it found habit again with the 8-min S, and goal-directed action with the 30-s S when the procedure controlled the amount of per-session time in S and the ITI (Figure 4). Experiment 4 then manipulated the percentage of 30-s Ss that contained an earned reinforcer; Group CRF could earn a reinforcer in every S, whereas Group PRF earned them on approximately 50% of the trials. Based on the results of Experiments 1–3, if habits only develop when the S duration is long, then the two groups in Experiment 4 should have remained goal-directed in both groups. Yet the CRF condition (and not the PRF condition) produced a response that was insensitive to reinforcer devaluation, and thus had the hallmark of habit. Overall, the results suggest that habit can develop during discriminated operant training under conditions that ensured that at least one response was reinforced on every trial.
The present results may help discriminate between available theories of habit formation. According to the Law of Effect (Thorndike, 1911), which some authors have suggested might control the development of habit (e.g., de Wit & Dickinson, 2009; Wood & Rünger, 2016), the acquisition of stimulus control is essentially synonymous with the development of habit, because reinforcement is assumed to strengthen the S-R (habit) connection every time it coincides with S and R. Our repeated failure to observe habit with discriminated operant procedures (Experiments 1, 3, and 4), even after training was both extensive and yielded clear-cut discriminative performance, suggests that repetitive exposure to the “trigger-action-reward sequence” might not be sufficient to generate habit. The overall pattern of results suggests that habit formation is not inevitable with discriminative control. The results thus provide a strong challenge to the view that habit develops during instrumental learning because the reinforcer automatically strengthens an S-R connection, as in Thorndike’s Law of Effect.
The results may also question a more contemporary account of habit formation (Dickinson, 1985, 1994) that restates the Law of Effect in terms of local response-reinforcer correlations arranged by different instrumental contingencies (Baum, 1973, 2012). Ratio reinforcement schedules create strong correlations between local response rate and reinforcement rate, whereas interval schedules arrange a weaker correlation because, above a certain response rate, further increases in response rate cause no further change in reinforcement rate. Consistent with this perspective, Dickinson et al. (1983) found evidence that habits develop more easily with interval schedules. And as we noted earlier, a correlation account can explain the results of Experiments 1–3. Specifically, training with the 30-s S arranged a strong local correlation between the response and reinforcer, because there were marked increases and decreases in both rates at the onset and offset of S. Thus, goal-directed action might be expected. In contrast, the 8-min S procedure yielded a weaker correlation, because the rats responded on the interval reinforcement schedule for a more extended period of time, weakening the local correlation. Thus, that procedure was more likely to create the possibility of habit. However, the correlation account has difficulty explaining the results of Experiment 4, where Groups CRF and PRF both experienced similar changes in response rate and reinforcement rate at the onset and offset of the 30-s S. (Group CRF arguably had a stronger local correlation, because there was an increase in both response rate and reinforcement rate in nearly every S.) However, Group CRF developed a habit (a response that was less sensitive to reinforcer devaluation) and Group PRF did not. That difference would not be predicted from knowledge of the local response rate / reinforcement rate correlation.
A third potential account suggests that a habit may develop as the animal learns to pay less and less attention to the instrumental response. According to one successful theory of attention in Pavlovian learning (Pearce & Hall, 1980), attention to a CS decreases during conditioning as the US becomes better and better predicted. In instrumental learning, a similar process would cause attention to the response to decrease as the reinforcer becomes well predicted. In either case, responding may become more automatic; in instrumental learning, such automatization could be synonymous with habit. The present results are consistent with such an account. Habit developed in procedures where the stimulus and response contingencies made the reinforcer most predictable: The 8-min S, which permitted habit acquisition in Experiments 2 and 3, predicted a reinforcer on every presentation, and a 30-s S that always contained a reinforcer also consistently predicted one (Experiment 4). In contrast, when S was paired with the reinforcer only 50% of the time (Experiments 1, 3, and 4), the response did not become a habit. As noted earlier, previous work in Pavlovian learning indicates that when a CS is reinforced on 100% of its presentations, attention and orienting responses decline, whereas when the CS is reinforced on 50% of its presentations, attention and orienting responses remain directed toward the CS (e.g., Beesley et al., 2015; Hogarth et al., 2008; Kaye & Pearce, 1984). According to the Pearce-Hall model, such an effect operating here could have sustained attention to both S and the response in the 50% procedures, preventing the acquisition of habit.
Mackintosh’s (1975) model of attention and classical conditioning is also relevant. In that model, co-occurring conditioned stimuli compete for attention allocation, and a strong predictor of the reinforcer (e.g., an S reinforced 100% of the time) can cause attention to other stimuli to decline. By extension, in instrumental learning, when S is a strong predictor of the reinforcer, it might similarly cause a decrease in attention to the response. Partial reinforcement would weaken the strength of S’s association with the reinforcer relative to continuous reinforcement—and thus produce less competition for attention to the response.
It is worth noting that some habit arguably developed even in the procedures that left the response sensitive to reinforcer devaluation. That is, the animals that received training with the (partially reinforced) 30-s S in Experiments 1, 3, and 4 often still performed the response to some extent after reinforcer devaluation had depressed it. Residual responding of this sort can be interpreted as the development of some habit (e.g., Dickinson et al., 1995; Thrailkill & Bouton, 2015), although it also has other interpretations (Colwill & Rescorla, 1985). Some incomplete habit learning with partial reinforcement is consistent with an attentional view, because attentional models could predict some loss of attention to the response, for example, if the S were sufficiently salient.
As mentioned in the Introduction, other experiments have previously studied habits and goal-directed actions using discriminated operant procedures. For instance, Vandaele, Pribut, and Janak (2017) recently examined habit acquisition when lever-press responding was occasioned by the insertion of the lever into the operant chamber; reinforcement was provided after the 5th response, and the lever was then withdrawn. Vandaele et al. found mixed results regarding habit when responding was assessed following reinforcer devaluation accomplished with either specific satiety or taste aversion conditioning in the home cage (which may yield incomplete transfer of taste aversion to the operant chamber, see Kosaki & Dickinson, 2013). The results depended to some extent on whether the reinforcer was sucrose liquid or grain pellets. Nevertheless, the training method has promise for studying habits. According to an attentional account, the lever insertion S would be a strong predictor of the reinforcer, and thus cause attention to the response to decline as in the present experiments.
Other investigators have also reported mixed evidence of habit development with discriminated operant methods (Callu, Puget, Faure, Guegan, & El Massioui, 2007; Faure, Haberland, Condé, & El Massioui, 2005; Faure, LeBlanc-Veyrac, & El Massioui, 2010). Callu et al. (2007) found evidence consistent with habit following training in which a reinforcer was contingent on the first lever-press response emitted in the presence of a 10-s tone. The result is consistent with the results of the CRF condition in the present Experiment 4. Unfortunately, the experiment did not provide evidence that the taste-aversion devaluation procedure could depress responding with the method, which is a necessary condition for interpreting the null effect of devaluation on the response (the evidence of habit). The point is crucial, because after demonstrating insensitivity to devaluation in the extinction test, rats with an aversion conditioned to the reinforcer continued to make the instrumental response when it was paired with the reinforcer again. Thus, insensitivity to devaluation could have been the result of weak transfer of taste aversion conditioning to the test.
The two other studies (Faure et al., 2005, 2010) also found mixed evidence of habit with discriminated operant responses. In both experiments, two unique stimulus-response-outcome combinations (e.g., tone-lever-sucrose pellet and light-chain-food pellet) were trained in separate sessions before sating the rats on one of the outcomes and testing both stimulus-response combinations in a single session. (The schedule of reinforcement during training was not specified.) Both experiments studied the effects of brain (striatal) lesions; the behavior of sham controls is most relevant here. After extensive training of both chain and lever responding, Faure et al. (2005) found evidence of goal-directed action on the chain and possible evidence of habit on the lever (a strong trend actually suggested action, but a comparison of the valued and devalued reinforcer conditions yielded a nonsignificant F [1,5] = 3.41). The possible difference between chain and lever was not explained. In the other experiment, also after extensive training, Faure et al. (2010) found inexplicably low responding to both valued and devalued responses during extinction testing. Although this result can be interpreted as evidence of habit, it is possible that any sensitivity to devaluation was obscured by the general demotivating effect of satiety. And there was no difference in responding for valued and devalued reinforcers during subsequent sessions in which the two behaviors were reinforced again, which may again suggest weak transfer of the devaluation treatment to the test of instrumental responding.
The present results suggest that habits may develop most readily when reinforcers are well predicted by the discriminative stimulus. As such, they are most germane to studies of habits in discriminated operant procedures. Nevertheless, prior work with free operants is worth noting. Recall that, based on evidence from random ratio schedules, Dickinson and others (e.g., Dickinson et al., 1983) suggest that ratio schedules protect against habit formation because they arrange a strong response-reinforcer correlation. It is not yet clear how an attention-based account would address the ratio/interval difference, although it might be noted that behaviors reinforced on ratio schedules do eventually convert to habit with training (Corbit, Nie, & Janak, 2012; 2014). An especially notable case is that of free-operant continuous reinforcement (or fixed-ratio [FR] 1). In FR 1, a reinforcer is delivered contingent on the organism emitting a single response; there is thus a very strong (1:1) response-reinforcer correlation. It seems apparent that a correlation-based account would not predict that habit would develop on FR-1. Yet, some of the earliest evidence of habit formation with free-operant procedures found clear evidence of habit after free-operant training on FR-1 (Adams, 1982). Although the present results specifically suggest that habit develops when the discriminative stimulus is consistently paired with the reinforcer, it is notable that FR-1 is the ratio schedule in which the outcome is most strongly predicted by the response.
In another free-operant study, DeRusso, Fan, Gupta, Shelest, Costa, and Yin (2010) investigated a role for reinforcer “uncertainty” in habit formation. In their experiment, groups of mice were trained on free-operant reinforcement schedules that varied the probability of reinforcement following intervals that averaged 60 s. One group received a simple fixed interval (FI) 60-s reinforcement schedule, and a group with high reinforcer uncertainty received a reinforcer contingent on a press according to a schedule that determined the availability of a pellet by querying a 1/10 probability every 6 seconds. In potential contrast to the present results, there was greater insensitivity to reinforcer devaluation (through sensory-specific satiety), suggesting habit, in the group for which reinforcer uncertainty was high. It is not clear how habit’s apparent connection with uncertain reinforcers squares with habits in everyday life, where reinforcers seem predictable and certain. DeRusso et al. suggested that, while their RI and FI schedules arranged similar response-reinforcer correlations (cf. Dickinson, 1989), the two schedules maintained different patterns of behavior that created differences in how closely responding was associated with the next reinforcer in time. More research will be required in order to understand the conditions under which reinforcer predictability influences habit formation in free-operant procedures.
To summarize, the present experiments explored the conditions that result in the development of discriminated habits. Contrary to the law of effect, habit formation was far from inevitable as a result of discriminated operant training. Habits generally developed when the discriminative stimulus reliably predicted reinforcers. The results challenge contemporary theories of habit formation and begin to suggest an alternative approach emphasizing attention to the response. Perhaps consistent with theories of attention and learning (Pearce & Hall, 1980; Mackintosh, 1975), conditions that make the reinforcer predictable may encourage the animal to pay less attention its behavior, thus encouraging the development of habit.
Acknowledgments
The research was supported by NIH Grant RO1 DA033123. Correspondence should be addressed to either EAT or MEB at the Department of Psychological Science, University of Vermont, Burlington, VT 05405–0134.
References
- Adams CD (1982). Variations in the sensitivity of instrumental responding to reinforcer devaluation. Quarterly Journal of Experimental Psychology, 34B, 77–98. [Google Scholar]
- Adams CD, & Dickinson A (1981). Instrumental responding following reinforcer devaluation. The Quarterly Journal of Experimental Psychology, 33B, 109–122. [Google Scholar]
- Balleine BW, & Dickinson A (1998). Goal-directed instrumental action: contingency and incentive learning and their cortical substrates. Neuropharmacology, 37, 407–419. [DOI] [PubMed] [Google Scholar]
- Baum WM (1973). The correlation-based law of effect. Journal of the Experimental Analysis of Behavior, 20, 137–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baum WM (2012). Rethinking reinforcement: Allocation, induction, and contingency. Journal of the Experimental Analysis of Behavior, 97, 101–124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beesley T, Nguyen KP, Pearson D, & Le Pelley ME (2015). Uncertainty and predictiveness determine attention to cues during human associative learning. The Quarterly Journal of Experimental Psychology, 68, 2175–2199. [DOI] [PubMed] [Google Scholar]
- Bouton ME, Todd TP, & León SP (2014). Contextual control of discriminated operant behavior. Journal of Experimental Psychology: Animal Learning and Cognition, 40, 92–105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bouton ME, Trask S, & Carranza-Jasso R (2016). Learning not to make the response during instrumental (operant) extinction. Journal of Experimental Psychology: Animal Learning and Cognition, 42, 246–258 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Calu D, Puget S, Faure A, Guegan M, & El Massioui N (2007). Habit learning dissociation in rats with lesions to the vermis and the interpositus of the cerebellum. Neurobiology of Disease, 27, 228–237. [DOI] [PubMed] [Google Scholar]
- Capaldi EJ (1966). Partial reinforcement: A hypothesis of sequential effects. Psychological Review, 73, 459–477. [DOI] [PubMed] [Google Scholar]
- Capaldi EJ (1994). The sequential view: From rapidly fading stimulus traces to the organization of memory and the abstract concept of number. Psychonomic Bulletin and Review, 1, 156–181. [DOI] [PubMed] [Google Scholar]
- Church RM, & Deluty MZ (1977). Bisection of temporal intervals. Journal of Experimental Psychology: Animal Behavior Processes, 3, 216–228. [DOI] [PubMed] [Google Scholar]
- Colwill RM, & Rescorla RA (1985). Instrumental responding remains sensitive to reinforcer devaluation after extensive training. Journal of Experimental Psychology: Animal Behavior Processes, 11, 520–536. [PubMed] [Google Scholar]
- Colwill RM, & Rescorla RA (1988). Associations between the discriminative stimulus and the reinforcer in instrumental learning. Journal of Experimental Psychology: Animal Behavior Processes, 14, 155–164. [Google Scholar]
- Colwill RM, & Rescorla RA (1990). Effect of reinforcer devaluation on discriminative control of instrumental behavior. Journal of Experimental Psychology: Animal Behavior Processes, 16, 40–47. [PubMed] [Google Scholar]
- Corbit LH, Nie H, & Janak PH (2012). Habitual alcohol seeking: Time course and the contribution of subregions of the dorsal striatum. Biological Psychiatry, 12, 389–395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Corbit LH, Nie H, & Janak PH (2014). Habitual responding for alcohol depends upon both AMPA and D2 receptor signaling in the dorsolateral striatum. Frontiers in Behavioral Neuroscience, 8, 301. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Daw ND, Niv Y, & Dayan P (2005). Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience, 8, 1704–1711. [DOI] [PubMed] [Google Scholar]
- DeRusso AL, Fan D, Gupta J, Shelest O, Costa RM, and Yin HH (2010). Instrumental uncertainty as a determinant of behavior under interval schedules of reinforcement. Frontiers in Integrative Neuroscience, 4, 17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Wit S, & Dickinson A (2009). Associative theories of goal-directed behavior: a case for animal-human translational models. Psychological Research, 73, 463–476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dickinson A (1985). Actions and habits: The development of behavioral autonomy. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 308, 67–78. [Google Scholar]
- Dickinson A (1989). Expectancy theory in instrumental conditioning In: Klein SB, Mowrer RR, (Eds.), Contemporary learning theories: Pavlovian conditioning and the status of traditional learning theories (pp. 279–308). Hillsdale, NJ: Erlbaum. [Google Scholar]
- Dickinson A (1994). Instrumental conditioning In Mackintosh NJ (Ed.) Animal learning and cognition: Handbook of perception and cognition series (2nd Ed., pp. 45–79). San Diego, CA: Academic Press. [Google Scholar]
- Dickinson A (2012). Associative learning and animal cognition. Philosophical transactions of the Royal Society, B. 367, 2733–2742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dickinson A, Balleine B, Watt A, Gonzalez F, & Boakes RA (1995). Motivational control after extended instrumental training. Animal Learning and Behavior, 23, 197–206. [Google Scholar]
- Dickinson A, Nicholas DJ, & Adams CD (1983). The effect of instrumental training contingency on susceptibility to reinforcer devaluation. Quarterly Journal of Experimental Psychology, 35B, 35–51. [Google Scholar]
- Dickinson A, & Perez OD (2018). Actions and habits: Psychological issues in dual system theory In Morris RW, Bornstein AM, & Shenhav A (Eds.), Goal-Directed Decision Making: Computations and Neural Circuits (pp. 1–37). Elsevier. [Google Scholar]
- Hall G, & Pearce JM (1979). Latent inhibition of a CS during CS–US pairings. Journal of Experimental Psychology: Animal Behavior Processes, 5, 31–42. [PubMed] [Google Scholar]
- Hogarth L, Dickinson A, Austin A, Brown C, & Duka T (2008). Attention and expectation in human predictive learning: The role of uncertainty. The Quarterly Journal of Experimental Psychology, 61, 1658–1668. [DOI] [PubMed] [Google Scholar]
- Ison JR (1962). Experimental extinction as a function of number of reinforcements. Journal of Experimental Psychology, 64, 314–317. [Google Scholar]
- Holland PC (2004). Relations between Pavlovian-instrumental transfer and reinforcer devaluation. Journal of Experimental Psychology: Animal Behavior Processes, 30, 104–117. [DOI] [PubMed] [Google Scholar]
- Kaye H, & Pearce JM (1984). The strength of the orienting response during Pavlovian conditioning. Journal of Experimental Psychology: Animal Behavior Processes, 10, 90–109. [PubMed] [Google Scholar]
- Kosaki Y, & Dickinson A (2010). Choice and contingency in the development of behavioral autonomy during instrumental conditioning. Journal of Experimental Psychology. Animal Behavior Processes, 36, 334–342. [DOI] [PubMed] [Google Scholar]
- Mackintosh NJ (1975). A theory of attention: Variations in the associability of stimuli with reinforcement. Psychological Review, 82, 276–298. [Google Scholar]
- Nelson A, & Killcross S (2006). Amphetamine exposure enhances habit formation. Journal of Neuroscience, 26, 3805–3812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearce JM, & Hall G (1980). A model for Pavlovian learning: Variations in the effectiveness of conditioned but not of unconditioned stimuli. Psychological Review, 87, 532–552. [PubMed] [Google Scholar]
- Steiger JH (2004). Beyond the F test: effect size confidence intervals and tests of close fit in the analysis of variance and contrast analysis. Psychological Methods, 9, 164–182. [DOI] [PubMed] [Google Scholar]
- Thorndike EL (1911). Animal intelligence: Experimental studies. New York: Macmillan. [Google Scholar]
- Thrailkill EA, & Bouton ME (2015). Contextual control of instrumental actions and habits. Journal of Experimental Psychology: Animal Learning and Cognition, 41, 69–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Todd TP, Vurbic D, & Bouton ME (2014). Mechanisms of renewal after the extinction of discriminated operant behavior. Journal of Experimental Psychology: Animal Learning and Cognition, 40, 355–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tombaugh TN (1967). The overtraining extinction effect with a discrete-trial bar-press procedure. Journal of Experimental Psychology, 73, 632–634. [DOI] [PubMed] [Google Scholar]
- Tukey JW (1977). Exploratory data analysis. Reading, PA: Addison Wesley. [Google Scholar]
- Vandaele Y, Pribut HJ, & Janak PH (2017). Lever insertion as a salient stimulus promoting insensitivity to outcome devaluation. Frontiers in Integrative Neuroscience, 11, 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wood W, & Rünger D (2016). Psychology of habit. Annual Review of Psychology, 67, 289–314. [DOI] [PubMed] [Google Scholar]
- Yin HH, Knowlton BJ, & Balleine BW (2004). Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. European Journal of Neuroscience, 19, 181–189. [DOI] [PubMed] [Google Scholar]






