Skip to main content
Perspectives on Behavior Science logoLink to Perspectives on Behavior Science
. 2020 Aug 12;43(4):655–675. doi: 10.1007/s40614-020-00264-w

The Evolution of Behavior Analysis: Toward a Replication Crisis?

Matthew L Locey 1,
PMCID: PMC7724009  PMID: 33381684

Abstract

The Open Science Collaboration (Science, 349(6251), 1–8, 2015) produced a massive failure to replicate previous research in psychology—what has been called a “replication crisis in psychology.” An important question for behavior scientists is: To what extent is behavior science vulnerable to this type of massive replication failure? That question is addressed by considering the features of a traditional approach to behavior science. Behavior science in its infancy was a natural science, inductive, within-subject approach that encouraged both direct and systematic replication. Each of these features of behavior science increased its resistance to three factors identified as responsible for the alleged replication crisis: (1) failures to replicate procedures, (2) low-power designs, and (3) publication bias toward positive results. As behavior science has evolved, the features of the traditional approach have become less ubiquitous. And if the science continues to evolve as it has, it will likely become more vulnerable to a massive replication failure like that reported by the Open Science Collaboration (Science, 349(6251), 1–8, 2015).

Keywords: Natural science, Inductive science, Within-subject design, Replication, Direct replication, Systematic replication


Behavior science, like any science, is not immune to the risk of replication failure. Is that risk sufficient to warrant a more careful consideration of current practices within behavior science, as has been suggested for other branches of psychology? The Open Science Collaboration (OSC, 2015) recently sought to replicate the results of 97 experimental and correlational studies from 2008 that had originally yielded statistically significant effects (p < 0.05). Only 35 of those replication attempts yielded such an effect. Whether or not these results indicate a "replication crisis" in psychology, as many have suggested (e.g., Maxwell, Lau, & Howard, 2015), they do serve as a jarring reminder of the importance of replicability. It would be ideal if a similar project could be conducted within behavior science to ascertain the replicability of results reported in leading behavior science journals such as the Journal of Applied Behavior Analysis (JABA) and the Journal of the Experimental Analysis of Behavior (JEAB). But even in the absence of such a project there may be merit in evaluating how various features of behavior science influence its susceptibility to replication failures like those reported by OSC (2015). Such an evaluation is complicated by the fact that behavior science has evolved over the past 65 years. Although there are numerous advantages to that evolution, the evaluation that follows will suggest that one disadvantage is increased susceptibility to replication crisis within behavior science.

Behavior Science: From Infancy to Today

Sidman (1960), once considered the “Bible of Operant Methodology” (Morgan & Morgan, 2001), describes what could reasonably be identified as the traditional approach in behavior science (TABS). Although all behavior science of today might have been influenced by TABS, TABS should not be mistaken as being identical to the current field. The origins of behavior science (TABS) consisted of laboratory experiments with nonhuman animals. This approach was a natural science, inductive, within-subject approach that encouraged both direct and systematic replication. The sections that follow will briefly elaborate on these features as a prelude to (1) considering the extent to which current behavior science has retained these features and (2) exploring how each of these features (or lack thereof) influences the resistance (or susceptibility) of behavior science to a large-scale replication failure like that recently identified for other areas of psychology (OSC, 2015).

Due to the expansion of behavior science into a host of new domains, addressing that first point (the retention of TABS features within behavior science) is extremely complicated. Although the Journal of the Experimental Analysis of Behavior continues to publish research, behavior science is now published in a variety of journals: some dedicated to behavior analysis in particular (e.g., The Analysis of Verbal Behavior, Behavior Analysis: Research and Practice, Journal of Applied Behavior Analysis) and others not (e.g., Behavioural Pharmacology, Behavioural Processes, Journal of Behavioural Decision Making). To further complicate the issue, the widespread success of applied behavior analysis has led to a booming behavior analysis industry—an industry in which most applications lack published reports documenting those applications. As such, in the discussion that follows, speculation about the current state of behavior science will largely replace definitive data. I leave it to individual researchers and practitioners to determine the extent to which their behavior science has drifted from the traditional approach.

TABS as a Natural Science

Within the empirical sciences, particular sciences are frequently categorized as either natural or social sciences. However, Ledoux (2002) argues that any science that deals only with natural events is a natural science and that any science that deals with “people issues” is a social science. His conclusion is that some sciences are both natural and social sciences (those that deal with people issues and only with natural events). Whether or not natural and social science are mutually exclusive categories, there are particular practices that tend to be more common within sciences that are labeled “natural” than those labeled “social.” Rather than focusing on how to draw precise lines that perfectly delineate the two types of science in all instances, it might be more fruitful to consider differential tendencies. For example, natural science is more likely (than social science) to rely on experimentation over correlation, direct over indirect measurement, and subject matter that is not specific to humans.

The traditional approach in behavior science involved total reliance upon experimentation over correlation. In fact, TABS was identical to the experimental analysis of behavior. And in the first issue of the Journal of the Experimental Analysis of Behavior (JEAB, 1958), there were 11 experimental designs and zero correlational studies. The foundation of TABS was to determine how a particular environmental manipulation influenced behavior by experimentally introducing (and removing) that manipulation. Likewise, TABS relied heavily upon direct over indirect measurement. Those 11 initial JEAB studies used direct, automated measures of behavior (and automated control of most experimental events). And although the subject matter of TABS was generally applicable to humans, the subject matter was rarely human-specific—that is, the subject matter was rarely only applicable to humans. Of the 11 experiments in the first issue of JEAB, only 2 used human subjects (Bijou, 1958; Keller, 1958). This is not to suggest that studies using human subjects are necessarily human-specific. But nonhuman studies are not human-specific.

Current Behavior Science as a Natural Science

In 2008 (the year of publication for the original studies for which OSC [2015] attempted replications), JEAB still relied upon a natural science approach. The final 11 JEAB studies of 2008, like the first 11 JEAB studies of 1958, were all experiments (rather than correlational designs). All 11 also relied upon direct, automated recording of behavior. Only 1 of these studies used humans as subjects (Fields & Moss, 2008)—comparable to the 2 out of 11 that used human subjects in the first issue of JEAB. But what about behavior science beyond JEAB? Research in applied behavior analysis typically involves experiments that rely on direct observation. Usually such experiments use human subjects. The use of human subjects does not preclude a natural science approach but insofar as any of those studies focus on human-specific phenomena, those particular studies would deviate from TABS and the benefits described later with respect to a focus on nonhuman-specific phenomena. Likewise, applied behavior analysis frequently uses nonautomated measurement, introducing a potential source of error into the data. Finally, even though most behavior science research might involve experiments, many behavior scientists have also conducted correlational studies—some relying upon indirect measurement (e.g., delay discounting questionnaire studies). So, behavior science as a whole might remain a natural science, but other approaches have become more common as behavior science has evolved.

TABS as an Inductive Science

Skinner (1950) offered an alternative to the widely cherished notion that all of science is conducted to evaluate hypotheses. About 95% of the first chapter of Sidman (1960) is dedicated to answering the question, “Why perform experiments?” (p. 4). Evaluating a hypothesis is given as only one of many answers to that question (with less than 10% of the chapter dedicated to that particular answer). Within the first issue of JEAB, the word “hypothesis” only appears once (Bijou, 1958)—and then it is in the last sentence of the manuscript, as part of a suggestion for future research, not as a part of the reported study.

Instead of the (hypothetico-)deductive model of typical research in psychology, TABS was an inductive approach. The deductive approach involves forming a hypothesis to be supported or discredited by the results of a study (whether experimental or correlational). For example, one might perform an experiment to determine if the procedure of extinction decreases behavior. If behavior decreases under extinction, the hypothesis is supported (the null hypothesis is rejected). If behavior does not decrease, the hypothesis is discredited (the null hypothesis is not rejected). In a pure inductive approach, not only is there no binary (yes/no) hypothesis to address, there is no hypothesis at all driving the research. The pure inductive approach simply asks, “what happens to x under condition y?”—where x, in the case of behavior science, is typically just “behavior.” For example, when Ferster and Skinner (1957) were assessing performance on a fixed ratio schedule, there was no hypothesis. There was only the question, “What happens to behavior when every z responses produce food?”

Other important distinctions exist between deductive and inductive research—some of which are particularly relevant to this issue of susceptibility to replication failure. For example, a deductive approach typically requires that a study be executed entirely as it was planned prior to data collection (see Hales, Wesselmann, & Hilgard, 2019, for a discussion of preregistration), whereas an inductive approach allows experimenters to make procedural changes as data are collected. This includes, but is not limited to, collecting data within a given condition until behavior is stable—essentially waiting until the answer to “what happens to x under condition y?” is answered before changing condition y. Indeed, these two approaches have sometimes been referred to as “theory before data” (deductive) and “data before theory” (inductive; Sidman, 1960). Stated in these terms, it should be apparent that any particular research project might fall somewhere within these two extremes, making the deductive versus inductive distinction more of a continuum than a binary difference between two discrete approaches. But whereas present-day behavior science studies might fall along all points of this continuum, TABS was consistently closer to the pure inductive approach.

Current Behavior Science as an Inductive Science

In the beginning, TABS might have been a purely inductive science. But it seems unlikely that any science could proceed entirely through such a purely inductive approach. Asking “what will happen to behavior under condition y” might be an effective approach to addressing some questions within a science, but at some point, reasonable binary questions will arise. At some point, after identifying what happens to behavior under condition y, it will become reasonable to ask whether or not behavior under condition y will differ from behavior under condition w. So, as behavior science has evolved, many of its questions have become similar to the questions asked in a deductive approach. Within the last 11 JEAB studies of 2008, only 1 included a hypothesis test using inferential statistics (Elliffe, Davison, & Landon, 2008) but such tests are still more common within behavior science today than 60 years ago (Zimmerman, Watkins, & Poling, 2015). And even among the many studies that do not include such tests, in many cases such tests could reasonably be conducted—in particular because the approaches are now less inductive than those of TABS.

TABS as a Within-Subject Approach with Replication

TABS was built upon reversal designs in which each subject experienced within-session replications (from reinforcer to reinforcer, whether trial-based or free operant), session replications (of daily contingencies within a particular condition) until behavior stabilized, and condition replications (e.g., Ferster & Skinner, 1957). Although frequently referred to as single-subject designs, the entire experiment was also typically replicated across several subjects (this was true for 8 of the 11 studies in the first issue of JEAB). Related to this emphasis on replication is the encouragement of systematic replication as fundamental to any research program (Sidman, 1960). Systematic replications are those that are not direct, because one or more variables are altered from one implementation to the next. In a technical sense, all replications are systematic—because it is impossible to exactly duplicate all conditions (e.g., experiential differences in subjects whether in an across-subject replication or within-subject). But there is a continuum of deviation from original conditions along which those with relatively few deviations are identified as “direct” rather than “systematic.”

Current Behavior Science as a Within-Subject Approach with Systematic Replication

Both basic and applied research in behavior science has largely retained the within-subject approach of TABS. However, group designs have become more common—at least among studies with human subjects. Shorter studies are also more common. The TABS designs with hundreds of daily sessions (e.g., Dews, 1958) are not difficult to find (e.g., Weaver & Branch, 2008) but neither are much shorter studies, including single-session designs (e.g., Fields & Moss, 2008). Related to this relative decrease in within-subject replication and systematic replication is the rise of the Least Publishable Unit (LPU) across scientific disciplines (Broad, 1981). If publication contingencies increasingly support smaller units (e.g., single experiments instead of multiexperiment studies) a reduction in multiexperiment systematic replications should not be surprising.

Factors Contributing to Replication Failures in Psychology

OSC (2015) was an ambitious, data-rich project with highly suggestive results. Only 36% of attempted replications yielded the same statistically significant effects as the 97 original studies. The mean effect size was only 40% of that reported in original studies. And for only 39% of the studies did the replicating investigators claim to have effectively replicated the original results (based on their subjective assessment rather than any standardized criteria). OSC (2015) offered three factors as likely responsible for these replication failures: failure to replicate procedures, low-power designs, and publication bias for positive results. The following sections describe each factor and then assess that factor in the context of the OSC (2015) studies, the traditional approach to behavior science (TABS), and the behavior science of today.

Failure to Replicate Procedures

The term "replication" has been used in two ways: the duplication of a procedure or the duplication of results (Goodman, Fanelli, & Ioannidis, 2016). Most of the evidence cited in support of a "replication crisis" consists of failures to replicate results, but such failures could be due to failures to replicate procedures (which includes participants). As mentioned above in the context of direct versus systematic replication, it would be impossible for any procedure to perfectly replicate another. For example, none of the OSC (2015) studies made any effort to recruit the exact same participants that were recruited for the original studies. The real issue, therefore, is not whether failures to replicate procedures exist, but rather to what extent the failures to replicate procedures were responsible for the failures to replicate results.

Failure to Replicate Procedures in OSC (2015)

Gilbert, King, Pettigrew, and Wilson (2016) described eight examples of procedural failures to replicate in OSC (2015) in which the changes were substantial enough to suggest—intuitively—that the procedural failures to replicate were likely responsible for the corresponding failures to replicate results. For example: “An original study that asked Israelis to imagine the consequences of military service was replicated by asking Americans to imagine the consequences of a honeymoon.” But those are just the extreme and obvious procedural-replication failures. By considering differences in the behavioral contingencies operating for the original investigations compared to the replications, it seems likely that the procedural-replication failures numbered substantially more than eight.

In a typical study, the experimenters have a vested interest in finding an effect (this includes, but is not limited to, the publication bias discussed later). But for the OSC (2015) project, an opposite phenomenon might have influenced the results. To illustrate this point, consider an ongoing study in our rat lab that involves choices between one and three food pellets. The rats showed no preference. This was a clear failure to replicate previous studies (e.g., Locey & Dallery, 2009; Mazur, 1987). But rather than immediately submitting this revolutionary result to JEAB, we tweaked the procedure to determine what was responsible for this replication failure. This is a standard type of activity within inductive research (Sidman, 1960). A demonstration of how some manipulation influences preference would be both less likely and less compelling if built upon a foundation of indifference that defied both intuition and previous research results (less food being equally preferred to more food). Under such conditions, both long-term (publication) and shorter-term (a meaningful determination of how some manipulation influences choice) contingencies would support efforts to identify and correct any procedural flaws (e.g., lever inequities) responsible for the anomalous results. But for a scientist operating under the contingencies of the OSC (2015) project, there would likely be less motivation to fix a procedure that initially yields indifference between one and three food pellets. The replication attempt will be published regardless of the results, so why bother modifying the procedure to ensure everything is working properly? Even worse, the contingencies might be reversed: adjusting the procedure might be viewed as a violation of protocol (previous studies that reported preference for three pellets over one pellet did not report any procedural modifications to produce that result, so adjusting anything to produce that initial preference might be avoided in the attempt at direct replication).

How significant is this potential problem? Although Gilbert et al. (2016) identified eight studies with extreme procedural differences, we have no definitive data on how many replications failed due to a lack of the minor tweaking necessary to replicate the original’s results. The closest data that we have is the 31% of original investigators who did not “endorse” the methodology proposed for replicating their studies (Gilbert et al., 2016). In other words, almost a third of the original investigators reported that the planned replication procedure was not a faithful procedural replication of the original. It is not surprising that these unendorsed protocols were four times as likely to yield results-replication failures than the endorsed protocols. Note that these endorsements—or lack thereof—by the original investigators, occurred prior to replication, without knowledge of any deviations that might arise in the course of actually implementing the proposed protocol. Although such data are not conclusive, it does seem likely that in many cases, failures to replicate procedures were responsible for the failures to replicate results.

Failure to Replicate Procedures in TABS

A failure to adequately replicate procedures has been identified as the first significant risk factor responsible for the “replication crisis in psychology.” How did the various features of TABS influence its susceptibility to this particular risk factor? The sections below address this question by considering each of the previously identified features of TABS in the context of this particular risk factor.

Failure to Replicate Procedures in a Natural Science TABS

All three tendencies of natural sciences will typically improve experimental control relative to the corresponding tendencies of social sciences. It is typical that a participant’s preexperiment (in particular in the case of nonhuman animal experiments) and within-experiment experiences can be more tightly controlled for an experiment than for a typical correlational design. This facilitates the high-fidelity reproduction of those experiences at a later time and/or with other participants. In contrast, with a correlational design, a researcher typically lacks control of participant experiences—making it more difficult to describe those experiences. Furthermore, a high level of experimental control also allows a more accurate reproduction of previous procedures (e.g., it is easier to reproduce a particular diet when an experimenter has complete control over a subject’s access to food). With a correlational study, a high-fidelity replication would typically require seeking out participants whose histories closely match those of the to-be-replicated study (in general in the absence of any method for accurately identifying such histories). With a well-controlled experiment, those matching histories are a built-in aspect of the procedural replication.

Likewise, direct measurement contributes to high-fidelity replication of procedures insofar as direct measurement enhances experimental control. Consider an hour-long session with thirty 15-s presentations of a green light and thirty 15-s presentations of a red light. Pressing a G key in the presence of the green light results in a 10% chance of three points and pressing an R key in the presence of the red light results in a 20% chance of two points. If key presses were directly recorded, the procedure would be more easily replicated than if a research assistant asked the participant at the end of each 10 min to self-report the exact moment in time that each response and point-award had occurred relative to each light presentation during the last 10 min. Likewise, the procedure would be more easily replicated if a computer automatically recorded the timing of each experimental event rather than relying on a research assistant to observe and manually record each event. The direct, automated recording of TABS avoided the potential measurement error introduced by indirect measurement and potentially faulty training or performance of observers. Failures to replicate procedures can occur by failing to replicate measurement practices. The potential for such failures can be greatly reduced, if not eliminated entirely, if all measurement practices are automated.

The preceding example should also illustrate the advantages of automated control of experimental events. A high-fidelity replication of the red-light/green-light procedure would be much easier if a computer controlled the presentation of lights and delivery of points (in addition to recording behavior). The alternative would be a human research assistant flipping the correct light switch twice per min, rolling a 10-sided die each time she judged the correct key to be adequately pressed, and delivering the correct number of points depending upon the roll of the die. Even if multiple highly trained assistants were involved, it is obvious that the manual procedure should be more difficult to perfectly replicate than the automated procedure.

Finally, studies with human-specific subject matter are also more vulnerable to failures to replicate procedures. The primary advantage of using nonhuman animals in an experiment is the greater degree of experimental control that is ethically permissible with animals over humans. In a rat experiment, an experimenter can have almost total control over that animal’s experiences prior to and throughout an experiment. So, just as well-controlled and well-described experiments can be more closely replicated than most correlational studies, so too can a well-controlled nonhuman animal study be more closely replicated than a typical human study. Also worth considering is that a cross-species phenomenon is likely to be a more robust phenomenon than a human-specific phenomenon. If a rat and a worm show the same relationship between the intensity of an unconditioned stimulus and the magnitude of an unconditioned response, then perhaps we are dealing with a general behavioral process that will be more easily replicated than a phenomenon that is only found in humans (e.g., a particular verbal response in the presence of a particular verbal stimulus). The cross-species phenomenon is less likely to be dependent upon some (less replicable) extraexperimental idiosyncratic histories than the human-specific phenomenon.

Failure to Replicate Procedures in an Inductive Science TABS

The inductive feature of TABS might easily be interpreted as an obstacle to high-fidelity procedural replication. Inductive procedures are typically adjusted over the course of an experiment based on the behavior of the participants. Although an experienced investigator can likely describe the specific variables responsible for such adjustments, the net result is a final procedure that would be more difficult to perfectly duplicate than a typical deductive design. A simple example would be the number of sessions in each condition. In a deductive approach, that number would likely be fixed—perhaps five sessions of each condition. In an inductive approach, that number would likely be variable—with a condition ending based on some behavioral stability criteria.

But whereas an inductive approach might be an obstacle to structural similarity in a procedural replication, it could be a boon to functional similarity. For example, individual differences may result in behavior changing after three sessions for participant A and six sessions for participant B. If the original study found a change in behavior for participant A within the five sessions allotted for that condition, a deductive approach would produce a failure to replicate results with participant B. In contrast, an inductive approach would likely extend the condition long enough to observe that behavior change. This feature of procedural flexibility might therefore increase the difficulty of procedural replication while decreasing the likelihood that such procedural replication “failures” would produce results-replication failures.

Related to this, the criteria for judging a replication failure would be more complicated for a TABS study than a typical deductive study in psychology. Identifying a replication failure from the standard deductive approach of OSC (2015) required that the first study find an effect and the second study not find that effect. And the determination of an effect within that methodology required rejecting the null hypothesis (due to p < 0.05) and failing to replicate required failing to reject the null hypothesis (due to p > 0.05). But if there is no null hypothesis, as is the case with inductive research, then there is no possibility of rejecting or failing to reject such a hypothesis. As such, more complex—and likely flexible—criteria would be needed across inductive studies to determine whether or not different results across those studies should be considered replication successes or failures.

Failure to Replicate Procedures in a Within-Subject TABS with Direct and Systematic Replication

A within-subject approach might not necessarily be more difficult to procedurally replicate with high fidelity, but in practice, it generally will be. For a between-groups design, each participant in each group is typically exposed (or not exposed) to some manipulation for a relatively brief period of time (e.g., a 1-hour session). In the within-subject approach of TABS, each subject is exposed (or not exposed) to some manipulation day after day—often for many months (e.g., Dews, 1958). If a few participants fail to show up in a between-groups design, those participants might be easily replaced with others. If a few subjects fail to show up in a within-subject design, the study can be ruined. This was one of the benefits of using nonhuman animals in TABS—because in general it was easier to ensure the daily return of nonhuman animals than humans.

The disadvantages inherent in the difficulty of perfect procedural replication for within-subject TABS designs are more than offset by the advantages of the sheer volume of replication inherent in such designs. Given the repeated within-session replications (e.g., trial replications), across-session replications, condition replications, and across-subjects replications of TABS, perfect procedural replication becomes less important. If a rat is handled slightly differently from day to day but the condition continues until behavior is stable despite such minor procedural fluctuations, this systematic replication would indicate that such a procedural replication failure is irrelevant with respect to results-replication. Furthermore, each replication across subjects in a typical TABS design constitutes a complete experiment replication—something that has no equivalent in a standard psychology group design. It is certainly not the equivalent of adding more participants to each group in such a design. In fact, the only equivalent would be someone completely replicating the original group design, such as the replication attempts of OSC (2015). So, when the introduction of this manuscript suggested that a behavior science analog of the OSC (2015) study would be a valuable contribution, the truth is, such a replication project has already been conducted. Every subject beyond the first in a typical TABS design constitutes a procedural replication of the same study with that first subject. So, the question of results-replicability in those studies has already been answered.

Failure to Replicate Procedures in the Behavior Science of Today

Whereas TABS held clear advantages over typical psychological science with respect to experimental control, such control is far less common throughout behavior science today. Nonhuman animal experiments with automated measurement (and control of experimental events) are no longer the norm. Perhaps behavior science has expanded from a purely natural science to a natural-social science. Although there is likely merit in such an expansion (e.g., greater focus on application), it also suggests greater risk with respect to high fidelity procedural replication.

Likewise, much of behavior science today includes far less within-subject replication than was typical of TABS. Consider the popular multiple-baseline-across-subjects design, like that used by DeVries, Burnette, and Redmon (1991) compared to a TABS study like Dews (1958). Each session in the former study included only four to six behavioral observations, compared to thousands per session in the latter. The 1991 study included 15–16 sessions per participant compared to over 150 sessions in the 1958 study. The multiple-baseline-across-subjects design also includes no within-subject condition replications. As such, unlike a single-subject design (like that of Dews, 1958), the across-subject replications in a multiple baseline design cannot be considered as full-experiment replications. In contrast, Dews (1958) exhaustively replicated within-subject and then replicated the entire experiment across three other (four total) subjects. For Dews (1958), any concerns about procedural replication difficulties can be laid to rest due to the exhaustive replications already published. The same cannot be said for most current studies in behavior science.

It is hoped that some of this increased risk with respect to high-fidelity procedural replications has been offset by the Baer, Wolf, and Risley’s (1968) framework for applied behavior analysis. In that seminal work, the authors laid out seven “dimensions” that were required for an applied behavior analysis study. One of those requirements was that the study be “technological,” meaning that the methods used must be described with such detail as to allow direct replication. With recent developments in online data storage, this technological dimension could be fully realized across behavior science (Hales et al., 2019; Hantula, 2019). So, insofar as behavior scientists of today adhere to Baer et al.’s (1968) framework, the risk from procedural replication failure should be greatly reduced (but see causes for concern in the discussion on construct validity in Laraway, Snycerski, Pradhan, & Huitema, 2019).

Low-Power Designs

The statistical power of a study is its probability of detecting an effect, if there is one. For a behavioral study, power is determined by the ratio of sample size to behavioral variability. For example, a group design with only a few participants and minimal behavioral variability or many participants and excessive behavioral variability would be a low-power design. A corresponding problem for a within-subject design would be only a few observations of behavior (e.g., one session of baseline compared to one session under some manipulation) or many observations in each condition but lots of behavioral variability. Such designs would be less likely to provide enough relevant data to determine the presence of an effect. This means that a low-power design is more vulnerable to Type II errors, i.e., false negatives—determining the absence of an effect when there actually was an effect to be found.

Statistical power could be increased by raising the alpha (α) level (i.e., requisite p-value)—e.g., from the typical 0.05 to 0.10, increasing the number of observations (in psychology this would typically mean increasing the number of participants), or reducing variability of the subject matter (e.g., by improving measurement or experimental control). Raising the alpha level (p-value) is akin to loosening the criterion for judging whether a manipulation had an effect. Although this would make a Type II error (false negative) less likely, it would also make a Type I error (false positive) more likely. In other words, by loosening the criterion, one would be more likely to judge a manipulation as effective when it was not actually effective. Within psychology, statistical power is typically increased by increasing the number of participants in each comparison group (which would correspond to increasing the number of sessions in a within-subject design). As an alternative, statistical power could be increased by reducing variability of the subject matter. For example, a questionnaire study might reword questions such that fewer members of Group A deviated from answer A and fewer members of Group B deviated from answer B. For a behavioral study, improving measurement or experimental control can reduce bounce in the data such that if an effect is present, it will be found—reducing Type II errors (false negatives) without increasing Type I errors (false positives from increasing α) or reducing practical significance (from increasing the sample size).

Low-power Designs in OSC (2015)

According to OSC (2015), many of the original (to be replicated) studies used low-power designs (e.g., few participants in each group). If this is true, then many replication attempts would also use low-power designs—which would be expected to fail at a high rate—by producing false negatives. In other words, many results-replication failures were likely due to the replicating studies not collecting enough relevant data to find the effect that was present. As an extreme example, if a replication attempt involves only one observation of behavior before and after a manipulation—we should not be surprised when that replication fails to find the same effect as the original study.

Low-power Designs in TABS

Low-power designs have been implicated as the second significant risk factor (OSC, 2015) responsible for the “replication crisis in psychology.” The following sections address how the various features of TABS influenced the statistical power of behavioral studies.

Low-power Designs in a Natural Science TABS

With a correlational study, increasing power will typically require either raising the alpha-level (and thereby increasing Type I errors) or increasing the number of observations (the number of participants in a typical correlational design within psychology). Depending upon the nature of the study, it might not be practical or possible to increase the number of participants necessary for a high-powered correlational study. This impracticality could be due to stringent participant inclusion criteria (e.g., difficulty in finding enough teenagers who admit to frequent use of multiple illicit drugs), a small budget for paying participants (making it impossible to afford large numbers of participants), and/or a small effect size that therefore requires thousands of participants for a statistically significant effect. And even when large samples are feasible, statistically significant effects do not indicate effects of clinical or practical significance (Branch, 2019; Killeen, 2019). As such, reducing the variability of data will typically be preferable to other methods of increasing design power.

One of the main themes of Sidman (1960) was the reduction of variability by enhancing experimental control. For example, in a TABS experiment, variability of data could be reduced by reducing environmental variation within and across participants. Likewise, automated direct measurement and the use of nonhuman animals also enhanced experimental control (e.g., eliminating the unreliability of self-reports and the unpredictability of a human’s extralaboratory experiences). Indeed, one of the great contributions of TABS was the operant chamber: an apparatus through which a nonhuman animal’s environment could be tightly controlled (even more so than mazes) while automatically recording behavior (something no mazes at that time could accomplish). Even with an operant chamber, behavioral measures within a condition will usually show some variability. But when that variability can be reduced to the point of virtually no overlap in the behavioral measure of interest from one condition to the next, that design will be much higher-powered than a similar design with greater overlap (e.g., from less experimental control).

Low-power Designs in an Inductive Science TABS

Evaluating the statistical power of a purely inductive study presents a conundrum—if not an unsolvable paradox. This is because statistical power is normally defined in the context of hypotheses—the probability that the results of a study correctly confirm a hypothesis—if that hypothesis is true. Likewise, one of the previously mentioned methods for improving a study’s power would be to increase the study’s alpha level—that is, loosening the criterion for confirming that a hypothesis is true. But a purely inductive study has no hypotheses to confirm or reject. It has no alpha-levels or p-values. In a real sense, Type I (false positive) and Type II (false negative) errors have no meaning in the absence of hypotheses (because it is a hypothesis that is accurately or inaccurately confirmed or rejected). In Sidman’s (1960) words: “Data can be negative only in terms of a prediction. When one simply asks a question of nature, the answer is always positive” (p. 9).

This conundrum can be somewhat lessened by recognizing that even though a purely inductive study lacks a hypothesis, any study aimed at determining the replicability of that initial inductive study would necessarily be at least somewhat deductive. In other words, the replicating study would formulate testable hypotheses with respect to the original study’s effects. This would still be an important difference from OSC (2015) methodology—in which study a claimed an effect based on criterion c and then study b tested that same effect base on the exact same criterion c. Replicating a TABS study would instead involve study a (perhaps) claiming (perhaps not claiming) an effect based on criterion c and then study b testing that same effect based on some different criterion d. It would be ideal if the original study data would be available such that the criterion d would be a criterion by which the original study could be shown to have had an effect post hoc. This would make the deductive replication of an inductive study analogous to a deductive replication of a deductive study—but not identical.

Ignoring the difficulty in measuring a purely inductive study’s statistical power, the flexibility of such a study should increase that power. By waiting for behavior to stabilize in each condition, the typical TABS condition included many sessions (observations) and minimal variability in the data at the end of each condition. Both of these features increased statistical power—increasing the likelihood of accurately detecting any effect on behavior.

Low-power Designs in a within-subject TABS with Direct and Systematic Replication

In addition to the benefits of repeated sessions (procedural replication) within each condition, TABS designs also typically replicated entire conditions within-subject. When combined with systematic replications, including parametric manipulations (e.g., the dose-effect curves of Dews, 1958) that procedurally replicate across quantities of an independent variable, TABS designs were extremely high-powered. In other words, if a TABS manipulation had an effect on behavior it was exceedingly unlikely to be missed. In sum, an experiment that includes many successful replications (across sessions and across conditions) is in general more likely to be replicable than an experiment that includes none. Although OSC (2015) did not attempt replications from behavior analytic journals, they did suggest that a key reason for cognitive experiments successfully replicating more than social psychology experiments (50% vs. 25%) was that cognitive psychology was more likely to use higher-powered within-subject designs. For TABS, such designs were the norm.

Low-power Designs in the Behavior Science of Today

Major contributors to the high power of TABS designs were the sheer number of observations, the reduced variability in data from prolonged exposure to contingencies, and the direct replications of conditions for each subject. Researchers should recognize that shorter study designs—including single-session designs, multiple baseline designs, and many other studies with reduced access to each participant (relative to TABS nonhuman animal studies)—will typically require more participants to retain the high-power of TABS designs.

Another noteworthy difficulty arises as scientists trained in an inductive approach adopt more deductive approaches with increasing regularity. In an inductive approach, changes to an experiment can be made throughout the experiment—adjusting the experiment based on the data obtained. In a deductive approach, the experiment must typically be fully planned in advance. Failing to do so is likely to result in multiple comparisons that yield some false positives (see Hales et al., 2019, and Smith & Ebrahim, 2002, for elaboration). So, as behavior science has evolved to incorporate both inductive and deductive approaches (among others; see, e.g., Young, 2019), the science is at risk for an increased likelihood of Type I errors (and nonreplicable results).

Publication Bias for Positive Results

The contingencies for research within psychology are such that finding an effect—usually in the form of a significant difference between two groups—greatly increases the chances of publication (Kühberger, Fritz, & Scherndl, 2014). These contingencies suppress the publication of negative results both directly (i.e., rejection of manuscripts with negative results) and indirectly (e.g., by decreasing the likelihood of attempts to produce or report such results). This so-called “file drawer problem” makes it extremely difficult to refute Type I errors (false positives) because the contradictory failures to replicate remain “in the file drawer” instead of being published. Given these contingencies that support the publication of false positives, it should not be surprising that a substantial portion of published research is not replicable.

This proliferation of false positive results through the publication bias becomes exacerbated with low-power studies. Although the two phenomena might seem to produce opposite effects (low-power designs increasing Type II errors and the publication bias increasing Type I errors), low-power designs also tend to produce greater variability in results. With only six observations, for example, it is more likely that half of those observations will be extreme outliers (from the population norms) than with 60 observations. As an analogy, it is far more likely that half of your dice rolls will be sixes if you roll 6 dice than if you roll 60 dice. When combined with a publication bias that favors positive results, the extreme outliers that are more likely to be published are those that produce inflated effect sizes and false positives (see Button et al., 2013). So, if 50 researchers conduct studies on how eating popcorn improves dice rolling, the study with 6 dice rolls with low rolls before popcorn and high rolls after popcorn will be published over the studies with 60 dice rolls which will average closer to 3.5 per die across conditions.

Publication Bias in OSC (2015)

Note that for OSC (2015), the contingencies that normally support a publication bias for positive results were not only eliminated—the contingencies were actually reversed. This project was an essentially unique opportunity to publish some of those negative results sitting in the “file drawer.” Although the OSC (2015) project was not designed to be an outlet for preexisting failed replications, by eliminating the publication bias for positive results, the project actually established contingencies supporting a reversed publication bias—a bias for failed replications. So, the standard publication bias likely increased false positives in the original studies, and this reversed “file drawer” problem for the replications likely increased false negatives. One could reasonably expect that the net result would be a substantial number of replication failures—which is exactly what happened.

Publication Bias in TABS

The publication bias for positive results was the third proposed culprit (OSC, 2015) responsible for the replication “crisis” in psychology. The following sections address how our three key features of TABS influenced this publication bias.

Publication Bias in a Natural Science TABS

The publication bias for positive results only inflates Type I errors (false positives) in the event that the same procedure produces results that are sometimes positive and sometimes negative. Given that a well-controlled experiment with direct, automated measurement with nonhuman animals and automated control of experimental events is more likely to result in high-fidelity procedural replication and a high-power design (without raising the alpha level) as described previously, it is far less likely to produce inconsistent results when replicated. As such, natural science studies like those within TABS should be less vulnerable to the publication bias for positive results.

Publication Bias in an Inductive Science TABS

A pure inductive approach has both advantages and disadvantages with respect to the publication bias for positive results. If a study is designed such that any effect, or lack thereof, is meaningful—null findings are not possible. Regardless of what pattern of behavior emerged under fixed ratio schedules, the results would be equally meaningful. This stands in stark contrast to a study designed to test the hypothesis that children will eat more in a blue room than in a red room. The former study would have been publishable regardless of the results. The latter study would only be publishable if the hypothesis were supported (if there was a significant increase in eating in the blue room). So, because the pure inductive approach lacks positive or negative results, that approach (at least in its purest form) should be essentially immune to the publication bias.

Unfortunately, this is an oversimplification of the inductive research process. Most TABS studies were not purely inductive. To return to a previous example, if chlorpromazine and promazine both had no effect whatsoever (at any dose) on pigeon behavior, it seems unlikely that Dews (1958) would have been published. Also, because the inductive approach allows for procedural modifications throughout the process of collecting data, data collected prior to such modifications tend to be deemphasized or altogether unreported. Although not identical to the unpublished studies in deductive research, these ignored data must be considered as analogous to the “file drawer” problem within the deductive approach.

Publication Bias in a within-subject TABS with Direct and Systematic Replication

A well-controlled within-subject experiment that requires steady-state performance in each condition typically required a substantial investment of time and resources. A study that requires many months for each participant is less likely to be subjected to as many replication attempts as a study that can be conducted in a single day. The fewer attempts at replication, the less likely a study will produce a false positive. Without such a false positive, the publication bias for positive results is a nonissue.

Perhaps more important to consider is the impact of systematic replication on this publication bias for positive results. Any successful replication—direct or systematic—increases the likelihood of a study’s replicability. But when a replication fails, only systematic replication allows us to determine why. Sidman (1960) discussed various uses of systematic replications but probably the most common use is to establish the generality of a phenomenon. A finding in the animal laboratory might be extended to humans through systematic replication. But what if a finding has failed to generalize? What if animals prefer variable to fixed delays but humans do not? Or what if nicotine increases delay discounting in one rat experiment but not in another? Such failures to generalize are effectively failures to replicate—but also negative results (and therefore subject to the publication bias against negative results). Systematic replication can be just as useful in explaining such replication failures as it can be in extending generalities.

The publication bias for positive results—and the corresponding file drawer problem—could be effectively eliminated if every negative result was followed by systematic replications that determined why one study yielded positive results and another similar study yielded negative results (for more on this point, see Perone, 2019). For example, Dallery and Locey (2005) found that nicotine increased impulsive choice (preference for an immediate food pellet over three delayed food pellets) in rats (i.e., a positive result). Based on these results, Odum and Baumann (2010) astutely noted that “nicotine appear(s) to increase delay discounting in rats, but these findings have yet to be replicated” (p. 46). Locey and Dallery (2009) conducted a systematic replication examining nicotine effects on delay-based risky choice (preference for a variable over a fixed delay to reinforcement). If nicotine increased delay discounting (the extent to which delay decreases reinforcement value) as suggested by Dallery and Locey (2005), nicotine should have increased delay-based risky choice (as both impulsive choice and delay-based risky choice are predicted by delay discounting). But in Experiment 1 of Locey and Dallery (2009), nicotine did not increase risky choice. These negative results were likely unpublishable on their own. However, in a follow-up experiment (Experiment 2), Locey and Dallery (2009) found that nicotine increased risky choice when the alternatives differed in reinforcer magnitude (number of food pellets). Through systematic replication, the apparently discrepant results of Dallery and Locey (2005; positive results) and Experiment 1 of Locey and Dallery (2009; negative results) were explained. All three experiments were consistent with nicotine decreasing sensitivity to reinforcer magnitude (rather than any effect on delay discounting).

The preceding example is not unique (for similar examples, see Perone, 2019). Negative results are commonplace in science but if such results are rarely publishable on their own, how are they to be admitted into the scientific literature? The best answer might come from considering the TABS reliance on systematic replication. Such a reliance may be critical to establishing the importance of negative results, overcoming the publication bias for positive results, and explaining replication failures.

Publication Bias in the Behavior Science of Today

As behavior science has matured, purely inductive research seems less likely. So, once we start looking at “does x affect y?,” it becomes possible that the answer will be no. And even though “no effect” might be a meaningful result, in general it is less publishable. Tincani and Travers (2019) provide an in-depth discussion of the very real risk posed by the publication bias—in particular within the applied behavior analysis of today. For example, a participant that fails to show a clear treatment effect might be excluded from a manuscript to avoid distraction from the remaining two participants who showed a clear effect. Such a practice is likely to mislead readers with respect to the generality of the treatment’s effectiveness and slow both research and clinical applications that might otherwise clarify the conditions under which that treatment is likely to be effective (or not).

On the more basic science side, it also becomes important to carefully consider how we are designing our studies—to be sure to capture any effect. For example, if we want to examine how nicotine affects impulsive choice, arranging a baseline of 50% choices for the smaller-sooner reinforcer would allow detection of either an increasing or decreasing effect (whereas a baseline of near-exclusive preference would only allow detecting an effect in one direction). It also becomes important to maximize experimental control and replicate whenever it is practical to do so—to again ensure that any effect is accurately detected. And when there is no effect, it becomes important to find a way to publish that data. It would be ideal if this would be accomplished through systematic replications that explain why there was no effect (for other approaches—such as including no-effect data within systematic reviews—see Tincani & Travers, 2019).

Finally, given the importance of systematic replication in explaining noneffects, it becomes critical that we arrange contingencies to support the publication of multiexperiment systematic replications—including those which include noneffect experiments. Instead, contingencies are frequently arranged to support the least publishable unit (LPU; see Broad, 1981). For example, if the two risky choice experiments of Locey and Dallery (2009) had both shown positive results, those experiments could have been published separately—counting as twice as many publications for the authors—which often matters to academic administrators (De Rond & Miller, 2005). Although such contingencies might often be harmless or even advantageous (Refinetti, 1990), they can prove detrimental to the publication of negative results within systematic replications (multiexperiment studies) that could explain those results. For example, Locey, Pietras, and Hackenberg (2009) conducted a systematic replication of delay-based risky choice—extending results with nonhuman animals to humans. But that study was originally submitted for publication as a two-experiment study. Experiment 1 showed a lack of delay-sensitivity with tokens exchangeable for money and Experiment 2 showed extreme delay-sensitivity with video reinforcers. Consistent with the publication bias for positive results, Experiment 1 (the no-effect experiment) was removed during the publication process in favor of the LPU—Experiment 2. The argument here is not that negative results should always be published on their own. But when negative results are included within a systematic replication that can explain those negative results, something valuable is lost by the exclusion of those data. Allowing LPU contingencies to influence our publishing practices further exposes our science to potential replication failure.

Other Factors in the Replication “Crisis”

The preceding analysis addressed the three factors identified by OSC (2015) as responsible for their high rate of replication failures. Other factors almost certainly played a role either independently or in tandem with the three identified factors (for a detailed—if not comprehensive—list, see Laraway et al., 2019). For example, Bakker and Wicherts (2011) identified numerous errors in published statistical test results and suggested that researchers were more likely to report erroneous results when the errors supported their hypotheses (and/or more likely to correct errors that failed to support their hypotheses). In more extreme cases, researchers might purposefully exclude (e.g., the Tincani & Travers [2019] example described in the previous section) or even change data to support their hypotheses. Such scientific misconduct would interact with the publication bias for positive results—increasing the chances of publication of the erroneous or falsified data.

Other factors largely ignored in the preceding analyses are reinforcement contingencies on publication practices (for an important point about scientist reputation, see Branch, 2019; for an excellent discussion of “grant culture” contingencies, see Lilienfeld, 2017; for an insightful analysis on the natural selection of questionable research practices, see Smaldino & McElreath, 2016; and for a brief but citation-rich review of the detrimental effects of “countability” contingencies, see Hantula, 2019). One set of contingencies that was briefly addressed above—LPU contingencies—was mentioned as potentially detrimental in the suppression of systematic replications and the suppression of negative-results reporting. Unfortunately, such contingencies—part of what Hantula (2019) refers to as “countability” contingencies—are increasingly far more pervasive and pernicious. For example, in 1969 less than 20% of interviewed university faculty agreed with the statement that “it is difficult for a person to receive tenure/promotion if he/she does not publish” (Youn & Price, 2009). In 2020, tenure or promotion would likely require many publications. And when an academic’s career acquisition, maintenance, and advancement are made contingent upon the quantity of publications, that quantity is likely to increase. However, increasing the quantity of publications is not likely to increase the quality. Instead, we are arranging contingencies that support sloppy science and questionable research practices (John, Loewenstein, & Prelec, 2012; Lilienfeld, 2017; Smaldino & McElreath, 2016). Given such a context, a massive replication failure like that of OSC (2015) is hardly surprising.

Behavior Science Tomorrow

The majority of the analysis above focused on features of the traditional approach to behavior science and how those features increased its resistance to study-level replication failures. The rationale behind this analysis was not to argue for a return to some golden age when science was pure and infallible. Behavior science has evolved over the past 70 years partly because the traditional approach was inadequate to address some of the important issues that should be addressed within a behavioral framework. Yes, experiments with automated control and measurement of nonhuman animal behavior can typically reach a higher level of experimental control than other types of studies. But some important questions can only be adequately (or ethically) addressed with correlational designs. Some important observations cannot currently be obtained through automation—or even direct observation. And it would certainly be foolish to restrict behavior science to nonhuman animal research. Perhaps, as Ledoux (2002) argues for “behaviorology,” behavior science should include both natural and social science.

Likewise, the inductive approach of TABS will not always be the ideal approach. Nor should a within-subject approach always be preferred. We are probably better off determining our research designs based on our research questions rather than the reverse—despite the fact that certain types of designs will be more vulnerable to replication failure. But as our designs of the future inevitably deviate from those of the past, we should be aware of the tradeoffs that come with those deviations—including greater vulnerability to replication failure.

The analysis above should also not be mistaken to mean that experiments from the traditional approach in behavior science (i.e., a natural science, inductive, within-subject approach that encouraged both direct and systematic replication) are immune to replication failure. Dworkin and Miller (1986), for example, published an important article documenting a massive failure to replicate a series of experiments on autonomic operant conditioning. Those studies probably could not be replicated for reasons other than the three replication-failure factors addressed above. As such, it is important to recognize that studies within science, however well-designed, will always be subject to some form of human error. For example, even a high-powered and well-controlled experiment can always produce a false positive if an experimenter chooses to falsify data. So, even in the case of studies that include all of the TABS features above, replication should always be encouraged.

In that vein, it is ultimately an empirical question as to whether or not research findings in behavior science—either from 1958 or 2008—are replicable. We would likely benefit from a study comparable to that of OSC (2015), which focused entirely on research within behavior science. A major obstacle to conducting such a large-scale replication effort within basic behavior science (i.e., from the Journal of the Experimental Analysis of Behavior) is the prevalence of long-duration nonhuman animal studies. The OSC (2015) methodology resulted in replications of the cheaper, shorter duration psychology studies of 2008, whereas studies requiring more resources or longer durations were deemed infeasible to replicate. The majority of JEAB studies would likely be likewise infeasible to replicate. However, systematic replication might again be the ideal solution. If experimental findings can be replicated and simultaneously extended (as is commonly the practice within JEAB), science could progress while responsibly verifying its foundation.

Although acknowledging that the replicability of particular studies is an empirical question, we should not mistake that for meaning that the analysis above could be replaced by replication attempts. Answering whether or not a particular experiment will replicate is not identical to determining the features of a study that make it more or less likely to replicate. So, as behavior science continues to evolve, it is imperative that we recognize how the changing features of that science can affect the science’s replicability.

Author Note

Thanks to Thomas Critchfield for comments on early drafts of this manuscript.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Baer DM, Wolf MM, Risley TR. Some current dimensions of applied behavior analysis. Journal of Applied Behavior Analysis. 1968;1:91–97. doi: 10.1901/jaba.1968.1-91. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bakker M, Wicherts JM. The (mis)reporting of statistical results in psychology journals. Behavior Research Methods. 2011;43(3):666–678. doi: 10.3758/s13428-011-0089-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bijou SW. Operant extinction after fixed-interval schedules with young children. Journal of the Experimental Analysis of Behavior. 1958;1:25–29. doi: 10.1901/jeab.1958.1-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Branch MN. The “Reproducibility Crisis:” Might the methods used frequently in behavior-analysis research help? Perspectives on Behavior Science. 2019;42(1):77–89. doi: 10.1007/s40614-018-0158-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Broad WJ. The publishing game: Getting more for less. Science. 1981;211(4487):1137–1139. doi: 10.1126/science.7008199. [DOI] [PubMed] [Google Scholar]
  6. Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafò MR. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience. 2013;14(5):365–376. doi: 10.1038/nrn3475. [DOI] [PubMed] [Google Scholar]
  7. Dallery J, Locey ML. Effects of acute and chronic nicotine on impulsive choice in rats. Behavioural Pharmacology. 2005;16:15–23. doi: 10.1097/00008877-200502000-00002. [DOI] [PubMed] [Google Scholar]
  8. De Rond M, Miller AN. Publish or perish: Bane or boon of academic life? Journal of Management Inquiry. 2005;14(4):321–329. [Google Scholar]
  9. DeVries JE, Burnette MM, Redmon WK. Aids prevention: Improving nurses' compliance with glove wearing through performance feedback. Journal of Applied Behavior Analysis. 1991;24(4):705–711. doi: 10.1901/jaba.1991.24-705. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Dews PB. Effects of chlorpromazine and promazine on performance on a mixed schedule of reinforcement. Journal of the Experimental Analysis of Behavior. 1958;1:73–82. doi: 10.1901/jeab.1958.1-73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Dworkin BR, Miller NE. Failure to replicate visceral learning in the acute curarized rat preparation. Behavioral Neuroscience. 1986;100(3):299. doi: 10.1037//0735-7044.100.3.299. [DOI] [PubMed] [Google Scholar]
  12. Elliffe D, Davison M, Landon J. Relative reinforcer rates and magnitudes do not control concurrent choice independently. Journal of the Experimental Analysis of Behavior. 2008;90(2):169–185. doi: 10.1901/jeab.2008.90-169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Ferster CB, Skinner BF. Schedules of reinforcement. New York, NY: Appleton-Century-Crofts; 1957. [Google Scholar]
  14. Fields L, Moss P. Formation of partially and fully elaborated generalized equivalence classes. Journal of the Experimental Analysis of Behavior. 2008;90(2):135–168. doi: 10.1901/jeab.2008.90-135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gilbert DT, King G, Pettigrew S, Wilson TD. Comment on “Estimating the reproducibility of psychological science.”. Science. 2016;351(6277):1037. doi: 10.1126/science.aad7243. [DOI] [PubMed] [Google Scholar]
  16. Goodman SN, Fanelli D, Ioannidis JP. What does research reproducibility mean? Science Translational Medicine. 2016;8(341):1–6. doi: 10.1126/scitranslmed.aaf5027. [DOI] [PubMed] [Google Scholar]
  17. Hales AH, Wesselmann ED, Hilgard J. Improving psychological science through transparency and openness: An overview. Perspectives on Behavior Science. 2019;42(1):13–31. doi: 10.1007/s40614-018-00186-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Hantula DA. Replication and reliability in behavior science and behavior analysis: A call for a conversation. Perspectives on Behavior Science. 2019;42(1):1–11. doi: 10.1007/s40614-019-00194-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. John LK, Loewenstein G, Prelec D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychological Science. 2012;23(5):524–532. doi: 10.1177/0956797611430953. [DOI] [PubMed] [Google Scholar]
  20. Keller FS. The phantom plateau. Journal of the Experimental Analysis of Behavior. 1958;1:1–13. doi: 10.1901/jeab.1958.1-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Killeen PR. Predict, control, and replicate to understand: How statistics can foster the fundamental goals of science. Perspectives on Behavior Science. 2019;42(1):109–132. doi: 10.1007/s40614-018-0171-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kühberger, A., Fritz, A., & Scherndl, T. (2014). Publication bias in psychology: A diagnosis based on the correlation between effect size and sample size. PloS One, 9(9), 1–8. [DOI] [PMC free article] [PubMed]
  23. Laraway S, Snycerski S, Pradhan S, Huitema BE. An overview of scientific reproducibility: Consideration of relevant issues for behavior science/analysis. Perspectives on Behavior Science. 2019;42(1):33–57. doi: 10.1007/s40614-019-00193-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Ledoux SF. Defining natural sciences. Behaviorology Today. 2002;5(1):34–36. [Google Scholar]
  25. Lilienfeld SO. Psychology’s replication crisis and the grant culture: Righting the ship. Perspectives on Psychological Science. 2017;12(4):660–664. doi: 10.1177/1745691616687745. [DOI] [PubMed] [Google Scholar]
  26. Locey ML, Dallery J. Isolating behavioral mechanisms of intertemporal choice: Nicotine effect on delay discounting and amount sensitivity. Journal of the Experimental Analysis of Behavior. 2009;91(2):213–223. doi: 10.1901/jeab.2009.91-213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Locey ML, Pietras CJ, Hackenberg TD. Human risky choice: Delay sensitivity depends on reinforcer type. Journal of Experimental Psychology: Animal Behavior Processes. 2009;35(1):15. doi: 10.1037/a0012378. [DOI] [PubMed] [Google Scholar]
  28. Maxwell SE, Lau MY, Howard GS. Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist. 2015;70:487–498. doi: 10.1037/a0039400. [DOI] [PubMed] [Google Scholar]
  29. Mazur JE. An adjusting procedure for studying delayed reinforcement. In: Commons ML, Mazur JE, Nevin JA, Rachlin H, editors. Quantitative analyses of behavior: The effects of delay and of intervening events on reinforcement value. Hillsdale, NJ: Lawrence Erlbaum Associates; 1987. pp. 55–73. [Google Scholar]
  30. Morgan DL, Morgan RK. Single-participant research design: Bringing science to managed care. American Psychologist. 2001;56(2):119. [PubMed] [Google Scholar]
  31. Odum AL, Baumann AA. Delay discounting: State and trait variable. In: Madden GJ, Bickel WK, editors. Impulsivity: The behavioral and neurological science of discounting. Washington, DC: American Psychological Association; 2010. pp. 39–65. [Google Scholar]
  32. Open Science Collaboration Estimating the reproducibility of psychological science. Science. 2015;349(6251):1–8. doi: 10.1126/science.aac4716. [DOI] [PubMed] [Google Scholar]
  33. Perone M. How I learned to stop worrying and love replication failures. Perspectives on Behavior Science. 2019;42(1):91–108. doi: 10.1007/s40614-018-0153-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Refinetti R. In defense of the least publishable unit. Journal of the Federation of American Societies for Experimental Biology. 1990;4(1):128–129. doi: 10.1096/fasebj.4.1.2295373. [DOI] [PubMed] [Google Scholar]
  35. Sidman M. Tactics of scientific research. New York, NY: Basic Books; 1960. [Google Scholar]
  36. Skinner BF. Are theories of learning necessary? Psychological Review. 1950;57(4):193–216. doi: 10.1037/h0054367. [DOI] [PubMed] [Google Scholar]
  37. Smaldino PE, McElreath R. The natural selection of bad science. Royal Society Open Science. 2016;3(9):160384. doi: 10.1098/rsos.160384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Smith GD, Ebrahim S. Data dredging, bias, or confounding: They can all get you into the BMJ and the Friday papers. British Medical Journal. 2002;325(7378):1437. doi: 10.1136/bmj.325.7378.1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Tincani M, Travers J. Replication research, publication bias, and applied behavior analysis. Perspectives on Behavior Science. 2019;42(1):59–75. doi: 10.1007/s40614-019-00191-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Weaver MT, Branch MN. Tolerance to effects of cocaine on behavior under a response-initiated fixed-interval schedule. Journal of the Experimental Analysis of Behavior. 2008;90(2):207–218. doi: 10.1901/neab.2008.90-207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Youn TI, Price TM. Learning from the experience of others: The evolution of faculty tenure and promotion rules in comprehensive institutions. Journal of Higher Education. 2009;80(2):204–237. [Google Scholar]
  42. Young ME. Bayesian data analysis as a tool for behavior analysts. Journal of the Experimental Analysis of Behavior. 2019;111(2):225–238. doi: 10.1002/jeab.512. [DOI] [PubMed] [Google Scholar]
  43. Zimmermann ZJ, Watkins EE, Poling A. JEAB research over time: species used, experimental designs, statistical analyses, and sex of subjects. The Behavior Analyst. 2015;38(2):203–218. doi: 10.1007/s40614-015-0034-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Perspectives on Behavior Science are provided here courtesy of Association for Behavior Analysis International

RESOURCES