Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Nov 1.
Published in final edited form as: Behav Anal (Wash D C). 2022 Apr 21;22(4):389–403. doi: 10.1037/bar0000244

Validating Human-Operant Software: A Case Example

Sean W Smith 1,2, Brian D Greer 2,3
PMCID: PMC9718443  NIHMSID: NIHMS1781558  PMID: 36467429

Abstract

Human-operant experiments conducted with computer software facilitate translational research by assessing the generality of basic research findings and exploring previously untested predictions about behavior in a cost-effective and efficient manner. However, previous human-operant research with computer-based tasks has included little or no description of rigorous validation procedures for the experimental apparatus (i.e., the software used in the experiment). This omission, combined with a general lack of guidance regarding how to thoroughly validate experimental software, introduces the possibility that nascent researchers may insufficiently validate their computer-based apparatus. In this paper, we provide a case example to demonstrate the rigor required to validate experimental software by describing the procedures we used to validate the apparatus reported by Smith and Greer (2021) to assess relapse via a crowdsourcing platform. The validation procedures identified several issues with early iterations of the software, demonstrating how failing to validate human-operant software can introduce confounds into similar experiments. We describe our validation procedures in detail so that others exploring similar computer-based research may have an exemplar for the rigorous testing needed to validate computer software to ensure precision and reliability in computer-based, human-operant experiments.

Keywords: apparatus validation, computer software, human operant, translational research


Human-operant experiments conducted with computer software facilitate translational research by assessing the generality of basic research findings and exploring previously untested predictions about behavior in a cost-effective and efficient manner. However, previous human-operant research with computer-based tasks has included little or no description of rigorous validation procedures for the experimental apparatus (i.e., the software used in the experiment). This omission, combined with a general lack of guidance regarding how to thoroughly validate experimental software, introduces the possibility that nascent researchers may insufficiently validate their computer-based apparatus.

We believe these concerns are particularly relevant to computer-based research conducted via crowdsourcing websites (e.g., Amazon’s Mechanical Turk [MTurk]). Recruiting participants through crowdsourcing websites allows researchers to acquire participants rapidly while paying little for participation; however, conducting experiments via the internet may increase the likelihood of compromising the integrity of the experiment. For example, participants will have different internet-connection speeds, which may cause the experimental software to lag for some participants but not others. Further, participants are likely to have different hardware, so one participant may see the experimental interface unencumbered on a large monitor, whereas others may have difficulty seeing the interface on smaller displays. Moreover, extra-experimental software necessary to complete the experiment (e.g., operating system type, web browser preference) and display settings (e.g., screen resolution) will vary, potentially rendering the user interface of the experimental software differently across participants.

To decrease the likelihood that such confounding variables affect computer-based experiments, investigators must validate the integrity of their experimental software. We also encourage researchers to report their validation procedures and outcomes when presenting or publishing their findings. Toward this end, we describe the procedures used to validate the experimental software described recently by Smith and Greer (2021) to assess relapse via a crowdsourcing platform. We evaluated the performance of the experimental software to ensure that the basic functionality of the software remained intact, data collection remained accurate (i.e., dependent-variable validation), and procedural fidelity remained high (i.e., independent-variable validation) under a variety of conditions, all aimed at improving the internal validity of the planned experiment. We describe our validation procedures in detail so that others exploring similar computer-based research may have an exemplar for the rigorous testing needed to validate computer software to ensure precision and reliability in computer-based, human-operant experiments. We provide examples of issues identified through our validation process, and steps we took to remediate such issues, with the dual purpose of demonstrating the need for conducting and reporting on rigorous validation procedures and providing an exemplar for how future researchers may accomplish this.

Validation Procedures

When we refer to software validation, we are referring to the final steps a researcher must take to ensure their completed software is functioning appropriately. Many stages of software development must occur prior to validation, and each stage is complex and requires testing in its own right. We limit the scope of this article to the final step of validation, or “black box testing,” which ensures that the final version of the software produces the appropriate outputs (e.g., consequences and data collection) based on the inputs (e.g., user responses) into the software.

Notably, even this final step of software validation can be an iterative process. It is typical for there to be “bugs” in the early iterations of any software validation process, so identifying and correcting issues with software should not be considered a shortcoming or a reason to question the reliability of the finished product. If a researcher does identify an issue during validation, they should attempt to resolve the issue before proceeding to the next step in their validation process, as depicted in Figure 1. Computers reliably execute software precisely as written, so if an error is present, the error will occur reliably under similar testing conditions. Therefore, deviations from perfect execution (i.e., <100% accuracy) during the validation process must be remedied prior to progressing to the next validation step. After the researcher addresses the problem, they should resume testing at the same step to ensure they resolved the issue successfully and that successful resolution of one issue did not introduce other, unintended problems elsewhere in the software’s operation. Relatedly, it is important to note that some components of the software will depend on the correct functionality of other components (e.g., inaccurate timer operation affecting the implementation of independent variables and the recording of dependent variables). It is important to validate the basic functionality of the software before testing other components of the software that depend upon those basic functions. Also, if the experiment will operate via a server (i.e., not on a local device), validating the software operating on a local device first can help identify server-specific issues when eventually hosted on the server. Considering the process flow of the software and structuring the validation procedures in accordance with this process flow can help to ensure that potential fixes to one problem do not introduce errors to components that have already been validated. The order of the procedures we outline below, as depicted in Figure 2, demonstrate how we structured our validation to minimize the need to repeat extensive validation of previously tested components.

Figure 1.

Figure 1

Iterative Parameter Testing Flow Chart

Figure 2.

Figure 2

General Validation Procedure Flow Chart

Purpose of the Software

The steps for validating software are not standard. A researcher must first consider the purpose of the software, which will allow the researcher to develop validation procedures that test its relevant parameters. For example, the primary purpose of the software we developed and validated was to conduct experiments pertaining to the relapse of operant behavior, so we created our software to implement relapse preparations similar to those conducted in animal laboratories. Basic researchers often use three-phase relapse preparations to evaluate various relapse phenomena such as resurgence, renewal, reinstatement, rapid reacquisition, and combinations thereof. We developed our software to present participants with a three-phase, human-operant preparation of each of these common relapse preparations while simultaneously being able to manipulate a variety of other independent variables (e.g., reinforcement schedule, Sweeney & Shahan, 2013; magnitude, Craig et al., 2017; delay, Jarmolowicz & Lattal, 2014) that are germane to relapse research. Our validation procedures evaluated the accuracy of numerous independent-variable manipulations within these types of arrangements specifically because we wanted our software to be capable of conducting a wide variety of three-phase relapse preparations. Thus, we deemed it necessary to validate the software across each of these potential preparations, and our validation procedures are extensive because they were designed to ensure fidelity across numerous planned experiments. Other researchers will need to adjust their validation procedures to match the purpose of their specific software and research question(s). Each of the components we validated can serve as an example of how future researchers may similarly validate different aspects of their software.

General Apparatus Specifications

We describe general information about our experimental software below to provide context for our subsequent validation procedures. Similar to Robinson and Kelley (2020), we created our experimental software with Axure RP 9, which is a computer program for constructing websites. We used Axure RP 9 to export HyperText Markup Language (HTML) files, and we hosted these files on a university server that would be used to run the software in subsequent experiments. The experimental software also included Javascript and PHP: Hypertext Preprocessor (PHP) coding to collect data automatically on participants’ responses throughout the apparatus validation and during subsequent experiments.

Because our software was designed to conduct research on relapse, our experimental interface displayed one target response button and, in some phases, one alternative response button, analogous to two response options (i.e., a target [destructive] behavior and an alternative [appropriate] response) typically available when treating destructive behavior using differential reinforcement of alternative behavior. For our software, each response option served as the first response of a response chain (see Smith & Greer [2021], for the rationale for using response chains). Following completion of a response chain, the target and alternative response buttons reappeared in a random location on the left- and right-half of the display, respectively. Clicking either button displayed a single-digit, two-term arithmetic problem (e.g., 3 + 9) on a background matching the color of the button in the initial link of the chain (e.g., clicking the blue target button led to a display with an arithmetic problem on a blue background). Participants could then click within a text box beside the instructions, “Type answer here,” type their response, then click a “Submit” button to complete the response chain, which made the arithmetic problem and background color disappear. We programmed the software to produce reinforcement in the form of point deliveries following “Submit” button clicks when the typed response correctly answered the arithmetic problem, provided that reinforcement was currently available, and a reinforcer set up, for the response chain completed. A running total of the points earned remained visible in the top, middle of the display at all times. When the software delivered points, the color of the button clicked in the initial link of the response chain flashed behind the point total.

Basic Functionality Checks

First, a researcher must conduct basic functionality checks on the experimental software (Figure 2, Step 1). For the validation of our software, we identified the necessary components of our specific software, created a checklist outlining these components, and then used the checklist to evaluate whether our software implemented each component accurately. As shown in Appendix A, our Basic Functionality Checklist included the following components: (a) the point total remained visible in the top, middle of the display at all times; (b) the target button was the correct color and remained visible across all phases; (c) the alternative button was the correct color and was present only in Phases 2 and 3, which is typical of most resurgence evaluations like the one described by Smith and Greer (2021); (d) the target and alternative response buttons moved accurately and semi-randomly after each response; (e) the background colors appeared as programmed; (f) the software added points to the point total following correct responses meeting the schedule requirement; (g) the software did not add points to the point total following incorrect responses or responses not meeting the schedule requirement (including responses placed on extinction); (h) the correct background colors flashed when points were delivered; (i) the arithmetic problems appeared according to the correct specifications; (j) the response box used to input the answers to arithmetic problems was present and said, “Type answer here;” and (k) a “Submit” button appeared below each arithmetic problem. Such checklists help to ensure that no component of the software is overlooked at this stage of the validation process. Testing the basic functionality of the software is considered complete when all parameters are operating correctly across all phases and conditions that would be arranged in subsequent experiments.

Because our software would be used by participants recruited through a crowdsourcing website, we also needed to ensure that our software functioned properly on a wide variety of commercially available, personal-computing platforms. Thus, we used our Basic Functionality Checklist to evaluate whether the software operated as programmed across different hardware (i.e., different laptops), operating systems (i.e., Windows, macOS), and web browsers (i.e., Internet Explorer, Microsoft Edge, Google Chrome, Firefox, Safari). Notably, researchers could also design their software such that participants can access the experimental task only using specific hardware, operating systems, and web browsers to mitigate the possibility of unprogrammed variations in the experimental software due to differences in the participants’ personal-computing platforms (e.g., Ritchey et al., 2021). If researchers use this strategy, they must still conduct similar functionality checks to ensure that the experimental software is indeed inaccessible from other platforms.

We also tested our software using various internet connection speeds. We had one confederate use the software with a wired internet connection (i.e., low stress) and did not observe differences compared to when we had nine other confederate participants use the software simultaneously (i.e., high traffic) on a wireless network that often failed to support video-conference software. If researchers plan to implement their experiment via the internet, they too should complete these additional tests to ensure that the basic functionality of the software remains intact when run on various hardware, operating systems, web browsers, and internet connection speeds (see Appendix B).

Independent- and Dependent-Variable Validation

After establishing the basic functionality of the software, more rigorous testing of the independent and dependent variables should occur. To do this, researchers should video-record their software to allow precise validation of all independent and dependent variables that will be relevant to subsequent experiments (Figure 2, Step 2). For this to be successful, the researcher may need to display components of their software on the experimental interface to serve as a point of comparison even though the components may not be visible during subsequent experiments. To do this in Axure, a researcher can add an “object” to the display and create an “interaction” for the object so that the object will display the variable(s) in the software that are used to produce each aspect of the experimental preparation. For example, in our software, we had a variable that determined the current schedule requirement, and during validation of our software’s implementation of the programmed reinforcement schedule, we created an object that displayed this value on the user interface while video recording. Then, once it had been validated, we set the object to “hidden” so that users could not see or interact with the object in the final version of the software.

When we validated independent and dependent variables for our software, we completed one or two 5-min sessions to evaluate each experimental parameter described in each subheading below. For each session, we recorded three videos simultaneously: Video 1 depicting the experimental display; Video 2 depicting the participant’s trackpad; and Video 3 depicting the session timer from BDataPro, a previously validated data-collection program (Bullock et al., 2017), which allowed us to compare the timestamps of various programmed events (e.g., target and alternative responses, reinforcer deliveries) with the timer running on BDataPro. We recorded the videos using Panopto, which allowed us to record the videos synchronously with a separate laptop computer connected to three different webcams. Across all validation procedures, we compared Video 1 (i.e., experimental software) to Video 2 (i.e., participant’s trackpad) to assess whether the actions programmed to occur in the experimental interface occurred within .5 s of the participant’s physical responses (e.g., clicking the trackpad when the cursor was over a button triggered the action of the button within .5 s; Figure 2, Step 3). Use of Video 3 is described in the next section.

We did not directly compare the timestamped data from the experimental software to data collected using BDataPro. BDataPro generates data based on human observers pressing keys as they observe events, so error can be introduced into data generated by BDataPro due to human-input errors (e.g., slow reaction time, failing to observe or record a response). Thus, data output from BDataPro would not serve as an appropriate point of comparison. Instead, we compared whether (a) the confederate’s physical responses matched what occurred in the software (i.e., comparison of Video 1 and Video 2; Figure 2, Step 3), (b) the software’s timers were synchronous (i.e., comparison of Video 1 and Video 3; Figure 2, Step 4), and (c) the timestamped data matched what occurred in the software (Video 1 and experimental software’s data output; Figure 2, Step 5). These comparisons were capable of assessing (a) the accuracy of the experimental timer and (b) the degree to which the experimental software accurately reported the user’s inputted responses.

Timer Validation

We validated the accuracy of the timer (Figure 2, Step 4) relatively early in our validation procedures because other variables (i.e., dependent variables, schedules of reinforcement, phase duration) depended on the accuracy of the software’s timer. To do this, we recorded one video with a confederate participant emitting a high response rate (resulting in 15.2 responses per min) and another video with the confederate participant emitting a low response rate (resulting in 7.2 responses per min) to evaluate whether different response rates affected the integrity of the software. We conducted a timer synchrony check upon each user response by comparing the timer in Video 1 (i.e., the experimental software) to the timer in Video 3 (i.e., BDataPro) using a frame-by-frame comparison to evaluate whether the timers remained synchronous throughout the experiment. For each session, we counted the number of synchrony checks when the timers were within .5 s of each other and divided this by the total number of synchrony checks. We multiplied this number by 100 to obtain the percentage of synchrony checks in which the timers matched.

Dependent-Variable Validation

After validating the timer, researchers should evaluate the accuracy of their software’s data collection (Figure 2, Step 5). For our software, we evaluated data-collection accuracy for correct target responses and correct alternative responses. Throughout validation of the dependent variables, we refer to responses observed in the videos by the confederate participant as true responses, and we refer to the data-collection output responses recorded from BDataPro by the researcher as recorded responses. Recorded responses included both correct target responses and correct alternative responses. A correct target response was a response chain defined as (a) clicking the target response button with the cursor, (b) typing a number that correctly solves the arithmetic problem, and (c) clicking the “Submit” button. A correct alternative response was identical to a correct target response, except the initial link in the chain was clicking the alternative response button instead of the target response button. Below, we describe our procedures for determining the accuracy of the software’s data collection by comparing correct1 target and alternative responses (i.e., recorded responses) to video-recorded responses (i.e., true responses) across three measures.

First, we evaluated the accuracy of the timestamp for each recorded response, which corresponded to the terminal link of the correctly completed chain (i.e., submitting a correct solution to the arithmetic problem). For each session, we counted the number of recorded responses that occurred within .5 s of a true response and divided this by the total number of recorded responses. We multiplied this number by 100 to obtain the percentage of scored responses that occurred within .5 s of true responses. Although this measure provided a percentage that is related to the accuracy of data collection, it could not indicate whether data recording accuracy deviates from true responses systematically. To evaluate this possibility more closely, we included a second and third measure. Our second measure was the frequency of true responses that did not have a corresponding recorded response within .5 s of the true response (i.e., a false negative). Our third measure was the frequency of recorded responses that did not have a corresponding true response within .5 s of the recorded response (i.e., a false positive). We evaluated this across both high- and low-rate response patterns to evaluate whether data collection accuracy was affected by different response patterns.

Independent-Variable Validation

After establishing that the experimental software is recording the dependent variables accurately, researchers should evaluate the integrity of the independent variables they intend to program (Figure 2, Step 6). Our experimental software was intended to run a wide variety of experiments, so it included numerous parameters that could be manipulated as independent variables commonly evaluated in relapse studies. We evaluated the accuracy with which our software implemented each parameter manipulation within a single 5-min session, unless indicated otherwise.

Reinforcement Schedule.

We evaluated whether the software implemented various reinforcement schedules accurately. We programmed the software to implement each of the schedules of reinforcement indicated in Table 1. The software generated these schedules by randomly selecting a number within a specified range. For example, for a variable interval (VI) 10-s schedule with a target interval range between 5 s and 15 s, we programmed the software to generate a random number between 5 s and 15 s following each reinforcer delivery to set up the next reinforcer. We used three measures to evaluate whether the schedules were implemented accurately. First, we measured the percentage of variable-schedule requirements (i.e., VI, variable time [VT], and variable ratio [VR]) the software generated within a specified (i.e., target) range that we deemed acceptable. We calculated this by counting the number of schedule requirements that the software generated in the session within the target range and divided this by the total number of schedule requirements the software generated. For example, for a VI 10-s schedule with a target interval range between 5 s and 15 s, we counted the number of intervals that the software generated within the 5–15-s range and divided this by the total number of intervals the software generated for that session. We multiplied the quotient by 100 to produce the percentage of variable-schedule requirements the software generated within the targeted range.

Table 1.

Reinforcement-Schedule Validation

Obtained Schedules
Condition Mean Standard Deviation Percentage within range
VI 5 s 4.97 s 2.66 s 100
VI 10 s 10.93 s 3.23 s 100
VR 2 2.29 o.68o 100
VR 5 6.24 1.72 100
FR 1 1 0 100
FR 5 5 0 100
VT 5 s 5.03 s 2.52 s 100
VT 10 s 9.31 s 2.91 s 100
Reinstatement 100
Extinction 100

Note. VI = variable interval; VR = variable ratio; FR = fixed ratio; VT = variable time

– indicates not applicable.

Second, we measured the mean and standard deviation of the variable-schedule requirements that the software generated for each session to evaluate whether the schedule requirements conformed to the programmed schedule. Due to the possibility of sampling bias related to analyzing a relatively small sample of randomly generated schedule requirements, we considered the schedule accurate if the obtained mean was within one standard deviation of the targeted mean.

Third, we evaluated whether the software delivered consequences as scheduled. We tested this first during our Basic Functionality Checks. During this initial test, we had a confederate participant vary their response patterns in a number of ways to try to catch potential bugs within the software. For example, to test the consequences for correct vs. incorrect responses, we instructed the confederate to produce only correct responses, only incorrect responses, or a combination of both. To test the changeover ratio, we instructed the confederate to produce response patterns that strictly alternated across options, never alternated (i.e., strict allocation of responding to one option), or alternated unsystematically. When we tested interval schedules, we instructed the confederate to complete the response chains immediately prior to the interval elapsing, immediately after the interval elapsed, or unsystematically throughout the interval to ensure consequences were delivered as programed.

We also evaluated the latency of the consequences by calculating the percentage of consequences delivered correctly within .5 s of the scheduled time. In these tests, we instructed the confederate to respond as quickly as possible or at a relatively slow rate. When we tested response-independent schedules, correct delivery was defined as reinforcers delivered within .5 s of the scheduled time. For response-dependent schedules, a correct delivery was defined as (a) reinforcers delivered following the first correct response (i.e., a click on the “Submit” button when the typed number correctly answered the arithmetic problem) meeting the currently programmed schedule requirement and (b) reinforcers withheld following responses that did not meet this schedule requirement. Any other reinforcer delivery was considered an incorrect delivery. For example, in the VR 5 session, if the program generated a schedule requirement of seven responses, then withholding reinforcement following the sixth correct response and delivering a reinforcer following the seventh correct response both would have been scored as a correct delivery, whereas delivering a reinforcer after the sixth correct response and withholding the reinforcer following the seventh correct response both would have been considered incorrect. We calculated the percentage of consequences delivered as scheduled by dividing the number of correct deliveries by the sum of correct and incorrect deliveries and multiplying the quotient by 100.

Reinforcer Magnitude.

We evaluated whether the experimental software delivered different reinforcer magnitudes accurately for each of the two response options. Because reinforcers were delivered in the form of point deliveries, we programmed the software to deliver different ratios of point values for target and alternative responses (i.e., 10:10, 10:20, 20:10). For each session, we counted the number of reinforcers delivered at the correct point value, divided this by the total number of reinforcer deliveries, and multiplied the quotient by 100 to calculate the accuracy that reinforcers were delivered according to their programmed magnitude.

Reinforcer Delay.

We evaluated whether the experimental software accurately delivered reinforcers following different delays after responses that met the current schedule requirement. We counted the number of reinforcers delivered within .5 s of the scheduled delay, divided this by the total number of reinforcers delivered in the session, and multiplied by 100 to calculate the percentage of reinforcers delivered within .5 s of the scheduled delay.

Reinforcer Quality.

We evaluated whether the experimental software accurately delivered reinforcers of different qualities. For this session only, we had the software display points on two counters. Each point counter was labelled with a qualitatively different reinforcer. For example, one was labelled “cash,” and the other was labelled “gift card.” Such a manipulation would allow future experiments to include instructions specifying that points on each counter would be associated with qualitatively different backup reinforcers.

To evaluate the accuracy of this parameter, we programmed the software to deliver points to one counter for responding to one response option (e.g., points were added to the point counter labelled “cash” following target responses), whereas the software delivered points to the other counter for responding to the other response option (e.g., points were added to the point counter labelled “gift card” following alternative responses). A qualitatively correct reinforcer delivery was defined as any occurrence of points being delivered to the correct point counter following the response programmed to produce reinforcers of that quality (e.g., points were delivered to the cash point counter following a target response). We counted the total number of reinforcer deliveries of the correct quality across both response options, divided this by the total number of reinforcer deliveries in the session, and multiplied the quotient by 100 to obtain the percentage of reinforcer deliveries of the correct quality.

Response Effort.

We evaluated whether the experimental software accurately implemented response-effort manipulations. We manipulated response effort across response options by altering the number of terms in the arithmetic problems presented within the target or alternative response chains. For example, in one session, the arithmetic problems in the target response chain were two-term, single-digit addition problems, whereas the problems in the alternative response chain were three-term, single-digit addition problems. In each session, we counted the number of arithmetic problems presented across both response options that followed the programmed specifications, divided this by the total number of arithmetic problems presented, and multiplied the quotient by 100 to calculate the percentage of arithmetic problems presented with the correct difficulty.

Phase Duration.

Next, we evaluated the accuracy of implementing different phase durations. We divided 5-min sessions into three phases to reflect a typical three-phase relapse experiment. We manipulated the duration of those three phases to represent various ratios of the total 5-min session duration (i.e., 1:1:1, 1:2:1, 2:1:1, 2:2:1). Due to the small number of phases within a single session, we calculated a single percentage from all of the sessions evaluating the phase-duration parameter. Specifically, we counted the total number of phases with a duration that was within .5 s of the programmed phase duration across these sessions and divided this number by the total number of phases. We multiplied this quotient by 100 to obtain a percentage.

Context Change.

We evaluated whether the experimental software displayed different contexts accurately across phases for the purposes of conducting renewal tests. We programmed the software to change contexts by changing the background color of the display. We manipulated the contexts in sequences commonly used in renewal preparations (i.e., AAA, ABA, AAB, ABC). Due to the relatively small number of contexts programmed within a single session, we calculated the percentage of accurate contexts displayed by counting the total number of contexts that were displayed correctly across all sessions evaluating this parameter, dividing this by the total number of contexts displayed, and multiplying by 100.

Other Considerations

It is important to validate software under conditions that approximate the experimental context as closely as possible prior to launching the experiment. For example, if the experiment will be implemented via the internet, then the validation procedures should eventually be conducted over the internet. Internet-connectivity issues may slow the timing of events programmed and recorded by the software, and server specifications may differ from those of the local device, either of which can introduce errors that would not be apparent if validation occurs only while the software runs on the local device. Also, if participants will access the software and receive compensation through a recruitment platform (e.g., MTurk), linking the software to the recruitment platform and testing its critical components (e.g., obtaining informed consent electronically, preventing repeat participation, generating unique compensation codes, ensuring accurate and timely payment) will be necessary. Whenever possible, the software should be tested exactly how it will be implemented during the experiment.

Identifying and Resolving Issues

As noted previously, computers reliably execute software precisely as written, so deviations from perfect execution (i.e., <100% accuracy) during the validation process must be remedied. We identified two notable issues during the validation of our software, which we describe here to clarify how a rigorous validation process helps to identify errors and facilitate their resolution. First, during the basic functionality check of our software, all parameters were implemented accurately across each user’s laptop, across operating systems, and across web browsers with one exception—some users had different screen-resolution settings on their personal devices, resulting in some users being unable to see the entire experimental interface without scrolling during the experiment. To address this issue, we programmed the experimental interface to prompt participants to adjust the zoom on their internet browser so they could simultaneously see four boxes located in the four corners of the experimental interface, and participants were required to click each box prior to advancing to the experiment. The addition of this feature resolved the issue for all users. No other issues were identified during basic functionality checks.

Second, during our initial timer validation, we identified that the timer built into the experiment lagged. Specifically, the timer in the experimental software slowed compared to the BDataPro timer, and the difference between the timers increased as the session progressed. The validation procedures identified this issue; however, the cause of the issue remained unclear. Again, it is normal to identify issues with software and to conduct follow-up tests to identify the cause of the issue before attempting to fix them. For this specific issue, we conducted sessions with longer durations, which yielded further differences between the timers. We also conducted sessions with varying response rates, and the amount of lag in the timer of the experimental software varied across these sessions, such that higher response rates exacerbated the timer lag. Finally, we conducted additional validation sessions with the experimental software saved and running locally on a computer (as opposed to on the server), which corrected the issue but did not provide a practical solution for the planned experiment that was to follow. Additional tests helped identify the underlying issue, which allowed us to fix it.

More specifically, the issue occurred because our software had to execute code following each response (i.e., store data on the response emitted, deliver reinforcement, make the arithmetic problem disappear, update the timer). When the code was stored on the server, we hypothesize this introduced delays because the code had to be executed via the internet, and the transfer of information via the internet introduced the delay. Executing the code stored on the local device produced no detectable delay because the information did not have to be transferred via the internet. Although executing the code via the internet may have delayed the timer by only a few milliseconds after each response, this small delay likely accumulated with each consecutive response, causing a progressively increasing difference between the experimental software’s timer and the previously validated timer in BDataPro. Thus, high-rate responding increased lag more than low-rate responding because information had to be communicated via the internet more frequently.

Importantly, these delays would have varied depending on each participant’s internet-connection speed and response rate, which would have introduced a confound into the experiment. We addressed this issue by changing the programming for the experimental timer. Rather than creating a timer within the software that could lag, we adjusted the software to use the timing mechanism built into the server, which progressed independent of interactions within the software. To do this, the software executed the “.getTime()” JavaScript method to timestamp the initiation of the experiment. Then, every 50 ms the software executed this code again to get the current time and subtracted the experiment initiation timestamp from the current time, which resulted in the amount of time since the experiment had started. Executing the code every 50 ms produced sufficiently precise timing without requiring excessive computing power, and programming the timer this way avoided the possibility that user interactions could cause the timer to lag.

As noted previously, software validation is an iterative process. If an issue is identified during validation, the researcher should resolve the issue and begin validation from that step again. Thus, we will only report the final performance of our validated software, which was 100% accuracy across all validation checks. Specifically, the timer in the experimental software remained within .5 s of the BDataPro timer at each synchrony check. When we compared the videos of the user’s physical responses on the trackpad to the video of the experimental software’s user interface, all physical responses produced the correct change to the user interface within .5 s. Further, the experimental software successfully recorded each response that occurred (i.e., 0 false negatives) and did not record any responses that did not occur (i.e., 0 false positives). All recorded responses were within .5 s of true responses. These results demonstrated the accuracy, sensitivity, and reliability of the experimental software’s timer and data collection.

Table 1 displays data obtained during reinforcement-schedule validation, demonstrating that the experimental software consistently generated reinforcement schedules according to the programmed parameters. When we evaluated whether the experimental software delivered consequences correctly, we obtained 100% accuracy across all consequence deliveries. When we evaluated all other independent variables (i.e., reinforcer magnitude, delay, and quality; response effort; phase duration; context changes), the software implemented each independent variable manipulation with 100% accuracy. Thus, the experimental software implemented all parameter manipulations as intended. Based on these evaluations, we determined that the critical features of our experimental software (i.e., basic functionality, data collection, independent-variable implementation) would remain intact and be implemented accurately via the internet.

The Importance of Validation

Conducting these evaluations highlighted the need for all experimental software to undergo rigorous validation procedures to rule out potential confounds. We identified two issues with our experimental software through our validation efforts. First, part of the experimental interface was not visible for certain users depending on their hardware (e.g., display size) and their screen resolution settings. We identified this issue when we assessed the basic functionality of the experiment on a wide variety of hardware and software, and we addressed this issue by instructing participants to adjust their settings to ensure they could see the entire interface. Notably, this confound was not due to a software malfunction, it was due to variations in the settings of each user’s personal device. If we had not conducted our apparatus validation, it is unlikely that we would have addressed this issue and we may have obtained unusual response patterns from participants who could not see the whole interface. This would have been a clear confound of the experiment, but it would have been difficult to discern the cause of these unusual response patterns after the fact.

Second, we noticed that the first iteration of the timer built into the experimental software lagged. Again, we addressed this issue by reprogramming the basic mechanism by which the software measured time, but it is unlikely that this issue would have been detected without rigorous validation efforts. In the final version of the software, the timer built into the software is not visible to the user, so it would have been difficult to determine whether the timer remained accurate had we not explicitly evaluated it. Further, this issue only occurred when the software was implemented via the internet, highlighting how experiments conducted via the internet may be especially prone to confounds and how researchers must validate their software under conditions that approximate the experimental context as closely as possible. Failing to detect this error could have impacted the accuracy of data collection, time- and interval-based reinforcement schedules, and phase durations.

Notably, it is unlikely that we would have identified either error if we had not conducted rigorous validation procedures, yet missing either of these issues could have seriously impacted our experimental control. Our identification of these two issues suggests a need to conduct rigorous validation procedures to ensure the integrity of experiments conducted with computer-based software.

Generalized Validation Procedures

Beyond demonstrating the need to validate experimental software, the primary purpose of this article is to provide researchers with an example of how they could validate their own experimental software. Of course, the validation procedures and aspects of the software that need validation may vary based on the critical features necessary for any given experiment or software; however, a general method similar to ours could likely be applied as follows (as depicted in Figure 2).

First, create a checklist of the basic features of your experimental software and ensure they function as intended across a wide variety of hardware and software (Figure 2, Step 1). In doing so, consider applying different user settings (e.g., display resolutions) on devices, as well. It is also important to test these functions (and all subsequent functions) under conditions closely approximating the experimental conditions, including executing the software via the internet, if applicable. It is important to note that the exact parameters of the experimental interface may vary based on the configuration of each participant’s personal computer. For example, the size, resolution, and color configuration of the display are all likely to vary across participants when human-operant experiments are conducted over the internet using a crowdsourcing recruitment platform. Even though the exact appearance of the experimental interface may vary slightly, the purpose of this first step is to ensure that the critical features of the experiment are unaffected before the researchers continue with the remaining steps of the validation process.

Second, record a synchronous video of the software’s interface, a previously validated timer (if applicable), and the hardware (e.g., keyboard, trackpad, mouse, touchscreen) the participants will use to interact with the software’s interface (Figure 2, Step 2). In this step, it may be necessary to make things visible on the software’s interface that will not be visible to participants simply for the purposes of validating the relevant parameters. For example, we had the timer in the experimental software visible for validating this parameter even though it was not visible to participants in subsequent experiments.

Third, ensure the software responds to user interactions as intended by comparing the video recordings of the software and the physical responses of a user (Figure 2, Step 3). Conduct frame-by-frame analyses as necessary to evaluate the responsiveness of the software.

Fourth, validate the timing mechanism (if applicable) of your software by comparing the video-recording of the software with a previously validated timer (e.g., BDataPro; Figure 2, Step 4). Conduct frame-by-frame analyses at multiple timepoints to ensure synchrony throughout the experiment. In these tests, evaluating the effects of different response patterns may help identify potential issues and their causes.

Fifth, validate the software’s data-collection mechanism by comparing the video-recording of the software’s interface and the output of the data collected by the software (Figure 2, Step 5). For each user response and each recorded response, conduct a frame-by-frame analysis to ensure accurate timestamps and identify false positives and false negatives.

Sixth, validate the software’s implementation of each programmed independent variable parameter (Figure 2, Step 6). Conduct frame-by-frame analyses of the video of the experimental interface to evaluate accuracy of the intended manipulations. Again, it may be necessary to make things visible on the software’s interface that will not be visible to participants. For example, we had the parameters of the reinforcement schedule and the current schedule requirement visible to validate their accuracy even though they were not visible to participants in subsequent experiments.

Finally, researchers should consider recruiting participants in small increments and evaluating incoming data iteratively to identify potential issues as they arise (Figure 2, Step 7). Unusual patterns in the data may suggest issues with the experimental software. Identifying and addressing such issues early on has many advantages, especially when resources are limited and the experiment involves participant compensation.

Reporting on Validation Procedures

Our evaluation also highlights the importance of reporting that the experimental apparatus has been validated. Although previous computer-based, human-operant experiments have rarely reported specific validation procedures, we are not suggesting that other researchers did not conduct rigorous validation procedures. We are suggesting, however, that future researchers should report how they validated their software, which may help demonstrate how nascent researchers can and should validate their own software. Further, we are not suggesting that researchers report each individual validation procedure with the same level of detail we provide in this paper.

Rather, researchers should consider summarizing the following points. If the experimental software will be implemented via the internet on participants’ personal devices, they should report that they tested the software on a wide array of common hardware and software to ensure the basic functions of the software remain intact. They should report that they evaluated whether the software responded to human responses as intended, maintained synchrony with a validated timer (if applicable), collected data accurately, and implemented independent variable manipulations with fidelity. They should also report the general procedures used to evaluate each of these parameters (e.g., synchronous video recordings with frame-by-frame comparisons). We also suggest that once an experimental software has been validated in a previous study evaluating the same variables, a researcher can simply reference the validation conducted in that previous study and explain how they validated any new variables pertinent to their study. Finally, researchers may consider making their experimental software available to others to allow for independent validation and to facilitate more precise replication of their experimental procedures. This should increase the reader’s confidence that extraneous variables did not affect the results of subsequent experiments using this software.

Such procedural detail is important given the promising preliminary results of experiments using human-operant preparations presented on the internet via crowdsourcing websites. For example, Robinson and Kelley (2020) demonstrated response patterns consistent with resurgence and ABA renewal when they presented their participants with their experimental software through MTurk, and both of these effects were replicated by Ritchey et al. (2021). Smith and Greer (2021) demonstrated that phase duration has a predictable effect on the amount of resurgence that occurs using similar methods, demonstrating that researchers can conduct experiments through the internet on human-operant software to reveal novel functional relations. Further, Kuroda et al. (2021) provide a tutorial on developing software for conducting human-operant experiments with MTurk, extending the accessibility of this methodology to researchers with minimal computer programming experience. As these methods become more common and accessible to nascent researchers, it is necessary to establish standards for validating and reporting human-operant software to ensure continued experimental rigor.

Reporting validation procedures, and the results thereof, in studies like these will help ensure that these translational evaluations are tightly controlled, minimize the potential influence of extraneous variables, and demonstrate a clear expectation to nascent researchers that validation is necessary. In the best-case scenario, following the validation steps we outlined above could improve researchers’ ability to demonstrate experimental control, decreasing the likelihood of false-positive and -negative results due to confounds. In the very least, reporting the results of such validation efforts will increase the reader’s confidence that any detected effects were due to the experimental manipulations and not extraneous variables. Together, this may facilitate more rapid translation of findings from basic laboratories with nonhuman animals to human participants, provide preliminary data on unexplored behavioral phenomena, and evaluate the feasibility of novel procedures.

Acknowledgments

This research fulfilled partial requirements for the first author’s doctoral degree from the University of Nebraska Medical Center and was funded in part by grants 2R01HD079113, 5R01HD083214, and 5R01HD093734 from the National Institute of Child Health and Human Development.

Appendix A. Basic Functionality Checklist

Parameter Correct Functionality
Point total visible Yes No
Target button visible Yes No
Alternative button visible Yes No
Response buttons movement Yes No
Background colors Yes No
Points add correctly Yes No
Points withheld correctly Yes No
Colors flash with point delivery Yes No
Arithmetic problems correct Yes No
Response input box correct Yes No
Submit button visible Yes No

Note. Only select “Yes” when the software implements each parameter with 100% accuracy throughout the entire experiment. Use Basic Functionality Checklist to evaluate each variable listed in Appendix B.

Appendix B. Variables to Evaluate When Testing Basic Functionality

Hardware Different laptops
Operating Systems Windows
macOS
Web Browsers Internet Explorer
Microsoft Edge
Google Chrome
Firefox
Safari
Stress Test Internet Speed High-speed
Low-speed
Internet Traffic High-traffic
Low-traffic

Note. Check box next to a variable when Basic Functionality Checklist is scored as 100% “Yes” for that variable.

Footnotes

1

The same procedures can be used to determine the accuracy of incorrect responses

References

  1. Bullock CE, Fisher WW, & Hagopian LP (2017). Description and validation of a computerized behavioral data program: “BDataPro”. The Behavior Analyst, 40(1), 275–285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Craig AR, Browning KO, Nall RW, Marshall CM, & Shahan TA (2017). Resurgence and alternative-reinforcer magnitude. Journal of the Experimental Analysis of Behavior, 107(2), 218–233. 10.1002/jeab.245 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Jarmolowicz DP, & Lattal KA (2014). Resurgence under delayed reinforcement. The Psychological Record, 64(2), 189–193. 10.1007/s40732-014-0040-0 [DOI] [Google Scholar]
  4. Kuroda T, Ritchey CM, & Podlesnik CA (2021, May). A tutorial for developing customizable systems for conducting operant experiments online with Amazon Mechanical Turk. ResearchGate. https://www.researchgate.net/publication/351942398_A_Tutorial_for_Developing_Customizable_Systems_for_Conducting_Operant_Experiments_Online_with_Amazon_Mechanical_Turk
  5. Ritchey CM, Kuroda T, Rung JM, & Podlesnik CA (2021). Evaluating extinction, renewal, and resurgence of operant behavior in humans with Amazon Mechanical Turk. Learning and Motivation, 74, 101728. 10.1016/j.lmot.2021.101728 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Robinson TP, & Kelley ME (2020). Renewal and resurgence phenomena generalize to Amazon’s Mechanical Turk. Journal of the Experimental Analysis of Behavior, 113(1), 206–213. 10.1002/jeab.576 [DOI] [PubMed] [Google Scholar]
  7. Smith SW, & Greer BD (2021). Phase duration and resurgence. Journal of the Experimental Analysis of Behavior. Advance online publication. 10.1002/jeab.725 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Sweeney MM, & Shahan TA (2013). Effects of high, low, and thinning rates of alternative reinforcement on response elimination and resurgence. Journal of the Experimental Analysis of Behavior, 100(1), 102–116. 10.1002/jeab.26 [DOI] [PubMed] [Google Scholar]

RESOURCES