Skip to main content
Perspectives on Behavior Science logoLink to Perspectives on Behavior Science
. 2023 Sep 11;47(1):225–250. doi: 10.1007/s40614-023-00388-9

Understanding Individual Subject Differences through Large Behavioral Datasets: Analytical and Statistical Considerations

Michelle A Frankot 1,2,, Michael E Young 3, Cole Vonder Haar 1,2
PMCID: PMC11035513  PMID: 38660505

Abstract

A core feature of behavior analysis is the single-subject design, in which each subject serves as its own control. This approach is powerful for identifying manipulations that are causal to behavioral changes but often fails to account for individual differences, particularly when coupled with a small sample size. It is more common for other subfields of psychology to use larger-N approaches; however, these designs also often fail to account for the individual by focusing on aggregate-level data only. Moving forward, it is important to study individual differences to identify subgroups of the population that may respond differently to interventions and to improve the generalizability and reproducibility of behavioral science. We propose that large-N datasets should be used in behavior analysis to better understand individual subject variability. First, we describe how individual differences have been historically treated and then outline practical reasons to study individual subject variability. Then, we describe various methods for analyzing large-N datasets while accounting for the individual, including correlational analyses, machine learning, mixed-effects models, clustering, and simulation. We provide relevant examples of these techniques from published behavioral literature and from a publicly available dataset compiled from five different rat experiments, which illustrates both group-level effects and heterogeneity across individual subjects. We encourage other behavior analysts to make use of the substantial advancements in online data sharing to compile large-N datasets and use statistical approaches to explore individual differences.

Keywords: Individual differences, Multilevel modeling, Monte Carlo simulation, Open Science, Big data, Rat

Introduction

The science and practice of behavior analysis has historically concerned itself with the single-subject design in which a subject serves as its own control. First brought to prominence in B. F. Skinner’s book, The Behavior of Organisms (Skinner, 1938) and still held as a core methodology, single-subject design provides a means for understanding and manipulating behavior, including tailoring individualized interventions. Despite these advantages, there are several inherent disadvantages to this design, particularly the problem of how to address individual subject variability (also referred to as between-subject variance or individual differences). Given heavy use of single-subject design, behavior analysis should be putatively focused on the individual; however, between-subject variance is often ignored or mitigated by training subjects to similar states (e.g., titrating task parameters to obtain a stable function such as response rate or discounting rate). However, individual differences are crucial to understand behavior, and when ignored can create challenges extrapolating from individuals to the population. This disconnect between individuals and population-level parameters has generated what might be described as a broad aversion to inferential statistics and the generalizations that often accompany them (DeHart & Kaplan, 2019; Michael, 1974; Young, 2017a).

How and whether inferential statistics should be integrated into behavior analysis has been a recurring topic with arguments against statistics formalized in Sidman’s early work (Sidman, 1960) and pushed into debate in the Journal of the Experimental Analysis of Behavior (JEAB; Revusky, 1967). A change in zeitgeist (or at the very least, expectations for publication) was demonstrated by a steady increase in inferential statistic use in JEAB over time, with over 50% of publications from 2005 to 2015 reporting test statistics or null hypothesis significance testing (Kyonka et al., 2019; Zimmermann et al., 2015). In recent studies researchers argued for statistical applications in behavior analysis to focus on how individual data may be preserved in analyzing the aggregate (e.g., through multilevel or mixed-effects models; DeHart & Kaplan, 2019; Young, 2017b). However, relatively little attention has been given to the use of large samples to better understand individual subject variability. Individual subject variability arises from characteristics inherent to an individual which influence outcome measures and interact with manipulations. These differences could be reflected in between-subject variation in average performance or as variation in subjects’ sensitivity to within-subject manipulations.

The goal of the current article is to provide guidance on why and how behaviorists can use large-N datasets to describe and quantify patterns in behavior that differ across subjects (i.e., to account for individual subject-variability) while still preserving core tenets of behaviorism. To accomplish this, we describe historical perspectives on individual differences, outline benefits of individual-focused research, and give examples of relevant techniques, including correlations, machine learning, mixed-effects modeling, cluster analyses, latent variable techniques, and simulations. We provide examples from the literature and make use of a publicly available dataset as a demonstration of how to analyze individual differences in maladaptive choice. This dataset (Vonder Haar, Martens, et al., 2022b) includes 151 rats tested on a concurrent four-choice paradigm. This paradigm generates substantial heterogeneity in choice preferences and can be profoundly altered by physiological (e.g., brain injury) or environmental (e.g., inclusion of audiovisual cues) manipulations. Our analysis of this dataset illustrates how behavior analytic paradigms can provide rich insight into individual-subject variability when sample size is sufficiently large and appropriate statistical techniques are applied.

Historical Perspectives on Individual Subject Variability

In early behavioral research, individual differences were a topic of interest. For example, Pavlov noted differences across dogs in his historic classical conditioning experiments, describing some as weak or strong, passive or impressionable, and modest or greedy (Todes, 2014). Psychologist Hans Eysenck argued that behavioral scientists abandoned this early interest in individual differences in a pushback against Freudian psychology (Eysenck, 1984). Moreover, the removal of the organism from behavior analytic language to focus on the behaviors themselves (Baer, 1976) may have unintentionally exacerbated this shift by reducing emphasis on individual characteristics (e.g., genetic and behavioral history). As time progressed, behavioral scientists generally dealt with individual-subject variability by mitigating its effects in two primary ways: (1) exclusive focus on the individual—the idiographic approach (e.g., case-study, single-subject design); or (2) a focus on group-level traits—the nomothetic approach. Each approach has advantages and disadvantages, but both attempt to reduce the influence of individual-subject variability, often without providing a means to understand those differences. We advocate for “blended” approaches, where a large sample size coupled with a strong causal design allows for a focus on both the individual and group-level effects.

Idiographic Approach

The goal of the idiographic approach is to make predictions and/or inferences about individuals (Molenaar, 2004). A common application in behavioral science is the single-subject design, where reliability (i.e., direct replication) is assessed by repeating a manipulation either within- or between-subjects, and generalizability (i.e., systematic replication) is established by manipulating the controlling variables under novel conditions (e.g., different subjects, contexts, stimuli). This approach can produce rich, longitudinal data with visually apparent effects. Tight experimental control mitigates the confounding influence of individual differences, strengthens internal validity, and allows for strong causal conclusions without advanced statistical applications. Moreover, as data accumulates across subjects, experiments, or replications, evidence for universal principles is likely to emerge akin to nomothetic approaches. The universality of discounting of delayed outcomes or matching behavior are prime examples of such effects (Sutton et al., 2008; Vanderveldt et al., 2016).

Many behaviorists have long advocated for the single-subject approach, and indeed, visually apparent effects that are reversible and repeatable are undeniably attractive. However, several limitations may affect translation of findings. Perhaps the most obvious is the case of irreversible conditions (e.g., brain injury, aging), in which it is impossible to return to baseline within a subject. Another set of problems are related to throughput; single-subject experiments often last months or years, which makes it difficult to rapidly build upon findings or evaluate multiple manipulations concurrently. These same issues are also apparent in applied settings, where the number of clients is limited by the time constraints of the behavior analyst. Indeed, authors have made the argument that more behavior analysts are needed to serve the autism population (Dorsey et al., 2009) and bring behavior analysis to other psychological conditions (Normand & Kohn, 2013), emphasizing the issue of throughput. Furthermore, when controlling as many sources of variance as possible, the results of a single-subject design may be more difficult to generalize beyond the conditions and individuals tested (Richter et al., 2010).

Nomothetic Approach

As psychological science became dominated by other subfields (e.g., cognitive, social, health), the nomothetic approach became a popular alternative to the idiographic approach with the goal of making broad inferences about the population (Molenaar, 2004). Rather than eliminating individual-subject variability through experimental control, the nomothetic approach averages out individual differences by analyzing group-level data with large samples. For example, group-level effects might be evaluated using a t-test or ANOVA to make inferences about the population. As sample size increases, the sample statistic will more closely approximate the population statistic, and conclusions about the population can be made with more certainty. Mathematically, this occurs because a larger sample size reduces the standard error of a test statistic (e.g., the error of a sample mean or probability). Thus, a nomothetic approach can be used to identify universal principles that generally apply to the population and provide a means to control for heterogeneity in clinical populations where tight control is unrealistic. As inferential statistics become more common in behavior analysis, they are often applied in this framework.

However, this approach exemplifies a classic problem of extrapolating from a population-level effect to a given individual in that population (Fisher et al., 2018). Indeed, many foundational papers demonstrate inferential mistakes due to averaging across individuals (Guthrie & Horton, 1946; Robinson, 1950; Wixted & Ebbesen, 1997) with an intuitive example described by Hamaker et al. (2005): at the population level, there is a negative correlation between typing speed and number of typographical errors; faster typists tend to be more experienced and make fewer errors. However, at the individual level, there is a positive correlation between typing speed and number of errors because individuals make more mistakes when typing faster than their average speed (Hamaker et al., 2005).

Furthermore, an effect may be smaller, larger, or in the opposite direction for some subgroups. For example, some drugs work differently for certain populations (e.g., many drugs cause more severe adverse reactions in females compared to males; Nakagawa & Kajiwara, 2015). These factors can be accounted for in behavioral science by adopting methods that identify relevant subgroups and/or moderating variables while still making inferences about the population at large (Hagopian et al., 2015). Nomothetic approaches heavily rely on statistical inference, which can result in errors when misapplied (e.g., using tests when core assumptions are violated). These types of errors may be particularly prevalent in psychology (Burke et al., 2013; Hoekstra et al., 2012) with some suggesting that the propensity for false positives in between-subject psychological research may be a major contributor to the replication crisis (DeHart & Kaplan, 2019). However, it should be noted that the nomothetic approach may offer many advantages to behavioral researchers working in policy or public health that deal with the population more broadly as opposed to isolating and understanding individual behavioral principles.

Blended Approaches

Both the idiographic and nomothetic approaches are certainly valuable and play important roles in psychological research. However, both approaches also have limitations, and a blending of idiographic and nomothetic may better serve to propel behavioral science forward. In particular, neither approach provides robust insight into the causes of heterogeneity or a means to understand why some interventions affect subjects differently. Failure to describe individual differences will inevitably result in the neglect of certain individuals and subgroups. Indeed, it is well-documented through nomothetic studies of large patient samples in the United States that there are differences across race in metabolism, effectiveness, and the occurrence of side effects of various prescription drugs (Burroughs et al., 2002) that may contribute to racial disparities in health outcomes. Some clinicians have proposed that combining the idiographic and nomothetic approaches provides the most powerful lens to understand heterogenous data and to treat diverse populations (Beltz et al., 2016). The National Institutes of Health (NIH) recognized this problem in a 2015 report on personalized medicine (NIH, 2015) and have since begun building a 1-million-person cohort to track health outcomes (NIH, 2022). In behavioral research, “blended” approaches could look very different depending on the research question, but we are broadly advocating for the use of detailed subject-level data in combination with larger sample sizes and statistical techniques focused on the individual as a means to enhance reproducibility and generalizability.

Individual Differences-Focused Research

A “blended” approach that focuses on individual differences can be accomplished by combining the strengths of the idiographic and nomothetic approaches and examining individuals in a large-N dataset. Such approaches allow for the use of advanced statistical techniques that require a larger sample size (e.g., clustering techniques) to answer questions about individual differences. We propose the blending of behavior analytic designs (i.e., longitudinal case study data) with larger sample sizes to understand individual differences in basic science and best serve heterogenous populations in applied science.

Logistics

To achieve the sample size for a blended approach, pooling of multiple experiments may be used, while accommodating practical concerns (e.g., funding, equipment). Data pooling has been used in preclinical work to fit the generalized matching law to behavioral data and inform model selection for future experiments (Baum, 1979; Sutton et al., 2008). In clinical research, pooling can improve generalizability, especially for distinct subgroups (Bangdiwala et al., 2016). Data pooling is also common in meta analyses (Sutton et al., 2008). It should be noted that a sufficient number of both subjects and observations is key to perform such analyses.

There are several resources available for individual laboratories interested in pooling data. There are field-specific repositories and the more general Open Science Framework (OSF) repository. Funding agencies are increasingly pushing for standardization (e.g., NIH’s Common Data Elements) and open access to data (e.g., NIH’s new data sharing policy), so opportunities for data pooling are likely to expand rapidly in the future. The easiest route is likely to begin with historical data from one’s own laboratory or colleagues who collect similar data. A recent review provided an accessible overview of how to prepare data using GitHub (Gilroy & Kaplan, 2019), and other papers detail FAIR (Findable, Accessible, Interoperable, Reusable) principles for preparing data for reuse (Wilkinson et al., 2016). Implementing a standardized data format, preferably using tidy data principles (i.e., columns as variables, rows as observations, and single values per cell; Ellis & Leek, 2018; Wickham, 2014), allows for rapid integration of new and old data with reusable scripts for data processing and analysis.

For analysis of individual differences, it may be more practical to obtain sufficient sample size by pooling baseline and control data rather than experimental conditions or manipulations, unless they are well-established and highly reliable. We used this approach to identify decision-making patterns in healthy control rats (described in further detail in the “Clustering” section; Vonder Haar, Frankot, et al., 2022a). It is important to remember that pooling data is not sufficient for understanding individual differences. We recommend data pooling as a practical technique for gaining a sufficient sample size to apply analysis techniques that can be used to study individual differences. Such approaches may be necessary to mitigate the replication crisis that has been recognized as problematic in behavioral science for several years (Nuzzo, 2015; Open Science Collaboration, 2015).

Reproducibility

Increased individual-subject variability may improve reproducibility. In fact, in a simple experiment examining mouse strain differences in untrained behaviors when experimental factors (i.e., age of animals, environmental enrichment, cage size, lighting, test time, sound level, experimenter, and pretest handling) were systematically varied, behavioral results were more consistent across multiple preclinical experiments compared to more standardized designs (Richter et al., 2010). It should be noted that this consistency was reflected by decreased likelihood of significant effects in strain difference, likely a reduction of false positive results. Thus, tight control at the cost of heterogeneity does not necessarily enhance reproducibility or causal inference. These findings call into question other reports that different strains of mice respond differently to various drugs, including opiates (Dockstader & van der Kooy, 2001), phencyclidine (PCP; Mouri et al., 2012), and the anti-depressant fluoxetine (Gosselin et al., 2017). A focus on limiting all sources of variance may ultimately undermine the goal of reproducing and generalizing to other populations compared to large-N approaches with a more heterogeneous data pool.

Generalizability

Generalizability (both across subjects and between preclinical and clinical research) is a major barrier in behavioral science (Garner, 2014). In behavior analysis, small-N designs may not capture enough of the population to generalize and would routinely fail to capture the behavior of unusual subjects due to their rarity. Behavior analytic methods provide a robust means to assess functional outcomes, which is highly useful for translational science. Some have argued that behavior analysis needs to adopt a stronger translational pipeline (Kyonka & Subramaniam, 2018), which will require considerations for the biases that emerge from the single-subject approach. In nomothetic designs, issues can occur when generalizing from population-level effects to individuals (Fisher et al., 2018). A blended approach using large-N datasets addresses the limitations of both approaches. When coupled with operant methods, a large-N dataset can improve generalization by sampling enough of the population to reflect real-world heterogeneity while still capturing rich within-subject information that is necessary to determine how well population-level effects generalize to the individual.

Applications

Improving reproducibility and generalizability through large-N designs and pooled datasets may address crucial pitfalls in behavioral science. In addition to providing information about fundamental behavior at the individual subject level, generalizable and reproducible results are necessary to treat maladaptive behaviors, as seen in various psychiatric conditions and disease states. For example, in the preclinical literature, subjects can be classified as either sign trackers or goal trackers based on their response to environmental cues. These phenotypes are important for understanding addiction vulnerability and can be recapitulated in human subjects (Colaizzi et al., 2020; Garofalo & di Pellegrino, 2015), further highlighting the need to understand and treat brain-behavior disruptions, such as addiction, at a personalized level. In the field of brain injury, a preclinical consortium assessed various putative therapeutics in three species (mouse, rat, pig) with a set of harmonized behavioral and biological outcomes (Kochanek et al., 2018) at multiple research sites. This approach introduced systematic variability across species and environments and increased confidence in the few drug candidates determined effective in multiple models. The approach also generated a large dataset that could be probed to further understand individual variance in response to treatment (Radabaugh et al., 2021). Another relevant application is precision medicine, a medical model that examines a variety of genetic and environmental factors on an individual basis to develop personalized treatments (Carrasco-Ramiro et al., 2017). Precision medicine identifies a number of predictors of poor outcomes, yet has yielded minimal treatment successes (Sisodiya, 2021). In contrast, applied behavior analysis already provides highly individualized treatment plans but may not generalize to a broader population or a different subpopulation. To address this limitation, applied behavior analysts recently compiled multiple datasets to manually score and identify subtypes of self-injurious behavior (Hagopian et al., 2015) and then evaluated individual predictors of responsiveness to treatment (Hagopian et al., 2018). Other groups used machine learning techniques to identify responsiveness to behavioral therapy for autism in children (Préfontaine et al., 2022). To further this goal of generalization, consecutive controlled case series designs may be used in applied setting in which standardized procedures are applied or evaluated (Hagopian, 2020). Such blended designs preserve individual subject information and standardize manipulations for larger-scale evaluations.

Analytical Considerations for Individual Subject Variability

The above sections described the historical context and motivation for selecting certain experimental designs, but it is equally critical to consider how to analyze individual differences in a large-N dataset. There are often several options for statistical analysis, and careful test selection is necessary to provide critical information about individual characteristics in addition to population-level differences. Careful selection is also critical to reduce statistical errors that have plagued the fields of behavioral and biomedical science (Burke et al., 2013; Veldkamp et al., 2014). Below, we review several statistical techniques that could be applied in a “blended” design that combines the strengths of the idiographic and nomothetic approaches to account for individual differences.

Correlational Analyses

Perhaps the simplest way to examine individual differences is to correlate predictors (e.g., biomarkers, baseline characteristics) with behavioral performance. In the case of group-level analyses (i.e., means), different underlying relationships within those groups may not be detected if the means are similar. For example, if a biological process affects a behavioral outcome in a positive linear direction for one group, but a negative linear direction for a second group, the means may be equivalent even when the underlying process is fundamentally different. By contrast, correlational analyses can be used to examine experimentally driven manipulations and continuous biological variables. For example, brain injury can produce psychiatric symptoms, such as impulsivity, through various underlying mechanisms. In a rodent model, brain injury increased both inflammatory markers and impulsive choice on the delay discounting task (Vonder Haar et al., 2017), and inflammatory markers positively correlated with impulsive choice, providing potential evidence for the role of inflammation in impulsivity (Fig. 1).

Fig. 1.

Fig. 1

Brain Cytokine Levels (via Enzyme-Linked Immunosorbent Assay Analysis) at 10 Weeks after a Traumatic Brain Injury and Impulsivity (via the Delay Discounting Task) at Multiple Time Points were Correlated. Note. Principal component analysis was performed on the cytokine data, and the second component (PC2), which was dominated by inflammatory marker interleukin-12 (IL-12), was correlated with impulsivity. There was no relationship between inflammation and impulsivity before injury (Panel A), but there was a strong relationship at 2 (Panel B), which gradually weakened at 4 (PanelC), and 8 (PanelD) weeks post-injury until it was clearly driven by a subset of rats with high levels of PC2 and impulsivity. This provides correlational evidence of a role for inflammation in TBI-induced impulsivity, particularly for a subset of rats. Adapted with permission from Vonder Haar et al. (2017) in the Journal of Neurotrauma

However, simple correlation (e.g., Pearson’s R) cannot model nonlinear relationships or moderating effects between variables, which would require regression analyses. Correlations can also be susceptible to outliers and spurious relationships, particularly when examining two variables that have only been measured but not manipulated. “Big data” (i.e., high numbers of variables and/or observations) may be especially vulnerable to spurious correlations that arise when analyzing massive numbers of measured variables. Several precautions must be taken to avoid spurious positive findings when analyzing a large dataset—cross-validation, dimension reduction, restriction to theoretically-motivated relationships, and other methodological considerations (e.g., penalized models; Calude & Longo, 2017; Lamata, 2020). With proper controls, correlational analyses can identify potentially important variables and even be applied to make predictions about individuals using machine learning techniques.

Machine Learning

Machine learning (ML) is a broader set of tools that encompasses standard statistical approaches like correlation, regression, and splines as well as techniques like data reduction and clustering (see more below). ML algorithms identify patterns in data while allowing flexibility in the relationships observed (e.g., nonlinearities). Because of their flexibility, it is necessary to ensure that ML results are not specific to the sample and will generalize out-of-sample. To establish generalization, a ML algorithm should be trained using a “training set” and then tested for performance on a novel “testing set” (this form of cross-validation is also good practice for any complex statistical model). ML has recently become ubiquitous, with particular relevance for modeling individual differences and making predictions based on those data. For instance, entertainment platforms like Netflix and TikTok use ML algorithms to tailor content to individuals. ML algorithms can improve diagnostic tools for various clinical disorders, including diabetes and cardiovascular disease (Dinh et al., 2019) and are growing rapidly in preclinical use, primarily for tracking behavior (e.g., Forced Swim Task, Open Field Test, Elevated Plus Maze) with equal accuracy to human raters (Sturman et al., 2020). The rich trial-level data produced by behavior analytic experiments may be attractive to ML researchers and could generate algorithms that predict functional outcomes on an individual basis.

One drawback of ML is that algorithms can perpetuate bias when systematic tendencies in the model generate unequal outcomes across subgroups (Friedman & Nissenbaum, 1996). In biomedical and behavior science, these flaws could endanger patients and perpetuate racial and other forms of bias (Kostick-Quenet et al., 2022), which further highlights the importance of training models on diverse subject pools. In preclinical considerations, this might apply to species selection (e.g., rat, mouse, pigeon, human), sex (e.g., male vs. female), or individual laboratory idiosyncrasies (e.g., light/dark cycle, extraneous noise). It is important not to overstate the utility of ML: a model is only as useful as the collected data and the care with which the data are modeled. Finally, ML requires sufficient data to effectively train and test a model, and may be primarily limited by the amount of data generated. That said, similar concerns arise in the case of tools like regression when they contain many predictors, interactions, or polynomial terms. As with correlational analyses of big data, several precautions must be taken to avoid spurious findings, and cross validation is a crucial component of ML pipelines to check for model generalization.

Mixed-Effects or Multilevel Analyses

Another method that accounts for individual-subject variability, particularly in a single-subject design, is mixed-effects modeling (also called multilevel, hierarchical, or random effects modeling). These models are naturally attractive as they blend idiographic and nomothetic analyses. In particular, they incorporate “fixed” effects, which have a systematic impact on the outcome variable that is assumed constant across individuals (e.g., group-level effects of an independent variable), and “random” effects, which occur when a variable is randomly sampled from a larger population (e.g., patients, testing sites, stimuli). In behavior analytic research, repeated-measures data (e.g., across trials or sessions) are usually nested within-subjects, and mixed models allow for regression parameters to vary by subject. A random intercept term allows each subject to differ at baseline (i.e., the value of the dependent variable when the independent variable equals zero), and a random slope term allows each within-subject variable to have different effects for each subject (e.g., the rate of behavioral change across trial, stimulus dimension, drug dose). Mixed effects analyses also mitigate Type I (false positives) and II (false negatives) error by handling individual-subject effects within one analysis, thereby reducing the number of tests that must be conducted.

The use of mixed models in behavior analysis is an excellent application of inferential statistics. However, it is common to only used mixed models to control for individual subject variability, rather than understand it, as demonstrated by low reporting of random effects in published papers (Bono et al., 2021). To better understand natural individual variance, researchers should report their random effects and consider how much variance individual differences account for (e.g., by comparing conditional [fixed effect] versus marginal [fixed + random effects] variance). An examination of random effects could be used to identify patterns or subgroups, and refine understanding of how individuals vary within a subgroup as opposed to the population. One complication for traditional behavior analysis is that conservative estimates suggest a mixed model requires 20 clusters (e.g., subjects) with 20 observations per cluster (Austin, 2010). The best practice for behavior analysis may often be to use mixed models as a complement to visual inspection (DeHart & Kaplan, 2019) and report when individual differences have large or comparable effects to experimental manipulations.

Clustering Techniques

Clustering techniques can ascertain patterns across subjects that might not be apparent using classic inferential statistics. Researchers used clustering to identify subgroups in the context of various psychiatric conditions, including autism (Scheerer et al., 2021) and depression (Liang et al., 2020), which may help identify groups that present unique symptoms and respond differently to treatment.

K-Means Clustering

K-means clustering uses a distance-based algorithm to partition the data such that each observation belongs to the cluster with the nearest “centroid” (i.e., cluster mean). In autism research, k-means clustering was used to identify five unique sensory phenotypes in 599 autistic children (Fig. 2), which is important for tailoring effective interventions (Scheerer et al., 2021). In a small-N design, it would be impossible to reliably identify these unique subtypes, whereas large-N studies provide avenues for understanding individual differences in risk and resilience factors and response to treatment. However, k-means clustering differentiates between subjects based on mean values alone, which neglects the fact that some populations might be more variable than others. By contrast, latent variable techniques may more effectively capture individual differences by accounting for variance as well.

Fig. 2.

Fig. 2

K-Means Clustering on Data from the Short Sensory Profile, a Well-Validated, 38-Item Measure of Behavioral Responses to Sensory Information, Collected from 599 Autistic Children. Note. The cluster number of five was selected to minimize error variance, which resulted in five unique sensory profiles: (1) sensory adaptive; (2) generalized sensory differences; (3) taste and smell sensitive; (4) underresponsive and sensory seeking; and (5) movement difficulties with low energy. Adapted with permission from Scheerer et al. (2021) in Molecular Autism

Latent Variable Techniques

Latent variable techniques assume a construct not directly measured explains variance in the data (e.g., stress as a latent variable extrapolated from cortisol, heart rate, and self-report). Mixture modeling (not to be confused with mixed-effect modeling) is a type of discrete latent variable model that creates subgroups within the population while modeling both the mean and variance within a cluster or “class” (Oberski, 2016). Thus, these discrete latent variable techniques bear resemblance to k-means clustering; however, discrete latent variable techniques assume probabilistic membership in each cluster as defined by cluster distribution (e.g., a normal distribution of a latent trait with a mean and variance for each cluster). This approach results in each subject having a probability profile for membership in each cluster in contrast to the absolute cluster boundaries derived from k-means and other classic clustering approaches. Different types of mixture models are defined by the variables describing each subject. For example, a latent profile analysis (LPA) assumes that all observed variables are continuous whereas a latent class analysis (LCA) assumes that all observed variables are categorical (Oberski, 2016). The major advantage of these techniques is that they account for individual differences by assuming that multiple, distinct distributions exist within the population, rather than assuming uniformity across subjects. A recent special issue of Learning and Individual Differences provides a nice set of examples mostly involving children in school settings (Bray & Dziak, 2018).

Latent variable techniques can identify unique behavioral phenotypes in preclinical behavior analysis. For example, LCA was used to identify two unique profiles on a behavioral model of reward devaluation with 53 subjects. Both classes decreased time spent consuming a sucrose solution when the sucrose concentration was decreased, but one class recovered quickly and the other more slowly (Annicchiarico & Cuenya, 2018). LCA was also used to identify six to seven subtypes of decision-makers from 1,198 adults on the delay discounting task (Gilroy et al., 2022). Both studies highlight the idea that individual-subject variability must be considered to identify unique subgroups. However, one limitation of both clustering and latent-variable techniques is that cluster labels must be generated by the researcher, making these techniques susceptible to a “naming fallacy” where the class name is not an accurate reflection of the class properties (Weller et al., 2020). Moreover, some subjectivity is introduced when deciding the ideal number of clusters, and researchers must be cautious in how that choice is made.

Simulation

The techniques discussed thus far provide methods for assessing individual subject variability in observed data. An alternate goal might be to create individual subject variability under hypothetical conditions using simulations. In economics, simulations are used to predict risk and reward of financial decisions under hypothetical market conditions (Boyle, 1977). In engineering, simulations allow the assessment of performance under various conditions without running long and expensive experiments (Fishman, 1996). In psychology, simulation has been used extensively to evaluate the accuracy of various statistical tests as well as to predict the effects of different mechanisms on behavior (Busemeyer & Diederich, 2010). In a Monte Carlo simulation to evaluate data analytic techniques, data is generated through repeated sampling from a known probability distribution. This method can be used to evaluate analysis accuracy because the researcher knows the “truth” of the data (i.e., whether the data were sampled from equal or unequal populations with specific magnitudes of difference across populations), which allows for direct assessment of parameter estimation error and, when a decision of equality is necessary, false positive and negative rates across various sample sizes and effect sizes.

Simulations can provide unique insights into analysis of behavioral tasks. The Morris Water Maze (MWM) is a measure of spatial memory that presents unique challenges for analysis. The task measures latency for a rodent to navigate to a hidden platform in an opaque tank of water. Animals are often tested repeatedly, giving the data a nested structure, and there may also be a nonlinear relationship between task experience and latency to find the platform. Subjects may also be removed from the tank if the platform is not located after some specified period of time (e.g., 60 or 90s), which creates a ceiling effect (in particular, “censoring”). Thus, there are several potential pitfalls for data analysis. Simulations of MWM data demonstrated that both linear and nonlinear mixed models were more accurate than ANOVA (Young et al., 2009), and that a censored model outperformed a linear model (Young & Hoane, 2021).

Analytical Applications Using a Concurrent Four-Choice Task

As an example of using a large behavioral data set in a “blended” approach to address a scientific question, this section examines the unique analytic challenges of the Rodent Gambling Task (RGT), a concurrent four-choice paradigm that measures decision making. Like the MWM, the data has a nested structure and can be analyzed several different ways (e.g., at the trial level using logistic regression versus at the aggregate level using linear regression). The Vonder Haar lab compiled a publicly available, large-N dataset from five RGT experiments and used it to inspect individual-subject variability through cluster analyses (Vonder Haar, Frankot, et al., 2022a). Thus, we will use this as a practical application to illustrate many of the techniques discussed above. The datasets analyzed during the current study are available on the Open Data Commons for TBI (https://odc-tbi.org/data/703), and code to accompany these analyses can be found at https://github.com/mfrankz/RGT-PoBS and will be mirrored to the Vonder Haar lab GitHub (https://github.com/VonderHaarLab). All analyses were performed using R Statistical Software (https://www.r-project.org/), and graphs were made using the ggplot2 package (Wickham, 2016) in R.

Rodent Gambling Task Background

The RGT is a choice paradigm that parallels a clinical assessment called the Iowa Gambling Task (Bechara et al., 1994) where participants receive monetary or point gains and losses by choosing between four different decks of cards. In the RGT, rats can choose from four nosepoke holes in an operant chamber. Each hole is associated with a different probability and magnitude of reinforcement (gains in the form of sucrose pellets) and punishment (losses in the form of timeout), making a distinct distribution of risk and reward (Fig. 3A). There are four choice outcomes of interest: one optimal choice, two risky choices (with differential reinforcement and punishment), and one suboptimal but relatively safe choice. At the aggregate level, control rats learn to primarily choose the optimal hole (Zeeb & Winstanley, 2013), but TBI rats demonstrate persistent reductions in optimal choice compared to rats with a control surgery (“Sham”; Shaver et al., 2019).

Fig. 3.

Fig. 3

Panel A Provides a Schematic of the Rodent Gambling Task (RGT). Note. After initiating a trial, rats chose from any of the four holes. Each hole was associated with a different probability and magnitude of reinforcement and punishment. As a result of varying reinforcement rates, the 1-pellet option (P1) was suboptimal, the 2-pellet option (P2) was optimal, and the 3- and 4-pellet options (P3, P4) were risky. Panel B shows the cluster analysis of RGT data reflecting the distinct choice profiles of each of the five phenotypes and is adapted from Vonder Haar et al. (2022a) in Frontiers in Behavioral Neuroscience. The x- and y-axes show the z-scores for the average choice within a phenotype to distinguish between optimal (green circles), exploratory (blue diamonds), risky (3-pellet option; red triangles), risky (4-pellet option; burgundy inverted triangles), and suboptimal (yellow squares) rats. Panel C shows the prevalence of these behavioral phenotypes across sham rats (i.e., intact) and rats with a traumatic brain injury (TBI) and comes from the same Vonder Haar et al. (2022a) publication. Adapted with permission from Shaver et al. (2019) in Brain Research

Previously Published Analyses

A dataset was compiled (Vonder Haar, Frankot, et al., 2022a) containing pre- and post-injury RGT data at the trial level from five experiments with 151 adult male subjects (n = 71 for TBI; n = 80 for Sham). In the published analysis, and for the current example, pre-injury sessions and data involving experimental manipulations (e.g., drugs) other than brain injury were excluded, which resulted in 109 subjects (n = 58 for TBI; n = 51 for Sham). Only stable data from these subjects (i.e., data collected outside of the initial task-learning phase and acute injury phase) were considered. Individual subjects had vastly different choice profiles, even among intact rats. In a published analysis, k-means clustering identified five decision-making phenotypes (Vonder Haar, Frankot, et al., 2022a) where some rats could be classified as optimizers, whereas others were more exploratory, risk-preferring, or suboptimal (Fig. 3B). This analysis provided both novel and unexpected information: rather than TBI globally reducing optimal choice or increasing risk preference, injured rats were more likely to belong to a nonoptimal category of decision-making phenotype (Fig. 3C). Without this large dataset and a question about individual subjects, this information may have gone unrealized. Other published articles with smaller samples only parsed apart an optimal and suboptimal group (e.g., Di Ciano et al., 2018). Here we present some additional analyses focusing on individual differences.

Mixed-Effects Regression

Based on the clustering data above (Fig. 3B-C), a typical fixed effects regression is not appropriate because it does not model the within-subject dependencies. Furthermore, a fixed effects approach to clustering in which subject and subject-by-predictor interactions are treated as fixed effects (Cohen et al., 2002) introduces a large number of degrees of freedom and undermines the generalizability of the results. Here, we illustrate how random effects can better model these data. A typical fixed effects-only model evaluating percent choice as a function of the four choice options and injury was compared against a mixed model incorporating subject variability in the choice profile (random option slopes + intercept model). This mixed-effects model was chosen because the choice data were fully interdependent (i.e., an increase in percent choice of any one option necessitates a decrease across the others). We did not include a random-intercept only model because allowing only a single choice (i.e., the reference choice when dummy coding) to vary by subject produces implausible predicted values (e.g., total choices summing to greater than 100%). The formulation for the compared two models in the lme4 package of R would look similar to the following:

  • Fixed-Effects Model: Choice~Option*Injury;

  • Mixed-Effects Model: Choice~Option*Injury + (Option|Subject).

Figure 4 depicts the individual variance in the data and the individual variance captured by each model. The mixed-effects model was superior to the fixed model (χ2 = 984.52, p < 0.001). Table 1 shows the AIC and marginal and conditional R2 values for each model. We have included R2 values because as an absolute goodness-of-fit metric, it can be more accessible than other metrics like AIC and BIC. R2 is less frequently applied to mixed models because it confounds the variance predicted by the predictors with subject variance. However, there is a simple method for dissociating between these types of variance by comparing marginal and conditional R2 (Nakagawa & Schielzeth, 2013). The marginal R2, which represents variance explained by the fixed effects only, is roughly equivalent across both models we have presented. The conditional R2, which represents variance explained by both fixed and random effects, is far superior in the mixed-effects model, which accounted for 90% of total variance in the data. The fixed effect model predicts behavior well in the aggregate but fails to account for individual subject variability.

Fig. 4.

Fig. 4

Raw Data Choice Profiles (Panel A) From Our Large-N Dataset Described in Fig. 3 Compared against Two Predictive Models: A Fixed Effect-Only Model (Panel B) and Random Slope + Intercept Mixed-Effects Model (Panel C). Note. Each line represents an individual subject in Panels A and C. Panel B has a single line because the fixed-effect model predicts the same regression parameters for each subject within a group. Note that line graphs are used to more easily identify individual subjects across categories

Table 1.

Fit metrics for fixed versus random effects modeling of RGT behavior

Model AIC R2 Marginal R2 Conditional
Fixed 1681.21 0.61 0.61
Mixed 778.87 0.59 0.90

Akaike information criteria (AIC) represents the relative fit of statistical models for a given set of data. By contrast, R2 is an absolute measure, where marginal R2 represents variance explained by the fixed effects only whereas conditional R2 represents variance explained by both fixed and random effects

Simulation

Because the dataset captured considerable information about subgroups of rats (Fig. 3B-C), it could be used to generate accurate simulations of choice data. These simulations can then inform several aspects of the research process. Statistical considerations (e.g., how powerful a test is) are perhaps the most obvious given prior studies using similar approaches (Young et al., 2009; Young & Hoane, 2021) but because we can manipulate core variables, we can also generate testable questions for the collection of new data or comparison of existing data. The RGT dataset was used to identify likely population subgroups (or phenotypes). To accurately generate subject-level data, several actions were taken. First, we determined that choice probability would be well-described by fitting the two-parameter softmax function to each phenotype. The softmax function takes a matrix of inputs and translates it to a matrix of outputs (see Eq. 1), and is commonly used in machine learning (Sutton & Barto, 1998). In the case of the current data, our inputs are the inherent preference for choices within a cluster or individual (“weights”) and a parameter (θ) representing the strength of the preference for the option with the highest weight where a value of zero indicates equal preference for each option (25%) regardless of their learned weights.

Pchoicej=eθ*weightji=1nθ*weighti 1

Thus, for each phenotype, we calculated a set of weights (phenotype-level means) for the four choice options (i.e., n = 4 in Eq. 1), and used the nls() function in R to solve for θ. By fitting this to individual subjects, we obtained estimates for between-subject variance (standard deviation) of θ and the weights. By fitting this to individual sessions for subjects, we obtained estimates of within-subject variance of θ and the weights.

With these parameters, data were simulated. The output probabilities from the softmax function were used to select from a multinomial distribution and generate a choice on a given trial. Variance was introduced between subjects and between sessions by sampling from the θ and weight values calculated above. Figure 4 provides an example of pseudo-code and how unique choice profiles can be generated by changing softmax parameters. To simulate Sham (i.e., intact or control) versus TBI rats, the probability of a simulated rat belonging to a given phenotype was manipulated (according to real data as shown in Fig. 3C). A simulated dataset from Sham and TBI subjects (n = 60 per group, 10 sessions) shows similar aggregate effects (Fig. 5A-D), and the observed individual variability is effectively captured (Fig. 5E-F).

Fig. 5.

Fig. 5

Panel A Shows the Code Structure for Automation Of Subject, Session, and Trial Generation. Subjects were Assigned an ID and a Phenotype. Then, population-level weights and θ (green text) were assigned to that subject, based on their phenotype. The weights and θ were varied for each subject (red text) and for each session (blue text). On any given trial, the weights and θ were passed through the softmax function, which generated probabilities to define a multinomial distribution. Sampling from within that distribution allowed the generation of a discrete choice (either P1, P2, P3, or P4) for each trial. This repeated for the assigned number of trials, and then a new session was generated until 10 sessions were complete. Panel B shows how changes to softmax parameters were used to generate unique choice profiles. For example, when θ is decreased from 3 to 1, the simulated rat becomes more exploratory (i.e., has a flatter choice profile or a lower strength of preference) despite having the same preference values (i.e., weights) among choices

Simulation Insights and Remaining Questions

This simulation is an example of how a large-N dataset can be used to model complex operant behaviors. The large sample was necessary to capture and recapitulate individual differences. We plotted individual choice profiles (Fig. 6E-F) and reported random effects for a linear mixed model (Fig. 6G-H), which demonstrate the same pattern as the observed data; choice profiles vary by subject, which heavily influences statistical analysis. Although our simulation generally recapitulated observed phenotypes, one remaining question is why TBI and Sham rats behave slightly differently within a phenotype or have more variance in our observed data. In the observed data (Fig. 6C), there is more between-subject variability compared to simulated data (Fig. 6D). Thus, there is some aspect of the injury effect that remains unknown. With additional data collection and techniques such as these, we may come to understand those nuances as well.

Fig. 6.

Fig. 6

Simulated RGT Data Compared against Observed RGT Data. Panel A Shows Observed Choice of Each Option (P1, P2, P3, and P4) for Sham (I.E., Control; Black) versus TBI (Red) Rats. Note. Panel B shows simulated choice of each option for Sham versus TBI rats. Data shown in Panels A and B are mean+SEM. Panel C and Panel D show individual (points) and average (lines) choice of each option faceted by phenotype for observed versus simulated data, respectively. Simulated TBI rats are slightly less variable than real data. Panel E and Panel F show data at the individual subject level within the exploratory phenotype. These subjects all displayed some degree of preference for the optimal choice (option 2 on the x-axis) but varied in the degree to which they explored among the other options. Subjects are color-coded to represent subtypes within the phenotype that were visually identified in four exemplar subjects in the observed data (Panel E) and recapitulated in the simulation (Panel F F). The final panels show the random effects for a linear-mixed effects regression predicting percent choice of four outcomes on the RGT for a subset of observed (Panel G) and simulated data (Panel H). Violin plots for sham (i.e., intact) rats are depicted in grey and TBI rats in red. Points represent individual subjects, and the y-axis represents the standardized value of the regression parameter for each subject

We recently used this simulated RGT data to evaluate statistical methodology by generating many datasets for assessment (i.e., 1,000 datasets at various sample sizes and effect sizes). Using this method, we found that traditional linear modeling drastically inflated false positives for these interdependent data and that accounting for dependencies (in random effect structures or using generalized linear models) corrected false positive rates, but were underpowered. We were ultimately able to bolster statistical power without exacerbating false positives by analyzing the data using a Bayesian mixed model (Frankot et al., 2023).

In addition to evaluating the accuracy of statistical techniques, simulation parameters can also be manipulated to study other biological and/or environmental conditions outside of brain injury that generalize more broadly to behavioral science (Fig. 7). For example, we simulated the addition of complex audiovisual cues paired to “winning” RGT options, an environmental manipulation that, in rodents, shifted preference toward risk in the aggregate (Barrus & Winstanley, 2016). Using only aggregate data (rather than raw subject data), we evaluated whether changes in individual phenotypes could recapitulate the effect of such cues. By shifting the distribution of behavioral phenotypes in our simulation away from optimal and toward the risky ones, we successfully modeled this effect in the aggregate (Fig. 7B). These results suggest that adjustments in proportions of choice phenotype can model behavioral manipulations in addition to physiological manipulations (e.g., brain injury). These data derived from the aggregate closely map onto data derived from individual subjects that were reported in our recent article (Frankot et al., 2023). It should be noted that these insights into RGT behavior were made possible by pooling operant data from multiple experiments in a “blended” approach and applying statistical techniques that provide insight into subject-level tendencies.

Fig. 7.

Fig. 7

Choice Profiles from Simulated Datasets with N = 60 Subjects in each Condition. The Bars are Color Coded to Represent the Choice Options on the RGT. Note. Panel A shows a comparison of simulated control data versus other magnitudes of deficits on the RGT. The first condition (control) on the x-axis was generated by recapitulating the prevalence of k-means phenotypes from our data on intact (i.e., Sham) rats. The next condition (TBI) modeled the observed injury deficit (and is more fully described in Fig. 6), where there was a reduction in the optimal phenotype and redistribution to other phenotypes. The next two conditions show that we can model the effects of a smaller and/or larger TBI effect. Panel B shows real (Cued, Uncued groups on x-axis, adapted from Barrus and Winstanley (2016) in the Journal of Neuroscience) and simulated (Uncued-SIM, Cued-SIM) data for an environmental manipulation that shifts preference exclusively toward risk, rather than simply reducing the prevalence of the optimal phenotype and redistributing evenly across the other phenotypes. Our simulated data recapitulates observed effect of cues

Conclusions

The single-subject design allows for thorough examination of behavioral changes within an individual but is less conducive to understanding differences across individuals, which may limit the utility of behavior analytic work. In particular, small-N designs lack information about individual subject variability, which is crucial for reproducibility and generalizability. Analysis of large-N datasets through techniques including, but not limited to, correlational analysis, machine learning, mixed-effects modeling, clustering or mixture-modeling, and simulation, can help address this gap in behavior analytic research. We believe these techniques can provide incredibly valuable information about individual-subject variability when applied to the rich, longitudinal datasets that are typical of behavior analysis.

Although practical constraints often limit sample size, larger samples can often be accomplished through pooling of data across multiple experiments. Some labs may have the requisite historical data to accomplish this on their own. However, a stronger solution is to make data publicly available, which will allow smaller or newer labs to contribute and contextualize their research. Data must be compiled carefully to maximize accessibility using FAIR principles (see Logistics, above). To date, there is no behavior analytic-specific repository, which may impede adoption and is an area which larger organizations such as the Association for Behavior Analysis International could assist to move the field forward. However, the Center for Open Science established a general repository (https://osf.io) that some labs already use while others use services such as GitHub to obtain a DOI to make datasets findable (Gilroy & Kaplan, 2019). We urge behavior analysts to do their part in addressing reproducibility and generalizability concerns through examining individual subject heterogeneity in their own research and contributing to open science by making data publicly available.

Acknowledgements

We thank the numerous researchers who helped to collect the original data which was used to support the analysis and simulations in this article.

Funding

This work was supported by the NINDS (R01-NS1109).

Data Availability

The datasets generated and analyzed in the current study are available on the Open Data Commons for Traumatic Brain Injury: https://odc-tbi.org/data/703. The R code used to perform statistical analyses are provided on GitHub: https://github.com/VonderHaarLab.

Declarations

Conflict of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Footnotes

The original version of this article has been updated to correctly display the given and family names of author Cole Vonder Haar.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. Annicchiarico, I., & Cuenya, L. (2018). Two profiles in the recovery of reward devaluation in rats: Latent class growth analysis. Neuroscience Letters, 684, 104–108. 10.1016/j.neulet.2018.07.013 10.1016/j.neulet.2018.07.013 [DOI] [PubMed] [Google Scholar]
  2. Austin, P. C. (2010). Estimating multilevel logistic regression models when the number of clusters is low: A comparison of different statistical software procedures. International Journal of Biostatistics, 6(1), 16. 10.2202/1557-4679.1195 10.2202/1557-4679.1195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Baer, D. M. (1976). The organism as host. Human Development, 19(2), 87–98. 10.1159/000271519 10.1159/000271519 [DOI] [Google Scholar]
  4. Bangdiwala, S. I., Bhargava, A., O'Connor, D. P., Robinson, T. N., Michie, S., Murray, D. M., & Pratt, C. A. (2016). Statistical methodologies to pool across multiple intervention studies. Translational Behavioral Medicine, 6(2), 228–235. 10.1007/s13142-016-0386-8 10.1007/s13142-016-0386-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Baron, A. (1999). Statistical inference in behavior analysis: Friend or foe? The Behavior Analyst, 22(2), 83–85. 10.1007/BF03391983 10.1007/BF03391983 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Barrus, M. M., & Winstanley, C. A. (2016). Dopamine d3 receptors modulate the ability of win-paired cues to increase risky choice in a rat gambling task. Journal of Neuroscience, 36(3), 785–794. 10.1523/jneurosci.2225-15.2016 10.1523/jneurosci.2225-15.2016 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Baum, W. M. (1979). Matching, undermatching, and overmatching in studies of choice. Journal of the Experimental Analysis of Behavior, 32(2), 269–281. 10.1901/jeab.1979.32-269 10.1901/jeab.1979.32-269 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bechara, A., Damasio, A. R., Damasio, H., & Anderson, S. W. (1994). Insensitivity to future consequences following damage to human prefrontal cortex. Cognition, 50(1–3), 7–15. 10.1016/0010-0277(94)90018-3 10.1016/0010-0277(94)90018-3 [DOI] [PubMed] [Google Scholar]
  9. Beltz, A. M., Wright, A. G., Sprague, B. N., & Molenaar, P. C. (2016). Bridging the nomothetic and idiographic approaches to the analysis of clinical data. Assessment, 23(4), 447–458. 10.1177/1073191116648209 10.1177/1073191116648209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bono, R., Alarcón, R., & Blanca, M. J. (2021). Report quality of generalized linear mixed models in psychology: A systematic review. Frontiers in Psychology, 12. 10.3389/fpsyg.2021.666182 [DOI] [PMC free article] [PubMed]
  11. Boyle, P. P. (1977). Options: A Monte Carlo approach. Journal of Financial Economics, 4(3), 323–338. 10.1016/0304-405X(77)90005-8 10.1016/0304-405X(77)90005-8 [DOI] [Google Scholar]
  12. Bray, B. C., & Dziak, J. J. (2018). Commentary on latent class, latent profile, and latent transition analysis for characterizing individual differences in learning. Learning & Individual Differences, 66, 105–110. 10.1016/j.lindif.2018.06.001 10.1016/j.lindif.2018.06.001 [DOI] [Google Scholar]
  13. Burke, D. A., Whittemore, S. R., & Magnuson, D. S. K. (2013). Consequences of common data analysis inaccuracies in cns trauma injury basic research. Journal of Neurotrauma, 30(10), 797–805. 10.1089/neu.2012.2704 10.1089/neu.2012.2704 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Burroughs, V. J., Maxey, R. W., & Levy, R. A. (2002). Racial and ethnic differences in response to medicines: Towards individualized pharmaceutical treatment. Journal of the National Medical Association, 94(10 Suppl), 1–26. [PMC free article] [PubMed] [Google Scholar]
  15. Busemeyer, J. R., & Diederich, A. (2010). Cognitive Modeling. SAGE Publications.
  16. Calude, C. S., & Longo, G. (2017). The deluge of spurious correlations in big data. Foundations of Science, 22(3), 595–612. 10.1007/s10699-016-9489-4 10.1007/s10699-016-9489-4 [DOI] [Google Scholar]
  17. Carrasco-Ramiro, F., Peiró-Pastor, R., & Aguado, B. (2017). Human genomics projects and precision medicine. Gene Therapy, 24(9), 551–561. 10.1038/gt.2017.77 10.1038/gt.2017.77 [DOI] [PubMed] [Google Scholar]
  18. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2002). Applied multiple regression/correlation analysis for the behavioral sciences ((3rd ed.). ed.). Routledge.
  19. Colaizzi, J. M., Flagel, S. B., Joyner, M. A., Gearhardt, A. N., Stewart, J. L., & Paulus, M. P. (2020). Mapping sign-tracking and goal-tracking onto human behaviors. Neuoscience & Biobehavior Reviews, 111, 84–94. 10.1016/j.neubiorev.2020.01.018 10.1016/j.neubiorev.2020.01.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. DeHart, W. B., & Kaplan, B. A. (2019). Applying mixed-effects modeling to single-subject designs: An introduction. Journal of the Experimental Analysis of Behavior, 111(2), 192–206. 10.1002/jeab.507 10.1002/jeab.507 [DOI] [PubMed] [Google Scholar]
  21. Di Ciano, P., Manvich, D. F., Pushparaj, A., Gappasov, A., Hess, E. J., Weinshenker, D., & Le Foll, B. (2018). Effects of disulfiram on choice behavior in a rodent gambling task: Association with catecholamine levels. Psychopharmacology, 235(1), 23–35. 10.1007/s00213-017-4744-0 10.1007/s00213-017-4744-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Dinh, A., Miertschin, S., Young, A., & Mohanty, S. D. (2019). A data-driven approach to predicting diabetes and cardiovascular disease with machine learning. BMC Medical Infomatics & Decision Making, 19(1), 211. 10.1186/s12911-019-0918-5 10.1186/s12911-019-0918-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Dockstader, C. L., & van der Kooy, D. (2001). Mouse strain differences in opiate reward learning are explained by differences in anxiety, not reward or learning. Journal of Neuroscience, 21(22), 9077. 10.1523/JNEUROSCI.21-22-09077.2001 10.1523/JNEUROSCI.21-22-09077.2001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Dorsey, M. F., Weinberg, M., Zane, T., & Guidi, M. M. (2009). The case for licensure of applied behavior analysts. Behavior Analysis in Practice, 2(1), 53–58. 10.1007/bf03391738 10.1007/bf03391738 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Ellis, S. E., & Leek, J. T. (2018). How to share data for collaboration. The American Statistician, 72(1), 53–57. 10.1080/00031305.2017.1375987 10.1080/00031305.2017.1375987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Eysenck, H. J. (1984). The place of individual differences in a scientific psychology. In J. R. Royce & L. P. Mos (Eds.), Annals of theoretical psychology (Vol. 1, pp. 233–285). Springer. [Google Scholar]
  27. Fisher, A. J., Medaglia, J. D., & Jeronimus, B. F. (2018). Lack of group-to-individual generalizability is a threat to human subjects research. Proceedings of the National Academy of Sciences of the United States of America, 115(27), E6106–E6115. 10.1073/pnas.1711978115 10.1073/pnas.1711978115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Fishman, G. S. (1996). Monte Carlo. In concepts, algorithms, and applications. Springer. 10.1007/978-1-4757-2553-7.
  29. Frankot, M., Mueller, P. M., Young, M. E., & Vonder Haar, C. (2023). Statistical power and false positive rates for interdependent outcomes are strongly influenced by test type: Implications for behavioral neuroscience. Neuropsychopharmacology. 10.1038/s41386-023-01592-6 [DOI] [PMC free article] [PubMed]
  30. Friedman, B., & Nissenbaum, H. (1996). Bias in computer systems. ACM Transactions on Information Systems, 14, 330–347. 10.1145/230538.230561 [DOI] [Google Scholar]
  31. Garner, J. P. (2014). The significance of meaning: Why do over 90% of behavioral neuroscience results fail to translate to humans, and what can we do to fix it? ILAR Journal, 55(3), 438–456. 10.1093/ilar/ilu047 10.1093/ilar/ilu047 [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Garofalo, S., & di Pellegrino, G. (2015). Individual differences in the influence of task-irrelevant pavlovian cues on human behavior. Frontiers in Behavioral Neuroscience, 9, 163. 10.3389/fnbeh.2015.00163 10.3389/fnbeh.2015.00163 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Gilroy, S. P., & Kaplan, B. A. (2019). Furthering open science in behavior analysis: An introduction and tutorial for using github in research. Perspectives on Behavior Science, 42(3), 565–581. 10.1007/s40614-019-00202-5 10.1007/s40614-019-00202-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Gilroy, S. P., Strickland, J. C., Naudé, G. P., Johnson, M. W., Amlung, M., & Reed, D. D. (2022). Beyond systematic and unsystematic responding: Latent class mixture models to characterize response patterns in discounting research. Frontiers in Behavioral Neuroscience, 16. 10.3389/fnbeh.2022.806944 [DOI] [PMC free article] [PubMed]
  35. Gosselin, T., Le Guisquet, A. M., Brizard, B., Hommet, C., Minier, F., & Belzung, C. (2017). Fluoxetine induces paradoxical effects in c57bl6/j mice: Comparison with balb/c mice. Behavioral Pharmacology, 28(6), 466–476. 10.1097/fbp.0000000000000321 10.1097/fbp.0000000000000321 [DOI] [PubMed] [Google Scholar]
  36. Guthrie, E. R., & Horton, G. P. (1946). Cats in a puzzle box. Rinehart.
  37. Hagopian, L. P. (2020). The consecutive controlled case series: Design, data-analytics, and reporting methods supporting the study of generality. Journal of Applied Behavior Analysis, 53(2), 596–619. 10.1002/jaba.691 10.1002/jaba.691 [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Hagopian, L. P., Rooker, G. W., & Yenokyan, G. (2018). Identifying predictive behavioral markers: A demonstration using automatically reinforced self-injurious behavior. Journal of Applied Behavior Analysis, 51(3), 443–465. 10.1002/jaba.477 10.1002/jaba.477 [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Hagopian, L. P., Rooker, G. W., & Zarcone, J. R. (2015). Delineating subtypes of self-injurious behavior maintained by automatic reinforcement. Journal of Applied Behavior Analysis, 48(3), 523–543. 10.1002/jaba.236 10.1002/jaba.236 [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Hamaker, E., Dolan, C., & Molenaar, P. (2005). Statistical modeling of the individual: Rationale and application of multivariate stationary time series analysis. Multivariate Behavioral Research, 40, 207–233. 10.1207/s15327906mbr4002_3 10.1207/s15327906mbr4002_3 [DOI] [PubMed] [Google Scholar]
  41. Hoekstra, R., Kiers, H. A., & Johnson, A. (2012). Are assumptions of well-known statistical techniques checked, and why (not)? Frontiers in Psychology, 3, 137. 10.3389/fpsyg.2012.00137 10.3389/fpsyg.2012.00137 [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Hudson, K., Lifton, R., & Patrick-Lake, B. (2015). The precision medicine initiative cohort program: Building a research foundation for 21st century medicine.
  43. Kochanek, P. M., Dixon, C. E., Mondello, S., Wang, K. K. K., Lafrenaye, A., Bramlett, H. M., Dietrich, W. D., Hayes, R. L., Shear, D. A., Gilsdorf, J. S., Catania, M., Poloyac, S. M., Empey, P. E., Jackson, T. C., & Povlishock, J. T. (2018). Multi-center pre-clinical consortia to enhance translation of therapies and biomarkers for traumatic brain injury: Operation brain trauma therapy and beyond. Frontiers in Neurology, 9, 640. 10.3389/fneur.2018.00640 10.3389/fneur.2018.00640 [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Kostick-Quenet, K. M., Cohen, I. G., Gerke, S., Lo, B., Antaki, J., Movahedi, F., Njah, H., Schoen, L., Estep, J. E., & Blumenthal-Barby, J. S. (2022). Mitigating racial bias in machine learning. Journal of Law, Medicine & Ethics, 50(1), 92–100. 10.1017/jme.2022.13 10.1017/jme.2022.13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Kyonka, E. G. E., Mitchell, S. H., & Bizo, L. A. (2019). Beyond inference by eye: Statistical and graphing practices in jeab, 1992–2017. Journal of the Experimental Analysis of Behavior, 111(2), 155–165. 10.1002/jeab.509 10.1002/jeab.509 [DOI] [PubMed] [Google Scholar]
  46. Kyonka, E. G. E., & Subramaniam, S. (2018). Translating behavior analysis: A spectrum rather than a road map. Perspectives in Behavior Science, 41(2), 591–613. 10.1007/s40614-018-0145-x 10.1007/s40614-018-0145-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Lamata, P. (2020). Avoiding big data pitfalls. Heart & Metabolism: Management of the Coronary Patient, 82, 33–35. 10.31887/hm.2020.82/plamata 10.31887/hm.2020.82/plamata [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Liang, S., Deng, W., Li, X., Greenshaw, A. J., Wang, Q., Li, M., & Li, T. (2020). Biotypes of major depressive disorder: Neuroimaging evidence from resting-state default mode network patterns. Neuroimage Clinical, 28, 102514. 10.1016/j.nicl.2020.102514 10.1016/j.nicl.2020.102514 [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Michael, J. (1974). Statistical inference for individual organism research: Mixed blessing or curse? Journal of Applied Behavior Analysis, 7(4), 647–653. 10.1901/jaba.1974.7-647 10.1901/jaba.1974.7-647 [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Molenaar, P. (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement: Interdisciplinary Research & Perspective, 2, 201–218. 10.1207/s15366359mea0204_1 10.1207/s15366359mea0204_1 [DOI] [Google Scholar]
  51. Mouri, A., Koseki, T., Narusawa, S., Niwa, M., Mamiya, T., Kano, S., Sawa, A., & Nabeshima, T. (2012). Mouse strain differences in phencyclidine-induced behavioural changes. International Journal of Neuropsychopharmacology, 15(6), 767–779. 10.1017/s146114571100085x 10.1017/s146114571100085x [DOI] [PubMed] [Google Scholar]
  52. Nakagawa, K., & Kajiwara, A. (2015). Female sex as a risk factor for adverse drug reactions. Nihon Rinsho, 73(4), 581–585. [PubMed] [Google Scholar]
  53. Nakagawa, S., & Schielzeth, H. (2013). A general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods in Ecology and Evolution, 4(2), 133–142. 10.1111/j.2041-210x.2012.00261.x 10.1111/j.2041-210x.2012.00261.x [DOI] [Google Scholar]
  54. National Institutes of Health. (2022). All of us research program https://allofus.nih.gov/
  55. NIH Precision Medicine Initiative Working Group (2015) The precision medicine initiative cohort program – building a research foundation for 21st century medicine. https://acd.od.nih.gov/documents/reports/DRAFT-PMI-WGReport-9-11-2015-508.pdf
  56. Normand, M. P., & Kohn, C. S. (2013). Don't wag the dog: Extending the reach of applied behavior analysis. The Behavior Analyst, 36(1), 109–122. 10.1007/bf03392294 10.1007/bf03392294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Nuzzo, R. (2015). How scientists fool themselves—And how they can stop. Nature, 526(7572), 182–185. 10.1038/526182a 10.1038/526182a [DOI] [PubMed] [Google Scholar]
  58. Oberski, D. (2016). Mixture models: Latent profile and latent class analysis. In J. Robertson & M. Kaptein (Eds.), Modern statistical methods for hci (pp. 275–287). Springer International Publishing. [Google Scholar]
  59. Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. 10.1126/science.aac4716 10.1126/science.aac4716 [DOI] [PubMed] [Google Scholar]
  60. Préfontaine, I., Lanovaz, M. J., & Rivard, M. (2022). Brief report: Machine learning for estimating prognosis of children with autism receiving early behavioral intervention—A proof of concept. Journal of Autism & Developmental Disorders.10.1007/s10803-022-05641-9 [DOI] [PubMed]
  61. Radabaugh, H., Bonnell, J., Schwartz, O., Sarkar, D., Dietrich, W. D., & Bramlett, H. M. (2021). Use of machine learning to re-assess patterns of multivariate functional recovery after fluid percussion injury: Operation brain trauma therapy. Journal of Neurotrauma, 38(12), 1670–1678. 10.1089/neu.2020.7357 10.1089/neu.2020.7357 [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Revusky, S. H. (1967). Some statistical treatments compatible with individual organism methodology. Journal of the Experimental Analysis of Behavior, 10(3), 319–330. 10.1901/jeab.1967.10-319 10.1901/jeab.1967.10-319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  63. Richter, S. H., Garner, J. P., Auer, C., Kunert, J., & Würbel, H. (2010). Systematic variation improves reproducibility of animal experiments. Nature Methods, 7(3), 167–168. 10.1038/nmeth0310-167 10.1038/nmeth0310-167 [DOI] [PubMed] [Google Scholar]
  64. Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review, 15, 351–357. 10.2307/2087176 10.2307/2087176 [DOI] [Google Scholar]
  65. Scheerer, N. E., Curcin, K., Stojanoski, B., Anagnostou, E., Nicolson, R., Kelley, E., Georgiades, S., Liu, X., & Stevenson, R. A. (2021). Exploring sensory phenotypes in autism spectrum disorder. Molecular Autism, 12(1), 67. 10.1186/s13229-021-00471-5 10.1186/s13229-021-00471-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Shaver, T. K., Ozga, J. E., Zhu, B., Anderson, K. G., Martens, K. M., & Vonder Haar, C. (2019). Long-term deficits in risky decision-making after traumatic brain injury on a rat analog of the Iowa gambling task. Brain Research, 1704, 103–113. 10.1016/j.brainres.2018.10.004 10.1016/j.brainres.2018.10.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Shull, R. L. (1999). Statistical inference in behavior analysis: Discussant's remarks. The Behavior Analyst, 22(2), 117–121. 10.1007/BF03391989 10.1007/BF03391989 [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Sidman, M. (1960). Tactics of scientific research: Evaluating experimental data in psychology. Basic Books.
  69. Sisodiya, S. M. (2021). Precision medicine and therapies of the future. Epilepsia, 62(S2), S90–S105. 10.1111/epi.16539 10.1111/epi.16539 [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Skinner, B. F. (1938). The behavior of organisms: An experimental analysis. Appleton-Century.
  71. Sturman, O., von Ziegler, L., Schläppi, C., Akyol, F., Privitera, M., Slominski, D., Grimm, C., Thieren, L., Zerbi, V., Grewe, B., & Bohacek, J. (2020). Deep learning-based behavioral analysis reaches human accuracy and is capable of outperforming commercial solutions. Neuropsychopharmacology, 45(11), 1942–1952. 10.1038/s41386-020-0776-y 10.1038/s41386-020-0776-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Sutton, N. P., Grace, R. C., McLean, A. P., & Baum, W. M. (2008). Comparing the generalized matching law and contingency discriminability model as accounts of concurrent schedule performance using residual meta-analysis. Behavioural Processes, 78(2), 224–230. 10.1016/j.beproc.2008.02.012 10.1016/j.beproc.2008.02.012 [DOI] [PubMed] [Google Scholar]
  73. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. MIT Press.
  74. Todes, D. P. (2014). Ivan pavlov: A russian life in science. Oxford University Press.
  75. Vanderveldt, A., Oliveira, L., & Green, L. (2016). Delay discounting: Pigeon, rat, human—Does it matter? Journal of Experimental Psychology: Animal Learning & Cognition, 42(2), 141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Veldkamp, C. L., Nuijten, M. B., Dominguez-Alvarez, L., van Assen, M. A., & Wicherts, J. M. (2014). Statistical reporting errors and collaboration on statistical analyses in psychological science. PLoS One, 9(12), e114876. 10.1371/journal.pone.0114876 10.1371/journal.pone.0114876 [DOI] [PMC free article] [PubMed] [Google Scholar]
  77. Vonder Haar, C., Frankot, M., Reck, A., Milleson, V., & Martens, K. (2022a). Large-n rat data enables phenotyping of risky decision-making: A retrospective analysis of brain injury on the rodent gambling task. Frontiers in Behavioral Neuroscience, 16. 10.3389/fnbeh.2022.837654 [DOI] [PMC free article] [PubMed]
  78. Vonder Haar, C., Martens, K. M., Riparip, L. K., Rosi, S., Wellington, C. L., & Winstanley, C. A. (2017). Frontal traumatic brain injury increases impulsive decision making in rats: A potential role for the inflammatory cytokine interleukin-12. Journal of Neurotrauma, 34(19), 2790–2800. 10.1089/neu.2016.4813 10.1089/neu.2016.4813 [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Vonder Haar, C., Martens, K. M., & Frankot, M. A. (2022b). Combined dataset of rodent gambling task in rats after brain injury. [DOI] [PMC free article] [PubMed]
  80. Weller, B. E., Bowen, N. K., & Faubert, S. J. (2020). Latent class analysis: A guide to best practice. Journal of Black Psychology, 46(4), 287–311. 10.1177/0095798420930932 10.1177/0095798420930932 [DOI] [Google Scholar]
  81. Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(10), 1–23. 10.18637/jss.v059.i1026917999 10.18637/jss.v059.i10 [DOI] [Google Scholar]
  82. Wickham, H. (2016). Ggplot2: Elegant graphics for data analysis. Springer International.
  83. Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., & Mons, B. (2016). The fair guiding principles for scientific data management and stewardship. Scientific Data, 3, 160018. 10.1038/sdata.2016.18 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
  84. Wixted, J. T., & Ebbesen, E. B. (1997). Genuine power curves in forgetting: A quantitative analysis of individual subject forgetting functions. Memory & Cognition, 25(5), 731–739. 10.3758/bf03211316 10.3758/bf03211316 [DOI] [PubMed] [Google Scholar]
  85. Young, M. (2017a). A place for statistics in behavior analysis. Behavior Analysis: Research & Practice, 17, 1. 10.1037/bar0000099 10.1037/bar0000099 [DOI] [Google Scholar]
  86. Young, M. E. (2017b). Discounting: A practical guide to multilevel analysis of indifference data. Journal of the Experimental Analysis of Behavior, 108(1), 97–112. 10.1002/jeab.265 [DOI] [PubMed] [Google Scholar]
  87. Young, M. E. (2019). Modern statistical practices in the experimental analysis of behavior: An introduction to the special issue. Journal of the Experimental Analysis of Behavior, 111(2), 149–154. 10.1002/jeab.511 10.1002/jeab.511 [DOI] [PubMed] [Google Scholar]
  88. Young, M. E., Clark, M. H., Goffus, A., & Hoane, M. R. (2009). Mixed effects modeling of Morris water maze data: Advantages and cautionary notes. Learning & Motivation, 40(2), 160–177. 10.1016/j.lmot.2008.10.004 [DOI] [Google Scholar]
  89. Young, M. E., & Hoane, M. R. (2021). Mixed effects modeling of Morris water maze data revisited: Bayesian censored regression. Learning & Behavior, 49(3), 307–320. 10.3758/s13420-020-00457-y 10.3758/s13420-020-00457-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. Zeeb, F. D., & Winstanley, C. A. (2013). Functional disconnection of the orbitofrontal cortex and basolateral amygdala impairs acquisition of a rat gambling task and disrupts animals' ability to alter decision-making behavior after reinforcer devaluation. Journal of Neurosceicne, 33(15), 6434–6443. 10.1523/jneurosci.3971-12.2013 10.1523/jneurosci.3971-12.2013 [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Zimmermann, Z. J., Watkins, E. E., & Poling, A. (2015). Jeab research over time: Species used, experimental designs, statistical analyses, and sex of subjects. The Behavior Analyst, 38(2), 203–218. 10.1007/s40614-015-0034-5 10.1007/s40614-015-0034-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated and analyzed in the current study are available on the Open Data Commons for Traumatic Brain Injury: https://odc-tbi.org/data/703. The R code used to perform statistical analyses are provided on GitHub: https://github.com/VonderHaarLab.


Articles from Perspectives on Behavior Science are provided here courtesy of Association for Behavior Analysis International

RESOURCES