Abstract
Bayesian statistics provides an effective, reliable approach for research with small clinical samples and yields clinically meaningful results that can bridge research and practice. This tutorial demonstrates how Bayesian statistics can be effectively and reliably implemented with a small, heterogeneous participant sample to promote reproducible and clinically relevant research. We tested example research questions pertaining to language and clinical features in autism spectrum disorder (ASD; n = 20), a condition characterized by significant heterogeneity. We provide step-by-step instructions and visualizations detailing how to (1) identify and develop prior distributions from the literature base, (2) evaluate model convergence and reliability, and (3) compare models with different prior distributions to select the best performing model. Moreover, in step three, we demonstrate how to determine whether a sample size is sufficient for reliably interpreting model results. We also provide instructions detailing how to examine results with varied bounds of clinical interest, such as the probability that an effect will reflect at least one standard deviation change in scores on a standardized assessment. This information facilitates generalization and application of Bayesian results to a variety of clinical research questions and settings. The tutorial concludes with suggestions for future clinical research, ensuring the utility of our step-by-step instructions for a broad clinical audience.
Keywords: Bayesian Statistics, Reproducibility, Clinical Research, Autism Spectrum Disorder
Replication Crisis and Clinical Research
The replication crisis in behavioral science refers to a failure to find similar results when studies are repeated (e.g., Camerer et al., 2018), because statistically significant results are not consistently observed across repeated studies. Bayesian statistics is an underutilized analytical approach that can address replication issues, and offers three particular advantages in clinical research (Brydges & Gaeta, 2019b). First, the conceptual underpinnings of Bayesian statistics align with researcher intuitions regarding the scientific process (McMillan & Cannon, 2019; Winkler, 2001). Second, Bayesian statistics allows the researcher to draw upon a cumulative evidence base, combined with new data, to make clinically meaningful interpretations and decisions (Brydges & Gaeta, 2019b; McMillan & Cannon, 2019; Oleson et al., 2019). Third, Bayesian statistics can be effectively implemented for small, heterogeneous participant samples, such as those frequently observed in clinical research (Kay et al., 2016; van de Schoot et al., 2014). In this tutorial, we test example research questions related to autism and language to provide a step-by-step illustration of how to implement Bayesian statistics for clinical research. We begin with background information on the Bayesian framework and what this framework offers clinical research, and we end with a brief discussion that provides additional suggestions and interpretation to facilitate generalization to a variety of clinical research contexts.
The Bayesian Framework
Probability Concepts
Bayesian statistics is founded on the idea of epistemic probability, which expresses prior knowledge about a given effect (e.g., the presence of a group difference) based on available information before collecting new data (Kaplan, 2014; see also new edition Kaplan, in press). Prior knowledge involves prior research and expert opinion. When the cumulative evidence base lacks relevant or rigorous research about effects of interest, as in exploratory studies, prior knowledge is considered “weakly informative” or “noninformative.” In the case of weakly-informative prior knowledge, there may be sufficient evidence to suggest a nonzero effect of interest, for instance, but insufficient evidence to suggest a more precise hypothesis. In the case of noninformative prior knowledge, there is insufficient evidence on which to base a hypothesis. When the cumulative evidence base involves highly relevant and rigorous prior research, as in multiple meta-analyses, this prior knowledge is considered “informative.” Expert opinion also factors into the informativeness of prior knowledge. Similar to when researchers develop testable hypotheses, expert opinion involves synthesizing the evidence base to formulate a hypothesis about prior knowledge. For instance, if the expert does not incorporate a relevant set of studies into a synthesis and hypothesis, prior knowledge will be less informative than if all relevant prior studies are represented (see Winkler, 2001 for discussion). Prior knowledge that reflects opposing hypotheses may be compared to determine which prior knowledge specification better fits new data. In other words, different hypotheses about prior knowledge can be compared using statistical model comparison metrics, like the Bayes factor, deviance information criterion, or leave-one-out cross validation information criterion (Brydges & Gaeta, 2019b; Kaplan, 2021; see Methods for further information). These metrics show which prior knowledge specification may better fit the data and which prior knowledge specification may best predict new data. Together, prior research and expert opinion yield assumptions about a given effect, such as whether the effect is likely to be positive, normally distributed, or large in magnitude, and model comparison metrics may be used to guide selection of different assumptions for further analysis.
Epistemic probability involves updating prior knowledge (that is, prior research and/or expert opinion) with new data by applying the Bayes’ Theorem: the probability of an effect based on data relevant to that effect (Figure 1; van de Schoot et al., 2014). In Figure 1, P(E) is the prior distribution, which reflects prior knowledge and assumptions about the effect, P(D|E) is the likelihood of the new data given the hypothesized effect, and P(D) is the marginal (that is, unconditional) probability of the new data across all possible values of an effect. The term P(E|D) is the posterior distribution of the effect given the data, which represents updated knowledge about the effect in the form of a distribution of parameter values. For example, a clinical researcher may hypothesize that performance on an experimental language task differs in an autism spectrum disorder (ASD) versus neurotypical (NT) group. The researcher would draw on prior research that has examined ASD versus NT performance on similar language tasks, and then use this information along with the researcher’s expert opinion (e.g., knowledge of experimental paradigms, population characteristics, clinical experience, etc.) to develop a hypothesized value for the group effect and the variation around the hypothesized group effect. Hypothesized values are then updated or revised with new data by applying Bayes’ Theorem.
Figure 1.

Bayes’ theorem.
The Bayesian approach is conceptually aligned with how clinical researchers engage in the scientific process, and, in addition, aligned with researcher intuition (Gigerenzer & Ulrich, 1995; McMillan & Cannon, 2019; Winkler, 2001). Clinical researchers often act as Bayesians, using available information and building upon prior work to test clinically meaningful hypotheses and to advance current knowledge. They make Bayesian-like predictions based on their findings, clinical experience, and study limitations, reporting on the likelihood that an effect will be observed in new data and in clinical practice. This process is formalized and made transparent in Bayesian statistical approaches. Importantly, the intuitive nature of Bayesian statistics minimizes inaccurate applications of statistics, thereby promoting accurate and reproducible findings (McMillan & Cannon, 2019). The alignment between the Bayesian framework and goals of clinical research – producing reliable and useful information – can also connect research and practice.
Statistical Results
Bayesian statistical results provide the full posterior distribution of values for an effect, rather than a single value that is subjected to a binary significant/non-significant decision. When a prior distribution is combined with new data via Bayes’ Theorem, the result is a posterior distribution of values for the parameters of interest (Brydges & Gaeta, 2019a), denoted as P(E|D) in Figure 1. One summary measure which can be obtained from this posterior distribution is the posterior probability interval (PPI, sometimes referred to as the credible interval). The PPI describes the probability that an effect lies within a numerical interval. A PPI of the relationship between language and social skills in ASD, for example, will be an interval that contains the true value of this relationship at a given probability level, such as the 95% level. This PPI will have lower and upper thresholds which may be used to guide decisions about the relative importance of the relationship between language and social skills in ASD.
Taking this example further, a hypothetical 95% PPI of [−1,15] describing the relationship between language and social skills in ASD contains zero as a possible value of the effect. A PPI that contains zero may or may not provide evidence of an effect, depending on the clinical interpretation of the effect. The clinical interpretation may be guided by computing the probability that the effect is greater than zero using values yielded from the 95% PPI. This probability might be large enough to be deemed clinically important even when zero is contained within the PPI (Oleson et al., 2019). For instance, a 90% probability that a standard score increase of one point on a language assessment is associated with a standard score increase of 7 or more points on a social skills assessment (i.e., a 90% probability that the effect is ≥ 7, equivalent to half of a standard deviation) is likely to be interpreted as sufficient justification to consider language skills when providing social supports to individuals with ASD, even if the 95% PPI contains zero. In clinical research, bounds of interest of this sort are arguably more important than simply determining whether zero is contained within the PPI, which is often the criterion for an effect to be considered important or significant. Notably, the interpretation of the PPI is entirely different than that of the conventional confidence interval and arguably of much greater clinical relevance (Kaplan, 2014).
In addition to clinical interpretability, Bayesian statistical results provide a strong basis for reproducibility. For instance, reproducing an effect based on the overlapping distributions of parameter values from PPIs provides information that can be used to determine the degree to which an effect replicated and can be used to guide diagnosis and treatment of clinical conditions. In contrast, reproducing an effect based on a single binary decision, such as whether an effect is statistically significant, does not yield such rich evidence. Brydges and Gaeta (2019b) highlighted this issue by using Bayesian methods to re-analyze null findings from audiology research that employed null hypothesis significance testing. None of the null findings from these studies was associated with strong evidence in favor of the null hypothesis, even though null findings were often interpreted as the presence of no effect. That is, the Bayesian method was useful for determining the strength of evidence in favor of the alternative hypothesis versus accepting the null hypothesis. The authors also suggested that Bayesian methods are useful for ensuring that sample sizes in clinical research are sufficient to determine whether an effect is present (Brydges & Gaeta, 2019b).
Sample Size Issues
A benefit of using Bayesian statistics in clinical research is that, unlike other statistical approaches (e.g., null hypothesis significance testing; van de Schoot et al., 2014), this approach does not rely on large-sample theory. Participant samples in clinical research are generally small due to low prevalence, the challenges of recruitment and assessment, etc. These relatively small samples are characterized by heterogeneity, such as the variance associated with the behavioral phenotype of clinical conditions, and with between-study variance in coefficient and effect size estimates (IntHout et al., 2015). Small samples (e.g., n = 20) are more heterogeneous than large samples (Button et al., 2013; IntHout et al., 2015) and heterogeneity is associated with biased and non-reproducible effect estimates (Liu et al., 2005). Thus, relying on large-sample theory, such as in frequentist statistics, may impede our ability to detect and reproduce effects (Brydges & Gaeta, 2019b).
Sample Size Evaluation.
In the Bayesian framework, the appropriateness of sample size is evaluated through assessing the sensitivity of the results to differences in the specification of the prior distributions (e.g. informative vs. non-informative). That is, small samples can yield wide, imprecise PPIs when combined with non-informative or weakly informative prior distributions. Small samples can also yield narrow, precise PPIs when combined with accurate and precise prior knowledge reflected in informative prior distributions. In other words, a small sample can be associated with high precision results if prior knowledge about effects of interest accurately reflects the reality of an effect. These patterns are quite logical; they mirror the totality of knowledge available in the evidence base and in a new dataset. The degree to which results, like PPIs, are sensitive to the informativeness of prior knowledge is also an indicator of how appropriate the sample size is for the prior knowledge specified and new data (van de Schoot et al., 2014). For instance, results that indicate a large negative effect for noninformative prior knowledge and a large positive effect for weakly informative prior knowledge suggest that detecting a reliable effect is not possible with the sample utilized. Thus, examining the sensitivity of results to different prior distributions provides critical information for illustrating how Bayesian statistics yields reliable results for small clinical samples.
Diversity, Equity, and Inclusion.
Because the Bayesian framework does not rely on large-sample theory, it allows researchers to maximize the impact of participant samples and address a critical challenge to clinical research: equitable inclusion of diverse populations (Kay et al., 2016). Black, Indigenous, and People of Color (BIPOC) and other minoritized individuals (e.g., autistic individuals with intellectual disability) may be disproportionately affected by the limitations of statistical approaches that are based on large-sample theory, because such samples are often exceedingly small (Durkin et al., 2015; Rivera-Figueroa et al., 2022; Russell et al., 2019). BIPOC experience systemic exclusion from research and more barriers to participation in research than their white counterparts (Girolamo et al., 2022; Woodall et al., 2010). Disproportionate demands to increase BIPOC participation in research, however, could be coercive and an inappropriate response to the problem of inequitable representation (Jones & Mandell, 2020). Of course, the current evidence base is limited in its relevance to BIPOC and likely provides weakly-informative or noninformative prior knowledge. An important next step in conducting equitable research is using analytical approaches that yield the information necessary to increase the relevant evidence base, but that do not place disproportionate demands of participation on marginalized groups. Bayesian approaches may be uniquely well-suited to address these issues and we explore this possibility in the current tutorial.
The Current Tutorial
Example Research Questions
This tutorial illustrates how to implement Bayesian statistics with small, heterogeneous samples in clinical research. We test example research questions pertaining to associations between behavior that reflects the ASD phenotype (social skills) and language performance on an experimental task. We also test the degree to which these associations vary when accounting for cognitive ability, an area of phenotypic heterogeneity in ASD with an informative evidence base (Georgiades et al., 2013; Geschwind, 2009), and sociodemographic factors, an area of heterogeneity in ASD with a more limited evidence base (Jones & Mandell, 2020; see also Woodall et al., 2010). These analyses are applicable to a broad range of research questions posed in clinical research, namely, associations between the phenotype of a clinical condition and constructs thought to be theoretically or clinically important to this condition.
Tutorial Explanation
To implement Bayesian analyses with a small clinical sample, it is critical to examine relationships between prior knowledge and statistical results by comparing models with different prior distributions specified. In particular, researchers should examine model comparison metrics and the sensitivity of results, such as differences in PPIs, to the choice of prior distributions. These measures provide guidance as to how reliable and clinically meaningful results may be. We conducted two sets of model comparisons to capture the continuum of scenarios that researchers are likely to experience when specifying prior distributions for Bayesian analyses.
Specifically, we compare models that include informative prior distributions versus noninformative distributions. While it is likely that researchers will identify informative evidence from the literature that is relevant to their research question, this prior evidence may be derived from a different age group or methodological paradigm or from research with varied scientific rigor (e.g., multiple replication studies versus a single study). These differences can affect the degree to which prior findings can inform associations in a new dataset. Comparing priors drawn from prior literature (informative prior distributions) to noninformative prior distributions is a rigorous approach to determine whether informative prior distributions are appropriately specified for the current data. We also compare models that include weakly-informative prior distributions versus noninformative prior distributions. When there is little informative evidence from the literature, yet sufficient evidence to suggest that associations are nonzero, weakly-informative prior distributions may best capture assumptions about associations in a new dataset. Comparing weakly-informative prior distributions to noninformative prior distributions will determine which prior distribution is more appropriately specified for the current data.
Model Comparison 1 – Informative versus Noninformative Prior Distributions.
This model comparison tested the following research question: What is the relationship between behavior that reflects functional presentation of the ASD phenotype (i.e., social skills; as measured by the Vineland Adaptive Behavior Scales-3 [VABS-3]; Sparrow et al., 2016), and language performance on an experimental grammaticality judgement task (e.g., Eigsti & Bennetto, 2009) in individuals with ASD? These two models also tested: How does this relationship vary when accounting for cognitive ability (Penn Matrix Reasoning; Gur et al., 2010)? See Figure 2 note.
Figure 2.

Prior distribution decision tree.
Note. For the example research questions, social skills, language, and cognitive ability have been rigorously investigated in prior research on ASD (informative priors), whereas sociodemographic factors have not been rigorously investigated in prior research on ASD (weakly-informative, noninformative priors).
Model Comparison 2 – Weakly-informative versus Noninformative Prior Distributions.
This model comparison re-tested the research question: What is the relationship between social skills and grammaticality judgement performance in individuals with ASD? This model comparison also tested: How does this relationship vary when accounting for sociodemographic factors? See Figure 2 note.
Example Dataset
This study was approved by the Institutional Review Board at the University of Connecticut. Participants were individuals with ASD ages 14–29 years (n = 20). Diagnosis was confirmed using the Autism Diagnostic Observation Schedule (ADOS-2; Lord et al., 2000) Module 4 scores in combination with the Autism Diagnostic Interview-Revised (ADI-R; Lord et al., 1994) following criteria from the Diagnostic and Statistical Manual, 5th Edition (DSM-5; American Psychiatric Association, 2013). Other eligibility criteria were no history of major psychiatric or neurological diagnoses or uncorrected hearing or vision impairments that would impede participation per parental and proband report. See Supplementary Materials 1, Table S1.1, for participant characteristics and performance on study measures.
Measures
Social Skills.
The VABS-3 Socialization parent report scales (Sparrow et al., 2016) yielded standardized scores based on a broadly sociodemographic representative sample of the U.S. population (M = 100, SD = 10). These standardized scores represented the social skills variable.
Language
Participants completed an experimental grammaticality judgement task. In this task, participants listened to 23 grammatical and 23 ungrammatical sentences administered in a pseudo-randomized fashion via Qualtrics survey. Ungrammatical sentences involved morphosyntactic errors in General American English: tense marking (e.g., *Yesterday the man is standing in the rain for his bus; n = 7), omission (e.g., *Mrs. Sampson clean her house every Wednesday; n = 8), word order (e.g., *They stood the line in very patiently; n = 4), and substitutions (e.g., *Leonard should has written a letter to his mother; n = 4). A single native-English speaker recorded all stimuli, all stimuli were presented auditorily, and participants made a two-alternative forced choice correct/incorrect judgement via button press. Grammaticality judgement accuracy was coded as accurate/inaccurate (0,1), and we examined sensitivity (A’; ability to distinguish grammatical vs. ungrammatical sentences) and response bias (B’’; tendency to respond yes/no) at the participant level. Note that grammaticality judgements did not differ between participants with exposure to languages other than General American English (e.g., French, Spanish) relative to participants with exposure only to General American English (p = 0.87).
Nonverbal Ability
Penn Matrix Reasoning z-scores (M = 0, SD = 1) were a proxy for nonverbal ability (Gur et al., 2010). Note a z-score of 1 corresponds to approximately 15 standardized score points (M = 100, SD = 15).
Race, Ethnicity, and Sociodemographics
Information on participant race, ethnicity, and sociodemographics was collected via background history questionnaires completed by the parent and proband. Race (white = −0.5, multiracial = 0.25, Asian = 0.25) and ethnicity (not Hispanic = −0.5, Hispanic = 0.5) data were contrast coded. Sociodemographics included highest grade completed by parents (mothers in the current study) and combined family income (from the MacArthur Network on SES and Health; Adler, 2007); these data were coded continuously per the upper threshold of each categorical response (e.g., $50,000-$74,999 = $74,999; $75,000-$99,999 = $99,999). See Supplementary Materials 1, Table S1.2, for race, ethnicity, and sociodemographic descriptive data.
Bayesian Analysis Step 1: Specifying Prior Distributions
Prior knowledge about effects of interest is specified as informative prior distributions by determining the most likely mean and our certainty about the mean (i.e., by identifying a variance value; large variance indicates low certainty or low precision and small variance indicates high certainty or high precision†). To identify relevant prior knowledge and specify prior distributions, researchers must consider the relevancy and rigor of prior literature; Figure 2 presents a decision tree. For our research questions and example dataset, relevant studies included participants with ASD, social skills measured by the VABS, and structural language measured by a grammaticality judgement task. Studies that included participants similar in age/developmental level as the current study’s participants or that statistically examined relationships between social and language were deemed most relevant. We also deemed studies sufficiently rigorous if they were published in reputable journals and their findings were broadly consistent with findings from additional studies; see Supplementary Materials 1, Table S1.5, for detailed comparisons of the most relevant and less-relevant studies.
Studies included: (1) Fein et al. (2013) which involved participants of a similar age/developmental level as the current study and measured social skills using the VABS-3; (2) Eigsti and Bennetto (2009) and Ellis Weismer et al. (2017), which involved participants of a similar age/developmental level, measured language using grammaticality judgement tasks, and accounted for the role of nonverbal ability; (3) Loucas et al. (2008) which examined the relationship between language and social skills measured by the VABS, and accounted for the role of nonverbal ability; (4) Several additional studies that examined the relationship between social skills and nonverbal ability, including Simonoff et al. (2020) and Zachor and Ben-Itzchak (2020). See Table 1 for informative prior distribution values and additional explanation, and Supplementary Materials 1, Table S1.5, for further information on prior distribution values and Supplementary Materials 1, p. 12–13, detailed hypotheses.
Table 1.
Informative prior specification.
| Variable | Prior | Primary Source(s) | Explanation |
|---|---|---|---|
|
| |||
| Social Skills standard scores | Informative Prior: [75,16] | Fein et al. (2013) | Confidence: High; Reasoning: The current study adopts the sampling approach and some participants from Fein et al. (2013), and the age and IQ ranges are similar. |
| Grammaticality Judgement A` | Informative Prior: [0.96,0.05] | Eigsti and Bennetto (2009); Ellis Weismer et al. (2017) | Confidence: Moderate-High; Reasoning: The current study adopts the sampling approach and some participants from Eigsti & Bennetto (2009), and the age and IQ ranges are similar. Differences in methodology, such as Qualtrics survey task administration in the current study, may be associated with greater variance than in-lab procedures (as in Eigsti & Bennetto, 2009) |
| Nonverbal ability z-score | Informative Prior: [0.85,1.25] | Fein et al. (2013) | Confidence: Moderate-High; Reasoning: The current study adopts the sampling approach and some participants from Fein et al. (2013) and Eigsti & Bennetto (2009), and the age and IQ ranges are similar. Differences in methodology, such as non-standardized (z-score scale) measure of nonverbal ability and Qualtrics administration in the current study, may be associated with greater variance than standardized and in-lab methods (as in Fein et al., 2013). |
| Social Skills standard scores ~ Grammaticality Judgement A` | Informative Prior: [0.0001,0.0005]† | Loucas et al. (2008) | Confidence: Low-Moderate; Reasoning: There is little relevant prior work and no prior work using grammaticality judgement. Only small relationships between language and VABS Socialization scores are evident in the most relevant prior literature. Small differences in grammaticality judgement performance (0.05 change in A`) are likely to be associated with somewhat large differences in VABS Socialization scores given high overall accuracy in grammaticality judgement in prior work (Eigsti & Bennetto, 2009). |
| Social Skills standard scores ~ Nonverbal ability z-score | Informative Prior: [1,8] | Loucas et al. (2008); Simonoff et al. (2020); Zachor and Ben-Itzchak (2020) | Confidence: Low; Reasoning: There is little relevant prior work and standardized scores for social skills and nonverbal ability appear to covary, thus 1 unit change in social skills will reflect a 1 unit change in nonverbal ability, yet this relationship may vary by ~0.5 SDs. |
| Grammaticality Judgement A` ~ Nonverbal ability z-score | Informative Prior: [0.01,0.005]† | Eigsti and Bennetto (2009); Ellis Weismer et al. (2017); Loucas et al. (2008) | Confidence: Low-Moderate Reasoning: There is somewhat relevant prior work showing small relationships between language and nonverbal ability. Small differences in grammaticality judgement (0.05 change in A`) are likely to be associated with somewhat large differences in nonverbal ability z-scores. |
Note. Informative Prior notation = [Mean, Standard Deviation], approximated based on available information; Social Skills standard scores = VABS-3 Socialization scale; ~ = outcome variable regressed on predictor variable.
Calculations based on 1 unit of change in A’ being ≥ 2 times the plausible descriptive range of standard score and z-score performance. These priors were validated (i.e., deemed plausible) based on simple regression using data from the larger project, specifically a random subset of neurotypical, loss of autism diagnosis, and autism spectrum disorder participants.
Bayesian Analysis Step 2: Statistical Approach
Step 2a: Statistical Models
Our Bayesian approach involved linear models, and incorporated information from the evidence base into statistical models via the prior distributions. All models were run using the free, open-source program rjags (Plummer et al., 2022). We used another open-source program, rstanarm (Goodrich et al., 2022), to obtain the Bayes factor and leave-one-out cross validation information criterion values; see model comparison metrics below. All R code is presented in Supplementary Materials 4, with comments to facilitate implementation. For researchers who prefer statistical software with a graphical user interface more similar to SPSS than R, we recommend JASP for Bayesian analyses, as discussed Brydges and Gaeta’s (2019b).
For the current tutorial, several definitions of “important” effects were considered, given that researchers may identify bounds of interest based on their research question and aims. First, statistical effects were considered “important” if zero was not contained within the 95% posterior probability interval (PPI; i.e., consistent with some prior studies employing Bayesian statistics, such as Larson et al., 2020; McGregor et al. 2022). Second, we account for other bounds of interest by evaluating effect size as it pertains to clinical thresholds. Specifically, we examine the probability of clinically meaningful values, such as an association equivalent to a one-SD change on a standardized assessment, for effects of interest using the 95% PPI. This approach is discussed in detail below in Step 3c: Clinically Meaningful Relationships. Note that missing data were imputed using predictive mean matching with one imputation (see Supplementary Materials 1 Tables S1.3 and S1.4 for information on missing data).
Step 2b: Model Convergence
Bayesian analyses require model convergence to ensure that the results can be properly and reliably interpreted. Convergence indicates that the computer algorithm used to run a Bayesian analysis has produced stable posterior distributions that can be reliably summarized (e.g., with PPIs; Kaplan, 2014; see also new edition Kaplan, in press). Recall that the posterior distribution is the probability of an effect given the data and represents updated knowledge about an effect in the form of a distribution of estimate values. Markov Chain Monte Carlo (MCMC) sampling is a computer algorithm that generates samples from a probability distribution. In the case of Bayesian methods, the MCMC algorithm generates effect estimates from the posterior probability distribution. The goal of the algorithm is to obtain the posterior distribution of an effect given the data. Model convergence diagnostic indices evaluate the degree to which the algorithm successfully produces a stable, or reliable, posterior distribution. We report five indices used to evaluate convergence, and we provide example plots representing adequate versus poor convergence in Figure 3. (1) Trace plots show the degree to which MCMC chains converge when they sample different points of the distribution; overlapping chains suggest good convergence and that the target distribution has been fully explored. (2) Density plots show the density of the posterior mean and variance; a smooth, bell-like curve suggests good convergence. (3) Autocorrelation plots represent the correlation of the first and second draws of the parameters from the posterior distribution, the first and third parameters, and so forth, to demonstrate how well the distribution of the data has been explored; vertical lines across the y-axis of the plot suggest poor convergence and that the MCMC chains are not independent. (4) Gelman diagnostic plots show the degree to which the between-chain variance and within-chain variance are equal at the point of convergence; lines that meet slowly or do not meet in the plot suggest poor convergence and that the chains have not explored the distribution. (5) Geweke diagnostics test the equality of means from first 10% and last 50% of MCMC chains, thereby assessing two independent sections of the chain; a z-score > 1.96 suggests poor convergence of MCMC chains.
Figure 3.




Trace plots: (a) poor exploration of the target distribution; (b) adequate exploration of the target distribution.
Density plots: (a) somewhat poor bell-like curve (due to peakiness); (b) adequate bell-like curve.
Autocorrelations plots: (a) slightly poor exploration of the data distribution (note that a plot with equally tall bars across the x-axis would be considered more markedly poor); (b) adequate exploration of the data distribution.
Gelman plots: (a) poor MCMC chain convergence; (b) adequate MCMC chain convergence.
Results from models that clearly show poor convergence on at least one index (e.g., Figure 3, Autocorrelation plot a or Gelman plot a) or that show at least moderately poor convergence on multiple indices (e.g., Figure 3, Density plot a and Trace plot a) should not be interpreted. Typically, multiple indices are considered in this way to inform convergence decisions. Model convergence may be improved by more appropriately specifying prior distributions, such as using weakly informative rather than informative prior distributions, or by adjusting MCMC algorithm specifications, such as increasing the number of iterations or thinning intervals (see R code, and Supplementary Materials 4, as well as Kaplan, 2014 for additional technical information on MCMC). In addition to reporting convergence metrics for each model predictor, we also report convergence metrics for three additional parameters that Bayesian models yield: tau, the precision hyperparameter specified to the model for every data point which determines the height and spread of the data; fit, the sum of model residuals for each subject squared, representing the difference between model predicted value and observation; and fit.new, the sum of new residuals squared, representing difference between each draw from the regression model. See Supplementary Materials 2 for complete model diagnostics, including all plots and model output.
Step 2c: Model Comparison
Model comparison metrics are used to guide model selection, such as between two models with different prior distributions (Kaplan, 2014). In this tutorial, we present four model comparison indices. First, we present the Bayes factor, a model comparison criterion that quantifies the likelihood that the data favor one model over another. These values can be very different models for the same outcome or models that only differ in terms of the specification of the prior distributions. Larger values indicate more evidence for a model relative to the base or alternative model (Lissa et al., 2021) and specific interpretations are attributed to values of Bayes Factors per Jeffreys (1961) Evidence Categories > 100 = decisive evidence, 10–100 = strong evidence; 3–10 = moderate evidence; < 3 = no evidence (see also Brydges & Gaeta, 2019b). Note that an approximation of the Bayes factor is the Bayesian information criterion (BIC), which may be more familiar to some researchers, where the model with the lowest BIC value is considered best in terms of which model best fits the data. The BIC should be used quite cautiously, as it is not formally an information criterion and it is most often used in frequentist settings (e.g., Brydges & Gaeta, 2019b; Kaplan, 2014; Lavine & Schervish, 1999 see also new edition Kaplan, in press). Nevertheless, we present it here due to its familiarity and to contrast it with two indices that are designed for the Bayesian context.
Second, we present the deviance information criterion (DIC), an explicitly Bayesian criterion for model selection that provides an index of model performance based on prediction error and model complexity; smaller DIC values indicate a better model from a predictive point of view – namely, the model that will do best in predicting out-of-sample observations. The DIC is the Bayesian counterpart of the potentially more well-known Akaike Information Criterion (AIC), used in frequentist analyses to select models based on predictive performance considerations. Note that the DIC can lead to unreliable results, in the sense of leading to selection of an incorrect model. In fact, it has been viewed as not entirely Bayesian as it uses point estimates of the posterior rather than averaging over the posterior distribution to account for uncertainty (see Vehtari et al., 2017).
Third, we present the leave-one-out cross-validation information criterion (LOOIC), a fully Bayesian measure of the model’s ability to predict new data, where each data point serves as the validation set and the remaining data points serve as the training set, in turn, yielding point-wise predictive validity; smaller values indicate better out-of-sample (e.g., future data) prediction (Vehtari et al., 2017; Kaplan, 2021). LOOIC is considered the state-of-the art for model choice on the basis of predictive performance. In contrast to Bayes factor values, specific DIC and LOOIC values do not carry intrinsic meaning (e.g., a DIC value of −50 for a model is not interpretable). Instead, these metrics provide comparative information for model selection.
Fourth, we present posterior predictive checks (PPCs), a tool used to evaluate how well a model fits the data, where the predicted value of a statistic of interest such as the mean of an effect or the variance of an effect based on MCMC draws from the model are compared to the actual mean or variance of the outcome variable. This metric yields a posterior predictive p-value that represents the proportion of prediction values that equal or exceed the observed value. A p-value of approximately .50 indicates adequate model fit. A major benefit of posterior predictive checking is that any statistic of interest can be examined. Thus, apart from examining the fit of the model to the mean or variance of an effect, PPCs can be used to assess the fit of any quantile of clinical interest (e.g., the probability that an effect is ≥ 7 or equivalent to half of a standard deviation). For clinical research, we advocate rigorous posterior predictive checking geared to the specific clinical outcome. If the goal is the selection of one model among a set of competing models, then we advocate the LOOIC as a model selection criterion given its more rigorous cross-validation approach toward prediction.
Additionally, the sensitivity of model results (e.g., regression coefficients, PPIs) to prior distributions provides additional information about the relationship between sample size and the reliability of results. When the interpretation of model output results differs between models with different prior distributions, results are deemed sensitive to prior distributions. For instance, if a model with informative prior distributions yields a PPI that does not overlap with the PPI from a model with noninformative prior distributions, these results would be deemed sensitive to prior distributions. Model output results that are sensitive to prior distributions suggest that the sample size is insufficient and, therefore, results are not reliable. In this case, prior distributions and other parameters, like covariates, may be revised so that the model better explains the data.
These model comparison metrics provide different sources and types of information. We recommend prioritizing LOOIC and sensitivity of model results. LOOIC is a fully Bayesian, cutting-edge model comparison metric, and sensitivity of model results is critical to understanding whether the sample size is sufficient (i.e., similar to a measure of power in the frequentist framework). However, we present the full set of model comparison metrics as a scaffolding technique for clinical researchers and to demonstrate the decision-making process in which clinical researchers may engage.
Bayesian Analysis Step 3: Interpreting Results
Step 3a: Model Comparison 1 – Informative versus Noninformative Prior Distributions
This model comparison tested the relationship between social skills and grammaticality judgement performance and how this relationship varied when accounting for cognitive ability, comparing models with informative versus noninformative prior distributions. See Table 2 for model comparison results and Supplementary Materials 1 Tables S1.3 and S1.4 for complete statistical model results, as well as Figure 2 note.
Table 2.
Model comparison results.
| Model | Bayes Factor | DIC | LOOIC | Posterior Predictive Check | Convergence Issues | Selected Model | |
|---|---|---|---|---|---|---|---|
|
| |||||||
| Goal 1: Informative vs. Noninformative | Informative/Noninformative | Noninformative/Informative | Informative – Noninformative | Informative – Noninformative | Informative – Noninformative | ||
| GJ ~ Socialization | 208.81 | 0.01 | -97.00<-96.89 | -95.2<-94.8 | 0.32<0.50 | Informative | |
| GJ ~ Socialization + Nonverbal Ability | 138^28 | 1.05^-30 | -85.99>-96.75 | -96.4<-95.6 | 0.32<0.51 | Informative | |
| Goal 2: Weakly-informative vs. Noninformative | Weakly-informative/Noninformative | Noninformative/ Weakly-informative | Weakly-informative – Noninformative | Weakly-informative – Noninformative | Weakly-informative – Noninformative | ||
| GJ ~ Socialization | 0.001 | 931.56 | -94.27>-96.89 | -23.3>-94.8 | 0.32<0.50 | Weakly-informative | Non-informative |
| GJ ~ Socialization + Race | 357^18 | 1.14^-21 | -88.28>-91.48 | -96.5<-95.0 | 0.31<0.50 | Weakly-informative | |
| GJ ~ Socialization + Ethnicity | 224^18 | 3.14^-21 | -97.51>-98.14 | -96.4<-95.2 | 0.31<0.49 | Weakly-informative | |
| GJ ~ Socialization + Income | >1,000 | 0.00 | -90.91>-96.17 | -91.7<-91.5 | 0.34<0.50 | Weakly-informative | |
| GJ ~ Socialization + Maternal Education | 348^11 | 5.43ˆ-14 | -91.23>-98.22 | -82.4>-92.1 | 0.35<0.50 | Weakly-informative; Noninformative | No model selected |
Note. Bayes Factor = larger values represent more evidence that the data favors the numerator model; DIC = Deviance information criterion, smaller values represent better model fit; LOOIC = Leave-one-out cross validation information criterion, smaller values indicate better prediction; Posterior Predictive Checks = yields a p-value indicating the proportion of model predictions that equal/exceed the observed value, a p-value of approximately .50 indicates that the model prediction fits the actual data; Convergence Issues = reports any model that fails to meet convergence standards on any metric, See Supplementary Materials 2 for complete convergence metric reporting; GJ = Grammaticality judgement; Socialization = VABS-3 Socialization standard scores; Nonverbal ability = Nonverbal ability z-scores on the Penn Matrix Reasoning scale; Noninformative/Informative = evidence in favor of the noninformative model relative to the informative model; Informative/Noninformative = evidence in favor of the informative model relative to the noninformative model; Informative – Noninformative = Informative model DIC and noninformative DIC; Weakly-informative/Noninformative = evidence in favor of the weakly-informative model relative to the noninformative model; Noninformative/Weakly-informative = evidence in favor of the noninformative model relative to the weakly-informative model; Weakly-informative – Noninformative = Weakly-informative model DIC and noninformative DIC; ~ = outcome variable regressed on predictor variable.
Model Comparison Metrics
All informative and noninformative models achieved adequate convergence, indicating sufficient reliability for further interpretation. Bayes factor values indicated that the informative model best fit the data. DIC values were similar between informative and noninformative models without the nonverbal ability predictor, but favored the noninformative model for models with the nonverbal ability predictor, indicating modest evidence of better prediction. LOOIC provided evidence that out-of-sample predictive performance was similar between models, and posterior predictive checks indicated that noninformative models provided better fit to the actual data. Informative models were selected due to the data favoring those models and little evidence of differences between models in predictive out-of-sample performance. These results suggest that the data favored models that included information from the evidence base and prediction was similar for models with versus without information from the evidence base. However, noninformative models predicted the actual mean of the outcome variable in the current dataset better than informative models based on posterior predictive checks. Given the focus of combining the current data with prior knowledge to yield results rather than describing our own data, we recommend selecting informative models. Researchers may select noninformative models if they aim to maintain relative “neutrality” in presenting their results.
Sensitivity of Model Output Results to Prior distributions
There was no substantive difference in results for the informative versus noninformative models, suggesting that the current sample size is sufficient for all prior distributions specified. The intercepts, b estimates, and 95% PPIs were highly similar between models without the nonverbal ability predictor, and the 95% PPIs were slightly narrower for the noninformative model with the nonverbal ability predictor.
Step 3b: Model Comparison 2 – Weakly-informative versus Noninformative Prior Distributions
This model comparison re-tested the relationship between social skills and grammaticality judgement performance and tested how this relationship varied when accounting for sociodemographic factors, comparing models with weakly-informative versus noninformative prior distributions. See Table 4 for model comparison results and Supplementary Materials 1, Table S1.4 for complete statistical model results, as well as Figure 2 note.
Model Comparison Metrics
The noninformative model with only the socialization predictor achieved adequate convergence, indicating sufficient reliability for further interpretation, but the weakly-informative model with only the socialization predictor had poor convergence based on Geweke diagnostics. Both weakly-informative and noninformative models with the race, ethnicity, or income predictor achieved adequate convergence, but both models with the maternal education predictor had poor convergence based the autocorrelation plot, indicating poor exploration of the posterior distribution. Bayes factor values indicated that the data favored the noninformative model with only the socialization predictor, but the weakly-informative model with the race, ethnicity, or income predictor. DIC values were similar between models with only the socialization predictor, and with the race, ethnicity, or income predictor. LOOIC values favored the noninformative model with only the socialization predictor, but were similar between models with race, ethnicity, or income. Posterior predictive checks indicated that the noninformative models provided better fit to the actual data.
The following models were selected based on these results: (1) the noninformative model with only the socialization predictor due to superior predictive performance based on LOOIC and posterior predictive check values, and due to the weakly-informative model having poor convergence; this result suggests that the model with no prior information was associated with better out-of-sample prediction than the model with weakly-informative prior information for the models with only the socialization predictor; (2) the weakly-informative model with the race, ethnicity, or income predictor due to the data favoring weakly-informative models and predictive performance being similar between models (note, however, that noninformative models better predicted the actual mean of the outcome variable in the current dataset and may be selected if the focus is best prediction of values within the current dataset); this result suggests that models with weakly-informative prior information were associated with greater odds that the data favored those models and with better out-of-sample prediction than models with no prior information for models with race, ethnicity, and income predictors; (3) no model with the maternal education predictor due to poor model convergence; results were not interpretable for models with the maternal education predictor.
Sensitivity of Model Output Results to Prior distributions
There was no substantive difference in results for the weakly-informative versus noninformative models with only the socialization predictor or models with the race, ethnicity, income, or maternal education predictor, suggesting that the current sample size is sufficient for all prior distributions specified. Note that while results from models with poor convergence should not be used to draw conclusions, results may be used to determine whether insufficient sample size represents a possible explanation for poor convergence (e.g., models with the maternal education predictor). For the noninformative model with only the socialization predictor, the 95% PPIs were slightly narrower, and the intercept and b estimates were highly similar between models. For the models with the race or income predictor, the intercepts, b estimates, and 95% PPIs were highly similar between noninformative and weakly-informative models. For the noninformative model with the ethnicity predictor, the 95% PPI was slightly narrower for the ethnicity predictor, and the intercept and b estimates, and intercept and socialization predictor 95% PPIs were highly similar between models. For the noninformative model with the maternal education predictor, the 95% PPI was slightly narrower for the intercept and maternal education predictor, and the intercept and b estimates, and socialization predictor 95% PPIs were highly similar between models. Thus, PPIs were slightly narrower for noninformative models than weakly-informative models for models with only the socialization predictor and for models with the socialization and ethnicity or maternal education predictor, which suggests slight sensitivity of results to prior distributions. Researchers may decide to specify informative prior distributions for these predictors to further test the sensitivity of results to prior distributions, particularly if clinical thresholds suggest meaningful differences in the range of PPI values (e.g., the difference in PPIs is greater than .5 standard deviations in performance on a standardized assessment). If there is additional evidence that results are sensitive to prior distributions, it may be beneficial to increase the sample size. Given that all other model output results did not differ meaningfully between models, the current overall findings suggest that sample size is sufficient.
Step 3c: Clinically Meaningful Results
So far in the current tutorial, statistical effects have been considered important if zero was not contained within the 95% PPI. To demonstrate other bounds of interest, we evaluated effect size specific to thresholds that may be deemed clinically meaningful. Using the function “pnorm” (base package in R; RStudio Team, 2022), we can probe Bayesian results, such as b estimates and SDs, to ask clinically meaningful questions, such as What is the probability that a grammaticality judgement A’ score of 95 versus 90 is associated with a meaningful difference in socialization standard scores? and What is the probability that a social skills standard score of 70 is associated with grammaticality judgement A’ scores between 80 and 90? A clinically meaningful relationship could be defined as a 0.05 change in grammaticality judgement À for every standard deviation (SD) change in VABS-3 and Nonverbal Ability scores (i.e., 15 standardized score points). For instance, a grammaticality judgement À score of 0.95 versus 0.90 is likely to reflect a meaningful difference in detecting grammatical errors, given that there are typically few errors on grammaticality judgement tasks (e.g., Eigsti & Bennetto, 2009). An association between a difference of 0.05 in grammaticality judgement scores and VABS-3 standard scores of 85 versus 70 will reflect a meaningful difference in social functioning given that a score of 85 is within the normal range whereas 70 is below the normal range relative to the normative sample (e.g., Chatham et al., 2018 reported that minimally-clinically important differences were standardized score differences between 2–4 on the VABS-2). In the current analyses, there were no important effects as all PPIs contained zero (see Supplementary Materials 1, Tables S1.3 and S1.4 for complete results). There were also no effects that we deemed clinically meaningful. For instance, the probability that an increase of 15 in social skills standard scores was associated with a grammaticality judgement A’ score increase ≥1 was 5%, indicating a very low probability of even a small association between performance on these measures. These results suggest that grammaticality judgement performance is not meaningfully associated with social skills in individuals similar to our participant sample, and that these measures may not be sensitive to associations between language and social function observed in prior work (e.g., between VABS expressive language and VABS socialization, Park et al., 2012, and between verb diversity and ADOS scores, LeGrand et al., 2021).
Given the small effects from our analyses, we also tested the relationship between ADOS-2 total scores and VABS-3 daily living standard scores, another measure that reflects the ASD phenotype (see Supplementary Materials 3 for model comparison and complete model output). We defined our clinically meaningful relationship as a 1-point change in ADOS-2 scores for every ≥ 0.5 SD change in VABS-3 daily living scores (i.e., 8 standardized score points). Our results for the relationship between ADOS-2 total scores and VABS-3 daily living scores were: b = −0.106; SD = 0.053; 95% PPI[−0.214, −0.002]. This finding indicates that this relationship is important per our criterion – zero did not lie within the 95% PPI. When analyzing other possible bounds of clinical interest, we found that probability of a 1-point change in ADOS-2 total scores for every ≥ 1 SD change in VABS-3 daily living was 77% and the probability of a 1-point change in ADOS-2 total score for every ≥ 0.5 SD change in VABS-3 daily living was 35%. We also examined the probability of this relationship lying between two values, demonstrating a 15% probability of a 1-point change ADOS-2 total scores given a 1–2 SD change in VABS-3 daily living scores, and a 41% probability of a 1-point change ADOS-2 total scores for a 0.5–1 SD change in VABS-3 daily living scores. Although the probability of ≥ 0.5 SD change in VABS-3 daily living scores being associated with 1-point change in ADOS-2 total scores was relatively low (35%), we concluded that this relationship is clinically meaningful because there is a high probability (77%) that ≥1 SD change in VABS-3 daily living scores is associated with meaningful change in ADOS-2 total scores.
Discussion
This tutorial described the Bayesian framework and why this framework is increasingly valued in clinical research. We provided step-by-step instructions on how to effectively and reliably implement a Bayesian statistical approach with a small clinical sample, testing example research questions pertaining to the ASD phenotype. Notably, this population is characterized by heterogeneity that often presents challenges to reliable statistical analysis. In conducting this analysis, we illustrated how to specify prior distributions from the literature, providing detailed information about the decision-making process (e.g., Figure 2). Second, we described the statistical approach, providing step-by-step instructions on statistical modeling with corresponding R code, model convergence metrics used to ensure results are reliable, and model comparison metrics to test different prior distributions and determine whether sample size is sufficient. Third, we interpreted model convergence and comparison metrics to identify selected models and prior distributions. We also demonstrated how to identify and interpret multiple bounds of interest from statistical model results depending on research questions and clinical thresholds.
Importantly, this tutorial showed that a Bayesian statistical approach reliably tested our research questions in a small clinical autism sample. Specifically, statistical models demonstrated adequate convergence for the vast majority of analyses, which is necessary for interpretation of results, except for models with the maternal education variable. We found that models with informative prior distributions were selected when compared to models with noninformative prior distributions. We also found that statistical results were minimally sensitive to the choice of prior distributions, indicating that our sample size was sufficient. Moreover, our results showed a high probability that performance on one measure of ASD symptomatology was associated with clinically meaningful differences in performance in another domain, daily living skills. This type of Bayesian analysis translates directly into clinical questions and may lead to improved research-to-practice pipelines (McMillan & Cannon, 2019). We expand on these findings below and offer additional information to further illustrate how to implement Bayesian analyses in future clinical research.
Model Convergence
Although the majority of statistical models achieved adequate model convergence, there were two cases of poor model convergence. These cases were associated with a specific variable, maternal education, rather than prior knowledge and sample size more broadly. To address this issue, we explored whether changes in statistical model parameters yielded more reliable results and found that increasing model thinning intervals from 10 to 100 led to adequate model convergence (see Supplementary Materials 4 for example R code). Increasing model thinning intervals helps determine whether autocorrelation is decreasing by examining parameter values at given intervals (e.g., 10 or 100) and represents a useful first step in adjusting the MCMC algorithm specification for cases of poor model convergence (see Step 2b: Model Convergence for additional information). With this change in model parameters, there were no models in the current study that required more well-specified prior distributions or a larger sample size to yield reliable results. Thus, even though there was relatively limited variance in maternal education, results suggested that findings are interpretable given that small changes in model specification led to adequate model convergence.
Model Comparison
Model comparison findings suggested that the informative prior distributions which reflected prior research and expert opinion outperformed the noninformative prior distributions which assumed no prior information; thus, drawing on the cumulative evidence base, combined with new data, yielded reliable and useful results. These results were associated with greater odds that the data favored the informative hypothesis; this may be particularly important in studies that test competing hypotheses, such as a study testing whether language predicts executive function or executive function predicts language in a clinical population (e.g., Larson et al., 2020). These results were also associated with better out-of-sample prediction, which may be particularly important in studies aiming to accurately predict new data, such as a study that aims to test a treatment effect for implementation in clinical settings (e.g., Haebig et al., 2019). However, posterior predictive checks indicated that noninformative models predicted the actual data to a greater degree than informative models. Predicting the actual data may be particularly important in studies that aim to accurately characterize their sample, such as a study that compares social skills in individuals with a current ASD diagnosis relative to a comparison group with history of ASD who no longer meet diagnostic criteria (e.g., Fein et al., 2013).
All model comparisons revealed that results were minimally sensitive to prior distributions, regardless of whether prior distributions were informative or less informative. In other words, the sample size was sufficient for the prior distributions specified across models and the model results were robust to these choices of priors. PPIs were slightly narrower for many of the noninformative models relative to the informative and weakly-informative models, suggesting slightly higher precision in the range of values for a given effect for noninformative models (van de Schoot et al., 2014). However, it is likely that differences in PPIs between models would be smaller with larger sample sizes.
The high reliability of current findings related to race and ethnicity variables is of particular note given the minimal racial and ethnic diversity of our sample. Our sample was 80% white, 20% Asian and multiracial, 95% non-Hispanic, and 5% Hispanic. Yet, we demonstrated reliable results for race and ethnicity predictors without imposing disproportionate demands of participation on marginalized groups (Girolamo et al., 2022; Jones & Mandell, 2020). Additional work is necessary to ensure that model results are relevant and generalizable to BIPOC populations and to varied coding schemes (e.g., equally weighted group codes or dummy coding), which may be facilitated by considering within- racial or ethnic group patterns when developing prior distributions. Collectively, these Bayesian analyses that combined prior knowledge with new data yielded a constellation of useful and reliable information pertaining to our clinical research questions. This information was useful and reliable even with a small, heterogeneous sample of participants with ASD.
Clinically Meaningful Relationships
The Bayesian framework yields a distribution of values for an effect that captures the strength of evidence in favor of a hypothesis, accounting for uncertainty reflected by prior distributions (Brydges & Gaeta, 2019). This distribution of parameter values provides a strong basis for reproducibility and clinical utility (Oleson et al., 2019). We found no important relationships between grammaticality judgement and socialization or sociodemographic predictors, suggesting that there was no meaningful association between these measures. Possible interpretations of these findings are that targeting grammaticality judgement performance in clinical contexts is not likely to lead to improved social function, and that grammaticality judgement and social function do not vary meaningfully depending on sociodemographic factors. To better demonstrate how to ask clinically meaningful questions of relationships found to be important, we conducted a post-hoc test of the relationship between ADOS-2 total scores and another measure that reflects the ASD phenotype, VABS-3 daily living scores. We found that there was a high probability (77%) that meaningful differences (≥ 1 SD) in VABS-3 daily living scores will be associated with meaningful differences (≥ 1 point) in ADOS-2 total scores. One possible interpretation of this finding is that an individual with ASD who has relatively high daily living scores due to having relatively strong self-regulation skills, such as recognizing and navigating emotions appropriately, and task completion skills, such as getting ready for school independently, has a high probability of presenting with fewer ASD features on the ADOS-2, such as those related to reporting emotions (e.g., describing anger and what makes them feel angry) and completing the construction task (e.g., asking for more pieces to complete the puzzle). We also showed that more minimal differences in performance on the VABS-3 daily living scale (0.5 SDs) had a much lower probability (35%) of being associated with meaningful differences in performance on the ADOS-2. This finding may suggest that individuals with ASD who have relatively similar daily living scores are also likely to have relatively similar levels of ASD features on the ADOS-2. In sum, we were able to test and interpret multiple clinically meaningful questions, illustrating how Bayesian statistical results address the crucial need for clinically relevant evidence.
Limitations
Identifying relevant prior literature and specifying informative prior distributions remains a significant challenge for Bayesian statistical approaches (Winkler, 2001). Thus, a primary contribution of this tutorial is the detailed decision-making process we have presented visually (Figure 2), with regard to the most relevant prior work (Table 1), and with regard to a broader literature base (Supplementary Materials 1, Table S1.5 and p. 12–13). One particular challenge for specifying informative prior distributions is differences between relevant studies in methodology. For instance, we inferred relationships between grammaticality judgement performance and VABS-3 social skills based on comparisons between language and social skills using quite dissimilar measures (e.g., VABS-3 scores in an ASD group with versus without co-occurring structural language impairment). This limitation highlights the importance of model comparison when selecting prior distributions and using the step-by-step approach we have outlined in this tutorial. An additional utility of model comparison may be testing competing hypotheses from the evidence base by encoding these hypotheses in different informative prior distributions. These models with different informative prior distributions may then be compared to determine which hypothesis better explains the data or better predicts new data based on model comparison metrics.
It is critical for future work to explore how to appropriately quantify clinically meaningful relationships for various research aims and clinical populations, such as in effect estimates. For instance, Chatham et al. (2018) identified “minimal clinically-important differences” in VABS-2 scores in an ASD group which reflect functional change in real-world contexts. Future work would also benefit from evaluating the relationship between language and social skills using different measures (e.g., language samples) in order to fully account for relationships between these skills. Evaluating these relationships with sociodemographically diverse participant samples, such as bilingual populations and samples with a greater variance in maternal education, is also necessary to fully account for how relationships vary depending on cultural and linguistic factors (Girolamo et al., in press).
Conclusions
The current tutorial showed how to implement a Bayesian statistical approach to yield reproducible and meaningful evidence for clinical research. We used an example dataset to test common research questions pertaining to a clinical population that is characterized by heterogeneity, ASD, and provided step-by-step instructions to facilitate implementation of Bayesian statistics to a variety of clinical research questions. Our Bayesian statistical models yielded reliable results, including models with predictors relevant to diverse populations, and we used these reliable results to answer clinically meaningful questions. This tutorial has provided clear evidence that Bayesian statistics can be reliably and effectively implemented for small, heterogeneous samples, and used to address the critical need for reproducible, clinically relevant research.
Supplementary Material
Acknowledgements:
The authors would like to thank participants and their families, as well as the research assistants that make this research possible. We gratefully acknowledge funding from the National Institutes of Health R01MH112687-01A1 and T32DC017703.
Footnotes
Conflict of Interest Statement: The authors have no conflicts of interest to report.
precision is the inverse of variance
Pre-registration Link: https://osf.io/yx3f4
Data Availability Statement:
Data are available upon request from the first author.
References
- Adler N and the MacArthur Network. (2022). Research network on socioeconomic status and health. https://www.macfound.org/networks/research-network-on-socioeconomic-status-health
- Ambridge B, Bannard C, & Jackson GH (2015). Is Grammar Spared in Autism Spectrum Disorder? Data from Judgements of Verb Argument Structure Overgeneralization Errors. Journal of Autism and Developmental Disorders, 45(10), 3288–3296. 10.1007/s10803-015-2487-5 [DOI] [PubMed] [Google Scholar]
- American Psychiatric Association. (2013). Diagnostic and statistical manual of mental disorders (5th ed.). 10.1176/appi.books.9780890425596 [DOI] [Google Scholar]
- Anderson DK, Liang JW, & Lord C (2014). Predicting young adult outcome among more and less cognitively able individuals with autism spectrum disorders. Journal of Child Psychology and Psychiatry and Allied Disciplines, 55(5), 485–494. 10.1111/jcpp.12178 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Attanasio JS (1994). Inferential statistics and treatment efficacy studies in communication disorders. Journal of Speech and Hearing Research, 37(4), 755–759. 10.1044/jshr.3704.755 [DOI] [PubMed] [Google Scholar]
- Barton-Hulsey A, & Sterling A (2020). Grammatical judgement and production in male participants with idiopathic autism spectrum disorder. Clinical Linguistics and Phonetics, 34(12), 1088–1111. 10.1080/02699206.2020.1719208 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brydges CR, & Gaeta L (2019a). An analysis of nonsignificant results in audiology using Bayes factors. Journal of Speech, Language, and Hearing Research, 62(12), 4544–4553. [DOI] [PubMed] [Google Scholar]
- Brydges CR, & Gaeta L (2019b). An Introduction to Calculating Bayes Factors in JASP for Speech, Language, and Hearing Research. Journal of Speech, Language, and Hearing Research, 62(12), 4523–4533. [DOI] [PubMed] [Google Scholar]
- Button KS, Ioannidis JPA, Mokrysz C, Nosek BA, Flint J, Robinson ESJ, & Munafò MR (2013). Power failure: why small sample size undermines the reliability of neuroscience. 14(May). 10.1038/nrn3475 [DOI] [PubMed] [Google Scholar]
- Camerer CF, Dreber A, Holzmeister F, Ho T, Huber J, Johannesson M, Kirchler M, Nave G, Nosek BA, Pfeiffer T, Altmejd A, Buttrick N, Chan T, Chen Y, Forsell E, Gampa A, Heikensten E, Hummer L, Imai T, … Manfredi D (2018). Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nature Human Behaviour, 2. 10.1038/s41562-018-0399-z [DOI] [PubMed] [Google Scholar]
- Chatham CH, Taylor KI, Charman T, Eule E, Fedele A, Loth E, Murtagh L, Caceres ASJ, Sevigny J, Snyder L, Tillmann JE, Ventola PE, Wang PP, Bolognani F, Roche H. La, Roche H. La, Sciences B, Development B, & Haven N (2018). Adaptive Behavior in Autism: Minimal Clinically-Important Differences on the Vineland-II. Autism Research, 11(2), 270–283. 10.1002/aur.1874.Adaptive [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eigsti IM, & Bennetto L (2009). Grammaticality judgements in autism: deviance or delay. Journal of Child Language, 36(5), 999–1021. [DOI] [PubMed] [Google Scholar]
- Fein D, Barton M, Eigsti IM, Kelley E, Naigles L, Schultz RT, Stevens M, Helt M, Orinstein A, Rosenthal M, Troyb E, & Tyson K (2013). Optimal outcome in individuals with a history of autism. Journal of Child Psychology and Psychiatry and Allied Disciplines, 54(2), 195–205. 10.1111/jcpp.12037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ellis Weismer S, Davidson MM, Gangopadhyay I, Sindberg H, Roebuck H, & Kaushanskaya M (2017). The role of nonverbal working memory in morphosyntactic processing by children with specific language impairment and autism spectrum disorders. Journal of Neurodevelopmental Disorders, 9. 10.1186/s11689-017-9209- [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freeman BJ, Del’Homme M, Guthrie D, & Zhang F (1999). Vineland adaptive behavior scale scores as a function of age and initial IQ in 210 autistic children. Journal of Autism and Developmental Disorders, 29(5), 379–384. 10.1023/A:1023078827457 [DOI] [PubMed] [Google Scholar]
- Georgiades S, Szatmari P, Boyle M, Hanna S, Duku E, Zwaigenbaum L, Bryson S, Fombonne E, Volden J, Mirenda P, Smith I, Roberts W, Vaillancourt T, Waddell C, Bennett T, & Thompson A (2013). Investigating phenotypic heterogeneity in children with autism spectrum disorder : a factor mixture modeling approach. 2, 206–215. 10.1111/j.1469-7610.2012.02588.x [DOI] [PubMed] [Google Scholar]
- Geschwind DH (2009). Advances in Autism. Annual Review of Medicine, 60, 367–380. 10.1146/annurev.med.60.053107.121225 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gigerenzer G, & Hoffrage U (1995). How to improve Bayesian reasoning without instruction: Frequency formats. Psychological Review, 102(4), 684–704. 10.1037/0033-295X.102.4.684 [DOI] [Google Scholar]
- Girolamo T, Ghali S, Campos I, & Ford A (in press). Interpretation and use of standardized language assessments for diverse school-age individuals. Perspectives of the ASHA Special Interest Groups. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Girolamo T, Parker TC, & Eigsti I (2022). Incorporating Dis / ability Studies and Critical Race Theory to combat systematic exclusion of Black , Indigenous , and People of Color in clinical neuroscience. September, 1–7. 10.3389/fnins.2022.988092 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodrich B, Gabry J, Ali I, Brilleman S (2022). “rstanarm: Bayesian applied regression modeling via Stan.” R package version 2.21.3, https://mc-stan.org/rstanarm/. [Google Scholar]
- Gur RC, Richard J, Hughett P, Calkins ME, Macy L, Bilker WB, Brensinger C, & Gur RE (2010). A cognitive neuroscience-based computerized battery for efficient measurement of individual differences: standardization and initial construct validation. J Neurosci Methods, 187(2), 254–262. 10.1016/j.jneumeth.2009.11.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haebig E, Leonard LB, Deevy P, Oxford University Karpicke J, Christ SL, Usler E, Kueser JB, Souto S, Krok W, & Weber C (2019). Retrieval-Based Word Learning in Young Typically Developing Children and Children With Development Language Disorder II: A Comparison of Retrieval Schedules. Journal of Speech, Language, and Hearing Research, 62, 944–964. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henrich J, Heine SJ, & Norenzayan A (2010). Most people are not WEIRD To understand human psychology, behavioural scientists must stop doing most of their experiments on. 466(July), 2010. 10.1017/S0140525X0999152X [DOI] [Google Scholar]
- Inthout J, Ioannidis JPA, Borm GF, & Goeman JJ (2015). Small studies are more heterogeneous than large ones: A meta-meta-analysis. Journal of Clinical Epidemiology, 68(8), 860–869. 10.1016/j.jclinepi.2015.03.017 [DOI] [PubMed] [Google Scholar]
- Jones LA, & Campbell JM (2010). Clinical characteristics associated with language regression for children with autism spectrum disorders. Journal of Autism and Developmental Disorders, 40(1), 54–62. 10.1007/s10803-009-0823-3 [DOI] [PubMed] [Google Scholar]
- Jeffreys H (1961). Theory of probability, 3rd edition. Oxford University Press. [Google Scholar]
- Kaplan D (2014). Bayesian statistics for the social sciences. The Guilford Press. [Google Scholar]
- Kaplan D (in press). Bayesian Statistics for the Social Sciences (2nd Edition). New York: Guilford Press. [Google Scholar]
- Kaplan D (2021). On the Quantification of Model Uncertainty: A Bayesian Perspective. Psychometrika. 10.1007/s11336-021-09754-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kay M, Nelson GL, & Hekler EB (2016). Researcher-centered design of statistics: Why Bayesian statistics better fit the culture and incentives of HCI. Conference on Human Factors in Computing Systems - Proceedings, 4521–4532. 10.1145/2858036.2858465 [DOI] [Google Scholar]
- Kjelgaard MM, & Tager-Flusberg H (2006). An Investigation of Language Impairment in Autism: Implications for Genetic Subgroups. Language and Cognitive Processes, 16(2–3), 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kjellmer L, Hedvall Å, Fernell E, Gillberg C, & Norrelgen F (2012). Language and communication skills in preschool children with autism spectrum disorders: Contribution of cognition, severity of autism symptoms, and adaptive functioning to the variability. Research in Developmental Disabilities, 33(1), 172–180. 10.1016/j.ridd.2011.09.003 [DOI] [PubMed] [Google Scholar]
- Kurtz MM, Ragland JD, Moberg PJ, Gur RC, (2004). The Penn Conditional Exclusion Test: a new measure of executive-function with alternate forms for repeat administration. Arch. Clin. Neuropsychology, 19, 191–201. [DOI] [PubMed] [Google Scholar]
- Kwok EYL, Brown HM, Smyth RE, & Oram Cardy J (2015). Meta-analysis of receptive and expressive language skills in autism spectrum disorder. Research in Autism Spectrum Disorders, 9, 202–222. 10.1016/j.rasd.2014.10.008 [DOI] [Google Scholar]
- Larson C, Kaplan D, Kaushanskaya M, & Ellis Weismer S (2020). Language and inhibition: Predictive relationships in children with language impairment relative to typically developing peers. Journal of Speech, Language & Hearing Research, 63(4), 1115–1127. doi: 10.1044/2019_JSLHR-19-00210 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lavine M, Schervish MJ (1999). Bayes factors: What they are and what they are not. The American Statistician, 53(2) 119–122. 10.1080/00031305.1999.10474443 [DOI] [Google Scholar]
- LeGrand KJ, Weil LW, Lord C, & Luyster RJ (2021). Identifying Childhood Expressive Language Features That Best Predict Adult Language and Communication Outcome in Individuals With Autism Spectrum Disorder. Journal of Speech, Language, and Hearing Research, 64(June), 1977–1991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leonard LB, Ellis Weismer S, Miller CA, Francis DJ, Tomblin JB, & Kail RV (2007). Speed of processing, working memory, and language impairment in children. Journal of Speech, Language & Hearing Research, 50(April), 408–428. 10.1044/1092-4388(2007/029) [DOI] [PubMed] [Google Scholar]
- Lissa CJ, Van Gu, X. Mulder, Rosseel J, Zundert Y, Van C, & Lissa C. J. Van. (2021). Teacher’s Corner: Evaluating Informative Hypotheses Using the Bayes Factor in Structural Equation Models. Structural Equation Modeling: A Multidisciplinary Journal, 28(2), 292–301. 10.1080/10705511.2020.1745644 [DOI] [Google Scholar]
- Liu W, Zhao W, Sha ML, Icitovic N, & Chase GA (2005). Modelling clinical trials in heterogeneous samples. September 2004, 2765–2775. 10.1002/sim.2144 [DOI] [PubMed] [Google Scholar]
- Loucas T, Charman T, Pickles A, Simonoff E, Chandler S, Meldrum D, & Baird G (2008). Autistic symptomatology and language ability in autism spectrum disorder and specific language impairment. Journal of Child Psychology and Psychiatry and Allied Disciplines, 49(11), 1184–1192. 10.1111/j.1469-7610.2008.01951.x [DOI] [PubMed] [Google Scholar]
- Lord C, Risi S, Lambrecht L, Cook EH Jr, Leventhal BL, DiLavore PC, & Rutter M (2000). The Autism Diagnostic Observation Schedule-Generic: A standard measure of social and communication deficits associated with the spectrum of autism. Journal of Autism and Developmental Disorders, 30, 205–223. [PubMed] [Google Scholar]
- Lord C, Rutter M, & Le Couteur A (1994). Autism Diagnostic Interview-Revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. Journal of Autism and Developmental Disorders, 24, 659–685. [DOI] [PubMed] [Google Scholar]
- Maye M, Boyd BA, Martínez F, Alycia P, Thurm A, Mandell DS, & Maye M (2021). Biases, Barriers, and Possible Solutions : Steps Towards Addressing Autism Researchers Under - Engagement with Racially, Ethnically, and Socioeconomically Diverse Communities. Journal of Autism and Developmental Disorders, 0123456789, 1–6. 10.1007/s10803-021-05250-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mccauley JB, Pickles A, Huerta M, & Lord C (2020). Defining Positive Outcomes in More and Less Cognitively Able Autistic Adults. Autism Research, 760, 1548–1560. 10.1002/aur.2359 [DOI] [PubMed] [Google Scholar]
- McGregor KK, Smolak E, Jones M, Oleson J, Eden N, Arbisi-Kelm T, & Pomper R (2022). What children with developmental language disorder teach us about cross-situational word learning. Cognitive Science, 46(2). 10.1111/cogs.13094 [DOI] [PMC free article] [PubMed] [Google Scholar]
- McMillan GP, & Cannon JB (2019). Bayesian Applications in Auditory Research. Journal of Speech Language and Hearing Research, 62, 577–586. [DOI] [PubMed] [Google Scholar]
- Nead KT, Wehner MR, & Mitra N (2018). The Use of “Trend” Statements to Describe Statistically Nonsignificant Results in the Oncology Literature. JAMA Oncology, 4(12), 1778–1779. 10.1001/jamaoncol.2018.4524 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oleson JJ, Brown GD, & McCreery R (2019). The evolution of statistical methods in speech, language, and hearing sciences. Journal of Speech, Language, and Hearing Research, 62(3), 498–506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park CJ, Yelland GW, Taffe JR, & Gray KM (2012). Brief report: The relationship between language skills, adaptive behavior, and emotional and behavior problems in pre-schoolers with autism. Journal of Autism and Developmental Disorders, 42(12), 2761–2766. 10.1007/s10803-012-1534-8 [DOI] [PubMed] [Google Scholar]
- Pickles A, McCauley JB, Pepa LA, Huerta M, & Lord C (2020). The adult outcome of children referred for autism: typology and prediction from childhood. Journal of Child Psychology and Psychiatry and Allied Disciplines, 61(7), 760–767. 10.1111/jcpp.13180 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plummer M, Stukalov A, & Denwood M (2022). Rjags: Bayesian graphical models using MCMC. https://mcmc-jags.sourceforge.io [Google Scholar]
- RStudio Team. (2020). RStudio: Integrated development for R. PBC. http://www.rstudio.com/
- Rivera-Figueroa K, Marfo NYA, & Eigsti I-M (2022). Parental Perceptions of Autism Spectrum Disorder in Latinx and Black Sociocultural Contexts: A Systematic Review. American Journal on Intellectual and Developmental Disabilities, 127(1), 42–63. 10.1352/1944-7558-127.1.42 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roman-urrestarazu A, Kessel R. Van, Allison C, Matthews FE, Brayne C, & Baron-cohen S (2021). Association of Race/Ethnicity and Social Disadvantage With Autism Prevalence in 7 Million School Children in England. 175(6). 10.1001/jamapediatrics.2021.0054 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sadia F, & Houssain SS (2014). Contrast of Bayesian and Classical Sample Size Determination. Journal of Modern Applied Statistical Methods, 13(2). 10.22237/jmasm/1414815720 [DOI] [Google Scholar]
- Simonoff E, Kent R, Stringer D, Lord C, Briskman J, Lukito S, Pickles A, Charman T, & Baird G (2020). Trajectories in Symptoms of Autism and Cognitive Findings From a Longitudinal Epidemiological Cohort. Journal of the American Academy of Child & Adolescent Psychiatry, 59(12), 1342–1352. 10.1016/j.jaac.2019.11.020 [DOI] [PubMed] [Google Scholar]
- Sparrow SS, Balla DA, & Cicchetti DV (2016). Vineland Adaptive Behavior Scales, Third Edition. Pearson. [Google Scholar]
- Stronach ST, & Wetherby AM (2017). Observed and Parent-Report Measures of Social Communication in Toddlers With and Without Autism Spectrum Disorder Across Race/Ethnicity. American Journal of Speech-Language Pathology, 26(May), 355–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Usta MB, Karabekiroglu K, Sahin B, Aydin M, Bozkurt A, Karaosman T, Aral A, Cobanoglu C, Kurt AD, Kesim N, Sahin İ, & Ürer E (2019). Use of machine learning methods in prediction of short-term outcome in autism spectrum disorders. Psychiatry and Clinical Psychopharmacology, 29(3), 320–325. 10.1080/24750573.2018.1545334 [DOI] [Google Scholar]
- van de Schoot R, Kaplan D, Denissen J, Asendorpf JB, Neyer FJ, & van Aken MAG (2014). A Gentle Introduction to Bayesian Analysis: Applications to Developmental Research. Child Development, 85(3), 842–860. 10.1111/cdev.12169 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Wijngaarden-Cremers PJM, Van Eeten E, Groen WB, Van Deurzen PA, Oosterling IJ, & Van Der Gaag RJ (2014). Gender and age differences in the core triad of impairments in autism spectrum disorders: A systematic review and meta-analysis. Journal of Autism and Developmental Disorders, 44(3), 627–635. 10.1007/s10803-013-1913-9 [DOI] [PubMed] [Google Scholar]
- Vehtari A, Gelman A, & Gabry J (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27, 1413–1432.doi: 10.1007/s11222-016-9696-4 [DOI] [Google Scholar]
- Wasserstein RL, & Lazar NA (2016). The ASA statement on p-values: Context, process, and purpose. The American Statistician, 70, 129–133. 10.1080/00031305.2016.1154108 [DOI] [Google Scholar]
- Whitehouse AJO, Watt HJ, Line EA, & Bishop DVM (2009). Adult psychosocial outcomes of children with specific language impairment, pragmatic language impairment and autism. International Journal of Language and Communication Disorders, 44(4), 511–528. 10.1080/13682820802708098 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winkler RL (2001). Why Bayesian analysis hasn’t caught on in healthcare decision making. International Journal of Technology Assessment in Health Care, 17(1), 56–66. 10.1017/S026646230110406X [DOI] [PubMed] [Google Scholar]
- Wittke K, Mastergeorge AM, Ozonoff S, Rogers SJ, & Naigles LR (2017). Grammatical language impairment in autism spectrum disorder: Exploring language phenotypes beyond standardized testing. Frontiers in Psychology, 8(APR), 1–12. 10.3389/fpsyg.2017.00532 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zachor DA, & Ben-Itzchak E (2020). From Toddlerhood to Adolescence, Trajectories and Predictors of Outcome: Long-Term Follow-Up Study in Autism Spectrum Disorder. Autism Research, 13(7), 1130–1143. 10.1002/aur.2313 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data are available upon request from the first author.
