Abstract
This article describes benchmark validation, an approach to validating a statistical model. According to benchmark validation, a valid model generates estimates and research conclusions consistent with a known substantive effect. Three types of benchmark validation, (1) benchmark value, (2) benchmark estimate, and (3) benchmark effect, are described and illustrated with examples. Benchmark validation methods are especially useful for statistical models with assumptions that are untestable or very difficult to test. Benchmark effect validation methods were applied to evaluate statistical mediation analysis in eight studies using the established effect that increasing mental imagery improves recall of words. Statistical mediation analysis led to conclusions about mediation that were consistent with established theory that increased imagery leads to increased word recall. Benchmark validation based on established substantive theory is discussed as a general way to investigate characteristics of statistical models and a complement to mathematical proof and statistical simulation.
Keywords: methods validation, replication, mediation
Translational Abstract
It is important to know whether a statistical model provides accurate information to researchers. This paper discusses ways that have been used to judge the accuracy of statistical models and then proposes a new way to judge if a statistical model is accurate. There are two main ways that models are now developed and studied. The first method uses mathematics to derive the best formulas to learn from data. The second method uses computers to generate data where the truth is known and the statistical model is judged by how often it gives the correct answer about the truth. A new method investigates how well the model gives the correct answer when the correct answer is already known to exist based on knowledge in a research area. The new way to understand models is called benchmark validation and it is applied to a model called the mediation model. The mediation model is important for several reasons, including that it helps researchers develop and improve programs to change problem behaviors such as criminal activity, overeating, and other unhealthy behaviors. Benchmark validation is applied to eight studies of the effect of how making images of words leads to better memory for those words than repeating the words over and over. The new method is likely to help researchers focus on the meaning of their results and the accumulation of known research and models in psychology and other sciences.
“It ought to be looked into; how do they know that their method should work”
(California Institute of Technology Commencement Address, Richard Feynman, 1974)
The focus of this paper is the validation of statistical models. A valid statistical model generates accurate estimates and accurate conclusions about the quantities that it was designed to measure. The quote above is from Nobel prize winning physicist Richard Feynman, who criticized social science methodology because researchers often do not investigate the accuracy of their methods. Feynman asked whether methods actually give truthful results or if the methodology is followed without criticism as in a cult. Similar concerns have been raised in psychology about why “Psychology isn’t doing very well as a scientific discipline and something seems to be wrong somewhere” (Lykken, 1991, p. 3) because of methodological concerns. Recent examples are criticisms of null hypothesis significance testing (Cumming, 2014; Harlow, Mulaik, & Steiger, 1997; Krantz, 1999; Rodgers, 2010) and concerns about the reproducibility of psychological results (Anderson & Maxwell, 2016; Lindsay, 2015; Nosek et al., 2015). Simultaneously, data are increasingly available from many sources that may be useful for validating statistical methods (Maxwell, Lau, & Howard, 2015; Miguel et al., 2014; National Institutes of Health, 2009; Nosek, et al., 2015; Open Science Collaboration, 2015; Perrino et al., 2013). The validity of methodology is central to any field that considers itself a science, and approaches to help investigate the veracity of methods are needed. Controversies about widely-used methods can be unsettling and confusing, particularly to early researchers, but they also represent opportunities to consider additional methods and approaches that may improve psychological science. In this light, we describe benchmark validation as a potential complement to other methods for validating statistical models.
The purpose of this paper is: (1) to describe benchmark validation of a statistical model as an additional approach to validate the accuracy of statistical models, and (2) to illustrate benchmark validation of mediation analysis with an established mediation relation. Mediation analysis was selected because the application and development of methods to assess mediating variables has grown rapidly. Growth in the application of these methods has led to serious concerns about the causal conclusions from these analyses, especially the extent to which untestable assumptions of mediation analysis are satisfied (Bullock, Green, & Ha, 2010; Holland, 1988; James, 1980; MacKinnon & Fairchild, 2009; McDonald, 1997). Alternative methods to improve assessment of mediation effects have also been developed (Coffman & Zhong, 2012; Jo, 2008; Frangakis & Rubin, 2002; Imai, Keele, & Tingley, 2010; MacKinnon & Pirlott, 2015; Pearl, 2001, 2014; Valeri & VanderWeele, 2013).
How are the validity and usefulness of statistical models in psychology assessed? Three different validation methods have been used in the methodological literature. First, models are most commonly developed and assessed mathematically with proof that estimators are unbiased and consistent. The bulk of this work emanates from statistics and mathematics and forms the basis of valid models in many fields. A second validation approach is based on simulation studies where the true population model is known because it is programmed in the simulation. The performance of statistical models is assessed by repeatedly sampling from the population and identifying which statistical models correctly identify when a phenomenon exists and when it does not exist. The simulation approach is widely used to evaluate statistical models, especially the performance of statistical model estimators as a function of sample size, known as finite sample bias. A third approach is the application of statistical models to real data to judge whether the method provides useful and reasonable information. Some journals, such as Multivariate Behavioral Research, require the application of new methods to a real data set. We describe a rarely used method related to this third approach, which we call benchmark validation (BV) that identifies a substantive finding that is widely accepted in a research area and assesses the extent to which a statistical model correctly identifies the known research finding. Within this approach are studies that evaluate how well models obtain an effect of a certain benchmark value, how well models obtain an effect close to a benchmark estimate, and studies that evaluate whether a model yields a correct conclusion about the presence of an effect, which we call a benchmark effect study. This last approach of assessing whether a statistical model yields correct answers to a known benchmark effect is applied in this paper. Benchmark effect validation is the use of a known effect to validate that a statistical method yields correct answers both when an effect is known to exist and when an effect is known not to exist. Ideally, several known effects are investigated. To date, benchmark validation has been rarely used but could complement mathematical and simulation work and is especially valuable when assumptions of a method are untestable, or it is difficult to obtain information to justify the assumptions. In this study, the role of mental processing on word recall is the benchmark effect studied, and mediation analysis is the statistical model evaluated, for reasons developed below.
Benchmark Validation
Compared to other model validation strategies, benchmark validation is especially valuable whenever there are untestable or unknown assumptions of a statistical model. Statistical simulations use a similar logic to benchmark validation, in that results from a statistical model are compared to known results. However, in the case of statistical simulation, the known results are artificially specified by the researcher as a population data-generating model. The population data-generating model often includes several statistical assumptions that may or may not be easy to justify or even test in real data. Some assumptions may be almost impossible to evaluate in real data, such as the unmeasured confounding assumption in mediation analysis, described later. In this case, a data-generating model is an inadequate standard against which models that violate the assumptions can be compared, because the data-generating model is unlikely to represent violations of assumptions for real empirical data. The researcher does not know the true underlying model for real data. A useful benchmark effect would not be constrained by artificial assumptions, so that statistical models with untestable or unknowable assumptions could still be validated against a known effect. For example, the illustrative data in this paper may contain violations of untestable statistical assumptions, so a statistical model that still detects the benchmark effect is valid and robust to the properties of real-world data that may violate statistical assumptions.
We propose three general types of benchmark validation studies (1) benchmark value, (2) benchmark estimate, and (3) benchmark effect. For a benchmark value study, there is an exact value that should be obtained from a statistical analysis. For example, MacKinnon (1986) evaluated two models developed to estimate the number of words in human memory storage. Students were asked to recall each of the states in the United States four different times, and the different patterns of recall or failure to recall each state across the four trials were recorded. For example, Massachusetts may have been remembered on the first and third trial but not the second and fourth trial. Arizona may have been remembered only on the third trial and not on any other trial. The pattern of recall across the four trials provided information that could be used to estimate the total number of states which should be 50. Here the benchmark value of 50 would be expected from a statistical model even though fewer states may have been remembered in repeated testing of the names of the 50 states (MacKinnon, 1986; Millsap & Meredith, 1987). MacKinnon (1986) evaluated a capture-recapture log-linear model to estimate the number of states (by estimating the number of states never remembered) and a Markov learning model to estimate the total number of states (by estimating a memory storage parameter). Neither of these models had estimates close to 50 (based on testing the null hypothesis that states in memory storage was equal to 50), so there was no evidence of benchmark value validation for these models. Benchmark values may be very difficult to find in many social science research areas because most social science research focuses on differences in groups and does not predict a specific value for that difference.
A more realistic type of study is a benchmark estimate study. Consider the case of two variables, X and Y, where X represents randomization to two groups and Y is a measured variable after randomization to the two groups. A variety of evidence exists for the accuracy of the causal estimator of X on Y under assumptions including successful random assignment and accurate measurement of X and Y (Holland, 1986; Rubin, 1974). A more complicated case involves situations where X is nonrandomized, and Y is measured. In this design, there is considerable controversy regarding when and if this design captures causal relations (Cook, Shadish, & Wong, 2008; Shadish, Cook, & Campbell, 2002). A primary limitation of correct inference from this design is the possibility of confounding variables related to both assignment to groups and the dependent variable. Benchmark validation strategies are useful to compare the estimates from nonrandomized designs (that may violate statistical assumptions) against estimates from randomized designs for a known effect. For example, Shadish, Clark, and Steiner (2008) invited college students to participate in a study where the known effect was the effect of academic training on academic performance. Participants were randomized to participate either in a randomized arm of the study or a nonrandomized arm of the study. Students in the randomized arm of the study were randomized to either receive verbal or quantitative training. Students in the nonrandomized arm of the study chose verbal or quantitative training. Because participants were randomized to a randomized or nonrandomized arm, the estimated effect from the randomized arm is the benchmark estimate to be compared to the effect from the nonrandomized arm. The effects in the nonrandomized arm were estimated using models that adjust for nonrandomization (such as propensity score matching). Here the untestable assumption investigated was that the propensity score method led to an unbiased estimate of the causal effect because it adjusts for confounders of assignment to groups. The estimate of the causal effect in the randomized arm was used as the benchmark estimate to compare effects from the nonrandomized arm, which is why these studies are called within-study comparison studies (Cook, Shadish, & Wong, 2008). Shadish et al. (2008) found that estimates from methods based on the propensity of group membership and analysis of covariance with all relevant covariates were close to the estimate from the randomization study, validating the use of these methods in nonrandomized studies. Similar benchmark estimate validation was used to evaluate the regression discontinuity design (Shadish et al., 2011), interrupted time series design (St. Clair, Hallberg, & Cook, 2016), the use of pretest measures to reduce selection bias (Hallberg, Cook, Steiner, & Clark, 2016), and the effects of preference as well as randomization (Long, Little, & Lin, 2008; Marcus, Stuart, Wang, Shadish, & Steiner, 2012).
An analogous method was used by Lalonde (1986) where he compared an estimate from a randomized study of employment training with econometric-based estimates of employment training from a field study. None of the econometric models based on nonrandomized methods gave estimates close to the estimate from the randomized study, calling into question nonrandomized methods for the evaluation of employment training effects. In this case, the benchmark estimate was different from the estimate from econometric methods that relied on assumptions to verify the validity of the method. Dehejia and Wahba (1999) used the same data and demonstrated that the propensity score method gave results closer to the benchmark estimate value, but Smith and Todd (2005) criticized this result because a special sample was used that had more stable weights in the Dehejia and Wahba study. In general, criticisms of the Lalonde study include that the sample used for comparison was not necessarily in the same labor market, the dependent variable was not measured in the same way across samples, and there were limited numbers of covariates for propensity methods (Glazerman, Levy, & Myers, 2003). Overall, however, there was no evidence for benchmark estimate validity of the econometric models used with this example.
The Shadish et al. (2008) and the Lalonde (1986) studies are examples of a benchmark study where the estimate in the randomized study is used as the benchmark estimate to evaluate statistical models. In both of these studies it would be ideal to repeat the study a large number of times to verify that the benchmark estimate is close to the population value and not just the one study that had an estimate of that value. Whereas a benchmark estimate study can be performed to assess the validity of models that account for nonrandomization in a bivariate relation, it is nearly impossible to perform a similar study for mediation relations because there will not be a way to obtain a mediated effect estimate from randomization. It is possible to use a benchmark effect approach to validate mediation models, as described next.
Another type of benchmark study is a benchmark effect study. In this study, an effect is known, and the validity of a statistical model is evaluated by the extent to which it obtains results that demonstrate this effect. Dwyer (1992), for example, used the positive relation between relative weight and blood pressure to validate a differential equation model for longitudinal data relating change in relative weight to change in blood pressure. An early example of benchmark effect validation was the validation of a scale measuring attitudes towards the church by showing scale differences between groups of church members and nonmembers (Cronbach & Meehl, 1955; Thurstone & Chave, 1929). Research on the Minnesota Multiphasic Personality Inventory (MMPI, Butcher et al., 1992) also compared profiles of persons with and without a certain syndrome (Cronbach & Meehl, 1955) to validate the scale and to develop profiles useful for clinicians. A recent example of a benchmark effect study was conducted by Mooij, Peters, Janzing, Zscheischler, and Scholkopf (2016) where they evaluated methods to assess the direction of causation between two variables from observational data using modern causal and machine-learning methods to identify true cause-effect relations. The researchers used the Max Planck data on 100 different cause-effect pairs selected from 37 data sets in fields such as meteorology, biology, medicine, and engineering. The Max Planck data include example data sets where the direction of relation between an X and Y variable is considered known based on prior research or logic. One example is that altitude causes temperature and another example is that age causes weight in mollusks. Mooij et al. (2016) evaluated the accuracy rates of two machine learning methods, additive noise modeling and information geometric causal inference, by comparing the model results to the known directionality of the effects. One of the best methods, an additive noise modeling approach, had accuracy of 64% (plus or minus 10%).
Recently, Thoemmes (2015) conducted a benchmark effect study with the Max Planck data in the evaluation of a method to assess temporal relations using only cross-sectional data based on technical work by Dodge and Rousson (2000; 2001) and von Eye and DeShon (2012). The rationale of this approach was that a dependent variable is more likely to have a normal distribution than an independent variable, so the skewness of a dependent variable will be smaller than the skewness of an independent variable, i.e., the moments of the dependent variable are closer to a normal distribution than the moments of the independent variable. Thoemmes (2015) used data from the Max Planck Institute on two variable systems where the independent and dependent variables were considered known. If the method of determining temporal precedence based on cross-sectional data was valid, then the method ought to have correctly identified the temporal direction of these known effects. Of the 65 tests evaluated in the paper, the known causal order was not often identified, but the rate improved when the assumptions of the test were fulfilled. Generally, the conclusion from this study was that the temporal-relation detection methodology was not validated.
In summary, benchmark validation consists of conducting a research study where the value of an effect is known, such as the total number of states in the United States, or where part of the design of the study includes a benchmark estimate value that is considered the true effect, or where an effect is well-established in the research literature and should be observed with a valid statistical model. Previous studies did not consider the case where a method should not find an effect. In addition to the validation of a statistical model by showing that it yields correct answers when an effect is known to exist, it is also useful to demonstrate that the model gives correct answers when an effect is known not to exist. The benchmark validation method does require that there is an established benchmark (and ideally an established lack of an effect) and that suitable data are available to validate the benchmark. Benchmark validation may focus on a benchmark value, benchmark estimate, or benchmark effect and is a useful method when assumptions are untestable or difficult to test, as in mediation analysis.
Mediating Variables and Problems with Mediation Analysis
In psychology and many other disciplines, mediating variables are central to scientific progress because they explain how nature operates (Kashy, Donnellan, Ackerman, & Russell, 2009; MacKinnon, 2008; Spencer, Zanna, & Fong, 2005). The goal of mediation analysis is to evaluate whether a variable transmits the effects of one variable to another variable so that there is a causal sequence by which a manipulation, X, causes a mediating variable, M, which then causes an outcome variable, Y. Mediation analysis assesses whether there is evidence that M serves as a mediator in this causal sequence. Mediation analysis is challenging because typically the researcher can randomize only the initial variable in the sequence, and randomization of X does not ensure randomization of M. Theoretical mediation processes are widespread in psychology and include dissonance theory and memory theory. Valid mediation models are needed to investigate psychological phenomena that form the basis of theory. Applied research focuses on the design of interventions to change mediators that in turn change outcomes which are needed for the development of successful and efficient interventions.
Mathematical derivations have specified estimands and estimators of mediated effects including the correct standard errors based on assumptions regarding model coefficients and the appropriate distributions for mediation quantities (MacKinnon & Dwyer, 1993; Sobel, 1982). Statistical simulation studies have evaluated which models yield correct answers when the population mediation model is known (MacKinnon & Dwyer, 1993; MacKinnon, Lockwood, Hoffman, West, & Sheets, 2002). The growth in application of mediation analysis has increased substantially and researchers generally find mediation analysis useful (MacKinnon, 2008). However, traditional mediation analysis based on linear regression or covariance structure modeling has been criticized based on a potential outcomes formulation of mediation (Holland, 1988). Some of these criticisms have been severe (Bullock et al., 2010). Most of these criticisms focused on the problems in the evidence for the causal sequence from a manipulation to a mediator to an outcome variable. In particular, the potential outcomes model for mediation demonstrates that mediation analysis requires untestable assumptions about comparisons of theoretical quantities that are impossible to ever measure because of unobserved potential outcomes (also called nested counterfactuals), such as mediator values in the reference group1 for persons in the treatment group and mediator values in the treatment group for persons in the reference group. Even when participants are randomized to conditions, there is no way to rule out confounding of the mediator to outcome relation. These criticisms based on the potential outcomes model have called into question the usefulness of mediation analysis models. Some results indicate that a true mediated effect can never be determined from data unless certain assumptions hold.
To make this more concrete, consider modeling the effect of X on Y. One possibility is to conduct an experiment within an experiment as done by Shadish et al. (2008) so that a randomized experimental mediation effect can be obtained. The randomized experiment effect provides a benchmark estimate against which other models can be compared. For mediation, it is difficult to conceive of a study that would provide the true mediated effect because of the problem of unobserved potential outcomes, as the mediator cannot easily be directly randomized. If a randomized experiment cannot be conducted in order to provide a benchmark estimate or value, we propose that a benchmark effect can instead be used to validate a statistical model. Evaluating whether a statistical model yields the correct answer in a substantive dataset with a known effect can help evaluate the accuracy of mediation models.
Statistical Mediation Analysis
Statistical mediation analysis is traditionally conducted by using information from two of the following three equations.
| (1) |
| (2) |
| (3) |
Y is the dependent variable, M is the mediator, and X is the binary randomized independent variable. Note that Y and M are continuous variables. Linear relations are assumed between variables. The coefficient c represents the relation between X and Y (see Figure 1), c′ is the coefficient relating X to Y adjusted for the effects of M, b is the coefficient relating M to Y adjusted for the effects of X, a is the coefficient relating X to M, eY,1, eY,2, and eM represent unexplained residuals, and the intercepts are b0Y,1, b0Y,2, and b0M (see Figure 2). Note that both c and c′ are coefficients relating X to Y, but c′ is a partial effect, adjusted for the effects of M. In a sample, â, b̂, ĉ, and ĉ′ are estimators of a, b, c, and c′, respectively. The interaction of X and M is sometimes needed when an intervention is known to modify the strength of the relation between M and Y across levels of X. As shown in Equation 4, the interaction of X and M is represented by coefficient h estimated with ĥ in a sample and the unexplained residual is eY,3.
Figure 1.
X to Y Model.
Figure 2.
X to M to Y Mediation Model.
| (4) |
If ĥ is nonzero, then the b̂ and ĉ′ coefficients differ across levels of X. Often the interaction of X and M is assumed to be zero and is not included in mediation analysis because the mediator to dependent variable relation is considered a known consistent relation based on past research and theory (MacKinnon, 2008). With random assignment of X, ĉ and â represent causal effects (with minor assumptions) but ĥ, b̂, and ĉ′ do not have a causal interpretation without further assumptions (Holland, 1988; MacKinnon, 2008; Robins & Greenland, 1992; VanderWeele & Vansteelandt, 2009). The product of â and b̂, âb̂, is the estimator of the mediated effect if ĥ is zero. If ĥ is nonzero, then then the b̂ coefficient and âb̂ mediated effect differ across levels of X. The causal interpretation of the âb̂ mediated effect requires a self-contained model with no omitted influences, correct functional form for the relations in the mediating process, psychometrically sound measures, uncorrelated errors across equations, correct temporal precedence, and correct timing of measurement to capture the mediation process (MacKinnon, 2008). Four assumptions are needed to identify direct and indirect effects (Pearl, 2001; VanderWeele & Vansteelandt, 2009; Valeri & VanderWeele, 2013).
The effect of the independent variable X on Y is unconfounded conditional on covariates.
The effect of the mediator M on the dependent variable Y is unconfounded conditional on X and covariates.
The effect of the independent variable X on the mediator M is unconfounded conditional on covariates.
There is no effect of the independent variable that itself confounds the M to Y relationship.
All four assumptions are necessary to estimate direct and indirect effects from observed data. Randomization to X will satisfy assumptions 1 and 3 but not assumptions 2 and 4. Randomization of individuals to levels of M is unlikely in most studies because individuals self-select their value of M--it is not randomly assigned to them. Assumption 2 is important because it cannot be tested or validated except in hypothetical situations. That is, X is randomized but the value of M for each participant is not randomized and is a function of individual characteristics. It is possible to evaluate the sensitivity of the mediation results of the study to possible confounders of the M to Y relation as we will demonstrate later in this paper. However, the sensitivity analysis suggests hypothetical confounder effects. The assumption of no confounding of M to Y is untestable and has been considered problematic by several researchers (Bullock et al., 2010; Holland, 1988; Imai et al. 2010; MacKinnon, 2008). The purpose of the memory mediation example used in this paper is to evaluate whether mediated effects would be present even though it is impossible to ensure that all assumptions are satisfied, especially the assumption that there is no confounding of the M to Y relation.
Imagery and Memory Theory Data Illustration
The substantive area for benchmark effect validation in this paper is human memory. A known phenomenon in human memory research is that individuals instructed to make mental images of words recall more words than individuals given no instructions or instructions to repeat the words. Levels of processing theory predicts that word recall is improved because the level of processing of the words is deeper when using imagery than repetition (Craik, 2002; Craik & Lockhart, 1972; Craik & Tulving, 1975; Paivo, 1971). The mediation hypothesis is that the instructions increase mental imagery of the words which then increases word recall. In this memory example, a valid mediation analysis model should provide evidence consistent with mediation of instruction effects through use of imagery to increased word recall but not mediation through repetition of the words. Benchmark effect validation of the mediation model could be assessed by conducting experiments that demonstrate how imagery leads to improved memory and assessing the extent to which the model yields an answer consistent with the known effect.
For the imagery and memory example, X is randomized but M and Y are observed. It is similar to the randomized study for a single randomized X variable and dependent Y variable in that X on Y and X on M are causal effects owing to randomization. However, the M to Y relation is not randomized and there may be variables that confound this relation. Furthermore, as described earlier, the traditional mediation analysis ignores a potential outcome that is impossible to measure, e.g., the mediator score under the reference condition for a person in the treatment group and the mediator score under the treatment condition for a person in the reference group. In this context, the known effect is the mediating effect of imagery instructions to imagery to word recall. A valid model should conclude that mediation is present even though there may be additional variables that may confound the M to Y relation and there are nested potential outcomes that are impossible to measure. Similarly, a mediation effect through repetition, though this may increase memory somewhat, should not be substantial based on memory processing theory. Mediation analysis models should lead to the correct conclusion about mediation through imagery, but not through repetition.
Single Mediator Analysis: Benchmark Effect Validation of Statistical Mediation Analysis of Imagery as a Mediator of Instructions on Memory
We have described the notion of benchmark validation as a method to evaluate statistical models and described several examples. We described statistical mediation analysis and its limitations owing to untestable assumptions such as the possible confounding of the M to Y relation. We described the research literature on the effect of imagery on word recall. Next, benchmark effect validation is applied to mediation analysis for the substantive mediation hypothesis of an experimental manipulation increasing imagery to increase word recall. The purpose of the single mediator analysis is to illustrate the application of benchmark validation methods for statistical mediation analysis in the common situation where some necessary assumptions for the statistical analysis are impossible to ever measure, namely that the relation between M and Y can be known as a true causal effect.
Methods
Data were obtained from eight experiments. The data were collected on the first day of class as part of the first author’s classroom teaching. The pedagogical value of the experiment was that students would have first-hand knowledge of the experiment thereby increasing their understanding of course concepts. The memory manipulation was chosen because the sample size in each class was likely to provide reasonable statistical power to detect an effect but not so much power that Type II errors would not occur. Permission to use the data was obtained from the university Institutional Review Board.
Participants were told that they would hear a list of words and would be asked to recall the words. Prior to the experiment, participants were given a sheet of paper with one of two instructions for either primary (repetition) or secondary (imagery) rehearsal: The primary rehearsal condition gave the instructions, “As you hear each word, repeat it over and over until you hear the next word.” The secondary rehearsal condition gave the instructions, “As you hear each word, make an image out of the word and the other words that you hear. For example, if you heard the words camel and woman, you would imagine a woman riding a camel.” The sheets were shuffled and then handed out randomly to students in the class. Participants were instructed to listen to the words and try to do the instructions described on the sheet and not some other strategy. At this point, the 20 words were read with 10 seconds between each word. Ten seconds after the last word, students were asked to write as many words as they could. After a few minutes they rated the extent to which they made images of the words from 1-not at all to 9-absolutely and then they rated the extent to which they repeated the words over and over on a scale of 1-not at all to 9-absolutely. Additional materials for the experiment, word list, data from the experiments and additional analyses are available in the supplementary materials or by contacting the first author.
The variables used in the statistical analysis were: X (repetition or primary rehearsal = 0, imagery or secondary rehearsal = 1), R is repetition on a 1 to 9 scale, I is imagery on a 1 to 9 scale, and Y is total word recall out of 20 words. Repetition and imagery were mean-centered prior to all analyses. Five participants did not have complete data for a total N of 369 (N in the eight studies of 77, 43, 24, 79, 22, 45, 35, and 44).
Recall that we are interested in testing the validity of the single mediator model in replicating a benchmark mediated effect of our experimental manipulation on total words recalled through its effect on use of mental imagery. We used four criteria to determine the validity of the single mediator model for each of the eight studies. First, the mediated effect should be statistically significant in the majority of studies. Because of sampling variability and small sample size in some studies, we did not expect the mediated effect to be statistically significant in all studies. Second, the imagery mediated effect should generally be positive, though with sampling variability, it is possible that the effect could be negative. Third, the mediated effect should be of meaningful effect size, such as at least a third of a word or an effect size of at least 0.2 standard deviations (small effect size). Fourth, the confidence interval obtained using data accumulation procedures in Howard, Maxwell, and Fleming (2000) should not include zero. The final data set that combines all data from all studies represents the most complete evaluation of the benchmark effect validation of the mediation model for the imagery example. If the mediated effect was significant and positive for many of the studies, had a meaningful effect size, and an accumulated data confidence interval that did not include zero, we concluded that the single mediator model was a valid model for estimating the effect of the experimental manipulation on total words recalled through its effect on use of mental imagery. The statistical mediation model may not be validated if the constellation of assumptions violated in real data makes mediation analysis an inaccurate method.
Results
Mediation Analysis
As shown in Table 1, there was a significant effect of X on M (M is the measure of imagery), consistent with a manipulation on the reported imagery rating. The results in Table 1 are based on estimating Equations 2 and 3. Five out of the eight studies had a significant total effect of the manipulation on words recalled with an average of 2.497 words. The relation of M to Y was statistically significant in five of eight studies and was always positive with a minimum value of 0.317. The mediated effect was statistically significant five times averaging 2.023 words and in the same direction for all eight studies based on the distribution of the product confidence intervals (MacKinnon, Fritz, Williams, & Lockwood, 2007; Tofighi & MacKinnon, 2011). The direct effect was never statistically significant and was positive five times and negative three times, consistent with a direct effect close to zero. Note that not all tests of mediation were statistically significant because of sampling error and the sample size differed across the replications.
Table 1.
Mediation Model Results Across Eight Memory Studies
| Study | a | b | c | c′ | ab |
|---|---|---|---|---|---|
| 1 | 3.746* (0.478) | 0.389* (0.160) | 0.343 (0.683) | −1.113 (0.892) | 1.457* (0.632) |
| 2 | 3.924* (0.777) | 0.367 (0.195) | 3.307* (1.001) | 1.868 (1.237) | 1.440 (0.831) |
| 3 | 5.748* (0.602) | 0.317 (0.351) | 4.727* (0.989) | 2.906 (2.250) | 1.822 (2.038) |
| 4 | 3.238* (0.505) | 0.655* (0.125) | 2.259* (0.641) | 0.138 (0.684) | 2.122* (0.527) |
| 5 | 5.283* (0.795) | 0.494 (0.388) | 1.133 (1.401) | −1.475 (2.472) | 2.608 (2.110) |
| 6 | 2.545* (0.842) | 0.696* (0.153) | 1.573 (1.022) | −0.198 (0.933) | 1.771* (0.715) |
| 7 | 3.883* (0.769) | 0.715* (0.206) | 4.117* (1.053) | 1.341 (1.213) | 2.776* (0.983) |
| 8 | 3.558* (0.689) | 0.614* (0.226) | 2.517* (1.084) | 0.332 (1.292) | 2.185* (0.922) |
Note. Standard errors appear in parentheses. Results of estimating Equations 2 and 3.
The mediated effect was tested using distribution of the product confidence intervals.
indicates p < 0.05
An effect size measure for the mediated effect, the standardized mediated effect, was calculated by dividing the mediated effect by the standard deviation of Y. The standardized mediated effect represents the difference in the mediated effect between imagery and repetition groups in standard deviations of Y (MacKinnon, 2008; Miočević, O’Rourke, MacKinnon, & Brown, 2017). The standardized mediated effect ranged from 0.395 to 0.803, much larger than the criterion standardized effect size of 0.2. Using the percentile bootstrap, five of the eight studies had confidence intervals for the standardized mediated effect that did not contain zero.
The results in Table 2 were obtained by estimating Equations 3 and 4. By including the interaction of X and M, because X was coded 0 for repetition and 1 for imagery, the coefficients for b and c′ are for the repetition group. One of the eight studies had a statistically significant interaction of X and M on Y, and five of the studies had a negative value of the interaction and three studies had a positive value of the interaction, suggesting that the X and M interaction is close to zero. The mediated effect was negative in the one study with a statistically significant XM interaction, but further analysis showed that the M to Y relation was positive in the imagery condition, so the mediated effect was positive in the imagery condition.
Table 2.
Mediation Model Results Across Eight Memory Studies with the X and M Interaction
| Study | a | b | c | c′ | ab | h |
|---|---|---|---|---|---|---|
| 1 | 3.746* (0.478) | −0.019 (0.181) | 0.343 (0.683) | −1.791* (0.837) | −0.072 (0.684) | 1.194* (0.309) |
| 2 | 3.924* (0.777) | 0.496* (0.236) | 3.307* (1.001) | 2.152 (1.271) | 1.948* (1.020) | −0.413 (0.421) |
| 3 | 5.748* (0.602) | 0.443 (0.395) | 4.727* (0.989) | 3.927 (2.667) | 2.549 (2.398) | −0.664 (0.904) |
| 4 | 3.238* (0.505) | 0.557* (0.137) | 2.259* (0.641) | −0.424 (0.757) | 1.803* (0.530) | 0.524 (0.316) |
| 5 | 5.283* (0.795) | 1.107* (0.473) | 1.133 (1.401) | −1.200 (2.300) | 5.850* (2.676) | −1.465 (0.731) |
| 6 | 2.545* (0.842) | 0.941* (0.216) | 1.573 (1.022) | −0.389 (0.925) | 2.394* (0.981) | −0.477 (0.302) |
| 7 | 3.883* (0.769) | 0.811* (0.219) | 4.117* (1.053) | 2.617 (1.594) | 3.151* (1.068) | −0.744 (0.609) |
| 8 | 3.558* (0.689) | 0.510* (0.241) | 2.517* (1.084) | −0.878 (1.629) | 1.815* (0.941) | 0.814 (0.674) |
Note. Standard errors appear in parentheses. Results of estimating Equations 3 and 4.
The mediated effect was tested using distribution of the product confidence intervals.
indicates p < 0.05.
Cumulative Data Analysis
A cumulative data analysis approach was used to aggregate the data across the eight memory studies to sequentially update the estimate of the mediated effect (Howard et al., 2000). This analysis was conducted so that the benchmark effect could be evaluated as a whole rather than merely counting the number of statistically significant tests. The cumulative data analysis approach was conducted by estimating the mediated effect from Study 1 data then adding the data from the next study in the sequence (i.e., Study 2 data) and estimating the mediated effect on this combined dataset. This process was continued until the data from all eight memory studies were included in a single dataset. Imagery was mean-centered in each dataset (e.g., mean-centered in Study 1 data and then mean-centered in Study 1+ Study 2 data). Confidence intervals for the mediated effect estimate were created using asymmetric distribution of the product confidence intervals (MacKinnon et al., 2007; Tofighi & MacKinnon, 2011).
Figure 3 displays the mediated effect estimate through imagery when the XM interaction was not included and its corresponding confidence interval for Study 1 data through Study 1 data + …. + Study 8 data. When all the data were included, the mediated effect was equal to 2.121 with all data and the confidence interval became narrower and ranged from 1.597 to 2.684. The confidence interval for the mediated effect did not contain zero.
Figure 3.
Mediated effect through imagery with asymmetric distribution of a product confidence intervals. The mediated effect was first estimated on Study 1 data then the data from the next study in the sequence (i.e., Study 2 data) was added to Study 1 data and the mediated effect was estimated on this combined dataset. This process was continued sequentially until the data from all eight studies were included in a single dataset.
Adjustment for Unmeasured Confounders and Measurement Error
Sensitivity analyses can be used to help determine how much the observed mediated effect would change assuming there is some unmeasured confounder(s) that affects the M to Y relation, or the imagery to total words recalled relation (Imai et al., 2010; MacKinnon, 2008). There are three types of sensitivity analyses that can be used to probe the untestable assumption of no unmeasured confounders for the single mediator model (Cox, Kisbu-Sakarya, Miočević, & MacKinnon, 2013). We apply one of these methods, the Left Out Variables Error (L.O.V.E.) method (Mauro, 1990) adapted to the case of mediation. The method calculates the magnitude of an unmeasured confounder, U, that would reduce the mediated effect to zero, based on the correlation between U and the mediator, M, and the correlation between U and the outcome, Y. The larger the correlation between U and M (Rum) and U and Y (Ruy) required to make the observed mediated effect equal to zero the more robust the observed mediated effect is to unmeasured confounding.
The L.O.V.E. method was applied to the word data aggregated across all eight memory studies (see Figure 4). The sensitivity plot has the correlation between the unmeasured confounder, U, and the outcome, Y on the x-axis and has the correlation between U and M on the y-axis. The line represents the correlation between U and M and U and Y that it would take to make the observed mediated effect equal to zero. The further the line is away from the axes, the more robust the observed mediated effect is to unmeasured confounding. Looking at a single point on the line in Figure 4 (e.g., RUM = .602 and RUY = .524 calculated analytically), this means it would take a correlation of .602 between mental imagery and some unmeasured confounder and a correlation between total words recalled and some unmeasured confounder of .524 for the observed mediated effect to equal zero. These sensitivity results provide evidence that the observed mediated effect of our experimental manipulation on total words recalled through its effect on mental imagery is robust to unmeasured confounding. It is difficult to conceive of a confounder with such large relations with M and Y that could reduce the mediated effect to zero. An example of a possible confounder would be if some students had knowledge of mnemonics such as the peg word technique to improve memory for the words. The imagery and rehearsal groups would have to differ on this confounder which is unlikely given random assignment to conditions. As a result, it is unlikely that a confounder of the M to Y relation could explain the mediated effects through imagery.
Figure 4.
L.O.V.E plot for word data aggregated across eight memory studies. X-axis represents correlation between an unmeasured confounder and the outcome and the Y-axis represents the correlation between the same unmeasured confounder and the mediator. The line represents the correlation between the unmeasured confounder and mediator and the unmeasured confounder and the outcome that would make the observed mediated effect equal zero.
Another sensitivity analysis was conducted to test the robustness of the mediated effect through imagery to both unmeasured confounding and measurement error (Fritz, Kenny, & MacKinnon, 2016). The reliability of the mediator and the outcome was varied from .80 to 1.0 while simultaneously estimating the correlation between the mediator and an unmeasured confounder, RUM, and the correlation between the outcome and an unmeasured confounder, RUY, that would make the mediated effect equal zero. Reliability of the mediator and outcome was varied by constraining the residual variance of M to (1-Reliability) and the residual variance of Y to (1-Reliability). The correlation between the mediator and the unmeasured confounder, RUM, and the correlation between the outcome and the unmeasured confounder, RUY was estimated by creating a latent variable, U, constraining its variance to 1, estimating a path from U to M and from U to Y constrained to be equal, and constraining the b path to zero (MacKinnon, 2008). This resulted in the same unstandardized path coefficient from U to M and U to Y but different correlations between M and U and Y and U because the variance of M and Y were not, in general, equal. When reliability of M and Y was .80, the mediated effect estimate was larger, and the direct effect was smaller which is expected when correcting for measurement error in the mediator (Fritz et al., 2016; Hoyle & Kenny, 1999). Allowing for measurement error increases the value of the b coefficient so the mediated effect adjusted for reliability of .8, increased from 2.121 to 3.129 and the correlation between imagery and a confounder would have to be .673, and the correlation between recall and the confounder would have to be .586 to make the mediated effect equal to zero. Again, correlations with a potential confounder would have to be very large to make the mediated effect zero. Overall, the results provide benchmark validation of the mediated effect for imagery and word recall, even when sensitivity to potential confounders and measurement error are considered.
Two Mediator Model Analysis: Benchmark Validation of Statistical Mediation Analysis of Imagery and Repetition as Possible Mediators of Instructions on Memory
The single mediator analysis provided evidence for benchmark effect validation for statistical mediation analysis because there was evidence that imagery was a mediator of instructions on word recall. It is also useful to demonstrate that a mediation effect is not found for a mediator that theoretically should not improve recall substantially. Merely repeating a word over and over should not have a substantial effect on word recall because it does not entail the type of elaborative processing that improves memory. In the next analysis, the mediated effect of the repetition mediator is tested as part of a two mediator model, containing both the repetition mediator and the imagery mediator. There should be evidence for mediation through imagery but not through repetition in this model for benchmark effect validation.
Two Mediator Models
To simultaneously address the predicted mediation pathway through imagery and the predicted lack of a mediation effect through repetition, a two mediator model (MacKinnon, 2008) was estimated. In this model, all four variables of manipulation, imagery, repetition, and recall were included in order to simultaneously estimate mediated effects through imagery and repetition. A covariance between imagery and repetition was included in the model because these two measures were likely to be correlated. In fact, the negative correlation between the two mediators suggested that as more of one strategy was used, less of the other strategy was used, i.e., with more imagery there was less repetition and with more repetition there was less imagery. The results of this analysis are shown in Table 3. There were significant effects of the manipulation on imagery and repetition self-reports in every study, as expected because this was the manipulation. Only one of the eight studies demonstrated a significant gain in R-squared when both the interaction of X and M on Y and the interaction of X and R on Y were included; therefore, these interactions were not included in the final analyses. The average relation of imagery to word recall was 0.515, which was statistically significant in six out of eight studies. The average mediated effect for imagery was 1.954 words and the mediated effect was statistically significant in six of eight studies and always positive, consistent with the single mediator results described earlier. In contrast, the average relation of repetition to recall was −0.027 and none of the coefficients were statistically significant. None of the mediated effects through repetition were statistically significant with an average of 0.152 words. Overall, the pattern of results for the two mediator model demonstrated benchmark validation of these models for the known imagery effect and known lack of a mediated effect through repetition, as expected.
Table 3.
Path analysis Mediation Model Results Across Eight Memory Studies with Imagery and Repetition as Mediators
| Imagery | Repetition | ||||||
|---|---|---|---|---|---|---|---|
|
| |||||||
| Study | a | b | ab | a | b | ab | c′ |
| 1 | 3.746* (0.475) | 0.389* (0.157) | 1.457* (0.621) [0.297, 2.738] | −4.519* (0.450) | 0.111 (0.166) | −0.502 (0.756) [−2.003, 0.971] | −0.612 (1.154) |
| 2 | 3.924* (0.768) | 0.432* (0.205) | 1.696* (0.884) [0.113, 3.586] | −3.998* (0.717) | 0.179 (0.220) | −0.717 (0.903) [−2.568, 1.014] | 2.329 (1.324) |
| 3 | 5.748* (0.590) | 0.303 (0.332) | 1.744 (1.927) [−2.000, 5.585] | −5.350* (0.586) | −0.256 (0.334) | 1.369 (1.804) [−2.138, 4.969] | 1.615 (2.710) |
| 4 | 3.238* (0.502) | 0.638* (0.123) | 2.064* (0.515) [1.145, 3.156] | −3.776* (0.464) | −0.150 (0.133) | 0.568 (0.511) [−0.417, 1.597] | −0.373 (0.808) |
| 5 | 5.283* (0.776) | 0.443 (0.466) | 2.341 (2.512) [−2.486, 7.441] | −3.300* (0.974) | −0.066 (0.371) | 0.218 (1.278) [−2.333, 2.858] | −1.425 (2.366) |
| 6 | 2.545* (0.832) | 0.670* (0.158) | 1.704* (0.700) [0.516, 3.241] | −2.903* (0.670) | −0.100 (0.196) | 0.289 (0.588) [−0.847, 1.506] | −0.420 (1.009) |
| 7 | 3.883* (0.758) | 0.564* (0.258) | 2.190* (1.107) [0.214, 4.559] | −3.867* (0.791) | −0.225 (0.247) | 0.871 (0.991) [−1.006, 2.934] | 1.055 (1.204) |
| 8 | 3.558* (0.681) | 0.684* (0.228) | 2.433* (0.948) [0.768, 4.475] | −2.900* (0.538) | 0.303 (0.289) | −0.878 (0.868) [−2.678, 0.763] | 0.962 (1.383) |
| Average | 3.991 (0.673) | 0.515 (0.241) | 1.954 (1.152) [−0.179, 4.348] | −3.827 (0.649) | −0.027 (0.245) | 0.152 (0.962) [−1.749, 2.077] | 0.391 (1.495) |
| Weighted Average | 3.724 (0.635) | 0.531 (0.203) | 1.902 (0.921) [0.217, 3.833] | −3.829 (0.594) | −0.006 (0.214) | 0.049 (0.835) [−1.603, 1.709] | 0.247 (1.288) |
Note. Standard errors appear in parentheses and distribution of the product confidence intervals appear in brackets. The weighted average is weighted by the sample size in each study.
indicates p < 0.05.
The standardized mediated effect size was calculated for the mediated effect through imagery and the mediated effect through repetition. The effect size measure was the mediated effect estimate divided by the standard deviation of Y. The effect size for the mediated effect through imagery ranged from 0.465 to 0.721. Using percentile bootstrap, confidence intervals for five out of the eight studies did not contain zero. The effect size for the mediated effect through repetition ranged from −0.234 to 0.406. Confidence intervals for all studies contained zero for the repetition mediated effect.
Figure 5 displays the cumulative data analysis results for the mediated effect estimate through imagery from the multiple mediator model that also included repetition as a mediator. No interactions were included in this model. The mediated effect through imagery had a similar final value as in Figure 3 for the single mediator model, 2.139, with a confidence interval that ranged from 1.606 to 2.715. The confidence intervals became narrower as data were accumulated across the eight studies, and the confidence intervals always excluded zero. Figure 6 displays the mediated effect through repetition from the multiple mediator model. The mediated effect through repetition was initially negative in Study 1 (−0.502) with a confidence interval that ranged from −2.003 to 0.971. As data were accumulated across the eight studies, the mediated effect through repetition increased to −0.082 with a narrower confidence interval that ranged from −0.645 to 0.477. The confidence intervals for the mediated effect through repetition never excluded zero. Overall, the mediated effect via imagery was non-zero and was close to the value of 2 while the mediated effect through repetition was near zero providing evidence the mediation model is valid for providing specific evidence of a known benchmark effect through imagery but not through repetition.
Figure 5.
Mediated effect through imagery including repetition in a multiple mediator model with no interactions and with asymmetric distribution of a product confidence intervals. The mediated effect was first estimated on Study 1 data then the data from the next study in the sequence (i.e., Study 2 data) was added to Study 1 data and the mediated effect was estimated on this combined dataset. This process was continued sequentially until the data from all eight studies were included in a single dataset.
Figure 6.
Mediated effect through repetition including imagery in a multiple mediator model with no interactions and with asymmetric distribution of a product confidence intervals. The mediated effect was first estimated on Study 1 data then the data from the next study in the sequence (i.e., Study 2 data) was added to Study 1 data and the mediated effect was estimated on this combined dataset. This process was continued sequentially until the data from all eight studies were included in a single dataset.
The sensitivity analysis for the mediated effect via imagery for the two mediator model was generally the same as the single mediator model results with correlations of .589 for a confounder and imagery and .511 for a confounder and recall to make the mediated effect zero. When reliability was .80, the mediated effect increased from 2.139 to 3.360 and the direct effect increased in magnitude but became more negative. The mediated effect through repetition was so small that any possible correlation with a confounder could render it zero and increasing measurement error increased the magnitude of the mediated effect only from −0.082 to −0.704. More on sensitivity to confounding and measurement error such as analyses for each experiment is included in the supplemental material for this paper.
As described earlier, the sample size for each study was considered reasonable for the overall effect of the manipulation on recall for words. We conducted a post-hoc power analysis for the mediated effect in each of the eight studies using a Monte Carlo procedure for the single mediator model and the multiple mediator model in Mplus 7.4 (Thoemmes, MacKinnon, & Reiser, 2010). For each study, a population model was chosen for which 10,000 simulated datasets were created. The population model was chosen to be equal to the parameter estimates (i.e., path coefficients, covariances, variances/residual variances) from the mediator models fit to the data accumulated across all eight studies. We acknowledge that there are limitations to post-hoc power calculations including that the effects from all the studies are not likely to represent the population effects (Hoenig & Heisey, 2012) so these power calculations should be considered approximate.
For each of these 10,000 simulated datasets, the observed mediated effect was tested for statistical significance using percentile bootstrapping. The empirical power for each study is the percentage of times the mediated effect was significant across the 10,000 simulated datasets. Power was calculated with and without the XM interaction. On average, the empirical post-hoc power was .690 to detect the mediated effect though imagery with the XM interaction, .765 to detect the mediated effect without the XM interaction, .733 to detect the mediated effect through imagery in the multiple mediator model, and .062 to detect the mediated effect through repetition in the multiple mediator model. Overall, there was empirical power slightly lower than .8 to detect the mediated effect through imagery and the empirical power to detect the effect through repetition was close to the nominal .05 Type 1 error rate. Based on these power calculations, a sample size of about 45 is needed to have .80 power to detect the imagery mediated effect. When sample size was 24 or 22 as in studies 3 and 5, there was about a 45% chance of detecting a nonzero effect (for the two mediator model it was .456 and .418, respectively).
Discussion
This paper described benchmark validation and then conducted a benchmark effect validation study of mediation analysis. We described the overall approach for benchmark validation, including benchmark value, benchmark estimate, and benchmark effect studies. Benchmark validation is not new and has been applied in several previous studies. Most prior related work has focused on benchmark estimate studies where an estimate from a randomized study is compared to the estimate from a nonrandomized study. It is useful to conduct benchmark validation of statistical mediation analysis because of the possible influence of confounding variables and imprecise specification of mediation quantities in mediation analysis--including that some quantities are impossible to ever measure from a potential outcomes perspective. Statistical mediation analyses led to accurate conclusions about the mediated effect of imagery on word recall and the lack of a mediated effect of repetition on word recall thereby providing benchmark validation of statistical mediation analysis. The benchmark effect validation in this study is only for one substantive example, imagery and word recall, and does not suggest that all statistical mediation models will always be valid. The results do provide an example where mediation analysis gave correct answers and provides a validation approach that could be applied to more examples. Ideally, many other applications of mediation analysis would be validated in a similar manner providing cumulative evidence for the methodology.
There are other more elaborate versions of the imagery and memory benchmark effect study than the one that we used. For example, latent variable models for type of rehearsal and memory could provide more accurate measurement of depth of processing and memory (Geiselman, Woodward, & Beatty, 1982) and all possible known confounders could be measured. It may also be possible in the future to randomize participants to a level of the mediator, perhaps with direct brain stimulation (e.g., by transcranial magnetic stimulation) allowing for a benchmark estimate validation study. Different research designs may allow for the use of other statistical methods like instrumental variable methods for mediation where the direct effect is known to be zero so that the estimate of the b path can be considered causal. The point of our example data is that it reflects a well-known, accepted finding in cognitive psychology--that instructions to form images improves memory for words. Such an effect should be uncovered in any mediation analysis of imagery and memory that is valid.
Validating statistical models with benchmark validation complements other methods to evaluate statistical models, though mathematics and simulation research will continue to be the dominant methods used to develop and evaluate statistical models. Each approach has strengths and weaknesses. Mathematical derivations are based on assumptions that may be incorrect or unrealistic for real world data. Simulation studies generally consist of two parts: specifying a data-generating model that is meant to emulate how a theoretical process operates in the population and a data-analysis model that is meant to estimate the parameters of the theoretical model. Statistical models are tested in the data-analysis stage and are validated if they produce results that match the parameters in the data-generating model. Simulation studies generally favor the statistical models that generated the data or favor models that are closely related to the data-generating model. It is unclear whether population models in simulation studies accurately reflect the complications of real substantive data. One option would be to base simulations on real-world data sets, but the simulation conditions can only reflect artificial aspects of the observed data in the specification of the population model. In this case, observed parameters from the real data set are used to create an artificial population from which simulated data sets are drawn. The benchmark validation method complements mathematical and simulation methods with a benchmark that is considered known based on extensive substantive research literature. In particular, the benchmark validation method is necessary because the substantive effect may have complications that are difficult or impossible to ever model in a simulation study or mathematical derivation and can only be validated with benchmarks. Ideally, benchmark validation studies of statistical models are applied in many different substantive research areas for different benchmark values, estimates, and effects. Benchmark validation studies may be most helpful after mathematical and simulation studies have demonstrated that a statistical method is accurate for population models.
There are limitations to benchmark validation including whether such consistent effects can be found in substantive research literature and how to determine when a known effect is present. Benchmark validation requires that there are substantive effects with enough evidence and replication to be considered known. Do we have such known effects in psychology? Levels of processing theory for word recall was demonstrated in this research. One characteristic of benchmark validation is a strong theoretical explanation as well as extensive empirical basis for an effect. Other possible known effects in psychology include some genetic relations (Plomin, DeFries, Knopik, & Neiderhiser, 2016), perception, cognitive dissonance, learning, anchoring, and Stroop interference, though the mediating processes for these phenomena may not be as well developed as imagery and word recall. Other sources for known effects include central findings in a research area (Fiske, Gilbert, & Lindzey, 2010; Reisberg, 2013; Valsiner & Connolly, 2003). Replicated meta-analyses may provide additional evidence for a known effect (Howard et al., 2009), but reaching the goal of a known effect may require many meta-analysis and research synthesis studies using the best methodology (Borenstein, Hedges, Higgins, & Rothstein, 2009; Cheung, 2015; Cooper, Hedges, & Valentine, 2009; Simonsohn, Simmons, & Nelson, 2015). It is also ideal if the known effect studied is easy to replicate or is already present in many available data sets. Easily available data is becoming the norm with requirements for data sharing and the capability of electronic devices to collect massive amounts of data.
Benchmark Validation Criteria
The benchmark estimate studies that include within-study comparison of randomized conditions are ideal benchmark validation studies because of the availability of an estimate from a randomized arm. They are ideal because they compare a benchmark estimate from a randomized study to an estimate from an observational study. The randomized design is considered the most accurate way to estimate causal effects, though this design has limitations that are addressed by other methods (Shadish et al., 2002). Cook et al. (2008) described seven criteria for within-study comparison research that are relevant for benchmark estimate validation which are paraphrased here: (1) two counterfactual groups must exist that vary in terms of whether the intervention was random or not, (2) experimental and observational studies must estimate the same causal effect, (3) the difference between the observational and experimental study should not be correlated with other variables related to the study outcome, (4) persons conducting the analysis of the randomized and observational study should be blind to each other’s results, (5) treatment and control groups must meet the usual criteria of technical accuracy such as proper randomization, (6) the observational study should meet technical requirements for accurate analysis for the statistical method used, and (7) a method should be used to decide whether the results from the experimental and observational study are comparable, such as statistical significance patterns, metric of the causal effect, and percentage difference in the two effects.
We propose seven criteria for selecting a substantive effect to use in a benchmark effect validation study: (1) there must be a body of research validating the existence of the effect conducted by many different researchers and in many different research contexts, (2) experienced researchers in a field must agree that the effect is real and replicable, (3) no specific evidence exists against the presence of the effect, (4) data from repeated studies are available or could be obtained by conducting multiple studies with available resources, (5) the data available are likely to exhibit the benchmark effect given effect size and sample size; ideally power calculations demonstrate power to find the effect, (6) specification of a method to conclude that there is benchmark validation such as an effect, effect size, confidence interval, or probability that the effect is nonzero, and (7) it is possible to identify an effect that should not be present using the statistical model. In this way, specificity is demonstrated when an effect is found for one but not another effect. Overall, the benchmark effect would be unambiguously supported by researchers in a research area. The consensus among researchers about the benchmark effect may be the most challenging aspect of identifying benchmark effects.
The first six criteria apply for a benchmark value study if a specific value, rather than an estimate, is considered known. Such benchmark value studies may be rare in psychology, although they exist in other fields. For example, geographical imaging studies may validate maps by physically traveling to locations on the map, thereby obtaining ground truth. Similarly, medical imaging studies may validate a medical imaging instrument by seeing how well the image corresponds to a person’s physical anatomy.
Possible Results of a Benchmark Validation Study
The success of the benchmark effect validation method depends on whether the benchmark effect is a true effect that exists independent of a chosen statistical model. If the chosen benchmark effect is real, there are two possible conclusions from the study. Ideally, the statistical model corroborates the existence of the benchmark effect, providing evidence that the model is valid. If the statistical model does not detect the benchmark effect, then the statistical model is not validated, suggesting that the statistical model does not lead to the correct answer. Regardless of whether the statistical model detected the benchmark effect, a benchmark validation study can inspire improvements or highlight limitations of the statistical model to inform further work. The benchmark validation study results, along with analytical and simulation work on the statistical model, may provide next steps. Additional BV studies may further clarify the reasons why the statistical model was not accurate and can compare additional statistical models which may more accurately identify the benchmark effect. Although finding other benchmark effects and conducting additional studies may be time-consuming, this process may help expose weaknesses of existing methods and inspire new methods. For example, if our statistical mediation model had failed to find an imagery mediated effect in the memory studies, we would next explore why the model failed. We would use analytical and simulation work to determine whether the model was underpowered for the observed sample size and effect size. Other approaches to mediation analysis based on machine learning or causal mediation would be evaluated. Such a result would seriously question the accuracy of statistical mediation analysis and generate extensive further research.
Problematic results may occur if the benchmark effect is not a real effect, in which case there are two possibilities that demonstrate limitations of the benchmark validation method, (1) an untrue benchmark effect suggests that an accurate statistical model is not valid and (2) an untrue benchmark effect suggests that an inaccurate statistical model is valid. These possibilities must be considered in any BV study and highlight the importance of combining BV, mathematical, and simulation work in the development and evaluation of statistical models. If the statistical model does not validate the benchmark effect that is untrue, then the statistical model may be valid, but it would be rejected because the benchmark effect is not real. This result is problematic because it will suggest that a valid statistical method should be rejected. In this situation, simulation and analytical work that form the basis of the statistical method should be investigated more thoroughly and it would be important to conduct other benchmark validation studies.
The other problematic outcome when the benchmark effect is not real is when the statistical model leads to the conclusion that the effect is present, though this may be very rare if the criteria for benchmark effects are followed. In this case, the statistical method would be considered valid and there would be further evidence for the benchmark effect studied. In this case, other benchmark validation studies and further analytical and simulation work could further expose the benchmark effect that is not real. Nevertheless, the possibility that the benchmark effect selected for validation is not real is an important consideration for any benchmark validation study. In summary, benchmark validation has limitations that must be considered, and benchmark validation studies combined with simulation and mathematical work will help develop and evaluate the veracity of a statistical model.
Given the criteria for selecting a benchmark effect, it would be surprising but possible that a benchmark effect would not be real. Such a result is possible if the above criteria are not followed, especially if there is not a real scientific consensus on the presence of the benchmark effect. There is at least one study that may be an example of this type of result for the confirmatory factor analysis model. McCrae, Zonderman, Costa, Bond, and Paunonen (1996) cast doubt on confirmatory factor analysis (CFA) because CFA did not validate dimensions of personality. It is possible in this case that CFA is valid, but the dimensions of personality are not as described in previous studies of the five factor personality model. The use of the five personality factors as a benchmark effect violates criterion 3 which requires that there is no reason to doubt the benchmark effect. There are dissenting opinions about the five personality factors (Block, 1995) including that the number of personality dimensions are more than five (e.g., Cattell’s research demonstrating 16 personality factors; Cattell & Schuerger, 2003), and the importance of context on personality (Ross & Nisbett, 1991). The results of the McCrae et al. (1996) study demonstrate the possible ambiguity of personality dimensions and demonstrate that other studies of CFA are needed including other BV studies. It is interesting to speculate about benchmark studies to validate the CFA model. For example, one possible BV study of CFA could expect three dimensions for measures of three human characteristics such as measures of physical size, cognitive abilities, and personality ratings. A valid method should suggest that the measures were obtained from three dimensions.
Benchmark Validation, Research Design, and Measurement
Benchmark validity is related to existing validity discussions of tests and research designs. A point emphasized in this literature is that validity is shown in the use of a test or research design. The validity of research designs is discussed extensively in the work of Shadish and colleagues (Shadish, Cook, & Campbell, 2002), primarily in the context of the validity of inferences drawn about the relation between two variables where one of the variables represents group membership. In this literature, statistical conclusion validity is the concept most closely related to benchmark validation, in that both refer to the extent to which a statistical analysis leads to the correct decision about whether variables are related and how strongly they are related. Many of the threats to statistical conclusion validity are also threats to benchmark validation such as low statistical power, violated assumptions, and unreliability of measures. There are many connections with concepts in test validity. In this literature, the notion of predictive validity, that a valid test score predicts a criterion variable, is most closely related to benchmark validation of whether a statistical model leads to accurate decisions about a known effect (Nunnally & Bernstein, 1994). Divergent and convergent validity evidence corresponds to whether the statistical model leads to the correct conclusion about benchmark effects that should and should not be present, respectively. Current views of test validity advocate for the importance of using comprehensive information to assess whether interpretations are valid (Kane, 2006; Markus & Boorsbom, 2013) that is similar to the use of mathematics, simulations, and benchmark validation studies to assess the accuracy of statistical models.
Benchmark Validation and Philosophy of Science
The notion of benchmark validation is consistent with several views from the philosophy of science literature of which we mention two. Kuhn (1970) described normal science as conducted within a research paradigm that included studies employing facts of a scientific paradigm to solve problems, evaluating predictions from theory defined by the scientific paradigm, and conducting research to resolve ambiguities of the paradigm. Kuhn identified that once started, it was very difficult for scientists to emerge from the bias introduced by their extant scientific paradigm. Kuhn questioned whether it was possible or even likely that truth could be obtained, and he concluded that scientific progress was not cumulative but requires scientific revolutions for newer, correct ideas to emerge. He argued that paradigms are neither right nor wrong but are merely the views held by scientists, and these views are heavily influenced by existing scientific paradigms. In response to Kuhn’s humanistic view that science is socially determined, Shapere (1983) noted that the goal of the scientist is to be aware of possible biases introduced by existing background information. He argued that there is an objective truth and that the goal of science is to obtain this truth. In one respect, benchmark validation is part of normal science described by Kuhn, reflecting the evaluation of predictions from statistical models from the existing scientific paradigm. In another view, benchmark validation provides a way to investigate when prevailing methods are inaccurate in the context of the meaning of actual research results. In either case, because the selection of a benchmark is based on prior knowledge in a research area, the potential biases of the effect owing to the existing scientific paradigm must be considered.
Summary and Future Directions
Perhaps the most important aspect of benchmark validation is that it will help focus research on consistent and replicable research results in psychology, which is an important aspect of substantive research that has been reemphasized recently (Gilbert, King, Pettigrew, & Wilson, 2016; Open Science Collaboration, 2015). A byproduct of the search for benchmarks to validate methods is an attention on known effects in psychology thereby providing a way to gauge growth of knowledge in psychology. Rather than focusing on new knowledge in psychology, benchmark validation increases focus on what is considered known. Why not concentrate on finding known psychological effects and demonstrating these effects in many studies? Such research may not be especially splashy or newsworthy, but focuses on the information that makes psychology a science.
Several important questions about benchmark validation remain. One is the number of replications required to validate a method. A way to view this problem is in terms of the width of the confidence interval of the effect across the studies (Cumming, 2014). The multilevel modeling literature provides confidence interval width recommendations for the number studies that range from 20–50 that depend on the size of the variance among replications (Hox, 2002; Snijders & Bosker, 1999). Recommendations from the meta-analysis literature also vary with a minimum requirement of at least two studies (Valentine, Pigott, & Rothstein, 2010). Mosteller and Tukey (1977) noted that after about three degrees of freedom, the critical t value is 89% of the value of the t from one compared to an infinite degrees of freedom, and at 10 degrees of freedom, the t is very close to the t for an infinite degrees of freedom. Their recommendation would suggest at least four replication studies, but 10 or more studies is ideal. Another question is how to decide whether a study replicates the results of a previous study. Anderson and Maxwell (2016) provided several useful approaches to decide whether a study replicates an effect obtained in an earlier study. In fact, benchmark validation could be used to evaluate these approaches to decide whether an effect is replicated. To use the imagery manipulation on recall example, each study should replicate earlier studies in consideration of different sample sizes and sampling error. For example, normal theory versus Bayesian credibility confidence interval measures of replication should lead to the conclusion that the studies replicate each other. Relatedly, benchmark validation provides an empirical way to evaluate the extent to which null hypothesis significance testing, sequential Bayesian estimation, versus other possible methods leads to correct decisions about benchmark effects across many studies.
Besides mediation analysis that was used for illustration in this paper, benchmark validation can help validate other statistical models. For the case of methods for nonrandomized X on Y, the use of a randomized arm of the study is helpful for understanding methodological validity of methods for causal inference from nonexperimental studies. For structural equation modeling, it may be difficult to find known effects for these more complicated models. However, Sewall Wright’s (Wright, 1920) original work on the genetics of piebald guinea pigs was a complicated path analytic model that Wright considered known. Benchmark validation could be used with latent class models in situations where there are known classes such as species of iris flowers (Fisher, 1936), or undergraduate major in a sample of students. For example, undergraduate seniors could be given a general knowledge test assuming that they would be more likely to correctly answer questions related to their major. If latent class analysis of their answers could correctly group students based on their actual chosen majors, this would provide some validity evidence for latent class models. Longitudinal models can be validated with effects that are known to change over time, such as vocabulary acquisition in children and growth in height. There may be examples of known effects from the physical sciences and biology which can be used to validate methods more widely used in the social sciences as has been done with the Max Planck Institute data.
Given the likelihood that statistical models have untestable assumptions and that future models will be developed that illustrate the shortcomings of existing statistical models, substantive and methodological researchers are left with a dilemma. If we know that our current models are not optimal or if we know that at some time in the future, current methodology may be viewed as obsolete or even misleading, how can we ever know that our models are valid or that our research conclusions are accurate? In order to be more confident that research conclusions are accurate, it is prudent for researchers to rely on accumulated evidence from mathematics, simulation studies, and also thorough validation of known benchmark values, estimates and effects in a substantive area. We hope that this research will encourage researchers to search for benchmark values, estimates, and effects to validate statistical models, so we gain information about whether our methods should work (Feynman, 1974).
Supplementary Material
Acknowledgments
This research was supported in part by the National Institute on Drug Abuse (R37DA09757 and DA043317) and the National Institute on Mental Health (MH40859).
We thank the Arizona State University quantitative seminar participants, Donna Coffman, Sander Greenland, David Lubinski, George Mount, Bengt Muthén, Joe Rodgers, and Steve West for comments related to this research. We thank the editor and three anonymous reviewers for many important changes.
Footnotes
We use the term ‘reference group’ to indicate the experimental manipulation group that did not receive the treatment. The term ‘control group’ can also be used to indicate the experimental manipulation group that did not receive the treatment (Holland, 1986, 1988).
Portions of this research (formerly known as Known Effect Validation, KEV) were presented at the 2014 and 2016 Society for Multivariate Experimental Psychology conference, London School of Hygiene and Tropical Medicine, King’s College London, and Vanderbilt University.
The data described in this paper were collected to serve as a mediation data resource and are freely available from the journal website in the supplemental materials for this paper or by contacting the first author. Some of the data in this paper were used for examples in other publications.
References
- Anderson SF, Maxwell SE. There’s more than one way to conduct a replication study: Beyond statistical significance. Psychological Methods. 2016;21(1):1–12. doi: 10.1037/met0000051. [DOI] [PubMed] [Google Scholar]
- Block J. A contrarian view of the five-factor approach to personality description. Psychological Bulletin. 1995;117:187–215. doi: 10.1037/0033-2909.117.2.187. [DOI] [PubMed] [Google Scholar]
- Borenstein M, Hedges LV, Higgins JPT, Rothstein HR. Introduction to meta-analysis. Chichester, England: Wiley; 2009. [Google Scholar]
- Bullock JG, Green DP, Ha SE. Yes, but what’s the mechanism? (Don’t expect an easy answer) Journal of Personality and Social Psychology. 2010;98(4):550–558. doi: 10.1037/a0018933. [DOI] [PubMed] [Google Scholar]
- Butcher JN, Williams CL, Graham JR, Archer R, Tellegen A, Ben-Porath YS, Kaemmer B. MMPI-A manual for administration, scoring, and interpretation. Minneapolis: University of Minnesota Press; 1992. [Google Scholar]
- Cattell HEP, Schuerger JM. Essentials of 16PF Assessment. Hoboken, NJ: John Wiley & Sons, Inc; 2003. [Google Scholar]
- Cheung MW-L. Meta-analysis: A structural equation modeling approach. West Sussex, UK: John Wiley & Sons, Ltd; 2015. [Google Scholar]
- Coffman DL, MacKinnon DP, Zhu Y, Ghosh D. A comparison of potential outcomes approaches for assessing causal mediation. In: He H, Wu P, Chen D-G, editors. Statistical Causal Inferences and Their Applications in Public Health Research. Springer; 2016. pp. 263–293. [Google Scholar]
- Coffman DL, Zhong W. Assessing mediation using marginal structural models in the presence of confounding and moderation. Psychological Methods. 2012;17(4):642. doi: 10.1037/a0029311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cook TD, Shadish WR, Wong VC. Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management. 2008;27(4):724–750. [Google Scholar]
- Cooper H, Hedges LV, Valentine JC. The handbook of research synthesis and meta-analysis. 2. The Russell Sage Foundation; New York: 2009. [Google Scholar]
- Cox MG, Kisbu-Sakarya Y, Miočević M, MacKinnon DP. Sensitivity plots for confounder bias in the single mediator model. Evaluation Review. 2014;37(5):405–431. doi: 10.1177/0193841X14524576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Craik FIM. Levels of processing: Past, present... and future? Memory. 2002;10(5–6):305–318. doi: 10.1080/09658210244000135. [DOI] [PubMed] [Google Scholar]
- Craik FIM, Lockhart RS. Levels of processing: A framework for memory research. Journal of Verbal Learning and Verbal Behavior. 1972;11(6):671–684. [Google Scholar]
- Craik FIM, Tulving E. Depth of processing and the retention of words in episodic memory. Journal of Experimental Psychology: General. 1975;104(3):268–294. [Google Scholar]
- Cronbach LJ, Meehl PE. Construct validity in psychological tests. Psychological Bulletin. 1955;52:281–302. doi: 10.1037/h0040957. [DOI] [PubMed] [Google Scholar]
- Cumming G. The new statistics: Why and how. Psychological Science. 2014;25(1):7–29. doi: 10.1177/0956797613504966. [DOI] [PubMed] [Google Scholar]
- Dehejia RH, Wahba S. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association. 1999;94(448):1053–1062. [Google Scholar]
- Dodge Y, Rousson V. Direction dependence in a regression line. Communications in Statistics-Theory and Methods. 2000;29(9–10):1957–1972. [Google Scholar]
- Dodge Y, Rousson V. On asymmetric properties of the correlation coefficient in the regression setting. American Statistician. 2001;55(1):51–54. [Google Scholar]
- Dwyer JH. Differential equation models for longitudinal data. Application: Blood pressure and relative weight. In: Dwyer JH, Feinleib M, Lippert P, Hoffmeister H, editors. Statistical models for longitudinal studies of health. New York, NY: Oxford University Press; 1992. pp. 71–98. [Google Scholar]
- Feynman R. Cargo cult science. Commencement address presented at the California Institute of Technology; Pasadena, CA. 1974. [Google Scholar]
- Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Human Genetics. 1936;7(2):179–188. [Google Scholar]
- Fiske ST, Gilbert DT, Lindzey G, editors. Handbook of social psychology. Vol. 2. Hoboken, NJ: John Wiley & Sons; 2010. [Google Scholar]
- Frangakis CE, Rubin DB. Principal stratification in causal inference. Biometrics. 2002;58(1):21–29. doi: 10.1111/j.0006-341x.2002.00021.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fritz MS, Kenny DA, MacKinnon DP. The combined effects of measurement error and omitting confounders in the single-mediator model. Multivariate Behavioral Research. 2016;51(5):681–697. doi: 10.1080/00273171.2016.1224154. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geiselman RE, Woodard JA, Beatty J. Individual differences in verbal performance: A test of alternative information processing models. Journal of Experimental Psychology: General. 1982;111:109–134. [Google Scholar]
- Gilbert DT, King G, Pettigrew S, Wilson TD. Comment on “Estimating the reproducibility of psychological science”. Science. 2016;351(6277):1037. doi: 10.1126/science.aad7243. [DOI] [PubMed] [Google Scholar]
- Glazerman S, Levy DM, Myers D. Nonexperimental versus experimental estimates of earning impacts. Annals of the Academy of Political and Social Science. 2003;589(1):63–93. [Google Scholar]
- Hallberg K, Cook TD, Steiner PM, Clark MH. Pretest measures of the study outcome and the elimination of selection bias: Evidence from three within study comparisons. Prevention Science. 2016 doi: 10.1007/s11121-016-0732-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harlow LL, Mulaik SA, Steiger JH, editors. What if there were no significance tests? Mahwah, NJ: Erlbaum; 1997. [Google Scholar]
- Hoenig JM, Heisey DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. American Statistician. 2012;55(1):19–24. [Google Scholar]
- Holland PW. Statistics and causal inference. Journal of the American Statistical Association. 1986;81(396):945–960. [Google Scholar]
- Holland PW. Causal inference, path analysis, and recursive structural equations models. Sociological Methodology. 1988;18(1):449–484. [Google Scholar]
- Howard GS, Lau MY, Maxwell SE, Venter A, Lundy R, Sweeny RM. Do research literatures give correct answers? Review of General Psychology. 2009;13(2):116–121. [Google Scholar]
- Howard GS, Maxwell SE, Fleming KJ. The proof of the pudding: An illustration of the relative strengths of null hypothesis, meta-analysis, and Bayesian analysis. Psychological Methods. 2000;5(3):315–332. doi: 10.1037/1082-989x.5.3.315. [DOI] [PubMed] [Google Scholar]
- Hox JJ. Multilevel analysis: techniques and applications. Mahwah, NJ: Erlbaum; 2002. [Google Scholar]
- Hoyle RH, Kenny DA. Sample size, reliability, and tests of statistical mediation. Statistical Strategies for Small Sample Research. 1999;1:195–222. [Google Scholar]
- Huang S, MacKinnon DP, Perrino T, Gallo C, Cruden G, Brown HC. A statistical method for synthesizing mediation analyses using the product of coefficient approach across multiple trials. Statistical Methods and Applications. 2016;25(4):565–579. doi: 10.1007/s10260-016-0354-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Imai K, Keele L, Tingley D. A general approach to causal mediation analysis. Psychological Methods. 2010;15(4):309–334. doi: 10.1037/a0020761. [DOI] [PubMed] [Google Scholar]
- James LR. The unmeasured variables problem in path analysis. Journal of Applied Psychology. 1980;65(4):415–421. [Google Scholar]
- Jo B. Causal inference in randomized experiments with mediational process. Psychological Methods. 2008;13(4):314–336. doi: 10.1037/a0014207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kane MT. Validation. In: Brennan RL, editor. Educational measurement. 4. Westport, CT: Praeger; 2006. pp. 17–64. [Google Scholar]
- Kashy DA, Donnellan MB, Ackerman RA, Russell DW. Reporting and interpreting research in PSPB: Practices, principles, and pragmatics. Personality and Social Psychology Bulletin. 2009;35(9):1131–1142. doi: 10.1177/0146167208331253. [DOI] [PubMed] [Google Scholar]
- Krantz DH. The null hypothesis testing controversy in psychology. Journal of the American Statistical Association. 1999;94:1372–1381. [Google Scholar]
- Kuhn TS. The structure of scientific revolutions. 2. Chicago: University of Chicago Press; 1970. [Google Scholar]
- Lalonde RJ. Evaluating the econometric evaluation of training programs with experimental data. The American Economic Review. 1986;76(4):604–620. [Google Scholar]
- Lindsay DS. Replication in psychological science. Psychological Science. 2015;26(12):1827–1832. doi: 10.1177/0956797615616374. [DOI] [PubMed] [Google Scholar]
- Long Q, Little RJ, Lin X. Causal inference in hybrid intervention trials involving treatment choice. Journal of the American Statistical Association. 2008;103:474–484. [Google Scholar]
- Lykken DL. What’s wrong with psychology anyway? In: Ciccetti D, Grove W, editors. Thinking clearly about psychology. Minneapolis, MN: University of Minnesota Press; 1991. pp. 3–39. [Google Scholar]
- MacKinnon DP. Unpublished doctoral dissertation. University of California; Los Angeles: 1986. Measurement of human memory storage using statistical models of multiple recall performance. [Google Scholar]
- MacKinnon DP. Introduction to statistical mediation analysis. New York, NY: Lawrence Erlbaum; 2008. [Google Scholar]
- MacKinnon DP, Dwyer JH. Estimating mediated effects in prevention studies. Evaluation Review. 1993;17(2):144–158. [Google Scholar]
- MacKinnon DP, Fairchild AJ. Current directions in mediation analysis. Current Directions in Psychological Science. 2009;18(1):16–20. doi: 10.1111/j.1467-8721.2009.01598.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacKinnon DP, Fritz MS, Williams J, Lockwood CM. Distribution of the product confidence limits for the indirect effect: program PRODCLIN. Behavior Research Methods. 2007;39(3):384–389. doi: 10.3758/bf03193007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacKinnon DP, Lockwood CM, Hoffman JM, West SG, Sheets V. A comparison of methods to test mediation and other intervening variable effects. Psychological Methods. 2002;7(1):83–104. doi: 10.1037/1082-989x.7.1.83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- MacKinnon DP, Pirlott A. Statistical approaches to enhancing the causal interpretation of the M to Y relation in mediation analysis. Personality and Social Psychology Review. 2015;19(1):30–43. doi: 10.1177/1088868314542878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Markus KA, Borsboom D. Frontiers of test validity theory: Measurement, causation, and meaning. New York: Taylor & Francis; 2013. [Google Scholar]
- Marcus SM, Stuart EA, Wang P, Shadish WR, Steiner PM. Estimating the causal effect of randomization versus treatment preference in a doubly randomized preference design. Psychological Methods. 2012;17(2):244–254. doi: 10.1037/a0028031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mauro R. Understanding L.O.V.E (left out variables error): a method for estimating the effects of omitted variables. Psychological Bulletin. 1990;108(2):314–329. [Google Scholar]
- Maxwell SE, Lau MY, Howard GS. Is psychology suffering from a replication crisis? What does “failure to replicate” really mean? American Psychologist. 2015;70(6):487. doi: 10.1037/a0039400. [DOI] [PubMed] [Google Scholar]
- McCrae RR, Zonderman AB, Costa PT, Bond MH, Paunonen SV. Evaluating replicability of factors in the revised NEO personality inventory: confirmatory factor analysis versus procrustes rotation. Personality and Individual Differences. 1996;70(3):552–566. [Google Scholar]
- McDonald RP. Haldane’s lungs: A case study in path analysis. Multivariate Behavioral Research. 1997;32:1–38. doi: 10.1207/s15327906mbr3201_1. [DOI] [PubMed] [Google Scholar]
- Miguel E, Camerer C, Casey K, Cohen J, Esterling KM, Gerber A, … Laitin D. Promoting transparency in social science research. Science. 2014;343(6166):30–31. doi: 10.1126/science.1245317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Millsap RE, Meredith W. Structure in semantic memory: A probabilistic approach using a continuous response task. Psychometrika. 1987;52(1):19–41. [Google Scholar]
- Miočević M, Gonzalez O, Valente MJ, MacKinnon DP. A tutorial in Bayesian potential outcomes mediation analysis. Structural Equation Modeling. doi: 10.1080/10705511.2017.1342541. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miočević M, MacKinnon DP, Levy R. Power in Bayesian mediation analysis for small sample research. Structural Equation Modeling. doi: 10.1080/10705511.2017.1312407. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miočević M, O’Rourke HP, MacKinnon DP, Brown CH. Statistical properties of five effect size measures for mediation models. Behavior Research Methods. 2017 doi: 10.3758/s13428-017-0870-1. Advance Online Publication. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mooij JM, Peters J, Janzing D, Zscheischler J, Scholkopf B. Distinguishing cause from effect using observational data: Methods and benchmarks. Journal of Machine Learning Research. 2016;52(1):1–102. [Google Scholar]
- Mosteller F, Tukey JW. Data analysis and regression: A second course in statistics. Reading, MA: Addison-Wesley; 1977. [Google Scholar]
- Muthén B, Asparouhov T. Causal effects in mediation modeling: An introduction with applications to latent variables. Structural Equation Modeling: A Multidisciplinary Journal. 2015;22(1):12–23. [Google Scholar]
- Muthén LK, Muthén BO. Mplus User’s Guide. 7. Los Angeles, CA: Muthén & Muthén; 1998–2015. [Google Scholar]
- National Institutes of Health. NIH data sharing policy. 2009 Retrieved from http://grants.nih.gov/grants/policy/data_sharing.
- Nosek BA, Alter G, Banks GC, Borsboom D, Bowman SD, Breckler SJ, … Contestabile M. Promoting an open research culture: Author guidelines for journals could help to promote transparency, openness, and reproducibility. Science. 2015;348(6242):1422. doi: 10.1126/science.aab2374. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nunnally JC, Bernstein IH. Psychometric theory. 3. New York: McGraw-Hill; 1994. [Google Scholar]
- Open Science Collaboration. Estimating the reproducibility of psychological science. Science. 2015;349(6251):aac4716. doi: 10.1126/science.aac4716. [DOI] [PubMed] [Google Scholar]
- Paivo A. Imagery and verbal processes. New York, NY: Holt, Rinehart & Winton; 1971. [Google Scholar]
- Pearl J. Direct and indirect effects. In: Breese J, Koller D, editors. Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence; San Francisco, CA: Morgan Kaufmann; 2001. pp. 411–420. [Google Scholar]
- Pearl J. Interpretation and identification of causal mediation. Psychological Methods. 2014;19(4):459–481. doi: 10.1037/a0036434. [DOI] [PubMed] [Google Scholar]
- Perrino T, Howe G, Sperling A, Beardslee W, Sandler I, … Brown CH. Advancing science through collaborative data sharing and synthesis. Perspectives on Psychological Science. 2013;8(4):433–444. doi: 10.1177/1745691613491579. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Plomin R, DeFries JC, Knopik VS, Neiderhiser JM. Top 10 replicated findings from behavioral genetics. Perspectives on Psychological Science. 2016;11(1):3–23. doi: 10.1177/1745691615617439. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reisberg D, editor. The Oxford handbook of cognitive psychology. New York, NY: Oxford University Press; 2013. [Google Scholar]
- Robins JM, Greenland S. Identifiabilty and exchangeability for direct and indirect effects. Epidemiology. 1992;3(2):143–155. doi: 10.1097/00001648-199203000-00013. [DOI] [PubMed] [Google Scholar]
- Rodgers JL. The epistemology of mathematical and statistical modeling: A quiet methodological revolution. American Psychologist. 2010;65(1):1–12. doi: 10.1037/a0018326. [DOI] [PubMed] [Google Scholar]
- Ross LD, Nisbett RE. The person and the situation: Perspectives of social psychology. New York, NY: McGraw-Hill; 1991. [Google Scholar]
- Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology. 1974;66(5):688–701. doi: 10.1037/h0037350. [DOI] [Google Scholar]
- Shadish WR, Cook TD, Campbell DT. Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton-Mifflin; 2002. [Google Scholar]
- Shadish WR, Clark MH, Steiner PM. Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association. 2008;103(484):1334–1344. [Google Scholar]
- Shadish WR, Galindo R, Wong VC, Steiner PM, Cook TD. A randomized experiment comparing random and cutoff-based assignment. Psychological Methods. 2011;16(2):179–191. doi: 10.1037/a0023345. [DOI] [PubMed] [Google Scholar]
- Shapere D. Reason and the scientific search for knowledge. Boston: Reidel; 1983. [Google Scholar]
- Simonsohn U, Simmons JP, Nelson LD. Better p-curves: Making p-curve analysis more robust to errors, fraud, and ambitious p-hacking, A Reply to Ulrich and Miller, 2015. Journal of Experimental Psychology-General. 2015;144(6):1146–1152. doi: 10.1037/xge0000104. [DOI] [PubMed] [Google Scholar]
- Smith JA, Todd PE. Does matching overcome LaLonde’s critique of nonexperimental estimators? Journal of Econometrics. 2005;125(1):305–353. [Google Scholar]
- Snijders TAB, Bosker RJ. Multilevel analysis: An introduction to basic and advanced multilevel modeling. Thousand Oaks, CA: Sage; 1999. [Google Scholar]
- Sobel ME. Asymptotic confidence intervals for indirect effects in structural equation models. Sociological Methodology. 1982;13(1982):290–312. [Google Scholar]
- Spencer SJ, Zanna MP, Fong GT. Establishing a causal chain: Why experiments are often more effective than mediational analyses in examining psychological processes. Journal of Personality and Social Psychology. 2005;89(6):845–851. doi: 10.1037/0022-3514.89.6.845. [DOI] [PubMed] [Google Scholar]
- St Clair T, Hallberg K, Cook TD. The validity and precision of the comparative interrupted time-series design: Three within-study comparisons. Journal of Educational and Behavioral Statistics. 2016;42(3):269–299. doi: 10.3102/1076998616636854. [DOI] [Google Scholar]
- Thoemmes F. Empirical evaluation of directional-dependence tests. International Journal of Behavioral Development. 2015;39(6):560–569. [Google Scholar]
- Thoemmes F, Mackinnon DP, Reiser MR. Power analysis for complex mediational designs using monte carlo methods. Structural Equation Modeling: A Multidisciplinary Journal. 2010;17(3):510–534. doi: 10.1080/10705511.2010.489379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thurstone LL, Chave EJ. The measurement of attitude: A psychophysical method and some experiments with a scale for measuring attitude toward the church. Chicago, IL: University of Chicago; 1929. pp. 1–21. [Google Scholar]
- Tingley D, Yamamoto T, Hirose K, Keele L, Imai K. mediation: R Package for Causal Mediation Analysis. Journal of Statistical Software. 2014;59(5):1–38. [Google Scholar]
- Tofighi D, MacKinnon DP. RMediation: An R package for mediation analysis confidence intervals. Behavior Research Methods. 2011;43(3):692–700. doi: 10.3758/s13428-011-0076-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valentine JC, Pigott TD, Rothstein HR. How many studies do you need? A primer on statistical power for meta-analysis. Journal of Educational and Behavioral Statistics. 2010;35(2):215–247. [Google Scholar]
- Valeri L, VanderWeele TJ. Mediation analysis allowing for exposure-mediator interactions and causal interpretation: Theoretical assumptions and implementation with SAS and SPSS macros. Psychological Methods. 2013;18(2):137–150. doi: 10.1037/a0031034. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valsiner J, Connolly K, editors. Handbook of developmental psychology. Thousand Oaks, CA: Sage; 2003. [Google Scholar]
- VanderWeele T. Explanation in causal inference: Methods for mediation and interaction. New York, NY: Oxford University Press; 2015. [Google Scholar]
- VanderWeele TJ, Vansteelandt S. Conceptual issues concerning mediation, interventions and composition. Statistics and Its Interface (Special Issue on Mental Health and Social Behavioral Science) 2009;2:457–468. [Google Scholar]
- von Eye A, DeShon RP. Directional dependence in developmental research. International Journal of Behavioral Development. 2012;36(4):303–312. doi: 10.1177/0165025412444077. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright S. The relative importance of heredity and environment in determining the piebald pattern of guinea pigs. Proceedings of the National Academy of Sciences. 1920;6:320–332. doi: 10.1073/pnas.6.6.320. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.






