Abstract
Plausible competing developmental models show similar or identical structural equation modeling (SEM) model fit indices, despite making very different causal predictions. One way to help address this problem is incorporating outside information into selecting among models. This study attempted to select among developmental models of children’s early mathematical skills by incorporating information about the extent to which models forecast the longitudinal pattern of causal impacts of early math interventions. We test for the usefulness and validity of the approach by applying it to data from three randomized controlled trials of early math interventions with longitudinal follow-up assessments in the U.S. (Ns = 1375, 591, 744; baseline age 4.3, 6.5, 4.4; 17%–69% Black). We find that, across datasets: 1) some models consistently outperform other models at forecasting later experimental impacts, 2) traditional statistical fit indices are not strongly related to causal fit as indexed by models’ accuracy at forecasting later experimental impacts, and 3) models show consistent patterns of similarity and discrepancy between statistical fit and models’ effectiveness at forecasting experimental impacts. We highlight the importance of triangulation and call for more comparisons of experimental and non-experimental estimates for choosing among developmental models.
Keywords: SEM, longitudinal data analysis, longitudinal model, randomized controlled trials
Developmentalists often rely on structural equation modeling (SEM) to relate theoretical assumptions to data and compare models that make substantively different predictions about the causal relations among these variables. Model fit indices are commonly used to evaluate whether a specific model is plausible or which of a set of alternative models is most plausible. However, it is well-known that a variety of models that make drastically different, mutually exclusive, theoretical predictions, can show similar or identical fit indices (Kline, 2011; Tomarken & Waller, 2003). The general problem that plausible competing theories can make the same set of predictions, referred to as “underdetermination” of scientific theory in philosophy of science (Stanford, 2017), can in some cases be addressed by generating stronger, more specific predictions on which theories can be compared (Mayo, 2018; Meehl, 1990).
The goal of the paper is to triangulate on longitudinal models of children’s mathematical development by comparing the accuracy in causal forecasting of competing longitudinal models, using a combination of experimental and non-experimental estimates. Specifically, we chose among models on the basis of their fit with estimates generated from research designs allowing for strong causal estimates. For each model, we obtain an index of causal fit: the Causal Mean Squared Error (CMSE), which indexes models’ accuracy at forecasting later experimental impacts using a within-study design approach. The CMSE is calculated based on two estimates of the causal effect on a longer-term outcome: one is estimated by comparing the experimental group with the control group as one would when computing treatment impacts in a randomized controlled trial (RCT); the other is estimated by forecasting the experimental impact on a longer-term outcome, by multiplying short-term experimental impacts by causal paths estimated in longitudinal models fit to data from the control group alone. The CMSE of each model is a sum of squared differences between observed impacts and forecasted impacts on children’s mathematical skills and serves as an index of the extent to which the longitudinal model conveys causally accurate information, under a set of assumptions. We use data from 3 randomized experiments, in which children were randomly assigned to receive an early math intervention or business as usual, to test for the usefulness and validity of our proposed approach.
Results suggest that this approach can be useful for choosing among models. First, across datasets, some models consistently outperform other models at forecasting later experimental impacts. Further, across datasets, traditional fit indices are not strongly related to our index of causal fit, suggesting that this index provides information for model selection that is not redundant with statistical fit estimated from non-experimental longitudinal data alone. Finally, across datasets, models show consistent patterns of similarity and discrepancy between statistical fit and causal fit, suggesting that there is reliable additional information contained in our estimate of causal fit, compared to traditional measures.
We also present the results from a simulation study designed to test the usefulness of the CMSE for recovering a generating model across a variety of conditions in the Supplementary Materials. We find that the CMSE outperforms traditional fit statistics when assumptions of the method are met, and that, because the CMSE relies on information partially independent from the non-experimental data on which the structural model is fit, it is less redundant with statistical fit indices than those statistical fit indices are with each other.
Underdetermination of Developmental Models
A diverse set of models will often make similar or identical predictions about the covariance structure of a longitudinal dataset. Tomarken and Waller (2003) discussed the potential problems with well-fitting models in a review: In short, models that imply equivalent covariance matrices can make strongly discrepant causal predictions. Conventionally, a model fits well when there is a minimal discrepancy between the observed covariances among the directly measured variables and the covariances among those variables that are implied by the model. However, a variety of developmental processes can generate patterns of covariation across time. For example, in the context of children’s mathematical development, these models include: 1) autoregressive models, wherein children’s skills at Time t are a function of their skills at Time t-1 and error; 2) autoregressive models with multiple lags, such that children’s math skills at Time t are a function of their skills at Times t-1, t-2, and error; and/or 3) latent trait or intercept models, such that children’s math skills at Times t-2, t-1, and t are all a function of some stable characteristic(s) and error. All of these models are consistent with positive correlations between measures of children’s mathematics achievement across time, but they can make very different predictions about the effects of a hypothetical math intervention that targets children’s math skills at Time t-2 on their later math skills.
Some have proposed that better incorporating causal predictions into the model selection process will decrease the chances of choosing substantively wrong models (e.g., Bailey et al., 2018; Borsboom et al., 2003; Protzko, 2017; Rhemtulla et al., 2020). For example, if a model implies that the only reason that people who like people also tend to enjoy parties is that they are both caused by a latent entity, extraversion, then an exogenous intervention that causes increases in liking people without affecting extraversion should not increase their likelihood of attending parties (Cramer et al., 2012).
Within-Study Designs
In the current study, we use a within-study design, which attempts to triangulate on developmental models by incorporating causal information to evaluate the fit of developmental models fit to non-experimental data. Our proposed work builds on a substantial body of within-study design or within-study comparison approaches which have been applied periodically in the social sciences for decades. Such approaches are usually applied to a control group within an experimental study to test the effectiveness of design features and statistical methods that best reproduce the causal effect estimated from the experimental design (Cook et al., 2008).
Shadish et al. (2008) applied a within-study design approach to test whether the impacts of two training regimens – a math program or vocabulary program – could be recovered in a dataset in which participants were allowed to select which program to participate in. Some students were randomly assigned to one training regimen or the other while others self-selected into the training they preferred. To test whether statistical methods could be used to reproduce the causal effect of the intervention, the experimenters first estimated the treatment effect in the experimental condition by regressing mathematics and vocabulary scores on counterpart pretest measures plus training type (math or vocabulary). Then they used a variety of analytic strategies, including ANCOVA (Analysis of Covariance) and several analyses incorporating propensity score matching, within the non-randomized condition to attempt to recover the average treatment effect. Perhaps reflecting positive self-selection into the training students felt they would enjoy the most, differences in vocabulary and mathematics test gains from the two training regimens were larger in the condition in which participants chose their training. However, for both outcomes, this bias could be reduced by over 80% by using ANCOVA with a rich set of baseline covariates or by a combination of stratifying on propensity scores and adding covariates to the model. These results were quite impressive: The total remaining bias was less than 5% of the experimental impact, indicating that observational studies can, in some cases, recover causal impacts with reasonable accuracy.
Weidmann and Miratrix (2020) demonstrate similar findings for a set of 14 educational interventions targeting various educationally relevant outcomes. These results were useful, because they demonstrated that for the combinations of participant characteristics, the selection processes into interventions, baseline covariates, and estimation strategies, it is feasible to expect researchers will be able to recover the average treatment effect in future work.
We attempt to extend this work; however, instead of identifying the designs and analyses under which end-of-treatment effects can be recovered, we attempt to identify the designs and analyses under which longer-term impacts can be recovered on the basis of short-run impacts and a plausible longitudinal model fit to the control group. Accomplishing this successfully would allow researchers to 1) better choose among a range of longitudinal models that might be used in future analyses of non-experimental analyses of similar constructs and participants, and 2) help researchers make educated guesses about longer-run treatment impacts in future experiments based on estimates from a non-experimental dataset and an observed or hypothesized short-run treatment impact.
Methodological Assumptions
We adapt the within-study design approach to generate longer-run forecasts of intervention effects, conditional on the short-run effects and a longitudinal model fit to the control group relating end-of-treatment skills to later skills. In other words, we select longitudinal models based on their accuracy at forecasting later experimental impacts, conditional on observed short-run experimental impacts (e.g., treatment effect at the end of treatment). We choose to fit models to the control group because it can be regarded as non-experimental data and this practice thus is consistent with our goal of triangulation. Figure 1 shows the conceptual framework of the within-study design.
Figure 1.
The within-study design used in the current study. Panel A presents cexperimental which is the causal effect of treatment on skill Time t drawn from an experiment with random assignment to a treatment or control group. The effects in Panel A are compared to cnon-experimental in Panel B. Panel B demonstrates how we calculated the model-implied treatment impact cnon-experimental. The model-implied treatment impact is the product of aexperimental (the observed effect of the treatment on skill Time 1), and bnon-experimental (the estimated causal effect of skill at Time 1 on skill at Time t drawn from the control group). Therefore, cnon-experimental = aexperimental * bnon-experimental.
This kind of within-study design can fail under several conditions. First, if the intervention affects variables other than the causal variables of interest in the model (in our case observed early math skills), which substantially influence the later outcomes (i.e., if there is a direct effect of the intervention on later outcomes), then a model that recovers the correct value of the model-forecasted effects of end of treatment skills on later skills (the parameter bnon-experimental in Figure 1) will not reproduce the observed effects of the treatment on the subsequent waves (the parameter cexperimental in Figure 1). The assumption that the effect of a randomly assigned variable on outcomes of interest operates only through the causal variable of interest is known as the exclusion restriction assumption in instrumental variables (IV) analysis (for an introduction, see Angrist & Pischke, 2008). That is why, as we recommend in Table 1, we selected interventions that were specifically designed to increase end of treatment math skills. We hypothesize this method would not work well for interventions likely to directly influence a broad set of outcomes, such as a year of schooling or adoption into a different family. Indeed, even in our examples, it is plausible that these interventions affect later math achievement via other pathways, such as motivation and attention, which would be generally hypothesized to boost later outcomes. In this case, the impacts of the intervention at Time t would be larger than expected based on the assumption that the intervention affected later math skills only via measured earlier math skills, and the correct value of bnon-experimental would under-predict these longer-run impacts. In other words, when important end-of-treatment variables influenced by the intervention cannot be measured, our method may produce reliably biased estimates of the effects of earlier skills on later skills, sometimes in a predictable direction, and we may risk selecting the wrong model.
Table 1.
Design features shared across datasets
| Design feature | Purpose |
|---|---|
|
| |
| Randomization to groups | Strongly identified end-of-treatment impacts |
| Intervention is specifically designed to increase end of treatment math skills | Decreases potential biases from indirect effects of the treatment via unmeasured mediators |
| At least 3 waves of data on key outcome measure (including pretest) | Allows for the estimation of longitudinal models with latent intercepts |
| a) Participants in preschool or the early school years in the U.S. b) Math achievement measure at each wave c) Approximately 1-year time lags between waves |
Reasonably comparable parameters being estimated across studies in longitudinal models |
| Non-trivial end-of-treatment impact | Analyses require multiplying estimated effects by end of treatment impact; null impacts will not allow to clearly differentiate among models |
| Pretest covariates | To statistically control for potential confounds influencing earlier and later math achievement |
However, depending on the research question and design, bias from violations of the exclusion restriction assumption (i.e., bias from unmeasured mediators) may be greater or less than the omitted variable bias in non-experimental estimates (i.e., bias from unmeasured confounds). In the case of effective educational interventions, these two types of biases will frequently go in opposite directions, with unmeasured mediators leading our approach to underestimate long-run effects (because the unmeasured mediators positively influence longer-run outcomes) and unmeasured confounds leading our approach to overestimate long-run effects (because the confounders are unaffected by the intervention but generally increase the estimated effect of earlier skills on later skills). In the interventions analyzed in the current study, bias from unmeasured mediators may be minor, as the control groups tend to quickly learn much of the intervention content in the post-intervention period. For other kinds of interventions, in particular when, 1) in the absence of the treatment (i.e., the counterfactual approximated by the control group), individuals do not eventually learn the key skills, capacities, or beliefs affected by the treatment, and 2) not all of these mediators are well measured at the end of treatment, our method will likely yield predictably overly pessimistic predictions.
Additional assumptions are required for the causal effects estimated from the experimental design to be generalizable to our non-experimental approach. These include linearity (e.g., the relationship between a predictor and an outcome is linear), monotonicity (e.g., there are no defiers for the assignment of the treatment), and additivity (e.g., the effects of different predictors on the expected value of the outcome are additive) (Pearl & Bareinboim, 2014). Although these assumptions are necessary for generalizing one effect to any other set of conditions, they are important to consider when using this design. For example, if the effect of prior math achievement is not linearly related to subsequent math achievement, then multiplying the average treatment effect by the estimated effect of earlier math skills on later math skills will not be equal to the observed total effect of the intervention. Here, we assume a linear effect of early math achievement on later math achievement.
The Current Study
To apply the proposed approach, an experimental dataset is needed, then one can use the information from the experimental dataset in model selection in future non-experimental studies, and to forecast future experimental impacts using observed short-run experimental impacts and a longitudinal model fit to non-experimental data. We provide examples using three datasets from longitudinal mathematics RCT interventions. We investigate whether some models are found to be consistently causally informative across the datasets. In addition, we test the usefulness and validity of our index of causal fit (i.e., models’ accuracy at forecasting later experimental impacts) by asking the following questions:
Do traditional model fit indices and the causally informative model fit index yield similar patterns of results across conceptual replications?
Are discrepancies between traditional model fit indices and the causally informative model fit index consistent across conceptual replications?
Method
Data
We use data from three randomized control trials of mathematics interventions for children in classrooms spanning preschool through fifth grade: the Technology-enhanced, Research-based, Instruction, Assessment, and professional Development study (TRIAD; Clements & Sarama, 2013), an evaluation of a Number Knowledge Tutoring program (NKT; Fuchs et al., 2013), and an evaluation of a preschool mathematics curriculum called Pre-K Mathematics (PKM; Starkey & Klein, 2012). The three datasets were chosen because they share several design features (see Table 1). Those features are necessary to identify some of the longitudinal models under consideration and to make results reasonably comparable across analyses. Details about the datasets including the designs, participants, and measures are shown in Table 2 (see the descriptions of the covariates and measures used in each dataset in the Supplementary Materials).
Table 2.
Datasets used in the study
| Data Characteristics | TRIAD | NKT | PKM |
|---|---|---|---|
|
| |||
| Key citation | Clements et al., (2013) | Fuchs et al. (2013) | Starkey & Klein (2012) |
| Level of randomization | Preschools (blocked) | Within classrooms | Preschool sites (blocked) |
| Sample size | 42 schools, 1,375 students | 591 students | 63 preschool sites, 744 children |
| Race/Ethnicity, FRPL status | 51% AA, 23% H, 85% FRPL | 69% AA, 20% White, 84% FRPL | 52% White, 18% H, 17% AA, 100% FRPL |
| Treatment group used in the study | Building Blocks-No Follow Through | Number Knowledge Tutoring (NKT) - Speeded Practice | Pre-K Mathematics |
| Waves | F PK (pretest), S PK (posttest), S K, S G1, F G4, S G4, S G5 | G1 (pretest and posttest), G2, G3 | F PK (pretest), S PK (posttest), S K, S G1 |
| Covariates | |||
| Demographics | gender, ethnicity, age, mother’s education, FRPL, ELL, SE | gender, ethnicity, FRPL | gender, ethnicity, ELL, Head Start Program |
| Pretests | REMA | FCR, KeyMath-Numeration, WRAT-Reading, Nonverbal Reasoning, Processing Speed, WM-Listening Recall, WM-Counting Recall, Listening Comprehension, Attentive Behavior | TEMA, WJ Letter-Word Identification, Spelling, and Understanding Directions subtests |
| Mathematics achievement outcomes measures | REMA (PK-1), TEAM 3–5 (G4-5) | FCR | TEMA (pre-k pre, post, K, grade 1) |
| End of Treatment Impacts | REMA (.56 SD) | FCR (.40 SD) | TEMA (.33 SD) |
Notes. FRPL = Free or Reduced-Price Lunch eligibility, ELL = English Language Learner status, SE = Special Education status, AA=African American, H=Hispanic, F=Fall, W=Winter, S=Spring, G=Grade, PK=Pre-kindergarten, WRAT= Wide Range Achievement Test, WJ = Woodcock Johnson, WM = working memory, FCR = Facts correctly retrieved, TEMA = Test of early mathematics ability, REMA = Research based early mathematics assessment. In the original TRIAD study, 1375 students within 42 schools were included in the sample. They were randomly assigned into three groups: 1) Building Blocks-No Follow Through, 2) Building Blocks-Follow Through, 3) Control group. In the current analyses, we only use data from the first treatment group and the control group (N = 834, school n = 30) because students in the second treatment group received extended treatment after end of treatment in preschool. The original NKT study has two treatment groups: speeded practice group and non-speeded practice group. We only use data from the former group in the current analyses because a bigger treatment impact was found in this group.
TRIAD.
Thirty urban elementary schools in Massachusetts and New York serving low-income students were randomly assigned to receive treatment or continue with business-as-usual. This was an effectiveness study of the TRIAD scale up model, in which preschool teachers in the treatment group received professional development (PD) training in guiding instruction using the Building Blocks curriculum; the control group continued using the district mathematics curriculum. Students’ mathematics achievement was measured using the REMA and the TEAM 3–5 (Clements et al., 2011, 2012). End of treatment impacts were initially reported in Clements et al. (2011) and appear in Table 2. Impacts out to fifth grade have been reported in Clements et al. (2016) and appear in Table S1.3.
NKT.
Students at-risk of low academic performance from elementary schools in a U.S. southeastern metropolitan district were randomly assigned to a treatment group — received tutoring with speeded practice — or business-as usual instruction. The Addition Strategy Assessment-Facts Correctly Retrieved (Geary et al., 2007) was used to measure student math achievement. End of treatment impacts were initially reported in Fuchs et al. (2013) and appear in Table 2. Impacts out to third grade have been reported in Bailey et al. (2019) and appear in Table S2.3.
PKM.
Low-income students in state funded preschool or Head Start sites in urban areas of Kentucky and Sacramento, California were randomly assigned to receive treatment or continue with business-as-usual instruction. The treatment group received the Pre-K Mathematics curriculum (Klein et al., 2002)—which is designed to enrich the preschool and home learning environment—to supplement their existing classroom curricula. The control group received their business-as-usual math instruction using the existing curriculum without the support of an intentional math curriculum. The TEMA-3 (Ginsburg & Baroody, 2003) was used to measure both informal and formal (symbolic) mathematical knowledge after the treatment because it can be used with children ages 3 through 8. End of treatment impacts have been reported in Starkey et al. (2020) and appear in Table 2. Impacts out to first grade are used in the current paper and appear in Table S3.3.
Analysis
Figure 1 illustrates the within-study design which contains four steps. First, we estimated the causal effect of each treatment on the same outcome measure at the posttest (the parameter aexperimental in Figure 1) and all subsequent waves (cexperimental) by regressing the outcome measure on randomly assigned treatment status and all demographics and pretest covariates listed in Table 2. It should be noted that all the scores of the outcome measure used in this step and the following steps are standardized scores which can be obtained by dividing raw scores by the standard deviation of scores for control group at each wave. We used the intent-to-treat (ITT) estimates of the treatment effects.
Second, we estimate the effects of end of treatment skills on later skills (bnon-experimental) using a variety of models, which we describe below, using data only from the control group of each study. This parameter estimate can be interpreted as the estimated effect of a 1-unit boost to skill at Time 1 (posttest at the end of treatment) to the same skill at Time t (any subsequent wave following the end of treatment).
Third, we projected the model-implied impacts of the intervention on long-run outcomes (cnon-experimental) by multiplying the end-of-treatment impact (aexperimental) by the estimates of bnon-experimental yielded by each model, such that cnon-experimental = aexperimental * bnon-experimental. That is, the estimate of the effect of an intervention on a later outcome equals to the product of the effect of an intervention on early outcome and the estimate of the effect of the early outcome on the later outcome. This idea is implied by the SEM: under a set of assumptions, an exogenous change to one of the variables in the model will have downstream effects that can be predicted by tracing forward along the path from the variable affected directly by the intervention to its causal descendants.
Finally, for each model used to generate estimates of bnon-experimental, we computed an index of causal fit: the Causal Mean Squared Error (CMSE), based on the absolute deviations between the model-implied treatment effects (cnon-experimental) and the observed causal effects (cexperimental). Specifically, we calculated the mean of the squared differences between cexperimental and cnon-experimental at each wave, as shown in the formula below.
Then we compared each model’s CMSE to its performance on traditional indices of statistical fit based on the variance covariance matrix. Regression analyses were conducted using Stata/SE 14.0. All the SEM analyses were conducted in Mplus 7.4 (Muthén & Muthén, 2015). Code for Mplus can be found in the Supplementary Materials. We present several conventional indices of model fit, including the comparative fit index (CFI), Tucker-Lewis index (TLI), root mean squared error of approximation (RMSEA), and standardized root mean square residual (SRMR), in tables but focus on RMSEA in the figures and text. RMSEA is a measure of badness-of-fit of the estimated covariance matrix to the observed covariance matrix—the lower the score the better the fit—adjusted by the degrees of freedom such that greater parsimony in the model is rewarded (Kline, 2011).
Models for Forecasting Effects
We use four models as the examples of how we apply our method. The models considered in this paper (see Table 3) are commonly used to model processes underlying the development of psychological constructs. We also consider several other models in the Supplementary Materials (see Table S4 and S5, Figure S1 and S2). Importantly, assessing model fit on the basis of causal information does not require using this subset of models, and certainly not data from the narrow slice of educational interventions from which these studies were selected. Rather, researchers should choose the set of models they consider on the basis of theories within their area of interest and choose datasets that will allow them to attempt to recover similar parameters in these models. In this section, we briefly describe the model structure, assumptions, advantages, and disadvantages.
Table 3.
SEM models used in the study
| Model | Picture | Advantages | Disadvantages | Hypothesized math skill development |
|---|---|---|---|---|
|
| ||||
| Regression |
|
• Flexible, does not impose functional form on pattern of impacts across time | • Limited to linear relations • Possible omitted variable bias due to unmeasured confounds and measurement error in covariates |
Children’s early math skill is the foundation upon which later math skill is built. Early skill effects on later skill may or may not be fully mediated via intermediate skill. |
| AR(1) |
|
• Only requires 2 waves of data • Does not require covariates |
• Without covariates, time- and domain-general omitted variable bias are major concerns • Implies exponential decay of correlations across time, which is unrealistic |
Children’s early math skill influences immediately following math skill. |
| AR(2) |
|
• Does not require covariates • Accounts for stability in correlations across waves |
• Requires at least 3 waves of data • Without covariates, time- and domain-general omitted variable bias are major concerns |
Children’s early math skill influences immediately following math skill and provides a basis for math learning across development. |
| RIAR |
|
• Does not require covariates • Accounts for stability in correlations across waves • Models time-general confounds |
• Requires at least 3 waves of data | In addition to autoregressive effects, there are unmeasured confounds (e.g., home environment) that exert a stable influence on children’s math skill throughout development. |
Note. AR = Autoregressive, RIAR = Random Intercept Autoregressive. O in the RIAR model refers to Occasion specific variance.
Regression.
To estimate the effects of an earlier skill on a later skill, the later skill is regressed on the earlier skill conditional on pretest covariates, which are assumed to affect both earlier and later skills as shown in Table 3. The estimated impact (bnon-experimental) is the path from an outcome at Time 1 to the same outcome at Time t. Regression is a widely used powerful tool that is available in all commonly used statistical software packages. However, for these estimates to be unbiased, pretest covariates must capture all the factors that influence both earlier and later skills. Prior research in this area has found that these estimates are frequently upwardly biased (Bailey et al., 2018, 2019), likely due to the difficulty of fully measuring the set of confounding factors influencing earlier and later math skills, particularly in young children.
Autoregressive models.
In the autoregressive 1-lag model (AR(1)), the outcome variable of interest is assumed to be influenced by the same outcome measured at a preceding time point in a time series (see Table 3). The AR(1) model implies that both correlations and causal effects will decay exponentially as the distance in time increases between two waves. The estimate of bnon-experimental is the product of all of the paths between an outcome at Time 1 and the same outcome at Time t.
The autoregressive 2-lag model (AR(2)) allows outcome at Times t-1 and t-2 to affect the outcome at Time t (see Table 3). Conceptually, perhaps earlier skills affect much later skills both indirectly, via intermediate skills, and directly, as foundational skills are consequential throughout development (for discussion, see Bailey et al., 2017). In this case, both the causal effects and the correlations between time points can remain stable as the distance in time increases between them. The model can be estimated with at least 3 waves of data. The estimate of bnon-experimental is the sum of all of the paths between an outcome at Time 1 and the same outcome at Time t. For observations separated by at least 2 waves of data, there will be at least two possible paths between these outcome measures in the AR(2) model. Thus, the AR(2) model tends to imply greater correlational stability and greater causal stability of effects in the post-treatment period. The AR(2) model allows for the direct effect of a skills-training intervention on later outcomes, which might be expected to be true. However, in the particular subset of early childhood math interventions included in the current analysis, the control group at the wave immediately following the posttest always outperforms the treatment group at the posttest. This lessens the possibility of direct effects in later periods because there is likely little experimental impact on the exact content trained during the intervention at subsequent time points.
Random intercept autoregressive model.
The random intercept autoregressive (RIAR) model assumes that an outcome at Time t is influenced by a time invariant factor and an occasion-specific term (see Table 3). Each occasion-specific term is assumed to be influenced by the preceding occasion term, therefore, estimating a decaying effect between the outcome of interest and time-specific influences on outcomes with greater lags. In this model the bnon-experimental estimate is the product of the direct paths from the occasion-specific term between outcome at Time 1 and outcome at Time t. Thus, the RIAR model accounts for the correlational stability of a construct across time (like the AR(2) model) while implying exponential decay in causal stability of effects (like the AR(1) model).
Within each dataset, we estimated bnon-experimental using the four models mentioned above. For the regression models, all of the variables listed as demographics and pretest variables in Table 2 (also see additional information on the measures in the Supplementary Materials) were included as covariates estimating aexperimental, cexperimental, and bnon-experimental.
Results
We found no significant differences between the treatment and control groups on any background measure observed in all studies, and there were no significant group differences in attrition (see Tables S1.1, S2.1, and S3.1). Supplementary Tables S1.2, S2.2, and S3.2 show significant initial treatment effects in each dataset (also see Tables S1.3, S2.3, and S3.3: d = .59 in TRIAD, .40 in NKT, and .33 in PKM), with smaller effects in subsequent waves. Treatment effects across datasets showed substantial fadeout, with 45% to 79% of the original effects decaying in the first year after end of treatment.
Are some models found to be consistently causally informative across the datasets?
Figure 2 presents the observed impacts (cexperimental) and model-implied treatment impacts (cnon-experimental) on posttest and later math achievement in each dataset. The trajectories of the actual treatment impacts (cexperimental) over time are displayed by the red lines in Figure 2. The model-implied treatment impacts (cnon-experimental) of different models are displayed using different colors. Additional details about the results from the TRIAD dataset are available in the Supplementary Materials (see Table S1.3 for regression models estimating the treatment impacts, aexperimental and cexperimental, Table S1.4 for estimated effects of end-of-preschool math achievement on later achievement from all models, bnon-experimental, and Table S1.5 for the comparison between observed impact, cexperimental, and model-implied impact on later math achievement, cnon-experimental, for all models). These details are organized in the same pattern for the NKT dataset (see Tables S2.3, S2.4, S2.5), and the PKM dataset (see Tables S3.3, S3.4, S3.5).
Figure 2.
Observed impact (cexperimental) and model-implied treatment impact (cnon-experimental) on later math achievement in each dataset. Note: The line Observed is the unbiased observed treatment effect (cexperimental). The lines Regression, AR(1), AR(2), and RIAR are the model-implied treatment impacts cnon-experimental which is the product of the observed effect of the treatment on end-of-treatment math achievement (aexperimental) and the estimated effects (bnon-experimental) of end-of-treatment math achievement on later achievement. AR = Autoregressive, RIAR = Random Intercept Autoregressive. For all the models, standard errors were adjusted for clustering at the school level (TRIAD & PKM) or the classroom level (NKT).
There are several regularities in the lines plotted in Figure 2. The projected treatment effects implied by the models fit to the control groups show consistent patterns of bias across datasets. Some models consistently outperform other models at forecasting later experimental impacts. For example, the RIAR models consistently outperform the AR(1) and AR(2) models. The regression, AR(1), AR(2) models consistently overestimate the long-run treatment impacts. The bias shown by the regression and AR models is often substantial, especially at later waves. For example, the AR(2) model forecasts effects 2 years after treatment of .40, .11, and .19 across the 3 datasets, compared to observed impacts of .18, −.01, and .02. Importantly, when looking at the differences in the predicted effects (bnon-experimental) of the AR models and RIAR models especially for the last wave of data (see Tables S1.4, S2.4, S3.4), the differences in effects are several times the standard errors in all three datasets, suggesting that bnon-experimental are meaningfully different. We include results of other models in the Supplementary Materials (see Figure S1). We find that other models with an intercept intended to account for unmeasured confounds also consistently outperform the AR(1) and AR(2) models.
RQ1: Do traditional model fit indices and the causally informative model fit index yield the same results?
Table 4 summarizes the traditional SEM model fit indices and CMSE. The traditional fit indices and the index of causal fit are not perfectly related. Some of the models within the same dataset demonstrated inconsistent patterns with better (lower) CMSE but worse model fit indices as measured by Kline (2011). The following Tables: S1.6, S2.6, and S3.6 contain a detailed description of the model fit indices and CMSE for varying model specifications for each dataset.
Table 4.
Comparing traditional model fit indices and the causal model index
| Dataset | Model | CFI | TLI | RMSEA | SRMR | CMSE overall | CMSE 1 year later | CMSE other waves |
|---|---|---|---|---|---|---|---|---|
|
| ||||||||
| TRIAD | AR(1) model | 0.943 | 0.920 | 0.162 | 0.099 | 0.040 | 0.018 | 0.045 |
| AR(2) model | 0.997 | 0.993 | 0.046 | 0.013 | 0.059 | 0.006 | 0.073 | |
| RIAR model | 0.964 | 0.945 | 0.134 | 0.102 | 0.008 | 0.017 | 0.006 | |
|
| ||||||||
| NKT | AR(1) model | 0.996 | 0.991 | 0.029 | 0.029 | 0.006 | 0.001 | 0.010 |
| AR(2) model | 1.000 | 1.007 | 0.000 | 0.010 | 0.007 | 0.000 | 0.014 | |
| RIAR model | 1.000 | 1.021 | 0.000 | 0.006 | 0.003 | 0.000 | 0.006 | |
|
| ||||||||
| PKM | AR(1) model | 0.953 | 0.905 | 0.174 | 0.047 | 0.032 | 0.035 | 0.029 |
| AR(2) model | 1.000 | 1.004 | 0.000 | 0.000 | 0.023 | 0.018 | 0.029 | |
| RIAR model | 0.981 | 0.944 | 0.129 | 0.031 | 0.006 | 0.009 | 0.003 | |
Notes. AR = Autoregressive, RIAR = Random Intercept Autoregressive, CMSE = causal mean squared error. For all the models, standard errors were adjusted for clustering at the school level (TRIAD & PKM) or the classroom level (NKT).
, , where , dft = degree of freedom for the tested model, N = sample size, and g = number of groups.
RQ2: Are discrepancies between traditional model fit indices and the causally informative fit index consistent across conceptual replications?
The inconsistent pattern of statistical and causal fit within each dataset is consistent across all three datasets. Figure 3 highlights the similar pattern of discrepancy between the CMSE and RMSEA of the SEM models for each dataset.
Figure 3.
RMSEA and CMSE of models for each dataset. Note: AR = Autoregressive, RIAR = Random Intercept Autoregressive. For all the models, standard errors were adjusted for clustering at the school level (TRIAD & PKM) or the classroom level (NKT).
The AR(2) models consistently show a very good RMSEA —sometimes approaching zero— but they have the highest CMSE and are the worst at forecasting the observed effect in all three datasets. The RIAR models also show good statistical fit but lower CMSE scores than the AR(1) and AR(2) models. The RIAR models, therefore, are better at predicting causal impacts than the other presented models.
Simulation study results
A simulation study was conducted to test the robustness of the findings to violations of the exclusion restriction assumption (details are presented in the Supplementary Materials). We found that, first, the CMSE was able to recover the generating model (i.e., AR(2) or RIAR models in our case) at better-than-chance levels. Proportions correct across values of variables in the generating models are displayed in the top row of Figure S3. Second, CMSE performed reasonably consistently across the levels of the manipulated data generating features, with accuracies ranging from .72 to .95 (Figure S3). CMSE was most effective at recovering the generating model correctly with larger sample sizes, with larger initial impacts, and with higher numbers of waves. As shown in Figure S3 (3rd, 4th, and 5th lines from the bottom), violation of the exclusion assumption did not have a large effect on the accuracy of CMSE. Lastly, CMSE contributes unique information above and beyond statistical fit index such as RMSEA, which can be shown by the high positive regression coefficients when we regress the generating model on the models selected by all indices at once (see Table S7).
Discussion
The broad lesson we hope researchers to take from this paper is the usefulness of triangulating on developmental theories using both experimental and non-experimental estimates. When analyzing data from experiments with longer-term follow-up assessments, sometimes researchers publish experimental impacts and longitudinal associations in separate papers. However, we argue that more can be learned from analyses that attempt to compare and contrast these experimental and non-experimental estimates than from separate analyses that do not treat the experimental and non-experimental analyses as potentially mutually informative. We tested the usefulness and validity of such an approach for choosing among models by indexing models’ accuracy at forecasting later experimental impacts. The forecasted impacts were based on a combination of observed short-run impacts from an experiment and a longitudinal model fit to a non-experimental dataset (the control group in each study). In support of the usefulness and validity of the approach, some models are found to be consistently causally informative across the datasets. In particular, the models that specified omitted latent factors that caused stability in mathematics achievement across time (the RIAR models) performed best. We found that traditional and causally informative model fit indices were not perfectly aligned with each other, and that models showed consistent patterns of similarity and discrepancy between statistical fit and models’ effectiveness at forecasting experimental impacts across the three datasets.
Drawing on the math intervention data we used, we found that both the Latent State-Trait (LST) autoregressive model and the RIAR model outperform other models in forecasting long-run effects (see Figures S1 and S2 in the Supplementary Materials) and the differences between the two are negligible compared with differences between these models and models that assume no confounding between earlier and later skill measures. Theoretically, this finding is consistent with the idea that longitudinal rank-order stability of children’s math skills is better explained by unmeasured confounders (represented in these latent factor models) than by unmeasured mediators (implied by the AR(2) model). Finally, controlling for a strong set of baseline covariates in regression models produces more accurate forecasts of long-run intervention effects than regression models without controls (see Tables S1.5, 2.5, and 3.5).
Importantly, our goal is not to advocate for any specific models across a broad set of research topics within developmental psychology: Perhaps in different domains with different data structures and different kinds of causal questions, different models will be preferred. Our results should not be taken to imply that AR(1) and AR(2) models are always bad, or that RIAR models are always good. Rather, for this narrow range of early math interventions, we think that the assumptions of the RIAR models more closely approximate the underlying developmental processes than the AR(1) or AR(2) models. For other developmental problems, it is likely that other models will be preferable. For example, when untreated individuals do not eventually learn the key skills, capacities, or beliefs included in the treatment under counterfactual conditions, or when interventions target a larger variety of skills, the direct effects of intervention on later outcomes are likely to be larger. In this case, the RIAR models might consistently under-predict long-run treatment impacts.
Instead, the point of this exercise is to demonstrate the potential usefulness of attempting to choose among models on the basis of their ability to forecast later experimental impacts. We recommend that the proposed approach is applicable in the following situations: 1) when using datasets from randomized experiments; 2) when the treatment targets a narrow and known set of measured constructs; 3) when the end of treatment effect is large enough that models make large differences in predicted patterns of impacts on other outcomes; and 4) when the data contain features required for estimating plausible causal models within the control group (e.g., pretest covariates and/or at least 2 waves after the end of treatment).
The triangulation approach requires a source of exogenous variation (in this case, a randomly assigned intervention) and thus cannot be applied to a dataset containing only the endogenous causal variable(s) and outcome(s) of interest. Nevertheless, studies using this approach may be useful to the field in several ways. First, a researcher analyzing non-experimental datasets can use the results of this within-study design as applied to relevant experimental datasets in the model selection process (see also Brick & Bailey, 2020). Most directly, the researcher can use information about what kinds of models tend to be more causally informative in previous work on relevant problems. For example, for modeling repeated measures of children’s mathematics achievement across time, models assuming unmeasured confounds but not unmeasured mediators (e.g., the RIAR model) appear to produce less biased long-run estimates than models that assume unmeasured mediators but not unmeasured confounds (e.g., the AR(2) model).
Second, parameter estimates from the triangulation approach can serve as benchmarks against which parameter estimates from future analyses can be compared. Of course, effects may not be identical across settings and populations, but estimates with opposite direction and/or very different magnitudes from comparable effects estimated in the within-study design may help researchers identify misspecified models. For example, across the three experimental datasets analyzed in this paper, the pattern of long-run impacts appears consistent with an autoregressive effect of preschool math achievement on math achievement one year later in the range of .4 SD. Therefore, in future analyses of observational datasets, if a model estimates either a zero effect of prior math achievement on later math achievement or an autoregressive effect approaching the correlation (up to .8), this could be an indication of model misspecification. Finally, the directions and magnitudes of the biases produced by particular models in the triangulation approach can guide researchers attempting to reason about the causal implications of a model fit to non-experimental data. For example, a model with unrealistically large (or small) autoregressive paths between earlier and later skills might fail to control for confounders that influence other later skills (or over-control for factors influenced by the causal skill of interest), thus raising concerns about the cross-lagged estimates within the same model.
Our simulation results suggest that CMSE provides unique information above and beyond other model fit indices and may be useful for model selection when models differ in their causal predictions. Because the CMSE has different goals and relies on some independent information, its added information is less redundant with commonly used indices of model fit than these indices are with each other. The CMSE has no penalty for model complexity, and therefore may be less likely to select more parsimonious models like the AR(1). This reflects the goal of the index (to obtain reliable causal estimates regardless of the model choice) and the unclear link between adding model parameters and generating causally informative estimates (if the model is fit to non-experimental data, it is not clear that adding unnecessary model parameters will improve the CMSE unless these additional parameters are useful for causal inference). The CMSE can therefore be thought of as taking an estimation-thinking approach with the goal of capturing the causal process, regardless of the comparison models, and may accordingly run the risk of overfitting. However, CMSE’s calculation separates the analysis of the control group from the total model analyses, which provides at least some protection against overfitting. That is, if a given model is overly complex and capitalizes on sampling errors in the control group, those estimates will not generalize to the treatment-control contrast, and thus may result in higher CMSE. In this way, CMSE contains an “built-in” hold-out set to protect against overfitting. Still, it is important to remember that CMSE will not be useful for distinguishing between possible models that make very similar causal predictions. Thus, the CMSE is best suited for distinguishing among models that make similar correlational predictions but different causal predictions.
Limitations and Future Directions
Just as strong model fit based only on statistical information cannot prove that a model is correct, neither can strong model fit on the basis of causal information. This paper does not solve the underdetermination problem in the context of model selection in SEM; it is merely an additional tool for allowing researchers to differentiate among theories and models that make similar statistical predictions. As Meehl (1978, 1990) advocated, psychologists should develop risky tests that distinguish between theories (and statistical models) that make similar predictions.
As previously noted, the current study requires the exclusion restriction assumption and several conditions necessary for the generalizability of one effect to another set of conditions, including linearity, monotonicity, and additivity. If the exclusion restriction assumption is violated in our example cases, cexperimental will be larger than the expected cnon-experimental because direct effects are likely positive (e.g., because they include unmeasured cognitive and motivational factors influenced by these math interventions; Watts et al., 2018) but not captured in our formula. However, as we see in the results, the direction of the bias of cnon-experimental in all three datasets is generally opposite to that: Most models tend to over-forecast the long-run effects of the intervention, conditional on the short-run effects. Importantly, this is not definitive proof that the exclusion restriction assumption is met; indeed, we find this unlikely. For example, because the change in the outcome in the intervention group happens before it is measured at the end of intervention, it is possible that an intervention influences more variables than the target variable X at different points of time (i.e., the intervention is “fat-handed”; Eronen, 2020). However, our results suggest that for the problem of estimating the causal effects of changes to earlier math achievement on later math achievement with this set of interventions and baseline statistical controls, upward bias in non-experimental estimates due to unmeasured confounds is a larger problem than downward bias due to unmeasured mediators in longer-run experimental impacts.
Moreover, the violations to the exclusion restriction assumption considered in our simulation study did not decrease the performance of the CMSE at distinguishing between the two models under consideration. Still, it is likely that most psychological interventions, including those included in the current study, influence multiple psychological constructs at once (Eronen, 2020), meaning that the exclusion restriction assumption is likely violated. Further, major violations of the exclusion restriction assumption can make the CMSE a poor index of model fit. Thus, we recommend use of the CMSE only for interventions for which a strong case can be made that treatment effects can be efficiently summarized by a range of available posttest measures.
The triangulation approach we proposed was intentionally tested on experimental data from a narrow set of psychological experiments. This approach can be used in model selection and study design in non-experimental research. We hope researchers will apply this approach to different research domains within psychology. For example, this approach may be useful for distinguishing among different kinds of cross-lagged panel models (Usami et al., 2019; Zyphur et al., 2019), networks vs. latent variable models (Epskamp, Rhemtulla, & Borsboom, 2017), formative vs. reflective latent variable models (e.g., Rhemtulla et al., 2020), or bifactor models vs. correlated factor models (Bonifay, Lane, & Reise, 2017), all of which can make similar statistical but very different causal predictions. Further, the CMSE is only one way of estimating model fit in relation to causal benchmarks, just as the RMSEA is just one way of estimating model fit in relation to statistical benchmarks. We encourage others to investigate whether modifying the CMSE formula to account for model parsimony can reduce bias and improve accuracy. The purpose of this research is to encourage broader experimentation with measures of causal fit, not to end the discussion.
Conclusion
Theoretical underdetermination is a critical problem for psychologists who work with observational data. Here we propose another tool for further addressing the underdetermination problem by incorporating causal benchmarks into the model selection process. Supporting the validity and potential usefulness of this approach, we present evidence that this approach can identify models that consistently perform better than others for a narrowly defined problem, and that our index of causal fit contains information not supplied by a commonly used index of statistical fit. We hope others will attempt to create and refine indices of causal model fit and apply this general approach to other important questions within psychology.
Supplementary Material
Public Significance Statement:
When analyzing data from experiments with longer-term follow-up assessments, sometimes researchers publish experimental impacts and longitudinal associations in separate papers. However, we argue that more can be learned from analyses that attempt to compare and contrast these experimental and non-experimental estimates than from separate analyses that do not treat the experimental and non-experimental analyses as potentially mutually informative.
Acknowledgments
The authors thank Douglas Clements and Julie Sarama, Lynn Fuchs, Alice Klein and Prentice Starkey for sharing their data and for their feedback on various aspects of this project. The authors thank Greg Duncan and Tyler Watts for helpful feedback while conceptualizing this project. D. H. Bailey is supported by a Jacobs Fellowship. T. R. Brick is partially funded by the Penn State Institute for Computational and Data Sciences. The TRIAD study was supported by the Institute of Education Sciences (IES) through grants R305K05157 and R305A120813, the NKT study was supported by the Eunice Kennedy Shriver National Institute of Child Health & Human Development to Vanderbilt University through Award Numbers R01 HD053714 and R37 HD0459M and Core Grant HD15052. The PKM study was supported by the IES though Grant R305K050004 to WestEd. The opinions expressed are those of the authors and do not represent views of the U.S. Department of Education or the Eunice Kennedy Shriver National Institute of Child Health & Human Development or the National Institutes of Health.
References
- Angrist JD, & Pischke JS (2008). Mostly harmless econometrics: An empiricist’s companion. Princeton university press. [Google Scholar]
- Bailey DH, Duncan G, Odgers C, & Yu W (2017). Persistence and fadeout in the impacts of child and adolescent interventions. Journal of Research on Educational Effectiveness, 10, 7–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bailey DH, Duncan GJ, Watts T, Clements DH, & Sarama J (2018). Risky business: Correlation and causation in longitudinal studies of skill development. American Psychologist, 73(1), 81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bailey DH, Fuchs LS, Gilbert JK, Geary DC, & Fuchs D (2019). Prevention: Necessary But Insufficient? A 2-Year Follow-Up of an Effective First-Grade Mathematics Intervention. Child Development. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bonifay W, Lane SP, & Reise SP (2017). Three concerns with applying a bifactor model as a structure of psychopathology. Clinical Psychological Science, 5(1), 184–186. [Google Scholar]
- Borsboom D, Mellenbergh GJ, & Van Heerden J (2003). The theoretical status of latent variables. Psychological review, 110(2), 203. [DOI] [PubMed] [Google Scholar]
- Brick TR, & Bailey DH (2019) MICr: Matrices of Implied Causation in R. R package version 0.0.0.9000 [software program]. Available from https://github.com/trbrick/MICr [Google Scholar]
- Brick TR, & Bailey DH (2020). Rock the MIC: The Matrix of Implied Causation, a Tool for Experimental Design and Model Checking. Advances in Methods and Practices in Psychological Science, 286–299. 10.1177/2515245920922775 [DOI] [Google Scholar]
- Clements DH, & Sarama J (2013). Building Blocks, Volumes 1 and 2. Columbus, OH: McGraw-Hill Education. [Google Scholar]
- Clements DH, Sarama J, Khasanova E, & Van Dine DW (2012). TEAM 3–5—Tools for elementary assessment in mathematics. Denver, CO: University of Denver. [Google Scholar]
- Clements DH, Sarama J, Layzer C, Unlu F, Wolfe CB, Spitler ME, & Weiss D (2016, March). Effects of TRIAD on Mathematics Achievement: Long-Term Impacts. Paper presented at the Spring 2016 Society for Research on Educational Effectiveness Conference, Washington, D.C. Abstract retrieved from https://www.sree.org/conferences/2016s/program/downloads/abstracts/1726.pdf [Google Scholar]
- Clements DH, Sarama J, Spitler ME, Lange AA, & Wolfe CB (2011). Mathematics learned by young children in an intervention based on learning trajectories: A large-scale cluster randomized trial. Journal for Research in Mathematics Education, 42, 127–166. [Google Scholar]
- Clements DH, Sarama J, Wolfe CB, & Spitler ME (2013). Longitudinal evaluation of a scale-up model for teaching mathematics with trajectories and technologies: Persistence of effects in the third year. American Educational Research Journal, 50(4), 812–850. doi: 10.3102/0002831212469270. [DOI] [Google Scholar]
- Cole DA, Martin NC, & Steiger JH (2005). Empirical and conceptual problems with longitudinal trait-state models: introducing a trait-state-occasion model. Psychological Methods, 10(1), 3. [DOI] [PubMed] [Google Scholar]
- Connolly AJ (1998). KeyMath-Revised. Circle Pines, MN: American Guidance Service. [Google Scholar]
- Cook TD, Shadish WR, & Wong VC (2008). Three conditions under which experiments, and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management: The Journal of the Association for Public Policy Analysis and Management, 27(4), 724–750. [Google Scholar]
- Cramer AO, Van der Sluis S, Noordhof A, Wichers M, Geschwind N, Aggen SH, Kendler KS, & Borsboom D (2012). Dimensions of normal personality as networks in search of equilibrium: You can’t like parties if you don’t like people. European Journal of Personality, 26(4), 414–431. [Google Scholar]
- Epskamp S, Rhemtulla M, & Borsboom D (2017). Generalized network psychometrics: Combining network and latent variable models. Psychometrika, 82(4), 904–927. [DOI] [PubMed] [Google Scholar]
- Eronen MI (2020). Causal discovery and the problem of psychological interventions. New Ideas in Psychology, 59, 100785. 10.1016/j.newideapsych.2020.100785 [DOI] [Google Scholar]
- Fuchs LS, Geary DC, Compton DL, Fuchs D, Schatschneider C, Hamlett CL, ... & Bryant JD. (2013). Effects of first-grade number knowledge tutoring with contrasting forms of practice. Journal of Educational Psychology, 105, 58–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Geary DC, Hoard MK, Byrd-Craven J, Nugent L, & Numtee C (2007). Cognitive mechanism underlying achievement deficits in children with mathematical learning disability. Child Development, 78(4), 1343–1359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ginsburg HP, & Baroody AJ (2003). Test of early mathematics ability. Austin, TX: Pro-Ed. [Google Scholar]
- Glasgow C, & Cowley J (1994). Renfrew Bus Story test (North American Edition). Centreville, DE: Centreville School. [Google Scholar]
- Invernizzi M, Sullivan A, Swank L, & Meier J (2004). PALS pre-K: Phonological awareness literacy screening for preschoolers (2nd ed.). Charlottesville, VA: University Printing Services. [Google Scholar]
- Klein A, Starkey P, & Ramirez A (2002). Pre-K Mathematics Curriculum. Glendale, IL: Scott Foresman. [Google Scholar]
- Kline RB (2011) Principles and Practice of Structural Equation Modeling. Guilford Press, New York [Google Scholar]
- Landry S (2007). MClass: CIRCLE. New York, NY: Wireless Generation. [Google Scholar]
- Mayo DG (2018). Statistical inference as severe testing. Cambridge: Cambridge University Press. [Google Scholar]
- Meehl PE (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 806–834. [Google Scholar]
- Meehl PE (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195–244. [Google Scholar]
- Muthén LK, & Muthén BO (2015). Mplus user’s guide (1998–2015). Muthén & Muthén: Los Angeles, CA. [Google Scholar]
- Neale MC, Hunter MD, Pritikin JN, Zahery M, Brick TR, Kirkpatrick RM, ... & Boker SM. (2016). OpenMx 2.0: Extended structural equation and statistical modeling. Psychometrika, 81(2), 535–549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearl J, & Bareinboim E (2014). External validity: From do-calculus to transportability across populations. Statistical Science, 29(4), 579–595. [Google Scholar]
- Pickering S & Gathercole S (2001). Working Memory Test Battery for Children. London: The Psychological Corporation. [Google Scholar]
- Protzko J (2017). Effects of cognitive training on the structure of intelligence. Psychonomic bulletin & review, 24 (4), 1022–1031. [DOI] [PubMed] [Google Scholar]
- Rhemtulla M, van Bork R, & Borsboom D (2020). Worse than measurement error: Consequences of inappropriate latent variable measurement models. Psychological Methods. [DOI] [PubMed] [Google Scholar]
- Shadish WR, Clark MH, & Steiner PM (2008). Can nonrandomized experiments yield accurate answers? A randomized experiment comparing random and nonrandom assignments. Journal of the American Statistical Association, 103, 1334–1344. [Google Scholar]
- Stanford K (2017). Underdetermination of scientific theory. In Zalta EN (Ed.), The Stanford Encyclopedia of philosophy (Winter 2017 Edition). Stanford. Retrieved from http://plato.stanford.edu/archives/win2017/entries/scientific-underdetermination/ [Google Scholar]
- Starkey P, & Klein A (2012). Scaling up the implementation of a pre-kindergarten mathematics intervention in public preschool programs (Final Report: IES Grant R305K050004). Washington, DC: U.S. Department of Education. [Google Scholar]
- Starkey P, Klein A, DeFlorio L, & Beliakoff A (2020). Scaling Up the Pre-K Mathematics Intervention in Public Preschool Programs. Manuscript submitted for publication. WestEd. [Google Scholar]
- Steyer R, & Schmitt T (1994). The theory of confounding and its application in causal modeling with latent variables. In von Eye A & Clogg CC (Eds.), Latent variables analysis: Applications for developmental research (pp. 36–67). Thousand Oaks, CA: Sage. [Google Scholar]
- Swanson JM, Schuck S, Porter MM, Carlson C, Hartman CA, Sergeant JA, ... & Wigal T. (2012). Categorical and dimensional definitions and evaluations of symptoms of ADHD: history of the SNAP and the SWAN rating scales. The International journal of educational and psychological assessment, 10(1), 51. [PMC free article] [PubMed] [Google Scholar]
- Tomarken AJ, & Waller NG (2003). Potential problems with” well fitting” models. Journal of Abnormal Psychology, 112(4), 578. [DOI] [PubMed] [Google Scholar]
- Usami S, Murayama K, & Hamaker EL (2019). A unified framework of longitudinal models to examine reciprocal relations. Psychological methods, 24(5), 637. [DOI] [PubMed] [Google Scholar]
- Watts TW, Duncan GJ, Clements DH, & Sarama J (2018). What is the long-run impact of learning mathematics during preschool? Child Development, 89, 539–555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weidmann B, & Miratrix L (2021). Lurking inferential monsters? Quantifying selection bias in evaluations of school programs. Journal of Policy Analysis and Management, 40(3), 964–986. 10.1002/pam.22236 [DOI] [Google Scholar]
- Wechsler D (1999). Wechsler Abbreviated Scale of Intelligence. San Antonio, TX: Psychological Corporation. [Google Scholar]
- Wilkinson GS (1993). Wide Range Achievement Test 3. Wilmington: Wide Range. [Google Scholar]
- Woodcock RW. Woodcock Diagnostic Reading Battery.Itasca, IL: Riverside; 1997. [Google Scholar]
- Woodcock RW, McGrew KS, & Mather N (2001). Woodcock-Johnson III. Itasca, IL: Riverside. [Google Scholar]
- Zyphur MJ, Allison PD, Tay L, Voelkle MC, Preacher KJ, Zhang Z, ... & Diener E. (2019). From Data to Causes I: Building A General Cross-Lagged Panel Model (GCLM). Organizational Research Methods, 1094428119847278. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



