Evaluating Restrictive Models in Educational and Behavioral Research: Local Misfit Overrides Model Tenability

Tenko Raykov; Christine DiStefano

doi:10.1177/0013164420944566

. 2020 Aug 1;81(5):980–995. doi: 10.1177/0013164420944566

Evaluating Restrictive Models in Educational and Behavioral Research: Local Misfit Overrides Model Tenability

Tenko Raykov ¹, Christine DiStefano ^2,^✉

PMCID: PMC8377339 PMID: 34565814

Abstract

The frequent practice of overall fit evaluation for latent variable models in educational and behavioral research is reconsidered. It is argued that since overall plausibility does not imply local plausibility and is only necessary for the latter, local misfit should be considered a sufficient condition for model rejection, even in the case of omnibus model tenability. The argument is exemplified with a comparison of the widely used one-parameter and two-parameter logistic models. A theoretically and practically relevant setting illustrates how discounting local fit and concentrating instead on overall model fit may lead to incorrect model selection, even if a popular information criterion is also employed. The article concludes with the recommendation for routine examination of particular parameter constraints within latent variable models as part of their fit evaluation.

Keywords: local fit, necessary condition, overall model fit, parameter constraint, sufficient condition, one-parameter logistic model, two-parameter logistic model

Over the past couple of decades, latent variable modeling (LVM) has been attracting increased interest by methodologists and substantive scholars across the educational, behavioral, social, business, organizational, and biomedical disciplines (e.g., Raykov & Marcoulides, 2011). A main reason for this attention to LVM is the fact that the methodology offers readily utilized opportunities for examining relationships among unobserved variables, as well as between them and their presumed indicators (e.g., B. O. Muthén, 2002). A major means for realizing these opportunities are latent variable models, which posit appropriate relationships between latent and observed variables, and where applicable, among latent variables (e.g., Hancock & Mueller, 2013).

A latent variable model (or latent trait model) usually reflects the accumulated knowledge in a substantive area of research or a hypothesis of relevance to be tested empirically (e.g., Bollen, 1989; McDonald, 1999). Often, particular aspects of prior research or theoretical propositions are represented by suitable constraints involving some parameters of one or more models under consideration. When these models are found to be tenable overall once tested against available data, following the current frequent practice in empirical research, one may be tempted to treat the more restrictive model as a plausible means of data description and explanation. As a prominent example, one may consider the one-parameter logistic model (1PL model) that is frequently employed in applications of item response modeling (IRM; e.g., Raykov & Marcoulides, 2018). In part due to its popularity in the educational and behavioral sciences, a researcher may be willing to treat this model as credible for a given data set if, when fitted, it is found to be associated with tenable overall model fit.

The goal of the present note is to caution that following this practice in applications of LVM more generally, and of IRM in particular, can yield incorrect conclusions leading researchers to favor a model that may be seriously misspecified and not dependable for subsequent research. It is argued below that while local fit is a necessary condition for overall model fit, local misfit is a sufficient condition for a model not to be considered a plausible means of data description and explanation. This argument is exemplified by a pair of popular IRMs, the 1PL and two-parameter (2PL) models. A relevant setting is described where a critical constraint in the 1PL model does not hold due to the data being generated by the 2PL model markedly violating it, while the 1PL model is associated with tenable overall fit and its Bayesian information criterion (BIC) is considerably smaller than that of the 2PL model. Implications for empirical educational, behavioral, and social research are finally discussed.

Local Misfit as a Sufficient Condition for Lack of Model Tenability

Local Fit and Overall Model Fit

Many latent variable models contain parameter constraints as essential components that reflect particular assumptions or accumulated knowledge in the subject-matter areas of their application. For example, within the framework of the popular congeneric test model, the model of true score equivalent tests (tau-equivalent tests) is often of high interest due to its presumption of equal measure loadings on the underlying common true score (e.g., Jöreskog, 1971; McDonald, 1999). When this assumption holds and the model contains no correlated error terms, one can use the popular Cronbach coefficient alpha (e.g., Cronbach, 1951) as an index of measuring instrument reliability, owing to the fact that population alpha equals then the population reliability coefficient (e.g., Novick & Lewis, 1967; Raykov, 1997). Similarly, the study of social phobia and its possible gender differences may well necessitate the utilization of a two-group model with invariant loadings and intercepts for the components of a unidimensional scale evaluating this construct (e.g., Brown, 2015). More generally, latent mean, variance, and correlation comparisons presuppose the validity of cross-group measurement invariance constraints, in particular for loadings and intercepts (e.g., Millsap, 2011). Also, a characteristic feature of the popular 1PL model (Rasch model; e.g., von Davier, 2016) is the equality of all item discrimination parameters.

All these models, and others similar to them, share the important feature that specific restrictions are imposed on some of their parameters. These parameter constraints may or may not hold in a population under consideration. In this article, the degree to which they are fulfilled is referred to as local fit, as these restrictions typically reside within a corresponding portion of the model; conversely, the extent to which the constraints are violated is referred to as local misfit. Relatedly, the degree to which the entire model is tenable is referred to as overall (global, omnibus) model fit, since it involves all parts of the model and whether they en bloc satisfy the set of all model assumptions in the population. In other words, local misfit is the degree to which the pertinent parameter restrictions are violated, regardless of whether the remaining parts of the model are correctly specified or not. At the same time, overall lack of model fit—or model misfit—refers to the extent that the entire model, with all its individual or local parts considered simultaneously, represents a valid omnibus means of description and explanation of a studied phenomenon in the population of concern (cf. Raykov & Penev, 2014).

A Necessary Condition and a Sufficient Condition for Model Plausibility or Lack Thereof

For restrictive models with parameter constraints, addressing the issue of model fit requires attention to both overall and local fit. The reason is that model validity implies logically the validity of any particular restriction that is part of the model specification. In this sense, validity of parameter constraints under consideration is a necessary condition for model validity (overall, or global, model validity; e.g., Raykov & Marcoulides, 2018). However, correct restrictions on certain parameters of interest do not generally imply a correct model overall, as there may be other parts of the model that are misspecified. In this sense, local fit with regard to particular parameter constraints does not in general represent a sufficient condition for model validity. From this it follows (e.g., Hodel, 2013) that local misfit, that is, violation of one or more parameter restrictions that are part of a more general model, is a sufficient condition to proclaim a model as not tenable overall. In other words, local misfit is sufficient for treating a model as not representing a plausible means of description and explanation of an analyzed data set.

This discussion highlights the relevance and need for separate examination of the validity of particular parameter constraints that are parts of latent variable models used in empirical educational and behavioral research. Unfortunately, this has not yet become routine practice in contemporary LVM applications. Therefore, a primary aim of this article is to emphasize the importance of local fit testing, in addition to more traditional overall model fit evaluation, especially for restrictive models containing particular parameter constraints. To this end, we use next the comparison of the 1PL and 2PL models, as these models are highly popular in educational and psychological studies.

Examining Plausibility of the One-Parameter Logistic Model: Local Misfit Overrides Overall Model Tenability

The recommendation for focused evaluation of parameter restrictions is especially relevant when utilizing unidimensional restrictive item response theory models, such as the widely used 1PL model. As discussed in detail in the literature (e.g., van der Linden, 2016a, 2016b), for a given set of k binary or binary scored items (k > 2) this model postulates the probability of “correct” response on the jth item, denoted P_j(θ), as follows:

P_{j} (θ) = \frac{e^{a (θ - b_{j})}}{1 + e^{a (θ - b_{j})}} = \frac{1}{1 + e^{- a (θ - b_{j})}} = \frac{1}{1 + \exp [- a (θ - b_{j})]}

(1)

where a is the common item discrimination parameter, θ the underlying latent dimension (construct; with its variance being a free model parameter), exp(·) is the exponential function, and b_j is the item difficulty parameter (j = 1, . . ., k). The 1PL model is a special case of the more general 2PL model that presumes this response probability as equal to

P_{j} (θ) \frac{e^{a_{j} (θ - b_{j})}}{1 + e^{a_{j} (θ - b_{j})}} = \frac{1}{1 + e^{- a_{j} (θ - b_{j})}} = \frac{1}{1 + \exp [- a_{j} (θ - b_{j})]}

(2)

where a_j are the individual item discrimination parameters (j = 1, . . ., k). We would like to stress that the only difference between the 1PL and 2PL models lies in the assumed validity in the former of the restriction that all item discrimination parameters are the same, that is,

a_{1} = a_{2} = \dots = a_{k} = a . (3)

(3)

In other words, the 1PL model is a 2PL model, where the k−1 pairwise discrimination parameter equality constraints in Equations (3) are assumed to hold; that is, Equations (3) represent a characteristic feature of the 1PL model, which makes it distinct from the 2PL model.¹

This and the preceding discussion on necessary and sufficient conditions of model validity or lack thereof imply that a necessary condition for plausibility of the 1PL model is the tenability of constraints (3) within the 2PL model. That is, lack of local misfit with regard to the k−1 constraints in Equations (3) represents a necessary condition for plausibility of the 1PL model (see also Raykov & Marcoulides, 2018). Similarly, a sufficient condition for lack of plausibility of the 1PL model is the violation of one or more of the k−1 constraints in Equations (3), since the 1PL model cannot be valid (plausible) unless all of these k−1 discrimination parameter restrictions are correct (plausible). In other words, local misfit as represented by 1, 2, . . ., or k−1 violated item discrimination parameter equality restrictions from those stated in Equations (3), suffices for rejecting (statistically) the 1PL model as a means of data description and explanation in an empirical study. We hasten to add that no additional indication for overall model fit then, such as a nonsignificant likelihood ratio test statistic (or associated p value), favorable descriptive fit indices, or minimal information criterion value, can overturn such a conclusion for lack of plausibility of the 1PL model.

We exemplify next the need for examining separately local fit, with regard to particular model parameter constraints, in addition to examination of overall model fit.

Testing for Equality of Item Discrimination Parameters as an Essential Part of Evaluating Fit of the One-Parameter Item Response Model

The purpose of this section is to highlight the critical relevance of a test for identity of the item discrimination parameters in a 1PL model as part of evaluating its plausibility. To this end, we provide a demonstration of the fact that disregarding this local test, that is, of the crucial Equations (3) for the 1PL model, can lead to preferring the incorrect model, even if a widely used model selection index like the BIC is employed in addition to the routine evaluation of overall model fit.

Example Setting

To accomplish these aims, we consider the following theoretically and empirically relevant setting based on a large number of simulated data sets. In it, a 2PL model holds with considerably differing discrimination parameters across the items of a unidimensional instrument (item set) of concern. More specifically, we use here r = 10,000 data sets for k = 7 binary items with n = 1,000 cases each, which we simulate according to a 2PL model with the following item discrimination and difficulty parameters as well as unit latent variance (see the Mplus source code in the appendix used for data simulation, which includes the seed utilized; see also its Notes):

a_{1} = 1.19, a_{2} = . 81, a_{3} = . 82, a_{4} = . 70, a_{5} = 1.09, a_{6} = 1.06, a_{7} = . 62; and

b_{1} = - . 506, b_{2} = . 015, b_{3} = - 1.783, b_{4} = . 516, b_{5} = 1.566, b_{6} = . 754, b_{7} = 1.632 .

(4)

From Equations (4) we observe that the cross-item discrepancy in the discrimination parameters is marked, since it spans nearly a half latent variance (standard deviation on the underlying latent dimension of interest). With these notable item discrimination differences in mind, the present setting can be seen as associated with a serious violation of Equations (3) that are characteristic of the 1PL model. This particular violation of the 1PL model does not allow one to treat it as a plausible means for description and explanation of the relationships among the seven items under consideration (cf. DeBoek & Wilson, 2004).

Modeling Results

When fitting the 2PL model to each of the simulated data sets across the 10,000 replications (using maximum likelihood; e.g., L. K. Muthén & Muthén, 2020), we obtained the following overall fit statistics: mean likelihood ratio (LR) chi-square (ave-χ²) = 118.071 (standard deviation = 14.447) for degrees of freedom (df) = 113 (with all pertinent 10,000 computations being successful). Table 1, discussed further below, provides the associated expected and observed proportions and percentiles for these chi-square values, followed by the same quantities pertaining to the BIC index for this unrestricted model.

Table 1.

Proportions and Percentiles Across 10,000 Replications for the Likelihood Ratio Chi-Square and BIC Associated With the 2PL Model (Software Output Format).

Likelihood ratio chi-square
Mean		118.995
Std Dev		14.447
Degrees of freedom		113
Number of successful computations		10000
Proportions		Percentiles
Expected	Observed	Expected	Observed
0.990	0.998	80.991	88.011
0.980	0.995	84.310	91.513
0.950	0.987	89.461	96.470
0.900	0.966	94.213	100.763
0.800	0.907	100.193	106.580
0.700	0.700	104.660	111.073
0.500	0.667	112.334	118.416
0.300	0.445	0.375	126.238
0.200	0.318	125.419	130.913
0.100	0.173	132.643	137.790
0.050	0.089	138.811	143.854
0.020	0.037	145.975	150.263
0.010	0.018	150.882	154.843
Bayesian (BIC)
Mean		8425.472
Std Dev		63.987
Number of successful computations		10000
Proportions		Percentiles
Expected	Observed	Expected	Observed
0.990	0.989	8276.620	8273.769
0.980	0.977	8294.062	8290.967
0.950	0.946	8320.220	8318.416
0.900	0.898	8343.466	8342.438
0.800	0.804	8371.621	8372.533
0.700	0.705	8391.917	8392.818
0.500	0.502	8425.472	8425.736
0.300	0.305	8459.026	8459.852
0.200	0.198	8479.323	8478.671
0.100	0.098	8507.477	8506.610
0.050	0.047	8530.723	8528.842
0.020	0.019	8556.881	8555.793
0.010	0.009	8574.323	8571.282

Open in a new tab

Note. BIC = Bayesian information criterion; 2PL model = two-parameter logistic model.

Similarly, when fitting the 1PL model to each of these 10,000 data sets we obtained the following overall fit statistics: ave-χ² = 138.547 (16.571), df = 119 (also with 10,000 successful computations; L. K. Muthén & Muthén, 2020, chap. 12). Table 2, discussed next, provides the associated expected and observed proportions and percentiles for these chi-square values, followed by the same quantities for the BIC index, in this restricted model.

Table 2.

Proportions and Percentiles Across 10,000 Replications for the Likelihood Ratio Chi-Square and BIC Index Associated With the 1PL Model (Software Output Format).

Likelihood ratio chi-square
Mean		118.995
Std Dev		14.447
Degrees of freedom		113
Number of successful computations		10000
Proportions		Percentiles
Expected	Observed	Expected	Observed
0.990	1.000	86.074	103.001
0.980	0.999	89.500	106.520
0.950	0.998	94.811	112.391
0.900	0.995	99.707	117.655
0.800	0.982	105.860	124.371
0.700	0.962	110.453	129.255
0.500	0.891	118.334	138.055
0.300	0.758	126.582	147.053
0.200	0.645	131.752	152.263
0.100	0.473	139.149	160.074
0.050	0.332	145.461	166.407
0.020	0.193	152.785	174.553
0.010	0.124	157.800	180.214

Bayesian BIC
Mean		8403.577
Std Dev		63.728
Number of successful computations		10000
Proportions		Percentiles
Expected	Observed	Expected	Observed
0.990	0.989	0.989	8252.461
0.980	0.978	8272.699	8269.359
0.950	0.946	8298.751	8295.475
0.900	0.899	8321.903	8321.548
0.800	0.804	8349.943	8350.965
0.700	0.706	8370.158	8371.199
0.500	0.502	8403.577	8403.964
0.300	0.302	8436.996	8437.178
0.200	0.198	8457.211	8456.797
0.100	0.098	8485.251	8484.410
0.050	0.047	8508.403	8506.364
0.020	0.019	8534.455	8534.054
0.010	0.009	8551.828	8550.237

Open in a new tab

Note. BIC = Bayesian information criterion; 1PL model = one-parameter logistic model.

Several observations are readily made from Tables 1 and 2, beginning with their top part. One, since the chi-square distribution cutoffs at the .05 significance level are χ²._{05, 113} = 138.811 for df = 113 (2PL model) and χ²._{05, 119} = 145.461 for df = 119 (1PL model), it follows that the mean likelihood chi-square values are not significant for either of these two models. In other words, either model—the 1PL model and the 2PL model—is, on average, tenable overall, that is, plausible as overall means of data description and explanation. Two, the difference in the above two average chi-square values is 138.547 − 118.071 = 20.476, for the 119 − 113 = 6 degrees of freedom of relevance here, with pertinent p value of .002. This result shows that the 1PL model is, on average, associated with an overall chi-square index that is significantly higher than the average chi-square index of the 2PL model. The last can be interpreted as suggesting that the constraint in Equations (3), which is characteristic of the 1PL model, is effectively not plausible on average across the 10,000 replications (see also below; this finding is not unexpected, since that constraint was markedly violated during the data simulation process). This conclusion is further corroborated by examining the 10,000 within-replication LR chi-square difference test statistic values, defined as the difference of the 1PL model LR chi-square minus the 2PL model LR chi-square value, whose histogram is presented in Figure 1 (see Note 3 to the appendix on how these LR test values are obtainable; see also seed used there, for result replication purposes). This examination shows that the overwhelming majority of the within-replication LR difference statistics (in fact, 79.3% of them) are associated with a test statistic value that is higher than the relevant chi-square cutoff for the 6 degrees of difference here, Δχ²._{05, 6} = 12.592. This observation may also be (informally) made from Figure 1. The discussed findings can be seen as implying that the 1PL model cannot be treated as a plausible means of data description and explanation, due to an essential component of it being violated, namely, the item discrimination equality constraints in Equations (3).

Figure 1. — Histogram of the likelihood ratio chi-square difference test statistic for the one-parameter logistic (1PL) model (restrictive model) versus the 2PL model (relaxed model), for the 10,000 replications used (see illustration section).Note. 2PL model = Two-parameter logistic model; 1PL model = one-parameter logistic model.

As a next step, closer examination of the percentiles of the individual model LR chi-square values’ distributions (right-most column of the corresponding parts of Tables 1 and 2) suggests further that the 1PL and 2PL models are in effect each plausible as a means of description and explanation of the analyzed data across the 10,000 replications. Specifically (see L. K. Muthén & Muthén, 2020, chap. 12), over 90% of these replications are associated with nonsignificant LR chi-square value for the 2PL model. Similarly, the 1PL model is associated with a nonsignificant LR chi-square value in approximately two thirds of these replications. This means that for the large majority of the replications each of the two models was plausible for the analyzed data.

In addition, the bottom halves of Tables 1 and 2 show that the median BIC for the 1PL model is by more than 21 units lower than the median BIC for the 2PL model. This finding may be interpreted as suggesting that, on average, the 1PL model may be considered preferable to the 2PL model based on this popular index of model selection (e.g., Raftery, 1995; see also below). We stress that the latter result is also at odds with the fact that it was the 2PL model rather than the 1PL model that was used for data generation and that the 1PL model was incorrect throughout the data simulation process as its characteristic feature of invariant item discrimination parameters—as reflected in Equations (3)—was markedly violated. This conclusion is additionally corroborated by an examination of the within-replication differences in the BIC values for the two models, defined as the difference of the 2PL model BIC minus the 1PL model BIC, whose histogram is presented in Figure 2 (see Note 3 to the appendix on how these BIC differences are obtainable, and seed used, for result replication purposes). This shows that the vast majority of these within-replication difference statistics (in fact, 91.4%) are associated with a BIC difference that is higher than 10. (Relatedly, 88.2% of these BIC differences are associated with a BIC difference greater than 12.) In other words, in the overwhelming majority of the 10,000 replications used the 1PL model was associated with a BIC value that was at least by 10 units lower than the BIC value of the 2PL model, thus suggesting preference of the 1PL model over the 2PL model based on this widely used model selection criterion (e.g., Raftery, 1995). This observation may also be (informally) made from Figure 2.

Figure 2. — Histogram of the difference between the 2PL model BIC and (minus) the 1PL model BIC, for the 10,000 replications used (see illustration section).Note. 2PL model = Two-parameter logistic model; 1PL model = one-parameter logistic model; BIC = Bayesian information criterion.

The Danger of Disregarding Local Misfit

The preceding discussion in this section showed that not examining local fit, in particular with respect to specific parameter constraints in used latent variable or item response theory models, can yield incorrect conclusions that favor misspecified models. As elaborated there, it was only the separate examination of the plausibility of constraints (3), that is, examining local fit, which provided a way to sense the 1PL model violation built in the data simulation process. By way of contrast, neither the overall test of the 1PL model nor the popular BIC index was able, on average, to sense this violation. We therefore interpret the above discussion as illustrating that there are serious dangers researchers may face if not examining separately the validity of particular parameter constraints within more general latent variable models (latent trait models). In other words, disregarding local misfit may lead to conclusions of preferring seriously misspecified models in empirical educational and behavioral research, with all adverse consequences for ensuing analyses and modeling following from such an incorrect decision.²

Conclusion

The purpose of this note was to highlight the relevance of examining parameter constraints within used latent variable models, as part of their model fit evaluation process. By showing that disregarding the separate test of such constraints may result in misleading conclusions preferring models stipulating incorrect parameter restrictions, we aimed to raise caution that particular parameter constraints represent essential parts of latent variable models (latent trait models) implementing them. For this reason, violation of these constraints should be considered sufficient for rejecting the model containing them, even if this model may be found to be tenable overall based on widely used goodness of fit indices, such as chi-square values, descriptive and alternative fit indices, and/or information criteria.

Our recommendation for routine examination of local parameter constraints within more general models is also grounded in statistical considerations. The latter result from the fact that the test of local parameter constraints focuses all statistical power on the pertinent portion of the model where the constraints reside, whereas the overall model test “distributes” that power across all parts of the model, also those beyond the particular constraints. As a consequence, for a given sample the power associated with testing a parameter constraint is typically higher than the power for testing the entire model. Thus, the chance of sensing misspecifications in local parts of the model, that is, with regard to parameter restrictions of concern, can be considerably higher than that associated with the overall model test. Therefore, when local fit examination signals constraint misspecification within a used latent variable model or an item response model in particular, and the violation of these restrictions in the pertinent parameter estimates is considerable based on substantive considerations, a researcher should (1) reconsider a possible finding of tenable overall model fit and (2) not necessarily treat the model as a plausible means of data description and explanation.

In order to conduct local fit examination as advocated in this note, after examining overall fit a researcher first needs to identify the essential parametric constraint(s) implemented in a main model of interest that is utilized in his or her study. (That constraint may oftentimes be one of equality of several model parameters, but in general need not be such a simple linear restriction and may also be a nonlinear constraint; e.g., L. K. Muthén & Muthén, 2020.) This will usually be a restrictive model of main or possibly focal relevance in their empirical investigation, which reflects particular aspects of a research question or a series of such that are of concern to the scientist. The model will frequently be obtainable as nested in a more relaxed model representing the general analytic frame of importance when addressing the research questions. (In the preceding sections, the restrictive model was the 1PL model that was nested in the 2PL model representing—as the more relaxed model—the general modeling framework in the earlier developments in this note.) As a next step, the scholar needs to examine the validity of the parameter constraint(s), for example, using the LR test in cases where the popular maximum likelihood model fitting and parameter estimation method is applicable (e.g., Bollen, 1989). When evaluating the results of this restriction testing procedure the researcher also has to (1) assess in substantive terms the empirical degree of violation of the constraint(s) in the more general model without them by examining the pertinent parameter estimates (and associated standard errors) and (2) make a decision whether that violation has practical relevance, before declaring the constraint as consistent with the data or alternatively not supported by the latter (i.e., rejected). This activity may be especially recommendable in settings with large samples (e.g., with thousands studied units of analysis), since statistical significance may then result mostly, if not exclusively, due to excessive power (e.g., Schmidt, 1996). In these cases, expert knowledge in the particular subject–matter domain may well be indispensable when making a well-informed decision in support of local fit or as evidence for local misfit that suffices to reject the restrictive model as a means of data description and explanation.

While this article raises justified caution regarding unwarranted overreliance on overall model fit of restrictive latent variable models and item response models, it is worthwhile emphasizing the following points as potential limitations of the preceding discussion. First, it is unknown currently how frequent a finding like that discussed in the illustration section may be obtainable in empirical behavioral, educational, social, or related research. We therefore encourage future studies that examine the possibility of encountering plausible overall models where particular parameter constraints are violated, possibly based on comprehensive simulation studies that are beyond the scope of this article (including such on the effect of sample size on this type of findings; see also Note 2). Second, the article does not imply that the 1PL model is rarely to be trusted in IRM applications (cf. Raykov & Marcoulides, 2019). More specifically, we do not intend to leave the impression that other examples of its spurious overall tenability with notably unequal item discrimination parameters could be readily found, in particular such where these parameters’ discrepancy is substantively meaningful as in the illustration section. Third, this note does not aim to make a general statement per se about model fit or related aspects of latent variable models. Rather, its aim is to provide an important demonstration of the fact that a potential general claim of relying routinely on overall model tenability, or considering it superior to or more relevant than local fit, is unjustifiable. Fourth, the article does not mean to suggest that the popular model comparison index BIC cannot be generally trusted in empirical research. Rather, merely as a byproduct of the main discussion of the relevance of examining “local fit,” this note indicates an empirical setting where the BIC fails to select the true model, without attempting any generalizations about its behavior beyond that setting. Last but not least, the article does not imply that there cannot be other means to sense potential “local” deviations from the 1PL model when fitted to a given data set and found plausible as in the preceding section and thus encourages further research into the complex process of latent variable model fit evaluation that need not always be noncontroversial.

In conclusion, this note raised caution with regard to what may be currently seen as a frequently followed practice of being concerned predominantly with overall model fit in educational and behavioral research. The main goal was thereby the justification of a recommendation to integrate as an important complement in the process of model fit evaluation also the results of local fit examination, especially that pertaining to parameter constraints that can be an essential feature of used latent variable and item response models.

Acknowledgments

We are grateful to T. Asparouhov, P. Doebler, and G. A. Marcoulides for valuable and helpful discussions on simulation studies and model fit evaluation.

Appendix

Mplus Source Code for Simulation of Data Used in the Illustration Section

TITLE: SIMULATING BINARY DATA FOLLOWING THE 2PL-MODEL WITH THE ITEM DISCRIMINATION AND DIFFICULTY PARAMETERS IN EQUATIONS (4).

MONTECARLO: NAMES = Y1-Y7;

CATEGORICAL = Y1-Y7;

GENERATE = Y1-Y7 (1);

NOBS = 1000;

SEED = 16665709;

NREPS = 10000;

MODEL POPULATION:

F1 BY Y1*1.19

Y2*.81

Y3*.82

Y4*.7

Y5*1.09

Y6*1.06

Y7*.62;

F1*1;

[y1$1*-.602];

[y2$1*.012];

[y3$1*-1.462];

[y4$1*.361];

[y5$1*1.707];

[y6$1*.799];

[y7$1*1.012];

ANALYSIS: ESTIMATOR = ML;

MODEL: F BY Y1* Y2-Y7; ! FITS THE 2PL-MODEL TO EACH SIMULATED DATA SET ! FROM THE 10,000 REPLICATIONS (SEE NOTE 2 FOR THE 1PL-MODEL).

F@1;

OUTPUT: TECH9;

Note 1. The item difficulty parameters result as the ratio of threshold given above (within brackets) to corresponding loading, namely, b_j =τ_j/a_j, where τ_j is the threshold associated with the jth item (j = 1, . . ., 7; e.g., L. K. Muthén & Muthén, 2020; Takane & de Leeuw, 1987).

Note 2. To fit the 1PL model with this source code, all one needs to do is add the three symbols “(1)” at the end of the first MODEL line (and immediately before the semicolon). To use this source code with another sample size for each of the 10,000 replications, change the NOBS line in its MONTECARLO SETION to “NOBS = <new sample size>;”.

Note 3. To obtain the 10,000 individual replication LR chi-square and BIC values, add in the MONTECARLO section “RESULTS = 1PL_model.dat;” when fitting subsequently the 1PL_-model (see MODEL section and above Note 2), or alternatively “RESULTS = 2PL_model.dat:” when fitting the 2PL model. Merging the two result files generated then permits evaluation of (a) the within-replication LR chi-square difference test statistics, (b) the 2PL model versus 1PL model BIC differences, and (c) their pertinent histograms, which are all referred to in the discussion in the illustration section (see also Figures 1 and 2).

^1.

The meaning of “distinct” as used in this sentence, is as not logically equivalent, that is, the 1PL model as not equivalent to the 2PL model. Strictly speaking, a 1PL model can be considered a 2PL model with a special additional characteristic, namely, the validity of Equations (3). In this sense, “distinct” is meant to highlight that the reverse is not true, that is, that a general 2PL model is not a 1PL model (but, rather, only a 2PL model with (3) is a 1PL model).

^2.

The same pattern of results with regard to the 1PL and 2PL models’ LR chi-square distributions, their difference, and corresponding BIC indices is observed at sample size n = 500 as well (similarly with 10,000 successful replications for either model; see Note 2 to the appendix), when power associated with the critical constraint (3) is substantially lower (and arguably potentially insufficiently high).

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD: Christine DiStefano Inline graphic https://orcid.org/0000-0001-7504-6554

References

Bollen K. A. (1989). Structural equations with latent variables. Wiley. 10.1002/9781118619179 [DOI]
Brown T. A. (2015). Confirmatory factor analysis for applied research. Guilford Press. [Google Scholar]
Cronbach L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. 10.1007/BF02310555 [DOI] [Google Scholar]
DeBoek P., Wilson M. (2004). Explanatory item response theory models. Springer. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hancock G. R., Mueller R. O. (2013). Structural equation modeling: A second course. Information Age. 10.1007/978-1-4757-3990-9 [DOI]
Hodel R. E. (2013). An introduction to mathematical logic. Dover. [Google Scholar]
Jöreskog K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109-133. 10.1007/BF02291393 [DOI] [Google Scholar]
McDonald R. P. (1999). Test theory. A unified treatment. Lawrence Erlbaum. [Google Scholar]
Millsap R. E. (2011). Statistical approaches to measurement invariance. CRC Press. 10.4324/9780203821961 [DOI] [Google Scholar]
Muthén B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 87-117. 10.2333/bhmk.29.81 [DOI] [Google Scholar]
Muthén L. K., Muthén B. O. (2020). Mplus user’s guide. Muthén & Muthén. [Google Scholar]
Novick M. R., Lewis C. (1967). Coefficient alpha and the reliability of composite measurement. Psychometrika, 32, 1-13. 10.1007/BF02289400 [DOI] [PubMed] [Google Scholar]
Raftery A. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-163. 10.2307/271063 [DOI] [Google Scholar]
Raykov T. (1997). Scale reliability, Cronbach’s coefficient alpha, and violations of essential tau-equivalence for fixed congeneric components. Multivariate Behavioral Research, 32(4), 329-354. 10.1207/s15327906mbr3204_2 [DOI] [PubMed] [Google Scholar]
Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. Taylor & Francis. [Google Scholar]
Raykov T., Marcoulides G. A. (2018). A course in item response theory and modeling with Stata. Stata Press. [Google Scholar]
Raykov T., Marcoulides G. A. (2019). Can the one-parameter logistic model be a spurious finding for a heterogeneous population? Measurement, 17, 192-199. 10.1080/15366367.2019.1591834 [DOI] [Google Scholar]
Raykov T., Penev S. (2014). Latent growth curve models selection: The potential of individual case residuals. Structural Equation Modeling, 21(1), 20-30. 10.1080/10705511.2014.856693 [DOI] [Google Scholar]
Schmidt F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129. 10.1037/1082-989X.1.2.115 [DOI] [Google Scholar]
Takane Y., de Leeuw J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. 10.1007/BF02294363 [DOI] [Google Scholar]
van der Linden W. J. (Ed.). (2016. a). Handbook of item response theory: Vol. 2. Statistical tools paperback. CRC Press. 10.1201/9781315374512 [DOI] [Google Scholar]
van der Linden W. J. (2016. b). Unidimensional logistic response models. In van der Linden W. J. (Ed.), Handbook of item response theory: Vol. 2. Statistical tools paperback (pp. 13-30). CRC Press. 10.1201/9781315374512 [DOI] [Google Scholar]
von Davier M. (2016). Rasch model. In van der Linden W. J. (Ed.), Handbook of item response theory Vol. 2. Statistical tools paperback (pp. 31-50). CRC Press. [Google Scholar]

[bibr1-0013164420944566] Bollen K. A. (1989). Structural equations with latent variables. Wiley. 10.1002/9781118619179 [DOI]

[bibr2-0013164420944566] Brown T. A. (2015). Confirmatory factor analysis for applied research. Guilford Press. [Google Scholar]

[bibr3-0013164420944566] Cronbach L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297-334. 10.1007/BF02310555 [DOI] [Google Scholar]

[bibr4-0013164420944566] DeBoek P., Wilson M. (2004). Explanatory item response theory models. Springer. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr5-0013164420944566] Hancock G. R., Mueller R. O. (2013). Structural equation modeling: A second course. Information Age. 10.1007/978-1-4757-3990-9 [DOI]

[bibr6-0013164420944566] Hodel R. E. (2013). An introduction to mathematical logic. Dover. [Google Scholar]

[bibr7-0013164420944566] Jöreskog K. G. (1971). Statistical analysis of sets of congeneric tests. Psychometrika, 36, 109-133. 10.1007/BF02291393 [DOI] [Google Scholar]

[bibr8-0013164420944566] McDonald R. P. (1999). Test theory. A unified treatment. Lawrence Erlbaum. [Google Scholar]

[bibr9-0013164420944566] Millsap R. E. (2011). Statistical approaches to measurement invariance. CRC Press. 10.4324/9780203821961 [DOI] [Google Scholar]

[bibr10-0013164420944566] Muthén B. O. (2002). Beyond SEM: General latent variable modeling. Behaviormetrika, 29, 87-117. 10.2333/bhmk.29.81 [DOI] [Google Scholar]

[bibr11-0013164420944566] Muthén L. K., Muthén B. O. (2020). Mplus user’s guide. Muthén & Muthén. [Google Scholar]

[bibr12-0013164420944566] Novick M. R., Lewis C. (1967). Coefficient alpha and the reliability of composite measurement. Psychometrika, 32, 1-13. 10.1007/BF02289400 [DOI] [PubMed] [Google Scholar]

[bibr13-0013164420944566] Raftery A. (1995). Bayesian model selection in social research. Sociological Methodology, 25, 111-163. 10.2307/271063 [DOI] [Google Scholar]

[bibr14-0013164420944566] Raykov T. (1997). Scale reliability, Cronbach’s coefficient alpha, and violations of essential tau-equivalence for fixed congeneric components. Multivariate Behavioral Research, 32(4), 329-354. 10.1207/s15327906mbr3204_2 [DOI] [PubMed] [Google Scholar]

[bibr15-0013164420944566] Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. Taylor & Francis. [Google Scholar]

[bibr16-0013164420944566] Raykov T., Marcoulides G. A. (2018). A course in item response theory and modeling with Stata. Stata Press. [Google Scholar]

[bibr17-0013164420944566] Raykov T., Marcoulides G. A. (2019). Can the one-parameter logistic model be a spurious finding for a heterogeneous population? Measurement, 17, 192-199. 10.1080/15366367.2019.1591834 [DOI] [Google Scholar]

[bibr18-0013164420944566] Raykov T., Penev S. (2014). Latent growth curve models selection: The potential of individual case residuals. Structural Equation Modeling, 21(1), 20-30. 10.1080/10705511.2014.856693 [DOI] [Google Scholar]

[bibr19-0013164420944566] Schmidt F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1, 115-129. 10.1037/1082-989X.1.2.115 [DOI] [Google Scholar]

[bibr20-0013164420944566] Takane Y., de Leeuw J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. 10.1007/BF02294363 [DOI] [Google Scholar]

[bibr21-0013164420944566] van der Linden W. J. (Ed.). (2016. a). Handbook of item response theory: Vol. 2. Statistical tools paperback. CRC Press. 10.1201/9781315374512 [DOI] [Google Scholar]

[bibr22-0013164420944566] van der Linden W. J. (2016. b). Unidimensional logistic response models. In van der Linden W. J. (Ed.), Handbook of item response theory: Vol. 2. Statistical tools paperback (pp. 13-30). CRC Press. 10.1201/9781315374512 [DOI] [Google Scholar]

[bibr23-0013164420944566] von Davier M. (2016). Rasch model. In van der Linden W. J. (Ed.), Handbook of item response theory Vol. 2. Statistical tools paperback (pp. 31-50). CRC Press. [Google Scholar]

PERMALINK

Evaluating Restrictive Models in Educational and Behavioral Research: Local Misfit Overrides Model Tenability

Tenko Raykov

Christine DiStefano

Abstract