Abstract
The main thesis of the present study is to use the Bayesian structural equation modeling (BSEM) methodology of establishing approximate measurement invariance (A-MI) using data from a national examination in Saudi Arabia as an alternative to not meeting strong invariance criteria. Instead, we illustrate how to account for the absence of measurement invariance using relative compared to exact criteria. A secondary goal was to compare latent means across groups using invariant parameters only and through utilizing exact and relative evaluative-MI protocol suggested equivalence of the thresholds using prior variances equal to 0.10. Subsequent differences between groups were evaluated using effect size criteria and the prior-posterior predictive p-value (PPPP), which proved to be invaluable in attesting for differences that are beyond zero, some meaningless nonzero estimate, and the three commonly used indices of effect sizes described by Cohen in 1988 (i.e., .20, .50, and .80). Results substantiated the use of the PPPP for evaluating mean differences across groups when utilizing nonexact evaluative criteria.
Keywords: Bayesian analysis, approximate measurement invariance, prior-posterior predictive p-value
Traditional assessments of measurement invariance engage a very strict framework on the equivalence of the measured parameters (Meredith, 1993; Vandenberg & Lance, 2000; van de Schoot, Lugtig, & Hox, 2012; Widaman & Reise, 1997). This is in line with exact model fit and traditional evaluation of models in structural equation modeling (SEM) by use of omnibus fit statistics such as the chi-square test (De Jong, Steenkamp, & Fox, 2007; MacCallum, Browne, & Sugaware, 1996) and poses a significant limitation: In the absence of strict measurement invariance, mean comparisons between groups are rendered meaningless. Several researchers have commented on the unrealistic nature of exact fit and have proposed various conventions for evaluating model fit (Marcoulides & Yuan, 2017; Maydeu-Olivares, 2017). One alternative to lacking exact measurement invariance is engaging approximate parameter equivalence using Bayesian methodology (Lee, 2007; Lynch, 2007; Lynch & Western, 2004; Zyphur & Oswald, 2015) through allowing parameters to vary by some small margin, but still maintaining satisfactory levels of invariance so that means between groups can be contrasted (Fox, 2010; Muthén & Asparouhov, 2013; van de Schoot et al., 2013). Another alternative has been the alignment method put forth by Muthén and Asparouhov (2010, 2012; see also Asparouhov & Muthén, 2014; Marsh et al., 2018). A third involves estimating random slopes and intercepts of the measurement model; thus, between-groups differences are accounted for as the measurement parameters modeled as random (Asparouhov & Muthén, 2015; De Boeck, 2008; Fox & Verhagen, 2017; Frederickx, Tuerlinckx, De Boeck, & Magis, 2010; Muthén & Asparouhov, 2018). The main thesis of the present study is to illustrate the Bayesian structural equation modeling (BSEM; Scheines, Hoijtink, & Boomsma, 1999) methodology of establishing approximate measurement invariance (A-MI) using data from a national examination in Saudi Arabia as an alternative to not meeting strict invariance criteria. Instead, we illustrate how to account for the absence of measurement invariance using relative, compared to exact, invariance criteria. A secondary goal was to compare latent means across groups using invariant parameters only and through utilizing exact and relative evaluative criteria by use of the prior-posterior predictive p-value (PPPP). The article is organized along the following axes: it presents a description of (a) exact measurement invariance, (b) SEM-based differential item functioning (DIF) as described by Dimitrov (2017), (c) A-MI, and (d) latent mean difference conventions, followed by an applied example illustrating the above concepts.
Exact Measurement Invariance and Differential Item Functioning
The justification of measurement invariance across groups is a logical prerequisite to conducting substantive cross-group comparisons (e.g., males vs. females, adolescents vs. adults, students vs. employees), but measurement invariance is rarely tested in psychological research (Horn & McArdle, 1992) and even less so satisfied at proper levels to allow for meaningful between-groups comparisons. On the contrary, in most cases where cross-group comparisons are attempted it is assumed that the particular scale’s measurement properties are equivalent across groups. However, this assumption is highly problematic since it is not always true. Thus, all previous findings could be questioned since one can never be sure that the reported differences are due to actual group differences or are due to the different structure of the measure in the groups (Reise, Widaman, & Pugh, 1993).
Applying exact zero parameter differences, which means the factor loadings and/or intercepts are identical across comparison groups, is the traditional measurement invariance approach for full invariance (Raykov, Marcoulides, & Millsap, 2013). However, traditional testing across many groups, namely, the multigroup confirmatory factor analysis (MGCFA), is often too strict and, thus, might lead to inaccurate model rejection or a long series of model modifications with substantial risks of misspecification (Asparouhov & Muthén, 2014; de Bondt & van Petegem, 2015; Kim, Cao, Wang, & Nguyen, 2017). Some researchers (Davidov et al., 2015; Muthén & Asparouhov, 2013; van de Schoot et al., 2013) criticized the use of the MGCFA for testing measurement invariance across a large number of groups, for mainly two reasons. First, the exponential growth of committing Type I errors and, second, the improver uses of goodness of fit indices with multiple groups, given recommendations that they may represent conservative estimates of group differences (Kim et al., 2017; Rutkowski & Svetina, 2014).
Measurement invariance refers to the statistical property of measurement that indicates that the same construct is being measured across some specific groups (Cheung & Rensvold, 2002; Millsap, 2011). In research practice, cross-group factorial invariance is traditionally tested by using the MGCFA technique (Raykov, Marcoulides, & Li, 2012; Vandenberg & Lance, 2000). There are three major forms of measurement invariance, namely, configural, metric, and scalar invariance (Cheung & Rensvold, 2002) and many more restrictions considered secondary for mean comparisons. Configural invariance is demonstrated when the same factor structure holds across groups (i.e., the same pattern of zero and non-zero item factor loadings). Configural invariance is tested by simultaneously running a MGCFA, in which the factor structure of the measure is constrained to be the same across groups. A measure exhibits configural invariance when both models demonstrate good fit (i.e., fail to be rejected). If a model exhibits configural noninvariance, it is not possible to compare means across groups: The observed variables are indicators of different traits in different groups. Metric or weak invariance refers to whether the proportion of true score variance is equivalent across groups. Equal true score variance suggests that the same construct is being measured in both groups (Meredith & Horn, 2001). In other words, this model tests whether the respondents from different groups respond to the items in the same way. Evidence of metric invariance exists if the constrained model does not fit significantly worse compared with the configural model. Lack of evidence of metric invariance is indicative of items that operate and function differently in one population compared with another. Lack of metric invariance seriously puts into question the similarity in which two populations conceptualize a specific construct(s). Scalar or strong invariance is established when item intercepts or thresholds (for ordinal data) are constrained to be equal across groups, and evidence of scalar invariance exists if the constrained model does not fit significantly worse compared with the metric model. Such evidence implies that persons from different groups with the same level of the latent construct are expected to provide equivalent responses. Evidence of scalar noninvariance may suggest some sort of response bias (Vandenberg & Lance, 2000). Scalar invariance must hold in order to compare factor means; further restrictions (i.e., factor variance, covariance, residual invariance) are optional and may be theoretically meaningful in specific contexts (Milfont & Fischer, 2010).
Following evaluation of measurement invariance by testing configural, metric, and scalar models, noninvariance in thresholds between groups can be further tested by means of applying any of the family of item response theory (IRT) models (Dorans & Holland, 1993; Hambleton & Swaminathan, 1985). At present we describe the 2-parameter IRT model in which item difficulties and slopes are simultaneously estimates for all groups (see Thissen, Steinberg, & Wainer, 1988, 1993, for excellent descriptions) as described by Dimitrov (2017) and Raykov, Marcoulides, Lee, and Chang (2013) using the confirmatory factor analysis (CFA) framework. Based on the 2PL model, the probability of success is estimated as follows (Kim & Yoon, 2011):
| (1) |
With the probability of person s providing the correct response (i.e., 1) being a function of the item’s discrimination α and difficulty level b. The term D refers to the scaling constant 1.7, which places the estimates on the normal ogive model. The estimation of the parameters of “a” and “b” in the CFA framework is expressed using a system of linear equations as follows:
| (2) |
Based on the CFA framework, the latent trait η is explaining the observed responses as a function of the factor loading λ plus some sort of measurement error ε. The item parameters of slopes and thresholds in the CFA framework need to be transformed as shown below to represent item discrimination and difficulty estimates. The equations below draw heavily on the influential work of Dimitrov (2017) on employing the CFA framework for examining both DIF and differential slope functioning. The interested reader needs to consult the original source for the presentation of differential slope functioning among other estimates and extensions to the 3PL model.
| (3) |
and
| (4) |
With Equation 3 excluding the scaling constant D when the normal ogive model is utilized. Based on Dimitrov (2017), the expected item score is estimated as follows (Dimitrov, 2003):
| (5) |
With X being estimated as shown below:
| (6) |
And erf being the error function described by Hastings (1955):
| (7) |
DIF is then estimated as a function of the difference between the ICCs for the reference and focal groups, respectively:
| (8) |
And tests of significance by use of the Z test, parametric and nonparametric confidence intervals1 can be furnished to evaluate the magnitude of DIF (Morey, & Rouder, 2011; Raykov & Marcoulides, 2012).
Differential Item Functioning: Effect Size Conventions
Several studies have looked on the issue of effect size metrics for DIF (e.g., Raju, 1988; Zwick, 2012; Zwick 2002), suggesting diverse analytical means (Penfield & Lam, 2000; Raju, 1988, 1990; Raju, van der Linden, & Fleer, 1995; Rudner, Getson, & Knight, 1980a, 1980b). Among them, the most prominent is the ETS criteria, which transform the difference logit parameter onto the delta metric system (Dorans & Holland, 1992; Holland & Thayer, 1988; Linacre & Wright, 1989; van der Linden & Glas, 2000). Conventions on effect sizes based on the difference logit parameter are .44 and below pointing to the existence of negligible DIF and estimates greater than .64 showing large DIF. Lin and Lin (2014) extended the ETS criteria to avoid Type I errors with values of .45 and below being indicative of negligible DIF, values between .45 and .90 in logits representing medium-level DIF, and values greater than .90 large DIF. These criteria, however, need to be interpreted in light of the fact that they were developed with the Rasch model in mind. The present study includes both conventions for evaluating threshold discrepancies accurately and more so as samples get large.
Approximate Measurement Invariance
In 2013, Muthén and Asparouhov introduced Bayesian A-MI as a new invariance testing level and technique that allows for minor discrepancies between parameters across a large number of groups. The A-MI has recently been introduced as a means of providing meaningful comparisons between groups on point estimates when exact measurement invariance is not satisfied. The concept of A-MI draws from the notion that differences in point estimates of intercepts/thresholds between two groups follow a specified probability distribution (likely Gaussian) with a mean of zero and variance σ2 (Fox & Glas, 2001). In other words, although mean differences may be nonexistent across parameters, variability in the estimated parameters across groups is likely nonzero, in contrast with the exact measurement invariance notion, which requires exact and absolute equivalence of parameters across groups. The procedure was put forth by Muthén and Asparouhov (2012), who suggested that the assumption of exact equivalence may be too strict and unrealistic (see also Alamri, 2019; Lek & van de Schoot, 2019) and is in light of alternatives to the null hypothesis testing approach (Killeen, 2005). Thus, instead of expecting an almost unjust assumption, a compromise is made between exact equivalence (zero difference between parameters of interest) and approximate equivalence (nonzero difference). This notion suggests that parameter differences between groups are allowed to the extent that they are not that large that invalidate estimates of means, but they are also not that small that have to be exactly zero (as in exact measurement invariance). Thus, Bayesian priors are posited on variances only and not on means because groups should be no different in mean levels but nevertheless some variability around those zero mean estimates should be allowed to provide for the “wiggle room” (in van de Schoot et al.’s, 2013, terms) so that the equivalence of the measuring instrument is justified (and between-group inferences in level are consequently meaningful and valid). This prior variance estimate defines a level of variability in the parameters between groups that is sufficiently small so that parameters are considered invariant assuming that an optimal prior variance estimate is selected. If approximate invariance is satisfied, then latent factor means are not contaminated by measurement error due to between-groups differences and inferences with regard to level can be further conducted.2 Figure 1 shows three distributions all having zero mean but different levels of variability. The upper distribution has a variance estimate of 0.001 allowing for some variability from zero, the lower distribution has a variance estimate of 0.01 providing more “wiggle” room, whereas the zero mean zero variance distribution is expressed with the vertical dotted line on the mean, having no shape (no room beyond the point estimate). The major thesis of BSEM is to provide some room around the estimated parameters that is in line with random and meaningless variation but without interfering with meaningful and substantial parameter estimates.
Figure 1.

Zero mean distributions of thresholds/latent means using small variance priors, that is, 0.001 (upper panel) and 0.01 (lower panel) along with estimated 95% confidence intervals based on the asymmetric bootstrap distribution (ADF approach). A zero mean zero variance distribution is depicted using the vertical dotted line on the mean of zero having no variance (thus no shape, Gaussian, or other). With the 0.01 variance prior, the range of intercept variability two standard deviations from the prior mean of zero is −0.2 to 0.2 as shown in Muthén (2012).
Model fit in the A-MI protocol is evaluated within the BSEM analytical framework using several means such as the posterior predictive p-value (PPP) and evaluation of its 95% credible interval (95% CI; Gelman, Carlin, Stern, & Rubin, 2004). In addition to model fit indices (i.e., PPP and 95% CI), researchers may use model fit comparison indices such as the deviance information criterion (Spiegelhalter, Best, Carlin, & van der Linde, 2002), the Bayesian information criterion (Schwarz, 1978), and the Bayes Factor (Alamri, 2019; Kass & Raftery, 1995).
Latent Mean Differences Comparisons
Assuming satisfying exact or A-MI, one can proceed with the testing of latent means across groups of interest. Latent point estimate differences can be captured using various means, by (a) evaluating differences by use of effect size indicators (Little, Bovaird, & Slegers, 2006), (b) examining symmetric and asymmetric 95% confidence intervals on their inclusion of a zero value, and, (c) using conventions on ignoring negligible effects by use of the prior-posterior predictive p-value (PPPP; Hoijtink & van de Schoot, 2018). A brief description of these approaches is described below.
Using the generalization of the work of Cohen (1992, 1994) on effect size metrics with measured variables, Hancock (2001) extended his work through estimating effect size indicators for latent means, using an effect size indicator equivalent to Cohen’s termed latent d. Specifically, he proposed the following estimation:
| (9) |
With a1–a2 being the latent means, n1 and n2 the sample sizes per group, and ψ1–ψ2 the latent variable variances.
The second approach to evaluating mean differences in latent constructs involves simulating population distributions using sample-based parameters and by evaluating both point estimates and 95% confidence intervals (95% CIs). Two types of distribution can be constructed, one assuming normality and one not placing strict assumptions of the shape of the distribution. The first method estimated intervals assuming a normal distribution and symmetric confidence intervals. The second involved the asymptotic distribution-free method, which estimates the bootstrap distribution using point and variance estimates from the sample (Browne, 1984; Raykov & Marcoulides, 2004). The advantage of the asymptotic distribution-free methodology is that the normality of the estimates is not presumed (Cheung, 2009; Kelley & Pornprasertmanit, 2016; Olsson, Foss, Troye, & Howell, 2000), but instead the population distribution is estimated based on the distribution shape of the resampled data.
The third and more recent approach involves using a p value that tests the hypothesis that differences between groups are beyond some nuisance, negligible effect, in light of prior recommendations on moving away from the exact null hypothesis testing paradigm (Cohen, 1994; Ioannidis, 2005; Killeen, 2005; Miller, 2009). This estimate is the PPPP using BSEM (Hoijtink & van de Schoot, 2018). As Asparouhov and Muthén (2017) pointed out, the PPPP is very suitable for testing hypotheses of minor parameter differences using small prior variances and is analogous to the Wald test, without having any influence or being affected by model fit (i.e., the PPP). Thus, the PPP uses the discrepancy function to evaluate model fit and the modeling of the “major” parameters θ1 whereas the PPPP tests focused hypotheses on the minor parameters θ2 about differences between parameters that lie outside the N(0,v) distribution (using some meaningful3 prior variance estimates). Prior variance estimates of the PPP and the PPPP parameters do not need to be the same as the two analyses target different hypotheses that are “uncorrelated” by all means.
An Application of Approximate Measurement Invariance Using a National Examination in Saudi Arabia
In this section, we present an application of the A-MI protocol using data from a national examination in Saudi Arabia in the domain of Physics. For the purposes of illustration 6 binary items from a Physics examination (out of 20) were utilized that exhibited proper levels of internal consistency reliability. Data involved a random sample of 2,000 participants, 1,118 males and 882 females, drawing from a large national sample (of over 60,000 participants). The sample size selection targeted at minimizing Type I errors associated with a sample size of 60,000 and also maintain properly estimated model parameters.4 Information on item content is not available due to confidentiality and for avoiding item exposure and contamination. This section is organized along the following axes. First, traditional models of exact measurement invariance were fit to the data across gender. Following that, and in the absence of meeting strict invariance, a DIF analysis was implemented. After confirming the location of noninvariance, A-MI was tested using BSEM. The section concludes with a series of tests evaluating mean differences across groups using exact and approximate invariance approaches.
Traditional Exact Measurement Invariance
Table 1 shows the traditional method for assessing measurement invariance using maximum likelihood estimation and the existence of exact parameter equivalence. As shown in Table 1, the difference between the configural and metric models was not significant as the difference chi-square value using the difference in loglikelihoods was equal to 4.27 chi-square units, which, for a difference of 6 df, was not significant (p = .640). Thus, suggesting constraining the slopes to be equivalent across gender was a justifiable assumption. The difference, however, between the metric and scalar models pointed to the inferiority in model fit of the later, suggesting that constraining the intercepts to be equivalent across males and females was improper. Consequently, scalar invariance was not justified. The two options put forth by the traditional MI methodology was to go through a systematic and time-consuming process of evaluating noninvariant parameters and then decide on either relaxing the assumption of full equivalence and achieve partial invariance and/or drop noninvariant parameters (Steinmetz, 2013). Despite being cumbersome, this procedure will likely capitalize on chance (MacCallum, 1992). Since the region of misfit was on the intercepts, and the data were dichotomous, a DIF analysis was implemented as discussed below using the Dimitrov (2017) protocol, which employs SEM.
Table 1.
ML Estimates of Measurement Invariance Models of the Physics Subscale.
| Model | LL | npar | Comparison | LRTS | df | p Value |
|---|---|---|---|---|---|---|
| Model (scalar) | −9164.00 | 14 | Metric vs. scalar | 28.196 | 6 | <.0001 |
| Model (configural) | −9147.77 | 26 | Configural vs. metric | 4.27 | 6 | .640 |
| Model (metric) | −9149.90 | 20 |
Note. Differences between models were evaluated by the use of the −2*loglikelihood.
Differential Item Functioning
Table 2 and Figure 2 display the findings from the DIF procedure. Differences in logits between groups were significant at p < .01, which was expected due to the large sample size. Consequently, two guidelines on effect size indicators of the logit difference were put forth, one by ETS and one by Lin and Lin (2014). As shown in Table 2, the two conventions agreed, pointing to the presence of two large effects (Items 2 and 5), three medium effects (Items 1, 4, and 6), and one negligible effect (Item 3). Figure 2 shows that all items behaved as being easier for females compared with males as the required levels of ability required to achieve 50% success were lower in females. At this point, the goal of contrasting mean levels between males and females in Physics becomes unrealistic as the physics scale functions in nonequivalent ways. Two analytical approaches: (a) to purify the instrument and/or (b) to establish partial measurement invariance and proceed with mean comparisons would likely result in loss of information; thus, A-MI was tested through applying zero mean small variance priors to allow intercepts to vary between males and females by some negligible margin (as would be expected by random variation).
Table 2.
Differential Item Functioning (DIF) Indices Using Inferential and Effect Size (ES) Criteria.
| Domain: MAR | Males b | Females b | z test | p value | DIF logit | ES ETS | ES Lin & Lin |
|---|---|---|---|---|---|---|---|
| Item 1 | 0.585 | 0.114 | 3.243 | .001** | 0.471 | Medium | Medium |
| Item 2 | 1.360 | 0.185 | 3.999 | .001** | 1.175 | Large | Large |
| Item 3 | 0.469 | 0.081 | 2.576 | .010* | 0.388 | Negligible | Negligible |
| Item 4 | −0.229 | −0.828 | 4.157 | .001** | 0.599 | Medium | Medium |
| Item 5 | 0.060 | −0.990 | 7.971 | .001** | 1.05 | Large | Large |
| Item 6 | 0.139 | −0.453 | 5.223 | .001** | 0.592 | Medium | Medium |
p < .01. *p < .05.
Figure 2.
Item characteristic curves of males (upper panel) and females (lower panel) on physics subscale after fitting a 2-PL model to the data. Dotted lines show differences in item difficulty parameters across males and females.
Approximate Measurement Invariance
The upper part of Table 3 describes the results on the fact that the scalar invariance model did not provide a good fit to the data using the PPP estimate. Specifically, for the scalar model, the estimate .202 was low in relation to accepted conventions in which PPP values should be around 0.500. Given Table 3 results on not meeting strict invariance, an attempt was made to allow some variability among intercepts between males and females so that mean comparisons between groups would still be feasible without having to either constrain the equivalence of very different estimators or drop noninvariant parameters. The lower part of the table describes the results from this sensitivity analysis in which a series of BSEM models were specified so that small variance priors are introduced to allow intercepts between males and females to vary in different degrees. Following the protocol of Seddig and Leitgob (2018), the following small variance priors were tested: 0.001, 0.005, 0.01, 0.10, and 0.50 and their optimality was justified by use of the PPP value, evaluation of the 95% credible interval around PPP, the degree of invariance across intercepts, and convergence (Asparouhov, Muthén, & Morin, 2015; see Appendix A for a sample model’s Mplus annotated syntax file). As shown in Table 3, all PPP values except the very small prior variance of 0.001 were indicative of good model fit, despite excessive power levels, which tends to diminish the effects of the priors on the posterior. Consequently, models with prior variance estimates in the intercepts of 0.005, 0.1, 0.10, and 0.50 were potentially plausible models. Among them, two were associated with a noninvariant intercept and, opting for full invariance, models with prior variances of 0.005 and 0.010 were rejected. Thus, the choice was among models with prior variances of 0.10 and 0.50. Both provided good model fit, but the model with prior variance equal to 0.10 was slightly better and that prior is certainly more informative compared with a prior variance of 0.50. Consequently, the model with a .10 room of variability across intercepts was the preferred model with these data. Further evidence on the suitability of the chosen model is shown in Figure 3 using an example item (Item 2). The top left panel shows a trace plot of the posterior estimates with the vertical line distinguishing the estimates used in the estimation process (to the right of the vertical line signaling the end of the burn-in phase). The top right panel shows the distribution of posterior estimates which is of Gaussian shape, the bottom left panel shows autocorrelation estimates of parameter values across iterations for different intervals in the chain which should be less than .10 (to confirm independent draws from the posterior distribution); this goal was accomplished through applying thinning (see right bottom panel). Examination of the parameter estimate stability between the models with and without thinning showed a very close resemblance. Examination of the Proportional Scale Reduction (Gelman & Rubin, 1992) pointed to acceptable values close to 1, that is, 1.085 at 400 iterations and 1.001 at 30,000 iterations. The last criterion, slow convergence of misfitted models provided additional anecdotal evidence5 of model misfit for all but the preferred model (Seddig & Leitgob, 2018). Figure 4 shows a scatterplot of the relationship between intercepts as measured by the scalar model and by the A-MI model. The relationship between estimates was 0.988 by use of Pearson’s r coefficient (p < .001), showing strong convergence and also some adjustment of the intercept estimates using the A-MI model.
Table 3.
Bayesian Confirmatory Factor Analysis Model Fit for Physics Subscale.
| Invariance | Prior variance | Npar | PPP | 95% CI | Intercept invariance |
|---|---|---|---|---|---|
| Exact measurement invariance | |||||
| Configural | — | 25 | 0.475 | (−28.034)-(30.455) | — |
| Metric | — | 19 | 0.506 | (−28.543)-(27.114) | — |
| Scalar | — | 13 | 0.202 | (−17.491)-(41.399) | — |
| Approximate measurement invariance | |||||
| Scalar | N(0,0.001) | 19 | 0.275 | (−20.637)-(37.789) | All invariant |
| Scalar | N(0,0.005) | 19 | 0.428 | (−25.882)-(31.261) | T5 noninvariant |
| Scalar | N(0,0.010) | 19 | 0.473 | (−27.708)-(29.246) | T5 noninvariant |
| Scalar | N(0,0.100) | 19 | 0.505 | (−28.441)-(27.964) | All invariant |
| Scalar | N(0,0.500) | 19 | 0.499 | (−28.173)-(28.131) | All invariant |
Note. Npar = number of free parameters. The preferred model involves a prior variance estimate of 0.1 as after allowing for that variability across intercepts, no significant difference emerged across groups. The table involves estimates from a sensitivity analysis using different priors.
Figure 3.
Top left panel: Traceplot of posterior distribution threshold estimates of an example item (Item 2). The vertical line distinguishes the first from the second half of the iterations. The first half of the estimates termed “burn-in” phase is not utilized to represent the posterior distribution. Top right panel: Posterior distribution of threshold values for Item 2 using 200,000 iterations. The posterior distribution is based on iterations 100,000 to 200,000 as the first 100,000 represented the “burn-in” phase. Bottom left panel: Autocorrelation before thinning; bottom right panel: Autocorrelation estimate of parameters during iterations after thinning.
Figure 4.

Relationship of intercepts from the estimation of the scalar model versus the approximate measurement invariance model with a prior variance of 0.1.
Latent Means Comparisons
Using conventions of effect size results pointed to the fact that the difference in the mean levels of physics between males and females was medium-to-large (i.e., .742 in SD terms). The symmetric confidence interval around the mean difference estimate ranged between 0.323 and 1.335 without including zero, suggesting that the difference between males and females was rather substantial and unlikely to be zero. The respective values for the asymmetric 95% confidence intervals ranged between 0.305 and 1.175 (see Figure 5) with the mean of males on physics being in the region of rejection compared with the distribution of mean values for the females. Consequently, the conclusion drawn was that females held significantly higher ability levels in physics compared with males.
Figure 5.
Latent mean differences across males and females on the six-item physics instrument by use of nonsymmetric 95% confidence intervals and 10,000 replicated samples using bootstrapping. The distribution for males with zero mean was simulated with the same estimate of variance as that of females.
When applying the recently recommended notion of accounting for negligible mean differences described above, a series of models were engaged in which continuous factor scores were regressed on a dummy gender indicator (see Appendix B for Mplus code and extract from Results section). Bayesian priors for testing the hypothesis that the b-slope (mean of females) was significantly different from the mean of males (intercept term) were set for various conventions of effect sizes, from negligible to large (see Table 4). As per the recommendations by Hoijtink and van de Schoot (2018), a three-step procedure was used with (a) setting prior variance estimates reflecting various conventions of effect size, (b) tail prior variances to the scale of the data at hand, and, (c) estimate PPPP to test hypotheses. As per the recommendations of Muthén and Asparouhov (2012) and Hoijtink and van de Schoot (2018), a prior variance of 0.01 is considered to represent negligible differences in estimates of interest (means or correlations) when the variance of the variables is 1. Consequently, the prior variance of 0.00564 in Table 4 was estimated using (0.1 × 0.751)2 = 0.00564, where 0.1 is the 0.01 variance in SD terms and 0.751 the variance of the factor (0.564) in standard deviation units. Similarly, conventions of effect sizes reflecting small (0.20), medium (0.50), and large (0.8) effects were tested accordingly to verify the behavior of the PPPP in relation to expectations.
Table 4.
Mean Differences Between Males and Females by Use of the Prior-Posterior Predictive Value (PPPP).
| Effect size | Prior variance | Npar | PPP | 95% CI | PPPP |
|---|---|---|---|---|---|
| 0.1 | N(0,0.00564) | 3 | 0.004 | (4.514)-(38.484) | <.001** |
| 0.2 | N(0,0.02256) | 3 | 0.357 | (−6.268)-(10.837) | <.001** |
| 0.5 | N(0,0.14100) | 3 | 0.491 | (−7.260)-(7.313) | .011* |
| 0.8 | N(0,0.36096) | 3 | 0.497 | (−7.283)-(7.289) | .098† |
Note. Npar = number of free parameters. The preferred model involves a prior variance estimate of 0.1 as after allowing for that variability across intercepts, no significant difference emerged across groups. The table involve estimates from a sensitivity analysis using different priors.
p < .01. *p < .05. †p < .05, one-tailed test.
Results indicated that allowing the mean difference between males and females by a nonsignificant nonsubstantial margin as reflected by effect size estimates of 0.1 and 0.2 (small per Cohen) was associated with PPPP values that were significant, suggesting that the difference between males and females was significantly different from small. Thus, differences were both significant and substantial (nonnegligible) by the use of the PPPP hypothesis testing estimator. Interestingly, the null hypothesis that the difference between males and females was of medium effect size was rejected as well (PPPP = .011), as the true difference between means was 0.795, reflecting a large effect size. This whole utilization of the PPPP hypothesis testing estimator was concluded by contrasting the true difference of a 0.795 SD to a large effect size difference of 0.8. This difference was not significant (as it should be), despite the excessive amounts of power associated with 2,000 participants. Consequently, it was concluded that the difference in physics aptitude across males and females was both significant and nonnegligible (i.e., substantial), favoring females.
Implications: Advantages and Limitations of Approximate Measurement Invariance
Advantages
There are several advantages of adopting a Bayesian framework to evaluating simple structures, measurement invariance, and assess validly latent mean differences. First, the Bayesian methodology does not make assumptions about distributions, and thus skewed posterior distributions can be accommodated (Muthén & Asparouhov, 2010). Second, frequentist analyses become computationally intensive and cumbersome or even not possible when categorical indicators are involved and models are large, requiring a large number of dimensions of numerical integration. Further obstacles relate to employing multiple tests and capitalizing on chance (MacCallum, Roznowski, & Necowitz, 1992) or applying invalid modifications (MacCallum, 1986). Finally, if the noninvariant items detected are of focal interest, the A-MI approach will be best served for this purpose (Kim et al., 2017).
Limitations
Bayesian modeling is nonetheless a very intriguing area of inquiry; however, among other things, it has been criticized for model estimation complexity, model misidentification (MacCallum, Cai, & Edwards, 2012), and selecting improper criteria (Marcoulides, 2018). The area of informative prior selection is still a growing and but yet understudied area of inquiry (Kaplan & Depaoli, 2013). For example, the BSEM model evaluation is hindered by the unreliability of convergence criteria (e.g., the Proportional Scale Reduction; Muthén & Asparouhov, 2012). Hoijtink and van de Schoot (2017) suggested the use of the PPPP (Asparouhov & Muthén, 2017) using an example from a regression analysis model and recently Muthén has extended the routines in the CFA analytical framework using test of cross-loadings, and despite using it in the present example, more research is needed to evaluate the behavior of PPPP in light of good or bad model fit overall (by use of the PPP).
Moreover, in order to prevent deceptive information when applying A-MI, many Bayesian decisions must be made in advance (Gelman & Rubin, 1992). The precision of the prior determines the wiggle-room and will reflect on the ability to detect the noninvariance (Muthén & Asparouhov, 2013; van de Schoot et al., 2013), therefore, choosing prior variance and the source of the knowledge of the differences between parameters are best examples of the most important Bayesian approach decisions. Although researchers may use Bayesian software default options such as those in Mplus (Muthén & Muthén, 1998-2017) and WinBUGS and OpenBUGS (Lunn, Thomas, Best, & Spiegelhalter, 2000), other decisions must be made such as the number of iterations, the spacing between retained iterations of the final analysis, the number of burn-in iterations discarded, and chain and processor values under the MCMC (Markov chain Monte Carlo) simulation (Gelman & Rubin, 1992; Raftery & Lewis, 1996).
With regard to our evaluation of DIF, we employed one among several possible analytical means (Clauser & Mazor, 1998; Hambleton, Robin, & Xing, 2000; Hambleton, Swaminathan, & Rogers, 1991) by use of SEM (Dimitrov, 2017) and did not go beyond assessing DIF and toward understanding the sources of DIF, despite the plethora of available methodologies (e.g., Penfield & Lam, 2000; Raju, 1988, 1990; Rudner et al., 1980a, 1980b). Certainly, this would be an important direction for future research.
Recommendations for Future Research
There are several venues and directions from the present study’s findings for future research. With regard to DIF, it will be important to evaluate other types of DIF such as Differential Slope Functioning, the potential differences in pseudo-guessing and pseudo-carelessness parameters as well as the interaction between probability of success and item difficulty levels (nonuniform DIF; Dimitrov, 2017) as in mixtures of populations (Raykov & Marcoulides, 2015). The case of intercept noninvariance can also be tested using the Multiple Indicators Multiple Causes model for which, to our knowledge, there is currently no published study. How power and large samples affect the global evaluative criteria of BSEM models is not yet well understood; similarly, for the size of the models or the use of a clustering variable, we know very little about BSEM (Raykov, Marcoulides, & Akaeze, 2017). Last, the idea that with small priors we achieve proper invariance levels or whether we are led to a misspecified model is also under question.
As a thoughtful reviewer suggested, the magnitude of the priors (the “wiggle room”) is of great importance for the model under study. Several leaders in BSEM have pointed to variance estimates of .1 as reflecting negligible effects (e.g., van de Schoot et al., 2013), but it would be important to tailor that estimate for the specific problem under study. A potentially interesting direction would be to employ prespecified conventions of effect size measures such as Cohen’s (1992) estimates on what constitutes small, medium, and large effects, when little is known about the topic. A second, more interesting direction would involve the thoughtful process of translating effect sizes onto practically meaningful effects based on the dependent variable under study. For example, a small improvement in reading for individuals with specific learning disabilities may reflect a variance estimate that is small by all statistical means but is nevertheless nonnegligible based on what is known about this population and its potential to improve its reading ability. Tailoring prior variance estimates to what is negligible and what is substantial will aid our evaluation of hypotheses.
It must be noted herein that A-MI is not a panacea (Davidov et al., 2015). There are instances in which differences between groups are so substantial that A-MI cannot hold (see Lommen, van de Schoot, & Engelhard, 2015; Meuleman, 2012), but instead, if used, may result in erroneous model selection by means of adopting a misfitting model to the data (Alamri, 2019). Furthermore, recent advances in methodology have put forth additional means of dealing with measurement invariance two of which are the estimation of noninvariant parameters as random effects (Jak, Oort, & Dolan, 2013, 2014a, 2014b) or the use of the alignment method (Muthén & Asparouhov, 2014) or combination of the two methodologies. Less explored methods involve the use of the p technique factor analysis originated in Cattell (1952) and described in Little et al. (2006) and Nesselroade, McArdle, Aggen, and Meyers (2002) or the use of Simultaneous Component Analysis (Ceulemans, Wildejans, Kiers, & Timmerman, 2016).
Appendix A
Annotated Mplus syntax file for estimating approximate measurement invariance of item thresholds using a prior variance of 0.01. The model engages the mixture modeling module in Mplus 8.3.
DATA: FILE IS physrand2.DAT; ! Name of data file
VARIABLE: NAMES ARE group q1-q6; ! Variable names
usevariables are q1-q6; ! Variables used in the model
classes=c(2); ! two latent classes engaged to represent gender
knownclass=c(group = 1-2); ! latent classes represent known groups (gender)
categorical are q1-q6; ! items are categorical (binary)
ANALYSIS: estimator=Bayes; !Bayesian estimator
type=mixture; Proc=2; ! Define mixture modeling, engaging two processors
chains=2; model=allfree; ! Engaging two chains, required freeing of pars
algorithm=integration; ! Algorithm=integration for dichotomous indicators
fbiterations=30000; ! large number of iterations
bseed = 3; ! Seed for MCMC random number generation
MODEL: ! Model Statement
%overall% ! Mixture syntax, estimates for males below that
f BY q1-q6* (l1-l6); ! Physics factor f defined by items q1-q6 with labels l1-l6
! that posit equal factor loadings across gender
[q1$1-q6$1](t#_1-t#_6); ! Thresholds of items q1-q6 freely estimated
f@1; [f@0]; ! Factor means and variances fixed to zero and 1
%c#2% ! Latent Class 2 represents females
f BY q1-q6* (l1-l6); ! Factor loadings fixed as metric model fit the data well
[f]; ! Factor mean of females freely estimated
Model Priors: ! Command for defining priors in BSEM
DO(1,6) DIFF(t1_#-t2_#)~N(0,0.1); !Looped statement that for items 1-6 the thresholds
! numbered # (1-6) for groups 1-2 would be allowed to
! vary by a prior variance of 0.1.
OUTPUT: tech1 tech3 tech8 ! tech1 shows labels of parameters, tech3 variances
! and covariances, tech8 PSR estimates to test
! convergence
cinterval; ! cinterval =symmetric confidence intervals
Appendix B
Annotated Mplus syntax file and excerpt from the results section for testing the hypothesis that latent mean differences are no different from a 0.50 (medium) effect size.
| DATA: FILE IS t3.DAT; | ! Name of data file |
| VARIABLE: NAMES ARE f gender | ! Variable names, f refers to saved factor scores |
| usevariables are f gender; | ! Variables used in the model, fscores and gender |
| classes=c(1); | ! onelatent class just to engage the mixture framework |
| ANALYSIS: estimator=Bayes; | ! Bayesian estimator |
| type=mixture; Proc=2; | ! Define mixture modeling, engaging two processors |
| chains=2; | ! Engaging two chains |
| biterations=500000(100000); | ! large number of iterations |
| MODEL: | ! Model Statement |
| %overall% | ! Mixture syntax command |
| f ON gender (a); | ! Factor scores regressed on a dummy gender variable |
| Output: tech1 tech3 tech8 cinterval; | ! Seeking various evaluative criteria tables |
| Model priors: | ! Defining prior variances statement |
| a~N(0,0.141); | ! Testing the hypothesis that the difference in means |
| ! between males and females is no different from a .5 | |
| ! effect size. Prior Variance=(0.5 × 0.751)2=0.141 | |
| Part of Mplus Output Results: | |
| MODEL FIT INFORMATION | |
| Number of Free Parameters | 3 |
| Bayesian Posterior Predictive Checking using Chi-Square | |
| 95% Confidence Interval for the Difference Between | |
| the Observed and the Replicated Chi-Square Values | |
| -7.260 7.313 | |
| Posterior Predictive P-Value | 0.491 |
| Prior Posterior Predictive P-Value | 0.011 |
Note. There is insufficient evidence that the difference in means between males and females is no different from a .5, medium, effect size.
Credible intervals when the Bayesian framework is utilized.
The alternative of partial MI has been advocated by Steenkamp and Baumgartner (1998), who suggested that latent means are unbiased when at least two items’ slopes and intercepts are equivalent between groups. A key difference between partial and approximate invariance is that in the former, only some items parameters are constrained to zero (minimum of two; Byrne, Shavelson, & Muthén, 1989) while the rest of the parameters could vary to a great extent because partial equivalence is under the exact-zero framework (van de Schoot et al., 2013). However, partial invariance is controversial for many reasons. For example, the source of noninvariance should be located at an item level, and a reference variable should be correctly identified as invariant. Moreover, partial invariance does not accommodate all types of scale structures (e.g., a scale with a single latent factor and three items) because it requires at least two invariant items (van de Schoot et al., 2013; Zercher, Schmidt, Cieciuch, & Davidov, 2015). He and Kubacka (2015) stated that the partial MI is not suitable when a scale has fewer than five items with a compression of large number of groups (i.e., TALIS scale with more than 24 countries).
Meaningful in the sense of the hypothesis of interest. In the present example we utilized PPPP to test hypotheses that the mean differences between males and females are different (from negligible, small, medium, or large).
Asparouhov and Muthén (2017) suggested that sample sizes between 500 and 5,000 observations provide ample power levels for the proper rejection of models by use of the PPP.
Although cannot be justified as a means of selecting an optimal model fit.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Georgios D. Sideridis
https://orcid.org/0000-0002-4393-5995
References
- Alamri A. A. S. (2019). Exploring the behavior of model fit criteria in the Bayesian approximate measurement invariance: A simulation study (Doctoral dissertation). University of South Florida. Available from ProQuest Dissertations & Theses Global. [Google Scholar]
- Asparouhov T., Muthén B. O. (2014). Multi-group factor analysis alignment. Structural Equation Modeling, 21, 1-14.31360054 [Google Scholar]
- Asparouhov T., Muthen B. (2010) Bayesian analysis using Mplus: Technical implementation. Mplus. Technical Report. http://www.statmodel.com
- Asparouhov T., Muthén B. (2015). General random effect latent variable modeling: Random subjects, items, contexts, and parameters. In Harring J. R., Stapleton L. M., Beretvas S. N. (Eds.), Advances in multilevel modeling for educational research: Addressing practical issues found in real-world applications (pp. 163-192). Charlotte, NC: Information Age. [Google Scholar]
- Asparouhov T., Muthén B. (2017). Prior-posterior predictive p-values. Mplus Web Notes: No. 22, Version 2. Retrieved from https://www.statmodel.com/download/PPPP.pdf
- Asparouhov T., Muthén B., Morin A. J. S. (2015). Bayesian structural equation modeling with crossloadings and residual covariance: Comments on Stromeyer et al. Journal of Management, 41, 1561-1577. [Google Scholar]
- Browne M. W. (1984). Asymptotic distribution free methods in the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology, 37, 127-141. [DOI] [PubMed] [Google Scholar]
- Byrne B. M., Shavelson R. J., Muthén B. O. (1989). Testing for equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105, 456-466. [Google Scholar]
- Cattell R. B. (1952). The three basic factor-analytic research designs: Their interrelations and derivatives. Psychological Bulletin, 49, 499-520. [DOI] [PubMed] [Google Scholar]
- Ceulemans E., Wildejans T. F., Kiers H. A. L., Timmerman M. E. (2016). Multilevel simultaneous component analysis: A computational shortcut and software package. Behavior Research Methods, 48, 1008-1020. [DOI] [PubMed] [Google Scholar]
- Cheung G. W., Rensvold R. B. (2002). Evaluating goodness-of-fit indexes for testing measurement invariance. Structural Equation Modeling, 9, 233-255. [Google Scholar]
- Cheung M. W.-L. (2009). Constructing approximate confidence intervals for parameters with structural constructing approximate confidence intervals for parameters with structural equation models. Structural Equation Modeling, 16, 267-294. [Google Scholar]
- Clauser B. E., Mazor K. M. (1998). Using statistical procedures to identify differential item functioning test items. Educational Measurement: Issues and Practice, 17, 31-44. [Google Scholar]
- Cohen J. (1988). Statistical power analysis for the behavioral sciences. London, England: Erlbaum. [Google Scholar]
- Cohen J. (1992). A power primer. Psychological Bulletin, 112, 155-159. [DOI] [PubMed] [Google Scholar]
- Cohen J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. [Google Scholar]
- Davidov E., Cieciuch J., Meuleman B., Schmidt P., Algesheimer R., Hausherr M. (2015). The comparability of measurements of attitudes toward immigration in the European social survey exact versus approximate measurement equivalence. Public Opinion Quarterly, 79(Suppl. 1), 244-266. [Google Scholar]
- De Boeck P. (2008). Random item IRT models. Psychometrika, 73, 533-559. [Google Scholar]
- De Bondt N., Van Petegem P. (2015). Psychometric evaluation of the overexcitability questionnaire-two applying Bayesian structural equation modeling (BSEM) and multiple-group BSEM-based alignment with approximate measurement invariance. Frontiers in Psychology, 6, 1963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Jong M. G., Steenkamp J.-B. E. M., Fox J.-P. (2007). Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. Journal of Consumer Research, 34, 260-278. [Google Scholar]
- Dimitrov D. M. (2003). Marginal true-score measures and reliability of binary items as a function of their IRT parameters. Applied Psychological Measurement, 27, 440-458. [PubMed] [Google Scholar]
- Dimitrov D. M. (2017). Examining differential item functioning: IRT-based detection in the framework of confirmatory factor analysis. Measurement and Evaluation in Counseling and Development, 50, 183-200. [Google Scholar]
- Dorans N. J., Holland P. W. (1992, October). DIF detection and description: Mantel-Haenszel and standardization. Paper presented at the Educational Testing Service/AFHRL Conference, Princeton, NJ. [Google Scholar]
- Dorans N. J., Holland P. W. (1993). DIF detection and description: Mantel-Haenzel and standardization. In Holland P. W., Wainer H. (Eds.), Differential item functioning (pp. 35-66). Hillsdale, NJ: Erlbaum. [Google Scholar]
- Fox J. P. (2010). Bayesian item response theory. New York, NY: Springer. [Google Scholar]
- Fox J. P., Glas C. A. W. (2001). Bayesian estimation of a multilevel IRT model using Gibbs. Psychometrika, 66, 69-86. [Google Scholar]
- Fox J.-P., Verhagen A. (2017). Random item effects modeling for cross-national survey data. In Davidov E., Schmidt P., Billiet J. (Eds.), Cross-cultural analysis: Methods and applications (2nd ed., pp. 461-482). New York, NY: Routledge. [Google Scholar]
- Frederickx S., Tuerlinckx F., De Boeck P., Magis D. (2010). RIM: A random item mixture model to detect differential item functioning. Journal of Educational Measurement, 47, 432-457. [Google Scholar]
- Gelman A., Carlin J. B., Stern H. S., Rubin D. B. (2004). Bayesian data analysis (2nd ed.). Boca Raton, FL: Chapman & Hall. [Google Scholar]
- Gelman A., Rubin D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457-511. [Google Scholar]
- Hambleton R. K., Robin R., Xing D. (2000). Item response models for the analysis of educational and psychological data. In Tinsley H. E. A., Brown S. (Eds.), Handbook of applied multivariate statistics and mathematical modeling (pp. 553-581). New York, NY: Academic Press. [Google Scholar]
- Hambleton R. K., Swaminathan H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer. [Google Scholar]
- Hambleton R. K., Swaminathan H., Rogers H. (1991). Fundamentals of item response theory. Newbury Park, CT: Sage. [Google Scholar]
- Hancock G. R. (2001). Effect size, power, and sample size determination for structured means modeling and MIMIC approaches to between-groups hypothesis testing of means on a single latent construct. Psychometrika, 66, 373-388. [Google Scholar]
- Hastings C., Jr. (1955). Approximations for digital computers. Princeton, NJ: Princeton University Press. [Google Scholar]
- He J., Kubacka K. (2015). Data comparability in the teaching and learning international survey (TALIS) 2008 and 2013. OECD Education Working Papers, No. 124, Paris: OECD Publishing. [Google Scholar]
- Hoijtink H., van de Schoot R. (2018). Testing small variance priors using prior-posterior predictive p values. Psychological Methods, 23, 561-569. [DOI] [PubMed] [Google Scholar]
- Holland P. W., Thayer D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer H., Braun H. I. (Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Erlbaum. [Google Scholar]
- Horn J. L., McArdle J. J. (1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18, 117-144. [DOI] [PubMed] [Google Scholar]
- Ioannidis J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jak S., Oort F. J., Dolan C. V. (2013). A test for cluster bias: Detecting violations of measurement invariance across clusters in multilevel data. Structural Equation Modeling, 20, 265-282. [Google Scholar]
- Jak S., Oort F. J., Dolan C. V. (2014. a). Measurement bias in multilevel data. Structural Equation Modeling, 21, 31-39. [Google Scholar]
- Jak S., Oort F. J., Dolan C. V. (2014. b). Using two-level factor analysis to test for cluster bias in ordinal data. Multivariate Behavioral Research, 49, 544-553. [DOI] [PubMed] [Google Scholar]
- Kaplan D., Depaoli S. (2013). Bayesian statistical methods. In Little T. D. (Ed.), Oxford handbook of quantitative methods (pp. 407-437). Oxford, England: Oxford University Press. [Google Scholar]
- Kass R. A., Raftery A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773-795. [Google Scholar]
- Kelley K., Pornprasertmanit S. (2016). Confidence intervals for population reliability coefficients: Evaluation of methods, recommendations, and software for composite measures. Psychological Methods, 21, 69-92. [DOI] [PubMed] [Google Scholar]
- Killeen P. R. (2005). An alternative to null-hypothesis significance tests. Psychological Science, 16, 345-353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim E. S., Cao C., Wang Y., Nguyen D. T. (2017). Measurement invariance testing with many groups: A comparison of five approaches. Structural Equation Modeling, 24, 524-544. [Google Scholar]
- Kim E. S., Yoon M. (2011). Testing measurement invariance: A comparison of multiple-group categorical CFA and IRT. Structural Equation Modeling, 18, 212-228. [Google Scholar]
- Lee S.-Y. (2007). Structural equation modeling: A Bayesian approach. Chichester, England: Wiley. [Google Scholar]
- Lek K. M., Van De Schoot R. (2018). A comparison of the single, conditional and Person-specific standard error of measurement: What do they measure and when to use them? Frontiers in Applied Mathematics and Statistics 4, 40. [Google Scholar]
- Lin P. Y., Lin Y. C. (2014). Examining student factors in sources of setting accommodation DIF. Educational and Psychological Measurement, 74, 759-794. [Google Scholar]
- Linacre J. M., Wright B. D. (1989). Mantel-Haenszel DIF and PROX are equivalent! Rasch Measurement Transactions, 3, 52-53. [Google Scholar]
- Little T. D., Bovaird J. A., Slegers D. W. (2006). Methods for the analysis of change. In Mroczek D. K., Little T. D. (Eds.), Handbook of personality development (pp. 181-211). Mahwah, NJ: Erlbaum. [Google Scholar]
- Lommen M. J. J., van de Schoot R., Engelhard I. M. (2014). The experience of traumatic events disrupts the measurement invariance of a posttraumatic stress scale. Frontiers in Psychology, 5, 1304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lunn D. J., Thomas A., Best N., Spiegelhalter D. (2000). WinBUGS: A Bayesian modelling framework: Concepts, structure, and extensibility. Statistics and Computing, 10, 325-337. [Google Scholar]
- Lynch S. M. (2007). Introduction to applied Bayesian statistics and estimation for social scientists. New York, NY: Springer. [Google Scholar]
- Lynch S. M., Western B. (2004). Bayesian posterior predictive checks for complex models. Sociological Methods & Research, 32, 301-335. [Google Scholar]
- MacCallum R. C. (1986). Specification searches in covariance structure modeling. Psychological Bulletin, 100, 107-120. [Google Scholar]
- MacCallum R. C., Cai L., Edwards M. C. (2012). Hopes and cautions in implementing Bayesian structural equation modeling. Psychological Methods, 17, 340-345. [DOI] [PubMed] [Google Scholar]
- MacCallum R. C., Roznowski M., Necowitz L. B. (1992). Model modification in covariance structure analysis: The problem of capitalization on chance. Psychological Bulletin, 111, 490-504. [DOI] [PubMed] [Google Scholar]
- Marcoulides K. M. (2018). Careful with those priors: A note on Bayesian estimation in two-parameter logistic item response theory models. Measurement: Interdisciplinary Research and Perspectives, 16, 92-99. [Google Scholar]
- Marcoulides K. M., Yuan K. -H. (2017). New ways to evaluate goodness of fit: A note on using equivalence testing to assess structural equation models. Structural Equation Modeling, 24, 148-153. [Google Scholar]
- Marsh H. W., Guo J., Nagengast B., Asparouhov T., Muthén B., Parker P. D., Dicke T. (2017). What to do when scalar invariance fails: The extended alignment method for multi-group factor analysis comparison of latent means across many groups. Psychological Methods, 23, 524-545. [DOI] [PubMed] [Google Scholar]
- Maydeu-Olivares A. (2017). Maximum likelihood estimation of structural equation models for continuous data: Standard errors and goodness of fit. Structural Equation Modeling, 24, 383-394. [Google Scholar]
- Meredith W. (1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525-543. [Google Scholar]
- Meredith W., Horn J. (2001). The role of factorial invariance in modeling growth and change. In Collins L. M., Sayer A. G. (Eds.), Decade of behavior.New methods for the analysis of change (pp. 203-240). Washington, DC: American Psychological Association. [Google Scholar]
- Meuleman B. (2012). When are item intercept differences substantively relevant in measurement invariance testing? In Salzborn S., Davidov E., Reinecke J. (Eds.), Methods, theories and empirical applications in the social sciences: Festschrift for Peter Schmidt (pp. 97-104). Heidelberg, Germany: Springer VS. [Google Scholar]
- Milfont T.L., Fischer R. (2010) Testing measurement invariance across groups: Applications in cross-cultural research. International Journal of Psychological Research, 3, 111-121. [Google Scholar]
- Miller J. (2009). What is the probability of replicating a statistically significant effect? Psychonomic Bulletin & Review, 16, 617-640. [DOI] [PubMed] [Google Scholar]
- Millsap R. E. (2011). Statistical approaches to measurement invariance. New York, NY: Routledge/Taylor & Francis Group. [Google Scholar]
- Morey R. D., Rouder J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16, 416-419. [DOI] [PubMed] [Google Scholar]
- Muthén B., Asparouhov T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313-335. [DOI] [PubMed] [Google Scholar]
- Muthén B., Asparouhov T. (2013). BSEM measurement invariance analysis. Mplus Web Notes, No. 17. Retrieved from https://www.statmodel.com/examples/webnotes/webnote17.pdf
- Muthén B., Asparouhov T. (2014). IRT studies of many groups: The alignment method. Frontiers in Psychology, 5, 978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muthén B., Asparouhov T. (2018). Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociological Methods & Research, 47, 637-664. [Google Scholar]
- Muthén L., Muthén B. O. (1998-2017). Mplus user’s guide (8th ed.). Los Angeles, CA: Muthén & Muthén. [Google Scholar]
- Nesselroade J. R., McArdle J. J., Aggen S. H., Meyers J. M. (2002). Dynamic factor analysis models for representing process in multivariate time-series. In Moskowitz D. S., Hershberger S. L. (Eds.), Modeling intraindividual variability with repeated measures data: Methods and applications (pp. 235-265). Mahwah, NJ: Erlbaum. [Google Scholar]
- Olsson U. H., Foss T., Troye S. V., Howell R. D. (2000). The performance of ML, GLS, and WLS estimation in structural equation modeling under conditions of misspecification and nonnormality. Structural Equation Modeling, 7, 557-595. [Google Scholar]
- Penfield R. D., Lam T. C. M. (2000). Assessing differential item functioning in performance assessment: Review and recommendations. Educational Measurement: Issues and Practice, 19, 5-15. [Google Scholar]
- Raju N. (1988). The area between two item characteristic curves. New York, NY: Springer. [Google Scholar]
- Raju N. (1990). Determining the significance of estimated signed and unsigned areas between two item response functions. Applied Psychological Measurement, 14, 197-207. [Google Scholar]
- Raju N. S., van der Linden W. J., Fleer P. F. (1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19, 353-368. [Google Scholar]
- Raftery A. E., Lewis S. M. (1996). Implementing MCMC. In Gilks W. R., Spiegelhalter D. J., Richardson S. (Ed.), Markov chain Monte Carlo in practice (pp. 115-130). London. England: Chapman & Hall. [Google Scholar]
- Raykov T., Marcoulides G. A. (2004). Using the Delta method for approximate interval estimation and parameter functions in SEM. Structural Equation Modeling, 11, 621-637. [Google Scholar]
- Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. Abingdon, Oxon: Routledge. [Google Scholar]
- Raykov T., Marcoulides G. (2015). Scale reliability evaluation with heterogeneous populations. Educational and Psychological Measurement, 75, 875-892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raykov T., Marcoulides G. A., Akaeze H. O. (2017). Comparing between- and within-group variances in a two-level study: A latent variable modeling approach to evaluating their relationship. Educational and Psychological Measurement, 77, 351-361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raykov T., Marcoulides G. A., Lee C-L., Chang D. C. (2013). Studying differential item functioning via latent variable modeling: A note on a multiple testing procedure. Educational and Psychological Measurement, 73, 898-908. [Google Scholar]
- Raykov T., Marcoulides G., Li C. H. (2012). Measurement invariance for latent constructs in multiple populations: A critical review and refocus. Educational and Psychological Measurement, 72, 954-974. [Google Scholar]
- Raykov T., Marcoulides G. A., Millsap R. E. (2013). Factorial invariance in multiple populations: A multiple testing procedure. Educational and Psychological Measurement, 73, 713-727. [Google Scholar]
- Rudner L. M., Getson P. R., Knight D. L. (1980. a). Biased item detection techniques. Journal of Educational Statistics, 5, 213-233. [Google Scholar]
- Reise S. P., Widaman K. F., Pugh R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Psychological Bulletin, 114, 552-566 [DOI] [PubMed] [Google Scholar]
- Rudner L. M., Getson P. R., Knight D. L. (1980. b). A Monte Carlo comparison of seven biased item detection techniques. Journal of Educational Measurement, 17, 1-10. [Google Scholar]
- Rutkowski L., Svetina D. (2014). Assessing the hypothesis of measurement invariance in the context of large-scale international surveys. Educational and Psychological Measurement, 74, 31-57. [Google Scholar]
- Scheines R., Hoijtink H., Boomsma A. (1999). Bayesian estimation and testing of structural equation modeling. Psychometrika, 64, 37-52. [Google Scholar]
- Schwarz G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464. [Google Scholar]
- Seddig D., Leitgob H. (2018). Approximate measurement invariance and longitudinal confirmatory factor analysis: Concept and application with panel data. Survey Research Methods, 12, 29-41. [Google Scholar]
- Spiegelhalter D. J., Best N. G., Carlin B. P., van der Linde A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society B, 64, 583-639. [Google Scholar]
- Steenkamp J. M., Baumgartner H. (1998). Assessing measurement invariance in cross national consumer research. Journal of Consumer Research, 25, 78-90. [Google Scholar]
- Steinmetz H. (2013). Analyzing observed composite differences across groups. Is partial measurement invariance enough? Methodology, 9, 1-12. [Google Scholar]
- Thissen D., Steinberg L., Wainer H. (1988). Use of item response theory in the study of group differences in trace lines. In Wainer H., Braun H. I. (Eds.), Test validity (pp. 147-169). Hillsdale, NJ: Erlbaum. [Google Scholar]
- Thissen D., Steinberg L., Wainer H. (1993). Detection of differential item functioning using the parameters of item response models. In Holland P. W., Wainer H. (Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Erlbaum. [Google Scholar]
- Vandenberg R. J., Lance C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, recommendations, for organizational research. Organizational Research Methods, 3, 4-70. [Google Scholar]
- van de Schoot R., Kluytmans A., Tummers L., Lugtig P., Hox J., Muthén B. (2013). Facing off with Scylla and Charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Frontiers in Psychology, 4, 770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- van de Schoot R., Lugtig P., Hox J. (2012). A checklist for testing measurement invariance. European Journal of Developmental Psychology, 9, 486-492. [Google Scholar]
- Van der Linder W., Glas C. A. W. (2000). Computerized adaptive testing: Theory and practice. Boston: Kluwer. [Google Scholar]
- Widaman K. F., Reise S. P. (1997). Exploring the measurement invariance of psychological instruments: Applications in the substance use domain. In Bryant K. J., Windle M., West S. G. (Eds.), The science of prevention: Methodological advances from alcohol and substance abuse research (pp. 281-324). Washington, DC: APA. [Google Scholar]
- Zercher F., Schmidt P., Cieciuch J., Davidov E. (2015). The comparability of the universalism value over time and across countries in the European Social Survey: Exact vs. approximate measurement invariance. Frontiers in Psychology, 6, 733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zwick R. (2002). The assessment of differential item functioning in computer adaptive tests. In Linden W., Glas G. W. (Eds.), Computerized adaptive testing: Theory and practice (pp. 221-244). Amsterdam, Netherlands: Springer. [Google Scholar]
- Zwick R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Technical Report RR-12-08. Princeton, NJ: ETS. [Google Scholar]
- Zyphur M. J., Oswald F. L. (2015). Bayesian estimation and inference: A user’s guide. Journal of Management, 41, 390-420. [Google Scholar]



