Abstract
We present a fast, score-based test to detecting model misspecification in item response theory (IRT) models that remains valid when person parameters are treated as fixed effects, as may be used for very large data sets. The new approximation (i) eliminates the need to pre-specify ability groups or priors for person abilities, (ii) does not require explicit functional form assumptions, (iii) works with two estimators designed for very high item/person counts—constrained joint maximum likelihood (CJML) and joint maximum a posteriori (JMAP)—and (iv) requires only a single model fit, making DIF-screening faster and simpler than alternatives based on model comparisons. A spline-based residualization step further suppresses spurious Type I error when the ordering covariate is correlated with ability. Simulations with the two-parameter logistic model show nominal error rates and high power once examinees contribute around 15–20 responses; only extremely short tests (around 10 items) still pose challenges under strong impact. An application to 1,602 reading items and 57,684 students from the Mindsteps platform demonstrates scalability and practical value, flagging 13% of items for gender-related DIF and correlating highly with conventional approaches of explicitly modeling DIF. Together, these results position the proposed test as a robust, computation-light diagnostic for large-scale assessments when classical random-effects approaches are infeasible, ability group structure is unknown or complex, or the shape of DIF effects is unknown or complex.
Keywords: measurement invariance, item factor analysis, large-scale assessments, model misspecification
Introduction
The use of an appropriate statistical model is often crucial when interpretation of parameters is of interest. As such, approaches that can point out flaws in a fitted model in a flexible manner, without requiring a long list of alternate models as comparisons, are very valuable. In this paper we generalize an earlier approach to this problem in item response theory (IRT) models, allowing detection of model flaws (a) without prior knowledge of ability differences among test takers, (b) without knowledge of the functional form that defines the model flaw, (c) in large-scale models that can require different estimation approaches, and (d) without the need to fit additional models for comparison. Many applications of statistical models require that model parameters are invariant across subgroups of interest. In the context of psychological measurement and IRT, this assumption is strongly related to measurement invariance (Millsap, 2011) and the absence of differential item functioning (DIF; Holland & Wainer, 1993). Differential item functioning means that persons with the same latent ability differ with respect to their probability to give a correct response to an item. In non-technical terms, it means that the psychometric characteristics of items, such as their difficulty or discrimination, differ between groups of participants. DIF effects that correspond to a change in the relative difficulty of items are usually labeled uniform DIF, and all other changes non-uniform DIF.
Early tests checked for this invariance within pre-defined subgroups (Magis et al., 2010), but newer methods have shifted away from this necessity. An early method for detecting DIF with respect to continuous covariates was the multiple-indicator multiple-cause (MIMIC) model (Muthén, 1989). This approach was based on a structural equation model and initially allowed the detection of DIF in the item difficulties with respect to a linear covariate. More general models that can be seen as extensions of the MIMIC model were proposed, for instance, by Bauer (2017).
Notably, the structural change tests by Andrews (1993), a focus of this paper, analyze the ordering of data points (e.g., students) relative to a specified covariate and assess the stability of maximum likelihood estimates accordingly. Though some instability is anticipated due to sampling error, excessive instability often suggests variation in one or more item parameters with respect to the covariate.
Score-based tests of parameter invariance in psychometric models are based on a similar idea. They are a generalization of the Score or Lagrange Multiplier test, used to detect differential item functioning for pre-defined groups (Glas, 1998, 1999; Glas & Suárez Falcón, 2003), that detect instabilities, or fluctuations, in estimates of model parameters. Such estimates include maximum likelihood as well as other types of estimators, as such, they are also sometimes called M-fluctuation tests (Zeileis & Hornik, 2007). In contrast to MIMIC models, the score-based tests do not assume a specific (e.g., linear) relationship between the change of model parameter estimates and the underlying covariate. In psychometrics, score-based tests have been mainly proposed for latent variable models, including dichotomous and polytomous Rasch models (Komboz et al., 2018; Strobl et al., 2015), Bradley-Terry models (Strobl et al., 2011), factor analysis (Merkle et al., 2014; Merkle & Zeileis, 2013; Sterner & Goretzko, 2023), models of item response theory (Debelak et al., 2022; Debelak & Strobl, 2019; Schneider et al., 2022), and structural equation models (Arnold et al., 2020). As the name implies, score-based tests rely on the score, which is the gradient of the log-likelihood with regard to the model parameters.
An important characteristic of these score-based tests of parameter invariance is that they usually require the definition of ability groups, that is, of groups of respondents within which an equal level of ability can be assumed. If such a definition is not provided or if the groups are misspecified, score-based tests can show an increased Type I error rate (Debelak & Strobl, 2019; Schneider et al., 2022). Conceptually, this is caused by a common prior distribution for the person parameters in such models, which leads to a bias in the parameter estimates if ability differences are not accounted for. However, it is often impossible to define such ability groups in empirical assessments, which can make the application of score-based tests problematic.
We take two steps to address this problem. The first step is that we focus on parameter estimation techniques that estimate the person parameters along with the item parameters, and do not require prior knowledge of the ability distribution or groups. Such methods also offer potential computational benefits and can be more tractable for large datasets with many items and or models with high-dimensional person parameters. In this paper, we develop methods that approximate score-based tests for two such estimation methods for item response theory models. One is a joint maximum a posteriori (JMAP) approach (Driver & Tomasik, 2023a), while the other uses a constrained joint maximum likelihood (Chen et al., 2019, 2020). As will be argued, these estimation methods violate a central assumption of traditional score-based invariance tests, and as such require a more general method. We derive and evaluate a new method that is based on approximations for score-based tests for this scenario and evaluate it using simulation studies. The second step we take to address the problem of Type I error when ability differences are correlated with DIF, is to residualize estimated ability out of the covariates used for ordering the scores. This substantially reduces the capacity of the score test to reflect (unwanted) ability group differences, reducing the associated Type I error.
In summary, this paper makes the following contributions:
We suggest an approximation that is related to the framework of score-based invariance tests. Technically, our method is intended for applications where the individual score contributions are not independent and identically distributed (i.i.d.), whereas previous work (Debelak et al., 2022) assumed this for known groups of respondents. We further present methods that do not require the explicit definition of respondent groups between which ability differences are modeled.
We suggest residualizing ability out of any covariates used for detecting parameter variation to avoid spurious score-test significance due to ability differences.
We evaluate this new approximate approach with and without covariate residualization for two estimation methods used for large-scale assessments, namely, joint maximum a posteriori and constrained joint maximum likelihood estimation, as well as more typical marginal maximum likelihood approaches.
The rest of this paper is structured as follows: In the next section, we summarize theoretical foundations of score-based tests. We then discuss score-based tests for the scenario where the score contributions can be considered independent but not identically distributed. We then evaluate performance using simulation studies, apply the approach to an empirical data set, and conclude with a general discussion.
A Conceptual Framework for Score-Based Parameter Invariance Tests
In the following, we elaborate on the underlying idea of score-based invariance tests in an IRT context for a general Bayesian or maximum likelihood estimation framework. Similar presentations can be found in earlier work (e.g., Debelak & Strobl, 2019; Schneider et al., 2022; Strobl et al., 2015; Wang et al., 2014). We consider a parametric item factor model, whose item parameters are given by a vector π = (π1, …, π K ), where K stands for the number of items. The estimated model parameters further include the person parameters, but since we are not interested in checking them for measurement invariance, we did not include them in this vector. Frequentist and Bayesian methods for estimating these unknown parameters usually make use of the log-likelihood ℓ of the data U. For item factor models, U is usually a response matrix that consists of rows u1, …, u N , with N being the number of respondents. We use ℓ( π ; u1, …, u N ) to denote the overall log-likelihood of the response matrix given the values of π. In item factor models, it is usually assumed that the responses of different respondents are independent, which allows the log-likelihood to be written as a sum of individual, that is, case-wise, log-likelihoods:
The score is now the vector of the first partial derivatives of the log-likelihood with regard to all model parameters, given by:
Here, t is used to indicate the transpose of the vector. Similar to the log-likelihood, the score can be presented as a sum of individual score contributions, with one score contribution per respondent:
Here, the last term is defined via:
This term is a vector, with one component per estimated item parameter. In practical evaluations, these score contributions can be seen as realizations of random variables. In many estimation frameworks for item parameters, such as marginal maximum likelihood estimation (MML; cf. Baker & Kim, 2004), the person parameters are treated as random effects (De Boeck, 2008). In this context, the respondents are treated as exchangeable, from which it follows that all individual score contributions for a specific item parameter can be considered identically distributed. Moreover, since respondents are assumed to be independent, it follows that individual score contributions for any item parameter can be treated as independent and identically distributed (i.i.d.) random variables for each item parameter. With the central limit theorem we obtain the result that sums of individual score contributions will approximately follow a normal distribution in sufficiently large samples. Since we are using maximum likelihood estimation to obtain each item parameter estimate, the individual score contributions also have to sum up to 0 over the whole sample. It follows from our discussion that sums of individual score contributions are approximately normally distributed around 0, with a specific variance.
We now consider the sums of individual score contributions for multiple item parameters for a sufficiently large sample. To consider the variances and pairwise covariances of the individual score contributions in the individual item parameters, we compute a consistent estimate of their covariance matrix and use it to decorrelate the individual score contributions. In the study at hand, this covariance matrix of individual score contributions was computed using a shrinkage covariance estimator (Schäfer & Strimmer, 2005), to avoid problems of non-positive definiteness in boundary cases, however in most cases the typical sample covariance formula should suffice. After this transformation, the covariance matrix of individual score contributions corresponds to the unit matrix. Conceptually, this transformation is also a form of standardization.
Next, we consider how to represent expected fluctuations in the score contributions under different conditions. For this, we first order the responses with respect to the person covariate which we hypothesize to be related to violations of measurement invariance, such as gender or age. This implies a new ordering of the observed responses, which we denote as u(i). As a second step, we use the stochastic process 0 ≤ t ≤ 1:
Here, t is a value between 0 and 1, ⌊… ⌋ is the floor function and denotes a consistent estimate of the covariance matrix of individual score contributions. Conceptually, this process summarizes to what extent the parameter estimates change depending on a respondent’s covariate. Under the null hypothesis that DIF is absent, we expect no systematic change in depending on t.
This process can be illustrated by a thought experiment, where we keep adding test takers with different covariate values (e.g., persons of different age or gender) to our sample to estimate the item parameters. If DIF is absent, the estimator should not depend on the person covariates of the test takers present in the sample but should fluctuate randomly around the true values of the item parameters. However, we would expect such a change if DIF effects were present. In our thought experiment, if the true item parameters differ for, say, female and male respondents, we would expect a shift of the item parameter estimates depending on the relative frequency of female and male respondents present in the sample used for the item parameter estimation. For more details and visuals on this topic, see Merkle and Zeileis (2013).
Since the individual score contributions are assumed to be i.i.d. random variables, the distribution of under the null hypothesis can be derived analytically. The cumulative sums of converge to a specific stochastic process, namely a Brownian bridge. Unlike a general stochastic process, a Brownian bridge is a continuous random path that is conditioned to begin and end at a specific value, typically 0, at two fixed points in time. Conceptually, it can be visualized as a randomly fluctuating “bridge” whose two ends are pinned to fixed points, while the path between them remains stochastic. In the case of maximum likelihood estimates, this convergence follows, for instance, from Donsker’s theorem (Billingsley, 1995), which generalizes the central limit theorem (Zeileis & Hornik, 2007).
In this context it is possible to consider missing values that result from respondents not taking all items by setting these specific score contributions to 0. This approach is also implemented in currently available software (Schneider et al., 2022).
Violations of parameter invariance now lead to a deviation between the observed distribution of and the one that is expected under the IRT model, and these deviations can be summarized via various test statistics (see e.g., Merkle et al., 2014). Important examples for settings with a continuous person covariate include the double maximum statistic DM, the Cramer-von Mises statistic CvM, and the maximum Lagrange multiplier statistic maxLM. If we label the matrix of the observed score contributions by , with i corresponding to the index of the individual respondents, which are ordered by a person covariate of interest, and j to the index of the individual item parameters, these statistics are given by:
The distribution of these statistics under the null hypotheses of invariance can be derived from the underlying stochastic process. In case of DM, the closed-form solution is provided by Merkle and Zeileis (2013), whereas there is no known closed-form solution for CvM or maxLM. As was already noted, the values of are ordered with respect to the continuous covariate, e.g. age, of the individual respondents i in these test statistics. The results are unaffected by the specific ranking of these values, that is, whether the ranking goes from highest to lowest or vice versa. For settings with a categorical covariate with m categories, an unordered Lagrange Multiplier Test LM uo is available, which is given by (Merkle et al., 2014; Wang et al., 2014):
In this equation, i l = ⌊N ⋅ t l ⌋, where t l , l = 1, …, m − 1 denotes the proportion of respondents in the first l categories. It is important to note that the value of LM uo does not depend on the order of the categories. It is essentially the sum of squared values of for each item parameter and each category. LM uo is asymptotically equivalent to the usual likelihood ratio statistic. For further details and statistics for scenarios with ordinal person covariates, see Merkle et al. (2014).
For these statistics, critical values for testing a null hypothesis of parameter invariance can be obtained by analytical derivations (DM, LM uo ) or by simulations (CvM, maxLM). For IRT models, score-based tests were described in various papers (e.g., Debelak & Strobl, 2019; Schneider et al., 2022; Strobl et al., 2015; Wang et al., 2018). For IRT models such as the two-parametric logistic model (2PL; Birnbaum, 1968), these tests were developed for applications where the item parameters are estimated by MML estimation, as it is implemented in widely applicable software such as the R package mirt (Chalmers, 2012). A distinct characteristic of these classical approaches is that they treat person parameters as random effects and thus only directly estimate the item parameters, and not the person parameters. The related score-based tests also require that ability differences which correlate with the covariate used for ordering scores are defined a priori (Schneider et al., 2022). If ability differences between respondent groups are not accounted for, the ability difference is misinterpreted as a DIF effect, leading to increased Type I error (Debelak & Strobl, 2019). In the next section, we discuss alternative estimation approaches to MML. Such estimation approaches usually violate the assumption of i.i.d. individual score contributions, and thus require an adaptation of score-based tests.
Estimation Approaches for IRT Models with Fixed Effects Person Parameters
We now consider two estimation methods for item factor analysis which allow consistent estimation of item and person parameters, and which are computationally suitable for the evaluation of large-scale assessments. We illustrate both methods with the multidimensional version of the two-parameter logistic model (Birnbaum, 1968), which is based on the following item response function:
Here, U ji takes on the values of 0 or 1, depending on whether the response is correct or incorrect. θ j is a vector of ability parameter of respondent j, whereas d i is the item intercept parameter and a i is a vector of discrimination parameters of item i.
Constrained Joint Maximum Likelihood Estimation
Constrained joint maximum likelihood estimation (Chen et al., 2019, 2020) maximizes the joint log-likelihood function of the observed response matrix under the following constraints for all respondents j and all items i:
Here, ‖x‖ denotes the Euclidean norm of vector x and C is a pre-specified positive constant. Heuristically, the inclusion of C regularizes the possible parameter values. In the current study, C was set to 5, which corresponds to the default value of the software and was also used in the simulations of Chen et al. (2019). Chen et al. (2019) show that, under mild regularity conditions and if the true models parameters meet the outlined constraints, estimates are consistent when the number of respondents and items grow to infinity. They also provide an implementation which was shown to be computationally more efficient than a traditional MML estimation approach in the R package mirtjml (Zhang et al., 2020).
Joint Maximum A Posteriori Estimation
Instead of specifying limits, the use of Bayesian priors serves to regularize estimates and avoids them tending towards infinity in limited data edge cases. The R package bigIRT (Driver & Tomasik, 2023a) uses joint optimization of item and person parameters under broad priors to compute a first pass estimate, then optionally uses these estimates as a basis for computing empirical Bayesian priors (for an overview see Carlin & Louis, 2008) to be used in a second pass optimization. Because joint maximization is somewhat prone to generating extreme outlier estimates in certain cases (e.g., with all correct or incorrect responses a subjects’ ability estimate tends towards ± inf), a robust approach is used to determine the width of priors (by default, points further than 1.5 times the interquartile range are dropped when computing the empirical priors). The amount of regularization generated by the specified and or estimated priors is governed by user defined parameters.
Here, optimization towards a maximum is also performed, but rather than simply the maximum of the likelihood, it is the maximum of the likelihood plus the prior term—referred to as maximum a posteriori. Since the prior term for either an item or person’s parameters does not change no matter how much data for that item or person is observed, with sufficient data the influence of the prior becomes negligible. With lower amounts of data, the estimate for a particular item or person becomes a weighting between the information for that specific item or person, and the general tendency for items or people (in the form of a normal distribution). To demonstrate a use-case for such approaches, in Driver and Tomasik (2023a), a simulation study compares cases with 50,000 subjects, 1,000 or 5,000 items from a single scale, and 50 responses per subject, using two and three-parametric logistic IRT models. With 5,000 items the time for model estimation is 10–20 times slower using MML with expectation-maximization in the mirt R package (Chalmers, 2012), and this difference grows with increasing numbers of items.
A Generalization of Score-Based Tests
In two estimation methods outlined above, person parameters are treated as fixed effects, which allows the joint estimation of item and person parameters. However, as a consequence, individual score contributions are not identically distributed.
Demonstration of Non IID Person Parameters
For an illustration of this point, we consider the unidimensional 2PL model, and further assume that all parameters a i , d i , and θ j are known. In the following, we use P ji to denote the probability of a positive response, while Q ji = 1 − P ji denotes the probability of a negative response given θ j , a i and d i . We can now derive the individual score contributions to the a i and d i parameters using well-known derivation rules.
In the case of a correct response U ji = 1, the individual score contribution to parameter a i is:
In the case of an incorrect response U ji = 0, the individual score contribution to parameter a i is:
The corresponding results for parameter d i are:
It is easy to see that, if the 2PL model is the true model and if the true parameter values θ j , a i , and d i are known for all items and respondents, the expected values of the individual score contributions are 0 for each pair i, j:
If the model is correctly specified, we also obtain an expected value of 0 for the cumulative score distributions that underlie the test statistics we discussed for the score-based tests.
Since only two outcomes are possible for each score contribution, we can also calculate the variance of their distribution for each combination of respondents and items. This variance is found to depend on P ji and Q ji and thus on the true item and person parameters. It immediately follows that the individual score contributions are not identically distributed for fixed a i or d i . We therefore cannot directly apply the theory underlying the classical score-based tests, which was outlined above, when treating the person parameters as fixed effects.
Approximate Solution to Non IID Person Parameters
To address the problem, we now regard this problem from a more abstract perspective and consider the individual score contributions for both parameters as realizations of binary random variables X1, …, X N , with N being the sample size, and E (X1) = ⋯ = E (X N ) = 0. While these random variables are independent, they are not identically distributed. Let denote the variance of X j . The presented framework of score-based tests is based on the central limit theorem for i.i.d. random variables, which states that the following sum S N of i.i.d. random variables converges to a standard normal distribution for N → ∞:
For independent, but not identically distributed random variables, this result remains correct under very general conditions (Bhattacharya & Waymire, 2022; Whitt, 2007). Under a technical perspective, we can therefore use this result to apply score-base tests in the (unrealistic) case that the true person parameters are known. The basic idea of the proposed approximate method is to use estimates of the person parameters in place of the true values for sufficiently large tests. This idea motivates us to consider the following stochastic process, which rescales the individual score contributions to a common variance of 1, using an estimate of their standard deviation: 1
In our example of the 2PL model, we have for the score contributions for the slope parameters a i , and for score contributions for the intercept parameters d i . and are, in turn, estimated based on estimates for the item and person parameters. This leads to the following algorithm for carrying out score-based tests when the score contributions are independent, but not identically distributed:
1. Obtain estimates , , for all model parameters.
2. Using these estimates, obtain estimates for the probabilities of a positive or negative responses and for each pair of items and respondents, and for the individual score contributions for each item parameter.
3. Using , and , estimate the standard deviation of the individual score contributions and use this to scale the individual score contributions to have a variance of 1.
4. Use the scaled individual score contributions to obtain . As our simulations illustrate, this stochastic process can be described well by a Brownian motion for sufficiently long tests.
5. After centering individual score contributions, the process of their cumulative sums can be described well by a Brownian bridge.
The approximation by a Brownian bridge allows the application of the same asymptotic theory that underlies the classical score-based tests if the person parameter estimates are sufficiently close to the true values. The proposed algorithm can be seen as a conceptual generalization of an algorithm described by Debelak et al. (2022), which considered a scenario with known groups of respondents, wherein the individual score contributions within each group were i.i.d. Note that the algorithm described can be applied to specific parameter subsets of interest, and not necessarily the entire parameter set (this becomes relevant for our empirical application later). We implemented the current algorithm in R and have evaluated it in a simulation study, described below.
Partialling Ability From Covariates
While the algorithm as described so far applies when ability is known, when ability is estimated, the algorithm can lead to an increased Type I error rate if true ability is correlated with the covariate used for ordering the scores and if the ability parameter estimates are biased. In this scenario, since ability is imperfectly estimated, the gradient for the item parameters will be biased by the mis-estimated ability, which can lead to spurious significance of the score-based tests. We can correct for this problem by residualizing the estimated person parameters out of the covariates used for ordering the scores. In our study we have used a thin-plate spline with basis dimension 10 to model the relationship between the covariate and the estimated person parameters. This is a flexible approach that allows for non-linear relationships and is implemented in the R package mgcv (Wood, 2017). The residuals of this regression are then used as a new covariate for ordering the scores. The resulting ordered scores are then used in the algorithm described above. This residualization is necessarily imperfect, as it relies on estimated ability parameters, but it can substantially reduce the Type I error rate of the score-based tests when ability differences correlate with the covariates of interest. Of course, when the original covariate is perfectly co-linear with ability, such residualization will result in no power to detect DIF effects, as the covariate will not vary after residualization. However, in practice we expect this is rarely the case, and residualization can help to reduce Type I error rates due to ability differences in many practical applications.
Simulation Study
We used a simulation study to evaluate the Type I error rate and power of the proposed algorithm under different conditions. Data were generated based on the 2PL model of Birnbaum (1968), as well as a uniformly distributed (U (0, 1)) continuous covariate per person. The following conditions were systematically combined with each other:
The number of persons was 500, 1,000, or 5,000.
The test length for each person was 10, 20, or 50.
DIF was either absent, or present with respect to the covariate. To keep the simulation study concise, we focused on the simulation of uniform DIF, that is, on changes in the intercept parameters.
Systematic ability differences, or impact effects, were either absent or present with respect to the covariate.
There were either no missing responses, 20%, or 50% of the responses were randomly selected to be missing. In cases with missingness, additional persons and items were used to keep the total number of observed responses approximately matching the total number of responses in the conditions without missingness. The intention here is to test against sparsity rather than reduced information explicitly.
For each combination of conditions, 1000 data sets were generated. In conditions without DIF and impact effects, item intercept parameters d i and person parameters θ j were drawn from a standard normal distribution with mean 0 and standard deviation 1, whereas the discrimination parameters a i were drawn from a normal distribution with mean 1 and standard deviation 0.2. The item and person parameters were sampled anew for each dataset generated in this study.
Individuals were assigned to one of two groups based on their covariate (split at 0.5). In conditions with impact, person parameters of the first group were sampled from a standard normal distribution N (0, 1), whereas the person parameters of the second group were shifted 1 unit higher, normal distribution N (1, 1). Then, all person parameter values were standardized to a mean of 0 and a variance of 1. Although this step was independent from the generation of the item parameters, it also defined the scale of the item parameters. Several R packages, including mirtjml, assume the same mean and variance for the person parameter estimates. If DIF effects were present, item intercept parameters for 20% of the items were increased for at least some subjects. The different DIF conditions were as follows:
Stepwise DIF: Item intercept parameters for 20% of items were increased by 0.5, for all respondents with a covariate of 0.5 or more. This represents a scenario where the relative item difficulty shifts suddenly for a specific group of respondents (e.g., respondents above an age of 50), and stays at this level.
U-Turn DIF: Item intercept parameters for 20% of the items were increased by 1 for all respondents with a covariate value below .25 or above .75. Here, the covariate defines three separate groups between which the values of the item intercept parameters change. This represents a scenario where the relative item difficulty shifts suddenly for a specific group of respondents, but returns later to the original level. We used a larger effect size for U-Turn DIF to ensure detection propensity was similar to other conditions when using the double maximum statistic DM (the only continuous test available in software for all conditions).
Linear DIF: Item intercept parameters for 20% of items were increased by the value of each respondent covariate, so ranging from 0 to 1. This depicts a scenario where item parameters change gradually with a person covariate.
Categorical DIF: Item intercept parameters for 20% of items were increased by 0.5 for subjects with greater than 0.5 for the covariate, and the covariate was converted to a binary variable (split at 0.5). This condition is close to the stepwise DIF condition, but the covariate is now binary rather than continuous, and collinear with the ability differences. For instance, this would correspond to a scenario where the relative item difficulty changes for a group that is defined by a categorical variable such as native language. This scenario also resembles that of a classical DIF analysis, where a focal and a reference group are compared against each other.
As a measure of the size of the DIF effect, one could consider the difference between the item parameters affected by DIF (Steinberg & Thissen, 2006). This effect size, which is particularly suitable for measuring DIF effects in a categorical covariate, is 0.5 for the conditions with categorical and stepwise DIF, 1 for U-turn DIF, and between 0 and 1, depending on the person covariate, for linear DIF. Another possible effect size is τ2, the variance of the difference between the item parameters over all items (Finch & French, 2023). τ2 equals 0.04 for all conditions but U-turn, which is 0.08. Chalmers (2023) discusses additional measures of effect size for DIF based on the response function.
In each generated dataset, we estimated the item and person parameters for the 2PL model with a range of estimators, and parameter estimates were used to carry out score-based tests using the algorithms outlined in the previous section. Three estimators of specific interest were used: constrained joint maximum likelihood estimation (CJML; using the default value C = 5), joint maximum a posteriori with generic weak priors (JMAP), and joint maximum a posterori with empirical Bayesian priors (JMAP_EB). Both JMAP approaches use the bigIRT defaults of N (0, 100) priors for ability and difficulty parameters, and N (0, 4) for the “raw” discrimination parameters, which are then transformed giving A = log (1 + exp (A raw )). We also investigated performance with true item parameters instead of estimates. These true value based results served as a baseline, mirroring the (unrealistic) case of perfectly accurate estimation. We further included score-based tests based on MML estimation with the EM (expectation-maximization) algorithm, which corresponds to the approach described in previous works (Debelak & Strobl, 2019; Schneider et al., 2022). Two MML conditions were included, the first of which (MML_groups) accounted for impact effects via separate priors for the person parameter across ability groups, using a multiple-group IRT model based on the true impact defining covariate. However this is a somewhat unfair comparison, as the joint estimator approaches do not rely on knowing the impact group/s when estimating person parameters, and as such an MML condition without pre-specified groups was also included.
Computational Details
In the simulation study, the following software packages packages were used: R version 4.5.0 (R Core Team, 2023); Stan version 2.32.0 (Carpenter et al., 2017); strucchange, version 1.5-3 (Zeileis et al., 2002); sandwich, version 3.0-2 (Zeileis et al., 2020); zoo, version 1.8-12 (Zeileis & Grothendieck, 2005); mirtjml, version 1.4.0 (Zhang et al., 2020); mirt, version 1.39 (Chalmers, 2012); lattice, version 0.21-8 (Sarkar, 2008); SimDesign, version 2.11 (Chalmers & Adkins, 2020); bigIRT, version 0.1.5 (Driver & Tomasik, 2023a); and data.table, version 1.14.8 (Dowle & Srinivasan, 2023). The R Code for reproducing the results of the simulation studies and the empirical example is available at https://osf.io/mkfg9/.
Results
Tables for every experimental cell are provided in the Supplemental Document, while Figure 1 and Table 1 summarize the key patterns. In the case of DIF based on a categorical covariate (e.g., gender and native language in empirical data), the unordered Lagrange multiplier statistic LMuo was used as test statistic, whereas in continuous cases (e.g., age in empirical data) we used the double max statistic DM, the Cramer-von-Mises statistic CvM and the maxLM statistic. These correspond to the recommendations of Merkle et al. (2014). Because of software limits in strucchange—CvM is capped at 25, and maxLM at 40 predictors, and as such we limit reporting in the manuscript to the double max statistic (DM) for all continuous-covariate scenarios, ensuring comparability across test lengths. We will first focus on results without missingness, then below discuss the minor deviations observed under increasing sparsity.
Figure 1.
Discovery rate of DM score-based tests, with no missingness. Note. Each plot shows mean power or Type I error rate (depending on whether DIF is present or not). Impact and DIF conditions are distinguished in vertical facets, and estimator in horizontal. The suffix “EB” on the estimator denotes empirical Bayes, and “groups” denotes pre-specified group structure
Table 1.
Discovery (Type I Error/Power) rates for 0% Missing, Continuous DIF, DM test
| N Persons | N Items | True | JMAP | JMAP_EB | CJML | MML | MML_groups |
|---|---|---|---|---|---|---|---|
| Impact, DIF | |||||||
| 500 | 10 | 0.50 (0.34) | 0.48 (0.40) | 0.38 (0.23) | 0.47 (0.37) | 0.97 (0.34) | 0.14 (0.96) |
| 500 | 20 | 0.62 (0.41) | 0.20 (0.20) | 0.24 (0.20) | 0.19 (0.20) | 0.72 (0.26) | 0.16 (0.47) |
| 500 | 50 | 0.76 (0.52) | 0.22 (0.24) | 0.24 (0.24) | 0.22 (0.24) | 0.42 (0.28) | 0.21 (0.23) |
| 1000 | 10 | 0.88 (0.70) | 0.85 (0.78) | 0.82 (0.55) | 0.86 (0.72) | 1.00 (0.75) | 0.42 (1.00) |
| 1000 | 20 | 0.96 (0.80) | 0.58 (0.48) | 0.68 (0.50) | 0.57 (0.47) | 1.00 (0.65) | 0.53 (0.95) |
| 1000 | 50 | 1.00 (0.92) | 0.70 (0.60) | 0.74 (0.60) | 0.69 (0.60) | 0.93 (0.66) | 0.65 (0.58) |
| 5000 | 10 | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 0.98 (1.00) |
| 5000 | 20 | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
| 5000 | 50 | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
| Impact, No DIF | |||||||
| 500 | 10 | 0.05 (0.04) | 0.17 (0.15) | 0.10 (0.06) | 0.19 (0.13) | 0.85 (0.08) | 0.01 (0.86) |
| 500 | 20 | 0.05 (0.04) | 0.02 (0.03) | 0.02 (0.03) | 0.02 (0.03) | 0.24 (0.05) | 0.01 (0.22) |
| 500 | 50 | 0.04 (0.04) | 0.01 (0.03) | 0.02 (0.03) | 0.01 (0.03) | 0.04 (0.04) | 0.02 (0.06) |
| 1000 | 10 | 0.04 (0.04) | 0.40 (0.33) | 0.28 (0.11) | 0.42 (0.26) | 1.00 (0.16) | 0.01 (1.00) |
| 1000 | 20 | 0.05 (0.04) | 0.02 (0.04) | 0.04 (0.04) | 0.02 (0.04) | 0.77 (0.07) | 0.01 (0.66) |
| 1000 | 50 | 0.04 (0.05) | 0.02 (0.04) | 0.02 (0.04) | 0.02 (0.04) | 0.08 (0.04) | 0.01 (0.10) |
| 5000 | 10 | 0.05 (0.05) | 0.95 (0.93) | 0.87 (0.60) | 0.92 (0.88) | 1.00 (0.93) | 0.01 (1.00) |
| 5000 | 20 | 0.04 (0.05) | 0.04 (0.05) | 0.09 (0.04) | 0.03 (0.04) | 1.00 (0.43) | 0.01 (1.00) |
| 5000 | 50 | 0.05 (0.05) | 0.05 (0.05) | 0.03 (0.04) | 0.02 (0.04) | 0.96 (0.07) | 0.01 (0.81) |
| No impact, DIF | |||||||
| 500 | 10 | 0.49 (0.47) | 0.31 (0.30) | 0.28 (0.27) | 0.31 (0.29) | 0.38 (0.33) | 0.31 (0.29) |
| 500 | 20 | 0.59 (0.58) | 0.32 (0.33) | 0.33 (0.32) | 0.33 (0.32) | 0.41 (0.38) | 0.37 (0.34) |
| 500 | 50 | 0.75 (0.75) | 0.43 (0.43) | 0.44 (0.43) | 0.43 (0.43) | 0.48 (0.46) | 0.46 (0.44) |
| 1000 | 10 | 0.88 (0.88) | 0.70 (0.69) | 0.69 (0.66) | 0.70 (0.68) | 0.81 (0.75) | 0.73 (0.67) |
| 1000 | 20 | 0.96 (0.96) | 0.77 (0.76) | 0.78 (0.76) | 0.77 (0.76) | 0.87 (0.84) | 0.82 (0.78) |
| 1000 | 50 | 1.00 (1.00) | 0.90 (0.90) | 0.90 (0.90) | 0.90 (0.90) | 0.93 (0.92) | 0.92 (0.90) |
| 5000 | 10 | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
| 5000 | 20 | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
| 5000 | 50 | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) | 1.00 (1.00) |
| No impact, No DIF | |||||||
| 500 | 10 | 0.04 (0.04) | 0.04 (0.04) | 0.03 (0.03) | 0.03 (0.04) | 0.04 (0.03) | 0.03 (0.04) |
| 500 | 20 | 0.04 (0.04) | 0.03 (0.03) | 0.03 (0.03) | 0.03 (0.03) | 0.03 (0.03) | 0.03 (0.03) |
| 500 | 50 | 0.04 (0.04) | 0.03 (0.03) | 0.03 (0.03) | 0.03 (0.03) | 0.03 (0.03) | 0.03 (0.03) |
| 1000 | 10 | 0.05 (0.05) | 0.05 (0.05) | 0.04 (0.03) | 0.05 (0.04) | 0.04 (0.03) | 0.04 (0.04) |
| 1000 | 20 | 0.04 (0.05) | 0.03 (0.04) | 0.03 (0.03) | 0.03 (0.03) | 0.04 (0.03) | 0.04 (0.04) |
| 1000 | 50 | 0.04 (0.05) | 0.04 (0.04) | 0.04 (0.04) | 0.04 (0.04) | 0.03 (0.03) | 0.03 (0.03) |
| 5000 | 10 | 0.05 (0.04) | 0.05 (0.04) | 0.05 (0.05) | 0.04 (0.04) | 0.04 (0.03) | 0.03 (0.05) |
| 5000 | 20 | 0.05 (0.04) | 0.03 (0.03) | 0.03 (0.04) | 0.03 (0.03) | 0.04 (0.03) | 0.03 (0.04) |
| 5000 | 50 | 0.04 (0.04) | 0.04 (0.04) | 0.04 (0.04) | 0.04 (0.04) | 0.04 (0.04) | 0.04 (0.04) |
Note. Residualized covariate approach shown in brackets when applicable.
Considering Figure 1, Type I error rates with no DIF or impact (bottom right) are as expected, but with impact (top right), Type I error rates are generally high when ability estimates are poor due to very few (10) items. The MML estimator shows very high Type I error rates whenever impact is present. Power (left side plots) is low with 500 subjects but adequate with 1000 or more subjects. Both lower item count and the presence of impact modestly reduce power.
Alternative Tests of Continuous DIF
The key difference notable with the alternative continuous score-tests not reported here (maxLM and CvM) is that maxLM gives much better power for the U-turn DIF condition, making it the easiest condition to detect. This makes sense, because (a) maxLM is a global maximum test, and is therefore good at detecting localized, abrupt changes in item parameters like U-turns (Merkle & Zeileis, 2013) and (b) in the simulations U-turn DIF was specified with approximately double the effect size of other conditions to ensure comparability when using the DM p-value). More nuanced comparison of the different tests is possible by examining the Supplemental Materials; however, our simulations were not designed to comprehensively compare the different tests, but simply to reflect a range of possible conditions. The test type (or types) chosen should ideally reflect expectations about the functional form of DIF, whenever there are any (Merkle & Zeileis, 2013).
Sparsity
Figure 1 in the online Supplement illustrates the case where half an individuals responses are missing at random. To maintain a comparable amount of information, in such conditions the number of items and persons was each multiplied by , meaning that with 50% missingness, in the 20 item condition, there were 30 items in total and 15 items answered at random per subject. The MML estimator is still terrible with regards to Type I error, we include it only for a comparative basis. With regards to the estimators of interest (CJML, JMAP, JMAP_EB), when the maximum 5,000 subjects are assessed, higher than nominal Type I error from 7% to 13% is visible when impact is present in the 20 (30 present, 15 answered) item condition. This can be explained by the reduced items answered per person, reducing the accuracy of the individuals ability estimates. More concerning is that Type I error is still high with JMAP in the 50 item (70 present, 35 answered), 5,000 person, impact condition.
Residualized Covariate Approach
Higher than nominal Type I error due to impact can be reduced in general, and eliminated in the case of JMAP in the 50 item (70 items present, 35 answered), 5,000 person, impact condition. This reduction in Type I error is achieved by residualizing the covariate used in the score test on the ability estimates, such that the residual is independent of the impact effect (to the extent that the ability estimates are accurate). However, for very short tests (10 items), where ability estimates are imprecise, the residualization approach fails to reduce the inflated Type I error (e.g., remaining at .93 for JMAP with 5000 persons). Furthermore, this correction comes at the cost of statistical power, which decreases by approximately .20 in some MML conditions. This residual approach also substantially reduces the large Type I error rate shown by MML when impact is present. Somewhat counter-intuitively, Type I error increases for the MML_groups approach with residualization of the covariate, but to avoid confusion this is not plotted because (a) there is no Type I error problem with such an approach and (b) it is not the primary focus here.
Categorical DIF
For categorical DIF the LMuo test is used, with a very similar pattern of results to the continuous DIF conditions shown in Figure 2, which averages over all missingness conditions. Because of the perfect overlap between impact groups and the DIF condition (i.e., the binary covariate used for ordering the score-based tests also perfectly reflects the ability difference structure), residualizing estimated ability out of the covariate was not a viable approach to mitigate higher than nominal Type I error in the impact condition here. With moderate information per individual (15–20 items answered) the empirical Bayesian JMAP approach is problematic, but with more information per individual (30–50 items) the Type I error rate drops below nominal (but still with good power), which is not the case for JMAP without the empirical Bayesian shrinkage, where increases in individual information do not appear to reduce the Type I error.
Figure 2.
Discovery rate of LMuo score-based tests. Note. Each plot shows mean power or Type I error rate (depending on whether DIF is present or not). Impact and DIF conditions are distinguished in vertical facets, and estimator in horizontal. The suffix “EB” on the estimator denotes empirical Bayes, and “groups” denotes pre-specified group structure
Empirical Application
To illustrate a use case for score-based DIF testing with fixed effects person parameters, we examine DIF against gender on the German reading scale from the Mindsteps online learning platform (see https://www.mindsteps.ch/). Mindsteps offers practice and tests with questions drawn from a bank of many thousands across a range of subjects, and is used by teachers and students in a variety of Swiss regions. The system covers topics from the third grade in elementary school until the third grade in secondary school, spanning 7 years of compulsory schooling. Currently, the item bank comprises up to 15,000 items per school subject. In Mindsteps, there are two types of item banks. A practice item bank is available to all students and teachers for training and teaching. Students can use this item bank to create and answer an item set from a topic domain on which to practice. A testing item bank is used to evaluate students’ ability. Teachers can select items according to desired competency domains or topics of the curriculum, and create assessments for students.
For our analyses, we use data from both item banks. Each assessment occasion consisted of at least 10 items, with most students completing multiple assessments. More details on the Mindsteps software, as well as discussion of vertical scaling and further modeling, are found in Tomasik et al. (2018), Berger et al. (2019), and Driver and Tomasik (2023b). In total, 3,403,101 responses from 57,684 students on 1602 German reading questions were available. 50% of repeated assessments were from the same month, 75% from the same year, and some students had assessments over multiple years. To simplify the demonstration here, we aggregate across assessments per student.
To maximize precision of the calibration of the 2PL model, in an initial step all data (after some basic cleaning) was used for parameter estimation, which was performed using joint maximum a posteriori (with weak generic priors) via bigIRT (Driver & Tomasik, 2023a). In a second step, score contributions were obtained only for students with 50 or more responses (to minimize false discovery concerns), leaving 21,028 students. The bigIRT software was designed with this dataset as a motivating example, and allows for fitting in minutes rather than hours as was typical of R packages offering MML approaches.
In contrast to the approach in the simulation, for this example we used univariate tests of DIF for each item, as with so many items global tests are near meaningless (and exceptionally difficult to calculate covariance matrices for). A significance threshold of 0.01 used to reduce false positives. 13% of items tested significant for gender differences overall, split roughly evenly between males and females in terms of direction.
To validate the performance of the score-based test approach in this empirical case—where misspecification is inevitable and the data are quite noisy—we re-fit the model to the data, but this time included gender as a predictor of both students’ ability, and the difficulty of each item. We used a standard normal distribution as the prior for item-wise DIF effects here, as 1602 additional parameters will doubtless lead to overfitting. Correlations between moderated and unmoderated model parameters were approximately 1.00. Figure 3 compares the mean score differences per item (calculated based on the initial, unmoderated model) with the estimated gender effect per item (from the moderated model fit), and items that tested significant for DIF using score-based tests are highlighted in red. The linear correlation between the estimated coefficient and score difference was .84, and the rank-order correlation was .93. These high correlations indicate that score contributions provide an effective way to screen for DIF in such scenarios—here we have estimated the 1602 extra parameters needed, but when testing multiple covariates that number could quickly grow.
Figure 3.
Mean score contributions versus model-based effect estimates. Note. Mean score differences per item from the initial unmoderated model are compared to the estimated gender effect per item from the moderated model fit. Note the substantial correlation, indicating that score contributions and score-based tests are a reliable approach to screen for DIF effects in such scenarios. Items that tested significant for DIF using score-based tests on the first model fit are shown as red triangles
Further investigation to understand the source of these apparent gender differences would be necessary—given that there are many such items, examining item covariates (such as e.g., item type of multiple-choice and short-form) as well as considering whether other forms of misspecification may play a role, would be important. R code for the analyses is available in the OSF repository reported at the end of this manuscript, but the data are unfortunately not able to be made public at present.
Discussion
Score-based tests offer a fast and flexible approach to model specification checking, as there is no need to fit and compare additional comparison models, and the different tests capture non-linear and non-monotonic forms of misspecification. This work describes a theoretical extension of score-based tests in an IRT context to two estimation methods, namely, constrained joint maximum likelihood estimation and joint maximum a posteriori estimation, that both allow the simultaneous estimation of person and item parameters. These results can be considered as an expansion of theoretical results by Debelak et al. (2022), and our results should also apply to comparable IRT estimation approaches that estimate both person and item parameters as fixed. We further evaluated the proposed adaptation of score-based tests with simulation studies and illustrated their application in an empirical data set.
Previous approaches for score-based tests for detecting DIF effects in IRT models, as they were described by Debelak and Strobl (2019) and Schneider et al. (2022), usually require the definition of respondent groups for which differences in the ability parameter distribution are assumed. As Debelak and Strobl (2019) show, a substantial misspecification of these groups can lead to an increased Type I error rate, which is usually problematic for DIF testing. It is a substantial advantage of the new methods that they do not require groups to be specified a priori, but account for ability groups directly by estimating the person parameters of the individual respondents.
Across extensive simulations the new tests achieved appropriate Type I control and competitive power once respondents contributed at least 15–20 informative responses. In real data from the Mindsteps platform (1,602 items, 57,684 students) the method flagged about 13% of reading items for gender-related DIF; re-fitting a model with item-specific gender effects confirmed a strong correspondence between score-based flags and estimated DIF coefficients, underscoring the procedure’s practical utility as a screening tool.
Despite generally positive results, when the person parameters are poorly estimated Type I error is problematic—particularly so when there is lots of information per item, as in our 5,000 person simulation condition. This is a relative weakness of the proposed method compared to the older approach based on MML estimation that relied on pre-specified groups—when the person parameters are poorly estimated, knowing the structure of ability differences (i.e., the impact groups) in advance appears necessary to attain adequate performance for DIF tests. From a practical perspective, this scenario is relevant, for instance, in very short tests, which generally do not allow an accurate estimation of the person parameters; in our simulations this weakness is apparent in the lowest, 10 item condition, and persists somewhat in the high sparsity condition where 15 answers per respondent were available. Since the DIF tests use the item and person parameters estimates as an approximation for the true model parameters, an inaccurate estimate may lead to a misfit between the model underlying the score-based test and the observed data, which is in turn misinterpreted as a model violation or DIF effect. In the simulation studies, this can be observed as an increased Type I error rate in conditions with very short tests. The earlier approach based on MML estimation where the different ability groups are assumed known was robust against this effect in our simulations.
When ability is correlated with the ordering covariate for the score test (i.e., our impact conditions), Type I error can be inflated because mis-estimated abilities contaminate item-parameter gradients. In the high sparsity (50% missingness) condition, the JMAP estimator exhibited such higher than nominal Type I error when many (5,000) subjects were available, even in the highest (30 items answered) item number condition. Partialling estimated ability out of the covariate using spline regression proved effective for controlling Type I error in tests with 20 or more items. However, in short tests (10 items), the inaccuracies in ability estimation render this correction ineffective (e.g., remaining at .93 for JMAP with 5,000 persons). Furthermore, this correction comes at the cost of statistical power, which decreases by approximately .20 in some MML conditions. This residual approach also substantially reduces the large Type I error rate shown by MML when impact is present. This simple preprocessing step is accordingly recommended when the covariate is aligned with proficiency. The present residualization strategy presumes a continuous ordering variable; when the covariate is purely categorical and collinear with ability groups the technique is not helpful. Further work to develop this may be possible, as even when the impact groups are collinear with the covariate, individual differences in ability (controlling for group) should remain available.
This study has some important limitations, which lead to possible topics for future studies. An important limitation is that we only simulated DIF effects in the item intercept parameters. The main motivation was to limit the scope of the simulation study, since previous studies on score-based tests found that these tests also have power against changes in the discrimination parameters (e.g., Debelak et al., 2022; Debelak & Strobl, 2019). It seems plausible that the proposed variations of score-based tests have power against non-uniform DIF as well, but this could be confirmed in future studies.
Perhaps the most important limitation is that we only derived and evaluated the adaptation of score-based tests for a specific type of IRT model, namely the unidimensional 2PL model. While this model is widely used and there is no obvious reason to think the approach would not generalize, it seems natural to investigate similar extensions for multidimensional 2PL models (Reckase, 2009) as well as the three-parametric (Birnbaum, 1968) and four-parametric logistic IRT models (Barton & Lord, 1981) and models for polytomous items. It should be noted in this context that of the R packages used in this study, mirt, bigIRT, and mirtjml, all support the estimation of multidimensional 2PL models, but only mirt and bigIRT support the estimation of more complex IRT models.
An interesting field for future studies would be to compare the estimation methods used in this study with regard to their bias in the estimation of the item and person parameters. As was seen in the empirical example, the relationship between the item and person parameter estimates obtained by different estimation methods is not necessarily monotonic, although all estimation methods considered in this study are consistent and should therefore agree for sufficiently large samples. This research topic would be most relevant for the application of these estimation methods in small, or sparse, samples.
Conclusion
Score-based tests with fixed-effects person parameters provide a flexible and computationally efficient way to probe measurement invariance in contemporary large-scale assessments. The proposed framework eliminates the need for a priori group definitions, and via the joint estimation approaches scales to data volumes that may overwhelm classical marginal maximum likelihood estimation. Researchers gain a fast, model-agnostic diagnostic that detects item-level misfit, even in the presence of complex functional DIF, while maintaining appropriate error control. Residualizing ability out of the covariate/s was shown to offer gains in error control for certain difficult conditions. While this method was found to be helpful especially for conditions with sufficiently many items and test takers, it was still found to be ineffective for conditions with very short tests. This approach will hopefully facilitate improvements in both the speed and detection performance of misspecification checks in large-scale IRT models, and the elimination of a priori group definitions may even offer value to more typical IRT settings.
Supplemental Material
Supplemental Material for Score-Based Tests With Fixed Effects Person Parameters in Item Response Theory: Detecting Model Misspecification Including Differential Item Functioning by Rudolf Debelak and Charles C. Driver in Applied Psychological Measurement
Note
At this point, one could try to simplify the calculations by assuming a common variance for all individual score contributions s(π; u(i)). This simplification assumes that all individual score contributions have the same or a similar variance. However, a small simulation study conducted by the authors showed that such a method would have a Type I error close to 1 under conditions similar to those considered in our simulation study, and we omit details for brevity.
Author Note: Some of the ideas and results in this work were disseminated as part of a conference talk at the 10th European Congress of Methodology.
Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partly supported by the Swiss National Science Foundation (Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung) under Grant No. 188920 awarded to Martin Tomasik.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental Material: Supplemental material for this article is available online.
ORCID iD
Rudolf Debelak https://orcid.org/0000-0001-8900-2106
References
- Andrews D. W. K. (1993). Tests for parameter instability and structural change with unknown change point. Econometrica, 61(4), 821–856. 10.2307/2951764 [DOI] [Google Scholar]
- Arnold M., Oberski D. L., Brandmaier A. M., Voelkle M. C. (2020). Identifying heterogeneity in dynamic panel models with individual parameter contribution regression. Structural Equation Modeling: A Multidisciplinary Journal, 27(4), 613–628. 10.1080/10705511.2019.1667240 [DOI] [Google Scholar]
- Baker F. B., Kim S.-H. (2004). Item response theory: Parameter estimation techniques (2nd ed.). CRC Press. [Google Scholar]
- Barton M. A., Lord F. M. (1981). An upper asymptote for the three-parameter logistic item-response model. ETS Research Report Series, 1981(1), i–8. 10.1002/j.2333-8504.1981.tb01255.x [DOI] [Google Scholar]
- Bauer D. J. (2017). A more general model for testing measurement invariance and differential item functioning. Psychological Methods, 22(3), 507–526. 10.1037/met0000077 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berger S., Verschoor A. J., Eggen T. J. H. M., Moser U. (2019). Development and validation of a vertical scale for formative assessment in mathematics. Frontiers in Education, 4, 103. 10.3389/feduc.2019.00103 [DOI] [Google Scholar]
- Bhattacharya R., Waymire E. (2022). Martingale central limit theorem. In Stationary processes and discrete parameter Markov processes (pp. 201–214). Springer International Publishing. 10.1007/978-3-031-00943-3\_15 [DOI] [Google Scholar]
- Billingsley P. (1995). Probability and measure. John Wiley & Sons. [Google Scholar]
- Birnbaum A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord F. M., Novick M. R. (Eds.), Statistical theories of mental test scores. Addison-Wesley. [Google Scholar]
- Carlin B. P., Louis T. A. (2008, June 30). Bayesian methods for data analysis. CRC Press. 10.1201/b14884 [DOI] [Google Scholar]
- Carpenter B., Gelman A., Hoffman M. D., Lee D., Goodrich B., Betancourt M., Brubaker M., Guo J., Li P., Riddell A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 1–32. 10.18637/jss.v076.i01 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chalmers R. P. (2012). Mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
- Chalmers R. P. (2023). A unified comparison of irt-based effect sizes for dif investigations. Journal of Educational Measurement, 60(2), 318–350. 10.1111/jedm.12347 [DOI] [Google Scholar]
- Chalmers R. P., Adkins M. C. (2020). Writing effective and reliable Monte Carlo simulations with the SimDesign package. The Quantitative Methods for Psychology, 16(4), 248–280. 10.20982/tqmp.16.4.p248 [DOI] [Google Scholar]
- Chen Y., Li X., Zhang S. (2019). Joint maximum likelihood estimation for high-dimensional exploratory item factor analysis. Psychometrika, 84(1), 124–146. 10.1007/s11336-018-9646-5 [DOI] [PubMed] [Google Scholar]
- Chen Y., Li X., Zhang S. (2020). Structured latent factor analysis for large-scale data: Identifiability, estimability, and their implications. Journal of the American Statistical Association, 115(532), 1756–1770. 10.1080/01621459.2019.1635485 [DOI] [Google Scholar]
- Debelak R., Pawel S., Strobl C., Merkle E. C. (2022). Score-based measurement invariance checks for Bayesian maximum-a-posteriori estimates in item response theory. British Journal of Mathematical and Statistical Psychology, 75(3), 728–752. 10.1111/bmsp.12275 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Debelak R., Strobl C. (2019). Investigating measurement invariance by means of parameter instability tests for 2PL and 3PL models. Educational and Psychological Measurement, 79(2), 385–398. 10.1177/0013164418777784 [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Boeck P. (2008). Random item IRT models. Psychometrika, 73(4), 533–559. 10.1007/s11336-008-9092-x [DOI] [Google Scholar]
- Dowle M., Srinivasan A. (2023). Data.table: Extension of ‘data.frame’. [R package version 1.14.8]. https://CRAN.R-project.org/package=data.table [Google Scholar]
- Driver C. C., Tomasik M. J. (2023. a). bigIRT: R software for item response theory models with large and sparse data. 10.31234/osf.io/594uw [DOI]
- Driver C. C., Tomasik M. J. (2023. b). Formalizing developmental phenomena as continuous-time systems: Relations between mathematics and language development. Child Development, 94(6), 1454–1471. 10.1111/cdev.13990 [DOI] [PubMed] [Google Scholar]
- Finch W. H., French B. F. (2023). Effect sizes for estimating differential item functioning influence at the test level. Psych, 5(1), 133–147. 10.3390/psych5010013 [DOI] [Google Scholar]
- Glas C. A. W. (1998). Detection of differential item functioning using Lagrange multiplier tests. Statistica Sinica, 8(3), 647–667. [Google Scholar]
- Glas C. A. W. (1999). Modification indices for the 2-PL and the nominal response model. Psychometrika, 64(3), 273–294. 10.1007/BF02294296 [DOI] [Google Scholar]
- Glas C. A. W., Suárez Falcón J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. 10.1177/0146621602250530 [DOI] [Google Scholar]
- Holland P. W., Wainer H. (1993). Differential item functioning. Taylor & Francis. [Google Scholar]
- Komboz B., Strobl C., Zeileis A. (2018). Tree-based global model tests for polytomous Rasch models. Educational and Psychological Measurement, 78(1), 128–166. 10.1177/0013164416664394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Magis D., Béland S., Tuerlinckx F., De Boeck P. (2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. 10.3758/BRM.42.3.847 [DOI] [PubMed] [Google Scholar]
- Merkle E. C., Fan J., Zeileis A. (2014). Testing for measurement invariance with respect to an ordinal variable. Psychometrika, 79(4), 569–584. 10.1007/s11336-013-9376-7 [DOI] [PubMed] [Google Scholar]
- Merkle E. C., Zeileis A. (2013). Tests of measurement invariance without subgroups: A generalization of classical methods. Psychometrika, 78(1), 59–82. 10.1007/s11336-012-9302-4 [DOI] [PubMed] [Google Scholar]
- Millsap R. E. (2011). Statistical approaches to measurement invariance. Routledge. [Google Scholar]
- Muthén B. O. (1989). Latent variable modeling in heterogeneous populations. Psychometrika, 54(4), 557–585. 10.1007/BF02296397 [DOI] [Google Scholar]
- R Core Team . (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
- Reckase M. (2009). Multidimensional item response theory. Springer. [Google Scholar]
- Sarkar D. (2008). Lattice: Multivariate data visualization with R. Springer. [Google Scholar]
- Schäfer J., Strimmer K. (2005). A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4(1), Article 32. 10.2202/1544-6115.1175 [DOI] [PubMed] [Google Scholar]
- Schneider L., Strobl C., Zeileis A., Debelak R. (2022). An R toolbox for score-based measurement invariance tests in IRT models. Behavior Research Methods, 54(5), 2101–2113. 10.3758/s13428-021-01689-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Steinberg L., Thissen D. (2006). Using effect sizes for research reporting: Examples using item response theory to analyze differential item functioning. Psychological Methods, 11(4), 402–415. 10.1037/1082-989X.11.4.402 [DOI] [PubMed] [Google Scholar]
- Sterner P., Goretzko D. (2023). Exploratory factor analysis trees: Evaluating measurement invariance between multiple covariates. Structural Equation Modeling: A Multidisciplinary Journal, 30(6), 871–886. 10.1080/10705511.2023.2188573 [DOI] [Google Scholar]
- Strobl C., Kopf J., Zeileis A. (2015). Rasch trees: A new method for detecting differential item functioning in the Rasch model. Psychometrika, 80(2), 289–316. 10.1007/s11336-013-9388-3 [DOI] [PubMed] [Google Scholar]
- Strobl C., Wickelmaier F., Zeileis A. (2011). Accounting for individual differences in Bradley-Terry models by means of recursive partitioning. Journal of Educational and Behavioral Statistics, 36(2), 135–153. 10.3102/1076998609359791 [DOI] [Google Scholar]
- Tomasik M. J., Berger S., Moser U. (2018). On the development of a computer-based tool for formative student assessment: Epistemological, methodological, and practical issues. Frontiers in Psychology, 9, 2245. 10.3389/fpsyg.2018.02245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T., Merkle E. C., Zeileis A. (2014). Score-based tests of measurement invariance: Use in practice. Frontiers in Psychology, 5, 438. 10.3389/fpsyg.2014.00438 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang T., Strobl C., Zeileis A., Merkle E. C. (2018). Score-based tests of differential item functioning via pairwise maximum likelihood estimation. Psychometrika, 83(1), 132–155. 10.1007/s11336-017-9591-8 [DOI] [PubMed] [Google Scholar]
- Whitt W. (2007). Proofs of the martingale FCLT. Probability Surveys, 4, 268–302. 10.1214/07-PS122 [DOI] [Google Scholar]
- Wood S. N. (2017). Generalized Additive Models: An Introduction with R (2nd ed.). Chapman and Hall/CRC. 10.1201/9781315370279 [DOI] [Google Scholar]
- Zeileis A., Grothendieck G. (2005). zoo: S3 infrastructure for regular and irregular time series. Journal of Statistical Software, 14(6), 1–27. 10.18637/jss.v014.i06 [DOI] [Google Scholar]
- Zeileis A., Hornik K. (2007). Generalized M-fluctuation tests for parameter instability. Statistica Neerlandica, 61(4), 488–508. 10.1111/j.1467-9574.2007.00371.x [DOI] [Google Scholar]
- Zeileis A., Köll S., Graham N. (2020). Various versatile variances: An object-oriented implementation of clustered covariances in R. Journal of Statistical Software, 95(1), 1–36. 10.18637/jss.v095.i01 [DOI] [Google Scholar]
- Zeileis A., Leisch F., Hornik K., Kleiber C. (2002). Strucchange: An R package for testing for structural change in linear regression models. Journal of Statistical Software, 7(2), 1–38. 10.18637/jss.v007.i02 [DOI] [Google Scholar]
- Zhang S., Chen Y., Li X. (2020). mirtjml: Joint maximum likelihood estimation for high-dimensional item factor analysis. [R package version 1.4.0]. https://CRAN.R-project.org/package=mirtjml [DOI] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental Material for Score-Based Tests With Fixed Effects Person Parameters in Item Response Theory: Detecting Model Misspecification Including Differential Item Functioning by Rudolf Debelak and Charles C. Driver in Applied Psychological Measurement



