Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2022 Jun 20;46(6):462–478. doi: 10.1177/01466216221108061

Item-Fit Statistic Based on Posterior Probabilities of Membership in Ability Groups

Bartosz Kondratek 1,
PMCID: PMC9382089  PMID: 35991828

Abstract

A novel approach to item-fit analysis based on an asymptotic test is proposed. The new test statistic, χw2 , compares pseudo-observed and expected item mean scores over a set of ability bins. The item mean scores are computed as weighted means with weights based on test-takers’ a posteriori density of ability within the bin. This article explores the properties of χw2 in case of dichotomously scored items for unidimensional IRT models. Monte Carlo experiments were conducted to analyze the performance of χw2 . Type I error of χw2  was acceptably close to the nominal level and it had greater power than Orlando and Thissen’s S-x2 . Under some conditions, power of χw2 also exceeded the one reported for the computationally more demanding Stone’s χ2 .

Keywords: item response theory, item response theory model fit, item-fit, asymptotic test

Introduction

Item response theory (IRT) models are a potent tool for explaining test behavior. However, the validity of analyses that involve IRT is critically related to the extent to which an IRT model fits the data. Several authors pointed to consequences of lack of fit of the IRT model for subsequent analyses (e.g., Wainer & Thissen, 1987; Woods, 2008; Bolt et al., 2014). The Standards for Educational and Psychological Testing (AERA, APA, and NCME, 2014) recommend that provision of evidence of model fit should be a prerequisite for making any inferences based on IRT.

Analysis of fit at a level of single item plays an especially important role in the assessment of IRT model validity, since IRT models are designed for the very purpose of explaining observable data by separating item properties from the properties of the test-takers. In unidimensional IRT models for dichotomous items, the probability of the response pattern y=(y1,, yn ) conditional on ability of the test-taker, θ, is assumed to follow

p(y|θ)=j=1n(fj(θ))yj(1fj(θ))1yj, (1)

where fj is a monotonically increasing function that describes the conditional probability of a correct response to item j{1,,n} . The marginal likelihood of response vector y is given by

p(y)=p(y|θ)g(θ)dθ, (2)

where g is the a priori ability distribution. Finally, a posteriori density of θ given response vector y is

g(θ|y)=p(y|θ)g(θ)p(y). (3)

These equations illustrate how item response functions fj mirror the structure of observable data. They serve as building blocks of the whole IRT model, and their form impacts any inferences regarding test-taker position on the θ continuum. Therefore, item-fit analysis is crucial in IRT. The item-level misfit information allows one to improve the overall model fit by discarding the misfitting items from analyses or by replacing the IRT model with one defined over a richer parameter space.

Many different approaches to item-fit testing have been developed. Despite ample research on the topic, the available solutions either are restricted to special cases of models and testing designs or require resampling. This article describes a universal and computationally feasible method for testing item-fit that aims at filling the gap in what is currently proposed.

Existing Item-Fit Statistics

Item-fit statistics are measures of discrepancy between the expected item performance, based on fj , and the observed item performance. Usually, the difference between the observed and expected item-score is calculated over groups of test-takers with similar ability and aggregated into a single number. The pursuit of an item-fit measure that would allow for statistical testing started with the first applications of IRT. A selective account of previous research is presented to provide the context for the development of the approach proposed in the article. An in-depth review of research on item-fit is available, for example, in Swaminathan et al. (2007).

Grouping on Point Estimates of θ

First advancements in this field of research were inspired by the solutions available for models that did not deal with latent variables. These early approaches grouped test-takers on their point estimates of θ and computed Pearson’s X2 (Bock, 1972; Yen, 1981) or likelihood-ratio test statistic G2 (McKinley & Mills, 1985). Uncertainty in measurement of θ is not accounted for under such grouping and the observed counts within groups are treated as independent from each other. Consequently, these fit statistics produce inflated Type I error rates, especially when tests are short (Orlando & Thissen, 2000; Stone & Hansen, 2000). Only in the case of Rasch-family of models, where the number-correct score is a sufficient statistic for θ^ , such approaches yield item-fit statistics that are in accordance with the postulated asymptotic distribution (Andersen, 1973; Glas & Verhelst, 1989).

Grouping on Observed Sum-Scores

Orlando and Thissen (2000) approached the problem from a different angle. Instead of relying on the partition of latent trait, they grouped test-takers on their number-correct scores. This allowed to compute the observed frequency of correct responses directly from observable data. In order to compute the expected frequency of correct responses at a given score group, Orlando and Thissen ingeniously employed the algorithm of Lord and Wingersky (1984). Their version of Pearson’s statistic, SX2 , has become a standard point of reference in studies on item-fit because SX2 is fast to compute and has Type I error rates very close to the nominal level.

A likelihood-based approach to item-fit with aggregation over sum-scores was developed by Glas (1999) and further expanded by Glas and Suarez-Falcon (2003). This approach stands out from other item-fit measures, not only by accounting for the stochastic nature of item parameters, but also because it does not directly bank on the observed versus expected difference. Item misfit is modeled by additional group-specific parameters, modification indices, introduced in order to capture systematic deviance of data from the item response function. To test for significance, the modification indices are compared to zero and the Lagrange multiplier test is employed. However, as pointed by Sinharay (2006), results of simulations run by Glas and Suarez-Falcon (2003) show that Type I error of their statistic can be elevated under some score groupings.

Resampling Methods

Partitioning of ability on observed scores, rather than latent θ , limits practical applications of statistics such as SX2 or LM proposed by Glas. When test-takers respond to different sets of items their raw sum-scores become incomparable. However, a lot of appeal of IRT arises exactly from the fact that it can be applied in analysis of incomplete testing designs. So, the pursuit for a solution that relies on residuals computed over θ scale has never stopped.

Stone (2000) developed a simulation-based approach. He proposed a χ2 statistic calculated over quadrature points of θ , with a resampling algorithm for determining the distribution of χ2 under the null hypothesis. Stone’s χ2 repeatedly proved to provide acceptable Type I error rate and it exceeded SX2 in power (Stone & Zhang, 2003; Chon et al., 2010; Chalmers & Ng, 2017). However, this came at a significant computational cost.

Other computationally intensive approaches were also developed. Sinharay (2006) and Toribio and Albert (2011) applied the posterior predictive model checking (PPMC) method, that is, available within Bayesian framework (Rubin, 1984). Theoretical advantage of PPMC method over Stone’s χ2 or Orlando and Thissen’s SX2 lies in the fact that the uncertainty of item parameter estimation is taken into account in PPMC. However, simulational studies performed by the authors showed that PPMC tests were too conservative in terms of Type I error, albeit still being practically useful in terms of statistical power. Chalmers & Ng (2017) proposed a fit statistic averaged over a set of plausible values (draws from (3)) that required additional resampling to obtain the p -value. Their statistic had deflated Type I error rates, similarly to PPMC.

Problem With Existing Item-Fit Statistics

From a practical stance, the existing research on item-fit is disappointing. The researcher assessing item-fit faces a choice either to use Orlando and Thissen’s SX2 , which has low power and is not always applicable, or must refer to methods that require considerable CPU time. In consequence, they may decide not to give any consideration to statistical significance and assess item-fit merely on the value of some discrepancy measure. An example of such an approach is found in PISA 2015 technical report (OECD, 2017, p. 143), where mean deviation (MD) and root mean square deviation (RMSD) are used with disregard for their sampling properties.

The statistic proposed in this article aims at filling the gap by providing a method of testing item fit that is suitable in contexts when raw-score grouping is not applicable and is computationally feasible for practical use.

The Proposed Item-Fit Test

Let Δ1,, Δr be non-intersecting grouping intervals of ability θ , such that

Δ1 Δr=,  Δk  Δh=ø forkh. (4)

The proposed approach to item-fit analysis compares two types of estimates of the expected item score over intervals Δk : O^j and E^j . Row vector O^j is computed from the observed item responses, yj , with a covariance estimate V^j . Row vector E^j consists of model-based expectations that are obtained from f^j .

To test for model fit, the following Wald-type statistic is employed

χw j2=(O^jE^j) V^j1(O^jE^j)T, (5)

which is assumed to be asymptotically chi square distributed with rq degrees of freedom, where q is the number of estimated model parameters used in computation of E^j .

The following sections will define quantities used in equation (5) and lay out the rationale behind the asymptotic claim about χw2 . The presentation will be restricted to unidimensional IRT models for dichotomous items because most of the research in the field was done under these restrictions and it also allows to keep things simple. Therefore, Oj and Ej will be henceforth referred to as vectors of the pseudo-observed and expected proportions of correct responses. It should be kept in mind, however, that equation (5) is defined in terms that apply to polytomous items and equation (4) could also be defined over multidimensional ability space.

A possibility of developing a Wald-type item-fit statistic such as equation (5) was mentioned by Stone (2000), at the very end of his paper. Stone discussed O^j and E^j computed at quadrature points, rather than over ability intervals, and did not indicate any way of obtaining V^j . To give due credit for the general idea to Stone, the symbol χw2 is adopted for equation (5), as in the original article.

Case When Item Parameters are Known

Assume that the IRT model holds, and the parameters of fj are known. A posterior probability that ability of test-taker i with a response vector yi falls into interval Δk is a definite integral of (3) over Δk

τki=g(θ|yi, Δk)=Δkg(θ|yi)dθ=Δkp(yi|θ)g(θ)dθp(yi). (6)

After observing m response vectors, an estimate of Ojk , that is, of the pseudo-observed proportion of correct responses to item j in interval Δk , is given by

O¯jk=i=1myijτkii=1mτki. (7)

O¯jk is a weighted mean of item responses with weights being the posterior probabilities of test-taker membership in grouping interval Δk . This estimate closely resembles the ML estimate of component mean in Bernoulli mixture model (McLachlan & Peel, 2000). A mixture model analogy can be further seen by noting that for each response vector yi : τki0 , k=1rτki=1 and p(yj|yi)=k=1rτkip(yj| yi,Δk) . The difference with mixture model is that a posteriori group membership, τki , used in (7) is obtained “externally” from the IRT model likelihood and not estimated via the likelihood of a mixture model.

The proposed item-fit test statistic assumes that the vector of estimates of pseudo-observed proportions (7) over all ability intervals (4), O¯j , is asymptotically multivariate normal with mean Oj and covariance matrix Vj . That is, as m

(O¯jOj) Vj12   d Nr(0,Ir). (8)

A test regarding Oj can be derived from (8). To verify H0: Oj=O0j versus H1: OjO0j the following quadratic form with an asymptotic χr2 distribution is employed

(O¯jO0j) Vj1(O¯jO0j)T   d χr2. (9)

The covariance matrix Vj in (9) is replaced by an estimator V¯j=[v¯jkh] r×r , where the (k,h) th element, a covariance between O¯jk and O¯jh , is given by

v¯jkh=i=1mτkiτhi(yijO¯jk)(yijO¯jh)(i=1mτki)(i=1mτhi). (10)

As pointed out in Shao (1999, p. 404), (9) is also true if Vj is replaced by a consistent estimator.

Let yi\j  denote a response vector of test-taker i to all items but the item j . Model-based probability of a correct response to item j in interval Δk upon observing yi\j is given by

ejki=p(yj=1|yi\j,Δk)=p(yj=1,yi\j|Δk)p(yi\j|Δk)=Δkfj(θ)p(yi\j|θ)g(θ)dθΔkp(yi\j|θ)g(θ)dθ. (11)

After observing m response vectors, a model-based expected proportion of correct responses to item j in interval  Δk to par with (7) can be computed as

E¯jk=i=1mejkiτkii=1mτki. (12)

Finally, the item-fit is tested by stating H0: Oj=E¯j against H1: OjE¯j with a test statistic

χw j2=( O¯jE¯j) V¯j1(O¯jE¯j)T, (13)

that is asymptotically chi squared with r degrees of freedom.

Case When Item Parameters are Estimated

When IRT model parameters are estimated from data, item response functions fj are replaced with f^j ; a posteriori group membership τki (6) is replaced with an estimate τ^ki ; and pseudo-observed proportion (7), model-expected proportion (12), and covariance element (10) are replaced, respectively, by estimates

O^jk=i=1myijτ^kii=1mτ^ki, (14)
E^jk=i=1me^jkiτ^kii=1mτ^ki, (15)
v^jkh=i=1mτ^kiτ^hi(yijO^jk)(yijO^jh)(i=1mτ^ki)(i=1mτ^hi). (16)

The item-fit statistic (13) for an IRT model with parameters estimated from data becomes χw j2=(O^jE^j) V^j1(O^jE^j)T as previously stated in (5). Number of degrees of freedom of χw j2 needs to be adjusted to account for the number of estimated model parameters used in computation of E^j .

Monte Carlo Experiments

This section describes results of three simulation studies conducted to examine properties of χw2 . First simulations dealt with implementation issues and their main purpose was to verify how well the asymptotic claims about χw2 hold upon varying approaches to construction of grouping intervals. Second simulations replicated a Monte Carlo experiment designed by Stone and Zhang (2003) that was augmented to include additional condition of incomplete response vectors. This experiment allowed to analyze Type I error rates and power of χw2 against a benchmark of Orlando and Thissen’s SX2 and Stone’s χ2 , and to verify performance of χw2 in an incomplete-data design setting. Final study was based on the “bad items” design by Orlando and Thissen (2003) and aimed at providing further information about power of χw2 .

Ability parameters in all three simulation studies were sampled from normal distribution g(θ)=N(0,1) . Item response functions belonged to the logistic family of IRT models: the three-parameter logistic model (3PLM)

fj(θ)=P(yj|θ)=cj+1cj1+eaj(θbj) , (17)

the two-parameter logistic model (2PLM, (17) with cj=0 ) and the one-parameter logistic model (1PLM, (17) with cj=0 and aj=a ).

All analyses were performed in Stata. Item responses under (17) were generated using uirt_sim (Kondratek, 2020). Parameters of IRT models were estimated by uirt (version 2.1; Kondratek, 2016) with its default settings—EM algorithm, Gauss–Hermite quadrature with 51 integration points, and 0.0001 stopping rule for maximum absolute change in parameter values between EM iterations. The uirt software was also used to compute SX2 and χw2 . Indefinite integrals over g(θ) , needed for expected proportions used in SX2 and to obtain p(yi) seen in denominator of (6), were computed by Gauss–Hermite quadrature with 151 integration points. Definite integrals over g(θ) , seen in the numerator of (6), employed Gauss–Legendre quadrature with 30 integration points at each bin of ability, Δk .

Each of the Monte Carlo experiments involved 10,000 replications of the simulated conditions. Type I error and power were computed as percentage of rejected H0 at significance level α=0.05 , averaged over replications.

Simulation Study 1 – Number and Range of Grouping Intervals

Implementation of χw2 has required decisions on the number of grouping intervals (4) and their range. Postulated distribution of χw2 is derived from asymptotic normality of the vector of the pseudo-observed proportions (8); therefore, the conventional rule for appropriateness of normal approximation to sample proportion was adopted to govern range of ability intervals. This resulted in item-specific intervals Δk that were constructed so that

mjkπ^jk(1π^jk)const, (18)

where π^jk is a simple model-based estimate of proportion of correct responses in interval Δjk

π^jk=p^(yj=1|Δjk)=Δjkf^j(θ)g(θ)dθΔjkg(θ)dθ. (19)

and mjk is the expected number of observations in interval Δjk , mjk=mΔjkg(θ)dθ .

Ranges of Δjk that would meet the condition (18) were determined by first splitting the ability distribution into smaller Δjv intervals that were equiprobable with respect to g(θ) , mjv=0.001m , and v{1,,1000} . The finer intervals were then aggregated into Δjk = Uv=av=bΔjv so that v=av=bπ^jv(1π^jv) 1rv=1v=1000π^jv(1π^jv), where r was the desired number of Δjk (4). Computation of π^jv in this step was performed with Gauss–Legendre quadrature with 11 integration points.

Behavior of χw2 upon adopting criterion (18) with three ability bins when testing fit of an easy 2PLM item ( aj=1.7 and bj=1.84 ) under true H0 is illustrated in Figure 1 (upper panel) and compared to an alternative equiprobable division of ability (lower panel). Graphs presented in Figure 1 were obtained in a simple Monte Carlo experiment in which the item tested for fit was embedded in a 30-item 2PLM test. The remaining 29-item parameters were sampled from lnavj N(ln1.7, 0.4) and bvjN(0,1) , and the sample size was m=1000 . In each replication, χw j2 was computed and the pseudo-observed proportions of correct responses O^jk ( 14 ) were stored. Upon completing 10,000 replications, the resulting empirical distribution of χw j2 was compared against theoretical χ2(1) in a Q–Q plot, and the pseudo-observed proportions were transformed according to (8) so that standardized variables were obtained and compared against theoretical N(0,1) on histograms.

Figure 1.

Figure 1.

Distribution of χw2 under different choices of interval range.

The equiprobable division would be a tempting alternative to (18) as it results in equal expected number of observations in each interval, mjkconst . By being independent from item parameters, it would decrease the computational cost of χw2 because the group membership probabilities, τ^ki , would need to be obtained only once. However, simulation results presented in Figure 1 indicate that χw j2 grossly deviated from theoretical χ2(1) upon such division. The expected number of observations in each bin exceeds 333, but because of the extreme easiness of the item, the rightmost bin is associated with a very small value on the mjkπ^jk(1π^jk) criterion. The transformed proportion of correct responses in this bin experiences a visible ceiling effect, and thus the χ2(1) assumption does not hold. Yet, when χw j2 was computed with intervals constructed using criterion (18), it resulted in a good approximation to χ2(1) , even at these rather difficult conditions in terms of sample size and item difficulty.

A second problem of implementation of χw2 required deciding on the number of ability bins, r . It was expected that increasing r would be detrimental to the normal approximation of pseudo-observed proportions. So, from the standpoint of Type I error, the safest approach would be to use the smallest possible number of intervals, r=q+1 , leading to χw2 with a single degree of freedom. However, the relation between r and power of χw2 was not obvious. On the one hand, increasing r would allow to detect a locally finer grade of deviances between (14) and (15). On the other hand, it would increase the entries of the covariance matrix (16) because of smaller effective sample size per interval.

Number of Ability Bins – Simulation Design

To investigate properties of χw2 under varying number of bins, a Monte Carlo experiment was conducted under similar scheme that was used to obtain results reported in Figure 1. The conditions of experiment were extended to cover different IRT models (1PLM, 2PLM, and 3PLM), items of varying marginal difficulty, πj=fj(θ)g(θ)dθ ( πj=0.5 and πj=0.9 for all models and additionally πj=0.3 for 3PLM), two sample sizes ( m{400, 4000} ), and two test lengths ( n{10,  40} ). Under each of these conditions Type I error was obtained in both fixed and estimated parameters case, and Q–Q plots were plotted for certain number of ability bins for closer assessment of the distribution of χw2 . Additionally, power to detect misfit was analyzed with varying number of ability bins by fitting a 1PLM or 2PLM model to an item estimated under 3PLM. This article presents only main conclusions from the experiment; detailed results under all tested conditions are provided in the online supplement.

Number of Ability Bins – Simulation Results

Figure 2 depicts relationship between the number of ability intervals and resulting detection rates for an item of medium difficulty ( πj=0.5 ) that was simulated as 3PLM and then estimated as 3PLM (Type I error) or as either 2PLM or 1PLM (statistical power). It illustrates that increasing the number of ability bins results in decrease of statistical power of χw2 . Additionally, increase of r eventually leads to elevated Type I error rates. These patterns were also seen in other conditions considered in the experiment, with the detrimental effect of increased r on the Type I error being especially prominent for difficult items in small samples (online supplement).

Figure 2.

Figure 2.

Type I error and power of χw2 in relation to the number of ability bins.

Based on these results, it was decided to implement χw2 with r=q+1 intervals for 3PLM and 2PLM. For 1PLM: either r=3 if criterion (18) exceeds 20, or r=2 otherwise. Number 20 precautiously doubles the conventional rule for when a normal approximation to sample proportion is appropriate. These settings were used in all the simulation studies covered in the rest of the article.

Q–Q plots were obtained according to the adopted rule for the number of intervals for a more detailed verification of the postulated asymptotic distribution of χw2 (online supplement). For m=4000 , the empirical distribution of χw2 was well aligned with theoretical χ2 under all tested conditions, both in the known and the estimated parameters case. Approximation was also well-behaved for m=400 and moderate item difficulty. However, combined conditions of small sample size and extreme item difficulties resulted in deviation of χw2 from its theoretical asymptotic distribution. We should notice that under such conditions, the criterion for appropriateness of normal approximation of sample proportion (18) is small in value, even when the lowest possible number of ability intervals is used. This alerts us that χw2 should be used with caution whenever fit of extremely easy or difficult item is to be assessed in small samples. A condition mjkπ^jk(1π^jk)>20 seems to be a good guideline on deciding if results of χw2 are trustworthy (see Figure 1).

Simulation Study 2 – Type I Error and Power

Simulation Design

This study replicated Monte Carlo experiment designed by Stone and Zhang (2003). Three test lengths, n{10, 20, 40} , were crossed with three sample sizes, m{500, 1000, 2000} , and data was generated under two IRT models: 2PLM and 3PLM. Under the 2PLM generating scenario, a set of 10 pairs of item parameters was constructed by crossing two values of item discrimination parameter, aj{1.2, 2.2} , with five values of item difficulty parameter, bj{2, 1, 0, 1, 2} . 20-item and 40-item tests were built by adding another 10 or 3x10 items defined by repetition of the same parameter set. The 3PLM scenario used the same discriminations and difficulties as the 2PLM. All items, except the easiest one with bj=2 , were added a pseudo-guessing parameter cj=0.25 in the 3PLM scenario.

This design was extended to create additional incomplete-data conditions. It was accomplished by taking the complete data generated under the original design for n{20, 40} and m{1000, 2000} , and treating random 50% of responses for each item as missing. In result, additional four generating conditions were introduced in which number of observations per item and expected number of items per observation halved the size of the original complete data.

In each replication, all item parameters were estimated from the simulated data under two IRT models: 1PLM and 2PLM. In complete data conditions, Orlando and Thissen’s SX2 and the χw2 statistics were computed from the estimates of the IRT model. In the missing responses scenario, only χw2 was obtained because SX2 is not applicable to data with incomparable sum-scores. The case when generating model was the same as estimating model (2PLM) served to analyze Type I error. Other three combinations, when generating model had more parameters than estimating model, were used to assess statistical power of SX2 and χw2 .

Simulation Results

Table 1 summarizes performance of SX2 and χw2 when both the generating and the estimating model were 2PLM. Entries in the table are percentages of rejected H0 at significance level α=0.05 averaged over all items and all replications. Results for SX2 and χ2 reported in Stone & Zhang (2003) are also included for reference. It should be kept in mind that Stone and Zhang results were obtained with two orders of magnitude fewer replications. Figure 3 expands the analysis of Type I errors of SX2 and χw2 , by presenting rejection rates of true H0 at an item level. Results for the first 10 items are plotted against a 95% confidence bound around the nominal significance level α=0.05 , assuming standard error of α(1α)/104 .

Table 1.

Type I error rates for different item-fit statistics (%).

Test length Sample size Results of current study Results of Stone and Zhang
SX2 χw2 χw2 SX2 χ2
Complete data Complete data 50% missing
n=10 m=500 4.8 4.2 4 5
m=1000 4.7 3.8 4 4
m=2000 4.9 3.7 5 3
n=20 m=500 4.9 4.7 5 5
m=1000 4.9 4.3 4.3 5 3
m=2000 5.0 4.2 4.1 5 3
n=40 m=500 4.7 5.1 6 6
m=1000 4.8 4.8 4.8 4 4
m=2000 5.0 4.7 4.5 3 4
Figure 3.

Figure 3.

Type I error rates for SX2 (top) and χw2 (bottom) conditional on item parameters.

Type I error of SX2 , as seen in Table 1 and Figure 3, was almost flawlessly nominal for all items, test-lengths, and sample sizes that were considered in the study. This result confirms what was previously observed by Orlando and Thissen (2000) or Stone and Zhang (2003).

The averaged Type I error of χw2 was in 0.037–0.045 range (Table 1) which is acceptable. These values do not exceed ranges reported for both SX2 and χ2 by Stone and Zhang (2003) under the same experimental conditions. However, the item-level information in Figure 3 reveals that for higher discriminating items ( aj=2.2 ) with moderate difficulty ( bj{1, 0, 1} ) Type I error of χw2 is deflated. This effect diminishes with increase in test length. From practical standpoint deflated false rejection rates would not be a problem as long as χw2 has sufficient power to detect misfit.

Power of SX2 and χw2 was examined by averaging rejection rates in three scenarios when the model used for simulating responses had more item parameters than the model used in estimation: 2PLM-1PLM, 3PLM-1PLM, and 3PLM-2PLM. The results are presented in Table 2, together with power of SX2 and χ2 from simulation by Stone and Zhang (2003). The χw2 statistic outperformed SX2 with regard to power under all experimental conditions. When compared to results for χ2 reported in Stone and Zhang (2003), χw2 was more sensitive in detecting misfit under the 3PLM-2PLM. In other misfit scenarios χ2 and χw2 achieved similar power under long tests ( n=40 ) and the reported power of χ2 exceeded that of χw2 for shorter tests. Power of Stone’s χ2 seems to be unaffected by the test length. This puts Stone’s χ2 in a position of an especially useful item-fit measure for short tests.

Table 2.

Power rates for different item-fit statistics (%).

Test length Sample size Results of current study Results of Stone and Zhang
SX2 χw2 χw2 SX2 χ2
Complete data Complete data 50% missing
Simulated 2PLM – Estimated 1PLM
n=10 m=500 23.8 35.1 26 51
m=1000 49.5 59.9 53 75
m=2000 80.7 86.5 81 94
n=20 m=500 23.2 47.5 23 56
m=1000 46.2 73.1 36.4 45 78
m=2000 78.7 93.7 60.8 80 96
n=40 m=500 20.3 52.3 22 52
m=1000 38.4 77.2 46.6 40 78
m=2000 71.9 95.5 71.9 75 94
Simulated 3PLM – Estimated 1PLM
n=10 m=500 36.5 47.8 35 69
m=1000 58.6 64.3 59 82
m=2000 75.0 77.8 75 88
n=20 m=500 41.4 61.9 42 67
m=1000 63.7 74.9 49.7 65 80
m=2000 75.9 86.2 66.0 77 87
n=40 m=500 41.5 66.2 42 68
m=1000 63.4 78.0 61.5 64 80
m=2000 74.6 88.2 74.6 75 90
Simulated 3PLM – Estimated 2PLM
n=10 m=500 7.1 23.7 7 13
m=1000 9.3 40.2 10 30
m=2000 13.3 62.4 14 46
n=20 m=500 8.3 25.9 9 13
m=1000 11.5 41.0 23.9 10 25
m=2000 16.8 58.6 40.2 17 44
n=40 m=500 8.0 20.6 6 15
m=1000 11.4 32.1 25.1 7 28
m=2000 17.4 46.1 40.3 12 44

To conclude remarks on this Monte Carlo experiment, it is worth noticing that χw2 performed well when random 50% of item responses were missing both in terms of its averaged Type I error rates (Table 1) and power (Table 2). Results for χw2 under the missing responses condition closely resemble the ones that are observed for complete data but with twice less items and observations, which is exactly the expected outcome. This result puts χw2 at advantage of over methods that rely on observed sum-score portioning of ability, like SX2 .

Simulation Study 3 – Power

Simulation Design

Last experiment adapted a design proposed by Orlando and Thissen (2003) to analyze power in misfit scenarios that go beyond fitting of a restricted IRT model to data generated from an unrestricted model. It involved three “bad” items that are described by response functions

BAD1: P(yj|θ)=cj1+eaj(θ(bjdj))+11+eaj(θbj),

where aj=1.7,2.5 , bj=1 , cj=0.25 , and dj=1.5 ;

BAD2: P(yj|θ)=dj1+eaj(θbj),

where aj=1.7,2 , bj=0.5 , dj=0.7 ; and BAD3: P(yj|θ)=xj1+eaj(θbj)+yj1+eaj(θ(bjdj)),

where aj=1.7,3.5 , bj=1 , dj=3 , xj=0.55 , and yj=0.45.

These bad items were, one at a time, embedded in tests consisting of n{10, 20, 40, 80} total items. The remaining vj items were drawn from 2PLM with lnavj N(0, 0.5) and bvjN(0,1) . For each test length m{500, 1000, 2000} , item responses were generated and IRT model was fit to data. Items BAD2, BAD3, and all vj items were modeled with 2PLM without imposing priors on item parameters, and item BAD1 was modeled with 3PLM using noninformative priors: N(0,3) for bj , N(1.1,3) for aj , and β(1.01,1.03) for cj . Estimated item parameters were used to compute χw2 and SX2 for the three bad items.

This design deviated from the original conditions used by Orlando and Thissen (2003) by adopting 3PLM only for the item BAD1, instead of using it for all items. It was motivated by observation that the cj parameter for items BAD2 and BAD3 approached 0 with increase of m . The 3PLM would be an unnecessarily over-parametrized choice for items BAD2 and BAD3.

Simulation Results

Resulting power rates (Table 3) support previous evidence (Table 2) that χw2 is more sensitive in detecting misfit than SX2 . Power of both statistics rose with increase of test length and sample size, but under all tested conditions χw2 exceeded SX2 .

Table 3.

Power rates for three types of misfitting items.

Item Test length n = 500 n = 1000 n = 2000
SX2 χw2 SX2 χw2 SX2 χw2
BAD1 m = 10 0.379 0.634 0.551 0.845 0.790 0.969
m = 20 0.539 0.868 0.753 0.982 0.955 0.999
m = 40 0.598 0.964 0.861 0.999 0.992 1.000
m = 80 0.533 0.987 0.865 1.000 0.998 1.000
BAD2 m = 10 0.130 0.228 0.209 0.413 0.378 0.680
m = 20 0.221 0.450 0.406 0.749 0.731 0.953
m = 40 0.315 0.607 0.586 0.875 0.912 0.993
m = 80 0.351 0.641 0.659 0.905 0.957 0.996
BAD3 m = 10 0.221 0.520 0.444 0.802 0.783 0.969
m = 20 0.359 0.764 0.756 0.967 0.982 1.000
m = 40 0.443 0.842 0.873 0.990 1.000 1.000
m = 80 0.444 0.851 0.878 0.992 1.000 1.000

Summary

Multiple Monte Carlo experiments were conducted to examine properties of the new χw2 item-fit statistic. Type I error of χw2 was close to nominal level. It outperformed Orlando and Thissen’s SX2 on power under all tested conditions. In the 3PLM-2PLM, misfit scenario χw2 was also more sensitive in comparison with Stone’s χ2 . The results are promising and χw2 poses as a viable candidate to test for item fit. It is especially attractive because it can be applied to incomplete testing designs, unlike alternatives that use observed scores for partitioning, and is far less computationally demanding than available statistics that involve residuals over the latent trait.

It is worth pointing to the possibility of other applications of the item-fit approach that was proposed in the article. First, χw2 is straightforwardly generalizable to polytomous items and to multivariate abilities. Also, the quadrature used in implementation of χw2 can be replaced with other solutions to cover cases when ability is not normally distributed. Second, estimates of observed proportions and of the covariance matrix in (5) can be utilized to construct confidence bounds around observed proportions. Such confidence intervals can be plotted against f^j to aid graphical analysis of item-fit. And finally, approach outlined in the article can also be applied to perform differential item functioning (DIF) analysis.

It should be noted that mathematical underpinnings of χw2 laid out in the article are incomplete. Asymptotic multivariate normality of vector of pseudo-observed proportions, (8), is assumed without proof. Consistency of the proposed estimator of the covariance, (10), is likewise just assumed. Careful consideration should also be exercised on how replacing item response functions in the known parameter case of χw2 by their ML estimates impacts the asymptotic claims about χw2 – especially when item parameters are estimated with priors. Results of simulational studies support asymptotic claims made about χw2 . However, they cannot be automatically generalized to cover conditions that would deviate from the specific ones that were considered here. This opens ground for future research on χw2 .

Supplemental Material

Supplemental Material - Item-Fit Statistic Based on Posterior Probabilities of Membership in Ability Groups

Supplemental Material for Item-Fit Statistic Based on Posterior Probabilities of Membership in Ability Groups by Bartosz Kondratek in Applied Psychological Measurement

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Science Centre research grant number 2015/17/N/HS6/02965.

Supplemental Material: Supplemental material for this article is available online.

ORCID iD

Bartosz Kondratek https://orcid.org/0000-0002-4779-0471

References

  1. American Educational Research Association. American Psychological Association. National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington: American Educational Research Association. [Google Scholar]
  2. Andersen E. B. (1973). A goodness of fit test for the Rasch model. Psychometrika, 38(1), 123–140. 10.1007/bf02291180 [DOI] [Google Scholar]
  3. Bolt D. M., Deng S., Lee S. (2014). IRT model misspecification and measurement of growth in vertical scaling. Journal of Educational Measurement, 51(2), 141–162. 10.1111/jedm.12039 [DOI] [Google Scholar]
  4. Chalmers R. P., Ng V. (2017). Plausible-Value imputation statistics for detecting item misfit. Applied Psychological Measurement, 41(5), 372–387. 10.1177/0146621617692079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chon K. W., Lee W., Dunbar S. B. (2010). A comparison of item fit statistics for mixed IRT models. Journal of Educational Measurement, 47(3), 318–338. 10.1111/j.1745-3984.2010.00116.x [DOI] [Google Scholar]
  6. Glas C. A. W. (1999). Modification indices for the 2-pl and the nominal response model. Psychometrika, 64(3), 273–294. 10.1007/bf02294296 [DOI] [Google Scholar]
  7. Glas C. A. W., Suarez-Falcon J.C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. 10.1177/0146621602250530 [DOI] [Google Scholar]
  8. Glas C. A. W., Verhelst N. D. (1989). Extensions of the partial credit model. Psychometrika, 54(4), 635–659. 10.1007/bf02296401 [DOI] [Google Scholar]
  9. Kondratek B. (2016). uirt: Stata module to fit unidimensional Item Response Theory models. In Statistical Software Components S458247. Boston College Department of Economics. [Google Scholar]
  10. Kondratek B. (2020). uirt_sim: Stata module to simulate data from unidimensional Item Response Theory models. In Statistical software components S458749. Boston College Department of Economics. [Google Scholar]
  11. McKinley R., Mills C. (1985). A comparison of several goodness-of-fit statistics. Applied Psychological Measurement, 9(1), 49–57. 10.1177/014662168500900105 [DOI] [Google Scholar]
  12. McLachlan G. J., Peel D. (2000). Finite mixture models. New York: Willey. [Google Scholar]
  13. Organization for Economic Co-operation and Development (OECD) (2017). PISA 2015 technical report. Paris, France: OECD. [Google Scholar]
  14. Orlando M., Thissen D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50–64. 10.1177/01466216000241003 [DOI] [Google Scholar]
  15. Rubin D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12(4), 1151–1172. 10.1214/aos/1176346785 [DOI] [Google Scholar]
  16. Shao J. (1999). Mathematical statistics. New York: Springer-Verlag. [Google Scholar]
  17. Sinharay S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British Journal of Mathematical and Statistical Psychology, 59(2), 429–449. 10.1348/000711005x66888 [DOI] [PubMed] [Google Scholar]
  18. Stone C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in IRT Models. Journal of Educational Measurement, 37(1), 58–75. 10.1111/j.1745-3984.2000.tb01076.x [DOI] [Google Scholar]
  19. Stone C. A., Hansen M.A. (2000). The effect of errors in estimating ability on goodness-of-fit tests for IRT models. Educational and Psychological Measurement, 60(6), 974–991. 10.1177/00131640021970907 [DOI] [Google Scholar]
  20. Stone C. A., Zhang B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352. 10.1111/j.1745-3984.2003.tb01150.x [DOI] [Google Scholar]
  21. Swaminathan H., Hambleton R. K., Rogers H.J. (2007). Assessing the fit of item response theory models. In Rao C. R., Sinharay S. (Eds), Handbook of statistics. New York, NY: Elsevier. [Google Scholar]
  22. Toribio S. G., Albert J. H. (2011). Discrepancy measures for item fit analysis in item response theory. Journal of Statistical Computation and Simulation, 81(10), 1345–1360. 10.1080/00949655.2010.485131 [DOI] [Google Scholar]
  23. Wainer H., Thissen D. (1987). Estimating ability with the wrong model. Journal of Educational Statistics, 12(4), 339–368. 10.2307/1165054 [DOI] [Google Scholar]
  24. Woods C. M. (2008). Consequences of ignoring guessing when estimating the latent density in item response theory. Applied Psychological Measurement, 32(5), 371–384. 10.1177/0146621607307691 [DOI] [Google Scholar]
  25. Yen W. (1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5(2), 245–262. 10.1177/014662168100500212 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material - Item-Fit Statistic Based on Posterior Probabilities of Membership in Ability Groups

Supplemental Material for Item-Fit Statistic Based on Posterior Probabilities of Membership in Ability Groups by Bartosz Kondratek in Applied Psychological Measurement


Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES