Skip to main content
Applied Psychological Measurement logoLink to Applied Psychological Measurement
. 2022 Sep 17;47(1):3–18. doi: 10.1177/01466216221108077

The Standardized S-X2 Statistic for Assessing Item Fit

Zhuangzhuang Han 1, Sandip Sinharay 1,, Matthew S Johnson 1, Xiang Liu 1
PMCID: PMC9679924  PMID: 36425289

Abstract

The S-X2 statistic (Orlando & Thissen, 2000) is popular among researchers and practitioners who are interested in the assessment of item fit. However, the statistic suffers from the Chernoff–Lehmann problem (Chernoff & Lehmann, 1954) and hence does not have a known asymptotic null distribution. This paper suggests a modified version of the S-X2 statistic that is based on the modified Rao–Robson χ2 statistic (Rao & Robson, 1974). A simulation study and a real data analyses demonstrate that the use of the modified statistic instead of the S-X2 statistic would lead to fewer items being flagged for misfit.

Keywords: item response theory model fit, Orlando-Thissen statistic, Pearson’s, statistic, Rao-Robson’s modified, statistic

Introduction

The Standard 4.10 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 2014) recommends documenting evidence of model-data fit when an item response theory (IRT) model is employed in test development and score reporting. In practice, analysis of model-data fit for IRT models involves the use of item-fit residuals and χ2-type statistics (Hambleton & Han, 2005). Among the χ2-type statistics for IRT models, the S-X2 statistic (Orlando & Thissen, 2000) is popular, presumably because of four reasons. First, to compute S-X2, one has to divide the examinees into groups based on their observed total scores rather than the estimated abilities. Second, S-X2 has been found to perform respectably in terms of Type I error rates and power in simulation studies (e.g., Glas & Suarez-Falcón, 2003; Sinharay, 2006; Sinharay & Lu, 2008; Stone & Zhang, 2003). Third, the simple and intuitive nature of S-X2 has allowed it to be easily generalized to cases with polytomous items (Kang & Chen, 2008, 2010), multidimensional examinee abilities (Zhang & Stone, 2007), unfolding models (Roberts, 2008), and cognitive diagnostic models (e.g., Sorrel et al., 2017). Fourth, S-X2 is implemented in multiple IRT software packages including irtplay (Lim, 2020), mirt (Chalmers, 2012), and IRTPRO (Cai et al., 2011)

Notwithstanding these appealing features, S-X2 should not be used without considering its limitations. As noted by researchers such as Sinharay (2006), S-X2, which is a special case of the Pearson’s χ2 statistic (Pearson, 1900), does not have a known asymptotic null distribution in typical IRT applications where the traditional marginal maximum likelihood estimates (MMLEs) of item parameters are used to compute the statistic. Instead, the values of S-X2 are stochastically larger than those from the theorized (χ2) distribution of the statistic. As a consequence, the Type I error rates of S-X2 tend to be slightly larger than the nominal level even for large samples, which has been observed in multiple simulation studies (e.g., Glas & Suarez-Falcón, 2003; Sinharay, 2006; Sinharay & Lu, 2008). The aim of this paper is to introduce a modified S-X2 statistic that has a known χ2 asymptotic null distribution.

The next section includes a review of the Pearson’s χ2 statistic used for assessing general model-data fit and the S-X2 statistic (Orlando & Thissen, 2000) for assessing item fit, followed by a brief review of a potential problem associated with the use of the Pearson’s χ2 statistic (Chernoff & Lehmann, 1954). The section also includes a description of the modified Pearson’s χ2 statistic that Rao and Robson (1974) suggested to overcome the Chernoff–Lehmann problem. The method section presents the details of our modified S-X2 statistic that is a special case of the modified Pearson’s χ2 statistic. The section on simulation studies compares the modified S-X2 statistic with the original S-X2 statistic with respect to Type I error rates and power. The two statistics are compared using a real data set in the penultimate section. Conclusions and recommendations are provided in the last section. Although the S-X2 statistic has been extended to tests with polytomously scored items (Kang & Chen, 2008, 2010), we will only consider tests with dichotomously scored items.

Background: Pearson’s χ2, Orlando-Thissen’s S-X2, Chernoff–Lehmann Problem, and Rao–Robson’s Modified χ2

Pearson’s χ2 Statistic

Let us assume that a sample with N independent observations, y1, y2, …, y N , is available from a population. Suppose that p(y i ; η ), the probability distribution of y i , involves a parameter vector η with L elements. Suppose that the observations are partitioned into K groups (or cells) and the proportion of observations belonging to group k is pk=NkN , where N k represents the number of observations in group k, k = 1, 2, …, K. Let π k ( η ) denote the expected value of p k under the assumed probability distribution.

Pearson’s χ2 statistic (Pearson, 1900) for assessing goodness of fit, denoted henceforth as P-X2, is defined as

PX2=Nk=1K(pkπk(η))2πk(η)=[u(η)]u(η), (1)

where

u(η)=N(p1π1(η)π1(η),p2π2(η)π2(η),,pKπK(η)πK(η)). (2)

In practice, the parameter vector η is unknown and P-X2 is computed by replacing η by η^ , which is the maximum likelihood estimate (MLE) of η , and is assumed to follow a χ2 distribution with KL − 1 degrees of freedom (df ), or, the χKL12 distribution, for large samples under no item misfit.

Orlando and Thissen’s S-X2 Statistic

Orlando and Thissen (2000) developed the S-X2 statistic, which is a special case of the Pearson’s χ2 statistic, to assess item fit in the context of IRT models for dichotomously scored items. Suppose that we are interested in assessing item fit for a J-item test. To compute S-X2 for a given item of interest, the examinees are divided into (J + 1) groups, where group k includes all the examinees whose raw score is k. Let N k denote the size of group k. One then computes, for each group k, O k , which is the observed proportion of test-takers in the group who answered the item correctly. The statistic S-X2 for the item is then computed as

SX2=k=1KNk[OkEk(η)]2Ek(η)[1Ek(η)]=[v(η)]v(η), (3)

where K = J − 1, E k ( η ) is the expected value, under the IRT model, of O k

v(η)=(N1[O1E1(η)]E1(η)[1E1(η)],N2[O2E2(η)]E2(η)[1E2(η)],,NK[OKEK(η)]EK(η)[1EK(η)]), (4)

and the L × 1 vector η includes the parameters of the item of interest, that is, η=(η1,,ηL) , where L could vary over the items depending on the assumed IRT model, and, for example, would be equal to 2 if the two-parameter logistic (2PL) model is used. Let v k ( η ) denote the k-th element of v ( η ).

In computing the S-X2 statistic, the number of examinee groups (K) is typically equal to J − 1 because O0 = E0( η ) = 0 and O J = E J ( η ) = 1 for any data set. For small samples, to ensure that the expected number of examinees is not too small in any examinee group, some groups may be merged and K can be set equal to a number smaller than J − 1. In the simulations and empirical data examples for this paper, groups with fewer than 5 expected number of test-takers were merged, as was recommended by Orlando and Thissen (2000). However, for the sake of simplicity, merging is not considered in the theoretical derivations.

The expected proportion of examinees for group k, E k ( η ), is computed as

Ek(η)=P(Y=1|θ,η)P(T1=k1|θ,η)ψ(θ)dθP(T=k|θ,η)ψ(θ)dθ, (5)

where Y is the score of a randomly chosen examinee on the item of interest, P(Y = 1|θ, η ) is the probability that Y is equal to 1 given examinee ability θ and item parameters η , T is the total (raw) score on the test, T−1 is the rest score, or the total score on all items except the item of interest, P(T=k|θ,η) is the probability that T is equal to k given ability θ and item parameters η , P(T1=k1|θ,η) is the probability that the rest score given ability θ is equal to k − 1, and ψ(θ) is the population distribution of the examinee ability and typically assumed to be the standard normal distribution. The integrals in equation (5) are approximated using numerical integration.

The expressions P(Y = 1|θ, η ), P(T1=k1|θ,η) , and P(T=k|θ,η) depend on the IRT model fitted to the data. If, for example, the 2PL model is used, then

P(Y=1|θ,η)=exp[a(θb)]1+exp[a(θb)],

where a and b, respectively, are the slope and difficulty parameters of the item of interest. Also, the terms P(T1=k1|θ,η) and P(T=k|θ,η) are computed using the Lord–Wingersky recursion formula (Lord & Wingersky, 1984).

Orlando and Thissen (2000) assumed that the asymptotic null distribution of S-X2 is the χKL2 distribution.

The Chernoff–Lehmann Problem with the Pearson’s χ2 Statistic

A critical step in defining P-X2, the Pearson’s χ2 Statistic, is the partitioning of the data into K groups. Under the setup of subsection 2.1, the grouped data comprise O k = Np k , k = 1, 2, …, K,. Because the O k ’s follow the multinomial distribution (e.g., Agresti, 2013, p. 6), the log-likelihood of η based on the grouped data is given by

logk[πk(η)]Npk=Nkpklogπk(η). (6)

Fisher (1924) proved that if P-X2 is computed using the estimated parameter vectors η that maximizes the log-likelihood provided in equation (6), then the asymptotic null distribution of P-X2 is the χKL12 distribution. That is, for large samples and under no model misfit

PX2=[u(η)]u(η)χKL12. (7)

The distribution reflects a loss of 1 df for each parameter that is estimated. The estimate η is often referred to as the minimum χ2 estimator (e.g., Harris & Kanji, 1983).

Let η^ denote the MLE of η , which is computed by maximizing

i=1Nlogf(yi,η),

which is the log-likelihood for the original/ungrouped data.

Chernoff and Lehmann (1954) proved that if one uses η^ to compute P-X2, the corresponding statistic

PX2=[u(η^)]u(η^)χKL12+l=1Lλl(η^)χ12, (8)

where 0 < λ l ( η ) < 1; that is, the statistic is somewhere between a χKL12 variable and a χK12 variable on average. Equation (8) implies that if a statistic of the form [u(η^)]u(η^) is used to assess item fit and the χKL12 distribution is used to approximate the limiting distribution of the statistic, the null hypothesis of adequate model fit will be rejected more often than is appropriate, which would result in an inflated Type I error rate of the fit-assessment approach.

Equations (1) and (4) imply that the S-X2 statistic is a special case of the Pearson’s χ2 statistic. In addition, S-X2 is computed using the MMLE of the item parameters based on the original/ungrouped data and yet is assumed to have a χJL12 asymptotic null distribution (Orlando & Thissen, 2000). Such a use of S-X2 is exactly like the use of the Pearson’s χ2 statistic along with the χKL12 asymptotic null distribution. Therefore, S-X2 is expected to suffer from the Chernoff-Lehmann Problem and is expected to follow not a χ2 distribution, but a distribution like the one given by equation (8). Thus, S-X2 is expected to be larger on average than a χJL12 random variable for large samples under no model misfit. Existing simulation studies that examined the Type I error rates of S-X2 corroborate this fact. Glas and Suarez-Falcón (2003), Sinharay (2006), and Sinharay and Lu (2008) found in simulation studies that the Type I error rates of S-X2 are slightly inflated when it is computed using the MMLEs of item parameters from ungrouped data and is assumed to have the χJL12 asymptotic null distribution. For example, Table 1 of Glas and Suarez-Falcón (2003) shows that the Type I error rates of S-X2 at 5% significance level are 0.08, 0.08, and 0.07, respectively, for sample sizes 500, 1,000, and 4000 for 10-item tests. The resampling-based approaches developed by Sinharay (2006), Stone (2000), Stone and Zhang (2003), which involve the determination of the null distribution of S-X2 using simulations, offer alternative solutions and successfully avoid the use of an inaccurate asymptotic null distribution, but these approaches are computation-intensive. The use of the minimum χ2 estimator η and the P-X2 statistics defined in equation (7) is another possible approach to attain the target Type I error rate. However, η^ is a more efficient estimator compared to η because the former utilizes more information than the latter (e.g., Rao, 1962; Rao & Robson, 1974). Also, η^ is more popular than η . For example, the former is implemented in several publicly available IRT software packages such as BILOG (Mislevy & Bock, 1991), MULTILOG (Thissen, 1991), and PARSCALE (Muraki & Bock, 2003). Further, a χ2-type statistic that utilizes η^ rather than η is likely to be more useful and popular among researchers and practitioners.

Table 1.

The Type I Error Rates of S-X2 and SXRR2 for the 2PL model.

Test Sample size
Length Statistic 500 1000 2000 4000
10 S-X2 0.092 0.087 0.074 0.068
SXRR2 0.035 0.042 0.043 0.044
20 S-X2 0.072 0.067 0.061 0.057
SXRR2 0.054 0.048 0.041 0.040
40 S-X2 0.062 0.057 0.053 0.051
SXRR2 0.054 0.053 0.049 0.047

The Modified χ2 Statistic of Rao and Robson

One solution to the abovementioned Chernoff–Lehmann problem is to modify P-X2 in a way such that the modified statistic has a known asymptotic null distribution.

One modification of the Pearson’s χ2 statistic was suggested by Rao and Robson (1974) and is computed as

PXRR2=[u(η^)]Σu(η^)1u(η^),

where Σu(η^) is the approximate covariance matrix of u(η^) for large samples. The modification is essentially a standardization of u(η^) such that Σu(η^)1/2u(η^) follows a multivariate normal distribution for large samples under no model misfit, and, consequently

PXRR2χK12.

Note that there is no loss of df for parameter estimation in the null distribution of the PXRR2 statistic. Rao and Robson (1974) found that PXRR2 has larger power than the Pearson’s χ2 statistic computed using the minimum χ2 estimator defined in equation (7)—this result is presumably due to the larger degrees of freedom of the former statistic compared to the latter statistic.

In this paper, we borrow the idea underlying PXRR2 and derive the covariance matrix Σv(η^) . The matrix Σv(η^) allows us to compute the statistic SXRR2 , which is a special case of the PXRR2 statistic and is a modified version of the S-X2 statistic, as

SXRR2=[v(η^)]Σv(η^)1v(η^). (9)

Further

SXRR2χJ12

(Rao & Robson, 1974). The key of this modification is the computation of the covariance matrix Σv(η^) . The detailed derivation of the matrix is provided below.

Method: Derivation of the Covariance Matrix Required in SXRR2

To obtain Σv(η^) , we first approximate v(η^) using the first-order Taylor series expansion (e.g., Lehmann & Casella, 1998, p. 77) around η 0 as

v(η^)v(η0)+A0(η^η0), (10)

where η 0 is the unknown true item parameter vector

v(η0)=(N1[O1E1(η0)]E1(η0)[1E1(η0)],N2[O2E2(η0)]E2(η0)[1E2(η0)],,NK[OKEK(η0)]EK(η0)[1EK(η0)]), (11)

and A 0 is a K × L matrix whose (k, l)-th element is given by

(A0)k,l=vk(η)Ek(η)Ek(η)ηl|η=η0=Nk1/2[1Ek(η0)1/2(1Ek(η0))1/2+(Ek(η0)0.5)(OkEk(η0))Ek(η0)3/2(1Ek(η0))3/2]Ek(η)ηl|η=η0. (12)

Note that for large values of N k , O k is approximately equal to E k ( η 0 ), and, consequently, (A0)k,l can be approximated as

(A0)k,lNkEk(η0)(1Ek(η0))Ek(η)ηl|η=η0. (13)

Equation (10) implies that

Σv(η^)Σv(η0)+2Cov[A0(η^η0),v(η0)]+A0Ση^A0. (14)

Among the terms in equation (14), the elements of A 0 can be approximated using equation (13) and Ση^ , which is the variance-covariance matrix among the estimates of the item parameters, can be obtained from the IRT software that was used to fit the IRT model to the data set. 1 The computation of the other terms, Σv(η0) and Cov[A0(η^η0),v(η0)] , are described below.

Computation of Σv(η0)

Because of equation (11), the diagonal elements of Σv(η0) are terms such as Var(v k ( η 0 )), where

vk(η0)=Nk[OkEk(η0)]Ek(η0)[1Ek(η0)],k=1,2,,K,

computed at η = η 0 and the off-diagonal elements of Σv(η0) are terms such as Cov(vk1(η0),vk2(η0)) for k1k2 = 1, 2, …, K computed at η = η 0 .

Because the variance of O k computed at η = η 0 is E k ( η 0 )[1 − E k ( η 0 )]/N k , v k ( η 0 ) is standardized, that is, its variance is 1 for k = 1, 2, …, K. So, the diagonal elements of Σv(η0) are all equal to 1. Because the quantities Ek1(η0) are constants, Cov(vk1(η0),vk2(η0)) is a multiple of Cov(Ok1,Ok2) , the covariance of Ok1 and Ok2 , computed at η = η 0 . Appendix A includes a proof that Cov(Ok1,Ok2) , computed at η = η 0 , is approximately equal to 0 for large samples. Therefore, the off-diagonal elements of Σv(η0) are all approximately equal to 0 for large samples.

Consequently, for large samples

Σv(η0)IK, (15)

where I K denotes an identity matrix of dimension K × K.

Computation of Cov[A0(η^η0),v(η0)]

The grouped data in the context of item-fit analysis comprise the quantities N k O k and N k (1 − O k ), which are the numbers of correct and incorrect answers on the item of interest for examinee group k. The log-likelihood of these grouped data is provided by (η^)=logkEk(η^)NkOk(1Ek(η^))Nk(1Ok)=k[NkOklog(Ek(η^))+Nk(1Ok)log(1Ek(η^))]..

As mentioned earlier, the minimum χ2 estimator η is obtained by solving

(η)ηl=k[NkOkEk(η)Nk(1Ok)1Ek(η)]Ek(η)ηl=0,l=1,2,,L, (16)

or by solving

kNk[OkEk(η)]Ek(η)[1Ek(η)]Ek(η)ηl=0,l=1,2,,L.

Therefore, the solution η to the above equations satisfies

(η)η|η=η=0L×1, (17)

where 0L×1 is a vector of length L whose elements are zeroes. Also note that Equations (11), (13), and (16) imply that

(η)η|η=η0=A0v(η0). (18)

By applying the Taylor series expansion around η = η 0 to (η)η|η=η and using the result provided in Equations (17) and (18), we obtain

(η)η|η=η=0L×1A0v(η0)+B0(ηη0), (19)

where

B0=2(η)ηη|η=η0.

Equation (19) implies that

B01A0v(η0)η+η00

or

B01A0v(η0)+η^ηη^η0. (20)

Using equation (20), we can express the covariance Cov[A0(η^η0),v(η0)] in equation (14) as

Cov[A0(η^η0),v(η0)]Cov[A0B01A0v(η0)+A0(η^η),v(η0)]=A0B01A0Σv(η0)+Cov[A0(η^η),v(η0)]. (21)

However, note that Cov[A0(η^η),v(η0)] , the second term in the right side of equation (21), converges to a matrix of zeroes since A 0 is a matrix of constants and η^η , which is the difference between two sets of item parameter estimates, converges to a zero vector as sample size increases. Therefore, equation (21) yields the result that

Cov[A0(η^η0),v(η0)]=A0B01A0Σv(η0) (22)

Equations (14), (15), and (22) imply that

Σv(η^)IK+2A0B01A0+A0Ση^A0. (23)

Although the minimum χ2 estimator appears in the above derivation, one does not have to compute the estimator to compute Σv(η^) . That is because A 0 and B 0 can be adequately approximated using the MLE η^ that is an accurate estimator of η 0 for common IRT models (e.g., Harwell et al., 1988).

After approximating Σv(η^) using equation (23), one can compute our modified version of S-X2 as

SXRR2=[v(η^)]Σv(η^)1v(η^), (24)

where v(η^) is computed using equation (4) after replacing η by η^ . The asymptotic null distribution of SXRR2 is a χJ12 distribution (Rao & Robson, 1974). Thus, item misfit is indicated by values of SXRR2 that are larger than the appropriate percentiles (say 95th or 99th percentile) of the χJ12 distribution.

Simulation Studies

We performed a simulation study to evaluate the Type I error rates and power of the new SXRR2 statistic defined in equation (24) and to compare its Type I error rates and power to those of the S-X2 statistic (Orlando & Thissen, 2000) defined in equation (3). In the first part of the study, we compute and compare the Type I error rates of S-X2 and SXRR2 for data simulated from the 2PL model. In the second part of the simulation study, we examine and compare the power of S-X2 and SXRR2 for data simulated from the Rasch, the 2PL and the 3PL models. Both the statistics were computed using η^ , which is the vector of the MMLEs of the item parameters.

Simulation Design

In the simulations, item scores were simulated under the Rasch, 2PL, and 3PL models. The test length was set as equal to 10, 20, or 40. The sample size was set equal to 500, 1000, 2000, or 4000. The true slope parameters, difficulty parameters, and guessing parameters were randomly generated from uniform distributions U(1, 2), U( − 3, 3), and U(0.05, 0.3), respectively, where, for example, U(1, 2) denotes the uniform distribution between 1 and 2. Simulating the true parameter values from other distributions did not affect the comparative performance of the item-fit statistics. To investigate the Type I error rates of the two statistics, the data-generating model (the IRT model that was used to simulate the data) was fitted to the data. To investigate the power of the two statistics, the Rasch and 2PL models were fitted to data simulated from the 3PL model and the Rasch model was fitted to data simulated from the 2PL model. After the models were fitted to the data and the item fit statistics were computed, the Type I error rate of an item-fit statistic at the 5% significance level was computed as the proportion of values of the statistic that were larger than the 95th percentile of the χ2 distribution with J − 1 (for SXRR2 ) or JL − 1 (for S-X2) df for the simulation cases where the data-generating model and the fitted model were the same; the power of a statistic was computed as the proportion of values of the statistic that were larger than the 95th percentile of the χ2 distribution with J − 1 or JL − 1 df for the simulation cases where the data-generating model and the fitted model were different. Both Type I error rate and power for each combination of test length and sample size were computed from 100 replications. The true item parameters were resampled in each replication.

Results

Table 1 shows that the Type I error rates of the two statistics for the various simulation cases where the data-generating model and the fitted model were the same. The table shows that the Type I error rates of S-X2 are larger than the nominal level in all simulation cases, a finding that is in agreement with findings on Type I error rates of S-X2 in Glas and Suarez-Falcón (2003), Sinharay (2006), and Sinharay and Lu (2008). However, the Type I error rates of S-X2 are not much larger than the nominal level for 40-item tests. The Type I error rates of the modified statistic SXRR2 are considerably smaller than those of S-X2 in all cases. Thus, SXRR2 overcomes the Chernoff–Lehmann problem to a certain extent. However, the Type I error rates of SXRR2 is considerably smaller than the nominal level for 10-item tests—we plan to investigate this issue in future research.

Table 2 shows the values of power of the two item-fit statistics for the various simulation cases where the data-generating model and the fitted model were different. The two columns with heading, for example, “2PL/1PL,” show the power for the cases when the data were simulated from the 2PL model and analyzed using the Rasch model. Table 2 shows that the power of the modified statistic SXRR2 is smaller than that of S-X2. However, the slightly better power of S-X2 relative to SXRR2 is likely a consequence of the inflated Type I error rate of the former statistic. As the sample size increases, the power of both statistics approach 1.0 for the “2PL/1PL” and “3PL/1PL” cases. The small power of both item statistics for the “3PL/2PL” case is an outcome of the fact that the 2PL model can explain data simulated from the 3PL model except for the case that the difficulty and guessing parameters for the latter model are too high (Sinharay, 2006).

Table 2.

The Power of SXRR2 and S-X2 for Various Combinations of Data-generating Model and Fitted Model.

Test Length Sample Size 2PL/1PL 3PL/1PL 3PL/2PL
S-X2 SXRR2 S-X2 SXRR2 S-X2 SXRR2
10 500 0.26 0.19 0.34 0.26 0.06 0.05
1000 0.48 0.40 0.49 0.37 0.07 0.06
2000 0.65 0.57 0.67 0.55 0.08 0.06
4000 0.80 0.69 0.82 0.68 0.11 0.05
20 500 0.17 0.17 0.30 0.27 0.07 0.04
1000 0.39 0.35 0.47 0.40 0.09 0.05
2000 0.64 0.61 0.67 0.59 0.10 0.08
4000 0.80 0.75 0.82 0.76 0.13 0.10
40 500 0.18 0.17 0.27 0.25 0.08 0.05
1000 0.25 0.24 0.42 0.38 0.11 0.08
2000 0.54 0.48 0.66 0.64 0.11 0.09
4000 0.77 0.66 0.79 0.75 0.15 0.13

Real Data Example

The two item-fit statistics, S-X2 and SXRR2 , were computed for a real data set. The data set includes the item scores of 2000 examinees on a state test with 46 dichotomous and multiple-choice items (with 5 answer options for each item) designed to measure students’ achievement in mathematics and was previously analyzed in Sinharay (2017).

The Rasch, 2PL, and 3PL models were fitted to the data set and the values of SXRR2 and S-X2 were computed for all items for each IRT model. Table 3 shows the number of items for which the item-fit statistics were statistically significant at the 5% level of significance for the three IRT models. The table shows that for each IRT model, the use of SXRR2 leads to fewer items being identified as misfitting compared to that of S-X2, with the difference being more prominent for the 2PL model. This finding agrees with the finding of smaller Type I error rate and power of SXRR2 compared to S-X2 in the simulation study. Although both statistics are significant for a considerable number of items for the Rasch and 2PL model, they are significant for only 6 and 3 items, respectively, for the 3PL model. Although the 3PL model seems to adequately fit the data set, more tests including tests for local independence (e.g., Chen & Thissen, 1997) and further investigations, should be conducted to finalize this conclusion.

Table 3.

The Number of Items with Statistically Significant Values of S-X2 and SXRR2 for the Three IRT models for the real data set.

Statistic Rasch 2PL 3PL
S-X2 33 18 6
SXRR2 31 12 3

Note. IRT = item response theory.

The three panels of Figure 1 show scatter plots of S-X2 versus SXRR2 for the real data set under the three IRT models. The range of the X-axis is the same as that of the Y-axis in each panel. The range is much wider in the leftmost panel than in the other two panels. The panels include a diagonal line and also vertical and horizontal dashed lines indicating the critical values at 5% level of significance for the respective statistics. The last two panels show that for several items, S-X2 is larger than its critical value, but SXRR2 is smaller than its critical value. Because item misfit often leads to an item being removed from the item pool (Sinharay & Haberman, 2014) and items are costly, these results indicate that the use of SXRR2 rather than S-X2 may lead to considerable saving of resources in operational testing.

Figure 1.

Figure 1.

Plot of S-X2 versus SXRR2 for three item response theory models for the real data.

Conclusions and Recommendations

The item-fit statistic S-X2 (Orlando & Thissen, 2000), in spite of its simplicity and popularity, does not have a known asymptotic null distribution (Sinharay, 2006) and the Type I error rate of the statistic is larger than the nominal level, especially for shorter tests. The present study adopts the modification procedure suggested by Rao and Robson (1974) to provide a modified version of S-X2 that has a known χ2 asymptotic null distribution. The statistic S-X2 can be written as v^Tv^ . The central idea of the modification of Rao and Robson (1974) is the computation of v^TΣv^1v^ , where Σv^ is an approximate variance-covariance matrix of v^ , so that v^TΣv^1v^ has a known χ2 asymptotic null distribution. A major contribution of this paper is the derivation of the appropriate Σv^ . Thus, this paper suggests a χ2-type statistic that (a) can be used to assess item fit for any IRT model for dichotomous items and (b) has a known asymptotic distribution under the null hypothesis. Item-fit statistics that have known asymptotic χ2 distribution under the null hypothesis have been suggested for the Rasch model by, for example, Glas (1988), but there is a lack of such statistics for non-Rasch IRT models. Thus, this paper makes an important contribution given that experts such as Box (1979) called for statistics that have known null distribution in assessing the fit of statistical models. Note that researchers such as Haberman et al. (2013) have suggested residual-based item-fit statistics that follow the standard normal distributions for non-Rasch IRT models, but we do not consider such statistics.

Simulation studies were conducted to compare the performance of S-X2 and SXRR2 with respect to Type I error rate and power. Results obtained from the simulation studies suggest that the Type I error rate of SXRR2 is closer to the nominal level than S-X2 across different conditions. However, SXRR2 was found to be slightly conservative in comparison to S-X2. Application of the two item-fit statistics to a real data set revealed that the number of misfitting items using SXRR2 was smaller than that for S-X2. In practice, item fit statistics such as SXRR2 should be used along with other methods such as informative graphics and pair-wise item fit indexes in order to gain a thorough understanding of the type of misfit.

This paper has several limitations. First, it is possible to compare the two statistics for more simulated data and more real data. Second, the proposed statistic SXRR2 applies only to dichotomous IRT models—it is possible to extend the statistic to tests with polytomous items or a mix of dichotomous and polytomous items in future research. Third, the current manuscript only investigates three unidimensional IRT models assuming the latent variable follows a normal distribution. To obtain better understanding of the suggested statistic, one can look into its performance in other cases including for non-normal ability distributions, multidimensional latent variables, and discrete latent variables. Finally, this manuscript only considers statistical significance and does not discuss practical significance on IRT model misfit (Hambleton & Han, 2005; Sinharay & Haberman, 2014).

Acknowledgments

The authors would like to thank John Donoghue, Sooyeon Kim, Hongwen Guo, Lora Monfils, and two anonymous reviewers for several helpful comments that led to a significant improvement of the article.

Appendix A. Proof that Cov(Ok1,Ok2) Computed at η = η 0 is Approximately Equal to Zero for Large Samples

Let S i denote the total score of examinee i, who is randomly chosen from the hypothetical population of all possible examinees. Let us define an indicator variable W ik as

Wik={1, Si=k0, Sik

Then O k for an item of interest can be expressed as

Ok=iWikXiiWik

where X i is the score of examinee i on the item.

Let us consider two possible values k1 and k2 of S i , where k1k2, and define a vector U as

U=(iWik1Xi,iWik1,iWik2Xi,iWik2)

Then one can express Ok1 and Ok2 as

Ok1=U1U2,Ok2=U3U4

where, for example, U1 is the first component of U . The Jacobian for the transformation from O=(Ok1,Ok2) to U is given by a matrix of the first derivatives of the elements of O with respect to those of U , or, by

J=[J1J20000J3J4] (A1)

where

J1=1U2,J2=U1U22,J3=1U4,J4=U3U42 (A2)

Consequently, using the multivariate delta method (e.g., Lehmann & Casella, 1998, p. 61), the variance-covariance matrix of O for large samples can be approximated as

Cov(O)JΣUJ

where Σ U is the variance-covariance matrix of the vector U , J is the value of J provided in equation (A1) upon replacing the U k ’s with their expected values computed at η = η 0 , and the parameters η are fixed at η 0. Using the result that the (i, j)-th element of the product of three matrices A, B and C is equal to the (matrix) product of the i-th row of A, the matrix B, and the j-th column of C (e.g., Banerjee & Roy, 2014, p. 12), the covariance between Ok1 and Ok2 can be approximated, for large samples, as

Cov(Ok1,Ok2)(J1,J2,0,0)ΣU(0,0,J3,J4)

where Ji is the value of J i upon replacing the U k ’s with their expected values computed at η = η 0 , or, as

Cov(Ok1,Ok2)J1J3σ13+J1J4σ14+J2J3σ23+J2J4σ24 (A3)

where σ ij is the (i, j)-th element of Σ U .

One can compute σ24 as

σ24=Cov(U2,U4)=Cov(iWik1,iWik2)=iCov(Wik1,Wik2)

where the last equality holds because the item scores are independent over two different examinees i1 and i2, which results in Cov(Wi1k1,Wi2k2)=0 . Consequently

σ24=i[E(Wik1Wik2)E(Wik1)E(Wik2)]=iE(Wik1)E(Wik2) (A4)

because the raw score of examinee i cannot be equal to k1 and also equal to k2 so that Wik1Wik2 is equal to 0.

Now note that E(Wik1) is the probability that the raw score on the test is k1 for an examinee who is randomly chosen from the population of all examinees, is equal to S(T=k1|θ,η)ψ(θ)dθ , and hence is the same over all the examinees. Therefore

E(Wik1)=1NiE(Wik1)=1NE(iWik1)=1NE(U2) (A5)

Similarly, one obtains

E(Wik2)=1NE(U4). (A6)

Equations (A4) to (A6) imply that

σ24=i[1NE(U2)][1NE(U4)]=1NE(U2)E(U4)

Let Uk denote E(U k ), where the expectation is computed at η = η 0 , k = 1, …, 4. Then

σ24=1NU2U4 (A7)

It is possible to prove in a similar manner that

σ13=1NU1U3,σ14=1NU1U4,σ23=1NU2U3 (A8)

Finally, equations (A2), (A3), (A7), and (A8) imply that

Cov(Ok1,Ok2)J1J31NU1U3J1J41NU1U4J2J31NU2U3J2J41NU2U41N[U1U31U21U4U1U41U2U3U42U2U3U1U221U4+U2U4U1U22U3U42]=0

Note

1

For example, the R package mirt (Chalmers, 2012) can be used to compute such a matrix.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Author Note: Any opinions expressed in this publication are those of the author and not necessarily of Educational Testing Service.

ORCID iDs

Sandip Sinharay https://orcid.org/0000-0003-4491-8510

Matthew S. Johnson https://orcid.org/0000-0003-3157-4165

References

  1. Agresti A. (2013). Categorical data analysis (3rd ed.). Wiley. [Google Scholar]
  2. American Educational Research Association, American Psychological Association, & National Council for Measurement in Education (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
  3. Banerjee S., Roy A. (2014). Linear algebra and matrix analysis for statistics. Chapman and Hall/CRC. [Google Scholar]
  4. Box G. E. P. (1979). Some problems of statistics and everyday life. Journal of the American Statistical Association, 74(365), 1–4. 10.1080/01621459.1979.10481600 [DOI] [Google Scholar]
  5. Cai L., du Toit S. H. C., Thissen D. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling. Scientific Software International. [Google Scholar]
  6. Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
  7. Chen W.-H., Thissen D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. 10.2307/1165285 [DOI] [Google Scholar]
  8. Chernoff H., Lehmann E. L. (1954). The use of maximum likelihood estimates in χ2 tests for goodness of fit. The Annals of Mathematical Statistics, 25(3), 579–586. 10.1214/aoms/1177728726 [DOI] [Google Scholar]
  9. Fisher R. A. (1924). The conditions under which χ2 measures the discrepancy between observation and hypothesis. Journal of the Royal Statistical Society, 87(3), 442–450. [Google Scholar]
  10. Glas C. A., Suarez-Falcón J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. 10.1177/0146621602250530 [DOI] [Google Scholar]
  11. Glas C. A. W. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53(4), 525–546. 10.1007/bf02294405 [DOI] [Google Scholar]
  12. Haberman S. J., Sinharay S., Chon K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440. 10.1007/s11336-012-9305-1 [DOI] [PubMed] [Google Scholar]
  13. Hambleton R. K., Han N. (2005). Assessing the fit of IRT models to educational and psychological test data: A five step plan and several graphical displays. In Lenderking W. R., Revicki D. (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57–78). Degnon Associates. [Google Scholar]
  14. Harris R. R., Kanji G. K. (1983). On the use of minimum chi-square estimation. The Statistician, 32(4), 379. 10.2307/2987540 [DOI] [Google Scholar]
  15. Harwell M. R., Baker F. B., Zwarts M. (1988). Item parameter estimation via marginal maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics, 13(3), 243–271. 10.2307/1164654 [DOI] [Google Scholar]
  16. Kang T., Chen T. T. (2008). Performance of the generalized S-X2 item fit index for polytomous IRT models. Journal of Educational Measurement, 45(4), 391–406. 10.1111/j.1745-3984.2008.00071.x [DOI] [Google Scholar]
  17. Kang T., Chen T. T. (2010). Performance of the generalized S-X2 item fit index for the graded response model. Asia Pacific Education Review, 12(1), 89–96. 10.1007/s12564-010-9082-4 [DOI] [Google Scholar]
  18. Lehmann E. L., Casella G. (1998). Theory of point estimation (2nd ed.). Springer-Verlag. [Google Scholar]
  19. Lim H. (2020). irtplay: Unidimensional item response theory modeling. (R package version 1.6.2). [Google Scholar]
  20. Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8(4), 453–461. 10.1177/014662168400800409 [DOI] [Google Scholar]
  21. Mislevy R. J., Bock R. D. (1991). BILOG 3.11 [computer software]. Scientific Software International. [Google Scholar]
  22. Muraki E., Bock R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [computer program]. Scientific Software. [Google Scholar]
  23. Orlando M., Thissen D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50–64. 10.1177/01466216000241003 [DOI] [Google Scholar]
  24. Pearson K. (1900). X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. 10.1080/14786440009463897 [DOI] [Google Scholar]
  25. Rao C. R. (1962). Efficient estimates and optimum inference procedures in large samples. Journal of the Royal Statistical Society. Series B (Methodological), 24(1), 46–72. 10.1111/j.2517-6161.1962.tb00436.x [DOI] [Google Scholar]
  26. Rao K. C., Robson D. S. (1974). A chi-square statistic for goodness-of-fit tests within the exponential family. Communications in Statistics, 3(12), 1139–1153. 10.1080/03610917408548327 [DOI] [Google Scholar]
  27. Roberts J. S. (2008). Modified likelihood-based item fit statistics for the generalized graded unfolding model. Applied Psychological Measurement, 32(5), 407–423. 10.1177/0146621607301278 [DOI] [Google Scholar]
  28. Sinharay S. (2006). Bayesian item fit analysis for unidimensional item response theory models. The British Journal of Mathematical and Statistical Psychology, 59(2), 429–449. 10.1348/000711005x66888 [DOI] [PubMed] [Google Scholar]
  29. Sinharay S. (2017). How to compare parametric and nonparametric person-fit statistics using real data. Journal of Educational Measurement, 54(4), 420–439. 10.1111/jedm.12155 [DOI] [Google Scholar]
  30. Sinharay S., Haberman S. J. (2014). How often is the misfit of item response theory models practically significant? Educational Measurement: Issues and Practice, 33(1), 23–35. 10.1111/emip.12024 [DOI] [Google Scholar]
  31. Sinharay S., Lu Y. (2008). A further look at the correlation between item parameters and item fit statistics. Journal of Educational Measurement, 45, 1–15. 10.1111/j.1745-3984.2007.00049.x [DOI] [Google Scholar]
  32. Sorrel M. A., Abad F. J., Olea J., de la Torre J., Barrada J. R. (2017). Inferential item-fit evaluation in cognitive diagnosis modeling. Applied Psychological Measurement, 41(8), 614–631. 10.1177/0146621617707510 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Stone C. A. (2000). Monte Carlo based null distribtion for an alternative goodness-of-fit test statistic in IRT models. Journal of Educational Measurement, 37(1), 58–75. 10.1111/j.1745-3984.2000.tb01076.x [DOI] [Google Scholar]
  34. Stone C. A., Zhang B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352. 10.1111/j.1745-3984.2003.tb01150.x [DOI] [Google Scholar]
  35. Thissen D. (1991). MULTILOG: Multiple category item analysis and test scoring using item response theory [computer software]. Scientific Software International. [Google Scholar]
  36. Zhang B., Stone C. A. (2007). Evaluating item fit for multidimensional item response models. Educational and Psychological Measurement, 68(2), 181–196. 10.1177/0013164407301547 [DOI] [Google Scholar]

Articles from Applied Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES