The Standardized S-X2 Statistic for Assessing Item Fit

Zhuangzhuang Han; Sandip Sinharay; Matthew S Johnson; Xiang Liu

doi:10.1177/01466216221108077

. 2022 Sep 17;47(1):3–18. doi: 10.1177/01466216221108077

The Standardized S-X² Statistic for Assessing Item Fit

Zhuangzhuang Han ¹, Sandip Sinharay ^1,^✉, Matthew S Johnson ¹, Xiang Liu ¹

PMCID: PMC9679924 PMID: 36425289

Abstract

The S-X² statistic (Orlando & Thissen, 2000) is popular among researchers and practitioners who are interested in the assessment of item fit. However, the statistic suffers from the Chernoff–Lehmann problem (Chernoff & Lehmann, 1954) and hence does not have a known asymptotic null distribution. This paper suggests a modified version of the S-X² statistic that is based on the modified Rao–Robson χ² statistic (Rao & Robson, 1974). A simulation study and a real data analyses demonstrate that the use of the modified statistic instead of the S-X² statistic would lead to fewer items being flagged for misfit.

Keywords: item response theory model fit, Orlando-Thissen statistic, Pearson’s, statistic, Rao-Robson’s modified, statistic

Introduction

The Standard 4.10 of the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council for Measurement in Education, 2014) recommends documenting evidence of model-data fit when an item response theory (IRT) model is employed in test development and score reporting. In practice, analysis of model-data fit for IRT models involves the use of item-fit residuals and χ²-type statistics (Hambleton & Han, 2005). Among the χ²-type statistics for IRT models, the S-X² statistic (Orlando & Thissen, 2000) is popular, presumably because of four reasons. First, to compute S-X², one has to divide the examinees into groups based on their observed total scores rather than the estimated abilities. Second, S-X² has been found to perform respectably in terms of Type I error rates and power in simulation studies (e.g., Glas & Suarez-Falcón, 2003; Sinharay, 2006; Sinharay & Lu, 2008; Stone & Zhang, 2003). Third, the simple and intuitive nature of S-X² has allowed it to be easily generalized to cases with polytomous items (Kang & Chen, 2008, 2010), multidimensional examinee abilities (Zhang & Stone, 2007), unfolding models (Roberts, 2008), and cognitive diagnostic models (e.g., Sorrel et al., 2017). Fourth, S-X² is implemented in multiple IRT software packages including irtplay (Lim, 2020), mirt (Chalmers, 2012), and IRTPRO (Cai et al., 2011)

Notwithstanding these appealing features, S-X² should not be used without considering its limitations. As noted by researchers such as Sinharay (2006), S-X², which is a special case of the Pearson’s χ² statistic (Pearson, 1900), does not have a known asymptotic null distribution in typical IRT applications where the traditional marginal maximum likelihood estimates (MMLEs) of item parameters are used to compute the statistic. Instead, the values of S-X² are stochastically larger than those from the theorized (χ²) distribution of the statistic. As a consequence, the Type I error rates of S-X² tend to be slightly larger than the nominal level even for large samples, which has been observed in multiple simulation studies (e.g., Glas & Suarez-Falcón, 2003; Sinharay, 2006; Sinharay & Lu, 2008). The aim of this paper is to introduce a modified S-X² statistic that has a known χ² asymptotic null distribution.

The next section includes a review of the Pearson’s χ² statistic used for assessing general model-data fit and the S-X² statistic (Orlando & Thissen, 2000) for assessing item fit, followed by a brief review of a potential problem associated with the use of the Pearson’s χ² statistic (Chernoff & Lehmann, 1954). The section also includes a description of the modified Pearson’s χ² statistic that Rao and Robson (1974) suggested to overcome the Chernoff–Lehmann problem. The method section presents the details of our modified S-X² statistic that is a special case of the modified Pearson’s χ² statistic. The section on simulation studies compares the modified S-X² statistic with the original S-X² statistic with respect to Type I error rates and power. The two statistics are compared using a real data set in the penultimate section. Conclusions and recommendations are provided in the last section. Although the S-X² statistic has been extended to tests with polytomously scored items (Kang & Chen, 2008, 2010), we will only consider tests with dichotomously scored items.

Background: Pearson’s χ², Orlando-Thissen’s S-X², Chernoff–Lehmann Problem, and Rao–Robson’s Modified χ²

Pearson’s χ² Statistic

Let us assume that a sample with N independent observations, y₁, y₂, …, y_N, is available from a population. Suppose that p(y_i; η ), the probability distribution of y_i, involves a parameter vector η with L elements. Suppose that the observations are partitioned into K groups (or cells) and the proportion of observations belonging to group k is $p_{k} = \frac{N k}{N}$ , where N_k represents the number of observations in group k, k = 1, 2, …, K. Let π_k( η ) denote the expected value of p_k under the assumed probability distribution.

Pearson’s χ² statistic (Pearson, 1900) for assessing goodness of fit, denoted henceforth as P-X², is defined as

P - X^{2} = N \sum_{k = 1}^{K} \frac{{(p_{k} - π_{k} (η))}^{2}}{π_{k} (η)} = {[u (η)]}^{⊤} u (η),

(1)

where

u (η) = \sqrt{N} {(\frac{p_{1} - π_{1} (η)}{\sqrt{π_{1} (η)}}, \frac{p_{2} - π_{2} (η)}{\sqrt{π_{2} (η)}}, \dots, \frac{p_{K} - π_{K} (η)}{\sqrt{π_{K} (η)}})}^{⊤} .

(2)

In practice, the parameter vector η is unknown and P-X² is computed by replacing η by $\hat{η}$ , which is the maximum likelihood estimate (MLE) of η , and is assumed to follow a χ² distribution with K − L − 1 degrees of freedom (df ), or, the $χ_{K - L - 1}^{2}$ distribution, for large samples under no item misfit.

Orlando and Thissen’s S-X² Statistic

Orlando and Thissen (2000) developed the S-X² statistic, which is a special case of the Pearson’s χ² statistic, to assess item fit in the context of IRT models for dichotomously scored items. Suppose that we are interested in assessing item fit for a J-item test. To compute S-X² for a given item of interest, the examinees are divided into (J + 1) groups, where group k includes all the examinees whose raw score is k. Let N_k denote the size of group k. One then computes, for each group k, O_k, which is the observed proportion of test-takers in the group who answered the item correctly. The statistic S-X² for the item is then computed as

S - X^{2} = \sum_{k = 1}^{K} \frac{N_{k} {[O_{k} - E_{k} (η)]}^{2}}{E_{k} (η) [1 - E_{k} (η)]} = {[v (η)]}^{⊤} v (η),

(3)

where K = J − 1, E_k ( η ) is the expected value, under the IRT model, of O_k

v (η) = {(\frac{\sqrt{N_{1}} [O_{1} - E_{1} (η)]}{\sqrt{E_{1} (η) [1 - E_{1} (η)]}}, \frac{\sqrt{N_{2}} [O_{2} - E_{2} (η)]}{\sqrt{E_{2} (η) [1 - E_{2} (η)]}}, \dots, \frac{\sqrt{N_{K}} [O_{K} - E_{K} (η)]}{\sqrt{E_{K} (η) [1 - E_{K} (η)]}})}^{⊤},

(4)

and the L × 1 vector η includes the parameters of the item of interest, that is, $η = {(η_{1}, \dots, η_{L})}^{⊤}$ , where L could vary over the items depending on the assumed IRT model, and, for example, would be equal to 2 if the two-parameter logistic (2PL) model is used. Let v_k( η ) denote the k-th element of v ( η ).

In computing the S-X² statistic, the number of examinee groups (K) is typically equal to J − 1 because O₀ = E₀( η ) = 0 and O_J = E_J ( η ) = 1 for any data set. For small samples, to ensure that the expected number of examinees is not too small in any examinee group, some groups may be merged and K can be set equal to a number smaller than J − 1. In the simulations and empirical data examples for this paper, groups with fewer than 5 expected number of test-takers were merged, as was recommended by Orlando and Thissen (2000). However, for the sake of simplicity, merging is not considered in the theoretical derivations.

The expected proportion of examinees for group k, E_k( η ), is computed as

E_{k} (η) = \frac{\int P (Y = 1 | θ, η) P (T_{- 1} = k - 1 | θ, η) ψ (θ) d θ}{\int P (T = k | θ, η) ψ (θ) d θ},

(5)

where Y is the score of a randomly chosen examinee on the item of interest, P(Y = 1|θ, η ) is the probability that Y is equal to 1 given examinee ability θ and item parameters η , T is the total (raw) score on the test, T₋₁ is the rest score, or the total score on all items except the item of interest, $P (T = k | θ, η)$ is the probability that T is equal to k given ability θ and item parameters η , $P (T_{- 1} = k - 1 | θ, η)$ is the probability that the rest score given ability θ is equal to k − 1, and ψ(θ) is the population distribution of the examinee ability and typically assumed to be the standard normal distribution. The integrals in equation (5) are approximated using numerical integration.

The expressions P(Y = 1|θ, η ), $P (T_{- 1} = k - 1 | θ, η)$ , and $P (T = k | θ, η)$ depend on the IRT model fitted to the data. If, for example, the 2PL model is used, then

P (Y = 1 | θ, η) = \frac{\exp [a (θ - b)]}{1 + \exp [a (θ - b)]},

where a and b, respectively, are the slope and difficulty parameters of the item of interest. Also, the terms $P (T_{- 1} = k - 1 | θ, η)$ and $P (T = k | θ, η)$ are computed using the Lord–Wingersky recursion formula (Lord & Wingersky, 1984).

Orlando and Thissen (2000) assumed that the asymptotic null distribution of S-X² is the $χ_{K - L}^{2}$ distribution.

The Chernoff–Lehmann Problem with the Pearson’s χ² Statistic

A critical step in defining P-X², the Pearson’s χ² Statistic, is the partitioning of the data into K groups. Under the setup of subsection 2.1, the grouped data comprise O_k = Np_k, k = 1, 2, …, K,. Because the O_k’s follow the multinomial distribution (e.g., Agresti, 2013, p. 6), the log-likelihood of η based on the grouped data is given by

\log \prod_{k} {[π_{k} (η)]}^{N p_{k}} = N \sum_{k} p_{k} \log π_{k} (η) .

(6)

Fisher (1924) proved that if P-X² is computed using the estimated parameter vectors $\tilde{η}$ that maximizes the log-likelihood provided in equation (6), then the asymptotic null distribution of P-X² is the $χ_{K - L - 1}^{2}$ distribution. That is, for large samples and under no model misfit

P - X^{2} = {[u (\tilde{η})]}^{⊤} u (\tilde{η}) \sim χ_{K - L - 1}^{2} .

(7)

The distribution reflects a loss of 1 df for each parameter that is estimated. The estimate $\tilde{η}$ is often referred to as the minimum χ² estimator (e.g., Harris & Kanji, 1983).

Let $\hat{η}$ denote the MLE of η , which is computed by maximizing

\sum_{i = 1}^{N} \log f (y_{i}, η),

which is the log-likelihood for the original/ungrouped data.

Chernoff and Lehmann (1954) proved that if one uses $\hat{η}$ to compute P-X², the corresponding statistic

P - X^{2} = {[u (\hat{η})]}^{⊤} u (\hat{η}) \sim χ_{K - L - 1}^{2} + \sum_{l = 1}^{L} λ_{l} (\hat{η}) χ_{1}^{2},

(8)

where 0 < λ_l( η ) < 1; that is, the statistic is somewhere between a $χ_{K - L - 1}^{2}$ variable and a $χ_{K - 1}^{2}$ variable on average. Equation (8) implies that if a statistic of the form ${[u (\hat{η})]}^{⊤} u (\hat{η})$ is used to assess item fit and the $χ_{K - L - 1}^{2}$ distribution is used to approximate the limiting distribution of the statistic, the null hypothesis of adequate model fit will be rejected more often than is appropriate, which would result in an inflated Type I error rate of the fit-assessment approach.

Equations (1) and (4) imply that the S-X² statistic is a special case of the Pearson’s χ² statistic. In addition, S-X² is computed using the MMLE of the item parameters based on the original/ungrouped data and yet is assumed to have a $χ_{J - L - 1}^{2}$ asymptotic null distribution (Orlando & Thissen, 2000). Such a use of S-X² is exactly like the use of the Pearson’s χ² statistic along with the $χ_{K - L - 1}^{2}$ asymptotic null distribution. Therefore, S-X² is expected to suffer from the Chernoff-Lehmann Problem and is expected to follow not a χ² distribution, but a distribution like the one given by equation (8). Thus, S-X² is expected to be larger on average than a $χ_{J - L - 1}^{2}$ random variable for large samples under no model misfit. Existing simulation studies that examined the Type I error rates of S-X² corroborate this fact. Glas and Suarez-Falcón (2003), Sinharay (2006), and Sinharay and Lu (2008) found in simulation studies that the Type I error rates of S-X² are slightly inflated when it is computed using the MMLEs of item parameters from ungrouped data and is assumed to have the $χ_{J - L - 1}^{2}$ asymptotic null distribution. For example, Table 1 of Glas and Suarez-Falcón (2003) shows that the Type I error rates of S-X² at 5% significance level are 0.08, 0.08, and 0.07, respectively, for sample sizes 500, 1,000, and 4000 for 10-item tests. The resampling-based approaches developed by Sinharay (2006), Stone (2000), Stone and Zhang (2003), which involve the determination of the null distribution of S-X² using simulations, offer alternative solutions and successfully avoid the use of an inaccurate asymptotic null distribution, but these approaches are computation-intensive. The use of the minimum χ² estimator $\tilde{η}$ and the P-X² statistics defined in equation (7) is another possible approach to attain the target Type I error rate. However, $\hat{η}$ is a more efficient estimator compared to $\tilde{η}$ because the former utilizes more information than the latter (e.g., Rao, 1962; Rao & Robson, 1974). Also, $\hat{η}$ is more popular than $\tilde{η}$ . For example, the former is implemented in several publicly available IRT software packages such as BILOG (Mislevy & Bock, 1991), MULTILOG (Thissen, 1991), and PARSCALE (Muraki & Bock, 2003). Further, a χ²-type statistic that utilizes $\hat{η}$ rather than $\tilde{η}$ is likely to be more useful and popular among researchers and practitioners.

Table 1.

The Type I Error Rates of S-X² and $S - X_{R R}^{2}$ for the 2PL model.

	Test	Sample size
Length	Statistic	500	1000	2000	4000
10	S-X²	0.092	0.087	0.074	0.068
10	$S - X_{R R}^{2}$	0.035	0.042	0.043	0.044
20	S-X²	0.072	0.067	0.061	0.057
20	$S - X_{R R}^{2}$	0.054	0.048	0.041	0.040
40	S-X²	0.062	0.057	0.053	0.051
40	$S - X_{R R}^{2}$	0.054	0.053	0.049	0.047

Open in a new tab

The Modified χ² Statistic of Rao and Robson

One solution to the abovementioned Chernoff–Lehmann problem is to modify P-X² in a way such that the modified statistic has a known asymptotic null distribution.

One modification of the Pearson’s χ² statistic was suggested by Rao and Robson (1974) and is computed as

P - X_{R R}^{2} = {[u (\hat{η})]}^{⊤} Σ_{u (\hat{η})}^{- 1} u (\hat{η}),

where $Σ_{u (\hat{η})}$ is the approximate covariance matrix of $u (\hat{η})$ for large samples. The modification is essentially a standardization of $u (\hat{η})$ such that $Σ_{u (\hat{η})}^{- 1 / 2} u (\hat{η})$ follows a multivariate normal distribution for large samples under no model misfit, and, consequently

P - X_{R R}^{2} \sim χ_{K - 1}^{2} .

Note that there is no loss of df for parameter estimation in the null distribution of the $P - X_{R R}^{2}$ statistic. Rao and Robson (1974) found that $P - X_{R R}^{2}$ has larger power than the Pearson’s χ² statistic computed using the minimum χ² estimator defined in equation (7)—this result is presumably due to the larger degrees of freedom of the former statistic compared to the latter statistic.

In this paper, we borrow the idea underlying $P - X_{R R}^{2}$ and derive the covariance matrix $Σ_{v (\hat{η})}$ . The matrix $Σ_{v (\hat{η})}$ allows us to compute the statistic $S - X_{R R}^{2}$ , which is a special case of the $P - X_{R R}^{2}$ statistic and is a modified version of the S-X² statistic, as

S - X_{R R}^{2} = {[v (\hat{η})]}^{⊤} Σ_{v (\hat{η})}^{- 1} v (\hat{η}) .

(9)

Further

S - X_{R R}^{2} \sim χ_{J - 1}^{2}

(Rao & Robson, 1974). The key of this modification is the computation of the covariance matrix $Σ_{v (\hat{η})}$ . The detailed derivation of the matrix is provided below.

Method: Derivation of the Covariance Matrix Required in $S - X_{R R}^{2}$

To obtain $Σ_{v (\hat{η})}$ , we first approximate $v (\hat{η})$ using the first-order Taylor series expansion (e.g., Lehmann & Casella, 1998, p. 77) around η ₀ as

v (\hat{η}) \approx v (η_{0}) + A_{0} (\hat{η} - η_{0}),

(10)

where η ₀ is the unknown true item parameter vector

v (η_{0}) = {(\frac{\sqrt{N_{1}} [O_{1} - E_{1} (η_{0})]}{\sqrt{E_{1} (η_{0}) [1 - E_{1} (η_{0})]}}, \frac{\sqrt{N_{2}} [O_{2} - E_{2} (η_{0})]}{\sqrt{E_{2} (η_{0}) [1 - E_{2} (η_{0})]}}, \dots, \frac{\sqrt{N_{K}} [O_{K} - E_{K} (η_{0})]}{\sqrt{E_{K} (η_{0}) [1 - E_{K} (η_{0})]}})}^{⊤},

(11)

and A ₀ is a K × L matrix whose (k, l)-th element is given by

\begin{array}{l} {(A_{0})}_{k, l} = {\frac{\partial v_{k} (η)}{\partial E_{k} (η)} \frac{\partial E_{k} (η)}{\partial η_{l}} |}_{η = η_{0}} \\ = N_{k}^{1 / 2} [- \frac{1}{E_{k} {(η_{0})}^{1 / 2} {(1 - E_{k} (η_{0}))}^{1 / 2}} + \frac{(E_{k} (η_{0}) - 0.5) (O_{k} - E_{k} (η_{0}))}{E_{k} {(η_{0})}^{3 / 2} {(1 - E_{k} (η_{0}))}^{3 / 2}}] {\frac{\partial E_{k} (η)}{\partial η_{l}} |}_{η = η_{0}} \end{array} .

(12)

Note that for large values of N_k, O_k is approximately equal to E_k( η ₀), and, consequently, ${(A_{0})}_{k, l}$ can be approximated as

{(A_{0})}_{k, l} \approx {- \sqrt{\frac{N_{k}}{E_{k} (η_{0}) (1 - E_{k} (η_{0}))}} \frac{\partial E_{k} (η)}{\partial η_{l}} |}_{η = η_{0}} .

(13)

Equation (10) implies that

Σ_{v (\hat{η})} \approx Σ_{v (η_{0})} + 2 Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})] + A_{0} Σ_{\hat{η}} A_{0}^{⊤} .

(14)

Among the terms in equation (14), the elements of A ₀ can be approximated using equation (13) and $Σ_{\hat{η}}$ , which is the variance-covariance matrix among the estimates of the item parameters, can be obtained from the IRT software that was used to fit the IRT model to the data set.¹ The computation of the other terms, $Σ_{v (η_{0})}$ and $Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})]$ , are described below.

Computation of $Σ_{v (η_{0})}$

Because of equation (11), the diagonal elements of $Σ_{v (η_{0})}$ are terms such as Var(v_k( η ₀)), where

v_{k} (η_{0}) = \frac{\sqrt{N_{k}} [O_{k} - E_{k} (η_{0})]}{\sqrt{E_{k} (η_{0}) [1 - E_{k} (η_{0})]}}, k = 1,2, \dots, K,

computed at η = η ₀ and the off-diagonal elements of $Σ_{v (η_{0})}$ are terms such as $Cov (v_{k_{1}} (η_{0}), v_{k_{2}} (η_{0}))$ for k₁ ≠ k₂ = 1, 2, …, K computed at η = η ₀.

Because the variance of O_k computed at η = η ₀ is E_k( η ₀)[1 − E_k( η ₀)]/N_k, v_k( η ₀) is standardized, that is, its variance is 1 for k = 1, 2, …, K. So, the diagonal elements of $Σ_{v (η_{0})}$ are all equal to 1. Because the quantities $E_{k_{1}} (η_{0})$ are constants, $Cov (v_{k_{1}} (η_{0}), v_{k_{2}} (η_{0}))$ is a multiple of $Cov (O_{k_{1}}, O_{k_{2}})$ , the covariance of $O_{k_{1}}$ and $O_{k_{2}}$ , computed at η = η ₀. Appendix A includes a proof that $Cov (O_{k_{1}}, O_{k_{2}})$ , computed at η = η ₀, is approximately equal to 0 for large samples. Therefore, the off-diagonal elements of $Σ_{v (η_{0})}$ are all approximately equal to 0 for large samples.

Consequently, for large samples

Σ_{v (η_{0})} \approx I_{K},

(15)

where I _K denotes an identity matrix of dimension K × K.

Computation of $Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})]$

The grouped data in the context of item-fit analysis comprise the quantities N_kO_k and N_k(1 − O_k), which are the numbers of correct and incorrect answers on the item of interest for examinee group k. The log-likelihood of these grouped data is provided by $\tilde{ℓ} (\hat{η}) = \log \prod_{k} E_{k} {(\hat{η})}^{N_{k} O_{k}} {(1 - E_{k} (\hat{η}))}^{N_{k} (1 - O_{k})} = \sum_{k} [N_{k} O_{k} \log (E_{k} (\hat{η})) + N_{k} (1 - O_{k}) \log (1 - E_{k} (\hat{η}))] . .$

As mentioned earlier, the minimum χ² estimator $\tilde{η}$ is obtained by solving

\frac{\partial \tilde{ℓ} (η)}{\partial η_{l}} = \sum_{k} [\frac{N_{k} O_{k}}{E_{k} (η)} - \frac{N_{k} (1 - O_{k})}{1 - E_{k} (η)}] \frac{\partial E_{k} (η)}{\partial η_{l}} = 0, l = 1,2, \dots, L,

(16)

or by solving

\sum_{k} \frac{N_{k} [O_{k} - E_{k} (η)]}{E_{k} (η) [1 - E_{k} (η)]} \frac{\partial E_{k} (η)}{\partial η_{l}} = 0, l = 1,2, \dots, L .

Therefore, the solution $\tilde{η}$ to the above equations satisfies

\frac{\partial \tilde{ℓ} (η)}{\partial η} |_{η = \tilde{η}} = 0_{L \times 1},

(17)

where 0_L×1 is a vector of length L whose elements are zeroes. Also note that Equations (11), (13), and (16) imply that

{\frac{\partial \tilde{ℓ} (η)}{\partial η} |}_{η = η_{0}} = - A_{0}^{⊤} v (η_{0}) .

(18)

By applying the Taylor series expansion around η = η ₀ to ${\frac{\partial \tilde{ℓ} (η)}{\partial η} |}_{η = \tilde{η}}$ and using the result provided in Equations (17) and (18), we obtain

{\frac{\partial \tilde{ℓ} (η)}{\partial η} |}_{η = \tilde{η}} = 0_{L \times 1} \approx - A_{0}^{⊤} v (η_{0}) + B_{0} (\tilde{η} - η_{0}),

(19)

where

B_{0} = {\frac{\partial^{2} \tilde{ℓ} (η)}{\partial η \partial η^{'}} |}_{η = η_{0}} .

Equation (19) implies that

B_{0}^{- 1} A_{0}^{⊤} v (η_{0}) - \tilde{η} + η_{0} \approx 0

B_{0}^{- 1} A_{0}^{⊤} v (η_{0}) + \hat{η} - \tilde{η} \approx \hat{η} - η_{0} .

(20)

Using equation (20), we can express the covariance $Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})]$ in equation (14) as

\begin{array}{l} Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})] \approx Cov [A_{0} B_{0}^{- 1} A_{0}^{⊤} v (η_{0}) + A_{0} (\hat{η} - \tilde{η}), v (η_{0})] \\ = A_{0} B_{0}^{- 1} A_{0}^{⊤} Σ_{v (η_{0})} + Cov [A_{0} (\hat{η} - \tilde{η}), v (η_{0})] . \end{array}

(21)

However, note that $Cov [A_{0} (\hat{η} - \tilde{η}), v (η_{0})]$ , the second term in the right side of equation (21), converges to a matrix of zeroes since A ₀ is a matrix of constants and $\hat{η} - \tilde{η}$ , which is the difference between two sets of item parameter estimates, converges to a zero vector as sample size increases. Therefore, equation (21) yields the result that

Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})] = A_{0} B_{0}^{- 1} A_{0}^{⊤} Σ_{v (η_{0})}

(22)

Equations (14), (15), and (22) imply that

Σ_{v (\hat{η})} \approx I_{K} + 2 A_{0} B_{0}^{- 1} A_{0}^{⊤} + A_{0} Σ_{\hat{η}} A_{0}^{⊤} .

(23)

Although the minimum χ² estimator appears in the above derivation, one does not have to compute the estimator to compute $Σ_{v (\hat{η})}$ . That is because A ₀ and B ₀ can be adequately approximated using the MLE $\hat{η}$ that is an accurate estimator of η ₀ for common IRT models (e.g., Harwell et al., 1988).

After approximating $Σ_{v (\hat{η})}$ using equation (23), one can compute our modified version of S-X² as

S - X_{R R}^{2} = {[v (\hat{η})]}^{⊤} Σ_{v (\hat{η})}^{- 1} v (\hat{η}),

(24)

where $v (\hat{η})$ is computed using equation (4) after replacing η by $\hat{η}$ . The asymptotic null distribution of $S - X_{R R}^{2}$ is a $χ_{J - 1}^{2}$ distribution (Rao & Robson, 1974). Thus, item misfit is indicated by values of $S - X_{R R}^{2}$ that are larger than the appropriate percentiles (say 95th or 99th percentile) of the $χ_{J - 1}^{2}$ distribution.

Simulation Studies

We performed a simulation study to evaluate the Type I error rates and power of the new $S - X_{R R}^{2}$ statistic defined in equation (24) and to compare its Type I error rates and power to those of the S-X² statistic (Orlando & Thissen, 2000) defined in equation (3). In the first part of the study, we compute and compare the Type I error rates of S-X² and $S - X_{R R}^{2}$ for data simulated from the 2PL model. In the second part of the simulation study, we examine and compare the power of S-X² and $S - X_{R R}^{2}$ for data simulated from the Rasch, the 2PL and the 3PL models. Both the statistics were computed using $\hat{η}$ , which is the vector of the MMLEs of the item parameters.

Simulation Design

In the simulations, item scores were simulated under the Rasch, 2PL, and 3PL models. The test length was set as equal to 10, 20, or 40. The sample size was set equal to 500, 1000, 2000, or 4000. The true slope parameters, difficulty parameters, and guessing parameters were randomly generated from uniform distributions U(1, 2), U( − 3, 3), and U(0.05, 0.3), respectively, where, for example, U(1, 2) denotes the uniform distribution between 1 and 2. Simulating the true parameter values from other distributions did not affect the comparative performance of the item-fit statistics. To investigate the Type I error rates of the two statistics, the data-generating model (the IRT model that was used to simulate the data) was fitted to the data. To investigate the power of the two statistics, the Rasch and 2PL models were fitted to data simulated from the 3PL model and the Rasch model was fitted to data simulated from the 2PL model. After the models were fitted to the data and the item fit statistics were computed, the Type I error rate of an item-fit statistic at the 5% significance level was computed as the proportion of values of the statistic that were larger than the 95th percentile of the χ² distribution with J − 1 (for $S - X_{R R}^{2}$ ) or J − L − 1 (for S-X²) df for the simulation cases where the data-generating model and the fitted model were the same; the power of a statistic was computed as the proportion of values of the statistic that were larger than the 95th percentile of the χ² distribution with J − 1 or J − L − 1 df for the simulation cases where the data-generating model and the fitted model were different. Both Type I error rate and power for each combination of test length and sample size were computed from 100 replications. The true item parameters were resampled in each replication.

Results

Table 1 shows that the Type I error rates of the two statistics for the various simulation cases where the data-generating model and the fitted model were the same. The table shows that the Type I error rates of S-X² are larger than the nominal level in all simulation cases, a finding that is in agreement with findings on Type I error rates of S-X² in Glas and Suarez-Falcón (2003), Sinharay (2006), and Sinharay and Lu (2008). However, the Type I error rates of S-X² are not much larger than the nominal level for 40-item tests. The Type I error rates of the modified statistic $S - X_{R R}^{2}$ are considerably smaller than those of S-X² in all cases. Thus, $S - X_{R R}^{2}$ overcomes the Chernoff–Lehmann problem to a certain extent. However, the Type I error rates of $S - X_{R R}^{2}$ is considerably smaller than the nominal level for 10-item tests—we plan to investigate this issue in future research.

Table 2 shows the values of power of the two item-fit statistics for the various simulation cases where the data-generating model and the fitted model were different. The two columns with heading, for example, “2PL/1PL,” show the power for the cases when the data were simulated from the 2PL model and analyzed using the Rasch model. Table 2 shows that the power of the modified statistic $S - X_{R R}^{2}$ is smaller than that of S-X². However, the slightly better power of S-X² relative to $S - X_{R R}^{2}$ is likely a consequence of the inflated Type I error rate of the former statistic. As the sample size increases, the power of both statistics approach 1.0 for the “2PL/1PL” and “3PL/1PL” cases. The small power of both item statistics for the “3PL/2PL” case is an outcome of the fact that the 2PL model can explain data simulated from the 3PL model except for the case that the difficulty and guessing parameters for the latter model are too high (Sinharay, 2006).

Table 2.

The Power of $S - X_{R R}^{2}$ and S-X² for Various Combinations of Data-generating Model and Fitted Model.

Test Length	Sample Size	2PL/1PL		3PL/1PL		3PL/2PL
Test Length	Sample Size	S-X²	$S - X_{R R}^{2}$	S-X²	$S - X_{R R}^{2}$	S-X²	$S - X_{R R}^{2}$
10	500	0.26	0.19	0.34	0.26	0.06	0.05
	1000	0.48	0.40	0.49	0.37	0.07	0.06
	2000	0.65	0.57	0.67	0.55	0.08	0.06
	4000	0.80	0.69	0.82	0.68	0.11	0.05
20	500	0.17	0.17	0.30	0.27	0.07	0.04
	1000	0.39	0.35	0.47	0.40	0.09	0.05
	2000	0.64	0.61	0.67	0.59	0.10	0.08
	4000	0.80	0.75	0.82	0.76	0.13	0.10
40	500	0.18	0.17	0.27	0.25	0.08	0.05
	1000	0.25	0.24	0.42	0.38	0.11	0.08
	2000	0.54	0.48	0.66	0.64	0.11	0.09
	4000	0.77	0.66	0.79	0.75	0.15	0.13

Open in a new tab

Real Data Example

The two item-fit statistics, S-X² and $S - X_{R R}^{2}$ , were computed for a real data set. The data set includes the item scores of 2000 examinees on a state test with 46 dichotomous and multiple-choice items (with 5 answer options for each item) designed to measure students’ achievement in mathematics and was previously analyzed in Sinharay (2017).

The Rasch, 2PL, and 3PL models were fitted to the data set and the values of $S - X_{R R}^{2}$ and S-X² were computed for all items for each IRT model. Table 3 shows the number of items for which the item-fit statistics were statistically significant at the 5% level of significance for the three IRT models. The table shows that for each IRT model, the use of $S - X_{R R}^{2}$ leads to fewer items being identified as misfitting compared to that of S-X², with the difference being more prominent for the 2PL model. This finding agrees with the finding of smaller Type I error rate and power of $S - X_{R R}^{2}$ compared to S-X² in the simulation study. Although both statistics are significant for a considerable number of items for the Rasch and 2PL model, they are significant for only 6 and 3 items, respectively, for the 3PL model. Although the 3PL model seems to adequately fit the data set, more tests including tests for local independence (e.g., Chen & Thissen, 1997) and further investigations, should be conducted to finalize this conclusion.

Table 3.

The Number of Items with Statistically Significant Values of S-X² and $S - X_{R R}^{2}$ for the Three IRT models for the real data set.

Statistic	Rasch	2PL	3PL
S-X²	33	18	6
$S - X_{R R}^{2}$	31	12	3

Open in a new tab

Note. IRT = item response theory.

The three panels of Figure 1 show scatter plots of S-X² versus $S - X_{R R}^{2}$ for the real data set under the three IRT models. The range of the X-axis is the same as that of the Y-axis in each panel. The range is much wider in the leftmost panel than in the other two panels. The panels include a diagonal line and also vertical and horizontal dashed lines indicating the critical values at 5% level of significance for the respective statistics. The last two panels show that for several items, S-X² is larger than its critical value, but $S - X_{R R}^{2}$ is smaller than its critical value. Because item misfit often leads to an item being removed from the item pool (Sinharay & Haberman, 2014) and items are costly, these results indicate that the use of $S - X_{R R}^{2}$ rather than S-X² may lead to considerable saving of resources in operational testing.

Conclusions and Recommendations

The item-fit statistic S-X² (Orlando & Thissen, 2000), in spite of its simplicity and popularity, does not have a known asymptotic null distribution (Sinharay, 2006) and the Type I error rate of the statistic is larger than the nominal level, especially for shorter tests. The present study adopts the modification procedure suggested by Rao and Robson (1974) to provide a modified version of S-X² that has a known χ² asymptotic null distribution. The statistic S-X² can be written as ${\hat{v}}^{T} \hat{v}$ . The central idea of the modification of Rao and Robson (1974) is the computation of ${\hat{v}}^{T} Σ_{\hat{v}}^{- 1} \hat{v}$ , where $Σ_{\hat{v}}$ is an approximate variance-covariance matrix of $\hat{v}$ , so that ${\hat{v}}^{T} Σ_{\hat{v}}^{- 1} \hat{v}$ has a known χ² asymptotic null distribution. A major contribution of this paper is the derivation of the appropriate $Σ_{\hat{v}}$ . Thus, this paper suggests a χ²-type statistic that (a) can be used to assess item fit for any IRT model for dichotomous items and (b) has a known asymptotic distribution under the null hypothesis. Item-fit statistics that have known asymptotic χ² distribution under the null hypothesis have been suggested for the Rasch model by, for example, Glas (1988), but there is a lack of such statistics for non-Rasch IRT models. Thus, this paper makes an important contribution given that experts such as Box (1979) called for statistics that have known null distribution in assessing the fit of statistical models. Note that researchers such as Haberman et al. (2013) have suggested residual-based item-fit statistics that follow the standard normal distributions for non-Rasch IRT models, but we do not consider such statistics.

Simulation studies were conducted to compare the performance of S-X² and $S - X_{R R}^{2}$ with respect to Type I error rate and power. Results obtained from the simulation studies suggest that the Type I error rate of $S - X_{R R}^{2}$ is closer to the nominal level than S-X² across different conditions. However, $S - X_{R R}^{2}$ was found to be slightly conservative in comparison to S-X². Application of the two item-fit statistics to a real data set revealed that the number of misfitting items using $S - X_{R R}^{2}$ was smaller than that for S-X². In practice, item fit statistics such as $S - X_{R R}^{2}$ should be used along with other methods such as informative graphics and pair-wise item fit indexes in order to gain a thorough understanding of the type of misfit.

This paper has several limitations. First, it is possible to compare the two statistics for more simulated data and more real data. Second, the proposed statistic $S - X_{R R}^{2}$ applies only to dichotomous IRT models—it is possible to extend the statistic to tests with polytomous items or a mix of dichotomous and polytomous items in future research. Third, the current manuscript only investigates three unidimensional IRT models assuming the latent variable follows a normal distribution. To obtain better understanding of the suggested statistic, one can look into its performance in other cases including for non-normal ability distributions, multidimensional latent variables, and discrete latent variables. Finally, this manuscript only considers statistical significance and does not discuss practical significance on IRT model misfit (Hambleton & Han, 2005; Sinharay & Haberman, 2014).

Acknowledgments

The authors would like to thank John Donoghue, Sooyeon Kim, Hongwen Guo, Lora Monfils, and two anonymous reviewers for several helpful comments that led to a significant improvement of the article.

Appendix A. Proof that $Cov (O_{k_{1}}, O_{k_{2}})$ Computed at η = η ₀ is Approximately Equal to Zero for Large Samples

Let S_i denote the total score of examinee i, who is randomly chosen from the hypothetical population of all possible examinees. Let us define an indicator variable W_ik as

W_{i k} = {\begin{array}{c} 1, S_{i} = k \\ 0, S_{i} \neq k \end{array}

Then O_k for an item of interest can be expressed as

O_{k} = \frac{\sum_{i} W_{i k} X_{i}}{\sum_{i} W_{i k}}

where X_i is the score of examinee i on the item.

Let us consider two possible values k₁ and k₂ of S_i, where k₁ ≠ k₂, and define a vector U as

U = {(\sum_{i} W_{i k_{1}} X_{i}, \sum_{i} W_{i k_{1}}, \sum_{i} W_{i k_{2}} X_{i}, \sum_{i} W_{i k_{2}})}^{⊤}

Then one can express $O_{k_{1}}$ and $O_{k_{2}}$ as

O_{k_{1}} = \frac{U_{1}}{U_{2}}, O_{k_{2}} = \frac{U_{3}}{U_{4}}

where, for example, U₁ is the first component of U . The Jacobian for the transformation from $O = {(O_{k_{1}}, O_{k_{2}})}^{⊤}$ to U is given by a matrix of the first derivatives of the elements of O with respect to those of U , or, by

J = [\begin{array}{c} J_{1} & J_{2} & 0 & 0 \\ 0 & 0 & J_{3} & J_{4} \end{array}]

(A1)

where

J_{1} = \frac{1}{U_{2}}, J_{2} = - \frac{U_{1}}{U_{2}^{2}}, J_{3} = \frac{1}{U_{4}}, J_{4} = - \frac{U_{3}}{U_{4}^{2}}

(A2)

Consequently, using the multivariate delta method (e.g., Lehmann & Casella, 1998, p. 61), the variance-covariance matrix of O for large samples can be approximated as

Cov (O) \approx \tilde{J} Σ_{U} {\tilde{J}}^{⊤}

where Σ_U is the variance-covariance matrix of the vector U , $\tilde{J}$ is the value of J provided in equation (A1) upon replacing the U_k’s with their expected values computed at η = η ₀, and the parameters η are fixed at η ₀. Using the result that the (i, j)-th element of the product of three matrices A, B and C is equal to the (matrix) product of the i-th row of A, the matrix B, and the j-th column of C (e.g., Banerjee & Roy, 2014, p. 12), the covariance between $O_{k_{1}}$ and $O_{k_{2}}$ can be approximated, for large samples, as

Cov (O_{k_{1}}, O_{k_{2}}) \approx ({\tilde{J}}_{1}, {\tilde{J}}_{2}, 0,0) Σ_{U} {(0,0, {\tilde{J}}_{3}, {\tilde{J}}_{4})}^{⊤}

where ${\tilde{J}}_{i}$ is the value of J_i upon replacing the U_k’s with their expected values computed at η = η ₀, or, as

Cov (O_{k_{1}}, O_{k_{2}}) \approx {\tilde{J}}_{1} {\tilde{J}}_{3} σ_{13} + {\tilde{J}}_{1} {\tilde{J}}_{4} σ_{14} + {\tilde{J}}_{2} {\tilde{J}}_{3} σ_{23} + {\tilde{J}}_{2} {\tilde{J}}_{4} σ_{24}

(A3)

where σ_ij is the (i, j)-th element of Σ_U.

One can compute σ₂₄ as

σ_{24} = Cov (U_{2}, U_{4}) = Cov (\sum_{i} W_{i k_{1}}, \sum_{i} W_{i k_{2}}) = \sum_{i} Cov (W_{i k_{1}}, W_{i k_{2}})

where the last equality holds because the item scores are independent over two different examinees i₁ and i₂, which results in $Cov (W_{i_{1} k_{1}}, W_{i_{2} k_{2}}) = 0$ . Consequently

σ_{24} = \sum_{i} [E (W_{i k_{1}} W_{i k_{2}}) - E (W_{i k_{1}}) E (W_{i k_{2}})] = - \sum_{i} E (W_{i k_{1}}) E (W_{i k_{2}})

(A4)

because the raw score of examinee i cannot be equal to k₁ and also equal to k₂ so that $W_{i k_{1}} W_{i k_{2}}$ is equal to 0.

Now note that $E (W_{i k_{1}})$ is the probability that the raw score on the test is k₁ for an examinee who is randomly chosen from the population of all examinees, is equal to $\int S (T = k_{1} | θ, η) ψ (θ) d θ$ , and hence is the same over all the examinees. Therefore

E (W_{i k_{1}}) = \frac{1}{N} \sum_{i} E (W_{i k_{1}}) = \frac{1}{N} E (\sum_{i} W_{i k_{1}}) = \frac{1}{N} E (U_{2})

(A5)

Similarly, one obtains

E (W_{i k_{2}}) = \frac{1}{N} E (U_{4}) .

(A6)

Equations (A4) to (A6) imply that

σ_{24} = - \sum_{i} [\frac{1}{N} E (U_{2})] [\frac{1}{N} E (U_{4})] = - \frac{1}{N} E (U_{2}) E (U_{4})

Let ${\tilde{U}}_{k}$ denote E(U_k), where the expectation is computed at η = η ₀, k = 1, …, 4. Then

σ_{24} = - \frac{1}{N} {\tilde{U}}_{2} {\tilde{U}}_{4}

(A7)

It is possible to prove in a similar manner that

σ_{13} = - \frac{1}{N} {\tilde{U}}_{1} {\tilde{U}}_{3}, σ_{14} = - \frac{1}{N} {\tilde{U}}_{1} {\tilde{U}}_{4}, σ_{23} = - \frac{1}{N} {\tilde{U}}_{2} {\tilde{U}}_{3}

(A8)

Finally, equations (A2), (A3), (A7), and (A8) imply that

\begin{array}{l} Cov (O_{k_{1}}, O_{k_{2}}) \approx - {\tilde{J}}_{1} {\tilde{J}}_{3} \frac{1}{N} {\tilde{U}}_{1} {\tilde{U}}_{3} - {\tilde{J}}_{1} {\tilde{J}}_{4} \frac{1}{N} {\tilde{U}}_{1} {\tilde{U}}_{4} - {\tilde{J}}_{2} {\tilde{J}}_{3} \frac{1}{N} {\tilde{U}}_{2} {\tilde{U}}_{3} - {\tilde{J}}_{2} {\tilde{J}}_{4} \frac{1}{N} {\tilde{U}}_{2} {\tilde{U}}_{4} \\ \approx - \frac{1}{N} [{\tilde{U}}_{1} {\tilde{U}}_{3} \frac{1}{{\tilde{U}}_{2}} \frac{1}{{\tilde{U}}_{4}} - {\tilde{U}}_{1} {\tilde{U}}_{4} \frac{1}{{\tilde{U}}_{2}} \frac{{\tilde{U}}_{3}}{{\tilde{U}}_{4}^{2}} - {\tilde{U}}_{2} {\tilde{U}}_{3} \frac{{\tilde{U}}_{1}}{{\tilde{U}}_{2}^{2}} \frac{1}{{\tilde{U}}_{4}} + {\tilde{U}}_{2} {\tilde{U}}_{4} \frac{{\tilde{U}}_{1}}{{\tilde{U}}_{2}^{2}} \frac{{\tilde{U}}_{3}}{{\tilde{U}}_{4}^{2}}] \\ = 0 \end{array}

Note

For example, the R package mirt (Chalmers, 2012) can be used to compute such a matrix.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Author Note: Any opinions expressed in this publication are those of the author and not necessarily of Educational Testing Service.

ORCID iDs

Sandip Sinharay https://orcid.org/0000-0003-4491-8510

Matthew S. Johnson https://orcid.org/0000-0003-3157-4165

References

Agresti A. (2013). Categorical data analysis (3rd ed.). Wiley. [Google Scholar]
American Educational Research Association, American Psychological Association, & National Council for Measurement in Education (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]
Banerjee S., Roy A. (2014). Linear algebra and matrix analysis for statistics. Chapman and Hall/CRC. [Google Scholar]
Box G. E. P. (1979). Some problems of statistics and everyday life. Journal of the American Statistical Association, 74(365), 1–4. 10.1080/01621459.1979.10481600 [DOI] [Google Scholar]
Cai L., du Toit S. H. C., Thissen D. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling. Scientific Software International. [Google Scholar]
Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]
Chen W.-H., Thissen D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. 10.2307/1165285 [DOI] [Google Scholar]
Chernoff H., Lehmann E. L. (1954). The use of maximum likelihood estimates in χ² tests for goodness of fit. The Annals of Mathematical Statistics, 25(3), 579–586. 10.1214/aoms/1177728726 [DOI] [Google Scholar]
Fisher R. A. (1924). The conditions under which χ² measures the discrepancy between observation and hypothesis. Journal of the Royal Statistical Society, 87(3), 442–450. [Google Scholar]
Glas C. A., Suarez-Falcón J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. 10.1177/0146621602250530 [DOI] [Google Scholar]
Glas C. A. W. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53(4), 525–546. 10.1007/bf02294405 [DOI] [Google Scholar]
Haberman S. J., Sinharay S., Chon K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440. 10.1007/s11336-012-9305-1 [DOI] [PubMed] [Google Scholar]
Hambleton R. K., Han N. (2005). Assessing the fit of IRT models to educational and psychological test data: A five step plan and several graphical displays. In Lenderking W. R., Revicki D. (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57–78). Degnon Associates. [Google Scholar]
Harris R. R., Kanji G. K. (1983). On the use of minimum chi-square estimation. The Statistician, 32(4), 379. 10.2307/2987540 [DOI] [Google Scholar]
Harwell M. R., Baker F. B., Zwarts M. (1988). Item parameter estimation via marginal maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics, 13(3), 243–271. 10.2307/1164654 [DOI] [Google Scholar]
Kang T., Chen T. T. (2008). Performance of the generalized S-X² item fit index for polytomous IRT models. Journal of Educational Measurement, 45(4), 391–406. 10.1111/j.1745-3984.2008.00071.x [DOI] [Google Scholar]
Kang T., Chen T. T. (2010). Performance of the generalized S-X² item fit index for the graded response model. Asia Pacific Education Review, 12(1), 89–96. 10.1007/s12564-010-9082-4 [DOI] [Google Scholar]
Lehmann E. L., Casella G. (1998). Theory of point estimation (2nd ed.). Springer-Verlag. [Google Scholar]
Lim H. (2020). irtplay: Unidimensional item response theory modeling. (R package version 1.6.2). [Google Scholar]
Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8(4), 453–461. 10.1177/014662168400800409 [DOI] [Google Scholar]
Mislevy R. J., Bock R. D. (1991). BILOG 3.11 [computer software]. Scientific Software International. [Google Scholar]
Muraki E., Bock R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [computer program]. Scientific Software. [Google Scholar]
Orlando M., Thissen D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50–64. 10.1177/01466216000241003 [DOI] [Google Scholar]
Pearson K. (1900). X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. 10.1080/14786440009463897 [DOI] [Google Scholar]
Rao C. R. (1962). Efficient estimates and optimum inference procedures in large samples. Journal of the Royal Statistical Society. Series B (Methodological), 24(1), 46–72. 10.1111/j.2517-6161.1962.tb00436.x [DOI] [Google Scholar]
Rao K. C., Robson D. S. (1974). A chi-square statistic for goodness-of-fit tests within the exponential family. Communications in Statistics, 3(12), 1139–1153. 10.1080/03610917408548327 [DOI] [Google Scholar]
Roberts J. S. (2008). Modified likelihood-based item fit statistics for the generalized graded unfolding model. Applied Psychological Measurement, 32(5), 407–423. 10.1177/0146621607301278 [DOI] [Google Scholar]
Sinharay S. (2006). Bayesian item fit analysis for unidimensional item response theory models. The British Journal of Mathematical and Statistical Psychology, 59(2), 429–449. 10.1348/000711005x66888 [DOI] [PubMed] [Google Scholar]
Sinharay S. (2017). How to compare parametric and nonparametric person-fit statistics using real data. Journal of Educational Measurement, 54(4), 420–439. 10.1111/jedm.12155 [DOI] [Google Scholar]
Sinharay S., Haberman S. J. (2014). How often is the misfit of item response theory models practically significant? Educational Measurement: Issues and Practice, 33(1), 23–35. 10.1111/emip.12024 [DOI] [Google Scholar]
Sinharay S., Lu Y. (2008). A further look at the correlation between item parameters and item fit statistics. Journal of Educational Measurement, 45, 1–15. 10.1111/j.1745-3984.2007.00049.x [DOI] [Google Scholar]
Sorrel M. A., Abad F. J., Olea J., de la Torre J., Barrada J. R. (2017). Inferential item-fit evaluation in cognitive diagnosis modeling. Applied Psychological Measurement, 41(8), 614–631. 10.1177/0146621617707510 [DOI] [PMC free article] [PubMed] [Google Scholar]
Stone C. A. (2000). Monte Carlo based null distribtion for an alternative goodness-of-fit test statistic in IRT models. Journal of Educational Measurement, 37(1), 58–75. 10.1111/j.1745-3984.2000.tb01076.x [DOI] [Google Scholar]
Stone C. A., Zhang B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352. 10.1111/j.1745-3984.2003.tb01150.x [DOI] [Google Scholar]
Thissen D. (1991). MULTILOG: Multiple category item analysis and test scoring using item response theory [computer software]. Scientific Software International. [Google Scholar]
Zhang B., Stone C. A. (2007). Evaluating item fit for multidimensional item response models. Educational and Psychological Measurement, 68(2), 181–196. 10.1177/0013164407301547 [DOI] [Google Scholar]

[bibr1-01466216221108077] Agresti A. (2013). Categorical data analysis (3rd ed.). Wiley. [Google Scholar]

[bibr2-01466216221108077] American Educational Research Association, American Psychological Association, & National Council for Measurement in Education (2014). Standards for educational and psychological testing. American Educational Research Association. [Google Scholar]

[bibr3-01466216221108077] Banerjee S., Roy A. (2014). Linear algebra and matrix analysis for statistics. Chapman and Hall/CRC. [Google Scholar]

[bibr4-01466216221108077] Box G. E. P. (1979). Some problems of statistics and everyday life. Journal of the American Statistical Association, 74(365), 1–4. 10.1080/01621459.1979.10481600 [DOI] [Google Scholar]

[bibr5-01466216221108077] Cai L., du Toit S. H. C., Thissen D. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling. Scientific Software International. [Google Scholar]

[bibr6-01466216221108077] Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1–29. 10.18637/jss.v048.i06 [DOI] [Google Scholar]

[bibr7-01466216221108077] Chen W.-H., Thissen D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. 10.2307/1165285 [DOI] [Google Scholar]

[bibr8-01466216221108077] Chernoff H., Lehmann E. L. (1954). The use of maximum likelihood estimates in χ² tests for goodness of fit. The Annals of Mathematical Statistics, 25(3), 579–586. 10.1214/aoms/1177728726 [DOI] [Google Scholar]

[bibr9-01466216221108077] Fisher R. A. (1924). The conditions under which χ² measures the discrepancy between observation and hypothesis. Journal of the Royal Statistical Society, 87(3), 442–450. [Google Scholar]

[bibr10-01466216221108077] Glas C. A., Suarez-Falcón J. C. (2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27(2), 87–106. 10.1177/0146621602250530 [DOI] [Google Scholar]

[bibr11-01466216221108077] Glas C. A. W. (1988). The derivation of some tests for the Rasch model from the multinomial distribution. Psychometrika, 53(4), 525–546. 10.1007/bf02294405 [DOI] [Google Scholar]

[bibr12-01466216221108077] Haberman S. J., Sinharay S., Chon K. H. (2013). Assessing item fit for unidimensional item response theory models using residuals from estimated item response functions. Psychometrika, 78(3), 417–440. 10.1007/s11336-012-9305-1 [DOI] [PubMed] [Google Scholar]

[bibr13-01466216221108077] Hambleton R. K., Han N. (2005). Assessing the fit of IRT models to educational and psychological test data: A five step plan and several graphical displays. In Lenderking W. R., Revicki D. (Eds.), Advances in health outcomes research methods, measurement, statistical analysis, and clinical applications (pp. 57–78). Degnon Associates. [Google Scholar]

[bibr14-01466216221108077] Harris R. R., Kanji G. K. (1983). On the use of minimum chi-square estimation. The Statistician, 32(4), 379. 10.2307/2987540 [DOI] [Google Scholar]

[bibr15-01466216221108077] Harwell M. R., Baker F. B., Zwarts M. (1988). Item parameter estimation via marginal maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics, 13(3), 243–271. 10.2307/1164654 [DOI] [Google Scholar]

[bibr16-01466216221108077] Kang T., Chen T. T. (2008). Performance of the generalized S-X² item fit index for polytomous IRT models. Journal of Educational Measurement, 45(4), 391–406. 10.1111/j.1745-3984.2008.00071.x [DOI] [Google Scholar]

[bibr17-01466216221108077] Kang T., Chen T. T. (2010). Performance of the generalized S-X² item fit index for the graded response model. Asia Pacific Education Review, 12(1), 89–96. 10.1007/s12564-010-9082-4 [DOI] [Google Scholar]

[bibr18-01466216221108077] Lehmann E. L., Casella G. (1998). Theory of point estimation (2nd ed.). Springer-Verlag. [Google Scholar]

[bibr19-01466216221108077] Lim H. (2020). irtplay: Unidimensional item response theory modeling. (R package version 1.6.2). [Google Scholar]

[bibr20-01466216221108077] Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8(4), 453–461. 10.1177/014662168400800409 [DOI] [Google Scholar]

[bibr21-01466216221108077] Mislevy R. J., Bock R. D. (1991). BILOG 3.11 [computer software]. Scientific Software International. [Google Scholar]

[bibr22-01466216221108077] Muraki E., Bock R. D. (2003). PARSCALE 4: IRT item analysis and test scoring for rating-scale data [computer program]. Scientific Software. [Google Scholar]

[bibr23-01466216221108077] Orlando M., Thissen D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24(1), 50–64. 10.1177/01466216000241003 [DOI] [Google Scholar]

[bibr24-01466216221108077] Pearson K. (1900). X. on the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 50(302), 157–175. 10.1080/14786440009463897 [DOI] [Google Scholar]

[bibr25-01466216221108077] Rao C. R. (1962). Efficient estimates and optimum inference procedures in large samples. Journal of the Royal Statistical Society. Series B (Methodological), 24(1), 46–72. 10.1111/j.2517-6161.1962.tb00436.x [DOI] [Google Scholar]

[bibr26-01466216221108077] Rao K. C., Robson D. S. (1974). A chi-square statistic for goodness-of-fit tests within the exponential family. Communications in Statistics, 3(12), 1139–1153. 10.1080/03610917408548327 [DOI] [Google Scholar]

[bibr27-01466216221108077] Roberts J. S. (2008). Modified likelihood-based item fit statistics for the generalized graded unfolding model. Applied Psychological Measurement, 32(5), 407–423. 10.1177/0146621607301278 [DOI] [Google Scholar]

[bibr28-01466216221108077] Sinharay S. (2006). Bayesian item fit analysis for unidimensional item response theory models. The British Journal of Mathematical and Statistical Psychology, 59(2), 429–449. 10.1348/000711005x66888 [DOI] [PubMed] [Google Scholar]

[bibr29-01466216221108077] Sinharay S. (2017). How to compare parametric and nonparametric person-fit statistics using real data. Journal of Educational Measurement, 54(4), 420–439. 10.1111/jedm.12155 [DOI] [Google Scholar]

[bibr30-01466216221108077] Sinharay S., Haberman S. J. (2014). How often is the misfit of item response theory models practically significant? Educational Measurement: Issues and Practice, 33(1), 23–35. 10.1111/emip.12024 [DOI] [Google Scholar]

[bibr31-01466216221108077] Sinharay S., Lu Y. (2008). A further look at the correlation between item parameters and item fit statistics. Journal of Educational Measurement, 45, 1–15. 10.1111/j.1745-3984.2007.00049.x [DOI] [Google Scholar]

[bibr32-01466216221108077] Sorrel M. A., Abad F. J., Olea J., de la Torre J., Barrada J. R. (2017). Inferential item-fit evaluation in cognitive diagnosis modeling. Applied Psychological Measurement, 41(8), 614–631. 10.1177/0146621617707510 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr33-01466216221108077] Stone C. A. (2000). Monte Carlo based null distribtion for an alternative goodness-of-fit test statistic in IRT models. Journal of Educational Measurement, 37(1), 58–75. 10.1111/j.1745-3984.2000.tb01076.x [DOI] [Google Scholar]

[bibr34-01466216221108077] Stone C. A., Zhang B. (2003). Assessing goodness of fit of item response theory models: A comparison of traditional and alternative procedures. Journal of Educational Measurement, 40(4), 331–352. 10.1111/j.1745-3984.2003.tb01150.x [DOI] [Google Scholar]

[bibr35-01466216221108077] Thissen D. (1991). MULTILOG: Multiple category item analysis and test scoring using item response theory [computer software]. Scientific Software International. [Google Scholar]

[bibr36-01466216221108077] Zhang B., Stone C. A. (2007). Evaluating item fit for multidimensional item response models. Educational and Psychological Measurement, 68(2), 181–196. 10.1177/0013164407301547 [DOI] [Google Scholar]

PERMALINK

The Standardized S-X² Statistic for Assessing Item Fit

Zhuangzhuang Han

Sandip Sinharay

Matthew S Johnson

Xiang Liu

Abstract

Introduction

Background: Pearson’s χ², Orlando-Thissen’s S-X², Chernoff–Lehmann Problem, and Rao–Robson’s Modified χ²

Pearson’s χ² Statistic

Orlando and Thissen’s S-X² Statistic

The Chernoff–Lehmann Problem with the Pearson’s χ² Statistic

Table 1.

The Modified χ² Statistic of Rao and Robson

Method: Derivation of the Covariance Matrix Required in $S - X_{R R}^{2}$

Computation of $Σ_{v (η_{0})}$

Computation of $Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})]$

Simulation Studies

Simulation Design

Results

Table 2.

Real Data Example

Table 3.

Figure 1.

Conclusions and Recommendations

Acknowledgments

Appendix A. Proof that $Cov (O_{k_{1}}, O_{k_{2}})$ Computed at η = η ₀ is Approximately Equal to Zero for Large Samples

Note

Footnotes

ORCID iDs

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The Standardized S-X2 Statistic for Assessing Item Fit

Zhuangzhuang Han

Sandip Sinharay

Matthew S Johnson

Xiang Liu

Abstract

Introduction

Background: Pearson’s χ2, Orlando-Thissen’s S-X2, Chernoff–Lehmann Problem, and Rao–Robson’s Modified χ2

Pearson’s χ2 Statistic

Orlando and Thissen’s S-X2 Statistic

The Chernoff–Lehmann Problem with the Pearson’s χ2 Statistic

Table 1.

The Modified χ2 Statistic of Rao and Robson

Method: Derivation of the Covariance Matrix Required in S−XRR2

Computation of Σv(η0)

Computation of Cov[A0(η^−η0),v(η0)]

Simulation Studies

Simulation Design

Results

Table 2.

Real Data Example

Table 3.

Figure 1.

Conclusions and Recommendations

Acknowledgments

Appendix A. Proof that Cov(Ok1,Ok2) Computed at η = η 0 is Approximately Equal to Zero for Large Samples

Note

Footnotes

ORCID iDs

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

The Standardized S-X² Statistic for Assessing Item Fit

Background: Pearson’s χ², Orlando-Thissen’s S-X², Chernoff–Lehmann Problem, and Rao–Robson’s Modified χ²

Pearson’s χ² Statistic

Orlando and Thissen’s S-X² Statistic

The Chernoff–Lehmann Problem with the Pearson’s χ² Statistic

The Modified χ² Statistic of Rao and Robson

Method: Derivation of the Covariance Matrix Required in $S - X_{R R}^{2}$

Computation of $Σ_{v (η_{0})}$

Computation of $Cov [A_{0} (\hat{η} - η_{0}), v (η_{0})]$

Appendix A. Proof that $Cov (O_{k_{1}}, O_{k_{2}})$ Computed at η = η ₀ is Approximately Equal to Zero for Large Samples