Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2017 Jun 22;78(1):32–45. doi: 10.1177/0013164417713570

Large Sample Confidence Intervals for Item Response Theory Reliability Coefficients

Björn Andersson 1,, Tao Xin 1
PMCID: PMC5965626  PMID: 29795945

Abstract

In applications of item response theory (IRT), an estimate of the reliability of the ability estimates or sum scores is often reported. However, analytical expressions for the standard errors of the estimators of the reliability coefficients are not available in the literature and therefore the variability associated with the estimated reliability is typically not reported. In this study, the asymptotic variances of the IRT marginal and test reliability coefficient estimators are derived for dichotomous and polytomous IRT models assuming an underlying asymptotically normally distributed item parameter estimator. The results are used to construct confidence intervals for the reliability coefficients. Simulations are presented which show that the confidence intervals for the test reliability coefficient have good coverage properties in finite samples under a variety of settings with the generalized partial credit model and the three-parameter logistic model. Meanwhile, it is shown that the estimator of the marginal reliability coefficient has finite sample bias resulting in confidence intervals that do not attain the nominal level for small sample sizes but that the bias tends to zero as the sample size increases.

Keywords: reliability, item response theory, confidence intervals, asymptotic variance


In classical test theory, the reliability of a test plays a central role. The reliability is a measure of the consistency of the scores from a test. Reliability, together with the concept of item and test information, also plays a role in item response theory (IRT). Two different measures of reliability in IRT are marginal reliability (Cheng, Yuan, & Liu, 2012) and test reliability (Kim & Feldt, 2010; Lord, 1977, 1980). The marginal reliability denotes the ratio of the true score variance to the total variance, expressed with respect to the estimated latent abilities. Test reliability, on the other hand, denotes the ratio of true score variance to total variance expressed with respect to the sum scores. Note that, in the terminology used in this article, both the marginal reliability and the test reliability refer to reliability with regard to a population as a whole and hence are in some sense both “marginal” measures of reliability. Having a measure of the reliability of the scores from a test in the context of IRT is useful since it provides an indication of the overall consistency of the test scores in measuring the underlying trait or the observed score generated from the IRT model. Since the object of a test in IRT is to measure the latent concept for everyone taking the test, an overall measure of the reliability of the test scores across the spectrum of the latent distribution is important to consider.

When using IRT in empirical research, a reliability coefficient is often reported. The reported reliability coefficients are estimates of the true reliability coefficient and these estimates have a certain amount of variance associated with them due to the item parameter estimation. The large sample variance for several reliability coefficient estimators in classical test theory have been derived (Yuan & Bentler, 2002). However, analytical estimates of the variance of the IRT reliability coefficients are not available in the literature and an estimated standard error or confidence interval is therefore usually not reported. In this article, the large sample variance of the estimators of the IRT marginal reliability coefficient and the IRT test reliability coefficient are derived using standard asymptotic theory. The results may be used by empirical researchers to estimate confidence intervals for the reliability coefficients.

The article is structured as follows. First, IRT is introduced and common models for either dichotomous or polytomous data are briefly described. Then, IRT marginal and test reliability coefficients are defined and the asymptotic variance of estimators of these are derived. It is then investigated how well the large sample results work with finite samples through a simulation study. Finally, concluding remarks are given.

Item Response Theory

Consider a test consisting of J number of dichotomous or polytomous items. Denote the IRT item parameter vector by α and let αjα be the item parameter vector for item j. Let Pjk(θ;αj) be the probability of obtaining category k{1,,mj} on item j conditional on the latent variable θ, where mj denotes the number of possible categories for item j. For dichotomous items, the three-parameter logistic (3-PL) model (Lord, 1980) defines the probability as

Pjk(θ;αj)=ujkPj(θ;αj)+(1ujk)(1Pj(θ;αj)),

where uj1=0 and uj2=1 and

Pj(θ;αj)=cj+1cj1+exp[aj(θbj)],

where aj is called the discrimination parameter, bj the location parameter and cj the pseudo-guessing parameter for item j. For polytomous items, the generalized partial credit model (GPCM) (Muraki, 1992) can be used, with probabilities defined by

Pjk(θ;αj)=exp[v=1kaj(θbj,v)]c=1mjexp[v=1caj(θbj,v)],

where aj is again the discrimination parameter and each bj,v are called the item category parameters, with bj,10. In IRT, the concept of item and test information has an important role. The test information is inversely related to the variance to the maximum likelihood estimator of the ability. Define the expected item information for a dichotomous IRT model as

Ij(θ;αj)=k=1mj(Pjk(θ;αj)θ)2Pjk(θ;αj),

where, for the 3-PL model,

Pjk(θ;αj)θ=ujkPj(θ;αj)θ(1ujk)Pj(θ;αj)θ

where (Hambleton & Swaminathan, 1985)

Pj(θ;αj)θ=aj[1Pj(θ;αj)][Pj(θ;αj)cj]1cj.

For a polytomous IRT model, define the expected item information as (Magis, 2015)

Ij(θ;αj)=k=1mj[(Pjk(θ;αj)θ)2Pjk(θ;αj)2Pjk(θ;αj)θ2],

where, for the GPCM (Muraki, 1992),

Pjk(θ;αj)θ=Pjk(θ;αj)aj(kc=1mjcPjc(θ;αj))

and

2Pjk(θ;αj)θ2=Pjk(θ;αj)θaj(kc=1mjcPjc(θ;αj))Pjk(θ;αj)ajc=1mjcPjc(θ;αj)θ,

resulting in the expected information for the GPCM having the expression

Ij(θ;αj)=k=1mjPjk(θ;αj)ajc=1mjcPjc(θ;αj)θ.

The test information is the sum of the information for each item, i.e.,

I(θ;α)=j=1JIj(θ;αj).

Let θ^MLE be the maximum likelihood estimator of the ability. As J, the estimator θ^MLE has variance (Hambleton & Swaminathan, 1985)

Var(θ^MLE|θ)=1I(θ;α).

Item Response Theory Marginal Reliability

One measure of reliability for scores on the latent variable metric which has been proposed in the literature is the marginal reliability (Green, Bock, Humphreys, Linn, & Reckase, 1984), sometimes referred to as parallel forms reliability (Kim, 2012). With the assumption of an ability distribution with density g and variance 1, this reliability coefficient is defined as (Cheng et al., 2012)

ρΘ(α)=I(θ;α)I(θ;α)+1g(θ)dθ.

The marginal reliability coefficient can be interpreted as the reliability with regard to the maximum likelihood ability estimates. The density function g is often assumed to be the standard normal distribution during estimation and this is the assumption that will be used hereafter.

Item Response Theory Test Reliability

The IRT test reliability coefficient ρXX(α) is defined as the ratio of the true score variance σT2(α) and the observed score variance σX2(α). It can be interpreted as the parallel forms reliability with regard to the sum score. Define Wjk as the score for obtaining category k on item j, which we consider as a fixed constant in this article. From the definition of reliability, we can note that ρXX(α) can also be expressed as

ρXX(α)=σT2(α)σX2(α)=σX2(α)σe2(α)σX2(α)=1σe2(α)σX2(α),

where, following Kim and Feldt (2010), σe2(α) is the conditional error variance σe|θ2(α) averaged over the distribution of θ, i.e.,

σe2(α)=σe|θ2(α)g(θ)dθ,

where the conditional error variance σe|θ2(α) can be calculated by (Kim & Feldt, 2010)

σe|θ2(α)=j=1Jσej|θ2(αj)=j=1JσXj|θ2(αj)=j=1J{k=1mjPjk(θ;αj)Wjk2[k=1mjPjk(θ;αj)Wjk]2}.

Consider a test that has possible observed scores from 0 to K. The variance of the observed score distribution, σX2(α), can be calculated from the probabilities ri(α) for each possible observed score, i.e.,

σX2(α)=i=0Kri(α)xi2(i=0Kri(α)xi)2,

where xi denotes the score value of the ith observed score. The probabilities ri(α) can be calculated from the recursive formulas given for dichotomous items in Lord and Wingersky (1984) and for polytomous items in Thissen, Pommerich, Billeaud, and Williams (1995).

Asymptotic Variance of Item Response Theory Reliability Coefficient Estimators

Define α^ as the estimator of the item parameter vector and let n denote the sample size. In the following, it is assumed that asymptotic normality holds for the item parameter estimator α^. Under this condition, the asymptotic variance of the reliability coefficient estimators can be derived using standard asymptotic theory (Ferguson, 1996). Let Σα^ denote the asymptotic covariance matrix of α^.

Item Response Theory Marginal Reliability

The integral in Equation (13) is approximated by a sum using Gauss-Hermite quadrature and we thus obtain the estimator of ρΘ(α) as

ρ^Θ(α)=l=1Lj=1JIj(θl;α)j=1JIj(θl;α)+1eθl222πwl,

where wl is the weight for the lth Gauss-Hermite quadrature point θl. The function ρ^Θ(α) is continuous and differentiable with regard to the item parameter vector α. We may then apply the delta method to approximate the variance of ρ^Θ(α) and hence we retrieve, for large n,

σρ^Θ(α)2ρ^Θ(α)αΣa^[ρ^Θ(α)α],

where

ρ^Θ(α)α=[l=1Lj=1JIj(θl;αj)j=1JIj(θl;αj)+1eθl222πwl]α=l=1L{j=1JIj(θl;αj)αj=1JIj(θl;αj)+1j=1JIj(θl;αj)j=1JIj(θl;αj)α[j=1JIj(θl;2aj)+1]2}eθl222πwl.

In Equation (20), the expression of Ij(θl;αj)a for dichotomous IRT models is

Ij(θl;αj)α=k=1mjPjk(θl;αj)θl[22Pjk(θl;αj)θlαPjk(θl;αj)θlPjk(θl;αj)α]

and for polytomous IRT with the GPCM is

Ij(θl;αj)α=k=1mj{Pjk(θl;αj)αajc=1mjcPjc(θl;αj)+ajαPjk(θl;α)c=1mjcPjc(θl;αj)+Pjk(θl;αj)ajc=1mjc2Pjc(θl;αj)θlα}.

For the 3-PL model,

Pjk(θl;αj)α=ujkPj(θl;αj)α(1ujk)Pj(θl;αj)α

where Pj(θl;αj)α is given in Hambleton and Swaminathan (1985) while for the GPCM the vector Pjk(θl;α)α is given in Muraki (1992). The vector 2Pjk(θl;αj)θlα is provided in the appendix.

Item Response Theory Test Reliability

The integral implicit in the numerator of Equation (14) is approximated by a sum using Gauss-Hermite quadrature and so are the integrals required to calculate ri(α). We thus obtain the estimator of ρXX(α) as

ρ^XX(α)=1l=1Lj=1J{k=1mjPjk(θl;α)Wjk2[k=1mjPjk(θl;α)Wjk]2}eθl222πwli=0Kr^i(α)xi2[i=0Kr^i(α)xi]2.

Again, ρ^XX(α) is a continuous and differentiable function of the item parameter vector α. Hence, the delta method can be used to approximate the variance of ρ^XX(α) and we retrieve, for large n,

σρ^XX(α)2ρ^XX(α)αΣα^[ρ^XX(α)α],

where

ρ^XX(α)α=1σ^X2(α)[σ^e2(α)σ^X2(α)σ^X2(α)ασ^e2(α)α].

Note that

σ^e2(α)α=l=1L{j=1Jk=1mjPjk(θl;α)αWjk22j=1J[k=1mjPjk(θl;α)Wjkk=1mjPjk(θl;α)αWjk]}eθl222πwl

and

σ^X2(α)α=i=0Kxi2r^i(α)α2i=0Kxir^i(α)i=0Kr^i(α)αxi,

where the derivatives r^i(α)α for the GPCM are given in Andersson (2016) and the derivatives r^i(α)α for the 3-PL model are given in Ogasawara (2003).

Confidence Interval Estimation

For large sample sizes, approximate confidence intervals for the reliability coefficients can be constructed using the derived variances and a normal approximation. Hence, approximate 95% confidence intervals for ρΘ(α) or ρXX(α) are given by ρ^Θ(α)±z(0.975)σ^ρ^Θ(α)2 and ρ^XX(α)±z(0.975)σ^ρ^XX(α)2, where z(0.975) denotes the 0.975 quantile of the standard normal distribution. The properties of the confidence intervals in finite samples are investigated in the next section using simulations.

Simulation Study

Simulations were conducted to study the finite sample properties of the derived asymptotic variances and confidence intervals. The values of the latent variable were drawn from the N(0,1) distribution and the responses to a test with nine GPCM items and the responses to a test with 40 3-PL items were simulated. The GPCM item parameters were taken from the study by Fischer, Tritt, Klapp, and Fliege (2011) and are given in Table 1. For the 3-PL model, the a-parameters were drawn from the U(0.5,2) distribution, the b-parameters from the N(0,1) distribution and the c-parameters from the U(0.15,0.20) distribution. The 3-PL parameters used in the study are given in Table 2. With the selected item parameters and N(0,1) as the reference distribution, for the GPCM the true marginal reliability coefficient was 0.8718 and the true test reliability coefficient was 0.8744 while for the 3-PL model the true marginal reliability coefficient was 0.8487 and the true test reliability coefficient was 0.8285. The item parameters were estimated with the R (R Development Core Team, 2016) package mirt (Chalmers, 2012) using marginal maximum likelihood with the assumption of a N(0,1) latent distribution. Since maximum likelihood estimation was used, no priors were specified for the item parameters. The observed information matrix (Yuan, Cheng, & Patton, 2013) was calculated after estimation by numerically differentiating the gradient of the observed log-likelihood function at the obtained maximum likelihood estimates to provide an estimate of the asymptotic covariance matrix, using newly written R code to improve the speed of calculation. The observed information matrix is appropriate in this setting since the models are correctly specified. After estimation, the reliability coefficients and their asymptotic variances were calculated using newly written R code. The number of Gauss-Hermite quadrature points used was 49 and 2,000 replications were used. The average bias, average asymptotic standard errors (ASE), Monte Carlo standard errors (MCSE), and empirical coverage rates of 95% confidence intervals were calculated. For the GPCM, sample sizes 250, 500, 1,000, 2,000, 4,000, and 8,000 were considered and for the 3-PL model sample sizes 1,000, 2,000, 4,000, and 8,000 were considered. The reason for only considering the larger sample sizes with the 3-PL model was that reliable estimation with the smaller sample sizes was not possible due to high non-convergence rates. With the GPCM, the nonconvergence rate was 0.45% (sample size 250) and with the 3-PL model the nonconvergence rates were 9.1% (sample size 1,000) and 0.7% (sample size 2,000). For all other settings, the nonconvergence rates were 0%.

Table 1.

Generalized Partial Credit Model Item Parameters for the Nine-Item Scale Used in the Simulation Study.

Item a b2 b3 b4
1 1.49 –1.14 0.64 0.68
2 2.73 –0.74 0.57 0.95
3 0.79 –1.52 –0.24 0.41
4 1.65 –1.62 0.20 0.40
5 0.71 –0.47 0.36 1.22
6 1.18 0.14 0.80 1.43
7 1.22 –0.70 0.77 1.01
8 0.80 0.37 1.26 1.73
9 1.40 0.89 1.92 2.26

Table 2.

Three-Parameter Logistic Item Parameters Used in the Simulation Study.

Item a b c Item a b c
1 1.62 –1.85 0.19 21 1.19 0.52 0.20
2 0.81 –0.06 0.15 22 1.25 –0.85 0.18
3 1.68 2.05 0.19 23 0.80 –0.62 0.17
4 1.50 –0.52 0.15 24 1.60 –1.17 0.16
5 0.73 –0.06 0.17 25 1.79 –0.67 0.18
6 1.37 –1.16 0.20 26 0.84 –0.40 0.20
7 1.00 0.79 0.18 27 1.49 –1.94 0.19
8 1.63 –0.40 0.17 28 0.80 0.77 0.17
9 1.55 –1.37 0.18 29 1.66 –0.86 0.16
10 0.70 –1.13 0.19 30 1.13 0.07 0.19
11 1.18 1.28 0.15 31 1.69 1.20 0.18
12 1.76 –2.36 0.19 32 0.62 0.30 0.20
13 0.75 –0.47 0.20 33 0.57 1.13 0.16
14 1.38 0.74 0.17 34 1.07 –0.38 0.17
15 0.54 –1.25 0.16 35 1.55 –0.74 0.17
16 0.92 –0.36 0.19 36 0.88 1.56 0.19
17 1.73 0.50 0.17 37 1.37 –2.30 0.18
18 0.63 –0.90 0.17 38 1.65 0.98 0.16
19 0.61 –1.34 0.20 39 0.95 –0.11 0.17
20 0.75 0.27 0.15 40 1.76 1.40 0.18

Results

In Table 3 the results for the simulation with the GPCM are given. For the marginal reliability coefficient, there exists a small but statistically significant bias for sample sizes lower than 2,000. The asymptotic standard errors are accurate as an estimate of the sampling variability for all sample sizes. The empirical coverage rates of the 95% confidence intervals are slightly lower than the nominal level for sample sizes 250 and 500 but with sample sizes 1,000 and higher the empirical coverage rate is not statistically significantly different from 95%. For the test reliability coefficient, the bias is not statistically significantly different from zero for any sample size. The asymptotic standard error is accurate for all sample sizes considered. The empirical coverage rates of the 95% confidence intervals for the test reliability coefficient are not statistically significantly different to the nominal level for any sample size.

Table 3.

Mean Bias (×100), Asymptotic and Monte Carlo Standard Errors (×100), and Coverage of 95% Confidence Intervals (%) for the GPCM Reliability Coefficient Estimators, With Estimated Standard Errors in Parentheses.

Reliability Sample size Mean bias ASE MCSE CI coverage
Marginal 250 0.14 (0.02) 0.90 (0.00) 0.90 (0.02) 92.71 (0.58)
500 0.06 (0.01) 0.65 (0.00) 0.66 (0.01) 93.45 (0.55)
1,000 0.03 (0.01) 0.46 (0.00) 0.46 (0.01) 94.10 (0.53)
2,000 0.02 (0.01) 0.32 (0.00) 0.33 (0.01) 94.20 (0.52)
4,000 0.01 (0.01) 0.23 (0.00) 0.23 (0.00) 95.70 (0.45)
8,000 0.00 (0.00) 0.16 (0.00) 0.16 (0.00) 95.05 (0.49)
Test 250 0.02 (0.03) 1.13 (0.00) 1.13 (0.02) 94.73 (0.50)
500 –0.01 (0.02) 0.80 (0.00) 0.81 (0.01) 94.15 (0.52)
1,000 –0.00 (0.01) 0.57 (0.00) 0.57 (0.01) 94.55 (0.51)
2,000 –0.00 (0.01) 0.40 (0.00) 0.40 (0.01) 95.10 (0.48)
4,000 0.00 (0.01) 0.28 (0.00) 0.28 (0.00) 95.80 (0.45)
8,000 –0.00 (0.00) 0.20 (0.00) 0.20 (0.00) 95.30 (0.47)

Note. GPCM = generalized partial credit model; ASE = asymptotic standard error; MCSE = Monte Carlo standard error; CI = confidence interval.

The results for the 3-PL model simulation are shown in Table 4. The estimator of the marginal reliability coefficient has a small but statistically significant bias under all settings considered. The asymptotic standard errors are accurate for all sample sizes but the empirical coverage rates of the confidence intervals are statistically significantly different from the nominal level of 95% with all sample sizes except the highest sample size of 8,000. For the test reliability coefficient, the bias is smaller and not statistically significantly different from zero with sample sizes 2,000 and higher. The asymptotic standard errors are accurate for all sample sizes and the empirical coverage rate is not statistically significantly different from the nominal level for any sample size.

Table 4.

Mean Bias (×100), Asymptotic and Monte Carlo Standard Errors (×100), and Coverage of 95% Confidence Intervals (%) for the 3-PL Reliability Coefficient Estimators, With Estimated Standard Errors in Parentheses.

Reliability Sample size Mean bias ASE MCSE CI coverage
Marginal 1,000 0.52 (0.01) 0.60 (0.00) 0.61 (0.01) 84.21 (0.86)
2,000 0.28 (0.01) 0.43 (0.00) 0.44 (0.01) 89.48 (0.69)
4,000 0.15 (0.01) 0.31 (0.00) 0.32 (0.01) 91.95 (0.61)
8,000 0.08 (0.00) 0.22 (0.00) 0.22 (0.00) 94.70 (0.50)
Test 1,000 0.05 (0.02) 0.72 (0.00) 0.74 (0.01) 94.39 (0.54)
2,000 0.02 (0.01) 0.51 (0.00) 0.51 (0.01) 95.12 (0.48)
4,000 0.01 (0.01) 0.36 (0.00) 0.36 (0.01) 95.25 (0.48)
8,000 –0.00 (0.01) 0.25 (0.00) 0.26 (0.00) 95.65 (0.46)

Note. 3-PL = three-parameter logistic; ASE = asymptotic standard error; MCSE = Monte Carlo standard error; CI = confidence interval.

Concluding Remarks

With the results presented in this article, empirical researchers and practitioners have access to methods that evaluate the variability of reliability coefficient estimators in IRT and with which confidence intervals for the reliability coefficients can be estimated. The simulation study indicates that the estimated confidence intervals for the test reliability coefficient have good coverage properties with sample sizes as small as 250 with the GPCM and 1,000 with the 3-PL model. The estimator of the marginal reliability coefficient is slightly biased with small samples when using the GPCM and with all sample sizes considered in this article when using the 3-PL model. With the GPCM, the estimated confidence intervals were however still largely accurate and had correct coverage with sample size 1,000 while with the 3-PL model the confidence intervals had correct coverage only with the largest sample size considered. If sufficient computational resources are available, a nonparametric bootstrap approach (Davison & Hinkley, 1997) can be used to estimate the finite sample bias of the reliability coefficient estimators. This bias estimate can then be used together with the results in this article to generate bias-adjusted confidence intervals, which will achieve an improved coverage rate.

The differences between the properties of the marginal and test reliability estimators can be attributed to the differences with regard to the estimation of item response functions and the estimation of expected information functions. In Ogasawara (2002), it was shown that item response functions were possible to estimate accurately in spite of unstable item parameter estimates while the expected information functions were more affected by unstable item parameter estimates. Since the marginal reliability is calculated from the expected information functions while the test reliability only uses the item response functions, the difference in the accuracy of the two different reliability estimators is consistent with the results of Ogasawara (2002).

Since the marginal and test reliability coefficients measure different concepts, a direct comparison between them is not very meaningful. However, this study does indicate that the estimator of the test reliability coefficient has slightly higher sampling variability but lower bias than the estimator of the marginal reliability coefficient when using the same item parameters.

When calculating and reporting the IRT reliability coefficient estimates, it is important to note that the reliability coefficients are not invariant with regard to the latent distribution. This means that the reliability coefficients will be different for populations with different latent distributions even if the item parameters are invariant. If the distribution parameters of the population are known, the results of this paper can incorporate these without any changes to the derivations presented by suitably changing the approximation of the integrals needed. If the distribution parameters have been estimated, for example by using parametric (Mislevy, 1984), nonparametric (Bock & Aitkin, 1981), semi-parametric (Woods, 2006) or multiple group (Muthén & Lehman, 1985) methods, an extension to the methods in this article is required where the derivatives with respect to the distribution parameters are derived.

Appendix

With the GPCM, for each item j and each category k{1,,mj}, the required derivative vector is

2Pjk(θl;αj)θlα=(02Pjk(θl;αj)θlaj2Pjk(θl;αj)θlbj22Pjk(θl;αj)θlbjmj00),

where

2Pjk(θl;αj)θlaj=Pjk(θl;αj)(kc=1mjcPjc(θl;αj))+Pjk(θl;αj)ajaj(kc=1mjcPjc(θl;αj))Pjk(θl;αj)ajc=1mjcPjc(θl;αj)aj

and, for k*=2,,mj,

2Pjk(θl;αj)θlbjk*=Pjk(θl;αj)bjk*aj(kc=1mjcPjc(θl;αj))Pjk(θl;αj)ajc=1mjcPjc(θl;αj)bjk*.

With the 3-PL model, for each item j, the derivative vectors are

2Pj1(θl;αj)θlα=(02Pj(θl;αj)θlaj2Pj(θl;αj)θlbj2Pj(θl;αj)θlcj00)

and

2Pj2(θl;α)θlα=(02Pj(θl;αj)θlaj2Pj(θl;αj)θlbj2Pj(θl;αj)θlcj00)

where

2Pj(θl;αj)θlaj=(1Pj(θl;αj))(Pj(θl;αj)cj)1cj+aj(1Pj(θl;αj))1cjPj(θl;αj)ajaj(Pj(θl;αj)cj)1cjPj(θl;αj)aj,
2Pj(θl;αj)θlbj=aj(1Pj(θl;αj))1cjPj(θl;αj)bjaj(Pj(θl;αj)cj)1cjPj(θl;αj)bj

and

2Pj(θl;αj)θlcj=aj(Pj(θl;αj)cj)1cjPj(θl;αj)cj+aj(1Pj(θl;αj))(Pj(θl;αj)cj1)(1cj)+(Pj(θl;αj)cj)(1cj)2.

The derivatives Pj(θl;αj)aj, Pj(θl;αj)bj and Pj(θl;αj)cj can be found in Hambleton and Swaminathan (1985).

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Tao Xin declared funding by the National Natural Science Foundation of China (Grant No. 31371047).

References

  1. Andersson B. (2016). Asymptotic standard errors of observed-score equating with polytomous IRT models. Journal of Educational Measurement, 53, 459-477. [Google Scholar]
  2. Bock R. D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459. [Google Scholar]
  3. Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. [Google Scholar]
  4. Cheng Y., Yuan K.-H., Liu C. (2012). Comparison of reliability measures under factor analysis and item response theory. Educational and Psychological Measurement, 72, 52-67. [Google Scholar]
  5. Davison A. C., Hinkley D. V. (1997). Bootstrap methods and their application. Cambridge, England: Cambridge University Press. [Google Scholar]
  6. Ferguson T. (1996). A course in large sample theory. London, England: Chapman & Hall. [Google Scholar]
  7. Fischer H. F., Tritt K., Klapp B. F., Fliege H. (2011). How to compare scores from different depression scales: Equating the patient health questionnaire (PHQ) and the ICD-10-symptom rating (ISR) using item response theory. International Journal of Methods in Psychiatric Research, 20, 203-214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Green B. F., Bock R. D., Humphreys L. G., Linn R. L., Reckase M. D. (1984). Technical guidelines for assessing computerized adaptive tests. Journal of Educational Measurement, 21, 347-360. [Google Scholar]
  9. Hambleton R. K., Swaminathan H. (1985). Item response theory: Principles and applications. Boston, MA: Kluwer. [Google Scholar]
  10. Kim S. (2012). A note on the reliability coefficients for item response model-based ability estimates. Psychometrika, 77, 153-162. [Google Scholar]
  11. Kim S., Feldt L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11, 179-188. [Google Scholar]
  12. Lord F. M. (1977). Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14, 117-138. [Google Scholar]
  13. Lord F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. [Google Scholar]
  14. Lord F. M., Wingersky M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8, 452-461. [Google Scholar]
  15. Magis D. (2015). A note on the equivalence between observed and expected information functions with polytomous IRT models. Journal of Educational and Behavioral Statistics, 40, 96-105. [Google Scholar]
  16. Mislevy R. J. (1984). Estimating latent distributions. Psychometrika, 49, 359-381. [Google Scholar]
  17. Muraki E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176. [Google Scholar]
  18. Muthén B., Lehman J. (1985). Multiple group IRT modeling: Applications to item bias analysis. Journal of Educational and Behavioral Statistics, 10, 133-142. [Google Scholar]
  19. Ogasawara H. (2002). Stable response functions with unstable item parameter estimates. Applied Psychological Measurement, 26, 239-254. [Google Scholar]
  20. Ogasawara H. (2003). Asymptotic standard errors of IRT observed-score equating methods. Psychometrika, 68, 193-211. [Google Scholar]
  21. R Development Core Team. (2016). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
  22. Thissen D., Pommerich M., Billeaud K., Williams V. S. (1995). Item response theory for scores on tests including polytomous items with ordered responses. Applied Psychological Measurement, 19, 39-49. [Google Scholar]
  23. Woods C. M. (2006). Ramsay-curve item response theory (RC-IRT) to detect and correct for nonnormal latent variables. Psychological Methods, 11, 253-270. [DOI] [PubMed] [Google Scholar]
  24. Yuan K.-H., Bentler P. M. (2002). On robustness of the normal-theory based asymptotic distributions of three reliability coefficient estimates. Psychometrika, 67, 251-259. [Google Scholar]
  25. Yuan K.-H., Cheng Y., Patton J. (2013). Information matrices and standard errors for MLEs of item parameters in IRT. Psychometrika, 79, 232-254. [DOI] [PubMed] [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES