On the Connections Between Item Response Theory and Classical Test Theory: A Note on True Score Evaluation for Polytomous Items via Item Response Modeling

Tenko Raykov; Dimiter M Dimitrov; George A Marcoulides; Michael Harrison

doi:10.1177/0013164417745949

. 2017 Dec 6;79(6):1198–1209. doi: 10.1177/0013164417745949

On the Connections Between Item Response Theory and Classical Test Theory: A Note on True Score Evaluation for Polytomous Items via Item Response Modeling

Tenko Raykov ^1,^✉, Dimiter M Dimitrov ^2,³, George A Marcoulides ⁴, Michael Harrison ⁵

PMCID: PMC6777063 PMID: 31619845

Abstract

This note highlights and illustrates the links between item response theory and classical test theory in the context of polytomous items. An item response modeling procedure is discussed that can be used for point and interval estimation of the individual true score on any item in a measuring instrument or item set following the popular and widely applicable graded response model. The method contributes to the body of research on the relationships between classical test theory and item response theory and is illustrated on empirical data.

Keywords: classical test theory, graded response model, individual trait level estimate, interval estimation, item response theory, polytomous item, true score, standard error

The past several decades have seen increased interest in the connections between item response theory (IRT) and item response modeling on one hand, and factor analysis and classical test theory (CTT) on the other (e.g., Takane & de Leeuw, 1987; Zimmerman, 1975; see also Raykov, Dimitrov, Marcoulides, & Harrison, 2017; Raykov & Marcoulides, 2017, and references therein). Their important relationships are highly useful for a deeper understanding of both methodologies and additionally facilitate substantially their well-informed applications. Recently, Raykov et al. (2017) highlighted and illustrated the links between IRT and CTT by discussing an item response modeling procedure for point and interval estimation of the individual true scores on each item in a measuring instrument or set consisting of binary or binary scored measures. The present note extends their procedure to the more general case of homogeneous polytomous items, and is concerned with instruments or item sets following the popular and widely used in educational and behavioral research graded response model (GRM; Samejima, 1969, 2016).

Point and Interval Estimation of Individual True Scores on Ordinal Polytomous Items

Background, Notation, and Assumptions

For the aims of this article, we assume that a set of ordinal polytomous items are given, with each having r response categories that are designated 1, 2, . . ., r (r≥ 2), and note that the following procedure is directly applicable in case of different numbers of responses across items. (The item invariance in these numbers is inconsequential for the method outlined in the sequel, entails no limitation of generality, and is assumed in this discussion merely for the sake of convenience.) We symbolize the items by Y₁, Y₂, . . ., Y_k (k > 1) and presume that they are the components of a considered unidimensional multi-item measuring instrument or item set, such as a psychometric scale or test (referred to as “instrument” below; these may be items that are for instance used in a partial scoring setting or represent the responses on Likert-type questions).¹ We stipulate that the instrument has been administered to a sample of independent subjects from a studied population that is not a mixture of two or more latent classes (cf. Raykov, Marcoulides, & Chang, 2016). Last, we posit that the GRM is valid in the population for these items, and will designate by θ the underlying latent trait or ability (latent dimension) being evaluated by them.

Point Estimation of Item-Specific Individual True Scores

Following CTT, the true score T_ij of the ith examined individual on the jth item is the expectation of his or her pertinent item score (treated as a random variable, at his or her given level of the underlying ability or trait θ as in the rest of this section):

T_{ij} = T_{ij} (θ_{i}) = ε (Y_{ij}),

where θ_i is this person’s ability or trait level, Y_ij their observed score on the item, and ε(·) symbolizes expectation with respect to the pertinent propensity distribution of possible item scores (i = 1, . . ., n, with n denoting sample size and j = 1, . . ., k; Lord & Novick, 1968). Since, Y_ij is a discrete random variable, its expectation is (e.g., Casella & Berger, 2002)

ε (Y_{ij}) = 1 \cdot P (Y_{ij} = 1) + 2 \cdot P (Y_{ij} = 2) + \dots + r \cdot P (Y_{ij} = r),

where P(·) denotes probability (and a dot is used to symbolize multiplication, as in the remainder of the section). According to the GRM (cf. de Ayala, 2009),

\begin{matrix} P (Y_{ij} = 1) = P (Y_{ij} \geq 1) - P (Y_{ij} \geq 2) = 1 - P (Y_{ij} \geq 2), \\ P (Y_{ij} = 2) = P (Y_{ij} \geq 2) - P (Y_{ij} \geq 3), \\ \dots \\ P (Y_{ij} = r - 1) = P (Y_{ij} \geq r - 1) - P (Y_{ij} \geq r), \\ P (Y_{ij} = r) = 1 - P (Y_{ij} = 1) - P (Y_{ij} = 2) - \dots - P (Y_{ij} = r - 1) \end{matrix} .

In Equations (3), P(Y_ij≥m) are the probabilities of responding in category m or higher, which are of main modeling interest and are parameterized in the GRM as follows:

P (Y_{ij} \geq m) = 1 / {1 + \exp [- a_{j} (θ_{i} - b_{jm})]},

with exp(·) denoting exponentiation, a_j an item discrimination parameter, and b_jm a cut-point that can be thought of as a difficulty parameter associated with a response in category m or higher (m = 2, 3, . . ., r). That is, these probabilities can be viewed as formally satisfying the two-parameter logistic model with an item-specific discrimination parameter and r−1 additional parameters associated each with the corresponding ordered category of the item (second through last), whereby P(Y_ij > r) = 0 is presumed (“Stata Item Response Theory Manual,” 2015; cf. Samejima, 1969, 2016). In the rest of the article, for simplicity of reference the latter r−1 quantities are called “item difficulty parameters.”

Hence, from Equations (1) through (3) it follows that the true score of the ith individual on the jth item is (cf. de Ayala, 2009)

\begin{matrix} T_{ij} = 1 - P (Y_{ij} \geq 2) + 2 \cdot [P (Y_{ij} \geq 2) - P (Y_{ij} \geq 3)] \\ + 3 \cdot [P (Y_{ij} \geq 3) - P (Y_{ij} \geq 4)] + \dots \\ + (r - 1) . [P (Y_{ij} \geq r - 1) - P (Y_{ij} \geq r)] + r \cdot P (Y_{ij} \geq r) \\ = 1 + P (Y_{ij} \geq 2) + P (Y_{ij} \geq 3) + \dots + P (Y_{ij} \geq r) \\ (i = 1, \dots, n, j = 1, \dots, k) \end{matrix} .

Therefore, after fitting the GRM to a given data set on ordinal polytomous items (and finding it plausible), the point estimates of the individual true scores on any of the k items under consideration are obtainable from Equation (3) as

{\hat{T}}_{ij} = 1 + \hat{P} (Y_{ij} \geq 2) + \hat{P} (Y_{ij} \geq 3) + \dots + \hat{P} (Y_{ij} \geq r),

where a hat is used to denote estimate of the quantity underneath, as in the remainder of this note, and $\hat{P}$ (.) denotes the model-based estimated probability of the event following in parentheses (i = 1, . . ., n, j = 1, . . ., k).²

Interval Estimation of Item-Specific Individual True Scores

Equation (5) only provides point estimates of the individual persons’ true scores on any item in an instrument or item set under consideration, with no information regarding the instability of these estimates. To resolve this issue, all estimated item discrimination and difficulty parameters are treated next as known, for instance, from prior calibration or application of the marginal maximum likelihood estimation method (e.g., Reckase, 2009; see also Raykov et al., 2017), as is nearly routinely proceeded with in corresponding IRT applications. In this way, by utilizing the delta method (e.g., Raykov & Marcoulides, 2004), we can furnish the following approximate standard error (SE) for the item-specific, individual true score in Equation (5) (cf. Raykov et al., 2017):

SE ({\hat{T}}_{ij}) = SE ({\hat{θ}}_{i}) \sum_{m = 2}^{r} [{\hat{a}}_{j} \exp [{\hat{a}}_{j} ({\hat{θ}}_{i} - {\hat{b}}_{jm})] / {1 + \exp [{\hat{a}}_{j} ({\hat{θ}}_{i} - {\hat{b}}_{jm})]}^{2}],

(i = 1, . . ., n, j = 1, . . ., k). (A standard error appearing in Equation 6 is by definition the positive square root of the pertinent approximate variance resulting from the delta method; e.g., Casella & Berger, 2002.)

Employing the standard error in Equation (6) and the true score estimate in Equation (5) with the monotone transformation-based procedure using the logistic function in Raykov and Marcoulides (2011, chapter 7; after a suitable initial linear transformation, see below and Appendix B), we finally obtain for each person and item a confidence interval (CI) of his or her true score on the jth item:

(T_{ij, lo} (α), T_{ij, up} (α)),

where T_ij,lo(α) and T_ij,up(α), respectively, denote the lower and upper limit of the 100(1 −α)% CI for the true score of the ith examined person on the item (0 < α < 1; i = 1, . . ., n, j = 1, . . ., k).

The item-specific individual true score estimate in Equation (5) as well as its associated standard error in Equation (6) and CI in (7) are readily obtained using widely circulated statistical software, such as Stata and R (e.g., Raykov & Marcoulides, 2017; see also the Mplus source code for examining the plausibility of the GRM, which is provided in appendix 1 of Raykov et al., 2017). The source code needed for the point estimation of the item-specific individual true scores and the associated approximate standard error (Equations 5 and 6) is supplied in Appendix A to this note, and the R-function for the construction of the true score CI (7) is found in Appendix B.

The applicability and utility of the discussed true score estimation procedure is demonstrated next on empirical data.

Illustration on Data

For the purposes of this section, we make use of a data set from an anxiety study that is available with a download from www.ssicentral.com of the (student version of the) IRT software IRTPRO (Cai, Thissen, & du Toit, 2017). The data set results from k = 5 ordinal polytomous items with r = 5 response options each that were administered to n = 514 persons and asked about their feelings of being calm, at ease, tense, regretful, or nervous (e.g., du Toit, 2003). For the sake of illustration and ease of reference, the aims of the present discussion, and without loss of generality these items might be thought of in the remainder as being possibly indicative of the trait Generalized Anxiety.

We commence by examining the plausibility of the GRM. To this end, as in Raykov et al. (2017), we use confirmatory factor analysis for the five categorical items and apply the popular latent variable modeling software Mplus for fitting the pertinent single-factor model to them (L. K. Muthén & Muthén, 2017; see appendix 1 in Raykov et al., 2017, for the needed source code and notes to it). The overall goodness-of-fit indices of the GRM fitted thereby are nonsignificant and suggest that it is a tenable means of data description and explanation: Pearson chi-square value = 1682.251, degrees of freedom (df) = 3090, associated p value (p) = 1; and likelihood ratio chi-square value = 837.822, df = 3090, p = 1.³ In addition, the individual item R ² indices range between 32% and 71%, and thus none of them can be seen as indicating potentially serious “local” violations of model fit. With these global and “local” goodness-of-fit results, we may conclude that the GRM is plausible for the analyzed data set.

Next, we obtain the point estimates of the individual Generalized Anxiety trait levels, that is, the above values ${\hat{θ}}_{i}$ (i = 1, . . ., n, with n = 514 here; see Appendix A for the needed Stata source code). This is achieved with the first couple of commands in the Stata source code in Appendix A (after the initial three comment lines). With these latent trait estimates as well as the item discrimination and difficulty parameter estimates (i.e., of a_j and b_jm; j = 1, . . ., 5, m = 2, . . ., 5), using Equation (5), we arrive at the point estimates of the 514 individual true scores on each of the five items. This is accomplished with the next four lines in the Stata code in Appendix A (after the pertinent comment line). The resulting true score estimates for the first 10 subjects say are presented in Table 1.

Table 1.

Individual True Score Estimates on Each of k = 5 Anxiety Items (for the First 10 Subjects, in Stata Format).

	calm	at ease	tense	regretful	nervous	T1	T2	T3	T4	T5
1	3	3	2	2	2	2.493538	2.678101	2.651873	2.357293	2.530299
2	3	3	5	5	3	3.231783	3.420408	3.455674	2.983735	3.236365
3	3	3	3	3	4	2.925919	3.10302	3.120781	2.714475	2.937812
4	3	3	2	2	3	2.600281	2.784837	2.764364	2.441504	2.627367
5	2	3	2	4	4	2.560339	2.745312	2.722132	2.409815	2.590901
6	1	1	1	1	2	1.173661	1.254172	1.288422	1.370501	1.370685
7	3	2	1	1	1	1.782303	1.944211	1.901951	1.797227	1.874782
8	1	1	2	1	1	1.236826	1.343809	1.362768	1.424015	1.432933
9	3	3	3	1	1	2.510716	2.695529	2.669911	2.370757	2.545853
10	3	2	2	1	1	1.994575	2.149248	2.120625	1.959929	2.066717

Open in a new tab

Note. Tj = individual true score on the jth item (j = 1, . . ., 5, in the order “calm,”“atease,”“tense,”“regretful,” and “nervous”; subject identifier given in left-most column).

In the second step of the procedure of this article, employing these individual true score estimates and based on Equation (6), we furnish the standard errors associated with each of the 514 individual true scores (estimates) on any of the five items. This is accomplished with the following nine lines in the Stata code in Appendix A (after the pertinent comment line). These standard errors for the first 10 respondents are presented in Table 2.

Table 2.

Standard Errors for the Individual True Score Estimates in Table 1 (for the First 10 Subjects, in Stata Format).

	T1	T1_SE	T2	T2_SE	T3	T3_SE	T4	T4_SE	T5	T5_SE
1	2.493538	0.313361	2.678101	0.3188743	2.651873	0.3289223	2.357293	0.2454092	2.530299	0.2836159
2	3.231783	0.3380273	3.420408	0.3509775	3.455674	0.3496579	2.983735	0.2905437	3.236365	0.31761
3	2.925919	0.2866951	3.10302	0.2878803	3.120781	0.3222451	2.714475	0.2521445	2.937812	0.2833688
4	2.600281	0.3114646	2.784837	0.3065315	2.764364	0.330206	2.441504	0.248196	2.627367	0.2852975
5	2.560339	0.3329757	2.745312	0.3314064	2.722132	0.351201	2.409815	0.2631197	2.590901	0.3030938
6	1.173661	0.17563	1.254172	0.2554093	1.288422	0.2167151	1.370501	0.1608515	1.370685	0.1859264
7	1.782303	0.3271973	1.944211	0.307346	1.901951	0.3244133	1.797227	0.2368114	1.874782	0.2802187
8	1.236826	0.2116557	1.343809	0.2925058	1.362768	0.2386862	1.424015	0.16746	1.432933	0.1958754
9	2.510716	0.3258688	2.695529	0.3296319	2.669911	0.342316	2.370757	0.255632	2.545853	0.2951923
10	1.994575	0.2961472	2.149248	0.2995908	2.120625	0.315244	1.959929	0.2372342	2.066717	0.2789896

Open in a new tab

Note. Tj = individual true score estimate on the jth item (see footnote in Table 1); Tj_SE = standard error for the individual true score estimate on the jth item (j = 1, . . ., 5; subject identifier given in left-most column).

In the third step of the method discussed in this note, utilizing the R-function “ci.ordinal_polytomous_item_TS” in Appendix B with the true score estimates and their standard errors obtained as above, we finally furnish the 95% CIs for each of the 514 individual true scores on any of the five items. These CIs for the first 10 persons and first item are presented in Table 3.

Table 3.

Confidence Intervals at 95% Confidence Level for the Individual True Scores (for First 10 Subjects and First Item, “Calm,” in Stata Format).

id	T1	T1_SE	T1_low	T1_up
1	2.493538	0.3133610	1.944534	3.138326
2	3.231783	0.3380273	2.568180	3.847397
3	2.925919	0.2866951	2.383873	3.479087
4	2.600281	0.3114646	2.043796	3.229680
5	2.560339	0.3329757	1.974632	3.237688
6	1.173661	0.17563	1.022728	2.059820
7	1.782303	0.3271973	1.322688	2.609956
8	1.236826	0.2116557	1.038733	2.153108
9	2.510716	0.3258688	1.941091	3.179467
10	1.994575	0.2961472	1.528359	2.673828

Open in a new tab

Note. T1 = individual true score for the first item; T1_SE = associated standard error; T1_low = lower endpoint of the 95% confidence interval (CI) for the individual true score on the first item; T1_up = upper endpoint of this CI. As indicated in the main text, CIs for the individual true scores on the remaining four items are obtained by complete analogy to the developments in the current section - see Equation 6 and 7, and Appendix B.

The individual true score CIs with respect to the remaining four items are obtained by complete analogy, using their earlier rendered true score estimates and associated standard errors.

Conclusion

In this note, we have discussed an extension of the true score point and interval estimation procedure in Raykov et al. (2017) to the case of ordinal polytomous items. The method has generalized their procedure to the setting of a unidimensional measuring instrument or item set adhering to the popular and widely applicable GRM, with that earlier procedure being a special case of the present one (when r = 2). We were similarly concerned here with an IRT modeling–based approach to point and interval estimation of individual true scores on each item in an item set under consideration that followed the GRM, and our goal was also to highlight and illustrate further important and useful links between IRT and CTT.

It is worthwhile stressing at this point several limitations of the estimation approach of the present article. One is the requirement of large samples with respect to both persons and items, since it is instrumentally based on maximum likelihood estimation that is grounded in an asymptotic theory (see Raykov et al., 2017, for additional discussion on this general limitation, in particular with respect to number of items). Two, the discussed method rests on the assumption of the items following the GRM, and therefore, its application with an instrument or item set that is multidimensional is not recommendable (cf. Reckase, 2009). Three, we have assumed that the instrument or considered items are given, that is, prespecified, rather than sampled from a pool, population, or universe of items. Last, the procedure of this note makes extensive use of the delta method that is based on a linear approximation, and it is unknown to what extent its approximate validity may be generalizable to complex parametric functions representing expected observed scores (true scores) in models other than the GRM.

In conclusion, this note provides educational and behavioral scientists with a widely applicable procedure for estimation of individual true scores on any of the items in a unidimensional measuring instrument or item set adhering to the popular GRM, and contributes further to the body of research on the connections between IRT and CTT (e.g., Raykov et al., 2017, and references therein).

Acknowledgments

We are grateful to B. Muthén and L. Steinberg for valuable discussions on the graded response model. We are indebted to R. Raciborski for helpful and instructive comments on applications of the Stata IRT module.

Appendix A

Stata Source Code for Point Estimation and Approximate Standard Error Evaluation of Individual True Scores on Ordinal Polytomous Items Following the Graded Response Model

* First read in the analyzed data set with Stata. (In the illustration section, this is the GA data set, with item names

*‘calm’, ‘atease’, ‘tense’, ‘regretful’, and ‘nervous’.)

* The next 2 lines fit the GRM and furnish the estimates of the individual trait levels θ_i (i = 1, …, 514).

. irt grm calm atease tense regretful nervous

. predict Theta, latent se(ThetaSE)

* The next 4 lines render the estimated item-specific individual true scores for all 514 subjects.

foreach x in calm atease tense regretful nervous {

gen T_ʻxʼ = 1+1/(1+exp(_b[/ʻxʼ:cut1]-_b[ʻxʼ:Theta]*Theta))+1/(1+exp(_b[/ʻxʼ:cut2] - /// _b[ʻxʼ:Theta]*Theta))+1/(1+exp(_b[/ʻxʼ:cut3] - _b[ʻxʼ:Theta]*Theta)) +1/(1+exp(_b[/ʻxʼ:cut4] - _b[ʻxʼ:Theta]*Theta))

}

* The next 9 lines furnish the standard errors for the item specific, individual true score estimates for all persons.

foreach x in calm atease tense regretful nervous {

gen T_SE_ʻxʼ= ThetaSE * (_b[ʻxʼ:Theta]*exp(_b[ʻxʼ:Theta]* ///

(Theta-_b[/ʻxʼ:cut1]/_b[ʻxʼ:Theta]))/(1+exp(_b[ʻxʼ:Theta]*(Theta-_b[/ʻxʼ:cut1]/_b[ʻxʼ:Theta])))^2 ///

+ _b[ʻxʼ:Theta]*exp(_b[ʻxʼ:Theta]*(Theta-_b[/ʻxʼ:cut2]/_b[ʻxʼ:Theta]))/(1+exp(_b [ʻxʼ:Theta]* ///

(Theta-_b[/ʻxʼ:cut2]/_b[ʻxʼ:Theta])))^2 + _b[ʻxʼ:Theta]*exp(_b[ʻxʼ:Theta]*(Theta- /// _b[/ʻxʼ:cut3]/_b[ʻxʼ:Theta]))/(1+exp(_b[ʻxʼ:Theta]*(Theta-_b[/ʻxʼ:cut3]/_b[ʻxʼ: Theta])))^2 ///

+ _b[ʻxʼ:Theta]*exp(_b[ʻxʼ:Theta]*(Theta-_b[/ʻxʼ:cut4]/_b[ʻxʼ:Theta]))/(1+exp(_b [ʻxʼ:Theta]* ///

(Theta-_b[/ʻxʼ:cut4]/_b[ʻxʼ:Theta])))^2)

}

* List next the first 10 subjects’ true score estimates on the 5 items.

. list calm atease tense regretful nervous T_calm-T_nervous in 1/10

* List next the standard errors of the true score estimates for the first 10 subjects.

. list T_calm T_SE_calm T_atease T_SE_atease T_tense T_SE_tense T_regretful T_SE_regretful T_nervous /// T_SE_nervous in 1/10

Note 1. A comment line is preceded by an asterisk (cf. “Stata Item Response Theory Manual,” 2015; the initial dot starting a command signifies Stata’s prompt, for commands only that can be executed in a do-file). The references “_b[item_name:Theta]” and “_b[/item_name:cut#]” are the “legends” or labels by which Stata internally refers to the item discrimination parameters (first reference) and the four item difficulty parameters (cuts), respectively, in the present illustration example and the “intercept-and-slope” parameterization implemented in the software (“Stata Item Response Theory Manual,” 2015). The command “irt grm” followed by the names of the items fits the GRM to them; “gen” is the computing command (short for “generate”); and “list” is the listing command (see also Raykov & Marcoulides, 2017, with respect to Stata’s IRT modeling capabilities). The set of four symbols “ ///” signifies continuation of the current command on the following line.

Note 2. Apply this set of commands only to a data set where the GRM is plausible (e.g., as examined using the Mplus command file in appendix 1 of Raykov et al., 2017; see also the illustration section). With a different number of ordinal polytomous items adhering to the GRM, modify/extend appropriately the applicable commands for true score estimation and standard error evaluation.

Note 3. T_“name” = true score on the “calm,”“atease,”“tense,”“regretful,” and “nervous” items, respectively; T_SE_”name” = standard error associated with true score estimate on corresponding item (j = 1, . . ., 5); Theta = estimate of the individual latent trait/ability levels or values (i.e., ${\hat{θ}}_{i}$ ; i = 1, . . ., n); Theta SE = standard error associated with the latter estimate (standard error of ${\hat{θ}}_{i}$ ; i = 1, . . ., n).

Appendix B

R-Function for Confidence Interval Construction for the Individual True Scores on Ordinal Polytomous Items Following the Graded Response Model

ci.ordinal_polytomous_item_TS = function(T_ij, se){ # R-function for CI of TS on each item.

T_ij = (T_ij-1)/4 # This R-function is applicable for ordinal polyt. items w/ m = 5 categories.

se = se/4 # Modify it correspondingly for items with different number of categories.

l = log(T_ij/(1-T_ij))# Logit transformation of T_ij (e.g., Raykov & Marcoulides, 2011, ch. 7).

sel = se/(T_ij*(1-T_ij))

ci_l_lo = l-1.96*sel

ci_l_up = l+1.96*sel

ci_lo = 1+4/(1+exp(-ci_l_lo))

ci_up = 1+4/(1+exp(-ci_l_up))

ci = c(ci_lo, ci_up)

ci

} # (ci_lo, ci_up) is the 95%-confidence interval for the individual true score T_ij.

Note. This R-function is a modification of the R-function “ci.true_score” in Raykov et al. (2017). The present R-function is based on the fact that for a Likert-type item Y with m = 5 response categories (denoted 1 through 5), as in the illustration section example, the transformed score Z = (Y− 1)/4 lies in the interval [0, 1] and hence its expectation, that is, modified true score, resides within the same interval (see first two command lines after its title in the R-function above). Hence, to obtain a 95% CI for this true score, one can employ the R-function “ci.true_score” in Raykov et al. (2017); using the resulting CI then, through the reverse transformation (y = 4z+ 1) one obtains the corresponding CI limits for the individual true score on the original item Y, which is of concern here (see 8th and 9th command line above). With a different confidence level, use correspondingly modified multipliers in lieu of 1.96 above (Lines 6 and 7; e.g., 1.64 for a 90% CI; Casella & Berger, 2002).

^1.

In case k = 2, identified is only the 1PL-model (Rasch model) as a special case of the GRM with r = 2 response options per item (e.g., von Davier, 2016).

^2.

Note that the familiar expression T_ij = P(Y_ij = 1) in the binary or binary scored item case (e.g., Raykov et al., 2017) would result from the preceding discussion in this section and Equation (4), if the response options were to be denoted by 0, 1, 2, . . ., r and correspondingly r = 1 set then.

^3.

Although the Pearson and likelihood ratio chi-square values are notably discrepant, which may well be the result of insufficiently large sample of the behavioral study data used in this section and several empty cells in the associated response pattern cells, for the discussed procedure illustration purposes, we proceed here by interpreting the identity of their associated p values—at their highest possible level—as consistent with the fitted GRM being plausible for the analyzed data set.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Cai L., Thissen D., du Toit S. H. C. (2017). IRTPRO 4.1 for Windows [Computer software]. Skokie, IL: Scientific Software International. [Google Scholar]
Casella G., Berger J. (2002). Statistical inference. Monterey, CA: Wadsworth. [Google Scholar]
de Ayala R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press. [Google Scholar]
M du Toit. (Ed.). (2003). IRT from SSI. Skokie, IL: Scientific Software International. [Google Scholar]
Lord F. M., Novick M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. [Google Scholar]
Muthén L. K., Muthén B. O. (2017). Mplus user’s guide. Los Angeles, CA: Muthén & Muthén. [Google Scholar]
Raykov T., Dimitrov D. M., Marcoulides G. A., Harrison M. (2017). On true score evaluation using item response theory modeling. Educational and Psychological Measurement, 79, 796-807. doi: 10.1177/0013164417741711 [DOI] [PMC free article] [PubMed] [Google Scholar]
Raykov T., Marcoulides G. A. (2004). Using the delta method for approximate interval estimation of parametric functions in covariance structure models. Structural Equation Modeling, 11, 659-675. [Google Scholar]
Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. New York, NY: Taylor & Francis. [Google Scholar]
Raykov T., Marcoulides G. A. (2017). A course in item response theory and modeling with Stata. College Station, TX: Stata Press. [Google Scholar]
Raykov T., Marcoulides G. A., Chang C. (2016). Studying population heterogeneity in finite mixture settings using latent variable modeling. Structural Equation Modeling, 23, 726-730. [Google Scholar]
Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]
Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph No. 17). Richmond, VA: Psychometric Press. [Google Scholar]
Samejima F. (2016). Graded response models. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 1, pp. 95-108). Boca Raton, FL: CRC Press. [Google Scholar]
Stata Item Response Theory Manual: Release 14. (2015). College Station, TX: Stata Press. [Google Scholar]
Takane Y., de Leeuw J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. [Google Scholar]
von Davier M. (2016). Rasch model. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 1, pp. 31-50). Boca Raton, FL: CRC Press. [Google Scholar]
Zimmerman D. W. (1975). Probability spaces, Hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395-412. [Google Scholar]

[bibr1-0013164417745949] Cai L., Thissen D., du Toit S. H. C. (2017). IRTPRO 4.1 for Windows [Computer software]. Skokie, IL: Scientific Software International. [Google Scholar]

[bibr2-0013164417745949] Casella G., Berger J. (2002). Statistical inference. Monterey, CA: Wadsworth. [Google Scholar]

[bibr3-0013164417745949] de Ayala R. J. (2009). The theory and practice of item response theory. New York, NY: Guilford Press. [Google Scholar]

[bibr4-0013164417745949] M du Toit. (Ed.). (2003). IRT from SSI. Skokie, IL: Scientific Software International. [Google Scholar]

[bibr5-0013164417745949] Lord F. M., Novick M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. [Google Scholar]

[bibr6-0013164417745949] Muthén L. K., Muthén B. O. (2017). Mplus user’s guide. Los Angeles, CA: Muthén & Muthén. [Google Scholar]

[bibr7-0013164417745949] Raykov T., Dimitrov D. M., Marcoulides G. A., Harrison M. (2017). On true score evaluation using item response theory modeling. Educational and Psychological Measurement, 79, 796-807. doi: 10.1177/0013164417741711 [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr8-0013164417745949] Raykov T., Marcoulides G. A. (2004). Using the delta method for approximate interval estimation of parametric functions in covariance structure models. Structural Equation Modeling, 11, 659-675. [Google Scholar]

[bibr9-0013164417745949] Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. New York, NY: Taylor & Francis. [Google Scholar]

[bibr10-0013164417745949] Raykov T., Marcoulides G. A. (2017). A course in item response theory and modeling with Stata. College Station, TX: Stata Press. [Google Scholar]

[bibr11-0013164417745949] Raykov T., Marcoulides G. A., Chang C. (2016). Studying population heterogeneity in finite mixture settings using latent variable modeling. Structural Equation Modeling, 23, 726-730. [Google Scholar]

[bibr12-0013164417745949] Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]

[bibr13-0013164417745949] Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores (Psychometrika Monograph No. 17). Richmond, VA: Psychometric Press. [Google Scholar]

[bibr14-0013164417745949] Samejima F. (2016). Graded response models. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 1, pp. 95-108). Boca Raton, FL: CRC Press. [Google Scholar]

[bibr15-0013164417745949] Stata Item Response Theory Manual: Release 14. (2015). College Station, TX: Stata Press. [Google Scholar]

[bibr16-0013164417745949] Takane Y., de Leeuw J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. [Google Scholar]

[bibr17-0013164417745949] von Davier M. (2016). Rasch model. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 1, pp. 31-50). Boca Raton, FL: CRC Press. [Google Scholar]

[bibr18-0013164417745949] Zimmerman D. W. (1975). Probability spaces, Hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395-412. [Google Scholar]

PERMALINK

On the Connections Between Item Response Theory and Classical Test Theory: A Note on True Score Evaluation for Polytomous Items via Item Response Modeling

Tenko Raykov

Dimiter M Dimitrov

George A Marcoulides

Michael Harrison

Abstract

Point and Interval Estimation of Individual True Scores on Ordinal Polytomous Items

Background, Notation, and Assumptions

Point Estimation of Item-Specific Individual True Scores

Interval Estimation of Item-Specific Individual True Scores

Illustration on Data

Table 1.

Table 2.

Table 3.

Conclusion

Acknowledgments

Appendix A

Stata Source Code for Point Estimation and Approximate Standard Error Evaluation of Individual True Scores on Ordinal Polytomous Items Following the Graded Response Model

Appendix B

R-Function for Confidence Interval Construction for the Individual True Scores on Ordinal Polytomous Items Following the Graded Response Model

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

On the Connections Between Item Response Theory and Classical Test Theory: A Note on True Score Evaluation for Polytomous Items via Item Response Modeling

Tenko Raykov

Dimiter M Dimitrov

George A Marcoulides

Michael Harrison

Abstract

Point and Interval Estimation of Individual True Scores on Ordinal Polytomous Items

Background, Notation, and Assumptions

Point Estimation of Item-Specific Individual True Scores

Interval Estimation of Item-Specific Individual True Scores

Illustration on Data

Table 1.

Table 2.

Table 3.

Conclusion

Acknowledgments

Appendix A

Stata Source Code for Point Estimation and Approximate Standard Error Evaluation of Individual True Scores on Ordinal Polytomous Items Following the Graded Response Model

Appendix B

R-Function for Confidence Interval Construction for the Individual True Scores on Ordinal Polytomous Items Following the Graded Response Model

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases