On True Score Evaluation Using Item Response Theory Modeling

Tenko Raykov; Dimiter M Dimitrov; George A Marcoulides; Michael Harrison

doi:10.1177/0013164417741711

. 2017 Nov 16;79(4):796–807. doi: 10.1177/0013164417741711

On True Score Evaluation Using Item Response Theory Modeling

Tenko Raykov ^1,^✉, Dimiter M Dimitrov ^2,³, George A Marcoulides ⁴, Michael Harrison ⁵

PMCID: PMC7328243 PMID: 32655184

Abstract

Building on prior research on the relationships between key concepts in item response theory and classical test theory, this note contributes to highlighting their important and useful links. A readily and widely applicable latent variable modeling procedure is discussed that can be used for point and interval estimation of the individual person true score on any item in a unidimensional multicomponent measuring instrument or item set under consideration. The method adds to the body of research on the connections between classical test theory and item response theory. The outlined estimation approach is illustrated on empirical data.

Keywords: classical test theory, delta method, interval estimation, item response theory, true score, standard error

Item response theory (IRT) modeling has become very popular across the educational, behavioral, and social science disciplines over the past half century or so (e.g., van der Linden, 2016). There are multiple benefits to applying IRT in empirical research, especially when one is concerned with scoring individual persons based on administered multi-item measuring instruments or item sets under consideration. In recent years, interest in the connections between IRT and classical test theory (CTT) (when properly used; e.g., Zimmerman 1975) has also been steadily growing (e.g., Bechger, Maris, Verstralen, & Béguin, 2003; Dimitrov, 2003; Hambleton, Swaminathan, & Rogers, 1991; Kamata & Bauer, 2008; Kohli, Koran, & Henn, 2015; Kolen, Hanson, & Brennan, 1992; Raykov & Marcoulides, 2016, 2017; see also Lord, 1980; McDonald, 1999). This and related research can be seen as having been significantly influenced by important earlier work indicating the links between factor analysis and IRT (Takane & de Leeuw, 1987). The CTT to IRT connections are highly useful for a deeper understanding of either of these important measurement methodologies as well as their applications, and the research mentioned has made marked contributions to explicating their interrelationships.

The present note builds upon prior work on the connections between key concepts in IRT and CTT, such as the item characteristic curve and true score. The aim of the following discussion is to highlight the close links that exist between these two fundamental notions and illustrate them by utilizing these relationships for point and interval estimation of item-specific, individual true scores on items under consideration using IRT modeling and appropriate follow-up procedures. While we are dealing primarily with the widely applicable two-parameter logistic (2PL) model with homogeneous binary or binary scored items in the absence of nesting or clustering effects, as indicated later the approach discussed in the remainder is also (a) readily employed after minor modifications with any unidimensional IRT model for a set of discrete items (ordinal items with at least three possible response options) and (b) applicable in more general settings with nesting or clustering of individual persons in higher order units (such as schools, neighborhoods, physicians, hospitals, firms, cities, etc.), by using appropriate adjustments (see the Conclusion section).

Background, Notation, and Assumptions

We assume in this and the next section that a set of binary or binary scored homogeneous items are given, designated by Y₁, . . ., Y_k, which are the components of a multi-item measuring instrument or item set under consideration, such as a psychometric scale or test (referred to as “scale” or “instrument” below; k > 1).¹ We presume that the instrument has been administered to a sample of independent subjects from a studied population, which is not a mixture of two or more latent classes (cf. Raykov, Marcoulides, & Chang, 2016). On each of the items, suppose that one of the two possible response options is denoted or coded with 1 (referred to as “correct” response in what follows) and the other answer with 0. Finally, we posit that the popular 2PL model is valid in the population for the k items under consideration (e.g., Cai & Thissen, 2015; see also Note 2 and the Conclusion section for extensions).

In IRT, a key concept is that of item characteristic curve (ICC; e.g., van der Linden, 2016). In the setting of relevance here, for a given item the ICC is defined as the probability of “correct” response, denoted P(θ), as a function of the unidimensional latent trait or ability of interest, designated θ, which as mentioned is presumed throughout to be underlying the instrument or set of items in question (e.g., Hambleton et al., 1991). Similarly, a fundamental concept in CTT is the subject true score on each item considered, denoted T_ij for the ith person and jth item (i = 1, . . ., n, with n denoting sample size; j = 1, . . ., k; e.g., Raykov & Marcoulides, 2011). As shown in prior research (see, e.g., Raykov & Marcoulides, 2017, and references therein), the individual true score on the jth item equals the pertinent ICC, denoted next by ICC_j:

T_{ij} = T_{ij} (θ_{i}) = P_{j} (θ_{i}) = IC C_{j} = IC C_{j} (θ_{i}),

where the first equality emphasizes the fact that the true score is a function of the underlying individual level or value of the studied latent trait or ability, θ_i, and the second half of this chain of equations highlights that this true score equals the jth item’s characteristic curve evaluated at that value (i = 1, . . ., n, j = 1, . . ., k).

In the 2PL model, as is well known (e.g., van der Linden, 2016), the probability of correct response on the jth item, as a function of the underlying trait or ability θ, is

P_{j} (θ) = 1 / {1 + \exp [- a_{j} (θ - b_{j})]},

where a_j and b_j are the item discrimination and difficulty parameters, respectively, and exp(·) denotes the exponential function (e.g., Apostol, 2006). Substituting Equation (2) into Equation (1), the true score of the ith individual on the jth item is expressed as

T_{ij} = 1 / {1 + \exp [- a_{j} (θ_{i} - b_{j})]}, (i = 1, \dots, n, j = 1, \dots, k) .

Equation (3) will play an instrumental role in the next two sections of this article.

Point and Interval Estimation of the Individual True Scores on the Items in a Multi-item Measuring Instrument

In an empirical educational or behavioral setting, suppose that one fits the 2PL model to the data on a used instrument or item set and finds it plausible. From the preceding section it follows then that the estimates of the individual true scores on any of the items are obtainable from Equation (3) as

{\hat{T}}_{ij} = 1 / {1 + \exp [- {\hat{a}}_{j} ({\hat{θ}}_{i} - {\hat{b}}_{j})]} = 1 / {1 + \exp [- ({\hat{c}}_{j} + {\hat{a}}_{j} {\hat{θ}}_{i})]},

where ${\hat{c}}_{j}$ = $- {\hat{a}}_{j} {\hat{b}}_{j}$ is the “intercept” in the alternative “intercept-and-slope” parameterization (representation or reparameterization; e.g., Cai & Thissen, 2015), and a hat is used to denote estimate of the parameter (quantity) underneath (i = 1, . . ., n, j = 1, . . ., k). That is, once estimating the ith person’s level or value on the studied ability or trait θ (using a plausible 2PL model), the estimate of his or her true score on the jth item is the logistic function in the right-hand side of Equation (4) (see also its middle part; i = 1, . . ., n, j = 1, . . ., k).

Equation (4) allows us to consider the estimate ${\hat{T}}_{ij}$ of the individual true score on the jth item as a nonlinear function of the person’s trait or ability level (value) estimate ${\hat{θ}}_{i}$ and the item discrimination and difficulty parameter estimates ${\hat{a}}_{j}$ and ${\hat{b}}_{j}$ , which is succinctly restated as follows:

{\hat{T}}_{ij} = {\hat{T}}_{ij} ({\hat{a}}_{j}, {\hat{b}}_{j}, {\hat{θ}}_{i}), (i = 1, \dots, n, j = 1, \dots, k) .

Treating next the estimated item discrimination and difficulty parameters as known, for instance from prior calibration or application of marginal maximum likelihood (e.g., Reckase, 2009), as is nearly routinely proceeded with in corresponding IRT applications (e.g., Hambleton et al., 1991), an approximate standard error (SE) for the individual true score estimate per item in Equation (4) results with the widely applicable delta method (e.g., Raykov & Marcoulides, 2004):

SE ({\hat{T}}_{ij}) = SE (\hat{θ}) {\hat{a}}_{j} \exp [{\hat{a}}_{j} ({\hat{θ}}_{i} - {\hat{b}}_{j})] / {1 + \exp [{\hat{a}}_{j} ({\hat{θ}}_{i} - {\hat{b}}_{j})]}^{2}, (i = 1, \dots, n, j = 1, \dots, k)

With the standard error in Equation (6) and the true score estimate in Equation (4), the monotone transformation-based procedure using the logistic function in Raykov and Marcoulides (2011, Chap. 7) can finally be employed to obtain for each person and item a confidence interval (CI) of his or her true score, at any prespecified confidence level:

(T_{ij, lo} (α), T_{ij, up} (α)),

where T_ij,lo(α) and T_ij,up(α) correspondingly denote the lower and upper limit of the 100(1−α)% CI for the ith examined person on the jth item (0 < α < 1; i = 1, . . ., n, j = 1, . . ., k; see below and Appendix C).

The item-specific, individual true score estimate in Equation (4) as well as its associated standard error in Equation (6) and CI in (7), are readily obtained using widely circulated statistical software, such as Stata, Mplus, and R (e.g., Raykov & Marcoulides, 2017; see also Raykov & Marcoulides, 2011, Chap. 7, for the R-function “ci.rel” furnishing the CI, which is also provided in Appendix B for completeness of this note). The software source codes needed then are found in Appendixes A, B, and C.

The discussed individual true score point and interval estimation procedure is illustrated next on empirical data.

Illustration on Data

For the purposes of this section, we make use of a widely circulated data set from the Law School Admission Council referred to as the LSAT data (Rizopoulos, 2007; this data set comes with downloading R’s package “ltm”). The set consists of the binary scored results of n = 1000 examinees on k = 5 items, which for the sake of illustration, the aims of the present discussion, and without loss of generality might be thought of as being possibly indicative of the trait General Mental Ability (GMA).²

We commence by examining the (anticipated) unidimensionality of this item set. To this end, we use confirmatory factor analysis accounting for the categorical nature of the items, applying the popular latent variable modeling (LVM) software Mplus (Muthén & Muthén, 2017; see Appendix A for needed source code and notes to it). The pair of overall goodness of fit indices of the single-factor model fitted thereby are numerically close to each other, nonsignificant, and suggest that it is a tenable means of data description and explanation: Pearson χ² = 18.149, degrees of freedom (df) = 21, associated p = .640; and likelihood ratio χ² = 21.230, df = 21, p = .445. In addition, the individual item R² indices range between 11% and 20%, and thus none of them indicates potentially serious “local” violations of model fit. We can therefore conclude that this one-factor model is plausible, and hence that the unidimensionality hypothesis for the 5-item set under consideration is plausible as well. These findings suggest that the analyzed data set does not contain sufficient evidence that would warrant rejection of the item homogeneity assumption, which as stated earlier underlies the point and interval estimation procedure of this article.

In the next step, we fit the 2PL-model that is equivalent to the single-factor model (e.g., Takane & de Leeuw, 1987), and obtain thereby the point estimates of the individual GMA trait/ability levels or values ${\hat{θ}}_{i}$ (i = 1, . . ., n; see Appendix B for the needed Stata source code, notes to it, and Note 2).³ This activity is accomplished by the first two commands in the Stata source code in Appendix B (after its initial comment line). With these latent ability estimates, using Equation (4) we arrive at the point estimates of the 1,000 individual true scores on each of the five items in question. This is achieved with the next five commands in the Stata code in Appendix B. The resulting true score estimates for the first 10 subjects say are presented in Table 1.⁴

Table 1.

Individual True Score Estimates for Each Item (for First 10 Subjects, in Stata Format).

	id	item4	item5	Theta	ThetaSE	T1	T2	T3	T4	T5
1.	1	0	0	–1.89679	.8012803	.7697889	.4059575	.1914838	.4947615	.6915334
2.	2	0	0	–1.89679	.8012803	.7697889	.4059575	.1914838	.4947615	.6915334
3.	3	0	0	–1.89679	.8012803	.7697889	.4059575	.1914838	.4947615	.6915334
4.	4	0	1	–1.474806	.8022794	.8257122	.4810807	.2564461	.5669779	.747344
5.	5	0	1	–1.474806	.8022794	.8257122	.4810807	.2564461	.5669779	.747344
6.	6	0	1	–1.474806	.8022794	.8257122	.4810807	.2564461	.5669779	.747344
7.	7	0	1	–1.474806	.8022794	.8257122	.4810807	.2564461	.5669779	.747344
8.	8	0	1	–1.474806	.8022794	.8257122	.4810807	.2564461	.5669779	.747344
9.	9	0	1	–1.474806	.8022794	.8257122	.4810807	.2564461	.5669779	.747344
10.	10	1	0	–1.454535	.8024143	.8281078	.4847391	.2599042	.5704006	.74985

Open in a new tab

Note. Tj = individual true score on jth item (j = 1, . . ., 5); id = subject identifier.

With these individual true score estimates, using Equation (6) we obtain then the standard errors associated with each of the 1,000 individual true scores (estimates) on any of the five items. This is accomplished with the following five commands in the Stata code in Appendix B. These standard errors for the first 10 examinees are presented in Table 2 (see also Note 4).

Table 2.

Standard Errors for Individual True Score Estimates in Table 1 (for First 10 Subjects, in Stata Format).

	id	T1	T2	T3	T4	T5	T1_SE	T2_SE	T3_SE	T4_SE	T5_SE
1.	1	.7697889	.4059575	.1914838	.4947615	.6915334	.1172436	.1396598	.1104977	.1378818	.1122798
2.	2	.7697889	.4059575	.1914838	.4947615	.6915334	.1172436	.1396598	.1104977	.1378818	.1122798
3.	3	.7697889	.4059575	.1914838	.4947615	.6915334	.1172436	.1396598	.1104977	.1378818	.1122798
4.	4	.8257122	.4810807	.2564461	.5669779.	747344	.0953296	.1447546	.1362643	.1355914	.0995111
5.	5	.8257122	.4810807	.2564461	.5669779	.747344	.0953296	.1447546	.1362643	.1355914	.0995111
6.	6	.8257122	.4810807	.2564461	.5669779	.747344	.0953296	.1447546	.1362643	.1355914	.0995111
7.	7	.8257122	.4810807	.2564461	.5669779	.747344	.0953296	.1447546	.1362643	.1355914	.0995111
8.	8	.8257122	.4810807	.2564461	.5669779	.747344	.0953296	.1447546	.1362643	.1355914	.0995111
9.	9	.8257122	.4810807	.2564461	.5669779	.747344	.0953296	.1447546	.1362643	.1355914	.0995111
10.	10	.8281078	.4847391	.2599042	.5704006	.74985	.094308	.1448514	.1374826	.1353545	.0988711

Open in a new tab

Note. Tj = individual true score estimate on jth item, Tj_SE = standard error for individual true score estimate on jth item (j = 1, . . ., 5); id = subject identifier.

In the last, third step of the true score evaluation procedure discussed in this note, using the R-function “ci.true_score” in Appendix C with the true score estimates and their standard errors obtained as above, we finally furnish the 95% CIs for each of the 1000 individual true scores on any of the five items of concern (see also notes to Appendix C). These CIs for the first 10 persons and first item say are presented in Table 3 (see also Note 4).

Table 3.

Confidence Intervals at 95% Confidence Level for the Individual True Score Estimates (for First 10 Subjects and First Item, in Stata Format).

	id	T1	T1_SE	T1_low	T1_up
1.	1	.7697889	.1172436	.4776141	.9244108
2.	2	.7697889	.1172436	.4776141	.9244108
3.	3	.7697889	.1172436	.4776141	.9244108
4.	4	.8257122	.0953296	.4776141	.9244108
5.	5	.8257122	.0953296	.5639477	.9455188
6.	6	.8257122	.0953296	.5639477	.9455188
7.	7	.8257122	.0953296	.5639477	.9455188
8.	8	.8257122	.0953296	.5639477	.9455188
9.	9	.8257122	.0953296	.5639477	.9455188
10.	10	.8281078	.094308	.5680052	.9463857

Open in a new tab

Note. As indicated in the main text, confidence intervals for the individual true scores on the remaining four items are obtained by complete analogy to the developments in the current section. T1 = individual true scores on first item, T1_SE = associated standard errors, T1_low = lower endpoint of 95%-CI for the individual true score on the first item, T1_up = upper endpoint of 95% CI for the individual true score on the first item.

We mention in conclusion of this section that the individual true score CIs with regard to the remaining four items are obtained by complete analogy, using the R function in Appendix C on the pertinent, earlier obtained individual score estimates and standard errors.

Conclusion

This note was concerned with an IRT modeling–based procedure for point and interval estimation of individual true scores on each item in a unidimensional measuring instrument or item set under consideration. The aim was to highlight and illustrate important and useful links between IRT and CTT, capitalizing on the identity of the key concepts of item characteristic curve and true score for binary or binary scored items. While we were concerned with the latter type of items in case of no nesting or clustering effects, the method is applicable with relatively minor modifications to any unidimensional IRT model for which the expectation (true score) of an item of interest can be formally obtained, such as (a) the three- and four-parameter logistic models and (b) models for categorical items with more than two possible response options (e.g., Raykov & Marcoulides, 2017). In addition, the approach is applicable also in settings with nesting or clustering effects via use of an appropriate method for obtaining adjusted standard errors and overall fit statistics correcting for these effects (e.g., Muthén & Muthén, 2017; StataCorp, 2015).⁵

Several limitations of the discussed estimation procedure are worth noting here. One is the requirement of large samples with respect to both examinees and items, since the method is instrumentally based on maximum likelihood estimation that itself is grounded in an asymptotic theory (e.g., Raykov & Marcoulides, 2017, and references therein). While future research is encouraged in this area that will hopefully contribute to developing potential guidelines with regard to sample size from subject populations, it may be the case that unidimensional sets with about 20 items or more may provide practically satisfactory situations with respect to the requirement of a large number of items (e.g., Hambleton et al., 1991).⁶ Two, the method discussed in this note is based on the assumption of item homogeneity, and its application with an item set that is multidimensional is not recommendable (cf. Reckase, 2009). Three, we assumed throughout that the set of items under consideration is given, i.e., fixed, or pre-specified beforehand, rather than sampled from a pool, population, or universe of items. Last but not least, the interval estimation procedure of this article makes twice use of the delta method that is based ultimately on a linear approximation (first-order Taylor expansion; e.g., Apostol, 2006). Therefore, it is not known to what extent its approximate validity may be generalizable with regard to complex parametric functions representing expected observed scores (true scores) in more general models than the ones focused on in this article (see also Note 3), which functions are of special relevance for the discussed method and its applications.

In conclusion, the present note offers to educational and behavioral scientists a readily applicable procedure for estimation of individual true scores on the items in a homogeneous measuring instrument or item set under consideration, and contributes to the body of research on the relationships between IRT and CTT by highlighting and illustrating their links and connections allowing deeper understanding of both methodologies and their more informed use and applications in empirical research.

Acknowledgments

Thanks are due to M. Wilson for a valuable discussion on item response modeling. We are grateful to R. Raciborski and K. MacDonald for helpful comments on IRT software applications.

Appendix A

Mplus Source Code for Testing Unidimensionality for a Categorical Item Set

TITLE: TESTINIG UNIDIMENSIONALITY OF A SET OF CATEGORICAL ITEMS.

DATA: FILE = LSAT.DAT; ! LSAT DATA ON n = 1000 EXAMINEES AND k = 5 BINARY ITEMS.

VARIABLE: NAMES = ID ITEM1-ITEM5;USEVARIABLE = ITEM1-ITEM5;CATEGORICAL = ITEM1-ITEM5;

ANALYSIS: ESTIMATOR = ML;

MODEL: THETA BY ITEM1-ITEM5;

OUTPUT: STANDARDIZED;

Note. This command file is to be used as is for the case of k = 5 categorical (ordinal) items only, and is tailored to the LSAT data set in the illustration section. (An annotating comment is preceded by an exclamation mark.) With a different number of categorical items, modify appropriately the applicable commands (that utilize the information that k = 5). For details on the Mplus syntax and command language, see, for example, Raykov and Marcoulides (2006).

Appendix B

Stata Commands for Point Estimation of the Item-Specific Individual True Scores

* First read in the analyzed data set with Stata. (In the illustration section, it is the LSAT data set.)*

. irt 2pl item1-item5

. predict Theta, latent se(ThetaSE)

. gen T1=1/(1+exp(-_b[item1:Theta]*(Theta-(-_b[item1:_cons]/_b[item1:Theta]))))

. gen T2=1/(1+exp(-_b[item2:Theta]*(Theta-(-_b[item2:_cons]/_b[item2:Theta]))))

. gen T3=1/(1+exp(-_b[item3:Theta]*(Theta-(-_b[item3:_cons]/_b[item3:Theta]))))

. gen T4=1/(1+exp(-_b[item4:Theta]*(Theta-(-_b[item4:_cons]/_b[item4:Theta]))))

. gen T5=1/(1+exp(-_b[item5:Theta]*(Theta-(-_b[item5:_cons]/_b[item5:Theta]))))

. gen T1_SE = ThetaSE*T1*_b[item1:Theta]/(1+exp(_b[item1:Theta]*(Theta-(-_b[item1:_cons]/_b[item1:Theta]))))

. gen T2_SE = ThetaSE*T2*_b[item2:Theta]/(1+exp(_b[item2:Theta]*(Theta-(-_b[item2:_cons]/_b[item2:Theta]))))

. gen T3_SE = ThetaSE*T3*_b[item3:Theta]/(1+exp(_b[item3:Theta]*(Theta-(-_b[item3:_cons]/_b[item3:Theta]))))

. gen T4_SE = ThetaSE*T4*_b[item4:Theta]/(1+exp(_b[item4:Theta]*(Theta-(-_b[item4:_cons]/_b[item4:Theta]))))

. gen T5_SE = ThetaSE*T5*_b[item5:Theta]/(1+exp(_b[item5:Theta]*(Theta-(-_b[item5:_cons]/_b[item5:Theta]))))

Note 1. Consider the above as a “do-file” in Stata’s nomenclature, with a starting comment preceded and finalized by an asterisk (cf. StataCorp, 2015; the initial dot starting each line subsequently signifies Stata’s prompt in the so-called “Command Window,” once reading in the data set in question after commencing a Stata session). The references “_b[item#:Theta]” and “_b[item#:_cons]” are the “legends” or labels by which Stata internally refers to the item discrimination and difficulty-related parameters, respectively, in the earlier mentioned “intercept-and-slope” parameterization (while keeping highest possible precision of estimation). The command “irt 2pl item1-item5” fits the 2PL model to the five items under consideration, and “gen” is the compute command (short for “generate”; see also Raykov & Marcoulides, 2017, with regard to Stata’s IRT modeling capabilities).

Note 2. Apply this set of commands only to a data set where the unidimensionality hypothesis is plausible (e.g., as examined using the Mplus command file in Appendix A; see also illustration section). With a different number of binary scored items, modify/extend appropriately the applicable commands (that use the information that k = 5).

Note 3. Tj = true score on jth item; Tj_SE = standard error associated with true score on jth item (j = 1, . . ., 5); Theta = estimate of the individual latent trait/ability levels or values ( ${\hat{θ}}_{i}$ ; i = 1, . . ., n); ThetaSE = standard error associated with this estimate (i.e., the standard error of ${\hat{θ}}_{i}$ ; i = 1, . . ., n).

Appendix C

R-Function for Interval Estimation of Item-Specific Individual True Scores

ci.true_score = function(T_ij, se){ # R-function for true score CI.

l = log(T_ij/(1-T_ij))# Initial logit transformation of T_ij.

sel = se/(T_ij*(1-T_ij))

ci_l_lo = l-1.96*sel

ci_l_up = l+1.96*sel

ci_lo = 1/(1+exp(-ci_l_lo))

ci_up = 1/(1+exp(-ci_l_up))

ci = c(ci_lo, ci_up)

ci

} # (ci_lo, ci_up) is the 95%-CI for the individual true score T_ij.

Note 1. This R-function is identical to the R-function “ci.rel” in Raykov and Marcoulides (2011, Chap. 7), and is presented here merely for the sake of completeness of the present article (after a corresponding change in its name/call name). (This R-function needs to be appropriately modified if a confidence interval at another level is to be obtained; for instance, for a 90% confidence interval, use 1.645 in lieu of 1.96 in the two middle commands of the function.)

Note 2. To use this R-function, at the R prompt type/submit “ci.true_score (T_ij, se),” where for “T_ij” the estimate of the individual true score T_ij on the jth item is entered, which is obtained with the Stata commands in Appendix B, and for “se” its associated standard error (see Equation 6 and pertinent section of Stata output resulting from using the corresponding commands in Appendix B; see also illustration section).

^1.

In case k = 2, identified is only the 1PL model (Rasch model) as a special case of the 2PL model underlying this article, rather than the 2PL model itself; the 1PL model results from the latter when all item discrimination parameters are restricted to be equal among themselves (e.g., von Davier, 2016).

^2.

None of the discussion in this section is meant to imply that the five items used indeed measure general mental ability, and the latter reference is only used for the sake of illustration and naming an assumed construct that presumably underlies subjects’ performance on the items. The actual nature of this construct is irrelevant for the developments, discussion, and estimation procedure illustration that is the only aim of the present section.

^3.

In an empirical setting, it may well be recommendable to find the most restricted tenable model for an analyzed item set (e.g., the 1PL model that is statistically and in terms of data fit equivalent to the Rasch model), before proceeding with the next step of the point and interval estimation procedure outlined in the article. We use the 2Pl model in the rest of this section in order to demonstrate an application of the more general procedure rather than a special case of it (which would be the case if using the 1PL model that is coincidentally also plausible for the particular data set analyzed; e.g., Rizopoulos, 2007).

^4.

The identity of the true score estimates for some of the successively listed persons results from the identity of the set of their observed scores on the 5 items under consideration in this section, as can also be seen for instance from the second through sixth column in the interior of Table 1; the raw data set in addition comes as ranked in ascending order in terms of number correct scores (Rizopoulos, 2007).

^5.

In contrast with the nearly a century-old true score estimation procedure by Kelley (1927), the one in this article (a) is based on a latent variable modeling approach rather than developed within an exclusively observed variable framework and (b) offers also an interval estimation method for the item-specific, individual true scores.

^6.

The current sentence is not to be interpreted as containing or implying a rule of thumb regarding the number of items, with regard to when one could use in a completely trustworthy way the discussed estimation procedure in empirical research. Rather, it is merely intended as a very rough guide for when this may be expected to be approximately the case in some empirical research settings.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Apostol T. (2006). Calculus. New York, NY: Wiley. [Google Scholar]
Bechger T. M., Maris G., Verstralen H. F. M., Béguin A. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27, 319-334. [Google Scholar]
Cai L., Thissen D. (2015). Modern approaches to parameter estimation in item response theory. In Reise S. P., Revicki D. A. (Eds.), Handbook of item response theory modeling (pp. 41-59). New York, NY: Taylor & Francis. [Google Scholar]
Dimitrov D. M. (2003). Marginal true-score measures and reliability for binary items as a function of their IRT parameters. Applied Psychological Measurement, 27, 440-458. [PubMed] [Google Scholar]
Hambleton R. K., Swaminathan H., Rogers H. J. (1991). Fundamentals of item response theory. Thousand Oaks, CA: Sage. [Google Scholar]
Kamata A., Bauer D. J. (2008). A note on the relationship between factor analytic and item response theory models. Structural Equation Modeling, 15, 136-153. [Google Scholar]
Kelley T. L. (1927). Interpretation of educational measurements. New York, NY: Macmillan. [Google Scholar]
Kohli N., Koran J., Henn L. (2015). Relationships among classical test theory and item response theory frameworks via factor analytic models. Educational and Psychological Measurement, 75, 389-405. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kolen M. J., Hanson B. A., Brennan R. L. (1992). Conditional standard error of measurement for scale scores. Journal of Educational Measurement, 29, 285-307. [Google Scholar]
Lord F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. [Google Scholar]
McDonald R. P. (1999). Test theory. A unified treatment. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
Muthén L. K., Muthén B. O. (2017). Mplus user’s guide. Los Angeles, CA: Muthén & Muthén. [Google Scholar]
Raykov T., Marcoulides G. A. (2004). Using the delta method for approximate interval estimation of parametric functions in covariance structure models. Structural Equation Modeling, 11, 659-675. [Google Scholar]
Raykov T., Marcoulides G. A. (2006). A first course in structural equation modeling. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]
Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. New York, NY: Taylor & Francis. [Google Scholar]
Raykov T., Marcoulides G. A. (2016). On the relationship between classical test theory and item response theory: From one to the other and back. Educational and Psychological Measurement, 76, 325-338. [DOI] [PMC free article] [PubMed] [Google Scholar]
Raykov T., Marcoulides G. A. (2017). A course in item response theory and modeling with Stata. College Station, TX: Stata Press. [Google Scholar]
Raykov T., Marcoulides G. A., Chang C. (2016). Studying population heterogeneity in finite mixture settings using latent variable modeling. Structural Equation Modeling, 23, 726-730. [Google Scholar]
Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]
Rizopoulos D. (2007). ltm: An R package for latent variable modeling and item response modeling. Journal of Statistical Software, 17. Retrieved from https://www.jstatsoft.org/issue/view/v017
StataCorp. (2015). Stata item response theory manual. Release 14. College Station, TX: Stata Press. [Google Scholar]
Takane Y., de Leeuw J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. [Google Scholar]
van der Linden W. J. (2016). Unidimensional logistic response models. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 2, pp. 13-30). Boca Raton, FL: CRC Press. [Google Scholar]
von Davier M. (2016). Rasch model. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 2, pp. 31-50). Boca Raton, FL: CRC Press. [Google Scholar]
Zimmerman D. W. (1975). Probability spaces, Hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395-412. [Google Scholar]

[bibr1-0013164417741711] Apostol T. (2006). Calculus. New York, NY: Wiley. [Google Scholar]

[bibr2-0013164417741711] Bechger T. M., Maris G., Verstralen H. F. M., Béguin A. A. (2003). Using classical test theory in combination with item response theory. Applied Psychological Measurement, 27, 319-334. [Google Scholar]

[bibr3-0013164417741711] Cai L., Thissen D. (2015). Modern approaches to parameter estimation in item response theory. In Reise S. P., Revicki D. A. (Eds.), Handbook of item response theory modeling (pp. 41-59). New York, NY: Taylor & Francis. [Google Scholar]

[bibr4-0013164417741711] Dimitrov D. M. (2003). Marginal true-score measures and reliability for binary items as a function of their IRT parameters. Applied Psychological Measurement, 27, 440-458. [PubMed] [Google Scholar]

[bibr5-0013164417741711] Hambleton R. K., Swaminathan H., Rogers H. J. (1991). Fundamentals of item response theory. Thousand Oaks, CA: Sage. [Google Scholar]

[bibr6-0013164417741711] Kamata A., Bauer D. J. (2008). A note on the relationship between factor analytic and item response theory models. Structural Equation Modeling, 15, 136-153. [Google Scholar]

[bibr7-0013164417741711] Kelley T. L. (1927). Interpretation of educational measurements. New York, NY: Macmillan. [Google Scholar]

[bibr8-0013164417741711] Kohli N., Koran J., Henn L. (2015). Relationships among classical test theory and item response theory frameworks via factor analytic models. Educational and Psychological Measurement, 75, 389-405. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr9-0013164417741711] Kolen M. J., Hanson B. A., Brennan R. L. (1992). Conditional standard error of measurement for scale scores. Journal of Educational Measurement, 29, 285-307. [Google Scholar]

[bibr10-0013164417741711] Lord F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum. [Google Scholar]

[bibr11-0013164417741711] McDonald R. P. (1999). Test theory. A unified treatment. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]

[bibr12-0013164417741711] Muthén L. K., Muthén B. O. (2017). Mplus user’s guide. Los Angeles, CA: Muthén & Muthén. [Google Scholar]

[bibr13-0013164417741711] Raykov T., Marcoulides G. A. (2004). Using the delta method for approximate interval estimation of parametric functions in covariance structure models. Structural Equation Modeling, 11, 659-675. [Google Scholar]

[bibr14-0013164417741711] Raykov T., Marcoulides G. A. (2006). A first course in structural equation modeling. Mahwah, NJ: Lawrence Erlbaum. [Google Scholar]

[bibr15-0013164417741711] Raykov T., Marcoulides G. A. (2011). Introduction to psychometric theory. New York, NY: Taylor & Francis. [Google Scholar]

[bibr16-0013164417741711] Raykov T., Marcoulides G. A. (2016). On the relationship between classical test theory and item response theory: From one to the other and back. Educational and Psychological Measurement, 76, 325-338. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr17-0013164417741711] Raykov T., Marcoulides G. A. (2017). A course in item response theory and modeling with Stata. College Station, TX: Stata Press. [Google Scholar]

[bibr18-0013164417741711] Raykov T., Marcoulides G. A., Chang C. (2016). Studying population heterogeneity in finite mixture settings using latent variable modeling. Structural Equation Modeling, 23, 726-730. [Google Scholar]

[bibr19-0013164417741711] Reckase M. D. (2009). Multidimensional item response theory. New York, NY: Springer. [Google Scholar]

[bibr20-0013164417741711] Rizopoulos D. (2007). ltm: An R package for latent variable modeling and item response modeling. Journal of Statistical Software, 17. Retrieved from https://www.jstatsoft.org/issue/view/v017

[bibr21-0013164417741711] StataCorp. (2015). Stata item response theory manual. Release 14. College Station, TX: Stata Press. [Google Scholar]

[bibr22-0013164417741711] Takane Y., de Leeuw J. (1987). On the relationship between item response theory and factor analysis of discretized variables. Psychometrika, 52, 393-408. [Google Scholar]

[bibr23-0013164417741711] van der Linden W. J. (2016). Unidimensional logistic response models. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 2, pp. 13-30). Boca Raton, FL: CRC Press. [Google Scholar]

[bibr24-0013164417741711] von Davier M. (2016). Rasch model. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 2, pp. 31-50). Boca Raton, FL: CRC Press. [Google Scholar]

[bibr25-0013164417741711] Zimmerman D. W. (1975). Probability spaces, Hilbert spaces, and the axioms of test theory. Psychometrika, 40, 395-412. [Google Scholar]

PERMALINK

On True Score Evaluation Using Item Response Theory Modeling

Tenko Raykov

Dimiter M Dimitrov

George A Marcoulides

Michael Harrison

Abstract

Background, Notation, and Assumptions

Point and Interval Estimation of the Individual True Scores on the Items in a Multi-item Measuring Instrument

Illustration on Data

Table 1.

Table 2.

Table 3.

Conclusion

Acknowledgments

Appendix A

Mplus Source Code for Testing Unidimensionality for a Categorical Item Set

Appendix B

Stata Commands for Point Estimation of the Item-Specific Individual True Scores

Appendix C

R-Function for Interval Estimation of Item-Specific Individual True Scores

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

On True Score Evaluation Using Item Response Theory Modeling

Tenko Raykov

Dimiter M Dimitrov

George A Marcoulides

Michael Harrison

Abstract

Background, Notation, and Assumptions

Point and Interval Estimation of the Individual True Scores on the Items in a Multi-item Measuring Instrument

Illustration on Data

Table 1.

Table 2.

Table 3.

Conclusion

Acknowledgments

Appendix A

Mplus Source Code for Testing Unidimensionality for a Categorical Item Set

Appendix B

Stata Commands for Point Estimation of the Item-Specific Individual True Scores

Appendix C

R-Function for Interval Estimation of Item-Specific Individual True Scores

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases