Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2024 Jun 24:00131644241259026. Online ahead of print. doi: 10.1177/00131644241259026

A Note on Evaluation of Polytomous Item Locations With the Rating Scale Model and Testing Its Fit

Tenko Raykov 1,, Martin Pusic 2
PMCID: PMC11572229  PMID: 39563843

Abstract

A procedure is outlined for point and interval estimation of location parameters associated with polytomous items, or raters assessing studied subjects or cases, which follow the rating scale model. The method is developed within the framework of latent variable modeling, and is readily applied in empirical research using popular software. The approach permits testing the goodness of fit of this widely used model, which represents a rather parsimonious item response theory model as a means of description and explanation of an analyzed data set. The procedure allows examination of important aspects of the functioning of measuring instruments with polytomous ordinal items, which may also constitute person assessments furnished by teachers, counselors, judges, raters, or clinicians. The described method is illustrated using an empirical example.

Keywords: discrete ordinal item, interval estimation, location parameter, model fit evaluation, rater, polytomous item response model, rating scale model


Polytomous items have been increasingly used over the past several decades in applications of item response theory and modeling in the educational, behavioral, social, marketing, and biomedical sciences (e.g., Nering & Ostini, 2010). A key reason for this interest by methodologists as well as substantive researchers is the widely appreciated fact that psychometric scales, tests, surveys, questionnaires, inventories, or self-reports are oftentimes employed to evaluate typical rather than maximal performance (e.g., Raykov & Marcoulides, 2018). A main item response model applicable when examining responses obtained with such items is the rating scale model (RSM; e.g., Andrich, 2016). When plausible for an analyzed data set, the model represents an important polytomous item generalization of the popular Rasch model (e.g., von Davier, 2016). In particular, the RSM is a very parsimonious means of data description and explanation, which also allows insightful interpretations as well as enhanced statistical estimation precision (e.g., de Ayala, 2022). The RSM in addition offers the opportunity to define substantively informative location parameters for discrete ordinal items with three or more answer options (Zhang & Petersen, 2018). These parameters are of special empirical relevance due to the fact that when used with teachers, counselors, clinicians, raters, radiologists, assessors, or judges, they reflect the raters’ individual tendencies to be more/less lenient (or stringent) in their evaluation of a given group of studied units of analysis, such as respondents, examinees, students, patients, cases, customers, employees, or clients (cf. Raykov & Pusic, 2023).

These benefits of the RSM become available only when the model is tenable for a studied data set. For this reason, one of the aims of this note is to outline a widely applicable approach for testing the overall goodness of fit of the RSM. When the model is found plausible with it, the procedure discussed subsequently permits one to readily evaluate the item location parameters as well as various functions of them. Specifically, we describe below a point and interval estimation method for a location parameter for each polytomous ordinal item (rater) in a multicomponent measuring instrument or item set that follows the RSM. The procedure allows one additionally to estimate linear or nonlinear, substantively relevant functions of one or more item location parameters, such as for instance their differences. The outlined method permits also evaluation of the degree of uncertainty associated with these estimates by furnishing pertinent parameter and function confidence intervals (CIs).

Background, Notation, and Assumptions

The following discussion assumes that a set of p polytomous ordinal (categorical) items are given, which will be collectively denoted by y = (y1, y2, …, y p )′ (e.g., de Ayala, 2022; underlining is used to denote vector and priming transposition in this article). As an example, the items could represent the discrete ratings of a studied group of examinees, patients, clients, students, employees, cases, or respondents, which are provided by p teachers, counselors, judges, assessors, raters, clinicians, evaluators, or radiologists (p > 1). For simplicity, we will refer generically to teachers, evaluators, judges, or assessors as “raters,” and will treat them and their alternative reference “items” synonymously throughout the remainder. The items are assumed to possess the following features: (a) they are associated with r possible ordered response options each, which are symbolized or scored using the numbers 0, 1, 2, …, r− 1 (r > 2), and (b) they are complying with the RSM (see below). These p observed measures are stipulated as fixed, that is, not sampled from a larger pool or universe of items to which one wishes to draw inferences. In addition, it is posited that the items have been administered to a sample of subjects (students, respondents, examinees, patients, employees, cases, clients, or customers) from a population under investigation that is not characterized by clustering effects or substantial unobserved heterogeneity. Last but not the least, all items are presumed to have the same r answer alternatives that are functionally equivalent, that is, possess the same meaning across the items (Andrich, 1978).

Assuming that the lowest category on a given item is chosen as a reference category, we denote by P k (θ) the probability of responding in its kth category as a function of the underlying unobserved dimension evaluated by the items, which is commonly symbolized by θ (e.g., Reckase, 2009; k = 0, …, r− 1). This probability is referred to as the kth category characteristic curve (CCC) for the item. Within the RSM, the probability P k (θ) is given as (cf. Andrich, 2016):

Pk(θ)=exp(kaθ+kβ+δk)/[1+exp(aθ+β+δ1)+exp(2aθ+2β+δ2)++exp(Kaθ+Kβ+δK)]. (1)

In Equation 1, K = r− 1 is set and a is the item discrimination parameter that is assumed constant over items, a key feature of the popular Rasch model whose polytomous item generalization the RSM is (cf. Raykov & Marcoulides, 2018). In addition, β and δ k are correspondingly difficulty and related threshold parameters associated with the item and its kth answer option, and exp(.) is the exponential function, that is, e(.); k = 0, …, K, with δ0 = 0 (see, e.g., Stata Item Response Theory Manual, 2023). Furthermore, as usual the population mean and variance of the studied latent ability or trait are presumed fixed at 0 and 1, respectively. (For further details on the RSM, see, for example, Andrich, 2016, as well as Stata Item Response Theory Manual, 2023, and references therein).

A Point and Interval Estimation Procedure of Categorical Item Locations and Functions of Them

When considering ordinal polytomous items adhering to the RSM, their individual location parameter has been defined as the projection on the studied latent continuum, θ, of the intersection point of (a) the CCC for an item’s lowest response category, that is, that denoted by 0, with (b) the CCC for its highest category, that is, that designated by K (Zhang & Petersen, 2018). Hence, the item location parameter is that point θ l on the underlying dimension, which has the property,

P0(θl)=PK(θl). (2)

From Equations 1 and 2, one obtains through substitution the following:

exp(δ0)/[1+exp(aθl+β+δ1)+exp(2aθl+2β+δ2)++exp(Kaθl+Kβ+δK)]=exp(Kaθl+Kβ+δK)/[1+exp(aθl+β+δ1)+exp(2aθl+2β+δ2)++exp(Kaθl+Kβ+δK)]. (3)

The solution of Equation 3 with respect to the unknown value θ l results then with some algebra as follows:

θl=Kβ+δKKa. (4)

From Equation 4 and the invariance property of maximum likelihood (ML; e.g., Casella & Berger, 2002), it is implied that when using ML for fitting the RSM, the ML estimator of this location parameter is obtained by substitution of the ML estimators of a, β, and δ K into the right-hand side of Equation 4. In an empirical study, this can be readily carried out with the widely available software Stata (Stata Base Reference Manual, 2023; the needed source code is provided in Appendix B). Thereby, the overall goodness of fit of the RSM can be examined using the popular latent variable modeling software Mplus (Muthen & Muthen, 2024; the required command file is supplied in Appendix A). Furthermore, it becomes possible to obtain CIs then of (a) the location parameter for each item (rater) and (b) a linear or nonlinear function of these parameters for one or more items. Such functions, for example, the differences between rater (item) locations, may be of particular substantive interest in educational, clinical, biomedical, imaging, marketing, or social research (e.g., Zhang & Petersen, 2018; see also the introduction as well as the discussion and conclusion sections).

We demonstrate next the discussed RSM fit evaluation and location parameter evaluation procedure using empirical data.

Application on Data

For the illustration purposes of this section, we employ a data set from an anxiety study (available from https://vpgcentral.com/software/scientific-software-international/, or on request from the authors; e.g., Cai et al., 2017). In that study, p = 5 Likert-type questions with r = 5 response options each were administered to n = 517 patients. The questions asked about their feelings of being calm, at ease, tense, regretful, or nervous, and inquired if they experienced them never, rarely, occasionally, frequently, or always (with their responses scored increasingly in this order, and item recoding carried out where needed by the original authors; cf. duToit, 2003; Raykov & Pusic, 2023). Owing to these study features, the used data set fulfills the abovementioned RSM requirement for the used items to possess the same number of response options with the same meaning (cf. Andrich, 1978).

Following the earlier described procedure, we commence by fitting the RSM to the analyzed data and testing (evaluating the goodness of) its fit. To this end, in this first step, the RSM is employed as a partial credit model (PCM) with equal across item pairs successive differences in category difficulty parameters (cf. Andrich, 2016; Raykov & Marcoulides, 2018). To evaluate the degree of overall goodness of its fit to the above data set, the RSM is fitted using the latent variable modeling program Mplus (see needed command file in Appendix A). This leads to the following plausible model fit indices: Pearson χ2 = 2,173.927 for degrees of freedom (df) = 3,107 and associated p value (p) = 1.0, as well as likelihood ratio χ2 = 1,001.112, df = 3,107, p = 1.0. These results suggest that the RSM is a tenable model for the analyzed data. 1 Based on this finding, we proceed in the next step to point and interval estimation of the five item location parameters (see Equation 4). This is readily accomplished with the Stata source code in Appendix B. Thereby, as indicated earlier, point and interval estimation of the difference of these parameters for say a pair of items becomes also possible. This estimation activity permits one to compare the items (raters) in terms of their location, for example, of two particular teachers, raters, assessors, evaluators, or items, and is accomplished by a single command (see end command in Appendix B, and below). The resulting estimates of the location parameters of the five anxiety items, as well as of the difference between two of them, are presented in Table 1. Moreover, the CCCs for the Calm item for instance are displayed in Figure 1 (see also its caption regarding these curves for the other items).

Table 1.

Item Discrimination and Location Parameter Estimates, Standard Errors, and Confidence Intervals

Item/parameter Estimate SE 95% CI
Discrimination 1.248 .066 [1.118, 1.378]
Calm .923 .075 [.775, 1.070]
Tense .712 .071 [.572, .851]
Regretful 1.031 .078 [.883, 1.189]
At Ease .696 .071 [.557, .835]
Nervous .828 .073 [.684, .972]
θ l (Calm) - θ l (At Ease) .227 .066 [.097, .356]

Note. Rounding off to 3 decimal figures used. SE = standard error; 95% CI = 95% confidence interval; θ l (Calm) = location parameter for the Calm item; θ l (At Ease) = location parameter for the At Ease item.

Figure 1.

Figure 1

Category Characteristic Curves for the Calm Item (cf. Second Row of Table 1 for the Item Location Estimate Indicated on the Figure; the CCCs of Each Remaining Item Is Furnished by Analogy, Providing Instead Its Name in the Used Command; “Pr” Denotes Probability and “Theta” the Underlying Latent Dimension)

As seen from the second through sixth rows of Table 1 (see its Estimate column, and below in this section), in the analyzed data set the item At Ease is located to the left of all other items. This item is followed by the item Tense, then Nervous, and Calm. Furthermore, the item Regretful is located to the right of all items. Comparing next the item location parameter CIs, and noting that those of the Tense and Regretful items do not overlap—with that of the latter being to the right of the CI of the former—it is suggested that in the studied patient population, the Regretful item is positioned to the right of the Tense item. (The latter comparison is not meant to be a statistical test of this particular parameter relationship; see below.) That is, patients with a given level of underlying anxiety tend to be more likely to respond “never” on the Regretful item than to respond so on the Tense item. In a similar fashion, it is seen that patients tend to be more likely to respond “always” on the Tense item than on the Regretful item. Furthermore, if a researcher was in the first place interested (i.e., before looking at the data) in whether the Calm item say was located in the population to the right of the At Ease item, he or she could interval estimate the difference in their location parameters using the earlier discussed procedure (see last command of code in Appendix B). As a result, the final row of Table 1 shows these items’ location parameter difference as estimated at .227, with a standard error .066 and a 95% CI = [.097, .356]. These findings suggest that in the studied clinical population also the Calm item is positioned to the right of the At Ease item.

Although this example was developed within the context of examining the latent trait of anxiety, as pointed out earlier, the same RSM application can be used in many other empirical settings. In particular, the outlined estimation procedure can be employed when one is interested in evaluating the tendencies of individual teachers, counselors, clinicians, radiologists, judges, assessors, or raters to be more/less lenient (or stringent) in their judgment of each member in a given group of persons or cases examined by them, such as students, examinees, clients, employees, respondents, or patients. The application of the outlined model testing and parameter estimation method is then directly conducted by treating the items as raters. With this in mind, in the example considered above, one could treat each of the five anxiety items as a corresponding rater (teacher, counselor, clinician, or judge say) that is assessing each of the n = 517 subjects who are viewed then as cases (observations). In more general terms, the particular item interpretation in the discussed generic person assessment setting will depend on the subject matter domain of application, and is best deferred to substantive experts using the outlined procedure.

Conclusion

This note provides a readily applicable approach for (a) interval estimation of location parameters for discrete ordinal items or teachers, counselors, raters, judges, assessors, or clinicians, following the widely used RSM, as well as (b) testing (evaluating the goodness of) the overall fit of this model to analyzed data sets. The RSM is of particular relevance in the educational, behavioral, and social sciences as a polytomous extension of the popular Rasch model (e.g., Andrich, 2016). When found plausible for a given data set, therefore, the RSM leads as a Rasch model generalization to insightful substantive interpretations (e.g., von Davier, 2016). They result in part from the fact that the RSM represents a very parsimonious means of data description and explanation (cf. de Ayala, 2022; see also below). The outlined model testing and parameter evaluation procedure in addition permits educational, behavioral, social, marketing, clinical, epidemiological, and radiology researchers to routinely point and interval estimate item (rater) locations, as well as their differences when interested in comparing two or more raters, or more generally categorical items with respect to their location. The method is also readily applicable when one is concerned with studying individual differences in the judgment of raters who are rating a given set of cases or patients (e.g., imaging cases) using discrete ordinal evaluation with three or more options (cf. Zhang & Petersen, 2018). This will be of particular relevance when a researcher is interested in finding out rater differences, or linear/nonlinear functions of them, in their degree of stringency (or leniency) while evaluating a set of subjects, patients, or cases under consideration.

With these important features, the outlined approach contributes significantly to the extant research on polytomous item location evaluation. In particular, the described model testing and interval estimation procedure complement the work by Zhang and Petersen (2018) in two important directions. Along the first, we offer a method for evaluation and testing the overall fit of the RSM (see also below). We stress that plausible fit of the RSM for an analyzed data set is a prerequisite for using this rather parsimonious model in applications of item response theory. The relevance of the availability now of such a test of the data fit of the RSM is realized by observing that while Zhang and Petersen (2018) instrumentally used the RMS, also as a crucial basis of their article, they did not provide evidence of its plausibility for their analyzed data sets. Another limitation of their research is that they did not provide evidence for the unidimensionality of the used items thereby, whereas this item homogeneity presumption was in fact made throughout their work. In addition to (a) having offered in this note such a means for testing the overall fit of the RSM that incorporates the unidimensionality hypothesis, we (b) have also complemented their research by providing a readily employed procedure for interval estimation of the discussed item location parameter as well as linear and nonlinear functions of two or more such parameters. The latter extension is also a significant contribution because such interval estimation was not of concern in Zhang and Petersen (2018). Moreover, the present note complements also the recent discussion in Raykov and Pusic (2023), by extending their developments and estimation approach to the considerably more parsimonious RSM, while in addition offering a means of testing its data fit.

In this context, it is worthwhile emphasizing the following modeling related point. Specifically, as seen from the introductory section, the RSM has a rather limited number of parameters. Hence, when plausible for an analyzed data set, as a Rasch model generalization, it implies substantively more insightful interpretations (e.g., Andrich, 2016; de Ayala, 2022; von Davier, 2016), as well as smaller standard errors and shorter CIs, than alternative polytomous item response models. This is an especially beneficial feature of the RSM, in which it outperforms other, often used models such as the PCM and generalized partial credit model (GPCM; e.g., Raykov & Marcoulides, 2018). While the latter two models are more flexible, that is due to the fact that they possess markedly more parameters. Indeed, while the RSM used in the illustration section has merely nine parameters, these of the PCM are 21, and those of the GPCM are 25. As a consequence, both the PCM and GPCM have more than twice as many parameters than the RSM, which is a marked downside from a precision of estimation viewpoint. For this reason, the PCM and GPCM share an important limitation relative to the RSM, which consists in the fact that due to their lack of parsimony, they would tend to be associated with lower parameter estimation precision. In this connection, we would like to stress that a significant point of the present note is the provision of a readily applicable means of testing the fit of the RSM. With its overall goodness of fit test offered here, one can evaluate whether this markedly parsimonious model may be considered a plausible means of data description and explanation. In the affirmative case, the RSM represents a tenable and parameter-economical means of description and explanation of an analyzed data set, and the resulting location parameter estimates can be used in subsequent research. 2

The described estimation and testing procedure in this note is best used with large samples of studied observations (persons, students, patients, respondents, or cases). This recommendation is a consequence of the fact that large samples enhance substantially the likelihood of the asymptotic theory underlying this procedure and ML method, to obtain practical applicability (e.g., Casella & Berger, 2002). We therefore encourage future research into possible guidelines allowing one to determine if certain empirical study sizes are associated with practical relevance of this theory, which is similarly of importance for the described RSM model fit evaluation and parameter interval estimation procedure (e.g., Raykov & Marcoulides, 2018).

In conclusion, this note outlines a readily used method for testing the fit of the widely applicable and parsimonious RSM, as well as point and interval estimation of location parameters and substantively informative functions of them, for polytomous ordinal items adhering to it.

Acknowledgments

The authors are indebted to G. A. Marcoulides, J. Pitblado, C. Huber, and J. Algina for valuable discussions on item response theory and its applications. They are grateful to C. Huber for helpful graphing, data extraction, and programming assistance. Thanks are also due to the Editor and two anonymous Referees for critical comments on an earlier version of the article, which have contributed considerably to its improvement.

Appendix A

Mplus Source Code for Testing the Rating Scale Model

TITLE USING MPLUS FOR EVALUATING THE FITOF THE RATING SCALE MODEL
DATA: FILE = <name of raw data set>;
VARIABLE: NAMES = Y1-Y5;
CATEGORICAL = Y1-Y5 (GPCM);
ANALYSIS: ESTIMATOR = ML;
MODEL: F BY Y1-Y5@1;
F*;
[Y1$1-Y1$4](t11-t14);
[Y2$1-Y2$4](t21-t24);
[Y3$1-Y3$4](t31-t34);
[Y4$1-Y4$4](t41-t44);
[Y5$1-Y5$4](t51-t54);
MODEL CONSTRAINT:
0 = t11-t12-t21+t22;
0 = t12-t13-t22+t23;
0 = t13-t14-t23+t24;
0 = t21-t22-t31+t32;
0 = t22-t23-t32+t33;
0 = t23-t24-t33+t34;
0 = t31-t32-t41+t42;
0 = t32-t33-t42+t43;
0 = t33-t34-t43+t44;
0 = t41-t42-t51+t52;
0 = t42-t43-t52+t53;
0 = t43-t44-t53+t54;
OUTPUT: TECH1 TECH8;
PLOT: TYPE = PLOT3;

Note. This source code requests first fitting the RSM (e.g., Raykov & Marcoulides, 2018, p. 202). To this end, initially the model parameters (symbolized by the $ sign, 4 per items) are assigned labels (denoted “t##”). Then, in the MODEL CONSTRAINT section the relevant successive parameter difference equalities across consecutive pairs of items are introduced, which imply the same type of equality constraints across all pairs of items (e.g., Andrich, 1978). Further specific explanation of the used commands and subcommands is found in Muthén and Muthén (2024, ch. 5).

Appendix B

Stata Code for Point and Interval Estimation of Polytomous Item (Rater) Locations and Functions of Them

. irt rsm calm tense regretful atease nervous * This command fits the RSM (see also Note below).

  * Point and interval estimate next the location parameters of the five items

  * using the following 5 respective commands:

. nlcom (-_b[5.calm:_cons] - _b[5.calm:_one])/_b[5.calm:Theta]

. nlcom (-_b[5.tense:_cons] - _b[5.tense:_one])/_b[5.tense:Theta]

. nlcom (-_b[5.regretful:_cons] - _b[5.regretful:_one])/ _b[5.regretful:Theta]

. nlcom (-_b[5.atease:_cons]-_b[5.atease:_one])/ _b[5.atease: Theta]

. nlcom (-_b[5.nervous:_cons] - _b[5.nervous:_one])/_b[5. nervous:Theta]

  * Point and interval estimate next the difference in the location parameters of the

  * Calm and At Ease items, say:

. nlcom (-_b[5.calm:_cons] - _b[5.calm:_one])/_b [5.calm:Theta] - (-_b[5.atease:_cons] - _b[5.atease:_one])/ _b[5.atease:Theta]

. copy https://vpgcentral.com/wp-content/uploads/2021/02/all-examples.zip, replace

. unzipfile allexamples.zip, replace

. import spss using "./Traditional/Anxiety14.sav", clear.

Note. The command nlcom requests estimation of the following parametric function (within the respective row), as stated in the internal code utilized when fitting the RSM (an intercept-and-slope reparameterization of the conventional model representation; for example, Andrich, 2016; Stata Item Response Theory Manual, 2023; the asterisk signals start of an annotating comment line). The analyzed data set can be obtained from authors upon request, or from the cited weblink in the illustration section by using these data extraction commands (in a Stata session).

1.

The reported model fit findings and their interpretation are not to be treated as implying that the rating scale model (RSM) need be the only model fitting plausibly the data set, which is used for method illustration purposes in this section.

2.

This discussion is not meant to imply down valuing the importance of model flexibility. The essence of the point raised here is the statistically logical fact that when a parsimonious model (like the RSM) is tenable for a given data set, there is no obvious need to use as a means of data description and explanation another model (like the partial credit model [PCM] or generalized partial credit model [GPCM]) that contains further parameters. The latter type of more flexible models are typically only then of empirically justifiable benefit, when a more parsimonious alternative model is not plausible. For this reason, the overall fit evaluation method for the RSM provided in this note offers a useful means of examining whether a highly parsimonious model like the RSM may be employed as a modeling means of an analyzed data set; in the alternative case, the PCM and GPCM may well be used instead for this purpose.

Footnotes

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

References

  1. Andrich D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–574. [Google Scholar]
  2. Andrich D. (2016). Rasch rating scale model. In van der Linden W. J. (Ed.), Handbook of item response modeling (Vol. 1, pp. 75–94). CRC Press. [Google Scholar]
  3. Cai L., Thissen D., du Toit S. H. C. (2017). IRTPRO 4.1 for Windows [Computer software]. Scientific Software International. [Google Scholar]
  4. Casella G., Berger J. O. (2002). Statistical inference. Wadsworth. [Google Scholar]
  5. de Ayala R. J. (2022). The theory and practice of item response theory. The Guilford Press. [Google Scholar]
  6. duToit M. (Ed.). (2003). IRT from SSI. Scientific Software International. [Google Scholar]
  7. Muthen L. K., Muthen B. O. (2024). Mplus user’s guide.
  8. Nering M. L., Ostini R. (2010). Handbook of polytomous item response theory models. Taylor & Francis. [Google Scholar]
  9. Raykov T., Marcoulides G. A. (2018). A course in item response theory and modeling with Stata. Stata Press. [Google Scholar]
  10. Raykov T., Pusic M. (2023). Evaluation of polytomous item locations in multi-component measuring instruments: A note on a latent variable modeling procedure. Educational and Psychological Measurement, 83, 630–641. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Reckase M. D. (2009). Multidimensional item response theory. Springer. [Google Scholar]
  12. von Davier M. (2016). Rasch scale model. In van der Linden W. J. (Ed.), Handbook of item response modeling (Vol. 1, pp. 56–74). CRC Press. [Google Scholar]
  13. Zhang S., Petersen J. H. (2018). Quantifying rater variation for ordinal data using a rating scale model. Statistics in Medicine, 37, 2223–2237. [DOI] [PubMed] [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES