Abstract
Rating scale analysis techniques provide researchers with practical tools for examining the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment rating scales) function in psychometrically useful ways. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., lower ratings mean “less” of a construct than higher ratings), distinguish between categories in the scale (i.e., each category reflects a unique level of the construct), and compare ratings across elements of the measurement instrument, such as individual items. Although researchers have used these techniques in a variety of contexts, studies are limited that systematically explore their sensitivity to problematic rating scale characteristics (i.e., “rating scale malfunctioning”). I used a real data analysis and a simulation study to systematically explore the sensitivity of rating scale analysis techniques based on two popular polytomous item response theory (IRT) models: the partial credit model (PCM) and the generalized partial credit model (GPCM). Overall, results indicated that both models provide valuable information about rating scale threshold ordering and precision that can help researchers understand how their rating scales are functioning and identify areas for further investigation or revision. However, there were some differences between models in their sensitivity to rating scale malfunctioning in certain conditions. Implications for research and practice are discussed.
Keywords: rating scale, partial credit model, generalized partial credit model, survey research, performance assessment
Rating scale functioning describes the degree to which ordinal rating scales (e.g., Likert-type scales or performance assessment scales) function in psychometrically useful ways (Wright & Masters, 1982). For example, rating scales function well when successive categories reflect increasing levels of a latent variable, each category reflects a distinct range of the latent variable, and categories have a comparable interpretation across items and subgroups of examinees, among other characteristics. Researchers use rating scale analysis techniques to empirically evaluate their rating scales for evidence of these properties in addition to other indicators of psychometric quality that focus on items, persons, and other facets.
In contrast to analyses that focus on total scores (e.g., classical test theory) or on item-level estimates of difficulty or discrimination (e.g., in many polytomous item response theory [IRT] analyses), rating scale analysis allows researchers to evaluate whether the categories in an ordinal rating scale can be interpreted in a meaningful way. Evidence of acceptable rating scale functioning ensures that researchers can interpret ratings in the intended direction, make distinctions between responses in different categories, and compare ratings between items and persons.
Linacre (2002, 2004) presented a set of guidelines for evaluating rating scale functioning based on the partial credit model (PCM; Masters, 1982), which belongs to the family of Rasch measurement theory models (Wright & Mok, 2004). In practice, researchers have used these guidelines to inform revisions to their rating scales prior to further analysis or administration of their instruments (e.g., Hagedoorn et al., 2018; Kornetti et al., 2004; Wesolowski et al., 2016; Wind et al., 2018). However, researchers have not fully considered the sensitivity of these indicators to problematic rating scale characteristics using a systematic approach. Moreover, researchers have not fully considered the degree to which different popular polytomous latent trait models offer insight into problematic rating scale characteristics. Because researchers often examine their surveys using non-Rasch IRT models such as the generalized partial credit model (GPCM; Muraki, 1997), it is important that they are aware of the degree to which these models can identify problematic rating scale characteristics, and how the sensitivity to rating scale problems varies across Rasch models and non-Rasch models. The current study uses real data and a simulation study to offer initial insight into these topics.
Evaluating Rating Scale Functioning
As mentioned earlier, Linacre (2002, 2004) proposed a set of guidelines for evaluating rating scale functioning in the context of Rasch measurement theory. These guidelines reflect concerns with the direction, precision, and model-data fit associated with rating scales. In practice, researchers can use these guidelines to evaluate the extent to which examinees’ interpretation and use of rating scales supports the interpretation and use of ratings to estimate person and item locations on a latent variable.
As a preliminary guideline before conducting additional rating scale analyses, Linacre (2004) suggested that researchers should verify the overall directional orientation of their rating scale by calculating corrected item-total correlations (i.e., biserial correlations) between examinee responses to each item and the total score. As in traditional item analysis procedures based on total scores (Crocker & Algina, 1986), positive, moderate-to-strong values support the interpretation of item responses in the expected direction. Following this preliminary check of directionality, Linacre (2004) proposed eight additional guidelines to evaluate rating scale functioning in more detail.
First, Linacre (2004) recommended that researchers ensure that they have a minimum of 10 observations in each category of a rating scale to facilitate precise and stable estimates of category thresholds in polytomous models. Categories that are unobserved or have low frequencies lead to ambiguous interpretations. For example, low frequencies may indicate that participants who are truly located in a certain range of the latent variable were not included in the current sample. Unobserved observations could also indicate that the level reflected in the category is not observable or does not exist in the population. In this case, it may be useful to revise the length of the rating scale. In practice, researchers can use information about the context, sample, and construct to inform their interpretation of unobserved or infrequent categories. Linacre recommends that researchers consider techniques such as combining responses in these categories with adjacent categories or treating responses in these categories as missing data.
Somewhat related to the first guideline, Linacre (2004) recommended that researchers check the frequencies of ratings across categories for evidence of regular (unimodal or uniform) distributions. Highly skewed or multi-modal distributions can indicate that examinees may not have used the full range of the rating scale, and that one or more categories may be functioning as a “pivot point” that groups examinees into potentially fewer categories than intended.
The third guideline is related to the order of rating scale categories. Linacre (2004) recommended that researchers ensure that the average examinee location estimate (e.g., θ) should increase monotonically along the ordinal rating scale. In other words, examinees who respond in lower categories should have lower average location estimates than examinees who respond in higher categories.
Fourth, Linacre (2004) suggested that researchers examine model-data fit indicators specific to individual categories in the rating scale. These analyses are relatively straightforward in the Rasch measurement context, where model-data fit analysis based on residuals is part of routine procedures (Bond et al., 2020; Smith, 2004). With Rasch models, researchers can calculate residual summary statistics such as outfit mean square error (MSE) to evaluate the degree to which observations in each category reflect the expected patterns of responses, given model estimates for items and persons.
Linacre’s (2004) fifth guideline is that rating scale category thresholds advance as categories progress from low to high in the ordinal rating scale. For example, in a four-category rating scale (x = 0, 1, 2, 3), location estimates for the three thresholds should be ordered such that τ1 < τ2 < τ3. For example, Figure 1 shows the plots of category probabilities based on the PCM for two polytomous items scored in four ordinal rating scale categories. The x-axis shows person location estimates, and the y-axis shows the conditional probability for a rating in each category. Separate lines are plotted for each category in the rating scale. Thresholds are located at the intersection points between adjacent categories. For the item in Panel A, thresholds advance in the expected direction, such that the threshold between x = 0 and x = 1 (τ1) is located lower on the x-axis compared with the threshold between x = 1 and x = 2 (τ2), which is located lower on the x-axis compared with the threshold between x = 2 and x = 3 (τ3). However, for the item in Panel B, τ1 is located higher on the x-axis compared with τ2; this is an example of threshold disordering. This guideline is most relevant in the case of measurement models in which rating scale thresholds are specified with adjacent-categories probabilities, such as the rating scale model (RSM; Andrich, 1978), the PCM (Masters, 1982), and the GPCM (Muraki, 1997). In these models, disordered thresholds can occur. In contrast, models such as the graded response model (GRM; Samejima, 1969) include thresholds that are specified using cumulative probabilities. In these models, rating scale categories are ordered by definition within items, and researchers cannot directly identify threshold disordering using threshold location estimates (Andrich, 2015; Mellenbergh, 1995).
Figure 1.
Illustrative Category Probability Curves for a Four-Category Ordinal Rating Scale
Sixth, Linacre (2004) recommended that researchers evaluate the coherence of observations with rating scale categories. This guideline is based on the idea that ratings should correspond to their overall location estimates (e.g., θ). In practice, researchers can evaluate category coherence using expected response functions, which illustrate the relationship between examinee locations on the latent variable and the rating scale categories. Then, researchers can calculate the percentage of examinees within certain location ranges who provided ratings in the expected category. Likewise, researchers can calculate the percentage of ratings in each category that were provided by examinees within a range of locations on the latent variable.
Linacre’s (2004) final guidelines are related to the distance between rating scale category threshold estimates on the logit scale. He recommended that the distance between categories should be large enough such that each category reflects a meaningful range of examinee locations on the latent variable but not so large that the interpretation of the category is vague. For example, in Figure 1A, there is a unique range of person locations along the x-axis at which each of the four rating scale categories is most probable. However, in Figure 1B, two categories (x = 1 and x = 2) are never the most probable response. In general, Linacre recommended a minimum distance of around 1 logit and a maximum distance around 5 logits between adjacent thresholds for most rating scales. In an earlier article, Linacre (2002) suggested that researchers can calculate an appropriate minimum distance between thresholds that can be used in contexts where rating scale categories are interpreted as a series of progressive steps, such as a developmental scale. In these situations, the categories can be interpreted as a series of dichotomous items that function as a set of Bernoulli trials, where examinees are expected to succeed on progressive trials as their locations on the latent variable increase. The minimum distance between adjacent thresholds can be calculated using
(1) |
where ln is the natural logarithm, x is the rating scale category number, and m is the number of categories in a rating scale minus one. For example, in a scale with three categories (x = 0, 1, 2), the two thresholds are τ1 = ln(1 / [(2 − 1) + 1]) = −0.693 and τ2 = ln(2 / [(2 − 2) + 1]) = −0.693. With these values, the recommended minimum distance between categories is τ2–τ1 = 1.39 logits.
Research on Rating Scale Analysis
Among the indicators of rating scale functioning that Linacre (2002, 2004) identified, most research on rating scale analysis has focused primarily on evaluating the order of rating scale category thresholds (Adams et al., 2012; Andrich, 2013; Luo, 2005) and the degree to which rating scale categories provide distinct information about examinee locations on the construct as indicated by the distance between adjacent threshold estimates (e.g., Hagedoorn et al., 2018; Kornetti et al., 2004; Van Zile-Tamsen, 2017; Wind et al., 2018). Threshold ordering and category distinctiveness have practical implications that help researchers evaluate their scales for evidence that categories can be interpreted in the intended order and that each category contributes unique and meaningful information with which to distinguish among examinees, respectively.
Most research related to rating scale analysis has been conducted using polytomous Rasch models. Although it is possible to examine rating scale functioning for a complete set of items using the RSM (Andrich, 1978), researchers who are concerned with rating scale functioning typically use the PCM (Masters, 1982) or PCM versions of the many-facet Rasch model (PC-MFRM; Linacre, 1989). The PCM and PC-MFRM are especially useful for rating scale analysis because they specify the rating scale category threshold parameter separately for individual items (PCM) or levels of explanatory variables (PC-MFRM). As a result, these models allow researchers to conduct a detailed investigation of rating scale functioning. For example, researchers can use the PCM to identify individual items with potentially problematic rating scale characteristics that warrant additional investigation.
Although researchers have discussed rating scale analysis most often in the context of Rasch models, it is also possible to examine rating scale functioning using non-Rasch IRT models. For example, Hagedoorn et al. (2018) evaluated rating scale characteristics using the GPCM (Muraki, 1997). Although these researchers did not focus exclusively on rating scale analysis, they explored rating scale functioning using descriptive statistics and the GPCM in an analysis of a survey related to nurses’ attitudes toward family involvement in nursing care. Specifically, these authors examined the distribution of responses across categories using frequencies and evaluated threshold ordering using estimates from the GPCM. They found that participants used categories differently across items, but that the rating scale generally functioned as expected.
Models such as the RSM, PCM, and GPCM are particularly well-suited to rating scale analysis because they are specified using adjacent-categories thresholds (Mellenbergh, 1995). As discussed earlier, these models allow researchers to empirically identify threshold disordering when it is present (Andrich, 2013; Luo, 2005). In contrast, models based on cumulative thresholds, such as the GRM, are not particularly well-suited to rating scale analysis because they do not allow researchers to identify threshold disordering when it is present (Andrich, 2015).
Purpose
Given the emphasis in previous research on rating scale ordering and category precision, this study focuses on the sensitivity of popular polytomous IRT models to these issues. Specifically, the purpose of this study is to explore the sensitivity of the PCM and GPCM to problematic rating scale characteristics related to category ordering and category distinctiveness. The overarching research question for this study is
Research Question 1 (RQ1): To what extent can the PCM and GPCM alert researchers to problematic rating scale characteristics?
Using a real data analysis and simulation study, I explored the sensitivity of the PCM and GPCM to three specific aspects of rating scale functioning, as reflected in the following research questions:
Research Question 2 (RQ2): How sensitive are the PCM and GPCM to disordered rating scale category thresholds?
Research Question 3 (RQ3): How sensitive are the PCM and GPCM to narrow distances between adjacent rating scale category thresholds?
Research Question 4 (RQ4): How sensitive are the PCM and GPCM to wide distances between adjacent rating scale category thresholds?
Method
A real data analysis and simulation study were used to address the research questions. This section includes an overview of the real data and simulation study design. Then, methods used to analyze the real and simulated data are described.
Real Data
The real data are from a survey designed to measure children’s sense of belonging within their families. The data were originally presented in a study on the development of a global health measure for children (9 – 15 years) within the Patient-Reported Outcomes Measurement System (PROMIS), which included the Family Belongingness Scale (FBS; Forrest et al., 2016). The current sample includes responses from 1,704 children to 34 items related to their sense of belongingness within their family. Children responded to each item using a five-category ordinal rating scale (0 = never, 1 = rarely, 2 = sometimes, 3 = often, 4 = always), where responses in higher categories indicate stronger levels of belongingness. Table A1 in Appendix A in the Online Supplement includes the item stems from the FBS. The real data sample included only complete responses; there were no missing data.
Simulation Study
A simulation study was useful for addressing the research questions because it allowed me to create conditions in which specific items exhibited specific types of problematic rating scale characteristics under various conditions. Accordingly, I used base programming in R (R Core Team, 2021) to generate ordinal ratings with the GPCM with no missing responses. I generated data with the GPCM because this model allowed me to incorporate some differences in item discrimination that are likely to occur in real-world settings. According to the GPCM, the probability for a rating in category k of a rating scale with x = 0, . . ., m categories is defined as
(2) |
where θ n is the location estimate for person n on a log-odds scale that represents their level on the latent variable. The item-threshold parameter (δik) is a combination of the overall item difficulty location on the latent variable (δi) and the difficulty associated with the threshold between category k and category k– 1 specific to item i (τik). Finally, αi is the estimated discrimination associated with item i. Higher values of alpha indicate that an item discriminates more strongly between examinees who have different locations on the latent variable.
Table 1 summarizes the fixed and manipulated characteristics of the simulated data. In all conditions, I generated ratings using a five-category rating scale (x = 0, 1, 2, 3, 4) because this scale length is popular in applied and methodological psychometric research related to both affective scales (e.g., Bozdağ & Bilge, 2022; Haddad et al., 2021; Hagedoorn et al., 2018; Moors, 2008; Waugh, 2002) and performance assessments that include ordinal rating scales (e.g., Andrich, 2010; Engelhard et al., 2018; Wind & Walker, 2019). I generated person parameters (θ) and overall item locations (δi) from a standard normal distribution to reflect recent simulation and real data IRT research related to rating scales (Buchholz & Hartig, 2019; Finch, 2011; Wind & Guo, 2019; Wolfe et al., 2014). I generated four thresholds associated with each item (τi1, τi2, τi3, τi4) to reflect the values of threshold parameters observed in the real data and Linacre’s (2002, 2004) guidelines for useful distances between adjacent thresholds in five-category rating scales. I selected the first threshold location (τi1) from a random uniform distribution with a minimum of −2.5 logits and a maximum of 0.5 logits. I selected the second threshold location (τi2) by adding δi1 to a value selected from a random uniform distribution with a minimum of 0.8 and a maximum of 1 logit. For τi3, I added δi2 to a value selected from a random uniform distribution with a minimum of 0.8 and a maximum of 1 logit. Finally, I calculated the fourth threshold (τi4) by subtracting the sum of the first three thresholds from 0, such that the sum of the item threshold parameters for each item was equal to zero. Then, I added the overall item location values to each threshold value to obtain the four item-threshold values for each item (δik).
Table 1.
Design of the Simulation Study
Characteristic | Level(s) | |
---|---|---|
Held constant | Rating scale length | Five categories (x = 0, 1, 2, 3, 4) |
Generating person parameters | θ ~ N(0, 1) | |
Generating item threshold parameters | τ1 ~ U(−2.5, .5) τ2 = τ1+ ~ U(.8, 1) τ3 = τ2+ ~ U(.8, 1) τ4 = 0 – (τ1+τ2 +τ3) |
|
Generating item discrimination parameters | α ~ N(1, 0.05) | |
Item sample size | N = 35 | |
Manipulated | Person sample size | N = 100; N = 500; N = 1,000 |
Generating item discrimination parameters | Small slope SD: α ~ N(1, 0.05); Large slope SD: α ~ N(1, 1) |
|
Type of rating scale malfunctioning | None; Disordered thresholds; Nondistinct narrow categories; Nondistinct wide categories |
|
Number of items with rating scale malfunctioning | 5; 15 |
I manipulated several characteristics to create simulation conditions. First, I used three sample sizes for persons that reflect the real data and recent simulation research in IRT: N = 100, 500, or 1,000. I specified two levels of item discrimination. In half of the conditions, I selected the item discrimination parameters from α ~ N(1, 0.05) such that there would be some variability in item slopes, but that most items would approximate the Rasch model requirement of a slope parameter around 1.00 for items. In the other half, I used a larger standard deviation (SD) for the slope parameter: α ~ N(1, 1); this specification reflected the relatively large SD for the slope parameter observed in the real data (SD = 0.85; discussed in the “Results” section) and also allowed me to explore the impact of variability in item discrimination that is directly modeled in the GPCM.
Next, I simulated either 5 or 15 randomly selected items to exhibit problematic rating scale characteristics (“rating scale malfunctioning”); these values reflect the magnitude of rating scale problems identified in the real data and in previous studies on rating scale functioning (e.g., Hagedoorn et al., 2018; Kornetti et al., 2004; Wesolowski et al., 2016; Wind et al., 2018). I varied the specific type of rating scale malfunctioning that occurred in each simulated dataset. First, I created a null condition with no simulated rating scale malfunctioning. In the disordered threshold conditions, I reversed the order of the first two thresholds such that δi1 > δi2 for the selected items. I also created a condition with nondistinct categories due to small distances between the first two thresholds by calculating the generating value for δi2 as δi1 plus a value selected from a random uniform distribution with a minimum of 0.05 and a maximum of .2 logits. These values fall outside of Linacre’s (2002) recommended value of a useful minimum distance of about 0.98 logits between adjacent thresholds for five-category rating scales. Finally, I created a condition with nondistinct categories due to large distances between the first two thresholds by calculating the generating value for δi2 as δi1 plus a value selected from a random uniform distribution with a minimum of 3 and a maximum of 4 logits. Although this range of distances between thresholds is somewhat smaller than Linacre’s (2004) recommended maximum value of 5 logits, it reflects half or more of the typical range of person and item locations observed in practical applications of item response theory models (DeAyala, 2009; Embretson & Reise, 2000). In practice, researchers may find such large distances between adjacent thresholds to be notably large and worthy of further investigation.
Data Analysis
The data analysis procedure involved applying the PCM and GPCM to the simulated and real data and using the results to evaluate the rating scale functioning across items. I used Marginal Maximum Likelihood Estimation (MMLE) functions in the Test Analysis Modules (TAM) package for R (Robitzsch et al., 2020) to apply both models to the real and simulated data. The GPCM is defined as in Equation 2, and the PCM is defined as
(3) |
where all of the terms are defined as in Equation 2. I used estimates from each model to evaluate rating scale functioning using indicators related to threshold ordering and category precision as described in the following sections.
Overall Model Results
Before I examined the rating scale analysis results in detail, I checked the model results for overall adherence to model requirements. Given the focus of the current study on rating scale analysis, model comparisons via fit indices are not of primary importance. Nonetheless, to provide context for the analysis, I evaluated the basic requirements for unidimensional IRT models. As a general evaluation of the unidimensionality requirement for the PCM and GPCM, I calculated the coefficient omega (Green & Yang, 2015) on the FBS item responses.
Next, I evaluated item fit to the PCM using the unweighted (i.e., outfit) and weighted (i.e., infit) formulations of MSE fit statistics; researchers frequently examine these statistics for individual items as part of routine procedures in Rasch analyses of item response data (Smith, 2004). Briefly, these statistics are averages of standardized versions of residuals that reflect the discrepancy between observed responses and model-expected responses given parameter estimates. Outfit MSE statistics are unweighted averages of the standardized residuals associated with individual items. Infit MSE statistics are also average standardized residuals, but they are weighted by response variance. As a result, infit MSE statistics are less sensitive to extreme unexpected responses compared with outfit MSE statistics. Because there is no known sampling distribution for outfit and infit MSE statistics, many researchers use critical values (i.e., cut scores) based on practical guidance and empirical methods (e.g., bootstrap methods) to interpret and evaluate these statistics in practice (DeAyala, 2009; Seol, 2016; Walker et al., 2018; Wolfe, 2013). Researchers who use Rasch models generally agree that values of outfit and infit MSE around 1.00 indicate acceptable fit to a Rasch model (Smith, 2004; Wu & Adams, 2013), values above 1.00 indicate more variation than expected in item responses, and values below 1.00 indicate less variation than expected. Reflecting the focus of the current study, item MSE statistics were interpreted as continuous variables, and the overall distribution of item fit statistics was considered rather than classifying items as “fitting” or “misfitting.”
Item-level fit analysis is not always included in routine applications of the GPCM (DeAyala, 2009). However, some researchers who use the GPCM conduct chi-square tests of conditional independence between item pairs (Chen & Thissen, 1997). These statistics evaluate the degree to which participant responses to items within each pair are statistically independent. Statistically significant values of these statistics suggest that an item pair may violate the independence assumption (Muraki & Muraki, 2018).
Threshold Ordering
The first set of indicators of rating scale functioning is related to the order of thresholds on the latent variable. When rating scales function as expected, the location estimates for thresholds are monotonically nondecreasing as categories progress from low to high. For example, in a rating scale with five ordered categories (x = 0, 1, 2, 3, 4), the expected order for the four thresholds is τ1 < τ2 < τ3 < τ4. This order indicates that the level of the latent variable associated with each category reflects the intended order given the definition of the scale categories. Accordingly, higher ratings correspond to higher levels of the latent variable.
When rating scale category thresholds are disordered, there is a mismatch between the level of the construct required for ratings in each category and the intended order of the rating scale categories. As a result, ratings cannot be interpreted in the intended order. For example, in a five-category rating scale, disordering could occur between the second and third thresholds such that τ2 > τ3. In this case, the level of the construct required to respond in Category 2 would be higher than the level required to respond in Category 3. Such disordering complicates the interpretation of responses and warrants further investigation.
Researchers can examine numeric threshold location estimates on the logit scale for evidence that they are ordered as expected. In addition, researchers can examine plots of rating scale category probability curves such as the plots in Figure 1 for evidence that the probability associated with each category progresses as expected along the latent variable. For example, the plot in Figure 1A supports the hypothesis of expected rating scale category ordering, but the plot in Figure 1B suggests disordered thresholds.
I evaluated rating scale category ordering for each item by comparing the empirical order of threshold estimates to the expected order given the ordinal rating scale. I flagged items for a violation of category disordering if any pair of adjacent thresholds was disordered and the distance was larger than the standard error associated with either threshold estimate. In the real data analysis, I supplemented the numeric comparison of threshold estimates with visual displays of category probability curves constructed as in Figure 1.
Category Precision
Next, researchers can evaluate rating scales for evidence that each category in the scale provides precise information about examinee locations on the latent variable that is unique from other categories. Category precision indicators help researchers evaluate rating scales for evidence that each category provides an appropriate level of distinction between participants that is specific enough to identify meaningful differences with regard to the latent variable. For example, researchers may find that a rating scale category reflects only a very narrow range of the latent variable, and that a shorter rating scale length may be sufficient. Conversely, researchers may find that a rating scale category reflects a very wide range of the latent variable such that participants with meaningful differences tend to provide the same responses; in this case, more categories may be useful to better distinguish among participants.
There are several approaches to evaluating category precision for rating scales (Linacre, 2002). A relatively straightforward approach is to examine the distance between adjacent threshold estimates from models such as the PCM and GPCM. Distances that are too small indicate potentially redundant categories, and distances that are too large indicate potentially ambiguous categories. I evaluated rating scale category precision by examining the difference between pairs of adjacent thresholds specific to each item and comparing these differences to Linacre’s (2002) recommendations. I examined the average values of distances between adjacent thresholds for evidence of category precision, using Linacre’s (2002) recommended minimum value of 0.98 logits and a maximum value of 3 logits as rough guidelines for identifying narrow and wide distances between adjacent thresholds.
Results
Real Data
Preliminary analyses of participant responses to the FBS indicated acceptable properties for rating scale analysis. All items had positive corrected item-total correlations, with values ranging from 0.41 ≤rpbis≤ 0.81. In addition, participant responses to the FBS items included at least 10 observations in each category for each item with two exceptions: Item 31 (category 0: n = 7) and Item 34 (category 0: n = 5). According to Linacre (2002), these results support the estimation of rating scale category thresholds and further exploration using rating scale analysis techniques.
Overall Model Results
To provide context for the analysis, I evaluated the basic requirements for unidimensional IRT models. The Omega Hierarchical value was equal to 0.84, which suggests that 84% of the variance of total scores can be attributed to individual differences on a common general factor related to belongingness. The Omega Total value was equal to 0.98. The ratio of these two values (0.84/0.98 = 0.86) suggests that most of the reliable variance in total scores can be attributed to the general factor, which we assume to reflect individual differences in children’s sense of belonging in their families, and 2% is estimated to reflect random error. Accordingly, we will interpret the FBS scores as essentially unidimensional.
Partial Credit Model
Table A2 in Appendix A in the Online Supplement includes detailed item-level results from the PCM, including centered item location estimates, threshold estimates, and MSE fit statistics. Responses to the FBS items exhibited generally good fit to the PCM, with average values around 1.00 (infit MSE: M = 1.04; outfit MSE: M = 0.97). However, Item 18 had relatively high fit statistics (outfit MSE = 3.37, infit MSE = 2.09), which suggests frequent unexpected responses associated with this item. Item 3 had the lowest fit statistics (outfit MSE = 0.47, infit MSE = 0.60), which suggests less variation than expected in participant responses. Centered item location estimates (M = 0.00) ranged from −0.87 logits for Item 34, which was the easiest item for participants to endorse, to 1.38 logits for Item 10, which was the most difficult item for participants to endorse.
Generalized Partial Credit Model
For the FBS items, there were 561 chi-square tests of conditional independence between each of the 34 items and the other items in the scale. Holm-adjusted p-values for these tests that take into account multiple statistical tests were not significant at p < .05. This result provides support for the use of the GPCM to analyze the FBS scale data. Table A3 in Appendix A in the Online Supplement summarizes results from the GPCM analysis of the FBS item responses. The average discrimination for the FBS items was equal to 1.64 (SD = 0.86). Individual item discrimination parameters (αi) ranged from 0.32 for Item 18, which was the least-discriminating item to 3.22 for Item 15, which was the most-discriminating item. After centering such that the average item location was equal to 0.00 logits, the item location estimates ranged from −0.48 logits for Item 31, which was the easiest item to endorse to 1.00 logits for Item 10, which was the most difficult item for participants to endorse, according to the GPCM.
Rating Scale Analysis Results
Table 2 summarizes the alignment between the PCM and GPCM regarding rating scale functioning for the FBS items. Detailed results about rating scale functioning for individual items based on each model are included in Appendix A in the Online Supplement.
Table 2.
Model Comparison Results for Real Data
Item | Disordered thresholds | Adjacent threshold distance < 0.98 logits | Adjacent threshold distance > 3 logits | |||
---|---|---|---|---|---|---|
PCM | GPCM | PCM | GPCM | PCM | GPCM | |
1 | 1 | 1 | ||||
2 | 1, 2 | 1, 2 | ||||
3 | 1 | 1, 2, 3 | ||||
4 | 1 | 1, 2 | ||||
5 | 1 | 1, 2 | ||||
6 | 1, 2, 3 | 1, 2, 3 | ||||
7 | 1 | 1, 3 | ||||
8 | 1 | 1 | 1 | 1, 3 | ||
9 | 1, 2, 3 | 1, 2, 3 | ||||
10 | 1 | 1 | ||||
11 | 1 | 1 | ||||
12 | 3 | 1<, 3 | 1, 2, 3 | 1, 2, 3 | ||
13 | 1< | 1 | 1, 3 | |||
14 | 1, 3 | 1, 2, 3 | ||||
15 | 1 | 1, 2, 3 | ||||
16 | 1< | 1 | 1 | 1 | ||
17 | 1 | 1, 2 | ||||
18 | 1 | 1, 3 | 1, 2, 3 | 2, 3 | 1 | |
19 | 1, 2, 3 | 1, 2, 3 | ||||
20 | 1, 3 | 1, 2, 3 | ||||
21 | 1 | 1, 3 | 1, 2, 3 | |||
22 | 3 | 1, 2, 3 | 1, 2 | |||
23 | 1, 2 | 1, 2, 3 | ||||
24 | 1< | 1 | 1, 2, 3 | |||
25 | 1 | 1 | 1 | |||
26 | 1 | 1 | ||||
27 | 1< | 1, 3 | 1, 2, 3 | |||
28 | 1 | 1, 3 | 1, 2, 3 | 1, 2, 3 | ||
29 | 1, 2, 3 | 1, 2, 3 | ||||
30 | 1, 3 | 1, 3 | ||||
31 | 1, 3 | 1, 2, 3 | ||||
32 | 1 | 1 | 1, 2, 3 | 2, 3 | ||
33 | 1, 3 | 1, 3 | 1, 2, 3 | 1, 2, 3 | ||
34 | 1 | 1, 3 |
Note. Values indicate the threshold pair between which problematic rating scale functioning was detected. For threshold disordering, 1 = τ1 > τ2, 2 = τ2 > τ3, and 3 = τ3 > τ4. The “<” symbol indicates that the distance was smaller than the standard error associated with one or both thresholds. For adjacent threshold distances, values indicate the threshold pair between which small absolute distances were identified, where 1 = |τ2−τ1,| 2 = |τ3–τ2,| and 3 = |τ4−τ3.| PCM = partial credit model; GPCM = generalized partial credit model.
Threshold Ordering
Threshold location estimates from the PCM and GPCM were ordered as expected for most items in the FBS. Both models identified four items with at least one disordered threshold, and the direction of disordering was congruent between models for all of these items. Specifically, the first two thresholds were reversed such that τ1 > τ2 for the following items: Items 8, 16, 32, and 33. In addition, both models found that Item 33 also had τ3 > τ4. Besides those items, the PCM identified τ1 > τ2 for Items 13, 24, 25, and 27, but this disordering was not captured in the GPCM estimates. Conversely, only the GPCM identified τ3 > τ4 for Item 22. For Items 12, 18, and 28, the GPCM identified two disordered thresholds: τ1 > τ2 and τ3 > τ4 whereas the PCM only identified one instance of threshold disordering for these items.
To explore these results further, Figures 2 and 3 show category probability plots for selected items where there were discrepancies in category functioning results between the PCM and GPCM. First, Figure 2 shows category probability curves for items where only the PCM identified disordered thresholds between τ2 and τ1 (Items 13, 24, and 25). Inspection of the category probabilities for these items indicates that, although the thresholds were disordered when estimated using the PCM, these thresholds were quite close to one another according to both models. As a result, researchers would likely come to similar conclusions about the distinctiveness of the second category using either model.
Figure 2.
Category Probability Curves for Selected Items With Discrepancies in Threshold Ordering Between Models
Note. PCM = partial credit model; GPCM = generalized partial credit model.
Figure 3.
Category Probability Curves for Selected Items With Discrepancies in Threshold Ordering Between Models
Note. PCM = partial credit model; GPCM = generalized partial credit model.
Figure 3 shows category probability curves for selected items where the GPCM identified threshold disordering in ways that the PCM did not: Items 12, 18, 22, and 28. For each of these items, the GPCM captured differences in item discrimination that may have contributed to differences in threshold ordering. All of these items had relatively low discrimination parameter estimates according to the GPCM (see Table A3 in Appendix A in the Online Supplement). Among the FBS items, Item 18 had the lowest discrimination parameter estimate (α18 = 0.32), Item 22 had the second-lowest discrimination (α22 = 0.37), Item 28 had the fourth-lowest discrimination (α28 = 0.56), and Item 12 was ranked seventh in discrimination (α18 = 0.76). The relatively low discrimination estimates resulted in relatively flat probability curves for the middle rating scale categories that resulted in disordered thresholds. Although the PCM did not flag these items for the same threshold disordering, each of their outfit MSE statistics were relatively high compared with the other FBS items: Items 18, 22, and 28 had outfit MSE statistics that exceeded the 90th percentile of the empirical distribution of outfit MSE for the FBS scale (outfit MSE > 1.89), and the outfit MSE statistic for item 12 was also relatively high (outfit MSE = 1.32). In summary, although the PCM did not flag the same instances of threshold disordering for these items that were identified in the GPCM via the estimation of the slope parameter, these items were identified as problematic by the PCM item-level fit statistics.
Category Precision
Both the PCM and the GPCM revealed differences in category precision across the FBS items, as reflected in absolute distances between adjacent thresholds. Based on the PCM, the absolute distance between adjacent thresholds ranged from 0.01 logits to 1.91 logits. Based on the GPCM, the absolute distance between adjacent thresholds was slightly larger, ranging from 0.01 logits to 3.26 logits.
For all 34 items, the distance between the first two thresholds was smaller than Linacre’s (2004) recommended minimum distance of 0.98 logits for a five-category rating scale. This value ranged from 0.01 to 0.67 logits with the PCM to 0.01 to 0.85 logits with the GPCM. For the remaining thresholds, the GPCM identified absolute distances that were lower than the 0.98-logit guideline more frequently than the PCM. For example, for Item 3, the GPCM identified smaller-than-recommended distances between all three pairs of thresholds (|τ2−τ1|, |τ3–τ2|, and |τ4−τ3|) whereas the PCM only flagged the first threshold pair (|τ2−τ1|) as smaller than recommended.
There were some differences between models in identifying imprecise categories for more than half of the FBS items. For example, the GPCM identified narrow absolute distances between more pairs of adjacent thresholds than the PCM for items: Items 3, 4, 5, 7, 8, 12, 13, 14, 15, 17, 20, 21, 23, 24, 27, 31, and 34. At the overall item level, most of these items had relatively low PCM fit statistics and relatively high GPCM slope estimates. For example, for Item 3 (α = 3.13; infit MSE = 0.47, outfit MSE = 0.60) and Item 15 (α = 3.22; infit MSE = 0.57, outfit MSE = 0.65), the GPCM identified small distances between all three pairs of adjacent thresholds, but the PCM only identified a narrow absolute distance between the first two thresholds. Figure 4 shows plots of category probability curves for Item 3 and Item 15—both of which were identified with more-frequent imprecise categories based on the GPCM compared with the PCM. The GPCM plots reflect the high levels of item discrimination for the overall items, which are reflected in narrower category probability curves compared with the PCM.
Figure 4.
Category Probability Curves for Selected Items Where GPCM Flagged More Imprecise Categories Than the PCM
Note. GPCM = generalized partial credit model; PCM = partial credit model.
However, this pattern was not always the case for items where the GPCM flagged imprecise categories more often than the PCM. For example, Item 7, which had a narrow distance between the first two thresholds based on the PCM (|τ2−τ1| = 0.09 logits), had a moderate slope estimate from the GPCM (α = 0.89) and MSE fit statistics from the PCM close to the expected value of 1 (infit MSE = 1.10, outfit MSE = 1.13). Figure 4 includes category probability curves for Item 7, which show generally similar patterns between the two models. Although the distances between adjacent thresholds were smaller for the GPCM compared with the PCM, the plots suggest that researchers may come to the same general conclusions about which categories had distinct ranges on the latent variable with both models.
For other FBS items, the PCM identified narrow distances between more pairs of adjacent thresholds than the GPCM: Items 18, 22, and 32. At the overall level, these items tended to have relatively low slope parameter estimates and high MSE fit statistics: Item 18 (α = 0.32; infit MSE = 3.37, outfit MSE = 2.09), Item 22 (α = 0.37; infit MSE = 3.17, outfit MSE = 1.96), Item 32 (α = 0.52; infit MSE = 2.08, outfit MSE = 1.54). Plots of category probability curves for these items (see Figure 5) provide more insight. Although the distances between the thresholds were not always flagged with the GPCM, this was likely due to category disordering, which clearly indicates imprecise categories that warrant further consideration.
Figure 5.
Category Probability Curves for Selected Items Where PCM Flagged More Imprecise Categories Than the GPCM
Note. PCM = partial credit model; GPCM = generalized partial credit model.
None of the FBS items had items with absolute distances between thresholds that exceeded Linacre’s (2002, 2004) recommended maximum value of 5 logits; the maximum distance between adjacent thresholds based on either model was an absolute difference of 3.26 logits between the first two thresholds for Item 18; these thresholds were also disordered based on the GPCM. Further examination of the results for Item 18 indicated that this item had the highest values of MSE fit statistics from the PCM (infit MSE = 3.37, outfit MSE = 2.09), and it had the lowest estimated slope parameter from the GPCM (α = 0.32). These results suggest that both models identified potential problems related to this item, but only the GPCM specifically identified a potentially wide distance between the first two categories. Visual inspection of category probability curves for Item 18 (see Figure 3) supports these conclusions, where both models reflect disordered thresholds and notable differences in the range of examinee locations that correspond to each rating scale category.
Simulation Study
Items Simulated to Exhibit Rating Scale Problems
Figure 6 provides a graphical summary of the rating scale analysis results for the items that were simulated to exhibit problematic rating scale characteristics; detailed numeric results are provided in Table A6 in Appendix A in the Online Supplement.
Figure 6.
Summary of Rating Scale Analysis Results for Items Simulated to Exhibit Problematic Rating Scale Functioning
In the small slope SD conditions (Figure 6A), both models were relatively sensitive to disordered thresholds. In the smallest person sample size conditions (N = 100), the GPCM and PCM accurately detected between 83% and 85% of the items simulated to exhibit disordered thresholds. Sensitivity improved for both models to between 95% and 99% as person sample size increased (N≥ 500). In the conditions with a large slope SD (Figure 6B), the GPCM was more sensitive to category disordering (0.79 ≤ sensitivity ≤ 0.96) compared with the PCM (0.71 ≤ sensitivity ≤ 0.79), and sensitivity increased as person sample size increased for both models.
In the Narrow Categories conditions, the GPCM and PCM were both generally accurate in identifying small absolute distances between the first two thresholds, with slightly smaller distances identified with the PCM compared with the GPCM. Overall, the absolute distance between thresholds was smaller in the conditions with a small slope SD (Figure 6C) compared with the large slope SD conditions (Figure 6D). In addition, the average absolute distance between these thresholds decreased as person sample size increased. In both slope SD conditions, when the examinee sample size was equal to N = 100, the average absolute distance between the first two thresholds was close to or larger than Linacre’s recommendation for a five-category scale. These distances decreased as examinee sample size increased for both models.
In the Wide Categories conditions, both models were sensitive to large distances between adjacent thresholds. In the small slope SD conditions (Figure 6E), average absolute distances between the first two thresholds ranged from 3.53 logits to 4.31 logits, with slightly larger distances detected with the GPCM compared with the PCM. In the large slope SD conditions (Figure 6F), the distance between thresholds was also smaller with the PCM (3.63 ≤|τ2−τ1| ≤ 4.11) compared with the GPCM (4.29 ≤|τ2−τ1| ≤ 8.68).
Normal Items
Table A7 in Appendix A in the Online Supplement provides a summary of the results from the simulation study for the items that were not simulated to exhibit problematic rating scale characteristics (i.e., normal items). In the conditions with a small slope SD, both models were effective in identifying these items as exhibiting acceptable rating scale functioning. Specifically, 1% or fewer of the normal items were flagged for category disordering based on both models, and the distance between the first two thresholds was within the generally accepted range, with average values between 1.50 logits and 1.63 logits across all conditions. In the conditions with a large slope SD, disordered thresholds were identified more often based on both models, with between 1% and 15% of normal items identified with disordered thresholds. Threshold disordering was observed most often in these items when other items were simulated to exhibit threshold disordering or wide distances between thresholds. The average absolute distance between thresholds for normal items was within Linacre’s expected range in all conditions, ranging from 1.16 logits to 4.00 logits, with larger distances identified with the GPCM compared with the PCM.
Discussion
Rating scale analysis provides researchers with practical tools for examining the degree to which their ordinal rating scales function in psychometrically useful ways. Researchers can use rating scale analysis in combination with overall scale and item-level analyses to support the interpretation of ratings in a variety of contexts, including research on affective surveys or attitude scales with Likert-type response formats and performance assessments with ordinal rating scales. When rating scales function as expected, researchers can interpret ratings in the intended direction (i.e., lower ratings mean “less” of a construct than higher ratings), distinguish between categories in the scale (i.e., each category reflects a unique level of the construct), and compare ratings across elements of the measurement instrument, such as individual items.
I used a real data analysis and a simulation study to systematically explore rating scale analysis techniques based on two polytomous IRT models: the PCM and the GPCM. Although researchers have used both of these models to evaluate rating scale functioning across a variety of contexts, methodological and applied research on rating scale analysis based on Rasch models such as the PCM is more common compared with the GPCM. In addition, studies are limited that focus specifically on the degree to which these models can identify problematic rating scale characteristics. Because the PCM and GPCM are popular choices for analyzing ordinal polytomous data in the social sciences (Hambleton et al., 2010; Nering & Ostini, 2010), it is important that researchers understand their utility in detecting problematic rating scale characteristics that could compromise the interpretation and use of item responses.
To What Extent Can the PCM and GPCM Alert Researchers to Problematic Rating Scale Characteristics?
This study was guided by an overarching research question focused on the degree to which the PCM and GPCM can help researchers identify problematic rating scale characteristics. There are numerous ways in which rating scales can malfunction that can potentially compromise the interpretation and use of measurement procedures (Linacre, 2002, 2004). For a focused illustration and analysis, the current study explored two major types of rating scale malfunctioning: threshold disordering and imprecise categories, as reflected in narrow and wide categories.
Overall, results from the real data analysis and simulation study revealed that both the PCM and the GPCM provide valuable information about rating scale threshold ordering and precision that can help researchers understand how their rating scales are functioning and identify areas for further investigation or revision. Researchers can use rating scale analysis techniques based on both models to supplement routine analyses of psychometric quality that focus on global model fit and overall item and person location estimates. However, there were some differences between models in their sensitivity to rating scale malfunctioning in certain conditions. I discuss these results below.
How Sensitive Are the PCM and GPCM to Disordered Rating Scale Category Thresholds?
Threshold ordering indicators help researchers understand the degree to which their scale categories are oriented in the expected direction, such that responses in lower categories reflect lower levels of the construct and responses in higher categories reflect higher levels of the construct. Information about disordered thresholds can help researchers understand how their scale functions empirically (Adams et al., 2012), and help them identify anomalies that can guide additional research to improve their understanding of the construct of interest and revisions to their instruments (Andrich, 2013).
Altogether, results from the threshold ordering analyses suggest that researchers can use the PCM and GPCM to identify threshold disordering when it occurs. Specifically, results from the real data analysis and simulation study revealed that both models identified disordered rating scale category thresholds, and the results were comparable between models under most conditions. In the real data analysis, discrepancies between models in sensitivity to disordered thresholds were often associated with items that had relatively extreme discrimination parameters as estimated by the GPCM, misfit as detected by the PCM, or both.
The simulation study results suggested that sensitivity to disordered thresholds improved as examinee sample size increased (N≥ 500). In addition, the PCM was more accurate in identifying threshold disordering when item discrimination was less variable (SDα = 0.05) compared with the conditions where there was more variation in discrimination (SDα = 1.00). This finding corroborated results from the real data analysis where the PCM sometimes identified items as misfitting but not necessarily as having disordered thresholds.
Recognizing the differences between models in identifying threshold disordering under certain conditions, it is important that researchers interpret the results from threshold ordering analyses in combination with overall item analysis results, including model-data fit analyses with the PCM and considerations related to item discrimination with the GPCM. Graphical displays of category probability curves can supplement numeric analyses and help analysts understand the magnitude and nature of disordered thresholds when they occur.
How Sensitive Are the PCM and GPCM to Narrow Distances Between Adjacent Rating Scale Category Thresholds?
Category precision indicators help researchers understand the degree to which each category reflects a unique range of locations on the construct. When they are precise, rating scale categories reflect a range of the construct that is neither so narrow that it only reflects a small range of the construct that represents only a small group of participants nor so wide that its interpretation is ambiguous. Researchers may use results from category precision analyses to inform decisions related to the number of categories in their rating scales, the use of neutral categories, or to guide data analysis techniques, such as collapsing or combining adjacent categories prior to further analyses (Bond et al., 2020).
Narrow distances between adjacent rating scale category thresholds correspond to rating scale categories that reflect a small range of the construct that may not identify a group of examinees whose location on the construct is meaningfully different from those who respond in other categories of the scale. When categories are narrow, researchers may decide to reduce the length of their rating scale in future administrations of their instrument or combine categories prior to further analyses of their data (Bond et al., 2020).
Results from the real data analysis revealed at least one narrow category for all of the FBS items, as indicated by both the PCM and GPCM. Practically speaking, this result suggests that the lowest categories in the FBS rating scale may not provide unique information about children’s perceptions of their belongingness in their families. Although both models identified narrow categories for the FBS items, there were some differences between them. Specifically, the GPCM tended to identify narrow categories more often than the PCM. As with the disordered category results, further inspection of the items for which there were discrepancies between the PCM and GPCM revealed somewhat low fit statistics from the PCM and relatively high slope estimates from the GPCM. This result suggests that differences in item discrimination, as captured in fit statistics (PCM) or slope estimates (GPCM) may contribute to the information that these models provide about narrow categories.
Results from the simulation study indicated that the PCM and GPCM were generally accurate in identifying narrow categories, and that accuracy improved as examinee sample size increased. In addition, both models were more accurate in identifying narrow categories in the conditions with less variation in item discrimination (SDα = 0.05) compared with the conditions where there was more variation in discrimination (SDα = 1.00). This result is congruent with observations from the real data analysis and provides further support for the use of rating scale analysis in combination with overall item analysis techniques, including model-data fit analysis with the PCM and considerations related to differences in item discrimination with the GPCM.
How Sensitive Are the PCM and GPCM to Wide Distances Between Adjacent Rating Scale Category Thresholds?
Wide distances between adjacent rating scale category thresholds correspond to rating scale categories that reflect a large range of the construct that may elicit the same responses from examinees who have meaningfully different levels on the construct. As a result, researchers may not be able to identify important differences between examinees. When categories are wide, researchers may decide to increase the length of their rating scale in future administrations of their instrument (Bond et al., 2020).
Results from the real data analysis indicated that only one item had a notably large distance between adjacent thresholds, and this wide category was only identified with the GPCM. This item had high fit statistics from the PCM, and a low slope estimate from the GPCM. These results suggest that the relatively wide category was also reflected in overall indicators of item quality.
Results from the simulation study corroborated the real data results. Specifically, although both models were generally accurate in identifying wide absolute distances between thresholds, the GPCM identified larger distances compared with the PCM. In addition, as observed for the threshold disordering and narrow categories conditions, sensitivity to wide categories improved for both models as examinee sample size increased and when there was less variation in item discrimination.
Implications
This study provided a systematic exploration of problematic rating scale characteristics related to threshold ordering and category precision using the PCM and GPCM. Researchers frequently use these models in methodological and applied research related to ordinal polytomous ratings, such as research on affective scales and educational performance assessments. These models are well-suited to rating scale analysis because they allow researchers to investigate rating scale functioning for individual items, and they are specified such that thresholds can exhibit disordering. Although researchers have recognized the utility of these models for rating scale analysis, they have not systematically explored their sensitivity to problematic rating scale characteristics.
When researchers are concerned with rating scale functioning, it is important that they are aware of differences in sensitivity to problematic rating scale characteristics between models. Although the PCM and GPCM provided generally consistent information about which items exhibited rating scale malfunctioning, there were some differences in the specific rating scale characteristics that each model identified across conditions.
In practice, researchers’ choice of a model for rating scale analysis should be guided by their theoretical perspective on measurement and their analytic goals. For example, researchers whose goal is to evaluate their rating scale data for evidence that responses reflect fundamental measurement properties (e.g., invariant measurement) would likely use a Rasch model to conduct rating scale analysis (Engelhard & Wind, 2018). In this case, rating scale analysis could supplement routine analyses of model-data fit that are integral to the Rasch approach to evaluating items and examinees for evidence of adherence to requirements for measurement. Results from this study suggest that researchers can use information about rating scale functioning to more fully understand and explore idiosyncrasies or systematic deviations from model expectations that may be associated with rating scale malfunctioning. In particular, the results suggest that rating scale analysis may offer specific insight into item misfit that can help researchers more fully understand potential causes for misfit to guide their understanding of the construct and inform revisions and further analyses with their rating scales.
However, researchers whose goal is to identify a model whose parameters reflect the empirical characteristics of their data may be more likely to use a model such as the GPCM, which directly incorporates differences in item discrimination into model estimates, including estimates related to rating scale functioning. In this case, rating scale analysis would provide additional insight into differences in item characteristics identified by overall GPCM item characteristics, such as differences in item discrimination.
Limitations and Directions for Future Research
This study has several limitations that warrant consideration in future research. First, the analyses were limited to one real data analysis of an affective survey administered to children and a simulation study with a relatively small set of conditions. Researchers should consider the characteristics of the real data and the simulation design before generalizing the results beyond the current analysis. For example, researchers could explore the sensitivity of the PCM and GPCM to rating scale malfunctioning in other assessment contexts, including other types of affective or diagnostic scales with participant samples who are different from the FBS scale and sample of children included in this study, with additional rating scale lengths, item sample sizes, examinee sample sizes, and proportions of items that exhibit rating scale malfunctioning. Researchers could also consider differences in model sensitivity to rating scale malfunctioning under different data collection designs, such as incomplete rating designs that often appear in rater-mediated educational performance assessments (Wind & Jones, 2019), and other types of missing data that occur in contexts such as affective survey research (Bodner, 2006).
In future studies, researchers could consider the sensitivity of the PCM and GPCM to indicators of rating scale functioning that were not included in the current analysis, including Linacre’s (2002, 2004) suggestion to consider the frequency of observations in each category, the distributional shape of observations across categories, model-data fit for categories, average participant location estimates within categories, and coherence between measures and ratings. In addition, researchers could also consider alternative approaches to evaluating the distance between adjacent thresholds besides Linacre’s recommended minimum value as determined by Equation 1 and the maximum value of 5 logits.
Finally, researchers could also consider the sensitivity of these models to rating scale malfunctioning using models besides the PCM and GPCM. For example, researchers could explore the degree to which polytomous IRT models based on cumulative category probabilities such as the GRM or nonparametric polytomous Mokken scaling models (Molenaar, 1997) provide insight into rating scale malfunctioning. Although these models do not allow rating scale category thresholds to be disordered, they may still offer insight into rating scale functioning that can inform the interpretation and use of measurement instruments with ordinal ratings. Along the same lines, researchers could consider the sensitivity of multidimensional models for ordinal ratings (e.g., Bonifay, 2019; Briggs & Wilson, 2003; Reckase, 2009) to issues related to rating scale functioning.
Supplemental Material
Supplemental material, sj-docx-1-epm-10.1177_00131644221116292 for Detecting Rating Scale Malfunctioning With the Partial Credit Model and Generalized Partial Credit Model by Stefanie A. Wind in Educational and Psychological Measurement
Footnotes
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author received no financial support for the research, authorship, and/or publication of this article.
ORCID iD: Stefanie A. Wind
https://orcid.org/0000-0002-1599-375X
Supplemental Material: Supplemental material for this article is available online.
References
- Adams R. J., Wu M. L., Wilson M. (2012). The Rasch rating model and the disordered threshold controversy. Educational and Psychological Measurement, 72(4), 547–573. 10.1177/0013164411432166 [DOI] [Google Scholar]
- Andrich D. A. (1978). A rating formulation for ordered response categories. Psychometrika, 43(4), 561–573. 10.1007/BF02293814 [DOI] [Google Scholar]
- Andrich D. A. (2010, June). The detection of a structural halo when multiple criteria have the same generic categories for rating [Conference session]. International Conference on Rasch Measurement, Copenhagen, Denmark. [Google Scholar]
- Andrich D. A. (2013). An expanded derivation of the threshold structure of the polytomous Rasch model that dispels any “threshold disorder controversy.” Educational and Psychological Measurement, 73(1), 78–124. 10.1177/0013164412450877 [DOI] [Google Scholar]
- Andrich D. A. (2015). The problem with the step metaphor for polytomous models for ordinal assessments. Educational Measurement: Issues and Practice, 34(2), 8–14. 10.1111/emip.12074 [DOI] [Google Scholar]
- Bodner T. E. (2006). Missing data: Prevalence and reporting practices. Psychological Reports, 99(3), 675–680. 10.2466/PR0.99.3.675-680 [DOI] [PubMed] [Google Scholar]
- Bond T. G., Yan Z., Heene M. (2020). Applying the Rasch model: Fundamental measurement in the human sciences (4th ed.). Routledge. [Google Scholar]
- Bonifay W. (2019). Multidimensional item response theory (1st ed.). SAGE. [Google Scholar]
- Bozdağ F., Bilge F. (2022). Scale Adaptation for refugee children: Sense of school belonging and social contact. Journal of Psychoeducational Assessment. Advance online publication. 10.1177/07342829221094402 [DOI]
- Briggs D. C., Wilson M. (2003). An introduction to multidimensional measurement using Rasch Models. Journal of Applied Measurement, 4(1), 87–100. [PubMed] [Google Scholar]
- Buchholz J., Hartig J. (2019). Comparing attitudes across groups: An IRT-Based Item-Fit Statistic for the analysis of measurement invariance. Applied Psychological Measurement, 43(3), 241–250. 10.1177/0146621617748323 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen W.-H., Thissen D. (1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265–289. [Google Scholar]
- Crocker L., Algina J. (1986). Introduction to classical and modern test theory. Holt, Rinehart and Winston. [Google Scholar]
- DeAyala R. J. (2009). The theory and practice of item response theory. The Guilford Press. [Google Scholar]
- Embretson S. E., Reise S. P. (2000). Item response theory for psychologists. Lawrence Erlbaum. [Google Scholar]
- Engelhard G., Wind S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Taylor & Francis. [Google Scholar]
- Engelhard G., Jr., Wang J., Wind S. A. (2018). A tale of two models: Psychometric and cognitive perspectives on rater-mediated assessments using accuracy ratings. Psychological Test and Assessment Modeling, 60(1), 33–52. [Google Scholar]
- Finch H. (2011). The impact of missing data on the detection of nonuniform differential item functioning. Educational and Psychological Measurement, 71(4), 663–683. [Google Scholar]
- Forrest C. B., Tucker C. A., Ravens-Sieberer U., Pratiwadi R., Moon J., Teneralli R. E., Becker B., Bevans K. B. (2016). Concurrent validity of the PROMIS® pediatric global health measure. Quality of Life Research, 25(3), 739–751. 10.1007/s11136-015-1111-7 [DOI] [PubMed] [Google Scholar]
- Green S. B., Yang Y. (2015). Evaluation of dimensionality in the assessment of internal consistency reliability: Coefficient alpha and omega coefficients. Educational Measurement: Issues and Practice, 34(4), 14–20. 10.1111/emip.12100 [DOI] [Google Scholar]
- Haddad C., Khoury C., Salameh P., Sacre H., Hallit R., Kheir N., Obeid S., Hallit S. (2021). Validation of the Arabic version of the eating attitude test in Lebanon: A population study. Public Health Nutrition, 24(13), 4132–4143. 10.1017/S1368980020002955 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hagedoorn E. I., Paans W., Jaarsma T., Keers J. C., van der Schans C. P., Luttik M. L., Krijnen W. P. (2018). Translation and psychometric evaluation of the Dutch Families Importance in Nursing Care: Nurses’ attitudes scale based on the Generalized Partial Credit Model. Journal of Family Nursing, 24(4), 538–562. 10.1177/1074840718810551 [DOI] [PubMed] [Google Scholar]
- Hambleton R. K., van der Linden W. J., Wells C. S. (2010). IRT models for the analysis of polytomously scored data: Brief and selected history of model building advances. In Nering M. L., Ostini R. (Eds.), Handbook of polytomous item response theory models (pp. 21–42). Routledge. [Google Scholar]
- Kornetti D. L., Fritz S. L., Chiu Y.-P., Light K. E., Velozo C. A. (2004). Rating scale analysis of the Berg balance scale. Archives of Physical Medicine and Rehabilitation, 85(7), 1128–1135. 10.1016/j.apmr.2003.11.019 [DOI] [PubMed] [Google Scholar]
- Linacre J. M. (1989). Many-facet Rasch measurement. MESA Press. [Google Scholar]
- Linacre J. M. (2002). Optimizing rating scale category effectiveness. Journal of Applied Measurement, 3(1), 85–106. [PubMed] [Google Scholar]
- Linacre J. M. (2004). Optimizing rating scale category effectiveness. In Smith E. V., Jr., Smith R. M. (Eds.), Introduction to Rasch measurement theory: Models and applications (pp. 258–278). JAM Press. [Google Scholar]
- Luo G. (2005). The relationship between the Rating Scale and Partial Credit models and the implication of disordered thresholds of the Rasch models for polytomous responses. Journal of Applied Measurement, 6(4), 443–455. [PubMed] [Google Scholar]
- Masters G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 149–174. 10.1007/BF02296272 [DOI] [Google Scholar]
- Mellenbergh G. J. (1995). Conceptual notes on models for discrete polytomous item responses. Applied Psychological Measurement, 19(1), 91–100. 10.1177/014662169501900110 [DOI] [Google Scholar]
- Molenaar I. W. (1997). Nonparametric models for polytomous responses. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 369–380). Springer. [Google Scholar]
- Moors G. (2008). Exploring the effect of a middle response category on response style in attitude measurement. Quality & Quantity, 42(6), 779–794. 10.1007/s11135-006-9067-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muraki E. (1997). A generalized partial credit model. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 153–164). Springer. 10.1007/978-1-4757-2691-6_9 [DOI] [Google Scholar]
- Muraki E., Muraki M. (2018). Generalized partial credit model. In van der Linden W. J. (Ed.), Handbook of item response theory (Vol. 1, pp. 127–138). CRC Press. [Google Scholar]
- Nering M. L., Ostini R. (Eds.). (2010). Handbook of polytomous item response theory models. Routledge. [Google Scholar]
- R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/ [Google Scholar]
- Reckase M. D. (2009). Multidimensional item response theory (1st ed.). Springer. [Google Scholar]
- Robitzsch A., Kiefer T., Wu M. (2020). TAM: Test analysis modules (Version 3.5-19) [Computer software]. https://CRAN.R-project.org/package=TAM
- Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(Part 2, No. 17), 1–97. [Google Scholar]
- Seol H. (2016). Using the bootstrap method to evaluate the critical range of misfit for polytomous Rasch fit statistics. Psychological Reports, 118(3), 937–956. [DOI] [PubMed] [Google Scholar]
- Smith R. M. (2004). Fit analysis in latent trait models. In Smith E. V., Smith R. M. (Eds.), Introduction to Rasch measurement (pp. 73–92). JAM Press. [Google Scholar]
- Van Zile-Tamsen C. (2017). Using Rasch analysis to inform rating scale development. Research in Higher Education, 58(8), 922–933. [Google Scholar]
- Walker A. A., Jennings J. K., Engelhard G. (2018). Using person response functions to investigate areas of person misfit related to item characteristics. Educational Assessment, 23(1), 47–68. 10.1080/10627197.2017.1415143 [DOI] [Google Scholar]
- Waugh R. F. (2002). Creating a scale to measure motivation to achieve academically: Linking attitudes and behaviours using Rasch measurement. British Journal of Educational Psychology, 72(1), 65–86. 10.1348/000709902158775 [DOI] [PubMed] [Google Scholar]
- Wesolowski B. C., Wind S. A., Engelhard G. (2016). Examining rater precision in music performance assessment: An analysis of rating scale structure using the Multifaceted Rasch Partial Credit Model. Music Perception, 33(5), 662–678. 10.1525/mp.2016.33.5.662 [DOI] [Google Scholar]
- Wind S. A., Guo W. (2019). Exploring the combined effects of rater misfit and differential rater functioning in performance assessments. Educational and Psychological Measurement, 79(5), 962–987. 10.1177/0013164419834613 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wind S. A., Jones E. (2019). The effects of incomplete rating designs in combination with rater effects. Journal of Educational Measurement, 56(1), 76–100. 10.1111/jedm.12201 [DOI] [Google Scholar]
- Wind S. A., Tsai C.-L., Grajeda S. B., Bergin C. (2018). Principals’ use of rating scale categories in classroom observations for teacher evaluation. School Effectiveness and School Improvement, 29(3), 485–510. 10.1080/09243453.2018.1470989 [DOI] [Google Scholar]
- Wind S. A., Walker A. A. (2019). Exploring the correspondence between traditional score resolution methods and person fit indices in rater-mediated writing assessments. Assessing Writing, 39, 25–38. 10.1016/j.asw.2018.12.002 [DOI] [Google Scholar]
- Wolfe E. W. (2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement, 14(1), 1–9. [PubMed] [Google Scholar]
- Wolfe E. W., Jiao H., Song T. (2014). A family of rater accuracy models. Journal of Applied Measurement, 16(2), 153–160. [PubMed] [Google Scholar]
- Wright B. D., Masters G. N. (1982). Rating scale analysis: Rasch measurement. MESA Press. [Google Scholar]
- Wright B. D., Mok M. M. C. (2004). An overview of the family of Rasch measurement models. In Smith E. V., Smith R. M. (Eds.), Introduction to Rasch measurement (pp. 1–24). JAM Press. [Google Scholar]
- Wu M., Adams R. J. (2013). Properties of Rasch residual fit statistics. Journal of Applied Measurement, 14(4), 339–355. [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-epm-10.1177_00131644221116292 for Detecting Rating Scale Malfunctioning With the Partial Credit Model and Generalized Partial Credit Model by Stefanie A. Wind in Educational and Psychological Measurement