Skip to main content
Educational and Psychological Measurement logoLink to Educational and Psychological Measurement
. 2016 Apr 18;77(2):330–350. doi: 10.1177/0013164416643826

Adjacent-Categories Mokken Models for Rater-Mediated Assessments

Stefanie A Wind 1,
PMCID: PMC5965541  PMID: 29795916

Abstract

Molenaar extended Mokken’s original probabilistic-nonparametric scaling models for use with polytomous data. These polytomous extensions of Mokken’s original scaling procedure have facilitated the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are used, including rater-mediated educational assessments. Because their underlying item step response functions (i.e., category response functions) are defined using cumulative probabilities, polytomous Mokken models can be classified as cumulative models based on the classifications of polytomous item response theory models proposed by several scholars. In order to permit a closer conceptual alignment with educational performance assessments, this study presents an adjacent-categories variation on the polytomous monotone homogeneity and double monotonicity models. Data from a large-scale rater-mediated writing assessment are used to illustrate the adjacent-categories approach, and results are compared with the original formulations. Major findings suggest that the adjacent-categories models provide additional diagnostic information related to individual raters’ use of rating scale categories that is not observed under the original formulation. Implications are discussed in terms of methods for evaluating rating quality.

Keywords: Mokken scaling, rating quality, polytomous item response theory models


Mokken (1971) presented a set of probabilistic-nonparametric models that can be used to explore fundamental measurement properties, including scalability, monotonicity, and invariant ordering, that are less restrictive in terms of model requirements than parametric models based on similar properties. Building upon the original dichotomous formulations of Mokken’s (1971) monotone homogeneity (MH) and double monotonicity (DM) models, Molenaar (1982, 1997) proposed polytomous extensions of the MH and DM models that facilitate the use of Mokken scale analysis as an approach to exploring fundamental measurement properties across a variety of domains in which polytomous ratings are used, including rater-mediated educational assessments (e.g., Wind, 2014, 2015; Wind & Engelhard, 2016)

When examining the results from any polytomous item response theory (IRT) model, including the polytomous MH and DM models (Molenaar, 1982, 1997), it is essential to consider how the rating scale categories are defined within the model formulation. Molenaar’s polytomous formulations of the MH and DM models are based on item step response functions (ISRFs; i.e., category response functions) that are specified based on cumulative probabilities. Specifically, for M rating scale categories, a set of m− 1 ISRFs can be drawn for each item that reflect the conditional probability of a response in or above a given category in the ordinal rating scale. Given this definition, Molenaar (1997, p. 370) observed that his polytomous Mokken models are classified as cumulative models. These models are distinct from adjacent-categories models (i.e., divide-by-total models; Andrich, 2015), and continuation ratio models. Specifically, adjacent-categories models define the probability of a rating in category k using a ratio of the probabilities for a rating in category k and the category just below it (k− 1). On the other hand, continuation ratio models describe the probability for a rating in category k using the ratio of the probabilities for rating in category k or higher and the probability of a rating in the category just below it (k− 1).

The distinction between these models, along with the conceptualization of the underlying psychological function associated with different formulations, has been a subject of discussion and debate in the polytomous IRT literature (e.g., Andrich, 2015; Penfield, 2014). Recently, Andrich (2015) cautioned researchers to carefully consider the appropriateness of various threshold specifications across domains. In particular, Andrich noted an important distinction between cumulative and adjacent-categories models that has implications for educational assessments. Describing models based on cumulative probabilities, he observed,

the model specifies the probability that a person will be classified above any threshold, that is, assessed in a category or in any category above it, and not in a specific category. This does not seem consistent with performance assessment—judges locate a performance in one of the categories, not in and beyond any particular category. (p. 6)

In order to permit a closer conceptual alignment with educational performance assessments (Andrich, 2015), this study presents an adjacent-categories variation on the polytomous MH and DM models.

Purpose

The purpose of this study is to illustrate an adjacent-categories formulation of the polytomous MH and DM models as a potentially more appropriate application of Mokken scale analysis (MSA) to educational performance assessments. The implications of the new formulation are considered in terms of three Mokken-based indicators of rating quality: (a) rater scalability, (b) rater monotonicity, and (c) invariant rater ordering.

Three research questions are used to organize this study:

  • Research Question 1: How can Molenaar’s (1982, 1997) polytomous formulations of the MH and DM models be redefined in terms of adjacent categories?

  • Research Question 2: How do conclusions regarding rating quality vary across the original (cumulative logit) and adjacent-categories formulations of the MH and DM models?

  • Research Question 3: What additional diagnostic information about rating quality is provided by the adjacent-categories formulations of the MH and DM models?

This study contributes to previous research in several ways. First, it introduces an alternative formulation of the polytomous MH and DM models that is more conceptually aligned with educational assessments—thus facilitating the more appropriate application of MSA to this context. In addition, the explicit consideration of the category specification highlights the importance of considering how categories are specified in polytomous IRT models in general, and in MSA in particular. Specifically, observed differences between the two approaches allow researchers to examine the implications of category specifications in terms of the diagnostic information that is provided regarding measurement quality, and the utility of this information across contexts.

Furthermore, this study provides researchers with additional tools for exploring rating quality from a nonparametric perspective. Although a variety of indicators of rating quality have been proposed based on polytomous parametric IRT models (e.g., Engelhard, 1994; Wolfe & McVay, 2012), very few nonparametric indicators have been proposed that are based on a coherent measurement framework. The methods illustrated in this study facilitate the investigation of a variety of measurement properties that can be explored prior to the application of parametric models. Finally, this study illustrates an approach to rating quality analysis that goes beyond the methods that are commonly used in large-scale performance assessment systems. Most operational performance assessment systems rely on indicators of rating quality based on indicators of agreement or rater-reliability that do not allow for the exploration of rating quality at the individual rater level or in terms of fundamental measurement properties (Johnson, Penny, & Gordon, 2009). The rating quality indices presented here offer an additional set of techniques for exploring rating quality that is conceptually aligned with rater-mediated educational assessments and is based on a coherent measurement framework.

Mokken Scaling Models

Mokken (1971) presented two scaling models for use with dichotomous items. First, the monotone homogeneity (MH) model is based on three major requirements:

  • Unidimensionality: Item responses reflect evidence of a single latent variable

  • Local Independence: Responses to an item are not influenced by responses to any other item, after controlling for the latent variable

  • Monotonicity: The probability for a positive or correct response is non-decreasing across increasing values of the latent variable.

The double monotonicity (DM) model is a special case of the MH model that adds a fourth requirement:

  • Nonintersecting item response functions: Item difficulty ordering is consistent across the range of the latent variable.

In practice, restscores (R) are used as estimates of latent variable locations (θ; see Hemker, Sijtsma, Molenaar, & Junker, 1997; Junker, 1993; Sijtsma & Molenaar, 2002), where R is defined for individual students as the total score across items minus the score on the item or items being evaluated (Junker & Sijtsma, 2000). The MH and DM model requirements are evaluated using indicators of scalability, monotonicity, and nonintersection (see Sijtsma and Molenaar, 2002 for additional details about the dichotomous Mokken models).

Polytomous Mokken Models

Molenaar’s (1982, 1997) polytomous formulations of the MH and DM models are based on the same underlying requirements as the original dichotomous formulations. However, under the polytomous formulation, the requirements are evaluated in terms of the item step response function (ISRF). As defined by Molenaar (1997, p. 370), ISRFs for a rating scale with k categories can be conceptualized as a set of k– 1 steps (τijk) that reflect the observed rating in category k on item i for person j, where:

τijk=1whenXijk,andτijk=0otherwise,

where Xij is the observed score on item i for person j.

The major differences between the dichotomous and polytomous MH models are evident in the monotonicity and nonintersection requirements. In addition to evaluating monotonicity and nonintersection at the overall item level, the polytomous MH and DM models require monotonicity and non-intersection for ISRFs. The monotonicity requirement for ISRFs is as follows:

  • Monotonicity: The conditional probability for a rating in category k or higher is nondecreasing over increasing values of the latent variable.

In practice, this requirement is evaluated by examining plots of ISRFs for evidence of non-decreasing cumulative probabilities over increasing restscores. The first plot in Figure 1, Panel A illustrates this procedure for polytomous ratings. Furthermore, one-sided one-sample hypothesis tests are used to evaluate the null hypothesis that the cumulative probability of a rating in category k or higher is equal across adjacent restscore groups, against the alternative hypothesis that the cumulative probability of a rating in category k or higher is lower in the group with a higher restscore, which is a violation of monotonicity.

Figure 1.

Figure 1.

Inconsistent conclusions based on item step response function (ISRFs). MH-R = monotone homogeneity for ratings; ac-MH = adjacent-categories monotone homogeneity.

Likewise, the polytomous DM model requires nonintersecting ISRFs:

  • Nonintersecting item step response functions: The conditional probability for a rating in category k or higher on item i has the same relative ordering across all values of the latent variable.

Violations of the nonintersection requirement would occur for items ordered i < j when the cumulative probability of a rating in category k or higher on item i is higher than the cumulative probability of a rating in category k or higher on item j. These violations are detected using plots of ISRFs for pairs of items and statistical hypothesis tests.

In their discussion of invariant ordering for polytomous items, Ligtvoet, van der Ark, Bergsma, and Sijtsma (2011) and Ligtvoet, van der Ark, te Marvelde, and Sijtsma (2010) observed that the interpretation of invariant item ordering (IIO) at the overall item level is more meaningful than invariant ordering of ISRFs (Sijtsma, Meijer, & van der Ark, 2011). As a more practically useful method, Ligtvoet, van der Ark, te Marvelde, et al. (2010) proposed the manifest invariant item ordering (MIIO) technique to check the assumption of IIO in polytomous data using average scores on items. First, items are ordered according to overall difficulty based on their average scores. Then, the degree to which this ordering holds across restscore groups is examined within pairs of items using plots of IRFs and hypothesis tests.

Polytomous Mokken Models for Ratings

Recently, Wind and Engelhard (2016) proposed the application of the polytomous MH and DM models to the context of rater-mediated educational assessments through the use of the monotone homogeneity for ratings (MH-R) and double monotonicity for ratings (DM-R) models. These models are based on the same underlying requirements as the original formulations, with the exception that model requirements are evaluated for raters instead of items. Specifically, raters are treated as a type of “item,” and checks for monotonicity, scalability, and invariant ordering are performed to evaluate rating quality. Additional details about Mokken-based indicators of rating quality are provided in the data analysis section.

Adjacent-Categories Mokken Models for Ratings

In addition to the original cumulative formulation proposed by Molenaar (1982), it is also possible to define Mokken ISRFs using adjacent categories. This approach provides a potentially more appropriate formulation of the MH and DM for rater-mediated educational assessments (Andrich, 2015). An adjacent-categories formulation of the Mokken ISRF can be specified as follows:

τijk=1whenXij=k,andτijk=0whenXij=k1,

where Xij is the observed score from rater i for student j. Within a given restscore group, the ISRF is calculated as:

P(X=k)=P(X=k)P(X=k1).

This formulation reflects the adjacent-categories class of polytomous IRT models because the probability for observations in rating scale categories are defined using the ratio of the category of interest and the category just below it.

Based on this formulation of the ISRF, the MH and DM model requirements can be re-stated to reflect adjacent categories, while still maintaining the same basic underlying requirements as the original formulation. First, the adjacent-categories monotone homogeneity (ac-MH) model is based on the same underlying requirements as the original polytomous MH model (Molenaar, 1982, 1997), with the exception that the monotonicity requirement for ISRFs is redefined to reflect adjacent categories:

  • Monotonicity: The probability for a rating in category k, rather than category k− 1, is nondecreasing across the range of the latent variable.

This requirement is evaluated by examining plots of ISRFs defined using Equations (2) and (3) for evidence of nondecreasing adjacent-categories probabilities over increasing restscores. The second plot in Figure 1, Panel A illustrates this procedure for polytomous ratings. Furthermore, the one-sided hypothesis test typically used to evaluate the monotonicity requirement for the polytomous MH model can be adapted to reflect the adjacent categories formulation as follows. Specifically, the null hypothesis that the proportion of observations within category k is equivalent to the proportion of observations within category k−1 across adjacent restscore groups is evaluated against the alternative hypothesis that the adjacent-categories probability is lower in the group with a higher restscore, which would be a violation of monotonicity. Whereas the original (cumulative) formulation defines violations of monotonicity using cumulative proportions across adjacent restscore groups, the adjacent-categories formulation defines violations of monotonicity using adjacent-categories proportions across adjacent restscore groups (described further below).

Likewise, the adjacent-categories double monotonicity (ac-DM) model requires nonintersecting adjacent-categories ISRFs for pairs of items:

  • Nonintersecting adjacent-categories item step response functions: The conditional probability for a rating in category k rather than in category k− 1 from rater i has the same relative ordering across all values of the latent variable.

This requirement is evaluated by examining plots of adjacent-categories ISRFs for pairs of raters for evidence of non-intersection. Because the MIIO procedure is based on overall items, the MIIO procedure can also be used to evaluate non-intersection for the ac-DM model with no change from the original formulation.

Method

In order to illustrate the adjacent-categories formulations of Mokken scaling models as a method for exploring rating quality, the ac-MH and ac-DM models are applied to a data set that has been previously analyzed by several researchers within the context of parametric and nonparametric IRT.

Instrument

The illustrative analysis in this study uses data that were previously examined in several other methodological studies on rating quality indicators in the context of both parametric IRT (e.g., Andrich, 2010; Gyagenda & Engelhard, 2009; Wind & Engelhard, 2012) and nonparametric IRT (Wind, 2015; Wind & Engelhard, 2016) using the original formulations of polytomous Mokken scale analysis. The data come from the Georgia High School Writing Test, and include scores from 365 eighth grade students whose persuasive essays were rated by 20 operational raters. Raters assigned a rating between one and four (1 = low, 4 = high) within four separate domains: Conventions, Organization, Sentence Formation, and Style. All raters scored the entire set of 365 essays, such that the rating design was fully connected (Engelhard, 1997). The ratings were recoded to 0 = low, 3 = high prior to analysis.

Data Analysis

The data analysis procedure for this study includes calculating indicators of rating quality based on the original cumulative definition of the ISRF, and calculating the same indices based on the ac-MH and ac-DM models. Three categories of rating quality indicators are explored: (a) rater monotonicity, (b) rater scalability, and (c) invariant rater ordering. In order to demonstrate the generalizability of the new formulations of the MSA models for evaluating the quality of multiple sets of ratings, the analyses were conducted separately within the four analytic rubric domains on the Georgia High School Writing Test. This section provides a brief overview of the three categories of rating quality indicators. All of the analyses based on the cumulative formulations of the MSA models were conducted using the mokken package for R (van der Ark, 2007, 2012). Analyses based on the adjacent-categories formulations were calculated using the R program (R Core Team, 2015).

Rater Monotonicity

The first indicator of rating quality examined in this study is rater monotonicity. Evidence of rater monotonicity suggests consistency between the relative ordering of a set of student performances for a rater of interest and the relative ordering of the performances based on the other raters, as defined by the restscore groups. When rater monotonicity is observed, the first requirement for invariant measurement is met: Students are ordered the same way across all raters (Wright & Stone, 1979). Adherence to this requirement implies that the interpretation of relative student achievement does not depend on the particular rater who scored a student’s performance.

Monotonicity analyses for polytomous ratings involve examining the ISRF for evidence of non-decreasing probabilities across increasing levels of student achievement. For both the cumulative (original) formulation of the ISRF and the adjacent-categories formulation, monotonicity can be evaluated using graphical displays that illustrate ISRFs for individual raters, along with statistical hypothesis tests. The hypothesis test for monotonicity that is used in this study is based on the one-sided one-sample hypothesis test presented by Molenaar and Sijtsma (2000, p. 72). This procedure is currently implemented in the mokken package for R (van der Ark, 2007, 2012). First, violations of monotonicity are identified by comparing the cumulative probability for a rating in a given category across adjacent restscore groups. When the cumulative probability in the lower of the adjacent restscore groups exceeds the cumulative probability in the higher group by a value greater than a predefined critical value (usually 0.03), the violation is examined using a one-sided, one-sample z-test. As given in Molenaar and Sijtsma, violations are examined by comparing the frequency of observations below the category of interest to the frequency of observations above the category of interest as follows:

z=|2*((a+1)(d+1)bct1|,

where a, b, c, d, and t are obtained from a table that shows the frequency of observations below and above the category of interest in both restscore groups. For a rating in category k, the values are defined as follows for the cumulative formulation:

  • a = the frequency of observations below category k in the lower restscore group

  • b = the frequency of observations in category k or higher in the lower restscore group

  • c = the frequency of of observations below category k in the higher restscore group

  • d = the frequency of observations in category k or higher in the higher restscore group

  • t = the total number of observations for the pair of items.

Based on the adjacent-categories formulation for ratings, the proportions can be redefined as follows:

  • a = the frequency of observations in category k− 1 in the lower restscore group

  • b = the frequency of observations in category k or higher in the lower restscore group

  • c = the frequency of observations in category k− 1 in the higher restscore group

  • d = the frequency of observations in category k in the higher restscore group

  • t = the total number of observations for the pair of raters.

Table 1 illustrates this procedure for both approaches.

Table 1.

Values for Evaluating Violations of Monotonicity.

Cumulative
Adjacent categories
F(Xi < k) F(Xik) F(Xi = k− 1) F(Xi = k)
Lower restscore group within pair a b a b
Higher restscore group within pair c d c d

Rater Scalability

The second indicator of rating quality examined in this study is rater scalability, which is related to the MH model (Meijer, Tendeiro, & Wanders, 2015). Scalability coefficients are used in MSA to evaluate the impact of Guttman errors on the quality of a scale, where more Guttman errors limit the degree to which item and person total scores can be interpreted in terms of a unidimensional construct (Sijtsma & Molenaar, 2002). For polytomous ratings, scalability coefficients describe the degree to which a set of raters can be ordered to form a scale that describes differences among students in terms of a latent variable. Together, indicators of scalability for rater pairs (Hij), individual raters (Hi), and an overall group of raters (H) describe the degree to which observed total scores across raters provide a meaningful description of student ordering on the latent variable. Low values of rater scalability for individual raters are of particular interest, as they indicate frequent Guttman errors that might suggest idiosyncratic use of a set of rating scale categories.

Scalability coefficients based on the original formulation of polytomous MSA models (Molenaar, 1982, 1997) are calculated as follows. First, the cumulative category probabilities (ISRFs) are used to establish the Guttman pattern for each pair of raters. For example, the Guttman pattern for a rater pair including Rater i and Rater j might be defined as Xi≥ 1, Xi≥ 2, Xj≥ 1, Xj≥ 2, Xi≥ 3, Xj≥ 3 based on the observed cumulative probabilities for each rater. If no Guttman errors were observed, each pair of observed ratings (Xi, Xj) would follow this sequence, such that the expected order with no Guttman errors would be defined as: (0,0), (1,0), (2,0), (2,1), (2,2), (3,2), (3,3). Observations in the other cells in the joint frequency table for this pair of raters are defined as Guttman errors. Weights are defined for each error cell by calculating the number of errors involved in arriving at the score pattern, based on the Guttman pattern established using the cumulative probabilities (for additional details about weights, see Ligtvoet, 2010, pp. 22-24; Molenaar & Sijtsma, 2000, pp. 20-22). The rater pair scalability coefficient for Rater i and Rater j is calculated as:

Hij=1FijEij,

where Fij is the weighted observed frequency of errors and Eij is the expected frequency of errors, based on marginal independence. Scalability coefficients for individual raters are calculated using each rater pair scalability coefficient that includes the rater of interest. Similarly, the overall scalability coefficient for a group of raters is calculated using all of the rater pair coefficients.

An adjacent-categories formulation of the scalability coefficient can be calculated in a similar fashion to the original coefficients. However, based on the adjacent-categories approach, the order of ISRFs is defined using adjacent-categories probabilities rather than cumulative probabilities. This change implies that if the order of ISRFs were different between the cumulative and adjacent-categories formulations, the value of the scalability coefficients would also be different.

Invariant Rater Ordering

The third indicator of rating quality explored in this study is invariant rater ordering, which is based on Mokken’s (1971) DM model requirement of invariant ordering. Invariant rater ordering provides evidence of adherence to the second requirement for invariant measurement: raters are ordered the same way across students. Invariant rater ordering facilitates the interpretation of results from rater-mediated assessments in a somewhat different fashion than invariant ordering in traditional Mokken scaling applications. Rather than comparing the observed ordering to an a priori specification related to item severity or a progression of symptoms, invariant rater ordering reflects a fairness concern. Specifically, evidence of consistent rater ordering ensures that conclusions about student achievement do not depend on the rater who happened to score their work, and conclusions about rater severity do not depend on the students whose work they happened to score.

As noted above, recent research on the polytomous DM model suggests that indicators of invariant ordering are more meaningful for overall polytomous items than at the level of rating scale categories (Ligtvoet, van der Ark, Bergsma, et al., 2011; Ligtvoet, van der Ark, te Marvelde, et al., 2010). Accordingly, researchers are encouraged (e.g., Sijtsma et al., 2011) to examine IIO using the MIIO technique, in which nonintersection is examined at the overall item level. In the context of rater-mediated assessments, Wind and Engelhard (2016) proposed the use of the manifest invariant rater ordering (MIRO) procedure for evaluating invariant ordering of raters. MIRO analyses involve ordering raters in terms of their overall severity based on observed average ratings across a group of students. Using this overall ordering, violations are detected when raters’ relative ordering changes across restscore groups. Violations of MIRO can be evaluated using graphical displays and statistical hypothesis tests. Because MIRO is evaluated at the overall rater level, rather than within ISRFs, the tests for invariant rater ordering are equivalent between the original DM-R model and the ac-DM model.

Results

In this section, results are presented as they relate to rater monotonicity, rater scalability, and invariant rater ordering. Within each of these categories of rating quality indices, results are compared between the cumulative (original) formulations of the MH-R and DM-R models and the adjacent-categories formulations.

Rater Monotonicity

No violations of monotonicity were observed for any of the domains based on the cumulative definition. On the other hand, several violations of monotonicity (≥0.03) were observed based on the ac-MH model within the Organization, Sentence Formation, and Style domains. Specifically one violation of monotonicity was observed for the following raters: Raters 5 and 15 (Organization), Raters 3 and 18 (Sentence Formation), and Raters 10, 12, 18, and 19 (Style).

Although the significance test for monotonicity violations (Equation 4) did not indicate significant violations, examination of monotonicity plots for individual raters revealed several interesting patterns for individual raters related to rating scale category use. The first major pattern was the finding of consistent conclusions between the two formulations of the models. The second major pattern observed in the comparison between the two approaches is illustrated in Figure 1, where the major conclusions related to the two sets of ISRFs changed across the original and adjacent-categories formulations of the MH model. For example, Panel A shows ISRFs for Rater 5 in the Style domain; this rater provides an example of inconsistent conclusions between the two formulations. Whereas no violations were detected based on the cumulative formulation, inspection of the plot for the ac-MH model reveals that the ISRF that describes the probability for a rating in category 1, rather than category 0 (top line in the plot), is non-monotonic between restscore groups 2 and 3. Similar findings are illustrated in Panel B for Rater 19 in the Organization domain and in Panel C for Rater 6 in the Sentence Formation domain, where violations of monotonicity are seen between the third and fourth restscore group for the highest ISRF.

Another interesting pattern that was observed among the 20 raters is illustrated in Figure 2, where crossing ISRFs are observed within a rater. For example, Panel A shows ISRFs for Rater 12 in the Style domain. Within the first restscore group, the lowest ISRF, which describes the probability for a rating in category 3, rather than in category 2, is higher than the middle ISRF, which describes the probability for a rating in category 2, rather than in category 1. The intersecting ISRFs for this rater also correspond to violations of monotonicity. A similar result is illustrated in the adjacent-categories ISRFs for Rater 20 in the Sentence Formation domain (Panel B), where crossing ISRFs occur within the first restscore group. Figure 1, Panel C illustrates a somewhat different pattern of crossing ISRFs. Specifically, the highest ISRF, which describes the probability for a rating in category 1, rather than category 0, intersects with the middle ISRF for the third restscore group. It is interesting to note that this finding of crossing ISRFs does not correspond with a violation of monotonicity.

Figure 2.

Figure 2.

Crossing item step response functions (ISRFs) within raters. MH-R = monotone homogeneity for ratings; ac-MH = adjacent-categories monotone homogeneity.

Whereas the cumulative definition of ISRFs does not allow crossing of ISRFs within an item (Sijtsma & Molenaar, 2002), it is possible to observe crossing ISRFs based on the adjacent-categories formulation presented here. The finding of crossing ISRFs within raters provides interesting diagnostic information regarding individual rater category use. Specifically, evidence of crossing ISRFs suggests rating scale category disordering for one or more restscore groups. It is important to note that the finding of crossing ISRFs for individual raters is different than crossing ISRFs within pairs of raters, which suggest violations of invariant ordering. Rather than identifying violations of the invariant ordering requirement, crossing ISRFs within an individual rater suggest that an individual rater may have an idiosyncratic interpretation of the rating scale categories that warrants further investigation.

Rater Scalability

Table 2 includes individual rater scalability coefficients for the 20 raters using the cumulative and adjacent-categories formulations of Equation (5) across the four writing domains. As can be seen in the table, there are differences in rater scalability between the two formulations that result from differences between the two approaches in the relative order of ISRFs based on each approach. When the coefficients were calculated using the original formulation, all of the raters had strong scalability within all four domains, based on Mokken’s (1971) critical values for interpreting scalability coefficients (Hi≥ 0.50, strong; 0.40 ≤Hi < 0.50, medium; 0.30 ≤Hi < 0.40, weak).

Table 2.

Scalability Results.

Rater Conventions Hi
Organization Hi
Sentence Formation Hi
Style Hi
MH-R ac-MH MH-R ac-MH MH-R ac-MH MH-R ac-MH
1 0.81 0.79 0.80 0.74 0.84 0.30a 0.77 0.77
2 0.78 0.78 0.74 0.71 0.78 0.64 0.76 0.77
3 0.82 0.82 0.79 0.75 0.81 0.32a 0.78 0.79
4 0.81 0.79 0.81 0.72 0.78 0.30a 0.77 0.72
5 0.83 0.81 0.79 0.73 0.81 0.69 0.76 0.75
6 0.76 0.75 0.80 0.47 0.78 0.57 0.74 0.75
7 0.83 0.83 0.79 0.73 0.82 0.66 0.78 0.77
8 0.83 0.79 0.83 0.75 0.86 0.66 0.82 0.80
9 0.82 0.81 0.74 0.71 0.82 0.66 0.78 0.79
10 0.82 0.79 0.76 0.70 0.83 0.68 0.78 0.79
11 0.81 0.78 0.79 0.73 0.81 0.33a 0.78 0.80
12 0.83 0.83 0.78 0.72 0.80 0.63 0.78 0.79
13 0.84 0.75 0.78 0.72 0.85 0.42a 0.76 0.76
14 0.79 0.75 0.83 0.74 0.81 0.37a 0.77 0.78
15 0.78 0.77 0.76 0.71 0.81 0.65 0.78 0.76
16 0.83 0.81 0.78 0.73 0.82 0.68 0.80 0.80
17 0.77 0.78 0.75 0.71 0.81 0.70 0.75 0.73
18 0.75 0.78 0.77 0.71 0.77 0.70 0.76 0.73
19 0.81 0.81 0.78 0.73 0.83 0.66 0.78 0.79
20 0.78 0.78 0.75 0.38a 0.80 0.13a 0.74 0.74
Overall H 0.81 0.79 0.78 0.67 0.81 0.67 0.77 0.77

Note. MH-R = monotone homogeneity for ratings; ac-MH = adjacent-categories monotone homogeneity.

a

Indicates change in scalability classification based on the following criteria: Hi≥ 0.50, strong; 0.40 ≤Hi < 0.50, medium; 0.30 ≤Hi < 0.40, weak (Mokken, 1971).

Examination of scalability coefficients for individual raters based on the adjacent-categories formulation reveals that the classification of raters based on Mokken’s critical values remains consistent for most of the raters. However, some changes are observed in the Organization and Sentence Formation domains, where strong scalability coefficients are observed based on the original formulation that correspond to medium or weak scalability coefficients based on the adjacent-categories formulation. Table 2 also includes overall rater scalability coefficients for each domain. As can be seen in the table, the raters formed strong Mokken scales across all four domains based on both formulations of the ISRFs.

Invariant Rater Ordering

Following Ligtvoet, van der Ark, Bergsma, et al. (2011), Ligtvoet, van der Ark, te Marvelde, et al. (2010), and Sijtsma et al. (2011), invariant ordering was explored using overall raters, rather than within rating scale categories. As a result, the results are equivalent between the cumulative and adjacent-categories formulations of the DM model. These results are summarized in Table 3, where it can be seen that significant violations were observed most frequently for Rater 13 in the Sentence Formation domain (10 significant violations), and within the Conventions domain (9 significant violations). For additional details about these results, see Wind and Engelhard (2016).

Table 3.

Invariant Rater Ordering Results.

Rater Number of significant violations (number significant)
Conventions Organization Sentence Formation Style
1 4 (3) 0 (0) 9 (6) 1 (0)
2 2 (1) 1 (0) 2 (2) 0 (0)
3 5 (3) 1 (0) 2 (2) 2 (0)
4 8 (5) 6 (3) 2 (1) 3 (0)
5 0 (0) 2 (1) 1 (1) 4 (0)
6 5 (0) 2 (0) 3 (2) 1 (0)
7 7 (5) 5 (2) 2 (1) 2 (1)
8 3 (1) 0 (0) 4 (1) 0 (0)
9 2 (1) 2 (0) 2 (1) 2 (0)
10 5 (3) 0 (0) 0 (0) 2 (0)
11 7 (4) 4 (2) 3 (2) 3 (0)
12 0 (0) 3 (1) 3 (2) 2 (0)
13 13 (9) 3 (1) 13 (10) 2 (0)
14 12 (7) 3 (1) 3 (1) 0 (0)
15 8 (5) 8 (1) 4 (2) 5 (3)
16 0 (0) 4 (0) 5 (1) 7 (2)
17 3 (1) 7 (0) 4 (3) 2 (0)
18 4 (2) 5 (1) 12 (8) 1 (0)
19 6 (4) 4 (1) 4 (2) 4 (1)
20 4 (0) 6 (2) 2 (0) 1 (1)

Summary and Conclusions

This study illustrated an adjacent-categories formulation of the polytomous MH and DM models as a potentially more appropriate approach to the application of MSA to the context of educational performance assessments. The use of the ac-MH and ac-DM models for evaluating rating quality was demonstrated using data from a large-scale rater-mediated writing assessment, and the results were compared to the original cumulative formulation of the MH and DM models. In this section, results are presented in terms of the three guiding research questions for this study.

  • Research Question 1: How can Molenaar’s (1982, 1997) polytomous formulations of the MH and DM models be redefined in terms of adjacent categories?

This study presented adjacent-categories formulations of the polytomous MH and DM models. In order to arrive at the adjacent-categories formulation, the specification of the ISRF was modified to reflect the probability for a rating in a given category, rather than a rating in the category just below it. Methods for evaluating rater monotonicity and scalability were adapted to reflect the adjacent-categories formulation. Specifically, monotonicity was evaluated for ISRFs using the adjacent-categories formulation, and the method for identifying Guttman errors was adapted to reflect category ordering based on adjacent-categories probabilities. Methods for evaluating invariant rater ordering were not changed, as these methods are based on overall average ratings.

  • Research Question 2: How do conclusions regarding rating quality vary across the original (cumulative) and adjacent-categories formulations of the MH and DM models?

The second research question for this study focused on the comparison between the cumulative and adjacent-categories formulations of the MH and DM models in terms of conclusions about rating quality. In terms of monotonicity, several violations were identified based on the ac-MH model that were not detected based on the cumulative MH model. Next, different values of rater scalability coefficients were observed for several raters based on the ac-MH model. Based on the adjacent categories formulation, the scalability coefficients for eight raters resulted in lower classifications using Mokken’s (1971) scalability criteria. Taken together, these findings suggest that the use of the adjacent categories formulation can result in different conclusions about monotonicity and scalability for individual raters than would result from the original formulation of polytomous MSA models (Molenaar, 1982, 1997).

  • Research Question 3: What additional diagnostic information about rating quality is provided by the adjacent-categories formulations of the MH and DM models?

The third research question focused on the added diagnostic value of the adjacent categories formulation of the MSA models in the context of rater-mediated assessments. Findings indicated different scalability coefficient classifications, model violations, and patterns of rating scale category use based on the adjacent-categories models that were not observed under the original specifications across all four domains. Thus, the new specifications of the models provide diagnostic information at the individual rater level that can be used to guide further investigations related to rating quality, inform rater training or retraining procedures, and inform the development or revision of scoring materials such as rubrics or rating scales.

Furthermore, inspection of adjacent categories ISRFs for individual raters revealed an interesting diagnostic opportunity related to rating scale category use. Specifically, when ISRFs are defined using the adjacent categories formulation, it is possible to observe crossing within categories for individual raters that may signal idiosyncratic interpretation and application of a rating scale. Additional research is needed to more fully explore rating scale category disordering for individual raters, particularly as it relates to similar findings of category disordering based on parametric IRT models.

The major implication of these findings is that the adjacent-categories formulation of polytomous MSA models provides an additional diagnostic tool for exploring rating quality. Because the ac-MH and ac-DM models are based on the same underlying principles as the original Mokken models, the methods presented here provide researchers and directors of scoring centers with additional graphical and statistical methods for evaluating rating quality within the framework of invariant measurement whose interpretation is more closely aligned with educational performance assessments than methods based on cumulative IRT models. Specifically, the alternative specification of the polytomous MSA models presented here overcomes limitations of the original formulations within the context of educational performance assessments in that the adjacent-categories formulation is more conceptually aligned with the purposes and intended interpretations of scores from rater-mediated assessments. Furthermore, the adjacent-categories formulation of the model facilitates the investigation of rating scale category functioning in terms of individual categories at the individual rater level. Because of the cumulative definition of ISRFs, this diagnostic information is not provided in the original specification of the MH and DM models.

As pointed out by Andrich (2015), the interpretation of results from any polytomous IRT model depends on the specification of the category probabilities, which should be selected to match the underlying processes specific to the context in which a rating scale is to be used. As was also observed in the current study, Andrich noted that the use of the adjacent category specifications provides additional diagnostic information regarding rating scale category ordering. In his words:

the process behind the adjacent category models is the one consistent with assessments in ordered categories, and finally to demonstrate that when the adjacent category model . . . is used, the empirical ordering of the categories can be studied, understood, and where necessary improved to be consistent with the all-important intended ordering. (p. 6)

Using a comparison of the cumulative and adjacent-categories formulations of the polytomous MH and DM models in the context of a rater-mediated assessment, this study highlighted the added diagnostic information related to individual raters and rating scale categories obtained through the use of the adjacent categories approach that can be used to guide further investigations related to rating quality, inform rater training or retraining procedures, and inform the development or revision of scoring materials such as rubrics or rating scales.

In addition to the theoretical implications, the methods presented in this study have practical implications for operational assessment settings. As noted above, although a variety of rating quality indices based on modern measurement frameworks have been proposed, most large-scale operational performance assessment systems rely on indicators of rater agreement and rater reliability as evidence of rating quality (Johnson, Penny, & Gordon, 2009). The methods for exploring rating quality presented here go beyond these group-level indicators of rating quality, and provide a method for exploring rating quality at the individual rater level and in terms of fundamental measurement properties.

Limitations

When considering the results from this study, several limitations are important to note. In particular, this study used real data from an educational performance assessment. Because real data were used, it was not possible to identify critical values or conditions in which certain indicators of rating quality (monotonicity, scalability, or invariant ordering) will be different across models; such goals would be better suited to analyses with simulated data. In light of the motivation for the use of the adjacent-categories formulation of the models rather than the original (cumulative) formulation as more conceptually aligned with educational achievement tests, real data from an educational achievement test were viewed as a useful starting point for the original investigation of these models.

Future research is needed based on simulated data in order to more fully explore the comparison between the adjacent-categories MSA models and the original formulation, as well as the comparison with other nonparametric and parametric models for exploring rating quality. In particular, simulation studies should be used to explore in greater depth the adjacent-categories formulation of the scalability coefficients, as well as standard errors for this alternative specification, in terms of their comparison with standard errors based on the original specification (Kuijpers, van der Ark, & Croon, 2013).

Acknowledgments

The author wishes to thank Dr. Randall Schumacker for his thoughtful comments on earlier drafts of this manuscript.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

  1. Andrich D. A. (2010, June). The detection of a structural halo when multiple criteria have the same generic categories for rating. Paper presented at the International Conference on Rasch Measurement, Copenhagen, Denmark. [Google Scholar]
  2. Andrich D. A. (2015). The problem with the step metaphor for polytomous models for ordinal assessments. Educational Measurement: Issues and Practice, 34(2), 8-14. doi: 10.1111/emip.12074 [DOI] [Google Scholar]
  3. Engelhard G. (1997). Constructing rater and task banks for performance assessments. Journal of Outcome Measurement, 1, 19-33. [PubMed] [Google Scholar]
  4. Engelhard G., Jr. (1994). Examining rater errors in the assessment of written composition with a many-faceted Rasch model. Journal of Educational Measurement, 31, 93-112. doi: 10.2307/1435170 [DOI] [Google Scholar]
  5. Gyagenda I. S., Engelhard G. (2009). Using classical and modern measurement theories to explore rater, domain, and gender influences on student writing ability. Journal of Applied Measurement, 10, 225-246. [PubMed] [Google Scholar]
  6. Hemker B. T., Sijtsma K., Molenaar I. W., Junker B. W. (1997). Stochastic ordering using the latent trait and the sum score in polytomous IRT models. Psychometrika, 62, 331-347. doi: 10.1007/BF02294555 [DOI] [Google Scholar]
  7. Johnson R. L., Penny J. A., Gordon B. (2009). Assessing performance: Designing, scoring, and validating performance tasks. New York, NY: Guilford Press. [Google Scholar]
  8. Junker B. W. (1993). Conditional association, essential independence and monotone unidimensional item response models. Annals of Statistics, 21, 1359-1378. [Google Scholar]
  9. Junker B. W., Sijtsma K. (2000). Latent and manifest monotonicity in item response models. Applied Psychological Measurement, 24, 65-81. doi: 10.1177/01466216000241004 [DOI] [Google Scholar]
  10. Kuijpers R. E., van der Ark L. A., Croon M. A. (2013). Standard errors and confidence intervals for scalability coefficients in Mokken scale analysis using marginal models. Sociological Methodology, 43, 42-69. doi: 10.1177/0081175013481958 [DOI] [Google Scholar]
  11. Ligtvoet R. (2010). Essays on invariant item ordering (Doctoral dissertation, Tilburg University: ) Retrieved from https://pure.uvt.nl/portal/files/1200331/thesisLigtvoet_final_.pdf [Google Scholar]
  12. Ligtvoet R., van der Ark L. A., Bergsma W. P., Sijtsma K. (2011). Polytomous latent scales for the investigation of the ordering of items. Psychometrika, 76, 200-216. [Google Scholar]
  13. Ligtvoet R., van der Ark L. A., te Marvelde J. M., Sijtsma K. (2010). Investigating an invariant item ordering for polytomously scored items. Educational and Psychological Measurement, 70, 578-595. doi: 10.1177/0013164409355697 [DOI] [Google Scholar]
  14. Meijer R. R., Tendeiro J. N., Wanders R. B. K. (2015). The use of nonparametric item response theory to explore data quality. In Reise S. P., Revicki D. A. (Eds.), Handbook of item response theory modeling: Applications to typical performance assessment (pp. 85-110). New York, NY: Routledge. [Google Scholar]
  15. Mokken R. J. (1971). A theory and procedure of scale analysis. Berlin, Germany: De Gruyter. [Google Scholar]
  16. Molenaar I. W. (1982). Mokken scaling revisited. Kwantitative Methoden, 3(8), 145-164. [Google Scholar]
  17. Molenaar I. W. (1997). Nonparametric models for polytomous responses. In van der Linden W. J., Hambleton R. K. (Eds.), Handbook of modern item response theory (pp. 369-380). New York, NY: Springer. [Google Scholar]
  18. Molenaar I. W., Sijtsma K. (2000). MPS5 for Windows: A program for Mokken scale analysis for polytomous items. Version 5.0 [Computer software]. Groningen, The Netherlands: ProGAMMA. [Google Scholar]
  19. Penfield R. D. (2014). An NCME instructional module on polytomous item response theory models. Educational Measurement: Issues and Practice, 33(1), 36-48. doi: 10.1111/emip.12023 [DOI] [Google Scholar]
  20. R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; Retrieved from https://www.R-project.org/ [Google Scholar]
  21. Sijtsma K., Meijer R. R., van der Ark L. A. (2011). Mokken scale analysis as time goes by: An update for scaling practitioners. Personality and Individual Differences, 50, 31-37. doi: 10.1016/j.paid.2010.08.016 [DOI] [Google Scholar]
  22. Sijtsma K., Molenaar I. W. (2002). Introduction to nonparametric item response theory (Vol. 5). Thousand Oaks, CA: Sage. [Google Scholar]
  23. van der Ark L. A. (2007). Mokken scale analysis in R. Journal of Statistical Software, 20(11), 1-19. [Google Scholar]
  24. van der Ark L. A. (2012). New developments in Mokken scale analysis in R. Journal of Statistical Software, 48(5), 1-27. [Google Scholar]
  25. Wind S. A. (2014). Examining rating scales using Rasch and Mokken models for rater-mediated assessments. Journal of Applied Measurement, 15, 100-132. [PubMed] [Google Scholar]
  26. Wind S. A. (2015). Evaluating the quality of analytic ratings with Mokken scaling. Psychological Test and Assessment Modeling, 3, 423-444. [Google Scholar]
  27. Wind S. A., Engelhard G. (2012). Examining rating quality in writing assessment: rater agreement, error, and accuracy. Journal of Applied Measurement, 13, 321-335. [PubMed] [Google Scholar]
  28. Wind S. A., Engelhard G. (2016). Exploring rating quality in rater-mediated assessments using Mokken scale analysis. Educational and Psychological Measurement, 76, 685-706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Wolfe E. W., McVay A. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31-37. doi: 10.1111/j.1745-3992.2012.00241.x [DOI] [Google Scholar]
  30. Wright B. D., Stone M. H. (1979). Best test design. Chicago, IL: MESA Press. [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

RESOURCES