Abstract
Two-level Mokken scale analysis is a generalization of Mokken scale analysis for multi-rater data. The bias of estimated scalability coefficients for two-level Mokken scale analysis, the bias of their estimated standard errors, and the coverage of the confidence intervals has been investigated, under various testing conditions. It was found that the estimated scalability coefficients were unbiased in all tested conditions. For estimating standard errors, the delta method and the cluster bootstrap were compared. The cluster bootstrap structurally underestimated the standard errors of the scalability coefficients, with low coverage values. Except for unequal numbers of raters across subjects and small sets of items, the delta method standard error estimates had negligible bias and good coverage. Post hoc simulations showed that the cluster bootstrap does not correctly reproduce the sampling distribution of the scalability coefficients, and an adapted procedure was suggested. In addition, the delta method standard errors can be slightly improved if the harmonic mean is used for unequal numbers of raters per subject rather than the arithmetic mean.
Keywords: cluster bootstrap, delta method, Mokken scale analysis, rater effects, standard errors, two-level scalability coefficients
In multi-rater assessments, multiple raters evaluate or score the attribute of subjects on a standardized questionnaire. For example, several assessors may assess teachers’ teaching skills using a set of rubrics (e.g., Maulana, Helms-Lorenz, & Van de Grift, 2015; Van der Grift, 2007), both parents may rate their child’s behavior using a health-related quality of life questionnaire (e.g., Ravens-Sieberer et al., 2014), and policy holders may evaluate the quality of health-care plans using several survey items (e.g., Reise, Meijer, Ainsworth, Morales, & Hays, 2006). In multi-rater assessments, raters (assessors, parents, policy holders) are nested within subjects (teachers, children, health-care plans). From this two-level data, measuring the attribute (teaching skills, behavior, quality) of the subjects at Level 2 is of most interest. Because raters are the respondents, they may have a large effect on the responses to the items, which can interfere with measuring the subjects’ attribute.
For dichotomous items, Snijders (2001) proposed two-level scalability coefficients to investigate the scalability of the items used in multi-rater assessments. These coefficients are generalizations of Mokken’s (1971) single-level scalability coefficients (or coefficients), which are useful as measures to assess whether “the items have enough in common for the data to be explained by one underlying latent trait . . . in such a way that ordering the subject by the total score is meaningful” (Sijtsma & Molenaar, 2002, p. 60). Mokken introduced scalability coefficients for each item-pair (), each item (), and the total set of items (). For multi-rater data, Snijders proposed extending the , , and coefficients to within-rater scalability coefficients (denoted by the superscript ), between-rater scalability coefficients (denoted by the superscript ), and the ratio of the between to within coefficients (denoted by the superscript ).
The scalability coefficients are related to measurement models, in which subject and rater effects are jointly modeled (Snijders, 2001). A more detailed description of the measurement models and the two-level coefficients is provided below. Crisan, Van de Pol, and Van der Ark (2016) generalized the two-level scalability coefficients for dichotomous items to polytomous items, and Koopman, Zijlstra, and Van der Ark (in press) derived standard errors for the estimated two-level scalability coefficients using the delta method (e.g., Agresti, 2012, pp. 577-581; Sen & Singer, 1993, pp. 131-152). Alternatively, a cluster bootstrap may be used to estimate standard errors. The cluster bootstrap (Sherman & Le Cessie, 1997; see also Cheng, Yu, & Huang, 2013; Deen & De Rooij, in press; Field & Welsh, 2007; Harden, 2011) has not been applied to two-level scalability coefficients, but it has been applied in similar data structures—for example, children within county (Sherman & Le Cessie, 1997), siblings or genetic profiles within families (Bull, Darlington, Greenwood, & Shin, 2001; Watt, McConnachie, Upton, Emslie, & Hunt, 2000), repeated measurements of homeless people their housing status (De Rooij & Worku, 2012), or of children’s microbial carriage (Lewnard et al., 2015).
For the two-level scalability coefficients, the problem at hand is that neither the bias of the point estimates nor the bias and accuracy of the standard errors have been thoroughly investigated. For the single-level scalability coefficients, the point estimates were mostly unbiased (Kuijpers, Van der Ark, Croon, & Sijtsma, 2016) and for both the analytically derived standard errors using the delta method (Kuijpers et al., 2016) and the bootstrap standard errors (Van Onna, 2004), the levels of bias and accuracy were satisfactory. However, these results cannot be generalized to two-level scalability coefficients because single-level coefficients do not take into account between-rater scalability, nor the dependency in the data due to the nesting of raters within subjects. The goal of this article is to investigate the bias of the point estimates and the standard errors of the two-level scalability coefficients. The remainder of this article first discusses two-level nonparametric item response theory (IRT) models, two-level scalability coefficients, and the two standard error estimation methods. Then, the article discusses the simulation study to investigate bias and coverage, and its results.
Nonparametric IRT Models for Two-Level Data
In multi-rater data, an attribute of subject is scored by raters using items. Raters are indexed by or , and items are indexed by or . Each item has ordered response categories, indexed by or . Let denote the score of subject by rater on item . Typically, the mean item score across raters, , is used as a measurement for the attribute of subject .
In 2001, Snijders proposed a two-level nonparametric IRT model for two-level data, based on the monotone homogeneity model (Mokken, 1971; Sijtsma & Molenaar, 2002). Let be the value of subject on a unidimensional latent trait that represents the attribute being measured, and a deviation that consists of the effect of rater and the interaction effect of rater and subject . Hence, is the value of subject on the latent trait according to rater . It is assumed that, on average, the rater deviation for subject equals zero (). In Snijders’s model, the responses to the different items and subjects are assumed stochastically independent given the latent values and . The probability that subject obtains at least score on item when assessed by rater , , is monotone nondecreasing in . Because , the monotonicity assumption implies a nondecreasing item-step response function , which is the expectation of with respect to the distribution of .
An alternative generalization of the monotone homogeneity model for two-level data is the nonparametric hierarchical rater model. The hierarchical rater model (DeCarlo, Kim, & Johnson, 2011; Mariano & Junker, 2007; Patz, Junker, Johnson, & Mariano, 2002) is a two-stage model for multi-rater assessments in which a single performance is rated. Similar to Snijders’s model, latent values and are the subject’s latent trait level and the rater’s deviation, respectively. The hierarchical rater model assumes an unobserved ideal rating of the performance of subject on each item , denoted by . The ideal ratings may vary across performances and are solely based on the subject’s latent trait value. The ideal ratings to the different items are assumed stochastically independent given , and the item-step response function is nondecreasing in . The observed item score is the rater’s evaluation of ideal rating (i.e., of the performance). For raters with negative , the probability increases that is smaller than , and for raters with positive , the probability increases that is larger than . Observed ratings are stochastically independent given and and the item-step response function is nondecreasing in .
Scalability Coefficients for Two-Level Data
Scalability coefficients evaluate the ordering of observed item responses. They are a function of the weighted item probabilities. These weights are explained briefly here (for more details, see Koopman, Zijlstra, & Van der Ark, 2017; Kuijpers, Van der Ark, & Croon, 2013), and illustrated in the appendix using a small data example. Let denote the bivariate probability that rater of subject scores on item and on item . Let denote the bivariate probability that rater of subject scores on item and another rater () of the same subject scores on item . Let be the probability that a certain rater scores on item for a certain subject.
Let denote an indicator function, which takes value 1 if its argument is true and value 0 otherwise. Each item-score has item steps . An item step is passed if , and an item step is failed if . is the popularity of item step . Item steps of each item-pair are sorted in descending order of popularity. A Guttman error is defined as passing a less popular item step after a more popular item step has been failed. For instance, if for item-pair the order of item steps is (i.e., ), then item-score pattern is a Guttman error, because this item-score pattern requires that the second ordered item step must be passed, whereas the first, easier step , is failed. Patterns that are not a Guttman error are referred to as consistent patterns. If a Guttman error is observed within the same rater (i.e., ), this is referred to as a within-rater error. If a Guttman error is observed across two different raters of the same subject (i.e., ), this is referred to as a between-rater error. A Guttman error is considered more severe if more ordered steps have been failed before a less popular item step has been passed (e.g., is worse than ). The severity of the Guttman error for item-score pattern is indicated by weight , which denotes the number of failed item steps preceding passed item steps (Molenaar, 1991). Let denote the evaluation of the -th ordered item step with respect to item-score pattern , then weight is computed as
| (1) |
For consistent item-score patterns value equals zero.
Let be the sum of all weighted within-rater Guttman errors in item pair and let be the sum of all expected weighted Guttman errors in item pair under marginal independence. The within-rater scalability coefficient for item-pair is then defined as
| (2) |
Let be the sum of all weighted between-rater Guttman errors in item pair . Replacing with in Equation 2 results in the between-rater scalability coefficient
| (3) |
Dividing the two coefficients results in ratio coefficient . Note that if , then and . As for single-level scalability coefficients, the two-level scalability coefficients for items are defined as and the two-level scalability coefficients for the total scale are defined as (e.g., Crisan et al., 2016; Snijders, 2001). In samples, the scalability coefficients are estimated by using the sample proportions; for computational details, see Snijders (2001; also see Crisan et al., 2016; Koopman et al., 2017).
Within-rater coefficient reflects the consistency of item-score patterns within raters, and its interpretation is similar to the single-level scalability coefficients of Mokken (1971). Between-rater coefficient reflects the consistency of item-score patterns between raters of the same subject. The maximum value of within- and between-rater scalability coefficients equals 1, reflecting a perfect relation between the items, within and between raters of the same subject. Under the discussed IRT models, if the distribution of is equally or more dispersed than the distribution of , (Snijders, 2001). As the population of subject-rater combinations becomes more homogeneous (i.e., the variance of becomes smaller), coefficient decreases. Likewise, as the population of subjects becomes more homogeneous (i.e., the variance of becomes smaller), coefficient decreases. Ratio coefficient provides useful information on the between- to within-rater variability: the larger the variance of (i.e., the rater effect) is compared to the variance of (i.e., the subject effect), the smaller the consistency of item-score patterns between raters of the same subject is relative to the consistency of item-score patterns within raters, and the smaller is compared to . As a result, decreases as the rater effect increases. For example, if is close to 1, the test score is hardly affected by the individual raters and only few raters per subject are necessary to scale the subjects, whereas if is close to 0, the raters almost entirely determine the item responses and scaling subjects is not sensible.
For a satisfactory scale, Snijders (2001) suggested heuristic criteria , and , , and and . In addition, he proposed that ratio value is reasonable and is excellent, with similar interpretations for and . In single-level data, an often-used lower bound is .3 (Mokken, 1971, p. 185). Due to the availability of multiple parallel measurements per subject (i.e., multiple raters), the heuristics for two-level scalability coefficients are lower. The value of total-scale coefficients can be increased by removing items with low item scalability from the item set. In Mokken scale analysis for single-level data, there exists an item selection procedure based on single-level scalability coefficients, but this is not yet available for multi-rater data. In addition to Snijders’s criteria, the authors suggest that the confidence intervals (CIs) of the coefficients should be used in evaluating the quality of a scale. Kuijpers et al. (2013) advised comparing the CI with the heuristic criteria: For example, a scale can only be accepted as strong when the lower bound of the 95% CI is at least .5. A less conservative approach is to require the lower bound for all coefficients to exceed zero. Items that fail to meet these criteria may be adjusted or removed from the item set.
Standard Error of Two-Level Scalability Coefficients
Analytical Standard Errors
The delta method approximates the variance of the transformation of a variable by using a first-order Taylor approximation (e.g., Agresti, 2012, pp. 577-581; Sen & Singer, 1993, pp. 131-152). Recently, Koopman et al. (in press) applied the delta method to derive standard errors for two-level scalability coefficients. Let n be a vector of order containing the frequencies of all possible item-score patterns, each pattern taking the form . The patterns are ordered lexicographically with the last digit changing fastest, such that . Vector n is assumed to be sampled from a multinomial distribution with varying multinomial parameters per subject (Vágó, Kemény, & Láng, 2011). Vector contains the probabilities of obtaining the item-score patterns in vector n for subject , with expectation for a randomly selected subject. Suppose that for each subject . In addition, let denote the expectation of vector x, and Diag(x) a diagonal matrix with x on the diagonal. Then the variance-covariance matrix of n equals
| (4) |
(Koopman et al., in press; Vágó et al., 2011).
Let g(n) be the transformation of vector n to a vector containing the scalability coefficients . Let G≡G(n) be the matrix of first partial derivatives of g(n). According to the delta method, the variance of g(n), , is approximated by
| (5) |
The covariance matrix of the scalability coefficients can be estimated as by using the sample estimates for and . For two-level scalability coefficients, Koopman et al. in press derived matrix G in Equation 5. Because the derivations are rather cumbersome and lengthy, they are omitted here. The interested reader is referred to Koopman et al. in press. The estimated delta-method standard errors are obtained by taking the diagonal of .
Bootstrap Standard Errors
The nonparametric bootstrap is a commonly used and easy to implement method to estimate standard errors (see, for example, Efron & Tibshirani, 1993; Van Onna, 2004). This method resamples the observed data with replacement to gain insight in the variability of the estimated coefficient. The bootstrap requires that all resampled observations are independent and identically distributed. Because in the two-level data structure the observations within subjects are expected to correlate, a standard bootstrap will not work. The cluster bootstrap accommodates for this dependency by resampling the subjects, thereby retaining all raters of that subject (see, for example, Deen & De Rooij, in press; Field & Welsh, 2007; Harden, 2011; Ng, Grieve, & Carpenter, 2013; Sherman & Le Cessie, 1997).
A bootstrap procedure is balanced if each observation occurs an equal number of times across the bootstrap samples. Balancing the bootstrap can reduce the variance of the estimation, resulting in a more efficient estimator (Chernick, 2008, p. 131; Efron & Tibshirani, 1993, pp. 348-349). The following algorithm is used to estimate a standard error with a balanced cluster bootstrap.
For a bootstrap of size , replicate the subjects from data times and randomly distribute these replications in a matrix S.
Create cluster-bootstrap data sets . To obtain , take the th row of the S matrix; consists of the observed ratings of all raters from the bootstrap subjects.
Compute the scalability coefficients for each bootstrap data set .
Estimate the bootstrap standard errors by computing the standard deviation of the coefficient across the bootstrap samples.
Resampling at subject-level ensures that the bootstrap samples reflect a similar data structure as the original data set. The cluster bootstrap allows observations within subjects to correlate, but observations between subjects should be independent. The correlation structure may differ per subject, and need not be known.
Method
Simulated data were used to investigate the bias of the two-level scalability coefficient estimates, bias of the standard error estimates, and coverage of the Wald-based CIs. To keep the simulation study manageable (and readable), a completely crossed design was avoided. Instead, bias and coverage were first investigated in a small study that included the most important independent variable, the rater effect , and the two standard error estimation methods (the main design). Because the rater effect determines the scalability of subjects for a given test, it is considered the most important independent variable. Second, in a series of small studies with specialized designs the effects of other independent variables were investigated using the most promising standard error estimation method. Finally, remarkable results were further investigated in post hoc simulations.
Data Simulation Strategy
Computation of the scalability coefficients and their standard errors by means of the delta method only assumes that the item scores follow a multinomial distribution with varying multinomial parameters across subjects (Koopman et al., in press). The cluster bootstrap assumes that data between subjects are independent. Both assumptions hold under the discussed two-level IRT models, given that each subject has a unique set of raters. The authors used a parametric hierarchical rater model to generate data, parameterized as follows:
| (6) |
Latent trait values were sampled from a normal distribution with mean 0 and variance . Ideal ratings were obtained using a graded response model (Samejima, 1969). This model was used because it is the parametric version of the monotone homogeneity model that underlies Mokken scale analysis (Hemker, Sijtsma, Molenaar, & Junker, 1996). For latent trait value , item discrimination parameter , and item-step location parameter , the probability of ideal rating according to the graded response model is
| (7) |
Note that and by definition. Ideal ratings were sampled from a multinomial distribution using the probabilities for each subject and item .
Rater deviations were sampled from a normal distribution with mean 0 and variance . For deviation and ideal rating , the probability of observed score , , was obtained from a discrete signal detection model. In this model, the probabilities are proportional to a normal distribution in with mean and rating variance ; that is,
| (8) |
(also, see Patz et al., 2002). The computed probabilities for the answer categories were normalized to sum to 1. Finally, observations were sampled from a multinomial distribution with parameter .
Main Design
Independent variables
Rater effect had four levels, each reflecting a different degree of rater effect: (very small), (small), (medium), and (large). Because the rater effect determines the scalability of subjects for a given test, it is considered the most important independent variable. As noted earlier, both the subject effect and the rater effect affect the magnitude of the scalability coefficients. By setting , the magnitude of was similar across the four levels of rater effect, which facilitated comparison. and decreased as increased.
Standard-error estimation method had two levels: the delta method and the bootstrap method. These methods were applied to each level of rater effect.
Other variables in the main design were fixed: The number of subjects was , and each subject was rated by the independent group of raters of size . The number of items was , and each item had answer categories. Item discrimination was equal for each item at (Equation 7), the item-step location parameter (Equation 7) had equidistant values between values −3 and 3, and rating variance (Equation 8).
Dependent variables
The scalability coefficients and standard errors of the estimates were computed for the three classes of the two-level total-scale scalability coefficients (, , and ). Item-pair and item scalability coefficients were not computed because the total-scale coefficient can be written as a normalized weighted sum of the or coefficients (Mokken, 1971, pp. 150-152). Therefore, it is expected that potential bias of or is reflected in . In the specialized design, the authors investigated conditions with two items; in that case, .
Bias of the estimated H coefficient
Bias reflects the average difference between the sample estimate and population value of . Let be the estimated scalability coefficient of the th replication. The bias was determined across replications as . The population values (Table 1) were determined based on a finite sample of 1,000,000 subjects and five raters per subject. Table 1 shows that and decrease as rater effect increases. As the rater effect in Table 1 increases the difference between and becomes larger. Therefore, the correlation between the sample estimates of and will be larger for small rater effects than for large rater effects. On average, a relative Bias() of 10% reflects a value of 0.044. Therefore, absolute bias values below 0.044 is considered satisfactory.
Table 1.
Population Values of the Two-Level Scalability Coefficients , and and the SD of the Sampling Distribution for the Four Conditions of in the Main Design.
| 0.25 |
0.50 |
0.75 |
1.00 |
|||||
|---|---|---|---|---|---|---|---|---|
| H | SD | H | SD | H | SD | H | SD | |
| H W | .437 | .037 | .418 | .034 | .435 | .029 | .479 | .025 |
| H B | .415 | .038 | .316 | .038 | .214 | .036 | .126 | .032 |
| H BW | .948 | .010 | .756 | .036 | .483 | .057 | .262 | .058 |
Bias of the estimated standard errors
Let be the standard error of the th replication, and the population standard error, then . The population values (Table 1) were determined by the standard deviation of across the replications and is assumed to be representative of the true standard deviation of the sampling distribution of , under the conditions of the main design. On average, a relative Bias() of 10% reflects a value of 0.004. Therefore, absolute bias values below 0.004 is considered satisfactory.
Coverage
Coverage of the 95% CIs was computed as the proportion of times, in replications, the population value H was included in the Wald-based confidence interval . This interval is selected because the distribution of the two-level scalability coefficients is asymptotically normal (Koopman et al., in press). There were replications per condition, and balanced bootstrap samples per replication.
Analyses
The simulation study was programmed in R (R Core Team, 2018). and partly performed on a high performance computing cluster. The scalability coefficients and delta method standard errors were computed using the R-package mokken (Van der Ark, 2007, 2012; also, see Koopman et al., in press). The main design had eight conditions (two standard error estimation methods × four rater effect levels). Summary descriptives were computed and visualized for relevant outcome variables for all scalability coefficients. An Agresti–Coull CI (Agresti & Coull, 1998) was constructed around the estimated coverage using R-package binom (Dorai-Raj, 2014) to test whether it deviated from the desired value .95.
Specialized Designs
Each specialized design varied one of the independent variables that had been fixed in the main design. The levels of rater effect remained unchanged ( = 0.25, 0.50, 0.75, and 1.00), to allow for the detection of potential interaction effects.
Independent variables
The following variables defined the specialized designs:
Number of subjects was , (as in main design) , or .
Number of raters per subject had six conditions. Let denote a discrete uniform distribution with minimum and maximum . In the six conditions were sampled from , (as in main design), , , , and , respectively. Hence, in the first three conditions, each subject had the same number of raters, and in the last three conditions the number of raters differed across subjects.
Rating variance had four conditions. In three conditions, was fixed at , (as in main design), and , respectively. In the fourth condition was sampled for each rater from an exponential distribution with mean .
Number of items was 2, 3, 4, 6, 10 (as in main design), or 20.
Number of answer categories had four levels: 2 (dichotomous items), 3, 5 (as in main design), and 7. The parameters of the signal detection model were adjusted according to the number of answer categories, to ensure that the magnitude of the scalability coefficients remained similar to those in the main design (Table 2).
Item discrimination parameter had four levels. In three conditions was kept constant for each item at 0.5, 1.0 (as in main design), or 1.5. In the last condition, the item discrimination varied across items at equidistant values between 0.5 and 1.5.
Distance between item-step location parameters had four levels. In the first three conditions, value ranged between −4.5 and 4.5, between −3 and 3 (as in main design), or between −1.5 and 1.5. In the last condition, the item-step locations were equal for the same item-steps across items, and ranged between −3 and 3 within items (i.e., for all ).
Table 2.
Rater Effect () and Rating Variance () Values for the Number of Answer Categories () Specialized Design.
| Rater effect |
|||||
|---|---|---|---|---|---|
| m+ 1 | 0.25 | 0.50 | 0.75 | 1.00 | |
| 2 | .3 | 0.18 | 0.27 | 0.35 | 0.45 |
| 3 | .4 | 0.20 | 0.33 | 0.48 | 0.65 |
| 5 | .5 | 0.25 | 0.50 | 0.75 | 1.00 |
| 6 | .5 | 0.30 | 0.70 | 1.00 | 1.20 |
Note. is the level from the main design.
Dependent variables and analyses
The dependent variables and statistical analyses were the same for the specialized designs and the main design. The specialized designs item discrimination, item-step location, and rating variance had an effect on the magnitude of (some of) the population values, see Table 3. Population SDs were similar to those in the main design, but increased for fewer items and smaller sets of subjects or raters.
Table 3.
Population Values for , and for the Specialized Designs Item Discrimination , Item-Step Location , and Rating Variance , for Rater Effect .
|
|
|
|
||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.5 | 1 | 1.5 | Varied | 1.5 | 3 | 4.5 | Equal | 0.25 | 0.50 | 0.75 | Varied | |
| H W | .185 | .418 | .569 | .381 | .377 | .418 | .439 | .400 | .464 | .418 | .357 | .384 |
| H B | .125 | .316 | .439 | .284 | .327 | .316 | .270 | .252 | .343 | .316 | .269 | .270 |
| H BW | .675 | .756 | .772 | .747 | .866 | .756 | .616 | .630 | .738 | .756 | .752 | .704 |
Post Hoc Simulations
Some exploratory simulations were performed to investigate aberrant results from the main and specialized designs.
Results
Main Design
Bias of all two-level scalability coefficients was close to zero across the different levels of rater effect (Table 4, left panel).
Table 4.
Bias of Estimated Coefficients (H) and of the Estimated Standard Errors ().
| Bias() |
Bias() delta |
Bias() bootstrap |
|||||||
|---|---|---|---|---|---|---|---|---|---|
| 0.25 | −.000 | −.001 | −.002 | .002 | .002 | .006 | –.007 | –.007 | −.002 |
| 0.50 | −.001 | −.002 | −.007 | .002 | .001 | .004 | –.008 | –.009 | –.010 |
| 0.75 | .001 | −.002 | −.009 | .003 | .002 | .004 | –.007 | –.009 | –.016 |
| 1.00 | .001 | −.003 | −.008 | .003 | .003 | .006 | –.007 | –.009 | –.016 |
Note. Bias that exceeds the boundary of .044 and .004 for and , respectively, is printed in boldface.
Bias of the delta method standard error estimates was generally close to zero, but the bootstrap standard error estimates were negatively biased (Table 4, last two panels). As a result, coverage of the 95% CIs was too low for the cluster bootstrap, with values ranging between .82 and .88 across the different conditions and coefficients (Figure 1). The delta method coverage is excellent for the between-rater coefficient, but is conservative for the within-rater coefficient if rater effect is large (Figure 1). In addition, coverage of the ratio coefficient tends to be too high, especially if the rater effect is nearly absent. The high coverage may be explained by the small value. For , hence there is hardly any variation of across different samples, indicated by a true standard error of .01 (Table 1). The bias of the estimated standard error was .006 (Table 4, first row, sixth column), which is identical to the bias in the condition (Table 4, last row, sixth column), for which the true standard error is .058 (Table 1). Relative to their true standard error, the bias of .006 was 60% for , and only 10% for . Therefore, coverage was much larger in the condition compared with the condition, even though the bias was equal.
Figure 1.

Plot of the coverage of the 95% confidence interval of the two-level scalability coefficients, for different levels of rater effect and the two standard error estimation methods.
Note. Error bars represent the 95% Agresti–Coull confidence interval.
Specialized Designs
For all conditions in the specialized designs, the bias of the point estimates of the two-level scalability coefficients was satisfactory with values between –.004 and .014. Because of the poor performance in the main design, the bias and coverage of the cluster-bootstrap standard errors were not computed in the specialized designs, so all results for the standard errors pertain to the delta method. Number of subjects, , number of answer categories, , item discrimination, , item-step location, , and rating variance, , had little or no effect on the bias of the estimated standard errors and the coverage of the Wald-based CI. As in the main design, for and , bias was satisfactory and coverages were accurate; whereas for , the bias was occasionally unsatisfactory–bias() –and coverages conservative. Number of raters, , and number of items, had an effect (Table 5). No interaction effect was found between rater effect () and the specialized design variables. Therefore, results are discussed only for .
Table 5.
Bias of the Delta Method Standard Errors () for the Two-Level Scalability Coefficients , , and for Specialized Designs of Number of Raters () and Number of Items ().
| 2 | .002 | .002 | .009 | 2 | .002 | –.009 | −.003 |
| 5 | .002 | .001 | .004 | 3 | .001 | −.004 | .000 |
| 30 | .000 | .000 | .001 | 4 | .002 | −.001 | .003 |
| 4-6 | .004 | .005 | .008 | 6 | .001 | .001 | .006 |
| 3-7 | .013 | .015 | .017 | 10 | .002 | .001 | .004 |
| 5-30 | .032 | .037 | .035 | 20 | .002 | .002 | .003 |
Note. Bias that exceeds the boundary of .004 is printed in boldface.
For unequal numbers of raters, the standard errors of the two-level scalability coefficients were too conservative (Table 5, left panel) and the coverage of the CIs too high (Figure 2, left plot, right-hand side of the plot). The overestimation was stronger if the variation of was larger. As in the main design with five raters, the standard errors were also too conservative for in the condition with two raters (Figure 2, left plot).
Figure 2.
Coverage plots for the two-level scalability for different number of raters and items, respectively.
Note. Error bars represent the 95% Agresti–Coull confidence interval.
For two and three items, the standard errors were underestimated for the between-rater coefficient (Table 5, right panel). As a result, coverage was too low (Figure 2, right plot).
Post Hoc Simulations
It was unexpected that the cluster bootstrap in the main design performed poorly in estimating the standard errors of the two-level scalability coefficients, resulting in poor coverage values. Apparently, the cluster bootstrap does not correctly approximate the sampling distribution of in the population. An explanation may be that the cluster bootstrap ignores the assumption that the raters should be a random sample of the population of raters. Therefore, an alternative, two-stage bootstrap is proposed (for a similar bootstrap procedure, see Ng et al., 2013). At Stage 1, the clusters are resampled as in the cluster bootstrap and at Stage 2, the raters of the selected subjects are resampled. Compared with the cluster bootstrap, the two-stage bootstrap resulted in substantial improvements in the standard error estimates and the coverages (Table 6, rows 1 and 2). In an effort to further improve the coverage rates of the two-stage bootstrap, the percentile and bias-corrected accelerated interval were also computed (see, for example, Efron & Tibshirani, 1993, pp. 170-187, for a detailed description). These two methods use the empirical distribution of to construct an interval, rather than assuming a normal distribution. The coverages of the percentile and bias-corrected accelerated intervals were equal to or lower than the coverages of the Wald-based intervals. Because the bias and coverages of the two-stage bootstrap are still inferior to those of the delta method (Table 4, third row), the delta method remains the preferred method.
Table 6.
Post Hoc Results of the Bias() and Coverage for the Two-Stage and Cluster Bootstrap and the Delta Method, the Arithmetic and Harmonic Mean of , and item-pairs with Two, Four, and 10 Items, for , , and , and Main Design Condition With .
| Bias () |
Coverage |
||||||
|---|---|---|---|---|---|---|---|
| Method | |||||||
| Two-stage bootstrap | −.003 | −.004 | –.007 | .930 | .930 | .880 | |
| Cluster bootstrap | –.008 | –.009 | –.010 | .865 | .861 | .853 | |
| Delta method | .002 | .001 | .004 | .955 | .950 | .970 | |
| Mean | |||||||
| 4-6 | A | .004 | .005 | .008 | .970 | .972 | .983 |
| H | .003 | .003 | .007 | .965 | .965 | .979 | |
| 3-7 | A | .013 | .015 | .017 | .991 | .993 | .990 |
| H | .009 | .011 | .013 | .984 | .984 | .989 | |
| 5-30 | A | .032 | .037 | .021 | .999 | .999 | 1.00 |
| H | .018 | .021 | .021 | .992 | .994 | .999 | |
| Number of Items | |||||||
| 2 | .002 | –.009 | −.003 | .944 | .910 | .941 | |
| 4 | .002 | −.001 | .011 | .945 | .938 | .983 | |
| 10 | .002 | .003 | .019 | .950 | .953 | .989 | |
Note. Bias that exceeds the boundary of .004 and coverages where .95 is outside the Agresti–Coull interval are printed in boldface. The two-stage bootstrap results are based on 100 replications. The results are averaged across all item-pairs. A = arithmetic mean and H = harmonic mean of .
There were two odd results in the specialized designs: the relatively poor results of the standard error estimates for unequal group sizes and for a set of two items. The standard error estimates of the two-level scalability coefficients rapidly increased if the variation in number of raters across subject became larger. For unequal number of raters across subjects, in Equation 4 was estimated by the (arithmetic) sample mean . As a solution, the authors estimated by the harmonic mean, which is lower than the arithmetic mean if group sizes differ, and is computed as . Using the harmonic mean improved the bias of the standard error and the coverage compared to the use of the arithmetic mean (Table 6, rows 4-9). However, the estimates were still too conservative, and equal group sizes are preferred.
The standard error of between-rater coefficient was underestimated for sets of two items. Although, in general, testing with a small set of items is discouraged (see, for example, Emons, Sijtsma, & Meijer, 2007), this condition was of interest because for only two items, the total-scale coefficient is equal to item-pair coefficient . To investigate whether bias in the standard error of item-pair coefficient persisted for larger sets of items, the coefficients and their standard errors were computed in a new condition with four items and in the main design with 10 items (both for ). As is shown in Table 6, bottom three rows, bias of standard errors vanished as the number of items increased. However, Table 6 also shows that the standard error estimates estimates and coverages of item-pair ratio coefficient were increasingly conservative, more than the total-scale coefficient .
Discussion
Point estimates of the two-level scalability coefficients were unbiased in all conditions, with bias values approximately zero. Standard errors were mostly unbiased if the delta method was used but not for the traditional cluster bootstrap. A two-stage cluster bootstrap was proposed that partially mitigated the bias, yet the delta method remains the preferred method.
The delta method resulted in unbiased standard error estimates for both the within- and between-rater scalability coefficients and , respectively. For large rater effects, the coverage of the within-rater coefficient was slightly conservative. However, if the rater effect is large, standard errors are of less interest, because the test will be determined of poor quality based on the (unbiased) coefficients alone. Standard error estimates and coverages for ratio coefficient were conservative, especially if was close to its upper bound 1. In this latter situation, standard errors are also of less interest, because if the coefficient estimate is so high, so is its interval estimate.
For all coefficients, the delta method overestimated the standard error if the number of raters was unequal across subjects, especially if the variation was larger. Post hoc simulations showed some improvements if the harmonic mean of the group size was used rather than the arithmetic mean, but equal group sizes are recommended. In addition, for small sets of items the standard errors between-rater coefficient were too liberal. Post hoc simulations showed that the standard errors of the total scale and the item-pair between-rater coefficients are unbiased, provided that a scale consists of at least four items.
The results of this study demonstrate that, in general, the estimated scalability coefficients and delta method standard errors are accurate and can therefore be confidently used in practice. When the scalability of a multi-rater test is deemed satisfactory, a related (but different) topic concerns the reliability. For a given test, Snijders (2001) presented coefficient alpha to determine how many raters are necessary for reliable scaling of the subjects. Note that the magnitude of the scalability coefficients is not affected by the number of raters. Alternatively, generalizability theory provides a more extensive selection of methods to investigate reliability (generalizability) of multi-rater tests (see, for example, Shavelson & Webb, 1991).
The application of two-level scalability coefficients and their standard errors is not limited to multi-rater data. They may also be applied in research with multiple (random) circumstances or time points in which the same questionnaire is completed. Also, the items may be replaced by a fixed set of situations in which a particular skill is scored using a single item. The standard errors examined in this article are also useful for single-level Mokken scale analysis for data from clustered samples (e.g., children nested in classes) because the single-level standard error will typically underestimate the true standard error (see, for example, Koopman et al., in press). Future research may focus on how the point and interval estimates can be useful to select a subset of items from a larger set of items.
Acknowledgments
The authors thank SURFsara (www.surfsara.nl) for the support in using the Lisa Compute Cluster to conduct our Monte Carlo simulations.
Appendix
Illustrative Example
Table A1 shows two small constructed data examples, each with two subjects and five raters per subject on two three-category items. The same item scores are present in both data sets, but Rater 4 of Subject 1 and Rater 5 of Subject 2 are exchanged in the second data set.
Table A1.
Two Small Constructed Multi-Rater Data Examples, One With a Large Rater Effect and One With a Small Rater Effect.
| Data set 1: Large rater effect |
Data set 2: Small rater effect |
|||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| s = 1 | s = 2 | s = 1 | s = 2 | |||||||||||||||||
| r = | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 |
| Item | 2 | 2 | 2 | 1 | 1 | 0 | 0 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 0 | 0 | 1 | 1 | 1 |
| Item | 1 | 2 | 2 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 0 | 1 | 0 | 1 | 0 |
| H W = .762 | H B = .167 | H BW = .219 | H W = .762 | H B = .702 | H BW = .922 | |||||||||||||||
| 95% CI | [0.343, 1.181] | [−0.231, 0.565] | [−0.288, 0.726] | [0.349, 1.175] | [0.435, 0.970] | [0.441, 1.402] | ||||||||||||||
Note. 95% CI is the 95% Wald-based confidence interval. Both data sets have two subjects (), and each rated by a unique set of five raters () on two three-category items ( and ).
For both data sets in Table A1, the item-step ordering is . Therefore, consistent item-score patterns are , , , , and , whereas patterns , , , and are Guttman errors. Within raters, Guttman error occurs once in each data set (Rater 2 of Subject 2). In the first data set, there are five between-rater Guttman errors (for Subject 2, Rater 1 scored 0 on , whereas Raters 2, 4, and 5 scored 1 on , and Rater 2 scored 0 on , whereas Raters 4 and 5 scored 1 on ), four between-rater Guttman errors , and five between-rater Guttman errors , summing up to 14 between-rater Guttman errors. In the second data set there are only three and two between-rater Guttman errors, summing up to five.
Because there are relatively many between-rater Guttman errors in the first data set, there is little consistency between raters of the same subject and is low compared to , as is reflected in ratio . Although scalability coefficients are above the criteria presented by Snijders (2001), the ratio coefficient is below .3 and the 95% CI of and includes zero. This indicates that the item responses are mainly determined by the raters, and it is doubtful whether it makes sense to scale subjects on using the test score on this set of items. In the second data set there is almost as much consistency between raters as there is within raters, reflected by a ratio coefficient of . All coefficients are above the criteria of Snijders and the CIs exceed zero. This indicates that the item responses are mainly determined by the subject, and subjects can be scaled on using these items.
The data example demonstrates that high values for two-level coefficients do not require perfect agreement among raters of the same subject. For to be high it is of importance that the probability of a between-rater Guttman error pattern is close to the probability of a within-rater Guttman error pattern.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the Netherlands Organization for Scientific Research (NWO; Grant 406.16.554).
ORCID iD: Letty Koopman
https://orcid.org/0000-0003-3832-2542
References
- Agresti A. (2012). Categorical data analysis (3rd ed.). New York, NY: John Wiley. [Google Scholar]
- Agresti A., Coull B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician, 52, 119-126. doi: 10.1080/00031305.1998.10480550 [DOI] [Google Scholar]
- Bull S., Darlington G., Greenwood C., Shin J. (2001). Design considerations for association studies of candidate genes in families. Genetic Epidemiology: The Official Publication of the International Genetic Epidemiology Society, 20, 149-174. doi: [DOI] [PubMed] [Google Scholar]
- Cheng G., Yu Z., Huang J. Z. (2013). The cluster bootstrap consistency in generalized estimating equations. Journal of Multivariate Analysis, 115, 33-47. doi: 10.1016/j.jmva.2012.09.003 [DOI] [Google Scholar]
- Chernick M. R. (2008). Bootstrap methods: A guide for practitioners and researchers (2nd ed.). Newtown, PA: John Wiley. [Google Scholar]
- Crisan D. R., Van de Pol J. E., Van der Ark L. A. (2016). Scalability coefficients for two-level polytomous item scores: An introduction and an application. In Van der Ark L. A., Bolt D. M., Wang W.-C., Douglas J. A., Wiberg M. (Eds.), Quantitative psychology research: The 80th annual meeting of the Psychometric Society, Beijing, 2015 (pp. 139-154). New York, NY: Springer. doi: 10.1007/978-3-319-38759-8_11 [DOI] [Google Scholar]
- DeCarlo L. T., Kim Y., Johnson M. S. (2011). A hierarchical rater model for constructed responses, with a signal detection rater model. Journal of Educational Measurement, 48, 333-356. doi: 10.1111/j.1745-3984.2011.00143.x [DOI] [Google Scholar]
- Deen M., De Rooij M. (in press). ClusterBootstrap: An R package for the analysis of clustered data using generalized linear models with the cluster bootstrap Behavior Research Methods. doi: 10.3758/s13428-019-01252-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Rooij M., Worku H. M. (2012). A warning concerning the estimation of multinomial logistic models with correlated responses in SAS. Computer Methods and Programs in Biomedicine, 107, 341-346. doi: 10.1016/j.cmpb.2012.01.008 [DOI] [PubMed] [Google Scholar]
- Dorai-Raj S. (2014). Binomial confidence intervals for several parameterizations. R-package, version 1.1-1 [computer software]. Retrieved from https://CRAN.R-project.org/package=binom
- Efron B., Tibshirani R. J. (1993). An introduction to the bootstrap (1st ed.). New York, NY: Chapman & Hall. [Google Scholar]
- Emons W. H., Sijtsma K., Meijer R. R. (2007). On the consistency of individual classification using short scales. Psychological Methods, 12, 105-120. doi: 10.1037/1082-989X.12.1.105 [DOI] [PubMed] [Google Scholar]
- Field C. A., Welsh A. H. (2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69, 369-390. doi: 10.1111/j.1467-9868.2007.00593.x [DOI] [Google Scholar]
- Harden J. J. (2011). A bootstrap method for conducting statistical inference with clustered data. State Politics & Policy Quarterly, 11, 223-246. doi: 10.1177/1532440011406233 [DOI] [Google Scholar]
- Hemker B., Sijtsma K., Molenaar I., Junker B. (1996). Polytomous IRT models and monotone likelihood ratio of the total score. Psychometrika, 61, 679-693. doi: 10.1007/BF02294042 [DOI] [Google Scholar]
- Koopman L., Zijlstra B. J. H., Van der Ark L. A. (2017). Weighted Guttman errors: Handling ties and two-level data. In Van Der Ark L. A., Wiberg M., Culpepper S. A., Douglas J. A., Wang W.-C. (Eds.), Quantitative psychology: The 81st annual meeting of the Psychometric Society, Asheville, North Carolina, 2016 (pp. 183-190). New York, NY: Springer. doi: 10.1007/978-3-319-56294-0_17 [DOI] [Google Scholar]
- Koopman L., Zijlstra B. J. H., Van der Ark L. A. (in press). Standard errors of two-level scalability coefficients British Journal of Mathematical and Statistical Psychology. 10.1111/bmsp.12174 [DOI] [PubMed]
- Kuijpers R. E., Van der Ark L. A., Croon M. A. (2013). Standard errors and confidence intervals for scalability coefficients in Mokken scale analysis using marginal models. Sociological Methodology, 43, 42-69. doi: 10.1177/0081175013481958 [DOI] [Google Scholar]
- Kuijpers R. E., Van der Ark L. A., Croon M. A., Sijtsma K. (2016). Bias in point estimates and standard errors of Mokken’s scalability coefficients. Applied Psychological Measurement, 40, 331-345. doi: 10.1177/0146621616638500 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewnard J. A., Givon-Lavi N., Huppert A., Pettigrew M. M., Regev-Yochay G., Dagan R., Weinberger D. M. (2015). Epidemiological markers for interactions among streptococcus pneumoniae, haemophilus influenzae, and staphylococcus aureus in upper respiratory tract carriage. The Journal of Infectious Diseases, 213, 1596-1605. doi: 10.1093/infdis/jiv761 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mariano L. T., Junker B. W. (2007). Covariates of the rating process in hierarchical models for multiple ratings of test items. Journal of Educational and Behavioral Statistics, 32, 287-314. doi: 10.3102/1076998606298033 [DOI] [Google Scholar]
- Maulana R., Helms-Lorenz M., Van de Grift W. (2015). Development and evaluation of a questionnaire measuring pre-service teachers’ behaviour: A Rasch modelling approach. School Effectiveness and School Improvement, 26, 169-194. doi: 10.1080/09243453.2014.939198 [DOI] [Google Scholar]
- Mokken R. J. (1971). A theory and procedure of scale analysis. The Hague, The Netherlands: Mouton. [Google Scholar]
- Molenaar I. W. (1991). A weighted Loevinger H-coefficient extending Mokken scaling to multicategory items. Kwantitatieve Methoden, 12(37), 97-117. [Google Scholar]
- Ng S.-W., Grieve R., Carpenter J. R. (2013). Two-stage nonparametric bootstrap sampling with shrinkage correction for clustered data. The Stata Journal, 13, 141-164. Retrieved from https://www.stata-journal.com/sjpdf.html?articlenum=st0288 [Google Scholar]
- Patz R. J., Junker B. W., Johnson M. S., Mariano L. T. (2002). The hierarchical rater model for rated test items and its application to large-scale educational assessment data. Journal of Educational and Behavioral Statistics, 27, 341-384. doi: 10.3102/10769986027004341 [DOI] [Google Scholar]
- R Core Team (2018). R: A language and environment for statistical computing [computer software]. Vienna, Austria: R Foundation for Statistical Computing; Retrieved from https://www.R-project.org/) [Google Scholar]
- Ravens-Sieberer U., Herdman M., Devine J., Otto C., Bullinger M., Rose M., Klasen F. (2014). The European KIDSCREEN approach to measure quality of life and well-being in children: Development, current application, and future advances. Quality of Life Research, 23, 791-803. doi: 10.1007/s11136-013-0428-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reise S. P., Meijer R. R., Ainsworth A. T., Morales L. S., Hays R. D. (2006). Application of group-level item response models in the evaluation of consumer reports about health plan quality. Multivariate Behavioral Research, 41, 85-102. doi: 10.1207/s15327906mbr41016 [DOI] [PubMed] [Google Scholar]
- Samejima F. (1969). Estimation of latent ability using a response pattern of graded scores [Psychometrika Monograph Supplement No. 17]. Richmond, VA: Psychometric Society. [Google Scholar]
- Sen P. K., Singer J. M. (1993). Large sample methods in statistics: An introduction with applications. London, England: Chapman & Hall. [Google Scholar]
- Shavelson R. J., Webb N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: SAGE. [Google Scholar]
- Sherman M., Le Cessie S. (1997). A comparison between bootstrap methods and generalized estimating equations for correlated outcomes in generalized linear models. Communications in Statistics-Simulation and Computation, 26, 901-925. doi: 10.1080/03610919708813417 [DOI] [Google Scholar]
- Sijtsma K., Molenaar I. W. (2002). Introduction to nonparametric item response theory. Thousand Oaks, CA: SAGE. [Google Scholar]
- Snijders T. A. B. (2001). Two-level non-parametric scaling for dichotomous data. In Boomsma A., van Duijn M. A. J., Snijders T. A. B. (Eds.), Essays on item response theory (pp. 319-338). New York, NY: Springer. doi: 10.1007/978-1-4613-0169-1_17 [DOI] [Google Scholar]
- Vágó E., Kemény S., Láng Z. (2011). Overdispersion at the binomial and multinomial distribution. Periodica Polytechnica Chemical Engineering, 55, 17-20. doi: 10.3311/pp.ch.2011-1.03 [DOI] [Google Scholar]
- Van der Ark L. A. (2007). Mokken scale analysis in R. Journal of Statistical Software, 20(11), 1-19. doi: 10.18637/jss.v020.i11 [DOI] [Google Scholar]
- Van der Ark L. A. (2012). New developments in Mokken scale analysis in R. Journal of Statistical Software, 48(5), 1-27. doi: 10.18637/jss.v048.i05 [DOI] [Google Scholar]
- Van der Grift W. (2007). Quality of teaching in four European countries: A review of the literature and application of an assessment instrument. Educational Research, 49, 127-152. doi: 10.1080/00131880701369651 [DOI] [Google Scholar]
- Van Onna M. J. H. (2004). Estimates of the sampling distribution of scalability coefficient H. Applied Psychological Measurement, 28, 427-449. doi: 10.1177/0146621604268735 [DOI] [Google Scholar]
- Watt G., McConnachie A., Upton M., Emslie C., Hunt K. (2000). How accurately do adult sons and daughters report and perceive parental deaths from coronary disease? Journal of Epidemiology & Community Health, 54, 859-863. doi: 10.1136/jech.54.11.859 [DOI] [PMC free article] [PubMed] [Google Scholar]

