Abstract
The degree of inter‐rater agreement is usually assessed through ‐type coefficients and the extent of agreement is then characterized by comparing the value of the adopted coefficient against a benchmark scale. Through two motivating examples, it is displayed the different behavior of some ‐type coefficients due to asymmetric distribution of marginal frequencies over categories. In order to investigate the robustness of four ‐type coefficients for nominal and ordinal classifications and of an inferential benchmarking procedure that, differently from straightforward benchmarking, does not neglect the influence of the experimental conditions, an extensive Monte Carlo simulation study has been conducted. The robustness has been investigated for several scenarios, differing for sample size, rating scale dimension, number of raters, frequency distribution of rater classifications, pattern of agreement across raters. Simulation results reveal an higher paradoxical behavior of Fleiss kappa and Conger kappa with ordinal rather than nominal classifications; the coefficients robustness improves with increasing sample size and number of raters for both nominal and ordinal classifications whereas robustness improves with rating scale dimension only for nominal classifications. By identifying the scenarios (ie, minimum sample size, number of raters, rating scale dimension) with acceptable robustness, this study provides guidelines about the design of robust agreement studies.
Keywords: inter‐rater agreement, ‐type coefficients, Monte Carlo simulation, paradoxical behavior, robustness
1. INTRODUCTION
Agreement studies are of critical importance in medicine, clinical epidemiology, diagnostic imaging as well as in similar contexts of research since they provide information about the repeatability and reproducibility of human measurement systems to which physicians, clinicians, and radiologists can be assimilated when evaluating patients' diseases using dichotomous, nominal, or ordinal rating scales.
The traditional measurement system analysis (MSA) procedures estimate the performance of a measurement system as its ability to provide true (ie, accurate) and consistent (ie, precise) results. 1 Generally speaking, the accuracy is the closeness between repeated measurements and the true value although ISO 5725 2 refers the term accuracy to both systematic bias (ie, trueness) and random error (ie, precision).
Actually, by definition, subjective evaluations lack a reference value for assessing their trueness and thus the classical definition of accuracy cannot be directly operationalized for human measurement system: subjective evaluations can be related only to consistency and assessed as the degree of agreement between repeated evaluations. From a conceptual standpoint, agreement measures the “closeness” between ratings and can be intended as a broader term that contains both accuracy and consistency: if all the ratings can be assumed to come from the same underlying distribution, then agreement is assessing precision around the mean of the ratings. 3
The agreement observed within rater and/or among more independent raters are, respectively, measures of rater repeatability and rater reproducibility; the more raters agree on the evaluations they provide, the more comfortable we can be that they are precise and that their evaluations are reproducible and exchangeable 4 and thus trustworthy.
A number of theoretical and methodological approaches have been proposed over the years in different disciplines for the assessment of rater repeatability and/or reproducibility; these approaches can be grouped in two main families: index‐based approach and model‐based approach. The former quantifies the rater agreement level in a single number and does not provide insight into the structure and nature of agreement differences; 5 , 6 , 7 , 8 the latter overcomes this criticism and models the ratings provided by each rater to each subject focusing on the association structure between repeated evaluations. 9 , 10
Even though the model‐based approach gives more information than the single estimate provided by the index‐based approach, the latter is the easiest to implement and thus the most widely applied, especially by practitioners. This article focuses on index‐based approach relating the precision of categorical subjective evaluations to the concept of agreement.
The easiest way of measuring agreement between ratings is to calculate the overall percentage of agreement; nevertheless, this measure does not take into account the agreement that would be expected by chance alone. 11 A reasonable alternative is to adopt an agreement coefficient belonging to the wide family of the ‐type coefficients, that corrects the probability of observed agreement with the probability of agreement expected by chance, resulting in a relative agreement measure. Specifically, ‐type coefficients compare a real measurement system (ie, rater) against a hypothetical chance measurement system, that is thus used as a reference for correcting the proportion of observed agreement. The chance agreement term estimates the agreement that would be obtained if the subjects had been evaluated completely at random.
The pioneer ‐type coefficients are Scott's 12 and Cohen's kappa 13 proposed in 1955 and 1960 for the simplest case of two raters and then extended, respectively, by Fleiss 14 and Conger 15 to the case of multiple raters. The coefficients differ for how the chance measurement system is conceived of. Specifically, Cohen and Conger coefficients assume that the probabilities of the chance measurement system of classifying an item into each agreement category are equal to the probabilities characterizing the raters; while, according to Scott and Fleiss, they are given by the overall classification probabilities so that no assumption about the equality of marginal frequencies across the replicated evaluations is required.
Despite their popularity, all the above coefficients are known for being strongly dependent on trait prevalence and bias in the subject population which affect the observed marginal frequency distribution of the ratings over classification categories and thus the calculation of the chance agreement term. Specifically, it has been shown that, fixing the observed agreement component, “symmetrical unbalanced marginal frequencies produce lower values of than asymmetrical unbalanced marginal frequencies”; 16 it means that the coefficients are not robust to changes in the frequency distribution of subjects across rating categories and it is unclear what they are truly measuring. 17 , 18 These criticisms, firstly observed by Kraemer in 1979, 19 are widely known as prevalence and bias paradoxes as referred to by Feinstein and Cicchetti. 16 , 20
The debate about the uses and misuses of Scott's 12 and Cohen's kappa 13 has been extensive and persistent in the specialized literature (eg, Brennan and Prediger, 21 Feinstein and Cicchetti, 16 Cicchetti and Feinstein, 20 Byrt et al, 22 Gwet, 18 Warrens, 23 Erdmann et al, 24 just to name a few), especially for the simplest case of two raters and dichotomous (or at least nominal) data.
For example within the purview of contingency tables, Cicchetti and Feinstein 16 , 20 identified the conditions that lead to paradoxical behavior showing via practical examples the dependence of Cohen's kappa on trait prevalence; Guggenmoos‐Holzmann 25 explored the dependence of Cohen's kappa on trait prevalence with respect to seven validity parameters (ie, sensitivity and specificity of each classification procedure, associations between the procedures both in presence and absence of the target trait and trait prevalence) and discussed its interpretation as measure of consistency in extreme populations with maximum true prevalence (ie, trait prevalence = 1) through two examples. The problematic dependency of Cohen's kappa and Scott's on the differences in raters marginal frequencies has been shown also by Gwet 26 who conducted a sensitivity analysis to investigate how the agreement assessed via some ‐type coefficients changes with respect to the variation of trait prevalence and raters classification probabilities. Similarly, Erdamnn et al 24 conducted a simulation study to determine the standard error of some ‐type coefficients for dichotomous tests depending on trait prevalence, specificity and sensitivity with different sample sizes.
For the case of contingency tables, the origin of the paradoxical behavior of Cohen's kappa and Scott's has been explored by Gwet 27 and a formal proof of the paradoxical behavior associated with Cohen's kappa has been provided by Warrens. 23
In order to overcome the criticisms related to ‐type coefficients, researchers have suggested to formulate the agreement expected by chance as uniform across categories (ie, commonly known as uniform kappa although proposed by several authors like Bennett et al, 28 Janson and Vegelius, 29 Brennan and Prediger, 21 and Byrt et al 22 ) or to approximate the propensity of random ratings by the proportion of observed to maximum evaluation variance so as to consistently yield reliable results (ie, agreement coefficient by Gwet 18 ). Another alternative approach for handling the paradoxical behavior in the case of two raters and nominal categories has been suggested by Nelson and Pepe, 30 who presented a graphical method for assessing inter‐rater agreement.
On the contrary, a scarce effort has been devoted to ‐type coefficients for inter‐rater agreement with more than two raters. Gwet 27 introduced the variant of the coefficient for the case of more than two raters and suggested new variance estimators for the multiple‐rater generalized statistics whose validity, demonstrated via a Monte Carlo simulation study, does not depend upon the hypothesis of independence between raters. Falotico and Quatto 31 discussed the paradoxical behavior of Fleiss' kappa and its asymptotic confidence interval and suggested the adoption of permutation and bootstrap techniques to avoid the former and the latter, respectively. Marasini et al, 32 instead, extended the uniform chance agreement to the case of multiple raters for ordinal rating categories.
To the best of our knowledge, the effect of changes in marginal frequencies over categories to inter‐rater agreement indexes has been best investigated by Quarfoot and Levine 33 who defined the coefficient robustness as coefficient ability of giving roughly the same result for a fixed level of agreement across raters irrespective of the frequency distribution of ratings across categories. Specifically, the Monte Carlo simulation study conducted by Quarfoot and Levine 33 aimed at exploring the robustness of 5 inter‐rater agreement indexes with respect to six different frequency distributions of ratings and as many patterns of rater agreement considering a large sample of 100 subjects classified by 8 raters for a total of 1440 investigated scenarios. A main limit of Quarfoot and Levine study is that it examined neither the more critical scenarios of small sample sizes and small groups of raters nor the influence of the type of rating scale nor the robustness of lower confidence bound commonly adopted for a proper characterization of the extent of inter‐rater agreement.
Since situations in which the number of subjects belonging to one of the rating categories far exceed the quantity of the others are very common in clinical contexts, an inter‐rater agreement coefficient robust to paradoxical behavior due to prevalence or bias becomes of utmost importance. In such framework, this article aims to identify the scenarios under which ‐type coefficients are not sensible to paradoxical behavior by investigating the effects of sample size, number of involved raters and type of rating scale on the robustness of ‐type coefficients and discussing their practical implication for the final characterization of the extent of rater agreement. The investigation concerns four ‐type coefficients for inter‐rater agreement with nominal data together with their weighted version for ordinal data. The investigated coefficients are the two well‐cited Fleiss' kappa coefficient 14 and Conger's kappa 15 as well as the uniform kappa 21 and Gwet's agreement coefficients 27 ( for nominal and for ordinal data) proposed as paradox‐resistant ‐type coefficients. It is important to note that each of these indexes uses a different approach to correct for chance agreement so that they could respond differently to paradoxes.
According to the study aims, the algorithm proposed by Quarfoot and Levine has been extended in order to investigate the robustness of the ‐type coefficients and asymptotic lower confidence bound under several scenarios differing for sample size, rating scale dimension, number of raters, frequency distribution of ratings, and pattern of agreement across raters.
The remainder of this article is organized as follows: In Section 2, the ‐type coefficients and the inferential benchmarking procedure are introduced. In Section 3, the implication of the paradoxical behavior of the ‐type coefficients and the usefulness of the inferential characterization procedure are illustrated and discussed through two motivating examples. In Section 4, the simulation algorithm for the analysis of the paradoxical behavior is described and the main results are discussed. Finally, conclusions are summarized in Section 5.
2. MEASURING AGREEMENT FOR NOMINAL AND ORDINAL CLASSIFICATIONS
Let a set of subjects randomly selected from the population of subjects be classified by raters on a categorical scale with dimension ; let be the random variable denoting the category to which the subject is assigned by rater and denote its realization (). The random variables are stochastically independent, their distribution depends on the true classification and they are completely determined by the model parameters and given by:
| (1) |
with and ;
| (2) |
so that
| (3) |
where is the marginal distribution for the generic rater .
The agreement among raters can be defined by an arbitrary choice along a continuum ranging from agreement among all possible pairs of raters (ie, pairwise agreement, the less restrictive definition of agreement) to agreement among all the raters (ie, ‐wise agreement, the most restrictive definition of agreement). Because of its practical interpretation, the attention is hereafter restricted to pairwise agreement according which the probability that a pair of randomly selected raters, referred to as and , agree on the classification of an arbitrary subject into category , namely , is given by:
| (4) |
For a generic subject, instead, the probability of agreement is given by:
| (5) |
At sample level, the probability of agreement is replaced by an unbiased estimator (see De Mast and Van Wieringen 34 for demonstration) given by the average proportion of observed agreement among all pairs of raters and formulated by Fleiss 14 as follows:
| (6) |
where being the indicator function. Since some inter‐rater agreement is expected by chance alone, a positive value of observed agreement does not automatically provide information about rater consistency; for this reason, several authors 13 , 14 , 15 , 21 , 27 proposed the ‐type agreement coefficients that are relative measures of agreement obtained by rescaling the observed agreement with the agreement expected by chance alone.
2.1. ‐Type coefficients for nominal classifications
The ‐type agreement coefficients are formulated as follows:
| (7) |
where is the probability of agreement expected by chance. At sample level, the estimator of is given by:
| (8) |
In order to formulate the proportion of agreement expected by chance , it is necessary to define how a chance measurement system is conceived of. Different notions of a chance measurement system are advocated in the literature, leading to as many ‐type coefficients; several well‐known alternative coefficients are in the following recalled.
According to Bennet et al, 28 a chance measurement system classifies subjects following the uniform model. Thus, the probability that two raters agree by chance can be estimated as follows:
| (9) |
and the obtained coefficient for inter‐rater agreement is just a linear transformation of the observed proportion of agreement .
Fleiss, 14 instead, defined the proportion of agreement expected by chance under the assumption of homogeneous and thus exchangeable raters. Indeed, is based on one‐way ANOVA setting where each subject is classified by a different set of raters randomly selected in a population so that the variation due to the raters cannot be separated from the error variation. 35
Assuming that the probability of classifying a subject into category is given by the marginal distribution of the classifications provided by raters (ie, ), the probability that two raters agree by chance is:
| (10) |
where can be estimated by the marginal frequencies given by:
| (11) |
Although Fleiss proposed the statistic as a generalization of Cohen's kappa to the case of multiple raters, it reduces to Scott's 12 and it coincides with Cohen's kappa if and only if the column marginals () are all equal.
The generalization of to the case of different raters, commonly referred to as Conger's kappa (ie, ), has been proposed firstly by Conger 15 and later by Davies and Fleiss, 36 Schouten, 37 and O'Connell and Dobson. 38 is based on two‐way ANOVA setting where all subjects are classified by the same set of raters who are included as systematic source of disagreement. 36 According to Conger, a rater providing random classifications is conceived of as one that classifies subjects randomly but with a distribution equal to the marginal distribution of her/his classifications () so that the probability that two raters agree by chance can be estimated as:
| (12) |
At sample level, the pairwise agreement coefficient estimates the expected agreement as the mean proportion of chance agreement between all pairs of raters, that is by averaging all Cohen's kappa pairwise chance agreement estimates. Since averaging all pairwise chance agreement components becomes time‐consuming when , an alternative method more efficient and with direct calculation is recommended. Let be the proportion of subjects classified into category by rater , can be expressed as follows:
| (13) |
The agreement coefficient proposed by Gwet 27 formulates the agreement expected by chance as the probability of the simultaneous occurrence that one rater provides random ratings and raters agree. The probability of random rating is approximated with a normalized measure of randomness defined by the ratio of the observed variance to the variance expected under the assumption of totally random ratings. The observed variance is , being still formulated as in Equation (11), whereas the variance expected under the assumption of totally random ratings is . Under this assumption, the probability that two raters agree by chance can be estimated as follows:
| (14) |
The complete sample formulations of the ‐type coefficients for nominal classifications under comparison are reported in Table A1 in Appendix A.
2.2. ‐Type coefficients for ordinal classifications
When raters classify subjects on a K‐ordinal scale, some disagreements are more serious than others, that is disagreement on two distant categories are more relevant than disagreement on neighboring categories; it is therefore necessary to a priori assign different weights, denoted as , to each pair of ratings (with ). The weighted ‐type coefficient for ordinal classifications is thus formulated as:
| (15) |
where the weighted version of the proportion of observed agreement 39 is:
| (16) |
The agreeing weighting scheme is a non‐increasing function of : for and for . The weights can be arbitrarily defined, anyway the linear, 40 , and quadratic, 41 , weights are the most commonly used weighting schemes for ‐type coefficients. Although Fleiss and Cohen 41 and Schuster 42 showed that the ‐type coefficients with quadratic weights are equivalent to the intraclass correlation coefficient, Brenner and Kliebsch 43 showed that the use of linear weights instead of quadratic weights leads to a statistic less sensitive to the number of rating categories. Thus, the linear weights are here suggested and adopted. It is worth to pinpoint that the unweighted coefficients are special cases of the corresponding weighted versions, obtained with weights equal to either 0 or 1: = 1 if and = 0 elsewhere.
Assuming a uniform model for chance measurement system, Marasini et al 32 formulated the statistic by defining the weighted proportion of agreement expected by chance as:
| (17) |
being the sum of all weights : .
The weighted version of the proportion of chance agreement defined by Fleiss for ordinal classifications is:
| (18) |
being and the estimates of the probability of classifying a subject into category and , respectively.
In Conger's kappa, the weighted proportion of agreement expected by chance for ordinal classifications is formulated as follows:
| (19) |
where
| (20) |
Gwet 27 proposed , a weighted version of , obtained by formulating the agreement expected by chance as follows:
| (21) |
The complete sample formulations of the ‐type coefficients for ordinal classifications under comparison are reported in Table A2 in Appendix A.
2.3. Characterization of the extent of rater agreement via lower confidence bound
All ‐type coefficients range from to : when the observed proportion of agreement equals chance agreement, the coefficient is null; when the observed agreement is greater than chance agreement the coefficient is positive; vice‐versa, it can be interpreted as disagreement. Several benchmark scales have been proposed mainly in social and medical sciences for interpreting the extent of agreement. 6 , 7 , 44 , 45 , 46 , 47 , 48 The best known scales are those proposed by Landis and Koch, 49 Altman, 50 and Shrout. 51 The former consists of six ranges of values corresponding to as many categories of agreement: Poor, Slight, Fair, Moderate, Substantial, and Almost Perfect agreement for coefficient values ranging between 1 and 0, 0 and 0.2, 0.21 and 0.4, 0.41 and 0.6, 0.61 and 0.8, and 0.81 and 1.0, respectively. This scale was then simplified by Altman who collapsed the first two ranges of values into one agreement category and later by Shrout who deleted the category of negative values and moved the threshold value of Slight agreement from 0.2 to 0.1.
Useful information about the true extent of agreement are provided by the lower confidence bound for . 45 The asymptotic normal approximation for ‐type coefficients 52 , 53 has been used to construct symmetric confidence intervals of the form:
| (22) |
A consistent large‐sample variance estimator for has been provided by Gwet 54 as a variant of the widely used variance estimate proposed by Fleiss et al. 52 Indeed, this latter has been derived under the assumption of no agreement among raters, making it suitable only for testing the hypothesis of no agreement; if this assumption is not satisfied, the variance estimate becomes irrelevant and it should be avoided for quantifying the precision of as well as for building confidence intervals. The variance estimator proposed by Gwet 54 , 55 and hereafter adopted is given by:
| (23) |
where is the sampling fraction of subjects from a target population of size N and is the agreement estimated at subject‐level. The formulations of for all ‐type coefficients under study can be found in Appendix B.
The asymptotic confidence interval, whose accuracy depends on the asymptotic normality of coefficient and variance, is by definition generally applicable only for large sample sizes. Under non‐asymptotic conditions, alternative approaches such as the bootstrap confidence intervals may be used (eg, References 8 and 56, 57, 58) .
Among the available methods to build bootstrap confidence intervals, the percentile bootstrap is the simplest and the most popular one. The lower and upper bounds of the two‐sided ()% percentile bootstrap confidence interval are, respectively, the and percentiles of the cumulative distribution function of the bootstrap replications of the ‐type coefficient:
| (24) |
On the other hand, the bias‐corrected and accelerated bootstrap (BCa) confidence interval is recommended for severely non normal data, 59 , 60 since it adjusts for any bias and lack of symmetry of the bootstrap distribution through the acceleration parameter and the bias correction parameter . The lower and upper bounds of the two‐sided ()% BCa confidence‐interval are defined as:
| (25) |
Despite BCa confidence interval needs an higher computational complexity, its coverage error is generally smaller than for the other bootstrap intervals although it can be erratic for small , typically . 59
3. TWO MOTIVATING EXAMPLES
Two real agreement studies are hereafter presented in order to show the implication of the paradoxical behavior of ‐type coefficients. To assess the degree of inter‐rater agreement, the evaluations simultaneously provided by raters have been classified in a table (, where the generic cell contains the number of raters who classify subject into category . The observed agreement is computed, according to either Equation (6) for nominal data or Equation (16) for ordinal data, and corrected by the agreement expected by chance adopting the analyzed ‐type coefficients, whose formulations are summarized in Tables A1 and A2 in Appendix A; in the case of ordinal data, the linear weighting scheme has been adopted. Moreover, for a proper characterization of the extent of rater agreement, the lower bound of the asymptotic confidence interval of each coefficient has been built according to Equation (22).
3.1. Data sets
The first data set is based on the data originally provided by Sandifer et al 61 and also discussed by Fleiss. 14 In the study of Sandifer et al, between 6 and 10 psychiatrists from a pool of 43 psychiatrists were selected to diagnose a patient. As done by Fleiss, 14 we dropped the diagnoses in order to have a constant number of 6 assignments per patient. Specifically, the analyzed data set contains the diagnoses of psychiatrists who were requested to classify patients into one of the following nominal diagnostic categories: (1) depression, (2) personality disorder, (3) schizophrenia, (4) neurosis, (5) other.
The second data set was originally published by Holmquist et al 62 and then analyzed also in the studies of Landis and Koch, 49 , 63 Agresti, 64 Becker and Agresti, 65 and Saraçbaşi; 66 it is one of the most common data set in agreement studies where the paradoxical behavior of the ‐type coefficients is observed. Specifically, the study involved independent pathologists who classified images/slides with the aim of investigating the variability in the classification of carcinoma in situ of the uterine cervix. Based on the dimension and type of lesions, physicians had to classify the presence of a carcinoma in situ adopting an ordinal scale with grades: (1) negative, (2) atypical squamous hyperplasia, (3) carcinoma in situ, (4) squamous carcinoma with early stromal invasion, (5) invasive carcinoma.
According to the study design, can be correctly adopted only to the first data set containing the diagnoses of 30 patients made by different groups of 6 psychiatrists each, whereas is suitable for the second data set where all slides/images have been independently classified by the same set of pathologists.
3.2. Study results
The studies' results are reported in Table 1 and represented in Figure 1, against the agreement categories of the Landis and Koch's benchmark scale. The observed agreement is equal to 0.555 and 0.855 in psychiatric diagnosis and cervix carcinoma study, respectively.
TABLE 1.
Point estimate, two‐sided 95% asymptotic confidence interval, and expected agreement term of each ‐type coefficient for agreement in psychiatric diagnoses and carcinoma classifications
| Psychiatric diagnosis study () | Cervix carcinoma study () | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| coefficient | Point estimate | 95% CI | term | coefficient | Point estimate | 95% CI | term | ||
|
|
0.444 | [0.336, 0.552] | 0.200 |
|
0.639 | [0.598, 0.680] | 0.600 | ||
|
|
0.430 | [0.324, 0.536] | 0.220 |
|
0.497 | [0.432, 0.562] | 0.713 | ||
|
|
0.448 | [0.339, 0.557] | 0.195 |
|
0.687 | [0.647, 0.727] | 0.687 | ||
FIGURE 1.

Point estimate and two‐sided 95% asymptotic confidence interval of each ‐type coefficient, plotted against Landis and Koch's benchmark scale
In psychiatric diagnosis study the degree of inter‐rater agreement appears Moderate with every ‐type coefficient; , , and and their terms are quite similar with one another and agree to within a few hundredths. Moreover, with a significance level , there is evidence for rejecting the null hypothesis of Slight inter‐rater agreement and accepting the tested hypothesis of at least Fair agreement since the lower bound of their confidence interval belongs to the region ranging from 0.2 to 0.4.
In cervix carcinoma study, achieving more than 85% observed agreement might be at first sight impressive, but this finding must be tempered by the fact that the expected agreement due to uniform distribution could be as high as 60% with a 5‐point ordinal scale. Because of the differences among the terms (see Table 1), the estimated degree of inter‐rater agreement differs across coefficients: it is classified as Moderate using and as Substantial using and . Moreover, with a significance level , there is evidence for rejecting the null hypothesis of no more than Fair inter‐rater agreement and accepting the tested hypothesis of at least Moderate inter‐rater agreement for both and ; whereas for there is evidence for rejecting the null hypothesis of no more than Moderate inter‐rater agreement and accepting the tested hypothesis of Substantial inter‐rater agreement.
It is interesting to highlight how the differences in skeweness of marginal distributions (see Figure 2) and sample size make the two agreement studies differ, respectively, in terms of similarity across the estimated agreement coefficients and width of the parametric confidence intervals. These differences highlight the importance of investigating the robustness of agreement coefficient against changes in marginal distributions.
FIGURE 2.

Marginal distributions of ratings over categories for psychiatric diagnoses and carcinoma classifications
4. ROBUSTNESS STUDY VIA MONTE CARLO SIMULATION
An extensive Monte Carlo simulation study has been conducted in order to investigate the robustness of ‐type coefficients to changes in frequency distribution of ratings over categories for a fixed level of agreement.
The ratings provided by raters on the same set of subjects adopting a categorical scale are simulated considering that rater rates the subjects according to their distribution over classification categories (ie, frequency distribution, FD) and each of the other raters agrees with according to a given pattern of agreement (ie, agreement distribution, AD). Specifically, the FD mimics the type of subjects the raters are exposed to as filtered through the rating instrument in question whereas AD mimics the agreement between each rater and by modeling the rating probabilities of the raters conditioned to FD.
The simulation study has been designed as multi‐factor experimental design with five multi‐level factors: rating scale dimension , sample size , number of raters , FD and AD. The factor has 9 levels: classification categories; the factor has 3 levels: subjects; the factor has 3 levels: raters; whereas the factors FD and AD have 6 and 2 levels, respectively. FDs are all special cases of the beta‐binomial distribution with different values of the shape parameters ; the main characteristics and the patterns of all FDs are summarized in Table 2. AD 1 is a binomial distribution scaled on and centered on ratings provided by ; AD 2 is a uniform distribution which represents the case that all classification categories have an equal chance of occurring.
TABLE 2.
Parameters and pattern of each FD
| Name | Parameters () | Pattern |
|---|---|---|
| FD 1 | (0.25, 0.25) | Extremes |
| FD 2 | (1, 1) | Uniform |
| FD 3 | (2, 2) | Central |
| FD 4 | (50, 50) | Binomial |
| FD 5 | (25, 50) | Skewed |
| FD 6 | (5, 50) | Very skewed |
For each combination of , , , FD, and AD, Monte Carlo data sets have been generated for both nominal and ordinal classifications and the degree of inter‐rater agreement has been assessed with all ‐type coefficients under study, for a total of scenarios. The robustness of the asymptotic lower confidence bound has been investigated with sample sizes satisfying the asymptotic condition of normality (ie, subjects) for a total of simulated scenarios. For each scenario, the lower bound of the two‐sided 95% asymptotic confidence interval has been built according to Equation (22).
Quarfoot and Levine 33 assessed the robustness of agreement coefficients looking at the range over the coefficient mean values obtained from different FDs with a fixed AD; this approach is not recommended under non‐asymptotic conditions because of the lower representativeness of the mean for the distribution of the ‐type coefficients. The approach here suggested is to assess robustness by looking at the mean range over the coefficient values obtained from different FDs with a fixed AD.
The adopted simulation procedure works as follows:
sample the under each FD, with ;
sample the with and under each pair FD‐AD;
compute the ‐type coefficients for nominal (ie, , , , ; see Table A1) and ordinal (ie, , , and ; see Table A2) classifications obtained under each pair FD‐AD;
set and build the lower confidence bound of each ‐type coefficient for each pair FD‐AD through Equation (22);
repeat S times steps 1 to 4 (ie, , );
for each pair FD‐AD and combination of , , and , compute the mean agreement value over the Monte Carlo data sets of ‐type coefficient;
- for each AD and Monte Carlo replication, compute the range of agreement over the coefficients estimated on data simulated from the 6 different FDs with a fixed AD:
(26) - for each AD, compute the mean range of agreement :
(27) Similarly, for each AD, compute the mean range of agreement over the LBs obtained from the 6 different FDs with a fixed AD.
The simulation algorithm has been implemented using Mathematica (Version 11.0, Wolfram Research, Inc., Champaign, IL, USA).
4.1. Simulation results
Simulation results revealed a negligible effect of number of raters and sample size on the mean agreement value. For illustrative purpose, mean and standard deviation of all ‐type coefficients for raters and subjects are reported in Tables 1 through 4 in the Supplementary Materials. The obtained results show that the mean agreement value changes with the number of categories; vice‐versa, the number of raters and sample size slightly affect the mean agreement value, that increases of about 3% with increasing number of raters and decreasing sample size.
Specifically, under AD 1 simulation results reveal a slightly higher dependency of and on trait prevalence since their mean values decrease to the lower adjacent category with FD 4, FD 5, and FD 6 (ie, unbalanced marginal distribution), whereas and generally assume values belonging to the same agreement category for all FDs (see Table 1 in the Supplementary Materials). The sensitivity of and to changes in FDs is much more evident and the mean agreement obtained with the very skewed frequency distribution FD 6 is 2‐steps apart categories lower than that with balanced FDs; vice‐versa, for and the mean agreement is quite the same whatever the FD (see Table 2 in the Supplementary Materials). The above results obtained for ordinal data are in line with those discussed in Quarfoot and Levine 33 in terms of average agreement. Under AD 2, instead, all the mean agreement values are approximately 0 (see Tables 3 and 4 in the Supplementary Materials).
For the extreme scenarios — those with raters evaluating subjects and raters evaluating subjects — Figures 3 and 4 report the distributions of the range over the coefficients estimated from different FDs with a fixed AD and Tables 3 and 4 compare the mean range over the coefficient values () against the range over the coefficient mean values (hereafter, ). The range distributions for and are clustered around a lower mean and thus these coefficients can be assumed more robust to changes in FDs than and . The comparative analysis between and reveals they are comparable under asymptotic conditions (see Table 4) but for small sample sizes always overestimates the coefficient robustness (see Table 3).
FIGURE 3.

Distribution of range over the coefficient values obtained from different FDs with AD 1 when raters evaluate subjects on a ‐ordinal scale
FIGURE 4.

Distribution of range over the coefficient values obtained from different FDs with AD 1 when raters evaluate subjects on a ‐ordinal scale
TABLE 3.
Range over the mean ‐type coefficient values and mean range obtained from different FDs with AD 1, when raters evaluate subjects on a ‐ordinal scale
|
|
|
|
|
|
|
|
|
|
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
0.066 | 0.098 | 0.111 | 0.121 | 0.125 | 0.124 | 0.133 | 0.127 | 0.125 | |||||||||
|
|
0.462 | 0.387 | 0.337 | 0.309 | 0.289 | 0.275 | 0.264 | 0.250 | 0.239 | ||||||||||
|
|
|
0.406 | 0.434 | 0.452 | 0.447 | 0.449 | 0.468 | 0.473 | 0.463 | 0.468 | |||||||||
|
|
0.714 | 0.660 | 0.636 | 0.624 | 0.616 | 0.628 | 0.630 | 0.623 | 0.622 | ||||||||||
|
|
|
0.389 | 0.417 | 0.435 | 0.431 | 0.435 | 0.453 | 0.456 | 0.447 | 0.451 | |||||||||
|
|
0.691 | 0.638 | 0.614 | 0.602 | 0.597 | 0.607 | 0.609 | 0.602 | 0.600 | ||||||||||
|
|
|
0.139 | 0.157 | 0.152 | 0.154 | 0.150 | 0.144 | 0.142 | 0.134 | 0.130 | |||||||||
|
|
0.452 | 0.376 | 0.329 | 0.301 | 0.282 | 0.265 | 0.251 | 0.236 | 0.227 |
.
TABLE 4.
Range over the mean ‐type coefficient values and mean range obtained from different FDs with AD 1, when raters evaluate subjects on a ‐ordinal scale
|
|
|
|
|
|
|
|
|
|
|||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
0.094 | 0.137 | 0.158 | 0.167 | 0.178 | 0.175 | 0.174 | 0.173 | 0.172 | |||||||||
|
|
0.119 | 0.153 | 0.171 | 0.179 | 0.182 | 0.183 | 0.182 | 0.180 | 0.178 | ||||||||||
|
|
|
0.304 | 0.374 | 0.413 | 0.439 | 0.442 | 0.469 | 0.476 | 0.482 | 0.489 | |||||||||
|
|
0.305 | 0.374 | 0.414 | 0.441 | 0.445 | 0.472 | 0.482 | 0.491 | 0.499 | ||||||||||
|
|
|
0.304 | 0.374 | 0.413 | 0.439 | 0.442 | 0.469 | 0.476 | 0.482 | 0.489 | |||||||||
|
|
0.305 | 0.374 | 0.414 | 0.440 | 0.445 | 0.472 | 0.482 | 0.491 | 0.499 | ||||||||||
|
|
|
0.214 | 0.241 | 0.244 | 0.240 | 0.237 | 0.223 | 0.215 | 0.206 | 0.200 | |||||||||
|
|
0.223 | 0.245 | 0.247 | 0.242 | 0.240 | 0.225 | 0.216 | 0.208 | 0.202 |
Simulation results obtained for the lower confidence bounds are almost the same as the coefficient estimates with subjects revealing that the trait prevalence affects the sample variance in a similar manner; moreover, no much difference is observed between for and so that the mean range of agreement is represented in Figures 5 and 6, respectively, for unweighted (ie, nominal classifications) and weighted (ie, ordinal classifications) coefficients for and raters.
FIGURE 5.

Mean range of agreement obtained for the four unweighted agreement coefficient estimates (red: , green: , blue: , orange: ) under AD 1 and AD 2
FIGURE 6.

Mean range of agreement obtained for the four weighted agreement coefficient estimates (red: , green: , blue: , orange: ) under AD 1 and AD 2
The patterns in Figure 5 show that under AD 2 and with raters or with but subjects, is no more than 0.2 for all the four unweighted coefficients so they can be recognized as robust to changes in FDs. Indeed, a reasonable value ranges between 0 and 0.2 since it allows to characterize the extent of agreement into the same category or at most into two adjacent categories whatever the frequency distribution of ratings over classification categories. Under AD 1, instead, the robustness worsens remaining quite comparable across all agreement coefficients that are not robust only with samples of subjects; with increasing sample size and number of involved raters, the coefficients exhibit a less paradoxical behavior.
Vice‐versa, with ordinal classifications (see Figure 6), the patterns of differ across coefficients and the similar behavior of and on one hand and of and on the other let to distinguish them into two groups, with a much higher robustness to changes in FDs for the latter. Specifically, under AD 2, is less than 0.2 with raters so that the coefficients can be assumed robust; under AD 1, instead, and are robust for and whereas for and exceeds the value 0.4 so that their adoption is not recommended because of their strong sensitivity to changes in frequency distribution.
It is worth to pinpoint that although ‐type coefficients are affected by the number of categories (as revealed by simulation results reported in Tables 1 through 4 in the Supplementary Materials), it is not reasonable to assume that the robustness changes over rating scale dimension is exclusively due to the changes of the underlying ‐type coefficients. Indeed, if the ‐type coefficients changed in the same way with increasing number of classification categories for all the 6 FDs under study, would be the same whatever the scale dimension. Actually, the mean range of agreement accounts only for the coefficient variation across FDs and it does not depend on the estimated agreement values, that is the same range could be obtained both with low and high degree of agreement. Moreover, increasing the number of classification categories makes and decrease with balanced FDs and increase with unbalanced FDs so that the value increases which means a worsening of coefficient robustness.
The simulation results can be more interestingly read in light of their practical implication for the characterization of the extent of agreement via a benchmark scale. Indeed, for a given AD, differences across FDs can make the extent of rater agreement span over a number of interpretation categories depending on the adopted benchmark scale. For example, adopting the Landis and Koch benchmark scale, implies that the extent of agreement spans up to adjacent categories (eg, it may belong to Fair or range from Fair to Moderate) whereas the extent of agreement spans up to 4 steps‐apart categories when (eg, from Slight to Almost Perfect).
5. CONCLUSIONS
This article investigates — via an extensive Monte Carlo simulation study — the robustness of ‐type coefficients for assessing inter‐rater agreement with both nominal and ordinal classifications.
Simulation results show that the robustness of ‐type coefficients increases as sample size and number of involved raters do and reveal the paradoxical behavior of all ‐type coefficients with raters and small sample sizes for both nominal and ordinal classifications. The investigation of robustness under several experimental conditions, never explored before to the best of our knowledge, sheds light on the different behavior between inter‐rater agreement coefficients with nominal and ordinal classifications, showing the higher paradoxical behavior of the weighted variant of Fleiss' and Conger's kappa. Indeed, with nominal classifications, the robustness is comparable across ‐type coefficients with the only exception of categories, although the values of and slightly decrease with symmetrical unbalanced marginal distribution; vice‐versa, with ordinal classifications and are more robust than and that are strongly influenced by the frequency distributions over categories. The obtained simulation results let to identify the scenarios where the degree of agreement is about the same whatever the FDs (ie, ): raters classifying a moderate set of subjects or less than 5 raters classifying a larger set of subjects. For such scenarios, being the effect of FDs negligible, there is no doubt about the adoption of ‐type coefficients as robust measures of rater agreement.
The variation of and values, for nominal and ordinal data, respectively, reflect the changes in the observed proportion of agreement since the coefficients are just linear transformation of the observed agreement. It should not come as a surprise that both agreement terms change with FDs because different combinations of AD and FD affect the distribution of ratings over classification categories (ie, the category assigned to subjects by raters) to which both the observed and the expected agreement are closely related. However, such variations are negligible producing values less than 0.2, with the exception of raters.
Fleiss' and Conger's kappa coefficients are more affected by paradoxical behavior due to the chance measurement system model they rely on. Firstly, they strongly depend on the true subject classification and confound measurement precision with accuracy and/or other properties of the subjects population. Secondly, there is a strongly nonlinear relationship between the observed agreement and the coefficient value, so that small variations in the observed agreement could result in dramatic changes in the final degree of agreement. The strong sensitivity of linearly weighted Fleiss' and Conger's kappa coefficients for changes in the distribution of classifications over categories could make their standard error be so large to make them practically useless.
It is also worthy to note that the choice between and should be based on the way the raters are selected for the agreement study since their adoption ignoring the study design can lead to incorrect conclusions: 67 in the two‐way ANOVA setting is likely to result in an underestimation of the agreement level giving on average smaller values than , whereas the misuse of in one‐way ANOVA settings is likely to overestimate the agreement level. 68 , 69
The inter‐rater agreement coefficients use as reference for the observed agreement different chance measurement systems, each of which can be more or less suitable for a given context. and represent two alternatives to when the chance measurement system cannot follow the uniform model. But, as revealed by simulation results, and are not recommended in the case of unbalanced marginal distributions with ordinal classifications because of their sensitivity to trait prevalence; in such circumstances, the adoption of and is strongly recommended.
Supporting information
Data S1 Supplementary material
ACKNOWLEDGEMENTS
Open Access Funding provided by Universita degli Studi di Napoli Federico II within the CRUI‐CARE Agreement. [Correction added on 26 May 2022, after first online publication: CRUI funding statement has been added.]
APPENDIX A.
A.1.
Let be the number of subjects classified by raters on a ‐categorical rating scale, be the number of raters who classify subject into category , be the estimate of the probability of classifying a subject into category , be the proportion of subjects classified into category by rater and be the agreeing weight assigned to each pair of ratings.
The complete sample formulations of the ‐type coefficients under comparison, respectively, for nominal and ordinal classifications, are:
TABLE A1.
Formulation of the introduced inter‐rater ‐type coefficients for nominal classifications
|
| |
|
| |
|
| |
|
|
TABLE A2.
Formulation of the introduced inter‐rater ‐type coefficients for ordinal classifications
|
| |
|
| |
|
| |
|
|
APPENDIX B.
B.1.
The ‐type coefficients can be assumed asymptotically normally distributed with mean and variance . 70
The variance of the coefficient is formulated as:
| (B1) |
where
is the sampling fraction of subjects from a target population of size ; in many studies is set equal to 0 because is unknown.
.
The variance of the weighted version of the Fleiss kappa coefficient is formulated as:
| (B2) |
where
.
.
, with , , , and defined in Equation (11).
The variance of the weighted version of the Conger kappa agreement coefficient is formulated as:
| (B3) |
where
.
.
.
, with being 1 if rater classifies subject into category, and 0 otherwise.
Finally, the variance of the agreement coefficient is formulated as:
| (B4) |
where
.
.
, with defined in Equation (11).
Adopting the identity matrix as weighting scheme, these expressions work as well for unweighted coefficients.
Vanacore A, Pellegrino MS. Robustness of ‐type coefficients for clinical agreement. Statistics in Medicine. 2022;41(11):1986–2004. doi: 10.1002/sim.9341
DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.
REFERENCES
- 1. Bashkansky E, Dror S, Ravid R, Grabov P. Effectiveness of a product quality classifier. Quality Engineering. 2007;19(3):235‐244. 10.1080/08982110701334577 [DOI] [Google Scholar]
- 2. International Organization for Standardization (ISO) . Accuracy (Trueness and Precision) of Measurement Methods and Results ‐ Part 1: General Principles and Definitions (5725‐1). Geneva, Switzerland: ISO; 1994. [Google Scholar]
- 3. Barnhart HX, Haber MJ, Lin LI. An overview on assessing agreement with continuous measurements. J Biopharm Stat. 2007;17(4):529‐569. [DOI] [PubMed] [Google Scholar]
- 4. Hayes AF. Statistical Methods for Communication Science. London, UK: Routledge; 2020. [Google Scholar]
- 5. Vanbelle S Agreement Between Raters and Groups of Raters. PhD thesis. Universitede Liege, Belgique; 2009.
- 6. Watson P, Petrie A. Method agreement analysis: a review of correct methodology. Theriogenology. 2010;73(9):1167‐1179. [DOI] [PubMed] [Google Scholar]
- 7. Hallgren KA. Computing inter‐rater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol. 2012;8(1):23‐39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Zapf A, Castell S, Morawietz L, Karch A. Measuring inter‐rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Med Res Methodol. 2016;16(1):93‐102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. de Mast Jeroen, van Wieringen Wessel N. Modeling and evaluating repeatability and reproducibility of ordinal classifications. Technometrics. 2010;52(1):94‐106. 10.1198/tech.2009.08052 [DOI] [Google Scholar]
- 10. Yilmaz AE, Saracbasi T. Assessing agreement between raters from the point of coefficients and loglinear models. Journal of Data Science. 2017;15(1):1‐24. 10.6339/jds.201701_15(1).0001 [DOI] [Google Scholar]
- 11. Sim J, Wright CC. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Therapy. 2005;85(3):257‐268. [PubMed] [Google Scholar]
- 12. Scott WA. Reliability of content analysis: the case of nominal scale coding. Public Opinion Quarterly. 1955;19(3):321‐325. https://www.jstor.org/stable/2746450 [Google Scholar]
- 13. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Measur. 1960;20(1):37‐46. [Google Scholar]
- 14. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378‐382. [Google Scholar]
- 15. Conger AJ. Integration and generalization of kappas for multiple raters. Psychol Bull. 1980;88(2):322‐328. [Google Scholar]
- 16. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. the problems of two paradoxes. J Clin Epidemiol. 1990;43(6):543‐549. [DOI] [PubMed] [Google Scholar]
- 17. Eugenio BD, Glass M. The kappa statistic: a second look. Comput Linguist. 2004;30(1):95‐101. [Google Scholar]
- 18. Gwet K. Kappa statistic is not satisfactory for assessing the extent of agreement between raters. StatMethods Inter‐rater Reliab Assess. 2002;1(6):1‐6. [Google Scholar]
- 19. Kraemer HC. Ramifications of a population model for k as a coefficient of reliability. Psychometrika. 1979;44(4):461‐472. [Google Scholar]
- 20. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. resolving the paradoxes. J Clin Epidemiol. 1990;43(6):551‐558. [DOI] [PubMed] [Google Scholar]
- 21. Brennan RL, Prediger DJ. Coefficient kappa: some uses, misuses, and alternatives. Educ Psychol Measur. 1981;41(3):687‐699. [Google Scholar]
- 22. Byrt T, Bishop J, Carlin JB. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46(5):423‐429. [DOI] [PubMed] [Google Scholar]
- 23. Warrens MJ. A formal proof of a paradox associated with Cohen's kappa. J Classif. 2010;27(3):322‐332. [Google Scholar]
- 24. Erdmann TP, De Mast J, Warrens MJ. Some common errors of experimental design, interpretation and inference in agreement studies. Stat Methods Med Res. 2015;24(6):920‐935. [DOI] [PubMed] [Google Scholar]
- 25. Guggenmoos‐Holzmann I. HOW reliable are change‐corrected measures of agreement? Stat Med. 1993;12(23):2191‐2205. [DOI] [PubMed] [Google Scholar]
- 26. Gwet K. Inter‐rater reliability: dependency on trait prevalence and marginal homogeneity. Statistical Methods for Inter‐Rater Reliability Assessment Series. 2002;2(1):1‐9. [Google Scholar]
- 27. Gwet KL. Computing inter‐rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61(1):29‐48. [DOI] [PubMed] [Google Scholar]
- 28. Bennett EM, Alpert R, Goldstein AC. Communications Through Limited Response Questioning. Public Opinion Quarterly. 1954;18(3):303. 10.1086/266520 [DOI] [Google Scholar]
- 29. Janson S, Vegelius J. On generalizations of the G index and the phi coefficient to nominal scales. Multivar Behav Res. 1979;14(2):255‐269. [DOI] [PubMed] [Google Scholar]
- 30. Nelson JC, Pepe MS. Statistical description of interrater variability in ordinal ratings. Stat Methods Med Res. 2000;9(5):475‐496. [DOI] [PubMed] [Google Scholar]
- 31. Falotico R, Quatto P. Fleiss' kappa statistic without paradoxes. Qual Quant. 2015;49(2):463‐470. [Google Scholar]
- 32. Marasini D, Quatto P, Ripamonti E. Assessing the inter‐rater agreement for ordinal data through weighted indexes. Stat Methods Med Res. 2016;25(6):2611‐2633. [DOI] [PubMed] [Google Scholar]
- 33. Quarfoot D, Levine RA. How robust are multirater interrater reliability indices to changes in frequency distribution? Am Stat. 2016;70(4):373‐384. [Google Scholar]
- 34. De Mast J, Van Wieringen WN. Measurement system analysis for categorical measurements: agreement and kappa‐type indices. Journal of Quality Technology. 2007;39(3):191‐202. 10.1080/00224065.2007.11917688 [DOI] [Google Scholar]
- 35. Landis JR, Koch GG. A one‐way components of variance model for categorical data. Biometrics. 1977;33(4):671. 10.2307/2529465 [DOI] [Google Scholar]
- 36. Davies M, Fleiss JL. Measuring agreement for multinomial data. Biometrics. 1982;38(4):1047. 10.2307/2529886 [DOI] [Google Scholar]
- 37. Schouten H. Measuring pairwise agreement among many observers. II. some improvements and additions. Biometr J. 1982;24(5):431‐435. [Google Scholar]
- 38. O'Connell DL, Dobson AJ. General observer‐agreement measures on individual subjects and groups of subjects. Biometrics. 1984;40(4):973. 10.2307/2531148 [DOI] [Google Scholar]
- 39. Abraira V, De Vargas AP. Generalization of the kappa coefficient for ordinal categorical data, multiple observers and incomplete designs. Qüestiió quaderns d'estadística i investigació operativa. 1999;23(3):561‐571. [Google Scholar]
- 40. Cicchetti DV, Allison T. A new procedure for assessing reliability of scoring EEG sleep recordings. American Journal of EEG Technology. 1971;11(3):101‐110. 10.1080/00029238.1971.11080840 [DOI] [Google Scholar]
- 41. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Measur. 1973;33(3):613‐619. [Google Scholar]
- 42. Schuster C. A note on the interpretation of weighted kappa and its relations to other rater agreement statistics for metric scales. Educ Psychol Measur. 2004;64(2):243‐253. [Google Scholar]
- 43. Brenner H, Kliebsch U. Dependence of weighted kappa coefficients on the number of categories. Epidemiology. 1996;7(2):199‐202. [DOI] [PubMed] [Google Scholar]
- 44. Everitt BS. The Analysis of Contingency Tables. Boca Raton, FL: CRC Press; 1992. [Google Scholar]
- 45. Blackman NJM, Koval JJ. Interval estimation for Cohen's kappa as a measure of agreement. Stat Med. 2000;19(5):723‐741. [DOI] [PubMed] [Google Scholar]
- 46. Altaye M, Donner A, Eliasziw M. A general goodness‐of‐fit approach for inference procedures concerning the kappa statistic. Stat Med. 2001;20(16):2479‐2488. [DOI] [PubMed] [Google Scholar]
- 47. Klar N, Lipsitz SR, Parzen M, Leong T. An exact bootstrap confidence interval for kappa in small samples. Journal of the Royal Statistical Society: Series D (The Statistician). 2002;51(4):467‐478. 10.1111/1467-9884.00331 [DOI] [Google Scholar]
- 48. Bland JM. Measurement in Health and Disease. Cohen's Kappa. York, UK: University of York, Department of Health Sciences; 2008. [Google Scholar]
- 49. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159‐174. [PubMed] [Google Scholar]
- 50. Altman DG. Practical Statistics for Medical Research. Boca Raton, FL: CRC Press; 1990. [Google Scholar]
- 51. Shrout PE. Measurement reliability and agreement in psychiatry. Stat Methods Med Res. 1998;7(3):301‐317. [DOI] [PubMed] [Google Scholar]
- 52. Fleiss JL, Nee JC, Landis JR. Large sample variance of kappa in the case of different sets of raters. Psychol Bull. 1979;86(5):974‐977. [Google Scholar]
- 53. Ben David A. Comparison of classification accuracy using Cohen's Weighted Kappa. Expert Systems with Applications. 2008;34(2):825‐832. 10.1016/j.eswa.2006.10.022 [DOI] [Google Scholar]
- 54. Gwet KL. Variance Estimation of Nominal‐Scale Inter‐Rater Reliability with Random Selection of Raters. Psychometrika. 2008;73(3):407‐430. 10.1007/s11336-007-9054-8 [DOI] [Google Scholar]
- 55. Gwet KL. Large‐sample variance of Fleiss generalized kappa. Educ Psychol Measur. 2021;81(4):781‐790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Lee J, Fung K. Confidence interval of the kappa coefficient by bootstrap resampling. Psychiatry Res. 1993;49:97‐98. [DOI] [PubMed] [Google Scholar]
- 57. Reichenheim ME. Confidence Intervals for the Kappa Statistic. The Stata Journal: Promoting communications on statistics and Stata. 2004;4(4):421‐428. 10.1177/1536867x0400400404 [DOI] [Google Scholar]
- 58. Vanacore A, Pellegrino MS. Inferring rater agreement with ordinal classification. In Petrucci A, Racioppi F, Verde R, eds. New Statistical Developments in Data Science. Cham: Springer; 2017:91‐101.
- 59. Carpenter J, Bithell J. Bootstrap confidence intervals: when, which, what? a practical guide for medical statisticians. Stat Med. 2000;19(9):1141‐1164. [DOI] [PubMed] [Google Scholar]
- 60. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. Boca Raton, FL: CRC Press; 1994. [Google Scholar]
- 61. Sandifer MG, Hordern A, Timbury GC, Green LM. Psychiatric diagnosis: a comparative study in North Carolina, London and Glasgow. Br J Psychiatry. 1968;114(506):1‐9. [DOI] [PubMed] [Google Scholar]
- 62. Holmquist N, McMahan C, Williams O. Variability in classification of carcinoma in situ of the uterine cervix. Arch Pathol. 1967;84(4):334‐345. [PubMed] [Google Scholar]
- 63. Landis JR, Koch GG. An application of hierarchical kappa‐type statistics in the assessment of majority agreement among multiple observers. Biometrics. 1977;33(2):363‐374. [PubMed] [Google Scholar]
- 64. Agresti A. An agreement model with kappa as parameter. Stat Probab Lett. 1989;7(4):271‐273. [Google Scholar]
- 65. Becker MP, Agresti A. Log‐linear modelling of pairwise interobserver agreement on a categorical scale. Stat Med. 1992;11(1):101‐114. [DOI] [PubMed] [Google Scholar]
- 66. Saraçbaşi T. Agreement models for multiraters. Turk J Med Sci. 2011;41(5):939‐944. [Google Scholar]
- 67. Hox J. Quantitative methodology series. Multilevel Analysis Techniques and Applications. Mahwah, NJ: Lawrence Erlbaum Associates Publishers; 2002. [Google Scholar]
- 68. Vanbelle S. Asymptotic variability of (multilevel) multirater kappa coefficients. Stat Methods Med Res. 2019;28(10‐11):3012‐3026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86(2):420‐428. [DOI] [PubMed] [Google Scholar]
- 70. Gwet KL. Handbook of Inter‐rater Reliability: The Definitive Guide to Measuring the Extent of Agreement Among Raters. Oxford, UK: Advanced Analytics, LLC; 2014. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data S1 Supplementary material
Data Availability Statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.
