Introduction
It is often of interest to assess agreement between 2 or more observers when the observation of interest is categorical. Observations of clinical interest can include diagnoses, assessment of risk factors (exposures), and outcomes. The statistical tool most widely used to determine agreement is the kappa statistic. The kappa statistic is a chance-corrected measure; that is, it seeks to assess agreement beyond that which occurs by chance. However, it is well known that kappa has some limitations that can lead to paradoxical results (low kappa even in the presence of strong observer agreement) in certain circumstances.1,2 In this article, we will review kappa and its limitation, and introduce an alternative measure of agreement, the Agreement Coefficient 1 (AC1) given by Gwet.3
The Kappa Statistic
Cohen in 1960 proposed the kappa statistic in the context of 2 observers.4 It was later extended by Fleiss to include multiple observers.5 For illustration purposes, we will look at the simpler case of 2 observers, acknowledging that the principles are the same for multiple observers. Let us imagine that an investigator wants to develop a spine classification system based in part on the presence or absence of a finding on magnetic resonance imaging (MRI). The investigator gives 200 MRIs to 2 spine surgeons to assess independently whether the finding is present or absent in each image. The results are provided in Table 1.
Table 1.
A 2 × 2 Table of Results in a Hypothetical Example Comparing 2 Different Assessors.
Assessor 1 | ||||
---|---|---|---|---|
Present | Absent | Total | ||
Assessor 2 | Present | (a) 130 | (b) 56 | (g1) 186 |
Absent | (c) 9 | (d) 5 | (g2) 14 | |
Total | (f1) 139 | (f2) 61 | (N) 200 |
The kappa statistic is given by the formula
where Po = observed agreement, (a + d)/N, and Pe = agreement expected by chance, .
In our example,
Po = (130 + 5)/200 = 0.675
Pe = ((186 * 139) + (14 * 61))/2002 = 0.668
κ = (0.675 − 0.668)/(1 − 0.668) = 0.022
Kappa values range from −1 to 1, though it usually falls between 0 and 1. One represents perfect agreement, indicating that the raters agree in their classification of every case. Zero indicates agreement no better than that expected by chance. A negative kappa would indicate agreement worse than that expected by chance.
A common scale of interpretation for the kappa statistic is given by Altman6 (Table 2).
Table 2.
A Commonly Used Scale of Interpretation for Kappa Statistic.
Kappa | Agreement |
---|---|
≤0.20 | Poor |
0.21-0.40 | Fair |
0.41-0.60 | Moderate |
0.61-0.80 | Good |
0.81-1.00 | Very good |
Using the scale shown in Table 2, the ability of our 2 investigators to agree on the presence or absence of a specific MRI finding is “poor.”
The Limitations of Kappa
While the kappa statistic above is quite low, one may notice that the absolute percentage of observer agreement (Po) is quite high (68%). How can the observers agree nearly 70% of the time yet have such a low kappa? The answer to this question lies in the distribution of the marginal table totals on which the magnitude of chance agreement (Pe), and subsequently, kappa, is dependent. The factors that characterize the distribution are referred to as prevalence and bias (see Box 1). Prevalence is the probability with which an observer will classify an object as present or absent. It is related to the balance in Table 1. Bias is the frequency at which raters choose a particular category, present or absent. This is related to the symmetry in Table 1. In our example, Table 1 is symmetrically imbalanced with a high prevalence for each assessor categorizing the MRI finding as present. This has the effect of lowering the kappa creating what appears to be a “paradox,” high percent agreement and low kappa.
Box 1.
Definitions Used to Describe the Limitations of the Kappa Statistic
Symmetrical: The distribution across g1 and g2 is the same as 𝑓1 and 𝑓2
Asymmetrical: The distribution across g1 and 𝑔2 is in the opposite direction to 𝑓1 and 𝑓2
Balanced: The proportion of the total number of objects in 𝑔1 and 𝑓1 is equal to 0.5
Imbalance: The proportion of the total number of objects in 𝑔1 and 𝑓1 is not equal to 0.5
Prevalence: Probability with which a rater will classify an object into a category; this is related to the balance in Table 2
Bias: Frequency at which raters choose a particular category; this is related to the symmetry in Table 2
An Alternative Statistic of Agreement
Given that kappa is affected by the skewed distributions of categories (the prevalence problem) and by the degree to which observers disagree (the bias problem), Gwet3 in 2008 proposed an alternative agreement statistic by adjusting for chance agreement in a different way. He defined and new agreement coefficient (AC1) between 2 (or multiple) observers as the conditional probability that 2 randomly selected observers will agree, given that no agreement will occur by chance. In this way, the AC1 resists the so-called “paradox” of kappa. The AC1 statistic has the same formula as the kappa except it calculates the agreement expected by chance as follows:
Pe = agreement expected by chance, 2q * (1 − q), where q = (g1 + f1)/2N
In our example, Po remains the same, but Pe takes on a different value with the following results:
Po = (130 + 5)/200 = 0.675
q = (186 + 139)/(2 * 200) = 0.813
Pe = 2(0.813) * (1 − 0.813) = 0.305
AC1 = (0.675 − 0.305)/(1 − 0.305) = 0.532
How shall we interpret this coefficient? Remember from Table 2 that many use a common benchmark scale to determine if the agreement is poor, fair, moderate, good, or very good. The advantage is that this scale is straightforward. However, this simple method can lead to misleading conclusions for the following reasons:
The calculated kappa is specific to the pool of subjects used in the study and will change if different subjects are used.
The magnitude of the kappa coefficient is dependent on several factors such as sample size, the number of categories, and the distribution of subjects among the categories. For example, a kappa value of 0.54 based on 200 subjects suggests a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only.
A more standardized method that overcomes many of the problems cited above was proposed by Gwet.7 Using the agreement coefficient and its standard error, one can calculate the probability that the agreement coefficient would fall into each category in Table 2. Starting with the highest agreement coefficient range, 0.81 to 1.0 (very good), and moving to the poorest range (≤0.2), one calculates the cumulative probability that the coefficient falls into that category. When the cumulative probability crosses a certain threshold (say 95%) that is the most likely range to which the estimate belongs. The results from our example show that kappa’s cumulative probability does not reach 95% until the bottom range (poor), where the AC1 crosses in the 0.41 to 0.60 range (moderate; (Table 3). Fortunately, Gwet’s agreement coefficient and his method for benchmarking are included in several statistical packages.
Table 3.
Benchmark Range When the Level of the Cumulative Probability of the Agreement Coefficient Reaches 0.95 (95%).
Benchmark Range | Description | Cumulative Probability | |
---|---|---|---|
Kappa | AC1 | ||
0.81 to 1.00 | Very good | 0.00 | 0.00 |
0.61 to 0.80 | Good | 0.06 | 0.14 |
0.41 to 0.60 | Moderate | 0.09 | 0.98 |
0.21 to 0.40 | Fair | 0.11 | 1.00 |
≤0.20 | Poor | 1.00 | 1.00 |
Conclusions
Kappa remains the most frequently used statistic assessing agreement between 2 or more observers when the observation of interest is categorical.
The kappa statistic is a chance-corrected measure. However, there are some limitations to the kappa that relate to the distribution of the marginal table totals on which the chance correction depends.
An alternative agreement statistic is Gwet’s AC1 that seeks to minimize the kappa limitations.
A standardized method of benchmarking includes calculating the cumulative probability of an agreement coefficient falling within a benchmark range, providing a more standardized way of interpreting the agreement statistic.
Footnotes
Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Spectrum Research, Inc, received funding in exchange for assistance in providing methodological and statistical assistance.
References
- 1. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990;43:551–558. [DOI] [PubMed] [Google Scholar]
- 2. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549. [DOI] [PubMed] [Google Scholar]
- 3. Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61(pt 1):29–48. [DOI] [PubMed] [Google Scholar]
- 4. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213–220. [DOI] [PubMed] [Google Scholar]
- 5. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378–382. [Google Scholar]
- 6. Altman DG. Practical Statistics for Medical Research. London, England: Chapman & Hall/CRC; 1991. [Google Scholar]
- 7. Gwet KL. K. Gwet’s Inter-Rater Reliability Blog. Benchmarking agreement coefficients https://inter-rater-reliability.blogspot.com/2014/. Published December 12, 2014. Accessed January 3, 2020.