Kappa and Beyond: Is There Agreement?

Joseph R Dettori; Daniel C Norvell

doi:10.1177/2192568220911648

. 2020 Mar 3;10(4):499–501. doi: 10.1177/2192568220911648

Kappa and Beyond: Is There Agreement?

Joseph R Dettori ^1,^✉, Daniel C Norvell ¹

PMCID: PMC7222679 PMID: 32435572

Introduction

It is often of interest to assess agreement between 2 or more observers when the observation of interest is categorical. Observations of clinical interest can include diagnoses, assessment of risk factors (exposures), and outcomes. The statistical tool most widely used to determine agreement is the kappa statistic. The kappa statistic is a chance-corrected measure; that is, it seeks to assess agreement beyond that which occurs by chance. However, it is well known that kappa has some limitations that can lead to paradoxical results (low kappa even in the presence of strong observer agreement) in certain circumstances.^1,2 In this article, we will review kappa and its limitation, and introduce an alternative measure of agreement, the Agreement Coefficient 1 (AC₁) given by Gwet.³

The Kappa Statistic

Cohen in 1960 proposed the kappa statistic in the context of 2 observers.⁴ It was later extended by Fleiss to include multiple observers.⁵ For illustration purposes, we will look at the simpler case of 2 observers, acknowledging that the principles are the same for multiple observers. Let us imagine that an investigator wants to develop a spine classification system based in part on the presence or absence of a finding on magnetic resonance imaging (MRI). The investigator gives 200 MRIs to 2 spine surgeons to assess independently whether the finding is present or absent in each image. The results are provided in Table 1.

Table 1.

A 2 × 2 Table of Results in a Hypothetical Example Comparing 2 Different Assessors.

		Assessor 1
		Present	Absent	Total
Assessor 2	Present	(a) 130	(b) 56	(g1) 186
	Absent	(c) 9	(d) 5	(g2) 14
	Total	(f1) 139	(f2) 61	(N) 200

Open in a new tab

The kappa statistic is given by the formula

k = \frac{P_{o} - P_{e}}{1 - P_{e}}

where P_o = observed agreement, (a + d)/N, and P_e = agreement expected by chance, $((g_{1} * f_{1}) + (g_{2} * f_{2})) / N^{2}$ .

In our example,

P_o = (130 + 5)/200 = 0.675

P_e = ((186 * 139) + (14 * 61))/200² = 0.668

κ = (0.675 − 0.668)/(1 − 0.668) = 0.022

Kappa values range from −1 to 1, though it usually falls between 0 and 1. One represents perfect agreement, indicating that the raters agree in their classification of every case. Zero indicates agreement no better than that expected by chance. A negative kappa would indicate agreement worse than that expected by chance.

A common scale of interpretation for the kappa statistic is given by Altman⁶ (Table 2).

Table 2.

A Commonly Used Scale of Interpretation for Kappa Statistic.

Kappa	Agreement
≤0.20	Poor
0.21-0.40	Fair
0.41-0.60	Moderate
0.61-0.80	Good
0.81-1.00	Very good

Open in a new tab

Using the scale shown in Table 2, the ability of our 2 investigators to agree on the presence or absence of a specific MRI finding is “poor.”

The Limitations of Kappa

While the kappa statistic above is quite low, one may notice that the absolute percentage of observer agreement (P_o) is quite high (68%). How can the observers agree nearly 70% of the time yet have such a low kappa? The answer to this question lies in the distribution of the marginal table totals on which the magnitude of chance agreement (P_e), and subsequently, kappa, is dependent. The factors that characterize the distribution are referred to as prevalence and bias (see Box 1). Prevalence is the probability with which an observer will classify an object as present or absent. It is related to the balance in Table 1. Bias is the frequency at which raters choose a particular category, present or absent. This is related to the symmetry in Table 1. In our example, Table 1 is symmetrically imbalanced with a high prevalence for each assessor categorizing the MRI finding as present. This has the effect of lowering the kappa creating what appears to be a “paradox,” high percent agreement and low kappa.

Box 1.

Definitions Used to Describe the Limitations of the Kappa Statistic

Symmetrical: The distribution across g₁ and g₂ is the same as 𝑓₁ and 𝑓₂
Asymmetrical: The distribution across g₁ and 𝑔₂ is in the opposite direction to 𝑓₁ and 𝑓₂
Balanced: The proportion of the total number of objects in 𝑔₁ and 𝑓₁ is equal to 0.5
Imbalance: The proportion of the total number of objects in 𝑔₁ and 𝑓₁ is not equal to 0.5
Prevalence: Probability with which a rater will classify an object into a category; this is related to the balance in Table 2
Bias: Frequency at which raters choose a particular category; this is related to the symmetry in Table 2

An Alternative Statistic of Agreement

Given that kappa is affected by the skewed distributions of categories (the prevalence problem) and by the degree to which observers disagree (the bias problem), Gwet³ in 2008 proposed an alternative agreement statistic by adjusting for chance agreement in a different way. He defined and new agreement coefficient (AC₁) between 2 (or multiple) observers as the conditional probability that 2 randomly selected observers will agree, given that no agreement will occur by chance. In this way, the AC₁ resists the so-called “paradox” of kappa. The AC₁ statistic has the same formula as the kappa except it calculates the agreement expected by chance as follows:

P_e = agreement expected by chance, 2q * (1 − q), where q = (g₁ + f₁)/2N

In our example, P_o remains the same, but P_e takes on a different value with the following results:

P_o = (130 + 5)/200 = 0.675

q = (186 + 139)/(2 * 200) = 0.813

P_e = 2(0.813) * (1 − 0.813) = 0.305

AC₁ = (0.675 − 0.305)/(1 − 0.305) = 0.532

How shall we interpret this coefficient? Remember from Table 2 that many use a common benchmark scale to determine if the agreement is poor, fair, moderate, good, or very good. The advantage is that this scale is straightforward. However, this simple method can lead to misleading conclusions for the following reasons:

The calculated kappa is specific to the pool of subjects used in the study and will change if different subjects are used.
The magnitude of the kappa coefficient is dependent on several factors such as sample size, the number of categories, and the distribution of subjects among the categories. For example, a kappa value of 0.54 based on 200 subjects suggests a much stronger message about the extent of agreement among raters, than a kappa value of 0.6 based on 10 subjects only.

A more standardized method that overcomes many of the problems cited above was proposed by Gwet.⁷ Using the agreement coefficient and its standard error, one can calculate the probability that the agreement coefficient would fall into each category in Table 2. Starting with the highest agreement coefficient range, 0.81 to 1.0 (very good), and moving to the poorest range (≤0.2), one calculates the cumulative probability that the coefficient falls into that category. When the cumulative probability crosses a certain threshold (say 95%) that is the most likely range to which the estimate belongs. The results from our example show that kappa’s cumulative probability does not reach 95% until the bottom range (poor), where the AC₁ crosses in the 0.41 to 0.60 range (moderate; (Table 3). Fortunately, Gwet’s agreement coefficient and his method for benchmarking are included in several statistical packages.

Table 3.

Benchmark Range When the Level of the Cumulative Probability of the Agreement Coefficient Reaches 0.95 (95%).

Benchmark Range	Description	Cumulative Probability
		Kappa	AC₁
0.81 to 1.00	Very good	0.00	0.00
0.61 to 0.80	Good	0.06	0.14
0.41 to 0.60	Moderate	0.09	0.98
0.21 to 0.40	Fair	0.11	1.00
≤0.20	Poor	1.00	1.00

Open in a new tab

Conclusions

Kappa remains the most frequently used statistic assessing agreement between 2 or more observers when the observation of interest is categorical.
The kappa statistic is a chance-corrected measure. However, there are some limitations to the kappa that relate to the distribution of the marginal table totals on which the chance correction depends.
An alternative agreement statistic is Gwet’s AC₁ that seeks to minimize the kappa limitations.
A standardized method of benchmarking includes calculating the cumulative probability of an agreement coefficient falling within a benchmark range, providing a more standardized way of interpreting the agreement statistic.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Spectrum Research, Inc, received funding in exchange for assistance in providing methodological and statistical assistance.

References

1. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990;43:551–558. [DOI] [PubMed] [Google Scholar]
2. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549. [DOI] [PubMed] [Google Scholar]
3. Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61(pt 1):29–48. [DOI] [PubMed] [Google Scholar]
4. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213–220. [DOI] [PubMed] [Google Scholar]
5. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378–382. [Google Scholar]
6. Altman DG. Practical Statistics for Medical Research. London, England: Chapman & Hall/CRC; 1991. [Google Scholar]
7. Gwet KL. K. Gwet’s Inter-Rater Reliability Blog. Benchmarking agreement coefficients https://inter-rater-reliability.blogspot.com/2014/. Published December 12, 2014. Accessed January 3, 2020.

[bibr1-2192568220911648] 1. Cicchetti DV, Feinstein AR. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990;43:551–558. [DOI] [PubMed] [Google Scholar]

[bibr2-2192568220911648] 2. Feinstein AR, Cicchetti DV. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549. [DOI] [PubMed] [Google Scholar]

[bibr3-2192568220911648] 3. Gwet KL. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61(pt 1):29–48. [DOI] [PubMed] [Google Scholar]

[bibr4-2192568220911648] 4. Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213–220. [DOI] [PubMed] [Google Scholar]

[bibr5-2192568220911648] 5. Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378–382. [Google Scholar]

[bibr6-2192568220911648] 6. Altman DG. Practical Statistics for Medical Research. London, England: Chapman & Hall/CRC; 1991. [Google Scholar]

[bibr7-2192568220911648] 7. Gwet KL. K. Gwet’s Inter-Rater Reliability Blog. Benchmarking agreement coefficients https://inter-rater-reliability.blogspot.com/2014/. Published December 12, 2014. Accessed January 3, 2020.

PERMALINK

Kappa and Beyond: Is There Agreement?

Joseph R Dettori, PhD

Daniel C Norvell, PhD

Introduction

The Kappa Statistic

Table 1.

Table 2.

The Limitations of Kappa

Box 1.

An Alternative Statistic of Agreement

Table 3.

Conclusions

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Kappa and Beyond: Is There Agreement?

Joseph R Dettori, PhD

Daniel C Norvell, PhD

Introduction

The Kappa Statistic

Table 1.

Table 2.

The Limitations of Kappa

Box 1.

An Alternative Statistic of Agreement

Table 3.

Conclusions

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases