Quantifying Interrater Agreement and Reliability Between Thoracic Pathologists: Paradoxical Behavior of Cohen’s Kappa in the Presence of a High Prevalence of the Histopathologic Feature in Lung Cancer

Kay See Tan; Yi-Chen Yeh; Prasad S Adusumilli; William D Travis

doi:10.1016/j.jtocrr.2023.100618

. 2023 Dec 16;5(1):100618. doi: 10.1016/j.jtocrr.2023.100618

Quantifying Interrater Agreement and Reliability Between Thoracic Pathologists: Paradoxical Behavior of Cohen’s Kappa in the Presence of a High Prevalence of the Histopathologic Feature in Lung Cancer

Kay See Tan ^a,^∗, Yi-Chen Yeh ^b, Prasad S Adusumilli ^c, William D Travis ^d

PMCID: PMC10820331 PMID: 38283651

Abstract

Introduction

Cohen’s kappa is often used to quantify the agreement between two pathologists. Nevertheless, a high prevalence of the feature of interest can lead to seemingly paradoxical results, such as low Cohen’s kappa values despite high “observed agreement.” Here, we investigate Cohen’s kappa using data from histologic subtyping assessment of lung adenocarcinomas and introduce alternative measures that can overcome this “kappa paradox.”

Methods

A total of 50 frozen sections from stage I lung adenocarcinomas less than or equal to 3 cm in size were independently reviewed by two pathologists to determine the absence or presence of five histologic patterns (lepidic, papillary, acinar, micropapillary, solid). For each pattern, observed agreement (proportion of cases with concordant “absent” or “present” ratings) and Cohen’s kappa were calculated, along with Gwet’s AC1.

Results

The prevalence of any amount of the histologic patterns ranged from 42% (solid) to 97% (acinar). On the basis of Cohen’s kappa, there was substantial agreement for four of the five patterns (lepidic, 0.65; papillary, 0.67; micropapillary, 0.64; solid, 0.61). Acinar had the lowest Cohen’s kappa (0.43, moderate agreement), despite having the highest observed agreement (88%). In contrast, Gwet’s AC1 values were close to or higher than Cohen’s kappa across patterns (lepidic, 0.64; papillary, 0.69; micropapillary, 0.71; solid, 0.73; acinar, 0.85). The proportion of positive versus negative agreement was 93% versus 50% for acinar.

Conclusions

Given the dependence of Cohen’s kappa on feature prevalence, interrater agreement studies should include complementary indices such as Gwet’s AC1 and proportions of specific agreement, especially in settings with a high prevalence of the feature of interest.

Keywords: Interobserver coefficient, Reproducibility, Predominant histologic subtypes, Diagnostic accuracy, Performance metrics, Sensitivity and specificity

Introduction

Interrater agreement and reliability are key metrics to determine the reproducibility of diagnoses, immunohistochemical results, and other test results such as molecular assays in surgical pathology. If two pathologists can reliably apply a criterion or tool to make the same assessment on the same specimen, the interrater agreement will be high and can serve as evidence of reliable ratings. If the ratings are highly discordant, then either the tool is not useful or the raters require additional training. The statistical measure most widely used to quantify the agreement between pathologists is Cohen’s kappa.¹ Cohen’s kappa reflects the agreement beyond that which occurs by chance (i.e., chance corrected). Despite its popularity, Cohen’s kappa has been found to produce paradoxical results under certain circumstances.²^,³ Paradoxical results occur when a high level of agreement is accompanied by a low kappa value, leading to seemingly counterintuitive conclusions. In the present study, we review Cohen’s kappa statistic and assess its limitations using data from a published study of surgical pathology in lung cancer. We provide practical recommendations and propose alternative measures of agreement for future studies of interrater agreement.

Materials and Methods

Patient Data and Study Design

The present study uses data from a surgical pathology study by Yeh et al.⁴ that focused on stage I lung adenocarcinomas less than or equal to 3 cm in size. Details regarding patient selection, study methods, and evaluation of surgical specimens are reported in the previous study.⁴ Data were collected under a protocol (IRB 17-630) approved by the Institutional Review Board at Memorial Sloan Kettering Cancer Center, which included a waiver of informed consent.

In brief, patients with lung adenocarcinoma less than or equal to 3 cm in size who underwent surgical resection from 1995 to 2009 were identified from the prospectively curated Memorial Sloan Kettering Cancer Center Thoracic Service database. Original permanent and frozen section slides were available for a cohort of 361 patients. By analyzing various subsets of the 361-patient cohort, Yeh et al.⁴ investigated the strengths and limitations of frozen sections for the accurate identification of prognostically important histologic features. In particular, a subset of 50 patients was randomly selected from the full cohort of 361 patients and independently reviewed by three pathologists to determine the presence or absence of lepidic, acinar, papillary, micropapillary, and solid patterns on frozen sections.⁴^,⁵

The present study uses data from this same set of 50 frozen sections. For the purpose of illustration, we use the ratings from two (instead of three) pathologists. On the basis of these ratings, various agreement measures are presented, which are as follows: “observed agreement” (the proportion of cases with the same ratings from both raters), “chance agreement” (the probability of two raters agreeing by random chance), and “chance-corrected agreement” (agreement metrics that adjust for chance agreement, such as Cohen’s kappa and Gwet’s AC1).

Cohen’s Kappa

The equation for Cohen’s kappa is presented in Figure 1. Cohen’s kappa ranges from 0 to 1, where higher values indicate greater interrater agreement. The degree of agreement is conventionally categorized as poor (kappa ≤ 0.20), fair (0.21 ≤ kappa ≤ 0.40), moderate (0.41 ≤ kappa ≤ 0.60), substantial (0.61 ≤ kappa ≤ 0.80), and almost perfect (0.81 ≤ kappa ≤ 1.00).⁶

Gwet’s AC1

Gwet’s AC1 is calculated using the formula presented in Figure 1. Similar to Cohen’s kappa, Gwet’s AC1 attempts to remove the chance agreement from the observed agreement, using the same structure of (observed agreement – chance agreement) / (1 – chance agreement).

Positive and Negative Agreement

The proportion of specific agreement includes two separate indices, $P_{p o s}$ (positive agreement) and $P_{n e g}$ (negative agreement). $P_{p o s}$ refers to the proportion of cases that were classified as positive (i.e., the feature of interest is present) among the average number of positive ratings between the two pathologists, whereas $P_{n e g}$ refers to the average proportional negative agreement. In accordance with the notations in Figure 1, the number of positive readings is $A_{+}$ for rater A and $B_{+}$ for rater B. Hence, positive agreement is calculated as $P_{p o s} = a / [(A_{+} + B_{+}) / 2]$ , and negative agreement is calculated as $P_{n e g} = d / [(A_{-} + B_{-}) / 2]$ .

Statistical Analysis

Patient characteristics are summarized as frequency and percentage for categorical variables and as median (25th–75th percentiles) for continuous variables. On the basis of the “absent” or “present” ratings across the 50 frozen sections for each histologic pattern, we calculated the observed agreement between the two pathologists, Cohen’s kappa, and Gwet’s AC1. We also derived the observed proportion of positive and negative agreement $(P_{p o s} : P_{n e g})$ . In addition, we determined the prevalence of each pattern (prevalence of the feature of interest) on the basis of the proportion of cases with the feature present in the full cohort.⁴ Observed agreement, Cohen’s kappa, and Gwet’s AC1 were calculated using the immer⁷^,⁸ and epiR⁹ packages in R (version 4.1.2, R Corporation, Vienna, Austria). For comparison, we present two additional alternative agreement metrics, which are as follows: Aickin’s α¹⁰ and B statistic¹¹ from the immer⁷^,⁸ and vcd¹² packages in R.

Results

Clinicopathologic Characteristics of the Patients

The characteristics of the 50 included patients are summarized in Table 1. On the basis of the full cohort of 361 patients previously reported by Yeh et al.,⁴ the prevalence of each pattern (prevalence of the feature of interest) ranged from 42% for solid pattern to 97% for acinar pattern (Table 2).

Table 1.

Patient Characteristics (N = 50)

Characteristics	Median (25th–75th Percentile) or n (%)
Age at surgery, y	66 (60–73)
Sex
Female	28 (56)
Male	22 (44)
Pathologic stage
1A	49 (98)
1B	1 (2.0)

Open in a new tab

Table 2.

Interobserver Agreement Between Two Pathologists for the Presence or Absence of Histologic Patterns Using Frozen Sections

Features	Ratings by Two Pathologists			Observed Agreement, %	Cohen’s Kappa	Prevalence of Feature,^a %	$P_{p o s}$ (95% CI)^b	$P_{n e g}$ (95% CI)^b	Gwet’s AC1	B Statistic	Aickin’s α
Lepidic		Present	Absent	82	0.65	69	80 (64–90)	84 (71–92)	0.64	0.70	0.78
	Present	18	9
	Absent	0	23
Acinar		Present	Absent	88	0.43	97	93 (86–97)	50 (17–78)	0.85	0.86	0.66
	Present	41	2
	Absent	4	3
Papillary		Present	Absent	84	0.67	75	80 (63–90)	87 (75–94)	0.69	0.73	0.73
	Present	16	7
	Absent	1	26
Micropapillary		Present	Absent	84	0.64	47	76 (57–89)	88 (77–94)	0.71	0.73	0.67
	Present	13	4
	Absent	4	29
Solid		Present	Absent	84	0.61	42	71 (49–86)	89 (79–95)	0.73	0.76	0.72
	Present	10	7
	Absent	1	32

Open in a new tab

CI, confidence interval; $P_{n e g}$ , negative agreement; $P_{p o s}$ , positive agreement.

The prevalence of the feature was derived from the full cohort of 361 patients from the study from Yeh et al.⁴

The 95% CIs around $P_{p o s}$ and $P_{n e g}$ reflect Bayesian intervals with Beta (1,1) prior.

Interrater Agreement for the Presence of Histologic Patterns Using Frozen Sections

The observed agreement between the two pathologists was high across all five histologic patterns, ranging from 82% for lepidic pattern to 84% for acinar pattern (Table 2).

The conventional approach (i.e., using Cohen’s kappa) indicated substantial agreement for four of the five histologic patterns (kappa: lepidic, 0.65; papillary, 0.67; micropapillary, 0.64; solid, 0.61). The lowest Cohen’s kappa was 0.43, for acinar pattern, which corresponds to moderate agreement.

Gwet’s AC1 values were close to or higher than Cohen’s kappa across all five patterns (Gwet’s AC1: lepidic, 0.64; papillary, 0.69; micropapillary, 0.71; solid, 0.73). In particular, Gwet’s AC1 for acinar pattern (0.85) was the highest across all five patterns. Although not the focus of the current study, B statistics and Aickin’s α were similar to Gwet’s AC1 except for acinar and micropapillary, in which Aickin’s α values were in between Cohen’s kappa and Gwet’s AC1.

Influence of the Prevalence of the Feature of Interest on Interrater Agreement

The prevalence of each histologic pattern is presented in Table 2. For the four patterns with a Cohen’s kappa greater than 0.6 (lepidic, papillary, micropapillary, and solid), the prevalence of any amount of each pattern was between 42% and 75%; for the pattern with the lowest Cohen’s kappa (acinar; Cohen’s kappa, 0.43), the prevalence was 97%.

The details for solid pattern versus acinar pattern illustrate the influence of the prevalence of the feature of interest on interrater agreement. The prevalence of solid pattern was 42%, and the observed agreement between the two pathologists was 84%. Cohen’s kappa resulted in a chance-corrected agreement of 0.61, similar to Gwet’s AC1 of 0.73. In contrast, the prevalence of acinar pattern was 97%. Even with a high observed agreement of 88% between the two pathologists, Cohen’s kappa was 0.43, compared with Gwet’s AC1 of 0.85 (which was closer to the observed agreement).

Distinguishing Between Positive and Negative Agreement

When lepidic pattern was assessed, the average number of “present” and “absent” ratings was 22.5 and 27.5 of 50 cases, respectively. Hence, the proportion of “present” ratings that were concordant between the two pathologists ( $P_{p o s}$ ) was 80% (18 of 22.5), and the proportion of “absent” ratings that were concordant ( $P_{n e g}$ ) was 84% (23 of 27.5). High $P_{p o s}$ and $P_{n e g}$ values were similarly observed for papillary, micropapillary, and solid patterns.

In contrast, when acinar pattern was assessed, the average number of “present” and “absent” ratings was 44 and six of 50 cases, respectively, reflecting a high prevalence of the pattern. The six discordant ratings resulted in $P_{p o s}$ of 93% (41 of 44) and $P_{n e g}$ of 50% (three of six). This implies that, in practice, if one pathologist rates the case as “absent” for acinar pattern, it may be worthwhile to obtain the opinion of a second pathologist. In the case of a “present” rating, however, the probability that the second pathologist agrees is 93%.

Discussion

Cohen’s kappa is routinely used to determine interrater agreement between two raters. The primary idea underlying Cohen’s kappa is that part of the observed agreement between two raters is attributable to chance—that is, that the two raters agree (whether the feature of interest is present or absent) simply because of chance. Cohen’s kappa adjusts for this chance agreement to derive a chance-corrected agreement. Two examples using data from the study population are provided subsequently to illustrate the potential limitations of Cohen’s kappa.

In example 1, pathologist A and pathologist B agree with each other on 16 of 20 frozen section slides. On 15 of the 16 slides, both pathologists observed the feature of interest, and on one slide, both pathologists did not observe the feature of interest (Fig. 2; example 1). Therefore, the observed agreement is as follows: (15 + 1) / 20 = 0.8. Nevertheless, pathologist A may have agreed with pathologist B simply by chance even if neither pathologist had scrutinized the frozen sections. To calculate the chance agreement, note that pathologist A found that 17 of 20 slides had the feature present and three of 20 slides had the feature absent. Thus, pathologist A said “present” 85% of the time and pathologist B said “present” 85% of the time. Consequently, the probability that both pathologists said “present” was 0.85 × 0.85 = 0.7225, and the probability that both pathologists said “absent” was 0.15 × 0.15 = 0.0225. The overall chance agreement is, therefore, 0.7225 + 0.0225 = 0.745, meaning that 74.5% of agreement between the pathologists is attributable to chance. Following the formula in Figure 1, Cohen’s kappa is calculated as (observed agreement – chance agreement) / (1 – chance agreement), which yields κ = (0.8 − 0.745) / (1 − 0.745) ≈ 0.22; this is considered poor to fair.

Summary of ratings by two pathologists; both examples have 80% observed agreement between the two raters.

In example 2 (Fig. 2), the observed agreement is exactly the same as in example 1—80% (16 of 20 cases in agreement)—but Cohen’s kappa is much higher because of a smaller chance agreement. The chance agreement in example 2 is (0.5 × 0.5) + (0.5 × 0.5) = 0.5, which yields a Cohen’s kappa as follows: κ = (0.8 − 0.5) / (1 − 0.5) = 0.6; this is considered moderate agreement. Despite that both examples were derived from tables with an observed agreement of 80%, example 1 had a lower kappa value (kappa = 0.27 in example 1 versus kappa = 0.6 in example 2). This discrepancy is because of a markedly different prevalence of the feature of interest (17 / 20 = 85% in example 1 versus 10 / 20 = 50% in example 2). When the prevalence of the feature of interest is close to 50%, as in example 2, the resulting kappa value is closer to the observed agreement. In contrast, when the prevalence of the feature of interest is either very high (close to 100%) or very low (close to 0%), as in example 1, the kappa value seems to be counterintuitively low.

To avoid the paradoxical results that can occur with Cohen’s kappa under certain circumstances, Gwet¹³ proposed a new agreement measure called the “first-order agreement coefficient” or AC1. The primary difference between Cohen’s kappa and Gwet’s AC1 lies in the calculation of chance agreement, which is based on the chance that raters may agree on a rating despite the fact that one or both of them may have made a random classification. Random ratings can occur when the rater is uncertain about how to classify a specimen (perhaps when the specimen’s characteristics do not match the rating instructions) and hence randomly assigns “present” for the feature of interest. In a situation where Cohen’s kappa is low despite a high level of overall agreement, Gwet’s AC1 has been introduced as a “paradox-resistant” alternative to Cohen’s kappa.¹⁴

As revealed in the present study, Gwet’s AC1 provides a chance-corrected agreement coefficient that is more in line with observed agreement, compared with Cohen’s kappa. Despite its popularity, Cohen’s kappa has its drawbacks, particularly in the setting of a high prevalence of the feature of interest. Cohen’s kappa assumes that agreement is at random and, hence, captures the agreement beyond that occurring at random. Conversely, Gwet’s AC1 acknowledges that agreement between observers is not totally at random—that is, there will be cases where the feature is truly present that will be easy to reach agreement on, there will be cases where the feature is truly absent that will be easy to reach agreement on, and there will be cases for which it will be difficult to reach agreement. Taking this perspective into consideration, Gwet’s AC1 avoids the overpenalization that results with Cohen’s kappa simply as a consequence of a high prevalence of the feature of interest.

The findings in this illustrative study, by the use of the ratings of two pathologists, can be extended to the setting of multiple response categories and multiple raters. Instead of only two possible attributions (“present” or “absent”), the response categories can be ordinal, such as “absent,” “low,” “intermediate,” and “high.” Cohen’s kappa has been extended to handle such settings using the weighted kappa.¹⁵ Furthermore, whereas Cohen’s kappa applies only to two raters, Light’s kappa¹⁶ can be applied in the setting of multiple raters. Both Cohen’s kappa and Light’s kappa assume that a fixed number of raters are rating identical cases; in contrast, Fleiss’ kappa¹⁷ is a more flexible approach that can be applied to any number of raters rating different cases. Similarly, Gwet’s AC1 has also been extended to accommodate multiple raters.¹⁸

To the best of our knowledge, Gwet’s AC1 has never been compared with Cohen’s kappa in the context of lung cancer pathology. Nevertheless, discussions surrounding the paradox of low Cohen’s kappa despite high observed agreement have been ongoing. Whereas some have cautioned against the use of Cohen’s kappa in these settings,²^,³ others have argued for continued support of Cohen’s kappa. Vach¹⁹ argued that the dependence of Cohen’s kappa on the prevalence of the feature of interest “does not matter,” because kappa is exactly fulfilling its purpose, which is to improve the interpretation of agreement rates. Indeed, it is intuitive that different populations, regardless of the prevalence of the feature of interest, would yield different kappa values. In fact, the chance correction used in Cohen’s kappa actually helps to standardize results across populations, which can be advantageous for comparisons across studies and of the performance of the raters.²⁰ Rather than criticizing Cohen’s kappa for its dependence on the prevalence of features or searching for statistical methods to salvage inefficient studies, the focus should be placed on obtaining populations with a prevalence of the feature of interest near 50%.²¹ Nevertheless, one could argue that this is not realistic from a clinical perspective and, furthermore, that doing so hampers the generalizability of the findings to clinical practice.

In contrast, other experts have proposed adjustments and extensions to Cohen’s kappa that are suggested to be paradox proof. In addition to Gwet’s AC1, alternative measures such as Aickin’s α¹⁰ and prevalence- and bias-adjusted kappa²² have been proposed to address the paradoxical behavior of kappa. One of the most creative alternatives is the B statistic proposed by Bangdiwala and Shankar,¹¹ which uses a visualization of the agreement between raters and adjusts the observed area of agreement with that expected to result from chance. As revealed in the results, both Aickin’s α and B statistic were higher than Cohen’s kappa across all five features.

The decision regarding which interrater indices to report should be guided by the purpose of the study, whether reliability or agreement (or both) is of primary interest.²³ Although they are often used interchangeably, there are important differences between the concepts of reliability and agreement.²⁴^,²⁵ In agreement, the question of interest is, “Are the ratings identical or close between two pathologists for each case?” In reliability, the question of interest is, “How well do the ratings distinguish one case from another?” Hence, agreement indices apply to instruments (or rating criteria) that are used for evaluative purposes, whereas reliability indices are required for instruments that are used for discriminative purposes. Although Cohen’s kappa was first proposed to describe agreement between raters, it was argued that, with adjustment of the observed agreement for the chance agreement, an agreement measure can be turned into a reliability measure.²⁶

In accordance with the suggestion from Feinstein and Cicchetti,²^,³ we have presented results for positive and negative agreement, in addition to overall agreement. Similar to the concept of sensitivity and specificity in a diagnostic test, these agreement indices distinguish between positive and negative classifications, which may have different implications in clinical practice. A clinical application of positive and negative agreement can be illustrated using the examples of lepidic pattern and acinar pattern from our study. When assessing lepidic pattern, on the basis of the $P_{p o s}$ value of 80% and the $P_{n e g}$ value of 84%, it may not be necessary to request a second opinion for either an “absent” or a “present” rating. For acinar pattern, however, the $P_{p o s}$ and $P_{n e g}$ values were 93% and 50%, respectively. This implies that, in practice, if one pathologist rates “absent,” it may be worthwhile to obtain the opinion of a second pathologist. In the case of a “present” rating, however, the probability that the second pathologist agrees is 93%, so it is not worthwhile to involve a second pathologist. This reveals the value of specific agreement in clinical practice and that both $P_{p o s}$ and $P_{n e g}$ are useful contextual metrics in interrater agreement studies.

A recent article by Vach and Gerke²⁷ conducted a head-to-head comparison of Cohen’s kappa and Gwet’s AC1. On the basis of the behavior of both metrics under various settings, the study concluded that in the case of no association or maximal disagreement, Gwet’s AC1 should not be viewed as a substitute for kappa and that the classification of degrees of agreement in Landis and Koch⁶ should not be applied to Gwet’s AC1. Even though the extreme scenarios of no association and maximal disagreement are unlikely between pathologists, in the present study, we have argued that agreement studies should present Gwet’s AC1 alongside the conventional Cohen’s kappa, rather than as a replacement. Much like the convention of presenting both sensitivity and specificity for medical diagnostic tests, the use of multiple indices is based on the acknowledgment that no single index of agreement can be satisfactory for all purposes. In addition, by including complementary indices, such as positive and negative agreement, an interrater study can provide a more clinically relevant determination of interrater variability. With advances in technology, these metrics are readily available in standard statistical software; therefore, researchers are not restricted to reporting only Cohen’s kappa.

Agreement statistics depend on feature prevalence. In addition to Cohen’s kappa, future interrater variability studies should consider the purpose of the study, report the prevalence of the feature(s) of interest, and include additional agreement statistics such as Gwet’s AC1, especially in cases where there is a high prevalence of the feature of interest. In addition to overall agreement, positive and negative agreement should also be reported to allow for clinical and practical interpretation of agreement studies in pathology.

CRediT Authorship Contribution Statement

Kay See Tan: Conceptualization, Data curation, Formal analysis, Methodology, Writing—original draft, Writing—Review, and editing.

Yi-Chen Yeh: Data curation, Formal analysis, Investigation.

Prasad S. Adusumilli: Conceptualization, Investigation, Writing—review and editing.

William D. Travis: Conceptualization, Investigation, Writing—review and editing.

Acknowledgments

This work was supported by the National Institutes of Health/National Cancer Institute Cancer Center Support Grant P30 CA008748 (to Memorial Sloan Kettering Cancer Center). Dr. Adusumilli’s laboratory work is supported by grants from the National Institutes of Health (P30 CA008748, R01 CA236615-01, and R01 CA235667), the U.S. Department of Defense (BC132124, LC160212, CA170630, CA180889, and CA200437), the Batishwa Fellowship, the Comedy versus Cancer Award, the Dalle Pezze Foundation, the Derfner Foundation, the Esophageal Cancer Education Fund, the Geoffrey Beene Foundation, the Memorial Sloan Kettering Technology Development Fund, the Miner Fund for Mesothelioma Research, the Mr. William H. Goodwin and Alice Goodwin, the Commonwealth Foundation for Cancer Research, and the Experimental Therapeutics Center of Memorial Sloan Kettering Cancer Center. The sponsors played no role in any aspect of this work.

The authors acknowledge excellent editorial assistance from Christy Rajcoomar and David B. Sewell of Memorial Sloan Kettering Cancer Center.

Footnotes

Disclosure: Dr. Adusumilli’s laboratory receives research support from ATARA Biotherapeutics. The remaining authors declare no conflict of interest.

Cite this article as: Tan KS, Yeh YC, Adusumilli PS, Travis WD. Quantifying interrater agreement and reliability between thoracic pathologists: paradoxical behavior of Cohen’s kappa in the presence of a high prevalence of the histopathologic feature in lung cancer. JTO Clin Res Rep. 2024;5:100618.

References

1.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46. [Google Scholar]
2.Cicchetti D.V., Feinstein A.R. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990;43:551–558. doi: 10.1016/0895-4356(90)90159-m. [DOI] [PubMed] [Google Scholar]
3.Feinstein A.R., Cicchetti D.V. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549. doi: 10.1016/0895-4356(90)90158-l. [DOI] [PubMed] [Google Scholar]
4.Yeh Y.C., Nitadori J., Kadota K., et al. Using frozen section to identify histological patterns in stage I lung adenocarcinoma of ≤3 cm: accuracy and interobserver agreement. Histopathology. 2015;66:922–938. doi: 10.1111/his.12468. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Travis W.D., Brambilla E., Noguchi M., et al. International Association for the Study of Lung Cancer/American Thoracic Society/European Respiratory Society: international multidisciplinary classification of lung adenocarcinoma: executive summary. Proc Am Thorac Soc. 2011;8:381–385. doi: 10.1513/pats.201107-042ST. [DOI] [PubMed] [Google Scholar]
6.Landis J.R., Koch G.G. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. [PubMed] [Google Scholar]
7.Robitzsch A., Steinfeld J. Item response models for human ratings: overview, estimation methods, and implementation in R. Psychol Test Assess Model. 2018;60:101–138. [Google Scholar]
8.Robitzsch A, Steinfeld J. immer: Item response models for multiple ratings. R package. version 1.1-35; 2018. https://cran.r-project.org/web/packages/immer/index.html. Accessed July 1, 2023.
9.Stevenson M. Evan Sergeant with contributions from Telmo Nunes, Cord Heuer, Jonathon Marshall, Javier Sanchez, epiR: Tools for the Analysis of Epidemiological Data. R package version 2.0.19. https://CRAN.R-project.org/package=epiR
10.Aickin M. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics. 1990:293–302. [PubMed] [Google Scholar]
11.Bangdiwala S.I., Shankar V. The agreement chart. BMC Med Res Methodol. 2013;13:1–7. doi: 10.1186/1471-2288-13-97. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Meyer D., Zeileis A., Hornik K. vcd: Visualizing Categorical Data. R Package Version 1.4-8. 2020. https://cran.r-project.org/web/packages/vcd/index.html
13.Gwet K.L. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29–48. doi: 10.1348/000711006X126600. [DOI] [PubMed] [Google Scholar]
14.Wongpakaran N., Wongpakaran T., Wedding D., Gwet K.L. A comparison of Cohen’s kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013;13:1–7. doi: 10.1186/1471-2288-13-61. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213–220. doi: 10.1037/h0026256. [DOI] [PubMed] [Google Scholar]
16.Light R.J. Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull. 1971;76:365. [Google Scholar]
17.Fleiss J.L. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378. [Google Scholar]
18.Gwet K. STATAXIS Publishing Company; Gaithersburg, MD: 2001. Handbook of Inter-rater Reliability. [Google Scholar]
19.Vach W. The dependence of Cohen’s kappa on the prevalence does not matter. J Clin Epidemiol. 2005;58:655–661. doi: 10.1016/j.jclinepi.2004.02.021. [DOI] [PubMed] [Google Scholar]
20.Kraemer H.C., Bloch D.A. Kappa coefficients in epidemiology: an appraisal of a reappraisal. J Clin Epidemiol. 1988;41:959–968. doi: 10.1016/0895-4356(88)90032-7. [DOI] [PubMed] [Google Scholar]
21.Hoehler F.K. Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J Clin Epidemiol. 2000;53:499–503. doi: 10.1016/s0895-4356(99)00174-2. [DOI] [PubMed] [Google Scholar]
22.Byrt T., Bishop J., Carlin J.B. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46:423–429. doi: 10.1016/0895-4356(93)90018-v. [DOI] [PubMed] [Google Scholar]
23.Kottner J., Audigé L., Brorson S., et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int J Nurs Stud. 2011;48:661–671. doi: 10.1016/j.ijnurstu.2011.01.016. [DOI] [PubMed] [Google Scholar]
24.de Vet H.C., Terwee C.B., Knol D.L., Bouter L.M. When to use agreement versus reliability measures. J Clin Epidemiol. 2006;59:1033–1039. doi: 10.1016/j.jclinepi.2005.10.015. [DOI] [PubMed] [Google Scholar]
25.Kottner J., Streiner D.L. The difference between reliability and agreement. J Clin Epidemiol. 2011;64:701–702. doi: 10.1016/j.jclinepi.2010.12.001. [DOI] [PubMed] [Google Scholar]
26.Guyatt G., Walter S., Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis. 1987;40:171–178. doi: 10.1016/0021-9681(87)90069-5. [DOI] [PubMed] [Google Scholar]
27.Vach W., Gerke O. Gwet’s AC1 is not a substitute for Cohen’s kappa—a comparison of basic properties. MethodsX. 2023;10 doi: 10.1016/j.mex.2023.102212. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib1] 1.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46. [Google Scholar]

[bib2] 2.Cicchetti D.V., Feinstein A.R. High agreement but low kappa: II. Resolving the paradoxes. J Clin Epidemiol. 1990;43:551–558. doi: 10.1016/0895-4356(90)90159-m. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Feinstein A.R., Cicchetti D.V. High agreement but low kappa: I. The problems of two paradoxes. J Clin Epidemiol. 1990;43:543–549. doi: 10.1016/0895-4356(90)90158-l. [DOI] [PubMed] [Google Scholar]

[bib4] 4.Yeh Y.C., Nitadori J., Kadota K., et al. Using frozen section to identify histological patterns in stage I lung adenocarcinoma of ≤3 cm: accuracy and interobserver agreement. Histopathology. 2015;66:922–938. doi: 10.1111/his.12468. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Travis W.D., Brambilla E., Noguchi M., et al. International Association for the Study of Lung Cancer/American Thoracic Society/European Respiratory Society: international multidisciplinary classification of lung adenocarcinoma: executive summary. Proc Am Thorac Soc. 2011;8:381–385. doi: 10.1513/pats.201107-042ST. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Landis J.R., Koch G.G. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174. [PubMed] [Google Scholar]

[bib7] 7.Robitzsch A., Steinfeld J. Item response models for human ratings: overview, estimation methods, and implementation in R. Psychol Test Assess Model. 2018;60:101–138. [Google Scholar]

[bib8] 8.Robitzsch A, Steinfeld J. immer: Item response models for multiple ratings. R package. version 1.1-35; 2018. https://cran.r-project.org/web/packages/immer/index.html. Accessed July 1, 2023.

[bib9] 9.Stevenson M. Evan Sergeant with contributions from Telmo Nunes, Cord Heuer, Jonathon Marshall, Javier Sanchez, epiR: Tools for the Analysis of Epidemiological Data. R package version 2.0.19. https://CRAN.R-project.org/package=epiR

[bib10] 10.Aickin M. Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa. Biometrics. 1990:293–302. [PubMed] [Google Scholar]

[bib11] 11.Bangdiwala S.I., Shankar V. The agreement chart. BMC Med Res Methodol. 2013;13:1–7. doi: 10.1186/1471-2288-13-97. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Meyer D., Zeileis A., Hornik K. vcd: Visualizing Categorical Data. R Package Version 1.4-8. 2020. https://cran.r-project.org/web/packages/vcd/index.html

[bib13] 13.Gwet K.L. Computing inter-rater reliability and its variance in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29–48. doi: 10.1348/000711006X126600. [DOI] [PubMed] [Google Scholar]

[bib14] 14.Wongpakaran N., Wongpakaran T., Wedding D., Gwet K.L. A comparison of Cohen’s kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples. BMC Med Res Methodol. 2013;13:1–7. doi: 10.1186/1471-2288-13-61. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Cohen J. Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull. 1968;70:213–220. doi: 10.1037/h0026256. [DOI] [PubMed] [Google Scholar]

[bib16] 16.Light R.J. Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull. 1971;76:365. [Google Scholar]

[bib17] 17.Fleiss J.L. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76:378. [Google Scholar]

[bib18] 18.Gwet K. STATAXIS Publishing Company; Gaithersburg, MD: 2001. Handbook of Inter-rater Reliability. [Google Scholar]

[bib19] 19.Vach W. The dependence of Cohen’s kappa on the prevalence does not matter. J Clin Epidemiol. 2005;58:655–661. doi: 10.1016/j.jclinepi.2004.02.021. [DOI] [PubMed] [Google Scholar]

[bib20] 20.Kraemer H.C., Bloch D.A. Kappa coefficients in epidemiology: an appraisal of a reappraisal. J Clin Epidemiol. 1988;41:959–968. doi: 10.1016/0895-4356(88)90032-7. [DOI] [PubMed] [Google Scholar]

[bib21] 21.Hoehler F.K. Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity. J Clin Epidemiol. 2000;53:499–503. doi: 10.1016/s0895-4356(99)00174-2. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Byrt T., Bishop J., Carlin J.B. Bias, prevalence and kappa. J Clin Epidemiol. 1993;46:423–429. doi: 10.1016/0895-4356(93)90018-v. [DOI] [PubMed] [Google Scholar]

[bib23] 23.Kottner J., Audigé L., Brorson S., et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int J Nurs Stud. 2011;48:661–671. doi: 10.1016/j.ijnurstu.2011.01.016. [DOI] [PubMed] [Google Scholar]

[bib24] 24.de Vet H.C., Terwee C.B., Knol D.L., Bouter L.M. When to use agreement versus reliability measures. J Clin Epidemiol. 2006;59:1033–1039. doi: 10.1016/j.jclinepi.2005.10.015. [DOI] [PubMed] [Google Scholar]

[bib25] 25.Kottner J., Streiner D.L. The difference between reliability and agreement. J Clin Epidemiol. 2011;64:701–702. doi: 10.1016/j.jclinepi.2010.12.001. [DOI] [PubMed] [Google Scholar]

[bib26] 26.Guyatt G., Walter S., Norman G. Measuring change over time: assessing the usefulness of evaluative instruments. J Chronic Dis. 1987;40:171–178. doi: 10.1016/0021-9681(87)90069-5. [DOI] [PubMed] [Google Scholar]

[bib27] 27.Vach W., Gerke O. Gwet’s AC1 is not a substitute for Cohen’s kappa—a comparison of basic properties. MethodsX. 2023;10 doi: 10.1016/j.mex.2023.102212. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Quantifying Interrater Agreement and Reliability Between Thoracic Pathologists: Paradoxical Behavior of Cohen’s Kappa in the Presence of a High Prevalence of the Histopathologic Feature in Lung Cancer

Kay See Tan, PhD

Yi-Chen Yeh, MD

Prasad S Adusumilli, MD, FACS

William D Travis, MD

Abstract

Introduction

Methods

Results

Conclusions

Introduction

Materials and Methods

Patient Data and Study Design

Cohen’s Kappa

Figure 1.

Gwet’s AC1

Positive and Negative Agreement

Statistical Analysis

Results

Clinicopathologic Characteristics of the Patients

Table 1.

Table 2.

Interrater Agreement for the Presence of Histologic Patterns Using Frozen Sections

Influence of the Prevalence of the Feature of Interest on Interrater Agreement

Distinguishing Between Positive and Negative Agreement

Discussion

Figure 2.

CRediT Authorship Contribution Statement

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Quantifying Interrater Agreement and Reliability Between Thoracic Pathologists: Paradoxical Behavior of Cohen’s Kappa in the Presence of a High Prevalence of the Histopathologic Feature in Lung Cancer

Kay See Tan, PhD

Yi-Chen Yeh, MD

Prasad S Adusumilli, MD, FACS

William D Travis, MD

Abstract

Introduction

Methods

Results

Conclusions

Introduction

Materials and Methods

Patient Data and Study Design

Cohen’s Kappa

Figure 1.

Gwet’s AC1

Positive and Negative Agreement

Statistical Analysis

Results

Clinicopathologic Characteristics of the Patients

Table 1.

Table 2.

Interrater Agreement for the Presence of Histologic Patterns Using Frozen Sections

Influence of the Prevalence of the Feature of Interest on Interrater Agreement

Distinguishing Between Positive and Negative Agreement

Discussion

Figure 2.

CRediT Authorship Contribution Statement

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases