Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 25.
Published in final edited form as: Stat Med. 2023 Mar 5;42(11):1687–1698. doi: 10.1002/sim.9694

Simulating and estimating agreement in the presence of multiple raters and covariates

Katelyn A McKenzie 1, Jonathan D Mahnken 1
PMCID: PMC10599607  NIHMSID: NIHMS1938653  PMID: 36872574

Abstract

Cohen’s and Fleiss’s kappa are popular estimators for assessing agreement among two and multiple raters, respectively, for a binary response. While additional methods have been developed to account for multiple raters and covariates, they are not always applicable, rarely used, and none simplify to Cohen’s kappa. Furthermore, there are no methods to simulate Bernoulli observations under the kappa agreement structure such that the developed methods could be adequately assessed. This manuscript overcomes these shortfalls. First, we developed a model-based estimator for kappa that accommodates multiple raters and covariates through a generalized linear mixed model and encompasses Cohen’s kappa as a special case. Second, we created a framework to simulate dependent Bernoulli observations that upholds all 2-tuple pair of rater’s kappa agreement structure and includes covariates. We used this framework to assess our method when kappa was nonzero. Simulations showed that Cohen’s and Fleiss’s kappa estimates were inflated unlike our model-based kappa. We analyzed an Alzheimer’s disease neuroimaging study and the classic cervical cancer pathology study. The proposed model-based kappa and advancement in simulation methodology demonstrates that the popular approaches of Cohen’s and Fleiss’s kappa are poised to yield invalid conclusions while our work overcomes shortfalls, leading to improved inferences.

Keywords: Alzheimer’s disease, Cohen’s kappa, Fleiss’s kappa, generalized linear mixed model

1 |. INTRODUCTION

Many fields, such as medicine and psychology, utilize agreement studies.13 Estimators for agreement are also used in other fields, such as a performance metric in machine learning.4,5 Agreement studies are useful when the outcome is unknown or subjective (eg, Alzheimer’s disease diagnosis based on neuroimaging biomarkers or the optimal machine learning model). Multiple raters may be included in agreement studies. One of the first proposed estimators of agreement among multiple (>2) raters was Fleiss’s kappa, which remains popular today.6 Although developed explicitly for two raters, some use Cohen’s kappa to assess agreement across all pairs of two raters.7

There are limitations to Cohen’s kappa and Fleiss’s kappa. It is well known that Cohen’s kappa depends on the marginal probability of a positive response (frequently termed sample disease prevalence).8 Both Cohen’s and Fleiss’s kappa can have paradoxical results, where there is high observed agreement, but low values of kappa.9 Neither approach can directly control for explanatory factors.

There are other statistical methods that can accommodate multiple raters and tend to provide population based inference. These include Landis and Koch’s hierarchical kappa-type statistic,10 Tanner and Young’s log-linear modeling approach,11 Klar’s generalized estimating equations (GEE) approach,12 Kraemer’s intraclass kappa,13 and Nelson’s generalized linear mixed model (GLMM) kappa-like statistics.14,15 These methods perform best with a larger number of raters.

Fleiss’s and Cohen’s kappa are generally favored in practice over these newer methods. This may be due to the ease of calculation and widespread acceptance of Fleiss’s and Cohen’s kappa as estimators for agreement. For instance, the GEE approach of Klar, Lipsitz and Ibrahim requires solving two sets of equations by iterative procedures.12 None of these newer methods provide estimates of agreement for each pair of raters and across all pairs of raters after accounting for explanatory factors. Additionally, these newer methods may not be applied to all situations. For example, the kappa-like statistic proposed by Nelson and Edwards14,15 cannot be used when the GLMM does not converge (ie, the variance of the random effects is not estimable). These methods do not simplify to Fleiss’s or Cohen’s kappa, thus while comparisons could technically be made, the interpretations of their estimates are not exchangeable.

Aside from these statistical estimators, there are no frameworks available to stimulate Bernoulli observations under a particular agreement dependency structure. Several methods have been proposed to generate correlated binary variables; however, correlation does not imply agreement.1622

The purpose of this work is 2-fold. First, propose an estimator of the kappa agreement structure that accounts for multiple raters and explanatory factors when the outcome is binary. This estimator includes Cohen’s kappa as a special case. Second, provide a framework for simulating Bernoulli observations under the kappa agreement structure while adjusting for covariates. This framework is used to generate Bernoulli observations for multiple raters under the kappa agreement structure for each 2-tuple pair of raters. The primary context considered is the assessment of pairwise agreement among multiple raters evaluating each subject once.

In Section 2, we describe the kappa agreement structure and common estimators: Cohen’s and Fleiss’s kappa. We define our estimator and show it is a function of each 2-tuple pair of raters’ kappa agreement structure. In Section 3, the novel simulation framework is described. In Section 4, simulation results show that Cohen’s and Fleiss’s kappa estimators provide inflated estimates while our estimator does not. In Section 5, we analyze an Alzheimer’s neuroimaging study and a classic pathology agreement study. These analyses demonstrate when our method can be used despite lack of convergence of the GLMM. Lastly, in Section 6, we provide a discussion.

2 |. METHODS

Kappa is a popular agreement structure and is defined as a function of observed πo and expected πe agreement:

κ=πoπe1πe (1)

Many estimators for kappa have been proposed.6,23,24 These works define πo as the observed joint probability of agreement between raters, and πe as the chance-expected agreement.25 A common estimate for πo is the proportion of subjects the raters provided the same assessment while πe is estimated under the assumption of rater independence. These estimators define πe under different functions of each rater’s marginal probabilities for each outcome.

In 1960, Cohen proposed a parameterization of kappa (termed Cohen’s kappa) that is widely used today. For a binary outcome (say category 1 = positive and 2 = negative) and two raters, Cohen’s kappa κC23 can be written as: κC=π11+π22π1+π+1+π2+π+21π1+π+1+π2+π+2, where π1+ and π+1π2+andπ+2 are the marginal probabilities the raters categorized the subjects “1” (“2”) and π11π22 is the unconstrained probability both raters classified the same subjects as “1” (“2”). In this case, πo=π11+π22 and πe=π1+π+1+π2+π+2.

Fleiss’s kappa κF was proposed as an estimator of kappa when there are multiple raters. A variety of study designs can be analyzed using κF. Examples include when all raters evaluate all subjects and when each rater evaluates a different subset of subjects. Notably κF requires the raters to evaluate the same number of subjects.6 In this estimator, πe is a function of the average probability for each category across all raters while π0 is the sum of the proportion of rater-pairs in agreement for each category.

While κC was developed exclusively for two raters, there are cases where it is used when there are more than two raters.7,26 In these situations, an estimate of κC is generally provided for each pair of raters, while κF provides a single estimate across all raters.

2. 1 |. Formulating kappa using generalized linear mixed models

Cohen’s kappa was first developed as a data-driven measure of agreement.23 Subsequent works showed that πe in κC has maximum likelihood estimator properties.25 The purpose of this subsection is to demonstrate that πe can be formulated using GLMMs. This is valuable as we can model the complex data structure that is mis-specified under the widely used κC and κF.

Consider the marginal probability of a positive (or negative) evaluation used in the definitions of πe for κC and κF eg,π1+. This is simply the proportion of subjects the given rater evaluated as positive (or negative). It is known that the marginal probability of a positive (or negative) evaluation can be equivalently defined under a logistic regression model that includes only raters as a fixed effect in the linear predictor term.27 While κC and κF are currently classified in the literature as simple “chance” adjusted measures of agreement,15 κC and κF are indeed model-based, since πe is a function of a very simple assumed logistic regression model.

Including raters as the only fixed effect in this logistic regression model has three limitations. First, we assume that the included raters encompass the entire population of raters, and that the scope of inference is limited to these specific raters. Second, the probability of a positive response is assumed to be constant for all subjects evaluated by a given rater, regardless of each individual subject’s characteristics. Third, as the number of raters becomes larger, more parameters must be estimated; this could limit the number of raters included in each analysis or require more subjects obtain the maximum likelihood estimate.

As further described below in Section 2.2, this work overcomes the limitations of defining πe as a function of the highly restricted logistic regression model that includes only raters as a fixed effect. This is accomplished by including raters as a random effect and explanatory measures as fixed effects in the model used to define πe.

2. 2 |. Model-based estimator of kappa

The primary study design considered in this manuscript is when the outcome is binary, there are more than two raters, and each rater evaluates each subject once. Additionally, the goal of the study is to assess agreement under a pairwise structure, or when a pair of two raters reaches the same conclusion. Other definitions of agreement are not considered.28,29

We defined observed agreement (πo in Equation 1) as the joint probability of agreement among all pairs of raters, and expected agreement (πe in Equation 1) as the probability of agreement for each pair of raters under the assumption of rater conditional independence. Simple, unadjusted proportions in the data (ie, a saturated model) are used to estimate π0. Thus, we describe πo as unconstrained. A GLMM is used to estimate πe, instead of the minimal logistic regression model described in Section 2.1. We use the term model-based to describe πe, to emphasize the fact that πe is always a function of a GLMM.

 Consider the following notation. Let Yij=1, rater j evaluates subject i as positive 0, rater j evaluates subject i as negative 

where i=1,,N and j=1,,R. Then, Yij~Bernθij, where θij=EYij=PYij=1.

In a population of N subjects, R raters, and T=R2 pairs of two raters, the unconstrained probability of agreement is defined as

πo=1N1Ti=1Nj=1R1j=j+1RPYij=1Yij=1+PYij=0Yij=0

and the model-based probability of agreement, under the assumption of rater independence, is

πe=1N1Ti=1Nj=1R1j=j+1RPYij=1PYij=1+1PYij=11PYij=1

The probability rater j evaluates subject i as positive can be modeled using a GLMM, where raters are included as a random effect. This allows for rater-specific tendencies to be modeled and accounts for non-independent observations since each rater evaluates more than one subject.

Consider model M:Mlogitθij=xijTβ+zjTb, where i=1,,N, j=1,,R, β is a vector of parameters for subject and rater characteristics, b~N0,σR2 is a vector of random effects for the raters, and θij=EYijxij,bj=PYij=1xij,bj. Then, the model-based probability of agreement under model M is: πeM=1N1Ti=1Nj=1R1j=j+1Rθijθij+1θij1θij, which includes random and fixed effects.

The proposed model-based estimator of kappa, under model M, is then:

κM=1N1Ti=1Nj=1R1j=j+1RPYij=Yij1N1Ti=1Nj=1R1j=j+1Rθijθij+1θij1θij11N1Ti=1Nj=1R1j=j+1Rθijθij+1θij1θij

Cohen’s kappa κC and Fleiss’s kappa κF can also be written under this notation (Supplemental Appendix A). Using algebra, κM can be re-written as:

κM=πoπeM1πeM=l=1Tκl(1πe,lM)c (2)

where l is the index for each of the T 2-tuple pair of raters, c=Tl=1Tπe,lM and is a constant, and πe,lM=1Niθijθij+1θij1θij is the model-based probability of agreement for the 2-tuple rater pair l. In words, κM is a weighted average of each rater-pair’s kappa, where the weights are the proportion of model-based disagreement for rater-pair l compared to the total model-based disagreement among all rater-pairs.

We estimate πo, in a sample of n subjects and r raters, as the average of the proportion of subjects upon which the evaluations agreed within each rater-pair:

π^o=1n1ti=1nj=1r1j=j+1ryijyij+1yij1yij

where t=r2, i=1,,nN, and j=1,,rR.

Under a specified (assumed) model M, we obtain predicted probabilities of a positive response for each subject-rater observation, θˆij. These model-based predicted probabilities, logitθˆij=xijTβ^+bj, are then used to calculate π^e assuming rater conditional independence:

π^eM=1n1ti=1nj=1r1j=j+1rθ^ijθ^ij+1θ^ij1θ^ij

In order to determine which fixed effects to include in estimating the probability of a positive response, usual GLMM building methods can be used.30 While the estimates of kappa are known to be approximately normally distributed when the number of subjects is large,23 we bootstrap with subject-level resampling to estimate the variances of κ^ across all raters as well as for κ^l, which estimates agreement for the 2-tuple rater pair l.31 Empirical confidence intervals were calculated using the bootstrap samples. The estimates, empirical standard errors, and empirical 95% confidence intervals (using the 0.025 and 0.975 quantiles of the bootstrapped data) are reported for all results (simulations and real-data analyses). The mean squared error is reported for simulation results.

3 |. SIMULATING BERNOULLI VARIABLES UNDER THE KAPPA AGREEMENT STRUCTURE

There are two general simulation situations of interest. First, when there is no agreement beyond rater independence (and what assumed model M would predict): κ=0. It is straightforward to generate data when κ=0 (and κl=0l), as joint distributions are simply the products of marginal univariate Bernoulli random variables.

Second, when agreement is greater than an independence structure and what model M would predict: κ>0. Generating Bernoulli observations is not trivial when κ>0 (and κl=cll, where cl(0,1]l). All 2-tuple pairs of raters have a specific kappa dependency structure, and each rater contributes to R1 values of κl. For instance, suppose R=4 and we are simulating Rater 4’s observation vector. Rater 4’s observations (drawn once) must adhere to the three, potentially distinct, kappa dependency structures with Raters 1, 2 and 3; in addition to previously drawn bivariate dependency structures of Raters 1 and 2, Raters 1 and 3, and Raters 2 and 3.

Methods are available to simulate correlated binary observations. These methods use the marginal distribution of the observations (ie, after integrating out any random effects) and may include covariates.1621 However, we know of no simulation method that enforces a specific kappa agreement dependency structure; i.e., for each 2-tuple rater pair κl and across all raters (κ) that simultaneously allows for the probability of a positive response to vary for each subject-rater observation. Further, our development allows for rater-specific inference, rather than population-based inference, through including raters as random effects.

The main idea of our approach is to use the multiplication rule for conditional probabilities to draw observations for each rater sequentially.32 This is necessary to maintain the kappa agreement dependency structure for each 2-tuple rater pair and allow for the probability of a positive response for each subject-rater observation to vary. For example, Rater 1 will either have a positive Yi1=1 or negative Yi1=0 evaluation for subject i based on Rater 1’s probability of a positive evaluation, PYi1=1. Next, we calculate Rater 2’s probability of a positive response based on Rater 1’s outcome: PYi2=1Yi1=1=PYi1=1,Yi2=1PYi1=1 and PYi2=1Yi1=0=PYi1=0,Yi2=1PYi1=0. This is completed sequentially for each rater. Figure 1 is a tree diagram that depicts drawing sequential observations for three raters.

FIGURE 1.

FIGURE 1

Tree diagram that depicts drawing sequential observations for three raters.

More details are provided. Suppose there are R raters and an assumed model M. The values of κl are selected based on model M. Note that raters with similar tendencies may attain a larger κl. The value of κ is calculated via Equation (2). The theoretical probabilities of the usual two-way table of two raters for each subject (eg, PYij=1,Yij=1; see Supplemental Table S1 for complete details) are calculated using the corresponding value of κl, PYij=1 and PYij=1 where l represents the rater pair of Raters j and j (details in Supplemental Appendix B).

Since each rater’s observations are drawn sequentially, the joint probabilities for each set of raters, PYi1=yi1,Yi2=yi2,Yi3=yi3, PYi1=yi1,Yi2=yi2,Yi3=yi3,Yi4=yi4,,PYi1=yi1,,YiR=yiR for all possible combinations of each of the yij=1 or 0 must be solved. Each joint probability is a numerator for a conditional probability used to draw the next rater’s observation. For example, PYi1=yi1,,YiR=yiR is a numerator for drawing the last rater’s (Rater R) probability of a positive response.

First, the probabilities involving Rater R (a total of 2R probabilities) are found. For instance, if R=4, we solve for the 16=24 probabilities: PYi1=1,Yi2=1,Yi3=1,Yi4=1, PYi1=1,Yi2=1,Yi3=1,Yi4=0,,PYi1=0,Yi2=0,Yi3=0,Yi4=0. We use non-negative least squares to solve for these probabilities: Aui=qi where ui0.33 Here, qi is a vector containing the known probabilities of agreement for all 2-tuple rater pairs, PYij=1,Yij=1 and PYij=0,Yij=0, and each rater’s probability of a positive response, PYij=1. An example qi, ui, and A is provided in Supplemental Appendix C.

The probabilities in ui form a partition of the joint probabilities for each set of raters needed to compute the conditional probabilities of a positive response when drawing the raters sequentially.32 Hence, algebra is used to solve for the required sets of joint probabilities. Recall that when R=4, ui=PYi1=1,Yi2=1,Yi3=1,Yi4=1,,PYi1=0,Yi2=0,Yi3=0,Yi4=0. To solve for the probabilities involving Raters 1, 2 and 3 PYi1=1,Yi2=1,Yi3=1,,PYi1=0,Yi2=0,Yi3=0, the probabilities in the respective partition in ui are summed. For example, PYi1=1,Yi2=1,Yi3=1=PYi1=1,Yi2=1,Yi3=1,Yi4=1+PYi1=1,Yi2=1,Yi3=1,Yi4=0.

Each subject is considered separately throughout all calculations. This is because the true probabilities of agreement and disagreement are known for each subject within each pair of raters. There may be some cases where a calculated probability is larger than 1. This may happen when either PYij=1,Yij=1<0 or PYij=0,Yij=0<0, which indicates that the selected κl is not possible given the marginal probabilities of the raters (ie, given model M).

There are several packages available in R that can implement non-negative least squares (eg, “limSolve”, “glmnet” or “bvls”).3436 We used the “limSolve” package due to ease of implementation. Their approach utilizes FORTRAN from Linpack.33

4 |. SIMULATION RESULTS

Extensive simulations were completed. While data were simulated under a variety of conditions, four representative results are presented. All reported simulations had the same number of raters (r=6), number of subjects (n=200), and model M :

M:logitθij=1.5xi, case +1xi, age 0.75xj, unexp +bj

The outcome in these simulations can be thought of as a medical diagnostic test evaluation. The explanatory factors included the disease status of each subject (xi,case=1 if subject i was a case), the standardized age of each subject (xi,age, continuous), and the (binary) training level of the rater (xj,unexp=1 if the rater was unexperienced). The proportion of cases among subjects was set as 0.5, as well as the proportion of unexperienced raters. Each rater’s random effect was drawn independently from a N0,σr2. The simulations differed in the SD of the rater random effect (σr=0.5 or 1) and the value of kappa (κ=0 or κ>0).

We estimated each rater-pair’s κl using Cohen’s kappa κl,C and the proposed method κl,M. Kappa was estimated among all raters using Fleiss’s kappa κF and the proposed method κM. The mean squared error (MSE) was calculated. Numerical simulation results are provided in Supplemental Tables S2 and S3, with key findings presented for σr=0.5. The provided results are representative of all conducted simulations.

Figure 2 is a plot of densities for κ^l across all simulations. The nominal kappa values κl are represented by color-coded, vertical reference lines. The rows refer to each simulation: (1) σr=0.5 and κ=0 (and κl=0l); (2) σr=0.5 and κ>0 (and κl>0l); (3) σr=1 and κ=0 (and κl=0l); (4) σr=1 and κ>0 (and κl>0l). The distance between the nominal kappa values and the corresponding densities of κ^l,M and κ^l,C represents the bias of the proposed method and Cohen’s kappa, respectively. Rater random effects were consistent for a given value of σr, allowing for comparison of Simulations 1 and 2, and Simulations 3 and 4.

FIGURE 2.

FIGURE 2

Simulation densities of estimates for each pair of two raters. Estimates were calculated using Cohen’s kappa (left panel) and the GLMM-based kappa (right panel). Nominal kappa values are represented by color-coded, vertical reference lines.

There are several key findings. First, κ^l,C tended to be larger (biased) than the nominal values, whereas κ^l,M was distributed around the nominal values (unbiased). Under the hypothesis of no agreement beyond what model M would predict under the assumption of independence (ie, κ=0), κ^l,C was not centered around 0, while κ^l,M was centered around 0. For example, κˆ1,C=0.261(SD=0.057) while κˆ1,M=0.02(SD=0.071), where κˆ1,C and κˆ1,M are the averages of κˆ1,C and κ^1,M, respectively, over all simulations (see also Supplemental Table S2). Likewise, when κ>0, κ^l,C was generally biased larger than the nominal values whereas κ^l,M was centered/unbiased. Comparing the estimates for Pair 1 (when σr=0.5; κ1=0.321), κˆ1,C=0.51(SD=0.051) while κˆ1,M=0.303(SD=0.061) (Supplemental Table S2). Notably, as κl became closer to its upper bound of 1, Cohen’s kappa estimates became less inflated. For example, when κ4=0.928, κˆ4,C=0.953(SD=0.022) and κˆ4,M=0.928(SD=0.034) (Supplemental Table S2).

While κ^l,C was inflated compared to κ^l,M, the SD of κ^l,C was slightly less than that of κ^l,M (Supplemental Tables S2 and S3). This indicates that κ^l,C was more precise around a biased estimate. Across both estimation methods, the SD of κ^l decreased as the value of κl increased. This is consistent with properties of the binomial distribution, where the variance of a binomial random variable is smallest when the probability of a positive response is close to 0 or 1. The MSE was smaller under the proposed method; however, when κl was closer to its upper bound of 1, the MSE under both methods were comparable. Lastly, the SD of the random effects did not impact the ability to estimate κl.

Similar results were found when examining estimates of κ (Figure 3 and Supplemental Table S4). Estimates of κF were inflated (biased upward) and produced similar values while estimates of κM were centered around the nominal values (unbiased). The MSE for κ^M was always smaller that of κ^F (Supplemental Table S4). The estimates were not impacted by σr.

FIGURE 3.

FIGURE 3

Simulation density of estimates among all raters. Estimates were calculated using Fleiss’s kappa (red) and the GLMM-based kappa (green). Nominal kappa values are represented by color-coded, vertical reference lines.

5 |. APPLICATIONS TO BIOMEDICAL STUDIES

Two study datasets were analyzed. The first study dataset contained interpretations of 54 Florbetapir positron emission tomography (PET) scans by three experienced raters for evidence of elevated amyloid-β plaque burden (a pathological hallmark of Alzheimer’s disease).7 Clinical and image-specific information was provided. The second study dataset contained seven rater’s interpretations on the cervical cancer status (present or absent) of 118 pathology samples.10,27,37 This dataset is considered to be a classical example of assessing agreement among multiple raters, but does not contain covariate data. Like simulations, κl was estimated using κl,M and κl,C while κ was estimated using κF and κM.

We include both sets of analyses as they highlight different features of our approach. The agreement among raters of the Florbetapir PET scans demonstrates the impact of ignoring covariates. The agreement among raters of pathology samples shows that including a random effect for rater accounts for the dependence structure induced when each rater is paired with all others and results in estimates of agreement consistent with other unadjusted analyses.

5.1 |. Agreement of Florbetapir PET scans

Alzheimer’s disease is common, and its prevalence is expected to increase as our population ages.38 It is important to identify biomarkers of Alzheimer’s disease, as historically diagnosis was determined at autopsy. A pathological hallmark of Alzheimer’s disease is amyloid-β plaques in the brain.38 Imaging technologies, such as Florbetapir PET scans, have advanced such that these plaques can be visualized pre-autopsy.39 The degree of amyloid-β deposition in the brain is not linearly correlated with clinical features of Alzheimer’s disease, and deposition may begin before clinical symptoms appear.40 As such, agreement studies have been conducted to assess the utility of Florbetapir PET scans in identifying patients with high amyloid-β deposition.7

One such study comes from the Alzheimer’s Prevention Program (APP) from the Alzheimer’s Disease Center (ADC) at the University of Kansas Medical Center (KUMC).7 In the APP study, 54 Florbetapir PET scans were evaluated for the presence of high amyloid-β deposition by three experienced raters. Information on study participants was known, such as age and sex. Additionally, the standard uptake value ratio (SUVR), a quantitative (computer algorithm based) measure of the amyloid-β burden status (elevated or not-elevated), was available. Raters were not provided the SUVR status and evaluated all scans independently.

The outcome Yij was binary for the presence of elevated amyloid-β deposition, where Yij=1 when rater j interpreted scan i as elevated and Yij=0 otherwise. The final model selected included SUVR status (elevated or non-elevated) as a fixed effect, as other participant features were not indicative of describing variation in a positive test interpretation (P>0.10). The model with rater as a random effect did not converge, as there were only three raters. Thus, rater was reverted to a fixed effect and model M was: logitθij=β0+βj+β3xi,SUVR+, where θij=PYij=1, β0 was the intercept, βj was the effect of rater j(j=1,2, with rater 3 as baseline), and β3 was the effect of an elevated SUVR status. Note that since the GLMM did not converge, other methods, such as that of Nelson and Edwards, cannot be used to assess agreement.

Supplemental Table S5 provides the logistic regression results and the values of κ^l and κ^. Figure 4 shows the empirical densities of κ^l (Figure 4A) and κ^ (Figure 4B). The observed proportion of a positive evaluation was 0.259 for Rater 2 and 0.296 for Raters 1 and 3. The estimated rater effects were not detectibly different (P>0.55). Out of the 54 images, 24 were elevated based on SUVR status. The odds of a positive evaluation for scans with an elevated SUVR status were 43.467 times the odds of a scan with a non-elevated SUVR status (P<0.01).

FIGURE 4.

FIGURE 4

APP study results among all raters. Observed estimates are denoted by vertical lines. (A) Plots for all pairs of two raters under Cohen’s kappa (yellow) and the model-based kappa method (green). (B) Plots for all raters under Fleiss’s kappa (red) and the model-based kappa (green).

All estimates of κl were larger for κl,C as compared to κl,M (Figure 4A; Supplemental Table S5). Kappa was largest for Pair 1: κ^1,C=0.816(95%CI:0.611,1) and κ^1,M=0.705(95%CI:0.376,1). The estimates for Pairs 2 and 3 were nearly identical within each estimation method. The standard errors were smaller for κ^l,C as compared to κ^l,M. As discussed before, the SD of Bernoulli random variables is smallest when the probability of a positive response is close to 0 or 1. The estimate of κ encompassing all raters was larger for κ^F than for κ^M:κ^F=0.757(95%CI:0.575,0.912) and κ^M=0.605(95%CI:0.345,0.816) (Supplemental Table S5). This is from adjusting for SUVR status.

5.2 |. Analysis of cervical cancer data

A classic example for assessing agreement among multiple raters is the interpretation of 118 samples for the presence/absence of cervical cancer from seven pathologists.10,27,41 This dataset does not contain any additional information related to the samples or the pathologists.

In this analysis, the outcome Yij was binary for the presence of cancer, where Yij=1 when pathologist j interpreted sample i as cancerous. Since there were no additional covariates, model M was logitθij=β0+bj, where θij=PYij=1, β0 was the intercept, and bj was the pathologist random effect assumed to be identically and independently normally distributed for all seven pathologists (ie, bj~N0,σR2,j=1,,7. Numerical results are provided in Supplemental Table S6 and S7 for κ^l and κ^, respectively.

Figure 5A shows empirical sampling distributions of κ^l. Estimates of κl ranged from 0.215 (Pair 10) to 0.810 (Pair 20) but did not greatly change by computation method (κl,M vs κl,C; Supplemental Table S6). The greatest difference was for Pair 10, where κ^10,M=0.215(95%CI:0.123,0.324) and κ^10,C=0.234(95%CI:0.144,0.339), and the smallest difference was for Pair 6, where κ^6,M=0.7943(95%CI:0.6750,0.9105) and κ^6,C=0.7937(95%CI:0.6742,0.9098).

FIGURE 5.

FIGURE 5

Cervical cancer pathology study results. Observed estimates are denoted by vertical lines. (A) Results for each pair of two raters under Cohen’s kappa and the GLMM-based kappa. (B) Results among all raters under Fleiss’s kappa (red) and the GLMM-based kappa (green).

The estimates of κ, and corresponding empirical 95% confidence intervals, were similar for both methods (Supplemental Table S7): κ^F=0.512(95%CI:0.423,0.594) and κ^M=0.519(95%CI:0.437,0.598). Figure 5B shows the empirical sampling distributions of κ. Compared to the previous example, the standard errors of κ^l and κ^ were smaller. This is likely due to the larger number of samples and raters.13

The similarity in values of κ^l and κ^ regardless of estimation method suggests that κM produces similar results to κC and κF when there is no adjustment for covariates. However, the proposed method correctly accounts for the agreement dependency structure among all pairs of two raters and provides theoretical justification for considering all pairs of raters.

6 |. DISCUSSION

An estimator for the agreement structure kappa that accounts for multiple raters and covariates was proposed. Predicted probabilities under a GLMM, including covariates as fixed effects and raters as random effects, were used to estimate πe. This new approach does not require the assumption of equality of any probabilities, which is a limitation for methods commonly used. Through simulations and the amyloid-β PET analysis, we show that ignoring necessary factors leads to inflated estimates. The amyloid-β PET analysis also demonstrated that our approach can still be used even if the GLMM does not converge, while other approaches cannot. Our estimate was similar to Fleiss’s kappa in the analysis of the cervical cancer pathology data, which included no covariates.

A simulation framework for generating multiple Bernoulli observations under an agreement structure was presented. This required accounting for multivariate dependence structures across each 2-tuple pair of raters, not just a marginal dependence overall. Our framework is flexible and allows the probability of a positive response to depend on subject and rater specific factors. To our knowledge, this is the first simulation framework that uses the conditional (rather than marginal) distribution of Bernoulli observations to account for all dependent structures across and between observations. Together, the proposed method and simulation framework provide a structure for assessing more general agreement study designs, where multiple raters and covariates can be included in the analysis.

An important discussion point of this work is the understanding that Cohen’s and Fleiss’s kappa are a function of an assumed model (via πe). This model is a logistic regression model including only raters as a fixed effect. Through this understanding, a model is always assumed when calculating Cohen’s and Fleiss’s kappa, whether or not that model is explicitly stated. Under the framework presented in this manuscript, a more flexible (and arguably more plausible) model for data generated from a given study can be used to estimate agreement.

The proposed method and simulation framework is different than previously developed methods. Some methods produce a distinct kappa value for each covariate pattern.15 For example, if there is a binary explanatory factor, their measure of agreement will include at least 2 kappa-like statistics. Additionally, the definitions of observed and expected agreement are not always consistent. For instance, Nelson & Edwards define expected agreement as a function of two distinct raters and subjects: pc=12Pyij=1yij=0. Additionally, these previously developed methods may not always be applicable as they utilize the estimated variance of the raters random effects and therefore rely on the convergence of the GLMM. Their method could not be used to assess agreement in the APP study.

Many of those works utilize GEEs to estimate the marginal distribution of the binary observations. We prefer the use of a GLMM to that of a GEE for several reasons. First, the GEE framework restricts us to the population-averaged marginal inference space. Second, GLMMs provide greater flexibility when determining the covariance matrix of the random effects.30 Lastly, under a conditional GLMM framework the probability of a positive response, and not the mean of the marginal distribution, is explicitly estimated. We refer the interested reader to30 for further theoretical differences between GEEs and GLMMs.

There are limitations with the proposed method. Convergence of the GLMM may be challenging with a smaller number of raters. If this occurs, raters may be included as fixed effects, just as one would do for modeling in general. Another limitation is that if the model perfectly predicts the response (eg, the model is saturated), κ^l and κ^ will be 0. This implies two interpretations. First, reproducibility may be suspect, as saturated model results often do not generalize or perform well with other datasets. Second, if the included covariates account for all variation in the data (and the model is not saturated), then those covariates are important factors for agreement. Lastly, while the simulation framework may theoretically include any number of raters, we are limited to R<31, as the maximum vector length is 2311 in R version 4.1.1 on Windows 64-bit operating system with 16 GB RAM.

Future work includes expanding the proposed estimator and simulation framework so each rater may evaluate each subject more than once. It is clear how to generalize the estimation of πe and the simulation framework; it is less clear how to estimate π0. By including replicates of all subject-rater observations, the proposed method could be compared to Nelson’s kappa-like statistic and may be used to assess intra-rater agreement (ie, rater consistency). Additionally, it is not clear how to choose the optimal number of raters and subjects. The simulation framework outlined in this manuscript may be modified to answer such questions.

In conclusion, we have developed an estimator for the agreement structure kappa that allows for multiple raters by accounting for the dependency among all pairs of two raters and adjusts for covariates. We provided a framework for simulating Bernoulli variables under the agreement structure kappa. This work demonstrates that usual methods are poised to yield invalid conclusions while our work overcomes shortfalls, leading to improved inferences.

Supplementary Material

Supporting Information
R Files

ACKNOWLEDGEMENTS

This work was partially supported by the National Institutes of Health F30 AG071349, R01 AG043962, P30 AG035982, UL1 TR000001 and UL1 TR002366; the Department of Biostatistics & Data Science, University of Kansas Medical Center; gifts from Frank and Evangeline Thompson, The Ann and Gary Dickinson Family Charitable Foundation, John and Marny Sherman, and Brad and Libby Bergman; and Lilly Pharmaceuticals (grant to support F18-AV45 doses and partial scan costs).

Funding information

National Center for Advancing Translational Sciences, Grant/Award Numbers: UL1 TR000001, UL1 TR002366; National Institute on Aging, Grant/Award Numbers: F30 AG071349, P30 AG035982, R01 AG043962; National Institute on Aging, Grant/Award Number: P30AG072973

Footnotes

SUPPORTING INFORMATION

Additional supporting information can be found online in the Supporting Information section at the end of this article.

DATA AVAILABILITY STATEMENT

Supplemental appendices and Tables are available online. The Alzheimer’s disease data that support the findings of this study may be made available from, and with the approval of, the University of Kansas Medical Center (KUMC) Alzheimer’s Disease Center (ADC). R code detailing simulations can be found on GitHub at https://github.com/CRISsupport/-Simulating-multi-rater-kappa-with-covariates.

REFERENCES

  • 1.Ho GY, Leonhard M, Volk GF, et al. Inter-rater reliability of seven neurolaryngologists in laryngeal EMG signal interpretation. Eur Arch Otorhinolaryngol. 2019;276(10):2849–2856. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Jones JD, Boyd RC, Calkins ME, et al. Parent-adolescent agreement about Adolescents’ suicidal thoughts. Pediatrics. 2019;143(2):30642950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Gambino CM, Lo Sasso B, Colomba C, et al. Comparison of a rapid immunochromatographic test with a chemiluminescence immunoassay for detection of anti-SARS-CoV-2 IgM and IgG. Biochem Med (Zagreb). 2020;30(3):030901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Jin H, Chien S, Meijer E, Khobragade P, Lee J. Learning from clinical consensus diagnosis in India to facilitate automatic classification of dementia: machine learning study. JMIR Ment Health. 2021;8(5):e27113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lehman CD, Yala A, Schuster T, et al. Mammographic breast density assessment using deep learning: clinical implementation. Radiology. 2019;290(1):52–58. [DOI] [PubMed] [Google Scholar]
  • 6.Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull. 1971;76(5):378–382. [Google Scholar]
  • 7.Harn NR, Hunt SL, Hill J, Vidoni E, Perry M, Burns JM. Augmenting amyloid PET interpretations with quantitative information improves consistency of early amyloid detection. Clin Nucl Med. 2017;42(8):577–581. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Guggenmoos-Holzmann I. The meaning of kappa: probabilistic concepts of reliability and validity revisited. J Clin Epidemiol. 1996;49(7):775–782. [DOI] [PubMed] [Google Scholar]
  • 9.Feinstein AR, Cicchetti DV. High agreement but low kappa: I. the problems of two paradoxes. J Clin Epidemiol. 1990;43(6):543–549. [DOI] [PubMed] [Google Scholar]
  • 10.Landis JR, Koch GG. An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics. 1977;33(2):363–374. [PubMed] [Google Scholar]
  • 11.Tanner M, Young M. Modeling agreement among raters. J Am Stat Assoc. 1985;80:175–180. [Google Scholar]
  • 12.Klar N, Lipsitz SR, Ibrahim JG. An estimating equations approach for modelling kappa. Biom J. 2000;42(1):45–58. [Google Scholar]
  • 13.Chmura Kraemer H, Periyakoil VS, Noda A. Kappa coefficients in medical research. Stat Med. 2002;21(14):2109–2129. [DOI] [PubMed] [Google Scholar]
  • 14.Nelson KP, Edwards D. On population-based measures of agreement for binary classifications. Can J Stat. 2008;36(3):411–426. [Google Scholar]
  • 15.Nelson KP, Edwards D. Improving the reliability of diagnostic tests in population-based agreement studies. Stat Med. 2010;29(6):617–626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Emrich LJ, Piedmonte MR. A method for generating high-dimensional multivariate binary variates. Am Stat. 1991;45(4):302–304. [Google Scholar]
  • 17.Oman SD, Zucker DM. Modelling and generating correlated binary variables. Biometrika. 2001;88(1):287–290. [Google Scholar]
  • 18.Qaqish BF. A family of multivariate binary distributions for simulating correlated binary variables with specified marginal means and correlations. Biometrika. 2003;90(2):455–463. [Google Scholar]
  • 19.Demirtas H. A method for multivariate ordinal data generation given marginal distributions and correlations. J Stat Comput Simul. 2006;76(11):1017–1025. [Google Scholar]
  • 20.Ferrari PA, Barbiero A. Simulating ordinal data. Multivar Behav Res. 2012;47(4):566–589. [DOI] [PubMed] [Google Scholar]
  • 21.Touloumis A. Simulating correlated binary and multinomial responses under marginal model specification: the SimCorMultRes package. RJ. 2016;8:79–91. [Google Scholar]
  • 22.Ranganathan P, Pramesh CS, Aggarwal R. Common pitfalls in statistical analysis: measures of agreement. Perspect Clin Res. 2017;8(4):187–191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20(1):37–46. [Google Scholar]
  • 24.Scott WA. Reliability of content analysis: the case of nominal scale coding. Public Opin Q. 1955;19(3):321–325. [Google Scholar]
  • 25.Fleiss JL, Levin B, Paik MC. Statistical Methods for Rates and Proportions. 3rd ed. Hoboken, NJ: Wiley; 2003. [Google Scholar]
  • 26.Fleck BW, Tandon A, Jones PA, Mulvihill AO, Minns RA. An interrater reliability study of a new ‘zonal’ classification for reporting the location of retinal haemorrhages in childhood for clinical, legal and research purposes. Br J Ophthalmol. 2010;94(7):886–890. [DOI] [PubMed] [Google Scholar]
  • 27.Agresti A. Categorical Data Analysis. 3rd ed. Hoboken, NJ: Wiley; 2013. [Google Scholar]
  • 28.Hubert L. Kappa revisited. Psychol Bull. 1977;84(2):289–297. [Google Scholar]
  • 29.Light RJ. Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull. 1971;76(5):365–377. [Google Scholar]
  • 30.Stroup WW. Generalized Linear Mixed Models: Modern Concepts, Methods and Applications. 1st ed. Boca Raton, FL: CRC Press; 2012. [Google Scholar]
  • 31.James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: with Applications in R. New York: Springer; 2013. [Google Scholar]
  • 32.DeGroot MH, Schervish MJ. Probability and Statistics. 4th ed. Boston, MA: Pearson; 2011. [Google Scholar]
  • 33.Lawson CL, Hanson RJ. Solving Least Squares Problems. 2nd ed. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1995. [Google Scholar]
  • 34.limSolve: Solving Linear Inverse Models [computer program]. R2009.
  • 35.Friedman J, Hastie T, Tibshirani R, et al. Glmnet: Lasso and Elastic-Net Regularized Generalized Linear Models. R2022.
  • 36.bvls: The Stark-Parker Algorithm for Bounded-Variable least Squares [computer program]. R2013.
  • 37.Holmquist ND, McMahan CA, Williams OD. Variability in classification of carcinoma in situ of the uterine cervix. Arch Pathol. 1967;84(4):334–345. [PubMed] [Google Scholar]
  • 38.2020Alzheimer’s disease facts and figures. Alzheimers Dement. 2020. [Google Scholar]
  • 39.Ferreira LK, Busatto GF. Neuroimaging in Alzheimer’s disease: current role in clinical practice and potential future applications. Clinics (Sao Paulo). 2011;66(Suppl 1):19–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Förster S, Yousefi BH, Wester HJ, et al. Quantitative longitudinal interrelationships between brain metabolism and amyloid deposition during a 2-year follow-up in patients with early Alzheimer’s disease. Eur J Nucl Med Mol Imaging. 2012;39(12):1927–1936. [DOI] [PubMed] [Google Scholar]
  • 41.Stoyan D, Pommerening A, Hummel M, Kopp-Schneider A. Multiple-rater kappas for binary data: models and interpretation. Biom J. 2018;60(2):381–394. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
R Files

Data Availability Statement

Supplemental appendices and Tables are available online. The Alzheimer’s disease data that support the findings of this study may be made available from, and with the approval of, the University of Kansas Medical Center (KUMC) Alzheimer’s Disease Center (ADC). R code detailing simulations can be found on GitHub at https://github.com/CRISsupport/-Simulating-multi-rater-kappa-with-covariates.

RESOURCES