Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Oct 25.
Published in final edited form as: Stat Med. 2010 Mar 15;29(6):617–626. doi: 10.1002/sim.3819

Improving the reliability of diagnostic tests in population-based agreement studies

Kerrie P Nelson a,*,, Don Edwards b
PMCID: PMC5079112  NIHMSID: NIHMS824041  PMID: 20128018

Abstract

Many large-scale studies have recently been carried out to assess the reliability of diagnostic procedures, such as mammography for the detection of breast cancer. The large numbers of raters and subjects involved raise new challenges in how to measure agreement in these types of studies. An important motivator of these studies is the identification of factors that contribute to the often wide discrepancies observed between raters’ classifications, such as a rater’s experience, in order to improve the reliability of the diagnostic process of interest. Incorporating covariate information into the agreement model is a key component in addressing these questions. Few agreement models are currently available that jointly model larger numbers of raters and subjects and incorporate covariate information. In this paper, we extend a recently developed population-based model and measure of agreement for binary ratings to incorporate covariate information using the class of generalized linear mixed models with a probit link function. Important information on factors related to the subjects and raters can be included as fixed and/or random effects in the model. We demonstrate how agreement can be assessed between subgroups of the raters and/or subjects, for example, comparing agreement between experienced and less experienced raters. Simulation studies are carried out to test the performance of the proposed models and measures of agreement. Application to a large-scale breast cancer study is presented.

Keywords: agreement, model-based kappa, Cohen’s kappa, intraclass kappa, generalized linear mixed model, crossed random effects

1. Introduction

The reliability of common medical procedures for the diagnosis of cancer and other diseases has become an important issue over the last few decades [14]. Subjective classifications of tests including mammograms, x-rays and biopsies are routinely carried out by physicians and other biomedical professionals, yet wide discrepancies have been observed between experts [1, 46]. Many studies have recently been carried out to measure the levels of agreement between experts in such settings, and to attempt to identify factors that may contribute to the observed discrepancies. Identification of influential factors provides valuable insight into how the reliability of diagnostic procedures might be improved [79]. These studies include, among others, a nationwide study carried out by Beam et al. [9] involving the classification of 148 mammograms by over 100 experts. Their study concluded that individual radiologists’ current reading volume was not statistically associated with the accuracy in reading screening mammograms, but several other factors were associated. Miglioretti et al. [7] examined the variability between raters in their classification of over 35 000 mammograms. They found considerable variation in the interpretative performance of diagnostic mammography that was not explained by the characteristics of the patients whose mammograms were interpreted. Thirty-eight prostate cancer biopsies were classified by 41 pathologists in a Gleason grading study [10], where barely moderate agreement between raters was found. The assessment of agreement in these studies is challenging due to the large numbers of experts and subjects under study. Further, the modeling of additional information that may lead to the identification of influential factors raises a further challenge.

Currently available methods for modeling agreement and reliability among raters can be broadly categorized as summary measures and model-based approaches. Summary statistics include the very popular Cohen’s kappa [3] and its extensions, which are often applied in biomedical studies to describe the overall level of chance-corrected agreement between two or more raters. The inclusion of covariate information when using Cohen’s kappa has also been described [1113]. However, it is well acknowledged that Cohen’s kappa has a number of weaknesses in its usage [14, 15] and can lead to an inaccurate assessment of agreement, sometimes severely so. Cohen’s kappa is applicable when several subjects are classified by two raters. However, when multiple raters are classifying several subjects, the multirater intraclass kappa described in Kraemer et al. [16] is the appropriate summary measure of reliability to use. In general, summary statistics provide a useful overall measure of agreement, but sustain a large loss of information due to outputting a single value about the agreement within a process.

Model-based approaches provide a more complete and broader framework for assessing factors influencing agreement, and thus perhaps improving reliability. These include log-linear models [1721], latent class and trait models [2024], and logistic regression models [25, 26]. Several of these approaches include the raters as fixed effects, and provide inference to the specific raters and items under study, not extending to agreement involving entire populations of such raters. Such methods work well when a small number of raters is involved, but become increasingly complex, frequently involving a large number of parameters when the classifications of more than two or three raters are included. Log-linear models and latent models can incorporate covariate information; however, few methods currently available are able to both incorporate covariate information and yield inference regarding the underlying diagnostic procedure.

For common medical procedures, it is desirable to make inference regarding the levels of agreement in the underlying populations of raters and subjects who are typically involved in the diagnostic procedure of interest, and to examine the influence of factors on the agreement process in such a setting. A number of approaches have been developed with this purpose in mind, including a population definition for the intraclass kappa [27], a flexible latent variable model proposed by Williamson and Manatunga [23] for ordinal classifications, and more recently, a generalized linear mixed model for modeling binary classifications [28], which is designed to examine agreement in a population-based setting. While Williamson and Manatunga’s approach can potentially include any number of raters, a fixed effect term is included for each individual rater in the model and is thus best-suited to a small to a moderate number of raters.

Clinical factors are often associated with the prevalence of a disease and may influence the reliability of diagnostic procedures. In the breast cancer setting, a mammogram presenting an advanced stage breast cancer is easier to distinguish than many less advanced forms of breast cancer, leading to a better agreement between the raters. Prior knowledge regarding a subject’s clinical history could also influence a rater’s perception of the mammogram and consequently their classification. For example, in the classification of mammograms, the age of the woman is an important factor as the prevalence of breast cancer increases with a woman’s age [29]. Other factors such as a rater’s level of experience and type of training could affect the levels of agreement present.

The earlier paper by Nelson and Edwards [28] proposed a simple population-based agreement model and summary statistic that focuses on inference regarding agreement in a diagnostic procedure for large numbers of raters and subjects (assumed randomly sampled from their populations), where each rater in the sample of raters is assumed to rate each subject independently of the other raters. This paper focused on defining the general concepts of measuring reliability, and also how the class of generalized linear mixed models could be used as a suitable framework in such a setting. Attention was focused on the more theoretical aspects underlying the simplest model proposed through the use of probit and logit link functions.

In this paper we extend the simple population-based agreement model and summary measure of agreement proposed by Nelson and Edwards [28] to the measurement of agreement between subgroups of raters and/or subjects, for example, how agreement between experienced raters compares with that of less experienced raters, by the incorporation of important covariate information into the model. Such models can lead to the identification of factors that influence agreement and the reliability of the diagnostic procedure, where consistency between raters is an important prerequisite for a reliable procedure. In agreement models, the inclusion of covariate information requires careful consideration due to its influence on both the prevalence of the disease under study and on the agreement process, and yields valuable information regarding factors that influence both how a rater classifies a subject and the agreement between the raters.

The remainder of the paper is as follows: Section 2 introduces the basic population-based agreement model described in Nelson and Edwards [28] and presents the extensions to the model. In Section 3, the associated population-based summary measure of agreement and an extended version are described. Simulation studies are carried out in Section 4 to examine the effects of including important factors into the proposed model and the agreement summary statistic described in Sections 2 and 3. Application to a breast cancer study involving the classifications of a large set of mammograms is presented in Section 5. In Section 6 concluding remarks and discussion are made.

2. Models and measures of agreement

A natural choice for modeling agreement data when the underlying populations of raters and subjects and the diagnostic procedure are of interest is the class of generalized linear mixed models with a crossed random effect structure. We restrict our attention here to classifications made on a binary scale, for example, yij = 1 if the ith subject is rated as positive (for example, diseased) by the jth rater, and yij = 0 otherwise. It is assumed that the raters and the subjects are randomly selected from their respective populations, and that each of the J raters classify all of the subjects under study (independently from the other J − 1 raters). Although binary outcomes can be modeled using a generalized linear mixed model with either a probit or logit link function, among others, we will focus here on the probit link function for ease of mathematics. Nearly identical results are achieved for the logistic link function [28]. The basic form of the agreement model is

Φ1(pij)=η+ui+vj, (1)

for subject i = 1, …, I and rater j = 1, …, J. The quantity pij = pr(yij = 1) is the probability of the ith item being classified as positive by the jth rater, and the constant η refers to the intercept term of the model and can also be regarded as a measure of the prevalence of positives in the data. When η is large, the overall frequency of positives in the data is high. The terms ui and vj represent random effects for the ith subject and jth rater respectively, with assumed independent Normal (0,σu2) and Normal (0,σv2) distributions. A subject with a positive (negative) random effect is more (less) likely than other such subjects to be classified as positive over many raters. A large value of σu2 is indicative of subjects that are easy to distinguish from one another. A rater with a positive (negative) random effect is more liberal (cautious) in classifying a subject as positive over many subjects. A large value of σv2 suggests more variability between the raters within the population in how they classify the subjects.

2.1. Inclusion of covariates

Many agreement models, including log-linear and logistic regression models, utilize an agreement index as the response, where 0 = two raters disagree, 1 = two raters agree, in their classification of a single item [17, 25, 26, 30]. In the generalized linear mixed model considered here, and in Williamson and Manatunga’s agreement model [23], the response variable yij, instead, is the binary classification made on the ith subject by the jth rater. Then when a covariate is included into the model, careful consideration has to be given as to whether it is likely to directly impact the prevalence so that it is included as a fixed effect, or more likely to influence the agreement between the raters and is then included as a random effect term. Covariates that influence both the prevalence and the agreement can be included as both fixed and random effects.

Earlier studies [5, 7, 9, 10] demonstrated that certain factors including a radiologist’s average volume of mammogram interpretations, a woman’s age and time since previous mammogram, a physician’s length of practice and type of training may influence how a rater classifies an item, and may affect the agreement between the raters. For example, raters who carry out more classifications on a routine basis may develop more experience and accuracy in reading mammograms, leading to a higher rate of agreement between the experienced raters. Factors related to subject (for example, stage of cancer) would lead to using a test in some clinical populations and not in others. Factors related to raters would change the form of the test in the same population. The basic agreement model in equation (1) is extended to

Φ1(pij)=η+β1xi+β2xj+z1ui+z2vj,i=1,,I,j=1,,J, (2)

where xi (r × 1) is the vector of factors associated with the ith subject, and xj (s × 1) the vector of fixed effects associated with the jth rater. The associated vectors of parameters are β1 (r × 1) and β2 (s × 1), respectively. The vectors z1 (p × 1) and z2 (q × 1) represent the design vectors of the random effects for the random effect vectors ui (p × 1) and vj (q × 1), respectively, where ui and vj contain the associated random effects for the subjects and the raters. The random effects are assumed to follow multivariate normal distributions

u~MVN(0,Σu)  and  v~MVN(0,Σv),

where the covariance matrices Σu and Σv are of dimensions p × p and q × q, respectively. More complex random effect structures can be employed if required and if sufficient data are available. The random effect vectors ui and vj can be predicted to describe the unique effects of each rater and subject included in the study.

3. A population-based measure of agreement

Cohen’s kappa is a very popular summary measure of agreement due to its appealing simplicity of both calculation and interpretation. In this section, we describe and extend a simple population-based summary measure of agreement introduced in Nelson and Edwards [28] to include covariate information associated with the raters and subjects to try and improve the reliability of a diagnostic test, and to determine their influence on the prevalence and agreement. It is demonstrated how the extended summary statistic can be used to compare the agreement between the subgroups of raters and/or subjects, for example, the amount of agreement between the experienced and the inexperienced raters. The summary statistic is comparable to Cohen’s kappa in style and interpretation, avoiding many of the weaknesses observed in Cohen’s kappa use, while allowing for population-based inference. An advantage of the model-based kappa statistic is that all available data is utilized to estimate the parameters, even when the agreement between the sub-groups are of interest. A general overall measure of agreement based upon all the raters and subjects included in a study can be obtained by fitting the simplest form of the generalized linear mixed model presented in equation (1).

The simplest form of the model-based kappa measure of agreement as introduced in Nelson and Edwards [28] is

κm=14Φ(zρ1ρ)[1Φ(zρ1ρ)]ϕ(z)dz, (3)

where ρ=σu2/(σu2+σv2+1) and 0≤κm≤1; the maximum likelihood estimates of the variances σu2 and σv2 are obtained from the corresponding generalized linear mixed model in equation (1). We can interpret the numerical value of our summary statistic in a similar manner to Cohen’s kappa, where a value close to 1 is suggestive of strong agreement, such that as given in Landis and Koch [31]. The approximate asymptotic variance of κ̂M is derived using the multivariate delta theorem, and is estimated as

var^(κ^m)16[{(12ρ^(1ρ^)(zρ^1ρ^)ϕ(zρ^1ρ^)[12Φ(zρ^1ρ^)])ϕ(z)dz}2×[(σ^v2σ^w2(σ^T2)2)(2σ^u4I)+(σ^u2(σ^T2)2)2(2σ^v4J)]].

3.1. Inclusion of covariates

In equation (4), we present a measure of agreement (extended from the original κm in equation (3)) between two raters j and j′ each classifying the ith subject, which accounts for the fixed effects vector xj, where the two raters may have different values for at least one of the covariates of interest contained in xj. For example, κm in equation (4) provides a summary measure of agreement between two randomly chosen raters, who have different amounts of experience in reading mammograms.

κm=14Φ(zρ+β1xi+β2xjσT2(1ρ)1/2)[1Φ(zρ+β1xi+β2xjσT2(1ρ)1/2)]ϕ(z)dz (4)

where

ρ=σ12/(σ12+σ22+1)  and  ρ=σ12/(σ12+σ22+1). (5)

The terms σ12 and σ22 represent the variances of the sums of the random effect components z1ui and z2vj for the ith subject and jth rater, respectively, that is, the variability associated with the rater and the subject random effects such that σ12=var(z1ui)=z1Σuz1 and σ22=var(z2vj)=z2Σvz2. Similarly, σ22=var(z2vj) for the j’th rater. The quantity σT2=σ12+σ22+1 is a measure of the total variability present in the model (T for total), given the covariate values of the ith subject and the jth rater. Similarly, σT2 is the total variability in the model given the covariate values of the ith subject and the j’th rater. In equation (5) an extended version of ρ is presented. It is defined as a measure of subject distinguishability relative to the variability between two raters with the same covariate information, given the ith subject’s covariate information. Similarly, the term ρ′ specifies the value of ρ for another rater j′ who may have a different set of covariate values from rater j, thus leading to a different value of σ22, and consequently ρ. The formula for ρ′ is presented in equation (5).

With the inclusion of covariates, the exact form of the variance of κm is dependent upon the random effect design vectors z1 and z2 and the assumed correlation structures of the random effects. The model-based kappa statistic κm is a function of the variance components contained in Σu and Σv, and the regression coefficient vectors β1 and β2, where vector θ contains all of the individual components. The asymptotic variance of κm can then be obtained using the multivariate delta theorem, and estimated as

var^(κ^m)=16var(Φ(zρ^+β^1xi+β^2xiσ^T2(1ρ^)1/2)[1Φ(zρ^+β^1xi+β^2xjσ^T2(1ρ^)1/2)]ϕ(z)dz) (6)
=16IJ(hΣ1h)   where h=(δκmδθ1,,δκmδθl) (7)

is the vector of derivatives of κm with respect to the variance components contained in the Σu and Σv matrices and β1 and β2, and Σ (l × l) is the variance–covariance matrix of all the variance components and β contained in θ. An application of this equation is presented in the Appendix.

4. Simulation studies

Simulation studies were carried out to examine the effects of including factors of interest on the estimation in the models and measures of agreement described. The simulations were based upon three probit generalized linear mixed models of increasing complexity as follows:

Model  (a):Φ1(pij)=η+ui+vj,
Model  (b):Φ1(pij)=η+βxj+ui+vj,
Model  (c):Φ1(pij)=η+βxj+ui+vj+z2v1j,

where i = 1, …, I, j = 1, …, J with I and J each set at 50. The term η is the intercept and the response variable yij represents the classification made by the jth rater on the ith subject, equaling 0 for a subject classified as not diseased and 1 otherwise. The random effects ui and vj in models (a) and (b) are assumed to be normally distributed as N(0,σu2) and N(0,σv2), respectively. The additional random effect v1j in model (c) is associated with a factor z2 likely to influence the agreement between the raters, such as a rater’s level of experience (z2 = 1 for an experienced rater, and 0 otherwise). For simulation purposes, z1j is randomly generated from a Bin(n = 1, p = 0.5) distribution, j = 1, …, J, and the random effect term v1j is also assumed to be normally distributed with variance σv12 and correlated with the other rater random effect vj. The covariance matrix for model (c) thus takes the form

(uivjv1j)~MVN(0,[σu2000σv2ρvσvσv10ρvσvσv1σv12]). (8)

In models (b) and (c), xj represents a fixed covariate value for the jth rater, (for example, the experience level of the jth rater (1 = high, 0 = low). In the simulations, xj is randomly generated from a Bin(1,0.5) distribution. Two different values for the intercept, η = 1 and 3, were included in the simulations. The regression coefficient β was set at 0.5. Different values of the variance components were also included to assess the effects of increasing variability: for all three models, σu2,σv2 and σv12 were set at (1,1,1) and (5,5,5), and for model (c), the correlation ρv between random effects vj and v1j was set at 0.25. Due to the computational intensity involved, the number of simulations was restricted to the fitting of 100 randomly generated data sets for each simulation scenario.

Each data set was fitted using a Monte-Carlo expectation–maximization algorithm (MCEM) developed by McCulloch [32] (see also Kuk and Cheng [33]) to obtain almost-exact maximum likelihood estimates of the parameters in the generalized linear mixed model. A description of this algorithm is provided in Nelson and Edwards [28].

Starting values of the parameters in the vector θ were set at η = 0.5, β = 0.05 and the random effect parameters, σu2,σv2,σv12 and ρv were all set at 0.5 for simplicity. Other sets of starting values were also tested, and resulted in the same estimated parameters in each case. Convergence criteria within the algorithm was set at max(|θk − θk−1|<0.0001, where k = 1, …, K is the number of parameters to be estimated, and θ contains the parameters to be estimated.

The model-based kappa statistic was calculated for each individual data set, based upon the estimated values of the parameters from the corresponding generalized linear mixed model, and for model (a), a version of Cohen’s kappa for multiple raters [31] was calculated.

Tables I and II display the mean estimates of the parameters and their associated standard errors from the simulation studies are described. We observe that the parameters η and β are estimated on average with little bias in each of the 12 models examined. Similarly, the almost-exact maximum likelihood mean estimates of the variance components σu2 and σv2 are nearly unbiased, although there is a slight increase in the level of bias observed for the variance components for model (c). The mean estimates of the correlation coefficient ρv are underestimated for each simulation scenario for model (c), sometimes severely so. In the simplest model (a), the mean estimated Cohen’s kappa is only slightly lower in value than the corresponding model-based kappa statistic when the prevalence term η = 1; however, the mean estimated Cohen’s kappa is noticeably smaller than the mean estimated model-based kappa for η = 3 (Cohen’s κ̂ = 0.09, κ̂m = 0.21) and the associated standard errors are large enough to suggest that there may be no significant difference between these two estimated measures of agreement.

Table I.

Simulation results for the probit generalized linear mixed model and κm based upon 100 data sets for I = 50 and J = 50 for two different sets of parameter values θ=(η,β,σu2,σv2σv11,ρv).

Parameter True value Model (a) Model (b) Model (c)
(i) η = 1, σu02=σv2=σv12=1
η 1 0.95 (0.17) 0.99 (0.24) 1.02 (0.36)
β 0.5 0.46 (0.25) 0.48 (0.41)
σu2
1 0.97 (0.20) 0.94 (0.26) 0.94 (0.21)
σv2
1 1.02 (0.23) 1.09 (0.31) 0.97 (0.32)
σv12
1 1.37 (0.46)
ρv01 0.25 −0.0004 (0.001)
κ 0.19 (0.04)
κm 0.21 (0.03)
(i) η = 1, σu02=σv02=σv12=5
η 1 0.89 (0.29) 0.71 (0.48) 1.049 (0.50)
β 0.5 0.38 (0.67) 0.43 (0.66)
σu2
5 4.90 (0.84) 4.65 (0.98) 4.66 (0.92)
σv2
5 5.26 (1.26) 5.09 (1.36) 4.28 (0.82)
σv12
5 6.64 (2.24)
ρv 0.25 0.0004 (0.005)
κ 0.28 (0.05)
κm 0.29 (0.04)

The three models fitted are: (a) Φ−1(pij) = η+ui+vj, (b) Φ−1(pij) = η+βxj+ui+vj and (c) Φ−1(pij) = η+βxj+ui+vj+z2v1j; xi ~ Bin(1,0.5), z2 ~ Bin(1,0.5). Cohen’s kappa = κ and model-based kappa = κm. Mean parameter estimates are presented with associated standard errors in parentheses.

Table II.

Simulation results for the probit generalized linear mixed model and model-based kappa statistic κm based upon 100 data sets for I = 50 and J = 50 using two different sets of parameter values θ=(η,β,σu2,σv2σv11,ρv).

Parameter True value Model (a) Model (b) Model (c)
(ii) η =3, σu02=σv02=σv12=1
η 3 2.95 (0.25) 3.05 (0.41) 3.02 (0.41)
β 0.5 0.51 (0.40) 0.42 (0.08)
σu2
1 1.00 (0.37) 1.05 (0.30) 0.76 (0.30)
σv2
1 1.09 (0.38) 1.14 (0.36) 1.3 (0.52)
σv12
1 1.25 (0.94)
ρv 0.25 −0.004 (0.002)
κ 0.09 (0.04)
κm 0.21 (0.05)
(ii) η = 3, σu02=σv02=σv12=5
η 3 2.93 (0.33) 2.87 (0.49) 2.71 (0.29)
β 0.5 0.45 (0.56) 0.54 (0.31)
σu2
5 5.00 (1.12) 4.44 (1.04) 3.99 (0.56)
σv2
5 5.39 (1.51) 4.34 (0.51) 4.44 (1.59)
σv12
5 5.21 (3.25)
ρv 0.25 0.003 (0.01)
κ 0.24 (0.06)
κm 0.29 (0.05)

The three models fitted are: (a) Φ−1(pij) = η+ui+vj, (b) Φ−1(pij) = η+βxj+ui+vj and (c) Φ−1(pij) = η+βxj+ui+vj+z2v1j; xi ~ Bin(1,0.5), z2 ~ Bin(1,0.5). Cohen’s kappa = κ and model-based kappa = κm. Mean parameter estimates are presented with associated standard errors in parentheses.

5. Application to a breast cancer study

An agreement study was carried out by Beam et al. [9] where 148 randomly selected mammograms were classified by a large number of physicians randomly selected from a group of 294 physicians from the USA. The mammograms included both diseased and non-diseased cases. Data on a number of covariates, including the subject’s age, the number of mammograms read in the previous year by each rater and the number of years of experience of the raters, were collected. The subjects’ ages ranged from 40 to 85 years. The classifications were made using the BIRADS scale, and have been dichotomized here so that an outcome of 0 represents a mammogram classified as non-diseased, and 1 as diseased. Full details on the data collection can be found in Beam et al. [9]. Data on 104 randomly sampled physicians were analyzed using the models and measure of agreement described in Sections 2 and 3 and a summary of the pairwise agreement is presented in Table III. The age of each woman on whom the mammograms was taken was included as an indicator variable xi, where xi = 1 for a subject less or equal to 60 years of age. The level of experience of a physician was included as an indicator variable z2 = 1 if a physician had 10 or more years experience of rating mammograms and 0 otherwise. These are two examples of models that can be fit to this data set; there are other models that could also be feasible.

Table III.

Summary of the pairwise agreement between 104 randomly selected physicians each independently classifying 148 slides for the presence (yij = 1) or absence (yij = 0) of breast cancer [9].

Physician B

Category Non-diseased Diseased Total

Physician A Non-diseased 460 951 64 531 525 482
Diseased 74 467 192 739 267 206
Total 535 418 257 270 792 688

Two generalized linear mixed models with a probit link function were fitted to this data set using McCulloch’s MCEM algorithm; one model (i) to assess the overall agreement present between all the raters and subjects included in the study, and a second model (ii) to assess agreement after accounting for a subject’s age and each rater’s level of experience. These models are

Model  (i):Φ1(pij)=η+ui+vj,
Model  (ii):Φ1(pij)=η+βxi+ui+vj+z2v1j.

Details on the forms of ρv, var(κ̂m) and the likelihood function L(θ, y) for model (ii) are presented in the Appendix.

Table IV presents the parameter estimation and model-based and Cohen’s kappa statistics for these two models. We observe a negative intercept for both models, reflecting the fact that over half of the mammograms were classified as not having cancer present. The negative regression coefficient for subject’s age suggests that the odds in favor of a younger patient (less than 60 years) being classified as having a diseased mammogram is approximately 45 per cent of that of an older patient (over 60 years old). The variability observed between the subjects (σu2) is larger than the variability observed between the raters (σv2). Cohen’s kappa for overall agreement between all raters was estimated at κ̂ = 0.60, whereas the model-based kappa κ̂m = 0.53 (se = 0.08). The agreement between highly experienced raters classifying mammograms of younger women (z2 = 1 and xi = 1) is κ̂m = 0.53 (se = 0.07), whereas chance-corrected agreement between less experienced raters is estimated as κ̂m = 0.50 (se = 0.08), suggesting little difference between these two estimated measures of agreement.

Table IV.

Results for the breast cancer data set.

Parameter Model (a) Model (b)
η −0.83 (0.15) −0.13 (0.02)
β −0.37 (0.03)
σu2
3.54 (0.45) 3.45 (0.40)
σv2
0.25 (0.04) 0.25 (0.03)
σv12
0.24 (0.003)
ρv 0.001 (0.01)
Cohen’s kappa κ 0.60
Model-based kappa κm 0.53 (0.08)
κm(z2 = 0, xi = 1) 0.50 (0.08)
κm(z2 = 1, xi = 1) 0.53 (0.07)

The two models fitted are: (a) Φ−1(pij) = η+ui+vj and (b) Φ−1(pij) = η+βxi+ ui+vj+z2v1j; xi is an indicator variable for ith subject’s age (1 for subjects less than or equal to 60 years of age, 0 for subjects greater than 60 years of age). The term z2 = 0 for an inexperienced rater, and 1 for an experienced rater. Parameter estimates are presented with standard errors in parentheses.

6. Discussion

In this paper, we have extended a model-based approach and measure of agreement, which are appropriate for use in large studies with the goal of improving diagnostic tests that wish to examine the reliability between many raters each classifying the same set of many subjects. This population-based approach, which is based upon the class of generalized linear mixed models allows for inference regarding the underlying diagnostic procedure, and for conclusions to be made regarding the populations of raters and subjects who are typically involved in the medical testing procedure of interest.

Extending the basic model and measure of agreement presented by Nelson and Edwards [28] to include covariate information that may affect both the prevalence of the disease and the agreement process yields valuable information about the roles of raters and subjects. For example, the significance of the level of experience of raters when classifying test items, such as mammograms, can be more closely examined. Other information can also be easily incorporated in this modeling approach, such as factors related to the subject’s clinical history. The use of generalized linear mixed models allows for missing observations and unbalanced data, and easily incorporates large numbers of raters and subjects, unlike most other methods for assessing agreement. Specific details regarding the performance of individual raters included in the study can also be obtained easily by estimating the random effects.

Obtaining almost-exact maximum likelihood estimates of the parameters in the generalized linear mixed models requires the use of a computationally intensive algorithm, such as McCulloch’s MCEM algorithm [32] or an equivalent [33]. At present, software packages do not have the capacity to obtain almost-exact maximum likelihood estimates for generalized linear mixed models with crossed random effects structures. The model-based kappa statistic can also be calculated using standard software such as R and SAS; a computer function written in the freely available software package R included in Table V.

Table V.

Function in R for calculating the two-rater model-based kappa statistic and its variance.

twokappafn = function(sigma2u, beta1)
{
integrand1 = function(z)
  {
    term1=pnorm(sqrt(sigma2u)*z − beta1)
    term2= pnorm(sqrt(sigma2u)*z + beta1)
fullintegrand= term1*term2*dnorm(z)
}
result1 = integrate (integrand1, lower=−100, upper=100)
integrand2=function(z)
{
    term3=(1-pnorm(sqrt(sigma2u)*z − beta1))
    term4=(1-pnorm(sqrt(sigma2u)*z + beta1))
fullintegrand2=term3*term4*dnorm(z)
}
result2= integrate(integrand2, lower=−100, upper=100)
res2=result2 $value
# two-rater model-based kappa
2*(res1 + res2)−1
}
# Calculation of variance: need to enter values for varbeta1, varsigmasqu, sigmasqu and beta1.
integrand1a = function(z)
{term=(0.5*(1/sqrt(sigmasqu))*dnorm(sqrt(sigmasqu)*z + beta1/2)*pnorm(sqrt(sigmasqu)*z − beta1/2)
+ 0.5*(1/sqrt (sigmasqu))*pnorm(sqrt(sigmasqu)*z + beta1/2)*dnorm(sqrt(sigmasqu)*z − beta1/2))
*dnorm(z)*z }
result1a=integrate(integrand1a, lower=−100, upper=100)
integrand1b = function(z)
{term=(−0.5*(1/sqrt(sigmasqu))*dnorm(sqrt(sigmasqu)*z + beta1/2)*(1-pnorm(sqrt(sigmasqu)*z − beta1/2))*z
−0.5*(1/sqrt(sigmasqu))*pnorm(sqrt(sigmasqu)*z + beta1/2)*dnorm(sqrt(sigmasqu)*z − beta1/2))
*dnorm(z)*z}
result1b=integrate(integrand1b, lower=−100, upper=100)
vectorh1=result1a$value+result1b$value
integrand1b = function(z)
{term=(0.5*dnorm(sqrt(sigmasqu)*z + beta1/2)*pnorm(sqrt(sigmasqu)*z − beta1/2) − 0.5*pnorm(sqrt(sigmasqu)*z
+ beta1/2)*dnorm(sqrt(sigmasqu)*z − beta1/2))*dnorm(z)}
result2a=integrate(integrand2a, lower=−100, upper=100)
integrand2b = function(z)
{term=(−0.5*dnorm(sqrt(sigmasqu)*z + beta1/2)*(1-pnorm(sqrt(sigmasqu)*z − beta1/2))
+ 0.5*(1-pnorm(sqrt(sigmasqu)*z + beta1/2))*dnorm(sqrt(sigmasqu)*z − beta1/2))*dnorm(z)}
result2b=integrate(integrand2b, lower=−100, upper=100)
vectorh2=result2a$value+result2b$value
covarmat = matrix(c(varsigmasqu,0,0,varbeta1), ncol=2)
vectorh = c(vectorh1, vectorh2)
varkappam = 4*(vectorh %*% covarmat %*% vectorh)
}

As is the case for any class of models where the outcomes of interest are binary, larger data sets are required when more factors are to be included into the model, since more parameters are required to be estimated. This also ensures successful convergence of the MCEM algorithm in the parameter estimation process.

It is assumed in the current model setting that the random effects of the subjects and raters are normally distributed. Previous research [34] has shown that for a generalized linear mixed model with a Poisson link function, fixed effect parameters are estimated with little or no bias, whereas variance components are estimated with more bias and variability when non-normal random effects are modeled assuming normality.

Further extensions to the model include multiple independent classifications of each item by the individual raters. This can be useful when we wish to examine the variability of ratings made by any rater for a particular item, also known as ‘intra-rater’ variability. This can be flexibly included in the framework of the generalized linear mixed model and summary statistic proposed, and is a topic for future research. Other future directions for extending this class of models in the assessment of agreement include development of a model to allow for ordinal classification scales, and to investigate the inclusion of non-normal random effect distributions for the raters and items.

Acknowledgments

The authors are grateful for the support provided by the following grant from the United States’ Institutes of Health R03CA114783-01A1. We thank Dr Craig Beam for kindly providing us with the breast cancer data set. We also gratefully acknowledge helpful advice from Charles McCulloch, Patrick Graham, Robert Best and the comments from the editors and the referees.

Contract/grant sponsor: United States’ Institutes of Health; contract/grant number: R03CA114783-01A1

APPENDIX A

DETAILS ON MODEL FITTING FOR APPLICATION

For model (i), the random effects ui and vj, i = 1, …, I and j = 1, …, J, are assumed to be normally distributed with zero means and variances σu2 and σv2, respectively. The random effects in model (ii) are assumed to be distributed as in Section 5. For model (ii), the term ρ takes the form

ρ=σ12(σ22+σ12+1)=σu2σu2+σv2+z22σv12+2z2ρvσvσv1+1

and σT2=σu2+σv2+z12σvij2+2z1ρvσvσv1+1. The statistic κ̂m is calculated using the estimated quantities, σ^u2,σ^v2,σ^v12 and ρ̂ from the fitted model. To calculate the variance for κm, let θ=(σu2,σv2,σv12,ρv,β). Since κm is a function of θ, the multivariate delta theorem can be applied directly. The variance of κ̂m, var(κ̂m) can be obtained using equation (7), where vector h takes the following form:

h=(δκmδσu2,δκmδσv2,δκmδσv12,δκmδρv,δκmδβ).

The matrix Σ is of dimension 5 × 5 since the vector θ contains five elements, and takes the form

Σ=(E[δ2logLδθδθ])1,

where the likelihood function of interest here is

L(θ;y)=Πi=1IΠj=1Jpijyij(1pij)1yij12πσvσv11ρ2exp {12(1ρv2)(vj2σv2+v1j2σv122ρvvjv1jσvσv1)}×12πσu2exp {ui22σu2}

with pij = Φ(η+βxi+ui+vj+z2v1j).

References

  • 1.Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. The New England Journal of Medicine. 1994;331(22):1493–1499. doi: 10.1056/NEJM199412013312206. [DOI] [PubMed] [Google Scholar]
  • 2.Yerushalmy J. The importance of observer error in the interpretation of photofluorograms and the value of multiple readings. International Tuberculosis Year Book. 1956;24:110–124. [Google Scholar]
  • 3.Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 1960;20:37–46. [Google Scholar]
  • 4.Holmquist ND, McMahan CA, Williams OD. Variability in classification of carcinoma in situ of the uterine cervix. Archives of Pathology. 1967;84:334–345. [PubMed] [Google Scholar]
  • 5.Sickles EA, Miglioretti DL, Ballard-Barbash R, Geller BM, Leung JWT, Rosenberg RD. Performance benchmarks for diagnostic mammography. Radiology. 2005;235:775–790. doi: 10.1148/radiol.2353040738. [DOI] [PubMed] [Google Scholar]
  • 6.Beam CA, Conant EF, Sickles EA. Factors affecting radiologist inconsistency in screening mammography. Academic Radiology. 2002;9:531–540. doi: 10.1016/s1076-6332(03)80330-6. [DOI] [PubMed] [Google Scholar]
  • 7.Miglioretti DL, Smith-Bindman R, Abraham L, Brenner RJ, Carney PA, Bowles EJA, Bowles EJA, Buist DSM, Elmore JG. Radiologist characteristics associated with interpretive performance of diagnostic mammography. Journal of the National Cancer Institute. 2007;99:1854–1863. doi: 10.1093/jnci/djm238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Elmore JG, Carney PA. Does practice make perfect when interpreting mammography? Journal of the National Cancer Institute. 2002;94:321–323. doi: 10.1093/jnci/94.5.321. [DOI] [PubMed] [Google Scholar]
  • 9.Beam CA, Conant EF, Sickles EA. Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. Journal of the National Cancer Institute. 2007;95:282–290. doi: 10.1093/jnci/95.4.282. [DOI] [PubMed] [Google Scholar]
  • 10.Allsbrook WC, Mangold KA, Johnson MH, Lane RB, Lane CG, Epstein JI. Interobserver reproducibility of Gleason grading of prostatic carcinoma. Human Pathology. 2001;32(1):81–88. doi: 10.1053/hupa.2001.21135. [DOI] [PubMed] [Google Scholar]
  • 11.Lipsitz SR, Laird NM, Brennan TA. Simple moment estimates of the κ-coefficient and its variance. Applied Statistics. 1994;43:309–323. [Google Scholar]
  • 12.Barlow W. Measurement of interrater agreement with adjustment for covariates. Biometrics. 1996;52:695–702. [PubMed] [Google Scholar]
  • 13.Klar N, Lipsitz SR, Ibrahim JG. An estimating equations approach for modelling kappa. Biometrical Journal. 2000;1:45–58. [Google Scholar]
  • 14.Maclure M, Willett WC. Misinterpretation and misuse of the Kappa statistic. American Journal of Epidemiology. 1987;126(2):161–169. doi: 10.1093/aje/126.2.161. [DOI] [PubMed] [Google Scholar]
  • 15.Nelson JC, Pepe MS. Statistical description of interrater variability in ordinal ratings. Statistical Methods in Medical Research. 2000;9:475–496. doi: 10.1177/096228020000900505. [DOI] [PubMed] [Google Scholar]
  • 16.Kraemer HC, Periyakoil VS, Noda A. Kappa coefficients in medical research. Statistics in Medicine. 2002;21:2109–2129. doi: 10.1002/sim.1180. [DOI] [PubMed] [Google Scholar]
  • 17.Tanner MA, Young MA. Modeling agreement among raters. Journal of the American Statistical Association. 1985;80(389):175–180. [Google Scholar]
  • 18.Agresti A. A model for agreement between ratings on an ordinal scale. Biometrics. 1988;44:539–548. [Google Scholar]
  • 19.Goodman LA. Simple models for the analysis of association in cross classifications having ordered categories. Journal of the American Statistical Association. 1979;74:537–552. [Google Scholar]
  • 20.Coull BA, Agresti A. Generalized log-linear models with random effects, with application to smoothing contingency tables. Statistical Modelling. 2003;3:251–271. [Google Scholar]
  • 21.Graham P. Modeling covariate effects in observer agreement studies: the case of nominal scale agreement. Statistics in Medicine. 1995;14:299–310. doi: 10.1002/sim.4780140308. [DOI] [PubMed] [Google Scholar]
  • 22.Dawid AP, Skene AM. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics. 1979;28:20–28. [Google Scholar]
  • 23.Williamson JM, Manatunga AK. Assessing interrater agreement from dependent data. Biometrics. 1997;54:707–714. [PubMed] [Google Scholar]
  • 24.Uebersax JS, Grove WM. Latent class analysis of diagnostic agreement. Statistics in Medicine. 1990;9:559–572. doi: 10.1002/sim.4780090509. [DOI] [PubMed] [Google Scholar]
  • 25.Coughlin SS, Pickle LW, Goodman MT, Wilkens LR. The logistic modeling of interobserver agreement. Journal of Clinical Epidemiology. 1992;45(11):1237–1241. doi: 10.1016/0895-4356(92)90164-i. [DOI] [PubMed] [Google Scholar]
  • 26.Lipsitz SR, Parzen M, Fitzmaurice GM, Klar N. A two-stage logistic regression model for analyzing inter-rater agreement. Psychometrika. 2003;68(2):289–298. [Google Scholar]
  • 27.Kraemer HC. Ramifications of a population model for κ as a coefficient of reliability. Psychometrika. 1979;44(4):461–472. [Google Scholar]
  • 28.Nelson KP, Edwards D. On population-based measures of agreement for binary classifications. Canadian Journal of Statistics. 2008;36(3):411–426. [Google Scholar]
  • 29.Holford TR, Cronin KA, Mariotto AB, Feuer EJ. Changing patterns in breast cancer incidence trends. Journal of the National Cancer Institute Monographs. 2006;36:19–25. doi: 10.1093/jncimonographs/lgj016. [DOI] [PubMed] [Google Scholar]
  • 30.Agresti A. An Introduction to Categorical Data Analysis. New York: Wiley; 1996. [Google Scholar]
  • 31.Landis JR, Koch GG. Measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–174. [PubMed] [Google Scholar]
  • 32.McCulloch CE. Maximum likelihood algorithms for generalized linear mixed models. Journal of the American Statistical Association. 1997;437:162–170. [Google Scholar]
  • 33.Kuk AYC, Cheng YW. The Monte Carlo Newton–Raphson algorithm. Journal of Statistical Computing and Simulation. 1997;59:233–250. [Google Scholar]
  • 34.Nelson KP, Leroux BG. Properties and comparison of estimation methods in a log-linear generalized linear mixed model. Journal of Statistical Computation and Simulation. 2008;78(3):367–384. [Google Scholar]

RESOURCES