Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jul 30.
Published in final edited form as: Stat Med. 2021 May 9;40(17):4014–4033. doi: 10.1002/sim.9011

Measuring Rater Bias in Diagnostic Tests with Ordinal Ratings

Chanmin Kim 1,*, Xiaoyan Lin 3, Kerrie P Nelson 2
PMCID: PMC8277718  NIHMSID: NIHMS1708186  PMID: 33969509

Summary

Diagnostic tests are frequently reliant upon the interpretation of images by skilled raters. In many clinical settings, however, the variability observed between experts’ ratings plays a detrimental role in the degree of confidence in these interpretations, leading to uncertainty in the diagnostic process. For example, in breast cancer testing, radiologists interpret mammographic images, while breast biopsy results are examined by pathologists. Each of these procedures involves elements of subjectivity. We propose here a flexible two-stage Bayesian latent variable model to investigate how the skills of individual raters impact the diagnostic accuracy of image-related testing in large-scale medical testing studies. A strength of the proposed model is that the true disease status of a patient within a reasonable time frame may or may not be known. In these studies, many raters each contribute classifications on a large sample of patients using a defined ordinal grading scale, leading to a complex correlation structure between ratings. Our modeling approach considers the different sources of variability contributed by experts and patients while accounting for correlations present between ratings and patients, in contrast to currently available methods. We propose a novel measure of a rater’s ability (magnifier) that, in contrast to conventional measures of sensitivity and specificity, is robust to the underlying prevalence of disease in the population, providing an alternative measure of diagnostic accuracy across patient populations. Extensive simulation studies demonstrate lower bias in estimation of parameters and measures of accuracy, and illustrate outperformance of the proposed model when compared to existing models. Receiver operator characteristic (ROC) curves are derived to assess the diagnostic accuracy of individual experts and their overall performance. Our proposed modeling approach is applied to a large breast imaging study for known disease status and a uterine cancer dataset for unknown disease status.

Keywords: Diagnostic test, Bayesian latent variable model, Ordinal Ratings, Variability, Breast imaging, ROC curve

1 ∣. INTRODUCTION

Assessing diagnostic accuracy is an important component of evaluating the performance of a medical diagnostic test. When the medical procedure produces binary results, diagnostic accuracy can be relatively easy to assess through estimating sensitivity and specificity of the test and receiver operating characteristics (ROC) analysis1,2,3,4,5. However, in many medical diagnostic procedures, an ordered categorical scale is used to meausre the severity of disease. For example, the Breast Imaging Reporting and Data System (BI-RADS) scale6 is used by radiologists for classifying breast cancer status from mammograms, and the Gustilo and Anderson scale7 is used for grading severity of open fractures. Simply dichotomizing the ordinal scale and applying the methods for binary data is a possible way to analyze ordinal data. However, doing this decreases a lot of data information and therefore loses accuracy when assessing the performance of the diagnostic tests.

There have been numerous methods proposed to analyze the accuracy of multirater ordinal data8. Such data arises in clinical studies assessing the diagnostic accuracy of an existing or new diagnostic procedure, where many experts each interpret the test results of a large sample of patients. Subsequently, variability between the multiple experts’ ratings needs to be accounted for in the modeling process. Many such methods focus on the calculation of omnibus indices, such as the kappa coefficient and all types of kappa-like coefficients9,10,11,12,13,14,15, to measure agreement between raters. Obuchowski et al. (2004)16 examined several different methods for multireader (multirater) ROC studies. In the past decades, Bayesian models have also been developed for the analysis of multirater ordinal data17,18,19 where a gold standard diagnosis is not required in any of these approaches. Cao et al. (2010)19 proposed a Bayesian hierarchical model for ordinal score data with no gold standard outcome (e.g., diseased/non-diseased) using rater bias and discrimination terms, however they focus on estimating the ranks of the latent traits rather than their values (which are our focus). Thus, they do not discuss non-identifiability of the model parameters as it is not a salient issue in estimation of the rank of the parameters. Recently, Bayesian nonparametric techniques have also been applied to analyze ordinal data20,21,22 where no such outcome as a gold standard is considered.

In this paper, we are interested in large scale studies, where many raters contribute ordinal classifications on a large sample of items. Our planned modeling approach flexibly accommodates research studies where the true disease status of patients is known or unknown. The motivating data set was collected by Beam et al. (2003)23 who studied variability in the interpretation of mammograms by radiologists in the United States. Each of a random sample of one hundred and seven radiologists interpreted a large random sample of mammograms using a modified ordinal five-point BI-RADS scale to record the rating results. Although the motivating data is a balanced design, our proposed method can flexibly accommodate unbalanced data with missing ratings.

For this multirater ordinal data structure, a two-stage latent trait model is proposed. Specifically, we employ a patient-level latent trait variable to link radiologist ordinal ratings with patient true disease-development binary outcomes. The first component of our modeling approach adopts a simple probit model to model the probability of being diseased, based upon a patient’s unique latent trait, with the patient’s true binary disease outcome (diseased/not diseased). For the ordinal rating component, our model builds upon Lin et al. (2018)5‘s binary model, where a rater’s overall diagnostic ability is modeled as diagnostic bias and magnifier. While the earlier latent trait models of Uebersax and Grove (1993)24 and Johnson (1996)25 assume additive effects of patient latent trait and rater effects, our model assumes that raters, based upon their ability, can magnify or shrink the impact of observed patient latent trait and then adjust with their innate diagnostic bias to evaluate the patient disease severity. Qu et al. (1996)1 propose a latent class model with random effects in which patient latent trait is multiplied by a rater-specific coefficient; however, this model is developed for binary data.

Although the motivating mammogram dataset has the true patient disease outcomes available, our proposed method is applicable even when the patient true disease outcomes are unknown. This can be done by just fitting the model for radiologist ordinal ratings conditional on the patient latent trait variable which is drawn based on the data augmentation algorithm26. The Holmquist et al. (1968)27 dataset is a relevant application where seven pathologists each independently evaluated and classified lesions on 118 slides with potential carcinoma in situ of the uterine cervix into a five-category ordinal scale: from 1 (negative) to 5 (invasive carcinoma). In the simulation study, we also test our model performance under the setting when the true disease status is not known. In practice it is recommended that model robustness is carefully examined without the presence of the true disease outcomes28. Our proposed two-stage latent trait model conveniently assesses diagnostic accuracy through estimating sensitivities and specificities and ROC analysis. Sensitivities and specificities can be calculated for each cut-point of the ordinal ratings. Accordingly, individual rater and an overall ROC curve for the medical test can be generated, facilitating the evaluation of the abilities of individual experts by comparison with the overall ROC curve. A Bayesian framework is adopted for estimation of the proposed ordinal latent model. For a large scale study, each rater contributes classifications on the same large sample of patients’ test results, therefore leading to a complex correlation structure between ratings. Several approaches29,30,31 have been proposed for studies when there is strong dependence between ratings. Here, we adopt a Bayesian framework for estimation, where a Bayesian hierarchical structure and Markov chain Monte Carlo (MCMC)32,33 computation naturally accommodate the complex correlation between ratings and patients. Through borrowing information from each other, these methods can provide robust estimation even when there are missing ratings. Not only can individual diagnostic ability (bias and magnifier) be estimated, but the overall performance of raters can also be can be naturally assessed with regard to diagnostic accuracy (e.g., ROC analysis).

The remainder of the article proceeds as follows. Section 2 proposes our Bayesian hierarchical model and discusses the conditions for identifiability. Section 3 provides details on posterior sampling for two different cases: when true disease status is known and when true disease status is unknown. In Section 4, a rigorous simulation study for estimation of the parameters is presented. In Section 5, the ROC curve based upon the proposed model is presented. Section 6 presents extensive simulation studies to compare the proposed model to existing methods. Section 7 illustrates how to check goodness-of-fit of the proposed model, while Section 7 illustrates the proposed approach using a large-scale breast screening study23 for known disease status and a uterine cancer dataset27 for unknown disease status. The paper concludes with a discussion in Section 8.

2 ∣. PROPOSED BAYESIAN HIERARCHICAL MODEL

We base our modeling on a study setting where n raters (such as radiologists) each independently classify the same set of m patient test results (such as mammograms) according to an ordinal classification scale with K categories. This results in m × n classifications of the form Wij = k (i = 1, ⋯ , m; j = 1, ⋯ , n; k = 1, ⋯ , K). Our proposed model is a two-stage Bayesian hierarchical model that flexibly incorporates the true disease status of patients and the classifications of the raters as presented in Eq. (1) and (2), respectively:

P(Diui)=Φ(ui) (1)
P(Wijkui)=Φ(αk(aj+bjui)), (2)

for k = 1, ⋯ , K; i = 1, ⋯ , m; j = 1, ⋯ , n, where ui is an unobserved continuous latent variable of the underlying latent disease severity of patient i. When the true disease status of each patient Di is known, we define Di = 0 if the patient does not have the disease, while Di = 1 if the patient has the disease. The disease status may be confirmed or determined within a reasonable time interval after the patient’s image was taken. The latent variable ui links the true disease outcome (Di = 0 or 1) in Eq (1) with the classification assigned by the jth rater, Wij = k in Eq (2). The rater decision Wij is also affected by the aj and bj parameters. The term aj (j = 1, ⋯ , n) refers to the intrinsic bias of a rater to assign a higher (more severe) disease score to a patient whose latent trait ui value is 0 (i.e., an equal chance of disease or no disease). The term bj (j = 1, ⋯, n) reflects the clinician’s diagnostic ability and skill to magnify the patient latent disease severity in the same direction (when bj > 0).

The proposed model introduces a salient challenge in parameter identifiability by dealing with ordinal data, while building upon an earlier model for binary classification processes in Lin et al. (2018)5.

2.1 ∣. Identifiability of Proposed Model

The issue of identifiability commonly arises in the modeling of ordinal data34,35. Without any restrictions on the model parameters in the proposed Bayesian latent model (e.g.aj’s, bj’s and ui’s), the parameters in Eq. (1) and (2) generally cannot be point-identified as multiple parametrizations are equivalent for some of the parameters (e.g., bj × ui = (c × bj) × (ui/c) for all i = 1, ⋯ , m and j = 1, ⋯, n)36. However, by fixing the posterior mean of ui based upon the prevalence of disease in the study population via Eq. (1), we are satisfactorily able to identify all parameters in the model. Additionally, the assumption that the distribution of the rater bias terms aj has a zero mean provides a simple and reasonable interpretation - a rater with a large positive aj suggests this rater has a tendency to assign larger scores to patients relative to other raters. It is worth noting that we do not shift aj’s by the mean of aj’s in each MCMC iteration (i.e., to guarantee the mean of aj’s is 0), but we shift the posterior samples of aj’s after all MCMC iterations (post-processing) and shift all other posterior distributions of the relevant parameters accordingly. In this way, the joint posterior distribution of all parameters can be explored more freely without additional restrictions during the MCMC iterations. In Section 4, we demonstrate these properties through the use of simulation studies under various different scenarios.

3 ∣. POSTERIOR SAMPLING OF THE PROPOSED MODEL

3.1 ∣. When true disease status is known

In this section we develop a Bayesian algorithm for posterior sampling to fit the proposed two-stage model in Eq. (1) and (2). For efficient Bayesian computation in probit models, we introduce normal latent variables. For the disease status Di of each patient, a normal latent variable

Zi0uiN(ui,1)

is introduced. For each score by the jth rater on the ith patient’s test result Wij, we define a normal latent variable Zij

ZijuiN(aj+bjui,1).

For each pair of Wij and Zij, for i = 1, ⋯ , m; j = 1, ⋯ , n, we have

P(Zij<α1ui)=P(Wij=1ui)P(α1Zij<α2ui)=P(Wij=2ui)P(αK1Zijui)=P(Wij=Kui).

Then, the augmented joint density function of d = (d1, d2, ⋯, dm), w = (w11, w12, ⋯, wmn) and z = (z10, z11, z12, ⋯ , zmn) is

p(d,w,za,b,u,α)=i=1m[ϕ(zi0ui)[I(zi0>0)]di[I(zi00)]1dij=1n(ϕ(zijajbjui)[I(zij<α1)]I(wij=1)[I(α1zij<α2)]I(wij=2)[I(αK1zij)]I(wij=K))] (3)

We adopt Gibbs sampling37, a special case of the Metropolis-Hastings algorithm, for our posterior sampling with the following prior specifications: ui ~ N(μu, τu), aj ~ N(0, τa), bj ~ N(μb, τb), μb ~ N(0, τ), τa ~ Ga(γ, γ), τb ~ Ga(γ, γ), α1 ~ Unif (−δ, δ), and αkαk−1 ~ Unif (αk−1, αk−1 + δ) for k = 2, ⋯ , K − 1 where τa and τb are the precision parameters of the prior distributions of aj and bj, respectively. Here, we assume μu = 0 and τu = τ = 1. Then, the following steps are iterated to obtain the posterior samples of the latent variables (zi0, zij; i = 1, … , m, j = 1, … ,n) and the parameters aj, bj, ui, μb, τa, τb, αk; i = 1, ⋯ , m, j = 1, ⋯ , n, k = 1, ⋯, K):

  1. Sample zi0, for i = 1, ⋯, m,
    zi0={N(ui,1)I(zi0>0),fordi=1N(ui,1)I(zi00),fordi=0}
  2. Sample zij, for i = 1, ⋯ , m; j = 1, ⋯, n,
    zij={N(aj+bjui,1)I(zij<α1),forwij=1N(aj+bjui,1)I(α1zij<α2),forwij=2N(aj+bjui,1)I(αK1zij),forwij=K}
  3. Sample ui, for i = 1, ⋯ , m,
    uiN(j=1nbj(zijaj)+zi0+μuτuj=1nbj2+1+τu,1j=1nbj2+1+τu)
    and set ui=(ui+C) where a constant C is solved in each iteration to satisfy P(D=1)=E(Φ(u))1mi=1mΦ(ui). See below for more details.
  4. Sample aj, for j = 1, ⋯, n,
    ajN(i=1m(zijbjui)m+τa,1m+τa)
  5. Sample bj, for j = 1, ⋯ , n,
    bjN(i=1mui(zijaj)+μbτbi=1mui2+τb,1i=1mui2+τb)
  6. Sample μb,
    μbN(τbj=1nbjnτb+τ,1nτb+τ)
  7. Sample τa,
    τaGa((γ+n2),(γ+12j=1n(aj)2))
  8. Sample τb,
    τbGa((γ+n2),(γ+12j=1n(bjμb)2))
  9. Sample αk, for k = 1, K − 1,
    αkUnif(Lk,Uk),
    where
    Lk={max(δ,αk+1δ,maxi,j:wij=kzij),fork=1max(αk1,αk+1δ,maxi,j:wij=kzij),fork=2,,K2max(αk1,maxi,j:wij=kzij),fork=K1}Uk={min(δ,αk+1,mini,j:wij=k+1zij),fork=1min(αk1+δ,αk+1,mini,j:wij=k+1zij),fork=2,,K1}

In Step 3, we shift the posterior samples of ui’s to meet the condition for our model identifiability in Section 2.1: P(D) = E(Φ(u)). In each iteration, we solve the equation P(D=1)=1mi=1mΦ(ui+C) for C using the rootSolve R package and set ui=ui+C. Note that since the latent variable in our model (ui) is continuous, a label switching issue may occur up to alternating signs (e.g., bj × ui = (−bj) × (−ui)). However, we fix the posterior mean of ui based upon the prevalence of disease in the study population via Eq. (1) for the issue of identifiability and this label switching issue can be avoided.

For effective dissemination of our method, we developed an R package, BayesODT (available at https://lit777.github.io/BayesODT.html), that runs the MCMC algorithm described above and provides other useful features such as visualization and receiver operating characteristic curves.

3.2 ∣. When true disease status is not known

We note that this Gibbs sampling algorithm can also be implemented in a study setting where the true disease status of patients is unknown. In that case, we only assume Model (2) and omit Step 1 in the sampling procedure in Section 3.1 and Step 3 is replaced by

uiN(j=1nbj(zijaj)+μuτuj=1nbj2+τu,1j=1nbj2+τu)

and set ui=(ui1mi=1mui). Note that when the true disease status is not known, the first component of the two-stage model (Eq. (1)) is not implemented and, therefore, the posterior mean of ui is not adjusted based upon the prevalence of disease. We present examples in the simulation studies and the performance (identifiability) of our proposed approach in Section 4 for both scenarios (when the true disease status of patients is known and unknown). We also demonstrate the performance of our proposed approach with comparison to existing approaches in a cancer dataset where the true disease status of patients is unknown.

4 ∣. SIMULATION STUDY

4.1 ∣. When true disease status is known

We conduct an extensive simulation study to assess the performance of the proposed Bayesian latent variable model under a diverse range of disease prevalence rates. In this section, we focus on evaluation of identifiability and parameter estimation of our proposed model; that is, we examine estimates of aj, bj and ui parameters for all i = 1, … , m and j = 1, … , n. In Section 6, we use the same simulation settings with a diverse range of additional settings to compare our model performance to other existing methods in estimation of average AUC. Table 1 presents the prevalence for each set of simulations as follows: Scenario (1) P(D = 1) = 0.05; (2) P(D = 1) = 0.25; (3) P(D = 1) = 0.5. Further, we examine both scenarios, where the true disease status of each patient is ‘known’ and ‘unknown’. In each situation we have m = 90 patients and n = 50 raters. We also test another scenario (4) with m = 148 patients and n = 100 raters for the moderate prevalence rate P(D = 1) = 0.25 that reflects the imaging application in Section 7. We set K = 4 where the ordinal classification scale has 4 categories and the vector of true cutpoints is set to (α1 = −1, α2 = 1, α3 = 2). True values of aj and bj for each rater j = 1, ⋯ , n are independently sampled from N(0, 0.5) and N(1.7, 0.5), respectively. Also, patient’s unobserved disease severity terms (i.e., latent traits) ui, i = 1, ⋯ , m, are randomly sampled from mixture distributions: Scenario (1) 0.95N(−1.5, 0.52) + 0.05N(1.5, 0.52) for P(D = 1) = 0.05; Scenario (2) 0.9N(−1, 0.52) + 0.1N(1.5, 0.52) for P(D = 1) = 0.25; and Scenario (3) 0.9N(0, 0.52) + 0.1N(1.5, 0.52) for P(D = 1) = 0.5. We also consider a non-normal mixture for ui: (4) 0.9N(−1, 0.52) + 0.1 Exp(1). Based on these values, true disease statuses Di, i = 1, ⋯ , m are sampled using the model in Eq. (1). With the simulated aj, bj and ui, in each dataset, the ratings Wij, i = 1, ⋯, m; j = 1, ⋯ , n, are generated based on Eq. (2). In this simulation study, we evaluate the model performance in terms of the estimation of the quantities αk, aj, bj and ui for k = 1, 2, 3; j = 1, ⋯ , n; i = 1, ⋯ , m when we shift the mean of the aj parameters to be zero after model fitting for easy interpretation. The simulation data are openly available in github at https://lit777.github.io/.

TABLE 1.

Various scenarios in the simulation study with m = 90 patients and n = 50 raters.

True disease prevalence
P(D = 1) = 0.05 P(D = 1) = 0.25 P(D = 1) = 0.50
Disease status (D) Known (1) (2), (4)*, (6)* (3)
Unknown (5)*
*

indicates scenarios with larger sets of patients (m = 148) and raters (n = 100).

indicates scenario with a non-normal mixture model for ui.

In the Supplementary Materials, we present Figures S1,S2 and S3 for simulation scenarios 1,2 and 3 (i.e., P(D = 1) = 0.05, 0.25 and 0.5), respectively. Figures S4 and S6 display results for simulation scenarios 4 and 6 (i.e., larger dataset), which demonstrate similar excellent parameter estimation. The results illustrate that the Bayesian estimation procedure is very successful, leading to estimates of αk, aj and bj which are approximately unbiased except for a few bj parameters which are uniformly shrunk; however those biases are only small to moderate. Also, the proposed model almost fully captures the true latent variables ui’s supporting the robustness of our method for estimating a non-normal distribution of patient disease severity ui’s.

4.2 ∣. Simulation scenario when true disease status is unknown

In Figure S5 in the Supplementary Materials, we examine another scenario when true disease status is not known (Scenario (5)) with setting m = 148, n = 100 and P(D = 1) = 0.25. Using the modified Gibbs sampling algorithm described in Section 3.2, we obtain the estimates of the target parameters. Compared to the estimates from scenario (4), the estimates of αk and aj’s are slightly more biased but the estimates of bj’s are more consistent with the true parameter values. Since scenario (5) does not impose any restriction on ui through the model P(Di) = Φ(ui), bj parameters that form an interaction term with ui in Eq. (2) have more freedom to explore the parameter space so that the estimates of bj parameters in scenario (5) have higher possibility to capture the true parameter values. The simulation data that support the findings of this study are available from the corresponding author upon request.

5 ∣. ROC CURVE

5.1 ∣. Estimation of ROC Curves for Each Individual Rater

In addition to estimating the model parameters of direct interest for each rater and patient (i.e., aj, bj, ui), an important goal of the proposed modeling approach is the generation of Receiver Operating Characteristic (ROC) curves38 for each individual rater and the overall diagnostic procedure. ROC curves provide a graphical assessment of diagnostic accuracy by plotting the true-positive rate vs the false-positive rate. An ROC curve with a steeper curve above the 45 degree line indicates a more accurate diagnostic test while a curve closer to the 45 degree line suggests ratings that are less accurate. The estimated ROC curve of each individual rater provides valuable insight into their performance in classifying the disease status of patients based upon their personal skill and bias. Posterior estimates of ROC curves for individual raters can be estimated through simulated values from MCMC iterations. Denoting the rth sampled value from the posterior distribution for (αk, aj, bj, ui) by (αk(r), aj(r), bj(r), ui(r)) (i.e., posterior samples from the rth MCMC iteration; r = 1, ⋯ , R), the conditional distribution function of Wij

P(Wijkαk(r),aj(r),bj(r),ui(r)),fork=1,,K, (4)

averaged over these values can be used to construct a Bayesian estimate for the estimated ROC curve. Specifically, for each rater j, the ROC curve is created by plotting the following points at each of the k cutpoints (k = 1, ⋯ , k)

{P(Wj>kD=0),P(Wj>kD=1)}

where W·j is the average of the ratings of the j-th rater over all the patients in the study. The cumulative probability of the average rating for the j-th rater, P(W·jkD· = d), given the true disease status d can be estimated by

P(WjkD=d)=P(WjkD=d)P(D=d)=Φ(αk(aj+bju))Φ(u)d(1Φ(u))(1d)dF(u)Φ(u)d(1Φ(u))(1d)dF(u)1Rr=1R1mi=1mΦ(αk(r)(aj(r)+bj(r)ui(r)))Φ(ui(r))d(1Φ(ui(r)))(1d)1mi=1mΦ(ui(r))d(1Φ(ui(r)))(1d) (5)

for k = 1, ⋯ , K, and d denotes true disease status (1 = disease, 0 = normal) and F(u) denotes the true distribution of the latent disease severity u. The final line in Eq (5) follows by inserting the r-th posterior samples into the models in Eq (1) and Eq (2) and averaging over all R sets of the posterior samples. Some selected estimated ROC curves for individual raters are presented in Figure 2 (a). Figure 2 (a) depicts estimated ROC curves for a selection of raters (rater ID: 32, 42, 17 and 40) along with the estimated ROC curve for all 50 raters from one simulated dataset based on scenario 2. These raters exhibited high bias aj and high magnifier bj (rater ID 32); high bias aj and low magnifier bj (rater ID 42); low bias aj and high magnifier bj (rater ID 17); and low bias aj and low magnifier bj (rater ID 40). As expected, when diagnostic magnifier bj is high (rater ID 32 and 17), the area under the ROC curve (AUC), which is a measure of how much a more highly skilled rater (with a higher bj value) is able to distinguish between categories in the ordinal scale, is high. The AUC for the j-th rater, AUCj, can be expressed as30

AUCj=k=1K1{P(Wj=kD=0)c=k+1KP(Wj=cD=1)}+0.5k=1KP(Wj=kD=0)P(Wj=kD=1),

where P(W·j = kD· = d) is estimated by

P^(Wj=kD=d)=1Rr=1R1mi=1m{Φ(αk(r)(aj(r)+bj(r)ui(r)))Φ(αk1(r)(aj(r)+bj(r)ui(r)))}Φ(ui(r))d(1Φ(ui(r)))(1d)1mi=1mΦ(ui(r))d(1Φ(ui(r)))(1d).

FIGURE 2.

FIGURE 2

(a) Estimated ROC curves and AUC values for our proposed model for four selected raters (j = 17, 32, 40, 42) and overall from one simulated dataset in Scenario 2 (see Section 5.1). (b) Estimates of the smoothed ROC curve (red) based on Eq. (7), the ROC curve based on Eq. (6) (blue), the ROC curve based on Eq. (8) (cyan) from the same simulated dataset in Scenario 2.

The AUC over all raters can be estimated similarly.

5.2 ∣. Estimation of ROC curve over all raters

A Bayesian estimate for the ROC curve over all raters can be constructed by averaging over the group of n raters (j index). This estimated ROC curve provides a valuable overall assessment of the accuracy of the underlying population of raters in the study setting after accounting for the variability between raters. The curve is generated based upon the following points at each k cutpoint (k = 1, … , K)

{P(W>kD=0)),(P(W>kD=1)}

where W·· is the average rating over all the patients and the raters in the study. The cumulative probability of the average rating for all the raters, P(W··kD· = d), given the true disease status d can be estimated by

P(WkD=d)=P(WkD=d)P(D=d)1Rr=1R1nj=1n1mi=1mΦ(αk(r)(aj(r)+bj(r)ui(r)))Φ(ui(r))d(1Φ(ui(r)))(1d)1mi=1mΦ(ui(r))d(1Φ(ui(r)))(1d). (6)

The overall estimated ROC curve for one simulated dataset is presented in Figure 2 (b). Figure 1 depicts three rater-level ROC curve plots to show the impact of increasing bj for a fixed aj value. True disease statuses (i = 1, ⋯ , 50) are simulated based on disease prevalence rate P(D = 1) = 0.25 (Scenario 2) and patient ordinal test results (k = 1, ⋯ , 4) are generated for each aj = (−1.5, 0, 1.5) and bj = (0.5, 1, 1.5, 2). In general, the resulting ROC curves are increasing in the bj parameter in the sense that when b1 > b2 the corresponding AUC(b1) > AUC(b2) (except the (aj = 0, bj = 1) case).

FIGURE 1.

FIGURE 1

Rater-level ROC curve plots for our proposed ordinal latent variable model in Eqs (1) and (2) for different aj and bj values. (a) aj = −1.5 and bj = {0.5, 1, 1.5, 2}. (b) aj = 0 and bj = {0.5, 1, 1.5, 2}. (c) aj = 1.5 and bj = {0.5, 1, 1.5, 2}. Overall indicates the estimated ROC curve over all 4 raters for our proposed model for each aj value.

5.3 ∣. Smoothed ROC curve

To construct a smoothed overall ROC curve based upon all raters, we can use a posterior estimate (a^, b^) for the means of (aj, bj) where a^.=1R1nr=1Rj=1naj(r) and b^.=1R1nr=1Rj=1nbj(r) obtained from the Bayesian modeling of the dataset. We then evaluate the following formula which calculates P(W..sD.=d) for arbitrarily many categories s = 1, ⋯ , S with the corresponding cutpoints t = {t1, ⋯ , tS} (−∞ < t1 < ⋯ < tS < ∞)

1Rr=1R1mimΦ(ts(a^+b^ui))Φ(ui)d(1Φ(ui))(1d)1mimΦ(ui)d(1Φ(ui))(1d) (7)

over a fine grid of t values ranging between two bounds that ensure the above formula is approximately 0 for the lower bound of t (t1) and 1 for the upper bound of t (tS) (e.g., a sequence of numbers from −10 to 10 equally spaced by 0.1). This is a similar approach to that described in39 in that we use a posterior estimate (the posterior mean value) of each parameter instead of inserting in the r-th sample, and evaluate the function over these posterior values. Figure 2(a) displays the estimated ROC curves for four raters and the overall ROC for a simulated dataset from Scenario 2. Figure 2 (b) displays an estimate for the smoothed ROC curve defined by Eq. (7) and an estimate for the ROC curve over all raters on the basis of Eq. (6) based upon the same simulated dataset as in Figure 2(a) based upon Scenario 2. The gaps between the different ROC curves in Figure 2 (b) are mainly induced by different ways of averaging over the posterior samples. Since the cumulative distribution function of the standard normal distribution, Φ(), is not a linear function, two approaches, (a) averaging over the posterior samples in Φ() function (smoothed ROC, Eq. (7)) vs (b) averaging over the posterior samples outside of Φ() function (ROC 1, Eq. (6)), potentially create discrepancy between the resulting estimated ROC curves. If we replace Eq. (6) with Eq. (8), which an alternative estimate for P(W··kD· = d) that Eq. (6) is averaged over the posterior samples inside the cumulative distribution function, a new estimate for the ROC (ROC 2 in Figure 2) approaches the smoothed ROC

P(WkD=d)1Rr=1R1mi=1mΦ(αk(r)(1nj=1naj(r)+1nj=1nbj(r)ui(r)))Φ(ui(r))d(1Φ(ui(r)))(1d)1mi=1mΦ(ui(r))d(1Φ(ui(r)))(1d). (8)

6 ∣. COMPARISON TO EXISTING METHODS

In this section, we evaluate our model performance and compare results to two existing models proposed in Albert (2007)30, namely a Gaussian random effects (GRE) model and a finite mixture (FM) of normals. Albert’s GRE model is a latent variable model for ordinal ratings which does not assume a known gold standard measurement is available, with the following form:

P(WijkDi,bD,i)=Φ(CDi,k+bDi,i),

where cutpoints CDi,k are non-decreasing for each unknown disease status Di ∈ {0, 1} and bDi,i is a random effect to capture the dependence in the joint conditional distribution of the vector of rater classifications, P(WiDi) for Wi = (Wi1, Wi2, ⋯ , Win). Albert’s FM model is a possible alternative approach for incorporating conditional dependence between tests:

P(WiDi)={ηDi+(1ηDi)j=1JΦ(CDi,1)ifWi+(1,1,,1)andDi=0ηDi+(1ηDi)j=1J{1Φ(CDi,K1)}ifWi=(K,K,,K)andDi=1(1ηDi)j=1J{Φ(CDi,Wij)Φ(CDi,Wij1)}otherwise}

where fractions η0 and η1 fall between [0, 1]. We compare the performance of our model to Albert’s GRE and FM models in terms of estimation of average AUC based upon extensive simulation scenarios: (1) four scenarios from the GRE model; (2) four scenarios from the FM model; (3) six scenarios from our simulation setup in Table 1; and finally, (4) one scenario (the simulation example in Ishwaran and Gatsonis (2000)39; hereafter I&G), where I&G describes an ordinal latent variable modeling approach which allows for additional location and scale covariates for each patient i. There are nine scenarios (eight from Albert (2007) and one from our simulation scenario) for unknown disease status and six scenarios (five from our simulation scenarios and one from Ishwaran and Gatsonis (2000)) for known disease status. We generate 500 simulated dataset per scenario. For the GRE-based scenario, in cases 1 and 2, we use (C0,1, C0,2, C0,3, C0,4) = (1.0, 2.9, 4.8, 6.7) and (C1,1, C1,2, C1,3, C1,4) = (−1.5, −1.1, −0.6, −0.1). In cases 3 and 4, we use (C0,1, C0,2, C0,3, C0,4) = (0.5, 1.0, 2.0, 2.5) and (C1,1, C1,2, C1,3, C1,4) = (−1.8, −1.4, 1.2, 4.0) where σ0 = σ1 = 1 in all four cases. For the FM-based scenario, in cases 1 and 2, we use (C0,1, C0,2, C0,3, C0,4) = (0, 0.5, 1.0, 1.5) and (C1,1, C1,2, C1,3, C1,4) = (−1.5, −1.1, −0.6, −0.1). In cases 3 and 4, we use (C0,1, C0,2, C0,3, C0,4) = (−0.75, −0.25, 0.25, 0.75) and (C1,1, C1,2, C1,3, C1,4) = (−1.5, −1.1, −0.6, −0.1) where η0 = η1 = 0.2 in all four cases. For the GRE and FM-based scenarios, we assume P(Di = 1) = 0.2. For the scenario based on I & G, two patient-specific covariates are simulated. The first binary covariate x1 is randomly generated from (−0.5, 0.5) and the second covariate x2 is generated from the discrete uniform distribution (−2, −1, 0, 1, 2). Additionally, two dimensional covariates ui,k are simulated from the discrete uniform distribution (0, 1, 2) and normal distribution N(0, 0, .32), respectively, for each i and k.

Table 3 shows results of the simulation study. As we expected, the GRE and FM models perform well under their respective scenarios (i.e., correct model specification) in terms of biases, mean squared errors (MSEs) and coverage probabilities. However, when the models are misspecified (i.e., the GRE is assumed for FM/I&G/our proposed model, or the FM model is assumed for GRE/I&G/our proposed model), estimates of the AUC for the FM, GRE and I& models are biased substantially by 0.033 to 0.222 in magnitude (gray cells indicate absolute biases larger than 0.1) and coverage probabilities are less than the nominal level of 0.95. On the contrary, our proposed model performs well over all scenarios with respect to biases, MSEs and interval widths. Especially, for the GRE-based scenarios, the biases and MSEs of our proposed model are similar to those of the GRE model, the correct model, in the first two scenarios and in the next two scenarios, our model produces much lower biases than other competing models. Our model produces coverage probabilities less than 0.95 in some scenarios, but it shows decent performance even compared to the corrected specified models (e.g., the GRE model for the GRE-based scenarios 3 and 4, the FM model for the FM-based scenarios 3). Although the GRE model is correctly specified for the 3rd and 4th GRE-based cases, the GRE shows the worst performance because of a "relabeling" of the latent variable in its likelihood as mentioned in Albert (2007) whereas our proposed model generates much smaller biases and MSEs.

TABLE 3.

Simulation results based upon 500 datasets for 8 scenarios based on Albert (2007), 1 scenario based on Ishwaran and Gatsonis [I & G] (2000), and 6 scenarios from our simulation setting in Section 4. Our proposed model is examined along with Gaussian random effect [GRE], finite mixture [FM] and Dorfman-Berbaum-Metz [DBM] methods in terms of the following metrics: averages of estimated AUCs, standard errors, 95% confidence intervals (credible intervals for our Bayesian methods), biases, MSEs and coverages (nominal level of 0.95). Gray cells indicate biases larger than 0.1 (in magnitude).

True
Model
Scenario
Case Known
Disease
Num. of
items (m)
Num. of
raters (n)
Truth   Specified Model
GRE (Albert 2007) FM (Albert 2007)
Avg. of estimated AUCs (SE) 95% C.I. Bias (MSE) Coverage Avg. of estimated AUCs (SE) 95% C.I. Bias (MSE) Coverage
GRE (1) No 150 7 0.90 0.90 (0.02) 0.87-0.93 0.002 (0.001) 0.98 0.94 (0.02) 0.89-0.97 0.035 (0.002) 0.52
(2) No 500 7 0.90 0.90 (0.03) 0.85-0.95 0.004 (0.001) 0.99 0.93 (0.01) 0.91-0.96 0.033 (0.001) 0.13
(3) No 50 7 0.80 0.63 (0.13) 0.44-0.85 −0.170 (0.045) 0.05 0.91 (0.05) 0.84-0.99 0.108 (0.014) 0.04
(4) No 500 7 0.80 0.59 (0.12) 0.46-0.82 −0.208 (0.056) 0 0.89 (0.03) 0.86-0.98 0.085 (0.008) 0.01
FM (1) No 150 7 0.90 0.83 (0.03) 0.77-0.88 −0.073 (0.006) 0.87 0.89 (0.02) 0.86-0.93 0.004 (0.001) 0.98
(2) No 500 7 0.90 0.83 (0.02) 0.79-0.86 −0.069 (0.005) 0.35 0.90 (0.01) 0.88-0.92 0.001 (0.001) 0.96
(3) No 150 7 0.80 0.76 (0.03) 0.69-0.79 −0.044 (0.003) 0.96 0.82 (0.04) 0.75-0.89 0.023 (0.002) 0.92
(4) No 500 7 0.80 0.76 (0.01) 0.73-0.78 −0.039 (0.002) 0.87 0.81 (0.03) 0.77-0.88 0.007 (0.001) 0.96
Our Scenarios (1) Yes 90 50 0.78 0.64 (0.06) 0.51-0.74 −0.136 (0.022) 0.14 0.96 (0.02) 0.93-0.98 0.184 (0.034) 0
(2) Yes 90 50 0.71 0.59 (0.06) 0.46-0.71 −0.111 (0.016) 0.27 0.93 (0.02) 0.87-0.97 0.222 (0.049) 0
(3) Yes 90 50 0.64 0.57 (0.09) 0.41-0.72 −0.071 (0.013) 0.53 0.80 (0.01) 0.75-0.85 0.158 (0.025) 0
(4) Yes 148 100 0.67 0.58 (0.03) 0.52-0.64 −0.085 (0.008) 0.35 0.86 (0.04) 0.78-0.93 0.192 (0.038) 0
(5) No 148 100 0.71 0.57 (0.03) 0.51-0.64 −0.135 (0.019) 0.19 0.93 (0.02) 0.88-0.96 0.222 (0.049) 0
(6) Yes 148 100 0.71 0.57 (0.03) 0.51-0.64 −0.135 (0.019) 0.19 0.93 (0.02) 0.88-0.96 0.222 (0.049) 0
I & G (1) Yes 250 3 0.79 0.84 (0.05) 0.71-0.92 0.045 (0.005) 0.76 0.96 (0.01) 0.95-0.98 0.171 (0.029) 0
 
True
Model
Scenario
Case Known
Disease
Num. of
items (m)
Num. of
raters (n)
Truth   Specified Model
DBM Our Proposed Method
Avg. of estimated AUCs (SE) 95% C.I. Bias (MSE) Coverage Avg. of estimated AUCs (SE) 95% C.I. Bias (MSE) Coverage
GRE (1) No 150 7 0.90 - - - - 0.90 (0.05) 0.85-0.94 0.002 (0.003) 0.84
(2) No 500 7 0.90 - - - - 0.91 (0.05) 0.88-0.93 0.007 (0.002) 0.74
(3) No 50 7 0.80 - - - - 0.82 (0.04) 0.79-0.85 0.025 (0.002) 0.95
(4) No 500 7 0.80 - - - - 0.79 (0.06) 0.78-0.81 −0.008 (0.004) 0.97
FM (1) No 150 7 0.90 - - - - 0.86 (0.07) 0.82-0.89 −0.046 (0.007) 0.81
(2) No 500 7 0.90 - - - - 0.86 (0.06) 0.84-0.88 −0.035 (0.001) 0.25
(3) No 150 7 0.80 - - - - 0.78 (0.05) 0.75-0.81 −0.025 (0.003) 0.96
(4) No 500 7 0.80 - - - - 0.80 (0.06) 0.78-0.82 −0.004 (0.004) 0.97
Our Scenarios (1) yes 90 0.77 (0.05) 0.65-0.86 −0.003 (0.003) 0.99 0.80 (0.05) 0.69-0.90 0.023 (0.004) 0.97
(2) 90 0.71 (0.04) 0.62-0.79 0.001 (0.002) 0.99 0.71 (0.04) 0.65-0.79 <0.001 (0.001) 0.94
(3) 90 0.64 (0.03) 0.58-0.70 0.007 (0.001) 0.79 0.67 (0.02) 0.64-0.70 0.035 (0.002) 0.96
(4) 148 0.67 (0.03) 0.61-0.74 0.003 (0.001) 0.95 0.67 (0.02) 0.63-0.72 0.002 (0.001) 0.98
(5) 0.72 (0.03) 0.67-0.78 0.009 (0.001) 0.99
(6) Yes 148 100 0.71 0.71 (0.03) 0.64-0.77 0.001 (0.001) 0.99 0.70 (0.03) 0.65-0.77 −0.004 (0.001) 0.93
I & G (1) Yes 250 0.79 0.79 (0.03) 0.74-0.84 0.002 (0.001) 0.97 0.79 (0.01) 0.79-0.81 0.007 (0.001) 0.94

In contrast to our model performing relatively well in both of Albert’s GRE and FM-based scenarios, both of Albert’s competing GRE and FM models perform poorly in our simulation scenarios illustrated in Section 4. Most of biases and MSEs for the GRE and FM models are larger than 0.1 (in magnitude) and 0.03, respectively. Even under the situation where the true disease is not known (case 5), the two models do not capture the true AUC in their 95% C.I.s while our proposed model is not affected by the absence of the true disease status. This discrepancy is mainly driven by the fact that the proposed model employs rater-specific parameters (aj, bj), patient-specific parameters (ui) and their interaction bjui, which create random effects for both rater and patient while Albert’s competing models are limited to only capturing the dependence structure between raters’ classifications.

For our simulation scenarios of known disease status, we additionally compare our model with the Dorfman-Berbaum-Metz Method (DBM) (1992)40. The DBM approach performs a mixed-effects ANOVA model on jack-knife pseudovalues based on accuracy for multirater scenarios. This method is restricted to assessing accuracy when the true disease status of patients is known. Overall, we observe that the DBM method exhibits similar levels of bias to our proposed method, and lower bias than Albert’s GRE and FM methods in the scenarios examined in the simulation studies. The coverage probabilities of the DBM approach tend to be generally larger than our proposed approach.

In the last scenario from Ishwaran and Gatsonis (2000) (the simulation example in Section 6), a vector of patient and rater specific covariates is introduced into the modeling. In fact, our model has patient-specific parameters (ui) that can incorporate a patient-specific covariate if available, rater-specific parameters (aj, bj) that can play a role as the vector of rater-specific covariates, and their interactions bjui which allow more complex structures driven by patient-level and rater-level covariates. It can be seen numerically in the last scenario in that our model (and the DBM method) captures the true AUC with the smallest MSE while the GRE and FM models generate estimates that substantially differ from true AUC. From these extensive simulation scenarios, our method is demonstrated to be more flexible model to handle many different multirater ordinal data structure with or without the true disease status.

In conclusion, we have demonstrated through these extensive simulation studies that our method provides a more flexible modeling approach which can accommodate many different multirater ordinal data structures when the true disease status is known or unknown.

7 ∣. GOODNESS-OF-FIT TESTS

One way to assess the goodness-of-fit of the proposed Bayesian latent variable model is to use a posterior predictive p-value41,42, the Bayesian version of evaluating the tail-area probability based on posterior simulations. Formally, the posterior predictive p-value, pB, is defined as follows:

pB=P(T(Wrep,θ)T(W,θ)W),

where W = (W11, W12, ⋯ , Wmn). θ = (a, b, u, α, z) and the replicated data, Wrep=(W11rep,W12rep,,Wmnrep). Technical details are described in Gelman et al. (2013)42. If the model assumptions are not well-reflected in the observed data, the posterior predictive p-value has a value near 0 or 1.

To compute the posterior predictive p-value, we evaluate the following integral

pB=IT(Wrep,θ)T(W,θ)p(Wrepθ)p(θW)dWrepdθ,

where I is the indicator function. From the posterior samples of θ, we draw one replicated data Wrep,r for the r-th posterior sample (θ(r), r = 1, ⋯ , R), and evaluate test quantities T(Wrep,r, θ(r)) and T(W, θ(r)). The posterior predictive p-value is estimated as the proportion of the total R simulations for which the predictive test quantity is larger than or equal to the realized test quantity.

In our application, we assess whether our proposed model is adequate based on the following test quantity for ordinal data

T(W,θ)=k=1K(NkpkN)2pkN,

where Nk is the number of observed Wi,j = k for i = 1, ⋯ , m; j = 1, ⋯ , n, k = 1, ⋯ = K and N=k=1KNk. The probability pk = P(Wi,j = k) for k = 1, ⋯ , K is estimated based on Eq. (2) and the posterior samples of the parameters. We apply this goodness-of-fit approach to two applications below in Section 8.

8 ∣. APPLICATIONS

8.1 ∣. Breast cancer study with known disease status

Despite the widespread use of mammography as a screening tool for breast cancer, issues still surround its use due to variability between radiologists’ ratings. In view of this, Beam et al. (2003)23 undertook a large-scale breast screening study to assess the accuracy and agreement between radiologists. One hundred and forty eight mammograms were randomly obtained from a large breast cancer screening program (affiliated with the University of Pennsylvania), among those, 64 (43%) were from women with cancer. Radiologists from randomly sampled mammography facilities accredited by the U.S. Food and Drug Administration as of January 1, 1998 were invited to participate and 104 radiologists were recruited. Each rater was asked to grade each mammogram based on a modified version of the Breast Imaging Reporting and Data System (BI-RADS)6 ordinal classification scale: 1=normal, return to normal screening; 2=benign, return to normal screening; 3=probably benign; 4=possibly malignant, biopsy recommended; 5=probably malignant, biopsy strongly recommended. We apply our proposed model to the Beam et al. (2003)23 dataset with the goal of assessing the accuracy of the radiologists’ ratings while considering the variability between radiologists.

First, to fit the model, we ran the Gibbs sampler with 25,000 iterations. After a burn-in of 15,000 iterations, we take every 5th draw of the sampler (thining to decrease autocorrelation) for a total of R = 2,000 iterations. The visual assessment of trace plots42 supports convergence of the MCMC chain. For the hyper-parameters in Section 3.1, we set γ = τ = 0.1 and δ = 15. For the condition of identifiability, we shift the sampled ui in each MCMC iteration to have E(Φ(ui)) = 0.43 (the prevalence disease rate in the study) by solving an equation 1mi=1mΦ(ui+C)=0.43 for C and setting ui=ui+C for all i = 1, ⋯ , m in each iteration. In practice, we use the multiroot function in the R package, rootSolve, to solve the equation.

Table 2 contains the posterior summaries of the overall (hyper) parameters. The posterior mean of the mean bias for the overall raters a¯j is −1.41 (95% C.I. [−1.46, −1.36]), however, for simpler interpretability, we shift this to 0. It is worth noting that we are not shifting a¯j’s to 0 in each MCMC iteration, but we shift it to 0 after all MCMC iterations. The posterior mean of the rater overall mean diagnostic magnifier of the group b¯j is 0.55 (95% C.I. 0.49–0.62). The posterior means (and 95% C.I.) of four cutpoints are α^1=0.96(1.03,0.90), α^2=0.24(0.29,0.19), α^3=0.75(0.72,0.78) and α^4=2.65(2.58,2.71) under the post-processing step of a¯j=0. The posterior means of τa and τb are 0.46 and 80.70, implying that aj’s are estimated with a large uncertainty because the raters have a large variation in their diagnostic biases. It is also worth noting as presented in Figure 3 that most of the raters have estimated diagnostic magnifiers, b^j, greater than 0.4 indicating their level of skill. We observe that rater 51’s estimated magnifier is b^32=0.79, indicating a very skilled rater who can successfully distinguish disease status according to a patient’s true disease traits (as reflected by ui). Estimated rater biases a^j vary between −1.11 and 0.96. For example, rater 45 has an a^45=0.96. This indicates that rater 45 tends to assign higher ratings on average relative to other raters in the study. Furthermore, the majority of the patients with breast cancer have estimated latent severity values greater than 0 (the black solid line in Figure 4), which suggests higher ratings are assigned to them given the fact all of the estimates for bj are positive.

TABLE 2.

Posterior summary of (hyper-) parameters for our proposed model for the Beam et al. (2003)23 mammogram dataset with 104 radiologists’ classifications of 148 mammograms.

  Parameter Posterior Mean 95% Credible Interval Note  
 
  α1 −2.37 (−2.44, −2.31) Ordinal cut-points  
α2 −1.65 (−1.69, −1.60)
α3 −0.66 (−0.69, −0.63)
α4 1.24 (1.17, 1.30)
τa 0.46 (0.34, 0.59) Rater bias - precision
μb 0.55 (0.49, 0.62) Rater magnifier - mean
τb 80.70 (55.70, 112.01) Rater magnifier - precision

FIGURE 3.

FIGURE 3

The Beam et al. (2003)23 mammogram data with 104 radiologists classifications of 148 mammograms: the posterior means (95% posterior intervals) of individual aj, bj and ui parameters based on our proposed model. The red stars (black circles) in the plot for ui are for subjects with disease (no disease).

FIGURE 4.

FIGURE 4

The Beam et al. (2003)23 mammogram data with 104 radiologists’ classifications of 148 mammograms: estimation of the distribution of patient latent disease severity u. The estimated distributions are depicted separately for the patients with D = 0 (no disease) and for the patients with D = 1 (disease) along with the estimated distribution for all patients.

In Table 4, our estimates of the rater-specific AUCs are compared with the estimates from the DBM method40. Among 104 rater-specific AUCs, the minimum, 1st quartile, 2nd quartile, 3rd quartile and maximum values that correspond to each method are presented. The interquartile ranges of the estimated rater-specific AUCs from two methods coincide considerably. Figure 5 illustrates the posterior estimates of ROC curves for individual raters and the overall group of all 104 raters. The left plot depicts the estimated ROC curves and AUCs for rater IDs 41, 45 and 101, representing a rater with low departure aj from the overall mean of 0 and low magnifier (low skill level) bj (ID 41); a rater with high departure (strong bias) aj from the overall mean of 0 and high magnifier (strong skill level) bj (ID 45); and a rater with high departure aj from the overall mean of 0 and low magnifier bj (ID 101). As we discussed in the simulation study, when diagnostic magnifier is high (ID 45), the area under the ROC curve is high. In addition to the estimated ROC for rater IDs 41, 45 and 101, the empirical ROC curves for raters corresponding to the minimum, 1st quartile, 3rd quartile and maximum of the set of all 104 AUCs are presented. Figure 5(b) depicts the estimated overall summary ROC curves for the group of 104 raters derived from different methods. The estimated smoothed, non-smoothed ROC curves based upon our methods (Eq. 6-8) listed in Section 5 and the empirical overall ROC curve coincide each other. The estimated overall ROC curve based on Eq. 6 (ROC 1 in the figure) has the estimated AUC of 0.88 (SD: 01), the estimated overall ROC curve based on Eq. 8 (ROC 2 in the figure) gives the estimated AUC of 0.90 (SD: 0.01), and the empirical overall ROC curve gives the estimated AUC of 0.90 (SE: 0.04).

TABLE 4.

Estimates (standard errors) of the rater-specific AUC are obtained under different models: Gaussian random effect (GRE) and finite mixture (FM) models from Albert (2007), Dorfman-Berbaum-Metz (DBM) method from Dorfman et al. (1992), and our proposed model.

Estimated rater-specific AUC
Method Min. 1st Quartile 2nd Quartile 3rd Quartile Max.
Beam Data (Known disease status) DBM 0.73 (0.03) 0.88 (0.02) 0.91 (0.02) 0.93 (0.02) 0.97 (0.01)
Our Model 0.79 (0.01) 0.88 (0.01) 0.89 (0.01) 0.91 (0.01) 0.93 (0.01)
 
Estimated rater-specific AUC
Method 1 2 3 4 5 6 7
Holmquist Data (Unknown disease status) GRE 0.87 (0.04) 0.83 (0.05) 0.89 (0.04) 0.99 (0.01) 0.86 (0.05) 0.79 (0.05) 0.95 (0.04)
FM 0.94 (0.02) 0.89 (0.03) 0.90 (0.04) 0.94 (0.03) 0.93 (0.02) 0.85 (0.03) 1.00 (0.01)
Our Model 0.93 (0.02) 0.91 (0.01) 0.89 (0.01) 0.89 (0.01) 0.89 (0.01) 0.85 (0.01) 0.91 (0.01)

FIGURE 5.

FIGURE 5

Different ROC curves from the Beam et al. (2003)23 mammogram dataset. The left plot displays estimated ROC curves and corresponding AUCs for a selection of individual raters based upon our model along with empirical ROC curves for 4 different individual raters. The right plot displays the smoothed overall summary ROC curves based upon the group of 104 raters. Red curve is for the smoothed ROC based on Eq. (7). Pink curve is for the ROC based on Eq. (8). Blue curve is for the ROC based on Eq. (6).

We check the goodness-of-fit of the model in the breast imaging study data using the posterior predictive p-value method described in Section 7. From a comparison between the predictive test quantity T(wrep, θ) and the realized test quantity T(w, θ), the estimated posterior predictive p-value is 0.22 which lies within a reasonable range (between 0.05 and 0.9542), implying the model provides a reasonable fit overall.

8.2 ∣. Uterine cancer study with unknown disease status

We also consider the uterine cancer data from Holmquist et al. (1968)27 where seven raters were asked to classify each of 118 slides (items) with potential carcinoma in situ of the uterine cervix based on a five category ordinal scale: 1=Negative; 2=Atypical squamous hyperplasia; 3=Carcinoma in situ; 4=Squamous carcinoma with early stromal invasion; and 5 invasive carcinoma. Within the setting of the unknown disease status, we apply our proposed model to the the Holmquist et al. (1968) dataset and compare our model performance with that of other available competing models, the FM and GRE models from Albert (2007). Table S1 in the supplementary material contains the posterior summaries of the overall (hyper) parameters. The posterior means (and 95% C.I.) of four cutpoints from our model are α^1=1.18(2.74,0.35), α^2=0.59(0.89,1.37), α^3=3.09(1.58,3.93) and α^4=4.57(2.98,5.53).

Table 4 includes our estimates of the rater-specific AUCs for all seven raters along with the estimates from the GRE and FM models. In general, they produce similar estimates of the rater-specific AUCs. Figure 6 depicts posterior estimates of ROC curves for individual raters and the estimated overall summary ROC curves for group of 7 raters derived from different methods. The estimated overall ROC curve based on Eq. 6 (ROC1 in the figure) has the estimated AUC of 0.88 (SD: 0.01) and the estimated overall ROC curve based on Eq. 8 (ROC2 in the figure) gives the estimated AUC of 0.90 (SD: 0.01). Based on the estimated posterior predictive p-value of 0.49, the proposed model has a reasonable fit for the Holmquist et al. (1968) dataset.

FIGURE 6.

FIGURE 6

Estimated ROC curves from the Holmquist et al. (1968)27 uterine cancer dataset based upon our proposed model. The left plot displays estimated ROC curves and corresponding AUCs for all 7 raters. The right plot displays the smoothed overall summary ROC curves based upon the group of all 7 raters. Red curve is for the smoothed ROC based on Eq. (7). Red dotted curve is for the ROC based on Eq. (8). Black dotted curve is for the ROC based on Eq. (6).

9 ∣. SUMMARY

Assessing the diagnostic accuracy of a test procedure is challenging in large multirater studies where many raters contribute ratings on the same sample of patients’ test results. In this paper, we propose a Bayesian hierarchical two-stage latent variable model to analyze multirater ordinal rating data. The model provides novel measures for the robust estimation of rater diagnostic ability (bias and magnifier) parameters and patient latent trait values with the true disease development outcomes known or unknown. Such information can provide valuable training information for individual raters and clearly defines the degree of variability between raters in a diagnostic setting. Diagnostic accuracy is not a fixed property of a screening test and can vary between patient subgroups with their spectrum of disease, the clinical setting, and the test raters43,44. It is essential to consider these elements when evaluating the diagnostic accuracy of a test. Our proposed novel measure of rater ability (magnifier) is providing an alternative measure of diagnostic accuracy across patient populations. An important strength of our proposed modeling approach is that, in contrast to existing models30, the measures of rater ability (magnifier) and bias are robust to model misspecifications of the latent trait distribution of patients’ true disease status, and our model flexibly captures the dependencies for both patients raters and their interactions. Here, the latent trait implies an unobserved continuous variable of the underlying latent disease severity of each patient.

The idea of modeling the probability of the true binary disease status using the patients’ latent traits which provides much convenience to assess diagnostic accuracy through estimating sensitivities and specificities and ROC analysis. A further advantage of our proposed method is that a strict distribution specification for the latent trait variable is not required, in contrast to frequentist methods24. Instead, the data themselves can correctly inform the distribution of the latent trait through Bayesian posterior computation. A further strength of the proposed approach is that each rater is not required to rate all patients as long as each patient has several ratings. The Gibbs sampler can be slightly modified to accommodate missing ratings. Specifically, in step 1 of the Gibbs sampler, we only need to sample zij’s from untruncated normal distributions for those missing ratings.

The proposed method assumes common ordinal thresholds for all the raters, which can be problematic. However, the rater-specific bias parameters (aj’s) contribute to some degree in alleviating this limitation. Further, the proposed model’s common thresholds assumption can be relaxed to allow rater specific thresholds, although in that situation we would require a large dataset to ensure having sufficient data to estimate these many additional parameters. Overall, our proposed approach is well-suited to large-scale studies and can be applied very efficiently using our R package in a real-life study setting. Our model can also be extended to investigate the effects of significant factors on rater diagnostic skills and on patient latent traits which can be a topic for future research.

For computational convenience in our Gibbs sampling algorithm, we assign independent normal priors for rater bias and magnifier. To investigate the correlation between rater bias and rater magnifier, it is reasonable to instead assign a bivariate normal prior on them. Furthermore, for a large-scale study, some nonparametric techniques, such as assigning Dirichlet process mixture priors19,20 on rater parameters, may be adopted to help with clustering of raters.

Supplementary Material

sm1

11 ∣. ACKNOWLEDGEMENTS

The authors are grateful for the support provided by grants R01CA172463 (Nelson) and R01CA226805 (Nelson, Kim and Lin) from the National Institute of Health. This work is also supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (No. NRF-2020R1F1A1A01048168). We thank Dr. Craig Beam for providing us with the mammography study data. We thank Don Edwards for his valuable guidance and advice throughout this manuscript.

Footnotes

10 ∣

SOFTWARE

Software in the form of R package and complete documentation are available at https://lit777.github.io/BayesODT.html.

References

  • 1.Qu Y, Tan M, Kutner M. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996: 797–810. [PubMed] [Google Scholar]
  • 2.Obuchowski N, Goske M, Applegate K. Assessing physicians’ accuracy in diagnosing paediatric patients with acute abdominal pain: measuring accuracy for multiple diseases. Statistics in medicine 2001; 20(21): 3261–3278. [DOI] [PubMed] [Google Scholar]
  • 3.Nelson K, Edwards D. On population-based measures of agreement for binary classifications. Canadian Journal of Statistics 2008; 36(3): 411–426. [Google Scholar]
  • 4.Obuchowski N. An ROC-type measure of diagnostic accuracy when the gold standard is continuous-scale. Statistics in medicine 2006; 25(3): 481–493. [DOI] [PubMed] [Google Scholar]
  • 5.Lin X, Chen H, Edwards D, Nelson K. Modeling rater diagnostic skills in binary classification processes. Statistics in medicine 2018; 37(4): 557–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.ACR . Illustrated Breast Imaging Reporting And Data System (BI-RADSTM), 3rd edn. American College of Radiology, Reston, VA. American College of Radiology (ACR)(2003). ACR BIRADS®–Mammography, 4th edn. In: ACR Breast Imaging Reporting and Data System, Breast Imaging Atlas. American College of Radiology, Reston, VA 1998: 61–259. [Google Scholar]
  • 7.Epstein J, Allsbrook W Jr, Amin M, Egevad L, Committee IG. The 2005 International Society of Urological Pathology (ISUP) consensus conference on Gleason grading of prostatic carcinoma. The American journal of surgical pathology 2005; 29(9): 1228–1242. [DOI] [PubMed] [Google Scholar]
  • 8.Johnson V, Albert J. Ordinal data modeling. Springer Science & Business Media; . 2006. [Google Scholar]
  • 9.Cohen J. A coefficient of agreement for nominal scales. Educational and psychological measurement 1960; 20(1): 37–46. [Google Scholar]
  • 10.Conger A. Integration and generalization of kappas for multiple raters.. Psychological Bulletin 1980; 88(2): 322. [Google Scholar]
  • 11.Banerjee M, Capozzoli M, McSweeney L, Sinha D. Beyond kappa: A review of interrater agreement measures. Canadian journal of statistics 1999; 27(1): 3–23. [Google Scholar]
  • 12.Williamson J, Lipsitz S, Manatunga A. Modeling kappa for measuring dependent categorical agreement data. Biostatistics 2000; 1(2): 191–202. [DOI] [PubMed] [Google Scholar]
  • 13.Berry K, Johnston J, Mielke P Jr. Weighted kappa for multiple raters. Perceptual and motor skills 2008; 107(3): 837–848. [DOI] [PubMed] [Google Scholar]
  • 14.Gwet K. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC; . 2014. [Google Scholar]
  • 15.Quarfoot D, Levine R. How robust are multirater interrater reliability indices to changes in frequency distribution?. The American Statistician 2016; 70(4): 373–384. [Google Scholar]
  • 16.Obuchowski N, Beiden S, Berbaum K, et al. Multireader, multicase receiver operating characteristic analysis:: an empirical comparison of five methods1. Academic radiology 2004; 11(9): 980–995. [DOI] [PubMed] [Google Scholar]
  • 17.Albert J, Chib S. Bayesian analysis of binary and polychotomous response data. Journal of the American statistical Association 1993; 88(422): 669–679. [Google Scholar]
  • 18.Ishwaran H. Univariate and multirater ordinal cumulative link regression with covariate specific cutpoints. Canadian Journal of Statistics 2000; 28(4): 715–730. [Google Scholar]
  • 19.Cao J, Stokes S, Zhang S. A Bayesian approach to ranking and rater evaluation: An application to grant reviews. Journal of Educational and Behavioral Statistics 2010; 35(2): 194–214. [Google Scholar]
  • 20.Kottas A, Müller P, Quintana F. Nonparametric Bayesian modeling for multivariate ordinal data. Journal of Computational and Graphical Statistics 2005; 14(3): 610–625. [Google Scholar]
  • 21.Savitsky T, Dalal S. Bayesian non-parametric analysis of multirater ordinal data, with application to prioritizing research goals for prevention of suicide. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2014; 63(4): 539–557. [Google Scholar]
  • 22.Bao J, Hanson T. Bayesian nonparametric multivariate ordinal regression. Canadian Journal of Statistics 2015; 43(3): 337–357. [Google Scholar]
  • 23.Beam C, Conant E, Sickles E. Association of volume and volume-independent factors with accuracy in screening mammogram interpretation. Journal of the National Cancer Institute 2003; 95(4): 282–290. [DOI] [PubMed] [Google Scholar]
  • 24.Uebersax J, Grove W. A latent trait finite mixture model for the analysis of rating agreement. Biometrics 1993: 823–835. [PubMed] [Google Scholar]
  • 25.Johnson V. On Bayesian analysis of multirater ordinal data: An application to automated essay grading. Journal of the American Statistical Association 1996; 91(433): 42–51. [Google Scholar]
  • 26.Tanner M, Wong W. The calculation of posterior distributions by data augmentation. Journal of the American statistical Association 1987; 82(398): 528–540. [Google Scholar]
  • 27.Holmquist N, McMahan C, Williams O. Variability in classification of carcinoma in situ of the uterine cervix. Obstetrical & Gynecological Survey 1968; 23(6): 580–585. [Google Scholar]
  • 28.Collins J, Albert P. Estimating diagnostic accuracy without a gold standard: a continued controversy. Journal of biopharmaceutical statistics 2016; 26(6): 1078–1082. [DOI] [PubMed] [Google Scholar]
  • 29.Albert P, Dodd L. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 2004; 60(2): 427–435. [DOI] [PubMed] [Google Scholar]
  • 30.Albert P. Random effects modeling approaches for estimating ROC curves from repeated ordinal tests without a gold standard. Biometrics 2007; 63(2): 593–602. [DOI] [PubMed] [Google Scholar]
  • 31.Wang Z, Zhou X. Random effects models for assessing diagnostic accuracy of traditional Chinese doctors in absence of a gold standard. Statistics in medicine 2012; 31(7): 661–671. [DOI] [PubMed] [Google Scholar]
  • 32.Robert C, Casella G. Monte Carlo statistical methods. Springer Science & Business Media; . 2013. [Google Scholar]
  • 33.Kim C, Daniels M, Hogan J, Choirat C, Zigler C. Bayesian methods for multiple mediators: Relating principal stratification and causal mediation in the analysis of power plant emission controls. The annals of applied statistics 2019; 13(3): 1927. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Swartz T, Haitovsky Y, Vexler A, Yang T. Bayesian identifiability and misclassification in multinomial data. Canadian Journal of Statistics 2004; 32(3): 285–302. [Google Scholar]
  • 35.Agresti A. Analysis of ordinal categorical data. 656. John Wiley & Sons; . 2010. [Google Scholar]
  • 36.Jones G, Johnson W, Hanson T, Christensen R. Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics 2010; 66(3): 855–863. [DOI] [PubMed] [Google Scholar]
  • 37.Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. In: Elsevier. 1987. (pp. 564–584). [DOI] [PubMed] [Google Scholar]
  • 38.Hanley J, McNeil B. The meaning and use of the area under a receiver operating characteristic (ROC) curve.. Radiology 1982; 143(1): 29–36. [DOI] [PubMed] [Google Scholar]
  • 39.Ishwaran H, Gatsonis C. A general class of hierarchical ordinal regression models with applications to correlated ROC analysis. Canadian Journal of Statistics 2000; 28(4): 731–750. [Google Scholar]
  • 40.Dorfman D, Berbaum K, Metz C. Receiver operating characteristic rating analysis: generalization to the population of readers and patients with the jackknife method. Investigative radiology 1992; 27(9): 723–731. [PubMed] [Google Scholar]
  • 41.Meng X Posterior predictive p-values. The Annals of Statistics 1994; 22(3): 1142–1160. [Google Scholar]
  • 42.Gelman A, Carlin J, Stern H, Dunson D, Vehtari A, Rubin D. Bayesian data analysis. Chapman and Hall/CRC; . 2013. [Google Scholar]
  • 43.Leeflang M, Deeks J, Gatsonis C, Bossuyt P. Systematic reviews of diagnostic test accuracy. Annals of internal medicine 2008; 149(12): 889–897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Šimundić AM. Measures of diagnostic accuracy: basic definitions. Ejifcc 2009; 19(4): 203. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sm1

RESOURCES