Bayesian Hierarchical Latent Class Models for Estimating Diagnostic Accuracy

Chunling Wang; Xiaoyan Lin; Kerrie P Nelson

doi:10.1177/0962280219852649

. Author manuscript; available in PMC: 2021 Apr 1.

Published in final edited form as: Stat Methods Med Res. 2019 May 30;29(4):1112–1128. doi: 10.1177/0962280219852649

Bayesian Hierarchical Latent Class Models for Estimating Diagnostic Accuracy

Chunling Wang ¹, Xiaoyan Lin ¹, Kerrie P Nelson ²

PMCID: PMC6884669 NIHMSID: NIHMS1057624 PMID: 31146651

Abstract

The diagnostic accuracy of a test or rater has a crucial impact on clinical decision making. The assessment of diagnostic accuracy for multiple tests or raters also merits much attention. A Bayesian hierarchical conditional independence latent class model for estimating sensitivities and specificities for a large group of tests or raters is proposed, which is applicable to both with-gold-standard and without-gold-standard situations. Through the hierarchical structure, not only are the sensitivities and specificities of individual tests estimated, but also the diagnostic performance of the whole group of tests. For a small group of tests or raters, the proposed model is further extended by introducing pairwise covariances between tests to improve the fitting and to allow for more modeling flexibility. Correlation residual analysis is applied to detect any significant covariance between multiple tests. Just Another Gibbs Sampler (JAGS) implementation is efficiently adopted for both models. Three real data sets from literature are analyzed to explicitly illustrate the proposed methods.

Keywords: Binary diagnostic outcome, Latent class model, Multiple tests, Sensitivity, Specificity

1. Introduction

In medical practice, multiple diagnostic tests (or raters) are often utilized to diagnose the disease status of a patient. Assessing the diagnostic accuracy of individual tests is important. The diagnostic accuracy based upon results from multiple raters also merits much attention. For a diagnostic test with a binary outcome, when the true disease status (with or without disease) is known or a gold standard reference standard exists, the estimation of sensitivity and specificity of the test is straightforward. However, on many occasions, the true disease status or the gold standard is prohibitive to obtain due to the cost or ethical concerns. For this situation, many latent class models, where the true disease status is unknown and therefore treated as latent, have been proposed to assess the diagnostic accuracy of tests.

The early development of latent class models lay in conditional independence models, where diagnostic results for the same patient across multiple tests are independent conditional on a patient’s true disease status. Hui and Walter¹ laid a solid foundation in studying the false positive and negative error rates of two diagnostic tests in two populations. Joseph et al.² discussed a similar problem from a Bayesian perspective. However, this conditional independence between multiple tests does not always hold due to some common factors connecting tests.³ Vacek⁴ and Brenner⁵, among others, demonstrated that ignoring the possible dependence between tests can lead to biased estimates of the prevalence of disease and accuracy of tests.

In the literature of latent class models, there are two general approaches for handling the conditional dependence between multiple tests. Vacek⁴, Torrance-Rynard and Walter⁶, Yang and Becker⁷, and Jones et al.⁸, among others, directly incorporated conditional pairwise covariances between tests. Qu et al.⁹ developed latent Gaussian random effects (GRE) to model the conditional dependence between multiple tests. Albert and Dodd¹⁰ provided a cautionary note and guidance in using latent class models with various dependence structures. In this paper, we adopt the first type of the conditional dependence model due to the ease of interpretation. A systematic review of latent class models can be found in van Smeden et al.¹¹ and Collins and Albert.¹²

Dendukuri and Joseph³ proposed Bayesian approaches for handling both of these two conditional dependence structures. However, they only considered positive correlation between tests and only dealt with two tests without providing a generalized method to handle more tests. Their full conditional distributions (for two tests) for the Gibbs Sampler implementation were complicated. When the number of tests increases, their likelihood based on the multinomial distribution of the test-result frequency grows exponentially. Our proposed model, an extension of Jones at al.⁸, is amenable to accommodating more tests and allows for positive and negative correlations between tests. The multinomial property is just a special case used in our computational algorithms.

There has been much discussion about model identifiability for latent class models. For example, see Rothenberg¹³, Deendukuri et al.¹⁴, and Gustafson.¹⁵ Jones et al.⁸ investigated the identifiability of the first type of the conditional dependence model. They concluded that a sufficient number of degrees of freedom does not guarantee unique estimates of prevalence and test performance. They provided some symbolic algebra methodology to determine whether a proposed study design would lead to an identifiable model. When using Bayesian approaches, identifiability is not mandatory if good prior information is available.⁸ For example, Dendukuri and Joseph³ adopted informative priors to avoid the nonidentifiability. In this paper, we propose to apply correlation residual analysis to reduce the number of parameters by only including significant covariance terms into the model to improve the identifiability.

In the literature, most methods can only accommodate a small number of tests or raters. For a large group of tests or raters, not only is individual diagnostic accuracy of interest to be estimated, but also the diagnostic performance of the whole group of tests or raters. Zhang et al.¹⁶ proposed a latent class model with crossed random effects for the subjects and raters to estimate the diagnostic accuracy of a group of raters. Lin et al.¹⁷ described a modeling approach to assess each rater’s diagnostic skills by linking rater binary decisions with patient true disease status through patient latent disease severity. In this paper, for the first time, the conditional independence model and the pairwise covariance model are further developed to analyze a large number of tests under the Bayesian framework. Unlike Dendukuri and Joseph’s approach³ where individual beta priors are assigned for sensitivities (specificities), our method assumes that all the sensitivities (specificities) follow a common beta prior with the two hyperparameters reflecting the group level sensitivity (specificity). Our proposed “Poisson zero trick” JAGS¹⁸ algorithms are easy to implement to flexibly incorporate a large number of tests. The algorithms are feasible for analyzing dozens of tests under the conditional dependence model and even hundreds of tests under the conditional independence model.

The rest of this article is organized as follows. Section 2 explicitly describes the conditional independence model and the pairwise covariance model with their hierarchical priors. The correlation residual analysis is also introduced in this section. Section 3 shows details about the computation strategies with regard to their implementation in JAGS. Simulation studies to investigate the performance of the proposed methods are shown in Section 4. Section 5 illustrates the proposed methods with three real data examples. Lastly, Section 6 summarizes the main results with some discussions.

2. Models

Suppose that K tests (or raters) are used to evaluate n subjects, yielding vectors of binary test results T₁,…,T_n, where T_i = (T_i1,…,T_iK)^′ for the ith subject (i = 1,…,n). Denote T_ij = 1 if the test result is positive and T_ij = 0 if negative. Let D_i denote the latent true disease status for subject i with 1 being positive and 0 negative, respectively. Denote by π the underlying disease prevalence P(D_i = 1). Then, the likelihood function for a latent class model for binary results is:

L = \prod_{i = 1}^{n} {π P (T_{i} = t_{i} | D_{i} = 1) + (1 - π) P (T_{i} = t_{i} | D_{i} = 0)} \equiv \prod_{i = 1}^{n} L_{i},

(1)

Where L_i = πP(T_i = t_i | D_i = 1) + (1 − π)P(T_i = t_i | D_i = 0).

2.1. Hierarchical conditional independence model (Model M1)

We first consider the conditional independence model, where the diagnostic results of a subject are independent across all tests conditional on the true disease status. The sensitivity and specificity of the jth test are denoted as Se_j and Sp_j, respectively. Then, according to the conditional independence structure,

P (T_{i} = t_{i} | D_{i} = 1) = \prod_{j = 1}^{K} P (T_{i j} = t_{i j} | D_{i} = 1) = \prod_{j = 1}^{K} S e_{j}^{t_{i j}} {(1 - S e_{j})}^{1 - t_{i j}}; P (T_{i} = t_{i} | D_{i} = 0) = \prod_{j = 1}^{K} P (T_{i j} = t_{i j} | D_{i} = 0) = \prod_{j = 1}^{K} {(1 - S p_{j})}^{t_{i j}} S p_{j}^{1 - t_{i j}} .

(2)

Individual sensitivities and specificities are assumed to be random effects and to independently follow common beta distributions, respectively:

{Se}_{j} \overset{i i d}{~} Beta (ω_{1} (κ_{1} - 2) + 1, (1 - ω_{1}) (κ_{1} - 2) + 1), {Sp}_{j} \overset{i i d}{~} Beta (ω_{2} (κ_{2} - 2) + 1, (1 - ω_{2}) (κ_{2} - 2) + 1),

where ω₁ is the mode of the beta distribution of test sensitivities and the concentrate parameter κ₁ reflects the spread of the distribution. The larger the value of κ₁, the more concentrated the distribution of test sensitivities is around the mode. To ensure the existence of the mode ω₁, κ₁ needs to be larger than 2. Similarly, ω₂ and κ₂ describe the mode and spread of the distribution of test specificities. An advantage of using this parameterization to denote a beta distribution is that it naturally provides group-level diagnostic accuracy and variation by the mode ω and concentrate term κ. See page 129 in Kruschke¹⁹ for more details of this parameterization. To allow data to inform these group-level parameters, we assign vague priors to them. We assign uniform(0.5,1) priors to ω₁ and ω₂, and diffused gamma priors, such as gamma(0.01,0.01), to κ₁ − 2 and κ₂ − 2. We assign a uniform(0,1) prior to the disease prevalence π.

Note that assigning uniform(0.5,1) priors to ω₁ and ω₂ is a natural prior choice as for most reasonable tests, we would expect their sensitivity and specificity are above 0.5. To avoid the “label-switching” problem in the fitting, we may follow Jones et al.⁸ to add the restriction of Se_j + Sp_j > 1. We can still assign Se_j the same beta prior, but let Sp_j follow the truncated beta prior with the support (1 − Sp_j, 1). Similarly, to avoid the sampling of π stuck in 0 or 1 extremes, we may adjust the uniform(0,1) prior of the disease prevalence to a uniform $(\frac{1}{n}, 1 - \frac{1}{n})$ .²⁰

2.2. Hierarchical conditional dependence model (Model M2)

Following Dendukuri and Joseph,³ our second model takes into account the pairwise dependence between multiple tests. Conditional covariances, such as between test j and test k, are denoted as $C_{j k}^{+}$ and $C_{j k}^{-}$ given the subject being diseased or non-diseased, respectively. For instance, given subject i is diseased, the joint probabilities for test j and test k classifying subject i are as follows:

P (T_{i j} = 1, T_{i k} = 1 | D_{i} = 1) = S e_{j} S e_{k} + C_{j k}^{+}, P (T_{i j} = 1, T_{i k} = 0 | D_{i} = 1) = S e_{j} (1 - S e_{k}) - C_{j k}^{+}, P (T_{i j} = 0, T_{i k} = 1 | D_{i} = 1) = S e_{k} (1 - S e_{j}) - C_{j k}^{+}, P (T_{i j} = 0, T_{i k} = 0 | D_{i} = 1) = (1 - S e_{j}) (1 - S e_{k}) + C_{j k}^{+} .

Similarly, given subject i is non-diseased, the joint probabilities for test j and test k are

P (T_{i j} = 1, T_{i k} = 1 | D_{i} = 0) = (1 - S p_{j}) (1 - S p_{k}) + C_{j k}^{-}, P (T_{i j} = 1, T_{i k} = 0 | D_{i} = 0) = (1 - S p_{j}) S p_{k} - C_{j k}^{-}, P (T_{i j} = 0, T_{i k} = 1 | D_{i} = 0) = S p_{j} (1 - S p_{k}) - C_{j k}^{-}, P (T_{i j} = 0, T_{i k} = 0 | D_{i} = 0) = S p_{j} S p_{k} + C_{j k}^{-} .

It is clear that a positive (negative) value of $C_{j k}^{+}$ reflects the positive (negative) dependence between test j and test k on diagnosing the same subjects when the true disease status is positive. A similar interpretation is applied to $C_{j k}^{-}$ when the true disease status is negative. To be valid covariances, $C_{j k}^{+}$ and $C_{j k}^{-}$ need to ensure that the probability of each combination above is between 0 and 1. Therefore, necessary constraints are

(1 - S e_{j}) (S e_{k} - 1) < C_{j k}^{+} < m i n (S e_{j}, S e_{k}) - S e_{j} S e_{k}, (1 - S p_{j}) (S p_{k} - 1) < C_{j k}^{-} < m i n (S p_{j}, S p_{k}) - S p_{j} S p_{k} .

(3)

Under this conditional pairwise dependence model, the joint probabilities of all k test results for subject i conditional on the true disease status can be expressed as follows using a generalized form in Jones et al.⁸:

P (T_{i} = t_{i} | D_{i} = 1) = \prod_{k = 1}^{K} S e_{k}^{t_{i k}} {(1 - S e_{k})}^{1 - t_{i k}} + \sum_{u = 1}^{K - 1} \sum_{v = u + 1}^{K} {(- 1)}^{t_{i u} + t_{i v}} c o v_{u v}^{+} \prod_{k \neq u, v}^{K} S e_{k}^{t_{i k}} {(1 - S e_{k})}^{1 - t_{i k}}, P (T_{i} = t_{i} | D_{i} = 0) = \prod_{k = 1}^{K} S p_{k}^{1 - t_{i k}} {(1 - S p_{k})}^{t_{i k}} + \sum_{u = 1}^{K - 1} \sum_{v = u + 1}^{K} {(- 1)}^{t_{i u} + t_{i v}} c o v_{u v}^{-} \prod_{k \neq u, v}^{K} S p_{k}^{1 - t_{i k}} {(1 - S p_{k})}^{t_{i k}} .

(4)

For the covariances, we assign uniform priors with the constraints (3) to them. For the other parameters, we adopt the same prior specification as in Subsection 2.1.

2.3. Correlation residual plot

It is worth pointing out that not every pair of covariances are needed in the conditional dependence model. Adding unnecessary covariances is not just introducing redundant parameters but also adding computational burden and uncertainty to the model. Therefore, we introduce a correlation residual analysis⁹ to detect significant dependence between tests. The correlation between each pair of tests, such as test j and test k, is defined as

r_{j k} = \frac{P (T_{j} = 1, T_{k} = 1) - P (T_{j} = 1) P (T_{k} = 1)}{\sqrt{P (T_{j} = 1) (1 - P (T_{j} = 1)) P (T_{k} = 1) (1 - P (T_{k} = 1))}} .

The correlation residual is the difference between observed correlation and model-based correlation. For the observed correlation, P(T_j = 1), P(T_k = 1), and P(T_j = 1,T_k = 1) are estimated by their sample proportions as $\frac{1}{n} \sum_{i = 1}^{n} t_{i j}$ , $\frac{1}{n} \sum_{i = 1}^{n} t_{i k}$ , and $\frac{1}{n} \sum_{i = 1}^{n} t_{i k} t_{i j}$ , respectively. For the model-based correlation, the joint probability is

P (T_{j} = 1, T_{k} = 1) = π S e_{j} S e_{k} + (1 - π) (1 - S p_{j}) (1 - S p_{k})

for model M1 in Section 2.1 and is

P (T_{j} = 1, T_{k} = 1) = π (S e_{j} S e_{k} + C_{j k}^{+}) + (1 - π) ((1 - S p_{j}) (1 - S p_{k}) + C_{j k}^{-})

for model M2 in Section 2.2. The marginal probabilities for models M1 and M2 have the same form P(T_j = 1) = πSe_j + (1 − π)(1 − Sp_j). After plugging in the estimates of the disease prevalence π, sensitivities and specificities, the model-based correlation is calculated.

For the correlation residual analysis, plotting all pairwise correlation residuals from the conditional independence model (M1) provides a simple visual way to check whether M1 sufficiently explains the data. If M1 provides a good fit, then all correlation residuals are expected to be close to zero. Otherwise, the conditional dependence model (M2) should be applied. The formal way to identify significant correlations should take into account the uncertainty of the model-based correlation. We can use the Markov Chain Monte Carlo (MCMC)²¹ samples of parameters to obtain the MCMC chain of the correlation residual for each pair of tests. Based upon multiple simulation studies we conducted, the following criterion is proposed to identify significant covariance terms for M2: when a one-sided 95% credible interval (CI) of the correlation residual from M1, either (0, 95%) or (5%,100%) CI, includes only negative values or positive values, we define that correlation pair as significant and therefore the corresponding covariance terms are added in the pairwise covariance model (M2). For nonsignificant pairs, we set the corresponding covariance terms to 0 in M2. After fitting M2, we can check correlation residuals again and a good fitting M2 model should now have all correlation residuals close to 0. Simulation results in Section 4 show that the one-sided credible interval criterion works well to identify significant correlations between tests.

3. Computational techniques

To configure the hierarchical structure of models for implementation, we apply MCMC computation in Just Another Gibbs Sampler (JAGS).¹⁸ The advantage of using JAGS is that the user does not need to derive the complicated full conditional distributions or to write the sampling code by himself. Instead, the user only needs to prepare the JAGS model declaration syntax and then, in R, to pass data and model syntax to JAGS, for example, via the function jags() in R2jags library.²² There are many built-in distributions in JAGS; however, none of them is directly proper for our proposed models. Consequently, we utilize the “Poisson zero trick” method to accommodate the likelihood in equation (1). Specifically, for i = 1,…,n, z_i = 0 is introduced and z_i follows a Poisson distribution with mean parameter as −log(L_i). In this way, the likelihood function ingenuously links the original data with the introduced z_i’s, that is, the z_i’s contribute the same likelihood as the original data.

3.1. Multinomial property

For K tests, there are in total 2^K possible test-result combinations for each patient. Given the parameters, we can calculate the probability for each test-result combination according to the models. For instance, in the case of two tests, there are 4 possible test-result combinations, i.e. 00, 01, 10 and 11, where 01 represents a non-diseased result from the first test and a diseased result from the second test. The probability of each test-result combination can be calculated by plugging the test-result combination values as t_i into (2) or (4) (depending on whether the M1 or M2 model is used) and then evaluating the likelihood formulae L_i. For example, the probability of the test results 00 is

P_{00} = P (T_{i 1} = 0, T_{i 2} = 0) = π P (T_{i 1} = 0, T_{i 2} = 0 | D_{i} = 1) + (1 - π) P (T_{i 1} = 0, T_{i 2} = 0 | D_{i} = 0) .

Denote the frequency of subjects for the four combinations as N₀₀, N₀₁, N₁₀, and N₁₁, respectively. The vector (N₀₀, N₀₁, N₁₀ N₁₁) follows a multinomial distribution (n, p) with n (= N₀₀ + N₀₁ + N₁₀ + N₁₁) the total number of subjects and p = (P₀₀, P₀₁, P₁₀, P₁₁) denoting the probability vector. This multinomial property holds for both models M1 and M2. Adopting this multinomial property for the fitting, especially for model M2, can significantly alleviate the computational burden. In Section 4, for the second simulation scenario, we have observed that it takes on average six or seven seconds to fit model M2 when using the multinomial property, while it takes more than one and a half hours to fit model M2 without using the multinomial property. However, for the situation when the sample size is relatively small in the sense that many test-result combinations are observed with zero frequency, imposing this multinomial distribution will distort the estimation instead. This phenomenon has been observed in the first simulations scenario in Section 4.

The JAGS model syntax for models M1 and M2 without using the multinomial property is included in the appendix. The authors’ R code and JAGS model syntax for the two models with and without the multinomial imposition are available in the online supplementary file. Specifically, Algorithm 1 is for model M1 without the multinomial imposition; Algorithm 2 is for model M2 without the multinomial imposition; Algorithm 3 is for model M1 with the multinomial imposition; and Algorithm 4 is for model M2 with the multinomial imposition.

4. Simulation study

In this section, we conduct a simulation study to demonstrate the advantage of using the conditional dependence model when the dependence between tests does exist, to compare the performance of the four algorithms, and to investigate the operating characteristics of the correlation residual analysis. We consider two simulation scenarios, with the first scenario similar to the real data example in Section 5.2 with a relatively small number of patients and the second scenario similar to the real data example in Section 5.3 with a large number of patients. Specifically, in the first scenario, 400 (n) patients are rated by 4 (K) tests with test sensitivities (0.96, 0.87, 0.81, 0.86) and test specificities (0.97, 0.98, 0.99, 0.97). Among the 4 tests, $C_{23}^{+} = 0.05$ , $C_{34}^{+} = 0.05$ , $C_{23}^{-} = 0.001$ , $C_{34}^{-} = 0.001$ and all other pairwise covariances set to 0. In the second scenario, 4000 (n) patients are rated by 5 (K) tests with test sensitivities (0.77, 0.65, 0.71, 0.68, 0.71) and test specificities (0.89, 0.93, 0.88, 0.90, 0.86). Among the 5 tests, $C_{23}^{+} = 0.05$ , $C_{14}^{+} = 0.05$ , $C_{24}^{+} = 0.05$ , $C_{25}^{+} = 0.05$ , $C_{35}^{+} = 0.001$ , $C_{23}^{-} = 0.001$ , $C_{14}^{-} = 0.001$ , $C_{24}^{-} = 0.001$ , $C_{25}^{-} = 0.001$ , $C_{35}^{-} = 0.05$ , and all other pairwise covariances set to 0. For both scenarios, the underlying disease prevalence π is 0.45. Given these true parameter values, data are generated according to the conditional dependence model. Specifically, the test results for each patient are randomly simulated from a categorical distribution with 2^K possible test-result combinations and the probability of each possible test-result combination calculated using the individual likelihood defined in (1) with the two conditional probabilities calculated via equation (4). For each scenario, 200 data sets are simulated.

We fit each data set using the four algorithms and summarize the simulation results in Table 1 and Table 2 for the two simulation scenarios, respectively. In the tables, Bias stands for the difference between the average of 200 point estimates (posterior means) and the true value; SD stands for the average of 200 posterior standard deviations; RMSE is the square root of the average of 200 MSEs with each individual MSE calculated between each estimated diagnostic accuracy parameter (sensitivities and specificities) and their true parameter values; DIC stands for the average Deviance Information criterion²³, that is, the average of 200 DICs with each DIC produced by the JAGS. For both scenarios, Table 1 and Table 2 show that after incorporating the pairwise covariances, model M2 provides overall more accurate estimation of the disease prevalence, sensitivities and specificities than model M1. Specifically, the point estimates for model M2 are much less biased with only slightly larger posterior standard deviations than model M1 and the RMSEs of model M2 are smaller than those of model M1. Furthermore, model M2 provides unbiased estimates of the covariances, which model M1 is unable to.

Table 1:

Simulation results for scenario 1: K = 4, n = 400, and 200 data replicates

	Truth	Ml (Algorithm 1)	M1 (Algorithm 3)	M2 (Algorithm 2)	M2 (Algorithm 4)
	Truth	Bias (SD)	Bias (SD)	Bias (SD)	Bias (SD)
π	0.450	−0.011 (0.025)	−0.002 (0.025)	0.000 (0.026)	0.006 (0.026)
Se₁	0.960	−0.009 (0.017)	−0.027 (0.019)	−0.007 (0.017)	−0.021 (0.020)
Se₂	0.870	0.019 (0.024)	0.003 (0.025)	0.002 (0.026)	−0.009 (0.027)
Se₃	0.810	0.031 (0.029)	0.019 (0.029)	−0.002 (0.033)	−0.001 (0.029)
Se₄	0.860	0.016 (0.025)	0.001 (0.026)	0.001 (0.027)	−0.009 (0.028)
Sp₁	0.970	−0.017 (0.016)	−0.020 (0.016)	0.000 (0.013)	−0.006 (0.014)
Sp₂	0.980	−0.006 (0.011)	−0.011 (0.012)	−0.004 (0.011)	−0.011 (0.012)
Sp₃	0.990	−0.003 (0.007)	−0.008 (0.009)	−0.007 (0.009)	−0.017 (0.011)
Sp₄	0.970	−0.006 (0.013)	−0.011 (0.014)	0.003 (0.012)	−0.010 (0.013)
$C_{23}^{+}$	0.050	—	—	−0.001 (0.015)	−0.008 (0.013)
$C_{34}^{+}$	0.050	—	—	0.001 (0.015)	−0.006 (0.013)
$C_{23}^{-}$	0.001	—	—	0.004 (0.004)	0.005 (0.005)
$C_{34}^{-}$	0.001	—	—	0.003 (0.004)	0.005 (0.005)
	RMSE	0.01567	0.01501	0.00416	0.01195
	DIC	1224.128	238.467	1189.774	208.30
	Time	18.488 secs	1.237 secs	36.217 secs	1.893 secs

Open in a new tab

Table 2:

Simulation results for scenario 2: K = 5, n = 4000, and 200 data replicates

	Truth	M1 (Algorithm 1)	M1 (Algorithm 3)	M2 (Algorithm 2)	M2 (Algorithm 4)
	Truth	Bias (SD)	Bias (SD)	Bias (SD)	Bias (SD)
π	0.450	−0.020 (0.011)	−0.017 (0.011)	0.002 (0.013)	0.004 (0.013)
Se₁	0.770	−0.008 (0.013)	−0.011 (0.013)	−0.004 (0.015)	−0.006 (0.015)
Se₂	0.650	0.044 (0.015)	0.042 (0.015)	0.001 (0.016)	0.000 (0.016)
Se₃	0.710	0.040 (0.012)	0.038 (0.012)	0.003 (0.015)	0.001 (0.014)
Se₄	0.680	0.019 (0.014)	0.017 (0.014)	−0.002 (0.016)	−0.002 (0.016)
Se₅	0.710	0.048 (0.012)	0.045 (0.012)	0.002 (0.014)	0.000 (0.014)
Sp₁	0.890	−0.029 (0.009)	−0.029 (0.009)	−0.002 (0.010)	−0.002 (0.011)
Sp₂	0.930	0.080 (0.006)	0.079 (0.006)	0.069 (0.009)	0.067 (0.009)
Sp₃	0.880	0.010 (0.010)	0.010 (0.010)	0.003 (0.011)	0.002 (0.011)
Sp₄	0.900	−0.008 (0.008)	−0.009 (0.008)	−0.003 (0.009)	−0.003 (0.010)
Sp₅	0.860	−0.016 (0.010)	0.017 (0.010)	0.003 (0.011)	0.002 (0.011)
$C_{23}^{+}$	0.050	—	—	−0.002 (0.006)	−0.002 (0.006)
$C_{14}^{+}$	0.050	—	—	0.001 (0.007)	−0.001 (0.007)
$C_{24}^{+}$	0.050	—	—	0.000 (0.006)	0.000 (0.006)
$C_{25}^{+}$	0.050	—	—	−0.002 (0.006)	−0.002 (0.006)
$C_{35}^{+}$	0.001	—	—	−0.002 (0.006)	−0.002 (0.006)
$C_{23}^{-}$	0.001	—	—	0.000 (0.004)	0.000 (0.004)
$C_{14}^{-}$	0.001	—	—	0.002 (0.005)	0.001 (0.005)
$C_{24}^{-}$	0.001	—	—	−0.001 (0.002)	0.000 (0.002)
$C_{25}^{-}$	0.001	—	—	−0.001 (0.004)	0.001 (0.004)
$C_{35}^{-}$	0.050	—	—	−0.002 (0.006)	−0.002 (0.006)
	RMSE	0.01882	0.01914	0.00587	0.00586
	DIC	22017.830	937.885	21569.440	495.858
	Time	12.592 mins	2.693 secs	98.422 mins	6.261 secs

Open in a new tab

In terms of using the multinomial imposition, we can see that the algorithms with the multinomial imposition take much less time for the fitting. For example, Table 2 shows that for the second simulation scenario, it takes on average more than one and a half hours to fit model M2 without the multinomial imposition while it takes only 6 seconds to fit model M2 with the multinomial imposition. Table 1 shows that model M2 with the multinomial imposition (Algorithm 4) actually leads to worse estimation results with larger biases than model M2 without the multinomial imposition (Algorithm 2) for the first scenario, while Table 2 shows that M2 with the multinomial imposition (Algorithm 4) provides almost identical estimation results as those without the multinomial imposition (Algorithm 2) for the second scenario. This discovery implies that when the sample size n is large in the sense that the number of unique observed test result combinations equals to 2^K, using the multinomial property for the fitting can provide the estimation results as good as the original model and meanwhile it alleviates the computational burden. However, when the sample size is small, imposing the multinomial property will distort the estimation instead. Finally, comparing these two tables, we can see that as the sample size n increases, the biases and posterior standard deviations both clearly decrease for using each algorithm.

Table 3 and Table 4 summarize the correlation residual analysis for the 200 simulated data sets for each of the two simulation scenarios. Table 3 presents the selection frequency of significant correlation pairs for scenario 1 based on the one-sided credible interval criterion (described in Section 2.3) after fitting model M1. Table 3 shows that the number of correct inclusion of correlation pairs is 149. There are 24 times that only r₃₄ is selected, 20 times only r₂₃ is selected, and very few times that some other correlations are selected. Additional simulation study (see in Part B of the online supplement) shows that partially missing importance covariances result in estimation bias, but the bias is smaller than using the completely misspecified conditional independence model (M1). On average, the estimation results for scenario 1 when fitting with M2 with selected significant covariances by the one-sided credible interval criterion are very close to those when fitting with the true model. Table 4 presents the selection frequency of significant correlation pairs for scenario 2 where the true model has significant covariances $C_{23}^{+}$ , $C_{14}^{+}$ , $C_{24}^{+}$ , $C_{25}^{+}$ and $C_{35}^{-}$ . Table 4 shows that besides the true significant pairs r₂₃, r₁₄, r₂₄, r₂₅, and r₃₅ are selected, two extra pairs r₃₄ and r₄₅ are also always selected. The other three pairs r₁₂, r₁₃, and r₁₅ are only selected 34, 19, and 29 times, respectively. Based on this observation, we fit the data with model M2 having these seven pairs of covariances added. The estimation results (see in Part C of the online supplement) are comparable to those obtained from the true model. Particularly, the estimates for the two extra covariance terms are approximately 0. A further simulation study shows that adding all ten pairs of covariances drastically worsens the estimation. In summary, when the covariance pattern is simple as in scenario 1, our correlation residual check method can frequently detect the pattern; when the covariance pattern is complicated as in scenario 2, a few more covariance terms may be selected without much influence on the estimation; when all pairwise covariance terms are included, estimation results can be poor.

Table 3:

Frequencies of correlation selection for scenario 1: K = 4, n = 400, and 200 data replicates. Note that “1” indicates “being selected”

Frequency	r₁₂	r₂₃	r₁₄	r₃₄
149	0	1	0	1
24	0	0	0	1
20	0	1	0	0
2	0	1	1	0
3	0	1	1	1
1	0	0	0	0
1	1	1	0	1

Open in a new tab

Table 4:

Frequencies of correlation selection for scenario 2: K = 5, n = 4000, and 200 data replicates. Note that “1” indicates “being selected”

Frequency	r₁₂	r₁₃	r₂₃	r₁₄	r₂₄	r₃₄	r₁₅	r₂₅	r₃₅	r₄₅
124	0	0	1	1	1	1	0	1	1	1
23	0	0	1	1	1	1	1	1	1	1
2	0	1	1	1	1	1	1	1	1	1
30	1	0	1	1	1	1	0	1	1	1
17	0	1	1	1	1	1	0	1	1	1
4	1	0	1	1	1	1	1	1	1	1

Open in a new tab

5. Data analysis

In this section, we illustrate our proposed methods through three examples.

5.1. Example 1: Beam’s mammogram data

Beam’s mammogram data²⁴ was collected to study the variability in the interpretation of mammograms by a national sample of radiologists in the United States. It contains diagnostic results of 107 radiologists each evaluating 146 women from a breast cancer screening program. The true disease status of each patient (breast cancer or no breast cancer) was known from a biopsy examination or a minimum 2-year follow-up study. See more details of this data set in Beam et al.²⁴. Beam’s data was analyzed by Lin et al.¹⁷ to estimate radiologist diagnostic skills with a latent variable model, where the diagnostic results were dichotomized. Here we illustrate our Bayesian hierarchical conditional independence model with this dichotomous rating data. Pairwise covariances and the multinomial property are not convenient to use due to the large number (=107) of raters, and therefore we restrict our analysis to model M1 and Algorithm 1 in the supplement is applied. The breast cancer disease prevalence π is estimated as 0.44 based upon the sample of 146 women. The mode of rater sensitivities, ω₁, is estimated as 0.94 and mode of rater specificities, ω₂, is estimated as 0.86. These modes measure the diagnostic accuracy of the whole population of radiologists. The estimated concentrate term κ₁ is 16 and the estimated κ₂ is 10.4 indicating a wider spread of rater specificities. So, it is reasonable to conclude that the raters usually reach higher consensus on identifying the subjects as being diseased when the truth disease status is diseased. Figure 1 plots the estimated sensitivities and specificities obtained from model M1 versus the empirical sensitivities and specificities calculated from the data, demonstrating that the latent class model works well for the Beam’s data.

Figure 1: — Plots of estimated sensitivities and specificities v.s. empirical sensitivities and specificities for the mammogram data in Example 1.

One advantage of using latent class models is that they do not require information of the true disease status. Theoretically, when the disease status is known, the nonidentifiability issue (if it exists) would be automatically resolved and the estimation is more accurate. Figure 2 shows that with and without true disease status, the estimated sensitivities and specificities are very close for this data set, which showing that there is no identifiability problem for the Beam’s data and again the latent class model works well for the Beam’s data. In addition, the estimated beta distributions for sensitivities and specificities remarkably catch the density curves of empirical sensitivities and specificities, which is presented in Figure 3.

Figure 2: — Plots of estimated sensitivities and specificities with true disease status known v.s. those without true disease status known for the mammogram data in Example 1.

Figure 3: — Density curves of the estimated and empirical distributions of sensitivities and specificities for the mammogram data in Example 1.

5.2. Example 2: Alvord’s HIV data

Alvord’s HIV data²⁵ was used to determine the diagnostic performance of HIV antibody assays. In the study, serum samples from each of the 428 subjects were tested by four conventional bioassays (tests). Alvord et al.²⁵ showed that the traditional independent latent two-class model is inadequate to fit the data and a three-class model can properly resolve the issue. Qu et al.⁹ used correlation residual plot to check covariance pattern and applied Gaussian random effects to model the test dependence, which was shown to greatly improve the fitting greatly. One drawback of their model is that the parameters introduced are not easily interpretable and cannot directly explain the covariance. The pairwise covariance dependence model (M2) proposed in this paper can naturally solve this problem and can easily be applied to this study with four bioassays. Because nine test result combinations have observed frequencies as 0, the multinomial property is not suitable to use.

The top panel of Figure 4 shows the pairwise correlation residuals in order of (T₁,T₂), (T₁,T₃), (T₂,T₃), (T₁,T₄), (T₂,T₄) and (T₃,T₄) after model M1 is fitted. Clearly, the correlation between T₂ and T₃ is not negligible. Based upon this information, model M2 is applied with $C_{23}^{+}$ and $C_{23}^{-}$ introduced and the other covariance terms set 0. The correlation residual plot of this model is on the bottom panel of Figure 4, showing all of the correlation residuals close to 0 and indicating an improved goodness of fit. Table 5 shows the observed frequencies and expected frequencies from M1 as well as M2. Clearly, M2 provides a better fit than M1. The chi-square goodness of fit test between the observed and expected frequencies indicates that M1 does not provide a good fit for the data since the p-values are only 0.0057 for posterior mean frequencies and 0.0077 for posterior median frequencies. The p-values of the chi-square test for M2 are above 0.9 for both posterior mean and median frequencies, suggesting M2 provides a more appropriate fit. Bayesian model comparison criterion DIC also confirms that M2 fits the data better with a smaller DIC value. Table 6 shows the estimated sensitivities and specificities from M1 and M2, which are very similar to those in Qu et al.⁹ The correlation residual plot shows the correlation pattern as a whole, while the estimated covariance terms nicely decompose this term into two parts based on the two latent classes. In this example, $C_{23}^{-}$ is negligible compared to $C_{23}^{+}$ , which tells the dependence between test 2 and test 3 mainly comes from diagnosing subjects who are diseased.

Table 5:

Observed and expected frequencies for models M1 and M2 without the multinomial impose for the HIV data in Example 2

Test results	Observed frequency	Expected frequency
Test results	Observed frequency	Ml (Algorithm 1)		M2 (Algorithm 2)
		Mean	Median	Mean	Median
0000	170	169.12	168.72	168.62	168.31
1000	4	5.46	5.20	5.32	4.95
0100	6	6.40	6.18	6.48	6.27
1100	1	0.27	0.25	0.23	0.21
0001	15	13.87	13.58	13.97	13.68
1001	17	9.38	9.25	16.84	16.56
1101	4	11.93	11.70	4.77	4.47
1011	83	88.90	88.97	81.66	81.46
1111	128	118.67	118.24	126.19	125.88
Goodness of fit (Chi-square test p value)		0.005749	0.007699	0.9953	0.9979
DIC		1284.388		1273.979

Open in a new tab

Table 6:

Estimation results (posterior quantities) from models M1 and M2 without using the multinomial impose for the HIV data in Example 2

	Ml (Algorithm 1)			M2 (Algorithm 2)
	Mean	Median	SD	Mean	Median	SD
π	0.540	0.540	0.025	0.542	0.542	0.024
Se₁	0.995	0.997	0.005	0.995	0.997	0.005
Se₂	0.572	0.572	0.032	0.572	0.571	0.032
Se₃	0.909	0.910	0.020	0.908	0.908	0.019
Se₄	0.995	0.996	0.005	0.995	0.996	0.005
Sp₁	0.969	0.971	0.013	0.970	0.972	0.013
Sp₂	0.964	0.965	0.013	0.960	0.961	0.013
Sp₃	0.993	0.995	0.006	0.993	0.995	0.006
Sp₄	0.924	0.926	0.020	0.924	0.925	0.018
$C_{23}^{+}$	—	—	—	0.032	0.032	0.010
$C_{23}^{-}$	—	—	—	0.003	0.002	0.004

Open in a new tab

5.3. Example 3: Handelman’s dentistry data

Handelman’s dentistry data²⁶ focuses upon 5 dentists each evaluating 3869 dental x-rays according to a binary rating scale with 0 denoting sound and 1 denoting carious. See more details of this data set in Espeland and Handelman.²⁶This data set contains a large number of subjects (n = 3869). It has the complete frequency information for all the possible test-result combinations, that is, the number of unique observed test-result combinations of the data equals the total number 2⁵ (= 32) of possible test-result combinations. Therefore, the multinomial property is imposed for fitting models M1 and M2.

From the top panel of Figure 5, we see the correlation pattern is more complicated. Both positive and negative correlations are displayed. The pairwise correlation from left to right are in the order of (T₁,T₂), (T₁,T₃), (T₂,T₃), (T₁,T₄), (T₂,T₄), (T₃,T₄), (T₁,T₅), (T₂,T₅), (T₃,T₅) and (T₄,T₅). Instead of handling each evident covariance separately, we recommend adding all pairwise covariance terms and letting the data decide the values of covariances. The correlation residual plot for M2 at the bottom panel of Figure 5 shows all the residuals are close to 0, which indicates M2 provides an improved and good fit for the data.

Table 7 shows the observed and expected frequencies estimated from models M1 and M2. The estimated frequencies from model M2 are much closer to the observed frequencies than those from model M1, especially for the frequencies of “00000” and “11111”. The chi-square goodness of fit test also indicates that the M1 does not provide a good fit for the data with small p-values while M2 is shown to fit the data well with p-values larger than 0.9. A smaller value of DIC also indicates that M2 is the more appropriate model.

Table 7:

Observed and expected frequencies for models M1 and M2 with the multinomial impose for the dentistry data in Example 3

Test results	Observed frequency	Expected frequency
Test results	Observed frequency	M1 (Algorithm 3)		M2 (Algorithm 4)
		Mean	Median	Mean	Median
00000	1880	1822.27	1821.36	1860.31	1860.33
00001	789	821.36	820.14	780.61	780.52
00010	43	62.67	62.22	43.45	43.08
00011	75	50.72	50.70	76.91	76.55
00100	23	30.05	30.06	23.64	23.39
00101	63	48.65	48.71	63.57	63.21
00110	8	4.91	4.84	10.94	10.53
00111	22	36.26	36.09	22.17	21.80
01000	188	212.74	212.46	187.77	187.77
01001	191	150.99	150.84	194.21	193.36
01010	17	13.41	13.34	20.28	19.88
01011	67	61.42	61.15	64.10	63.70
01100	15	13.01	12.95	17.30	16.97
01101	85	90.50	90.50	83.62	83.35
01110	8	9.78	9.68	6.77	6.56
01111	56	85.87	85.73	59.41	59.11
10000	22	22.81	22.62	22.25	21.96
10001	26	26.60	26.49	27.41	26.97
10010	6	2.59	2.57	8.90	8.48
10011	14	17.10	17.04	13.91	13.67
10100	1	3.19	3.16	3.22	2.85
10101	20	25.77	25.67	19.99	19.82
10110	2	2.79	2.77	2.19	2.06
10111	17	24.69	24.53	19.88	19.64
11000	2	7.08	6.99	4.21	3.85
11001	20	42.79	42.60	20.01	19.79
11010	6	4.62	4.55	4.50	4.34
11011	27	40.20	40.09	30.90	30.66
11100	3	6.96	6.87	3.34	3.14
11101	72	61.40	61.19	75.26	75.17
11110	1	6.66	6.59	5.10	5.10
11111	100	59.15	58.93	92.89	92.63
Goodness of fit (Chi-square test p value)		< 2.2e – 16	< 2.2e – 16	0.4736	0.6402
DIC		619.1735		540.0874

Open in a new tab

Table 8 displays the estimated posterior means, medians and standard deviations for sensitivities and specificities from models M1 and M2. For our method, we observe that the posterior mean and median of sensitivities from model M2 tend to be larger than those from model M1 and the posterior mean and median of specificities tend to be smaller. In addition, the posterior standard deviations from M2 are overall larger than those from M1. Compared with Qu et al.’s results⁹, our estimated sensitivities and specificities from model M1 (algorithm 3) correspond closely to those from their 2LC model, while the estimated sensitivities and specificities from model M2 (algorithm 4) are not as closely aligned with those from their 2LCR model. Table 9 displays the estimated pairwise covariances with $C_{13}^{+}$ estimated apparently larger than other covariances.

Table 8:

Estimation results (Posterior quantities) from models M1 and M2 with the multinomial impose for the dentistry data in Example 3

	Ml (Algorithm 3)			M2 (Algorithm 4)
	Mean	Median	SD	Mean	Median	SD
π	0.202	0.202	0.010	0.173	0.171	0.034
Se₁	0.407	0.407	0.02l	0.442	0.439	0.073
Se₂	0.706	0.706	0.021	0.755	0.756	0.083
Se₃	0.597	0.596	0.023	0.604	0.601	0.090
Se₄	0.493	0.493	0.021	0.498	0.497	0.059
Se₅	0.899	0.900	0.014	0.937	0.949	0.038
Sp₁	0.989	0.989	0.002	0.979	0.979	0.005
Sp₂	0.898	0.898	0.007	0.884	0.881	0.018
Sp₃	0.986	0.986	0.003	0.963	0.963	0.009
Sp₄	0.968	0.968	0.004	0.951	0.951	0.012
Sp₅	0.695	0.695	0.009	0.683	0.680	0.029

Open in a new tab

Table 9:

Estimated pairwise covariances conditional on the diseased class and the nondiseased class for the dentistry data in Example 3

		T₁	T₂	T₃	T₄	T₅
C⁺	T₁	—	0.017	0.057	0.028	0.004
	T₂	—	—	0.019	0.010	0.003
	T₃	—	—	—	−0.007	0.007
	T₄	—	—	—	—	−0.006
	T₅	—	—	—	—	—
C⁻	T₁	—	0.001	0.001	0.004	0.003
	T₂	—	—	0.006	0.005	0.014
	T₃	—	—	—	0.005	0.008
	T₄	—	—	—	—	0.010
	T₅	—	—	—	—	—

Open in a new tab

6. Summary

In this paper, we propose two Bayesian hierarchical latent class models to allow the estimation of sensitivity and specificity of multiple diagnostic tests with or without gold standard data. Our proposed models build upon existing approaches by flexibly accommodating a large number of diagnostic tests. Further, our proposed pairwise covariance dependence model (M2), in contrast to Qu et al (1996)’s approach, provides easily interpretable estimates of parameters and direct interpretation of covariance parameters. Through the Bayesian hierarchical structure, individual sensitivities and specificities are modeled as random effects following two common overarching beta distributions, respectively. The mode parameters in the overarching beta distributions therefore reflect the group-level sensitivity and specificity. Further, the concentrate parameters implicitly reflect the rating agreement among the group with larger values indicating more consistency of group diagnostic performance.

Another contribution of this paper is to provide easy-to-implement JAGS algorithms to apply these models. A guideline to using these algorithms is provided. Algorithm 1 for the independence model is always the first attempt to fit the data. If the correlation residual analysis detects significant dependence between at least two of the tests, then Algorithm 2 for the conditional dependence model is implemented to improve the fitting. Algorithm 3 and 4 in the supplement are analogous to Algorithm 1 and 2 with the multinomial distribution imposition. When the number n of subjects is large relative to the number 2^K of possible test result combinations, using the multinomial distribution property can significantly reduce the computational burden. For example, for example 3, it takes 1.5 hours to run Algorithm 2 while it takes only 19 seconds to run Algorithm 4.

Although the pairwise covariance model is easy to apply and interpret, one limitation is the ability to handle data from other models, such as the GRE model⁹ and the finite mixture (FM) model¹⁰. Our simulation study (see part D of the online supplement) shows that for the data from these two models, our correlation residual analysis tends to include all pairs of covariances for the pairwise covariance model and our method leads to biased estimation results. One possible reason is that the pairwise covariance model models covariances between tests directly while the GRE and FM model introduce the correlation between tests through common subject random effects. Finally, for real data analysis, we recommend using model selection criterion, such as DIC, to choose the best model among different models.

Supplementary Material

New

NIHMS1057624-supplement-New.pdf^{(105.2KB, pdf)}

Acknowledgements

The authors are grateful for the support provided by grants R01-CA226805 and R01-CA172463 from the United States National Institutes of Health. We thank Dr. Craig Beam for providing the mammography dataset.

Appendix

Algorithm 1:

JAGS code for the conditional independence model (M1)

ModelString1 = “

Model{

for(i in 1:n){

for (k in 1:K){

s1[i,k]= se[k]^x[i,k]*((1-se[k])^(1-x[i,k]))

s2[i,k]= sp[k]^(1-x[i,k])*((1-sp[k])^x[i,k])}

prob[i]=pi*(prod(s1[i,1:K])) + (1-pi)*(prod(s2[i, 1:K]))

z[i] ~ dpois(− log(prob[i]))}

for (k in 1:K){

se[k] ~ dbeta(omega1*(kappa1 −2)+1, (1-omega1)*(kappa1–2) +1)

sp[k] ~ dbeta(omega2*(kappa2 −2) + 1, (1-omega2)*(kappa2–2) +1)

}

omega1 ~ dbeta(1,1)T(0.5,)

omega2 ~ dbeta(1,1)T(0.5,)

kappa1 = kappaMinusTwo1 +2

kappaMinusTwo1 ~ dgamma(0.01,0.01)

kappa2 = kappaMinusTwo2+2

kappaMinusTwo2 ~ dgamma(0.01,0.01)

pi ~ dbeta(1,1)

}”

writeLines(modelString1, con=“model1.bug”)

Open in a new tab

Algorithm 2:

JAGS code for the conditional dependence model (M2)

modelString2 = “

model {

for(i in 1:n){

for (k in 1:K){

s1[i,k]= se[k]^x[i,k]*((1-se[k])^(1-x[i,k]))

s2[i,k]= sp[k]^(1-x[i,k])*((1-sp[k])^x[i,k])}

for (j in 1:K){

for (h in 1:K){

cop[i,j,h]= c1[j,h]*(−1)^(x[i,j] + x[i,h])/(s1[i,j]*s1[i,h])

con[i,j,h]= c2[j,h]*(−1)^(x[i,j] + x[i,h])/(s2[i,j]*s2[i,h])}}

eta[i] = (prod(s1[i,1:K]) *(1+ sum(cop[i,,])))

theta[i] =(prod(s2[i, 1:K]) *(1+sum(con[i,,])))

prob[i]=pi*eta[i] + (1-pi)*theta[i]

z[i] ~ dpois(−log(prob[i]))}

for (k in 1:K){

se[k] ~ dbeta(omega1*(kappa1 −2) + 1, (1-omega1)*(kappa1–2) +1)

sp[k] ~ dbeta(omega2*(kappa2 −2) + 1, (1-omega2)*(kappa2–2) +1)}

for (l in 1:(K−1)){

for (h in (l+1):K){

c1[l,h] ~ dunif((se[l]-1)*(1-se[h]), (min(se[l],se[h])-se[l]*se[h]))

c2[l,h] ~ dunif((sp[l]-1)*(1-sp[h]), (min(sp[l],sp[h])-sp[l]*sp[h]))}}

for (h in 1:K){

for (l in h:K){

c1[l,h] =0

c2[l,h] =0}}

omega1 ~ dbeta(1,1)T(0.5,)

omega2 ~ dbeta(1,1)T(0.5,)

kappa1 = kappaMinusTwo1 +2

kappaMinusTwo1~ dgamma(0.01,0.01)

kappa2 = kappaMinusTwo2+2

kappaMinusTwo2 ~ dgamma(0.01,0.01)

pi ~ dbeta(1,1)

}”

writeLines(modelString2, con=“model2.bug”)

Open in a new tab

References

1.Hui SL and Walter SD. Estimating the error rates of diagnostic tests. Biometrics 1980; 36: 167–171. [PubMed] [Google Scholar]
2.Joseph L, Gyorkos TW and Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. American journal of epidemiology 1995; 141: 263–272. [DOI] [PubMed] [Google Scholar]
3.Dendukuri N and Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics 2001; 57: 158–167. [DOI] [PubMed] [Google Scholar]
4.Vacek PM. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics 1985; 41: 959–968. [PubMed] [Google Scholar]
5.Brenner H. How independent are multiple ‘independent’ diagnostic classifications? Statistics in Medicine 1996; 15: 1377–1386. [DOI] [PubMed] [Google Scholar]
6.Torrance-Rynard VL and Walter SD. Effects of dependent errors in the assessment of diagnostic test performance. Statistics in Medicine 1997; 16: 2157–2175. [DOI] [PubMed] [Google Scholar]
7.Yang I and Becker MP. Latent variable modeling of diagnostic accuracy. Biometrics 1997; 53: 948–958. [PubMed] [Google Scholar]
8.Jones G, Johnson WS, Hanson TE and Christensen R. Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics 2010; 66: 855–863. [DOI] [PubMed] [Google Scholar]
9.Qu YS, Tan M and Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996; 52: 797–810. [PubMed] [Google Scholar]
10.Albert PS and Dodd LE. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 2004; 60: 427–435. [DOI] [PubMed] [Google Scholar]
11.van Smeden M, Naaktgeboren CA, Reitsma JB, Moons KG, de Groot JA. Latent class models in diagnostic studies when there is no reference standard–a systematic review. Am J Epidemiol 2014; 179: 423–431. [DOI] [PubMed] [Google Scholar]
12.Collins J, Albert PS. Estimating diagnostic accuracy without a gold standard: A continued controversy. Journal of Biopharmaceutical Statistics 2016; 26: 1078–1082 [DOI] [PubMed] [Google Scholar]
13.Rothenberg TJ. Identification in parametric models. Econometrica 1971; 39: 577–591. [Google Scholar]
14.Dendukuri N, Rahme E, Belisle P, and Joseph L. Bayesian sample size determination for prevalence and diagnostic test studies in the absence of a gold standard test. Biometrics 2004; 60: 388–397. [DOI] [PubMed] [Google Scholar]
15.Gustafson P. On model expansion, model contraction, identifiability and prior information: Two illustrative scenarios involving mismeasured variables. Statistical Science 2005; 20: 111–140. [Google Scholar]
16.Zhang B, Chen Z, and Albert PS. Estimating diagnostic accuracy of raters without a gold standard by exploiting a group of experts. Biometrics 2012; 68: 1294–1302. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Lin XY, Chen H, Edwards D, and Nelson KP. Modeling rater diagnostic skills in binary classification processes. Statistics in Medicine 2018; 37: 557–571. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Plummer M. Jags: A program for analysis of Bayesian graphical models using Gibbs sampling Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vienna, Austria: ISSN; 1609–395X., 2003. [Google Scholar]
19.Kruschke JK. Doing Bayesian data analysis: A tutorial with r, jags, and stan. Academic Press, 2015. [Google Scholar]
20.Robert CP and Soubrian C. Estimation of a mixture model through Bayesian sampling and prior feedback. Test 1993; 2: 125–146. [Google Scholar]
21.Gelman A., Carlin JB, Stern HS, Dunson DB, Vehtari A and Rubin DB. Bayesian data analysis Chapman and Hall, CRC Texts in Statistical Science, 2013. [Google Scholar]
22.Su YS and Yajima M. Using R to Run ‘JAGS’. R 2015. [Google Scholar]
23.Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B 2002; 64: 583–639. [Google Scholar]
24.Beam CA, Conant EF and Sickles EA. Association of volume-independent factors with accuracy in screening mammogram interpretation. J. Natl. Cancer. I 2003; 4: 282–290. [DOI] [PubMed] [Google Scholar]
25.Alvord WG, Drummond JE, Arthur LO, Biggar RJ, Goedert JJ, Levine PH, Murphy EL, Weiss SH, and Blattner WA. A method for predicting individual HIV infection status in the absence of clinical information. AIDS Research and Human Retroviruses 1988; 4: 295–304. [DOI] [PubMed] [Google Scholar]
26.Espeland MA and Handelman SL. Using latent class models to characterize and assess relative error in discrete measurements. Biometrics 1989; 45: 587–599. [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

New

NIHMS1057624-supplement-New.pdf^{(105.2KB, pdf)}

[R1] 1.Hui SL and Walter SD. Estimating the error rates of diagnostic tests. Biometrics 1980; 36: 167–171. [PubMed] [Google Scholar]

[R2] 2.Joseph L, Gyorkos TW and Coupal L. Bayesian estimation of disease prevalence and the parameters of diagnostic tests in the absence of a gold standard. American journal of epidemiology 1995; 141: 263–272. [DOI] [PubMed] [Google Scholar]

[R3] 3.Dendukuri N and Joseph L. Bayesian approaches to modeling the conditional dependence between multiple diagnostic tests. Biometrics 2001; 57: 158–167. [DOI] [PubMed] [Google Scholar]

[R4] 4.Vacek PM. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics 1985; 41: 959–968. [PubMed] [Google Scholar]

[R5] 5.Brenner H. How independent are multiple ‘independent’ diagnostic classifications? Statistics in Medicine 1996; 15: 1377–1386. [DOI] [PubMed] [Google Scholar]

[R6] 6.Torrance-Rynard VL and Walter SD. Effects of dependent errors in the assessment of diagnostic test performance. Statistics in Medicine 1997; 16: 2157–2175. [DOI] [PubMed] [Google Scholar]

[R7] 7.Yang I and Becker MP. Latent variable modeling of diagnostic accuracy. Biometrics 1997; 53: 948–958. [PubMed] [Google Scholar]

[R8] 8.Jones G, Johnson WS, Hanson TE and Christensen R. Identifiability of models for multiple diagnostic testing in the absence of a gold standard. Biometrics 2010; 66: 855–863. [DOI] [PubMed] [Google Scholar]

[R9] 9.Qu YS, Tan M and Kutner MH. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996; 52: 797–810. [PubMed] [Google Scholar]

[R10] 10.Albert PS and Dodd LE. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 2004; 60: 427–435. [DOI] [PubMed] [Google Scholar]

[R11] 11.van Smeden M, Naaktgeboren CA, Reitsma JB, Moons KG, de Groot JA. Latent class models in diagnostic studies when there is no reference standard–a systematic review. Am J Epidemiol 2014; 179: 423–431. [DOI] [PubMed] [Google Scholar]

[R12] 12.Collins J, Albert PS. Estimating diagnostic accuracy without a gold standard: A continued controversy. Journal of Biopharmaceutical Statistics 2016; 26: 1078–1082 [DOI] [PubMed] [Google Scholar]

[R13] 13.Rothenberg TJ. Identification in parametric models. Econometrica 1971; 39: 577–591. [Google Scholar]

[R14] 14.Dendukuri N, Rahme E, Belisle P, and Joseph L. Bayesian sample size determination for prevalence and diagnostic test studies in the absence of a gold standard test. Biometrics 2004; 60: 388–397. [DOI] [PubMed] [Google Scholar]

[R15] 15.Gustafson P. On model expansion, model contraction, identifiability and prior information: Two illustrative scenarios involving mismeasured variables. Statistical Science 2005; 20: 111–140. [Google Scholar]

[R16] 16.Zhang B, Chen Z, and Albert PS. Estimating diagnostic accuracy of raters without a gold standard by exploiting a group of experts. Biometrics 2012; 68: 1294–1302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Lin XY, Chen H, Edwards D, and Nelson KP. Modeling rater diagnostic skills in binary classification processes. Statistics in Medicine 2018; 37: 557–571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Plummer M. Jags: A program for analysis of Bayesian graphical models using Gibbs sampling Proceedings of the 3rd International Workshop on Distributed Statistical Computing, Vienna, Austria: ISSN; 1609–395X., 2003. [Google Scholar]

[R19] 19.Kruschke JK. Doing Bayesian data analysis: A tutorial with r, jags, and stan. Academic Press, 2015. [Google Scholar]

[R20] 20.Robert CP and Soubrian C. Estimation of a mixture model through Bayesian sampling and prior feedback. Test 1993; 2: 125–146. [Google Scholar]

[R21] 21.Gelman A., Carlin JB, Stern HS, Dunson DB, Vehtari A and Rubin DB. Bayesian data analysis Chapman and Hall, CRC Texts in Statistical Science, 2013. [Google Scholar]

[R22] 22.Su YS and Yajima M. Using R to Run ‘JAGS’. R 2015. [Google Scholar]

[R23] 23.Spiegelhalter DJ, Best NG, Carlin BP, van der Linde A. Bayesian measures of model complexity and fit (with discussion). Journal of the Royal Statistical Society, Series B 2002; 64: 583–639. [Google Scholar]

[R24] 24.Beam CA, Conant EF and Sickles EA. Association of volume-independent factors with accuracy in screening mammogram interpretation. J. Natl. Cancer. I 2003; 4: 282–290. [DOI] [PubMed] [Google Scholar]

[R25] 25.Alvord WG, Drummond JE, Arthur LO, Biggar RJ, Goedert JJ, Levine PH, Murphy EL, Weiss SH, and Blattner WA. A method for predicting individual HIV infection status in the absence of clinical information. AIDS Research and Human Retroviruses 1988; 4: 295–304. [DOI] [PubMed] [Google Scholar]

[R26] 26.Espeland MA and Handelman SL. Using latent class models to characterize and assess relative error in discrete measurements. Biometrics 1989; 45: 587–599. [PubMed] [Google Scholar]

Frequency	r₁₂	r₁₃	r₂₃	r₁₄	r₂₄	r₃₄	r₁₅	r₂₅	r₃₅	r₄₅
124	0	0	1	1	1	1	0	1	1	1
23	0	0	1	1	1	1	1	1	1	1
2	0	1	1	1	1	1	1	1	1	1
30	1	0	1	1	1	1	0	1	1	1
17	0	1	1	1	1	1	0	1	1	1
4	1	0	1	1	1	1	1	1	1	1

Frequency	r₁₂	r₁₃	r₂₃	r₁₄	r₂₄	r₃₄	r₁₅	r₂₅	r₃₅	r₄₅
124	0	0	1	1	1	1	0	1	1	1
23	0	0	1	1	1	1	1	1	1	1
2	0	1	1	1	1	1	1	1	1	1
30	1	0	1	1	1	1	0	1	1	1
17	0	1	1	1	1	1	0	1	1	1
4	1	0	1	1	1	1	1	1	1	1

PERMALINK

Bayesian Hierarchical Latent Class Models for Estimating Diagnostic Accuracy

Chunling Wang

Xiaoyan Lin

Kerrie P Nelson

Abstract

1. Introduction

2. Models

2.1. Hierarchical conditional independence model (Model M1)

2.2. Hierarchical conditional dependence model (Model M2)

2.3. Correlation residual plot

3. Computational techniques

3.1. Multinomial property

4. Simulation study

Table 1:

Table 2:

Table 3:

Table 4:

5. Data analysis

5.1. Example 1: Beam’s mammogram data

Figure 1:

Figure 2:

Figure 3:

5.2. Example 2: Alvord’s HIV data

Figure 4:

Table 5:

Table 6:

5.3. Example 3: Handelman’s dentistry data

Figure 5:

Table 7:

Table 8:

Table 9:

6. Summary

Supplementary Material

Acknowledgements

Appendix

Algorithm 1:

Algorithm 2:

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Frequency	r₁₂	r₁₃	r₂₃	r₁₄	r₂₄	r₃₄	r₁₅	r₂₅	r₃₅	r₄₅
124	0	0	1	1	1	1	0	1	1	1
23	0	0	1	1	1	1	1	1	1	1
2	0	1	1	1	1	1	1	1	1	1
30	1	0	1	1	1	1	0	1	1	1
17	0	1	1	1	1	1	0	1	1	1
4	1	0	1	1	1	1	1	1	1	1