Kappa statistic for the clustered dichotomous responses from physicians and patients

Chaeryon Kang; Bahjat Qaqish; Jane Monaco; Stacey L Sheridan; Jianwen Cai

doi:10.1002/sim.5796

. Author manuscript; available in PMC: 2014 Sep 20.

Published in final edited form as: Stat Med. 2013 Mar 27;32(21):10.1002/sim.5796. doi: 10.1002/sim.5796

Kappa statistic for the clustered dichotomous responses from physicians and patients

Chaeryon Kang ^a, Bahjat Qaqish ^b, Jane Monaco ^b, Stacey L Sheridan ^c,^d, Jianwen Cai ^b,^*

PMCID: PMC3844626 NIHMSID: NIHMS461200 PMID: 23533082

Abstract

The bootstrap method for estimating the standard error of the kappa statistic in the presence of clustered data is evaluated. Such data arise, for example, in assessing agreement between physicians and their patients regarding their understanding of the physician-patient interaction and discussions. We propose a computationally efficient procedure for generating correlated dichotomous responses for physicians and assigned patients for simulation studies. The simulation result demonstrates that the proposed bootstrap method produces better estimate of the standard error and better coverage performance compared to the asymptotic standard error estimate that ignores dependence among patients within physicians with at least a moderately large number of clusters. An example of an application to a coronary heart disease prevention study is presented.

Keywords: Kappa statistic, a dichotomous response for clusters, bootstrap resampling for clusters

1. Introduction

The kappa statistic is a commonly used measure of inter-rater agreement. The kappa statistic was originally proposed by Cohen [1] to quantify the degree of agreement beyond chance when two raters simultaneously score the same subjects on a nominal or ordinal scale. Inter-observer reliability is measured by comparing the observed proportion of agreement, P_o, with the proportion of agreement expected by chance, P_e, and scaling the difference so that a value of one indicates perfect agreement and a value of zero indicates no agreement beyond that expected by chance.

Many extensions to the kappa statistic have been proposed including weighting to account for the seriousness of misclassification errors [2, 3], allowing more than two raters [4], using stratified samples [5], comparing a single rater to a consensus group of raters [3], and allowing for multiple observations per subject and multiple categorizations [6].

Recent work has focused on the study of correlated kappa statistics. For example, when two or more raters evaluate the same subjects at different time points or under different conditions and the equality of the kappa statistics is of interest, the correlation between the kappa statistics must be addressed. McKenzie [7] proposed resampling techniques for comparing correlated kappa statistics, considering the case of pairwise comparisons of kappa statistics when the observations were evaluated by three raters. VanBelle and Albert [8] extended these methods using the bootstrap method to allow the comparison of more than two kappa statistics. Model-based procedures for comparing two dependent kappa statistics calculated from two observers from the same data when the outcome is binary were proposed by Donner, Shoukri, Klar and Bartfay [9]. GEE methods for comparing correlated kappa statistics have been proposed for dichotomous [10] or categorical [11] outcomes. Methods for weighted correlated kappas have also been developed [12] using GEE methods which allow for modeling of covariate effects. Barnhart and Williamson [13] proposed a weighted least squares method for comparing correlated kappa coefficients in the case of categorical covariates. Rather than comparing correlated kappa statistics, our research focuses on inference for individual kappa parameters in the presence of correlated binary outcomes.

Clustered binary data can result, as in our motivating example, when investigators are interested in the agreement between ratings by physicians and their patients regarding the same event. Since each physician can see more than one patient, patients seen by the same physician form a cluster. In other words, the responses of a physician for his/her patients tend to be more similar than those from other physicians, which results in clustered responses within a physician. We propose a bootstrap method to address the correlated data structure in such cases.

In our example, 24 physicians and their 157 patients evaluated clinical discussions regarding coronary heart disease (CHD) prevention [14]. Cluster sizes ranged from 1 to 20 patients per physician. Following the physician-patient discussion, each participant was surveyed regarding discussion content and resulting decisions. Each member of a physician-patient pair reported whether or not CHD was discussed. For pairs in which at least one of the physician/patient pair reported that CHD was discussed, agreement was evaluated for several outcome measures including whether medication was recommended or change in the patient’s lifestyle for CHD prevention was recommended. The kappa statistic was used as a measure of physician-patient agreement regarding their discussion, and the bootstrap method was used to estimate the standard errors accounting for the clustered data structure.

While methods for analysis have been proposed for clustered data, research has focused primarily on extensions to McNemar’s test or estimates of association, such as the odds ratio [15, 16, 17] rather than the kappa statistic. Oden [18] and Schouten [19] proposed a pooled kappa for situations in which two raters evaluate a set of paired units, such as pairs of eyes.

Bootstrap confidence intervals for kappa statistics have been proposed previously to address small sample sizes when observations are independent [20] and for comparing correlated kappa statistics [7, 8]. To our knowledge, the properties of the bootstrap method have not been studied for inference on individual kappa parameters in the context of clustered data.

In this paper, we evaluate the bootstrap method for calculating the kappa statistic and estimating its standard error in the presence of clustered binary outcomes. In Section 2, we provide background information regarding the kappa statistic, its asymptotic standard error, and the bootstrap method. We describe the generation of correlated dichotomous responses specifically for our simulations in Section 3. Simulation results are presented in Section 4. An applied example of the method follows in Section 5. Finally, Section 6 contains a discussion of the results.

2. Method

2.1. The kappa statistic and its asymptotic standard error (ASE)

In this section, we briefly describe the kappa statistic and its asymptotic standard error (ASE) for the case of independent subjects. Suppose that (Y, X) represents a pair of dichotomous responses from two raters, for example, a physician and a patient. Data on N subjects can be depicted in a 2 × 2 table. Let n₀₀, n₀₁, n₁₀ and n₁₁ represent the cell counts for (Y, X) = (0, 0), (0, 1), (1, 0), (1, 1), respectively. Let n_▪0 = n₀₀ + n₁₀, n_0▪ = n₀₀ + n₀₁, n_▪1 = n₀₁ + n₁₁, and n_1▪ = n₁₀ + n₁₁, and $n_{1 .} = n_{10} + n_{11} . N = \sum_{i, j = 0}^{1} n_{ij}$ denotes the number of subjects under study. Define $P_{o} = \frac{n_{00} + n_{11}}{N}$ and $P_{e} = \frac{(n_{. 0}) \times (n_{0 .}) + (n_{. 1}) \times (n_{1 .})}{N^{2}}$ , the kappa statistic $\hat{κ}$ introduced by Cohen [1] is calculated as follows:

\hat{κ} = \frac{P_{o} - P_{e}}{1 - P_{e}} .

(1)

We also define

\begin{matrix} q = & \frac{n_{00}}{N} \times {1 - (\frac{n_{0 .}}{N} + \frac{n_{. 0}}{N}) \times (1 - \hat{κ})}^{2} + \frac{n_{11}}{N} \times {1 - (\frac{n_{1 .}}{N} + \frac{n_{. 1}}{N}) \times (1 - \hat{κ})}^{2}, \\ r = & {(1 - \hat{κ})}^{2} \times {\frac{n_{01}}{N} {(\frac{n_{. 0}}{N} + \frac{n_{1 .}}{N})}^{2} + \frac{n_{10}}{N} {(\frac{n_{. 1}}{N} + \frac{n_{0 .}}{N})}^{2}}, \\ s = & {(\hat{κ} - P_{e} \times (1 - \hat{κ}))}^{2} . \end{matrix}

Following [21] and [22], ASE of the kappa statistic can be estimated by

ASE (\hat{κ}) = \sqrt{\frac{q + r - s}{N \times {(1 - P_{e})}^{2}}} .

(2)

Note that (1) and (2) can be obtained using SAS PROC FREQ.

2.2. Bootstrap sampling algorithm

The ASE calculation introduced in the previous section was developed based on the assumption of independence. Therefore, ASE is not appropriate for the clustered data since responses within clusters tend to be positively correlated which results in underestimating standard error of kappa statistic. To resolve this problem, we propose adopting a bootstrap method [23] by randomly sampling the clusters with replacement and taking all observations belonging to the sampled clusters. This bootstrap method is called the cluster bootstrap method since bootstrap sampling is conducted on clusters only [24, 25, 26]. In our study, a cluster is a physician, and observations within the cluster are patients.

2.2.1. Bootstrap sampling of clusters (physicians)

1. Assume that there are n clusters (physicians), and they are indexed by {1, … , n}. Draw a random sample of n clusters with replacement from the original data. The selected clusters are indexed by {1*, 2*, … , n*}, where the i * (i = 1, … , n) are elements of {1, … , n}.
For each sampled cluster, take all observations belonging to cluster i*. Let n_i* denote the size of cluster i*, Y^i* = (y_i*,1, … , y_{i*,n_i*})^T and X^i* = (x_i*,1, … , x_{i*,n_i*})^T. In our example, Y^i* represents a vector of responses from physician i* for his/her n_i* patients and X^i* represents a vector of responses from the n_i* patients of physician i*. The bootstrap sample Z consists of (Y*,X*), where Y* = (Y^1*T , … , Y^n*T)^T, and X* = (X^1*T, … ,X^n*T)^T.
Repeat steps 1 and 2 B times to generate B independent bootstrap samples Z¹, …, Z^B.
Calculate the kappa statistic ${\hat{κ}}^{b}$ corresponding to each bootstrap sample, Z^b, following formula (1).
Calculate bootstrap estimate by ${\hat{κ}}_{B} = \sum_{b = 1}^{B} \frac{{\hat{κ}}^{b}}{B}$
Estimate bootstrap standard error by $\hat{SE} ({\hat{κ}}_{B}) = \sqrt{\frac{\sum_{b = 1}^{B} {({\hat{κ}}^{b} - {\hat{κ}}_{B})}^{2}}{B - 1}}$ .

2.2.2. Confidence interval. The 95% confidence intervals are obtained by

\begin{matrix} 95 % bootstrap confidence interval based on normal approximation & = ({\hat{κ}}_{B} - 1.96 \hat{SE} ({\hat{κ}}_{B}), {\hat{κ}}_{B} + 1.96 \hat{SE} ({\hat{κ}}_{B})), \\ 95 % bootstrap confidence interval based on percentiles & = ({\hat{κ}}_{B}^{(0.025)}, {\hat{κ}}_{B}^{(0.975)}), \end{matrix}

(3)

where $\hat{SE} ({\hat{κ}}_{B})$ denotes bootstrap standard error estimate of ${\hat{κ}}_{B}$ , and ${\hat{κ}}_{B}^{(1 - α)}$ is the 100(1 − α)^th empirical percentile using the bootstrap samples.

In addition to the two confidence intervals above, we calculate bias-corrected and accelerated (BC_a) intervals which is an improved percentile method by automatically correcting the bias of the bootstrap estimator and provides second-order accuracy [23, 27, 28]. We computed the BC_a confidence interval following [23] with some modification since our resampling unit is clusters (physicians), not individual subjects. Let $\hat{G} (c)$ denote the empirical cumulative distribution of c, $\hat{G} (c) = \sum_{b = 1}^{B} \frac{1 {{\hat{κ}}^{b} < c}}{B}$ . Define ${\hat{κ}}_{{BC}_{a}}^{(α)} = {\hat{G}}^{- 1} {Φ (z_{0} + \frac{z_{0} + z^{(α)}}{1 - α (z_{0} + z^{(α)})})}$ , where Φ denotes the standard normal distribution (CDF), z^(α) denotes the 100α^th percentile point of a standard normal distribution, and for some z₀ and a. Then, 95% bootstrap confidence interval using the BC_a method is defined as follows:

95 % bootstrap confidence interval based on {BC}_{a} = ({\hat{κ}}_{{BC}_{a}}^{(0.025)}, {\hat{κ}}_{{BC}_{a}}^{(0.975)}) .

(4)

The constant ${\hat{z}}_{0}$ can be computed by ${\hat{z}}_{0} = Φ^{- 1} (\sum_{b = 1}^{B} \frac{1 {{\hat{κ}}^{b} < \hat{κ}}}{B})$ . Next, we calculate a following [23]. Since our resampling unit is a cluster (physician) i = 1, … , n, we define $U_{i} = {\hat{κ}}_{(\cdot)} - {\hat{κ}}_{(- i)}$ , where ${\hat{κ}}_{(\cdot)} = \sum_{i = 1}^{n} {\hat{κ}}_{(- i)} ∕ n$ and ${\hat{κ}}_{(- i)}$ is a kappa statistic computed by the original sample deleting all subjects belonging to i^th cluster. Then we compute $\hat{a} = \frac{1}{6} \frac{\sum_{i = 1}^{n} U_{i}^{3}}{{(\sum_{i = 1}^{n} U_{i}^{2})}^{3 ∕ 2}}$ .

Note that both the standard and the percentile methods are not second-order accurate, so relatively larger number of bootstrap replications are required for the BC_a method compared to the standard and the percentile methods. Efron and Tibshirani [23] suggest that at least 1, 000 bootstrap replications are needed for the BC_a method.

3. Simulation set-up

In this section, we provide a detailed description of the data generation procedure for the simulation study based on the clustered data structure in which the cluster is a physician and observations within a cluster are the patients of the physician. The calculation of the kappa statistic, estimation of standard error of the kappa statistic, and construction of the confidence intervals of the kappa statistic follows. Suppose that a pair of dichotomous responses is obtained for each physician-patient encounter. For example, the dichotomous response could denote survey-response of the physician-patient discussion or an assessment of the treatment.

3.1. Generating dichotomous responses for physician-patient pairs

3.1.1. Notation and assumptions

Suppose we have n clusters representing n physicians, and each cluster consists of m pairs of dichotomous responses from the physician-patient pairs. For patient i of a physician, let Y_i and X_i be random variables representing the physician’s assessment and the patient’s assessment of the same discussion, respectively. Note that Y_i ∈ {0, 1} and X_i ∈ {0, 1} with Y_i = 1 or X_i = 1 denoting “yes” for a given question. Let Y = (Y₁, … , Y_m)^T and X = (X₁, … , X_m)^T denote the random vectors representing dichotomous responses for a physician and his/her patients, and μ_y = (μ_y1, … , μ_ym)^T = (P{Y₁ = 1}, … , (P{y_m = 1})^T = E[Y] and μ_x = (μ_x1, … , μ_xm)^T = (P{X₁ = 1}, … , P{X_m = 1})^T = E[X] denote the corresponding marginal mean vectors. The correlation matrix of the response vectors is define as $R_{c} = (r_{ij}) = (\begin{matrix} R_{yy} & R_{xy} \\ R_{xy} & R_{xx} \end{matrix})$ , where R_yy = (Corr(Y_i, Y_j)), R_xy = (Corr(Y_i, X_j)), and R_xx = (Corr(X_i, X_j)).

In this simulation, a homogeneous cluster size is assumed for simplicity although the same simulation procedure can be applied to the case of the heterogeneous cluster size. We assume that all physicians have the same mean vector, μ_y1 = … = μ_ym = μ_y, and all patients have the same mean vector, μ_x1 = … = μ_{x_m} = μ_x. Also, all physicians have the same correlation matrix and same strength of agreement with patients. An exchangeable correlation structure is assumed within-physician and between physician and patient. Hence we define ρ_w = Corr(Y_i, Y_j), i ≠ j to be the within-physician correlation and ρ_b = Corr(Y_i, X_i) to be the physician-patient correlation. The parameter ρ_b is related to kappa as explained in subsection 3.1.3. Since all physicians are assumed to have the same mean and correlation matrix, we generate n independent sets of responses for the n physicians by repeating the following data generating procedure n times independently.

3.1.2. Generating correlated dichotomous responses within physicians

Note that each physician could have their own practice pattern, so it is reasonable to assume that the responses from a physician for different patients are correlated. We generate an m × 1 vector of correlated dichotomous responses Y for each of the n physicians following Qaqish [29]. Qaqish [29] introduced the conditional linear family of multivariate Bernoulli distributions which is useful for simulating correlated binary random variables with specified marginal mean vector μ = (μ₁, … , μ_m)^T and correlation matrix, R = (r_ij). The algorithm has been implemented in the R-package binarySimCLF [30], which we used to generate responses for physicians.

Before proceeding with the description of the data generating procedure, we briefly describe some restrictions that any valid (μ, R) should satisfy as [29] noted. First, the correlation matrix R = (r_ij) should be positive definite, and secondly, restriction on r_ij are imposed by μ_i and μ_j. Defining $ψ_{i} = \sqrt{\frac{μ_{i}}{1 - μ_{i}}}$ , the correlations must satisfy

\max (- ψ_{i} ψ_{j}, \frac{- 1}{ψ_{i} ψ_{j}}) \leq r_{ij} \leq \min (\frac{ψ_{i}}{ψ_{j}}, \frac{ψ_{j}}{ψ_{i}}) (i \neq j) .

(5)

For example, if we assume an homogeneous mean μ_y = 0.4 for all Y_i, (5) implies $- \frac{2}{3} \leq ρ_{w} \leq 1$ . Since we assume an exchangeable and positive correlation between physician’s responses within a physician, both conditions are satisfied.

3.1.3. Generating responses for patients given responses for physician

Once dichotomous responses for physicians, Y , are generated, dichotomous responses for patients given responses for physicians, X, can be generated in such a way that kappa and the marginal means have their stipulated values. We assume that responses for patients are conditionally independent given the physician’s responses, and naturally, this implies that, marginally, responses for patients are correlated within physicians.

Let us consider a 2 × 2 table, where Y_i denotes dichotomous response for a physician about patient i, and X_i denotes the corresponding patient’s response. Then, a ≡ P(Y_i = 0, X_i = 0) and c ≡ P(Y_i = 1, X_i = 0). Also, d and b can be expressed as follows:

\begin{matrix} d & = P (Y_{i} = 1, X_{i} = 1) = μ_{y} μ_{x} + ρ_{b} \sqrt{μ_{y} (1 - μ_{y})} \sqrt{μ_{x} (1 - μ_{x})}, \\ b & = P (Y_{i} = 0, X_{i} = 1) = μ_{x} - d, \end{matrix}

and we define b₀ and b₁ by

\begin{matrix} b_{0} & = \frac{b}{1 - μ_{y}} = \frac{P (Y_{i} = 0, X_{i} = 1)}{P (Y_{i} = 0)} = P (X_{i} = 1 ∣ Y_{i} = 0), \\ b_{1} & = \frac{d}{μ_{y}} - b_{0} = \frac{P (Y_{i} = 1, X_{i} = 1)}{P (Y_{i} = 1)} - b_{0} = P (X_{i} = 1 ∣ Y_{i} = 1) - P (X_{i} = 1 ∣ Y_{i} = 0) . \end{matrix}

(6)

This implies that E[X_i∣Y_i] = b₀ + b₁Y_i. Therefore, we generate X_i, i = 1 … ,m as independent Bernoulli variables with conditional means b₀ + b₁Y_i, i = 1, … ,m. The correlation coefficient between responses for physician and patient should satisfy the restriction given in (5). In the simulation, we set μ_y = 0.4 and μ_x = 0.5, so $ψ_{y} = \sqrt{\frac{μ_{y}}{1 - μ_{y}}} = \sqrt{\frac{2}{3}}$ , and $ψ_{x} = \sqrt{\frac{μ_{x}}{1 - μ_{x}}} = 1$ . Therefore,

- \sqrt{\frac{2}{3}} \leq ρ_{b} \leq \sqrt{\frac{2}{3}} = 0.816497 .

We calculate κ₀ and ρ_b as follows:

\begin{matrix} κ_{0} = & \frac{a + d - [μ_{y} μ_{x} + (1 - μ_{y}) (1 - μ_{x})]}{1 - [μ_{y} μ_{x} + (1 - μ_{y}) (1 - μ_{x})]}, \\ ρ_{b} = & \frac{d - μ_{y} μ_{x}}{\sqrt{μ_{y} (1 - μ_{y})} \sqrt{μ_{x} (1 - μ_{x})}} . \end{matrix}

From the above two formulae, κ₀ and ρ_b are related by $κ_{0} = ρ_{b} {\frac{1}{2} (\frac{ψ_{y}}{ψ_{x}} + \frac{ψ_{x}}{ψ_{y}})}^{- 1}$ .

Setting μ_y = 0.4 and μ_x = 0.5, the maximum value of available ρ_b is 0.816497, and hence the maximum value of κ₀ we can use is 0.816497/1.02062 = 0.8. This is a reasonable boundary in practice. For more details on parameter restrictions, see [29]. Detailed calculation for the marginal correlation between patients within a physician is given in APPENDIX 1. Using this simulation algorithm, for each configuration we generated M = 1, 000 independent data sets (Monte-Carlo simulations), with n clusters each.

3.2. Calculate the kappa statistic and the bootstrap kappa statistic

After generating (Y, X), we calculate the kappa statistic $\hat{κ}$ assuming independent observations by formula (1) and $ASE (\hat{κ})$ by formula (2). A 95% confidence interval is constructed as $(\hat{κ} - 1.96 \times ASE (\hat{κ}), \hat{κ} + 1.96 \times ASE (\hat{κ}))$ . We calculate the bootstrap kappa statistic, ${\hat{κ}}_{B}$ , and bootstrap standard error estimate, $\hat{SE} ({\hat{κ}}_{B})$ , as described in Section 2.2. 95% bootstrap confidence intervals based on the normal approximation, 95% bootstrap confidence interval based on percentiles, and 95% bootstrap confidence interval by using the BC_a method are constructed by formula (3) and (4). M independent data sets are simulated, and the coverage rate can be calculated as follows:

Coverage rate (%) = \frac{Number of times the true kappa value (κ_{0}) lies within the confindence interval}{Total number of simulations (M)} \times 100 .

(7)

A method whose coverage rate is closer to the target nominal coverage probability, for example, 95%, has better coverage performance.

3.3. The number of Monte-Carlo simulations

The number of Monte-Carlo simulations, denoted by M, can be determined by $M = {(\frac{Z_{1 - \frac{α}{2}} σ}{δ})}^{2}$ following [31], where σ denotes the variance of κ, δ is the permissible difference between κ₀ and $\hat{κ}$ , and $Z_{1 - \frac{α}{2}}$ denotes the ${(1 - \frac{α}{2})}^{th}$ quantile of the standard normal distribution. The standard error estimates of $\hat{κ}$ and ${\hat{κ}}_{B}$ from data analysis results presented later in this study are between 0.053 ~ 0.091. With M = 500, the permissible difference between κ₀ and $\hat{κ} (δ)$ is between 0.0046 ~ 0.0080. In our simulation study, M = 1, 000 and the corresponding δ is 0.0033 ~ 0.0056, which is a reasonably small difference.

4. Simulation results

In this section, we present simulation results for examining the performance of the bootstrap estimates with clustered data under various scenarios of the number of physicians (number of clusters), the number of patients per physician (cluster size), and strength of agreement between physician and patient (kappa value). We also present the results under independent assumption. The simulation was performed using R (Version 2.15.1) and the R-package binarySimCLF.

4.1. Determine the number of bootstrap replications B

Before we proceed with comparing two methods, a simulation study was conducted to decide the number of bootstrap replications B. We generated 1, 000 independent data sets as described in Section 3. Each data set consists of 25 clusters (physicians), and each cluster consists of 20 pairs of the dichotomous responses for physician and patient. For each simulated data set, different numbers of replications B ranging from 50 to 2, 000 were explored. For this simulation study, marginal mean of the dichotomous responses for physicians (μ_y) and patients (μ_x) were 0.4 and 0.5, respectively. The correlation coefficient within a physician (ρ_w) was 0.3. We calculated Monte-Carlo Standard Error estimate (MCSE) of ${\hat{κ}}_{B}$ and $\hat{SE} ({\hat{κ}}_{B})$ over M = 1, 000 simulations by $\sqrt{\frac{\sum_{m = 1}^{M} {({\hat{θ}}^{(m)} - \bar{\hat{θ}})}^{2}}{M (M - 1)}}$ , where $\bar{\hat{θ}} = \sum_{m = 1}^{M} \frac{{\hat{θ}}^{(m)}}{M}$ and ${\hat{θ}}^{(m)}$ denotes estimate of parameter of interest, θ, obtained from the m^th simulation.

Table 1 provides summary statistics using the bootstrap method for various number of bootstrap replications B. Bootstrap estimate of kappa statistic (mean and MCSE of ${\hat{κ}}_{B}$ ), bootstrap error estimate of ${\hat{κ}}_{B}$ (mean and MCSE of $\hat{SE} ({\hat{κ}}_{B}))$ , and 95% confidence interval coverage rate using the bootstrap confidence interval based on normal approximation $({CR}_{B}^{S})$ , the bootstrap confidence interval based on percentiles $({CR}_{B}^{P})$ and the bootstrap confidence interval using the BC_a method $({CR}_{B}^{{BC}_{a}})$ for ${\hat{κ}}_{B}$ are presented. We observed that there was no notable change in ${\hat{κ}}_{B}$ , $\hat{SE} ({\hat{κ}}_{B})$ , and coverage rates based on the bootstrap confidence interval using normal approximation and percentile methods by increasing the numbers of bootstrap replications. However, coverage rate based on the BC_a method was improved by increasing the number of bootstrap replications until B = 1, 000 which is the minimum number of bootstrap replications suggested by [23]. Therefore, we concluded that 1, 000 is a reasonable number for bootstrap replications, and simulation results with B = 1, 000 are presented for the remaining simulation results.

Table 1.

Simulation results for determining the number of bootstrap replications B (μ_y = 0.4, μ_x = 0.5, ρ_w = 0.3, κ₀ = 0.5, the number of cluster= 25, and cluster size= 20)

Bootstrap B		${\hat{κ}}_{B}$	$\hat{SE} ({\hat{κ}}_{B})$	95% confidence interval coverage rate
Bootstrap B		${\hat{κ}}_{B}$	$\hat{SE} ({\hat{κ}}_{B})$	${CR}_{B}^{S}$	${CR}_{B}^{P}$	${CR}_{B}^{{BC}_{a}}$
50	Mean MCSE	0.4947 0.0014	0.0407 0.0002	91.9	92.7	90.4
100	Mean MCSE	0.4946 0.0014	0.0408 0.0002	92.6	93.2	91.8
200	Mean MCSE	0.4947 0.0014	0.0410 0.0002	92.9	92.2	92.5
300	Mean MCSE	0.4947 0.0014	0.0411 0.0002	93.1	92.3	92.7
500	Mean MCSE	0.4947 0.0014	0.0412 0.0002	93.2	92.9	92.6
1000	Mean MCSE	0.4947 0.0014	0.0412 0.0002	93.2	93.0	93.1
1500	Mean MCSE	0.4948 0.0013	0.0411 0.0002	93.2	93.0	93.5
2000	Mean MCSE	0.4947 0.0013	0.0411 0.0002	93.2	93.2	93.1

Open in a new tab

Note: Mean of ${\hat{κ}}_{B}$ denotes the average bootstrap kappa statistic.

MCSE of ${\hat{κ}}_{B}$ denotes the Monte-Carlo Standard Error estimate of the bootstrap kappa statistic.

Mean of $\hat{SE} ({\hat{κ}}_{B})$ denotes the average bootstrap standard error estimate of bootstrap kappa statistic.

MCSE of $\hat{SE} ({\hat{κ}}_{B})$ denotes the Monte-Carlo Standard Error estimate of $\hat{SE} ({\hat{κ}}_{B})$ .

${CR}_{B}^{S}$ denotes the 95% confidence interval coverage rate (%) using the 95% confidence interval based on normal approximation.

${CR}_{B}^{P}$ denotes the 95% confidence interval coverage rate (%) using the 95% confidence interval based on percentiles.

${CR}_{B}^{{BC}_{a}}$ denotes the 95% confidence interval coverage rate (%) using the 95% confidence interval based on BC_a.

4.2. Varying strength of agreement between raters, the number of clusters, and cluster size

4.2.1. Simulation set-up and summary statistics

We generated 1, 000 independent data sets, and each data set consists of different numbers of clusters n = (10, 25, 50, 100) and different cluster size m = (5, 20, 50, 100). Poor (κ₀ = 0), fair (0.3), moderate (0.5) and substantial (0.8) strength of agreement associated with kappa statistics between two raters (physician and patient) were investigated. Marginal mean of the dichotomous response for physicians (μ_y) and patients (μ_x) were 0.4 and 0.5, respectively. The correlation coefficient of the responses for physician within a physician (ρ_w) was fixed at 0.3 (exchangeable correlation structure). As we discussed in Section 3.1.3, 0.8 is the maximum possible value for κ₀ for given μ_y = 0.4, μ_x = 0.5 and ρ_w = 0.3.

Table 2 provides the kappa statistics assuming independent observations (mean and MCSE of $\hat{κ}$ ), asymptotic standard error estimates of the kappa statistics assuming independence (mean and MCSE of $ASE (\hat{κ}))$ , empirical standard deviations of the kappa statistics using 1, 000 simulations $(std (\hat{κ}))$ , 95% confidence interval coverage rates using the 95% confidence interval based on normal approximation (CR^S) in addition to summary statistics using bootstrapping on the physicians introduced in Section 4.1.

Table 2.

Summary of the simulation results for comparing the bootstrap method and the method assuming independent observations with various levels of agreement between physician and patient for μ_y = 0.4, μ_x = 0.5, and ρ_w = 0.3.

Setting			kappa assuming independent observations						kappa using the bootstrap method
# of physicians	# of patients	κ ₀	$\hat{κ}$		$ASE (\hat{κ})$		$std (\hat{κ})$	CR^S	${\hat{κ}}_{B}$		$\hat{SE} ({\hat{κ}}_{B})$		${CR}_{B}^{S}$	${CR}_{B}^{P}$	${CR}_{B}^{{BC}_{a}}$
# of physicians	# of patients	κ ₀	Mean	MCSE	Mean	MCSE	$std (\hat{κ})$	CR^S	Mean	MCSE	Mean	MCSE	${CR}_{B}^{S}$	${CR}_{B}^{P}$	${CR}_{B}^{{BC}_{a}}$
25	5	0.0	0.004	0.0026	0.086	0.000107	0.081	95.9	0.004	0.0025	0.083	0.000362	95.1	94.8	94.2
		0.3	0.296	0.0027	0.083	0.000105	0.085	93.5	0.293	0.0027	0.081	0.000346	92.3	91.9	92.2
		0.5	0.493	0.0025	0.076	0.000127	0.080	93.0	0.488	0.0025	0.076	0.000331	91.8	92.5	93.2
		0.8	0.797	0.0018	0.053	0.000198	0.057	91.3	0.794	0.0018	0.056	0.000316	93.4	93.7	94.2

	20	0.0	0.002	0.0014	0.043	0.000037	0.044	94.0	0.002	0.0014	0.042	0.000202	93.2	92.4	91.6
		0.3	0.299	0.0014	0.042	0.000028	0.045	93.0	0.296	0.0014	0.042	0.000186	93.2	93.4	93.3
		0.5	0.498	0.0013	0.038	0.000032	0.043	91.2	0.495	0.0014	0.041	0.000189	93.2	93.0	93.1
		0.8	0.798	0.0011	0.026	0.000062	0.034	87.7	0.796	0.0011	0.034	0.000200	94.7	94.6	94.4

50	5	0.0	−0.004	0.0019	0.061	0.000049	0.061	95.3	−0.004	0.0019	0.060	0.000194	94.7	94.1	93.6
		0.3	0.295	0.0019	0.059	0.000050	0.059	95.5	0.293	0.0019	0.059	0.000184	94.4	94.3	94.2
		0.5	0.494	0.0017	0.054	0.000061	0.055	93.9	0.492	0.0017	0.054	0.000172	93.8	93.8	93.7
		0.8	0.797	0.0013	0.037	0.000098	0.041	92.4	0.795	0.0013	0.040	0.000168	94.0	94.2	94.0

	20	0.0	0.000	0.0010	0.031	0.000018	0.031	93.7	0.000	0.0010	0.030	0.000101	94.8	94.4	93.5
		0.3	0.298	0.0010	0.030	0.000012	0.031	94.8	0.297	0.0010	0.030	0.000095	94.6	94.6	95.0
		0.5	0.498	0.0009	0.027	0.000016	0.030	92.2	0.497	0.0009	0.029	0.000098	93.7	93.7	93.7
		0.8	0.799	0.0008	0.019	0.000031	0.024	89.2	0.798	0.0008	0.024	0.000097	95.4	95.0	95.2

100	5	0.0	−0.002	0.0013	0.044	0.000024	0.042	96.1	−0.001	0.0013	0.043	0.000098	96.0	95.9	95.9
		0.3	0.297	0.0013	0.042	0.000023	0.041	95.6	0.296	0.0013	0.042	0.000091	95.5	95.2	95.0
		0.5	0.497	0.0012	0.038	0.000029	0.038	95.7	0.496	0.0012	0.039	0.000084	96.3	96.1	96.4
		0.8	0.798	0.0009	0.026	0.000048	0.029	92.1	0.797	0.0009	0.028	0.000086	93.8	93.6	93.9

	20	0.0	0.001	0.0006	0.020	0.000008	0.019	94.2	0.001	0.0006	0.019	0.000045	94.5	94.6	94.1
		0.3	0.301	0.0007	0.021	0.000006	0.022	94.3	0.300	0.0007	0.021	0.000049	94.7	94.2	93.7
		0.5	0.500	0.0006	0.017	0.000006	0.019	93.3	0.499	0.0006	0.019	0.000046	95.4	95.4	94.9
		0.8	0.799	0.0005	0.012	0.000013	0.016	85.9	0.799	0.0005	0.016	0.000045	95.2	94.8	94.5

Open in a new tab

Note: κ₀ denotes strength of agreement between physician and patient.

Mean (MCSE) of $\hat{κ}$ denotes the average (Monte-Carlo Standard Error estimate of) kappa statistic assuming independence.

Mean (MCSE) of $ASE (\hat{κ})$ denotes the average (Monte-Carlo Standard Error estimate of) asymptotic standard error of kappa statistic assuming independence.

$std (\hat{κ})$ denotes the empirical standard deviation of kappa statistics.

CR^S denotes the 95% confidence interval coverage rate (%) for $\hat{κ}$ using the 95% confidence interval based on normal approximation assuming independence.

Mean (MCSE) of ${\hat{κ}}_{B}$ denotes the average (Monte-Carlo Standard Error estimate of) bootstrap kappa statistic.

Mean (MCSE) of $\hat{SE} ({\hat{κ}}_{B})$ denotes the average (Monte-Carlo Standard Error estimate of ) bootstrap standard error estimate of bootstrap kappa statistic.

${CR}_{B}^{S}$ denotes the 95% confidence interval coverage rate (%) for ${\hat{κ}}_{B}$ using the 95% bootstrap confidence interval based on normal approximation.

${CR}_{B}^{P}$ denotes the 95% confidence interval coverage rate (%) for ${\hat{κ}}_{B}$ using the 95% bootstrap confidence interval based on percentiles.

${CR}_{B}^{{BC}_{a}}$ denotes the 95% confidence interval coverage (%) rate for ${\hat{κ}}_{B}$ using the 95% bootstrap confidence interval based on BC_a.

4.2.2. Simulation result

Figures 1(a) and 1(b) display average of bias of $\hat{κ}$ over 1,000 Monte-carlo simulations. Both methods produced better point estimates of κ with larger number of physicians and number of patients for each physician. No marked difference between the two methods was observed in the point estimates, but kappa statistics assuming independent observations were slightly closer to κ₀ than those by the bootstrap method over all strength of agreements between raters, especially with small number of physicians (n = 10). This bias is negligible according to [23].

Comparisons between $\hat{κ}$ assuming independent observations (denoted by Indep_{# of physicians}) and ${\hat{κ}}_{B}$ using the cluster bootstrap method (denoted by Boots_{# of physicians}) for the numbers of physicians=(10,25,50,100) at different κ₀ values (strength of agreement between physician and patient).

Table 2 and Figures 1(c), 1(d), 1(e), and 1(f) present 95% coverage rates, the average ratio of the $ASE (\hat{κ})$ to the empirical standard deviation $std (\hat{κ})$ and the average ratio of the bootstrap standard error estimate $(\hat{SE} ({\hat{κ}}_{B}))$ to the empirical standard deviation for the number of clusters n = (25, 50, 100). Empirical standard deviation was calculated based on 1, 000 estimates of κ, which can be considered as the ‘true’ standard deviation of kappa statistics. Overall the average ratio of the $ASE (\hat{κ})$ to $std (\hat{κ})$ decreased rapidly while the average ratio of $\hat{SE} ({\hat{κ}}_{B})$ to $std (\hat{κ})$ stayed close to 1 as we increased the strength of the agreement between physician and patient. This indicates that standard error calculation of kappa statistics assuming independent observations tends to underestimate the standard error of kappa statistics on average, particularly when there is a strong agreement between raters. The underestimated standard error of the kappa statistic assuming independent observations resulted in a narrower confidence interval as the strength of agreement between raters increased while the coverage rate based on the bootstrap method was close to the target nominal level (95%) as long as the number of physicians was not very small, even under substantial agreement between raters. The advantage of the cluster bootstrap method was more obvious in the larger number of physicians and larger number of patients for each physician (Figures 1(e) and 1(f)). Overall BC_a method produced confidence interval which is similar to or slightly better than percentile and normal approximation methods except for the case with small number of physicians (n = 10) under no agreement between raters.

4.3. Varying strength of correlation within cluster (physician), the number of clusters, and cluster size

So far, we examined the performance of the two methods to calculate kappa statistics for various strength of agreement between raters while keeping within-cluster correlation fixed. In this section, we varied values of the correlation coefficient within cluster (ρ_w) under fixed moderate strength of agreement between raters (κ₀ = 0.5). Table 3 and Figure 2 provide the simulation results for ρ_w = 0.1, 0.3, 0.5 and 0.8. Calculating ASE of the kappa statistic assuming independent observations tended to underestimate the standard error on average for at least moderately positively correlated subjects within physician while the clustered bootstrap method did not. This pattern is more obvious in Figure 2. Similar to the result by varying strength of agreement between raters, however, the bootstrap method produced poorer coverage rates than those obtained by using ASE assuming independent observations when both the number of physicians and the number of patients for each physician were very small under weak within-physician correlation.

Table 3.

Summary of the simulation results for comparing the bootstrap method and the method assuming independent observations with various levels of correlation coefficient between any two responses of a physician (μ_y = 0.4, μ_x = 0.5,r_b = 0.29, and κ₀ = 0.5).

Setting			kappa assuming independent observations						kappa using the bootstrap method
# of physicians	# of patients	ρ _w	$\hat{κ}$		$ASE (\hat{κ})$		$std (\hat{κ})$	CR^S	${\hat{κ}}_{B}$		$\hat{SE} ({\hat{κ}}_{B})$		${CR}_{B}^{S}$	${CR}_{B}^{P}$	${CR}_{B}^{{BC}_{a}}$
# of physicians	# of patients	ρ _w	Mean	MCSE	Mean	MCSE	$std (\hat{κ})$	CR^S	Mean	MCSE	Mean	MCSE	${CR}_{B}^{S}$	${CR}_{B}^{P}$	${CR}_{B}^{{BC}_{a}}$
25	5	0.1	0.495	0.0025	0.076	0.000127	0.079	93.6	0.492	0.0025	0.075	0.000326	92.0	92.6	92.9
		0.3	0.493	0.0025	0.076	0.000127	0.080	93.0	0.488	0.0025	0.076	0.000331	91.8	92.5	93.2
		0.5	0.490	0.0025	0.076	0.000126	0.079	92.9	0.484	0.0025	0.078	0.000336	92.6	93.4	93.6
		0.8	0.488	0.0025	0.076	0.000128	0.080	93.1	0.479	0.0025	0.081	0.000358	93.6	93.1	94.2

	20	0.1	0.499	0.0013	0.038	0.000032	0.041	93.6	0.497	0.0013	0.038	0.000181	92.2	91.8	92.1
		0.3	0.498	0.0013	0.038	0.000032	0.043	91.2	0.495	0.0014	0.041	0.000189	93.2	93.0	93.1
		0.5	0.498	0.0014	0.038	0.000033	0.045	90.1	0.492	0.0014	0.044	0.000231	93.6	93.1	92.5
		0.8	0.494	0.0015	0.038	0.000034	0.048	89.1	0.484	0.0015	0.050	0.000336	95.8	94.4	94.1

50	5	0.1	0.496	0.0017	0.054	0.000061	0.054	94.4	0.494	0.0017	0.054	0.000164	94.1	93.6	93.8
		0.3	0.494	0.0017	0.054	0.000061	0.055	93.9	0.492	0.0017	0.054	0.000172	93.8	93.8	93.7
		0.5	0.494	0.0018	0.054	0.000062	0.056	93.2	0.491	0.0018	0.055	0.000169	93.4	93.5	94.2
		0.8	0.494	0.0018	0.054	0.000062	0.057	92.6	0.489	0.0018	0.057	0.000181	94.1	93.6	93.7

	20	0.1	0.500	0.0009	0.027	0.000015	0.028	94.9	0.499	0.0009	0.027	0.000090	94.6	94.6	95.0
		0.3	0.498	0.0009	0.027	0.000016	0.030	92.2	0.497	0.0009	0.029	0.000098	93.7	93.7	93.7
		0.5	0.497	0.0010	0.027	0.000016	0.032	90.6	0.494	0.0010	0.031	0.000115	94.1	94.0	94.2
		0.8	0.495	0.0011	0.027	0.000017	0.034	89.2	0.491	0.0011	0.034	0.000168	95.3	94.0	94.5

100	5	0.1	0.498	0.0012	0.038	0.000030	0.038	95.4	0.497	0.0012	0.038	0.000085	95.7	95.6	95.3
		0.3	0.497	0.0012	0.038	0.000029	0.038	95.7	0.496	0.0012	0.039	0.000084	96.3	96.1	96.4
		0.5	0.495	0.0012	0.038	0.000029	0.039	94.2	0.494	0.0012	0.039	0.000088	95.7	94.9	95.3
		0.8	0.496	0.0013	0.038	0.000030	0.040	93.6	0.494	0.0013	0.040	0.000094	94.5	94.3	94.8

	20	0.1	0.500	0.0006	0.019	0.000007	0.019	93.9	0.500	0.0006	0.019	0.000045	93.9	94.0	94.0
		0.3	0.500	0.0006	0.019	0.000008	0.020	93.3	0.499	0.0006	0.021	0.000049	94.2	94.2	94.2
		0.5	0.499	0.0007	0.019	0.000008	0.021	92.0	0.498	0.0007	0.022	0.000058	95.4	95.1	94.9
		0.8	0.499	0.0007	0.019	0.000008	0.023	90.2	0.497	0.0007	0.024	0.000081	96.1	95.2	95.7

Open in a new tab

Note: ρ_w denotes the correlation coefficient within a physician.