Abstract
Asymptotic approaches are traditionally used to calculate confidence intervals for intraclass correlation coefficient in a clustered binary study. When sample size is small to medium, or correlation or response rate is near the boundary, asymptotic intervals often do not have satisfactory performance with regard to coverage. We propose using the importance sampling method to construct the profile confidence limits for the intraclass correlation coefficient. Importance sampling is a simulation based approach to reduce the variance of the estimated parameter. Four existing asymptotic limits are used as statistical quantities for sample space ordering in the importance sampling method. Simulation studies are performed to evaluate the performance of the proposed accurate intervals with regard to coverage and interval width. Simulation results indicate that the accurate intervals based on the asymptotic limits by Fleiss and Cuzick generally have shorter width than others in many cases, while the accurate intervals based on Zou and Donner asymptotic limits outperform others when correlation and response rate are close to their boundaries.
Keywords: Clustered binary data, confidence interval, importance sampling, intraclass correlation coefficient, profile confidence limit
1. Introduction
The intraclass correlation coefficient (ICC), denoted by ρ, is widely used to estimate the correlation between multiple measures from one cluster. For example, CNS Vital Signs are often used to define the presence or absence of impairment for professional fighters [1]. Recently, several new measures are added to the battery to define impairment: clinical diagnosis, and brain image data at baseline or longitudinal change [18]. When multiple assessment tools were used to define the impairment status of each fighter, one of the first research questions is the ICC among these measures. When the outcome is binary, many methods were developed to estimate the ICC: moment generating functions, and quasi-likelihood approach [13]. It should be noted that some of these estimates may not exist for some data [3]. In addition to point estimate, confidence intervals for the ICC should be reported in studies with binary correlated data.
For binary data, the exact one-sided limit for the ICC by Buehler [2] is preferable [6,10,15,19]. The Buehler's exact one-sided limit guarantees the nominal level with the computed converge bounded by the nominal level. It has been successfully applied to studies with independent samples or paired samples [7,11], and single-arm two-stage designs with binary outcome [15]. But, it quickly becomes too computationally intensive for a study with clustered binary outcomes. The computation intensity of the Buehler's exact limit comes from the massive size of the complete sample space [16,21,23,25]. For these reasons, we propose using importance sampling (IS) to calculate confidence intervals for ρ. The IS approach reduces the computational intensity by avoiding that challenge while the computed IS intervals are highly accurate with the actual coverages close to the nominal level. In the IS approach, a statistical quantity has to be used for sample space ordering. We use the following four existing intervals as statistical quantities: the ANOVA interval, the modified Wald interval by Zhou and Donner [30], the interval by Fleiss and Cuzick [4], and the Pearson interval that gives equal weight to each pair of observations [8].
In the confidence interval calculation for ρ, there is another parameter in the tail probability: the event rate π which is considered as the nuisance parameter. Kabaila and LLoyd [5] developed the profile confidence limit by using the profile maximum likelihood estimate (MLE) of π given ρ, instead of all the possible π values. This profile confidence limit is computationally effective, and has very good statistical properties [5]. Therefore, we use a combination of the IS approach for discrete data [9] and profile limits by Kabaila and LLoyd [5] to construct confidence intervals for ρ. We conduct extensive simulation studies to compare their performances with regard to coverage and interval width [22,24]. The psychiatric data set reported by Lipsitz et al. [8] is used to illustrate the application of the proposed accurate intervals.
The rest of the article is organized as follows. In Section 2, we first introduce the four existing asymptotic intervals, and then propose accurate intervals for ρ by using the asymptotic limits as statistical quantities for sample space ordering. In Section 3, we compare the performance of accurate intervals with a common cluster size and multiple cluster sizes through simulation studies. The proposed accurate intervals are then applied to the data set from a psychiatric study. Finally, we provide some comments on the ICC confidence interval calculation as well as a paradox of the ICC estimate in Section 4.
2. Methods
Suppose a study has k clusters with the cluster size of individuals for the ith cluster, where . When the outcome is binary, the number of events among individuals from the ith cluster is , with data , see Table 1. The total number of individuals in the study is . The ICC provides a quantitative measure of correlation between individuals from the same cluster. This is the parameter of interest here in this article.
Table 1.
Data for a clustered binary study with k clusters.
| Events | Non-events | Total |
|---|---|---|
| ··· | ··· | ··· |
Under the exchangeable correlation model [12], the probability for the number of events from a cluster with the size of is computed as
| (1) |
This probability density function is going to be used in the simulation studies.
2.1. Asymptotic intervals
Traditionally, the statistical inference for ρ is based on asymptotic approaches. For this reason, we first introduce four commonly used methods to construct asymptotic intervals for ρ.
2.1.1. ANOVA interval and ZD interval
In a one-way ANOVA, the between-cluster mean square (MSB) and the within-cluster mean square (MSW) are calculated as
Then, the ICC estimate based on the ANOVA method is
| (2) |
where . Smith [29] developed the large sample variance estimator for as
| (3) |
where and . A confidence interval for ρ can be computed as
| (4) |
where is the quartile of the standard normal distribution. The ANOVA interval is obtained by replacing and its variance estimator with in Equation (2) and in Equation (3).
Later, Zou and Donner [30] derived a variance estimator for ρ by using the moment functions and the delta method under the exchangeable correlation model:
| (5) |
where , , , and When the variance estimator in Equation (4) is replaced by the variance estimator by Zou and Donner in Equation (5), this confidence interval is referred to be as the ZD interval.
2.1.2. FC interval
Fleiss and Cuzick [4] developed a kappa-type estimator for the ICC based on the probability of two individuals having the same outcome (both 0 or 1), for the two individuals from the same cluster and for them from different clusters. This approach is referred to be as the FC approach, with the ICC estimator as
Since π and are in the denominator, this estimator is not defined when or 1. The variance of is calculated as
| (6) |
where and . The FC interval is computed by using and its variance estimator in Equation (4). Zou and Donner [30] found that the FC interval has coverages close to nominal over a wide range of parameter configurations as compared to the ANOVA interval and the ZD interval.
2.1.3. Pearson interval
Pearson's approach computes the ICC from all possible combinations of pairs within each cluster. For a cluster with the size of n, the total possible number of pairs is . For each pair, its correlation is computed. These correlations are added as the overall correlation within that cluster. Pearson's ICC estimator gives equal weight to each pair of individuals to avoid giving more weights to clusters with larger sizes.
The ICC based on the Pearson's approach is presented as
where . The variance of is
| (7) |
where and . The Pearson interval is computed by using and its variance estimator in Equation (4).
When , the variance estimates are zero for the ANOVA method, the FC method, and the Pearson method in Equations (3), (6), and (7). This occurs when all the individuals in each cluster have the same outcome.
2.2. Accurate intervals
The original data set in Table 1 can be re-organized by the size of clusters. Let W be the maximum number of cluster sizes: . The re-organized data is
where is the number of clusters that have j events and the cluster size of w, , and . Then, is the number of clusters having the cluster size of w.
For clusters having the same cluster size of w, their numbers of events follow a multinomial distribution with the probabilities in Equation (1), specifically
It follows that the probability of data is
which is a function of the parameter of interest ρ and the nuisance parameter π.
Let T be a statistical quantity to order the upper limits. The tail probability can be expressed as
| (8) |
where is an observed data. Exact one-sided upper limit by Buehler [2] can be computed as the supremum of ρ such that
| (9) |
In order to compute Buehler's exact one-sided limit [2], the complete sample space has to be enumerated, which is not computationally feasible for a study with clustered binary data due to the massive size of the sample space. Additionally, the tail probability has to be calculated for all possible values of π that lead to the computational intensity issue.
To overcome these challenges, we consider the importance sampling (IS) method described by Lloyd and Li [9] to estimate the tail probability in Equation (8). Importance sampling is able to reduce computational intensity by avoiding enumerating all data sets from the complete sample space [14,16]. Its estimate has a very good approximation to the true tail probability [9].
The importance sampling distribution is assumed to follow a multinomial distribution, with ρ and π estimated from the observed data set. From the fitted ANOVA model by using the observed data, the estimates of and are obtained and then used in the probability density functions in Equation (1). In the ANOVA model, the is calculated as the ratio of the total number of events and the total number of individuals. The number of events for the ith cluster with the size of is then generated from the estimated multinomial distribution. This step is repeated by k times to generate one IS data set having the same cluster sizes as the observed data. Suppose we have B importance data sets: , where . These B importance samples are used to estimate the tail probability as
where is the importance weight and is the indicator function.
When is used as the estimator for G, one has to compute the upper limit as the supremum of ρ, such that is larger than for all the possible π values, which is computationally intensive. To further reduce the computational intensity, Kabaila and Lloyd [5] proposed the upper profile limit as the supremum of ρ such that
| (10) |
where is the MLE of π for a given ρ. In addition to that, the profile limit would help reduce the conservativeness of the confidence intervals as the lower limit and the upper limit are computed separately at the nominal of for each limit. The accurate lower limit can be computed similar to the method for computing accurate upper limit. The aforementioned four asymptotic limits can be used as statistical quantities for sample space ordering in the accurate interval calculation. We refer to the accurate intervals based on the ANOVA limits, the ZD limits, the FC limits, and the Pearson limits as the ANOVA-IS interval, the ZD-IS interval, the FC-IS interval, and the Pearson-IS interval, respectively.
3. Results
We first compare the performance of asymptotic intervals and accurate intervals for the ICC with regard to coverage probability when all the clusters have the same cluster size. The coverage probability is calculated as the proportion of the simulated data sets whose confidence intervals contain the pre-specified ρ value. When all the clusters have the same size, the asymptotic FC intervals and the asymptotic Pearson intervals are identical. For this reason, we only need to compare the performance of the following three intervals: the ANOVA interval, the ZD interval, and the FC interval. Figure 1 shows the coverage probabilities of these asymptotic intervals with the number of clusters: k = 20, 50 and 100, and the common cluster size 5 or 8, at the nominal level of 95%. The parameter of interest is set to be as , 0.5, and 0.9, and the event rates π are 0.1, 0.3, 0.6, and 0.9. For each configuration, we simulate 2,000 data sets using the probability density functions in Equation (1). For each simulated data set, we fit the ANOVA model to obtain the parameter estimates: and . We then simulate 5,000 importance samples from the multinominal distribution whose parameters are replaced with and .
Figure 1.
Coverage probabilities of asymptotic intervals when the number of clusters is and 100 and a common cluster size is 5 or 8.
It can be seen from Figure 1 that the ANOVA interval has the lowest coverage as compared to the other two methods. At the nominal level of 95%, the coverage probability of the ANOVA interval could be as low as 40% in some configurations. Both the ZD interval and the FC interval have the coverage around the nominal level. The ZD interval often has a higher coverage than the FC interval. When the number of cluster is k = 100, the ZD interval controls the coverage well in many cases with the actual coverage probabilities above 95%. The FC interval has the coverages close to the nominal level, but the coverages are often below the nominal level.
As compared to the coverage of asymptotic intervals in Figure 1, the proposed accurate intervals generally control well for coverage, see Figure 2. Although the accurate intervals based on the IS approach do not theoretically guarantee the coverage, the computed accurate intervals often have very good performance with regard to coverage. The accurate ANOVA-IS intervals occasionally have the coverage slightly below the nominal level when ρ is high.
Figure 2.
Coverage probabilities of accurate intervals when the number of clusters is and 100 and a common cluster size is 5 or 8.
We then compare their interval widths in Figure 3 for accurate intervals. The results for a study with the cluster size of 5 are similar to those with the cluster size of 8. When the number of clusters is small (k = 20), the ZD-IS interval has the shortest width as compared to the ANOVA-IS interval and the FC-IS interval in many configurations. The ANOVA-IS interval has the shortest width when ρ is 0.9 and π is near the boundary, and a study with small number of clusters. As the number of clusters increases from k = 20 to k = 100, the FC-IS interval and the ZD-IS interval have similar interval width, and they have shorter width than the ANOVA-IS interval. When k = 100, the width of the ZD-IS interval is the shortest when ρ is 0.9 and π is near the boundary. The simulations for a study with a common cluster size indicate that the ZD-IS interval and the FC-IS interval outperform the ANOVA-IS interval with regard to coverage and interval width. For this reason, we exclude the ANOVA-IS interval in the following simulations for a study with multiple cluster sizes.
Figure 3.
Interval width of accurate intervals when the number of clusters is and 100 and a common cluster size is 5 or 8.
The FC-IS interval and the Pearson interval are different from each other when a study has more than one cluster size. Figure 4 shows the coverage probability and interval width of the three asymptotic intervals and their associated accurate intervals, with 60 clusters (20 clusters each for the cluster sizes of 3, 4, and 5). The same configurations of ρ and π as the aforementioned simulations are studied. The asymptotic ZD interval is very conservative when ρ and π are near the boundary, having the actual coverage much larger than the nominal level. The FC interval always has a higher coverage than the Pearson interval, and both are below the nominal level when or 0.5. When , the FC interval and the Pearson interval have coverages around the nominal level, while the coverage of the ZD interval ranges from 0.9 to 1. Meanwhile, accurate intervals control the coverage much better than asymptotic intervals as seen in the figure.
Figure 4.
Coverage probability and interval width of asymptotic intervals (left) and accurate intervals (right) for a study with 3 different cluster sizes: 3, 4, and 5. The number of clusters is 20 for each cluster size with a total of 60 clusters. The computed coverage probability and interval width are plotted as a function of ρ and π.
Among the accurate intervals, the ZD-IS interval has the shortest width when both ρ and π are near the boundary. When ρ is large and π is near the boundary, the ZD-IS interval is the best among them, in which the other two intervals often have very small accurate lower limits. The accurate FC-IS interval often performs better than others with shorter widths in other cases. Similar results are observed in Figure 5 when the number of clusters is increased to k = 90.
Figure 5.
Coverage probability and interval width of asymptotic intervals (left) and accurate intervals (right) for a study with 3 different cluster sizes: 3, 4, and 5. The number of clusters is 30 for each cluster size with a total of 90 clusters. The computed coverage probability and interval width are plotted as a function of ρ and π.
3.1. Example
We use the psychiatric data reported by Lipsitz et al. [8] to illustrate the application of the proposed accurate intervals for ρ. That study had 26 patients who are considered as clusters here. Each patient with psychiatric disorders is assessed by at least three different psychiatrists, with the outcome as the status of disorder (neurosis versus other disorder). The cluster sizes are 3, 4, 5, and 6. The detailed data set can be found in Lipsitz et al. [8].
The estimated ρ values based on the ANOVA method, the FC method, and the Pearson method are presented in Table 2. The estimate based on the FC method is similar to that based on the Pearson method, and they are slightly less than the estimate based on the ANOVA method. The accurate lower limits of the ZD-IS interval, the FC-IS interval, and the Pearson interval are larger than their associated asymptotic lower limits, while the differences in upper limits are small. The accurate ANOVA-IS interval is wider than the other accurate intervals. Suppose we are interested in testing whether ρ is above 20% under the alternative. Three IS intervals (the ZD-IS interval, the FC-IS interval, and the Pearson-IS interval) reject the null hypothesis while their asymptotic intervals fail to reject the null hypothesis. The ANOVA-IS interval is the only IS interval that fails to reject the null hypothesis among the four IS intervals.
Table 2.
Asymptotic and accurate intervals for the psychiatric study.
| ρ | Method for interval | Asymptotic | Accurate |
|---|---|---|---|
| ANOVA | (0.231,0.613) | (0.182,0.690) | |
| ZD | (0.195,0.649) | (0.214,0.639) | |
| FC | (0.187,0.632) | (0.209,0.637) | |
| Pearson | (0.179,0.637) | (0.206,0.642) |
4. Discussion
We study the four accurate confidence intervals for ρ in a clustered binary study. Four asymptotic limits are used as statistical quantities in computing these accurate intervals. Another asymptotic interval proposed by Chakraborty and Hossain [3] is a simulation based method that would require more computational resources [28]. That approach would introduce additional simulation errors in computing the confidence interval for ρ. We do not have the computational resources to conduct simulations using that method.
In addition to the Wald-type confidence intervals studied in this article, confidence intervals can be solved from cubic functions of ρ, similar to the Wilson type confidence intervals [30]. The proposed accurate confidence intervals are computed by using the ordering information of upper limits or lower limits [20]. We do not expect significant changes with regard to sample space ordering between Wald-type limits and Wilson type limits. In addition, more computational resources are required to solve cubic functions [15,17,26,27]. This could be considered as future research on the difference between Wald type limits and Wilson type limits for accurate intervals calculation.
Similar to the κ statistic for testing agreement, there is a paradox about the ICC estimate: a high probability that two individuals from the same cluster have the same number of events, but a low ρ estimate. For example, a study with k = 50 clusters having a common cluster size of 5, the outcome is , where 1 cluster has 4 events and 1 non-event, and the remaining 49 clusters all have 5 events. Given that almost every cluster has 5 events, the probability of two individuals from the same cluster having the same results would be very high, and it should be close to 1. However, the estimated ρ is negative: -0.004 for the FC method and -3.4 for the ANOVA method. This could be caused by the large value of π that leads to a very large denominator in the estimate. When π is near 0.5, the estimated is closer to the probability.
Acknowledgements
The authors are very grateful to the Editor, Associate Editor, and two reviewers for their insightful comments that help improve the manuscript. Research reported in this publication was supported by the National Institutes of Health under Award Number R01AG070849, and R03CA248006.
Funding Statement
Research reported in this publication was supported by the National Institutes of Health under Award Number R01AG070849, and R03CA248006.
Disclosure statement
No potential conflict of interest was reported by the author(s).
References
- 1.Bernick C., Banks S.J., Shin W., Obuchowski N., Butler S., Noback M., Phillips M., Lowe M., Jones S., and Modic M., Repeated head trauma is associated with smaller thalamic volumes and slower processing speed: The professional fighters' brain health study, Br. J. Sports Med. 49 (2015), pp. 1007–1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Buehler R.J., Confidence intervals for the product of two binomial parameters, J. Am. Stat. Assoc. 52 (1957), pp. 482–493. [Google Scholar]
- 3.Chakraborty H. and Hossain A., R package to estimate intracluster correlation coefficient with confidence interval for binary data, Comput. Methods Programs Biomed. 155 (2018), pp. 85–92. [DOI] [PubMed] [Google Scholar]
- 4.Fleiss J.L. and Cuzick J., The reliability of dichotomous judgments: Unequal numbers of judges per subject, Appl. Psychol. Meas. 3 (1979), pp. 537–542. [Google Scholar]
- 5.Kabaila P. and Lloyd C.J., Profile upper confidence limits from discrete data, Aust. N. Z. J. Stat. 42 (2000), pp. 67–79. [Google Scholar]
- 6.Kabaila P. and Lloyd C.J., The efficiency of Buehler confidence limits, Stat. Probab. Lett. 65 (2003), pp. 21–28. [Google Scholar]
- 7.Kabaila P. and Lloyd C.J., Improved Buehler limits based on refined designated statistics, J. Stat. Plan. Inference 136 (2006), pp. 3145–3155. [Google Scholar]
- 8.Lipsitz S.R., Laird N.M., and Brennan T.A., Simple moment estimates of the κ-coefficient and its variance, Appl. Stat. 43 (1994), pp. 309–323. [Google Scholar]
- 9.Lloyd C.J. and Li D., Computing highly accurate confidence limits from discrete data using importance sampling, Stat. Comput. 24 (2014), pp. 663–673. [Google Scholar]
- 10.Lloyd C.J. and Moldovan M.V., Exact one-sided confidence limits for the difference between two correlated proportions, Stat. Med. 26 (2007), pp. 3369–3384. [DOI] [PubMed] [Google Scholar]
- 11.Lloyd C.J. and Moldovan M.V., Unconditional efficient one-sided confidence limits for the odds ratio based on conditional likelihood, Stat. Med. 26 (2007), pp. 5136–5146. [DOI] [PubMed] [Google Scholar]
- 12.Madsen R.W., Generalized binomial distributions, Commun. Statist. Theory Methods 22 (1993), pp. 3065–3086. [Google Scholar]
- 13.Ridout M.S., Demétrio C.G., and Firth D., Estimating intraclass correlation for binary data, Biometrics 55 (1999), pp. 137–148. [DOI] [PubMed] [Google Scholar]
- 14.Shan G., Exact Statistical Inference for Categorical Data, 1st ed., Academic Press, San Diego, CA, 2015. [Google Scholar]
- 15.Shan G., Exact confidence limits for the response rate in two-stage designs with over or under enrollment in the second stage, Stat. Methods Med. Res. 27 (2018), pp. 1045–1055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Shan G., Accurate confidence intervals for proportion in studies with clustered binary outcome, Stat. Methods Med. Res. 29 (2020), pp. 3006–3018. [DOI] [PubMed] [Google Scholar]
- 17.Shan G., Optimal two-stage designs based on restricted mean survival time for a single-arm study, Contemp. Clin. Trials Commun. 21 (2021), p. 100732. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Shan G., Bayram E., Caldwell J.Z.K., Miller J.B., Shen J.J., and Gerstenberger S., Partial correlation coefficient for a study with repeated measurements, Stat. Biopharm. Res. (2020), pp. 1–7. DOI: 10.1080/19466315.2020.1784780 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shan G., Bernick C., Caldwell J.Z.K., and Ritter A., Machine learning methods to predict amyloid positivity using domain scores from cognitive tests, Sci. Rep. 11 (2021), p. 4822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shan G. and Gerstenberger S., Fisher's exact approach for post hoc analysis of a chi-squared test, PLoS ONE 12 (2017), p. e0188709. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Shan G. and Ma C., Unconditional tests for comparing two ordered multinomials, Stat. Methods Med. Res. 25 (2016), pp. 241–254. [DOI] [PubMed] [Google Scholar]
- 22.Shan G. and Wang W., ExactCIdiff: An R package for computing exact confidence intervals for the difference of two proportions, R. J. 5 (2013), pp. 62–71. [Google Scholar]
- 23.Shan G. and Wang W., Advanced statistical methods and designs for clinical trials for COVID-19, Int. J. Antimicrob. Agents 57 (2021), p. 106167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Shan G., Wilding G.E., and Hutson A.D., Computationally intensive two-stage designs for clinical trials, Wiley StatsRef: Statistics Reference Online (2017), pp. 1–7. DOI: 10.1002/9781118445112.stat07986 [DOI] [Google Scholar]
- 25.Shan G., Wilding G.E., Hutson A.D., and Gerstenberger S., Optimal adaptive two-stage designs for early phase II clinical trials, Stat. Med. 35 (2016), pp. 1257–1266. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Shan G. and Zhang H., Two-stage optimal designs with survival endpoint when the follow-up time is restricted, BMC Med. Res. Methodol. 19 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Shan G., Zhang H., and Jiang T., Adaptive two-stage optimal designs for phase II clinical studies that allow early futility stopping, Seq. Anal. 38 (2019), pp. 199–213. [Google Scholar]
- 28.Shan G., Zhang H., Jiang T., Peterson H., Young D., and Ma C., Exact p-values for simon's two-stage designs in clinical trials, Stat. Biosci. 8 (2016), pp. 351–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Smith C.A.B., On the estimation of intraclass correlation, Ann. Hum. Genet. 21 (1957), pp. 363–373. [DOI] [PubMed] [Google Scholar]
- 30.Zou G. and Donner A., Confidence interval estimation of the intraclass correlation coefficient for binary outcome data, Biometrics 60 (2004), pp. 807–811. [DOI] [PubMed] [Google Scholar]





