Abstract
Reliability of measurement instruments providing quantitative outcomes is usually assessed by an intraclass correlation coefficient. When participants are repeatedly measured by a single rater or device, or, are each rated by a different group of raters, the intraclass correlation coefficient is based on a one-way analysis of variance model. When planning a reliability study, it is essential to determine the number of participants and measurements per participant (i.e. number of raters or number of repeated measurements). Three different sample size determination approaches under the one-way analysis of variance model were identified in the literature, all based on a confidence interval for the intraclass correlation coefficient. Although eight different confidence interval methods can be identified, Wald confidence interval with Fisher’s large sample variance approximation remains most commonly used despite its well-known poor statistical properties. Therefore, a first objective of this work is comparing the statistical properties of all identified confidence interval methods—including those overlooked in previous studies. A second objective is developing a general procedure to determine the sample size using all approaches since a closed-form formula is not always available. This procedure is implemented in an R Shiny app. Finally, we provide advice for choosing an appropriate sample size determination method when planning a reliability study.
Keywords: Intrarater reliability, interrater reliability, measurement errors, reproducibility (of results), observer variation
1. Introduction
Reliability is important in many scientific disciplines.1–3 All measurement and evaluation processes are subject to measurement error. These errors can have a serious impact on research undermining the conclusions of the study, as well as in daily practice when measurement and evaluation processes are used to make diagnoses or assess the progression of participants, for example. It is therefore essential for measurement instruments to be reliable (i.e. the device/rater is able to distinguish among participants in a population) and valid (i.e. measurements reflect the underlying true values). The reliability of a device/rater is usually evaluated during a reliability study. Generally, a reliability study consists of participants measured repeatedly under similar conditions by the same device/rater (intrarater reliability) or by different devices/raters (interrater reliability). In interrater reliability studies, the set of raters can be the same, or different for every participant. In this article, we focus (1) on intrarater studies where the same number of repeated measurements is made simultaneously on each participant, and the order of the measurements is interchangeable, and, (2) on specific interrater reliability studies where the set of raters is different for every participant, and the same number of raters rates each participant. In the second case the reliability coefficient additionally reflects the differences between raters, next to the measurement error.
When the outcome measurements are quantitative, reliability can be quantified using an intraclass correlation coefficient (ICC). ICC is defined as the correlation between repeated measurements at multiple occasions made by the same rater/device or by different raters/devices on the same participants. It compares the variability of measurements/ratings within participants to the variability of measurements/ratings between participants. Depending on the design of the study, different forms of ICC should be used.4,5 This article focuses on the ICC defined in the one-way analysis of variance (ANOVA) model, ICC(1). 4 When planning a reliability study, determining the minimum number of raters/repetitions and participants is of prime importance. In fact, too many participants may prove to be time-consuming and may also increase the research budget, while too few may adversely impact the precision of the ICC estimate, preventing the drawing of any conclusion on the study. Several approaches to determine sample sizes can be identified in the literature. The aim of this review is two-fold. First, it is to compare the statistical properties of the sample sizes obtained with the approaches in realistic settings. Second, it is to develop a general procedure for sample size determination, since a closed-form formula is not always available for all the approaches.
Existing literature on determining sample size indicates two main approaches, namely, the confidence interval approach6,8 and the hypothesis testing approach.6,8–10 The confidence interval approach requires defining, around a planned ICC, a target width of the confidence interval that the researcher aims to achieve. A generalization of the width of the confidence interval approach, the assurance probability approach, 6 is based on testing whether the width of the confidence interval is less than a pre-specified width with a given assurance probability. The testing approach is based on the power of testing the hypothesis that the ICC is lower or equal to (null hypothesis), or, above (alternative hypothesis) a pre-specified value of the ICC. A common feature of these approaches is that the variance of the ICC estimator needs to be defined. In the literature, two closed-form approximations of the large-sample variance of the ICC estimate are mainly used. These are namely, the Swiger variance, 11 which is based on the Taylor-series expansion of the ratio of the ANOVA mean squares, and the Fisher variance, 12 a large-sample approximation obtained by Fisher. We further consider another form of the variance, known as the Zerbe variance, 13 based on the formulation of the ratio of two independent F-statistics. This variance is far less popular and was not included in previous reviews.
Confidence intervals formed around the ICC are mainly based on the Wald method,6,14 or on the F-statistic, termed the Searle method. 15 The Wald and the Searle methods can be further applied using a normalization transformation.12,16 When comparing the coverage probability of the confidence intervals (confidence intervals based on the Wald method with the Fisher variance and the Searle method), Zou 6 concluded that the normalized Searle method performs better than the Wald method with the Fisher variance. However, when comparing the coverage probabilities and mean interval widths of confidence intervals obtained with the Wald method (with the Swiger variance), the Searle method, and the normalized Searle method, Donner and Wells 17 concluded that no method was superior in all situations.
In the context of sample size determination with the confidence interval approach, a closed-form formula was derived by Bonett 7 for the Wald confidence interval with the Fisher variance and is the most common choice. 18 While Shieh more recently defined a numerical procedure for the Searle method, 19 no procedure to determine sample sizes exists for the other methods. Note that, in common statistical software like R, 18 SAS, and PASS, 20 the Wald method with the Fisher variance is the only one that is available (see Appendix E). As for the comparison among the methods, Shieh 19 compared the statistical properties of the Wald method (with the Fisher variance) and the Searle method with respect to the width of the confidence interval approach and the assurance probability approach. To summarize the results, the Searle method and the assurance probability approach, with a assurance probability, showed better coverage than the width of the confidence interval approach. 6 Furthermore, this was achieved with a somewhat smaller width of the confidence interval. We aim to complete these comparisons by considering all the identified confidence interval methods.
For the testing approach, sample size determination was derived only for the normalized Searle method 18 and the Searle method (numerically). Only the latter is available in common statistical software. 20 As for the comparison, Shieh 21 showed that the approximate sample size formula obtained using the normalized Searle method 8 under-performs, with respect to the observed power of the hypothesis test, when compared to numerical sample sizes obtained via the Searle method. We extend the work of Shieh, 21 by comparing the results that can be obtained using all the methods for sample size determination identified in this article.
Several studies have investigated inference procedures for the ICC in this context but are incomplete as these studies do not consider all the confidence interval methods identified. In summary, our contribution is as follows. First, we compare the statistical properties of all identified confidence interval methods. Second, we analytically derive the sample size formulas using the Swiger and the Zerbe variances. Third, we develop a numerical procedure to obtain sample sizes with all identified confidence interval methods under the three sample size approaches. In this numerical procedure, we derive formulas to approximate the assurance probability function and the power function (except for the Searle method for which these formulas were already derived 21 ). Additionally, we provide guidelines for end users. We further provide an user-friendly and interactive R Shiny application to obtain sample sizes with all the methods discussed in this article on https://github.com/DiproMondal/sample-size-ICCGithub and the https://dipro.shinyapps.io/sample-size-icc/Shiny server.
The article is organized as follows. Section 2 introduces the methods to estimate ICC, its variance, and confidence interval. Section 3 introduces the simulation setup that is used to evaluate the statistical properties of the confidence interval methods. Section 4 describes different approaches for sample size calculation when the number of raters, , is fixed. We further propose a general procedure to obtain minimum sample sizes under any approach. Section 5 presents a case study. Finally, Section 6 concludes the article with a summary and a discussion of the results obtained in this article.
2. Definition
Consider the scenario in which each participant is measured on a quantitative scale by a different set of raters randomly drawn from a population of raters, 4 or is measured repeatedly by a measuring device several times under identical conditions. Further assume that the number of raters/repeated measurements per participant is the same, which is a common assumption when planning a reliability study. Let represent the measurement of participant by rater . This outcome can be described by a one-way ANOVA model, which can be written as
(1) |
where is the grand mean, is the effect of participant , and is the measurement error for participant measured by rater . The total number of observations is denoted by ( ). The assumptions of this ANOVA model are that the participant effects are identically and normally distributed with mean and variance , the measurement errors are identically and normally distributed with mean and variance , and the errors and participant effects are independent. Table 1 shows the variance components of this one-way ANOVA model. The mean squares in Table 1 are and , where and . Using this variance decomposition, the ICC is defined as
(2) |
Note that the value of becomes closer to 1 as the measurement error variance becomes smaller ( ) and becomes closer to 0 as it increases ( ).
Table 1.
Variance decomposition as for the one-way ANOVA model described by equation (1).
Source of | Degrees of | Mean | Expected |
---|---|---|---|
variation | freedom | squares | mean squares |
Between participants | |||
Within participants |
ANOVA: analysis of variance; BMS: between mean squares; WMS: within mean squares.
2.1. Estimation of ICC
ICC is usually estimated using the ANOVA 4 or the maximum-likelihood estimator. The ANOVA estimator is given by
(3) |
Since this estimator is negatively biased, 22 a maximum-likelihood estimator has been suggested 23 :
Comparing the bias of the two estimators, Wang et al. 23 showed that the bias of is still quite large and decreases only slightly for large samples. For instance, to achieve a bias of not > 10 , a total of 100 observations (e.g. 20 participants and five raters) are required when expecting (the value of at which the bias is maximum). When one expects higher values of , as in the context considered in this article, the bias of is small and the two estimators lead to almost identical estimates. For this reason, the maximum-likelihood estimator is generally not used in the literature. Accordingly, we will only consider the ANOVA estimator in this article. Note that this estimator relies on the assumptions of the ANOVA model (equation (1)). A brief discussion of what happens when these assumptions are violated is given in Section 6.
2.2. Large sample variance of the ICC
Here we focus on the three approximated closed-form expressions of the variance of available in the literature for large . Swiger et al. 11 provided the large sample variance of as,
(4) |
Given that , as , this leads to the variance obtained by Fisher 12 when , which is a reasonable assumption for small and , 24
(5) |
Note that in equation (5), is sometimes replaced by . 25 Lastly, following Zerbe and Goldgar 13 and Kaart, 26 the variance can also be estimated by the ratio of two independent F-statistics as,
(6) |
These three formulas are related by the following inequality, (see Appendix A for proof).
2.3. Confidence interval for the ICC
In the literature, there are four methods to compute the upper ( ) and lower ( ) bounds of the confidence interval for , namely the Wald method, 6 the Searle method,9,15 and their normalized versions. Demetrashvili et al. 27 further suggested two generic methods not considered here because they are not accurate in the balanced one-way random effects model.
2.3.1. Wald confidence interval ( , , and )
Based on the central limit theorem, the upper ( ) and lower ( ) bounds of the confidence interval for the ICC can be written as,6,14
(7) |
where is the percentile of the standard normal distribution. Plugging equations (4) to (6) into (7) as the variance leads to confidence intervals, which we denote as , , and , respectively.
The Wald method assumes that the sampling distribution of is normally distributed. However, is bounded between 0 and 1, implying a skewed sampling distribution of when is close to the boundaries. 28 Since typically ICC values close to one are of interest in a reliability study, Wald confidence intervals may thus have poor statistical properties in this context.
2.3.2. Searle method ( )
Under the assumption of normality of the ANOVA model, the ratio of the between-mean squares and within-mean squares (i.e. the F-statistic) is distributed as , where represents an F-distribution with and degrees of freedom. We represent this ratio as
(8) |
Then, the upper and lower bounds of the confidence interval for are given by Searle 14 as,
(9) |
where and are the and the percentile of an F-distribution with and degrees of freedom, respectively. We denote this method as .
Rather than making a normality assumption on , this method makes an assumption of normality on the outcome . Hence this method has been referred to as being an exact procedure by several authors.6,17
2.3.3. Normalized ICC method ( )
The Fisher transformation can be applied to the ICC so that the transformed ICC approximately follows a normal distribution. Applying this transformation to leads to
(10) |
where , and the variance, can be derived applying the Delta method 16 to one of the variances defined in equations (4) to (6) leading, respectively, to
(11) |
(12) |
(13) |
Since is approximately normally distributed and defined on the real line, we can then compute the Wald confidence interval for this transformation as
Finally, the confidence interval for is obtained by back-transformation leading to
(14) |
We refer to the confidence intervals obtained by these methods as , , and , respectively.
2.3.4. Normalized Searle method ( )
The F-statistic ( ) can also be normalized by a log-transformation to obtain confidence limits.6,9,14 Normalizing starting from equation (8), we obtain
(15) |
where and . The confidence interval on this log transformed scale is then
Note that the expression for provided in equation (3) of Zou 6 is not correct, so we use as specified above. The confidence limits for can be obtained directly by back-transforming as
(16) |
We denote this method as . Note that for , the confidence intervals based on the transformed F-statistic and the normalized ICC with the Swiger variance (following equations (.1) and (14)), are the same.
3. Simulation comparison of the confidence interval methods
We set up a Monte Carlo simulation to evaluate the statistical properties of the eight confidence interval methods described in Section 2.3. Based on the ANOVA model defined in equation (1), participant effects ( ) are drawn from a standard normal distribution. Then, (= ) errors ( ) are drawn from a normal distribution, with zero mean and variance determined by the relation in equation (2) for a given value of . This process is replicated 25,000 times. For each replication, the confidence interval using the eight methods described in Section 2.3 is obtained. We study the properties of the methods for values of varying from to (in steps of ), from to (in steps of ), and from to (in steps of ).
The methods are compared based on the coverage probability and average confidence interval width in each scenario. The coverage probability is defined as the proportion of times the true value of is covered by the confidence intervals across the 25,000 replications. We define coverage probability as acceptable if it falls within the range where is the nominal coverage and is the number of simulations. This is the range of proportions from the simulation, where one expects these proportions to lie in 95% of the cases, if the nominal coverage is the true coverage probability. Specifically, for a nominal coverage of 95%, the coverage probabilities from the simulation are expected to lie between 0.947 and 0.953. The average width of a confidence interval is defined as the average difference between the upper and lower limits of a confidence interval over the 25,000 replications. Since a shorter width of the confidence interval is desirable, methods with a smaller average width of the confidence interval are considered to be better.
Table 2 summarizes the results for , while complete results can be found in Supplemental Material 1. Table 2 shows that for and , , , (equivalent to ), and provide acceptable coverage for all values of , while provides acceptable coverage only for . , , and do not provide acceptable coverage (based on sample sizes explored in Table 2, i.e., ).
Table 2.
Summary of the methods which show acceptable coverage for the confidence interval, that is, between 0.947 and 0.953, for the ICC, and different number of raters, , and participants, . In each row, the method providing the average minimum width of the confidence interval is marked in bold. For , the differences in average width are , therefore, none of the methods have been marked bold in those cases.
2 | 0.7–0.9 | |||||||||
0.7–0.8 | ||||||||||
0.9 | ||||||||||
ICC: intraclass correlation coefficient.
For , still provides acceptable coverage under all scenarios while and only for . The coverage of and deteriorates first when increasing from to , and then improves on increasing further. These confidence interval methods provide acceptable coverage when for . on the other hand provides acceptable coverage when for . provides acceptable coverage when for , while provides acceptable coverage only when for . The effect of increasing is not monotonic for some of the confidence interval methods. However, increasing above 5 does not seem to improve notably the coverage of the methods (see blueSupplemental Material 1).
The confidence interval methods providing the smallest average width most frequently, under the different scenarios, are marked in bold. The difference in average width between the different confidence interval methods decreases from to as increases from to . It must be noted here that though provides better coverage compared to , it has the largest width among the confidence interval methods.
In summary for , , , and provide acceptable coverage in almost all scenarios.
4. Sample size determination
Sample size determination when the number of raters, , is fixed, is reviewed for three approaches, namely, the width of confidence interval approach, the assurance probability approach, and the testing approach. These sample size approaches require a planning value, , and yield valid results when the initial guess for is accurate. The eight confidence interval methods reviewed in Section 2.3 can be used with each of the three approaches. However, a closed-form formula for sample size determination is not always available, which necessitates numerical evaluation procedures to determine sample sizes.
4.1. Width of confidence interval approach
The approach consists in finding the minimum number of participants for a given value of the expected width, , of the confidence interval around a planned value of and for a given number of raters . Bonett 7 derived an analytical formula based on the Wald confidence interval and the Fisher variance ( ). We generalize this approach by considering all large sample variance formulas reviewed in Section 2.2.
The expected width of the Wald confidence interval is given by , where is the confidence level and the variance can be estimated using equations (4) to (6). Using the Swiger variance (equation (4)), under the approximation and taking the positive root, the minimum number of participants is given by (see Appendix B.1 for the derivation)
(17) |
where . Using the Fisher variance (equation (5)), the expression for the required minimum number of participants is the same as equation (17), but subtracting one participant. Bonett 7 used the Fisher variance with in the denominator of equation (5) instead of . As a result, the sample size derived by Bonnet is the same as equation (17).
Using the Zerbe variance (equation (6)), the minimum number of participants obtained under the assumption that is:
(18) |
where and (see Appendix B.2 for the derivation).
Giraudeau and Mary 29 provided an approximate formula for the width of the confidence interval obtained with the Searle method which coincides with the width obtained using the Wald confidence interval with the Fisher variance. Analytical formulas can hardly be obtained for the Searle and the normalization methods. Hence, we propose a general numerical procedure to determine the minimum sample size, , which can be used with all confidence interval methods. Specifically, this numerical evaluation method consists of finding the expected width of the confidence interval for the specified values of and . This is done for every , starting from and increasing by one unit at a time. The minimum sample size is the smallest value of for which the expected width of confidence interval is smaller or equal to . Bonett 7 and Shieh 19 used a similar numerical approach to obtain sample sizes for .
Table 3 shows the minimal sample sizes obtained by using the numerical evaluation for {0.1,0.2}, {0.7,0.8,0.9}, and {2,3,6}. The values within parentheses indicate sample sizes obtained using equation (17), equation (17) with a subtraction of one participant and equation (18) for , , and , respectively. It can be observed that the sample sizes obtained with the different confidence interval methods are rather close. Sample sizes providing acceptable coverage (the calculation of the acceptable range is given in Section 3) for different combinations of , , and , are marked in bold. Table 3 indicates that the confidence interval methods , , and provide sample sizes with acceptable coverage in most cases. Note that the numerical approach of Bonnett 7 and Shieh 19 leads to sample sizes very close to the values we obtain (data not shown).
Table 3.
The minimum number of participants, , required to achieve an expected width, , of the confidence interval, given and the number of raters, , according to the numerical evaluation method. Sample sizes that provide coverage within an acceptable range (based on 25,000 simulations, i.e. between 0.947 and 0.953) are marked in bold. The values in parentheses indicate sample sizes obtained with analytical formulas given in equations (17) for , (17) with a subtraction of one participant for , and (18) for , respectively.
0.1 | 0.7 | 2 | 401 (401) | 400 (400) | 408 (408) | 403 | 402 | 401 | 409 | 402 |
3 | 267 (267) | 266 (266) | 270 (270) | 267 | 267 | 267 | 271 | 267 | ||
6 | 188 (188) | 187 (187) | 189 (188) | 187 | 189 | 188 | 190 | 188 | ||
0.8 | 2 | 200 (200) | 200 (199) | 207 (207) | 204 | 202 | 202 | 209 | 202 | |
3 | 140 (139) | 139 (138) | 143 (142) | 140 | 141 | 141 | 145 | 141 | ||
6 | 104 (103) | 103 (102) | 105 (104) | 103 | 105 | 104 | 106 | 104 | ||
0.9 | 2 | 56 (56) | 56 (55) | 63 (63) | 61 | 60 | 59 | 67 | 60 | |
3 | 41 (41) | 41 (40) | 45 (44) | 43 | 44 | 43 | 47 | 44 | ||
6 | 32 (32) | 31 (31) | 34 (33) | 32 | 34 | 33 | 35 | 34 | ||
0.2 | 0.7 | 2 | 101 (101) | 100 (100) | 108 (108) | 103 | 102 | 102 | 109 | 102 |
3 | 68 (67) | 67 (66) | 71 (70) | 67 | 68 | 68 | 72 | 68 | ||
6 | 48 (48) | 47 (47) | 49 (48) | 47 | 49 | 48 | 50 | 48 | ||
0.8 | 2 | 51 (51) | 50 (50) | 57 (57) | 54 | 53 | 53 | 60 | 53 | |
3 | 36 (36) | 35 (35) | 39 (38) | 36 | 37 | 37 | 41 | 37 | ||
6 | 27 (27) | 26 (26) | 28 (27) | 26 | 28 | 27 | 29 | 27 | ||
0.9 | 2 | 15 (15) | 14 (14) | 21 (21) | 19 | 18 | 17 | 24 | 18 | |
3 | 11 (11) | 11 (10) | 14 (14) | 13 | 13 | 13 | 16 | 13 | ||
6 | 9 (9) | 8 (8) | 10 (9) | 9 | 11 | 10 | 12 | 10 |
4.2. Assurance probability approach
The assurance probability approach based on the width of the confidence interval for , 19 consists of finding the minimum number of participants such that
(19) |
where is the probability that the width , is less than or equal to a constant, , and is the assurance probability. The assurance probability approach based on the width of the confidence interval was introduced by Zou, 6 who pointed out that the width of confidence interval approach seen in the previous subsection is a special case, which corresponds to setting the assurance probability to . Zou 6 also introduced an assurance probability approach based on the lower limit of a confidence interval, see Section 4.3.
Zou 6 derived an analytical formula based on the Wald confidence interval and the Fisher variance ( ). Shieh 19 later extended the approach numerically to the Searle method ( ). In this article, we numerically generalize the assurance probability approach by considering all the confidence interval methods mentioned in Section 2.3.
Using the Wald confidence interval and the Fisher variance (equation (5)), Zou 6 obtained the minimum number of participants as
(20) |
where and . Zou 6 used in equation (5) and derived the formula considering the half-width of the confidence interval. As a result, the formula in Zou has different coefficients than equation (20). Using the Swiger variance (equation (4)), we derived the sample size under the approximation that (taking the positive root). This leads to equation (20) with the addition of one participant.
The analytical forms of the other confidence interval methods (including ) are too complex. Therefore, we propose a generalization of the numerical approach explained in Section 4.1, which uses assurance probability functions to find the minimum satisfying a pre-defined value of the assurance probability ( ). This numerical procedure works with all the confidence interval methods mentioned in Section 2.3. The derivation of the assurance probability functions are given in Appendix C.
Table 4 shows the sample sizes obtained by the numerical procedure using assurance probability functions. The analytical counterparts are shown between parentheses (when available). It can be observed that the minimum sample sizes obtained analytically are close to the values obtained via the numerical procedure. Further, our method gives sample sizes close to the ones obtained by Shieh, who also used a numerical method for (Tables 8 and 9 of Shieh 19 ). It can be further observed that the sample sizes obtained by the different confidence interval methods are rather close. Sample sizes providing acceptable assurance probability under different combinations of , , and , for are marked in bold. The lower limit of the acceptable range of assurance probabilities is calculated in the same way as in Section 3 where is replaced by . Confidence interval methods , , , and provide sample sizes with acceptable assurance probability in most cases while for only.
Table 4.
Minimum number of participants, , required to achieve an expected width, , of the confidence interval, given , the number of raters, , and the assurance probability , according to the numerical procedure using assurance probability functions. Sample sizes that provide acceptable empirical assurance probability (i.e. above 0.896 for 25,000 simulations) are marked in bold. The values in parentheses indicate sample sizes obtained from the analytical formulas given in equation (20) for and equation (20) with an addition of 1 participant for , respectively.
0.1 | 0.7 | 2 | 470 (470) | 469 (469) | 477 | 473 | 472 | 471 | 479 | 473 |
3 | 309 (308) | 308 (307) | 312 | 309 | 314 | 313 | 317 | 308 | ||
6 | 214 (214) | 213 (213) | 216 | 214 | 221 | 220 | 222 | 213 | ||
0.8 | 2 | 255 (255) | 254 (254) | 262 | 260 | 259 | 258 | 266 | 260 | |
3 | 176 (176) | 175 (175) | 179 | 178 | 180 | 180 | 184 | 177 | ||
6 | 129 (129) | 128 (128) | 130 | 130 | 134 | 133 | 135 | 129 | ||
0.9 | 2 | 87 (87) | 87 (86) | 94 | 94 | 93 | 93 | 100 | 94 | |
3 | 63 (63) | 63 (62) | 67 | 67 | 68 | 67 | 71 | 66 | ||
6 | 49 (49) | 48 (48) | 50 | 52 | 53 | 52 | 54 | 50 | ||
0.2 | 0.7 | 2 | 134 (134) | 134 (133) | 141 | 137 | 136 | 135 | 143 | 137 |
3 | 88 (88) | 87 (87) | 91 | 88 | 91 | 90 | 94 | 87 | ||
6 | 61 (61) | 60 (60) | 62 | 60 | 64 | 64 | 66 | 59 | ||
0.8 | 2 | 77 (77) | 76 (76) | 84 | 81 | 80 | 80 | 87 | 81 | |
3 | 53 (53) | 53 (52) | 56 | 55 | 56 | 56 | 60 | 54 | ||
6 | 39 (39) | 38 (38) | 40 | 40 | 42 | 41 | 43 | 39 | ||
0.9 | 2 | 29 (29) | 29 (28) | 36 | 35 | 34 | 34 | 41 | 35 | |
3 | 22 (22) | 21 (21) | 25 | 25 | 25 | 24 | 28 | 24 | ||
6 | 17 (17) | 16 (16) | 18 | 19 | 20 | 19 | 21 | 18 |
4.3. Testing approach
The testing approach consists of finding the minimum number of participants when one is interested in achieving a pre-specified power when testing the null hypothesis that is less than or equal to a constant, , that is, , against the alternative that is greater than , that is, . Denoting under the alternative hypothesis, the power of this test can be defined as the probability that the null hypothesis is rejected when the alternative hypothesis is true ( ). In our case, this is the probability that the lower limit of the confidence interval for is greater than when the alternative hypothesis is true. The mathematical form of the criterion under this approach can be written as, 6
(21) |
where is the probability that the lower limit of the confidence interval for , , is greater than the pre-specified value, , under the alternative hypothesis that . Donner and Eliasziw, 10 Walter et al., 9 and Zou 6 derived an analytical formula for the minimum number of participants, , based on the transformation of the F-statistic ( ) when minimizing the criterion specified in equation (.1). Specifically,
(22) |
Shieh 21 used a numerical evaluation procedure to obtain sample sizes for the Searle method. Zou 6 obtained equation (22) by introducing an assurance probability based on a pre-specified lower limit of an asymmetrical interval procedure, which is equivalent to the testing approach.
We derived power functions following equation (.1) for all the confidence interval methods (see Appendix D). These power functions were then used to obtain sample sizes for the testing approach using the numerical procedure mentioned in Section 4.2. The numerical procedure uses the power functions to find the minimum satisfying a pre-defined power ( ).
Table 5 shows the sample sizes obtained by the numerical procedure using the power functions and numerical evaluation. The values within parentheses indicate sample sizes obtained by the analytical formulas for the method which correspond exactly to the ones obtained by our numerical procedure. The values obtained for the method using our numerical procedure are exactly one unit greater than the values obtained by the numerical method of Shieh. 21 Furthermore, unlike the previous approaches, the sample sizes obtained via the Wald confidence interval methods tend to require smaller sample sizes than other confidence interval methods. The actual power of the hypothesis test was also calculated at the obtained sample sizes. Sample sizes providing acceptable power for different combinations of , , , and are marked in bold. The lower limit of the acceptable range of power is calculated in the same way as in Section 3, where is replaced by . The confidence interval methods and provide the sample sizes with acceptable power in most cases while , , and provide sample sizes with acceptable power for only. The Wald methods always have power below acceptable value (i.e. < 0.795 when , and 0.896 when ). For example, the actual power for the Wald methods can go as low as which is the case for , , , and .
Table 5.
Minimum number of participants, , for a given value of considering the null ( ) and alternative hypothesis ( ) for a specified number of raters, , and power of the test according to the numerical procedure using power functions. Sample sizes that provide acceptable empirical power (i.e. above 0.896 when and above 0.795 when for 25,000 simulations) are marked in bold. The values in parentheses indicate the sample sizes obtained from the analytical formula given in equation (22) for .
0.9 | 0.7 | 0.8 | 2 | 112 | 111 | 119 | 162 | 161 | 161 | 168 | 162 (162) |
3 | 78 | 78 | 82 | 110 | 112 | 112 | 116 | 110 (110) | |||
6 | 58 | 58 | 60 | 80 | 84 | 83 | 85 | 80 (80) | |||
0.8 | 0.9 | 2 | 32 | 31 | 38 | 63 | 62 | 62 | 69 | 63 (63) | |
3 | 24 | 23 | 27 | 45 | 46 | 45 | 49 | 45 (45) | |||
6 | 19 | 18 | 20 | 34 | 36 | 35 | 37 | 35 (35) | |||
0.8 | 0.7 | 0.8 | 2 | 81 | 81 | 88 | 117 | 117 | 116 | 123 | 117 (117) |
3 | 57 | 56 | 60 | 79 | 82 | 81 | 85 | 80 (80) | |||
6 | 43 | 42 | 44 | 57 | 61 | 60 | 62 | 58 (58) | |||
0.8 | 0.9 | 2 | 23 | 23 | 30 | 46 | 45 | 45 | 52 | 46 (46) | |
3 | 17 | 17 | 20 | 32 | 33 | 33 | 36 | 33 (33) | |||
6 | 14 | 13 | 15 | 25 | 26 | 25 | 27 | 25 (25) |
4.4. Software for sample size calculation
Currently, only the method is available in common software (see Appendix E) for the width of the confidence interval and assurance probability approaches, while only is available for the testing approach. Therefore, a Shiny app containing all the approaches to determine minimum required sample sizes has been developed 30 and made available on https://github.com/DiproMondal/sample-size-ICCGithub and the https://dipro.shinyapps.io/sample-size-icc/Shiny server.
5. Empirical illustration
5.1. Reliability of systolic blood pressure measurements
In this section, we illustrate how the confidence interval methods described in Section 2.3 and the approaches for sample size determination described in Section 4 are used in the context of a reliability study. In the study of Bland and Altman, 31 three repeated systolic blood pressure measurements ( ) were made on 85 participants ( ) by two experienced observers raters J and R and a semi-automatic blood pressure monitor. For the purpose of our illustration, we use the measurements made by rater J only, which can be modeled by a one-way ANOVA.
The ANOVA model assumes that the outcome measurements are normally distributed and the variance across repetitions is homogeneous across participants. Exploratory data analysis revealed that the excess kurtosis for the repetitions was mild while the degree of asymmetry of the repetitions indicated moderate skewness. Furthermore, the data also present mild heteroscedasticity on the repeated measurements. Following equation (3), we obtain . The confidence intervals obtained using the eight confidence interval methods are shown in Table 6, which have rather similar bounds.
Table 6.
Lower and upper limits of the 95% confidence intervals for for the systolic blood pressure measurements.
Lower limit | 0.948 | 0.948 | 0.947 | 0.945 | 0.945 | 0.945 | 0.945 | 0.945 |
Upper limit | 0.975 | 0.975 | 0.976 | 0.974 | 0.973 | 0.973 | 0.973 | 0.973 |
5.2. Planning a reliability study
A researcher may be interested in planning a study to measure blood pressure aiming at a reliability of . The sample size approaches described in previous sections can be used to find the number of participants required for such a study.
Figure 1 shows, for each in the interval (x-axis), the minimum required (y-axis) using (top-down) the width of confidence interval approach described in Section 4.1, the assurance probability approach described in Section 4.2, and the testing approach described in Section 4.3 for the confidence interval methods and . For example, suppose the study only allows for three repeated measurements per participant. Then using the width of confidence interval approach, the assurance probability approach, and the testing approach the researcher would require, respectively, 43, 67, and 133 (considering the Searle confidence interval method) participants for the criteria given in Figure 1. The effect of increasing the number of measurements per participant to four is a decrease in the number of participants to 37, 59, and 115 participants (considering the Searle confidence interval method), respectively, for the width of confidence interval approach, the assurance probability approach, and the testing approach. The gain in having a smaller number of participants for the study decreases as the number of measurements per participant increases.
Figure 1.
Combinations for and for the confidence interval methods and satisfying the criteria specified for the three different sample size approaches. The criteria for the sample size approaches are mentioned in the sub-figures where, for the width of confidence interval approach, is the expected width of the confidence interval around a given value ; for the assurance probability approach, the notations are the same as the width of confidence interval approach with the addition of denoting the assurance probability; for the testing approach is the power of the hypothesis test, and are the values under the null and alternative hypotheses. (a) Width of confidence interval approach; (b) assurance probability approach; and (c) testing approach.
If, instead, there is flexibility in choosing the number of repetitions per participant, the researcher can consider a cost-constraint approach to find the optimal combination of the number of participants ( ) and number of repeated measures per participant ( ). Then, the optimal combination of is obtained by finding the value of and for which the total cost, T, is minimum. A plausible cost function is
(23) |
where is the total cost, is the cost of recruiting a participant, and is the cost of making one observation.
Table 7 shows the optimal combinations of obtained by minimizing the total cost, (equation (23)), for different combinations of and . It can be observed from the table that as increases relative to , more repetitions per participant are required with a smaller number of participants to achieve the same criterion value.
Table 7.
Optimal combination of the number of repetitions and participants, ( ) for the sample size approaches with the confidence interval methods, and , for different costs of recruiting a participant, and making an observation, .
Sample size approach | ||||
---|---|---|---|---|
Width of confidence interval approach for and | 1 | 5 | (261) | (260) |
1 | 1 | (343) | (344) | |
5 | 1 | (437) | (438) | |
Assurance probability approach for , , and | 1 | 5 | (294) | (295) |
1 | 1 | (367) | (367) | |
5 | 1 | (459) | (458) | |
Testing approach for , , and | 1 | 5 | (2185) | (2185) |
1 | 1 | (3133) | (3133) | |
5 | 1 | (4115) | (4116) |
6. Discussion
Sample size determination is a crucial aspect of the planning stage of a reliability study. Usually, the number of raters, , is fixed due to budget or time constraints in the study, and the sample size of participants, , needs to be determined. This article gives a complete overview of the different approaches available in that case. Analytical closed-form solutions for sample size determination only exist in a few cases. Therefore, we proposed a general procedure that entails deriving an assurance probability or power function (depending on the approach) and finding optimal via a simple search procedure.
Before inspecting the different approaches for sample size determination, we looked at the statistical properties of the different confidence interval methods. We have shown that the confidence interval based on the Searle method ( ) provides acceptable coverage in almost all scenarios for , and, and for . This can be explained by the fact that is an exact method and is based on a normalizing transformation of . is the Wald method based on the Zerbe variance which was also derived as a ratio of F-statistics. It must be noted however, that does not provide acceptable coverage for small (when , see blueSupplemental Material 1). The other methods, based on some approximations, only provide acceptable coverage in few scenarios. It is worthwhile to note that the Wald confidence interval using the Fisher variance, , widely used in the literature shows acceptable coverage only for large sample sizes, , when and . Note that the Zerbe variance provides better statistical properties than the Fisher variance when , but the width of the confidence interval is larger.
Sample sizes were determined using three different approaches which rely on the limits of a confidence interval for . Sample sizes in the case of the width of confidence interval were obtained via a numerical evaluation. We derived the assurance probability and power functions for assurance probability and testing approaches, respectively, to determine sample sizes. These functions, when combined with the numerical evaluation, enabled us to determine sample sizes for all the methods discussed. Sample sizes obtained through this procedure and the corresponding available analytical formulas led to similar sample sizes. Furthermore, sample sizes obtained with different confidence interval methods in the width of confidence interval approach and the assurance probability approach were similar. However, this was not the case in the testing approach where smaller sample sizes were obtained using the Wald confidence interval to achieve a required power level compared to the other confidence interval methods. This is probably because the Wald confidence interval method assumes a symmetric distribution for the estimate of , which is not a realistic assumption when is large (e.g. 0.8, 0.9). 28 In all the approaches, the Searle method ( ) provided sample sizes with good statistical properties as well as when . We, therefore, advise the use of these methods to make statistical inference on the ICC in the one-way ANOVA setting.
We have shown that the choice of the approach to determine sample size or even the choice of the confidence interval method, has an impact on the resulting sample size. We, therefore advise researchers to carefully consider requirements for their studies as a guide to choose the appropriate sample size approach. For the three different approaches discussed in this article, the Searle confidence interval method demonstrated good statistical properties, making it our recommended choice. Furthermore, in order to determine sample sizes, we have developed an R Shiny app which we believe will prove valuable to researchers in need of a simple and efficient interface for obtaining sample sizes.
Our study is not without limitations. First, the confidence interval methods investigated in this paper, except the Searle confidence interval method (which is exact), rely on large sample approximations. Therefore, practitioners should exercise caution when calculations lead to a small minimal sample size because a good statistical behavior is not guaranteed. Note that the minimal sample sizes obtained with the different approaches rarely go below in realistic scenarios (see Tables 3 to 5). Second, the estimator of and its confidence interval rely on the assumptions of normality and homoscedasticity in line with the one-way ANOVA model (equation (1)). Violations of these conditions impact the statistical properties of the confidence intervals. The effect of non-normality on the Type-I error rate of the F-statistic was studied by various authors.32–37 However, simulation studies 38 showed that the effect of heteroscedasticity outweighs the effect of non-normality on the Type-I error rate of the F-statistic, even for a balanced design 39 as considered here. We, therefore, advise researchers to check for violations of the assumptions of the ANOVA model (equation (1)) before using the methods described in this article. Readers interested in non-parametric estimators of ICC, not requiring the normality assumption, are directed to the works of Rothery, 40 Shirahata, 41 Commenges and Jacqmin, 42 and Ukoumunne et al. 43 Note that, however, these papers do not develop a sample size procedure. Third, as previously mentioned, we consider an equal number of ratings per participant constituting a balanced design. Considering unbalanced designs will require specifying the degree of imbalance in advance, which is not an easy task. Furthermore, Donner 14 showed that with an unbalanced design, the F-statistic is not exact and this, in turn, affects the statistical properties of the ICC and its confidence interval. Fourth, we focused on reliability in the context of a one-way ANOVA model. Whether the numerical procedure we developed can be extended to multi-way ANOVA models, will require further investigation, as methods to construct confidence intervals are different in that case.44,45
Supplemental Material
Supplemental material, sj-pdf-1-smm-10.1177_09622802231224657 for Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model by Dipro Mondal, Sophie Vanbelle, Alberto Cassese and Math JJM Candel in Statistical Methods in Medical Research
Appendix A. Ratio of variance estimators
We will show that,
and thus
where is the average width of the confidence interval based on variance approximation .
Note that the variance estimators (equations (4) to (6)) can be written in the form,
where and depends on the form of the variance approximation . For the Swiger variance, , for the Fisher variance, , and for the Zerbe variance, .
= .
.
The proof of 2 is as follows:
We can re-write the ratio of the Zerbe and the Swiger variances into two parts as,
We have . For the second part, we have,
implying that
Appendix B. Sample size determination in the width of confidence interval approach
The sample size formulas for the width of confidence interval approach are derived here. Using the Wald confidence interval from Section 2.3.1, the minimum value of is obtained when:
where is the expected width of the confidence interval, is the percentile of the standard normal distribution and can be determined by equations (4) to (6).
B.1 Width of confidence interval approach with the Wald confidence interval and the Swiger variance
Using the Swiger variance formula (equation (4)), we have:
Solving the last equation yields the following solution:
which leads to the approximation,
The other solution (with a negative sign in front of the square root) leads to n approximately equal to 0. The excess part creates a difference between n and . (By numerical evaluation for , , and we observe a maximum difference of 0.5 between n and .)
B.2 Width of confidence interval approach with the Wald confidence interval and the Zerbe variance
In a similar way, we obtain with the Zerbe variance (equation (6))
Assuming , leads to the following result
Solving the last equation gives only one root in the real domain:
where .
Appendix C. Assurance probability functions
The assurance probability functions for the assurance probability approach 6 are derived here. The criterion for this approach is given as,
C.1 Wald confidence interval method
The expected width of the Wald confidence interval is given as . Then the criterion can be further specified as,
Rewriting , with and defined in Appendix A, we have,
Using the Delta method, we obtain, . Then, the assurance probability function can be written as,
where is the cumulative standard normal distribution.
C.2 Searle method
The width of the Searle confidence interval is given as,
Rewriting , we notice that is a decreasing function of when , otherwise concave. Assuming , the point of extrema (i.e. ) is . Therefore, we can define the inverse function, when as,
and the inverse function when as,
where and . Following the criterion for the assurance probability approach,
Then, the assurance probability function can be written as,
where is the cumulative F distribution with and degrees of freedom, , and and are defined above.
C.3 Normalized ICC method
The width of the normalized ICC confidence interval method is given as,
Denoting and , assuming the variance of to be known, we can rewrite the width of the normalized ICC confidence interval method as,
which after some algebraic manipulation and using Euler’s transformation of the hyperbolic function is equal to . Then, following the criterion for the assurance probability approach,
Denoting , then,
which gives . Note that is a convex function ( , since, and , ). Therefore,
Therefore, the assurance probability can be approximated by,
where , and is substituted from equations (11) to (13) for the three different variance approximations, respectively.
C.4 Normalized Searle method
In the normalized Searle method, the normalization is . Following the steps of the derivation of the assurance probability function for the Searle method, we have,
We can take the logarithm to obtain,
Now, since
and
the assurance probability function can be approximated by,
where .
Appendix D. Power functions
The power functions for the testing approach are derived here. The criterion for this approach is given as,
where is the lower limit of the confidence interval for .
D.1 Wald confidence interval method
The lower limit of the confidence interval is given as . Then, assuming to be known, the criterion can be elaborated as,
so that,
Then, the power function can be written as,
where is the cumulative standard normal distribution.
D.2 Searle method
Following Shieh, 21 the power function can be written as,
where is the cumulative F-distribution with and degrees of freedom, and with representing values under the null ( ) and alternative hypotheses ( ).
D.3 Normalized ICC method
The lower limit of the confidence interval is given as . Then the criterion can be approximated by,
Following the same steps as for the Wald confidence interval method, we get,
where , and with representing values under the null and alternative hypotheses.
D.4 Normalized Searle method
The lower limit of the confidence interval is given as . Then the criterion can be approximated by,
Then, following the same steps as we did for the Wald confidence interval method, we get,
(.1) |
where , with representing values under the null and alternative hypotheses.
Appendix E. Sample size formulas used in common statistical software
An overview of the methods implemented in common statistical software is provided in Table A.1.
Table 8.
A list of other relevant software commonly used to obtain sample size for ICC(1).
Source | Sample size approaches | Additional comments |
---|---|---|
SAS19,21 | Testing ( ) | Shieh provided sample size calculation for assurance probability ( , ) and testing ( , ) also including cost-constraints |
PASS 20 | Testing ( ) | |
R package MBESS 46 | Assurance probability ( ) Testing ( ) | Allows sample size calculation under cost-constraints |
R package presize 47 | Assurance probability ( ) Testing ( ) | Allows sample size calculation with dropout rates. (webcalculator) |
R package ICC.Sample.Size 48 | Testing ( ) | Allows sample size calculation for the testing approach with two tails |
ICC: intraclass correlation coefficient; SAS: Statistical Analysis System.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.
ORCID iDs: Dipro Mondal https://orcid.org/0000-0002-4356-0011
Sophie Vanbelle https://orcid.org/0000-0001-6584-2522
Alberto Cassese https://orcid.org/0000-0001-5830-4136
Math JJM Candel https://orcid.org/0000-0002-2229-1131
Supplemental material: Supplemental material for this article is available online.
References
- 1.Lucas NP, Macaskill P, Irwig L. et al. The development of a quality appraisal tool for studies of diagnostic reliability (QAREL). J Clin Epidemiol 2010; 63: 854–861. [DOI] [PubMed] [Google Scholar]
- 2.Mokkink LB, Terwee CB, Gibbons E. et al. Inter-rater reliability of the cosmin (consensus-based standards for the selection of health status measurement instruments) checklist. Qual Life Res 2010; 19: 25–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Kottner J, Audige L, Brorson S. et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int J Nurs Stud 2011; 48: 661–671. [DOI] [PubMed] [Google Scholar]
- 4.McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996; 1: 30–46. [Google Scholar]
- 5.Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 420–428. [DOI] [PubMed] [Google Scholar]
- 6.Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med 2012; 31: 3972–3981. [DOI] [PubMed] [Google Scholar]
- 7.Bonett DG. Sample size requirements for estimating intraclass correlations with desired precision. Stat Med 2002; 21: 1331–1335. [DOI] [PubMed] [Google Scholar]
- 8.Shoukri M, Asyali M, Donner A. Sample size requirements for the design of reliability study: review and new results. Stat Methods Med Res 2004; 13: 251–271. [Google Scholar]
- 9.Walter SD, Eliasziw M, Donner A. Sample size and optimal designs for reliability studies. Stat Med 1998; 171: 101–110. [DOI] [PubMed] [Google Scholar]
- 10.Donner A, Eliasziw M. Sample size requirements for reliability studies. Stat Med 1987; 6: 441–448. [DOI] [PubMed] [Google Scholar]
- 11.Swiger LA, Harvey WR, Everson DE. et al. The variance of intraclass correlation involving groups with one observation. Biometrics 1964; 20: 818. [Google Scholar]
- 12.Fisher R. Statistical methods for research workers. 13. ed., rev. ed. New York: Hafner, 1958. [Google Scholar]
- 13.Zerbe CO, Goldgar DE. Comparison of intracalss correlation coefficients with the ratio of two independent F-statistics. Commun Stat-Theory Methods 1980; 9: 1641–1655. [Google Scholar]
- 14.Donner A. A review of inference procedures for the intraclass correlation coefficient in the one-way random effects model. Int Stat Rev 1986; 54: 67–82. [Google Scholar]
- 15.Searle SR. Linear models. New York: John Wiley & Sons, 1971.
- 16.Ramasundarahettige CF, Donner A, Zou GY. Confidence interval construction for a difference between two dependent intraclass correlation coefficients. Stat Med 2009; 28: 1041–1053. [DOI] [PubMed] [Google Scholar]
- 17.Donner A, Wells GA. A comparison of confidence interval methods for the intraclass correlation coefficient. Biometrics 1986; 42: 401–412. [PubMed] [Google Scholar]
- 18.Borg D, Bach A, O’Brien J, et al. Calculating sample size for reliability studies. PM&R 2022; 14: 1018–1025. [DOI] [PubMed] [Google Scholar]
- 19.Shieh G. Sample size requirements for the design of reliability studies: precision consideration. Behav Res Methods 2014; 46: 808–822. [DOI] [PubMed] [Google Scholar]
- 20.Bujang MA, Baharum N. A simplified guide to determination of sample size requirements for estimating the value of intraclass correlation coefficient: a review. Arch Orofac Sci 2017; 12: 1–11. [Google Scholar]
- 21.Shieh G. Optimal sample sizes for the design of reliability studies: power consideration. Behav Res Methods 2014; 46: 772–785. [DOI] [PubMed] [Google Scholar]
- 22.Shoukri MM, Al-Hassan T, Deniro M. et al. Bias and mean square error of reliability estimators under the one and two random effects models: the effect of non-normality. Open J Stat 2016; 06: 254–273. [Google Scholar]
- 23.Wang CS, Yandell BS, Rutledge JJ. Bias of maximum likelihood estimator of intraclass correlation. Theor Appl Genet 2004; 82: 421–424. [DOI] [PubMed] [Google Scholar]
- 24.Donner A, Koval JJ. A note on the accuracy of Fisher’s approximation to the large sample variance of an intraclass correlation. Commun Stat - Simul Comput 1983; 12: 443–449. [Google Scholar]
- 25.Visscher PM. On the sampling variance of intraclass correlations and genetic correlations. Genetics 1998; 149: 1605–1614. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Kaart T. A new approximation to the variance of the anova estimate of the intraclass correlation coefficient. Proce Est Acad Sci Phys, Math 2005; 54. DOI: 10.3176/phys.math.2005.4.04. [DOI] [Google Scholar]
- 27.Demetrashvili N, Wit EC, van den Heuvel ER. Confidence intervals for intraclass correlation coefficients in variance components models. Stat Methods Med Res 2016; 25: 2359–2376. [DOI] [PubMed] [Google Scholar]
- 28.Liljequist D, Elfving B, Skavberg Roaldsen K. Intraclass correlation – a discussion and demonstration of basic features. PLoS ONE 2019; 14: e0219854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Giraudeau B, Mary JY. Planning a reproducibility study: how many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Stat Med 2001; 20: 3205–3214. [DOI] [PubMed] [Google Scholar]
- 30.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. https://www.R-project.org/ SEP.
- 31.Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999; 8: 135–160. [DOI] [PubMed] [Google Scholar]
- 32.Tiku ML. Approximating the general non-normal variance-ratio sampling distributions. Biometrika 1964; 51: 83–95. [Google Scholar]
- 33.Gayen AK. The distribution of the variance ratio in random samples of any size drawn from non-normal universes. Biometrika 1950; 37: 236–255. [PubMed] [Google Scholar]
- 34.Scheffé H. The analysis of variance. Oxford, England: Wiley, 1959. [Google Scholar]
- 35.Khan A, Rayner GD. Robustness to non-normality of common tests for the many-sample location problem. J Appl Math Decis Sci 2003; 7: 657201. [Google Scholar]
- 36.Blanca MJ, Alarcón R, Arnau J. et al. Non-normal data: Is ANOVA still a valid option? Psicothema 2017; 29: 552–557. [DOI] [PubMed] [Google Scholar]
- 37.Donaldson TS. Robustness of the F-test to errors of both kinds and the correlation between the numerator and denominator of the f-ratio. J Am Stat Assoc 1968; 63: 660–676. http://www.jstor.org/stable/2284037 . [Google Scholar]
- 38.Marcinko T. Consequences of assumption violations regarding one-way ANOVA. The 8th International Days of Statistics and Economics, Prague, September 11–13, 2014.
- 39.Wilcox R. Chapter 7—one-way and higher designs for independent groups. In Wilcox R (ed.) Introduction to Robust Estimation and Hypothesis Testing (Third Edition), third edition ed. Statistical Modeling and Decision Science, Boston: Academic Press. ISBN 978-0-12-386983-8, 2012. pp. 291–377. DOI: 10.1016/B978-0-12-386983-8.00007-X. [DOI]
- 40.Rothery P. A nonparametric measure of intraclass correlation. Biometrika 1979; 66: 629–639. [Google Scholar]
- 41.Shirahata S. Nonparametric measures of interclass correlation. Commun Stat – Theory Method 1982; 11: 1707–1721. [Google Scholar]
- 42.Commenges D, Jacqmin H. The intraclass correlation-coefficient – distribution-free definition and test. Biometrics 1994; 50: 517–526. [PubMed] [Google Scholar]
- 43.Ukoumunne O, Davison A, Gulliford M, et al. Non-parametric bootstrap confidence intervals for the intraclass correlation coefficient. Stat Med 2003; 22: 3805–3821. [DOI] [PubMed] [Google Scholar]
- 44.Ionan AC, Polley MY, McShane LM, et al. Comparison of confidence interval methods for an intra-class correlation coefficient (ICC). BMC Med Res Methodol 2014; 121. DOI: 10.1186/1471-2288-14-121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Almehrizi RS, Emam M. Asymptotic standard errors of intraclass correlation coefficients for two-way model. Commun Stat-Simul Comput 2021; 52: 2073–2092. [Google Scholar]
- 46.Ken K. MBESS: The MBESS R Package https://CRAN.R-project.org/package=MBESS.
- 47.Alan GH, Armando L, Odile S, et al. ‘presize‘: an R-package for precision-based sample size calculation in clinical research. J Open Source Softw 2021; 6: 3118. [Google Scholar]
- 48.Alasdair R, Saurabh S, Dinesh K. ICC.Sample.Size: Calculation of Sample Size and Power for ICC. https://CRAN.R-project.org/package=ICC.Sample.Size.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-pdf-1-smm-10.1177_09622802231224657 for Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model by Dipro Mondal, Sophie Vanbelle, Alberto Cassese and Math JJM Candel in Statistical Methods in Medical Research