Skip to main content
Sage Choice logoLink to Sage Choice
. 2024 Feb 6;33(3):532–553. doi: 10.1177/09622802231224657

Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model

Dipro Mondal 1,, Sophie Vanbelle 1, Alberto Cassese 2, Math JJM Candel 1
PMCID: PMC10981208  PMID: 38320802

Abstract

Reliability of measurement instruments providing quantitative outcomes is usually assessed by an intraclass correlation coefficient. When participants are repeatedly measured by a single rater or device, or, are each rated by a different group of raters, the intraclass correlation coefficient is based on a one-way analysis of variance model. When planning a reliability study, it is essential to determine the number of participants and measurements per participant (i.e. number of raters or number of repeated measurements). Three different sample size determination approaches under the one-way analysis of variance model were identified in the literature, all based on a confidence interval for the intraclass correlation coefficient. Although eight different confidence interval methods can be identified, Wald confidence interval with Fisher’s large sample variance approximation remains most commonly used despite its well-known poor statistical properties. Therefore, a first objective of this work is comparing the statistical properties of all identified confidence interval methods—including those overlooked in previous studies. A second objective is developing a general procedure to determine the sample size using all approaches since a closed-form formula is not always available. This procedure is implemented in an R Shiny app. Finally, we provide advice for choosing an appropriate sample size determination method when planning a reliability study.

Keywords: Intrarater reliability, interrater reliability, measurement errors, reproducibility (of results), observer variation

1. Introduction

Reliability is important in many scientific disciplines.13 All measurement and evaluation processes are subject to measurement error. These errors can have a serious impact on research undermining the conclusions of the study, as well as in daily practice when measurement and evaluation processes are used to make diagnoses or assess the progression of participants, for example. It is therefore essential for measurement instruments to be reliable (i.e. the device/rater is able to distinguish among participants in a population) and valid (i.e. measurements reflect the underlying true values). The reliability of a device/rater is usually evaluated during a reliability study. Generally, a reliability study consists of participants measured repeatedly under similar conditions by the same device/rater (intrarater reliability) or by different devices/raters (interrater reliability). In interrater reliability studies, the set of raters can be the same, or different for every participant. In this article, we focus (1) on intrarater studies where the same number of repeated measurements is made simultaneously on each participant, and the order of the measurements is interchangeable, and, (2) on specific interrater reliability studies where the set of raters is different for every participant, and the same number of raters rates each participant. In the second case the reliability coefficient additionally reflects the differences between raters, next to the measurement error.

When the outcome measurements are quantitative, reliability can be quantified using an intraclass correlation coefficient (ICC). ICC is defined as the correlation between repeated measurements at multiple occasions made by the same rater/device or by different raters/devices on the same participants. It compares the variability of measurements/ratings within participants to the variability of measurements/ratings between participants. Depending on the design of the study, different forms of ICC should be used.4,5 This article focuses on the ICC defined in the one-way analysis of variance (ANOVA) model, ICC(1). 4 When planning a reliability study, determining the minimum number of raters/repetitions and participants is of prime importance. In fact, too many participants may prove to be time-consuming and may also increase the research budget, while too few may adversely impact the precision of the ICC estimate, preventing the drawing of any conclusion on the study. Several approaches to determine sample sizes can be identified in the literature. The aim of this review is two-fold. First, it is to compare the statistical properties of the sample sizes obtained with the approaches in realistic settings. Second, it is to develop a general procedure for sample size determination, since a closed-form formula is not always available for all the approaches.

Existing literature on determining sample size indicates two main approaches, namely, the confidence interval approach6,8 and the hypothesis testing approach.6,810 The confidence interval approach requires defining, around a planned ICC, a target width of the confidence interval that the researcher aims to achieve. A generalization of the width of the confidence interval approach, the assurance probability approach, 6 is based on testing whether the width of the confidence interval is less than a pre-specified width with a given assurance probability. The testing approach is based on the power of testing the hypothesis that the ICC is lower or equal to (null hypothesis), or, above (alternative hypothesis) a pre-specified value of the ICC. A common feature of these approaches is that the variance of the ICC estimator needs to be defined. In the literature, two closed-form approximations of the large-sample variance of the ICC estimate are mainly used. These are namely, the Swiger variance, 11 which is based on the Taylor-series expansion of the ratio of the ANOVA mean squares, and the Fisher variance, 12 a large-sample approximation obtained by Fisher. We further consider another form of the variance, known as the Zerbe variance, 13 based on the formulation of the ratio of two independent F-statistics. This variance is far less popular and was not included in previous reviews.

Confidence intervals formed around the ICC are mainly based on the Wald method,6,14 or on the F-statistic, termed the Searle method. 15 The Wald and the Searle methods can be further applied using a normalization transformation.12,16 When comparing the coverage probability of the confidence intervals (confidence intervals based on the Wald method with the Fisher variance and the Searle method), Zou 6 concluded that the normalized Searle method performs better than the Wald method with the Fisher variance. However, when comparing the coverage probabilities and mean interval widths of confidence intervals obtained with the Wald method (with the Swiger variance), the Searle method, and the normalized Searle method, Donner and Wells 17 concluded that no method was superior in all situations.

In the context of sample size determination with the confidence interval approach, a closed-form formula was derived by Bonett 7 for the Wald confidence interval with the Fisher variance and is the most common choice. 18 While Shieh more recently defined a numerical procedure for the Searle method, 19 no procedure to determine sample sizes exists for the other methods. Note that, in common statistical software like R, 18 SAS, and PASS, 20 the Wald method with the Fisher variance is the only one that is available (see Appendix E). As for the comparison among the methods, Shieh 19 compared the statistical properties of the Wald method (with the Fisher variance) and the Searle method with respect to the width of the confidence interval approach and the assurance probability approach. To summarize the results, the Searle method and the assurance probability approach, with a 90% assurance probability, showed better coverage than the width of the confidence interval approach. 6 Furthermore, this was achieved with a somewhat smaller width of the confidence interval. We aim to complete these comparisons by considering all the identified confidence interval methods.

For the testing approach, sample size determination was derived only for the normalized Searle method 18 and the Searle method (numerically). Only the latter is available in common statistical software. 20 As for the comparison, Shieh 21 showed that the approximate sample size formula obtained using the normalized Searle method 8 under-performs, with respect to the observed power of the hypothesis test, when compared to numerical sample sizes obtained via the Searle method. We extend the work of Shieh, 21 by comparing the results that can be obtained using all the methods for sample size determination identified in this article.

Several studies have investigated inference procedures for the ICC in this context but are incomplete as these studies do not consider all the confidence interval methods identified. In summary, our contribution is as follows. First, we compare the statistical properties of all identified confidence interval methods. Second, we analytically derive the sample size formulas using the Swiger and the Zerbe variances. Third, we develop a numerical procedure to obtain sample sizes with all identified confidence interval methods under the three sample size approaches. In this numerical procedure, we derive formulas to approximate the assurance probability function and the power function (except for the Searle method for which these formulas were already derived 21 ). Additionally, we provide guidelines for end users. We further provide an user-friendly and interactive R Shiny application to obtain sample sizes with all the methods discussed in this article on https://github.com/DiproMondal/sample-size-ICCGithub and the https://dipro.shinyapps.io/sample-size-icc/Shiny server.

The article is organized as follows. Section 2 introduces the methods to estimate ICC, its variance, and confidence interval. Section 3 introduces the simulation setup that is used to evaluate the statistical properties of the confidence interval methods. Section 4 describes different approaches for sample size calculation when the number of raters, k , is fixed. We further propose a general procedure to obtain minimum sample sizes under any approach. Section 5 presents a case study. Finally, Section 6 concludes the article with a summary and a discussion of the results obtained in this article.

2. Definition

Consider the scenario in which each participant is measured on a quantitative scale by a different set of raters randomly drawn from a population of raters, 4 or is measured repeatedly by a measuring device several times under identical conditions. Further assume that the number of raters/repeated measurements per participant is the same, which is a common assumption when planning a reliability study. Let Yij represent the measurement of participant i (i=1,2,,n) by rater j (j=1,2,,k) . This outcome can be described by a one-way ANOVA model, which can be written as

Yij=μ+si+ϵij (1)

where μ is the grand mean, si is the effect of participant i , and ϵij is the measurement error for participant i measured by rater j . The total number of observations is denoted by N ( N=kn ). The assumptions of this ANOVA model are that the participant effects si are identically and normally distributed with mean 0 and variance σs2 , the measurement errors ϵij are identically and normally distributed with mean 0 and variance σϵ2 , and the errors and participant effects are independent. Table 1 shows the variance components of this one-way ANOVA model. The mean squares in Table 1 are BMS=kn1i=1n(Y¯i.Y¯..)2 and WMS=1Nni=1nj=1k(YijY¯i.)2 , where Y¯i.=1kj=1kYij and Y¯..=1Ni=1nj=1kYij . Using this variance decomposition, the ICC is defined as

ρ=σs2σs2+σϵ2,0ρ1 (2)

Note that the value of ρ becomes closer to 1 as the measurement error variance becomes smaller ( σϵ2<<σs2 ) and ρ becomes closer to 0 as it increases ( σϵ2>>σs2 ).

Table 1.

Variance decomposition as for the one-way ANOVA model described by equation (1).

Source of Degrees of Mean Expected
variation freedom squares mean squares
Between participants n1 BMS σϵ2+kσs2
Within participants Nn WMS σϵ2

ANOVA: analysis of variance; BMS: between mean squares; WMS: within mean squares.

2.1. Estimation of ICC

ICC is usually estimated using the ANOVA 4 or the maximum-likelihood estimator. The ANOVA estimator is given by

ρ^ANOVA=BMSWMSBMS+(k1)WMS (3)

Since this estimator is negatively biased, 22 a maximum-likelihood estimator has been suggested 23 :

ρ^ML=BMS(n1)/nWMSBMS(n1)/n+(k1)WMS

Comparing the bias of the two estimators, Wang et al. 23 showed that the bias of ρ^ML is still quite large and decreases only slightly for large samples. For instance, to achieve a bias of ρ^ML not > 10 % , a total of 100 observations (e.g. 20 participants and five raters) are required when expecting ρ=0.5 (the value of ρ at which the bias is maximum). When one expects higher values of ρ , as in the context considered in this article, the bias of ρ^ANOVA is small and the two estimators lead to almost identical estimates. For this reason, the maximum-likelihood estimator is generally not used in the literature. Accordingly, we will only consider the ANOVA estimator in this article. Note that this estimator relies on the assumptions of the ANOVA model (equation (1)). A brief discussion of what happens when these assumptions are violated is given in Section 6.

2.2. Large sample variance of the ICC

Here we focus on the three approximated closed-form expressions of the variance of ρ^ available in the literature for large n . Swiger et al. 11 provided the large sample variance of ρ^ as,

var(ρ^)S=2(N1)(1ρ)2[1+(k1)ρ]2k2(Nn)(n1). (4)

Given that k(Nn)=nk(k1) , as N=kn , this leads to the variance obtained by Fisher 12 when N1k(n1)1 , which is a reasonable assumption for small k and n30 , 24

var(ρ^)F=2(1ρ)2[1+(k1)ρ]2nk(k1). (5)

Note that in equation (5), n is sometimes replaced by n1 . 25 Lastly, following Zerbe and Goldgar 13 and Kaart, 26 the variance can also be estimated by the ratio of two independent F-statistics as,

var(ρ^)Ze=2(1ρ)2[1+(k1)ρ]2(Nn)2(N3)k2(n1)(Nn2)2(Nn4). (6)

These three formulas are related by the following inequality, var(ρ^)Ze>var(ρ^)S>var(ρ^)F (see Appendix A for proof).

2.3. Confidence interval for the ICC

In the literature, there are four methods to compute the upper ( U ) and lower ( L ) bounds of the confidence interval for ρ , namely the Wald method, 6 the Searle method,9,15 and their normalized versions. Demetrashvili et al. 27 further suggested two generic methods not considered here because they are not accurate in the balanced one-way random effects model.

2.3.1. Wald confidence interval ( WaldS , WaldF , and WaldZe )

Based on the central limit theorem, the upper ( U ) and lower ( L ) bounds of the confidence interval for the ICC can be written as,6,14

U,L=ρ^±z1α/2var(ρ^), (7)

where z1α/2 is the (1α/2)×100 percentile of the standard normal distribution. Plugging equations (4) to (6) into (7) as the variance leads to confidence intervals, which we denote as WaldS , WaldF , and WaldZe , respectively.

The Wald method assumes that the sampling distribution of ρ^ is normally distributed. However, ρ is bounded between 0 and 1, implying a skewed sampling distribution of ρ^ when ρ is close to the boundaries. 28 Since typically ICC values close to one are of interest in a reliability study, Wald confidence intervals may thus have poor statistical properties in this context.

2.3.2. Searle method ( Fρ )

Under the assumption of normality of the ANOVA model, the ratio of the between-mean squares and within-mean squares (i.e. the F-statistic) is distributed as 1+(k1)ρ1ρFν1,ν2 , where Fν1,ν2 represents an F-distribution with ν1=n1 and ν2=n(k1) degrees of freedom. We represent this ratio as

F(ρ^)=BMSWMS=1+(k1)ρ^1ρ^. (8)

Then, the upper and lower bounds of the confidence interval for ρ are given by Searle 14 as,

U,L=F(ρ^)/Fl1F(ρ^)/Fl+k1,F(ρ^)/Fu1F(ρ^)/Fu+k1. (9)

where Fl and Fu are the α/2×100 and the (1α/2)×100 percentile of an F-distribution with n1 and n(k1) degrees of freedom, respectively. We denote this method as Fρ .

Rather than making a normality assumption on ρ^ , this method makes an assumption of normality on the outcome Yij . Hence this method has been referred to as being an exact procedure by several authors.6,17

2.3.3. Normalized ICC method ( ZS,ZF,andZZe )

The Fisher transformation can be applied to the ICC so that the transformed ICC approximately follows a normal distribution. Applying this transformation to ρ^ leads to

Z(ρ^)=12ln1+ρ^1ρ^˙N(E(Z(ρ^)),var(Z(ρ^))), (10)

where E(Z(ρ^))=12ln1+ρ1ρ , and the variance, var(Z(ρ^)) can be derived applying the Delta method 16 to one of the variances defined in equations (4) to (6) leading, respectively, to

var(Z(ρ^))S=2(N1)[1+(k1)ρ]2k2(1+ρ)2(Nn)(n1), (11)
var(Z(ρ^))F=2[1+(k1)ρ]2N(1+ρ)2(k1), (12)
var(Z(ρ^))Ze=2[1+(k1)ρ]2(Nn)2(N3)k2(n1)(Nn2)2(Nn4)(1+ρ)2. (13)

Since Z(ρ^) is approximately normally distributed and defined on the real line, we can then compute the Wald confidence interval for this transformation as

UZ,LZ=Z(ρ^)±z1α/2var(Z(ρ^)).

Finally, the confidence interval for ρ is obtained by back-transformation leading to

U,L=exp(2UZ)1exp(2UZ)+1,exp(2LZ)1exp(2LZ)+1. (14)

We refer to the confidence intervals obtained by these methods as ZS , ZF , and ZZe , respectively.

2.3.4. Normalized Searle method ( ZFρ )

The F-statistic ( F(ρ^) ) can also be normalized by a log-transformation to obtain confidence limits.6,9,14 Normalizing F(ρ^) starting from equation (8), we obtain

Z(F(ρ^))=12ln1+(k1)ρ^1ρ^N(E(Z(F(ρ^))),var(Z(F(ρ^)))), (15)

where E(Z(F(ρ^)))=12ln1+(k1)ρ1ρ and var(Z(F(ρ^)))=12(1n1+1n(k1)) . The confidence interval on this log transformed scale Z(F(ρ^)) is then

UZF,LZF=Z(F(ρ^))±z1α/2var(Z(F(ρ^))).

Note that the expression for var(Z(F(ρ^))) provided in equation (3) of Zou 6 is not correct, so we use var(Z(F(ρ^))) as specified above. The confidence limits for ρ can be obtained directly by back-transforming as

U,L=exp(2UZF)1exp(2UZF)+k1,exp(2LZF)1exp(2LZF)+k1. (16)

We denote this method as ZFρ . Note that for k=2 , the confidence intervals based on the transformed F-statistic and the normalized ICC with the Swiger variance (following equations (.1) and (14)), are the same.

3. Simulation comparison of the confidence interval methods

We set up a Monte Carlo simulation to evaluate the statistical properties of the eight confidence interval methods described in Section 2.3. Based on the ANOVA model defined in equation (1), n participant effects ( si ) are drawn from a standard normal distribution. Then, N (= nk ) errors ( ϵij ) are drawn from a normal distribution, with zero mean and variance determined by the relation in equation (2) for a given value of ρ . This process is replicated 25,000 times. For each replication, the confidence interval using the eight methods described in Section 2.3 is obtained. We study the properties of the methods for values of k varying from 2 to 10 (in steps of 1 ), n from 20 to 100 (in steps of 10 ), and ρ from 0.1 to 0.9 (in steps of 0.1 ).

The methods are compared based on the coverage probability and average confidence interval width in each scenario. The coverage probability is defined as the proportion of times the true value of ρ is covered by the confidence intervals across the 25,000 replications. We define coverage probability as acceptable if it falls within the range 1α±z0.975(1α)αηsim where 1α is the nominal coverage and ηsim(=25,000) is the number of simulations. This is the range of proportions from the simulation, where one expects these proportions to lie in 95% of the cases, if the nominal coverage is the true coverage probability. Specifically, for a nominal coverage of 95%, the coverage probabilities from the simulation are expected to lie between 0.947 and 0.953. The average width of a confidence interval is defined as the average difference between the upper and lower limits of a confidence interval over the 25,000 replications. Since a shorter width of the confidence interval is desirable, methods with a smaller average width of the confidence interval are considered to be better.

Table 2 summarizes the results for ρ0.7 , while complete results can be found in Supplemental Material 1. Table 2 shows that for k=2 and ρ0.7 , WaldZe , F , ZS (equivalent to ZFρ ), and ZFρ provide acceptable coverage for all values of n , while ZF provides acceptable coverage only for n40 . WaldS , WaldF , and ZZe do not provide acceptable coverage (based on sample sizes explored in Table 2, i.e., n100 ).

Table 2.

Summary of the methods which show acceptable coverage for the 95% confidence interval, that is, between 0.947 and 0.953, for the ICC, ρ0.7 and different number of raters, k , and participants, n . In each row, the method providing the average minimum width of the confidence interval is marked in bold. For n60 , the differences in average width are <0.01 , therefore, none of the methods have been marked bold in those cases.

k ρ n WaldS WaldF WaldZe F ZS ZF ZZe ZFρ
2 0.7–0.9 20
40
>2 0.7–0.8 20
40
80
90
0.9 20
40
50
60
80

ICC: intraclass correlation coefficient.

For k>2 , F still provides acceptable coverage under all scenarios while ZFρ and WaldZe only for n40 . The coverage of ZS and ZF deteriorates first when increasing k from 2 to 3 , and then improves on increasing k further. These confidence interval methods provide acceptable coverage when ρ0.7 for n90 . ZZe on the other hand provides acceptable coverage when ρ0.7 for n80 . WaldS provides acceptable coverage when ρ0.7 for n80 , while WaldF provides acceptable coverage only when ρ0.9 for n90 . The effect of increasing k is not monotonic for some of the confidence interval methods. However, increasing k above 5 does not seem to improve notably the coverage of the methods (see blueSupplemental Material 1).

The confidence interval methods providing the smallest average width most frequently, under the different scenarios, are marked in bold. The difference in average width between the different confidence interval methods decreases from 0.1 to <0.01 as n increases from 20 to 60 . It must be noted here that though WaldZe provides better coverage compared to WaldF , it has the largest width among the confidence interval methods.

In summary for ρ0.7 , WaldZe , ZFρ , and F provide acceptable coverage in almost all scenarios.

4. Sample size determination

Sample size determination when the number of raters, k , is fixed, is reviewed for three approaches, namely, the width of confidence interval approach, the assurance probability approach, and the testing approach. These sample size approaches require a planning value, ρ , and yield valid results when the initial guess for ρ is accurate. The eight confidence interval methods reviewed in Section 2.3 can be used with each of the three approaches. However, a closed-form formula for sample size determination is not always available, which necessitates numerical evaluation procedures to determine sample sizes.

4.1. Width of confidence interval approach

The approach consists in finding the minimum number of participants for a given value of the expected width, ω , of the confidence interval around a planned value of ρ and for a given number of raters k . Bonett 7 derived an analytical formula based on the Wald confidence interval and the Fisher variance ( WaldF ). We generalize this approach by considering all large sample variance formulas reviewed in Section 2.2.

The expected width of the Wald confidence interval is given by ω=2z1α/2var(ρ^) , where 1α is the confidence level and the variance can be estimated using equations (4) to (6). Using the Swiger variance (equation (4)), under the approximation NN1 and taking the positive root, the minimum number of participants is given by (see Appendix B.1 for the derivation)

n=1+8Ak,ρ2z1α/22k(k1)ω2, (17)

where Ak,ρ=(1ρ)×[1+(k1)ρ] . Using the Fisher variance (equation (5)), the expression for the required minimum number of participants is the same as equation (17), but subtracting one participant. Bonett 7 used the Fisher variance with n1 in the denominator of equation (5) instead of n . As a result, the sample size derived by Bonnet is the same as equation (17).

Using the Zerbe variance (equation (6)), the minimum number of participants obtained under the assumption that N3Nk1 is:

n=[(Aω3+24Aω2+103Aω+16+63Aω4Aω2+71Aω+8)13+(Aω3+24Aω2+103Aω+1663Aω4Aω2+71Aω+8)13+Aω+8]×13(k1), (18)

where Aω=8z1α/22Ak,ρ2kω2 and Ak,ρ=(1ρ)×[1+(k1)ρ] (see Appendix B.2 for the derivation).

Giraudeau and Mary 29 provided an approximate formula for the width of the confidence interval obtained with the Searle method which coincides with the width obtained using the Wald confidence interval with the Fisher variance. Analytical formulas can hardly be obtained for the Searle and the normalization methods. Hence, we propose a general numerical procedure to determine the minimum sample size, n , which can be used with all confidence interval methods. Specifically, this numerical evaluation method consists of finding the expected width of the confidence interval for the specified values of ρ and k . This is done for every n , starting from n=4 and increasing n by one unit at a time. The minimum sample size is the smallest value of n for which the expected width of confidence interval is smaller or equal to ω . Bonett 7 and Shieh 19 used a similar numerical approach to obtain sample sizes for F .

Table 3 shows the minimal sample sizes obtained by using the numerical evaluation for ω {0.1,0.2}, ρ {0.7,0.8,0.9}, and k {2,3,6}. The values within parentheses indicate sample sizes obtained using equation (17), equation (17) with a subtraction of one participant and equation (18) for WaldS , WaldF , and WaldZe , respectively. It can be observed that the sample sizes obtained with the different confidence interval methods are rather close. Sample sizes providing acceptable coverage (the calculation of the acceptable range is given in Section 3) for different combinations of ω , ρ , and k , are marked in bold. Table 3 indicates that the confidence interval methods WaldZe , F , and ZFρ provide sample sizes with acceptable coverage in most cases. Note that the numerical approach of Bonnett 7 and Shieh 19 leads to sample sizes very close to the values we obtain (data not shown).

Table 3.

The minimum number of participants, n , required to achieve an expected width, ω , of the 95% confidence interval, given ρ and the number of raters, k , according to the numerical evaluation method. Sample sizes that provide coverage within an acceptable range (based on 25,000 simulations, i.e. between 0.947 and 0.953) are marked in bold. The values in parentheses indicate sample sizes obtained with analytical formulas given in equations (17) for WaldS , (17) with a subtraction of one participant for WaldF , and (18) for WaldZe , respectively.

ω ρ k WaldS WaldF WaldZe F ZS ZF ZZe ZFρ
0.1 0.7 2 401 (401) 400 (400) 408 (408) 403 402 401 409 402
3 267 (267) 266 (266) 270 (270) 267 267 267 271 267
6 188 (188) 187 (187) 189 (188) 187 189 188 190 188
0.8 2 200 (200) 200 (199) 207 (207) 204 202 202 209 202
3 140 (139) 139 (138) 143 (142) 140 141 141 145 141
6 104 (103) 103 (102) 105 (104) 103 105 104 106 104
0.9 2 56 (56) 56 (55) 63 (63) 61 60 59 67 60
3 41 (41) 41 (40) 45 (44) 43 44 43 47 44
6 32 (32) 31 (31) 34 (33) 32 34 33 35 34
0.2 0.7 2 101 (101) 100 (100) 108 (108) 103 102 102 109 102
3 68 (67) 67 (66) 71 (70) 67 68 68 72 68
6 48 (48) 47 (47) 49 (48) 47 49 48 50 48
0.8 2 51 (51) 50 (50) 57 (57) 54 53 53 60 53
3 36 (36) 35 (35) 39 (38) 36 37 37 41 37
6 27 (27) 26 (26) 28 (27) 26 28 27 29 27
0.9 2 15 (15) 14 (14) 21 (21) 19 18 17 24 18
3 11 (11) 11 (10) 14 (14) 13 13 13 16 13
6 9 (9) 8 (8) 10 (9) 9 11 10 12 10

4.2. Assurance probability approach

The assurance probability approach based on the width of the confidence interval for ρ , 19 consists of finding the minimum number of participants n such that

P(Wω)1γ, (19)

where P(Wω) is the probability that the width W , is less than or equal to a constant, ω , and 1γ is the assurance probability. The assurance probability approach based on the width of the confidence interval was introduced by Zou, 6 who pointed out that the width of confidence interval approach seen in the previous subsection is a special case, which corresponds to setting the assurance probability to 0.5 . Zou 6 also introduced an assurance probability approach based on the lower limit of a confidence interval, see Section 4.3.

Zou 6 derived an analytical formula based on the Wald confidence interval and the Fisher variance ( WaldF ). Shieh 19 later extended the approach numerically to the Searle method ( F ). In this article, we numerically generalize the assurance probability approach by considering all the confidence interval methods mentioned in Section 2.3.

Using the Wald confidence interval and the Fisher variance (equation (5)), Zou 6 obtained the minimum number of participants as

n=1k(k1)ω2[4Ak,ρz1α/2(Ak,ρz1α/2+Bk,ρz1γω)+16Ak,ρ3z1α/23(Ak,ρz1α/2+2Bk,ρz1γω)], (20)

where Ak,ρ=(1ρ)×[1+(k1)ρ] and Bk,ρ=2(k1)ρk+2 . Zou 6 used n1 in equation (5) and derived the formula considering the half-width of the confidence interval. As a result, the formula in Zou has different coefficients than equation (20). Using the Swiger variance (equation (4)), we derived the sample size under the approximation that NN1 (taking the positive root). This leads to equation (20) with the addition of one participant.

The analytical forms of the other confidence interval methods (including WaldZe ) are too complex. Therefore, we propose a generalization of the numerical approach explained in Section 4.1, which uses assurance probability functions to find the minimum n satisfying a pre-defined value of the assurance probability ( 1γ ). This numerical procedure works with all the confidence interval methods mentioned in Section 2.3. The derivation of the assurance probability functions are given in Appendix C.

Table 4 shows the sample sizes obtained by the numerical procedure using assurance probability functions. The analytical counterparts are shown between parentheses (when available). It can be observed that the minimum sample sizes obtained analytically are close to the values obtained via the numerical procedure. Further, our method gives sample sizes close to the ones obtained by Shieh, who also used a numerical method for F (Tables 8 and 9 of Shieh 19 ). It can be further observed that the sample sizes obtained by the different confidence interval methods are rather close. Sample sizes providing acceptable assurance probability under different combinations of ω , ρ , and k , for 1γ=0.9 are marked in bold. The lower limit of the acceptable range of assurance probabilities is calculated in the same way as in Section 3 where 1α is replaced by 1γ . Confidence interval methods F , ZS , ZF , and ZZe provide sample sizes with acceptable assurance probability in most cases while ZFρ for k=2 only.

Table 4.

Minimum number of participants, n , required to achieve an expected width, ω , of the 95% confidence interval, given ρ , the number of raters, k , and the assurance probability 1γ=0.9 , according to the numerical procedure using assurance probability functions. Sample sizes that provide acceptable empirical assurance probability (i.e. above 0.896 for 25,000 simulations) are marked in bold. The values in parentheses indicate sample sizes obtained from the analytical formulas given in equation (20) for WaldF and equation (20) with an addition of 1 participant for WaldS , respectively.

ω ρ k WaldS WaldF WaldZe F ZS ZF ZZe ZFρ
0.1 0.7 2 470 (470) 469 (469) 477 473 472 471 479 473
3 309 (308) 308 (307) 312 309 314 313 317 308
6 214 (214) 213 (213) 216 214 221 220 222 213
0.8 2 255 (255) 254 (254) 262 260 259 258 266 260
3 176 (176) 175 (175) 179 178 180 180 184 177
6 129 (129) 128 (128) 130 130 134 133 135 129
0.9 2 87 (87) 87 (86) 94 94 93 93 100 94
3 63 (63) 63 (62) 67 67 68 67 71 66
6 49 (49) 48 (48) 50 52 53 52 54 50
0.2 0.7 2 134 (134) 134 (133) 141 137 136 135 143 137
3 88 (88) 87 (87) 91 88 91 90 94 87
6 61 (61) 60 (60) 62 60 64 64 66 59
0.8 2 77 (77) 76 (76) 84 81 80 80 87 81
3 53 (53) 53 (52) 56 55 56 56 60 54
6 39 (39) 38 (38) 40 40 42 41 43 39
0.9 2 29 (29) 29 (28) 36 35 34 34 41 35
3 22 (22) 21 (21) 25 25 25 24 28 24
6 17 (17) 16 (16) 18 19 20 19 21 18

4.3. Testing approach

The testing approach consists of finding the minimum number of participants when one is interested in achieving a pre-specified power (1β) when testing the null hypothesis that ρ is less than or equal to a constant, ρ0 , that is, ρρ0 , against the alternative that ρ is greater than ρ0 , that is, ρ>ρ0 . Denoting ρ=ρA under the alternative hypothesis, the power of this test can be defined as the probability that the null hypothesis is rejected when the alternative hypothesis is true ( ρ=ρA ). In our case, this is the probability that the lower limit L of the confidence interval for ρ is greater than ρ0 when the alternative hypothesis is true. The mathematical form of the criterion under this approach can be written as, 6

P(Lρ0|ρ=ρA)1β, (21)

where P(Lρ0|ρ=ρA) is the probability that the lower limit of the confidence interval for ρ , L , is greater than the pre-specified value, ρ0 , under the alternative hypothesis that ρ=ρA (1>ρA>ρ0>0) . Donner and Eliasziw, 10 Walter et al., 9 and Zou 6 derived an analytical formula for the minimum number of participants, n , based on the transformation of the F-statistic ( ZFρ ) when minimizing the criterion specified in equation (.1). Specifically,

n=1+2(z1β+z1α)2k[ln(F(ρA)/F(ρ0))]2(k1). (22)

Shieh 21 used a numerical evaluation procedure to obtain sample sizes for the Searle method. Zou 6 obtained equation (22) by introducing an assurance probability based on a pre-specified lower limit of an asymmetrical interval procedure, which is equivalent to the testing approach.

We derived power functions following equation (.1) for all the confidence interval methods (see Appendix D). These power functions were then used to obtain sample sizes for the testing approach using the numerical procedure mentioned in Section 4.2. The numerical procedure uses the power functions to find the minimum n satisfying a pre-defined power ( 1β ).

Table 5 shows the sample sizes obtained by the numerical procedure using the power functions and numerical evaluation. The values within parentheses indicate sample sizes obtained by the analytical formulas for the method ZFρ which correspond exactly to the ones obtained by our numerical procedure. The values obtained for the method F using our numerical procedure are exactly one unit greater than the values obtained by the numerical method of Shieh. 21 Furthermore, unlike the previous approaches, the sample sizes obtained via the Wald confidence interval methods tend to require smaller sample sizes than other confidence interval methods. The actual power of the hypothesis test was also calculated at the obtained sample sizes. Sample sizes providing acceptable power for different combinations of 1β , ρ0 , ρA , and k are marked in bold. The lower limit of the acceptable range of power is calculated in the same way as in Section 3, where 1α is replaced by 1β . The confidence interval methods F and ZZe provide the sample sizes with acceptable power in most cases while ZS , ZF , and ZFρ provide sample sizes with acceptable power for k=2 only. The Wald methods always have power below acceptable value (i.e. < 0.795 when 1β=0.8 , and 0.896 when 1β=0.9 ). For example, the actual power for the Wald methods can go as low as 0.729 which is the case for 1β=0.8 , ρ0=0.7 , ρA=0.8 , and k=2 .

Table 5.

Minimum number of participants, n , for a given value of ρ considering the null ( ρ0 ) and alternative hypothesis ( ρA ) for a specified number of raters, k , and power of the test 1β according to the numerical procedure using power functions. Sample sizes that provide acceptable empirical power (i.e. above 0.896 when 1β=0.9 and above 0.795 when 1β=0.8 for 25,000 simulations) are marked in bold. The values in parentheses indicate the sample sizes obtained from the analytical formula given in equation (22) for ZFρ .

1β ρ0 ρA k WaldS WaldF WaldZe F ZS ZF ZZe ZFρ
0.9 0.7 0.8 2 112 111 119 162 161 161 168 162 (162)
3 78 78 82 110 112 112 116 110 (110)
6 58 58 60 80 84 83 85 80 (80)
0.8 0.9 2 32 31 38 63 62 62 69 63 (63)
3 24 23 27 45 46 45 49 45 (45)
6 19 18 20 34 36 35 37 35 (35)
0.8 0.7 0.8 2 81 81 88 117 117 116 123 117 (117)
3 57 56 60 79 82 81 85 80 (80)
6 43 42 44 57 61 60 62 58 (58)
0.8 0.9 2 23 23 30 46 45 45 52 46 (46)
3 17 17 20 32 33 33 36 33 (33)
6 14 13 15 25 26 25 27 25 (25)

4.4. Software for sample size calculation

Currently, only the method WaldF is available in common software (see Appendix E) for the width of the confidence interval and assurance probability approaches, while only ZFρ is available for the testing approach. Therefore, a Shiny app containing all the approaches to determine minimum required sample sizes has been developed 30 and made available on https://github.com/DiproMondal/sample-size-ICCGithub and the https://dipro.shinyapps.io/sample-size-icc/Shiny server.

5. Empirical illustration

5.1. Reliability of systolic blood pressure measurements

In this section, we illustrate how the confidence interval methods described in Section 2.3 and the approaches for sample size determination described in Section 4 are used in the context of a reliability study. In the study of Bland and Altman, 31 three repeated systolic blood pressure measurements ( k=3 ) were made on 85 participants ( n=85 ) by two experienced observers raters J and R and a semi-automatic blood pressure monitor. For the purpose of our illustration, we use the measurements made by rater J only, which can be modeled by a one-way ANOVA.

The ANOVA model assumes that the outcome measurements are normally distributed and the variance across repetitions is homogeneous across participants. Exploratory data analysis revealed that the excess kurtosis for the repetitions was mild while the degree of asymmetry of the repetitions indicated moderate skewness. Furthermore, the data also present mild heteroscedasticity on the repeated measurements. Following equation (3), we obtain ρ^=0.962 . The confidence intervals obtained using the eight confidence interval methods are shown in Table 6, which have rather similar bounds.

Table 6.

Lower and upper limits of the 95% confidence intervals for ρ for the systolic blood pressure measurements.

WaldS WaldF WaldZe F ZS ZF ZZe ZFρ
Lower limit 0.948 0.948 0.947 0.945 0.945 0.945 0.945 0.945
Upper limit 0.975 0.975 0.976 0.974 0.973 0.973 0.973 0.973

5.2. Planning a reliability study

A researcher may be interested in planning a study to measure blood pressure aiming at a reliability of ρ=0.9 . The sample size approaches described in previous sections can be used to find the number of participants required for such a study.

Figure 1 shows, for each k in the interval [2,30] (x-axis), the minimum required n (y-axis) using (top-down) the width of confidence interval approach described in Section 4.1, the assurance probability approach described in Section 4.2, and the testing approach described in Section 4.3 for the confidence interval methods F and ZFρ . For example, suppose the study only allows for three repeated measurements per participant. Then using the width of confidence interval approach, the assurance probability approach, and the testing approach the researcher would require, respectively, 43, 67, and 133 (considering the Searle confidence interval method) participants for the criteria given in Figure 1. The effect of increasing the number of measurements per participant to four is a decrease in the number of participants to 37, 59, and 115 participants (considering the Searle confidence interval method), respectively, for the width of confidence interval approach, the assurance probability approach, and the testing approach. The gain in having a smaller number of participants for the study decreases as the number of measurements per participant increases.

Figure 1.

Figure 1.

Combinations for n and k for the confidence interval methods F and ZFρ satisfying the criteria specified for the three different sample size approaches. The criteria for the sample size approaches are mentioned in the sub-figures where, for the width of confidence interval approach, ω is the expected width of the confidence interval around a given value ρ ; for the assurance probability approach, the notations are the same as the width of confidence interval approach with the addition of 1γ denoting the assurance probability; for the testing approach 1β is the power of the hypothesis test, ρ0 and ρA are the values under the null and alternative hypotheses. (a) Width of confidence interval approach; (b) assurance probability approach; and (c) testing approach.

If, instead, there is flexibility in choosing the number of repetitions per participant, the researcher can consider a cost-constraint approach to find the optimal combination of the number of participants ( n ) and number of repeated measures per participant ( k ). Then, the optimal combination of (k,n) is obtained by finding the value of k and n for which the total cost, T, is minimum. A plausible cost function is

T=nc1+nkc2, (23)

where T is the total cost, c1 is the cost of recruiting a participant, and c2 is the cost of making one observation.

Table 7 shows the optimal combinations of (k,n) obtained by minimizing the total cost, T (equation (23)), for different combinations of c1 and c2 . It can be observed from the table that as c1 increases relative to c2 , more repetitions per participant are required with a smaller number of participants to achieve the same criterion value.

Table 7.

Optimal combination of the number of repetitions and participants, ( k,n ) for the sample size approaches with the confidence interval methods, F and ZFρ , for different costs of recruiting a participant, c1 and making an observation, c2 .

Sample size approach c1 c2 F ZFρ
Width of confidence interval approach for ω=0.1 and ρ=0.9 1 5 (261) (260)
1 1 (343) (344)
5 1 (437) (438)
Assurance probability approach for 1γ=0.9 , ω=0.1 , and ρ=0.9 1 5 (294) (295)
1 1 (367) (367)
5 1 (459) (458)
Testing approach for 1β=0.9 , ρ0=0.85 , and ρA=0.9 1 5 (2185) (2185)
1 1 (3133) (3133)
5 1 (4115) (4116)

6. Discussion

Sample size determination is a crucial aspect of the planning stage of a reliability study. Usually, the number of raters, k , is fixed due to budget or time constraints in the study, and the sample size of participants, n , needs to be determined. This article gives a complete overview of the different approaches available in that case. Analytical closed-form solutions for sample size determination only exist in a few cases. Therefore, we proposed a general procedure that entails deriving an assurance probability or power function (depending on the approach) and finding optimal n via a simple search procedure.

Before inspecting the different approaches for sample size determination, we looked at the statistical properties of the different confidence interval methods. We have shown that the confidence interval based on the Searle method ( F ) provides acceptable coverage in almost all scenarios for n20 , and, WaldZe and ZFρ for n40 . This can be explained by the fact that F is an exact method and ZFρ is based on a normalizing transformation of F . WaldZe is the Wald method based on the Zerbe variance which was also derived as a ratio of F-statistics. It must be noted however, that WaldZe does not provide acceptable coverage for small ρ (when ρ<0.5 , see blueSupplemental Material 1). The other methods, based on some approximations, only provide acceptable coverage in few scenarios. It is worthwhile to note that the Wald confidence interval using the Fisher variance, WaldF , widely used in the literature shows acceptable coverage only for large sample sizes, n80 , when ρ0.9 and k>2 . Note that the Zerbe variance provides better statistical properties than the Fisher variance when ρ0.7 , but the width of the confidence interval is larger.

Sample sizes were determined using three different approaches which rely on the limits of a confidence interval for ρ . Sample sizes in the case of the width of confidence interval were obtained via a numerical evaluation. We derived the assurance probability and power functions for assurance probability and testing approaches, respectively, to determine sample sizes. These functions, when combined with the numerical evaluation, enabled us to determine sample sizes for all the methods discussed. Sample sizes obtained through this procedure and the corresponding available analytical formulas led to similar sample sizes. Furthermore, sample sizes obtained with different confidence interval methods in the width of confidence interval approach and the assurance probability approach were similar. However, this was not the case in the testing approach where smaller sample sizes were obtained using the Wald confidence interval to achieve a required power level compared to the other confidence interval methods. This is probably because the Wald confidence interval method assumes a symmetric distribution for the estimate of ρ , which is not a realistic assumption when ρ is large (e.g. 0.8, 0.9). 28 In all the approaches, the Searle method ( F ) provided sample sizes with good statistical properties as well as ZFρ when k=2 . We, therefore, advise the use of these methods to make statistical inference on the ICC in the one-way ANOVA setting.

We have shown that the choice of the approach to determine sample size or even the choice of the confidence interval method, has an impact on the resulting sample size. We, therefore advise researchers to carefully consider requirements for their studies as a guide to choose the appropriate sample size approach. For the three different approaches discussed in this article, the Searle confidence interval method demonstrated good statistical properties, making it our recommended choice. Furthermore, in order to determine sample sizes, we have developed an R Shiny app which we believe will prove valuable to researchers in need of a simple and efficient interface for obtaining sample sizes.

Our study is not without limitations. First, the confidence interval methods investigated in this paper, except the Searle confidence interval method (which is exact), rely on large sample approximations. Therefore, practitioners should exercise caution when calculations lead to a small minimal sample size because a good statistical behavior is not guaranteed. Note that the minimal sample sizes obtained with the different approaches rarely go below 20 in realistic scenarios (see Tables 3 to 5). Second, the estimator of ρ and its confidence interval rely on the assumptions of normality and homoscedasticity in line with the one-way ANOVA model (equation (1)). Violations of these conditions impact the statistical properties of the confidence intervals. The effect of non-normality on the Type-I error rate of the F-statistic was studied by various authors.3237 However, simulation studies 38 showed that the effect of heteroscedasticity outweighs the effect of non-normality on the Type-I error rate of the F-statistic, even for a balanced design 39 as considered here. We, therefore, advise researchers to check for violations of the assumptions of the ANOVA model (equation (1)) before using the methods described in this article. Readers interested in non-parametric estimators of ICC, not requiring the normality assumption, are directed to the works of Rothery, 40 Shirahata, 41 Commenges and Jacqmin, 42 and Ukoumunne et al. 43 Note that, however, these papers do not develop a sample size procedure. Third, as previously mentioned, we consider an equal number of ratings per participant constituting a balanced design. Considering unbalanced designs will require specifying the degree of imbalance in advance, which is not an easy task. Furthermore, Donner 14 showed that with an unbalanced design, the F-statistic is not exact and this, in turn, affects the statistical properties of the ICC and its confidence interval. Fourth, we focused on reliability in the context of a one-way ANOVA model. Whether the numerical procedure we developed can be extended to multi-way ANOVA models, will require further investigation, as methods to construct confidence intervals are different in that case.44,45

Supplemental Material

sj-pdf-1-smm-10.1177_09622802231224657 - Supplemental material for Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model

Supplemental material, sj-pdf-1-smm-10.1177_09622802231224657 for Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model by Dipro Mondal, Sophie Vanbelle, Alberto Cassese and Math JJM Candel in Statistical Methods in Medical Research

Appendix A. Ratio of variance estimators

We will show that,

var(ρ^)Ze>var(ρ^)S>var(ρ^)F,

and thus

Ω(ρ^)Ze>Ω(ρ^)S>Ω(ρ^)F

where Ω(ρ^)m is the average width of the confidence interval based on variance approximation m (m=S,F,Ze) .

Note that the variance estimators (equations (4) to (6)) can be written in the form,

var(ρ^)m=(f(ρ)f(n,k)m)2=(f(ρ))2f(n,k)m

where (f(ρ))2=2(1ρ)2[1+(k1)ρ]2 and f(n,k)m depends on the form of the variance approximation m . For the Swiger variance, f(n,k)S=k2(Nn)(n1)(N1) , for the Fisher variance, f(n,k)F=nk(k1) , and for the Zerbe variance, f(n,k)Ze=k2(n1)(Nn2)2(Nn4)(Nn)2(N3) .

  1. var(ρ^)Svar(ρ^)F = N1Nk>1 .

  2. var(ρ^)Zevar(ρ^)S=(Nn)3(N3)(Nn2)2(Nn4)(N1)>1 .

The proof of 2 is as follows:

We can re-write the ratio of the Zerbe and the Swiger variances into two parts as,

var(ρ^)Zevar(ρ^)S=(Nn)2(Nn2)2×(Nn)(N3)(Nn4)(N1)

We have (Nn)2(Nn2)2>1 . For the second part, we have,

(Nn)(N3)(Nn4)(N1)=2(N+n)4>0(for n2andN4),

implying that

(Nn)(N3)(Nn4)(N1)>1.

Appendix B. Sample size determination in the width of confidence interval approach

The sample size formulas for the width of confidence interval approach are derived here. Using the Wald confidence interval from Section 2.3.1, the minimum value of n is obtained when:

2z1α/2var(ρ^)ω, i.e.var(ρ^)ω24z1α/22,

where ω is the expected width of the confidence interval, z1α/2 is the (1α/2)×100 percentile of the standard normal distribution and var(ρ^) can be determined by equations (4) to (6).

B.1 Width of confidence interval approach with the Wald confidence interval and the Swiger variance

Using the Swiger variance formula (equation (4)), we have:

2(nk1)(1ρ)2[1+(k1)ρ]2k2n(k1)(n1)=ω24z1α/22,n(n1)nk1=8Ak,ρ2z1α/22k2(k1)ω2where Ak,ρ=(1ρ)(1+(k1)ρ),n2n[1+8Ak,ρ2z1α/22k(k1)ω2]+8Ak,ρ2z1α/22k2(k1)ω2=0.

Solving the last equation yields the following solution:

n=k(k1)ω2+8Ak,ρ2z1α/22+[k(k1)ω2+8Ak,ρ2z1α/22]232Ak,ρ2(k1)ω2z1α/222k(k1)ω2,

which leads to the approximation,

nk(k1)ω2+8Ak,ρ2z1α/22+[k(k1)ω2+8Ak,ρ2z1α/22]22k(k1)ω2=n*.

The other solution (with a negative sign in front of the square root) leads to n approximately equal to 0. The excess part 32Ak,ρ2(k1)ω2z1α/22 creates a difference <1 between n and n* . (By numerical evaluation for k{220} , ρ{0.10.9} , ω{0.10.3} and α{0.05,0.01} we observe a maximum difference of 0.5 between n and n* .)

n1+8Ak,ρ2z1α/22k(k1)ω2.

B.2 Width of confidence interval approach with the Wald confidence interval and the Zerbe variance

In a similar way, we obtain with the Zerbe variance (equation (6))

2(1ρ)2[1+(k1)ρ]2(nkn)2(nk3)k2(n1)(nkn2)2(nkn4)=ω24z1α/2,8Ak,ρ2z1α/22(k1)2(nk3)ω2k2(n1)=(nkn2)2(nkn4)n2.

Assuming nk3nkk1 , leads to the following result

8Ak,ρ2z1α/22ω2k=(nkn2)2(nkn4)(nkn)2,(nkn)3(8+8Ak,ρ2z1α/22ω2k)(nkn)2+20(nkn)16=0.

Solving the last equation gives only one root in the real domain:

n=[(Aω3+24Aω2+103Aω+16+63Aω4Aω2+71Aω+8)13+(Aω3+24Aω2+103Aω+1663Aω4Aω2+71Aω+8)13+Aω+8]×13(k1).

where Aω=8z1α/22Ak,ρ2kω2 .

Appendix C. Assurance probability functions

The assurance probability functions for the assurance probability approach 6 are derived here. The criterion for this approach is given as,

P(Wω)1γ.

C.1 Wald confidence interval method

The expected width of the Wald confidence interval is given as 2z1α/2var(ρ^) . Then the criterion can be further specified as,

P(2z1α/2var^(ρ^)ω)1γ.

Rewriting var^(ρ^)=(f(ρ^)f(n,k))2 , with f(ρ^) and f(n,k) defined in Appendix A, we have,

P(2z1α/2f(ρ^)f(n,k)ω)1γ,P(f(ρ^)ω2z1α/2f(n,k))1γ,P(f(ρ)f(ρ^)var(f(ρ^))f(ρ)ω2z1α/2f(n,k)var(f(ρ^)))1γ.

Using the Delta method, we obtain, var(f(ρ^))=var(ρ)×|f(ρ)|2 . Then, the assurance probability function can be written as,

1Φz(f(ρ)ω2z1α/2f(n,k)f(ρ)|f(ρ)|f(n,k)).

where Φz(.) is the cumulative standard normal distribution.

C.2 Searle method

The width of the Searle confidence interval is given as,

F(ρ^)/Fl1F(ρ^)/Fl+k1F(ρ^)/Fu1F(ρ^)/Fu+k1=kF(ρ^)(FuFl)(F(ρ^)+(k1)Fu)(F(ρ^)+(k1)Fl).

Rewriting kF(ρ^)(FuFl)(F(ρ^)+(k1)Fu)(F(ρ^)+(k1)Fl)=fF(F(ρ^)) , we notice that fF(.) is a decreasing function of F(ρ^) when k=2 , otherwise concave. Assuming F(ρ^)=x , the point of extrema (i.e. fF(x)=0 ) is (k1)FuFl . Therefore, we can define the inverse function, fF1(x) when x>(k1)FuFl as,

fF+1(x)=12x[Fu*Fl*+(FuFl)(Fu*2FuFl*2Fl)],

and the inverse function when x(k1)FuFl as,

fF1(x)=12x[Fu*Fl*(FuFl)(Fu*2FuFl*2Fl)],

where Fu*=Fu(k(k1)x) and Fl*=Fl(k+(k1)x) . Following the criterion for the assurance probability approach,

P(kF(ρ^)(FuFl)(F(ρ^)+(k1)Fu)(F(ρ^)+(k1)Fl)ω)1γ,P(fF(F(ρ^))ω)1γ,P(F(ρ^)fF1(ω))+P(F(ρ^)>fF+1(ω))1γ.

Then, the assurance probability function can be written as,

1+ΦF(fF1(ω)τ)ΦF(fF+1(ω)τ),

where ΦF(.) is the cumulative F distribution with (n1) and n(k1) degrees of freedom, τ=1+(k1)ρ1ρ , and fF1(.) and fF+1(.) are defined above.

C.3 Normalized ICC method

The width of the normalized ICC confidence interval method is given as,

exp(2UZ)1exp(2UZ)+1exp(2LZ)1exp(2LZ)+1.

Denoting x=2Z(ρ^) and δ=2z1α/2var(Z(ρ^)) , assuming the variance of Z(ρ^) to be known, we can rewrite the width of the normalized ICC confidence interval method as,

exp(x)exp(δ)1exp(x)exp(δ)+1exp(x)exp(δ)1exp(x)exp(δ)+1,=2exp(xδ)(exp(2δ)1)(exp(x)+exp(δ))(exp(x)+exp(δ)),=4exp(x)(exp(x)+exp(δ))(exp(x)+exp(δ))×exp(2δ)12exp(δ).

which after some algebraic manipulation and using Euler’s transformation of the hyperbolic function is equal to 2sinh(δ)cosh(δ)+cosh(x) . Then, following the criterion for the assurance probability approach,

P(2sinh(δ)cosh(δ)+cosh(x)ω)1γ,P(cosh(x)2sinh(δ)1ωcosh(δ)2sinh(δ))1γ,P(cosh(x)2sinh(δ)ωcosh(δ))1γ.

Denoting A=2sinh(δ)ωcosh(δ) , then,

cosh(x)=A,ex2A+ex=0,e2x2Aex+1=0.

which gives x=ln(A±A21) . Note that cosh(.) is a convex function ( 2x2cosh(x)=cosh(x)=ex+ex2>0 , since, ex>0 and ex>0 , xR ). Therefore,

P(cosh(x)A)={P(xln(AA21))+P(xln(A+A21))ifA11ifA<1,={P(Z(ρ^)12ln(AA21))+P(Z(ρ^)12ln(A+A21))ifA11ifA<1.

Therefore, the assurance probability can be approximated by,

{Φ(12ln(AA21)E(Z(ρ^))var(Z(ρ^)))+1Φ(12ln(A+A21)E(Z(ρ^))var(Z(ρ^)))ifA11ifA<1,

where E(Z(ρ^))=12ln(1+ρ1ρ) , and var(Z(ρ^)) is substituted from equations (11) to (13) for the three different variance approximations, respectively.

C.4 Normalized Searle method

In the normalized Searle method, the normalization is 12lnF(ρ^) . Following the steps of the derivation of the assurance probability function for the Searle method, we have,

P(F(ρ^)fF1(ω))+P(F(ρ^)>fF+1(ω))1γ.

We can take the logarithm to obtain,

P(12lnF(ρ^)12lnfF1(ω))+P(12lnF(ρ^)>12lnfF+1(ω))1γ.

Now, since

P(12lnF(ρ^)12lnfF1(ω))=P(12lnF(ρ)12lnF(ρ^)var(Z(F(ρ^)))12lnF(ρ)12lnfF1(ω)var(Z(F(ρ^)))),

and

P(12lnF(ρ^)>12lnfF+1(ω))=P(12lnF(ρ)12lnF(ρ^)var(Z(F(ρ^)))12lnF(ρ)12lnfF+1(ω)var(Z(F(ρ^)))),

the assurance probability function can be approximated by,

1+ΦZ(12lnF(ρ)12lnfF+1(ω)var(Z(F(ρ^))))ΦZ(12lnF(ρ)12lnfF1(ω)var(Z(F(ρ^)))),

where var(Z(F(ρ^)))=12(1n1+1n(k1)) .

Appendix D. Power functions

The power functions for the testing approach are derived here. The criterion for this approach is given as,

P(Lρ0|ρ=ρA)1β.

where L is the lower limit of the confidence interval for ρ .

D.1 Wald confidence interval method

The lower limit of the confidence interval is given as ρ^z1αvar(ρ^) . Then, assuming var(ρ^) to be known, the criterion can be elaborated as,

P(ρ^z1αvar(ρ^)ρ0|ρ=ρA)1β,P(ρ^ρ0+z1αvar(ρ^)|ρ=ρA)P(ρ^ρ0+z1αvar(ρ^|ρ=ρA))1β,

so that,

P(ρAρ^var(ρ^|ρ=ρA)ρAρ0var(ρ^|ρ=ρA)z1α|ρ=ρA)1β.

Then, the power function can be written as,

Φz(ρAρ0var(ρ^|ρ=ρA)z1α).

where Φz(.) is the cumulative standard normal distribution.

D.2 Searle method

Following Shieh, 21 the power function can be written as,

1ΦF(τ0τAFn1,n(k1),1α),

where ΦF(.) is the cumulative F-distribution with n1 and n(k1) degrees of freedom, and τi=1+kρi1ρi with i{0,A} representing values under the null ( 0 ) and alternative hypotheses ( A ).

D.3 Normalized ICC method

The lower limit of the confidence interval is given as exp(2LZ)1exp(2LZ)+1 . Then the criterion can be approximated by,

P(exp(2LZ)1exp(2LZ)+1ρ0|ρ=ρA)1β,P(LZ12ln1+ρ01ρ0|ρ=ρA)1β.

Following the same steps as for the Wald confidence interval method, we get,

1Φz(z1αμzA*μz0*var(Z(ρ^)|ρ=ρA)),

where μzi*=12lnτi* , and τi*=1+ρi1ρi with i{0,A} representing values under the null and alternative hypotheses.

D.4 Normalized Searle method

The lower limit of the confidence interval is given as exp(2LZF)1exp(2LZF)+k1 . Then the criterion can be approximated by,

P(exp(2LZF)1exp(2LZF)+k1ρ0|ρ=ρA)1β,P(LZF12ln1+(k1)ρ01ρ0|ρ=ρA)1β.

Then, following the same steps as we did for the Wald confidence interval method, we get,

1Φz(z1αμzA*μz0*var(Z(ρ^)|ρ=ρA)), (.1)

where μzi=12lnτi , with i{0,A} representing values under the null and alternative hypotheses.

Appendix E. Sample size formulas used in common statistical software

An overview of the methods implemented in common statistical software is provided in Table A.1.

Table 8.

A list of other relevant software commonly used to obtain sample size for ICC(1).

Source Sample size approaches Additional comments
SAS19,21 Testing ( ZFρ ) Shieh provided sample size calculation for assurance probability ( WaldF , F ) and testing ( F , ZFρ ) also including cost-constraints
PASS 20 Testing ( ZFρ )
R package MBESS 46 Assurance probability ( WaldF ) Testing ( ZFρ ) Allows sample size calculation under cost-constraints
R package presize 47 Assurance probability ( WaldF ) Testing ( ZFρ ) Allows sample size calculation with dropout rates. (webcalculator)
R package ICC.Sample.Size 48 Testing ( ZFρ ) Allows sample size calculation for the testing approach with two tails

ICC: intraclass correlation coefficient; SAS: Statistical Analysis System.

Footnotes

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/or publication of this article.

Supplemental material: Supplemental material for this article is available online.

References

  • 1.Lucas NP, Macaskill P, Irwig L. et al. The development of a quality appraisal tool for studies of diagnostic reliability (QAREL). J Clin Epidemiol 2010; 63: 854–861. [DOI] [PubMed] [Google Scholar]
  • 2.Mokkink LB, Terwee CB, Gibbons E. et al. Inter-rater reliability of the cosmin (consensus-based standards for the selection of health status measurement instruments) checklist. Qual Life Res 2010; 19: 25–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kottner J, Audige L, Brorson S. et al. Guidelines for reporting reliability and agreement studies (GRRAS) were proposed. Int J Nurs Stud 2011; 48: 661–671. [DOI] [PubMed] [Google Scholar]
  • 4.McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996; 1: 30–46. [Google Scholar]
  • 5.Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979; 86: 420–428. [DOI] [PubMed] [Google Scholar]
  • 6.Zou GY. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med 2012; 31: 3972–3981. [DOI] [PubMed] [Google Scholar]
  • 7.Bonett DG. Sample size requirements for estimating intraclass correlations with desired precision. Stat Med 2002; 21: 1331–1335. [DOI] [PubMed] [Google Scholar]
  • 8.Shoukri M, Asyali M, Donner A. Sample size requirements for the design of reliability study: review and new results. Stat Methods Med Res 2004; 13: 251–271. [Google Scholar]
  • 9.Walter SD, Eliasziw M, Donner A. Sample size and optimal designs for reliability studies. Stat Med 1998; 171: 101–110. [DOI] [PubMed] [Google Scholar]
  • 10.Donner A, Eliasziw M. Sample size requirements for reliability studies. Stat Med 1987; 6: 441–448. [DOI] [PubMed] [Google Scholar]
  • 11.Swiger LA, Harvey WR, Everson DE. et al. The variance of intraclass correlation involving groups with one observation. Biometrics 1964; 20: 818. [Google Scholar]
  • 12.Fisher R. Statistical methods for research workers. 13. ed., rev. ed. New York: Hafner, 1958. [Google Scholar]
  • 13.Zerbe CO, Goldgar DE. Comparison of intracalss correlation coefficients with the ratio of two independent F-statistics. Commun Stat-Theory Methods 1980; 9: 1641–1655. [Google Scholar]
  • 14.Donner A. A review of inference procedures for the intraclass correlation coefficient in the one-way random effects model. Int Stat Rev 1986; 54: 67–82. [Google Scholar]
  • 15.Searle SR. Linear models. New York: John Wiley & Sons, 1971.
  • 16.Ramasundarahettige CF, Donner A, Zou GY. Confidence interval construction for a difference between two dependent intraclass correlation coefficients. Stat Med 2009; 28: 1041–1053. [DOI] [PubMed] [Google Scholar]
  • 17.Donner A, Wells GA. A comparison of confidence interval methods for the intraclass correlation coefficient. Biometrics 1986; 42: 401–412. [PubMed] [Google Scholar]
  • 18.Borg D, Bach A, O’Brien J, et al. Calculating sample size for reliability studies. PM&R 2022; 14: 1018–1025. [DOI] [PubMed] [Google Scholar]
  • 19.Shieh G. Sample size requirements for the design of reliability studies: precision consideration. Behav Res Methods 2014; 46: 808–822. [DOI] [PubMed] [Google Scholar]
  • 20.Bujang MA, Baharum N. A simplified guide to determination of sample size requirements for estimating the value of intraclass correlation coefficient: a review. Arch Orofac Sci 2017; 12: 1–11. [Google Scholar]
  • 21.Shieh G. Optimal sample sizes for the design of reliability studies: power consideration. Behav Res Methods 2014; 46: 772–785. [DOI] [PubMed] [Google Scholar]
  • 22.Shoukri MM, Al-Hassan T, Deniro M. et al. Bias and mean square error of reliability estimators under the one and two random effects models: the effect of non-normality. Open J Stat 2016; 06: 254–273. [Google Scholar]
  • 23.Wang CS, Yandell BS, Rutledge JJ. Bias of maximum likelihood estimator of intraclass correlation. Theor Appl Genet 2004; 82: 421–424. [DOI] [PubMed] [Google Scholar]
  • 24.Donner A, Koval JJ. A note on the accuracy of Fisher’s approximation to the large sample variance of an intraclass correlation. Commun Stat - Simul Comput 1983; 12: 443–449. [Google Scholar]
  • 25.Visscher PM. On the sampling variance of intraclass correlations and genetic correlations. Genetics 1998; 149: 1605–1614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kaart T. A new approximation to the variance of the anova estimate of the intraclass correlation coefficient. Proce Est Acad Sci Phys, Math 2005; 54. DOI: 10.3176/phys.math.2005.4.04. [DOI] [Google Scholar]
  • 27.Demetrashvili N, Wit EC, van den Heuvel ER. Confidence intervals for intraclass correlation coefficients in variance components models. Stat Methods Med Res 2016; 25: 2359–2376. [DOI] [PubMed] [Google Scholar]
  • 28.Liljequist D, Elfving B, Skavberg Roaldsen K. Intraclass correlation – a discussion and demonstration of basic features. PLoS ONE 2019; 14: e0219854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Giraudeau B, Mary JY. Planning a reproducibility study: how many subjects and how many replicates per subject for an expected width of the 95 per cent confidence interval of the intraclass correlation coefficient. Stat Med 2001; 20: 3205–3214. [DOI] [PubMed] [Google Scholar]
  • 30.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2022. https://www.R-project.org/ SEP.
  • 31.Bland JM, Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res 1999; 8: 135–160. [DOI] [PubMed] [Google Scholar]
  • 32.Tiku ML. Approximating the general non-normal variance-ratio sampling distributions. Biometrika 1964; 51: 83–95. [Google Scholar]
  • 33.Gayen AK. The distribution of the variance ratio in random samples of any size drawn from non-normal universes. Biometrika 1950; 37: 236–255. [PubMed] [Google Scholar]
  • 34.Scheffé H. The analysis of variance. Oxford, England: Wiley, 1959. [Google Scholar]
  • 35.Khan A, Rayner GD. Robustness to non-normality of common tests for the many-sample location problem. J Appl Math Decis Sci 2003; 7: 657201. [Google Scholar]
  • 36.Blanca MJ, Alarcón R, Arnau J. et al. Non-normal data: Is ANOVA still a valid option? Psicothema 2017; 29: 552–557. [DOI] [PubMed] [Google Scholar]
  • 37.Donaldson TS. Robustness of the F-test to errors of both kinds and the correlation between the numerator and denominator of the f-ratio. J Am Stat Assoc 1968; 63: 660–676. http://www.jstor.org/stable/2284037 . [Google Scholar]
  • 38.Marcinko T. Consequences of assumption violations regarding one-way ANOVA. The 8th International Days of Statistics and Economics, Prague, September 11–13, 2014.
  • 39.Wilcox R. Chapter 7—one-way and higher designs for independent groups. In Wilcox R (ed.) Introduction to Robust Estimation and Hypothesis Testing (Third Edition), third edition ed. Statistical Modeling and Decision Science, Boston: Academic Press. ISBN 978-0-12-386983-8, 2012. pp. 291–377. DOI: 10.1016/B978-0-12-386983-8.00007-X. [DOI]
  • 40.Rothery P. A nonparametric measure of intraclass correlation. Biometrika 1979; 66: 629–639. [Google Scholar]
  • 41.Shirahata S. Nonparametric measures of interclass correlation. Commun Stat – Theory Method 1982; 11: 1707–1721. [Google Scholar]
  • 42.Commenges D, Jacqmin H. The intraclass correlation-coefficient – distribution-free definition and test. Biometrics 1994; 50: 517–526. [PubMed] [Google Scholar]
  • 43.Ukoumunne O, Davison A, Gulliford M, et al. Non-parametric bootstrap confidence intervals for the intraclass correlation coefficient. Stat Med 2003; 22: 3805–3821. [DOI] [PubMed] [Google Scholar]
  • 44.Ionan AC, Polley MY, McShane LM, et al. Comparison of confidence interval methods for an intra-class correlation coefficient (ICC). BMC Med Res Methodol 2014; 121. DOI: 10.1186/1471-2288-14-121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Almehrizi RS, Emam M. Asymptotic standard errors of intraclass correlation coefficients for two-way model. Commun Stat-Simul Comput 2021; 52: 2073–2092. [Google Scholar]
  • 46.Ken K. MBESS: The MBESS R Package https://CRAN.R-project.org/package=MBESS.
  • 47.Alan GH, Armando L, Odile S, et al. ‘presize‘: an R-package for precision-based sample size calculation in clinical research. J Open Source Softw 2021; 6: 3118. [Google Scholar]
  • 48.Alasdair R, Saurabh S, Dinesh K. ICC.Sample.Size: Calculation of Sample Size and Power for ICC. https://CRAN.R-project.org/package=ICC.Sample.Size.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-pdf-1-smm-10.1177_09622802231224657 - Supplemental material for Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model

Supplemental material, sj-pdf-1-smm-10.1177_09622802231224657 for Review of sample size determination methods for the intraclass correlation coefficient in the one-way analysis of variance model by Dipro Mondal, Sophie Vanbelle, Alberto Cassese and Math JJM Candel in Statistical Methods in Medical Research


Articles from Statistical Methods in Medical Research are provided here courtesy of SAGE Publications

RESOURCES