The validity and reliability of measurement tools are essential in empirical research, and agreement assessment is a key part of method validation. 1 Various methods exist depending on data type, including Cohen’s kappa for categorical data and Bland–Altman, the intraclass correlation coefficient (ICC), and the concordance correlation coefficient (CCC) for continuous data. A common challenge is determining an appropriate sample size: Too small yields imprecise estimates, whereas too large wastes resources. Sample size calculation ensures meaningful agreement with adequate statistical power, and this tutorial provides a practical guide for doing so.2,3
This tutorial addresses a gap in applied research by providing a practical, consolidated guide to sample size calculation for commonly used agreement methods.
Discussion
Agreement Methods and Sample Size Calculation
Cohen’s Kappa (κ)
Method
Cohen’s kappa is a statistic used to measure inter-rater agreement for categorical (nominal) items between two raters. It corrects the agreement expected by chance, making it more robust than simple percent agreement. The formula is *κ = (P0 − Pe)/(1 − Pe) *, where P0 is the observed agreement, and Pe is the expected chance agreement. 4
Sample Size Considerations
The number of categories, the prevalence of those categories (marginal proportions), the value of kappa under the null-hypothesis (typically 0), the expected value of kappa under the alternative hypothesis, the desired power, and the alpha level all affect the sample size for kappa. Although there are approximations for calculating sample size, the distribution of kappa is non-normal, particularly in the vicinity of its scale borders.
Formula/Rule of Thumb
A widely used formula for the case of two raters and two categories is:
Where A and B are complex functions of the marginal proportions and the kappa values. Due to this complexity, software is almost always used. A common rule of thumb is that a sample of 50–100 subjects is often a reasonable starting point for a kappa study, but this can vary dramatically with the number of categories and expected agreement.
Applied Example in Clinical Psychiatry
Scenario
A study evaluates diagnostic agreement between two psychiatrists for a specific anxiety disorder (binary outcome: Diagnosis present or absent). Expected agreement is κ1 = 0.75, tested against a clinically relevant benchmark κ0 = 0.60, with 85% power and one-sided a = 0.05. The prevalence of a positive diagnosis is estimated at 40%.
Calculation
Using R’s kappaSize package:
Library (kappaSize)
PowerBinary (kappa 0 = 0.60, kappa 1 = 0.75, props = c (0.40, 0.60), alpha = 0.05, power = 0.85)
Result
Required sample size ≈ 177 patients.
Interpretation
To detect that the psychiatrists’ agreement exceeds κ = 0.60 with 85% power, 177 patients must be independently assessed by both raters. Testing against a non-zero null (κ0 = 0.60) and using a one-sided test increases the required sample size compared to testing against κ0 = 0.
Fleiss’ Kappa (κ)
Method
For more than two raters, Fleiss’ kappa is a generalization of Cohen’s kappa. On a categorical scale, it is used to assess the extent to which multiple fixed rates agree. It is predicated on the idea that the same group of raters rates each subject. 5
Sample Size Considerations
The sample size depends on the number of raters (*k*), the number of categories (*c*), the null and alternative kappa values, the power, alpha, and the distribution of ratings across categories. The calculations are more complex than for Cohen’s kappa.
Formula/Rule of Thumb
There is no straightforward, widely applicable closed-form formula. Usually, simulation or specialized power analysis tools are used to calculate sample size. In general, the number of subjects required can be reduced by increasing the number of raters, although this relationship is nonlinear. For reliable estimates, studies with three to five raters often require 30–50 respondents; however, this is quite context-dependent.
Applied Example in Clinical Psychology
Scenario
A panel of four clinical psychologists (k = 4) will independently rate patient vignettes for a personality disorder trait (categorical: Absent, subthreshold, present). The expected agreement is κ1 = 0.55, tested against a “slight” benchmark κ0 = 0.20, with 85% power and one-sided α = 0.05. Expected marginal ratings are 50% absent, 30% subthreshold, and 20% present.
Calculation
Using simulation in R:
Simulate rating matrices for different sample sizes, assuming the expected distribution and true κ = 0.55.
Compute Fleiss’ kappa and test H0: κ ≤ 0.20.
Identify the sample size where power reaches 0.85.
Result
n = 92 patient vignettes required.
Interpretation
To detect that agreement among four psychologists exceeds “slight” (κ > 0.20) with 85% power when true agreement is moderate (κ = 0.55), 92 vignettes are needed. This example highlights that simulation is the most reliable approach for multi-rater ordinal designs, accounting for expected rating distributions.
Weighted Kappa (κ_w)
Method
For ordinal data (such as disease severity: Mild, moderate, or severe), weighted kappa is utilized. It recognizes that not all arguments are created equal; a dispute between “mild” and “severe” is more significant than one between “mild” and “moderate.” Disagreements are penalized according to their squared distance or absolute difference using a weight matrix, which is often linear or quadratic. 6
Sample Size Considerations
As with Cohen’s kappa, the sample size rationale accounts for the selected weighting system. The weights and marginal distributions of the ordinal categories affect the variance of the weighted kappa statistic.
Formula/Rule of Thumb
The sample size formula is an extension of the formula for Cohen’s kappa, incorporating weights. It is computationally intensive. A pragmatic rule of thumb is to use the sample size calculated for an unweighted kappa with the same number of categories as a conservative estimate, or to use simulation. Sample sizes are often comparable to those for Cohen’s kappa, typically >50 subjects.
Applied Example in Clinical Psychiatry
Scenario
Two psychiatrists independently rate depressive episode severity using a four-point CGI-S scale (κ = normal to 4 = severely ill) with quadratic weights. Expected agreement is substantial (κ_w1 = 0.70), tested against a minimum acceptable threshold of κ_w0 = 0.50, with 80% power and one-sided α = 0.05.
Calculation
Using simulation in R or SAS, the expected distribution rating (e.g., 20% normal, 35% mild, 30% moderate, 15% severe) and the weighting scheme are applied. Contingency tables are simulated iteratively to estimate κ_w and power for different sample sizes.
Result
n = 85 patients required.
Interpretation
To detect that inter-rater reliability exceeds moderate (κ_w > 0.50) with 80% power when true reliability is substantial (κ_w = 0.70), 85 independently rated patients are needed. The example shows that sample size depends on both kappa values and the expected distribution of ordinal ratings.
Intraclass Correlation Coefficient (ICC)
Method
For continuous data, the ICC is used to evaluate measurement reliability. It measures the percentage of the total variance in the measurements that can be attributed to between-subjects variance. Depending on the research design (e.g., whether the same or different raters rate each subject), there are several models (e.g., one-way random, two-way random, two-way mixed). 7
Sample Size Considerations
The sample size depends on the ICC model, the number of raters (*k*), the true ICC value (ρ), the null-hypothesis value (often ρ0 = 0 or a low value), power, and alpha. A key insight is that increasing the number of raters (*k*) can be a more efficient way to increase power than increasing the number of subjects (*n*).
Formula/Rule of Thumb
A standard formula for a one-way random-effects model (each subject measured by different random raters) is:
Where *k* is the number of raters, ρ0 is the null ICC, and ρ1 is the alternative ICC. A common rule of thumb is the “30/30 rule”: At least 30 subjects, each measured by at least 30 raters (or 30 measurements per subject), is a good target for a precise estimate, but this is often impractical. A more common design is 2–5 raters and 50–100 subjects.
Applied Example in Clinical Psychology/Psychiatry
Scenario
A study evaluates the interrater reliability of a semi-structured interview for generalized anxiety disorder using four clinical psychologists (k = 4) and a two-way random-effects ICC model (absolute agreement). The null-hypothesis is H0: ICC ≤ 0.75 versus H1: ICC > 0.75, with 90% power and α = 0.05. The anticipated true ICC is 0.85.
Calculation
Using R (ICC.Sample.Size) or Power Analysis and Sample Size (PASS) software, parameters including the number of raters, null and alternative ICC, power, and alpha are specified.
Result
The required sample size is n = 38 patients, totaling 152 ratings across raters.
Interpretation
With four raters, 38 patients are sufficient to detect an ICC exceeding 0.75 with 90% power when the true ICC is 0.85. Increasing the number of raters reduces the required number of subjects compared to designs with fewer raters.
Concordance Correlation Coefficient (CCC)
Method
The CCC (Lin, 1989) assesses the agreement between two continuous measurements by measuring their deviation from the line of perfect concordance (the 45° line). It incorporates both precision (Pearson’s correlation) and accuracy (the shift from the 45° line). The formula is *ρ_c = (2σ_(8))/(σ_x² + σ_y² + (µ_x − µ_y) ²)* (9).
Sample Size Considerations
Sample size calculation for the CCC is based on its asymptotic distribution. It requires specifying the expected CCC value (ρ_c), the null-hypothesis value, the desired power, and α. The calculation also involves the expected means and variances of the two measurement methods.
Formula/Rule of Thumb
The formula involves a transformation of the CCC (Fisher’s Z). The sample size formula is:
Where ξ is Fisher’s Z transformation of ρ_c, and ψ is a scale parameter related to the variances and means. Due to its complexity, software is essential. There is no simple rule of thumb, but sample sizes are generally similar to those for ICC, often in the range of 50–200 pairs.
Applied Example in Clinical Psychology
Scenario
A study validates a new five-minute computerized cognitive battery (Method A) against a 45-minute paper-and-pencil test (Method B) for processing speed. The goal is to test whether agreement is excellent, with a null CCC (ρ0) = 0.90 and an anticipated true CCC (ρ1) = 0.94, using 90% power and one-sided α = 0.05.
Calculation
Using R (cccPower) or PASS, parameters including the null and alternative CCC, power, alpha, and pilot-based means and variances (σ1² ≈ σ2², µ1 ≈ µ2) are input.
Result
The required sample size is n = 112 participants (paired assessments).
Interpretation
To detect that the screening battery agrees excellently with the gold-standard test (CCC > 0.90) with 90% power when true CCC = 0.94, 112 participants are needed.
Bland–Altman Plot
Method
The Bland–Altman plot is a graphical method to assess agreement between two continuous measurements. It plots the differences between the two methods against their averages. The analysis focuses on estimating the mean difference (bias) and the limits of agreement (LoA), defined as bias ± 1.96 × SD of the differences. 8
Sample Size Considerations
The sample size for a Bland–Altman analysis is not intended to test a hypothesis but to estimate the LoA precisely. The goal is to have sufficiently narrow confidence intervals around the LoA. The precision of the LoA depends on the standard error of the standard deviation of the differences. 9
Formula/Rule of Thumb
A formula for the confidence interval for a standard deviation is used. The approximate 95% CI for the population standard deviation (σ) is given by: *[s√(n/χ²_{1 − α/2, n − 1}), s√(n/χ²_{α/2, n − 1})]*. To ensure that the upper limit of the LoA is estimated precisely, one can solve for *n*. A widely cited rule of thumb from Bland and Altman (1999) is that a sample of at least 100 subjects is desirable for a reliable estimate of the LoA. For preliminary studies, a minimum of 50 subjects is often used.
Applied Example in Clinical Psychology/Psychiatry
Scenario
A study validates a brief digital self-report depression questionnaire against the clinician-administered HAM-D scale by estimating the LoA. For clinical utility, the 95% CI for the upper LoA should be ≤ ±3 points.
Calculation
Pilot data (n = 30) indicate SD of differences s = 4.5. Iterative calculations show:
n = 50 → CI too wide ([3.9, 5.4]), imprecise LoA
n = 100 → CI narrower ([4.0, 5.1]), precise LoA
Result
A sample of n = 100 participants is needed to achieve a reliable and precise estimate of agreement.
Summary of Agreement and Reliability Assessment Methods and Sample Size Guidelines
Sample size determination for studies evaluating agreement or reliability requires careful consideration of the chosen statistical method, as each metric has distinct assumptions and requirements. Table 1 presents a concise comparison of common agreement and reliability metrics, along with key practical guidance for planning an adequate sample size in each case.
Table 1.
Comparison of Agreement and Reliability Assessment Methods.
| Method | Data Type | Primary Purpose | Sample Size Guidance (n = Number of Subjects) |
| Cohen’s kappa (κ) | Categorical (nominal) | Chance-corrected agreement between two raters. | Minimum 50–100. It may require 200 plus samples to detect rare outcomes or to test against a benchmark value. |
| Fleiss’ kappa (κ) | Categorical (nominal) | Chance-corrected agreement among >2 raters. | For 3–5 raters: 50–150. Simulation is required for precise calculation. |
| Weighted kappa (κ_w) | Ordinal | Agreement between two raters with severity-weighted disagreements. | As with Cohen’s kappa, simulation is the recommended approach due to the complexity of weighting. |
| ICC | Continuous | Reliability among ≥2 raters or measurements. | Common: 2–5 raters and 50–100 subjects. Increasing the number of raters (*k*) is an efficient way to increase power. |
| CCC | Continuous | Agreement between two measurement methods. | Typically, 50–200 paired measurements. Requires software for calculation. |
| Bland–Altman plot and LoA | Continuous | Estimate bias and LoA between two methods. | n ≥ 100 for reliable LoA; n ≥ 50 for pilot studies. Based on the precision of the standard deviation of differences. |
According to Table 1, no single sample size rule applies universally to agreement studies. Although the guidelines presented offer a practical starting point, conducting formal a priori sample size calculations—using dedicated software or simulation techniques—is strongly recommended. Such calculations require precise definitions of the anticipated agreement level, the margin of clinical relevance or the null-hypothesis benchmark, the expected rating distributions, and the desired statistical power. Diligent sample size planning is fundamental to ensuring that a study is sufficiently powered to produce reliable evidence regarding the degree of measurement agreement or reliability.
Software Implementation Guide
For the sample size calculations and power analyses outlined in Table 1, researchers can utilize several statistical software platforms. The choice of tool often depends on the specific agreement metric, the researcher’s preference for a formula-based versus simulation-based approach, and the availability of commercial or open-source solutions. Table 2 provides an overview of the primary software options and their key capabilities.
Table 2.
Overview of Software and Approaches for Power Analysis.
| Software/Package | Key Capabilities for Power Analysis of Agreement Metrics | Considerations and Primary Approach |
| R | ||
| pwr | Fundamental power analysis can be adapted for some agreement metrics | Formula-based; requires adaptation |
| irr | Calculates agreement statistics (e.g., kappa, ICC) | Power analysis typically requires a custom simulation using its functions |
| KappaSize | Dedicated sample size calculation for Cohen’s and weighted kappa | Specialized, formula-based procedures for Kappa |
| ICC.Sample.Size | Specifically for calculating the sample size for ICC | Specialized, formula-based procedure for ICC |
| Custom simulation | The most flexible approach for any agreement statistic (kappa, ICC, CCC, etc.) | Simulate datasets under H0 and H1, compute the statistic, and estimate power as the proportion of correct rejections |
| SAS | ||
| PROC POWER | Built-in power analysis for the ICC (using the DIST = TESTF option) | Direct, formula-based method for ICC |
| PROC FREQ with simulation | Can be used for power analysis of kappa statistics via simulation | A simulation-based approach within the standard procedure |
| User-written macros | Available for specific statistics such as CCC | Provides flexibility for metrics not covered by built-in procedures |
| PASS software | Dedicated procedures for sample size calculation for kappa, ICC, and CCC | Comprehensive, commercial software with a user-friendly interface and validated routines |
| STATA | ||
| kapci, icc | Provide confidence intervals for kappa and ICC | No direct priori power analysis; useful for post-hoc analysis |
| sampsi with formulas/simulate | Can be used with known formulas or, more commonly, with the simulate command for custom power analysis | Relies on user programming for simulation-based power analysis |
| G*Power | Excellent for many common statistical tests (t tests, ANOVA, correlation, etc.). | Not recommended for agreement metrics, as it lacks direct support for kappa, ICC, or CCC |
As illustrated in Table 2, a range of software options exists, from specialized packages for specific metrics (e.g., KappaSize in R) to flexible simulation frameworks available in most environments. For most agreement metrics beyond the ICC, simulation—whether in R, SAS, or Stata—is the most robust and generalizable method for power analysis, as it can accommodate complex designs, unevenly distributed ratings, and non-standard null hypotheses. While commercial software such as PASS offers validated, user-friendly procedures, open-source solutions provide greater flexibility for custom study designs.
Power Analysis in Agreement Studies
To determine the necessary sample size, power analysis is an essential element in the design of any research project, including agreement investigations. In the context of dependability and agreement statistics, this section describes the fundamental concepts, implementation procedures, and specific factors to consider when conducting power analysis. A well-conducted power analysis protects against inconclusive results and wasted resources by ensuring that a study has a high probability of accurately detecting a clinically important level of agreement, if it exists.
Conclusions
Formal priori power analysis is crucial for designing robust agreement studies. By linking clinical or research objectives to statistical parameters such as the significance level and desired power, it ensures an adequate sample size to detect meaningful agreement, such as a substantial kappa or a high ICC. Consideration of method-specific factors—such as the number of raters for ICC, marginal proportions for kappa, or precision for Bland–Altman plots—enhances efficiency. Using pilot data, the literature, and modern statistical tools transforms study design into a scientifically solid investigation, improving the reliability, validity, and credibility of measurement agreement research.
Acknowledgments
Not applicable.
Footnotes
Appropriate Permissions from the Concerned Authorities: None.
Data Sharing Statements: Not applicable.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Declaration Regarding the Use of Generative AI: No part of this article was written or generated by a generative AI tool. The authors take full responsibility for the accuracy, integrity, and originality of the published article.
Ethics Committee Details: This study was a tutorial and did not require ethics approval.
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
Informed Consent/Assent: Not applicable.
Prior Presentations: None.
PROSPERO/CTRI Details: None.
Registration: Not applicable.
Simultaneous Submission to Another Journal or Resource: Not applicable.
Status of Your Study (for Study Protocol): Not applicable.
References
- 1.De Vet HC, Terwee CB, Mokkink LB, et al. Measurement in medicine: A practical guide. Cambridge University Press, 2011. [Google Scholar]
- 2.Bujang MA and Baharum N.. A simplified guide to determination of sample size requirements for estimating the value of intraclass correlation coefficient: A review. Arch Orofac Sci, 2017; 12(1): 1–11. [Google Scholar]
- 3.Bland JM and Altman DG. Measuring agreement in method comparison studies. Stat Methods Med Res, 1999; 8(2): 135–160. [DOI] [PubMed] [Google Scholar]
- 4.Kvålseth TO. Note on Cohen’s kappa. Psychol Rep, 1989; 65(1): 223–226. [Google Scholar]
- 5.Rücker G, Schimek-Jasch T and Nestle U.. Measuring inter-observer agreement in contour delineation of medical imaging in a dummy run using Fleiss’ kappa. Methods Inf Med, 2012; 51(06): 489–494. [DOI] [PubMed] [Google Scholar]
- 6.Cohen J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychol Bull, 1968; 70(4): 213. [DOI] [PubMed] [Google Scholar]
- 7.Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psychol Rep, 1966; 19(1): 3–11. [DOI] [PubMed] [Google Scholar]
- 8.Kaur P and Stoltzfus JC. Bland–Altman plot: A brief overview. Int J Acad Med, 2017; 3(1): 110–111. [Google Scholar]
- 9.Jan S-L and Shieh G.. The Bland-Altman range of agreement: Exact interval procedure and sample size determination. Comput Biol Med, 2018; 100: 247–252. [DOI] [PubMed] [Google Scholar]
