Figure 4.
Test-retest reliability effects across different effect size metrics and statistical tests. (A) The observed effect sizes as a function of reliability for rtrue = 0.5, comparing group differences to correlational strength. Note, because the effect sizes among the tests are not directly comparable, each effect size is normalized by its own maximum value at ICC = 1. The inset shows the results for rtrue = 0.9. The dashed line denotes . (B) The p-value as a function of reliability for rtrue = 0.5 and the total sample size of N = 60. Dichotomizing data substantially increases p-values, especially when reliability is low. (C) The required sample size to achieve 80% statistical power at α = 0.05 as a function of reliability for the three effect size metrics. Dichotomizing data substantially increases the required sample sizes to detect the same true effect, especially when reliability is low.
