Table 2.
Index | Overview | Range | Interpretation | Advantage(s) for CT | Disadvantage(s) for CT |
---|---|---|---|---|---|
Pearson’s r | Measures strength of linear association between scores from two independent observers |
−1 to 1 | Values closer to −1 or 1 indicate strong negative or positive relationship, respectively |
Easy to compute with standard software; method is well known and coefficient is easy to interpret |
Cannot be used to assess intra-rater reliability because it requires exactly 2 independent observers; inappropriate, generally, for reliability because it measures linearity: a perfect, positive correlation (r = 1) could result if two observers have systematically different scores; cannot be used for categorical CT scores in CF (e.g. presence/absence of bronchiectasis) |
Bland- Altman analysis |
CI (typically 95%) for mean difference Δ between two sets of observer scores |
Depends on Δ estimate |
When used with Bland- Altman plot, LOA can show systematic differences and variability |
Easy to interpret plot; statistical approach is straightforward to acquire mean difference and CI |
Influenced by normality assumption; only valid for continuous scores; cannot be used if >2 observers; assumes one method of measurement is the ‘gold standard’ and is therefore known; requires understanding of what is acceptable/unacceptable difference in observer scores |
ICC/weighted Kappa statistic |
Ratio of between-subject variability to total variability, where total variability is the sum of between- and within- subject variability |
−1 to 1 | Scaled coefficient ranges adopted in the literature (not recognized widely in statistics) indicate fair (0.4 to 0.6), moderate (0.6 to 0.8) and excellent agreement (0.8 to 1). |
Unit-less value; has continuous (ICC) and categorical (Kappa) versions to accommodate different CT subscores and interpretation is consistent for both data types; can be used to assess intra-rater reliability; can accommodate >2 observers |
Interpretation for the same CT marker across different populations and/or studies is not consistent; high estimate does not always reflect excellent agreement; requires that ANOVA assumptions are met |
CCC | 1 minus the ratio of within-subject squared deviation to total deviation |
−1 to 1 | Scaled coefficient may be thought of as a standardized estimate of the mean squared difference between observers |
Same advantages as ICC; relaxes the ANOVA assumption required for ICC; currently the only reliability statistic endorsed by the Metrics Champion Consortium for Imaging |
Estimates often correspond to ICC and therefore have same issues with interpretation across multiple studies/populations |
CP | Proportion of scans with differences in scores that fall within an acceptable threshold |
0 to 1 | Unscaled index expressed as a probability estimate, where values close to 1 indicate higher reliability |
Can be computed using nonparametric approach for any data type; can be used for > 2 observers; has consistent interpretation across multiple studies/ populations; can be used to pinpoint specific instances of poor reliability |
A threshold must be set a priori defining a meaningful difference between observer scores, which can be difficult when studying novel CT markers or systems; this threshold will impact the probability estimate |
Abbreviations: Analysis of variance (ANOVA); chest-computed tomography (CT); concordance correlation coefficient (CCC); coverage probability (CP); cystic fibrosis (CF); intra-class correlation coefficient (ICC); limits of agreement (LOA)