Skip to main content
. Author manuscript; available in PMC: 2017 Mar 7.
Published in final edited form as: J Cyst Fibros. 2016 Dec 28;16(2):175–185. doi: 10.1016/j.jcf.2016.12.008

Table 2.

Overview of select reliability indices for chest-computed tomography markers

Index Overview Range Interpretation Advantage(s) for CT Disadvantage(s) for CT
Pearson’s r Measures
strength of linear
association
between scores
from two
independent
observers
−1 to 1 Values closer to −1 or 1
indicate strong negative or
positive relationship,
respectively
Easy to compute with standard
software; method is well known
and coefficient is easy to
interpret
Cannot be used to assess intra-rater
reliability because it requires exactly 2
independent observers; inappropriate,
generally, for reliability because it
measures linearity: a perfect, positive
correlation (r = 1) could result if two
observers have systematically different
scores; cannot be used for categorical
CT scores in CF (e.g. presence/absence
of bronchiectasis)
Bland-
Altman
analysis
CI (typically
95%) for mean
difference Δ
between two sets
of observer
scores
Depends on Δ
estimate
When used with Bland-
Altman plot, LOA can show
systematic differences and
variability
Easy to interpret plot; statistical
approach is straightforward to
acquire mean difference and CI
Influenced by normality assumption;
only valid for continuous scores;
cannot be used if >2 observers;
assumes one method of measurement
is the ‘gold standard’ and is therefore
known; requires understanding of
what is acceptable/unacceptable
difference in observer scores
ICC/weighted
Kappa
statistic
Ratio of
between-subject
variability to
total variability,
where total
variability is the
sum of between-
and within-
subject
variability
−1 to 1 Scaled coefficient ranges
adopted in the literature (not
recognized widely in
statistics) indicate fair (0.4 to
0.6), moderate (0.6 to 0.8)
and excellent agreement (0.8
to 1).
Unit-less value; has continuous
(ICC) and categorical (Kappa)
versions to accommodate
different CT subscores and
interpretation is consistent for
both data types; can be used to
assess intra-rater reliability; can
accommodate >2 observers
Interpretation for the same CT marker
across different populations and/or
studies is not consistent; high estimate
does not always reflect excellent
agreement; requires that ANOVA
assumptions are met
CCC 1 minus the ratio
of within-subject
squared
deviation to total
deviation
−1 to 1 Scaled coefficient may be
thought of as a standardized
estimate of the mean
squared difference between
observers
Same advantages as ICC; relaxes
the ANOVA assumption
required for ICC; currently the
only reliability statistic endorsed
by the Metrics Champion
Consortium for Imaging
Estimates often correspond to ICC and
therefore have same issues with
interpretation across multiple
studies/populations
CP Proportion of
scans with
differences in
scores that fall
within an
acceptable
threshold
0 to 1 Unscaled index expressed as
a probability estimate, where
values close to 1 indicate
higher reliability
Can be computed using
nonparametric approach for any
data type; can be used for > 2
observers; has consistent
interpretation across multiple
studies/ populations; can be used
to pinpoint specific instances of
poor reliability
A threshold must be set a priori
defining a meaningful difference
between observer scores, which can be
difficult when studying novel CT
markers or systems; this threshold will
impact the probability estimate

Abbreviations: Analysis of variance (ANOVA); chest-computed tomography (CT); concordance correlation coefficient (CCC); coverage probability (CP); cystic fibrosis (CF); intra-class correlation coefficient (ICC); limits of agreement (LOA)