Abstract
Quantitative imaging biomarkers (QIBs) are increasingly used in clinical studies. Because many QIBs are derived through multiple steps in image data acquisition and data analysis, QIB measurements can produce large variabilities, posing a significant challenge in translating QIBs into clinical trials, and ultimately, clinical practice. Both repeatability and reproducibility constitute the reliability of a QIB measurement. In this article, we review the statistical aspects of repeatability and reproducibility of QIB measurements by introducing methods and metrics for assessments of QIB repeatability and reproducibility and illustrating the impact of QIB measurement error on sample size and statistical power calculations, as well as predictive performance with a QIB as a predictive biomarker.
Introduction
Medical imaging modalities such as CT, MRI, and positron emission tomography (PET) are routinely used in clinical practice for disease screening, diagnosis, staging, therapeutic monitoring, evaluation of residual disease, and assessment of disease recurrence. Traditionally, image contrast-based qualitative interpretations of medical images are the most commonly employed radiology practice. With advances in imaging technologies in recent years, imaging metrics that can quantify tissue biological and physiological properties, in addition to those that quantify tissue morphology such as disease size, are increasingly used in research and early phase clinical trials to characterize disease and response to treatment. A recent review 1 by a group of principal investigators from the Quantitative Imaging Network (National Cancer Institute, National Institutes of Health) has called for wider incorporation of quantitative imaging methods into clinical trials, and eventually, clinical practice for evaluation of cancer therapy response. In the emerging era of precision medicine, quantitative imaging biomarkers (QIBs) can be integrated with quantitative biomarkers from genomics, transcriptomics, proteomics, and metabolomics to facilitate patient stratification for individualized treatment strategy and improve treatment outcome. 2 A QIB is defined as “an objective characteristic derived from an in vivo image measured on a ratio or interval scale as an indicator of normal biological processes, pathogenic processes or a response to a therapeutic intervention.” 3 QIBs can be generally classified into five different types: structural, morphological, textural, functional, and physical property QIBs. 4 Kessler et al 3 have introduced terminologies related to QIBs for scientific studies. Study designs and statistical methods used for assessing QIB technical performances have been extensively reviewed. 4–8 Because many QIBs are derived through multiple steps in image data acquisition and data analysis that often involve different manufacturer scanner platforms and different computer algorithms and software tools, QIB measurements can produce large variabilities, which pose a significant challenge in translating QIBs into clinical trials, and ultimately, clinical practice. In order for a QIB and its changes to be interpretable in clinical settings across institutions and clinics for disease characterization and therapy response assessment, it is highly important to evaluate the repeatability and reproducibility of the QIB.
Both repeatability and reproducibility constitute the reliability of a QIB measurement. Repeatability refers to the precision of a QIB measured under identical conditions (e.g. using the same measurement procedure, same measurement system, same image analysis algorithm, and same location over a short period of time; also known as repeatability condition), 3,4 which is mainly a measure of the within-subject variability and the variability caused by the same imaging device over time. On the other hand, reproducibility refers to the precision of a QIB measured under different experimental conditions 3,4 (also known as the reproducibility condition), which is mainly a measure of the variability associated with different measurement systems, imaging methods, study sites, and population. In recent years, many studies have been conducted to investigate the reliability of different QIBs. For example, Yokoo et al 9 studied the precision of hepatic proton-density fat-fraction measurements by using MRI; Lodge 10 examined the repeatability of standardized uptake value (SUV) in oncologic 11 F-fludeoxy glucose (FDG) PET studies; Shukla-Dave et al 12 reviewed and emphasized the need for assessment of reproducibility and repeatability of MRI QIBs in oncologic studies; Park et al 13 reviewed the challenges in reproducibility of quantitative radiomics metrics; Fedorov et al 14 and Schwier et al 15 assessed the repeatability of radiomics features from multiparametric MRI of small prostate tumors; Hernando et al 16 quantified the reproducibility of MRI-based proton-density fat-fraction measurements in a phantom across vendor platforms at different field strengths; Kalpathy et al 17 investigated the repeatability and reproducibility of tumor volume estimate from CT images of lung cancer; Lin et al 18 evaluated the repeatability of 18F-NaF PET–derived SUV metrics; Baumgartner et al 11 studied the repeatability of 18F-FDG PET brain imaging; Jafer et al., 19 Winfield et al., 20 Weller et al., 21 Lecler et al, 22 and Lu et al 23 estimated the repeatability and reproducibility of the apparent diffusion coefficient (ADC) derived from diffusion-weighted MRI; Hagiwara et al 24 studied the repeatability and reproducibility of quantitative relaxometry with a multidynamic multiecho MRI sequence using a phantom and normal healthy human subjects; Jafari-Khouzani et al 25 appraised the repeatability of brain tumor perfusion measurement using dynamic susceptibility contrast MRI; Han et al 26 studied the repeatability and reproducibility of the ultrasonic attenuation coefficient and backscatter coefficient in the liver; Hagiwara et al 27 reviewed the repeatability of MRI and CT QIBs; Wang et al 28 examined the repeatability and reproducibility of 2D and 3D hepatic MR elastography markers in healthy volunteers; and Olin et al 29 investigated the reproducibility of the MR-based attenuation correction factors in PET/MRI and its impact on 18F-FDG PET quantification in patients with non-small cell lung cancer.
In this article, we review the statistical aspects on repeatability and reproducibility of QIB. We introduce methods and metrics for assessment of QIB repeatability and reproducibility, and illustrate the impact of QIB measurement error on sample size and statistical power calculations, as well as performance as a predictive biomarker.
Measurement error model
Precision of a QIB is defined as the closeness of agreement between repeated measurements of the QIB, 3 and repeatability and reproducibility comprise different sources of variability that may impact the precision of a given QIB. Measurement error is defined as the difference between a measured quantity and its true value. 30 Any sources of variability can cause measurement errors in QIB measurements. Though it is critical to identify and obtain valid inference on the impact of every component of variation (see Section 3), we start by introducing a general model on measurement error. Table 1 lists commonly used symbols in this article.
Table 1.
List of symbols and notations
~ | Distributed as |
---|---|
Normal distribution with mean and variance | |
df | Degree of freedom |
Quantile function of chi-square distribution | |
Variance | |
Measured QIB value from the th measurements of a repeated QIB measurement made at time for subject | |
True QIB value for subject | |
Number of subjects included in the study | |
Number of replicates for each subject | |
Mean of | |
Measurement error | |
Within-subject error (under repeatability condition) | |
Between-condition error (under reproducibility condition) | |
Interaction between subject and condition | |
, , | Variance of , , , , and , respectively |
ˆ Represents the estimator of a parameter | |
Null hypothesis | |
Alternative hypothesis |
QIBs, quantitative imaging biomarkers.
In a measurement error model, instead of the true QIB value, we can only observe QIB values with errors that are random across different QIB measurements. If the errors are constant for all measurements, then the error is called bias. 3 Since both repeatability and reproducibility mainly concern random errors, we assume that there is no bias and that the random errors are independent and identically distributed with the mean equal to zero and variance of . Thus, the only unknown parameter measures the level of variability, with larger values indicating larger variability or worse precision. Let Yitl be the measured QIB value from the lth measurements of a repeated QIB measurement made at time t for subject i with Xit being the corresponding true value, the measurement error model can be expressed as
(1) |
where β
0t
represents the bias of the QIB measurement and β
1 represents the proportional bias of the QIB measurement. The random error
is assumed to be normally distributed (other commonly used distribution assumptions are discussed below). Because both bias and proportional bias cannot be identified solely through the QIB measurements, it is usually assumed that these values are constant and known in advance from ground-truth studies such as a phantom study.
31
Because we can always standardize the QIB measurement through to remove the effect of bias and proportional bias, without the loss of generalizability, we can reasonably assume the setting of no bias and no proportional bias (β
0t
= 0 and β
1 = 1), and the model (1) becomes
(2) |
Repeatability and reproducibility
Repeatability and reproducibility represent different sources of variability (causing measurement error ) in model (1). Repeatability refers to the precision of a QIB measured under identical conditions, while reproducibility refers to the precision of a QIB measured under different experimental conditions. 3,4 Kessler et al 3 recommends identifying and evaluating each experimental component separately that contributes to reproducibility-related variability. However, in many real-life settings, it is very difficult to independently investigate each source of variability under the reproducibility condition. Thus, the statistical model introduced below only assumes a single parameter representing all sources of variability associated with the reproducibility condition.
Following the measurement error model (2), we consider a simplified setting where all measurements for subject i are made at a relatively short time interval so that the true value Xi remains unchanged. That is, let Yijk be the kth repeated QIB measurement made on subject i under experimental condition j (different experimental conditions may include different measurement systems, imaging methods, and countries/regions), in a similar fashion to model (1) we employ a linear relationship between Yijk and Xi , 4,5,32 but further breaking down the measurement error into different components of repeatability- and reproducibility-related errors. Similar to model (2), we assume there are no bias or proportional bias, and the model that accounts for both repeatability- and reproducibility-related errors can be written as
(3) |
The terms δik , γj , and (γδ) ij represent different components of measurement error caused by within-subject variability (under repeatability condition), between-condition variability (under reproducibility condition), and the interaction between subject and condition, respectively, and we assume they follow normal distributions. In general, the random error variances , , and are the key performance characteristics used in repeatability and reproducibility studies (see details below).
Repeatability
Many studies have been conducted to investigate the repeatability of QIBs for different imaging modalities, including but not limited to CT, 17,18 MRI, 9,12,14,15,20–25 and PET. 10,11,18,29 Test–retest studies are usually performed to evaluate the repeatability of QIB measurements. These studies usually require each subject to be scanned repeatedly over a short period of time with the assumption that Xi does not change. If the repeatability study condition is held, i.e. all repeated scans are performed at the same location, with the same measurement procedure, and using the same measurement system and image analysis algorithm, an estimate of QIB repeatability can be calculated.
In practice, test–retest studies can be performed fairly easily using a phantom, but could be difficult with human subjects due to expenses and logistic problems. Therefore, repeatability studies using human subjects are often limited to a small number of replicates (usually two) on each subject. Furthermore, for imaging studies with contrast administration, because there usually exists a required contrast washout period between two consecutive scans (e.g. consecutive dynamic contrast-enhanced (DCE) MRI scans are usually required to be performed at least 24 h apart), 4,31 “coffee-break” experiments where there is only a short break between repeated scans are not always possible. Thus, for repeatability studies with long interval between scans, possible changes in the true values should also be considered in the model.
Specifically, for a test–retest study with n subjects and m replicates for each subject, since all experiments are conducted under the same experimental condition, without the loss of generalizability, we set j = 1 and model (3) becomes
(4) |
for and . The random effect variance is the key performance characteristic used in a repeatability study, with smaller value corresponding to better repeatability. Because we only consider QIB measurements in a single-site and single measurement system study for a test–retest repeatability study, the random error γj in (3), which measures the reproducibility-related variability such as between-site variability, becomes a condition-specific systematic error γ 1 in (4), representing the condition-specific bias. As discussed in Section 2, the constant bias γ 1 cannot be identified through the test–retest study and should be estimated through a phantom study with ground-truth values. The term in model (3) vanishes because there is only one study condition and the interaction effect cannot be observed.
Many metrics related to the estimate of within-subject variance have been proposed to quantify the magnitude of repeatability. 4 Table 2 shows the metrics considered in this article for repeatability measurement. The within-subject standard deviation (wSD) is the most commonly used metrics for assessing repeatability. It is the standard deviation ( ) of repeat measurements for a single subject. If we assume all subjects have the same and the ground true values Xi are independent and normally distributed, wSD can be obtained by fitting a linear mixed-effects model with subject-specified random intercepts using maximum likelihood. 33 Although other estimation procedures such as method of moment estimators can also provide consistent estimation of wSD, the maximum likelihood method is usually preferred for small sample size when model (3) is correctly constructed. Alternatively, wSD can be calculated by averaging the within-subject sample variances 32
Table 2.
Repeatability and reproducibility metrics
Repeatability metrics | |
---|---|
wSD | Within-subject standard deviation |
ICC | Intraclass correlation coefficient, proportion of total variation associated with the variation of true value |
wCV | Within-subject coefficient of variation, ratio of the within-subject standard deviation to its mean |
Reproducibility metrics | |
tSD (reproducibility SD) | Total standard deviation under reproducibility conditions |
CCC | Concordance correlation coefficient, measurement agreement between two experimental conditions |
(5) |
where . This estimator (equation (5)) is equivalent to the estimator obtained from the one-way analysis of variance (ANOVA) model. 34
Another closely related metric for repeatability measurement is the intraclass correlation coefficient (ICC), 35,36 which is defined as the proportion of total variation that is associated with the variation of true value. That is, if we assume , we have and
(6) |
Comparing the variance of QIB measurement (denominator of equation (6)) with the variance of the true value Xi (numerator of equation (6)), we can observe that the extra variation equals to the within-subject variation . If is much smaller than , then and ICC is close to 1, which indicates that the measurement error contributes little to variation of the QIB measurement. In other words, larger ICC implies better repeatability and smaller ICC implies worse repeatability. ICC can be estimated with known estimates of and (equation (6)), both of which can be obtained by either fitting a linear mixed-effects model with subject-specified random intercepts or one-way ANOVA model. When using ICC as the measure of repeatability, it is crucial to ensure that the subjects participating in the study are representative of the study population so that the estimated QIB variation can well reflect the variation of the study population (variance of ).
The within-subject coefficient of variation (wCV) (Table 2) is an alternative metric of wSD. The wCV is defined as the ratio of the within-subject standard deviation to its mean, which is commonly used for test–retest studies of repeatability when wSD is not constant among studied subjects and model (4) becomes inadequate. A useful alternative model for model (4) is to assume that the wSD increased proportionally with the true value Xi , i.e.,
(7) |
In model (7), with the extra constraint of , it is more adequate to assume follows log-normal or Weibull distributions 37 with the mean of equal to one. Raunig et al 4 suggest using log-normal distribution so that the log-transformed QIB measurement is normally distributed (after adjusting for the site-specific bias γ 1)
(8) |
Under model (8), wCV only depends on the log-transformed within-subject variance 4 through the form
(9) |
Therefore, we can apply any of the estimators of wSD on the log-transformed QIB measurements to obtain valid estimates of , and wCV can be estimated by plugging the into model (9). Without the log-normal distribution assumption, by mimicking estimator (5), when m = 2, we can still estimate wCV by pooling and averaging the within-subject sample coefficient variation 32
(10) |
Both ICC and wCV have the benefit of being dimensionless, which make them useful for comparing quantities measured on different scales.
Instead of a single point estimates of the repeatability metric (either wSD, ICC, or wCV), it is desirable to make inference on these values through either constructing confidence intervals (CIs) or performing hypothesis testing. Both confidence interval and hypothesis testing involve estimating the distribution of the estimator and thus depend on the choice of the estimation method. For wSD (denoted as σ and the corresponding estimator denoted as ), if the ANOVA-type estimators such as equations (5) and (10) are used, follows χ2 distribution with a degree of freedom (df) of . Thus, the (1 − α) ∗ 100% CI of is
(11) |
where and are the and quantiles, respectively, of the χ2 distribution with degree of freedom df. To test whether the level of wSD is greater than a threshold value c, we conduct hypothesis test with null hypothesis vs alternative hypothesis . The corresponding test statistic is and we reject the null hypothesis if . On the other hand, if the maximum likelihood estimators are used, making inference based on the asymptotic distribution of is usually problematic for small sample sizes and numerical methods such as bootstrap CI 38 or profile likelihood CI 39 should be considered. For wCV, CI of on the log-transformed data is first determined using formula (11). Then the CI of wCV can be obtained based on model (9). Because the estimator of ICC is a nonlinear function of and , its exact sampling distribution is not available. In this case, bootstrap confidence intervals have been extensively used in literature. 40,41 We can also construct the confidence interval of by approximating its sampling distribution using either F (also known as Satterthwaite approximation) or β distribution. 42 Based on Monte Carlo simulation of the sampling distribution of the generalized pivotal quantity of ICC, Ionan et al 43 suggests the generalized confidence interval proposed by Weerahandi. 44 Sample size calculation can be conducted using the method provided by A’Hern. 36
Reproducibility
Reproducibility concerns the consistency or precision of the QIB measurement made on the same subject with the same experimental design but under different experimental conditions, such as different measurement device. Reproducibility of QIBs for different imaging modalities, such as CT, 17 MRI, 9,12,16,19,24 and PET, 13,29 have also been extensively studied. Although many experimental factors can be included for the reproducibility study condition, it is practically impossible to consider all conditions for a single reproducibility study. Raunig et al 4 provided a list of conditions that can be tested in reproducibility studies. Depending on which condition is being tested, reproducibility studies can be classified into two categories: (1) repeated measurement design and (2) cohort measurement design. For example, the former can be used to study the variability caused by different scanners, while the latter can be used to study the variability caused by different study sites. Because the within-subject variability is generally embedded in the variability under different experimental conditions, a reproducibility study can generate repeatability results for each experimental condition. Specifically, for a repeated measurement design, subject i is repeatedly measured m times for each of the J experimental conditions; for a cohort measurement design, each subject will be repeatedly measured m times under one of the experimental conditions. That is, a repeated measurement design requires each subject being measured m × J times, while a cohort measurement design requires each subject only being measured m times. Model (3) is valid for both experimental designs, and the key performance characteristics is the sum of the random effect variances ( ), which represents the total variation under the reproducibility study condition. However, subjects in the cohort measurement design are only measured under a single experimental condition and thus, the subject-condition interaction effect does not exist. In this case, we can assume , and the total variation for the cohort measurement design becomes .
Similar to repeatability studies, either linear mixed effects models or two-way ANOVA can be used to fit the data and obtain valid estimates of , , and . The square root of the total variance , denoted as total SD (tSD) (Table 2) or reproducibility SD, can be used as a metric to quantify the magnitude of reproducibility. If estimators based on linear mixed effects models are considered, the sampling distributions for these estimators are unknown and numerical methods such as bootstrap or permutation should be used to make valid inferences on these parameters. On the other hand, if ANOVA-type estimators are considered, all the three terms , , and follow χ2 distributions and the corresponding CIs and test statistics can be easily obtained. However, because the sampling distribution of , which equals to , is unknown, numerical methods are recommended.
In a scenario where only two experimental conditions are being compared, e.g. comparing two scanner platforms, the variance of ( ) is no longer estimable (only and exist). We then use the agreement between these two experimental conditions as the measure of reproducibility. For such a situation where m = 1 for a repeated measurement design, Lawrence and Lin 45 proposed the concordance correlation coefficient (CCC) (Table 2) as an agreement measure of reproducibility, 46 defined as
to evaluate the QIB agreement between the two experimental conditions, where and are the standard deviations of the measured QIB under experiment condition 1 and 2, respectively, and is the Pearson correlation between the measured QIB values and . Similar to Pearson correlation, CCC is also ranged between −1 and 1, with values close to 1 (or −1) representing good concordance (or good discordance) and 0 representing no correlation.
The method introduced by Lawrence and Lin 45 is commonly used to estimate CCC and its CI. 47 Lin and Williamson 48 provide a simple method to perform sample size calculation of CCC.
Plots for repeatability and reproducibility studies
Varies plots can be used to visually study the impact of measurement errors on repeatability and reproducibility studies. For repeatability studies, although Box-Whisker plots can provide information on the variability for each subject, 4 they are usually not feasible for test–retest studies since only few repeated measurements (usually two) are available for each subject. Bland–Altman plots 49 are commonly included in repeatability studies to visualize trends in variability within measurement intervals. Figure 1 demonstrates the Bland–Altman plots based on 100 simulated data with X generated uniformly at random from interval 0 and 5. When gold-standard or reference values are available (see the setting described in Obuchowski et al 32 ), the difference between the QIB measurements and the corresponding reference values are plotted against the reference values in Bland–Altman plots. Figure 1a illustrates the case of additive error (model (4) with σ = 0.8), while Figure 1b shows the case of multiplicative error (model (7) with σ = 0.3). When reference values are not available, e.g. test–retest studies, standard deviations (or differences for the case of m = 2) of repeated measurements are plotted against the averages of repeated measurements in Bland–Altman plots. Figure 1c and d show the cases of additive and multiplicative errors based on two repeated measurements and same values of σ as in Figure 1a and Figure 1b, respectively. When multiplicative errors are observed (e.g. in Figure 1(b) , and (d)), log-transformed QIB measurements should be considered as suggested in Section 3.1.
Figure 1.
Bland–Altman plot examples for (a) reference value available with constant measurement error variance, (b) reference value available with increasing measurement error variance, (c) reference value not available with constant measurement error variance, and (b) reference value not available with increasing measurement error variance.
When only two experimental conditions are being compared, Bland–Altman plots can also be used in reproducibility studies, especially when gold-standard or reference values are not available. 4 In addition, scatter plot of experimental Condition 1 vs Condition 2 with a fitted regression line can provide useful visualization of the agreement between these two methods (Figure 2). When there exists more than two experimental conditions, we can follow the same procedure as above for each pair of the experimental conditions.
Figure 2.
Scatter plot of two method comparison (slope = 0.831, intercept = 0.335).
Examples of the impact of QIB measurement errors on clinical studies
QIB as trial end point
QIBs can serve as a clinical trial’s endpoint to assess treatment efficacy, where subjects enrolled in the study are scanned before and after treatment, and the difference of the mean QIB measurements over the treatment course is used to determine the efficacy of the treatment. Since it is often nearly impossible to perform repeated measurements at a single time point in a longitudinal study, it is difficult to assess repeatability and/or reproducibility-related QIB measurement errors. From model (2), for a setting without repeated measurements, let and be the QIB measurements (or log-transformed QIB measurements) before and after intervention for subject , respectively, and we further assume that the corresponding true value follows normal distribution with mean and variance for and . The common approach to assess treatment efficacy is to test if the mean difference is greater than a threshold value so that the difference is practically meaningful (i.e. null hypothesis vs alternative hypothesis ). Under model (2), the corresponding test statistic Z is
where ρ is the Pearson correlation (ranged between 0 and 1) between and . Thus, for a statistically significant level of α, the formula of minimum sample size required to achieve β power is
(12) |
where and are the and quantiles of standard normal distribution. For example, under the setting of and , which represents 50% change against no change in hypothesis testing if log-transformed QIB is considered, Figure 3 illustrates the required sample sizes to achieve 80% power with a significance level of 5% for different values of , , and . From both Figure 3 and equation (12), we can notice that the required sample size is an increasing function of and , and is a decreasing function of .
Figure 3.
Sample size required to achieve 80% power with a significance level of 5% against measurement error standard deviation . The true difference is 0.5, commonly represents 50% change for log-transformed QIB; or 0.8; or 1.5; and . QIB, quantitative imaging biomarker.
For a longitudinal study, the correlation parameter measures the level of dependence between the QIB measurements before and after treatment, with longer time interval between the measures generally resulting in smaller . Obuchowski et al 6 showed that the range of , which depends on the time interval of the two measurements, is between 0 and . Using equation (12), the range of sample size is from to . Although the required sample size is a decreasing function of and is a decreasing function of the time interval between the measurements, we cannot jump into the conclusion that studies with smaller time interval require smaller sample sizes. This is because a smaller time interval usually also results in a smaller difference between and .
QIB as predictive biomarker
In addition to serving as clinical trial end points, QIBs can also be used as predictive biomarkers for early prediction of treatment effect, or as intermediate endpoints in multiarm, multistage trials. 50 Under this scenario, it is usually assumed that the true value of QIB is associated with the primary trial end point but we only measure QIB with error (see model (1)). Many methods have been proposed to adjust the measurement error on covariates in regression models. 51–53 However, using such adjustment requires knowledge of the distribution of measurement errors that can only be obtained from additional studies such as repeatability or reproducibility studies. This requirement, on the one hand, emphasizes the importance of repeatability or reproducibility studies, but on the other hand, may not always be applicable and standard approach that ignores the measurement error is conducted. 54 In this study, we use simulation to illustrate the impact of measurement error when standard approach is used.
Because the sample sizes in published studies where QIBs were used as predictive biomarkers are usually small, it is difficult to analytically evaluate the impact of measurement error on QIBs as predictive biomarkers. Based on the study design and assumptions, Monte Carlo simulations can be used to numerically approximate the impact of measurement error. For illustration purposes, we designed our simulations based on the study by Tudorica et al, 54 where DCE-MRI QIBs were used for early prediction of breast cancer response [pathologic complete response (pCR) vs non-pCR] to neoadjuvant chemotherapy (NACT). We denoted Zi as the indicator of pCR and Xi as the true value of a DEC-MRI QIB for subject i. The true DCE-MRI QIB Xi was generated from normal distribution. Tudorica et al 54 provided a list of DEC-MRI QIBs with their means and SDs for pCR and non-pCR patients. Here, we considered the percent change in QIB K trans (transfer rate constant) after the first cycle of NACT relative to baseline (pCR: mean = −64%, SD = 9%; non-pCR: mean = −14%, SD = 41%), which showed the best predictive performance for pCR vs non-pCR in that study. 53 Consistent with the sample size of the study, 53 we included a total of 28 subjects in this simulation study. Here, without loss of generality, we assumed that the first five subjects are pCR patients and the other subjects are non-pCR patients. As noted above, we can only observe the DEC-MRI QIB with error (model (2)). The best approach was to fit a univariate logistic regression model using the observed DEC-MRI QIB Yi as covariate, i.e.
(13) |
Area under the receiver operating characteristic curve (AUC) was used to evaluate the QIB predictive performance. Sample size calculation of AUC can be conducted using the formula provided by Obuchowski et al. 55 Because the effect of measurement error on AUC is still not clear, we perform a simulation study evaluate this effect. The true K trans percent change values were repeatedly generated 1000 times, and the average AUC across these 1000 simulated data sets and the corresponding 95% CIs were calculated. Figure 4 illustrates the average AUCs against different values of ( ), the measurement error standard deviation in K trans percent change. Our simulation results show that the predictive performance as measured by the average AUC decreases, and the length of 95% CIs increases with increased measurement error ( ).
Figure 4.
Average AUC (solid black curve) and the corresponding 95% confidence intervals (dashed blue curves) against different values of . Results were obtained based on 1000 simulated data sets. AUC, area under the receiver operating characteristic curve.
Discussion
In this review article, we provided a general introduction on the study design, statistical model, and statistical metrics that can be used to assess repeatability and reproducibility of QIB measurements. We also illustrated the impact of repeatability- and reproducibility-related QIB measurement errors on QIB applications, e.g. on sample size calculation when a QIB is used as a clinical trial end point.
The statistical models presented here assume that the measurement errors are normally distributed and independent from the true QIB values. If the measurement errors increase in proportion to the true QIB values, i.e. multiplicative errors (see model (7)), log-transformed QIB values can be used as the relationship between error and true value becomes additive after the transformation. In practice, when QIB measurements have values equal or close to zero, we can add a small amount on all QIB values before taking the log-transformation. For more complex error structures such as non-Gaussian or heterogeneous measurement errors, the statistical methods introduced in this article can provide a reasonable approximation of the repeatability and reproducibility metrics of interest, e.g. wSD and ICC, but statistical inferences on these estimates can be biased and may lead to false conclusions.
Test–retest studies are commonly used to study repeatability and reproducibility of a QIB, where each object, e.g. a phantom or a human subject, is being repeatedly measured. This approach may sometimes be impractical for human subject studies for reasons such as costs, time, and the invasiveness of the imaging scan. As an alternative strategy, Obuschowski et al 32 proposed a method to estimate the measurement error under repeatability condition when a test–retest study is not feasible. The method requires a reference (golden standard) value is available for each subject. By assuming the reference value to be the true QIB value ( ), it can serve as the second measured value for wSD or wCV estimation:
Repeatability and reproducibility can be part of the same study. It is possible to study repeatability in a restricted subset of a reproducibility study to ensure repeatability is acceptable, e.g. in an initial subset of subjects going into the study to ensure it is worth pursuing.
There is increasing need to accelerate clinical translation of QIBs. However, significant challenges remain. Using solid tumor therapy response as an example, the 1D imaging tumor size measurement based on the RECIST (Response Evaluation Criteria In Solid Tumors) 1.1 guidelines 56 is the only widely used QIB in today’s standard of care and clinical trials. Many QIBs that interrogate tumor biology and physiology and thus are well suited for evaluation of response to increasingly used and effective molecular targeted therapies find it difficult to be translated into clinical trials and practice. This is mainly due to the variabilities in quantifying these QIB parameter values caused by differences in vendor imaging platforms, imaging data acquisition methods, and imaging data analysis algorithms and software tools. Because of the lack of sufficient repeatability and reproducibility studies to understand the variabilities of these functional QIBs, unlike RECIST tumor size measurement, there is currently no consensus on the magnitudes of changes in these QIBs for defining clinical response endpoints such as complete response, stable disease, etc. In order to establish a path to clinical translation for functional QIBs, there is a clear need for not only standardization of data acquisition and analysis to minimize variability, 1 but also more efforts in assessment of QIB repeatability and reproducibility. 1 It is our hope that the statistical tools presented in this article may contribute to this endeavor.
Footnotes
Acknowledgements: We thank the anonymous reviewers for their insightful comments.
Competing interests: None.
Funding: R01 CA248192, National Institutes of Health, U.S.A.
REFERENCES
- 1. Yankeelov TE, Mankoff DA, Schwartz LH, Lieberman FS, Buatti JM, Mountz JM, et al. Quantitative imaging in cancer clinical trials. Clin Cancer Res 2016; 22: 284–90. doi: 10.1158/1078-0432.CCR-14-3336 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Pinker K, Chin J, Melsaether AN, Morris EA, Moy L. Precision medicine and radiogenomics in breast cancer: new approaches toward diagnosis and treatment. Radiology 2018; 287: 732–47. doi: 10.1148/radiol.2018172171 [DOI] [PubMed] [Google Scholar]
- 3. Kessler LG, Barnhart HX, Buckler AJ, Choudhury KR, Kondratovich MV, Toledano A, et al. The emerging science of quantitative imaging biomarkers terminology and definitions for scientific studies and regulatory submissions. Stat Methods Med Res 2015; 24: 9–26. doi: 10.1177/0962280214537333 [DOI] [PubMed] [Google Scholar]
- 4. Raunig DL, McShane LM, Pennello G, Gatsonis C, Carson PL, Voyvodic JT, et al. Quantitative imaging biomarkers: a review of statistical methods for technical performance assessment. Stat Methods Med Res 2015; 24: 27–67. doi: 10.1177/0962280214537344 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Obuchowski NA, Reeves AP, Huang EP, Wang X-F, Buckler AJ, Kim HJG, et al. Quantitative imaging biomarkers: a review of statistical methods for computer algorithm comparisons. Stat Methods Med Res 2015; 24: 68–106. doi: 10.1177/0962280214537390 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Obuchowski NA, Mozley PD, Matthews D, Buckler A, Bullen J, Jackson E. Statistical considerations for planning clinical trials with quantitative imaging biomarkers. J Natl Cancer Inst 2019; 111: 19–26. doi: 10.1093/jnci/djy194 [DOI] [PubMed] [Google Scholar]
- 7. Sullivan DC, Obuchowski NA, Kessler LG, Raunig DL, Gatsonis C, Huang EP, et al. Metrology standards for quantitative imaging biomarkers. Radiology 2015; 277: 813–25: 825. doi: 10.1148/radiol.2015142202 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Huang EP, Wang X-F, Choudhury KR, McShane LM, Gönen M, Ye J, et al. Meta-analysis of the technical performance of an imaging procedure: guidelines and statistical methodology. Stat Methods Med Res 2015; 24: 141–74. doi: 10.1177/0962280214537394 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Yokoo T, Serai SD, Pirasteh A, Bashir MR, Hamilton G, Hernando D, et al. Linearity, bias, and precision of hepatic proton density fat fraction measurements by using MR imaging: a meta-analysis. Radiology 2018; 286: 486–98. doi: 10.1148/radiol.2017170550 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.] Lodge MA. Repeatability of SUV in oncologic 18F-FDG PET. Journal of Nuclear Medicine. 2017;58(4):523–532. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Baumgartner R, Joshi A, Feng D, Zanderigo F, Ogden RT. Statistical evaluation of test-retest studies in PET brain imaging. EJNMMI Res 2018; 8: 1–9: 13. doi: 10.1186/s13550-018-0366-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Shukla‐Dave A, Obuchowski NA, Chenevert TL, Jambawalikar S, Schwartz LH, Malyarenko D, et al. Quantitative imaging biomarkers alliance (QIBA) recommendations for improved precision of DWI and DCE‐MRI derived biomarkers in multicenter oncology trials. J Magn Reson Imaging 2019; 49: e101–21. doi: 10.1002/jmri.26518 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Park JE, Park SY, Kim HJ, Kim HS. Reproducibility and generalizability in radiomics modeling: possible strategies in radiologic and statistical perspectives. Korean J Radiol 2019; 20: 1124. doi: 10.3348/kjr.2018.0070 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Fedorov A, Vangel MG, Tempany CM, Fennessy FM. Multiparametric magnetic resonance imaging of the prostate: repeatability of volume and apparent diffusion coefficient quantification. Invest Radiol 2017; 52(9): 538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Schwier M, Griethuysen J, Vangel MG, Pieper S, Peled S, Tempany C, et al. Repeatability of multiparametric prostate MRI radiomics features. Sci Rep 2019; 9: 1–16. doi: 10.1038/s41598-019-45766-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Hernando D, Sharma SD, Aliyari Ghasabeh M, Alvis BD, Arora SS, Hamilton G, et al. Multisite, multivendor validation of the accuracy and reproducibility of proton-density fat-fraction quantification at 1.5t and 3T using a fat-water phantom. Magn Reson Med 2019; 77: 1516–24. doi: 10.1002/mrm.26228 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kalpathy-Cramer J, Zhao B, Goldgof D, Gu Y, Wang X, Yang H, et al. A comparison of lung nodule segmentation algorithms: methods and results from A multi-institutional study. J Digit Imaging 2016; 29: 476–87. doi: 10.1007/s10278-016-9859-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Lin C, Bradshaw T, Perk T, Harmon S, Eickhoff J, Jallow N, et al. Repeatability of quantitative 18F-naf PET: a multicenter study. J Nucl Med 2016; 57: 1872–79. doi: 10.2967/jnumed.116.177295 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Jafar MM, Parsai A, Miquel ME. Diffusion-weighted magnetic resonance imaging in cancer: reported apparent diffusion coefficients, in-vitro and in-vivo reproducibility. World J Radiol 2016; 8: 21–49. doi: 10.4329/wjr.v8.i1.21 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Winfield JM, Tunariu N, Rata M, Miyazaki K, Jerome NP, Germuska M, et al. Extracranial soft-tissue tumors: repeatability of apparent diffusion coefficient estimates from diffusion-weighted MR imaging. Radiology 2017; 284: 88–99. doi: 10.1148/radiol.2017161965 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Weller A, Papoutsaki MV, Waterton JC, Chiti A, Stroobants S, Kuijer J, et al. Diffusion-weighted (DW) MRI in lung cancers: ADC test-retest repeatability. Eur Radiol 2017; 27: 4552–62. doi: 10.1007/s00330-017-4828-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Lecler A, Savatovsky J, Balvay D, Zmuda M, Sadik J-C, Galatoire O, et al. Repeatability of apparent diffusion coefficient and intravoxel incoherent motion parameters at 3.0 tesla in orbital lesions. Eur Radiol 2017; 27: 5094–5103. doi: 10.1007/s00330-017-4933-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Lu Y, Hatzoglou V, Banerjee S, Stambuk HE, Gonen M, Shankaranarayanan A, et al. Repeatability investigation of reduced field-of-view diffusion-weighted magnetic resonance imaging on thyroid glands. J Comput Assist Tomogr 2015; 39: 334–39. doi: 10.1097/RCT.0000000000000227 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Hagiwara A, Hori M, Cohen-Adad J, Nakazawa M, Suzuki Y, Kasahara A, et al. Linearity, bias, intrascanner repeatability, and interscanner reproducibility of quantitative multidynamic multiecho sequence for rapid simultaneous relaxometry at 3 T: a validation study with a standardized phantom and healthy controls. Invest Radiol 2019; 54: 39–47. doi: 10.1097/RLI.0000000000000510 [DOI] [PubMed] [Google Scholar]
- 25. Jafari-Khouzani K, Emblem KE, Kalpathy-Cramer J, Bjørnerud A, Vangel MG, Gerstner ER, et al. Repeatability of cerebral perfusion using dynamic susceptibility contrast MRI in glioblastoma patients. Transl Oncol 2015; 8: 137–46: S1936-5233(15)00018-2. doi: 10.1016/j.tranon.2015.03.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Han A, Andre MP, Deiranieh L, Housman E, Erdman JW Jr, Loomba R, et al. Repeatability and reproducibility of the ultrasonic attenuation coefficient and backscatter coefficient measured in the right lobe of the liver in adults with known or suspected nonalcoholic fatty liver disease. J Ultrasound Med 2018; 37: 1913–27. doi: 10.1002/jum.14537 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Hagiwara A, Fujita S, Ohno Y, Aoki S. Variability and standardization of quantitative imaging: monoparametric to multiparametric quantification, radiomics, and artificial intelligence. Invest Radiol 2020; 55: 601–16. doi: 10.1097/RLI.0000000000000666 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Wang K, Manning P, Szeverenyi N, Wolfson T, Hamilton G, Middleton MS, et al. Repeatability and reproducibility of 2D and 3D hepatic MR elastography with rigid and flexible drivers at end-expiration and end-inspiration in healthy volunteers. Abdom Radiol (NY) 2017; 42: 2843–54. doi: 10.1007/s00261-017-1206-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Olin A, Ladefoged CN, Langer NH, Keller SH, Löfgren J, Hansen AE, et al. Reproducibility of MR-based attenuation maps in PET/MRI and the impact on PET quantification in lung cancer. J Nucl Med 2018; 59: 999–1004. doi: 10.2967/jnumed.117.198853 [DOI] [PubMed] [Google Scholar]
- 30. Fuller WA. Measurement error models. John Wiley & Sons; 2009. [Google Scholar]
- 31. Obuchowski NA, Bullen J. Quantitative imaging biomarkers: effect of sample size and bias on confidence interval coverage. Stat Methods Med Res 2018; 27: 3139–50. doi: 10.1177/0962280217693662 [DOI] [PubMed] [Google Scholar]
- 32. Obuchowski NA, Buckler AJ. Estimating the precision of quantitative imaging biomarkers without test-retest studies. Academic Radiology 2022; 29: 543–49. doi: 10.1016/j.acra.2021.06.009 [DOI] [PubMed] [Google Scholar]
- 33. Gałecki A, Burzykowski T. Linear Mixed-Effects Models Using R. New York, NY: Springer; 2013., pp. 245–73. doi: 10.1007/978-1-4614-3900-4 [DOI] [Google Scholar]
- 34. Girden ER. ANOVA. 2455 Teller Road, Thousand Oaks California 91320 United States of America: : Sage; 1992. doi: 10.4135/9781412983419 [DOI] [Google Scholar]
- 35. Killip S, Mahfoud Z, Pearce K. What is an intracluster correlation coefficient? crucial concepts for primary care researchers. Ann Fam Med 2004; 2: 204–8. doi: 10.1370/afm.141 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. A’Hern RP. Employing multiple synchronous outcome samples per subject to improve study efficiency. BMC Med Res Methodol 2021; 21: 1–11: 211. doi: 10.1186/s12874-021-01414-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Rinne H. The Weibull Distribution. CRC press; 2008. doi: 10.1201/9781420087444 [DOI] [Google Scholar]
- 38. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. CRC press; 1994. doi: 10.1201/9780429246593 [DOI] [Google Scholar]
- 39. Cole SR, Chu H, Greenland S. Maximum likelihood, profile likelihood, and penalized likelihood: a primer. Am J Epidemiol 2014; 179: 252–60. doi: 10.1093/aje/kwt245 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Cook JA, Bruckner T, MacLennan GS, Seiler CM. Clustering in surgical trials--database of intracluster correlations. Trials 2012; 13: 1–8: 2. doi: 10.1186/1745-6215-13-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.] Thompson DM, Fernald DH, Mold JW.. Intraclass correlation coefficients typical of cluster-randomized studies: estimates from the Robert Wood Johnson Prescription for Health projects. The Annals of Family Medicine. 2012;10(3):235–240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Demetrashvili N, Wit EC, Heuvel ER. Confidence intervals for intraclass correlation coefficients in variance components models. Stat Methods Med Res 2016; 25: 2359–76. doi: 10.1177/0962280214522787 [DOI] [PubMed] [Google Scholar]
- 43. Ionan AC, Polley MYC, McShane LM, Dobbin KK. Comparison of confidence interval methods for an intra-class correlation coefficient (ICC). BMC Med Res Methodol 2014; 14: 1–11. doi: 10.1186/1471-2288-14-121 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Weerahandi S. Generalized confidence intervals. Journal of the American Statistical Association 1993; 88: 899–905. doi: 10.1080/01621459.1993.10476355 [DOI] [Google Scholar]
- 45. Lin LI-K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1993; 45: 255. 10.2307/2532051 [DOI] [PubMed] [Google Scholar]
- 46. Huang W, Li X, Chen Y, Li X, Chang M-C, Oborski MJ, et al. Variations of dynamic contrast-enhanced magnetic resonance imaging in evaluation of breast cancer therapy response: a multicenter data analysis challenge. Transl Oncol 2014; 7: 153–66: 166. doi: 10.1593/tlo.13838 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Chen CC, Barnhart HX. Comparison of ICC and CCC for assessing agreement for data without and with replications. Computational Statistics & Data Analysis 2008; 53: 554–64. doi: 10.1016/j.csda.2008.09.026 [DOI] [Google Scholar]
- 48. Lin HM, Williamson JM. A simple approach for sample size calculation for comparing two concordance correlation coefficients estimated on the same subjects. J Biopharm Stat 2015; 25: 1145–60. doi: 10.1080/10543406.2014.971163 [DOI] [PubMed] [Google Scholar]
- 49. Altman DG, Bland JM. Measurement in medicine: the analysis of method comparison studies. The Statistician 1983; 32: 307. 10.2307/2987937 [DOI] [Google Scholar]
- 50. Millen GC, Yap C. Adaptive trial designs: what are multiarm, multistage trials. In: Archives of Disease in Childhood-Education and Practice. ; 2020, pp. 376–78. [DOI] [PubMed] [Google Scholar]
- 51. Yang M, Adomavicius G, Burtch G, Ren Y. Mind the gap: accounting for measurement error and misclassification in variables generated via data mining. Information Systems Research 2018; 29: 4–24. doi: 10.1287/isre.2017.0727 [DOI] [Google Scholar]
- 52. Chesher A. The effect of measurement error. Biometrika 2018; 78: 451–62. doi: 10.1093/biomet/78.3.451 [DOI] [Google Scholar]
- 53. Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models. Boston, MA: CRC press; March 2018. doi: 10.1007/978-1-4899-4477-1 [DOI] [Google Scholar]
- 54. Tudorica A, Oh KY, Chui SY-C, Roy N, Troxell ML, Naik A, et al. Early prediction and evaluation of breast cancer response to neoadjuvant chemotherapy using quantitative DCE-MRI. Transl Oncol 2016; 9: 8–17: S1936-5233(15)30017-6. doi: 10.1016/j.tranon.2015.11.016 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Obuchowski NA, Lieber ML, Wians FH Jr. ROC curves in clinical chemistry: uses, misuses, and possible solutions. Clin Chem 2004; 50: 1118–25. doi: 10.1373/clinchem.2004.031823 [DOI] [PubMed] [Google Scholar]
- 56. Eisenhauer EA, Therasse P, Bogaerts J, Schwartz LH, Sargent D, Ford R, et al. New response evaluation criteria in solid tumours: revised RECIST guideline (version 1.1). Eur J Cancer 2009; 228–47. [DOI] [PubMed] [Google Scholar]