Abstract
Objective evaluation of quantitative imaging (QI) methods with patient data is highly desirable, but is hindered by the lack or unreliability of an available gold standard. To address this issue, techniques that can evaluate QI methods without access to a gold standard are being actively developed. These techniques assume that the true and measured values are linearly related by a slope, bias, and Gaussian-distributed noise term, where the noise between measurements made by different methods is independent of each other. However, this noise arises in the process of measuring the same quantitative value, and thus can be correlated. To address this limitation, we propose a no-gold-standard evaluation (NGSE) technique that models this correlated noise by a multi-variate Gaussian distribution parameterized by a covariance matrix. We derive a maximum-likelihood-based approach to estimate the parameters that describe the relationship between the true and measured values, without any knowledge of the true values. We then use the estimated slopes and diagonal elements of the covariance matrix to compute the noise-to-slope ratio (NSR) to rank the QI methods on the basis of precision. The proposed NGSE technique was evaluated with multiple numerical experiments. Our results showed that the technique reliably estimated the NSR values and yielded accurate rankings of the considered methods for 83% of 160 trials. In particular, the technique correctly identified the most precise method for ∼ 97% of the trials. Overall, this study demonstrates the efficacy of the NGSE technique to accurately rank different QI methods when correlated noise is present, and without access to any knowledge of the ground truth. The results motivate further validation of this technique with realistic simulation studies and patient data.
Keywords: no-gold-standard, objective evaluation, medical imaging, quantitative imaging
1. INTRODUCTION
Medical imaging provides a mechanism to study in vivo physiological properties of the human body, and thus plays an important role in the diagnosis, prognosis, and assessment of treatment response of different diseases. To facilitate decision-making in clinical practice, quantitative imaging (QI), i.e., the extraction of numerical or statistical features from medical images, is being actively investigated.1, 2 QI has demonstrated substantial promise in multiple clinical applications. These include the quantification of metabolic tumor volume from oncological positron emission tomography (PET) for predicting clinical outcomes,3 quantification of dopamine transporter uptake from single-photon emission computed tomography (SPECT) to assess the severity of Parkinson disease,4, 5 and quantification of regional uptake from PET and SPECT for dosimetry in targeted radionuclide therapy.6–9
Given the significant interest in QI, multiple methods have been and are being developed for QI. For clinical translation of QI, it is essential that the measurements made by those methods are reliable. Thus, there is an important need for objective evaluation of the reliability of measurements obtained using QI methods. Typically, such evaluation requires the presence of either the true value of the quantitative parameter or a reference standard. Such true values or reference standards can be available in realistic simulation and physical phantom studies.10–14 While these studies are important for the initial development of QI methods, there is an important need for techniques that can perform objective evaluation of QI methods directly with patient data. Such evaluation then requires the presence of gold-standard quantitative values. These are typically time-consuming, expensive, and tedious to obtain. Further, even when an approximate gold standard is available, it could suffer from the lack of reliability. Thus, techniques that can objectively evaluate QI methods in the absence of a gold standard are much needed.
To objectively evaluate QI methods without the knowledge of a gold standard, a regression-without-truth (RWT) technique was proposed in a set of seminal papers.15, 16 The RWT technique assumes that the true and measured values are linearly related by a slope, bias, and Gaussian-distributed noise term. It was demonstrated that even in the absence of a gold standard, the values of the slope, bias, and the standard deviation of the noise term for all the considered QI methods can be estimated using a maximum-likelihood (ML) approach. These estimated parameters can then be used to rank different QI methods on the basis of precision. The efficacy of the RWT technique was demonstrated in evaluating segmentation methods on the task of estimating the apparent diffusion coefficient from diffusion-weighted magnetic resonance imaging (MRI) scans,17, 18 and on the task of estimating the left ventricular ejection fraction from cardiac cine MRI sequences.19 The RWT technique was then advanced further and the efficacy of the resultant no-gold-standard evaluation (NGSE) technique20 was demonstrated in objectively evaluating reconstruction methods for SPECT on the task of quantifying regional uptake.20, 21 Further, the technique was applied to clinical oncological PET images to evaluate segmentation methods on the task of measuring metabolic tumor volume.22 While the findings from these studies are encouraging, an important assumption in these existing evaluation techniques is that the noise between measurements obtained using different QI methods is independent of each other. The noise with different QI methods arises in the process of measuring the same true value, and then can be correlated. Thus, this assumption is often violated. To address this issue, we propose an advanced NGSE technique that accounts for the presence of such correlated noise. We start by presenting the theory of this technique.
2. METHODS
2.1. Theory
Consider a clinical scenario where a total of P patients are being scanned by an imaging system. From the acquired data of each patient, a set of K QI methods are used to measure certain quantitative values. Such quantitative values can be the mean activity concentration within different organs of interest. Our objective is to estimate the parameters that can describe the relationship between the true and measured values, without access to a gold standard. These estimated parameters can then be used to rank the QI methods.
We assume that there exists a linear stochastic relationship between the true and measured values. This relationship is parameterized by a slope, bias, and Gaussian-distributed noise term. We note that this noise arises in the process of measuring the same true values but with different methods. Thus, the noise is expected to be correlated. We model this correlated noise by a zero-mean multi-variate Gaussian distribution denoted by . Specifically, the diagonal elements of C, i.e., denote the variance of the noise of each method. The off-diagonal elements of C, i.e., denote the covariance of the noise between methods k and k′. For the pth patient, denote the true value by ap and the estimated value using the kth method by . Additionally, denote the slope and bias of the kth method by uk and uk, respectively. For the pth patient, we can then write the relationship between the true and measured values as
Denote the vector by , the matrix containing {uk} and {vk} by Θ, and the vector [ap, 1]T by Ap. Based on Eq. (1), obtaining the probability of observing given the knowledge of depends on the true values, which are unknown. To address this issue, we next assume that the true values are sampled from a four-parameter beta distribution (FPBD) parameterized by a vector Ω.20 This FPBD incorporates the fact that the true values lie within a certain range. Additionally, the FPBD provides the capability to model a wide variety of the ranges and shapes of the true distribution.
Let denote the collection of measurements made by all the K methods from all the P patients. The NGSE technique uses an ML approach to estimate the values of that maximize the probability of observing . The ML estimate of is given by
where denotes the probability of observing given the knowledge of . We note that obtaining this probability does not require any knowledge of the true values.
The expression for was determined to obtain the ML estimates using a constrained optimization technique based on the interior-point algorithm.23 From the estimated parameters, we used the slope terms and noise standard deviation terms , i.e., the square root of the diagonal elements of the covariance matrix, to compute the noise-to-slope ratio (NSR) for each method. For the kth method, the NSR is given by
The NSR evaluates QI methods on the basis of precision,15, 16, 20 and a lower value indicates a more precise estimation performance.
2.2. Evaluating the NGSE technique using numerical experiments
We evaluated the performance of the NGSE technique using multiple numerical experiments. In each experiment, P = 200 true values were sampled from a known FPBD. From these true values, noisy measured values were generated for K = 3 hypothetical QI methods. Each method yielded outputs that were linearly related to the true values by a slope of uk and bias of uk. The variance of the noise of each method was characterized by the diagonal elements of the covariance matrix C. Additionally, the covariance of the noise between different methods were characterized by the off-diagonal elements of C. These noisy measurements were then input to the NGSE technique to estimate . From these estimated parameters, we used the slope terms and the noise standard deviation terms to compute the NSR for all methods to rank them based on precision, as described in Sec. 2.1.
In this evaluation, we sampled the 200 true values from FPBD for 4 combinations of Ω such that different ranges and shapes of the true distribution were modeled. To evaluate the sensitivity of the NGSE technique to correlated noise, we generated two sets of QI methods for each combination of Ω. The first set of methods had lower correlated noise with . In contrast, the second set of methods had higher correlated noise with . For both sets, the values of slope {uk}, bias {uk}, and variance of the noise of the three methods were set to , , and , respectively. Finally, for each combination of , we repeated the experiment for 20 different noise realizations. Thus, we evaluated the performance of the NGSE technique for a total of 4×2×20 = 160 trials.
3. RESULTS
We first present the performance of the NGSE technique for the set of hypothetical QI methods that had lower correlated noise. The means and standard deviations of the estimated slope , noise standard deviation and resultant NSR for all considered methods are reported in Table 1. As described in Sec. 2.2, these statistics were computed from a total of 80 trials. In each trial, we considered either a different combination of or a different noise realization of the synthetic measurements given the true values. We observe that the NGSE technique reliably estimated the slope, noise standard deviation, and consequently the NSR values. From the estimated NSR values, the NGSE technique accurately ranked the methods for 78% of the 80 trials. Further, the technique correctly identified method 1 as the most precise method for 97% of the 80 trials.
Table 1:
The means and standard deviations of slope, noise standard deviation, and resultant NSR estimated using the NGSE technique for the set of methods that had lower correlated noise.
| Method index | True slope | Estimated slope | True noise standard deviation | Estimated noise standard deviation | True NSR | Estimated NSR |
|---|---|---|---|---|---|---|
| 1 | 1.10 | 1.11 ± 0.08 | 0.20 | 0.17 ± 0.09 | 0.18 | 0.16 ± 0.08 |
| 2 | 0.90 | 0.89 ± 0.07 | 0.30 | 0.30 ± 0.06 | 0.33 | 0.35 ± 0.09 |
| 3 | 1.05 | 1.06 ± 0.09 | 0.45 | 0.44 ± 0.04 | 0.43 | 0.42 ± 0.07 |
We then present in Table 2 the results for the set of methods that had higher correlated noise. We again observe that the NGSE technique reliably estimated the slope, noise standard deviation, and resultant NSR. For 87% of the 80 trials, the technique yielded accurate rankings of the considered methods. Further, the technique correctly identified that method 1 was the most precise for 97% of the 1, 600 trials.
Table 2:
The means and standard deviations of slope, noise standard deviation, and resultant NSR estimated using the NGSE technique for the set of methods that had higher correlated noise.
| Method index | True slope | Estimated slope | True noise standard deviation | Estimated noise standard deviation | True NSR | Estimated NSR |
|---|---|---|---|---|---|---|
| 1 | 1.10 | 1.13 ± 0.09 | 0.20 | 0.17 ± 0.08 | 0.18 | 0.15 ± 0.08 |
| 2 | 0.90 | 0.91 ± 0.07 | 0.30 | 0.30 ± 0.06 | 0.33 | 0.34 ± 0.08 |
| 3 | 1.05 | 1.07 ± 0.09 | 0.45 | 0.44 ± 0.05 | 0.43 | 0.42 ± 0.07 |
4. DISCUSSION AND CONCLUSION
For clinical translation of QI, there is an important need for techniques that can objectively evaluate QI methods with patient data. In this context, existing statistical techniques assume that the noise between measurements obtained using different methods is independent of each other. However, this assumption can often be violated since the noise arises in the process of measuring the same true value. To address this issue, we developed an NGSE technique that models this correlated noise by a multi-variate Gaussian distribution.
Our results from the numerical experiments (Tables 1 and 2) showed that the NGSE technique yielded reliable estimates of slope, noise standard deviation, and consequently the NSR for all hypothetical QI methods. Additionally, the technique yielded accurate rankings of these methods for 83% of the total 160 trials. Further, the technique was able to identify the most precise method for 97% of the cases. This observation is especially important since when evaluating different QI methods, the objective is typically to find the method that yields the most reliable performance.22 All these results demonstrated that in controlled settings, where the true and measured values were linearly related by design, the NGSE technique was able to reliably rank QI methods without access to any knowledge of the ground truth.
The results motivate further evaluation of the proposed technique with QI methods that are developed for clinical applications. These include methods developed for reconstruction, post-reconstruction processing, segmentation, and quantification. Often, such methods are evaluated using strategies that rely on the availability of a ground truth. Further, this ground truth may not be relevant to the clinical task. For example, segmentation methods are evaluated using metrics such as the Dice score and Hausdorff distance, which quantify spatial overlap and shape similarity, respectively, between the estimated segmentation and a certain ground-truth segmentation. Such evaluation then requires access to the true segmentation, which is typically unavailable. Usually manual segmentations are used as a surrogate for the ground truth, but these can be erroneous and suffer from intra and inter-reader variability.24 Similarly, denoising methods for low-dose images are evaluated by comparing the denoised image to a certain normal-dose image using metrics of structural similarity index and root mean square error. However, the normal-dose image is also noisy, and thus provides a limited measure of ground truth. More importantly, it is unclear whether the evaluation based on those conventional metrics correlates with the clinical task.25–27 Thus, these methods should preferably be evaluated based on clinical-task performance.28 The proposed NGSE technique provides a mechanism to perform evaluation on clinically relevant quantitative tasks and without access to the ground truth.
One limitation of the proposed technique is that the true and measured values are assumed to be linearly related. This linear relationship is desirable in QI since it ensures that the measured quantitative value is linearly related to the biological effect. However, this assumption of linearity may not always hold true in QI. To address this issue, one strategy is to check whether the measurements made by different methods are linearly related to each other. This will increase the confidence that the measured values are also linearly related to the ground truth.22 A second limitation is that the NGSE technique requires many patient images since multiple parameters need to be estimated. One way to reduce the required number of input images is to incorporate the prior information of the parameters to be estimated.29 Thus, extending the proposed technique to incorporate such prior knowledge is an important research direction.
In conclusion, our study demonstrated the ability of the proposed NGSE technique to accurately rank different QI methods in the presence of correlated noise, and without the need for any knowledge of the ground truth. The results motivate further evaluation of the technique with realistic simulation studies and patient data.
ACKNOWLEDGEMENTS
Financial support for this work was provided by the National Institute of Biomedical Imaging and Bioengineering R01-EB031051, R56-EB028287, and R21-EB024647 (Trailblazer Award).
REFERENCES
- [1].Sullivan DC, Obuchowski NA, Kessler LG, Raunig DL, Gatsonis C, Huang EP, Kondratovich M, McShane LM, Reeves AP, Barboriak DP, et al. , “Metrology standards for quantitative imaging biomarkers,” Radiology 277(3), 813–825 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Rosenkrantz AB, Mendiratta-Lala M, Bartholmai BJ, Ganeshan D, Abramson RG, Burton KR, John-Paul JY, Scalzetti EM, Yankeelov TE, Subramaniam RM, et al. , “Clinical utility of quantitative imaging,” Acad. Radiol 22(1), 33–49 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Ohri N, Duan F, Machtay M, Gorelick JJ, Snyder BS, Alavi A, Siegel BA, Johnson DW, Bradley JD, DeNittis A, et al. , “Pretreatment FDG-PET metrics in stage III non–small cell lung cancer: ACRIN 6668/RTOG 0235,” J. Natl. Cancer Inst 107(4) (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Filippi L, Manni C, Pierantozzi M, Brusa L, Danieli R, Stanzione P, and Schillaci O, “123I-FP-CIT semi-quantitative SPECT detects preclinical bilateral dopaminergic deficit in early Parkinson’s disease with unilateral symptoms,” Nucl. Med. Commun 26(5), 421–426 (2005). [DOI] [PubMed] [Google Scholar]
- [5].Moon HS, Liu Z, Ponisio M, Laforest R, and Jha A, “A physics-guided and learning-based estimation method for segmenting 3D DaT-Scan SPECT images,” (2020). [Google Scholar]
- [6].Flux G, Bardies M, Monsieurs M, Savolainen S, Strand S-E, and Lassmann M, “The impact of PET and SPECT on dosimetry for targeted radionuclide therapy,” Z. Med. Phys 16(1), 47–59 (2006). [DOI] [PubMed] [Google Scholar]
- [7].Ljungberg M, Sjögreen K., Liu X., Frey E., Dewaraja Y., and Strand S-E., “A 3-dimensional absorbed dose calculation method based on quantitative SPECT for radionuclide therapy: evaluation for 131I using Monte Carlo simulation,” J. Nucl. Med 43(8), 1101–1109 (2002). [PMC free article] [PubMed] [Google Scholar]
- [8].Dewaraja YK, Frey EC, Sgouros G, Brill AB, Roberson P, Zanzonico PB, and Ljungberg M, “MIRD pamphlet no. 23: quantitative SPECT for patient-specific 3-dimensional dosimetry in internal radionuclide therapy,” J. Nucl. Med 53(8), 1310–1325 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Li Z, Benabdallah N, Abou DS, Baumann BC, Dehdashti F, Jammalamadaka U, Laforest R, Wahl RL, Thorek DL, and Jha AK, “A projection-domain low-count quantitative SPECT method for alpha-particle emitting radiopharmaceutical therapy,” arXiv preprint arXiv:2107.00740 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Du Y, Tsui BM, and Frey EC, “Partial volume effect compensation for quantitative brain SPECT imaging,” IEEE Trans. Med. Imaging 24(8), 969–976 (2005). [DOI] [PubMed] [Google Scholar]
- [11].Jin X, Mulnix T, Gallezot J-D, and Carson RE, “Evaluation of motion correction methods in human brain PET imaging—A simulation study based on human motion data,” Med. Phys 40(10), 102503 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Ouyang J, El Fakhri G, Moore SC, and Kijewski MF, “Fast Monte Carlo Simulation Based Joint Iterative Reconstruction for Simultaneous 99mTc/123I Brain SPECT Imaging,” in [2006 IEEE Nuclear Science Symposium Conference Record ], 4, 2251–2256, IEEE; (2006). [Google Scholar]
- [13].He B, Du Y, Song X, Segars WP, and Frey EC, “A Monte Carlo and physical phantom evaluation of quantitative In-111 SPECT,” Phys. Med. Biol 50(17), 4169 (2005). [DOI] [PubMed] [Google Scholar]
- [14].Liu Z, Mhlanga JC, Laforest R, Derenoncourt P-R, Siegel BA, and Jha AK, “A Bayesian approach to tissue-fraction estimation for oncological PET segmentation,” Phys. Med. Biol 66(12), 124002 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Hoppin JW, Kupinski MA, Kastis GA, Clarkson E, and Barrett HH, “Objective comparison of quantitative imaging modalities without the use of a gold standard,” IEEE Trans. Med. Imaging 21(5), 441–449 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Kupinski MA, Hoppin JW, Clarkson E, Barrett HH, and Kastis GA, “Estimation in medical imaging without a gold standard,” Acad. Radiol 9(3), 290–297 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Jha AK, Kupinski MA, Rodríguez JJ., Stephen RM., and Stopeck AT., “Evaluating segmentation algorithms for diffusion-weighted MR images: a task-based approach,” in [Medical Imaging 2010: Image Perception, Observer Performance, and Technology Assessment ], 7627, 76270L, International Society for Optics and Photonics; (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Jha AK, Kupinski MA, Rodriguez JJ, Stephen RM, and Stopeck AT, “Task-based evaluation of segmentation algorithms for diffusion-weighted MRI without using a gold standard,” Phys. Med. Biol 57(13), 4425 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Lebenberg J, Buvat I, Lalande A, Clarysse P, Casta C, Cochet A, Constantinidés C., Cousty J., De Cesare A., Jehan-Besson S., et al. , “Nonsupervised ranking of different segmentation approaches: application to the estimation of the left ventricular ejection fraction from cardiac cine MRI sequences,” IEEE Trans. Med. Imag 31(8), 1651–1660 (2012). [DOI] [PubMed] [Google Scholar]
- [20].Jha AK, Caffo B, and Frey EC, “A no-gold-standard technique for objective assessment of quantitative nuclear-medicine imaging methods,” Phys. Med. Biol 61(7), 2780 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Jha AK, Song N, Caffo B, and Frey EC, “Objective evaluation of reconstruction methods for quantitative SPECT imaging in the absence of ground truth,” in [Medical Imaging 2015: Image Perception, Observer Performance, and Technology Assessment ], 9416, 94161K, International Society for Optics and Photonics; (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Jha AK, Mena E, Caffo BS, Ashrafinia S, Rahmim A, Frey EC, and Subramaniam RM, “Practical no-gold-standard evaluation framework for quantitative imaging methods: application to lesion segmentation in positron emission tomography,” J. Med. Imaging 4(1), 011011 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Byrd RH, Hribar ME, and Nocedal J, “An interior point algorithm for large-scale nonlinear programming,” SIAM J. Optim 9(4), 877–900 (1999). [Google Scholar]
- [24].Leung KH, Marashdeh W, Wray R, Ashrafinia S, Pomper MG, Rahmim A, and Jha AK, “A physics-guided modular deep-learning based automated framework for tumor segmentation in PET,” Phys. Med. Biol 65(24), 245032 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Yu Z, Rahman MA, Schindler T, Gropler R, Laforest R, Wahl R, and Jha A, “AI-based methods for nuclear-medicine imaging: Need for objective task-specific evaluation,” (2020). [Google Scholar]
- [26].Zhu Y, Yousefirizi F, Liu Z, Klyuzhin I, Rahmim A, and Jha A, “Comparing clinical evaluation of PET segmentation methods with reference-based metrics and no-gold-standard evaluation technique,” (2021). [Google Scholar]
- [27].Liu Z, Mhlanga JC, Siegel BA, and Jha AK, “Need for objective task-based evaluation of segmentation methods in oncological PET: a study with ACRIN 6668/RTOG 0235 multi-center clinical trial data,” in review (2022). [Google Scholar]
- [28].Jha AK, Myers KJ, Obuchowski NA, Liu Z, Rahman MA, Saboury B, Rahmim A, and Siegel BA, “Objective Task-Based Evaluation of Artificial Intelligence-Based Medical Imaging Methods: Framework, Strategies, and Role of the Physician,” PET Clin 16(4), 493–511 (2021). [DOI] [PubMed] [Google Scholar]
- [29].Jha A and Frey E, “Incorporating prior information in a no-gold-standard technique to assess quantitative SPECT reconstruction methods,” in [International Meeting on Fully 3D reconstruction in Radiology and Nuclear Medicine ], 47–51 (2015). [Google Scholar]
