Skip to main content
Journal of Medical Imaging logoLink to Journal of Medical Imaging
. 2022 Aug 18;9(4):044005. doi: 10.1117/1.JMI.9.4.044005

Repeatability and reproducibility of magnetic resonance imaging-based radiomic features in rectal cancer

Robba Rai a,b,c,*, Michael B Barton a,b,c, Phillip Chlap a,b,c, Gary Liney a,b,c, Carsten Brink d,e, Shalini Vinod a,b,c, Monique Heinke f, Yuvnik Trada g, Lois C Holloway a,b,c,h,i
PMCID: PMC9386367  PMID: 35992729

Abstract.

Purpose

Radiomics of magnetic resonance images (MRIs) in rectal cancer can non-invasively characterize tumor heterogeneity with potential to discover new imaging biomarkers. However, for radiomics to be reliable, the imaging features measured must be stable and reproducible. The aim of this study is to quantify the repeatability and reproducibility of MRI-based radiomic features in rectal cancer.

Approach

An MRI radiomics phantom was used to measure the longitudinal repeatability of radiomic features and the impact of post-processing changes related to image resolution and noise. Repeatability measurements in rectal cancers were also quantified in a cohort of 10 patients with test–retest imaging among two observers.

Results

We found that many radiomic features, particularly from texture classes, were highly sensitive to changes in image resolution and noise. About 49% of features had coefficient of variations 10% in longitudinal phantom measurements. About 75% of radiomic features in in vivo test–retest measurements had an intraclass correlation coefficient of 0.8. We saw excellent interobserver agreement with mean Dice similarity coefficient of 0.95±0.04 for test and retest scans.

Conclusions

The results of this study show that even when using a consistent imaging protocol many radiomic features were unstable. Therefore, caution must be taken when selecting features for potential imaging biomarkers.

Keywords: magnetic resonance image, radiomics, rectal cancer, repeatability, reproducibility, stability

1. Introduction

Radiomics has gained popularity during the last decade and is proving to be a promising method to non-invasively characterize tumor heterogeneity, treatment response, radiosensitivity, and outcome prediction based on imaging features.15 With radiomics, tumors can be segmented using medical images and information on the first- and second-order features from the histogram, pixel intensity relationships, and shape can be extracted using complex algorithms. The advantage of radiomics is the methodology can be applied to a number of tumor site groups2 and is a non-invasive method of characterizing tumors compared to traditional methods such as biopsy. With biopsies, a single tumor sample is often analyzed that may not be representative of the tumor genomic landscape due to complex intratumor heterogeneity.6 Radiomics can address the limitation of biopsy using tumor segmentations from 3D volumetric medical imaging such as computed tomography (CT), magnetic resonance imaging (MRI), and positron emission tomography (PET). Imaging can be repeated over time to measure the temporal heterogeneity of tumors.

For radiomics to be useful, the predictive models need to have high accuracy, reliability, and efficiency.7 Many radiomic features are highly correlated with one another and are therefore redundant and need to be removed prior to radiomics analysis to reduce the risk of overfitting the predictive models.8,9 This will also ensure that only the most informative features are included in the radiomics signature for a predictive model. For this to occur, many studies in radiomics use a variety of datasets in their analysis, for training and validation.2,1012

Further investigations into the repeatability and reproducibility of features used need to be performed.13 Repeatability is defined as radiomic features that do not change when the object of interest is imaged multiple times.13 Reproducibility can be defined as features that remain the same when imaged using different imaging apparatus such as different scanners, software or sequence parameters.13

Repeatability and reproducibility of radiomic features is imperative if predictive radiomic models are to be used in multicenter trials, where image quality can vary greatly for a number of reasons, including different protocol acquisition parameters and machine vendors and fluctuations in image quality.14 Quantifying the repeatability and reproducibility of radiomic features can be achieved with a rigorous assessment of feature stability including quality assurance (QA) of imaging protocols as well as measuring feature stability using in vivo test–retest datasets.

There has been recent interest in rectal cancer radiomics to predict response to treatments using MRI, because MRI provides superior soft tissue contrast compared to CT and PET. MRI is often used to diagnose, stage and restage rectal cancer15 and to determine the status of the circumferential resection margin, which is associated with the risk of local recurrence.16

Recent studies into MRI in rectal cancer have explored creating radiomics models that can be used to predict pathological complete response to neoadjuvant chemoradiotherapy. The studies often use a combination of anatomical and functional imaging such as diffusion weighted imaging and dynamic contrast enhanced17,18 or anatomical T2-weighted (T2-w) imaging alone.19,20 Although the findings from this growing body of evidence is yielding promising results, there is still a need to explore the repeatability and reproducibility of radiomics features. This is particularly important for the building of predictive radiomics models, which require reproducible and stable imaging features.7 Traverso et al.13 in a systematic review, found that there was limited investigation into the most stable features for MRI-based radiomics analysis and recommended standardization of radiomics analysis. Studies exploring MRI-based radiomics particularly in oncological settings is increasing compared to CT and PET. However, despite the increase in studies exploring the feasibility of using MRI-based features to uncover radiomic signatures and potential imaging biomarkers, there is still limited research looking at the impact of image quality and image acquisition parameters on radiomic based features.21

The purpose of this study is to investigate the repeatability and reproducibility of MRI-based rectal cancer radiomic features using both phantom and clinical test–retest datasets.

2. Materials and Methods

2.1. Radiomics Phantom

The radiomics phantom was developed by our group previously14 and was used to assess the repeatability of radiomic features for the imaging sequences used in clinical rectal cancer patients. The phantom was imaged over a period of four months for a total of 13 imaging sessions to measure the temporal stability of radiomic features.

The phantom was positioned on two oil-filled containers to ensure adequate signal to trigger a localizer scan [Fig. 1(a)]. A non-slip mat was placed under the phantom base to minimize movement and vibrations caused by image acquisition. The phantom test objects were placed on a 3D printed horizontal plate in their corresponding position. The horizontal plate was aligned using the external laser bridge system (LAP Laser, Luneburg, Germany) at every scan session to ensure accurate reproducibility of position and to minimize setup errors.

Fig. 1.

Fig. 1

(a) Phantom setup on RT table and (b) individual test objects part of the radiomics phantom.

The phantom is comprised of 11 test objects, of which three were used in this study for radiomics analysis including test objects M1, M2, and M8 [Fig. 1(b)]. Only three test objects were selected for segmentation as these objects have a variation of textures and shapes, including a solid homogenous structure (M1), heterogeneous structure with fine texture details (M2), and a complex shape (M8). It was beyond the scope of this study to analyze all 11 test objects; however, a complete analysis of all shapes has been previously reported.14 The test objects were segmented using a global threshold-based segmentation method described by Rai et al.14 The threshold values are based on the minimum and maximum pixel values. This was performed on one reference scan (test) and then propagated on the second scan (retest) to minimize contour variability. The upper and lower threshold pixel values were based on the histogram of the models.

2.2. Patient Population

Retrospective analysis of a test–retest cohort of rectal cancer patients was performed. This study was approved by the South Western Sydney Local Health District human research ethics committee (HREC No. HREC/15/LPOOL/555). All patients provided informed consent and all imaging and experiments were performed in accordance with relevant guidelines and regulations. Ten patients treated at Liverpool and Macarthur Cancer Therapy Centre between 2015 and 2017 with stage IIIA–IVB rectal adenocarcinoma were recruited. Inclusion criteria included any patient with confirmed rectal adenocarcinoma requiring a simulation MRI for their radiation therapy (RT) planning. Exclusion criteria were patients with a contraindication to MRI such as non-safe or non-compatible medical devices.

2.3. Image Acquisition

All imaging in this study was conducted on a radiotherapy dedicated three Tesla MRI (MAGNETOM Skyra, Siemens Healthineers, Erlangen, Germany), using an 18-channel receiver only surface coil and 32-channel spine coil integrated into the MRI bed. A flat top RT table (CIVCO Medical Solutions, Coralville) was used for both phantom and patient imaging to replicate patient treatment conditions [Fig. 1(a)].

The imaging protocol acquired for both phantom and patients was a T2-w turbo spin echo (TSE) sequence. There were no differences in acquisition parameters between the clinical protocol and phantom protocol. Phantom imaging was acquired in a coronal view whilst patient imaging was acquired in a transverse view. The sequence was acquired with Cartesian k-space ordering, a TE/TR of 96/10000  ms, 3/0  mm slice thickness/spacing, 1 signal average, 15 turbo factor, acceleration factor of 2 for parallel imaging, 320×224 matrix, 160 deg flip angle, 400-Hz/Px receiver bandwidth, 220 mm field of view (FOV), and 0.7×0.7  mm2 in-plane resolution.

2.3.1. Patient imaging

All patients in the test–retest cohort were scanned in their RT treatment position including a flat table top and RT immobilization equipment. The external laser system was used to align the patient to their planning tattoos (given in CT simulation) to assist with reproducing the patient position. After the first sequence was acquired (test) the patient left the MRI room and went for a short walk. They were then repositioned in the same RT planning positioning and the second T2-w TSE scan was acquired (retest). All 10 patients received two scans for a total of 20 datasets.

2.4. Image Postprocessing

The impact of varied image quality on the reproducibility of radiomic features in the phantom was investigated to assess the impact it may have on radiomic feature stability.

To assess the impact on feature stability, the original phantom images were pre-processed with added noise and variable spatial resolution. To adjust the spatial resolution, the images were resampled to 0.5, 0.8, 1, and 1.2  mm2. To add noise to the original images, a Gaussian noise filter was added to the datasets. The standard deviation (SD) of the noise levels tested were set to 2, 5, 10, and 20 with the range of image intensities ranging from 0 to 302. For all images (both original and processed), the signal-to-noise ratio (SNR) within the individual images was assessed as the mean signal intensity within one phantom test object (M1) divided by the SD of a nearby air signal. All post-processing was performed in 3D Slicer (Version 4.13.0).22

2.5. Tumor and Tissue Segmentation

All segmentations were performed in imaging processing software MiM (MiM Maestro, Cleveland, Ohio).

2.5.1. Tumor segmentation

The rectal tumor volumes were contoured manually by two radiation oncologists (MH and YT) on both repeat T2-w sequences. Observer 2 (YT) performed a copy and adjust contour from the original contour of observer 1 (MH). The rectal tumor volume was defined as a high signal region on T2-w imaging by both observers. Care was taken to ensure only the tumor volumes excluded healthy rectal wall, lumen, and fecal matter (Fig. 2). This process was completed on both test and retest imaging by both observers.

Fig. 2.

Fig. 2

Example of contours of a rectal tumor on (a) axial; (b) coronal; and (c) sagittal reconstructions. Care was taken to ensure rectal wall, lumen, and fecal matter was not included in the tumor volumes.

2.5.2. Normal tissue segmentation

The right gluteus maximus was selected as a reference for normal tissue. Radiomic feature stability within a reference muscle was performed to quantify features that are stable in an otherwise normal human tissue.

The gluteus maximus was selected as it is covered in the FOV of all MRI scans across all datasets and was consistently outside the highest planning dose areas (<20% of maximum dose). A spherical region of interest (ROI) was placed over the middle of the gluteus maximus on axial imaging. The size of the spherical ROI was kept consistent across all scans with an average volume of 4 cc. The reference muscle was contoured manually by a single observer (RR) on T2-w imaging

2.6. Radiomics Analysis

Both in vivo and phantom segmentations were analyzed using PyRadiomics,23 which is available in-house as a plug-in within MiM software. A total of 83 radiomic features were calculated, including 15 first-order statistics, 14 shape, and various textural features including 23 gray level co-occurence matrix (GLCM), 13 gray level run length matrix (GLRLM), 13 gray level size zone matrix (GLSZM), and 5 from neighbourhood gray tone dependent matrix (NGTDM). All features were computed with a fixed bin width of 25. The full list of features is outlined in Table 1.

Table 1.

List of radiomic features analyzed in this study.

Feature class Feature
First order Energy, entropy, kurtosis, maximum, mean, mean absolute deviation, median, minimum, range, robust mean absolute deviation, root mean squared, skewness, total energy, uniformity, and variance
GLCM Autocorrelation, cluster prominence, cluster shade, cluster tendency, correlation difference average, difference entropy, difference variance, inverse difference, inverse difference moment, inverse difference moment normalized, inverse difference normalized, information measure of correlation 1, information measure of correlation 2, inverse variance, joint average, joint energy, joint entropy, maximum probability, maximal correlation coefficient, sum average, sum entropy, and sum squares
GLRLM High gray-level run emphasis, long run emphasis, long run high gray-level emphasis, long run low gray-level emphasis, low gray-level run emphasis, run entropy, run length non uniformity, run length non uniformity normalized, run percentage, run variance, short run emphasis, short run high gray-level emphasis, and short run low gray-level emphasis
GLSZM High gray-level zone emphasis, large area emphasis, large area high gray-level emphasis, large area low gray-level emphasis, low gray-level zone emphasis, size zone non uniformity, size zone non uniformity normalized, small area emphasis, small area high gray-level emphasis, small area low gray-level emphasis, zone entropy, zone percentage, and zone variance
NGTDM Busyness, coarseness, complexity, contrast, and strength
Shape Elongation, flatness, least axis length, major axis length, maximum 2D diameter column, maximum 2D diameter row, maximum 2D diameter slice, maximum 3D diameter, mesh volume, minor axis length, sphericity, surface area, surface volume ratio, and voxel volume

2.7. Statistical Analysis and Feature Selection

For both the original and post-processed phantom imaging, the coefficient of variation (COV) was calculated to assess the degree of variability in features. The COV was calculated as

COV=standard deviationmean×100%,

where the SD and the mean is derived from the radiomic features over all time points of imaging for each model.

The features were divided into four groups based on COV as very small (COV5%), small (5%<COV10%), intermediate (10%<COV20%), and large (COV>20%) ranges of variation.24

For the test–retest patient cohort, the intraclass correlation coefficient (ICC) test was performed to measure the reproducibility and repeatability of radiomic features between the test and retest scans for both observers for the rectal tumor volumes and normal muscles.

The impact of spatial resolution and Gaussian noise filtering on the reproducibility and repeatability of each radiomic feature were quantified using ICC. A two-way mixed effects model with absolute agreement was used to compute the ICCs. Features that returned an ICC of 0.8 were considered to have an almost perfect strength of agreement. ICC was calculated in R using the DescTools package (Version 3.5.1).

Wilcoxon-signed rank test was used to compare the differences between the tumor volumes of observer 1 and 2 for both test and retest scans. This non-parametric test was selected as it does not make assumptions on the underlying distribution of data. A p-value of 0.05 was selected as being statistically significant. The Wilcoxon-signed rank test was performed in SPSS (IBM SPSS Statistics, version 23, New York).

3. Results

3.1. Patient Demographics

Table 2 outlines patient demographics. About 10 patients were analyzed in this study with each patient having two MRI scans. The average time ± SD between the test and retest scans was 11±2  min.

Table 2.

Patient characteristics.

Characteristics Patient cohort
Number of patients 10
Age (years)
Median 60
Q1–Q3 55–70
Gender
Male 7
Female 3
TNM staging
IIA 0
IIIA 1 (1%)
IIIB 7 (70%)
IIIC 0
IVA 1 (1%)
IVB 1 (1%)
Time between scans, mean ± SD 11 ± 2 (min)

3.2. Repeatability of Radiomic Features in the Patient Test–Retest Cohort

Figure 3 shows a scatter plot showing the ICC values for all radiomic feature classes for observers 1 and 2. Overall, 74.6% of radiomic features returned a high ICC>0.8 with all shape features having a high strength of agreement.

Fig. 3.

Fig. 3

Scatter plot comparison of feature stability for observer 1 and 2 as measured by the ICC for the test–retest patient cohort.

3.3. Interobserver Variability in the Patient Test–Retest Cohort

The mean ± SD volume for observer 1 and 2 for the test scans was 38.01±18.89  ml and 36.83±14.82  ml, respectively, with a mean difference of 1.18 ml (p=0.878 [95% CI: 14.7717.13]). The mean ± SD volume for observer 1 and 2 for the retest scan was 36.68±17.26  ml and 39.19±19.79  ml, respectively, with a mean difference of 2.52 ml (p=0.766 [95% CI: 19.9614.93]).

There was good interobserver agreement when considering the DSC with a mean ± SD DSC of 0.95±0.04 for both the test and retest scans. The mean ± SD HD for the test and retest datasets was 0.061±0.037  cm and 0.057±0.042  cm

3.4. Feature Stability in Reference Tissue

Table 3 gives the ICC values for the reference muscle across both texture and first-order statistics. Overall, 75% of radiomic features returned a high ICC0.8. Only one feature, variance from first-order statistics, had a low ICC of 0.03.

Table 3.

List of ICC values for each radiomic feature in the reference muscle.

  Low <0.5 Moderate (0.5 – 0.8) High (≥0.8)
First order Variance Kurtosis, maximum, minimum, and uniformity Energy, entropy, mean, mean absolute ddeviation, median, range, robust mean absolute deviation, root mean squared, skewness, and total energy
GLCM Correlation, inverse difference moment, inverse difference moment normalized, and information measure of correlation 1 Autocorrelation, cluster prominence, cluster shade, cluster tendency, difference average, difference entropy, difference variance, inverse difference, inverse difference normalized, information measure of correlation 2, inverse variance, joint average, joint energy, joint entropy, maximum probability, maximal correlation coefficient, sum average, sum entropy, and sum squares
GLRLM Long run low gray-level emphasis, low gray-level run emphasis, run length non uniformity,run variance, and short run low gray-level emphasis High gray-level run emphasis, long run emphasis, long run high gray-level emphasis,run entropy,run length non uniformity normalized,run percentage, short run emphasis, andshort run high gray-level emphasis
GLSZM Large area low gray-level emphasis and size zone non uniformity High gray-level zone emphasis, large area emphasis, large area high gray-level emphasis, low gray-level zone emphasis, size zone non uniformity normalized, small area emphasis, small area high gray-level emphasis, small area low gray-level emphasis, zone entropy, zone percentage, and zone variance
NGTDM Coarseness Busyness, complexity, contrast, and strength

3.5. Temporal Stability of Radiomic Features in Phantom

Figure 4 shows an example of two scans of the phantoms taken on two separate days during the 4-month period of monitoring. Both scans show the phantom taken at the same slice position.

Fig. 4.

Fig. 4

An example of two repeat scans for the phantom imaging. Scan 1 and 2 were taken on different days on a single scanner.

Table 4 shows the COV of all features as measured by the three phantoms objects (M1, M2, and M8) on repeat imaging over a period of 4 months. The COV values listed for each phantom object is the mean COV for all the imaging sessions (n=13).

Table 4.

COV of each phantom object (M1, M2, and M8) with corresponding mean COV and SD.

Feature class Feature M1 M2 M8 Mean SD (%)
First order Energy 12.9% 12.6% 15.6% 13.7% 1.7
Entropy 3.2% 3.3% 5.1% 3.9% 1.1
Kurtosis 7.8% 5.8% 9.6% 7.8% 1.9
Maximum 4.7% 5.6% 8.9% 6.4% 2.2
Mean 5.2% 7.0% 6.1% 6.1% 0.9
Mean absolute deviation 7.3% 5.1% 11.6% 8.0% 3.3
Median 5.2% 11.0% 6.1% 7.4% 3.1
Minimum 23.0% 21.5% 24.8% 23.1% 1.7
Range 5.3% 5.6% 8.9% 6.6% 2.0
Robust mean absolute deviation 11.5% 5.0% 14.1% 10.2% 4.7
Root mean squared 5.0% 6.3% 6.1% 5.8% 0.7
Skewness 22.7% 9.4% 15.2% 9.5% 16.8
Total energy 12.9% 12.6% 15.6% 13.7% 1.7
Uniformity 7.2% 7.3% 10.9% 8.5% 2.1
Variance 11.4% 11.1% 20.1% 14.2% 5.1
GLCM Autocorrelation 10.1% 10.6% 10.5% 10.4% 0.3
Cluster prominence 24.3% 24.2% 35.4% 28.0% 6.5
Cluster shade 66.1% 20.5% 30.2% 25.3% 43.5
Cluster tendency 11.5% 11.4% 19.8% 14.3% 4.8
Correlation 17.9% 2.9% 12.7% 11.1% 7.6
Difference average 10.1% 6.2% 11.3% 9.2% 2.7
Difference entropy 5.1% 2.8% 5.1% 4.3% 1.3
Difference variance 20.1% 11.2% 17.1% 16.1% 4.5
Inverse difference 2.6% 2.9% 3.8% 3.1% 0.6
Inverse difference moment 3.2% 3.8% 5.0% 4.0% 0.9
Inverse difference moment normalized 0.4% 0.2% 0.6% 0.4% 0.2
Inverse difference normalized 0.8% 0.5% 0.9% 0.8% 0.2
Information measure of correlation 1 10.9% 4.8% 7.0% 7.5% 3.1
Information measure of correlation 2 8.9% 2.8% 5.2% 5.7% 3.1
Inverse variance 6.0% 2.8% 4.7% 4.5% 1.6
Joint average 4.9% 5.3% 5.2% 5.1% 0.2
Joint energy 16.9% 14.0% 21.5% 17.5% 3.8
Joint entropy 4.6% 3.0% 5.2% 4.3% 1.1
Maximum probability 12.3% 2.8% 4.3% 6.5% 5.1
Maximal correlation coefficient 20.3% 21.5% 23.0% 21.6% 1.3
Sum average 4.9% 5.3% 5.2% 5.1% 0.2
Sum entropy 3.6% 2.4% 4.4% 3.5% 1.0
Sum squares 12.0% 11.4% 19.6% 14.3% 4.6
GLRLM High gray-level run emphasis 8.6% 11.2% 11.3% 10.3% 1.5
Long run emphasis 24.2% 10.1% 7.7% 14.0% 8.9
Long run high gray-level emphasis 20.8% 10.5% 6.3% 12.5% 7.4
Long run low gray-level emphasis 35.1% 14.6% 23.7% 24.5% 10.2
Low gray-level run emphasis 12.6% 7.7% 14.7% 11.7% 3.6
Run entropy 1.7% 1.5% 2.0% 1.7% 0.2
Run length nonuniformity 14.1% 10.0% 9.3% 11.1% 2.6
Run length nonuniformity normalized 4.7% 2.5% 3.1% 3.4% 1.1
Run percentage 3.6% 2.4% 2.2% 2.7% 0.7
Run variance 31.9% 21.6% 12.9% 22.1% 9.5
Short run emphasis 2.3% 1.1% 1.4% 1.6% 0.6
Short run high gray-level emphasis 8.7% 12.0% 12.6% 11.1% 2.1
Short run low gray-level emphasis 13.9% 7.2% 14.3% 11.8% 4.0
GLSZM High gray-level zone emphasis 15.3% 6.3% 11.1% 10.9% 4.5
Large area emphasis 34.6% 42.9% 37.6% 38.4% 4.2
Large area high gray-level emphasis 31.5% 33.7% 28.4% 31.2% 2.7
Large area low gray-level emphasis 47.5% 45.6% 42.5% 45.2% 2.5
Low gray-level zone emphasis 31.2% 28.9% 17.7% 25.9% 7.2
Size zone non uniformity 44.2% 18.4% 18.2% 26.9% 14.9
Size zone non uniformity normalized 29.4% 8.3% 9.0% 15.6% 12.0
Small area emphasis 21.8% 6.0% 5.5% 11.1% 9.3
Small area high gray-level emphasis 23.2% 10.7% 15.0% 16.3% 6.3
Small area low gray-level emphasis 50.3% 39.6% 21.7% 37.2% 14.4
Zone entropy 9.3% 1.7% 2.9% 4.6% 4.1
Zone percentage 24.8% 17.9% 13.4% 18.7% 5.8
Zone variance 33.8% 43.1% 38.0% 38.3% 4.7
NGTDM Busyness 24.1% 22.4% 19.4% 22.0% 2.4
Coarseness 13.4% 13.4% 15.3% 14.1% 1.1
Complexity 16.0% 14.2% 27.7% 19.3% 7.3
Contrast 20.0% 12.7% 26.6% 19.7% 6.9
Strength 18.2% 12.4% 19.5% 16.7% 3.8
Shape Elongation 0.4% 0.3% 1.0% 0.6% 0.3%
Flatness 6.8% 7.5% 3.9% 6.1% 1.9%
Least axis length 7.6% 9.2% 4.0% 6.9% 2.7%
Major axis length 1.3% 2.0% 0.9% 1.4% 0.6%
Maximum 2D diameter column 1.7% 1.8% 2.6% 2.0% 0.5%
Maximum 2D diameter row 1.5% 1.8% 1.1% 1.4% 0.4%
Maximum 2D diameter slice 0.6% 1.7% 1.8% 1.4% 0.7%
Maximum 3D diameter 1.3% 1.8% 1.1% 1.4% 0.4%
Mesh volume 10.0% 13.4% 6.6% 10.0% 3.4%
Minor axis length 1.4% 2.0% 1.3% 1.6% 0.4%
Sphericity 2.9% 2.2% 2.5% 2.5% 0.4%
Surface area 4.1% 6.6% 2.8% 4.5% 1.9%
Surface volume ratio 5.9% 6.2% 5.1% 5.7% 0.6%
Voxel volume 9.9% 13.3% 6.6% 10.0% 3.4%

3.6. Impact of Postprocessing on Feature Reproducibility

The original phantom image had an SNR of 73 (a.u.). The datasets with added Gaussian noise of 2, 5, 20, and 20 SD had a SNR of 57, 38, 18, and 10 (a.u.), respectively.

Figure 5(a) shows a heatmap demonstrating the ICC across the different noise iterations for each individual texture feature against each phantom test object. With an increase in noise the repeatability of some features from all feature classes decreases. Features from the GLSZM appear to be more sensitive to smaller changes in noise compared to features from GLCM, GLRLM, and NGTDM.

Fig. 5.

Fig. 5

Heatmap of all ICC for each variation in image quality across the three test objects M1, M2, and M8 for texture features GLCM, GLRLM, GLSZM and NGTDM. (a) The ICC for all noise levels (2, 5, 10, and 20 SD) and (b) the ICC for all resampled resolution iterations introduced, except for 0.8 mm, which is the original resolution. The columns correspond to each feature and the rows correspond to the test object.

In contrast, first-order features [Fig. 6(a)] appear to be more robust in the presence of added noise with the exception of maximum, uniformity and variance. Figure 6(a) shows the ICC across the different noise levels for each first-order feature against each phantom test object. It appears that even with an added 20 SD of Gaussian noise, the majority of features have ICC0.8.

Fig. 6.

Fig. 6

Heatmap of all ICC for each variation in image quality across the three test objects M1, M2, and M8 for first-order features. (a) The ICC for all noise levels (2, 5, 10, and 20 SD) and (b) ICC for all resampled resolution iterations. The columns correspond to each feature and the rows correspond to the test object.

Figure 5(b) highlights the ICC across the different resolution settings for each individual texture feature against each phantom test object. Texture features appear to be more sensitive to changes in resolution particularly with GLSZM and NGTDM, with GLCM and GLRLM having more stable features out of the four feature classes.

The change in resolution also impacted the reproducibility of first-order features, particularly maximum, uniformity, and variance; however, the majority of features were highly reproducible [Fig. 6(b)].

4. Discussion

This study has used phantom and test–retest imaging in a rectal cancer patient cohort to assess the stability of MRI-based radiomics features.

The results of the study show that shape features, in both the phantom and test–retest cohorts, was the most stable feature class that provided the most repeatable and reproducible results, which is in agreement with previous repeatability studies in cervical25 and prostate26,27 cancer.

In the phantom repeatability measurements, shape, which was auto-defined, performed the best with all features having a COV10%. The remaining features from first-order and texture features classes had more varied results. One feature from GLSZM and no features from NGTDM class had 10% COV for the test–retest imaging protocol. This is similar to the results of Gourtsoyianni et al.,28 who found COV ranging from 30% to 50% or greater for features from GLSZM and NGTDM, concluding that these features were not reliable for texture analysis in primary rectal cancer. Moreover, a more recent study by Lee et al.29 also found that features computed from NGTDM also exhibited high variation when assessing the impact of radiomic feature reliability between T1 and T2 weighted images.

In comparison to this study, Bianchini et al.30 found excellent repeatability of radiomic features with 92.1% to 96.4% of features having excellent repeatability (ICC>0.9) on two independent 1.5T MRI scanners and 79.9% of features exhibiting excellent repeatability at 3T. However, with repositioning the phantom (test–retest experiment), they observed a consistent reduction across all scanners, with only 78.4% to 80.6% of features having excellent repeatability (ICC>0.9) on the two independent 1.5T MRI scanners and only 11.2% of features exhibiting excellent repeatability at 3T. Overall, they only found 3.3% of features extracted had excellent repeatability (ICC>0.9) and reproducibility [concordance correlation coefficient (CCC)>0.9].

We found that there was a varied range of ICC in repeatability of texture and first-order statistics features for the patient test–retest cohort. All shape features had an ICC0.8 with excellent interobserver agreement. This is not surprising as there was good interobserver agreement between the contoured volumes for both the test and retest scans as measured using DSC and HD. There was no statistically significant difference between the contoured volumes of both observers for the test and retest scans. The second set of tumor segmentations was completed by one observer using a copy and adjust method instead of an independent blinded contour. This may have contributed to the excellent interobserver agreement in DSC and mean HD values.

Similarly to this study, Timmeren et al.31 found that shape features analyzed with CT were the most stable in a cohort of rectal and lung cancer patients. In a recent study by Schurink et al.,32 rectal cancer shape features were also found to be the most reliable radiomic feature and less affected by sources of variation such as image acquisition and segmentation. Contrary to the findings in this study, Traverso et al.33 found that first-order statistics had higher repeatability compared to GLCM and GLRLM features, but found very poor inter-observer repeatability of shape features in their rectal cancer dataset, which they attributed to interobserver variability of the tumor delineation.

The investigations into the impact of added Gaussian noise and varied spatial resolution yielded noteworthy results. First-order features were more robust in the presence of increased Gaussian noise and significant decreases of SNR compared to texture features. Overall, increasing or decreasing spatial resolution using clinically relevant and realistic values severely impacted the reproducibility of most texture features. As discussed previously, Bianchini et al.30 observed differences in the repeatability and reproducibility of radiomic features at different magnetic field strengths (1.5T versus 3T), which they postulated may be due to the differences in SNR between the two systems.

First-order features and texture features from GLCM had more moderate ICC values with changes to resolution compared to GLRLM, GLSZM, and NGTDM. These results are in agreement with previous studies where resolution was found to significantly impact the reliability of texture features in both phantom29,30,34,35 and in vivo datasets.29,33,36 Mayerhoefer et al.34 also found that texture features from the GLCM are superior to features from the GLRM with their ability to distinguish texture in the presence of spatial resolution perturbations. However, both Mayerhoefer et al.34 and this study used retrospective adjustment of spatial resolution. A recent study by Yuan et al.35 prospectively adjusted imaging parameters that impact SNR including partial Fourier in both the Y and Z directions. The adjusted image quality was compared to the baseline scans. They found that when calculating the percent deviation (d%) 25% of all features computed had excellent agreement (<5%) with baseline. However, when using ICC as a measure of robustness, 91% features had excellent agreement (>0.9) with baseline imaging.

There are several limitations to this study. The phantom repeatability measurements were only performed on three test objects. Further investigations into the impact of image quality changes should be performed on the remaining test objects, as some texture measurements may not be suitable for analysis of tumors with more complex shapes.14

Moreover, adjustment of spatial resolution and added Gaussian noise was performed retrospectively in the image domain. A more comprehensive approach would be to vary the resolution and image noise (SNR) by adjusting individual MRI parameters at the time of imaging such as previous studies.21,29,30 This may give more of an insight on the MRI parameters that impact radiomic feature stability such as TE and TR,29,30 to ensure that MRI protocols are kept consistent for studies extracting MRI-based radiomic features.

The patient numbers were also low, therefore the observed correlations should be validated and performed in a larger patient cohort. The radiomics analysis was only performed on T2-w images. Further investigations should look at assessing the repeatability and reproducibility for a range of sequences used in rectal cancer treatment and staging such as diffusion weighted imaging and dynamic contrast enhanced imaging.

The main novelty in this study is in the use of a dedicated MRI radiomics phantom as a QA tool for a clinical imaging protocol used for radiomics analysis. The study found that even when using a radiomics phantom and in vivo test–retest imaging with a standardized imaging protocol in a single-center setting, some radiomic features suffer from poorer reproducibility and reliability. For radiomics to be used in the future as a reliable method to quantify tumor types, QA of imaging protocols is highly recommended alongside in vivo test–retest imaging. The sources of variation in MRI based radiomics is well documented in the literature.21,29,30,32,35,37 However, similar to other quantitative imaging biomarkers such as ADC, further research into monitoring radiomic feature stability using dedicated phantoms is warranted. This is imperative in the setting of multicenter as well as single-center trials to ensure only the most robust features are included in a radiomic signature.

5. Conclusion

This study has evaluated the performance of radiomics analysis using phantom and test–retest imaging of rectal cancers. Even using a consistent imaging protocol, many radiomic features are unstable in both in vivo and phantom experiments. Texture features from the GLRLM, GLSZM, and NGTDM appear to be more sensitive to changes in image quality such as varied noise and resolution. Shape features performed the best of all features classes with consistently high reproducibility and repeatability in vivo and in the phantom. There is a need for controlled image quality and regular QA of imaging protocols if these features are to be used for radiomic models in multicenter trials.

Biography

Biographies of the authors are not available.

Disclosures

The authors have no conflicts of interest to declare.

Contributor Information

Robba Rai, Email: robba.rai@health.nsw.gov.au.

Michael B. Barton, Email: profmichaelbarton@gmail.com.

Phillip Chlap, Email: phillip.chlap@unsw.edu.au.

Gary Liney, Email: gary.liney@health.nsw.gov.au.

Carsten Brink, Email: Carsten.Brink@rsyd.dk.

Shalini Vinod, Email: shalini.vinod@health.nsw.gov.au.

Monique Heinke, Email: Monique.Heinke@genesiscare.com.

Yuvnik Trada, Email: yuvnik@gmail.com.

Lois C. Holloway, Email: lois.holloway@health.nsw.gov.au.

References

  • 1.Leijenaar R. T. H., et al. , “Stability of FDG-PET radiomics features: an integrated analysis of test-retest and inter-observer variability,” Acta Oncol. 52, 1391–1397 (2013). 10.3109/0284186X.2013.812798 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Aerts H. J., et al. , “Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach,” Nat. Commun. 5, 4006 (2014). 10.1038/ncomms5006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lambin P., et al. , “Radiomics: extracting more information from medical images using advanced feature analysis,” Eur. J. Cancer 48, 441–446 (2012). 10.1016/j.ejca.2011.11.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gillies R. J., et al. , “The biology underlying molecular imaging in oncology: from genome to anatome and back again,” Clin. Radiol. 65, 517–521 (2010). 10.1016/j.crad.2010.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kumar V., et al. , “Radiomics: the process and the challenges,” Magn. Reson. Imaging 30(9), 1234–1248 (2012). 10.1016/j.mri.2012.06.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Clark W. M., et al. , “Intratumor heterogeneity and branched evolution revealed by multiregion sequencing,” N. Engl. J. Med. 353(7), 701–711 (2005). 10.1056/NEJMra041866 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Yip S. S. F., Aerts H. J. W. L., “Applications and limitations of radiomics,” Phys. Med. Biol. 61, R150–R166 (2016). 10.1088/0031-9155/61/13/R150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Papanikolaou N., Matos C., Koh D. M., “How to develop a meaningful radiomic signature for clinical use in oncologic patients,” Cancer Imaging 20, 1–10 (2020). 10.1186/s40644-020-00311-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Larue R. T. H. M., et al. , “Quantitative radiomics studies for tissue characterization: a review of technology and methodologic procedures,” Br. J. Radiol. 90, 20160665 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yang L., et al. , “Development of a radiomics nomogram based on the 2D and 3D CT features to predict the survival of non-small cell lung cancer patients,” Eur. Radiol. 29, 2196–2206 (2019). 10.1007/s00330-018-5770-y [DOI] [PubMed] [Google Scholar]
  • 11.Parmar C., et al. , “Radiomic feature clusters and prognostic signatures specific for lung and head & neck cancer,” Nat. Sci. Rep. 5, 1–10 (2015). 10.1038/srep11044 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Shen C., et al. , “2D and 3D CT radiomics features prognostic performance comparison in non-small cell lung cancer,” Transl. Oncol. 10, 886–894 (2017). 10.1016/j.tranon.2017.08.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Traverso A., et al. , “Repeatability and reproducibility of radiomic features: a systematic review,” Int. J. Radiat. Oncol. 102, 1143–1158 (2018). 10.1016/j.ijrobp.2018.05.053 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Rai R., et al. , “Multicenter evaluation of MRI-based radiomic features: a phantom study,” Med. Phys. 47, 3054–3063 (2020). 10.1002/mp.14173 [DOI] [PubMed] [Google Scholar]
  • 15.Beets-Tan R. G. H., et al. , “Magnetic resonance imaging for clinical management of rectal cancer: updated recommendations from the 2016 European Society of Gastrointestinal and Abdominal Radiology (ESGAR) consensus meeting,” Eur. Radiol. 28, 1465–1475 (2018). 10.1007/s00330-017-5026-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.LeBlanc J. K., “Imaging and management of rectal cancer,” Nat. Clin. Pract. Gastroenterol. Hepatol. 4, 665–676 (2007). 10.1038/ncpgasthep0977 [DOI] [PubMed] [Google Scholar]
  • 17.Nie K., et al. , “Rectal cancer: assessment of neoadjuvant chemoradiation outcome based on radiomics of multiparametric MRI,” Clin. Cancer Res. 22, 5256–5264 (2016). 10.1158/1078-0432.CCR-15-2997 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Liu Z., et al. , “Radiomics analysis for evaluation of pathological complete response to neoadjuvant chemoradiotherapy in locally advanced rectal cancer,” Clin. Cancer Res. 23(23), 7253–7262 (2017). 10.1158/1078-0432.CCR-17-1038 [DOI] [PubMed] [Google Scholar]
  • 19.Horvat N., et al. , “MR imaging of rectal cancer: radiomics analysis to assess treatment response after neoadjuvant therapy,” Radiology 287, 833–843 (2018). 10.1148/radiol.2018172300 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.De Cecco C. N., et al. , “Texture analysis as imaging biomarker of tumoral response to neoadjuvant chemoradiotherapy in rectal cancer patients studied with 3-T magnetic resonance,” Invest. Radiol. 50, 239–245 (2015). 10.1097/RLI.0000000000000116 [DOI] [PubMed] [Google Scholar]
  • 21.Xue C., et al. , “Radiomics feature reliability assessed by intraclass correlation coefficient: a systematic review,” Quantum Imaging Med. Surg. 11, 4431–4460 (2021). 10.21037/qims-21-86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Fedorov A., et al. , 3D Slicer [internet], https://www.slicer.org/
  • 23.Van Griethuysen J. J. M., et al. , “Computational radiomics system to decode the radiographic phenotype,” Cancer Res. 77, e104–e107 (2017). 10.1158/0008-5472.CAN-17-0339 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Yan J., et al. , “Impact of image reconstruction settings on texture features in 18F-FDG PET,” J. Nucl. Med. 56, 1667–1673 (2015). 10.2967/jnumed.115.156927 [DOI] [PubMed] [Google Scholar]
  • 25.Fiset S., et al. , “Repeatability and reproducibility of MRI-based radiomic features in cervical cancer,” Radiother. Oncol. 135, 107–114 (2019). 10.1016/j.radonc.2019.03.001 [DOI] [PubMed] [Google Scholar]
  • 26.Fedorov A., et al. , “Multiparametric magnetic resonance imaging of the prostate: repeatability of volume and apparent diffusion coefficient quantification,” Invest. Radiol. 52, 538–546 (2017). 10.1097/RLI.0000000000000382 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Schwier M., et al. , “Repeatability of multiparametric prostate MRI radiomics features,” Sci. Rep. 9, 1–16 (2019). 10.1038/s41598-019-45766-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Gourtsoyianni S. T. D. R., et al. , “Primary rectal cancer: repeatability of global and local- regional mr imaging texture features,” Radiology 284(2), 552–561 (2017). 10.1148/radiol.2017161375 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Lee J., et al. , “Radiomics feature robustness as measured using an MRI phantom,” Sci. Rep. 11, 3973 (2021). 10.1038/s41598-021-83593-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Bianchini L., et al. , “A multicenter study on radiomic features from T2-weighted images of a customized MR pelvic phantom setting the basis for robust radiomic models in clinics,” Magn. Reson. Med. 85, 1713–1726 (2021). 10.1002/mrm.28521 [DOI] [PubMed] [Google Scholar]
  • 31.Van Timmeren J. E., et al. , “Test–retest data for radiomics feature stability analysis: generalizable or study-specific?” Tomography 2, 361–365 (2016). 10.18383/j.tom.2016.00208 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Schurink N. W., et al. , “Sources of variation in multicenter rectal MRI data and their effect on radiomics feature reproducibility,” Eur. Radiol. 32, 1506–1516 (2022). 10.1007/s00330-021-08251-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Traverso A., et al. , “Stability of radiomic features of apparent diffusion coefficient (ADC) maps for locally advanced rectal cancer in response to image pre-processing,” Phys. Med. 61, 44–51 (2019). 10.1016/j.ejmp.2019.04.009 [DOI] [PubMed] [Google Scholar]
  • 34.Mayerhoefer M. E., et al. , “Effects of MRI acquisition parameter variations and protocol heterogeneity on the results of texture analysis and pattern discrimination: an application-oriented study,” Med. Phys. 36, 1236–1243 (2009). 10.1118/1.3081408 [DOI] [PubMed] [Google Scholar]
  • 35.Yuan J., et al. , “Quantitative assessment of acquisition imaging parameters on MRI radiomics features: a prospective anthropomorphic phantom study using a 3D-T2W-TSE sequence for MR-guided-radiotherapy,” Quantum Imaging Med. Surg. 11, 1870–1887 (2021). 10.21037/qims-20-865 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Brynolfsson P., et al. , “Haralick texture features from apparent diffusion coefficient (ADC) MRI images depend on imaging and pre-processing parameters,” Sci. Rep. 7, 4041 (2017). 10.1038/s41598-017-04151-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zhao B., “Understanding sources of variation to improve the reproducibility of radiomics,” Front. Oncol. 11, 1–21 (2021). 10.3389/fonc.2021.633176 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Medical Imaging are provided here courtesy of Society of Photo-Optical Instrumentation Engineers

RESOURCES