Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 1.
Published in final edited form as: J Magn Reson Imaging. 2022 Jul 19;57(4):1029–1039. doi: 10.1002/jmri.28365

Generalizability of Deep Learning Segmentation Algorithms for Automated Assessment of Cartilage Morphology and MRI Relaxometry

Andrew M Schmidt 1, Arjun D Desai 1,2, Lauren E Watkins 1,3, Hollis Crowder 4, Marianne S Black 1,4, Valentina Mazzoli 1, Elka B Rubin 1, Quin Lu 5, James W MacKay 6,7, Robert D Boutin 1, Feliks Kogan 1, Garry E Gold 1,3, Brian A Hargreaves 1,2,3, Akshay S Chaudhari 1,8
PMCID: PMC9849481  NIHMSID: NIHMS1822687  PMID: 35852498

Abstract

Background:

Deep learning (DL)-based automatic segmentation models can expedite manual segmentation yet require resource-intensive fine-tuning before deployment on new datasets. The generalizability of DL methods to new datasets without fine-tuning is not well characterized.

Purpose:

Evaluate the generalizability of DL-based models by deploying pre-trained models on independent datasets varying by MR scanner, acquisition parameters, and subject population.

Study Type:

Retrospective based on prospectively acquired data.

Population:

Overall test dataset: 59 subjects (26 females); Study 1: five healthy subjects (zero females), Study 2: eight healthy subjects (eight females), Study 3: ten subjects with osteoarthritis (eight females), Study 4: 36 subjects with various knee pathology (ten females).

Field Strength/Sequence:

3-T, quantitative double-echo steady-state (qDESS).

Assessment:

Four annotators manually segmented knee cartilage. Each reader segmented one of four qDESS datasets in the test dataset. Two DL models, one trained on qDESS data and another on Osteoarthritis Initiative (OAI)-DESS data, were assessed. Manual and automatic segmentations were compared by quantifying variations in segmentation accuracy, volume, and T2 relaxation times for superficial and deep cartilage.

Statistical Tests:

Dice similarity coefficient (DSC) for segmentation accuracy. Lin’s Concordance Correlation Coefficient (CCC), Wilcoxon rank-sum tests, root-mean-squared error-coefficient-of-variation to quantify manual vs. automatic T2 and volume variations. Bland-Altman plots for manual vs. automatic T2 agreement. A p-value <0.05 was considered statistically significant.

Results:

DSCs for the qDESS-trained model, 0.79–0.93, were higher than those for the OAI-DESS-trained model, 0.59–0.79. T2 and volume CCCs for the qDESS-trained model, 0.75–0.98 and 0.47–0.95, were higher than respective CCCs for the OAI-DESS-trained model, 0.35–0.90 and 0.13–0.84. Bland-Altman 95% limits of agreement for superficial and deep cartilage T2 were lower for the qDESS-trained model, ±2.4ms and ±4.0ms, than the OAI-DESS-trained model, ±4.4ms and ±5.2ms.

Data Conclusion:

The qDESS-trained model may generalize well to independent qDESS datasets regardless of MR scanner, acquisition parameters, and subject population.

Keywords: cartilage, segmentation, machine learning, osteoarthritis, qmri

Introduction

Knee osteoarthritis (OA) is a prevalent, debilitating whole joint disease that is characterized by the progressive structural degradation of articular cartilage (1,2). MRI can non-invasively detect early changes in cartilage morphology and composition indicative of joint degeneration due to OA (3,4). Developing strategies to reliably assess these early changes can potentially identify individuals at risk prior to the development of irreversible changes and contribute to strategies for early intervention (5,6). Cartilage morphology metrics such as thickness and volume and cartilage relaxometry metrics such as T2 relaxation times have demonstrated potential as imaging biomarkers for detecting such early changes (7,8). Despite these potential benefits, widespread use of quantitative morphological and compositional MRI techniques for OA remains limited. A major obstacle in developing quantitative MRI biomarkers for OA is manual tissue segmentation, a time-consuming process prone to inter- and intrareader variability (9,10).

Deep learning (DL)-based automatic segmentation algorithms have the potential to expedite segmentation tasks, minimize inter-reader variability, and circumvent intra-reader variability (11,12). These automated segmentation models are becoming an increasingly common method for medical image analysis and have demonstrated high performance on quantitative MRI tasks including cartilage morphology and relaxometry (13). A robust fully automatic segmentation model could help more efficiently analyze large datasets and avoid the manual segmentation bottleneck hindering the widespread application of quantitative MRI techniques for OA. Recent results from the 2019 knee MRI segmentation challenge indicate that automatic segmentation models can perform cartilage segmentation comparable to that of manual segmentation performed by human annotators, regardless of the DL approach used when models are specifically tuned to the features of a given dataset (14).

Prior knee DL segmentation models have primarily demonstrated high efficacy on double-echo steady-state (DESS) MRI scans obtained from the Osteoarthritis Initiative (OAI) (1416). The OAI dataset contains a subset of publicly available expert-annotated segmentations which has led to increasing use of OAI-trained DL segmentation models in musculoskeletal research (14,17,18). However, systematic evaluation of model performance remains a challenge as different groups have used different subsets of OAI data, which makes it difficult to accurately compare methods (1419). Further, existing segmentation metrics, such as the Dice similarity coefficient (DSC), have been shown to not necessarily predict algorithm performance on clinically relevant quantitative MRI biomarkers (14). Therefore, there is a need to establish methods to systematically evaluate the performance of these OAI-trained models using clinically relevant endpoints.

Besides discordance between technical and clinical metrics for evaluation, DL techniques can be highly sensitive to shifts in data distributions that can be caused by changes in the imaging setup, such as differences in MR scanner vendors, image acquisition parameters, and patient populations (20). A major consequence of these distribution shifts is that a method trained on one dataset may not generalize to other datasets. Algorithms are often fine-tuned to every new dataset, which involves training an already pretrained model on a small corpus of relevant data (21). This fine-tuning is often not feasible due to the requirement of additional dataset-specific training data and additional computational resources. Consequently, the generalizability of DL methods for knee MRI segmentation on independent datasets with different MRI vendors, sequence variations, and varied subject populations has not been assessed. Such a validation may be crucial to improve the characterization of the robustness and utility of DL model generalizability in these cross-cohort scenarios.

In this study, we provide a framework to systematically evaluate the generalizability of a DL-based MRI cartilage segmentation model. We evaluate whether automated cartilage segmentation models generalize across four separate datasets that vary in MR vendor and model, subject population, and image acquisition parameters. We further evaluate the impact of using different DL models to assess this generalizability by using clinically relevant metrics of cartilage morphometry and relaxometry in addition to traditional segmentation metrics.

Materials and Methods

All data were acquired with institutional review board approval and all subjects provided written informed consent.

Subjects

Data from four independent studies described previously were used for evaluation (2225), for a total of 59 subjects (82 knees). The studies varied in terms of subject demographics and joint health (Table 1), MR vendor and model, and image acquisition parameters (Table 2). Study 1 recruited healthy male subjects (22); Study 2 recruited healthy female recreational runners (23); Study 3 recruited subjects with knee OA and mild to severe knee joint pain (24); and Study 4 recruited subjects referred for a clinical knee MRI to evaluate various pathologies, including but not limited to: cartilage damage, ligament damage, meniscal damage, and joint effusion (25). Three subjects were excluded from Study 2 due to missing segmentation masks and ten subjects were excluded from Study 3 due to the unavailability of segmentations at the onset of this study (23,24). The healthy control subjects from Studies 2 and 3 were also excluded from this analysis to ensure the subject populations were distinct between Studies 1–4 (23,24). Subjects in Study 4 subsequently underwent arthroscopic surgery following internal derangement (25).

Table 1.

Details on subject demographics and cartilage surfaces examined within the four studies.

Subject Details
Study 1 Study 2 Study 3 Study 4
Knee Count 5 Left, 5 Right 8 Left, 8 Right 10 Left, 10 Right 24 Left, 12 Right
Subject Sex 0 female, 5 male 8 female, 0 male 8 female, 2 male 10 female, 26 male
Subject Age (mean years ± STD) 1 39.6 (±11.3) 33.9 (±4.3) 60.5 (±8.9) 42.1 (±15.8)
Cartilage Surfaces with Manual Segmentations FC2 FC FC FC, TC3, PC4
Subject Joint Status Healthy Healthy Mild to Moderate OA Various Pathologies
Time of Subject Enrollment 2018 2018–2020 2019–2020 2016–2018
1

STD: Standard Deviation

2

FC: Femoral Cartilage

3

TC: Tibial Cartilage

4

PC: Patellar Cartilage

Table 2.

Image acquisition parameters and MR scanner details for all studies.

Image Acquisition Parameters
Study 1 Study 2 Study 3 Study 4
Scanner 1 Scanner 2
MR Scanner GE Premier Philips Ingenia GE Premier GE Signa PET/MRI GE MR750
Unilateral/Bilateral Scan Unilateral Unilateral Bilateral Bilateral Unilateral
Coil 18ch1 transmit receive coil 16ch transmit receive coil 16ch flexible receive coil 16ch flexible receive coil 8ch transmit receive coil
Bandwidth (kHz) 2 62.5 63 83 71 83.3
Parallel Imaging
(ky × kz)3
None None 2 × 3 2 × 3 2 × 1
Number of Slices 44 40 250 220 80
Slice Thickness 3 3 1.4 1.5 1.6 (interpolated to 0.8)
In Plane Resolution (mm × mm) 4 0.50 × 0.50 0.50 × 0.50 0.50 × 0.50 0.50 × 0.50 0.38 × 0.38
Repetition Time (ms) 18.0 17.7 15.5 19.8 20.4
1st Echo Time (ms) 5.8 5.8 5.2 6.7 6.4
2nd Echo Time (ms) 30.2 13.3 25.8 33.0 34.3
Maximum Nominal Gradient Strength (mT/m) 5 80 45 80 44 50
1

ch: channel

2

kHz: kilohertz

3

ky × kz: K-spacey × K-spacez

4

mm: millimeter

5

mT/m: millitesla per meter

Image Acquisition

Scans for Studies 1 and 2 were performed on a GE Premier 3.0-T MRI scanner (GE Healthcare, Waukesha, WI, USA), denoted as Scanner 1 in this study (22,23). All subjects in Study 1 were also scanned with a Philips Ingenia 3-T MRI scanner (Philips Healthcare, Best, Netherlands), denoted as Scanner 2 in this study (22). Scans for Study 1 were performed on Scanners 1 and 2 for all subjects on the same day, and care was taken to ensure minimal physical activity on the day of imaging as well as the day prior to imaging (22). Scans for Study 3 were performed on a 3-T GE Signa Positron Emission Tomography (PET)/MRI hybrid system (GE Healthcare, Milwaukee, WI, USA) (24). Scans for Study 4 were performed on a clinical GE MR750 3-T MRI scanner (GE Healthcare, Milwaukee, WI, USA) (25).

All studies varied in scan acquisition parameters such as resolution, parallel imaging, number and type of coils, and bandwidth that would affect the image signal-to-noise ratio (SNR). Regarding parallel imaging, for simultaneous bilateral knee MRI, it has been shown that 2×3 acceleration with a dual-coil-array is comparable in SNR and quantitative accuracy to a 2×1 acceleration with a single coil due to minimal crosstalk between the coils and a minimal increase in g-factor maps (26). Scans for Studies 1 and 4 were acquired unilaterally and scans for Studies 2 and 3 were acquired bilaterally (2225).

We acquired all scans with a qDESS sequence, a multi-contrast, distortion-free three-dimensional (3D) MRI pulse sequence described initially as a sequence that combined two echoes into one, and then later modified to generate two separate echo images (2730). These separate qDESS echoes can be used to analytically compute accurate T2 relaxation time measurements of articular cartilage (2830). The qDESS sequence also generates high-resolution morphometry measurements for cartilage as validated in the OAI with the DESS sequence which combines the two qDESS echoes (28,29,31). Thus, qDESS datasets are suitable for evaluating the generalizability of OAI-DESS-trained segmentation algorithms.

Image Processing

Four readers manually segmented femoral cartilage in Studies 1–4 and tibial cartilage and patellar cartilage in Study 4 (2225). All segmentations were performed in ITK-Snap using multi-contrast and multi-planar images (http://www.itksnap.org). Femoral cartilage in Study 1 was segmented by a reader (E.B.R.) with three years of experience (22). Femoral cartilage in Study 2 was segmented by a reader (H.C.) with two years of experience (23). Femoral cartilage in Study 3 was segmented by a reader (L.E.W.) with five years of experience (24). Femoral cartilage, tibial cartilage, and patellar cartilage were segmented by a medical image analysis laboratory and manually corrected by a reader (A.M.S) with three years of experience (25). In general, all readers opted to segment conservatively, especially around fluid-cartilage and bone-cartilage interfaces. All quality-controlled manual segmentations were considered ground truth data to which the automated segmentations were compared against.

Two independent publicly available DL-based segmentation models were used in this analysis: an OAI-DESS-trained model and a qDESS-trained model (32). Both models can be found for installation here: https://dosma.readthedocs.io/en/latest/models.html. The OAI-DESS-trained model was trained using 120 DESS scans (60 subjects with OA with Kellgren-Lawerence (KL) grades 1–3 at two timepoints, one year apart) with 0.7 mm slice resolution and no acceleration acquired on Siemens 3T scanners (14). The qDESS-trained model was trained on 86 qDESS scans (86 subjects who underwent routine diagnostic knee MRI) provided in the publicly available SKM-TEA dataset with 1.5 mm slice thickness and 2×1 acceleration acquired on a GE 3T scanner (25). During training, the qDESS-trained model was trained on qDESS images that had both echoes combined via a root-sum-of-squares (RSS) technique. Both models are based on a previously described two-dimensional (2D) U-Net model and neither model was fine-tuned to specifically quantify cartilage volume or cartilage T2 relaxation times (12). Both models were used to automatically segment the femoral cartilage in Studies 1–4 and the tibial cartilage and patellar cartilage for Study 4. There was no overlap in scans that were used for training or evaluating the segmentation algorithms.

For all manual and automatic segmentations, the Deep Open-Source Medical Image Analysis (DOSMA) framework (32), an open-source toolbox for MRI analysis, was used to calculate T2 maps via an established method (2830). Furthermore, it was used to automatically subdivide the segmented cartilage into anatomical subregions. For sub-regional cartilage analysis, femoral and tibial cartilage segmentations were unrolled to produce a 2D image (33), and automatically subdivided into the medial and lateral condyles, deep and superficial layers, and anterior, central, and posterior zones using previously described methods (23,34). Patellar cartilage segmentations were unrolled to produce a 2D image, and automatically subdivided into the medial and lateral condyles and deep and superficial layers. A summary of the image processing workflow can be seen in Figure 1.

Figure 1:

Figure 1:

qDESS image processing workflow. The qDESS sequence reconstructs two individual echo images separately and combines them into a single output image using a RSS technique. These two echo images are used to automatically generate a 3D T2 map. The cartilage type of interest (femoral cartilage in the above figure) is then segmented from the RSS image either by a manual reader or an automated segmentation model. Using an open-source toolbox for MR image analysis, the 3D segmentation is unrolled to produce a 2D image of the segmented cartilage, which is automatically subdivided into anatomical subregions and used to produce a T2 map of the segmented cartilage. Anatomical subregions in the region map are the following: MA – medial anterior, MC – medial central, MP – medial posterior, LA – lateral anterior, LC – lateral central, LP – lateral posterior.

Cartilage Abnormality Evaluation

Study 3 had associated MRI Osteoarthritis Knee Scores (MOAKS) for cartilage analysis (24,35), and Study 4 had associated cartilage lesion scores classified via the modified Noyes articular cartilage lesion classification system (25,36). One MSK radiologist (blinded, with 8 years of experience, J.W.M) performed MOAKS evaluation of all cartilage subregions in Study 3. A second MSK radiologist (blinded, with 25 years of experience, R.D.B) performed cartilage lesion classification via the modified Noyes classification system for all subjects in Study 4. A detailed breakdown of cartilage lesion evaluation for Studies 3 and 4 can be seen in Tables 3 and 4.

Table 3.

MOAKS summary for subjects in Study 3. MOAKS regions for femoral cartilage are subdivided by medial/lateral condyles, and anterior/central/posterior subregions. MOAKS was performed to evaluate cartilage lesion size and full thickness loss.

Study 3 Femoral Cartilage MOAKS Summary
MOAKS: Size Region Count
0 80
1 6
2 24
3 10
MOAKS: Full Thickness Region Count
0 98
1 3
2 13
3 6

Table 4.

Modified Noyes cartilage lesion scores summary for subjects in Study 4.

Study 4 Cartilage Lesion Summary
Tissue Lesion Grade Lesion Count
Femoral Cartilage 1 5
2A 13
2B 2
3 4
Tibial Cartilage 1 5
2A 7
2B 1
3 1
Patellar Cartilage 1 4
2A 8
2B 1
3 3

Statistical Analysis

The DSC was used to assess pixel-wise accuracy between manual and automatic segmentations in Studies 1–4. Variations in T2 relaxation times and cartilage volume between manual and automatic segmentations were quantified with Lin’s Concordance Correlation Coefficient (CCC), Wilcoxon rank-sum tests, and RMS error coefficient of variation (RMSE-CV). Wilcoxon rank-sum tests were used to test whether there existed consistent differences between the automatic segmentations produced by the OAI-DESS-trained and qDESS-trained models. Bland-Altman plots with 95% limits of agreement (LoA) were used to visualize and analyze manual vs. automatic T2 agreement. All statistical analyses were performed using pandas, an open-source Python package, with a level of statistical significance of p<0.05 (python version 3.6.6, available at: https://www.python.org; pandas version 1.1.5, available at: https://pandas.pydata.org/).

Results

The obtained DSC values for the qDESS-trained model indicating segmentation accuracy between manual and automatic segmentations ranged from 0.79–0.93 and were significantly higher than all respective DSC values for the OAI-DESS-trained model, which ranged from 0.59–0.79.

For both layers of all cartilage surfaces in all studies, manual vs. automatic T2 CCC values were high for the qDESS-trained model, ranging from 0.75–0.98 (Table 5). The qDESS-trained model performed particularly well in deep cartilage regions, with T2 CCC values ranging between 0.89–0.98. The qDESS-trained model T2 CCC values were higher than all respective T2 CCC values for the OAI-DESS-trained model, which ranged from 0.35–0.90, apart from superficial femoral cartilage in Study 1 (Scanner 2) in which both models had a CCC value of 0.86.

Table 5.

CCC and RMSE-CV values indicating the T2 relaxation time agreement between manual ground truth segmentations and automatic segmentations from both segmentation models for the superficial and deep layers of femoral cartilage in Studies 1–4 and the superficial and deep layers of tibial cartilage and patellar cartilage in Study 4.

Manual vs. Automatic T2 Agreement
Study Tissue Cartilage Region Manual vs. Automatic CCC Manual vs. Automatic RMSE-CV (%)
qDESS Model OAI-DESS Model qDESS Model OAI-DESS Model
1: Scanner 1 FC1 Superficial 0.75* 0.41** 4.7 7.6
Deep 0.96 0.56** 2.1 7.6
1: Scanner 2 FC Superficial 0.86 0.86 3.9 4.0
Deep 0.96 0.35** 2.0 10.7
2 FC Superficial 0.94 0.88 2.3 3.5
Deep 0.92 0.48** 3.9 10.2
3 FC Superficial 0.85 0.82 3.9 4.3
Deep 0.91 0.64** 3.8 8.1
4 FC Superficial 0.94 0.90 2.2 2.8
Deep 0.98 0.69** 1.7 6.5
4 TC2 Superficial 0.84 0.55** 4.3 7.5
Deep 0.89 0.57** 3.4 7.2
4 PC3 Superficial 0.84 0.33** 3.2 10.1
Deep 0.96 0.39** 1.7 8.2

P-values from Wilcoxon rank-sum tests indicating significant differences between manual segmentation and automatic segmentation T2 values are denoted using the following:

*

<0.01,

**

<0.001.

1

FC: Femoral Cartilage

2

TC: Tibial Cartilage

3

PC: Patellar Cartilage

The T2 RMSE-CV values for the qDESS-trained model were lower than all respective values for the OAI-DESS-trained model as well. Specifically, T2 RMSE-CV values for the qDESS-trained model ranged from 1.7–4.7% and T2 RMSE-CV values ranged from 2.8–10.7% for the OAI-DESS-trained model. Bland-Altman plots indicating manual vs. automatic T2 variations for deep and superficial femoral cartilage for the qDESS-trained model had 95% LoA of ±2.4 milliseconds (ms) (mean bias 0.45 ms) and ±4.0 ms (mean bias 0.61 ms), respectively. These LoA for the deep and superficial femoral cartilage for the qDESS-trained model were lower than the LoA for the OAI-DESS-trained model, which were ±4.4 ms (mean bias 1.60 ms) and ±5.2 ms (mean bias −1.20 ms), respectively (Fig 2). For both models, manual vs. automatic T2 performance was similar between the anterior, central, and posterior anatomical subregions.

Figure 2:

Figure 2:

Bland-Altman plots for deep and superficial femoral cartilage T2 relaxation times for both the OAI-DESS trained model (Top) and qDESS trained model (Bottom) in Studies 1–4. Data is further stratified by study and anterior/central/posterior anatomic region. The T2 variations are minimal for both models and show no systematic error, however the limits of agreement for the qDESS trained model for both cartilage layers are smaller.

For the OAI-DESS-trained model, significant differences in manual vs. automatic T2 estimates were seen in both layers of femoral cartilage for Study 1 (Scanner 1), the deep femoral cartilage of Studies 1–4 (Study 1 Scanner 2), and both layers of tibial and patellar cartilage of Study 4. For the qDESS-trained model, significant differences in manual vs. automatic T2 estimates were only seen in the superficial femoral cartilage of Study 1 (Scanner 1). All T2 results for both models can be seen in Table 5. A qualitative evaluation of the resulting segmentations via visual inspection depicted that most of the manual vs. automatic T2 errors occurred at the bone-cartilage interface in Studies 1–4 (Fig 3).

Figure 3:

Figure 3:

Comparison of manual and automatic segmentations from both models and respective 2D unrolled T2 maps in the left knee of a subject in Study 4. Also shown are the subject’s T2 values from the superficial and deep cartilage regions, cartilage volumes, and DSC values for the qDESS-trained and OAI-DESS-trained models. Arrows indicate examples of visually apparent differences in the automated segmentations and resultant T2 maps. These differences typically appear at the periphery of cartilage surfaces, which have limited impact on subregion estimates.

Cartilage volume CCC values for the qDESS-trained model ranged from 0.47–0.95 and were higher than all respective OAI-DESS-trained model CCC values, which ranged from 0.13–0.84, apart from femoral cartilage volume in Study 3, in which the qDESS-trained model had a CCC value of 0.71 and the OAI-DESS-trained model had a CCC value of 0.82. Volume RMSE-CV values followed a similar trend in which all RMSE-CV values were lower for the qDESS-trained model than all respective values for the OAI-DESS-trained model, apart from Study 3. Volume RMSE-CV values for the qDESS-trained model ranged from 2.9–16.1% and volume RMSE-CV values ranged from 4.9–37.4% for the OAI-DESS-trained model. Most volume RMSE-CV values for the OAI-DESS-trained model ranged between 4.9–17.7%, except for the patellar cartilage of Study 4 which had a value of 37.4%. All morphology results for both models can be seen in Table 6.

Table 6.

Morphology result summary for all studies. CCC and RMSE-CV values indicate the agreement between manual ground truth cartilage segmentation volume and the automatic segmentation volumes from both models. DSC values indicate the pixelwise segmentation accuracy of the automatic segmentations from both models compared to the manual ground truth segmentations.

Manual vs. Automatic Morphology Agreement
Study Tissue Manual vs. Automatic Volume CCC Manual vs. Automatic Volume RMSE-CV (%) DSC (± STD)1
qDESS Model OAI-DESS Model qDESS Model OAI-DESS Model qDESS Model OAI-DESS Model
1: Scanner 1 FC2 0.89 0.13* 2.9 17.7 0.85 (±0.07) 0.66 (±0.07)
1: Scanner 2 FC 0.84 0.82 4.5 4.9 0.93 (±0.03) 0.70 (±0.05)
2 FC 0.47* 0.44** 16.1 16.5 0.79 (±0.04) 0.71 (±0.03)
3 FC 0.71 0.82 12.0 8.2 0.81 (±0.08) 0.71 (±0.09)
4 FC 0.87 0.84 6.7 7.4 0.88 (±0.04) 0.79 (±0.04)
4 TC3 0.87 0.55* 6.4 15.0 0.87 (±0.04) 0.69 (±0.06)
4 PC4 0.95 0.30* 7.1 37.4 0.89 (±0.07) 0.59 (±0.20)

P-values from Wilcoxon rank-sum tests indicating significant differences between cartilage volume computed using manual and automated segmentation are denoted using the following:

*

<0.01,

**

<0.001.

P-values from Wilcoxon rank-sum tests indicating significant differences between automatic segmentation from the OAI-DESS trained model and the qDESS trained model are denoted using the following:

<0.001.

1

STD: Standard Deviation

2

FC: Femoral Cartilage

3

TC: Tibial Cartilage

4

PC: Patellar Cartilage

Discussion

In this study, we evaluated whether DL-based cartilage segmentation models can generalize to different datasets than they were trained on. By evaluating manual vs. automatic cartilage T2 and cartilage volume variations using two separately trained DL models, we showed that the qDESS-trained model generalized well to independent qDESS datasets and performed consistently better than the OAI-DESS-trained model across all studies. We also demonstrated that the lower performing OAI-DESS-trained model had better performance for assessing sub-regional T2 relaxation times than assessing cartilage morphology. Our results may indicate that distributional shifts resulting from data domain differences in the OAI-DESS and qDESS sequences can lead to larger variations in model performance than distributional shifts resulting from differences in imaging parameters or multi-vendor scanners. Overall, the systematically reduced performance for off-the-shelf pretrained models compared to models trained in a domain-specific manner may highlight the need to fine-tune DL models on data collected from the settings where the models will be eventually deployed.

The segmentation accuracy of the qDESS-trained model measured by DSC across all studies and all articular cartilage in the knee was comparable to respective manual segmentations (14). The OAI-DESS-trained model had significantly lower segmentation accuracy for Studies 1–4 and all articular cartilage in the knee. The qDESS-trained model also had fewer instances of manual vs. automatic cartilage volume variations and higher cartilage volume CCC values in every study, apart from the femoral cartilage volume in Study 3 in which the qDESS-trained model had a lower CCC value than the OAI-DESS-trained model. Although pre-trained OAI-DESS segmentation models have achieved high accuracies on OAI-DESS datasets (1419), we demonstrate that these models do not necessarily generalize to other MRI scans, which may necessitate additional fine-tuning to accurately assess cartilage morphological metrics. Our results suggest that despite similar contrast and resolution, DESS and qDESS data might have sufficiently different data distributions, resulting in lower generalizability of OAI-DESS-trained models to unseen qDESS image sets, particularly for morphological outcomes.

The qDESS-trained model had higher T2 CCC values in nearly every study, apart from superficial femoral cartilage in Study 1 (Scanner 2) in which both models had the same CCC value, and fewer instances of manual vs. automatic T2 differences. Further, the T2 RMSE-CV values for the qDESS-trained model observed in the present study were comparable to the T2 scan-rescan CV (3.3%) from a study that also obtained scans using the qDESS sequence utilized in our study (28). This suggests the disagreement between T2 relaxation times obtained manually and T2 relaxation times obtained automatically from our qDESS-trained model was within the range of general T2 variability, which may make these models viable for large-scale, prospective relaxometry studies. While the RMSE-CV values for the OAI-DESS-trained model were slightly higher than the qDESS-trained model, the LoA for both models from the Bland-Altman plots for manual vs. automatic T2 variations were similar to the current state-of-the-art segmentation quantitative MRI models (17), despite the OAI-DESS-trained model having significantly lower segmentation accuracy. The discordance between DSC and T2 performance may be explained by the fact that T2 is averaged across a sub-region with multiple pixels. Most manual vs. automatic T2 map variations were visually apparent at the bone-cartilage interface and the cartilage surface. These variations at the periphery of cartilage surfaces have limited impact on subregion T2 estimates due to this process of averaging multiple pixels. This is in line with findings from previous work that indicate existing segmentation metrics, such as DSC, are not accurate indicators of relaxometry parameters (25). Our findings suggest low DSC values may not be indicative of manual vs. automatic T2 agreement and that existing pre-trained models may be suitable for quantifying relaxometry parameters but not morphology parameters.

Most prior DL segmentation studies report model accuracies based on the annotations of a singular reader (37,38). However, in most practical applications, different readers perform segmentation across different studies (39,40). In our analysis of the generalizability, we depicted the performance of segmentation algorithms across studies that used different readers as annotators of the ground truth cartilage surfaces. Despite such inter-reader variations that will exist in different studies, the qDESS-trained model consistently performed well across all outcomes in all studies despite these variations. This may suggest DL models have the potential to mitigate inter-reader variability and intra-reader variability, both of which are critical for enabling reproducible relaxometry analyses. It is worth noting that the highest cartilage volume differences for both models were seen in Study 2 where the outcome of interest was cartilage T2 relaxation time (23). Consequently, the reader utilized a conservative approach to segmentation, which may have resulted in the discordance between manual vs. automatic cartilage volume variations for both models in this study. The very high T2 CCC values produced by the qDESS-trained model in Study 2 further suggests that pre-trained segmentation models may be suitable to quantify relaxometry parameters, but likely require fine-tuning to accurately quantify outcomes of interest related to cartilage morphology.

Interestingly, the respective study population did not appear to play a role in either model’s performance. Previous work has demonstrated segmentation models that perform well on healthy cohorts (19), and segmentation models that perform well on OA populations (14,1618). However, there has been limited work demonstrating segmentation algorithm effectiveness across both healthy cohorts and cohorts with knee pathology present (15). The OAI-DESS-trained model was trained using scans of subjects with OA with KL grades 1–3, while the qDESS-trained model was trained using scans of diverse subjects that underwent routine diagnostic knee MRI. Given the heterogeneity of subjects encountered during model training, no consistent trends were evident based on the prevalence of OA and subject population for either model. This may suggest potential model generalizability on such datasets with varied subject health.

Further, the performance of the segmentations was comparable across all the different MR scanners used. While this does not preclude that T2 relaxation times and cartilage morphology measurements may vary across scanners, the performance of the automated segmentation models was comparable to manual segmentation across scanners. Segmentation algorithms are often evaluated using scans from a single MR vendor or scanner which makes it difficult to draw conclusions about algorithm generalizability. Here, we used four different scanners from two different vendors, including one study in which subject scans were conducted on MR scanners from two different vendors on the same day. These scanners included considerably different gradient performances that can affect the duty cycle and repetition times, data acquisition, and, consequently, image SNR. We observed no clear performance trends across the different MR scanners for either model, and the qDESS-trained model generalized well to all studies. This generalizability across MR vendors and scanners may allow algorithms to be more easily deployed across multiple institutes and may help inform more clinically useful metrics to evaluate algorithm performance.

Limitations

First, Study 4 was the only study that examined patellar cartilage and tibial cartilage. To compare our segmentation algorithms to ground truth data, all cartilage surfaces must be manually segmented, which is a very time-consuming task. We chose Study 4 because this dataset offered a mixed clinical population with the highest count of available knees. Nevertheless, in the future we will extend our segmentation models to more datasets with more cartilage surfaces and tissues. Second, Studies 3 and 4 contained subjects with mild to moderate OA, but radiographs were not available for all of these subjects, thus we were unable to obtain KL grades. However, KL grades would have allowed us to quantitively evaluate whether OA severity affected the model performance. Given that model performance did not appear to vary based on subject population between Studies 1–4, we do not believe access to KL grades would have affected our conclusions. However, applying the segmentation models to a larger cohort of OA subjects with known KL grades would allow us to examine the relationship between clinical KL grading standards and model performance. Third, although we used four different studies with segmentation as an outcome, the relative sizes of each study were small (average of 15 subjects and 21 knees) due to challenges in acquiring ground-truth manual segmentations. Future work investigating larger cohorts will improve the validity of our findings. Finally, the multi-echo qDESS sequence used to train our qDESS-trained model was combined into a single image using a RSS technique. This was done in part to make the performance of the two models more comparable, as the OAI-DESS-trained model was trained on DESS data, in which the echoes are combined. However, new segmentation models that utilize both contrasts have been shown to have improved segmentation performance (25).

Conclusion

We demonstrated that qDESS-trained models generalize well to independent qDESS datasets regardless of MR scanner type, MR scan parameters, and subject population. However, while OAI DESS-trained models may be suitable for quantifying cartilage relaxometry, they do not generalize to similar MRI scans and require fine-tuning to accurately assess cartilage morphology.

Acknowledgments:

Use this section to acknowledge individuals who have provided personal assistance. Please give full names of all individuals mentioned in this section.

Grant Support:

We would like to acknowledge our funding sources: National Institutes of Health (NIH) grant numbers, NIH R01-AR077604, R00 EB022634, R01 EB002524, R01-AR074492, K24 AR062068, and P41 EB015891. GE Healthcare, Philips, the Wu Tsai Human Performance Alliance and Stanford Medicine Precision Health and Integrated Diagnostics, DOD – National Science and Engineering Graduate Fellowship (ARO); National Science Foundation (GRFP-DGE 1656518)

References

  • 1.Loeser RF, Goldring SR, Scanzello CR, Goldring MB. Osteoarthritis: a disease of the joint as an organ. Arthritis Rheum 2012;64:1697–1707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Vina ER and Kwoh CK. Epidemiology of osteoarthritis: literature update. Curr Opin Rheumatol 2018;30:160–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chaudhari AS, Kogan F, Pedoia V, Majumdar S, Gold GE, Hargreaves BA. Rapid Knee MRI Acquisition and Analysis Techniques for Imaging Osteoarthritis. J Magn Reson Imaging 2020;52:1321–1339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Menashe L, Hirko K, Losina E, et al. The diagnostic performance of MRI in osteoarthritis: a systematic review and meta-analysis. Osteoarthritis Cartilage 2012;20:13–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Li X and Majumdar S. Quantitative MRI of articular cartilage and its clinical applications. J Magn Reson Imaging 2013;38:991–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Matzat SJ, van Tiel J, Gold GE, Oei EH. Quantitative MRI techniques of cartilage composition. Quant Imaging Med Surg 2013;3:162–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Dardzinski BJ, Mosher TJ, Li S, Van Slyke MA, Smith MB: Spatial variation of T2 in human articular cartilage. Radiology 1997; 205:546–550. [DOI] [PubMed] [Google Scholar]
  • 8.Mosher TJ, Dardzinski BJ, Smith MB: Human articular cartilage: influence of aging and early symptomatic degeneration on the spatial variation of T2--preliminary findings at 3 T. Radiology 2000;214:259–266. [DOI] [PubMed] [Google Scholar]
  • 9.Eckstein F, Kwoh CK, Link TM, OAI investigators. Imaging research results from the Osteoarthritis Initiative (OAI): A review and lessons learned 10 years after start of enrolment. Ann Rheum Dis 2014;73:1289–1300. [DOI] [PubMed] [Google Scholar]
  • 10.Pedoia V, Majumdar S, Link TM. Segmentation of joint and musculoskeletal tissue in the study of arthritis. MAGMA 2016;29:207–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Pedoia V, Norman B, Mehany SN, Bucknor MD, Link TM, Majumdar S. 3D convolutional neural networks for detection and severity staging of meniscus and PFJ cartilage morphological degenerative changes in osteoarthritis and anterior cruciate ligament subjects. J Magn Reson Imaging 2019;49:400–410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Desai AD, Gold GE, Hargreaves BA, Chaudhari AS. Technical Considerations for Semantic Segmentation in MRI using Convolutional Neural Networks. arXiv 2019. eprint arXiv:1902.01977. [Google Scholar]
  • 13.Gan HS., Ramlee MH, Wahab AA, et al. From classical to deep learning: review on cartilage and bone segmentation techniques in knee osteoarthritis research. Artif Intell Rev 2021;54:2445–2494. [Google Scholar]
  • 14.Desai AD, Caliva F, Iriondo C, et al. The International Workshop on Osteoarthritis Imaging Knee MRI Segmentation Challenge: A Multi-Institute Evaluation and Analysis Framework on a Standardized Dataset. Radiol Artif Intell 2021;10;3:e200078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Eckstein F, Chaudhari AS, Fuerst D, et al. A Deep Learning Automated Segmentation Algorithm Accurately Detects Differences in Longitudinal Cartilage Thickness Loss - Data from the FNIH Biomarkers Study of the Osteoarthritis Initiative. Arthritis Care Res (Hoboken) 2022;74:929–936. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ambellan F, Tack A, Ehlke M, Zachow S. Automated segmentation of knee bone and cartilage combining statistical shape knowledge and convolutional neural networks: Data from the Osteoarthritis Initiative. Med Image Anal 2019;52:109–118. [DOI] [PubMed] [Google Scholar]
  • 17.Norman B, Pedoia V, Majumdar S. Use of 2D U-Net Convolutional Neural Networks for Automated Cartilage and Meniscus Segmentation of Knee MR Imaging Data to Determine Relaxometry and Morphometry. Radiology 2018;288:177–185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gatti AA, Maly MR. Automatic knee cartilage and bone segmentation using multi-stage convolutional neural networks: data from the osteoarthritis initiative. MAGMA 2021;34:859–875. [DOI] [PubMed] [Google Scholar]
  • 19.Wirth W, Eckstein F, Kemnitz J, et al. Accuracy and longitudinal reproducibility of quantitative femorotibial cartilage measures derived from automated U-Net-based segmentation of two different MRI contrasts: data from the osteoarthritis initiative healthy reference cohort. MAGMA 2021;34:337–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Chaudhari AS, Sandino CM, Cole EK, et al. Prospective Deployment of Deep Learning in MRI: A Framework for Important Considerations, Challenges, and Recommendations for Best Practices. J Magn Reson Imaging 2021;54:357–371. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kim HE, Cosa-Linan A, Santhanam NT et al. Transfer learning for medical image classification: a literature review. BMC medical imaging 2022;22:69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Chaudhari AS, Lu Q, Wisser A, et al. Scan-Rescan Variability and Left-Right Knee Asymmetry of Cartilage Morphometry Assessed with Rapid MRI in a Harmonized Multi-Vendor Study. Proceedings of the 27th Annual Meeting of ISMRM (virtual), 2020. (abstract 2730). [Google Scholar]
  • 23.Crowder HA, Mazzoli V, Black MS, et al. Characterizing the transient response of knee cartilage to running: Decreases in cartilage T2 of female recreational runners. J Orthop Res 2021;39:2340–2352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Watkins L, MacKay J, Haddock B, et al. Assessment of quantitative [18F]Sodium fluoride PET measures of knee subchondral bone perfusion and mineralization in osteoarthritic and healthy subjects. Osteoarthritis Cartilage 2021;29:849–858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Desai AD, Schmidt AM, Rubin EB, et al. SKM-TEA: A Dataset for Accelerated MRI Reconstruction with Dense Image Labels for Quantitative Clinical Evaluation. Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks (Round 2) 2021. [Google Scholar]
  • 26.Kogan F, Levine E, Chaudhari AS, et al. Simultaneous bilateral-knee MR imaging. Magn Reson Med 2018;80:529–537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bruder H, Fischer H, Graumann R, Deimling M. A new steady‐state imaging sequence for simultaneous acquisition of two MR images with clearly different contrasts. Magn Reson Med 1988;7:35–42. [DOI] [PubMed] [Google Scholar]
  • 28.Chaudhari AS, Black MS, Eijgenraam S, et al. Five-minute knee MRI for simultaneous morphometry and T2 relaxometry of cartilage and meniscus and for semiquantitative radiological assessment using double-echo in steady-state at 3T. J Magn Reson Imaging 2018;47:1328–1341. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sveinsson B, Chaudhari AS, Gold GE, Hargreaves BA. A simple analytic method for estimating T2 in the knee from DESS. Magn Reson Imaging 2017;38:63–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Welsch GH, Scheffler K, Mamisch TC, et al. Rapid estimation of cartilage T2 based on double echo at steady state (DESS) with 3 Tesla. Magn Reson Med 2009;62:544–9. [DOI] [PubMed] [Google Scholar]
  • 31.Eijgenraam SM, Chaudhari AS, Reijman M, et al. Time-saving opportunities in knee osteoarthritis: T2 mapping and structural imaging of the knee using a single 5-min MRI scan. Eur Radiol 2020;30:2231–2240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Desai AD, Barbieri M, Mazzoli V, et al. DOSMA: A deep-learning, open-source framework for musculoskeletal MRI analysis. Proceedings of the 27th Annual Meeting of ISMRM Montreal, 2019. (abstract 1135). [Google Scholar]
  • 33.Monu UD, Jordan CD, Samuelson BL, Hargreaves BA, Gold GE, McWalter EJ. Cluster analysis of quantitative MRI T2 and T relaxation times of cartilage identifies differences between healthy and ACL-injured individuals at 3T. Osteoarthritis Cartilage 2017;25:513–520. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Thomas KA, Krzemiński D, Kidziński Ł, et al. Open Source Software for Automatic Subregional Assessment of Knee Cartilage Degradation Using Quantitative T2 Relaxometry and Deep Learning. Cartilage 2021;13:747S–756S. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hunter DJ, Guermazi A, Lo GH, et al. Evolution of semi-quantitative whole joint assessment of knee OA: MOAKS (MRI Osteoarthritis Knee Score). Osteoarthritis Cartilage 2011;19:990–1002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kijowski R, Davis KW, Woods MA, et al. Knee joint: comprehensive assessment with 3D isotropic resolution fast spin-echo MR imaging--diagnostic performance compared with that of conventional MR imaging at 3.0 T. Radiology 2009;252:486–495. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kemnitz J, Steidle-Kloc E, Wirth W, et al. Local MRI-based measures of thigh adipose tissue derived from fully automated deep convolutional neural network-based segmentation show a comparable responsiveness to bidirectional change in body weight as from quality controlled manual segmentation. Ann Anat 2022;240:151866. [DOI] [PubMed] [Google Scholar]
  • 38.Zhao R, Yaman B, Zhang Y et al. fastMRI+, Clinical pathology annotations for knee and brain fully sampled magnetic resonance imaging data. Sci Data 2022;9:152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fujinaga Y, Yoshioka H, Sakai T, Sakai Y, Souza F, Lang P. Quantitative measurement of femoral condyle cartilage in the knee by MRI: validation study by multireaders. J Magn Reson Imaging 2014;39:972–977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Bae KT, Shim H, Tao C, et al. Intra- and inter-observer reproducibility of volume measurement of knee cartilage segmented from the OAI MR image set using a novel semi-automated segmentation method. Osteoarthritis Cartilage 2009;17:1589–1597. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES