Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Mar 1.
Published in final edited form as: Pediatr Radiol. 2020 Oct 13;51(3):392–402. doi: 10.1007/s00247-020-04854-3

DeepLiverNet: a deep transfer learning model for classifying liver stiffness using clinical and T2-weighted magnetic resonance imaging data in children and young adults

Hailong Li 1,2, Lili He 1,2,3, Jonathan A Dudley 2,4, Thomas C Maloney 2,4, Elanchezhian Somasundaram 4, Samuel L Brady 4,5, Nehal A Parikh 1,3, Jonathan R Dillman 2,4,5
PMCID: PMC8675279  NIHMSID: NIHMS1757172  PMID: 33048183

Abstract

Background

Although MR elastography allows for quantitative evaluation of liver stiffness to assess chronic liver diseases, it has associated drawbacks related to additional scanning time, patient discomfort, and added costs.

Objective

To develop a machine learning model that can categorically classify the severity of liver stiffness using both anatomical T2-weighted MRI and clinical data for children and young adults with known or suspected pediatric chronic liver diseases.

Materials and methods

We included 273 subjects with known or suspected chronic liver disease. We extracted data including axial T2-weighted fast spin-echo fat-suppressed images, clinical data (e.g., demographic/anthropomorphic data, particular medical diagnoses, laboratory values) and MR elastography liver stiffness measurements. We propose DeepLiverNet to classify patients into one of two groups: no/mild liver stiffening (<3 kPa) or moderate/severe liver stiffening (≥3 kPa). We conducted internal cross-validation using 178 subjects, and external validation using an independent cohort of 95 subjects. We assessed diagnostic performance using accuracy, sensitivity, specificity and area under the receiver operating characteristic curve (AuROC).

Results

In the internal cross-validation experiment, the combination of clinical and imaging data produced the best performance (AuROC=0.86) compared to clinical (AuROC=0.83) or imaging (AuROC=0.80) data alone. Using both clinical and imaging data, the DeepLiverNet correctly classified patients with accuracy of 88.0%, sensitivity of 74.3% and specificity of 94.6%. In our external validation experiment, this same deep learning model achieved an accuracy of 80.0%, sensitivity of 61.1%, specificity of 91.5% and AuROC of 0.79.

Conclusion

A deep learning model that incorporates clinical data and anatomical T2-weighted MR images might provide a means of risk-stratifying liver stiffness and directing the use of MR elastography.

Keywords: Children, Chronic liver disease, Deep learning, Liver, Liver stiffness, Magnetic resonance elastography, Magnetic resonance imaging, Risk stratification

Introduction

Chronic liver diseases are a common source of morbidity and mortality in both children and adults in the United States and around the world [1, 2]. Detection and progression of such liver diseases is typically assessed using a combination of clinical history, physical examination, laboratory testing, biopsy with histopathological assessment, and imaging [3]. Historically, imaging assessment of chronic liver diseases has relied upon subjective assessment of liver morphology, echogenicity and echotexture on US, signal intensity at MRI, and appearances following intravenous contrast material administration at MRI and CT. However, recently there are increasingly available preclinical and clinical quantitative methods [4-7].

Elasticity imaging can be performed using either commercially available US or MRI equipment and allows for quantitative evaluation of liver stiffness. While liver stiffness can be impacted by a variety of physiological and histopathological processes, including inflammation [8, 9], steatosis [10] and passive congestion [11, 12], liver stiffening is most often the result of tissue fibrosis in the setting of chronic liver diseases [8, 13]. MR elastography, in particular, uses an active–passive driver system (with the passive paddle placed over the right upper quadrant of the abdomen at the level of the costal margin) to create transverse (shear) waves in the liver. The displacement of liver tissue related to these waves can be imaged using a modified phase-contrast pulse sequence and can be used to create an elastogram (map or parametric image) of liver stiffness [14, 15]. Although MR elastography obviates the need for liver biopsy for some children and allows more frequent longitudinal assessment of liver health, it has associated drawbacks related to additional patient time in the scanner (not including time to place and adjust the passive driver), patient discomfort, and added costs (e.g., infrastructure and patient charges).

Increasing reports have shown that modern machine learning techniques have shifted from focusing primarily on computer-aided diagnosis to segmenting organs and lesions, image processing, classifying patients or lesions, and predicting outcomes [16-22]. These newer techniques might ultimately enable objective automated diagnosis and prognostication for individual patients. Previously, we developed a support vector machine classifier that is able to categorically classify liver stiffness using clinical and non-stiffness MRI radiomic features in people with known and suspected pediatric chronic liver diseases [23]. Such an algorithm could theoretically decrease the use of MR elastography in those with predicted normal liver stiffness, thereby decreasing imaging time and health care costs. However, in this prior study, we extracted handcrafted radiomic features (e.g., histogram, geometric and texture metrics of MR images) from manually segmented livers from axial T2-weighted fat-suppressed MR images. This handcrafted image feature extraction process is time-consuming and might fail to recognize certain important non-hepatic image features indicative of liver stiffening (e.g., splenomegaly, varices, ascites), potentially leading to a sub-optimal performance. Meanwhile, deep learning has demonstrated state-of-the-art performance for medical imaging analysis [21, 24, 25], providing an opportunity to utilize the original axial T2-weighted fat-suppressed MR images of the liver and surrounding structures directly, without the need for manual segmentation or radiomic feature extraction.

In the current project, we developed a deep learning approach to classify the severity of liver stiffness as determined by MR elastography using clinical features and anatomical MR imaging data (Fig. 1). Specifically, we propose DeepLiverNet, a multi-channel deep transfer learning convolutional neural network classification model, to categorically classify MR elastography-derived liver stiffness by integrating clinically available features and axial T2-weighted fat-suppressed MR structural liver images from children and young adults with known or suspected pediatric chronic liver diseases. Transfer learning and data augmentation were utilized to aid model training. We comprehensively evaluated the proposed DeepLiverNet using internal cross-validation and also external validation on an independent cohort.

Fig. 1.

Fig. 1

Liver stiffness stratification with DeepLiverNet using anatomical two-dimensional axial T2-weighted fast spin-echo fat-suppressed MR images and clinical data. The input of the imaging channel is S of axial 2-D T2-weighted MR images with a size of 224×224, and the input of the clinical channel is a vector of k clinical features. The type of layers, the size of filter, and the number of neurons are listed for individual layers. Blue color indicates non-trainable layer, while other layers are trainable. Batch Norm batch normalization layer, Conv convolutional layer, Full Conn fully connected layer, Maxpool maxpooling layers

Materials and methods

This retrospective study complied with the Health Insurance Portability and Accountability Act and was approved by our institutional review board. The requirement for participant informed consent was waived. This study was performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki and its later amendments.

We searched Cincinnati Children’s Hospital Medical Center’s Department of Radiology records from January 2011 through October 2018 to retrieve clinically performed MR elastography examinations from children with known or suspected pediatric chronic liver diseases, irrespective of clinical indication. We identified two cohorts (internal validation cohort and external validation cohort, respectively). The internal validation cohort, scanned on 1.5-tesla (T) (Signa and Discovery 450W) and 3-T (Discovery 750W and Architect) GE Healthcare MRI scanners (Waukesha, WI), was used for model development and internal validation. The external validation cohort was scanned on 1.5-T (Ingenia) and 3-T (Ingenia) Philips Healthcare MRI scanners (Best, the Netherlands). Only one MR elastography examination was selected from each unique patient (the most recent), with the other MR elastography examinations excluded. Examinations from children with missing clinical or imaging data were excluded. No MRI examinations were excluded because of excessive motion artifacts. Ultimately, this resulted in 178 MR elastography examinations for the internal validation cohort and 95 MR elastography examinations for the external validation cohort.

The institutional MR elastography technique used during the study period has been described [26]. The mean liver stiffness value (mean of four anatomical levels/slices through the mid liver, weighted for region-of-interest [ROI] size) in kPa (shear modulus) was retrieved from the clinical imaging report of each MR elastography examination. Based on mean liver stiffness, children were divided into two groups (<3 kPa, with no/mild stiffening; or ≥3 kPa, with moderate/severe liver stiffening). We chose a cut-off value of 3 kPa because it provides reasonable clinical sensitivity and specificity for detecting abnormal liver stiffening based on the literature in both pediatric cohorts [13, 23, 27, 28]. Liver volume in milliliters (mL), liver chemical shift-encoded fat fraction (%), presence of liver fat (fat fraction >5%), and MRI scanner information (i.e. manufacturer, machine model, field strength) also were extracted from clinical imaging reports (Online Supplementary Material 1).

T2-weighted magnetic resonance images

Axial two-dimensional T2-weighted fast spin-echo fat-suppressed images that were obtained as part of routine clinical MR elastography examinations were extracted from our clinical picture archiving and communicating system (PACS). T2-weighted images were obtained using the following parameters/parameter ranges during the study period: repetition time/echo time [TR/TE] = >3,000/~85 ms; flip angle = 90°; number of signal averages = 2; parallel imaging acceleration factor = 2; matrix = ~256 x 224; and slice thickness = 5–6 mm. Individual T2-weighted images were normalized using nearest-neighbor interpolation to a field of view of 300 × 300 mm2, with an in-plane resolution 1.0 × 1.0 mm2.

Clinical data

For each subject, we retrieved 27 clinical features from the electronic medical record system (Epic Systems, Verona, WI), using only values/records within 6 months of the MR elastography examination. We obtained clinical data from three major domains: (1) demographic/anthropometric data (i.e. age, gender, body mass index); (2) medical history/diagnoses (i.e. diabetic status; specific diagnoses such as non-alcoholic fatty liver disease, viral hepatitis or primary sclerosing cholangitis); and (3) laboratory testing (i.e. alanine aminotransferase, aspartate aminotransferase, bilirubin, albumin, gamma-glutamyl transferase, and fibrosis-4 score). The complete list of clinical features used for developing our model is presented in Online Supplementary Material 1.

Overview of liver stiffness stratification

Our task was to classify a given patient with known or suspected chronic liver disease into one of two groups: no/mild liver stiffening or moderate/severe liver stiffening (Fig. 1). Our deep learning model contains two separate input channels for imaging and clinical data, respectively. For the imaging channel, we designed a transfer learning block by reusing a pre-trained deep learning model for image feature extraction. It was followed by an adaptive learning block to learn the latent imaging features unique to indicating the presence of liver stiffening. The clinical channel was designed to capture the latent clinical features. Then, we employed a fusion block to integrate the latent imaging and clinical features. Last, we used a softmax classifier to stratify the severity of liver stiffness.

Architecture of DeepLiverNet

A multi-channel (i.e. imaging channel and clinical channel) deep architecture was utilized in our DeepLiverNet to take individual axial 2-D T2-weighted MR images (e.g., S slices of images) and clinical data (e.g., k clinical features), simultaneously (Fig. 1).

The imaging channel comprises an image input layer, a transfer learning block and an adaptive learning block. First, the image input layer contains S parallel input sub-channels, taking S number of individual slices of fixed-size axial T2-weighted MR images. Next, to extract liver image features, we designed the transfer learning block (Fig. 1, purple box) by reusing available pre-trained deep models. We chose to reuse the weights of the VGG-19 model [29] (from 1st to 21st layers) for the transfer learning block based on our optimization experiment described later. To re-use the pre-trained VGG-19 model and improve training efficacy, we scaled down the T2-weighted images to a matrix size of 224 × 224. Then, we designed the adaptive learning block (Fig. 1, orange box) that contains S parallel sub-channels corresponding to the input sub-channels for learning the individual latent features of S liver slices, respectively. At the end, those sub-channels in the adaptive learning block are integrated by a fully connected layer.

For the clinical channel, a fully connected layer is directly applied to learn the latent features from the clinical data represented by a low-dimension vector (e.g., k features). After the feature extraction, a fusion block is applied to integrate the latent features from both imaging and clinical data. A two-way softmax classifier was utilized to classify the severity of liver stiffness.

The current architecture design was based on brute-force searching of limited combinations of the numbers of layers and neurons. In addition, multiple publicly available pre-trained deep ImageNet models (based on ~1.2 million color images) [30]. The candidate ImageNet models that we compared included VGG-16 and VGG-19 models [29], ResNet [31], Inception [32] and NASNet [33]. We divided the interval validation cohort into training (80%), validating (10%) and data testing (10%). Various combinations of the architecture options were tested, and the one with the best performance on the validating dataset was considered optimal for this study. Additional details of architecture optimization are included in Online Supplementary Material 2.

We utilized the cross-entropy function as loss function. We selected a mini-batch Adam algorithm [34] to train the proposed DeepLiverNet. The learning rate was set as 0.01. The number of epochs was set as 30. We applied an early stop mechanism, which would cease the optimization process if five consecutive epochs returned the same validation loss errors. The proposed DeepLiverNet was implemented by Python 3.6, Keras (version 2.2.4; Python Software Foundation, Wilmington, DE) with TensorFlow (version 1.10; Google Brain Team, Mountain View, CA) backend on a computer workstation (256 gigabyte random access memory [RAM], 2×NVIDIA [Nvidia Corporation, Santa Clara, CA] GTX1080 Ti with CUDA 10.0 [Compute Unified Device Architecture 10.0]). Additional model training details are included in Online Supplementary Material 2.

To avoid model overfitting, we used a well-established data augmentation scheme [35] to increase the training data based on image rotation and shift. Augmentation includes random image rotation (≤10°) as well as vertical and horizontal shifting (≤5 voxels) on a randomly selected T2-weighted image. Figure 2 illustrates an original liver image and three randomly synthesized liver images from the same subject. We then augmented the training samples by 10 times, while the testing dataset of any experiment was fully excluded from data augmentation procedures.

Fig. 2.

Fig. 2

Original axial T2-weighted MRI liver image (a) and three associated synthesized MRI liver images (b–d) using the rotation (≤10°)- and shift (≤5 voxels)-based data augmentation approach for a 16-year-old female subject

Internal validation

We developed and validated our deep model using the T2-weighted imaging data alone, clinical data alone, and combined imaging and clinical data from the internal validation cohort (178 unique examinations from children scanned with MRI scanners manufactured by GE Healthcare). Clinical MR elastography examinations obtained for this study contained axial images (6.5-mm slice thickness) sampled through the liver volume. Individual T2-weighted slices corresponding to those MR elastography anatomical slice levels were identified, i.e. S = 4. Subject-wise 10-fold cross-validation was used to test the DeepLiverNet (Fig. 3). In each iteration of the 10-fold cross-validation, the subjects in the whole cohort were divided into 10 portions of approximately equal size. One portion of cohort was utilized for testing, while the other nine portions of cohort were used for model training. In addition, 10% of training data was treated as validating data to test the convergence of model training. We conducted this process 10 times until all 10 portions of cohort had been tested once. We then computed the average performance across all 10 times. To test the reproducibility of the model, we repeated this ten-fold cross-validation experiment 10 times and calculated the 95% confidence interval. We assessed the diagnostic performance of the model using the metrics of accuracy, sensitivity, specificity and area under the receiver operating characteristic curve (AuROC).

Fig. 3.

Fig. 3

Internal and external validation experiments flow chart. In the internal validation, we applied 10-fold cross-validation to develop and evaluate the model using 178 patients. In the external validation, we trained our DeepLiverNet using 178 patients from our internal validation cohort and tested the model using 95 unseen subjects from the external validation cohort

External validation

The DeepLiverNet was externally validated using examinations from an independent cohort of 95 unique patients scanned on MRI scanners manufactured by Philips Healthcare. By testing the model on data collected from different manufacturer scanners, we are able to show the generalizability of the model when it is used as an off-the-shelf product on unseen data. This is especially useful for the future potential clinical usage of the model when training the model with data from a particular scanner is not feasible. We trained our DeepLiverNet using 178 subjects from our internal validation cohort and tested the model using 95 unseen subjects from the external validation cohort using T2-weighted imaging data alone, clinical data alone, and combined imaging and clinical data. In addition, we tested the DeepLiverNet trained with combined imaging and clinical data on sub-populations of the external validation cohort according to gender and fatty liver disease diagnosis. The same rotation and shift-based data augmentation methodology used in our internal validation experiment was applied to balance and augment the imaging data in our external validation experiment. Again, we assessed the diagnostic performance of the model using the metrics of accuracy, sensitivity, specificity and AuROC.

Visualization of discriminative image regions

We visualized the most discriminative image regions of a given T2-weighted liver image using gradient-weighted class activation mapping (Grad-CAM) technique [36]. The Grad-CAM algorithm can generate a coarse location heat map to demonstrate the important regions of the image utilized by our model for risk stratification, demystifying the decision-making process of the DeepLiverNet.

Discriminative clinical feature ranking

We applied a connection weights algorithm [37] to rank the importance of clinical and non-deep imaging features. Specifically, we calculated the partial derivatives of the model output with respect to clinical features. A higher absolute value of the partial derivative of the clinical feature indicates a higher level of the importance for stratifying liver stiffness.

Statistical analysis

Continuous data were summarized as means and standard deviations; categorical data were summarized as counts and percentages. The two-sided Student’s t-test (continuous data) and chi-square test (categorical data) were used to assess baseline differences between cohorts and model performance. A P-value <0.05 was considered statistically significant for inference testing. Analyses were performed with the statistical package of MATLAB 2018a (MathWorks, Natick MA).

Results

No significant baseline differences were found between patients in our internal and external validation cohorts (Table 1).

Table 1.

Baseline characteristics of internal and external validation cohorts

Internal cohort External cohort P-valuea
MRI scanner manufacturer GE Healthcare Philips Healthcare -
Number of subjects (n) 178 95 -
Age (years)b 14.7 (4.8) 14.0 (5.3) 0.29
Male, number (%)c 117 (65.7%) 59 (62.1%) 0.55
Body mass index (kg/m2)b 30.0 (9.7) 28.9 (10.9) 0.37
Liver stiffness (kPa)b 2.9 (1.1) 3.1 (1.4) 0.11
a

P-value <0.05 is significant

b

Age, body mass index and liver stiffness are presented as mean (standard deviation)

c

Gender is presented as the number (percentage) of male patients

Internal validation

We used 178 MR elastography examinations performed on a GE Healthcare MRI scanner from 178 unique patients for the internal validation experiment. Of these, 121 patients had a mean liver stiffness <3 kPa and a mean age of 14.2 (standard deviation [SD] = 4.6) years; 85/121 (70.0%) patients were male. Fifty-seven of the 178 patients had a mean liver stiffness ≥3 kPa and a mean age of 15.8 (SD=5.0) years; 32/57 (56.1%) patients were male. There was no significant difference in age (P=0.05) or gender (P=0.06) between groups. Patients with a mean liver stiffness <3 kPa had a mean liver stiffness of 2.3 (SD=0.4) kPa, whereas patients with a mean liver stiffness ≥3 kPa had a mean liver stiffness of 4.0 (SD=1.2) kPa. One-hundred-forty-one (79.2%) MR elastography examinations were performed on 1.5-T MRI scanners, and 37 (20.8%) MR elastography examinations were performed on 3-T MRI scanners. Of these 178 unique patients, 24 (13.5%) had been diagnosed with fatty liver disease.

Classifying liver stiffness using T2-weighted imaging data alone

We first set to determine the performance of DeepLiverNet using only non-stiffness T2-weighted imaging data. Our DeepLiverNet was able to correctly classify patients with regard to categorical MR elastography liver stiffness with an AuROC of 0.80 (Table 2). The model, when using imaging data only, achieved an accuracy of 85.2%, with a sensitivity of 66.0% and a specificity of 93.0%.

Table 2.

Internal cross-validation of DeepLiverNet model for categorically classifying patients using imaging data alone, clinical data alone and combined data (n=178)

Accuracy Sensitivity Specificity AuROC
Imaging dataa 85.2% [84.4%, 86.0%] 66.0% [64.5%, 67.7%] 93.0% [91.1%, 94.9%] 0.80 [0.79, 0.81]
Clinical dataa 83.8% [83.0%, 84.6%] 70.9% [68.8%, 73.0%] 89.8% [89.1%, 90.4%] 0.83 [0.81, 0.84]
Combined imaging and clinical dataa 88.0% [87.6%, 88.5%] 74.3% [73.0%, 75.6%] 94.6% [93.9%, 95.3%] 0.86 [0.85, 0.87]
a

Numbers in brackets are 95% confidence intervals

AuROC area under the receiver operating characteristic curve

Classifying liver stiffness using clinical data alone

Using only clinical data, the model classified patients with an AuROC of 0.83 (Table 2), achieving a significantly greater AuROC (P=0.003) compared to the one using only imaging data. The accuracy of this model was 83.8%, the sensitivity was 70.9%, and the specificity was 89.8%.

Classifying liver stiffness using both imaging and clinical data

The DeepLiverNet combining both T2-weighted MR imaging and clinical data was able to correctly classify patients with an AuROC of 0.86 (Table 2). This was significantly greater than imaging data alone (P<0.0001) or clinical data alone (P<0.0001). The DeepLiverNet model using both clinical and imaging data achieved an accuracy of 88.0%, with a sensitivity of 74.3% and a specificity of 94.6%.

External validation

Ninety-five MRI examinations from 95 unique patients were included in our external validation experiment. Of these 95 patients, 59 had a mean liver stiffness of <3 kPa with a mean age of 15.0 (SD=4.7) years; 40/59 (67.8%) patients were male. The other 36 patients had a mean liver stiffness of ≥3 kPa and a mean age of 14.1 (SD=4.7) years; 28/36 (77.8%) patients were male. There was no significant difference in age (P=0.45) or gender (P=0.30) between groups. Patients with a mean liver stiffness <3 kPa had a mean liver stiffness of 2.3 (SD=0.3) kPa, whereas patients with a mean liver stiffness ≥3 kPa had a mean liver stiffness of 4.4 (SD=1.4) kPa. Ninety (94.7%) of these MR elastography examinations were performed on 1.5-T MRI scanners, and only 5 (5.3%) MR elastography examinations were performed on 3-T MRI scanners. The DeepLiverNet trained for classifying liver stiffness using both clinical and imaging features was able to correctly classify these patients with an accuracy of 80.0% and AuROC of 0.79 (Table 3). With the imaging data alone, the model had an accuracy of 77.2% and AuROC of 0.75. With the clinical data alone, the model achieved an accuracy of 75.0% and AuROC of 0.74.

Table 3.

External validation of DeepLiverNet model for categorically classifying patients using imaging data alone, clinical data alone and combined data (n=95)

Accuracy Sensitivity Specificity AuROC
Imaging data alone 77.2% 60.3% 89.4% 0.75
Clinical data alone 75.0% 60.9% 87.3% 0.74
Combined imaging and clinical data 80.0% 61.1% 91.5% 0.79

AuROC area under the receiver operating characteristic curve

We continued to investigate the performance of the proposed DeepLiverNet on sub-populations of the external cohort (Table 4). The model was first evaluated in a male group (n=59) and a female group (n=36) separately. Using both imaging and clinical data, the model showed an AuROC of 0.80 and accuracy of 81.4% in the male group, similar to the whole external cohort. A comparable performance was also achieved in the female group, with an AuROC of 0.82 and accuracy of 77.8%. Receiver operating characteristic (ROC) curves of DeepLiverNet for total patients, males only, and females only are displayed in Fig. 4.

Table 4.

External validation of DeepLiverNet model on diverse sub-populations of the external cohort using combined imaging and clinical data

Subgroup No. of test
subjects
Accuracy Sensitivity Specificity AuROC
Male patients 59 81.4% 54.5% 97.3% 0.80
Female patients 36 77.8% 71.4% 81.8% 0.82
Patients with fatty liver disease 15 73.3% 20.0% 100.0% 0.62
Patients without fatty liver disease 80 81.2% 67.7% 89.8% 0.83

AuROC area under the receiver operating characteristic curve, No. number

Fig. 4.

Fig. 4

External validation receiver operating characteristic curves of DeepLiverNet for sub-populations based on patients’ gender (a) and groups with and without fatty liver disease (b) using the combination of imaging and clinical data. Dashed line indicates a random classifier

Next, we tested the DeepLiverNet in patients without (n=80) and with (n=15) a diagnosis of fatty liver disease (Table 4). Compared to the whole external cohort, the DeepLiverNet achieved slightly higher performance in patients without a diagnosis of fatty liver disease. In contrast, the model diagnostic performance was apparently lower in patients with a diagnosis of fatty liver disease, with an AuROC of 0.62 and a very low sensitivity of 20%. ROC curves of DeepLiverNet in all patients and patients with and those without fatty liver disease are shown in Fig. 4.

Visualization of discriminative image regions

We visualized the most discriminative image regions for a given T2-weighted liver image using Grad-CAM technique [36], as shown in Fig. 5. Coarse location heat maps were overlaid with the input liver images. Figure 5 demonstrates axial T2-weighted liver images and their most discriminative regions from three subjects with different liver stiffness values (ranging from 4.3 kPa to 6.9 kPa). Figure 5 shows a learned heat map that covers the whole liver regions. Figure 5 also demonstrates that both liver and spleen regions were selected by the model for decision-making. In one child in Fig. 5, subjective assessment of maps shows localization to the left hepatic lobe and medial portion of the spleen as well as intervening tissues (e.g., gastrohepatic ligament region). More examples of discriminative region visualization are displayed in Online Supplementary Material 3.

Fig. 5.

Fig. 5

Axial T2-weighted fast spin-echo fat-suppressed liver MR images (a, c, e) and their most discriminative regions based on a normalized gradient-weighted class activation mapping algorithm (b, d, f) from three patients who underwent MR elastography. a, b A 14-year-old boy with no known medical history and a liver stiffness value of 4.3 kPa. c, d A 13-year-old boy with no known medical history and a liver stiffness value of 4.6 kPa. e, f A 13-year-old boy with no medical history and a liver stiffness value of 6.9 kPa

Discriminative clinical features ranking

Ranked by the connection weights algorithm, the 10 most discriminative features for classifying liver stiffness in our DeepLiverNet model included total bilirubin, fibrosis-4 score, gamma-glutamyl transferase, direct bilirubin, MRI liver volume, MRI chemical shift-encoded fat fraction, aspartate aminotransferase to platelet ratio index (APRI), body mass index, aspartate aminotransferase and serum albumin.

Discussion

Deep learning, which simultaneously learns data representation and decision-making, is a state-of-the-art artificial intelligence technique, and it has achieved exceptional performance in numerous fields, such as image recognition, object detection, and natural language processing [24]. In this work, we focused on supervised deep learning, where a model is given a set of input data (e.g., clinical data or MR images) as well as associated labels (i.e. liver stiffness) to learn the latent relationship between input data and labels. In this retrospective study, we proposed and evaluated DeepLiverNet for a liver stiffening classification task. By integrating clinical and T2-weighted MRI liver data, DeepLiverNet achieved an AuROC of 0.86 and an accuracy of 88.1% at internal validation. This multi-channel deep learning model outperformed the single-channel models trained with either clinical or imaging data alone.

DeepLiverNet performed similarly to our previously developed support vector machine classifier that achieved an AuROC of 0.84 and an accuracy of 81.8% on the same liver stiffening classification task in a similar cohort [23]. Different from the support vector machine, the proposed DeepLiverNet model does not require manual liver segmentation or use handcrafted radiomic features manually engineered by data scientists. Instead, it is able to automatically identify and extract useful imaging features that may not be easily recognized visually by human experts for the liver stiffening classification task. Thus, DeepLiverNet reduces the time cost of image processing on both model development and application. We believe such a model with continued refinement could convincingly outperform our support vector machine model and be used to reliably identify children with normal liver stiffness at point of care (e.g., integrated within the MR console) to triage the need for additional MR elastography testing, and thus potentially avoid MR elastography in up to two-thirds of candidates, thereby shortening examination length and lowering health care costs.

Overfitting is a phenomenon that occurs when a model fits the training data closely but has difficulty being generalized to additional unseen datasets. It is especially common when classifying medical images, where the heterogeneity of biological processes is inherent and training samples are relatively limited. Thus, we applied two strategies to mitigate the model overfitting. The first strategy was transfer learning. Pretrained ImageNet models [29, 31-33] that were trained on ~1.2 million non-medical color images (dogs, cats, cars, etc.) were reused to help the training of the DeepLiverNet on medical images (i.e. anatomical T2-weighted MR images) in a liver stiffening classification task. Although there are differences between non-medical color images and gray-scale medical images in terms of image content, basic image elements such as edges, shapes and blobs are similar across any image. After comparing various ImageNet models, we opted to use VGG-19 in our work. It is intriguing that the VGG-19 model achieved the best performance in our optimization experiments, even though it has relatively simpler architecture than other models (e.g., Inception, ResNet and NASNet). The architecture design of deep learning models depends on the complexity of the task [38]. Although those deeper models are useful for a general computer vision classification task with a thousand categories, they might not be optimal to be reused in our 2-way classification task. Indeed, a similar trend has been reported previously [25]. The other strategy we used for minimizing the possibility of model overfitting was data augmentation. Image augmentation methods have been applied frequently to enlarge variability of training samples and enhance generalizability of models [29, 35, 39]. With these two strategies, our internal and external validation results show promise for our DeepLiverNet as an off-the-shelf product in the near future for clinical use.

The DeepLiverNet reached a slightly lower AuROC of 0.79 and an accuracy of 80.0% at external validation on an independent cross-platform patient cohort. When DeepLiverNet was utilized in either a male group or a female group of the external cohort, comparable diagnostic performances were observed in both sub-populations. This indicates that gender might not be a confounder of MR elastography. Intriguingly, DeepLiverNet diagnostic performance improved for the patients without fatty liver disease (n=80), slightly increasing the AuROC to 0.83. Conversely, the model performed poorly in the subset of patients with documented fatty liver disease (n=15). Particularly, the model had a very low sensitivity of 20%, indicating that the model tends to incorrectly classify the patients with fatty liver disease and moderate/severe liver stiffening. Indeed, liver fat has been suggested as a potential confounder of MR elastography, possibly having a softening effect [10, 13]. In a recent study, the diagnostic performance of MR elastography for predicting liver histological fibrosis was found to be better in patients without fatty liver disease than those with fatty liver disease [13]. Further studies are needed to understand the potential confounding effects of liver fat.

Visualization of the imaging channel of DeepLiverNet could explicitly demonstrate from where DeepLiverNet extracts image features for liver stiffness stratification. Although the proposed model utilizes entire T2-weighted images for prediction, we note that diverse regions were highlighted by Grad-CAM heat maps. The whole liver or spleen regions were identified for liver stiffness stratification in multiple patients (Fig. 5; Online Supplementary Material 3). We also noted in some cases that only spleen regions were utilized for decision-making. (Online Supplementary Material 3). For several correctly classified patients, similar regions covering the left hepatic lobe and medial spleen were identified on saliency maps, despite their varying degrees of liver stiffening (Fig. 5; Online Supplementary Material 3). A bold interpretation of the resulting Grad-CAM heat maps might be that our deep learning model was emphasizing the relationship between liver and spleen (and intervening tissues, such as the gastrohepatic ligament), such as the ratio of liver/spleen volumes. It has been established that the morphology of the left hepatic lobe and spleen change with progressive liver fibrosis and cirrhosis [40]. In our previous study [23], liver volume was also recognized as a predictor of liver stiffening by a support vector machine learning model.

By deciphering the clinical channel of DeepLiverNet, the most discriminative clinical features were revealed for classifying liver stiffness. These features (including total bilirubin, fibrosis-4 score, gamma-glutamyl transferase, direct bilirubin and MRI liver volume) are quite similar to those identified from an overlapping subject cohort that used more traditional machine learning (support vector machine) to identify clinical and imaging features predictive of liver stiffness [23]. Based on existing literature, changes in such clinical features have been associated with progressive chronic liver disease and increasing liver fibrosis/cirrhosis.

There are limitations to this work. First, the proposed DeepLiverNet had a lower performance in the external validation than in the internal validation. It remains challenging to apply the model on unseen data from different imaging platforms for external validation and obtain identical results to internal validation. We expect that more training data and including training data from multiple MRI scanner manufacturers would further improve the generalizability of the model. Second, only four axial T2-weighted liver images, from where the liver stiffness values were assessed in the MR elastography examinations, were used in the model evaluation. It is conceivable that additional T2-weighted slices or even the whole liver could be harnessed to leverage the model performance. Third, for imaging data, we only used T2-weighted fat-suppressed liver images for the DeepLiverNet. Additional imaging data from other pulse sequences, such as T1-weighted or diffusion-weighted imaging, could possibly improve model performance. Last, the DeepLiverNet was used to stratify the severity of liver stiffness derived by MR elastography, an indirect surrogate marker of liver fibrosis. Our future plans include developing similar deep learning methodologies to predict liver stiffness on a continuous scale and categorically (or continuously, based on advanced digital pathology) stage liver fibrosis on a histopathological basis.

Conclusion

A deep learning model incorporating clinical features and T2-weighted MR images from children with suspected or known pediatric chronic liver diseases has demonstrated a means of classifying patients into normal/minimally elevated versus moderately/severely elevated liver stiffness with accuracy up to 88%. We performed both internal and external validation experiments using data on MRI scanners from two manufacturers from subjects with a variety of chronic liver diseases. Further studies are needed to continue to refine the model as well as validate it in other patient groups, including cohorts with very specific liver diseases (e.g., primary sclerosing cholangitis, viral hepatitis, non-alcoholic fatty liver disease). Ultimately, we plan to use this model as the foundation for predicting liver histological fibrosis, perhaps eliminating the need for biopsy in some children with suspected or known chronic liver disease.

Supplementary Material

1757172_OL_sup-1
1757172_OL_sup-2
1757172_OL_sup-3

Acknowledgments

This study was supported by an internal grant from the Department of Radiology, Cincinnati Children’s Hospital Medical Center and NIH R01-EB030582.

Footnotes

Publisher's Disclaimer: This Author Accepted Manuscript is a PDF file of a an unedited peer-reviewed manuscript that has been accepted for publication but has not been copyedited or corrected. The official version of record that is published in the journal is kept up to date and so may therefore differ from this version.

Conflicts of interest None

Reference

  • 1.Chalasani N, Younossi Z, Lavine JE et al. (2018) The diagnosis and management of nonalcoholic fatty liver disease: practice guidance from the American Association for the Study of Liver Diseases. Hepatology 67:328–357 [DOI] [PubMed] [Google Scholar]
  • 2.Lavanchy D (2009) The global burden of Hepatitis C. Liver Int 29:74–81 [DOI] [PubMed] [Google Scholar]
  • 3.Tapper EB, Lok AS-F (2017) Use of liver imaging and biopsy in clinical practice. N Engl J Med 377:756–768 [DOI] [PubMed] [Google Scholar]
  • 4.Serai SD, Trout AT, Miethke A et al. (2018) Putting it all together: established and emerging MRI techniques for detecting and measuring liver fibrosis. Pediatr Radiol 48:1256–1272 [DOI] [PubMed] [Google Scholar]
  • 5.Smith AD, Porter KK, Elkassem AA et al. (2019) Current imaging techniques for noninvasive staging of hepatic fibrosis. AJR Am J Roentgenol 2019:1–13 [DOI] [PubMed] [Google Scholar]
  • 6.Banerjee R, Pavlides M, Tunnicliffe EM et al. (2014) Multiparametric magnetic resonance for the non-invasive diagnosis of liver disease. J Hepatol 60:69–77 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Dillman JR, Heider A, Bilhartz JL et al. (2015) Ultrasound shear wave speed measurements correlate with liver fibrosis in children. Pediatr Radiol 45:1480–1488 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Yin M, Glaser KJ, Talwalkar JA et al. (2015) Hepatic MR elastography: clinical performance in a series of 1,377 consecutive examinations. Radiology 278:114–124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Shi Y, Guo Q, Xia F et al. (2014) MR elastography for the assessment of hepatic fibrosis in patients with chronic Hepatitis B infection: does histologic necroinflammation influence the measurement of hepatic stiffness? Radiology 273:88–98 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Joshi M, Dillman JR, Singh K et al. (2018) Quantitative MRI of fatty liver disease in a large pediatric cohort: correlation between liver fat fraction, stiffness, volume, and patient-specific factors. Abdom Radiol 43:1168–1179 [DOI] [PubMed] [Google Scholar]
  • 11.DiPaola FW, Schumacher KR, Goldberg CS et al. (2017) Effect of Fontan operation on liver stiffness in children with single ventricle physiology. Eur Radiol 27:2434–2442 [DOI] [PubMed] [Google Scholar]
  • 12.Rotemberg V, Palmeri M, Nightingale R et al. (2011) The impact of hepatic pressurization on liver shear wave speed estimates in constrained versus unconstrained conditions. Phys Med Biol 57:329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Trout AT, Sheridan RM, Serai SD et al. (2018) Diagnostic performance of MR elastography for liver fibrosis in children and young adults with a spectrum of liver diseases. Radiology 287:824–832 [DOI] [PubMed] [Google Scholar]
  • 14.Serai SD, Towbin AJ, Podberesky DJ (2012) Pediatric liver MR elastography. Dig Dis Sci 57:2713–2719 [DOI] [PubMed] [Google Scholar]
  • 15.Muthupillai R, Lomas D, Rossman P et al. (1995) Magnetic resonance elastography by direct visualization of propagating acoustic strain waves. Science 269:1854–1857 [DOI] [PubMed] [Google Scholar]
  • 16.Bahl M, Barzilay R, Yedidia AB et al. (2018) High-risk breast lesions: a machine learning model to predict pathologic upgrade and reduce unnecessary surgical excision. Radiology 286:810–818 [DOI] [PubMed] [Google Scholar]
  • 17.Dawes TJW, de Marvao A, Shi W et al. (2017) Machine learning of three-dimensional right ventricular motion enables outcome prediction in pulmonary hypertension: a cardiac MR imaging study. Radiology 283:381–390 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kickingereder P, Bonekamp D, Nowosielski M et al. (2016) Radiogenomics of glioblastoma: machine learning-based classification of molecular characteristics by using multiparametric and multiregional MR imaging features. Radiology 281:907–918 [DOI] [PubMed] [Google Scholar]
  • 19.Wu H, Deng Z, Zhang B et al. (2016) Classifier model based on machine learning algorithms: application to differential diagnosis of suspicious thyroid nodules via sonography. AJR Am J Roentgenol 207:859–864 [DOI] [PubMed] [Google Scholar]
  • 20.Abajian A, Murali N, Savic LJ et al. (2018) Predicting treatment response to intra-arterial therapies for hepatocellular carcinoma with the use of supervised machine learning — an artificial intelligence concept. J Vasc Interv Radiol 29:850–857 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Kline TL, Korfiatis P, Edwards ME et al. (2017) Performance of an artificial multi-observer deep neural network for fully automated segmentation of polycystic kidneys. J Digit Imaging 30:442–448 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Mutasa S, Chang PD, Ruzal-Shapiro C, Ayyala R (2018) MABAL: a novel deep-learning architecture for machine-assisted bone age labeling. J Digit Imaging 31:513–519 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.He L, Li H, Dudley JA et al. (2019) Machine learning prediction of liver stiffness using clinical and T2-weighted MRI radiomic data. AJR Am J Roentgenol 213:1–10 [DOI] [PubMed] [Google Scholar]
  • 24.LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444 [DOI] [PubMed] [Google Scholar]
  • 25.Lakhani P, Sundaram B (2017) Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 284:574–582 [DOI] [PubMed] [Google Scholar]
  • 26.Serai SD, Dillman JR, Trout AT (2016) Spin-echo echo-planar imaging MR elastography versus gradient-echo MR elastography for assessment of liver stiffness in children and young adults suspected of having liver disease. Radiology 282:761–770 [DOI] [PubMed] [Google Scholar]
  • 27.Sawh MC, Newton KP, Goyal NP et al. (2020) Normal range for MR elastography measured liver stiffness in children without liver disease. J Magn Reson Imaging 51:919–927 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Yin M, Talwalkar JA, Glaser KJ et al. (2007) Assessment of hepatic fibrosis with magnetic resonance elastography. Clin Gastroenterol Hepatol 5:1207–1213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Cornell University, https://arxiv.org/abs/1409.1556. Accessed 22 Aug 2020 [Google Scholar]
  • 30.ImageNet (2016) Website. http://www.image-net.org/. Accessed 22 Aug 2020 [Google Scholar]
  • 31.He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778 [Google Scholar]
  • 32.Szegedy C, Vanhoucke V, Ioffe S et al. (2016) Rethinking the inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2818–2826 [Google Scholar]
  • 33.Zoph B, Vasudevan V, Shlens J, Le QV (2018) Learning transferable architectures for scalable image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8697–8710 [Google Scholar]
  • 34.Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. Cornell University. https://arxiv.org/abs/1412.6980. Accessed 22 Aug 2020 [Google Scholar]
  • 35.Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Comm ACM, 60 [Google Scholar]
  • 36.Selvaraju RR, Das A, Vedantam R et al. (2016) Grad-CAM: why did you say that? Cornell University. https://arxiv.org/abs/1611.07450. Accessed 22 Aug 2020 [Google Scholar]
  • 37.Olden JD, Jackson DA (2002) Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol Model 154:135–150 [Google Scholar]
  • 38.Andrew N (2017) Machine learning yearning. [Google Scholar]
  • 39.Szegedy C, Liu W, Jia Y et al. (2015) Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1–9 [Google Scholar]
  • 40.Pickhardt PJ, Malecki K, Hunt OF et al. (2017) Hepatosplenic volumetric assessment at MDCT for staging liver fibrosis. Eur Radiol 27:3060–3068 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1757172_OL_sup-1
1757172_OL_sup-2
1757172_OL_sup-3

RESOURCES