Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Oct 1.
Published in final edited form as: Acad Radiol. 2024 May 3;31(10):4045–4056. doi: 10.1016/j.acra.2024.03.044

Local Assessment and Small Bowel Crohn’s Disease Severity Scoring using AI

Binu Enchakalody 1, Ashish P Wasnik 1,2, Mahmoud Al-Hawary 1,2,3,4, Stewart C Wang 1, Grace L Su 1,4, Brian Ross 1, Ryan W Stidham 1,4,5
PMCID: PMC12108342  NIHMSID: NIHMS2070211  PMID: 38702212

Abstract

Rationale and Objectives:

We present a machine learning and computer vision approach for a localized, automated, and standardized scoring of Crohn’s disease (CD) severity in the small bowel, overcoming the current limitations of manual measurements CT enterography (CTE) imaging and qualitative assessments, while also considering the complex anatomy and distribution of the disease.

Materials and Methods:

Two radiologists introduced a severity score and evaluated disease severity at 7.5 mm intervals along the curved planar reconstruction of the distal and terminal ileum using 236 CTE scans. A hybrid model, combining deep-learning, 3-D CNN, and Random Forest model, was developed to classify disease severity at each mini-segment. Precision, sensitivity, weighted Cohen’s score, and accuracy were evaluated on a 20% hold-out test set.

Results:

The hybrid model achieved precision and sensitivity ranging from 42.4% to 84.1% for various severity categories (normal, mild, moderate, and severe) on the test set. The model’s Cohen’s score (κ = 0.83) and accuracy (70.7%) were comparable to the interobserver agreement between experienced radiologists (κ = 0.87, accuracy = 76.3%). The model accurately predicted disease length, correlated with radiologist-reported disease length (r = 0.83), and accurately identified the portion of total ileum containing moderate-tosevere disease with an accuracy of 91.51%.

Conclusion:

The proposed automated hybrid model offers a standardized, reproducible, and quantitative local assessment of small bowel CD severity and demonstrates its value in CD severity assessment.

Keywords: Crohn’s disease, Disease Severity scoring, CD severity detection, Machine learning in CD, Small bowel AI

INTRODUCTION

Crohn’s disease (CD) is a chronic inflammatory bowel disease affecting over a million North Americans (1,2). It leads to varying degrees of inflammatory and fibrotic damage, commonly observed in the ileum of the small bowel (3). While colonoscopy is the conventional CD diagnostic method, cross-sectional Computed Tomography Enterography (CTE) or Magnetic Resonance Enterography (MRE) plays a crucial role in accurate assessment. These manual assessment methods are limited by inter-observer variation, subjectivity, and lack of reproducibility, partly due to the complex disease features, and bowel shape.

Manual linear measurements from MRE, including MaRIA (4), simplified MaRIA (5), Lemann Index (6), and London (7) scores, have contributed to the development of CD severity scoring methods, prompting efforts to standardize measurement techniques among experts (8, 9). Current systems employ three to six severity-related features for disease grading through linear measurements. There have been efforts to compare grading standards for CD in MRE (10, 11). Yet, they often omit various established CD imaging findings due to measurement challenges, subjectivity, time constraints, poor reproducibility, and practicality issues linked to the scoring tool’s simplicity. Currently, there are no established standards for grading severity scoring in CTE.

Advancements in artificial intelligence (AI), computer vision, and machine learning (ML) provide novel approaches for standardized assessments and explainable CD quantitation. Previous work shows automated methods are capable of efficiently extracting and quantifying conventional CD imaging features, as defined by radiologists and employed in clinical practice (12), thereby enabling automated and reproducible assessments. The assessment of small bowel CD has been demonstrated in T1-weighted MRE scans, using Curved Planar Reformation (CPR) applied to the bowel imagery, along with a spatial coordinate reference (1315). Other studies have demonstrated deep-learning models are capable of detecting CD regions in the large intestine and score CD severity in the terminal ileum using MRE (16, 17). A recent study demonstrated that a SVM model trained using radiomic features and clinical data accurately predicted ileal CD in T2-Weighted MRI, achieving results comparable to those of three radiologists who manually annotated the diseased region (18). To assess ulcerative colitis, a study compared the performance of a deep learning model with that of human experts in categorizing a 4-level Mayo severity score (19). ML models using normal and diseased annotations placed by two experts on the CPR view of the small bowel CTE have demonstrated promising results in improving CD assessment (20).

We introduce a novel, four-level severity score for CD disease and demonstrate its reproducibility by training a AI model to assess severity. We also present methods for localizing, assessing, categorizing, and summarizing small bowel CD regions using this severity score. Our method enables replicating expert radiologist evaluations of disease severity along planar reformatted views, offering a novel perspective for CD diagnosis in CTE. Moreover, the significance of combining clinically inspired traditional features with a deep-learning model is highlighted, providing clinicians with a comprehensive suite of clinically explainable measures that correlate bowel health along the segment’s length.

METHODS

Study Selection

In this institutional review board (IRB) approved HIPAA complaint study, subjects with a diagnosis of CD were retrospectively selected who underwent CTE between 2014–2018 and had three years of follow up recorded in electronic medical records at a single tertiary care referral center. Inclusion criteria consisted of adults aged 18 years and above with verified CD diagnosis and radiological confirmation of CD on CTE. Out of 282 initially selected subjects fitting the study criteria, only 236 were chosen. Exclusion criteria included the presence of penetrating complications, an ileostomy, malignancy, or inadequate imaging quality for radiologist evaluation. The patient demographics, characteristics, and exclusion criteria are detailed in Appendix: Table 3 and Appendix: Figure 1 in the supplemental material. The scans were acquired using a standard institutional protocol on 64/128 multi-detector CT, with only one CTE per subject being used.

Pre-Processing for Annotation

This study focuses on using the CPR (21) view of the ileum to visualize the small bowel, addressing the limitations of relying solely on raw CTE for local CD assessment. The reformatted volume is generated using expert placed center points originating from the ileocolonic junction and extending through the ileum to capture 10 cm proximal to any detected disease (12). The method ensures that all reformatted study volumes originate from the same anatomical landmark. Center points for the CPR were placed via a custom GUI, enabling the experts to view axial, coronal, and sagittal axes. This was followed by a MATLAB based program that automatically generates the CPR volume using these center points, providing axial slices perpendicular to the center line for cross-sectional views along the length of the bowel.

Disease Severity Score Annotation

Two experienced abdominal radiologists with 15 and 20 years of expertise in CD imaging assessments annotated consecutive 7.5 mm intervals (mini-segments) with overall disease severity. The experts recommended choosing this interval to accommodate potential severity changes that are likely to occur within 1 cm intervals. Disease grading was mutually agreed upon by the radiologists following the standardized guidelines (9) from the American Gastroenterology Association (AGA) and the Society of Abdominal Radiology (SAR), assigning simplified grades of normal (0), mild (1), moderate (2), and severe (3) for each mini-segment. Experts annotated using a designed MATLAB-based GUI displaying CPR views and mini-segment boundaries. The following are the key conventional bowel features used by experts for disease severity assessment.

  • Mild: Bowel wall thickness ranging from 3 to 5 mm, possibly exhibiting slight enhancement, and potentially accompanied by mesenteric fat stranding or hypervascularity.

  • Moderate: Bowel wall thickness ranging from 5 to 8 mm, possibly showing moderately increased enhancement, and potentially featuring mild mesenteric fat stranding or hypervascularity.

  • Severe: Bowel wall thickness > 8 mm, possibly displaying increased enhancement, and potentially exhibiting significant mesenteric fat stranding or hypervascularity.

Inter-observer agreement was also assessed, and the disagreements were resolved through consensus, which served as the ground truth for model training and evaluation.

Data Distribution

The study includes a cohort of 236 distinct patients, with an average age of 43.3 ± 16 years. The gender distribution in the cohort was 47.46% female and 52.54% male (Table 3). The studies were divided randomly into 80% training and 20% test sets, with the split being conducted on a per-patient basis to ensure that the test set did not contain any mini-segments from the studies used in training the model. Each mini-segment is a patch with dimensions 30 mm (height) x 30 mm(width) x 7.5 mm(depth), derived from the reformatted volume, with the corresponding voxels mapped between the range of − 300 to + 300 Hounsfield Units (HU). The mini-segments derived from each study are treated as independent samples. Table 1 shows the data distribution of the studies and the same set of mini-segments were used to train and test all models. Figure 1 illustrates the functionality of the disease severity model, emphasizing the role played by the reformatted view and mini-segments.

TABLE 1.

Data Distribution in Disease Severity

Data Set Patients Mini-segments Disease Severity Category

0 1 2 3
Train 194 6929 44.1% 21.1% 24.5% 10.1%
Test 42 1448 48.4% 24.1% 17.6% 10.1%

Figure 1.

Figure 1.

Flowchart illustrates the functionality of the disease severity model. After the manual placement of center points on the CTE, the reformatted view and mini-segments are generated for both the Random Forest (RF) and Convolutional Neural Network (CNN) models. The prediction probabilities of these models are the inputs to the Hybrid model.

Semantic Segmentation and Post Processing

We trained a 2-D semantic segmentation using the DeepLabV3+ (22) architecture and Xception (23) model with axial images from reformatted small-bowel volumes of 194 CTE scans in the training set (Table 1). The model classified three main regions: the bowel wall perimeter, the inner wall, which includes regions of the lumen and mucosa, and the background. The masks were then converted into equally spaced point clouds to create a surface matching the segment’s shape and length, ensuring a seamless transition of the surface perimeter along the bowel. A MATLAB-based GUI was designed for the purpose of editing bowel wall and inner wall perimeters as required. Post-processing involved segmenting 3-D regions of mucosa and lumen within the inner wall using superpixels (24) and k-means clustering. For training, we used an ADAM (25) optimizer with Focal loss (26) to address class imbalance. The training lasted 8 epochs, with a learning rate of 10−5 and a mini-batch size of 16. We employed MATLAB’s Deep Learning framework on two parallel TITAN-X GPUs with 24 GB RAM and CUDA11.1.

Designed Features

The reformatted view provides a unique perspective of the bowel, centered on the lumen, enabling accurate and reproducible measurement of bowel wall, lumen, and surrounding regions. The features depend on the segmentation of the bowel wall and lumen, allowing granular measurements along the segment’s length within the reformatted image volume. A suite of 163 features, including density based, linear, cross-sectional area, volume-based, and texture predictors, resemble clinically established measures within the bowel wall, lumen, and outer surroundings, and fall into four broad categories: geometry (G), density (D), texture based (T), and relative (R) features. Geometry-based features quantify linear, 2D, and volume-based measures of the bowel wall, wall thickness (excluding lumen volume), lumen diameter, and surrounding areas. To account for the bowel’s non-circular shape at each cross-section, equivalent diameter replaces traditional diameter and thickness measures. Density-based features capture the density and summary statistics of the bowel wall, wall thickness, lumen, and the surrounding areas outside the bowel. Texture features are obtained from an isotropic volume of each mini-segment, disregarding density values below − 100 HU and above 250 HU. Extracted features include Gray Level Co-occurrence Matrices (27)(GLCM), Neighboring Gray Tone Difference Matrix (28) (NGTDM), Gray Level Run Length Matrix (29) (GLRLM), and Gray Level Size Zone (30) (GLSZM) for each mini-segment. The set of relative predictors assess changes in bowel wall, wall thickness, and lumen within the study, comparing them to both diseased and normal mini-segments. This involves categorizing the mini-segments into binary groups of diseased (≥mild) and normal, followed by summarizing their measures.

Random Forest Model

The RF model uses designed features inspired by established markers of disease severity, requiring prior knowledge of the bowel wall and lumen surroundings. The adjudicated ground truth acts as the model’s response. The model selection process involved assessing variations of decision trees, bagged classifiers, discriminant analysis classifiers, and SVM, using the average model accuracy in a 5-fold cross-validation of the training set. The bagged RF trees model, tuned for hyperparameters including the number of trees (499), learning rate (λ = 1), regularization, and misclassification cost, was selected as the best performing model. Its final performance was evaluated on the test set.

Deep Learning Model

Exploring non-traditional features can yield valuable insights into CD, free from the bias of clinical knowledge. We trained a deep learning model using raw mini-segment volumes and its corresponding disease severity scores, using convolutional neural networks (CNN) to generate abstract predictors without the need for bowel wall or lumen perimeter segmentation. A custom 3-D ResNet-50 (31) inspired architecture, with 179 layers, was employed to train mini-segment volume data patches that were resized to 128(height) x 128(width) x 64(depth) x 1(channel). The disease severity labels were used as the model response. The input imagery was a single channel, gray scale image volume of each mini-segment. Various models were evaluated for model selection based on validation loss. Hyper-parameter tuning involved varying learning rates, L2-regularization, and custom loss functions to optimize the final model, which was trained using an ADAM optimizer, cross-entropy loss, a learning rate of 10–6 for 25 epochs, and a mini-batch size of 6. The training was performed on MATLAB’s Deep Learning framework, utilizing two parallel TITAN-X GPUs with 24 GB RAM and CUDA 11.1. The performance of the trained model was evaluated on the test set.

Hybrid Model

Classifier confidence in disease severity category decisions is measured by posterior probability scores associated with each prediction. The RF model computes average probabilities of all ensemble trees, while the CNN model’s final activation layer provides a Softmax normalized score (ranging from 0 to 1) indicating decision confidence. Our AI model combines strengths by calculating the mean probability score. Agreement between models validates the predicted severity score for the mini-segment, but in disagreements, probability scores of the mini-segment and its nearest neighbors are averaged to derive a new mean probability score, Equation 1. The disease severity category with the highest score is selected as the final decision. Unlike RF and CNN models treating mini-segments as independent data points, the hybrid model views them as sequences to minimize abrupt grading fluctuations, resulting in a smoother transition between adjacent mini-segment severity scores.

Score=1/2 Σ(CNN Score(0,1,2,3) + EnsScore(0,1,2,3))

Inter-Observer Reliability

The optimal Bayes error rate for classifying disease severity categories within small bowel mini-segments is not reported. However, the inter-observer agreement among radiologists serves as a surrogate for the expected model classification error. Both radiologists independently reviewed the test set mini-segments, and inter-observer agreement was evaluated using the quadratic weighted Cohen’s κ statistic (32, 33) within the disease severity grades. MATLAB statistics and machine learning toolbox was used to conduct the statistical analysis.

Full Segment Disease Severity Assessment

Assessing the severity of the disease in a granular manner enables not only the spatial localization but also the measurement of the length of the affected areas. By combining the mini-segment assessments across a scan, we can summarize the distribution of predicted disease severity for the entire segment. Immediate medical attention is often necessary for regions with moderate to severe (≥ moderate) disease grades. We also examine the full segment performance of a binary disease severity model that merges the normal to mild (≤ mild) and moderate to severe (≥ moderate) groups. Understanding the distribution, location, and length of these regions significantly informs the procedure. The full segment assessment visually presents this patient-specific insight.

RESULTS

Agreement on Disease Severity Grading

Two expert radiologists independently reviewed 42 studies, evaluating a total of 1448 mini-segments. The inter-observer agreement yielded a Cohen’s κ of 0.87 (95% CI: 0.83 0.91, σ = 0.027), indicating a ”strong” level(Table 3 in (33)) of agreement. The overall accuracy between expert agreements was 76.31%, with the highest agreement observed for the normal (0) category and the least agreement for the moderate (2) category. Specifically, for the normal category, Radiologist 1 agreed with 81.8% of the 905 mini-segments graded as normal by Radiologist 2, while Radiologist 2 agreed with 97.8% of the 757 mini-segments graded as normal by Radiologist 1. For the moderate category, Radiologist 1 agreed with 33.8% of the 98 mini-segments graded by Radiologist 2, and Radiologist 2 agreed with 40.3% of 119 mini-segments graded by Radiologist 1.

Semantic Segmentation Model Performance

The Deeplabv3+ model’s performance was evaluated using a test set of 10 CTE scans (5% of the data) containing 3535 axial images. These images were independently reviewed by experts, who provided the ground truth for the previously segmented bowel wall masks. The model’s efficacy was assessed across three classes Background, Bowel Wall, and Inner Wall using several metrics, including global accuracy (0.97), mean accuracy (0.84), mean Intersection Over Union (IoU) (0.74), weighted IoU (0.95), and Boundary F1 (BF) score (0.61). Figure 2(d) shows an example where the segmentation is superimposed on the cross-sectional view of the reformatted perspective of the bowel wall.

Figure 2.

Figure 2.

Model results: A 40-year-old female with CD underwent analysis using a disease severity classifier on her bowel segment, divided into 37 mini-segments measuring 7.5 mm each, totaling 27.75 cm from the Ileo-cecal valve (origin). The first incident of moderate to severe disease is visible on a 5 cm patch from 2.5 cm through 7.5 cm, distal from the origin. Predictions from RF, Hybrid, and CNN models, along with ground truth, are shown in (a) over a cross-sectional view of the reformatted bowel segment. The 3-D rendering of the bowel wall and lumen, with ground truth and hybrid model predictions, are presented in (b) and (c). Bowel wall and lumen segmentation on a cross-sectional slice of the reformatted volume is shown in (d). Figure (e) illustrates the cross-sectional area of the bowel wall and lumen volume along the segment’s length. Narrowing of lumen volume (visible as the inner tube in the rendering) correlates with disease severity.

Evaluating Model Performance on Severity Grading

The performance of three models (RF, CNN, and hybrid) was evaluated using various metrics such as sensitivity, precision, Area under curve (AUC), and accuracy, for each severity category, on the test set. The overall accuracy of the models on the test set was reported as 66.75%, 65.09%, and 70.7% for the RF, CNN, and hybrid models, respectively. The Cohen’s κ score at 95% Confidence Interval (CI) between model prediction and radiologist truth for the RF, CNN, and hybrid models was reported as 0.78 (CI = 0.74 0.83), 0.76 (CI = 0.71 0.81), and 0.83 (CI = 0.80 0.87), respectively. The hybrid model showed a notable improvement in sensitivity for the mild category compared to individual models. Table 2 presents the precision, sensitivity, AUC, accuracy, and Cohen’s κ for 95% CI, for the four severity categories of the three models. Figure 3 displays the ROC curves of the hybrid model with 95% CI, for all categories. The RF model showed higher sensitivity than the CNN model in classifying the severity categories normal (0) and moderate (2), while the CNN model had considerably higher sensitivity than the RF model for the severe (3) category. Both models had high sensitivity for detecting normal and low sensitivity for mild severity. Figure 2(a),(c) displays the outcomes of the model per mini-segment, as applied to a test study.

TABLE 2.

Disease Severity, Category-wise Performance Metrics

Model Severity Precision Sensitivity F1 Score AUC Accuracy
Hybrid 0 78.8% 84.10% 0.814 89.69% 70.7%
1 52.3% 42.40% 0.468 74.84%
2 65.60% 64.90% 0.652 89.49%
3 73.60% 84.00% 0.785 97.16%
RF 0 75.60% 85.70% 0.803 86.62% 66.75%
1 50.80% 26.70% 0.350 69.30%
2 50.10% 71.00% 0.587 87.73%
3 70.30% 53.20% 0.606 92.12%
CNN 0 75.20% 80.80% 0.779 84.85% 65.09%
1 44.80% 31.40% 0.369 69.12%
2 52.70% 52.30% 0.525 78.59%
3 61.40% 82.70% 0.705 93.58%

Figure 3.

Figure 3.

Model ROC performance: The ROC curve for the hybrid classifier model displays the AUC values for normal (0), mild (1), moderate (2), and severe (3) disease categories. The RF and CNN models exhibit the lowest sensitivity (26.7% and 31.4%), specificity (50.8% and 44.8%), and AUC (69.3% and 69.1%) in classifying mild (1) disease. However, the hybrid model, which combines both models, demonstrates a significant improvement in sensitivity, specificity, and AUC (42.4%, 52.3%, and 74.8%) for the same category.

Interpreting Mini-segment Prediction Results

Explainable machine learning allows us to interpret the model’s decision-making process and identify potential biases. Most RF model predictors are explainable since they relate to clinically accepted disease markers. Using variable importance plots, we can gauge the relative significance of each predictor. Figure 5(a), shows the predictor importance plot of the top 25 variables. The deep learning model is unbiased to clinically established measures and identifies abstract features within the images. We interpret model predictions of the CNN model by generating visual explanations using GradCAM (34), a technique that generates an activation map by visualizing gradients from the final CNN reduction layer (Fig 5(b)).

Figure 5.

Figure 5.

Interpreting model results: The predictor importance plot (a) displays the top 25 variables of significance in the RF model. The leading five predictors are a blend of the four feature groups (Section 2.6). Both ‘relWallThkNorPerc’ (relative normal wall thickness %) and ‘relLumAbnHU’ (relative abnormal lumen density), are relativistic assessments of normal wall thickness and abnormal lumen density in relation to each other. ‘largeZoneLGE’ signifies a GLSZM texture feature, while wall thickness and lumen % denote correlated geometric measures, representing the proportion of wall thickness and lumen within the bowel wall volume. In the Grad-CAM heat map (b), the activation’s heat map is superimposed on the mini-segment volumes influencing the model’s severity prediction. This is a 2-D projection of the highest gradient values across the mini-segment’s depth, providing insight to the regions influencing the model’s decision for that particular disease severity level.

Full Segment Disease Summary Statistics

The model’s ability to quantify ileal disease comprehensively was assessed by comparing mini-segment summary statistics per CTE study with radiologist assessments, illustrated in Figure 4. Mean absolute error (MAE) for normal, mild, moderate, and severe disease predictions were 4.94 cm, 4.64 cm, 3.03 cm, and 1.55 cm, respectively. The overall analysis of the test set showed 70.7% accuracy in total severity and 91.51% accuracy in detecting moderate-severe disease. The mean length of moderate-severe disease segments in the test set was 16.08 cm, while the predicted mean length was 15.24 cm. The correlation between reported and predicted disease lengths for the entire ileum per CTE was strong (r = 0.83, p < 0.001) in the test set. Misclassification in disease severity categories decreased with severity. Figure 6 visualizes the full segment analysis of each test study, using actual and predicted disease assessments, severity distribution, and location of the initial occurrence of moderate to severe disease.

Figure 4.

Figure 4.

Measured vs. predicted disease lengths: The notched box plot depicts the difference between reported and predicted disease lengths for each severity category across all studies in the test set. The mean and standard deviation (μ ± σ) for the normal, mild, moderate, and severe category differences are − 0.82 ± 5.04 cm, 1.16 ± 5.19 cm, 0.054 ± 3.54 cm, and − 0.40 ± 2.44 cm, respectively. The plot indicates that the accuracy tends to increase as the severity category of the disease increases. The test set itself consists of a total reported length of 552.7 cm for normal disease, 261.7 cm for mild disease, 191.2 cm for moderate disease, and 110.2 cm for severe disease.

Figure 6.

Figure 6.

Full segment summary analysis: Illustrating the full segment summary for the 42 test studies, consisting a total of 1448 mini-segments. In (a), stem plots demonstrate the hybrid model’s full segment accuracy across all severity categories, as well as accuracy for moderate and higher (≥ moderate) per study. The x-axis represents the length of the bowel segment in that particular study. In (b), the predicted and measured initial occurrence of moderate to severe disease is visualized along the bowel, spanning from the origin at the ileo-colonic junction (origin) to 25 cm from it. (c) presents the distribution of disease severity reported by expert-adjudicated ground truth for each study. The x-axis represents the measured length of moderate to severe (≥moderate) disease. (d) showcases the predicted disease severity distribution and the predicted disease length (moderate and higher) for the same studies. The x-axis represents the predicted length of moderate to severe (≥ moderate) disease.

DISCUSSION

In this study, we highlight the benefits of a standardized four-level, small bowel CD severity scoring system for CTE. We address the challenges associated with manual cross-sectional measurements by introducing a novel hybrid AI model that provides localized CD severity scoring, replicating expert-level judgment and quantifying disease burden on CTE scans. The proposed model combines clinically inspired features with the abstract predictors of a deep learning model, presenting an automated solution for CD severity predictive analysis. The severity assessments provide detailed descriptions of intestinal damage along with clinically relevant measures that are challenging to collect manually.

Argument for Standardized Severity Scoring

Conventional considerations of standardizing CD severity grading incorporates imaging features based not only on their observed association with disease biology and outcomes, but also their ability to be reliably assessed or measured. Our proposed method aims to reduce inter and intra-observer errors in disease assessment. AI and automated post-processing methods discussed here can localize regions of disease and create reproducible assessments to quantify their qualitative characteristics. Figure 2 illustrates the predictive analysis available using model, with mini-segments delineated by white dotted boundaries in (a). CD severity and localization can be tracked by anatomical indexing from the terminal ileum, with derivable measures from the suite of designed features available for any location along the segment’s length. Comparing the independent expert reviews with the consensus reveals the subjectivity among experienced radiologists, particularly in the normal and mild categories of decision-making. Standardized severity assessment rules can improve this. Our model’s results suggest potential in training reproducible models to replicate expert judgment.

Spatially Classifying Disease Severity

The CNN model exhibited surprisingly good performance compared to the RF model, despite lacking spatial knowledge of the bowel wall and lumen perimeters. The hybrid model indicated that CNN predictions contributed valuable information for replicating radiologist CD severity judgment. The top-performing RF model predictors include traditionally established imaging features of bowel structural damage, such as wall thickening radius and lumen narrowing diameter. Figure 7 illustrates the interaction between two of these traditional features and disease severity scores, providing clinicians with a familiar perspective to validate the model’s results. Figure 2(e) shows an example where clinically relevant features bowel wall, wall thickness, and lumen area, can be extracted from various regions of interest along the segment’s length. Leveraging abstract CNN features and higher-order RF features without prior clinical assumptions opens avenues for discovering new insights not recognized by medical experts. For instance, a neural network-based radiomics predictor outperformed radiologists in classifying intestinal fibrosis using pathology specimens (35). Although the promising performance of abstract features surpasses human judgment, it remains crucial to interpret and explain the model’s decision-making process. Our study emphasizes the potential of using a reformatted view to anatomically index and localize disease while quantitatively assessing severity with explainable models.

Figure 7.

Figure 7.

Clinically established CD markers: The box plot illustrates the relationship between predicted CD severity and two clinically established markers for detecting CD: bowel wall thickness and lumen diameter. Increased severity corresponds to higher average equivalent bowel wall thickness radius (4.75 mm normal, 5.37 mm mild, 6.2 mm moderate, and 6.5 mm severe) and reduced maximum equivalent lumen wall diameter (11.31 mm normal, 7.16 mm mild, 5.05 mm moderate, and 3.03 mm severe) based on the hybrid model’s predictions on the test set.

Summarizing Mini-Segment Assessments

The summarized spatial disease distribution in the small bowel introduces a novel descriptor for CD. The assessments across a study offers an overview of diseased regions, severity distribution, and distal location relative to an established anatomical landmark like the ileo-colonic junction. A moderate or severe CD assessment often requires medical intervention and the possibility of resection surgery. Having prior knowledge about the diseased region, its potential severity, and distal location, provides clinicians with a better understanding of the situation. The advantages of this information, analyzed on a per-subject basis, are visualized in Figure 6(a), (b), and Figure 4, with the actual and predicted lengths, the location of their initial occurrence, and the differences in these measurements for the ≥ moderate categories across each test study. Apart from the MRE global scoring system (MEGS) (36) which includes the total length of any disease, no current CTE scoring system accounts for spatial distribution of disease. Our full segment analysis helps to visualize spatial variation that may aid clinicians in understanding differences between seemingly similar patients, providing relevant information for early detection and treatment.

Limitations

Several limitations should be considered. The four-level CD mini-segment severity grading relies on radiologist expert judgment, which may be perceived as qualitative and less reproducible than completely quantitative scoring. However, most current CTE scoring indices also require qualitative assessments, and our expert’s judgment demonstrated good reproducibility. Most of the model’s misclassifications are observed in one severity grade difference. Future models could address inter-observer bias by incorporating consensus from more expert radiologists using federated data. The analysis was restricted to patients with CD without major structural complications like fistulas or abscesses, and future work will focus on developing models to identify and account for such complications. The current scope is limited to assessing the ileum, and further development is needed for computational complexities involved in inferring disease in the colon. The lack of adequate histopathology and clinical outcomes data prevents us from validating the proposed scoring methods. These models can achieve better results with a larger dataset, potentially unlocking the true potential of deep learning for severity assessment. While the techniques provide automated analysis, they still require manual placement of center points by specialists and computationally intensive hardware for scan interpretation. Thus, implementation in clinical settings remains a challenge. While this is an exciting development in the field of CD severity assessment, ongoing work is needed to refine these measures and evaluate their clinical utility.

CONCLUSION

The application of computer vision and AI methods to CTE imaging has the potential to revolutionize the way we assess CD. By automating the assessment process, trained models can provide diagnostic reports on regions with disease distribution and associated measures, which can aid clinicians and radiologists in patient management. Automated solutions can overcome the challenges of collecting exact, reproducible, and reliable measures of CD. The proposed methods not only replicate conventional complex expert-level CD severity assessments but also provide new perspectives on quantifying CD through the addition of spatial information. As automated image analysis continues to advance, our expectations of AI will shift from replicating expert measurements to advancing the quality of CD descriptions.

Supplementary Material

Supplement

ACKNOWLEDGMENTS

The research presented in this paper was supported by the National Institutes of Health.

FUNDING SUPPORT

National Institute of Health NIDDK R01DK124779.

Footnotes

DECLARATION OF COMPETING INTEREST

B. Enchakalody, A. Wasnik, S. Wang, M. Al-Hawary, and R. Stidham disclose a conflict of interest. They are the inventors of patent 16/427,769, titled “Automated assessment of bowel damage in intestinal diseases,” and the methods from this patent are utilized in this study. All other authors have no competing interests.

DATA SHARING STATEMENT

Data generated or analyzed during the study are available from the corresponding author on request following institutional approval.

Conflict of Interest

RWS has served as a consultant or on advisory boards for AbbVie, Janssen, Takeda, Gilead, Eli Lilly, Exact Sciences, Evergreen Pharmaceuticals, and CorEvitas. RWS, BE, SCW, MAH, APW hold intellectual property on cross-sectional imaging technologies licensed by the University of Michigan to AMI, Inc. Unrelated: APW-Book royalties from Elsevier Inc., Research grant from Sequana Medical

REFERENCES

  • 1.Kilcoyne A, Kaplan JL, Gee MS. Inflammatory bowel disease imaging: current practice and future directions. World J Gastroenterol 2016; 22(3):917–932. 10.3748/wjg.v22.i3.917 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ng SC, Shi HY, Hamidi N, et al. Worldwide incidence and prevalence of inflammatory bowel disease in the 21st century: a systematic review of population-based studies. The Lancet 2017; 390(10114):2769–2778. 10.1016/S0140-6736(17)32448-0 [DOI] [PubMed] [Google Scholar]
  • 3.Burisch J, Munkholm P. Inflammatory bowel disease epidemiology. Curr Opin Gastroenterol 2013; 29(4). [DOI] [PubMed] [Google Scholar]
  • 4.Rimola J, Ordás I, Rodriguez S, et al. Magnetic resonance imaging for evaluation of Crohn’s disease: validation of parameters of severity and quantitative index of activity. 10.1002/ibd.21551. URL Inflamm Bowel Dis 2010; 17(8):1759–1768. https://doi.org/10.1002/ibd.21551 [DOI] [PubMed] [Google Scholar]
  • 5.Roseira J, Ventosa AR, de Sousa HT, Brito J. The new simplified MARIA score applies beyond clinical trials: a suitable clinical practice tool for crohn’s disease that parallels a simple endoscopic index and fecal cal- protectin. United European Gastroenterol J 2020; 8(10):1208–1216. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1177/2050640620943089, doi: 10.1177/2050640620943089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Pariente B, Torres J, Burisch J, et al. OP05 Validation of the Lémann index in Crohn’s disease. J Crohn’s Colitis 2020; 14(Supplement1):S005–S006. 10.1093/ecco-jcc/jjz203.004 [DOI] [Google Scholar]
  • 7.D’Amico F, Chateau T, Laurent V, Danese S, Peyrin-Biroulet L. Which MRI score and technique should be used for assessing Crohn’s disease activity? J Clin Med 2020; 9(6). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Al-Hawary MM, Kaza RK, Platt JF. CT enterography: concepts and advances in Crohn’s disease imaging. Radiol Clin North Am 2013; 51(1):1–16. [DOI] [PubMed] [Google Scholar]
  • 9.Bruining DH, Zimmermann EM, Loftus EV, et al. Consensus re- commendations for evaluation, interpretation, and utilization of com- puted tomography and magnetic resonance enterography in patients with small bowel Crohn’s disease. Gastroenterology 2018; 154(4):1172–1194. 10.1053/j.gastro.2017.11.274 [DOI] [PubMed] [Google Scholar]
  • 10.Rimola J, Alvarez-Cofino A, Perez-Jeldres T, et al. Comparison of three magnetic resonance enterography indices for grading activity in Crohn’s disease. J Gastroenterol 2016; 52(5):585–593. [DOI] [PubMed] [Google Scholar]
  • 11.D’Amico F, Chateau T, Laurent V, Danese S, Peyrin-Biroulet L. Which MRI score and technique should be used for assessing Crohn’s disease activity? J Clin Med 2020; 9(6). 10.3390/jcm9061691. (URL). https://www.mdpi.com/2077-0383/9/6/1691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Stidham RW, Enchakalody B, Waljee AK, et al. Assessing small bowel stricturing and morphology in Crohn’s disease using semi-automated image analysis. doi: 10.1093/ibd/izz196. URL Inflamm Bowel Dis 2019; 196. https://doi.org/10.1093/ibd/izz196 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lamash Y, Kurugol S, Freiman M, et al. Curved planar reformatting and convolutional neural network-based segmentation of the small bowel for visualization and quantitative assessment of pediatric Crohn’s disease from MRI. J Magn Reson Imaging 2018; 49(6):1565–1576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Puylaert CA, Schuffler PJ, Naziroglu RE, et al. Semiautomatic assess- ment of the terminal ileum and colon in patients with Crohn disease using mri (the vigor++ project). Acad Radiol 2018; 25(8):1038–1045. 10.1016/j.acra.2017.12.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lamash Y, Kurugol S, Warfield SK. Semi-automated extraction of Crohns disease MR imaging markers using a 3D residual CNN with distance prior. Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Cham: Springer International Publishing,; 2018. p. 218–226. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mahapatra D, Schueffler P, Tielbeek JAW, Buhmann JM, Vos FM. A supervised learning approach for Crohn’s disease detection using higher-order image statistics and a novel shape asymmetry measure. doi: 10.1007/s10278-013-9576-9. URL J Digit Imaging 2013; 26(5):920–931. https://doi.org/10.1007/s10278-013-9576-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Holland R, Patel U, Lung P, Chotzoglou E, Kainz B. Automatic detection of bowel disease with residual networks. In: Rekik I, Adeli E, Park SH, editors. Predictive Intelligence in Medicine. Cham: Springer International Publishing; 2019. p. 151–159. [Google Scholar]
  • 18.Liu RX, Li H, Towbin AJ, et al. Machine learning diagnosis of small bowel Crohn disease using t2-weighted MRI radiomic and clinical data. Am J Roentgenol 2024; 222:e2329812 10.2214/AJR.23.29812〈https://doi.org/10.2214/AJR.23.29812〉. [DOI] [PubMed] [Google Scholar]
  • 19.Stidham RW, Liu W, Bishu S, et al. Performance of a deep learning model vs human reviewers in grading endoscopic disease severity of patients with ulcerative colitis. e193963–e193963 JAMA Network Open 2019; 2(5). 10.1001/jamanetworkopen.2019.3963 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Enchakalody BE, Henderson B, Wang SC, et al. Machine learning methods to predict presence of intestine damage in patients with Crohn’s disease. in: Medical Imaging 2020: Computer-Aided Diagnosis International Society for Optics and Photonics, SPIE,; 2020. p. 742–753. 10.1117/12.2549326. URL 〈https://doi.org/10.1117/12.2549326〉. [DOI] [Google Scholar]
  • 21.Kanitsar A, Fleischmann D, Wegenkittl R, Felkel P, Groler ME. Cpr: curved planar reformation. Proceedings of the Conference on Visualization ‘02, VIS ‘02. USA: IEEE Computer Society,; 2002. p. 37–44. [Google Scholar]
  • 22.Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H. Encoder-decoder with atrous separable convolution for semantic image segmentation. Munich, Germany, September 8–14, 2018, Proceedings, Part VII Computer Vision – ECCV 2018: 15th European Conference. Berlin, Heidelberg: Springer-Verlag,; 2018. p. 833–851. 10.1007/978-3-030-01234-2-49. Munich, Germany, September 8–14, 2018, Proceedings, Part VII. [DOI] [Google Scholar]
  • 23.Chollet F. Xception: deep learning with depthwise separable convolu- tions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Los Alamitos, CA, USA: IEEE Computer Society,; 2017.p. 1800–1807. 10.1109/CVPR.2017.195 [DOI] [Google Scholar]
  • 24.Achanta R, Shaji A, Smith K, Lucchi A, Fua P, Susstrunk S. Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans Pattern Anal Mach Intell 2012; 34(11):2274–2282. 10.1109/TPAMI.2012.120 [DOI] [PubMed] [Google Scholar]
  • 25.Kingma DP, Ba J. Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y, editors. 3rd International Conference on Learning Representations. San Diego, CA, USA: ICLR 2015; 2015. Conference Track Proceedings, 2015.URL 〈http://arxiv.org/abs/1412.6980〉. [Google Scholar]
  • 26.Lin T-Y, Goyal P, Girshick R, He K, Dollar P. Focal loss for dense object detection. IEEE Trans Pattern Anal Mach Intell 2020; 42(2):318–327. 10.1109/TPAMI.2018.2858826 [DOI] [PubMed] [Google Scholar]
  • 27.Haralick RM, Shanmugam K, Dinstein I. Textural features for image classification. IEEE Trans Syst Man Cybern SMC3 1973; 6:610–621. 10.1109/TSMC.1973.4309314 [DOI] [Google Scholar]
  • 28.Amadasun M, King R. Textural features corresponding to textural properties. IEEE Trans Syst Man Cybern 1989; 19(5):1264–1274. 10.1109/21.44046 [DOI] [Google Scholar]
  • 29.Galloway MM. Texture analysis using gray level run lengths. Comput Graph Image Process 1975; 4(2):172–179. 10.1016/S0146-664X(75)80008-6 [DOI] [Google Scholar]
  • 30.Thibault G, Fertil B, Navarro CL, et al. Texture indexes and gray level size zone matrix. Application to cell nuclei classification. 10th International Conference on Pattern Recognition and Information Processing. Minsk, Belarus: PRIP; 2009,; 2009. p. 140–145〈https://hal.science/hal-01499715〉. [Google Scholar]
  • 31.He X K. Zhang S. Ren J. Sun. Deep residual learning for image recognition 2015. doi: 10.48550/ARXIV.1512.03385.(URL)〈https://arxiv.org/abs/1512.03385〉. [DOI] [Google Scholar]
  • 32.Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas 1960; 20(1):37–46. [Google Scholar]
  • 33.McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb) 2012; 22(3):276–282. [PMC free article] [PubMed] [Google Scholar]
  • 34.Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Int J Comput Vis 2019; 128(2):336–359. 10.1007/s11263-019-01228-7. (URL). [DOI] [Google Scholar]
  • 35.Meng J, Luo Z, Chen Z, et al. Intestinal fibrosis classification in patients with Crohn’s disease using ct enterography–based deep learning: comparisons with radiomics and radiologists. Eur Radiol 2022. 10.1007/s00330-022-08842-z [DOI] [PubMed] [Google Scholar]
  • 36.Makanyanga JC, Pendse D, Dikaios, et al. Evaluation of Crohn’s disease activity: initial validation of a magnetic resonance enterography global score (MEGS) against faecal calprotectin. Eur Radiol 2013; 24(2):277–287. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES