Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: Aliment Pharmacol Ther. 2020 Nov 1;53(2):281–290. doi: 10.1111/apt.16136

Machine learning identifies novel blood protein predictors of penetrating and stricturing complications in newly diagnosed paediatric Crohn’s disease

Ryan C Ungaro 1, Liangyuan Hu 2, Jiayi Ji 2, Shikha Nayar 3, Subra Kugathasan 4, Lee A Denson 5, Jeffrey Hyams 6, Marla C Dubinsky 7, Bruce E Sands 1, Judy H Cho 3
PMCID: PMC7770008  NIHMSID: NIHMS1646225  PMID: 33131065

Summary

Background:

There is a need for improved risk stratification in Crohn’s disease.

Aim:

To identify novel blood protein biomarkers associated with future Crohn’s disease complications

Methods:

We performed a case-cohort study utilising a paediatric inception cohort, the Risk Stratification and Identification of Immunogenetic and Microbial Markers of Rapid Disease Progression in Children with Crohn’s disease (RISK) study. All patients had inflammatory disease (B1) at baseline. Outcomes were development of stricturing (B2) or penetrating (B3) complications. We assayed 92 inflammation-related proteins in baseline plasma using a proximity extension assay (Olink Proteomics). An ensemble machine learning technique, random survival forests (RSF), selected variables predicting B2 and B3 complications. Selected analytes were compared to clinical variables and serology only models. We examined selected proteins in a single-cell sequencing cohort to analyse differential cell expression in blood and ileum.

Results:

We included 265 patients with mean age 11.6 years (standard deviation [SD] 3.2). Seventy-three and 34 patients, respectively, had B2 and B3 complications within mean 1123 (SD 477) days for B2 and 1251 (442) for B3. A model with 5 protein markers predicted B3 complications with an area under the curve (AUC) of 0.79 (95% confidence interval [CI] 0.760.82) compared to 0.69 (95% CI 0.66–0.72) for serologies and 0.74 (95% CI 0.71–0.77) for clinical variables. A model with 4 protein markers predicted B2 complications with an AUC of 0.68 (95% CI 0.65–0.71) compared to 0.62 (95% CI 0.59–0.65) for serologies and 0.52 (95% CI 0.50–0.55) for clinical variables. B2 analytes were highly expressed in ileal stromal cells while B3 analytes were prominent in peripheral blood and ileal T cells.

Conclusions:

We identified novel blood proteomic markers, distinct for B2 and B3, associated with progression of paediatric Crohn’s disease.

1 |. INTRODUCTION

Crohn’s disease is one of the major types of inflammatory bowel disease (IBD) that causes chronic inflammation anywhere in the gastrointestinal tract from the mouth to the anus.1 Crohn’s disease is a progressive condition that can lead to complications including strictures, fistulas or intra-abdominal abscesses that can necessitate surgery. Crohn’s disease can result in irreversible bowel damage that can decrease quality of life and productivity.2,3 At least 50% of Crohn’s disease patients will have a surgery within 10 years of diagnosis.4,5 Increasing evidence suggests that if inflammation in Crohn’s disease is controlled and the most effective immunosuppression is used early in the disease course clinical outcomes are significantly improved with higher rates of remission and decreased need for surgery.6,7 Despite this, the most effective medications are utilised at very low rates although many Crohn’s disease patients will progress to complications.8 If Crohn’s disease patients at risk for more aggressive disease can be identified earlier, this may lead to more effective treatment strategies, including closer monitoring with a quicker step-up approach or earlier intervention with biologic agents. Risk stratification information can also be utilised to improve shared-decision making between patients and physicians. Incorporation of risk stratification into decision making on treatment selection for Crohn’s disease patients has also been advocated by recent national guidelines.9 However, many of these proposed clinical risk factors for higher risk Crohn’s disease often lack precise and consistent definitions (deep ulcerations, extensive disease) or may be of lower value in recently diagnosed patients who have not yet experienced complications (prior Crohn’s disease-related surgery, presence of strictures or fistulae at diagnosis).

Prior prognostic biomarkers in Crohn’s disease have included serologies (antibodies against bacterial and fungal antigens in the blood) and certain genes.10 Some of the first biomarkers suggested to be indicative of an increased risk of complications were blood serologies, with increasing quartiles of expression associating with higher rate of complications.11 Incorporating clinical, serology, and genetic risk factors for a complicated disease course has had modest predictive capacity with few potential new prognostic markers identified.12 More recently, the Risk Stratification and Identification of Immunogenetic and Microbial Markers of Rapid Disease Progression in Children with Crohn’s disease (RISK) cohort identified a model that combined clinical, serology and ileal biopsy gene expression markers with an area under the curve (AUC) of 0.72.13 However, risk prediction using tissue based gene expression may be more difficult to implement in clinical practice than blood-based tests. A more recent study using RISK identified potential blood markers of stricturing disease, illustrating the potential of blood-based markers for prediction of complications.14

Identification of more easily assayed blood markers of complication risk in Crohn’s disease patients at the time of diagnosis are needed. Peripheral blood-based proteomics offers a new modality that may enhance our ability to more accurately prognosticate by providing a new way to characterise a patient’s inflammatory signature. Utilising a prospective inception cohort of paediatric Crohn’s disease patients, we aimed to assess the association of novel blood protein biomarkers with the future development of penetrating and stricturing complications.

2 |. MATERIALS AND METHODS

We performed a case-cohort study utilising the RISK cohort. RISK is a prospective, observational inception cohort started in 2008 with Crohn’s disease patients from 28 gastroenterology centres in the United States and Canada.13 Patients with suspected IBD were enrolled between 2008 and 2012. Newly diagnosed children and adolescents age 18 years or younger with confirmed Crohn’s disease were followed longitudinally with data collected every 6 months. All patients had Crohn’s disease confirmed based on standard clinical, endoscopic and histologic criteria as described by Lenard-Jones et al15 At enrolment, blood samples for plasma were taken from all patients. Clinical management, including decisions on diagnostic testing and medications, was at the discretion of the treating physician. The Montreal classification was used to define disease behaviour. An uncomplicated disease state (inflammation only) was referred to as B1. Stricturing disease (B2) was defined as persistent luminal narrowing with pre-stenotic dilatation as shown by small bowel contrast imaging. Internal penetrating disease (B3) was defined as intra-abdominal fistulising disease resulting in intra-abdominal or pelvic abscesses or fistulas to an adjacent organ (excluding the vagina or perianal region). All patients who later developed B2 or B3 complications were selected and controls (baseline B1 who remained B1 during follow-up) were then randomly selected across all participating sites.

We assayed candidate protein biomarkers using a proximity extension assay (Olink Proteomics). We ran baseline plasma samples using the Olink inflammation panel. This assay can be run on as little as one microlitre of sample utilising a pair of oligonucleotide-labelled antibodies that bind to the target protein and has potential to be scalable for clinical use. When the two probes are in close proximity, a new polymerase chain reaction (PCR) target sequence is formed by a proximity-dependent DNA polymerization event which is subsequently detected and quantified using standard real-time PCR. This technique limits cross-reactivity to maximise sensitivity and specificity. Normalised protein expression is log2-transformed. Intra-assay coefficient of variation (CV) ranges between 5% and 13% (mean 8%), and inter-assay CV ranges between 9% and 39% (mean 15%).16 We excluded proteins that were undetectable in 95% or more of samples. In addition, we analysed baseline clinical and serology data from the RISK cohort. Clinical variables of interest included those that were associated with disease progression in the original RISK study (age, African American race, and ileal disease location). In addition, all models included anti-tumour necrosis factor biologic (anti-TNF) use within 90 days as a covariate as this was previously associated with decreased risk of progression to B3.13 Baseline serology data included natural log (Ln) of anti-Saccharomyces cerevisiae (ASCA) IgG, ASCA IgA, anti-flagellin (Cbir1), anti-outer membrane protein C precursor (OmpC) and perinuclear anti-neutrophil cytoplasmic antibodies (pANCA).

Our primary outcome was the composite of any B2 or B3 complication. Secondary outcomes of interest included B2 and B3 complications individually. Patients who developed both B2 and B3 were considered in both specific complication analyses. Descriptive analyses (means, medians and proportions as appropriate) were performed for cohort characteristics. Pearson correlation was used to analyse associations between proteins and standard disease biomarkers including C-reactive protein (CRP), albumin and serologies.

For variable selection in prediction models we utilised random survival forest (RSF) machine learning methodology. The ensemble machine learning technique is highly flexible, fluidly handles a large number of predictors, and is able to detect interactions and non-linearity in the associations between analyst and outcomes of interest. Bootstrapping and random node splitting were used to grow an ensemble of binary trees to form the RSF model. The RSF model can provide its own internal estimate of predictive performance that correlates well with either cross-validation estimates or test set estimates. We built a RSF model by ensembling binary trees grown on bootstrapped samples to select the most important variables that are associated with time to event. We used 1000 tress to construct our models with the square root of the number of predictors sampled at each split time. When constructing a bootstrap sample in the ensemble, certain samples are left out. These samples are called out-of-bag (OOB), and they can be used to assess the predictive performance of that specific model since they were not used to build the model. The average of the OOB performance measures can then be used to evaluate the predictive performance of the entire ensemble. To further validate the RSF models and also provide the confidence intervals for the predictive performance (area under the curve, AUC), we implemented repeated cross-validation with 5-folds and 200 replications for each of our final models with 95% confidence intervals (CI).

The variable importance score from the computed RSF model was used to assess how informative a variable is regarding time until event. We used an iterative algorithm based on variable importance scores to select variables for model building. At each iteration, the variable with least importance score was removed and a model was rebuilt using the remaining variables, and the prediction error rate of the model was recorded. The process was repeated until all variables were removed. Then the model with the smallest prediction error was selected as the final model. In constructing the RSF model, OOB samples were left out and not used for model fitting. RSF prediction error rate was computed based on Harrell’s concordance index which equates to 1 minus area under the curve (AUC). We computed the average AUC on OOB samples across the ensemble to evaluate the accuracy of the final RSF models.

To evaluate how each of the individual analytes selected by RSF was linked to time to event (B2, B3 or any complication), we constructed Cox regression models with selected variables for each outcome of interest. To better understand the predictive model’s performance, we constructed predictiveness curves to analyse variations in risk based on protein RSF model score and identify low and high-risk groups using quantiles of OOB survival probability at 3 and 5 years.17 Statistical analyses were performed using R 3.6.0 (R Foundation for Statistical Computing, 2019).

Last, to explore blood and tissue cellular expression of these peripheral blood protein analytes, we analysed data from a previously reported comprehensive immune cell single-cell RNA sequencing dataset of inflamed ileum and peripheral blood mononuclear cells (PBMCs) from 11 Crohn’s disease patients undergoing ileal resection (GEO accession: GSE13 4809).18 The detailed methodology for this cohort has been previously reported. Briefly, after isolation of PBMCs and ileal lamina propria cells, cells were suspended at 1.106/mL in phosphate buffered saline (PBS) and 10 000 cells were loaded onto the ChromiumTM Controller instrument within 15 min of cell suspension preparation using GemCode Gel Bead and Chip, all from 10x Genomics (Pleasanton, CA), following the manufacturer’s recommendations. Cells were partitioned into Gel Beads in Emulsion in the ChromiumTM Controller instrument where cell lysis and barcoded reverse transcription of RNA occurred. Libraries were prepared using 10x Genomics Library Kits and sequenced on an Illumina NextSeq500 according the manufacturer’s recommendations. Using an annotated library of cell clusters from this dataset, we then examined PBMC and tissue expression of B2 and B3 peripheral blood analytes that were selected by the RSF model. The number of cells within each cell-type cluster expressing specific genes (unique molecular identifiers, UMI) were compared using the Wilcoxon rank-sum test.

3 |. RESULTS

A total of 265 patients were included in our cohort. Ninety-eight were B1 at baseline and then later developed a B2 or B3 complication and 167 were controls who remained B1 and did not experience any B2 or B3 event during follow-up. Mean age was 11.6 years, the majority had ileocolonic disease, and 12.8% were African American (Table 1). The mean time to initiating anti-TNF therapy was 412 (SD 441) days. Mean time to B2 complication was 1123 (SD 477) days and 1251 (SD 442) for B3 complication.

TABLE 1.

Patient cohort characteristics (Total n = 265)

Characteristic
Baseline Age, years (mean, SD) 11.6 (3.2)
African American, n (%) 31 (12.8)
Baseline disease location, n (%)
 Ileal (L1) 44 (23.2)
 Colonic (L2) 39 (20.5)
 Ileocolonic (L3) 107 (56.3)
 Upper tract disease 137 (58.5)
Baseline Perianal disease, n (%) 96 (39.8)
Baseline smoker, n (%) 4 (1.7)
Baseline albumin, g/dL (mean, SD) 3.4 (0.7)
Developed any complication, n (%) 98 (37)
Type of complication*
 B2, n 73
 B3, n 34
B2/B3, n 9
 Time to B2 in days (mean, SD) 1123 (477)
 Time to B3 in days (mean, SD) 1251 (442)

Note:

*

If patient classified as B2/B3 then considered as having both complications for analyses.

SD, standard deviation; B2, stricturing complication; B3, penetrating complication.

Out of a total of 92 inflammation-related proteins assayed in baseline patient plasma (Table S1), seven were not detected in ≥ 95% of samples (IL2, IFNϒ, IL2RB, TSLP, IL22RA, IL20 and IL33), leaving 85 proteins that were included for analysis. RSF modelling selected nine proteins (IL12B, CXCL9, IL7, CCL3, CD6, IL15RA, MMP10, CCL11, IL10) and three serologic markers (LnCbir, LnASCA IgA, LnOMPC) that were most predictive for any new complication. The protein-based model for any complication had a numerically higher AUC than the serology only model (AUC 0.66, 95% CI 0.62–0.69 vs AUC 0.64, 95% CI 0.61–0.67) but was statistically significantly better than a clinical variables only model (AUC 0.66, 95% CI 0.62–0.69 vs AUC 0.56, 95% CI 0.52–0.59; Table 2). A combined model with all protein, serology, and clinical variables had the numerically highest AUC of 0.69 (95% CI 0.66–0.72).

TABLE 2.

Performance of random forrest machine learning selected models for any complication, B2, or B3

Variable type AUC (95% CI) AUC (95% CI) with early anti-TNF exposurea Selected variablesb
Any complication
 Proteins only (Olink) 0.66 (0.63, 0.69) 0.66 (0.62, 0.69) IL12B, CXCL9, IL7, CCL3,CD6, IL15RA, MMP10, CCL11, IL10
 Serologies only 0.64 (0.61, 0.67) 0.64 (0.61, 0.67) LnCbir, LnASCA IgA, LnOMPC
 Clinical variables only 0.57 (0.54, 0.60) 0.56 (0.52, 0.59) Age, African American Race, Ileal Disease Location
 Combined model 0.68 (0.65,0.71) 0.69 (0.66, 0.72) All above variables
B2
 Proteins only (Olink) 0.68 (0.65, 0.71) 0.68 (0.65, 0.71) IL7, MMP10, IL12B, CCL11
 Serologies only 0.62 (0.59, 0.65) 0.62 (0.59, 0.65) LnASCA IgA, LnCbir
 Clinical variables only 0.51 (0.48, 0.54) 0.52 (0.50, 0.55) Age, African American Race, Ileal Disease Location
 Combined model 0.70 (0.67, 0.73) 0.69 (0.66, 0.72) All above variables
B3
 Proteins only (Olink) 0.78 (0.75,0.81) 0.79 (0.76, 0.82) TNFSF14, CCL4, IL15RA, TNFB, CD40
 Serologies only 0.71 (0.68, 0.74) 0.69 (0.66, 0.72) LnASCA IgA, LnANCA, LnCbir
 Clinical variables only 0.74 (0.71, 0.77) 0.74 (0.71, 0.77) Age, African American Race, Ileal Disease Location
 Combined model 0.78 (0.75, 0.81) 0.77 (0.75, 0.79) All above variables

Abbreviations: AUC, area under the curve; B2, stricturing complication; B3, penetrating complication; CI, confidence interval; TNF, tumour necrosis factor.

a

Early TNF exposure defined as within 90 days of diagnosis.

b

Variables selected using VIMP selection criteria.

We next examined the performance of RSF selected models for B2 or B3 complications specifically. For B2 complications, four proteins (IL7, MMP10, IL12B, CCL11) and two serologic markers (LnASCA IgA, LnCbir) were selected as most predictive. A protein-based model had statistically significant better performance for predicting B2 than the serology only model, suggested by non-overlapping 95% confidence intervals for AUC (AUC 0.68, 95% CI 0.65–0.71 vs AUC 0.62, 95% CI 0.59–0.65). The protein-based model was significantly better than the clinical variables only model (AUC 0.68, 95% CI 0.65–0.71 vs AUC 0.52, 95% CI 0.50–0.55). A combined model had the numerically highest AUC of 0.69 (95% CI 0.66–0.72, Table 2). For B3 complications, five proteins (TNFSF14, CCL4, IL15RA, TNFB, CD40) and three serologic markers (LnASCA IgA, LnANCA, LnCbir) were selected as most predictive for B3. The protein-based B3 model performed significantly better than the serologies only model (AUC 0.79, 95% CI 0.76–0.82 vs AUC 0.69, 95% CI 0.66–0.72). The protein-based model had a numerically but not significantly higher AUC compared to clinical variables only and combined variables models (Table 2). In addition, we fit a logistic regression with baseline perianal disease as the outcome and the five selected B3 proteins as predictors and observed that none were significantly associated with perianal disease (data not shown). Of note, none of the selected proteins were significantly correlated with CRP, albumin or any of the serologic markers (Table S1).

Protein predictors of complications that were selected by RSF modelling were then analysed with Cox regression modelling to understand the magnitude and directionality of individual variables (Table 3). The proteins that were individually significantly associated with developing any complication were CCL3 (HR 1.12, 95% CI 1.01–1.24) and MMP10 (HR 0.65, 95% CI 0.47–0.90). In comparison, RSF selected serologies significantly associated with any complication in Cox regression included LnCbir (HR 1.50, 95% CI 1.21–1.87) and LnASCA IgA (HR 1.31, 95% CI 1.08–1.58). For B2 complications, MMP10 was significantly associated with a decreased risk (HR 0.62, 95% CI 0.45–0.86). Serologies associated with B2 included LnCBir (HR 1.43, 95% CI 1.13–1.81) and LnASCA Iga (HR 1.40, 95% CI 1.14–1.71). For B3 complications, proteins significantly associated with increased risk were CD40 (HR 2.94, 95% CI 1.53–5.66) and CCL4 (HR 1.79, 95% CI 1.38–2.32) while TNFSF14 was associated with decreased risk (HR 0.38, 95% CI 0.25–0.57). LnCbir was the only serology significantly associated with development of B3 in Cox regression (HR 1.55, 95% CI 1.09–2.19). The impact of early anti-TNF use (within 90 days of diagnosis) had similar results to the original RISK study with decreased risk of B3 complications but no clear impact on development of B2 (Table 3).13

TABLE 3.

Hazard ratios of selected individual proteins for any complication, B2 only, or B3 only in multivariate Cox regression models

Protein Hazard ratio (HR) 95% CI P value
Any complication
 CXCL9 1.04 0.82–1.31 0.75
 IL12B 1.05 0.82–1.36 0.69
 CCL3 1.12 1.01–1.24 0.03
 MMP10 0.65 0.47–0.90 0.01
 IL7 1.06 0.73–1.55 0.76
 IL15RA 1.23 0.37–4.08 0.74
 CCL11 0.86 0.59–1.27 0.45
 IL10 1.05 0.84–1.31 0.66
 CD6 0.85 0.63–1.15 0.29
 Early anti-TNF 1.09 0.64–1.85 0.76
B2
 IL7 1.15 0.78–1.70 0.48
 MMP10 0.62 0.45–0.86 0.004
 IL12B 1.10 0.87–1.39 0.43
 CCL11 0.86 0.57–1.30 0.47
 Early Anti-TNF 1.42 0.83–2.44 0.20
B3
 TNFSF14 0.38 0.25–0.57 <0.001
 CCL4 1.79 1.38–2.32 <0.001
 IL15RA 0.05 0.01–1.37 0.08
 TNFB 0.52 0.27–0.99 0.05
 CD40 2.94 1.53–5.66 0.001
 Early anti-TNF 0.73 0.29–1.84 0.50

Abbreviations: B2, stricturing complication; B3, penetrating complication; CI, confidence interval; TNF, tumor necrosis factor.

Next, we constructed predictiveness curves to analyse variations in complication risk at 3 and 5 years from diagnosis based on protein models selected by RSF to visualise risk distribution based on differential protein model risk score cut-offs. Across all models the protein scores positively correlated with survival risk (Figure 1AC). To further characterise patients, we consider low risk as those with a protein model score in the bottom decile and high risk as those in the top decile. 64% of patients with a high protein risk score developed a complication by 5 years, in marked contrast with 23% of those with a low-risk score developing a complication (Figure 1A). Similar results were seen for the B2 protein model with 64% of high-risk patients experiencing a stricture by year 5 with 15% in low-risk group having a B2 event (Figure 1B). High-risk protein model score patients had 40% risk of developing B3 by 5 years with very few patients in the lowest decile (5%) experiencing B3 in the same time-frame (Figure 1C).

FIGURE 1.

FIGURE 1

A, Survival probability for any progression at 3 and 5 years by decile of protein risk score. Plots of out of bag survival probability by random survival forest by decile of selected protein risk score. Red lines mark 10th and 90th percentile risk scores. B, Survival probability for stricturing (B2) Complication at 3 and 5 years by decile of protein risk score. Plots of out of bag survival probability by random survival forest by decile of selected protein risk score. Red lines mark 10th and 90th percentile risk scores. C, Survival probability for penetrating (B3) complication at 3 and 5 years by decile of protein risk score. Plots of out of bag survival probability by random survival forest by decile of selected protein risk score. Red lines mark 10th and 90th percentile risk scores

Last, we sought to compare gene expression of these plasma proteins by cell type in ileal tissue and peripheral blood by analysing expression patterns in a comprehensive immune cell repertoire single-cell RNA sequencing library from Crohn’s disease patients. In inflamed ileal tissue, stromal cells and mononuclear phagocytes were primary sources of B2-associated analytes (IL7, MMP10, IL12B, CCL11), whereas B3-associated analytes (IL15RA, CD40, TNFB, TNFSF14, CCL4) were expressed in a broader subset of immune cells and in PBMCs, particularly T cells, on visual inspection of heat maps (Figure 2). In formal statistical comparisons of B2 proteins, IL7 and CCL11 were significantly more frequent in ileal stromal cells, IL12B was more frequently seen in non-stromal cells, and MMP10 was significantly increased in ileal mononuclear phagocytes (Figure S1). Analytes associated with an increased risk of complications (IL12B and IL7 for B2 disease, and CCL4 and CD40 for B3 disease) were consistently expressed in activated dendritic cells, inflammatory macrophages, and cytotoxic T cells of inflamed tissues. Notably, B3-associated analytes were also abundant at the transcriptional level in PBMCs whereas B2-associated analytes were more restricted to tissue-specific expression, with the exception of IL7 expression in peripheral blood B cells. These results taken together underscore the differences in the cellular and molecular changes that underlie B2 and B3 complications in Crohn’s disease.

FIGURE 2.

FIGURE 2

Single-cell RNA gene expression of analytes significantly associated with stricturing (B2) and penetrating (B3) disease. Transcriptional profiles observed in inflamed ileal tissue and PBMCs of 11 Crohn’s disease patients. Enriched analytes in B2 stricturing diseased patients (top). Enriched analytes observed in B3 penetrating diseased patients (bottom). Analytes with a HR < 1 (left) and HR > 1 (right) are distinguished by the dotted white line. Cell types are noted on the Y-axis. Tregs: regulatory T cells, Cytotox. T: cytotoxic T cells, ILC1: innate lymphoid cells 1, Res. Macs: residential macrophages, Inf. Macs: inflammatory macrophages, Act. DCs: activated dendritic cells, Act. Fibro: activated fibroblasts, Mem. B: memory B cells, Naïve B: naïve B cells, Int. mono: intermediate monocytes, Non-cl. Mono: non-classical monocytes, CD56hi: CD56hi natural killer cells, CD5lo: CD56hi natural killer cells, EM T: effector memory T cells, Active T: active T cells

4 |. DISCUSSION

Applying machine learning to a paediatric Crohn’s disease inception cohort, we have identified protein biomarkers that can compositely predict the development of complications when assayed in the blood at time of diagnosis. Distinct proteins were selected for B2 and B3, highlighting the differential underlying biologic processes behind these complications. Protein-based models performed as well as or better than clinical feature and serology-based models at predicting Crohn’s disease complications and selected proteins were not correlated with other biomarkers of disease activity or prognosis. Complication risk scores based on selected proteins provide a dynamic risk distribution range that can assist in risk stratifying patients at the time of diagnosis.

To date, there are limited biomarkers for predicting disease complications in Crohn’s disease. Clinical variables have been proposed as defining higher risk Crohn’s disease patients including strictures, fistulae, or abdominal abscesses. However, most Crohn’s disease patients present with uncomplicated behaviour and markers that can risk-stratify for later complications can be valuable for patient and physician decision making and treatment selection. Serologic markers were some of the first biomarkers associated with increasing risk of complications in Crohn’s disease.11 Anti-Cbir and ASCA IgA antibodies have being most consistently associated with risk of developing Crohn’s disease complications, including in the original RISK analysis, and were also the serologies selected by RSF in the current study.13 A web-based Personalised Risk and Outcome Prediction Tool (PROSPECT) that provides a personalised risk profile for Crohn’s disease patients based on a combination of these clinical, serologic and genetic variables has demonstrated good predictive performance with a Harrell’s C index of 0.73 though is not yet available clinically.12 More recently, a blood-based 17-gene expression test that corresponds to differences in T cell exhaustion has been associated with a more aggressive Crohn’s disease course.19 This blood signature is promising but the outcome defining complicated Crohn’s disease was time to treatment escalation which may not reflect the risk of developing Crohn’s disease-related complications. Last, in the RISK cohort, a potential serum biomarker of stricturing complications (extracellular matrix protein 1, ECM1) was identified and may have prognostic potential in patients in the highest quartile of expression.14

The protein models from the current study perform similarly well or better than prior clinical and serologic Crohn’s disease risk stratification markers. When comparing novel proteins with clinical variables and serologies in the RISK cohort, the protein-based models have a numerically higher AUC than all clinical and serology only models for any complication as well as B2 and B3 specifically. Furthermore, the protein models performed statistically significantly better than serology-only models for B2 and B3 complications but the difference was more marked for B3. This may potentially be due to systemic inflammatory profiles better reflecting risk of developing penetrating complications or imprecision in the definition of B2 complications. Composite models that added clinical and serology data to the novel proteins had marginally higher numerical performance. Interestingly, the protein-based model AUC for any complication (AUC 0.66) and combined model (AUC 0.69) had slightly lower performance than a model from the original RISK study that incorporated a gene expression signature from ileal biopsies with clinical and serology (AUC 0.72), suggesting that peripheral blood based prognostic markers may be able to approach tissue-based markers.

The markers selected by RSF have biologic plausibility as being important in Crohn’s disease. Proteins that were most significantly associated with risk of complications included CCL3, CCL4, CD40, MMP10 and TNFSF14. CCL3, CCL4 and Crohn’s disease 40 were all associated with an increased risk of complications. CCL3 (C-C Motif Chemokine Ligand 3), also known as Macrophage Inflammatory Protein 1-Alpha, is a chemoattractant for various leukocytes (NK cells, monocytes, T and B cells), and has been demonstrated to be upregulated in mucosa of Crohn’s disease patients while its administration in mouse models of IBD has been shown to exacerbate disease.20,21 CCL4 (C-C Motif Chemokine Ligand 3), also known as Macrophage Inflammatory Protein 1-Beta, is a chemokine that, similar to CCL3, is produced by macrophages and is upregulated in the inflamed mucosa of IBD patients.22 CCL4 has also been found to be downregulated in the intestinal tissue of IBD patients treated with anti-TNF.23 CD40 is a member of the TNF-receptor superfamily that has previously been noted to be elevated in the blood of IBD patients, particularly those with abscess and/or fistulae, and overexpressed in Crohn’s disease inflamed mucosa on endothelial cells and dendritic cells.2426 CD40 ligand blockade ameliorates inflammation in mouse models of IBD and an early phase clinical trial demonstrated some efficacy of CD40 blockade in humans.27,28

In contrast, MMP10 and TNFSF14 were associated with a decreased risk of progression to complications in our study. MMP10 (matrix metalloproteinase 10) is involved in the degradation of extracellular matrix proteins and has been found to be differentially expressed in the mucosa of IBD patients.29 Interestingly, lack of MMP10 in mouse models of IBD has been associated with exacerbation of inflammation.30 MMP10 had a protective effect against complications in our study, particularly against B2 complications, which is plausible mechanistically. TNFSF14 (tumour necrosis factor superfamily member 1, also known as LIGHT) was found to be associated with a decreased risk of B3 complications in our study. TNFSF14 has been observed to be over-expressed in Crohn’s disease patient T cells.31 In mouse models of IBD, deficiency of TNFSF14 was associated with significantly worse disease severity, highlighting its potential role as protective against intestinal inflammation.32

Through analysis of our single-cell RNA sequencing cohort we were able to observe that the selected blood protein analytes had differential expression in the ileum and PBMCs of Crohn’s disease patients. B2 associated markers, with the exception of IL7, were primarily expressed in ileal tissue but not in PBMCs. In contrast, B3 associated proteins were expressed both in ileal tissue and PBMCs. These results taken underscore the differences in the cellular and molecular changes that may underlie the development of B2 and B3 complications in Crohn’s disease. Local tissue-based markers may ultimately prove more useful for predicting B2 but the biologic processes that are associated with B3 may reflect a more systemic inflammatory response for which blood markers may be sufficiently robust. As early anti-TNF treatment has been associated with preventing B3 complications, the composite of B3 associated markers may prove useful in defining patients at diagnosis who may most benefit from early biologic therapy.

Our study had a number of strengths and limitations. Strengths included the use of a multi-centre Crohn’s disease inception cohort with treatment-naïve blood samples and longitudinal follow up. In addition, we were able to incorporate clinical and serologic data into our analyses for comparison of risk stratification models. We also utilised a sample-sparing, multiplex protein biomarker discovery platform that has potential for scalability. One limitation is the observational nature of the RISK cohort. Diagnostic studies that defined complications were not protocolised for specific timepoints, so it is possible that complications developed prior to the actual date they were recognised. In addition, treatment decisions were made by the patients’ primary gastroenterologists and may have affected clinical outcomes. For B2 complications, there is currently uncertainty about the best definition of a stricture in Crohn’s disease so this may have led to imprecision in measuring this outcome. In addition, given the unique nature of RISK (completely treatment naïve paediatric patients with samples at time of Crohn’s disease diagnosis) we were unable to identify a comparable external validation cohort. However, it is important to note that our methodology does not depend on any hypothesis testing, for which a Type I error is associated. This is an advantage of machine learning. Unlike many parametric regression methods that rely on an arbitrary cut-off value of certain statistical test, our variable selection used a data-driven approach using the RSF’s variable importance scores calculated from the OOB samples, which can provide their own internal estimate of predictive performance that correlates well with either cross-validation estimates or separate testing set. The relatively small sample size of events is a limitation and future validation studies with larger data sets are needed. Adding to the biologic plausibility of our results are the marked differences in the single cell data between B2 and B3 associated markers, particularly with respect to the marked tissue-predominance (with the exception of IL7) of B2 markers is consistent with present understanding of stricturing pathogenesis. A technical limitation of the single cell data is that the methodology does not allow characterisation of granulocytes including neutrophils and eosinophils. Our single cell data suggest that the gut could be a source of peripherally detected proteins but we cannot rule out other stromal sources of detected analytes such as the bone marrow. It is also important to note that some proteins selected by RSF modelling were not significant in Cox modelling. This is likely because Cox models have assumptions about linearity and proportional hazards that RSF models do not make. RSF modelling can account more for non-linearity and interactions between analytes allowing for capture of biologic complexity to select sets of proteins that are significantly associated. In addition, the variable selection algorithm based on the RSF model is data driven and does not rely on an arbitrary cut-off value which is frequently needed by Cox models. In addition, RSF modelling can pick up combinations of proteins which alone are not significant but in combination may be significant and may reflect the complex interplay of multiple contributing cells and proteins in IBD.

In summary, we have identified panels of blood protein markers that have predictive capacity for development of complications in paediatric Crohn’s disease. These markers have biologic plausibility and may highlight potentially important pathways in Crohn’s disease progression. Protein based models had good performance for predicting later complications, outperforming clinical and serologic based models for B2 and serologic based models for B3. Although further studies with external cohorts are needed before these markers could be implemented clinically, our findings support the use of blood biomarkers in assisting with risk stratification of Crohn’s disease patients at the time of diagnosis.

Supplementary Material

Supplementary File 1
Supplementary File 2

ACKNOWLEDGEMENTS

The Crohn’s and Colitis Foundation for providing funding for the RISK study, the RISK Steering Committee, and all participating RISK study sites. We also gratefully acknowledge the Sanford J. Grossman Charitable Trust, NIDDK R01 DK106593, U01 DK062422 (JHC).

Declaration of personal interests: RCU has served as an advisory board member or consultant for Eli Lilly, Janssen, Pfizer, and Takeda; and research support from AbbVie, Boehringer Ingelheim, and Pfizer. LAD reports research support from Friesland, Campina, Glycosyn, and Janssen. MCD reports personal fees from AbbVie, Janssen, Takeda, Pfizer, Celgene, and UCB and grants from Janssen and Janssen and Prometheus. BES received grant support for research from AbbVie, Amgen, Bristol-Myers Squibb, Celgene, Janssen, MedImmune (AstraZeneca), Millennium Pharmaceuticals, Pfizer Inc, Prometheus Laboratories, and Takeda; and personal fees from 4D Pharma, AbbVie, Akros Pharma, Allergan, Amgen, Arena Pharmaceuticals, Boehringer Ingelheim, Capella Biosciences, Celgene, EnGene, Ferring, Forward Pharma, Gilead, Immune Pharmaceuticals, Janssen, Lilly, Luitpold Pharmaceuticals, Lyndra, MedImmune, Oppilan Pharma, Otsuka, Palatin Technologies, Pfizer, Progenity, Receptos, Rheos Pharmaceuticals, Salix Pharmaceuticals, Seres Therapeutics, Shire, Synergy Pharmaceuticals, Takeda, Target PharmaSolutions, Theravance Biopharma R&D, TiGenix, Topivert Pharma, UCB, Vedanta Biosciences, and Vivelix Pharmaceuticals.

Funding informationRCU is supported by an NIH K23 Career Development Award (K23KD111995-01A1).

Footnotes

SUPPORTING INFORMATION

Additional supporting information will be found online in the Supporting Information section.

REFERENCES

  • 1.Torres J, Mehandru S, Colombel J-F, et al. Crohn’s disease. Lancet. 2017;389:1741–1755. [DOI] [PubMed] [Google Scholar]
  • 2.Pariente B, Mary J-Y, Danese S, et al. Development of the Lémann index to assess digestive tract damage in patients with Crohn’s disease. Gastroenterology. 2015;148:52–63.e3. [DOI] [PubMed] [Google Scholar]
  • 3.De Boer AGEM, Bennebroek Evertsz F, Stokkers PC, et al. Employment status, difficulties at work and quality of life in inflammatory bowel disease patients. Eur J Gastroenterol Hepatol. 2016;28:1130–1136. [DOI] [PubMed] [Google Scholar]
  • 4.Peyrin-Biroulet L, Loftus EV, Colombel J-F, et al. Long-term complications, extraintestinal manifestations, and mortality in adult Crohn’s disease in population-based cohorts. Inflamm Bowel Dis. 2011;17:471–478. [DOI] [PubMed] [Google Scholar]
  • 5.Loftus EV, Schoenfeld P, Sandborn WJ. The epidemiology and natural history of Crohn’s disease in population-based patient cohorts from North America: a systematic review. Aliment Pharmacol Ther. 2002;16:51–60. [DOI] [PubMed] [Google Scholar]
  • 6.Ungaro RC, Yzet C, Bossuyt P, et al. Deep remission at 1 year prevents progression of early crohn’s disease. Gastroenterology. 2020;159:139–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Colombel J-F, Panaccione R, Bossuyt P, et al. Effect of tight control management on Crohn’s disease (CALM): a multicentre, randomised, controlled phase 3 trial. Lancet. 2018;390:2779–2789. [DOI] [PubMed] [Google Scholar]
  • 8.Rubin DT, Mody R, Davis KL, et al. Real-world assessment of therapy changes, suboptimal treatment and associated costs in patients with ulcerative colitis or Crohn’s disease. Aliment Pharmacol Ther. 2014;39:1143–1155. [DOI] [PubMed] [Google Scholar]
  • 9.Lichtenstein GR, Loftus EV, Isaacs KL, et al. ACG clinical guideline: management of crohn’s disease in adults. Am J Gastroenterol. 2018;113:481–517. [DOI] [PubMed] [Google Scholar]
  • 10.Torres J, Caprioli F, Katsanos KH, et al. Predicting outcomes to optimize disease management in inflammatory bowel diseases. J Crohns Colitis. 2016;10:1385–1394. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dubinsky MC, Kugathasan S, Mei L, et al. Increased immune reactivity predicts aggressive complicating Crohn’s disease in children. Clin Gastroenterol Hepatol. 2008;6:1105–1111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Siegel CA, Horton H, Siegel LS, et al. A validated web-based tool to display individualised Crohn’s disease predicted outcomes based on clinical, serologic and genetic variables. Aliment Pharmacol Ther. 2016;43:262–271. [DOI] [PubMed] [Google Scholar]
  • 13.Kugathasan S, Denson LA, Walters TD, et al. Prediction of complicated disease course for children newly diagnosed with Crohn’s disease: a multicentre inception cohort study. Lancet. 2017;389:1710–1718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wu J, Lubman DM, Kugathasan S, et al. Serum protein biomarkers of fibrosis aid in risk stratification of future stricturing complications in pediatric crohn’s disease. Am J Gastroenterol. 2019;114:777–785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lennard-Jones JE, Shivananda S. Clinical uniformity of inflammatory bowel disease a presentation and during the first year of disease in the north and south of Europe. EC-IBD Study Group. Eur J Gastroenterol Hepatol. 1997;9:353–359. [DOI] [PubMed] [Google Scholar]
  • 16.Assarsson E, Lundberg M, Holmquist G, et al. Homogenous 96-plex PEA immunoassay exhibiting high sensitivity, specificity, and excellent scalability. PLoS One. 2014;9:e95192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pepe MS, Feng Z, Huang Y, et al. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol. 2008;167:362–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Martin JC, Chang C, Boschetti G, et al. Single-cell analysis of Crohn’s disease lesions identifies a pathogenic cellular module associated with resistance to anti-TNF therapy. Cell. 2019;178:1493–1508.e20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Biasci D, Lee JC, Noor NM, et al. A blood-based prognostic biomarker in IBD. Gut. 2019;68:1386–1395. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Banks C, Bateman A, Payne R, et al. Chemokine expression in IBD. Mucosal chemokine expression is unselectively increased in both ulcerative colitis and Crohn’s disease. J Pathol. 2003;199:28–35. [DOI] [PubMed] [Google Scholar]
  • 21.Pender SLF, Chance V, Whiting CV, et al. Systemic administration of the chemokine macrophage inflammatory protein 1alpha exacerbates inflammatory bowel disease in a mouse model. Gut. 2005;54:1114–1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Arijs I, De Hertogh G, Machiels K, et al. Mucosal gene expression of cell adhesion molecules, chemokines, and chemokine receptors in patients with inflammatory bowel disease before and after infliximab treatment. Am J Gastroenterol. 2011;106:748–761. [DOI] [PubMed] [Google Scholar]
  • 23.Magnusson MK, Strid H, Isaksson S, et al. Response to infliximab therapy in ulcerative colitis is associated with decreased monocyte activation, reduced CCL2 expression and downregulation of Tenascin C. J Crohns Colitis. 2015;9:56–65. [DOI] [PubMed] [Google Scholar]
  • 24.Senhaji N, Kojok K, Darif Y, et al. The contribution of CD40/CD40L axis in inflammatory bowel disease: an update. Front Immunol. 2015;6:529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Danese S, Katz JA, Saibeni S, et al. Activated platelets are the source of elevated levels of soluble CD40 ligand in the circulation of inflammatory bowel disease patients. Gut. 2003;52:1435–1441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ludwiczek O, Kaser A, Tilg H. Plasma levels of soluble CD40 ligand are elevated in inflammatory bowel diseases. Int J Colorectal Dis. 2003;18:142–147. [DOI] [PubMed] [Google Scholar]
  • 27.Liu Z, Geboes K, Colpaert S, et al. Prevention of experimental colitis in SCID mice reconstituted with CD45RBhigh CD4+ T cells by blocking the CD40-CD154 interactions. J Immunol. 2000;164:6005–6014. [DOI] [PubMed] [Google Scholar]
  • 28.Kasran A, Boon L, Wortel CH, et al. Safety and tolerability of antagonist anti-human CD40 Mab ch5D12 in patients with moderate to severe Crohn’s disease. Aliment Pharmacol Ther. 2005;22:111–122. [DOI] [PubMed] [Google Scholar]
  • 29.Dobre M, Milanesi E, Mănuc TE, et al. Differential intestinal mucosa transcriptomic biomarkers for crohn’s disease and ulcerative colitis. J Immunol Res. 2018;2018: 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Koller FL, Dozier EA, Nam KT, et al. Lack of MMP10 exacerbates experimental colitis and promotes development of inflammation-associated colonic dysplasia. Lab Invest. 2012;92:1749–1759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Shih T-C, Hsieh S-Y, Hsieh Y-Y, et al. Aberrant activation of nuclear factor of activated T cell 2 in lamina propria mononuclear cells in ulcerative colitis. World J Gastroenterol. 2008;14:1759–1767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Krause P, Zahner SP, Kim G, et al. The tumor necrosis factor family member TNFSF14 (LIGHT) is required for resolution of intestinal inflammation in mice. Gastroenterology. 2014;146:1752–1762.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File 1
Supplementary File 2

RESOURCES