Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2024 Jul 25;3(7):e0000533. doi: 10.1371/journal.pdig.0000533

Machine-learning-based prediction of disability progression in multiple sclerosis: An observational, international, multi-center study

Edward De Brouwer 1,#, Thijs Becker 2,3,#, Lorin Werthen-Brabants 4, Pieter Dewulf 5, Dimitrios Iliadis 5, Cathérine Dekeyser 6,7,8, Guy Laureys 6,7, Bart Van Wijmeersch 9,10, Veronica Popescu 9,10, Tom Dhaene 4, Dirk Deschrijver 4, Willem Waegeman 5, Bernard De Baets 5, Michiel Stock 5,11, Dana Horakova 12, Francesco Patti 13, Guillermo Izquierdo 14, Sara Eichau 14, Marc Girard 15, Alexandre Prat 15, Alessandra Lugaresi 16, Pierre Grammond 17, Tomas Kalincik 18,19, Raed Alroughani 20, Francois Grand’Maison 21, Olga Skibina 22, Murat Terzi 23, Jeannette Lechner-Scott 24, Oliver Gerlach 25,26, Samia J Khoury 27, Elisabetta Cartechini 28, Vincent Van Pesch 29, Maria José Sà 30, Bianca Weinstock-Guttman 31, Yolanda Blanco 32, Radek Ampapa 33, Daniele Spitaleri 34, Claudio Solaro 35, Davide Maimone 36, Aysun Soysal 37, Gerardo Iuliano 38, Riadh Gouider 39, Tamara Castillo-Triviño 40, José Luis Sánchez-Menoyo 41, Guy Laureys 42, Anneke van der Walt 43, Jiwon Oh 44, Eduardo Aguera-Morales 45, Ayse Altintas 46, Abdullah Al-Asmi 47, Koen de Gans 48, Yara Fragoso 49, Tunde Csepany 50, Suzanne Hodgkinson 51, Norma Deri 52, Talal Al-Harbi 53, Bruce Taylor 54, Orla Gray 55, Patrice Lalive 56, Csilla Rozsa 57, Chris McGuigan 58, Allan Kermode 59, Angel Pérez Sempere 60, Simu Mihaela 61, Magdolna Simo 62, Todd Hardy 63, Danny Decoo 64, Stella Hughes 65, Nikolaos Grigoriadis 66, Attila Sas 67, Norbert Vella 68, Yves Moreau 1, Liesbet Peeters 3,10,*
Editor: Ryan S McGinnis69
PMCID: PMC11271865  PMID: 39052668

Abstract

Background

Disability progression is a key milestone in the disease evolution of people with multiple sclerosis (PwMS). Prediction models of the probability of disability progression have not yet reached the level of trust needed to be adopted in the clinic. A common benchmark to assess model development in multiple sclerosis is also currently lacking.

Methods

Data of adult PwMS with a follow-up of at least three years from 146 MS centers, spread over 40 countries and collected by the MSBase consortium was used. With basic inclusion criteria for quality requirements, it represents a total of 15, 240 PwMS. External validation was performed and repeated five times to assess the significance of the results. Transparent Reporting for Individual Prognosis Or Diagnosis (TRIPOD) guidelines were followed. Confirmed disability progression after two years was predicted, with a confirmation window of six months. Only routinely collected variables were used such as the expanded disability status scale, treatment, relapse information, and MS course. To learn the probability of disability progression, state-of-the-art machine learning models were investigated. The discrimination performance of the models is evaluated with the area under the receiver operator curve (ROC-AUC) and under the precision recall curve (AUC-PR), and their calibration via the Brier score and the expected calibration error. All our preprocessing and model code are available at https://gitlab.com/edebrouwer/ms_benchmark, making this task an ideal benchmark for predicting disability progression in MS.

Findings

Machine learning models achieved a ROC-AUC of 0⋅71 ± 0⋅01, an AUC-PR of 0⋅26 ± 0⋅02, a Brier score of 0⋅1 ± 0⋅01 and an expected calibration error of 0⋅07 ± 0⋅04. The history of disability progression was identified as being more predictive for future disability progression than the treatment or relapses history.

Conclusions

Good discrimination and calibration performance on an external validation set is achieved, using only routinely collected variables. This suggests machine-learning models can reliably inform clinicians about the future occurrence of progression and are mature for a clinical impact study.

Author summary

Models that accurately predict disability progression in individuals with multiple sclerosis (MS) have the potential to greatly benefit both patients and medical professionals. By aiding in life planning and treatment decision-making, these predictive models can enhance the overall quality of care for people with MS. While previous academic literature has demonstrated the feasibility of predicting disability progression, recent systematic reviews have shed light on several methodological limitations within the existing research. These reviews have highlighted concerns such as the absence of probability calibration assessment, potential biases in cohort selection, and insufficient external validation. Furthermore, the datasets examined often include variables that are not routinely collected in clinical settings or readily available for digital analysis. Consequently, it remains uncertain whether the models identified in these systematic reviews can be effectively implemented in a clinical context. Compounding this issue, the lack of availability of data and analysis code makes it challenging to compare results across different publications. To address these gaps, this study endeavors to develop and validate a machine-learning-based prediction model using the largest longitudinal patient cohort ever assembled for disability progression prediction in MS. Leveraging data from MSBase, a comprehensive international data registry comprising information from multiple MS centers, we aim to create robust models capable of accurately predicting the probability of disability progression. The integration of machine learning models into routine clinical practice has the potential to greatly enhance treatment decision-making and life planning for individuals with MS. The models developed through this study could be subsequently evaluated in a clinical impact study involving MS centers participating in MSBase. This research represents a significant advancement towards the practical application of machine learning models in improving the treatment and care of individuals with MS.

Introduction

Multiple sclerosis (MS) is a chronic autoimmune disease of the central nervous system [1]. A recent census estimated more than 2⋅8 million people are currently living with MS [2], which causes a wide variety of symptoms such as mobility problems, cognitive impairment, pain, and fatigue. Importantly, the rate of disability progression is highly variable among people with MS (PwMS) [3]. This heterogeneity makes the personalization of care difficult and prognostic models are thus of high relevance for medical professionals, as they could contribute to better individualized treatment decisions. Indeed, a more aggressive treatment could be prescribed in case of a negative prognosis. Moreover, surveys indicate that PwMS are interested in their prognosis [4], which could help them with planning their lives.

There is a large amount of literature on prognostic MS models [510]. Some prognostic models are or were at some point available as web tools. However, with the exception of Tintore et al. [10] that focuses on conversion to MS, none have been integrated into clinical practice and no clinical impact studies have been performed [5, 6]. Because MS is a complex chronic disease that is often treated within a multidisciplinary context, the performance of a prognostic model studied in isolation from its clinical context gives limited information on its clinical relevance [11, 12]. Recent systematic reviews have highlighted several methodological issues within the current literature [5, 6], such as the lack of calibration or a possible significant bias in the cohort selection. Moreover, the investigated datasets are rarely made available. They furthermore often contain variables that are not routinely collected within the current clinical workflow (e.g. neurofilament light chain) or are not readily available for digital analysis (e.g. Magnetic Resonance Imaging (MRI) images).

In this article, we aimed at developing a model with three specific goals. Firstly, it should predict the probability (a value between 0 and 1) of disability progression for a PwMS within the next two years, instead of just a binary target (0 or 1, i.e., disease progression or no disease progression). Secondly, it should be applicable to a well-defined, relevant, and large patient population. Thirdly, all variables used in the model should be available in routine clinical care. A successful combination of these three goals would justify a clinical impact study of the model and represent a significant step towards clinical applicability.

With this aim in mind, we developed and externally validated machine learning models to predict disability progression after two years for PwMS, using commonly-available clinical features. For this task we represented disability progression as a binary variable indicating if a confirmed disability progression will occur within the next two years, as defined by Kalincik et al. [13]. We trained the models using the largest longitudinal patient cohort to date for disability progression prediction in MS. The cohort was extracted from MSBase, a large international data registry containing data from multiple MS centers. We evaluated the performance, including the predicted probabilities, of different machine learning architectures and found they could achieve a ROC-AUC of 0⋅71 ± 0⋅01, an AUC-PR of 0⋅26 ± 0⋅02, a Brier score of 0⋅1 ± 0⋅01, and an expected calibration error of 0⋅07 ± 0⋅04.

Importantly, and in contrast with the available literature on disease progression models for MS (except for one model to predict relapses [14]), our data pre-processing pipeline and our models check all the boxes of the Transparent Reporting for Individual Prognosis Or Diagnosis (TRIPOD) checklist. Our work therefore provides an important step towards the integration of artificial intelligence (AI) models in MS care. The outline of our approach is presented in Fig 1.

Fig 1. Overall layout of our approach.

Fig 1

A: Representation of a clinical trajectory of an individual person with multiple sclerosis (PwMS). The trajectory consists of, among others, relapses, EDSS values, and treatment durations collected over time. The full list of used variables is given in the Materials and Methods. The trajectory of each patient is divided into an observation window (the available clinical history for the prediction) and the future trajectory, which is used to compute the confirmed disability progression label at two years (wc). B: For an individual PwMS, the clinical trajectory in the observation window is extracted and used in the machine learning model to predict a well-calibrated probability of disability progression at two years. Based on the predictions, clinicians can adjust their clinical decisions accordingly. C: The MSbase dataset contains clinical data from 146 individual MS clinical centers with different clinical practice. We leveraged this feature by creating an external validation cohort of patients. We split the data per clinic, with 60% of patients used for training the model, 20% for optimizing the hyper-parameters (validation set) and 20% for external validation. The results presented in this work are all on the external validation cohort.

Results

Cohort statistics

In this multi-center international study, we used data of people with MS from 146 centers spread over 40 different countries and compiled in the MSBase registry [15] as of September 2020. All data were prospectively collected during routine clinical care predominantly from tertiary MS centres [16].

The inclusion criteria for the initial extraction of the data from MSBase were: having at least 12 months of follow-up, being aged 18 years or older, and diagnosed with relapsing remitting (RR) MS, secondary progressive (SP) MS, primary progressive (PP) MS. Clinically-isolated syndrome (CIS) patients were excluded. This resulted in an initial cohort of 40,827 patients.

The clinical trajectory of each patient in the cohort consisted of multiple, potentially overlapping, clinical episodes, that allowed to artificially augment the dataset. We defined a clinical episode as the conjunction of an observation window, a baseline EDSS measurement, and a disability progression label. Details about the construction of the clinical episodes are given in the Materials and Methods. For each episode, we required a minimum of three EDSS measurements over the last three years and three months at the time of prediction. This inclusion criterion represents the typical follow-up frequency for PwMS, which is once or twice a year. Prior work showed that longitudinal clinical history was beneficial for prediction of disability progression [17]. The final cohorts resulted in a total of 283,115 valid episodes from 26,246 patients. Basic characteristics of the final cohort are shown in Table 1.

Table 1. Summary statistics of the cohort of interest after extraction from MSBase (Extracted Cohort) and after patient and sample selection (Final Cohort).

For all variables the value at the last recorded visit was used. KFS stands for Kurtzke Functional Systems Score, DMT for Disease Modifying Therapy, CIS for Clinically Isolated Syndrome.

Variable Cohort 3 EDSS
Patients (% female) 26,246 (71⋅8)
Age, Yearsa 42⋅8 (10⋅8)
Age at MS onset, yearsa 31⋅3 (8⋅9)
Disease duration, yearsa 11⋅6 (8⋅0)
Education status, % higherc 18⋅2 (65⋅1)
First symptom, none given (%) 13⋅7
 supratentorial (%) 28⋅2
 optic pathways (%) 22⋅6
 brainstem (%) 24⋅3
 spinal cord (%) 26⋅4
MS course /
 CIS (%) 0
 Relapsing-Remitting (%) 83⋅5
 Primary Progressive (%) 5⋅0
 Secondary Progressive (%) 11⋅5
EDSSa 3⋅0 (2⋅1)
EDSSt=0 category /
 EDSSt=0 ≤ 5⋅5 (%) 83⋅9
 EDSSt=0 > 5⋅5 (%) 16⋅1
Annualized relapse rateb 0⋅82 [0⋅43, 1⋅47]
KFS Scores /
 pyramidalb 2 [1, 3]
 cerebellarb 0 [0, 2]
 brainstemb 0 [0, 1]
 sensoryb 1 [0, 2]
 sphinctericb 0 [0, 1]
 visualb 0 [0, 1]
 cerebralb 0 [0, 1]
 ambulatoryb 0 [0, 1]
DMT /
 none 23⋅5
 low-efficacy 51⋅3
 moderate-efficacy 13⋅6
 high-efficacy 11⋅6
 high induction 7⋅2

a: mean ± standard deviation

b: median (quartiles)

c: % missing data

Model performance

The performance of the predictive models assessed on the external test cohort is reported in Tables 24. A visual illustration of the discrimination performance is shown in Fig 2. A temporal-attention-based model reached an area under the receiver operating characteristic curve (ROC-AUC) of 0⋅71 ± 0⋅01 and an area under the precision-recall curve (AUC-PR) of 0⋅26 ± 0⋅02, with a calibration error of 0⋅07 ± 0⋅04 on the external test cohort.

Table 2. Summary statistics of the performance measures (averages ± standard deviations).

Baseline performance are 0⋅5 for the area under the receiver operating curve (ROC-AUC) and 0⋅11 for the area under the precision-recall curve (AUC-PR). ↑ indicates higher is better. ↓ indicates lower is better. p-value for ROC-AUC between Ensemble and MLP: 0⋅152 for unpaired t-test. p-value for AUC-PR between Attention and MLP: 0.452 for unpaired t-test.

Model ROC-AUC ↑ AUC-PR ↑ Brier ↓ ECE ↓
Ensemble 0⋅71 ± 0⋅01 0⋅25 ± 0⋅02 0⋅10 ± 0⋅01 0⋅06 ± 0⋅05
Attention 0⋅71 ± 0⋅01 0⋅26 ± 0⋅02 0⋅10 ± 0⋅01 0⋅07 ± 0⋅04
Bayesian NN 0⋅71 ± 0⋅01 0⋅25 ± 0⋅01 0⋅10 ± 0⋅01 0⋅08 ± 0⋅04
MLP 0⋅70 ± 0⋅01 0⋅24 ± 0⋅02 0⋅10 ± 0⋅01 0⋅09 ± 0⋅03

Table 4. Results for disability progression prediction for different baseline Expanded Disability Status Scale score (EDSSt=0), EDSSt=0 ≤ 5⋅5 and >5⋅5.

↑ indicates higher is better. ↓ indicates lower is better. We report averages ± standard deviations computed over the different folds. Traning size of the different groups: EDSSt=0 ≤ 5⋅5 = 185,556 episodes (16,282 patients); EDSSt=0 > 5⋅5 = 34,848 episodes (4,686 patients).

Model EDSSt=0 ROC-AUC ↑ AUC-PR ↑ Brier ↓ ECE ↓
Attention EDSSt=0 ≤ 5⋅5 0⋅72 ± 0⋅01 0⋅26 ± 0⋅01 0⋅09 ± 0⋅0 0⋅07 ± 0⋅04
Attention EDSSt=0 > 5⋅5 0⋅65 ± 0⋅01 0⋅27 ± 0⋅04 0⋅15 ± 0⋅01 0⋅07 ± 0⋅02
Bayesian NN EDSSt=0 ≤ 5⋅5 0⋅72 ± 0⋅01 0⋅25 ± 0⋅01 0⋅09 ± 0⋅0 0⋅08 ± 0⋅04
Bayesian NN EDSSt=0 > 5⋅5 0⋅64 ± 0⋅02 0⋅26 ± 0⋅03 0⋅15 ± 0⋅01 0⋅11 ± 0⋅03
MLP EDSSt=0 ≤ 5⋅5 0⋅71 ± 0⋅01 0⋅24 ± 0⋅01 0⋅1 ± 0⋅01 0⋅09 ± 0⋅03
MLP EDSSt=0 > 5⋅5 0⋅63 ± 0⋅01 0⋅26 ± 0⋅04 0⋅15 ± 0⋅02 0⋅09 ± 0⋅03

Fig 2. Visual representation of the discrimination performance.

Fig 2

ROC-AUC curve, the AUC-PR curve, and distribution of the estimated probability of relapse per group obtained with the temporal attention model.

To assess the reliability of those results on specific subgroups of patients, we also evaluated the performance for each different MS course at the time of prediction (Table 3) and different baseline EDSS (EDSSt=0) (Table 4). The relapsing-remitting (RR) category showed a performance similar to the full cohort. We observed a decreased discrimination performance in the progressive and secondary progressive groups. We conjecture that this is due to the low sample size in these groups. A similar effect was observed when segmenting by disability severity, with the group of higher severity showing a lower discrimination performance. In the supporting information, we also present a segmentation of the results by the medical center of origin of the patients (S1, S2 and S3 Figs), indicating a higher variability of the results for small centers.

Table 3. Results for disability progression prediction per MSCourse (Primary Progressive (PP), Relapsing Remitting (RR), and Secondary Progressive (SP)), for the best models.

↑ indicates higher is better. ↓ indicates lower is better. We report averages ± standard deviations computed over the different folds. Training sizes of different groups: PP = 10,976 episodes (1,192 patients); RR = 185,724 episodes (16,268 patients); SP = 23,704 episodes (2,402 patients).

Model MSCourse ROC-AUC ↑ AUC-PR ↑ Brier ↓ ECE ↓
Attention PP 0⋅65 ± 0⋅01 0⋅33 ± 0⋅04 0⋅16 ± 0⋅01 0⋅07 ± 0⋅02
Attention RR 0⋅70 ± 0⋅01 0⋅21 ± 0⋅01 0⋅09 ± 0⋅01 0⋅06 ± 0⋅03
Attention SP 0⋅65 ± 0⋅01 0⋅33 ± 0⋅03 0⋅17 ± 0⋅01 0⋅10 ± 0⋅05
Bayesian NN PP 0⋅66 ± 0⋅01 0⋅34 ± 0⋅03 0⋅16 ± 0⋅01 0⋅09 ± 0⋅05
Bayesian NN RR 0⋅70 ± 0⋅01 0⋅20 ± 0⋅01 0⋅09 ± 0⋅01 0⋅09 ± 0⋅03
Bayesian NN SP 0⋅64 ± 0⋅01 0⋅32 ± 0⋅02 0⋅17 ± 0⋅01 0⋅11 ± 0⋅02
MLP PP 0⋅63 ± 0⋅03 0⋅32 ± 0⋅05 0⋅16 ± 0⋅01 0⋅09 ± 0⋅03
MLP RR 0⋅69 ± 0⋅01 0⋅19 ± 0⋅0 0⋅09 ± 0⋅01 0⋅05 ± 0⋅02
MLP SP 0⋅63 ± 0⋅01 0⋅31 ± 0⋅02 0⋅17 ± 0⋅01 0⋅10 ± 0⋅04

The calibration of the different models was assessed from the Brier score and the expected calibration errors (ECE), which are reported in Tables 24. In Fig 3, we report the calibration plot of the longitudinal attention model on the external test cohort. We observed a very good calibration of the predicted risks in the range between 0 and 0.3, suggesting an excellent reliability of the predictive model. The calibration curves of other models are given in the supporting information (S4 Fig) along with a segmentation of the calibration of the models by clinical subgroups (S5 Fig). A comprehensive comparison of all considered models is available in S2, S3, S4, S5, S6 and S7 Tables.

Fig 3. Calibration diagram for the temporal attention model for the first data split.

Fig 3

The val.prob.ci.2 function [18] was used to generate this plot.

Feature importance

The importance of the different variables used in the machine learning models was investigated. Fig 4 shows the results of a permutation importance test on the multi-layer percepetron (MLP) model, by assessing the loss in discrimination performance when a variable is shuffled over the test set [19]. Fig 4 ranks the features in decreasing order of importance. We found the most important variables to be the baseline EDSS at prediction time, the number of years since 1990 as well as the mean EDSS and Kurtzke Functional Systems Score (KFS) over the last 3 years. The complete results are available in S8 Table.

Fig 4. Feature importance of different variables.

Fig 4

Feature importance of different variables used in the MLP model based on the average performance degradation on the ROC-AUC, AUC-PR, and ECE metrics. ‘EDSS at 0’ stands for the Expanded Disability Status Scale score at the time of prediction. ‘Date Reference’ represents the date of prediction. ‘Mean EDSS’ stands for the average EDSS over the last 3 years. ‘MS Course = SP’ is a binary variable indicating that the MS course is secondary progressive at the time of prediction. ‘Mean KFS x’ represents the corresponding variable in the average Kurtzke Functional Systems Score over the last three years. ‘Std EDSS’ represents the standard deviation of EDSS over the last 3 years.

The baseline EDSS was expected to be important in the prediction, as the definition of the progression event directly depends on it (as seen in Eq (1)). The time since 1990 suggests a change of behavior of the disease over the years that could be explained by progress in clinical care or enhanced diagnosis of earlier and milder forms of the disease. The importance of the previous values of EDSS and KFS demonstrates an added value of considering longitudinal data, as already shown in De Brouwer et al. [17]. Remarkably, no variables including disease modifying treatments (DMT) were given a significant importance score.

Discussion

The models investigated in this study provide a significant advance towards deploying AI in clinical practice in MS. After validation of the results in a clinical impact study, they have the potential to let the research in MS benefit from the advantages of advanced predictive modeling capabilities.

Our work confirms that predicting the probability of disability progression of MS patients is feasible. Importantly, despite MS progression being inherently stochastic, this study shows that relevant historical clinical data, collected as part of routine clinical care, can lead to high discrimination performance and good calibration (Fig 3), which is crucial in healthcare applications. Combined with our rigorous benchmarking, external validation, and our strict adherence to the TRIPOD guidelines, this points towards a readiness of these models to be tested in a clinical impact study. Such study would evaluate the performance of these models in real-world clinical practice, compare them with the predictions of clinicians, and assess the value of such a prediction for PwMS. Over- or under-prediction of the probability of progression could indeed lead to unnecessary emotional stress or optimism.

Our attained ROC-AUC scores of 0.71 are compatible with those found in the literature, which were found to range between 0.64 and 0.89 [5]. Our ROC-AUC scores are on the low end of this range. This could be explained by several factors, such as: MSBase being a large and diverse population; the use of a limited set of variables, since we constrained ourselves to variables that are collected during routine clinical care; a validation set-up where prediction is done on patients from different clinics than those in the training set.

Previous work had only reported calibration graphically [2022], with some of these models showing good calibration. The possibility to achieve well-calibrated models is empirically confirmed in our study. As previous studies used different patient populations, covariates and prediction targets, we could not directly compare our models with other models from previous studies.

The models developed in this study also suffer from limitations. First of all, several countries with good quality MS registries were not included because they are not part of the MSBase initiative. Since treatment decisions can be country specific to a significant degree [23], it can result in a difference of performance of the proposed models on countries not included in this dataset. Yet, a clinical impact study in MS centers participating in MSBase would not suffer from such external validity problems.

Second, our inclusion criteria required patients with good follow-up (at least one yearly visit with EDSS measurement), so stable patients that do not visit regularly, or newly diagnosed with MS cannot benefit from these models. This limits the application to patients with an already established clinical MS history. This decision was motivated by prior work [17], which showed that including clinical history as a predictor leads to more accurate prognosis, a finding that we confirm in this study. A new dedicated model would be required for disability progression in patient with shorter clinical history. Nevertheless, MS being a chronic disease, many patients would still satisfy our follow-up inclusion criteria (64% in the MSBase cohort).

Third, our analysis showed that the performance of the different models varied across different patient subgroups. When segmenting the cohort by disease course or by baseline EDSS, we found that the majority subgroup (i.e. relapsing-remitting and EDSSt=0 ≤ 5⋅5) showed a better discrimination performance than subgroups with lower prevalence. We conjecture that this difference of performance is due to the lower sample size in the minority subgroups. This finding suggests a more limited value of the models for PwMS belonging to the minority subgroups. Nevertheless, the difference in calibration was not significantly different.

Fourth, the progression target that we defined in this work cannot realistically fully capture the complexity of the disease and progression in MS cannot be summarized by EDSS only. EDSS itself, as an attempt to quantify progression on one-dimensional scale, lacks the expressivity to reliably encode the progression of the disease. What is more, we framed disability progression as a classification task, which is more granular than predicting future EDSS, but is more amenable for machine learning. Despite these imperfections, the confirmed disability progression label used in this work has been proven clinically useful [13], striking a good balance between abstraction and expressivity. Our work builds upon those concepts and inherits their flaws and advantages.

Despite these imperfections, our models could potentially help patients in the planning of their lives and provide a baseline for further research. An emphasis on reproducibility was made, in an attempt to provide a strong benchmark for this important task. Thanks to the excellent clinically-informed pre-processing pipeline, researchers can easily extend the current models or propose their own, to continuously improve disease progression prediction. Extensions to our method could include treatment recommendation or inclusion of other biomarkers available in a specific center.

Materials and methods

Resource availability

Lead contact

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Liesbet Peeters (liesbet.peeters@uhasselt.be).

Materials availability

Trained machine learning models can be found at https://gitlab.com/edebrouwer/ms_benchmark.

Cohort definition

In this multi-center international study, we used data of people with MS from 146 centers spread over 40 different countries and compiled in the MSBase registry [15] as of September 2020. All data were prospectively collected during routine clinical care predominantly from tertiary MS centres [16].

The inclusion criteria for the initial extraction of the data from MSBase were: at least 12 months of follow-up, aged 18 years or older, and diagnosed with relapsing remitting (RR) MS, secondary progressive (SP) MS, primary progressive (PP) MS. Clinically-isolated syndrome (CIS) patients were excluded from the study. This initial dataset contained a total of 44,886 patients.

In order to ensure data quality, some observations or patients were removed from that cohort. Exclusion criteria include:

  • Visits of the same patient that happened on the same day but had different expanded disability status scale (EDSS) values were removed. All duplicate visits with the same EDSS for the same visit date were removed (i.e., only one of the visits was retained). Visits from before 1970 were discarded.

  • Patients with the CIS MS course at their last visit were discarded. For those patients the relevant question is whether or not they will progress to confirmed MS, which is a different question than the one investigated in this work.

  • Patients whose diagnosis date or age at first symptoms (i.e., MS onset date) was missing or with invalid formatting were removed.

  • Patient whose MS course or sex was not available were removed.

  • Patients whose date of MS diagnosis, birth, MS onset, start of progression, clinic entry or first relapse was higher than the extraction date were discarded.

  • All visits whose visit date had an invalid format or was after the extraction date were discarded.

These criteria resulted in a total number of 40, 827 patients in the cohort. A flowchart of the patient inclusion for the final cohort is shown in Fig 5. Basic characteristics of the final cohorts are shown in Table 1.

Fig 5. Flowchart of patient selection.

Fig 5

Flowchart of patient selection for both at least three and at least six visits in the last 3.25 years.

The clinical trajectory of each patient in the cohort consisted of multiple, potentially overlapping, clinical episodes, that allow to artificially augment the dataset. Clinical episodes are defined in the sections below.

External validation was used to assess the performance of our predictive models by splitting the cohort by MS center. The models were thus evaluated on patients from different clinics than the ones used for training. An assessment of the heterogeneity across centers is available in the supporting information (S3 Fig).

Definition of disability progression

Machine learning models were trained to predict a disability progression binary variable for each clinical episode. In this section, we describe the definition of this binary disability progression label. Conceptually, disability progression is defined as a sustained increase in EDSS over time.

Because assessing progression requires a baseline EDSS value to compare with, predictions were made at visit dates where an EDSS measurement was recorded. In our notation, t = 0 denotes the time of the visits at which the prediction is made and the baseline EDSS is thus written as EDSSt=0. Motivated by the non-linearity of the EDSS, unconfirmed disability progression (w = 1) after two years (t = 2y) is defined as follows [13]:

w={1ifEDSSt=2y-EDSSt=01·5&EDSSt=0=01ifEDSSt=2y-EDSSt=01&0<EDSSt=05·51ifEDSSt=2y-EDSSt=00·5&EDSSt=0>5·50otherwise. (1)

EDSSt = 2y represents the last recorded EDSS before t = 2 years. We chose a time horizon of two years as a trade-off between short and long disease time scales. A short horizon would lead to very few confirmed progression in the cohort, making predictive modeling difficult. A long horizon would result in less patients satisfying the inclusion criteria, reducing the sample size. It is a typical choice in the literature and is a relevant timescale for PwMS to plan their lives.

EDSS suffers from inter- and intra-rater variability [24]. The actual state of the patient also fluctuates, because of e.g. recent relapses from which the patient could still (partly) recover. We therefore studied confirmed disability progression (wc) for at least six months. Progression was confirmed if all EDSS values measured within six months after the progression event and the first EDSS measurement after two years lead to the same worsening target w = 1 according to Eq (1). EDSS measurements within one month after a relapse were not taken into account for confirming disability progression [13]. wc represents the target binary label used to train the machine learning models.

Importantly, if progression (w = 1) could be confirmed because there were no EDSS measurements after two years that could be used for confirmation, it was not considered a valid target and no label would be derived. If progression could not be confirmed because an EDSS used for confirmation led to w = 0, it counted as no confirmed disability progression (wc = 0). If there was no disability progression (w = 0), no confirmation was needed to make it a valid target. Note that even with confirmation for at least six months, around 20% of progression events were expected to regress after more than five years [25]. However, disability progression that lasts several years is a relevant outcome for a person with MS.

We note that the above definition of confirmed disability progression has been introduced and clinical motivated in Kalincik et al. [13]. Although it is only a surrogate for the actual and complex disease progression mechanism, it represents a clinically validated label that is more amenable to statistical and machine learning analysis.

Definition of clinical episodes

For each patient, all visits can potentially represent a valid baseline EDSS for a progression episode. More generally, it is possible to divide the available clinical history of a patient in multiple (potentially overlapping) episodes for which a disability progression label can be computed. Each episode therefore consists of an observation window (the clinical history before t = 0), a baseline EDSS (EDSSt=0) and a confirmation label (wc) as shown on Fig 6. Extracting several episodes per patient allowed to significantly increase the number of data points in the study.

Fig 6. Problem Setup.

Fig 6

A: For each patient episode, the available data for prediction consists of the baseline data and the longitudinal clinical data in the observation window. Disability progression (wc) was assessed based on the difference between the EDSS at time t = 0 and two years later (t = t2y) as defined in Eq (1). B: Based on the available historical clinical data (in the observation time window), we aimed at training a model able to predict the probability p(wc) of disability progression at a two years horizon (t2y).

To assess the impact of follow-up on the performance of the models, we defined two cohorts of patients, one with a minimum of three EDSS measurements, the other with a minimum of six EDSS measurements over the last three years and three months of the observation window. While our results focus on the cohort with a minimum of three EDSS measurements, performance results for the other cohort are presented in the supporting information. The three measurements requirement excluded patients who have a less than yearly (or biyearly) EDSS follow-up frequency. The three additional months were chosen to allow for some margin regarding when the yearly visit was planned.

Episodes were considered valid if they met the following criteria:

  • A valid confirmed disability progression label (wc) could be computed at t = 0.

  • The time at which the prediction were made was after 1990 (t0 > 1990, Jan 1st). This ensured that we had a cohort of patients from decades were disease modifying therapies (DMTs) were available [26].

  • There were at least k EDSS measurement in the last last three years and three months of the observation window, where k is either three or six measurements.

Examples of valid and invalid episodes are presented in Fig 7. The final cohorts resulted in a total of 283,115 valid episodes from 26,246 patients, for a minimum of three EDSS measurements and 166,172 valid episodes from 15,240 patients, for a minimum of six EDSS measurements. For the 3-visits cohort, 11⋅64% of the episodes represented a progression event, hence showing a mild imbalance. We addressed this imbalance by re-weighing each sample proportionally to its label occurrence.

Fig 7. Examples of valid and non-valid episodes.

Fig 7

The time is in years (y) and months (m). (a) Confirmed progression after two years. The EDSS around 2y6m is not used to confirm the progression, because it occurs within 1 month after a relapse. Progression is confirmed with the EDSS measurement around 4y. There are 3 EDSS measurements between −3y and 0y, which is enough follow-up data. (b) This is not a valid sample: there are not enough EDSS measurements between −3y and 0y. (c) This is not a valid sample: no confirmed progression because there are no EDSS values after 2y. (d) This is a valid sample: the EDSS decreases after 2y, so this counts as no disability progression. (e) This is a valid sample: wu = 0, so no confirmation is needed.

Variables

A set of clinical variables was retained from all available variables and included in the observation window of each episode. The following static (i.e., non-varying over time) variables were selected: birth date, sex, MS onset date, education status (higher education, no higher education, unknown) and the location of the first symptom (i.e., supratentorial, optic pathways, brainstem or spinal cord).

The following longitudinal variables were also collected in the observation window (i.e., for times t ≤ 0): EDSS, MS course (Relapsing Remitting MS (RRMS), Primary Progressive MS (PPMS), Secondary Progressive MS (SPMPS), Clinically Isolated Syndrome (CIS)), relapse occurrence, relapse position (pyramidal tract, brainstem, bowel bladder, cerebellum, visual function, sensory), all Kurtzke functional system (KFS) scores, and Fampridine administration. The disease modifying therapies (DMT) and immunosuppressants were categorized into low-efficacy, moderate-efficacy and high-efficacy:

  • Low-efficacy: Interferons, Teriflunomide, Glatiramer, Azathioprine, Methotrexate.

  • Moderate-efficacy: Fingolimod, Dimethyl-Fumarate, Cladribine, Siponimod, Daclizumab

  • High-efficacy: Alemtuzumab, Rituximab, Ocrelizumab, Natalizumab, Mitoxantrone, Cyclophosphamide.

Except for Mitoxantrone and Cyclophosphamide, we assumed that only one DMT was administered at the same time. This implies that if a new DMT was started, the administration of the previous DMT was considered to have ended, even if no end date was registered in the data. Mitoxantrone and Cyclophosphamide can be administered in combination with another DMT. Indeed, they are induction DMTs and are thus expected to have a long-term effect. Therefore, only the start dates of these two DMTs were recorded. They were coded by a separate category: highly active induction DMTs. Alemtuzumab and Cladribine are also induction DMTs. In contrast to Mitoxantrone and Cyclophosphamide they are not combined with other DMTs. If a new DMT was started, it was assumed that they were considered as not effective and the start date of the new DMT was taken as the end date of Alemtuzumab or Cladribine.

MRI variables were not included due to high missingness. Indeed, the lesion counts were available in less than 1⋅7% of the clinical episodes. The variable indicating whether the MRI was normal, abnormal MS typical, or abnormal MS atypical was judged as not informative enough.

The above variables were then grouped in three feature sets: static, dynamic (summary statistics of the clinical history) and longitudinal [17]. These represent increasing quantity of information regarding the clinical history of patients.

Grouping of the included clinical variables

The static feature set contains variables available at t = 0 without taking into account possible previous values. Categorical variables can be encoded as indicator variables. For example, sex is encoded as female ‘yes / no’ and male ‘yes / no’. If that feature contains missing occurrences, the category ‘unknown’ is added. EDSS and the KFS scores were treated as continuous variables, even though they are categorical. The variables of the static feature set are: Sex, Age (years), Age at MS onset (years), Disease duration (years), MS course at t = 0 (RRMS, SPMS, PPMS), EDSS at t = 0, Last used DMT at t = 0, Use of induction DMT at t = 0, all KFS scores at t = 0, education status, first symptom (supratentorial, optic pathways, brainstem, spinal cord or missing), time of prediction (years since 1990), and time of diagnosis (years since 1990).

The dynamic feature set adds information about the clinical history before t = 0 (longitudinal information) to the static dataset. It contains variables that are hand-engineered from the longitudinal variables: number of visits in the last 3.25 years, the minimum and maximum in the whole history (t ≤ 0) of the EDSS and all KFS variables, mean and standard deviation over the last 3.25 years of the EDSS and all KFS scores, oldest EDSS and KFS score measured in the last 3.25 years, relapse rate over the whole history (number of relapses divided by the follow-up period—since first clinical visit), time since the last relapse (years), presence of high-efficacy DMT in the past, disease duration until a first DMT was administered, disease duration until an high-efficacy active DMT was administrated, time spent on a DMT during the disease duration (ratio of time on a DMT divided by the time since MS onset), and time since the last Fampridine administration.

The variables representing the times since the last relapse, disease duration until a DMT was administered, disease duration until a high-efficacy DMT was administrated, and time since the last Fampridine administration were transformed according to an 1/(1 + t) scaling, with t the actual time. If no time could be defined because, e.g., no DMT has ever been administered, the transformed variable was set to 0. If t < 0, which can happen because of erroneous dates in the dataset, the transformed variable was also set to 1.

The longitudinal feature set contains the dates and values for the following variables: all measured EDSS values and KFS scores, relapses occurrence (encoded as a binary variable set to 1 when a relapse occurs), relapse position (brainstem, pyramidal tract or other), cumulative relapse count, MS course, DMT administration (start and end dates), induction DMT administration (start date), and Fampridine administration. The timing of measurements was expected to be informative [17, 23].

Models

The disability progression was framed as a classification problem. There exists a large literature on machine learning models for clinical applications [2729]. The following models were used to predict disability progression: a multi-layer perceptron, a Bayesian neural network, and a temporal attention model with continuous temporal embeddings [30]. This work was supported by a large project (Flanders AI) and those models were selected as the best performing ones among a larger array of candidate models implemented by the different partners (see S1 Text for details). We followed the TRIPOD guidelines for reporting prognostic models [31]. The checklist can be found in Fig 8, at the end of this section.

Fig 8. TRIPOD checklist.

Fig 8

The multi-layer perceptron model is a neural network architecture that takes as input the static and dynamic features set, represented as a fixed length vector. The model is composed of five hidden layers of dimension 128.

The Bayesian neural network has a similar architecture as the multi-layer perceptron, but provides uncertainty estimates on the weights of the last hidden layer by incorporating MCdropout [32]. This should confer better generalization capabilities as well as better calibration.

The temporal attention model relies on a transformer architecture [30]. In contrast to the previous models, this architecture is able to handle the longitudinal feature set, as it is able to process the whole clinical time series. Each visit is encoded as a fixed-length vector along with a mask for missing features and a continuous temporal embedding. This temporal embedding allows for arbitrary time differences between measurements, and is therefore especially suited for clinical time series where irregular sampling is most common. The static and the dynamic feature sets were included in the model as extra temporal features that are repeated over the patient history. Two temporal attention layers with dimension 128 were used. The code for training the models and the final models are publicly available and can be found at https://gitlab.com/edebrouwer/ms_benchmark.

Evaluation

The dataset was split into 60% for training, 20% for validation and 20% for testing. The validation data was used to optimize the hyperparameters of the models. Post-hoc calibration methods (Platt scaling [33] and isotonic regression [34]) were used on the validation set and the performance evaluated on the test set.

The test set was not seen during model training and hyperparameter optimization. To produce a measure of uncertainty of the performance of the models, the procedure of splitting the data and training the models was repeated five times, corresponding to five splits.

As the dataset consists of patients from different centers, we split the dataset such that the validation and test sets represent an external validation. Patients from the same centers were therefore assigned to the same set (training, validation or test).

Discrimination was evaluated using the area under the receiving operator characteristic (ROC-AUC) and the area under the precision recall curve (AUC-PR). Calibration was evaluated numerically using the Brier score and the expected calibration error (ECE) with 20 bins. Calibration was also evaluated visually using reliability diagrams.

The list of main hyperparameteres of each method, along with the values used for cross-validation are presented in S9, S10, S11, S12, S13 and S14 Tables.

Tripod checklist

The design of the algorithms carefully followed the TRIPOD checklist as shown on Fig 8. All points were checked or were not applicable in our study. This consists of the following:

  • 6b. Report any actions to blind assessment of the outcome to be predicted.

  • 11. Provide details on how risk groups were created, if done. No risk groups were identified in this study.

  • 14b. This can only be done for statistical models. However, we reported measures of variables importance.

  • 17. Model updating. The models proposed here were not updates of previous iterations but rather their first development.

Note also that no sample size calculations were performed; the size of this retrospective dataset was fixed.

Supporting information

S1 Fig. ROC-AUC scores per MS center.

ROC-AUC of individual centers in the test set against the size of the center. As the size of the centers grows, the performance converges to the average ROC-AUC. As the size of centers shrinks, the variability in performance increases, which is statistically expected due to low sample size. Centers with no progression are not plotted (because ROC-AUC is not defined in this case).

(PDF)

pdig.0000533.s001.pdf (27.3KB, pdf)
S2 Fig. Visualization of the different countries in the dataset.

Each country is represented as the set of vectors of static variables for each episode. A distance between countries was computed using earth mover distance. The 2D visualization was obtained by using multidimensional scaling (MDS).

(PNG)

pdig.0000533.s002.png (34.8KB, png)
S3 Fig. Visualization of the different clinical centers in the dataset.

Each center is represented as the set of vectors of static variables for each episode. A distance between centers was computed using earth mover distance. The 2D visualization was obtained by using multidimensional scaling (MDS). We color each center by its country of origin.

(PNG)

pdig.0000533.s003.png (49.7KB, png)
S4 Fig. Calibration diagram for all models.

Calibration curves of the different models on the test set (fold (e.g. train-test split) 0). Calibration was performed using Platt scaling [33]. A good calibration was observed for all models. The discrepancy with the ideal line (dotted) in the larger scores regime can be explained by the lowest number of data points in that region, leading to more variance.

(PDF)

pdig.0000533.s004.pdf (23.5KB, pdf)
S5 Fig. Predicted percentage of worsening per subgroup.

Predicted percentage of worsening per subgroup, for both MS Courses and EDSS larger or smaller than 5.5. Green is the actual prevalence for the age groups on the x-axis, and red and purple are model predictions. This shows the calibration performance for different subgroups. An acceptable discrepancy is observed (of maximum 3 points), and a tendency of the models to underestimate the prevalence of disability progression.

(PDF)

pdig.0000533.s005.pdf (18.8KB, pdf)
S1 Table. Summary statistics of the patients cohort.

Summary statistics of the cohort of interest after patient and sample selection. For all variables the value at the last recorded visit was used. KFS stands for Kurtzke Functional Systems Score, DMT for Disease Modifying Therapy, CIS for Clinically Isolated Syndrome.

(PDF)

pdig.0000533.s006.pdf (56.4KB, pdf)
S2 Table. Summary statistics of the performance measures (Cohort with minimum 3 visits).

ROC-AUC, AUC-PR, Brier Score and ECE of all models (averages ± standard deviations). Cohort of patients with a least 3 visits with EDSS in the last 3.25 years.

(PDF)

pdig.0000533.s007.pdf (34.6KB, pdf)
S3 Table. Summary statistics of the performance measures (Cohort with minimum 6 visits).

ROC-AUC, AUC-PR, Brier Score and ECE of all models (averages ± standard deviations). Cohort of patients with a least 6 visits with EDSS in the last 3.25 years.

(PDF)

pdig.0000533.s008.pdf (34.6KB, pdf)
S4 Table. Summary statistics of the performance measures on different MS subgroups (Cohort with minimum 3 visits).

ROC-AUC, AUC-PR, Brier Score and ECE of all models on the different MS course subgroups (averages ± standard deviations). Primary Progressive (PP), Relapsing Remitting (RR) and Secondary Progressive are considered (SP). Cohort of patients with a least 3 visits with EDSS in the last 3.25 years.

(PDF)

pdig.0000533.s009.pdf (35.2KB, pdf)
S5 Table. Summary statistics of the performance measures on different MS subgroups (Cohort with minimum 6 visits).

ROC-AUC, AUC-PR, Brier Score and ECE of all models on the different MS course subgroups (averages ± standard deviations). Primary Progressive (PP), Relapsing Remitting (RR) and Secondary Progressive are considered (SP). Cohort of patients with a least 6 visits with EDSS in the last 3.25 years.

(PDF)

pdig.0000533.s010.pdf (35.2KB, pdf)
S6 Table. Summary statistics of the performance measures on different severity subgroups (Cohort with minimum 3 visits).

ROC-AUC, AUC-PR, Brier Score and ECE by severity subgroup (averages ± standard deviations). Low severity patients are defined as the ones with EDSS ≤ 5.5 at baseline, while high severity patients are defined as having EDSS > 5.5 at baseline. Cohort of patients with a least 3 visits with EDSS in the last 3.25 years.

(PDF)

pdig.0000533.s011.pdf (36KB, pdf)
S7 Table. Summary statistics of the performance measures on different severity subgroups (Cohort with minimum 6 visits).

ROC-AUC, AUC-PR, Brier Score and ECE by severity subgroup (averages ± standard deviations). Low severity patients are defined as the ones with EDSS ≤ 5.5 at baseline, while high severity patients are defined as having EDSS > 5.5 at baseline. Cohort of patients with a least 6 visits with EDSS in the last 3.25 years.

(PDF)

pdig.0000533.s012.pdf (36KB, pdf)
S8 Table. Features importance for different performance metrics.

Features are ranked by order of importance for the Dynamic Model. Feature importance is assessed by the average difference in performance when the specific feature is shuffled. Averages ± standard deviations are reported.

(PDF)

pdig.0000533.s013.pdf (34.2KB, pdf)
S9 Table. Hyperparameters table for the temporal attention model.

List of hyperparameters used for training the models.

(PDF)

pdig.0000533.s014.pdf (19KB, pdf)
S10 Table. Hyperparameters table for the multi-layer perceptron model.

List of hyperparameters used for training the models.

(PDF)

pdig.0000533.s015.pdf (18.4KB, pdf)
S11 Table. Hyperparameters table for the recurrent neural network model.

List of hyperparameters used for training the models.

(PDF)

pdig.0000533.s016.pdf (19.3KB, pdf)
S12 Table. Hyperparameters table for the dynamic MTP model.

List of hyperparameters used for training the models.

(PDF)

pdig.0000533.s017.pdf (18.7KB, pdf)
S13 Table. Hyperparameters table for the factorization machines model.

List of hyperparameters used for training the models.

(PDF)

pdig.0000533.s018.pdf (18KB, pdf)
S14 Table. Hyperparameters table for the logistic regression model.

List of hyperparameters used for training the models.

(PDF)

pdig.0000533.s019.pdf (17.2KB, pdf)
S1 Text. Models description.

Description of the Bayesian neural networks, DeepMTP, and Factorization Machines models.

(PDF)

pdig.0000533.s020.pdf (129.4KB, pdf)

Acknowledgments

The authors also wish to acknowledge the MSBase contributors for sharing the clinical data:

  • Eva Kubala Havrdova, Charles University in Prague and General University Hospital, Prague, Czech Republic

  • Serkan Ozakbas, Dokuz Eylul University, Konak/Izmir, Turkey

  • Marco Onofrj, University G. d’Annunzio, Chieti, Italy

  • Raed Alroughani, Amiri Hospital, Sharq, Kuwait

  • Maria Pia Amato, University of Florence, Florence, Italy

  • Katherine Buzzard, Box Hill Hospital, Melbourne, Australia

  • Cavit Boz, KTU Medical Faculty Farabi Hospital, Trabzon, Turkey

  • Vahid Shaygannejad, Isfahan University of Medical Sciences, Isfahan, Iran

  • Jens Kuhle, Universitatsspital Basel, Basel, Switzerland

  • Bassem Yamout, American University of Beirut Medical Center, Beirut, Lebanon

  • Recai Turkoglu, Haydarpasa Numune Training and Research Hospital, Istanbul, Turkey

  • Julie Prevost, CSSS Saint-Jérôme, Saint-Jerome, Canada

  • Ernest Butler, Monash Medical Centre, Melbourne, Australia

  • Celia Oreja-Guevara, Hospital Clinico San Carlos, Madrid, Spain

  • Richard Macdonell, Austin Health, Melbourne, Australia

  • Ricardo Fernandez Bolaños, Hospital Universitario Virgen de Valme, Seville, Spain

  • Marie D’hooghe, Nationaal MS Centrum, Melsbroek, Belgium

  • Liesbeth Van Hijfte, Universitary Hospital Ghent, Ghent, Belgium

  • Helmut Butzkueven, The Alfred Hospital, Melbourne, Australia

  • Michael Barnett, Brain and Mind Centre, Sydney, Australia

  • Justin Garber, Westmead Hospital, Sydney, Australia

  • Sarah Besora, Hospital Universitari MútuaTerrassa, Barcelona, Spain

  • Edgardo Cristiano, Centro de Esclerosis Múltiple de Buenos Aires (CEMBA), Buenos Aires, Argentina

  • Magd Zakaria, Ain Shams University

  • Maria Laura Saladino, INEBA—Institute of Neuroscience Buenos Aires, Buenos Aires, Argentina

  • Shlomo Flechter, Assaf Harofeh Medical Center, Beer-Yaakov, Israel

  • Leontien Den braber-Moerland, Francicus Ziekenhuis, Roosendaal, Netherlands

  • Fraser Moore, Jewish General Hospital, Montreal, Canada

  • Rana Karabudak, Hacettepe University, Ankara, Turkey

  • Claudio Gobbi, Ospedale Civico Lugano, Lugano, Switzerland

  • Jennifer Massey, St Vincent’s Hospital, Sydney, Australia

  • Nevin Shalaby, Kasr Al Ainy MS research Unit (KAMSU), Cairo, Egypt

  • Jabir Alkhaboori, Royal Hospital, Muscat, Oman

  • Cameron Shaw, Geelong Hospital, Geelong, Australia

  • Jose Andres Dominguez, Hospital Universitario de la Ribera, Alzira, Spain

  • Jan Schepel, Waikato Hospital, Hamilton, New Zealand

  • Krisztina Kovacs, Péterfy Sandor Hospital, Budapest, Hungary

  • Pamela McCombe, Royal Brisbane and Women’s Hospital, Brisbane, Australia

  • Bhim Singhal, Bombay Hospital Institute of Medical Sciences, Mumbai, India

  • Mike Boggild, Townsville Hospital, Townsville, Australia

  • Imre Piroska, Veszprém Megyei Csolnoky Ferenc Kórház zrt., Veszprem, Hungary

  • Neil Shuey, St Vincents Hospital, Fitzroy, Melbourne, Australia

  • Carlos Vrech, Sanatorio Allende, Cordoba, Argentina

  • Tatjana Petkovska-Boskova, Clinic of Neurology Clinical Center, Skopje, Macedonia

  • Ilya Kister, New York University Langone Medical Center, New York, United States

  • Cees Zwanikken, University Hospital Nijmegen, Nijmegen, Netherlands

  • Jamie Campbell, Craigavon Area Hospital, Craigavon, United Kingdom

  • Etienne Roullet, MS Clinic, Hopital Tenon, Paris, France

  • Cristina Ramo-Tello, Hospital Germans Trias i Pujol, Badalona, Spain

  • Jose Antonio Cabrera-Gomez, Centro Internacional de Restauracion Neurologica, Havana, Cuba

  • Maria Edite Rio, Centro Hospitalar Universitario de Sao Joao, Porto, Portugal

  • Pamela McCombe, University of Queensland, Brisbane, Australia

  • Mark Slee, Flinders University, Adelaide, Australia

  • Saloua Mrabet, Razi Hospital, Manouba, Tunisia

Data Availability

The data set used in this study is available upon request to the MSBase principal investigators included in the study. MSBase operates as a single point of contact to facilitate the data sharing agreements with the individual data custodians. Inquiries should be addressed at info@msbase.org. Data is restricted behind a request to ensure a controlled usage of patients data and to stay inline with specific data ownership requirements. The data processing and training scripts to reproduce all experiments are publicly available at https://gitlab.com/edebrouwer/ms_benchmark.

Funding Statement

This study was funded by the Research Foundation Flanders (FWO) and the Flemish government through the Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen program (https://www.flandersairesearch.be/en). This funding was awarded to YM, LB, TD, DD, WW, and BDB and funded EBD, TB, LWB, PD, DI, MS, YM, LB, TD, DD, WW, and BDB. EDB was also concomitantly funded by a FWO-SB fellowship (1S98821N - https://fwo.be). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. Reich DS, Lucchinetti CF, Calabresi PA. Multiple Sclerosis. New England Journal of Medicine. 2018;378(2):169–180. doi: 10.1056/NEJMra1401483 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Walton C, King R, Rechtman L, Kaye W, Leray E, Marrie RA, et al. Rising prevalence of multiple sclerosis worldwide: Insights from the Atlas of MS. Multiple Sclerosis Journal. 2020;26(14):1816–1821. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Degenhardt A, Ramagopalan SV, Scalfari A, Ebers GC. Clinical prognostic factors in multiple sclerosis: a natural history review. Nature Reviews Neurology. 2009;5(12):672–682. doi: 10.1038/nrneurol.2009.178 [DOI] [PubMed] [Google Scholar]
  • 4. Dennison L, Brown M, Kirby S, Galea I. Do people with multiple sclerosis want to know their prognosis? A UK nationwide study. PLOS ONE. 2018;13(2):1–14. doi: 10.1371/journal.pone.0193407 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Brown FS, Glasmacher SA, Kearns PKA, MacDougall N, Hunt D, Connick P, et al. Systematic review of prediction models in relapsing remitting multiple sclerosis. PLOS ONE. 2020;15(5):1–13. doi: 10.1371/journal.pone.0233575 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Havas J, Leray E, Rollot F, Casey R, Michel L, Lejeune F, et al. Predictive medicine in multiple sclerosis: A systematic review. Multiple Sclerosis and Related Disorders. 2020;40:101928. doi: 10.1016/j.msard.2020.101928 [DOI] [PubMed] [Google Scholar]
  • 7. Seccia R, Romano S, Salvetti M, Crisanti A, Palagi L, Grassi F. Machine Learning Use for Prognostic Purposes in Multiple Sclerosis. Life. 2021;11(2). doi: 10.3390/life11020122 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Hartmann M, Fenton N, Dobson R. Current review and next steps for artificial intelligence in multiple sclerosis risk research. Computers in Biology and Medicine. 2021;132:104337. doi: 10.1016/j.compbiomed.2021.104337 [DOI] [PubMed] [Google Scholar]
  • 9. Peeters LM, Parciak T, Kalra D, Moreau Y, Kasilingam E, Van Galen P, et al. Multiple Sclerosis Data Alliance–A global multi-stakeholder collaboration to scale-up real world data research. Multiple Sclerosis and Related Disorders. 2021;47:102634. doi: 10.1016/j.msard.2020.102634 [DOI] [PubMed] [Google Scholar]
  • 10. Tintore M, Rovira À, Río J, Otero-Romero S, Arrambide G, Tur C, et al. Defining high, medium and low impact prognostic factors for developing multiple sclerosis. Brain. 2015;138(7):1863–1874. doi: 10.1093/brain/awv105 [DOI] [PubMed] [Google Scholar]
  • 11. Kelly CJ, Karthikesalingam A, Suleyman M, Corrado G, King D. Key challenges for delivering clinical impact with artificial intelligence. BMC medicine. 2019;17(1):1–9. doi: 10.1186/s12916-019-1426-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Mateen BA, Liley J, Denniston AK, Holmes CC, Vollmer SJ. Improving the quality of machine learning in health applications and clinical research. Nature Machine Intelligence. 2020;2(10):554–556. doi: 10.1038/s42256-020-00239-1 [DOI] [Google Scholar]
  • 13. Kalincik T, Manouchehrinia A, Sobisek L, Jokubaitis V, Spelman T, Horakova D, et al. Towards personalized therapy for multiple sclerosis: prediction of individual treatment response. Brain. 2017;140(9):2426–2443. doi: 10.1093/brain/awx185 [DOI] [PubMed] [Google Scholar]
  • 14. Chalkou K, Steyerberg E, Bossuyt P, Subramaniam S, Benkert P, Kuhle J, et al. Development, validation and clinical usefulness of a prognostic model for relapse in relapsing-remitting multiple sclerosis. Diagnostic and prognostic research. 2021;5(1):1–16. doi: 10.1186/s41512-021-00106-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Butzkueven H, Chapman J, Cristiano E, Grand’Maison F, Hoffmann M, Izquierdo G, et al. MSBase: an international, online registry and platform for collaborative outcomes research in multiple sclerosis. Multiple Sclerosis Journal. 2006;12(6):769–774. doi: 10.1177/1352458506070775 [DOI] [PubMed] [Google Scholar]
  • 16. Kalincik T, Butzkueven H. The MSBase registry: Informing clinical practice. Multiple Sclerosis Journal. 2019;25(14):1828–1834. doi: 10.1177/1352458519848965 [DOI] [PubMed] [Google Scholar]
  • 17. De Brouwer E, Becker T, Moreau Y, Havrdova EK, Trojano M, Eichau S, et al. Longitudinal machine learning modeling of MS patient trajectories improves predictions of disability progression. Computer Methods and Programs in Biomedicine. 2021; p. 106180. 10.1016/j.cmpb.2021.106180 [DOI] [PubMed] [Google Scholar]
  • 18. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. Journal of Clinical Epidemiology. 2016;74:167–176. doi: 10.1016/j.jclinepi.2015.12.005 [DOI] [PubMed] [Google Scholar]
  • 19. Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26(10):1340–1347. doi: 10.1093/bioinformatics/btq134 [DOI] [PubMed] [Google Scholar]
  • 20. Schlaeger R, D’Souza M, Schindler C, Grize L, Dellas S, Radue E, et al. Prediction of long-term disability in multiple sclerosis. Multiple Sclerosis Journal. 2012;18(1):31–38. doi: 10.1177/1352458511416836 [DOI] [PubMed] [Google Scholar]
  • 21. de Groot V, Beckerman H, Uitdehaag BM, Hintzen RQ, Minneboo A, Heymans MW, et al. Physical and Cognitive Functioning After 3 Years Can Be Predicted Using Information From the Diagnostic Process in Recently Diagnosed Multiple Sclerosis. Archives of Physical Medicine and Rehabilitation. 2009;90(9):1478–1488. doi: 10.1016/j.apmr.2009.03.018 [DOI] [PubMed] [Google Scholar]
  • 22. von Gumberz J, Mahmoudi M, Young K, Schippling S, Martin R, Heesen C, et al. Short-term MRI measurements as predictors of EDSS progression in relapsing-remitting multiple sclerosis: grey matter atrophy but not lesions are predictive in a real-life setting. PeerJ. 2016;4:e2442. doi: 10.7717/peerj.2442 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. BMJ. 2018;361. doi: 10.1136/bmj.k1479 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Goodkin DE, Cookfair D, Wende K, Bourdette D, Pullicino P, Scherokman B, et al. Inter- and intrarater scoring agreement using grades 1.0 to 3.5 of the Kurtzke Expanded Disability Status Scale (EDSS). Neurology. 1992;42(4):859–859. doi: 10.1212/WNL.42.4.859 [DOI] [PubMed] [Google Scholar]
  • 25. Kalincik T, Cutter G, Spelman T, Jokubaitis V, Havrdova E, Horakova D, et al. Defining reliable disability outcomes in multiple sclerosis. Brain. 2015;138(11):3287–3298. doi: 10.1093/brain/awv258 [DOI] [PubMed] [Google Scholar]
  • 26. Robertson D, Moreo N. Disease-modifying therapies in multiple sclerosis: overview and treatment considerations. Federal Practitioner. 2016;33(6):28. [PMC free article] [PubMed] [Google Scholar]
  • 27. Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nature medicine. 2019;25(1):24–29. doi: 10.1038/s41591-018-0316-z [DOI] [PubMed] [Google Scholar]
  • 28.De Brouwer E, Gonzalez J, Hyland S. Predicting the impact of treatments over time with uncertainty aware neural differential equations. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2022. p. 4705–4722.
  • 29.De Brouwer E, Krishnan RG. Anamnesic Neural Differential Equations with Orthogonal Polynomial Projections. In: The Eleventh International Conference on Learning Representations; 2022.
  • 30.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems; 2017. p. 5998–6008.
  • 31. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) the TRIPOD statement. Circulation. 2015;131(2):211–219. doi: 10.1161/CIRCULATIONAHA.114.014508 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gal Y, Ghahramani Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In: international conference on machine learning. PMLR; 2016. p. 1050–1059.
  • 33. Platt JC. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers. 1999;10(3):61–74. [Google Scholar]
  • 34.Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining; 2002. p. 694–699.
PLOS Digit Health. doi: 10.1371/journal.pdig.0000533.r001

Decision Letter 0

Martin G Frasch, Ryan S McGinnis

21 Aug 2023

PDIG-D-23-00247

Machine-learning-based prediction of disability progression in multiple sclerosis: an observational, international, multi-center study

PLOS Digital Health

Dear Dr. De Brouwer,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Oct 20 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Ryan S McGinnis

Academic Editor

PLOS Digital Health

Journal Requirements:

1. We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex.

2. Please provide separate figure files in .tif or .eps format only and remove any figures embedded in your manuscript file. Please also ensure that all files are under our size limit of 10MB.

For more information about figure files please see our guidelines:

https://journals.plos.org/digitalhealth/s/figures

https://journals.plos.org/digitalhealth/s/figures#loc-file-requirements

3. We have noticed that you have uploaded Supporting Information files, but you have not included a list of legends. Please add a full list of legends for your Supporting Information files after the references list.

Additional Editor Comments (if provided):

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

Reviewer #3: Yes

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I don't know

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present a machine learning method to predict whether a disease progression will occur in persons with MS. They present numerous algorithms and train them on commonly available features. Overall the paper presents meaningful findings, however, the framing of the paper and claims are not supported by the results.

Major Concerns:

- The authors claim to train models 'for disability progression prediction in MS.' While this may be the end goal of the work, the work presented in this manuscript is a classification of whether a change will occur or not, not a prediction of the actual disease progression. The presented work is an important stepping stone, however, it cannot address the motivation suggests for uses of patient planning and care. Please reframe the paper in a more appropriate manner to reflect the results presented.

- The exclusion criteria limit those who are newly diagnosed and/or currently have a steady disease state. This is partially addressed as a limitation in the discussion but this needs to be expanded. These exclusions limit the model to predict whether a change will occur in those who already present changes instead of PwMS overall.

- P.18 is this definition of disease progression a clinical standard or is it introduced for this work? Need to define and motivate or cite.

- Please provide more details on model training and evaluation (eg. training epochs)

- Several models are presented with very similar results and claims are made about which is best. Please add significance tests to determine if these differences are significant.

Minor Concerns:

- Throughout the manuscript there is a mix of present and past tense

- Define all acronyms before they are used in text

- Need to include definitions of acronyms in table captions

- Define all elements of equations clearly (e.g. page 26; Parameter p seems to be defined in figured but not text)

- Some results, such as table 5, may be easier to interpret as a figure/bar chart

- Please fix reference to supplementary materials on page 24 line 329 ?? and corresponding table in supplementary materials

Reviewer #2: General:

This manuscript shows the ability ofpredicting MS progression based on a large data set. The study design is superior in its choice and rational for the chosen input variables which all target the clinical feasability. The manuscript comes with some flaws in writing; the methods section is narrative in major parts, the structure is confusing and the different names for the used models lack explanation. This stands in contrast to being applicable in clinical practise. Any clinician that will read the paper, will stop reading the methods by being confused, which would be a pity.

The discussion is missing as there is not a single reference to other pulished models. In its current status, the paper needs to be rewritten in large parts to be comprehensive.

Introduction:

The introduction stays vague in developing the rationale for the parameters of interest. Also, no hypotheses are formulated. Instead, the results are already summarized which feels like a repetition of the abstract.

Results:

The Cohort section of the results is mainly methods, I would therefore suggest removing this information.

Tables: please explain all the abbreviations. Also the caption for the tables are ususaly above the table, of not requested differently.

Discussion:

Was the model also tested on less input span? Would be interesting what the minimal number of required monitoring points would be.

The discussion has not a single reference to other published models (that were named in the introduction). Please report and compare the results to existing prediction models to give the reader a fair chance to evaluate the reported outcome.

Methods:

General:

Although I appreciate a detailed description of the methods, the amount of information is just overwhelming. Describing why a parameter is abbreviated is too much (e.g. why w is choosen to abbreviate worsening). The methods are too narrative and wouldbe shorter with lessstory telling. Also, the methods are quite chaotic.

If the section about valid/non-valid samples would be clearer, Figure 5 would not be needed.

Line 240 ff: why were the two cohorts defined? There is so much description of things that are not important, but then this is not clear.

What was the rational behind training the model with these long time spans? For an efficient patient management, much shorter tie frames are required. Also, which real predictive value has the model, if so much input is needed?

Line 291: 'If data can be missing...' probaby better to use 'If data was missing...'

Line 330: supplementary material number is not valid.

It is completely unclear to me why first 3 different models are named and described briefly, followed by a section where they are again explained in great detail but differently named and in a different order then the summary description. any reader that is not deeply informed about machine learning will be lost. I encourage the authors to be concise and precise.

Reviewer #3: While the authors present a useful application of standard machine learning methods and architectures in the clinical setting of MS, the novelty of the approaches used by the authors is somewhat limited. I would recommend the authors consider the following:

1. The authors claim to evaluate the performance of their model on an external dataset (where the "externalization" refers to splitting the data by patient center). Have the authors conducted any data-distribution level visualizations/statistics to verify whether the data occupy different distributions? Specifically, have the authors considered other axes of data heterogeneity in addition to patient population?

2. It is great that the authors assessed the degree of model calibration, together with the Brier score, in their work. Given the size of the dataset, have the authors considered the degree to which model predictions are repeatable across patients and/or explainable?

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000533.r003

Decision Letter 1

Martin G Frasch, Ryan S McGinnis

27 Feb 2024

PDIG-D-23-00247R1

Machine-learning-based prediction of disability progression in multiple sclerosis: an observational, international, multi-center study

PLOS Digital Health

Dear Dr. De Brouwer,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 60 days Apr 27 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Ryan S McGinnis

Academic Editor

PLOS Digital Health

Journal Requirements:

Additional Editor Comments (if provided):

Thank you for your revised submission. We have arranged review of this revised manuscript. Please address the remaining comments in your revision.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #4: (No Response)

--------------------

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Partly

Reviewer #4: Partly

--------------------

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #4: Yes

--------------------

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #4: Yes

--------------------

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #4: Yes

--------------------

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for addressing the comments. The additional details and clarifications are greatly appreciated. Based on the pvalues added in table 2, there is not sufficient evidence that the attention model performs best. Please adjust the wording throughout (including introduction) to address this. Additionally, there are a few formatting (eg. p7 line ~52 line break) and wording errors that could use one more readthrough.

Reviewer #4: This manuscript presents a machine learning approach for predicting if a person with multiple sclerosis will experience disability progression in the next two years based on data commonly collected during routine clinical care. Models were trained on a very large, multi-national dataset (MSBase). I was not a reviewer on the prior version of the manuscript, but I have reviewed the comments from prior reviewers and believe that the authors have adequately addressed many of their concerns. In my review, I have identified several areas (below) that would benefit from revision to strengthen this manuscript and further highlight this important work.

-Abstract, Methods: typo – expended -> expanded

-line64: What is the scientific or clinical rationale for predicting progression in the next two years? Why not 1 year? Or 6 months? This seems to have also been a question from the prior reviewers that has not been addressed and would be helpful for readers.

-lines 114-115: it is noted that the lower performance observed in in the primary progressive and secondary progressive subgroups was due to small sample size, but how is that known? Was this tested in some way?

-tables 2-4: it would be helpful to note, in the caption, what is being reported after the +/- in these tables.

-tables 3-4: an indication of the training set size for each model being compared would be helpful as it is not exactly clear with all of the subgroups.

-lines 153-154: While it is argued that the presented models are a ‘significant advance towards deploying AI in clinical practice in MS’ - how do we know what good enough performance is in this context? Is a ROC AUC of 0.72 and PR-AUC of 0.26 sufficient for predicting MS progression? What are the negative byproducts of an incorrect prediction (both false positive and negative)? It would be helpful for readers if these results could be further evaluated and placed in their intended context of use in the discussion section.

-Discussion: this section would be strengthened with a more detailed discussion of the reported results. For example, the results of Tables 3 and 4 are clearly important, but are not discussed. It would be helpful for the authors to further elucidate what may be causing the observed differences in model performance and to highlight what those performance differences may imply for the translation of this approach into the clinical environment.

-there seems to be an extra (2) between lines 271 and 272

-line323: 11.64% progression events suggests a pretty significant imbalance in the data, but it is not clear how this was dealt with in training and evaluating the models. This should be described more clearly.

--------------------

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #4: No

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000533.r005

Decision Letter 2

Martin G Frasch, Ryan S McGinnis

14 May 2024

Machine-learning-based prediction of disability progression in multiple sclerosis: an observational, international, multi-center study

PDIG-D-23-00247R2

Dear Dr De Brouwer,

We are pleased to inform you that your manuscript 'Machine-learning-based prediction of disability progression in multiple sclerosis: an observational, international, multi-center study' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Ryan S McGinnis

Academic Editor

PLOS Digital Health

***********************************************************

Many thanks to the authors for addressing the remaining reviewer comments.

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Comments addressed. Thank you.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. ROC-AUC scores per MS center.

    ROC-AUC of individual centers in the test set against the size of the center. As the size of the centers grows, the performance converges to the average ROC-AUC. As the size of centers shrinks, the variability in performance increases, which is statistically expected due to low sample size. Centers with no progression are not plotted (because ROC-AUC is not defined in this case).

    (PDF)

    pdig.0000533.s001.pdf (27.3KB, pdf)
    S2 Fig. Visualization of the different countries in the dataset.

    Each country is represented as the set of vectors of static variables for each episode. A distance between countries was computed using earth mover distance. The 2D visualization was obtained by using multidimensional scaling (MDS).

    (PNG)

    pdig.0000533.s002.png (34.8KB, png)
    S3 Fig. Visualization of the different clinical centers in the dataset.

    Each center is represented as the set of vectors of static variables for each episode. A distance between centers was computed using earth mover distance. The 2D visualization was obtained by using multidimensional scaling (MDS). We color each center by its country of origin.

    (PNG)

    pdig.0000533.s003.png (49.7KB, png)
    S4 Fig. Calibration diagram for all models.

    Calibration curves of the different models on the test set (fold (e.g. train-test split) 0). Calibration was performed using Platt scaling [33]. A good calibration was observed for all models. The discrepancy with the ideal line (dotted) in the larger scores regime can be explained by the lowest number of data points in that region, leading to more variance.

    (PDF)

    pdig.0000533.s004.pdf (23.5KB, pdf)
    S5 Fig. Predicted percentage of worsening per subgroup.

    Predicted percentage of worsening per subgroup, for both MS Courses and EDSS larger or smaller than 5.5. Green is the actual prevalence for the age groups on the x-axis, and red and purple are model predictions. This shows the calibration performance for different subgroups. An acceptable discrepancy is observed (of maximum 3 points), and a tendency of the models to underestimate the prevalence of disability progression.

    (PDF)

    pdig.0000533.s005.pdf (18.8KB, pdf)
    S1 Table. Summary statistics of the patients cohort.

    Summary statistics of the cohort of interest after patient and sample selection. For all variables the value at the last recorded visit was used. KFS stands for Kurtzke Functional Systems Score, DMT for Disease Modifying Therapy, CIS for Clinically Isolated Syndrome.

    (PDF)

    pdig.0000533.s006.pdf (56.4KB, pdf)
    S2 Table. Summary statistics of the performance measures (Cohort with minimum 3 visits).

    ROC-AUC, AUC-PR, Brier Score and ECE of all models (averages ± standard deviations). Cohort of patients with a least 3 visits with EDSS in the last 3.25 years.

    (PDF)

    pdig.0000533.s007.pdf (34.6KB, pdf)
    S3 Table. Summary statistics of the performance measures (Cohort with minimum 6 visits).

    ROC-AUC, AUC-PR, Brier Score and ECE of all models (averages ± standard deviations). Cohort of patients with a least 6 visits with EDSS in the last 3.25 years.

    (PDF)

    pdig.0000533.s008.pdf (34.6KB, pdf)
    S4 Table. Summary statistics of the performance measures on different MS subgroups (Cohort with minimum 3 visits).

    ROC-AUC, AUC-PR, Brier Score and ECE of all models on the different MS course subgroups (averages ± standard deviations). Primary Progressive (PP), Relapsing Remitting (RR) and Secondary Progressive are considered (SP). Cohort of patients with a least 3 visits with EDSS in the last 3.25 years.

    (PDF)

    pdig.0000533.s009.pdf (35.2KB, pdf)
    S5 Table. Summary statistics of the performance measures on different MS subgroups (Cohort with minimum 6 visits).

    ROC-AUC, AUC-PR, Brier Score and ECE of all models on the different MS course subgroups (averages ± standard deviations). Primary Progressive (PP), Relapsing Remitting (RR) and Secondary Progressive are considered (SP). Cohort of patients with a least 6 visits with EDSS in the last 3.25 years.

    (PDF)

    pdig.0000533.s010.pdf (35.2KB, pdf)
    S6 Table. Summary statistics of the performance measures on different severity subgroups (Cohort with minimum 3 visits).

    ROC-AUC, AUC-PR, Brier Score and ECE by severity subgroup (averages ± standard deviations). Low severity patients are defined as the ones with EDSS ≤ 5.5 at baseline, while high severity patients are defined as having EDSS > 5.5 at baseline. Cohort of patients with a least 3 visits with EDSS in the last 3.25 years.

    (PDF)

    pdig.0000533.s011.pdf (36KB, pdf)
    S7 Table. Summary statistics of the performance measures on different severity subgroups (Cohort with minimum 6 visits).

    ROC-AUC, AUC-PR, Brier Score and ECE by severity subgroup (averages ± standard deviations). Low severity patients are defined as the ones with EDSS ≤ 5.5 at baseline, while high severity patients are defined as having EDSS > 5.5 at baseline. Cohort of patients with a least 6 visits with EDSS in the last 3.25 years.

    (PDF)

    pdig.0000533.s012.pdf (36KB, pdf)
    S8 Table. Features importance for different performance metrics.

    Features are ranked by order of importance for the Dynamic Model. Feature importance is assessed by the average difference in performance when the specific feature is shuffled. Averages ± standard deviations are reported.

    (PDF)

    pdig.0000533.s013.pdf (34.2KB, pdf)
    S9 Table. Hyperparameters table for the temporal attention model.

    List of hyperparameters used for training the models.

    (PDF)

    pdig.0000533.s014.pdf (19KB, pdf)
    S10 Table. Hyperparameters table for the multi-layer perceptron model.

    List of hyperparameters used for training the models.

    (PDF)

    pdig.0000533.s015.pdf (18.4KB, pdf)
    S11 Table. Hyperparameters table for the recurrent neural network model.

    List of hyperparameters used for training the models.

    (PDF)

    pdig.0000533.s016.pdf (19.3KB, pdf)
    S12 Table. Hyperparameters table for the dynamic MTP model.

    List of hyperparameters used for training the models.

    (PDF)

    pdig.0000533.s017.pdf (18.7KB, pdf)
    S13 Table. Hyperparameters table for the factorization machines model.

    List of hyperparameters used for training the models.

    (PDF)

    pdig.0000533.s018.pdf (18KB, pdf)
    S14 Table. Hyperparameters table for the logistic regression model.

    List of hyperparameters used for training the models.

    (PDF)

    pdig.0000533.s019.pdf (17.2KB, pdf)
    S1 Text. Models description.

    Description of the Bayesian neural networks, DeepMTP, and Factorization Machines models.

    (PDF)

    pdig.0000533.s020.pdf (129.4KB, pdf)
    Attachment

    Submitted filename: ResponseToReviewers.pdf

    pdig.0000533.s021.pdf (180.7KB, pdf)
    Attachment

    Submitted filename: MS_rebuttal.pdf

    pdig.0000533.s022.pdf (200KB, pdf)

    Data Availability Statement

    The data set used in this study is available upon request to the MSBase principal investigators included in the study. MSBase operates as a single point of contact to facilitate the data sharing agreements with the individual data custodians. Inquiries should be addressed at info@msbase.org. Data is restricted behind a request to ensure a controlled usage of patients data and to stay inline with specific data ownership requirements. The data processing and training scripts to reproduce all experiments are publicly available at https://gitlab.com/edebrouwer/ms_benchmark.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES