Abstract
Amyotrophic Lateral Sclerosis (ALS) is an inexorably progressive neurodegenerative condition with no effective disease modifying therapies. The development and validation of reliable prognostic models is a recognised research priority. We present a prognostic model for survival in ALS where result uncertainty is taken into account. Patient data were reduced and projected onto a 2D space using Uniform Manifold Approximation and Projection (UMAP), a novel non-linear dimension reduction technique. Information from 5,220 patients was included as development data originating from past clinical trials, and real-world population data as validation data. Predictors included age, gender, region of onset, symptom duration, weight at baseline, functional impairment, and estimated rate of functional loss. UMAP projection of patients shows an informative 2D data distribution. As limited data availability precluded complex model designs, the projection was divided into three zones with relevant survival rates. These rates were defined using confidence bounds: high, intermediate, and low 1-year survival rates at respectively (), () and (). Predicted 1-year survival was estimated using zone membership. This approach requires a limited set of features, is easily updated, improves with additional patient data, and accounts for results uncertainty.
Subject terms: Neurological disorders, Computer science
Introduction
Amyotrophic Lateral Sclerosis (ALS) is a relentlessly progressive neurodegenerative condition involving both upper and lower motor neurons, leading to progressive limb weakness and bulbar dysfunction. Mean survival time from disease onset is typically 3 to 5 years1, with death occurring secondary to respiratory failure. The disease is characterised by considerable clinical heterogeneity2 and differences in progression rate3, with some patients surviving 10 years or more4,5.
From a clinical perspective, accurate prognostic indicators are indispensable for optimising multidisciplinary care, planning interventions, advising patients on end-of-life decisions, resource allocation, etc. Disease heterogeneity is a recognised barrier to successful clinical trials in ALS6, and accurate prognosis prediction would improve patient stratification. Previous epidemiology studies have identified a number of negative prognostic indicators7, such as older age of onset, bulbar onset, respiratory compromise, cognitive impairment, short symptom onset to diagnosis interval, marked functional disability, c9orf72 status, and fast progression rate8–11. However, individualised prediction is seldom reliable when clinical and demographic variables are considered alone11. There is a growing trend to develop accurate prognostic tools based a combination of prognostic factors12, using supervised machine learning models such as random forests13, regression models14, neural networks with random forests15 and boosting algorithms16. Recently, Westeneng et al.17 presented an externally validated Royston-Parmar regression prediction model of survival in a large European ALS population.
Unsupervised learning methods provide new modelling opportunities in ALS due to their ability to detect data distributions without a firm underlying statistical hypothesis18,19. Dimension reduction methods project data onto a new low-dimensional space and allow interesting data visualisation. A neighbourhood-based approach takes full advantage of patient similarity for prognosis modelling and can unravel relevant correlations between predictors and outcomes. Uniform Manifold Approximation and Projection (UMAP)20 is a novel method based on non-linear dimension reduction which can be readily combined with probability assessments. The main objective of our study was to evaluate a UMAP based 1-year survival prediction model in ALS, designed using three clinical trial datasets, and validated by a Real-World (RW) dataset. Model performance was compared with random forest and logistic regression models. The model is easily updated, works with a limited set of features and factors result uncertainty in. Taking advantage of the UMAP projection, other prognosis outcomes and different time frames can be explored.
Methods
Patient population
Validation and test data for this research included a total of 5,393 patients from four different datasets, three of which originated from clinical trials. The first dataset, which is referred to as ‘Trophos’, was a clinical trial for olesoxime, a drug developed by Trophos21 which included 512 patients. After excluding samples with missing data, 431 patients remained. The second dataset, ‘Exonhit’, was a clinical trial for pentoxifylline, a drug produced by Exonhit Pharma22 which included 400 patients. Given the considerable negative effect of the tested treatment on survival time, patients that received the treatment were excluded from outcome analysis. Nevertheless, these patients were included in dimension reduction as projection calculation is solely based on baseline features. Following the exclusion of incomplete samples and patients having received the treatment, data from 345 patients were included in the dimension reduction phase and 172 patients were retained for 1-year survival analysis. The third database was ‘PRO-ACT’, funded by the ALS Therapy Alliance and released in 2012 as part of the DREAM Phil Bowen ALS prediction Prize4Life competition. PRO-ACT consists of pooled data from 16 completed phase II-III clinical trials and one observational study23. The original sample size was 10,723, reduced to 3,971 after discarding samples with missing data. The fourth dataset was population-based and contained RW patient data. These data were obtained from the database of the Paris tertiary referral centre for ALS collected between September 1999 and April 2008. The original sample size was 1,377 which was reduced to 646 after the removal of incomplete samples. Baseline patient feature distribution for 1-year survival analysis is presented for each cohort in Table 1. Additional information on each dataset is provided as supplementary information.
Table 1.
Source | n | Gender (male/female) | Onset (spinal/bulbar) | Age (years) | Symptom duration (months) | Baseline weight (kg) | Baseline ALSFRS (score) | Baseline ALSFRS decline rate (score/month) |
---|---|---|---|---|---|---|---|---|
PRO-ACT | 3,971 | 2,485/1,486 | 3,117/854 | (18:81) | (0.5:140.4) | (30:148.6) | (7:40) | (6.09:0) |
Trophos | 431 | 277/154 | 346/85 | (26:79) | (5:38) | (41:130) | (16:40) | (2.67:0) |
Exonhit | 172 | 118/54 | 129/43 | (26.3:77.9) | (5:58) | (45:112) | (10:39) | (3.14:0.05) |
Real world | 646 | 345/301 | 458/188 | (26.3:92.2) | (0:228.5) | (40:140) | (3.5:40) | (4.16:0) |
Overall | 5,220 | 3,225/1,995 | 4,050/1,170 | (18:92.2) | (0:228.5) | (30:148.6) | (3.5:40) | (6.09:0) |
Numerical predictors are described using mean ± standard deviation (range).
Clinical predictors and outcomes
The primary outcome was 1-year survival. Overall survival (in months), and 1-year functional loss (using the validated ALS Functional Rating Scale (ALSFRS)) were secondary outcomes. Each outcome had a specific data scope: 1-year survival was a binary variable and was predicted for patients dying within 12 months or with an available ALSFRS score at t+12 months. 1-year functional loss was predicted for patients that survived at t+12 months with an ALSFRS score at that time. Patients who died or had invasive ventilation within the first year were assigned an ALSFRS score of 0 at year 1. Overall survival (in months) was used for patients when such information was available but provides a limited understanding of true patient survival given patient monitoring ended at t+12 months for most data.
The choice of predictors was based on feature completeness after database cross-referencing. Predictors include gender, region of onset (spinal/bulbar), age, symptom duration, baseline ALSFRS score, baseline weight, and estimated functional decline rate24. The functional decline rate was estimated using the following formula:
1 |
with , the ALSFRS score recorded at baseline, , the maximum score for the ALSFRS (40) and , time in months between symptom onset and baseline.
Table 2 provides an overview of patient outcome feature distribution. Patient survival was on average above for all datasets, and 1-year average ALSFRS was above 17 for all datasets. Overall patient survival was bounded by clinical trial follow up time.
Table 2.
Source | n (1-year survival) | Survival rate (%) | n (survival) | Survival (months) | n (1-year ALSFRS) | 1-year ALSFRS (score) |
---|---|---|---|---|---|---|
PRO-ACT | 3,971 | 76 | 1,434 | (0:31) | 3,789 | (0:40) |
Trophos | 431 | 84 | 99 | (3:15) | 428 | (0:38) |
Exonhit | 172 | 72 | 79 | (1:18) | 165 | (0:39) |
Real world | 646 | 67 | 447 | (0:41) | 543 | (0:40) |
Overall | 5,220 | 75 | 2,059 | (0:41) | 4,925 | (0:40) |
Numerical predictors are described using mean ± standard deviation (range).
Missing data management
Missing feature analysis focused solely on baseline predictors and outcomes (overall survival, 1-year survival, and 1-year ALSFRS). Table 3 presents missing data ratio per feature for all datasets. Features which were not available in all datasets, such as testing and biological lab results, were discarded. ALSFRS sub-scores were not recorded for Trophos patients and were discarded as a whole. Outcome features can easily be missing due to loss to follow up or death. Features at time t+3 were less available than at baseline. Data collection was not disclosed for PRO-ACT data which aggregates multiple clinical trials and this prevented the identification of missing data patterns. Due to data collection differences between the cohorts, we did not perform missing data imputation and opted for complete case analysis.
Table 3.
Group | n | Survived (yes/no) | Gender (male/female) | Onset (spinal/bulbar) | Age (years) | Symptom duration (months) | Baseline weight (kg) | Baseline ALSFRS (score) | Baseline ALSFRS decline rate (score/month) | 1-year ALSFRS (score) |
---|---|---|---|---|---|---|---|---|---|---|
High survival rate zone | 1,525 | 1,378/147 | 1,189/336 | 1,187/338 | (22:78) | (2.9:59.8) | (46:148.6) | (27:40) | (1.46:0) | (0:40) |
Intermediate survival rate zone | 1,524 | 1,219/305 | 899/625 | 1,171/353 | (18:81) | (3.1:140.4) | (30:122.5) | (25:39) | (2.38:0.02) | (0:39) |
Low survival rate zone | 1,525 | 892/633 | 792/733 | 1,234/291 | (25:80) | (0.5:86.7) | (36.5:138.9) | (7:35) | (6.09:0.15) | (0:37) |
Overall | 4,574 | 3,489/1,085 | 2,880/1,694 | 3,592/982 | (18:81) | (0.5:140.4) | (30:148.6) | (7:40) | (6.09:0) | (0:40) |
Numerical predictors are described using mean ± standard deviation (range).
Data processing
Pre-processing was limited to predictor normalisation to the 0–1 range. Data transformation was carried out through non-linear dimension reduction, also called manifold learning. The Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP)20 method was implemented. UMAP works in two steps. First, a compressed embedding of the input space (aka initial patient data) is generated through topological analysis of the data structure. Subsequently, a low-dimensional (in our case 2D) data embedding is created through a cross-entropy optimisation process. UMAP preserves data neighbourhoods, distances and density. ‘Development data’ were used to learn a 2D representation of patients. Validation data were projected using the learnt mapping. Information on the subject can be found in the supplementary information section. Sample size of development data for 1-year survival was 4,574. Functional loss and overall survival analyses had lower sample sizes: respectively 4,382, a drop with regards to 1-year survival sample size, and 1,612, a drop with regards to 1-year survival. Sample size of validation data for 1-year survival, functional loss and overall survival were respectively 646, 541 and 447.
1-year survival rates zones were identified by dividing the UMAP projection space into multiple small square cells. A local assessment of the survival rate was calculated for each cell based on the development samples belonging to that cell. Confidence bounds were derived at a confidence level using the area sample size and the following formula25:
2 |
with , , the value for 2 normal distribution, P, the outcome probability and N, the sample size.
Cell sample size directly influenced the cell survival rate. The less populated a cell, the wider the probability confidence interval, and the less reliable the analysis of cell membership. Cells were combined to form three equally populated zones with sample sizes sufficient to bound survival rates’ confidence intervals. These zones were designed to have distinct survival rates. Validation data were projected onto the UMAP projection space to check if distribution patterns observed for development data still held. RW patients were then assigned to their corresponding survival rate zone. Validation data zone assignment was assessed with regards to actual survival.
The model was compared to logistic regression and random forest models. Models were trained on two different subsets of features: all of the baseline features and specifically age and baseline ALSFRS features. Models were trained on development data and tested on validation data. The number of True Positives (TP), False Positives (FP), False Negatives (FN) and True Negatives (TN) were reported for each model. The following classification metrics were used: accuracy (), precision (or positive predictive value = ), specificity (or true negative rate, selectivity = ), recall (or sensitivity, true positive rate = ), balanced accuracy (average of precision and recall = .) and F1-measure (harmonic mean of precision and recall = ). As the model returned a survival probability and not a survival status, the total number of survivors could only be approximated. It was calculated by adding up the number of survivors for each zone which was based on the total number of patients within that zone and the associated survival rate.
Results
Analysis of patient characteristics—input feature distribution
Development data were projected using UMAP in a 2D space shown in Fig. 1a. Initial plot of data did not show relevant patient stratification as all patients were clustered together. Plot analysis helps to identify strong correlations between projection and predictors. This was the case for age and baseline ALSFRS scores (Fig. 1d,g respectively) and to a lesser extent for symptom duration and estimated ALSFRS decline rate (Fig. 1e,h respectively). Onset, gender, and baseline weight did not show a high degree of correlation as demonstrated in Fig. 1b,c,f as feature distribution appeared to be random with regards to UMAP projection. Projection data seemed to be independent of cohort membership as patients from each source were evenly distributed in the projection space.
Analysis of patient outcomes—output feature distribution
Analysis of UMAP projection with regards to outcome variables showed spatial patterns as presented in Fig. 2. Survival in months is shown in Fig. 2a. Patients with a longer survival (more than 12 months is referred to as the 13+ on the colour map) tended to be located in the upper part of the UMAP projection. 1-year survival led to an uneven patient distribution, as shown in Fig. 2b. Patients deceased within the year tended to concentrate in the lower pane of the UMAP projection which was consistent with the pattern for overall survival. Patients who survived a year tended to spread evenly across the entire projection space. Fig. 2c shows that similarly to 1-year survival, the 1-year ALSFRS score correlated well with the UMAP projection. ALSFRS score patterns differed slightly from 1-year survival as the lower left pane concentrated patients with lowest ALSFRS. Unsurprisingly, the 1-year ALSFRS score, in Fig. 2c, correlated strongly with baseline ALSFRS score, in Fig. 1g.
Analysis of projection space segmentation—zone division
As stated earlier, patients who were not alive at year 1 were mainly located in the lower pane of the projection space as seen in Fig. 3a. Dividing the projection space in square cells helped to unravel local survival patterns as shown in Fig. 3b. Cells in the lower left side of the projection space had survival rates lower than . As average sample size within each cell is below 25, confidence intervals were approximately minimum with survival rate between 10 and . To ensure statistical significance, a simple division of the UMAP projection space according to the vertical axis was proposed as shown in Fig. 3c. This led to high, intermediate, and low survival rate zones with respectively (), () and () survival rates. Predictors of patient population within each zone are presented in Table 4. Baseline features for the intermediate survival rate zone were similar to overall baseline features. Baseline features for high and low survival rate zones differed significantly from one another. The former had younger patients and patients with higher weight with shorter symptom duration, with less functional disability and lower functional loss rate; while the latter had older patients with lower baseline weight and longer symptom duration, higher functional loss and functional loss rate.
Table 4.
Feature | PRO-ACT | Exonhit | Trophos | Real world | Overall |
---|---|---|---|---|---|
Initial sample size (n) | 10,723 | 400 | 512 | 1,377 | 13,012 |
Gender | |||||
Onset | |||||
Age | |||||
Symptom duration | |||||
Baseline weight | |||||
Baseline height | |||||
Baseline ALSFRS | |||||
Baseline ALSFRS upper limb sub-score | |||||
Baseline ALSFRS lower limb sub-score | |||||
Baseline ALSFRS bulbar sub-score | |||||
Baseline ALSFRS respiratory sub-score | |||||
Baseline ALSFRS trunk sub-score | |||||
Baseline pulse | |||||
Baseline diastolic blood pressure | |||||
Baseline systolic blood pressure | |||||
Baseline vital capacity (L) | |||||
Baseline vital capacity () | |||||
Survival (month) | |||||
1-year survival | |||||
1-year ALSFRS | |||||
Overall missing ratio | |||||
Overall predictor missing ratio | |||||
Overall outcome missing ratio | |||||
Final sample size for 1-year survival (n) | 3,971 | 172 | 646 | 431 | 5,220 |
Novel patient data, provided all baseline features are recorded, can be projected in the reduced UMAP space. The corresponding 2D coordinates determine zone membership to one of the three survival rate zones. Zone membership and the spatial positioning within the projection space provide a broad estimate of patient 1-year survival. Three examples are provided for more details and presented in Fig. 3d:
Patient A (ID 4922) is a 41-years-old woman with a spinal onset, baseline weight is 84 kg, baseline ALSFRS score is 36, symptom duration is estimated at 6.5 months, hence estimated baseline ALSFRS decline rate is assessed at − 0.6 ALSFRS points per month. This information is used to compute the spatial coordinates of patient A within the UMAP projection space. Patient’s A spatial coordinates in the UMAP projection space are (0.92, 0.79), which fall into the high survival rate zone. Patient A has a resulting 1-year survival rate estimate of .
Patient B (ID 429) is a 57-years-old man with a spinal onset, baseline weight is 71 kg, baseline ALSFRS is 33, symptom duration is estimated at 13 months, hence baseline estimated ALSFRS decline rate is assessed at around − 0.5 ALSFRS points per month. This information is used to compute the spatial coordinates of patient B within the UMAP projection space. Patient’s B spatial coordinates in the UMAP projection space are (0.46, 0.62) which fall into the intermediate survival rate zone. Patient B has a resulting 1-year survival rate estimate of .
Patient C (ID 2816) is a 78-years-old woman with a spinal onset, baseline weight is 64 kg, baseline ALSFRS is 19, symptom duration is estimated at 11 months, hence baseline estimated ALSFRS decline rate is assessed at around − 1.8 ALSFRS points per month. This information is used to compute the spatial coordinates of patient C within the UMAP projection space. Patient’s C spatial coordinates in the UMAP projection space are (0.41, 0.03) which fall into the intermediate survival rate zone. Patient C has a resulting 1-year survival rate estimate of .
Subsequent analysis of patients’ A, B and C status after one year are that patient A and B survived a year while patient C died within the first year. A refined division of the projection space was also carried out and is presented in the supplementary information section.
Analysis of the model with additional data—external data testing
The prognosis model was assessed using external data. Patient distribution within the projection space was examined with regards to outcome variables. The different trends for outcome variables identified in Fig. 2 remained valid with patient distribution being uneven for patients who die within one year. Patients with a shorter survival tended to concentrate in the lower pane of the projection, as shown in Fig. 4a, as did patients who do not reach the 1-year milestone in Fig. 4b. Patients were also distributed similarly based with regards to the functional loss pattern identified earlier. Patients were distributed according to their impairment after one year of follow up. Patients suffering from a stronger functional loss were located in the lower-left part of the projection, as presented in Fig. 4c. Additional information on differences between development and validation data using the Kullback-Leibler divergence and complementary figures on distribution comparisons are presented in the supplementary information section.
Zone division—external data evaluation
Patient distribution within the three zones is presented in Table 5. of the RW patients went within the low survival rate zone, while go within the high survival rate zone, and the remaining to the non-informative intermediate survival rate zone. The overall survival rate of the RW patient dataset was . Measured survival rates within the low, intermediate, and high survival rate zones were respectively , , and . Patients in the low survival rate group had a poorer survival rate than observed with trial data. Adding 646 patients reduced the overall confidence bound for survival relatively by (from to ).
Table 5.
Group | Deceased | Survived | Count per zone | Percent per zone |
---|---|---|---|---|
High survival rate zone | 20 | 140 | 160 | |
Intermediate survival rate zone | 51 | 160 | 211 | |
Low survival rate zone | 142 | 133 | 275 | |
Count per status | 213 | 433 | 646 | |
Percent per status |
The model was compared to logistic regression and random forest models. Results are presented in Table 6. of the 160 patients associated with the high survival rate zone were labelled as survivors (144). of the 211 patients belonging to the intermediate survival rate zone were labelled as survivors (159). of the 275 patients assigned to the low survival rate zone were labelled as survivors (160). Overall 473 patients were predicted to survive, 173 were predicted to die. 433 patients actually survived and 213 died. Performance assessment is approximated based these figures. Hence 433 survivors (TP) and 173 deceased patients (TN) were predicted correctly while 40 patients were wrongly labelled as survivors (FP). Our model obtained classification metrics higher than the other models’, specifically with regards to the F1-measure and balanced accuracy metric where our model reached respectively 96 and 91 scores in opposition to the other models averaging around and scores.
Table 6.
Model | TP | FP | FN | TN | Accuracy (%) | Precision (%) | Specificity (%) | Recall (%) | Balanced accuracy (%) | F1 measure (%) |
---|---|---|---|---|---|---|---|---|---|---|
LR 2 features | 89 | 124 | 58 | 375 | 72 | 42 | 75 | 61 | 68 | 49 |
LR 7 features | 100 | 113 | 64 | 369 | 73 | 47 | 77 | 61 | 69 | 53 |
RF 2 features | 85 | 128 | 96 | 337 | 65 | 40 | 72 | 47 | 60 | 43 |
RF 7 features | 119 | 94 | 104 | 329 | 69 | 56 | 78 | 53 | 66 | 55 |
Proposed Model | 433 | 40 | 0 | 173 | 94 | 91 | 81 | 1000 | 91 | 96 |
LR, RF and Proposed Model respectively stand for Logistic Regression, Random Forest and UMAP combined to spatial division.
Discussion
Our study demonstrated the utility of UMAP for survival analysis in ALS. We have successfully applied this non-linear dimension reduction method to ALS clinical trial data to predict overall survival, 1-year survival and 1-year functional loss. Our results showed that limited patient information, collected early in the course of the disease, was sufficient to obtain a relevant low-dimensional patient projection with regards to key outcome variables (survival and functional loss). These input features correlated with the different outcomes of interest, thus explaining the observed distribution patterns. These correlations persisted for external RW patients. One-year survival patient distribution patterns were used to identify zones with distinct survival rates. We proposed a simple 1-year survival estimation model which fared well against the tested machine learning models although performance metrics could only be grossly approximated. The benefit of our approach with regards to standard machine learning methods is threefold. First, our model is simple; it uses only simple probabilities and readily available clinical features. Second, we limit prognosis error by providing a coarse prognosis estimate. Third, our model is easily updated and improves with additional data. No learning was required for our model to work as UMAP is a dimension reduction method. Given dimension reduction was performed on baseline features, projection analysis can be extended to other prognosis outcomes, namely functional loss or clinical staging, and different time frames.
As this study evaluated pre-existing datasets we faced a number of constraints. PRO-ACT data are not uniformly recorded; for instance, vital capacity may be available in litres or percent, and slow and forced vital capacities are inconsistently documented. Units for weight are not clearly labelled as pounds or kilograms. A weight value of 99 without an associated unit may equally be interpreted as kilograms or pounds. These inconsistencies concern of PRO-ACT patients. Inclusion criteria for all datasets pooled within PRO-ACT are not comprehensively documented; 6 out the 23 pooled clinical trial names were not disclosed. Available trial data also suffer from inclusion bias, as patients with marked cognitive or behavioural impairment often face worse prognosis26–28, and are often excluded from or drop out of clinical trials.
Missing data imputation was omitted and our model was trained solely on complete case samples. Although generally recommended in medical settings, data imputation seemed hazardous in this specific data context, specifically working with PRO-ACT. Multiple imputation methods often assume that the missingness patterns are missing at random, i.e. that they depend on other observed variables in the dataset. This information is difficult to verify and these data imputation methods are often performed on the biggest feature subset available so as to improve the odds of such a hypothesis being true. Given the differences in the data collection process and the limited feature subset shared between the different datasets, data imputation could not have been carried out on the global data structure. Data imputation at a dataset level would not have been productive and would have led to significant additional noise in data given small sample size and significant missing feature ratio for each dataset. Even advanced multiple imputation methods such as Quartagno et al.29 which deal with missing data imputation at a study level (for meta-analysis purposes) require knowing the collection process for each study in scope, which we cannot access for PRO-ACT as features could be missing due to loss to follow up or due to clinical trial setup. Furthermore, as UMAP is a neighbourhood-based approach, data imputation can be seen as adding data where it is missing. This would have induced sample similarity in cases where little information was known on the subjects, creating visual artefacts of similar patients within the projection space and adding significant bias to the visual representation. Our spatial distribution approach would have had a more limited performance had we worked with imputed data that would have artificially created spatial proximity.
Another data constraint was that lack of availability of established prognostic indicators in at least one of the four datasets, such as ALSFRS sub scores, cognitive profile, Riluzole intake, vital capacity30, time to generalisation31 or weight loss, which is considered more relevant than absolute weight at baseline32. This limited the model’s ability to discriminate patients within the projection space. Additional clinical features, such as upper or lower limb onset, upper or lower motor neuron predominance, may be potential predictors to improve our model further. The inclusion of biological, genetic33, and imaging features14,34 are likely to have improved current prognosis modelling35. In our study, overall survival was only regarded as a secondary outcome as global survival was not available in most cases. Analysis of overall survival would not have led to accurate results given the available data is predominantly censored after trial end. As overall survival prediction remains key and 1-year survival, a substitute target, it seemed relevant to analyse how overall survival correlated with UMAP projection coordinates. Given our data, 1-year survival was a good proxy of overall survival.
Feature processing excluded dealing time-resolved features in a time-series manner, comparable to past ALS prognosis studies15,36–38. As such, feature processing and model design was simplified. Time-series information, specifically with regards to ALSFRS, was obtained using intercept and slope values. As such, we did not intend to carry out a statistical analysis of data using traditional Kaplan Meier (for 1-year survival) or Cox regression (for functional loss) approaches that factor in time and censoring. A Kaplan–Meier approach can provide an interesting overview of the outcome with regard to time but never at a patient level which is the approach we wished to explore.
As a non-linear unsupervised learning model, UMAP can capture and characterise complex relationships between predictors. UMAP is more than a data visualisation method: the projection space preserves distances, density and neighbourhoods which allow manipulation of projected data through spatial analysis or clustering methods. However, it is a black-box approach. Model interpretability cannot be obtained: the explicit relationship between UMAP input and output variables remains unavailable. Analysis of input feature distribution in the UMAP projection gives a broad overview of variable importance with regards to the projection. Data is projected in a reduced space with interesting data distribution and preserved input space proprieties. UMAP provides the foundation to develop our prognosis model which derived from UMAP space segmentation. Our model combined UMAP with a simple spatial division in order to leverage observed correlations between projection features and the primary outcome. As such, similarly to other machine learning models, UMAP identifies underlying data correlations but cannot reveal causal relationships. Nevertheless, our model provides confidence intervals which most machine learning techniques such as random forest, boosting or neural network methods do not ordinarily provide. This additional information can help clinicians to evaluate prognosis in finer detail.
ALS prognosis modelling has been already been extensively researched in the past. Random forest models were frequently tested15,36,37,39–41, repeatedly outperforming other machine learning models. As logistic regression is a probabilistic model, it seemed interesting to compare our model with these two machine learning models. Given the strong correlation between age and baseline ALSFRS features and projection space coordinates, evaluating model performance on this feature subset was also valuable. Given the imbalance with regards to the outcome (as of patients survived 12 months), accuracy alone would not have been a reliable performance metric. Precision and recall metrics provided a finer understanding of model weaknesses and strengths. As performance metrics were calculated differently for our model and the other machine learning models, where individual predictions were available for all patients, performance results should be viewed with caution.
Given the cell sample size, the estimated survival probability for each cell was not directly used for prognosis estimation, as the confidence interval was not narrow enough. Although each cell carried limited survival information on its own; combined, they were useful in understanding the differences in spatial distribution. Sample size was crucial as it directly influenced the level of detail for the projection space division. A larger data sample would be required to define more zones with distinct survival rates. Dividing the projection space in three was deemed the most appropriate approach given the patient distribution and sample size. Based on the available data, we had to deal with the trade-off between prognosis personalisation and narrow confidence bounds for survival. Testing on external RW data was necessary to assess model ability to scale up and model validity as it was designed using trial patients. Only minor differences were observed when assessing zone membership. A large number of patients were assigned to the low survival rate zone. This is clearly explained by the fact that clinical trials have inclusion criteria which select less severe patients. Additional RW data could correct this bias and limit the resulting over-optimistic prognosis it entails.
In conclusion, we have successfully implemented a simple 1-year survival model partially based on a novel non-linear unsupervised learning method. Further work will be needed to extend our analyses to other prognosis outcomes, such as functional loss and clinical staging systems. Given the relatively low incidence of ALS compared to other neurodegenerative conditions, robust international collaborations are necessary to collect large datasets and build precision models42. Notwithstanding the constraints of the available data, we have demonstrated that combining UMAP with a probabilistic and spatial distribution analysis, important correlations can be unravelled.
Supplementary information
Acknowledgements
This article reported its findings as advised in the TRIPOD (Transparent Reporting of a multivariate prediction model for Individual Prognosis or Diagnosis) statement43. As such, the following organisations and individuals within the PRO-ACT Consortium contributed to the design and implementation of the PRO-ACT Database and/or provided data, but did not participate in the analysis of the data or the writing of this report: Neurological Clinical Research Institute (NCRI), Massachusetts General Hospital (MGH); Northeast ALS Consortium; Novartis; Prize4Life; RegeneronPharmaceuticals, Inc., Sanofi; Teva Pharmaceuticals Industries, Ltd. VG, GL, J-FP-P, and FD contributions were made within a SORBONNE UNIVERSITE/CNRS and FRS Consulting partnership which received funding from MESRI grant CIFRE 2017/1051. Peter Bede is supported by the Health Research Board (HRB – Ireland; HRB EIA-2017-019), Irish Institute of Clinical Neuroscience IICN and the Iris O’Brien Foundation.
Author contributions
V.G. contributed to the design of the study, analysed the data, and wrote the first draft of the manuscript. V.G., G.L., F.D., J.-F.P.-P., M.-S.S.-B., P.B. and P.-F.P. contributed to discussions regarding model testing and results, and to the revision of the manuscript. V.G., G.L., F.D., J.-F.P.-P., M.-S.S.-B., P.B. and P.-F.P. read and approved the final version.
Data availability
Anonymised data are freely accessible from the public database of the Northeast ALS Consortium. Statistical code are shared on the following github: .
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
is available for this paper at 10.1038/s41598-020-70125-8.
References
- 1.Robberecht W, Philips T. The changing scene of amyotrophic lateral sclerosis. Nat. Rev. Neurosci. 2013;14:248–264. doi: 10.1038/nrn3430. [DOI] [PubMed] [Google Scholar]
- 2.Finegan E, Chipika RH, Shing SLH, Hardiman O, Bede P. Primary lateral sclerosis: A distinct entity or part of the ALS spectrum? Amyotroph. Lateral Scler. Frontotemporal Degen. 2019;20:133–145. doi: 10.1080/21678421.2018.1550518. [DOI] [PubMed] [Google Scholar]
- 3.Swinnen B, Robberecht W. The phenotypic variability of amyotrophic lateral sclerosis. Nat. Rev. Neurol. 2014;10:661–670. doi: 10.1038/nrneurol.2014.184. [DOI] [PubMed] [Google Scholar]
- 4.Paganoni S, et al. Diagnostic timelines and delays in diagnosing amyotrophic lateral sclerosis (ALS) Amyotroph. Lateral Scler. Frontotemporal Degen. 2014;15:453–456. doi: 10.3109/21678421.2014.903974. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Labra J, Menon P, Byth K, Morrison S, Vucic S. Rate of disease progression: A prognostic biomarker in ALS. J. Neurol. Neurosurg. Psychiatry. 2015;87:628–632. doi: 10.1136/jnnp-2015-310998. [DOI] [PubMed] [Google Scholar]
- 6.Mitsumoto H, Brooks BR, Silani V. Clinical trials in amyotrophic lateral sclerosis: Why so many negative trials and how can trials be improved? Lancet Neurol. 2014;13:1127–1138. doi: 10.1016/s1474-4422(14)70129-2. [DOI] [PubMed] [Google Scholar]
- 7.Elamin M, et al. Predicting prognosis in amyotrophic lateral sclerosis: A simple algorithm. J. Neurol. 2015;262:1447–1454. doi: 10.1007/s00415-015-7731-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Gordon PH, et al. Predicting survival of patients with amyotrophic lateral sclerosis at presentation: A 15-year experience. Neurodegener. Dis. 2012;12:81–90. doi: 10.1159/000341316. [DOI] [PubMed] [Google Scholar]
- 9.Elamin M, et al. Executive dysfunction is a negative prognostic indicator in patients with ALS without dementia. Neurology. 2011;76:1263–1269. doi: 10.1212/wnl.0b013e318214359f. [DOI] [PubMed] [Google Scholar]
- 10.Wolf J, et al. Factors predicting one-year mortality in amyotrophic lateral sclerosis patients—Data from a population-based registry. BMC Neurol. 2014;14:197. doi: 10.1186/s12883-014-0197-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Chiò A, et al. Prognostic factors in ALS: A critical review. Amyotroph. Lateral Scler. 2009;10:310–323. doi: 10.3109/17482960802566824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Grollemund V, et al. Machine learning in amyotrophic lateral sclerosis: Achievements, pitfalls, and future directions. Front. Neurosci. 2019;13:135. doi: 10.3389/fnins.2019.00135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Huang Z, et al. Complete hazard ranking to analyze right-censored data: An als survival study. PLoS Comput. Biol. 2017;13:e1005887. doi: 10.1371/journal.pcbi.1005887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Schuster C, Hardiman O, Bede P. Survival prediction in amyotrophic lateral sclerosis based on MRI measures and clinical characteristics. BMC Neurol. 2017;17:1. doi: 10.1186/s12883-017-0854-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Beaulieu-Jones BK, Greene CS, et al. Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 2016;64:168–178. doi: 10.1016/j.jbi.2016.10.007. [DOI] [PubMed] [Google Scholar]
- 16.Ong M-L, Tan PF, Holbrook JD. Predicting functional decline and survival in amyotrophic lateral sclerosis. PLoS ONE. 2017;12:e0174925. doi: 10.1371/journal.pone.0174925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Westeneng H-J, et al. Prognosis for patients with amyotrophic lateral sclerosis: Development and validation of a personalised prediction model. Lancet Neurol. 2018;17:423–433. doi: 10.1016/s1474-4422(18)30089-9. [DOI] [PubMed] [Google Scholar]
- 18.h. Taguchi, Y., Iwadate, M. & Umeyama, H. Heuristic principal component analysis-based unsupervised feature extraction and its application to gene expression analysis of amyotrophic lateral sclerosis data sets. In 2015 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), 10.1109/cibcb.2015.7300274 (IEEE, 2015).
- 19.Tang M, et al. Model-based and model-free techniques for amyotrophic lateral sclerosis diagnostic prediction and patient clustering. Neuroinformatics. 2019;17:407–421. doi: 10.1007/s12021-018-9406-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.McInnes, L., Healy, J. & Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
- 21.Lenglet T, et al. A phase ii–iii trial of olesoxime in subjects with amyotrophic lateral sclerosis. Eur. J. Neurol. 2014;21:529–536. doi: 10.1111/ene.12344. [DOI] [PubMed] [Google Scholar]
- 22.Meininger V, et al. Pentoxifylline in als: A double-blind, randomized, multicenter, placebo-controlled trial. Neurology. 2006;66:88–92. doi: 10.1212/01.wnl.0000191326.40772.62. [DOI] [PubMed] [Google Scholar]
- 23.Pro-act database. https://nctu.partners.org/ProACT/Home/Index. Accessed 01 Jan 2020.
- 24.Querin G, et al. Spinal cord multi-parametric magnetic resonance imaging for survival prediction in amyotrophic lateral sclerosis. Eur. J. Neurol. 2017;24:1040–1046. doi: 10.1111/ene.13329. [DOI] [PubMed] [Google Scholar]
- 25.Hulley SB, Cummings SR, Browner WS, Grady DG, Newman TB. Designing Clinical Research. Philadelphia: Lippincott Williams & Wilkins; 2006. [Google Scholar]
- 26.Elamin M, et al. Cognitive changes predict functional decline in ALS: A population-based longitudinal study. Neurology. 2013;80:1590–1597. doi: 10.1212/wnl.0b013e31828f18ac. [DOI] [PubMed] [Google Scholar]
- 27.Olney RK, et al. The effects of executive and behavioral dysfunction on the course of ALS. Neurology. 2005;65:1774–1777. doi: 10.1212/01.wnl.0000188759.87240.8b. [DOI] [PubMed] [Google Scholar]
- 28.Xu Z, Alruwaili ARS, Henderson RD, McCombe PA. Screening for cognitive and behavioural impairment in amyotrophic lateral sclerosis: Frequency of abnormality and effect on survival. J. Neurol. Sci. 2017;376:16–23. doi: 10.1016/j.jns.2017.02.061. [DOI] [PubMed] [Google Scholar]
- 29.Quartagno M, Carpenter JR. Multiple imputation for IPD meta-analysis: Allowing for heterogeneity and studies with missing covariates. Stat. Med. 2016;35:2938–54. doi: 10.1002/sim.6837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Pirola A, et al. The prognostic value of spirometric tests in amyotrophic lateral sclerosis patients. Clin. Neurol. Neurosurg. 2019;184:105456. doi: 10.1016/j.clineuro.2019.105456. [DOI] [PubMed] [Google Scholar]
- 31.Tortelli R, et al. Time to generalization and prediction of survival in patients with amyotrophic lateral sclerosis: A retrospective observational study. Eur. J. Neurol. 2016;23:1117–1125. doi: 10.1111/ene.12994. [DOI] [PubMed] [Google Scholar]
- 32.Moglia C, et al. Early weight loss in amyotrophic lateral sclerosis: Outcome relevance and clinical correlates in a population-based cohort. J. Neurol. Neurosurg. Psychiatry. 2019;90:666–673. doi: 10.1136/jnnp-2018-319611. [DOI] [PubMed] [Google Scholar]
- 33.Byrne S, et al. Cognitive and clinical characteristics of patients with amyotrophic lateral sclerosis carrying a c9orf72 repeat expansion: A population-based cohort study. Lancet Neurol. 2012;11:232–240. doi: 10.1016/s1474-4422(12)70014-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Bede P, Iyer PM, Finegan E, Omer T, Hardiman O. Virtual brain biopsies in amyotrophic lateral sclerosis: Diagnostic classification based on in vivo pathological patterns. NeuroImage Clin. 2017;15:653–658. doi: 10.1016/j.nicl.2017.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Agosta F, et al. Survival prediction models in motor neuron disease. Eur. J. Neurol. 2019;26:1143–1152. doi: 10.1111/ene.13957. [DOI] [PubMed] [Google Scholar]
- 36.Hothorn T, Jung HH. RandomForest4life: A random forest for predicting ALS disease progression. Amyotroph. Lateral Scler. Frontotemporal Degener. 2014;15:444–452. doi: 10.3109/21678421.2014.893361. [DOI] [PubMed] [Google Scholar]
- 37.Ko, K. D., El-Ghazawi, T., Kim, D. & Morizono, H. Predicting the severity of motor neuron disease progression using electronic health record data with a cloud computing big data approach. In 2014 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, 10.1109/cibcb.2014.6845506 (IEEE, 2014). [DOI] [PMC free article] [PubMed]
- 38.Küffner R, et al. Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression. Nat. Biotechnol. 2015;33:51. doi: 10.1038/nbt.3051. [DOI] [PubMed] [Google Scholar]
- 39.Taylor AA, et al. Predicting disease progression in amyotrophic lateral sclerosis. Ann. Clin. Transl. Neurol. 2016;3:866–875. doi: 10.1002/acn3.348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Jahandideh S, et al. Longitudinal modeling to predict vital capacity in amyotrophic lateral sclerosis. Amyotroph. Lateral Scler. Frontotemporal Degener. 2017;19:294–302. doi: 10.1080/21678421.2017.1418003. [DOI] [PubMed] [Google Scholar]
- 41.Pfohl SR, Kim RB, Coan GS, Mitchell CS. Unraveling the complexity of amyotrophic lateral sclerosis survival prediction. Front. Neuroinform. 2018;12:36. doi: 10.3389/fninf.2018.00036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bede P, Querin G, Pradat P-F. The changing landscape of motor neuron disease imaging. Curr. Opin. Neurol. 2018;31:431–438. doi: 10.1097/wco.0000000000000569. [DOI] [PubMed] [Google Scholar]
- 43.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (tripod): The tripod statement. BMC Med. 2015;13:1. doi: 10.1186/s12916-014-0241-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Anonymised data are freely accessible from the public database of the Northeast ALS Consortium. Statistical code are shared on the following github: .