Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2024 Feb 29;19(2):e0299487. doi: 10.1371/journal.pone.0299487

Machine learning approaches to enhance diagnosis and staging of patients with MASLD using routinely available clinical information

Matthew McTeer 1,*,#, Douglas Applegate 2, Peter Mesenbrink 3, Vlad Ratziu 4, Jörn M Schattenberg 5, Elisabetta Bugianesi 6, Andreas Geier 7, Manuel Romero Gomez 8, Jean-Francois Dufour 9, Mattias Ekstedt 10, Sven Francque 11, Hannele Yki-Jarvinen 12, Michael Allison 13, Luca Valenti 14, Luca Miele 15, Michael Pavlides 16, Jeremy Cobbold 16, Georgios Papatheodoridis 17, Adriaan G Holleboom 18, Dina Tiniakos 17,19, Clifford Brass 2, Quentin M Anstee 19,20,#, Paolo Missier 1,#; on behalf of the LITMUS Consortium investigators
Editor: Pavel Strnad21
PMCID: PMC10903803  PMID: 38421999

Abstract

Aims

Metabolic dysfunction Associated Steatotic Liver Disease (MASLD) outcomes such as MASH (metabolic dysfunction associated steatohepatitis), fibrosis and cirrhosis are ordinarily determined by resource-intensive and invasive biopsies. We aim to show that routine clinical tests offer sufficient information to predict these endpoints.

Methods

Using the LITMUS Metacohort derived from the European NAFLD Registry, the largest MASLD dataset in Europe, we create three combinations of features which vary in degree of procurement including a 19-variable feature set that are attained through a routine clinical appointment or blood test. This data was used to train predictive models using supervised machine learning (ML) algorithm XGBoost, alongside missing imputation technique MICE and class balancing algorithm SMOTE. Shapley Additive exPlanations (SHAP) were added to determine relative importance for each clinical variable.

Results

Analysing nine biopsy-derived MASLD outcomes of cohort size ranging between 5385 and 6673 subjects, we were able to predict individuals at training set AUCs ranging from 0.719-0.994, including classifying individuals who are At-Risk MASH at an AUC = 0.899. Using two further feature combinations of 26-variables and 35-variables, which included composite scores known to be good indicators for MASLD endpoints and advanced specialist tests, we found predictive performance did not sufficiently improve. We are also able to present local and global explanations for each ML model, offering clinicians interpretability without the expense of worsening predictive performance.

Conclusions

This study developed a series of ML models of accuracy ranging from 71.9—99.4% using only easily extractable and readily available information in predicting MASLD outcomes which are usually determined through highly invasive means.

Introduction

Metabolic dysfunction Associated Steatotic Liver Disease (MASLD), formerly known as Non-Alcoholic Fatty Liver Disease (NAFLD) [1] is the world’s most common chronic liver disease, and with the rise in increasingly sedentary lifestyles, poses a major challenge to healthcare systems globally. It is estimated that over 25% of the global adult population has MASLD [2], which is predicted to soon be the leading cause of liver transplantation [3]. MASLD encompasses a spectrum of disease severity, ranging from isolated increased hepatic triglyceride content (steatosis; metabolic dysfunction associated steatotic liver—MASL), through hepatic inflammation and hepatocyte injury (metabolic dysfunction associated steatohepatitis—MASH) with increasing fibrosis, and ultimately to cirrhosis and/or hepatocellular carcinoma [4]. More advanced stages of hepatic fibrosis are associated with an increased risk of liver-related and all-cause mortality [5]. The reference standard for grading and staging MASLD is histological using a semi-quantitative scoring system [6, 7]. However, liver biopsy requires expertise in both procurement and histological assessment, are costly, harbour inherent risks and have methodological limitations, (e.g., sampling variability and intra- and inter-pathologist scoring variability), rendering it unsuitable for routine MASLD clinical practice [8, 9]. In recent years there have been major advances in the development of non-invasive biomarkers, both blood-based and radiological [10]. Candidate serum and imaging biomarkers, as well as multi-marker panels, are currently being evaluated in large, multi-centre independent cohorts by international research consortia like LITMUS in Europe and NIMBLE in the USA [11, 12]. However, studies suggest that biomarker performance for the diagnostic context of use remains to date only borderline with classification AUC scores around 0.80 [13]. With no single marker or panel conclusively predicting biopsy results, the hope remains that a combination of complementary assessments may improve diagnostic performance. The application of standard machine learning (ML) approaches to multi-modal training sets remain also relatively unexplored in this research area.

The objective of this study was to investigate the role of selected clinical variables associated with MASL and MASH, when predicting a set of biopsy-derived outcomes that indicate stage of progression along the MASLD spectrum. This work explored the utility of ML approaches to predict binary target conditions in relation to biopsy-derived phenotypes across the MASLD spectrum including At-Risk MASH, Advance Fibrosis and Cirrhosis. Ultimately, we aimed to show that routinely available clinical tests can provide sufficient information to predict these outcomes, suggesting a reduced need to carry out invasive biopsies. Scholars have attempted to tackle this problem through using ML with many studies primarily focusing upon identifying novel combinations of biomarkers that can replace existing surrogate scores that indicate the severity of disease [1417]. Most studies claim to outperform the existing surrogate markers such as Hepatic Steatosis Index (HSI) and Fatty Liver Index (FLI). The strongest results yielded are from studies that utilise non-routinely collected multi-omics data [18]. Some scholars have focused upon utilising only routinely collected clinical information in their analysis [1921], however the study cohorts used or the results their method’s yielded have been limited.

This paper demonstrates how we achieved our aim through using only data that is easily and readily available from routine clinical appointments and standard blood tests to accurately predict individuals who are at risk of MASH and other outcomes in relation to MASLD severity via ML. We also show that the introduction of variables that are more difficult to obtain into ML classifiers does not improve accuracy significantly to offset the cost of procuring these variables.

Materials and methods

Study population

This study utilised data drawn from the LITMUS Metacohort from patients participating in the European NAFLD Registry (NCT04442334), an international cohort of NAFLD patients prospectively recruited following standardized procedures and monitoring; see Hardy and Wonders et al. for details [12]. Patients were required to provide informed consent prior to inclusion. Studies contributing to the Registry were approved by the relevant Ethical Committees in the participating countries and conform to the guidelines of the Declaration of Helsinki. The Metacohort enrolled subjects from sites in Belgium, Finland, France, Germany, Italy, the Netherlands, Spain, Sweden, Switzerland, and the UK between Jan 6, 2010, and Dec 29, 2017. Subjects were at least 18 years old, clinically suspected of having MASLD having been referred for further investigation due to abnormal biochemical liver tests and/or radiological evidence of steatosis. Participating subjects also received a liver biopsy confirming their MASLD status within 6 months of enrolment. After providing written informed consent, participants underwent standardised assessment protocols, including collection of serum blood samples for later analysis with novel biomarkers. Participants reporting excessive alcohol consumption (>20/30g per day for women/men) in the preceding 6 months and/or history of excessive alcohol consumption in the past 5 years were excluded along with participants reporting other causes of chronic liver diseases. Summary statistics of the LITMUS Metacohort at baseline assessment are illustrated in the table S1 Table in the S1 File).

Features and responses

The branch of ML that this paper focuses on is known as supervised learning classification. This is simply where ML algorithms learn from observations that have been labelled, in this case as either negative (0) or positive (1) for a particular target condition and uses the information about these individuals to create a model that can predict individuals where their status for the target is unknown (i.e., unlabelled). The information in this case refers to the clinical data that is collected, known as features. In this paper, we use a set of non-invasive clinical and novel biomarkers as our set of predictive features. Clinically derived features were collected by a trained investigator (e.g., weight, BMI, comorbidity information) while standard clinical biochemistry (e.g., LDL, HDL, platelet count, ALT, AST, GGT) was measured at each site’s local laboratory. Additional biomarkers available included vibration-controlled transient elastography (VCTE; Fibroscan™, Echosens, Paris, France) to measure liver stiffness; the Enhanced Liver Fibrosis (ELF) test [22, 23], measured on the ADVIA Centaur CP system (Siemens, Munich, Germany); and multiple direct collagen biomarkers, including collagen neo-epitopes Pro-C3, Pro-C4, Pro-C6 [24, 25].

Three different combinations of these features which vary in the difficulty of procurement are used, which are referred to as follows:

  • Core features—19 clinical variables that are considered standard measurements that are achieved through a routine clinical appointment or blood test.

  • Extended features—26 clinical variables that include the 19 Core features plus 7 other features that are either not difficult to acquire but not collected routinely, or composite scores known to be good indicators of MASLD endpoints.

  • Specialist features—35 variables that include the 26 clinical features outlined in the Extended feature set and 9 specialist tests that are rarely procured.

These three feature sets are outlined in Table 1 and are described and evaluated individually in [13]. All three feature sets were applied within the ML modelling to predict 9 binarized target conditions, with the number of individuals that exist within the negative and positive class highlighted in Table 2. These targets are recorded by pathologists from liver biopsies. Biopsy evaluation was performed by expert liver pathologists at the recruiting site. Biopsies, when deemed of sufficient quality and size for clinical diagnosis, were assessed using the NASH Clinical Research Network (NASH CRN) scoring system, where steatosis and lobular inflammation are scored using a semi-quantitative ordinal score described by integers 0–3 and hepatocyte ballooning is scored 0–2. Together these three scores added provide the composite NAFLD Activity Score (denoted NAS in Table 2), ranging from 0 to 8. Disease Activity Score (denoted A in Table 2), is also a composite score ranging from 0–5, consisting of the level of hepatocyte ballooning and lobular inflammation. Fibrosis stage (denoted F in Table 2) is scored with integers 0–4 [6]. Definitions for each target condition are outlined in Table 2 and they are of particular interest as they indicate whether a patient has progressed to MASH or advanced hepatic fibrosis. As an example, an individual is classified as At-Risk MASH (i.e. positive) if the individual has a NAS score greater of equal to 4 as well as a fibrosis stage greater or equal to 2, and if this is not true then the individual is not At-Risk MASH (i.e. negative).

Table 1. Clinical variables owing to three feature sets used within this analysis.

Feature Set Clinical Variables
Core Features Age, Gender, BMI, Historic Alcohol Consumption (>5 years ago), Insulin Resistance, Hypertensive, Metabolic Syndrome, eGFR, Dyslipidaemia, ALT, AST, GGT, Platelets, Creatinine, Serum Triglycerides, Albumin, Bilirubin, Obstructive Sleep Apnoea, AST-ALT Ratio
Extended Features Core Features + FIB4, NFS, APRI, BARD, Waist-to-hip Ratio, Ferritin, IgA
Specialist Features Extended Features + Fibroscan Stiffness, CK18-M30, CK18-M65, Pro-C3, Pro-C6, ELF, ADAPT, FIBC3, ABC3D

Table 2. MASLD target condition’s class distribution.

Target Condition Definition Negative (0) Positive (1) -/+ Ratio
MASL vs. MASH NAS <4 (0) vs. NAS ≥4 (1) 2776 3132 0.9: 1
At-Risk MASH NAS <4 AND/OR F<2 (0) vs. NAS ≥4 AND F≥2 (1) 4014 2010 2.0: 1
High Activity A <2 (0) vs. A ≥2 (1) 2426 3672 0.7: 1
Clinically Significant Fibrosis F <2 (0) vs. F ≥2 (1) 3771 2532 1.5: 1
Advanced Fibrosis (Histology confirmed) F <3 (0) vs. F ≥3 (1) 4921 1382 3.6: 1
Cirrhosis (Histology confirmed) F <4(0) vs. F ≥4 (1) 5815 488 11.1: 1
Advanced Fibrosis (Histology & Clinically confirmed) F <3 (0) vs. F ≥3 AND clinically cirrhotic cases (1) 5163 1510 3.4: 1
Cirrhosis (Histology & Clinically confirmed) F <4 (0) vs. F ≥4 AND clinically cirrhotic cases (1) 6009 664 9.0: 1
At-Risk MASLD Otherwise (0) vs. 2 ≥ F ≥ 3 AND NAS ≥4 (1) (Cirrhotics Excluded) 3979 1406 2.8: 1

The output of each combination of feature sets to target conditions are models that represent a function of the inputted clinical variables that can accurately predict one or more of these outcomes.

Learning approach—XGBoost

Due to data sparsity and varying levels of missingness across these 3 feature sets, we focused our research upon algorithms that did not require a fully imputed dataset, namely XGBoost (eXtreme Gradient Boosting) [26]. XGBoost was a highly suitable algorithm for these binary classification exercises, this is due to its consistent high performance in classification tasks and its ability to treat missing values for features as values themselves meaning that no missing imputation is required. Due to XGBoost’s characteristics of being a gradient boosting tree ensemble method, it also takes into account issues surrounding variable multicollinearity, with the ability to consider each variable at a single time for each split and not the correlation that it has with other features. This is particularly important in our models that use the Extended or Specialist feature set, as many features used are composites of others.

Missing imputation—MICE

In preliminary studies we compared a number of supervised learning algorithms upon predicting the binary target condition for presence/absence of “At-Risk MASH” (NAS ≥4 with Fibrosis ≥2) using all predictive features. The issue with these experiments was that unlike XGBoost the vast majority of algorithms require a totally complete dataset and indeed with the Metacohort being a real-world dataset, there were only 3 out of 35 totally complete clinical features. Missing imputation tool MICE [27] (Multiple Imputation by Chained Equations) was necessary. Having conducted these experiments however it was apparent that XGBoost, despite being a tolerant algorithm towards missing values, was still the best performing learning algorithm and when using XGBoost, performing MICE imputation on the training set offered marginal improvements compared to using the algorithm without imputation. Depending upon the data type for each feature, MICE will use different methods to determine the missing value, for example using predictive mean matching for continuous numerical data, logistic regression for dichotomous variables and polytomous regression for categorical data. The level of missing imputation required for ‘Core’ features is far less than that of the more difficult to procure feature sets, however even of the 19 ‘Core’ clinical features, only 10 were more than 90% complete.

Class balancing—SMOTE

In ML classification models, imbalanced datasets can result in the model skewing predictions towards the majority class in order to maximise model accuracy. We therefore have the option of either downsampling (removing datapoints of the majority class) or upsampling (increasing the number of datapoints of the minority class). Downsampling is less favourable due to the removal of perfectly valuable datapoints therefore we focus upon upsampling techniques. SMOTE (Synthetic Minority Oversampling Technique) is an upsampling method that synthetically creates new minority class datapoints through selecting examples of the minority class that are near neighbours to each other in feature space before drawing a hypothetical line between these points; SMOTE then creates a new datapoint at some point along this line thus creating a new synthetic minority class sample [28]. SMOTE is considered to be more reliable than other upsampling techniques due to interpolating new data between existing minority class datapoints, rather than some techniques that simply duplicate existing datapoints which can lead to overfitting. It is important to note however that SMOTE is only operational when there are no missing values across the feature set, therefore all models that use SMOTE must also use MICE.

Model interpretability—SHAP

Particularly in ML studies for medical domains, there has long existed a tension between the level of understanding of how these classifiers create their conclusions and the overall model accuracy [29]. Medical settings in particular rely upon model interpretability to reduce the level of complexity of a model and therefore increase the trustworthiness of its results. Using an additive feature attribution method known as Shapley values [30], it is possible to determine the relative importance for each feature and has helped to aid the explanation in individual predictions through feature weightings for every model generated in this paper. Shapley values can be added to a model post-hoc to its creation; using Python library SHAP, this allows for easy integration of interpretability into naturally difficult to understand black-box models, such as XGBoost.

Experimental design

Analysing the 9 target conditions outlined in Table 2 and utilising the 3 feature sets of varying degrees of procurement, we applied 3 different model frameworks: XGBoost, XGBoost with MICE, XGBoost with MICE and SMOTE to every dataset and target combination. This in total provided 81 classifiers. We applied a train-test split to each feature and target condition in the proportion of 80% to 20%—the training data is used to tune the hyperparameters of the XGBoost classifiers to derive a ML model. This model was then cross-validated upon the training data to obtain evaluation metrics; the mean average of the AUC, Accuracy, Sensitivity, Specificity and F1 scores across 5 folds were noted. The model was then applied to the test set to derive predicted values and establish a test set AUC value.

Results

Machine learning vs. Univariate linear approach

The ML classifiers created used multiple variables in their decision making and were more powerful and effective at predicting these outcomes than each individual feature alone. Fig 1 compares the training set AUC achieved from univariate logistic regression models upon each of the 35 features explored in the analysis and the training set AUC achieved across all ML models created in predicting At-Risk MASH. All 9 ML models outperformed each individual variable when used in isolation. This demonstrates that the predictive power of these ML models is substantially greater than individual variables that have previously been used in predicting various MASLD outcomes. When comparing test set AUC the ML models performed less admirably, however the differences between training and test performance are small and to be expected. This is to ensure that our classifiers have not overfitted, that is, where classifiers perform too well upon training data and therefore struggle to generalise to unseen data and predict observations reliably—we discuss our approaches to mitigate overfitting of our ML classifiers in the discussion section of this work. A handful of test set AUC of univariate models can be difficult to compare with our ML classifiers due to the very small sample sizes of some classifiers, such as N < 200 in 8 univariate cases, however the majority of these univariate models have test sets comparable to the N ≈ 1200 of the ML classifiers. Even the smallest sample sizes within the univariate models offer a robust estimator however, as far fewer observations are required in models that utilise only one feature. In general, we see ML models outperforming models that utilise each variable individually, highlighting the progress ML models can offer in improving classifier performance over existing models.

Fig 1. ML/Linear approach comparison for predicting At-Risk MASH.

Fig 1

Error bars denote +/- S.D from k = 5 fold cross-validation.

Modelling using core features

Focusing first on At-Risk MASH as the target condition, the first stage of ML modelling that was undertaken related to whether it was possible to accurately predict individuals with different MASLD outcomes using only Core variables that are routinely collected from either from a routine clinical practice or standard blood test. The Core feature set had 3 combinations of ML modelling applied: XGBoost, XGBoost with MICE, XGBoost with MICE and SMOTE. Each model contained p = 19 predictors and had N = 6024 observations, of which 2010 were considered ‘At-Risk MASH’ (positive) and 4014 were not (negative). Taking the 80:20 training-test split, we observed a 2:1 negative-to-positive ratio in terms of the training set class split. We wish only to balance the training set for the XGBoost with MICE and SMOTE model, therefore we artificially enhanced the minority class (in this case the positive set) from 1601 to 3218 to match the case numbers for negative class in the training set. It is important to note that rebalancing was not applied to the test set—this is so the test set is as close as possible to what we would expect to see in reality, thus reducing any model biases.

Good classifier performance was attained using Core variables at predicting At-Risk MASH, with a training AUC of 0.814 for the model which used no imputation or class balancing. There was a markedly improved performance in AUC by ∼8% when MICE and SMOTE were used. It is worth noting that the classifier was much better at predicting the negative (and in this case majority) class of individuals who were not At-Risk MASH with average specificity being 86.2%, in comparison to average sensitivity (prediction of positive and in this case minority class) of 63.1%. However, with the use of class balancing algorithm SMOTE the performance of the classifier on the minority positive class improved at the expense of reduced predictive power on the majority negative class, and in fact the classifier accuracy for the positive class was greater than that of the negative class in this case. This was preferable as this ultimately improved all other evaluation metrics of the classifier.

Following the repetition of the 3 combinations of ML modelling techniques upon all other target conditions using the Core variable dataset, as was the case with At-Risk MASH, the best performing classifier was the model that used missing imputation MICE and class balancing algorithm SMOTE. Performance metrics for these models are shown in Table 3.

Table 3. Evaluation metrics for ’Core’ dataset performance upon predicting all response using XGBoost with MICE and SMOTE.

Response AUC Accuracy Sensitivity Specificity F1
MASL vs. MASH 0.719 0.663 0.658 0.667 0.661
At-Risk MASH 0.899 0.820 0.827 0.812 0.821
High Activity 0.801 0.723 0.720 0.734 0.724
Clinically Significant Fibrosis 0.852 0.778 0.767 0.789 0.775
Advanced Fibrosis (Histology confirmed) 0.960 0.895 0.909 0.880 0.896
Cirrhosis (Histology confirmed) 0.994 0.964 0.980 0.949 0.965
Advanced Fibrosis (Histology & Clinically confirmed) 0.961 0.901 0.915 0.888 0.903
Cirrhosis (Histology & Clinically confirmed) 0.993 0.960 0.973 0.947 0.960
At-Risk MASLD 0.921 0.846 0.856 0.835 0.847

Strong predictive performance for the XGBoost with MICE and SMOTE classifiers using only Core variables with levels of accuracy was achieved with eight out of nine targets producing an AUC of >0.800. We found that all models that employed SMOTE improved either sensitivity or specificity at the expense of a small decline in the other, providing a more equal result between these metrics than before class balancing was used. For classifiers in which there was a heavy class imbalance such as Advanced Fibrosis (Histology confirmed), Cirrhosis (Histology confirmed), Advanced Fibrosis (Histology & Clinically confirmed), Cirrhosis (Histology & Clinically confirmed) and At-Risk MASLD, the improvement of sensitivity or specificity was greater, and therefore the improvement in overall AUC was greater also.

We also compared the AUC achieved from k = 5 training cross-validation with the AUC achieved when applying the models to the test sets to test for overfitting. Cross-validation helps to analyse the generalisation of models upon unseen data, with the main purpose being to provide an estimate of model performance upon new test data. Fig 2 displays the ROC curves for all 5 c.v. folds on the training set with the mean average ROC along with the ROC achieved upon the test set highlighted. The model in question in Fig 2 was the XGBoost with MICE and SMOTE used to predict At-Risk MASH using Core variables only. The AUC for the test set was lower than the mean average for AUC for our training set by approximately 9%. This however was not a significant drop in model performance and with a test AUC performance of 0.80, this displayed good implementation and generalisation of the model upon new and unseen data.

Fig 2. Training/Test set comparison.

Fig 2

Training and Test AUCs and ROC curves for XGB + MICE + SMOTE model using Core variables upon predicting At-Risk MASH.

Model interpretability

Applying Shapley values to the most optimal model, we found a clear list of variables in terms of their magnitude on the model’s output. Fig 3 ranks the ‘Core’ features by importance to model prediction from top to bottom, with AST, Platelet Count and AST-ALT Ratio being the most influential predictors of the 19 available. The relative feature value for each variable is presented, with red representing higher respective values for that feature and blue representing lower values. The x-axis of this chart demonstrates the relative ‘push’ towards positive or negative output of the model—for instance we see that the higher the Age of an individual, the more likely it is that the model will classify an individual as positive and ‘At-Risk MASH’.

Fig 3. SHAP summary plots.

Fig 3

Ranking of Core variables in terms of their influence on predicting At-Risk MASH for XGBoost with MICE and SMOTE model.

Interpretability was also available for local predictions as well as the global model. Fig 4 illustrates 4 ‘force plots’ owing to 4 individuals with and without Type 2 Diabetes, and also with and without high stage fibrosis (F >2). Force plots allow visualization of each feature’s attribution with ‘forces’ that either increase or decrease the predicted value of the observation by the model—in the case of the individual without Type 2 Diabetes and low stage fibrosis (top left), strong negative influence from Albumin, GGT and Age outweigh the positive forces from Insulin Resistance, such that a low prediction output of 0.07 was obtained; this individual would therefore have been predicted by the model to not be At-Risk of MASH.

Fig 4. SHAP force plots.

Fig 4

Force plots illustrating the impact of each feature upon the prediction of 4 random individual’s probability of At-Risk MASH. Top Left: A non-diabetic, 49 year old man of low fibrosis stage. Top Right: A diabetic, 69 year old woman of low fibrosis stage. Bottom Left: A non-diabetic 76 year old woman of high fibrosis stage. Bottom Right: A diabetic, 55 year old man of high fibrosis stage.

It is worth noting that an individual’s diabetic status and fibrosis stage were not used to train this model, however from the force plots we can see that very high values of 0.99 and 0.96 were predicted for the 2 individuals who were in a high fibrosis stage, therefore considering these individuals to be positively At-Risk MASH. For the 2 individuals who were in a low fibrosis stage, one low prediction value of 0.07 and one high prediction value 0.98 were returned, with the higher value belonging to that of the individual who is diabetic; the model therefore considers the non-diabetic individual with low fibrosis stage to be not At-Risk MASH and the diabetic individual with high fibrosis stage to be At-Risk MASH. All of the force plots have different features that were considered most important in their respective prediction, however common features that appeared in these plots as most critical were Age, AST, and Platelet count.

Modelling using extended and specialist features

Initially focusing again upon the target condition of At-Risk MASH, the analysis was repeated using the Extended and Specialist variables feature set, there were therefore in total 9 models in predicting this response. Fig 5 illustrates the AUC for each of these models.

Fig 5. Modelling for At-Risk MASH.

Fig 5

AUC for every classifier predicting At-Risk MASH by Feature Set and Model Composition.

By directly comparing each model composition to the feature set used, we see that the average improvement in AUC of 0.03% was negligible at predicting At-Risk MASH between Extended and Core feature sets. This compared to an average decline in AUC of -3.4% between Core and Specialist feature sets. When also looking at other performance metrics the improvements/deteriorations were similar for At-Risk MASH. The average improvement in model accuracy, sensitivity, and specificity range between 0.03% and 1.57% when again comparing Extended feature set to Core feature set performance—therefore very little difference was found using the extra 7 variables within this new set of variables. When comparing the change in average performance to the Specialist feature set there was a -4.70% fall in overall accuracy, -8.17% fall in specificity and 6.23% improvement in sensitivity.

This pattern was observed across every target condition explored in this work, with the average improvement of AUC when utilising the Extended feature set over the Core feature set being 0.39%. This was also the case with accuracy, sensitivity and specificity, with the greatest improvement being a 1.17% increase in sensitivity. It seems unlikely the cost of obtaining these extra 7 features offsets any benefit in classification performance.

The differences in average performance metrics between Core and Specialist feature sets were far more variable, however. AUC for predicting MASL vs. MASH and High Activity improved by more than 5% on average, however for other targets such as At-Risk MASH and At-Risk MASLD average AUC performance deteriorated when introducing Specialist features. It is therefore difficult to give a general conclusion for comparing these 2 feature sets in terms of overall performance. Sensitivity however appeared to improve significantly (avg. 10.07%) for every target when the Specialist feature set was used, this typically was at the expense of heavily reduced specificity (avg. -4.86%). The high level of variability between the performance of Core and Specialist feature sets was likely due to the differences in number of individuals available for each set—comparison is therefore more difficult however there is little evidence to suggest that Specialist features perform significantly better than those that can be accessed simply by a routine clinical appointment.

Discussion

Our best models achieved an AUC score reaching 0.899 in cross-validation and 0.800 in our hold-out test set for predicting At-Risk MASH, and similar performance for other endpoints. These scores largely track the performance observed in [13] and reflect a modest improvement over individual biomarkers. Interestingly, our machine learning models using ‘Core’ features significantly outperformed established markers such as FIB-4 (with AUC = 0.708 for At-Risk MASH on our test sample). Additionally, they provide similar levels of performance to the best performing specialised markers. This suggests that incremental improvement in MASLD/MASH screening is possible with established biomarker assays combined with more advanced models. Our inability to radically improve classification performance may be due to the relatively small sample size of novel biomarkers available in the LITMUS Metacohort. Progress with similar models may be possible with the more complete prospective LITMUS Study Cohort [12]. We must also acknowledge that classification performance will be limited by the fundamental variability in biopsy reads, though we do not claim that we have reached that ceiling yet.

It is very typical of medical data to have varying levels of missing data as it is either not feasible or unnecessary to record every variable for each individual at baseline appointments. The level of missingness therefore increases the more complex feature sets are used within the modelling. The level of imputation required for the Specialist feature set was so great for some variables that it is not wise to use the same number of observations as used in the modelling of the other 2 feature sets. Comparison should therefore be treated with caution when comparing classifier performance using Core features to the classifiers using Specialist features with approximately N ≈ 950 for modelling using Specialist features, compared to N ≈ 6000 for Core feature modelling.

All target conditions that are explored in this paper are also imbalanced. Ideally for ML classification, the number of negative and positive cases should display a ratio of 1:1, however the nearest that this is achieved for these targets was Steatosis vs. MASH with a ratio of 0.9:1. Some target outcomes display severely imbalanced levels of class, with Cirrhosis (Histology Confirmed) providing a negative-to-positive class ratio of 11.1:1. The usage of class balancing algorithm SMOTE was therefore more aggressive in cases where there are such great imbalance. Also, although SMOTE is a markedly improved version of existing upsampling methods, it naturally still bears limitations. In particular it does not consider the quality of synthetic samples generated and therefore can struggle to fully capture the distribution of the minority class. However, SMOTE still offers significant advantages within ML classification and is seen to be of great benefit and reliability to heavily imbalanced datasets.

The performance for each target condition from base XGBoost model to XGBoost with MICE falls in 8 out of the 9 targets observed suggesting a worse level of classifier performance for the fully imputed datasets. This is in contrast to the use of class balancing algorithm SMOTE, with all model performances for each target condition bar one improving between a range of 5.10–14.60%. For targets in which the level of class imbalancing was more drastic and therefore used SMOTE more aggressively to increase the minority class, it is clear that these targets had the greatest improvements in AUC performance. For target conditions in which the level of initial class imbalancing was less drastic however and the more conservative the use of the class balancing algorithm, there was at worst little difference in classifier performance, however in general there was still an improved model. For all SMOTE models also, any improvement in either sensitivity or specificity was offset by a decline in the other. The only two models in which specificity improves over sensitivity is where the target condition had more positive cases than negative cases. It is useful to point out that for every target condition, the net improvement between these two metrics was greater than zero. Alongside an overall improvement to AUC and Accuracy also, it can therefore be concluded that SMOTE offers a significant improvement upon model performance.

Finally, the reported accuracy of as high as 99.4% in the prediction of Cirrhosis (Histology Confirmed) suggests a possibility of overfitting. Several measures were taken within the experimental design of this work to reduce the chances of model overfit. This included the use of k-fold cross-validation when tuning hyperparameters and assessing the model fit; this method allows for the ML models to not be strongly influenced by any particular part of the training data and allow for a more accurate indication of model performance. Furthermore, tuning hyperparameters ‘max_depth’ and ‘colsample_bytree’ within the XGBoost models allowed for model complexity to be controlled as well as adding randomness to allow the model training to be robust to noise respectively. SMOTE also ensures that models do not become biased and overfit towards the majority class. It is acknowledged however that despite these steps overfitting can still occur, and as shown in Fig 2, mean training AUC is still slightly greater than test AUC. We would therefore issue caution to clinicians when using these models upon unseen data but be reassured of this small reduction that our models still present overall a good generalisation.

Conclusion

Building upon previous linear approaches to predict MASLD related endpoints, this research highlights the capability of more complex, non-linear machine learning methods in being able to accurately classify individuals of varying severity in relation to the MASLD natural progression. In particular, we have demonstrated the ability of predicting such outcomes using easily extractable and readily available information as collected from routine clinical appointments or standard blood tests to a high degree of accuracy. Through using the ML algorithm XGBoost along with missing imputation algorithm MICE and class balancing tool SMOTE upon easily accessible variables, we are able to obtain a classifier with an accuracy of 89.9% at predicting At-Risk MASH. Using this model structure, we are also able to accurately predict other MASLD outcomes up to a training set AUC of 99% in some cases. We have also demonstrated that the introduction of variables that are more complex and difficult to obtain from standard healthcare procedures do not substantially improve the accuracy of these classifiers to offset the cost of procuring these variables although confirmatory analysis upon suitable validation sets is required when available. Each model created within this research was also designed to be highly interpretable, offering clinicians the ability to explore how each individual classifier has come to its conclusions. Each model created, with the help of SHAP, was able to display the most important features used in a model’s decision making, how specific values of each feature contribute to final output, and also the observation of personalised predictions for each individual used within the classifier’s training.

Supporting information

S1 File

(DOCX)

pone.0299487.s001.docx (171.3KB, docx)

Acknowledgments

The LITMUS consortium, coordinated by Quentin M. Anstee (quentin.anstee@newcastle.ac.uk). Below are all investigators as part of the LITMUS consortium and their respective affliations:

Newcastle University: Quentin M. Anstee, Ann K. Daly, Simon Cockell, Dina Tiniakos, Pierre Bedossa, Alastair Burt, Fiona Oakley, Heather J. Cordell, Christopher P. Day, Kristy Wonders, Paolo Missier, Matthew McTeer, Luke Vale, Yemi Oluboyede, Matt Breckons. AMC Amsterdam: Patrick M. Bossuyt, Hadi Zafarmand, Yasaman Vali, Jenny Lee, Max Nieuwdorp, Adriaan G. Holleboom, Athanasios Angelakis, Joanne Verheij. Institute of Cardiometabolism And Nutrition: Vlad Ratziu, Karine Clément, Rafael Patino-Navarrete, Raluca Pais. Hôpital Beaujon, Assistance Publique Hopitaux de Paris: Valerie Paradis. University Medical Center Mainz: Detlef Schuppan, Jörn M. Schattenberg, Rambabu Surabattula, Sudha Myneni, Yong Ook Kim, Beate K. Straub. University of Cambridge: Toni Vidal-Puig, Michele Vacca, Sergio Rodrigues-Cuenca, Mike Allison, Ioannis Kamzolas, Evangelia Petsalaki, Mark Campbell, Chris J. Lelliott, Susan Davies. Örebro University: Matej Orešič, Tuulia Hyötyläinen, Aidan McGlinchey. Center for Cooperative Research in Biosciences: Jose M. Mato, Óscar Millet. University of Bern: Jean-François Dufour, Annalisa Berzigotti, Mojgan Masoodi, Naomi F. Lange. University of Oxford: Michael Pavlides, Stephen Harrison, Stefan Neubauer, Jeremy Cobbold, Ferenc Mozes, Salma Akhtar, Seliat Olodo-Atitebi. Perspectum: Rajarshi Banerjee, Elizabeth Shumbayawonda, Andrea Dennis, Anneli Andersson, Ioan Wigley. Servicio Andaluz de Salud, Seville: Manuel Romero-Gómez, Emilio Gómez-González, Javier Ampuero, Javier Castell, Rocío Gallego-Durán, Isabel Fernández-Lizaranzu, Rocío Montero-Vallejo. Nordic Bioscience: Morten Karsdal, Daniel Guldager Kring Rasmussen, Diana Julie Leeming, Antonia Sinisi, Kishwar Musa. Integrated Biobank of Luxembourg: Estelle Sandt, Manuela Tonini. University of Torino: Elisabetta Bugianesi, Chiara Rosso, Angelo Armandi. Università degli Studi di Firenze: Fabio Marra. Consiglio Nazionale delle Ricerche: Amalia Gastaldelli. Università Politecnica delle Marche: Gianluca Svegliati. University Hospital of Angers: Jérôme Boursier Antwerp University Hospital: Sven Francque, Luisa Vonghia, An Verrijken, Eveline Dirinck, Ann Driessen. Linköping University: Mattias Ekstedt, Stergios Kechagias. University of Helsinki: Hannele Yki-Järvinen, Kimmo Porthan, Johanna Arola. UMC Utrecht: Saskia van Mil. Medical School of National & Kapodistrian University of Athens: George Papatheodoridis. Faculdade de Medicina, Universidade de Lisboa: Helena Cortez-Pinto. Faculty of Pharmacy, Universidade de Lisboa: Cecilia M. P. Rodrigues. Università degli Studi di Milano: Luca Valenti, Serena Pelusi. Università degli Studi di Palermo: Salvatore Petta, Grazia Pennisi. Università Cattolica del Sacro Cuore: Luca Miele, Antonio Liguori. University Hospital Würzburg: Andreas Geier, Monika Rau. RWTH Aachen University Hospital: Christian Trautwein, Johanna Reißing. University of Nottingham: Guruprasad P. Aithal, Susan Francis, Naaventhan Palaniyappan, Christopher Bradley. Antaros Medical: Paul Hockings, Moritz Schneider. National Institute for Health Research, Biomedical Research Centre at University Hospitals Birmingham NHS Foundation Trust and the University of Birmingham: Philip N. Newsome, Stefan Hübscher. iXscient: David Wenn. Genfit: Jeremy Magnanensi. Intercept Pharma: Aldo Trylesinski. OWL: Rebeca Mayo, Cristina Alonso. Eli Lilly and Company: Kevin Duffin, James W. Perfield, Yu Chen, Mark L. Hartman. Pfizer: Carla Yunis, Theresa Tuthill, Magdalena Alicia Harrington, Melissa Miller, Yan Chen, Euan James McLeod, Trenton Ross, Barbara Bernardo. Boehringer-Ingelheim: Corinna Schölch, Judith Ertle, Ramy Younes, Harvey Coxson, Eric Simon. Somalogic: Joseph Gogain, Rachel Ostroff, Leigh Alexander, Hannah Biegel. Novo Nordisk: Mette Skalshøi Kjær, Lea Mørch Harder, Naba Al-Sari, Sanne Skovgård Veidal, Anouk Oldenburger. Ellegaard Göttingen Minipigs: Jens Ellegaard. Novartis Pharma AG: Maria-Magdalena Balp, Lori Jennings, Miljen Martic, Jürgen Löffler, Douglas Applegate. AstraZeneca: Richard Torstenson, Daniel Lindén. Echosens: Céline Fournier-Poizat, Anne Llorca. Resoundant: Michael Kalutkiewicz, Kay Pepin, Richard Ehman. Bristol-Myers Squibb: Gerald Horan. HistoIndex: Gideon Ho, Dean Tai, Elaine Chng, Teng Xiao. Gilead: Scott D. Patterson, Andrew Billin. RTI-HS: Lynda Doward, James Twiss. Takeda Pharmaceuticals Company Ltd.: Paresh Thakker, Zoltan Derdak, Hiroaki Yashiro. AbbVie: Henrik Landgren. Medical University of Graz: Carolin Lackner. University of Groningen: Annette Gouw. Aristotle University of Thessaloniki: Prodromos Hytiroglou. KU Leuven: Olivier Govaere. Resolution Therapeutics: Clifford Brass.

The code used for the statistical analysis of this work are available in the GitHub repository: https://github.com/mattmcteer/ML-Approaches-To-MASLD.

Data Availability

Data underpinning this study are not publicly available. The European NAFLD Registry protocol has been published in [1], including details of sample handing and processing, and the network of recruitment sites. Patient level data will not be made available due to the various constraints imposed by ethics panels across all the different countries from which patients were recruited and the need to maintain patient confidentiality. The point of contact for any enquiries regarding the European NAFLD Registry is the oversight group via email: NAFLD.Registry@newcastle.ac.uk.

Funding Statement

This work was supported by Newcastle University and Red Hat UK. This work has been supported by the LITMUS project, which has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No. 777377. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA. QMA is an NIHR Senior Investigator and is supported by the Newcastle NIHR Biomedical Research Centre. This communication reflects the view of the authors and neither IMI nor the European Union and EFPIA are liable for any use that may be made of the information contained herein.

References

  • 1. Rinella ME, Lazarus JV, Ratziu V, Francque SM, Sanyal AJ, Kanwal F, et al. A multi-society Delphi consensus statement on new fatty liver disease nomenclature. Annals of Hepatology. 2023; p. 101133. [DOI] [PubMed] [Google Scholar]
  • 2. Younossi Z, Anstee QM, Marietti M, Hardy T, Henry L, Eslam M, et al. Global burden of NAFLD and NASH: trends, predictions, risk factors and prevention. Nature reviews Gastroenterology & hepatology. 2018;15(1):11–20. doi: 10.1038/nrgastro.2017.109 [DOI] [PubMed] [Google Scholar]
  • 3. Satapathy SK, Bernstein DE, Roth NC. Liver transplantation in patients with non-alcoholic steatohepatitis and alcohol-related liver disease: the dust is yet to settle. Translational Gastroenterology and Hepatology. 2022;7. doi: 10.21037/tgh-2020-15 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Anstee QM, Reeves HL, Kotsiliti E, Govaere O, Heikenwalder M. From NASH to HCC: current concepts and future challenges. Nature reviews Gastroenterology & hepatology. 2019;16(7):411–428. doi: 10.1038/s41575-019-0145-7 [DOI] [PubMed] [Google Scholar]
  • 5. Taylor RS, Taylor RJ, Bayliss S, Hagström H, Nasr P, Schattenberg JM, et al. Association between fibrosis stage and outcomes of patients with nonalcoholic fatty liver disease: a systematic review and meta-analysis. Gastroenterology. 2020;158(6):1611–1625. doi: 10.1053/j.gastro.2020.01.043 [DOI] [PubMed] [Google Scholar]
  • 6. Kleiner DE, Brunt EM, Van Natta M, Behling C, Contos MJ, Cummings OW, et al. Design and validation of a histological scoring system for nonalcoholic fatty liver disease. Hepatology. 2005;41(6):1313–1321. doi: 10.1002/hep.20701 [DOI] [PubMed] [Google Scholar]
  • 7. Dyson J, McPherson S, Anstee Q. Non-alcoholic fatty liver disease: non-invasive investigation and risk stratification. Journal of clinical pathology. 2013;66(12):1033–1045. doi: 10.1136/jclinpath-2013-201620 [DOI] [PubMed] [Google Scholar]
  • 8. Brunt EM, Clouston AD, Goodman Z, Guy C, Kleiner DE, Lackner C, et al. Complexity of ballooned hepatocyte feature recognition: Defining a training atlas for artificial intelligence-based imaging in NAFLD. Journal of hepatology. 2022;76(5):1030–1041. doi: 10.1016/j.jhep.2022.01.011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Davison BA, Harrison SA, Cotter G, Alkhouri N, Sanyal A, Edwards C, et al. Suboptimal reliability of liver biopsy evaluation has implications for randomized clinical trials. Journal of hepatology. 2020;73(6):1322–1332. doi: 10.1016/j.jhep.2020.06.025 [DOI] [PubMed] [Google Scholar]
  • 10. Anstee QM, Castera L, Loomba R. Impact of non-invasive biomarkers on hepatology practice: past, present and future. Journal of hepatology. 2022;76(6):1362–1378. doi: 10.1016/j.jhep.2022.03.026 [DOI] [PubMed] [Google Scholar]
  • 11. Sanyal AJ, Shankar SS, Calle RA, Samir AE, Sirlin CB, Sherlock SP, et al. Non-invasive biomarkers of nonalcoholic steatohepatitis: the FNIH NIMBLE project. Nature medicine. 2022;28(3):430–432. doi: 10.1038/s41591-021-01652-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Hardy T, Wonders K, Younes R, Aithal GP, Aller R, Allison M, et al. The European NAFLD Registry: a real-world longitudinal cohort study of nonalcoholic fatty liver disease. Contemporary clinical trials. 2020;98:106175. doi: 10.1016/j.cct.2020.106175 [DOI] [PubMed] [Google Scholar]
  • 13. Vali Y, Lee J, Boursier J, Petta S, Wonders K, Tiniakos D, et al. Biomarkers for staging fibrosis and non-alcoholic steatohepatitis in non-alcoholic fatty liver disease (the LITMUS project): a comparative diagnostic accuracy study. The Lancet Gastroenterology & Hepatology. 2023;. doi: 10.1016/S2468-1253(23)00017-1 [DOI] [PubMed] [Google Scholar]
  • 14. Sorino P, Caruso MG, Misciagna G, Bonfiglio C, Campanella A, Mirizzi A, et al. Selecting the best machine learning algorithm to support the diagnosis of Non-Alcoholic Fatty Liver Disease: A meta learner study. PLoS One. 2020;15(10):e0240867. doi: 10.1371/journal.pone.0240867 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Canbay A, Kälsch J, Neumann U, Rau M, Hohenester S, Baba HA, et al. Non-invasive assessment of NAFLD as systemic disease—a machine learning perspective. PloS one. 2019;14(3):e0214436. doi: 10.1371/journal.pone.0214436 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Chen YS, Chen D, Shen C, Chen M, Jin CH, Xu CF, et al. A novel model for predicting fatty liver disease by means of an artificial neural network. Gastroenterology report. 2021;9(1):31–37. doi: 10.1093/gastro/goaa035 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Lee J, Westphal M, Vali Y, Boursier J, Petta S, Ostroff R, et al. Machine learning algorithm improves the detection of NASH (NAS-based) and at-risk NASH: A development and validation study. Hepatology. 2023;78(1):258–271. doi: 10.1097/HEP.0000000000000364 [DOI] [PubMed] [Google Scholar]
  • 18. Atabaki-Pasdar N, Ohlsson M, Viñuela A, Frau F, Pomares-Millan H, Haid M, et al. Predicting and elucidating the etiology of fatty liver disease: A machine learning modeling and validation study in the IMI DIRECT cohorts. PLoS medicine. 2020;17(6):e1003149. doi: 10.1371/journal.pmed.1003149 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Ma H, Xu Cf, Shen Z, Yu Ch, Li Ym. Application of machine learning techniques for clinical predictive modeling: a cross-sectional study on nonalcoholic fatty liver disease in China. BioMed research international. 2018;2018. doi: 10.1155/2018/4304376 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Yip TF, Ma A, Wong VS, Tse YK, Chan HY, Yuen PC, et al. Laboratory parameter-based machine learning model for excluding non-alcoholic fatty liver disease (NAFLD) in the general population. Alimentary pharmacology & therapeutics. 2017;46(4):447–456. doi: 10.1111/apt.14172 [DOI] [PubMed] [Google Scholar]
  • 21. Schattenberg JM, Balp MM, Reinhart B, Tietz A, Regnier SA, Capkun G, et al. NASHmap: clinical utility of a machine learning model to identify patients at risk of NASH in real-world settings. Scientific Reports. 2023;13(1):5573. doi: 10.1038/s41598-023-32551-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Guha IN, Parkes J, Roderick P, Chattopadhyay D, Cross R, Harris S, et al. Noninvasive markers of fibrosis in nonalcoholic fatty liver disease: Validating the European Liver Fibrosis Panel and exploring simple markers. Hepatology. 2008;47(2):455–460. doi: 10.1002/hep.21984 [DOI] [PubMed] [Google Scholar]
  • 23. Vali Y, Lee J, Boursier J, Spijker R, Löffler J, Verheij J, et al. Enhanced liver fibrosis test for the non-invasive diagnosis of fibrosis in patients with NAFLD: a systematic review and meta-analysis. Journal of hepatology. 2020;73(2):252–262. doi: 10.1016/j.jhep.2020.03.036 [DOI] [PubMed] [Google Scholar]
  • 24. Boyle M, Tiniakos D, Schattenberg JM, Ratziu V, Bugianessi E, Petta S, et al. Performance of the PRO-C3 collagen neo-epitope biomarker in non-alcoholic fatty liver disease. Jhep Reports. 2019;1(3):188–198. doi: 10.1016/j.jhepr.2019.06.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Mak AL, Lee J, van Dijk AM, Vali Y, Aithal GP, Schattenberg JM, et al. Systematic review with meta-analysis: diagnostic accuracy of pro-C3 for hepatic fibrosis in patients with non-alcoholic fatty liver disease. Biomedicines. 2021;9(12):1920. doi: 10.3390/biomedicines9121920 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. p. 785–794.
  • 27. Van Buuren S, Groothuis-Oudshoorn K. mice: Multivariate imputation by chained equations in R. Journal of statistical software. 2011;45:1–67. [Google Scholar]
  • 28. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–357. doi: 10.1613/jair.953 [DOI] [Google Scholar]
  • 29. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30. [Google Scholar]
  • 30.Lundberg SM, Lee SI. Consistent feature attribution for tree ensembles. arXiv preprint arXiv:170606060. 2017;.

Decision Letter 0

Pavel Strnad

13 Dec 2023

PONE-D-23-40017Machine Learning Approaches to Enhance Diagnosis and Staging of Patients with MASLD Using Routinely Available Clinical InformationPLOS ONE

Dear Dr. McTeer,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jan 27 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Pavel Strnad

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

  1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

    https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

    https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

  1. Thank you for stating the following in the Competing Interests section:

I have read the journal's policy and the authors of this manuscript have the following competing interests:

Quentin M. Anstee has received research grant funding from AstraZeneca, Boehringer Ingelheim, and Intercept Pharmaceuticals, Inc.; has served as a consultant on behalf of Newcastle University for Alimentiv, Akero, AstraZeneca, Axcella, 89bio, Boehringer Ingelheim, Bristol Myers Squibb, Galmed, Genfit, Genentech, Gilead, GSK, Hanmi, HistoIndex, Intercept Pharmaceuticals, Inc., Inventiva, Ionis, IQVIA, Janssen, Madrigal, Medpace, Merck, NGM Bio, Novartis, Novo Nordisk, PathAI, Pfizer, Poxel, Resolution Therapeutics, Roche, Ridgeline Therapeutics, RTI, Shionogi, and Terns; has served as a speaker for Fishawack, Integritas Communications, Kenes, Novo Nordisk, Madrigal, Medscape, and Springer Healthcare; and receives royalties from Elsevier Ltd.

Jörn M. Schattenberg has served as consultant for Alentis Therapeutics, Astra Zeneca, Apollo Endosurgery, Bayer, Boehringer Ingelheim, Gilead Sciences, GSK, Ipsen, Inventiva Pharma, Madrigal, MSD, Northsea Therapeutics, Novartis, Novo Nordisk, Pfizer, Roche, Sanofi, Siemens Healthineers. Research Funding: Gilead Sciences, Boehringer Ingelheim, Siemens Healthcare GmbH. Stock Options: AGED diagnostics, Hepta Bio. Speaker Honorarium: Advanz, Echosens, MedPublico GmbH.

Andreas Geier served as a speaker and consultant for AbbVie, Advanz, Alexion, AstraZeneca, Bayer, BMS, Burgerstein, CSL Behring, Eisai, Falk, Gilead, Heel, Intercept, Ipsen, Merz, MSD, Novartis, Pfizer, Roche, Sanofi-Aventis; received research funding from Intercept, Falk, Novartis.

Dina Tiniakos served as consultant on behalf of the University or for ICON, Merck Greece, Madrigal, Inventiva, Histoindex, Cymabay and Clinnovate.

Please confirm that this does not alter your adherence to all PLOS ONE policies on sharing data and materials, by including the following statement: "This does not alter our adherence toPLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests).If there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include your updated Competing Interests statement in your cover letter; we will change the online submission form on your behalf.

  1. One of the noted authors is a group or consortium [insert name of group or team]. In addition to naming the author group, please list the individual authors and affiliations within this group in the acknowledgments section of your manuscript. Please also indicate clearly a lead author for this group along with a contact email address.

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments :

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In the manuscript, entitled “Machine Learning Approaches to Enhance Diagnosis and Staging of Patients with MASLD Using Routinely Available Clinical Information” the authors provide an extensive dataset of machine learning models predicting outcome of MASLD. 1) Subjects for the analyses were drawn from the LITMUS Metacohort (derived from the European NAFLD Registry), which is a very well known/established and extensively characterised patient cohort tremendously increasing the overall value of this particular study. Therefore, the authors employed selected clinical parameters associated with MASLD in combination with histopathologic assessments based on a liver biopsy indicating the different disease stages of MASLD and progression to MASH, fibrosis and ultimately liver cirrhosis. The study data suggest that commonly available clinical variables/tests (i.e. anamnesis, biomarkers, elastrography determined as core, extended or specialist features) provide sufficient information to predict MASLD patient outcomes - potentially reducing the need of more invasive tests such as a liver biopsy.

1) “Test set AUC of univariate models and our ML classifiers are also difficult to compare due to the very small sample sizes of some covariates in the univariate modelling, such as N ≈ 150 compared to test set size of all ML classifiers at N ≈ 1200.” I totally agree with the reviewers that it is very difficult to compare both approaches as the total number of individuals/parameters differs enormously. Besides, it is hardly surprising that an unbiased machine learning approach outperforms univariate linear analyses – the additional value of this result is relatively low.

2) “Recalling that we wished only to balance the training set for the XGBoost with MICE and SMOTE model, we artificially enhanced the minority class (in this case the positive set) from 1601 to 3218 to match the case numbers for negative class in the training set. It is also important to note that rebalancing was not applied to the test set - this is so the test set is as close as possible to what we would expect to see in reality, thus reducing any model biases.” Could you please explain to me the reason why you had to increase the number of cases artificially and how does this manipulation do not influence the results. Is the mentioned interpretation of those data reliable? Maybe this should be stressed or at least clarified in the main manuscript for a broad readership.

3) “The average improvement in model accuracy, sensitivity, and specificity range between 0.03% and 1.57% when again comparing Extended feature set to Core feature set performance - therefore very little difference was found using the extra 7 specialist variables within this new set of variables. […] Sensitivity however appeared to improve significantly (avg. 10.07%) for every target when the Specialist feature set was used, this typically was at the expense of heavily reduced specificity (avg. -4.86%).” In conclusion, more commonly available parameters and tests (“core features”) might be more valuable than “specialist features”. This sounds little surprising – yet promising for our daily clinical routine. However, the number of individuals, where all “special features” were available, was significantly lower compared to “core features”. Therefore, one should be very careful interpreting those data and analyses on a greater scale are needed.

4) In reference to “Table 3” highest prediction accuracy was achieved regarding definite parameters / endpoints such as advanced fibrosis or cirrhosis. However, those patients with high inflammatory activity or “at-risk MASH” with advanced or rapidly progressive fibrosis are those patients who needs to be identified to stop further progression to cirrhosis and its complications. In conclusion, one receives the impression that the findings of this machine learning approach are not surprising or novel. Nonetheless, unbiased machine learning approaches will determine the near future of diagnostics and therapeutic interventions improving our daily clinical routines. In this context, the current study provides interesting machine learning approaches with a lack of novelty based on a great database.

Reviewer #2: I congratulate the authors on their timely and interesting manuscript, which focuses on the innovative use of supervised learning in diagnosing the recently renamed Metabolic Associated Steatohepatitis Liver Disease (MASLD).

The study's strength lies in its longitudinal design, encompassing a substantial period from 2010 to 2017, which allows for a comprehensive analysis. The requirement that all participants have biopsy-confirmed MASLD within six months of enrolment adds a significant degree of diagnostic certainty to the study.

The exclusion of participants with excessive alcohol consumption and other chronic liver diseases is appropriate, as it helps maintain the focus on MASLD as the primary condition under study.

The manuscript does an excellent job of demonstrating the application of supervised learning in medical diagnostics. The division of clinical variables into Core, Extended, and Specialist feature sets is a thoughtful approach, offering a layered understanding of data utility in clinical practice.

This manuscript makes a significant contribution to the field of hepatology. The methodological approach is solid, and the insights provided could substantially improve MASLD diagnosis .

Still, I have some comments that need to be adressed:

- The manuscript would benefit from a table 1 describing the baseline characteristics of the cohort and the different subccohorts.

- I wonder why Gender had such a little impact on the results (Figure 3) and how ethnicity was distributed within the cohort.

- I like, that the authors followed the recently published guidelines and Participants reporting excessive alcohol consumption (>20/30g per day for women/men) or other causes of chronic liver diseases were excluded. Why is excessive alcohol consumption in figure 3 even though it was excluded?

- The reported accuracy rate of up to 99.4% suggests a possibility of overfitting. It is essential to discuss measures to mitigate this and how such high accuracy might be interpreted in real-world clinical settings.

- The manuscript would benefit from a deeper analysis explaining the disparity between the excellent performance of individual predictors like AST, Platelet Count, and AST-ALT Ratio in the machine learning models, and the overall poor performance of FIB-4 as a composite score

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Feb 29;19(2):e0299487. doi: 10.1371/journal.pone.0299487.r002

Author response to Decision Letter 0


1 Feb 2024

Dear Academic Editor and Reviewers,

First of all I, on behalf of all authors of this work, would like to thank you all for your kind comments and constructive feedback of our paper entitled “Machine learning approaches to enhance diagnosis and staging of patients with MASLD using routinely available clinical information”. We would also like to thank you for the opportunity to submit minor revisions to our work taking into account the changes you suggest. In this rebuttal letter we have responded to each of the Editor’s and Reviewers’ comments and highlight where the suggested changes are made within our revised manuscript.

The Editor and Reviewers’ comments are highlighted here in red and our responses are in black beneath:

Editor’s Comments:

1. “Please ensure that your manuscript meets PLOS ONE’s style requirements, including those for file naming.”

A: We confirm that the manuscript meets PLOS ONE’s style requirements as per the template PLOS provides for LaTeX submissions available at https://journals.plos.org/plosone/s/latex. File names have been updated to adhere to these style requirements also.

2a. “Please confirm that [Competing Interests] do not alter your adherence to all PLOS ONE policies on sharing data and materials. If there are restrictions on sharing of data and/or materials, please state these.”

A: We confirm the following statement regarding stated Competing Interests: This does not alter our adherence to PLOS ONE policies on sharing data and materials. We have also included this statement within the updated cover letter relating to this submission.

In response to the editors’ comment regarding data availability, we would like to request an exception, on the grounds that, with reference to the wording below, in this instance public deposition would breach compliance with the protocol approved by our research ethics board. Specifically, the research is based entirely on one of the LITMUS datasets (denoted the ‘1a Metacohort’), which is described in [1]. Such data have been derived from multiple international cohorts, each collected under a separate ethical approval in a different country. At this stage It would be completely unrealistic to seek permission from every different ethics panel to seek permission to share patient level data. We therefore respectfully ask for exemption from the data policy on this occasion. We have also included this response within the cover letter and in the ‘Comments’ section on the online submission.

[1] Hardy T, Wonders K, Younes R, Aithal GP, Aller R, Allison M, et al. The European NAFLD Registry: a real-world longitudinal cohort study of nonalcoholic fatty liver disease. Contemporary clinical trials. 2020;98:106175. https://doi.org/10.1016/j.cct.2020.106175.

“Data policy:

All PLOS journals now require all data underlying the findings described in their manuscript to be freely available to other researchers, either 1. In a public repository, 2. Within the manuscript itself, or 3. Uploaded as supplementary information. This policy applies to all data except where public deposition would breach compliance with the protocol approved by your research ethics board. If your data cannot be made publicly available for ethical or legal reasons (e.g., public availability would compromise patient privacy), please explain your reasons on resubmission and your exemption request will be escalated for approval.”

2b. “Please note that exceptions to the data policy are only granted if there are legal or ethical restrictions being placed upon the data by an IRB or ethics committee. At that point we require the authors to provide contact information for an institutional point of contact where fellow researchers can send data inquiries.

If the authors are unable to make the data publicly available, then we ask that the authors provide the source of the data, if it is owned by one or more third parties, or an institutional point of contact, including an email address or phone number, where fellow researchers can send data inquiries.

Please note that PLOS does not allow authors to be the sole contact for data inquiries. If the data is only available upon request, please provide contact information, such as an email address, for a non-author, institutional point of contact (such as an IRB or ethics committee contact) who can field data inquiries from fellow researchers. If the data contact is an individual, please provide their title and relationship to the data as well.”

A: In response to the Editor’s additional revisions from 22nd January and 1st February, we provide the following Data Availability statement:

Data underpinning this study are not publicly available. The European NAFLD Registry protocol has been published in [1], including details of sample handing and processing, and the network of recruitment sites. Patient level data will not be made available due to the various constraints imposed by ethics panels across all the different countries from which patients were recruited and the need to maintain patient confidentiality. The point of contact for any enquiries regarding the European NAFLD Registry is the oversight group via email: NAFLD.Registry@newcastle.ac.uk.

Please note that this data contact is not an individual but an institutional point of contact. This statement has also been made clear within the Manuscript in the Acknowledgements section, the Cover Letter and ‘Comments’ on the online submission.

As a courtesy to the Editor, in this response letter we have also included the list of registry sites and PIs as shown in the table below: (see letter)

3. “Please list the individual authors and affiliations within [group or consortium] in the acknowledgements section of your manuscript. Please also indicate clearly a lead author for this group along with a contact email address.”

A: We have now included all authors and affiliations within the LITMUS Consortium in the acknowledgements section of the manuscript. The lead coordinator of the consortium is Quentin M. Anstee, available at quentin.anstee@newcastle.ac.uk. We have also made this information clear in the acknowledgements section also (page 12).

Reviewer #1’s Comments:

1. ‘“Test set AUC of univariate models and our ML classifiers are also difficult to compare due to the very small sample sizes of some covariates in the univariate modelling, such as N ≈ 150 compared to test set size of all ML classifiers at N ≈ 1200.” I totally agree with the reviewers that it is very difficult to compare both approaches as the total number of individuals/parameters differs enormously. Besides, it is hardly surprising that an unbiased machine learning approach outperforms univariate linear analyses – the additional value of this result is relatively low.’

A: It is perhaps important to stress that only 8 out of 35 univariate logistic regression models had a test set of sample size lower than N=200, and half of the univariate models had a test set sample size N>1000. I have clarified within the manuscript that although comparison may be difficult for a handful of covariates where test set sizes are small, comparison is still useful and more reliable with the vast majority of other covariates used within the univariate logistic regression models. With the univariate models only taking into account one variable each also, far fewer observations are therefore required in order to develop a reliable estimator model with scholars [2,3] (Schmidt, 1971) (Harrell, 2001) arguing that approximately 10-20 observations per covariate are required. Our smallest training sets of N being as low as 150 therefore still allows for the provision of robust models and worthwhile comparisons. Although we agree with the reviewer that the result of ML approaches outperforming univariate linear approaches are unsurprising, we argue that it displays important context in highlighting the progress ML models can offer upon improving classifier performance as opposed to linear models. We have added these arguments within the relevant results section in the revised manuscript (page 7).

[2] Schmidt, F.L., 1971. The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 31(3), pp.699-714.

[3] Harrell, F.E., 2001. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis (Vol. 608). New York: springer.

2. ‘“Recalling that we wished only to balance the training set for the XGBoost with MICE and SMOTE model, we artificially enhanced the minority class (in this case the positive set) from 1601 to 3218 to match the case numbers for negative class in the training set. It is also important to note that rebalancing was not applied to the test set - this is so the test set is as close as possible to what we would expect to see in reality, thus reducing any model biases.” Could you please explain to me the reason why you had to increase the number of cases artificially and how does this manipulation do not influence the results. Is the mentioned interpretation of those data reliable? Maybe this should be stressed or at least clarified in the main manuscript for a broad readership.’

A: In ML classification, an imbalanced dataset can result in the ML model skewing predictions towards the majority class in order to maximise model accuracy. For instance, take the extreme example of a cancer dataset of which 99 out of 100 patients were considered to have a benign tumour and the remaining 1 to have a malignant tumour. The ML classifier, which is trained to maximise the level of accuracy in its predictions, could therefore predict all patients to have benign tumours and receive an accuracy rate of 99%, naturally this is not useful. In an ideal classification dataset, the number of negative-to-positive cases should be 1:1, we therefore have two options: ‘upsample’, i.e. artificially increase the minority class, or ‘downsample’, i.e. remove instances of the majority class. Upsampling is typically preferred to downsampling simply because downsampling involves the removal of perfectly valuable datapoints. SMOTE is an example of upsampling with a focus upon generating new instances of minority class datapoints that can be determined from comparing existing minority class datapoints (interpolation), rather than for example simply replicating minority datapoints exactly. SMOTE therefore is far more reliable in terms of reducing the risk of overfitting which is common in random oversampling techniques. SMOTE naturally has limitations, it does not consider the quality of the synthetic samples generated and therefore may not completely capture the distribution of the minority class, however it is still a markedly improved version of existing upsampling methods and is of great benefit to heavily imbalanced datasets such as those used within this work. I have now stressed in the manuscript why this upsampling took place in the first place (page 6) and discussed the reliability of the interpretation within the discussion section (page 11).

3. ‘“The average improvement in model accuracy, sensitivity, and specificity range between 0.03% and 1.57% when again comparing Extended feature set to Core feature set performance - therefore very little difference was found using the extra 7 specialist variables within this new set of variables. […] Sensitivity however appeared to improve significantly (avg. 10.07%) for every target when the Specialist feature set was used, this typically was at the expense of heavily reduced specificity (avg. -4.86%).” In conclusion, more commonly available parameters and tests (“core features”) might be more valuable than “specialist features”. This sounds little surprising – yet promising for our daily clinical routine. However, the number of individuals, where all “special features” were available, was significantly lower compared to “core features”. Therefore, one should be very careful interpreting those data and analyses on a greater scale are needed.’

A: We acknowledge that the results are indeed promising for the daily clinical practice and that impressive levels of ML classifier performance can be achieved through using easily procurable variables, along with the result that more complex to procure variables offer little improvement to classification in our case. We also concur with the reviewer that comparing results between datasets of different sizes requires caution, and that at present the claim that ‘Core’ features are more valuable than ‘Specialist’ features is not yet definitive. We have added that confirmatory analysis is therefore required upon suitable validation sets when they become available within the conclusion section of this paper to make this clear to readers (page 12).

4. ‘In reference to “Table 3” highest prediction accuracy was achieved regarding definite parameters / endpoints such as advanced fibrosis or cirrhosis. However, those patients with high inflammatory activity or “at-risk MASH” with advanced or rapidly progressive fibrosis are those patients who needs to be identified to stop further progression to cirrhosis and its complications. In conclusion, one receives the impression that the findings of this machine learning approach are not surprising or novel. Nonetheless, unbiased machine learning approaches will determine the near future of diagnostics and therapeutic interventions improving our daily clinical routines. In this context, the current study provides interesting machine learning approaches with a lack of novelty based on a great database.’

A: We concur with the reviewer that the results are perhaps unsurprising however we would argue that ML provides a more advanced form of modelling which can be used in this case as confirmatory of existing clinical dogma, as opposed to being disruptive of expected results from more traditional forms of analysis. We agree also with the reviewer that the LITMUS Metacohort is indeed a great database and is incredibly rich with respect to its field. We would therefore argue that novelty is therefore found through coupling this dataset with robust algorithms based upon state-of-the-art ML to essentially confirm and reinforce something that the clinical practice would expect.

Reviewer #2’s Comments:

1. ‘The manuscript would benefit from a table 1 describing the baseline characteristics of the cohort and the different subccohorts.’

A: For the interest of space within the main manuscript, we have now included a table within the supplementary material to this work summarising the statistics of the LITMUS Metacohort upon baseline assessments. No subcohorts were used or created within this work however since At-Risk MASH is the prevalent response variable within this work, we have also included the summary statistics for individuals who are positive and negative of At-Risk MASH along with the Metacohort as a whole. The characteristics that are displayed within the table include information regarding demographics of the cohort, comorbidities and biomarkers. We have referred to this table on page 3 of the main manuscript.

2. ‘I wonder why Gender had such a little impact on the results (Figure 3) and how ethnicity was distributed within the cohort.’

A: Having discussed the results of all models extensively with clinicians within our team there was no medical explanation to why Gender was not considered one of the more significant features. From a ML perspective, one possible explanation is simply the nature of the Gender feature within the dataset being binary and therefore can only ever be 0 (male) or 1 (female). This comparatively offers less information than more continuous, numerical features such as AST, Platelets, Age etc. in which the range of possible values is far greater and therefore there is more data for the XGBoost model to draw conclusions from. We can see however from our SHAP values in Figure 3 that “high values” of Gender (in this case = 1, referring to female) have a more negative impact on the prediction of an individual who is At-Risk MASH, therefore if you are a woman, it is less likely you are to be considered At-Risk MASH, and similarly vice-versa for males. Ethnicity throughout the cohort is almost entirely European Caucasian, with only a handful of individuals from other ethnic backgrounds – for this reason discussion surrounding ethnicity in this work is largely omitted.

3. ‘I like, that the authors followed the recently published guidelines and Participants reporting excessive alcohol consumption (>20/30g per day for women/men) or other causes of chronic liver diseases were excluded. Why is excessive alcohol consumption in figure 3 even though it was excluded?’

A: Our apologies to the reviewer, we have edited the manuscript such that on page 3 this now reads “participants reporting excessive alcohol consumption (>20/30g per day for women/men) in the preceding 6 months and/or history of excessive alcohol consumption in the past 5 years were excluded”. The ‘Excessive Alcohol Consumption’ that was used as a feature within the ML models refers to whether or not participants had previous excessive alcohol consumption from over 5 years ago which would therefore not have excluded them from this study. We appreciate this confusion and have therefore changed the original variable named ‘Excessive Alcohol Consumption’ to ‘Historic Alcohol Consumption’ within Table 1, Figure 1 and Figure 3 as well as editing our definition for excessive alcohol consumption as a means of exclusion criteria (page 3). New versions of Figures 1 and 3 have been uploaded.

4. ‘The reported accuracy rate of up to 99.4% suggests a possibility of overfitting. It is essential to discuss measures to mitigate this and how such high accuracy might be interpreted in real-world clinical settings.’

A: We agree with the reviewer that accuracy rates as high as 99% do suggest possibility of overfitting of the ML model to the training set. Several steps were taken within the experimental design of this work to reduce the possibility of overfitting including the use of k-fold cross validation upon both hyperparameter tuning and model fitting, the focus upon specific hyperparameters within the XGBoost that are known to control overfitting, as well as the use of class balancing algorithm SMOTE. We have updated the discussion section of the manuscript (page 11) to now showcase our attempts to prevent overfitting within our models as well as discuss their effectiveness. We acknowledge that despite steps being taken to mitigate overfitting, Figure 2 shows that it is still evident through a disparity in mean training AUC and test AUC. In terms of how this result may be interpreted in real-world settings, we would argue that the size of the difference between mean training AUC score and test AUC score is not large enough to cause concern, therefore clinicians when using these models upon unseen data should naturally be cautious about a potential small reduction in performance but can still be reassured of good implementation and generalisation. We have included how the overfitting in our models may be interpreted in a real-world setting within the discussion section of the updated manuscript also (page 11).

5. ‘The manuscript would benefit from a deeper analysis explaining the disparity between the excellent performance of individual predictors like AST, Platelet Count, and AST-ALT Ratio in the machine learning models, and the overall poor performance of FIB-4 as a composite score.’

A: We agree with the reviewer’s comment that a deeper analysis would be beneficial, however this is perhaps the most difficult amendment to make to our work. We know with help from SHAP that AST, Platelets and AST-ALT Ratio are the three most important individual predictors in assessing At-Risk MASH when ‘Core’ features are used, however, it is difficult to quantify their importance into a reasonable comparison to the 72.5% training AUC achieved by the univariate logistic regression model using FIB-4 as the sole predictor (as seen in Fig 1). The disparity between the ML and the univariate logistic regression models in general is however one of the key results we would like to promote within this work and on page 7 we have made amendments to highlight the improved performance of using more advanced learning algorithms such as XGBoost over existing forms of analysis such as univariate regression. It is worth noting that FIB-4 in our univariate logistic regression models offers an accuracy score largely in line of what is available in current literature [4] (Vali et al, 2023). FIB-4 is considered to be a good test but not perfect, and exploring its use in large independent cohorts that are well characterised are more likely to give a measure of its true accuracy than the small studies that originally describe its use. In addition to this, generally individual biomarkers are not used in the assessment of MASLD related outcomes but utilised within composite score such as that of FIB-4, NFS and APRI, therefore deeper analysis and wider comparisons across studies of these markers such as AST, Platelet count and AST-ALT Ratio are difficult to make.

[4] Vali, Y., Lee, J., Boursier, J., Petta, S., Wonders, K., Tiniakos, D., Bedossa, P., Geier, A., Francque, S., Allison, M. and Papatheodoridis, G., 2023. Biomarkers for staging fibrosis and non-alcoholic steatohepatitis in non-alcoholic fatty liver disease (the LITMUS project): a comparative diagnostic accuracy study. The Lancet Gastroenterology & Hepatology.

We would like to once again thank the Academic Editor and Reviewers for their kind comments and constructive feedback for our work, we hope that our responses within this rebuttal letter and our highlighted changes within the new manuscript are sufficient. If there are any further comments you may have surrounding this work we would be more than happy to discuss further.

Many thanks!

Best Wishes

Matthew McTeer (lead author)

M.McTeer@newcastle.ac.uk

Attachment

Submitted filename: Response to Reviewers.docx

pone.0299487.s002.docx (411.4KB, docx)

Decision Letter 1

Pavel Strnad

12 Feb 2024

Machine Learning Approaches to Enhance Diagnosis and Staging of Patients with MASLD Using Routinely Available Clinical Information

PONE-D-23-40017R1

Dear Dr. McTeer,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Pavel Strnad

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: The authors have addressed all my comments sufficiently.

The story is timely and exciting and I was very grateful to serve as a reviewer.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Carolin Victoria Schneider

**********

Acceptance letter

Pavel Strnad

16 Feb 2024

PONE-D-23-40017R1

PLOS ONE

Dear Dr. McTeer,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Pavel Strnad

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (DOCX)

    pone.0299487.s001.docx (171.3KB, docx)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0299487.s002.docx (411.4KB, docx)

    Data Availability Statement

    Data underpinning this study are not publicly available. The European NAFLD Registry protocol has been published in [1], including details of sample handing and processing, and the network of recruitment sites. Patient level data will not be made available due to the various constraints imposed by ethics panels across all the different countries from which patients were recruited and the need to maintain patient confidentiality. The point of contact for any enquiries regarding the European NAFLD Registry is the oversight group via email: NAFLD.Registry@newcastle.ac.uk.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES