Skip to main content
PLOS One logoLink to PLOS One
. 2020 Oct 20;15(10):e0240867. doi: 10.1371/journal.pone.0240867

Selecting the best machine learning algorithm to support the diagnosis of Non-Alcoholic Fatty Liver Disease: A meta learner study

Paolo Sorino 1, Maria Gabriella Caruso 2, Giovanni Misciagna 3, Caterina Bonfiglio 1, Angelo Campanella 1, Antonella Mirizzi 1, Isabella Franco 1, Antonella Bianco 1, Claudia Buongiorno 1, Rosalba Liuzzi 1, Anna Maria Cisternino 4, Maria Notarnicola 2, Marisa Chiloiro 5, Giovanni Pascoschi 6,#, Alberto Rubén Osella 1,*,#; MICOL Group
Editor: Ming-Lung Yu7
PMCID: PMC7575109  PMID: 33079971

Abstract

Background & aims

Liver ultrasound scan (US) use in diagnosing Non-Alcoholic Fatty Liver Disease (NAFLD) causes costs and waiting lists overloads. We aimed to compare various Machine learning algorithms with a Meta learner approach to find the best of these as a predictor of NAFLD.

Methods

The study included 2970 subjects, 2920 constituting the training set and 50, randomly selected, used in the test phase, performing cross-validation. The best predictors were combined to create three models: 1) FLI plus GLUCOSE plus SEX plus AGE, 2) AVI plus GLUCOSE plus GGT plus SEX plus AGE, 3) BRI plus GLUCOSE plus GGT plus SEX plus AGE. Eight machine learning algorithms were trained with the predictors of each of the three models created. For these algorithms, the percent accuracy, variance and percent weight were compared.

Results

The SVM algorithm performed better with all models. Model 1 had 68% accuracy, with 1% variance and an algorithm weight of 27.35; Model 2 had 68% accuracy, with 1% variance and an algorithm weight of 33.62 and Model 3 had 77% accuracy, with 1% variance and an algorithm weight of 34.70. Model 2 was the most performing, composed of AVI plus GLUCOSE plus GGT plus SEX plus AGE, despite a lower percentage of accuracy.

Conclusion

A Machine Learning approach can support NAFLD diagnosis and reduce health costs. The SVM algorithm is easy to apply and the necessary parameters are easily retrieved in databases.

Introduction

Non-alcoholic fatty liver disease (NAFLD) is the leading cause of chronic liver disease in Western countries, as well as a condition raising the risk for cardiovascular diseases, type 2 diabetes mellitus and chronic renal disease, and increased mortality [1, 2].

Worldwide, NAFLD prevalence is currently estimated to be around 24% and is constantly increasing (from 15% in 2005 to 25% in 2010).

A meta-analysis published in 2016 reported an average prevalence of 23.71% in Europe [3]. Population-based studies conducted in our geographical area (district of Bari, Apulian Region, Italy), have estimated a NAFLD prevalence of around 30%, mainly among male subjects and elderly people [4].

NAFLD is defined as an accumulation of Triglycerides in hepatocytes (> at 5% of the liver volume) of a patient with reduced alcohol intake (<20 g / day in women or <30 g / day in men), after excluding viral infectious causes or other specific liver diseases [5].

NAFLD can manifest as pure fatty liver disease (hepato-steatosis) or as non-alcoholic steatohepatitis (NASH), an evolution of the former where the steatosis is associated with inflammation and hepatocellular injury, and with fibrogenic activation that can lead to cirrhosis and the onset of hepatocarcinoma [6].

According to recent EASL—EASD—EASO guidelines [7], at individual level the gold standard to identify steatosis in individual patients is magnetic resonance imaging (MRI), but ultrasound scanning (US) is preferred because it is more widely available and cheaper than MRI.

However, for large scale screening studies, serum bio markers and steatosis scores indexes have been preferred, as the availability and cost of US has a substantial impact on screening feasibility.7 One of the best validated INDEXES is the Fatty Liver Index (FLI) [8], although other scores or anthropometric measures work as well as the FLI in predicting the NAFLD risk [9].

The FLI is important and has widespread use in the diagnosis of NAFLD in epidemiological studies because it relies on few parameters that are easy to obtain in clinical practice.

In a previous study, we compared the performance of eight alternative hybrid methods to identify subjects with NAFLD in a large population [9]. In that setting, predictive formulas were used as a first approach strategy, to identify subjects considered to be at greater risk of steatosis, needing to be scheduled for later US. In clinical practice, it is necessary to develop support models for the diagnosis of NAFLD that include only clinical and laboratory routine parameters which are easily retrievable from health databases.

In recent years, and due to the increasing prevalence of NAFLD, there has been a research trend towards low-cost diagnostic methods, and Machine Learning has been identified as a valid tool. Machine Learning (ML) is a branch of artificial intelligence that aims to allow machines to work using intelligent “learning” software [10]. Data sets are provided, that the machine processes through algorithms which develop their own logic to perform the function, the activity or the task required. Machine Learning has already been used as a support tool for the diagnosis of some diseases and for risk quantification, such as Cardiovascular risk in patients with Diabetes Mellitus [11, 12], Ischemic Heart Disease [13] and Cancers [14].

The purpose of our study is to compare various Machine Learning algorithms, using easily available laboratory parameters, in order to find the best algorithm to support the identification of subjects at greater risk of NAFLD to be scheduled for US assessment. We also compared the performances of the trained models; to assess the true reliability and ease of use of each, and validated the results on a large population-based study.

Materials and methods

Population

Details about the study population have been published elsewhere [9]. Briefly, a prospective cohort study was conducted by the Laboratory of Epidemiology and Biostatistics of the National Institute of Gastroenterology, “Saverio de Bellis” Research Hospital (Castellana Grotte, Bari, Italy). The MICOL Study is a population-based prospective cohort study randomly drawn from the electoral list of Castellana Grotte (≥30 years old) in 1985 and followed up in 1992, 2005–2006 and 2013–2016. In 2005–2006 this cohort was added with a random sample of subjects (PANEL Study) aged 30–50 years, to compensate for the cohort aging. In this paper the baseline for the MICOL cohort was established in 2005–2006 to capture all ages.

Participants were interviewed to collect information about sociodemographic characteristics, health status, personal history and lifestyle factors. Weight was taken with the subject in underwear, on a SECA® electronic balance, and approximated to the nearest 0.1 kg. Height was measured with a wall-mounted stadiometer SECA®, approximated to 1 cm. Blood pressure (BP) measurements were performed following international guidelines [15]. The mean of 3 BP measurements was calculated.

A fasting venous blood sample was drawn, and the serum was separated into two different aliquots. One aliquot was immediately stored at −80°C. The second aliquot was used to test biochemical serum markers by standard laboratory techniques in our Central laboratory.

The study included a total of 2970 out of 3000 selected subjects; 56.5% were male. In 1985, 2472 subjects had been enrolled. The cohort was followed-up in 1992 and, in 2005–2006, 1697 from the original cohort were still present. A further sample of 1273 subjects were enrolled in 2005–2006 to compensate for the cohort aging [16]. All subjects gave informed written consent to participate.

All procedures were performed in accordance with the ethical standards of the institutional research committee (IRCCS Saverio de Bellis Research and with the 1964 Helsinki declaration, and had had Ethical Committee approval for the MICOL Study (DDG-CE-347/1984; DDG-CE-453/1991; DDG-CE-589/2004; DDG-CE 782/2013).

Model development

The decisive variables and the training sessions of the models

After the adoption of the Best Subset Selection [17], we compared three methods to choose the optimal model: Mallow’s Cp (Cp) [18], Bayesian Information Criteria (BIC) [19] and, R-squared Adjusted (R2Adj) [20].

The following 27 variables were considered when employing the three subset selection methods: Age, Sex, Height, Body Mass Index (BMI) [21], Waist Circumference (WC) [22], Systolic Blood Pressure (SBP), Diastolic Blood Pressure (DBP), A Body Shape Index (ABSI) [23], Abdominal Volume Index (AVI) [24], Body Adiposity Index (BAI) [25], Body Roundness Index (BRI) [26], Hepatic Steatosis Index (HIS) [27], Leukocyte Alkaline Phosphatase Score [28], Total Bilirubin, Alkaline Phosphatase, Glucose, Gamma-Glutamyl Transferase (GGT), Glutamic-Oxaloacetic Transaminase (GOT), Glutamate Pyruvate Transaminase (GPT), High-Density Lipoproteins Cholesterol (HDL-C), Low-Density Lipoproteins Cholesterol (LDL-C), Triglycerides, FLI, Waist-Hip Ratio (WHR) [22], Waist-to-Height Ratio (WHtR) [22], Waist/Height0.5 (WHt.5R) [29] and Diabetes. All biochemical markers were introduced in the models as continuous variables. We choose to introduce the biochemical markers as continuous variables to better reflect the natural scale of the variables. The assumption behind this choice is that the effect of categorization is the loss of information.

The following best subset results were available after running the code. Cp: Age, Sex, AVI, BRI, FLI, GGT, Glucose, GPT, HIS, PAD, WHR, WHtR, WHt.5R, WC; BIC: Age, Sex, AVI, BRI, FLI, GGT, Glucose and R2-Adj: Age, Sex, AVI, BRI, FLI, GGT, Glucose, GPT, HIS, PAD, PAS, Triglycerides, WHR, WHtR, WHt.5R and WC. The best model was then extrapolated from the BIC method, because the trade-off between number of variables and explained variation did not show substantial changes. The model was composed with one of the following indexes: FLI, AVI or BRI plus Age, Sex, Glucose and Gamma-glutamyl transpeptidase (GGT).

Table 1 shows the formulas employed in this study and the variables necessary for their construction.

Table 1. Indexes formula and their structure.

Reference Name Formula
Bedogni G [8] Fatty Liver Index (FLI) FLI=e0.953*ln(TG)+0.139*BMI+0.718*ln(GGT)+0.053*WC15.7451+e0.953*ln(TG)+0.139*BMI+0.718*ln(GGT)+0.053*WC15.754*100
Guerrero-Romero F [24] Abdominal Volume Index (AVI) AVI=[2*(WC)2+0,7*(WCHC)2]1000
Thomas DM [26] Body Roundness Index (BRI) BRI=364.2365.51((WC[m]2π)2(0.5*Ht[m])2)

Abbreviation TG = Triglycerides; BMI = Body Mass Index; GGT = Gamma-Glutamyl Transferase; WC = Waist Circumference; HC = Hip Circumference; Ht = Height (centimeters).

These formulas are simple and the variables (essential to calculate them) are easily available because they are collected and routinely assessed as anthropometric measures and biochemical markers. The selected database included all the variables essential for the construction of the predictive formulas, along with other values such as: Age, Sex, GGT and Glucose. However, GGT was not included in the model employing the FLI Index formula because it is already included in the formula.

Following the European Association for the Study of the Liver (EASL), European Association for the Study of Diabetes (EASD) and European Association for the Study of Obesity (EASO) recommendations [7], NAFLD diagnosis was performed using an ultrasound scanner Hitachi H21 Vision (Hitachi Medical Corporation, Tokyo, Japan). Examination of the visible liver parenchyma was performed with a 3.5 MHz transducer.

Model implementation

The algorithms considered were the following:

  • Boosting Tree Classifier (using AdaboostClassifier) [30, 31]

  • Decision Tree Classifier [32]

  • Naive Bayes Classifier [33, 34]

  • K-Nearest Neighbors Classifier [35]

  • Neural Network Classifier [36]

  • Random Forest Classifier [37]

  • Regularized Multinomial Classifier (use Logistic regression) [38, 39]

  • Support Vector Machine Classifier [40, 41]

All eight algorithms considered were supervised learning models, except for the neural network algorithms, which consist of reinforcement learning. Afterwards, a code was developed and due to the interaction with Python, it recalled the functions of the Machine Learning contained in the Scikit-learn [42] library. This was used to compare all models in order to identify the one best suited to diagnose NAFLD, and also the model minimizing false-negative predictive values.

The actual training sessions

The models considered were the following:

  • 2.1. FLI plus Glucose plus Age plus Sex

  • 2.2. AVI plus Glucose plus GGT plus Age plus Sex

  • 2.3. BRI plus Glucose plus GGT plus Age plus Sex

For each model above, all the algorithms were compared using the Meta-learner approach [43, 44]

Initially, it was important to pre-process data to eliminate missing values for features (x) or for target variables (y). Then, we obtained a dataset containing 2868 subject Subsequently, we set the algorithm to randomly extract 50 individuals from the initial database, which were used for the construction of a new dataset. This dataset was used for the prediction phase, making use of machine learning algorithms that had been adequately trained during the training phase. During the training sessions, all 2868 subject was used for the training and for the test by means of the 10-fold cross-validation approach [45]. To retrieve the best parameters for each algorithm, GridSearchCV, which is a method contained in a Scikit-learn in Python, was used. In this way, the algorithm characterized by the lowest variance was identified, to reduce the possibility of a false prediction. After selection of the best parameters for each algorithm, we carried out the prediction on a dataset of 50 individuals (generated by the prediction), thus obtaining the predicted target variable for each algorithm.

The analysis of model performance

We compared eight different types of Machine Learning algorithms, in terms of the percentage accuracy and variance for each model. The parameters used for comparison in this case were: values of accuracy, variance, calculated confidence limits (95%), the weight of each model (as a %) and the number of ultrasound examinations it could avoid.

The accuracy of a model is defined as:

Accuracy=NumberofcorrectpredictionsTotalnumberofpredictions*100 [46]

More specifically, the accuracy of a model is calculated with the following formula:

Accuracy=TP+TNTP+TN+FP+FN [46]

Where TP = True Positive, TN = True Negative, FP = False Positive and FN = False Negative.

Another aspect to take into account was the value of variance for each algorithm, as a key indicator of the variability of a dataset.

A low value of variance means that the predictions have a percentage of accuracy with a low variability. Differently a high value of variance indicates a weak prediction also in case of high value of accuracy percentage.

Another fundamental parameter we considered was the weight of the algorithms, i.e. the value indicating the algorithm with the best performance in the test phase.

Of 100%, each algorithm had a weight percentage, and the one that contributed most to the sum of all the weights was taken as the best algorithm. In addition, for each algorithm under study, the optimal parameters for the future development of an algorithm with the absolute highest values of accuracy and weight were calculated.

Results

Subjects characteristics and the performance of each of the three indexes considered are shown in Table 2. NAFLD prevalence was 31.55% and was, as expected, more prevalent among men. Subjects with NAFLD were a little older, and had increased levels of Glucose and GGT.

Table 2. Subset subject characteristics by NAFLD condition.

MICOL Study, Castellana Grotte (BA), Italy, 2005.

Variables NAFLD
Absent Present
N (%) 2033 (68.45) 937(31.55)
Sex
 Female 1007 (49.5) 286 (30.5)
 Male 1026 (50.5) 651 (69.5)
Age 54.04 (15,61) 55.42 (13,67)
FLI 31,09 (25,50) 64,11 (24,47)
AVI 15,96 (4,27) 21,15 (4,96)
BRI 4,48 (1,66) 6,21 (1,89)
GLUCOSE 105,51 (24,03) 117,62 (33,13)
GGT 14,60 (15,18) 20,57 (19,75)

Cells reporting subject characteristics contain mean (±SD) or n (%).

FLI plus glucose plus age plus sex predictive model

In the training session of the Model employing FLI plus Glucose plus Age plus Sex as predictive variables, the Boosting Tree was considered as the algorithm with greatest accuracy, characterized by 76% accuracy and by a weight factor of 19.02% while the algorithm with the lowest variance was the Support Vector Machine with 68% accuracy, lower than the Boosting Tree, but the weight of the model was 27.35%. Compared to the Boosting Tree, the Support Vector Machine made less mistakes during the testing phase.

In Fig 1 we show the trend of the error in the test phase and the trend of the error in the training phase for each model.

Fig 1. Machine learning applied to NAFLD diagnosis.

Fig 1

Training Error vs Test Error for the FLI plus Glucose plus Age plus Sex Predictive Model.

AVI plus glucose plus GGT plus age plus sex predictive model

In the training session of the Model employing AVI plus Glucose plus GGT plus Age plus Sex as the predictive variables, the Regularized Multinomial (RM) was considered as the algorithm with the greatest accuracy, characterized by 76% accuracy and by a weight factor of 28.32%. while the algorithm with the lowest variance was the Support Vector Machine, with 68% accuracy, lower than the RM but the weight of the algorithm was 32.62%.

Compared to the RM, the Support Vector Machine made less mistakes during the testing phase.

In Fig 2 we show the trend of the error in the test phase and the trend of the error in the training phase for each model.

Fig 2. Machine learning applied to NAFLD diagnosis.

Fig 2

Training Error vs Test Error for the AVI plus Glucose plus GGT plus Age plus Sex Predictive Model.

BRI plus glucose plus GGT plus age plus sex predictive model

In the training session of the Model employing BRI plus Glucose plus GGT plus Age plus Sex as the predictive variables, the Nearest Neighbor was considered as the algorithm with the greatest accuracy, characterized by 77% accuracy and by a weight factor of 22.21%, while the algorithm with the lowest variance was the Support Vector Machine, with 77% accuracy, equivalent to the Regularized Multinomial RM, but the model weight was 34.70%.

Despite having the same percentage of accuracy as the RM, the Support Vector Machine made less mistakes during the testing phase.

In Fig 3 we show the trend of the error in the test phase and the trend of the error in the training phase for each model.

Fig 3. Machine Learning Applied to NAFLD Diagnosis.

Fig 3

Training Error vs Test Error for the BRI plus Glucose plus GGT plus Age plus Sex Predictive Model.

Figs 4, 5 and 6 show the results, in terms of accuracy, of the test with 95% confidence intervals, the weight of the model and also the value of the variance.

Fig 4. Machine learning applied to NAFLD diagnosis.

Fig 4

Forest Plot for the FLI plus Glucose plus Age plus Sex Predictive Model.

Fig 5. Machine learning applied to NAFLD diagnosis.

Fig 5

Forest Plot for the AVI plus Glucose plus GGT plus Age plus Sex Predictive Model.

Fig 6. Machine learning applied to NAFLD diagnosis.

Fig 6

Forest Plot for the BRI plus Glucose plus GGT plus Age plus Sex Predictive Model.

From the above Figures, it can be seen that the SVM algorithm, even if it did not feature the highest accuracy percentage, was the best for predicting NAFLD.

From the output of the three algorithms, we have also calculated the value of sensibility, specificity, predictive value of a positive and negative test in the prediction phase and are shown in Table 3.

Table 3.

Sensibility Specificity VPP VPN
FLI plus Glucose plus Age plus Sex 0.979 1.00 1.00 0.990
AVI plus Glucose plus GGT plus Age plus Sex 0.985 1.00 1.00 0.993
BRI plus Glucose plus GGT plus Age plus Sex 0.967 0.99 0.997 0.990

Discussion

In this study, the search for the best algorithm to support NAFLD diagnosis was conducted by comparing the different Machine Learning algorithms for each model.

Specifically, the model with a high level of accuracy and the model with the lowest level of variance were identified in this research. The Machine Learning model presenting the lowest variance was selected.

Today, the search for non-invasive methods is very important, considered as an alternative to the expensive NAFLD diagnostic tools (MRI, Ultrasound). The reorganization of the National Health System requires closer consideration of aspects linked to the performance together with the factors linked to cost-reduction and waiting times. The aim of our study was to use new, modern Machine Learning techniques to support medical decisions during the diagnostic phase using easier and cheaper tools, thus reducing both the costs and waiting times inherent to the use of instrumental methods.

In view of the results of this study, it is possible to state that the most appropriate Machine Learning Algorithm is the Support Vector Machine in Python. In particular, the Support Vector Machine employing AVI plus Glucose plus GGT plus Sex plus Age, despite having an almost identical percentage of accuracy and weight to the other models, produced fewer prediction errors in the test step. We obtained in the test phase for the models composed of FLI plus Glucose plus Age plus Sex and AVI plus Glucose plus GGT plus Age plus Sex a percentage error equal to 32% while for the model composed of BRI plus Glucose plus GGT plus Age plus Sex an error of 23%. However, in the prediction phase, the model that made fewer errors was the one composed of AVI plus Glucose plus GGT plus Age plus Sex with an error of 20% while FLI plus Glucose plus Age plus Sex 26% and BRI plus Glucose plus GGT plus Age plus Sex 28%.

Therefore, AVI plus Glucose plus GGT plus Sex plus Age was the model that contributed most to reducing unnecessary ultrasound examinations.

The good performance of ML Algorithms used to identify NAFLD, applying common anthropometric parameters and other variables, has shown them to be a valid alternative to classic Indexes [47, 48].

Moreover, the SVM was well able to identify subjects without NAFLD. From an ethical perspective, the model with the lowest variance is the best one, as it is characterized by a smaller number of false negatives, despite a lower percentage of accuracy.

This kind of study underlines the fact that this ML Algorithm can be used to find subjects at high risk of NAFLD, who need to undergo US. Furthermore, as low-risk subjects do not undergo US, 81.9% of unnecessary US examinations could be avoided (this value was calculated as the ratio of the total number of subjects in the test set divided by the total number of subjects in the test set plus the number of incorrect predictions.)

Some methodological issues need to be considered. A strength of this study is the population-based random sample from which the observations were drawn. The NAFLD prevalence in the sample is a good estimator of the population prevalence and its age-sex distribution. Limitations include both the limited number of observations and the method used to perform NAFLD diagnosis. It may be criticized the low sensibility of the NAFLD diagnostic methodology, as it fails to detect fatty liver content >25–90% [49]. However, this is a population-based study and subjects were chosen from the electoral register. They did not seek medical attention and participated on volunteer basis. Then, the diagnosis of NAFLD performed by US was the only diagnostic procedure we could propose to the participants. Ethical issues prevent us to propose biopsy or H-MRS. Moreover, and to lighten the waiting lists, our purpose was to find out a machine learning algorithm that permit us to avoid a number of USs which otherwise would have been prescribed. Then, this algorithm is useful to exclude NAFLD and as valid support diagnostic in the context of epidemiologic studies and not as replacement diagnostic tool.

In conclusion, this model, like others based on ML Algorithms, may be considered as a valid support for medical decision making as regards health policies, in epidemiological studies and screening.

Supporting information

S1 File. Dataset with random selection.

(XLS)

Acknowledgments

Computations for this research were performed in the Laboratory of Epidemiology and Biostatistics, National Institute of Gastroenterology, “S de Bellis” Research Hospital, MICOL Working Group: Vittorio Pugliese (Laboratory of Epidemiology and Biostatistics), Mario Correale, Palma Iacovazzi, Anna Mastrosimini, Giampiero De Michele (Laboratory of Clinical Pathology), Osvaldo Burattini (Unit of Gastroenterology), Valeria Tutino, Benedetta D’Attoma (Laboratory of Nutritional Biochemistry), Maria R Noviello (Department of Radiology), National Institute of Gastroenterology “S de Bellis” Research Hospital, Castellana Grotte (BA), Italy.

Abbreviations

ABSI

A Body Shape Index

AVI

Abdominal Volume Index

BAI

Body Adiposity Index

BIC

Bayesian Information Criteria

BMI

Body Mass Index

BP

Blood pressure

BRI

Body Roundness Index

Cp

Mallow’s

DBP

Diastolic Blood Pressure

FLI

Fatty Liver Index

FN

False Negative

FP

False Positive

GGT

Gamma-Glutamyl Transferase

GOT Glucose

Glutamic-Oxaloacetic Transaminase

GPT

Glutamate Pyruvate Transaminase

HDL-C

High-Density Lipoproteins Cholesterol

HIS

Hepatic Steatosis Index

LDL-C

Low-Density Lipoproteins Cholesterol

ML

machine learning

MRI

magnetic resonance imaging

NAFLD

Non-Alcoholic Fatty Liver Disease

NASH

non-alcoholic steatohepatitis

R2-Adj

R-squared Adjusted

RM

Regularized Multinomial

SBP

Systolic Blood Pressure

SVM

Support Vector Machine

TN

True Negative

TP

True Positive

US

ultrasound scan

WC

waist circumference

WHR

Waist-Hip Ratio

WHt.5R

Waist/Height0.5

WHtR

Waist-to-Height Ratio

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

This work was funded by a grant from the Ministry of Health, Italy (Progetto Finalizzato del Ministero della Salute, ICS 160.2/RF 2003), 2004/2006) and by Apulia Region-D.G.R. n. 1159, 28/6/2018 and 2019.

References

  • 1.Fazel Y, Koenig AB, Sayiner M, Goodman ZD, Younossi ZM. Epidemiology and natural history of non-alcoholic fatty liver disease. Metabolism. 2016;65(8):1017–25. Epub 2016/03/22. 10.1016/j.metabol.2016.01.012 . [DOI] [PubMed] [Google Scholar]
  • 2.Levene AP, Goldin RD. The epidemiology, pathogenesis and histopathology of fatty liver disease. Histopathology. 2012;61(2):141–52. Epub 2012/03/01. 10.1111/j.1365-2559.2011.04145.x . [DOI] [PubMed] [Google Scholar]
  • 3.Younossi ZM, Koenig AB, Abdelatif D, Fazel Y, Henry L, Wymer M. Global epidemiology of nonalcoholic fatty liver disease-Meta-analytic assessment of prevalence, incidence, and outcomes. Hepatology. 2016;64(1):73–84. Epub 2015/12/29. 10.1002/hep.28431 . [DOI] [PubMed] [Google Scholar]
  • 4.Cozzolongo R, Osella AR, Elba S, Petruzzi J, Buongiorno G, Giannuzzi V, et al. Epidemiology of HCV infection in the general population: a survey in a southern Italian town. The American journal of gastroenterology. 2009;104(11):2740–6. Epub 2009/07/30. . [DOI] [PubMed] [Google Scholar]
  • 5.Neuschwander-Tetri BA, Caldwell SH. Nonalcoholic steatohepatitis: summary of an AASLD Single Topic Conference. Hepatology. 2003;37(5):1202–19. Epub 2003/04/30. 10.1053/jhep.2003.50193 . [DOI] [PubMed] [Google Scholar]
  • 6.Ratziu V, Bellentani S, Cortez-Pinto H, Day C, Marchesini G. A position statement on NAFLD/NASH based on the EASL 2009 special conference. Journal of hepatology. 2010;53(2):372–84. Epub 2010/05/25. 10.1016/j.jhep.2010.04.008 . [DOI] [PubMed] [Google Scholar]
  • 7.Mahana Douglas, T CM, Kurtz Zachary D., Bokulich Nicholas A., Battaglia Thomas, Chung Jennifer, Müller Christian L., et al. Antibiotic perturbation of the murine gut microbiome enhances the adiposity, insulin resistance, and liver disease associated with high-fat diet. Genome Medicine. 2016;8:48 10.1186/s13073-016-0297-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bedogni G, Bellentani S, Miglioli L, Masutti F, Passalacqua M, Castiglione A, et al. The Fatty Liver Index: a simple and accurate predictor of hepatic steatosis in the general population. BMC Gastroenterol. 2006;6:33 Epub 2006/11/04. 10.1186/1471-230X-6-33 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Procino F, Misciagna G, Veronese N, Caruso MG, Chiloiro M, Cisternino AM, et al. Reducing NAFLD-screening time: A comparative study of eight diagnostic methods offering an alternative to ultrasound scans. Liver International. 2019;39(1):187–96. 10.1111/liv.13970 [DOI] [PubMed] [Google Scholar]
  • 10.Eihab Bashier Mohammed Bashier MM, and Muhammad Badruddin Khan. Machine Learning: Algorithms and Applications2016.
  • 11.Napoli C, Benincasa G, Schiano C, Salvatore M. Differential Epigenetic Factors in the Prediction of Cardiovascular Risk in Diabetic Patients. Eur Heart J Cardiovasc Pharmacother. 2019. Epub 2019/10/31. 10.1093/ehjcvp/pvz062 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Dagliati A, Marini S, Sacchi L, Cogni G, Teliti M, Tibollo V, et al. Machine Learning Methods to Predict Diabetes Complications. Journal of Diabetes Science and Technology. 2018;12(2):295–302. 10.1177/1932296817706375 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kukar M, Kononenko I, Groselj C, Kralj K, Fettich J. Analysing and improving the diagnosis of ischaemic heart disease with machine learning. Artif Intell Med. 1999;16(1):25–50. Epub 1999/05/04. 10.1016/s0933-3657(98)00063-3 . [DOI] [PubMed] [Google Scholar]
  • 14.Kourou Konstantina, E TP, Exarchos Konstantinos P., Karamouzis Michalis V., and Fotiadis Dimitrios I. "Machine learning applications in cancer prognosis and prediction". Computational and Structural Biotechnology Journal, vol 13 2014. 10.1016/j.csbj.2014.11.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Sever P. New hypertension guidelines from the National Institute for Health and Clinical Excellence and the British Hypertension Society. J Renin Angiotensin Aldosterone Syst. 2006;7(2):61–3. Epub 2006/11/04. 10.3317/jraas.2006.011 . [DOI] [PubMed] [Google Scholar]
  • 16.Osella AR, Misciagna G, Leone A, Di Leo A, Fiore G. Epidemiology of hepatitis C virus infection in an area of Southern Italy. Journal of hepatology. 1997;27(1):30–5. Epub 1997/07/01. 10.1016/s0168-8278(97)80276-0 . [DOI] [PubMed] [Google Scholar]
  • 17.James G, and Witten D, and Hastie T, and Tibshirani R. Linear Model Selection and Regularization An Introduction to Statistical Learning: with Applications in R: Springer; New York; 2013. p. 203–64. [Google Scholar]
  • 18.Stevens VL, Jacobs EJ, Patel AV, Sun J, McCullough ML, Campbell PT, et al. Weight cycling and cancer incidence in a large prospective US cohort. Am J Epidemiol. 2015;182(5):394–404. Epub 2015/07/26. 10.1093/aje/kwv073 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Vrieze SI. Model selection and psychological theory: A discussion of the differences between the Akaike information criterion (AIC) and the Bayesian information criterion (BIC). Psychological Methods. 2012;17(2):228–43. 10.1037/a0027127 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Miles J. R Squared, Adjusted R Squared. American Cancer Society; 2014. 10.1002/9781118445112.stat06627 [DOI] [Google Scholar]
  • 21.Nuttall FQ. Body Mass Index: Obesity, BMI, and Health: A Critical Review. Nutrition Today. 2015;50(3). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Bacopoulou F, Efthymiou V, Landis G, Rentoumis A, Chrousos GP. Waist circumference, waist-to-hip ratio and waist-to-height ratio reference percentiles for abdominal obesity among Greek adolescents. BMC Pediatr. 2015;15:50 Epub 2015/05/04. 10.1186/s12887-015-0366-z . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lee D-Y, Lee M-Y, Sung K-C. Prediction of Mortality with A Body Shape Index in Young Asians: Comparison with Body Mass Index and Waist Circumference. Obesity. 2018;26(6):1096–103. 10.1002/oby.22193 [DOI] [PubMed] [Google Scholar]
  • 24.Guerrero-Romero F, Rodríguez-Morán M. Abdominal volume index. An anthropometry-based index for estimation of obesity is strongly related to impaired glucose tolerance and type 2 diabetes mellitus. Arch Med Res. 2003;34(5):428–32. Epub 2003/11/07. 10.1016/S0188-4409(03)00073-0 . [DOI] [PubMed] [Google Scholar]
  • 25.Freedman DS, Thornton JC, Pi-Sunyer FX, Heymsfield SB, Wang J, Pierson RN Jr., et al. The body adiposity index (hip circumference ÷ height(1.5)) is not a more accurate measure of adiposity than is BMI, waist circumference, or hip circumference. Obesity (Silver Spring, Md). 2012;20(12):2438–44. Epub 2012/04/10. 10.1038/oby.2012.81 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Thomas DM, Bredlau C, Bosy-Westphal A, Mueller M, Shen W, Gallagher D, et al. Relationships between body roundness with body fat and visceral adipose tissue emerging from a new geometrical model. Obesity (Silver Spring, Md). 2013;21(11):2264–71. Epub 2013/03/23. 10.1002/oby.20408 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Lee JH, Kim D, Kim HJ, Lee CH, Yang JI, Kim W, et al. Hepatic steatosis index: a simple screening tool reflecting nonalcoholic fatty liver disease. Digestive and liver disease: official journal of the Italian Society of Gastroenterology and the Italian Association for the Study of the Liver. 2010;42(7):503–8. Epub 2009/09/22. 10.1016/j.dld.2009.08.002 . [DOI] [PubMed] [Google Scholar]
  • 28.Lipshitz J, Limaye S, Patel D. Leukocyte Alkaline Phosphatase Score as a Marker of Severity and Progression of Myelodysplastic Syndrome. Blood. 2007;110(11):4625-. 10.1182/blood.V110.11.4625.4625 [DOI] [Google Scholar]
  • 29.Nevill AM, Duncan MJ, Lahart IM, Sandercock GR. Scaling waist girth for differences in body size reveals a new improved index associated with cardiometabolic risk. Scand J Med Sci Sports. 2017;27(11):1470–6. Epub 2016/10/12. 10.1111/sms.12780 . [DOI] [PubMed] [Google Scholar]
  • 30.Nevatia WBaR. “Cluster Boosted Tree Classifier for Multi-View, Multi-Pose Object Detection.”. 2007 IEEE 11th International Conference on Computer Vision. 2007: 1–8.
  • 31.Freund Y, Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J Comput Syst Sci. 1997;55(1):119–39. 10.1006/jcss.1997.1504 [DOI] [Google Scholar]
  • 32.Wu C-C, Chen Y-L, Liu Y-H, Yang X-Y. Decision tree induction with a constrained number of leaf nodes. Applied Intelligence. 2016;45(3):673–85. 10.1007/s10489-016-0785-z [DOI] [Google Scholar]
  • 33.Panda M, Patra M. Network intrusion detection using naive bayes. 2007; 7.
  • 34.Larsen K. Generalized Naive Bayes Classifiers. SIGKDD Explor Newsl. 2005;7(1):76–81. 10.1145/1089815.1089826 [DOI] [Google Scholar]
  • 35.Zuo W, Zhang D, Wang K. On kernel difference-weighted k-nearest neighbor classification. Pattern Analysis and Applications. 2008;11(3):247–57. 10.1007/s10044-007-0100-z [DOI] [Google Scholar]
  • 36.SC. W. Artificial Neural Network. In: Interdisciplinary Computing in Java Programming. The Springer International Series in Engineering and Computer Science. 2003; 743. 10.1007/978-1-4615-0377-4_5. [DOI]
  • 37.Pal M. Random forest classifier for remote sensing classification. International Journal of Remote Sensing. 2005;26(1):217–22. 10.1080/01431160412331269698 [DOI] [Google Scholar]
  • 38.Li J, Qian Y, editors. Regularized Multinomial Regression Method for Hyperspectral Data Classification via Pathwise Coordinate Optimization. 2009 Digital Image Computing: Techniques and Applications; 2009 1–3 Dec. 2009.
  • 39.Yu H-F, Huang F-L, Lin C-J. Dual coordinate descent methods for logistic regression and maximum entropy models. Mach Learn. 2011;85(1–2):41–75. 10.1007/s10994-010-5221-8 [DOI] [Google Scholar]
  • 40.Karatzoglou A, Meyer D, Hornik K. Support Vector Machines in R. 2006. 2006;15(9):28 Epub 2005-12-05. 10.18637/jss.v015.i09 [DOI] [Google Scholar]
  • 41.Cristianini N., S-T J. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Press CU, editor 2000. [Google Scholar]
  • 42.Pedregosa F, V G. and Gramfort A. and Michel V., and Thirion B, G O. and Blondel M. and Prettenhofer P., et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–30. [Google Scholar]
  • 43.MELANIE H Aa K. MODEL SELECTION VIA META-LEARNING: A COMPARATIVE STUDY. International Journal on Artificial Intelligence Tools. 2001;10:525–54. 10.1142/S0218213001000647. [DOI] [Google Scholar]
  • 44.Giovanni C. A Super-Learning Machine for Predicting Economic Outcomes. MPRA Paper. 2020.
  • 45.Bro R, Kjeldahl K, Smilde AK, Kiers HA. Cross-validation of component models: a critical look at current methods. Anal Bioanal Chem. 2008;390(5):1241–51. Epub 2008/01/25. 10.1007/s00216-007-1790-1 . [DOI] [PubMed] [Google Scholar]
  • 46.Biswas AK, Noman N, Sikder AR. Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinformatics. 2010;11:273 Epub 2010/05/25. 10.1186/1471-2105-11-273 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Yip TC-F, Ma AJ, Wong VW-S, Tse Y-K, Chan HL-Y, Yuen P-C, et al. Laboratory parameter-based machine learning model for excluding non-alcoholic fatty liver disease (NAFLD) in the general population. Alimentary Pharmacology & Therapeutics. 2017;46(4):447–56. 10.1111/apt.14172 [DOI] [PubMed] [Google Scholar]
  • 48.Canbay A, Kälsch J, Neumann U, Rau M, Hohenester S, Baba HA, et al. Non-invasive assessment of NAFLD as systemic disease-A machine learning perspective. PLoS One. 2019;14(3):e0214436 Epub 2019/03/27. 10.1371/journal.pone.0214436 declare. JPS and AC state that they received royalties for a scientific lecture, which was in part supported by TECOmedical group. This does not alter the authors’ adherence to PLOS ONE policies on sharing data and materials. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Saadeh S, Younossi ZM, Remer EM, Gramlich T, Ong JP, Hurley M, et al. The utility of radiological imaging in nonalcoholic fatty liver disease. Gastroenterology. 2002;123(3):745–50. Epub 2002/08/29. 10.1053/gast.2002.35354 . [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Ming-Lung Yu

7 Aug 2020

PONE-D-20-19706

SELECTING THE BEST MACHINE LEARNING ALGORITHM TO SUPPORT THE DIAGNOSIS OF NON-ALCOHOLIC FATTY LIVER DISEASE: A META LEARNER STUDY.

PLOS ONE

Dear Dr. Osella,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Especially, please expand the sample size of validation cohort.

Please submit your revised manuscript by Sep 21 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Ming-Lung Yu, MD, PhD

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for including your ethics statement:  "Ethical Committee approval for the MICOL Study (DDG-CE-347/1984; DDG-CE-453/1991; DDG-CE-589/2004; DDG-CE 782/2013".   

Please amend your current ethics statement to include the full name of the ethics committee/institutional review board(s) that approved your specific study.

Once you have amended this/these statement(s) in the Methods section of the manuscript, please add the same text to the “Ethics Statement” field of the submission form (via “Edit Submission”).

For additional information about PLOS ONE ethical requirements for human subjects research, please refer to http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

"This work was funded by a grant from the Ministry of Health, Italy (Progetto Finalizzato

del Ministero della Salute, ICS 160.2/RF 2003), 2004/2006) and by Apulia Region-D.G.R. n. 1159,

28/6/2018 and 2019"

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript."

4. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions.

In your revised cover letter, please address the following prompts:

a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent.

b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories.

We will update your Data Availability statement on your behalf to reflect the information you provide.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Authors did a tremendous effort building the database and analyzing different AI algorithms to compare them, but I think that the networks they have built are unbalanced and some other methodological concerns that they should address.

• The authors did make a database including 2970 patients. They used 2920 cases for the training set and only 50 to test the algorithm. From the point of view of this reviewer these 50 represent less 2% of them, but it is recommended to include at least 15% of the data for validation purpose. Moreover, cases are well represented when they are selected randomly, as they did.

• Authors did use accuracy and variance to determine if the algorithm works. However, usually sensitivity and specificity are calculated because accuracy, although it takes into account the FP-FN, does not distinguish them separately, and it is as important that your algorithm is correct as it is not wrong, especially when it comes to the diagnosis of a disease.

• As for the description of the population they have used, there are several concerns. On the one hand, from the total cases, only about 32% have NAFLD, which makes the network not compensated and, be able to detect better the cases that do not have NAFLD than those that have NAFLD (something that they mention in the Discussion part, "the SVM was well able to identify subjects without NAFLD", obviously, because it has been trained with more cases without NAFLD). This can lead to very good accuracy results if the test cases are selected randomly and none of them is a NAFLD patient (and in their case it can happen since few test cases have been used). On the other hand, the network is also gender imbalanced, 56.5% male. Perhaps it would have been better to create two different networks, one for men and another for women. This decompensation is even greater in the case of women with NAFLD, there are only 30.5% of the cases, and they do not mention at any moment, neither in the conclusions or discussion part, and I think it is something they should say.

• At the Model Development part, at the end, they say that they do not include in the AI the parameter GGT because it is included in the formula, but in the formula it is included in a logarithmic way, so what is really evaluated is its logarithm and not the variable itself.

• For the diagnosis of NAFLD they use the ultrasounds, here you know the diagnostic limitations of the ultrasounds, a histological diagnosis should be recommended. In spite of barriers doing that, authors should discuss this aspect.

• "The actual training sessions" section is confusing. First, they remove a number of cases because they are missing features, but they don't include how many cases were removed, so the number of cases included for training the network remained hidden. A sentence completely baffles me: "For the training phase we had operated in the same way, randomly selecting the 50 individuals extracted from the initial database", it is not clear if they have used only 50 cases to train the network (it would be very few cases), or if they refer to another stage and have named it wrong. In the AI setting, the data are divided into three groups: training (most of them), to train the network, validation (around 15%, different from the training), to evaluate the network while it is being trained, and test (at least 15% of the data, different from the other two groups), to check the functioning of the network once it is built.

• The Results part, when they analyze BRI plus Glucose etc, we can read "the algorithm accuracy percentages were compared to those of the algorithm reducing the numbers of false negatives", I don't know what it means by reducing the number of false negatives, they should to clarify it, because it could be that they compare the algorithms by seeing which has less number of false negatives or that they have eliminated false negatives to make the comparison, which in my opinion would not be correct.

• The Discussion part, they don't indicate at any time the percentage of error, and I think they should say so, because an algorithm can be better than another if one has 60% of error and the other one 70%, but in both cases they would be very big errors to consider the algorithm as good.

• References Format: many of the references are placed in different ways, before or after the point.

Reviewer #2: Dear Authors, congratulations of the innovative work to help identify NAFLD patients via machine learning.

Here are my comments:

Q1. For noninvasive diagnosis of NAFLD, proton magnetic resonance spectroscopy (H-MRS) instead of ultrasound were usually adopted for it's accuracy and less interobserver errors.

A technician or clinician performing ultrasound gives diagnosis of fatty liver when encountering higher echogenicity of liver parenchyma than that of spleen or renal cortex. The ultrasound diagnosis of steatosis is therefore subjective, prone to interobserver error, especially when studying those patients with borderline of mild fatty liver.

Did this study stratify and give different weight to a diagnosis of steatosis according to the severity noted by the ultrasound report? How did the study eliminate the interobserver error of ultrasound?

Q2. As discussed in Q1, the gold standard for diagnosis of hepatic steatosis is core biopsy, and H-MRS is regarded widely as the most accurate non-invasive diagnostic tool of steatosis. The current study will be much more persuasive if the training data set includes NAFLD confirmed by biopsy or H-MRS, and includes the ultrasound diagnosis when finally pooling the algorithms for weighting (%).

Q3. Some formatting errors or redundancy: "the accuracy of a model" in page 14. In table 2, were the numbers in the parenthesis of "Age", "FLI", "AVI", etc. indicating standard deviation? Please specify.

Q4. Are the parameters as rGT and glucose in the training models treated as continuous variables? If these were continuous variables, how did the authors describe those within normal range and those were mildly/severely deviated? The differentiation of variables within or above normal range may enhance the accuracy of the algorithm.

Q5. In figure 4, 5, 6, SVM performed nearly identical percentage of accuracy, which was relatively lower than other algorithms, while fewer wrong predictions in test phase was observed. What were the authors explanations?

Q6. As discussed in Q2, if the study aimed to "search the best algorithm to support NAFLD diagnosis", it needs a new study design. While if the study intends to show how machine learning reduce unnecessary ultrasound examinations, it only needs a minor revision.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Manuel Romero-Gómez

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Oct 20;15(10):e0240867. doi: 10.1371/journal.pone.0240867.r002

Author response to Decision Letter 0


8 Sep 2020

1) The manuscript meets PLOS ONE's style

2) We have include the full name of the ethics committee/institutional review board(s) that approved your specific study.

3) We have remove any funding-related text from the manuscript.

4) We have upload the minimal anonymized data set necessary.

Reviewer #1: Authors did a tremendous effort building the database and analyzing different AI algorithms to compare them, but I think that the networks they have built are unbalanced and some other methodological concerns that they should address.

1) The authors did make a database including 2970 patients. They used 2920 cases for the training set and only 50 to test the algorithm. From the point of view of this reviewer these 50 represent less 2% of them, but it is recommended to include at least 15% of the data for validation purpose. Moreover, cases are well represented when they are selected randomly, as they did.

1) We have modified the paragraph and hope now is more readable. The paragraph now reads. “Initially, it was important to pre-process data to eliminate missing values for features (x) or for target variables (y). Then, we obtained a dataset containing 2868 subject Later, we set the algorithm to randomly extracted 50 individuals from the initial database, which were used for the construction of a new dataset. This dataset was used for the prediction phase, making use of machine learning algorithms that had been adequately trained during the training phase. During the training sessions, all 2868 subject was used for the training and for the test by means of the 10-fold cross-validation approach (45). To retrieve the best parameters for each algorithm, GridSearchCV, which is a method contained in a Scikit-learn in Python, was used. In this way, the algorithm characterized by the lowest variance was identified, to reduce the possibility of a false prediction. After selection of the best parameters for each algorithm, we carried out the prediction on a dataset of 50 individuals (generated by the prediction), thus obtaining the predicted target variable for each algorithm.

2) Authors did use accuracy and variance to determine if the algorithm works. However, usually sensitivity and specificity are calculated because accuracy, although it takes into account the FP-FN, does not distinguish them separately, and it is as important that your algorithm is correct as it is not wrong, especially when it comes to the diagnosis of a disease.

2) We used accuracy and variance without considering specificity and sensitivity as the purpose of this study was to highlight that a carefully selected machine learning algorithm (among those considered) could considerably reduce the number of subjects to be sent for ultrasound examination and consequently also the costs of instrumental diagnosis. Furthermore, having had as output the algorithm prediction for 50 subjects for each of the models, we experimentally verified the performance of the algorithms in terms of incorrect predictions. However, as suggested by the reviewer, we evaluated the specificity and sensitivity of the SVM algorithm in the prediction phase and added the table to the text.

The table is as follows:.

Table 1 below shows the specificity and sensitivity values for each algorithm

Table 1

Sensibility Specificity VPP VPN

FLI plus Glucose plus Age plus Sex

0.979 1.00 1.00 0.990

AVI plus Glucose plus GGT plus Age plus Sex 0.985 1.00 1.00 0.993

BRI plus Glucose plus GGT plus Age plus

Sex 0.967 0.99 0.997 0.990

3) As for the description of the population they have used, there are several concerns. On the one hand, from the total cases, only about 32% have NAFLD, which makes the network not compensated and, be able to detect better the cases that do not have NAFLD than those that have NAFLD (something that they mention in the Discussion part, "the SVM was well able to identify subjects without NAFLD", obviously, because it has been trained with more cases without NAFLD). This can lead to very good accuracy results if the test cases are selected randomly and none of them is a NAFLD patient (and in their case it can happen since few test cases have been used). On the other hand, the network is also gender imbalanced, 56.5% male. Perhaps it would have been better to create two different networks, one for men and another for women. This decompensation is even greater in the case of women with NAFLD, there are only 30.5% of the cases, and they do not mention at any moment, neither in the conclusions or discussion part, and I think it is something they should say.

3) We have modified and added some comments in the text. We understand the concern of the reviewer and we have tried to give an explanation to the structure of our population. The paragraph now reads: “Data come from a population-based random sample of subjects. As the sampling procedure was random, the sample represents not only the age-sex structure of the population but also the age-sex distribution of diseases or conditions present in the population. Then, the frequency of NAFLD represents the prevalence of NAFLD in the population. Moreover, in the population studied NAFLD is a condition more prevalent among men than in women.”

4) At the Model Development part, at the end, they say that they do not include in the AI the parameter GGT because it is included in the formula, but in the formula it is included in a logarithmic way, so what is really evaluated is its logarithm and not the variable itself.

4) Yes. In the formula the natural logarithm of GGT is evaluated. Authors provided an explanation in the original paper (Bedogni G et al. The Fatty Liver Index: a simple and accurate predictor of hepatic steatosis in the general population. BMC Gastroenterol 2006; 6:33: “All variables besides gender were evaluated as continuous predictors. Linearity of logits was ascertained using the Box-Tidwell procedure [22]. To obtain a linear logit, we transformed age using the coefficient suggested by the Box-Tidwell procedure [(age/10) ∧ 4.9255] and ALT, AST, GGT, insulin and triglycerides using natural logarithms (loge). The logits of the other predictors (BMI, waist circumference, glucose, cholesterol, ethanol and the sum of 4 skinfolds) were linear.” Tomake it clearer we have added a comment in the text. Now it reads: “

5) For the diagnosis of NAFLD they use the ultrasounds, here you know the diagnostic limitations of the ultrasounds, a histological diagnosis should be recommended. In spite of barriers doing that, authors should discuss this aspect.

5) We have added in text, wich now reads: “. Limitations include both the limited number of observations and the method used to perform NAFLD diagnosis. It may be criticized the low sensibility of the NAFLD diagnostic methodology, as it fails to detect fatty liver content >25-90% (47). However, this is a population-based study and subjects were chosen from the electoral register. They did not seek medical attention and participated on volunteer basis. Then, the diagnosis of NAFLD performed by US was the only diagnostic procedure we could propose to the participants. Ethical issues prevent us to propose biopsy or H-MRS. Moreover, and to lighten the waiting lists, our purpose was to find out a machine learning algorithm that permit us to avoid a number of USs which otherwise would have been prescribed. Then, this algorithm is useful to exclude NAFLD and as valid support diagnostic in the context of epidemiologic studies and not as replacement diagnostic tool.”

6) "The actual training sessions" section is confusing. First, they remove a number of cases because they are missing features, but they don't include how many cases were removed, so the number of cases included for training the network remained hidden. A sentence completely baffles me: "For the training phase we had operated in the same way, randomly selecting the 50 individuals extracted from the initial database", it is not clear if they have used only 50 cases to train the network (it would be very few cases), or if they refer to another stage and have named it wrong. In the AI setting, the data are divided into three groups: training (most of them), to train the network, validation (around 15%, different from the training), to evaluate the network while it is being trained, and test (at least 15% of the data, different from the other two groups), to check the functioning of the network once it is built.

6) "The actual training sessions" section has been modified and now reads: “ Initially, it was important to pre-process data to eliminate missing values for features (x) or for target variables (y). Then, we obtained a dataset containing 2868 subject Later, we set the algorithm to randomly extracted 50 individuals from the initial database, which were used for the construction of a new dataset. This dataset was used for the prediction phase, making use of machine learning algorithms that had been adequately trained during the training phase. During the training sessions, all 2868 subject was used for the training and for the test by means of the 10-fold cross-validation approach (45). To retrieve the best parameters for each algorithm, GridSearchCV, which is a method contained in a Scikit-learn in Python, was used. In this way, the algorithm characterized by the lowest variance was identified, to reduce the possibility of a false prediction. After selection of the best parameters for each algorithm, we carried out the prediction on a dataset of 50 individuals (generated by the prediction), thus obtaining the predicted target variable for each algorithm.”

7) The Results part, when they analyze BRI plus Glucose etc, we can read "the algorithm accuracy percentages were compared to those of the algorithm reducing the numbers of false negatives", I don't know what it means by reducing the number of false negatives, they should to clarify it, because it could be that they compare the algorithms by seeing which has less number of false negatives or that they have eliminated false negatives to make the comparison, which in my opinion would not be correct.

7) We thank the reviewer. The sentence was incorrect. We have cancelled it out.

8) The Discussion part, they don't indicate at any time the percentage of error, and I think they should say so, because an algorithm can be better than another if one has 60% of error and the other one 70%, but in both cases they would be very big errors to consider the algorithm as good.

We have added in text, wich now reads: “.We obtained in the test phase for the models composed of FLI plus Glucose plus Age plus Sex and AVI plus Glucose plus GGT plus Age plus Sex a percentage error equal to 32% while for the model composed of BRI plus Glucose plus GGT plus Age plus Sex an error of 23%. However, in the prediction phase, the model that made fewer errors was the one composed of AVI plus Glucose plus GGT plus Age plus Sex with an error of 20 % while FLI plus Glucose plus Age plus Sex 26% and BRI plus Glucose plus GGT plus Age plus Sex 28%. Therefore, AVI plus Glucose plus GGT plus Sex plus Age was the model that contributed most to reducing unnecessary ultrasound examinations.”

9) References Format: many of the references are placed in different ways, before or after the point.

9) The references have been corrected and all inserted before the point.

Reviewer #2: Dear Authors, congratulations of the innovative work to help identify NAFLD patients via machine learning.

Here are my comments:

Q1. For noninvasive diagnosis of NAFLD, proton magnetic resonance spectroscopy (H-MRS) instead of ultrasound were usually adopted for it's accuracy and less interobserver errors.

A technician or clinician performing ultrasound gives diagnosis of fatty liver when encountering higher echogenicity of liver parenchyma than that of spleen or renal cortex. The ultrasound diagnosis of steatosis is therefore subjective, prone to interobserver error, especially when studying those patients with borderline of mild fatty liver.

Did this study stratify and give different weight to a diagnosis of steatosis according to the severity noted by the ultrasound report? How did the study eliminate the interobserver error of ultrasound?

Q2. As discussed in Q1, the gold standard for diagnosis of hepatic steatosis is core biopsy, and H-MRS is regarded widely as the most accurate non-invasive diagnostic tool of steatosis. The current study will be much more persuasive if the training data set includes NAFLD confirmed by biopsy or H-MRS, and includes the ultrasound diagnosis when finally pooling the algorithms for weighting (%).

A1, A2. We have explained and added a paragraph in the text. Now it reads: “Following the European Association for the Study of the Liver (EASL), European Association for the Study of Diabetes (EASD) and European Association for the Study of Obesity (EASO) recommendations (7), NAFLD diagnosis was performed using an ultrasound scanner Hitachi H21 Vision (Hitachi Medical Corporation, Tokyo, Japan). Examination of the visible liver parenchyma was performed with a 3.5 MHz transducer.” And “It may be criticized the low sensibility of the NAFLD diagnostic methodology, as it fails to detect fatty liver content >25-90% (47). However, this is a population-based study and subjects were chosen from the electoral register. They did not seek medical attention and participated on volunteer basis. Then, the diagnosis of NAFLD performed by US was the only diagnostic procedure we could propose to the participants. Ethical issues prevent us to propose biopsy or H-MRS. Moreover, and to lighten the waiting lists, our purpose was to find out a machine learning algorithm that permit us to avoid a number of USs which otherwise would have been prescribed. Then, this algorithm is useful to exclude NAFLD and as valid support diagnostic in the context of epidemiologic studies and not as replacement diagnostic tool.”

Q3. Some formatting errors or redundancy: "the accuracy of a model" in page 14. In table 2, were the numbers in the parenthesis of "Age", "FLI", "AVI", etc. indicating standard deviation? Please specify.

A3 We have modified the Table 2. We have added a footnote that reads: Cells reporting subject characteristics contain mean (±SD) or n (%).

Q4. Are the parameters as GGT and glucose in the training models treated as continuous variables? If these were continuous variables, how did the authors describe those within normal range and those were mildly/severely deviated? The differentiation of variables within or above normal range may enhance the accuracy of the algorithm.

A4. We have considered the indexes in the original form they have been published after a peer review process and without any type of modification. Anyway, we have added a paragraph about this in the text. The text now reads: “We choose to introduce the biochemical markers as continuous variables to better reflect the natural scale of the variables. The assumption behind this choice is that the effect of categorization is the loss of information”.

Q5. In figure 4, 5, 6, SVM performed nearly identical percentage of accuracy, which was relatively lower than other algorithms, while fewer wrong predictions in test phase was observed. What were the authors explanations?

A5 The lowest number of incorrect predictions in the test phase is due to the fact that even if SVM performed with lower accuracy than the others it was the one with the least variance therefore in the test phase it was less wrong than any other algorithm with a more accuracy percentage but characterized by a larger variance.

Q6. As discussed in Q2, if the study aimed to "search the best algorithm to support NAFLD diagnosis", it needs a new study design. While if the study intends to show how machine learning reduce unnecessary ultrasound examinations, it only needs a minor revision.

A6 (The article wanted to highlight once the best algorithm capable of supporting the diagnosis of NAFLD among the algorithms studied was selected, how much such an approach could affect the reduction of costs and the number of unnecessary ultrasound examinations)

Moreover and, to make it clearer our methods we have added the following paragraph to the text: “A fasting venous blood sample was drawn, and the serum was separated into two different aliquots. One aliquot was immediately stored at −80 °C. The second aliquot was used to test biochemical serum markers by standard laboratory techniques in our Central laboratory.”

Attachment

Submitted filename: Reviewer.docx

Decision Letter 1

Ming-Lung Yu

5 Oct 2020

SELECTING THE BEST MACHINE LEARNING ALGORITHM TO SUPPORT THE DIAGNOSIS OF NON-ALCOHOLIC FATTY LIVER DISEASE: A META LEARNER STUDY.

PONE-D-20-19706R1

Dear Dr. Osella,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ming-Lung Yu, MD, PhD

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: The authors have already addressed all of the concerns I raised.

The paper showed machine learning approach could be an efficient and low cost tool to support diagnosis of NAFLD.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

Acceptance letter

Ming-Lung Yu

9 Oct 2020

PONE-D-20-19706R1

Selecting the Best Machine Learning Algorithm to support the diagnosis of Non-Alcoholic Fatty Liver Disease: A Meta Learner Study.

Dear Dr. Osella:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ming-Lung Yu

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Dataset with random selection.

    (XLS)

    Attachment

    Submitted filename: Reviewer.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES