Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2022 Aug 19;17(8):e0273383. doi: 10.1371/journal.pone.0273383

Framework for feature selection of predicting the diagnosis and prognosis of necrotizing enterocolitis

Jianfei Song 1,#, Zhenyu Li 2,#, Guijin Yao 1, Songping Wei 1, Ling Li 1,*, Hui Wu 2,*
Editor: Vijayalakshmi Kakulapati3
PMCID: PMC9390903  PMID: 35984833

Abstract

Neonatal necrotizing enterocolitis (NEC) occurs worldwide and is a major source of neonatal morbidity and mortality. Researchers have developed many methods for predicting NEC diagnosis and prognosis. However, most people use statistical methods to select features, which may ignore the correlation between features. In addition, because they consider a small dimension of characteristics, they neglect some laboratory parameters such as white blood cell count, lymphocyte percentage, and mean platelet volume, which could be potentially influential factors affecting the diagnosis and prognosis of NEC. To address these issues, we include more perinatal, clinical, and laboratory information, including anemia—red blood cell transfusion and feeding strategies, and propose a ridge regression and Q-learning strategy based bee swarm optimization (RQBSO) metaheuristic algorithm for predicting NEC diagnosis and prognosis. Finally, a linear support vector machine (linear SVM), which specializes in classifying high-dimensional features, is used as a classifier. In the NEC diagnostic prediction experiment, the area under the receiver operating characteristic curve (AUROC) of dataset 1 (feeding intolerance + NEC) reaches 94.23%. In the NEC prognostic prediction experiment, the AUROC of dataset 2 (medical NEC + surgical NEC) reaches 91.88%. Additionally, the classification accuracy of the RQBSO algorithm on the NEC dataset is higher than the other feature selection algorithms. Thus, the proposed approach has the potential to identify predictors that contribute to the diagnosis of NEC and stratification of disease severity in a clinical setting.

Introduction

Necrotizing enterocolitis (NEC) is one of the most devastating gastrointestinal diseases in the neonatal intensive care unit (NICU), with significant morbidity and mortality [1]. It is estimated that the incidence of NEC has been maintained at 3%-15%, and the mortality rate has been maintained at 20%-30% for decades [2, 3]. In general, the diagnosis of NEC is based on a combination of clinical, laboratory, and radiographic symptoms, most of which are nonspecific or even insidious [4, 5], such as abdominal distention and reduced bowel sounds as clinical indications for feeding intolerance (FI) and NEC. These insensitive features hinder timely diagnosis and accurate treatment. Due to the difficulty of early diagnosis of NEC and the lack of reliable biomarkers, it is essential to develop an effective diagnostic model of NEC to quickly and accurately identify the key information affecting the diagnosis and prognosis of NEC, leading to more timely treatment.

NEC can be a rapidly progressing disease, and it may take only one to two days to progress from initial symptoms to full-blown illness and death. The severity of the disease is usually divided into "medical NEC" and" surgical NEC". Medical NEC refers only to medical management, while surgical NEC involves surgical intervention. In addition, as the disease progresses, the child’s symptoms become more pronounced and the risk of long-term complications increases significantly, including neurocognitive impairment, developmental failure, short bowel syndrome, and cholestasis [68]. Therefore, it is necessary to identify high-risk infants before the disease progresses rapidly to ensure that therapeutic interventions can be initiated as soon as possible before bowel resection is required.

In recent years, machine learning (ML) methods have been widely used to diagnose cancer [911] and the other common diseases [12, 13]. Many researchers have developed prediction models for early NEC diagnosis (suspected NEC + NEC) and graded NEC diagnosis (medical NEC + surgical NEC). In the feature selection stage, they use statistical analysis to extract important features. In the classification stage, most researchers use ML methods such as linear discriminant analysis (LDA) [14, 15], random forest (RF) [1618], or Light Gradient Boosting Machine (GBM) [19] as classifier models. Table 1 summarizes some studies using ML for the diagnosis or prognosis of NEC.

Table 1. Relevant studies involving ML methods for NEC diagnosis and prognosis.

Author Number of features in use classifier AUROC
NEC diagnosis (suspected NEC and NEC)
Pantalone, J. M. et al. 14 RF 87.7%
Lure, A. C. et al. 16 RF 98%
Jaskari, J. et al. 14 RF 80.6%
Gao, W. J. et al. 23 GBM 93.37%
NEC prognosis (medical NEC and surgical NEC)
Ji, J. et al. 9 LDA 84.38%
Sylvester, K. G. et al. 27 LDA 81.7%
Pantalone, J. M. et al. 14 RF 75.9%
Gao, W. J. et al. 23 GBM 94.13%

Abbreviations: RF, random forest; GBM, light gradient boosting machine; LDA, linear discriminant analysis.

Most relevant studies perform well in the diagnosis and prognosis prediction of NEC. However, some key issues also need to be addressed. Firstly, most researchers use statistical methods to select features, which may ignore the correlation between features. Specifically, the idea of statistical methods is to use statistical significance to explore the association between each feature and category labels. Since there may be potential correlations between features, it is crucial to consider the correlation between features to ensure that the best performing subset of features is selected. Secondly, most researchers select a small number of features, which may overlook features that are highly correlated with predicted outcomes. Therefore, in order to solve the above problems, we need to include more features while considering their correlation.

Feature selection is a fundamental task in machine learning and statistics, and has been proved to be an effective method to process feature-related data in previous studies [20, 21]. Feature selection methods fall into three categories: filter methods, wrapper methods, and embedded methods. Filter methods [2225] extract a subset of features from the initial dataset and use the correlation score for each feature created based on statistical measures to filter features. The advantage of this method is that the calculation is relatively easy and efficient. However, filter methods only rank features by their single-feature association with class labels and thus tend to ignore correlations between features [26]. Wrapper methods [27, 28] integrate the classification algorithm into the feature selection process. Because wrappers directly optimize the target classification algorithm, they often achieve better classification performance than filters. Wrappers usually run much slower than filter methods due to their consideration of inter-feature relationships [29]. Embedded methods [3033] use a classification learning algorithm to evaluate the validity of features, which retain the high precision of the wrapper methods and have the high efficiency of filter methods. However, the time complexity is relatively high when processing high-dimensional data, and the redundant features cannot be completely removed [34].

To address the above issues, various works are proposed to solve feature selection problems using metaheuristics [35]. Most of them use genetic algorithms (GA) [3639]. Meta-heuristic algorithms based on swarm intelligence are also applied to feature selection, such as ant colony optimization (ACO) [40, 41], particle swarm optimization (PSO) [42, 43], and bee swarm optimization (BSO) [44, 45]. Although metaheuristic algorithms are very effective in solving feature selection problems, the increasing number of features makes this task more and more difficult. Therefore, metaheuristic algorithms combined with machine learning and the other areas of approaches may achieve better results [46, 47].

In this paper, we propose a novel algorithm called ridge regression and Q-learning strategy based bee swarm optimization (RQBSO) metaheuristic algorithm to predict NEC diagnosis and prognosis. Ridge regression is an embedded feature selection method. Compared with the other feature selection methods, the ridge regression algorithm can filter out irrelevant features while considering the correlation between features. Therefore, the ridge regression algorithm will help to screen irrelevant variables and improve the efficiency of the meta-heuristic algorithm search. To obtain the optimal feature subset, a Q-learning strategy based bee swarm optimization (QBSO) metaheuristic algorithm is used. The advantage of Q-learning is that it does not require a complete model of the fundamental problem, because learning is performed by gathering experience referred to as trial-error [48]. By combining Q-learning with the BSO algorithm, the BSO algorithm can be adaptive in the process of searching feature subsets. In the classification stage, since the RQBSO method outputs sparse feature vectors, a linear SVM specialized in processing such data is used as the classifier model.

Materials and methods

Datasets

Settings and patients

This retrospective observational study was conducted in the neonatal intensive care unit (NICU) of Jilin University First Hospital, China, from January 1, 2015 to October 30, 2021 in accordance with the Helsinki Declaration of the World Medical Association. The study is approved by the Institutional Review Board of Jilin University First Hospital (Ethics No.2021-042). Due to the nature of the study, the informed consent from the parents/guardians of the patients is waived.

The infants with the presentation of FI who underwent abdominal imaging were enrolled, and their medical records were collected. FI is defined as “the inability to digest enteral feedings presented as gastric residual volume of more than 50%, abdominal distension or emesis or both, and the disruption of the patient’s feeding plan” [49]. The exclusion criteria are as follows: (a) congenital malformations, (b) spontaneous bowel perforation, (c) emergency surgical conditions unrelated to NEC, and (d) incomplete information.

Data collection and definitions

The collected NEC and FI datasets include clinical patient information obtained between diagnosis and discharge from the NICU. The final diagnosis is determined by two independent senior neonatologists from an examination of the complete medical chart, including all perinatal and clinical findings, such as clinical manifestations, laboratory tests, the results of imaging, and the disease course. In case of disagreement between the two neonatologists, a consensus is reached with the help of a senior expert. We judge whether the infant experienced NEC based on modified Bell stage ≥IIA and then determine that the following criteria should be met in the whole disease course: (1) the presentation of FI; (2) abdominal signs (such as bowel sound attenuation and abdominal tenderness) and systemic signs (such as apnoea, lethargy, and temperature instability); and (3) antibiotics therapy and withholding feeds for at least one week [2, 50].

The NEC group is further divided into a "medical NEC group" and a "surgical NEC group". Medical NEC involves only medical management, including withholding feeds, provision of parenteral nutrition, and empirical use of antibiotics, while surgical NEC involves surgical interventions, including laparotomy and peritoneal drainage. To avoid selection bias, infants who die from severe NEC disease are assigned to the surgical NEC group. Timing of NEC onset (t0) is defined as the earliest occurrence of one of the following, within 48 hours of confirmation: 1) first notification of abdominal problems by the neonatologist, 2) abdominal radiographs or abdominal ultrasound ordered, 3) stopping enteral feeding, or 4) initiation of antibiotics [51, 52]. To identify predictors of NEC diagnosis and disease severity, we evaluate perinatal, clinical, and laboratory variables including treatment details prior to clinical onset of NEC in detail. A detailed description of each variable is shown in the S1 Table.

Methods

In this paper, we propose a feature selection cascade framework to address NEC diagnosis and prognosis prediction. Fig 1 shows the flowchart of our experiments, which can be divided into three stages: data preprocessing, feature selection using the RQBSO algorithm, and model classification.

Fig 1. The flowchart of the proposed method.

Fig 1

All experiments are performed in a computer equipped with Jupyter notebook 3.6.1, which contains 16 GB RAM and an i7-6700 CPU clocked at 3.40 GHZ. All analyses are performed using the Scikit-learn library for Python 3.7 and the Matplotlib visualization tool.

Data preprocessing

First, we count the missing data and exclude clinical parameters from the study if they are missing more than 30%. Then the remaining missing values are filled using the k-nearest neighbor method. K-nearest neighbor filling is based on the principle that missing values are estimated and filled by the eigenvalues of the k nearest neighboring samples. Assuming that xai (the i-th feature of the a-th sample) is a missing value, the samples that do not contain the missing value at the corresponding position will serve as providers of training information (neighbors). The reciprocal of the Euclidean distance between the a-th sample and the b-th sample is used as the weight in filling (Eq (1)).

ωab=1k(xakxbk)2 (1)

where k denotes the k-th feature of the sample. The estimates of missing values can be filled using a weighted averaging eigenvalue of the nearest neighbor samples (Eq (2)).

xai=1bωab(bxbiωab) (2)

Obviously, the closer the sample is to the target structure, the smaller the Euclidean distance between the two, the larger the weight factor and the greater the contribution provided to the missing value padding. However, this method has a problem that it can only estimate continuous variables, but not discrete variables. To address this drawback, we extend the existing k-nearest neighbor algorithm for estimating discrete variables by voting based on the k-nearest neighbor samples and using the nearest neighbor sample category with the most votes to fill in the missing values, as shown in Eq (3).

xai=Mode{xbi},bK (3)

where K is the set of all k nearest neighbor samples.

By adopting a hybrid strategy missing value filling method, we not only make effective use of the existing information, but also extend the application of the k-nearest neighbor feature filling method so that it can be used to fill both discrete and continuous variables.

We normalize the raw NEC data by the z-score algorithm [53] to eliminate the effects of inter-feature variation in magnitude and distribution. In addition, the normalized data can improve the convergence speed and prediction accuracy of the ML model.

Feature selection using RQBSO algorithm

RQBSO framework is a feature selection algorithm for the diagnosis and prognosis of NEC. It combines a ridge regression algorithm and Q-learning strategy based BSO metaheuristic algorithm. Unlike BSO, it can filter out irrelevant features by ridge regression technique, so there is no need to traverse all features in the search process of the BSO algorithm. Therefore, compared with BSO, the RQBSO algorithm has a faster training speed. Figs 2 and 3 show the structure and pseudocode of the RQBSO algorithm, respectively.

Fig 2. The structure of the used RQBSO algorithm.

Fig 2

Fig 3. The pseudocode of the used RQBSO algorithm.

Fig 3

In the first stage of RQBSO, we collapse these NEC data vectors in the data input layer into a NEC data matrix suitable for processing by the feature selection algorithm. Eq (4) shows the process as

X=[x11x12x1Nx21x22x2Nxm1xm2xmN] (4)

where xij denotes the j-th feature of the i-th sample. The matrix X is fed to the next stage for features prescreening.

In the second stage of RQBSO, ridge regression is used for preliminary screening of features. The purpose is to filter out irrelevant features, reduce the space of the feature search, and improve search efficiency. The optimization objective of Ridge is

J(β)=i=1m(yiβTxi)2+λ||β||22,λ>0 (5)

where xi denotes the i-th sample, yi denotes the i-th label, and the regularization parameter λ determines the compression degree of model coefficients. We use cross-validation to determine the appropriate λ value. To solve for the regression coefficient β, we take the partial derivative of β with respect to Eq (5), as shown in Eq (6)

J(β)=2XT(YXβ)2λβ (6)

where X=[x1,x2,,xm]T,Y=[y1,y2,,ym]T. Let J(β)=0, the value of β can be obtained (as shown in Eq (7)):

β=(XTX+λI)1XTY (7)

where I denotes identity matrix. The explainable model is obtained by filtering out the features with regression coefficients equal to zero, and the purpose of feature screening is achieved.

In the final stage of RQBSO, the features that flow into the next stage are further filtered using the QBSO feature selection method to obtain the optimal subset of features. The QBSO method can be roughly divided into three stages: the determination of the search area, the local search of the Q-learning strategy, and the determination of the optimal feature subset.

The determination of the search area. In the first iteration, 20% of features are randomly generated as the initial feature set, which is used as the initial reference solution Refsol. To obtain the feature subset of the search area, we use two different strategies to ensure that the feature subset obtained is as different as possible. In the first strategy, the k-th feature subset is generated by flipping starting from the k-th bit of RefSol with a flipping interval of n/flip bits. Here, flip is a hyper-parameter, and the size of this set is equal to the number of bees, which determines the number of features to filter from Refsol. As an example, let n = 20 and filp = 5, where n denotes the number of features in Refsol. If the index of features are 0 to 19, feature subsets f0, f1, f2, f3 and f4 are obtained by flipping the following bits, as shown in Fig 4: (0,5,10,15), (1,6,11,16), (2,7,12,17), (3,8,13,18) and (4,9,14,19). In the second strategy, the k-th subset of features is obtained by flipping n/flip contiguous bits starting from the k-th bit. Following the previous example, the feature subsets f0, f1, f2, f3 and f4 are obtained by flipping the following bits: (0,1,2,3), (4,5,6,7), (8,9,10,11), (12,13,14,15) and (16,17,18,19). With the above two searching strategies, we determine the search area for each bee.

Fig 4.

Fig 4

(a) solutions generated by the first strategy, (b) solutions generated by the second strategy.

The local search of the Q-learning strategy. After determining the search area of the feature sets, we perform a nearest neighbor search for each bee by flipping each bit of the feature set separately. We denote the action of flipping the current feature as at(atAt) and the state as st(stSt), where At = {at, at+1,…at+n}, St = {st, st+1,…st+n}, and denote at+1 as the action at the next moment (flipping the next feature) and st+1 as the state resulting from that action. By comparing the state at that moment with the next moment, we can obtain the reward rt when searching for a subset of features in different neighborhoods, as shown in Eq (8).

{rt=Acc(st+1),Acc(st)<Acc(st+1)rt=Acc(st+1)Acc(st),Acc(st)>Acc(st+1) (8)

where Acc(st) denotes the classification accuracy of the selected feature subset in the current state, and Acc(st+1) denotes the classification accuracy of the selected feature subset in the next state. If the accuracy of selecting a subset of features in the current state is equal to that in the next state, then the reward rt is calculated by comparing the number of selected features in the two states, as shown in Eq (9).

{rt=12*Acc(st+1),nbFeatures(st)>nbFeatures(st+1)rt=12*Acc(st+1),nbFeatures(st)<nbFeatures(st+1) (9)

where nbFeatures(st) denotes the number of features selected in the current state and nbFeatures(st+1) denotes the number of features selected in the next state. Then, we construct a Q-table of states and actions to store the Q-values and obtain the most favorable action (subset of features) based on the Q-values. The Q-values are calculated as shown in Eq (10).

Q(st+1,at+1)=rt+γ*Q(st,at) (10)

where 0≤γ≤1 denotes the discount parameter, Q(st+1, at+1) denotes the Q value under the next state and action, Q(st, at) denotes the Q value under the current state and action, and the initial value of Q value is zero. By comparing the Q values under each feature subset, we select the feature subset with the largest Q value as the initial solution for each bee’s next search, and continuously update the feature subset. The initial solution until a predetermined number of iterations is reached (localInteration), and finally return an optimal solution as the result of that bee’s search.

The determination of the optimal feature subset. After determining the optimal solution for each bee’s search, we compare its Q-value and return the feature subset corresponding to the largest Q-value as the reference solution for the next iteration. Then the (1) and (2) are repeated until a predefined number of iterations (MaxInteration) is reached. Finally, the feature set with the maximum Q-value is returned as the optimal feature subset. If the maximum Q value determined in this iteration is less than the maximum Q value of the previous iteration, we perform the diversification operation in the next iteration, that is, re-select 20% of the features at random as the solution for the next iteration, and then perform the (1) and (2) processes to determine the maximum Q value and continue the comparison with the current maximum Q value.

The advantage of the QBSO algorithm is that it processes learning through interactions with the environment. At the same time, the Q-learning adaptive searching method is used to avoid the problem of falling into local optimality.

Model classification

To evaluate the performance of the feature selection algorithm, we use a supervised classification model called linear SVM to calculate the classification accuracy. The linear SVM classifier is a popular supervised learning algorithm. It uses the computed decision hyperplane to classify samples. The choice of the error penalty factor, which represents the error tolerance, significantly affects the accuracy of the linear SVM. In our experiments, we use an SVM with a linear kernel function [54] and the parameter C set to 1.

Performance measurements

To obtain a highly robust model, we use ten-fold cross-validation in our experiments. Specifically, we randomly divide the experimental data into 10 equal parts. In each experiment, 9 copies of the data are selected in turn for training and the remaining data are tested. We take the average of the 10 results as an estimate of the model accuracy.

A binary classification algorithm optimizes the parameters of a model and predicts that a new sample belongs to the positive (P) or negative (N) group. The sizes of the positive and negative groups are respectively denoted as P and N. A positive sample is defined as a true positive or false negative if it is predicted as positive or negative. A negative sample is defined as a false positive or a true negative if its prediction is positive or negative. The numbers of true positives, false negatives, false positives, and true negatives are denoted as TP, FN, FP, and TN, respectively. The binary classification performance is evaluated by the following measurements, as [55]. This study defines recall (Rec) as the percentages of correctly predicted positive samples, i.e. Rec = TP/(TP+FN). The overall accuracy is defined as Acc = (TP+TN)/(TP+FN+TN+FP). F1-score is also known as F-measure or F-score and has been widely used to evaluate the performance of a binary classification model [56]. F1-score is defined as 2*(Precision*Rec)/(Precision+Rec), and precision is defined as Pre = TP/(TP+FP). In addition, ROC and PRC curves reflect the relationship between true positive rates and false positive rates, precision and recall, respectively. They are often used as performance graphing methods in medical decision-making [57].

Results

Study on the NEC cohort

Two datasets are created for analysis: dataset 1 include 447 patients, 296 (66.22%) are positive for NEC (median gestational age 31.71 (30.00–34.00) [IQR] weeks), and 151 (33.78%) are classified as FI (median gestational age 31.71 (30.14–33.85) [IQR] weeks); dataset 2 include only the NEC group (n = 296), in which a total of 91 patients (median gestational age 31.00 weeks (28.86–33.71) [IQR]) undergo surgery and 205 patients (median gestational age 32.00 weeks (30.50–34.29)) undergo conservative treatment. Each dataset is consisted of 119 variables, and the demographic factors, clinical characteristics, and laboratory results of each dataset are shown in Table 2.

Table 2. Main perinatal and clinical characteristics of two datasets.

Dataset 1 (n = 447) Dataset 2 (n = 296)
FI (n = 151) NEC (n = 296) Medical NEC (n = 205) Surgical NEC (n = 91)
perinatal characteristics
GA (median [IQR], weeks) 31.71[30.14–33.85] 31.71[30.00–34.00] 32.00[30.50–34.29] 31.00[28.86–33.71]
BW (median [IQR], g) 1660[1320–1920] 1600[1100–1790] 1660[1400–2100] 1450[1200–1850]
Female (n [%]) 48[47.1] 59[48] 91 [44.4] 42[46.2]
BW for GA
SGA (n [%]) 10[6.6] 41[13.9] 30[14.6] 11[12.1]
AGA (n [%]) 137[90.7] 250[84.5] 173[84.4] 77[84.6]
LGA (n [%]) 4[2.7] 5[1.6] 2[1.0] 3[3.3]
Vaginal delivery (n [%]) 72[47.7] 127[42.9] 81[39.5] 43[47.3]
Apgar 1-min (median [IQR]) 7[6–8] 7[6–8] 7[6–8] 7[5–8]
Apgar 5-min (median [IQR]) 8[8–9] 9[8–9] 9[8–9] 8[7–9]
PPROM (n [%]) 47[31.1] 98[33.1] 70[34.1] 28[30.8]
Corrected GA at clinical onset (median [IQR], weeks) 34.43[33.14–35.86] 34.07[32.61–35.86] 34.14[32.71–36.00] 34.00[32.29–35.71]
clinical characteristics
Early Use of Antibiotics 95[62.9] 172[58.1] 111[54.1] 61[67.0]
MV (n [%]) 73[48.3] 165[55.7] 94[45.9] 71[78.0]
PDA (n [%]) 93[61.6] 200[67.6] 136[66.3] 64[70.3]
IVH (n [%]) 62[41.1] 68[23.0] 38[18.5] 30[33.0]
Infectious diseases (n [%]) 60[39.7] 107[36.1] 64[31.2] 43[47.3]
Anemia-RBC transfusiona
Not anemia (n [%]) 94[63.6] 203[68.6] 166[81.0] 37[40.7]
Anemia-not transfusion (n [%]) 25[16.6] 28[9.5] 11[5.4] 17[18.6]
Anemia-transfusion (n [%]) 32[19.8] 65[21.9] 28[13.6] 37[40.7]
Feeding strategy
Type of milk
human milk (n [%]) 63[41.7] 54[18.2] 38[18.5] 16[17.6]
Formula milk (n [%]) 59[39.1] 158[53.4] 110[53.7] 48[52.7]
Combination (n [%]) 29[19.2] 84[28.4] 57[27.8] 27[29.7]
HMF 44[29.1] 29[9.8] 19[9.3] 10[11.0]
Enteral nutrition startb
Quick (n [%]) 124[82.1] 221[74.7] 160[78.0] 61[67.0]
Medium (n [%]) 25[16.6] 59[19.9] 32[15.6] 27[29.7]
Slow (n [%]) 2[1.3] 16[5.4] 13[6.4] 3[3.3]
daily milk incrementc
Quick (n [%]) 53[35.1] 73[24.7] 58[28.3] 15[16.5]
Slow (n [%]) 98[64.9] 223[75.3] 147[71.7] 76[83.5]
Probiotics 119[78.8] 124[41.9] 75[36.6] 49[53.8]
Clinical manifestations
Bowel sound attenuation 60[39.7] 182[61.5] 121[59.0] 61[67.0]
bloody stools 81[53.6] 105[35.5] 75[36.6] 30[33.0]
gastric residual 39[25.8] 141[47.6] 97[47.3] 44[48.4]
abdominal distension 55[36.4] 160[54.1] 91[44.4] 69[75.8]
laboratory parameters d
WBC at birth 8.87[2.48–44.36] 10.91[3.48–52.29] 11.12[3.48–39.46] 10.38[4.20–52.29]
NEUT% at birth 0.57[0.12–0.90] 0.58[0.06–0.93] 0.58[0.06–0.93] 0.57[0.15–0.84]
LY% at birth 0.32[0.08–0.80] 0.33[0.03–0.90] 0.34[0.03–0.90] 0.33[0.06–0.74]
MO% at birth 0.08[0.01–0.18] 0.07[0–0.22] 0.06[0.00–0.19] 0.07[0.00–0.22]
NEUT# at birth 4.78[0.5–32.4] 5.94[0.31–43.26] 6.04[0.31–35.00] 5.46[1.17–43.26]
LY# at birth 2.94[0.93–15.93] 3.4[0.7–30.67] 3.40[0.70–30.67] 3.46[0.79–29.90]
MO# at birth 0.67[0.05–5.90] 0.67[0–4.97] 0.62[0.00–3.40] 0.79[0.01–4.97]
RBC at birth 4.61[2.71–6.26] 4.57[2.54–6.13] 4.58[2.54–6.13] 4.43[3.07–5.92]
HGB at birth 172[99–237] 172[86–226] 173[86–226] 170[117–220]
HCT at birth 51.4[29.6–69.3] 51.4[29–69.3] 51.7[29.0–69.3] 50.4[33.6–67.6]
MCV at birth 111.9[98.6–129.3] 112.9[79.2–132.9] 112.4[97.0–209.0] 114.4[79.2–130.6]
MCH at birth 37.8[32.8–44.1] 37.9[15.6–44.7] 37.8[15.6–44.7] 38.0[25.8–43.6]
RDW at birth 16.45[13.9–21.1] 16.6[13.1–26.9] 16.7[13.1–26.9] 16.3[13.4–25.3]
PLT at birth 227[116–406] 219[42–509] 218[42–509] 220[69–460]
PCT at birth 0.23[0.11–0.41] 0.23[0.09–0.55] 0.23[0.09–0.55] 0.24[0.10–0.46]
MPV at birth 10.2[9.2–11.8] 10.7[8.5–13.0] 10.6[8.5–13.0] 11.0[8.9–12.4]
PDW at birth 11.1[9.3–14.6] 11.9[8.4–18.9] 11.8[8.4–18.9] 12.0[8.6–15.6]
WBC at clinical onset 9.71[3.44–25.37] 9.42[0.95–48.85] 9.72[2.07–48.85] 8.64[0.95–27.79]
NEUT% at clinical onset 0.41[0.12–0.84] 0.61[0.14–0.91] 0.60[0.14–0.88] 0.62[0.18–0.91]
LY% at clinical onset 0.43[0.10–0.73] 0.27[0.06–0.73] 0.27[0.06–0.71] 0.26[0.07–0.73]
MO% at clinical onset 0.09[0.01–0.24] 0.08[0–0.58] 0.08[0.00–0.58] 0.07[0.00–0.26]
NEUT# at clinical onset 3.90[0.94–16.76] 5.61[0.39–43.02] 5.77[0.51–43.02] 5.02[0.39–23.50]
LY# at clinical onset 3.97[0.72–8.62] 2.47[0.06–9.53] 2.57[0.21–7.86] 2.25[0.06–9.53]
MO# at clinical onset 0.86[0.04–3.37] 0.68[0.01–4.43] 0.74[0.01–4.43] 0.54[0.05–3.69]
RBC at clinical onset 3.71[2.29–5.50] 3.86[2.41–6.08] 3.87[2.50–6.08] 3.74[2.41–5.03]
HGB at clinical onset 125[77–180] 135[77–310] 136[77–310] 126[86–185]
HCT at clinical onset 37.4[23.0–51.4] 39.6[23.7–63.0] 40.3[23.7–63.0] 38.4[25.3–55.4]
MCV at clinical onset 101.75[83.80–113.20] 102.65[83.60–122.80] 103.3[85.1–122.8] 101.2[83.6–119.4]
MCH at clinical onset 34.6[28.1–39.7] 34.9[26.7–41.0] 35.3[26.7–41.0] 34.0[27.0–40.6]
RDW at clinical onset 15.9[13.2–20.8] 16.01[10.30–24.30] 16.0[10.4–24.3] 16.3[10.3–22.4]
PLT at clinical onset 317.5[105.0–823.0] 261.5[4.0–799.0] 257[5–609] 272[4–799]
PCT at clinical onset 0.36[0.15–0.85] 0.32[0.01–0.91] 0.31[0.11–0.68] 0.33[0.01–0.91]
MPV at clinical onset 11.2[9.2–13.2] 12[9–14] 11.96[9.50–14.00] 12[9–14]
PDW at clinical onset 13.0[8.9–20.3] 14.2[9.6–23.0] 14.2[9.8–23.0] 14.5[9.6–22.8]
WBC change 0.01[-0.72, 2.44] -0.12[-0.92, 5.82] -0.09[-0.92, 5.82] -0.18[-0.92, 2.53]
NEUT% change -0.28[-0.76, 6.00] 0.07[-0.83, 11.81] 0.05[-0.83, 11.81] 0.12[-0.78, 4.06]
LY% change 0.25[-0.82, 5.37] -0.19[-0.86, 11.96] -0.19[-0.86, 11.96] -0.19[-0.86, 7.10]
MO% change 0.17[-0.90, 17.82] 0.25[-1.00, 800.00] 0.33[-1.00, 800.00] 0.18[-1.00, 500.00]
NEUT# change -0.15[-0.92, 10.62] -0.12[-0.94, 35.09] -0.10[-0.94, 35.09] -0.15[-0.94, 4.98]
LY# change 0.25[-0.77, 6.16] -0.27[-0.98, 3.99] -0.24[-0.98, 2.81] -0.35[-0.97, 3.99]
MO# change 0.29[-0.95, 11.27] 0.03[-0.99, 6400.00] 0.14[-0.99, 6400.00] -0.20[-0.96, 73.27]
RBC change -0.18[-0.55, 0.31] -0.13[-0.42, 0.57] -0.13[-0.42, 0.49] -0.15[-0.42, 0.57]
HGB change -0.26[-0.63, 0.18] -0.20[-0.55, 1.40] -0.19[-0.48, 1.40] -0.25[-0.55, 0.33]
HCT change -0.25[-0.63, 0.15] -0.21[-0.50, 0.36] -0.20[-0.50, 0.36] -0.25[-0.50, 0.30]
MCV change -0.09[-0.27, -0.01] -0.08[-0.33, 0.23] -0.07[-0.49, 0.11] -0.10[-0.28, 0.23]
MCH change -0.07[-0.29, 0.03] -0.06[-0.33, 1.43] -0.06[-0.33, 1.43] -0.09[-0.31, 0.25]
RDW change -0.03[-0.21, 0.23] -0.04[-0.39, 0.47] -0.04[-0.39, 0.43] -0.01[-0.37, 0.47]
PLT change 0.46[-0.62, 2.38] 0.20[-0.97, 4.21] 0.18[-0.97, 4.21] 0.23[-0.97, 1.93]
PCT change 0.64[-0.35, 2.55] 0.40[-0.95, 2.33] 0.40[-0.65, 2.33] 0.41[-0.95, 2.00]
MPV change 0.09[-0.16, 0.25] 0.09[-0.14, 0.34] 0.09[-0.09, 0.34] 0.09[-0.10, 0.30]
PDW change 0.16[-0.18, 0.81] 0.20[-0.25, 0.97] -0.16[-0.45, 0.63] -0.14[-0.41, 0.53]

Abbreviations: BW, birth weight; FPIES, Food protein-induced enterocolitis; GA, gestational age; MV, mechanical ventilation; HMF, human milk fortifier; PPROM, Preterm premature rupture of membranes; PDA, patent ductus arteriosus; IVH, intraventricular hemorrhage; IQR, interquartile range; RBC, red blood cell.

aAnemia is determined based on the hemoglobin concentration, the days after birth, the respiratory status and clinical manifestations based on the recommendations of Canadian Pediatric Society; The usual volume of transfusion was 10 to 20 ml kg−1 over 3 to 5 h and feeding volumes are routinely decreased during transfusions.

bSlow, never start or start later than postnatal day 4; Medium, start on postnatal day 3 or 4; Quick, start within postnatal day 2.

cSlow, the daily milk increment is less than 20 ml per kilogram of body weight until reaching full feeding volumes; quick, more than 20 ml per kilogram of body weight.

dlaboratory values change is percentage change of each indicator at clinical onset compared with those at birth.

Comparison with other feature selection algorithms

We evaluate our proposed feature selection algorithm RQBSO and compare it with three major groups of feature selection methods, including two filter methods, namely Max-Relevance and Min-Redundancy (mRMR) [58] and ReliefF [59]. Three wrapper methods, namely GA [39], BSO [44], and recursive feature elimination (RFE) [60]. Our method is also compared with two leading embedded methods, namely LASSO [30], and Ridge regression [61]. The important parameter settings of the RQBSO algorithm are shown in Table 3. The hyper-parameters of other methods are detailed in S2 Table.

Table 3. Hyper-parameters used by RQBSO algorithm.

Parameter value
Ridge alphas 503.15
BSO flip 5
nBees 10
maxIteration 10
localIteration 10
Q-Learning γ 0.1

Fig 5A–5D and Table 4 show the comparison of the RQBSO algorithm with three sets of feature selection methods using ten-fold cross-validation. As shown in Fig 5A and 5B, RQBSO (orange curve) outperforms the other algorithms with AUROC values of 94.20% and 91.85% on both datasets. For the same FPR level, both our method obtains a higher TPR value, which is of great significance for the diagnosis and prognosis of NEC. The AUROC values of the two filter methods (mRMR and reliefF) perform poorly due to the failure to consider the correlation between features. The PRC curves in Fig 5C and 5D also confirms these results. RQBSO has the highest AUPRC values on both datasets with 97.42% and 84.61%, respectively.

Fig 5. Comparison of ROC and PRC curve of RQBSO and other algorithms.

Fig 5

(a, b) correspond to the ROC curve of dataset 1 and dataset 2. The numbers in parentheses indicate the AUROC value. The x-axis represents sensitivity, or true positive rate (TPR). The y-axis is 1-Specificity, or false positive rate (FPR). (c, d) represents the PRC curve of dataset 1 and dataset 2. The numbers in parentheses indicate the AUPRC value. The x-axis represents recall. The y-axis is precision.

Table 4. The performance comparison of different feature selection models.

RQBSO mRMR ReliefF GA BSO RFE LASSO Ridge
Dataset 1
Acc (%) 91.07 82.88 82.87 85.43 85.72 86.53 85.67 84.57
Rec (%) 96.94 86.26 88.02 89.86 94.27 89.82 89.39 89.39
Pre (%) 94.31 86.76 85.48 87.55 87.46 89.11 88.22 86.80
F1-Score (%) 92.90 86.36 86.69 88.57 89.03 89.41 88.75 88.00
Dataset 2
Acc (%) 84.37 75.40 76.76 75.53 81.36 75.02 77.19 73.37
Rec (%) 68.93 47.68 50.89 39.82 70.18 43.57 49.64 37.50
Pre (%) 93.33 70.95 68.71 70.05 89.31 65.12 72.25 65.82
F1-Score (%) 72.37 53.97 56.19 45.66 63.05 50.31 55.59 43.82

Table 4 shows that the classification accuracy of the NEC diagnosis and prognosis datasets are 91.07% and 84.37%, respectively. The advantage of our experimental accuracy is significant. In terms of accuracy, the prediction success rate of the RQBSO method exceeded 93% by conducting experiments in both dataset 1 and dataset 2. Compared with the other feature selection algorithms, our accuracy and precision are at a high level.

Feature importance analysis

We apply the RQBSO feature selection algorithm on dataset 1 and 2 to select the optimal feature set and calculate the final ranking of the selected features. The normalized importance scores of the selected features are presented in Tables 5 and 6.

Table 5. Feature importance ranking of dataset 1.

Rank Feature Importance score
1 Placenta abnormalities 0.041254
2 PDW at birth 0.041254
3 Type of milk 0.040842
4 LY% change 0.040842
5 Signs of peritoneal irritation 0.040429
6 Feeding volume at NEC onset 0.039604
7 Drowsiness 0.039191
8 NEUT% at clinical onset 0.039191
9 Meconium amniotic fluid 0.038366
10 Probiotics 0.037954
11 Early onset sepsis 0.036716
12 Acidosis 0.036716
13 HCT at clinical onset 0.036304
14 PDA 0.035891
15 Daily milk increment 0.034653
16 WBC change 0.034653
17 Gastric residual 0.034241
18 PS 0.031766
19 Inotropic 0.030941
20 Abdominal distension 0.030528
21 LY# change 0.030116
22 LY# at clinical onset 0.029703
23 MO# at birth 0.028878
24 MO% at birth 0.028053
25 DIC 0.027640
26 MCH at clinical onset 0.025578
27 NEUT# change 0.025165
28 LY% at birth 0.021865
29 Temperature instability 0.021040
30 Bloody stools 0.020627

Table 6. Feature importance ranking of dataset 2.

Rank Feature Importance score
1 Anemia-RBC transfusion 0.069979
2 Signs of peritoneal irritation 0.069979
3 Acidosis 0.069279
4 Tachycardia 0.068579
5 WBC change 0.068579
6 LY% at birth 0.066480
7 WBC at clinical onset 0.065780
8 Early onset sepsis 0.061582
9 Apgar 5-min 0.059482
10 PICC 0.052484
11 Total number of RBC transfusions 0.049685
12 MCH at clinical onset 0.049685
13 Postnatal age at clinical onset 0.047586
14 PLT change 0.045486
15 Caffeine 0.044787
16 Para 0.032190
17 NEUT# at clinical onset 0.029391
18 Fever 0.025192
19 MCV at clinical onset 0.023793

In the differential diagnosis of NEC, placenta abnormalities, platelet distribution width (PDW) at birth are the two most important features. This is followed by type of milk, lymphocyte percentage (LY%) change, signs of peritoneal irritation, achieved full enteral feeding, and drowsiness (Table 5). Overall, perinatal features account for 7.96% of the differential diagnosis of NEC, clinical features before clinical onset account for 28.84%, clinical features at clinical onset account for 25.04%, and laboratory parameters account for 38.16%.

In the classification of NEC, anemia-RBC transfusion, signs of peritoneal irritation, acidosis, tachycardia, and white blood cell count (WBC) change are the top five most important features (Table 6). Overall, perinatal features account for 9.17% of NEC classification, clinical features before clinical onset account for 27.85%, clinical features at clinical onset account for 28.06%, and laboratory parameters account for 34.92%.

Comparison with other ML classifiers

In addition to linear SVM, we evaluate three representative classification algorithms on the dataset. The k-nearest neighbor (KNN) algorithm is a distance-based metric. The multi-layer perceptron (MLP) algorithm is one of the most widely used neural network models, and the algorithm is a multilayer feedforward neural network. The random forest (RF) algorithm is an integrated learning algorithm consisting of multiple decision trees.

We use four classifiers to classify the NEC dataset. Compared with KNN, MLP and RF methods, the linear SVM has faster training and classification speed because the linear SVM is a linear classifier well suited for high-dimensional features, and it also has good generalization ability. As shown in Fig 6, the linear SVM has AUROC values of 94.22% and 91.85% and AUPRC values of 97.43% and 85.36% in datasets 1 and 2. In contrast, the AUROC and AUPRC values of KNN, MLP and RF are lower than those of the linear SVM. Therefore, the linear SVM has a significant advantage over the other three classifiers in our experiments.

Fig 6. Comparison of ROC and PRC curve of different classifiers.

Fig 6

(a, b) correspond to the ROC curve of dataset 1 and dataset 2. The numbers in parentheses indicate the AUROC value. The x-axis represents sensitivity, or true positive rate (TPR). The y-axis is 1-Specificity, or false positive rate (FPR). (c, d) represents the PRC curve of dataset 1 and dataset 2. The numbers in parentheses indicate the AUPRC value. The x-axis represents recall. The y-axis is precision.

Discussion

Predictive features

This study builds and tests a feature selection and classification algorithm that uses available data prior to disease onset for automatic diagnostic classification and NEC risk prediction. Using different ML-based classifiers trained and tested on different datasets, we obtain two general models with high accuracy and precision. In our multivariate feature selection algorithm, the previously described NEC parameters of higher WBC, signs of peritoneal irritation, and early clinical onset of NEC are significant weighted predictors of surgical NEC. Higher neutrophil percentage (NEUT%) at clinical onset, breast milk, and the use of probiotics are significant weighted predictors to identify classic NEC [7, 14, 62]. In addition, we also identify mean corpuscular hemoglobin (MCH) at clinical onset and anemia-RBC transfusion, which are risk factors for the development of NEC [6365], as the weighted predictors of surgical NEC. This suggests that our feature selection method identifies pathophysiologically important predictors of NEC diagnosis and prognosis. Previously unreported key variables predicting NEC, such as some parameters in routine blood tests and their variations, should be brought to the attention of clinicians.

Strengths and limitations

One of the strengths of this study is the extensive collection of perinatal, clinical, and laboratory information, including topical issues in NEC in recent years, such as anemia-RBC transfusion and feeding strategies, which allows for a detailed assessment to predict the diagnosis and severity of NEC. In addition, we propose the RQBSO feature selection algorithm, which uses an integrated learning strategy that combines machine learning with a swarm optimization algorithm. This algorithm achieves better feature selection results on both the NEC diagnosis and risk prediction datasets. The average classification accuracy of RQBSO filtered features is higher in both datasets. Moreover, most of the features filtered by RQBSO are clinically significant, and these important weighted predictive values deserve the attention of clinicians.

The present study has some limitations. Firstly, the number of extracted features is disproportionate to the size of the dataset, which may affect the performance of our ML classifiers, and increasing the sample size would probably improve performance. Secondly, the Bell staging criteria used in this study provide a relatively poor description of bowel injury. Although we exclude possible confounding factors when separating medical NEC from surgical NEC, applying ML methods to classify datasets with poorly defined non-discrete entities may be flawed. Finally, the lack of out-of-sample validation and the single-center retrospective design make our models less applicable. We hope to validate our models with future data from our NICU or other NICUs.

Conclusion

In this work, we propose a new feature selection framework RQBSO for early diagnosis of NEC and identification of high-risk infants. To evaluate the effectiveness of our algorithms, we conduct experiments on two skewed datasets of NEC differential diagnosis and risk prediction. In the end, we classify the NEC differential diagnosis data with an average recognition accuracy of 91.07% and an AUROC value of 94.20%. While the accuracy of the other set is only 84.37%, and the AUROC value is 91.85%. The experimental results show that the method has a high recognition accuracy in the differential diagnosis and risk prediction of NEC. In addition, the method screens some new significant weighted predictors that may lead to earlier identification and more timely treatment.

In future work, we plan to apply our method to higher-dimensional datasets and perform deeper parameter tuning to investigate their impact on algorithm performance.

Supporting information

S1 Table. Description of different features.

(DOCX)

S2 Table. Hyper-parameters used by other algorithms.

(DOCX)

S1 Dataset

(ZIP)

Acknowledgments

We would like to thank colleagues at the Neonatology Department of the First Hospital of Jilin University for robust data support.

Data Availability

All relevant data are within the paper and its Supporting Information files.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Torrazza RM, Ukhanova M, Wang XY, Sharma R, Hudak ML, Neu J, et al. Intestinal Microbial Ecology and Environmental Factors Affecting Necrotizing Enterocolitis. PLoS One. 2013;8(12). doi: 10.1371/journal.pone.0083304 WOS:000329194700031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Neu J, Walker WA. Medical Progress: Necrotizing Enterocolitis. New England Journal of Medicine. 2011;364(3):255–64. doi: 10.1056/NEJMra1005408 WOS:000286383900010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yee WH, Soraisham AS, Shah VS, Aziz K, Yoon W, Lee SK, et al. Incidence and Timing of Presentation of Necrotizing Enterocolitis in Preterm Infants. Pediatrics. 2012;129(2):E298–E304. doi: 10.1542/peds.2011-2022 WOS:000300395100007. [DOI] [PubMed] [Google Scholar]
  • 4.Sanchez JB, Kadrofske M. Necrotizing enterocolitis. Neurogastroenterology and Motility. 2019;31(3). doi: 10.1111/nmo.13569 WOS:000459504300018. [DOI] [PubMed] [Google Scholar]
  • 5.Kim JH, Sampath V, Canvasser J. Challenges in diagnosing necrotizing enterocolitis. Pediatric Research. 2020;88:16–20. doi: 10.1038/s41390-020-1090-4 WOS:000618528700004. [DOI] [PubMed] [Google Scholar]
  • 6.Rees CM, Pierro A, Eaton S. Neurodevelopmental outcomes of neonates with medically and surgically treated necrotizing enterocolitis. Archives of Disease in Childhood-Fetal and Neonatal Edition. 2007;92(3):F193–F8. doi: 10.1136/adc.2006.099929 WOS:000246069900012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Robinson JR, Rellinger EJ, Hatch LD, Weitkamp JH, Speck KE, Danko M, et al. Surgical necrotizing enterocolitis. Seminars in Perinatology. 2017;41(1):70–9. doi: 10.1053/j.semperi.2016.09.020 WOS:000395965400010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Matei A, Montalva L, Goodbaum A, Lauriti G, Zani A. Neurodevelopmental impairment in necrotising enterocolitis survivors: systematic review and meta-analysis. Archives of Disease in Childhood-Fetal and Neonatal Edition. 2020;105(4):F432–F9. doi: 10.1136/archdischild-2019-317830 WOS:000553105500017. [DOI] [PubMed] [Google Scholar]
  • 9.Bejnordi BE, Veta M, van Diest PJ, van Ginneken B, Karssemeijer N, Litjens G, et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. Jama-Journal of the American Medical Association. 2017;318(22):2199–210. doi: 10.1001/jama.2017.14585 WOS:000417822700018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Mobadersany P, Yousefi S, Amgad M, Gutman DA, Barnholtz-Sloan JS, Vega JEV, et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proceedings of the National Academy of Sciences of the United States of America. 2018;115(13):E2970–E9. doi: 10.1073/pnas.1717139115 WOS:000428382400012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Becker AS, Marcon M, Ghafoor S, Wurnig MC, Frauenfelder T, Boss A. Deep Learning in Mammography Diagnostic Accuracy of a Multipurpose Image Analysis Software in the Detection of Breast Cancer. Investigative Radiology. 2017;52(7):434–40. doi: 10.1097/RLI.0000000000000358 WOS:000403234600007. [DOI] [PubMed] [Google Scholar]
  • 12.Rajkomar A, Dean J, Kohane I. Machine Learning in Medicine. New England Journal of Medicine. 2019;380(14):1347–58. doi: 10.1056/NEJMra1814259 WOS:000463386900011. [DOI] [PubMed] [Google Scholar]
  • 13.Oh SL, Hagiwara Y, Raghavendra U, Yuvaraj R, Arunkumar N, Murugappan M, et al. A deep learning approach for Parkinson’s disease diagnosis from EEG signals. Neural Computing & Applications. 2020;32(15):10927–33. doi: 10.1007/s00521-018-3689-5 WOS:000549646700010. [DOI] [Google Scholar]
  • 14.Ji J, Ling XFB, Zhao YZ, Hu ZK, Zheng XL, Xu ZN, et al. A Data-Driven Algorithm Integrating Clinical and Laboratory Features for the Diagnosis and Prognosis of Necrotizing Enterocolitis. PLoS One. 2014;9(2). doi: 10.1371/journal.pone.0089860 WOS:000332396200089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Sylvester KG, Ling XFB, Liu GY, Kastenberg ZJ, Ji J, Hu ZK, et al. A novel urine peptide biomarker-based algorithm for the prognosis of necrotising enterocolitis in human infants. Gut. 2014;63(8):1284–92. doi: 10.1136/gutjnl-2013-305130 WOS:000339164200014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pantalone JM, Liu S, Olaloye OO, Prochaska EC, Yanowitz T, Riley MM, et al. Gestational Age-Specific Complete Blood Count Signatures in Necrotizing Enterocolitis. Frontiers in Pediatrics. 2021;9. doi: 10.3389/fped.2021.604899 WOS:000627765400001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Lure AC, Du XS, Black EW, Irons R, Lemas DJ, Taylor JA, et al. Using machine learning analysis to assist in differentiating between necrotizing enterocolitis and spontaneous intestinal perforation: A novel predictive analytic tool. Journal of Pediatric Surgery. 2021;56(10):1703–10. doi: 10.1016/j.jpedsurg.2020.11.008 WOS:000712991700005. [DOI] [PubMed] [Google Scholar]
  • 18.Jaskari J, Myllarinen J, Leskinen M, Rad AB, Hollmen J, Andersson S, et al. Machine Learning Methods for Neonatal Mortality and Morbidity Classification. Ieee Access. 2020;8:123347–58. doi: 10.1109/access.2020.3006710 WOS:000553749700001. [DOI] [Google Scholar]
  • 19.Gao WJ, Pei YY, Liang HY, Lv JJ, Chen JL, Zhong W. Multimodal AI System for the Rapid Diagnosis and Surgical Prediction of Necrotizing Enterocolitis. Ieee Access. 2021;9:51050–64. doi: 10.1109/access.2021.3069191 WOS:000638393900001. [DOI] [Google Scholar]
  • 20.Khaire UM, Dhanalakshmi R. Stability of feature selection algorithm: A review. Journal of King Saud University-Computer and Information Sciences. 2022;34(4):1060–73. doi: 10.1016/j.jksuci.2019.06.012 WOS:000782989300003. [DOI] [Google Scholar]
  • 21.Min F, Hu QH, Zhu W. Feature selection with test cost constraint. International Journal of Approximate Reasoning. 2014;55(1):167–79. doi: 10.1016/j.ijar.2013.04.003 WOS:000329256700007. [DOI] [Google Scholar]
  • 22.Munirathinam DR, Ranganadhan M. A new improved filter-based feature selection model for high-dimensional data. Journal of Supercomputing. 2020;76(8):5745–62. doi: 10.1007/s11227-019-02975-7 WOS:000549632900005. [DOI] [Google Scholar]
  • 23.Thaseen IS, Kumar CA, Ahmad A. Integrated Intrusion Detection Model Using Chi-Square Feature Selection and Ensemble of Classifiers. Arabian Journal for Science and Engineering. 2019;44(4):3357–68. doi: 10.1007/s13369-018-3507-5 WOS:000462305100032. [DOI] [Google Scholar]
  • 24.Farahani G. Feature Selection Based on Cross-Correlation for the Intrusion Detection System. Security and Communication Networks. 2020;2020. doi: 10.1155/2020/8875404 WOS:000578263600007. [DOI] [Google Scholar]
  • 25.Cai YD, Huang T, Hu LL, Shi XH, Xie L, Li YX. Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids. 2012;42(4):1387–95. doi: 10.1007/s00726-011-0835-0 WOS:000301181400030. [DOI] [PubMed] [Google Scholar]
  • 26.Gardeux V, Chelouah R, Wanderley MFB, Siarry P, Braga AP, Reyal F, et al. Computing molecular signatures as optima of a bi-objective function: method and application to prediction in oncogenomics. Cancer informatics. 2015;14:33–45. doi: 10.4137/CIN.S21111 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1–3):389–422. doi: 10.1023/a:1012487302797. WOS:000171501800018. [DOI] [Google Scholar]
  • 28.Ma L, Li MC, Gao Y, Chen T, Ma XX, Qu LA. A Novel Wrapper Approach for Feature Selection in Object-Based Image Classification Using Polygon-Based Cross-Validation. Ieee Geoscience and Remote Sensing Letters. 2017;14(3):409–13. doi: 10.1109/lgrs.2016.2645710 WOS:000395908600027. [DOI] [Google Scholar]
  • 29.Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research. 2004;5:1205–24. WOS:000236328300001. [Google Scholar]
  • 30.Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc Ser B-Stat Methodol. 2011;73:273–82. doi: 10.1111/j.1467-9868.2011.00771.x WOS:000290575300001. [DOI] [Google Scholar]
  • 31.Zhang SC, Cheng DB, Hu RY, Deng ZY. Supervised feature selection algorithm via discriminative ridge regression. World Wide Web-Internet and Web Information Systems. 2018;21(6):1545–62. doi: 10.1007/s11280-017-0502-9 WOS:000449485200007. [DOI] [Google Scholar]
  • 32.Yang MS, Ali W. Fuzzy Gaussian Lasso clustering with application to cancer data. Mathematical Biosciences and Engineering. 2020;17(1):250–65. doi: 10.3934/mbe.2020014 WOS:000495897300014. [DOI] [PubMed] [Google Scholar]
  • 33.Chen SB, Zhang YM, Ding CHQ, Zhang J, Luo B. Extended adaptive Lasso for multi-class and multi-label feature selection. Knowledge-Based Systems. 2019;173:28–36. doi: 10.1016/j.knosys.2019.02.021 WOS:000465056400003. [DOI] [Google Scholar]
  • 34.Xia JN, Sun DY, Xiao F, editors. SUMMARY OF LASSO AND RELATIVE METHODS. 13th International Conference on Enterprise Information Systems (ICEIS 2011); 2011 Jun 08–11; Beijing Jiaotong Univ, Sch Econ & Management, Beijing, PEOPLES R CHINA2011.
  • 35.Agrawal P, Abutarboush HF, Ganesh T, Mohamed AW. Metaheuristic Algorithms on Feature Selection: A Survey of One Decade of Research (2009–2019). Ieee Access. 2021;9:26766–91. doi: 10.1109/access.2021.3056407 WOS:000619305900001. [DOI] [Google Scholar]
  • 36.Ge H, Hu TL, Ieee, editors. Genetic algorithm for feature selection with mutual information. 7th International Symposium on Computational Intelligence and Design (ISCID); 2014 Dec 13–14; Hangzhou, PEOPLES R CHINA2014.
  • 37.Bu HL, Zheng SZ, Xia J, editors. Genetic algorithm based Semi-feature selection method. International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing; 2009 Aug 03–05; Shanghai, PEOPLES R CHINA2009.
  • 38.Lee J, Jang H, Ha S, Yoon Y. Android Malware Detection Using Machine Learning with Feature Selection Based on the Genetic Algorithm. Mathematics. 2021;9(21). doi: 10.3390/math9212813 WOS:000720038500001. [DOI] [Google Scholar]
  • 39.Park J, Park MW, Kim DW, Lee J. Multi-Population Genetic Algorithm for Multilabel Feature Selection Based on Label Complementary Communication. Entropy. 2020;22(8). doi: 10.3390/e22080876 WOS:000564072200001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Peng HJ, Ying C, Tan SH, Hu B, Sun ZX. An Improved Feature Selection Algorithm Based on Ant Colony Optimization. Ieee Access. 2018;6:69203–9. doi: 10.1109/access.2018.2879583 WOS:000452597700001. [DOI] [Google Scholar]
  • 41.Xu J, Li GY. A Two-Stage Improved Ant Colony Optimization Based Feature Selection for Web Classification. International Journal of Innovative Computing Information and Control. 2016;12(6):1851–63. WOS:000406146300007. [Google Scholar]
  • 42.Shaheen H, Agarwal S, Ranjan P, editors. MinMaxScaler Binary PSO for Feature Selection. 1st International Conference on Sustainable Technologies for Computational Intelligence (ICTSCI); 2019 Mar 29–30; Sri Balaji Coll Engn & Technol, Jaipur, INDIA2020.
  • 43.wu Q, Ma ZP, Fan J, Xu G, Shen YF. A Feature Selection Method Based on Hybrid Improved Binary Quantum Particle Swarm Optimization. Ieee Access. 2019;7:80588–601. doi: 10.1109/access.2019.2919956 WOS:000474607500001. [DOI] [Google Scholar]
  • 44.Sadeg S, Hamdad L, Benatchba K, Habbas Z, editors. BSO-FS: Bee Swarm Optimization for Feature Selection in Classification. 13th International Work-Conference on Artificial Neural Networks (IWANN); 2015 Jun 10–12; Palma de Mallorca, SPAIN2015.
  • 45.Sadeg S, Hamdad L, Remache AR, Karech MN, Benatchba K, Habbas Z, editors. QBSO-FS: A Reinforcement Learning Based Bee Swarm Optimization Metaheuristic for Feature Selection. 15th International Work-Conference on Artificial Neural Networks (IWANN); 2019 Jun 12–14; Spain2019.
  • 46.Calvet L, de Armas J, Masip D, Juan AA. Learnheuristics: hybridizing metaheuristics with machine learning for optimization with dynamic inputs. Open Mathematics. 2017;15:261–80. doi: 10.1515/math-2017-0029 WOS:000404637700001. [DOI] [Google Scholar]
  • 47.Talbi EG. Combining metaheuristics with mathematical programming, constraint programming and machine learning. Annals of Operations Research. 2016;240(1):171–215. doi: 10.1007/s10479-015-2034-y WOS:000376301200008. [DOI] [Google Scholar]
  • 48.Wauters T, Verbeeck K, De Causmaecker P, Berghe GV. Boosting metaheuristic search using reinforcement learning. Hybrid Metaheuristics: Springer; 2013. p. 433–52. [Google Scholar]
  • 49.Moore TA, Wilson ME. Feeding intolerance: a concept analysis. Advances in neonatal care: official journal of the National Association of Neonatal Nurses. 2011;11(3):149–54. doi: 10.1097/ANC.0b013e31821ba28e . [DOI] [PubMed] [Google Scholar]
  • 50.Qi YH, Liu C, Zhong X, Ma XL, Zhou J, Shi Y, et al. IL-27 as a potential biomarker for distinguishing between necrotising enterocolitis and highly suspected early-onset food protein-induced enterocolitis syndrome with abdominal gas signs. Ebiomedicine. 2021;72. doi: 10.1016/j.ebiom.2021.103607 WOS:000710193200001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Le VT, Klebanoff MA, Talavera MM, Slaughter JL. Transient effects of transfusion and feeding advances (volumetric and caloric) on necrotizing enterocolitis development: A case-crossover study. PLoS One. 2017;12(6). doi: 10.1371/journal.pone.0179724 WOS:000404046100046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Lambert DK, Christensen RD, Baer VL, Henry E, Gordon PV, Besner GE, et al. Fulminant necrotizing enterocolitis in a multihospital healthcare system. Journal of Perinatology. 2012;32(3):194–8. doi: 10.1038/jp.2011.61 WOS:000300875000005. [DOI] [PubMed] [Google Scholar]
  • 53.Cheadle C, Vawter MP, Freed WJ, Becker KG. Analysis of microarray data using Z score transformation. Journal of Molecular Diagnostics. 2003;5(2):73–81. doi: 10.1016/s1525-1578(10)60455-2 WOS:000182492800002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ. LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research. 2008;9:1871–4. WOS:000262636800009. [Google Scholar]
  • 55.Guo P, Luo YX, Mai GQ, Zhang M, Wang GQ, Zhao MM, et al. Gene expression profile based classification models of psoriasis. Genomics. 2014;103(1):48–55. doi: 10.1016/j.ygeno.2013.11.001 WOS:000333512800006. [DOI] [PubMed] [Google Scholar]
  • 56.Lipton ZC, Elkan C, Naryanaswamy B. Optimal Thresholding of Classifiers to Maximize F1 Measure. Machine learning and knowledge discovery in databases: European Conference, ECML PKDD: proceedings ECML PKDD (Conference). 2014;8725:225–39. MEDLINE:26023687. doi: 10.1007/978-3-662-44851-9_15 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27(8):861–74. doi: 10.1016/j.patrec.2005.10.010 WOS:000237462800002. [DOI] [Google Scholar]
  • 58.Peng HC, Long FH, Ding C. Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. Ieee Transactions on Pattern Analysis and Machine Intelligence. 2005;27(8):1226–38. doi: 10.1109/TPAMI.2005.159 WOS:000229700900004. [DOI] [PubMed] [Google Scholar]
  • 59.Robnik-Sikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn. 2003;53(1–2):23–69. doi: 10.1023/a:1025667309714. WOS:000185138700002. [DOI] [Google Scholar]
  • 60.Duan KB, Rajapakse JC, Wang HY, Azuaje F. Multiple SVM-RFE for gene selection in cancer classification with expression data. Ieee Transactions on Nanobioscience. 2005;4(3):228–34. doi: 10.1109/tnb.2005.853657 WOS:000231695900005. [DOI] [PubMed] [Google Scholar]
  • 61.Hoerl AE, Kennard RW. Ridge Regression—Biased Estimation for Nonorthogonal Problems. Technometrics. 1970;12(1):55–&. doi: 10.1080/00401706.1970.10488634 WOS:A1970F898700005. [DOI] [Google Scholar]
  • 62.el Hassani SE, Niemarkt HJ, Derikx JPM, Berkhout DJC, Ballon AE, de Graaf M, et al. Predictive factors for surgical treatment in preterm neonates with necrotizing enterocolitis: a multicenter case-control study. European Journal of Pediatrics. 2021;180(2):617–25. doi: 10.1007/s00431-020-03892-1 WOS:000595384200001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Maheshwari A, Patel RM, Christensen RD. Anemia, red blood cell transfusions, and necrotizing enterocolitis. Seminars in Pediatric Surgery. 2018;27(1):47–51. doi: 10.1053/j.sempedsurg.2017.11.009 WOS:000423139600009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Ozcan B, Aydemir O, Isik DU, Bas AY, Demirel N. Severe Anemia Is Associated with Intestinal Injury in Preterm Neonates. American Journal of Perinatology. 2020;37(6):603–6. doi: 10.1055/s-0039-1683982 WOS:000529912700007. [DOI] [PubMed] [Google Scholar]
  • 65.Martini S, Spada C, Aceti A, Rucci P, Gibertoni D, Battistini B, et al. Red blood cell transfusions alter splanchnic oxygenation response to enteral feeding in preterm infants: an observational pilot study. Transfusion. 2020;60(8):1669–75. doi: 10.1111/trf.15821 WOS:000529678100001. [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Vijayalakshmi Kakulapati

6 Jun 2022

PONE-D-22-08369Predicting the diagnosis and prognosis of necrotizing enterocolitis using a novel feature selection frameworkPLOS ONE

Dear Dr. Li,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.Please submit your revised manuscript by Jul 21 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Vijayalakshmi Kakulapati, Ph.D

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf  and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information.

4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

5. Thank you for stating the following financial disclosure:

“The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

At this time, please address the following queries:

a)        Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution.

b)        State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

c)        If any authors received a salary from any of your funders, please state which authors and which funders.

d)        If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I Don't Know

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. Line 1: Based on the content of the work, I will rather suggest the author coin the topic in line with the modified algorithm RQBSO. Aside this, the work will expose the novelty not the topic. However, I suggest something like “Framework for feature selection of predicting the diagnosis and prognosis of nectrotizing enterocolitis”

2. Line 23-26: list some potential influencing factors ignored

3. Line 29—31: What is RQBSO? NEC? And does linear support vector machine translate to SVM?

4. Line 47: I’m not aware this type of citation style

5. Line 54: Check the sentence especially the use of the word “mine”

6. Line 74: Take abbreviation outside the table as note

7. Line 82: Second or secondly

8. Line 84: kindly justify why the need to improve prediction requires inclusion of more features

9. Line 91-94: consider review into smaller sentences

10. Line 112: This suggest the main focus of this work, the abbreviations need to be check especially how line 113-114 translate to RQBSO

11. Line 127: from this stage the heading and heading numbering are defective. Check the heading style and heading numbering of the journal. Specifically, heading line 127 and line 177 should go together and each description of information in it must flow and justified

12. Line 149: The patient characteristics should have gone with line 128, Patient and data sets

13. Line 155-156. This covers about four pages of data. These data shouldn’t have come under material used or patient characteristics rather data presentation or during analysis. If it should be here kindly justify

14. Line 178-179: the proposed method include three steps-two steps mentioned and the heading that follows are more than three. Coordinate and arrange your materials to follow the proposed steps

15. Line 182 onward contained mathematical expressions which the character and formatting impair the quality of information in the equation. For example, in line 188, the multiplication sign is not far from the other x’s define in the equation

16. Line 194: Reconcile eqt2 because there will be a problem if it is substituted into eqt1

17. Line 202: check eqt3

18. Line 211: in this work we use the RQBSO algorithm……., if RQBSO is the main focus, it must have been describe before this place

19. Line 230: delete Eq.(5)

20. Line 223-241: You need to utilize software that can improve your mathematical writings to make sense, this applies to all equations in the work

21. Line 250, 268 and 298: headings

22. Line 312: the entirty of these section should have been devoted to development and justification of RQBSO where the figure will now be pictorial representation of the proposed method. As it stand now, this section is the pillar of the work which needs to be strengthened

23. Lines 314, 319, 329, 336, 341, 347 should go to introduction or still reeiw of relevant tools but not where the main work is been discussed. It can only be mentioned to justify its use

24. Line 353-356: deliberate action is needed to cite and justify these equations

25. Line 358: Evaluation cannot be done under the heading Result and Discussion. Is either the heading or content is faulty

26. Line 357 and line 441: reconcile

27. Line 470: Here, we can simply be ……’’In this work, a novel………’’

28. Line 478: the word … most… there must be specific by mentioning the existing ……

General Comment

My adventure of going through this work suggest to me that the author has lots of materials to be presented to justify publication. Unfortunately, the materials were not well organized and presented. The language of communication is good but research writing and presentation is lack which hindere the flow of communication. Aside this

1. Style and formatting

2. Heading and heading numbering

3. Citations referencing style (I don’t know if this is the journal style)

4. Probably due to submission, I suggest all part are put together before released for reviewing

Reviewer #2: The manuscript has very well and enough introduction, all the analysis and calculations are made in good manner. The figures are in a good resolution and all the references are put in order of date. The manuscript has got accepted from my side.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: MKA Abdulrahman

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Aug 19;17(8):e0273383. doi: 10.1371/journal.pone.0273383.r002

Author response to Decision Letter 0


20 Jul 2022

Dear Editor and Reviewers:

Thank you for your letter and for the reviewers’ comments concerning our manuscript entitled “Predicting the diagnosis and prognosis of necrotizing enterocolitis using a novel feature selection framework” (Manuscript Number: PONE-D-22-08369). We have studied the comments carefully, and they have all been valuable and very helpful for revising and improving our paper. We have revised the article based on the recommendations of the reviewers and the revised portions are marked in red in the paper. We hope that the revision is acceptable and look forward to hearing from you.

Review Comments to the Author

Reviewer #1:

Comment 1: Line 1: Based on the content of the work, I will rather suggest the author coin the topic in line with the modified algorithm RQBSO. Aside this, the work will expose the novelty not the topic. However, I suggest something like “Framework for feature selection of predicting the diagnosis and prognosis of nectrotizing enterocolitis”.

Reply 1: Thank you for your comment. As you said, the focus of our work is to propose the RQBSO algorithm and use it to predict the diagnosis and prognosis of NEC, as well as to uncover some potential impact factors. At the same time, our work exposes freshness rather than theme. Therefore, using “Framework for feature selection of predicting the diagnosis and prognosis of necrotizing enterocolitis” as the title can better highlight the focus of our work. We revise it according to your suggestion.

Actual changes:

� Changing the title of Line 1.

Comment 2: Line 23-26: list some potential influencing factors ignored.

Reply 2: Thank you for your comment. Previous studies have focused on perinatal characteristics (gestational age, birth weight, etc.) and clinical characteristics (blood and stool, mechanical ventilation, etc.) of patients while neglecting some laboratory parameters such as white blood cell count, lymphocyte percentage, and mean platelet volume. In addition, our study incorporates recent topical issues such as anemia-RBC transfusion and feeding strategies, allowing for detailed assessment of possible variables predictive of the diagnosis and severity of NEC.

Actual changes:

� The influencing factors ignored by previous studies are added after Line 26.

Comment 3: Line 29—31: What is RQBSO? NEC? And does linear support vector machine translate to SVM?

Reply 3: Thank you for your comment. I apologize that the explanation of the RQBSO algorithm and linear SVM in the previous manuscript raised questions for you. The RQBSO algorithm means a ridge regression and Q-learning strategy based bee swarm optimization metaheuristic algorithm. NEC is a worldwide pediatric disease and a major source of neonatal morbidity and mortality. There are some differences between linear SVM and SVM. More precisely, the SVM module is a wrapper for the libsvm library and supports different kernel functions (linear kernel, Gaussian kernel, Laplace kernel, etc.), while linear SVM is based on liblinear and supports only linear kernel functions. In the revised version, I correct the writing of RQBSO and linear SVM so that you can understand it better.

Actual changes:

� Correcting the definition of RQBSO in Line 29.

� Correcting the definition of linear SVM in Line 31.

� The definition of NEC is described in detail in Line 21.

Comment 4: Line 47: I’m not aware this type of citation style.

Reply 4: Thank you for your comment. The formatting requirements for references in PLOS one journals are as follows:

1. References should be listed after the main text, before the supporting information.

2. References with more than six authors should list the first six author names, followed by “et al.”

3. References should be formatted according to the NLM/ICMJE style: https://www.nlm.nih.gov/bsd/uniform_requirements.html.

In the references of the previous manuscript, I used the NLM style supported by the PLOS one journal. It is possible that this style is used by fewer authors, so in the revised manuscript I change the citation style of the references to the ICMJE style supported by the PLOS one journal.

Actual changes:

� Revising the format of references in Lines 491-720.

Comment 5: Line 54: Check the sentence especially the use of the word “mine”.

Reply 5: Thank you for your comment. In the previous manuscript, the original meaning of the sentence was that the diagnosis of NEC in clinical medicine currently suffers from the difficulty of early diagnosis and the lack of reliable biomarkers. Therefore, there is an urgent need to develop effective NEC diagnostic models to quickly and accurately identify relevant information affecting the diagnosis and prognosis of NEC, thus enabling a more accurate diagnosis. Therefore, the word “mine” is not used appropriately. In the revised manuscript, I replace the word “mine” and phrase the sentence in more detail.

Actual changes:

� Changing the word "mine" in line 54, and phrasing the sentence in more detail.

Comment 6: Line 74: Take abbreviation outside the table as note.

Reply 6: Thank you for your suggestion. In the previous manuscript, we did not annotate the abbreviations for the classifiers in Table 1, but gave the full name of the classifier in its first occurrence in the table. That's not quite the norm. In the revised manuscript, instead of writing the full name of the classifier in the table, we use abbreviations and write the abbreviations of the classifier as comments outside the table.

Actual changes:

� Changing the full name of the classifier to an abbreviation in Table 1 in Line 74, and adding a comment outside the table.

Comment 7: Line 82: Second or secondly.

Reply 7: Thank you for your comment. There is a small syntax problem here, and secondly should be used instead of second.

Actual changes:

� Changing the word "second" to "secondly" in Line 82, and checking the paragraph thoroughly.

Comment 8: Line 84: kindly justify why the need to improve prediction requires inclusion of more features.

Reply 8: Thank you for your comment. After careful examination, I found some problems with the description of this sentence. Previous studies have the following problems: First, they mostly used statistical methods to select features, which may ignore the correlation between features. Secondly, most of the researchers selected a small number of features, which may ignore features that are highly correlated with predicted outcomes. Therefore, in order to fully consider the association of unknown features and to explore the potential influencing factors associated with NEC diagnosis and disease severity stratification, we need to include more features for study. It is not certain that the inclusion of more features will improve prediction. Therefore, in the revised manuscript, I will remove the statement of improved prediction.

Actual changes:

� Removing the statement of improved prediction in Line 84.

Comment 9: Line 91-94: consider review into smaller sentences.

Reply 9: Thank you for your suggestion. In the previous manuscript, I used long sentences to describe the filter method and did not show the advantages and disadvantages of the method well. As a result, it may be relatively difficult to understand. In the revised manuscript, I rewrite the long sentences into shorter sentences and describe the advantages and disadvantages of the method so that the method can be better understood.

Actual changes:

� Changing long sentence to short sentence in Lines 91-94 and thoroughly checking the paragraph.

Comment 10: Line 112: This suggest the main focus of this work, the abbreviations need to be check especially how line 113-114 translate to RQBSO.

Reply 10: Thank you for your comment. I apologize that the abbreviation of the RQBSO algorithm in the previous manuscript raised questions for you. In the revised version, I redefine the abbreviation of the RQBSO algorithm. The RQBSO algorithm means a ridge regression and Q-learning strategy based bee swarm optimization metaheuristic algorithm. The ridge regression algorithm allows filtering out irrelevant features while considering the correlation between features. Thus, the ridge regression algorithm will help filter out irrelevant variables and improve the efficiency of the metaheuristic algorithm search. Using the Q-learning strategy based bee swarm optimization metaheuristic algorithm, adaptive learning can be performed during the search for feature subsets to obtain the optimal feature subset.

Actual changes:

� Redefining the abbreviation of the RQBSO algorithm in Line 112.

Comment 11: Line 127: from this stage the heading and heading numbering are defective. Check the heading style and heading numbering of the journal. Specifically, heading line 127 and line 177 should go together and each description of information in it must flow and justified.

Reply 11: Thank you for your suggestion. In the original draft, I wrote Materials and Methods in two separate sections. As you said, the headings and the numbering of the headings are somewhat flawed and hinder the reader's reading and understanding. In the revised version, I integrate Materials and Methods together. Specifically, in Materials, I first present the ethical instructions and patient exclusion criteria, then describe the data collection process and give a detailed description of the relevant medical terminology. In Methods, I first introduce the entire experimental procedure, then provide a detailed description of each process in the experimental procedure. Finally, I give the performance evaluation metrics used in the study.

Actual changes:

� Combining the headings of Lines 127 and 177, and detailing the datasets used and the methodology in that section.

Comment 12: Line 149: The patient characteristics should have gone with line 128, Patient and data sets.

Reply 12: Thank you for your comment. Both sections Patients and datasets and Patient characteristics provide patient-specific information, including patient exclusion and inclusion criteria, descriptions of relevant medical terminology, and personal information (characteristics) of the patient, and should therefore be integrated. In the revised version, I describe Patients and datasets in more detail, including the ethical statement, the inclusion and exclusion criteria for patients, the data collection process, and the introduction of relevant medical terminology. The patient characteristics are also statistically analyzed and then shown in the data analysis.

Actual changes:

� Combining the content of Line 128 and Line 149 and rewriting the datasets section.

Comment 13: Line 155-156. This covers about four pages of data. These data shouldn’t have come under material used or patient characteristics rather data presentation or during analysis. If it should be here kindly justify.

Reply 13: Thank you for your comment. You are correct. In the previous manuscript, I used statistical analysis to provide a thorough description of the patient's clinical characteristics, specifically the patient's perinatal characteristics, clinical features, and laboratory parameters. The categorical variables were summarized as counts and percentages, whereas the non-normally distributed continuous variables were summarized as medians and quartiles. Routine blood test results are analyzed as percentage change at the onset of NEC compared to birth. And this part should not belong to the Materials used but to the period of data presentation or analysis. In the revised manuscript, I put this part of the statistical analysis into the Study on the NEC cohort under Results. This section focuses on the clinical characteristics of the patients.

Actual changes:

� The contents of Table 2 in rows 155-156 are placed in the "NEC Cohort Study" under the "Results" section, and the statistical analysis process is described in more detail in this section.

Comment 14: Line 178-179: the proposed method include three steps-two steps mentioned and the heading that follows are more than three. Coordinate and arrange your materials to follow the proposed steps.

Reply 14: Thank you for your comments. In the previous manuscript, I proposed a framework consisting of three steps of data pre-processing, feature selection and classification, but the method was followed by more than three subheadings (K-fold cross-validation and Evaluation indicators were added). Therefore, in the revised version, I made the following changes: k-fold cross-validation and Evaluation were integrated into a secondary heading after Datasets and Methods in order to make the reader better understand the core content of the article.

Actual changes:

� The subheadings of the proposed method in Lines 178-179 are revised and the method is described in more detail.

� The “K-fold cross-validation” section in Line 341 is integrated with the “Evaluation indicators” section in Line 347 and placed in the section after “Datasets” and “Methods”.

Comment 15: Line 182 onward contained mathematical expressions which the character and formatting impair the quality of information in the equation. For example, in line 188, the multiplication sign is not far from the other x’s define in the equation.

Reply 15: Thank you for your suggestion. In the previous manuscript, I was not sufficiently rigorous in writing certain mathematical expressions, resulting in characters and formatting of certain expressions that may compromise the quality of the information in the equations. Therefore, in the revised manuscript, I check all the mathematical expressions in the article and rewrite all the equations. For example, to address the problem that the multiplication sign in line 188 in the original draft was easily confused with other x's in the equation, I change the multiplication sign to "*" to better distinguish them.

Actual changes:

� The mathematical expressions after Line 182 are thoroughly checked and all equations are rewritten.

Comment 16: Line 194: Reconcile eqt2 because there will be a problem if it is substituted into eqt1.

Reply 16: Thank you for your comment. Eqt1 and eqt2 are formulas for missing value filling using the K-nearest neighbor algorithm, which works by estimating the missing values from the eigenvalues of the k nearest neighbor samples and filling them. In the previous manuscript, I was not clear enough about the formulation, thus causing confusion to you. In the revised manuscript, I describe the k-nearest neighbor algorithm in more detail and adjusted eqt1 and eqt2. First, we use the reciprocal of the Euclidean distance between two samples as the padding weight (eqt1). Then we calculate the Euclidean distance between the sample with missing values and the other samples to determine the nearest neighbor sample of that sample. Finally, the weighted average of the eigenvalues of the nearest neighbor samples is used to fill in the estimates of the missing values (eqt2).

Actual changes:

� The eqt1 in Line 188 and eqt2 in Line 194 are rewritten and the k-nearest neighbor algorithm is described in more detail.

Comment 17: Line 202: check eqt3

Reply 17: Thank you for your comment. Eqt3 is an extension of our existing k-nearest neighbor algorithm for estimating discrete variables. In the revised version, I modify eqt3 based on eqt1 and eqt2. First, the nearest neighbor samples of samples with missing values are calculated according to Euclidean distance. Secondly, the nearest neighbor samples are voted based on the nearest neighbor samples, and the nearest neighbor sample category with the most votes is filled with missing values.

Actual changes:

� Modifying eqt1 in Line 188 and eqt2 in Line 194 and extending eqt3 based on the modified eqt1 and eqt2.

Comment 18: Line 211: in this work we use the RQBSO algorithm……., if RQBSO is the main focus, it must have been describe before this place.

Reply 18: Thank you for your comment. This section is an introduction to the RQBSO algorithm and a description of the specific process of the algorithm: the RQBSO framework is a feature selection algorithm for NEC diagnosis and prognosis. It combines a ridge regression algorithm and a BSO metaheuristic based on a Q-learning strategy. And "in this work we use the RQBSO algorithm ......." is a summary statement. It is inappropriate to be placed here. Therefore, in the revised version, I removed the relevant summary statement and described the specific process of the RQBSO algorithm in detail in this section.

Actual changes:

� The summary statement in Line 211 is removed and the specific procedure of the RQBSO algorithm is described in detail in that section.

Comment 19: Line 230: delete Eq.(5).

Reply 19: Thank you for your suggestion. In the revised version, I remove Eq. (5).

Actual changes:

� Removing Eq.(5) from line 230.

Comment 20: Line 223-241: You need to utilize software that can improve your mathematical writings to make sense, this applies to all equations in the work.

Reply 20: Thank you for your comment. In the revised draft, I check all the formulas and replace the software used to write the mathematical formulas so that the reader could read them better.

Actual changes:

� Rechecking the formulas in Lines 223-241 and replacing the software used to write the mathematical formulas.

Comment 21: Line 250, 268 and 298: headings.

Reply 21: Thank you for your comment. Due to my lack of standardization in writing, the numbers (1), (2), and (3) in lines 250,268 and 298 of the original manuscript may have been interpreted by you as subheadings. However, these three lines are not a subheading, but the execution flow of the QBSO algorithm. In the revised version, I bold these three lines to indicate that the line is the execution process of the algorithm.

Actual changes:

� Deleting (1), (2), (3) in Lines 250, 268 and 298. And bolding the three lines.

Comment 22: Line 312: the entirty of these section should have been devoted to development and justification of RQBSO where the figure will now be pictorial representation of the proposed method. As it stand now, this section is the pillar of the work which needs to be strengthened.

Reply 22: Thank you for your comment. As you said, Fig. 4 is the pseudocode of the RQBSO algorithm, which aims to describe the whole execution of the algorithm in a form close to natural language. And this part is also the core and backbone of the whole algorithm. Therefore, in the revised version, I have repositioned Fig. 4 and described the RQBSO algorithm in more detail so that the reader can better understand the execution of the algorithm.

Actual changes:

� Repositioning Fig. 4 under the “Feature selection” section in Line 210, and describing the RQBSO algorithm in more detail.

Comment 23: Lines 314, 319, 329, 336, 341, 347 should go to introduction or still reeiw of relevant tools but not where the main work is been discussed. It can only be mentioned to justify its use.

Reply 23: Thank you for your comment. Lines 314, 319, 325, 329, and 336 focus on the ML models used in the model prediction phase and compare the four models mentioned above. Whereas we end up using the linear SVM model, the remaining three models are not the place to discuss the main work. Therefore, in the Model classification section of the revised manuscript, I only present the linear SVM model, while the Comparison with other ML classifiers section in Results shows the comparison of the four models. And lines 341 and 347 are about the performance evaluation metrics used in the study. In this paper, we use recall, accuracy, precision, F1 score, ROC curve, and PRC curve to evaluate the performance of model prediction. And the above metrics are also the most common evaluation metrics in binary classification performance.

Actual changes:

� Removing the description of other ML models in Lines 325, 329 and 336, and keeping only the description of the linear SVM model in line 319.

� The content of Lines 341 and 347 are merged and the merged content is modified.

Comment 24: Line 353-356: deliberate action is needed to cite and justify these equations.

Reply 24: Thank you for your comment. This study uses recall, accuracy, F1-score, precision, ROC curve and PRC curve to measure the prediction performance of the model. Where recall is defined as the percentage of positive samples correctly predicted. Accuracy is defined as the percentage of positive samples predicted accurately. Precision is defined as the percentage of correct predictions among all samples. F1-score is defined as the harmonic mean of the accuracy and recall rates. the ROC and PRC curves reflect the relationship between the true and false positive rates, the precision rate and the recall rate, respectively. The above metrics have been widely used to evaluate the performance of binary classification models, and the ROC and PRC curves are often used as performance mapping methods in medical decision making. In the revised manuscript, references and proofs regarding the above equations are listed with relevant literature supporting them.

Actual changes:

� Giving proofs about equations in lines 353-356 with references to support them.

Comment 25: Line 358: Evaluation cannot be done under the heading Result and Discussion. Is either the heading or content is faulty.

Reply 25: Thank you for your comment. You are correct. In the previous manuscript, I evaluated the RQBSO algorithm and other feature selection algorithms under Results and discussion, which is not appropriate. I should have evaluated and compared the performance of the methods when comparing the performance of different feature selection algorithms. I have adjusted this section in the revised manuscript.

Actual changes:

� Removing the evaluation of the performance of different algorithms in Line 358 and adding it to Line 380 under the heading Comparison with other feature selection algorithms.

Comment 26: Line 357 and line 441: reconcile.

Reply 26: Thank you for your comment. In the revised version, I reconcile the Results and discussion chapter. Specifically, I split this section into two specific chapters. The Results section focuses on a statistical analysis of data, comparison with different methods, and the importance of analytic features. The Discussion section focuses on some of the pathophysiologically important predictors of NEC diagnosis and prognosis that are explored, as well as the strengths and limitations of the study.

Actual changes:

� Dividing the heading of Lines 357 and 441 into two sections and refactoring the content of both sections.

Comment 27: Line 470: Here, we can simply be ……’’In this work, a novel………’’.

Reply 27: Thank you for your suggestion. There are writing irregularities in this section, and I revise it according to your comments.

Actual changes:

� Simplifying the content in Line 470.

Comment 28: Line 478: the word … most… there must be specific by mentioning the existing ……

Reply 28: Thank you for your comment. In this study, the RQBSO algorithm is proposed and experimented on two skewed datasets for NEC differential diagnosis and risk prediction. To evaluate the effectiveness of our algorithm, we compare it with three sets of feature selection methods and our algorithm outperforms the other three sets of algorithms. However, the use of the word "most" is not appropriate because the number of compared algorithms is limited and the existing algorithms need to be mentioned to specify them. Therefore, in the revised manuscript, I delete "which is better than most existing feature selection algorithms" and instead indicate that our method has a high recognition accuracy in differential diagnosis and risk prediction of NEC.

Actual changes:

� The phrase "which is better than most existing feature selection algorithms" in Line 478 is removed, and instead, our method is shown to have a high recognition accuracy in differential diagnosis and risk prediction of NEC.

Reviewer #2: The manuscript has very well and enough introduction, all the analysis and calculations are made in good manner. The figures are in a good resolution and all the references are put in order of date. The manuscript has got accepted from my side.

Reply: Thank you very much for your recognition of my manuscript.

Best regards!

Ling Li

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Vijayalakshmi Kakulapati

8 Aug 2022

Framework for feature selection of predicting the diagnosis and prognosis of necrotizing enterocolitis

PONE-D-22-08369R1

Dear Dr. Li,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Vijayalakshmi Kakulapati, Ph.D

Academic Editor

PLOS ONE

Acceptance letter

Vijayalakshmi Kakulapati

10 Aug 2022

PONE-D-22-08369R1

Framework for feature selection of predicting the diagnosis and prognosis of necrotizing enterocolitis

Dear Dr. Li:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Vijayalakshmi Kakulapati

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Description of different features.

    (DOCX)

    S2 Table. Hyper-parameters used by other algorithms.

    (DOCX)

    S1 Dataset

    (ZIP)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are within the paper and its Supporting Information files.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES