Table 1.
Package | Feature selection algorithm | Extracted variables |
---|---|---|
Wrapper algorithms | ||
“Boruta”31 | An all-relevant feature selection algorithm using a random forest classifier. |
|
Filter algorithms | ||
“FSelector”32 | For using entropy-based methods, continuous features were discretized. | |
1) Gain ratio: An entropy-based filter using information gain criterion derived from a decision-tree classifier modified to reduce bias on highly branching features with many values. Bias reduction is achieved through normalizing information gain by the intrinsic information of a split. | From the top 30 variables: age, waist circumference, monocyte count, mean SBP, BMI, diagnosed hypertension, uric acid, GGT, serum phosphorus, vigorous activity, familial diabetes, marital status, serum potassium, hepatitis B, hepatitis C, race, ALT, overweight baby at birth, gender, hepatitis B surface antibody, bilateral ovariectomy, self-perceived DM risk | |
2) Symmetrical uncertainty: An entropy-based filter using information gain criterion but modified to reduce bias on highly branching features with many values. Bias reduction is achieved through normalizing information gain by the corresponding entropy of features. | From the top 30 variables: age, waist circumference, mean SBP, BMI, gender, GGT, race, serum uric acid, phosphorus, hepatitis E IgG, hepatitis B, serum potassium, food security, ALT, hepatitis B surface antibody, hepatitis C, self-perceived DM risk, female hormone intake, hysterectomy | |
3) Random forest: The algorithm finds weights of attributes using random forest algorithm. | From the top 30 variables: age, waist circumference, BMI, mean SBP, mean DBP, income-poverty ratio, hematocrit, osmolality, triglycerides level, bilateral ovariectomy, RBC count, female hormones intake, WBC count, marital status, serum uric acid, hemoglobin, GGT, monocyte count, serum calcium, hepatitis E IgG, serum phosphorus | |
4) Relief: The algorithm finds weights of continuous and discrete attributes basing on a distance between instances. | From the top 30 variables: education, past any tobacco use, hepatitis C, vigorous activity, overweight baby at birth, citizenship, alcohol use, female hormone intake, moderate activity, hysterectomy, duration of watching TV, bilateral ovariectomy, gestational DM, diagnosed jaundice, familial diabetes. Hepatitis E IgG | |
Embedded algorithms | ||
“glmnet”33 | Lasso (Least Absolute Shrinkage and Selection Operator) regularization: This puts a constraint on the sum of the absolute values of the model parameters. The sum should be less than a fixed value (upper bound). A regularization process penalizes regression coefficients of variables shrinking some of them to zero. The variables with nonzero coefficients after regularization are selected. The lambda value that minimizes the cross validated mean squared error determines the sparse model containing the selected features. | From the top 15 variables: self-perceived DM risk, age, citizenship, diagnosed hypertension, waist circumference, RBC count, hepatitis E IgG, serum iron, serum calcium, serum globulin, serum potassium |
“caret”34 | Recursive feature elimination: A resampling based recursive feature elimination method is applied. A random forest algorithm is used on each iteration to evaluate the model. The algorithm is configured to explore all possible subsets of the attributes. | From the top 30 variables: age, waist circumference, duration of watching TV, mean SBP, hematocrit, WBC count, GGT, gestational DM, mean DBP, hepatitis E IgG, income-poverty ratio, food security, RBC count, marital status, osmolality, diagnosed jaundice, serum uric acid, overweight baby at birth, serum iron, BMI, AMT, hysterectomy |
Abbreviations: ALT, alanine amino transferase; AMT, aspartate amino transferase; BMI, body mass index; DBP, diastolic blood pressure; DM, diabetes mellitus; GGT, gamma glutamyl transferase; IgG, immunoglobulin G; RBC, red blood cells; SBP, systolic blood pressure; WBC, white blood cells.