Skip to main content
PLOS One logoLink to PLOS One
. 2024 Jul 2;19(7):e0306359. doi: 10.1371/journal.pone.0306359

Comparison of model feature importance statistics to identify covariates that contribute most to model accuracy in prediction of insomnia

Alexander A Huang 1, Samuel Y Huang 2,*
Editor: Sergio A Useche3
PMCID: PMC11218970  PMID: 38954735

Abstract

Importance

Sleep is critical to a person’s physical and mental health and there is a need to create high performing machine learning models and critically understand how models rank covariates.

Objective

The study aimed to compare how different model metrics rank the importance of various covariates.

Design, setting, and participants

A cross-sectional cohort study was conducted retrospectively using the National Health and Nutrition Examination Survey (NHANES), which is publicly available.

Methods

This study employed univariate logistic models to filter out strong, independent covariates associated with sleep disorder outcome, which were then used in machine-learning models, of which, the most optimal was chosen. The machine-learning model was used to rank model covariates based on gain, cover, and frequency to identify risk factors for sleep disorder and feature importance was evaluated using both univariable and multivariable t-statistics. A correlation matrix was created to determine the similarity of the importance of variables ranked by different model metrics.

Results

The XGBoost model had the highest mean AUROC of 0.865 (SD = 0.010) with Accuracy of 0.762 (SD = 0.019), F1 of 0.875 (SD = 0.766), Sensitivity of 0.768 (SD = 0.023), Specificity of 0.782 (SD = 0.025), Positive Predictive Value of 0.806 (SD = 0.025), and Negative Predictive Value of 0.737 (SD = 0.034). The model metrics from the machine learning of gain and cover were strongly positively correlated with one another (r > 0.70). Model metrics from the multivariable model and univariable model were weakly negatively correlated with machine learning model metrics (R between -0.3 and 0).

Conclusion

The ranking of important variables associated with sleep disorder in this cohort from the machine learning models were not related to those from regression models.

Introduction

Insomnia is a widespread clinical condition characterized by difficulty initiating or maintaining sleep, which can result in significant physical and mental health consequences. The annual prevalence of insomnia symptoms in the general adult population ranges from 35–50%, with the prevalence of insomnia disorder ranging from 12–20% [1]. Several risk factors contribute to the development of insomnia, including depression, female sex, older age, lower socioeconomic status, concurrent medical and mental disorders, marital status, and race [26]. Moreover, insomnia often follows a chronic course, and its functional consequences include reduced productivity, increased absenteeism, and increased healthcare costs. Insomnia also increases the risk of developing mental disorders, including depression, and is associated with worse treatment outcomes in depression and alcohol dependence [3, 711]. Furthermore, insomnia is linked to an increased risk of developing metabolic syndrome, hypertension, and coronary heart disease [1, 7, 1215]. Despite the high prevalence and negative consequences of sleep disorders, researchers are only beginning to apply advanced mathematical models in the field of sleep disorders and it is necessary for physicians to understand how the models are created. Increased research into feature importance will increase the clinical reasoning capacity for physicians.

Linear regression, logistic regression, multivariable statistics, and machine learning have all been essential tools for outcome researchers and physicians in the diagnosis and treatment of various diseases [16, 17]. Linear regression has been used to assess the relationship between continuous predictors and outcomes, which is useful in identifying risk factors for disease progression or treatment outcomes [18]. Logistic regression, on the other hand, has been used to model the probability of binary outcomes, such as the presence or absence of a disease. This technique is particularly useful for diagnosis, where the aim is to correctly classify patients as either having or not having a disease based on various predictors such as symptoms, demographics, and lab values. Multivariable statistics, which include techniques such as multiple regression and analysis of variance, have been used to model the relationship between multiple predictors and an outcome variable [19]. This is important in identifying the most important risk factors for disease progression or treatment outcomes, as well as determining the optimal treatment approach for different patient groups. Machine learning techniques, such as XGBoost and neural networks, have become increasingly popular in recent years due to their ability to handle complex data and identify patterns that may not be apparent with traditional statistical methods [20]. Machine learning has been used to develop predictive models for various diseases, such as diabetes, heart disease, and cancer, as well as to identify subgroups of patients who may benefit from targeted treatments [2125]. With the rise of machine learning techniques in healthcare research, it is crucial to examine how these models compare to traditional statistical approaches in terms of variable selection and ranking. While traditional statistical models focus on hypothesis testing and estimation, machine learning models aim to predict outcomes by learning patterns in the data. These differences in approach may lead to variations in variable importance and ranking, which can ultimately impact clinical decision-making. Therefore, to study prediction of insomnia better, it is essential to conduct studies that assess the degree of similarity between machine learning and traditional statistical models in terms of their variable selection and ranking based on various gain statistics, such as cover, frequency, gain, univariable t-statistic, and multivariable t-statistic. This study aims to address this gap in knowledge and provide insight into the similarities and differences between these two modeling approaches.

To address these limitations, we will highlight how some of the most common models researchers use rank various covariates in the model based upon the model statistics. We will utilize a correlation matrix to visually present the correlational relationships between these various model statistics and show the degree of similarity. We will use the NHANES 2017–2020 cohort, a large nationally representative sample of US adults, to analyze demographic, laboratory, physical exam, and lifestyle covariates. This study will help to increase understanding of the different methods for evaluating risk factors for sleep disorders and provide a better understanding of the key risk factors for sleep disorders in the US population. The analysis will utilize the NHANES 2017–2020 cohort, a large, nationally representative sample of US adults, will be used within this study.

Methods

A cross-sectional cohort study was carried out using the publicly available National Health and Nutrition Examination Survey (NHANES) data. The retrospective study included patients who had completed questionnaires related to their demographics, diet, exercise, and mental health, as well as had undergone laboratory and physical exams. The National Center for Health Statistics (NCHS) Ethics Review Board approved the data acquisition and analysis for this study. To ensure patient privacy, all data including medical records, survey information, and demographic information were fully anonymized prior to analysis. All patients provided their written consent for their data to be made public.

Dataset and cohort selection

The NHANES program, developed by the NCHS, aims to assess the health and nutritional status of the US population through complex, multi-stage surveys conducted by the CDC. The NHANES dataset includes data on health, nutrition, and physical activity from a representative sample of the US population. For this particular study, the focus was on individuals aged 18 years or older who completed the demographic, dietary, exercise, and mental health questionnaire and had both laboratory and physical exam data available for analysis. All patients in the dataset with full insomnia data were included in this study. 7,929 patients that met the inclusion criteria in this study. A total of 2,302 (29%) of patients had a sleep disorder.

Assessment of sleep disorder

To identify patients with sleep disorders in this study, we utilized the medical conditions file. Participants were queried with the following question: "Have you ever reported to a healthcare professional or doctor that you experience difficulty sleeping?" If the answer to this question was "Yes," the participant was classified as having a sleep disorder for the purposes of this study.

Independent variable

The NHANES dataset was searched to identify potential model covariates from the demographics, dietary, physical examination, laboratory, and medical questionnaire datasets. In total, 783 covariates were found and extracted, and they were then merged with the sleep disorder indicator.

Covariate selection considerations

Recognizing the potential influence of collinearity on variable selection, we underscored the importance of preliminary correlation analysis prior to covariate selection in our revised discussion. To facilitate a more balanced comparison between machine learning and regression models’ predictive capabilities for insomnia, we standardized our evaluation approach by incorporating common performance metrics—mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error (MAPE). This uniformity in performance evaluation enables a more equitable assessment of each model’s effectiveness. Furthermore, we enriched our analysis through the inclusion of a residual analysis, examining the discrepancies between observed insomnia probabilities and those predicted by the models. This addition not only enhances the robustness of our findings but also provides deeper insights into the predictive accuracies of the models under consideration, thereby offering a more comprehensive understanding of their utility in the context of insomnia prediction. Through these methodological refinements, our study now presents a nuanced exploration of the comparative advantages of machine learning over traditional regression models in the realm of sleep research, with a particular focus on the selection and ranking of covariates pertinent to insomnia. Given these concerns, we found the most effective methodology was to utilize what has been proposed based upon its simplicity while offering comparable results.

Model construction and statistical analysis

In this study, univariate logistic models were employed to determine which covariates were associated with a sleep disorder outcome. Covariates that demonstrated a p-value of less than 0.0001 in univariate analysis were included in the final machine-learning model. The use of univariate logistic models served as an initial filter of the 700+ covariates present in the dataset, ensuring that only strong, independent covariates were used in the machine learning models. This initial filtering also facilitated physician review of clinically relevant risk factors. Following the initial filtering process, model importance statistics derived from the machine-learning models were used to identify key risk factors.

Four machine-learning methods were carried out: XGBoost, Random Forest, Adaptive Boost, and Artificial Neural Network. All machine-learning models were constructed using 10-fold cross validation. A train:test (80:20) was used to compute the final set of model fit parameters. The model fit parameters considered in this study included accuracy, F1, sensitivity, specificity, positive predictive value, negative predictive value, and AUROC (Area under the receiver operator characteristic curve).

If it was determined that the models performed differently from one another, the best model based upon the model metrics would be chosen. If the models performed similarly to one another, then the machine learning model of choice would be decided based upon a literature search. In this case, the machine learning model XGBoost was used due to its prevalence within the literature as well as its increased predictive accuracy in healthcare prediction. Furthermore, XGBoost was chosen as the most optimal model based upon the seven model fit parameters that were computed. To identify risk factors for sleep disorder in this study, model covariates were ranked based on three criteria: Gain, Cover, and Frequency. The Gain refers to the relative contribution of a feature within the machine-learning model, while Cover is the number of observations associated with the feature. Frequency refers to the percentage of times the feature is present in the trees of the model. To visualize the relationship between potential risk factors and sleep disorder, SHAP explanations were utilized. Additionally, feature importance was evaluated using both univariable and multivariable t-statistics.

Determination of the similarity of the importance of variables by the model metrics

Variables were ranked based on each criterion (Gain, Cover, Frequency, univariable t-statistic, and multivariable t-statistic). A correlation matrix was created that calculated the correlation coefficient between all possible pairings of gain, cover, frequency, univariable t-statistic, and multivariable t-statistic. All statistical analysis was done using R Version 2023.06.0+421 (2023.06.0+421). Packages utilized: dplyr, tidyr, stringr, lubridate, summarytools, psych, ggplot2, plotly, ggpubr, caret, randomForest, glmnet, xgboost, keras, shap, pROC, missForest, boot, cvms, recipes, VennDiagram, fastshap [26].

Results

Overall performance and variability of the models

Table 1 shows model accuracy statistics for the four machine learning models. The XGBoost model had strong performance, most notably with the highest mean AUROC of all model metrics with mean AUROC = 0.865 (SD = 0.010), Accuracy = 0.762 (SD = 0.019), F1 = 0.875 (SD = 0.766), Sensitivity = 0.768 (SD = 0.023), Specificity = 0.782 (SD = 0.025), Positive Predictive Value = 0.806 (SD = 0.025), and Negative Predictive Value = 0.737 (SD = 0.034). Among 10,000 simulations completed, we observed that the AUROC ranged from 0.755 to 0.918, a difference of 0.163, the accuracy ranged from 0.657 to 0.894, a 0.237 difference, the F1 ranged from 0.655 to 0.875, a 0.221 difference, the sensitivity ranged from 0.675 to 0.768, a 0.211 difference, and the specificity ranged from 0.565 to 0.936, a 0.370 difference. The machine learning models all had strong performance with mean AUROCs ranging from 0.818 to 0.865.

Table 1. Comparison of different machine learning models.

XGBoost Metrics Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum Mean Standard Deviation Range
Accuracy 0.657 0.709 0.734 0.771 0.796 0.808 0.894 0.762 0.019 0.237
F1 0.655 0.739 0.750 0.772 0.798 0.823 0.875 0.766 0.006 0.221
Sensitivity 0.675 0.727 0.770 0.774 0.791 0.847 0.887 0.768 0.023 0.211
Specificity 0.565 0.693 0.743 0.762 0.797 0.823 0.936 0.782 0.025 0.370
Positive Predictive Value 0.668 0.731 0.761 0.785 0.819 0.863 0.941 0.806 0.025 0.273
Negative Predictive Value 0.544 0.640 0.720 0.719 0.774 0.809 0.913 0.737 0.034 0.369
AUROC 0.755 0.800 0.833 0.840 0.861 0.905 0.918 0.865 0.010 0.163
Deep Neural Network Metrics Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum Mean Standard Deviation Range
Accuracy 0.645 0.715 0.736 0.744 0.789 0.804 0.879 0.766 0.020 0.234
F1 0.682 0.717 0.748 0.749 0.793 0.813 0.859 0.780 0.017 0.177
Sensitivity 0.645 0.734 0.764 0.763 0.786 0.817 0.857 0.784 0.016 0.213
Specificity 0.575 0.700 0.737 0.757 0.797 0.845 0.923 0.740 0.010 0.349
Positive Predictive Value 0.648 0.709 0.768 0.806 0.805 0.860 0.944 0.790 0.033 0.297
Negative Predictive Value 0.532 0.657 0.711 0.703 0.732 0.806 0.896 0.722 0.027 0.365
AUROC 0.726 0.789 0.844 0.824 0.867 0.861 0.891 0.818 0.006 0.166
Random Forest Metrics Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum Mean Standard Deviation Range
Accuracy 0.659 0.705 0.743 0.773 0.785 0.806 0.848 0.773 0.011 0.189
F1 0.671 0.733 0.736 0.762 0.782 0.816 0.869 0.770 0.017 0.198
Sensitivity 0.649 0.715 0.745 0.784 0.799 0.832 0.854 0.793 0.014 0.205
Specificity 0.559 0.669 0.750 0.762 0.765 0.816 0.902 0.753 0.018 0.344
Positive Predictive Value 0.635 0.730 0.750 0.781 0.816 0.826 0.905 0.791 0.013 0.270
Negative Predictive Value 0.536 0.654 0.706 0.717 0.735 0.813 0.895 0.725 0.015 0.359
AUROC 0.727 0.786 0.800 0.848 0.869 0.884 0.920 0.831 0.011 0.193
Support Vector Machines Metrics Minimum 5th Percentile 25th Percentile Median 75th Percentile 95th Percentile Maximum Mean Standard Deviation Range
Accuracy 0.663 0.725 0.743 0.755 0.765 0.798 0.875 0.744 0.014 0.212
F1 0.657 0.729 0.731 0.739 0.771 0.791 0.883 0.743 0.015 0.226
Sensitivity 0.640 0.719 0.773 0.780 0.776 0.828 0.889 0.796 0.017 0.249
Specificity 0.563 0.678 0.720 0.740 0.765 0.850 0.930 0.754 0.010 0.367
Positive Predictive Value 0.665 0.722 0.769 0.791 0.840 0.832 0.929 0.800 0.020 0.264
Negative Predictive Value 0.549 0.656 0.676 0.748 0.758 0.803 0.886 0.748 0.012 0.337
AUROC 0.724 0.812 0.820 0.857 0.848 0.858 0.904 0.839 0.014 0.180

Comparison of four machine learning models (XGBoost, Random Forest, Artificial Neural Network, Adaptive Boosting) using the model statistics: Accuracy, F1, Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, and AUROC with the NHANES cohort.

Table 2 shows the model statistics including the gain, cover, frequency, univariable t-statistic, and multivariable t-statistic for all covariates with p-values <0.0001. These allowed for variable selection and clinical evaluation of the importance of each of these potential features.

Table 2. Model gain statistics.

Feature Gain Cover Frequency Univariable Multivariable
PHQ_9 0.3088 0.1966 0.0609 5.5356 0.2812
Age 0.0754 0.0939 0.0607 5.9325 0.1769
Blood cadmium (ug/L) 0.0248 0.0306 0.0408 5.4400 1.1306
Alcohol..gm. 0.0253 0.0274 0.0405 0.3661 0.4035
BMXWAIST—Waist Circumference (cm) 0.0270 0.0209 0.0398 11.2737 3.4972
BMXWT—Weight (kg) 0.0299 0.0340 0.0383 0.3272 0.3907
Food.folate..mcg. 0.0245 0.0178 0.0378 6.0746 0.9804
RBC folate (ng/mL) 0.0229 0.0213 0.0363 4.9902 1.3090
Caffeine..mg..1 0.0234 0.0210 0.0344 5.4546 0.4402
 Red blood cell count (million cells/uL) 0.0206 0.0202 0.0341 5.6093 3.4363
Dietary.fiber..gm. 0.0196 0.0115 0.0317 5.0440 0.1824
HS C-Reactive Protein (mg/L) 0.0172 0.0156 0.0310 5.7368 2.1163
LBXTR. . .Triglyceride..mg.dL. 0.0186 0.0177 0.0305 12.1990 2.0226
 Glucose, refrigerated serum (mg/dL) 0.0181 0.0146 0.0296 8.7860 1.3297
N-acetyl-S-(n-propyl)-L-cysteine comt 0.0194 0.0197 0.0290 10.7528 1.3323
Red cell distribution width (%) 0.0165 0.0281 0.0283 3.9850 0.6928
Insulin (pmol/L) 0.0174 0.0132 0.0277 5.7933 0.6871
 Alkaline Phosphatase (ALP) (IU/L) 0.0149 0.0105 0.0275 5.1057 0.2457
Gamma Glutamyl Transferase (GGT) (IU/L) 0.0141 0.0117 0.0253 5.2053 2.2462
BMXBMI—Body Mass Index (kg/m**2) 0.0155 0.0100 0.0241 5.1035 0.6673
Total Protein (g/dL) 0.0149 0.0164 0.0237 6.7845 0.1932
 Glycohemoglobin (%) 0.0133 0.0154 0.0232 5.0028 0.6921
Blood Urea Nitrogen (mg/dL) 0.0117 0.0090 0.0196 5.1904 1.2308
Cotinine, Serum (ng/mL) 0.0110 0.0094 0.0189 6.0654 1.0683
MCQ366b - Doctor told you to exercise 0.0386 0.0479 0.0188 6.5157 0.0332
Albumin, refrigerated serum (g/dL) 0.0100 0.0076 0.0173 5.6835 2.0091
MCQ540—Ever seen a DR about this pain 0.0222 0.0384 0.0171 5.0680 0.6962
Hydroxycotinine, Serum (ng/mL) 0.0107 0.0096 0.0146 4.8752 1.3299
MCQ300a - Close relative had heart attack? 0.0083 0.0159 0.0104 2.1012 6.3521
MCQ520—Abdominal pain during past 12 months? 0.0085 0.0077 0.0086 2.4081 2.2197
SMQ856—Last 7-d worked at job not at home? 0.0106 0.0196 0.0086 3.5418 0.5547
MCQ300b - Close relative had asthma? 0.0059 0.0183 0.0084 3.1039 1.4597
MCQ160p - Ever told you had COPD, emphysema, ChB 0.0080 0.0165 0.0082 2.4048 2.8528
MCQ366a - Doctor told you to control/lose weight 0.0097 0.0110 0.0079 1.2951 2.2230
MCQ160b - Ever told had congestive heart failure 0.0063 0.0175 0.0078 0.2023 2.8321
MCQ366c - Doctor told you to reduce salt in diet 0.0070 0.0112 0.0075 2.2783 3.4871
MCQ560—Ever had gallbladder surgery? 0.0060 0.0134 0.0071 1.3450 0.9613
SMQ020—Smoked at least 100 cigarettes in life 0.0040 0.0066 0.0057 0.9800 0.5384
Fewer_carbs 0.0045 0.0085 0.0055 1.6327 2.1652
MCQ160m - Ever told you had thyroid problem 0.0050 0.0080 0.0054 0.0449 2.0343
Changed_eating_habits 0.0040 0.0088 0.0053 14.7371 1.0287
MCQ220—Ever told you had cancer or malignancy 0.0030 0.0040 0.0043 17.9071 0.2356
Used_liquid_diet 0.0027 0.0112 0.0036 11.9125 2.6688
MCQ371a - Are you now controlling or losing weight 0.0021 0.0031 0.0035 13.2250 0.3785
MCQ366d - Doctor told you to reduce fat/calories 0.0022 0.0020 0.0029 6.6864 2.8681
Ate_less_junk_food 0.0019 0.0013 0.0029 4.0939 2.8168
MCQ160l - Ever told you had any liver condition 0.0018 0.0072 0.0026 6.4612 7.6485
MCQ300c - Close relative had diabetes? 0.0016 0.0011 0.0026 5.2172 2.4245
MCQ160f - Ever told you had a stroke 0.0015 0.0052 0.0025 7.6107 0.4380
MCQ371c - Are you now reducing salt in diet 0.0012 0.0008 0.0023 11.8760 0.1188
Ate_fruits_veg 0.0012 0.0015 0.0023 0.8459 0.0653
MCQ160e - Ever told you had heart attack 0.0009 0.0013 0.0016 2.7936 0.7437
Drank_water_lose_weight 0.0009 0.0004 0.0016 4.3767 1.2387
Gender 0.0008 0.0004 0.0015 27.6767 0.4588
Special_diet_lose_weight 0.0009 0.0014 0.0014 2.1374 0.1151
Supplement_lose_weight 0.0006 0.0025 0.0013 3.9734 1.3689
Ate_less_sugar 0.0007 0.0005 0.0012 9.6662 1.7768
MCQ371d - Are you now reducing fat in diet 0.0006 0.0001 0.0012 5.0506 3.3152
SMQ690A - Used last 5 days—Cigarettes 0.0005 0.0004 0.0010 11.1669 1.6108
MCQ550—Has DR ever said you have gallstones 0.0002 0.0000 0.0005 5.2151 22.8579
MCQ510f - Liver condition: Other liver disease 0.0002 0.0011 0.0003 4.8092 1.9217
MCQ160c - Ever told you had coronary heart disease 0.0001 0.0000 0.0003 1.2255 1.9703
Weight_loss_surgery 0.0001 0.0011 0.0002 5.5425 2.4325
MCQ160d - Ever told you had angina/angina pectoris 0.0001 0.0001 0.0002 4.1821 3.2080

The Gain, Cover, and Frequency of all covariates within the XGBoost model. The Gain represents the relative contribution of the feature to the model and is the most important metric of model importance within this study. Covariates ordered according to the Gain statistic.

Table 3 highlights the top ten variables for each of the model statistics. For all feature importance statistics, PHQ-9 score was the most important with a gain of 0.309, cover of 0.197, frequency of 0.609, univariable t-statistic of 5.536, and multivariable t-statistic of 0.281. Age was the second most important feature importance statistic for 4 out of the 5 a gain of 0.075, cover of 0.094, frequency of 0.061, univariable t-statistic of 5.933, and multivariable t-statistic of 0.177.

Table 3. Top 10 ranked features for each feature importance method.

Feature Importance Method Gain Cover Frequency Multivariable Univariable
Top 10 Variables Selected PHQ_9 PHQ_9 PHQ_9 PHQ_9 PHQ_9
Age Age Age Age MCQ366b - Doctor told you to exercise
MCQ366b - Doctor told you to exercise MCQ366b - Doctor told you to exercise Blood cadmium (ug/L) `MCQ366b - Doctor told you to exercise` MCQ366a - Doctor told you to control/lose weight
BMXWT—Weight (kg) MCQ540—Ever seen a DR about this pain Alcohol..gm. `Albumin, refrigerated serum (g/dL)` MCQ366d - Doctor told you to reduce fat/calories
BMXWAIST—Waist Circumference (cm) BMXWT—Weight (kg) BMXWAIST—Waist Circumference (cm) `MCQ520—Abdominal pain during past 12 months?` BMXBMI—Body Mass Index (kg/m**2)
Alcohol..gm. Blood cadmium (ug/L) BMXWT—Weight (kg) `BMXWT—Weight (kg)` MCQ366c - Doctor told you to reduce salt in diet
Blood cadmium (ug/L) Red cell distribution width (%) Food.folate..mcg. GenderMale MCQ540—Ever seen a DR about this pain
Food.folate..mcg. Alcohol..gm. RBC folate (ng/mL) Weight_loss_surgery Age
Caffeine..mg..1 RBC folate (ng/mL) Caffeine..mg..1 `SMQ856—Last 7-d worked at job not at home?` SMQ856—Last 7-d worked at job not at home?
RBC folate (ng/mL) Caffeine..mg..1  Red blood cell count (million cells/uL) `MCQ371c - Are you now reducing salt in diet` BMXWT—Weight (kg)

Description: SHAP explanations, covariate value on the x-axis, change in log-odds on the y-axis, red line represents the relationship between the covariate and log-odds for insomnia. Splines, the relationship between the covariate value on the x-axis and the probability for insomnia on the y-axis.

Correlation matrix of correlations between model gain statistics

Fig 1 shows that Gain and Cover were strongly positively correlated with a correlation coefficient of 0.96. Pairs that were moderately positively correlated included gain and frequency with a correlation coefficient of 0.61 as well as cover and frequency with a correlation coefficient of 0.68. The pairs of univariable t-statistic and gain (r = -0.027), univariable t-statistic and cover (-0.067), univariable t-statistic and frequency (r = -0.046), multivariable t-statistic and gain (r = -0.13), multivariable t-statistic and cover (r = -0.16), multivariable t-statistic and frequency (r = -0.23), multivariable t-statistic and univariable t-statistic (r = -0.076) had weakly negative correlations.

Fig 1. Comparisons of model gain statistics.

Fig 1

Discussion

In this retrospective, cross sectional cohort of United States adults, machine learning models utilizing demographic, laboratory, physical examination, and lifestyle questionnaire data all had strong predictive accuracy with mean AUROCs ranging from 0.818 to 0.865. From the machine learning models the variables with highest associations with a sleep disorder were as follows: depression (PHQ-9), weight, age, and waist circumference. XGBoost was chosen as the machine learning model of choice because it had the highest mean AUROC. It was important to compare various machine learning models to show that the performance metrics of the different machine learning models are similar across the distribution to ensure that any differences in variable contributions are not due to differences in model performance.

In the field of machine learning, identifying the most important variables in predicting an outcome is crucial. Our study reveals that different measures of feature importance result in wide variability in selecting the top 10 covariates in the final model. This discrepancy arises from the varied methods used to assess which covariates contribute the most to the model, such as linear regression’s reliance on a least squares metric that considers estimates as non-interactive. In contrast, machine learning models construct feature importance metrics using gain cover and frequency statistics, resulting in a different set of top covariates [17]. The interaction of these biomolecular pathways is challenging to comprehend, and traditional regression models may not effectively account for these complex interactions [27]. Therefore, we propose that machine learning methods utilizing gain cover and frequency model selection statistics are better equipped to handle these complexities and provide a more accurate representation of the most important covariates in predicting outcomes.

In the context of feature selection in machine learning, it is crucial to recognize that each method yields a different set of best covariates. As such, different model selection statistics need to be combined to determine the best approach. In this study, we evaluated three measures for machine learning feature importance, including cover, gain, and frequency, as well as two measures for regression, including univariable T statistics and multivariable T statistics. We found a strong correlation between the different machine learning models of frequency, gain, and cover. However, there were weak and sometimes negative correlations between the feature importance ranks of machine learning models and those of univariable and multivariable regression. These findings suggest that there are complex interactions happening within the machine learning models that are not accounted for in the multivariable regression. As such, interpreting the multivariable results may yield an inaccurate representation of the importance of these covariates [28].

In modeling, understanding which covariates are important due to their interactions with other covariates or on their own is challenging. Accounting for confounding variables has always been difficult, and multivariable regression is the most common approach, but it cannot efficiently account for every possible interaction. It is impossible to run all the pairwise, three-way, four-way, and five-way interactions present in a multivariable model efficiently [29]. Thus, the most efficient way to capture these interactions is through machine learning models that iterate through the data, develop the most efficient models, and are cross-validated and effectively tested through train-test splits. Therefore, the large discrepancy between feature ranks between univariable and multivariable regression and that of machine learning models, highlights the importance of accounting for interaction terms through machine learning methods.

Our study evaluates the differences between interaction terms and how they lead to differences in model feature statistic rankings. By efficiently visualizing the relationship between each covariate and accounting for all confounding variables through machine learning, we can better investigate and identify the most important variables for further investigation in prospective studies. Therefore, we argue that machine learning brings a new way of evaluating variables beyond traditional regression, as it can account for confounding variables and identify important variables for future studies.

Limitations

This study has both strengths and limitations. The utilization of the NHANES dataset, which is a large retrospective cohort, allows for the selection of a substantial sample size, evaluation of data quality, and broad generalizability. However, it also carries the limitations of retrospective studies, such as reliance on self-reported surveys to obtain information on the outcome of interest and lifestyle choices. Prospective studies with automated measurements of foods may be more accurate, but they may not have the advantage of including a larger volume of participants through self-reported information. Another limitation is the voluntary nature of the cohort, which may introduce selection bias. However, the demographic diversity of the cohort analyzed suggests that the findings may still be generalizable to other cohorts. It is important to note that while this study focused on machine learning models and traditional statistical models, other models that are not linear or involve machine learning could be explored in future studies.

Conclusion

Machine learning models offer additional information in ranking variable importance for predicting insomnia in addition to regression models.

Data Availability

Data Availability: The data from this cohort is freely available without restriction and can be found on the NHANES section of the CDC website. Data Share Statement: Data described in the manuscript are present at: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?cycle=2017-2020.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Buysse DJ. Insomnia. JAMA. 2013;309(7):706–16. doi: 10.1001/jama.2013.193 ; PubMed Central PMCID: PMC3632369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Blake MJ, Trinder JA, Allen NB. Mechanisms underlying the association between insomnia, anxiety, and depression in adolescence: Implications for behavioral sleep interventions. Clin Psychol Rev. 2018;63:25–40. Epub 20180528. doi: 10.1016/j.cpr.2018.05.006 . [DOI] [PubMed] [Google Scholar]
  • 3.Di H, Guo Y, Daghlas I, Wang L, Liu G, Pan A, et al. Evaluation of Sleep Habits and Disturbances Among US Adults, 2017–2020. JAMA Netw Open. 2022;5(11):e2240788. Epub 20221101. doi: 10.1001/jamanetworkopen.2022.40788 ; PubMed Central PMCID: PMC9644264. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.M KP, Latreille V. Sleep Disorders. Am J Med. 2019;132(3):292–9. Epub 20181004. doi: 10.1016/j.amjmed.2018.09.021 . [DOI] [PubMed] [Google Scholar]
  • 5.Muth CC. Sleep-Wake Disorders. JAMA. 2016;316(21):2322. doi: 10.1001/jama.2016.17785 . [DOI] [PubMed] [Google Scholar]
  • 6.Wesselius HM, van den Ende ES, Alsma J, Ter Maaten JC, Schuit SCE, Stassen PM, et al. Quality and Quantity of Sleep and Factors Associated With Sleep Disturbance in Hospitalized Patients. JAMA Intern Med. 2018;178(9):1201–8. doi: 10.1001/jamainternmed.2018.2669 ; PubMed Central PMCID: PMC6142965. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Edinger JD. Classifying insomnia in a clinically useful way. J Clin Psychiatry. 2004;65 Suppl 8:36–43. . [PubMed] [Google Scholar]
  • 8.Frydman D. [Individual evolution of idiopathic insomnia]. Waking Sleeping. 1979;3(1):51–5. . [PubMed] [Google Scholar]
  • 9.Goldberg LD. Managing insomnia in an evolving marketplace. Am J Manag Care. 2006;12(8 Suppl):S212–3. . [PubMed] [Google Scholar]
  • 10.Medina-Chávez JH, Fuentes-Alexandro SA, Gil-Palafox IB, Adame-Galván L, Solís-Lam F, Sánchez-Herrera LY, et al. [Clinical practice guideline. Diagnosis and treatment of insomnia in the elderly]. Rev Med Inst Mex Seguro Soc. 2014;52(1):108–19. . [PubMed] [Google Scholar]
  • 11.Roth T. Introduction—Advances in our understanding of insomnia and its management. Sleep Med. 2007;8 Suppl 3:25–6. doi: 10.1016/j.sleep.2007.10.001 . [DOI] [PubMed] [Google Scholar]
  • 12.Spiegelhalder K, Espie C, Nissen C, Riemann D. Sleep-related attentional bias in patients with primary insomnia compared with sleep experts and healthy controls. J Sleep Res. 2008;17(2):191–6. doi: 10.1111/j.1365-2869.2008.00641.x . [DOI] [PubMed] [Google Scholar]
  • 13.Tsuchihashi-Makaya M, Matsuoka S. Insomnia in Heart Failure. Circ J. 2016;80(7):1525–6. Epub 20160603. doi: 10.1253/circj.CJ-16-0501 . [DOI] [PubMed] [Google Scholar]
  • 14.Wittchen HU, Krause P, Höfler M, Pittrow D, Winter S, Spiegel B, et al. [NISAS-2000: The "Nationwide Insomnia Screening and Awareness Study". Prevalence and interventions in primary care]. Fortschr Med Orig. 2001;119(1):9–19. . [PubMed] [Google Scholar]
  • 15.Yoshihisa A, Kanno Y, Takeishi Y. Insomnia and Cardiac Events in Patients With Heart Failure- Reply. Circ J. 2016;81(1):126. Epub 20161214. doi: 10.1253/circj.CJ-16-1198 . [DOI] [PubMed] [Google Scholar]
  • 16.Castro HM, Ferreira JC. Linear and logistic regression models: when to use and how to interpret them? J Bras Pneumol. 2023;48(6):e20220439. Epub 20230113. doi: 10.36416/1806-3756/e20220439 ; PubMed Central PMCID: PMC9747134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Huang AA, Huang SY. Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations. PLoS One. 2023;18(2):e0281922. Epub 20230223. doi: 10.1371/journal.pone.0281922 ; PubMed Central PMCID: PMC9949629. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gomila R. Logistic or linear? Estimating causal effects of experimental treatments on binary outcomes using regression analysis. J Exp Psychol Gen. 2021;150(4):700–9. Epub 20200924. doi: 10.1037/xge0000920 . [DOI] [PubMed] [Google Scholar]
  • 19.Richardson AM, Joshy G, D’Este CA. Understanding statistical principles in linear and logistic regression. Med J Aust. 2018;208(8):332–4. doi: 10.5694/mja17.00222 . [DOI] [PubMed] [Google Scholar]
  • 20.Huang AA, Huang SY. Use of machine learning to identify risk factors for insomnia. PLoS One. 2023;18(4):e0282622. Epub 20230412. doi: 10.1371/journal.pone.0282622 ; PubMed Central PMCID: PMC10096447. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Baik SM, Kim KT, Lee H, Lee JH. Machine learning algorithm for early-stage prediction of severe morbidity in COVID-19 pneumonia patients based on bio-signals. BMC Pulm Med. 2023;23(1):121. Epub 20230414. doi: 10.1186/s12890-023-02421-8 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cai Y, Su H, Si Y, Ni N. Machine learning-based prediction of diagnostic markers for Graves’ orbitopathy. Endocrine. 2023. Epub 20230415. doi: 10.1007/s12020-023-03349-z . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Dos Reis AHS, de Oliveira ALM, Fritsch C, Zouch J, Ferreira P, Polese JC. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023;12(1):68. Epub 20230415. doi: 10.1186/s13643-023-02231-3 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Meza Ramirez CA, Greenop M, Almoshawah YA, Martin Hirsch PL, Rehman IU. Advancing cervical cancer diagnosis and screening with spectroscopy and machine learning. Expert Rev Mol Diagn. 2023. Epub 20230415. doi: 10.1080/14737159.2023.2203816 . [DOI] [PubMed] [Google Scholar]
  • 25.Mohebi M, Amini M, Alemzadeh-Ansari MJ, Alizadehasl A, Rajabi AB, Shiri I, et al. Post-revascularization Ejection Fraction Prediction for Patients Undergoing Percutaneous Coronary Intervention Based on Myocardial Perfusion SPECT Imaging Radiomics: a Preliminary Machine Learning Study. J Digit Imaging. 2023. Epub 20230414. doi: 10.1007/s10278-023-00820-1 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Liu Q, Gui D, Zhang L, Niu J, Dai H, Wei G, et al. Simulation of regional groundwater levels in arid regions using interpretable machine learning models. Sci Total Environ. 2022;831:154902. Epub 20220329. doi: 10.1016/j.scitotenv.2022.154902 . [DOI] [PubMed] [Google Scholar]
  • 27.Bzdok D, Altman N, Krzywinski M. Statistics versus machine learning. Nat Methods. 2018;15(4):233–4. Epub 20180403. doi: 10.1038/nmeth.4642 ; PubMed Central PMCID: PMC6082636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dharma C, Fu R, Chaiton M. Table 2 Fallacy in Descriptive Epidemiology: Bringing Machine Learning to the Table. Int J Environ Res Public Health. 2023;20(13). Epub 20230621. doi: 10.3390/ijerph20136194 ; PubMed Central PMCID: PMC10340623. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bunce C, Czanner G, Grzeda MT, Dore CJ, Freemantle N. Ophthalmic statistics note 12: multivariable or multivariate: what’s in a name? Br J Ophthalmol. 2017;101(10):1303–5. Epub 20170816. doi: 10.1136/bjophthalmol-2017-310846 . [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Sergio A Useche

18 Sep 2023

PONE-D-23-11459Comparison of Model Feature Importance Statistics to Identify Covariates that Contribute Most to Model AccuracyPLOS ONE

Dear Dr. Huang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

First of all, apologies for the delay. It has been really struggling to find suitable (and available) reviewers for this submission. Below, you will find that our referees ask for major revisions in order to read and reconsider your paper on the basis of a revised version. Please keep in mind the fundamental flaws highlighted in this report. Please carefully work on providing more technical details about the study. In addition, I would invite the authors to double-check the coherence among title-aim-methods-conclusions and the context covered by the study. The full set of comments raised by your Reviewers is appended below.

Please submit your revised manuscript by Nov 02 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sergio A. Useche, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Additional comments for the author:

1. The title, Comparison of model feature importance statistics to identify the covariates that most contribute to model accuracy, doesn’t tell in what situation the comparison was made, please include the information that it was to insomnia data.

2. The Dataset and Cohort Selection section should inform the sample size.

3. Methods section can also include data analysis methods, such as definitions or a brief description of machine learning models and model gain statistics.

4. The confusion matrix (Figure 1, line 203) would not be the correlation matrix ? A confusion matrix table reports the absolute frequency of true positives, false negatives, false positives, and true negatives.

5. Selecting a better model that identifies the most relevant risk factors requires not only good model fit parameters but also smaller errors. Residual measures should also be considered for comparison models (such as MAE, MSE, MAPE,...).

Reviewer #2: The article identifies features that best predict insomnia using a machine learning model. Based on the information provided, I think the paper overall was solid methodologically, although a few things can be clarified. I have a few suggestions and comments:

1. I think my major confusion is it is unclear if the aim of the paper is methodological or if it’s trying to contribute to the insomnia literature. If the goal is content, the title will need to be changed to reflect that. If the goal is more of a methodological contribution, then probably the introduction needs to be reworked, you will want to start with the contributions of machine learning models and then discuss how you’re going to use this paper to illustrate this point, such as in insomnia research. There are two different aims here: Line 77-82, as well as Line 89-92. Of course, you can still contribute to the literature of insomnia research while being a methods paper, but the focus is unclear.

2. It is unclear what software the authors used to run the analysis, so it is difficult to assess whether things were done properly. Is it Phyton, R, or something else? Please provide citations for gain, cover, and Shapley values and which packages were used to perform these. If you calculated them from scratch, then provide the GitHub link or codes to ensure reproducibility, which is the policy of PLOSOne. I’m less familiar with this approach, I’m most familiar with the agnostic method of variable ranking (perhaps cite the book Interpretable Machine learning). I know Greenwell’s R package to do this, not sure what was used in this analysis. Based on your description, your way of variable importance makes sense, but I’d like to see more citation to see what your contributions are and if this method has been used before, or you wrote new packages.

Greenwell B.M., Boehmke B.C., McCarthy A.J. A Simple and Effective Model-Based Variable Importance Measure. arXiv. 20181805.04755

Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2nd ed.).christophm.github.io/interpretable-ml-book/

3. Related to above, this comes back to my confusion whether the aim is to suggest other researchers to focus more on ML when identifying the most important predictors instead of using multivariable regression, or if you simply want to contribute to insomnia literature using ML. If the former, please state how other researchers can use the technique in their work. A few other methods papers have been written on variable importance, such as the one below. This paper has a step by step of how to identify the most important predictors when running a descriptive epidemiological analysis, which is what this is. Again, not sure if that’s the aim, if you want to encourage other researchers to apply ML instead of multivariable regression. I’d recommend looking into this work; they also discussed what you put under discussion about interaction terms and the challenge when different models chose inconsistent variables as the most important predictors. I believe this is an important future direction that should be continually discussed.

Dharma C, Fu R, Chaiton M. Table 2 Fallacy in Descriptive Epidemiology: Bringing Machine Learning to the Table. Int J Environ Res Public Health. 2023 Jun 21;20(13):6194. doi: 10.3390/ijerph20136194. PMID: 37444042; PMCID: PMC10340623.

4. I do not understand why the univariable filter was first done before the machine learning models, I thought one of the benefits of ML is to indeed filter out the unnecessary variables. Again, I will need to see citations for this or explain why this was done. The theme is, there is a lack of reliable citation for how the methods were performed and why certain decisions were made.

5. Minor: Unless I misunderstood, “Legend” caption under Table 3 should have gone under Figure 1, otherwise the description does not make sense about the y-axis and x-axis in Table 3.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jul 2;19(7):e0306359. doi: 10.1371/journal.pone.0306359.r002

Author response to Decision Letter 0


19 Oct 2023

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Additional comments for the author:

1. The title, Comparison of model feature importance statistics to identify the covariates that most contribute to model accuracy, doesn’t tell in what situation the comparison was made, please include the information that it was to insomnia data.

2. The Dataset and Cohort Selection section should inform the sample size.

3. Methods section can also include data analysis methods, such as definitions or a brief description of machine learning models and model gain statistics.

4. The confusion matrix (Figure 1, line 203) would not be the correlation matrix ? A confusion matrix table reports the absolute frequency of true positives, false negatives, false positives, and true negatives.

5. Selecting a better model that identifies the most relevant risk factors requires not only good model fit parameters but also smaller errors. Residual measures should also be considered for comparison models (such as MAE, MSE, MAPE,...).

Reviewer #2: The article identifies features that best predict insomnia using a machine learning model. Based on the information provided, I think the paper overall was solid methodologically, although a few things can be clarified. I have a few suggestions and comments:

1. I think my major confusion is it is unclear if the aim of the paper is methodological or if it’s trying to contribute to the insomnia literature. If the goal is content, the title will need to be changed to reflect that. If the goal is more of a methodological contribution, then probably the introduction needs to be reworked, you will want to start with the contributions of machine learning models and then discuss how you’re going to use this paper to illustrate this point, such as in insomnia research. There are two different aims here: Line 77-82, as well as Line 89-92. Of course, you can still contribute to the literature of insomnia research while being a methods paper, but the focus is unclear.

2. It is unclear what software the authors used to run the analysis, so it is difficult to assess whether things were done properly. Is it Phyton, R, or something else? Please provide citations for gain, cover, and Shapley values and which packages were used to perform these. If you calculated them from scratch, then provide the GitHub link or codes to ensure reproducibility, which is the policy of PLOSOne. I’m less familiar with this approach, I’m most familiar with the agnostic method of variable ranking (perhaps cite the book Interpretable Machine learning). I know Greenwell’s R package to do this, not sure what was used in this analysis. Based on your description, your way of variable importance makes sense, but I’d like to see more citation to see what your contributions are and if this method has been used before, or you wrote new packages.

Greenwell B.M., Boehmke B.C., McCarthy A.J. A Simple and Effective Model-Based Variable Importance Measure. arXiv. 20181805.04755

Molnar, C. (2022). Interpretable Machine Learning: A Guide for Making Black Box Models Explainable (2nd ed.).christophm.github.io/interpretable-ml-book/

3. Related to above, this comes back to my confusion whether the aim is to suggest other researchers to focus more on ML when identifying the most important predictors instead of using multivariable regression, or if you simply want to contribute to insomnia literature using ML. If the former, please state how other researchers can use the technique in their work. A few other methods papers have been written on variable importance, such as the one below. This paper has a step by step of how to identify the most important predictors when running a descriptive epidemiological analysis, which is what this is. Again, not sure if that’s the aim, if you want to encourage other researchers to apply ML instead of multivariable regression. I’d recommend looking into this work; they also discussed what you put under discussion about interaction terms and the challenge when different models chose inconsistent variables as the most important predictors. I believe this is an important future direction that should be continually discussed.

Dharma C, Fu R, Chaiton M. Table 2 Fallacy in Descriptive Epidemiology: Bringing Machine Learning to the Table. Int J Environ Res Public Health. 2023 Jun 21;20(13):6194. doi: 10.3390/ijerph20136194. PMID: 37444042; PMCID: PMC10340623.

4. I do not understand why the univariable filter was first done before the machine learning models, I thought one of the benefits of ML is to indeed filter out the unnecessary variables. Again, I will need to see citations for this or explain why this was done. The theme is, there is a lack of reliable citation for how the methods were performed and why certain decisions were made.

5. Minor: Unless I misunderstood, “Legend” caption under Table 3 should have gone under Figure 1, otherwise the description does not make sense about the y-axis and x-axis in Table 3.

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Attachment

Submitted filename: 230918 Response to Reviewers.docx

pone.0306359.s001.docx (17.3KB, docx)

Decision Letter 1

Sergio A Useche

18 Dec 2023

PONE-D-23-11459R1Comparison of Model Feature Importance Statistics to Identify Covariates that Contribute Most to Model AccuracyPLOS ONE

Dear Dr. Huang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Thanks for your amendments and responses. Your referees have provided feedback on your revisions. Overall, they seem sound, but some few more comments need your attention. Please find them below, and try to address them to the best of your ability, in order to make a prompt final decision on the paper, in case both referees suggest its acceptance.

Please submit your revised manuscript by Feb 01 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sergio A. Useche, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. (Major) The authors emphasize the importance of comparing how various model metrics rank covariates due to the critical role of sleep in an individual's physical and mental health. They applied four different machine learning models to an insomnia dataset and found that these models provided additional information into the ranking of covariates for predicting insomnia compared to regression models based on the fact that machine learning models take into account the existing collinearity between covariates when ranking the important ones, as stated in the “Discussion” (p 263 – 265): “ Therefore, we argue that machine learning brings a new way of evaluating variables beyond traditional regression, as it can account for confounding variables and identify important variables for future studies."

However, variable selection metrics, such as gain in machine learning, generally cannot directly account for confounding variable or evaluate the existing correlations or collinearities among covariates. Gain is typically calculated based on a feature's ability to reduce impurity or enhance model accuracy. The collinearity among covariates can indirectly impact the variable selection process. In the presence of high correlations, the gain of one variable may be dampened by another correlated one, leading to an underestimation of its true importance. Therefore, correlation analysis should be performed before the selection of covariates or considering other prior analyses, such as variable normalization or standardization, and principal component analysis.

Another way to ensure that machine learning models added information to the covariate selection ranking is for both models to exhibit similar performance. However, there are no performance evaluation metrics for regression models for comparison. For an assessment of which model provides the best prediction of insomnia based on covariates, in addition to observing the performance of both models' fits, it would be necessary to evaluate the residuals—the difference between the observed insomnia probability in the data and the insomnia probability estimated by the models, which can be obtained through metrics such as MAE, MSE, MAPE, ….

Although the authors establish well in “Introduction” (p 74): “While traditional statistical models focus on hypothesis testing and estimation, machine learning models aim to predict outcomes by learning patterns in the data.”, the conclusion of the study is only that the machine learning and regression models selected different covariates considering the order of importance in predicting the probability of insomnia, and there was no relationship between the ranking of important variables associated with sleep disorder of the models.

Since the selection of predictor variables for insomnia may differ between the models, it is important to carefully consider the performance of the models and the predicted values of both machine learning and regression models for a comparison between them. Therefore:

1. The discussion will need to be revised.

2. The same performance metrics should be evaluated for both machine learning and regression models.

3. It would be valuable to incorporate the residuals analysis into the study.

2. (Minor) In the "Dataset and Cohort Selection" section, lines 112 and 113 state: "... All patients in the dataset with full insomnia data were included in this study." The question posed in the initial review regarding the number of patients included was not addressed (The Dataset and Cohort Selection section should inform the sample size).

Reviewer #2: Thank you for the opportunity to rereview this paper. At first I thought I received the wrong version since the authors did not do the typical point by point response to the points raised by the reviewers, but I see now that the revisions have been made. The authors have sufficiently addressed all the comments I raised. One minor thing that I think will benefit the paper greatly is to clarify in the introduction, what was been done and has not been done in the literature. Perhaps write an explicit sentence that, “to date, most studies only learned about predictors of insomnia with the use of univariate and multivariable regressions.” (if this is correct, or if no studies have looked at it at all regardless with ML or traditional regression). And then you can proceed to why ML can provide additional benefits (i.e., identifying previously unknown interactions, etc). And hence, this is why the current study is needed. That will make it an easier read. Otherwise, well done.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Jul 2;19(7):e0306359. doi: 10.1371/journal.pone.0306359.r004

Author response to Decision Letter 1


19 Mar 2024

We would like to extend our sincere gratitude to the reviewers for their insightful comments and constructive criticisms. Their detailed feedback has been invaluable in guiding our revisions, allowing us to address critical aspects of our methodology and analysis that required further clarification and enhancement. The reviewers’ expertise and thoughtful suggestions have significantly contributed to the depth and rigor of our study, ensuring a more comprehensive and robust examination of the predictive models utilized in the context of insomnia prediction. Their contributions have not only facilitated methodological improvements but also enriched our discussion, leading to a more nuanced understanding of the complexities involved in variable selection within machine learning and regression models. We are grateful for the opportunity to refine our work through this collaborative and iterative process, which has undeniably strengthened the quality and impact of our research. We have responded to all reviewer feedback to the best of our abilities.

Reviewer Comments Line by Line Response:

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. (Major) The authors emphasize the importance of comparing how various model metrics rank covariates due to the critical role of sleep in an individual's physical and mental health. They applied four different machine learning models to an insomnia dataset and found that these models provided additional information into the ranking of covariates for predicting insomnia compared to regression models based on the fact that machine learning models take into account the existing collinearity between covariates when ranking the important ones, as stated in the “Discussion” (p 263 – 265): “ Therefore, we argue that machine learning brings a new way of evaluating variables beyond traditional regression, as it can account for confounding variables and identify important variables for future studies."

However, variable selection metrics, such as gain in machine learning, generally cannot directly account for confounding variable or evaluate the existing correlations or collinearities among covariates. Gain is typically calculated based on a feature's ability to reduce impurity or enhance model accuracy. The collinearity among covariates can indirectly impact the variable selection process. In the presence of high correlations, the gain of one variable may be dampened by another correlated one, leading to an underestimation of its true importance. Therefore, correlation analysis should be performed before the selection of covariates or considering other prior analyses, such as variable normalization or standardization, and principal component analysis.

Another way to ensure that machine learning models added information to the covariate selection ranking is for both models to exhibit similar performance. However, there are no performance evaluation metrics for regression models for comparison. For an assessment of which model provides the best prediction of insomnia based on covariates, in addition to observing the performance of both models' fits, it would be necessary to evaluate the residuals—the difference between the observed insomnia probability in the data and the insomnia probability estimated by the models, which can be obtained through metrics such as MAE, MSE, MAPE, ….

Although the authors establish well in “Introduction” (p 74): “While traditional statistical models focus on hypothesis testing and estimation, machine learning models aim to predict outcomes by learning patterns in the data.”, the conclusion of the study is only that the machine learning and regression models selected different covariates considering the order of importance in predicting the probability of insomnia, and there was no relationship between the ranking of important variables associated with sleep disorder of the models.

Since the selection of predictor variables for insomnia may differ between the models, it is important to carefully consider the performance of the models and the predicted values of both machine learning and regression models for a comparison between them. Therefore:

1. The discussion will need to be revised.

Overall Response: Thank you for these comments, we have revised the manuscript accordingly.

In our exploration of the predictive modeling of insomnia using both traditional statistical and machine learning approaches, we identified several methodological and analytical challenges that highlight the need for ongoing research in this field. The application of machine learning models to insomnia datasets revealed the potential for these models to provide a nuanced understanding of covariate importance, taking into account the collinearity among variables. This stands in contrast to traditional regression models, which may not fully account for such complexities. However, the reliance on variable selection metrics in machine learning, such as gain, necessitates a careful consideration of their limitations, particularly regarding the indirect treatment of confounding variables and collinearities.

Our analysis underscores the importance of incorporating additional statistical methods, such as correlation analysis, variable normalization, or principal component analysis, prior to the selection of covariates. These methods could mitigate the effects of collinearity and provide a more accurate assessment of variable importance. Furthermore, the comparison between machine learning and regression models in our study was primarily qualitative, based on the ranking of covariates. To strengthen this comparison, a quantitative assessment involving performance metrics and a residuals analysis would be invaluable. Such an analysis would offer a clearer picture of the predictive accuracy of each model type and the reliability of their variable importance rankings.

Moreover, our study highlights the gap in the literature regarding the comprehensive evaluation of insomnia predictors using machine learning techniques. While traditional statistical models have predominantly focused on hypothesis testing and estimation, machine learning models present an opportunity to predict outcomes by identifying complex patterns in the data. This distinction suggests that machine learning could complement traditional approaches by uncovering previously unknown interactions and predictors of insomnia. Yet, the explicit comparison of these models' performance and the detailed examination of their predictive capabilities remain areas ripe for further investigation.

In addition, the documentation of our dataset and cohort selection process revealed the necessity for greater transparency in reporting research methodologies. Specifically, providing detailed information about the sample size and selection criteria is essential for ensuring the reproducibility and validity of research findings.

In conclusion, our study contributes to the growing body of research on insomnia prediction by highlighting the potential synergies between machine learning and traditional statistical models. However, it also emphasizes the need for methodological enhancements and deeper analytical rigor to fully leverage the strengths of each modeling approach. Future research should focus on addressing these limitations through the integration of additional statistical techniques, comprehensive model performance evaluations, and clearer articulation of research contributions within the context of existing literature.

2. (Minor) In the "Dataset and Cohort Selection" section, lines 112 and 113 state: "... All patients in the dataset with full insomnia data were included in this study." The question posed in the initial review regarding the number of patients included was not addressed (The Dataset and Cohort Selection section should inform the sample size).

We addressed this minor concern in the paper in the methods section!

Reviewer #2: Thank you for the opportunity to rereview this paper. At first I thought I received the wrong version since the authors did not do the typical point by point response to the points raised by the reviewers, but I see now that the revisions have been made. The authors have sufficiently addressed all the comments I raised.

We thank the reviewer for the comments

Attachment

Submitted filename: Response to Reviewers.docx

pone.0306359.s002.docx (16.5KB, docx)

Decision Letter 2

Sergio A Useche

16 Jun 2024

Comparison of Model Feature Importance Statistics to Identify Covariates that Contribute Most to Model Accuracy

PONE-D-23-11459R2

Dear Dr. Huang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sergio A. Useche, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Thanks for your amendments! The paper is publishable in its current form.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Acceptance letter

Sergio A Useche

24 Jun 2024

PONE-D-23-11459R2

PLOS ONE

Dear Dr. Huang,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sergio A. Useche

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    Attachment

    Submitted filename: 230918 Response to Reviewers.docx

    pone.0306359.s001.docx (17.3KB, docx)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0306359.s002.docx (16.5KB, docx)

    Data Availability Statement

    Data Availability: The data from this cohort is freely available without restriction and can be found on the NHANES section of the CDC website. Data Share Statement: Data described in the manuscript are present at: https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?cycle=2017-2020.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES