Skip to main content
PLOS One logoLink to PLOS One
. 2023 Jun 8;18(6):e0286883. doi: 10.1371/journal.pone.0286883

Machine learning application for predicting smoking cessation among US adults: An analysis of waves 1-3 of the PATH study

Mona Issabakhsh 1,*, Luz Maria Sánchez-Romero 1, Thuy T T Le 2, Alex C Liber 1, Jiale Tan 3, Yameng Li 1, Rafael Meza 4, David Mendez 2, David T Levy 1
Editor: Mohammad Amin Fraiwan5
PMCID: PMC10249849  PMID: 37289765

Abstract

Identifying determinants of smoking cessation is critical for developing optimal cessation treatments and interventions. Machine learning (ML) is becoming more prevalent for smoking cessation success prediction in treatment programs. However, only individuals with an intention to quit smoking cigarettes participate in such programs, which limits the generalizability of the results. This study applies data from the Population Assessment of Tobacco and Health (PATH), a United States longitudinal nationally representative survey, to select primary determinants of smoking cessation and to train ML classification models for predicting smoking cessation among the general population. An analytical sample of 9,281 adult current established smokers from the PATH survey wave 1 was used to develop classification models to predict smoking cessation by wave 2. Random forest and gradient boosting machines were applied for variable selection, and the SHapley Additive explanation method was used to show the effect direction of the top-ranked variables. The final model predicted wave 2 smoking cessation for current established smokers in wave 1 with an accuracy of 72% in the test dataset. The validation results showed that a similar model could predict wave 3 smoking cessation of wave 2 smokers with an accuracy of 70%. Our analysis indicated that more past 30 days e-cigarette use at the time of quitting, fewer past 30 days cigarette use before quitting, ages older than 18 at smoking initiation, fewer years of smoking, poly tobacco past 30-days use before quitting, and higher BMI resulted in higher chances of cigarette cessation for adult smokers in the US.

Introduction

Smoking prevalence in the United States (US) has significantly decreased over time (from 23.3% in 2000 [1] to 13.7% in 2018) [2]. Still, cigarette smoking continues to be one of the most significant public health issues, responsible for about 480,000 deaths annually in the US [3]. Smoking cessation is the most cost-effective means of tobacco-related disease prevention [4]. The global and national importance of smoking cessation has been discussed widely in the literature [57]. To promote smoking cessation, the World Health Organization (WHO) has also emphasized strengthening its Framework Convention on Tobacco Control implementation in all countries [8]. Although most cigarette users want to quit smoking, and more than half of current smokers make a quit attempt every year, less than 10% remain abstinent for at least six months [9]. Identifying the factors driving and sustaining smoking cessation is thus a critical need to address this epidemic effectively.

Current literature on smoking cessation is dominated by population-based studies that account for a small number of predictors. Those studies often require estimating state transition rates for cessation, which makes model application and prediction more challenging [10, 11]. On the other hand, machine learning (ML) algorithms use a flexible model structure that can include multiple variables without needing state transition rates. ML has been used successfully in tobacco research for high-quality estimations and predictions and considering multiple predictors [12]. Researchers have applied ML algorithms to assist with patient smoking status classification from electronic medical records data [13]. ML models have also been used for smoking cessation prediction based on clinical data [14]. Lai et al. [15] developed ML algorithms for smoking cessation outcome prediction among current smokers enrolled in a cessation program at a medical center in Northern Taiwan. Coughlin et al. [9] used ML to predict abstinence from smoking using cognitive behavioral therapy data for tobacco dependence in the US. Medina and Mohaghegh [16] developed ML models to predict smoking cessation outcome among Quitline service users in New Zealand. These studies have mainly focused on cessation program data. Only individuals with an intention to quit participate in such programs, which limits the generalizability of the results. Applying ML to a nationally representative survey would provide, in contrast, a broad perspective on the factors that can lead to smoking cessation among the general population.

This study uses the US nationally representative longitudinal data from the Population Assessment of Tobacco and Health (PATH) [17] survey to develop ML predictive models (i.e., binary classifiers). Our objective is to analyze the smoking cessation process by distinguishing its important determinants and predicting smoking cessation after one data wave (roughly one year) for survey participants. This study should help us better understand how the transition between current and former cigarette use (a.k.a. smoking cessation) happens over time, and thereby better characterize factors that support individuals in their smoking cessation journey, both to confirm factors that have been established in the current literature, and to discover novel factors missed in previous studies. To the best of our knowledge, the PATH study has not previously been used to predict smoking cessation using ML algorithms. The outcome of this research is a set of significant determinants of smoking cessation, and accurate ML predictive models for smoking cessation, considering the nationally representative longitudinal data from the PATH survey.

Materials and methods

Data

We used data from the PATH survey, a nationally representative US longitudinal cohort study of tobacco use and its effects on population health [18]. We conducted a longitudinal analysis of PATH data in our study with one-time measurements. The same assumptions (as described below) are considered throughout the data cleaning and model development steps. We used the open-access PATH dataset (not the restricted version) [17], in which all data were fully deidentified. Therefore, Georgetown University and the University of Michigan Institutional Review Boards exempted our analysis from review. To develop the predictive models, we used unweighted PATH adult survey data (ages 18 and above) from wave 1 (September 2013 to December 2014) and wave 2 (October 2014 to October 2015) [17]. We considered current cigarette smokers at wave 1 and checked whether they quit by wave 2. We only focused on PATH waves 1 and 2 to elicit variables and attributes involved in smoking cessation before the use of e-cigarettes (specifically JUUL) became more widespread, to limit the effect of changes in cigarette smoking due to e-cigarette use (and its unstable patterns in the initial years) on our results [19].

Our baseline sample included current smokers in wave 1, who smoked 100 cigarettes or more during their lifetime and reported smoking every day or some days at the time of the survey. In other words, we considered current established smokers in wave 1. We tracked the participants’ smoking cessation (defined below) in wave 2. In total, 32,320 adult respondents were surveyed in PATH wave 1, among which 26,447 individuals also participated in wave 2. Out of all adults who participated in both waves, 9,281 were current established smokers in wave 1.

In this study, we applied binary classifiers to predict cigarette smoking cessation (quit or not quit). Cigarette smoking cessation is considered for those current established cigarette smokers in wave 1 who did not report smoking a cigarette in the past 30 days and self-reported quitting smoking cigarettes in wave 2 [20]. More specifically, individuals who answered “YES” to the question “Have you completely quit smoking cigarettes?” AND “NO” to the question “In the past 30 days, did you smoke a cigarette, even one or two puffs?” are considered to quit smoking cigarettes in wave 2. We excluded participants who did not answer either of these questions. Of the 9,281 current established smokers in wave 1 who participated in wave 2, 710 successfully quitted by wave 2. (Fig 1).

Cessationoutcome(binary):{Quit(1)Notquit(0)

Fig 1. Analytical sample selection flowchart, PATH adult survey waves 1 and 2.

Fig 1

Data cleaning

Data cleaning is the process of removing (or fixing) incorrect, irrelevant, corrupted, incorrectly formatted, incomplete, or duplicate data within a dataset to ensure data quality. Data cleaning is essential in model development since it speeds up the model training process, improves model accuracy and interpretability, and reduces model complexity and overfitting [21]. In our analysis, we initially considered all PATH wave 1 adult survey variables (1,742 in total), mainly related to participants’ characteristics and tobacco product use habits. To this set, we added 24 PATH-drived variables related to participants’ past 30 days use (for different tobacco products in wave 1 and tobacco products other than cigarettes in wave 2) and their nicotine dependence, provided by The Center for the Assessment of Tobacco Regulations (CAsToR) Data Analysis and Dissemination Core [22]. We considered “tobacco products use in the past 30 days” in wave 1 to assess the relationship between participants’ tobacco product use in wave 1 and cigarette quit over time in wave 2. Past 30 days use of tobacco products other than cigarettes is also considered for wave 2 to assess if the use of other tobacco products correlates with participants’ cigarette quit at the time of quitting.

We started the analysis with 1,766 variables in total. In the data cleaning process, we removed variables that exclusively targeted never and former smokers, survey design variables, and categorical variables with a single level (no variation). Variables were also merged as needed. For instance, in the wave 1 PATH survey, two separate variables show the amount usually paid for a pack of cigarettes in dollars and cents. The summation of these two variables shows the total amount paid for a pack of cigarettes. In this case, to calculate a single “cigarette price” variable, we multiplied the “dollar” variable by 100 and added it to the “cents” variable.

After the initial data cleaning, we divided the remaining 1,096 variables into categories based on their similarities. From each category, we selected those variables that were most relevant to current smokers and smoking cessation, with the least number of missing samples. We then developed a correlation matrix and removed correlated variables to avoid collinearity [23, 24]. The number of “quit” instances in the dataset was limited (710 among 9,281). Therefore, as the last step of data preparation, we filled in the missing data. In other words, we added “missing” as a factor level and labeled NA samples as “missing” for categorical variables. For numeric variables, we filled in the NAs with the average value of the variable. The final set included 181 variables that were considered for the model fitting. Fig 2 shows a summary of the data cleaning process, and the details of the data cleaning are also provided in the S1 Appendix.

Fig 2. The data cleaning process.

Fig 2

Class imbalance

Among all smokers in PATH wave 1, only 7% quit smoking after one wave. This issue is called class imbalance in the ML literature [2527], and is a challenge in deploying ML algorithms for smoking cessation prediction using both clinical [28] and survey data, due to the inherent low rate of quits. Class imbalance may result in models that overweight specificity (detecting “not quits”) over sensitivity (detecting “quits”), given that most individuals do not have the desired outcome (here, smoking cessation).

Advanced ML sampling techniques, including random over and under sampling and ensemble-based methods, have been developed to deal with class imbalance [28]. Random over-sampling and under-sampling methods are mainly applied to decrease the skewed class distribution effect on the performance of the classifiers [29]. In random over-sampling, samples are replicated in the minority class in the training set (i.e., quit samples are replicated). In contrast, in random under-sampling, samples are discarded from the majority class in the training set (i.e., not quit samples are discarded) to balance the class distribution [28]. In ensemble-based methods (e.g., bagging), the total data is divided into n parts, and n different models are trained. Each model uses all the samples of the minority class and 1/n of the majority class [30]. Thereby, all samples of the majority class are used for training (no information loss), while samples of the minority class are used efficiently. Because of the skewed class distribution in our data, class imbalance was a major issue. We applied random sampling and ensemble-based techniques for feature selection and predictive model training to overcome this problem.

Variable importance and direction effect

We used ML model fitting algorithms to find the most relevant and significant variables in predicting smoking cessation among those 181 variables selected in the data cleaning step (Fig 2). To obtain significant variables arranged in the order of importance, we applied two tree-based algorithms: Random Forest (RF) [31] and Gradient Boosting Machine (GBM) [32]. Tree-based algorithms are known for handling class imbalance efficiently [28]. GBM and RF are both tree-based algorithms but differ in how the trees are built and how the results are combined. GBM builds one tree at a time; therefore, each new tree helps correct errors made by previously trained trees. RF, on the other side, trains each tree independently using a random sample of the data. RF and GBM provide variable importance, representing the average increase in prediction error or decrease in prediction accuracy when the variable is removed from the model [33]. Because of the differences between the structure of GBM and RF, the variable importance provided by these two algorithms is not completely identical. Therefore, we combined the top variables selected by both algorithms to increase the probability of identifying relevant variables of smoking cessation.

To determine the direction of the relationship between each variable and cessation, we used the SHapley Additive exPlanation (SHAP) [34] technique. SHAP is a conditional game theory method that explains instances’ predictions by computing each variable’s contribution (and the effect direction of its contribution) to the prediction. We use TreeSHAP [35], a variant of SHAP for tree-based ML algorithms, to determine the effect direction of each variable on cigarette cessation. The TreeSHAP is a typical method for variable interpretation, specifically in the public health domain. It has been used in tobacco research [16], cancer prevention and control research [36], and other applications [37]. The TreeSHAP analysis is used to explain the prediction of the machine learning models independently by each variable included in the model.

Machine learning predictive models

After acquiring the analytical sample, variables were selected to be included in the classification model. The sample was then divided into a training (70%) and a testing (30%) dataset [38]. The training dataset was used to “train” the classifiers to predict the smoking cessation for each individual. The testing dataset (which shared the same distribution and properties as the training dataset) was not used in the training process. After training, the testing dataset was used to evaluate the ability of each classifier to predict smoking cessation.

We developed classification models with Generalized Linear Regression (GLM) (an extension of the linear regression for binary classification) [39], RF [31], GBM [32], and extreme gradient boosting (XGBoost) [40] algorithms. We assessed four sampling strategies (no sampling, under-sampling, over-sampling, and bagging) in the classifiers’ training to overcome the class imbalance issue and to evaluate the effect of the sampling strategy on the performance of the classifiers in the testing set. The training process of all four algorithms was performed in the R statistical software version 4.0.2, using the training set data (70% of the total analytical samples) [41]. The hyperparameters of algorithms were adjusted by experimenting to achieve the best performance for the testing set. By limiting the complexity of each model, the likelihood of overfitting was reduced.

The classifiers were evaluated based on the ability to predict cases correctly identified as “quit”, compared with cases incorrectly identified as “‘quit”, and cases correctly identified as “not quit”, compared with cases incorrectly identified as “not quit” [42]. The performance of the trained classifiers was compared based on classification accuracy (the ability to make correct predictions), sensitivity (the ability to predict “quit” cases correctly), specificity (the ability to predict “not quit” cases correctly), and the area under the receiver operating characteristic curve (AUC-ROC), (the ability to make correct predictions) in the testing set [42].

Results

Variable selection outcome

Out of 181, GBM selected 67, and RF selected 178 significant variables. The difference between the number of variables selected by GBM and RF is due to different algorithm structures (as discussed in the previous section). GBM places significant weight on the top 12 variables and 1% weight on the rest, while RF distributes the importance weight between multiple variables. Table 1 shows the variables in order of importance determined by GBM and RF. As shown in Table 1, 80% of the top 15 variables (72% of the top 25) selected by RF and GBM are the same, showing robustness of results across the two variable selection algorithms. The importance order, however, is different between GBM and RF.

Table 1. Comparison of the top 25 variables selected by GBM and RF.

No. Variable Category GBM order RF order
1 Age range when first started smoking cigarettes every day Cigarette smoking habits 1 3
2 In the past 30 days, the number of days used e-cigarettes Wave 2 past 30-days use 2 7
3 Adult poly tobacco product user (used at least 10 days in the past 30 days) Wave 1 past 30-days use 3 6
4 How long smoked cigarettes fairly regularly Cigarette smoking habit 4 1
5 Body mass index (BMI) Demographics 5 2
6 In the past 30 days, the number of days smoked cigarettes Wave 1 past 30-days use 6 4
7 Number of minutes from waking to smoking the first cigarette Cigarette smoking habit 7 5
8 In the past 30 days, the number of days smoked cigarillos Wave 2 past 30-days use 8 58
9 How would you describe your overall opinion of tobacco Beliefs 9 15
10 The extent to which health warnings on cigarette packs make you more likely to quit/stay quit from smoking Health warnings 10 13
11 In the past 30 days, the number of days smoked filtered cigars Wave 2 past 30-days use 11 113
12 Age range when interviewed Demographics 12 8
13 How often have you seen a list of the chemicals contained in tobacco products in the past 12 months Health warnings 13 14
14 General perception: Harmfulness of cigarettes to health Beliefs 14 22
15 How often do you use the internet Social media 15 11
16 Number of hours in the past 7 days that you were in close contact with others when they were smoking Family/friends smoking habits 16 9
17 In the past 30 days, the average number of cigarettes smoked per day on days smoked Wave 1 past 30-days use 17 42
18 Statement that best describes rules about using non-combustible tobacco products inside your home Smoking rules at home 18 45
19 Self-perception of quality of life Physical/mental health and quality of life 19 23
20 Current employment status SES 20 43
21 Currently covered by health insurance or health coverage plan SES 21 25
22 Statement that best describes rules about smoking a combustible tobacco product inside your home Smoking rules at home 22 41
23 Last time that you used alcohol or other drugs weekly or more often Alcohol/other substances use 23 21
24 Opinion on using tobacco among people who are important to you Family/friends smoking habits 24 16
25 Used a coupon when buying cigarettes in the past 30 days Cigarette smoking habits 25 36
26 The amount usually paid for a pack of cigarettes Cigarette smoking habits 26 10
27 How often have you thought about the chemicals contained in tobacco in the past 12 months Health warnings 35 18
28 Highest grade or level of school completed SES 38 12
29 Level of satisfaction with social activities and relationships Physical/mental health and quality of life 39 24
30 How often you have been bothered by emotional problems such as feeling anxious, depressed, or irritable in the past 7 days Physical/mental health and quality of life 41 20
31 How often have you noticed things that promote tobacco in the past 6 months? Ads/promotions 49 19
32 Self-perception of mental health Physical/mental health and quality of life 61 17

Based on the GBM results, the most significant variable in smoking cessation prediction in wave 2 is the age at smoking initiation, followed by e-cigarette use in the past 30 days in wave 2 and poly tobacco product use in wave 1. Next was the number of years having smoked cigarettes, followed by the participant’s body mass index (BMI). In general, out of 67 variables detected significant by GBM, 11 variables (16%) correspond to individual’s cigarette smoking history and cigarette use habits in wave 1, 11 (16%) are variables related to tobacco products use (other than cigarettes and e-cigarettes) in waves 1 and 2, 10 (15%) are physical and mental health and quality of life variables in wave 1, 9 (13%) are demographics and socio-economic status (SES) variables in wave 1, 7 (10%) are variables related to family and friends tobacco products use habits and rules about tobacco products use at home in wave 1, 5 (8%) are variables related to knowledge and beliefs about smoking and tobacco products harmfulness in wave 1, 4 (6%) are variables related to the exposure to tobacco ads and promotions in wave 1, 3 (5%) are alcohol and other substances use variables in wave 1, 3 (5%) are variables related to noticing tobacco products health warnings in wave 1, 2 (3%) are e-cigarette use variables in waves 1 and 2, and 2 (3%) are variables related to social media exposure in wave 1.

The top five variables selected by RF are years of smoking cigarettes, BMI, age at smoking initiation, past 30 days poly tobacco product use in wave 1, and minutes from waking up to smoke the first cigarette in wave 1. In general, out of 178 variables detected significant by RF, 64 (35%) correspond to tobacco products use (other than cigarettes and e-cigarettes) in waves 1 and 2, 35 (20%) are physical and mental health and quality of life variables in wave 1, 16 (9%) are variables regarding knowledge and beliefs about smoking and tobacco products harmfulness in wave 1, 14 (8%) are demographic and SES variables in wave 1, 14 (8%) are related to individual’s cigarette smoking history and cigarette use habits in wave 1, 8 (4%) are variables related to alcohol and other substances use in wave 1, 7 (4%) are e-cigarette use variables in waves 1 and 2, 7 (4%) are variables related to family and friends tobacco products use habits and rules about tobacco products use at home in wave 1, 5 (3%) are variables regarding exposure to tobacco ads and promotions in wave 1, 3 (2%) are social media exposure variables, 3 (2%) are variables related to noticing tobacco products health warnings in wave 1, and 2 (1%) are variables about previous quit attempts.

We selected the combination of the top 25 variables detected significant by both GBM and RF algorithms shown in Table 1 to develop our final model to predict smoking cessation for participants in wave 2. The final dataset included 32 variables.

Fig 3 shows the TreeSHAP summary plot for the combination of the top five variables selected by RF and GBM. Each point on the summary plot is a SHAP value for a variable and a survey participant, equivalent to the variable’s marginal contribution to the cessation prediction. In the summary plot in Fig 3, the y-axis shows variable names, while the x-axis shows the SHAP values. In the plot, colors show the original value of each variable. For categorical variables, two colors (yellow and dark purple), are shown to compare two levels of the variable. In contrast, for continuous variables, a spectrum of colors (from yellow to dark purple) is shown to demonstrate different values of the variable. SHAP values less than zero show “low predicted cessation,” while values greater than zero show “high predicted cessation.”

Fig 3. The TreeSHAP summary plot for the combination of the top five variables selected by RF and GBM.

Fig 3

Based on the summary plot, we observed that lower values of past 30 days e-cigarette use in wave 2 mainly resulted in lower predicted cessation, while higher values of past 30 days e-cigarette use in wave 2 resulted in higher predicted cessation. More years of smoking mainly resulted in lower predicted cessation, while lower smoking duration resulted in higher predicted cessation. More past 30 days cigarette use in wave 1 resulted in lower predicted cessation, while fewer past 30 days cigarette use in wave 1 resulted in higher predicted cessation. The summary plot showed no noticeable direction effect for minutes from waking up to smoking, which is likely due to the low variability of time to the first cigarette in the morning in our dataset. In the wave 1 PATH survey, 75% of participants reported smoking the first cigarette of the day within one hour of waking up. This can also be observed in the summary plot, where most dots are yellow for the minutes from waking up to smoking, indicating low values of this variable.

It can be observed that higher values of BMI (darker dots) resulted in higher predicted cessation, while lower values resulted in lower predicted cessation. For categorical variables, one level of the variable is compared to the other levels. The summary plot suggests that only cigarette use resulted in lower predicted cessation compared to poly tobacco products use. Alternatively, simultaneous use of cigarettes and other combustibles, and cigarettes and smokeless tobacco products in wave 1 resulted in a higher predicted cessation compared to only cigarette use. A lower predicted cessation is observed for participants who started smoking when they were 18 or younger, and comparatively, a higher predicted cessation for people who started smoking at ages 18–24. It should be noted that most participants (52%) indicated 18 or younger as the age of smoking initiation, followed by 18–24 (41%) and other ages (7%). Therefore, it can be concluded that the probability of cessation is lower for participants who started smoking at ages 18 or younger compared to older ages.

Model performance

After model training, we used the testing dataset for the performance evaluation. We trained predictive models applying RF, GBM, GLM, and XGBoost algorithms with variables shown in Table 1. We tried different sampling strategies (no sampling, under-sampling, over-sampling, and bagging) to evaluate the effect of sampling on the prediction power of the classifiers in the test set. A comparison between the sensitivities, specificities, accuracies, and AUC-ROC values of the models for the testing set is shown in Table 2. The ROCs of the models are compared in Fig 4.

Table 2. Evaluation results of the predictive models.

Sample Model Sensitivity Specificity Balanced Accuracy ROC-AUC
No Sampling
GBM 0.0135 0.9972 0.5054 0.7696
XGBoost 0.0676 0.9917 0.5296 0.7574
GLM 0.0495 0.9929 0.5212 0.7392
RF 0.0045 0.9992 0.5018 0.7584
Over Sampling
GBM 0.6712 0.7732 0.7222 0.7757
XGBoost 0.3108 0.9094 0.6101 0.7021
GLM 0.6531 0.7165 0.6848 0.7244
RF 0.0360 0.9948 0.5154 0.7614
Under Sampling
GBM 0.7162 0.7114 0.7138 0.7652
XGBoost 0.7432 0.6937 0.7185 0.7645
GLM 0.6667 0.6409 0.6538 0.6991
RF 0.7432 0.6917 0.7175 0.7652
Bagging
GBM 0.6824 0.7445 0.7135 0.7631
XGBoost 0.7008 0.7019 0.7014 0.7557
GLM 0.6607 0.6637 0.6622 0.7063
RF 0.7297 0.7146 0.7221 0.7637

Fig 4. ROC comparison of the predictive models.

Fig 4

Without a sampling strategy, the classifier ignores the minority class and labels almost all samples as“not quit” because of class imbalance. Consequently, the sensitivity measure (which shows the power of the classifier to predict quit cases correctly) is close to 0, and the specificity measure (which indicates the ability of the classifier to predict not quit cases correctly) is close to 1 in all cases with no sampling. Without any sampling, balanced accuracy is also as low as 0.5 (random guess) in all cases, which again indicates the poor performance of the classifiers in predicting the actual smoking cessation outcome. Even though the AUC-ROC is around 0.75 for all cases with no sampling, it does not imply good performance. The AUC-ROC metric is high in cases with no sampling only because 93% of the data points are “not quit,” which is detected perfectly well by the classifiers (specificity ∼ 1), neglecting 7% of “quit” cases.

With random over-sampling, classifiers performed better in sensitivity and balanced accuracy compared with no sampling. A limitation of random over-sampling is that, since the minority class samples are duplicated in the train set, not much variance is added to help with the learning process. Random under-sampling helped increase the sensitivity and balanced accuracy better than random over-sampling and much better than no sampling. However, the drawback of random under-sampling is information loss, meaning that a part of the majority class data is removed to develop a balanced training set. All classifiers performed well with bagging. Bagging resulted in sensitivity, specificity, and accuracy of around 0.7 and a mean AUC-ROC of 0.75. Compared with under and over-sampling, in bagging, all samples are used efficiently for training, with no information loss.

Considering all performance metrics shown in Table 2, all four classification algorithms performed poorly without sampling with over-sampling GBM performed best, with under-sampling GBM and RF performed best, and with bagging RF performed best. Generally, GBM, XGBoost, and RF performed better than GLM in all metrics and considering all sampling techniques.

For validation, we used the best-performing model (RF with bagging) with a similar combination of variables to predict the smoking cessation of wave 2 adult current established smokers (ages 18 and above) in wave 3. The model predicted the smoking cessation outcome for individuals in wave 3 with an accuracy level of 70% in the test set. The detailed steps of the validation process are explained in the S2 Appendix.

Discussion

To the best of our knowledge, the PATH survey data has not previously been used to predict smoking cessation using ML predictive models, and our study is the first that surveyed all variables of at least one wave of the PATH dataset for smoking cessation prediction. Despite the class imbalance issue and a highly skewed class distribution (7% quit, 93% not quit cases), we detected important determinants of smoking cessation introduced in Table 1 and developed predictive models using these variables. No existing study in the literature has considered simultaneously all the final variables listed in Table 1 to predict smoking cessation. Our best model (RF with bagging) showed good performance in predicting smoking cessation in a representative population of adult cigarette users in the US (sensitivity 73%, specificity 71%, balanced accuracy 72%, and AUC-ROC 76%).

Our analysis showed that variables such as the age at smoking initiation, years of smoking cigarettes, BMI of participants, past 30 days use of poly tobacco products in wave 1, minutes from waking up to smoking the first cigarette in wave 1, and past 30 days use of e-cigarettes in wave 2 are important determinants of smoking cessation. We applied the TreeSHAP algorithm to assess the relationship direction between those variables and smoking cessation. The TreeSHAP analysis demonstrated that more e-cigarette use in the past 30 days at the time of quitting, being older than 18 at smoking initiation, smoking for fewer years, fewer cigarette use in the past 30 days before quitting, poly tobacco past 30 days use (compared with only cigarettes use) before quitting, and higher BMI mainly resulted in higher odds of cessation for adults. We use the term “mainly” since each variable affects each participant differently (dots in Fig 3). What we reported here is the effect of each variable on most participants.

Our study is consistent with previous research that has suggested an association between early smoking initiation and longer duration of smoking [15], and increased chances of nicotine dependence [15, 16] and lower chances of cessation [43, 44]. Our analysis also showed a positive association between higher BMI and cigarette quitting. This relationship could be explained by the positive correlation between higher BMI and health risks, specifically heart disease [45], which can motivate people to quit smoking cigarettes [46]. Studies have mainly discussed the effect of smoking cessation on weight gain [44]. Another ML study considered BMI as a determinant of smoking cessation but did not discuss the direction of the relationship [15]. Previous literature has reported mixed outcomes regarding the effect of e-cigarette use on smoking cessation, ranging from no effect [47] to a significant and positive effect [48, 49]. Our analysis suggested that e-cigarette use is associated with a higher chance of cessation for adult cigarette smokers. Our study also showed that the past 30-days use of other tobacco products (both combustible and smokeless) in addition to cigarettes could increase the chance of cigarette cessation for adult smokers, compared to only cigarette use. This can be interpreted by arguing that the combination of cigarette use with other tobacco products may result in partial or complete substitution of cigarettes for adult smokers and, therefore, higher chances of cigarette quitting [50]. Our analysis did not show an obvious direction for the effect of “minutes from waking up to smoking the first cigarette” on cigarette cessation since 75% of participants in our study reported smoking the first cigarette of the day within one hour of waking up. A shorter time to the first cigarette in the morning is known in the literature as an indicator of nicotine dependence, which can reduce the chance of cessation [51]. A low degree of variability in the time between waking up and smoking the first cigarette, reported by study participants, might have affected the ability of the model to identify the direction of this relationship.

Our results point to the potential importance of non-tobacco behavioral and socioeconomic variables that are generally overlooked in previous smoking cessation studies, such as health insurance coverage [52], and mental health status [53]. Our study also revealed uncommon variables associated with smoking cessation, such as internet use and perceived quality of life. Using the internet might expose individuals to tobacco coupons and advertisements, as well as raise awareness regarding the harmfulness of tobacco through surfing websites. While many studies surveyed the effect of cigarette use on quality of life [5456], we did not find any research exploring the effect of quality of life on smoking cessation.

Our study results show that to compare the performance of the classifiers in case of class imbalance, all metrics of sensitivity, specificity, balanced accuracy, and AUC-ROC should be considered and that only focusing on a single metric can be misleading. For instance, Fig 4 shows an almost equal performance for all classifiers; however, according to the results provided in Table 2, those classifiers with “No-sampling” performed poorly in detecting quit cases (low sensitivities). Considering a combination of all metrics in Table 2, an RF model with bagging outperformed all other classifiers. Compared with random under and over-sampled data, bagging enables using the entire train set efficiently for training the classifiers. The developed RF model with bagging predicted wave 1 (wave 2) current established smokers’ cessation in wave 2 (wave 3) with a 72% (70%) accuracy level in the testing set.

Our model performed better than or equal to similar studies in the literature in predicting smoking cessation. The artificial neural network algorithm developed by Lai et al. [15] predicted smoking cessation with an average accuracy of 64%. Coughlin et al. [9] applied a classification and regression tree algorithm which predicted the smoking cessation outcome with an accuracy level of 64%. Medina and Mohaghegh [16] developed a CatBoost algorithm capable of predicting smoking cessation at 70% accuracy. However, it should be noted that because of the differences between study designs and the data used, these studies cannot be compared directly.

Our study is subject to limitations. We have mainly considered waves 1–2 (developing cohort) of the PATH survey for model development and waves 2–3 (validation cohort) for model validation. Thus, our results and conclusions may not apply to later years when other factors (such as the use of JUUL and nicotine pouches) might be more relevant. Furthermore, the validation cohort in our study had 690 fewer (7.4% lower) participants than the developing cohort. This reduction is partially because some current smokers in Wave 1 had quit by Wave 2. Additionally, based on the PATH survey user guide [17], some participants surveyed in PATH wave 1 were permanently or temporarily ineligible to participate in the follow-up waves (for instance, because they were deceased or moved out of the US). A number of eligible participants did not agree to participate in follow-up waves, and some did not respond to follow-up surveys. Even with the observed loss to follow-up between waves 1–2 and 2–3, we had a large enough sample (w1–2: 9,281, w2–3: 8,591) to accomplish our analysis and train, test, and validate accurate predictive models. In addition, our analysis is based on PATH data for the US, and our results may not apply to specific sub-populations within the US (e.g., racial/ethnic groups) or other countries. Further validation would be necessary to assess the performance of our model in other populations or in more recent years (e.g., due to the introduction and rapid growth of JUUL in the US). Another limitation of our analysis is the inability to detect the direction effect of “minutes from waking up to smoking the first cigarette” (an important indicator of nicotine dependence) on cigarette cessation. Most participants in our baseline sample reported smoking the first cigarette of the day within one hour of waking up, which resulted in low variability of minutes from waking up to smoking the first cigarette and possibly the inability of the model to detect its direction effect.

Conclusions

To better characterize factors that support individuals in their smoking cessation journey, and to inform future tobacco policies, we applied ML to interpret each variable’s effect on smoking cessation odds. Compared to other studies which are focused on cessation treatment program data, we used data from a US nationally representative survey to predict smoking cessation among the general population, and considered a broad range of factors that could lead to smoking cessation. This study shows the virtue of ML algorithms over other modeling strategies to find important determinants of smoking cessation and develop accurate predictive models, specifically in large datasets with vast numbers of variables.

Supporting information

S1 Appendix. Steps of data cleaning.

(PDF)

S2 Appendix. Model validation for cessation transition between waves 2-3.

(PDF)

Acknowledgments

We would like to thank the Center for the Assessment of Tobacco Regulations (CAsToR) Data Analysis and Dissemination Core for providing data analysis insights for this work and to members of CAsToR for providing comments on the initial draft.

Data Availability

Data are from the Public Use Files for the Population Assessment of Tobacco and Health, Waves 1-3 at https://www.icpsr.umich.edu/icpsrweb/NAHDAP/studies/36498.

Funding Statement

YES, Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health (NIH) and FDA Center for Tobacco Products (CTP) under Award Number U54CA229974. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the Food and Drug Administration. MI and TTTL were funded. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cigarette Smoking Among Adults, United States, 2000-2002. Available from: https://www.cdc.gov/mmwr/preview/mmwrhtml/mm5129a3.htm. [PubMed]
  • 2.Tobacco Product Use and Cessation Indicators Among Adults, United States, 2018-2019. Available from: https://www.cdc.gov/mmwr/volumes/68/wr/mm6845a2.htm?scid=mm6845a2w. [DOI] [PMC free article] [PubMed]
  • 3.Smoking and Tobacco Use. Available from: https://www.cdc.gov/tobacco/datastatistics/factsheets/fastfacts/index.htm.
  • 4. Pipe AL, Evans W, Papadakis S. Smoking cessation: health system challenges and opportunities. Tobacco Control. 2022;31(2):340–347. doi: 10.1136/tobaccocontrol-2021-056575 [DOI] [PubMed] [Google Scholar]
  • 5. Lin H, Xiao D, Liu Z, Shi Q, Hajek P, Wang C. National survey of smoking cessation provision in China. Tobacco induced diseases. 2019; 17. doi: 10.18332/tid/104726 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Shaik SS, Doshi D, Bandari SR, Madupu PR, KuLKARNI S. Tobacco use cessation and prevention–A Review. Journal of clinical and diagnostic research. 2016;10(5): ZE13. doi: 10.7860/JCDR/2016/19321.7803 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Jha P, Ramasundarahettige C, Landsman V, Rostron B, Thun M, Anderson RN, et al. 21st-century hazards of smoking and benefits of cessation in the United States. New England Journal of Medicine. 2013;368(4): 341–350. doi: 10.1056/NEJMsa1211128 [DOI] [PubMed] [Google Scholar]
  • 8.WHO sustainable development goals. Available from: https://www.who.int/europe/about-us/our-work/sustainable-development-goals/targets-of-sustainable-development-goal-3.
  • 9. Coughlin LN, Tegge AN, Sheffer CE, Bickel WK. A machine-learning approach to predicting smoking cessation treatment outcomes. Nicotine and Tobacco Research. 2020;22(3):415–422. doi: 10.1093/ntr/nty259 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Blok DJ, de Vlas SJ, van Empelen P, van Lenthe FJ. The role of smoking in social networks on smoking cessation and relapse among adults: a longitudinal study. Preventive medicine. 2017;99:105–110. doi: 10.1016/j.ypmed.2017.02.012 [DOI] [PubMed] [Google Scholar]
  • 11. Charafeddine R, Demarest S, Cleemput I, Van Oyen H, Devleesschauwer B. Gender and educational differences in the association between smoking and health-related quality of life in Belgium. Preventive medicine. 2017;105:280–286. doi: 10.1016/j.ypmed.2017.09.016 [DOI] [PubMed] [Google Scholar]
  • 12. Vázquez AL, Rodríguez MMD, Barrett TS, Schwartz S, Buenabad NGA, Gamiño MNB, et al. Innovative identification of substance use predictors: machine learning in a national sample of Mexican children. Prevention Science. 2020;21(2):171–181. doi: 10.1007/s11121-020-01089-4 [DOI] [PubMed] [Google Scholar]
  • 13. Caccamisi A, Jørgensen L, Dalianis H, Rosenlund M. Natural language processing and machine learning to enable automatic extraction and classification of patients’ smoking status from electronic medical records. Upsala Journal of Medical Sciences. 2020;125(4):316–324. doi: 10.1080/03009734.2020.1792010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Morgenstern JD, Buajitti E, O’Neill M, Piggott T, Goel V, Fridman D, et al. Predicting population health with machine learning: a scoping review. BMJ open. 2020;10(10):e037860. doi: 10.1136/bmjopen-2020-037860 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Lai CC, Huang WH, Chang BCC, Hwang LC. Development of Machine Learning Models for Prediction of Smoking Cessation Outcome. International journal of environmental research and public health. 2021;18(5):2584. doi: 10.3390/ijerph18052584 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Medina IC, Mohaghegh M. Explainable Machine Learning Models for Prediction of Smoking Cessation Outcome in New Zealand. In: 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS). IEEE; 2022. p. 764–768.
  • 17.Population Assessment of Tobacco and Health (PATH) Study [United States] Public-Use Files (ICPSR 36498). Available from: https://www.icpsr.umich.edu/web/NAHDAP/studies/36498.
  • 18. Hyland A, Ambrose BK, Conway KP, Borek N, Lambert E, Carusi C, et al. Design and methods of the Population Assessment of Tobacco and Health (PATH) Study. Tobacco control. 2017;26(4):371–378. doi: 10.1136/tobaccocontrol-2016-052934 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Levy DT, Yuan Z, Li Y, Mays D, Sanchez-Romero LM. An examination of the variation in estimates of e-cigarette prevalence among US adults. International journal of environmental research and public health. 2019;16(17):3164. doi: 10.3390/ijerph16173164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Cook S, Hirschtick JL, Patel A, Brouwer A, Jeon J, Levy DT, et al. A longitudinal study of menthol cigarette use and smoking cessation among adult smokers in the US: Assessing the roles of racial disparities and E-cigarette use. Preventive medicine. 2022;154:106882. doi: 10.1016/j.ypmed.2021.106882 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Saeys Y, Inza I, Larranaga P. A review of feature selection techniques in bioinformatics. bioinformatics. 2007;23(19):2507–2517. doi: 10.1093/bioinformatics/btm344 [DOI] [PubMed] [Google Scholar]
  • 22.The Center for the Assessment of Tobacco Regulations Data Analysis and Dissemination Core. Available from: https://tcors.umich.edu/CoresData.php.
  • 23.Jed Wing AWCKAETCZM Steve Weston. find Correlation: Determine highly correlated variables. caret: Classification and Regression Training R package version 60-35. 2014;.
  • 24. Mukaka MM. A guide to appropriate use of correlation coefficient in medical research. Malawi medical journal. 2012;24(3):69–71. [PMC free article] [PubMed] [Google Scholar]
  • 25. Zhu M, Xia J, Jin X, Yan M, Cai G, Yan J, et al. Class weights random forest algorithm for processing class imbalanced medical data. IEEE Access. 2018;6:4641–4652. doi: 10.1109/ACCESS.2018.2789428 [DOI] [Google Scholar]
  • 26. Dal Pozzolo A, Caelen O, Le Borgne YA, Waterschoot S, Bontempi G. Learned lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications. 2014;41(10):4915–4928. doi: 10.1016/j.eswa.2014.02.026 [DOI] [Google Scholar]
  • 27. Le T, Lee MY, Park JR, Baik SW. Oversampling techniques for bankruptcy prediction: Novel features from a transaction dataset. Symmetry. 2018;10(4):79. doi: 10.3390/sym10040079 [DOI] [Google Scholar]
  • 28. Davagdorj K, Lee JS, Pham VH, Ryu KH. A comparative analysis of machine learning methods for class imbalance in a smoking cessation intervention. Applied Sciences. 2020;10(9):3307. doi: 10.3390/app10093307 [DOI] [Google Scholar]
  • 29. Haixiang G, Yijing L, Shang J, Mingyun G, Yuanyue H, Bing G. Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications. 2017;73:220–239. doi: 10.1016/j.eswa.2016.12.035 [DOI] [Google Scholar]
  • 30. Jafarzadeh H, Mahdianpari M, Gill E, Mohammadimanesh F, Homayouni S. Bagging and boosting ensemble classifiers for classification of multispectral, hyperspectral and PolSAR data: a comparative evaluation. Remote Sensing. 2021;13(21):4405. doi: 10.3390/rs13214405 [DOI] [Google Scholar]
  • 31. Liaw A, Wiener M. Classification and regression by randomForest. R news. 2002;2(3):18–22. [Google Scholar]
  • 32. Friedman JH. Greedy function approximation: a gradient boosting machine. Annals of statistics. 2001; p. 1189–1232. [Google Scholar]
  • 33. Breiman L. Random forests. Machine learning. 2001;45(1):5–32. doi: 10.1023/A:1010933404324 [DOI] [Google Scholar]
  • 34. Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30. [Google Scholar]
  • 35.Lundberg SM, Erion GG, Lee SI. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:180203888. 2018.
  • 36. Inoguchi T, Nohara Y, Nojiri C, Nakashima N. Association of serum bilirubin levels with risk of cancer development and total death. Scientific reports. 2021;11(1):1–12. doi: 10.1038/s41598-021-92442-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Shakeri E, Crump T, Weis E, Souza R, Far B. Using SHAP Analysis to Detect Areas Contributing to Diabetic Retinopathy Detection. IEEE 23rd International Conference on Information Reuse and Integration for Data Science. 2022;166–171. [Google Scholar]
  • 38. Lee CK, Hofer I, Gabel E, Baldi P, Cannesson M. Development and validation of a deep neural network model for prediction of postoperative in-hospital mortality. Anesthesiology. 2018;129(4):649–662. doi: 10.1097/ALN.0000000000002186 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wright RE. Logistic regression. 1995;.
  • 40. Chen T, He T, Benesty M, Khotilovich V, Tang Y, Cho H, et al. Xgboost: extreme gradient boosting. R package version 04-2. 2015;1(4):1–4. [Google Scholar]
  • 41. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011;12:2825–2830. [Google Scholar]
  • 42. Zhu W, Zeng N, Wang N. Sensitivity, specificity, accuracy, associated confidence interval and ROC analysis with practical SAS implementations. NESUG proceedings: health care and life sciences, Baltimore, Maryland. 2010;19:67. [Google Scholar]
  • 43. Breslau N, Peterson EL. Smoking cessation in young adults: age at initiation of cigarette smoking and other suspected influences. American journal of public health. 1996;86(2):214–220. doi: 10.2105/ajph.86.2.214 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Chen J, Millar WJ. Age of smoking initiation: implications for quitting. Health reports-statistics Canada. 1998;9:39–48. [PubMed] [Google Scholar]
  • 45. Abbasi F, Brown BW, Lamendola C, McLaughlin T, Reaven GM. Relationship between obesity, insulin resistance, and coronary heart disease risk. Journal of the American College of Cardiology. 2002;40(5):937–943. doi: 10.1016/S0735-1097(02)02051-X [DOI] [PubMed] [Google Scholar]
  • 46. Goettler D, Wagner M, Faller H, Kotseva K, Wood D, Leyh R, et al. Factors associated with smoking cessation in patients with coronary heart disease: a cohort analysis of the German subset of EuroAspire IV survey. BMC cardiovascular disorders. 2020;20(1):1–9. doi: 10.1186/s12872-020-01429-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Wang RJ, Bhadriraju S, Glantz SA. E-cigarette use and adult cigarette smoking cessation: a meta-analysis. American journal of public health. 2021;111(2):230–246. doi: 10.2105/AJPH.2020.305999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Hartmann-Boyce J, McRobbie H, Butler AR, Lindson N, Bullen C, Begh R, et al. Electronic cigarettes for smoking cessation. Cochrane database of systematic reviews. 2021;(9). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Kasza KA, Edwards KC, Kimmel HL, Anesetti-Rothermel A, Cummings KM, Niaura RS, et al. Association of e-cigarette use with discontinuation of cigarette smoking among adult smokers who were initially never planning to quit. JAMA network open. 2021;4(12):e2140880–e2140880. doi: 10.1001/jamanetworkopen.2021.40880 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Stanton CA, Halenar MJ. Patterns and correlates of multiple tobacco product use in the United States. Nicotine and Tobacco Research. 2018;20(suppl 1):S1–S4. doi: 10.1093/ntr/nty081 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Baker TB, Piper ME, McCarthy DE, Bolt DM, Smith SS, Kim SY, et al. Time to first cigarette in the morning as an index of ability to quit smoking: implications for nicotine dependence. Nicotine and Tobacco Research. 2007;9(Suppl4):S555–S570. doi: 10.1080/14622200701673480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Bailey SR, Hoopes MJ, Marino M, Heintzman J, O’Malley JP, Hatch B, et al. Effect of gaining insurance coverage on smoking cessation in community health centers: a cohort study. Journal of general internal medicine. 2016;31(10):1198–1205. doi: 10.1007/s11606-016-3781-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Cengelli S, O’Loughlin J, Lauzon B, Cornuz J. A systematic review of longitudinal population-based studies on the predictors of smoking cessation in adolescent and young adult smokers. Tobacco control. 2012;21(3):355–362. doi: 10.1136/tc.2011.044149 [DOI] [PubMed] [Google Scholar]
  • 54. Chen J, Qi Y, Wampfler JA, Jatoi A, Garces YI, Busta AJ, et al. Effect of cigarette smoking on quality of life in small cell lung cancer patients. European journal of cancer. 2012;48(11):1593–1601. doi: 10.1016/j.ejca.2011.12.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Garces YI, Yang P, Parkinson J, Zhao X, Wampfler JA, Ebbert JO, et al. The relationship between cigarette smoking and quality of life after lung cancer diagnosis. Chest. 2004;126(6):1733–1741. doi: 10.1378/chest.126.6.1733 [DOI] [PubMed] [Google Scholar]
  • 56. Turner J, Page-Shafer K, Chin DP, Osmond D, Mossar M, Markstein L, et al. Adverse impact of cigarette smoking on dimensions of health-related quality of life in persons with HIV infection. AIDS patient care and STDs. 2001;15(12):615–624. doi: 10.1089/108729101753354617 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Mohammad Amin Fraiwan

11 Apr 2023

PONE-D-23-00290Machine learning application for predicting smoking cessation among US adultsPLOS ONE

Dear Dr. Issabakhsh,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by May 26 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Mohammad Amin Fraiwan

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in your Competing Interests section: 

"NO authors have competing interests."

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

 This information should be included in your cover letter; we will change the online submission form on your behalf.

3. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The manuscript is well written and the methodology section is detailed and particularly data cleaning sub-section. The limitations of the study can be described more in detail. The potential strengths of the study stand out. The study will be interesting for wider readership and adds to existing knowledge base in tobacco research.

Reviewer #2: Review Report

Title: Machine learning application for predicting smoking cessation among US adults.

Manuscript Number:

Review Version: I

Review Comments

� The objective and the results are little beat not consistent.

� Establish the background more and incorporate global and national promises.

� Use of exclamations in inappropriate way.

� There is high drop out among the first and the second wave and between the second and the third wave? What was the reason behind and how was that treated?

� The inclusion criteria are loose and is that only current or life time smoking or both?

� Dis TreeSHAP validated on Artificial intelligence before its application on human being?

� What are the assumptions taken in to account in general and in the final model? Is that one time or repeated time measurements? The analysis should be linear regression or longitudinal analysis? How did you ensure data quality?

� How did the ethical considerations secure? Did the national or local IRB approval?

� The analysis needs further explanation.

� Ensure the completeness of the contents of the manuscript E.g. The title lacks time period.

Regards,

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Jun 8;18(6):e0286883. doi: 10.1371/journal.pone.0286883.r002

Author response to Decision Letter 0


24 May 2023

Responses for reviewers’ comments: PLOS ONE - Manuscript ID PONE-D-23-00290

Title: Machine learning application for predicting smoking cessation among US adults

We appreciate the opportunity to revise our manuscript. We responded to the reviewers’ comments below and tracked changes in the manuscript. Please note that the page and line numbers provided in the responses refer to the “tracked changes” version of the manuscript.

Reviewer 1

The manuscript is well written and the methodology section is detailed and particularly data cleaning sub-section. The limitations of the study can be described more in detail. The potential strengths of the study stand out. The study will be interesting for wider readership and adds to existing knowledge base in tobacco research.

Response: We thank the reviewer for their valuable comment. In response to this comment, we have described the limitations of our study in more detail and revised the last paragraph of the “Discussion” section (page 11, lines 394-417) as follows:

Our study is subject to limitations. We have mainly considered waves 1-2 (developing cohort) of the PATH survey for model development and waves 2-3 (validation cohort) for model validation. Thus, our results and conclusions may not apply to later years when other factors (such as the use of JUUL and nicotine pouches) might be more relevant. Furthermore, the validation cohort in our study had 690 fewer (7.4% lower) participants than the developing cohort. This reduction is partially because some current smokers in Wave 1 had quit by Wave 2. Additionally, based on the PATH survey user guide,[1] some participants surveyed in PATH wave 1 were permanently or temporarily ineligible to participate in the follow-up waves (for instance, because they were deceased or moved out of the US). A number of eligible participants did not agree to participate in follow-up waves, and some did not respond to follow-up surveys. Even with the observed loss to follow-up between waves 1-2 and 2-3, we had a large enough sample (w1-2: 9,281, w2-3: 8,591) to accomplish our analysis and train, test, and validate accurate predictive models.

In addition, our analysis is based on PATH data for the US, and our results may not apply to specific sub-populations within the US (e.g., racial/ethnic groups) or other countries. Further validation would be necessary to assess the performance of our model in other populations or in more recent years (e.g., due to the introduction and rapid growth of JUUL in the US). Another limitation of our analysis is the inability to detect the direction effect of “minutes from waking up to smoking the first cigarette” (an important indicator of nicotine dependence) on cigarette cessation. Most participants in our baseline sample reported smoking the first cigarette of the day within one hour of waking up, which resulted in low variability of minutes from waking up to smoking the first cigarette and possibly the inability of the model to detect its direction effect.

Reviewer 2

Comment 1: The objective and the results are little beat not consistent.

Response: We thank the reviewer for noting this. We have revised the objective sentence in the “Introduction” section (page 2, lines 33-37) as follows to make sure of the consistency of the objective and results:

This study uses the US nationally representative longitudinal data from the Population Assessment of Tobacco and Health (PATH) survey to develop ML predictive models (i.e., binary classifiers). Our objective is to analyze the smoking cessation process by distinguishing its important determinants and predicting smoking cessation after one data wave (roughly one year) for survey participants.

Comment 2: Establish the background more and incorporate global and national promises.

Response: We thank the reviewer for their comment. We have extended the introduction of the paper to cite studies discussing the global and national importance of smoking cessation.[2-5] We added the following sentences to the “Introduction” section (page 2 lines 6-9):

The global and national importance of smoking cessation has been discussed widely in the literature.[3-5] To promote smoking cessation, the World Health Organization (WHO) has also emphasized strengthening its Framework Convention on Tobacco Control implementation in all countries.[1]

Comment 3: Use of exclamations in inappropriate way.

Response: We thank the reviewer for noting this problem. Probably this happened because of a “compiling” issue. We have revised the manuscript and made corrections where needed.

Comment 4: There is high drop out among the first and the second wave and between the second and the third wave? What was the reason behind and how was that treated?

Response: We thank the reviewer for raising this issue. The validation cohort (waves 2-3) in our study had 690 fewer participants (7.4% lower) than the developing cohort (waves 1-2). This reduction is partially because some current smokers in wave 1 had quit by wave 2. In addition, the PATH survey user guide explains part of this "drop out" between the baseline wave of PATH (wave 1) and the follow-up waves.[1] Based on “Section 4. Response Rates”, on page 32 of the PATH user guide, “Some addresses sampled for the PATH Study could not be located or accessed, others were found to be ineligible (e.g., vacant lots and group quarters), and some eligible households did not complete the household screener. Further, not all sampled persons within eligible households agreed to participate in the PATH Study, and those who were recruited at Wave 1, i.e., those in the Wave 1 Cohort, may not have responded at some or all of the follow-up waves”. Additionally, as explained in Section 4 of the PATH user guide, some participants surveyed in PATH wave 1 were permanently or temporarily ineligible to participate in the follow-up waves because they were deceased or moved out of the US.[1]

Even with the observed loss to follow-up between waves 1-2 and 2-3, we had a large enough sample (w1-2: 9,281, w2-3: 8,591) to accomplish our analysis and train, test, and validate accurate predictive models, as discussed in the “Results” section. We added the following paragraph to the last part of the “Discussion” section (page 11, lines 398-407) and addressed this issue as one limitation of our study:

Furthermore, the validation cohort in our study had 690 fewer (7.4% lower) participants than the developing cohort. This reduction is partially because some current smokers in Wave 1 had quit by Wave 2. Additionally, based on the PATH survey user guide,[1] some participants surveyed in PATH wave 1 were permanently or temporarily ineligible to participate in the follow-up waves (for instance, because they were deceased or moved out of the US). A number of eligible participants did not agree to participate in follow-up waves, and some did not respond to follow-up surveys. Even with the observed loss to follow-up between waves 1-2 and 2-3, we had a large enough sample (w1-2: 9,281, w2-3: 8,591) to accomplish our analysis and train, test, and validate accurate predictive models.

Comment 5: The inclusion criteria are loose and is that only current or life time smoking or both?

Response: We thank the reviewer for their question. As explained in the manuscript, “Our baseline sample included current established cigarette smokers in wave 1, defined as those who smoked 100 cigarettes or more in their lifetime and reported smoking every day or some days.” In other words, we have considered both criteria: current smokers who smoked 100 cigarettes or more in their lifetime. To make the definition clearer, we revised the “Data” section (page 3, lines 70-73) as follows:

Our baseline sample included current smokers in wave 1, who smoked 100 cigarettes or more during their lifetime and reported smoking every day or some days at the time of the survey. In other words, we considered current established smokers in wave 1.

Comment 6: Is TreeSHAP validated on Artificial intelligence before its application on human being?

Response: We thank the reviewer for this question. The TreeSHAP is a typical method for variable interpretation, specifically in the public health domain. TreeSHAP has been used in tobacco research,[6] in cancer prevention and control research,[7] and in other applications [8] in a similar manner to our analysis. To explain the application of TreeSHAP more, we added the following sentence to the last paragraph of the “Variable importance and direction effect” section (page 5, lines 157-161):

The TreeSHAP is a typical method for variable interpretation, specifically in the public health domain. It has been used in tobacco research,[6] cancer prevention and control research,[7] and other applications.[8] The TreeSHAP analysis is used to explain the prediction of the machine learning models independently by each variable included in the model.

Comment 7: What are the assumptions taken in to account in general and in the final model? Is that one time or repeated time measurements? The analysis should be linear regression or longitudinal analysis? How did you ensure data quality?

Response: We thank the reviewer for their questions. We have explained the assumptions of our model extensively in the Material and methods section and in further detail in the first part of the “Appendix” section. To answer your questions, we made the same assumptions "in general and in the final model” and have considered “one-time” measurements. Linear regression is not suitable for our analysis since we consider a binary response (i.e., quit/not quit), as explained in the last paragraph of the “Data” section. Instead, we have adopted a generalized linear regression model, an extension of linear regression for binary classification. We have also developed more advanced predictive models with random forest, gradient boosting machines, and extreme gradient boosting algorithms. We conducted a longitudinal analysis by developing two-wave transitions in our paper; we used waves 1 and 2 for developing our model (not multiple waves), and after predictive model development (using waves 1- 2 cohort), we validated our results by applying waves 2-3 cohort. More specifically, we considered current smokers in wave 1 (baseline) and tracked smoking cessation for those respondents in wave 2 (follow-up) in order to develop a model capable of predicting smoking cessation for current smokers within a year, similar to what would likely be done in a smoking cessation clinical trial. As explained in the “Machine learning predictive models” section, we have ensured “data quality” mainly by data cleaning (lines 80-114 and 437-489) and measured model performance using a testing dataset. In response to this comment, we revised the following parts of the manuscript:

We added the following sentence to the “Data” section (pages 2-3, lines 54-57):

We conducted a longitudinal analysis of PATH data in our study with one-time measurements. The same assumptions (as described below) are considered throughout the data cleaning and model development steps.

Page 3, lines 80-82 of the manuscript were revised as follows:

Data cleaning is the process of removing (or fixing) incorrect, irrelevant, corrupted, incorrectly formatted, incomplete, or duplicate data to ensure data quality.

The following sentence was added to page 5 lines 170-172 of the manuscript:

We developed classification models with Generalized Linear Regression (GLM) (an extension of the linear regression for binary classification), RF, GBM, and extreme gradient boosting (XGBoost) algorithms.

Comment 8: How did the ethical considerations secure? Did the national or local IRB approval?

Response: We thank the reviewer for their question. As explained in the “Data” section, the need for IRB and participants’ consent were waived in our research since we used the open-access PATH dataset [9] (not the restricted version) in which all data were fully anonymized. We revised the “Data” section (page 3, lines 57-60) as follows:

We used the open-access PATH dataset (not the restricted version) [9], in which all data were fully deidentified. Therefore, Georgetown University and the University of Michigan Institutional Review Boards exempted our analysis from review.

Comment 9: The analysis needs further explanation.

Response: We thank the reviewer for their comment, and hope that the added explanations to the manuscript help make the analysis clearer. We edited the manuscript entirely, and revised the following parts:

“Data” section, pages 2-3, lines 54-57:

We conducted a longitudinal analysis of PATH data in our study with one-time measurements. The same assumptions (as described below) are considered throughout the data cleaning and model development steps.

“Data cleaning” section, page 3, lines 80-82:

Data cleaning is the process of removing (or fixing) incorrect, irrelevant, corrupted, incorrectly formatted, incomplete, or duplicate data to ensure data quality.

“Variable importance and direction effect” section, page 5, lines 157-161:

The TreeSHAP is a typical method for variable interpretation, specifically in the public health domain. It has been used in tobacco research,[6] cancer prevention and control research,[7] and other applications.[8] The TreeSHAP analysis is used to explain the prediction of the machine learning models independently by each variable included in the model.

“Machine learning predictive models” section, page 5, lines 170-172:

We developed classification models with Generalized Linear Regression (GLM) (an extension of the linear regression for binary classification) RF, GBM, and extreme gradient boosting (XGBoost) algorithms.

“Machine learning predictive models” section, pages 5-6, lines 185-191:

The performance of the trained classifiers was compared based on classification accuracy (the ability to make correct predictions), sensitivity (the ability to predict “quit” cases correctly), specificity (the ability to predict “not quit” cases correctly), and the area under the receiver operating characteristic curve (AUC-ROC), (the ability to make correct predictions) in the testing set.

Comment 10: Ensure the completeness of the contents of the manuscript, E.g. The title lacks time period.

Response: We thank the reviewer for noting this. We have revised the title of the manuscript to: "Machine learning application for predicting smoking cessation among US adults: an analysis of Waves 1-3 of the PATH study”. Additionally, we have revised and edited the manuscript entirely to ensure the completeness of the content.

References

1. Population Assessment of Tobacco and Health (PATH) Study [United States] Public-Use Files (ICPSR 36498). Available from: https://www.icpsr.umich.edu/web/NAHDAP/studies/36498.

2. WHO sustainable development goals. Available from: https://www.who.int/europe/about-us/our-work/sustainable-development-goals/targets-of-sustainable-development-goal-3.

3. Lin, H., et al., National survey of smoking cessation provision in China. Tobacco induced diseases, 2019. 17.

4. Shaik, S.S., et al., Tobacco use cessation and prevention–A Review. Journal of clinical and diagnostic research: JCDR, 2016. 10(5): p. ZE13.

5. Jha, P., et al., 21st-century hazards of smoking and benefits of cessation in the United States. New England Journal of Medicine, 2013. 368(4): p. 341-350.

6. Medina, I.C. and M. Mohaghegh. Explainable Machine Learning Models for Prediction of Smoking Cessation Outcome in New Zealand. in 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS). 2022. IEEE.

7. Inoguchi, T., et al., Association of serum bilirubin levels with risk of cancer development and total death. Scientific reports, 2021. 11(1): p. 1-12.

8. Shakeri, E., et al. Using SHAP Analysis to Detect Areas Contributing to Diabetic Retinopathy Detection. in 2022 IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI). 2022. IEEE.

9. Population Assessment of Tobacco and Health (PATH) Study.; Available from: https://www.icpsr.umich.edu/web/NAHDAP/studies/36231.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Mohammad Amin Fraiwan

25 May 2023

Machine learning application for predicting smoking cessation among US adults: an analysis of Waves 1-3 of the PATH study

PONE-D-23-00290R1

Dear Dr. Issabakhsh,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Mohammad Amin Fraiwan

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Mohammad Amin Fraiwan

31 May 2023

PONE-D-23-00290R1

Machine learning application for predicting smoking cessation among US adults: an analysis of Waves 1-3 of the PATH study

Dear Dr. Issabakhsh:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Mohammad Amin Fraiwan

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Appendix. Steps of data cleaning.

    (PDF)

    S2 Appendix. Model validation for cessation transition between waves 2-3.

    (PDF)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    Data are from the Public Use Files for the Population Assessment of Tobacco and Health, Waves 1-3 at https://www.icpsr.umich.edu/icpsrweb/NAHDAP/studies/36498.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES