Abstract
Introduction
Cigarette smoking continues to pose a threat to public health. Identifying individual risk factors for smoking initiation is essential to further mitigate this epidemic. To the best of our knowledge, no study today has used machine learning (ML) techniques to automatically uncover informative predictors of smoking onset among adults using the Population Assessment of Tobacco and Health (PATH) study.
Aims and Methods
In this work, we employed random forest paired with Recursive Feature Elimination to identify relevant PATH variables that predict smoking initiation among adults who have never smoked at baseline between two consecutive PATH waves. We included all potentially informative baseline variables in wave 1 (wave 4) to predict past 30-day smoking status in wave 2 (wave 5). Using the first and most recent pairs of PATH waves was found sufficient to identify the key risk factors of smoking initiation and test their robustness over time. The eXtreme Gradient Boosting method was employed to test the quality of these selected variables.
Results
As a result, classification models suggested about 60 informative PATH variables among many candidate variables in each baseline wave. With these selected predictors, the resulting models have a high discriminatory power with the area under the specificity-sensitivity curves of around 80%. We examined the chosen variables and discovered important features. Across the considered waves, two factors, (1) BMI, and (2) dental and oral health status, robustly appeared as important predictors of smoking initiation, besides other well-established predictors.
Conclusions
Our work demonstrates that ML methods are useful to predict smoking initiation with high accuracy, identifying novel smoking initiation predictors, and to enhance our understanding of tobacco use behaviors.
Implications
Understanding individual risk factors for smoking initiation is essential to prevent smoking initiation. With this methodology, a set of the most informative predictors of smoking onset in the PATH data were identified. Besides reconfirming well-known risk factors, the findings suggested additional predictors of smoking initiation that have been overlooked in previous work. More studies that focus on the newly discovered factors (BMI and dental and oral health status,) are needed to confirm their predictive power against the onset of smoking as well as determine the underlying mechanisms.
Introduction
During the past decade, the tobacco product landscape has evolved rapidly with a remarkable decrease in cigarette smoking prevalence,1 a dramatic increase in the popularity of electronic nicotine delivery systems (ENDS), and the emergence of other novel tobacco products such as heated tobacco products.2,3 However, cigarette smoking is still the main cause of tobacco-related morbidity and mortality, being responsible for hundreds of thousands of premature deaths and costing the U.S. economy hundreds of billions of dollars annually.4,5 Despite the significant decline in smoking prevalence, reducing initiation rates, and increasing cessation rates remain the focus of tobacco control to further reduce the burden of tobacco use and prevent a potential reversal in the reductions of population-level smoking rates from the past decades. To better support tobacco regulations, more studies predicting smoking behaviors as well as identifying risk factors of such behaviors are needed.
Using survey data and traditional regression methods, previous research has uncovered a list of individual-specific characteristics associated with smoking initiation. Because of the standard assumptions of these models and the correlation between potential explanatory variables, these analyses can generally only incorporate a relatively small number of expert-selected predictors.6–9 Consequently, they generally do not exploit the vast amount of information in survey data such as the many variables collected by the U.S. Population Assessment of Tobacco and Health (PATH) study.10 In addition, the findings from traditional regression methods are likely to depend on the independent variables included in the model. This dependence may lead to different conclusions if different sets of independent variables are considered6 and to the omission of some unanticipated predictors. Furthermore, previous studies have shown that these traditional methods may not perform optimally as prediction models, especially when dealing with class-imbalanced data (having an imbalance between class samples).11,12 Therefore, in predicting individuals’ smoking behaviors, alternative methods are needed.
Machine learning (ML) methods such as random forest (RF) are powerful tools that can address some of these limitations. Many ML algorithms can automatically identify hidden patterns in complex data that researchers would otherwise struggle to uncover to predict future outcomes. Applications of ML techniques have just started infiltrating tobacco control research, despite the popularity of its applications in other research fields.13 In tobacco control, a growing number of research articles have focused on predicting smoking and vaping behaviors and identifying the most important predictors of such behaviors through ML algorithms. For instance, researchers used ML methods such as RF classifiers and penalized logistic regression models to predict ENDS use status and identify the most important predictors of this behavior based on various surveys.13–15 Coughlin et al. employed decision trees to classify smoking cessation outcomes in group cognitive-behavioral therapy at a 6-month follow-up and identify the most relevant risk factors of treatment outcomes.16 Kim et al. applied regression tree models to identify complex interactions among various factors that differentiate global adolescents’ levels of tobacco susceptibility and use.17
Most studies used data from clinical trials or local representative surveys.13,15,16,18 Therefore, the findings may not be generalizable to the whole U.S. population. The PATH study has been a primary data source of research work on population-level tobacco use behaviors.10 PATH is a large, longitudinal, and nationally representative survey.10 It contains detailed individual-specific information, including demographics, socioeconomic and health status, perceptions, substance use, and tobacco use. To the best of our knowledge, no study to date has applied ML models to predict smoking initiation and identify relevant predictors of this behavior in the adult PATH data. Here we employed an RF classifier paired with recursive feature elimination (RF-RFE), a popular ML method, to identify critical baseline factors that predict smoking initiation among adults who have never smoked at baseline between two consecutive PATH waves. Our analysis incorporates all PATH variables relevant to the smoking onset. This study is intended to enrich the literature on applications of ML methods in tobacco control, advance our understanding of smoking initiation, and aid tobacco regulators in designing targeted tobacco interventions.
Methods
Data Preparation
The PATH survey is a national longitudinal survey of tobacco use and its effects on the U.S. population. PATH conducted its first wave in 2013 and has publicly released five waves of adult data (wave 1: September 2013–December 2014, wave 2: October 2014–October 2015, wave 3: October 2015–October 2016, wave 4: December 2016—January 2018, and wave 5: December 2018–November 2019).10 Using the PATH survey, we aimed to predict smoking initiation between two consecutive waves among adults who have never smoked at baseline and identify relevant predictors of this behavior. Here, individuals who never tried smoking even one or two puffs of a cigarette are referred to as individuals who have never smoked. In PATH, a nationally representative sample of adults was surveyed in wave 1, completing follow-up interviews at each follow-up wave. In wave 4, a new replenishment sample was added to the study to address attrition over time and reset the sample size, making it again nationally representative.
In this study, we analyzed the earliest (waves 1 and 2) and more recent (waves 4 and 5) pairs of waves. Considering these two pairs of waves, we employed the RF-RFE method to identify the most important predictors of smoking initiation in the PATH survey and investigated whether these predictors changed over time. These important predictors were selected by the RF-RFE method as most informative for the prediction of this smoking behavior, see section “Statistical Analysis” for details. We used the wave 1 (wave 4) variables to predict the past 30-day (P30D) smoking status in wave 2 (wave 5). The P30D smoking status of an individual is defined as “Yes” if he/she smoked a cigarette in the past 30 days and “No” otherwise.
For waves 1 and 2, we first extracted a subpopulation of individuals who have never smoked in wave 1 and combined it with the outcome variable that indicates the P30D smoking status in wave 2 using individuals’ identity numbers. We then removed non-relevant variables such as personal identity numbers, random questions, sample weights, imputed variables, adult survey type, and specimen collection status. Furthermore, variables with more than 5% missing values of the total sample size were dropped to maintain the highest possible number of predictors. We also excluded individuals with missing smoking status in wave 2. To use these datasets for model development later, all the remaining missing values were imputed using a supervised imputation method.19 Finally, we removed highly correlated variables. As a result, we obtained a final analytic dataset which includes data from 5776 adults with 145 predictors, see Table 1.
Table 1.
A Description of the Clean and Complete Datasets Extracted From the PATH Data After Data Processing
Baseline smoking status | Outcome smoking status | Value of outcome smoking status | Number of predictors | Total | |
---|---|---|---|---|---|
Yes | No | ||||
Individuals who have never smoked in wave 1 |
P30D smoking status in wave 2 | 197 (3.4%) |
5579 (96.6%) |
145 | 5776 (100%) |
Individuals who have never smoked in wave 4 | P30D smoking status in wave 5 | 208 (2.6%) |
7687 (97.4%) |
182 | 7895 (100%) |
PATH = Population Assessment of Tobacco and Health.
However, as shown in Table 1, most wave 1 individuals who have never smoked (more than 95%) remained their never smoking status in wave 2. Therefore, the distribution of the outcome P30D smoking status in wave 2 is highly imbalanced, which often has a negative impact on predicting the minority class (ie, individuals who have smoked in the past 30 days). We divided each dataset into training and test data (80%:20%). The training data were used for training and validating the chosen model classifier, and for feature selection, while the test data, which remained untouched, was used to reevaluate the model performance on the RF-RFE selected variables.
Analogously, we repeated this whole process and obtained the final clean dataset for waves 4 and 5 (7895 adults with 182 predictors) shown in Table 1. A difference in the number of predictors between these two pairs of waves is because of the changes in the PATH survey questionnaires between wave 1 and wave 4. These data preparation processes are illustrated in Figure 1.
Figure 1.
Diagram of the data preparation process, where N is the number of participants and P is the number of variables including the outcome variable.
Statistical Analysis
After data preprocessing, we obtained two datasets corresponding to two pairs of waves in Table 1. Our datasets with these many variables are likely to contain many redundant or irrelevant predictors. Searching for an optimal subset of predictors is usually needed to produce more effective data mining algorithms, improve the predictive accuracy of the considered model, and increase the comprehensibility of the underlying process that generated the data.20 As such, for each pair of waves, we used an RF classifier combined with RFE to obtain a subset of predictors on which the RF classifier performs best.21 RF is an ensemble method that consists of many decision trees trained in parallel on different subsets bootstrapped from the original data. The final RF decision is generated by aggregating the decisions of all individual decision trees. For RF-RFE, the RF classifier was, first, trained on the entire training set, providing a rank order of all the predictors based on the mean decrease of accuracy. The mean decrease in accuracy was computed by randomly permuting the values of each predictor. Variables whose permutation causes a significant loss in accuracy correspond with high-importance scores. Variable(s) with the lowest important ranking was (were) then removed from the dataset. The model was re-trained on the reduced predictors, and its performance was recorded. This process was repeated until all the predictors were explored.21 We trained RF-RFE on the training data with 5-fold cross-validation repeated 3 times to score different subsets of predictors, select the best subset of predictors, and avoid selection bias.22 To overcome this class imbalance issue and improve the classification power of the model, we performed random oversampling examples (ROSE – an R package)23 of the training set within each cross-validation iteration (oversampling the training folds, but not the validation one). To guarantee the stability of our selected predictors, we repeated this RF-RFE training process until the intersection of the best subsets of the RF-RFE selected predictors of all simulations remains the same after many iterations (a total of 100 simulations were enough for this). Finally, the final most important predictor subset, which may contain all predictors if they are all informative, was a set of the RF-RFE selected predictors which were chosen in at least 95% of all simulations. For more details on how RF-RFE works, we refer the reader to Kuhn’s work.22 Here, we chose RF as a core method for RFE due to its appealing features including high prediction accuracy, ability to deal with high dimensional data, and internal measure of variable importance.24,25
To evaluate how well the model performance would generalize to new data drawn from the same distribution, but not included in the training data, we trained the eXtreme Gradient Boosting method (XGBoost) using the training data with the RF-RFE selected variables and tested the model’s ability to classify the P30D smoking status using the test data. XGBoost, an efficient and effective open-source implementation of Gradient Boosting, which was developed by Chen and Guestrin in 2016,26 has been shown to outperform many other classification techniques.
Performance Measures of the RF Classifier
A receiver operating characteristic (ROC) curve is a graphical representation of a classifier’s performance as a tradeoff between specificity and sensitivity. It is typically obtained by plotting the true positive rate (sensitivity) against the true negative rate (specificity) for a single classifier at various probability cutoff values. For each outcome wave,
where individuals with true P30D smoking status (false never smoking status) are their actual P30D smoking status in the outcome wave who are correctly (incorrectly) predicted by the model, and individuals with true never smoking status (false P30D smoking status) are their actual never smoking status in the outcome wave who are correctly (incorrectly) predicted by the model. Sensitivity and specificity show the ability of a classifier to correctly identify individuals’ P30D smoking status in wave 2 (wave 5) among wave 1 (wave 4) those who have never smoked. The area under the ROC curve (AUC) is usually a preferred measure of the model performance in binary classification problems. AUC is a more discriminating measure of performance than accuracy in classification.27 The closer AUC gets to 100%, the better the classifier performance. A perfect classifier corresponds to an AUC of 100%, while a classifier without any predictive power corresponds to an AUC of 50%. Finally, to provide a reliable estimate of AUC, we also computed the 95% confidence interval (CI) of AUC with 2000 stratified bootstrap replicates.
The study was exempted from the review by the Institutional Review Board at the authors’ institution because of the public availability of the data.
Results
Most Relevant Predictors of Smoking Initiation
RF- RFE selected a list of only 64 and 56 relevant variables of 145 and 182 possible variables in waves 1 and 4, respectively, for predicting smoking initiation among adults who have never smoked at baseline as shown in detail in Supplementary Tables A1, A2 in the appendix. With these recommended variables, the classification model’s performance metrics (Accuracy and Kappa28) reach their best levels. Based on the description of each variable, we grouped them into different categories as detailed in Supplementary Tables A1, A2 in the appendix. Variables describing the same characteristics are put into the same category. For instance, the “Social influence” category includes the variables in waves 1 and 4 that describe the number of hours in past 7 days that an individual was in close contact with others when they were smoking. We refer the readers to Supplementary Tables A1, A2 in the appendix for the description of each category in Table 2.
Table 2.
Individuals’ Characteristics Related to the Transition from Individuals Who Have Never Smoked to Those Who Have Smoked in the Past 30 Days Between Two Consecutive Waves
Wave 1 | Wave 4 |
---|---|
BMI | |
Census region | |
Dental and Oral care | |
Education | |
Employment status | |
Experimentation with other tobacco products | |
Exposure to radio and TV | |
Financial status | |
Gender, Race | |
Mental and Physical health status | |
Social influence | |
Tobacco risk perceptions and awareness | |
Alcohol use status | |
Exposure to tobacco products and tobacco advertisements | Exposure to anti-tobacco advertisements |
Household rules about tobacco use | Marital status |
Insurance status | Smoking intention |
Relatives’ asthma status | |
Taking anti-inflammatory or pain medication | |
Presence of youth in the household | |
Internet use |
As a result, we obtained a set of individual characteristics associated with the onset of cigarette smoking, see Table 2 and Supplementary Tables A1, A2 in the appendix for more details. Table 2 indicates that across the considered baseline waves, census region, BMI, dental and oral care, education, employment status, experimentation with other tobacco products, exposure to radio and TV, financial status, mental and physical health status, gender, race, social influence, and tobacco risk perceptions and awareness are important factors associated with smoking initiation by the following wave. In addition to those variables, alcohol use status, exposure to tobacco products and tobacco advertisements, household rules about tobacco use, insurance status, relatives’ asthma status, taking anti-inflammatory or pain medication, presence of youth in the household, and internet use are related to smoking initiation between waves 1 and 2. However, exposure to anti-tobacco advertisements, marital status, and smoking intention are predictive factors identified only in the wave 4 to 5 analysis.
Supplementary Figures A1, A2 in the appendix comprehensively present the rankings of all the RF-RFE selected predictors and their impact directions on P30D smoking status.
Performance of the RF Classifier
Figure 2 shows the predictive power of the trained RF classifiers on the test data using the selected variables (around 60 variables). Critical metrics for evaluating the classifier’s performance including AUCs, Specificity, and Sensitivity with their 95% CIs are computed. Column 3 indicates that the RF classifier performs very well on all considered datasets with AUCs of around 80% (82% [CI: 75.5% to 87.5%] for waves 1 and 2 and 78% [CI: 70% to 85.5%] for waves 4 and 5). Column 3 displays the suggested probability cutoffs which are the closest points to the top-left parts of the plots with perfect sensitivity and specificity. These suggested thresholds are computed using the closest-topleft method.29 Individuals with predicted probability greater than these thresholds are classified as individuals who have never smoked and individuals who have smoked in the past 30 days otherwise. Columns 4 and 5 are the values of sensitivity and specificity corresponding with the thresholds in column 3. Increasing sensitivity, more individuals who have smoked in the past 30 days identified correctly, leads to a decrease in specificity (more individuals who have never smoked would be incorrectly identified) and vice versa. Therefore, these thresholds are just suggestive and should be adjusted to achieve one’s desirable balance between sensitivity and specificity. The ROC curves of the XGBoost classifiers are plotted in Figure 2. The list of optimized XGBoost hyperparameters can be found in Supplementary Table A3 in the appendix.
Figure 2.
The receiver operating characteristic curves of all the XGBoost classifiers. The diagonal line in each plot represents random guessing. Each plot shows the area under the ROC curve (AUC) curve (95% confidence interval [CI]) together with a suggested threshold (specificity, sensitivity).
Discussion
This is the first study to leverage all possible original public adult PATH variables to extract a subset of the most informative variables for predicting the transition from having never smoked to having smoked in the past 30 days between two consecutive waves using RF-RFE. We uncovered about 60 variables associated with the smoking onset between two considered PATH waves among adults who have never smoked at baseline. Despite the severely skewed class distribution shown in Table 1, with these RF-RFE selected variables, the XGBoost model performs well in classifying smoking status in the follow-up wave among individuals who have never smoked at baseline with AUC = 80% approximately (see Figure 2). The summary in Table 2 shows that most of the discovered risk factors are robust across waves and their association with smoking initiation is well reported in the literature.
Before starting our discussion about the association between the resulting risk factors in Table 2 and the smoking initiation behavior, we would like to emphasize that the direction of the impact (ie, negative or positive) of each risk factor on the onset of smoking, as discussed below, is based on the exhaustive information provided in Supplementary Tables A1, A2 and Figures A1, A2 in the appendix.
For the joint predictive variables between waves 1 and 4, most of them (ie, census region,30 education,31,32 employment status,33 experimentation with other tobacco products,32,34–38 exposure to radio and TV,39 financial status,32,40 mental and physical health status,41 gender,30 race,31,32,42 social influence,43–45 and tobacco risk perceptions and awareness31,45) are familiar risk factors of smoking initiation. Meanwhile, other factors including BMI and dental and oral health status, have not been consistently identified as predictors for smoking initiation in the literature. Across the considered waves, our model consistently indicates BMI as one of the main predictors of the onset of cigarette smoking in the following waves. Our findings indicate that individuals with lower BMI are likely to be at a higher risk of smoking initiation. Several studies have found a strong connection between BMI and smoking initiation while analyzing different datasets.45–48 This association could be related to weight concerns since many believe that smoking helps control their weight or via other sociodemographic factors such as lower socioeconomic status, income, and educational attainment, which are highly correlated to BMI.45–48 Oral and dental care robustly appears among the predictors of smoking initiation in both baseline waves. The association between oral and dental care and smoking onset has not been widely reported in the literature, except in earlier studies that analyzed data from a randomized clinical trial in California to predict tobacco initiation over a 2-year follow-up interval among adolescent never-tobacco users.48,49 People who take good care of their oral health seem to be less likely to start smoking since they are probably aware of the harmful impacts of smoking on their teeth and gum health. More work is required to investigate and understand the impacts of these factors on smoking initiation.
Besides the shared predictors of smoking initiation across the considered waves, each baseline wave has its own set of relevant predictors, in part because of the changes in the PATH questionnaire. Again, most wave-specific variables are well reported in previous studies except taking anti-inflammatory or pain medication, insurance status, relatives’ asthma status, and internet use in wave 1. We found that taking anti-inflammatory or pain medication is also associated with less risk of smoking initiation. Individuals taking these medications are likely to be already in bad health status, and thus do not want to start smoking. In addition, our results show that individuals with no insurance are related to a higher likelihood of smoking initiation.30 This link is probably because of the confounding effects of other factors such as income and employment status. Individuals who have any of their biological relatives (father, mother, brothers or sisters, sons or daughters, and both living and deceased) with asthma are associated with lower smoking probability. This can be because of an increase in tobacco risk awareness since asthma is a well-known tobacco-related disease. Finally, our findings show that the more time an individual spends on the internet, the more likely that person is to initiate smoking. Internet use is likely to be an indicator of exposure to smoking-stimulating factors, including onscreen tobacco use, tobacco advertisements, and digital social influence. A recent study14 reported a similar observation while using a penalized logistic regression model on the PATH data to predict the use of ENDS among never users of any tobacco products. These authors found an association between digital media use with future ENDS use when using different waves in U.S. adolescents. There could be other possible explanations for the connection between these uncommon factors and smoking initiation. As such, more studies on these matters are needed.
Like other similar studies,14,18 our work has limitations. First, since the PATH survey is self-reported, the responses are potentially subject to recall bias. Second, our analysis was conducted based on the unweighted PATH data, therefore our results may not generalize to the whole population. However, since the PATH survey is nationally representative, for example, the survey sample captures the characteristics of most U.S. individuals, our conclusions are likely to hold in general. External validation of these findings is needed. Third, factors not included in the PATH study were not considered. Fourth, even though we simulated RF-RFE a hundred times to obtain the final converged set of most informative predictors of smoking onset, other ML methods may provide a slightly different set with which the XGBoost classifier’s discriminative power remains similar to our presented results. As such, a future study that is devoted to comparing the best sets of the most informative predictors of this smoking behavior that result from different feature selection techniques is encouraged. Finally, it should be noted that our results cannot establish the causal direction of the association of the model-selected predictors with smoking initiation because of the nature of these ML algorithms.
While most smoking experimentation happens during adolescence, more established smoking behavior is usually observed during young adulthood.4 In addition, recent studies have reported a shift in the age of smoking initiation towards older ages.50 As such, with our outcome of interest, the current P30D smoking status, it is relevant to analyze the young adult PATH data to understand smoking initiation. In PATH, according to our definition of the baseline population and the outcome of interest, we found that the unweighted proportions of individuals who have never smoked at baseline and have smoked in the past 30 days in the following wave were similar in both the adult and youth data. We chose to start with the adult data first to demonstrate the feasibility of this approach. The analysis based on the youth data, our ongoing work, is more complex because of the aged-out individuals whose behaviors are different from the rest. As such, analyses of predictors of adolescent initiation will be the focus of a follow-up paper.
In summary, this work demonstrates the potential applications of ML methods in developing highly accurate classification models to predict smoking initiation using nationally representative survey data and systematically identifying all potential risk factors of smoking behaviors in the PATH data. Besides reconfirming well-known risk factors in the literature, our findings discovered additional predictors of smoking initiation that have not been paid much attention to in previous studies. More studies that focus on the newly discovered factors are needed to confirm their predictive power against changes in smoking behaviors as well as determine the underlying mechanisms.
Supplementary Material
A Contributorship Form detailing each author’s specific involvement with this content, as well as any supplementary data, are available online at https://academic.oup.com/ntr.
Contributor Information
Thuy T T Le, Department of Health Management and Policy, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
Mona Issabakhsh, Department of Oncology, School of Medicine, Georgetown University, Washington, DC, USA.
Yameng Li, Department of Oncology, School of Medicine, Georgetown University, Washington, DC, USA.
Luz María Sánchez-Romero, Department of Oncology, School of Medicine, Georgetown University, Washington, DC, USA.
Jiale Tan, Department of Epidemiology, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
Rafael Meza, Integrative Oncology, BC Cancer Research Institute, Vancouver BC, USA.
David Levy, Department of Oncology, School of Medicine, Georgetown University, Washington, DC, USA.
David Mendez, Department of Health Management and Policy, School of Public Health, University of Michigan, Ann Arbor, MI, USA.
Funding
Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health (NIH) and FDA Center for Tobacco Products (CTP) under Award Number U54CA229974. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the Food and Drug Administration.
Declaration of Interests
The authors do not report any conflicts of interest.
Data Availability
All data produced are available online at https://www.icpsr.umich.edu/web/NAHDAP/studies/36231
References
- 1. Cornelius ME, Wang TW, Jamal A, et al. Tobacco product use among adults—United States, 2019. MMWR Morb Mortal Wkly Rep. 2020;69(46):1736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. US Department of Health and Human Service. Surgeon General’s advisory on e-cigarette use among youth. https://e-cigarettes.surgeongeneral.gov/documents/surgeon-generals-advisory-on-e-cigarette-use-among-youth-2018.pdf. Accessed May 3, 2022.
- 3. US Department of Health and Human Services. E-cigarette Use Among Youth and Young Adults: A Report of the Surgeon General. Atlanta, GA: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health; 2016. [Google Scholar]
- 4. U.S. Department of Health and Human Services. The Health Consequences of Smoking - 50 Years of Progress: A Report of the Surgeon General. Atlanta: U.S. Department of Health and Human Services, Public Health Service, Centers for Disease Control and Prevention, National Center for Chronic Disease Prevention and Health Promotion, Office on Smoking and Health. Available at: https://www.cdc.gov/tobacco/sgr/50th-anniversary/index.htm2014. [Google Scholar]
- 5. Xu X, Shrestha SS, Trivers KF, et al. US healthcare spending attributable to cigarette smoking in 2014. Prev Med. 2021;150:106529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Sun R, Mendez D, Warner KE.. Is adolescent e-cigarette use associated with subsequent smoking? a new look. Nicotine Tob Res. 2021;24(5):710–718. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Loukas A, Marti CN, Cooper M, Pasch KE, Perry CL.. Exclusive e-cigarette use predicts cigarette initiation among college students. Addict Behav. 2018;76:343–347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Watkins SL, Glantz SA, Chaffee BW.. Association of noncigarette tobacco product use with future cigarette smoking among youth in the Population Assessment of Tobacco and Health (PATH) study, 2013-2015. JAMA Pediatr. 2018;172(2):181–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Bell K, Keane H.. All gates lead to smoking: the “gateway theory,” e-cigarettes and the remaking of nicotine. Soc Sci Med. 2014;119:45–52. [DOI] [PubMed] [Google Scholar]
- 10. FDA and NIH Study: Population Assessment of Tobacco and Health. https://www.fda.gov/tobacco-products/research/fda-and-nih-study-population-assessment-tobacco-and-health. Accessed July 14th, 2022.
- 11. Couronné R, Probst P, Boulesteix A-L.. Random forest versus logistic regression: a large-scale benchmark experiment. BMC Bioinform. 2018;19(1):1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Muchlinski D, Siroky D, He J, et al. Comparing random forest with logistic regression for predicting class-imbalanced civil war onset data. Political Anal. 2016;24(1):87–103. [Google Scholar]
- 13. Fu R, Kundu A, Mitsakakis N, et al. Machine learning applications in tobacco research: a scoping review. Tob Control. 2021;32(1):99–109. [DOI] [PubMed] [Google Scholar]
- 14. Han D-H, Lee SH, Lee S, Seo D-C.. Identifying emerging predictors for adolescent electronic nicotine delivery systems use: a machine learning analysis of the Population Assessment of Tobacco and Health Study. Prev Med. 2021;145:106418. [DOI] [PubMed] [Google Scholar]
- 15. Shi J, Fu R, Hamilton H, et al. A machine learning approach to predict e-cigarette use and dependence among Ontario youth. The HPCDP Journal. 2022;42(1):21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Coughlin LN, Tegge AN, Sheffer CE, Bickel WK.. A machine-learning approach to predicting smoking cessation treatment outcomes. Nicotine Tob Res. 2020;22(3):415–422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Kim N, Loh W-Y, McCarthy DE.. Machine learning models of tobacco susceptibility and current use among adolescents from 97 countries in the global youth tobacco survey, 2013-2017. PLOS Glob Public Health. 2021;1(12):e0000060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Fu R, Shi J, Chaiton M, et al. A machine learning approach to identify predictors of frequent vaping and vulnerable californian youth subgroups. Nicotine Tob Res. 2022;24(7):1028–1036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. RColorBrewer S, Liaw MA.. Package “randomforest.” Berkeley, CA, USA: University of California, Berkeley; 2018. [Google Scholar]
- 20. Saeys Y, Inza I, Larranaga P.. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–2517. [DOI] [PubMed] [Google Scholar]
- 21. Guyon I, Weston J, Barnhill S, et al. Gene selection for cancer classification using support vector machines. Mach Learn. 2002;46(1):389–422. [Google Scholar]
- 22. Kuhn M. Variable selection using the caret package. 2012:1–24. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=3a24fb209030861e39d1421d6ee3f328230140cb, Accessed on October 15, 2022.
- 23. Menardi G, Torelli N.. Training and assessing classification rules with imbalanced data. Data Min Knowl Discov. 2014;28:92–122. [Google Scholar]
- 24. Biau G, Scornet E.. A random forest guided tour. Test. 2016;25(2):197–227. [Google Scholar]
- 25. Cutler A, Cutler DR, Stevens JR.. Random forests. In: Zhang, C., Ma, Y. (eds) Ensemble machine learning, New York, NY: Springer;2012:157–175. doi: 10.1007/978-1-4419-9326-7_5. [DOI] [Google Scholar]
- 26. Chen T, Guestrin C.. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August, 2016:785–794.
- 27. Ling CX, Huang J, Zhang H.. AUC: a statistically consistent and more discriminating measure than accuracy. IJCAI. 2003;3:519–524. [Google Scholar]
- 28. McHugh ML. Interrater reliability: the kappa statistic. Biochem Med. 2012;22(3):276–282. [PMC free article] [PubMed] [Google Scholar]
- 29. Robin X, Turck N, Hainard A, et al. Package “pROC.” Package “pROC” 2021.
- 30. Cornelius ME, Loretan CG, Wang TW, et al. Tobacco product use among adults—United States, 2020. MMWR Morb Mortal Wkly Rep. 2022;71(11):397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Freedman KS, Nelson NM, Feldman LL.. Smoking initiation among young adults in the United States and Canada, 1998-2010: a systematic review. Prev Chronic Dis. 2012;9:110037. doi: 10.5888/pcd9.110037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Kasza KA, Edwards KC, Tang Z, et al. Correlates of tobacco product initiation among youth and adults in the USA: findings from the PATH Study Waves 1–3 (2013–2016). Tob Control. 2020;29(suppl 3):s191–s202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Marcus J. Does job loss make you smoke and gain weight? Economica. 2014;81(324):626–648. [Google Scholar]
- 34. Aleyan S, Cole A, Qian W, Leatherdale ST.. Risky business: a longitudinal study examining cigarette smoking initiation among susceptible and non-susceptible e-cigarette users in Canada. BMJ Open. 2018;8(5):e021080. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Leventhal AM, Strong DR, Kirkpatrick MG, et al. Association of electronic cigarette use with initiation of combustible tobacco product smoking in early adolescence. JAMA. 2015;314(7):700–707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Soneji S, Barrington-Trimis JL, Wills TA, et al. Association between initial use of e-cigarettes and subsequent cigarette smoking among adolescents and young adults: a systematic review and meta-analysis. JAMA Pediatr. 2017;171(8):788–797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Baenziger ON, Ford L, Yazidjoglou A, Joshy G, Banks E.. E-cigarette use and combustible tobacco cigarette smoking uptake among non-smokers, including relapse in former smokers: umbrella review, systematic review and meta-analysis. BMJ Open. 2021;11(3):e045603. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Coreas SI, Rodriquez EJ, Rahman SG, et al. Smoking susceptibility and tobacco media engagement among youth never smokers. Pediatrics. 2021;147(6). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Johnson EK, Len-Ríos M, Shoenberger H, et al. A fatal attraction: The effect of TV viewing on smoking initiation among young women. Commun Res. 2019;46(5):688–707. [Google Scholar]
- 40. Binkley J. Low income and poor health choices: the example of smoking. Am J Agric Econ. 2010;92(4):972–984. [Google Scholar]
- 41. Rohde P, Lewinsohn PM, Brown RA, Gau JM, Kahler CW.. Psychiatric disorders, familial factors and cigarette smoking: I. Associations with smoking initiation. Nicotine Tob Res. 2003;5(1):85–98. [DOI] [PubMed] [Google Scholar]
- 42. Thompson AB, Mowery PD, Tebes JK, McKee SA.. Time trends in smoking onset by sex and race/ethnicity among adolescents and young adults: findings from the 2006–2013 National Survey on Drug Use and Health. Nicotine Tob Res. 2018;20(3):312–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Chezhian C, Murthy S, PraSad S, et al. Exploring factors that influence smoking initiation and cessation among current smokers. J Clin Diagnostic Res. 2015;9(5):LC08. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Khalil GE, Jones EC, Fujimoto K.. Examining proximity exposure in a social network as a mechanism driving peer influence of adolescent smoking. Addict Behav. 2021;117:106853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Bernat DH, Klein EG, Forster JL.. Smoking initiation during young adulthood: a longitudinal study of a population-based cohort. J Adolesc Health. 2012;51(5):497–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Taylor AE, Richmond RC, Palviainen T, et al. The effect of body mass index on smoking behaviour and nicotine metabolism: a Mendelian randomization study. Hum Mol Genet. 2019;28(8):1322–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Murphy CM, Janssen T, Colby SM, Jackson KM.. Low self-esteem for physical appearance mediates the effect of body mass index on smoking initiation among adolescents. J Pediatr Psychol. 2019;44(2):197–207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Wahlgren DR, Hovell MF, Slymen DJ, et al. Predictors of tobacco use initiation in adolescents: a two-year prospective study and theoretical discussion. Tob Control. 1997;6(2):95–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Hovell MF, Slymen DJ, Keating KJ, et al. Tobacco use prevalence and correlates among adolescents in a clinician initiated tobacco prevention trial in California, USA. J Epidemiol Community Health. 1996;50(3):340–346. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Barrington-Trimis JL, Braymiller JL, Unger JB, et al. Trends in the age of cigarette smoking initiation among young adults in the US from 2002 to 2018. JAMA Netw Open. 2020;3(10):e2019022–e2019022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data produced are available online at https://www.icpsr.umich.edu/web/NAHDAP/studies/36231