Abstract
Background:
Social media platforms are critical channels for promoting e-cigarettes, particularly among youth, making analysis of their vast and diverse content essential for public health interventions. Prevalence rates of e-cigarette use are high and evidence suggests that social media are popular forums that promote e-cigarette use through direct and indirect marketing techniques. The volume and diverse nature of e-cigarette-related information on social media is challenging and may obfuscate public health prevention messaging. Traditional hand-coding methods are labor-intensive and limit scalability. In contrast, unsupervised machine learning approaches, such as topic modeling, allow for efficient analysis of large datasets, uncovering patterns and trends that manual methods cannot achieve at scale. The present study focused on ascertaining the extent to which themes and topics in tweets related to e-cigarettes can be successfully rendered into useful homogenous units using machine learning. A better understanding of current depictions and discussions around e-cigarette products and use on social media can inform public health counter messaging and policy interventions.
Methods:
We used topic modeling (BERTopic) to iteratively derive vape-related tweet clusters and calculate the importance of particular words to these groupings. We conducted a qualitative content analysis to study clustered tweets. We also sought to determine the geographic locations of e-cigarette conversations using automated geoparsing methods, which translate toponyms in textual data into geographic identifiers, to attempt to infer the location of tweets.
Results:
We were able to successfully identify >100,000 tweets in broad thematic categories in English and Spanish. Our correlation and inter-topic map analysis of the machine-derived topics, which examines the relationships between topics, indicated that most of the topics were unique (correlation value < 0.5) and did not overlap with each other. We identified six topics: Flavors and Disposable Vapes, Cannabis, Vape Shops and Refillable Vapes, Vape Culture, Anti-vaping and Quitting, and Spanish Tweets and Vaping Nicotine. Further analysis of these topics using qualitative methods identified themes within each topic. For example, Category 6 (Spanish Tweets and Vaping Nicotine) included four topics focused on the health risks of vaping, personal motivations for vaping, and the regulation of vaping products. Using geoparsing, which automatically detects location information, we found that the United States had the highest number of tweets related to vaping.
Discussion/conclusion:
Results underscore the possibility of leveraging BERTopic modeling to reduce large quantities of data to comprehensively describe and categorize myriad e-cigarette related messages to which social media users are exposed. This data reduction approach can be applied to various social media platforms to describe and categorize e-cigarette posts and thereby triangulate and validate findings. Thematic content analysis of the topics identified through this technique requires supervision and human inputs. Our approach provides a comprehensive understanding of the evolving e-cigarette discourse, informing public health counter-messaging and policy interventions. Moreover, findings support the need for regulation, such as reducing appealing flavors and suggest that social media can be used effectively to support public health messaging (i.e., quitting messages).
Keywords: Bertopic, Topic modeling, Natural language processing, Vape, E-cigarettes
1. Introduction
The prevalence of electronic cigarettes (commonly known as e-cigarettes and referred to as vaping) has increased over the past several years (Evans-Polce et al., 2020; Obisesan et al., 2020). In the United States (US) in 2023, 10 % of high school students and 4.6 % of middle school students used e-cigarettes during the 30 days prior to assessment (Birdsey et al., 2023). Prevalence of e-cigarette use, either daily or occasionally, among young adults aged 18–24 years in the US rose from 2.4 % in 2012–2013 to 5.2 % in 2017 and increased further to 7.6 % in 2018 (Dai and Leventhal, 2019; Primack et al., 2015). As of 2021, 4.5 % of US adults aged 18 years or older and 11 % of those aged 18–24 years reported currently using e-cigarettes, which is the highest level of any age group (MMWR, 2023).
One factor impacting the initiation, continuation, and escalation of e-cigarette use among all age groups involves engagement on social media platforms (Cavazos-Rehg et al., 2021) such as Twitter (now known as X). Young adults use social media at higher rates than older adults (Pew Research Center, 2024) and e-cigarette content is ubiquitous on social media platforms, in the form of direct and indirect marketing (Jung et al., 2024). As a result, young adults are exposed to or see e-cigarette related content in the form of direct marketing as well as more subtle images in which e-cigarettes are used as part of daily life, thereby increasing young adults’ risk for e-cigarette use (Jung et al., 2024).
At the time of data collection, Twitter was one of the most popular social media platforms, hosting 237.8 million monetizable daily active users (Statista, 2023), with around 77 million Twitter users in the US alone (Twitter by the Numbers, 2023). Many adolescents and young adults use Twitter (Lu et al., 2022), with 20 % of adolescents and 42 % of young adults reporting use of this social media platform in 2023 (Pew Research Center, 2023; Statista, U.S. Twitter reach by age group 2021, n.d). The platform actively hosts vaping conversations (Hassan et al., 2022; Malik et al., 2021), and the largest category of e-cigarette-related tweets involves marketing and promotion, personal experience-related and personal opinion-oriented tweets (Cole-Lewis et al., 2015). Any promotional information on Twitter could be from companies or individuals, such as users or social influencers. More recent work has found a significant increase in vaping and e-cigarette discussions on Twitter, including posts (i.e., ‘tweets’) related to flavors, perceptions, attitudes, and sentiments (positive or negative) (Gao et al., 2022). During the 2019 outbreak of E-cigarette or Vaping Use-Associated Lung Injury (EVALI), an officially identified and named severe pulmonary illness resulting in hospitalization and deaths associated with the use of e-cigarette products, there was an increase on Twitter of negative sentiment in vaping-related posts and engagement with them (Wu et al., 2022).
Pre-trained topic models remain underutilized for analyzing e-ciga-rette-related messages on social media. We aim to address this gap by identifying the most common themes and topics in e-cigarette-related tweets on Twitter using multilingual BERTopic built on Bidirectional Encoder Representations from Transformers (BERT), a machine learning topic modeling system pre-trained on 40 epochs over a 3.3-billion-word corpus. Our study uniquely provides a method to apply BERTopic to e-cigarette related tweets, and it successfully clusters a large volume of tweets into manually interpretable clusters. We sought to identify unique clusters of topics in each category (e.g., a topic cluster about quitting vaping and another topic cluster about flavors) that future content analyses can examine further. We also provide an open-source dataset of 121,000 unique tweets scraped from 98,634 unique individuals between November 4, 2022 – February 23, 2023, using 118 vape-related hashtags for others to use or replicate and extend our work.
2. Literature review/ related work
Social media platforms are regularly used to express ideas, beliefs, personal experiences, and opinions on e-cigarettes and other substance use (Ren et al., 2022). Studies have used the Twitter Application Programming Interface (API), a gateway to automatically collect large volumes of tweets, to examine the association between social media use and vaping among adolescents and young adults (Adhikari et al., 2021; Lee et al., 2021; Ren et al., 2022; Wu et al., 2022). For example, Ren et al. (2022) utilized a detection model to extract vaping-related keywords from Twitter, develop classifiers, and identify vaping-related text within the tweets. Additionally, Galimov et al. (2022) collected tweets over a 7 -month period focusing on mentions of vaping-related and ice-flavor-related terms using content analysis and identified themes such as marketing of ice-flavored e-cigarette products (e.g., blueberry ice or melon ice) and personal testimonials about these products. In a similar study, Cole-Lewis et al. (2015) used content analysis to analyze the user categories (e.g., tobacco company, government), message framing, genre (news experience, marketing etc.) and sentiments (positive, negative, neutral) surrounding e-cigarette discussions on Twitter. All these studies involved the collection of large volumes of data directly from the Twitter API for specific analysis of e-cigarette-related content.
Historically, content analyses of social media have relied on hand-coding of text by trained coders, which limits the number of posts that can feasibly be coded. However, the large volume of e-cigarette-related posts on Twitter alone overwhelms traditional hand-coding methods, making automated machine learning approaches that classify content without human input (i.e., ‘unsupervised’ systems) imperative for analyzing large numbers of posts. Topic modeling, an unsupervised machine learning method, has been successfully applied in e-cigarette-related research (Zhan et al., 2017). This method identifies words that statistically co-occur and clusters them into machine-derived categories. For instance, Hassan et al. (2022) utilized the Latent Dirichlet Allocation (LDA) topic modeling approach to machine identify topics in 189,658 e-cigarette-related tweets. LDA has also been utilized to identify potential e-cigarette beliefs for campaign design targeting youth (Sangalang et al., 2019) and to identify themes related to industry activities, protecting minors, health effects, and regulations outside of the US (Liu et al., 2022). Unsupervised machine learning approaches have also been applied to analyze e-cigarette-related content on other social media platforms, such as Instagram (Ketonen and Malik, 2020). However, neural machine learning-based topic modeling remains underutilized for analyzing large-scale e-cigarette-related messages on social media, which makes our investigation of the most common themes and topics in e-cigarette-related tweets methodologically unique.
Studies have successfully applied the Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) pre-trained language model to classify public sentiments about Juul flavors on Twitter (Malik et al., 2021) and to examine how vaping was framed on Facebook between 2008 and 2021 (Chen et al., 2022). Baker et al. (2022) used a BERTweet-based deep learning classifier to categorize vape-related tweets, vape-related tweets of a commercial nature and of a particular sentiment. Additionally, Baker et al. compared the BERTweet classifier against a long short-term memory (LSTM) model to classify hand coded tweets and show the improvements of pretrained models over traditional, untrained deep learning approaches with higher accuracy. Though the Baker et al. (2022) study reinforced the importance of deep learning open sourced BERTweet classifiers in analyzing time sensitive data on social media platforms like Twitter, their work did not specifically aim to understand broader vape-related conversations on social media. Moreover, as of April 2024, only a handful of studies have utilized BERT for topic modeling (i.e., using BERTopic, a state-of-art natural language processing and deep-learning neural method) to cluster vape-related social media posts into interpretable topics (Chen et al., 2022; Lee et al., 2024; Liu et al., 2024; Shah et al., 2024). Our study therefore expands from this work as well as Baker et al. (2022), but with the specific aim of better categorizing content on Twitter via neural topic modeling, an unsupervised machine learning algorithm incorporating neural components to organize a corpus of documents into coherent topics.
In the current study, we leverage BERTopic to analyze a large Twitter dataset collected between November 2022 and February 2023 and to evaluate the extent to which collected tweets can be stratified into vape-related categories suitable for further qualitative analysis. The categories rendered through this process will be further examined in an ongoing study, Project VAMoS, which is a four-year web-based study examining the independent and combined influences of social media exposure/engagement and acculturation on the vaping behaviors of a longitudinal cohort of 18–29-year-old Mexican-American college students in Texas. We use BERTopic to identify thematic groups of vape-related tweets and calculate the importance of particular words to these groupings. We also sought to determine the geographic locations of e-cigarette conversations, which may aid in identifying locations where e-cigarette use may be more prevalent. Our research question is as follows: Can neural machine learning-based topic modeling approaches be used to stratify large volumes of vape-related tweets into topical clusters, which are suitable for future in-depth qualitative categorization of the content in those clusters?
3. Methods
In this section, we introduce the computational and qualitative methods used in the current study. Fig. 1 provides a visual summary of our research methods.
Fig. 1.

Methods framework.
3.1. Hashtags and categories
For the current study, a team of content experts created a list of 118 unique vape-related hashtags based on a) the existing literature about e-cigarettes and b) the VAMoS study population, Mexican-American college students in Texas (see Appendix A for a list of hashtags used). An initial list of subgroup categories was assembled by examining information in a dataset containing 118 unique hashtags (e.g., brands, etc.). Next, categories were manually created by individually reviewing each hashtag that did not fit into the existing categories. An “Other” subgroup was created for hashtags that did not fit existing subgroups. After reviewing all the data, the subgroup titles (see Appendix A) were refined for accuracy and then reviewed by the co-authors.
3.2. Dataset generation
We initially collected 465 million tweets using a python-based script that collected data from Twitter’s ‘Spritzer’ STREAM API, which collects 1 % of all tweets (global) regardless of keyword, hashtag, etc. These data were collected between November 2022 and February 2023 using an Oracle Cloud Compute instance that scraped tweet id and tweet text only. Next, we searched the 465 million tweets for e-cigarette related hashtags in each category (Appendix A), and we obtained a total of 154,281 tweets.
3.3. Preprocessing
We preprocessed the data by first removing duplicate tweets, converting all text to lowercase, and then removing the links and URLs from the tweets because this information is not relevant to topic modeling. We removed retweets (including ‘RT’), along with duplicated punctuation and numbers that occur in tweet text. Emojis also were removed. Finally, stop words (i.e., commonly used words such as “the,” “an,” “for”) were removed using the stop words list from NLTK (about 6700 stop words) (Hardeniya et al., 2016).
3.4. Topic modeling
We selected ‘multilingual BERTopic’ as the tool for topic modeling for its extensibility and its successful use in computational public health literature (Taeb et al., 2021). Devlin et al. (2018) introduced the transformer-based architecture in language models, resulting in better results for downstream NLP applications, where a transformer represents an attention model (Vaswani et al., 2017). To generate language models, BERT’s encoder reads the entire sequence of text at one time, not in a particular left to right or right to left direction, enabling BERT to analyze documents bidirectionally. This is advantageous because the model can take in the context of a given word based on the surrounding words. BERTopic specifically leverages this characteristic of the BERT language model, using the term frequency-inverse document frequency (TF-IDF) to translate input documents into clusters. A hierarchical clustering algorithm, the Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN; Campello et al., 2015), is then used. HDBSCAN models documents that cannot be categorized into any cluster as outliers and thereby prevents unrelated documents being assigned to any cluster to minimize noise. This approach has been used to improve topic representations (Grootendorst, 2022).
Quantitative methods have been developed to evaluate topic models discerned from BERTopic and can help evaluate the uniqueness of the topics identified. Similarity matrices, term scores, and inter-topic maps have been successfully used by others to assist with topic interpretability (Alqahtani et al., 2022). We interpret the BERTopic output using keywords c-TF-IDF scores (‘term scores’) for each parameter (sorted by decreasing order). Term score visualization, a visualization of terms’ relative relevance to that topic (ordered from most influential to least influential), represents what terms are most central to their respective topics and provides. We also use similarity matrices for each category of tweets we studied. For example, topic correlation for ranking topics based on clinical utility to “enhance the ability of discovering high-quality topics” has been developed by Xu et al. (2017), p.1. Similarity matrices use a cosine similarity score and visualize how similar (i.e., correlated) topics are to one another. Output matrices range from green to blue, with dark blue indicating the strongest relationship between topics and light green indicating the weakest relationships between topics. We also produced inter-topic distance maps for each category of tweets. These maps provide two-dimensional visualizations; the size of the circle for a specific topic represents the number of tweets that the model estimates fall under that topic. Circles that are more adjacent/proximate represent topics that have more words in common. Similarly, circles that are farther away from each other have less overlap and are more unique from one another. We embed our c-TF-IDF representation of the topics in 2D using Umap and then visualize the two dimensions using ‘Plotly’ to create an interactive view. Our inter-topic distance plots are based on the LDAvis visualization technique (Sievert and Shirley, 2014)
3.5. Geoparsing
We used the ‘mordecai’ Python library (Halterman, 2017), which uses the text from tweets and searches for location related keywords to geolocate tweets (e.g., “Great day for an outdoor vape in Texas”). Mordecai can place names from a piece of English-language text, identify their location, and return their coordinates and structured geographic information using language models. A confidence threshold of 75 % was chosen for accurately identifying the location from the text, because it was observed that mordecai provided inaccurate location estimates with confidence values of around 30 %. Hence, choosing the 75 % threshold provides confidence that our geolocation method is reliable. With this higher threshold, only a small fraction of tweets could be geolocated, as most did not include mention of a location in the tweet’s text.
3.6. Iterations of topic modeling
To infer major themes from the tweets used to classify tweets into categories, we used BERTopic modeling as mentioned in Section 3.4. Several iterations were undertaken within the topic modeling to identify increasingly relevant tweets related to e-cigarettes. The first iteration included all tweets in all languages. In our next iteration, we filtered tweets by language. All authors then met, discussed the content of each cluster, and selected relevant and irrelevant topics based on consensus agreement. In each iteration of topic modeling, we observed some topics with irrelevant keywords that would have been very difficult to find manually. In subsequent iterations, we used these keywords to delete irrelevant tweets from the data and topic modeling was performed again. This iterative process ensured that the final topics that we obtained were more relevant to our study. Please see Appendix D for the frequency of topics generated per iteration. A detailed description of each iteration is as follows:
Iteration 1 (N = 120,921): The first topic modeling included all tweets scraped in all languages (with stop words removed from NLTK library). For some languages, the topics generated from this process did not contain enough tweets to derive valuable conclusions. We therefore decided to focus only on English and Spanish, languages with enough tweets to analyze computationally using our methods.
Iteration 2 (N = 69,642): This iteration of topic modeling was performed on only English- and Spanish-language tweets detected from the larger corpus using the Python langdetect library (Langdetect 1.0.9, 2021). This procedure resulted in many topics being generated with 10 keywords per topic.
Iteration 3 (N = 48,562): To generate more precise topics, irrelevant keywords were removed manually, as mentioned in Appendix C. All authors (some of whom are experts in tobacco regulatory science) determined the relevancy of each keyword. After selecting the relevant tweets, we repeated the topic modeling process and obtained several relevant topics as detailed in the Results section.
Iteration 4 (N = 37,122): We repeated topic modeling by further identifying and removing irrelevant keywords. The entire list of keywords removed is provided in Appendix C.
Iteration 5 (N = 10,504): We manually selected only relevant topics and related tweets from the topic modeling results of iteration 4, forming a subset of only the relevant tweets we identified (~ one third of the tweets from iteration 4). To accomplish this, all authors (including experts in tobacco regulatory science) met synchronously through a hybrid in-person meeting with Zoom video conferencing to select the relevant topics and resolve discrepancies. This iteration yielded our study’s final tweets and categories.
To generate a better understanding of the topic modeling results, we applied a multi-level qualitative method to interpret the themes of each topic. First, by evaluating the keywords generated by BERTopic, we identified e-cigarette-related topics under each category (see Figs. 3–8). Second, two authors independently labeled the theme of each topic by reviewing the first 10 tweets (a sample used by Castillo et al. (2011) for evaluation) that have the highest probability of being identified with that topic. Finally, the two authors compared the independently labeled themes and resolved discrepancies through discussion. Table 1 lists the theme of each topic under the six categories and presents three example tweets for each theme.
Fig. 3.

Topic modeling results of category 1 “Flavors and Disposable Vapes” with top 12 topics.
Topics relevant to e-cigarettes are framed with red boxes; values represent c-TF–IDF scores per keyword.
Fig. 8.

Topic modeling results of category 6, “Spanish Tweets and Vaping Nicotine” with all 4 relevant topics; values represent c-TF–IDF scores per keyword.
Table 1.
Qualitative analyses of BERTopic modeling.
| Category | Summary | Topic | Theme | Example Tweets |
|---|---|---|---|---|
|
| ||||
| Category 1 | This category focuses on the flavors available for vaping and discussions related to cannabis vaporizers. Users share their preferences for flavors and express interest in cannabis legalization and related products. | Topic 5 | Flavors | cranberry grape #vocano #vocanovape #disposable #7000puffs #usavape #elfbar #escobars #hyde #bang youre the mango to my juul pod the last time had a fruit was from his mango juul pod |
| Topic 11 | Cannabis Vaporizers | i hope they legalize weed soon i might buy me a vaporizer then #edibles #cbdoil #cannabiscommunity the davind miqro vaporizer is a small, sleek and sophisticated loose leaf vaporizer crafted to fît your lifestyle. we have packed all of the flavor, quality and precision. that davinci is known for #cbd #cannabis #hemp i guess, i can help you with a vaporizer or two | ||
| Category 2 | This category revolves around various aspects of cannabis culture, including discussions about cannabis itself, CBD oil, news related to legalization, and different compounds and strains associated with cannabis consumption. | Topic 5 | Cannabis culture | rt : lot to be said for this............ #cannabis #cannabiscommunity #weed #marijuana #thc #cbd #cannabisculture https:/live from the heat seat and its fuckin nice out #cannabisculture #cannabis #topshelflife #topshelf #pot #stoner #potheadsociety #pothead #dank #purple #highsociety #hightimes #ganja #highlife #terps #fueledbythc #marijuana. rt : another day another country.... #cannabis #cannabiscommunity #weed #marijuana #thc #cbd #cannabisculture |
| Topic 6 | Buying and using CBD oil | retro post: cbd oil (cannabidiol) what is it, and should i buy it? by dr kathleen thompson can i use something other than weed? it’s illegal here you know. can i use the name of my cbd oil? where and how to buy cbd and hemp oil online icd35gi |
||
| Topic 7 | Cannabis news and legalization | rt : #mmemberville #legalizecannabis #cannabiscommunity #stonerfam #federallylegalizecannabis #weedlife #cannabis rt : new facts for you! #cannabisindustry #cannabis #cannafarm #cannabisfacts legalization #cannabisnews fifa world cup : anti-doping laws explained #cbd #cannabislegal #weed #cbdedibles #cbdoil #cannabiscommunity |
||
| Topic 8 | Cannabis compounds and strains | #crd #cbd #thco # #cbn #cbg thco crd up lift thch ## #thco #thch #hhcp #cbd #cbn #cbg #crd ## rt : ## #thco #thch #hhcp #cbd #cbn #cbg #crd ## |
||
| Topic 9 | Cannabis legalization | rt : el registro del pro grama de cannabis medicinal de argentina Cel reprocann) ya cuenta con ms de personas inscrit colombia - proyecto para regularizacin del cannabis de uso adulto rt : pilas, pueblo colombiano! estn usando los proyectos de legalizacin del cannabis para crear un as mafias alrededor de |
||
| Category 3 | This category encompasses the discussions around specific vaping devices like the Calibum and promotions/offers related to smoke and vape shops | Topic 0 | Calibum | which one would you choose for your night? #uwell #uwelltech #calibuma #vapedaily #vapetime #vapefam #vapelove #calibuma3 #calibumak3 only for adults(21+). rt : choose one? #uwell #uwelltech #calibum #vapedaily #vapetime #vapefam #vapelove #uwellcrown #crownm only for adults(21+). rt : which one do you want to focus on? #uwell #uwelltech #calibuma #vapedaily #vapetime #vapefam #vapelove #calibuma3 # |
| Topic 1 | Smoke and vape shop | rt : up to % off on bongs, pipes and gifts on #headshop #smokeshop follow us for more promotions im go to my nearest smokeshop and just flip out on em as if they the ones responsible for the treachery rt : i own a smokeshop in texas and cops are horrible harassing my customers, they wait across stopping them for any reas |
||
| Category 4 | This category encompasses discussions about the vaping experience, including insights into vaping products, promotions, lifestyle choices, the overall vape culture and broader discussions around nicotine vaping and regulations. | Topic 0 | Vape community | rt : buy get free snowwolf fuel cell mini w pod device – $8.88 & you get of these kits!! #vapecommunity classic ample build space lets you build your deadrabbit solo #rda coil with ease - #vapesourcing #vape #vapelife #vapecommunity #gift #ffeeshipping #newarrivals #newvape #vapor #vaping #vapefam #atomizer stuck in the office on a rainy day with vaporesso luxe qs, but still in a good mood! only use vape in permitted areas. - not for u.s. market #vape #vaporesso #vapeon #vapelife #vapecommunity #vaping #luxe #office #rainy #good #mood |
| Topic 1 | Vaping Lifestyle | el vaper es un viaje de ida me e pillado un vaper y mis flemas ahora saben a pia colada cada da me gustan ms los vapers socorro |
||
| Topic 2 | Desire to vape | yo llegando a la salida ya graduado a fumar vaper con wilson algn da : coo quiero un puff o un vaper o lo que sea asta cigarro si es posible pienso tumbarme a ver tiktoks y fumar de mi vaper melocotn manzana compulsivamente has ta que todas mis neuronas se fundan o me revienten los pulmones (lo que pase antes) |
||
| Topic 3 | Nicotine vaping | rt: exclusive nicotine vapers’ lung cancer risk from the major lung cancer carcinogen is >20x lower than in smokers, even where is the evidence that prohibiting ’flavors’ will reduce teen nicotine vaping? such prohibitions harm* adult ’flavor’ vapers for every one teen ’flavor’ vaper...... who may just switch to tobacco flavor or worse: to combustible cigarettes. * fyi, “harm” = relapse to smoking rt: when prohibited ’flavored’ pod-based nicotine vapes, % of adult vapers relapsed to smoking deadly combus |
||
| Category 5 | This category delves into discussions about traditional smoking, cannabis consumption, vaping as a quitting tool, legalities surrounding vaping, and observations about vaping devices. It captures the diverse facets of smoking and vaping culture, including social interactions, health considerations, industry debates, and personal experiences related to smoking, vaping, and cannabis use. | Topic 0 | Cigarettes and smoking | him eyes are yellow cause he used to smoke cigarettes he is this is why i dont smoke bro think its better to smoke |
| Topic 1 | Attempt to quit smoking weed | i do not wanna smoke weed no more im growing out a lot of things rt: i will not smoke weed for days! february, and im not touching that shit! wish me luck i dont know what youre trying to do here, if ive not been clear: you shouldnt encourage your teenagers to smoke weed, if you cant understand that, please fuck off. |
||
| Topic 2 | Negative emotions around smoking cigarettes | i rather her smoke weed then having a cigarette ew she smoke cigarettes if you smoke and the girl doesnt leave you, shes either stupid or cheap. |
||
| Topic 3 | E-cigarettes as quitting devices | moved to vaping on a very low nicotine level rt: gen vape: how clinicians can help their young patients quit. ...“...nicotine can actually rewire a teen’s #brain to crave m rt: interesting that you group nicotine vaping with other substances. you must be aware that nicotine vaping is a w |
||
| Topic 6 | Discussions around commitment to quit smoking | today is the great american smokeout/vape out. if you or a loved one have a habit of smoking or vaping, make today the day to quit. there are people and resources who are here to help. australia imposes a very heavy tobacco tax, yet some continue to smoke, ultimately, change needs to come from within ... rt: today is the #greatamericansmokeout, an opportunity to commit to a smoke-free life, quitting smoking is a process i |
||
| Topic 8 | Smoke blunts | why arch that back when you smoke that blunt just gonna smoke this blunt and stfu sometimes i wanna smoke a blunt just to be bad | ||
| Category 6 | This category covers discussions about the health risks associated with vaping, including debates about its addictive nature and comparisons to traditional smoking. It also delves into personal urges and desires to vape, as well as concerns about vaping harms among vulnerable populations, such as adolescents and pregnant individuals. | Topic 0 | Health risk of vaping | vapeo desde el y veo muchas publicaciones de que le vape es malo y peor que le cigarro, es verdad es mucho ms adictivo y ms sencillo de utilizar en cualquier lado (no como el cigarro) pero todas esas publicaciones de que se hospitalizan pt1. los peligros del cigarillo electmico / vapeo. nada es inocuo, y en especial inhalar humo a alta temperatura con sustancias txicas. fumar no es saludable. rt: los cigarros electmicos o vapeadores resultan dainos para la salud, ocasionando enfermedades en muy poco tiempo de |
| Topic 1 | Vaping urge/desires | necesito vapear ya no me queda quiero vapear ahora les juro que este no soy yo quiero empezar a vapear |
||
| Topic 2 | Vaping harms among vulnearble population | rt: la comercializacin ilegal de los vapeadores, pone en riesgo la salud de adolescentes y adultos debido a su concentracin de rt: los #vapeadores constituyen un riesgo especialmente para la salud de los jvenes. la evali es un dao pulmonar potencialmente advertencia para vapeadores embarazadas! la prctica condujo a la cesrea de emergencia de la mujer y al sangrado en suspulmones. |
||
| Topic 3 | Regulation around vaping products | A pesar de tener hasta componentes químicos tóxicos, cigarrillos electrónicos se promodonan con sabores atractivos para niñas, niños y adolescentes. La venta de vapeadores a menores de edad debe ser prohibida. Colombia debe contar con #leyvapeadoresregulados #nomáscortinasdehumo por el bienestar de los consumidores, jvenes, nios y nias de Colombia, debemos reglamentar el uso de los vapeadores. hoy acompaamos a nuestro colega dd en la audienda pblicasobre d uso y distribucin de estos dispositivos electmicos. se mete una vez ms con los mexicanos, d prohibicionismo ahora va en contra de las tienditas. los tenderos no podm exhibir productos legales como los cigarros. ayer fueron los vapeadores, hoy los dgarros. maana qu ser? #nosvasaquebrargatell #ligadeguerreros |
||
4. Results
4.1. Iterations
Our initial dataset VAPE-TWEET (N = 154,281 tweets) was organized by category and consisted of two primary columns: tweet id and tweet text. To obtain the metadata, we contacted the Twitter API using the tweet ids. After removing tweets that were not related to e-cigarettes or not written in English or Spanish (N = 143,777), and after conducting the fifth iteration of topic modeling, we were able to successfully classify all remaining tweets into major topics. We successfully identified relevant e-cigarette-related tweets (N = 10,504) organized around six categories as follows: category 1 (334 tweets), category 2 (1598 tweets), category 3 (104 tweets), category 4 (393 tweets), category 5 (7889 tweets) and category 6 (186 tweets). Categories 3, 4 and 6 did not go through iterations 3, 4 and 5; hence for these categories all topics from iteration 2 were retained. Fig. 2 illustrates our iterative process for categories 1, 2, and 5.
Fig. 2.

Tweets per category per iteration.
The number of tweets per category varied more for some categories than for others by iteration. Some categories had <8 topics per category and <400 tweets per category (See Appendix D. 1 and F.1). The average number of tweets per topic was 48, 48, 52, 78, 202 and 47 for categories 1–6 respectively (see Appendix D.2). Appendix F2 details the number of tweets per topic broken down by categories for just the top 12 topics (given that some categories, such as 5, had a very high frequency of topics, whereas others, such as category 6, did not). Though the starting frequency of tweets at iteration 1 was not interpretable, we were able to successfully create topics with subsets of tweets at iteration 5 that can be qualitatively studied in more detail in future work.
4.2. Term scores
Topic modeling results are presented in Figs. 3–8. These results are a product of the 5th (last) iteration of topic modeling and graphically depict the topic word scores indicating the word’s relevance to the topic. Words higher in the list for a topic exert more influence over that topic and hence are more representative of that topic. Word terms with longer bars (the horizontal rectangles in the bar chart) indicate higher c-TF–IDF scores and higher levels of relevance to that topic. This provides insights into the topic composition by keywords, as words higher in the list are more representative of the topic. The quantitative measures on the X axis indicate percentage likelihood of a term occurring. Using the framework detailed in the method section above, we were able to successfully derive 7, 33, 2, 5, 39 and 4 topics for categories 1–6 respectively. As depicted in Fig. 3, one of the topics for category 1 was Juul flavors (topic 5), with the keywords ‘Juul’ and ‘grape’ at the top of the list indicating that these are higher scoring keywords in that topic.
4.3. Topic correlation using similarity matrices
We calculated a correlation coefficient between the topics and present correlation matrices for each category in Fig. 9 and Appendix G. Low correlation values indicate that the topics are unique, and suggest that these topics, if studied further, are likely to provide unique insights per topic rather than repeating content across topics.
Fig. 9.

Similarity matrix for the topics of category 2.
The matrix for category 1 (see Appendix G) indicates some moderate correlation values (~ 0.4), suggesting that the topics identified may possibly be unique. The matrix for category 2 (see Fig. 9) includes topics that were retained after iteration 5 (i.e., only expert filtered topics) that have a correlation value of <0.3 (i.e., light green shading), which indicates that these topics likely do not overlap with each other. Though category 3 only contains two topics, its matrix (see Appendix G) indicates that they are not correlated with each other and hence both can be retained for future qualitative analysis. As can be seen from the correlation matrix for category 4, topics 1, 4, and topics 2, 3 seem to be highly correlated (see Appendix G). For further analysis on the tweets in these topics, it might be possible to merge topics 1 and 4 and topics 2 and 3. As the matrix for category 5 (see Appendix G) indicates, most of the topics that were retained after iteration 5 (i.e., only expert filtered topics) have a correlation value <0.5, which represents <25 % shared variability and indicates that they do not likely overlap with each other. Lastly, the matrix for category 6 (see appendix G) indicates that topics 1 and 2 were highly correlated (> 0.9) and hence can be merged for future qualitative analysis. The other topics are less correlated.
Overall, we found many correlation values between topics with low values (< 0.3), representing <10 % shared variability, particularly in categories 2 and 5. We can confidently conclude that these topics are unique and, prima facie, less likely to have overlap with other topics. Topics that do have a high correlation value (e.g., > 0.9 in category 4 and 6) can be merged and studied as one topic.
4.4. Inter-topic maps
All inter-topic maps are included in Appendix H. The results of the inter-topic map analysis corroborate the correlations presented in Section 3.3. Specifically, in most categories, distinct clusters of topics emerged. In some cases, it would have been beneficial to merge topics prior to qualitative analysis. These results and recommendations are indicated in this section. The inter-topic map for category 1 (see Appendix H.1) indicates 7–8 unique clusters that can be qualitatively studied. The inter-topic map for category 2 (see Appendix H.2) visualizes 12 relatively distinct clusters created across all topics. Because category 3 only has 2 topics, our model was not able to create an intertopic map for this category. As displayed within the inter-topic map for category 4 (see Fig. 10 or Appendix H.3), the 4 circles visualized represent the 5 topics. The axes in Fig. 10 represent a latent space of text embeddings that capture the semantic meaning of the text but do not have a physical meaning. As discussed previously in Section 3.3, 2 of these topics should be merged because they are highly correlated. A similar conclusion can be reached from reviewing the inter-topic map as there is high overlap in the topic circles – 2 major clusters emerge (3 if we consider the circle barely overlapping as different). The inter-topic map for category 5 (see Appendix H.4) indicates that, with a high volume of topics, one might need to merge the topics prior to qualitatively studying them. Moreover, even if two topics are very close to each other in the inter-topic map, they might be considered as part of a larger cluster and studied accordingly. The inter-topic map for category 6 (see Appendix H.5) indicates that this category includes 4 distinct topic clusters.
Fig. 10.

Inter-topic distance map of Category 4; circular clusters illustrate similar topics and the axes (D1 and D2) represent a latent space of text embeddings that capture the semantic meaning of the text.
4.5. Geoparsing
Using the geoparsing methods described above, we identified a total of 526 Twitter users across a total of 66 unique countries. The country with the most vape-related tweets was the US (N = 227; 43.2 % of users). We were also able to geolocate states within the US, and the state with the most vape-related tweets was Texas (N = 24; 10.5 % of users). Please see Appendix E for the generated heat map of the tweets we were able to geolocate. We were not able to geolocate many tweets because users are not tagged by location and usually do not share their location when posting a tweet.
4.6. Qualitative analyses
Through qualitative analyses of the BERTopic results, we identified the central theme of each topic identified within the six BERTopic-derived categories (see Table 1 for details). The table provides examples of tweet examples from selected topics identified within the six BERTopic-derived categories. Specifically, the two topics in Category 1 (Flavors and Disposable Vapes) focus on discussions centering on different vape flavors and conversations regarding types of cannabis vaporizers. The five topics of Category 2 (Cannabis) focused on discussion around various aspects of cannabis culture, and the two topics of Category 3 (Vape Shops and Refillable Vapes) discussed a specific vaping device (i.e., Caliburn) and promotions available at smoke and vape shops. Category 4 (Vape Culture) included four topics that discussed the vaping experience and overall vape culture, and the six topics of Category 5 (Anti-vaping and Quitting) delved into different aspects of smoking and vaping culture, covering traditional smoking, cannabis consumption, vaping as a cessation aid and the legality of vaping. Category 6 (Spanish Tweets and Vaping Nicotine) included four topics focused on the health risks of vaping, personal motivations for vaping, and the regulation of vaping products.
5. Discussion
In the current study, we presented and utilized a method to collect vape-related data from the Twitter API, preprocess these data, and classify tweets into relevant topics using computational methods. We used BERTopic, a transformer-based neural network architecture, to identify relevant topics. Informed by the BERT topic modeling, we innovatively combined automated textual analysis and human coding and suggested 23 e-cigarette-related themes under six categories. One noteworthy finding is that except for Categories 2 and 6, the topics in the other four categories described the dual use of combustible cigarettes, e-cigarettes, and marijuana, which might reflect the prevalence of substance use behaviors among the general public. This comprehensive analysis of the most up-to-date vape-related tweets not only generates valuable knowledge for the tobacco regulatory science scholarship but also provides important information to policy-making to further regulate e-cigarette marketing on social media, especially for underaged audiences.
We further previous studies which examined e-cigarette-related tweets using BERT topic modeling (Chen et al., 2022; Baker et al., 2022; Malik et al., 2021) by analyzing a more comprehensive and recent data set and qualitatively interpreting them. Our current study provides a unique methodological contribution to tobacco regulatory science scholarship from two perspectives. First, compared to Chen et al. (2022) study analyzing news articles and Facebook posts, and to Malik et al. (2021) inquiry focusing on tweets regarding Juul flavors from 2018 to 2019, our study uses more recent data. Second, the similarity matrix and inter-topic distance map generated important knowledge regarding relationships between and among topics, which represents valuable information to help understand the e-cigarette conversations on Twitter at a macro level. Our analysis and process of compiling tweets detailed here not only offers a valuable characterized dataset of the most recent conversations regarding e-cigarette products, but also provides a method for constructing such a dataset, underscoring the benefits of our methods for identifying relevant topics.
Our study provides novel public health and policy insights derived from analyzing vape-related tweets. Our findings highlight several key themes, including vape juice flavor-related tweets that may appeal to youth and glamorize vaping, discussions around cannabis use and legalization, direct and indirect marketing of vaping devices, and vaping as a cessation aid with associated health concerns. These findings align with previous research on e-cigarette content on social media involving discussions related to flavors, attitudes, and sentiments (Allem et al., 2020; Amin et al., 2023; Chu et al., 2015; Gao et al., 2022; Jung et al., 2024), thus suggesting that the tweeted content is consistent with general social media discourse. Our study is distinguished by the use of BERTtopic modeling, which allowed us to analyze over 100,000 tweets, identify six e-cigarette related topics, and categorize the tweets within these topics. In contrast, prior e-cigarette related studies examined fewer than 1500 tweets (e.g., Chu et al., 2015 and Amin et al., 2023), which limited the scope of their findings. Thus, by using BERTtopic modeling, we were able to conduct a more comprehensive examination of Twitter messages related to e-cigarettes. Furthermore, this methodology can be applied to other social media platforms, offering nuanced and timely data that can inform regulatory efforts and public health initiatives.
Our findings provide important insights and value that can inform the regulation of vape marketing on social media (e.g., marketing practices targeting youth through flavors). Specifically, these insights reveal a critical need for targeted regulatory measures that address not only the presence but also the persuasive nature of flavored e-cigarette promotions that appeal to younger demographics. Previous work has found increases in content on social media related to flavors (Gao et al., 2022). Regulatory bodies could consider the enforcement of stricter advertising guidelines that prohibit the use of appealing flavors in promotional content that is accessible to underaged audiences. For example, specific promotions and offers from smoke and vape shops, as identified in our study, could be subject to stricter regulations or outright bans to prevent promotional activities on platforms frequented by youth. Although many social media companies have already restricted or prevented the promotion of tobacco products on their platforms, our analysis suggests that this practice is still prevalent, suggesting a need for improved enforcement. Social media may also apply algorithm-based filters to feeds for youths who indicated being underaged (depending on their date or birth at registration) and behavioral data (e.g., searching, browsing, forwarding, commenting, or generating pro-e-cigarette messages). Finally, the detailed thematic insights provided by our methods can serve as a foundational tool for policymakers to understand the evolving landscape of e-cigarette marketing strategies on digital platforms, enabling more dynamic and responsive regulation. For example, our analysis has highlighted the prevalence of discussions related to cannabis use with e-cigarettes, suggesting a potential area for intensified public health messaging and prevention efforts.
Ultimately, our computational qualitative results highlight the need to continue increasing social media monitoring and surveillance for vape-related content (e.g., vaping trends, trending/viral vape products, and harmful marketing tactics). Monitoring social media purely through human-conducted qualitative means is not scalable. Therefore, using computational methods such as BERTopic may facilitate ongoing surveillance, which could inform timely interventions and policy responses. Moreover, social media companies could engage in active monitoring of vape-related content on their platforms via computational methods such as topic modeling and could then share trends with policy makers. Topic modeling and qualitative interpretations of machine-generated outputs can therefore inform public health and help to inform interventions targeting challenges, trends, and crises posed by vaping and to mitigate potential public health impacts.
6. Future work and limitations
The rapidly changing social media landscape, coupled with the identification of weak correlations between and among topics, reflects the diversity of e-cigarette conversation on social media and suggests the need for more nuanced studies using content analysis. We successfully identified 23 topics on e-cigarette use across six broad thematic categories and explored these topics to identify themes within each topic. We also selected exemplary tweets from each topic. Although we tried our best to enhance validity by having two authors independently code all the topics and resolve discrepancies through discussion, the qualitative analysis is susceptible to subjectivity. Even so, human interpretation would still complement the machine-generated topics and reinforce our understanding of the landscape of e-cigarette-related conversations on social media. Based on the 10 most relevant tweets within each topic, we qualitatively labeled each topic. Further in-depth qualitative research is needed to determine how this information can be used to develop counter message campaigns to prevent or reduce e-cigarette use.
In addition, themes identified within topics discussed both nicotine and cannabis use. As cannabis use continues to be legalized around the US, using our approach to monitor discussion provides insights into users’ attitudes and normative beliefs surrounding the use of cannabis, which can ultimately be used to inform policies designed to regulate the sale and distribution of cannabis. Future studies would also benefit from inferential statistical analyses to understand what topics are more popular on Twitter (more likes and comments implying more engagement with the general public and hence more conversations being centered around the topics). Since the topics fall in one or more of the categories we discussed in our results (e.g., vape juice flavors, vaping for cessation), such inferences could inform policy around these categories.
In addition to the future work mentioned above, there are a few areas of improvement and limitations to consider for this study. While the inputs used to create the Twitter dataset focused on the e-cigarette literature and Mexican American college students (i.e., the Project VAMoS study population) only a very small fraction of tweets could successfully be geoparsed to a particular location (e.g., Texas). This is not surprising as only roughly 1 % of Twitter users explicitly share their location (Zheng et al., 2018), yet 10 % of those that could be geolocated were in Texas suggesting that the initial Twitter dataset was an effective mechanism of identifying tweets relevant to our study population. However, because it was almost impossible to geolocate a large majority of tweets, our method is not suitable for informing local policy. Additionally, we were not able to reliably ascertain the age of Twitter users, as our method does not detect the age of the person tweeting. Further, we looked at only Twitter data and did not collect data from any other social media platforms, which could provide a more detailed in depth understanding of e-cigarette-related conversations and discussions on social media. We found some tweets in the six categories to contain health misinformation, underscoring the need for further regulation as well as education campaigns for youth and adults. For example, subcategories contain tweets related to using e-cigarettes as a method of cessation of combustible tobacco. This is concerning given evidence suggesting limited efficacy of e-cigarettes as aids for smoking cessation (Chen et al., 2023). These areas are beyond the scope of our study. Future tobacco regulatory science work should investigate these claims on Twitter (i.e., X) and other social media platforms.
7. Conclusion
In the present study, we successfully developed a framework and methodology to collect vape-related Twitter data at scale and use machine learning to identify tweets relevant to e-cigarette use. Moreover, we combined the computational method of topic modeling with qualitative textual analysis to successfully identify and name a total of 23 relevant topics (e.g., regarding brands, flavors, and regulation) across six broad thematic categories. That is, BERTopic modeling reduced the vast quantity of information on social media into homogenous, meaningful units, which made it feasible for members of the research team to further analyze the data using qualitative methods. The study highlights the need for continued monitoring of e-cigarette-related discussions on social media platforms, as these discussions can have significant implications for public health messaging. Overall, the study provides valuable insights into the topics and themes that are most discussed in e-cigarette-related tweets on Twitter and demonstrates the potential of BERTopic as a tool for analyzing large-scale social media data.
Considering the limited number of geolocated tweets, it is important to acknowledge that although Texas appears to have a high number of e-cigarette-related tweets, the sample size is likely not sufficiently representative to draw definitive conclusions. It could be worthwhile to further investigate states with more tweets to provide insights to designing persuasive campaigns to reduce e-cigarettes initiation and use in those states. Our findings shed light on the landscape of e-cigarette-related tweets and highlight the possibility of leveraging innovative BERTopic modeling to analyze large Twitter data sets. We hope that the present study will inspire further qualitative work to categorize the themes of machine-derived e-cigarette-related topics.
Supplementary Material
Supplementary materials
Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.etdah.2024.100160.
Fig. 4.

Topic modeling results of category 2 “Cannabis” with top 12 topics.
Topics relevant to e-cigarettes are framed with red boxes; values represent c–TF-IDF scores per keyword.
Fig. 5.

Topic modeling results of category 3 “Vape Shops and Refillables” with top 2 relevant topics.
Topics relevant to e-cigarettes are framed with red boxes; values represent c-TF–IDF scores per keyword.
Fig. 6.

Topic modeling results of category 4 “Vape Culture” with top 5 topics.
Topics relevant to e-cigarettes are framed with red boxes; values represent c-TF–IDF scores per keyword.
Fig. 7.

Topic modeling results of category 5, “Anti-Vaping and Quitting” with top 12 topics.
Topics relevant to e-cigarettes are framed with red boxes; values represent c-TF–IDF scores per keyword.
Footnotes
Declaration of interests
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CRediT authorship contribution statement
D. Murthy: Writing – review & editing, Writing – original draft, Visualization, Supervision, Methodology, Investigation, Formal analysis, Conceptualization. S. Keshari: Writing – original draft, Visualization, Software, Methodology, Investigation, Formal analysis, Data curation. S. Arora: Writing – review & editing, Writing – original draft, Validation, Methodology, Investigation, Formal analysis. Q. Yang: Writing – review & editing, Writing – original draft, Visualization, Methodology, Investigation, Formal analysis. A. Loukas: Writing – review & editing, Project administration, Investigation, Funding acquisition. S.J. Schwartz: Writing – review & editing, Investigation. M.B. Harrell: Writing – review & editing, Investigation. E.T. Hébert: Writing – review & editing, Investigation. A.V. Wilkinson: Writing – review & editing, Writing – original draft, Project administration, Methodology, Funding acquisition.
Data availability
We publicly share VAPE-TWEET, a Twitter data set of 154,281 (121,000 unique) tweets scraped from 98,634 unique individuals between November 4, 2022, – February 23, 2023, using 118 vape-related hashtags for others to use or replicate our work. We provide information on how to access our tweet dataset (archived using twarc) at https://github.com/keshariS/dataScrapers/tree/main/VAMoS/dataset and our code at https://github.com/keshariS/dataScrapers/tree/main/allScraper_Twitter.
References
- Adhikari S, Uppal A, Mermelstein R, Berger-Wolf T, Zheleva E, 2021. Understanding the dynamics between vaping and cannabis legalization using twitter opinions. In: Proceedings of the International AAAI Conference on Web and Social Media, 15, pp. 14–25. 10.1609/icwsm.v15i1.18037. [DOI] [Google Scholar]
- Allem J-P, Escobedo P, Dharmapuri L, 2020. Cannabis surveillance with twitter data: emerging topics and social bots. Am. J. Public Health 110 (3), 357–362. 10.2105/AJPH.2019.305461. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alqahtani E, Janbi N, Sharaf S, Mehmood R, 2022. Smart homes and families to enable sustainable societies: a data-driven approach for multi-perspective parameter discovery using BERT modeling. Sustainability 14 (20), 13534. 10.3390/su142013534. [DOI] [Google Scholar]
- Amin S, Jaiswal A, Washington PY, Pokhrel P, 2023. Investigating #vapingcessation in twitter. Am. J. Health Behav. 47 (6), 1183–1191. 10.5993/AJHB.47.6.11. [DOI] [Google Scholar]
- Baker W, Colditz JB, Dobbs PD, Mai H, Visweswaran S, Zhan J, Primack BA, 2022. Classification of twitter vaping discourse using bertweet: comparative deep learning study. JMIR Med. Inform. 10 (7), e33678. 10.2196/33678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Birdsey J, Cornelius M, Jamal A, Park-Lee E, Cooper MR, Wang J, Sawdey MD, Cullen KA, Neff L, 2023. Tobacco product use among u. S. Middle and high school students-National youth tobacco survey, 2023 MMWR Morb. Mortal. Wkly. Rep. 72 (44), 1173–1182. 10.15585/mmwr.mm7244a1. [DOI] [Google Scholar]
- Campello RJGB, Moulavi D, Zimek A, Sander J, 2015. Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Trans. Knowl. Discov. Data 10 (1), 1–51. 10.1145/2733381. [DOI] [Google Scholar]
- Castillo C, Mendoza M, Poblete B, 2011. Information credibility on twitter. In: Proceedings of the 20th International Conference on the World wide web, pp. 675–684. [Google Scholar]
- Cavazos-Rehg P, Li X, Kasson E, Kaiser N, Borodovsky JT, Grucza R, Bierut LJ, 2021. Exploring how social media exposure and interactions are associated with e-cigarettes and tobacco use in adolescents from the PATH study. Nicot. Tobacco Res. 23 (3), 487–494. 10.1093/ntr/ntaa113. [DOI] [Google Scholar]
- CDC MMWR, 2023. Quickstats: percentage distribution of cigarette smoking status among current adult e-cigarette users, by age group — National health interview survey, united states, 2021 MMWR Morb. Mortal. Wkly. Rep. 72. 10.15585/mmwr.mm7210a7. [DOI] [Google Scholar]
- Chen K, Babaeianjelodar M, Shi Y, Aanegola R, Cheung LY, Nakov P, Yadav S, Bancroft A, Khudabukhsh A, Choudhury MD, Altice FL, & Kumar N (2022). US News and Social Media Framing around Vaping. ArXiv, abs/2206.07765. [Google Scholar]
- Chen R, Pierce JP, Leas EC, Benmarhnia T, Strong DR, White MM, Messer K, 2023. Effectiveness of e-cigarettes as aids for smoking cessation: evidence from the PATH Study cohort, 2017–2019. Tob. Control 32 (e2), e145–e152. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chu K-H, Unger JB, Cruz TB, Soto DW, 2015. Electronic cigarettes on twitter – spreading the appeal of flavors. Tob. Regul. Sci. 1 (1), 36–41. 10.18001/TRS.1.1.4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cole-Lewis H, Pugatch J, Sanders A, Varghese A, Posada S, Yun C, Schwarz M, Augustson E, 2015. Social listening: a content analysis of e-cigarette discussions on twitter. J. Med. Internet Res. 17 (10), e243. 10.2196/jmir.4969. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai H, Leventhal AM, 2019. Prevalence of e-cigarette use among adults in the United states, 2014–2018. JAMA 322 (18), 1824–1827. 10.1001/jama.2019.15331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Devlin J, Chang MW, Lee K, & Toutanova K (2018). Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv: 1810.04805. [Google Scholar]
- Evans-Polce R, Veliz P, Boyd CJ, McCabe VV, McCabe SE, 2020. Trends in e-cigarette, cigarette, cigar, and smokeless tobacco use among us adolescent cohorts, 2014–2018. Am. J. Public Health 110 (2), 163–165. 10.2105/AJPH.2019.305421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Galimov A, Vassey J, Galstyan E, Unger JB, Kirkpatrick MG, Allem J-P, 2022. Ice flavor-related discussions on twitter: content analysis. J. Med. Internet Res. 24 (11), e41785. 10.2196/41785. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao Y, Xie Z, Li D, 2022. Investigating the impact of the New York state flavor ban on e-cigarette-related discussions on twitter: observational study. JMIR Publ. Health Surveill. 8 (7), e34114. 10.2196/34114. [DOI] [Google Scholar]
- Grootendorst M (2022). BERTopic: neural topic modeling with a class-based TF-IDF procedure. doi: 10.48550/ARXIV.2203.05794. [DOI] [Google Scholar]
- Halterman A, 2017. Mordecai: full text geoparsing and event geocoding. J. Open Source Softw. 2 (9), 91. 10.21105/joss.00091. [DOI] [Google Scholar]
- Hardeniya N, Perkins J, Chopra D, Joshi N, Mathur I, 2016. Natural Language processing: Python and NLTK. Packt Publishing Ltd. [Google Scholar]
- Hassan L, Elkaref M, de Mel G, Bogdanovica I, Nenadic G, 2022. Text mining tweets on e-cigarette risks and benefits using machine learning following a vaping related lung injury outbreak in the USA. Healthc. Anal. 2, 100066. 10.1016/j.health.2022.100066. [DOI] [Google Scholar]
- Jung S, Murthy D, Bateineh BS, Loukas A, Wilkinson AV, 2024. The normalization of vaping on tiktok using computer vision, natural language processing, and qualitative thematic analysis: mixed methods study. J. Med. Internet Res. 26, e55591. 10.2196/55591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ketonen V, Malik A, 2020. Characterizing vaping posts on Instagram by using unsupervised machine learning. Int. J. Med. Inform. 141, 104223. 10.1016/j.ijmedinf.2020.104223. [DOI] [PubMed] [Google Scholar]
- Langdetect 1.0.9, 2021. Available: https://pypi.org/project/langdetect/. [Google Scholar]
- Lee J, Ouellette RR, Murthy D, Pretzer B, Anand T, Kong G, 2024. Identifying e-cigarette content on tiktok: using a bertopic modeling approach. Nicot. Tobacco Res. ntae171. 10.1093/ntr/ntae171. [DOI] [Google Scholar]
- Lee J, Tan ASL, Porter L, Young-Wolff KC, Carter-Harris L, Salloum RG, 2021. Association between social media use and vaping among Florida adolescents, 2019. Prev. Chronic Dis. 18, 200550. 10.5888/pcd18.200550. [DOI] [Google Scholar]
- Liu AH, Hootman J, Li D, Xie Z, 2024. Public perceptions of synthetic cooling agents in electronic cigarettes on twitter. Plos One 19 (3), e0292412. 10.1371/journal.pone.0292412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu Q, Liang Y, Wang S, Huang Z, Wang Q, Jia M, Ming WK, 2022. Health communication through Chinese media on E-cigarette: a topic modeling approach. Int. J. Environ. Res. Public Health 19 (13), 7591. 10.3390/ijerph19137591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu X, Sun L, Xie Z, Li D, 2022. Perception of the food and drug administration electronic cigarette flavor enforcement policy on twitter: observational study. JMIR Publ. Health Surveill. 8 (3), e25697. 10.2196/25697. [DOI] [Google Scholar]
- Malik A, Khan MI, Karbasian H, Nieminen M, Ammad-Ud-Din M, Khan SA, 2021. Modeling public sentiments about JUUL flavors on twitter through machine learning. Nicot. Tobacco Res. 23 (11), 1869–1879. 10.1093/ntr/ntab098. [DOI] [Google Scholar]
- Obisesan OH, Osei AD, Uddin SMI, Dzaye O, Mirbolouk M, Stokes A, Blaha MJ, 2020. Trends in e-cigarette use in adults in the United States, 2016–2018. JAMA Intern. Med. 180 (10), 1394. 10.1001/jamainternmed.2020.2817. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pew Research Center. (2023). Teens, social media & technology 2023. Retrieved from https://www.pewresearch.org/internet/2023/12/11/teens-social-media-and-technology-2023/.
- Pew Research Center. (2024). Social media fact sheet. Retrieved from https://www.pewresearch.org/internet/fact-sheet/social-media/.
- Primack BA, Soneji S, Stoolmiller M, Fine MJ, Sargent JD, 2015. Progression to traditional cigarette smoking after electronic cigarette use among us adolescents and young adults. JAMA Pediatr. 169 (11), 1018–1023. 10.1001/jamapediatrics.2015.1742. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren Y, Wu D, Singh A, Kasson E, Huang M, Cavazos-Rehg P, 2022. Automated detection of vaping-related tweets on twitter during the 2019 evali outbreak using machine learning classification. Front. Big Data 5. 10.3389/fdata.2022.770585. [DOI] [Google Scholar]
- Sangalang A, Volinsky AC, Liu J, Yang Q, Lee SJ, Gibson LA, Hornik RC, 2019. Identifying potential campaign themes to prevent youth initiation of e-cigarettes. Am. J. Prev. Med. 56 (2), S65–S75. 10.1016/j.amepre.2018.07.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shah NA, Li Z, McMann T, Calac AJ, Le N, Nali MC, Cuomo RE, Mackey TK, 2024. Identification and characterization of synthetic nicotine product promotion and sales on instagram using natural language processing. Nicot. Tobacco Res. 26 (5), 580–588. 10.1093/ntr/ntad222. [DOI] [Google Scholar]
- Sievert C, Shirley K, 2014. LDAvis: a method for visualizing and interpreting topics. In: Proceedings of the workshop on interactive language learning, visualization, and interfaces, pp. 63–70. [Google Scholar]
- Statista, U.S. Twitter reach by age group 2021. (n.d.). Retrieved May 5, 2023, from https://www.statista.com/statistics/265647/share-of-us-internet-users-who-use-twitter-by-age-group/.
- Statista. (2023). Number of monetizable daily active Twitter users (mDAU) worldwide from 1st quarter 2017 to 2nd quarter 2022. Retrieved from- https://www.statista.com/statistics/970920/monetizable-daily-active-twitter-users-worldwide/.
- Taeb M, Chi H, Yan J, 2021. Applying machine learning to analyze anti-vaccination on tweets. In: 2021 IEEE International Conference on Big Data (Big Data), pp. 4426–4430. 10.1109/BigData52589.2021.9671647. [DOI] [Google Scholar]
- Twitter by the Numbers (2023): Stats, Demographics & Fun Facts. (2023, March 9). Omnicore Agency. https://www.omnicoreagency.com/twitter-statistics/. [Google Scholar]
- Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I, 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30. In: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html. [Google Scholar]
- Wu D, Kasson E, Singh AK, Ren Y, Kaiser N, Huang M, Cavazos-Rehg PA, 2022. Topics and sentiment surrounding vaping on twitter and reddit during the 2019 e-cigarette and vaping use-associated lung injury outbreak: comparative study. J. Med. Internet Res. 24 (12), e39460. 10.2196/39460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu X, Jin T, Wei Z, Wang J, 2017. Incorporating topic assignment constraint and topic correlation limitation into clinical goal discovering for clinical pathway mining. J. Healthc. Eng. 2017, 1–13. 10.1155/2017/5208072. [DOI] [Google Scholar]
- Zhan Y, Liu R, Li Q, Leischow SJ, & Zeng DD (2017). Identifying topics for E-cigarette. [Google Scholar]
- Zheng X, Han J, Sun A, 2018. A survey of location prediction on twitter. IEEE Trans. Knowl. Data Eng. 30 (9), 1652–1671. 10.1109/TKDE.2018.2807840. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
We publicly share VAPE-TWEET, a Twitter data set of 154,281 (121,000 unique) tweets scraped from 98,634 unique individuals between November 4, 2022, – February 23, 2023, using 118 vape-related hashtags for others to use or replicate our work. We provide information on how to access our tweet dataset (archived using twarc) at https://github.com/keshariS/dataScrapers/tree/main/VAMoS/dataset and our code at https://github.com/keshariS/dataScrapers/tree/main/allScraper_Twitter.
