Contextual Word Embeddings and Topic Modeling in Healthy Dieting and Obesity

Vijaya Kumari Yeruva; Sidrah Junaid; Yugyung Lee

doi:10.1007/s41666-019-00052-5

. 2019 Jun 10;3(2):159–183. doi: 10.1007/s41666-019-00052-5

Contextual Word Embeddings and Topic Modeling in Healthy Dieting and Obesity

Vijaya Kumari Yeruva ¹, Sidrah Junaid ¹, Yugyung Lee ^1,^✉

PMCID: PMC8982804 PMID: 35415426

Abstract

An alarming proportion of the US population is overweight. Obesity increases the risk of illnesses such as diabetes and cardiovascular diseases. In this paper, we propose the Contextual Word Embeddings (ContWEB) framework that aims to build contextual word embeddings on the relationship between obesity and healthy eating from the crowd domain (Twitter) and the expert domain (PubMed). For this purpose, our work is based on a pipeline model that consists of a chain of processing elements as follows: (1) to use term frequency and inverse document frequency (TF-IDF) and Word2Vec in the data collected from the crowd and expert domains; (2) to apply natural language processing (NLP) algorithms to the corpus; (3) to construct social word embeddings by sentiment analysis; (4) to discover the contextual word embeddings using co-occurrence and conditional probability; (5) to find an optimal number of topics in a topic modeling with the obesity and healthy dieting corpus; (6) to extract latent features extracted using Latent Dirichlet Allocation (LDA). The ContWEB framework has been implemented on Apache Spark and TensorFlow platforms. We have evaluated the ContWEB framework in terms of the effectiveness in contextual word embeddings constructed from the crowd and the expert domains. We conclude that the ContWEB framework would be useful in enhancing the decision-making process for healthy eating and obesity prevention.

Keywords: Natural language processing, Word embeddings, Topic modeling, Sentiment analysis, Obesity and healthy dieting

Introduction

An alarming proportion of the US population is overweight: two-thirds of US adults are overweight, and one-third of those overweight are obese [1]. Already, one in six children in the USA is obese, and one in three is overweight [2]. Obesity increases the risk of illnesses such as diabetes and cardiovascular diseases. This “epidemic” can be attributed to the combination of cheap, high-calorie food and lack of physical activity. The guidelines are already available through public health programs such as “Healthy Eating Made Easier” [3] and MyPlate’s dietary guidelines for Americans [4]. MyPlate reported that a healthy lifestyle would be built throughout our lifetime and also be affected by personal and social factors such as our stage of life, situations, preferences, access to food, and culture.

As we are living in the era of social media, our decision on our foods may be strongly influenced by social trends. The social data provided by 316 million Twitter users can be analyzed to understand their perspective and behaviors on health [5]. Twitter data have been widely used as a means for understanding trends in public health, such as tracking and understanding spreading diseases, e.g., influenza and cholera [6, 7]. Public behaviors, trends, preferences, and their healthy lifestyle can be discovered from social media data. Furthermore, the topics or sentiments about diseases and medical conditions were analyzed [8]. Research with social media data may overcome the limitations of traditional methods such as paper surveys or face-to-face interviews in health-related studies [9].

Our work is motivated by the works done by [10, 11], in which the relationships between obesity and food trends were found through the analysis of Twitter data. As social network users will be increased from 2.34 billion in 2016 to 2.95 billion in 2020 [12], the impact of word embeddings and topic modeling with such social media data would be significant. In our research, we define the term “expert” as someone who performs a task with in-depth knowledge and experiences in their professional activities [13] while “crowd” as a “potentially large and unknown population” [14]. Thus, in our context, an expert is defined as a practitioner or a health caregiver in the health domain. A crowd is defined as naive individuals showing their health behaviors in the social media context. For our study, the experts’ perspectives are extracted from PubMed publications, and the crowd’s ones are from Twitter’s tweets.

In our work, we would like to explore if there is any difference in the crowd perspective and the expert one on healthy dieting and obesity. We assume that the construction of contextual word embeddings would be useful to understand the social trends of healthy eating and find the relation between obesity and healthy dieting. For this purpose, we analyze the crowd’s perspectives from Twitter as well as experts’ from the PubMed publications. Healthy dieting statements in social media posts and biomedical publications would be useful in identifying potential topics or the relationships between healthy dieting and obesity.

In this paper, we presented the ContWEB framework that is designed to discover the relationships between healthy dieting and obesity by constructing the contextual word embeddings in the crowd’s and experts’ perspectives. Our contributions in this paper are:

To collect data from social media and publications using contextual terms from TF-IDF [15] and Word2Vec [16] (in Section 3.1),
To extract contextual features through the workflow of natural language processing and feature extraction techniques from the crowd (Twitter) and the expert (PubMed) domains (in Section 3.2),
To conduct a sentimental analysis of healthy dieting moods in the crowd domain (in Section 3.3),
To model co-occurrence analysis to find the relationship between healthy dieting and obesity both in the crowd and expert domains (in Section 3.4),
To measure a topic coherence in determining the optimal number of latent topics (in Section 3.5),
To find latent topics from the corpus in both the crowd (Twitter) and the expert (PubMed) domains (in Section 3.5).

Related Work

There are some related works in word embeddings construction such as the word co-occurrence matrix using dimensionality reduction [17], context learning with word proximity [18], probabilistic models [19], and supervised learning [20]. For building the underlying representation for word and phrase embeddings, the performance has been boosted by natural language processing (NLP) tasks such as syntactic parsing [21] and sentiment analysis [22]. More recently, the researchers in biomedical word embeddings reported that the word embeddings discovered from EHR and PubMed have more relevant medical terms than the ones from Glove [23] and Google News [24].

The recent work [8] pointed out the importance of multi-dimensional analysis of news media data and presented their findings on topics and sentiments of news articles from Reuters media in terms of diseases and medical conditions. They confirmed that their discoveries could be used for practical guidance for medical needs and research priorities in the decision-making of public policy.

It was confirmed that there is a strong relationship between food, mood, and stress [25]. Feeding behavior on either increased or reduced food intake is caused by external and psychological stress [26] and may lead to increased consumption of foods leading to obesity [27]. Nguyen et al. [28] analyzed 80 million tweets using machine learning algorithms and built a national neighborhood database for well-being and health behaviors. They validated with both machine-labeled and manually labeled tweets and obtained the accuracy of 78% for sentiment and 83% for food with the F scores 0.54 and 0.86, respectively. Fast food tweets were more frequently posted from big cities. Also, tweets for fast food restaurants are much more popular than tweets for food items. The state-level food sentiment is strongly associated with the prevalence of chronic conditions such as obesity and diabetes.

Eichstaedt et al. [29] analyzed Twitter messages using a regression model and found markers of cardiovascular mortality at the community level. For this study, they analyzed the psychological correlation of mortality and demographic, socioeconomic, and health risk factors (e.g., smoking, diabetes, hypertension, and obesity). Their results showed that the Twitter-based model for predicting mortality outperformed the classical risk factor-based predictive models.

According to the Centers for Disease and Control Prevention (CDC) [30], “young adults are about half as likely to have obesity as middle-aged adults. Adults aged 18–24 had the lowest self-reported obesity (17.3%) compared to adults aged 45–54 years who had the highest prevalence (35.1%).” They revealed that social media analysis is useful in raising obesity awareness and promoting healthy dieting. Paul et al. [31] presented the Ailment Topic Aspect model to analyze Twitter messages and to measure behavioral risk factors by a geographic region for chronic medical conditions like allergies, obesity, and insomnia. They concluded that Twitter could be broadly applicable to public health research. Madan et al. [32] studied the relationship between social interaction and health-related behaviors such as diet choices or long-term weight changes using sensing and self-reporting tools. Scanfeld et al. [33] analyzed Twitter antibiotics data to determine the categories of antibiotics such as cold and antibiotics, flu and antibiotics, and leftover antibiotics.

There is exciting research on sentiment analysis with food tweets. Sentiment analysis aims to determine whether a feature of a tweet is positive, negative, or neutral. Poria et al. [34] presented an innovative method to extract features from textual and visual datasets using deep convolutional neural networks. With the use of those features and a multiple-kernel learning classifier, they achieved the state of the art of multi-modal emotion recognition. Go et al. [35] trained on one million tweets in the food domain and achieved an accuracy of 83% in sentiment analysis. Food-mood [36] analyzed tweets for food sentiment and social and cultural aspects using Bayesian Sentiment classifier. Interestingly, they indicated the importance of continually evolving food trends (e.g., meat or fast food sentiment). However, there is room for improvement in utilizing diverse data such as tweet messages and publications, to find relationships among food, sentiments, and obesity. Besides, topic modeling and word embeddings techniques have not been evaluated in the context of healthy dieting and obesity.

Contextual Word Embeddings Framework

The Contextual Word Embeddings (ContWEB) framework (Fig. 1) aims to extract contextual word embeddings from both the crowd (Twitter) and expert (PubMed) domains. We define the word embeddings as “the collective name for a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers.” The contextual word embeddings in the ContWEB framework would be useful in finding the relationship between obesity and healthy dieting. It is based on the workflow of NLP, TF-IDF [15], and Word2Vec [16], and the topic model with LDA [37] in the construction of the word embeddings in a health domain.

Data Collection Using Word Embedding Techniques

TF-IDF and Word2Vec were used for data collection.

Term Frequency and Inverse Document Frequency (TF-IDF)

In this paper, we used TF-IDF for collecting data from the expert and the crowd domains that is the initial step in finding relevant word embeddings in the context of healthy dieting and obesity. The term frequency inverse document frequency (TF-IDF) [15], which is the product of term frequency and inverse document frequency, was proposed to reflect how important a word is to a document in a corpus and to find representative features from the corpus. The term frequency (TF) is the number of annotated terms t that appear in a specific document d (1). The inverse document frequency (IDF) intends to reduce the importance of the word that occurs most frequently in all the documents D (2). It is mainly used to eliminate the common terms across all the documents. The IDF value is computed by dividing the total number of documents with the number of documents that contain the given term t and then by applying a logarithm to the resultant value. If the term t appears in more than one corpus, it is more likely to be a common term that is not specific to any given document d, and hence, the log value of the word, t is reduced to 0 ensuring that the IDF value and thereby, the TF-IDF values, which are computed by the product of the TF values and IDF values (3), are less for the t term.

TF (t, d) = 1 + \log (f_{t, d})

IDF (t, D) = \log \frac{N}{1 + | {d \in D : t \in d} |}

where N is the total number of the documents D in the corpus, i.e., N = |D| and |{d ∈ D : t ∈ d}| is the number of documents d in which the term t appears (i.e., TF(t,d)≠ 0).

The TF-IDF value is high if the term has a high term frequency and a low document frequency in the corpus. Hence, by considering the TF-IDF value, we can eliminate the common terms.

TF - IDF (t, D) = TF (t, d) \cdot IDF (t, D)

The top k important words can be extracted based on the weights of the terms (e.g., k = 10).

Word Embeddings with Word2Vec

The Word2Vec [16], a model of shallow, two-layer neural networks, is used to extract contexts of words accurately. It is a step for the construction of the word embedding model. In this paper, we used a large corpus of Twitter (crowd) data and PubMed (expert) data to produce a vector space, for the representative features from TF-IDF (e.g., several hundred dimensions). In this model, significant features (words) in the corpus will be assigned to a corresponding vector in the vector space. Word vectors that are strictly related and share common contexts in the corpus will be projected as proximity to each other in the vector space [38].

Natural Language Processing (NLP)

For the given corpus from the experts’ and crowds’ domains, the NLP tasks were conducted using CoreNLP Library [39] as follows:

Tokenization: Tokenization is the process of breaking sentences into tokens that are the smallest constructs of any text data.
Lemmatization: Lemmatization is the process of separating words into individual morphemes and identifying the class of the morphemes.
Stopword Removal: Stopword Removal is the process of removing stopwords from the corpus. For example, the stopwords in English include “able,” “about,” “above,” and “according”.
Multi-Word Detection: We partially applied the multi-word detection only to detect medical terms (e.g., “cardiac mesoderm”) using NCBO BioPortal v4.0 API [40].

Sentiment Analysis for Social Word Embeddings

We conducted the sentiment analysis to identify social word embeddings on healthy dieting and food sentiment from the crowd’s corpus (Twitter). For our research, we proposed a meta-model, called Assemble Sentiment Analysis (ASA), based on the combination of Valence Aware Dictionary and sEntiment Reasoner (VADER) [41], CoreNLP [39], and TextBlob [42] to incorporate social embeddings and increase a more accurate analysis of sentiments in the crowd’s domain. The ASA model is a rule-based sentiment analysis tool exploring each model’s strength attuned explicitly to sentiments expressed in Twitter’s social media.

VADER handles polarity and intensity. VADER is a human-centric approach as human raters and wisdom of the crowd using dictionary maps lexical features to emotional intensities. The sentiment score in VADER measures on scale + 4 to − 4 and the value 0 is neutral. This model works well with short text, especially for acronyms, slangs, emoticons, contractions, and punctuation. We converted the sentiment-related symbols such as emoticons or acronyms to sentimental words about food–mood in our crowd dataset. The features are described as follows:

A full list of Western-style emoticons, for example, :-) denotes a smiley face and generally indicates positive sentiment
Sentiment-related acronyms and initialisms (e.g., LOL and WTF are both examples of sentiment-laden initialisms)
Commonly used slang with sentiment value (e.g., nah, meh, and giggly).

Stanford CoreNLP [39] computes the sentiment based on how individual words change the meaning of longer phrases, a new type of recursive neural network that builds on grammatical structures. The sentiment tree bank includes fine-grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences. CoreNLP assigns sentiments negative, positive, and neutral to a sentence.

In TextBlob [42], the sentiment property returns a named tuple of the form Sentiment (polarity, subjectivity) where Polarity Score lies within the range [− 1.0,1.0] and Subjectivity lies within the range [0.0,1.0]. The TextBlob is available as a Python library.

For simplicity’s sake, we have integrated the following two models: (1) sentiment analysis and (2) healthy dieting to classify the tweets in the crowd domain into the four categories of healthy dieting and food sentiment as follows: healthy-positive (HP), healthy-negative (HN), unhealthy-positive (UP), unhealthy-negative (UN). The results of the healthy food sentiment classification are discussed in Section 4.4.

Sentiment Analysis in Crowd

The sentiment score is calculated by adding the valence scores of each word present in the tweet, and then adjusting scores measured on between − 1 (highly negative) and + 1 (highly positive) according to the rules. The category of tweet sentiments is defined by the compound score (CS(t)) for a given tweet (t) as shown in (4).

Sentiment (t) = \{\begin{matrix} Positive & if (CS(t) \geq 0.5) \\ Negative & if (CS(t) \leq - 0.5) \\ Neutral & if (CS(t) > - 0.5) \\ & (CS(t) < 0.5) \end{matrix})

A tweet can be determined as a negative tweet or a positive tweet by the averaging of positive, negative, and neutral sentiments. Interestingly, “eating green grapes” is classified as a neutral sentiment. In this paper, we take the neutral sentiment as positive as we want to find the negative/positive eating trends of the users.

Healthy Dieting Analysis in Crowd

Using the ASA model, we designed the food sentiment model for both healthy and unhealthy food tweets. The positive sentiment was aligned to the foods in the healthy dieting, and the negative sentiment was aligned to the foods in the unhealthy dieting. Through this extension, we could identify the inclination of tweets toward healthy (H) dieting, unhealthy (U) dieting, and compound (C) dieting, which is both healthy and unhealthy dieting, by calculating the healthy dieting score (hs(f)) including the compound food. For the food tweet dataset (f ), we classify the food type, type(f), as the negative sentiment if the healthy dieting score (hs(f)) is less than 0 and the positive sentiment if hs(f) is greater than 0. More specifically, we compute the healthy dieting score (hs(f)) according to the rules in Equation 5.

hs (f) = \{\begin{matrix} 1 & if type(f) = H \\ - 1 & if type(f) = U \\ \frac{\sum hs (H) + hs (U)}{\sum hs (H) - hs (U)} & if type(f) = C \end{matrix})

For example, for a given tweet, “There better be wine and coffee in hell,” we have the healthy dieting scores (hs) like [healthy: 0, unhealthy: -2, compound: − 1] since it is 0 for healthy food, − 2 for the two unhealthy food items (wine and coffee), the healthy dieting score (hs) = $\frac{0 - 2}{0 + 2} = - 1$ .

Co-occurrence Analysis for Contextual Word Embeddings

In this paper, we define the co-occurrence matrix as the frequency of two terms occurring in the same corpus. In other words, we measure how often two words appear together in a single document (e.g., a single PubMed abstract or a single tweet). We use the co-occurrence matrix to find important co-occurring terms by determining the proximity of the terms in the document.

The co-occurrence model is extended to a probability model that can incorporate conditional probability between obesity (O) and its related diseases (D) for a given food domain (F), such as p(O|F) or p(D|F), as well as infer the food (healthy or unhealthy) (F) for known diseases such as obesity (O), p(F|O) or related diseases (D), such as p(F|D). We have computed the conditional probability for given keyword and abstraction in terms of disease and food as follows:

p(D_k/D_p): The conditional probability of disease keyword D_k presented in the PubMed Disease Publications D_p
p(D_k/F_p): The conditional probability of disease keyword D_k presented in the PubMed Food Publications F_p
p(F_k/D_p): The conditional probability of food keyword F_k presented in the PubMed Disease Publications D_p
p(F_k/F_p): The conditional probability of food keyword F_k presented in the PubMed Food Publications F_p

Word Embeddings with Topic Modeling

Our approach to topic discovery is based on Latent Dirichlet Allocation (LDA) [37] in which topics are discovered through the probability distribution of topics over documents that are associated with word distribution. Words can be mapped into topics that are associated with documents. The LDA model is a Bayesian model where the distributions over the parameters (𝜃_d and φ_t are the topic distribution for document d and the word distribution for topic t, respectively) are modeled, given by Dirichlet distributions with hyperparameters (α and β). We define the LDA variable names in Table 1.

Table 1.

LDA variable names

M	The number of documents
N	The number of words in document d
α	The parameter of the Dirichlet prior on the per-document topic distributions
β	The parameter of the Dirichlet prior on the per-topic word distribution
𝜃 _d	The topic distribution for document d
φ _t	The word distribution for topic t
z _dn	The topic for the n th word in document d
w _dn	The specific word n in document d.

Open in a new tab

Given the parameters 𝜃 and φ, the probability of a word in the LDA model is defined in (6).

P (w_{dn} = w | 𝜃_{d}, φ) = \sum_{t} 𝜃_{dt} φ_{tw}

where the document is d, topic t, and term w.

We computed the overall term frequency and the estimated term frequency within the selected topic according to Termite [43]. The most salient terms were computed for topic t and term w for the given frequency of words (freq(w)) in Equation 7.

salient (w) = freq (w) * \sum_{t} P (t | w) * \log \frac{P (t | w)}{P (t)}

The relevance R metric was computed to adjust to a given weight parameter λ (where 0 ≤ λ ≤ 1) according to (8) defined in [44].

\begin{array}{rcl} R (w, k | λ) & = & λ * P (w | t) + (1 - λ) * \frac{P (w | t)}{P (w)} = \\ λ * \log (φ_{tw}) + (1 - λ) * \log (\frac{φ_{tw}}{p_{w}}) \end{array}

In this paper, we applied the LDA model [37] to build latent topics on healthy dieting and obesity in the crowd domain (using Twitter data) and the expert domain (using publications with PubMed API [45]). A topic model from a Food domain is aligned with a topic model from an Obesity domain. We applied the library of PyLDA [46] to discover the hidden topics from the expert corpus and the crowd corpus and visualized them with PyLDA.

Results and Evaluation

The prototype of the ContWEB framework was implemented using Scala and Python with Apache Spark [47] and TensorFlow [48]. In addition, open source projects, such as CoreNLP [39], VADER [41], TextBlob [42], and Python Scikit-learn [49] and PyLDA [46], are fully utilized in the implementation of the ContWEB system.

In the rest of this section, we discuss the results and the evaluation of the proposed framework. First, we describe the word embeddings extracted by TF-IDF and Word2Vec [16] for the data collection from the crowd domain (Twitter) and the expert domain (PubMed) datasets. Second, we show the results of the sentiment analysis with the social embeddings from food tweets into four categories such as healthy-positive, healthy-negative, unhealthy-positive, and unhealthy-negative. Third, we present the co-occurrence matrix [17] with the conditional probability for finding the relationships between obesity and healthy dieting. Fourth, we present the latent topic embeddings extracted using LDA [37].

Results from Data Collection

Crowd Data Collection

For the data collection in the crowd domain, we have considered the three different social media such as Twitter, Facebook, and Instagram as shown in Table 2. Regarding active users and API availability, Twitter and Facebook are more suitable compared to Instagram. In this paper, due to Twitter’s outstanding searching ability, we have collected the crowd data from Twitter.

Table 2.

Social network comparison

Features	Twitter	Facebook	Instagram
Active users [50]	560M	1B	150M
Text limit (characters) [51]	280	63,206	2,200
Followers’ values for the influence posts [52]	$60,000	$187,500	$150,000
User ratio over the Internet users [53]	Female 22% Male 15%	Female 76% Male 66%	Female 20% Male 15%
Application program interface (API)	Twitter API [54]	Facebook API [55]	Instagram API [56]: recent users

Open in a new tab

The tweets for the crowd corpus were collected from January 15, 2018, to January 19, 2018, (5 days duration) using the Twitter streaming API [57] using keywords on food or obesity. We used the standard food keywords from the following sites, defined by the USDA MyPlate (“2015–2020 Dietary Guidelines for Americans” for children) [4] and USDA Standardized Recipe [58], ChooseMyPlate [4], the most unhealthy meals in America [59], and Worst Options for Restaurant Menu [60].

Table 3 shows the statistics of the crowd data collection. In total, 54,473 tweets were collected as healthy foods and 68,974 tweets as unhealthy foods. From this crowd dataset, we also analyzed tweets including any medical terms. For the crowd data collection, we used the healthy/unhealthy food keywords (76 healthy foods and 28 unhealthy foods) and 10 diseases as defined in Table 4.

Table 3.

Crowd data collection

Keyword	Tweet no.	Tweet no. (after cleaning)	Total tweet no.
Healthy food	54,473	41,199	103,609
Unhealthy food	68,974	62,410

Open in a new tab

Table 4.

Keywords for data collection

Type	Count	Keywords
Disease	10	obesity heart_attack diabetes blood_pressure heart_disease overweight breast_cancer HIV heartburn lung_cancer
Healthy food	76	alfalfa apple asparagus avocado banana beans beets blueberry broccoli cabbage cantaloupe carnitas carrot cauliflower celery cherry chives coconut cranberry cucumber dates egg elderberry figs garlic ginger gourd grape grapefruit greens grits guava habanero jambalaya kiwi lemon lentils lettuce lime lobster maize mango melon mint miso olive onion orange parsley peach peanut peas peppermint peppers pickle pineapple plum pomegranate pumpkin raisin rocket salad salsa salmon seaweed sesame soybean spinach squash strawberry sweetcorn tofu tomatillo tomato vegetable zucchini
Unhealthy food	28	arbys bacon bbq beer bread cake checkers chips chipotle coffee cookie dessert donut ketchup kfc mayo milkshake nuggets nutella pancake pepperoni pizza potato rum sausage soda waffles wine

Open in a new tab

The frequency of food items was counted for each tweet, and then each tweet was classified as either healthy or unhealthy. From the healthy dieting analysis, Fig. 2 shows the top 15 healthy/unhealthy food keywords used for the collection of the crowd dataset. Figure 3a shows the top 7 frequently mentioned diseases in the crowd domain. Diabetes and heart attack are ranked highly in the crowd domain.

Fig. 2 — Crowd: top 15 healthy/unhealthy foods

Fig. 3 — Crowd and expert: top 7 diseases

After removing duplicates and pre-processing (removal of URLs, punctuation, special characters, and stopwords) shown in Table 3, about 25% and 10% of the healthy tweets and the unhealthy tweets were removed, respectively.

Expert Data Collection

As shown in Table 5, we collected 6602 abstracts from PubMed API [45] using keywords as well as extended search using the representative or related terms from TF-IDF [15] and Word2Vec [16]. Each of the data collection steps is described in more detail below.

We collected the 1000 abstracts for each disease keyword “Obesity,” “Diabetes,” and “Heart Disease” through PubMed API [45].
We applied TF-IDF to the 1000 abstracts of “Obesity” to find a set of representative terms.
We applied Word2Vec to each term in the TF-IDF using the 1000 abstracts of obesity and found the Word2Vec terms (shown in Table 4).
We collected 25 abstracts for each term in the Word2Vec of obesity, which is a total of 275 abstracts.
We collected the 25 abstracts for each particular keyword, which we encountered in the document related to obesity but not identified by Word2Vec.
We collected nearly 25 abstracts for each from healthy and unhealthy keywords created from USDA food API (a total of 104 keywords shown in Table 4).

Table 5.

Expert dataset

Category	Disease/food	No. of abstracts
Diseases	Diabetes	1000
	Heart disease	1000
	Obesity	1000
Food	Healthy	2584
	Unhealthy	743
Medical terms	Word2Vec of obesity	275
Total number of abstracts		6602

Open in a new tab

Figure 3b shows the top 7 frequently mentioned diseases in the expert dataset. Obesity, diabetes, and heart diseases are ranked highly in the expert domain.

NLP Results

NLP with Twitter Crowd Corpus: Before applying the NLP techniques on the Twitter dataset, the size of the corpus was 671,383 words. After removing stopword and words having the length of less than three, the size of the corpus is 528,740. After lemmatization, 50,637 unique words were collected in the crowd domain.

NLP with PubMed Expert Corpus: For 6,602 abstracts, after tokenization, we got 1,147,215 words. After stopword removal, we got 286,854 words. After lemmatization, 9,567 unique words were extracted for the expert dictionary.

Table 6 shows the results from the NLP for the crowd and the expert domains.

Table 6.

NLP results (word no.)

Operation	Crowd corpus	Expert corpus
Before NLP	671,383	1,147,215
After stopword removal	528,740	286,854
Unique lemmatization	50,637	9,567

Open in a new tab

Results from Sentiment Analysis

For the sentiment analysis in the crowd domain, the sentiment and food emojis of tweets reflected in the Unicode sequence are parsed into their CLDR short names by creating and using lexicon according to the Full Emoji List, v11.0 [61]. For example, negative faces, neural faces, and positive faces can be converted to CLDR short names such as grinning face, thinking face, and frowning face. The food sentiment analysis was computed using the positive and negative terms [62] shown in Table 7.

Table 7.

Word embeddings for crowd food sentiment analysis

Type	Keywords
Positive	appetize, tempt, attractive, yummy, savory, tasty, relish, good, luscious, mellow, mouthwatering, addictive, delectable, aromatic, delicious, connoisseur, authentic
Negative	bland, burnt, disgusting, greasy, heavy, indigestion, junkfood, messy, mushy, obesity, smelly, stink, tasteless, yuck

Open in a new tab

We analyzed and determined the content of diet tweets, and sentiment analysis of the tweet messages (e.g., the relative frequency of sentiments in terms of positive and negative). We analyzed the healthy food trends with obesity and obesity-related diseases. As expected, fruits and vegetables are categorized as healthy food and fast foods are classified as unhealthy food.

After applying term frequency (TF) on the crowd dataset (Twitter) for the healthy/unhealthy categorization, the food–mood analysis will be conducted on healthy food tweets. Table 8 shows the healthy dieting categories concerning the number of healthy, unhealthy, and compound (both healthy and unhealthy) foods as well as the food–mood in terms of the number of positive and negative tweets.

Table 8.

Crowd healthy dieting categorization

Healthy food	Unhealthy food	Compound food	Total
30,953	39,281	7,163	77,397
Emotion	Positive	Negative	Total
	62,485	14,912	77,397

Open in a new tab

We also categorized them regarding six healthy food sentiment categories (as shown in Table 9). Unhealthy-negative (40.5%) is the highest food sentiment type, followed by healthy-positive (32.6%), unhealthy-positive (10.2%), compound-positive (7.5%), healthy-negative (7.3%), and compound-negative (1.69%). These results indicate that Twitter’s users are aware of healthy dieting issues and expressed more positive sentiments about the topics. Since the datasets we used in this study are newly collected, they are not properly labeled. Thus, we have manually evaluated about 10% of the machine-labeled crowd dataset by the proposed model, ASA. We confirmed that the ASA’s classification accuracy (80%, 79%, and 75.25% for positive, negative, and neutral sentiments, respectively) is higher than others’ including VADER [41], TextBlob [42], and CoreNLP [39] (Table 10).

Table 9.

Crowd food sentiment analysis

	Healthy		Unhealthy		Compound
	No. of tweets	Tweet (%)	No. of tweets	Tweet (%)	No. of tweets	Tweet (%)
Positive	25,268	32.6	7,917	10.2	5,853	7.5
Negative	5,685	7.3	31,364	40.5	1,310	1.69
Total	30,953	39.99	39,281	50.75	7,163	9.25

Open in a new tab

Table 10.

Evaluation on sentimental analysis using Twitter food dataset

	Positive (%)	Negative (%)	Neutral (%)
Vader	74.52	71.4	67.36
TextBlob	69.0	60.35	63.93
CoreNLP	58.53	33.33	45.75
ASA (Ours)	80	79	75.25

Open in a new tab

Figure 4 shows the tag clouds of four different categories of the food sentiment (healthy-positive, unhealthy-positive, healthy-negative, unhealthy-negative) through the food tweet sentiment analysis. Each tag cloud shows 50–60 food keywords. The tag cloud for each category depicts frequently mentioned food keywords, and the importance of each tag is shown with its font size or color. A bigger tag is a more frequently mentioned food keyword compared to a less frequently mentioned one in a smaller tag.

Fig. 4 — Crowd’s food–mood word embeddings

Result from Co-occurrence Analysis

The word embeddings based on the co-occurrence and conditional probability based on the collected data in the expert domain. As shown in Table 11, the 6602 abstracts are divided into disease (3275) and food (3327) based on the disease keywords (10 diseases) and food keywords. The conditional probability of these keywords and abstracts computed is shown in Table 12.

Table 11.

Expert: diseases/foods in publications

Category	Number
PubMed abstract no.	6602
Disease abstract no.	3275
Food abstract no.	3327
Unique disease keyword no.	71
Unique food keyword no.	286
Disease keyword no. in disease abstracts	4143
Disease keyword no. in food abstracts	431
Food keyword no. in disease abstracts	114
Food keyword no. in food abstracts	5837

Open in a new tab

Table 12.

Conditional probability in diseases and foods in PubMed publications

Ranking	p(D_k/D_p)		p(D_k/F_p)		p(F_k/D_p)		p(F_k/F_p)
1	Obesity	34%	Cholesterol	27%	Dates	27%	Lettuce	4%
2	Diabetes	33%	Diabetes	21%	Bread	11%	Tomato	3%
3	Diabetic	11%	Obesity	18%	Olive	9%	Citrus	2%
4	Hypertension	8%	Diabetic	10%	Pomelo	6%	Apple	2%
5	Cholesterol	5%	HIV	8%	Soybean	5%	Onion	2%
6	Overweight	3%	Hepatitis	3%	Orange	4%	Potato	2%
7	HIV	1%	Hypertension	3%	Peas	4%	Orange	2%
8	Thrombosis	1%	Overweight	3%	Beans	3%	Cucumber	2%
9	Dialysis	1%	Gastroparesis	1%	Citrus	3%	Garlic	2%
10	Sepsis	1%	Cholera	1%	Grape	3%	Grape	2%

Open in a new tab

D_k/D_p drug keyword/drug publication, D_k/F_p drug keyword/food publication

F_k/D_p food keyword/drug publication, F_k/F_p food keyword/food publication

Figure 5 shows the co-occurrence relationships between 20 healthy and unhealthy food items for 7 different diseases (obesity, heart attack, diabetes, blood pressure, heart diseases, overweight, breast cancer) in the crowd domain. Among 20 food items, only 7 healthy food items are mentioned together with the 7 obesity and its related diseases, while 13 unhealthy food items are mentioned. Figure 5 shows the co-occurrence relationships between 20 healthy/unhealthy food items for 7 different diseases in the expert domain. Interestingly, 13 healthy food items are co-occurred with the 7 obesity and its related diseases. This result shows that there are entirely different perspectives between the experts and the crowds.

More precisely, in Fig. 5, diabetes shows the highest co-occurrence weight (38%) and heart attack the second highest (16%) among 7 diseases for the top 20 food items. Apple and coffee have the highest co-occurrence weight. From this, we can interpret that Apple may be positively related to the diseases while coffee is negatively related to them. Heart disease and overweight showed the lowest co-occurrence weights (4% and 4%). The 20 food items shown in Fig. 5 covered higher than 86% co-occurrence weight of the food domain.

Figure 6 shows the disease food co-occurrence relationships for top 20 food items and 7 diseases for the expert domain. In this figure, obesity shows the highest co-occurrence weight (41%) among 7 diseases for top 20 food items (vegetable and dates with the highest co-occurrence weight) and diabetes shows the second highest co-occurrence weight (36%). Unlike the low co-occurrence case, heart attack and breast cancer show the lowest co-occurrence weights (1% and 3%), respectively. The 20 food items shown in Fig. 6 were represented the 99% co-occurrence weight of the food domain.

Fig. 6 — Expert disease–food co-occurrence

Results of Topic Modeling

We used the topic coherence measurement that is commonly employed to determine the optimal number of topics with LDA. The UMass measurement [63] is an intrinsic coherence measurement that was proposed to use an asymmetrical confirmation measure between the pairs of top words. The UMass measures compute pairwise score function by comparing a word to the preceding and succeeding words, for a given ordered word set. The UMass measure is computing how much a common word triggers a rarer word, p(rareword|commonword). The UMass measure is based on the smoothed conditional log-probability.

We determine an optimal number of topics using the UMass measure for both the expert corpus and the crowd corpus. The UMass coherence (C_UM) was defined by [63] for the measure between top word pairs of a topic based on smoothed conditional probability as defined in (9).

C_{UM} (w_{i}, w_{j}) = \frac{2}{N (N - 1)} \sum_{i = 1}^{N} \sum_{j = 1}^{i} \log \frac{P (w_{i}, w_{j})}{P (w_{j})} + α

where N is the number of top words of a topic and α is added to avoid logarithm of zero.

In our experiment, the coherence measure was used to determine the optimal number of topics for the given corpus (the experts’ and crowd’s corpus) by varying the value of K. It is not easy to set up the parameters of the LDA model such as the number of topics t and the Dirichlet priors used for the document to topic (𝜃_dt) and topic to word (φ_tw) distributions. In this paper, we followed the standard heuristic for the φ_tw prior that is 0.01 and the 𝜃_dt prior is 0.05 ∗ (L/K), where L is an average document length and K is the number of topics.

The topic coherence refers to interpretability of LDA topics by measuring if retrieved words consistently belong to the same topic. The higher the topic coherence score, the better the topic model.

In our experiments, we have computed the coherence scores for changing topic number (3–10) for both crowd and expert domain datasets. For the evaluation of the topic model, we have estimated the number of topics in both the crowd’s and experts’ domains. From this analysis, we have found that the coherence scores for the experts’ domain were higher than the ones in the crowd’s domain. The average coherence score per topic has been evaluated for a range of models trained with a different number of topics. For the crowd’s topic number, Fig. 7a shows a high coherence score around 4–6 topics so that we can expect the number of topics that exists in this range. Similarly, for the Experts’ topic number, Fig. 7b shows a high coherence score around topics 3–5 and the number of topics that exists in this range. Thus, we choose 5 as the optimal number of topics for the crowd and expert domains.

Fig. 7 — Crowd’s and experts’ optimal topics

We generated the top five topics for the experts’ obesity domain and the top five LDA topics in the crowd’s disease domain. To generate the LDA topics for the expert dataset, we used BioPortal API [40] with the expert dataset. The top five topics in obesity publications are shown in Fig. 8. The topics’ terms and their significance measures are presented in Table 13. As seen from the table, many of these terms from the obesity domain are related to heart or diabetes diseases. The top five topics in the crowd’s disease domain are shown in Fig. 9. The topics’ terms and their significance measures are presented in Table 14. As seen from the table, many of these terms discovered from the crowd domain are also related to the experts’ terms (e.g., heart nos, murine heart, heart structure, circulatory organ) or general terms (e.g., pathology, clinical trials, mortality, disease or disorder). However, these results confirm that the crowd’s perspectives (Twitter) are significantly different from the experts’ perspectives (PubMed). For example, the dominant topics (topics 1 and 2) cover the topic terms from the crowd’s perspective such as cancer, blood pressure, cure, treatment, eat, and family.

Table 13.

Expert’s top five topics in obesity

Top salient terms	Topic 1 (28.6%)	Topic 2 (17%)	Topic 3 (15.5%)	Topic 4 (9.3%)	Topic 5 (7.2%)
orphanet_	heart	disease_or	disease_state	patient	occurrence
377788		_disorder
disease_	pathology	pathology	disease	patients	high risk
diseasefinding
disease_process	disease_state	disease	heart	heart	available
risk	disease	disease_state	pathology	disease_or	availability
				_disorder
actmoodrisk	disease_or	heart	disease_or	disease	incidence
	_disorder		_disorder
subject_risk	heart_nos	heart_nos	heart_nos	disease_state	actmoodrisk
multicellular_	murine_heart	orphanet_	study	pathology	risk
		377788
organism
occurrence	heart_structure	disease_	diseases	clinical_trials	multiple
		diseasefinding
incidence	entire_heart	disease_	disorders	risk	thromboembolism
		process
infection	cardiac_	circulatory_	cardiac	subject_risk	multicellular_
	mesoderm	organ			organism

Open in a new tab

Fig. 9 — Crowd’s top five disease topics

Table 14.

Crowd’s top five topics in disease

Top most salient terms	Topic 1 (23.8%)	Topic 2 (23.7%)	Topic 3 (18.1%)	Topic 4 (17.8%)	Topic 5 (16.6%)
heart attack	cancer	diabetes	heart attack	cancer	overweight
diabetes	heartburn	blood pressure	HIV	help	cancer
HIV	lung cancer	heart disease	heart	obesity	obesity
overweight	cure	breast cancer	blood pressure	breast cancer	dialysis
cancer	eat	risk	die	eating disorder	drink
blood pressure	family	cancer	diabetic	leukemia	need
heart disease	diagnose	health	attack	hivaids	prep
risk	dad	treatment	hernia	fight	nurse
heartburn	days	diet	brain cancer	raise	eat
breast cancer	friends	obesity	test	sepsis	ulcers

Open in a new tab

Discussion

This study shows the unique contribution to exploring the contextual word embeddings discovered from the crowd and expert domains. It is interesting to find out that the topics about healthy dieting and obesity in the crowd domain are entirely different from what we found in the expert domain. Interestingly, we presented that there is a bit of a difference of the crowd’s perspectives in social networks and the experts’ perspective in a publication domain in the context of healthy dieting and obesity.

In this paper, we showed how to identify if a particular tweet is a healthy dieting tweet or not, and also determine if a tweet expresses a positive or negative sentiment about healthy dieting. With the ASA meta-model for the food sentiment, we have obtained a higher average precision of the food–mood with the contextual word embeddings. We observed that the Word2Vec was suitable to convert the selected features to a vector space model [38] and co-occurrence and LDA are very effective for reducing the large-scale and high-dimensional data of the corpus to lower ones by finding representative terms.

There are some limitations to this study. First, basic NLP techniques, such as n-gram and name entity detection or negation, were not fully explored in finding contextual word embeddings. The NLP validation is essential in contextual word embedding. Notably, the meaning of clinical entities can be affected by negation. However, it is challenging for us to process negative sentences with clinical context (e.g., diagnosed patients’ condition) and accurately interpret the context. The limitations of existing approaches such as NegEx [64] or ConText [65] were studied. In our paper, we have simply measured strong association (frequency of occurrence of two terms) without differentiating its negative or positive correlations, i.e., the high correlation between coffee and heart disease. Second, the ContWeb framework has not been thoroughly evaluated. It is mainly because of a lack of the benchmark datasets in the healthy dieting or obesity domain. Thus, it was not feasible to formulate the relationships between healthy dieting and obesity. Furthermore, there is a limitation on interpreting the results from the food–mood analysis and topic modeling. It is hoped that, in the future, extensive experiments and evaluation can be carried out for advance in medical research and leading to advances in health care.

Conclusion

In this paper, we have developed the Contextual Word Embeddings (ContWEB) framework to explore the relationships between healthy dieting and obesity via contextual word embeddings from the crowd’s and experts’ perspectives. The ContWEB framework was designed with the advanced word embeddings techniques including feature extraction and topic modeling, sentimental analysis, and co-occurrence relationships between the healthy dieting and obesity and other related diseases. The proposed framework was implemented with Apache Spark [47] and TensorFlow [48]. The results and the evaluation confirmed that the framework is useful in the revelation of social topics and sentiments to obesity or its related diseases.

From the results, we realized that the experts’ and crowd’s perspectives should be analyzed to understand the topics and the social sentiments on healthy dieting in the real world. We conclude that the framework has the potential to help us a better understanding of social media influences our behavior and decision-making on healthy eating and obesity prevention. Differences in healthy dieting associated disease between crowds and experts ensure collective knowledge and behaviors about healthy dieting. We believe that integrated perspectives between the crowds and the experts lead to the comprehensive understanding of healthy dieting and its awareness.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Vijaya Kumari Yeruva, Email: vyq4b@mail.umkc.edu.

Sidrah Junaid, Email: sjhv6@mail.umkc.edu.

Yugyung Lee, Email: leeyu@umkc.edu.

References

1.Flegal KM, Carroll MD, Ogden CL, Curtin LR. Prevalence and trends in obesity among us adults, 1999-2008. Jama. 2010;303(3):235–241. doi: 10.1001/jama.2009.2014. [DOI] [PubMed] [Google Scholar]
2.Ogden CL, Carroll MD, Kit BK, Flegal KM. Prevalence of obesity and trends in body mass index among us children and adolescents, 1999-2010. Jama. 2012;307(5):483–490. doi: 10.1001/jama.2012.40. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Diary Council of California (2017) Healthy eating made easier. [Online]. Available: https://www.healthyeating.org/Healthy-Kids/Kids-Games-Activities.aspx
4.USDAMyPlate (2017) The usda myplate (2015-20 dietary guidelines for americans for children). [Online]. Available: https://www.choosemyplate.gov/games
5.Mejova Y, Weber I, Macy MW (eds) (2015) Twitter: a digital socioscope. Cambridge University Press, Cambridge
6.Achrekar H, Gandhe A, Lazarus R, Yu S-H, Liu B (2011) Predicting flu trends using twitter data. In: 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). IEEE, pp 702–707
7.Culotta A (2010) Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the first workshop on social media analytics. ACM, pp 115–122
8.Huang M, ElTayeby O, Zolnoori M, Yao L. Public opinions toward diseases: Infodemiological study on news media data. J Med Internet Res. 2018;5:20. doi: 10.2196/10047. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Ghosh D, Guha R. What are we ‘tweeting’ about obesity? mapping tweets with topic modeling and geographic information system. Cartogr Geogr Inf Sci. 2013;40(2):90–102. doi: 10.1080/15230406.2013.776210. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Widener MJ, Li W. Using geolocated twitter data to monitor the prevalence of healthy and unhealthy food references across the us. Appl Geogr. 2014;54:189–197. doi: 10.1016/j.apgeog.2014.07.017. [DOI] [Google Scholar]
11.Karami A, Dahl AA, Turner-McGrievy G, Kharrazi H, Shaw G. Characterizing diabetes, diet, exercise, and obesity comments on twitter. Int J Inf Manag. 2018;38(1):1–6. doi: 10.1016/j.ijinfomgt.2017.08.002. [DOI] [Google Scholar]
12.Statista (2017) Number of social media users worldwide from 2010 to 2020. [Online]. Available: https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/
13.Nofer M, Hinz O. Are crowds on the internet wiser than experts? the case of a stock prediction community. J Bus Econ. 2014;84(3):303–338. doi: 10.1007/s11573-014-0720-x. [DOI] [Google Scholar]
14.Poetz MK, Schreier M. The value of crowdsourcing: can users really compete with professionals in generating new product ideas? J Prod Innov Manag. 2012;29(2):245–256. doi: 10.1111/j.1540-5885.2011.00893.x. [DOI] [Google Scholar]
15.Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142
16.Mikolov T, Chen K, Corrado G, Dean J, Sutskever L, Zweig G (2014) “word2vec”, Google Scholar
17.Lund K, Burgess C. Producing high-dimensional semantic spaces from lexical co-occurrence. Behav Res Methods Instrum Comput. 1996;28(2):203–208. doi: 10.3758/BF03204766. [DOI] [Google Scholar]
18.Levy O, Goldberg Y (2014) Linguistic regularities in sparse and explicit word representations. In: Proceedings of the eighteenth conference on computational natural language learning, pp 171–180
19.Globerson A, Chechik G, Pereira F, Tishby N. Euclidean embedding of co-occurrence data. J Mach Learn Res. 2007;8(Oct):2265–2295. [Google Scholar]
20.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
21.Socher R, Bauer J, Manning C, et al. (2013) Parsing with compositional vector grammars. In: Proceedings of the 51st annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 455–465
22.Socher R, Perelygin A, Wu J, Chuang J, Manning C, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1631–1642
23.Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
24.Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Liu H (2018) A comparison of word embeddings for the biomedical natural language processing, arXiv:http://arXiv.org/abs/1802.00400 [DOI] [PMC free article] [PubMed]
25.Bast ES, Berry EM. Laugh away the fat? therapeutic humor in the control of stress-induced emotional eating. Rambam Maimonides Medical Journal. 2014;1:5. doi: 10.5041/RMMJ.10141. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Yau YH, Potenza MN. Stress and eating behaviors. Minerva Endocrinol. 2013;38(3):255. [PMC free article] [PubMed] [Google Scholar]
27.Tryon MS, Carter CS, DeCant R, Laugero KD. Chronic stress exposure may affect the brain’s response to high calorie food cues and predispose to obesogenic eating habits. Physiol Behav. 2013;120:233–242. doi: 10.1016/j.physbeh.2013.08.010. [DOI] [PubMed] [Google Scholar]
28.Nguyen QC, Li D, Meng H-W, Kath S, Nsoesie E, Li F, Wen M. Building a national neighborhood dataset from geotagged twitter data for indicators of happiness, diet, and physical activity. JMIR Public Health Surveill. 2016;2:2. doi: 10.2196/publichealth.5064. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Eichstaedt JC, Schwartz HA, Kern ML, Park G, Labarthe DR, Merchant RM, Jha S, Agrawal M, Dziurzynski LA, Sap M, et al. Psychological language on twitter predicts county-level heart disease mortality. Psychol Sci. 2015;26(2):159–169. doi: 10.1177/0956797614557867. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.CDC (2017) Centers for disease and control prevention: Adult obesity prevalence maps. [Online]. Available: https://www.cdc.gov/obesity/data/prevalence-maps.html
31.Paul MJ, Dredze M. You are what you tweet: analyzing twitter for public health. Icwsm. 2011;20:265–272. [Google Scholar]
32.Madan A, Moturu ST, Lazer D, Pentland AS (2010) Social sensing: obesity, unhealthy eating and exercise in face-to-face networks. In: Wireless Health 2010. ACM, pp 104–110
33.Scanfeld D, Scanfeld V, Larson EL. Dissemination of health information through social networks: Twitter and antibiotics. Am J Infect Control. 2010;38(3):182–188. doi: 10.1016/j.ajic.2009.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE, pp 439–448
35.Go A, Huang L, Bhayani R. Twitter sentiment analysis. Entropy. 2009;17:252. [Google Scholar]
36.Dixon N, Jakić B, Lagerweij R, Mooij M, Yudin E (2012) Foodmood: measuring global food sentiment one tweet at a time. In: Proceedings of sixth international AAAI conference on Weblogs and social media
37.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3(Jan):993–1022. [Google Scholar]
38.Erk K, Padó S (2008) A structured vector space model for word meaning in context. In: Proceedings of the conference on empirical methods in natural language processing. Association for computational linguistics, pp 897–906
39.Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60
40.NCBO (2017) Bioportal api. [Online]. Available: http://data.bioontology.org/documentation
41.Gilbert CHE (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international conference on Weblogs and social media (ICWSM-14). Available at (20/04/16) http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
42.Loria S, Keen P, Honnibal M, Yankovsky R, Karesh D, Dempsey E, et al. (2014) Textblob: simplified text processing, Secondary TextBlob: simplified text processing
43.Chuang J, Manning C, Heer J (2012) Termite: visualization techniques for assessing textual topic models. ACM, pp 74–77
44.Sievert C, Shirley K (2014) Ldavis: a method for visualizing and interpreting topics, pp 63–70
45.NCBI (2017) Pubmed central (pmc). [Online]. Available: https://www.ncbi.nlm.nih.gov/home/develop/api/
46.Dorlhiac GF, Fare C, van Thor JJ. Pyldm-an open source package for lifetime density analysis of time-resolved spectroscopic data. PLoS Comput Biol. 2017;13(5):e1005528. doi: 10.1371/journal.pcbi.1005528. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. HotCloud. 2010;10(10-10):95. [Google Scholar]
48.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems, arXiv:http://arXiv.org/abs/1603.04467
49.scikit (2018) Scikit-learn machine learning in python. [Online]. Available: http://scikit-learn.org/stable/
50.SI Media (2018) Social influence media: active users. [Online]. Available: https://socialinfluencemedia.com/social-media-marketing/
51.SI Media (2018) Social influence media: Text size. [Online]. Available: https://sproutsocial.com/insights/social-media-character-counter/
52.Economist (2016) Influencer posts. [Online]. Available: https://www.economist.com/graphic-detail/2016/10/17/celebrities-endorsement-earnings-on-social-media
53.brandwatch (2018) Internet users. [Online]. Available: https://www.brandwatch.com/blog/men-vs-women-active-social-media/
54.Twitter (2018) Twitter api. [Online]. Available: https://developer.twitter.com/en/docs/tweets/search/overview
55.Facebook (2018) Facebook api. [Online]. Available: https://developers.facebook.com/docs/graph-api/reference/v3.0/user/feed
56.Instagram (2018) Instagram api. [Online]. Available: https://www.instagram.com/developer/endpoints/users/
57.Twitter (2016) Twitter streaming api. [Online]. Available: http://apiwiki.twitter.com/
58.USDA (2017) The usda standard on food and nutrition. [Online]. Available: https://www.fns.usda.gov
59.BusinessInsider (2017) The 8 unhealthiest restaurant meals in america. [Online]. Available: http://www.businessinsider.com/most-unhealthy-meals-in-america-2017-7/#uno-pizzeria-and-grill-chocolate-cake-1740-calories-2
60.Eatthis (2017) The #1 worst menu option at 41 popular restaurants. [Online]. Available: http://www.eatthis.com/restaurant-menu-worst-options/
61.Unicode (2017) Emoji list, v11.0. [Online]. Available: https://unicode.org/emoji/charts/full-emoji-list.html
62.TEFLtastic (2013) Positive and negative words in food. [Online]. Available: https://tefltastic.files.wordpress.com/2013/07/positive-and-negative-words-about-food.pdf
63.Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for computational linguistics, pp 262–272
64.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2001;34(5):301–310. doi: 10.1006/jbin.2001.1029. [DOI] [PubMed] [Google Scholar]
65.Afzal Z, Pons E, Kang N, Sturkenboom MC, Schuemie MJ, Kors JA. Contextd: an algorithm to identify contextual properties of medical terms in a dutch clinical corpus. BMC Bioinf. 2014;15(1):373. doi: 10.1186/s12859-014-0373-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR1] 1.Flegal KM, Carroll MD, Ogden CL, Curtin LR. Prevalence and trends in obesity among us adults, 1999-2008. Jama. 2010;303(3):235–241. doi: 10.1001/jama.2009.2014. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Ogden CL, Carroll MD, Kit BK, Flegal KM. Prevalence of obesity and trends in body mass index among us children and adolescents, 1999-2010. Jama. 2012;307(5):483–490. doi: 10.1001/jama.2012.40. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR3] 3.Diary Council of California (2017) Healthy eating made easier. [Online]. Available: https://www.healthyeating.org/Healthy-Kids/Kids-Games-Activities.aspx

[CR4] 4.USDAMyPlate (2017) The usda myplate (2015-20 dietary guidelines for americans for children). [Online]. Available: https://www.choosemyplate.gov/games

[CR5] 5.Mejova Y, Weber I, Macy MW (eds) (2015) Twitter: a digital socioscope. Cambridge University Press, Cambridge

[CR6] 6.Achrekar H, Gandhe A, Lazarus R, Yu S-H, Liu B (2011) Predicting flu trends using twitter data. In: 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). IEEE, pp 702–707

[CR7] 7.Culotta A (2010) Towards detecting influenza epidemics by analyzing twitter messages. In: Proceedings of the first workshop on social media analytics. ACM, pp 115–122

[CR8] 8.Huang M, ElTayeby O, Zolnoori M, Yao L. Public opinions toward diseases: Infodemiological study on news media data. J Med Internet Res. 2018;5:20. doi: 10.2196/10047. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Ghosh D, Guha R. What are we ‘tweeting’ about obesity? mapping tweets with topic modeling and geographic information system. Cartogr Geogr Inf Sci. 2013;40(2):90–102. doi: 10.1080/15230406.2013.776210. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Widener MJ, Li W. Using geolocated twitter data to monitor the prevalence of healthy and unhealthy food references across the us. Appl Geogr. 2014;54:189–197. doi: 10.1016/j.apgeog.2014.07.017. [DOI] [Google Scholar]

[CR11] 11.Karami A, Dahl AA, Turner-McGrievy G, Kharrazi H, Shaw G. Characterizing diabetes, diet, exercise, and obesity comments on twitter. Int J Inf Manag. 2018;38(1):1–6. doi: 10.1016/j.ijinfomgt.2017.08.002. [DOI] [Google Scholar]

[CR12] 12.Statista (2017) Number of social media users worldwide from 2010 to 2020. [Online]. Available: https://www.statista.com/statistics/278414/number-of-worldwide-social-network-users/

[CR13] 13.Nofer M, Hinz O. Are crowds on the internet wiser than experts? the case of a stock prediction community. J Bus Econ. 2014;84(3):303–338. doi: 10.1007/s11573-014-0720-x. [DOI] [Google Scholar]

[CR14] 14.Poetz MK, Schreier M. The value of crowdsourcing: can users really compete with professionals in generating new product ideas? J Prod Innov Manag. 2012;29(2):245–256. doi: 10.1111/j.1540-5885.2011.00893.x. [DOI] [Google Scholar]

[CR15] 15.Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning. Springer, pp 137–142

[CR16] 16.Mikolov T, Chen K, Corrado G, Dean J, Sutskever L, Zweig G (2014) “word2vec”, Google Scholar

[CR17] 17.Lund K, Burgess C. Producing high-dimensional semantic spaces from lexical co-occurrence. Behav Res Methods Instrum Comput. 1996;28(2):203–208. doi: 10.3758/BF03204766. [DOI] [Google Scholar]

[CR18] 18.Levy O, Goldberg Y (2014) Linguistic regularities in sparse and explicit word representations. In: Proceedings of the eighteenth conference on computational natural language learning, pp 171–180

[CR19] 19.Globerson A, Chechik G, Pereira F, Tishby N. Euclidean embedding of co-occurrence data. J Mach Learn Res. 2007;8(Oct):2265–2295. [Google Scholar]

[CR20] 20.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119

[CR21] 21.Socher R, Bauer J, Manning C, et al. (2013) Parsing with compositional vector grammars. In: Proceedings of the 51st annual meeting of the association for computational linguistics (vol 1: Long Papers), pp 455–465

[CR22] 22.Socher R, Perelygin A, Wu J, Chuang J, Manning C, Ng A, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 1631–1642

[CR23] 23.Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

[CR24] 24.Wang Y, Liu S, Afzal N, Rastegar-Mojarad M, Wang L, Shen F, Liu H (2018) A comparison of word embeddings for the biomedical natural language processing, arXiv:http://arXiv.org/abs/1802.00400 [DOI] [PMC free article] [PubMed]

[CR25] 25.Bast ES, Berry EM. Laugh away the fat? therapeutic humor in the control of stress-induced emotional eating. Rambam Maimonides Medical Journal. 2014;1:5. doi: 10.5041/RMMJ.10141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Yau YH, Potenza MN. Stress and eating behaviors. Minerva Endocrinol. 2013;38(3):255. [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Tryon MS, Carter CS, DeCant R, Laugero KD. Chronic stress exposure may affect the brain’s response to high calorie food cues and predispose to obesogenic eating habits. Physiol Behav. 2013;120:233–242. doi: 10.1016/j.physbeh.2013.08.010. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Nguyen QC, Li D, Meng H-W, Kath S, Nsoesie E, Li F, Wen M. Building a national neighborhood dataset from geotagged twitter data for indicators of happiness, diet, and physical activity. JMIR Public Health Surveill. 2016;2:2. doi: 10.2196/publichealth.5064. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Eichstaedt JC, Schwartz HA, Kern ML, Park G, Labarthe DR, Merchant RM, Jha S, Agrawal M, Dziurzynski LA, Sap M, et al. Psychological language on twitter predicts county-level heart disease mortality. Psychol Sci. 2015;26(2):159–169. doi: 10.1177/0956797614557867. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.CDC (2017) Centers for disease and control prevention: Adult obesity prevalence maps. [Online]. Available: https://www.cdc.gov/obesity/data/prevalence-maps.html

[CR31] 31.Paul MJ, Dredze M. You are what you tweet: analyzing twitter for public health. Icwsm. 2011;20:265–272. [Google Scholar]

[CR32] 32.Madan A, Moturu ST, Lazer D, Pentland AS (2010) Social sensing: obesity, unhealthy eating and exercise in face-to-face networks. In: Wireless Health 2010. ACM, pp 104–110

[CR33] 33.Scanfeld D, Scanfeld V, Larson EL. Dissemination of health information through social networks: Twitter and antibiotics. Am J Infect Control. 2010;38(3):182–188. doi: 10.1016/j.ajic.2009.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Poria S, Chaturvedi I, Cambria E, Hussain A (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In: 2016 IEEE 16th international conference on data mining (ICDM). IEEE, pp 439–448

[CR35] 35.Go A, Huang L, Bhayani R. Twitter sentiment analysis. Entropy. 2009;17:252. [Google Scholar]

[CR36] 36.Dixon N, Jakić B, Lagerweij R, Mooij M, Yudin E (2012) Foodmood: measuring global food sentiment one tweet at a time. In: Proceedings of sixth international AAAI conference on Weblogs and social media

[CR37] 37.Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. J Mach Learn Res. 2003;3(Jan):993–1022. [Google Scholar]

[CR38] 38.Erk K, Padó S (2008) A structured vector space model for word meaning in context. In: Proceedings of the conference on empirical methods in natural language processing. Association for computational linguistics, pp 897–906

[CR39] 39.Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pp 55–60

[CR40] 40.NCBO (2017) Bioportal api. [Online]. Available: http://data.bioontology.org/documentation

[CR41] 41.Gilbert CHE (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international conference on Weblogs and social media (ICWSM-14). Available at (20/04/16) http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf

[CR42] 42.Loria S, Keen P, Honnibal M, Yankovsky R, Karesh D, Dempsey E, et al. (2014) Textblob: simplified text processing, Secondary TextBlob: simplified text processing

[CR43] 43.Chuang J, Manning C, Heer J (2012) Termite: visualization techniques for assessing textual topic models. ACM, pp 74–77

[CR44] 44.Sievert C, Shirley K (2014) Ldavis: a method for visualizing and interpreting topics, pp 63–70

[CR45] 45.NCBI (2017) Pubmed central (pmc). [Online]. Available: https://www.ncbi.nlm.nih.gov/home/develop/api/

[CR46] 46.Dorlhiac GF, Fare C, van Thor JJ. Pyldm-an open source package for lifetime density analysis of time-resolved spectroscopic data. PLoS Comput Biol. 2017;13(5):e1005528. doi: 10.1371/journal.pcbi.1005528. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR47] 47.Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. HotCloud. 2010;10(10-10):95. [Google Scholar]

[CR48] 48.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems, arXiv:http://arXiv.org/abs/1603.04467

[CR49] 49.scikit (2018) Scikit-learn machine learning in python. [Online]. Available: http://scikit-learn.org/stable/

[CR50] 50.SI Media (2018) Social influence media: active users. [Online]. Available: https://socialinfluencemedia.com/social-media-marketing/

[CR51] 51.SI Media (2018) Social influence media: Text size. [Online]. Available: https://sproutsocial.com/insights/social-media-character-counter/

[CR52] 52.Economist (2016) Influencer posts. [Online]. Available: https://www.economist.com/graphic-detail/2016/10/17/celebrities-endorsement-earnings-on-social-media

[CR53] 53.brandwatch (2018) Internet users. [Online]. Available: https://www.brandwatch.com/blog/men-vs-women-active-social-media/

[CR54] 54.Twitter (2018) Twitter api. [Online]. Available: https://developer.twitter.com/en/docs/tweets/search/overview

[CR55] 55.Facebook (2018) Facebook api. [Online]. Available: https://developers.facebook.com/docs/graph-api/reference/v3.0/user/feed

[CR56] 56.Instagram (2018) Instagram api. [Online]. Available: https://www.instagram.com/developer/endpoints/users/

[CR57] 57.Twitter (2016) Twitter streaming api. [Online]. Available: http://apiwiki.twitter.com/

[CR58] 58.USDA (2017) The usda standard on food and nutrition. [Online]. Available: https://www.fns.usda.gov

[CR59] 59.BusinessInsider (2017) The 8 unhealthiest restaurant meals in america. [Online]. Available: http://www.businessinsider.com/most-unhealthy-meals-in-america-2017-7/#uno-pizzeria-and-grill-chocolate-cake-1740-calories-2

[CR60] 60.Eatthis (2017) The #1 worst menu option at 41 popular restaurants. [Online]. Available: http://www.eatthis.com/restaurant-menu-worst-options/

[CR61] 61.Unicode (2017) Emoji list, v11.0. [Online]. Available: https://unicode.org/emoji/charts/full-emoji-list.html

[CR62] 62.TEFLtastic (2013) Positive and negative words in food. [Online]. Available: https://tefltastic.files.wordpress.com/2013/07/positive-and-negative-words-about-food.pdf

[CR63] 63.Mimno D, Wallach HM, Talley E, Leenders M, McCallum A (2011) Optimizing semantic coherence in topic models. In: Proceedings of the conference on empirical methods in natural language processing. Association for computational linguistics, pp 262–272

[CR64] 64.Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2001;34(5):301–310. doi: 10.1006/jbin.2001.1029. [DOI] [PubMed] [Google Scholar]

[CR65] 65.Afzal Z, Pons E, Kang N, Sturkenboom MC, Schuemie MJ, Kors JA. Contextd: an algorithm to identify contextual properties of medical terms in a dutch clinical corpus. BMC Bioinf. 2014;15(1):373. doi: 10.1186/s12859-014-0373-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Contextual Word Embeddings and Topic Modeling in Healthy Dieting and Obesity

Vijaya Kumari Yeruva

Sidrah Junaid

Yugyung Lee

Abstract

Introduction

Related Work

Contextual Word Embeddings Framework

Fig. 1.

Data Collection Using Word Embedding Techniques

Term Frequency and Inverse Document Frequency (TF-IDF)

Word Embeddings with Word2Vec

Natural Language Processing (NLP)

Sentiment Analysis for Social Word Embeddings

Sentiment Analysis in Crowd

Healthy Dieting Analysis in Crowd

Co-occurrence Analysis for Contextual Word Embeddings

Word Embeddings with Topic Modeling

Table 1.

Results and Evaluation

Results from Data Collection

Crowd Data Collection

Table 2.

Table 3.

Table 4.

Fig. 2.

Fig. 3.

Expert Data Collection

Table 5.

NLP Results

Table 6.

Results from Sentiment Analysis

Table 7.

Table 8.

Table 9.

Table 10.

Fig. 4.

Result from Co-occurrence Analysis

Table 11.

Table 12.

Fig. 5.

Fig. 6.

Results of Topic Modeling

Fig. 7.

Fig. 8.

Table 13.

Fig. 9.

Table 14.

Discussion

Conclusion

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases