Abstract
Objective
To detect and visualize salient queries about menopause using Big Data from ChaCha®.
Methods
We used Word Adjacency Graph (WAG) modeling to detect clusters and visualize the range of menopause-related topics and their mutual proximity. The subset of relevant queries was fully modeled. We split each query into token words (i.e., meaningful words and phrases) and removed stopwords (i.e., not meaningful functional words). The remaining words were considered in sequence to build summary tables of words and 2- and 3-word phrases. Phrases occurring at least 10 times were used to build a network graph model that was iteratively refined by observing and removing clusters of unrelated content.
Results
We identified two menopause-related subsets of queries by searching for questions containing menopause and menopause-related terms (e.g., climacteric, hot flashes, night sweats, hormone replacement). The first contained 263,363 queries from individuals age 13 and older and the second contained 5,892 queries from women aged 40–62 years. In the first set, we identified 12 topic clusters: 6 relevant to menopause and 6 less relevant. In the second set, we identified 15 topic clusters: 11 relevant to menopause and 4 less relevant. Queries about hormones were pervasive within both WAG models. Many of the queries reflected low literacy levels and/or feelings of embarrassment.
Conclusions
We modeled menopause-related queries posed by ChaCha® users between 2009 and 2012. ChaCha® data may be used on its own or in combination with other Big Data sources to identify patient-driven educational needs and create patient-centered interventions.
Keywords: Menopause, hot flashes, symptoms, hormone therapy, data mining, information science
INTRODUCTION
Historically, identifying menopause-related health information needs was done through large surveys or small sample qualitative studies. Predetermined survey questions can be prohibitive when trying to obtain women’s own perspectives and this problem can be overcome with open-ended interviews in qualitative studies. However, qualitative studies are also typically limited by geographic region or other sociodemographic characteristics of participants, and qualitative data collection and analyses are time intensive.
A modern alternative to both methodologies is available. With the advent of technologies that allow users to ask questions at the push of a few keys or buttons, there are now large datasets available that consist of an organic set of user-derived queries about a variety of topics, including health-related information. Mining these large datasets, or Big Data, to guide research and population-based health interventions is a priority of the National Institutes of Health Big Data to Knowledge (BD2K) initiative.1
There are many sources of Big Data that can be mined for health information needs and these have been both touted and criticized.2 Electronic health records, particularly from large health care systems, provide the most focused health information from a large pool of demographically and geographically diverse persons. However, problems with integration of information across computer systems and missing and/or inaccurate documentation have been noted.3 Search engine queries and social media data have demonstrated value in predicting a variety of health issues, such as influenza, cigarette smoking, depression, and suicide.4, 5 However, the sheer number of different social media platforms, their lack of integration, and fact that subpopulations of users (e.g., old vs. young, male vs. female) have different preferences for different platforms can limit generalizability and/or complicate analyses.6–8 Each source of Big Data merits investigation to determine its relevance and usefulness for specific purposes, such as identifying menopause-related health information needs.
In 2015, our lab became the only research team in the world with access to raw data from ChaCha® for the years 2009 to 2012.9 ChaCha® is a United-Stated based company. Users are able to anonymously submit questions and receive a human-guided, real-time, anonymous, and verified answer. During 2009 to 2012, users submitted their questions via texts or the Web. After several queries from the same user, ChaCha® asked the user to provide their age, gender, and zip code but the information was not required. ChaCha® is not a social media platform and it’s anonymity appears to allow users to ask unusual and potentially embarrassing or stigmatizing questions.10
The purpose of this study was to identify salient queries about menopause using Big Data from ChaCha®. We were interested in determining the extent to which this source might provide value for identifying menopause-related health information needs. Specific aims were to (1) create Word Adjacency Graph (WAG) modeling to detect clusters of menopause-related queries and (2) visualize the type of menopause-related topics included in queries. We evaluated these specific aims within the entire dataset (users aged 13+) as well as within a subset of queries from midlife women aged 40 to 62. We selected the age range 40 to 62 since it was the same range used by a research network in clinical trials testing menopausal symptom management therapies.11
METHODS
This was a study of existing, de-identified data which did not require institutional review board approval because it did not meet definitions for human subjects’ research. The ChaCha® database contained 1.93 billion complete questions asked between January, 2009 and November, 2012 by 19.3 million users.10 Of these, 7.94 million or 40% reported age (68% were younger than age 20), 8.87 million or 46% reported gender (49% male, 51% female), and 5.11 million or 26% reported location (99.5% from the United States).10
A menopause-related subset was identified by searching for questions containing a set of key words. We started with an initial set of keywords that were expanded based on our initial review of the data and variations in spelling of terms. The final set of key words we used can be grouped into 5 broad topics: (1) menopause/menopausal status (midlife, menopause, menopausal, climacteric, amenorrhea, premenopausal, pre-menopausal, premenopause, pre-menopause, perimenopausal, peri-menopausal, perimenopause, peri-menopause, postmenopausal, post-menopausal, postmenopause, post-menopause; (2) hot flash symptoms (hot flash, hot flashes, hot flush, hot flushes, power surges, always hot, night sweats); (3) medications (Actonel, bisphosphonates, estrogen, Evista, Fosamax, hormone, hormone replacement, hormone replacement therapy, hormones, HRT, oestrogen, phytoestrogens, progesterone, progestin, Prolia, Reclast, SERM, Tamoxifen); (4) surgery (hysterectomy, oophorectomy); and (5) specialty provider (OBGYN, OB-GYN, gynecologist). Our searches using these keywords resulted in 263,363 menopause-related queries in total (user age 13+) and 5,892 queries from women aged 40 to 62.
We used Word Adjacency Graph (WAG) modeling to view common words from the queries in relation to one another. WAG modeling, as described in detail by Miller and colleagues12, organizes data by forming pairs of words that are in direct sequence. Word pairs form clusters, and a visual layout of the resulting network graph often reveals several self-organized, self-describing subsets of words. Visual layout of the graph model was achieved using Gephi, an open source graph modeling tool.13
Each query was split into token words (an individual word in a sentence) and stop words (e.g., a, the, and). Remaining words were considered in sequence to build a summary table of words and phrases of varying lengths (two word phrases, three word phrases, etc). Two word phrases occurring at least 10 times were used to build a network graph model. We chose a minimum of 10 occurrences as a minimally important signal.
The graph model was refined by performing partitioning, then reviewing resulting subsections of the graph manually for appropriateness. There were no irrelevant topics to remove but some were deemed of lesser salience than others as described in results below.
RESULTS
As shown in Table 1, we identified a subset of 263,363 queries about menopause-related topics in the overall dataset (263K subset). The majority of those queries were from individuals age 13 to 29 and from females. When concentrating only on queries from women of menopausal age (40–62 years), we found 5,892 queries about menopause related topics mostly from those in their 40’s (5.8K subset).
Table 1.
Description of Those Submitting Menopause-Related Queries
Number total queries | 263,364a | 5,892b |
---|---|---|
Queries by age (N, %) | ||
13–19 years | 150,728 (57%) | |
20–29 years | 40,975 (15%) | |
30–39 years | 10,241 (4%) | |
40–49 years | 6,634 (3%) | 4,933 (84%) |
50–59 years | 1,356 (< 1%) | 902 (15%) |
60–69 years | 702 (< 1%) | 41 (1%) |
70+ years | 1,135 (< 1%) | |
Not reported | 51,772 (20%) | |
Queries by gender (N, %) | ||
Female | 148,669 (56%) | 5,892 (100%) |
Male | 66,585 (25%) | |
Not reported | 48110 (18%) |
All users who submitted menopause-related queries
Subset of females aged 40–62 years who submitted menopause-related queries
In the larger 263K subset, the WAG modeling process revealed 12 topic clusters (see Table 2). There were 6 topics considered highly relevant to menopause and 6 less relevant. As noted in the example questions, some of the topics considered relevant did contain some irrelevant questions (e.g., query about hormones in Table 2). The WAG modeling process placed the terms hormones and estrogen into separate clusters because of the literal difference in key words and because these constituted non-overlapping queries.
Table 2.
Word Adjacency Graph Modeling Results: Topic Clusters, Descriptions, and Sample Questions from the 263K Subset of User Queries
Name | Description: Questions about… | Sample Questions |
---|---|---|
Highly relevant to menopause | ||
Hot flashes | What is the nature and cause of hot flashes, how to stop ‘hot flashes’ |
|
Menopausal symptoms | Type and duration of ‘menopausal symptoms’ |
|
Menses | Factors affecting start and stop of menses and effects of menstrual and menopausal hormone changes |
|
Hysterectomy | A hysterectomy and its effects |
|
Heart rate | The relationship of heart rate to hormones, thyroid function, anxiety, and blood sugar |
|
Sex drive | What hormones affect sex drive and factors that increase or decrease sex drives |
|
Less relevant to menopause | ||
Hormones | The nature and function of hormones in the body |
|
Hormones and body proportion | Effects of hormones on the growth, size, and nature of breasts and other body parts |
|
Estrogen | What factors increase, decrease or block estrogen and estrogen in men |
|
Ortho Tri-Cyclen | The medication Ortho Tri-Cyclen and Ortho Tri-Cyclen Lo. |
|
Pregnancy test | What a pregnancy test measures, when to take one, and what might affect the results |
|
Gynecologist | Where to find a gynecologist including free clinics, at what age girls and women should have their first visit, and what a gynecologist can see on exam. |
|
In the smaller 5.8K subset of midlife women, the WAG modeling process revealed 15 topic clusters (Table 3). There were 11 topics considered relevant to menopause and 4 less relevant. As noted in the example questions, the relevant topics contained questions that were more specific to menopause than the queries in the 263K subset. The WAG modeling process identified hormone replacement and estrogen as separate topic clusters because of the difference in query words. Similarly, queries containing the phrase ‘hot flashes’ were part of a different topic cluster than queries containing ‘night sweats and other menopausal symptoms’ despite their related meaning.
Table 3.
Word Adjacency Graph Modeling Results: Topic Clusters, Descriptions, and Sample Questions from the 5.8K Subset of User Queries
Name | Description: Questions about… | Sample Questions |
---|---|---|
Highly relevant to menopause | ||
Onset of menopause | Age of onset for menopause |
|
Menopause vs. pregnancy | Differentiating pregnancy and menopause |
|
Menopausal ‘symptoms’ | Nature, causes, type, and treatment of ‘menopausal symptoms’ |
|
Hot flashes and night sweats | Nature, causes, duration, and treatment for ‘hot flashes and night sweats’ |
|
Hot flash | Nature, causes and duration of a ‘hot flash’ |
|
Sex drive | A woman’s sex drive and activity at menopause |
|
Emotional symptoms | Mood, sadness, feelings, cortisol |
|
Heart, skin | Heart palpitations and skin problems at menopause |
|
Hormone therapy | Risks, benefits, and effects of hormone therapy |
|
Hysterectomy | Nature, risks, consequences, post-operative recovery after hysterectomy |
|
Provider and prescriptions | Provider role and location, prescriptions |
|
Less relevant to menopause | ||
Hormones | Function of hormones in the body, Foods and other sources of hormones, Human growth hormone |
|
Birth control | Nature and risks of birth control pills |
|
Menstrual bleeding | Questions about menstrual bleeding in general and not limited to menopause |
|
Midlife crisis | Definition, cause, impact in women and men |
|
The WAG modeling revealed the pervasive nature of questions related to hormones in both subsets of queries. The two hormone-related topic clusters in the 263K sample had an extensive network of queries, with long fingers reaching across and down the model (see Figure 1). This meant that questions about hormones were related to many other topics. Similar findings were seen for the hormone-related topic clusters in the 5.8K sample. As shown in Figure 2, most of the lower half of the WAG model was related to hormones and those topics had long networks of connection to other topics.
Figure 1. Word Adjacency Graph (WAG) Model for the 263K Menopause-Related Queries.
WAG modeling words and word pairs chaining together in clusters which are shown in different colors. The size of the circle represents its betweenness centrality or it’s centralness to the overall WAG model.
Figure 2. Word Adjacency Graph (WAG) Model for the 5.8K Subset of Menopause-Related Queries from Women aged 40 to 62 Years.
WAG modeling words and word pairs chaining together in clusters which are shown in different colors. The size of the circle represents its betweenness centrality or it’s centralness to the overall WAG model.
The nature of the queries suggested users were more comfortable asking basic questions reflective of low health literacy levels and/or questions that were embarrassing to the users. Some of these low literacy and embarrassing questions are provided in Table 4.
Table 4.
Potentially Embarrassing or Stigmatizing Queries from Women Aged 40 to 62
Age | Query Text |
---|---|
40 | I have had a HYSTERECTOMY years ago and now I am having severe sweats acne and odor could this be a hormonal imbalance |
40 | How often do I need to douche after a hysterectomy? |
40 | What is an obgyn? |
40 | How much time should you wait to have oral sex after hysterectomy |
40 | I am going thru MENOPAUSE and taking PREMARIN could it cause my nipples to itch? |
42 | Can you have anal sex after a hysterectomy |
43 | Should a girl shave her vagina before going to an obgyn appointment? |
44 | How long does it usually take to have a bowel movement after an abdominal hysterectomy? |
44 | When you have sex with your clothes on but you still like have those hormones and everything freaking out can you still get pregnant? |
45 | Is it normal 4 my breasts 2 fill with milk aftr hysterectomy? |
46 | Can teeth or gums burn with menopause? |
47 | What is a Gynecologist |
47 | Will sex feel different after getting a hysterectomy? |
48 | Can you die going through menopause |
48 | Do doctors test you for marijuana when you go in to an obgyn for a pap smear? |
49 | How can I make appointments with my gynecologist less uncomfortable? I just hate talking about these personal things! |
50 | Does having sex with a man taking heart meds make a woman go into menopause early? |
50 | I'm going through menopause, can I get pregnant w/out protection? |
52 | Does smoking weed add extra hormones to you? |
53 | What kind of doctor treats hormone imbalance? |
54 | Can you still get wet in your vagina after menopause |
54 | I know this is a gross Q. But if a woman who has had a total hysterectomy has a perforated bowel could the waste come out of her vagina as well as the rectum? |
DISCUSSION
Our analyses show the value of Big Data from ChaCha® for understanding the public’s needs for menopause-related information. We were able to quickly and efficiently organize a large amount of textual data and analyze its relevance for understanding the public’s information needs regarding menopause. We found that an age- and gender-specific query (5.8K subset) produced more specific information for the topic of menopause within the large ChaCha® dataset, but that menopause queries were not limited to individuals at midlife. We can speculate several reasons for menopause queries from younger individuals and males including seeking information for someone else (e.g., mother, wife), experiencing non-menopause related hot flashes (e.g., due to systemic disease),14 and the pervasive nature of questions about hormones.
While being quick and efficient, the WAG modeling process yielded separate clusters of topics with similar meaning rather than a meta-cluster due to differences in words used within the queries. During the computing process, the literal sorting that resulted from the WAG modeling precluded sorting by meaning.12 The WAG graphs showed topical proximity of similar-meaning words and phrases such as “hot flashes” and “hot flashes and night sweats” but did not group these together into a meta cluster of “vasomotor symptoms” because these words and phrases were parts of non-overlapping queries. Our results stand in contrast to a more traditional qualitative analysis which would have grouped these queries together under a common theme. With the WAG modeling, we identified three different subsets of queries, each with a slightly different context. These subtle differences could be further evaluated using other Big Data analytic tools such as text mining to more carefully distinguish their differences and/or similarities.
Although the menopause-related topic clusters were not surprising, the number and specificity of user-generated queries provides new insights. From the sheer numbers of questions, there appears to be a need for reliable information that is readily available to the public. Such information should also address the needs of users who were asking embarrassing or potentially stigmatizing questions. Providing information using words and phrases from the queries may be more meaningful to health care consumers than medical jargon. For example, providing information about ‘sex drive’ may be more meaningful to consumers than addressing ‘libido’ or ‘sexual function’. The user-generated words and phrases we uncovered could be used to open discussions during clinical encounters and/or incorporated into patient educational materials, decision aids, and self-management interventions. In addition, the pervasive nature of hormone-related queries in both subsets may reflect greater awareness of the risks and benefits of hormone replacement therapy that has come to light within the past 15 years. Hormone has become a household word and the public may be more attuned to worries about hormones.
Our findings suggest that consideration should be given to using ChaCha® menopause queries in combination with other data sources in the future. Several studies have reported on using combined Big Data sources. For example, one study evaluated differences in obesity related posts and queries across social media platforms and online video and blog data sources in order to inform public health obesity prevention campaigns.15 Another study used social media, online video, and blog data to increase accuracy of influenza-like illness predictions.16 Combining data from multiple sources can provide a more holistic picture of the public’s needs for health information to build more comprehensive interventions. Our findings suggest ChaCha® contains enough data on menopause-related queries to be used in the future.
Study results should be considered in light of some study limitations. The gender and ages associated with the user-generated queries are self-reported by anonymous users and some opted not to report this information. More of our 263K sample provided gender and age information compared to those in the total dataset (80% versus 40%–56%). Our subgroup analysis was limited to individuals with a known gender and age and therefore may not be fully representative of all midlife women who submitted queries to ChaCha®. In addition, some demographic information may have been reported inaccurately by users. Some user-generated queries may have been made on behalf of another person and therefore may not fully represent the user’s views. These analyses are purposefully based only on one source of Big Data without triangulation to other data sources.
CONCLUSION
In conclusion, Big Data mining of the ChaCha® question and answer service revealed a subset of menopause-specific queries. The WAG modeling topics that emerged can be used as a guide for the types of information contained in this datasets. This dataset could be further mined alone or in combination with other Big Data Sources in greater depth to inform clinical practice and development of educational materials, decision aids, or other self-management interventions.
Acknowledgments
Sources of support: This paper was made possible with the support of the IU School of Nursing Social Health Network Research Laboratory. One author of this publication was supported by the National Institute of Nursing Research of the National Institutes of Health under Award Number T32NR007066. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Footnotes
Conflict of interest: None.
Disclaimers: None.
REFERENCES
- 1. [cited 2016];Data science at NIH: BD2K funds biomedical data science training: National Institutes of Health. 2016 Available from: https://datascience.nih.gov/bd2k.
- 2.Hansen MM, Miron-Shatz T, Lau AY, Paton C. Big Data in Science and Healthcare: A Review of Recent Literature and Perspectives. Contribution of the IMIA Social Media Working Group. Yearb Med Inform. 2014;9:21–26. doi: 10.15265/IY-2014-0004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Luo J, Wu M, Gopukumar D, Zhao Y. Big Data Application in Biomedical Research and Health Care: A Literature Review. Biomed Inform Insights. 2016;8:1–10. doi: 10.4137/BII.S31559. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Woo H, Cho Y, Shim E, Lee JK, Lee CG, Kim SH. Estimating Influenza Outbreaks Using Both Search Engine Query Data and Social Media Data in South Korea. Journal of Medical Internet Research. 2016;18:e177. doi: 10.2196/jmir.4955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Block M, Stern DB, Raman K, Lee S, Carey J, Humphreys AA, Mulhern F, Calder B, Schultz D, Rudick CN, Blood AJ, Breiter HC. The relationship between self-report of depression and media usage. Front Hum Neurosci. 2014;8:712. doi: 10.3389/fnhum.2014.00712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Alshaikh F, Ramzan F, Rawaf S, Majeed A. Social network sites as a mode to collect health data: a systematic review. Journal of Medical Internet Research. 2014;16:e171. doi: 10.2196/jmir.3050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Leng HK. Methodological issues in using data from social networking sites. Cyberpsychol Behav Soc Netw. 2013;16:686–689. doi: 10.1089/cyber.2012.0355. [DOI] [PubMed] [Google Scholar]
- 8.Afsar B. The relation between Internet and social media use and the demographic and clinical parameters, quality of life, depression, cognitive function and sleep quality in hemodialysis patients: social media and hemodialysis. General hospital psychiatry. 2013;35:625–630. doi: 10.1016/j.genhosppsych.2013.05.001. [DOI] [PubMed] [Google Scholar]
- 9.ChaCha Search. [cited 2016];ChaCha. 2016 Available from: http://www.chacha.com/
- 10.Priest C, Knopf A, Groves D, Carpenter JS, Furrey C, Krishnan A, Miller WR, Otte JL, Palakal M, Wiehe S, Wilson J. Finding the Patient's Voice Using Big Data: Analysis of Users' Health-Related Concerns in the ChaCha Question-and-Answer Service (2009–2012) Journal of Medical Internet Research. 2016;18:e44. doi: 10.2196/jmir.5033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Newton KM, Carpenter JS, Guthrie KA, Anderson GL, Caan B, Cohen LS, Ensrud KE, Freeman EW, Joffe H, Sternfeld B, Reed SD, Sherman S, Sammel MD, Kroenke K, Larson JC, Lacroix AZ. Methods for the design of vasomotor symptom trials: the Menopausal Strategies: Finding Lasting Answers to Symptoms and Health network. Menopause. 2014;21:45–58. doi: 10.1097/GME.0b013e31829337a4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Miller WR, Groves D, Knopf A, Otte JL, Silverman R. The patient voice in big data: Word adjacency graph modeling of epilepsy-related ChaCha data. Western Journal of Nursing Research. doi: 10.1177/0193945916670363. in press. [DOI] [PubMed] [Google Scholar]
- 13. [cited 2016 July 15];Gephi makes graph handy. 2016 updated 2016. Available from: https://gephi.org/
- 14.Mohyi D, Tabassi K, Simon J. Differential diagnosis of hot flashes. Maturitas. 1997;27:203–214. doi: 10.1016/s0378-5122(97)83974-6. [DOI] [PubMed] [Google Scholar]
- 15.Chou WY, Prestin A, Kunath S. Obesity in social media: a mixed methods analysis. Transl Behav Med. 2014;4:314–323. doi: 10.1007/s13142-014-0256-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, Brownstein JS. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLoS Comput Biol. 2015;11:e1004513. doi: 10.1371/journal.pcbi.1004513. [DOI] [PMC free article] [PubMed] [Google Scholar]