Abstract
This paper explores methods to compare concept spaces derived from different discourses in a common health domain. The concept spaces are generated from the research literature and from message board discussions on the Internet. We explore a number of methods for comparing and contrasting concept space pairs. We experiment with five select health domains in this exploratory research: Autism, AIDS, Fibromyalgia, Irritable Bowel Syndrome and Multiple Sclerosis. The paper concludes with a discussion about the potential of our methods. Future work on refinements to our techniques is also outlined.
Keywords: text mining, concepts spaces, message boards
1. INTRODUCTION
An Internet phenomenon that has a growing impact on health care is that of chat rooms or message boards and blogs. These electronic media allow individuals with shared interests to form online communities that overcome barriers such as time and space. In addition to health information and medical referrals, online groups are recognized for providing emotional support, guidance, and promoting self-education and responsibility for one’s health. Available for nearly all medical conditions, some are even known to initiate clinical research while others have established tissue banks and registries [1]. For rare health problems, online groups serve to establish critical mass, an aspect that can also be of value to medical researchers. As Ferguson and Gilles [2004] state “we are witnessing the most important technocultural medical revolution of the past century” [2]. We believe that we are as a species accumulating a large and growing number of observations and inquiries, at the individual level, regarding health related problems of almost every kind. These electronic recordings are unprecedented and offer a wealth of information that can be mined not only to guide patient care but also to suggest plausible directions in medical research and consumer health education.
Viewed differently, this patient-initiated, mainly experiential dialog in health care occurs parallel to the dialog that occurs in biomedical research via publications. From the perspective of information retrieval and text mining, there is an opportunity to contribute a suite of text analysis and mining tools that may jointly serve both the patient and the medical research communities. This opportunity to strengthen the links between health care consumers and care givers and researchers motivates our current research in which we explore automatic methods for comparing message board communications of patients and their families with the more formal, peer-reviewed communications within the research literature.
There are obvious advantages to be gained by being able to compare and connect patient communications with communications in the professional literature. For instance, differences may be revealing. A topic that is given high importance in one discourse and not in the other may be very telling. If patients are not emphasizing the underlying aspect sufficiently then perhaps there is a need for patient education targeting that topic. If the balance weighs in the other direction, then possibly this provides the motivation for new research. On the other hand, a topic that appears equally important in both discourses may be, in an abstract sense, reassuring both to the patient group as well as to the health care profession. Thus our goal is to explore methods for comparing patient and professional communications. We illustrate our general approach with five select health domains: AIDS, Autism, Fibromyalgia (Fib), Irritable Bowel Syndrome (IBS) and Multiple Sclerosis (MS).
2. METHODS
Our aim is to compare concept spaces. We view a concept space as a weighted, undirected graph with concepts at the nodes and edges representing their connections. We extract one space from the message board and another from the professional literature for a given health domain. Both are represented using a common concept vocabulary.
Given a domain of interest such as Autism, there are three steps. (1) Collect PubMed documents representing the research discourse and also collect postings from appropriate message boards representing the patient discourse. (2) Extract concepts from each collection. (3) Calculate each concept frequency within a space and also co-occurrence frequency of concept pairs (assessed at the sentence level). These data are stored in a relational database and then analyzed.
2.1 Obtaining the Document Collections
PubMed queries executed via Entrez [3] were constructed using appropriate MeSH terms. We retrieved records matching the message board collection in time. For example for Autism we searched: “Autistic disorder”[MAJR] AND hasabstract[text] AND English[Lang] AND (“human”[MeSH Terms] OR “hominidae”[MeSH Terms]) AND “2001/11/14 23.12”[EDAT] : “2004/11/14 23.12”[EDAT]. Other than substituting for the disease MeSH term no other significant adjustments were made for the other queries. For the message board, we picked http://www.healthboards.com/ because it is moderated, crawlable, active (runs from year 2000 till now and has a substantial user base) and ranks high when searching “disease-X message board” on Google.
A message board document is the full thread of a discussion including the original post on the topic and all responses. So message board document sizes vary far more than PubMed records with only the titles and abstracts for individual articles. Message board documents on Autism for example, contain on average 6.31 posts per thread (standard deviation 6.38). A summary of the collections we built is in Table 1.
Table 1.
Dataset Details. (ST: semantic types); Types: unique terms count; Tokens: all terms count.
Concepts | |||||
---|---|---|---|---|---|
Number of Docs | Types | Tokens | Tokens/Type | ST | |
PubMed (PM) | |||||
Autism | 998 | 4077 | 35686 | 19.17 | 125 |
AIDS | 1341 | 6455 | 47358 | 7.34 | 123 |
Fib | 705 | 3894 | 26897 | 6.91 | 117 |
IBS | 1026 | 5125 | 43768 | 8.54 | 116 |
MS | 1383 | 5544 | 46576 | 8.40 | 123 |
Message Board (MB) | |||||
Autism | 847 | 5750 | 78167 | 13.59 | 127 |
AIDS | 2298 | 6207 | 158788 | 25.58 | 125 |
Fib | 2874 | 11671 | 334602 | 28.67 | 129 |
IBS | 1450 | 6424 | 106142 | 16.52 | 113 |
MS | 1268 | 6860 | 102853 | 14.99 | 120 |
2.2 Extracting Concepts Using MetaMap
The UMLS Metathesaurus [4] contains information about biomedical and health related concepts, their various names, and the relationships among them. Metathesaurus concepts are assigned to at least one of 135 semantic types. For example, Jaundice is assigned the types Sign or Symptom and Pathologic Function. MetaMap [5] is a program developed at the NLM to discover Metathesaurus concepts from free-text. Because MetaMap employs general-purpose NLP tools, and because its source vocabularies serve as index terms for a wide range of electronic health resources of interest to lay people (professional literature, lay literature, clinical records) [6], we view MetaMap as suitable for identifying concepts in message board discussions and in PubMed records. Each dataset was submitted to MetaMap. A Perl program cleaned the results returned; extracted the sentences, document IDs, concept terms, and semantic types; and calculated statistics of occurrences for concepts and concept pairs. Note we identify a link between two concepts if and only if they co-occur in a sentence of a document.
2.3 Concept Space Comparisons
We compare concept spaces in several ways as detailed next.
2.3.1 Concepts
First, we measure the extent to which a pair of spaces shares concepts. We use Dice’s similarity measures for this:
Here C is the set of concepts in a space and |C| represents the size of this set. C1 AND C2 represents the set intersection between spaces C1 and C2. This measure generates scores in the interval [0,1]. If C1 and C2 have identical concepts, the score is1 and it is 0 if they share no concepts. We also compare the PubMed and message board discourses by focusing on sub spaces, defined by UMLS semantic types. Since all of the 135 types are not equally important (or even present) in particular concept spaces, we focus on those that are important for a health domain. In particular, for a given domain such as Autism, we rank semantic types by concept frequency and focus our analysis on the S top ranked (we set S = 20) semantic types. Note that this could just as easily be done for all semantic types.
2.3.2 Concepts+Links
The previous approach focuses on the concepts shared between the two concept spaces. It does not consider concept co-occurrences in the space. Observe that we regard a concept space as a collection of concept nodes and co-occurrence based edges. Thus next we explore these structural or ‘topological’ aspects as well. In essence we apply the same Dice similarity measure as before. However each space is now represented as a collection of concepts (nodes) and their co-occurrences (edges). I.e. if a concept pair co-occurs, in a sentence of a document, then it is included in the set of features representing that space.
2.3.3 Central Topics
We also calculate the centrality of topics in each concept space. We do this by ranking the concepts by IDF (=log(N/n+1). Here N is the total number of documents in a discourse while n is the number of documents in which the particular concept occurs. After a preliminary examination of the resulting central topics it became clear that many non-informative concepts were highly ranked. Thus we remove concept terms that rank among the top 100 across all the datasets in one discourse (i.e., PubMed or message board). This vocabulary normalization step allows us to remove ‘stop words’ in that discourse. For example, STUDY (Scientific Study) which ranks among top 100 across all five concept spaces in the PubMed discourse is removed. After this clean up process we identify the top P ranked terms (we set P = 200) as central topics for the concept space.
2.3.4 Hot Topics
For each top Q (which we set to 200) central terms in a concept space, we contrast its ranking in the other discourse’s concept space for the same health domain. If that term does not exist or its rank is substantially lower in the other concept space, we regard it as an interesting term, which we call a ‘hot topic’. We set the ranking difference threshold, D to be 200. For example, if term X ranks number 1 in the PubMed Autism collection but it ranks number 202 in the Autism message board collection, then it is a hot topic candidate in PMAutism.
Each of these approaches involves one or more parameters. We set the reported values by conducting preliminary experiments that are not reported here for space reasons. For example, we explored Q = 50 and 150 and D = 100, 300. We also explored setting P to 100 and 300. In each case manual inspection of results recommended the settings used in this study. In further work we will more systematically explore the parameter space for optimal values.
3. RESULTS
3.1 Concept Based Similarity
Table 2 shows the similarity of each of the 25 concept space pairs involving both discourses. AIDS for example has 6,455 unique concepts in the PubMed discourse and 6,207 in the message board discourse giving a similarity score of 0.3886.
Table 2.
Dice’s Similarity (Concept Based)
PM | |||||
---|---|---|---|---|---|
MB | Autism | AIDS | Fib | IBS | MS |
Autism | 0.4060 | 0.1527 | 0.1336 | 0.1394 | 0.1351 |
AIDS | 0.1326 | 0.3886 | 0.3247 | 0.3343 | 0.3249 |
Fib | 0.1260 | 0.3178 | 0.3030 | 0.3227 | 0.3076 |
IBS | 0.1211 | 0.3126 | 0.3156 | 0.3682 | 0.2926 |
MS | 0.1377 | 0.3405 | 0.3314 | 0.3386 | 0.3607 |
Thus about 62% of the PubMed concepts and 60% of the message board concepts on AIDS are unique. The lowest similarity (within a domain) is for Fibromyalgia (0.3030) and the highest is 0.4060 for Autism. Autism appears the least similar to the other domains while AIDS appears the most similar. Our interest is in the highlighted diagonal cells – where the discourses come from the same health domain. In order to determine if these are significant we compare with scores for topically ‘unrelated’ pairs of spaces selected appropriately for each health domain. For example, Autism has 5 unrelated pairs involving the PubMed Autism discourse (eg. PM Autism – MB AIDS, similarity = 0.1326) and another 5 unrelated pairs involving the Autism message board discourse (eg. MB Autism – PM AIDS, similarity = 0.1527). Average similarity score for these 10 unrelated pairs along with the variance is in Table 3. From tables 2 and 3 we see that the similarity score for each PubMed –message board pair from the same domain is always greater than the corresponding unrelated pairs’ mean plus variance suggesting statistical significance.
Table 3.
Mean and Variance of Similarities of Unrelated Pairs for each Health Domain. (Derived from Table 2).
Autism | AIDS | Fib | IBS | MS | |
---|---|---|---|---|---|
Mean | 0.1348 | 0.2800 | 0.2724 | 0.2721 | 0.2761 |
Variance | 0.00009 | 0.0087 | 0.0078 | 0.0079 | 0.0077 |
3.1.1 Concept Based Similarity: Sub Spaces
We first identify the 20 most frequent semantic types in the two discourses for each domain. We then compute similarity using Dice’s measure, independently within each of the semantic types that are in common between the two sets of 20. These scores are shown in Table 4. We see for example that Autism is unique in that both discourses emphasize Social Behavior and Spatial Concepts. Temporal Concept, Qualitative Concept, Functional Concept have highly similarity for all domains while similarities in Findings are on the lower side and Sign or Symptom is interesting only in a couple of domains. Exploring reasons for these will be the aim in future research.
Table 4.
Similarity between Concept Subspaces. n/a means that the semantic type is not in the Top 20 for both discourses.
Semantic Types | Autism | AIDS | Fib | IBS | MS |
---|---|---|---|---|---|
Age Group | 0.6429 | n/a | n/a | n/a | n/a |
Family Group | 0.5124 | n/a | n/a | n/a | n/a |
Finding | 0.4269 | 0.3463 | 0.2779 | 0.3658 | 0.3852 |
Functional Concept | 0.5983 | 0.6085 | 0.6093 | 0.5887 | 0.6079 |
Idea or Concept | 0.5581 | 0.6009 | 0.4513 | n/a | 0.5227 |
Intellec. Product | 0.4544 | 0.4560 | 0.3801 | 0.4308 | 0.4306 |
Manufactured Object | 0.2719 | 0.3592 | 0.1630 | 0.2556 | 0.2923 |
Mental Process | 0.5472 | 0.5292 | 0.5025 | 0.5275 | 0.4800 |
Mental Dysfunct. | 0.4084 | n/a | n/a | n/a | n/a |
Qualit. Concept | 0.5932 | 0.5802 | 0.5050 | 0.5619 | 0.5464 |
Quant. Concept | 0.5067 | 0.5093 | 0.4943 | 0.4485 | 0.5234 |
Social Behavior | 0.5833 | n/a | n/a | n/a | n/a |
Spatial Concept | 0.4865 | n/a | n/a | n/a | n/a |
Sign or Symptom | n/a | n/a | 0.2391 | 0.2964 | n/a |
Body Part, Organ or Comp. | n/a | n/a | n/a | 0.3496 | 0.3293 |
Temporal Concept | 0.6182 | 0.5879 | 0.5697 | 0.5831 | 0.5901 |
Therap. or Preventive Procedure | 0.4250 | n/a | 0.3032 | 0.3410 | 0.3510 |
Disease or Syndrome | n/a | 0.2301 | 0.2518 | 0.3250 | 0.2786 |
Organic Chemical | n/a | 0.2083 | 0.1301 | 0.1508 | n/a |
Population Group | n/a | 0.5528 | 0.3497 | 0.4151 | n/a |
Virus | n/a | 0.5088 | n/a | n/a | n/a |
Sign or Symptom | n/a | n/a | 0.2391 | 0.2964 | n/a |
Body Part, Organ or Comp. | n/a | n/a | n/a | 0.3496 | 0.3293 |
Amino acid, or protein | n/a | n/a | n/a | n/a | 0.1474 |
3.2 Concept+Link Based Similarity
Table 5 shows the similarities when both nodes and edges of each concept space are considered. As in the case of the analysis at the concept level (Tables 2 and 3), similarities, although low, are again higher than the mean+variance value for the corresponding set of unrelated pairs.
Table 5.
Dice’s Similarity. Both Concepts and Links (sentence-based co-occurrences) are considered.
Unrelated Pairs | |||
---|---|---|---|
Health Domain | Dice Similarity | Mean | Variance |
Autism | 0.0265 | 0.0077 | 0.000002 |
AIDS | 0.0169 | 0.0134 | 0.00002 |
Fib | 0.0145 | 0.0139 | 0.000017 |
IBS | 0.0192 | 0.0135 | 0.00001 |
MS | 0.0196 | 0.0144 | 0.00001 |
3.3 Central Topics
Table 6 lists the 11 most frequent concepts for both discourses for IBS and MS. Differences between discourses are interesting even if not totally unexpected. Patients are more concerned about particular symptoms and aspects related to their lives (work, stress). In PubMed we see a greater focus on drugs (such as tegaserod (Zelnorm), Alosetron and Interferon-beta) especially in the context of clinical trials. The message boards emphasize symptoms such as tingling, lesions and pain usually in the context of diagnosis and treatment strategies.
Table 6.
Central Topics for IBS and MS; MB: Message Board; PM: PubMed.
IBS | MS | ||
---|---|---|---|
MB | PM | MB | PM |
IBS | IBS | MRI | Multiple sclerosis |
Pain | tegaserod | Avonex | Interferon- beta |
Fiber | Alosetron | Pain | Relapsing |
symptoms | Placebo | Copaxone | Glatiramer acetate |
Time | Pain | Hope | MRI |
Work | constipation | Stress | Mitoxantrone |
Stress | Diarrhoea | Numbness | Fatigue |
Diarrohoea | Dyspepsia | Neck | CD4 |
calcium | Anxiety | Lesions | MBP |
gas | bloating | Tingling | Depression |
constipation | Fibromyalgia | Spinal tap | CNS |
3.4 Hot Topics
We examined the hot topic candidates for each concept space pair (as described in 2.3.4) and manually weeded out non informative concepts (eg. patient, drug). Examples of hot topics are in Table 7. One limitation is that we do not recognize synonyms. For example, Copaxone is the trade name for glatiramer acetate (more frequently observed in PubMed). Similarly Zelnorm is the trade name for Tegaserod. But once again we see a greater emphasis, in the message boards, on the symptoms within the different health domains. Tingling in the extremities is a common complaint in MS. Several of the postings are discussions on this aspect. The emphasis is low for this concept in PubMed.
Table 7.
Hot Topics in Message Board Discourses.
AIDS | Autism | Fib | IBS | MS |
---|---|---|---|---|
Oral Sex Worried Mouth Anxiety |
Mercury Diet Vaccines Thimerosal Mercury Poisoning Vitamins |
Neck Ultram MRI Headache |
Calcium Metamucil Zelnorm Foods Magnesium Milk Cramps |
Copaxone Numbness Tingling (Pins and needles) Thyroid Steroids Burning |
We also looked at hot topics within particular semantic types (details not shown due to lack of space). For example in Autism within the Pharmacologic Substance subspace concepts such as Wheat, Melatonin, Cod Liver Oil, Soy, Prozac are identified as hot topics in the message board while in PubMed we find Naltrexone and Fluvoxamine. We chose to examine the Pharmacologic Substance subspace since it shows low similarity between the two discourses. Such subspaces offer in general, a higher likelihood of finding hot topics.
4. RELATED WORK
Researchers are studying the impact of the vast amount of health information that has become available on the Internet. Such research often deals with one specific health care aspect, for example, evaluation of the quality of self-education mammography materials [7] and assessing the quality of newspaper medical advice columns [8].
In information retrieval, researchers are actively developing techniques to mine the biomedical professional literature [9, 10, 11]. From the WWW conference alone, we can find a wealth of literature on the analysis of web communities’ textual communications [12], opinion extraction and semantic classification of product reviews [13], and the study of the dynamics of information propagation in Blogspace [14]. Although most of the literature on web mining has business or marketing applications in mind, some target the health domain [15]. Some researchers have explored the feasibility of detecting health related concepts within electronic messages of patients [6]. The most related work we have found is by Abidi et al [16] presenting a knowledge management approach to the timely provision of clinical knowledge to health care practitioners. Their proposed system identifies, captures and organizes the knowledge inherent within on-line problem-solving discussions between pediatric pain practitioners; establishes linkages between pediatric pain discussions and the corresponding medical literature on children’s pain available at PubMed. Our system differs from theirs in source of data, scope, and purpose. Because they have not implemented their ideas, it is also not possible to compare with their methods.
5. CONCLUSIONS AND FUTURE WORK
Our explorations with the five health domains display the potential of our methods. Methodologically it is reassuring to see that our overall similarity scores for the PM-MB pairs in the five domains show statistical significance. We also see high similarity scores within most of the top ranking semantic types shared by the discourses. This may indicate public awareness of research and the success of consumer health education. But fundamentally there are several implicit confounding factors. The two discourses in a domain are by their very nature different: in vocabulary and in interests. There are other challenges. Using MetaMap to normalize the vocabulary helps immensely but we have to work with its errors. For example, Zidovudine, a common medication for HIV is mapped to five concepts including “AZT (Zidovudine)” and “Azidothymidine (Zidovudine)”. This suggests the need for more post-processing. We need to also consider variants of a concept spanning the consumer and professional vocabularies. Spelling errors, abbreviations, conversational (low content) sentences in MB postings need to be treated systematically. We are investigating SVM classifiers for removing low content sentences.
With this baseline exploratory study our immediate aim is to improve our similarity measures. For example, we would like to use cosine similarity with weights computed for each concept and edge. We will also explore formal graph matching algorithms that look for sub and super graph structures. We are exploring a probabilistic measure for central and hot topic selection. The underlying question is: with what statistical confidence can we identify a term as a hot topic that differentiates two spaces? By developing formal measures we also plan to reduce the number of parameters in our methods.
Another angle to explore is to identify opinions [17] expressed by the authors in both discourses. For example, although the statistics show that professional and lay people both talk about MMR vaccination and mercury in the Autism domain, their opinions on these topics differ. Our data show that lay people have strong negative opinions on MMR vaccination and speculate that mercury contained in the vaccine is a causal agent for autism; professionals dismiss the association through epidemiological and clinical studies. Ultimately our goal is real-time detection of hot topics in health message boards with just-in-time links to the appropriate professional literature.
Acknowledgments
6. Acknowledgments
This material is based upon work supported by the National Science Foundation under Grant No. 0312356 awarded to P. Srinivasan.
7. REFERENCES
- 1.Solovitch S. The citizen scientists. Wired. 2001;9(9) [Google Scholar]
- 2.Ferguson T, Frydman G. The first generation of e-patients. BMJ. 2004;328:1148–1149. doi: 10.1136/bmj.328.7449.1148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
- 4.http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html (2004 version).
- 5.http://skr.nlm.nih.gov/
- 6.Brennan, P., Aronson, A. 2004. Toward Linking Patients and Clinical Information: Detecting UMLS Concepts in E-Mail. Working paper [DOI] [PubMed]
- 7.Tamm EP, Raval BK, Huynh PT. Evaluation of the quality of self-education mammography material available for patients on the Internet. Acad Radiol. 2000;7:137–141. doi: 10.1016/s1076-6332(00)80113-0. [DOI] [PubMed] [Google Scholar]
- 8.Molnar FJ, Man-Son-Hing M, Dalziel WB, et al. Assessing the quality of newspaper medical advice columns for elderly readers. CMAJ. 1999;161:393–395. [PMC free article] [PubMed] [Google Scholar]
- 9.Srinivasan P. Text Mining: Generating Hypotheses from MEDLINE. JASIST. 5(5):396–413. [Google Scholar]
- 10.Chaussabel D, Sher A. Mining microarray expression data by literature profiling. Genome Biology. 3(10):research 0055.1–0055.16. 2002. doi: 10.1186/gb-2002-3-10-research0055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Shatkay H., Edwards, S., Wilbur, W.J., & Boguski, M. Genes, Themes and Microarrays. Using information retrieval for large-scale gene analysis. ISMB, La Jolla, CA, 317–328. ‘02. [PubMed]
- 12.Yukio, O., Hirotaka, S., Yutaka, M., et al. Featuring web communities based on word co-occurrence structure of communications. WWW’02. 2002.
- 13.Dave, K., Lawrence, S., and Pennock, D. Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews. WWW’03. 2003.
- 14.Gruhl D., Guha R., Liben-Nowell D., et al. Information Diffusion through Blogspace. WWW’04. 2004.
- 15.Johnson HA, Wagner MM, et al. Analysis of web access logs for surveillance of influenza. Medinfo. 2004:1202–6. [PubMed] [Google Scholar]
- 16.Abidi SS, Finley A, Milios E, et al. Knowledge Management in Pediatric Pain: Mapping On-line Expert Discussions to Medical Literature. Medinfo. 2004;2004:3–7. [PubMed] [Google Scholar]
- 17.Wiebe J, Wilson T, Bruce R. Learning Subjective Language. Computational Linguistics. 2004;30(3):277–30. [Google Scholar]