Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Dec 1.
Published in final edited form as: J Biomed Inform. 2011 Aug 12;44(6):1032–1038. doi: 10.1016/j.jbi.2011.08.008

Toward Automated Consumer Question Answering: Automatically Separating Consumer Questions from Professional Questions in the Healthcare Domain

Feifan Liu 1, Lamont D Antieau 1, Hong Yu 1,2
PMCID: PMC3226885  NIHMSID: NIHMS324105  PMID: 21856442

Abstract

Objective

Both healthcare professionals and healthcare consumers have information needs that can be met through the use of computers, specifically via medical question answering systems. However, the information needs of both groups are different in terms of literacy levels and technical expertise, and an effective question answering system must be able to account for these differences if it is to formulate the most relevant responses for users from each group. In this paper, we propose that a first step toward answering the queries of different users is automatically classifying questions according to whether they were asked by healthcare professionals or consumers.

Design

We obtained two sets of consumer questions (~10,000 questions in total) from Yahoo answers. The professional questions consist of two question collections: 4654 point-of-care questions (denoted as PointCare) obtained from interviews of a group of family doctors following patient visits and 5378 questions from physician practices through professional online services (denoted as OnlinePractice). With more than 20,000 questions combined, we developed supervised machine-learning models for automatic classification between consumer questions and professional questions. To evaluate the robustness of our models, we tested the model that was trained on the Consumer-PointCare dataset on the Consumer-OnlinePractice dataset. We evaluated both linguistic features and statistical features and examined how the characteristics in two different types of professional questions (PointCare vs. OnlinePractice) may affect the classification performance. We explored information gain for feature reduction and the back-off linguistic category features.

Results

10-fold cross-validation results showed the best F1-measure of 0.936 and 0.946 on Consumer-PointCare and Consumer-OnlinePractice respectively, and the best F1-measure of 0.891 when testing the Consumer-PointCare model on the Consumer-OnlinePractice dataset.

Conclusion

Healthcare consumer questions posted at Yahoo online communities can be reliably classified from professional questions posted by point-of-care clinicians and online physicians. The supervised machine-learning models are robust for this task. Our study will significantly benefit further development in automated consumer question answering.

Keywords: question classification, medical question answering, supervised machine learning, support vector machines, natural language processing

1. Introduction

The use of computer technology is an integral part of meeting healthcare information needs. While the earliest use of computers to search for medical information was generally performed by healthcare professionals, the Internet has paved the way for laypeople with healthcare concerns such as diseases, symptoms, and treatments to search for information [1-3], and the number of people using the Internet for such information is growing. Tu and Cohen [4] report, for instance, that the number of those searching for health-related information online doubled between 2001 and 2007 to include nearly one-third of all adult Internet users. This growth has not only expanded the healthcare topics that users search for, but also expanded in the myriad sources available to them; recently, for instance, there has been a boom in healthcare-related social networking [5], particularly among question-answering sites [6]. These trends suggest that healthcare consumers will become increasingly dependent on Internet resources for answers to their medical questions in the years ahead.

In light of these developments, this paper reports on efforts to improve the medical question-answering system, AskHERMES (http://www.askhermes.org/). AskHERMES (Help clinicians Extract and aRticulate Multimedia information from literature to answer their ad-hoc clinical quEStions) is a fully automated system that uses natural language processing tools to retrieve, extract, analyze, and integrate information from medical literature and other information resources to provide answers in response to questions posed by healthcare professionals, e.g. physicians. To date, much of the research that has been done to benefit the system has focused on extracting information needs from the complex questions that physicians often ask [7] and to present responses to such questions in the most effective way [8]. More recently, the use of automatic speech recognition tools that enable physicians to pose questions to the system as spoken rather than typewritten input has been explored [9]. The research direction presented in the current paper, however, reflects the belief that AskHERMES could also benefit healthcare consumers, but only after the development of several subcomponents will it accurately interpret the information needs of lay users (i.e. healthcare consumers), quickly find information appropriate to their literacy level and technical expertise (or lack thereof), and then summarize the information that is retrieved and present it in the most useful way.

As a first step in this process, we investigated ways of determining whether users are healthcare consumers or professionals based on quantitative linguistic differences between the questions asked by members of both groups. Although the literature on differences in communication between physicians and patients acknowledges that questions are a significant component of medical encounters [10-12], studies in this area have generally focused on questions that patients ask healthcare professionals [10-14], questions posed by healthcare professionals [13], or on the more general aspects of linguistic interaction between the two groups [15]. With respect to the literature on using computers to search for medical-related information, studies have also tended to investigate either the queries of healthcare consumers [16] or those of professionals [17] without taking into account the questions of both groups. A more rigorous, comparative analysis of these questions might reveal stylistic differences that could enable us to better meet the information needs of members of both groups.

With these aims in mind, we developed a supervised machine-learning framework to automatically distinguish the questions of healthcare consumers and professionals. Although an exploratory study of the differences between the information-seeking behaviors of the two groups revealed significant differences at every level of the grammar, we primarily focused on the shallow-level linguistic features (e.g. bag-of-words features) without deep language processing(e.g. syntactic parsing), as previous work determined that words are adequate representational units for the purposes of classification [18]. We found machine-learning approaches suitable for classifying questions by whether the question was posed by consumers or professionals. In addition to the success of a bag-of-words approach for classification, we experimented with statistical features and linguistic category features to improve the robustness of the classifiers.

2. Related Work

Many studies in question classification have focused on the semantics of questions and their potential answers, and to that end, they have investigated the use of taxonomies in question classification both in the open domain [19, 20] and in the medical domain [21]. Some systems have explored the use of syntactic features for classification but have generally done so as a supplement to semantics rather than as a replacement [22-24]. Other studies have identified additional dimensions that could be useful for question classification, for instance, the distinction between factual and analytical questions [25, 26], factual and opinion questions [27], objective and subjective questions [28, 29], and answerable and unanswerable questions [30]. We propose that the ability to distinguish between the questions asked by consumers and professionals could be a dimension worth exploring, in our case, because of its potential to tailor information retrieval and question answering systems for different users.

Different linguistic features and feature selection methods have been studied in previous work. In the area of corpus linguistics, studies focusing on readability [31-33] have explored word length, word frequency, and sentence length to determine linguistic complexity and genre. The information gain based feature selection has shown to be helpful for text and evidence classification [34, 35]. Motivated by those prior works, we evaluated both linguistic features and statistical features on our task, and the proposed linguistic category features which are expected to capture language usage differences on a higher level between healthcare professionals and consumers, thereby eluding the data sparseness problem resulting from “bag of words” features.

3. Material and Methods

We first discuss the collection of our data and provide a brief characterization before describing the machine-learning methods that were used for question classification.

3.1 Data

We used four representative datasets in our study: two sets of consumer questions and two sets of professional questions, as described below.

  1. Consumer questions I (Consumer-I):

    We downloaded 5,013 consumer questions posted on Yahoo Answers between May and June 2009 (http://answers.yahoo.com, category “Health/Diseases and Conditions”)

  2. Consumer questions II (Consumer-II):

    We reused 5499 consumer questions, which is a subset extracted from a previous study http://ir.mathcs.emory.edu/shared/. Questions in this subset were posted in the “Health/Diseases and Conditions” category on Yahoo Answers between Nov. 2007 and Jan. 2008.

  3. Point-of-care clinical questions (PointCare):

    A set of 4654 professional questions collected by physicians from interviews of family doctors following patient visits [13, 36].

  4. Online questions among physician practices (OnlinePractice):

    ParkHurstExchange (http://www.parkhurstexchange.com) is an online publishing service based in Montreal, Canada, which provides credible and highly respected publications of physician practice questions and answers from healthcare professionals. All questions posted by physicians are selected and answered by professional members, which are further reviewed and approved by the Medical Editor-in-Chief. Through this service, physicians can ask their own questions, browse questions in different specialties and search them by keywords. We downloaded 5,378 professional questions from the ParkHurstExchange website as of December 6, 2010..

Although all four collections of questions described above were not intended for an automated question-answering system, there are several benefits of using these question collections: 1) They are relatively large collections of questions in which each question can be attributed to a consumer or a professional with a high degree of certainty and thus are amenable to supervised machine-learning processes; and 2) They go beyond the use of search terms to include utterances comprising complete sentences and even longer passages of discourse, which we anticipate will be more representative of queries that are formulated via natural language as opposed to keywords and phrases; and 3) Two different professional datasets represents two different clinical settings based on varying factors; for instance, point-of-care questions are relatively spontaneous while online physician practice questions are relatively well-planned. This allows us to examine the diversity effects on the classification performance and robustness of our proposed approach.

3.2 Linguistic Observation on Question Collections

There is a wide range of question types in both consumer and professional questions, and typologies for the healthcare professional questions have been proposed in a number of publications [13, 36]. A typological classification based on the interrogative words in questions [36] is not only useful for the professional questions, but also applies to the Yahoo consumer questions, as shown in the instances below:

  • 1.a

    How do I treat hand eczema?

  • b

    What can I use to relieve a sunburn?

  • c

    When is the best time to take your resting heart rate and why?

  • d

    Where can you find truthful answers about bone cancer?

  • e

    Why do we get blisters on our feet?

Both professional and consumer questions pose some difficulties for understanding due to typographical and grammatical errors. They also comprise instances that violate the general syntactic rules of written questions, e.g. some appear in declarative or imperative form rather than interrogative, or they are incomplete by the rules of any sentence type. Such instances include the following:

  • 2.a

    I wonder if this patient could have a rotator cuff thing? (professional)

  • b

    If im lactose intolerant…….? (consumer)

Additionally, many of the linguistic phenomena that are in interrogative form are embedded within other sentences that are in other syntactic forms, as in the following:

  • 3.a

    this patient still has a cervical strain that is flared up. muscle spasm? what to do next? i’ll probably refer to a neurologist if still no better at the next visit. (professional)

  • b

    I have mosquito bites on my feet and the scars from it arent going away what can i do? (consumer)

In order to minimize some of the syntactic and pragmatics issues raised by such questions, we focused primarily on the shallow-level features (e.g. bag-of-words features) without deep language processing, which is discussed below. Additionally, the lexicon is of particular interest to the biomedical informatics community because of the challenges that medical terms pose for laypersons in terms of comprehension [37-41] and information retrieval [42, 43].

3.3 Machine Learning Approach

The task is formulated as a binary classification problem aiming to separate consumer questions from professional questions related to healthcare. We explored supervised machine-learning (ML) classifications with various algorithms using the freely available Weka Toolkit (http://www.cs.waikato.ac.nz/ml/weka/). To separate consumer questions from spoken clinical questions, we appllied the classification framework on the Consumer-PointCare dataset where instances for two classes are from Consumer-I and PointCare (described in Section 3.2). Similarly, to separate consumer questions from online clinical question, we applied the classification framework on the Consumer-OnlinePractice dataset, where instances for two classes are from Consumer-II and OnlinePractice (described in Section 3.2), respectively.

3.3.1 Learning Features

Bag-of-words (BOW) Features

As learning features for our ML approaches, we first explored bag-of-words, which relied on the lexical terms used in both sets of questions (referred as BOW Features). To obtain BOW features from each question, we did the tokenization and filtered some noisy terms via regular expression, e.g. terms containing only symbols or starting with symbols.

Statistical Features

Even though we used two relatively large collections of professional and consumer health questions (~10,000 each), the questions still represent only a small portion of questions that physicians and consumers could potentially pose to a question-answering system. Even between the two datasets we used there is much difference in the lexicon usage. In order to offset the data sparseness for a more robust system, we explored the use of features based on statistical aspects of language structure (referred as Statistical Features) listed as follows.

  1. Word length. Healthcare professionals tend to use domain terms to express their information needs, and those technical terms are frequently longer than common words. We calculated the maximum, minimum and average letter counts of words in each question as features.

  2. Inverse Document Frequency (IDF). Many domain terms in the professional questions are in rare usage compared with common words, and IDF is a good indicator of the corresponding word’s rarity. We calculated each word’s IDF value on the MEDLINE 2010 corpus that contains nearly 19 million records, and then used the maximum, minimum and average IDF value in each question as features.

  3. Question length. Professional questions tend to contain more words and we counted the number of words in each question as a feature to capture the length difference between consumer questions and professional questions.

Linguistic Category Features

To improve the robustness of our approach, we extracted four linguistic categories as higher-level features (referred as Linguistic Category Features) by manually examining the top bag-of-words features based on information gain based ranking.

Table 1 shows the top 20 words in the Consumer-PointCare dataset based on information gain. We observed that many of the terms that achieved a high information gain score were stopwords that are often omitted from NLP tasks including information retrieval due to their characterization as insignificant discriminators [44]. This finding is consistent with our earlier studies [21, 30, 45] in which we found stopwords were helpful for sentence classification and question classification tasks.

Table 1.

Top 20 words based on information gain value on the Consumer-PointCare dataset

1 the 11 she
2 what 12 of
3 patient 13 you
4 my 14 a
5 is 15 help
6 with 16 rid
7 should 17 dose
8 woman 18 and
9 for 19 this
10 her 20 in

The finding that stopwords hold value for our classification has a practical benefit in that stopwords are typically members of the broad supercategory that is traditionally recognized in linguistics as closed-class words as opposed to open-class words [46]. Closed-class words belong to parts-of-speech sets that rarely, if ever, admit new members, including prepositions, pronouns, and interrogative words. The finite nature of these sets enabled us to extract the entire set as learning features based on the observation that some of the members of these sets had a high value in information gain. Based on these values, the second author of this study recognized four linguistic categories as having an especially high potential for our classification purposes, viz. interrogative words (e.g. what, how), personal pronouns (e.g. I, my, her, she, you), indefinite pronouns (e.g. anyone, somebody), and auxiliary verbs (e.g. is, should, do). As these categories are closed-class words, we extracted all the members from Wikipedia and automatically derived corresponding features indicating whether a question contains those linguistic category members. For open-class words such as nouns and verbs that do admit members freely, more sophisticated techniques are needed to group them in a higher level, which we will leave for our future work.

3.3.2 Evaluation Metrics

We used recall, precision, and weighted F1-score as the evaluation metrics. Recall is the number of correctly classified sentences divided by the total number of sentences of that class; precision is the number of correctly classified sentences divided by the total number of sentences classified for the category; and the F1-score is the harmonic mean of recall and precision.

4. Results

In this section, we present the results of our supervised learning approach on the classification of consumer and professional questions. We experimented with different classification algorithms available in the Weka toolkit, and found that support vector machines (SVMs) with the Sequential Minimization Optimization (SMO) algorithm developed by Platt [47] worked best, and therefore, report only those results. We first evaluated our approach with different features on Consumer-PointCare and Consumer-OnlinePractice respectively using 10-fold cross-validation and we examined different characteristics between consumer questions and professional healthcare questions. The information gain based feature selection was then evaluated and finally we reported the classification performance applying the model trained using the Consumer-PointCare dataset on the Consumer-OnlinePractice dataset to show the robustness of our learning framework.

4.1 Performance Comparison on Two Datasets with Different Features

We report 10-fold cross-validation results on two datasets (Consumer-PointCare and Consumer-OnlinePractice) separately, as shown in Table 2. We found the performance patterns with different feature settings are similar on two datasets. Bag-of-words (BOW) features perform best with the F-measure of 0.918 and 0.945 (row 3 and 10), compared with the 0.824/0.875 (row 4 and 11) using statistical features and 0.818/0.755 (row 6 and 13) using linguistic category features. Statistical features are shown to be very helpful in boosting the performance when combined with other features, achieving the highest F-measure of 0.929 on Consumer-PointCare and 0.946 on Consumer-OnlinePractice when combined with BOW features.

Table 2.

Evaluation performance with different feature settings on two datasets

Results on Consumer-PointCare
Feature Settings Precision Recall F-measure
Bag-of-words Features (BOW) 0.918 0.918 0.918
Statistical Features (SF) 0.825 0.824 0.824
BOW+SF 0.929 0.929 0.929
Linguistic Category Features(LCF) 0.821 0.818 0.818
LCF+SF 0.882 0.882 0.882
Results on Consumer-OnlinePractice
Features Settings Precision Recall F-measure
Bag-of-words Features (BOW) 0.945 0.945 0.945
Statistical Features(SF) 0.876 0.875 0.875
BOW+SF 0.946 0.946 0.946
Linguistic Category Features(LCF) 0.757 0.755 0.755
LCF+SF 0.898 0.898 0.898

Overall the performance on Consumer-OnlinePractice is better than on Consumer-PointCare, except that using linguistic category features yielded better F-measure of 0.818 on Consumer-PointCare, compared with 0.755 on Consumer-OnlinePractice. Note that although the linguistic category features we proposed do not perform as well as other features, they provide more potential for better generalization ability than BOW features. We also tried stemmed BOW features and it degraded the performance, which is consistent with the finding in other natural language processing tasks [46, 48].

To obtain a deep understanding of how each individual linguistic category and statistical feature contribute to distinguishing consumer questions from professional questions, we evaluated the performance on Consumer-PointCare dataset using each single corresponding feature respectively as shown in Table 3. The results on Consumer-OnlinePractice showed a similar pattern.

Table 3.

Analysis on the contribution of each statistical feature and linguistic category (F-measure)

Statistical Features
Word Length Question Length Inverse Document Frequency (IDF)
Max Avg Min Max Avg Min
0.762 0.656 0.53 0.696 0.354 0.661 0.534
Linguistic Category Features
Interrogative words Personal pronouns Indefinite pronouns Auxiliary verbs
0.673 0.657 0.489 0.726

We can see most statistical features perform well on this task. The maximum word length yielded the best performance with an F-measure of 0.762, while the maximum IDF yielded the worst performance of 0.354. However, the average of IDF achieved a pretty good performance of 0.661, the third highest score following the second-best F-measure of 0.696 using question length. Of the four linguistic categories we proposed, auxiliary verbs performed the best with the F-measure of 0.726; indefinite pronouns performed the worst with the F-measure of 0.489. Interrogative words and personal pronouns performed similarly, yielding the F-measure of 0.673 and 0.657, respectively.

Figure 1 shows the significant differences in the distribution of personal pronouns used in professional and consumer questions(on Consumer-PointCare dataset), which further explains the effectiveness of using the personal pronoun linguistic category. As the figure shows, consumer questions incorporate a greater percentage of pronouns(5.6%) than professional oquestions(9.0%) overall, and there is a greater drop in usage among professional questions than consumer questions as the focus on pronouns is reduced from all pronouns to first- and second-person pronouns(3.4% versus 8.4%) and, finally, to only first-person pronouns(1.9% versus 7.0%).

Figure 1.

Figure 1

Distributions of personal pronouns used in professional and consumer questions on Consumer-PointCare dataset

Based on the analysis, we tried to remove the worst statistical feature (maximum word length) and the worst linguistic category (indefinite pronouns) in the corresponding feature settings of Table 2, but there was no improvement, showing that different features may complement each other through some interaction.

4.2 Feature selection on BOW Features

As previously described, bag-of-words (BOW) features are very useful for this classification; however, BOW features are computationally expensive to construct and they contribute to data sparseness problems, thereby posing a practical challenge on automatic question answering systems. In addition to the non-lexical statistical features and higher-level linguistic category features discussed earlier, we explored information gain based feature selection for dimension reduction on the BOW features, which are expected to reduce useless redundancies without noticeable information loss.

Figure 2 shows the performance curves on two datasets with the different features selected by information gain. We can see that information gain based feature selection achieved the best f-measure of 0.926 on Consumer-PointCare and 0.946 on Consumer-OnlinePractice compared with 0.918 and 0.945 using all the BOW features. In addition, using only 3000 top BOW features for both systems can achieve highly competitive performance and could potentially improve the system’s robustness. When we combined with statistical features, using 3000 top BOW features based on information gain further improved the performance from 0.926 to 0.936 on Consumer-PointCare, and from 0.943 to 0.945 on Consumer-OnlinePractice.

Figure 2.

Figure 2

Performance (F-measure) curve of classifying healthcare consumer questions from two types of professional questions, with different number of features selected based on information gain.

4.3 Performance When Using Consumer-OnlinePractice as Blind Test Data

In this section, we evaluated the robustness of our approaches with different settings. Specifically, we trained the learning model on the Consumer-PointCare dataset and used Consumer-OnlinePractice dataset as a blind test data for evaluation. The results are shown in Table 3, where “BOW+SF” for bag-of-words features combined with statistical features, “LCF+SF” for linguistic category features combined with statistical features, “BOW3000+SF” for 3000 top BOW features based on information gain combined with statistical features.

We can see that using all BOW features is not as advantageous as the cross validation results on training data (Consumer-PointCare) in Table 2, achieving the F-measure of 0.879 compared with 0.918(row 3 in Table 2), while statistical features as expected were more robust on new test data and obtained the F-measure of 0.841 compared with 0.824(row 4 in Table 2). It also shows that the proposed linguistic category features(LCF) and the dimension reduction of BOW features (BOW3000) based on information gain score both help gain better performance on the test data when combining with the statistical features, with the F-measure increasing from 0.841 to 0.856 and the best 0.891 respectively. In addition, we found keep the top 3000 BOW features did not lose any useful information, yielding the same best performance as using all the BOW features (compare row 4 and 6 in Table 4).

Table 4.

Classification performance using Consumer-PointCare as training data and Consumer-OnlinePractice as test data.

Feature Settings Precision Recall F-measure
Bag-of-Words Features(BOW) 0.88 0.879 0.879
Statistical Features(SF) 0.858 0.843 0.841
BOW+SF 0.891 0.891 0.891
LCF+SF 0.86 0.856 0.856
BOW3000+SF 0.891 0.891 0.891

5. Discussion

This study explored the possibility of using a classifier to distinguish between questions asked by healthcare professionals and those asked by consumers. Although the semantics of medical questions play an important role for related tasks, the study focused on the language usage difference between consumers and healthcare professionals, where traditional semantics would not matter as much as other question classification tasks. The results show that our supervised learning framework based on inexpensive bag-of-words features and statistical features obtained satisfying performance for this classification task.

To account for the diversity between point-of-care and online physician practice questions, the study used two datasets Consumer-PointCare and Consumer-OnlinePractice in our study. From the results in Table 2, we can see that the different language facets in two different clinical settings influenced the accuracy of distinguishing consumer questions from them, but similar patterns with different feature settings prove that they share internal characterizations, which makes it quite feasible to develop an effective and unified learning framework compatible with both point-of-care and online physician practice questions, even other new professional questions generated under different scenarios.

In Table 4 for the training model on the Consumer-PointCare dataset and testing on the Consumer-OnlinePractice dataset, the performance is slightly worse than the cross-validation results on the training data (Consumer-PointCare) shown in Table 2, but the nearly 0.9 of the F-measure still shows the strong robustness of the framework proposed for this task. In other words, although there are some differences in various aspects of language usage between point-of-care and online physician practice questions, this result further validates that the difference in those two type of professional questions will not overwhelm their inherent difference from consumer questions. This finding lays a solid foundation towards further development in automated consumer question answering.

Among the seven statistical-based features we used, word length was shown to be a significant indicator of the identity of the question askers. We found that professionals are more likely than consumers to include long scientific words in their questions. This finding has been previously observed in the literature on differences between the language of healthcare professionals and consumers; for instance, it has been addressed by the U.S. Food and Drug Administration has proposed simpler language on medication labels by replacing words such as “pulmonary” with “lung”, “assistance” with “help” or “aid”, and “medication” with “drug” [41].

As shown in Table 1, the number of words in the question also played an important role in question classification. Among the individual features analyzed on the Consumer-PointCare dataset in Table 3, Maximum Word Length achieved the best performance with the F-measure of 0.762, suggesting that the appearance of long words in a question is indicative of the type of user that asked the question. We found in the Consumer-PointCare dataset the maximum length of individual words differed considerably between two types of questions: the longest word in the professional questions, for instance, is 27 letters long (esphogagogastroduodenoscopy), and the professional question set contains numerous words comprising 20 letters or more, including electroencephalographic, pseudothrombophlebitis, and dehydroepisandrosterone. The longest word in the consumer question, on the other hand, is a single instance of the misspelled, 20-letter word herpaghonacyphilaid.

The overall distributions of word length on the two types of questions also differed. While we found that in the professional questions of the Consumer-PointCare dataset, 24.5% of the words are 7 or more letters and only 15.3% of the words for the consumer questions are 7 or more letters; 7.6% of the words in the professional questions are 10 or more letters compared with only 2.6% of consumer questions with the same length. From these numbers, we can see that the greater the length of the word, the more likely it is to appear in a question asked by a healthcare professional rather than a consumer.

We assumed that highly educated and specialized healthcare professionals use more domain terms than consumers, making inverse document frequency (IDF) a useful feature for classification; however, our analysis on each individual feature in Table 3 showed mixed results. While the average IDF works quite well, the maximum IDF achieved the worst F-measure among all the other statistical features, suggesting that assumption to be unwarranted, for several possible reasons. First, the misspellings and Internet abbreviations that have appeared in recent years, such as “plz” and “lol,” and appear in consumer questions are difficult to account for using such a model. Furthermore, in addition to such forms, consumer questions might include infrequent dialectal variants for diseases, treatments, and conditions that are unlikely to surface in general medical language. Finally, from a methodological standpoint, we calculated IDF values from the MEDLINE corpus; however, basing IDF values on the distribution of words in a more general corpus or from a variety of corpora might be a better strategy for making the most of IDF value for such classification.

We observed that bag-of-words (BOW) features contain many redundancies for the classification task. With the increase of number of BOW features as shown in Figure 2, the performance has only marginal improvements or even degradations, which is why information gain based dimension reduction allows the system to use fewer BOW features without information loss, which is especially useful to making the system more compatible and robust as shown in Table 4. The proposed linguistic category features achieved promising results on our classification task. Their effectiveness was demonstrated in both cross-validation results (Table 2) and blind test results (Table 4), boosting the performance when combined with statistical features. Although the linguistic category features are not obtaining the best performance in our current study, it provides a potential way in the future to examine other linguistic categories (specific open-class sets, e.g. nominalizations and words of Latin and Greek origin) or topic-related clusters that could further benefit from this classification task. We especially analyzed the distributions of one linguistic category (personal pronouns) as shown in Figure 1, suggesting that differences in subjectivity and objectivity play important roles in distinguishing between the language of professional questions and that of consumer questions. In other words, pronoun usage in the professional questions reflects the objective orientation of clinicians who generally have questions about the healthcare of their patients and infrequently refer to themselves in medical questions. Consumer questions, on the other hand, generally concern he askers of question because they are experiencing the problem directly and their pronoun usage reflects their subjectivity. Although these observations are perhaps intuitively obvious, our study shows they have practical applications for classification.

Although our systems perform quite well, there are several limitations in the current study. With respect to the use of IDF values as a means for measuring the value of words based on their rarity, we used MEDLINE as our reference corpus with limited success. Future work on this aspect of language use might be better served by relying on more general English corpora, such as the Brown Corpus, the Frown Corpus or the Wall Street Journal Corpus, or by using a combination of medical and general corpora. Additionally, the word count of a sentence was shown to be useful for this task; however, future work should explore the use of more sophisticated measurements of sentential complexity, such as deep linguistic analysis (e.g. automatic parsing) of questions might also aid in this kind of classification and will be addressed in future work. Finally, only Yahoo prompts were tested for consumer questions; future research should include all the language provided by the asker in Yahoo answers for a given linguistic event.

6. Conclusion

We evaluated a supervised learning framework for classifying consumer questions from professional questions, in which we explored bag-of-words features as well as statistical features. The results of our work suggest that automating the classification of questions into professional and consumer categories is feasible. The proposed approach performed well in separating consumer questions from two types of professional questions(point-of-care and online physician practice), and the competitive performance was generalized when training a model on Consumer-PointCare and testing on Consumer-OnlinePractice, showing the robustness of our learning framework on this specific task. The proposed linguistic category features and dimension reduction on bag-of-words features were shown to enhance the system’s robustness. In addition, several differences between the questions of healthcare professionals and health consumers were analyzed, such as word length and personal pronoun usage.

Our future work will further enhance the classification performance by exploring more helpful features, such as using IDF calculated with a more general corpus, linguistic open classes or topic related word clusters, and syntactic parsing. More extensive evaluations will be conducted in the future with user central evaluation. We will investigate a systematic way to incorporate the classification framework proposed in this paper into the ASKHermes system for a preliminary evaluation on automated consumer question answering systems.

  • Propose to automatically classify consumer healthcare questions from professionals.

  • Supervised machine-learning models show robust results on different data sets.

  • Bag-of-words and statistical features were shown useful on this task.

Acknowledgments

The authors would like to thank Yong-Gang Cao for technical support and Shashank Agarwal and Betsy Barry for helpful comments.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Anderson JG. Consumers of e-health: patterns of use and barriers. Soc Sci Comput Rev. 2004;22:242–248. [Google Scholar]
  • 2.Elkin N. How America Searches: Health and Wellness. iCrossing. 2008:1–17. [Google Scholar]
  • 3.Weaver JB, III, Mays D, Lindner G, Eroglu D, Fridinger F, Bernhardt JM. Profiling Characteristics of Internet Medical Information Users. Journal of the American Medical Informatics Association. 2009;16:714–722. doi: 10.1197/jamia.M3150. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Tu HT, Cohen GR. Striking jump in consumers seeking health care information. Track Rep. 2008:1–8. [PubMed] [Google Scholar]
  • 5.Landro L. Social networking comes to health care. Wall Street Journal. 2006;27 [Google Scholar]
  • 6.Agichtein E, Castillo C, Donato D, Gionis A, Mishne G, Agichtein E, Castillo C, Donato D, Gionis A, Mishne G. Finding high-quality content in social media with an application to community-based question answering. PROCEEDINGS OF WSDM. 2008 [Google Scholar]
  • 7.Yu H, Cao YG. Automatically extracting information needs from ad hoc clinical questions. AMIA Annual Symposium Proceedings. 2008;2008:96. [PMC free article] [PubMed] [Google Scholar]
  • 8.Cao YG, Ely J, Antieau L, Yu H. Evaluation of the clinical question answering presentation. BioNLP. 2009 [Google Scholar]
  • 9.Liu F, Kruse AM, Tur G, Hakkani-Tür D, Yu H. Towards Spoken Clinical Question Answering: Evaluating and Adapting Automatic Speech Recognition Systems for Spoken Clinical Questions. Journal of American Medical Informatics Association. 2011 doi: 10.1136/amiajnl-2010-000071. In revision. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dorr DA, Tran H, Gorman P, Wilcox AB. Information needs of nurse care managers. AMIA Annu Symp Proc. 2006:913. [PMC free article] [PubMed] [Google Scholar]
  • 11.Katz MG, Jacobson TA, Veledar E, Kripalani S. Patient Literacy and Question-asking Behavior During the Medical Encounter: A Mixed-methods Analysis. J GEN INTERN MED. 2007;22:782–786. doi: 10.1007/s11606-007-0184-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Graber MA, Randles BD, Ely JW, Monnahan J. Answering clinical questions in the ED. Am J Emerg Med. 2008;26:144–7. doi: 10.1016/j.ajem.2007.03.031. [DOI] [PubMed] [Google Scholar]
  • 13.Ely JW, Osheroff JA, Ebell MH, Bergus GR, Levy BT, Chambliss ML, Evans ER. Analysis of questions asked by family doctors regarding patient care. BMJ. 1999;319:358–61. doi: 10.1136/bmj.319.7206.358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Roter DL. Patient participation in the patient-provider interaction: the effects of patient question asking on the quality of interaction, satisfaction and compliance. Health Education & Behavior. 1977;5:281. doi: 10.1177/109019817700500402. [DOI] [PubMed] [Google Scholar]
  • 15.Rudd RE, Moeykens BA, Colton TC. Annual Review of Adult Learning and Literacy. New York: Jossey-Bass; 1999. Health and Literacy. A Review of Medical and Public Health Literature. www.cete.org/acve/docs/pab00016.pdf. [Google Scholar]
  • 16.Bader JL, Theofanos MF. Searching for cancer information on the internet: analyzing natural language search queries. Journal of Medical Internet Research. 2003;5 doi: 10.2196/jmir.5.4.e31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Mendonça EA, Kaufman D, Johnson SB. Answering Information Needs in Workflow [Google Scholar]
  • 18.Joachims T. Learning to classify text using support vector machines: Methods, theory, and algorithms. Computational Linguistics. 2002;29:656–664. [Google Scholar]
  • 19.Suzuki J, Taira H, Sasaki Y, Maeda E. Question classification using HDAG kernel. Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering-Volume 12. 2003:61–68. [Google Scholar]
  • 20.Hermjakob U. Parsing and question classification for question answering. ACL Workshop on Open-Domain Question Answering. 2001 [Google Scholar]
  • 21.Yu H, Sable C, Zhu HR. Classifying Medical Questions based on an Evidence Taxonomy. Proceedings of the AAAI 2005 Workshop on Question Answering in Restricted Domains. 2005 [Google Scholar]
  • 22.Zhang D, Lee WS. Question classification using support vector machines. The 26th Annual International ACM SIGIR Conference. 2003:26–32. [Google Scholar]
  • 23.Li X, Doth D. Learning question classifiers. COLING’ 02. 2002 [Google Scholar]
  • 24.Hacioglu K, Ward W. Question classification with support vector machines and error correcting codes. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: companion volume of the Proceedings of HLT-NAACL 2003–short papers-Volume 2. 2003:28–30. [Google Scholar]
  • 25.Small S, Liu T, Shimizu N, Strzalkowski T. HITIQA: an interactive question answering system a preliminary report. Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering-Volume 12. 2003:46–53. [Google Scholar]
  • 26.Small S, Strzalkowski T. HITIQA: A data driven approach to interactive analytical question answering. Proceedings of HLT-NAACL 2004: Short Papers on XX. 2004:53–56. [Google Scholar]
  • 27.Ku L-wei, Liang Y-ting, Chen H-hsi. Question Analysis and Answer Passage Retrieval for Opinion Question Answering Systems [Google Scholar]
  • 28.Li B, Liu Y, Agichtein E. CoCQA: co-training over questions and answers with an application to predicting question subjectivity orientation. Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2008:937–946. [Google Scholar]
  • 29.Li B, Liu Y, Ram A, Garcia EV, Agichtein E. Exploring question subjectivity prediction in community QA. Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 2008:735–736. [Google Scholar]
  • 30.Yu H, Sable C. Being Erlang Shen: Identifying answerable questions. Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence on Knowledge and Reasoning for Answering Questions. 2005 [Google Scholar]
  • 31.Biber D, Conrad S, Reppen R. Corpus linguistics: Investigating language structure and use. Cambridge Univ Pr; 1998. [Google Scholar]
  • 32.Biber D. Variation across speech and writing. Cambridge Univ Pr; 1991. [Google Scholar]
  • 33.Redish JC, Selzer J. The place of readability formulas in technical communication. Technical communication. 1985;32:46–52. [Google Scholar]
  • 34.Lin J, Demner-Fushman D. “Bag of Words” is not enough for Strength of Evidence Classification. AMIA Annual Symposium Proceedings. 2005;2005:1031. [PMC free article] [PubMed] [Google Scholar]
  • 35.Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. Proceedings of the Fourteenth International Conference (ICML’97) 1997 [Google Scholar]
  • 36.Ely JW, Osheroff JA, Gorman PN, Ebell MH, Chambliss ML, Pifer EA, Stavri PZ. A taxonomy of generic clinical questions: classification study. Bmj. 2000;321:429–32. doi: 10.1136/bmj.321.7258.429. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Hartley J. Clarifying the abstracts of systematic literature reviews. Bulletin of the Medical Library Association. 2000;88:332. [PMC free article] [PubMed] [Google Scholar]
  • 38.McCray AT, Ide NC, Loane RR, Tse T. Strategies for supporting consumer health information seeking. Medinfo 2004: proceedings of the 11th World Conference on Medical Informatics; San Francisco. september 7-11, 2004; 2004. p. 1152. [PubMed] [Google Scholar]
  • 39.McCray AT. Promoting health literacy. Journal of the American Medical Informatics Association. 2005;12:152–163. doi: 10.1197/jamia.M1687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Weeks WB, Wallace AE. Readability of British and American medical prose at the start of the 21st century. BMJ. 2002;325:1451. doi: 10.1136/bmj.325.7378.1451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Farley D. Label Literacy for OTC Drugs. FDA Consumer. 1997;31 [Google Scholar]
  • 42.Zeng QT, Tse T. Exploring and developing consumer health vocabularies. Journal of the American Medical Informatics Association. 2006;13:24. doi: 10.1197/jamia.M1761. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Zeng QT, Crowell J, Plovnick RM, Kim E, Ngo L, Dibble E. Assisting consumer health information retrieval with query recommendations. Journal of the American Medical Informatics Association. 2006;13:80–90. doi: 10.1197/jamia.M1820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Yates RB, Neto BR. Modern information retrieval. New York: ACM Press; 1999. [Google Scholar]
  • 45.Agarwal S, Yu H. Automatically Classifying Sentences in Full-Text Biomedical Articles into Introduction, Methods, Results and Discussion. Proceedings of the AMIA Summit on Translational Bioinformatics. 2009 [PMC free article] [PubMed] [Google Scholar]
  • 46.Jurafsky D, Martin JH. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR; 2000. [Google Scholar]
  • 47.Platt JC. Using analytic QP and sparseness to speed training of support vector machines. Advances in neural information processing systems. 1999:557–563. [Google Scholar]
  • 48.Hull DA. Stemming algorithms: A case study for detailed evaluation. Journal of the American Society for Information Science. 1996;47:70–84. [Google Scholar]

RESOURCES