Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2011 Oct 22;2011:954–959.

An Investigation into the Feasibility of Spoken Clinical Question Answering

Tim Miller 1, Kourosh Ravvaz 2, James J Cimino 3, Hong Yu 4
PMCID: PMC3243288  PMID: 22195154

Abstract

Spoken question answering for clinical decision support is a potentially revolutionary technology for improving the efficiency and quality of health care delivery. This application involves many technologies currently being researched, including automatic speech recognition (ASR), information retrieval (IR), and summarization, all in the biomedical domain. In certain domains, the problem of spoken document retrieval has been declared solved because of the robustness of IR to ASR errors. This study investigates the extent to which spoken medical question answering benefits from that same robustness. We used the best results from previous speech recognition experiments as inputs to a clinical question answering system, and had physicians perform blind evaluations of results generated both by ASR transcripts of questions and gold standard transcripts of the same questions. Our results suggest that the medical domain differs enough from the open domain to require additional work in automatic speech recognition adapted for the biomedical domain.

1. Introduction

Question answering (QA) is an extension of information retrieval that changes the nature of input from keywords to natural language questions, and changes the nature of output from documents to answers. This type of interface potentially offers advantages to end users by not requiring them to convert their question to keywords, and by giving them a direct answer to their query instead of a list of documents to peruse.

The QA paradigm offers particular advantages when the modality changes to speech (both input and output). While an argument could be made that entering keywords via a keyboard is faster than entering a question, speaking a question may be faster and more natural than typing, and can be used in situations where a standard computer interface is not available. Ultimately, this type of interface could be part of a bigger ‘automated assistant’ application, incorporating dialogue understanding, so that users may interact with it as they would with another person, requiring very little adaptation on the part of human users.

The medical domain offers an important early testbed and research platform for these technologies, for some of the reasons mentioned above. Physicians are very busy, and are less likely to use technologies that require more time and cognitive overhead. Presumably, though, they would be willing to simply speak questions as they arise in the clinical setting if they could get a fast and accurate response. Besides the speed and convenience, QA in the medical domain would often occur away from a traditional keyboard/monitor setup (e.g., operating rooms, ambulances, battlefields). In addition, smart phones are now ubiquitous, and entering precise questions via touchscreen or miniature physical keyboard can be tedious and error-prone. For all these use cases, a spoken language interface may be the most practical option.

Unfortunately, there is reason to believe that speech recognition in the medical domain is a very challenging task1. Question answering offers additional challenges due to the spontaneous nature of such questions (possibly resulting in disfluency, ungrammaticality, etc.) and the difference between the structure of questions and average medical texts.

In one domain that could be considered the “open domain” (broadcast news), automatic speech recognition (ASR) is already considered good enough so that document retrieval is considered a solved problem2. This is despite the fact that word error rates (WER) for ASR systems in that domain are still quite high (around 35% WER). Fortunately for that domain, information retrieval techniques seem to be robust enough to compensate for the noise in the output of ASR systems. This is very valuable, because improving or adapting speech recognition is potentially a difficult, expensive, and time-consuming process.

In this paper, we describe an experiment on the robustness of spoken question answering in the medical domain. In this experiment, physician subjects evaluated the quality of a question answering system both on ASR output and on gold standard question transcriptions. The results of this experiment allow us to measure the extent to which information retrieval can compensate for recognition errors in the medical question answering domain.

2. Background

2.1. Question Answering

For these experiments we use AskHERMES3, a question answering system for the medical domain. It has several sophisticated NLP technologies integrated in a pipeline fashion that attempt to answer clinical questions that physicians may have. The standard interface is a text field into which questions are entered via keyboard.

The first processing step is query term extraction. In this step, stop words are removed, and MetaMap4 is used to find Unified Medical Language System (UMLS) concepts and semantic types for all keywords that have possible medical meanings (e.g., using the word ‘tremor’ will result in inclusion of the word ‘shaking’, etc.). Keywords are extracted using supervised machine learning methods5. The query is then formulated from identified keywords and synonyms.

This query term is used to search the Pubmed database, indexed locally using the open source Lucene software.* The set of returned documents is then passed to an answer generating subsystem along with the original question.

To generate an answer, the first step is to detect phrases relevant to the question from returned documents. Next, passages are identified comprising one or more sentences that attempt to provide answers with necessary context. Finally, putative answers are presented in clusters based on the subset of keywords that each contains.

2.2. Automatic Speech Recognition

Automatic speech recognition is the well known task of converting an acoustic signal into a string of words that generated that signal. In this work we take advantage of data that was available from a related experiment using domain adaptation to perform ASR on biomedical text1.

Hidden Markov Models (HMMs)6 are commonly used for ASR. HMMs use Bayes’s Law (Equation 2 below) and Markov independence assumptions (Equation 3) to rewrite the probability of a sequence of ‘hidden’ states (h) given observations (o) in terms of a Language Model (ΘL) and an Observation Model (ΘO). The language model is typically defined on the level of word n-grams (illustrated as bigrams in Equation 2), while the observation model estimates the probability of observed evidence given a current hypothesized phoneme. The probability of a sequence is then the cumulative product of the language model transitions and the observation probabilities. The Viterbi algorithm7 is used to extract the most likely sequence.

h^1..T=argmaxh1..TP(h1..T|o1..T) (1)
=argmaxh1..TP(h1..T)P(o1..T|h1..T) (2)
=defargmaxh1..Tt=1TPΘL(ht|ht1)PΘO(ot|ht) (3)

In this work, the speech recognizer used is the SRI system8 that had the best performance in domain adaptation experiments by Liu et al.1 Domain adaptation is the building of domain-specific models by combining sparse models trained on domain-specific data only (ΘD) with models built from domains with more available data (base model ΘB), with the relative weight of each model determined by a parameter λ. Liu et al. used language model interpolation to build an adapted model ΘA:

PΘA(ht|ht1)=λPΘB(ht|ht1)+(1λ)PΘD(ht|ht1) (4)

Liu et al. describe experiments using two different ASR systems under a variety of parameter settings and against a number of different inputs. Experiments below use the configuration with the best performance, as this can be thought of a reasonable lower bound on future performance. That setting had as input a set of questions that were read by physicians (not spontaneously spoken), and attained a word-error rate of 29.3%.

3. Methods

The goal of this experiment was to determine the extent to which incorrect analysis of an automatic speech recognition component can harm the overall value of a question answering system. Therefore, the experimental design used an existing question answering system (AskHERMES) and asked three domain experts (physicians) to evaluate the quality of answers. Subjects were shown the gold standard question transcription, and below that were shown two sets of answers side by side, one in which the QA system was given the gold standard question transcription as input, and another in which the QA system was given the ASR hypotheses question transcription as input. The subjects were blind to which answer was based on the gold standard transcription and which was based on ASR output. The placement of each answer was determined randomly for each question, to control for bias for answers on one side or another.

Subjects were asked to determine whether the left or right answer was superior, or whether they were equal. Since question answering in the medical domain is a very difficult problem, and AskHERMES is intended for written questions, we considered the possibility that the output would not be good in either condition. To account for this possibility, we split the ‘equal’ value into ‘equally good’ and ‘equally bad.’ When computing statistics for the results we then left out the ‘equally bad’ questions.

All users were given unique subject numbers and URLs so that they could complete the experiment on their own time and without being on site. Their position in the experiment was saved so that they did not have to complete the entire survey at one time. Response times for each question were logged, both as a potential variable of interest, and to make sure that enough time was spent on the questions to give them enough thought (as expected, this was not an issue).

Five questions were correctly recognized by the ASR system, so these are counted as ‘equally good’ automatically, without presenting them to the users. The set of input questions then included only those in which the hypothesis was different from the gold standard, and all questions from the gold standard had punctuation removed to simulate the output from a perfect ASR transcription. There were approximately 160 questions given these constraints. To make the best use of the data, it was divided into three sets of 50 questions, with ten questions left over. The ten questions were given to all subjects in order to get a preliminary reading on agreement. Each of the three sets of 50 questions remaining was given to one subject only.

The results from AskHERMES were generated ahead of time to decrease loading time (with left/right placement randomized at load time). The output from AskHERMES was extracted using simple pattern matching of the html source, and included only the generated answer portion of the output, excluding the extra information from the interface such as related questions and other display modes. As described in Section 2, the answer is headed by a labeled cluster of keywords extracted from the question. Since there is some uncertainty in question answering, a ranked set of answers is included. This list of answers was not modified in any way (to prevent any potential bias), and the answer lists were sometimes quite long. Subjects were thus explicitly instructed that they need not exhaustively review all the answers, and to spend only as much time on each question as they might be willing to spend when interacting with a QA system in a clinical setting. We believe this resulted in a more realistic test setting while also preventing subjects from spending too much of their time on the experiment.

To analyze the results, we first compared the results against the hypothesis that IR should be able to make up for the errors in the ASR system, and thus the gold standard question answering should have the same acceptability rate as the question answering based on speech input. This requires first calculating the percentage of acceptable answers from each source. For the gold standard acceptance rate, we simply add the number of ’equally good’ responses to the number in which the gold standard was selected as better and divide by the total number of trials. For the ASR input acceptance rate, we add the ‘equally good’ responses to the number in which the ASR output answer was selected as better and divide by the total number of trials.

We also measured the results against the hypothesis that the error rate for the spoken QA should be the product of the ASR error rate by the baseline error rate of the QA system on text input (QER). In this case, our ASR system had a word error rate (WER) of approximately 30%. We make the assumptions that the WER applies uniformly across word types, and that there is one most important keyword per utterance, and thus apply the 30% word error rate to the input string level. For example, in a question about diabetes, there is a 30% chance of missing the word ‘diabetes,’ which should make it impossible for IR to return the relevant documents. So under these assumptions we can say that we should expect 70% of the results of ASR to be usable by the IR component for returning a reasonable set of documents.

4. Results

Results showed first of all that 83 of the answers (54%) were ‘equally bad.’ There were 72 remaining trials, in which the gold standard was considered superior 38 times (25%), the ASR hypothesis was considered superior 11 times (7%), and the remaining 23 trials (15%) were judged ‘equally good’ (percentages do not sum to 100 due to rounding error). In total, the subjects judged that using gold standard input the question was answered acceptably in 61 out of 155 trials (39.4%). Using the ASR input the question was answered acceptably in 34 out of 155 trials (21.9%).

Evaluating these results against hypothesis one above, Pearson’s chi-squared significance test shows that the lower than expected acceptability result is in fact statistically significant (p < 0.001). This suggests that IR is not good enough to completely compensate for the errors in ASR. However, this hypothesis might be expecting too much of IR, which motivated the second analysis.

Recall that the second analysis method described above sets the expected acceptability rate based on a combination of the error rate of the ASR and the error rate of the QA system on gold standard input. Based on the assumptions described in the methodology, one might expect the number of acceptable answers to ASR input to be 70% of 39.4%, for a total of 27.5% ([1 − WER] * [1 − QER]). In fact, 22% answers based on ASR output were judged to answer the question. This is lower than the expected value, though Pearson’s chi-square test finds that this difference is not statistically significant (p > 0.10). Regardless, the hypothesis that the robustness of IR would lead to better than expected performance is not supported by this evidence.

One possible interpretation of these results is that the assumptions above are flawed – for example, maybe it is more likely that it is required to get a few particular words correct rather than just one most important word. Another interpretation is that the robustness of IR is not enough to overcome ASR errors because ASR errors can manifest in two places – first in the document retrieval, but also in the summaries.

5. Discussion

One interesting result is that a number of hypothesized ASR transcripts with errors were still chosen over the gold standard. Given a high number of speech recognition errors, one might expect the results to be split exclusively between the ‘equally good’ and the gold standard. We investigated this further, and this seems to be caused by long inputs, errors in keyword extraction, and lucky guesses on erroneous input. For example, here is one considerably long question:

This morning I saw a 70-year-old man with left-sided shoulder pain and numbness his biceps and triceps were weak and he had atrophy of his triceps got cortisone injection by orthopedist 2 weeks ago with no improvement, normal x-rays high dose nonsteroidal anti-inflammatory drugs no help I didn’t know what to make of it neuropathy I called a neurologist and ended up referring him

Due to the length of this question, the keyword extraction algorithm that is part of the QA system is vulnerable. The extensive background context provided by the physician increases the number of potential mistakes for an automated algorithm to make while selecting and weighting the most important keywords. However, the ASR hypothesis leaves out a number of words due to recognition errors. By chance, in this example these misrecognized words are not as pertinent to answering the question, and do not ‘distract’ the keyword extraction algorithm. The remaining keywords represent the question well, and as a result, the top answers for the ASR output are better at answering the question than the top answers for the gold standard transcript.

These experiments suggest that there is more work required in the biomedical domain to bring spoken question answering even to the level of regular question answering. There are two good reasons why we think this is the case.

First, automatic speech recognition in specific domains can be quite difficult. In the medical domain, not only is the language model different, there is an essentially entirely new sub-vocabulary consisting of medical jargon that does not appear at all in broad domains used for language model adaptation. In addition, medical questions (in the corpus used here) often contain an extensive context, and are thus quite long. While word error rates in the broad domain may be acceptable, it appears that in the medical domain it is too likely that an important keyword will be misrecognized. As the analysis above suggests, it is then a matter of chance of which keywords are recognized and thus which document types are returned.

Second, beyond document retrieval, there is an answer generation/summarization component of question answering that is reliant on both the document retrieval and the speech recognition. This has the potential to cause compounding errors, so that even if the document retrieval works well enough, the summarization component may erroneously generate passages that emphasize the wrong keywords, leading to results that are doubly wrong.

This experiment was partially inspired by and modeled after a similar experiment in the open domain,2 as mentioned in Section 1. However, the experiments are not exactly analogous, because that work evaluated on document retrieval and this work uses question answering, a task comprising document retrieval as well as a number of other components. There are several reasons for doing this: First, QA is viewed as the actual application for use by end users. As mentioned in the Introduction, this assumes that physicians do not want to ask questions and look through documents, they want answers. Thus, it makes sense to evaluate the QA framework here and not simply the document retrieval component.

Second, evaluating QA directly probably provides more precise feedback. Returning documents that have the same ‘gist’ of the question may be sufficient to score well in a document retrieval evaluation, but that is not good enough for QA. With QA, the evaluation criteria is ‘Did the results answer the question?’ By directly evaluating the QA system, the question we asked our experts was much more precise, making it easier to answer and more valuable for analysis.

Consideration of these factors suggests that these experiments should be replicated in the open domain. After all, document retrieval is currently a widely used paradigm for end users (e.g., search engines on the World Wide Web), but presumably question answering is seen as a logical direction for many use cases. In that case, we speculate that similar results would occur in the open domain, requiring additional work on open-domain ASR to make generalized spoken question answering accurate.

There is another issue with spoken question answering, unexplored by these experiments, which also presents obstacles. One motivating case for spoken QA is non-standard computing environments where a standard keyboard/mouse/monitor setup is not available. In these cases, a different output modality is probably required. In the experiments presented here we focus on the input modality, and used the standard output interface (a list of answers on a webpage displayed on a monitor). For the use case described above, different output modality will be required, most likely machine generated speech. This increases the burden on the QA system to provide answers that are concise, complete, and correctly ranked (so that the first answer is the best answer). This topic is not examined here, but is obviously relevant, and will hopefully see great improvements in the near future under the banner of general QA research.

In conclusion, spoken question answering is a potentially transformational technology, but improvements in automatic speech recognition in the medical domain are probably required before this technology is usable.

Acknowledgments

The project described was supported by NIH Grant Number 1R01LM009384. We would also like to acknowledge the contribution of Carmelo Gaudioso, M.D., as an expert subject.

Footnotes

References

  • [1].Liu Feifan, Tur Gokhan, Hakkani-Tür Dilek, Yu Hong. Towards spoken clinical question answering: Evaluating automatic speech recognition systems for clinical spoken questions. Journal of American Medical Informatics Association. 2011 doi: 10.1136/amiajnl-2010-000071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Okurowski Mary Ellen, Wilson Harold, Urbina Joaquin, Taylor Tony, Clark Ruth Colvin, Krapcho Frank. NAACL-ANLP 2000 Workshop on Automatic summarization - Volume 4. Seattle, Washington: Association for Computational Linguistics; 2000. Text summarizer in use: lessons learned from real world deployment and evaluation; pp. 49–58. [Google Scholar]
  • [3].Cao YongGang, Liu Feifan, Simpson Pippa, Antieau Lamont, Bennett Andrew, Cimino James J, Ely John, Yu Hong. AskHERMES: an online question answering system for complex clinical questions. Journal of Biomedical Informatics. 2011 doi: 10.1016/j.jbi.2011.01.004. In Press, Accepted Manuscript. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Aronson AR. Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium; 2001. pp. 17–21. PMCID: 2243666. [PMC free article] [PubMed] [Google Scholar]
  • [5].Yu H, Cao YG. Automatically extracting information needs from ad hoc clinical questions. AMIA Annu Symp Proc; 2008. pp. 96–100. [PMC free article] [PubMed] [Google Scholar]
  • [6].Rabiner LR. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE; 1989. pp. 257–286. [Google Scholar]
  • [7].Viterbi A. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Information Theory, IEEE Transactions on. 1967;13(2):260–269. [Google Scholar]
  • [8].Stolcke Andreas, Anguera Xavier, Boakye Kofi, Çetin Özgür, Janin Adam, Arindam M, Peskin Barbara, Wooters Chuck, Zheng Jing. Further progress in meeting recognition: The ICSI-SRI spring 2005 speech-to-text evaluation system. MACHINE LEARNINGFOR MULTIMODAL INTERACTION: SECOND INTERNATIONAL WORKSHOP, MLMI 2005. VOLUME 3869 OF LECTURE NOTES IN COMPUTER SCIENCE. 2005;78:463–4752. [Google Scholar]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES