See also the article by Chaudhari et al in this issue.

Aaron C. Abajian, MD, MS, is a resident physician in diagnostic radiology at the University of Washington in Seattle. This coming July, he will begin a 1-year independent interventional radiology residency at Memorial Sloan Kettering Cancer Center in New York. His background is in software development. He holds undergraduate and master's degrees in computer science and has worked as a software engineer for radiology and medical informatics companies. His most recent position focused on natural language processing for health care.

Hoiwan Cheung, MD, is a resident physician in diagnostic radiology at the University of Washington in Seattle. This coming July, she will be starting her fellowship in musculoskeletal radiology at the Hospital for Special Surgery in New York.
Speech recognition (SR) plays an important role in the workflow of practicing radiologists. The progressive increase in imaging volumes and infrastructure has increased the daily diagnostic workload. SR technologies improve efficiency by reducing report dictation times and may reduce errors by standardizing phrases with verbal macros (1). However, the most prominent SR software programs introduce errors that are difficult to identify using traditional spelling and grammar checking engines. For example, a correctly spelled word in a grammatically correct context may have little clinical meaning (eg, “There is a 10 mm renal bone (stone).”). More challenging are dictation errors that make clinical sense, but convey an incorrect result (eg, “New (No) bowel obstruction.”). While referring providers may understand the intended meaning, patients are increasingly reviewing their own radiology reports and are reaching out to radiologists for clarification and correction. It is worthwhile to consider whether such errors may be reduced or eliminated using modern technologies (2).
At a basic level, SR is a task of word completion. SR engines match spoken words against a list of phonetically similar candidates. The challenge is identifying the right word given all available information. Consider the impression, “No renal ______ is present. No CT explanation for flank pain.” Should the blank be filled with stone, bone, home, or cone? The correct completion depends on local context and remote information. Local context refers to the immediately surrounding words and is affected by their definitions, pluralities, tenses, and parts of speech. Certainly “stone” is more likely to follow “ureteral” than the other candidates.
Remote information refers to both previously dictated sentences and yet-to-be dictated words. In the example above, the subsequently transcribed sentence indicates that the patient has flank pain. Although this information comes after the completion in question, it nonetheless informs what the completion should be. We might also expect “flank pain” to occur in the clinical indication at the beginning of the radiology report. Regardless of where “flank pain” occurs, it informs the completion of “stone” as the right one. Recent developments in machine learning, particularly deep learning, have provided a means to incorporate both local context and remote information when performing SR.
Two broad developments—word embeddings and transformer models—have addressed the issues of local context and remote information. Embeddings convert each word into a set of numbers called a vector. Algebraic operations on word vectors numerically capture similarities between words (3). Words with similar meanings, such as “heart attack” and “myocardial infarction,” are close together under a specific distance measure. By design, similar words appear in similar contexts. For example, “The king/queen created a new tax,” is OK because the word vectors for king and queen fit this context. Conversely, “No renal bone is present,” does not fit the expected local context for the word vector for “bone.”
Word embeddings help determine what words make sense for the local context, but they do not solve the problem of remote information. Both “New” and “No” fit the context of “____ bowel obstruction,” but which of these is correct? As discussed above, we should seek remote information. If we see, “No evidence of obstruction” or “Bowel: Normal” before or after, certainly “No” is the more likely completion for the impression.
Transformer models address this issue by asking, “If this word were chosen, would the sentence make sense given the preceding sentence?” This process proceeds from the first sentence to the last. All pairs of adjacent sentences are thus checked for congruency. Consider again the impression, “No acute findings. ____ bowel obstruction.” Should the blank be filled with “New” or “No”? The previous sentence provides insight into future transcribed text. We saw another example earlier where “No CT explanation for flank pain,” informed that “stone” was the correct completion for the immediately preceding sentence.
Under-the-hood, transformer models maintain sentence encodings much like word-vector embeddings. A sentence encoding attempts to capture the gist of a sentence numerically to compare it to its surrounding sentences. The latest transformer models are thus aptly named bidirectional encoder representations from transformers (BERT) (4). The term bidirectional refers to the fact that all words in the two adjacent sentences are considered simultaneously, as opposed to being read left-to-right. BERT models are the cutting edge of SR outside of medicine; Google uses BERT models as the foundation of its SR on Google.com and the Android platform (5). BERT models perform remarkably well when pretrained on the type of documents they are transcribing.
You can try dictating a radiology report using Google Doc's SR engine. It easily handles common English idioms but stumbles on specific radiology terminology. This is because that engine has not been specifically trained to handle radiology reports. By teaching a BERT model about the expected terminology being transcribed, its contextual performance can reach that of a human as measured by a standardized benchmark (6). Once such a model is created, it can be used to augment primary dictation in that domain (7). Alternatively, trained BERT models may be used to correct existing dictations.
The latter task is presented in Chaudhari et al in this issue of Radiology: Artificial Intelligence (8). The authors take an existing, well-established BERT model and train it on a large corpus of radiology reports. The resulting model is then applied to radiology reports transcribed using existing SR software. Their BERT model calculates whether each transcribed word makes sense, without having knowledge of the original audio that led to that transcription. As the authors point out, it can identify dictation omissions, additions, and incorrect substitutions.
Their work is particularly noteworthy because it is readily applied. It could be adapted to existing dictation infrastructure to correct prior and future radiology reports. Radiologists may dictate using any available SR software and use the BERT model as a new type of universal “checker” that provides spelling, grammar, and before-after context correction. The correction may be applied live during dictation or retroactively before a report becomes visible to referring providers and patients. Adoption of such models can improve patient and provider confidence in our reports and save us from infamous dictation errors. Incorporation of existing trained BERT models into our current dictation software could help radiologists communicate more clearly and effectively with our referring providers by reducing dictation errors as well as the amount of time spent correcting dictation errors.
Footnotes
Authors declared no funding for this work.
Disclosures of conflicts of interest: A.C.A. Software engineering consulting work for Verantos (no direct relation to the present work); former Radiology: Artificial Intelligence trainee editorial board member. H.C. No relevant relationships.
References
- 1. Hammana I , Lepanto L , Poder T , Bellemare C , Ly MS . Speech recognition in the radiology department: a systematic review . Health Inf Manag 2015. ; 44 ( 2 ): 4 – 10 . [DOI] [PubMed] [Google Scholar]
- 2. Lee CI , Langlotz CP , Elmore JG . Implications of direct patient online access to radiology reports through patient web portals . J Am Coll Radiol 2016. ; 13 ( 12 Pt B ): 1608 – 1614 . [DOI] [PubMed] [Google Scholar]
- 3. Mikolov T , Chen K , Corrado G , Dean J . Efficient estimation of word representations in vector space . arXiv 1301.3781 [preprint] https://arxiv.org/abs/1301.3781. Posted January 16, 2013. Accessed June 8, 2022 .
- 4. Devlin J , Chang MW , Lee K , Toutanova K . Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv 1810.04805 [preprint] https://arxiv.org/abs/1810.04805. Posted October 11, 2018. Accessed June 8, 2022 .
- 5. Nayak P . Understanding searches better than ever before . https://blog.google/products/search/search-language-understanding-bert/. Published October 25, 2019. Accessed June 8, 2022 .
- 6. Wang A , Singh A , Michael J , Hill F , Levy O , Bowman SR . GLUE: A multi-task benchmark and analysis platform for natural language understanding . arXiv 1804.07461 [preprint] https://arxiv.org/abs/1804.07461. Posted April 20, 2018. Accessed June 8, 2022 .
- 7. Huang WC , Wu CH , Luo SB , et al . Speech Recognition by Simply Fine-Tuning Bert . In: ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , Toronto, ON, Canada , June 6–11, 2021 . Piscataway, NJ: : IEEE; , 2021. ; 7343 – 7347 . [Google Scholar]
- 8. Chaudhari GR , Liu T , Chen TL , et al . Application of a domain-specific BERT for detection of speech recognition errors in radiology reports . Radiol Artif Intell 2022. ; 4 ( 4 ): e210185 . [DOI] [PMC free article] [PubMed] [Google Scholar]
