See also the article by Steinkamp et al in this issue.
Dr Liu completed her PhD in biomedical informatics and postdoctoral training in neurosurgery at Stanford University. She received her Bachelor of Computer Science with Bioinformatics Option at the University of Waterloo in Canada. Her research interests focus on applying machine learning and deep learning methods to solve biomedical problems. She worked at Veracyte and currently works as a principal deep learning engineer at Roche Sequencing Solutions.
Introduction
Pathology reports store crucial information about clinicians’ observations and interpretations of tissue samples, as well as diagnoses. These reports usually summarize results of follow-up pathologic examinations conducted in response to abnormal radiologic imaging findings. Rapid and automatic extraction of information from these reports would greatly improve diagnostic workflow and provide clinical decision support. However, similar to radiology reports, medical findings in pathology reports are often captured in free-text format. As a result, it is challenging to effectively extract information from these reports because of a wide range of imaging observations and variations in natural language descriptions (1). In this issue of Radiology: Artificial Intelligence, Steinkamp and colleagues addressed this challenge and applied machine learning models to classify pathology reports into four major classes of organ systems to facilitate radiology follow-up recommendations (2). They demonstrated that state-of-the-art neural network–based approaches (F1 score, approximately 96%) consistently outperformed conventional machine learning algorithms (best F1 score, approximately 94% from extreme gradient boosting, or XGBoost) on this classification task. Moreover, they provided interpretations of the internal representations used by the neural network–based algorithms, elucidating important features the classifiers have learned. The computational approaches for automatic text classification of pathology reports presented in this study are applicable to other clinical scenarios facing similar challenges, such as information extraction from radiology reports.
With the recent availability of large amounts of training data and computational power, neural network–based algorithms, also referred to as deep learning, have achieved remarkable accuracy in natural language processing (NLP) applications, such as speech recognition and language translation, and in computer vision tasks, including image classification and object detection. Deep neural network–based methods have also shown promising results in biomedical applications. The two major classes of methods are recurrent neural network (RNN) and convolutional neural network (CNN). RNN algorithms are most applicable to classification tasks involving sequential data or temporal sequences. Pathology and radiology reports typically consist of sequences of words (also called tokens in NLP), and thus semantic representations can be naturally encoded using RNN. CNN, initially developed for computer vision tasks, has been successfully applied to biomedical images for automated cell segmentation on microscopic images (3), cancer metastasis detection on pathologic images (4), and lesion segmentation on radiologic images. More recently, CNN has also been shown to have remarkable performance in classifying text data. In the study by Steinkamp et al, both CNN and RNN were applied to the classification task of pathology reports at the organ level (2).
In addition to CNN and RNN, Steinkamp et al also trained classifiers using three widely used conventional machine learning methods. They demonstrated that CNN (sensitivity, 95.1%; specificity, 97.5%; and F1 score, 96.3%) and RNN (sensitivity, 94.3%; specificity, 99.1%; and F1 score, 96.7%) achieved better performance than random forest (sensitivity, 72.4%; specificity, 95.1%; and F1 score, 82.8%), XGBoost (sensitivity, 93.5%; specificity, 94.3%; and F1 score, 93.9%), and support vector machines (sensitivity, 82.9%; specificity, 98.0%; and F1 score, 89.9%) in classifying pathology reports to relevant organ classes (2). Another study by Chen et al applied deep learning to classify radiology free-text reports for extracting pulmonary embolism findings and showed that a CNN-based approach yielded better classification results than traditional NLP approaches (5). Qiu et al also confirmed superior performance of CNN to conventional methods in extracting International Classification of Diseases code from a large set of breast and lung cancer pathology reports (6). Traditional machine learning methods require additional feature engineering and extraction, which often rely on domain expert knowledge, prior to being piped into the classifiers. For example, before applying the conventional machine learning algorithms, Steinkamp et al first performed feature extraction using a previously developed word frequency method to generate the input. In contrast, this feature extraction step is not needed with the neural network–based methods, because encodings of semantic representations of input data are learned through training. In addition, Steinkamp and colleagues demonstrated that the neural network–based algorithms outperformed a baseline model using simple string matching (sensitivity, 99.1%; specificity, 60.3%; and F1 score, 75.2%) by 21% measured by F1 score. Establishing that a more complex method under consideration has better performance than simple baseline approaches is an important result.
The unstructured free-text format of medical reports hinders automatic extraction of clinically relevant information. In comparison, medical imaging data are intrinsically structured as matrices of pixel or voxel values and can be directly fed into CNN algorithms. Fortunately, algorithms, such as Global Vectors for Word Representation (GloVe), have recently been developed to detect semantic relationships between words and generate pretrained statistics measuring co-occurrences of words (7). In the study, Steinkamp et al leveraged the tokens mapped to the GloVe feature space as the input to the deep learning models (2). This approach uses precomputed metrics generated from a large corpus of natural language text, and unlike conventional machine learning methods, does not require human intervention or expert domain knowledge to identify relationships between words.
The two neural network–based models, CNN and RNN, have similar classification performance in terms of sensitivity, specificity, and F1 score. The authors concluded that RNN is the best performing model, with a slightly higher F1 score than CNN. Another common metric for model evaluation in such classification tasks is the area under the receiver operating characteristic curve, which combines sensitivity and specificity. In addition to classification performance, other factors such as computational complexity and memory usage need to be taken into account in deploying deep learning methods to clinical applications. The CNN algorithm uses a sliding window to examine locally adjacent words, whereas RNN captures long-range dependencies and semantic relationships across long distances in sequences. One advantage of CNN over RNN is the runtime complexity. Computation in RNN is performed sequentially, and thus the runtime of RNN may be a bottleneck. In comparison, CNN allows parallel processing, because it assumes that each section of input is independent from all other sections. However, CNN requires more memory than RNN, as CNN processes data in layers and stores intermediate features from each layer in memory to propagate them to the next layer.
Machine learning models are often black boxes because of their lack of transparency and interpretability. A key contribution of the study by Steinkamp et al is the identification and interpretation of the latent features learned and used by the neural network–based classifiers. The study identified the salient words that most affected the classification of pathology reports. The results were visualized by highlighting salient words in the text, allowing easy interpretation (2). The authors also investigated whether removing certain words by occlusion changes classification results for each classification method. They showed that neural network–based approaches are more robust to occlusion than conventional machine learning methods.
The deep learning algorithms investigated in this study achieved excellent performance in classifying pathology reports into four relevant organ classes. This is not a trivial task, as is revealed by the high false-positive rate in the simple string matching baseline method (2). However, to truly expedite diagnostic decisions, extraction and classification of information at a more fine-grained level may be required. Although limited by the availability of training data, the study demonstrates the feasibility of a more complex 12-organ classification task. Future studies with more detailed labeled data and complex tasks may be warranted.
In conclusion, the computational techniques developed in this study are generalizable to analyzing radiology reports. The successful clinical application of these neural network–based methods can improve diagnostic decision making.
Footnotes
This work is the author’s own opinion and not necessarily the opinion of her employer.
Current address: Roche Sequencing Solutions, Santa Clara, Calif
Disclosures of Conflicts of Interest: T.T.L. disclosed no relevant relationships.
References
- 1.Hassanpour S, Langlotz CP. Unsupervised topic modeling in a large free text radiology report repository. J Digit Imaging 2016;29(1):59–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Steinkamp JM, Chambers CM, Lalevic D, Zafar HM, Cook TS. Automated organ-level classification of free-text pathology reports to support a radiology follow-up tracking engine. Radiol Artif Intell 2019;1(5):e180052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham, Switzerland: Springer, 2015; 234–241. [Google Scholar]
- 4.Liu Y, Gadepalli K, Norouzi M, et al. Detecting cancer metastases on gigapixel pathology images. ArXiv 1703.02442. [preprint] https://arxiv.org/abs/1703.02442. Posted March 3, 2017. Accessed June 28, 2019.
- 5.Chen MC, Ball RL, Yang L, et al. Deep learning to classify radiology free-text reports. Radiology 2018;286(3):845–852. [DOI] [PubMed] [Google Scholar]
- 6.Qiu JX, Yoon HJ, Fearn PA, Tourassi GD. Deep learning for automated extraction of primary sites from cancer pathology reports. IEEE J Biomed Health Inform 2018;22(1):244–251. [DOI] [PubMed] [Google Scholar]
- 7.Pennington J, Socher R, Manning C. Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014; 1532–1543. [Google Scholar]