Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Aug 1.
Published in final edited form as: Int J Med Inform. 2019 Jun 6;129:81–87. doi: 10.1016/j.ijmedinf.2019.05.021

Identifying incidental findings from radiology reports of trauma patients: An evaluation of automated feature representation methods

Gaurav Trivedi a,c,*, Charmgil Hong c, Esmaeel R Dadashzadeh b,d, Robert M Handzel d, Harry Hochheiser a,b, Shyam Visweswaran a,b
PMCID: PMC6717529  NIHMSID: NIHMS1037843  PMID: 31445293

Abstract

Background:

Radiologic imaging of trauma patients often uncovers findings that are unrelated to the trauma. These are termed as incidental findings and identifying them in radiology examination reports is necessary for appropriate follow-up. We developed and evaluated an automated pipeline to identify incidental findings at sentence and section levels in radiology reports of trauma patients.

Methods:

We created an annotated dataset of 4,181 reports and investigated automated feature representations including traditional word and clinical concept (such as SNOMED CT) representations, as well as word and concept embeddings. We evaluated these representations by using them with traditional classifiers such as logistic regression and with deep learning methods such as convolutional neural networks (CNNs).

Results:

The best performance was observed using word embeddings with CNNs with F1 scores of 0.66 and 0.52 at section and sentence levels respectively. The F1 score was statistically significantly higher for sections compared to sentences (Wilcoxon; Z < 0.001, p < 0.05). Compared to using words alone, the addition of SNOMED CT concepts did not improve performance. At the sentence level, the F1 score improved significantly from 0.46 to 0.52 when using pre-trained embeddings (Wilcoxon; Z < 0.001, p < 0.05).

Conclusion:

The results show that the best performance was achieved by using embeddings with CNNs at both sentence and section levels. This provides evidence that such a pipeline is capable of accurately identifying incidental findings in radiology reports in an automated manner.

Keywords: Automated feature representations, Radiology reports, Incidental findings, Word embeddings, Convolutional neural networks

1. Background and motivation

Trauma is a leading cause of morbidity and mortality, accounting for an estimated 79,000 deaths each year in ages younger than 45 years [1]. Assessment of injuries in trauma patients relies on extensive radiologic imaging that includes whole-body computed tomography (CT) and magnetic resonance imaging (MRI) scans. While invaluable in demonstrating the extent of injuries, whole-body imaging often uncovers findings – occult masses, lesions, and anatomic anomalies – that are unrelated to the trauma. These unrelated findings are termed as incidental findings [2]. They range from an inconsequential renal cyst to a potentially life-threatening lung nodule (see Fig. 1). About 40% of all incidental findings have sinister features that warrant follow-up and treatment [3]. Members of the trauma team are responsible for reading radiology examination reports, identifying incidental findings, assessing their clinical significance, and communicating the information to the patient and other physicians. Automated methods to identify incidental findings in radiology reports can be invaluable at busy trauma centers.

Fig. 1.

Fig. 1.

A de-identified radiology report of CT imaging in a patient with trauma. It revealed a nodule in the left lung as an incidental finding (underlined). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Natural language processing (NLP) techniques enable automatic identification and extraction of information from radiology reports. Applications of NLP to radiology reports include retrieval of reports that contain identification of a specific condition or set of conditions [4], information extraction such as follow-up recommendations [5,6], and summarization [7]. In a recent systematic review, Pons et al. [8] comprehensively catalog NLP applications in radiology reports. A variety of NLP pipelines have been developed to process clinical text reports and extract features. The steps in these pipelines often include section and sentence segmentation followed by tokenization and normalization. Subsequent steps include enrichment with syntactic (linguistic) and semantic (often based on specialized biomedical lexicons) annotations. Traditionally both rule based and machine learning methods have been applied to the extracted features for classification or information extraction tasks. Rule-based systems define a set of conditions on features to classify reports. For example, Dutta et al. [5] employed keyword-based rules to identify reports with relevant recommendations, and Elkin et al. developed term-based rules to identify pneumonia in radiology reports [4]. More recently, machine learning approaches have been used because though rules are easy to understand, they are difficult to maintain and often perform worse than machine learning systems. While in the past machine learning approaches have employed traditional classification methods like logistic regression and support vector machines, deep learning methods are increasingly used because they can identify the terms in free text without substantial preprocessing that is needed for traditional classification methods [9]. Cai et al. [10] provide a comprehensive survey of common NLP pipelines and methods that have been applied to radiology reports.

Past work in automated identification of incidental findings in radiology reports is scant. In a corpus of 573 radiology reports related to thromboembolic diseases, Pham et al. [11] applied support vector machine and maximum entropy classifiers to identify incidental findings. They obtained F1 scores of 0.57 at the report level and 0.80 when retaining only results and conclusion sections in the report. Using a corpus of 661 radiology reports, Johnson [12] applied a combination of machine learning methods and hand-crafted rules to identify incidental findings, and obtained a F1 score of 0.69 at the report level. In these studies, identification was done only at the report level and would need additional manual steps to identify sentences that represent incidental findings.

In this paper, we focus on identifying incidental findings at sentence and section levels in radiology reports. We explore automated feature representation methods including traditional word and clinical concepts (using a standard clinical vocabulary, like Systematized Nomenclature of Medicine – Clinical Terms (SNOMED CT)), as well as word and concept embeddings. Dense vector embeddings used in conjunction with deep neural networks have been shown to be particularly effective in NLP tasks [1315]. A word embedding represents each word in the corpus by a vector of real numbers [16]. This vector may encode information regarding that word’s meaning derived from the context in which it appears [17]. These embeddings may be trained in an unsupervised manner using a large corpus. Clinical concept embedding differs from word embedding only in that clinical concepts replace words [18]. In the following sections we describe our data and experimental setup to compare classifier performance using these feature representation methods.

2. Methods

2.1. Data and annotation

We obtained 170,052 radiology reports for trauma patients at a major academic medical center. The reports were de-identified and stripped of explicit identifiers regarding imaging modalities using software from DE-ID Data Corp [20]. Using approximate regular expression rules to identify the imaging modality, we estimate that these include about 47K CT scans, 86K X-ray reports, and 10K ultrasound and MRI reports each. Others included studies from interventional radiology, fluoroscopy, nuclear medicine etc.

To create an annotated dataset, two trauma physicians (E.R.D. and R.M.H.) annotated 4,181 radiology reports for incidental findings using a custom annotation tool. The remaining 165,871 reports constituted the unannotated dataset for training embeddings. The annotators selected phrases or full sentences as incidental findings, and annotations were allowed to cross sentence boundaries as needed. Annotators focused on two types of incidental findings that are recommended for follow-up: lesions suspected to be malignant and arterial aneurysms meeting specified size and location criteria. Table 1 provides detailed annotation guidelines that were used by the physicians.

Table 1.

Annotation guidelines that are adapted from Sperry et al. [19]. Any lesion of malignant potential and any arterial aneurysm that is greater than a specified size was annotated.

Lesions
Brain Any solid lesion
Thyroid Any lesion
Bone Any osteolytic or osteoblastic lesion, not age-related
Breast Any solid lesion
Lung Any lesion
Liver Any heterogeneous lesion
Kidney Any heterogeneous lesion
Adrenal Any lesion
Pancreas Any lesion
Ovary Any heterogeneous lesion
Bladder Any lesion
Prostate Any lesion
Intraperitoneal/retroperitoneal Any free lesion
Arterial Aneurysms
Thoracic aorta ≥5 cm
Abdominal aorta ≥4 cm
External iliac artery ≥3 cm
Common femoral artery ≥2 cm
Popliteal artery ≥1 cm

We were motivated by an interest in guiding physicians directly to specific mentions of incidental findings. Thus, we explored identification of incidental findings at finer text-resolution levels by processing the reports to extract sentences and sections. We extracted individual sentences using spaCy (a Python NLP library suitable for large-scale information extraction tasks; https://spacy.io [21]). A sentence was labeled positive if a phrase in it or the entire sentence was selected by the annotators. Sections were extracted after applying regular expressions to identify section headings. A section was marked positive if it contained one or more sentences with incidental findings.

2.2. Feature representations

We explored feature representations that consisted of words (Wordsonly) and words augmented with clinical concepts (Words+Concepts). We also investigated word and concept embeddings, which represent each word or concept as a vector such that similarity among vectors correlates with semantic similarity.

For the Words-only representation, we converted all text to lowercase and removed headings, newlines, non-alphanumeric characters and common English stopwords using the Natural Language Toolkit (NLTK, a toolkit for symbolic and statistical NLP; https://www.nltk.org/ [22]). In Words+Concepts representation, we augmented the Words-only representation with SNOMED CT vocabulary concepts. We extracted clinical concepts in reports using Noble Coder that automatically identifies concepts in free text based on a standard vocabulary (http://noble-tools.dbmi.pitt.edu/ [23]). Table 2 shows example sentences where SNOMED CT concepts may be useful in identifying incidental findings. We hypothesized that the addition of clinical concepts would reduce variability in textual features; for example, “cardiomegaly” and “enlarged heart” would be mapped to the same concept.

Table 2.

Two example sentences where identified SNOMED CT concepts are shown in bold and concepts that may signal incidental findings are shown underlined. The first sentence is an example where the radiologist explicitly identifies an incidental finding. The second sentence illustrates an example where the finding is incidental only in the context of a trauma patient; in a diagnostic radiological examination this would have been a regular finding.

Incidental note is made of a low-attenuation mass at the medial upper pole of the right kidney which may represent a renal cyst.
There is a 2 cm, partially calcified nodule in the right lobe of the thyroid gland.

We used a bag-of-words model with term frequency inverse document frequency (TF-IDF) [24] to obtain vector representations at section and sentence levels, for use with traditional machine learning classifiers. We compared the use of high-dimensional and sparse TF-IDF representation with denser word embeddings for training convolutional neural networks (CNNs). Word embeddings is a class of unsupervised methods to derive dense word vector representations of features from a large text corpus. We used a word2vec method that is trained by predicting context words from a target word [17] and compared three different schemes for generating these embeddings [25]:

  1. Random: Word-vectors were initialized randomly.

  2. Folds-only: The embeddings were created from only the training dataset. Since the experiments used a 5-fold cross-validation scheme, a distinct embedding was created with each cross-fold.

  3. Pre-trained: The embedding was obtained from the unannotated dataset.

We used Gensim (a Python framework for efficient vector space modeling; https://pypi.org/project/gensim/ [26]) to train the word embeddings. Similarly, we explored the use of concept embeddings with Word+Concept features. We downloaded SNOMED CT concept embeddings published by Beam et al. [18], which were trained on a large corpus covering 108,477 medical concepts.

2.3. Experimental methods

We trained two sets of classifiers to predict sentences and sections describing incidental findings. The traditional classifiers included Naïve Bayes, random forest, logistic regression, and support vector machine (SVM), and were built using scikit-learn (a machine learning library in Python; https://scikit-learn.org/ [27]). We compared them with CNNs using an architecture described by Kim [25] that is implemented in Keras (a deep learning library in Python; https://keras.io/ [28]).

We used a 5-fold cross-validation scheme to train and evaluate classifiers. We computed F1 score, precision, recall, and area under the ROC curve (AUROC) for each fold. Table 3 shows the user specified parameter settings we explored in our experiments. We picked parameter settings that maximized the F1 score and reported the results using these settings in Section 3. We conducted experiments to compare sentence and section classifiers, to compare Words-only with Words +Concepts feature representations, and to compare pre-trained embeddings with random embeddings. Wilcoxon and Kruskal-Wallis signed-ranked tests were used to statistically compare the F1-scores across the cross-folds.

Table 3.

Implementation details for feature representations and classifiers.

Representation Classifier Parameter settings
TF-IDF (scikit-learn [27], version 0.20) Naïve Bayes (scikit-learn) Bernoulli Naïve Bayes default settings
 O Random forest (scikit-learn) Number of trees = 100
Metric = Gini
Minimum samples at a leaf node = 20
Logistic regression (scikit-learn) L2 regularization coefficient determined by 3-fold internal cross validation
SVM (scikit-learn) L2 regularization coefficient determined by 3-fold internal cross validation
Word2vec CNN Best settings determined by searching over the following values:
 (Gensim [26], version 3.6.0) (Keras [28], version 2.2.4) Hidden dimension = 75
Dimension = {50, 75, 100} Filter sizes = {3, 5, 7} Number of filters = 25
Minimum word count = 3 Batch size = 32
Window size = {5, 10, 15}
Epochs = 150

3. Results

3.1. Annotations

The annotated dataset consisted of 4,181 reports, of which 439 (10.5%) contained at least one incidental finding. Table 4 shows the distribution of incidental findings at sentence and sections levels in the annotated dataset.

Table 4.

Distribution of positives, words, and concepts (mean ± standard deviation) at sentence and section levels in the annotated dataset. Positives denote the raw count of sentences or sections that contained one or more incidental findings (along with percentages).

Total Positives Words Concepts
Sentences 110,354 1,276 (1.15%) 8.0 ± 6.6 2.9 ± 2.9
Sections 23,302 661 (2.83%) 38.1 ± 61.9 13.1 ± 23.8

An initial pilot set of 128 radiology reports was annotated by the two physicians independently, and the inter-annotator agreement (IAA) measured using Cohen’s Kappa statistic [29] was 0.73. After review and deliberation the annotation guidelines were revised, and a second pilot set of 144 radiology reports was annotated. This resulted in a revised IAA of 0.83. Each of the remaining 4,053 reports was annotated by a single physician using the revised annotation scheme.

3.2. Classifier performance

The performance of classifiers at the sentence and section levels in terms of F1 score, precision, recall and AUROC are shown in Table 5.

Table 5.

F 1 scores, precision, recall and AUROC values at the section and sentence levels from 5-fold cross-validation (mean ± standard deviation). The best F1 scores are highlighted. Words+Concepts includes both words and extracted SNOMED CT concepts as features.

Words-only Words + Concepts
F1 Score Precision Recall AU ROC F1 Score Precision Recall AU ROC
Sentences Naïve Bayes 0.32±0.02 0.23±0.02 0.53±0.04 0.93±0.01 0.32±0.02 0.22±0.01 0.54±0.03 0.94±0.01
Random Forest 0·33±0.05 0.69±0.04 0.21±0.04 0.93±0.01 0.31±0.04 0.64±0.06 0.21±0.03 0.93±0.01
Logistic Regression 0.42±0.03 0.78±0.05 0.29±0.03 0.95±0.01 0.43±0.03 0.78±0.03 0.30±0.03 0.95±0.01
SVM 0.42±0.03 0.79±0.05 0.29±0.03 0.92±0.03 0.45±0.03 0.81±0.05 0.31±0.03 0.92±0.03
CNN Random 0.46±0.03 0.54±0.03 0.41±0.03 0.90±0.03 0.47±0.04 0.57±0.09 0.41±0.06 0.90±0.03
CNN Folds-only 0.46±0.03 0.59 ± 0.07 0.39±0.04 0.90± 0.03 0.47±0.03 0.54±0.06 0.42 ± 0.04 0.89±0.02
CNN Pre-trained 0.52±0.03 0.57±0.06 0.49±0.07 0.92± 0.02 NC NC NC NC
Sections Naïve Bayes 0.39 ± 0.01 0.27± 0.01 0.72 ± 0.02 0.95 ± 0.00 0.39±0.01 0.27±0.01 0.70±0.02 0.94±0.00
Random Forest 0.45±0.04 0.56±0.04 0.37± 0.04 0.97±0.01 0.43±0.03 0.55±0.03 0.36±0.03 0.97±0.01
Logistic Regression 0.56±0.01 0.71±0.04 0.47± 0.01 0.97±0.01 0.57± 0.03 0.71± 0.05 0.48 ± 0.02 0.97±0.01
SVM 0.62±0.011 0.79±0.02 0.51±0.02 0.97±0.01 0.60±0.02 0.77±0.02 0.49±0.03 0.97±0.01
CNN Random 0.66±0.02 0.69±0.06 0.65±0.07 0.90±0.02 0.65±0.01 0.68±0.06 0.63±0.05 0.93±0.01
CNN Folds-only 0.65±0.02 0.68±0.05 0.бЗ±0.03 0.89±0.01 0.59±0.13 0.59±0.16 0.60±0.11 0.92±0.03
CNN Pre-trained 0.65±0.02 0.68±0.05 0.64±0.05 0.92±0.03 NC NC NC NC

Results for traditional methods are reported with a bag-of-words model using TF-IDF. CNNs were trained using dense word-embeddings: CNN Random: Embeddings are randomly initialized. CNN Folds-only: Embeddings are trained on-the-fly using the training set in each cross-validation fold from the annotated dataset. CNN Pretrained: Embeddings are trained using the unannotated dataset. NC denotes not computed: Pre-trained embeddings were not available for all SNOMED CT concepts.

3.2.1. Performance at sentence and section levels

A sentence contained 8.04 words on average (Standard Deviation 6.6), while a section contained 38.14 words on average (SD = 61.9). Traditional classifiers with TF-IDF representation had an average vocabulary size of 7.6K and 7.8K words in each training fold for sentences and sections, respectively. The F1 scores for classifiers over 5 folds on sections was statistically significantly higher than F1 scores for sentences (Wilcoxon; Z < 0.001, p < 0.05). The best F1 scores were 0.66 and 0.52 for sections and sentences respectively and was obtained using CNNs with word embeddings. We also observed balanced precision and recall scores for CNNs.

3.2.2. Comparison of Words-only and Words+Concepts representations

We counted an average of 2.9 concepts (SD = 2.9) per sentence and found 27,906 sentences without any concepts. For sections, we counted an average of 13.12 concepts (SD = 23.8) per section and found 4,216 empty sections.

There was no statistically significant difference between the F1 scores of Words-only and Words+Concepts representations (Wilcoxon; Z = 31.0, p > 0.5; see Fig. 2). The best F1 scores for sections were around 0.65 for both representations.

Fig. 2.

Fig. 2.

Comparison of Words-only and Words+Concepts representation. Each cross-fold is shown in the plot, along with the mean and standard deviation denoted by horizontal bars.

While using the published embedding for SNOMED CT concepts [18], we found that its coverage was low on our dataset and did not contain embeddings for over a third of the concepts. We did not train our own concept embeddings in this work.

3.2.3. Comparison of word embeddings

We compared CNNs with word embeddings that were trained using Random, Folds-only, and Pre-trained initializations. The Pre-trained word embeddings consisted of 24,034 distinct words (20M total) while the Folds-only embeddings consisted of words ranging from 6,832 to 6,900 distinct words (488K to 491K total) across the 5 cross-folds.

On comparing sentences with sections, there was no statistically significant difference in performance across the three embeddings overall (Kruskal-Wallis; H = 0.42, p = 0.82). However, at the sentence level Pre-trained embedding showed statistically significant improvement over Random embedding (Wilcoxon; Z < 0.001, p < 0.05; see Fig. 3), with F1 scores improving from 0.46 to 0.52 respectively.

Fig. 3.

Fig. 3.

Comparison of CNNs with Random, Fold-only and Pre-trained embeddings. Each cross-fold is shown in the plot, along with mean and standard deviation denoted by horizontal bars.

4. Discussion

High performance automated methods to identify text spans with incidental findings in radiology reports might be particularly useful in freeing members of busy trauma centers to focus on urgent clinical activities. We developed and evaluated several feature representations and classifiers to identify incidental findings at section and sentence levels in radiology reports of trauma patients. We annotated a corpus of over 4,000 reports for this task. We compared Words-only and Words +Concepts feature sets, and evaluated their representations using both traditional TF-IDF representation as well as word embeddings. In addition to using a much larger training set than prior work (4,187 vs. 661 reports [12]), our exploration of multiple approaches provides significant insight into the problem while suggesting interesting areas for future work.

Granularity and potential generalizability: We evaluated performance at both the section and sentence levels. Section-level performance was comparable to previously described report-level performance (F1 = 0.66 vs. 0.69 [12]). The feature representations used in our study eliminate the need for feature curation, presenting the possibility of easier transfer to other datasets. Our examination of classification at the level of individual sentences and sections was motivated by an interest in guiding physicians directly to specific mentions of incidental findings. Although sentence level performance was lower than section level performance (F1 = 0.52 vs. 0.66), these results suggest that retrained classifiers, perhaps informed by detailed error analysis, might perform better. An initial error analysis of sentence-level results found difficulties with sentence boundary detection. Replacing the basic spaCy sentence segmentation tool with a customized version tuned to the idiosyncrasies of clinical narratives might improve performance.

Words vs. concepts: The comparison of Words-only representation with Words+Concepts explored the utility of adding features from standard vocabularies such as SNOMED CT. The limited set of concepts (lesions and aneurysms) found in the annotation guidelines (Table 1) presented the possibility that a few concepts may be highly predictive of incidental findings. However, the combined Words+Concepts features did not perform better than the basic Words-Only features. The combination of the highly skewed dataset and the wide variation in the distribution of incidental findings (Table 4) may have contributed to these results (25% of sentences and 18% of sections had no concepts identified). Future work with alternative concept extraction tools and with alternative vocabularies such as RadLex [30] may improve performance when using concepts.

Embeddings: Use of word embeddings with deep learning improved performance at the sentence level by a good margin (F1 score improved from 0.46 to 0.52). However, we did not see such improvements at the section level. This may be due to the limitations of the CNN architecture for modeling long distance dependencies [31]. Temporal models such as recurrent neural networks [32,33] with attention mechanisms [34] have the potential to provide better results for longer texts. Pre-trained embeddings created from 165K reports (24K distinct words) is a relatively small training sample for deep learning. Embeddings trained on a much larger corpus might yield better performance. Future work may explore alternative methods of generating embeddings such as GloVe [35].

The contextualized nature of incidental findings will likely present challenges to any automated extraction approach. For example, the classification of an observation of a tumor as incidental might depend on whether or not the patient or their physician was aware of the tumor. Similarly, simple kidney cysts might not be considered incidental if they are not serious enough to treat. Given potential disagreements between domain experts on these classifications, the performance of automated methods will necessarily be limited.

There are several limitations related to the dataset used in this study. It is substantially skewed with about 2% positives, and has wide variation in the number of words and concepts across individual sentences and sections. A larger and/or less-skewed dataset might provide better training data. Up-sampling of positive cases and inclusion of data from additional institutions might increase the robustness of the training data. As our de-identification processes stripped information regarding the type of imaging used (e.g., X-ray vs. CT), it is possible that performance might differ for different imaging modalities. Examining these potential differences might be an interesting area for future work.

5. Conclusion

We developed and evaluated several automated feature representations on the performance of classifiers for the task of identifying sentences and sections in radiology reports containing incidental findings. The best F1 scores were 0.52 and 0.66 at sentence and section levels respectively, both using word embeddings with CNNs. The inclusion of concepts from SNOMED CT did not lead to improved performance. Enhancements to the feature representations that we explored in this paper can form the basis of a future tool that will automatically identify and extract incidental findings. Potential clinical uses of such a tool include automated and accurate communication of information at patient handoffs [36].

Summary points.

What was already known on the topic?

  • Incidental findings in radiology examination reports can be identified using machine learning methods.

  • Prior work demonstrated proof-of-concepts using small datasets and also hand-crafted features.

What this study added to our knowledge?

  • It is feasible to extract incidental findings at finer levels of resolution (sentence and section levels) at performance similar to that at report-level in prior work. Including clinical concepts from SNOMED CT did not result in improvement in performance.

  • Pre-trained word embeddings with convolutional neural networks (CNNs) hold promise for accurate and efficient identification of incidental findings.

Funding

The research reported in this publication was supported by the National Library of Medicine of the National Institutes of Health under award number R01LM012095 and a Provost Fellowship in Intelligent Systems at the University of Pittsburgh (awarded to G.T.). The content of the paper is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the University of Pittsburgh.

Footnotes

Conflicts of interest

The authors do not have any competing interests.

References

  • [1].DiMaggio C, Ayoung-Chee P, Shinseki M, Wilson C, Marshall G, Lee DC, Wall S, Maulana S, Leon Pachter H, Frangos S, Traumatic injury in the United States: in-patient epidemiology 2000–2013, Injury 47 (2016) 1393–1403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Lumbreras B, Donat L, Hernández-Aguado I, Incidental findings in imaging diagnostic tests: a systematic review, Br. J. Radiol 83 (2010) 276–289 PMID:20335439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Kroczek EK, Wieners G, Steffen I, Lindner T, Streitparth F, Hamm B, Maurer MH, Non-traumatic incidental findings in patients undergoing whole-body computed tomography at initial emergency admission, Emerg. Med. J 34 (2017) 643–646. [DOI] [PubMed] [Google Scholar]
  • [4].Elkin PL, Froehling D, Wahner-Roedler D, Trusko B, Welsh G, Ma H, Asatryan AX, Tokars JI, Rosenbloom ST, Brown SH, NLP-based identification of pneumonia cases from free-text radiological reports, AMIA Annu. Symp. Proc (2008) 172–176. [PMC free article] [PubMed] [Google Scholar]
  • [5].Dutta S, Long WJ, Brown DFM, Reisner AT, Automated detection using natural language processing of radiologists recommendations for additional imaging of incidental findings, Ann. Emerg. Med 62 (2013) 162–169. [DOI] [PubMed] [Google Scholar]
  • [6].Oliveira L, Tellis R, Qian Y, Trovato K, Mankovich G, Follow-up recommendation detection on radiology reports with incidental pulmonary nodules, Stud. Health Technol. Inform 216 (2015) 1028. [PubMed] [Google Scholar]
  • [7].Goff DJ, Loehfelm TW, Automated radiology report summarization using an open-source natural language processing pipeline, J. Digit. Imaging 31 (2018) 185–192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Pons E, Braun LMM, Hunink MGM, Kors JA, Natural language processing in radiology: a systematic review, Radiology 279 (2016) 329–343. [DOI] [PubMed] [Google Scholar]
  • [9].Chen MC, Ball RL, Yang L, Moradzadeh N, Chapman BE, Larson DB, Langlotz CP, Amrhein TJ, Lungren MP, Deep learning to classify radiology freetext reports, Radiology 286 (2017) 845–852. [DOI] [PubMed] [Google Scholar]
  • [10].Cai T, Giannopoulos AA, Yu S, Kelil T, Ripley B, Kumamaru KK, Rybicki FJ, Mitsouras D, Natural language processing technologies in radiology research and clinical applications, Radiographics 36 (2016) 176–191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Pham AD, Neveol A, Lavergne T, Yasunaga D, Clement O, Meyer G, Morello R, Burgun A, Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings, BMC Bioinform 15 (2014) 266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Johnson EB, Methods in text mining for diagnostic radiology, Case Western Reserve University, 2016. Ph.D. thesis.
  • [13].Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P, Natural language processing (almost) from scratch, J. Mach. Learn. Res 12 (2011) 2493–2537. [Google Scholar]
  • [14].Kim Y, Convolutional neural networks for sentence classification, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014, pp. 1746–1751. [Google Scholar]
  • [15].Bengio Y, Courville A, Vincent P, Representation learning: a review and new perspectives, IEEE Trans. Pattern Anal. Mach. Intell 35 (2013) 1798–1828. [DOI] [PubMed] [Google Scholar]
  • [16].Goldberg Y, A primer on neural network models for natural language processing, CoRR abs/1510.00726 (2015). [Google Scholar]
  • [17].Mikolov T, Chen K, Corrado G, Dean J, Efficient estimation of word representations in vector space, CoRR abs/1301.3781 (2013). [Google Scholar]
  • [18].Beam AL, Kompa B, Fried I, Palmer NP, Shi X, Cai T, Kohane IS, Clinical concept embeddings learned from massive sources of medical data, CoRR abs/1804. 01486 (2018). [PMC free article] [PubMed] [Google Scholar]
  • [19].Sperry JL, Massaro MS, Collage RD, Nicholas DH, Forsythe RM, Watson GA, Marshall GT, Alarcon LH, Billiar TR, Peitzman AB, Incidental radiographic findings after injury: dedicated attention results in improved capture, documentation, and management, Surgery 148 (2010) 618–624. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Gupta D, Saul M, Gilbertson J, Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research, Am. J. Clin. Pathol 121 (2004) 176–186. [DOI] [PubMed] [Google Scholar]
  • [21].Honnibal M, Johnson M, An improved non-monotonic transition system for dependency parsing, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Lisbon, Portugal, 2015, pp. 1373–1378. [Google Scholar]
  • [22].Loper E, Bird S, NLTK: The natural language toolkit, Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics – Volume 1, ETMTNLP ‘02, Association for Computational Linguistics, Stroudsburg, PA, USA, 2002, pp. 63–70. [Google Scholar]
  • [23].Tseytlin E, Mitchell K, Legowski E, Corrigan J, Chavan G, Jacobson RS, NOBLE – Flexible concept recognition for large-scale biomedical natural language processing, BMC Bioinform 17 (2016) 32. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Jones KS, A statistical interpretation of term specificity and its application in retrieval, J. Doc 28 (1972) 11–21. [Google Scholar]
  • [25].Kim Y, Convolutional neural networks for sentence classification, CoRR abs/1408. 5882 (2014). [Google Scholar]
  • [26].Řehůřek R, Sojka P, Software framework for topic modelling with large corpora, Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, Valletta, Malta, 2010, pp. 45–50. [Google Scholar]
  • [27].Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. , Scikit-learn: machine learning in python, J. Mach. Learn. Res 12 (2011) 2825–2830. [Google Scholar]
  • [28].Chollet F, et al. , Keras, (2015) https://keras.io.
  • [29].Cohen J, A coefficient of agreement for nominal scales, Educ. Psychol. Meas 20 (1960) 37–46. [Google Scholar]
  • [30].Mejino JL Jr., Rubin DL, Brinkley JF, Fma-radlex: An application ontology of radiological anatomy derived from the foundational model of anatomy reference ontology, AMIA Annual Symposium Proceedings, volume, American Medical Informatics Association, 2008, p. 465. [PMC free article] [PubMed] [Google Scholar]
  • [31].Kalchbrenner N, Grefenstette E, Blunsom P, A convolutional neural network for modelling sentences, arXiv preprint 1404.2188 [Google Scholar]
  • [32].Hochreiter S, Schmidhuber J, Long short-term memory, Neural Comput 9 (1997) 1735–1780. [DOI] [PubMed] [Google Scholar]
  • [33].Cho K, van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y, Learning phrase representations using RNN encoder-decoder for statistical machine translation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014, pp. 1724–1734. [Google Scholar]
  • [34].Gao S, Young MT, Qiu JX, Yoon H-J, Christian JB, Fearn PA, Tourassi GD, Ramanthan A, Hierarchical attention networks for information extraction from cancer pathology reports, J. Am. Med. Inform. Assoc 25 (2018) 321–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Pennington J, Socher R, Manning C, Glove: Global vectors for word representation. in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. [Google Scholar]
  • [36].Trivedi G, Towards interactive natural language processing in clinical care, IEEE International Conference on Healthcare Informatics (ICHI) (2018) 448–449. [Google Scholar]

RESOURCES