Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 May 1.
Published in final edited form as: J Biomed Inform. 2017 Apr 18;69:177–187. doi: 10.1016/j.jbi.2017.04.011

Automated annotation and classification of BI-RADS assessment from radiology reports

Sergio M Castro 1, Eugene Tseytlin 1, Olga Medvedeva 1, Kevin Mitchell 1, Shyam Visweswaran 1, Tanja Bekhuis 1, Rebecca S Jacobson 1
PMCID: PMC5706448  NIHMSID: NIHMS871606  PMID: 28428140

Abstract

The Breast Imaging Reporting and Data System (BI-RADS) was developed to reduce variation in the descriptions of findings. Manual analysis of breast radiology report data is challenging but is necessary for clinical and healthcare quality assurance activities.

The objective of this study is to develop a natural language processing (NLP) system for automated BI-RADS categories extraction from breast radiology reports. We evaluated an existing rule-based NLP algorithm, and then we developed and evaluated our own method using a supervised machine learning approach. We divided the BI-RADS category extraction task into two specific tasks: (1) annotation of all BI-RADS category values within a report, (2) classification of the laterality of each BI-RADS category value. We used one algorithm for task 1 and evaluated three algorithms for task 2. Across all evaluations and model training, we used a total of 2159 radiology reports from 18 hospitals, from 2003 to 2015.

Performance with the existing rule-based algorithm was not satisfactory. Conditional random fields showed a high performance for task 1 with an F-1 measure of 0.95. Rules from partial decision trees (PART) algorithm showed the best performance across classes for task 2 with a weighted F-1 measure of 0.91 for BIRADS 0–6, and 0.93 for BIRADS 3–5. Classification performance by class showed that performance improved for all classes from Naïve Bayes to Support Vector Machine (SVM), and also from SVM to PART.

Our system is able to annotate and classify all BI-RADS mentions present in a single radiology report and can serve as the foundation for future studies that will leverage automated BI-RADS annotation, to provide feedback to radiologists as part of a learning health system loop.

Graphical abstract

graphic file with name nihms871606u1.jpg

2. Introduction

Breast cancer is the most common malignancy in women in the United States [1], and also in the world [2]. It is a major public health concern with considerable medical and economic burden. Early detection of breast cancer is associated with treatment at earlier stage and mortality reduction [3]. Clinical practice guidelines include regular screening mammography recommendations for women of average risk [4, 5].

The Breast Imaging Reporting and Data System (BI-RADS) was developed by the American College of Radiology (ACR) to reduce variation in the radiologists’ descriptions of findings used for diagnosis [6]. BI-RADS includes 1) a standard lexicon to describe anatomical features present in breast imaging, and 2) a classification system designed to categorize by likelihood of malignancy, independently to each breast (Table 1) [79].

Table 1.

Description of BI-RADS assessment categories [7]

BI-RADS Category Final Assessment meaning Likelihood of Breast Cancer
0 Need additional imaging evaluation and/or prior imaging for comparison Not applicable
1 Negative Negligible
2 Benign finding Negligible
3 Probably benign finding <2%
4 Suggestive of abnormality 23%–34%
5 Highly suggestive of malignancy ≥95%
6 Malignancy confirmed by biopsy 100%

In addition to its clinical use, the system is also used in research settings and as a healthcare quality assurance tool in mammography, ultrasound and magnetic resonance imaging [10]. Healthcare quality assurance programs in breast imaging provide feedback to radiologists regarding their ability to detect and diagnose lesions, with the goal of continuous performance improvement. Initiatives designed to increase and improve feedback to radiologists are becoming more common, extending even beyond the Mammography Quality Standards Act (MQSA) [11] and hospital accreditation practices [12]. Current performance evaluations for radiologists use BI-RADS to summarize performance metrics in correlation with pathology reports [13] and/or standard of practice benchmarks [10].

Data needed for performance evaluation and quality reviews are stored within radiology and pathology systems, and typically as free text. Consequently, most provider organizations use a labor-intensive, manual procedure for correlating this data, including manual data entry for all breast imaging patients. The current process requires collection and coding of key data elements for efficient retrieval, follow-up of patients’ outcomes, correlation of pathology results with radiologists’ reports, computation of metrics based on the rates of patients’ outcomes and rates of adherence to evidence-based guidelines. These steps must then be repeated at specific intervals to ensure adequate follow up for low-risk categories.

Manual analysis of these data can be quite challenging due to the large volume of screening examinations, reliance on manual abstraction and report generation, sequestration of the data in task specific databases and systems, limited access to clinical reports, patient mobility, off-site biopsies, and physicians’ referral patterns. In addition, the required human effort imposes significant limits on the timeliness, granularity and flexibility of the data that can be provided for healthcare performance improvement and research purposes.

Various natural language processing (NLP) techniques can be employed to automatically identify and extract key expressions from radiology reports [1417]. These techniques may be used independently, or in combination to accomplish different subtasks within a multi-step pipeline [18].

We sought to develop an NLP system for BI-RADS extraction as the first step in developing a complete system for correlation of radiology to pathology results with feedback to radiologists. Such a system would be able to automatically extract features of interest (e.g. BI-RADS assessment categories) from the radiology reports, correlate the findings with the pathologic findings (e.g. malignant, benign, high risk), and then present this information to the clinician in a meaningful way to support performance improvement. The envisioned system provides an example of a single learning health system loop. Learning health system approaches aim to continuously improve practice by capturing data at the various levels of clinical practice and efficiently use the data to change practice [19, 20].

There have been only a few publications related to the extraction of BI-RADS features from free-text radiology reports. Most studies have focused on extracting standardized descriptions of anatomical findings [2123], rather than the BI-RADS categories. Only one published study aimed to specifically extract the BI-RADS category from the radiology report [24]. These authors developed the BI-RADS Observation Kit (BROK) algorithm to extract the final assessment category. BROK uses regular-expression string matches to obtain the reports’ BI-RADS category and the laterality of the breast to which it is assigned, when possible.

A published report of the algorithm by the developers showed high performance with recall of 100.0 % (95 % confidence interval (CI), 99.7, 100.0 %) and precision of 96.6 % (95 % CI, 95.4, 97.5 %). However, performance in this study was measured against radiology reports that were randomly sampled by imaging technique. This sampling strategy would almost certainly over-represent the more common low-risk categories (BI-RADS 1 and 2) and under-represent the less common high-risk categories (BI-RADS 3, 4, and 5), which are the categories of greatest interest for performance improvement.

We first sought to evaluate whether BROK could be used to extract all BI-RADS categories present in a radiology report. On the basis of our findings, we then chose to develop and evaluate our own information extraction method for BI-RADS categories extraction using a supervised machine learning approach.

3. Materials and Methods

3.1. Data source

All radiology reports used in this study were obtained from the University of Pittsburgh Text Information Extraction System (TIES) [25, 26]. TIES is a de-identified database comprising approximately 24 million radiology reports from all 18 University of Pittsburgh Medical Center (UPMC) hospitals from 2003 to 2015.

3.2. Radiology reports

We included a total of 2159 radiology reports for all evaluations and model training described in this study. Reports included screening mammograms, diagnostic mammograms, breast ultrasounds, computed tomography, and magnetic resonance imaging. To avoid potential reuse bias, we created four different datasets and used them in different stages of our research and for different purposes (Table 2). The selection of the documents was based on BROK algorithm output. Use of the BROK algorithm in the selection step was necessary in order to obtain sufficient cases of each of the BI-RADS categories.

Table 2.

Description of datasets.

Dataset Number of radiology reports Description Use
Dataset 1 480 Random sample of radiology reports stratified by BI-RADS final category (1–6) using the BROK software, and then manually classified at document level for BI-RADS final category Evaluation of BROK (rule-based approach) baseline performance; BROK error analysis; BROK rule adaptations
Dataset 2 600 Random sample of radiology reports flagged as ambiguous by BROK, and then manually classified at document level for BI-RADS final category BROK error analysis; BROK rule adaptations
Dataset 3 480 Random sample of radiology reports stratified by BI-RADS final category (1–6) using the BROK algorithm, and then manually classified at document level for BI-RADS final category Evaluation of rule-based approach performance on BI- RADS category extraction (BROK vs. adapted BROK)
Dataset 4 599 Random sample of radiology reports stratified by BI-RADS final category (1–6) using the BROK algorithm and then manually annotated at mention level following annotation guidelines Development and evaluation of BI-RADS category value annotator, and BI-RADS laterality classifier (machine learning approach)

There was no overlap among the four datasets. A total of 1560 (72.2%) reports were used for the evaluation of the rule-based approach (BROK), and a total of 599 (27.8%) were used for the development and evaluation of the machine learning approach. Section 3.3 provides a detailed description of the development and use of datasets 1, 2, and 3. Section 3.4 provides a detailed description of the development and use of dataset 4. For each dataset, we manually classified (datasets 1–3) or manually annotated (dataset 4) BI-RADS category values and laterality.

3.3. Rule based approach

We used three different sets of breast radiology reports for the evaluation of the BROK rule-based approach (Table 2). For all datasets, selected reports were manually reviewed by one author (SC) to determine a document-level BI-RADS classification with laterality (e.g., Left, Right, Bilateral, or Nonspecific) per document, congruent with the classification scheme used in the output of the BROK system. Performance metrics (precision, recall and accuracy) were determined at the document level for each individual final BI-RADS category present in a single report, and overall for all final BI-RADS categories by comparing the BROK classification against this manual classification.

We evaluated the baseline performance of BROK on dataset 1, which contained a stratified sample of 80 random radiology reports per BI-RADS category from 1–6 (total 480 reports). We excluded reports with BI-RADS final category of 0 from the selection criteria because these patients will, by definition, undergo further imaging studies, and therefore BIRADS 0 classes can be expected in the documents containing other BIRADS classes.

We then performed an error analysis for all cases in the first dataset where the BROK algorithm and the manual classification disagreed. To evaluate potential false negatives, we also obtained a second dataset containing a random sample of 600 reports classified as ambiguous by BROK (Table 2). These included reports with the BROK defined outputs of “multiple BI-RADS without laterality found”, “conflict with bilateral/overall/combined BI-RADS”, “multiple (left/right/bilateral) BI-RADS found” or “no BI-RADS category found”. Error analyses on both the dataset 1 and dataset 2 were used to adapt BROK algorithm’s regular expressions.

In the final step, we evaluated the adapted BROK system on a third dataset (Table 2) that contained a stratified sample of 80 randomly selected radiology reports per BI-RADS category from 1–6 (total 480 reports) (Table 2). We compared performance of the baseline system against the adapted system.

3.4. Machine learning approach

We also developed a second annotator using a machine learning (ML) approach, which annotated and classified multiple BI-RADS mentions within a given document. We first developed an annotation guideline and used it to create a manually annotated corpus for development, training and testing the system, based on dataset 4 (Table 2). On one subset of this data, we trained our supervised ML methods to extract BI-RADS information from the radiology reports. Finally, we evaluated the model performance and examined the resulting errors on held-out data. Figure 1 provides a summary of the methods for this aspect of the study.

Figure 1.

Figure 1

Methods for development of the two-step machine learning annotator.

Annotation Guidelines and Schema

The annotation schema and annotation guidelines were designed using a traditional, cyclic and iterative approach. In the first step of the annotation process, two expert annotators independently applied the schema to 15 different documents for annotation training, discussed issues that arose, leading to changes in the annotation guidelines and schema. This second version of the schema was applied to 30 new documents, and Inter-Annotator Agreement (IAA) was calculated using Cohen’s kappa for all the instances annotated in the schema (Figure 3).

Figure 3.

Figure 3

Corpus annotation process and inter-annotator agreement

The annotators met to review the disagreements and arrived at consensus on the changes to the schema and the guidelines. This process was performed until the annotators achieved an IAA ≥ 0.85. The final annotation schema included the BI-RADS category value (0 to 6), the BI-RADS category descriptor (value meaning), the BI-RADS laterality (Table 3) and variations of BI-RADS acronyms.

Table 3.

Description of BI-RADS laterality

Entity Definition
Left BI-RADS BI-RADS category assigned to the left breast
Right BI-RADS BI-RADS category assigned to the right breast
Bilateral BI-RADS BI-RADS category assigned to both breasts
Non-specific BI-RADS Unspecified BI-RADS category, ambiguous as to whether affected breast is left or right
Overall BI-RADS Corresponds to the most abnormal BI-RADS of the two breasts, based on the highest likelihood of malignancy. It is usually found after the detailed description of the BI-RADS category for each breast. In some cases, it is the only BI-RADS class in the report.

Dataset

Dataset 4 was used for the development of the ML-based annotator and was produced from a stratified sample of 100 randomly selected radiology reports per BI-RADS category from 1–6 (total 600 reports). One sampled report was excluded from the study because it was erroneously labeled as a radiology report when it fact it was a pathology report, yielding a total corpus of 599 documents. Documents were annotated using the Anafora annotation tool [27]. Anafora is a web-based annotation tool with a human-readable XML output.

We used an annotation approach based on the assumption that single annotation of additional data can increase the amount of training data without loss of system performance [28]. The annotation task was divided into three rounds (Figure 3). Within each round, both annotators created annotations on two sets of documents. One set contained documents that were annotated by both annotators, while the other set contained documents that were annotated only by a single annotator (either Annotator 1 or Annotator 2). The double-annotated set was used to calculate the IAA at intervals throughout the annotation process to assure a high level of agreement throughout the annotation task. The final gold standard combined the annotations of both annotators. Discrepancies presented during the annotation phase were discussed after the IAA calculation, and one was selected by consensus for use in the gold set.

Preprocessing radiology reports

We used the clinical Text Analysis and Knowledge Extraction System (cTAKES) [29] for preprocessing steps. cTAKES is an open-source NLP system for information extraction from electronic medical records free-text. Annotations created by cTAKES were used to create relevant features for the machine learning algorithms.

We made two minor changes to existing cTAKES annotators to be used in the preprocessing steps, based on our initial evaluations. First, we modified the cTAKES sectionizer to identify custom headers that are not included in the CDA/HL7 standard. This list of headers was based on those identified by TIES within the radiology reports. Second, we adjusted the cTAKES tokenizer to distinguish when the “.” character was used as a punctuation mark versus a decimal mark, when this character followed a number.

We also developed two new Unstructured Information Management Architecture (UIMA) rule-based annotators to 1) label BI-RADS anchor expressions, such as those that were manually annotated (e.g. ‘BIRADS’, ‘BI-RADS’, ‘ACR code’) and 2) annotate simple time expressions, e.g. ‘6-month’, ‘1-year’, ‘12:30 AM’.

The final cTAKES preprocessing pipeline consisted of the following sequential modules: (1) sectioning to identify the headers and annotate the various sections of the radiology report, (2) sentence splitting, (3) tokenizing to split the sentences into tokens and to classify each lexical unit into different token types, and (4) section filtering. In the tokenizing module, we also included the context dependent tokenizer to identify roman numerals that are sometimes used in the text to specify BI-RADS category values. The cTAKES Context dependent tokenizer creates annotations from one or more tokens, using surrounding token types as clues.

Model development

We divided the BI-RADS category extraction task into two specific tasks (Figure 2). The first task was to develop an automated method for annotation of all BI-RADS risk assessment values within the breast radiology reports (including screening mammogram, diagnostic mammogram, breast ultrasound and breast MRI). The problem was formulated as a sequence prediction problem. All tokens within the text were classified as ‘BI-RADS category value’ or ‘not BI-RADS category value’, based on contextual information in the surrounding line of text. We trained a linear chain Conditional Random Fields (CRF) model to perform this task [30]. The second task was to develop an automated method for classifying the resulting annotations as Left, Right, Bilateral, Overall, or Nonspecific. The problem was formulated as a multi-class classification problem. We trained and compared multiple classifiers to perform this task.

Figure 2.

Figure 2

Example radiology reports of increasing complexity with multiple BIRADS statements depicting the two separate tasks.

Derived datasets for training and testing

We split dataset 4 as follows: a training set containing 60% of all instances, a development set containing 10% of all instances, and a test set containing 30% of all instances. The training set was used to train the classifiers, and the test set was used for the final evaluation including the final error analysis. For task 1, we used the development set for initial model building, preliminary experiments, and feature development, construction and selection. Consequently, the development set was excluded from the training and testing splits. Error analysis was performed throughout the development process, and the preprocessing system or model was subsequently adjusted. For task 2, we did not utilize the development set, and consequently we added it to the test set to increase the sample size and variance of documents used in the final evaluation.

Task 1– Development of a BI-RADS category value annotator

We used the open-source MAchine Learning for LanguagE Toolkit (MALLET) [31] to build a linear chain CRF model for BI-RADS token annotation. CRFs are a common method used for information extraction that have been successfully applied in a wide variety of clinical text and reports [3235]. Linear chain CRFs consider that labels of adjacent tokens are dependent on each other, therefore they are used to predict a sequence of labels given a sequence of tokens.

We used the following set of cTAKES-derived labels as features to train the model: 1) report section, 2) token type, and 3) context dependent token type. We used an iterative approach to feature construction and selection. For each set of features, we trained the model on the training set, then computed evaluation metrics and performed error analysis on the development set. We used our error analysis to identify causes of false negatives and false positives, and these insights were employed to improve the preprocessing pipeline, and to add or change features used in the subsequent model. Feature selection was expert-determined and refined using the results of the error analysis. After we reached an acceptable level of performance, the final model was tested once on the test set.

Task 2 – Development of a BI-RADS laterality classifier

We used the open-source Waikato Environment for Knowledge Acquisition (WEKA) [36] to implement three different BI-RADS laterality classifiers: Naïve Bayes (NB) [37], Rules from partial decision trees (PART) [38], and Support Vector Machine (SVM) [39]. We tested the ability of each model to classify the BI-RADS category value detected in the previous step as left, right, bilateral, overall, or nonspecific. We chose NB because it is the simplest type of Bayesian network and makes the assumption of conditional independence of the predictors on the target variable [37, 40]. PART algorithm provides an outcome that is easy to interpret. It uses partial decision trees to generate a decision list and employs a separate-and-conquer approach in each iteration, making the “best” leaf into a rule [38]. We trained the SVM classifier using sequential minimal optimization (SMO) [39, 41, 42] with a first-order polynomial kernel. This linear method produced the best performance when compared against a number of alternative kernels.

Each classifier was trained with an identical set of features, automatically generated as cTAKES output. These features included:

  • -

    Bag-of-word (BoW). Each unique word token was treated as a feature. We included only the word tokens present in the line of the report with the BI-RADS category value as determined in Task 1. We did not normalize or remove stop words.

  • -

    Total number of BI-RADS category value annotations. This value was obtained by counting the number of BI-RADS category values present in the report.

  • -

    BI-RADS category value. We used the BI-RADS value (0–6) assigned by the radiologist and automatically extracted in Task 1.

  • -

    BI-RADS sequence. We assigned a rank order to each BI-RADS token annotation, based on its order of appearance in the report (e.g. first, middle, last). When there was only one BI-RADS category value in the report, it was considered to be the Last BI-RADS annotation.

  • -

    Imaging Study type. This was the procedure type, extracted from the report title in the radiology report system or the Technique description section. Possible values were mammogram, ultrasound, or magnetic resonance imaging.

  • -

    Laterality. This feature was obtained from the report title in the radiology report system or the Technique description section, and specified whether the study was performed on one breast or both breasts.

  • -

    Breast(s) studied. The anatomic location was obtained from the report title in the radiology report system or the Technique description section. It specified whether the study was performed on the left breast or right breast.

  • -

    Laterality word token counts. These were three additional features obtained by counting the mention of the word tokens “left”, “right”, and “bilateral” throughout the complete report.

Statistical analysis

We used standard metrics to evaluate the performance of both the BI-RADS annotator and classifier, including recall, precision, F1-measure, and accuracy. We separately measured performance for each BI-RADS laterality category.

For task 1 (BI-RADS category value annotation), a given annotation was considered a true positive if the model correctly predicted a token as ‘BI-RADS value annotation’ when compared with gold annotation, and true negative if the model correctly predicted a token as ‘Not BI-RADS value’ annotation, when compared with the gold annotation. False negatives were defined as cases in which the model incorrectly predicted ‘Not BI-RADS value’, when compared with the gold annotation. False positives were defined as cases in which the model incorrectly predicted ‘BI-RADS value annotation’, when compared with the gold annotation.

For task 2 (BI-RADS laterality classification), a given classification was considered a true positive if the classification (using the unordered labels ‘Left’, ‘Right’, ‘Bilateral’, ‘Overall’, ‘Nonspecific’) was identical to the gold annotation. True negatives were the number of correctly recognized cases that do not belong to the specific class we are trying to predict. If the classification was not identical, then the classification was an error. For this multi-class classification problem, we used micro-averaging [43] in calculating general performance metrics and classification F1 measure to compare classifiers. We measured performance for all BI-RADS category values and also for a subset of BI-RADS category values (categories 3, 4, and 5). We separately evaluated for this latter subset because these BI-RADS categories indicate an increased risk for cancer and are followed by some subsequent clinical action (re-imaging after some interval, additional imaging or biopsy). Given our anticipated use case, discriminative performance is most important in this subset.

4. Results

4.1. Rule based approach (BROK)

Baseline algorithm performance

Overall algorithm performance in the Dataset 1 showed that the original BROK performed relatively poorly in this sample (Table 4), with an F1 value of 0.55, favoring precision (0.69) over recall (0.46). BROK accurately classified BI-RADS category for the left breast in 50.2% of the reports and for the right breast in 49.8% of the documents. Most of the reports contained a final BI-RADS category for each individual breast.

Table 4.

BROK performance metrics

System Original Original Adapted

Dataset Dataset 1 Dataset 3 Dataset 3
Accuracy 0.50 0.59 0.74
Recall 0.46 0.62 0.78
Precision 0.69 0.64 0.81
F-1 0.55 0.63 0.79

Comparison of BROK and adapted BROK

Following our initial error analysis, we then compared the performance of the original version of BROK against the adapted version of BROK (Table 4). We observed some variation in performance of the original BROK across datasets, with increased and more balanced performance in dataset 3. Adaptations we made to the algorithm had a positive effect on performance when compared to the original system.

BROK Error Analysis

The algorithm was unable to classify 131 of the BI-RADS instances present in Dataset 1, from which 74 were due to errors in the extraction of the right breast BI-RADS categories and 57 to errors in the extraction of left breast BI-RADS categories. Nearly three quarters of the extraction errors were present in documents with final category representing higher malignancy risk: BI-RADS category 4 (22.1%), BI-RADS category 3 (19.1%), BI-RADS category 5 (16%), and BI-RADS category 6 (16%).

In datasets 1 and 2, reports that contained data extraction errors shared one or more of the following features: multiple BI-RADS categories for each breast, difficulties in the preprocessing, addenda, and multiple imaging techniques in a single report. Most of data extraction errors resulted from variations in the reporting template for BI-RADS categories. We grouped the errors into various categories, and determined the most likely source (Table 5). On the basis of the error analysis, we adapted the BROK algorithm in three ways: (1) removed the existing rule that suppressed statements beginning with outline numbers, (2) included additional terms in the regular expressions, and (3) increased the distance between BI-RADS anchor, laterality and category. The adapted version of the algorithm, tested in dataset 3, showed continued limitations because it relied too heavily on the terms in the regular expressions, which were frequently not present within the same line.

Table 5.

Description of most common data extraction errors using BROK

Error Definition Error Source Definition
Erroneous bilateral assignation The algorithm erroneously assigned a category to both breasts Removal of numerals from a list. The template allowed BI-RADS categories to be at the beginning of a sentence. BROK removed these numerals by considering it part of a list but this impedes identification.
BI-RADS category was not found – Regular expression trigger word not found BROK did not assign any BI-RADS category because it failed to identify a trigger word BI-RADS acronym not included Some variations of the BI-RADS acronym (e.g. ‘BI RADS’, ‘ACR code’) were not included in the regular expressions.
Laterality contextual errors The tool failed to identify the correct laterality because it was not within the token distance specified in the algorithm Distance between critical terms The tool failed to identify the correct laterality because the term was too far from the BI-RADS trigger of the regular expression

We observed that a major limitation of BROK was that it could only assign classifications at the document level. For our future system, we anticipated the need to evaluate multiple lesions on radiology studies and correlate these lesions with the associated pathologic findings. Therefore, a basic requirement for methods development was annotation at the mention level to ensure that both breasts could be separately evaluated.

4.2. Machine learning approach

Annotation process and Corpus development

A total of 120 documents had double annotation (annotation by two individuals) and a total of 479 had single annotation (annotation by one individual). We obtained high levels of agreement from the beginning of the process which was sustained throughout the annotation process (Figure 3). The final corpus contained 1,014 annotated BI-RADS instances in 599 radiology reports (Table 6). One report was excluded from the corpus because it turned out to be a pathology report. The final IAA across all 120 double annotated documents was 0.95.

Table 6.

BI-RADS corpus distribution

Corpus Split Total Documents Total number of word tokens Total number of numeric tokens (machine annotations) Total number of BI-RADS tokens (gold annotations) BI-RADS Category
0 1 2 3 4 5 6
Training 368 105333 7588 608 83 79 129 100 90 77 50
Development 58 16311 1189 101 12 15 16 32 14 7 5
Test 173 55711 4077 305 48 34 67 50 46 36 24
Total 599 177355 12854 1014 143 128 212 182 150 120 79

We split the data into three sets: a training set containing 60% of all instances (608 BI-RADS instances, 368 documents), a development set containing 10% of all instances (101 BI-RADS instances, 58 documents), and a test set containing 30% of all instances (305 BI-RADS instances, 173 documents). BI-RADS instances across different categories were uniformly distributed, and corresponded to approximately 10% of all numeric tokens present in the corpus. Distributions were similar across all data sets (Table 6).

Evaluation of BI-RADS Token Annotator (Task 1)

Only three features were included for training of the baseline model to obtain the simplest model possible and improve performance over it: (1) section, (e.g., technique, findings, impression, or recommendation); (2) token type, (e.g., word, punctuation, symbol, newline, contraction, or numeric); and (3) context dependent token type, (e.g., roman numerals, ranges, or measurement). The unit of analysis was the token. A numeric token is defined as a consecutive series of digits.

Error analysis in the development set showed that failures in the sectionizer altered the algorithms’ ability to detect BI-RADS categories, and that time expressions (12-hour format) and time-length expressions (e.g., “6-months”) followed a sequence that was very similar to the BI-RADS category. We therefore adjusted the section headers to annotate only the impression, recommendation, and addendum sections in the section feature. We also developed two UIMA rule-based annotators, one for BI-RADS anchor words and one for time expressions. We implemented changes sequentially and observed a continuous improvement in the overall performance in the development set. The final model demonstrated a recall of 0.93, precision of 0.98 and F1 measure of 0.95. A more detailed account of the model’s performance is presented in Table 7.

Table 7.

BI-RADS token annotator performance

Task Features Output TP TN FP FN Recall Precision F1
BI-RADS Token annotator Segment, TokenType, ContextToken, BI-RADS Anchor, Time Expression BI-RADS Category 283 76224 7 22 0.93 0.98 0.95

Evaluation of BI-RADS Annotation Classifier (Task 2)

Average classification performance was similar in the test set across all models. For all BI-RADS token annotation values (0 to 6), NB demonstrated a weighted F1 measure of 0.83, SVM showed weighted F1 measure of 0.89, and the PART model showed weighted F1 measure of 0.91. For the BI-RADS annotations subset (values 3–5), NB demonstrated weighted F1 measure of 0.87, SVM showed a weighted F1 measure of 0.9, and the PART model showed a weighted F1 measure of 0.93. PART model performance was slightly superior and more balanced across the classes. Further investigation of the classification performance by class showed that performance improved for all classes from NB to SVM, and also from SVM to PART.

Bilateral BI-RADS was the class with the highest rate of classification errors for all models. Most of the errors were due to the lack of explicit indication of the class within the reports or the feature derived from the study name. Table 8 presents a detailed description of the classification performance of the different models. Although we sought to include ‘non-specific’ as a fifth class, the total number of instances of this class (N=2) were too low and assigned to the training set, and therefore could not be properly evaluated in Table 8.

Table 8.

BI-RADS classifiers performance

Classifier BI-RADS values Output TP TN FP FN Recall Precision F1
BI-RADS class
Naïve Bayes
0 – 6 Left BI-RADS 97 279 21 9 0.92 0.82 0.87
Right BI-RADS 81 292 15 18 0.82 0.84 0.83
Bilateral BI-RADS 45 325 20 16 0.74 0.69 0.71
Overall BI-RADS 113 254 12 27 0.81 0.90 0.85
Nonspecific BI-RADS 0 404 2 0 0.00 0.00 0.00
Weighted Average 0.83

3 – 5 Left BI-RADS 46 130 10 0 1.00 0.82 0.90
Right BI-RADS 50 124 6 6 0.89 0.89 0.89
Bilateral BI-RADS 3 173 4 6 0.33 0.43 0.38
Overall BI-RADS 64 108 3 11 0.85 0.96 0.90
Nonspecific BI-RADS 0 186 0 0 0.00 0.00 0.00
Weighted Average 0.87

BI-RADS class
SVM
0 – 6 Left BI-RADS 99 278 22 7 0.93 0.82 0.87
Right BI-RADS 90 299 8 9 0.91 0.92 0.91
Bilateral BI-RADS 49 338 7 12 0.80 0.88 0.84
Overall BI-RADS 124 259 7 16 0.89 0.95 0.92
Nonspecific BI-RADS 0 406 0 0 0.00 0.00 0.00
Weighted Average 0.89

3 – 5 Left BI-RADS 46 128 12 0 1.00 0.79 0.88
Right BI-RADS 50 129 1 6 0.89 0.98 0.93
Bilateral BI-RADS 6 174 3 3 0.67 0.67 0.67
Overall BI-RADS 66 109 2 9 0.88 0.97 0.92
Nonspecific BI-RADS 0 186 0 0 0.00 0.00 0.00
Weighted Average 0.90

BI-RADS class
PART
0 – 6 Left BI-RADS 101 293 7 5 0.95 0.94 0.94
Right BI-RADS 89 295 12 10 0.90 0.88 0.89
Bilateral BI-RADS 51 339 6 10 0.84 0.89 0.86
Overall BI-RADS 127 253 13 13 0.91 0.91 0.91
Nonspecific BI-RADS 0 406 0 0 0.00 0.00 0.00
Weighted Average 0.91

3 – 5 Left BI-RADS 45 138 2 1 0.98 0.96 0.97
Right BI-RADS 49 128 2 7 0.88 0.96 0.92
Bilateral BI-RADS 9 174 3 0 1.00 0.75 0.86
Overall BI-RADS 70 105 6 5 0.93 0.92 0.93
Nonspecific BI-RADS 0 186 0 0 0.00 0.00 0.00
Weighted Average 0.93

5. Discussion

We evaluated an existing method for the extraction of final BI-RADS assessment categories at the document level, and then developed our own multi-model NLP system for the annotation and classification of all BI-RADS assessment categories present in a single radiology report.

To the best of our knowledge, there are no other published reports that may be directly compared to our results. Many studies have developed tools to extract useful information from breast radiology reports using NLP techniques [21, 22, 4447]. However, most of them have been limited to extraction of observed features from the reports, such as density, shape, and calcifications, and they relate exclusively to mammography reports. Our work differs because our goal was to extract the BI-RADS categories rather than the description of the findings. Only the publication of the BROK algorithm [24] had objectives similar to our research. An important difference between our goals and the goals of the BROK authors is that we specifically sought to annotate all mentions of BI-RADS categories in a report including both breasts as well as overall assessment. In contrast, the authors of BROK focused only on the final BI-RADS category for a given report. Consequently, it is not meaningful to directly compare our results to those of these previous authors.

Rule based approach (BROK)

Our evaluation of BROK showed a very different performance profile when compared to the previously published evaluation [24]. Even the adaptations we performed on the algorithm were not sufficient to obtain a satisfactory classification performance for our purposes. Two factors could have contributed to this observed difference between these evaluations.

First, the increased number of hospitals, increased time-frame and increased complexity of reports in our evaluation, when compared to the original evaluation could explain the decreased performance that we observed. In general, rule based approaches can be brittle, because the algorithm must account for many different potential patterns, which must be known a priori. We note that the regular expression approach may be improved by the implementation of specialty specific standardized lexicons. In the case of radiology, there is such a trend towards standardization of radiological language [48]. However, the use of templates and lexicons is still not the norm. Another important difference is that our evaluation was conducted on a sample stratified by BI-RADS category. Because the preponderance of breast imaging studies are screening mammography exams with negative/benign results [4951], the sample used in the previously published BROK evaluation is likely to be heavily skewed towards simpler and more uniform language.

Second, we identified the level of classification of BROK to be problematic. BROK identifies the BI-RADS category across the entire document. However, cases with differing and actionable BI-RADS assessments of each breast are not uncommon. BROK also does not attempt to differentiate between a bilateral and an overall BI-RADS assessment. A bilateral assessment with high risk of malignancy has completely different implications for patients and clinicians, in comparison to an overall assessment with high risk of malignancy. This is particularly important in radiologic-pathologic correlation, where two specimens must be correlated in the case of a bilateral BI-RADS assessment, whereas only one specimen must be correlated in the case of an overall BI-RADS assessment.

Finally, since our goal is to develop the methods needed for a learning healthcare system feedback loop, we considered more complex reports associated with higher BI-RADS categories to be very important, even though they are less frequent in any large corpus. For all of these reasons, we elected to develop our own method for BI-RADS annotation and information extraction, using a machine learning approach.

BI-RADS category value annotator

The use of multi-step systems for clinical information extraction tasks is not new. Such systems have been successfully applied in the past for diverse tasks such as coreference resolution [52], and relationship extraction [33]. Our system for BI-RADS annotation and classification shares the same general structure: the first stage consists of a preprocessing step followed by a machine learning algorithm that focuses on recognition of the entity, and the second stage implements a different machine learning algorithm that uses the output of the previous stage and perform the classification task. However, although our system follows this same general structure, our approach differs in terms of the methods and the features we aimed to extract and classify.

We developed a linear chain-CRF model that demonstrated high recall, precision and accuracy for the annotation of all BI-RADS tokens within any section of a breast radiology report. Our BI-RADS token annotator is able to detect BI-RADS categories with a high level of granularity and accuracy, and is capable of identifying all the types of BI-RADS token categories (including overall and bilateral BI-RADS assessments) present in a breast imaging reports. Importantly, because we specifically sought to categorize the full range of categories, and to annotate all assessments (instead of producing a single annotation for the entire document), results from our study are not easily compared with any existing method, including BROK.

The pipeline approach is an important strength of our method which likely contributed to its high accuracy. Specifically, an automated section detection method was used to produce machine annotation for sections that were used as features for the CRF. Similarly, we used a modified tokenizer to create machine annotations for common numeric token types (including time), which were also features in our CRF. These features decreased the number of false negatives, as shown in our development set results. The inclusion of our UIMA annotators for time and BI-RADS anchors also helped to decrease the number of false positives.

Results from our error analysis showed that in rare situations, sectionizer failures at the beginning of a section were more likely to produce errors for the model. These problems could be easily solved in the future by standardizing the report’s format. However, the overall impact of these errors on the system performance was quite small.

BI-RADS laterality classifiers

NB networks and SVM classifiers have been used in the past for classification of radiology reports [21, 5357]. NB networks have been used specifically for classification of breast radiology reports [58, 59]. However, these studies have been limited to malignancy risk detection from BI-RADS lexicon features. SVM after a CRF detection has been successfully used for classification in biomedical text [33, 52]. However, importantly, no study to date has tried to classify the reports into specific categories and with the level of granularity that we did in this study.

Our results demonstrate that we were able to assign BI-RADS laterality categories with moderate to very good performance with all of the classifiers. Our best classifier used the PART algorithm which demonstrated very good recall, precision and F1 measure for the multiclass classification of BI-RADS categories, once the annotation has been detected. Decision trees have the advantage of being relatively easy to interpret.

All models performed better for left class and right class when compared to bilateral class. There are many factors that could account for this observed classification behavior, including (1) the expert-constructed feature space may be biased towards the detection of left and right class, (2) variations in reporting across different radiology techniques that could produce missing data for key features to describe bilateral (3) the absence of specific features to describe bilateral and overall BI-RADS, (4) laterality coreference within the reports, and (5) ambiguity within the reports.

Performance variation among classifiers may be a reflection of the fact that we did not fully explore the entire feature space. Expert constructed features produced the potential for conflicting features, which would likely be resolved more simply with the decision tree and explain PART’s better performance.

We included features such as imaging study and laterality counts to balance the assumption that a significant amount of information was present in the same text line of the report as the BI-RADS value. By including these features, we attempted to enhance semantic information and reduce the ambiguity that could be present in a single line of the report. In a recent study by Bozkurt and Rubin [60], a different approach to reduce ambiguity in BI-RADS reporting was assessed, and these researchers were able to identify mismatches between BI-RADS reported categories and descriptions of mammogram laterality. Detection of ambiguities could help reduce variability in reporting and improve performance of algorithms that aim to classify laterality.

Limitations

An important potential limitation of our methods was that we used the BROK software to sample by BI-RADS category during the development of the corpus used for development of our machine-learning base annotator. Although we manually classified or annotated all documents, it is likely that the most complex documents (classified by BROK as ambiguous due to multiple BI-RADS mentions) were not included in Dataset 4. A future prospective study of the software against manual abstractors will be needed to determine whether our machine learning based annotator is robust for such highly complex reports.

For the BI-RADS token annotator, the generalizability of our findings may also be limited because the detection step of our annotator strongly relies on the accurate performance of the preprocessing steps.

6. Conclusion

In conclusion, we have developed and evaluated a complete NLP system for automated BI-RADS annotation and classification, using a novel approach. BI-RADS classification with PART showed very good performance, and was consistent across all laterality classes. Our system is able to provide a detailed list of BI-RADS categories present in a single radiology report.

This work provides a solid foundation for future studies that will leverage automated BI-RADS annotation, to provide feedback to radiologists as part of a learning health system loop.

Highlights.

  • We developed an NLP system to extract all BI-RADS categories from breast radiology reports.

  • We evaluate the results of existing of an NLP system using a multi-step approach.

  • ML classifiers can detect and assign laterality to all BI-RADS categories in a single report.

Acknowledgments

We thank Maria Bond at the University of Pittsburgh Department of Biomedical Informatics for preparation and review of the manuscript. We also thank Guergana Savova at Harvard Medical School and Computational Health Informatics Program and Harry Hochheiser at University of Pittsburgh Department of Biomedical Informatics for expert review of the manuscript.

8. Funding Sources

This research was supported by the National Cancer Institute (1U24CA184407). The first author (SC) was supported by the Fogarty Training Grant 1 D43 TW008443.

Abbreviations

ACR

American College of Radiology

BI-RAD

Breast Image-Reporting and Data System

BROK

BI-RADS Observation Kit

CI

confidence interval

CRF

conditional random fields

cTAKES

clinical Text Analysis and Knowledge Extraction System

MALLET

MAchine Learning for LanguagE Toolkit

ML

machine learning

MQSA

Mammography Quality Standards Act

NB

Naïve Bayes

NLP

Natural Language Processing

SVMs

support vector machines

TIES

Text Information Extraction System

UIMA

Unstructured Information Management Architecture

WEKA

Waikato Environment for Knowledge Acquisition

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

9. Conflicts of Interest

At the time of this publication, the authors do not report any conflict of interest.

References

  • 1.Siegel RL, Miller KD, Jemal A. Cancer statistics, 2016: Cancer Statistics, 2016. CA Cancer J Clin. 2016;66:7–30. doi: 10.3322/caac.21332. [DOI] [PubMed] [Google Scholar]
  • 2.Torre LA, Bray F, Siegel RL, Ferlay J, Lortet-Tieulent J, Jemal A. Global cancer statistics, 2012: Global Cancer Statistics, 2012. CA Cancer J Clin. 2015;65:87–108. doi: 10.3322/caac.21262. [DOI] [PubMed] [Google Scholar]
  • 3.Berry DA, Cronin KA, Plevritis SK, Fryback DG, Clarke L, Zelen M, et al. Effect of Screening and Adjuvant Therapy on Mortality from Breast Cancer. New England Journal of Medicine. 2005;353:1784–92. doi: 10.1056/NEJMoa050518. [DOI] [PubMed] [Google Scholar]
  • 4.U.S. Preventive Services Task Force. Screening for breast cancer: U.S. Preventive Services Task Force recommendation statement. Ann Intern Med. 2009;151:716–26. W-236. doi: 10.7326/0003-4819-151-10-200911170-00008. [DOI] [PubMed] [Google Scholar]
  • 5.Oeffinger KC, Fontham EH, Etzioni R, et al. Breast cancer screening for women at average risk: 2015 guideline update from the american cancer society. Journal of the American Medical Association. 2015;314:1599–614. doi: 10.1001/jama.2015.12783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.American College of Radiology, Bi-Rads Committee. ACR BI-RADS atlas: breast imaging reporting and data system. Reston, VA: American College of Radiology; 2013. [Google Scholar]
  • 7.Eberl MM, Fox CH, Edge SB, Carter CA, Mahoney MC. BI-RADS Classification for Management of Abnormal Mammograms. J Am Board Fam Med. 2006;19:161–4. doi: 10.3122/jabfm.19.2.161. [DOI] [PubMed] [Google Scholar]
  • 8.Lacquement MA, Mitchell D, Hollingsworth AB. Positive predictive value of the Breast Imaging Reporting and Data System. J Am Coll Surg. 1999;189:34–40. doi: 10.1016/s1072-7515(99)00080-0. [DOI] [PubMed] [Google Scholar]
  • 9.Orel SG, Kay N, Reynolds C, Sullivan DC. BI-RADS categorization as a predictor of malignancy. Radiology. 1999;211:845–50. doi: 10.1148/radiology.211.3.r99jn31845. [DOI] [PubMed] [Google Scholar]
  • 10.Sickles EA, D’Orsi CJ. ACR BI-RADS® Follow-up and Outcome Monitoring ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System. Reston, VA: American College of Radiology; 2013. [Google Scholar]
  • 11.U. S. Food and Drug Administration. Mammography Quality Standards Act (MQSA) http://www.fda.gov/Radiation-EmittingProducts/MammographyQualityStandardsActandProgram/Regulations/ucm110823.htm, (accessed April 22, 2015).
  • 12.Steele JR, Hovsepian DM, Schomer DF. The Joint Commission Practice Performance Evaluation: a Primer for Radiologists. J am Coll Radiol. 2010;7:425–30. doi: 10.1016/j.jacr.2010.01.027. [DOI] [PubMed] [Google Scholar]
  • 13.Geller BM, Kerlikowske K, Carney PA, Abraham LA, Yankaskas BC, Taplin SH, et al. Mammography surveillance following breast cancer. Breast Cancer Res Treat. 2003;81:107–15. doi: 10.1023/A:1025794629878. [DOI] [PubMed] [Google Scholar]
  • 14.Cai T, Giannopoulos AA, Yu S, Kelil T, Ripley B, Kumamaru KK, et al. Natural Language Processing Technologies in Radiology Research and Clinical Applications. RadioGraphics. 2016;36:176–91. doi: 10.1148/rg.2016150080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Cheng LTE, Zheng J, Savova GK, Erickson BJ. Discerning Tumor Status from Unstructured MRI Reports—Completeness of Information in Existing Reports and Utility of Automated Natural Language Processing. Journal of Digital Imaging: the official journal of the Society for Computer Applications in Radiology. 2010;23:119–32. doi: 10.1007/s10278-009-9215-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pham A-D, Névéol A, Lavergne T, Yasunaga D, Clément O, Meyer G, et al. Natural language processing of radiology reports for the detection of thromboembolic diseases and clinically relevant incidental findings. BMC Bioinformatics. 2014;15:266. doi: 10.1186/1471-2105-15-266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pons E, Braun LMM, Hunink MGM, Kors JA. Natural Language Processing in Radiology: A Systematic Review. Radiology. 2016;279:329–43. doi: 10.1148/radiol.16142770. [DOI] [PubMed] [Google Scholar]
  • 18.Liu F, Chen J, Jagannatha A, Yu H. Learning for Biomedical Information Extraction: Methodological Review of Recent Advances. 2016 http://arxiv.org/abs/1606.07993. (accessed June 26, 2016)
  • 19.Friedman C, Rubin J, Brown J, Buntin M, Corn M, Etheredge L, et al. Toward a science of learning systems: a research agenda for the high-functioning Learning Health System. J Am Med Inform Assoc. 2015;22:43–50. doi: 10.1136/amiajnl-2014-002977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Institute of Medicine (US) Olsen L, Aisner D, McGinnis JM. The learning healthcare system: workshop summary. 2011 http://www.ncbi.nlm.nih.gov/books/NBK92076/ (accessed August 11, 2015) [PubMed]
  • 21.Bozkurt S, Lipson JA, Senol U, Rubin DL. Automatic abstraction of imaging observations with their characteristics from mammography reports. J Am Med Inform Assoc. 2015;22:e81–92. doi: 10.1136/amiajnl-2014-003009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Nassif H, Cunha F, Moreira IC, Cruz-Correia R, Sousa E, Page D, et al. Extracting BI-RADS Features from Portuguese Clinical Texts. IEEE Int Conf Bioinforma Biomed. 2012:1–4. doi: 10.1109/bibm.2012.6392613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Xu Y, Tsujii J, Chang EIC. Named entity recognition of follow-up and time information in 20 000 radiology reports. J Am Med Inform Assoc. 2012;19:792–9. doi: 10.1136/amiajnl-2012-000812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Sippo DA, Warden GI, Andriole KP, Lacson R, Ikuta I, Birdwell RL, et al. Automated Extraction of BI-RADS Final Assessment Categories from Radiology Reports with Natural Language Processing. J Digit Imaging. 2103;26:989–94. doi: 10.1007/s10278-013-9616-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Crowley RS, Castine M, Mitchell K, Chavan G, McSherry T, Feldman M. caTIES: a grid based system for coding and retrieval of surgical pathology reports and tissue specimens in support of translational research. Journal of the American Medical Informatics Association. 2010;17:253–64. doi: 10.1136/jamia.2009.002295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.The TIES Cancer Research Network (TCRN) Homepage.
  • 27.Chen W-T, Styler W. Anafora: A Web-based General Purpose Annotation Tool. In: Vanderwende K, Daumé H III, Kirchhoff K, editors. 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Atlanta, GA: The Association for Computational Linguistics; 2013. pp. 14–9. [PMC free article] [PubMed] [Google Scholar]
  • 28.Dligach D, Palmer M. LAW V’11 Proceedings of the 5th Linguistic Annotation Workshop. Stroudsburg, PA: Association for Computational Linguistics; 2011. Reducing the Need for Double Annotation; pp. 65–73. [Google Scholar]
  • 29.Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17:507–13. doi: 10.1136/jamia.2009.001560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Lafferty JD, McCallum A, Pereira FCN. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Morgan Kaufmann Publishers Inc; 2001. pp. 282–9. [Google Scholar]
  • 31.McCallum AK. MALLET: A Machine Learning for Language Toolkit. 2002 http://mallet.cs.umass.edu. accessed.
  • 32.Jiang M, Chen Y, Liu M, Rosenbloom ST, Mani S, Denny JC, et al. A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries. J Am Med Inform Assoc. 2011;18:601–6. doi: 10.1136/amiajnl-2011-000163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Patrick J, Li M. High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge. J Am Med Inform Assoc. 2010;17:524–7. doi: 10.1136/jamia.2010.003939. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Torii M, Wagholikar K, Liu H. Using machine learning for concept extraction on clinical documents from multiple data sources. J Am Med Inform Assoc. 2011;17:524–7. doi: 10.1136/amiajnl-2011-000155. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Uzuner O, Solti I, Cadag E. Extracting medication information from clinical text. J Am Med Inform Assoc. 2010;17:514–8. doi: 10.1136/jamia.2010.003947. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA Data Mining Software: An Update. SIGKDD Explor Newsl. 2009;11:10–8. [Google Scholar]
  • 37.Meyfroidt G, Güiza F, Ramon J, Bruynooghe M. Machine learning techniques to examine large patient databases. Best Pract Res Clin Anaesthesiol. 2009;23:127–43. doi: 10.1016/j.bpa.2008.09.003. [DOI] [PubMed] [Google Scholar]
  • 38.Frank E, Witten IH. Generating Accurate Rule Sets Without Global Optimization Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc; 1998. pp. 144–51. [Google Scholar]
  • 39.Platt JC. Fast training of support vector machines using sequential minimal optimization. In: Bernhard S, lkopf, Christopher JCB, Alexander JS, editors. Advances in kernel methods. MIT Press; 1999. pp. 185–208. [Google Scholar]
  • 40.John GH, Langley P. Proceedings of the Eleventh conference on Uncertainty in artificial intelligence Montreal. Quebec, Canada: Morgan Kaufmann Publishers Inc; 1995. Estimating continuous distributions in Bayesian classifiers; pp. 338–45. [Google Scholar]
  • 41.Hastie T, Tibshirani R. Classification by pairwise coupling. Ann Statist. 1998;26:451–71. [Google Scholar]
  • 42.Keerthi SS, Shevade SK, Bhattacharyya C, Murthy KRK. Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Comput. 2001;13:637–49. [Google Scholar]
  • 43.Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management. 2009;45:427–37. [Google Scholar]
  • 44.Gao H, Aiello Bowles EJ, Carrell D, Buist DSM. Using Natural Language Processing to Extract Mammographic Findings. J Biomed Inform. 2015;54:77–84. doi: 10.1016/j.jbi.2015.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Jain NL, Friedman C. American Medical Informatics Association Annual Fall Symposium. American Medical Informatics Society; 1997. Identification of findings suspicious for breast cancer based on natural language processing of mammogram reports; pp. 829–33. [PMC free article] [PubMed] [Google Scholar]
  • 46.Nassif H, Woods R, Burnside E, Ayvaci M, Shavlik J, Page D. Information Extraction for Clinical Data Mining: A Mammography Case Study. IEEE International Conference on Data Mining Workshops. 2009:37–42. doi: 10.1109/icdmw.2009.63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Percha B, Nassif H, Lipson J, Burnside E, Rubin D. Automatic classification of mammography reports by BI-RADS breast tissue composition class. J Am Med Inform Assoc. 2012;19:913–6. doi: 10.1136/amiajnl-2011-000607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.RadLex. 2016 http://www.rsna.org/RadLex.aspx. (accessed June 24, 2016).
  • 49.Akande HJ, Olafimihan BB, Oyinloye OI. A five year audit of mammography in a tertiary hospital, North Central Nigeria. Niger Med J J Niger Med Assoc. 2015;56:213–7. doi: 10.4103/0300-1652.160401. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Badan GM, Roveda Júnior D, Ferreira CAP, de Noronha OA., Junior Complete internal audit of a mammography service in a reference institution for breast imaging. Radiol Bras. 2014;47:74–8. doi: 10.1590/S0100-39842014000200007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Poplack SP, Tosteson AN, Grove MR, Wells WA, Carney PA. Mammography in 53,803 women from the New Hampshire mammography network. Radiology. 2000;217:832–40. doi: 10.1148/radiology.217.3.r00dc33832. [DOI] [PubMed] [Google Scholar]
  • 52.Jonnalagadda SR, Li D, Sohn S, Wu ST-I, Wagholikar K, Torii M, et al. Coreference analysis in clinical notes: a multi-pass sieve with alternate anaphora resolution modules. J Am Med Inform Assoc. 2012;19:867–74. doi: 10.1136/amiajnl-2011-000766. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Benndorf M, Kotter E, Langer M, Herda C, Wu Y, Burnside ES. Development of an online, publicly accessible naive Bayesian decision support tool for mammographic mass lesions based on the American College of Radiology (ACR) BI-RADS lexicon. Eur Radiol. 2015;2015:1768–75. doi: 10.1007/s00330-014-3570-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Bouzghar G, Levenback BJ, Sultan LR, Venkatesh SS, Cwanger A, Conant EF, et al. Bayesian Probability of Malignancy With BI-RADS Sonographic Features. J Ultrasound Med. 2014;33:641–8. doi: 10.7863/ultra.33.4.641. [DOI] [PubMed] [Google Scholar]
  • 55.Percha B. In: Machine Learning Approaches to Automatic BI-RADS Classification of Mammography Reports. Department of Biomedical Informatics SU, editor. Standford University; 2010. [Google Scholar]
  • 56.Zuccon G, Wagholikar AS, Nguyen AN, Butt L, Chu K, Martin S, et al. Automatic Classification of Free-Text Radiology Reports to Identify Limb Fractures using Machine Learning and the SNOMED CT Ontology. AMIA Jt Summits Transl Sci Proc. 2013;2013:300–4. [PMC free article] [PubMed] [Google Scholar]
  • 57.Bozkurt S, Gimenez F, Burnside ES, Gulkesen KH, Rubin DL. Using Automatically Extracted Information from Mammography Reports for Decision-Support. Journal of Biomedical Informatics. 2016 doi: 10.1016/j.jbi.2016.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Ferreira P, Fonseca NA, Dutra I, Woods R, Burnside E. Predicting Malignancy from Mammography Findings and Surgical Biopsies. Proceedings (IEEE Int Conf Bioinformatics Biomed) 2011;2011 doi: 10.1109/BIBM.2011.71. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Fischer EA, Lo JY, Markey MK. Bayesian networks of BI-RADS descriptors for breast lesion classification. Engineering in Medicine and Biology Society, 2004 IEMBS ‘04 26th Annual International Conference of the IEEE. 2004:3031–4. doi: 10.1109/IEMBS.2004.1403858. [DOI] [PubMed] [Google Scholar]
  • 60.Bozkurt S, Rubin D. Automated detection of ambiguity in BI-RADS assessment categories in mammography reports. Stud Health Technol Inform. 2014;197:35–9. [PubMed] [Google Scholar]

RESOURCES