Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 1.
Published in final edited form as: Acad Radiol. 2021 Feb 11;29(4):479–487. doi: 10.1016/j.acra.2021.01.017

Automated Radiology-Arthroscopy Correlation of Knee Meniscal Tears Using Natural Language Processing Algorithms

Matthew D Li 1, Francis Deng 1, Ken Chang 1, Jayashree Kalpathy-Cramer 1, Ambrose J Huang 2
PMCID: PMC8355247  NIHMSID: NIHMS1673861  PMID: 33583713

Abstract

Rationale and Objective:

Train and apply natural language processing (NLP) algorithms for automated radiology-arthroscopy correlation of meniscal tears.

Materials and Methods:

In this retrospective single-institution study, we trained supervised machine learning models (logistic regression, support vector machine, and random forest) to detect medial or lateral meniscus tears on free-text MRI reports. We trained and evaluated model performances with cross-validation using 3593 manually annotated knee MRI reports. To assess radiology-arthroscopy correlation, we then randomly partitioned this dataset 80:20 for training and testing, where 108 test set MRIs were followed by knee arthroscopy within 1 year. These free-text arthroscopy reports were also manually annotated. The NLP algorithms trained on the knee MRI training dataset were then evaluated on the MRI and arthroscopy report test datasets. We assessed radiology-arthroscopy agreement using the ensembled NLP-extracted findings versus manually annotated findings.

Results:

The NLP models showed high cross-validation performance for meniscal tear detection on knee MRI reports (medial meniscus F1 scores 0.93–0.94, lateral meniscus F1 scores 0.86–0.88). When these algorithms were evaluated on arthroscopy reports, despite never training on arthroscopy reports, performance was similar, though higher with model ensembling (medial meniscus F1 score 0.97, lateral meniscus F1 score 0.99). However, ensembling did not improve performance on knee MRI reports. In the radiology-arthroscopy test set, the ensembled NLP models were able to detect mismatches between MRI and arthroscopy reports with sensitivity 79% and specificity 87%.

Conclusion:

Radiology-arthroscopy correlation can be automated for knee meniscal tears using NLP algorithms, which shows promise for education and quality improvement.

Keywords: Natural language processing, machine learning, radiology-arthroscopy correlation, knee MRI, meniscal tear

Introduction

Magnetic resonance imaging (MRI) is a commonly performed for evaluation of internal derangements of the knee, including meniscal abnormalities. The diagnostic value of a knee MRI compared to the reference standard of knee arthroscopy has been studied in numerous populations since the late 1980s [13], with usually small sample sizes, though some of the earlier studies featured as many as 1014 patients. More recent studies have featured between 54 and 120 patients [411]. Importantly, these studies evaluate the performance of radiologists or other clinicians specifically tasked to evaluate the menisci for the research study, which is not necessarily reflective of clinical performance in a non-experimental setting (i.e. findings communicated in the radiologist report).

Assessment of the diagnostic performance of radiologists is important to demonstrate clinical value and identify areas for improvement, but manual retrospective review of radiology and arthroscopy reports is time-consuming and requires domain-specific knowledge. Natural language processing (NLP) has been applied to the classification of findings in radiology reports [12], and to a lesser extent in operative reports as well [13,14]. The majority of these NLP algorithms have employed rule-based approaches, which require rules created by a domain expert specifically for the text corpus of interest. Increasingly, supervised machine learning techniques that do not require such rules, including deep learning methods, have been employed to classify clinical text documents. For example, various machine learning models have been applied to detect abnormalities in radiology reports [1518]. A support vector machine (SVM)-based NLP algorithm has been used to classify knee MRI reports to differentiate between normal and abnormal cases [19].

We hypothesized that supervised NLP models can be applied to free-text narrative knee MRI radiology and arthroscopy reports to accurately identify the presence of meniscal tears, including the laterality, using models trained on knee MRI reports. Correct identification of these findings in both the radiology and arthroscopy reports allows automated identification of disagreements between the reports. This system is automatic and scalable, with possible applications for quality assessment and improvement.

Materials and Methods

This HIPAA-compliant retrospective study was approved by the institutional review board (IRB) of our integrated health system, with informed consent waived.

Knee MRI report data collection, annotation, and partition

A radiology report search application was used to extract free-text narrative knee MRI reports from the database of our institution, including a large academic hospital and its outpatient imaging practice [20]. The collected data included all knee MRIs acquired over a one year period from January 1, 2018 to December 31, 2018, which totaled 3631 knee MRI reports from 3448 unique patients. Of note, these MRI reports are semi-structured with headings; however, the text content within each heading was not structured, and the headings were variable depending on the report template used.

Each of these reports was manually annotated for the presence or absence of meniscal tear, with separate labels for the medial and lateral menisci, by a diagnostic radiology resident physician (postgraduate year 4). A meniscus was considered to have a tear if there was a reported tear or re-tear, which included abnormalities explicitly described as tears and other findings consistent with tears like meniscocapsular separation or meniscal root avulsion. A meniscus was considered to have no tear if it met the following criteria: (1) meniscus reported to be intact, including variant anatomy like a discoid meniscus or (2) non-tear abnormalities, including but not limited to post-operative change, intrasubstance high signal, fraying, blunting, maceration, and attenuation. These non-tear abnormalities were not considered to be tears as we focused the tear definition on findings that would be potentially treatable by meniscal surgery [21] and thus more clinically relevant to the orthopedic surgeon. In occasional cases, there was hedging in the findings (e.g. “minor meniscal tearing/fraying cannot be excluded”). If the possibility of a tear was raised in the absence of a more likely alternative diagnosis, the meniscus was considered torn. If an alternative more likely diagnosis was suggested, the meniscus was not considered torn.

Knee MRI report text pre-processing

Full reports including the study findings and impression from the knee MRIs were pre-processed for input to the supervised machine learning models. First, sentences containing the “menisc*” or “horn” (where * refers to wildcard character(s)) were excerpted from the text using the re (version 2.2.1) regular expression operations library in Python (version 3.7.1). Reports that did not mention these terms were excluded from further analysis, as such reports were for cases with prior total arthroplasty, tumor resection, or focused examinations of nerve structures. This excluded 38 MRI reports from 31 patients (1% of reports), yielding a final dataset of 3593 MRI reports from 3417 patients, as summarized in Table 1.

Table 1.

Summary of knee MRI and arthroscopy report data sets. N, number.

Unique Patients, N MRI Reports, N Medial Meniscal Tears, N (% of reports) Lateral Meniscal Tears, N (% of reports) MRIs with subsequent arthroscopy*, N (% of reports)
Data for Nested Cross-Validation Analysis of NLP on MRI Reports
All Data 3417 3593 1566 (44) 795 (22) 564 (16)
Data Partition for NLP Testing on MRI and Arthroscopy Reports for MRI-arthroscopy correlation
Training Set 2730 2864 1247 (44) 648 (23) 456 (16)
Test Set 687 729 319 (44) 147 (20) 108 (15)
*

arthroscopy performed <1 year after the MRI. If multiple arthroscopies were performed within that time frame, the arthroscopy report closest in time to the MRI was used.

To make these text excerpts digestible for our machine learning models, we represented the text as vectors using bag-of-words vectorization with N-grams (N = 1, 2, or 3, minimum term frequency 5%), which is an approach that has previously been used for knee MRI report natural language processing for general abnormality detection [19]. This vectorization was performed in Python using the sklearn (version 0.20.3) and NLTK [22] (version 3.4) packages. We used the NLTK mark_negation function, which appends a “_NEG” suffix to words found between a negation term and punctuation.

NLP machine learning model training and testing

The pre-processed radiology reports from the training set were then passed as inputs for supervised training of logistic regression, SVM, and random forest supervised machine learning models for the classification of meniscal tears (medial and lateral trained separately) using sklearn (linear_model.LogisticRegression, svm.SVC, ensemble.RandomForestClassifier) in Python. We used a nested stratified 5-fold cross-validation strategy with a grid search for model hyperparameter optimization to evaluate model performance [23]. In this strategy, there is an outer loop cross-validation (for evaluating performance on data unseen in training or used for hyperparameter optimization) and multiple inner loop cross-validations (for optimizing hyperparameters). Within the training partition of each outer loop cross-validation fold, we performed a 5- fold inner loop cross-validation with selection of the optimal hyperparameters based on the F1 score on the inner loop validation partition. Each cross-validation partition was stratified on outcome (meniscal tear) to maintain the balance of labels in each partition.

For the hyperparameter cross-validation grid search, for the logistic regression models, we varied the regularization hyperparameter C (0.1, 1, 10, 100). For the SVM models, we varied the regularization hyperparameter C (0.1, 1, 10, 100) and Kernel coefficient gamma (1, 0.1, 0.01, 0.001). For the random forest models, we varied the number of estimators (100, 500, 1000) and maximum number of features the model examines to determine each split (square root or log2 of the total number of features). Other model hyperparameters were kept as defaults.

Accuracy, precision, recall, specificity, and F1 score were used to report the nested stratified 5-fold cross-validation performance results for each NLP model. We also evaluated model performance with different training set sizes after a random 80:20 partition of the knee MRI cohort into training and test sets, with training using the same hyperparameter cross-validation grid search.

Knee arthroscopy report data collection and annotation

A patient database query tool available called the Research Patient Data Registry (RPDR) [24] was used to extract free-text narrative operative reports for knee arthroscopies performed at our hospital for the patients who underwent arthroscopy within 1 year following knee MRI (Table 1). For MRI studies associated with more than one arthroscopy performed, the closest arthroscopy in time to the MRI date was included, while later arthroscopy reports were omitted from analysis.

While nested 5-fold cross-validation was used to evaluate NLP model performance for the knee MRI reports, to avoid overcomplexity in evaluating the performance of MRI-arthroscopy report correlation, we randomly partitioned the knee MRI dataset into a training set containing 80% of the data and a test set containing 20% of the data. The partition occurred at the patient-level with no crossover in patients between training and testing sets. For the 108 arthroscopies performed for MRIs in this test set, the arthroscopy reports were manually annotated for medial and lateral meniscal tear by a diagnostic radiology resident physician (postgraduate year 4). The criteria for tear versus no tear was the same as what was used to classify the knee MRI reports.

Knee arthroscopy report text pre-processing and NLP testing

Arthroscopy reports are semi-structured at our institution, but the structure was different for each different orthopedic surgeon. For example, some reports featured descriptions of findings by knee compartment, while others grouped findings together in one section. The descriptions for these findings, however, are all non-structured free text. Furthermore, some operative reports include MRI or pre-operative diagnosis text that precedes the report findings. To avoid contaminating our text analysis with this data, we used regular expressions to extract the arthroscopy procedural findings sections of the reports (Appendix Table A.1).

We assessed the generalizability of the MRI report-based NLP models (trained on only the training partition) on the test set arthroscopy reports. The free-text narrative arthroscopy procedural findings were pre-processed using the same text analysis pipeline used for MRI report analysis. We trained logistic regression, SVM, and random forest models using the knee MRI training partition with the same hyperparameter cross-validation grid search as described above. These models were then tested on the test set knee MRI and arthroscopy reports, with evaluation of performance using accuracy, precision, recall, specificity, and F1 score. We also assessed the ensemble performance of the NLP models (ensembling all three or a combination of two), with the predicted label determined by model voting (i.e. meniscal tear was predicted if the label was predicted by the ≥ 2 models).

Evaluation of radiology-arthroscopy correlation

For the test set MRIs with subsequently performed arthroscopy, we assessed the radiology report performance using the arthroscopy report as the reference standard, but with comparison of the manual versus automatic NLP labels for meniscal tears. The NLP label assignment was determined using an ensemble of the three NLP models described above, with a majority vote determining the final label.

Statistics

Wilcoxon signed rank tests were used to compare cross-validation performance metrics of the different NLP models [25]. A p-value < 0.05 was considered a priori as the level of statistical significance. Bootstrapped 95% confidence intervals were also calculated for non-cross-validation performance metrics.

Results

Knee MRI and arthroscopy report data sets

In the knee MRI data set containing 3593 total reports from 2018, the mean patient age was 47 years (standard deviation ± 18 years). 254 (7%) of the MRIs were from pediatric patients (<18 years old). 1897 (53%) of MRIs were from patients identifying as female and 1696 (47%) were from patients identifying as male.

The prevalence of medial and lateral meniscal tears in the complete knee MRI dataset was 44% and 22% respectively (Table 1). In the random test partition used for analysis of MRI-arthroscopy correlation, 16% of the knee MRIs were followed by a knee arthroscopy within 1 year (median 42 days interquartile range 27–39 days, range 6–210 days), resulting in a test set of 108 MRI and arthroscopy reports. The prevalence of medial and lateral meniscal tears in the 108 arthroscopy reports in the test set was 66% and 42%, respectively. The most common indications for knee arthroscopy included anterior cruciate ligament tears, meniscal tears, and cartilage injury.

The knee MRIs were dictated by 80 different radiology staff, including residents, fellows, and attending radiologists. All trainee reports were reviewed, edited and signed by attending radiology staff. Attending radiologists included 11 academic musculoskeletal radiologists, 6 academic pediatric radiologists (interpreting pediatric knee MRIs only), 2 emergency radiologists (both with musculoskeletal fellowship-training), and 5 community radiologists (3 with musculoskeletal fellowship-training). Nine different orthopedic surgeons performed the arthroscopy operations, of which eight had subspecialty training including knee arthroscopy.

NLP machine learning model performance on knee MRI reports

Logistic regression, SVM, and random forest machine learning models evaluated for the detection of meniscal tears on the free-text knee MRI reports showed nested stratified 5-fold cross-validation accuracies ranging from 0.94–0.95 and F1 scores ranging from 0.86–0.94 (Table 2). There was no significant difference between the F1 scores for the different NLP models in cross-validation (p>0.1). The performance was higher for the medial meniscus compared to the lateral meniscus, as reflected by higher F1 scores. When evaluating performance with different training set sizes (N=100 to 2864) with a test set held constant, we found that performance improved with larger training set sizes (Appendix Table A.2). Extrapolating from this data, slight further gains with larger training set sizes may be possible.

Table 2.

Knee MRI report test set performance of different supervised NLP machine learning algorithms trained to classify MRI reports for the presence of medial and lateral meniscal tears. Performance metrics on the test set are presented as the average of nested stratified 5-fold cross-validation, with the cross-validation performance range in parentheses.

Machine Learning Model Accuracy Precision (positive predictive value) Recall (sensitivity) Specificity F1 score
Medial Meniscus (knee MRI report)
Logistic Regression 0.94 (0.94–0.95) 0.94 (0.92–0.95) 0.93 (0.92–0.96) 0.95 (0.94–0.97) 0.94 (0.93–0.95)
Support Vector Machine 0.94 (0.93–0.95) 0.94 (0.91–0.97) 0.93 (0.92–0.95) 0.95 (0.93–0.98) 0.93 (0.93–0.94)
Random Forest 0.95 (0.94–0.96) 0.93 (0.90–0.94) 0.95 (0.93–0.96) 0.94 (0.92–0.96) 0.94 (0.93–0.95)
Lateral Meniscus (knee MRI report)
Logistic Regression 0.94 (0.94–0.95) 0.90 (0.85–0.94) 0.85 (0.82–0.90) 0.97 (0.95–0.98) 0.87 (0.86–0.88)
Support Vector Machine 0.95 (0.94–0.95) 0.90 (0.88–0.96) 0.86 (0.83–0.89) 0.97 (0.96–0.99) 0.88 (0.86–0.89)
Random Forest 0.94 (0.93–0.95) 0.88 (0.82–0.91) 0.85 (0.82–0.89) 0.97 (0.95–0.97) 0.86 (0.84–0.88)

NLP machine learning model performance for radiology-arthroscopy correlation

To evaluate the application of these NLP models for radiology arthroscopy-correlation, we performed a random partition of the knee MRI cohort 80:20 into training and testing sets, with training of NLP models on this MRI training set with grid search cross-validation for hyperparameter optimization, followed by testing on the knee MRI test set and patient-matched arthroscopy test set. The best performing hyperparameters in training are shown in Appendix Table A.3.

On the knee MRI test set, accuracy of the NLP models ranged from 0.93–0.95 and F1 scores from 0.87–0.94, without improvement with model ensembling (Table 3). The models trained on the knee MRI training set were then evaluated on the free-text knee arthroscopy report test set with no further training. Accuracies ranged from 0.84–0.97, and F1 scores ranged from 0.87–0.96 (Table 3), similar to the performance on the MRI test set. Model performance was also higher for the medial meniscus compared to the lateral meniscus, as reflected by higher F1 scores. However, the ensembled NLP models showed performance equal to or higher than the individual NLP models as assessed by F1 score, with the best scores seen when all three NLP models were ensembled (Table 3).

Table 3.

Performance of different NLP machine learning algorithms trained on a random 80% partition of the knee MRI report dataset and tested on a held out 20% partition of the knee MRI report dataset and corresponding arthroscopy report dataset. Performance metrics on the test sets are presented as bootstrapped median values with 95% confidence intervals. LOG, logistic regression; SVM, support vector machine; RF, random forest.

Machine Learning Model Accuracy Precision (positive predictive value) Recall (sensitivity) Specificity F1 score
Medial Meniscus (knee MRI reports)
LOG 0.93 (0.91–0.95) 0.93 (0.90–0.96) 0.92 (0.89–0.95) 0.95 (0.92–0.97) 0.92 (0.90–0.94)
SVM 0.93 (0.91–0.95) 0.94 (0.91–0.96) 0.90 (0.86–0.93) 0.95 (0.93–0.97) 0.92 (0.89–0.94)
RF 0.95 (0.93–0.96) 0.94 (0.91–0.96) 0.94 (0.91–0.96) 0.95 (0.93–0.97) 0.94 (0.92–0.96)
Ensemble: LOG + SVM 0.93 90.91–0.95) 0.93 (0.90–0.95) 0.92 (0.89–0.95) 0.94 (0.92–0.97) 0.92 (0.90–0.94)
Ensemble: LOG + RF 0.94 (0.92–0.96) 0.90 (0.87–0.93) 0.96 (0.93–0.98) 0.92 (0.89–0.95) 0.93 (0.91–0.95)
Ensemble: SVM + RF 0.94 (0.92–0.96) 0.91 (0.88–0.94) 0.95 (0.93–0.97) 0.93 (0.90–0.95) 0.93 (0.91–0.95)
Ensemble: All 3 Models 0.93 (0.92–0.95) 0.94 (0.91–0.96) 0.91 (0.88–0.94) 0.95 (0.93–0.97) 0.92 (0.90–0.94)
Lateral Meniscus (knee MRI reports)
LOG 0.95 (0.93–0.97) 0.88 (0.83–0.93) 0.87 (0.81–0.92) 0.97 (0.96–0.98) 0.88 (0.84–0.91)
SVM 0.95 (0.94–0.97) 0.91 (0.86–0.95) 0.85 (0.79–0.91) 0.98 (0.97–0.99) 0.88 (0.84–0.92)
RF 0.95 (0.93–0.96) 0.86 (0.80–0.91) 0.89 (0.84–0.94) 0.96 (0.95–0.98) 0.87 (0.83–0.91)
Ensemble: LOG + SVM 0.95 (0.93–0.97) 0.88 (0.83–0.93) 0.87 (0.81–0.92) 0.97 (0.96–0.98) 0.88 (0.83–0.91)
Ensemble: LOG + RF 0.94 (0.92–0.96) 0.82 (0.76–0.88) 0.92 (0.87–0.96) 0.95 (0.93–0.97) 0.86 (0.82–0.90)
Ensemble: SVM + RF 0.95 (0.93–0.96) 0.83 (0.77–0.89) 0.92 (0.87–0.96) 0.95 (0.94–0.97) 0.87 (0.83–0.91)
Ensemble: All 3 Models 0.95 (0.94–0.97) 0.90 (0.85–0.95) 0.86 (0.81–0.93) 0.98 (0.96–0.99) 0.88 (0.84–0.92)
Medial Meniscus (knee arthroscopy reports)
LOG 0.94 (0.90–0.98) 0.92 (0.86–0.97) 1.00 (1.00–1.00) 0.84 (0.71–0.94) 0.96 (0.92–0.99)
SVM 0.91 (0.84–0.95) 0.93 (0.86–0.99) 0.93 (0.86–0.99) 0.87 (0.73–0.97) 0.93 (0.88–0.97)
RF 0.84 (0.77–0.91) 0.95 (0.89–1.00) 0.81 (0.71–0.90) 0.92 (0.82–1.00) 0.87 (0.80–0.93)
Ensemble: LOG + SVM 0.93 (0.88–0.97) 0.90 (0.83–0.96) 1.00 (1.00–1.00) 0.79 (0.95–0.910 0.95 (0.91–0.98)
Ensemble: LOG + RF 0.93 (0.88–0.97) 0.90 (0.83–0.96) 1.00 (1.00–1.00) 0.79 (0.64–0.90) 0.95 (0.91–0.98)
Ensemble: SVM + RF 0.94 (0.89–0.97) 0.92 (0.86–0.97) 0.99 (0.96–1.00) 0.84 (0.72–0.95) 0.95 (0.92–0.98)
Ensemble: All 3 Models 0.95 (0.92–0.99) 0.95 (0.89–0.99) 0.99 (0.95–1.00) 0.89 (0.79–0.98) 0.97 (0.93–0.99)
Lateral Meniscus (knee arthroscopy reports)
LOG 0.93 (0.87–0.97) 0.85 (0.72–0.95) 0.95 (0.85–1.00) 0.92 (0.85–0.98) 0.89 (0.81–0.96)
SVM 0.97 (0.94–1.00) 1.00 (1.00–1.00) 0.92 (0.81–1.00) 1.00 (1.00–1.00) 0.96 (0.89–1.00)
RF 0.92 (0.86–0.96) 0.89 (0.76–0.97) 0.86 (0.73–0.97) 0.95 (0.89–0.99) 0.87 (0.77–0.94)
Ensemble: LOG + SVM 0.94 (0.88–0.98) 0.85 (0.73–0.95) 0.97 (0.91–1.00) 0.92 (0.85–0.97) 0.91 (0.83–0.97)
Ensemble: LOG + RF 0.91 (0.84–0.96) 0.78 (0.64–0.89) 1.00 (1.00–1.00) 0.86 (0.77–0.94) 0.87 (0.78–0.94)
Ensemble: SVM + RF 0.96 (0.93–0.99) 0.90 (0.80–0.98) 1.00 (1.00–1.00) 0.95 (0.89–0.99) 0.95 (0.89–0.99)
Ensemble: All 3 Models 0.99 (0.96–1.00) 1.00 (1.00–1.00) 0.97 (0.90–1.00) 1.00 (1.00–1.00) 0.99 (0.95–1.00)

In the 108 test set knee MRIs with arthroscopy reports, we found that the ensemble NLP model was able to identify 19/24 (sensitivity 79%) of the reports with disagreement on the meniscal tear abnormality (medial or lateral) between the MRI and arthroscopy reports. 73/84 true negative cases were identified (specificity 87%). From examining the cases where the NLP models incorrectly classified reports for meniscal tears (in the test set reports, 5 medial and 6 lateral menisci from knee MRIs and 5 medial and 1 lateral menisci from the arthroscopies), a common failure case occurred when multiple abnormalities were described for a single meniscus (e.g. post meniscectomy changes, fraying, and a tear in the same meniscus). Another failure case involved findings that have low prevalence (e.g. discoid meniscus), where the NLP models likely did not see sufficient cases in training to learn the pattern. In addition, in some arthroscopy reports, the medial or lateral meniscus was not explicitly mentioned, which sometimes led to incorrect classifications.

The reasons for the 24 radiology-arthroscopy disagreements found in the test set are summarized in Appendix Table A.4. In the most common case (6, 25%), knee arthroscopy was performed for anterior cruciate ligament repair/reconstruction and/or meniscal repair, but an additional meniscal tear was discovered at arthroscopy that was not described on the MRI report.

Discussion

We developed an NLP-based approach for radiology-arthroscopy correlation of meniscal tears that uses supervised machine learning methods trained on free-text knee MRI reports. The model can be applied to MRI reports but also generalizes to free-text arthroscopy reports without additional training on arthroscopy reports. We show that this approach allowed us to automatically identify cases where the MRI and arthroscopy reports are in disagreement, which could be used to flag cases for teaching and quality improvement. Identification of specific abnormalities that are frequently over- or under-called can be helpful for radiologists to learn from on knee MRI [26,27].

Among expert musculoskeletal radiologists, there is inconsistency in evaluating the presence of meniscal abnormality on knee MRIs [28]. Timely feedback has the potential to help to improve this consistency. However, manual identification of cases with arthroscopy correlation is tedious and requires diligent follow-up. In our study, arthroscopy was performed in 16% of knee MRI cases; thus, for a radiologist following up on their knee MRI cases, they would expend much time to only find an arthroscopy in 1 in 6 cases. A rule-based algorithm has been used to automatically notify radiologists of arthroscopy notes corresponding to their radiology reports, with matching by anatomy of interest (e.g. knee versus elbow) [29]. In our work, we address the next step, with the automated identification of the abnormality-of-interest in the MRI and arthroscopy reports. Our NLP approach, with the generalization of algorithms from radiology to arthroscopy reports, could be useful for analysis of other findings reported on free-text narrative radiology reports that have subsequent operative correlation, which is a useful non-interpretive application of machine learning tools in radiology with ramifications for education and quality improvement [12,30]. Automated feedback has been deployed for screening mammography, though the structured nature of mammography reports makes that task technically less challenging [31]. Adoption of more structured reporting in the interpretation of knee MRIs could also help to improve the performance of automated feedback systems in musculoskeletal radiology. In addition to providing feedback, such NLP systems could be used to audit radiologist performance, which may be helpful as reimbursement systems move from fee-for-service to value-based programs.

The performances of the NLP models developed in our study were similar compared to previous work using NLP machine learning for the classification of knee MRI reports as normal versus abnormal [19]. However, the classification task in our study was more complex, as we evaluated specifically the meniscus, differentiated by laterality, and used a definition for meniscal tear focused on identifying tears that may be potential targets for intervention and thus clinically relevant for the orthopedic surgeon. In addition, our model performance was worse for the lateral meniscus, which may reflect that lateral meniscus pathology is less prevalent in our dataset, so fewer training examples were available for the model to learn from. Thus, the NLP approach used in our study could be further improved. KneeTex [32], a rule-based NLP algorithm used for evaluation of knee MRI reports, achieved a high level of performance for extraction of more detailed information than in our study but required a complex domain-specific ontology, while our models only required labeling of meniscal abnormalities. Deep learning-based NLP approaches have also shown promise for these types of analyses, though their comparative benefit versus more traditional machine learning models depending on the specific task is an open area of research [1618].

There are limitations to this study. First, although the total number of studies requiring manual review for radiology-arthroscopy correlation would be substantially decreased using this NLP approach, given that the model performance is not perfect, some studies without a disagreement between the knee MRI and arthroscopy reports would not be screened out by the NLP approach, which would then still require manual review. Training using larger datasets and more sophisticated machine learning models, such as deep learning-based models and transformer-based models in particular [33], may help achieve superior performance. Second, the NLP models were trained on data from a single institution, which may limit the generalizability of the model to other institutions with different reporting practices. However, given that the model generalized well to arthroscopy reports, with inherently different reporting styles, the model may still be able to generalize. A model trained specifically on arthroscopy reports could lead to some performance improvement, though we found our models performed similarly for both MRI and arthroscopy reports. Nevertheless, NLP models may require local training or tuning to optimize performance for new sites, as seen in previous work applying NLP to radiology reports [34]. Third, in analyzing MRI and arthroscopy correlation, there is a delay between the two studies and patients may develop new meniscal abnormalities in that time interval. Thus, while the arthroscopy report may be considered to the reference standard, a “missed” meniscal tear may not have been present on the initial MRI. Fourth, different types of meniscal tears may have different management implications. Training machine learning models to identify further subdivisions of meniscal abnormalities is a future direction of research. Fifth, the focus of our study is on meniscal tears, but other abnormalities may be clinically relevant for radiology-arthroscopy correlation and would require further labeling of training data.

Conclusions

Natural language processing algorithms can be trained to automatically identify free-text knee MRI and arthroscopy reports with medial or lateral meniscal tears. Interestingly, algorithms exclusively trained on MRI reports generalized well to arthroscopy reports, which suggests that machine learning models trained on free-text radiology reports may have application to non-radiology clinical text. While the model performance has room for improvement (perhaps through the use deep learning techniques, instead of conventional machine learning models), this approach shows promise to help reduce the burden of manual review to identify cases where the radiologist interpretation did not match the arthroscopic findings. This may be useful for radiology education and quality improvement.

Acknowledgments

Research reported in this publication was supported by a training grant from the National Institutes of Health under Award Number F30CA239407 to K.C.

Abbreviations:

MRI

Magnetic resonance imaging

NLP

Natural language processing

SVM

Support vector machine

Appendix

Table A.1.

Different orthopedic surgeons use different terms in their reports to denote the start of the findings section of the report. We used the regular expressions summarized in this table to extract only the relevant portion of the report (omitting the pre-operative diagnosis sections, patient history, and study indication).

Orthopedic Surgeon Arthroscopy Report Regular Expression
#1 “DESCRIPTION OF PROCEDURE”
#2 “DESCRIPTION OF PROCEDURE”
#3 “ARTHROSCOPIC FINDINGS”
#4 “DESCRIPTION OF PROCEDURE”
#5 “Description of Procedure”
#6 “FINDINGS”
#7 “Procedure was then initiated.”
#8 “DESCRIPTION OF PROCEDURE”
#9 “DESCRIPTION OF PROCEDURE”

Table A.2.

Evaluation of the impact of training set size on NLP model performance. F1 scores were calculated for each of the NLP machine learning algorithms, trained with grid search cross-validation on a sample of the random 80% partition of the knee MRI report dataset and tested on the a held out 20% partition of the dataset.

Medial Meniscus Lateral Meniscus
Training set size Logistic Regression SVM Random Forest Logistic Regression SVM Random Forest
2864 0.92 0.92 0.94 0.88 0.88 0.87
2000 0.91 0.92 0.93 0.89 0.87 0.85
1000 0.9 0.91 0.93 0.86 0.84 0.82
500 0.89 0.9 0.91 0.85 0.86 0.8
100 0.87 0.87 0.88 0.77 0.75 0.76
50 0.81 0.82 0.84 0.73 0.7 0.67

Table A.3.

Best hyperparameters from grid search 5-fold cross-validation during training of the NLP models on the training set partition. Other model hyperparameters were kept as the defaults in sklearn version 0.20.3.

NLP Model Hyperparameter Value
Logistic Regression
 Medial meniscus C (regularization) 0.1
 Lateral meniscus C (regularization) 0.1
SVM
 Medial meniscus C (regularization) 10
Gamma (Kernel coefficient) 0.001
 Lateral meniscus C (regularization) 10
Gamma (Kernel coefficient) 0.001
Random Forest
 Medial meniscus Number of estimators 1000
Max features Square root of total number of features
 Lateral meniscus Number of estimators 1000
Max features Square root of total number of features

Table A.4.

Reasons for radiology-arthroscopy disagreements in the knee MRI and arthroscopy test sets.

Reason Number of Cases (% of 24 total disagreements)
False Negatives
Arthroscopy performed for anterior cruciate ligament repair/reconstruction and/or meniscal repair, found to have an additional meniscal tear missed on MRI. 6
Degenerative fraying described on MRI but called a tear on arthroscopy. 2
Previous meniscectomy with re-tear missed on MRI. 2
Arthroscopy performed for knee exploration / irrigation due to infection, found to have meniscal tear missed on MRI. 1
Meniscocapsular sprain described on MRI but called tear on arthroscopy. 1
False Positives
Arthroscopy performed for anterior cruciate ligament or medial and lateral meniscal repairs, but a tear of one of the menisci described on MRI was not seen at arthroscopy. 3
Previous meniscectomy with re-tear described on MRI but not seen on arthroscopy. 1
Arthroscopy performed for lysis of adhesions, MRI reported meniscal remnant tear, arthroscopy reported remnant but no tear. 1
Meniscal fraying with possible tear described on MRI but no tear on arthroscopy. 1
Meniscocapsular separation described on MRI, but not seen on arthroscopy. 1
Other
Meniscal tear described in the MRI but the corresponding medial or lateral meniscus was not described in the arthroscopy report. 4
Error in the arthroscopy report, where the report conclusions stated no meniscal tear though the body of the report described a meniscal tear correctly correlating with MRI findings. 1

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Statement of data access and integrity: The authors declare that they had full access to all of the data in this study and the authors take complete responsibility for the integrity of the data and the accuracy of the data analysis.

Conflicts of Interest: Dr. Chang reports grants from NIH, during the conduct of the study. Dr. Kalpathy-Cramer reports grants from GE Healthcare, non-financial support from AWS, and grants from Genentech Foundation, outside the submitted work. The other authors declare no conflict of interest.

References

  • [1].Crues JV, Mink J, Levy TL, Lotysch M, Stoller DW. Meniscal tears of the knee: accuracy of MR imaging. Radiology 1987;164:445–8. 10.1148/radiology.164.2.3602385. [DOI] [PubMed] [Google Scholar]
  • [2].Fischer SP, Fox JM, Del Pizzo W, Friedman MJ, Snyder SJ, Ferkel RD. Accuracy of diagnoses from magnetic resonance imaging of the knee. A multi-center analysis of one thousand and fourteen patients. J Bone Joint Surg Am 1991;73:2–10. [PubMed] [Google Scholar]
  • [3].Oei EHG, Nikken JJ, Verstijnen ACM, Ginai AZ, Myriam Hunink MG. MR Imaging of the Menisci and Cruciate Ligaments: A Systematic Review. Radiology 2003;226:837–48. 10.1148/radiol.2263011892. [DOI] [PubMed] [Google Scholar]
  • [4].Grossman JW, De Smet AA, Shinki K. Comparison of the Accuracy Rates of 3-T and 1.5-T MRI of the Knee in the Diagnosis of Meniscal Tear. Am J Roentgenol 2009;193:509–14. 10.2214/AJR.08.2101. [DOI] [PubMed] [Google Scholar]
  • [5].Behairy NH, Dorgham MA, Khaled SA. Accuracy of routine magnetic resonance imaging in meniscal and ligamentous injuries of the knee: comparison with arthroscopy. Int Orthop 2009;33:961–7. 10.1007/s00264-008-0580-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Subhas N, Sakamoto FA, Mariscalco MW, Polster JM, Obuchowski NA, Jones MH. Accuracy of MRI in the Diagnosis of Meniscal Tears in Older Patients. Am J Roentgenol 2012;198:W575–80. 10.2214/AJR.11.7226. [DOI] [PubMed] [Google Scholar]
  • [7].Navali AM, Bazavar M, Mohseni MA, Safari B, Tabrizi A. Arthroscopic evaluation of the accuracy of clinical examination versus MRI in diagnosing meniscus tears and cruciate ligament ruptures. Arch Iran Med 2013;16:229–32. https://doi.org/013164/AIM.008. [PubMed] [Google Scholar]
  • [8].Yaqoob J, Alam MS, Khalid N. Diagnostic accuracy of Magnetic Resonance Imaging in assessment of Meniscal and ACL tear: Correlation with arthroscopy. Pakistan J Med Sci 2015;31:263–8. 10.12669/pjms.312.6499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Sharifah MIA, Lee CL, Suraya A, Johan A, Syed AFSK, Tan SP. Accuracy of MRI in the diagnosis of meniscal tears in patients with chronic ACL tears. Knee Surgery, Sport Traumatol Arthrosc 2015;23:826–30. 10.1007/s00167-013-2766-7. [DOI] [PubMed] [Google Scholar]
  • [10].Abbas MM, Abulaban AA, Kotb MM. Accuracy of Magnetic Resonance Imaging in Diagnosis of Internal Derangement of the Knee. J King Abdulaziz Univ - Med Sci 2016;23:11–7. 10.4197/Med.23-4.2. [DOI] [Google Scholar]
  • [11].Coşkun Bilge A, Tokgöz N, Dur H, Uçar M. The value of magnetic resonance imaging in diagnosing meniscal tears: A retrospective cohort study. J Surg Med 2019;3:64–9. 10.28982/josam.515244. [DOI] [Google Scholar]
  • [12].Pons E, Braun LMM, Hunink MGM, Kors JA. Natural Language Processing in Radiology: A Systematic Review. Radiology 2016;279:329–43. 10.1148/radiol.16142770. [DOI] [PubMed] [Google Scholar]
  • [13].Wyles CC, Tibbo ME, Fu S, Wang Y, Sohn S, Kremers WK, et al. Use of Natural Language Processing Algorithms to Identify Common Data Elements in Operative Notes for Total Hip Arthroplasty. J Bone Jt Surg 2019;101:1931–8. 10.2106/JBJS.19.00071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Liu L, Shorstein NH, Amsden LB, Herrinton LJ. Natural language processing to ascertain two key variables from operative reports in ophthalmology. Pharmacoepidemiol Drug Saf 2017;26:378–85. 10.1002/pds.4149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Zech J, Pain M, Titano J, Badgeley M, Schefflein J, Su A, et al. Natural Language–based Machine Learning Models for the Annotation of Clinical Radiology Reports. Radiology 2018;287:570–80. 10.1148/radiol.2018171093. [DOI] [PubMed] [Google Scholar]
  • [16].Kehl KL, Elmarakeby H, Nishino M, Van Allen EM, Lepisto EM, Hassett MJ, et al. Assessment of Deep Natural Language Processing in Ascertaining Oncologic Outcomes From Radiology Reports. JAMA Oncol 2019;5:1421. 10.1001/jamaoncol.2019.1800. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Banerjee I, Ling Y, Chen MC, Hasan SA, Langlotz CP, Moradzadeh N, et al. Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification. Artif Intell Med 2019;97:79–88. 10.1016/J.ARTMED.2018.11.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Lee C, Kim Y, Kim YS, Jang J. Automatic Disease Annotation From Radiology Reports Using Artificial Intelligence Implemented by a Recurrent Neural Network. Am J Roentgenol 2019;212:734–40. 10.2214/AJR.18.19869. [DOI] [PubMed] [Google Scholar]
  • [19].Hassanpour S, Langlotz CP, Amrhein TJ, Befera NT, Lungren MP. Performance of a Machine Learning Classifier of Knee MRI Reports in Two Large Academic Radiology Practices: A Tool to Estimate Diagnostic Yield. Am J Roentgenol 2017;208:750–3. 10.2214/AJR.16.16128. [DOI] [PubMed] [Google Scholar]
  • [20].Dang PA, Kalra MK, Schultz TJ, Graham SA, Dreyer KJ. Informatics in Radiology. RadioGraphics 2009;29:1233–46. 10.1148/rg.295085036. [DOI] [PubMed] [Google Scholar]
  • [21].Abram SGF, Beard DJ, Price AJ, BASK Meniscal Working Group. National consensus on the definition, investigation, and classification of meniscal lesions of the knee. Knee 2018;25:834–40. 10.1016/j.knee.2018.06.001. [DOI] [PubMed] [Google Scholar]
  • [22].Bird S, Klein E, Loper E. Natural Language Processing with Python. 2009.
  • [23].Cawley GC, Talbot NLC. On Over-Fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. J Mach Learn Res 2010;11:2079–2107. [Google Scholar]
  • [24].Murphy SN, Chueh HC. A security architecture for query tools used to access large biomedical databases. Proceedings AMIA Symp 2002:552–6. [PMC free article] [PubMed] [Google Scholar]
  • [25].Demšar J Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res 2006;7:1–30. [Google Scholar]
  • [26].Simpfendorfer C, Polster J. MRI of the Knee: What Do We Miss? Curr Radiol Rep 2014;2:43. 10.1007/s40134-014-0043-2. [DOI] [Google Scholar]
  • [27].Lecouvet F, Van Haver T, Acid S, Perlepe V, Kirchgesner T, Vande Berg B, et al. Magnetic resonance imaging (MRI) of the knee: Identification of difficult-to-diagnose meniscal lesions. Diagn Interv Imaging 2018;99:55–64. 10.1016/J.DIII.2017.12.005. [DOI] [PubMed] [Google Scholar]
  • [28].Hoover KB, Vossen JA, Hayes CW, Riddle DL. Reliability of meniscus tear description: a study using MRI from the Osteoarthritis Initiative. Rheumatol Int 2019:1–7. 10.1007/s00296-019-04489-0. [DOI] [PubMed] [Google Scholar]
  • [29].Moore W, Doshi A, Bhattacharji P, Gyftopoulos S, Ciavarra G, Kim D, et al. Automated Radiology-Operative Note Communication Tool; Closing the Loop in Musculoskeletal Imaging. Acad Radiol 2018;25:244–9. 10.1016/j.acra.2017.08.016. [DOI] [PubMed] [Google Scholar]
  • [30].Richardson ML, Garwood ER, Lee Y, Li MD, Lo HS, Nagaraju A, et al. Noninterpretive Uses of Artificial Intelligence in Radiology. Acad Radiol 2020. 10.1016/j.acra.2020.01.012. [DOI] [PubMed] [Google Scholar]
  • [31].Sippo DA, Sullivan AM, Cohen A, Mercaldo SF, Bahl M, Lehman CD. The Adoption and Impact on Performance of an Automated Outcomes Feedback Application for Tomosynthesis Screening Mammography. J Am Coll Radiol 2020;0. 10.1016/j.jacr.2020.05.036. [DOI] [PubMed] [Google Scholar]
  • [32].Spasić I, Zhao B, Jones CB, Button K. KneeTex: an ontology–driven system for information extraction from MRI reports. J Biomed Semantics 2015;6:34. 10.1186/s13326-015-0033-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-Art Natural Language Processing. Proc. 2020 Conf. Empir. Methods Nat. Lang. Process. Syst. Demonstr, Stroudsburg, PA, USA: Association for Computational Linguistics; 2020, p. 38–45. 10.18653/v1/2020.emnlp-demos.6. [DOI] [Google Scholar]
  • [34].Li MD, Lang M, Deng F, Chang K, Buch K, Rincon S, et al. Analysis of Stroke Detection during the COVID-19 Pandemic Using Natural Language Processing of Radiology Reports. Am J Neuroradiol 2020. 10.3174/ajnr.A6961. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES