Skip to main content
. Author manuscript; available in PMC: 2023 Mar 1.
Published in final edited form as: Acad Radiol. 2021 Dec 1;29(Suppl 3):S188–S200. doi: 10.1016/j.acra.2021.09.005

Figure 1.

Figure 1.

Overview of the Pipeline. (A) This visualization shows the different steps of our pipeline where we collect 871 radiology reports from our four systems, perform preprocessing to clean the text data, perform feature extraction using our four different methods. We load the n-grams, controlled vocabulary, and document embeddings feature matrices into a logistic regression model to predict the presence of these findings. For rules, we instead use a rule-based model that classifies a report as “positive” if at least one mention was non-negated and “negative” if there was no mention or all mentions of the finding were negated. We perform two types of assessments: generalizability and performance based on AUC. (B) A visual representation using the four different NLP methods to featurize the text for two example findings: fracture and any degeneration. The resulting finding-specific feature matrices are then used for the machine learning model, which uses the first column as the labels and remaining columns as features to predict the presence of these findings. AUC, Area Under the Curve; UMLS, Unified Medical Language System.