Overview of the Pipeline. (A) This visualization shows the different
steps of our pipeline where we collect 871 radiology reports from our four
systems, perform preprocessing to clean the text data, perform feature
extraction using our four different methods. We load the n-grams,
controlled vocabulary, and document embeddings
feature matrices into a logistic regression model to predict the presence of
these findings. For rules, we instead use a rule-based model
that classifies a report as “positive” if at least one mention was
non-negated and “negative” if there was no mention or all mentions
of the finding were negated. We perform two types of assessments:
generalizability and performance based on AUC. (B) A visual representation using
the four different NLP methods to featurize the text for two example findings:
fracture and any degeneration. The resulting
finding-specific feature matrices are then used for the machine learning model,
which uses the first column as the labels and remaining columns as features to
predict the presence of these findings. AUC, Area Under the Curve; UMLS, Unified
Medical Language System.