Abstract
Obtaining medication use and response information is essential for both care providers and researchers to understand patients’ medication use and long-term treatment patterns. While unstructured clinical notes contain such information, they have rarely been analyzed for this purpose on a large scale due to the demands of expensive manual reviews. Here, we aimed to extract and analyze medication use patterns from clinical notes for a population of breast cancer patients at an academic medical center using unsupervised topic modeling techniques. Notably, we proposed a two-stage modeling process that was built upon correlated topic modeling (CTM) and structural topic modeling (STM) to capture nuanced information about medication behavior, including drug-disease relationships as well as medication schedules. The STM-derived topics show longitudinal prevalence patterns that may reflect changing patient needs and behaviors after the diagnosis of a severe disease. The patterns also show promise as a predictor for medication-taking behavior.
INTRODUCTION
With an estimated 70% of Americans taking prescription drugs1, care providers must understand how, when, and why patients take (or do not take) their medications in the outpatient setting. In addition, secondary researchers can make use of high-quality information regarding patients’ medication use to define windows of medication exposure, detect potential medication side effects, and understand patient-reported barriers to taking medication2–4.
Structured electronic health records (EHRs) contain medication lists and refill records, which are valuable for both care providers and researchers seeking to understand medication use. However, a great deal of information on patient behavior also resides in unstructured text such as clinical notes and patient-provider communications5. While use of this information previously required extensive manual review of clinical documents, natural language processing (NLP) has made automated information extraction possible for large clinical document corpora6,7.
For instance, topic modeling, especially its standard implementation, Latent Dirichlet Allocation (LDA), is a classical unsupervised NLP technique used for these information extraction tasks8. LDA treats documents in a corpus as realizations drawn from an underlying probability distribution of abstract topics, which themselves are distributions over the words seen in the corpus. Each document is assumed to be composed of one or more topics, while each topic is assumed to be associated with relatively few high-probability words. The topics produced by such models, which are interpreted by their top probable words, yield useful representations of a text corpus and may possess predictive utility when used as features in other statistical models9.
The preponderance of existing work on topic modeling over documents from the EHR includes, but is not limited to, 1) prediction of disease progression, staging, or outcome10–12; 2) prediction of hospital readmissions or mortality13–15; and 3) methods to improve topic modeling, topic summarization, or novel information retrieval in the medical domain16–19. Particularly, three recent papers suggested that such techniques can be applied to extract medication-related information from unstructured text. Yin et. al. showed, in two papers, that topics derived via hierarchical clustering of word embeddings over patient-provider communications can be used to predict initiation20 and discontinuation21 of hormonal adjuvant therapy in breast cancer patients. Beam et. al.22 applied LDA to summarize information from clinical notes that would influence a provider’s prescription of sleep medication. All three studies achieved relatively high predictive accuracy with their primary models, as well as developing a set of topics that include information on side effects, patient-provider communication styles regarding medication, comorbidities, and medical treatments that might provide insights into medication-related behaviors.
Despite these notable findings, little other research exists on using topic modeling methods to assess the outpatient medication-related content of unstructured EHR text—especially for notes from patients with a life-altering disease, such as breast cancer. Identifying determinants and patterns of patient medication-taking behavior are an active research area, but primarily rely on qualitative approaches on limited amounts of data 23–26. An automated method could improve the volume and variety of documents reviewed for these studies.
However, LDA has significant inherent limitations27 related to modeling topics in clinical corpora. One is an assumption of total independence between topics co-occurring in the same documents, which is unrealistic in medical notes, where topics on diseases would be expected to co-occur with their comorbidities, treatments, and symptoms. Correlated topic modeling (CTM)28 addresses this limitation by treating topics as draws from a logistic normal distribution that can have significant correlations with each other when they co-occur within documents.
CTM, like LDA, cannot model how topic prevalence may change over time, as the content of patients’ clinical notes are expected to change over time with disease progression and treatment resolution. Structural topic modeling (STM) extends CTM to include document-level metadata in topic models29 to gain more insightful interpretability regarding topic prevalence. As for clinical notes, time—whether an absolute timestamp, or a relative time from a patient’s diagnosis—can be considered a form of metadata in STM30 to investigate how topic longitudinal patterns evolve.
To understand the characteristics and quality of medication-related information in clinical notes, and examine the viability of using topic modeling to retrieve such information, we propose a two-stage topic modeling process over a set of clinical documents derived from patients with breast cancer treated at Vanderbilt University Medical Center (VUMC). This approach enables the topic models to separate general medication information into more nuanced topics by removing extraneous documents from a large clinical note corpus to focus on notes that are most likely to relate to medication and medication use. We refine our models through STM to identify significant signals in patients’ long-term care. Our work demonstrates how a two-stage topic modeling approach effectively extracts condition-specific medication information from the EHR and reveals longitudinal patterns in note information content over time from diagnosis with breast cancer.
METHODS
In this retrospective longitudinal study, we successively implement unsupervised machine learning techniques to reduce the high-dimensional, sparse latent space of unstructured clinical notes and identify medication-related topics. Figure 1 summarizes our research pipeline, which consists of four major components: data extraction, data preprocessing, model training and model evaluation.
Figure 1.
Research pipeline.
Data extraction
This study used de-identified EHR data from the VUMC Synthetic Derivative (SD)31.We restricted our initial study cohort to patients who: 1) received at least one breast cancer-related ICD 9/10 code (174.x, 175.x, C50.x); 2) had at least one record including demographic, condition, and medication information; 3) were at least 18 years of age in 2010; 4) were born in 1923 or later and, if deceased, died after 2009; and 5) had at least one available clinical note between 2010 and 2021. The clinical note had to contain at least one of the following case-insensitive key words/stems: ‘adher’, ‘compliance’, ‘compliant’, ‘comply’, ‘drug’, ‘med’, ‘pharm’, ‘pill’, ‘presc’, ‘rx’, and ‘side effect’. Filtering notes for these keywords narrowed our initial note corpus to texts that likely contained medication-related information.
As we sought to model a broad range of medication-related topics, we set the first breast cancer diagnosis date as year 0 to align each patient’s clinical notes, such that the clinical notes and medical history data before (after) the first breast cancer diagnosis of each patient would have a negative (positive) year value. The date of first breast cancer diagnosis for each patient was defined as the earliest date a breast cancer ICD code was received.
Data preprocessing
Raw note data were cleaned and tokenized using a combination of NLTK (version 3.5)32 and hand-written regex patterns in Python 3 (version 3.8). Tokens were not stemmed or lemmatized as recent research indicated such practices may harm topic model performance33. The term ‘repnumber’ was substituted for all numeric quantities, while ‘meddose’ was used for combinations of a number and mg or ml. Additional substitutions included trttiming for common abbreviations for scheduled medication timing such as qid and bid, and ‘admroute’ for the abbreviations po and iv. The term prn (“pro re nata,” as needed, for medications patients take at their own discretion) was retained, as “prn medications” and “scheduled medications” constitute different classes where patient behavior is concerned. Some partially numeric terms such as o2 and co2 were expanded (to oxygen and carbondioxide, respectively) and hyphens were replaced with a null character to collapse terms such as x-ray or j-tube. Stop words—both the common English set from NLTK as well as a corpus-derived set defined by the authors—were removed.
To mitigate the negative impact of redundant information in clinical notes34, we incorporated redundancy into our research pipeline. Our approach followed the work in Cohen et al.35, where redundancy is calculated as the fraction of character sequences in a patient’s notes identical to those in their previous notes. In our initial experiments, we found that removing fully redundant documents (redundancy score equal to 1) was best for model performance, as it improved topic semantic coherence (high probability words in a topic frequently co-occur in a document36) and exclusivity (high probability words in one topic are exclusive to that topic37) while also increasing the lower bound of the maximum model likelihood. Therefore, after calculating the redundancy score of each note, we removed all completely redundant notes from our corpus in the analysis. Finally, tokens which had fewer than three characters and which did not correspond to common clinical terms (see Appendix for list of included tokens <3 characters in length), or tokens appearing in fewer than 100 documents, were removed.
Model training and evaluation
After cleaning the data, we relied on topic modeling to capture nuanced descriptions of medication-taking behavior in clinical notes. We applied multiple rounds of model training for corpus refinement and hyperparameter tuning. All the topic modeling models were trained and evaluated using the stm package (version 1.3.6)30.
Because the original data may contain notes that have little connection with medication-taking behavior, directly tuning the hyperparameters (e.g., a large range of topic candidates) in this dataset would be time-consuming and generate many irrelevant topics. To mitigate this issue, we designed a two-stage topic modeling pipeline. In the first stage, we refined our corpus to documents with a high prevalence of medication-related terms. To do this, we trained an initial “filtering” topic model, from which we selected all medication-related topics to form a composite medication-relatedness score for each document in the corpus. Filtering the corpus on varying thresholds of this score allowed us to create varyingly medication-related corpora for the second round of model training.
We chose a 30-topic CTM model for our filtering model based on preliminary modeling work which showed that this number of topics yielded a model with good semantic coherence and exclusivity (e.g., it developed several medication-specific topics while not splitting medication-related terminology across too many topics). After training this model, we manually reviewed its topics to identify those related to outpatient medication use. Authors KK and TB independently identified topics they believed represented outpatient medication-related content, then compared their lists. Topics on both lists were immediately accepted as being medication-related, while discrepant topics were discussed until consensus on inclusion or exclusion from the score was reached. Once consensus was achieved on the medication-related topics, we computed the cumulative gamma value 𝛾M of these topics for each document following Equation (1), creating its medication-relatedness score. 𝛾M scores how much each note in the corpus discussed medication and, potentially, medication-related behaviors.
| (1) |
where T represents the set of topics containing medication-related information
In the second stage, we trained four STM models (of 30, 45, 60, and 75 topics) on each of four different corpora: the original corpus, and three corpora filtered by thresholds of 𝛾M >0.1, >0.2, and >0.5, respectively. The metadata that each STM model was adjusted for included: 1) patient gender, race, and ethnicity; 2) the year the note was documented; 3) the patient’s age when the note was documented; and 4) the number of years from the patient’s first breast cancer diagnosis to the note date, which can be either positive or negative. This feature also allowed us to evaluate temporal topic prevalence relative to the patient’s initial breast cancer diagnosis. The note year served as an adjustment for temporal changes in the standard of care.
We evaluated each model’s residuals, lower bound of the maximum model likelihood, and topic semantic coherence and exclusivity to select a best-performing model. Author JW provided expert annotation of the topics from that model, after which we investigated how topic prevalence varied by the STM model’s metadata.
RESULTS
Study cohort
Our initial and final study cohorts consist of 1,306 and 1,031 patients, respectively. These patients are primarily White, non-Hispanic females with an average age of 60 years when first diagnosed with breast cancer. The cohort additionally includes a nontrivial proportion of male and Black patients. Table 1 contains cohort demographics for the two study cohorts. It should be noted that the statistics for the final patient cohort are calculated based on the note corpus with 𝛾M>0.5, as this corpus was determined to produce the best resulting topic models (see below).
Table 1:
Cohort demographics.
| Raw cohort | Final patient cohort | |
|---|---|---|
| Characteristic | Count (proportion), n = 1,306 | Count (proportion), n = 1,031 |
| Sex | ||
| Female | 1,193 (0.91) | 931 (0.90) |
| Male | 113 (0.09) | 100 (0.10) |
| Race | ||
| White | 1,054 (0.81) | 862 (0.84) |
| Black | 130 (0.10) | 111 (0.11) |
| Asian | 6 (0.004) | 6 (0.006) |
| Other | 8 (0.006) | 5 (0.005) |
| Unknown/Not reported | 108 (0.08) | 47 (0.05) |
| Ethnicity | ||
| Hispanic | 9 (0.006) | 9 (0.009) |
| Non-Hispanic | 1,182 (0.91) | 972 (0.94) |
| Unknown/Not reported | 115 (0.09) | 50 (0.05) |
| Age at first breast cancer diagnosis (years) | ||
| Mean (SD) | 59.6 (13.3) | 60.1 (13.2) |
From the raw cohort, we obtained 145,934 clinical notes that included at least one of the keywords. The number of notes per patient follows a long-tailed distribution, where 90% of the patients have 283 notes or less (Figure 2). This distribution is preserved for the corpus reduced by the redundancy filter and for our final corpus, which was further reduced by the medication-relatedness score filter. In the final corpus, 90% of patients have 66 notes or less.
Figure 2.
Distribution of the number of notes per patient for the raw corpus, the redundancy filtered corpus, and the final corpus further filtered for notes with a 𝛾M≥0.5. Note that the x-axis is log-scaled.
CTM model to filter clinical notes
After applying the redundancy filter to our initial corpus, we were left with 124,347 clinical notes. We identified seven topics (1, 2, 3, 5, 12, 14, and 15, see Appendix: Filtering CTM Topics) as unambiguously containing content related to outpatient medications. Topics 23 and 29 also appeared to also contain medication-related content, but on review, we determined that 23 contained primarily words related to lab results and 29 was a mixed topic containing many terms unrelated to medication.
Topics 11 and 18 proved more difficult to classify. We initially excluded topic 18 from 𝛾M as it primarily contained terms related to infusion of chemotherapy medications in a (presumably) monitored setting, while including topic 11 as it contained generic medication-related terms. However, more careful review of topic 11 showed it contained specific terms (e.g., ‘levophed’, ‘norepinephrine’, ‘vancomycin’, ‘electrolytea’) for medications administered intravenously in an ER or inpatient setting. This argued for topic 11’s exclusion on the same grounds as topic 18, or conversely, topic 18’s inclusion in calculating 𝛾M as it contained substantial medication-related content. However, subsequent testing determined that including topic 11, but not topic 18, in calculating 𝛾M improved the subjective quality of topics in the second-stage models. Figure 3 depicts the correlation structure (cutoff ρ = 0.01) of topics from the filtering CTM, showing topic 18 to be uncorrelated with all other topics, while topic 11 clusters with many of the other chosen topics, further justifying its inclusion in calculation of 𝛾M. We ultimately selected topics 1, 2, 3, 5, 11, 12, 14, and 15 to calculate 𝛾M. These topics contained a mix of generic (e.g., ‘capsule, ‘meddose’, ‘oral’, ‘prn’, ‘tablet’) and condition-specific (e.g., ‘insulin’, topic 5; ‘amiodarone,’ ‘coumadin’, topic 1) terms.
Figure 3.
Correlation graph for topics in the filtering CTM, colored by inclusion status in calculating 𝛾M.
Hyperparameter selection
Next, we trained 16 STM models on corpora further refined by 𝛾M values derived from topics of the previous CTM model. The 16 models differed in the training corpus (per the 𝛾M threshold) and the number of topics. Figure 4a, b, and c display diagnostics for the 16 models. The number of notes for each 𝛾M threshold was as follows: 124,347 for 0; 74,447 for 0.1; 54,807 for 0.2; and 26,363 for 0.5. All two-stage models (𝛾Mthreshold > 0) demonstrated performance improvements over the single-stage models (𝛾Mthreshold = 0).
Figure 4.
STM model diagnostics for second round of topic modeling. Higher 𝛾M thresholds lead to nearly monotonic improvements in most model diagnostics. Models using the original redundancy-filtered corpus (𝛾M threshold = 0) form a single-stage baseline for comparison.
We found models trained on the corpus of notes with 𝛾M >0.5—the smallest corpus, with 26,363 notes—have the largest model likelihood lower bound, the smallest residuals, and competitive semantic coherence scores with other models, and so focused on these models. Among the models trained on notes where 𝛾M >0.5, we found the model with 60 topics showed the best balance between exclusivity and semantic coherence (Figure 4d).
Topics of the selected STM model
Among the model’s 60 topics, some topics related to condition-specific medications while others seemed to capture aspects of breast cancer progression and treatment. We display the most probable terms and the most frequent and exclusive (FREX) terms for some of these topics in Table 2. The full set of topics is available in the Appendix: Best STM Topics.
Table 2.
A collection of breast cancer and medication-related topics from the best STM model.
| Topic | High Probability Terms | High FREX Terms | Interpretation |
|---|---|---|---|
| 18 | meddose, insulin, units, diabetes, daily, tablet, left, unit, also, mouth | lancets, pen, insulin, ultra, humalog, lantus, strips, onetouch, sugar, novolog | Diabetes management |
| 20 | inr, warfarin, coumadin, visit, diagnosis, anticoagulation, dose, week, lovenox, dosing | inr, warfarin, fri, tues, coumadin, wed, mon, thurs, sun, anticoagulation | Anticoagulation, including weekly timing of coumadin/warfarin doses and INR testing |
| 22 | disease, r, l, breast, since, treatment, metastatic, meddose, history, progression | view, brerepnumber, restaging, scans, life, palliation, initiation, preservation, understood, participation | Breast cancer status |
| 36 | pain, meddose, oxycodone, medication, trttiming, use, continue, prn, medications, history | oxycontin, oxycodone, fentanyl, worst, patches, patch, breakthrough, dilaudid, prescribed, efficacy | Pain management and pain medications; patient descriptions of pain |
| 37 | meddose, day, trttiming, negative, denies, transplant, gvhd, started, today, admroute | gvhd, cmv, engraftment, pbsct, donor, csa, fk, antiinfective, valtrex, allo | Immunosuppressives and management of transplant patients |
| 39 | er, bone, metastatic, meddose, breast, faslodex, hours, diagnosis, pain, r | xgeva, faslodex, counter, [identifier], ibrance, aromasin, letrozole, afinitor, exemestane, unexpected | Treatment of hormone-positive metastatic breast cancer |
| 43 | inhaler, mcg, daily, nasal, actuation, day, spray, meddose, albuterol, puffs | inhalation, inhaler, puffs, hfa, albuterol, aerosol, puff, spiriva, advair, mcgrepnumber | Inhaled medications for respiratory illness |
| 47 | breast, left, right, biopsy, carcinoma, lymph, axillary, node, grade, negative | mammogram, birads, nipple, histologic, intermediate, ultrasound, extent, receptor, proliferative, breasts | Initial diagnosis and management of breast cancer in the pretreatment period |
| 53 | pleural, effusion, er, meddose, negative, disease, since, metastatic, diagnosis, pr | thoracentesis, pleural, effusion, individual, gdcrepnumber, pericardial, effusions, cfr, nrc, regulation | Metastatic breast cancer; pleural effusions are a common site of metastasis |
| 60 | radiation, brain, treatment, mri, oncology, cgy, lesion, dose, metastatic, resection | cgy, brain, fractions, frontal, simulation, radiation, onc, srs, cerebellar, roncop | Brain metasteses from breast cancer |
These topics contained common, nonspecific terms, such as dosing (‘meddose’) and timing (‘trttiming’) information alongside more nuanced, disease-specific information. The anticoagulation topic (Topic 20), for instance, contained terms for weekdays, relating to specific timing of anticoagulant medication administration.
The STM also revealed how breast cancer-specific topics change in proportion in the corpus over time from a breast cancer diagnosis. Figure 5 displays topic proportions for several breast-cancer-related topics plotted against year from first breast cancer diagnosis. The sharpest increase at the time of diagnosis was seen in topic 47 (initial diagnosis and management of breast cancer), while topics 53 and 60, with terms for metastatic cancer, grew in proportion further from patients’ initial diagnoses.
Figure 5.
Topic proportion over time from breast cancer diagnosis for breast-cancer-related latent topics.
A different longitudinal pattern was seen in topics for patients’ existing comorbidities. The left side of Figure 6 shows topic proportions over time from breast cancer diagnosis for four comorbidity topics (18, 20, 37, 43) captured by our model. This “bow-shaped” curve showed a sharp reduction in topic proportion in the period immediately around patients’ breast-cancer diagnoses, followed by a local maximum in proportion somewhere between 0 and 5 years following diagnosis, a reduction to a local minimum, and then a gradual increase in topic proportion again. The pattern is much more pronounced for the topic relating to post-transplant immunosuppression, compared to the other three topics, but has the same relative shape in all. By contrast, topic 36 (pain medication and management), seen at the right of Figure 6, increases in prevalence just prior to diagnosis and remains relatively stable in proportion afterward, though it shows some of the same bow-shaped pattern as the others.
Figure 6.
Topic proportion over time from breast cancer diagnosis for latent topics covering medications used for various comorbid diseases (left) and pain (right).
DISCUSSION
In this study, we demonstrated how a two-stage unsupervised topic modeling pipeline can extract specific topics about patient medication use in clinical documents drawn from the EHRs, which substantially improved model performance over a single-stage model trained on a large document corpus. This supports the use of latent topic models to filter documents of interest out of large research corpora used in training other language or machine learning models.
After using two preliminary steps for filtering our corpus to texts with the highest concentration of medication-specific discussion, we obtained a model that captured several latent topics with high specificity for terms around medication use. These included topics for several diseases and their treatments (diabetes in topic 18, anticoagulant therapy in topic 20, post-transplantation immunosuppression in topic 37, and inhaled medications in topic 43), anti-cancer treatments (topics 17, 23, 26, 30, 35, 39, 59), pain control and opioid medications (topic 36), pharmacy refill requests (topics 14 and 31), and nuanced information relating to the timing of medication (topic 20).
While our filtering criteria were highly specific for medication-related terms, the second-stage topics also captured other areas of patient experience such as ability to perform activities of daily living (topic 2), nutrition needs for patients with difficulty eating (topic 7), urgent care consultation in the context of severe symptoms (topic 32), and non-pharmaceutical pain management (topic 38). All of these topics may represent barriers to patients taking their medications, thus forming rich additional context on patient medication taking-behavior2–4.
With the addition of a temporal covariate to the STM, we found that the prevalence of topics could vary by time from a patient’s breast cancer diagnosis. Figure 5 shows that topic prevalence for topics related to breast cancer progression and staging rises sharply at initial diagnosis, then fluctuates over time as treatment milestones or disease progression occur. Topic proportions of medication-related topics might therefore be expected to increase or subside with the presence and severity of an underlying condition in other diseases.
The bow-shaped pattern seen in Figure 6 further suggests that the diagnosis of a new, severe condition causes significant changes in how existing conditions are discussed. The proportion of topics concerning pre-existing comorbidities dips concurrent with a diagnosis of breast cancer, reflecting how medication-related notes during this time period will have much more breast-cancer-related content and less focus on existing conditions, then rises again in the following years when a patient is under treatment (potentially corresponding to exacerbation of these conditions and/or complications arising from breast cancer treatment), then declines at the five year mark (which is often a milestone for cessation of adjuvant therapy in breast cancer) only to rise again after. If these changes in topic proportions also reflect neglect or exacerbation of the treatment of pre-existing conditions, then it may be reasonable to conclude that they may be predictive of changes in medication behavior for those conditions as well. This could include deliberate cessation of medications that interact poorly with new treatments or unintentional cessation due to patient lifestyle changes while dealing with a new, life-altering disease. Cheng and Levy demonstrate that even patients with stage I breast cancer have a high treatment workload40, with that workload increasing with higher disease staging. Insofar as treatment burden affects patients’ ability to manage their own care4, our results show that signals of this difficulty may be detected in medication-related note text using a wholly automated method.
The emergence of a topic related to transplantation and immunosuppression was surprising, considering that these do not represent standard current treatment approaches for breast cancer. We initially surmised one of two possible explanations for this: the non-trivial rate of secondary leukemias resulting from cytotoxic chemotherapeutics used to treat breast cancer38 and subsequent allogenic stem cell transplant to treat them; or the use of autologous stem-cell transplant following high-dose chemotherapy, a regrettable approach popularized in the 1990s that took many years to debunk39. Subsequent examination of notes with large γ values for this topic showed that many of the highest-scoring notes were for patients with a leukemia presumed secondary to prior breast cancer treatment. Meanwhile, we further discovered that patients’ first billing code for breast cancer often corresponded with a note date for initial work-up of the leukemia, indicating that this was the first mention of a prior, resolved breast cancer in medical notes at our institution. This may explain the sharp spike in this topic’s prevalence at year 0 from breast cancer “diagnosis” (Figure 6) and argues for using more precise techniques to determine patients’ initial diagnosis dates for a disease of interest.
Our results suggest that two-stage topic modeling produces nuanced, human-comprehensible latent topics that may be suitable to use as features for studying and predicting patient medication behavior. In addition, when examined in the context of a disease-specific time measure, these topics also reveal how mentions of patient medication behavior are affected by changes in a patient’s condition. Longitudinal changes in medication-related topic proportions might thus be applied to predict changes in patient medication behavior.
Limitations and future directions
Despite our notable findings, there are several limitations we acknowledge here. Although we worked on a large patient cohort, this is a single-institutional study that may be biased to geographic, temporal, and care delivery particularities. Additionally, we used a very liberal criterion of one breast cancer code to identify the study cohort. While this increased the size and variety of the note corpus, it included notes from patients who did not have breast cancer, which weakened our conclusions about breast cancer patients. However, our final model showed a variety of high-quality topics concerning breast cancer even with the presence of these false positives. Future research should include a more robust phenotype definition as well as more temporal precision in resolving a patient’s actual date of diagnosis.
Technically, we noted that several poorly processed words entered our final topics, which might have been avoided with better preprocessing to remove note metadata. A lack of standardization of terms or expansion/replacement of common medical acronyms (such as those used for timing or route of medications) also rendered our final topics less interpretable than they could have been. However, these topics did conclusively demonstrate the proposed two-stage pipeline’s ability to capture medication-specific behaviors and outcomes, supporting our overall conclusion. Future research in this area is recommended to explore other ways of excluding metadata terms from our text corpora.
Another limitation that could be addressed in future research is that we did not consider multi-word phrases (e.g., bigrams, trigrams) in our model. Inclusion of bigrams in our models would have made topics dealing with, say, side effects easier to detect because they would have included the term “side effect” versus simply “effect”. However, including n-grams becomes very computationally expensive as well.
Finally, we found the subjective quality of the second-stage models to be highly sensitive to the specific topics selected from the first-stage filtering model for generating the medication-relatedness score, and therefore the particular subset of the corpus selected to train on. Inclusion of an uncorrelated topic with a group of correlated topics in calculating 𝛾M greatly reduced subjective topic quality, suggesting future sensitivity analyses on the extent to which the topic correlation affects the filtering model performance.
CONCLUSIONS
In this paper, we presented a two-stage topic modeling pipeline for extracting latent topics related to medication use in clinical notes for a cohort of breast cancer patients. We found relevant, human-interpretable latent topics capturing medication use and patient experience with several medical conditions. In addition to demonstrating the value of the two-stage approach to modeling, the inferred topics showed demographic and temporal associations that may be beneficial to future researchers who wish to extract information on patient medication use from EHRs.
ACKNOWLEDGMENTS
This publication was supported by the National Center for Advancing Translational Sciences (UL1TR000445) and the National Cancer Institute (R37CA237452) of the National Institutes of Health. We gratefully acknowledge Dr. Tom Lasko for help in gaining data access to the VUMC SD. We also acknowledge Julia Silge for her excellent tutorial on the stm package*.
Footnotes
Available at https://juliasilge.com/blog/evaluating-stm/
APPENDIX
All topics from the best STM model (60 topics, documents with 𝛾M>0.5) and the filtering CTM model can be found at https://bit.ly/3xwarPO. The third worksheet contains a list of words <3 characters that were included in the final token list.
Figures & Table
REFERENCES
- 1.Zhong W, Maradit-Kremers H, St. Sauver JL, Yawn BP, Ebbert JO, Roger VL, et al. Age and Sex Patterns of Drug Prescribing in a Defined American Population. Mayo Clin Proc. 2013 Jul;88(7):697–707. doi: 10.1016/j.mayocp.2013.04.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Brown MT, Bussell JK. Medication Adherence: WHO Cares? Mayo Clin Proc. 2011 Apr;86(4):304–14. doi: 10.4065/mcp.2010.0575. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gellad WF, Grenard JL, Marcum ZA. A Systematic Review of Barriers to Medication Adherence in the Elderly: Looking Beyond Cost and Regimen Complexity. Am J Geriatr Pharmacother. 2011 Feb 1;9(1):11–23. doi: 10.1016/j.amjopharm.2011.02.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Haynes RB, McDonald HP, Garg AX. Helping Patients Follow Prescribed TreatmentClinical Applications. JAMA. 2002 Dec 11;288(22):2880–3. doi: 10.1001/jama.288.22.2880. [DOI] [PubMed] [Google Scholar]
- 5.Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiol Drug Saf. 2006;15(8):565–74. doi: 10.1002/pds.1230. [DOI] [PubMed] [Google Scholar]
- 6.Moen H, Peltonen L-M, Heimonen J, Airola A, Pahikkala T, Salakoski T, et al. Comparison of automatic summarisation methods for clinical free text notes. Artif Intell Med. 2016 Feb 1;67:25–37. doi: 10.1016/j.artmed.2016.01.003. [DOI] [PubMed] [Google Scholar]
- 7.Koleck TA, Dreisbach C, Bourne PE, Bakken S. Natural language processing of symptoms documented in free-text narratives of electronic health records: a systematic review. J Am Med Inform Assoc JAMIA. 2019 Feb 6;26(4):364–79. doi: 10.1093/jamia/ocy173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Blei DM, Ng AY, Jordan MI. Latent Dirichlet Allocation. J Mach Learn Res. 20033(Jan):993–1022. [Google Scholar]
- 9.Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, et al. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl. 2019 Jun 1;78(11):15169–211. [Google Scholar]
- 10.Meng Y, Speier W, Ong M, Arnold CW. HCET: Hierarchical Clinical Embedding With Topic Modeling on Electronic Health Records for Predicting Future Depression. IEEE J Biomed Health Inform. 2021 Apr;25(4):1265–72. doi: 10.1109/JBHI.2020.3004072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wang L, Miloslavsky E, Stone JH, Choi HK, Zhou L, Wallace ZS. Topic modeling to characterize the natural history of ANCA-Associated vasculitis from clinical notes: A proof of concept study. Semin Arthritis Rheum. 2021 Feb 1;51(1):150–7. doi: 10.1016/j.semarthrit.2020.10.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Wang L, Lakin J, Riley C, Korach Z, Frain LN, Zhou L. Disease Trajectories and End-of-Life Care for Dementias: Latent Topic Modeling and Trend Analysis Using Clinical Notes. AMIA Annu Symp Proc. 2018 Dec 5;2018:1056–65. [PMC free article] [PubMed] [Google Scholar]
- 13.Rumshisky A, Ghassemi M, Naumann T, Szolovits P, Castro VM, McCoy TH, et al. Predicting early psychiatric readmission with natural language processing of narrative discharge summaries. Transl Psychiatry. 2016 Oct;6(10):e921. doi: 10.1038/tp.2015.182. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Korach ZT, Cato KD, Collins SA, Kang MJ, Knaplund C, Dykes PC, et al. Unsupervised Machine Learning of Topics Documented by Nurses about Hospitalized Patients Prior to a Rapid-Response Event. Appl Clin Inform. 2019 Oct;10(5):952–63. doi: 10.1055/s-0039-3401814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Jo Y, Lee L, Palaskar S. 2017. Combining LSTM and Latent Topic Modeling for Mortality Prediction. Sep 8 [cited 2021 May 7]; Available from: http://arxiv.org/abs/1709.02842v1.
- 16.Zhang R, Pakhomov SVS, Arsoniadis EG, Lee JT, Wang Y, Melton GB. Detecting clinically relevant new information in clinical notes across specialties and settings. BMC Med Inform Decis Mak. 2017 Jul;17(2):15–22. doi: 10.1186/s12911-017-0464-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Kandula S, Curtis D, Hill B, Zeng-Treitler Q. Use of Topic Modeling for Recommending Relevant Education Material to Diabetic Patients. AMIA Annu Symp Proc. 2011;2011 :674-82. [PMC free article] [PubMed] [Google Scholar]
- 18.Zhang R, Pakhomov S, McInnes BT, Melton GB. Evaluating Measures of Redundancy in Clinical Texts. AMIA Annu Symp Proc. 2011;2011 :1612-20. [PMC free article] [PubMed] [Google Scholar]
- 19.Bai T, Chanda AK, Egleston BL, Vucetic S. Joint Learning of Representations of Medical Concepts and Words from EHR Data. Proc IEEE Int Conf Bioinforma Biomed. 2017 Nov;2017:764–9. doi: 10.1109/BIBM.2017.8217752. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Yin Z, Warner JL, Chen Q, Malin BA. Patient Messaging Content Associated with Initiating Hormonal Therapy after a Breast Cancer Diagnosis. AMIA Annu Symp Proc. 2020 Mar 4;2019:962–71. [PMC free article] [PubMed] [Google Scholar]
- 21.Yin Z, Harrell M, Warner JL, Chen Q, Fabbri D, Malin BA. The therapy is making me sick: how online portal communications between breast cancer patients and physicians indicate medication discontinuation. J Am Med Inform Assoc. 2018 Nov 1;25(11):1444–51. doi: 10.1093/jamia/ocy118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Beam AL, Kartoun U, Pai JK, Chatterjee AK, Fitzgerald TP, Shaw SY, et al. Predictive Modeling of Physician-Patient Dynamics That Influence Sleep Medication Prescriptions and Clinical Decision-Making. Sci Rep. 2017 Feb 9;7(1):42282. doi: 10.1038/srep42282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Huang LH. Medication-taking behavior of the elderly. Kaohsiung J Med Sci. 1996 Jul;12(7):423–33. [PubMed] [Google Scholar]
- 24.Laws MB, Lee Y, Rogers WH, Beach MC, Saha S, Korthuis PT, et al. Provider-patient communication about adherence to anti-retroviral regimens differs by patient race and ethnicity. AIDS Behav. 2014 Jul;18(7):1279–87. doi: 10.1007/s10461-014-0697-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kobayashi A, Tamura A, Ichihara T, Minagawa T. Factors associated with changes over time in medication-taking behavior up to 12 months after initial mild cerebral infarction onset. J Med Invest. 2017;64(1.2):85–95. doi: 10.2152/jmi.64.85. [DOI] [PubMed] [Google Scholar]
- 26.Yang C, Qin W, Yu D, Li J, Zhang L. 2019. Medication Adherence and Associated Factors for Children With Tic Disorders in Western China: A Cross-Sectional Survey. Front Neurol [Internet]. Nov 5 [cited 2020 Mar 6];10. Available from: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6848256/
- 27.Xia L, Luo D, Zhang C, Wu Z. A Survey of Topic Models in Text Classification. In: 2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD). 2019. p. 244-50.
- 28.Blei DM, Lafferty JD. A correlated topic model of Science. Ann Appl Stat. 2007 Jun;1(1):17–35. [Google Scholar]
- 29.Roberts ME, Tingley D, Stewart BM, Airoldi EM. The Structural Topic Model and Applied Social Science. Neural Inf Process Soc. 2013;4.
- 30.Roberts ME, Stewart BM, Tingley D. stm : An R Package for Structural Topic Models. J Stat Softw [Internet] 2019;91(2) . Available from: http://www.jstatsoft.org/v91/i02/ [cited 2021 Apr 1] [Google Scholar]
- 31.Roden D, Pulley J, Basford M, Bernard G, Clayton E, Balser J, et al. Development of a Large-Scale De-Identified DNA Biobank to Enable Personalized Medicine. Clin Pharmacol Ther. 2008 Sep;84(3):362–9. doi: 10.1038/clpt.2008.89. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Steven Bird, Klein E, Loper E. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
- 33.Schofield A, Magnusson M, Thompson L, Mimno D. Understanding Text Pre-Processing for Latent Dirichlet Allocation. 2017;4.
- 34.Cohen R, Elhadad M, Elhadad N. Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinformatics. 2013 Jan 16;14:10. doi: 10.1186/1471-2105-14-10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cohen R, Aviram I, Elhadad M, Elhadad N. Redundancy-Aware Topic Modeling for Patient Record Notes. PLoS ONE. 2014 Feb 13;9(2):e87555. doi: 10.1371/journal.pone.0087555. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Mimno D, Wallach HM, Talley E, Leenders M, McCallum A. Optimizing semantic coherence in topic models. :11.
- 37.Bischof JM, Airoldi EM. Summarizing topical content with word frequency and exclusivity. In: Proceedings of the 29th International Coference on International Conference on Machine Learning. Madison, WI, USA: Omnipress; 2012. p. 9-16. (ICML’12)
- 38.Howard RA, Gilbert ES, Chen BE, Hall P, Storm H, Pukkala E, et al. Leukemia following breast cancer: an international population-based study of 376,825 women. Breast Cancer Res Treat. 2007 Nov 1;105(3):359–68. doi: 10.1007/s10549-006-9460-0. [DOI] [PubMed] [Google Scholar]
- 39.Rettig RA, Jacobson PD, CMF M.D, WMA M.D. False Hope: Bone Marrow Transplantation for Breast Cancer. Oxford University Press; ; 2007. 368 p. [Google Scholar]
- 40.Cheng AC, Levy MA. Measures of Treatment Workload for Patients With Breast Cancer. JCO Clin Cancer Inform. 2019 Feb 4;3:CCI.18.00122. [DOI] [PMC free article] [PubMed]






