Abstract
Lifestyle modification, including diet, exercise, and tobacco cessation, is the first-line treatment of many disorders including hypertension, obesity, and diabetes. Lifestyle modification data are not easily retrieved or used in research due to their textual nature. This study addresses this knowledge gap using natural language processing to automatically identify lifestyle modification documentation from electronic health records. Electronic health record notes from hypertension patients were analyzed using an open-source natural language processing tool to retrieve assessment and advice regarding lifestyle modification. These data were classified as lifestyle modification assessment or advice and mapped to a coded standard ontology. Combined lifestyle modification (advice and assessment) recall was 99.27 percent, precision 94.44 percent, and correct classification 88.15 percent. Through extraction and transformation of narrative lifestyle modification data to coded data, this critical information can be used in research, metric development, and quality improvement efforts regarding care delivery for multiple medical conditions that benefit from lifestyle modification.
Keywords: electronic health records, health behavior, hypertension, lifestyle, natural language processing, text mining
Background and significance
National guidelines have recommended the use of lifestyle modification in the treatment and prevention of prevalent disorders plaguing the US today including hypertension, obesity, coronary artery disease, diabetes, peripheral vascular disease, and cancer.1-9 The top seven US health risks for combined disability and death identified in the Global Burden of Disease Study 2016 were tobacco, dietary risks, high body mass index, alcohol and drug abuse, high blood pressure, high fasting plasma glucose, and high cholesterol.10,11 These risks can be minimized through the implementation of lifestyle modifications (i.e. tobacco cessation, dietary changes, alcohol moderation, illicit drug use abstinence, increased aerobic exercise, and weight loss to achieve a healthy weight). The efficacy of addressing lifestyle in just one counseling session has been shown.12 For example, behavior change can be elicited by making patients aware of their elevated blood pressure readings after a provider encounter.13 Despite the effectiveness and importance of lifestyle modification in treating hypertension and many chronic illnesses, lifestyle modification is under-utilized and not easily measured, due to it being buried in narrative text rather than in consistent coded form (e.g. diagnosis or procedure codes).14-21 This limits accurate evaluation of the efficacy of these interventions, tracking of quality care metrics, and reimbursement. Evaluation of lifestyle modification includes both the assessment of lifestyle modification activities as reported by a patient or observed by a provider (e.g. “patient started running and weight is down”) and advice on lifestyle modification activities given by a provider to a patient (e.g. “recommend patient lose 30 pounds”). Systematic identification of lifestyle modification is the first step toward improving its use in clinical practice and establishing it as a quality metric. While some clinical information systems code limited individual behaviors (e.g. smoking history), much of this information continues to be recorded primarily in narrative form.
There is a need for automated methods that can facilitate the extraction and integration of lifestyle behavior factors for use in research. Historically, manual chart review has been used to abstract information from patient records, but this has proven to be a time and labor-intensive process, making large-scale chart abstractions nearly impossible. In order to accomplish this task more efficiently, this study used natural language processing (NLP) software tools and processes that can automatically extract text-based information. NLP tools can process many thousands of notes per hour.22 This technology makes larger chart abstractions feasible and allows a more comprehensive evaluation of documentation of lifestyle modification.
NLP has been used to successfully extract data from electronic clinical records and applied in many fields for efficient and accurate chart abstraction.23-27 Some studies have explored the automated extraction of information on smoking status as an isolated finding.28-31 One study looked at NLP tool augmentation to extract cardiovascular risk identification.32 Another study used the MediClass NLP tool for extraction of information on weight management counseling in postpartum visits and showed extraction capabilities similar to human abstractors.33 A separate study extracted limited lifestyle modification documentation in evaluation of diabetes management.34 And multiple prior studies of lifestyle modification have used survey data with potential bias and/or generalizability concerns. To our knowledge, this work is the first to evaluate lifestyle modification for hypertension in a large population using automated methods.
The primary objective of this study was to use an existing open-source natural language processing tool, cTAKES, along with rules and regular expressions on existing electronic health records to make previously invisible data on lifestyle modification documentation visible and ready for analysis. We used an existing data set that had been used in prior evaluations of hypertension treatment and clinical inertia.35 This data set was chosen because it could accomplish two goals: (1) evaluate the feasibility of automatic extraction of lifestyle modification (LM) using an open-source NLP tool allowing use and application to multiple chronic disease evaluations at different institutions, and (2) facilitate extension of prior work in this patient population to improve care of hypertension patients early in their disease process. Methods used in this study are designed so that they can be applied to many other patient populations including those with other chronic diseases such as obesity, coronary artery disease, diabetes, and peripheral vascular disease.
Methods and materials
Institutional and clinical settings: UW Health is the academic health system for the University of Wisconsin–Madison. UW Health adopted the Epic Systems Corporation’s electronic health record in 2004 and has created a Health Information Management Center, which is devoted to the integrity of system-wide electronic health record data and facilitating its use for improved patient care and health. All relevant full-text documents from patients meeting inclusion criteria were retrieved from the UW Health electronic health record. An overview diagram of this study’s steps is provided in Figure 1 with more details of the steps and methods following.
Step 1: retrieving and dividing data set
Study inclusion and exclusion criteria are detailed in Table 1.35
Table 1.
Inclusion criteria (from this retrospective patient population) |
|
Exclusion criteria |
|
The hypertension diagnosis criteria were based on JNC 7 criteria, reflecting the guidelines available during the time of the hypertension data set creation. However, since this analysis focuses on lifestyle modifications, more recent guidelines including, JNC 836-39 and the 2017 American Heart Association’s guidelines reflect similar lifestyle modification recommendations. LM concepts reflect all three sources. Blood pressure measurements were extracted from discrete fields within the EHR. Preexisting conditions were identified using ICD9 codes. The 14,860 patient data set was divided into a 500 patient NLP tool refinement set and a 14,360 patient lifestyle modification retrieval set (Figure 2).
The subset of patient notes for NLP tool refinement was randomly selected using Python’s random module: “random sample.”40,41 This subset was composed of notes from throughout the outpatient clinical encounter including nursing notes, provider notes, patient instructions, nutrition consultation, and exercise consultation. The University of Wisconsin IRB approved this study. Baseline characteristics of the study population are listed in Table 2 with the 500 patient subset characteristics listed beside the 14,860 entire data set characteristics.
Table 2.
Baseline characteristics | Entire data set: 14,860 patients | NLP refinement subset: 500 patients |
---|---|---|
Age (years, mean, median, standard deviation) | 18–80, mean 49.21, median 49.00, SD 14.70 | 18–91, mean 49.47, median 50.00, SD 15 |
Gender | Number of patients (% of data set) | Number of patients (% of data set) |
Female | 7363 (49.55%) | 261 (52.20%) |
Male | 7497 (50.45%) | 239 (47.80%) |
Race/ethnicity | Number of patients (% of data set) | Number of patients (% of data set) |
African American/Black | 703 (4.73%) | 24 (4.80%) |
Asian | 216 (l.45%) | 7 (1.40%) |
American Indian/Alaska Native | 50 (0.34%) | 0 (0%) |
White | 13,140 (88.43%) | 438 (87.60%) |
Hispanic/Latino | 281 (1.89%) | l3 (2.6%) |
Other/Native Hawaiian/Pacific Islander/Multiple | 7l (0.48%) | 2 (0.40%) |
Unknown | 399 (2.69%) | 16 (3.20%) |
NLP: natural language processing; SD: standard deviation.
Step 2: creating lifestyle modification (LM) lexicon
An empirical method was used to create and iteratively enrich and refine the lifestyle modification terminology using four approaches: (1) literature review to identify related terms, including terms and concepts discussed in JNC 7 and 8; (2) existing ontology review (terms and their interconnections) to identify relevant terms including acronyms and abbreviations such as those in the SNOMED CT ontology, the National Center for Biomedical Ontology and the Consumer Health Vocabulary42; (3) domain expert collaboration to identify words, acronyms, abbreviations, and phrases relevant to hypertension and lifestyle modification; and (4) electronic health record note training. Identified terms and phrases from this fourfold process generated the initial list of relevant terms and phrases for extraction (Table 3).
Table 3.
Provider verb phrase terms requiring any lifestyle modification object: advice | Lifestyle modification object terms (may be part of provider or patient verb phrase) |
---|---|
advise/d/ing | alcohol |
avoid | cholesterol |
begin | diet |
change | dietician |
congratulated | exercise |
continue | fat/fats |
counsel/ed/ing | gym |
covered | health club |
recommend/recommended | physical activity |
refer/referral | salt |
Patient verb phrase terms requiring any lifestyle modification object: assessment | Declaration of lifestyle modification—requiring no LM object: assessment |
changed/changing | alcoholic |
continued/continues/continuing | tobacco use |
decreased/decreasing | change in weight |
eats/eating | cigarettes: |
improved/improving | cigars: |
increased/increasing | DASH/dash diet |
interested in | decreased weight |
motivated to | down ##/pounds/lb/lbs |
plans/planning to | healthy diet |
reports/reported | physically active |
LM: lifestyle modification.
Our goal was to iteratively extend the NLP tool to have additional capabilities to handle discourse. Terms and phrases were mapped to codes in the Unified Medical Language System (UMLS) and semantic types based on the UMLS semantic net. Details available as Supplemental Appendices A and B.
Step 3: pre-filtering to identify notes of interest
The total data set was composed of 14,860 patients’ EHR notes with an average of 56 notes per patient. Each note was composed of multiple sentences, some with multiple concepts of interest (lifestyle modification, family history). Many notes were not relevant to lifestyle modification, so pre-filtering for note types and departments likely to have documented lifestyle modification was performed (Figure 3 shows LM-relevant note types and departments agreed upon by three study physicians).
We included pediatric departments because the National Institutes of Health (NIH) defines children as those persons 18–21 years old and some patients in this age range may continue care in pediatric clinics. Gynecology was also included as a primary care clinic because many women see gynecologists as their primary care provider.
Step 4: parsing and diagnosis identification
After pre-filtering, 14,331 notes were in the NLP tool refinement set and 403,018 notes were in the LM retrieval set. These notes were processed using the Clinical Text Analysis and Knowledge Extraction System (cTAKES). This is an open-source NLP pipeline that processes clinical notes and identifies types of clinical named entities—drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures.43 Each named entity has attributes for the text span, the ontology mapping code, context (e.g. family history of, current, unrelated to patient), and negated/not negated. Figure 4 shows components of a typical NLP pipeline (with study enhancements to improve relevant term and phrase identification depicted by the down-arrow boxes). The typical process starts with detection of each section of the text and then each sentence, followed by the identification of sentence tokens (e.g. words, dates, numbers) in the sentence. A part-of-speech is assigned to tokens (e.g. noun, preposition, noun-phrase). The Named Entity Recognition component implements a dictionary lookup, so that each entity is mapped to a concept from the tool dictionary. Post-coordination then combines multiple concepts into a single one (e.g. DASH + diet).
Notes were further processed using Python code to retrieve notes with diagnoses likely to have LM (e.g. obesity, diabetes—details of diagnosis filter terms available by request as Supplemental Appendix C).
Steps 5-8: retrieving LM, filtering out notes with negated or LM terms not related to patient, and mapping
This study initially tried to use cTAKES as the sole method to identify relevant LM, but this approach resulted in poor precision (67.15%) and poor recall (62.48%) because:
Searching for semantic types of interest (TUIs) was too broad and extracted many irrelevant items (e.g. “dietary history” = TUI T033 = “Finding.” Many irrelevant terms are “Findings” such as “sweaty palms” which was pulled as a relevant retrieval when searching based on TUI alone).
Using concept unique identifiers (CUIs) was too restrictive due to identifying only a specific term and inability to identify phrases with LM (e.g. “exercise”=CUI C0015259. If searching on this CUI, relevant terms such as “swimming” with CUI C0039003 were missed).
Using cTAKES’ retrieval and dictionary lookup effectively identified nouns, but missed verbs and verb phrases of LM documentation (e.g. retrieved “diet,” missed “counseled on diet changes”).
Therefore, this study used a combined approach to processing the notes. cTAKES was used for parsing the notes into sections and sentences, but instead of using the parsed parts of speech identified by cTAKES for ontology mapping, Python code with regular expressions and rules was used to process each sentence to identify lifestyle modification terms and phrases from the created lexicon. This enhanced the natural language processing workflow to facilitate extraction of data related to lifestyle modification, including social and behavioral factors that historically have been difficult to extract and verbs. These data were mapped to corresponding terms within the SNOMED CT ontology using the Unified Medical Language System (UMLS).
For example, a provider recommendation of “cut down on alcohol intake” was retrieved using the study’s enhanced NLP methods resulting in lifestyle modification advice retrieval: “alcohol consumption counseling” in SNOMED CT UMLS CUI = C1531491 and TUI = Health Car Activity T058. Without enhancements to NLP, no match for this phrase was retrieved or mapped by cTAKES to the SNOMED CT ontology.
In order to capture context-sensitive phrases, we used semantic and syntactic rules, as well as regular expressions. Due to many notes containing unusual syntactic expressions impeding retrieval of context-sensitive lifestyle modification, existing note punctuation was modified. Specifically, when a sentence about LM contained a colon (:), the NLP processor read this as a hard stop and classified concepts after it as belonging to the next sentence. The colon mark was replaced with a comma to allow capture of concepts after the colon mark as contextually related to the first phrase. This improved retrieval of the entire lifestyle modification context. For example, prior to modification, cTAKES identified the negated history of “Patient smoking: no.” as a positive history of patient tobacco use: “Patient smoking. No.” After modification, the intended meaning was retrieved: “patient history negative for smoking.” Tokenization per cTAKES was not modified, but cTAKES only identifies nouns and noun phrases and this study also needed to identify verbs and verb phrases. Therefore, the algorithm searched the entire sentence using regular expressions for LM object terms and phrases (Figure 5 and details of LM lexicon available as Supplemental Appendix A).
Negation identification was performed using negated regular expressions already employed to extract LM terms and phrases. This approach allowed for more efficient and thorough negation identification of verbs and verb phrases that were not identified by cTAKES’ negation module. Attribution of a concept, as pertaining to the patient or other person for LM terms/phrases and diagnoses, was expanded to include, in addition to mother and father, grandparents, grandmother, grandfather, roommate, spouse, husband, wife, son, daughter, partner, roommate, parent, friend, significant other, coworker, brother, sister, aunt, and uncle. Family history retrieval using name of relation (e.g. mom, sister) within rules and regular expressions, along with mapping schemes to SNOMED CT, was created specifically for this project. Mapping of terms to the SNOMED CT hierarchy was restricted to parent and child concepts, and their CUIs, to reduce granularity and facilitate effective use of this coded data in future machine learning projects that required less dimensionality. For example, “brother with non-insulin-dependent diabetes mellitus” was mapped to “FH: diabetes mellitus,” a child of the parent concept “FH: metabolic disorder,” and not mapped to the grandchild concept “FH: diabetes mellitus type 2.” This allowed all diabetics to be binned under one code. Details of family history mapping available as Supplemental Appendix D.
This study also classified lifestyle modification words and phrases as “assessment” and “advice” as was done in a prior analysis of videotaped encounters involving lifestyle modification.44 Classification as assessment or advice was based on logic rules that included key words. This allowed for evaluation of LM documentation as either provider documentation of LM activities that the patient reported (e.g. patient reported exercising = assessment) or provider documentation of advice that she or he offered to the patient (e.g. recommended exercise = advice). Matching of LM terms to SNOMED CT was further refined with specific LM terms and phrases grouped under overarching concepts such as exercise education (advice) or exercise history (assessment). Details of advice/assessment phrases available as Supplemental Appendix B. This process continued until performance was close to reproducing the manually coded set of terms and phrases, and additional modifications to the system minimally altered NLP tool performance. The training/testing tasks and iterative process details are shown in Table 4.
Table 4.
Annotation task | Total number of training notes manually annotated and used in training (batch counts) |
Number of iterations (some batches trained on multiple times) |
Total number of testing notes manually annotated and used in testing |
Annotators’ initials |
---|---|---|---|---|
Diagnosis filter | 1000 (500 × 2) | 2 | 500 | L.G., K.S. |
Family history retrieval | 1200 (300 × 4) | 4 | 1500 | E.A.M., K.S. |
Family history CUI mapping | 1200 (300 × 4) | 4 | 1500 | E.A.M., K.S. |
Lifestyle modification retrieval | 1200 (300 × 4) | 12 | 1500 | E.A.M., K.S. |
Lifestyle modification CUI mapping | 1200 (300 × 4) | 4 | 1500 | E.A.M., K.S. |
CUI: concept unique identifier.
Overfitting was discussed by researchers (K.S., E.A.M., Y.S.) after each iteration to determine whether the retrieved terms/phrases from that iteration reflected relevant and generalizable results or included concepts or terms reflective only of this patient sample. All three members reached consensus in determining whether changes made to the tool were to be kept or represented overfitting and should not be retained in the tool development process. This iterative process resulted in retrieval of textual LM documentation and transformation into coded data ready for future statistical and machine learning analyses (Figure 6).
Validation of the extended tool
The augmented NLP process was applied to the testing data set as the manually annotated gold standard. Data set divisions and uses are detailed in Figure 7 and Table 4 describes researchers and numbers of notes employed in each test set evaluation.
Two sets of domain experts were employed to establish validity in each test task. One team worked on validation of the diagnosis filter and consisted of one clinical data analyst trained in linguistics and healthcare terminologies and one physician (L.G., K.S.). The other team consisted of two physicians who worked on validation of the LM retrieval and CUI mapping, and FH retrieval and CUI mapping, and LM classification as assessment or advice (E.A.M., K.S.). Each validation task used the randomly selected test set of patients’ records that each team’s members manually abstracted independently. From these manual abstractions, inter-annotator agreements were calculated using percent agreement and kappa coefficient calculations. Testing results of the enhanced NLP process are reported using precision, recall, and F measurements in Table 5. For this study, precision is defined as the number of terms correctly extracted divided by the sum of the number of terms correctly extracted and the number of terms incorrectly extracted. Formally this is: Precision = TP/(TP + FP) where TP denotes true positive (a term that was correctly extracted) and FP denotes false positive (a term that was extracted as relevant but should not have been). Recall is the number of terms correctly extracted divided by the sum of the number of terms correctly extracted and the number of terms that should have been extracted but were missed. Formally this is: Recall = TP/TP + FN where FN denotes false negative (a term that was not extracted but should have been). The F-measure is the harmonic mean between precision and recall with F = 2[(precision × recall)/(precision + recall)]. In addition to lifestyle modification retrieval and CUI mapping, our process was able to identify attribution and mapping to family history CUIs, and LM documentation type (i.e. assessment or advice). Success of classification of LM documentation as advice or assessment was evaluated using percent of LM terms and phrases correctly. Unit of measure for classification as advice or assessment was each LM concept (e.g. if a sentence contained two LM concepts with one being “advised weight loss” and classified correctly identified and the other being “patient reports gaining weight” and not correctly classified, then one concept would be counted as correct and one counted as incorrect classification identified).
Table 5.
Measurement: testing of NLP tool refinement | |||||||
---|---|---|---|---|---|---|---|
NLP task | Recall | Precision | F-measure | Initial annotator % agreement |
Initial IAA kappa (SE, 95% CI) |
% Correct classification as advice or assessment |
Correct CUI mapping |
Diagnosis filter retrieval | 98.70% | 100% | 99.35% | 97.80% | 0.920 (0.024, 0.874–0.967) | N/A | N/A |
Family history retrieval | 95.24% | 81.63% | 87.90% | 96.15% | 0.923 (0.044, 0.838–1.000) | N/A | N/A |
Family history CUI mapping | N/A | N/A | N/A | 97.96% | 0.935 (0.064, 0.808–1.000) | N/A | 81.63% |
Combined LM retrieval | 99.27% | 94.44% | 96.79% | 96.84% | 0.847 (0.066, 0.717–0.977) | N/A | N/A |
Combined LM, CUI mapping | N/A | N/A | N/A | 98.59% | 0.868 (0.092, 0.687–1.00) | 88.15% | 93.66% |
NLP: natural language processing; IAA: inter-annotator agreement; LM: lifestyle modification; SE: standard error; CI: confidence interval; CUI: concept unique identifier.
N/A means measurement not applicable to task.
Results
Results from testing the NLP tool refinement process for combined lifestyle modification retrieval were excellent with 99.27 percent recall and 94.44 percent precision and an F-measure of 96.79 percent. CUI mapping for lifestyle modification was also very good with phrases correctly classified as advice or assessment 88.15 percent. These results along with the excellent diagnosis filter testing results and family history retrieval and mapping results are shown in detail in Table 5. Inter-annotator agreement was also excellent with initial agreement percentages (range 96.15%–98.59%) and kappa scores (range 0.847–0.935) (Table 5). In the rare occurrence of reviewers extracting a different number of terms, the consensus-agreed-upon terms were used as the gold standard to calculate inter-annotator agreement. After consensus discussions, 100 percent agreement was reached between annotators for each task.
Each specific type of lifestyle modification retrieval recall and precision is detailed in Table 6. Overall retrieval testing results for each type of lifestyle modification were very good.
Table 6.
Specific LM type recall and precision | |||||
---|---|---|---|---|---|
NLP retrieval of LM advice | Recall | Precision | NLP retrieval of LM assessment |
Recall | Precision |
Combined LM advice | 97.82% | 97.82% | Combined LM assessment | 98.80% | 91.21% |
Dietary management education, guidance, and counseling | 93.75% | 93.75% | Dietary history | 100% | 100% |
Exercise education | 100% | 100% | Exercise history | 100% | 100% |
Patient advised about weight management | 100% | 100% | Weight finding | 100% | 96.55% |
Smoking cessation assistance | 100% | 100% | Smoking assessment | 100% | 92.31% |
Alcohol counseling | 100% | 100% | Alcohol use assessment | 87.50% | 87.50% |
Drug addiction counseling | 100% | 100% | Drug use assessment | 100% | 100% |
NLP: natural language processing; LM: lifestyle modification.
Of the 14,360 patients in the LM retrieval set, 11,252 patients (78.36%) had notes documenting lifestyle modification. Each patient had an average of 56 notes in the initial data set. From the total 809,097 notes in the LM retrieval set (NLP refinement set removed), after filters and processes described in Figure 1 were applied, 47,838 notes had at least one documentation of lifestyle modification. Many notes contained more than one documented LM activity. Specific lifestyle modification activities and their CUIs are detailed in Table 7 with patient and note counts for each type of LM.
Table 7.
Concept unique identifier code |
Patient counts |
Note counts |
|
---|---|---|---|
Lifestyle modification advice—SNOMED CT concept | |||
Dietary management education, guidance, and counseling | C1828150 | 7589 | 19,963 |
Exercise education | C0582396 | 7349 | 17,829 |
Patient advised about weight management | C3697318 | 5268 | 11,373 |
Smoking cessation assistance | C1692317 | 2139 | 4463 |
Alcohol consumption counseling | C1531491 | 1701 | 3053 |
Drug addiction counseling | C0199403 | 81 | 106 |
Lifestyle modification assessment—SNOMED CT concept | |||
Dietary history | C0425401 | 3949 | 7552 |
Exercise history | C1287528 | 5912 | 12,526 |
Weight finding | C1265588 | 5358 | 15,749 |
Smoking assessment | C3853073 | 4719 | 10,647 |
Assessment of alcohol use | C4076406 | 4928 | 10,712 |
Assessment of drug use | C4076408 | 915 | 1544 |
For advice, diet and exercise advice were most frequently documented. For assessment, exercise and weight assessments were most frequently documented. Counts for advice on exercise were greater than counts for assessment, suggesting that providers are offering advice on exercise regardless of patients’ current exercise status. Counts for drug abuse counseling and assessment were low. This was a hypertension population and drug use assessment and counseling is not a recommended lifestyle modification intervention for hypertension. Drug abuse assessment and counseling were included in the design of this NLP tool enhancement for comprehensiveness and facilitation of future training and testing in populations with higher numbers of drug abuse assessment and counseling.
Outcome
Our modified process using an augmented open-source natural language processing tool was successful in identifying lifestyle modification documentation in electronic health records of hypertension patients at an academic medical center. Given the importance of lifestyle modification interventions for multiple medical issues, this is an important and innovative step in care transparency for future comparative effectiveness studies, outcome analyses, and efforts in care improvement. To date, most studies evaluating lifestyle modification as a medical treatment rely on surveys and self-reports which are inherently vulnerable to reporting and recall bias.45,46 With increasing emphasis placed on the need for LM to treat multiple chronic diseases, our study offers more objective and comprehensive measurement of LM care delivery via EHR analysis than previous studies. It is encouraging that 78 percent of patients in this study had LM addressed at least once during the study period, but there is room for improvement in how often LM is addressed with each patient. The percentage of notes containing LM was low at 5.9 percent, suggesting that although providers are addressing LM with many patients, providers are not repeatedly reviewing LM with patients despite the need for behavior modification for treatment of many diseases. As more efforts are made to improve lifestyle modification interventions, this study has shown that LM documentation can be automatically extracted from EHRs, thus offering increased identification of actual use of lifestyle modification in the care of hypertension and multiple chronic disorders in the future (Figure 8).47
Challenges and limitations
Limitations of this study included the secondary use of EHR data that can present data quality issues (e.g. missing data, reporting bias, recording bias). Caveats and challenges in secondary use of EHR data have been well documented.48-51 One model of data quality defined completeness as, “the extent to which data are of sufficient breadth, depth, and scope for the task at hand.”52 This study’s ability to accurately retrieve and classify LM from this EHR data set confirms that this data set is sufficient for this task. A second challenge was this study’s use of an open-source NLP tool with requirements for customization. Initial lifestyle modification data retrieval attempts using only cTAKES were unsuccessful with poor recall and precision but, after iterative development, the final testing results were successful in LM retrieval and classification as assessment or advice. This study overcame a key challenge in NLP evaluation of care delivery: the inability to retrieve verbs and verb phrases. cTAKES alone could not accurately identify verbs, verb phrases, or verb-tense-specific classification of a term as advice or assessment. For example, “patient started walking” is assessment and documentation of a patient reported LM. This is different from “recommended patient start walking 30 minutes per day” which is provider LM advice. cTAKES alone could not retrieve or distinguish these two different concepts, but using a combination of regular expressions, rules, and key words, these critical concepts centered on verb phrases were accurately retrieved and classified. However, inherent limitations in generalizability and scalability are present with this current approach.
Future development
Future research will attempt to improve and augment cTAKES and its dictionary to extract and map verb phrases directly. This approach will minimize use of rules and regular expressions and make this work more generalizable and scalable. We also plan to extend lifestyle modification mapping to ICD10 and other dictionaries of interest. This work will be made available to researchers in related medical areas that use LM as a treatment and could be used to support evaluation of current and future initiatives such as “Exercise Is Medicine” and “Healthy People 2030.”53-55 Another area for future development could be the assessment of quality of counseling for lifestyle modification with a more granular extraction of lifestyle modification concepts and counseling details to better understand best practices.56
Conclusion
This study successfully extracted lifestyle modification documentation from EHR notes. Its methods and future planned expansion of these methods could be used in studies involving multiple chronic medical conditions. This is an important step in better understanding and quantifying the use of lifestyle modification as a prevention and treatment modality for many disorders. This information can be used in future outcome and comparative effectiveness research and inform metric development for lifestyle modification documentation and counseling. The automatic identification and mapping of terms, especially verbs, related to care delivery is a major innovation that can allow further evaluation and improvement in care delivery models and treatment approaches to multiple chronic illnesses.
Supplementary Material
Acknowledgements
The authors are grateful to the Health Innovation Program at the University of Wisconsin–Madison for assistance with data acquisition. K.S. and Y.S. contributed equally to this work and should be listed as first authors.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this project for investigators Shoenbill, Song, and Mendonca provided by: The Clinical and Translational Science Award (CTSA) program, through the NIH National Center for Advancing Translational Sciences (NCATS), grant UL1TR002373 (A. Brazier, PI) UW-Madison Office of the Vice Chancellor for Research and Graduate Education Research-2014 Fall Research Competition Award. “Predictors of Lifestyle Modification in Hypertension: A Computational Analysis”. (E. Mendonca, PI) Additional funding for investigator Shoenbill provided by: University of North Carolina Chapel Hill: start up funding. (K. Shoenbill, PI) NLM Grant 5T15LM007359 to the Computation and Informatics in Biology and Medicine Training Program. (M. Craven, PI).
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
Contributor Information
Kimberly Shoenbill, University of North Carolina at Chapel Hill, USA.
Yiqiang Song, University of Wisconsin-Madison, USA.
Lisa Gress, University of Wisconsin-Madison, USA.
Heather Johnson, University of Wisconsin-Madison, USA.
Maureen Smith, University of Wisconsin-Madison, USA.
Eneida A Mendonca, University of Wisconsin-Madison, USA.
References
- 1.Benjamin EJ, Virani SS, Callaway CW, et al. Heart disease and stroke statistics—2018 update: a report from the American Heart Association. Circulation 2018; 137: e67–e492. [DOI] [PubMed] [Google Scholar]
- 2.Eckel RH, Jakicic JM, Ard JD, et al. 2013 AHA/ACC guideline on lifestyle management to reduce cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation 2014; 129: S76–S99. [DOI] [PubMed] [Google Scholar]
- 3.Jensen MD, Ryan DH, Apovian CM, et al. 2013 AHA/ACC/TOS guideline for the management of overweight and obesity in adults: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines and the Obesity Society. Circulation 2014; 129: S102–S138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Cerezo C, Sequra J, Praga M, et al. Guidelines updates in the treatment of obesity or metabolic syndrome and hypertension. Curr Hypertens Rep 2013; 15: 196–203. [DOI] [PubMed] [Google Scholar]
- 5.Fleg JL, Forman DE, Berra K, et al. Secondary prevention of atherosclerotic cardiovascular disease in older adults: a scientific statement from the American Heart Association. Circulation 2013; 128: 2422–2446. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.American Diabetes Association. Standards of medical care in diabetes—2014. Diabetes Care 2014; 37: S14–S80. [DOI] [PubMed] [Google Scholar]
- 7.Anderson JL, Halperin JL, Albert NM, et al. Management of patients with peripheral artery disease (compilation of 2005 and 2011 ACCF/AHA guideline recommendations): a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. Circulation 2013; 127: 1425–1443. [DOI] [PubMed] [Google Scholar]
- 8.Whelton PK, Carey RM, Aronow WS, et al. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. J Am Coll Cardiol 2018; 71(19): 2199–2269. doi: 10.1016/j.jacc.2017.11.005. [DOI] [PubMed] [Google Scholar]
- 9.Islami F, Goding Sauer A, Miller KD, et al. Proportion and number of cancer cases and deaths attributable to potentially modifiable risk factors in the United States. CA Cancer J Clin 2018; 68: 31–54. [DOI] [PubMed] [Google Scholar]
- 10.Institute for Health Metrics. Institute for Health Metrics country profile: United States, www.healthdata.org
- 11.Mokdad AH, Ballestros K, Echko M, et al. The state of US health, 1990–2016. JAMA 2018; 319: 1444–1472. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schoenthaler A, Luerassi L, Silver S, et al. Comparative effectiveness of a practice-based comprehensive lifestyle intervention vs. single session counseling in hypertensive blacks. Am J Hypertens 2016; 29: 280–287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Pu J, Chewning BA, Johnson HM, et al. Health behavior change after blood pressure feedback. PLoS ONE 2015; 10: e0141217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Lanier JB, Bury DC and Richardson SW. Diet and physical activity for cardiovascular disease prevention. Am Fam Physician 2016; 93: 919–924. [PubMed] [Google Scholar]
- 15.Weintraub WS, Daniels SR, Burke LE, et al. Value of primordial and primary prevention for cardiovascular disease: a policy statement from the American Heart Association. Circulation 2011; 124: 967–990. [DOI] [PubMed] [Google Scholar]
- 16.Lin JSO, Connor E, Evans CV, et al. Behavioral counseling to promote a healthy lifestyle in persons with cardiovascular risk factors: a systematic review for the U.S. Preventive Services Task Force. Ann Intern Med 2014; 161: 568–578. [DOI] [PubMed] [Google Scholar]
- 17.Patnode CD, Evans CV, Senger CA, et al. Behavioral counseling to promote a healthful diet and physical activity for cardiovascular disease prevention in adults without known cardiovascular disease risk factors: updated evidence report and systematic review for the US Preventive Services Task Force. JAMA 2017; 318: 175–193. [DOI] [PubMed] [Google Scholar]
- 18.Lin J, Zhuo X, Bardenheier B, et al. Cost-effectiveness of the 2014 U.S. Preventive Services Task Force (USPSTF) recommendations for intensive behavioral counseling interventions for adults with cardiovascular risk factors. Diabetes Care 2017; 40: 640–646. [DOI] [PubMed] [Google Scholar]
- 19.Mozaffarian D Dietary and policy priorities for cardiovascular disease, diabetes, and obesity: a comprehensive review. Circulation 2016; 133: 187–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tajeu GS, Booth JN III, Colantonio LD, et al. Incident cardiovascular disease among adults with blood pressure <140/90 mm Hg. Circulation 2017; 136: 798–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Sacks FM, Lichtenstein AH, Wu JHY, et al. Dietary fats and cardiovascular disease: a presidential advisory from the American Heart Association. Circulation 2017; 136: e1–e23. [DOI] [PubMed] [Google Scholar]
- 22.Turchin A, Kolatkar NS, Grant RW, et al. Using regular expressions to abstract blood pressure and treatment intensification information from the text of physician notes. J Am Med Inform Assoc 2006; 13: 691–695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Mendonca EA, Haas J, Shagina L, et al. Extracting information on pneumonia in infants using natural language processing of radiology reports. J Biomed Inform 2005; 38: 314–321. [DOI] [PubMed] [Google Scholar]
- 24.Mendonca EA, Cimino JJ and Johnson S. Using narrative reports to support a digital library. Proc AMIA Symp 2001: 458–462. [PMC free article] [PubMed] [Google Scholar]
- 25.Murff HJ, FitzHenry F, Matheny ME, et al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA 2011; 306: 848–855. [DOI] [PubMed] [Google Scholar]
- 26.Pakhomov S, Weston SA, Jacobsen SJ, et al. Electronic medical records for clinical research: application to the identification of heart failure. Am J Manag Care 2007; 13: 281–288. [PubMed] [Google Scholar]
- 27.Salmasian H, Freedberg DE and Friedman C. Deriving comorbidities from medical records using natural language processing. J Am Med Inform Assoc 2013; 20: e239–e242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Stevens V, Bailey S, Hazlehurst B, et al. PS1-1d: use of CER hub to evaluate outcomes of smoking cessation services, a behavioral treatment. Clin Med Res 2013; 11: 144. [Google Scholar]
- 29.Hazlehurst B, Frost R, Sitting D, et al. MediClass: a system for detecting and classifying encounter-based clinical events in any electronic medical record. J Am Med Inform Assoc 2005; 12: 517–529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Wicentowski R and Sydes MR. Using implicit information to identify smoking status in smoke-blind medical discharge summaries. J Am Med Inform Assoc 2008; 15: 29–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Lindholm C, Adsit R, Bain P, et al. A demonstration project for using the electronic health record to identify and treat tobacco users. WMJ 2010; 109: 335–340. [PMC free article] [PubMed] [Google Scholar]
- 32.Khalifa A and Meystre S. Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes. J Biomed Inform 2015; 58: S128–S132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Hazlehurst BL, Lawrence JM, Donahoo WT, et al. Automating assessment of lifestyle counseling in electronic health records. Am J Prev Med 2014; 46: 457–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hosomura N, Goldberg SI, Shubina M, et al. Electronic documentation of lifestyle counseling and gly-cemic control in patients with diabetes. Diabetes Care 2015; 38: 1326–1332. [DOI] [PubMed] [Google Scholar]
- 35.Johnson HM, Thorpe CT, Bartels CM, et al. Undiagnosed hypertension among young adults with regular primary care use. J Hypertens 2014; 32: 65–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.James PA, Oparil S, Carter BL, et al. 2014 evidence-based guideline for the management of high blood pressure in adults: report from the panel members appointed to the Eighth Joint National Committee (JNC 8). JAMA 2014; 311: 507–520. [DOI] [PubMed] [Google Scholar]
- 37.Chobanian AV. Seventh report of the Joint National Committee on prevention, detection, evaluation, and treatment of high blood pressure. Hypertension 2003; 42: 1206–1252. [DOI] [PubMed] [Google Scholar]
- 38.Shrout T, Rudy DW and Piascik MT. Hypertension update, JNC8 and beyond. Curr Opin Pharmacol 2017; 33: 41–46. [DOI] [PubMed] [Google Scholar]
- 39.Merai R, Siegel C, Rakotz M, et al. CDC grand rounds: a public health approach to detect and control hypertension. MMWR Morb Mortal Wkly Rep 2016; 65: 1261–1264. [DOI] [PubMed] [Google Scholar]
- 40.Wichmann BA and Hill ID. Algorithm AS 183: an efficient and portable pseudo-random number generator. J Roy Stat Soc C-App 1982; 31: 188–190. [Google Scholar]
- 41.Matsumoto M and Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudorandom number generator. ACM T Model Comput S 1998; 8: 3–30. [Google Scholar]
- 42.Zeng QT and Tse T. Exploring and developing consumer health vocabularies. J Am Med Inform Assoc 2006; 13: 24–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17: 507–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Milder IE, Blokstra A, de Groot J, et al. Lifestyle counseling in hypertension—related visits—analysis of video-taped general practice visits. BMC Fam Pract 2008; 9: 58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Sreedhara M, Silfee VJ, Rosal MC, et al. Does provider advice to increase physical activity differ by activity level among US adults with cardiovascular disease risk factors? Fam Pract 2018; 35: 420–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Jackson EA, Krishnan S, Meccone N, et al. Perceived quality of care and lifestyle counseling among patients with heart disease. Clin Cardiol 2010; 33: 765–769. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Bruun Larsen L, Soendergaard J, Halling A, et al. A novel approach to population-based risk stratification, comprising individualized lifestyle intervention in Danish general practice to prevent chronic diseases: results from a feasibility study. Health Informatics J 2017; 23: 249–259. [DOI] [PubMed] [Google Scholar]
- 48.Weiskopf NG, Hripcsak G, Swaminathan S, et al. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013; 46: 830–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Weiskopf NG and Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 2013; 20: 144–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Botsis T, Hartvigsen G, Chen F, et al. Secondary use of EHR: data quality issues and informatics opportunities. Summit on Translat Bioinforma 2010; 2010: 1–5. [PMC free article] [PubMed] [Google Scholar]
- 51.Hersh WR, Weiner MG, Embi PJ, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care 2013; 51: S30–S37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Wang RY and Strong DM. Beyond accuracy: what data quality means to data consumers. J Manage Inform Syst 1996; 12: 5–33. [Google Scholar]
- 53.Cowan RE. Exercise is medicine initiative: physical activity as a vital sign and prescription in adult rehabilitation practice. Arch Phys Med Rehabil 2016; 97: S232–S237. [DOI] [PubMed] [Google Scholar]
- 54.Sperling LS, Sandesara PB and Kim JH. Exercise is medicine: proof … and possibilities? JACC Cardiovasc Imaging 2017; 10: 1469–1471. [DOI] [PubMed] [Google Scholar]
- 55.HealthyPeople.Gov. Office of Disease Prevention and Health Promotion, https://www.healthypeople.gov/2020/About-Healthy-People/Development-Healthy-People-2030/Framework (accessed 28 October 2017).
- 56.Stonerock GL and Blumenthal JA. Role of counseling to promote adherence in healthy lifestyle medicine: strategies to improve exercise adherence and enhance physical activity. Prog Cardiovasc Dis 2017; 59: 455–462. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.