Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Mar 1.
Published in final edited form as: Health Informatics J. 2019 Feb 22;26(1):388–405. doi: 10.1177/1460458218824742

Natural language processing of lifestyle modification documentation

Kimberly Shoenbill 1, Yiqiang Song 2, Lisa Gress 3, Heather Johnson 4, Maureen Smith 5, Eneida A Mendonca 6
PMCID: PMC6722039  NIHMSID: NIHMS1037217  PMID: 30791802

Abstract

Lifestyle modification, including diet, exercise, and tobacco cessation, is the first-line treatment of many disorders including hypertension, obesity, and diabetes. Lifestyle modification data are not easily retrieved or used in research due to their textual nature. This study addresses this knowledge gap using natural language processing to automatically identify lifestyle modification documentation from electronic health records. Electronic health record notes from hypertension patients were analyzed using an open-source natural language processing tool to retrieve assessment and advice regarding lifestyle modification. These data were classified as lifestyle modification assessment or advice and mapped to a coded standard ontology. Combined lifestyle modification (advice and assessment) recall was 99.27 percent, precision 94.44 percent, and correct classification 88.15 percent. Through extraction and transformation of narrative lifestyle modification data to coded data, this critical information can be used in research, metric development, and quality improvement efforts regarding care delivery for multiple medical conditions that benefit from lifestyle modification.

Keywords: electronic health records, health behavior, hypertension, lifestyle, natural language processing, text mining

Background and significance

National guidelines have recommended the use of lifestyle modification in the treatment and prevention of prevalent disorders plaguing the US today including hypertension, obesity, coronary artery disease, diabetes, peripheral vascular disease, and cancer.1-9 The top seven US health risks for combined disability and death identified in the Global Burden of Disease Study 2016 were tobacco, dietary risks, high body mass index, alcohol and drug abuse, high blood pressure, high fasting plasma glucose, and high cholesterol.10,11 These risks can be minimized through the implementation of lifestyle modifications (i.e. tobacco cessation, dietary changes, alcohol moderation, illicit drug use abstinence, increased aerobic exercise, and weight loss to achieve a healthy weight). The efficacy of addressing lifestyle in just one counseling session has been shown.12 For example, behavior change can be elicited by making patients aware of their elevated blood pressure readings after a provider encounter.13 Despite the effectiveness and importance of lifestyle modification in treating hypertension and many chronic illnesses, lifestyle modification is under-utilized and not easily measured, due to it being buried in narrative text rather than in consistent coded form (e.g. diagnosis or procedure codes).14-21 This limits accurate evaluation of the efficacy of these interventions, tracking of quality care metrics, and reimbursement. Evaluation of lifestyle modification includes both the assessment of lifestyle modification activities as reported by a patient or observed by a provider (e.g. “patient started running and weight is down”) and advice on lifestyle modification activities given by a provider to a patient (e.g. “recommend patient lose 30 pounds”). Systematic identification of lifestyle modification is the first step toward improving its use in clinical practice and establishing it as a quality metric. While some clinical information systems code limited individual behaviors (e.g. smoking history), much of this information continues to be recorded primarily in narrative form.

There is a need for automated methods that can facilitate the extraction and integration of lifestyle behavior factors for use in research. Historically, manual chart review has been used to abstract information from patient records, but this has proven to be a time and labor-intensive process, making large-scale chart abstractions nearly impossible. In order to accomplish this task more efficiently, this study used natural language processing (NLP) software tools and processes that can automatically extract text-based information. NLP tools can process many thousands of notes per hour.22 This technology makes larger chart abstractions feasible and allows a more comprehensive evaluation of documentation of lifestyle modification.

NLP has been used to successfully extract data from electronic clinical records and applied in many fields for efficient and accurate chart abstraction.23-27 Some studies have explored the automated extraction of information on smoking status as an isolated finding.28-31 One study looked at NLP tool augmentation to extract cardiovascular risk identification.32 Another study used the MediClass NLP tool for extraction of information on weight management counseling in postpartum visits and showed extraction capabilities similar to human abstractors.33 A separate study extracted limited lifestyle modification documentation in evaluation of diabetes management.34 And multiple prior studies of lifestyle modification have used survey data with potential bias and/or generalizability concerns. To our knowledge, this work is the first to evaluate lifestyle modification for hypertension in a large population using automated methods.

The primary objective of this study was to use an existing open-source natural language processing tool, cTAKES, along with rules and regular expressions on existing electronic health records to make previously invisible data on lifestyle modification documentation visible and ready for analysis. We used an existing data set that had been used in prior evaluations of hypertension treatment and clinical inertia.35 This data set was chosen because it could accomplish two goals: (1) evaluate the feasibility of automatic extraction of lifestyle modification (LM) using an open-source NLP tool allowing use and application to multiple chronic disease evaluations at different institutions, and (2) facilitate extension of prior work in this patient population to improve care of hypertension patients early in their disease process. Methods used in this study are designed so that they can be applied to many other patient populations including those with other chronic diseases such as obesity, coronary artery disease, diabetes, and peripheral vascular disease.

Methods and materials

Institutional and clinical settings: UW Health is the academic health system for the University of Wisconsin–Madison. UW Health adopted the Epic Systems Corporation’s electronic health record in 2004 and has created a Health Information Management Center, which is devoted to the integrity of system-wide electronic health record data and facilitating its use for improved patient care and health. All relevant full-text documents from patients meeting inclusion criteria were retrieved from the UW Health electronic health record. An overview diagram of this study’s steps is provided in Figure 1 with more details of the steps and methods following.

Figure 1.

Figure 1.

Overview of study steps.

Step 1: retrieving and dividing data set

Study inclusion and exclusion criteria are detailed in Table 1.35

Table 1.

Inclusion and exclusion criteria.

Inclusion criteria (from this retrospective patient population)
  • Adult patients ⩾ 18 years old “managed” at a UW Health practice between 1 January 2008 and 31 December 2011. “Managed” is defined as having at least two billable office encounters in an outpatient, non-urgent care primary care setting, or one primary care encounter and one office encounter in an urgent care setting (regardless of diagnosis code)

  • Adult patients (regardless of gender, race, or ethnicity) meeting criteria for the diagnosis of hypertension defined as follows by guidelines from the Seventh Report of the Joint National Committee on the Prevention, Detection, Evaluation and Treatment of High Blood Pressure.36,37 Hypertension is defined as:
    • ⩾ three separate elevated blood pressures within a 2-year period, measured at least 30 days apart with: systolic ⩾ 140 mmHg or diastolic ⩾90 mmHg
    • or ⩾ two elevated blood pressures within a 2-year period, measured at least 30 days apart with: systolic ⩾ 160 mmHg or diastolic ⩾ 100 mmHg
Exclusion criteria
  • Patients pregnant during the retrospective study period were excluded 1 year before, during, and 1 year after pregnancy

  • Patients with a preexisting diagnosis of hypertension including essential hypertension, hypertensive heart disease, hypertensive renal disease, hypertensive heart and renal disease, secondary hypertension, or any anti-hypertensive medication prescription

  • Children under the age of 18 (because we employed guidelines on lifestyle modification for adults)

  • Prisoners (as a vulnerable population) were excluded as IRB approval was not obtained for inclusion of this population due to their limited freedom to choose lifestyle activities

The hypertension diagnosis criteria were based on JNC 7 criteria, reflecting the guidelines available during the time of the hypertension data set creation. However, since this analysis focuses on lifestyle modifications, more recent guidelines including, JNC 836-39 and the 2017 American Heart Association’s guidelines reflect similar lifestyle modification recommendations. LM concepts reflect all three sources. Blood pressure measurements were extracted from discrete fields within the EHR. Preexisting conditions were identified using ICD9 codes. The 14,860 patient data set was divided into a 500 patient NLP tool refinement set and a 14,360 patient lifestyle modification retrieval set (Figure 2).

Figure 2.

Figure 2.

Data set divisions with initial note counts.

The subset of patient notes for NLP tool refinement was randomly selected using Python’s random module: “random sample.”40,41 This subset was composed of notes from throughout the outpatient clinical encounter including nursing notes, provider notes, patient instructions, nutrition consultation, and exercise consultation. The University of Wisconsin IRB approved this study. Baseline characteristics of the study population are listed in Table 2 with the 500 patient subset characteristics listed beside the 14,860 entire data set characteristics.

Table 2.

Study subjects’ baseline characteristics.

Baseline characteristics Entire data set: 14,860 patients NLP refinement subset: 500 patients
Age (years, mean, median, standard deviation) 18–80, mean 49.21, median 49.00, SD 14.70 18–91, mean 49.47, median 50.00, SD 15
Gender Number of patients (% of data set) Number of patients (% of data set)
 Female 7363 (49.55%) 261 (52.20%)
 Male 7497 (50.45%) 239 (47.80%)
Race/ethnicity Number of patients (% of data set) Number of patients (% of data set)
 African American/Black 703 (4.73%) 24 (4.80%)
 Asian 216 (l.45%) 7 (1.40%)
 American Indian/Alaska Native 50 (0.34%) 0 (0%)
 White 13,140 (88.43%) 438 (87.60%)
 Hispanic/Latino 281 (1.89%) l3 (2.6%)
 Other/Native Hawaiian/Pacific Islander/Multiple 7l (0.48%) 2 (0.40%)
 Unknown 399 (2.69%) 16 (3.20%)

NLP: natural language processing; SD: standard deviation.

Step 2: creating lifestyle modification (LM) lexicon

An empirical method was used to create and iteratively enrich and refine the lifestyle modification terminology using four approaches: (1) literature review to identify related terms, including terms and concepts discussed in JNC 7 and 8; (2) existing ontology review (terms and their interconnections) to identify relevant terms including acronyms and abbreviations such as those in the SNOMED CT ontology, the National Center for Biomedical Ontology and the Consumer Health Vocabulary42; (3) domain expert collaboration to identify words, acronyms, abbreviations, and phrases relevant to hypertension and lifestyle modification; and (4) electronic health record note training. Identified terms and phrases from this fourfold process generated the initial list of relevant terms and phrases for extraction (Table 3).

Table 3.

Example of terms and phrases in lifestyle modification lexicon.

Provider verb phrase terms requiring any lifestyle modification object: advice Lifestyle modification object terms (may be part of provider or patient verb phrase)
advise/d/ing alcohol
avoid cholesterol
begin diet
change dietician
congratulated exercise
continue fat/fats
counsel/ed/ing gym
covered health club
recommend/recommended physical activity
refer/referral salt
Patient verb phrase terms requiring any lifestyle modification object: assessment Declaration of lifestyle modification—requiring no LM object: assessment
changed/changing alcoholic
continued/continues/continuing tobacco use
decreased/decreasing change in weight
eats/eating cigarettes:
improved/improving cigars:
increased/increasing DASH/dash diet
interested in decreased weight
motivated to down ##/pounds/lb/lbs
plans/planning to healthy diet
reports/reported physically active

LM: lifestyle modification.

Our goal was to iteratively extend the NLP tool to have additional capabilities to handle discourse. Terms and phrases were mapped to codes in the Unified Medical Language System (UMLS) and semantic types based on the UMLS semantic net. Details available as Supplemental Appendices A and B.

Step 3: pre-filtering to identify notes of interest

The total data set was composed of 14,860 patients’ EHR notes with an average of 56 notes per patient. Each note was composed of multiple sentences, some with multiple concepts of interest (lifestyle modification, family history). Many notes were not relevant to lifestyle modification, so pre-filtering for note types and departments likely to have documented lifestyle modification was performed (Figure 3 shows LM-relevant note types and departments agreed upon by three study physicians).

Figure 3.

Figure 3.

Note types and departments evaluated.

We included pediatric departments because the National Institutes of Health (NIH) defines children as those persons 18–21 years old and some patients in this age range may continue care in pediatric clinics. Gynecology was also included as a primary care clinic because many women see gynecologists as their primary care provider.

Step 4: parsing and diagnosis identification

After pre-filtering, 14,331 notes were in the NLP tool refinement set and 403,018 notes were in the LM retrieval set. These notes were processed using the Clinical Text Analysis and Knowledge Extraction System (cTAKES). This is an open-source NLP pipeline that processes clinical notes and identifies types of clinical named entities—drugs, diseases/disorders, signs/symptoms, anatomical sites and procedures.43 Each named entity has attributes for the text span, the ontology mapping code, context (e.g. family history of, current, unrelated to patient), and negated/not negated. Figure 4 shows components of a typical NLP pipeline (with study enhancements to improve relevant term and phrase identification depicted by the down-arrow boxes). The typical process starts with detection of each section of the text and then each sentence, followed by the identification of sentence tokens (e.g. words, dates, numbers) in the sentence. A part-of-speech is assigned to tokens (e.g. noun, preposition, noun-phrase). The Named Entity Recognition component implements a dictionary lookup, so that each entity is mapped to a concept from the tool dictionary. Post-coordination then combines multiple concepts into a single one (e.g. DASH + diet).

Figure 4.

Figure 4.

Augmented cTAKES processing steps.

Notes were further processed using Python code to retrieve notes with diagnoses likely to have LM (e.g. obesity, diabetes—details of diagnosis filter terms available by request as Supplemental Appendix C).

Steps 5-8: retrieving LM, filtering out notes with negated or LM terms not related to patient, and mapping

This study initially tried to use cTAKES as the sole method to identify relevant LM, but this approach resulted in poor precision (67.15%) and poor recall (62.48%) because:

  1. Searching for semantic types of interest (TUIs) was too broad and extracted many irrelevant items (e.g. “dietary history” = TUI T033 = “Finding.” Many irrelevant terms are “Findings” such as “sweaty palms” which was pulled as a relevant retrieval when searching based on TUI alone).

  2. Using concept unique identifiers (CUIs) was too restrictive due to identifying only a specific term and inability to identify phrases with LM (e.g. “exercise”=CUI C0015259. If searching on this CUI, relevant terms such as “swimming” with CUI C0039003 were missed).

  3. Using cTAKES’ retrieval and dictionary lookup effectively identified nouns, but missed verbs and verb phrases of LM documentation (e.g. retrieved “diet,” missed “counseled on diet changes”).

Therefore, this study used a combined approach to processing the notes. cTAKES was used for parsing the notes into sections and sentences, but instead of using the parsed parts of speech identified by cTAKES for ontology mapping, Python code with regular expressions and rules was used to process each sentence to identify lifestyle modification terms and phrases from the created lexicon. This enhanced the natural language processing workflow to facilitate extraction of data related to lifestyle modification, including social and behavioral factors that historically have been difficult to extract and verbs. These data were mapped to corresponding terms within the SNOMED CT ontology using the Unified Medical Language System (UMLS).

For example, a provider recommendation of “cut down on alcohol intake” was retrieved using the study’s enhanced NLP methods resulting in lifestyle modification advice retrieval: “alcohol consumption counseling” in SNOMED CT UMLS CUI = C1531491 and TUI = Health Car Activity T058. Without enhancements to NLP, no match for this phrase was retrieved or mapped by cTAKES to the SNOMED CT ontology.

In order to capture context-sensitive phrases, we used semantic and syntactic rules, as well as regular expressions. Due to many notes containing unusual syntactic expressions impeding retrieval of context-sensitive lifestyle modification, existing note punctuation was modified. Specifically, when a sentence about LM contained a colon (:), the NLP processor read this as a hard stop and classified concepts after it as belonging to the next sentence. The colon mark was replaced with a comma to allow capture of concepts after the colon mark as contextually related to the first phrase. This improved retrieval of the entire lifestyle modification context. For example, prior to modification, cTAKES identified the negated history of “Patient smoking: no.” as a positive history of patient tobacco use: “Patient smoking. No.” After modification, the intended meaning was retrieved: “patient history negative for smoking.” Tokenization per cTAKES was not modified, but cTAKES only identifies nouns and noun phrases and this study also needed to identify verbs and verb phrases. Therefore, the algorithm searched the entire sentence using regular expressions for LM object terms and phrases (Figure 5 and details of LM lexicon available as Supplemental Appendix A).

Figure 5.

Figure 5.

Example of regular expression to retrieve exercise.

Negation identification was performed using negated regular expressions already employed to extract LM terms and phrases. This approach allowed for more efficient and thorough negation identification of verbs and verb phrases that were not identified by cTAKES’ negation module. Attribution of a concept, as pertaining to the patient or other person for LM terms/phrases and diagnoses, was expanded to include, in addition to mother and father, grandparents, grandmother, grandfather, roommate, spouse, husband, wife, son, daughter, partner, roommate, parent, friend, significant other, coworker, brother, sister, aunt, and uncle. Family history retrieval using name of relation (e.g. mom, sister) within rules and regular expressions, along with mapping schemes to SNOMED CT, was created specifically for this project. Mapping of terms to the SNOMED CT hierarchy was restricted to parent and child concepts, and their CUIs, to reduce granularity and facilitate effective use of this coded data in future machine learning projects that required less dimensionality. For example, “brother with non-insulin-dependent diabetes mellitus” was mapped to “FH: diabetes mellitus,” a child of the parent concept “FH: metabolic disorder,” and not mapped to the grandchild concept “FH: diabetes mellitus type 2.” This allowed all diabetics to be binned under one code. Details of family history mapping available as Supplemental Appendix D.

This study also classified lifestyle modification words and phrases as “assessment” and “advice” as was done in a prior analysis of videotaped encounters involving lifestyle modification.44 Classification as assessment or advice was based on logic rules that included key words. This allowed for evaluation of LM documentation as either provider documentation of LM activities that the patient reported (e.g. patient reported exercising = assessment) or provider documentation of advice that she or he offered to the patient (e.g. recommended exercise = advice). Matching of LM terms to SNOMED CT was further refined with specific LM terms and phrases grouped under overarching concepts such as exercise education (advice) or exercise history (assessment). Details of advice/assessment phrases available as Supplemental Appendix B. This process continued until performance was close to reproducing the manually coded set of terms and phrases, and additional modifications to the system minimally altered NLP tool performance. The training/testing tasks and iterative process details are shown in Table 4.

Table 4.

Annotation task iteration details.

Annotation task Total number
of training notes
manually annotated
and used in training
(batch counts)
Number of
iterations (some
batches trained
on multiple
times)
Total number
of testing
notes manually
annotated and
used in testing
Annotators’
initials
Diagnosis filter 1000 (500 × 2) 2 500 L.G., K.S.
Family history retrieval 1200 (300 × 4) 4 1500 E.A.M., K.S.
Family history CUI mapping 1200 (300 × 4) 4 1500 E.A.M., K.S.
Lifestyle modification retrieval 1200 (300 × 4) 12 1500 E.A.M., K.S.
Lifestyle modification CUI mapping 1200 (300 × 4) 4 1500 E.A.M., K.S.

CUI: concept unique identifier.

Overfitting was discussed by researchers (K.S., E.A.M., Y.S.) after each iteration to determine whether the retrieved terms/phrases from that iteration reflected relevant and generalizable results or included concepts or terms reflective only of this patient sample. All three members reached consensus in determining whether changes made to the tool were to be kept or represented overfitting and should not be retained in the tool development process. This iterative process resulted in retrieval of textual LM documentation and transformation into coded data ready for future statistical and machine learning analyses (Figure 6).

Figure 6.

Figure 6.

Example of LM retrieval and transformation to coded data.

Validation of the extended tool

The augmented NLP process was applied to the testing data set as the manually annotated gold standard. Data set divisions and uses are detailed in Figure 7 and Table 4 describes researchers and numbers of notes employed in each test set evaluation.

Figure 7.

Figure 7.

After pre-filtering for note type and department—data divisions and uses.

Two sets of domain experts were employed to establish validity in each test task. One team worked on validation of the diagnosis filter and consisted of one clinical data analyst trained in linguistics and healthcare terminologies and one physician (L.G., K.S.). The other team consisted of two physicians who worked on validation of the LM retrieval and CUI mapping, and FH retrieval and CUI mapping, and LM classification as assessment or advice (E.A.M., K.S.). Each validation task used the randomly selected test set of patients’ records that each team’s members manually abstracted independently. From these manual abstractions, inter-annotator agreements were calculated using percent agreement and kappa coefficient calculations. Testing results of the enhanced NLP process are reported using precision, recall, and F measurements in Table 5. For this study, precision is defined as the number of terms correctly extracted divided by the sum of the number of terms correctly extracted and the number of terms incorrectly extracted. Formally this is: Precision = TP/(TP + FP) where TP denotes true positive (a term that was correctly extracted) and FP denotes false positive (a term that was extracted as relevant but should not have been). Recall is the number of terms correctly extracted divided by the sum of the number of terms correctly extracted and the number of terms that should have been extracted but were missed. Formally this is: Recall = TP/TP + FN where FN denotes false negative (a term that was not extracted but should have been). The F-measure is the harmonic mean between precision and recall with F = 2[(precision × recall)/(precision + recall)]. In addition to lifestyle modification retrieval and CUI mapping, our process was able to identify attribution and mapping to family history CUIs, and LM documentation type (i.e. assessment or advice). Success of classification of LM documentation as advice or assessment was evaluated using percent of LM terms and phrases correctly. Unit of measure for classification as advice or assessment was each LM concept (e.g. if a sentence contained two LM concepts with one being “advised weight loss” and classified correctly identified and the other being “patient reports gaining weight” and not correctly classified, then one concept would be counted as correct and one counted as incorrect classification identified).

Table 5.

NLP augmentation validation performed on 1500 manually annotated notes from the 200 patient testing subset.

Measurement: testing of NLP tool refinement
NLP task Recall Precision F-measure Initial
annotator %
agreement
Initial IAA
kappa (SE,
95% CI)
% Correct
classification
as advice or
assessment
Correct
CUI
mapping
Diagnosis filter retrieval 98.70% 100% 99.35% 97.80% 0.920 (0.024, 0.874–0.967) N/A N/A
Family history retrieval 95.24% 81.63% 87.90% 96.15% 0.923 (0.044, 0.838–1.000) N/A N/A
Family history CUI mapping N/A N/A N/A 97.96% 0.935 (0.064, 0.808–1.000) N/A 81.63%
Combined LM retrieval 99.27% 94.44% 96.79% 96.84% 0.847 (0.066, 0.717–0.977) N/A N/A
Combined LM, CUI mapping N/A N/A N/A 98.59% 0.868 (0.092, 0.687–1.00) 88.15% 93.66%

NLP: natural language processing; IAA: inter-annotator agreement; LM: lifestyle modification; SE: standard error; CI: confidence interval; CUI: concept unique identifier.

N/A means measurement not applicable to task.

Results

Results from testing the NLP tool refinement process for combined lifestyle modification retrieval were excellent with 99.27 percent recall and 94.44 percent precision and an F-measure of 96.79 percent. CUI mapping for lifestyle modification was also very good with phrases correctly classified as advice or assessment 88.15 percent. These results along with the excellent diagnosis filter testing results and family history retrieval and mapping results are shown in detail in Table 5. Inter-annotator agreement was also excellent with initial agreement percentages (range 96.15%–98.59%) and kappa scores (range 0.847–0.935) (Table 5). In the rare occurrence of reviewers extracting a different number of terms, the consensus-agreed-upon terms were used as the gold standard to calculate inter-annotator agreement. After consensus discussions, 100 percent agreement was reached between annotators for each task.

Each specific type of lifestyle modification retrieval recall and precision is detailed in Table 6. Overall retrieval testing results for each type of lifestyle modification were very good.

Table 6.

Recall and precision of specific types of LM concepts.

Specific LM type recall and precision
NLP retrieval of LM advice Recall Precision NLP retrieval of
LM assessment
Recall Precision
Combined LM advice 97.82% 97.82% Combined LM assessment 98.80% 91.21%
Dietary management education, guidance, and counseling 93.75% 93.75% Dietary history 100% 100%
Exercise education 100% 100% Exercise history 100% 100%
Patient advised about weight management 100% 100% Weight finding 100% 96.55%
Smoking cessation assistance 100% 100% Smoking assessment 100% 92.31%
Alcohol counseling 100% 100% Alcohol use assessment 87.50% 87.50%
Drug addiction counseling 100% 100% Drug use assessment 100% 100%

NLP: natural language processing; LM: lifestyle modification.

Of the 14,360 patients in the LM retrieval set, 11,252 patients (78.36%) had notes documenting lifestyle modification. Each patient had an average of 56 notes in the initial data set. From the total 809,097 notes in the LM retrieval set (NLP refinement set removed), after filters and processes described in Figure 1 were applied, 47,838 notes had at least one documentation of lifestyle modification. Many notes contained more than one documented LM activity. Specific lifestyle modification activities and their CUIs are detailed in Table 7 with patient and note counts for each type of LM.

Table 7.

Lifestyle modifications, concept unique identifiers (UMLS CUIs), and retrieval counts.

Concept unique
identifier code
Patient
counts
Note
counts
Lifestyle modification advice—SNOMED CT concept
Dietary management education, guidance, and counseling C1828150 7589 19,963
Exercise education C0582396 7349 17,829
Patient advised about weight management C3697318 5268 11,373
Smoking cessation assistance C1692317 2139 4463
Alcohol consumption counseling C1531491 1701 3053
Drug addiction counseling C0199403 81 106
Lifestyle modification assessment—SNOMED CT concept
Dietary history C0425401 3949 7552
Exercise history C1287528 5912 12,526
Weight finding C1265588 5358 15,749
Smoking assessment C3853073 4719 10,647
Assessment of alcohol use C4076406 4928 10,712
Assessment of drug use C4076408 915 1544

For advice, diet and exercise advice were most frequently documented. For assessment, exercise and weight assessments were most frequently documented. Counts for advice on exercise were greater than counts for assessment, suggesting that providers are offering advice on exercise regardless of patients’ current exercise status. Counts for drug abuse counseling and assessment were low. This was a hypertension population and drug use assessment and counseling is not a recommended lifestyle modification intervention for hypertension. Drug abuse assessment and counseling were included in the design of this NLP tool enhancement for comprehensiveness and facilitation of future training and testing in populations with higher numbers of drug abuse assessment and counseling.

Outcome

Our modified process using an augmented open-source natural language processing tool was successful in identifying lifestyle modification documentation in electronic health records of hypertension patients at an academic medical center. Given the importance of lifestyle modification interventions for multiple medical issues, this is an important and innovative step in care transparency for future comparative effectiveness studies, outcome analyses, and efforts in care improvement. To date, most studies evaluating lifestyle modification as a medical treatment rely on surveys and self-reports which are inherently vulnerable to reporting and recall bias.45,46 With increasing emphasis placed on the need for LM to treat multiple chronic diseases, our study offers more objective and comprehensive measurement of LM care delivery via EHR analysis than previous studies. It is encouraging that 78 percent of patients in this study had LM addressed at least once during the study period, but there is room for improvement in how often LM is addressed with each patient. The percentage of notes containing LM was low at 5.9 percent, suggesting that although providers are addressing LM with many patients, providers are not repeatedly reviewing LM with patients despite the need for behavior modification for treatment of many diseases. As more efforts are made to improve lifestyle modification interventions, this study has shown that LM documentation can be automatically extracted from EHRs, thus offering increased identification of actual use of lifestyle modification in the care of hypertension and multiple chronic disorders in the future (Figure 8).47

Figure 8.

Figure 8.

Chronic conditions that can be evaluated using these methods.

Challenges and limitations

Limitations of this study included the secondary use of EHR data that can present data quality issues (e.g. missing data, reporting bias, recording bias). Caveats and challenges in secondary use of EHR data have been well documented.48-51 One model of data quality defined completeness as, “the extent to which data are of sufficient breadth, depth, and scope for the task at hand.”52 This study’s ability to accurately retrieve and classify LM from this EHR data set confirms that this data set is sufficient for this task. A second challenge was this study’s use of an open-source NLP tool with requirements for customization. Initial lifestyle modification data retrieval attempts using only cTAKES were unsuccessful with poor recall and precision but, after iterative development, the final testing results were successful in LM retrieval and classification as assessment or advice. This study overcame a key challenge in NLP evaluation of care delivery: the inability to retrieve verbs and verb phrases. cTAKES alone could not accurately identify verbs, verb phrases, or verb-tense-specific classification of a term as advice or assessment. For example, “patient started walking” is assessment and documentation of a patient reported LM. This is different from “recommended patient start walking 30 minutes per day” which is provider LM advice. cTAKES alone could not retrieve or distinguish these two different concepts, but using a combination of regular expressions, rules, and key words, these critical concepts centered on verb phrases were accurately retrieved and classified. However, inherent limitations in generalizability and scalability are present with this current approach.

Future development

Future research will attempt to improve and augment cTAKES and its dictionary to extract and map verb phrases directly. This approach will minimize use of rules and regular expressions and make this work more generalizable and scalable. We also plan to extend lifestyle modification mapping to ICD10 and other dictionaries of interest. This work will be made available to researchers in related medical areas that use LM as a treatment and could be used to support evaluation of current and future initiatives such as “Exercise Is Medicine” and “Healthy People 2030.”53-55 Another area for future development could be the assessment of quality of counseling for lifestyle modification with a more granular extraction of lifestyle modification concepts and counseling details to better understand best practices.56

Conclusion

This study successfully extracted lifestyle modification documentation from EHR notes. Its methods and future planned expansion of these methods could be used in studies involving multiple chronic medical conditions. This is an important step in better understanding and quantifying the use of lifestyle modification as a prevention and treatment modality for many disorders. This information can be used in future outcome and comparative effectiveness research and inform metric development for lifestyle modification documentation and counseling. The automatic identification and mapping of terms, especially verbs, related to care delivery is a major innovation that can allow further evaluation and improvement in care delivery models and treatment approaches to multiple chronic illnesses.

Supplementary Material

Appendix A
Appendix B
Appendix C
Appendix D

Acknowledgements

The authors are grateful to the Health Innovation Program at the University of Wisconsin–Madison for assistance with data acquisition. K.S. and Y.S. contributed equally to this work and should be listed as first authors.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding for this project for investigators Shoenbill, Song, and Mendonca provided by: The Clinical and Translational Science Award (CTSA) program, through the NIH National Center for Advancing Translational Sciences (NCATS), grant UL1TR002373 (A. Brazier, PI) UW-Madison Office of the Vice Chancellor for Research and Graduate Education Research-2014 Fall Research Competition Award. “Predictors of Lifestyle Modification in Hypertension: A Computational Analysis”. (E. Mendonca, PI) Additional funding for investigator Shoenbill provided by: University of North Carolina Chapel Hill: start up funding. (K. Shoenbill, PI) NLM Grant 5T15LM007359 to the Computation and Informatics in Biology and Medicine Training Program. (M. Craven, PI).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

Contributor Information

Kimberly Shoenbill, University of North Carolina at Chapel Hill, USA.

Yiqiang Song, University of Wisconsin-Madison, USA.

Lisa Gress, University of Wisconsin-Madison, USA.

Heather Johnson, University of Wisconsin-Madison, USA.

Maureen Smith, University of Wisconsin-Madison, USA.

Eneida A Mendonca, University of Wisconsin-Madison, USA.

References

  • 1.Benjamin EJ, Virani SS, Callaway CW, et al. Heart disease and stroke statistics—2018 update: a report from the American Heart Association. Circulation 2018; 137: e67–e492. [DOI] [PubMed] [Google Scholar]
  • 2.Eckel RH, Jakicic JM, Ard JD, et al. 2013 AHA/ACC guideline on lifestyle management to reduce cardiovascular risk: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines. Circulation 2014; 129: S76–S99. [DOI] [PubMed] [Google Scholar]
  • 3.Jensen MD, Ryan DH, Apovian CM, et al. 2013 AHA/ACC/TOS guideline for the management of overweight and obesity in adults: a report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines and the Obesity Society. Circulation 2014; 129: S102–S138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Cerezo C, Sequra J, Praga M, et al. Guidelines updates in the treatment of obesity or metabolic syndrome and hypertension. Curr Hypertens Rep 2013; 15: 196–203. [DOI] [PubMed] [Google Scholar]
  • 5.Fleg JL, Forman DE, Berra K, et al. Secondary prevention of atherosclerotic cardiovascular disease in older adults: a scientific statement from the American Heart Association. Circulation 2013; 128: 2422–2446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.American Diabetes Association. Standards of medical care in diabetes—2014. Diabetes Care 2014; 37: S14–S80. [DOI] [PubMed] [Google Scholar]
  • 7.Anderson JL, Halperin JL, Albert NM, et al. Management of patients with peripheral artery disease (compilation of 2005 and 2011 ACCF/AHA guideline recommendations): a report of the American College of Cardiology Foundation/American Heart Association Task Force on Practice Guidelines. Circulation 2013; 127: 1425–1443. [DOI] [PubMed] [Google Scholar]
  • 8.Whelton PK, Carey RM, Aronow WS, et al. 2017 ACC/AHA/AAPA/ABC/ACPM/AGS/APhA/ASH/ASPC/NMA/PCNA Guideline for the Prevention, Detection, Evaluation, and Management of High Blood Pressure in Adults: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines. J Am Coll Cardiol 2018; 71(19): 2199–2269. doi: 10.1016/j.jacc.2017.11.005. [DOI] [PubMed] [Google Scholar]
  • 9.Islami F, Goding Sauer A, Miller KD, et al. Proportion and number of cancer cases and deaths attributable to potentially modifiable risk factors in the United States. CA Cancer J Clin 2018; 68: 31–54. [DOI] [PubMed] [Google Scholar]
  • 10.Institute for Health Metrics. Institute for Health Metrics country profile: United States, www.healthdata.org
  • 11.Mokdad AH, Ballestros K, Echko M, et al. The state of US health, 1990–2016. JAMA 2018; 319: 1444–1472. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Schoenthaler A, Luerassi L, Silver S, et al. Comparative effectiveness of a practice-based comprehensive lifestyle intervention vs. single session counseling in hypertensive blacks. Am J Hypertens 2016; 29: 280–287. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Pu J, Chewning BA, Johnson HM, et al. Health behavior change after blood pressure feedback. PLoS ONE 2015; 10: e0141217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lanier JB, Bury DC and Richardson SW. Diet and physical activity for cardiovascular disease prevention. Am Fam Physician 2016; 93: 919–924. [PubMed] [Google Scholar]
  • 15.Weintraub WS, Daniels SR, Burke LE, et al. Value of primordial and primary prevention for cardiovascular disease: a policy statement from the American Heart Association. Circulation 2011; 124: 967–990. [DOI] [PubMed] [Google Scholar]
  • 16.Lin JSO, Connor E, Evans CV, et al. Behavioral counseling to promote a healthy lifestyle in persons with cardiovascular risk factors: a systematic review for the U.S. Preventive Services Task Force. Ann Intern Med 2014; 161: 568–578. [DOI] [PubMed] [Google Scholar]
  • 17.Patnode CD, Evans CV, Senger CA, et al. Behavioral counseling to promote a healthful diet and physical activity for cardiovascular disease prevention in adults without known cardiovascular disease risk factors: updated evidence report and systematic review for the US Preventive Services Task Force. JAMA 2017; 318: 175–193. [DOI] [PubMed] [Google Scholar]
  • 18.Lin J, Zhuo X, Bardenheier B, et al. Cost-effectiveness of the 2014 U.S. Preventive Services Task Force (USPSTF) recommendations for intensive behavioral counseling interventions for adults with cardiovascular risk factors. Diabetes Care 2017; 40: 640–646. [DOI] [PubMed] [Google Scholar]
  • 19.Mozaffarian D Dietary and policy priorities for cardiovascular disease, diabetes, and obesity: a comprehensive review. Circulation 2016; 133: 187–225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Tajeu GS, Booth JN III, Colantonio LD, et al. Incident cardiovascular disease among adults with blood pressure <140/90 mm Hg. Circulation 2017; 136: 798–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Sacks FM, Lichtenstein AH, Wu JHY, et al. Dietary fats and cardiovascular disease: a presidential advisory from the American Heart Association. Circulation 2017; 136: e1–e23. [DOI] [PubMed] [Google Scholar]
  • 22.Turchin A, Kolatkar NS, Grant RW, et al. Using regular expressions to abstract blood pressure and treatment intensification information from the text of physician notes. J Am Med Inform Assoc 2006; 13: 691–695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mendonca EA, Haas J, Shagina L, et al. Extracting information on pneumonia in infants using natural language processing of radiology reports. J Biomed Inform 2005; 38: 314–321. [DOI] [PubMed] [Google Scholar]
  • 24.Mendonca EA, Cimino JJ and Johnson S. Using narrative reports to support a digital library. Proc AMIA Symp 2001: 458–462. [PMC free article] [PubMed] [Google Scholar]
  • 25.Murff HJ, FitzHenry F, Matheny ME, et al. Automated identification of postoperative complications within an electronic medical record using natural language processing. JAMA 2011; 306: 848–855. [DOI] [PubMed] [Google Scholar]
  • 26.Pakhomov S, Weston SA, Jacobsen SJ, et al. Electronic medical records for clinical research: application to the identification of heart failure. Am J Manag Care 2007; 13: 281–288. [PubMed] [Google Scholar]
  • 27.Salmasian H, Freedberg DE and Friedman C. Deriving comorbidities from medical records using natural language processing. J Am Med Inform Assoc 2013; 20: e239–e242. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Stevens V, Bailey S, Hazlehurst B, et al. PS1-1d: use of CER hub to evaluate outcomes of smoking cessation services, a behavioral treatment. Clin Med Res 2013; 11: 144. [Google Scholar]
  • 29.Hazlehurst B, Frost R, Sitting D, et al. MediClass: a system for detecting and classifying encounter-based clinical events in any electronic medical record. J Am Med Inform Assoc 2005; 12: 517–529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wicentowski R and Sydes MR. Using implicit information to identify smoking status in smoke-blind medical discharge summaries. J Am Med Inform Assoc 2008; 15: 29–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Lindholm C, Adsit R, Bain P, et al. A demonstration project for using the electronic health record to identify and treat tobacco users. WMJ 2010; 109: 335–340. [PMC free article] [PubMed] [Google Scholar]
  • 32.Khalifa A and Meystre S. Adapting existing natural language processing resources for cardiovascular risk factors identification in clinical notes. J Biomed Inform 2015; 58: S128–S132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hazlehurst BL, Lawrence JM, Donahoo WT, et al. Automating assessment of lifestyle counseling in electronic health records. Am J Prev Med 2014; 46: 457–464. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Hosomura N, Goldberg SI, Shubina M, et al. Electronic documentation of lifestyle counseling and gly-cemic control in patients with diabetes. Diabetes Care 2015; 38: 1326–1332. [DOI] [PubMed] [Google Scholar]
  • 35.Johnson HM, Thorpe CT, Bartels CM, et al. Undiagnosed hypertension among young adults with regular primary care use. J Hypertens 2014; 32: 65–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.James PA, Oparil S, Carter BL, et al. 2014 evidence-based guideline for the management of high blood pressure in adults: report from the panel members appointed to the Eighth Joint National Committee (JNC 8). JAMA 2014; 311: 507–520. [DOI] [PubMed] [Google Scholar]
  • 37.Chobanian AV. Seventh report of the Joint National Committee on prevention, detection, evaluation, and treatment of high blood pressure. Hypertension 2003; 42: 1206–1252. [DOI] [PubMed] [Google Scholar]
  • 38.Shrout T, Rudy DW and Piascik MT. Hypertension update, JNC8 and beyond. Curr Opin Pharmacol 2017; 33: 41–46. [DOI] [PubMed] [Google Scholar]
  • 39.Merai R, Siegel C, Rakotz M, et al. CDC grand rounds: a public health approach to detect and control hypertension. MMWR Morb Mortal Wkly Rep 2016; 65: 1261–1264. [DOI] [PubMed] [Google Scholar]
  • 40.Wichmann BA and Hill ID. Algorithm AS 183: an efficient and portable pseudo-random number generator. J Roy Stat Soc C-App 1982; 31: 188–190. [Google Scholar]
  • 41.Matsumoto M and Nishimura T. Mersenne twister: a 623-dimensionally equidistributed uniform pseudorandom number generator. ACM T Model Comput S 1998; 8: 3–30. [Google Scholar]
  • 42.Zeng QT and Tse T. Exploring and developing consumer health vocabularies. J Am Med Inform Assoc 2006; 13: 24–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Savova GK, Masanz JJ, Ogren PV, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17: 507–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Milder IE, Blokstra A, de Groot J, et al. Lifestyle counseling in hypertension—related visits—analysis of video-taped general practice visits. BMC Fam Pract 2008; 9: 58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Sreedhara M, Silfee VJ, Rosal MC, et al. Does provider advice to increase physical activity differ by activity level among US adults with cardiovascular disease risk factors? Fam Pract 2018; 35: 420–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Jackson EA, Krishnan S, Meccone N, et al. Perceived quality of care and lifestyle counseling among patients with heart disease. Clin Cardiol 2010; 33: 765–769. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Bruun Larsen L, Soendergaard J, Halling A, et al. A novel approach to population-based risk stratification, comprising individualized lifestyle intervention in Danish general practice to prevent chronic diseases: results from a feasibility study. Health Informatics J 2017; 23: 249–259. [DOI] [PubMed] [Google Scholar]
  • 48.Weiskopf NG, Hripcsak G, Swaminathan S, et al. Defining and measuring completeness of electronic health records for secondary use. J Biomed Inform 2013; 46: 830–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Weiskopf NG and Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc 2013; 20: 144–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Botsis T, Hartvigsen G, Chen F, et al. Secondary use of EHR: data quality issues and informatics opportunities. Summit on Translat Bioinforma 2010; 2010: 1–5. [PMC free article] [PubMed] [Google Scholar]
  • 51.Hersh WR, Weiner MG, Embi PJ, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care 2013; 51: S30–S37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Wang RY and Strong DM. Beyond accuracy: what data quality means to data consumers. J Manage Inform Syst 1996; 12: 5–33. [Google Scholar]
  • 53.Cowan RE. Exercise is medicine initiative: physical activity as a vital sign and prescription in adult rehabilitation practice. Arch Phys Med Rehabil 2016; 97: S232–S237. [DOI] [PubMed] [Google Scholar]
  • 54.Sperling LS, Sandesara PB and Kim JH. Exercise is medicine: proof … and possibilities? JACC Cardiovasc Imaging 2017; 10: 1469–1471. [DOI] [PubMed] [Google Scholar]
  • 55.HealthyPeople.Gov. Office of Disease Prevention and Health Promotion, https://www.healthypeople.gov/2020/About-Healthy-People/Development-Healthy-People-2030/Framework (accessed 28 October 2017).
  • 56.Stonerock GL and Blumenthal JA. Role of counseling to promote adherence in healthy lifestyle medicine: strategies to improve exercise adherence and enhance physical activity. Prog Cardiovasc Dis 2017; 59: 455–462. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix A
Appendix B
Appendix C
Appendix D

RESOURCES