Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2009 Nov 14;2009:553–557.

Early Warning and Risk Estimation methods based on Unstructured Text in Electronic Medical Records to Improve Patient Adherence and Care

Jakka Sairamesh 1, Ram Rajagopal 1, Ravi Nemana 2, Keith Argenbright 4
PMCID: PMC2815399  PMID: 20351916

Abstract

In this paper we present risk-estimation models and methods for early detection of patient non-adherence based on unstructured text in patient records. The primary objectives are to perform early interventions on patients at risk of non-adherence and improve outcomes. We analyzed over 1.1 million visit notes corresponding to 30,095 Cancer patients, spread across 12 years of Oncology practice. Our risk analysis, based on a rich risk-factor dictionary, revealed that a staggering 30% of the patients were estimated to be at a high risk of non-adherence. Our risk classification showed that 2 distinct patient groups, between 26 and 38 (mean risk score, r=0.77, s=0.22), and 75 and 90 (r=0.81, s=0.19) years of age respectively, exhibited the highest risk of nonadherence when compared to the rest. The dominant risk-factors for these two groups, not surprisingly, included psychosocial (e.g. depression, lack of support), medical (e.g. side-effects such as pain) and financial issues (e.g. costs of treatment).

Introduction

Sound medical diagnosis, treatment and follow-up care are crucial for a patient's quality of life and survival. Equally crucial is the ability of the patient to adhere [1][10] to the prescribed regimen of care, and patients often misunderstand, fail to carry out or even ignore medical advice. This deviation from the recommended and expected clinical path can dramatically increase costs of care [2], rehospitalizations [3][4], adverse outcomes, and the chance of preventable death, with estimates of hospitalization costs as high at $13.5 Billion annually [6][7].

Understanding and modeling patient non-adherence [1][9][10] during treatment within the newer and more complex process of clinical care poses a range of challenges. This document illustrates methods to glean risk-factors from text written in patient-visit records, and identify patients at risk of non-adherence before they decide to veer-off their treatment regimens. Our primary aim is to glean nuggets of information embedded (leading indicators or signals) in the visit notes as clues to risk-factors causing a patient to not adhere to a treatment regimen. These nuggets can provide valuable early warnings to clinicians on when to intervene.

Even though clinicians document nearly every patient encounter, detecting the ability of a patient to adhere to care recommendations is difficult. We define a patient as adherent when he or she follows the prescribed treatments or care that health professionals recommend. Clinical documentation is diverse and can include images, audio and multimedia, but a vast fraction of clinical information is semi-structured and unstructured text. Increasingly this heterogeneouslystructured clinical documentation is in digital form (EMR), growing 63% from 2001 to 2006 among clinical practices [11][12][13]. The volume of documentation is so great that it is impractical to sift through large number of patient charts to determine non-adherent activity. Semantic differences, variation in documentation styles, and heterogeneous documentation traditions (e.g. physicians v. social workers) make for difficult manual evaluation of non-adherence. Clinical documentation contains substantial linguistic, stylistic and other semantic variations such as codes, spelling errors, abbreviations, etc. that introduce additional complexity to detecting adherence risk factors. For example, the concept of “treatment side effects” above can be represented by the following keywords: “nausea,” “headache” (or “h/a”), “GI distress,” “GI issues,” “fatigue,” “tired,” and so on.

Text processing and mining offer well-described methods that can be used to model, identify, classify, and extract a large majority of the clinically documented risk indicators of non-adherence that correlate to the known non-adherence risk factors found in the literature. Text mining in health care is not new, however, text mining applications in health care are limited to a few areas, and our literature search uncovered limited text mining methods for modeling patient adherence behavior [14].

Clues to patient non-adherence

Identifying the dominant risk factors from patient records on what causes patients to miss appointments, cancel often, not show up at all or deviate from a treatment regimen is a challenge. For cancer patients the treatment cycles can vary depending on the disease stage and patient characteristics. Long treatment cycles may potentially cause a patient to undergo changes in lifestyle, habits, and social support, which can negatively impact their adherence.

Causal relationships

The root-causes for non-adherence may include a combination of psychosocial, financial, family and other factors [8][9][11] which can influence a patient to not adhere to treatment. In addition, the risk-factors can be interlinked. For example, costs of co-payment and medication during treatment can influence patients to not adhere to the prescribed treatment regimen. In reality the root cause could be that patients have severe medication related side-effects causing them to miss their daily jobs or take leave and this can be a burden on finances. Identifying such causal relationships can be very challenging.

Early warnings and rules

Identifying the emerging issues automatically by analyzing the clinician notes for each patient can be a challenge given that the notes captured during treatment can be hard to decipher and correlate across multiple clinicians. In addition, they may be hard to glean unless proper information and clinical rules about a patient diagnosis, family history and social structure are employed during analysis. The main challenge is in extracting risk-factors and generating alerts [5] about patient behavior from millions of patient notes, which is typically the case in medium to large sized clinics with multiple years of oncology practice.

Our Methodology

We propose an early-warning methodology for analyzing unstructured and structured elements in patient records in order to develop actionable information for clinicians to improve care and reduce non-adherence.

Dictionary

The first step in our methodology is the creation of a medical dictionary to enable the extraction of risk factors from clinician notes. An initial raw risk-factor dictionary is created by automatically analyzing millions of records (in deidentified form). The following are the tasks (see Figure 2) of the dictionary creation process:

  • Raw risk-factor dictionary creation: This is an initial version of the dictionary of terms that include nouns, risk factors, adjectives, location names, medical terms, medical codes and other language specific terms. This also includes domain-specific synonyms such as terms, abbreviations, misspellings and code-words that have the same meaning. These terms have to pruned and merged for further analysis.

  • Refined dictionary creation: Once the nouns, location lists and names are taken out from the dictionary, what are left are the words and phrases (including medication information) that represent risk factors for analysis. The factors manifest themselves in various forms (e.g. medical phrases, terms and specialized codes). The extracted raw-dictionary is further pruned by a human medical expert, and a refined risk-factor dictionary (RFD) is created based upon clinician defined rules for early detection.

Figure 2:

Figure 2:

Multi-attribute model and risk algorithms

Our analysis of patient records has shown that visit information can contain anywhere from 10 to 1000 words describing the patient condition, vital signs and at times social information. A sample paragraph from a real patient visit note is shown in Figure 1, where, in addition to the medical condition, word combinations such as “lives alone” or “family lives far away” or “does not have transport often” provide clues to social and economic risk factors related to lack of support and logistics help facing a cancer patient. We employed nearest-neighbor and distance-vector algorithms to compare one or more visit records across patients. We look for similarity in the risk-factors extracted by comparing the nearness of two phrases or word-combinations. We model a visit note or document as vector of important words: D (i) = F (W, F, A), where W is a bag of concepts (words, word combinations and phrases) extracted from the document (i) that represent risk factors. F is the corresponding vector of occurrence of the concepts and A is the weights associated with the word (e.g. severity of fatigue). We compare the vector of concepts across multiple documents and find similarities and correlations useful for analysis.

Figure 1:

Figure 1:

Sample clinician note from a visit record (documented in the EMR).

Behavioral Modeling of Patients at risk

The next step in the methodology is to extract the potential risk factors from the patient visit records and apply certain rules to derive the distribution (Pareto histogram) of the dominant risk factors across the patients. In order to do this clinical rules are constructed by a medical expert looking for earlywarning signals from the patient information such as medication, side-effects, psychosocial issues, insurance, job loss, family support and other criteria. For example if a patient has a history of depression or has taken anti-depression medications, the patient is likely to be at a high risk of non-adherence. We employed algorithms to classify, group and estimate future patient behavior based on risk factors and patient attributes (e.g. age, disease type, stage, smoking history, family history and other relevant characteristics). In order to model patient nonadherence behavior, we considered a multi-attribute representation of the risk factors. This model consists of several risk factors that are represented by their severity, strength and corresponding numerical weights. The severity of each risk factor is obtained by interviewing clinicians and other medical staff who have empirical and medical knowledge.

We provide an approach (shown in Figures 3 and 4) to modeling adherence behavior by computing the overall current risk of non-adherence R(i) and probability of non-adherence, P(i), of a patient “i”. Consider R (i,j,k) as the current estimated risk of non-adherence of a patient “i” due to risk factor “j” during a visit “k” to the clinic for treatment. P(i) is the estimated probability that patient “i” will not adhere to a treatment regimen. R(i) is based on a weighted risk scoring function across all risk factors and visit records for a patient. We first train the algorithms using a large set of records, and then we validate the model and algorithms on a smaller blinded set of visit records. Any errors (based on Rsquare and Chi tests) are then fed back to refine the model.

Figure 3:

Figure 3:

Risk assessment equations and probabilities

Figure 4:

Figure 4:

Methodology for early-warning and leading indicator analysis

Risk estimation and alerts

The next step is to compute the risk of non-adherence for each patient based on the risk factors gleaned from the medical records. First, a risk-matrix is constructed for each patient. The matrix is represented by the patient visit identifiers as rows and risk factors as columns. If a patient visit is influenced by multiple risk factors, the risk matrix for that patient will include a value of “W(ij)” for that visit (row) corresponding to the dominant risk factor columns (j). Total risk is computed by a non-linear weighted scoring model, where weights are assigned to the risk factors based on severity. We then compute the probability of nonadherence for each patient “i”, which is defined by equation (5): The estimates for values “b(j)” and W(ij) are obtained using non-linear regression functions (e.g. Logit function, shown in Figure 3, equation 5) based on the training data set. We can then compute the probability of non-adherence, P(i).

P(i)=e(b0+b(j)R(i,j))/[1e(b0+b(j)R(i,j))]

A sample 4x4 risk matrix is shown below for patient who visited a clinic 4 times for treatment. The 4 risk factors were gleaned from the unstructured visit notes. Each row is a patient visit, and each column is the risk factor gleaned from that visit.

RiskMatrixExample:(W11,0,W13,W14W21,W23,0,00,W32,W33,00,0,0,W44)

Risk indicators

The next step in the methodology is to apply certain rules to derive the Pareto distribution of the dominant risk factors across the patient demographics. The overall risk of patient nonadherence is computed using a weighted scoring function, where regression (linear and non-linear) and probabilistic methods are employed to compute risk based on risk factors (e.g. age, disease type, stage, smoking history, family history and others).

Dominant Risk Factors

We then extract the potential risk factors from the patient notes and apply certain rules to extract the dominant risk-factors across the patient demographics. The dominance can be based on frequency of occurrence of risk-factors in the text or simply severity of the risk factor. This includes applying text analysis, regression (linear and non-linear) and classification methods to classify patient behavior based on risk factors and patient attributes (e.g. age, disease type, stage, smoking history, family history and other relevant characteristics).

Validation

Finally, a validation step is done to examine the performance of the algorithms on the patient records. First, the algorithms are trained on a large enough training set (around 60% of the data set). Once trained, the algorithms are then tested on a “testing set” (20% of the data) consisting of patient records different from the training set to verify that the risk factors extracted are indeed the dominant factors for patient non-adherence (as shown in Figure 4). A third set of patient records (the remaining 20%), called the validation set is selected for analysis and error detection in the algorithms (e.g. false positivs)

Figure 4 shows an approach to the validation, testing and alert generation based on a variety of risk factors (e.g. treatment cycle, side effects, costs, social status and others) and patient attributes (e.g. age and others). We considered 70 risk factors in evaluating the risk scores. Using these risk-factors we computed the risk-score for all of the patients and classified them into groups by age. In Figure 5 we show the computed risk scores for each of the age-groups from 24 to 98 years of age. Patients, 65 and above (years of age), showed a higher risk of non-adherence when compared to the rest. Surprisingly, patients between 24 and 40 years of age exhibited a high risk of non-adherence.

Figure 5:

Figure 5:

Average risk score by age-group (exponential moving average), normalized between 0 to 1.0

Case studies

We present two case-studies of cancer patient behavior, where early detection could have helped to target early interventions:

  • A patient (61 years of age) was diagnosed with stage-2 colorectal cancer (neoplasm of the rectosigmoid colon). During early stages of the treatment, side-effects caused the patient to feel severely fatigued to handle his/her daily job. The patient complained about his/her fatigue and pain during the visits, but not much was done by the clinicians. Within a few months after the 3rd visit the patient discontinued treatment. This patient did complain about weakness and lack of energy to continue work very early on during treatment. These signals could have been taken into serious consideration and enabled early intervention before the patient decided to drop-off the treatment regimen.

  • An elderly patient (over 65 years of age), diagnosed with breast cancer, had a prolonged treatment cycle spanning 5 years. During which time the patient experienced several side-effects, causing the patient to miss appointments and not adhere to medications. Not much was done early on by the clinicians to spot the issues and enable interventions. Figure 6 illustrates a spider chart (Pareto model) of risk-factors for this patient along the outer circle of the spider web diagram. The strength of the risk-factors is shown by the solid line (closer to the outer circle of the web). The inner line shows the average strength of the risk factors of patients in the same age-group and disease area..

Figure 6:

Figure 6:

Dominant Risk factors (outer ring)

Conclusion

We proposed a novel, real-time methodology that uses data-mining and text processing of unstructured text embedded in electronic medical records to detect whether patients are at risk of non-adherence to treatment regimens. We presented a rigorous mathematical model and methodology to extract and analyze risk-factors of non-adherence by using a rich risk-factor dictionary. Our results indicated that over 30% of the patients were likely to drop off treatment based on several risk factors. We also showed that two distinct patient groups (ages less than 40 and over 75) were at a high risk of non-adherence compared to other age groups. Further work needs to be done in leveraging the gleaned risk-factors to target interventions over multiple patient groups for improving adherence, care and quality of life for the patients.

Figure 7:

Figure 7:

Risk score for each visit (normalized between 0 and 0.5).

References

  • 1.DiMatteo, et al. “50 Years of adherence research,”. 2006.
  • 2.Blanchard Christina G, et al. Physician Behaviors, patient Perceptions, and patient Characteristics as Predictors of Satisfaction of Hospitalized Adult cancer patients, Journal of cancer65-186-192, 1990 [DOI] [PubMed]
  • 3.Razavi Amir R, Gill1 Hans, Åhlfeldt1 Hans, Shahsavar Nosrat. “Non-adherence with a postmastectomy radiotherapy guideline: Decision tree and cause analysis,”. BMC Medical Informatics and Decision Making, journal. 2002. [DOI] [PMC free article] [PubMed]
  • 4.Chakrabarthi S, et al. “Fast and Accurate Text Classification via Multiple Linear Discriminant Projections,”. VLDB Journal. 2003.
  • 5.Sairamesh J, et al. “Reducing Business Surprises through Proactive real-time sensing and Alert management,”. Proceedings of the USENIX workshop on End-to-End, Sense and Respond Systems, Seattle; June 5th, 2005. [Google Scholar]
  • 6.Department of Health and Human Services Centers for Disease Control, Chronic Disease Overview, available athttp://www.cdc.gov/nccdphp/overview.htm#2
  • 7.Department of Health and Human Services Center for Disease Control, Chronic Disease Overview, AccessedOctober 28, 2008. http://www.cdc.gov/nccdphp/overview.htm
  • 8.Burke LE, Ockene IS, editors. Adherence in Healthcare and Research. Armonk, NY: Futura; 2001. pp. 3–21. [Google Scholar]
  • 9.DiMatteo MR, et al. “Physicians’ Characteristics Influence Patients’ Adherence to Medical Treatment: Results from the Medical Outcomes Study”. Health Psychology. 1993 Mar;12(1):93–102. doi: 10.1037/0278-6133.12.2.93. [DOI] [PubMed] [Google Scholar]
  • 10.DiMatteo MR. “Variations in Patients’ Adherence to Medical Recommendations A Quantitative Review of 50 Years of Research”. Medical Care. 2004 Mar;42(3):200–209. doi: 10.1097/01.mlr.0000114908.90348.f9. [DOI] [PubMed] [Google Scholar]
  • 11.National Academies, Institute of Medicine Report . National Academies Press; Washington DC: 1999. “To Err is Human”. [Google Scholar]
  • 12.National Academies, Institute of Medicine Report . National Academies Press; Washington DC: 2007. “Cancer Care for the Whole Patient: Meeting Psychosocial Health needs,”. [PubMed] [Google Scholar]
  • 13.Starr P. Basic Books; 1983. The Social Transformation of American Medicine; pp. 355–360. [Google Scholar]
  • 14.Sairamesh J, et al. “Early Warning Systems for Patient Care, “ Proceedings of DM-HI. 2008.

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES