Abstract
In a retrospective secondary-use EHR study identifying a cohort of Non-Valvular Atrial Fibrillation (NVAF) patients, chart abstraction was done by two sets of clinicians to create a gold standard for risk measures CHA2DS2-VASc and HAS-BLED. Inter-rater reliability between each set of clinicians for NVAF and the outcomes of interest were variable, ranging from extremely low agreement to high agreement. To assess the chart abstraction process, a focus group and a survey was conducted. Survey findings revealed patterns of difficulty in assessing certain items dealing with temporality and social data. The focus group raised issues on the quality and completeness of EHR data, including missing encounters, truncated notes, and low granularity. It also raised the issue of the usability of the data system, the Clinical Data Viewer, which did not mirror a live EHR and made it difficult to record outcomes. Finally, the focus group found it was difficult to infer certain outcomes, like severity, from the provided data. These factors produced differences in clinician rated outcomes.
Keywords: Medical chart review, electronic health record, qualitative research, human factors
Introduction
The key to improving a clinician chart abstraction process is to apply system design principles from human factors engineering and qualitative research design. This discipline aims to formally study the clinician’s interactions with the process to gain valuable insight and perspective that can help guide future research, improve outcomes, and advance the system [1].
A major technological system involved in the chart abstraction process is the electronic health record (EHR) system. As of 2015, 84% of non-federal U.S. hospitals have adopted a basic electronic health record (EHR) system, providing a wealth of data that can facilitate secondary retrospective observational research [2; 3]. Secondary use of EHR data “can enhance health care experiences for individuals, expand knowledge about disease and appropriate treatments, [and] strengthen understanding about the effectiveness and efficiency of our health care systems” [4]. However, various papers have outlined the challenges associated with using EHR data for research, including lack of completeness, lack of an entire patient record, and low granularity of data [5–9].
Another challenge is that an EHR contains two forms of data: structured data, the small proportion of fixed field data which are codified with terminologies like ICD-9, and unstructured data, or free text [2]. Traditionally, acquiring unstructured clinical data have required human abstraction of patient records [2]. Methods have been developed to automatically extract information from free text locked in EHRs. One such method makes use of high throughput phenotyping natural language processing (HTP-NLP) which takes textual input and produces a version of the text annotated with ontologies and terminologies, such as SNOMED-CT, and term-level negation and uncertainty, combining the related annotations into Compositional Expressions [2; 10; 11]. However, to assess the accuracy, sensitivity, and specificity of the automatic extraction of data using HTP-NLP, a gold standard by clinician abstraction still needs to done [9; 10; 12].
EHR data were used in a randomized retrospective study on patients with Non-Valvular Atrial Fibrillation (NVAF). Atrial fibrillation is associated with an increased risk in stroke, heart failure, and an increase in healthcare cost and, in particular, NVAF results in a five times greater risk of stroke [13; 14]. Therefore, classifying the disease status for the outcomes of the CHA2DS2-VASc score, a stroke risk scheme that combines the CHADS2 score with additional moderate risk factors, and the HAS-BLED score for major risk of bleeding is important [15]. The objectives of the study were cohort identification, appropriate treatment with an oral anticoagulant (OAC), and the use of automated methods to acquire this information by way of structured data (ICD-9 codes) and unstructured data (HTP-NLP).
The construction of a gold standard for this study through clinician abstraction found variable outcomes in inter-rater reliability assessed using Cohen’s kappa. In order to gain a better understanding of the results, it is crucial to understand the clinician’s perspective on the chart abstraction process. To explain this complex phenomena of the outcomes and the use of the Clinical Data Viewer, the technological system that supplies the de-identified charts to the clinicians, a sequential explanatory mixed-method study was done with two components, a structured survey and a focus group [1; 16–18]. The survey assessed the difficulty of categorizing each outcome as positive or negative and was given prior to the focus group to gain valuable information on additional outcomes that needed to be considered in the investigator’s questions. The focus group of clinicians involved with the chart abstraction process was comprised of broad and focal questions concerning the process and data were collected concerning individual opinions and group interaction to clarify individual and shared perspectives in regards to the process and allowed for corrections and consensus to be made concerning the topics [19; 20].
1. Methods
After IRB approval, patient data were extracted from AllScripts EHR for the UBMD faculty practice. An NVAF cohort of patients aged 18–90 with a diagnosis of Atrial Fibrillation or Atrial Flutter were included in the study. Patients who received OAC therapy for an indication other than NVAF, had a mechanical prosthetic valve, hemodynamically significant mitral stenosis or aortic stenosis, were pregnant, had transient AF due to reversible conditions, or had active infective endocarditis were excluded.
After excluding patients with clinical notes less than 20 characters to ensure that clinical notes existed, 96,681 patients were assessed for atrial fibrillation using HTP-NLP and ICD-9 coding, producing 3,448 cases using HTP-NLP and 3,155 cases using ICD-9. After the exclusion criteria above was applied, 1,849 cases assessed as NVAF by both ICD-9 and HTP-NLP and 873 cases assessed by HTP-NLP alone remained. A random sample (Sample 1) of 150 was taken from the 873 cases and a random sample (Sample 2) of 150 was taken from the 1,849 cases.
Data for the clinical chart abstraction was extracted by four clinicians recruited from the State University at New York’s Clinical Informatics fellowship program, two for each sample. The first pair of clinician informaticians was a Surgeon (Clinician 1) and an Internist Nephrologist (Clinician 2) and the second pair was an Internist (Clinician 3) and a Surgical Pathologist (Clinician 4). Participants had a range of work experience, practiced in urban and rural settings, and two out of four of the physicians had previous experience with chart abstraction. Specific demographic characteristics were not collected to protect participant confidentiality and to meet exemption from institutional research board approval.
The clinicians coded the data independently. Multiple encounters with clinical notes, problem lists, lab observations, medications, and allergies were provided in a Clinical Data Viewer with access to only de-identified data. The Clinical Data Viewer allowed the viewing of the data using a secure URL, making access to the data easier. Clinicians had to import the research IDs and for each research ID, they had to click on encounters present in the record. Once they clicked on an encounter, the notes, medications, etc. would populate in the viewer. In a box on the right hand side of the screen, observations outlined in the instrument could be coded for the patient by clicking on a drop-down menu, then clicking on the outcome grouping (atrial fibrillation, CHA2DS2-VASc score, HAS-BLED score, and other outcomes, such as treated with OAC) and checking the relevant box. Selecting the boxes for each outcome grouping and clicking submit would automatically populate each patient’s dichotomous outcomes into a database. Clinicians remained blinded to the specific research questions and hypotheses concerning these measures. As a group with the principal investigator, two cases were coded together to provide a tutorial for using the Clinical Data Viewer and how to consistently code the outcomes. This method of chart review adheres to the majority of the current guidelines for chart reviews [21; 22] Statistical analysis of the agreement between each set of clinicians was done using R 3.3.2. This agreement was assessed through Cohen’s kappa, a measure of inter-rater reliability that takes into account agreement by chance, and 95% bootstrapped confidence intervals.
A mixed-methods analysis into the investigation of the range in inter-rater reliability was analyzed with two components. The structured survey component with a 5-item likert scale (very difficult, difficult, neutral, easy, or very easy) was given to analyze the difficulty of assessing each outcome in the instrument for each clinician. The survey can be acquired from the first author. The focus group occurred in person from 9:30 a.m. to 11:00 a.m. four months after the final data abstraction process occurred. All four clinicians who participated in chart abstractions were in attendance. The session was moderated by the first author, who has an MS in Statistics, has prior work experience in survey assessments, and is a PhD candidate in Biomedical Informatics [20; 23]. In addition, the investigator completed appropriate human subjects research training. The session was dictated and field notes were created immediately following the focus group. Focus group dictation was subjected to content analysis using inductive analysis [23]. No a priori hypotheses were made and findings arose directly from the raw data [20; 23]. Topics discussed were the quality and content of data sources, the use of the Clinical Data Viewer, outcomes that were hard to operationalize and required assumptions to be made, and reasons for difficulty for certain outcomes. Data were analyzed and categorized into three themes. The data and respective quotes were provided to all study participants so they could confirm validity. The participants provided feedback and indicated the results and interpretations were valid. Two changes were recommended and amended.
2. Results
2.1. Inter-rater reliability between Clinicians
Inter-rater reliability was measured between the two clinicians using Cohen’s kappa with bootstrapped 95% CI for NVAF, CHA2DS2-VASc score components, and HAS-BLED score components, which are all dichotomous in nature, and can be found in Table 1. Depending on the variable, inter-rater reliability ranged from no agreement to strong agreement and was often not consistent across the two samples, such as with Hypertension. Inter-rater reliability is not included for treatment and other components, but can be requested from the first author.
Table 1.
Cohen’s kappa and bootstrapped 95% CI for inter-rater reliability between each set of clinicians for Samples 1 and 2
| Variable | Sample 1 | Sample 2 |
|---|---|---|
|
| ||
| NVAF | 0.58 (0.443, 0.706) | 0.522 (0.233, 0.747) |
|
CHA2DS2-VASc score | ||
| Congestive Heart Failure | 0.424 (0.268, 0.571) | 0.688 (0.562, 0.802) |
| Hypertension and HBP | 0.707 (0.571, 0.824) | 0.21 (0.043, 0.383) |
| Diabetes Mellitus | 0.812 (0.701, 0.907) | 0.811 (0.706, 0.901) |
| Stroke/TIA | 0.543 (0.35, 0.713) | 0.838 (0.713, 0.937) |
| Vascular Disease | 0.3 (0.171, 0.433) | 0.517 (0.389, 0.641) |
|
HAS-BLED Score | ||
| Hypertension | 0.591 (0.442, 0.727) | 0.273 (0.094, 0.454) |
| Renal Disease | 0.516 (0.356, 0.667) | 0.850 (0.753, 0.934) |
| Liver Disease | 0 | 0.277 (0, 0.677) |
| Stroke | 0.505 (0.306, .681) | 0.815 (0.684, 0.922) |
| Disposition to Bleeding | −0.013 (−0.036, 0) | −0.011 (−0.023, 0) |
| Labile | −0.023 (−0.040, 0) | −0.01 (−0.023, 0) |
| Medication Bleeding | −0.087 (−0.194, 0.01) | 0.355 (0.171, 0.524) |
| Alcohol (8 drinks a week) | −0.024 (−0.040, 0) | 0.231 (−0.029, 0.658) |
2.2. Survey Results
The survey identified outcomes which the clinicians felt were primarily easy or difficult to assess. Diagnosis of “atrial fibrillation, history of hypertension, history of high blood pressure, and history of diabetes mellitus, medication usage predisposing to bleeding, and treated with an OAC were categorized as easy to assess for all clinicians (100% marked “easy” or “very easy”). 75% of clinicians said it was easy to assess thrombocytopenia where the platelet count was less than 50,000 and 75% said it was easy or very easy to assess history of stroke, history of thromboembolism, and history of renal disease. 100% of clinicians rated clinically significant gastrointestinal (GI) bleed within the last 6 months as a very difficult thing to assess and 100% of clinicians rated alcohol use greater than 8 drinks a week as being either very difficult or difficult to assess. 75% of clinicians thought it was difficult or very difficult to assess major surgical procedure within 30 days, bleeding diathesis, hemorrhagic disorder, current intraarticular bleeding, and any active bleed. The remaining instrument outcomes were spread across the scale for the clinicians.
2.3. Focus Group Results
Three themes arose in the focus group. The first theme was the lack of completeness of data, low granularity, and missing encounters. Clinicians felt that some data were truncated, not providing complete notes for a patient. For instance, an example was given by Clinician 4 where a patient presented with a history of coughing for two weeks, but no diagnosis was made and it was not mentioned in subsequent encounters. Most of the outcome observations came from the problem list, labs and medication list. However, although “history of” outcomes could be assessed, which is consistent with inter-rater reliability and survey results, clinicians universally found that it was much more difficult to assess outcomes that required time indexing. Medication dates, lab dates, and encounter dates were not provided, and therefore the context of the outcome was lost. Clinician 2 stated that when looking at the outcome “Clinically significant GI bleed within last 6 months”, the clinician didn’t know if “it was six months ago or a year ago.” In addition, assumptions had to be made when making assessments for outcomes due to a lack of description of the severity of the disease. Most often, if a medication was present for a disease, the clinician coded it as chronic.
Not all of the inclusion/exclusion criteria, especially regarding the process for when data were absent, was provided to the clinicians. Confusion surrounded the outcomes history of stroke, history of vascular disease, and history of hypertension, primarily when specific data and higher granularity were absent. For instance, if stroke was observed one would require the type of stroke such that observing ischemic stroke would also cause vascular disease to be observed and hemorrhagic stroke would also cause hypertension to be observed. More often than not, vascular disease, which had weak to moderate inter-rater reliability across the samples, was inferred when a stroke was observed. Clinician 1 stated, “As soon as you saw stroke, you marked vascular.” Likewise, Clinician 3 stated “If it is mentioned in the note that it is hemorrhagic stroke then it was not marked as vascular disease. If there was no specification, it was a luck of the draw, leading to uncertainties in marking.” Social data, such as for alcohol use with greater than 8 drinks a week, were also often omitted from the EHR. Alcohol use was rarely in the patient history, although the clinicians reported that it would suddenly occur in an encounter problem list, and then not occur in subsequent encounter notes. This was the reason for such a small number of cases of alcohol use observed (3 in Sample 1 and 7 in Sample 2).
The second theme that arose concerned usability and a needed improvement in the technological design of the Clinical Data Viewer [1; 17]. There was a consensus from all clinicians that the data viewer would freeze, was not intuitive, and took an excessive time to load. An added step required them to clear all outcome check boxes manually when moving on to a new patient, otherwise the outcomes for the previous patient would remain and get coded incorrectly for the new patient. This created confusion on whether check boxes were from the previous patient or the newer patient, adding unnecessary inconsistency and errors. The viewer was especially not intuitive when it came to the menu for OAC observations. For the outcome “No OAC but should be on OAC,” another drop-down menu was triggered which gave a long list of medical and social outcomes, but this type of information was included within an encounter. In addition, clinicians recommended that the viewer provide a function that notifies the clinician that the case has been submitted. For validation of case submission, each clinician came to a consensual validation process which required them to re-load the patient ID, click on the patient, click on the outcome grouping, submit, and receive either the message “This has already been submitted” or, if the message did not occur, they would have to redo the outcome grouping. In addition, each clinician kept a separate list of research IDs and tracked completed cases manually. Finally, clinicians remarked that they wish they had continual access to a patient’s outcomes and could change some of the submitted outcomes. Once a case was submitted, a case was locked and no edits could be made.
The third theme was a feeling of disconnect with the patient. They felt as if their job, according to Clinician 3, was to “mainly read the free text and learn about the patient in a detached way.” Clinician 2 felt the same way, stating “there is an emotional aspect to knowledge, you have age [and other related outcomes, allowing you to] put the patient together, forming a picture in your mind.” The Clinical Data Viewer also did not provide basic demographic characteristics such as gender, age, and race, eliciting a gap in the patient picture.
3. Discussion and Conclusions
The study identified three key themes that lead to poor inter-rater reliability in clinician abstraction. Only a fraction of a patient’s lifetime of care will be housed in one specific EHR, and therefore, prior studies have similarly highlighted the lack of completeness, granularity, and temporality [9]. A systematic review showed that out of 98 studies, 55% of the studies used additional non-EHR sources of data and 40% of the studies supplemented EHR data with patient-reported data [6]. Such as with the relationship between stroke, vascular disease, and hypertension, other studies found that study variables had to be manually inferred by putting different pieces within the record together, highlighting the issue of granularity [5; 6]. In a survival analysis study for pancreatic cancer, the issue of temporality was also prominent, being “difficult to determine the exact period for medical interventions or events, e.g. the duration of chemotherapy treatments” [5]. To improve patient care, either at the point of care by making sure problem lists, medications, labs, and allergies are up to date, or through secondary research, EHR data need to become more complete with high granularity, while additionally time-indexing important events and outcomes.
In addition, assessing human factors and doing a usability study, evaluating how a particular process or product works for individuals, is an important aspect of any research study involving EHR manual abstraction [18]. The use of the Clinical Data Viewer created additional problems other than data quality for the clinicians which could have led to errors and low inter-rater reliability. When creating the Clinical Data Viewer, a user-centered software design, which employs users in the earliest phase of software design and testing, should have been implemented [18].
For reliable annotation of free text data in an EHR, the extraction chart should accurately mirror the live chart as closely as possible. Data can often be lost or obscured in the process leading to errors in human abstraction. Although chart abstraction for research is flawed, it does provide valuable outcome information and can be seen as a good depiction of the data actually in the EHR, providing a good measure for which to assess the accuracy of automated methods such as HTP-NLP.
Acknowledgments
Funding for this project was supported by the National Center for Advancing Translational Sciences of the National Institutes of Health under Award Number UL1TR001412. This study was also supported in part by an NIH NCI/VA BD-STEP Fellowship in Big Data Science and funded in part by a grant from Pfizer, Inc.
References
- 1.Beuscart-Zephir MC, Elkin P, Pelayo S, Beuscart R. The human factors engineering approach to biomedical informatics projects: State of the art, results, benefits and challenges. Yearb Med Inform. 2007:109–127. [PubMed] [Google Scholar]
- 2.Elkin PL, Trusko BE, Koppel R, Speroff T, Mohrer D, Sakji S, Gurewitz I, Tuttle M, Brown SH. Secondary use of clinical data. Stud Health Technol Inform. 2010;155:14–29. [PubMed] [Google Scholar]
- 3.Henry J, Pylypchuk Y, Talisha S, Vaishali P. Adoption of electronic health record systems among u.S. Non-federal acute care hospitals: 2008–2015. 2016;(35) [Google Scholar]
- 4.Safran C, Bloomrosen M, Hammond WE, Labkoff S, Markel-Fox S, Tang PC, Detmer DE. Toward a national framework for the secondary use of health data: An american medical informatics association white paper. Journal of the American Medical Informatics Association. 2007;14(1):1–9. doi: 10.1197/jamia.M2273. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Botsis T, Hartvigsen G, Chen F, Weng C. Secondary use of ehr: Data quality issues and informatics opportunities. AMIA Jt Summits Transl Sci Proc. 2010:1–5. [PMC free article] [PubMed] [Google Scholar]
- 6.Hersh WR, Weiner MG, Embi PJ, Logan JR, Payne PR, Bernstam EV, Lehmann HP, Hripcsak G, Hartzog TH, Cimino JJ, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care. 2013;51(8 Suppl 3):S30–37. doi: 10.1097/MLR.0b013e31829b1dbd. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.AMIA Annual Symposium Proceedings. American Medical Informatics Association; 2011. Root causes underlying challenges to secondary use of data. [PMC free article] [PubMed] [Google Scholar]
- 8.Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring completeness of electronic health records for secondary use. Journal of biomedical informatics. 2013;46(5):830–836. doi: 10.1016/j.jbi.2013.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Richesson R, Horvath M, Rusincovitch S. Clinical research informatics and electronic health record data. Yearbook of medical informatics. 2014;9(1):215. doi: 10.15265/IY-2014-0009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Elkin PL, Froehling DA, Wahner-Roedler DL, Brown SH, Bailey KR. Comparison of natural language processing biosurveillance methods for identifying influenza from encounter notes. Annals of Internal Medicine. 2012;156(1_Part_1):11–18. doi: 10.7326/0003-4819-156-1-201201030-00003. [DOI] [PubMed] [Google Scholar]
- 11.Schlegel DR, Crowner C, Lehoullier F, Elkin PL. Htp-nlp: A new nlp system for high throughput phenotyping. [PMC free article] [PubMed] [Google Scholar]
- 12.Nlp-based identification of pneumonia cases from free-text radiological reports. AMIA; 2008. [PMC free article] [PubMed] [Google Scholar]
- 13.Haim M, Hoshen M, Reges O, Rabi Y, Balicer R, Leibowitz M. Prospective national study of the prevalence, incidence, management and outcome of a large contemporary cohort of patients with incident non̺valvular atrial fibrillation. Journal of the American Heart Association. 2015;4(1) doi: 10.1161/JAHA.114.001486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wolf PA, Abbott RD, Kannel WB. Atrial fibrillation as an independent risk factor for stroke: The framingham study. Stroke. 1991;22(8):983–988. doi: 10.1161/01.str.22.8.983. [DOI] [PubMed] [Google Scholar]
- 15.Furie KL, Goldstein LB, Albers GW, Khatri P, Neyens R, Turakhia MP, Turan TN, Wood KA, American Heart Association Stroke C, Council on Quality of C et al. Oral antithrombotic agents for the prevention of stroke in nonvalvular atrial fibrillation: A science advisory for healthcare professionals from the american heart association/american stroke association. Stroke. 2012;43(12):3442–3453. doi: 10.1161/STR.0b013e318266722a. [DOI] [PubMed] [Google Scholar]
- 16.Pelayo S, Anceaux F, Rogalski J, Elkin P, Beuscart-Zephir MC. A comparison of the impact of cpoe implementation and organizational determinants on doctor-nurse communications and cooperation. Int J Med Inform. 2013;82(12):e321–330. doi: 10.1016/j.ijmedinf.2012.09.001. [DOI] [PubMed] [Google Scholar]
- 17.Beuscart-Zephir MC, Aarts J, Elkin PL. Human factors engineering for healthcare it clinical applications. Int J Med Inform. 2010;79(4):223–224. doi: 10.1016/j.ijmedinf.2010.01.010. [DOI] [PubMed] [Google Scholar]
- 18.Elkin PL. Human factors engineering in hi: So what? Who cares? And what’s in it for you? Health Inform Res. 2012;18(4):237–241. doi: 10.4258/hir.2012.18.4.237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Morgan DL. Practical strategies for combining qualitative and quantitative methods: Applications to health research. Qualitative health research. 1998;8(3):362–376. doi: 10.1177/104973239800800307. [DOI] [PubMed] [Google Scholar]
- 20.Tong A, Sainsbury P, Craig J. Consolidated criteria for reporting qualitative research (coreq): A 32-item checklist for interviews and focus groups. International Journal for Quality in Health Care. 2007;19(6):349–357. doi: 10.1093/intqhc/mzm042. [DOI] [PubMed] [Google Scholar]
- 21.Vassar M, Holzmann M. The retrospective chart review: Important methodological considerations. J Educ Eval Health Prof. 2013;10(12) doi: 10.3352/jeehp.2013.10.12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Sarkar S, Seshadri D. Conducting record review studies in clinical practice. J Clin Diagn Res. 2014;8(9):JG01–04. doi: 10.7860/JCDR/2014/8301.4806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Thomas DR. A general inductive approach for analyzing qualitative evaluation data. American Journal of Evaluation. 2006;27(2):237–246. [Google Scholar]
