Abstract
Integrative medicine including complementary and alternative medicine (CAM) has become more available through mainstream health providers. Acupuncture is one of the most widely used CAM therapies, though its efficacy for treating various conditions requires further investigation. To assist with such investigations, we set out to identify acupuncture patient cohorts using a nationwide clinical data repository. Acupuncture patients were identified using both structured data and unstructured free text notes: 44,960 acupuncture patients were identified using structured data consisting of CPT codes;. Using unstructured free text clinical notes, we trained a support vector classifier with 86% accuracy and was able to identify an additional 101,628 acupuncture patients not identified through structured data (a 226% increase). In addition, characteristics of the patients identified through structured and unstructured data were compared, which show differences in geographic locations and medical service usage patterns. Patients identified with structured data displayed a consistently higher use of the Veterans Health Administration (VHA) medical system.
Introduction
Over the past decade, integrative medicine has gained increasing attention from providers and researchers. Compared to traditional healthcare, integrative medicine’s emphasis on a partnership between patients and clinicians takes a holistic view of patients’ health and well being, and incorporates complementary and alternative medicine (CAM) approaches such as acupuncture and massage into treatment options. Many large hospitals now provide some form of integrative health services to their patients.
At the same time, the safety and effectiveness of many CAM treatments are not sufficiently understood. For instance, acupuncture is widely practiced to relieve pain and treat certain health problems, but debate on its effectiveness continues in the literature. Witt et al evaluated clinical and economical effectiveness of acupuncture on chronic low back pain in a large randomized controlled trial (RCT).(1) They demonstrated that acupuncture in addition to routine care considerably improved clinical outcomes and was relatively cost-effective. A systematic review of RCTs looking at acupuncture for pain was published by Linde et.al. It included thirteen trials (3,025 patients) with a variety of pain conditions and found a small analgesic effect from acupuncture, hardly distinguishable from bias.(2) Another systematic review of 23 RCTs on the effectiveness of acupuncture for nonspecific lower back pain by Yuan et al showed moderate evidence that acupuncture is more effective than no treatment, and strong evidence of no significant difference between acupuncture and sham acupuncture, for short-term pain relief.(3) This review concluded that acupuncture can be a useful supplement to other forms of conventional therapy for nonspecific lower back pain, but the effectiveness of acupuncture compared with conventional therapies requires further investigation. Considering acupuncture is one of the most studied CAM modalities, these uncertain results indicate that more research is needed to ascertain the efficacy of CAM practices.
Secondary analysis of electronic medical records (EMR) is a powerful approach to study treatment safety and effectiveness. At the Veterans Health Administration (VHA), we have begun leveraging its nationwide EMR repository to study the use of acupuncture to manage pain and control other symptoms like nausea. A critical step in EMR secondary analysis is cohort identification.
In this paper, we describe our effort to identify a cohort of patients who had undergone acupuncture treatments while receiving care from the VHA. Both structured data and unstructured data were used. To understand the impact of data source on the resultant cohorts, cohorts identified from the two methods were compared in terms of size of patient characteristics.
Materials and Methods
Data Source
Data for this study was procured through the Veterans Informatics and Computing Infrastructure (VINCI), VHA. The VHA comprises 152 medical facilities in addition to 1,400 clinics that are community-based and tailored to serve individuals on an outpatient basis, Vet Centers, community living centers, and Domiciles. In total, these facilities employ over 53,000 healthcare professionals who provide their services to over 8.3 million veterans on an annual basis. VINCI is a collaboration between the Office of Research and Development and the Office of Information and Technology in the U.S. Department of Veterans Affairs (VA), providing data and infrastructure needs of the VHA research community. VINCI provides access to structured and unstructured health information originating from the VISTA electronic health record system, and includes data for over 17 million patients. We identified patients receiving acupuncture treatments through structured as well as unstructured data using the process outlined in Figure 1.
Figure 1.
Acupuncture Patient Cohort Identification from Structured Data (SD) and Unstructured Data (UD)
Cohort Identification Using Structured Data
VHA offers many forms of CAM treatments from acupuncture to sweat lodge. Patients receiving specific treatments within the VHA system can be identified through Current Procedural Terminology (CPT) codes identifying specific patient procedures. Acupuncture treatments are represented by CPT codes 97780, 97781, 97810, 97811, 97813, and 97814.
Many non-standard treatments can be identified through the locations of patient visits. In the VHA, clinic “Stop Codes” are included in the outpatient visit records to indicate the clinic or work group providing specific services. We were, however, only able to identify a single location for acupuncture services using the “Stop Code”. Since acupuncture services are widespread in the VHA system, we resorted to CPT codes for their identification.
Cohort Identification Using Free Text Data
Structured data has been shown to be insufficient for cohort identification in many cases (4). Some patients receiving acupuncture will not have corresponding CPT codes assigned for various reasons. For example, many patients obtain treatment from non-VHA providers, particularly when VHA clinics offering a specific therapy are not available in the geographic area of the patient. In some cases, their VHA clinicians do not prescribe or authorize the treatments. Although they may not be recorded by CPT codes, many VHA healthcare providers do ask Veterans about the non-VHA treatments they are receiving and document them in narrative clinical notes. Thus, we searched unstructured, free text clinical notes for mentions of acupuncture.
Free text clinical notes have been shown to be rich in medical information that can be accessed using natural language processing techniques (5–7). Searching of the unstructured notes was accomplished using the Voogo search engine, which was developed specifically for searching structured and unstructured data within VINCI. Using Voogo, patients with clinical documents containing the string “acupuncture” were identified. Snippets of text containing acupuncture, including surrounding context, were extracted and manually annotated to identify if the snippets were positive, negative, or prescribed (if the snippet described a recommendation) for use of acupuncture treatment by the patient. A support vector machine (SVM) was trained for automated acupuncture text classification. Using text classification results, patients were classified as positive for acupuncture treatment use if they had at least one positive snippet; prescribed if they had no positive snippets but at least one prescribed snippet; or negative if they had only negative snippets.
Patients with a positive history of acupuncture use identified through unstructured data is referred to as UD.
Comparing Cohorts from Structured and Free Text Data
We compared the two cohorts (UD and SD) to determine the distribution of patients identifiable only from SD, only from UD, or both. The UD and SD patients were then compared and contrasted for geographic location, gender, age, and most frequent medical procedures, diagnoses, and prescriptions.
A list of the 25 most common procedures was constructed by combining the 21 most frequent Current Procedural Terminology (CPT) codes from UD patients and the 21 most frequent CPT codes from SD patients (the number of codes from UD and SD patients was chosen by trial and error to obtain a combined number of 25). Similarly, a list of the 25 most common diagnoses was constructed by combining the 23 most frequent International Classification of Diseases version 9 (ICD-9) codes from UD and SD patients. And again with prescriptions, the 24 most frequent drug names from UD and SD patients were determined, for a combined set of 25 unique drug names. We then determined the proportions of UD and SD patients receiving these most frequent procedures, diagnoses, and prescriptions..
Results
Using CPT codes, 44,960 patients were identified as receiving acupuncture treatment using structured For identification of acupuncture using unstructured data (UD), 1,245,753 documents mentioning identified representing 400,350 patients. 297 snippets were classified as positive, prescribed, or annotators with an inter-rater reliability kappa score of 0.74. Since the kappa is relatively low, reached through discussion to create the reference standard. A support vector machine (SVM) using these snippets and validated with 10-fold cross validation. This resulted in the ability to identify text with an overall accuracy of 0.862 (precision 0.883, recall 0.743, and f1-measure 0.785) (Table 1). Using the SVM classification model, 140,525 patients were identified as positive for acupuncture use. SD and UD identified patients were compared to determine an intersection of 38,897 patients, so that an additional 101,628 (226%) patients were identified using UD that were not identifiable using SD (Figure 2).
Table 1.
Confusion matrix for acupuncture classifier.
Reference Standard | Precision | 88.3% | |||
Yes | No | Recall | 74.3% | ||
Acupuncture Classifier | Yes | 77 | 13 | F1 Measure | 78.5% |
No | 24 | 183 | Accuracy | 86.2% |
Figure 2.
Distribution of patients between groups identifiable by structured data (SD), unstructured data (UD), or both (SD + UD).
We compared the geographic locations of UD and SD patients. Overall, patients congregated around major population centers. There were some differences in the distributions, however. There was a much higher proportion of UD patients in the northwest region of Oregon, and a higher proportion of SD patients in the New York City metropolitan region. (Figure 3).
Figure 3.
Geographic distribution of acupuncture patients identified from structured data (SD) and unstructured data (UD).
The age distribution of UD and SD patients were essentially similar (Figure 4). The mean age was 56.4 (stdev. 15.2) for UD patients and 56.1 (stdev. 14.6) for SD patients. The difference was statistically significant (p < 0.00 by student t-test) due to the large sample size, however this small difference is not clinically meaningful.
Figure 4.
Density of age distribution for acupuncture patients identified from unstructured data (UD) and structured data (SD).
We compared the percent of UD and SD patients receiving the most common procedures, diagnoses, and prescriptions (Figure 5). Although the percentages are somewhat similar, overall a higher percent of SD patients received the measured procedures, diagnoses, and prescriptions. There are some procedures where SD patients show a higher percent that is more pronounced, i.e. metabolic panel total calcium, patient evaluation, therapeutic exercises, and comprehensive metabolic panels. Alternatively, UD patients show a higher percent of assays for quantitative blood glucose, alanine aminotransferase, and assay of urea nitrogen. For diagnoses SD patients also have a higher percentage in most cases, exceptions including unspecified reason for consultation and unspecified tobacco use disorder. Prescriptions continue the trend of higher SD percentages, with SD having higher percentages in all cases. We also examined the per-patient average number of all procedures, diagnoses, prescriptions, and visits between the two groups. This analysis confirmed that SD patients had a higher rate of use in all cases (Table 2).
Figure 5.
Percent of UD and SD patients with the most frequent procedures (by CPT code), diagnoses (by ICD9 code), and prescriptions, and gender distribution of UD and SD patients.
Table 2.
Average per-patient procedure, diagnosis, prescription, and outpatient visit rates for UD and SD patients.
Procedures per Patient | Diagnoses per Patient | Prescriptions per Patient | Visits per Patient | |
---|---|---|---|---|
UD Patients | 634 | 473 | 202 | 483 |
SD Patients | 724 | 568 | 232 | 562 |
Overall, both diagnoses and prescriptions indicate the presence of pain and pain management. Diagnoses of lumbago (low back pain), unspecified back pain, and cervicalgia (neck pain) are frequent, as are pain medications such as hydrocodone, gabapentin, cyclobenzaprine, naproxen, tramadol, oxycodone, codeine, etc.
We also compared the gender distribution (Figure 5). There was a higher percent of females in SD patients (14%) as opposed to UD patients (12%), both of which reflect the expected minority of females in the veteran population.
Discussion
In this study we identified patients in the VHA system being treated with acupuncture. We identified the cohorts using structured data and unstructured full-text data. We used CPT codes to identify patients for the structured data cohort, and SVM classification of unstructured free-text clinical notes to identify patients in the unstructured data cohort. There was a large overlap in the two sets, with only 13% of structured data patients not also being present in the unstructured data set. However, 72% of the unstructured data patients were not present in the structured data set, demonstrating the ability to significantly enlarge the set of identified acupuncture patients by using unstructured data. Our study shows that while it is feasible to identify acupuncture cohorts through structured and unstructured data independently, combining the two approaches can maximize the cohort size. This finding is consistent with findings reported by prior studies (8–11), but we show a more dramatic increase due to this medical domain not being traditionally included in electronic health records.
Aside from increasing the cohort size, combining structured and unstructured data can lead to a more representative patient population. Some prior studies compared sensitivity and specificity of different cohort identification methods, while we compared the cohorts. In comparing the cohort characteristics, we found a high degree of similarity but also some meaningful differences. Geographically, large acupuncture patient populations tend to locate in or near large metropolitan centers. The unstructured data cohort had a much higher proportion in the northwest region of Oregon, and those in the structured data cohort were proportionally more highly represented in the New York City metropolitan area. Some large metropolitan centers showed low acupuncture populations from either method. This suggests a variance in practice and/or documentation, although there are many other possibilities that will require further study to identify.
The distribution of ages in the two cohorts showed no significant difference, with the mean patient age at the time of treatment being about 56 years old in both groups. The gender representation in the two cohorts was also very similar. The rankings of the most frequent medical procedures, diagnoses, and prescriptions were very similar between the two cohorts, however there were consistently higher percentages of patients in the structured data cohort that received each procedure, diagnosis, and prescription. A frequent application of acupuncture treatment is for pain management (12), which is reflected in the frequent use of pain management prescriptions and diagnoses related to pain conditions in both cohorts.
Our data suggest that the patients in the structured data cohort had consistently higher rates of procedures, diagnoses, and prescriptions in general, not only in the most frequent sets. Patients in the structured data set also had a higher average outpatient visit rate. This indicates a difference in medical resource usage pattern between the two cohorts, with those in the structured data cohort consistently displaying higher use of VHA resources. This may indicate that patients identified only through unstructured data are relatively healthy or relying less on VHA as the sole provider.
Acknowledgments
This work is funded by VA grants CHIR HIR 08-374 and VINCI HIR-08-204.
References
- 1.Witt CM, Jena S, Selim D, Brinkhaus B, Reinhold T, Wruck K, et al. Pragmatic randomized trial evaluating the clinical and economic effectiveness of acupuncture for chronic low back pain. American journal of epidemiology. 2006;164(5):487–96. doi: 10.1093/aje/kwj224. [DOI] [PubMed] [Google Scholar]
- 2.Linde K, Vested Madsen M, Gøtzsche PC, Hrobjartsson A. Acupuncture Treatment for Pain: Systematic Review of Randomised Clinical Trials with Acupuncture, Placebo Acupuncture, and no Acupuncture Groups. Deutsche Zeitschrift für Akupunktur. 2010;53(2):40–1. doi: 10.1136/bmj.a3115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yuan J, Purepong N, Kerr DP, Park J, Bradbury I, McDonough S. Effectiveness of acupuncture for low back pain: a systematic review. Spine. 2008;33(23):E887–E900. doi: 10.1097/BRS.0b013e318186b276. [DOI] [PubMed] [Google Scholar]
- 4.Jacobson BC, Gerson LB. The inaccuracy of ICD-9-CM Code 530.2 for identifying patients with Barrett’s esophagus. Diseases of the esophagus : official journal of the International Society for Diseases of the Esophagus / ISDE. 2008;21(5):452–6. doi: 10.1111/j.1442-2050.2007.00800.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lin J, Jiao T, Biskupiak JE, McAdam-Marx C. Application of electronic medical record data for health outcomes research: a review of recent literature. Expert review of pharmacoeconomics & outcomes research. 2013;13(2):191–200. doi: 10.1586/erp.13.7. [DOI] [PubMed] [Google Scholar]
- 6.Friedlin J, McDonald CJ. A natural language processing system to extract and code concepts relating to congestive heart failure from chest radiology reports. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium. 2006:269–73. [PMC free article] [PubMed] [Google Scholar]
- 7.Friedlin J, Grannis S, Overhage JM. Using natural language processing to improve accuracy of automated notifiable disease reporting. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium. 2008:207–11. [PMC free article] [PubMed] [Google Scholar]
- 8.Liao KP, Cai T, Gainer V, Goryachev S, Zeng-treitler Q, Raychaudhuri S, et al. Electronic medical records for discovery research in rheumatoid arthritis. Arthritis care & research. 2010;62(8):1120–7. doi: 10.1002/acr.20184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Li L, Chase HS, Patel CO, Friedman C, Weng C, editors. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium. 2008. Comparing ICD9-encoded diagnoses and NLP-processed discharge summaries for clinical trials pre-screening: a case study; pp. 404–8. [PMC free article] [PubMed] [Google Scholar]
- 10.Jeff Freidlin D, Marc Overhage MDP, Mohammed A, Al-Haddad M, Joshua A, Waters M, J Juan R, Aguilar-Saavedra M, Joe Kesterson M, et al., editors. AMIA 2010 Symposium. 2010. Comparing Methods for Identifying Pancreatic Cancer Patients Using Electronic Data Sources. [PMC free article] [PubMed] [Google Scholar]
- 11.Elkin PL, Froehling D, Wahner-Roedler D, Trusko B, Welsh G, Ma H, et al. NLP-based identification of pneumonia cases from free-text radiological reports. AMIA Annual Symposium proceedings / AMIA Symposium AMIA Symposium. 2008:172–6. [PMC free article] [PubMed] [Google Scholar]
- 12.Vickers AJ, Cronin AM, Maschino AC, Lewith G, MacPherson H, Foster NE, et al. Acupuncture for chronic pain: individual patient data meta-analysis. Archives of internal medicine. 2012;172(19):1444–53. doi: 10.1001/archinternmed.2012.3654. [DOI] [PMC free article] [PubMed] [Google Scholar]