ABSTRACT
Objective The objective of this project was to use statistical techniques to determine the completeness and accuracy of data migrated during electronic health record conversion.
Methods Data validation during migration consists of mapped record testing and validation of a sample of the data for completeness and accuracy. We statistically determined a randomized sample size for each data type based on the desired confidence level and error limits.
Results The only error identified in the post go-live period was a failure to migrate some clinical notes, which was unrelated to the validation process. No errors in the migrated data were found during the 12- month post-implementation period.
Conclusions Compared to the typical industry approach, we have demonstrated that a statistical approach to sampling size for data validation can ensure consistent confidence levels while maximizing efficiency of the validation process during a major electronic health record conversion.
Keywords: Data migration, data validation, electronic medical record, electronic health record, implementation, applied clinical informatics
INTRODUCTION
During transition from a legacy electronic health record (EHR) to a new system, clinical data migration is one of the major challenges that organizations must address.1,2 There is a paucity of literature on best practices for these processes, especially in health care. Data migration is a risky process with evidence from the computer industry suggesting that 38% of large data migration projects run over budget or are not delivered on time.3 In addition to project budget and timeline concerns, failure to migrate clinical data into the active EHR may also create significant risks to patient safety and provider efficiency.1,2,4,5
If the decision is made to migrate large amounts of clinical data, the data must be checked for completeness and to ensure that no errors were introduced during the migration process. The process of data testing and validation significantly adds to the time, cost, and complexity of the data migration process,6 but is critical to maintaining the integrity of patient data and ensuring high-quality care and patient safety. To mitigate the time and expense of data validation, a subset of the data is commonly used in the validation process. Traditionally, validation volumes for this sampling have been subjectively determined (i.e., a 15% sample size).6,7 In industries outside of health care, several novel approaches have been described for automating and facilitating data migration and validation including validation sampling techniques,6,8 but techniques such as these have never previously been described in a clinical setting. In this manuscript, we describe our statistical approach to determining validation sample sizes to ensure efficient data validation during our EHR transition while gaining a better understanding of the confidence levels about the error rates in the migration.
METHODS
Setting
Lucile Packard Children's Hospital Stanford (LPCHS) is a 303 bed, freestanding, quaternary care, academic children’s hospital. On May 4, 2014, LPCHS converted from a comprehensively implemented vendor EHR (Cerner, Kansas City, MO, USA) with computerized provider order entry,9 clinical decision support,10,11 and clinical documentation,12,13 to another vendor EHR (Epic Systems, Verona, WI, USA). During this conversion, 10 major categories of data from over 300 000 patient records were migrated from the legacy system to the new system. The data types selected for migration and validation included provider notes, patient measurements (height, weight, and head circumference), laboratory results (both internal and external), radiology reports, specialized study reports (EKG, EEG, Echocardiogram), and pathology reports. All provider notes and patient measurements since implementation of the legacy system in 2004 were migrated. For the other data types, 2.5 years of data was migrated.
An EHR physician advisory group was involved in the decision about what data to migrate. Part of the decision to do such a large-scale data migration was driven by the fact that participation in a regional health information exchange was a key factor in the decision to switch vendors, and only data native in the new system would be accessible to clinical partners through the EHR-integrated health information exchange tool, CareEverwhere.14,15 In addition to the data migrated into the new system, users were able to view historical data natively through an embedded link to the legacy system (via a contextually-aware web-based portal with single-sign on), very similar to what has been previously described by Gettinger and Csatari.1
Validation Methodology
We designed our data migration validation methodology to meet several objectives. The first was to ensure that all desired data had been correctly selected from the legacy system and converted to the new system with no data missing in the new system and no extra data points in the new system. The second objective was to ensure that all converted data displayed accurately in the new system, with identical content displaying in the correct location and format. The final objective was to maximize efficiency and effectiveness, while achieving high confidence in data migration accuracy.
Validation of data was comprised of two separate processes. The first was mapped record testing – confirming that each unique data value in each data type was properly configured in the new system (e.g., Height, weight, and head circumference were three separate data values in the data type of growth chart.) This process was applied to all types of results, reports, and document types.
The second process was manual validation of a sample of each data type for completeness (correct selection of all the values of a particular data type for an individual patient) and accuracy. Instead of using the traditional method of selecting an arbitrary sample size, we calculated the patient sample size and validation resources required based on the desired level of statistical guarantees, as described below.
Once the sample size was determined for each data type, the patients were sampled uniformly at random from all the patients with at least one record of that data type. By selecting the records through randomization, we increased the likelihood that errors affecting a significant portion of the population would appear in our sample. For each sampled patient, the validation team checked that all the records of the specified data type were appropriately selected and that the contents of each record were accurate. If any errors were identified, the cause of the error was addressed and the population was re-sampled, as described below, until no errors were found.
Calculation of Sample Size
Leadership at Stanford Children’s Health (including information services and clinical administrators, physicians, clinical informaticists, and a statistician) chose a comfortable level of certainty (98% confidence interval), then chose an acceptable detectable error limit on selection errors. We had 2.5 years of HL7 transactions of every production interface to our legacy EHR saved in our interface engines, which could be replayed and loaded into our new system. For these data types that were converted from archived production interface feeds, production experience suggested that any errors in selection would affect >5% of the population, so the error limit was set to 5% for those data types. For data types that were not part of the HL7 interface feeds, manual reports were created in the legacy system to extract the data, which was then loaded into the new system. For this manual extraction from the legacy EHR, a more conservative 2–4% error limit was used. We computed the validation sample size that would allow errors affecting this fraction of the population to be detected with the specified certainty. The dotted line in Figure 1 shows the chosen level of certainty. When we chose an acceptable detectable error limit, this specified a point on the curve where it crosses the dotted line. See appendix A for details of the calculations.
Figure 1:
Illustration of method for selecting the sample size. The dotted line shows the 95% detection level. If the smaller sample was used (solid line curve) we could detect errors affecting 5% of the population or more at the 95% detection level (black dot). If a larger sample was used (dashed line curve), then we could detect errors affecting 2% of the population at the 95% detection level.
Additionally, we wanted to allow for retesting if we found any errors. To avoid decreased statistical guarantees introduced by retesting repeatedly until no errors were found, we made our intervals hold simultaneously across all retests. To do this, we reserved half of the remaining test level at each repetition of testing. This led the sample sizes to be a bit larger, but made our validation correct even if we repeatedly restarted the validation.
To compare this data sampling process to traditional data sampling processes, we manually validated 15% of the records for one data type – echocardiograms. We then calculated the error rate for this entire population of records and compared that to the estimated error rate calculated for the statistical sampling approach.
RESULTS
Table 1 shows the confidence interval and acceptable error limit on selection for each migrated data type. The last column of Table 1 shows the calculated error limit on the accuracy of each result based on the specified confidence level and error limit on selection. The estimated times for validation using our statistical method for calculating a sample size vs the estimated time for validation using a more standard 15% sample size are shown in Table 2. We calculated that our sampling approach saved 56 000, 86 000, and 115 000 man-hours compared to a 10, 15, and 20% sample, respectively.
Table 1:
For each data type that was migrated, the total volume of data elements and sub-populations is shown with the specified confidence interval, the specified error limit on selection errors, and the calculated error limit on accuracy.
| Data Type | Total Volume (Data Elements) | No. of Sub-Populations | Confidence Level (%) | Error Limit on Selection Errors (%) | Calculated Error Limit on Accuracy (%) |
|---|---|---|---|---|---|
| ECHO Reports | 23 000 | 1 | 98 | 3 | 1 |
| EKG Reports | 22 000 | 1 | 98 | 3 | 1 |
| EEG Reports | 3300 | 1 | 98 | 2 | 1 |
| Historic Report Indicators | 242 800 | 1 | 98 | 2 | 0.7 |
| Lab Results | 2 877 000 | 4 | 98 | 5 | 0.2 |
| Lab Results (External) | 250 300 | 1 | 98 | 5 | 0.2 |
| Pathology Reports | 34 200 | 1 | 98 | 4 | 1 |
| Patient Measurements | 2 471 000 | 1 | 98 | 2 | 0.2 |
| Provider Clinical Notes | 2 112 300 | 2 | 98 | 5 | 0.6 |
| Radiology reports | 117 600 | 1 | 98 | 3 | 1 |
Table 2:
Sample sizes and estimated validation times using statistical method for calculating a sample size vs. the estimated time for validation using a 15% sample size. Estimated time and cost savings (assuming cost of $70/hr for validation) of the statistical sampling approach are shown in last two columns.
| Data Type | Sample Size Calculated by Statistical Sampling Approach (# results) | 15% Sample Size (No. results) | Manual Validation Time Per Data Element (min) | Estimated Validation Time for Statistical Sampling Approach (h) | Estimated Validation Time for 15% Sample Size (h) | Estimated Time Saved (h) | Estimated Cost Savings at $70/h ($) |
|---|---|---|---|---|---|---|---|
| ECHO Reports | 459 | 3450 | 10 | 77 | 576 | 499 | 34 930 |
| EKG Reports | 459 | 3300 | 10 | 77 | 549 | 472 | 33 040 |
| EEG Reports | 460 | 495 | 10 | 77 | 84 | 7 | 490 |
| Historic Report Indicators | 690 | 36 420 | 2 | 23 | 1215 | 1192 | 83 440 |
| Lab Results | 13 421 | 431 550 | 5 | 1118 | 18 525 | 17 407 | 1 218 490 |
| Lab Results (External) | 2303 | 37 545 | 3 | 115 | 1878 | 1763 | 123 410 |
| Pathology Report | 461 | 5130 | 10 | 77 | 2940 | 2863 | 200 410 |
| Patient Measurements | 2763 | 370 650 | 1 | 46 | 6177 | 6131 | 429 170 |
| Provider Clinical Notes | 1695 | 316 845 | 5 | 283 | 52 809 | 52 526 | 3 676 820 |
| Radiology reports | 461 | 17 640 | 15 | 77 | 2940 | 2863 | 200 410 |
After go-live, our clinicians recognized that a subset of the clinical documents (approximately 6000 notes) had not originally been extracted from the legacy systems. These data were missed because these records were not initially included in the migration scope, unrelated to the statistical validation process. Once identified, this data was migrated using the same process with minimal additional effort. No other data errors were identified through clinical practice in the 12 month post go-live period.
To compare the traditional data sampling approach for validation of migrated data with the statistical sampling approach, 15% of the echocardiogram reports (4932 reports for 1533 patients) were manually validated. This validation resulted in the identification of one new error – an addendum to a report that had not been migrated from the legacy system to the new system. This error rate is within the error limits calculated for our statistically-determined sample size.
DISCUSSION
This project demonstrates that a more rational approach to validation of migrated data during an EHR conversion can provide clear quality assurance metrics while significantly minimizing labor and associated costs.
This statistical approach provides two unique contributions to data migration validation in a clinical setting. First, we randomly selected our validation samples so that they would be statistically representative of the whole population. While validation is often done on a subset of the records during migration, it is usually not done in a randomized fashion. Without randomization, it would be difficult to draw careful statistical conclusions from the validation results.
Secondly, using the results of the randomization-based validation, we are able to make claims about the worst-case number of records affected by errors in the population. Because we only examined a small subset of the records, we cannot state with complete certainty that there are no records with errors. Instead, we are able to attach a probabilistic statement in terms of a limit on the fraction of errors and a confidence level. For a given error limit and confidence level, the statement takes the form: “If the true fraction of errors in the population were above [error limit], the probability of discovering one of those errors in our sampling would be at least [confidence level].”
During the 12-month period after go-live, the only unexpected data migration problem was that approximately 6000 clinical documents were not initially migrated because they were not included in the original scope. No errors in the migrated data have been identified. An important limitation of these results is that clinicians may not be aware of data errors and may not report those they find, so while it is reassuring that no errors in migrated data have been identified in 12 months post go-live, there could still be errors that were not brought to our attention. Manual validation of a more traditional 15% sample size of echocardiogram reports showed an error rate within the error limits calculated for our statistically determined sample size, suggesting that our approach is at least as good as the traditional approach.
While this sampling process decreases the time and costs associated with data validation, the limitation of this approach is that there remains a small risk of not detecting errors in the data migration process. Note that nonrandom validations of the same number of records would face similar risks of nondetection, but those limitations would not be quantifiable. A limitation of our randomization process is that the accuracy checks were made on records of the same patients that were examined for selection errors. This decision was made because this was a significantly more efficient workflow. A consequence of that decision is that even though patients were selected uniformly at random for selection validation, individual results were not selected for random uniform accuracy validation. That could be a limitation if accuracy errors clustered on an individual patient basis. However, in previous experience, errors had not shown that type of clustering behavior.
We chose a “comfortable” level of certainty in our data sampling process. Since this type of statistical approach to sampling in validation of healthcare data has not been previously published, the appropriate level of certainty in this process has not been established. There are obvious patient safety risks with any level of uncertainty in integrity of migrated data. However, it is not logistically or economically feasible to manually validate every piece of data in a large-scale data migration project such as this. Balanced with the patient safety and workflow risks of leaving historic patient data stranded in a legacy system,1,2 as well as the cost for additional effort, we felt that the small level of uncertainty determined using our method was appropriate. Ideally, future research and development will lead to more efficient, automated processes for health care data validation that could further mitigate the risk associated with this process.
CONCLUSION
A statistical approach to selecting appropriate sample sizes during data migration can provide clear error limits on migrated data, while maximizing efficiency of the sampling and validation process during an EHR conversion. Given the quantity of EHR conversions now occurring in the United States alone, this approach has the potential to save millions of dollars in cost, and provide a higher degree of data confidence compared to the current industry standard.
Supplementary Material
ACKNOWLEDGEMENTS
We would like to thank Ed Kopetsky, Tom Maher, and the Stanford Children’s Information Services Physician Advisory Group for their guidance and support during this project.
FUNDING
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
FINANCIAL DISCLOSURE
The authors of this manuscript have no financial relationships relevant to this article to disclose.
COMPETING INTERESTS
None.
CONTRIBUTORS
Dr N.M.P. and Dr C.A.L. conceptualized and designed the project, designed the data collection instruments, drafted the initial manuscript, and approved the final manuscript as submitted. Dr M.J.G.S. designed and carried out the statistical analyses, critically reviewed and revised the manuscript, and approved the final manuscript as submitted. Mr. W.C. conceptualized and designed to project, critically reviewed and revised the manuscript, and approved the final manuscript as submitted. Ms. C.Y. and Ms. E.M. conceptualized and designed the project, participated in designing the data collection instruments, coordinated and supervised the data collection, critically reviewed and revised the manuscript, and approved the final manuscript as submitted.
REFERENCES
- 1. Gettinger A, Csatari A. Transitioning from a legacy EHR to a commercial, vendor-supplied, EHR: one academic health system’s experience. Appl Clin Inform. 2012;3:367–376. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. West S. Need versus cost: understanding EHR data migration options. J Med Pract Manage. 2013;29:181–183. [PubMed] [Google Scholar]
- 3. Howard P. Data Migration Customer Survey. London: 2014. http://www.bloorresearch.com/dlfile/data-migration-customer-survey-2216.pdf Accessed November 8, 2015. [Google Scholar]
- 4. Payne T, Fellner J, Dugowson Cet al. Use of more than one electronic medical record system within a single health care organization. Appl Clin Inform. 2012;3:462–474. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Michel J, Hsiao A, Fenick A. Using a scripted data entry process to transfer legacy immunization data while transitioning between electronic medical record systems. Appl Clin Inform. 2014;5:284–298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Paygude P, Devale PR. Automated data validation testing tool for data migration quality assurance. Int J Mod Eng Res. 2013;3:599–603. [Google Scholar]
- 7. Kelly C, Nelms C. Roadmap to checking data migration. Comput Secur. 2003;22:506–510. [Google Scholar]
- 8. Hegadi RS. A study on sampling techniques for data testing. Int J Comput Sci Commun. 2012;3:13–16. [Google Scholar]
- 9. Longhurst CA, Parast L, Sandborg CIet al. Decrease in hospital-wide mortality rate after implementation of a commercially sold computerized physician order entry system. Pediatrics. 2010;126:14–21. [DOI] [PubMed] [Google Scholar]
- 10. Adams ES, Longhurst CA, Pageler Net al. Computerized physician order entry with decision support decreases blood transfusions in children. Pediatrics. 2011;127:e1112–e1119. [DOI] [PubMed] [Google Scholar]
- 11. Pageler NM, Franzon D, Longhurst CAet al. Embedding time-limited laboratory orders within computerized provider order entry reduces laboratory utilization. Pediatr Crit Care Med. 2013;14:413–419. [DOI] [PubMed] [Google Scholar]
- 12. Patel SJ, Longhurst CA, Lin Aet al. Integrating the home management plan of care for children with asthma into an electronic medical record. Jt Comm J Qual Patient Saf. 2012;38:359–365. [DOI] [PubMed] [Google Scholar]
- 13. Hahn JS, Bernstein JA, McKenzie RBet al. Rapid implementation of inpatient electronic physician documentation at an academic hospital. Appl Clin Inform. 2012;3:175–185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Kaelber DC, Waheed R, Einstadter Det al. Use and perceived value of health information exchange: one public healthcare system’s experience. Am J Manag Care. 2013;19:SP337–SP343. [PubMed] [Google Scholar]
- 15. Winden TJ, Boland LL, Frey NGet al. Care everywhere, a point-to-point HIE tool: utilization and impact on patient care in the ED. Appl Clin Inform. 2014;5:388–401. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

