Abstract
Objective
The quantity of patient data in healthcare is exponentially increasing. While big data and artificial intelligence have emerged across the fields, in healthcare, such rapid development is hindered by numerous factors. Predominantly, health-care software developed decades ago cannot foresee the demands of modern data processing and analysis. We present the challenges, remedies, and steps of efficient patient data integration that have been co-developed with clinicians at Lenval Children's University Hospital in Nice, France.
Methods
In collaboration with pediatricians, we created an integration framework that integrated a patient's germane historical data (from the past 10 years) for research purposes. The clinical data presented in this study were collected between 2012 and 2021 in the Lenval Children's University Hospital Pediatric Emergency Department.
Results
We present the architecture of a clinical data warehouse (CDW) and demonstrate its use. CDW can also host doctoral notes, which is the key element for creating large language models that can help predict patient outcomes and provide critical information to health-care professionals. We also conducted several tests on the utilization of this new CDW, recorded multiple challenges on data integration, and gave three suggestions on software design. The CDW we created represents a solid foundation for future machine learning models of patient flow, hospital economics, and studies on rare diseases at CHU-Lenval.
Conclusion
Although the integration framework is grounded in pediatrics, the challenges discussed, and the proposed remedies are relevant for software development across medical specializations. Our recommendations for software design can help with future secondary usage of Electronic Health Record.
Keywords: Secondary use of data, data mining, software design, clinical data warehouse, electronic health records, data scrubbing
Introduction
Hospitals and clinics collect and store large amounts of patient data that include heterogeneous characteristics, non-standard formats, and excessive storage space.1,2 Until the second decade of the twenty-first century, it was difficult to use this data in research. The main reason for this has been linked to an imbalance between the extensive quantity of data and the low computing power. However, with the development of powerful processing units and Big Data, large volumes of patient data have gradually become accessible for novel analyses. Electronic Health Records (EHR) have been used successfully in various health-care facilities. For example, in the U.S., the use of EHR in hospital settings increased from 9% to 96% between 2008 and 2015. 3 With novel techniques and artificial intelligence (AI) (i.e. artificial neural networks 4 ), EHR offer a rich, large-scale, and affordable source of information. 5 In recent years, health and medical data have been predicted to grow exponentially 6 and are measured in tera-, peta-, and yottabytes.7–9 As the amount of data increases, the use of third-party computing solutions (mostly cloud-based) has also rise. 10 Every year, new hardware and software products and new companies enter the growing market. 11 This increase in data sources creates new opportunities for novel research, as EHR's contain underutilized information relevant to clinical research. 12
Digitalization in healthcare has advanced technologically during the past 20 years, but AI has only recently become a part of everyday healthcare. 13 Although AI holds great promise, it poses certain requirements and rules that data need to be obeyed (e.g. data need to have the magnitude for meaningful analytics, be valid for the purpose, and be prepared for modeling, including the pretreatment of missing values). Whether EHR fulfills these requirements historically traces back to when an individual hospital decides to start using EHR. This situation is linked to two principal challenges: first, the secondary use of patient data has never been incorporated into the future of the software lifecycle,.3,14 As noted by Kim et al. (2019), the original reason for keeping health records (paper format) was for secondary use (i.e. research and education), while one of the primary reasons for the creation of EHR is billing.3,14 The historical purpose of EHR poses a second challenge. Historical patient data are not necessarily compatible with the new and recent data. While integrating historical data into recent data sources offers numerous advantages and insights into the medical field, the challenge of inconsistency arises when these sources are combined.15,16
Characteristics of clinical pediatric data
Clinical data are generally characterized by heterogeneity, data availability, and complexity.8,14,17 In addition, pediatric clinical data has unique characteristics. 18 First, the number of visits to emergency departments (EDs) has been rising steadily for both adult and pediatric patients over the past decades. 19 Second, most congenital conditions in clinical care occur during childhood rather than during adulthood, causing the processes created for general medicine to not necessarily comply with pediatrics. Finally, pediatric healthcare ends at the age of 18 years, after which the patient is usually transferred to an adult practitioner. All these characteristics contribute to the specific challenges of clinical patient data, as we encountered in this study.
Data heterogeneity is a common feature of clinical data2,5,8,18,20–22 and is caused by numerous factors, such as the patient's unique physiology, the need for medical specializations, and cultural and regional regulations and restrictions. 20 Data were also collected primarily to identify patients and to store medical records and health-care billing.23–25 Because the secondary use of data has only recently become important, the standardization and interoperability of clinical data vary with technical and medical practices, patients, and hospital sites. For example, large hospitals may admit more patients with respiratory problems than do local hospitals in rural areas. Because clinical data and metadata are missing common standards, 26 clinical data combine multiple variables of mixed-data types.22,27
Data availability refers to the sensitive characteristics of clinical data. Owing to data sensitivity, clinical data are considered high-risk, and their protection and access are regulated by the law and local policies. The imposed regulations and sensitivity of clinical data make it challenging to computationally process, analyze, and share across medical networks and between organizations. It is also unclear what level of sensitive data can lead to an individual's harm. For example, metadata such as sex, age, and postal code combined are sufficient to identify a patient's identity in 87% of the cases. 28 However, direct and applicable guidelines for safe processing of sensitive data are scarce. This reflects the fact that attacks on health-care networks are increasing, 29 which causes the problem that the owner of the data needs to be sure of how the data are being stored and used. Therefore, clinical open-data platforms are limited, restricted, and anonymized,30,31 which in turn reduces data accuracy (i.e. age vs. age groups) and usability. 14
Data complexity is linked to a variety of formats (i.e. numerical, text, images, and signals) that occur in medical data and multiple platforms within a hospital.8,32 In this study, four separate software platforms were used to collect and process various patient data: 1) patient's laboratory data (Clinicom, InterSystems®), 33 2) patient's imaging data (VHM, visionHM®), 34 3) patient's prescriptions (ORBIS, Agfa®), 35 and 4) Terminal Urgences®, (Innovation e-Sante Sud group), 36 which stores the patient's personal details, diagnostic data, and medication during admission. Because these four platforms are not interconnected, researchers have only a limited possibility of efficiently accessing patient data. In this study, we addressed these three challenges and presented the developmental steps necessary for the availability of public data repositories for pediatric research.
The secondary use of clinical data has been proposed only recently23–25 and not all medical and health-care systems have been implemented to comply with this use. 23 To use large-scale, longitudinal, and heterogeneous clinical data, prior research has investigated various approaches for the automatic integration of clinical data for secondary use. Yao et al. 37 created an integrated system with Hadoop to execute distributed MapReduce algorithms, whereas Schultz et al. 38 used Microsoft SQL Server in their validation study. In addition, cloud-computing approaches are promising. Sobeslav et al. 39 provided a well-detailed view of the benefits and threats of cloud computing with respect to data sensitivity and patient's privacy. To efficiently integrate clinical data, the most common methods for data processing have been to use the “flattened table” format, in which each row presents an instance for training a model. 14 However, this approach is limited, because each patient may have multiple measurements obtained during the same admission.
A similar line of research has been published by Deshpande et al. Al. 2020. 40 Deshpande et al. concluded that the main difficulties involved working with null values, different timestamp formats, and errors in the values. The missing data content was between 1% and 31% depending on the dataset. They structured the data cleaning process as follows: replacing missing category contents in medical reports; removing errors; and replacing inconsistencies in dates, ages, and abbreviations substitution through medical dictionaries and ontologies.
However, these attempts cannot be fully applied in healthcare.
This study was conducted at Lenval Children's University Hospital in Nice, the 4th largest pediatric hospital in France, with nearly 60,000 admissions annually. 41 To manage patient data, the hospital has been using the Terminal Urgences® (TU) 42 software as an EHR management system since 2010 43 including a pediatric triage tool (pediaTRI). 44 In addition, the hospital employs dedicated software for laboratory, 45 therapeutic prescription, 46 and imaging data. 34 The aim of this study was to automatically integrate these data streams to create a clinical data warehouse (CDW).
TU was developed in 2002 and is used at more than 60 sites in France. 42 However, to the best of our knowledge, no previous attempt has been made to clean and integrate clinical data from the past 10 years. The aim of this study was to integrate clinical data with administrative data, as they were separated using the TU software. Administrative data included all basic patient details (i.e. TU ID, name, address, admission times, and chief complaint). However, clinical data are more complex and comprise all the given care, diagnostics, medications, chief complaints, and measurements taken during admission. The data used in this study were obtained from the pediatric ED. Figure 1 illustrates these variables and their occurrences. In accordance with the laws which govern “non-interventional clinical research” in France (namely articles L.1121-1 and R.1121-2 of the public health code), informed written consent or ethics committee authorization was not necessary for this no-effect observational study collecting anonymized data on patient management. Since patient data were collected in our hospital using different software packages, the respective copyright holders authorize the use of the names of these software packages for our study.
Figure 1.
Variable incidence rate and processing success rate.
Main goal of this study was to unify heterogeneous large-scale clinical data and integrate the raw data to produce standardized secondary data, which could later be used to improve the existing pediatric triage tool to allow the prediction of resource utilization. This goal gave also the possibility to record the challenges which can occur when integrating and transforming medical data. At this stage, we refer to these data as large-scale data (as opposed to Big Data) because the data come from a single hospital site and are structurally simple.7,37 However, the data share other features with Big Data, such as veracity (uncertainty) and variety.8,21
In this work, we aimed to create an open-source software solution that automatically integrates (i.e. reads, cleans, and processes) multiple data streams into a single database that enables efficient secondary use of data and open-science research, e.g. applications of deep learning. 31 Instead of relying on a service provider, the database was created using PostgreSQL and data processing was performed using Python.
Contribution
We created a research database with nearly all the variables available in the EHR and established a framework for the automation of clinical data integration. To the best of our knowledge, this is the first study to collect and integrate 10 years of pediatric clinical data from a hospital in France. We also already integrated several other data streams, like imaging, and made the data machine readable to be used with machine learning and later for example with large language model (LLM). While we have more and more health-care data available for research, pediatric data is still scarce and our study can help to close this gap. During the study, we also found out that we can contribute to help to overcome the existing issues on clinical data integration and usage for secondary purposes. The point of view of the study is on the software design which usually is overlooked when it comes to data integration.
Methods
In this section, we describe how the heterogeneous, large-scale, and longitudinal data collection process (summarized in Table 1) and challenges were resolved and unified with contemporary standards (Proprietary, legacy health-care systems and machine-unreadable clinical data Section) to establish a database suitable for open-science research in pediatrics (Software design and future work Section).
Table 1.
Types of data collected.
| Data type | Description | Data source |
|---|---|---|
| Structured administrative patient data | Main patient data that consists patient details along with admission details like timestamps when seen by a triage nurse or a doctor. Also chief complaint is included in this data. | Terminal Urgences |
| Non-structured clinical data | Clinical data that consists all the measurements along with any given medication, care and diagnostics. | Terminal Urgences |
| Structured imaging data | Binary data for each possible imaging option. | VHM |
Terminal Urgence: software used in our Pediatric Emergency Department for electronic medical records. VHM, visionHM®: software used in our Pediatric Emergency Department for imaging.
The clinical data presented in this study were collected at Lenval Children's University Hospital (CHU-Lenval, Nice, France) between 2012 and 2021 and comprised over half a million admissions (575,207) in the CHU-Lenval Pediatric Emergency Department, with nearly 200,000 unique patients (Table 2). The presented data processing took initially place 2021 and was complex, labor intensive, and time consuming, and hence, out of the reach of hospital clinicians. The resulting PostgreSQL database aims to democratize the clinician's access to complex patient data.
Table 2.
Number of data collected in our Pediatric Emergency Department from 2012 to 2021.
| Data category | Unique admissions in raw data | Unique admissions in CDW | Unique patients in raw data | Unique patients in CDW |
|---|---|---|---|---|
| TU administrative data | 575207 | 567589 | 191284 | 190569 |
| TU clinical data | 565501 | 564916 | Not applicablea | 189957 |
| Imaging data | 133517 | 128381 | 116838 | 80159 |
| Total | 1274225 | 1260886 | 308122 | 460685 |
aAll patient details were excluded from the clinical data export.
CDW: clinical data warehouse; TU: Terminal Urgences software; Imaging data: imaging collected with VHM software.
Terminal urgence (Tu®) data: administrative data
Administrative data were sourced from the Terminal Urgence patient management system and included all the administrative variables, timestamps of admissions, medical triage, and chief complaint (the main reason for consultation based on the triage nurse interpretation of the patients’ symptoms). 47 Data exports were split into smaller chunks, that is, three data exports per year. Administrative data were structured such that each patient's admission was represented as a single row. During pre-processing, columns with identative patient information were removed with custom scripts using Python. Next, new interval time variables (i.e. the time between admission and the first medical examination) were calculated before importing records into the database. We excluded records in which the timestamp, admission ID, or patient ID did not match the expected value.
The main challenge in processing administrative data is the text representation of the data and lack of error checking of the input values in the TU® software. For example, the pediatric triage tool in-use has five-level system for grading patients; however, a nurse can add an option that would allow upgrading the level. This option is marked by an asterisk (*). There were also values that we were unable to maintain, as in the case where the cell contained a question mark. For some other cases, such as patients vital sign variable being out of the normal range, we needed to use the knowledge of clinical practitioners to decide how to proceed, as they base their decisions on a combination of factors.
Clinical data
Clinical data were sourced from the PediaTRI tool integrated into the TU® 2011. 44 Clinical data from TU® were heterogeneous and unstructured and included detailed information about the measurements, diagnostics, medications, chief complaint, and care conducted during admission. During the processing of the clinical data, variables and units were stored separately, resulting in one row per observation. Owing to the unstructured nature of clinical data, we developed the following practices for data mining and standardization: Table 3 illustrates all the variables that were gathered from the raw clinical data and the prime variables obtained from the clinical data according to the clinician's recommendations. Clinical data before 2016 did not include care, chief complaints, medications, or diagnostics. The year 2011 was excluded because the PediaTRI tool was integrated in the middle of the same year.
Table 3.
List of existing variables that could potentially be integrated into our CDW.
| Included into CDW | Excluded from CDW |
|---|---|
| Diagnostics | Ketonemiaa |
| Care | Diuresis (liters)a |
| Chief complaint | HemoCue point-of-care (g/dl)a |
| Medications | Height (meters)a |
| Blood pressure (mm Hg) | PASi/PADi/PAMiae |
| Urine dipstick | Head circumference (meters)a |
| Capillary blood glucose | SAT O2cf |
| Painb | Capillary refill time (seconds)d |
| Heart rate (bpm) | |
| Respiratory rate (cpm) | |
| Peripheral oxygen saturation (%) | |
| Glasgow Coma scale | |
| Capillary refill time | |
| Temperature (degree celsius) | |
| Weight (kg) |
Diagnostics, care, chief complaint and medications have only been exported and included starting 2016.
aVariable was excluded because it is rarely measured and used in diagnosis.
bEvendol Score if age < 7 years, analog visual scale or digital pain scale if >= 7 years.
cVariable was excluded because it was used to store information not related to the actual variable.
dVariable was excluded because it was only being used since 2021.
eSystolic/Diastolic/Average blood pressure.
fOxygen saturation.
CDW: clinical data warehouse.
Vital sign variables (such as blood pressure) were heterogeneous and inconsistent. Care, medication, diagnostics, and chief complaint details were recorded as numerical, categorical, or textual, respectively. We processed each vital sign variable separately and computationally checked the values of the variables for potential format and content errors. For example, a professional caregiver wrote “9” instead of “90” for the blood pressure field, owing to time constraints. Similarly, string keywords and abbreviations were inconsistent among the data.
All data inconsistencies were resolved using custom-made Python scripts.
Imaging data
Imaging data were obtained using the TU® interface for patients who underwent imaging during admission. The original imaging data comprised binary variables related to imaging procedures, such as Ultrasound, IRM, X-ray, CT, and their corresponding image area and position. The data were exported to a structured *. csv format. During pre-processing, all patient-sensitive data were removed.
The main objective was to unify heterogeneous large-scale clinical data. We developed and adopted numerous data standards that comply with current practices in data science analyses and open-science access. Identifiable patient demographics (i.e. address and name) were removed, and records linked through the patient's admission ID, which was a unique numerical value for each admission. We also incorporated the naming conventions of medical standards (e.g. by integrating the ICD-10 code table into the CDW).
We designed and implemented a PostgreSQL database that allowed us to import all the processed clinical data.
We then developed an integration process to associate all relevant measurements for each admission. The challenge was that multiple measurements of different signals, such as the heart rate and respiratory rate, could share the same timestamp. We created an additional identifier that combined all the measurements taken simultaneously in a single entry (row). Therefore, each patient's data were represented by multiple rows with a unique combination of admission ID and timestamp, which were used as the primary keys in the clinical data tables. Finally, we included an ICD-10 table with the corresponding values, explanations, and severity for each ICD-10 code. In this study, we successfully integrated TU and VHM data, whereas ORBIS and Clinicom are still in the process of integration. Figure 2 illustrates the workflow from raw data to structured data in the relational database. Figure shows the four main hospital systems, total amount of admissions in the raw data, and total amount of admissions (unique admission records) after the ETL process. This was the amount imported into the CDW.
Figure 2.
Data processing workflow from the source exported data up to the integration in the CDW. CDW: clinical data warehouse.
It was challenging to characterize or standardize missing values in this data collection. For example, a patient's record could contain some missing signals because vital signs were not relevant and were intentionally omitted. Although the records were correct and complete from a clinical perspective, they were incomplete in terms of machine learning. 40 Since missing data were mainly associated with particular vital sign variables (e.g. respiratory rate), we decided to encode the missing values as “nulls” in the database and left it for physicians to decide how to treat those. For example, the respiratory rate is one of the most useful variables in the ED, but it is also often unmeasured.48,49
Infrastructure
Data processing was conducted using Microsoft Windows Server 2019 Standard X64 (4096 MB memory, 2.1 GHZ Dual CPU). Data processing was performed according to the fundamental principles of writing modular code 50 in Python (version 3.8.10) using the Miniconda 51 distribution. The database was developed using PostgreSQL version 13. 52
Results
Longitudinal clinical data are complex, heterogeneous, and sensitive, which prevents data mining from obtaining big data insights. The original EHR collection consisted of 750MB, spanning 10 years of clinical records from 2012 to 2021 (376,307 unique patients by definition, one visit per year, 191,284 unique patients overall, and 575,207 unique admissions). After data cleaning, standardization, and encoding, the data tables contained 863 MB, which were imported into the database (1963 MB inside the database, including testing tables for two other data streams). The resulting database contained 1,385,048 rows of clinical data entries for vital signs. Table 2 provides a detailed overview of each category of data.
In processing the data, we developed 32 functions to automatically process data inconsistencies, which corresponded to typos (eight cases), invalid data formats, 12 software, 28 and others. 2 The final Python library included eight modules and 67 functions (including functions that did not alter the data).
The entire data collection was processed, of which 54% required advanced cleaning or data transformation (as described in Terminal urgence (Tu®) data: administrative data Section and illustrated in Table 4). With the assistance of three pediatric clinicians in the pediatric ED (EF, EB, and AT), the most informative data were retained; the least informative data were dropped; and missing, inconsistent, or overall challenging data entries were recovered.
Table 4.
Data processing challenges.
| Data challenges | Occurrence | Remedy |
|---|---|---|
| Question marks behind the value | High | Regular expressions |
| Alphanumerical in numerical variable | High | Data cleaning using a system of regular expressions to remove characters in the numerical values |
| Unwanted characters | High | Regular expressions to remove unwanted characters |
| Non-matching timestampsa | High | Date and time functions and later SQL language |
| Values not in correct range | Medium | Lambda function and mapping |
| Duplicated records and missing identification values | Low | Removing duplicated records and / or records that cannot be identified |
| Spelling differences | Low | Regular expressions and mapping |
| Duplicated character | Low | Regular expressions and / or mapping |
| Occurrence is defined high when more than 80% of the variables in non-structured files (clinical data) were treated against the challenge | ||
| Medium: 26–79% | ||
| Low: < 25%. | ||
aNon-matching refers to the fact that the datasets used different datetime format.
A total of 10.5% of the source data could not be recovered; however, this was due to the addition of two new variables to the TU software® after 2016. These were the capillary refill time and urine dipstick. Without these variables, the recovery rate was 5.5%. In the processing of the administrative data, 0.6% of the records were lost because the values did not correspond to the variable under consideration or obligatory values were missing. A total of 128 admissions were considered possible duplicates. In such cases, the admission and/or patient ID values did not match the criteria for the nine numerical characters. For the clinical data, we investigated the most informative variables and inspected them for null values before and after data processing. Table 5 covers the targeted variables that were most informative for clinical practice, whereas the heatmap (Figure 1) shows all variables that can be found from the CDW and their corresponding incidence and processing success rate.
Table 5.
Most informative target variables.
| Variable | Measurements in raw data | Unique admissionsa | Unique admissions in CDWa | % of unrecovered data during processing |
|---|---|---|---|---|
| Heart rate | 442123 | 356656 | 356054 | 0.17% |
| Respiratory rate | 123365 | 81008 | 80290 | 0.89% |
| Blood pressure | 198712 | 174730 | 139479 | 20.17% |
| O2 Saturation | 379413 | 301904 | 301120 | 0.26% |
| Glasgow score | 457776 | 438508 | 438397 | 0.03% |
| Weight | 566498 | 557514 | 544810 | 2.28% |
Table only shows the most relevant vital sign values out of the total of 11 vital signs in the CDW.
aWith occurrence of the measurement.
CDW: clinical data warehouse.
As we developed the processing framework, we evaluated the processing time required for 1 year of data. Administrative and imaging data were already structured; therefore, processing times were less time consuming than clinical data. After optimization, the processing time of the clinical data was approximately 13 min per year. Table 6 illustrates the total duration for administrative, clinical, and imaging data, along with the quantum of the processed data in each category. Execution times were calculated using the Python Time library.
Table 6.
Processing time.
| Data category | Data size [MB] | Processing time [s] | Size after processing [MB] |
|---|---|---|---|
| Administrative data | 26.9 ± 2.7 | 73.0 ± 21.5 | 42.3 ± 6.3 |
| Clinical data | 38.0 ± 11.7 | 807.4 ± 196.8 | 34.8 ± 13.6 |
| Imaging data | 5.8 ± 0.5 | 11.5 ± 1.6 | 5.9 ± 0.5 |
| Average total | 23.6 ± 5.9 | 297.6 ± 107.4 | 27.2 ± 6.7 |
Administrative data size and processing time do not include year 2019 because of a missing raw data file.
High standard deviation with clinical data is because the data before 2016 does not include care, chief complain, medications or diagnostics.
Each year contained one file of each data type to process.
Pediatric clinical data warehouse
The final CDW consisted of 14 tables with 761 columns and 1,385,048 rows of vital sign data, representing 567,781 admissions. A total of 567,656 admissions (99.98%) had an ID value that matched the criteria for nine numerical characters and could be considered as valid admissions. Figure 3 depicts the conceptual architecture of the CDW.
Figure 3.
Conceptual architecture of the clinical data warehouse (CDW).
We also conducted several tests on the utilization of this new CDW. Figure 4 shows a use case with the patient pathway and intervals between the initial triage and the first medical examination and discharge from the hospital. We also tested the usability of the data to train machine learning algorithms to help with hospital resource planning.
Figure 4.
Timeframe from admission to triage nurse to first clinical exam and exit. *PED = Pediatric Emergency Department. **IOA = Admission Nurse (certified triage nurse does not exist in France). Admission: number of patients admitted to the Pediatric Emergency Department. Numbers are presented in the figure as histogram. The time spent is presented in the figure as a line and presents the average time spent by patients between 2 checkpoints.
Discussion
The secondary use of clinical data has gradually become a new norm in healthcare. 53 However, the proprietary and legacy systems used in hospitals have rarely been future-proofed or designed to comply with the open-science paradigm. Our study presents the first attempt to efficiently integrate EHR, specifically in pediatrics, and to integrate data streams from Terminal Urgences® (TU®), which has not been done before. The integration framework is grounded in a tight collaboration between data scientists, clinical experts, and on-site physicians, who iteratively worked with patient data and validated the outcomes of integration. Although the study was based in one of the largest pediatric hospitals in France, similar systems, clinical data, integration challenges, and processes are applicable to other contemporary health-care systems.
Overall, the framework allowed us to integrate and preserve 99% of the administrative data. This finding was consistent with the missing content reported by Deshpande et al. 2020. 40 Most data processing and integration challenges occur in clinical data, owing to their unstructured and heterogeneous characteristics. Here, we reflect on how data-related challenges in contemporary health-care systems can be resolved and improved through software design.
Proprietary, legacy health-care systems and machine-unreadable clinical data
Not all health-care systems comply with current state-of-the-art software developments such as operation systems or games. 54 Health-care systems such as Terminal Urgences® have been developed and regulated as medical device 55 to perform their primary purpose (i.e. EHR for emergency physicians working in EDs). Consequently, these systems are rigid, and data collection is not designed to be statistically readable. These systems produce heterogeneous data that are error prone and unsuitable for secondary use. Despite their richness, patient records in these forms cannot be simply processed and analyzed in the third-party software.
We observed this in the integration of administrative data, where mistakes and non-numerical values in the postal code and admission ID raised the validation errors. With clinical data, these problems grew exponentially and were all based on the simple fact that no value or spell check of the input was implemented in the proprietary software. In addition, the data formats and available variables changed over time. For example, “capillary refill time” was available in 2021; therefore, before, clinicians and nurses used another variable to store information.
Owing to data sensitivity and patient safety, 56 health-care systems undergo scrutiny of medical device regulations. Every major update requires validation and approval by associated national authorities. Consequently, the development and maintenance of these systems is expensive. Systems also often follow existing centralized architecture which means that secure accessibility to the data can be challenging. 57
Owing to the slow-paced development of health-care systems, clinical data rarely follows machine-readable standards, which is common in other data science fields. Often, clinical data are missing unified formatting, standardization, interpretable headers, standard comma-separators, and class labels in the first column. Without the precise and lengthy integration of multiple clinicians and data scientists, this data would be unusable for data mining and machine learning. Current machine learning methods are unprepared for such a diverse portfolio of missing values, misplaced types 23 and seemingly missing values that we encountered (see Methods Section).
Efficient collection of patient data for secondary use is not possible without major updates to the proprietary and legacy systems. Based on our findings, the following main functionalities and features should become the norm in legacy health-care systems: (1) input verification of numerical values in vital sign variables, (2) agreed-upon and unified use of a timestamp format, and (3) corresponding labels between software and data representation would greatly improve data collection and processing. All of these would also improve the current challenge of duplicated admissions (e.g. admissions over mobile apps). At a high level, having a “general comment” section provides the required flexibility to add meta-information.
Unwritten standards and human (non-) errors in clinical data
Not all mistakes and anomalies in the clinical data are easily corrected as typos or factual errors. Each clinical facility, department, and team established unwritten standards that leaked into the patient's data. For example, it became a standard that when a parent makes an appointment at the hospital, a clinician preadmits a pediatric patient prior to their arrival to the hospital. The patient was officially admitted to the hospital on arrival. When a patient is admitted to another hospital unit or treatment unit, the number of admissions increases. This new standard practice has created the problem of multiple admissions and potential duplicates within the system. As illustrated in Table 2, we observed fewer admissions and fewer patient data in the clinical data than in the administrative data. These admissions (with similar yet unique IDs) cannot be simply ruled out as errors because the first admission contains administrative data, whereas the second contains additional information and actual diagnostic data. These issues are included in the French clinical system under topic identitovigilance, which refers to the treatment of the correct patient. Better system integration lowers the possibility of errors and the loss of information.
Similarly, unwritten clinical practices for measuring vital signs can compromise data standards. For example, it has become common to omit the trailing zero in a patient's blood pressure. Although this practice is acceptable in hospitals, clinicians and researchers outside the medical field may classify these values as erroneous. Similarly, the clinical practice of using other variables to write down additional metadata may be standardized within a clinical team; however, other clinicians may view these values as unreliable. One way to respect the established standards and ensure data quality at the same time is to keep track of the normal ranges of vital signs (i.e. a control table) and compare the input values against it. During data processing, we implemented minimum and maximum thresholds, which resulted in fewer far-out and/or possibly incorrect values in the CWD. Overall, numerous international standards for clinical data exist; however, they are not always adapted by clinical staff and software developers,4,58 which hinders the efficient secondary use of patient data.
Software design and future work
Software development for healthcare has requirements that differ from those of software design in other fields. Consequently, the designed software in healthcare is often ill-suited for the secondary use of data and integration with other systems, devices, or hospital clinics. During this study, most data issues were linked to obstacles that could have been prevented by the original software design (28 of the 32 correction functions were software-specific).
To mitigate the current state of software design, stronger emphasis should be placed on 1) quality control of heterogeneous data, machine readability, and standardization of data formats and 2) backward data-format compatibility when transitioning between system and versions, which would ensure future-proofing of the patient's data for secondary use. Although all of these are software- and data-oriented decisions, we argue that clinicians are indispensable to these tasks and should be incorporated in the design cycle.
While data heterogeneity is common in life-science data, 17 we observed heterogeneity within the same variable, raising errors in data adoption. For example, all clinical data (except timestamps) can be stored as text. Owing to unsuitable variable types and missing data checks of the most common errors (summarized in Table 4), the system allowed the entry of obviously incorrect data (e.g. a written comment) into the native numerical variable (e.g. heart rate).
It is important to note that health-care software is designed and implemented by software developers and is used by health-care workers. However, health-care workers (not software developers) understand how patient values and vitals are captured and stored. For example, blood saturation might miss the trailing zero to save time when writing. Pain values were recorded as a combination of patient age and pain scores (i.e. 8/10). Blood pressure comprised three values with prefixes signaling the type of measurement (i.e. “PAM: 85” stood for “average pressure 85”). Although these values are meaningful and comprehensive for trained clinicians, but they pose considerable challenges for secondary use and automatic data processing. We suggest that in this type of software, the possible values can be placed in the variables, and that any other information is placed in its own text box. We also suggest that if the value is out of range, the software should have a pop-up window to check and ask the user if the value is correct.
Hence, software design directly affects the quality and proportion of valid patient data suitable for analysis. These software design choices also indirectly influence clinical research because invalid or missing data can cause imbalance, suppress, or completely omit important patient cases in the population.
Similarly, missing backward data-format compatibility, even within the same health-care system, leads to unnecessary data omission. For example, the system and format of administrative data have been updated several times over the years, resulting in numerous exceptions to the patient data. Although these changes are understandably a part of software development, software design should include functionality for backward compatibility. Such future-proofing could directly benefit longitudinal data collection in health-care settings.
Finally, a major issue in healthcare is interconnect software compatibility. Hospitals are equipped with dozens of devices, which are rarely manufactured by the same company. In principle, none of these companies is willing or able to share data with others for free. This means that the only way to integrate the data is to export the data from all these devices and then process and integrate the data using the hospital itself.
With the increasing use of LLMs, there is a need for accessible and standardized data for model training. 59 We have established a method to integrate different data sources and create a centralized data source. Once doctoral notes are integrated, they can be used to teach LLM's and create a hybrid model that combines free text with other variables. Furthermore, this can contribute to LLM-based prediction software for health-care applications. As an example, Jablonka et al. wrote in their paper from 2024 that if a chemist could ask “If I change the metal in my metal-organic framework, will it be stable in water?". 60 This could expedite the development process. Similarly, an aim in healthcare could be a model from which a caregiver could ask “If I change medicine X to medicine Y, will it help my patient?”. Or “If I give this medicine to my patient could it help?”.
Conclusion
Secondary use of patient data has recently emerged with the widespread use of EHR. Health-care providers, businesses, and researchers have aimed to adopt intelligent data mining from the rich plethora of administrative and clinical data. However, the journey to secondary use of patient data has been turbulent. Missing guidelines, undocumented proprietary formats, lack of data standards, and backward compatibility, to name a few, have been the main obstacles for data-driven medical solutions.
Because more stakeholders aim to use EHR, we have presented the framework, obstacles, and solutions for the automatic integration of patients’ legacy data. To the best of our knowledge, the established CDW has resulted in one of the largest pediatric data collections in France (in terms of admissions and health variables). Indeed, the integration process is complex and effortful yet necessary to preserve the large quanta of rich patient data.
In the process of creating CDW, we systematically targeted and solved the underlying challenges in EHR that have not been previously recognized. We characterized the data exceptions resulting from cultural and departmental health-care backgrounds, personnel errors, software types, versions, and naming conventions. To prevent data loss from valuable patient records, we established a laborious rule-based integration and consulted clinical practitioners to restore the data quality. Although effortful, an integrated approach is necessary when dealing with patient data. The curated CDW represents a solid foundation for future machine learning models of patient flow, hospital economics, and studies on rare diseases at CHU-Lenval.
Acknowledgements
I am grateful to all the members of the team who have helped to achieve this goal in the form of this research paper. I cannot imagine how this could have been possible without the help of Hana (Vrzáková) and Antoine (Tran) who have given enormous support throughout the process. I also wish to thank Hervé (Haas) for providing the concept on which we have done this research and both Emma's for their support on the review process.
Footnotes
ORCID iD: Valo Petri https://orcid.org/0009-0003-1096-7784
Ethical approval: Research on retrospective data such as ours does not require compliance to the French Law Number 2012–300 of 5 March 2012 relating to the research involving human participants, as modified by the Order Number 2016–800 of 16 June 2016. In this context, it does not require approval from the French competent authority (Agence Nationale de Sécurité du Médicament et des Produits de Santé, ANSM) nor from the French ethics committee (Comités de Protection des Personnes, CPP). Thus, informed written consent was not necessary for this no-effect observational study collecting anonymized data on patient management. Patient data were anonymized using a specific patient numbering procedure for the study. Our study was reported to the National Data Protection Authority (Health Data Hub, N° F20220415105250).
Contributorship: VP and TA contributed equally on the writing and validation of the article. TA also shared the duty of the review together with BE, HH, FE, and VH. Visualization, data processing, and data analysis were done by VP. Main investigators of the study were BE and FE. HH shared the conceptualization with TA. Supervision and editing were done by VH.
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability: The raw data underlying this study is not publicly available to follow the rules of the European Union's GDPR. For cybersecurity reasons, processed data (CDW) is available only upon request.
References
- 1.Statista [Internet] . [cited 2021 Nov 18]. Healthcare data volume globally 2020 forecast. Available from: https://www.statista.com/statistics/1037970/global-healthcare-data-volume/.
- 2.Vidal ME, Jozashoori S, Sakor A. Semantic Data Integration Techniques for Transforming Big Biomedical Data into Actionable Knowledge. :4.
- 3.Kim E, Rubinstein SM, Nead KT, et al. The evolving use of electronic health records (EHR) for research. Semin Radiat Oncol 2019; 29: 354–361. [DOI] [PubMed] [Google Scholar]
- 4.Ross MK, Wei W, Ohno-Machado L. “Big Data” and the electronic health record. Yearb Med Inform 2014; 23: 97–104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Martin-Sanchez FJ, Aguiar-Pulido V, Lopez-Campos GH, et al. Secondary use and analysis of big data collected for patient care: contribution from the IMIA working group on data mining and big data analytics. Yearb Med Inform 2017; 26: 28–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Borgi T, Zoghlami N, Abed M, et al. Big data for operational efficiency of transport and logistics: a review. In: 6th IEEE international conference on advanced logistics and transport (ICALT), Bali, Indonesia, 2017, pp.113–120. [Google Scholar]
- 7.Hermon R, Williams PAH. Big data in healthcare: What is it used for? Proc 3rd Aust EHealth Inform Secur Conf Held 1-3 Dec. 2014;2014 at Edith Cowan University:Western Australia.
- 8.Ristevski B, Chen M. Big data analytics in medicine and healthcare. J Integr Bioinforma [Internet] 2018. Sep 25 [cited 2021 May 28]; 15: 1–5. Available from: https://www.degruyter.com/document/doi/10.1515/jib-2017-0030/html. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Archenaa J, Anita EAM. A Survey of Big Data Analytics in Healthcare and Government. Procedia Comput Sci 2015; 50: 408–13. [Google Scholar]
- 10.Segarra C, Muntane E, Lemay M, et al. Secure Stream Processing for Medical Data. In: 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) [Internet], Berlin, Germany: IEEE; 2019 [cited 2021 May 28]. p. 3450–3. Available from: https://ieeexplore.ieee.org/document/8856334/. [DOI] [PubMed] [Google Scholar]
- 11. More healthcare companies go public in 2020 than prior 5 years [Internet]. [cited 2022 Jan 22]. Available from: https://www.spglobal.com/marketintelligence/en/news-insights/latest-news-headlines/more-healthcare-companies-go-public-in-2020-than-prior-5-years-62197223.
- 12.Shah SM, Khan RA. Secondary use of electronic health record: opportunities and challenges. IEEE Access 2020; 8: 136947–65. [Google Scholar]
- 13.Jiang F, Jiang Y, Zhi H, et al. Artificial intelligence in healthcare: past, present and future. Stroke Vasc Neurol 2017; 2: 230–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. . Data Preparation Framework for Preprocessing Clinical Data in Data Mining. :5. [PMC free article] [PubMed] [Google Scholar]
- 15.Smith CL, Thomas Z, Enas N, et al. Leveraging historical data into oncology development programs: two case studies of phase 2 Bayesian augmented control trial designs. Pharm Stat 2020; 19: 276–290. [DOI] [PubMed] [Google Scholar]
- 16.Leighton C, Patrick M. Using Large Datasets for Population-based Health Research. In: Principles and Practice of Clinical Research [Internet]. Elsevier; 2012 [cited 2022 Feb 3]. p. 371–9. Available from: https://linkinghub.elsevier.com/retrieve/pii/B978012382167600028X.
- 17.Fillinger S, de la Garza L, Peltzer A, et al. Challenges of big data integration in the life sciences. Anal Bioanal Chem 2019; 411: 6791–6800. [DOI] [PubMed] [Google Scholar]
- 18.Bennett TD, Callahan TJ, Feinstein JA, et al. Data science for child health. J Pediatr 2019; 208: 12–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Urgences | Direction de la recherche, des études, de l’évaluation et des statistiques [Internet]. [cited 2023 Apr 23]. Available from: https://drees.solidarites-sante.gouv.fr/urgences.
- 20.OWKIN [Internet] . 2020 [cited 2021 Nov 20]. Overcoming the challenge of Heterogeneity in Healthcare. Available from: https://owkin.com/federated-learning/heterogeneity-in-healthcare/.
- 21.Andreu-Perez J, Poon CCY, Merrifield RD, et al. Big data for health. IEEE J Biomed Health Inform 2015; 19: 1193–1208. [DOI] [PubMed] [Google Scholar]
- 22.Daemen A, Timmerman D, Van den Bosch T, et al. Improved modeling of clinical data with kernel methods. Artif Intell Med 2012; 54: 103–114. [DOI] [PubMed] [Google Scholar]
- 23.Botsis T, Hartvigsen G, Chen F, et al. Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl Bioinforma 2010; 2010: 1–5. [PMC free article] [PubMed] [Google Scholar]
- 24.Silversides A. Privacy concerns raised over “secondary use” of health records. Can Med Assoc J 2009; 181: E287–E287. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Kosseim P, Brady M. Policy by procrastination: secondary use of electronic health records for health research purposes. McGill J Law Health 2008; 2: 6–43. [Google Scholar]
- 26.Shin SY, Kim WS, Lee JH. Characteristics desired in clinical data warehouse for biomedical research. Healthc Inform Res 2014; 20: 109–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Miller JB. Big data and biomedical informatics: preparing for the modernization of clinical neuropsychology. Clin Neuropsychol 2019; 33: 287–304. [DOI] [PubMed] [Google Scholar]
- 28.Sweeney L. Simple Demographics Often Identify People Uniquely. . Pittsburgh. :34.
- 29. 2020 Healthcare Data Breach Report: 25% Increase in Breaches in 2020 [Internet]. HIPAA Journal. 2021 [cited 2021 Nov 22]. Available from: https://www.hipaajournal.com/2020-healthcare-data-breach-report-us/
- 30.Tao Z, Weber GM, Yu YW. Expected 10-anonymity of HyperLogLog sketches for federated queries of clinical data repositories. Bioinformatics 2021; 37: i151–i160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. J Am Med Inform Assoc 2018; 25: 1419–1428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Conference EF for MIST . Seamless Care, Safe Care: The Challenges of Interoperability and Patient Safety in Health Care : Proceedings of the EFMI Special Topic Conference, June 2-4, 2010, Reykjavik, Iceland. IOS Press; 2010. 240 p. [PubMed]
- 33.France [Internet] . [cited 2024 Feb 18]. Gestion de bases de donnée et information santé | InterSystems. Available from: https://www.intersystems.com/fr/.
- 34. VisionHM, Dapsys [Internet]. Available from: https://www.visionhm.com/
- 35.Baten B. Agfa Radiology Solutions France. [cited 2024 Feb 18]. Radiographie numérique : Les solutions DR & CR optimisées par MUSICA. Available from: https://medimg.agfa.com/france/
- 36. Présentation TU [Internet]. Terminal Urgences. [cited 2024 Feb 18]. Available from: https://tgs.ies-sud.fr/presentation-tu/
- 37.Yao Q, Tian Y, Li PF, et al. Design and development of a medical big data processing system based on hadoop. J Med Syst 2015; 39: 23. [DOI] [PubMed] [Google Scholar]
- 38.Schultz RF, Sharathkumar A, Kwon S, et al. Implementation of automatic data extraction from an enterprise database warehouse (EDW) for validating pediatric VTE decision rule: a prospective observational study in a critical care population. J Thromb Thrombolysis 2020; 50: 782–789. [DOI] [PubMed] [Google Scholar]
- 39.Sobeslav V, Maresova P, Krejcar O, et al. Use of cloud computing in biomedicine. J Biomol Struct Dyn 2016; 34: 2688–2697. [DOI] [PubMed] [Google Scholar]
- 40.Deshpande P, Rasin A, Tchoua R, et al. Enhancing Recall Using Data Cleaning for Biomedical Big Data. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS) [Internet], Rochester, MN, USA: IEEE; 2020 [cited 2021 Jun 23]. p. 265–70. Available from: https://ieeexplore.ieee.org/document/9182943/. [Google Scholar]
- 41.Tran A, Hérissé AL, Isoardo M, et al. Evaluation of compliance with early postbirth follow-up and unnecessary visits to the paediatric emergency department: a prospective observational study at the lenval children’s hospital in nice. BMJ Open 2022; 12: e056476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Terminal Urgences [Internet]. Available from: https://tgs.ies-sud.fr/sites/
- 43.Demonchy D, Haas H, Gillet Vittori L, et al. Un circuit court pour désengorger les services d’accueil des urgences pédiatriques. Arch Pédiatrie 2015; 22: 247–254. [DOI] [PubMed] [Google Scholar]
- 44.Tran A, Valo P, Rouvier C, et al. Validation of the computerized pediatric triage tool, pediaTRI, in the pediatric emergency department of lenval children’s hospital in nice: a cross-sectional observational study. Front Pediatr. 2022; 10: 840181. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Clinicom, Siemens [Internet]. Available from: https://www.intersystems.com/fr/ressources/detail/trakcare-clinicom-service-cdri/
- 46. ORBIS, AGFA Healthcare [Internet]. Available from: https://global.agfahealthcare.com/
- 47.Gallin JI, Ognibene FP. Principles and Practice of Clinical Research [Internet]. San Diego, UNITED STATES: Elsevier Science & Technology, 2012. [cited 2021 Nov 22], Available from: http://ebookcentral.proquest.com/lib/uef-ebooks/detail.action?docID=913767. [Google Scholar]
- 48.Marjanovic N, Mimoz O, Guenezan J. An easy and accurate respiratory rate monitor is necessary. J Clin Monit Comput 2020; 34: 221–222. [DOI] [PubMed] [Google Scholar]
- 49.Njeru CM, Ansermino JM, Macharia WM, et al. Variability of respiratory rate measurements in neonates- every minute counts. BMC Pediatr 2022; 22: 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Hunt A, Thomas D. The pragmatic programmer: from journeyman to master. Reading, Mass: Addison-Wesley, 2000, 321 p. [Google Scholar]
- 51. Miniconda — conda documentation [Internet]. [cited 2023 May 7]. Available from: https://docs.conda.io/en/latest/miniconda.html.
- 52.PostgreSQL [Internet] . 2023 [cited 2023 May 7]. PostgreSQL. Available from: https://www.postgresql.org/.
- 53.Boonstra A, Broekhuis M. Barriers to the acceptance of electronic medical records by physicians from systematic review to taxonomy and interventions. BMC Health Serv Res 2010; 10: 231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Mandl KD, Kohane IS. Escaping the EHR trap — the future of health IT. N Engl J Med 2012; 366: 2240–2242. [DOI] [PubMed] [Google Scholar]
- 55.Radley-Gardner O, Beale H, Zimmermann R. (eds). Fundamental Texts On European Private Law [Internet]. Hart Publishing; 2016 [cited 2022 Apr 11]. Available from: http://www.bloomsburycollections.com/book/fundamental-texts-on-european-private-law-1.
- 56.Géczy P. BIG DATA CHARACTERISTICS. 2014;11.
- 57.Khan AA, Yang J, Laghari AA, et al. BAIoT-EMS: consortium network for small-medium enterprises management system with blockchain and augmented intelligence of things. Eng Appl Artif Intell 2025; 141: 109838. [Google Scholar]
- 58.Parvaiz MA, Subramanian A, Kendall NS. The use of abbreviations in medical records in a multidisciplinary world–an imminent disaster. Commun Med 2008; 5: 25–33. [DOI] [PubMed] [Google Scholar]
- 59.Ott S, Hebenstreit K, Liévin V, et al. Thoughtsource: a central hub for large language model reasoning data. Sci Data 2023; 10: 528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Jablonka KM, Schwaller P, Ortega-Guerrero A, et al. Leveraging large language models for predictive chemistry. Nat Mach Intell 2024; 6: 161–169. [Google Scholar]




