1. Technical process and data stewardship |
Research using structured healthcare data conducted according to the FAIR data principles of findability, accessibility, interoperability, and reusability.
13
Important considerations are transparency of who performed the coding, the coding system used, and the purpose of coding (reimbursement, diagnosis, etc.). |
Clear and consistent identification and description of the sources of EHR data. Code lists and phenotyping algorithms can be described in detail and published, ideally before a study commences (for example on a coding repository or open-source archive). The minimum data required to meet the definitions will depend on the use case and can be reported to enhance transparency, in addition to the rationale for why certain decisions were made (for example, why one code was chosen over another, or what the effect would be if data collection periods were changed). |
Validation at local, regional and global levels. Evidence demonstrating how algorithms have been externally validated, and also what quality assessment was performed on the research findings, for example on the accuracy, completeness and timeliness of the data.14 Data quality rules can be used to assess coded data and allow comparisons across institutions and countries.15
|
Reporting of methods used for data pre-processing and data linkage. This includes the methods used to assess the quality of linkage and the results of any data pre-processing and linkage (with provision of false positive and false negative rates, comparisons of linked and unlinked data, and any sensitivity analyses).16 A flow diagram showing the processes for cleaning and linking different coding sources and datasets can aid understanding of the study design. |
Reporting of the governance framework underpinning the study from a technical/data stewardship standpoint. This includes a clear purpose for data gathering and the parameters and time limit of consent, clear mechanisms for data processing (“what happens with my data”), and a description of what the data can and cannot be used for (i.e. the mandate given for research). |
2. Data security and privacy |
Working towards a new, sustainable mandate from the public and patients to use their health data may require moving away from abstract rules and regulations and towards more constructive governance, in which trust is a central concept. The trust of patients and the public in research institutions and in science is pivotal because of the liberties they give to researchers to use their data, which are the product of a social licence based on this trust. |
Gaining this trust would benefit from understanding what society and stakeholders expect from scientists conducting health data research, with engagement of stakeholders from the concept stage. Co-creation of data governance based on inclusion of patient/public communities and dialogue with researchers is crucial for ethical and sustainable governance, and to translate expectations into scientific research and scientific output. |
Researchers and big data consortia have to be mindful that trustworthiness comes with the duty to act in ethically-responsible ways. This concerns two areas; first is the competence in data handling (meaning that systems are in place to ensure data protection and there is a framework of rules and regulations for data sharing), and second is what motivates the data analysis. Ongoing dialogue can ensure that public values continue to be aligned with the governance structures of health data research projects. Questions arise as to how to measure success at implementing public values into research, and what levels of public support are sufficient to grant a mandate for data usage. |
Complex organisational structures may be less important for this trust than is often asserted. Complicated rules and regulations may do more harm than good in establishing the conditions for public trust in big data health research to flourish, and as a result be counterproductive especially when a social license has not been adequately achieved.17
|
Embracing values such as transparency, reciprocity, inclusivity and service to the common good. These values can be embedded into the governance framework of big data health research.18 This calls for constructing a narrative that researchers and research consortia can be held accountable so that patients and the wider public are willing, and consistently willing, to place their trust in health research projects. |
Governance could be aided by developing a framework for accountability. This includes clear distinctions between anonymised, pseudonymised and aggregate data along with plain language explanation to participants and users, and discrimination between primary and secondary use of data sources. |
3. Publications using structured healthcare data |
Accountability for the source of data and how the data have been collected (traceability). As with data security, a framework of accountability would enable editorial teams in medical journals to be aware of the technical processes prior to data analysis. |
Sharing of data, codes and algorithms used to analyse datasets. Similar to the requirement for pre-registration of clinical trials and pre-publication of protocols, journals could restrict publication where the coding within a study is not shared. |
Demonstration of data validity and robust analysis. The FDA and EMA already suggest independent checking or accreditation of data sources; this accreditation could be provided to editors to increase their confidence in data quality. |
Balancing the speed of publication against requirements for data validation. Prompt publication (for example of results with immediate public health implication) needs to be balanced against validation of data sources to ensure authenticity. |
Scientific advice committees with experts in big data analytics to aid journal editorial teams. The skill-set required in editors and reviewers for studies using structured healthcare data is not the same as having statistical or clinical trials experience; expertise in EHR data and respective coding systems could add value to the journal review process. |
Widening gap between the knowledge of physicians and the advanced methodologies used in big data papers. Medical/graduate students and practising clinicians, as well as hospital managers and leadership, need training in health data management and analysis. This is important to build a digital workforce with increased capacity and capability to translate publications using new approaches to improve patient care.19
|
4. Addressing the needs of regulators, reimbursement authorities and clinical practice guidelines |
EHR-based trials have the potential to generate reliable and cost-efficient results. Each type of trial and each type of clinical question is considered in an individual context, including under what circumstances a particular type of EHR process could assist in answering questions about a particular intervention, and with what limitations. |
Further research may help explore cases in which EHR studies produce valuable evidence, and when they might be flawed. This will generate confidence in regulators for future EHR studies, and for guideline taskforces to appropriately appraise evidence. |
Quality standards will help to ensure that the information recorded in EHR systems represents real events without bias. This will enable confidence that trials using EHRs can produce reliable results on efficacy and safety, and could include examination of the validity of both data sources and data analyses. |
Source data validation to report on appropriate computational phenotypes. This could be supported by an independent adjudication committee to examine a subset of the EHR and confirm outcome events. The use of AI techniques could facilitate larger validation studies by automated extraction of supporting text from clinical notations. Such validation exercises can be pre-registered, for example in the form of a Study-Within-A-Trial. 20 Another possibility is for researchers to provide consented and anonymised gold standard cases to benchmark against, or for data from devices used to verify codes (such as lead fractures). The value of synthetic datasets for validation, which mimic real data, needs further exploration. |
Mixed model approaches to collect data on particular endpoints. May be valuable for situations where the EHR does not reliably collect relevant data. For example, where patients and/or clinicians are asked for information, or data are collected via wearable devices or telemonitoring. In some cases, parallel monitoring of patients alongside the EHR study may provide additional confidence (for example to identify serious unexpected adverse events). Technological advances in EHR systems will help, such as the ability to retrieve EHR data on a daily basis to support clinical trials.21
|
Taking advantage of the many real world data initiatives to support new research. Government agencies, regulators, charities and professional bodies have initiated programmes for better use of real-world data that can support further activity and dissemination. |