Skip to main content
Elsevier - PMC COVID-19 Collection logoLink to Elsevier - PMC COVID-19 Collection
. 2022 Jul 19;4(8):e567–e568. doi: 10.1016/S2589-7500(22)00122-4

Data provenance and integrity of health-care systems data for clinical trials

Macey L Murray a,b,c, Sharon B Love a,b, James R Carpenter a,b,e, Suzanne Hartley d, Martin J Landray b,c,g, Marion Mafham b,c,f, Mahesh K B Parmar a,b, Heather Pinches c, Matthew R Sydes a,b,h; The Healthcare Systems Data for Clinical Trials Collaborative Group, on behalf of
PMCID: PMC9296098  PMID: 35868811

The need to run clinical trials quickly and efficiently is well recognised, none more so than during the push to find life-saving treatments and preventative measures for COVID-19. Late phase clinical trials can take many years and are expensive, due to the immense efforts necessary to deliver them. There are various ways in which the conduct of clinical trials could be improved, one is through judicious use of data already collected in health-care interactions. These data might be known as health-care systems data, routinely collected health-care data (RCHD), or real-world data. We describe here how one key roadblock to the use of these data can be removed.

In the UK, 50% of the National Institute for Health and Care Research-funded trials (to 2019) were planning to access and use RCHD.1 However, looking at the data given to trials from registries between 2013 and 2018, fewer than 5% of UK-based randomised clinical trials obtained RCHD.2 As notable examples, the RECOVERY3 and PRINCIPLE4 trials of potential treatments for COVID-19 have each successfully harnessed benefits of RCHD, which has simplified multi-site data collection through centralised collation and aided identification of potential trial participants.5 There is an intention by trialists more widely to make use of RCHD for study design and recruitment through to outcome ascertainment and post-trial follow-up;1, 2, 5, 6 successful high-profile example trials will encourage further uptake.

Regulatory issues are a major challenge to wider use of RCHD in trials. The Medicines and Healthcare products Regulatory Agency in the UK and the US Food and Drug Administration recognise the potential value of RCHD for clinical trials that support regulatory decisions and each published draft guidance.

Trial sponsors nevertheless need to demonstrate that all the data used in the trial, including the RCHD, are reliable, complete, and relevant. This involves assessing data provenance and integrity, and the validity (diagnostic value) and suitability of routine datasets for trial measures (eg, outcomes, exposures, and covariates).7 Data provenance is the detailed record of the origins of the data, the processes, and the methods by which it is produced. Data integrity is defined as the extent to which all data are complete, consistent, accurate, and reliable throughout the data lifecycle.8 Although there is a standard endorsed by regulators for assessing systems that create or capture electronic clinical data as source (that is, original records),9 there has been insufficient guidance on how to assess centrally curated RCHD.

Therefore, we developed a process to ascertain and document the provenance and integrity of RCHD. Our in-depth report has recently been made publicly available.9 This report sets out the methods which we applied to the two NHS Digital data assets most requested by trialists: the Admitted Patient Care dataset of Hospital Episode Statistics (HES APC) and the Civil Registration of Deaths (CRD).9

The provenance and integrity of these two datasets were evaluated in three key stages: first, collection and transfer of data from health-care systems to NHS Digital's systems; second, centralised processing and curation to form the validated dataset; and, finally, linkage and extraction for trialists and the sponsor. At each stage, we reviewed the tools and systems used, and the controlled processes for managing data, data lineage, and access arrangements. Advice about the level of detail required for documentation was sought throughout from the Medicines and Healthcare products Regulatory Agency, who provided helpful feedback through the development process.

By investigating the data lifecycles of HES APC and CRD, we have demonstrated that their curation is robust, and handled with appropriate controls and automation. We are confident that the data can be considered as equivalent to high-quality transcribed versions of the original source data, and so are sufficiently reliable for use in clinical trials.

Our detailed approach has clear implications for the design, conduct, and analysis of clinical trials. We have demonstrated that these two key health-care systems datasets have the provenance and data integrity for use in clinical trials that would be suitable in regulatory submissions. This approach is relevant to industry as well as academia. Greater use of RCHD should change many aspects of trial conduct, including the way trials are monitored and in particular, probably decreasing their carbon footprint.

Our work on two health-care systems datasets is only the initial step. The integrity and provenance of each routinely collected dataset that might be used in clinical trials should now be systematically assessed and clearly documented using the same approach. We call upon, and strongly encourage, all data collators to share and maintain the necessary documentation in a similar manner that we have started for HES APC and CRD.

Trialists must also record the relevance of RCHD (validity and suitability) in their trial protocol and Trial Master File, and we suggest a process of curation and documentation of these choices.10 Further work is important to assess the use of RCHD against traditional trial-specific data collection methods so trialists can choose which approach to use for data items, accounting for availability, completeness, timeliness, latency, and cost. Such assessments can be achieved through studies-within-a-trial in existing trials.

In conclusion, RCHD has the potential to transform the conduct of clinical trials, but their sponsors need confirmation of their integrity and provenance to satisfy regulatory-grade standards. We have demonstrated the process for two important datasets and now urge data providers to take the necessary steps to facilitate this for all relevant datasets. These steps will make trials more efficient and consequently lead to faster improvements in health care for all.

MM declares research grants from Novartis and Novo Nordisk unrelated to this manuscript. MJL declares grants from the Industrial Strategy Challenge Fund, HDR UK, National Institute for Health and Care Research Oxford Biomedical Research Centre, and Medical Research Council (MRC) Population Health Research Unit unrelated to this manuscript. MRS declares research grants from Astellas, Clovis Oncology, Janssen, Pfizer, Novartis, and Sanofi-Aventis unrelated to this manuscript; and speaker fees from Lilly Oncology and Janssen unrelated to this manuscript. SBL, MLM, JRC, SH, MKBP, and HP declare no competing interests. MLM, SBL, JRC, MKBP, and MRS acknowledge funding from HDR UK (HDR-9005) and MRC (MC_UU_00004/07, MC_UU_00004/08, and MC_UU_00004/09).

References


Articles from The Lancet. Digital Health are provided here courtesy of Elsevier

RESOURCES