A pipeline to further enhance quality, integrity and reusability of the NCCID clinical data

Anna Breger; Ian Selby; Michael Roberts; Judith Babar; Effrossyni Gkrania-Klotsas; Jacobus Preller; Lorena Escudero Sánchez; AIX-COVNET Collaboration; James H F Rudd; John A D Aston; Jonathan R Weir-McCall; Evis Sala; Carola-Bibiane Schönlieb

doi:10.1038/s41597-023-02340-7

. 2023 Jul 27;10:493. doi: 10.1038/s41597-023-02340-7

A pipeline to further enhance quality, integrity and reusability of the NCCID clinical data

Anna Breger ^1,^2,^✉,^#, Ian Selby ^3,^4,^✉,^#, Michael Roberts ^1,⁵, Judith Babar ^3,⁴, Effrossyni Gkrania-Klotsas ^4,⁶, Jacobus Preller ^4,⁶, Lorena Escudero Sánchez ^3,⁷; AIX-COVNET Collaboration, James H F Rudd ⁵, John A D Aston ⁸, Jonathan R Weir-McCall ^3,^4,⁹, Evis Sala ^10,¹¹, Carola-Bibiane Schönlieb ¹

PMCID: PMC10374610 PMID: 37500661

Abstract

The National COVID-19 Chest Imaging Database (NCCID) is a centralized UK database of thoracic imaging and corresponding clinical data. It is made available by the National Health Service Artificial Intelligence (NHS AI) Lab to support the development of machine learning tools focused on Coronavirus Disease 2019 (COVID-19). A bespoke cleaning pipeline for NCCID, developed by the NHSx, was introduced in 2021. We present an extension to the original cleaning pipeline for the clinical data of the database. It has been adjusted to correct additional systematic inconsistencies in the raw data such as patient sex, oxygen levels and date values. The most important changes will be discussed in this paper, whilst the code and further explanations are made publicly available on GitLab. The suggested cleaning will allow global users to work with more consistent data for the development of machine learning tools without being an expert. In addition, it highlights some of the challenges when working with clinical multi-center data and includes recommendations for similar future initiatives.

Subject terms: Predictive markers, Diagnostic markers, Prognostic markers

Introduction

First introduced in 2020, in response to the emergence of Coronavirus Disease 2019 (COVID-19) as a global pandemic, the National COVID-19 Chest Imaging Database (NCCID)¹ is a centralized database that contains computed tomography (CT), chest X-ray (CXR) and magnetic resonance (MR) images collected from over 20,000 patients across the UK. The aim of NCCID is to support a better understanding of the SARS-CoV-2 virus and help in the development of machine learning tools, amongst others, to enable optimized care for hospitalized patients. At participating hospitals, data was requested for all patients with a positive COVID-19 reverse transcription polymerase chain reaction (RT-PCR or simply PCR) test as well as a random sample of patients who tested negatively. The quantity of negative patients was similar to the positive cases in the first disease ‘waves’ of 2020 and 2021, but has since increased.

Requests to access the NCCID can be made online, giving access to the comprehensive clinical data available in addition to pseudonymized Digital Imaging and Communication In Medicine (DICOM) image files. A large range of clinical features (including patient demographics, past medical history, laboratory results and outcomes) are provided for patients with a positive test for COVID-19 as well as selected features for patients with a negative PCR.

A cleaning pipeline for the NCCID clinical data was published on GitHub in 2021 (https://github.com/nhsx/nccid-cleaning/tree/v0.3.0) by NHSx (which has since been incorporated into NHS England) with the database now being managed by the NHS AI Lab. We will subsequently refer to it as the NHSx pipeline. The packaged cleaning pipeline includes remapping, clipping and rescaling of features that appear to contain messy data. The data was collected during a time of high pressure and resource demands, therefore inconsistencies are expected and some systematic errors can be traced back.

Cleaning of inconsistent or defective data is of high importance in order to allow reasonable usage for diagnostic analyses and the development of high-quality machine learning tools. The urgent need for COVID-19 models to get a fast, automated prognosis for clinical triage has luckily decreased in the last year, but nevertheless the tremendous amount of collected COVID-19 related data throughout the globe reveals uniquely powerful possibilities. When systematically stored and cleaned, such huge amounts of medical data can provide an outstanding chance for the development of machine learning tools in the medical domain beyond the pandemic; E.g. to provide models that can easily be adapted to diverse diseases and/or to enhance the understanding of deep learning related approaches.

As mentioned above, a major drawback of data collection during the COVID-19 pandemic was the scarcity of resources (e.g. time), resulting in inconsistent and out of range data. This complicated the use of multi-center datasets for non-medical experts as judging the plausibility of medical values requires expert knowledge. With our pipeline, we aim to contribute to the sustainability of data collection and to enable non-experts to work with the NCCID clinical data. This work will inform future multi-center projects, akin to NCCID, by providing examples of the challenges that are encountered when using such medical data as well as giving specific recommendations for dataset curators and developers.

The outline of the paper is as follows. In Section 2 we state the most important improvements obtained when using our pipeline in comparison to the raw data or that cleaned by the NHSx pipeline. In Section 3 the importance of these results are discussed, whereas in Section 4 the methods used to identify the issues and implement the cleaning functions are explained. The data is available online upon request at the website of the NHS AI Lab (https://nhsx.github.io/covid-chest-imaging-database/) and our cleaning pipeline (including the required components from the NHSx pipeline) is available open-source on GitLab https://gitlab.developers.cam.ac.uk/maths/cia/covid-19-projects/nccidxclean. Information about every function in the pipeline is provided in detail in our online documentation (https://maths.uniofcam.dev/cia/covid-19-projects/nccidxclean/).

Results

In this section we provide examples of the improvements that are achieved when using our cleaning pipeline in comparison to the raw data and the NHSx pipeline. In total, at the time of final development (November 2022), 21,253 patient cases were available from 25 hospital groups in England and Wales. 6,931 patients at 23 of the hospital groups had a positive PCR test with 2 organisations having only submitted patients who tested negative. We abbreviate the names of the hospitals following the key in Table 1 to increase the readability of the manuscript.

Table 1.

Codes used to represent the hospitals and the 23 NHS Trusts (England)/Health Boards (Wales).

Code	Hospital(s)	Submitting Center
A	Ashford and St Peter’s Hospital	Ashford and St Peters Hospitals NHS Foundation Trust
B	Unknown	Betsi Cadwaladr University Health Board
C	Princess Royal Hospital	Brighton and Sussex University Hospitals NHS Trust
D	Royal Sussex County Hospital	Brighton and Sussex University Hospitals NHS Trust
E	Unknown	Brighton and Sussex University Hospitals NHS Trust
F	Prince Charles Hospital	Cwm Taf Morgannwg University Health Board
G	Royal Glamorgan Hospital	Cwm Taf Morgannwg University Health Board
H	Ysbyty Cwm Cynon	Cwm Taf Morgannwg University Health Board
I	George Eliot Hospital	George Eliot Hospital NHS Trust
J	Basingstoke and North Hampshire Hospital	Hampshire Hospitals NHS Foundation Trust
K	Royal Hampshire County Hospital	Hampshire Hospitals NHS Foundation Trust
L	Charing Cross Hospital	Imperial College Healthcare NHS Trust
M	Hammersmith Hospital	Imperial College Healthcare NHS Trust
N	St. Mary’s Hospital	Imperial College Healthcare NHS Trust
O	Unknown	Leeds Teaching Hospitals NHS Trust
P	Liverpool Heart and Chest Hospital	Liverpool Heart and Chest NHS Foundation Trust
Q	Ealing Hospital	London North West University Healthcare NHS Trust
R	Northwick Park Hospital	London North West University Healthcare NHS Trust
S	Norfolk & Norwich University Hospital	Norfolk and Norwich University Hospitals NHS Foundation Trust
T	Unknown	Oxford University Hospitals NHS Foundation Trust
U	Unknown	Royal Cornwall Hospitals NHS Trust
V	Royal Surrey County Hospital	Royal Surrey NHS Foundation Trust
W	Royal United Hospital	Royal United Hospitals Bath NHS Foundation Trust
X	Unknown	Sandwell and West Birmingham Hospitals NHS Trust
Y	Sheffield Children’s Hospital	Sheffield Childrens NHS Foundation Trust
Z	Unknown	Sheffield Teaching Hospitals NHS Foundation Trust
α	St George’s Hospital	St Georges University Hospitals NHS Foundation Trust
β	Musgrove Park Hospital	Taunton and Somerset NHS Foundation Trust
γ	The Walton Centre	The Walton Centre NHS Foundation Trust
δ	Unknown	University Hospitals of Leicester NHS Trust
ε	West Suffolk Hospital	West Suffolk NHS Foundation Trust

Open in a new tab

The ‘Submitting Center’ in the NCCID corresponds to the organization submitting the data (the NHS Trust or Health Board), which may operate multiple hospitals. When possible, individual hospitals were used in the analysis to allow more detailed insights such as identification of inconsistencies between hospitals of the same submitting center.

Microsoft Excel spreadsheets were used for the collection of the clinical data, filled in by each submitting center and then uploaded via a web portal. The templates are available online in their original form (https://medphys.royalsurrey.nhs.uk/nccid/guidance.php), containing 68 data fields for positive patients and 7 for negative patients. Consequently, the vast majority of the fields only apply to patients who had a positive COVID-19 diagnosis and a huge amount of the data will always be missing by design when the full data is aggregated. To combat this, we will only discuss and show the results for the positive cohort from this point onward. In the following we will refer to the individual data fields as features.

The development of our cleaning pipeline was guided by structural inconsistencies, e.g. date formatting errors and implausible biomedical values. The application of the pipeline leads to a reduced number of missing values in comparison to the raw and the NHSx cleaned data, see Fig. 1. The correction of all features is described in detail in the documentation available on GitLab and in this paper we focus on the description of 10 representative examples. The underlying methods will be discussed in detail in Section 3.

Fig. 1 — Number of missing values (left) and dates (right) in the raw data, after cleaning with the NHSx and the extended pipeline.

Categorical features

Some systematic inconsistencies appeared in the categorical features including the handling of ‘Unknown’ versus blank entries, categorical confusions such as ‘F/M’ or ‘1/2’ versus ‘0/1’ for Sex, and misspellings such as ‘1es’ instead of ‘1’. See Fig. 2 for our corrections regarding Sex, past medical history (PMH) of Cardiovascular (CVS) Disease and PMH Hypertension.

Fig. 2 — Adjustment of the categorical features *Sex*, *PMH CVS Diseases* and *PMH Hypertension*. ‘NaN’ refers to missing data values.

Without using individualized DICOM information, the number of patients classified as female increased from 1,162 to 2,064 (78% increase), when compared to the NHSx pipeline and the number of males remained 2,064. With additional updates from DICOM metadata, the number of patients classified as either male or female increased from 6,776 to 6,853 (1.1% increase).

For PMH Hypertension, 1,038 ‘Unknown’ values have been retained in comparison to the NHSx pipeline that removed them, leaving them blank. Distinguishing between unknown and missing values (here referred to as ‘NaN’) is important as they represent two distinct types of data that can have different implications for analysis and interpretation. ‘NaN’ values refer to data that is absent or unavailable, for reasons such as data entry errors, data loss, or data that was not collected. ‘Unknown’ values refer to data where the value is unknown or undefined due to the nature of the data or the data collection process. We found that the ‘Unknown’ values in the PMH Hypertension feature showed quite different statistical behavior compared to the original ‘NaN’ values, in particular

77% of patients with ‘NaN’ were male, compared to 62% of ‘Unknowns’,
34% of patients with ‘NaN’ had a history of lung disease, compared to 13% of ‘Unknowns’,
37% of patients with ‘NaN’ had a history of Chronic Kidney Disease (CKD), compared to 10% of ‘Unknowns’,
3% of patients with ‘NaN’ value required intubation, compared to 10% of ‘Unknowns’.

The number of patients categorized as having PMH CVS Disease fell from 58% to 23%. This is a result of disregarding all values for hospital X within this feature since‘1’ was exclusively entered, suggesting that all of their 1,872 patients had previously suffered a myocardial infarction. Additionally, the PMH CVS Disease (included in Fig. 2) and PMH Lung Disease features have been collapsed to include only ‘Yes’, ‘No’ and ‘Unknown’ values. Our reasoning for this loss of information is discussed in the Methods section. However, the user can opt to retain the original categories by changing an input parameter when running the extended pipeline. Beyond these examples, similar adjustments have been applied to Ethnicity and all other categorical PMH and medication features.

Dates

Due to inconsistencies in use of UK versus US date format, there appeared confusion from the data entries themselves compounded by those resulting from file formatting. Using the NHSx pipeline, there were an additional 488 missing date values after cleaning when compared to the raw data. With our extended pipeline, the number of missing dates fell by 1,087 (3.6%) when compared to the NHSx pipeline and 699 (2.3%) when compared to the raw data, see Fig. 1.

For three of the date features (Date of Admission, Date of Acquisition of 1st RT-PCR and Date of Positive Covid Swab) which were expected in US format (mm/dd/yyyy), 3 hospitals (α, γ, δ) were found to have submitted dates in UK format (dd/mm/yyyy). In the original cleaning pipeline, 67% of affected dates were still processed correctly, but in 33% of the remaining cases the error went undetected and 1,664 incorrect dates were obtained. In Fig. 3 (top) the amount of time between the affected PCR date and the closest imaging date has been computed for each patient. When applying our pipeline it can be observed that high time gaps were eliminated at the affected hospitals. In Fig. 3 (bottom) the number of values each day throughout the month is shown for the three date related features. The spikes on days 1 and 4 in the plot of the NHSx cleaned data (left) have been removed.

Fig. 3 — (Top) Mean time of smallest time lag between a PCR test and corresponding imaging for each patient. (Bottom) Distribution of days throughout the month for the collected dates.

Numerical features

The features Creatinine, Troponin and D-dimer obtained from blood tests and the fraction of inspired oxygen parameter FiO₂ have been corrected according to biological and technical threshold values as well as unit inconsistencies, see Fig. 4 for the comparison of the original and updated numerical values. Around 6% of Troponin I values that would have been lost in the original pipeline, as they contained a less than symbol (‘ < ’), were retained, whilst 186 values (26%) were truncated to a value of 10. Creatinine values for 22 samples (0.87%) were thresholded and the units were corrected for 8 (0.32%) values. As we see in Fig. 4, there are no longer any FiO₂ values below 21% and the distribution of the D-dimer results now shows a unimodal distribution. The spike in FiO₂ just above zero likely represented values being expressed as a decimal rather than a percentage, which have now been corrected.

Fig. 4 — Adjustment of the blood test features and administered oxygen levels.

For patient oxygen levels (originally the PaO2 feature), 4 of hospitals submitted partial pressures of oxygen (PaO₂), 17 provided oxygen saturations (SpO₂), and 1 provided a mixture of both. The PaO₂ values have been converted to the more commonly measured SpO₂, including a correction of units, see Fig. 5. 149 values in the new SpO2 feature were inferred from likely PaO₂ values (8.5% of completed entries).

Fig. 5 — Box-plots of original *PaO2* values (left) and our *SpO2* output demonstrating more consistent values (right).

Discussion

Due to the unprecedented pressure on resources in hospitals during the data collection period of the NCCID, many inconsistencies and systematic errors are found in the data. Some of these issues can be traced back and many have been corrected by the original NHSx pipeline. However, a number of important issues still remain which can reduce the relevance and generalizability of subsequently developed models, as we will discuss shortly. These have been identified and corrected by our extended pipeline where possible. Here, we discuss some of the key issues, their consequences and the impact of solving them. Although specific to NCCID, similar issues will be applicable to other such collaborative datasets. As such, we will subsequently provide recommendations of how data curation and cleaning may be improved in similar future projects.

Fairness, reproducibility and interpretability

Understanding of the data used to train machine learning models is fundamental to ensuring ethical and fair algorithms. This is of paramount importance in the medical domain since outputs may directly influence clinical decision making and patient care. As stated in², data “truthfulness” includes understanding of how complete and detailed given data are, what information they contain, how accurately they reflect the true physical situation as well as measures of variance and bias. A certain level of bias is unavoidable in any dataset; however, its impact may be mitigated by gaining a thorough understanding of the data and any potential sources of bias^3,4. For example, if there is selection bias in the data collection or model development, it may be necessary to test the model on additional data to ensure that it is adequately evaluated on all groups prior to deployment^3,5. Major generalization failures of machine learning models have been reported, especially when relying on age and sex^6,7. Amongst others, diagnostic COVID-19 models have been shown to under-diagnose the female and younger patients⁸ when they are under-represented in the training data⁹.

All papers published in the first year of the COVID-19 pandemic failed to satisfy sufficient criteria regarding reproducibility and interpretability for clinical use despite the urgent need of tools to assist clinicians^4,10. Although this was partly a consequence of hidden biases within the data, it is further compounded by a failure to adequately perform data cleaning and/or standardization as well as describing the models in sufficient detail¹¹.