Abstract
Background
Manual curation of real-world data (RWD) for patients with ovarian cancer is complex and costly. We set up a novel collaboration between informatics and clinical teams generating automated data curation at scale. This enabled integrated and timely access to RWD across all ovarian cancer patients treated within a tertiary gynaecological cancer centre of the UK National Health System, setting the basis for research and operational use.
Materials and methods
The collaboration defined high-yield, accessible data which were pulled into tables representing various clinical domains followed by a systematic integration, cleaning and analysis within the iCARE Secure Data Environment.
Results
We curated data for 1581 patients diagnosed between 1 January 2014 and 31 December 2022. We showed that referrals to the specialist tumour board consistently increased over time while baseline characteristics did not change significantly. The number of patients receiving a new line of therapy decreased in 2020, the first year of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) outbreak. Data robustness was supported using multivariate survival modelling demonstrating the expected impact of known prognostic factors. There was a paucity of available data for some variables (e.g. ethnicity) while others lacked a consistent storage mechanism within source systems (genomic data).
Conclusions
Automated curation and analysis of RWD is possible at scale, in real time. Analysis yielded clinical findings consistent with the prevalent literature and showed evolution of treatment practice. While not all unstructured data could be explored, we demonstrate that automated curation of clinically important real-world variables is feasible and can yield robust data for both research and operational purposes.
Key words: ovarian cancer, real-world data, automation
Highlights
-
•
RWD are expensive to manually curate at scale.
-
•
Programmatic curation of electronic health record data is feasible.
-
•
Such data are robust and outcomes reflect clinically expected patterns.
-
•
The data show a change in treatment practice over time.
-
•
The data have purpose for both research and operational use cases.
Introduction
Ovarian cancer has the highest mortality of any gynaecological malignancy, causing over 35 000 deaths per year in Europe in 2022.1, 2, 3 It is usually diagnosed at an advanced, incurable stage; even with optimal treatment, it frequently recurs with a 5-year survival of <50%.4
Real-world data (RWD), routinely collected health data including from electronic health records, registries and wearable technology, can supplement pre-clinical and clinical trial data to improve outcomes. Real-world evidence (RWE) generated from such data can inform detection of care inequality, development of prognostic models or the generation of hypotheses for prospective studies.5,6 The European Organisation for Research and Treatment of Cancer has recently outlined its strategy to contribute to the generation of robust real-world evidence while the European Society of Medical Oncology has published its guidance for reporting RWE (ESMO Guidance for Reporting Oncology real-World evidence [ESMO-GROW]).7,8
The largest and most well-established RWD sources are registries whose original scope was to investigate cancer incidence and burden on healthy populations facilitating evaluation of screening outcomes, quality of care and regional disparities.9, 10, 11
One example is the UK’s Ovarian Cancer Audit Feasibility Pilot, followed by the ongoing National Ovarian Cancer Audit (NOCA).10 Key findings included improvement in survival since 2001, national variation in stage distribution and variation in survival. Certain variables, however, including details of specific systemic anticancer therapies (SACTs), were not captured. Furthermore, the initial audit findings were made available a few years after data capture highlighting a lag often seen with large-scale registry data reports. Changes in practice, such as the uptake of new therapies, need to be captured in near real time to be useful for operational purposes and influencing care pathways.12 Ultimately a more detailed understanding of all aspects of the ovarian cancer treatment pathway is required. While population registries serve a purpose, limitations include those intrinsic to the type of registry or derive from the way the data are collected.13, 14, 15, 16
To mitigate some of these limitations, clinical quality registries (CQRs) systematically collect detailed multi-site data using standard procedures, enabling provision of clinical quality indicators for benchmarked reporting to participating sites.17 Data may be updated continuously or in short intervals to drive care and quality improvement. They may adhere to best practice in conformity with the principles of ‘registry science’ ensuring they are designed, operated and utilised robustly, adding credence to their goal of care and quality improvement.18
While CQRs have potential to directly impact care, historically RWD have been collected manually, a labour intensive, inefficient and error-prone process; this precludes frequent inspection of data and timely assessment of change. Patient-related data are present across several electronic health record (EHR) systems, in both structured and unstructured formats further compounding the problem.
Here, the clinical and informatics teams at the Imperial College Healthcare National Health Service (NHS) Trust (ICHT) collaborated to collect such data programmatically. We then analysed data via semi-automated pipelines allowing efficient and iterable inspection of data in the future. We present our approach to data curation and analysis of this RWD and demonstrate how we can use these data to monitor trends over time and variation across different patient groups.
Materials and methods
Design of the study and patient involvement
A collaboration between clinical experts and the informatics team was established to define clinically important variables (Supplementary Appendix S1, available at https://doi.org/10.1016/j.esmorw.2025.100150) and establish the feasibility of data capture.
A collaboration was established with the National Institute for Health Research (NIHR) Biomedical Research Centre Imperial Patient Experience Research Centre (PERC) to involve patients in the design and conduct of this study. Those who had lived experience of ovarian cancer personally, or as a carer, were invited. Online sessions were held to understand how these individuals thought the team could use (anonymised) Imperial Clinical Analytics, Research and Evaluation (iCARE) data to improve care for all patients with ovarian cancer. They also commented on their perceptions of inequalities in care and data-enabled mitigation.19 Insights were used to shape and improve this project.
Data curation
Structured and unstructured data from multiple EHR systems across the Trust’s information technology infrastructure, captured during direct care at ICHT, were imported into the iCARE Secure Data Environment (SDE). This environment is approved to curate identifiable data, anonymise and provide access for clinical analytics, research and evaluation. It is hosted within the NHS infrastructure. Integration was achieved via linked server connections (for systems including Oracle Cerner Millennium® and Somerset Cancer Registry) while SACT reports were manually submitted as comma-separated value files via secure file transfer protocol, before being loaded into the SDE.
Data query was carried out using SQL. This produced structured data and free-text extracts which were stored across multiple tables. Free-text data underwent standardisation including case conversion and removal of special characters (e.g. ‘Stage-III’ -> ‘Stage III’).
Free-text curation was carried out using SQL-based named entity recognition techniques. Methods included the following: (i) Key–value pair extraction, (ii) Rule-based text parsing, (iii) Context-aware extraction. De-identified data were accessed for research within the iCARE SDE research account. Data were pulled into seven tables, designed during collaboration between the iCARE informatics and clinical teams (Supplementary Appendix S1, available at https://doi.org/10.1016/j.esmorw.2025.100150). Data for validation were accessed in approved, access limited, secure locations by the direct care team.
Structured and unstructured data were curated from the following EHR systems:
-
(i)
Somerset Cancer Register (SCR)—demographic and diagnostic data [including Fédération Internationale de Gynécologie et d’Obstétrique (FIGO) stage, Systematized Nomenclature of Medicine—Clinical Terms (SNOMED) histology and Eastern Cooperative Oncology Group performance status (PS)].
-
(ii)
Oracle Cerner Millennium® (CM)—demographic data, surgical data, outpatient pharmacy encounters (oral SACT), visit data, date of death (updated through NHS Spine).
-
(iii)
ARIA® oncology information system (Siemens and Varian)—SACT data.
-
(iv)
North West London Pathology—haematology, biochemistry, tumour markers, histology reports.
The cohort was selected from patients ≥18 years old with first recording of the following International Statistical Classification of Diseases and Related Health Problems (ICD) codes between 1 January 2014 and 31 December 2022:
-
(i)
C56X—malignant neoplasm of the ovary
-
(ii)
C57—malignant neoplasm of other and unspecified female genital organs
-
(iii)
C48—malignant neoplasm of retroperitoneum and peritoneum
-
(iv)
D39—neoplasm of uncertain or unknown behaviour of female genital organs
Once this cohort was selected from SCR data, demographic and other data were linked from CM. Date of death, at month granularity to comply with privacy requirements, was curated from the NHS Spine, a central digital infrastructure supporting health care in England.20 For those with no date recorded, the date of last known contact was recorded instead (Supplementary Appendix S2, available at https://doi.org/10.1016/j.esmorw.2025.100150).
Where structured data for stage, histology or PS were missing, unstructured data from CM and SCR were queried using rule-based algorithms applied to encounters, multi-disciplinary team (MDT) documentation, imaging and histology reports using free text entered 6 months before or after diagnosis. The value closest to the date of diagnosis was kept. ‘Null’ values were entered where data were missing. Rule-based free-text curation was also applied to surgical reports, aiming to extract information about timing of cytoreductive surgery.
Finally, SACT data from ARIA were curated by the Trust’s pharmacy team and linked to the dataset.
Data were de-identified and pseudonymised according to the ICHT Data Protection office-approved de-identification process. Once quality assurance processes were completed, access to data was provided within the ICHT iCARE SDE. Access was limited to approved researchers, all access to data is audited and data extraction is prevented unless following iCARE airlock procedures in which iCARE review ensures that only aggregated, anonymised data can be removed.
Data analysis—pre-processing
Analysis used R (version 4.3.2). Firstly, data were cleaned. Although data were curated programmatically, the source data had largely been inputted manually and were therefore subject to error and missingness. Numerical fields were checked for nonsensical data and distributions plotted for inspection. Categorical data were grouped into appropriate categories with rules defined by a clinical expert (LT, Supplementary Appendix S1, available at https://doi.org/10.1016/j.esmorw.2025.100150), while rule-based free-text comments were also categorised through manual inspection.
Ovarian cancer is managed within expert centres in the UK with referrals coming from other hospitals (the units) according to geographic proximity. Expert centres provide surgical expertise and receive referrals for second and third opinions. Given referral complexity, we defined three cohorts and analysed them separately where indicated:
-
•
Cohort 1—all patients (n = 1581)
-
•
Cohort 2—patients diagnosed and treated at the expert/tertiary/central Trust network of hospitals (n = 827)
-
•
Cohort 3—patients referred from units outside the immediate expert Trust network (n = 754); patients usually receive surgery at the expert centre but SACT at the referring unit
Survival analysis was conducted using the survival (version 3.5) and survminer (version 0.4.9) packages while forest plots were generated using forestplot (version 3.1.1).
Ethics approval: The iCARE research database was given favourable ethics approval by the South West –Central Bristol Research Ethics Committee (reference 21/SW/0120; IRAS project ID 282093). This project was approved by the NIHR Imperial Biomedical Research Centre Data Access and Prioritisation committee and de-identified data were accessed within the ICHT iCARE SDE. All members involved in the project underwent training in data security awareness and signed the ICHT Terms of Use for iCARE Data.
Results
Data for 2596 records corresponding to 2506 unique patients with an MDT-recorded new diagnosis under the relevant ICD codes (see ‘Materials and methods’) between 3 January 2014 and 31 December 2022 were identified from Trust EHR systems. After validation of curation method (Supplementary Appendix S3, available at https://doi.org/10.1016/j.esmorw.2025.100150) duplicates, recurrent episodes and records with non-epithelial or ambiguous histology (Supplementary Figure S1, available at https://doi.org/10.1016/j.esmorw.2025.100150) were removed. A total of 1581 unique records remained for the analysis set in cohort 1. For cohort 2 (tertiary referral centre network only), 827 patients remained in the analysis set while 754 patients were in the set for cohort 3.
Overview of patient, tumour and treatment characteristics and trends
Supplementary Figure S2, available at https://doi.org/10.1016/j.esmorw.2025.100150, shows the number of MDT-recorded new diagnoses (diagnoses) per year, partitioned by referral centre. The number of diagnoses showed a steady upward trend since 2014, driven by referrals from outside the immediate Trust network (from cohort 3).
Baseline demographics are shown for all years combined (Table 1) and (for cohort 1 only) per year (Supplementary Table S1, available at https://doi.org/10.1016/j.esmorw.2025.100150). Overall, median age was 62 years [interquartile range (IQR) 53-72 years] with no consistent trend seen over time. The median body mass index was 25.2 (IQR 22-29.4), which also showed no trend with time. Most patients had a PS of 0 or 1 (70.2% combined). Ethnicity was missing or not stated in 42.8% of cases with Caucasian (36.4%) followed by South Asian (11.1%) ethnicities being the most common recorded types. When looking at those with known ethnicity, proportions remained unchanged over the observation period (chi-square test, P > 0.05).
Table 1.
Demographic and diagnostic characteristics at baseline
| Demographics |
Tumour characteristics |
||||||
|---|---|---|---|---|---|---|---|
| Cohort 1 (n = 1581) | Cohort 2 (n = 827) | Cohort 3 (n = 754) | Cohort 1 (n = 1581) | Cohort 2 (n = 827) | Cohort 3 (n = 784) | ||
| Median age, years (IQR) | 62 (53-72) | 62 (52-72) | 63 (62-64) | Stage | |||
| Performance status, n (%) | I | 195 (12.3) | 131 (15.8) | 64 (8.5) | |||
| 0 | 539 (34.1) | 293 (35.4) | 246 (32.6) | II | 162 (10.2) | 99 (12.0) | 63 (8.4) |
| 1 | 571 (36.1) | 315 (38.1) | 256 (34) | III | 747 (47.2) | 354 (42.8) | 393 (52.1) |
| 2 | 128 (8.1) | 86 (10.4) | 42 (5.6) | IV | 354 (22.4) | 178 (21.5) | 176 (23.3) |
| 3 | 46 (2.9) | <30 (3.6) | <25 (3.3) | Not available | 123 (7.8) | 65 (7.9) | 58 (7.7) |
| 4 | <10 (<2) | <10 (<2) | <10 (<2) | Histology | |||
| Not available | 288 (18.2) | 97 (11.7) | 191 (25.3) | High-grade serous | 926 (58.6) | 470 (56.8) | 456 (60.5) |
| Smoking, n (%) | Low-grade serous | 77 (4.9) | 39 (4.7) | 38 (5) | |||
| Never smoked | 11 (0.7) | <10 (<2) | <10 (<2) | Clear cell | 82 (5.2) | 54 (6.5) | 28 (3.7) |
| Non-smoker | 337 (21.3) | 234 (28.3) | 103 (13.7) | Carcinosarcoma | 48 (3) | 22 (2.7) | 26 (3.4) |
| Ex-smoker | 74 (4.7) | 45 (5.4) | 29 (3.8) | Endometrioid—G1 | 30 (1.9) | <25 (<3) | <10 (<2) |
| Smoker | 64 (4) | <50 (6) | <30 (<4.1) | Endometrioid—G2 | 48 (3) | 34 (4.1) | 14 (1.9) |
| Not available | 1095 (69.3) | 501 (60.6) | 594 (78.8) | Endometrioid—G3 | 26 (1.6) | <20 (2.4) | <10 (<1.3) |
| Median BMI (IQR) | 25.2 (22-29.4) | 25.5 (22.2-29.5) | 24.6 (21.7-29.3) | Mucinous—G1 | 19 (1.2) | <15 (<1.8) | <10 (<1) |
| Ethnicity, n (%) | Mucinous—G2 | 18 (1.1) | <15 (<1.8) | <10 (<1.3) | |||
| Caucasian | 576 (36.4) | 348 (42.1) | 228 (30.2) | Mucinous—G3 | <15 (<1) | <10 (<1.2) | <10 (<1.3) |
| South Asian | 175 (11.1) | 117 (14.1) | 58 (7.7) | Null histology | 292 (18.5) | 138 (16.7) | 154 (20.4) |
| Other ethnicity | 107 (6.8) | 80 (9.7) | 27 (3.6) | Other | <10 (<1) | <10 (<1.2) | <10 (<1.3) |
| African, Caribbean | 47 (3) | 36 (4.4) | 11 (1.5) | ||||
| Not stated | 676 (42.8) | 246 (29.7) | 430 (57) | ||||
Demographic and diagnostic details of all years combined. ‘Other’ histology includes undifferentiated carcinoma and mixed tumours. ‘Null’ histology includes confirmed serous, endometrioid and mucinous carcinoma where concomitant grade was unavailable. Smoker of any nicotine products.
BMI, body mass index (available for 1028 patients); IQR, interquartile range.
Details of tumour characteristics are given in Table 1. Of those with known stage/histology, the majority had either stage III (47.2%) or IV (22.4%) disease and serous histology (71.8% of those with known histology). Within the serous sub-cohort (n = 1001), 926 (92.5%) cases were high-grade serous cancers. The data over time are represented in Supplementary Table S2 and visually in Figure S3, available at https://doi.org/10.1016/j.esmorw.2025.100150. There was no obvious temporal trend for histology or stage. These data covered the period of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) epidemic.
Pathology
We interrogated albumin, cancer antigen (CA)-125, C-reactive protein (CRP), haemoglobin (Hb) and platelet values closest to diagnosis before initiation of treatment (Supplementary Table S3, available at https://doi.org/10.1016/j.esmorw.2025.100150). The median albumin value in the overall cohort was in the normal range (36 g/l) while median haemoglobin was 118 g/l, slightly below the lower limit of the reference range (120 g/l); both exhibited a trend of falling values with increasing age, PS and FIGO stage. Conversely, median CRP was elevated (27.5 mg/l) but showed little relationship to FIGO stage, which contrasted with median CA 125 and platelet count, both of which increased with higher stage.
Treatment
Systemic anticancer therapy (SACT)
Data on SACT was only available from the tertiary unit, within accessible chemotherapy prescribing records from cohort 2. Between 1 April 2014 and 31 July 2022, 504 patients received 7119 cycles of SACT.
Yearly administrations of SACT regimens were categorised from 1 August 2014 to 31 July 2015 up until 1 August 2021 to 31 July 2022. The number of cycles prescribed per year, including visualisation of the 10 most commonly prescribed regimens overall, is shown in Figure 1A. The number of cycles prescribed for our cohort increased year on year as patients progressed through lines of therapy. To this end, prescriptions for second-line regimens and beyond [such as carboplatin and pegylated liposomal doxorubicin (C-PLD) and weekly paclitaxel (PW)] increased with time. We also saw emergence of poly (ADP-ribose) polymerase (PARP) inhibitor prescribing over time.
Figure 1.
Trends in systemic treatment over time. (A) No. of cycles prescribed per year. The 10 most frequent regimens are specifically highlighted. (B) Number of patients receiving their first course of SACT shown by year. The ten most frequent regimens are highlighted. (C) PARP inhibitor cycles over time. Note 2022 data only complete until 31 July. PARP, poly (ADP-ribose) polymerase; SACT, systemic anticancer therapy.
To also explore treatment trends for primary systemic treatment, we looked at the first course of SACT received by each patient. As shown (Figure 1B), there was a general increase in the number receiving their first course of SACT between 2014/2015 and 2018/2019. There was a sharp drop in 2019/2020, possibly attributable to SARS-CoV-2 outbreak (51 first cycles that year) and an increase in line with pre-2020 trends thereafter (65 first cycles in 2020/2021 and 70 in the year after). Three-weekly carboplatin in combination with weekly or three-weekly paclitaxel (PC) remained the most common first regimen received every year.
The first documented prescriptions of PARP inhibitors were in 2017 (14 cycles in 2016/2017). This increased year on year to 298 cycles in 2019/2020, 467 in 2020/2021 and 671 for 2021/2022. Although each individual inhibitor (olaparib, rucaparib and niraparib) demonstrated their own trends, the total number of prescriptions over the time period studied was similar for each.
Surgery
A total of 1096 patients in cohort 1 underwent 1540 procedures between 17 March 2015 and 19 March 2023. After filtering procedure codes not related to the diagnosis of ovarian cancer, 1296 procedures in 1068 patients remained. Algorithm-extracted surgeon comments were used to manually assign each record into seven surgical categories (Supplementary Table S4, available at https://doi.org/10.1016/j.esmorw.2025.100150, Figure 2A and B). The most common surgery types were primary cytoreductive surgery (PCS, n = 367, 28.3%) followed by interval cytoreductive surgery (ICS, n = 324, 25%) and 253 (19.5%) where the timing of cytoreduction could not be inferred from the extracted comments. Temporal trends for PCS and ICS are shown in Figure 2C demonstrating an increase in ICS over time.
Figure 2.
Distribution of surgical procedures over time. (A) Distribution of category proportions in cohort 1. (B) Distribution of category proportions in cohort 1 by year. (C) Trends in primary (PCS) and interval (ICS) cytoreduction over time. (D) Trends in ICS over time, by cohort (data from 2015 not shown to comply with reporting guidance). ICS, interval cytoreductive surgery; PCS, primary cytoreductive surgery.
In cohort 2, PCS was the most common type of procedure (214, 29%) in contrast to cohort 3, where ICS was most common (187, 34%). The number of ICS procedures rose over time, driven by the increased referrals from cohort 3 (Figure 2D) while numbers remained stable from cohort 2.
Survival outcome
We examined overall survival (OS) data using mortality information from CM (updated from the NHS Spine) combined with local structured data to inform censoring (Supplementary Appendix S2, available at https://doi.org/10.1016/j.esmorw.2025.100150). In all patients (cohort 1, n = 1581) there were 623 survival events with 342 events in cohort 2 (tertiary centre network, n = 827). Median follow-up time was 2.6 years (31 months) in cohort 1 versus 3.8 years (45 months) in cohort 2. We therefore carried out survival analysis in cohort 2 where a longer follow-up time and the enhanced completeness of treatment details enabled more accurate estimation of outcomes. Here, we looked at short-term landmark survival estimated by the Kaplan–Meier analysis. Estimated 1-year survival in cohort 2 was 85.0% (82.5%-87.6%) (Figure 3A). Split by stage, 1-year survival ranged from 98.3% (stage I), 95.1% (stage II), 88.3% (stage III) to 73.8% (stage IV) (Figure 3B, Supplementary Table S5, available at https://doi.org/10.1016/j.esmorw.2025.100150). Endometrioid subtype associated with the most favourable 1-year OS (93.4%) while carcinosarcoma had the least favourable outcome (80.2%) (Supplementary Table S5, available at https://doi.org/10.1016/j.esmorw.2025.100150).
Figure 3.
Overall survival in Central Trust (cohort 2) estimated by the Kaplan–Meier analysis. (A) Kaplan–Meier plot showing overall survival. Dashed lines show median survival. (B) 1- and-2 year overall survival estimates per FIGO stage. FIGO, Fédération Internationale de Gynécologie et d’Obstétrique.
We next looked at the univariate and multivariate impact of prognostic factors on OS, using Cox regression. As shown in Supplementary Table S5 and Figure S4, available at https://doi.org/10.1016/j.esmorw.2025.100150, and Figure 4, known prognostic factors remained statistically significant in both univariate and multivariate analyses (including FIGO stage, histology, PS, albumin and age). The trend towards worse survival with increasing CA 125 and platelet count seen in univariate analyses, was not seen in multivariate analysis (Figure 4).
Figure 4.
Multivariate Cox hazard ratios for overall survival. Given low numbers, mucinous histologies were aggregated regardless of grade. CI, confidence interval; HGSOC, High-Grade Serous Ovarian Cancer; LGSOC, Low-Grade Serous Ovarian Cancer; ECOG PS, Eastern Cooperative Oncology Group performance status; FIGO, Fédération Internationale de Gynécologie et d’Obstétrique; HR, hazard ratio; IMD, index of multiple deprivation decile.
Finally, we carried out sensitivity analyses by excluding patients with ‘Null’ histology. Median follow-up in this subset of cohort 2 was 3.94 years, with 1-year survival at 88% (85.6%-90.5%) and stage-specific 1-year survival of 100%, 95.1%, 89.3% and 75.7% in stage I-IV disease, respectively. Importantly, in multivariate analysis FIGO stage, histology, PS, albumin and age remained significantly associated with survival.
Discussion
We demonstrated that, through collaboration between clinical and informatics teams, semi-automated curation and analysis of routinely collected health care data at scale across multiple clinical systems is possible at our Trust and able to generate robust data that can be used to analyse treatment patterns and survival outcomes for patients with ovarian cancer.
Our data have utility for the rapid display of simple operational metrics, for the monitoring of clinical trends and for understanding real-world OS and its determinants. Operationally, we showed that referrals to our Trust increased by >200% over the last 8 years. Investigation of prognostic variables showed consistency with factors widely accepted to have the biggest impact on prognosis including stage, histology and PS.21 These data suggest that, using our methods, routine NHS data can generate robust and meaningful locally relevant results and also highlight its potential for future novel research.
Importantly, these methods enable assessment of treatments (including surgery and SACT). Data showed that patients treated within the Trust have a higher percentage of PCS than those referred externally, that there was a decrease in systemic treatment during the SARS-CoV-2 pandemic and that PARP inhibitor prescriptions rose linearly in the years after their introduction. These results highlight the importance of a service to adapt to evidence-driven changes in practice, ultimately allowing us to follow outcomes in response to new therapy practices while enabling departments to assess their responsiveness to clinical advancements and aid in practice benchmarking.
Currently, we largely inform our patients using trial data and National Statistics.22,23 These data are often not representative of the local population and are often old. A recurring theme from our Patient and Public Involvement and Engagement (PPIE) work was that patients want statistics more relevant to them, especially regarding outcomes and treatment pathways.19 These methods have the potential to inform patients in this way using aggregated contemporary, granular and local RWD.
Our semi-automated methods are able to scale more efficiently than manual curation allowing timely data updates in a manner not possible with manual methods where each iteration requires substantial effort often resulting in stale data due to lack of resources.
While the advantages are substantial, we recognise that there are inherent limitations too. Structured histological diagnoses were obtained from data curated in multi-disciplinary tumour boards. This relies on the quality of data entry. Accuracy measured against the gold standard of manual curation of free-text pathology and clinical reports was 96%. Conversely, rule-based curation of free-text histology reports was more accurate (99%) suggesting a role for such data in future iterations.
Secondly, missing data were variably present across all fields, most noticeable for ethnicity (43%) but also for histology. To avoid biasing the cohort by excluding patients with potentially poorer prognosis, we included patients with ‘Null’ histology after validation suggested that the majority truly had epithelial ovarian cancer. To mitigate some uncertainty, we carried out sensitivity analysis for survival, by excluding ‘Null’ cases. While landmark estimates increased slightly, the relationship between prognostic factors (e.g. stage and PS) and survival was unaltered in both univariate and multivariate sensitivity analyses. Nonetheless, missing data is problematic especially when data are not missing at random. Future mitigation strategies include local education to improve source documentation and improvement of free-text extraction by widening document scope (e.g. beyond histology reports) and using more sophisticated methods (e.g. large language models). Assessing genomic data is particularly challenging due to data fragmentation and lack of standardised mechanisms for reporting and storage of results. We are working on delivering reliable results for genomic testing (including homologous recombination deficiency) using our methodology.
Finally, we recognise that our methods are limited by the fact that they require collaboration between clinical and informatics teams, assuming that both are adequately resourced and enabled to do this. In an increasingly constrained health care system, this is not always possible. Beyond the above, generic limitations of RWD including lack of source data standardisation, availability and timeliness also apply to our methods.
With the increasing digitisation of health care data, improvements in infrastructure and continued development of artificial intelligence capabilities, RWD curation will evolve. While we used SQL-enabled, rule-based curation, the rapid evolution of deep learning methods including large language models to curate text is already leading to more sophisticated curation workflows.24,25 These methods have the potential to enable more complete curation while also widening variable scope, encompassing more complex variables. Nonetheless, such methods are not ready for widespread adoption, requiring significant compute power and being prone to hallucinations and privacy concerns.26
Beyond methods of curation, efforts to standardise data models to enable interoperability and analytic consistency between data sources are paramount. Such efforts, in the UK, have been exemplified by the NATCAN project27 while international efforts include those lead by the Observational Medical Outcomes Partnership (OMOP) and Observational Health Data Sciences and Informatics (OHDSI) consortiums.28 Ultimately, high-quality data capture and development of standardised interoperable models synergise with the shift of clinical registries away from simply being observational data collection systems to being powerful tools that drive health care quality improvement—so-called ‘clinical quality registries’.29 These registries are underpinned by adherence to ‘registry science’ including best practice in governance processes, data variable management and quality assurance. However, despite their emergence, recent research has suggested that knowledge translation from such registries (i.e. applying the data for intervention) is lacking.30
Thus, as our local registry develops, emphasis will be on evolving curation methods, optimising source data, ensuring interoperability through standardisation and a focus on knowledge translation to improve outcomes. To truly understand outcomes and treatments, future projects should focus on combining datasets nationally and internationally. We hope the methods used in this project could be transferable to different centres with different EHRs.
Conclusion
The digitisation of health data globally represents a transformative opportunity for the health care sector. We demonstrate the feasibility and value of leveraging (RWD) to gain insights into patient care, particularly in oncology. By employing our data curation and analysis techniques, we have overcome traditional challenges associated with data fragmentation and manual curation processes. Our findings not only corroborate established clinical knowledge but also open avenues for novel research.
The success of our project underscores the importance of collaborative efforts between clinical teams and data specialists. Such partnerships are vital in navigating the complexities of RWD and unlocking its full potential for research and operational improvement.
Acknowledgments
Funding
This work was supported by the Imperial College Health Charity’s Innovate at Imperial programme [grant number II2122_16]. This research was enabled by the iCARE secure data environment and used the iCARE team and data resources. The research was supported by the National Institute for Health Research (NIHR) Imperial Biomedical Research Centre [grant number NIHR203323]. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care. GG was supported by an ESMO translational research fellowship.
Disclosure
AS was a full-time employee of Imperial College Healthcare NHS Foundation Trust at the time of study conception, data extraction and exploration and manuscript drafting. He is currently an employee of Flatiron Health UK, an independent subsidiary of the Roche group. GG has received honoraria as invited speaker for Pharma&, MIT and Collage SPA; LT has served on advisory boards for AstraZeneca, Clovis oncology and GSK; has received consulting fees and fees for lectures and presentations from AstraZeneca, GSK and MSD; and is a steering group committee member for GSK-sponsored studies for endometrial and ovarian cancer. IM has served on advisory boards for Roche; has received consultancy fees from AstraZeneca, BioNTech, Roche, Clovis oncology, OncoC4 and Pharma &; and honoraria for presentations from GSK. CF has received consulting fees from AstraZeneca, MSD, GSK Roche, Ethicon, Oncoinvent and Medronic. None of the above disclosures is related to the submitted manuscript. All other authors have declared no conflicts of interest.
Supplementary data
Data processing schema. C57, C48, D39 & C56X refer to ICD-10 diagnostic codes. CM, Cerner Millenium; DD, Demographic Data; Orals, oral anti-cancer medication
Referral patterns per year a) Split by cohort (see methods) & b) Split by individual unit (hospital). ‘Other’ includes hospitals with the least referral volume over the study interval.
Trends in a) Ethnicity, b) Histology and c) Stage over time in overall cohort (Cohort 1). Figures show percentage of each level for each variable per year (2014–2022), missing data included. Serous, mucinous and endometrioid histology include those where grade is unknown.
Univariate Cox hazard ratios for overall survival for a) Demographics, b) Tumour characteristics and c) Blood parameters. PS, ECOG performance score. IMD, Index of Multiple Deprivation decile. Given low numbers, mucinous histologies were aggregated regardless of grade.
References
- 1.European Cancer Information System. https://ecis.jrc.ec.europa.eu/explorer.php Available at.
- 2.Dalmartello M., La Vecchia C., Bertuccio P., et al. European cancer mortality predictions for the year 2022 with focus on ovarian cancer. Ann Oncol. 2022;33(3):330–339. doi: 10.1016/j.annonc.2021.12.007. [DOI] [PubMed] [Google Scholar]
- 3.Wojtyła C., Bertuccio P., Giermaziak W., et al. European trends in ovarian cancer mortality, 1990–2020 and predictions to 2025. Eur J Cancer. 2023;194 doi: 10.1016/j.ejca.2023.113350. [DOI] [PubMed] [Google Scholar]
- 4.Cancer Research UK Ovarian cancer survival statistics - Cancer Research UK. https://www.cancerresearchuk.org/health-professional/cancer-statistics/statistics-by-cancer-type/ovarian-cancer/survival#heading-Three Available at.
- 5.Gillessen S., Sauvé N., Collette L., et al. Predicting outcomes in men with metastatic Nonseminomatous Germ Cell Tumors (NSGCT): results from the IGCCCG Update Consortium. J Clin Oncol. 2021;39(14):1563–1574. doi: 10.1200/JCO.20.03296. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Greenberg P.L., Tuechler H., Schanz J., et al. Revised international prognostic scoring system for myelodysplastic syndromes. Blood. 2012;120(12):2454–2465. doi: 10.1182/blood-2012-03-420489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Saesen R., Van Hemelrijck M., Bogaerts J., et al. Defining the role of real-world data in cancer clinical research: the position of the European Organisation for Research and Treatment of Cancer. Eur J Cancer. 2023;186:52–61. doi: 10.1016/j.ejca.2023.03.013. [DOI] [PubMed] [Google Scholar]
- 8.Castelo-Branco L., Pellat A., Martins-Branco D., et al. ESMO Guidance for Reporting Oncology real-World evidence (GROW) Ann Oncol. 2023;34(12):1097–1112. doi: 10.1016/j.annonc.2023.10.001. [DOI] [PubMed] [Google Scholar]
- 9.Friedman S., Negoita S. History of the Surveillance, Epidemiology, and End Results (SEER) Program. J Natl Cancer Inst Monogr. 2024;2024(65):105–109. doi: 10.1093/jncimonographs/lgae033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.NATCAN National Ovarian Cancer Audit (NOCA) https://www.natcan.org.uk/audits/ovarian/ Available at. Accessed March 2025.
- 11.Minicozzi P., Innos K., Sánchez M.J., et al. Quality analysis of population-based information on cancer stage at diagnosis across Europe, with presentation of stage-specific cancer survival estimates: a EUROCARE-5 study. Eur J Cancer. 2017;84:335–353. doi: 10.1016/j.ejca.2017.07.015. [DOI] [PubMed] [Google Scholar]
- 12.Ledermann J.A., Matias-Guiu X., Amant F., et al. ESGO–ESMO–ESP consensus conference recommendations on ovarian cancer: pathology and molecular biology and early, advanced and recurrent disease. Ann Oncol. 2024;35(3):248–266. doi: 10.1016/j.annonc.2023.11.015. [DOI] [PubMed] [Google Scholar]
- 13.Azoulay L. Rationale, strengths, and limitations of real-world evidence in oncology: a Canadian review and perspective. Oncologist. 2022;27(9):e731–e738. doi: 10.1093/oncolo/oyac114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gehrmann J., Beyan O. Studies in Health Technology and Informatics. IOS Press; 2024. Data Quality in Medical Real-World Data – An Oncological Use Case. [DOI] [PubMed] [Google Scholar]
- 15.Bray F., Parkin D.M. Evaluation of data quality in the cancer registry: principles and methods. Part I: comparability, validity and timeliness. Eur J Cancer. 2009;45(5):747–755. doi: 10.1016/j.ejca.2008.11.032. [DOI] [PubMed] [Google Scholar]
- 16.Parkin D.M., Bray F. Evaluation of data quality in the cancer registry: principles and methods Part II. Completeness. Eur J Cancer. 2009;45(5):756–764. doi: 10.1016/j.ejca.2008.11.033. [DOI] [PubMed] [Google Scholar]
- 17.Ahern S., Hopper I., Evans S.M. Clinical quality registries for clinician-level reporting: strengths and limitations. Med J Aust. 2017;206(10):427–429. doi: 10.5694/mja16.00659. [DOI] [PubMed] [Google Scholar]
- 18.Labkoff S.E., Quintana Y., Rozenblit L. Identifying the capabilities for creating next-generation registries: a guide for data leaders and a case for “registry science.”. J Am Med Inform Assoc. 2024;31(4):1001–1008. doi: 10.1093/jamia/ocae024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Piggin M.J.H., Baker K., Tookman L., Glampson B., Samani A., Nkolobe B., On behalf of the NIHR Imperial Biomedical Research Centre . 2022. Insight Report: “Innovative, automated use of real-world healthcare data to improve outcomes for patients with ovarian cancer” project online public involvement session. [Google Scholar]
- 20.NHS Digital Mandating mortality data updates onto the Personal Demographics Service. https://digital.nhs.uk/about-nhs-digital/corporate-information-and-documents/directions-and-data-provision-notices/data-provision-notices-dpns/mortality-data-flows Available at. Accessed March 2025.
- 21.Winter W.E., Maxwell G.L., Tian C., et al. Prognostic factors for stage III epithelial ovarian cancer: a Gynecologic Oncology Group Study. J Clin Oncol. 2007;25(24):3621–3627. doi: 10.1200/JCO.2006.10.2517. [DOI] [PubMed] [Google Scholar]
- 22.Cancer Research UK Survival for Ovarian Cancer. https://www.cancerresearchuk.org/about-cancer/ovarian-cancer/survival Available at. Accessed March 2025.
- 23.SEER Cancer Stat Facts: Ovarian Cancer. https://seer.cancer.gov/statfacts/html/ovary.html Available at. Accessed March 2025.
- 24.Kehl K.L., Jee J., Pichotta K., et al. Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research. Nat Commun. 2024;15:9787. doi: 10.1038/s41467-024-54071-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Cohen A.B., Adamson B., Larch J.K., Amster G. Large language model extraction of PD-L1 biomarker testing details from electronic health records. AI Precision Oncol. 2025 doi: 10.1089/aipo.2024.0043. [DOI] [Google Scholar]
- 26.Gartlehner G., Kahwati L., Nussbaumer-Streit B., et al. From promise to practice: challenges and pitfalls in the evaluation of large language models for data extraction in evidence synthesis. BMJ Evid Based Med. 2024 doi: 10.1136/bmjebm-2024-113199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.NATCAN: Key COSD Data Items. https://www.natcan.org.uk/resources/key-cosd-data-items/ Available at.
- 28.Wang L., Wen A., Fu S., et al. Adoption of the OMOP CDM for cancer research using real-world data: current status and opportunities. medRxiv [Preprint] 2024 2024.08.23.24311950. [Google Scholar]
- 29.Stirling R., Melder A., Eyles E., Reich M., Dawkins P. Cancer clinical quality registries: a systematic review of the utilisation and effectiveness of knowledge translation interventions. J Thorac Oncol. 2022;17(suppl 9):S255–S256. [Google Scholar]
- 30.Parker K.J., Hickman L.D., Ferguson C. The science of clinical quality registries. Eur J Cardiovasc Nurs. 2023;22(2):220–225. doi: 10.1093/eurjcn/zvad008. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data processing schema. C57, C48, D39 & C56X refer to ICD-10 diagnostic codes. CM, Cerner Millenium; DD, Demographic Data; Orals, oral anti-cancer medication
Referral patterns per year a) Split by cohort (see methods) & b) Split by individual unit (hospital). ‘Other’ includes hospitals with the least referral volume over the study interval.
Trends in a) Ethnicity, b) Histology and c) Stage over time in overall cohort (Cohort 1). Figures show percentage of each level for each variable per year (2014–2022), missing data included. Serous, mucinous and endometrioid histology include those where grade is unknown.
Univariate Cox hazard ratios for overall survival for a) Demographics, b) Tumour characteristics and c) Blood parameters. PS, ECOG performance score. IMD, Index of Multiple Deprivation decile. Given low numbers, mucinous histologies were aggregated regardless of grade.




