Skip to main content
AMIA Annual Symposium Proceedings logoLink to AMIA Annual Symposium Proceedings
. 2022 Feb 21;2021:716–725.

Validation of Real-World Data-based Endpoint Measures of Cancer Treatment Outcomes

Qian Li 1, Hansi Zhang 1, Zhaoyi Chen 1, Yi Guo 1, Thomas J George Jr 1, Yong Chen 2, Fei Wang 3, Jiang Bian 1,*
PMCID: PMC8861715  PMID: 35308944

Abstract

Recently, there has been a growing interest in using real-world data (RWD) to generate real-world evidence that complements clinical trials. To quantify treatment effects, it is important to develop meaningful RWD-based endpoints. In cancer trials, two real-world endpoints are of particular interest: real-world overall survival (rwOS) and real-world time to next treatment (rwTTNT). In this work, we identified ways to calculate these real-world endpoints with structured electronic health record (EHR) data and validate these endpoints against the gold-standard measurements of these endpoints derived from linked EHR and tumor registry (TR) data. In addition, we examined and reported data quality issues, especially inconsistencies between the EHR and TR data. Using a survival model, we show that the presence of next treatment was not significantly associated with rwOS, but patients who had longer rwTTNT had longer rwOS, validating the use of rwTTNT as a real-world surrogate marker for measuring cancer endpoints.

Introduction

In clinical trials, an endpoint is a “precisely defined variable intended to reflect an outcome of interest that is statistically analyzed to address a particular research question,”1 and is usually characterized by the type of research questions or outcomes the trials aim to assess. In cancer trials, the most common outcome-based endpoints are overall survival and measurements of tumor burden, such as tumor response rate and progression-free survival,2,3 that are normally used to measure how well a treatment has worked. A number of endpoints have emerged that can be used as surrogate markers for “duration of clinical benefit,”4,5 such as time to next treatment (TTNT). Although clinical trials, especially randomized clinical trials, are considered the gold standard for generating clinical evidence about a treatment and its outcomes, they are expensive, time-consuming, and require the recruitment of sufficient participants, which is often difficult6,7. Further, trial results are often not generalizable to patients treated in real-world settings due to issues such as overly restrictive eligibility criteria8. These issues are especially prevalent in cancer trials,6,7 and approximately 97% of oncology trials ultimately fail9.

Recently, there has been a growing interest in using real-world data (RWD) to generate real-world evidence (RWE) that complements the results of clinical trials. The term RWD, widely promoted by the U.S. Food and Drug Administration (FDA), refers to data collected from sources outside of conventional research settings, including electronic health records (EHRs), administrative claims, disease registries, and billing data, among others10,11. These RWD sources contain detailed, longitudinally tracked patient information such as disease status, treatment, comorbidities, and concurrent treatments. The information generated from RWD can provide valuable RWE about how patients are treated in real-world clinical settings that can inform therapeutic development, outcomes research, patient care, safety surveillance, and comparative effectiveness studies12. In order to quantify the treatment effects, however, it is necessary to develop meaningful RWD-based endpoints.

Another important RWD source in cancer research is tumor registry (TR) data, which is often manually extracted from cancer patients’ medical charts (i.e., EHRs). Variables related to endpoints such as overall survival (OS), cause of death, and the date when the first line of treatment was begun can be reliably obtained from TR data. However, there are a number of gaps in TR data: it lacks detailed information about patients’ other characteristics (e.g., comorbidities) and longitudinal information about patients’ cancer treatment trajectories,13 and it does not capture all the cancer patients in the health system, for various practical reasons (e.g., state or national TR reporting requirements and reporting delays due to the labor-intensive manual abstraction process)14. Patient data missing from the TR leads to various issues for research studies (e.g., reduced sample size leading to reduced power of the estimates). For RWD-based research studies, therefore, using linked raw EHR and TR data is ideal.

Literature on real-world endpoints is emerging and a number of endpoints have been proposed, including real-world overall survival (rwOS), real-world time to next treatment (rwTTNT), and real-world progression-free survival (rwPFS)15–17. Nevertheless, the ability of researchers to use RWD to extract these endpoints remains an area of active discussion. For example, rwTTNT can be calculated based on structured EHR data alone, using the dates of various procedures and diagnoses, while rwPFS requires information on the tumor itself, which is often only available in unstructured clinical text (e.g., pathology reports).

Two real-world endpoints are particularly of interest. The first is rwOS, the time from the date of cancer diagnosis or treatment initiation (depending on the type of cancer or the study aims) to the date of death, end of follow-up, or last contact; the second is rwTTNT, the time from the initiation of the first course of cancer-directed treatment to the initiation of the next line of therapy (i.e., “subsequent treatment” in the case of recurrence or progression)15,18. Although rwTTNT derived from RWD sources has not been examined extensively, it can provide critical insight on the real-world performance of cancer treatments16,19. For example, rwTTNT can be used to estimate progression-free survival and the effectiveness of cancer treatment5.

In this work, we aimed to identify ways to calculate real-world endpoints with structured EHR data and validate these endpoints against the gold-standard measurements of these endpoints derived from linked EHR and TR data. We focused on two real-world endpoints, rwOS and rwTTNT, in stage I—III colon cancer patients. We used RWD from the OneFlorida Clinical Research Consortium, a Patient-Centered Outcomes Research Institute (PCORI)-funded clinical data research network contributing to the national Patient-Centered Clinical Research Network (PCORnet).

Methods

Overview. Our primary analysis goals were to assess the alignment between EHR and TR data, describe the data gaps between the two, and assess the validity of real-world endpoints derived from EHR and TR data. To achieve these objectives, we first needed to understand colon cancer patients’ cancer care pathways using RWD. A conceptual stage I—III colon cancer patient timeline is shown in Figure 1.

Figure 1.

Figure 1.

Stage I-III Colon cancer patient treatment timeline.

Our analysis involved five major parts: We (1) identified the events and associated dates of colon cancer diagnoses, death or last contact, and colon cancer treatments from both EHR and TR data; (2) explored the discrepancies between the events and event dates recorded in the EHR and TR data (e.g., patients identified as dead in the EHR but not in the TR, or vice versa); (3) computed the rwOS; (4) identified the starting points of patients’ first course of cancer treatments and subsequent treatments and computed the rwTTNT; and (5) examined the associations between the rwOS and (i) the presence of a subsequent treatment and (ii) rwTTNT. We hypothesized that patients who had a subsequent treatment would have shorter rwOS compared to those who did not and that among patients who had subsequent treatment, those who had a longer rwTTNT would have a longer rwOS than those who had a shorter rwTTNT. The typical first course of cancer treatment and subsequent treatment choices include surgery, radiation therapy, and chemotherapy. As we focused on stage I—III colon cancer patients, we only focused on systemic chemotherapy as depicted in clinical guidelines (see section below for details), and the next line of treatment refer to the intervention with a new regimen after initial chemotherapy.20

Data source and study population. We used the linked EHR and TR data from the OneFlorida network. The OneFlorida network contains linked, robust, longitudinal, patient-level RWD from approximately 15 million Floridians, including data from Medicaid claims, TR, vital statistics, and EHRs from its clinical partners. As one of the clinical research networks contributing to PCORnet, OneFlorida includes 12 healthcare organizations that provide care through 4,100 physicians, 914 clinical practices, and 22 hospitals covering all 67 Florida counties. The OneFlorida data follows the PCORnet Common Data Model (CDM) and includes patient demographics, enrollment status, vital signs, conditions, encounters, diagnoses, procedures, prescribing and dispensing records, lab results, etc. We extended the CDM to incorporate TR data, which follows the North American Association of Central Cancer Registries (NAACCR) standards21. Currently, OneFlorida includes TR data from three partners that maintain records of documented neoplasms (typically malignant) in their local hospital TRs. The TR records are linked with patients’ EHRs in the OneFlorida data.

The selection of the study cohort is illustrated in Figure 2. We first identified patients diagnosed with colon cancer in both EHR or TR data. Patients were grouped into (A) patients with colon cancer diagnoses in both EHR and TR data, (B) patients with colon cancer diagnoses only in the EHR data, and (C) patients with colon cancer diagnoses only in the TR data. To ensure data quality, and because the stage information was only available discretely in the TR data, we restricted our analyses to Group A. As stage 0 patients need more information to confirm their cancer diagnoses and treatment plans for stage IV patients are more complex, we further restricted the study cohort to patients with stage I—III colon cancer. We also excluded patients with unknown or missing stage information and patients who were diagnosed with colon cancer before 2012 in TR, since our EHR data only included records dating after 2012.

Figure 2.

Figure 2.

Selection of the study cohort.

Determination of cancer cases. To extract colon cancer patients, we used the International Classification of Disease, Ninth/Tenth Revision, Clinical Modification (ICD-9/10-CM) codes 153.* and C18.* for the EHR data and the International Classification of Disease for Oncology, 3rd Edition (ICD-O-3) codes C18.0 through C18.9 for the TR data. In the EHR data, a colon cancer patient’s onset date was defined as the earliest encounter date with the colon cancer diagnosis.

Determination of patients’ last contact or death dates to calculate rwOS. We calculated rwOS as the length of the period from the date of the first colon cancer diagnosis to the death or last contact date. In OneFlorida, the death records in the EHR data come from two sources: The deaths are either recorded directly in the EHR by the health system (e.g., inpatient deaths or deaths reported to the health system by relatives) or are extracted from the Social Security Administration (SSA)’s Death Master File (DMF) and third party data from public and private obituaries through a privacy-preserving record linkage process by a third-party vendor, Datavant22. The death dates in the Datavant’s death data only contain the month and year to protect privacy; thus, we imputed the death date to the first day of the month. If a patient did not have a death record, we assumed the patient to be alive. We then used the last encounter date of the patient in the EHR system as the last contact date.

TRs typically contain a cancer patient’s vital status (i.e., alive or dead) and the death date (if dead) or the last contact date (if alive). However, in our TR data, the last contact date information was missing. To determine patients’ death or last contact date, we combined EHR and TR vital status and event dates as follows: If patients were indicated to be dead in any of the data, we treated them as dead; if they had death dates in both the EHR and TR data, we used the one from the TR data. If patients were indicated to be alive in both datasets, we treated them as alive and used the last contact date from the EHR data.

Summarization of colon cancer treatments to determine rwTTNT. We computed rwTTNT as the length of the period between the beginning of the first course of treatment to the beginning of subsequent treatment. To identify the colon cancer treatment, we used the EHR data, as TR data does not contain patients’ longitudinal treatment records. We reviewed the National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines in Oncology23 and found that treatments for stage I—III colon cancer include surgery, chemotherapy, and radiation therapy. For surgery, we identified the Current Procedural Terminology (CPT) and Healthcare Common Procedure Coding System (HCPCS) codes and extracted surgery-related procedures from the EHR data. To identify chemo- and radiation therapies, we used the Cancer Therapy Look-up Tables developed by the Cancer Research Network (CRN)24.

As we focused on chemotherapy after surgey, we first identified all chemotherapy occurrences in the EHR data and further distinguished between the first course of treatment and subsequent treatments. We defined the beginning of the first course of treatment as the earliest chemotherapy record in the EHR after the patient’s colon cancer diagnosis date. We implemented two rules for identifying subsequent treatments: (1) if two adjacent chemotherapy occurrences had a gap between them of over 90 days, we considered the former occurrence as the end of the first course of treatment, and the latter occurrence as the beginning of subsequent treatment; and (2) if the patient switched to a new colon cancer treatment regimen, we considered the onset of the new regimen as the beginning of the subsequent treatment. If a patient had multiple subsequent treatments, we considered only the first occurrence in our analysis.

Statistical analyses. We first computed the confusion matrices for events (i.e., presence of colon cancer diagnosis and vital status) obtained from the EHR data vs. the TR data. We computed kappa coefficients for each event from the EHR and the TR data. We also compared the dates of these events between the EHR and TR data.

Further, to assess the validity of rwTTNT, we built two Cox proportional hazards models to examine the association between subsequent treatment and overall survival, one that considered whether the patients had subsequent treatments or not, and a second that considered the rwTTNT. For both models, the outcome was the rwOS determined by both the EHR and the TR data. We controlled for age at diagnosis, sex, race/ethnicity, cancer stage at diagnosis, Charlson Comorbidity Index (CCI) (based on diagnosis records from EHR data25), and smoking status.

All data processing procedures were conducted using Python and statistical analyses were performed with SAS, version 9.4 (SAS, Cary, NC, USA).

Results

Demographic characteristics. Our final study cohort (Group A) included 1,372 stage I—III colon cancer patients. The patients’ characteristics are summarized in Table 1. The mean age at colon cancer diagnosis was 65.2 years old. Men and women were represented approximately equally (50.7% and 49.3%, respectively). The majority of the patients were non-Hispanic Whites (NHW). There were more stage III patients (42.6%) than stage II (33.7%) or stage I (23.7%). Most of the patients (81.0%) had no comorbidities defined by CCI. At baseline, 41.4% of the patients had never smoked, and 58.6% were current or previous smokers.

Table 1.

Study cohort demographics.

Demographic Characteristic (N = 1,372) Mean / N SD / %
Age at diagnosis, years 65.2 13.8
Sex
Female 676 49.3%
Male 696 50.7%
Race/Ethnicity
Non-Hispanic White 821 59.8%
Non-Hispanic Black 230 16.8%
Hispanic 198 14.4%
Other 123 9.0%
Colon Cancer Stage at diagnosis
I 325 23.7%
II 462 33.7%
III 585 42.6%
Charlson Comorbidity Index at diagnosis
None (0) 1112 81.0%
Mild (1—2) 117 8.5%
Moderate (3—4) 62 4.5%
Severe (≥ 5) 81 5.9%
Smoking status at diagnosis
Never 568 41.4%
Ever 804 58.6%

Cancer diagnosis comparison. Although all 1,372 patients had records of colon cancer diagnosis in both the EHR and the TR data, there were differences in the diagnosis dates between the EHR and the TR data as shown in Table 2. The majority of the discrepancies (73.4%) were of less than 1 month. The TR diagnosis dates tended to be earlier than the EHR dates: 16.2% of patients had a TR date that was 1 to 3 months earlier than their EHR date, and 9.0% had a TR date that was more than 3 months earlier. There were very few patients whose diagnoses were recorded earlier in the EHR than the TR: 0.9% were 1 to 3 months earlier and 0.5% were more than 3 months earlier.

Table 2.

Differences in colon cancer diagnosis dates between EHR and TR data.

Colon cancer diagnosis date difference (N = 1,372) Frequency Percent
EHR record is 1–3 months earlier than TR record 12 0.87%
EHR record is >3 months earlier than TR record 7 0.51%
TR record is 1–3 month earlier than EHR record 222 16.18%
TR record is >3 months earlier than EHR record 124 9.04%
Within 1 month 1007 73.40%

Although our statistical analyses focused on Group A patients, we further explored Group B and Group C patients’ cancer diagnosis records, as shown in Table 3. For patients in Group B (i.e., patients who had colon cancer diagnoses only in the EHR data), more than 50% (N = 1,051) of the patients were diagnosed with rectal cancer. For patients in Group C (i.e., patients who had colon cancer diagnoses only in the TR data), 75% (N = 602) of the patients’ first cancer diagnoses were before 2012, which would not have been captured in our EHR data. Among the remaining 25% (N = 195) of patients, many of their first cancer diagnoses were secondary malignancies.

Table 3.

Cancer-related diagnoses for Group B and C patients.

Cancer-related diagnosis in TR data for Group B patients (N = 2,061)
ICD-O-3 site Frequency (%)
 C20.* (Malignant neoplasm of rectum) 692 (33.58%)
 C19.* (Malignant neoplasm of rectosigmoid junction) 359 (17.42%)
 C34.* (Malignant neoplasm of bronchus and lung) 118 (5.88%)
Cancer-related diagnosis after 2012 in EHR for Group C patients (N = 195)
ICD 9/10 Codes Frequency (%)
 C78.* (Secondary malignant neoplasm of respiratory and digestive organs) 36 (18.46%)
 C19.* (Malignant neoplasm of rectosigmoid junction) 26 (13.33%)
 C7A.* (Malignant neuroendocrine tumors) 19 (9.74%)
Cancer-related diagnosis before 2012 in EHR for Group C patients (N = 602)*
*

Patients whose colon cancer diagnoses were recorded as pre-2012 in the TR would not have their diagnosis captured in our EHR.

Vital status and death date comparison. In Group A, there were 248 (18.08%) patients with death information in the EHR data, while 272 (19.83%) patients were deceased according to the TR data. The confusion matrix is shown in Table 4. The Kappa coefficient was 0.66 (95% CI: 0.61—0.71). Both datasets agreed that 75.8% of patients were alive and 13.7% of patients were dead; among patients with inconsistent vital status, 4.37% were alive in the TR but deceased in the EHR and 6.12% were deceased in the TR but alive in the EHR.

Table 4.

Confusion matrix for vital status in EHR and TR data.

Frequency Percent TR vital status
EHR vital status Alive Dead Total
Alive 1040 75.8% 84 6.12% 1124 81.92%
Dead 60 4.37% 188 13.7% 248 18.08%
Total 1100 80.17% 272 19.83% 1372 100%

We compared the death dates for those recorded as deceased in both EHR and TR data. Among those 188 patients, only one patient had a death date that differed between the TR and the EHR data by over 1 month.

Figure 3 shows the distribution of rwOS in months. There were five patients with death or last contact dates that were earlier than their colon cancer diagnosis dates; this was likely due to data entry errors or other data quality-related issues. We removed these patients for the statistical modeling. The maximum rwOS time was 101 months and the median was 22.6 months.

Using the EHR data, we identified 174 (12.68%) patients who had subsequent treatments. Table 5 shows the presence of subsequent treatment across different cancer stages. Among stage I patients, 5.85% had subsequent treatments. For the stage II and stage III patients, the percentages were 10.82% and 17.95%, respectively. The distribution of rwTTNT is shown in Figure 4. The average time to next treatment was 9.1 months with a standard deviation of 10.6 months and the median was 6.3 months.

Table 5.

Presence of subsequent treatment by colon cancer stage.

Presence of Subsequent Treatment
Colon Cancer Stage at Diagnosis No Yes Total
I 306 (94.15%) 19 (5.85%) 325
II 412 (89.18%) 50 (10.82%) 462
III 480 (82.05%) 105 (17.95%) 585
Total 1198 174 1372

Figure 4.

Figure 4.

Distribution of real-world time to next treatment,in months.

Association between presence of subsequent treatment/rwTTNT and rwOS. Table 6 shows the estimates of the Cox proportional hazards model for rwOS using the presence of subsequent treatments as a predictor. Age at diagnosis was significantly inversely associated with rwOS, with a hazard ratio (HR) of 1.038. Sex was not a significant predictor for rwOS, but the HR was 1.246, meaning males had shorter rwOS compared to females. Compared to non-Hispanic Whites, non-Hispanic Blacks and Hispanics had lower hazards, with HRs of 0.659 and 0.623, respectively. Cancer stage at diagnosis was significantly associated with rwOS: stage I and II patients had much longer rwOS compared to stage III patients, with HRs of 0.454 and 0.517, respectively. Patients with CCI > 0 had shorter rwOS compared to those with no comorbidities. Smoking status at baseline was not associated with rwOS. The presence of subsequent treatment was not statistically associated with rwOS, where the HR is 1.203 with a p-value of 0.2173.

Table 6.

Estimates of real-world overall survival with presence of subsequent treatments as a predictor.

Parameter Hazard Ratio 95% Hazard Ratio Confidence Limits p-Value
Age 1.038 1.028 1.048 <.0001+
Sex Male vs. Female 1.246 0.997 1.556 0.0530#
Race-ethnicity NHB vs. NHW 0.659 0.473 0.918 0.0138+
Race-ethnicity Hispanic vs. NHW 0.623 0.425 0.915 0.0157+
Race-ethnicity Other vs. NHW 0.911 0.600 1.383 0.6614
Colon cancer stage at diagnosis I vs. III 0.454 0.332 0.619 <.0001+
Colon cancer stage at diagnosis II vs. III 0.517 0.399 0.670 <.0001+
Charlson comorbidity index at diagnosis > 0 vs. = 0 1.724 1.343 2.214 <.0001+
Smoking status at diagnosis Ever vs. Never 1.027 0.805 1.311 0.8288
Presence of subsequent treatment Yes vs. No 1.203 0.897 1.614 0.2173
+

statistically significant at 0.05 level;

#

close to statistically significant at 0.05 level.

Table 7 shows the estimates of the Cox proportional hazards model for rwOS with rwTTNT as a predictor on patients who had subsequent treatments. Real-world TTNT itself was statistically significant; it had an HR of 0.973, which means patients with longer rwTTNT had a longer rwOS.

Table 7.

Estimates of Cox proportional hazards model for real-world overall survival with real-world time to next treatment as a predictor.

Parameter Hazard Ratio 95% Hazard Ratio Confidence Limits p-Value
Age 1.005 0.982 1.029 0.6568
Sex Male vs. Female 1.049 0.618 1.780 0.8599
Race-ethnicity NHB vs. NHW 0.611 0.315 1.185 0.1451
Race-ethnicity Hispanic vs. NHW 0.546 0.165 1.805 0.3212
Race-ethnicity Other vs. NHW 0.791 0.273 2.289 0.6653
Colon cancer stage at diagnosis I vs. III 0.362 0.119 1.096 0.0721#
Colon cancer stage at diagnosis II vs. III 0.539 0.286 1.016 0.0559#
Charlson comorbidity index at diagnosis > 0 vs. = 0 1.165 0.633 2.142 0.6241
Smoking status at diagnosis Ever vs. Never 0.746 0.367 1.517 0.4181
rwTTNT 0.973 0.948 0.999 0.0452 +
+

statistically significant at 0.05 level;

#

close to statistically significant at 0.05 level.

Figure 6 shows the KM survival plots of rwOS for the patients with subsequent treatment only, stratified by sex, race-ethnicity, stage, and CCI. Because of the small sample size, the difference between the survival curves for each stratum, except the colon cancer stage, are small.

Discussion and Conclusion

In this study, we used RWD from linked EHR and TR to identify and validate two real-world endpoints, rwOS and rwTTNT, in stage I—III colon cancer patients. We identified the necessary events on colon cancer patients’ treatment timelines, including date of cancer diagnosis, vital status and last contact date, and presence and dates of cancer treatments, in order to establish the two real-world endpoints. We assessed the discrepancies in these events and related dates between the EHR and the TR data. Using the longitudinal records in the EHR data, we differentiated the first course of cancer treatment (focusing on chemotherapy) from any subsequent courses of treatment and used the associated dates to compute rwTTNT. We showed that presence of subsequent treatments was not significantly correlate with overall survival. We also showed that longer rwTTNT was significantly associated with longer rwOS.

There were discrepancies between our EHR and TR data, revealing various potential data quality issues with RWD. First, the records of colon cancer diagnosis did not always line up between the EHR and TR data; some patients had a colon cancer diagnosis recorded in only the EHR (Group B) or only the TR (Group C). Possible reasons for such discrepancies between the EHR and the TR could be misdiagnosis (e.g., patients with rectal cancer were initially misdiagnosed with colon cancer and then later determined to have rectal cancer, which was reported to the TR); reporting latency (there is typically a delay of more than 6 months in reporting to the TR because of the manual abstraction process); or patient continuity issues that lead to EHR data continuity issues (e.g., patients seeking care across different health systems). The discrepancies between the EHR data and the TR data in the dates of cancer diagnosis were, however, smaller than we expected. For those patients with consistent colon cancer diagnoses in both datasets (Group A, our final study cohort), over 70% had a difference in their diagnosis dates of less than 1 month. The differences between the EHR death dates and the TR death dates was likewise negligible, and only 1 patient had a difference in deceased dates of over 1 month. We did not anticipate that the last date of contact information for surviving patients would be missing from the TR data. Although RWD has data quality issues in completeness and accuracy, linking RWD from multiple sources such as EHRs and TRs could provide a more accurate depiction of patients’ care history and health status.

In terms of real-world endpoints, past studies like Stewart et al. 201915 have only shown a positive correlation (simply Spearman’s rank-order correlation) between rwTTNT and rwOS, without controlling for other covariates. In our study, we built two Cox proportional hazard models for survival analysis and controlled for a number of important risk factors such as age, gender, race-ethnicity, cancer stage, and CCI. In the first model, we dichotomized the rwTTNT as having vs. not having a second course of treatment, which led to a larger sample size. Thus, we had more statistical power to model and examine the relationship between rwTTNT and rwOS. However, in the Cox model, the non-significant hazard ratio of having vs. not having a second course of treatment showed no statistically significant association between presence of next treatment and rwOS. The crossing KM curves for the presence vs. no presence of subsequent treatments showed that the effects of having or not having subsequent treatments vary across time. Further in-depth studies are needed to investigate the association, such as adding interaction between presence of next treatment and cancer stage. In the second Cox model, we modeled the relationship between the actual length of the rwTTNT and patients’ overall survival and yielded a statistically significant result. As the Cox model is commonly used in modeling clinical trial results, being able to run Cox models with real-world endpoints and RWD is significant in that RWD-based analyses can generate results compatible with clinical trials. The ability to generate valid surrogate markers, such as the rwTTNT that we investigated in this study, provides health outcomes and comparative effectiveness research investigators with a valuable new tool to leverage as large collections of RWD become increasingly available.

Our study is not without limitations. First, our sample size was relatively small, especially for the second Cox model on patients who had subsequent treatments. The cohort may not well represent the stage I-III colon cancer population. This could be a potential reason for the non-Hispanic whites having higher hazard (i.e., shorter OS) compared to non-Hispanic blacks and Hispanics. As additional sites contribute their TR data to OneFlorida, we can expand the study cohort to achieve higher statistical power. Second, cancer stage information is currently only available discretely in our TR data; however, it is also prevalent in unstructured documents stored in EHRs (e.g., pathology reports). If only EHR data are available, we can explore advanced natural language processing (NLP) tools to unlock critical information such as cancer stage and other tumor characteristics. Third, current rules for identifying the first course of treatment and subsequent treatments can be improved. For example, our rule defines a 90-day wash-out period to differentiate the end of the first course of treatment and the beginning of the subsequent treatment. We assumed that the 90-day gap could eliminate the effects of the previous treatment. However, different drugs have different washout periods, leading to potential misclassification of the different treatment courses. For example, some trials26 have defined a one-month wash-out period to eliminate the effects of the previous treatment for colon cancer. We can potentially further decompose the rules for each type of colon cancer chemotherapy. More fine-grained rules or computable phenotypes should be developed and validated for future studies.

In summary, we calculated rwTTNT and rwOS from EHR and TR data. We assessed the agreement between EHR and TR data on the colon cancer diagnosis, treatment, and survival-related events. We showed that rwTTNT is positively correlated with rwOS. Thus, rwTTNT could be an important surrogate marker for studies that utilize RWD and ultimately help to optimize colon cancer treatment to extend overall survival.

Figure 3.

Figure 3.

Distribution of real-world overall survival,in months.

Figure 5.

Figure 5.

Kaplan–Meier survival plots of real-world overall survival, stratified by (a) sex, (b) race-ethnicity, (c) colon cancer stage, (d) Charlson comorbidity index (CCI), and (e) presence of next treatment.

Figure 6.

Figure 6.

Kaplan–Meier survival plots of real-world overall survival for the patients with subsequent treatment only, stratified by (a) sex, (b) race-ethnicity, (c) colon cancer stage, and (d) Charlson comorbidity index (CCI).

Acknowledgments

This work was supported in part by NIH grants R01CA246418, R21AG068717, and R21CA245858 and the OneFlorida Clinical Research Consortium (CDRN-1501-26692) funded by PCORI. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or PCORI.

References


Articles from AMIA Annual Symposium Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES