An Automated, High-Throughput Platform to Generate a High-Reliability, Comprehensive Rectal Cancer Database

Neal Bhutiani; Mahmoud MG Yousef; Abdelrahman Yousef; Mohammad Zeineddine; Mark Knafl; Olivia Ratliff; Uditha P Fernando; Anastasia Turin; Fadl A Zeineddine; Jeff Jin; Kristin Alfaro-Munoz; Drew Goldstein; George J Chang; Scott Kopetz; John Paul Shen; Abhineet Uppal

doi:10.1200/CCI.23.00219

. Author manuscript; available in PMC: 2025 May 30.

Published in final edited form as: JCO Clin Cancer Inform. 2024 May;8:e2300219. doi: 10.1200/CCI.23.00219

An Automated, High-Throughput Platform to Generate a High-Reliability, Comprehensive Rectal Cancer Database

Neal Bhutiani ¹, Mahmoud MG Yousef ², Abdelrahman Yousef ², Mohammad Zeineddine ², Mark Knafl ³, Olivia Ratliff ⁴, Uditha P Fernando ⁴, Anastasia Turin ⁴, Fadl A Zeineddine ², Jeff Jin ⁴, Kristin Alfaro-Munoz ², Drew Goldstein ⁵, George J Chang ¹, Scott Kopetz ², John Paul Shen ², Abhineet Uppal ¹

PMCID: PMC12123765 NIHMSID: NIHMS2076867 PMID: 38759125

Abstract

Purpose:

Dynamic operations platforms allow for cross-platform data extraction, integration, and analysis, though application of these platforms to large-scale oncology enterprises has not been described. This study presents a pipeline for automated, high-fidelity extraction, integration, and validation of cross-platform oncology data in patients undergoing treatment for rectal cancer at a single, high-volume institution.

Patients and Methods:

A dynamic operations platform was used to identify patients with rectal cancer treated at MD Anderson Cancer Center between 2016 to 2022 who had MRI imaging and preoperative treatment details available in the electronic health record (EHR). Demographic, clinicopathologic, tumor mutation, radiographic, and treatment data were extracted from the EHR using a methodology adaptable to any disease site. Data accuracy was assessed by manual review. Accuracy before and after implementation of synoptic reporting was determined for MRI data.

Results:

A total of 516 patients with localized rectal cancer were included. In the era after institutional adoption of synoptic reports, the dynamic operations platform extracted T (tumor) category data from the EHR with 95% accuracy compared to 87% prior to the use of synoptic reports, and N (lymph node) category with 88% compared to 58%. Correct extraction of pelvic sidewall adenopathy was 94% compared to 78% and EMVI accuracy was 99% compared to 89%. Neoadjuvant chemotherapy and radiation data was 99% accurate for patients who had synoptic data sources.

Conclusions:

Utilizing dynamic operations platforms enables automated cross-platform integration of multi-parameter oncology data with high fidelity in patients undergoing multimodality treatment for rectal cancer. These pipelines can be adapted to other solid tumors and, together with standardized reporting, can increase efficiency in clinical research and the translation of actionable findings towards optimizing patient outcomes.

Keywords: rectal cancer, bioinformatics, database construction, synoptic reporting

Summary

Key objective:

This work sought to demonstrate the feasibility and methodology of a pipeline for automated, high-fidelity extraction, integration, and validation of cross-platform oncology data in patients undergoing treatment for rectal cancer.

Knowledge generated:

A dynamic operations platform enabled extraction of T (tumor) and N (lymph node) category, pelvic sidewall adenopathy, extramural vascular invasion (EMVI), and chemotherapy and radiation data with excellent accuracy, and accuracy was improved in the setting of synoptic reports. Dynamic operations platforms, together with standardized reporting, can increase efficiency in clinical research and the translation of actionable findings towards optimizing patient outcomes.

Introduction

Over the last decade, cancer care has grown increasingly complex with the development of an ever-increasing number of targeted and immune therapies, as well as molecular biomarkers to guide treatment strategies.¹ This drive towards precision medicine has been accompanied by an increase in the volume of patient-level data generated through diagnostic and therapeutic strategies, including data captured over many discrete platforms.^2,3 Indeed, these data accompanied by sophisticated interpretation, potentially aided by artificial intelligence techniques, hold the promise for the development of individualized, targeted treatment strategies that aim to optimize outcomes for each individual patient.⁴ However, the current standard-of-care treatment of rectal cancer incorporates few if any elements of precision medicine, and significant challenges remain to implement a more personalized approach. While powerful analytic tools allow for identification of cross-modality predictors of treatment response and prolonged survival among patient populations, using these tools requires integrating large quantities of data from many sources into a single, unified database.^5,6 Conventional methods for data capture and entry struggle to achieve such integration in a timely and resource-efficient manner.⁷ Dynamic operations platforms, broadly speaking, represent software platforms that incorporate flowing data, programming languages, and reusable software components to aggregate, analyze, and operationalize data for research and/or clinical purposes.⁸ These platforms allow for cross-platform extraction, integration into a single database, and performance of analytics in real time using a web-based interface.⁸ However, to date, work describing how to effectively utilize these platforms for multidisciplinary, large-scale oncology applications remains limited.

The objective of this study was to develop and present a pipeline for automated, high-fidelity extraction, integration, and validation of cross-platform longitudinal oncology data in patients undergoing treatment for rectal cancer at a single, high-volume institution. In doing so, our group aimed to provide a framework for application of this methodology across disease sites for the broader oncology community.

Methods

Recognizing both the value of aggregating ‘real world’ data from a high volume institution such as The University of Texas MD Anderson Cancer Center (MD Anderson; see Supplemental Figure 1 for annual new patients seen for colon and rectal cancer at the institution)⁶ as well as the fact that this volume of data makes manual review of individual charts impractical, we chose to develop the Palantir Foundry software platform (Syntropy, Cambridge, MA) ^8,9 The Foundry platform aids in the extraction, integration, analysis, and transformation of clinical data, allowing the many elements of the Electronic Health Record (EHR; here Epic (Epic Systems, Verona, WI)) to be merged into unified datasets amenable to research analyses¹⁰. All data was stored on a HIPAA-compliant server on the MD Anderson network for privacy protection.

Cohort selection

Under an MD Anderson Institutional Review Board (IRB) approved protocol, the Palantir Foundry software system was used to query the MD Anderson EHR to identify all patients with a diagnosis of rectal cancer seen at MD Anderson between 2016 to 2022. Patients with preoperative rectal magnetic resonance imaging (MRI) for initial staging at our institution were included. At our institution, high-quality rectal MRI represents an integral part of staging and treatment planning for all patients with rectal cancer. Without an initial rectal MRI, we are unable to verify accurate staging and, additionally, are unable to accurately and completely assess treatment response and other parameters. As such, patients without dedicated rectal MRI for initial staging were excluded from the database. Patients with non-adenocarcinoma pathologies such as anal squamous cell carcinoma or neuroendocrine carcinoma were also excluded, as were patients with metastatic or recurrent rectal adenocarcinoma (Figure 1).

Figure 1 – — Flow diagram of patient selection for inclusion in study cohort

Data Extraction

Data for this study was extracted from a variety of sources at MD Anderson, including both structured data from the EHR, radiology system (PACS IntelliSpace Radiology, Philips Healthcare, Houston, TX), and institutional databases (e.g. radiology reports, pathology reports, medication administration record) as well as unstructured data in notes. These data elements are synced daily to the Foundry platform. For synoptic data in which context evaluation was not needed, regular expression (Regex) was used. Given that synoptic data is presented in a text-based format that does not require interpretation based on context, more advanced models such as natural language processing (NLP) methods were not deemed necessary. However, for non-synoptic reports that required context-based interpretation, NLP was used to extract the data. An internally-developed NLP annotator, a parser driven by text parsing rules and dictionaries of treatment agents, dates, and numbers, was used as part of the NLP effort. This was developed using the IBM Content Analytics Studio (now Watson Explorer Analytics Studio, IBM, Armonk, NY). For evaluation of records unable to be processed by NLP, Regex was used as a supplementary tool for text extraction.

Demographic, Clinicopathologic, and Mutation Data

Patient age, gender, race and ethnicity were extracted from discrete data elements within the EHR. Pathology data was automatically extracted from the EHR. Pathology reports were based on a standardized template utilized by the College of American Pathologists.¹¹ Tumor mutation data, when available, was automatically extracted from the EHR.

MRI Analysis

Synoptic MRI reports were processed using a sequential regular expression algorithm that divided reports into sections based on a pre-existing template (Supplemental Figures 2, 3). These sections were further processed to extract T category, N status, extra-mural vascular invasion (EMVI) descriptions, pelvic side wall (PSW) node description, distance from the anal verge and threatened circumferential resection margin (CRM) through individual regular expressions. MRI reports prior to April 2020 were in a non-synoptic format, and manual review was performed to extract the variables of interest from these reports. Results from synoptic reports were manually reviewed to extract any un-categorized data and ensure data fidelity.

CT Analysis

Initial staging CTs were defined as CTs of the chest, abdomen and pelvis performed within two weeks of the staging MRI. These reports were processed using sequential regular expression algorithms to extract mentions of liver or hepatic metastases, lung or pulmonary metastases, retroperitoneal lymphadenopathy and peritoneal metastases or carcinomatosis. For each site, categorization algorithms were developed with iterative manual review of identified terms to define radiographic presence or absence of a metastasis. Initial staging status was then calculated using the American Joint Commission on Cancer (AJCC) 8^th edition categories for colorectal adenocarcinoma. The impression section of each report was then manually reviewed to audit categorization and ensure data fidelity.

Neoadjuvant therapy extraction

Utilizing the diverse sources of data that the Foundry system allows investigators to explore, the neoadjuvant therapy that patients received prior to their surgeries was identified as follows. Neoadjuvant therapy was divided into radiotherapy and chemotherapy. Radiotherapy data (total dose, fractionation) for patients treated at MD Anderson was extracted in automated fashion from the EHR. Data for patients treated at outside facilities was extracted using sequential regular expression, a method successfully employed by other groups, within the Foundry system (Supplemental Figure 4) to process the patients’ clinical notes that were written by radiation oncologists prior to the use of a synoptic treatment reporting system at MDACC.^12–14 Several methods were used to categorize the regimen utilized for neoadjuvant chemotherapy. For patients treated at MDACC, pharmacy delivery data and associated treatment plans were identified for each patient. Chemotherapy given within six months of rectal surgery were included. For patients treated outside MDACC, neoadjuvant therapy data was extracted using natural language processing (NLP) to identify patients treated with 5-fluorouracil (5-FU), oxaliplatin (OX), and/or irinotecan (IR). Any remaining patients had data extracted by manual review of the electronic health record (EHR). In both cases, after attempting fully automated extracted missing data was assessed for causes of incompleteness and/or inaccuracy and automated extraction methods were updated in an iterative fashion. As a final step the quality of the automatically extracted data was systematically evaluated.

Data validation

To validate data automatically extracted using the Foundry platform, charts of all patients (515) were manually reviewed with respect to radiographic and pathologic data. All data fields were identified in each chart, and the values obtained by automated extraction were compared to those obtained by manual review by two board-certified surgeons, one in General Surgery and one in General Surgery, Surgical Oncology, and Colon and Rectal Surgery. Any fields with discrepancies were flagged and coded as incorrectly entered by the automated process so the accuracy of the automated process could be evaluated. Review of patient charts only compared manual extracted values to automated extracted values; no independent interpretation of data, such as interpretation of radiology imaging or pathology slides, was performed.

Results

A total of 515 patients with localized rectal cancer met inclusion criteria and were included in the study. Using a dynamic operations platform pipeline, clinicodemographic, operative, imaging, pathology, laboratory, molecular, chemotherapy, and radiation treatment data were extracted from the EHR and integrated into a database for analysis (Figure 2). The majority of patients had T3 tumors both on pre-operative rectal MRI and final pathology and received either preoperative chemoradiation or total neoadjuvant therapy (both systemic chemotherapy and radiation) (Table 1). Among patients who received preoperative radiation therapy, most received long course (5 week) chemoradiation, and among those who received preoperative chemotherapy, FOLFOX was the most common regimen.

Figure 2 – — Illustration of workflow using Foundry platform for data acquisition, database construction, and data evaluation

Table 1 –

Clinicopathologic details of study cohort automatically extracted from electronic health record

Patient Demographics	Study cohort (n=516)	%

Male gender	296	57.40%

Race

White or Caucasian	400	77.50%
Black or African American	36	7.00%
Asian	30	5.80%
American Indian or Alaska Native	2	0.40%
Other	45	8.70%
Declined to Answer	3	0.60%

Ethnicity

Not Hispanic or Latino	427	82.70%
Hispanic or Latino	84	16.30%
Declined to Answer	5	1.00%

Age at diagnosis	56 (17–92)



MRI Data

T category

T0	4	0.80%
T1	2	0.40%
T2	126	24.40%
T2/T3	21	4.10%
T3	274	53.00%
T4	79	15.30%
Tx	10	2.00%

N category

N0	87	16.90%
N+	302	58.50%
Nx	127	24.60%

Extramural vascular invasion	164	31.80%

Pelvic sidewall adenopathy	69	13.40%

Mesorectal fascia involvement	155	30.10%

Anal sphincter involvement	63	12.20%

Distance from anal verge (cm)	7 (0–29)



CT Data

Stage IV at diagnosis

0	413	80.00%
a	66	12.80%
b	4	0.80%
c	2	0.40%
Missing	31	6.00%

Liver metastases at diagnosis	33	6.40%

Lung metastases at diagnosis	40	7.80%

Distant lymph node metastases at diagnosis	1	0.20%

Peritoneal metastases at diagnosis	2	0.40%



Mutational Data

KRAS

Wildtype	93	18.00%
Mutation	58	11.20%
Not-tested	365	70.80%

BRAF

Wildtype	142	27.50%
Mutation	10	1.90%
Not-tested	364	70.60%

TP53

Wildtype	44	8.50%
Mutation	96	18.60%
Not-tested	376	72.90%

PTEN

Wildtype	132	25.50%
Mutation	8	1.60%
Not-tested	376	72.90%

FBXW7

Wildtype	114	22.10%
Mutation	26	5.00%
Not-tested	376	72.90%

SMAD4

Wildtype	125	24.20%
Mutation	14	2.70%
Not-tested	377	73.10%

PIK3CA

Wildtype	130	25.20%
Mutation	17	3.30%
Not-tested	369	71.50%

BRCA2

Wildtype	128	24.80%
Mutation	6	1.20%
Not-tested	382	74.00%

ARID1A

Wildtype	115	22.30%
Mutation	6	1.20%
Not-tested	390	76.60%

NOTCH1

Wildtype	136	26.40%
Mutation	4	0.80%
Not-tested	376	72.80%

NOTCH3

Wildtype	78	15.10%
Mutation	2	0.40%
Not-tested	436	84.50%

ATM

Wildtype	130	25.20%
Mutation	8	1.60%
Not-tested	378	73.20%

NRAS

Wildtype	144	27.80%
Mutation	6	1.20%
Not-tested	366	71.00%

ERBB2

Wildtype	132	25.60%
Mutation	7	1.40%
Not-tested	377	73.00%

KIT

Wildtype	139	27.00%
Mutation	0	0%
Not-tested	377	73.00%

EGFR

Wildtype	138	26.80%
Mutation	1	0.20%
Not-tested	377	73.00%



Preoperative Therapy

Preoperative regimen

Conventional chemoradiation	147	28.50%
Total Neoadjuvant Therapy	185	35.90%
Chemotherapy only	30	5.80%
Radiation therapy only	10	1.90%
Other regimen	5	1.00%
No neoadjuvant	139	26.90%

Preoperative chemotherapy

FOLFOX	95
FOLFOX+BEVACIZUMAB	10
FOLFOX, CAPECITABINE	51
FOLFOXIRI	16
FOLFOXIRI, CAPECITABINE	5
FOLFOXIRI, BEVACIZUMAB	4
XELOX	12
Other	27

Preoperative radiation

Long course chemoradiation	230	44.60%
Short course radiation	112	21.70%



Pathology Data

pT

pT0	34	6.60%
pT1	57	11.00%
pT2	114	22.10%
pT3	268	51.90%
pT4	39	7.60%
pTx	1	0.20%
Missing	3	0.60%

pN

pN0	312	60.50%
pN1	131	25.40%
pN2	61	11.80%
pNx	9	1.70%
Missing	3	0.60%

Grade

1	7	1.40%
2	448	86.80%
3	57	11.00%
x	2	0.40%
Missing	2	0.40%

Lymphovascular invasion	228	44.20%

Perineural invasion	187	36.20%

Negative margins	510	98.80%

Status of mesorectal excision

Complete	139	26.90%
Incomplete	15	2.90%
Nearly complete	59	11.50%
Missing	303	58.70%

Tumor Regression Grade

0	36	7.00%
1	70	13.60%
2	200	38.80%
3	62	12.00%
NA	148	28.70%



Follow-up

Recurrence	27	5.20%

Time to recurrence (days; median, range)	777 (93–2410)

Death	22	4.30%

Open in a new tab

Assessment of accuracy of data extraction from MRI reports revealed a marked difference during the era before (2016–2018) and after (2018-present) synoptic MRI reporting (Figure 3). In the era after institutional adoption of synoptic reports, the dynamic operations platform was able to extract T (Tumor) status data from the EHR with 95% accuracy compared to 87% prior to the use of synoptic reports, and N (lymph Node) status with 88% compared to 58%. Correct extraction of pelvic sidewall adenopathy was 94% compared to 78% and accurate extraction of EMVI accuracy was 99% compared to 89%. Lymph Node status was the least accurately coded field.

Figure 3 – — Accuracy assessment of data extraction from MRI reports before and after synoptic reporting with respect to T category (a), N category (b), pelvic sidewall adenopathy (c), and extramural vascular invasion (EMVI) (d), respectively. In each panel, the top diagram represents the era preceding synoptic reporting while the bottom diagram represents the era after synoptic reporting was adopted. In each diagram, the top flow represents accurate data while the bottom flow represents inaccurate data.

With respect to staging, the dynamic operations platform accurately captured both radiographic (clinical) and pathologic staging information. Of the 671 patients with Stage I-III disease, 590 (88%) were accurately categorized despite the lack of synoptic CT reports. Accuracy for categorized Stage IV patients was lower at 44% (101/229). Similarly, pathologic T status was accurate in 98.8% of cases, and pathologic N status was accurate in 93.4% of cases. It is important to note that, throughout the entire study period, pathology at MDACC was reported using the synoptic College of American Pathologist (CAP) pathology report.¹¹

Neoadjuvant chemotherapy and radiation data extraction were similarly precise. Among patients receiving neoadjuvant chemotherapy at MDACC, 265 (72%) patients were categorized by direct data extraction from the synoptically reported pharmacy delivery data and associated treatment plans. The remaining patients’ neoadjuvant therapy data was extracted using natural language processing (NLP), yielding another 43 (12%) patients. The remaining 61 (16%) were extracted by manual review of the EHR (Supplemental Figure 4). The recall for the patients who received their therapy at MDACC and had synoptic data sources was calculated by manually revising the patients who were not included through this data source for their medication receiving facility and recall was found to be 99%. NLP methods was validated for chemotherapy regimens of 527 occurrences in notes and yielded 99% accurate medication names but only 80% accuracy for date of chemotherapy infusion after comparing to 85 dates occurrences in notes (Table 2).

Table 2 –

Accuracy assessment of neoadjuvant chemotherapy data

	Completely automated extraction^*	Partially identified by the NLP built dataset^**	Completely manual extraction***
Number of patients	265	43	61
Percent of total received neoadjuvant chemotherapy	72%	12%	16%
Accuracy of automated data extraction	Precision: 99% Recall: 99%	Precision: Medication names 99% Dates of receiving medications 80% Recall: Medication names 99% Dates of receiving medications 85%	N/A

Open in a new tab

265 patients received neoadjuvant chemotherapies inside MDA or from our pharmacy (254) for those the information is automatically inserted from the pharmacy system so medication information and dates were accurate. 231 patients had formulated the treatment plan inside MDA and was synoptically reported by the GI medical oncologist so also medications information and dates were accurate.

^**

43 patients had neoadjuvant chemotherapy in their clinical Notes identified with dates but received the chemotherapy in an outside facility.

^**

61 patients received neoadjuvant chemotherapy in an outside facility and either didn’t have dates with their regimens in the NLP dataset or weren’t included in the NLP dataset.

Meanwhile, among patients receiving radiation therapy, data was extracted in a fully automated manner for 274 (80%) patients. Data for 28 (8%) patients was extracted utilizing regular expression codes (Supplemental Figure 5). Data for a final cohort of 41 (12%) patients were extracted manually from the EHR, and these patients mostly received their radiotherapy in an outside facility (Supplemental Figure 6). Data was 99% accurate for patients who had synoptic data sources and 86% accurate for those with data extracted using regular expression codes within the dynamic operations platform (Table 3).

Table 3 –

Accuracy assessment of neoadjuvant radiation data

	Completely automated extraction^*	Partially identified by the foundry system^**		Completely manual extraction***
	Completely automated extraction^*	Extracted from notes using regex	Manually extracted	Completely manual extraction***
Number of patients	274	28	19	22
Percent of total received neoadjuvant radiotherapy	80%	8%	6%	6%
Accuracy of automated data extraction	99%	86%	N/A	N/A

Open in a new tab

247 patients received neoadjuvant and were recorded in the BROCADE interventions database. Patients received their radiotherapy in MDACC (reported in synoptic fashion by radiation oncologist).

^**

47 patients had radiation oncology notes, but there were not inserted in the BROCADE system, received the therapy in an outside facility, those patients formulated the management plan in our institution

^**

22 patients received neoadjuvant radiotherapy but didn’t have either radiation oncology notes in our institution received treatment in external institution

Discussion

Herein, we demonstrate the construction of a platform that allows for highly accurate automated data extraction across multiple data fields for patients with rectal cancer. This platform allows for seamless integration of clinicodemographic, radiographic, treatment, mutational, and pathologic data into a data repository for query and analysis. While used here in the context of rectal cancer, this platform can be applied across disease sites with site-specific modifications. As a result, it allows for highly efficient data capture that can be applied for real-time interrogation of complex patient-level data. Indeed, the real power of this platform lies in its ability to efficiently integrate multiple data elements from across modalities of cancer care with high fidelity. As volume and complexity of both multimodal cancer therapy and the patient-level data generated in the context of cancer care increases, maintaining data repositories has grown increasingly resource-intensive.⁷ However, effectively utilizing this data holds the potential to evaluate treatment patterns and outcomes in real time. These analyses can then inform iterative changes to patient care aimed at refining and optimizing outcomes for cancer patients. Use of a platform like the one presented here have the potential to allow a larger number of healthcare systems to unlock the potential held by large scale patient-level data by automating the process of acquiring, synthesizing, and analyzing data already present within the electronic health record. It remains important to note, however, that accuracy of this automated approach hinges, in large part, on implementation of standardized reporting for each of the elements comprising a given patient’s cancer care.

Synoptic reporting in clinical notes greatly increases the efficiency of data extraction using word analysis algorithms, as demonstrated by the improved accuracy of MRI staging information after implementation of synoptic notes for this modality, the high accuracy of pathologic staging data extraction from synoptic reports throughout the time period studied and the relatively low accuracy of identifying stage IV disease using non-synoptic CT reports.^15–17 In addition, synoptic reporting allows clinicians to rapidly identify pertinent data during clinical care, thus aligning the needs of research and clinical efforts. Reasons for improved efficiency include decreased variability in wording and decreased variation in location of the pertinent field in the report. Synoptic reporting has been increasingly implemented by the College of American Pathologists and the Commission on Cancer, including for operative reports.^11,18 This has allowed for improved data-gathering for quality monitoring and will become more widespread as electronic health records allow for implementation of these reports. Towards this end, our group is currently in the process of implementing synoptic operative reports in addition to synoptic pathology and radiology reporting (Supplemental Figure 7).

Data extraction algorithms used in this study relied primarily on regular expressions (RegEx), a standardized and widely implemented set of word-processing methods used in most computer programming languages.^12–14 Features of this method include relative ease of implementation, widespread availability as open-source software and consistent output across a wide variety of text. RegEx is limited in that it is not an adaptive algorithm and thus cannot be trained in a specific lexicon. Thus, implementation requires an iterative approach with a human programmer and can be time-consuming for non-synoptic reports. Natural language processing (NLP) is a broad term to describe adaptive algorithms that can be trained to categorize text using statistical approaches. These approaches allow for variation in human grammar prevalent in non-synoptic reports, though require large volumes of text for accurate training. In addition, the output is provided in a probabilistic framework and classification errors can be difficult to debug. Both methods ultimately require human expertise to ensure that categorization of data is accurate, and tradeoffs between accuracy and sensitivity are tuned to the purposes of the research goals.

It remains important to note that the ideas presented herein are generalizable to other disease processes and hospitals/clinics even without the use of integration systems, with modification for specific clinical scenarios. They can be developed for other institutions using open-source programming languages, as the platform used in this work primarily serves as a data pipeline mechanism to integrate these languages into a single user interface. The dictionary of terms used in the regular expressions would have to be different for different body sites and disease processes (as evidenced by work in areas such as heart failure and early detection of multiple sclerosis) due to different treatment options, staging, etc.^19–21 These dictionaries represent parameters to the database creation code. Using this at different hospitals would face similar dictionary/synonym problems and could be solved in very much the same way. The ideas presented here could be implemented with various individual sql databases (open source based) and using python, sql, or R to do the data joins and regular expressions needed. The added value conferred by the system used in the present study is the integration of all of these tools and the querying capabilities on the lineage to see all transformations. There are also several tools within the dynamic operations platform ecosystem used in this work that could reduce the coding needed, which allows much faster project completion. This is not necessary and could also be created by a development team at other institutions using the aforementioned open-source programs. Finally, additional platforms, some open-source, are currently available as alternatives to the platform used for the work presented in this manuscript, including FIDDLE, MIMIC, and Oracle-based relational databases, that can assist with database integration and data processing.^22–24

This study should be interpreted in light of several limitations. This represents a single-institution study with a mature EHR and emphasis on standardized reporting (though not universal). Additionally, most patients treated at the institution received most or all aspects of their multidisciplinary care at MDACC, allowing for efficient and accurate data capture in the EHR. Similar accuracy may not be possible in centers in which patients receive imaging and treatment at multiple different locations given the heterogeneity of reporting patterns that may be present at those facilities and the challenges with integrating that data into a single EHR. Moreover, this effort involved multiple individuals with experience using the platform and associated programming tools. Finally, it bears noting that this work focuses on describing this platform as well as its construction and capabilities rather than on outcome data. Additional work is ongoing utilizing the database and platform described in this manuscript to evaluate outcomes in specific sub-populations of our cohort. This outcome-focused work will report more detailed, granular outcomes in specific populations undergoing treatment for rectal cancer at our center.

Conclusions

As oncology practitioners incorporate increasing amounts of data to provide personalized care for multi-disciplinary cancer care, research and clinical databases will require more advanced, semi-automated methods to extract and process data. Utilizing dynamic operations platforms enables automated cross-platform integration of multi-parameter oncology data with high fidelity in patients undergoing multimodality treatment for rectal cancer. These pipelines can be readily adapted to a variety of solid tumors. Through the implementation of synoptic reports that also increase clinical efficiency, rapidly evolving machine learning algorithms and careful auditing of the extracted information, clinical research can be made more efficient. This will reduce the lag time between development of research questions, analysis of data and ultimate deployment in clinical practice for meaningful improvement of patient’s outcomes.

Supplementary Material

Supplemental

Supplemental Data

Supplemental Figure 1: Annual number of new colon and rectal cancer patients seen at The University of Texas MD Anderson Cancer Center

Supplemental Figure 2: Templates for baseline MRI reporting

Supplemental Figure 3: Template for follow-up MRI reporting

Supplemental Figure 4 – Flow diagram for data acquisition for neoadjuvant chemotherapy

Supplemental Figure 5: Examples of regular expression codes for radiotherapy data extraction

Supplemental Figure 6 – Flow diagram for data acquisition for neoadjuvant radiation

Supplemental Figure 7: Template for robotic low anterior resection operative note

NIHMS2076867-supplement-Supplemental.pdf^{(294.1KB, pdf)}

Figure 4 – — Data extraction process for neoadjuvant chemotherapy

Figure 5 – — Data extraction process for neoadjuvant radiation therapy.

*Dataset that contains all interventions received by the patients including radiotherapy.

** Dataset that contains all the patients’ notes.

Acknowledgements

This work was supported by the Col. Daniel Connelly Memorial Fund, the National Cancer Institute (K22 CA234406 to J.P.S., and the Cancer Center Support Grant (P30 CA016672), the Cancer Prevention & Research Institute of Texas (RR180035 to J.P.S., J.P.S. is a CPRIT Scholar in Cancer Research), and a Conquer Cancer Career Development Award (CDA-7604125121 to J.P.S). Any opinions, findings, and conclusions expressed in this material are those of the author(s) and do not necessarily reflect those of the American Society of Clinical Oncology^® or Conquer Cancer. Additionally, this work was enabled by The University of Texas MD Anderson Cancer Center Context Engine and the Context Engine Team. The Context Engine is MD Anderson’s institutional Data Management System and Digital Architecture.

Footnotes

Author Disclosure Statement: The authors have no competing interests to report.

References

1.Davis AA, McKee AE, Kibbe WA, Villaflor VM. Complexity of Delivering Precision Medicine: Opportunities and Challenges. American Society of Clinical Oncology Educational Book. 2018(38):998–1007. [DOI] [PubMed] [Google Scholar]
2.Schlick CJR, Castle JP, Bentrem DJ. Utilizing Big Data in Cancer Care. Surgical Oncology Clinics. 2018;27(4):641–652. [DOI] [PubMed] [Google Scholar]
3.Jiang Y-Z, Liu Y, Xiao Y, et al. Molecular subtyping and genomic profiling expand precision medicine in refractory metastatic triple-negative breast cancer: the FUTURE trial. Cell Research. 2021;31(2):178–186. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Elemento O, Leslie C, Lundin J, Tourassi G. Artificial intelligence in cancer research, diagnosis and therapy. Nature reviews Cancer. 2021;21(12):747–752. [DOI] [PubMed] [Google Scholar]
5.Tsai CJ, Riaz N, Gomez SL. Big Data in Cancer Research: Real-World Resources for Precision Oncology to Improve Cancer Care Delivery. Seminars in Radiation Oncology. 2019;29(4):306–310. [DOI] [PubMed] [Google Scholar]
6.Booth CM, Karim S, Mackillop WJ. Real-world data: towards achieving the achievable in cancer care. Nat Rev Clin Oncol. 2019;16(5):312–325. [DOI] [PubMed] [Google Scholar]
7.Vassar M, Holzmann M. The retrospective chart review: important methodological considerations. J Educ Eval Health Prof. 2013;10:12. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Alfaro-Munoz K, Hallatt G, Sookprasong J, et al. Building a data foundation: How MD Anderson and Palantir are partnering to accelerate research and improve patient care. Journal of Clinical Oncology. 2019;37(15_suppl):e18077–e18077. [Google Scholar]
9.Goldstein JB, Beird H, Zhang J, et al. Tackling “big data” for accelerating cancer research. Journal of Clinical Oncology. 2016;34(15_suppl):e23160–e23160. [Google Scholar]
10.Kothari AN, Trans AT, Caudle AS, et al. Universal preoperative SARS-CoV-2 testing can facilitate safe surgical treatment during local COVID-19 surges. Br J Surg. 2021;108(1):e24–e26. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Washington MK, Berlin J, Branton P, et al. Protocol for the examination of specimens from patients with primary carcinoma of the colon and rectum. Archives of pathology & laboratory medicine. 2009;133(10):1539–1551. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Bui DD, Zeng-Treitler Q. Learning regular expressions for clinical text classification. Journal of the American Medical Informatics Association : JAMIA. 2014;21(5):850–857. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Fabacher T, Godet J, Klein D, Velten M, Jegu J. Machine learning application for incident prostate adenocarcinomas automatic registration in a French regional cancer registry. International journal of medical informatics. 2020;139:104139. [DOI] [PubMed] [Google Scholar]
14.Murtaugh MA, Gibson BS, Redd D, Zeng-Treitler Q. Regular expression-based learning to extract bodyweight values from clinical notes. Journal of biomedical informatics. 2015;54:186–190. [DOI] [PubMed] [Google Scholar]
15.Renshaw AA, Mena-Allauca M, Gould EW, Sirintrapun SJ. Synoptic Reporting: Evidence-Based Review and Future Directions. JCO Clinical Cancer Informatics. 2018(2):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Sluijter CE, van Lonkhuijzen LR, van Slooten HJ, Nagtegaal ID, Overbeek LI. The effects of implementing synoptic pathology reporting in cancer diagnosis: a systematic review. Virchows Arch. 2016;468(6):639–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Srigley JR, McGowan T, MacLean A, et al. Standardized synoptic cancer pathology reporting: A population-based approach. Journal of surgical oncology. 2009;99(8):517–524. [DOI] [PubMed] [Google Scholar]
18.Program TACoSCSS. Operative Standards Toolkit. 2023; https://www.facs.org/quality-programs/cancer-programs/cancer-surgery-standards-program/cssp-operative-standards-toolkit/. Accessed 9/12/2023, 2023.
19.Moore CR, Jain S, Haas S, et al. Ascertaining Framingham heart failure phenotype from inpatient electronic health record data using natural language processing: a multicentre Atherosclerosis Risk in Communities (ARIC) validation study. BMJ open. 2021;11(6):e047356. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Wu J, Roy J, Stewart WF. Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches. 2010;48(6):S106–S113. [DOI] [PubMed] [Google Scholar]
21.Chase HS, Mitrani LR, Lu GG, Fulgieri DJ. Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC medical informatics and decision making. 2017;17(1):24. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tang S, Davarmanesh P, Song Y, Koutra D, Sjoding MW, Wiens J. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. Journal of the American Medical Informatics Association. 2020;27(12):1921–1934. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Wang S, McDermott MBA, Chauhan G, Ghassemi M, Hughes MC, Naumann T. MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III. Proceedings of the ACM Conference on Health, Inference, and Learning; 2020; Toronto, Ontario, Canada. [Google Scholar]
24.Hernandez-Boussard T, Kourdis PD, Seto T, et al. Mining Electronic Health Records to Extract Patient-Centered Outcomes Following Prostate Cancer Treatment. AMIA Annual Symposium proceedings AMIA Symposium. 2017;2017:876–882. [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental

Supplemental Data

Supplemental Figure 1: Annual number of new colon and rectal cancer patients seen at The University of Texas MD Anderson Cancer Center

Supplemental Figure 2: Templates for baseline MRI reporting

Supplemental Figure 3: Template for follow-up MRI reporting

Supplemental Figure 4 – Flow diagram for data acquisition for neoadjuvant chemotherapy

Supplemental Figure 5: Examples of regular expression codes for radiotherapy data extraction

Supplemental Figure 6 – Flow diagram for data acquisition for neoadjuvant radiation

Supplemental Figure 7: Template for robotic low anterior resection operative note

NIHMS2076867-supplement-Supplemental.pdf^{(294.1KB, pdf)}

[R1] 1.Davis AA, McKee AE, Kibbe WA, Villaflor VM. Complexity of Delivering Precision Medicine: Opportunities and Challenges. American Society of Clinical Oncology Educational Book. 2018(38):998–1007. [DOI] [PubMed] [Google Scholar]

[R2] 2.Schlick CJR, Castle JP, Bentrem DJ. Utilizing Big Data in Cancer Care. Surgical Oncology Clinics. 2018;27(4):641–652. [DOI] [PubMed] [Google Scholar]

[R3] 3.Jiang Y-Z, Liu Y, Xiao Y, et al. Molecular subtyping and genomic profiling expand precision medicine in refractory metastatic triple-negative breast cancer: the FUTURE trial. Cell Research. 2021;31(2):178–186. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Elemento O, Leslie C, Lundin J, Tourassi G. Artificial intelligence in cancer research, diagnosis and therapy. Nature reviews Cancer. 2021;21(12):747–752. [DOI] [PubMed] [Google Scholar]

[R5] 5.Tsai CJ, Riaz N, Gomez SL. Big Data in Cancer Research: Real-World Resources for Precision Oncology to Improve Cancer Care Delivery. Seminars in Radiation Oncology. 2019;29(4):306–310. [DOI] [PubMed] [Google Scholar]

[R6] 6.Booth CM, Karim S, Mackillop WJ. Real-world data: towards achieving the achievable in cancer care. Nat Rev Clin Oncol. 2019;16(5):312–325. [DOI] [PubMed] [Google Scholar]

[R7] 7.Vassar M, Holzmann M. The retrospective chart review: important methodological considerations. J Educ Eval Health Prof. 2013;10:12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Alfaro-Munoz K, Hallatt G, Sookprasong J, et al. Building a data foundation: How MD Anderson and Palantir are partnering to accelerate research and improve patient care. Journal of Clinical Oncology. 2019;37(15_suppl):e18077–e18077. [Google Scholar]

[R9] 9.Goldstein JB, Beird H, Zhang J, et al. Tackling “big data” for accelerating cancer research. Journal of Clinical Oncology. 2016;34(15_suppl):e23160–e23160. [Google Scholar]

[R10] 10.Kothari AN, Trans AT, Caudle AS, et al. Universal preoperative SARS-CoV-2 testing can facilitate safe surgical treatment during local COVID-19 surges. Br J Surg. 2021;108(1):e24–e26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Washington MK, Berlin J, Branton P, et al. Protocol for the examination of specimens from patients with primary carcinoma of the colon and rectum. Archives of pathology & laboratory medicine. 2009;133(10):1539–1551. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Bui DD, Zeng-Treitler Q. Learning regular expressions for clinical text classification. Journal of the American Medical Informatics Association : JAMIA. 2014;21(5):850–857. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Fabacher T, Godet J, Klein D, Velten M, Jegu J. Machine learning application for incident prostate adenocarcinomas automatic registration in a French regional cancer registry. International journal of medical informatics. 2020;139:104139. [DOI] [PubMed] [Google Scholar]

[R14] 14.Murtaugh MA, Gibson BS, Redd D, Zeng-Treitler Q. Regular expression-based learning to extract bodyweight values from clinical notes. Journal of biomedical informatics. 2015;54:186–190. [DOI] [PubMed] [Google Scholar]

[R15] 15.Renshaw AA, Mena-Allauca M, Gould EW, Sirintrapun SJ. Synoptic Reporting: Evidence-Based Review and Future Directions. JCO Clinical Cancer Informatics. 2018(2):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Sluijter CE, van Lonkhuijzen LR, van Slooten HJ, Nagtegaal ID, Overbeek LI. The effects of implementing synoptic pathology reporting in cancer diagnosis: a systematic review. Virchows Arch. 2016;468(6):639–649. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Srigley JR, McGowan T, MacLean A, et al. Standardized synoptic cancer pathology reporting: A population-based approach. Journal of surgical oncology. 2009;99(8):517–524. [DOI] [PubMed] [Google Scholar]

[R18] 18.Program TACoSCSS. Operative Standards Toolkit. 2023; https://www.facs.org/quality-programs/cancer-programs/cancer-surgery-standards-program/cssp-operative-standards-toolkit/. Accessed 9/12/2023, 2023.

[R19] 19.Moore CR, Jain S, Haas S, et al. Ascertaining Framingham heart failure phenotype from inpatient electronic health record data using natural language processing: a multicentre Atherosclerosis Risk in Communities (ARIC) validation study. BMJ open. 2021;11(6):e047356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Wu J, Roy J, Stewart WF. Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches. 2010;48(6):S106–S113. [DOI] [PubMed] [Google Scholar]

[R21] 21.Chase HS, Mitrani LR, Lu GG, Fulgieri DJ. Early recognition of multiple sclerosis using natural language processing of the electronic health record. BMC medical informatics and decision making. 2017;17(1):24. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Tang S, Davarmanesh P, Song Y, Koutra D, Sjoding MW, Wiens J. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. Journal of the American Medical Informatics Association. 2020;27(12):1921–1934. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Wang S, McDermott MBA, Chauhan G, Ghassemi M, Hughes MC, Naumann T. MIMIC-Extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III. Proceedings of the ACM Conference on Health, Inference, and Learning; 2020; Toronto, Ontario, Canada. [Google Scholar]

[R24] 24.Hernandez-Boussard T, Kourdis PD, Seto T, et al. Mining Electronic Health Records to Extract Patient-Centered Outcomes Following Prostate Cancer Treatment. AMIA Annual Symposium proceedings AMIA Symposium. 2017;2017:876–882. [PMC free article] [PubMed] [Google Scholar]

PERMALINK

An Automated, High-Throughput Platform to Generate a High-Reliability, Comprehensive Rectal Cancer Database

Neal Bhutiani

Mahmoud MG Yousef

Abdelrahman Yousef

Mohammad Zeineddine

Mark Knafl

Olivia Ratliff

Uditha P Fernando

Anastasia Turin

Fadl A Zeineddine

Jeff Jin

Kristin Alfaro-Munoz

Drew Goldstein

George J Chang

Scott Kopetz

John Paul Shen

Abhineet Uppal

Abstract

Purpose:

Patients and Methods:

Results:

Conclusions:

Summary

Key objective:

Knowledge generated:

Introduction

Methods

Cohort selection

Figure 1 –

Data Extraction

Demographic, Clinicopathologic, and Mutation Data

MRI Analysis

CT Analysis

Neoadjuvant therapy extraction

Data validation

Results

Figure 2 –

Table 1 –

Figure 3 –

Table 2 –

Table 3 –

Discussion

Conclusions

Supplementary Material

Figure 4 –

Figure 5 –

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases