Modeling Cancer Clinical Trials Using HL7 FHIR to Support Downstream Applications: A Case Study with Colorectal Cancer Data

Nansu Zong; Daniel J Stone; Deepak K Sharma; Andrew Wen; Chen Wang; Yue Yu; Ming Huang; Sijia Liu; Hongfang Liu; Qian Shi; Guoqian Jiang

doi:10.1016/j.ijmedinf.2020.104308

. Author manuscript; available in PMC: 2022 Jan 1.

Published in final edited form as: Int J Med Inform. 2020 Oct 22;145:104308. doi: 10.1016/j.ijmedinf.2020.104308

Modeling Cancer Clinical Trials Using HL7 FHIR to Support Downstream Applications: A Case Study with Colorectal Cancer Data

Nansu Zong ¹, Daniel J Stone ¹, Deepak K Sharma ¹, Andrew Wen ¹, Chen Wang ¹, Yue Yu ¹, Ming Huang ¹, Sijia Liu ¹, Hongfang Liu ¹, Qian Shi ¹, Guoqian Jiang ^1,^*

PMCID: PMC7736510 NIHMSID: NIHMS1646103 PMID: 33160272

Abstract

Background and Objective:

Identification and Standardization of data elements used in clinical trials may control and reduce the cost and errors during the operational process, and enable seamless data exchange between the electronic data capture (EDC) systems and Electronic Health Record (EHR) systems. This study presents a methodology to comprehensively capture the clinical trial data element needs.

Materials and Methods:

Case report forms (CRF) for clinical trial data collection were used to approximate the clinical information need, whereby these information needs were then mapped to a semantically equivalent field within an existing FHIR cancer profile. For items without a semantically equivalent field, we considered these items to be information needs that cannot be represented in current standards and proposed extensions to support these needs.

Results:

We successfully identified 62 discrete items from a preliminary survey of 43 base questions in four CRFs used in colorectal cancer clinical trials, in which 28 items are modeled with FHIR extensions and their associated responses for colorectal cancer. We achieved promising results in the data population of the CRFs with average Precision 98.5%, Recall 96.2%, and F-measure 96.8% for all base questions. We also demonstrated the auto-filled answers in CRFs can be used to discover patient subgroups using a topic modeling approach.

Conclusion:

CRFs can be considered as a proxy for representing information needs for their respective cancer types. Mining the information needs can serve as a valuable resource for expanding existing standards to ensure they can comprehensively represent relevant clinical data without loss of granularity.

Keywords: HL7 FHIR, Colorectal Cancer, Clinical Trials, Case Report Forms, Patient Subgrouping

1. Introduction

Clinical trials are the experiments and observations designed to study the response of human participants to biomedical or behavioral interventions. Patient data is needed throughout the various stages of conducting a clinical trial (e.g., planning, conduction, and evaluation). Enormous volume of records from Electronic Health Record (EHR) systems needs to be reviewed and processed to capture the data using the Electronic Data Capture (EDC) systems. The tedious, inaccurate, and costly process [1 2] raises a need to develop an efficient way to improve data exchange and communication between EDC and EHR systems [2 3]. Identification and standardization of the important data elements is one of the best approaches to control and reduce the cost and errors created during the operational process as those elements can simplify the data exchange between the systems [1]. Researchers are motivated to identify common data elements in clinical trials through a diverse set of therapeutic cases [1 4–6]. For example, the National Cancer Institute (NCI) has developed the common data elements originating from the reports of Phase 3 cancer trials to standardize data representation and facilitate data interchange between research organizations [6 7]. These pre-specified data elements are often employed in multiple data models with standardized frameworks. Such efforts include the cancer profiles developed by the Clinical Data Interchange Standards Consortium (CDISC) [8], Clinical Information Modeling Initiative (CIMI) community [9] and Royal College of Pathologists of Australasia (RCPA) / HL7 Australia [10].

Case report forms (CRFs) are the questionnaires to collect information regulated by the study protocols, which approximates the information needs of scientists to answer hypothetical questions [11 12]. By extension, we hypothesize that the data elements necessary to comprehensively represent any given cancer can be potentially captured by mining the CRFs of associated clinical trials. This knowledge can then be used to construct a more comprehensive and standardized cancer data model to represent the data elements that are needed for the trials, which naturally fills the gap between the data consumers (e.g., EDC systems) and providers (e.g., EHR systems) and improves data exchange and communication for supporting the diverse downstream applications.

The objective of this study is thusly to extract common data elements from CRFs to build a data model with inheritance and usage of existing resources (e.g., data models and CRFs) to better address the need for EDC systems for cancer trials. In a previous study [13], we demonstrated that adoption of an existing cancer profile for modeling pathological reports known as the Australian Colorectal Cancer Profile (ACP) [10] based on the Fast Healthcare Interoperability Resources (FHIR) can enable the automation of the CRFs data population. However, ACP is designed to model the pathological reports (e.g., synoptic reports), and this previous study only covered a limited number of questions for the demonstration. Therefore, a data model that can handle diverse data sources (e.g., patient, diagnosis, medication, surgical, order, lab test, and pathological reports) to provide sufficient coverage to address the comprehensive questions from multiple CRFs should be studied. In this study, we focused on the following two aspects as new contributions: 1) we extended ACP to comprehensively cover more data elements of both structured (e.g., lab tests) and unstructured data (e.g., surgical reports) from EHR systems needed in CRFs, and 2) we explored more downstream applications based on the utilization of standardized data elements of the extended data model. As a proof of concept, we conducted a case study based on the CRFs for a real colorectal cancer trial in the Alliance for Clinical Trials in Oncology (Alliance) [14]. We built a colorectal cancer data model with an exploration of the autonomous capture of the data elements from 71 questions in four CRFs and populated the model with the medical records of 331 colon cancer patients from Mayo Clinic. We developed two downstream applications for the evaluation, 1) data population of CRFs, and 2) patients subgrouping. We achieved promising results with an average precision of 98.5%, recall of 96.2%, and F-measure of 96.8% for the 43 base questions generalized from CRFs. We demonstrated the subgrouping of the patients based on the auto-filled answers to support the efficient allocation of the cohorts for analysis.

2. Materials

We extracted the patient information using the data warehouse of Mayo Clinic known as Unified Data Platform (UDP) [15]. Two types of data, structured and unstructured, are sourced. For structured data, the information about the patient, diagnosis, medication, surgical, order, and lab test, etc., were collected. For unstructured data, surgical and pathological reports are obtained to collect the cancer-related data. In practice, we used a semi-structured form known as a synoptic report as our main data source for obtaining the pathological data elements with the full-form original pathological reports as a supporting supplement. A synoptic report is a form of a templated pathology report, which follows the College of American Pathologists (CAP) guidance [16] on the inclusion of data elements and the general definition of templated values [17]. The protocol of clinical data access was approved by the Mayo Clinic Institutional Review Board.

RCPA has adopted structured cancer reporting to build the standard data models [10], in which a colorectal cancer model known as ACP represents the data elements in pathological reports [18]. With a logical model, ACP defines a set of concepts and their values mainly with five core elements, “preAnalytic”, “macro”, “micro”, “ancillaryTests”, and “synthesisOverview”. The elements in the logical model are further represented with the FHIR resources, DiagnosticReport and Observation, to form a FHIR-based data model. ACP was adopted in our previous study [13]. In this study, we extended ACP with the data elements extracted from four CRFs of a real-world colorectal cancer trial, which is to study how the patients with resected stage III colon cancer are affected by the drugs oxaliplatin, fluorouracil, and leucovorin with or without cetuximab [14]. The CRFs contain the forms of registration/randomization eligibility checklist, adjuvant on-study, anatomic pathology review, and follow up, containing 71 questions in total.

3. Method

We captured important data elements based on the information needs in CRFs and constructed a FHIR-based data model that extends ACP to facilitate EDC for downstream applications. We used mainly three steps in the framework. Firstly, both structured and unstructured data are extracted from the UDP. For the unstructured data (e.g., surgical and pathological reports), each data element and its values are harvested. For synoptic reports, a total of 25 data elements, such as primary tumor, are directly obtained. For structured data, data elements, such as RESULT_VAL are obtained from the database schema. Secondly, we collected CRFs to capture the commonly used data elements from all the questions. In addition, we developed a data model based on the extension of the ACP to organize the elements. The data elements are further analyzed to identify the sources of either structured or unstructured data in EHR. Lastly, the data model will be populated with the values obtained in the extract, transform and load (ETL) process to form a FHIR-based data profile based on manually created mappings.

3.1. Consolidating a FHIR-based data model

To tackle colorectal cancer, we adapted the ACP colorectal cancer profile as a base model from which to construct our logical model for data representation. As Figure 2 shows, “PreAnalytic”, “Macro”, “Micro”, “AncillaryTests”, and “SynthesisOverview” are adapted (highlighted in the orange box in Figure 2). “PreAnalytic” represents the information collected prior to specimen receipt at the laboratory. “Macro”, “Micro”, and “Ancillary” are about macroscopic, microscopy, ancillary test findings. “SynthesisOverview” is used to record synthesis information. To enrich the “PreAnalytic”, “Macro” and “Micro”, we further added more sub-elements that are captured as common data elements from CRFs (green box), such as “newPrimary” for new primary tumor information and “recurrence” for recurrence information of the tumors. Besides, for the 28 common data elements (green box) captured that cannot be modeled with ACP colorectal cancer profile, we created new elements in the logical model. Specifically, “LaboratoryTest” is created for the lab test, “Medication” for treatment, and “Surgery” for surgical information.

Figure 2. — The proposed data model for colorectal cancer-related trials. The data elements inherited from the original ACP are in orange boxes, the new in green boxes, and adopted in red boxes.

To represent our logical model, we adopted FHIR resources to capture the concepts and value sets defined in the logical model. The mappings for the atomic data elements of the original ACP model are inherited directly (http://hl7.org.au/fhir/rcpa/cmap.html#summary). The newly developed elements are directly represented (highlighted in the red box in Figure 2) by the attributes defined in the Resources of FHIR Release 4 (R4) [19].

3.2. Data population based on the proposed model

To populate the data for CRFs, a mapping between the data element (i.e., schema) in the source datasets (i.e., database tables and synoptic reports) and atomic data elements defined in the proposed data model are established. In total, we identified 62 mappings to link the cancer model elements with the schema across the eight sources. As Table 1 shows, the original defined atomic data elements of ACP are mainly designed to represent pathological information and thus mapped to the elements in synoptic reports, such as Colorectal.micro.involvedMargins are mapped to “Surgical Margins”. The extended elements in the proposed model are designed to represent the information from medication orders, surgical, radiological, and lab testing results. In practice, we implemented some simple logical rules to obtain the values to populate the data models with patient records. For example, Colorectal.laboratoryTest.absoluteNeutrophilCount.value is obtained from lab tests with either standard or locally-used concept codes referring to neutrophil count tests.

Table 1.

Map source atomic data elements to the proposed cancer model.

Cancer Model Element	Source	Element
Colorectal.micro.colonscopyAssessmentDate	Orders Table (UDP)	ORDER_DATE (ORDER_NAME-“colonscopy”)

Colorectal.micro.extramuralTumourDeposits¹	Synoptic Report	Tumor Deposits
Colorectal.micro.extramuralVeinInvasion¹	Synoptic Report	Lymphovascular Invasion
Colorectal.micro.histoConfDistMetastases¹	Synoptic Report	Distant Metastasis
Colorectal.micro.histoConfDistMetastasesSite¹	Synoptic Report	Distant Metastasis
Colorectal.micro.histologicalGrade	Synoptic Report	Histologic Grade
Colorectal.micro.hostLymphoidResponse	Pathology Report	DIAGNOSIS
Colorectal.micro.intramuralVeinInvasion¹	Synoptic Report	Lymphovascular Invasion
Colorectal.micro.involvedMargins¹	Synoptic Report	Surgical Margins
Colorectal.micro.lymphNodeInvolvement¹	Synoptic Report	Lymphovascular Invasion
Colorectal.micro.lymphNodesDetails.numExamined	Synoptic Report	Number examined (total)
Colorectal.micro.lymphNodesDetails.numPos	Synoptic Report	Number involved (total)
Colorectal.micro.marginsMicroClearance¹	Synoptic Report	Surgical Margins
Colorectal.micro.maxDegreeLocalInvasion	Synoptic Report	Microscopic Tumor Extension
Colorectal.micro.neoadjuvantTherapy¹	Synoptic Report	Treatment Effect
Colorectal.micro.nonperitonealisedCircumMargin¹	Synoptic Report	Surgical Margins
Colorectal.micro.perineuralInvasion¹	Synoptic Report	Perineural Invasion
Colorectal.micro.polypDetails¹	Synoptic Report	Type of Polyp Tumor Arises From
Colorectal.micro.proximalOrDistalResectionMargins¹	Synoptic Report	Surgical Margins
Colorectal.micro.smallVesselInvasion¹	Synoptic Report	Lymphovascular Invasion
Colorectal.micro.tumourType	Synoptic Report	Histologic Type
Colorectal.micro.venousSmallVesselInvasion¹	Synoptic Report	Lymphovascular Invasion

Colorectal.macro.depositNumber	Synoptic Report	Tumor Deposits
Colorectal.macro.intactnessOfMesorectum¹	Synoptic Report	Macroscopic Intactness of Mesorectum
Colorectal.macro.invasion	Radiology Table (UDP)	RADIOLOGY_TEST_DESCRIPTION
	Synoptic Report	Microscopic Tumor Extension
Colorectal.macro.maxTumourDiameter	Synoptic Report	Tumor Size
Colorectal.macro.distNonperitonCircumMargin	Synoptic Report	Surgical Margins
Colorectal.macro.natureAndSiteOfBlocks	Pathology Report	BLOCK SUMMARY
Colorectal.macro.otherMacroComments¹	Synoptic Report	Specimen
Colorectal.macro.polyps	Diagnosis Table (UDP)	DIAGNOSIS_NAME=(“intestinal polyposis syndrome”, “gastrointestinal polyposis syndrome”)
Colorectal.macro.tumourPerforation	Synoptic Report	Macroscopic Tumor Perforation
Colorectal.macro.tumourSite	Synoptic Report	Tumor Site

Colorectal.preAnalytic.adherence	Synoptic Report	Microscopic Tumor Extension
Colorectal.preAnalytic.clinicalAssessmentDate	Diagnosis Table (UDP)	DIAGNOSIS_DATE
Colorectal.preAnalytic.newPrimary	Radiology Table (UDP)	RADIOLOGY_REPORT
Colorectal.preAnalytic.newPrimaryDate	Radiology Table (UDP)	RADIOLOGY_DATE
Colorectal.preAnalytic.recurrence	Synoptic Report	Microscopic Tumor Extension / Comment
Colorectal.preAnalytic.recurrenceDate	Synoptic Report	NOTE_DATE
Colorectal.preAnalytic.tumourLocation	Synoptic Report	Tumor Site
Colorectal.preAnalytic.typeOfOperation¹	Synoptic Report	Procedure
Colorectal.preAnalytic.clinicalObstruction	Synoptic Report	GROSS DESCRIPTION

Colorectal.synthesisOverview.tumourStageM¹	Synoptic Report	Distant Metastasis
Colorectal.synthesisOverview.tumourStageN	Synoptic Report	Regional lymph nodes
Colorectal.synthesisOverview.tumourStageT	Synoptic Report	Primary tumor
Colorectal.synthesisOverview.tumourStagingSystem¹	Synoptic Report	Pathologic Staging (AJCC, 7th edition)
Colorectal.synthesisOverview.overarchingComment	Synoptic Report	Comment

Colorectal.laboratoryTest.absoluteNeutrophilCount.date¹	Lab-test Table (UDP)	LAB_COLLECTION_DATE
Colorectal.laboratoryTest.absoluteNeutrophilCount.value	Lab-test Table (UDP)	RESULT_VAL(LAB_DESCRIPTION=“Neutrophils” \| “Neutrophils Absolute” \| “Absolute Neutrophil Count”)
Colorectal.laboratoryTest.bilirubin.date¹	Lab-test Table (UDP)	LAB_COLLECTION_DATE
Colorectal.laboratoryTest.bilirubin.value	Lab-test Table (UDP)	RESULT_VAL(LAB_DESCRIPTION=“Bilirubin” \| “Bilirubin S”)
Colorectal.laboratoryTest.creatinine.date¹	Lab-test Table (UDP)	LAB_COLLECTION_DATE
Colorectal.laboratoryTest.creatinine.value	Lab-test Table (UDP)	RESULT_VAL(LAB_DESCRIPTION=“Hgb” \| “Hemoglobin”)
Colorectal.laboratoryTest.Hgb.date¹	Lab-test Table (UDP)	LAB_COLLECTION_DATE
Colorectal.laboratoryTest.Hgb.value	Lab-test Table (UDP)	RESULT_VAL(LAB_DESCRIPTION=“Creatinine” \| “ Creatinine S” \| “ Creatinine P” \| “ Creatinine U”)
Colorectal.laboratoryTest.plateletCount.date¹	Lab-test Table (UDP)	LAB_COLLECTION_DATE
Colorectal.laboratoryTest.plateletCount.value	Lab-test Table (UDP)	RESULT_VAL(LAB_DESCRIPTION=“Platelet” \| “Platelet Count” \| “Platelet Estimate”)
Colorectal.laboratoryTest.serumPregnancy.date	Lab-test Table (UDP)	LAB_COLLECTION_DATE
Colorectal.laboratoryTest.serumPregnancy.value	Lab-test Table (UDP)	RESULT_VAL(LAB_DESCRIPTION=“HCG” \| “Pregnancy Test”)

Colorectal.surgery.resectionExtent	Surgical Procedures Table (UDP)	SURGICAL_PROCEDURE_DESCRIPTION=“Biopsy” \| “Polypectomy” \| “Excision” \| “Colectomy” \| “Resection”
Colorectal.surgery.type	Surgical Procedures Table (UDP)	SURGICAL_PROCEDURE_DESCRIPTION=“Laparoscopy” \| “Open Approach”
Colorectal.surgery.date	Surgical Procedures Table (UDP)	SURGICAL_PROCEDURE_DATE

Colorectal.subject.vitalStatus	Patient Table (UDP)	PATIENT_DECEASED_FLAG

Colorectal.medication.treatment.code	Orders Table (UDP)	ORDER_DESCRIPTION-“Leucovorin” \| “Fluorouracil” \| “Oxaliplatin” \| “Cetuximab”
Colorectal.medication.treatment.unit¹	Orders Table (UDP)	ORDER_DOSE_UNITS
Colorectal.medication.treatment.value¹	Orders Table (UDP)	ORDER_DOSE_AMOUNT

Open in a new tab

The mappings cannot be validated by the base 43 questions in Table 3.

3.3. EDC-based downstream applications for colorectal cancer

We have implemented the proposed framework based on the clinical records of 331 Mayo Clinic patients during the years from 2013 to 2019 with a search using the colorectal cancer-related ICD 9 codes filtering complied with the research authorization policies in Mayo Clinic. Based on the list of patients, we collected 1226 synoptic reports. Two downstream applications are developed to evaluate the model, 1) data population of CRFs, and 2) patient subtyping based on the populated CRFs.

3.3.1. Application (1) – data population of CRFs

We summarized the 57 questions in all the CRFs [14] to remove redundancy resulting in 43 base questions. For example, “Primary Site(s)” in the form “adjuvant on-study” and “Primary Site(s)” in the form “anatomic pathology review” are generalized into the question Q(24) “Primary Site(s)”. We mapped the questions to the atomic data elements of the proposed data model to enable the population of the questions. For example, to answer the question Q(16) “Extent of resection”, we mapped Colorectal.surgery.resectionExtent. Detailed mapping is available in Table 2. Please note, in practice, we generated the raw questions of the application to reduce the bias that may be caused by the different annotating experiences of the domain experts in the evaluation.

Table 2.

Map base questions of CRFs to the data elements in the proposed cancer model.

ID	Question	Value	Element
1	Patient’s vital status	Yes / No (Patient is alive)	Colorectal.subject.vitalStatus
2	Hemoglobin (Hgb)	Yes / No (Hemoglobin >= 9 g/dL)	Colorectal.laboratoryTest.hgb.value
3	Absolute neutrophil count	Yes / No (Absolute neutrophil count >= LNL)	Colorectal.laboratoryTest.absoluteNeutrophilCount.value
4	Absolute neutrophil count LNL	Quantitative	Colorectal.laboratoryTest.absoluteNeutrophilCount.LNL
5	Creatinine	Yes / No (Creatinine <= 1.5 × UNL)	Colorectal.laboratoryTest.creatinine.value
6	Creatinine UNL	Quantitative	Colorectal.laboratoryTest.creatinine.UNL
7	Platelet count	Yes / No (Platelet count >= 100,000/uL)	Colorectal.laboratoryTest.plateletCount.value
8	Total bilirubin	Yes / No (Total bilirubin <= 1.5 × UNL)	Colorectal.laboratoryTest.bilirubin.value
9	Total bilirubin UNL	Quantitative	Colorectal.laboratoryTest.bilirubin.UNL
10	Negative serum pregnancy test	Positive / Negative	Colorectal.laboratoryTest.serumPregnancy.value
11	Negative serum pregnancy test date	DateTime	Colorectal.laboratoryTest.serumPregnancy.date
12	Assigned treatment (medication)	Oxaliplatin / Fluorouracil / Leucovorin / Cetuximab	Colorectal.medication.treatment.code
13	Associated diseases	Yes / No (is polyposis syndrome)	Colorectal.macro.polyps
14	Clinical assessment date	DateTime	Colorectal.preAnalytic.clinicalAssessmentDate
15	Colonoscopy Date	DateTime	Colorectal.micro.colonoscopyAssessmentDate
16	Extent of resection	Biopsy / Polypectomy / Bowel resection / Local excision / Indeterminate	Colorectal.surgery.resectionExtent
17	Type of procedure	Open approach / Laparoscopic	Colorectal.surgery.type
18	Surgery date	DateTime	Colorectal.surgery.date
19	Site of pathologically Confirmed invasion	Bladder/ Prostate/ Vagina/ Liver/ Seminal vesicles/ Pelvic (other than above)/ Ovary/ Ureter/ Peritoneum/ Uterus	Colorectal.macro.invasion
20	New primary cancer or MDS (myelodysplastic syndrome)	Yes / NO	Colorectal.preAnalytic.newPrimary
21	Date of diagnosis for new primary cancer	DateTime	Colorectal.preAnalytic.newPrimaryDate
22	First progression (or recurrence)	Yes / No	Colorectal.preAnalytic.recurrence
23	Date of first recurrence or progression	DateTime	Colorectal.preAnalytic.recurrenceDate
24	Primary site(s)	Cecum / Transverse colon / Sigmoid colon / Ascending colon / Splenic flexure / Hepatic flexure / Descending colon	Colorectal.preAnalytic.tumourLocation/Colorectal.macro.tumourSite
25	Tumor size	Narrative	Colorectal.macro.maxTumourDiameter
26	Bowel perforation	Present / Absent	Colorectal.macro.tumourPerforation
27	Histologic type	Signet ring cell adenocarcinoma / Signet ring cell carcinoma / High grade neuroendocrine carcinoma / Mucinous adenocarcinoma / No residual carcinoma / Adenocarcinoma / Medullary carcinoma / Squamous cell carcinoma	Colorectal.micro.tumourType
28	Histology	High (poorly differentiated or undifferentiated) / Low (well or moderately differentiated)	Colorectal.micro.histologicalGrade
29	Comments	Narrative	Colorectal.synthesisOverview.overarchingComment
30	Adherence	Yes / No	Colorectal.preAnalytic.adherence
31	Number of deposits	Quantitative	Colorectal.macro.depositNumber
32	Disease extent	Tumor invades submucosa (PT1) / Tumor invades muscularis propria (PT2) / Tumor invades through the muscularis propria into the subserosa, or into nonperitonealized pericolic or perirectal tissue (PT3) / The tumor has grown into the surface of the visceral peritoneum, which means it has grown through all layers of the colon (PT4a) / The tumor has grown into or has attached to other organs or structures (PT4b) / Primary tumor cannot be assessed (TX)	Colorectal.micro.maxDegreeLocalInvasion / Colorectal.synthesisOverview.tumourStageT
33	Regional lymph node involvement	No regional lymph node metastases (PN1)/ Metastases in 1 to 3 regional lymph nodes (PN2) / Metastases in 4 or more regional lymph nodes (PN3) / Regional lymph nodes cannot be assessed (PNX)	Colorectal.synthesisOverview.tumourStageN
34	Number of lymph nodes examined	Quantitative	Colorectal.micro.lymphNodesDetails.numExamined
35	Positive lymph nodes	Absent / Present	Colorectal.micro.lymphNodesDetails.numPos
36	Distance to closest longitudinal margin	Narrative	Colorectal.macro.distNonperitonCircumMargin
37	Bowel obstruction	Absent / Present	Colorectal.preAnalytic.clinicalObstruction
38	Blocks	Narrative	Colorectal.macro.natureAndSiteOfBlocks
39	Stools	Yes / No, patient has a colostomy/ileostomy	Colorectal.preAnalytic.stool
40	Multiple primary malignant tumors?	Yes / No	Colorectal.macro.maligantTumorNumber
41	Deposits type	Discrete/ Irregular/ Both discrete and irregular	Colorectal.macro.depositType
42	Residual adjacent adenoma?	Yes / No	Colorectal.macro.residualAdjacentAdenoma
43	Host lymphoid response	Crohn’s like (2 or more lymphoid aggregates per slide, often associated with germinal (check all that apply) centers adjacent to tumor)/ Peritumoral, mild (distinct rim or cap of lymphocytes at tumor-parenchyma interface)/ Intratumoral, marked (>4 tumor infiltrating lymphocytes/HPF)	Colorectal.micro.hostLymphoidResponse

Open in a new tab

To evaluate the quality of the generated answers, we randomly split patients into seven groups and requested the seven subject matter experts (N.Z., Y.Y., M.M., A.W., D.S., S.L., and D.S.) majoring in medical informatics to mainly answering the base question based on the patient records. The seven experts are approved for data access by the Mayo Clinic Institutional Review Board. Ten randomly selected patient records were annotated by all reviewers and the inter-rater reliability kappa scores were calculated. For each question, standard answers were generated from reliable annotators (average kappa score of 0.96) where the annotations of the experts who have low kappa inter-rater reliability scores are filtered out if determined to be mostly inaccurate upon review. Please note, the annotations based on the randomly selected patients are merged in evaluation following the same filtering rule. The results are evaluated based on the metrics: Precision, Recall, and F-measure.

3.3.2. Application (2) – discovery of patient subgroups

With the proposed FHIR-based data model, we can extract the data elements and values for each patient to generate the subgroups with the patients sharing the same clinical features. In practice, we standardized and utilized the raw answers with the categorical values of the base questions generated in Application (1) as features to explore the patient cohorts based on the patient subgrouping.

We adopted the one-topic-per-document Dirichlet Multinomial Mixture (DMM) model [20] to cluster each patient. DMM is a topic model designed for short texts, which assumes that each document can be only categorized into one topic. Specifically, we modeled each patient as p and each categorical answers for base questions as a_i, then we reformed the DMM model as, a topic z_p for each patient as z_p ~ Multinomial (θ) where θ~ Dirchlet (α), and a categorical answer $a_{p_{i}}$ as $a_{p_{i}}$ ~ Multinomial $(ϕ^{z_{i}})$ where $ϕ^{z_{i}}$ ) ~ Dirchlet (β). In practice, the conduction for this task is based on the jLDADMM library [21].

4. Results

4.1. Population of CRFs

We evaluated the data model for the downstream application of generating the response for the base questions. An average F-measure of 0.968 was obtained as shown in Table 3. The data elements of the proposed model are collected from the following two parts, 1) the elements that existed in ACP and 2) elements that are required and inferred from the questions of CRFs. We found that 27 mappings cannot be validated by the base questions (refer to Table 1). We also failed to generate answers or consistently annotate Questions (39–42) due to a lack of sufficient data to identify and extract the corresponding elements from the target data sources.

Table 3.

Precision, Recall, and F-measure of the automatically generated answers for the base questions.

ID	Question	P	R	F	ID	Question	P	R	F
1	Patient’s vital status	1.000	1.000	1.000	23	Date of first recurrence or progression	1.000	1.000	1.000
2	Hemoglobin (Hgb)	1.000	1.000	1.000	24	Primary site(s)²	1.000	1.000	1.000
3	Absolute neutrophil count	1.000	1.000	1.000	25	Tumor size	1.000	1.000	1.000
4	Absolute neutrophil count LNL	1.000	1.000	1.000	26	Bowel perforation²	1.000	1.000	1.000
5	Creatinine	1.000	1.000	1.000	27	Histologic type	1.000	1.000	1.000
6	Creatinine UNL	1.000	1.000	1.000	28	Histology	1.000	1.000	1.000
7	Platelet count	1.000	1.000	1.000	29	Comments²	1.000	0.955	0.977
8	Total bilirubin	1.000	1.000	1.000	30	Adherence	0.909	0.769	0.833
9	Total bilirubin UNL	1.000	1.000	1.000	31	Number of deposits	1.000	0.997	0.998
10	Negative serum pregnancy test	1.000	1.000	1.000	32	Disease extent²	1.000	0.994	0.997
11	Negative serum pregnancy test date	1.000	1.000	1.000	33	Regional lymph node involvement	1.000	0.995	0.997
12	Assigned treatment (medication)	1.000	1.000	1.000	34	Number of lymph nodes examined²	1.000	0.997	0.999
13	Associated diseases	1.000	0.952	0.976	35	Positive lymph nodes²	1.000	0.989	0.994
14	Clinical assessment date	1.000	1.000	1.000	36	Distance to closest longitudinal margin	1.000	0.697	0.822
15	Colonoscopy Date	0.997	1.000	0.998	37	Bowel obstruction²	1.000	1.000	1.000
16	Extent of resection	0.890	0.973	0.930	38	Blocks	1.000	1.000	1.000
17	Type of procedure²	0.997	0.985	0.991	39	Stools	-	-	-
18	Surgery date²	0.997	1.000	0.998	40	Multiple primary malignant tumors?	-	-	-
19	Site of pathologically Confirmed invasion	1.000	1.000	1.000	41	Deposits type	-	-	-
20	New primary cancer or MDS (myelodysplastic syndrome)	0.833	1.000	0.909	42	Residual adjacent adenoma?	-	-	-
21	Date of diagnosis for new primary cancer	0.833	1.000	0.909	43	Host lymphoid response	0.946	0.889	0.917

22	First progression (or recurrence)	1.000	0.333	0.500	Overall Average		0.985	0.962	0.968

Open in a new tab

Please note, the questions are re-evaluated with the new randomly selected patients in this study, and thus the results are slightly different from our previous work [13].

On structured portions of the evaluation, there were no discrepancies found that would indicate any issues with the underlying data such as mis-entered values of the wrong scale, and the generated answers yielded perfect results. For unstructured portions of the record, excerpts pertaining to the questions might be stated multiple times in the record with slight variances in wording (e.g. benign vs negative) or with a slightly different interpretation of the underlying facts -- a count of 15 negative lymph nodes might be noted as a count of 18 in a later section of the same synoptic report. Annotators often had stylistic or personal differences that might point to an identical locality of the record but with varying start or end windows for the selection of annotation span, which causes the majority of the discrepancies in results.

4.2. Discovery of patient subgroups

We have listed the distribution of the categorical answers for the base questions in Table 4. Most questions with binary answers (e.g., Yes vs. No) have highly imbalanced distribution (e.g., Q1, Q2, Q3, Q5, Q7, Q8, Q13, Q17, Q22, Q30, Q37). For the questions with multiple answers, the answers are much more balanced except Q16 and Q27.

Table 4.

Distribution of the categorical answers for the base questions.

Question	Value	Frequency	Question	Value	Frequency
Q1	Yes	284	Q19	Peritoneum	9
	No	47		Ovary	1
				Vagina	2
				Bladder	3
				Pelvic	1
				Prostate	2

Q2	Yes	307	Q22	Yes	2
	No	23		No	329

Q3	Yes	38	Q24	Transverse colon	27
	No	267		Hepatic flexure	9
				Cecum	52
				Descending colon	12
				Sigmoid colon	11
				Ascending colon	62
				Splenic flexure	8

Q5	Yes	319	Q26	Present	15
	No	11		Absent	5

Q7	Yes	319	Q27	Signet ring cell adenocarcinoma	4
	No	11		Signet ring cell carcinoma	2
				High grade neuroendocrine carcinoma	1
				Mucinous adenocarcinoma	19
				No residual carcinoma	3
				Adenocarcinoma	306
				Medullary carcinoma	3
				Squamous cell carcinoma	1

Q8	Yes	275	Q28	Low	252
	No	6

Q10	Positive	1	Q30	Yes	11
	Negative	3		No	320

Q12	Oxaliplatin	74	Q32	PT4a	24
	Fluorouracil	77		PT4b	21
	Leucovorin	66		PT3	157
	Cetuximab	3		PT1	43
				PT0	86

Q13	Yes	17	Q33	PN1	86
	No	314		PN0	196
				PN2	40
				PNx	9

Q16	Bowel resection	208	Q35	Absent	214
	Local excision	84		Present	117
	Biopsy	1

Q17	Laparoscopy	52	Q37	Absent	328
	Open approach	237		Present	3

Open in a new tab

We have tested a different number of clusters and found the best result is with four. As Figure 3 (a) shows, Groups zero (color green) and three (color red) are well distinguished from the other two groups. While the two groups (group one in blue and two in orange) are entangled, Group two is more centered and Group one is more sided. We observed that Groups one and three mainly represent the patients who are taking the treatment of oxaliplatin, fluorouracil, leucovorin, and cetuximab (i.e., solid nodes). Groups zero and two are the rest of the patients separated by the answers, Q35, Q33, Q32, Q16, Q24, Q32, and Q17.

We summarize the subgroups as follows (see Figure 3 (b)),

Group zero: “Absent” for positive lymph nodes (Q35), “PN0” for regional lymph node involvement (Q33), “PT0” and “PT1” for disease extent (Q32), “Local excision” for extent of resection (Q16), “Ascending colon” for primary site(s) (Q24), and “Laparoscopy” for type of procedure (Q17).

Group One: taking oxaliplatin, fluorouracil, leucovorin, and cetuximab (Q12), and with “PN1” for regional lymph node involvement (Q33).

Group two: with ”PN2” and “PN1” for regional lymph node involvement (Q33), “Cecum” for primary site(s) (Q24), “PT4a” for disease extent (Q32), and “Present” for positive lymph nodes (Q35).

Group three: taking oxaliplatin, fluorouracil, leucovorin, and cetuximab (Q12), and with “No” for vital status (Q1).

5. Discussion and conclusion

In this study, we designed and developed a framework for capturing common data elements from CRFs so as to identify clinical information needs and to extend a FHIR-based data model as necessary to meet those needs. We have developed the corresponding ETL process to generate the FHIR-based representation from the source data, and enabled extensions to be made on the proposed model to handle new data elements and data sources. Two downstream applications are also developed for adaptation (refer to the readme file in project GitHub page). The methodology and data model provided in this study ensure the standard adaptability of the necessary data elements covered for the clinical trial-related applications.

Despite the profound value of the work as proven in this article, there are a number of issues that need to be further discussed. Firstly, from all the forms in the trial [14], our domain expert (Q.S., and G.J.) selected the four CRFs based on three criteria, 1) importance/representative, 2) coverage and 3) feasibility, for the phases of the patient registration, trial conduction, and evaluation. As our selection is mostly based on the understanding and experience of the domain experts with the criteria, it may not be robust and free of bias for some organizations to conduct a similar study. Secondly, we have not investigated the management performance (e.g., storage, and search) for data represented by the proposed data model, which serves as a data reservoir for clinical trials. Thirdly, our validation only covers the base questions and remains a gap for the real questions. The answer to a base question in our study serves as the foundation to address the questions with complicated specifications. For example, the question “has the patient had a documented clinical assessment for this cancer since the submission of the previous follow-up form?” is based on the dates of clinical assessment. We agree that the design of such rules is critical to determining the success of the application of the real CRFs population. Lastly, the proposed methodology and model are based on the element extraction of CRFs and FHIR resources, which is generically adaptable for colorectal cancer trials. However, as CRFs may vary in different cancer types and it lacks of existing cancer models for adaptation, it could be more challenging to generalize the approach for other cancers.

Targeting the limitations, we have the following work planned in the future, 1) the development of a systematic method and criteria for CRF selection criteria to meet the need of covering important data elements in a trial as well as ensuring robustness and unbiasedness for adaptation, 2) an exploration of using graph-based storage for the graphical representation of FHIR (e.g., FHIR RDF [22]), 3) an expansion of the base questions to validate the remainder of mappings and development of logical rules to answer real questions with base-responses, 4) an adaptation of the proposed method and the development of a cancer model on diverse cancers.

The resulting data model and demonstration application are publicly available in the project GitHub website at https://github.com/BD2KOnFHIR/CancerTrialByFHIR.

Figure 1. — A FHIR-based framework of data modeling for clinical trials and downstream applications.

Summary

We extended an existing data model, the Australian Colorectal Cancer Profile (ACP), to capture the data elements extracted from Case report forms (CRFs) needed in clinical trials.
We populated the data model with both structured and unstructured data from Electronic Health Record (EHR) systems.
We explored clinical trial-related downstream applications that can be automated with the utilization of standardized data elements.

Highlights

The data elements are captured from cancer clinical trial case report forms (CRFs).
A FHIR-based cancer data model is constructed as an extension of an existing cancer profile.
A data population application for CRFs using FHIR-based cancer data is developed and evaluated.
A patient subgroup discovery application is developed with the FHIR-based cancer data as input.
CRFs serve as a proxy for representing information needs for their respective cancer types.

Acknowledgments

This study is supported in part by the funding from the NIH BOND (K99 GM135488) and BD2K (U01 HG009450) grants. The authors thank Mr. Grahame Grieve for his guidance on the access of the Australian colorectal cancer profile.

Footnotes

Conflict of Intrest

None

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Reference

1.Bruland P, McGilchrist M, Zapletal E, et al. Common data elements for secondary use of electronic health record data for clinical trial execution and serious adverse event reporting. BMC medical research methodology 2016;16(1):159. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Nahm ML, Pieper CF, Cunningham MM. Quantifying data quality for clinical trials using electronic data capture. PloS one 2008;3(8):e3049. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.El Emam K, Jonker E, Sampson M, Krleža-Jerić K, Neisa A. The use of electronic data capture tools in clinical trials: Web-survey of 259 Canadian trials. Journal of medical Internet research 2009;11(1):e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Crabb DW, Bataller R, Chalasani NP, et al. Standard definitions and common data elements for clinical trials in patients with alcoholic hepatitis: recommendation from the NIAAA Alcoholic Hepatitis Consortia. Gastroenterology 2016;150(4):785–90 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ghitza UE, Gore- Langton RE, Lindblad R, Shide D, Subramaniam G, Tai B. Common data elements for substance use disorders in electronic health records: the NIDA Clinical Trials Network experience. Addiction 2013;108(1):3–8 [DOI] [PubMed] [Google Scholar]
6.The Common Data Element Dictionary-a standard nomenclature for the reporting of Phase 3 cancer clinical trial data. Proceedings 14th IEEE Symposium on Computer-Based Medical Systems CBMS 2001; 2001. IEEE. [Google Scholar]
7.Nadkarni PM, Brandt CA. The common data elements for cancer research: remarks on functions and structure. Methods of information in medicine 2006;45(06):594–601 [PMC free article] [PubMed] [Google Scholar]
8.CDSIC Published User Guides. Secondary CDSIC Published User Guides 2019. https://www.cdisc.org/standards/therapeutic-areas.
9.HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2). Secondary HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2) 2019. http://build.fhir.org/ig/HL7/us-breastcancer/.
10.HL7 Australia Implementation Guide. Secondary HL7 Australia Implementation Guide 2014. http://fhir.hl7.org.au/fhir/rcpa/index.html.
11.Grimes DA, Hubacher D, Nanda K, Schulz KF, Moher D, Altman DG. The Good Clinical Practice guideline: a bronze standard for clinical research. The Lancet 2005;366(9480):172–74 [DOI] [PubMed] [Google Scholar]
12.Bellary S, Krishnankutty B, Latha M. Basics of case report form designing in clinical research. Perspectives in clinical research 2014;5(4):159. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Zong N, Wen A, Stone DJ, et al. Developing an FHIR-Based Computational Pipeline for Automatic Population of Case Report Forms for Colorectal Cancer Clinical Trials Using Electronic Health Records. JCO Clinical Cancer Informatics 2020;4:201–09 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Alberts SR, Sargent DJ, Nair S, et al. Effect of oxaliplatin, fluorouracil, and leucovorin with or without cetuximab on survival among patients with resected stage III colon cancer: a randomized trial. Jama 2012;307(13):1383–93 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Kaggal VC, Elayavilli RK, Mehrabi S, et al. Toward a learning health-care system–knowledge delivery at the point of care empowered by big data and NLP. Biomedical informatics insights 2016;8:BII S37977. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Cancer Protocol Templates. Secondary Cancer Protocol Templates 2019. https://www.cap.org/cancerprotocols.
17.Srigley JR, McGowan T, MacLean A, et al. Standardized synoptic cancer pathology reporting: A population-based approach. Journal of surgical oncology 2009;99(8):517–24 [DOI] [PubMed] [Google Scholar]
18.Brown AS, Patel CJ. A standard database for drug repositioning. Scientific data 2017;4:170029. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.HL7.org. HL7 FHIR R4. Secondary HL7 FHIR R4 2018. http://hl7.org/fhir/R4/.
20.Nigam K, McCallum AK, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine learning 2000;39(2–3):103–34 [Google Scholar]
21.Nguyen DQ. jLDADMM: A Java package for the LDA and DMM topic models. arXiv preprint arXiv:1808.03835 2018 [Google Scholar]
22.FHIR RDF Specification. Secondary FHIR RDF Specification 2016. http://w3c.github.io/hcls-fhir-rdf/spec/.

[R1] 1.Bruland P, McGilchrist M, Zapletal E, et al. Common data elements for secondary use of electronic health record data for clinical trial execution and serious adverse event reporting. BMC medical research methodology 2016;16(1):159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Nahm ML, Pieper CF, Cunningham MM. Quantifying data quality for clinical trials using electronic data capture. PloS one 2008;3(8):e3049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.El Emam K, Jonker E, Sampson M, Krleža-Jerić K, Neisa A. The use of electronic data capture tools in clinical trials: Web-survey of 259 Canadian trials. Journal of medical Internet research 2009;11(1):e8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Crabb DW, Bataller R, Chalasani NP, et al. Standard definitions and common data elements for clinical trials in patients with alcoholic hepatitis: recommendation from the NIAAA Alcoholic Hepatitis Consortia. Gastroenterology 2016;150(4):785–90 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Ghitza UE, Gore- Langton RE, Lindblad R, Shide D, Subramaniam G, Tai B. Common data elements for substance use disorders in electronic health records: the NIDA Clinical Trials Network experience. Addiction 2013;108(1):3–8 [DOI] [PubMed] [Google Scholar]

[R6] 6.The Common Data Element Dictionary-a standard nomenclature for the reporting of Phase 3 cancer clinical trial data. Proceedings 14th IEEE Symposium on Computer-Based Medical Systems CBMS 2001; 2001. IEEE. [Google Scholar]

[R7] 7.Nadkarni PM, Brandt CA. The common data elements for cancer research: remarks on functions and structure. Methods of information in medicine 2006;45(06):594–601 [PMC free article] [PubMed] [Google Scholar]

[R8] 8.CDSIC Published User Guides. Secondary CDSIC Published User Guides 2019. https://www.cdisc.org/standards/therapeutic-areas.

[R9] 9.HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2). Secondary HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2) 2019. http://build.fhir.org/ig/HL7/us-breastcancer/.

[R10] 10.HL7 Australia Implementation Guide. Secondary HL7 Australia Implementation Guide 2014. http://fhir.hl7.org.au/fhir/rcpa/index.html.

[R11] 11.Grimes DA, Hubacher D, Nanda K, Schulz KF, Moher D, Altman DG. The Good Clinical Practice guideline: a bronze standard for clinical research. The Lancet 2005;366(9480):172–74 [DOI] [PubMed] [Google Scholar]

[R12] 12.Bellary S, Krishnankutty B, Latha M. Basics of case report form designing in clinical research. Perspectives in clinical research 2014;5(4):159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Zong N, Wen A, Stone DJ, et al. Developing an FHIR-Based Computational Pipeline for Automatic Population of Case Report Forms for Colorectal Cancer Clinical Trials Using Electronic Health Records. JCO Clinical Cancer Informatics 2020;4:201–09 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Alberts SR, Sargent DJ, Nair S, et al. Effect of oxaliplatin, fluorouracil, and leucovorin with or without cetuximab on survival among patients with resected stage III colon cancer: a randomized trial. Jama 2012;307(13):1383–93 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Kaggal VC, Elayavilli RK, Mehrabi S, et al. Toward a learning health-care system–knowledge delivery at the point of care empowered by big data and NLP. Biomedical informatics insights 2016;8:BII S37977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Cancer Protocol Templates. Secondary Cancer Protocol Templates 2019. https://www.cap.org/cancerprotocols.

[R17] 17.Srigley JR, McGowan T, MacLean A, et al. Standardized synoptic cancer pathology reporting: A population-based approach. Journal of surgical oncology 2009;99(8):517–24 [DOI] [PubMed] [Google Scholar]

[R18] 18.Brown AS, Patel CJ. A standard database for drug repositioning. Scientific data 2017;4:170029. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.HL7.org. HL7 FHIR R4. Secondary HL7 FHIR R4 2018. http://hl7.org/fhir/R4/.

[R20] 20.Nigam K, McCallum AK, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine learning 2000;39(2–3):103–34 [Google Scholar]

[R21] 21.Nguyen DQ. jLDADMM: A Java package for the LDA and DMM topic models. arXiv preprint arXiv:1808.03835 2018 [Google Scholar]

[R22] 22.FHIR RDF Specification. Secondary FHIR RDF Specification 2016. http://w3c.github.io/hcls-fhir-rdf/spec/.

PERMALINK

Modeling Cancer Clinical Trials Using HL7 FHIR to Support Downstream Applications: A Case Study with Colorectal Cancer Data

Nansu Zong

Daniel J Stone

Deepak K Sharma

Andrew Wen

Chen Wang

Yue Yu

Ming Huang

Sijia Liu

Hongfang Liu

Qian Shi

Guoqian Jiang

Abstract

Background and Objective:

Materials and Methods:

Results:

Conclusion:

1. Introduction

2. Materials

3. Method

3.1. Consolidating a FHIR-based data model

Figure 2.

3.2. Data population based on the proposed model

Table 1.

3.3. EDC-based downstream applications for colorectal cancer

3.3.1. Application (1) – data population of CRFs

Table 2.

3.3.2. Application (2) – discovery of patient subgroups

4. Results

4.1. Population of CRFs

Table 3.

4.2. Discovery of patient subgroups

Table 4.

Figure 3.

5. Discussion and conclusion

Figure 1.

Summary

Highlights

Acknowledgments

Footnotes

Reference

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases