Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jan 1.
Published in final edited form as: Int J Med Inform. 2020 Oct 22;145:104308. doi: 10.1016/j.ijmedinf.2020.104308

Modeling Cancer Clinical Trials Using HL7 FHIR to Support Downstream Applications: A Case Study with Colorectal Cancer Data

Nansu Zong 1, Daniel J Stone 1, Deepak K Sharma 1, Andrew Wen 1, Chen Wang 1, Yue Yu 1, Ming Huang 1, Sijia Liu 1, Hongfang Liu 1, Qian Shi 1, Guoqian Jiang 1,*
PMCID: PMC7736510  NIHMSID: NIHMS1646103  PMID: 33160272

Abstract

Background and Objective:

Identification and Standardization of data elements used in clinical trials may control and reduce the cost and errors during the operational process, and enable seamless data exchange between the electronic data capture (EDC) systems and Electronic Health Record (EHR) systems. This study presents a methodology to comprehensively capture the clinical trial data element needs.

Materials and Methods:

Case report forms (CRF) for clinical trial data collection were used to approximate the clinical information need, whereby these information needs were then mapped to a semantically equivalent field within an existing FHIR cancer profile. For items without a semantically equivalent field, we considered these items to be information needs that cannot be represented in current standards and proposed extensions to support these needs.

Results:

We successfully identified 62 discrete items from a preliminary survey of 43 base questions in four CRFs used in colorectal cancer clinical trials, in which 28 items are modeled with FHIR extensions and their associated responses for colorectal cancer. We achieved promising results in the data population of the CRFs with average Precision 98.5%, Recall 96.2%, and F-measure 96.8% for all base questions. We also demonstrated the auto-filled answers in CRFs can be used to discover patient subgroups using a topic modeling approach.

Conclusion:

CRFs can be considered as a proxy for representing information needs for their respective cancer types. Mining the information needs can serve as a valuable resource for expanding existing standards to ensure they can comprehensively represent relevant clinical data without loss of granularity.

Keywords: HL7 FHIR, Colorectal Cancer, Clinical Trials, Case Report Forms, Patient Subgrouping

1. Introduction

Clinical trials are the experiments and observations designed to study the response of human participants to biomedical or behavioral interventions. Patient data is needed throughout the various stages of conducting a clinical trial (e.g., planning, conduction, and evaluation). Enormous volume of records from Electronic Health Record (EHR) systems needs to be reviewed and processed to capture the data using the Electronic Data Capture (EDC) systems. The tedious, inaccurate, and costly process [1 2] raises a need to develop an efficient way to improve data exchange and communication between EDC and EHR systems [2 3]. Identification and standardization of the important data elements is one of the best approaches to control and reduce the cost and errors created during the operational process as those elements can simplify the data exchange between the systems [1]. Researchers are motivated to identify common data elements in clinical trials through a diverse set of therapeutic cases [1 46]. For example, the National Cancer Institute (NCI) has developed the common data elements originating from the reports of Phase 3 cancer trials to standardize data representation and facilitate data interchange between research organizations [6 7]. These pre-specified data elements are often employed in multiple data models with standardized frameworks. Such efforts include the cancer profiles developed by the Clinical Data Interchange Standards Consortium (CDISC) [8], Clinical Information Modeling Initiative (CIMI) community [9] and Royal College of Pathologists of Australasia (RCPA) / HL7 Australia [10].

Case report forms (CRFs) are the questionnaires to collect information regulated by the study protocols, which approximates the information needs of scientists to answer hypothetical questions [11 12]. By extension, we hypothesize that the data elements necessary to comprehensively represent any given cancer can be potentially captured by mining the CRFs of associated clinical trials. This knowledge can then be used to construct a more comprehensive and standardized cancer data model to represent the data elements that are needed for the trials, which naturally fills the gap between the data consumers (e.g., EDC systems) and providers (e.g., EHR systems) and improves data exchange and communication for supporting the diverse downstream applications.

The objective of this study is thusly to extract common data elements from CRFs to build a data model with inheritance and usage of existing resources (e.g., data models and CRFs) to better address the need for EDC systems for cancer trials. In a previous study [13], we demonstrated that adoption of an existing cancer profile for modeling pathological reports known as the Australian Colorectal Cancer Profile (ACP) [10] based on the Fast Healthcare Interoperability Resources (FHIR) can enable the automation of the CRFs data population. However, ACP is designed to model the pathological reports (e.g., synoptic reports), and this previous study only covered a limited number of questions for the demonstration. Therefore, a data model that can handle diverse data sources (e.g., patient, diagnosis, medication, surgical, order, lab test, and pathological reports) to provide sufficient coverage to address the comprehensive questions from multiple CRFs should be studied. In this study, we focused on the following two aspects as new contributions: 1) we extended ACP to comprehensively cover more data elements of both structured (e.g., lab tests) and unstructured data (e.g., surgical reports) from EHR systems needed in CRFs, and 2) we explored more downstream applications based on the utilization of standardized data elements of the extended data model. As a proof of concept, we conducted a case study based on the CRFs for a real colorectal cancer trial in the Alliance for Clinical Trials in Oncology (Alliance) [14]. We built a colorectal cancer data model with an exploration of the autonomous capture of the data elements from 71 questions in four CRFs and populated the model with the medical records of 331 colon cancer patients from Mayo Clinic. We developed two downstream applications for the evaluation, 1) data population of CRFs, and 2) patients subgrouping. We achieved promising results with an average precision of 98.5%, recall of 96.2%, and F-measure of 96.8% for the 43 base questions generalized from CRFs. We demonstrated the subgrouping of the patients based on the auto-filled answers to support the efficient allocation of the cohorts for analysis.

2. Materials

We extracted the patient information using the data warehouse of Mayo Clinic known as Unified Data Platform (UDP) [15]. Two types of data, structured and unstructured, are sourced. For structured data, the information about the patient, diagnosis, medication, surgical, order, and lab test, etc., were collected. For unstructured data, surgical and pathological reports are obtained to collect the cancer-related data. In practice, we used a semi-structured form known as a synoptic report as our main data source for obtaining the pathological data elements with the full-form original pathological reports as a supporting supplement. A synoptic report is a form of a templated pathology report, which follows the College of American Pathologists (CAP) guidance [16] on the inclusion of data elements and the general definition of templated values [17]. The protocol of clinical data access was approved by the Mayo Clinic Institutional Review Board.

RCPA has adopted structured cancer reporting to build the standard data models [10], in which a colorectal cancer model known as ACP represents the data elements in pathological reports [18]. With a logical model, ACP defines a set of concepts and their values mainly with five core elements, “preAnalytic”, “macro”, “micro”, “ancillaryTests”, and “synthesisOverview”. The elements in the logical model are further represented with the FHIR resources, DiagnosticReport and Observation, to form a FHIR-based data model. ACP was adopted in our previous study [13]. In this study, we extended ACP with the data elements extracted from four CRFs of a real-world colorectal cancer trial, which is to study how the patients with resected stage III colon cancer are affected by the drugs oxaliplatin, fluorouracil, and leucovorin with or without cetuximab [14]. The CRFs contain the forms of registration/randomization eligibility checklist, adjuvant on-study, anatomic pathology review, and follow up, containing 71 questions in total.

3. Method

We captured important data elements based on the information needs in CRFs and constructed a FHIR-based data model that extends ACP to facilitate EDC for downstream applications. We used mainly three steps in the framework. Firstly, both structured and unstructured data are extracted from the UDP. For the unstructured data (e.g., surgical and pathological reports), each data element and its values are harvested. For synoptic reports, a total of 25 data elements, such as primary tumor, are directly obtained. For structured data, data elements, such as RESULT_VAL are obtained from the database schema. Secondly, we collected CRFs to capture the commonly used data elements from all the questions. In addition, we developed a data model based on the extension of the ACP to organize the elements. The data elements are further analyzed to identify the sources of either structured or unstructured data in EHR. Lastly, the data model will be populated with the values obtained in the extract, transform and load (ETL) process to form a FHIR-based data profile based on manually created mappings.

3.1. Consolidating a FHIR-based data model

To tackle colorectal cancer, we adapted the ACP colorectal cancer profile as a base model from which to construct our logical model for data representation. As Figure 2 shows, “PreAnalytic”, “Macro”, “Micro”, “AncillaryTests”, and “SynthesisOverview” are adapted (highlighted in the orange box in Figure 2). “PreAnalytic” represents the information collected prior to specimen receipt at the laboratory. “Macro”, “Micro”, and “Ancillary” are about macroscopic, microscopy, ancillary test findings. “SynthesisOverview” is used to record synthesis information. To enrich the “PreAnalytic”, “Macro” and “Micro”, we further added more sub-elements that are captured as common data elements from CRFs (green box), such as “newPrimary” for new primary tumor information and “recurrence” for recurrence information of the tumors. Besides, for the 28 common data elements (green box) captured that cannot be modeled with ACP colorectal cancer profile, we created new elements in the logical model. Specifically, “LaboratoryTest” is created for the lab test, “Medication” for treatment, and “Surgery” for surgical information.

Figure 2.

Figure 2.

The proposed data model for colorectal cancer-related trials. The data elements inherited from the original ACP are in orange boxes, the new in green boxes, and adopted in red boxes.

To represent our logical model, we adopted FHIR resources to capture the concepts and value sets defined in the logical model. The mappings for the atomic data elements of the original ACP model are inherited directly (http://hl7.org.au/fhir/rcpa/cmap.html#summary). The newly developed elements are directly represented (highlighted in the red box in Figure 2) by the attributes defined in the Resources of FHIR Release 4 (R4) [19].

3.2. Data population based on the proposed model

To populate the data for CRFs, a mapping between the data element (i.e., schema) in the source datasets (i.e., database tables and synoptic reports) and atomic data elements defined in the proposed data model are established. In total, we identified 62 mappings to link the cancer model elements with the schema across the eight sources. As Table 1 shows, the original defined atomic data elements of ACP are mainly designed to represent pathological information and thus mapped to the elements in synoptic reports, such as Colorectal.micro.involvedMargins are mapped to “Surgical Margins”. The extended elements in the proposed model are designed to represent the information from medication orders, surgical, radiological, and lab testing results. In practice, we implemented some simple logical rules to obtain the values to populate the data models with patient records. For example, Colorectal.laboratoryTest.absoluteNeutrophilCount.value is obtained from lab tests with either standard or locally-used concept codes referring to neutrophil count tests.

Table 1.

Map source atomic data elements to the proposed cancer model.

Cancer Model Element Source Element
Colorectal.micro.colonscopyAssessmentDate Orders Table (UDP) ORDER_DATE (ORDER_NAME-“colonscopy”)

Colorectal.micro.extramuralTumourDeposits1 Synoptic Report Tumor Deposits
Colorectal.micro.extramuralVeinInvasion1 Synoptic Report Lymphovascular Invasion
Colorectal.micro.histoConfDistMetastases1 Synoptic Report Distant Metastasis
Colorectal.micro.histoConfDistMetastasesSite1 Synoptic Report Distant Metastasis
Colorectal.micro.histologicalGrade Synoptic Report Histologic Grade
Colorectal.micro.hostLymphoidResponse Pathology Report DIAGNOSIS
Colorectal.micro.intramuralVeinInvasion1 Synoptic Report Lymphovascular Invasion
Colorectal.micro.involvedMargins1 Synoptic Report Surgical Margins
Colorectal.micro.lymphNodeInvolvement1 Synoptic Report Lymphovascular Invasion
Colorectal.micro.lymphNodesDetails.numExamined Synoptic Report Number examined (total)
Colorectal.micro.lymphNodesDetails.numPos Synoptic Report Number involved (total)
Colorectal.micro.marginsMicroClearance1 Synoptic Report Surgical Margins
Colorectal.micro.maxDegreeLocalInvasion Synoptic Report Microscopic Tumor Extension
Colorectal.micro.neoadjuvantTherapy1 Synoptic Report Treatment Effect
Colorectal.micro.nonperitonealisedCircumMargin1 Synoptic Report Surgical Margins
Colorectal.micro.perineuralInvasion1 Synoptic Report Perineural Invasion
Colorectal.micro.polypDetails1 Synoptic Report Type of Polyp Tumor Arises From
Colorectal.micro.proximalOrDistalResectionMargins1 Synoptic Report Surgical Margins
Colorectal.micro.smallVesselInvasion1 Synoptic Report Lymphovascular Invasion
Colorectal.micro.tumourType Synoptic Report Histologic Type
Colorectal.micro.venousSmallVesselInvasion1 Synoptic Report Lymphovascular Invasion

Colorectal.macro.depositNumber Synoptic Report Tumor Deposits
Colorectal.macro.intactnessOfMesorectum1 Synoptic Report Macroscopic Intactness of Mesorectum
Colorectal.macro.invasion Radiology Table (UDP) RADIOLOGY_TEST_DESCRIPTION
Synoptic Report Microscopic Tumor Extension
Colorectal.macro.maxTumourDiameter Synoptic Report Tumor Size
Colorectal.macro.distNonperitonCircumMargin Synoptic Report Surgical Margins
Colorectal.macro.natureAndSiteOfBlocks Pathology Report BLOCK SUMMARY
Colorectal.macro.otherMacroComments1 Synoptic Report Specimen
Colorectal.macro.polyps Diagnosis Table (UDP) DIAGNOSIS_NAME=(“intestinal polyposis syndrome”, “gastrointestinal polyposis syndrome”)
Colorectal.macro.tumourPerforation Synoptic Report Macroscopic Tumor Perforation
Colorectal.macro.tumourSite Synoptic Report Tumor Site

Colorectal.preAnalytic.adherence Synoptic Report Microscopic Tumor Extension
Colorectal.preAnalytic.clinicalAssessmentDate Diagnosis Table (UDP) DIAGNOSIS_DATE
Colorectal.preAnalytic.newPrimary Radiology Table (UDP) RADIOLOGY_REPORT
Colorectal.preAnalytic.newPrimaryDate Radiology Table (UDP) RADIOLOGY_DATE
Colorectal.preAnalytic.recurrence Synoptic Report Microscopic Tumor Extension / Comment
Colorectal.preAnalytic.recurrenceDate Synoptic Report NOTE_DATE
Colorectal.preAnalytic.tumourLocation Synoptic Report Tumor Site
Colorectal.preAnalytic.typeOfOperation1 Synoptic Report Procedure
Colorectal.preAnalytic.clinicalObstruction Synoptic Report GROSS DESCRIPTION

Colorectal.synthesisOverview.tumourStageM1 Synoptic Report Distant Metastasis
Colorectal.synthesisOverview.tumourStageN Synoptic Report Regional lymph nodes
Colorectal.synthesisOverview.tumourStageT Synoptic Report Primary tumor
Colorectal.synthesisOverview.tumourStagingSystem1 Synoptic Report Pathologic Staging (AJCC, 7th edition)
Colorectal.synthesisOverview.overarchingComment Synoptic Report Comment

Colorectal.laboratoryTest.absoluteNeutrophilCount.date1 Lab-test Table (UDP) LAB_COLLECTION_DATE
Colorectal.laboratoryTest.absoluteNeutrophilCount.value Lab-test Table (UDP) RESULT_VAL(LAB_DESCRIPTION=“Neutrophils” | “Neutrophils Absolute” | “Absolute Neutrophil Count”)
Colorectal.laboratoryTest.bilirubin.date1 Lab-test Table (UDP) LAB_COLLECTION_DATE
Colorectal.laboratoryTest.bilirubin.value Lab-test Table (UDP) RESULT_VAL(LAB_DESCRIPTION=“Bilirubin” | “Bilirubin S”)
Colorectal.laboratoryTest.creatinine.date1 Lab-test Table (UDP) LAB_COLLECTION_DATE
Colorectal.laboratoryTest.creatinine.value Lab-test Table (UDP) RESULT_VAL(LAB_DESCRIPTION=“Hgb” | “Hemoglobin”)
Colorectal.laboratoryTest.Hgb.date1 Lab-test Table (UDP) LAB_COLLECTION_DATE
Colorectal.laboratoryTest.Hgb.value Lab-test Table (UDP) RESULT_VAL(LAB_DESCRIPTION=“Creatinine” | “ Creatinine S” | “ Creatinine P” | “ Creatinine U”)
Colorectal.laboratoryTest.plateletCount.date1 Lab-test Table (UDP) LAB_COLLECTION_DATE
Colorectal.laboratoryTest.plateletCount.value Lab-test Table (UDP) RESULT_VAL(LAB_DESCRIPTION=“Platelet” | “Platelet Count” | “Platelet Estimate”)
Colorectal.laboratoryTest.serumPregnancy.date Lab-test Table (UDP) LAB_COLLECTION_DATE
Colorectal.laboratoryTest.serumPregnancy.value Lab-test Table (UDP) RESULT_VAL(LAB_DESCRIPTION=“HCG” | “Pregnancy Test”)

Colorectal.surgery.resectionExtent Surgical Procedures Table (UDP) SURGICAL_PROCEDURE_DESCRIPTION=“Biopsy” | “Polypectomy” | “Excision” | “Colectomy” | “Resection”
Colorectal.surgery.type Surgical Procedures Table (UDP) SURGICAL_PROCEDURE_DESCRIPTION=“Laparoscopy” | “Open Approach”
Colorectal.surgery.date Surgical Procedures Table (UDP) SURGICAL_PROCEDURE_DATE

Colorectal.subject.vitalStatus Patient Table (UDP) PATIENT_DECEASED_FLAG

Colorectal.medication.treatment.code Orders Table (UDP) ORDER_DESCRIPTION-“Leucovorin” | “Fluorouracil” | “Oxaliplatin” | “Cetuximab”
Colorectal.medication.treatment.unit1 Orders Table (UDP) ORDER_DOSE_UNITS
Colorectal.medication.treatment.value1 Orders Table (UDP) ORDER_DOSE_AMOUNT
1

The mappings cannot be validated by the base 43 questions in Table 3.

3.3. EDC-based downstream applications for colorectal cancer

We have implemented the proposed framework based on the clinical records of 331 Mayo Clinic patients during the years from 2013 to 2019 with a search using the colorectal cancer-related ICD 9 codes filtering complied with the research authorization policies in Mayo Clinic. Based on the list of patients, we collected 1226 synoptic reports. Two downstream applications are developed to evaluate the model, 1) data population of CRFs, and 2) patient subtyping based on the populated CRFs.

3.3.1. Application (1) – data population of CRFs

We summarized the 57 questions in all the CRFs [14] to remove redundancy resulting in 43 base questions. For example, “Primary Site(s)” in the form “adjuvant on-study” and “Primary Site(s)” in the form “anatomic pathology review” are generalized into the question Q(24) “Primary Site(s)”. We mapped the questions to the atomic data elements of the proposed data model to enable the population of the questions. For example, to answer the question Q(16) “Extent of resection”, we mapped Colorectal.surgery.resectionExtent. Detailed mapping is available in Table 2. Please note, in practice, we generated the raw questions of the application to reduce the bias that may be caused by the different annotating experiences of the domain experts in the evaluation.

Table 2.

Map base questions of CRFs to the data elements in the proposed cancer model.

ID Question Value Element
1 Patient’s vital status Yes / No (Patient is alive) Colorectal.subject.vitalStatus
2 Hemoglobin (Hgb) Yes / No (Hemoglobin >= 9 g/dL) Colorectal.laboratoryTest.hgb.value
3 Absolute neutrophil count Yes / No (Absolute neutrophil count >= LNL) Colorectal.laboratoryTest.absoluteNeutrophilCount.value
4 Absolute neutrophil count LNL Quantitative Colorectal.laboratoryTest.absoluteNeutrophilCount.LNL
5 Creatinine Yes / No (Creatinine <= 1.5 × UNL) Colorectal.laboratoryTest.creatinine.value
6 Creatinine UNL Quantitative Colorectal.laboratoryTest.creatinine.UNL
7 Platelet count Yes / No (Platelet count >= 100,000/uL) Colorectal.laboratoryTest.plateletCount.value
8 Total bilirubin Yes / No (Total bilirubin <= 1.5 × UNL) Colorectal.laboratoryTest.bilirubin.value
9 Total bilirubin UNL Quantitative Colorectal.laboratoryTest.bilirubin.UNL
10 Negative serum pregnancy test Positive / Negative Colorectal.laboratoryTest.serumPregnancy.value
11 Negative serum pregnancy test date DateTime Colorectal.laboratoryTest.serumPregnancy.date
12 Assigned treatment (medication) Oxaliplatin / Fluorouracil / Leucovorin / Cetuximab Colorectal.medication.treatment.code
13 Associated diseases Yes / No (is polyposis syndrome) Colorectal.macro.polyps
14 Clinical assessment date DateTime Colorectal.preAnalytic.clinicalAssessmentDate
15 Colonoscopy Date DateTime Colorectal.micro.colonoscopyAssessmentDate
16 Extent of resection Biopsy / Polypectomy / Bowel resection / Local excision / Indeterminate Colorectal.surgery.resectionExtent
17 Type of procedure Open approach / Laparoscopic Colorectal.surgery.type
18 Surgery date DateTime Colorectal.surgery.date
19 Site of pathologically Confirmed invasion Bladder/ Prostate/ Vagina/ Liver/ Seminal vesicles/ Pelvic (other than above)/ Ovary/ Ureter/ Peritoneum/ Uterus Colorectal.macro.invasion
20 New primary cancer or MDS (myelodysplastic syndrome) Yes / NO Colorectal.preAnalytic.newPrimary
21 Date of diagnosis for new primary cancer DateTime Colorectal.preAnalytic.newPrimaryDate
22 First progression (or recurrence) Yes / No Colorectal.preAnalytic.recurrence
23 Date of first recurrence or progression DateTime Colorectal.preAnalytic.recurrenceDate
24 Primary site(s) Cecum / Transverse colon / Sigmoid colon / Ascending colon / Splenic flexure / Hepatic flexure / Descending colon Colorectal.preAnalytic.tumourLocation/Colorectal.macro.tumourSite
25 Tumor size Narrative Colorectal.macro.maxTumourDiameter
26 Bowel perforation Present / Absent Colorectal.macro.tumourPerforation
27 Histologic type Signet ring cell adenocarcinoma / Signet ring cell carcinoma / High grade neuroendocrine carcinoma / Mucinous adenocarcinoma / No residual carcinoma / Adenocarcinoma / Medullary carcinoma / Squamous cell carcinoma Colorectal.micro.tumourType
28 Histology High (poorly differentiated or undifferentiated) / Low (well or moderately differentiated) Colorectal.micro.histologicalGrade
29 Comments Narrative Colorectal.synthesisOverview.overarchingComment
30 Adherence Yes / No Colorectal.preAnalytic.adherence
31 Number of deposits Quantitative Colorectal.macro.depositNumber
32 Disease extent Tumor invades submucosa (PT1) / Tumor invades muscularis propria (PT2) / Tumor invades through the muscularis propria into the subserosa, or into nonperitonealized pericolic or perirectal tissue (PT3) / The tumor has grown into the surface of the visceral peritoneum, which means it has grown through all layers of the colon (PT4a) / The tumor has grown into or has attached to other organs or structures (PT4b) / Primary tumor cannot be assessed (TX) Colorectal.micro.maxDegreeLocalInvasion / Colorectal.synthesisOverview.tumourStageT
33 Regional lymph node involvement No regional lymph node metastases (PN1)/ Metastases in 1 to 3 regional lymph nodes (PN2) / Metastases in 4 or more regional lymph nodes (PN3) / Regional lymph nodes cannot be assessed (PNX) Colorectal.synthesisOverview.tumourStageN
34 Number of lymph nodes examined Quantitative Colorectal.micro.lymphNodesDetails.numExamined
35 Positive lymph nodes Absent / Present Colorectal.micro.lymphNodesDetails.numPos
36 Distance to closest longitudinal margin Narrative Colorectal.macro.distNonperitonCircumMargin
37 Bowel obstruction Absent / Present Colorectal.preAnalytic.clinicalObstruction
38 Blocks Narrative Colorectal.macro.natureAndSiteOfBlocks
39 Stools Yes / No, patient has a colostomy/ileostomy Colorectal.preAnalytic.stool
40 Multiple primary malignant tumors? Yes / No Colorectal.macro.maligantTumorNumber
41 Deposits type Discrete/ Irregular/ Both discrete and irregular Colorectal.macro.depositType
42 Residual adjacent adenoma? Yes / No Colorectal.macro.residualAdjacentAdenoma
43 Host lymphoid response Crohn’s like (2 or more lymphoid aggregates per slide, often associated with germinal (check all that apply) centers adjacent to tumor)/ Peritumoral, mild (distinct rim or cap of lymphocytes at tumor-parenchyma interface)/ Intratumoral, marked (>4 tumor infiltrating lymphocytes/HPF) Colorectal.micro.hostLymphoidResponse

To evaluate the quality of the generated answers, we randomly split patients into seven groups and requested the seven subject matter experts (N.Z., Y.Y., M.M., A.W., D.S., S.L., and D.S.) majoring in medical informatics to mainly answering the base question based on the patient records. The seven experts are approved for data access by the Mayo Clinic Institutional Review Board. Ten randomly selected patient records were annotated by all reviewers and the inter-rater reliability kappa scores were calculated. For each question, standard answers were generated from reliable annotators (average kappa score of 0.96) where the annotations of the experts who have low kappa inter-rater reliability scores are filtered out if determined to be mostly inaccurate upon review. Please note, the annotations based on the randomly selected patients are merged in evaluation following the same filtering rule. The results are evaluated based on the metrics: Precision, Recall, and F-measure.

3.3.2. Application (2) – discovery of patient subgroups

With the proposed FHIR-based data model, we can extract the data elements and values for each patient to generate the subgroups with the patients sharing the same clinical features. In practice, we standardized and utilized the raw answers with the categorical values of the base questions generated in Application (1) as features to explore the patient cohorts based on the patient subgrouping.

We adopted the one-topic-per-document Dirichlet Multinomial Mixture (DMM) model [20] to cluster each patient. DMM is a topic model designed for short texts, which assumes that each document can be only categorized into one topic. Specifically, we modeled each patient as p and each categorical answers for base questions as ai, then we reformed the DMM model as, a topic zp for each patient as zp ~ Multinomial (θ) where θ~ Dirchlet (α), and a categorical answer api as api ~ Multinomial (ϕzi) where ϕzi) ~ Dirchlet (β). In practice, the conduction for this task is based on the jLDADMM library [21].

4. Results

4.1. Population of CRFs

We evaluated the data model for the downstream application of generating the response for the base questions. An average F-measure of 0.968 was obtained as shown in Table 3. The data elements of the proposed model are collected from the following two parts, 1) the elements that existed in ACP and 2) elements that are required and inferred from the questions of CRFs. We found that 27 mappings cannot be validated by the base questions (refer to Table 1). We also failed to generate answers or consistently annotate Questions (39–42) due to a lack of sufficient data to identify and extract the corresponding elements from the target data sources.

Table 3.

Precision, Recall, and F-measure of the automatically generated answers for the base questions.

ID Question P R F ID Question P R F
1 Patient’s vital status 1.000 1.000 1.000 23 Date of first recurrence or progression 1.000 1.000 1.000
2 Hemoglobin (Hgb) 1.000 1.000 1.000 24 Primary site(s)2 1.000 1.000 1.000
3 Absolute neutrophil count 1.000 1.000 1.000 25 Tumor size 1.000 1.000 1.000
4 Absolute neutrophil count LNL 1.000 1.000 1.000 26 Bowel perforation2 1.000 1.000 1.000
5 Creatinine 1.000 1.000 1.000 27 Histologic type 1.000 1.000 1.000
6 Creatinine UNL 1.000 1.000 1.000 28 Histology 1.000 1.000 1.000
7 Platelet count 1.000 1.000 1.000 29 Comments2 1.000 0.955 0.977
8 Total bilirubin 1.000 1.000 1.000 30 Adherence 0.909 0.769 0.833
9 Total bilirubin UNL 1.000 1.000 1.000 31 Number of deposits 1.000 0.997 0.998
10 Negative serum pregnancy test 1.000 1.000 1.000 32 Disease extent2 1.000 0.994 0.997
11 Negative serum pregnancy test date 1.000 1.000 1.000 33 Regional lymph node involvement 1.000 0.995 0.997
12 Assigned treatment (medication) 1.000 1.000 1.000 34 Number of lymph nodes examined2 1.000 0.997 0.999
13 Associated diseases 1.000 0.952 0.976 35 Positive lymph nodes2 1.000 0.989 0.994
14 Clinical assessment date 1.000 1.000 1.000 36 Distance to closest longitudinal margin 1.000 0.697 0.822
15 Colonoscopy Date 0.997 1.000 0.998 37 Bowel obstruction2 1.000 1.000 1.000
16 Extent of resection 0.890 0.973 0.930 38 Blocks 1.000 1.000 1.000
17 Type of procedure2 0.997 0.985 0.991 39 Stools - - -
18 Surgery date2 0.997 1.000 0.998 40 Multiple primary malignant tumors? - - -
19 Site of pathologically Confirmed invasion 1.000 1.000 1.000 41 Deposits type - - -
20 New primary cancer or MDS (myelodysplastic syndrome) 0.833 1.000 0.909 42 Residual adjacent adenoma? - - -
21 Date of diagnosis for new primary cancer 0.833 1.000 0.909 43 Host lymphoid response 0.946 0.889 0.917

22 First progression (or recurrence) 1.000 0.333 0.500 Overall Average 0.985 0.962 0.968
2

Please note, the questions are re-evaluated with the new randomly selected patients in this study, and thus the results are slightly different from our previous work [13].

On structured portions of the evaluation, there were no discrepancies found that would indicate any issues with the underlying data such as mis-entered values of the wrong scale, and the generated answers yielded perfect results. For unstructured portions of the record, excerpts pertaining to the questions might be stated multiple times in the record with slight variances in wording (e.g. benign vs negative) or with a slightly different interpretation of the underlying facts -- a count of 15 negative lymph nodes might be noted as a count of 18 in a later section of the same synoptic report. Annotators often had stylistic or personal differences that might point to an identical locality of the record but with varying start or end windows for the selection of annotation span, which causes the majority of the discrepancies in results.

4.2. Discovery of patient subgroups

We have listed the distribution of the categorical answers for the base questions in Table 4. Most questions with binary answers (e.g., Yes vs. No) have highly imbalanced distribution (e.g., Q1, Q2, Q3, Q5, Q7, Q8, Q13, Q17, Q22, Q30, Q37). For the questions with multiple answers, the answers are much more balanced except Q16 and Q27.

Table 4.

Distribution of the categorical answers for the base questions.

Question Value Frequency Question Value Frequency
Q1 Yes 284 Q19 Peritoneum 9
No 47 Ovary 1
Vagina 2
Bladder 3
Pelvic 1
Prostate 2

Q2 Yes 307 Q22 Yes 2
No 23 No 329

Q3 Yes 38 Q24 Transverse colon 27
No 267 Hepatic flexure 9
Cecum 52
Descending colon 12
Sigmoid colon 11
Ascending colon 62
Splenic flexure 8

Q5 Yes 319 Q26 Present 15
No 11 Absent 5

Q7 Yes 319 Q27 Signet ring cell adenocarcinoma 4
No 11 Signet ring cell carcinoma 2
High grade neuroendocrine carcinoma 1
Mucinous adenocarcinoma 19
No residual carcinoma 3
Adenocarcinoma 306
Medullary carcinoma 3
Squamous cell carcinoma 1

Q8 Yes 275 Q28 Low 252
No 6

Q10 Positive 1 Q30 Yes 11
Negative 3 No 320

Q12 Oxaliplatin 74 Q32 PT4a 24
Fluorouracil 77 PT4b 21
Leucovorin 66 PT3 157
Cetuximab 3 PT1 43
PT0 86

Q13 Yes 17 Q33 PN1 86
No 314 PN0 196
PN2 40
PNx 9

Q16 Bowel resection 208 Q35 Absent 214
Local excision 84 Present 117
Biopsy 1

Q17 Laparoscopy 52 Q37 Absent 328
Open approach 237 Present 3

We have tested a different number of clusters and found the best result is with four. As Figure 3 (a) shows, Groups zero (color green) and three (color red) are well distinguished from the other two groups. While the two groups (group one in blue and two in orange) are entangled, Group two is more centered and Group one is more sided. We observed that Groups one and three mainly represent the patients who are taking the treatment of oxaliplatin, fluorouracil, leucovorin, and cetuximab (i.e., solid nodes). Groups zero and two are the rest of the patients separated by the answers, Q35, Q33, Q32, Q16, Q24, Q32, and Q17.

Figure 3.

Figure 3.

Clustering patients based on DMM and the Group (i.e., topic) explanation.

We summarize the subgroups as follows (see Figure 3 (b)),

Group zero: “Absent” for positive lymph nodes (Q35), “PN0” for regional lymph node involvement (Q33), “PT0” and “PT1” for disease extent (Q32), “Local excision” for extent of resection (Q16), “Ascending colon” for primary site(s) (Q24), and “Laparoscopy” for type of procedure (Q17).

Group One: taking oxaliplatin, fluorouracil, leucovorin, and cetuximab (Q12), and with “PN1” for regional lymph node involvement (Q33).

Group two: with ”PN2” and “PN1” for regional lymph node involvement (Q33), “Cecum” for primary site(s) (Q24), “PT4a” for disease extent (Q32), and “Present” for positive lymph nodes (Q35).

Group three: taking oxaliplatin, fluorouracil, leucovorin, and cetuximab (Q12), and with “No” for vital status (Q1).

5. Discussion and conclusion

In this study, we designed and developed a framework for capturing common data elements from CRFs so as to identify clinical information needs and to extend a FHIR-based data model as necessary to meet those needs. We have developed the corresponding ETL process to generate the FHIR-based representation from the source data, and enabled extensions to be made on the proposed model to handle new data elements and data sources. Two downstream applications are also developed for adaptation (refer to the readme file in project GitHub page). The methodology and data model provided in this study ensure the standard adaptability of the necessary data elements covered for the clinical trial-related applications.

Despite the profound value of the work as proven in this article, there are a number of issues that need to be further discussed. Firstly, from all the forms in the trial [14], our domain expert (Q.S., and G.J.) selected the four CRFs based on three criteria, 1) importance/representative, 2) coverage and 3) feasibility, for the phases of the patient registration, trial conduction, and evaluation. As our selection is mostly based on the understanding and experience of the domain experts with the criteria, it may not be robust and free of bias for some organizations to conduct a similar study. Secondly, we have not investigated the management performance (e.g., storage, and search) for data represented by the proposed data model, which serves as a data reservoir for clinical trials. Thirdly, our validation only covers the base questions and remains a gap for the real questions. The answer to a base question in our study serves as the foundation to address the questions with complicated specifications. For example, the question “has the patient had a documented clinical assessment for this cancer since the submission of the previous follow-up form?” is based on the dates of clinical assessment. We agree that the design of such rules is critical to determining the success of the application of the real CRFs population. Lastly, the proposed methodology and model are based on the element extraction of CRFs and FHIR resources, which is generically adaptable for colorectal cancer trials. However, as CRFs may vary in different cancer types and it lacks of existing cancer models for adaptation, it could be more challenging to generalize the approach for other cancers.

Targeting the limitations, we have the following work planned in the future, 1) the development of a systematic method and criteria for CRF selection criteria to meet the need of covering important data elements in a trial as well as ensuring robustness and unbiasedness for adaptation, 2) an exploration of using graph-based storage for the graphical representation of FHIR (e.g., FHIR RDF [22]), 3) an expansion of the base questions to validate the remainder of mappings and development of logical rules to answer real questions with base-responses, 4) an adaptation of the proposed method and the development of a cancer model on diverse cancers.

The resulting data model and demonstration application are publicly available in the project GitHub website at https://github.com/BD2KOnFHIR/CancerTrialByFHIR.

Figure 1.

Figure 1.

A FHIR-based framework of data modeling for clinical trials and downstream applications.

Summary

  • We extended an existing data model, the Australian Colorectal Cancer Profile (ACP), to capture the data elements extracted from Case report forms (CRFs) needed in clinical trials.

  • We populated the data model with both structured and unstructured data from Electronic Health Record (EHR) systems.

  • We explored clinical trial-related downstream applications that can be automated with the utilization of standardized data elements.

Highlights

  • The data elements are captured from cancer clinical trial case report forms (CRFs).

  • A FHIR-based cancer data model is constructed as an extension of an existing cancer profile.

  • A data population application for CRFs using FHIR-based cancer data is developed and evaluated.

  • A patient subgroup discovery application is developed with the FHIR-based cancer data as input.

  • CRFs serve as a proxy for representing information needs for their respective cancer types.

Acknowledgments

This study is supported in part by the funding from the NIH BOND (K99 GM135488) and BD2K (U01 HG009450) grants. The authors thank Mr. Grahame Grieve for his guidance on the access of the Australian colorectal cancer profile.

Footnotes

Conflict of Intrest

None

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Reference

  • 1.Bruland P, McGilchrist M, Zapletal E, et al. Common data elements for secondary use of electronic health record data for clinical trial execution and serious adverse event reporting. BMC medical research methodology 2016;16(1):159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Nahm ML, Pieper CF, Cunningham MM. Quantifying data quality for clinical trials using electronic data capture. PloS one 2008;3(8):e3049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.El Emam K, Jonker E, Sampson M, Krleža-Jerić K, Neisa A. The use of electronic data capture tools in clinical trials: Web-survey of 259 Canadian trials. Journal of medical Internet research 2009;11(1):e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Crabb DW, Bataller R, Chalasani NP, et al. Standard definitions and common data elements for clinical trials in patients with alcoholic hepatitis: recommendation from the NIAAA Alcoholic Hepatitis Consortia. Gastroenterology 2016;150(4):785–90 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ghitza UE, Gore- Langton RE, Lindblad R, Shide D, Subramaniam G, Tai B. Common data elements for substance use disorders in electronic health records: the NIDA Clinical Trials Network experience. Addiction 2013;108(1):3–8 [DOI] [PubMed] [Google Scholar]
  • 6.The Common Data Element Dictionary-a standard nomenclature for the reporting of Phase 3 cancer clinical trial data. Proceedings 14th IEEE Symposium on Computer-Based Medical Systems CBMS 2001; 2001. IEEE. [Google Scholar]
  • 7.Nadkarni PM, Brandt CA. The common data elements for cancer research: remarks on functions and structure. Methods of information in medicine 2006;45(06):594–601 [PMC free article] [PubMed] [Google Scholar]
  • 8.CDSIC Published User Guides. Secondary CDSIC Published User Guides 2019. https://www.cdisc.org/standards/therapeutic-areas.
  • 9.HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2). Secondary HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2) 2019. http://build.fhir.org/ig/HL7/us-breastcancer/.
  • 10.HL7 Australia Implementation Guide. Secondary HL7 Australia Implementation Guide 2014. http://fhir.hl7.org.au/fhir/rcpa/index.html.
  • 11.Grimes DA, Hubacher D, Nanda K, Schulz KF, Moher D, Altman DG. The Good Clinical Practice guideline: a bronze standard for clinical research. The Lancet 2005;366(9480):172–74 [DOI] [PubMed] [Google Scholar]
  • 12.Bellary S, Krishnankutty B, Latha M. Basics of case report form designing in clinical research. Perspectives in clinical research 2014;5(4):159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zong N, Wen A, Stone DJ, et al. Developing an FHIR-Based Computational Pipeline for Automatic Population of Case Report Forms for Colorectal Cancer Clinical Trials Using Electronic Health Records. JCO Clinical Cancer Informatics 2020;4:201–09 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Alberts SR, Sargent DJ, Nair S, et al. Effect of oxaliplatin, fluorouracil, and leucovorin with or without cetuximab on survival among patients with resected stage III colon cancer: a randomized trial. Jama 2012;307(13):1383–93 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kaggal VC, Elayavilli RK, Mehrabi S, et al. Toward a learning health-care system–knowledge delivery at the point of care empowered by big data and NLP. Biomedical informatics insights 2016;8:BII S37977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Cancer Protocol Templates. Secondary Cancer Protocol Templates 2019. https://www.cap.org/cancerprotocols.
  • 17.Srigley JR, McGowan T, MacLean A, et al. Standardized synoptic cancer pathology reporting: A population-based approach. Journal of surgical oncology 2009;99(8):517–24 [DOI] [PubMed] [Google Scholar]
  • 18.Brown AS, Patel CJ. A standard database for drug repositioning. Scientific data 2017;4:170029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.HL7.org. HL7 FHIR R4. Secondary HL7 FHIR R4 2018. http://hl7.org/fhir/R4/.
  • 20.Nigam K, McCallum AK, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine learning 2000;39(2–3):103–34 [Google Scholar]
  • 21.Nguyen DQ. jLDADMM: A Java package for the LDA and DMM topic models. arXiv preprint arXiv:1808.03835 2018 [Google Scholar]
  • 22.FHIR RDF Specification. Secondary FHIR RDF Specification 2016. http://w3c.github.io/hcls-fhir-rdf/spec/.

RESOURCES