Abstract
Background and Objective:
Identification and Standardization of data elements used in clinical trials may control and reduce the cost and errors during the operational process, and enable seamless data exchange between the electronic data capture (EDC) systems and Electronic Health Record (EHR) systems. This study presents a methodology to comprehensively capture the clinical trial data element needs.
Materials and Methods:
Case report forms (CRF) for clinical trial data collection were used to approximate the clinical information need, whereby these information needs were then mapped to a semantically equivalent field within an existing FHIR cancer profile. For items without a semantically equivalent field, we considered these items to be information needs that cannot be represented in current standards and proposed extensions to support these needs.
Results:
We successfully identified 62 discrete items from a preliminary survey of 43 base questions in four CRFs used in colorectal cancer clinical trials, in which 28 items are modeled with FHIR extensions and their associated responses for colorectal cancer. We achieved promising results in the data population of the CRFs with average Precision 98.5%, Recall 96.2%, and F-measure 96.8% for all base questions. We also demonstrated the auto-filled answers in CRFs can be used to discover patient subgroups using a topic modeling approach.
Conclusion:
CRFs can be considered as a proxy for representing information needs for their respective cancer types. Mining the information needs can serve as a valuable resource for expanding existing standards to ensure they can comprehensively represent relevant clinical data without loss of granularity.
Keywords: HL7 FHIR, Colorectal Cancer, Clinical Trials, Case Report Forms, Patient Subgrouping
1. Introduction
Clinical trials are the experiments and observations designed to study the response of human participants to biomedical or behavioral interventions. Patient data is needed throughout the various stages of conducting a clinical trial (e.g., planning, conduction, and evaluation). Enormous volume of records from Electronic Health Record (EHR) systems needs to be reviewed and processed to capture the data using the Electronic Data Capture (EDC) systems. The tedious, inaccurate, and costly process [1 2] raises a need to develop an efficient way to improve data exchange and communication between EDC and EHR systems [2 3]. Identification and standardization of the important data elements is one of the best approaches to control and reduce the cost and errors created during the operational process as those elements can simplify the data exchange between the systems [1]. Researchers are motivated to identify common data elements in clinical trials through a diverse set of therapeutic cases [1 4–6]. For example, the National Cancer Institute (NCI) has developed the common data elements originating from the reports of Phase 3 cancer trials to standardize data representation and facilitate data interchange between research organizations [6 7]. These pre-specified data elements are often employed in multiple data models with standardized frameworks. Such efforts include the cancer profiles developed by the Clinical Data Interchange Standards Consortium (CDISC) [8], Clinical Information Modeling Initiative (CIMI) community [9] and Royal College of Pathologists of Australasia (RCPA) / HL7 Australia [10].
Case report forms (CRFs) are the questionnaires to collect information regulated by the study protocols, which approximates the information needs of scientists to answer hypothetical questions [11 12]. By extension, we hypothesize that the data elements necessary to comprehensively represent any given cancer can be potentially captured by mining the CRFs of associated clinical trials. This knowledge can then be used to construct a more comprehensive and standardized cancer data model to represent the data elements that are needed for the trials, which naturally fills the gap between the data consumers (e.g., EDC systems) and providers (e.g., EHR systems) and improves data exchange and communication for supporting the diverse downstream applications.
The objective of this study is thusly to extract common data elements from CRFs to build a data model with inheritance and usage of existing resources (e.g., data models and CRFs) to better address the need for EDC systems for cancer trials. In a previous study [13], we demonstrated that adoption of an existing cancer profile for modeling pathological reports known as the Australian Colorectal Cancer Profile (ACP) [10] based on the Fast Healthcare Interoperability Resources (FHIR) can enable the automation of the CRFs data population. However, ACP is designed to model the pathological reports (e.g., synoptic reports), and this previous study only covered a limited number of questions for the demonstration. Therefore, a data model that can handle diverse data sources (e.g., patient, diagnosis, medication, surgical, order, lab test, and pathological reports) to provide sufficient coverage to address the comprehensive questions from multiple CRFs should be studied. In this study, we focused on the following two aspects as new contributions: 1) we extended ACP to comprehensively cover more data elements of both structured (e.g., lab tests) and unstructured data (e.g., surgical reports) from EHR systems needed in CRFs, and 2) we explored more downstream applications based on the utilization of standardized data elements of the extended data model. As a proof of concept, we conducted a case study based on the CRFs for a real colorectal cancer trial in the Alliance for Clinical Trials in Oncology (Alliance) [14]. We built a colorectal cancer data model with an exploration of the autonomous capture of the data elements from 71 questions in four CRFs and populated the model with the medical records of 331 colon cancer patients from Mayo Clinic. We developed two downstream applications for the evaluation, 1) data population of CRFs, and 2) patients subgrouping. We achieved promising results with an average precision of 98.5%, recall of 96.2%, and F-measure of 96.8% for the 43 base questions generalized from CRFs. We demonstrated the subgrouping of the patients based on the auto-filled answers to support the efficient allocation of the cohorts for analysis.
2. Materials
We extracted the patient information using the data warehouse of Mayo Clinic known as Unified Data Platform (UDP) [15]. Two types of data, structured and unstructured, are sourced. For structured data, the information about the patient, diagnosis, medication, surgical, order, and lab test, etc., were collected. For unstructured data, surgical and pathological reports are obtained to collect the cancer-related data. In practice, we used a semi-structured form known as a synoptic report as our main data source for obtaining the pathological data elements with the full-form original pathological reports as a supporting supplement. A synoptic report is a form of a templated pathology report, which follows the College of American Pathologists (CAP) guidance [16] on the inclusion of data elements and the general definition of templated values [17]. The protocol of clinical data access was approved by the Mayo Clinic Institutional Review Board.
RCPA has adopted structured cancer reporting to build the standard data models [10], in which a colorectal cancer model known as ACP represents the data elements in pathological reports [18]. With a logical model, ACP defines a set of concepts and their values mainly with five core elements, “preAnalytic”, “macro”, “micro”, “ancillaryTests”, and “synthesisOverview”. The elements in the logical model are further represented with the FHIR resources, DiagnosticReport and Observation, to form a FHIR-based data model. ACP was adopted in our previous study [13]. In this study, we extended ACP with the data elements extracted from four CRFs of a real-world colorectal cancer trial, which is to study how the patients with resected stage III colon cancer are affected by the drugs oxaliplatin, fluorouracil, and leucovorin with or without cetuximab [14]. The CRFs contain the forms of registration/randomization eligibility checklist, adjuvant on-study, anatomic pathology review, and follow up, containing 71 questions in total.
3. Method
We captured important data elements based on the information needs in CRFs and constructed a FHIR-based data model that extends ACP to facilitate EDC for downstream applications. We used mainly three steps in the framework. Firstly, both structured and unstructured data are extracted from the UDP. For the unstructured data (e.g., surgical and pathological reports), each data element and its values are harvested. For synoptic reports, a total of 25 data elements, such as primary tumor, are directly obtained. For structured data, data elements, such as RESULT_VAL are obtained from the database schema. Secondly, we collected CRFs to capture the commonly used data elements from all the questions. In addition, we developed a data model based on the extension of the ACP to organize the elements. The data elements are further analyzed to identify the sources of either structured or unstructured data in EHR. Lastly, the data model will be populated with the values obtained in the extract, transform and load (ETL) process to form a FHIR-based data profile based on manually created mappings.
3.1. Consolidating a FHIR-based data model
To tackle colorectal cancer, we adapted the ACP colorectal cancer profile as a base model from which to construct our logical model for data representation. As Figure 2 shows, “PreAnalytic”, “Macro”, “Micro”, “AncillaryTests”, and “SynthesisOverview” are adapted (highlighted in the orange box in Figure 2). “PreAnalytic” represents the information collected prior to specimen receipt at the laboratory. “Macro”, “Micro”, and “Ancillary” are about macroscopic, microscopy, ancillary test findings. “SynthesisOverview” is used to record synthesis information. To enrich the “PreAnalytic”, “Macro” and “Micro”, we further added more sub-elements that are captured as common data elements from CRFs (green box), such as “newPrimary” for new primary tumor information and “recurrence” for recurrence information of the tumors. Besides, for the 28 common data elements (green box) captured that cannot be modeled with ACP colorectal cancer profile, we created new elements in the logical model. Specifically, “LaboratoryTest” is created for the lab test, “Medication” for treatment, and “Surgery” for surgical information.
Figure 2.
The proposed data model for colorectal cancer-related trials. The data elements inherited from the original ACP are in orange boxes, the new in green boxes, and adopted in red boxes.
To represent our logical model, we adopted FHIR resources to capture the concepts and value sets defined in the logical model. The mappings for the atomic data elements of the original ACP model are inherited directly (http://hl7.org.au/fhir/rcpa/cmap.html#summary). The newly developed elements are directly represented (highlighted in the red box in Figure 2) by the attributes defined in the Resources of FHIR Release 4 (R4) [19].
3.2. Data population based on the proposed model
To populate the data for CRFs, a mapping between the data element (i.e., schema) in the source datasets (i.e., database tables and synoptic reports) and atomic data elements defined in the proposed data model are established. In total, we identified 62 mappings to link the cancer model elements with the schema across the eight sources. As Table 1 shows, the original defined atomic data elements of ACP are mainly designed to represent pathological information and thus mapped to the elements in synoptic reports, such as Colorectal.micro.involvedMargins are mapped to “Surgical Margins”. The extended elements in the proposed model are designed to represent the information from medication orders, surgical, radiological, and lab testing results. In practice, we implemented some simple logical rules to obtain the values to populate the data models with patient records. For example, Colorectal.laboratoryTest.absoluteNeutrophilCount.value is obtained from lab tests with either standard or locally-used concept codes referring to neutrophil count tests.
Table 1.
Map source atomic data elements to the proposed cancer model.
Cancer Model Element | Source | Element |
---|---|---|
Colorectal.micro.colonscopyAssessmentDate | Orders Table (UDP) | ORDER_DATE (ORDER_NAME-“colonscopy”) |
Colorectal.micro.extramuralTumourDeposits1 | Synoptic Report | Tumor Deposits |
Colorectal.micro.extramuralVeinInvasion1 | Synoptic Report | Lymphovascular Invasion |
Colorectal.micro.histoConfDistMetastases1 | Synoptic Report | Distant Metastasis |
Colorectal.micro.histoConfDistMetastasesSite1 | Synoptic Report | Distant Metastasis |
Colorectal.micro.histologicalGrade | Synoptic Report | Histologic Grade |
Colorectal.micro.hostLymphoidResponse | Pathology Report | DIAGNOSIS |
Colorectal.micro.intramuralVeinInvasion1 | Synoptic Report | Lymphovascular Invasion |
Colorectal.micro.involvedMargins1 | Synoptic Report | Surgical Margins |
Colorectal.micro.lymphNodeInvolvement1 | Synoptic Report | Lymphovascular Invasion |
Colorectal.micro.lymphNodesDetails.numExamined | Synoptic Report | Number examined (total) |
Colorectal.micro.lymphNodesDetails.numPos | Synoptic Report | Number involved (total) |
Colorectal.micro.marginsMicroClearance1 | Synoptic Report | Surgical Margins |
Colorectal.micro.maxDegreeLocalInvasion | Synoptic Report | Microscopic Tumor Extension |
Colorectal.micro.neoadjuvantTherapy1 | Synoptic Report | Treatment Effect |
Colorectal.micro.nonperitonealisedCircumMargin1 | Synoptic Report | Surgical Margins |
Colorectal.micro.perineuralInvasion1 | Synoptic Report | Perineural Invasion |
Colorectal.micro.polypDetails1 | Synoptic Report | Type of Polyp Tumor Arises From |
Colorectal.micro.proximalOrDistalResectionMargins1 | Synoptic Report | Surgical Margins |
Colorectal.micro.smallVesselInvasion1 | Synoptic Report | Lymphovascular Invasion |
Colorectal.micro.tumourType | Synoptic Report | Histologic Type |
Colorectal.micro.venousSmallVesselInvasion1 | Synoptic Report | Lymphovascular Invasion |
Colorectal.macro.depositNumber | Synoptic Report | Tumor Deposits |
Colorectal.macro.intactnessOfMesorectum1 | Synoptic Report | Macroscopic Intactness of Mesorectum |
Colorectal.macro.invasion | Radiology Table (UDP) | RADIOLOGY_TEST_DESCRIPTION |
Synoptic Report | Microscopic Tumor Extension | |
Colorectal.macro.maxTumourDiameter | Synoptic Report | Tumor Size |
Colorectal.macro.distNonperitonCircumMargin | Synoptic Report | Surgical Margins |
Colorectal.macro.natureAndSiteOfBlocks | Pathology Report | BLOCK SUMMARY |
Colorectal.macro.otherMacroComments1 | Synoptic Report | Specimen |
Colorectal.macro.polyps | Diagnosis Table (UDP) | DIAGNOSIS_NAME=(“intestinal polyposis syndrome”, “gastrointestinal polyposis syndrome”) |
Colorectal.macro.tumourPerforation | Synoptic Report | Macroscopic Tumor Perforation |
Colorectal.macro.tumourSite | Synoptic Report | Tumor Site |
Colorectal.preAnalytic.adherence | Synoptic Report | Microscopic Tumor Extension |
Colorectal.preAnalytic.clinicalAssessmentDate | Diagnosis Table (UDP) | DIAGNOSIS_DATE |
Colorectal.preAnalytic.newPrimary | Radiology Table (UDP) | RADIOLOGY_REPORT |
Colorectal.preAnalytic.newPrimaryDate | Radiology Table (UDP) | RADIOLOGY_DATE |
Colorectal.preAnalytic.recurrence | Synoptic Report | Microscopic Tumor Extension / Comment |
Colorectal.preAnalytic.recurrenceDate | Synoptic Report | NOTE_DATE |
Colorectal.preAnalytic.tumourLocation | Synoptic Report | Tumor Site |
Colorectal.preAnalytic.typeOfOperation1 | Synoptic Report | Procedure |
Colorectal.preAnalytic.clinicalObstruction | Synoptic Report | GROSS DESCRIPTION |
Colorectal.synthesisOverview.tumourStageM1 | Synoptic Report | Distant Metastasis |
Colorectal.synthesisOverview.tumourStageN | Synoptic Report | Regional lymph nodes |
Colorectal.synthesisOverview.tumourStageT | Synoptic Report | Primary tumor |
Colorectal.synthesisOverview.tumourStagingSystem1 | Synoptic Report | Pathologic Staging (AJCC, 7th edition) |
Colorectal.synthesisOverview.overarchingComment | Synoptic Report | Comment |
Colorectal.laboratoryTest.absoluteNeutrophilCount.date1 | Lab-test Table (UDP) | LAB_COLLECTION_DATE |
Colorectal.laboratoryTest.absoluteNeutrophilCount.value | Lab-test Table (UDP) | RESULT_VAL(LAB_DESCRIPTION=“Neutrophils” | “Neutrophils Absolute” | “Absolute Neutrophil Count”) |
Colorectal.laboratoryTest.bilirubin.date1 | Lab-test Table (UDP) | LAB_COLLECTION_DATE |
Colorectal.laboratoryTest.bilirubin.value | Lab-test Table (UDP) | RESULT_VAL(LAB_DESCRIPTION=“Bilirubin” | “Bilirubin S”) |
Colorectal.laboratoryTest.creatinine.date1 | Lab-test Table (UDP) | LAB_COLLECTION_DATE |
Colorectal.laboratoryTest.creatinine.value | Lab-test Table (UDP) | RESULT_VAL(LAB_DESCRIPTION=“Hgb” | “Hemoglobin”) |
Colorectal.laboratoryTest.Hgb.date1 | Lab-test Table (UDP) | LAB_COLLECTION_DATE |
Colorectal.laboratoryTest.Hgb.value | Lab-test Table (UDP) | RESULT_VAL(LAB_DESCRIPTION=“Creatinine” | “ Creatinine S” | “ Creatinine P” | “ Creatinine U”) |
Colorectal.laboratoryTest.plateletCount.date1 | Lab-test Table (UDP) | LAB_COLLECTION_DATE |
Colorectal.laboratoryTest.plateletCount.value | Lab-test Table (UDP) | RESULT_VAL(LAB_DESCRIPTION=“Platelet” | “Platelet Count” | “Platelet Estimate”) |
Colorectal.laboratoryTest.serumPregnancy.date | Lab-test Table (UDP) | LAB_COLLECTION_DATE |
Colorectal.laboratoryTest.serumPregnancy.value | Lab-test Table (UDP) | RESULT_VAL(LAB_DESCRIPTION=“HCG” | “Pregnancy Test”) |
Colorectal.surgery.resectionExtent | Surgical Procedures Table (UDP) | SURGICAL_PROCEDURE_DESCRIPTION=“Biopsy” | “Polypectomy” | “Excision” | “Colectomy” | “Resection” |
Colorectal.surgery.type | Surgical Procedures Table (UDP) | SURGICAL_PROCEDURE_DESCRIPTION=“Laparoscopy” | “Open Approach” |
Colorectal.surgery.date | Surgical Procedures Table (UDP) | SURGICAL_PROCEDURE_DATE |
Colorectal.subject.vitalStatus | Patient Table (UDP) | PATIENT_DECEASED_FLAG |
Colorectal.medication.treatment.code | Orders Table (UDP) | ORDER_DESCRIPTION-“Leucovorin” | “Fluorouracil” | “Oxaliplatin” | “Cetuximab” |
Colorectal.medication.treatment.unit1 | Orders Table (UDP) | ORDER_DOSE_UNITS |
Colorectal.medication.treatment.value1 | Orders Table (UDP) | ORDER_DOSE_AMOUNT |
The mappings cannot be validated by the base 43 questions in Table 3.
3.3. EDC-based downstream applications for colorectal cancer
We have implemented the proposed framework based on the clinical records of 331 Mayo Clinic patients during the years from 2013 to 2019 with a search using the colorectal cancer-related ICD 9 codes filtering complied with the research authorization policies in Mayo Clinic. Based on the list of patients, we collected 1226 synoptic reports. Two downstream applications are developed to evaluate the model, 1) data population of CRFs, and 2) patient subtyping based on the populated CRFs.
3.3.1. Application (1) – data population of CRFs
We summarized the 57 questions in all the CRFs [14] to remove redundancy resulting in 43 base questions. For example, “Primary Site(s)” in the form “adjuvant on-study” and “Primary Site(s)” in the form “anatomic pathology review” are generalized into the question Q(24) “Primary Site(s)”. We mapped the questions to the atomic data elements of the proposed data model to enable the population of the questions. For example, to answer the question Q(16) “Extent of resection”, we mapped Colorectal.surgery.resectionExtent. Detailed mapping is available in Table 2. Please note, in practice, we generated the raw questions of the application to reduce the bias that may be caused by the different annotating experiences of the domain experts in the evaluation.
Table 2.
Map base questions of CRFs to the data elements in the proposed cancer model.
ID | Question | Value | Element |
---|---|---|---|
1 | Patient’s vital status | Yes / No (Patient is alive) | Colorectal.subject.vitalStatus |
2 | Hemoglobin (Hgb) | Yes / No (Hemoglobin >= 9 g/dL) | Colorectal.laboratoryTest.hgb.value |
3 | Absolute neutrophil count | Yes / No (Absolute neutrophil count >= LNL) | Colorectal.laboratoryTest.absoluteNeutrophilCount.value |
4 | Absolute neutrophil count LNL | Quantitative | Colorectal.laboratoryTest.absoluteNeutrophilCount.LNL |
5 | Creatinine | Yes / No (Creatinine <= 1.5 × UNL) | Colorectal.laboratoryTest.creatinine.value |
6 | Creatinine UNL | Quantitative | Colorectal.laboratoryTest.creatinine.UNL |
7 | Platelet count | Yes / No (Platelet count >= 100,000/uL) | Colorectal.laboratoryTest.plateletCount.value |
8 | Total bilirubin | Yes / No (Total bilirubin <= 1.5 × UNL) | Colorectal.laboratoryTest.bilirubin.value |
9 | Total bilirubin UNL | Quantitative | Colorectal.laboratoryTest.bilirubin.UNL |
10 | Negative serum pregnancy test | Positive / Negative | Colorectal.laboratoryTest.serumPregnancy.value |
11 | Negative serum pregnancy test date | DateTime | Colorectal.laboratoryTest.serumPregnancy.date |
12 | Assigned treatment (medication) | Oxaliplatin / Fluorouracil / Leucovorin / Cetuximab | Colorectal.medication.treatment.code |
13 | Associated diseases | Yes / No (is polyposis syndrome) | Colorectal.macro.polyps |
14 | Clinical assessment date | DateTime | Colorectal.preAnalytic.clinicalAssessmentDate |
15 | Colonoscopy Date | DateTime | Colorectal.micro.colonoscopyAssessmentDate |
16 | Extent of resection | Biopsy / Polypectomy / Bowel resection / Local excision / Indeterminate | Colorectal.surgery.resectionExtent |
17 | Type of procedure | Open approach / Laparoscopic | Colorectal.surgery.type |
18 | Surgery date | DateTime | Colorectal.surgery.date |
19 | Site of pathologically Confirmed invasion | Bladder/ Prostate/ Vagina/ Liver/ Seminal vesicles/ Pelvic (other than above)/ Ovary/ Ureter/ Peritoneum/ Uterus | Colorectal.macro.invasion |
20 | New primary cancer or MDS (myelodysplastic syndrome) | Yes / NO | Colorectal.preAnalytic.newPrimary |
21 | Date of diagnosis for new primary cancer | DateTime | Colorectal.preAnalytic.newPrimaryDate |
22 | First progression (or recurrence) | Yes / No | Colorectal.preAnalytic.recurrence |
23 | Date of first recurrence or progression | DateTime | Colorectal.preAnalytic.recurrenceDate |
24 | Primary site(s) | Cecum / Transverse colon / Sigmoid colon / Ascending colon / Splenic flexure / Hepatic flexure / Descending colon | Colorectal.preAnalytic.tumourLocation/Colorectal.macro.tumourSite |
25 | Tumor size | Narrative | Colorectal.macro.maxTumourDiameter |
26 | Bowel perforation | Present / Absent | Colorectal.macro.tumourPerforation |
27 | Histologic type | Signet ring cell adenocarcinoma / Signet ring cell carcinoma / High grade neuroendocrine carcinoma / Mucinous adenocarcinoma / No residual carcinoma / Adenocarcinoma / Medullary carcinoma / Squamous cell carcinoma | Colorectal.micro.tumourType |
28 | Histology | High (poorly differentiated or undifferentiated) / Low (well or moderately differentiated) | Colorectal.micro.histologicalGrade |
29 | Comments | Narrative | Colorectal.synthesisOverview.overarchingComment |
30 | Adherence | Yes / No | Colorectal.preAnalytic.adherence |
31 | Number of deposits | Quantitative | Colorectal.macro.depositNumber |
32 | Disease extent | Tumor invades submucosa (PT1) / Tumor invades muscularis propria (PT2) / Tumor invades through the muscularis propria into the subserosa, or into nonperitonealized pericolic or perirectal tissue (PT3) / The tumor has grown into the surface of the visceral peritoneum, which means it has grown through all layers of the colon (PT4a) / The tumor has grown into or has attached to other organs or structures (PT4b) / Primary tumor cannot be assessed (TX) | Colorectal.micro.maxDegreeLocalInvasion / Colorectal.synthesisOverview.tumourStageT |
33 | Regional lymph node involvement | No regional lymph node metastases (PN1)/ Metastases in 1 to 3 regional lymph nodes (PN2) / Metastases in 4 or more regional lymph nodes (PN3) / Regional lymph nodes cannot be assessed (PNX) | Colorectal.synthesisOverview.tumourStageN |
34 | Number of lymph nodes examined | Quantitative | Colorectal.micro.lymphNodesDetails.numExamined |
35 | Positive lymph nodes | Absent / Present | Colorectal.micro.lymphNodesDetails.numPos |
36 | Distance to closest longitudinal margin | Narrative | Colorectal.macro.distNonperitonCircumMargin |
37 | Bowel obstruction | Absent / Present | Colorectal.preAnalytic.clinicalObstruction |
38 | Blocks | Narrative | Colorectal.macro.natureAndSiteOfBlocks |
39 | Stools | Yes / No, patient has a colostomy/ileostomy | Colorectal.preAnalytic.stool |
40 | Multiple primary malignant tumors? | Yes / No | Colorectal.macro.maligantTumorNumber |
41 | Deposits type | Discrete/ Irregular/ Both discrete and irregular | Colorectal.macro.depositType |
42 | Residual adjacent adenoma? | Yes / No | Colorectal.macro.residualAdjacentAdenoma |
43 | Host lymphoid response | Crohn’s like (2 or more lymphoid aggregates per slide, often associated with germinal (check all that apply) centers adjacent to tumor)/ Peritumoral, mild (distinct rim or cap of lymphocytes at tumor-parenchyma interface)/ Intratumoral, marked (>4 tumor infiltrating lymphocytes/HPF) | Colorectal.micro.hostLymphoidResponse |
To evaluate the quality of the generated answers, we randomly split patients into seven groups and requested the seven subject matter experts (N.Z., Y.Y., M.M., A.W., D.S., S.L., and D.S.) majoring in medical informatics to mainly answering the base question based on the patient records. The seven experts are approved for data access by the Mayo Clinic Institutional Review Board. Ten randomly selected patient records were annotated by all reviewers and the inter-rater reliability kappa scores were calculated. For each question, standard answers were generated from reliable annotators (average kappa score of 0.96) where the annotations of the experts who have low kappa inter-rater reliability scores are filtered out if determined to be mostly inaccurate upon review. Please note, the annotations based on the randomly selected patients are merged in evaluation following the same filtering rule. The results are evaluated based on the metrics: Precision, Recall, and F-measure.
3.3.2. Application (2) – discovery of patient subgroups
With the proposed FHIR-based data model, we can extract the data elements and values for each patient to generate the subgroups with the patients sharing the same clinical features. In practice, we standardized and utilized the raw answers with the categorical values of the base questions generated in Application (1) as features to explore the patient cohorts based on the patient subgrouping.
We adopted the one-topic-per-document Dirichlet Multinomial Mixture (DMM) model [20] to cluster each patient. DMM is a topic model designed for short texts, which assumes that each document can be only categorized into one topic. Specifically, we modeled each patient as p and each categorical answers for base questions as ai, then we reformed the DMM model as, a topic zp for each patient as zp ~ Multinomial (θ) where θ~ Dirchlet (α), and a categorical answer as ~ Multinomial where ) ~ Dirchlet (β). In practice, the conduction for this task is based on the jLDADMM library [21].
4. Results
4.1. Population of CRFs
We evaluated the data model for the downstream application of generating the response for the base questions. An average F-measure of 0.968 was obtained as shown in Table 3. The data elements of the proposed model are collected from the following two parts, 1) the elements that existed in ACP and 2) elements that are required and inferred from the questions of CRFs. We found that 27 mappings cannot be validated by the base questions (refer to Table 1). We also failed to generate answers or consistently annotate Questions (39–42) due to a lack of sufficient data to identify and extract the corresponding elements from the target data sources.
Table 3.
Precision, Recall, and F-measure of the automatically generated answers for the base questions.
ID | Question | P | R | F | ID | Question | P | R | F |
---|---|---|---|---|---|---|---|---|---|
1 | Patient’s vital status | 1.000 | 1.000 | 1.000 | 23 | Date of first recurrence or progression | 1.000 | 1.000 | 1.000 |
2 | Hemoglobin (Hgb) | 1.000 | 1.000 | 1.000 | 24 | Primary site(s)2 | 1.000 | 1.000 | 1.000 |
3 | Absolute neutrophil count | 1.000 | 1.000 | 1.000 | 25 | Tumor size | 1.000 | 1.000 | 1.000 |
4 | Absolute neutrophil count LNL | 1.000 | 1.000 | 1.000 | 26 | Bowel perforation2 | 1.000 | 1.000 | 1.000 |
5 | Creatinine | 1.000 | 1.000 | 1.000 | 27 | Histologic type | 1.000 | 1.000 | 1.000 |
6 | Creatinine UNL | 1.000 | 1.000 | 1.000 | 28 | Histology | 1.000 | 1.000 | 1.000 |
7 | Platelet count | 1.000 | 1.000 | 1.000 | 29 | Comments2 | 1.000 | 0.955 | 0.977 |
8 | Total bilirubin | 1.000 | 1.000 | 1.000 | 30 | Adherence | 0.909 | 0.769 | 0.833 |
9 | Total bilirubin UNL | 1.000 | 1.000 | 1.000 | 31 | Number of deposits | 1.000 | 0.997 | 0.998 |
10 | Negative serum pregnancy test | 1.000 | 1.000 | 1.000 | 32 | Disease extent2 | 1.000 | 0.994 | 0.997 |
11 | Negative serum pregnancy test date | 1.000 | 1.000 | 1.000 | 33 | Regional lymph node involvement | 1.000 | 0.995 | 0.997 |
12 | Assigned treatment (medication) | 1.000 | 1.000 | 1.000 | 34 | Number of lymph nodes examined2 | 1.000 | 0.997 | 0.999 |
13 | Associated diseases | 1.000 | 0.952 | 0.976 | 35 | Positive lymph nodes2 | 1.000 | 0.989 | 0.994 |
14 | Clinical assessment date | 1.000 | 1.000 | 1.000 | 36 | Distance to closest longitudinal margin | 1.000 | 0.697 | 0.822 |
15 | Colonoscopy Date | 0.997 | 1.000 | 0.998 | 37 | Bowel obstruction2 | 1.000 | 1.000 | 1.000 |
16 | Extent of resection | 0.890 | 0.973 | 0.930 | 38 | Blocks | 1.000 | 1.000 | 1.000 |
17 | Type of procedure2 | 0.997 | 0.985 | 0.991 | 39 | Stools | - | - | - |
18 | Surgery date2 | 0.997 | 1.000 | 0.998 | 40 | Multiple primary malignant tumors? | - | - | - |
19 | Site of pathologically Confirmed invasion | 1.000 | 1.000 | 1.000 | 41 | Deposits type | - | - | - |
20 | New primary cancer or MDS (myelodysplastic syndrome) | 0.833 | 1.000 | 0.909 | 42 | Residual adjacent adenoma? | - | - | - |
21 | Date of diagnosis for new primary cancer | 0.833 | 1.000 | 0.909 | 43 | Host lymphoid response | 0.946 | 0.889 | 0.917 |
22 | First progression (or recurrence) | 1.000 | 0.333 | 0.500 | Overall Average | 0.985 | 0.962 | 0.968 |
Please note, the questions are re-evaluated with the new randomly selected patients in this study, and thus the results are slightly different from our previous work [13].
On structured portions of the evaluation, there were no discrepancies found that would indicate any issues with the underlying data such as mis-entered values of the wrong scale, and the generated answers yielded perfect results. For unstructured portions of the record, excerpts pertaining to the questions might be stated multiple times in the record with slight variances in wording (e.g. benign vs negative) or with a slightly different interpretation of the underlying facts -- a count of 15 negative lymph nodes might be noted as a count of 18 in a later section of the same synoptic report. Annotators often had stylistic or personal differences that might point to an identical locality of the record but with varying start or end windows for the selection of annotation span, which causes the majority of the discrepancies in results.
4.2. Discovery of patient subgroups
We have listed the distribution of the categorical answers for the base questions in Table 4. Most questions with binary answers (e.g., Yes vs. No) have highly imbalanced distribution (e.g., Q1, Q2, Q3, Q5, Q7, Q8, Q13, Q17, Q22, Q30, Q37). For the questions with multiple answers, the answers are much more balanced except Q16 and Q27.
Table 4.
Distribution of the categorical answers for the base questions.
Question | Value | Frequency | Question | Value | Frequency |
---|---|---|---|---|---|
Q1 | Yes | 284 | Q19 | Peritoneum | 9 |
No | 47 | Ovary | 1 | ||
Vagina | 2 | ||||
Bladder | 3 | ||||
Pelvic | 1 | ||||
Prostate | 2 | ||||
Q2 | Yes | 307 | Q22 | Yes | 2 |
No | 23 | No | 329 | ||
Q3 | Yes | 38 | Q24 | Transverse colon | 27 |
No | 267 | Hepatic flexure | 9 | ||
Cecum | 52 | ||||
Descending colon | 12 | ||||
Sigmoid colon | 11 | ||||
Ascending colon | 62 | ||||
Splenic flexure | 8 | ||||
Q5 | Yes | 319 | Q26 | Present | 15 |
No | 11 | Absent | 5 | ||
Q7 | Yes | 319 | Q27 | Signet ring cell adenocarcinoma | 4 |
No | 11 | Signet ring cell carcinoma | 2 | ||
High grade neuroendocrine carcinoma | 1 | ||||
Mucinous adenocarcinoma | 19 | ||||
No residual carcinoma | 3 | ||||
Adenocarcinoma | 306 | ||||
Medullary carcinoma | 3 | ||||
Squamous cell carcinoma | 1 | ||||
Q8 | Yes | 275 | Q28 | Low | 252 |
No | 6 | ||||
Q10 | Positive | 1 | Q30 | Yes | 11 |
Negative | 3 | No | 320 | ||
Q12 | Oxaliplatin | 74 | Q32 | PT4a | 24 |
Fluorouracil | 77 | PT4b | 21 | ||
Leucovorin | 66 | PT3 | 157 | ||
Cetuximab | 3 | PT1 | 43 | ||
PT0 | 86 | ||||
Q13 | Yes | 17 | Q33 | PN1 | 86 |
No | 314 | PN0 | 196 | ||
PN2 | 40 | ||||
PNx | 9 | ||||
Q16 | Bowel resection | 208 | Q35 | Absent | 214 |
Local excision | 84 | Present | 117 | ||
Biopsy | 1 | ||||
Q17 | Laparoscopy | 52 | Q37 | Absent | 328 |
Open approach | 237 | Present | 3 |
We have tested a different number of clusters and found the best result is with four. As Figure 3 (a) shows, Groups zero (color green) and three (color red) are well distinguished from the other two groups. While the two groups (group one in blue and two in orange) are entangled, Group two is more centered and Group one is more sided. We observed that Groups one and three mainly represent the patients who are taking the treatment of oxaliplatin, fluorouracil, leucovorin, and cetuximab (i.e., solid nodes). Groups zero and two are the rest of the patients separated by the answers, Q35, Q33, Q32, Q16, Q24, Q32, and Q17.
Figure 3.
Clustering patients based on DMM and the Group (i.e., topic) explanation.
We summarize the subgroups as follows (see Figure 3 (b)),
Group zero: “Absent” for positive lymph nodes (Q35), “PN0” for regional lymph node involvement (Q33), “PT0” and “PT1” for disease extent (Q32), “Local excision” for extent of resection (Q16), “Ascending colon” for primary site(s) (Q24), and “Laparoscopy” for type of procedure (Q17).
Group One: taking oxaliplatin, fluorouracil, leucovorin, and cetuximab (Q12), and with “PN1” for regional lymph node involvement (Q33).
Group two: with ”PN2” and “PN1” for regional lymph node involvement (Q33), “Cecum” for primary site(s) (Q24), “PT4a” for disease extent (Q32), and “Present” for positive lymph nodes (Q35).
Group three: taking oxaliplatin, fluorouracil, leucovorin, and cetuximab (Q12), and with “No” for vital status (Q1).
5. Discussion and conclusion
In this study, we designed and developed a framework for capturing common data elements from CRFs so as to identify clinical information needs and to extend a FHIR-based data model as necessary to meet those needs. We have developed the corresponding ETL process to generate the FHIR-based representation from the source data, and enabled extensions to be made on the proposed model to handle new data elements and data sources. Two downstream applications are also developed for adaptation (refer to the readme file in project GitHub page). The methodology and data model provided in this study ensure the standard adaptability of the necessary data elements covered for the clinical trial-related applications.
Despite the profound value of the work as proven in this article, there are a number of issues that need to be further discussed. Firstly, from all the forms in the trial [14], our domain expert (Q.S., and G.J.) selected the four CRFs based on three criteria, 1) importance/representative, 2) coverage and 3) feasibility, for the phases of the patient registration, trial conduction, and evaluation. As our selection is mostly based on the understanding and experience of the domain experts with the criteria, it may not be robust and free of bias for some organizations to conduct a similar study. Secondly, we have not investigated the management performance (e.g., storage, and search) for data represented by the proposed data model, which serves as a data reservoir for clinical trials. Thirdly, our validation only covers the base questions and remains a gap for the real questions. The answer to a base question in our study serves as the foundation to address the questions with complicated specifications. For example, the question “has the patient had a documented clinical assessment for this cancer since the submission of the previous follow-up form?” is based on the dates of clinical assessment. We agree that the design of such rules is critical to determining the success of the application of the real CRFs population. Lastly, the proposed methodology and model are based on the element extraction of CRFs and FHIR resources, which is generically adaptable for colorectal cancer trials. However, as CRFs may vary in different cancer types and it lacks of existing cancer models for adaptation, it could be more challenging to generalize the approach for other cancers.
Targeting the limitations, we have the following work planned in the future, 1) the development of a systematic method and criteria for CRF selection criteria to meet the need of covering important data elements in a trial as well as ensuring robustness and unbiasedness for adaptation, 2) an exploration of using graph-based storage for the graphical representation of FHIR (e.g., FHIR RDF [22]), 3) an expansion of the base questions to validate the remainder of mappings and development of logical rules to answer real questions with base-responses, 4) an adaptation of the proposed method and the development of a cancer model on diverse cancers.
The resulting data model and demonstration application are publicly available in the project GitHub website at https://github.com/BD2KOnFHIR/CancerTrialByFHIR.
Figure 1.
A FHIR-based framework of data modeling for clinical trials and downstream applications.
Summary
We extended an existing data model, the Australian Colorectal Cancer Profile (ACP), to capture the data elements extracted from Case report forms (CRFs) needed in clinical trials.
We populated the data model with both structured and unstructured data from Electronic Health Record (EHR) systems.
We explored clinical trial-related downstream applications that can be automated with the utilization of standardized data elements.
Highlights
The data elements are captured from cancer clinical trial case report forms (CRFs).
A FHIR-based cancer data model is constructed as an extension of an existing cancer profile.
A data population application for CRFs using FHIR-based cancer data is developed and evaluated.
A patient subgroup discovery application is developed with the FHIR-based cancer data as input.
CRFs serve as a proxy for representing information needs for their respective cancer types.
Acknowledgments
This study is supported in part by the funding from the NIH BOND (K99 GM135488) and BD2K (U01 HG009450) grants. The authors thank Mr. Grahame Grieve for his guidance on the access of the Australian colorectal cancer profile.
Footnotes
Conflict of Intrest
None
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Reference
- 1.Bruland P, McGilchrist M, Zapletal E, et al. Common data elements for secondary use of electronic health record data for clinical trial execution and serious adverse event reporting. BMC medical research methodology 2016;16(1):159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Nahm ML, Pieper CF, Cunningham MM. Quantifying data quality for clinical trials using electronic data capture. PloS one 2008;3(8):e3049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.El Emam K, Jonker E, Sampson M, Krleža-Jerić K, Neisa A. The use of electronic data capture tools in clinical trials: Web-survey of 259 Canadian trials. Journal of medical Internet research 2009;11(1):e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Crabb DW, Bataller R, Chalasani NP, et al. Standard definitions and common data elements for clinical trials in patients with alcoholic hepatitis: recommendation from the NIAAA Alcoholic Hepatitis Consortia. Gastroenterology 2016;150(4):785–90 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ghitza UE, Gore- Langton RE, Lindblad R, Shide D, Subramaniam G, Tai B. Common data elements for substance use disorders in electronic health records: the NIDA Clinical Trials Network experience. Addiction 2013;108(1):3–8 [DOI] [PubMed] [Google Scholar]
- 6.The Common Data Element Dictionary-a standard nomenclature for the reporting of Phase 3 cancer clinical trial data. Proceedings 14th IEEE Symposium on Computer-Based Medical Systems CBMS 2001; 2001. IEEE. [Google Scholar]
- 7.Nadkarni PM, Brandt CA. The common data elements for cancer research: remarks on functions and structure. Methods of information in medicine 2006;45(06):594–601 [PMC free article] [PubMed] [Google Scholar]
- 8.CDSIC Published User Guides. Secondary CDSIC Published User Guides 2019. https://www.cdisc.org/standards/therapeutic-areas.
- 9.HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2). Secondary HL7 FHIR Implementation Guide: Breast Cancer Data, Release 1 - US Realm (Draft for Comment 2) 2019. http://build.fhir.org/ig/HL7/us-breastcancer/.
- 10.HL7 Australia Implementation Guide. Secondary HL7 Australia Implementation Guide 2014. http://fhir.hl7.org.au/fhir/rcpa/index.html.
- 11.Grimes DA, Hubacher D, Nanda K, Schulz KF, Moher D, Altman DG. The Good Clinical Practice guideline: a bronze standard for clinical research. The Lancet 2005;366(9480):172–74 [DOI] [PubMed] [Google Scholar]
- 12.Bellary S, Krishnankutty B, Latha M. Basics of case report form designing in clinical research. Perspectives in clinical research 2014;5(4):159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Zong N, Wen A, Stone DJ, et al. Developing an FHIR-Based Computational Pipeline for Automatic Population of Case Report Forms for Colorectal Cancer Clinical Trials Using Electronic Health Records. JCO Clinical Cancer Informatics 2020;4:201–09 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Alberts SR, Sargent DJ, Nair S, et al. Effect of oxaliplatin, fluorouracil, and leucovorin with or without cetuximab on survival among patients with resected stage III colon cancer: a randomized trial. Jama 2012;307(13):1383–93 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Kaggal VC, Elayavilli RK, Mehrabi S, et al. Toward a learning health-care system–knowledge delivery at the point of care empowered by big data and NLP. Biomedical informatics insights 2016;8:BII S37977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Cancer Protocol Templates. Secondary Cancer Protocol Templates 2019. https://www.cap.org/cancerprotocols.
- 17.Srigley JR, McGowan T, MacLean A, et al. Standardized synoptic cancer pathology reporting: A population-based approach. Journal of surgical oncology 2009;99(8):517–24 [DOI] [PubMed] [Google Scholar]
- 18.Brown AS, Patel CJ. A standard database for drug repositioning. Scientific data 2017;4:170029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.HL7.org. HL7 FHIR R4. Secondary HL7 FHIR R4 2018. http://hl7.org/fhir/R4/.
- 20.Nigam K, McCallum AK, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Machine learning 2000;39(2–3):103–34 [Google Scholar]
- 21.Nguyen DQ. jLDADMM: A Java package for the LDA and DMM topic models. arXiv preprint arXiv:1808.03835 2018 [Google Scholar]
- 22.FHIR RDF Specification. Secondary FHIR RDF Specification 2016. http://w3c.github.io/hcls-fhir-rdf/spec/.