Abstract
Automated extraction of patient trial eligibility for clinical research studies can increase enrollment at a decreased time and money cost. We have developed a modular trial eligibility pipeline including patient-batched processing and an internal webservice backed by a uimaFIT pipeline as part of a multi-phase approach to include note-batched processing, the ability to query trials matching patients or patients matching trials, and an external alignment engine to connect patients to trials.
Keywords: Clinical Trial, Natural Language Processing, Machine Learning
Introduction
The success or failure of clinical research studies depend heavily on patient enrollment. Achieving sufficient levels of enrollment is an expensive endeavor in terms of both time and money [1]. Unfortunately, patients are commonly not enrolled in relevant trials because of their doctor’s lack of awareness rather than the patient failing to meet eligibility criteria [2]. Natural Language Processing (NLP) and machine learning (ML) can be used to automatically extract relevant evidence from a patient’s electronic health record (EHR) regarding their eligibility to participate in a clinical research study. Importantly, NLP and ML can glean these details from the structured fields (easily accessible through common database queries) and the unstructured clinical notes, which often lay untouched. Automated eligibility criteria extraction has been shown to significantly decrease enrollment time [3].
Building off our prior experiences developing an eligibility criteria extraction engine for breast cancer trials [4] and competing in the 2018 n2c2 Shared Task on Cohort Selection [5][6], we have found a gap in the discussion and documentation of trial eligibility approaches, especially with respect to balance between component reuse vs. customization to accommodate an ever-changing list of trials. We are in the first phase of three to evaluate and distribute a modular architecture for extracting trial eligibility criteria from EHRs.
Modularity comes in various forms and abstraction levels in the architecture. First, within the clinical research domain, new trials are constantly introduced and old trials are phased out. As such, the architecture needs to accommodate a formalism for describing criteria and their relationship to trials to allow trials to be easily added and removed from the monitoring cycle without heavy re-engineering of the core application.
Second, the constant turnover of trials requires that the extraction engine be able to accommodate new concepts. As such, there must exist modularity in the engine itself to allow for the easy extension of the core application.
Third, treating the patient as an individual is the ultimate goal of care but documentation of that care occurs in more quantal units. As such, the architecture needs to be able to integrate together extractions from individual notes (i.e., these quantal units) and structured data into a single picture of eligibility.
We propose that, just as automated trial eligibility surveillance can decrease enrollment time and costs over fully manual efforts, trial eligibility surveillance systems balancing a stable core application against the three needs described above decrease development time and costs over fully bespoke applications. For the first phase, we have focused on developing towards the first two needs, as described below.
Methods
Figure 1 shows the high-level architecture of our initial phase implementation. At the heart of our architecture is an Apache UIMA pipeline [7]. UIMA (Unstructured Information Management Architecture) is designed to be a highly modular system for processing documents. With uimaFIT, specific modules can be activated for a given document as determined on-the-fly based on general program arguments or properties of the document itself. Several clinical data-oriented information extraction tools already exist for UIMA, including cTAKES (clinical Text Analysis and Knowledge Extraction System) [8], which we also partly use in our application. This system extracts eligibility criteria from clinical notes and aligns those criteria with trials for optimal pairing of patient to trial. (Further details can be seen in Figures 2 and 3, below.)
Figure 1–
High-Level Architecture of a Service-Based Pipeline Treating Patients as the Primary Pivot
Figure 2–
Six Conceptual Classes of Pipeline Modules in Use
Figure 3–
Configurable paths through uimaFIT. Light Gray modules are not used by default but can be parametrically called.
The most significant deviation from a standard UIMA pipeline is our use of the OMOP Common Data Model (CDM) [9] as our underlying data model for eligibility criteria. The two implications of this formalism are found in Figure 2. First, we developed a module to standardize all other concept representations created by upstream modules into a OMOP CDM representation.
Second, we developed eligibility criteria aggregator modules for each trial which programmatically filter documents just as OHDSI’s ATLAS tool allows users to define cohorts. The parallels between these implementations is intended to eventually allow for migration of definitions between the two.
A webservice sits in front of our UIMA NLP application to allow users to batch process patient records. As an early and partial implementation of our system, we currently only support treating a document as a self-contained representation of a patient. A patient’s eligibility is determined based on the contents of a single document rather than on the accumulation of evidence over time.
Results
A simple webservice was developed in Java using the Spring Java framework. It accepts a single document or a batch of documents such that each document contains the entire collection of notes associated with a given patient.
General guidance on processing documents can be passed to the engine via the parameters of document-type and annotators-list. The former is used when the document fits squarely into a predefined limited set of known documents types (e.g., ‘cte’ type documents are processed as per the cancer trial eligibility pipeline we developed prior to this work). The latter allows the user to specify additional modules to run on the documents beyond the standard core pipeline. Figure 3 depicts a simplified view into several common flows through the pipeline.
Due to the uncertain nature of how long processing takes, the webservice does not hang while waiting to return processed results. Instead, the user must re-query the webservice with a provided batch-key (which uniquely identifies a batch) to get the break-down of eligibility by trial and criteria per patient in a JSON file.
Discussion
The next phase of development will be to accumulate extracted information about a patient across notes and structured data and improving the pipeline’s flexibility. Webservice query parameters cannot yet fully control the system. Finally, trial eligibility criteria are programmatically determined. Ideally, this alignment between patient information and trial would be done externally and be less rigidly defined (e.g., through a spreadsheet, which we have experimented with through OHDSI’s ATLAS, or using machine learning algorithms).
We have consolidated the development wisdom from separate applications built for similar tasks within the domain of eligibility surveillance. The shared components will reduce development and upkeep costs and will help us better clarify our abstract representations of trial criteria, as experimented with using OHDSI’s ATLAS tool.
Acknowledgements
Work in part supported by pilot research funding, Hollings Cancer Center’s Cancer Center Support Grant P30 CA138313, by NIH/NCATS 5UL1TR001450-03, and by the SmartState endowment.
References
- [1].Sung NS, Crowley WF Jr., Genel M, et al. , Central challenges facing the national clinical research enterprise, JAMA. 289 (2003) 1278–1287. [DOI] [PubMed] [Google Scholar]
- [2].Somkin CP, Ackerson L, Husson G, et al. , Effect of medical oncologists’ attitudes on accrual to clinical trials in a community setting., J. Oncol. Pract 9 (2013) e275–83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Penberthy L, Brown R, Puma F, and Dahman B, Automated matching software for clinical trials eligibility: measuring efficiency and flexibility., Contemp. Clin. Trials 31 (2010) 207–217. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Meystre SM, Heider PM, Kim Y, et al. , Automatic Trial Eligibility Surveillance: Pilot Study Focused on Breast Cancer, in: AMIA Summits, San Francisco, CA, USA, (forthcoming) [Google Scholar]
- [5].N2C2: National NLP Clinical Challenges, (n.d.). https://n2c2.dbmi.hms.harvard.edu/ (accessed November 19, 2018).
- [6].Heider PM, Kim Y, AAlAbdulsalam AK, et al. , Hybrid Approaches for Automated Clinical Trial Cohort Selection, in: 2018 N2c2 Shar Task Work., San Francisco, CA, USA, 2018. [Google Scholar]
- [7].Ferrucci D, and Lally A, UIMA: An architectural approach to unstructured information processing in the corporate research environment, Nat. Lang. Eng 10 (2004) 327–348. [Google Scholar]
- [8].Savova G, Masanz J, Ogren P, et al. , Mayo Clinic Clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications, JAMIA. 17 (2010) 507–513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Hripcsak G, Duke JD, Shah NH, et al. , Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers., Stud. Health Technol. Inform 216 (2015) 574–578. [PMC free article] [PubMed] [Google Scholar]



