Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2024 May 31;2024:125–134.

Automating Clinical Trial Matches Via Natural Language Processing of Synthetic Electronic Health Records and Clinical Trial Eligibility Criteria

Victor M Murcia 1,2, Vinod Aggarwal 3,4, Nikhil Pesaladinne 5, Ram Thammineni 5, Nhan Do 1, Gil Alterovitz 2, Rafael B Fricks 1,2
PMCID: PMC11141802  PMID: 38827083

Abstract

Clinical trials are critical to many medical advances; however, recruiting patients remains a persistent obstacle. Automated clinical trial matching could expedite recruitment across all trial phases. We detail our initial efforts towards automating the matching process by linking realistic synthetic electronic health records to clinical trial eligibility criteria using natural language processing methods. We also demonstrate how the Sørensen-Dice Index can be adapted to quantify match quality between a patient and a clinical trial.

Introduction

Clinical trials are a key step in advancing medical practices by performing controlled tests of new ways to treat, detect, or prevent a variety of diseases and conditions1,2. They have led to more effective medications3, better accessibility to care4, and enhanced imaging techniques5. At each stage of a clinical trial, patients are key stakeholders, yet, despite the numerous benefits, clinical trials struggle to meet their patient recruitment goals. This leads to delays, increased operational costs, underpowered results, and study cancellations6. Many factors contribute to recruitment obstacles, including socioeconomic or cultural factors, and the complexity of medical science. One of the top factors is that both patients and providers may not be aware of ongoing clinical trials6. Matching patients to clinical trials requires dedicated effort to review patient records for potentially eligible candidates which is a laborious, time-consuming, and manual process. There is a need to match more quickly, accurately, and efficiently, while ensuring that patient health data is securely handled. By improving trial matching, we increase the options available to patients, reduce the burden on healthcare providers, and accelerate trial-mediated clinical research7 and drug discovery8.

Medical records contain a wealth of information about the history of a patient’s health, including demographics, medicines, medical problems, test results, immunizations, allergies, and many others.9The development of electronic health records (EHRs) in recent decades transformed not only how patients and providers access that information but opened the door to programmatic access. Furthermore, the use of medical codes and ontologies such as the International Classification of Diseases, Tenth Revision (ICD-10)10, Current Procedural Terminology (CPT)11, Logical Observation Identifiers Names and Codes (LOINC)12, and Anatomical Therapeutic Chemical (ATC)13 amongst others has brought consistency in expressing concepts in health records that allows for the development of robust applications. As such, EHRs constitute a major source of empirical data that is key to both advancing and accelerating clinical and biomedical research. The potential of programmatic EHR access has not been fully realized for clinical trials. For example, one of the most widely utilized clinical trial registries, clinicaltrials.gov14, contains 463,924 clinical trials currently. On this site, a patient can see clinical trials by searching for a specific condition and refining the results using filters like age, trial location, and recruitment status. Though filters can narrow the results, there is still difficulty interpreting the eligibility criteria. Eligibility criteria can be highly specific, requiring precise medical terminology or exact lab values. Searching often also requires manual data entry, introducing the potential for error, and can be overwhelming for users. For these reasons, a tool that automatically parses EHRs and uses them to identify relevant clinical trials would be of immense value for clinical researchers, medical providers, and patients alike.

Natural Language Processing (NLP) has tremendous potential in healthcare due to the vast amounts of unstructured text produced by provider notes in the form of interpretations of lab results or nuances of a diagnosis amongst a myriad of other things that provide a richly detailed account on a patient’s medical history. Similarly, eligibility criteria of clinical trials contain the concepts and attributes required to classify a patient as eligible or not for a study. Therefore, the information in EHRs combined with the clinical trial eligibility criteria contain all the information necessary to automate the matching process. Furthermore, since NLP methods excel at processing vast amounts of text data, the time required to find and form sufficiently large patient cohorts to start or continue the clinical trial could be greatly reduced through the automation of eligibility assessments and automated trial recommendations.

We detail our approach towards developing an automated TrialMatcher application that uses the information contained within EHRs to find matching clinical trials for veterans receiving treatment within the U.S. Department of Veterans Affairs. Though the methodology discussed here is focused on a patient-centered matching approach, the algorithm could be adapted for trial-centered matching to serve medical researchers as well. First, we discuss the initial prototype that used synthetic data via the Synthea platform. Then, we describe the implementation of the Sørensen-Dice Index (SDI) as a similarity metric to quantify the match quality, as well as the usage of the clinicaltrials.gov API for this application. Finally, we demonstrate improved match quality by evaluating on a synthetic dataset more representative of the VA patient population. Through the SDI, patients can be recommended clinical trials for which they have a high similarity score. We show also how the SDI can be used as a loss function that allows us to gauge the effectiveness of the matcher in distinguishing patients belonging to cohorts made for different clinical trials.

Methods

Design of the TrialMatcher Algorithm

The general workflow for the TrialMatcher algorithm is diagrammed in Figure 1. TrialMatcher employs NLP pipelines and a variety of pretrained NER models (Table 1), found in spacy, scispacy16, and HuggingFace. These different models extract different entity types. For instance, Models 1-4 simply extract entities without more specific labeling. However, models like en_ner_jnlpba_md extract and categorize entities using labels such as DNA, CELL_TYPE, CELL_LINE, RNA, and PROTEIN. Data processing, analysis and visualization was carried out using standard Python modules. Some of the elements that make up the data preprocessing and feature engineering steps involve standardization of orthographic information (e.g., capitalization), abbreviation detection, parts of speech (POS) tagging, detection of grammatical errors (i.e., fuzzy matching) and generation of semantic information (i.e., linking term to UMLS concept unique identifiers).

Figure 1.

Figure 1.

Process diagram for TrialMatcher

Table 1.

NER models used for prototyping TrialMatcher along with their published F1 score in entity extraction

16# Model Name Architecture F1 (ENT)
1 en_core_web_lg Tok2Vec 0.85
2 en_core_web_trf transformer 0.90
3 en_core_sci_lg Tok2Vec 0.70
4 en_core_sci_scibert BERT 0.68
5 en_ner_craft_md Tok2Vec 0.77
6 en_ner_jnlpba_md Tok2Vec 0.72
7 en_ner_bc5cdr_md Tok2Vec 0.85
8 en_ner_bionlp13cg_md Tok2Vec 0.77
9 biomedical-ner-all BERT 0.92
10 ClinicalNER transformer 0.82

All source code used in this project is provided in a GitHub repository (https://github.com/victormurcia/Clinical- Trial-Matcher).

Initial TrialMatcher Prototype

TrialMatcher was initially prototyped using the publicly available Synthetic Suicide Prevention (SSP) Dataset with Social Determinants of Health17 published by the U.S. Department of Veterans Affairs. This dataset contained 10,000 synthetically generated Veteran patient records made with the platform Synthea and encompassed over 500 clinical concepts. This dataset is comprised of numerical and textual information regarding patient allergies, procedures, conditions, devices, immunizations, demographics, lab results, medications, and insurance data.

The workflow applied to the SSP dataset is shown in Figure 1. The entire set of 10,000 patients comprises the population data in this case. The data was transformed into wide format since the information for each patient was spread across multiple rows and then all dataframes comprising the dataset were merged using the unique patient ID. This resulted in data being placed in lists that contained all allergies, medications, conditions, etc. found in the patient profile. Then, the full patient profile for any patient in the dataset could be readily extracted. Next, the textual descriptions for the patient were preprocessed through a spacy pipeline involving tokenization, part of speech tagging, a sentence recognizer, a dependency parser, a lemmatizer, an abbreviation detector for biomedical text from the scispacy library and finalizing with an NER component that used one of the models shown in Table 1. The extracted entities from the NER can then be linked to various clinical concepts via NEL through the UMLS API which are important as they could enable a developer or a patient to only query for clinical trials that deal with conditions, medications, procedures, etc. or a combination thereof.

The extracted entities serve to define the medical profile of a patient and each term can be used to query for clinical trials via the clinicaltrials.gov API. Through this API, one can generate the same results as one would through their browser implementation that can be saved as a .csv file and subsequently loaded into a pandas dataframe. This means that any filters concerning age, sex, location, study phase and recruitment status can be readily applied to the query to refine the results. The application of these filters is important since it is a straightforward way to narrow down the number of trials that will need to be parsed which can significantly speed up the time it takes to provide results. Included in the clinical trial query output are textual descriptions of the eligibility criteria for each trial. Eligibility is defined by a set of inclusion and exclusion criteria that appear in a standardized format, hence, regular expressions can be used to split each trial’s matching criteria into its respective components. Then, both inclusion and exclusion criteria can undergo through the same NLP pipeline described earlier. The result of this process generates a set of concepts that define the patient profile and a set of concepts for each clinical trial whose similarity can be quantified using similarity metrics like the Sørensen-Dice Index (SDI) (Equation 1).

SDI(A,B)=2|AB||A|+|B|

Equation 1. Equation for SDI. A is the list defining the patient profile and B is the list defining the clinical trial eligibility criteria

Refined TrialMatcher

TrialMatcher was further refined on a set of patient cohorts generated via the MDClone ADAMS platform (MDClone Ltd., Be’er Sheva, Israel). The platform was chosen to generate synthetic data because it uses a computational derivation approach where a computational model of “real-world” data is used to produce a novel and multidimensional dataset that preserves the statistical features (i.e., distributions and covariances) of the original data18. In other words, since the generated synthetic data is connected to the Clinical Data Warehouse (CDW) of the VA, the generated synthetic data closely resembles the distribution of VA patients and is therefore representative of the veteran population while protecting individual patient privacy. The patient cohorts were constructed using the eligibility criteria of 11 different clinical trials that dealt with substance use disorders (SUD), cardiovascular diseases (CVD), infectious disease (ID), and chronic diseases (CD) which are all clinical and research priorities of the VA (Table 2). In addition to this, the clinical trials were also selected according to their perceived parsing difficulty, based on the total number of inclusion and exclusion criteria, the heterogeneity of variable types (i.e., discrete, continuous, ordinal, and nominal) present in the eligibility criteria, and the vagueness/specificity. The Sample Size column represents the number of synthetic patients generated as the cohort for that study. Finally, the Attributes column refers to the unique elements that define each patient in the cohort based on the eligibility criteria for each trial.

Table 2.

Basic properties of initial set of 11 clinical trials used to build synthetic patient cohorts.

graphic file with name 2317t2.jpg

For each generated synthetic clinical trial cohort, 3 patient labels were used, namely -1,0, and 1 corresponding to non-matching, unknown match, and perfectly matched patients. For a patient to be labeled as a non-match, they had to have all attributes discussed in exclusion criteria and none of the attributes present in the inclusion criteria. For a perfectly matched patient, they had to have all attributes discussed in inclusion criteria and none of the attributes present in the exclusion criteria. For unknown matches, the synthetic patients were not constrained to have any particular entry for the queried attributes.

As an example, in clinical trial NCT04871100, the inclusion criteria require a patient to be 18 years of age or older, meet diagnostic criteria for current anxiety disorder and endorsing hazardous alcohol use while the exclusion criteria require the patient to not exhibit high risk suicidality, psychotic symptoms, or cognitive impairment in addition to not needing acute medically supervised detoxification. Then, a perfectly matched patient could be made by ensuring that the patient has a current anxiety disorder denoted by an ICD-10 listing in their profile with codes F40, F41, F42 or F43 made within a time window of 6 months from the query date while hazardous alcohol use could be denoted via specific ICD-10 listings within F10 and/or patient responses to alcohol-related screening questions found in their EHR which would satisfy the inclusion criteria. For the exclusion criteria, we would need to query for patients that do not have high-risk suicidality gauged by the absence of an ICD-10 listing of R45.851 or specific responses in screening questions, the absence of psychotic symptoms could be made by selecting patients that do not have ICD-10 listings in their profile within the F30, F31, F32, F33, F34, and F39 categories that deal specifically with psychotic symptoms, cognitive impairment could be avoided by selecting patients that do not have specific ICD-10 listing in the R41 category, and the need for medically-supervised detoxification could be avoided by excluding patients profiles containing ICD-10 code HZ2ZZZZ. A similar procedure would then be carried out to generate non-matching patients as well as patients that have an unknown match.

The algorithm for calculating the SDI between the synthetic patient cohorts and clinical trials has seen several notable modifications since the initial prototype. For instance, it was possible to have finer control of the synthetic data which allowed for the construction of patient cohorts with varying degrees of similarity with the trial at hand. This in turn informed how high of a SDI the algorithm should output, thus, creating a feedback loop through which the matching performance could be improved. The generation of synthetic data ensures that the original patient data is properly deidentified which is accomplished through a combination of censoring replacements. As such, it is not always possible to obtain the exact entries associated with a patient which prevents nuances (i.e., severity) concerning lab interpretations, diagnoses, etc. to be extracted. To address this, each patient attribute was carefully constructed under the assumption that a missing entry corresponds to the absence of that condition, procedure, health factor, medication, etc. from a patient’s medical history. Under this assumption, entries could then be represented as Booleans such that non-missing entries would correspond to True for a patient having that attribute and missing entries would correspond to False for a patient not having that attribute. The patient profile for each cohort could then be represented in terms of a Python dictionary instead of a list. Dictionaries, also known as associative arrays, are sets of key-value pairs that require unique keys within a dictionary. Hence, the patient attributes that act as the column names in the dataframe will be used as the dictionary key while the Boolean entry associated with that entry for the current patient will serve as the value for that key.

For the clinical trial, the profile will also be expressed as a dictionary. The keys will be the extracted entities by the NER model, and the values will be set to True if they were entities that were extracted from the inclusion criteria and set to False if they were extracted from the exclusion criteria. The assumption made here is that any concepts present in the inclusion criteria are required for a patient to be eligible and that any concepts present in the exclusion criteria would preclude the patient from being eligible. This assumption simplifies the process, but it is not universally true. For example, a common requirement found in inclusion criteria is “Able to provide informed consent”. However, there are instances where the opposite sentence is found in the exclusion criteria in a variant of “Unable to provide informed consent”. The NER models can generally readily extract “informed consent” from the sentence, however, the qualifier pertaining to the statement is not necessarily associated. This can be handled using negations implemented through NegEx patterns involving pseudo negations, preceding negations, following negations, and terminations which can be readily included through a library such as negspacy. This is an option that is still under development and so the clinical trial profiles will be constructed under the assumption discussed earlier. In addition to the issue concerning negations, the extraction of numerical quantities associated with specific entities (i.e., accepted values for a lab result or valid systolic blood pressure measurements) is a notoriously challenging task for NER models. However, through a combination of regular expressions and libraries like extractacy, numerical quantities can be linked the named entities extracted through NER. This is also a feature that is currently still under development, so for now, the True and False value of numerical requirements was set manually based on the conditions stipulated by the clinical trial.

Having reworked the data structure for the patient profile and the clinical trial profile, now the SDI can be calculated as an aggregate score (Equation 2). This is done by calculating an SDI for each pair of matching keys in the two dictionaries (the SDI will always be either 0 or 1 in this case), summing the SDI scores for each key, and then dividing by the total number of keys. It is also important to note that the scores are defined with respect to the clinical trial profile. This is done because the patient profile will generally have a lot more attributes than those listed in clinical trial eligibility criteria, which if considered would unnecessarily lower the SDI.

SDIagg=1NiN2|AiBi||Ai|+|Bi|

Equation 2. Equation for aggregated SDI. A is the set defining the patient profile, B is the set defining the clinical trial profile, N is the total number of keys in the clinical trial dictionary, and the sum is carried out over each key i.

Results

Initial TrialMatcher Prototype

Figure 2a shows a frequency plot for the calculated SDI between a set of Synthea patients and the clinical trials matching based on the extracted entities in their profile. The SDI values were consistently low (under 15% match) for a given patient-clinical trial pair. Low SDI values partially result from non-exact entity matches between the two lists. For example, the clinical trial profile may contain a requirement for “Post Traumatic Stress Disorder”, but the patient profile may instead contain it as “PTSD”. Even though these two terms represent the same concept, an exact match is required for the SDI to consider them to be equal. In addition to this, numerical results, abbreviations, and negations were not handled well in the initial prototype which contributed to the poor matching performance. Furthermore, there is evidence in the literature that Synthea19 can generate patient cohorts with atypical prevalence or comorbidities (i.e., Synthea patients were more likely to suffer kidney failure and undergo diabetes-related amputations relative to the national average). These factors drove us to explore alternative synthesis methods.

Figure 2.

Figure 2.

a) Distribution of SDI for a set of synthetic veterans from the initial TrialMatcher prototype and their matched clinical trials. b) SDI distribution and c) matching decision visualization for the refined TrialMatcher algorithm showing how the entities and corresponding values for a patient in the cohort were used to calculate an SDI for NCT05602727.

Refined TrialMatcher

The distribution of SDIs for the patient cohort generated for NCT05602727 can be seen in Figure 2b. Through the inclusion of better synthetic data, better processing methods, and restructuring both the patient and clinical trial dictionaries, the SDIs between patients and clinical trials were vastly improved. Up to nearly 80% match can be obtained for certain trials. The histogram in Figure 2b shows the frequencies of SDIs for patients that were labeled as perfect matches (dark purple, right side), non matches (beige bars, left side), and unknown matches (pink, middle) for clinical trial NCT0560727. It is worth observing that the patients that were labeled as non-matches based on their attributes as they pertain to the eligibility criteria for this trial had SDIs of 0. In addition to this, the highest SDIs were obtained for patients that were labeled as perfect matches. For these patients, it was expected that the SDI would be 1. This was the case in NCT05301348 for instance, however, in NCT0560727, the highest SDI for the perfectly matched patients was 0.79. This lower-than-expected score happened either due to over tokenization or misclassification from the model as in the case for “curative therapy”. In this case, the NER model split “curative therapy” into two separate terms, due to extracted entities for which data was not possible to include in the cohort (i.e., allergy for this clinical trial pertains to an exclusion criterion stating that a patient ‘Has a known allergy or intolerance to the active or inert ingredients in MK-1942.’ The active ingredients for this medication were not found and so they could not be included in the attributes), or due to mismatched keys that were not able to be consolidated via fuzzy matching (i.e., drug dependency or abuse was extracted from the clinical trial criteria. The attribute ‘drug dependency’ was present in the cohort, however, fuzzy matching seemed unable to resolve this discrepancy). Figure 2c showcases the matching results between the extracted entities from the clinical trial eligibility criteria and the attributes comprising the patient profile. This plot allows one to rapidly assess what features were matched well (green bars), missing (gray bars), or non-matched (red bars). Knowing this is useful for patients and providers alike since it gives insight as to how the algorithm performed the matching process.

We were also interested in the classification capability of the matcher. To probe this, 7 new patient cohorts were created based on the eligibility criteria of clinical trials dealing with cardiovascular diseases and posttraumatic stress disorder (Table 3). For this experiment, the biomedical-NER-all20 model was used. Unlike the cohorts discussed prior however, each cohort contained all the attributes required across all clinical trial criteria which encompassed 195 unique attributes. Then, each patient could have an SDI readily calculated which could then be used alongside their cohort label to calculate ROC curves as shown in Figure 3.

Table 3.

Clinical trials used for ROC analysis.

# TRIAL ID Cohort Size Main Topic Human Entities NER Entities AUC Max SDI
1 NCT05866666 658,584 Heart Disease 15 9 0.51 0.78
2 NCT04828070 24,384 Heart Disease 21 12 0.92 0.42
3 NCT05746559 2,875 Heart Disease 34 21 1.00 0.19
4 NCT04452500 584,033 PTSD 88 83 0.78 0.51
5 NCT04597190 990,819 PTSD 17 15 0.65 0.53
6 NCT03509909 989,615 PTSD 17 21 0.42 0.52
7 NCT05194930 487,153 PTSD 24 20 0.36 0.30

Figure 3.

Figure 3.

ROC curves of clinical trials using synthetic patient match labels and calculated SDI to gauge classification performance of the refined TrialMatcher

For trials that had highly specific inclusion requirements, the AUC scores suggest that these cohorts are highly separable from patients present in other cohorts. As an example, NCT04828070 defines eligible patients based on criteria like such as: must be a female with an age between 18-55 years of age, must be pregnant, and must have a diagnosis of cardiovascular disease. The AUC score for NCT04828070 was 0.92 which is likely due to the pregnancy requirement. None of the other trials required pregnancy which enables a clear distinction between patients from this cohort and all the others. On the other hand, NCT0586666 has very broadly written inclusion criteria (able to provide informed consent, over 21 years old, history of several common cardiac diseases), with highly specific exclusion criteria (heart transplant, use of LVAD, use of a Holter monitor, surgical scars and wounds that may not be well documented in EHR). Consequently, the AUC score for this trial was 0.51, which indicates patients created to match this trial cannot be reliably distinguished from other cohorts based on synthetic EHR data alone.

Discussion

Clinical trials require patients to meet defined and posted eligibility criteria as a first screening for enrollment. Unfortunately, patients face a variety of challenges in finding suitable trials and consequently trials are often unable to reach their enrollment goals. Matching tools that alleviate this burden would reduce this barrier to running clinical trials and therefore impact patient wellbeing and medical research more broadly.

Existing Clinical Trial Matching Tools

There are few tools in use right now that can help patients find relevant clinical trials. We surveyed some notable tools in use and identified three common limitations in current applications: 1) the reliance on manual entry or questionnaires, 2) a narrow focus on trial matching, and 3) limited availability to outside users.

Most clinical trial finders in current use require manual entry of patient data or deploy a questionnaire, the answers to which help the system to filter out a select few applicable trials. Examples of these tools include ResearchMatch Trials Today (RMTT)21, DQuest22, and the Janssen Global Trial Finder23. Questionnaires may ask for a combination of medical, geographic, and demographic information. In effect this approach serves as a guided search of clinical trial repositories using select details of patient health information. Approaches face tradeoffs between questionnaire complexity to optimal trial matching, where it is typically better to err on the side of showing more results than eliminating potentially useful information. Depending on implementation and ontology used, this can result in a system that does not differentiate between critical differences in conditions being investigated by trials, such as “breast cancer” and “HER2+ breast cancer.” General terms can also yield multiple trials addressing sequelae of a primary condition, such as when “leukemia” returns over 2,600 results. Finally, some tools primarily filter trials by location. All these systems place the burden on patients and their caregivers to craft their responses for effectiveness and ultimately rely on manual review.

Other tools partially address the manual burden with a narrow scope of focus. These still employ questionnaires but self-select for a subset of trials thereby providing a more tailored set of recommendations. Examples include COVID-19 Trial Finder, which is a COVD-19 specific trial finder that employs a five-question questionnaire to pair patients to trials. The COVID-19 Trials Finder24 sources its clinical trial options from ClinicalTrials.gov by querying all the trials indexed with “COVID-19” being their condition and yields 79.76% precision in matching trial locations for COVID-19.13 A near 80% success rate is compelling, but this system does not include and is not scalable to other conditions. Similarly, the Fox Trial Finder2525, developed by the Michael J. Fox Foundation focuses primarily on Parkinson’s Disease, but also covers trials on Multiple System Atrophy, Progressive Supranuclear Palsy, Lewy Body Dementia, and Corticobasal Degeneration. Narrow focus tools may provide a more efficient search but serve a limited audience in potentially low prevalence conditions; Fox Trial Finder for example has only reported 91,000 uses since its launch in 2011 as of February 2020.

Finally, while all trial finders examined rely on data from ClinicalTrials.gov, not all are as persistent or publicly available. At the time of writing DQuest does not appear to be publicly accessible, and the COVID-19 Trial Finder is no longer available online. This is a common limitation in our own solution currently; the preceding VA Clinical Trial Selector project is available as an open-source repository but is not currently a hosted web application. Applications that directly share patient health records also introduce security concerns that are circumvented by a questionnaire. Otherwise, the solution either needs to be hosted on internal health network systems to avoid sharing public health data, or a process for sharing records with the trial matching tool needs to be established. The first alternative is feasible at the VA due to the nature of our integrated health record. The application described here is intended for deployment within the VA and as such existing security and privacy protocols will be enforced to protect patient data. However, the methods described here should be generalizable for other medical institutions to adopt and apply the necessary patient information protections.

TrialMatcher

The initial results demonstrated by the refined TrialMatcher are promising, however, there are still plenty of improvements that could be made to ensure that the matching process is accurate. For instance, the assumption regarding the Boolean nature of concepts found in the inclusion and exclusion criteria needs to be addressed. This could be handled through the inclusion of negation and affirmation methods in the processing pipeline that could more accurately inform the required value for a given attribute based on the eligibility criteria. Both negation and affirmation will be particularly important components to include here since they can completely change the entire meaning of a sentence. As mentioned earlier, a common requirement in clinical trials is for the patient to “Be able to provide informed consent”. NER models can generally extract informed consent quite readily, however, to properly assess the truth value from each criterion we need to be able to a) extract the entity, b) detect negations/affirmations and c) identify whether the entity was extracted from the inclusion or exclusion criteria. In the example of “Be able to provide informed consent”, if the requirement is found within the inclusion criteria, then, the extracted entity would be “informed consent” and we have an affirmation cue expressed via the word “able”. Given that, the value in the clinical trial profile for informed consent would be True. On the other hand, if that same sentence was found in the exclusion criteria, then, the value for that entity in the clinical trial profile would be False. This is something that is currently being developed.

Furthermore, the extraction and association of numerical quantities needs to be improved. Though the extraction numbers can be readily done through simple regular expressions, reliably identifying quantities like dosages, survey scores, measurements and lab results is paramount for accurate matching. Currently, patient cohorts were constructed such that the truth value associated with these numerical quantities could be readily established. As an example, one of the criteria in NCT04452500 excludes patients with an eGFR less than 60mL/min. The synthetic cohorts could be readily constructed such that this condition was either met or not which enabled the truth value for this quantity to be established. However, in practice, the EHR of a patient may contain eGFR values that are either below or above that requirement or even missing. In this scenario, the extracted value from the eligibility criteria will need to be compared against the value found in the EHR before a truth value can be given to it. Libraries like extractacy can help towards this end and there are NER models that have been trained to aid with the extraction and classification of these types of values. This is another feature that is current being developed. The tested NER models can extract many of the entities required to characterize each clinical trial, however, there is crucial information that is often missing and therefore, it would be beneficial to finetune an NER model using the eligibility criteria as training data. For example, in Figure 2c, any ‘Missing’ attributes are as such due to the NER model failing to extract that attribute from the patient profile. It is crucial that all the relevant medical concepts are being extracted from the eligibility criteria and the patient profile to ensure that an accurate match score is being reported. For instance, in clinical trial NCT05746559, one of the requirements in the inclusion criteria is ‘Planned non-emergent sternotomy with CPB procedure’. Information regarding sternotomy procedures is part of the attributes in the patient profiles, however, the NER model did not extract ‘sternotomy’ as an entity. This is problematic since this is a key concept that will dictate the ultimate eligibility of a patient for this trial. Training an additional layer for NER on top of a transformer model like SciBERT or PubMedBERT could be sufficient to increase the matching performance. Finetuning a model on the clinical trial eligibility criteria could also help ensure that all the relevant entities are properly extracted.

The SDI serves as an easy to explain metric for similarity scores. Though other similarity metrics like the Jaccard Index (JI) and Cosine Similarity (CS) were explored in the early prototype stages, it will be interesting to revisit them and use them as loss functions to test the matching performance. Furthermore, the eligibility threshold still needs to be defined. A simple solution would allow either the patient or the clinical trial coordinator to freely adjust the eligibility threshold themselves and thus be provided with their options accordingly. However, it could also be desirable to have the eligibility threshold be determined automatically. This could be achieved by using ROC curves or Precision-Recall curves in conjunction with evaluation metrics like the geometric mean or Youden’s J statistic to effectively tune the decision threshold. Finally, synthetic data has allowed this initial prototype to be developed; however, the amount of censoring present in the synthetic patient cohorts in addition to the difficulties associated with the generation of unstructured text mean that original patient data is required to further enhance the performance of this application. For instance, clinical trial eligibility criteria may require a patient to have a mild, moderate, or severe case of a condition which is information that synthetic data can not readily convey. The inclusion of these details will further capture the unique nuances that define each patient. There are other functionalities that would be interesting to explore via NLP in the context of clinical trials such as the development of trial recommender systems as well as the use of abstractive summarization techniques via an appropriate Large Language Model (LLM). However, the work done here will serve as the foundation for subsequent efforts in developing an application that facilitates patient recruitment in clinical trials.

Conclusions

An NLP-powered clinical trial matcher that uses EHRs and clinical trial eligibility criteria would facilitate and expedite the search for applicable clinical trials. The work expounded here with TrialMatcher showcases some of the efforts involving the utilization of EHRs, synthetic data, NLP techniques and similarity metrics towards the development of such an application. Continued development of this application will require finetuning NER models on a labeled dataset of clinical trial eligibility criteria which is currently being constructed as well as transitioning to original patient data in order to validate the methods shown here. In addition to testing these methods with original data, the utility of this application will be ultimately determined based on things like patient reports/surveys indicating ease of finding clinical trials and/or shorter recruitment times for clinical trials. The work shown in this paper delineate how this application would accurately match patients to a variety of clinical trials with minimal effort on their part whilst providing patients with not only a match score via the Sørensen-Dice Index, but also with a traceable reason that explains the exact criteria that makes them eligible for a study.

Figures & Table

References

  • 1.Umscheid CA, Margolis DJ, Grossman CE. Key Concepts of Clinical Trials: A Narrative Review. Postgrad Med. 2011 Sep 13;123(5):194–204. doi: 10.3810/pgm.2011.09.2475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kandi V, Vadakedath S. Clinical Trials and Clinical Research: A Comprehensive Review. Cureus. 2023 Feb 16. [DOI] [PMC free article] [PubMed]
  • 3.Götzinger F, Kunz M, Lauder L, Böhm M, Mahfoud F. Arterial hypertension - clinical trials update 2022. Hypertension Research. 2022 Jul 13;45(7):1140–6. doi: 10.1038/s41440-022-00931-2. [DOI] [PubMed] [Google Scholar]
  • 4.Alhalel J, Francone N, Post S, O’Brian C A., Simon M A. How Should Representation of Subjects With LEP Become More Equitable in Clinical Trials? AMA J Ethics. 2022 Apr 1;24(4):E319–325. doi: 10.1001/amajethics.2022.319. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Soliman MAS, Kelahan LC, Magnetta M, Savas H, Agrawal R, Avery RJ, et al. A Framework for Harmonization of Radiomics Data for Multicenter Studies and Clinical Trials. JCO Clin Cancer Inform. 2022 Nov;6:e2200023. doi: 10.1200/CCI.22.00023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Chaudhari N, Ravi R, Gogtay NJ, Thatte UM. Recruitment and retention of the participants in clinical trials: Challenges and solutions. Perspect Clin Res. 2020;11(2):64–9. doi: 10.4103/picr.PICR_206_19. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Osawa EA, Rhodes A, Landoni G, Galas FRBG, Fukushima JT, Park CHL, et al. Effect of Perioperative Goal-Directed Hemodynamic Resuscitation Therapy on Outcomes Following Cardiac Surgery. Crit Care Med. 2016 Apr;44(4):724–33. doi: 10.1097/CCM.0000000000001479. [DOI] [PubMed] [Google Scholar]
  • 8.Harrer S, Shah P, Antony B, Hu J. Artificial Intelligence for Clinical Trial Design. Trends Pharmacol Sci. 2019 Aug;40(8):577–91. doi: 10.1016/j.tips.2019.05.005. [DOI] [PubMed] [Google Scholar]
  • 9.Evans RS. Electronic Health Records: Then, Now, and in the Future. Yearb Med Inform. 2016 May 20;Suppl 1(Suppl 1):S48–61. doi: 10.15265/IYS-2016-s006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Hirsch JA, Nicola G, McGinty G, Liu RW, Barr RM, Chittle MD, et al. ICD-10: History and Context. American Journal of Neuroradiology. 2016 Apr;37(4):596–9. doi: 10.3174/ajnr.A4696. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dotson P. CPT ® Codes: What Are They, Why Are They Necessary, and How Are They Developed? Adv Wound Care (New Rochelle) 2013 Dec;2(10):583–7. doi: 10.1089/wound.2013.0483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.McDonald CJ, Huff SM, Suico JG, Hill G, Leavelle D, Aller R, et al. LOINC, a Universal Standard for Identifying Laboratory Observations: A 5-Year Update. Clin Chem. 2003 Apr 1;49(4):624–33. doi: 10.1373/49.4.624. [DOI] [PubMed] [Google Scholar]
  • 13.Miller GC, Britt H. A new drug classification for computer systems: the ATC extension code. Int J Biomed Comput. 1995 Oct;40(2):121–4. doi: 10.1016/0020-7101(95)01135-2. [DOI] [PubMed] [Google Scholar]
  • 14.McCray AT, Ide NC. Design and Implementation of a National Clinical Trials Registry. Journal of the American Medical Informatics Association. 2000 May 1;7(3):313–23. doi: 10.1136/jamia.2000.0070313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2023 Jan 14;82(3):3713–44. doi: 10.1007/s11042-022-13428-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Neumann M, King D, Beltagy I, Ammar W. Proceedings of the 18th BioNLP Workshop and Shared Task. Stroudsburg, PA, USA: Association for Computational Linguistics; 2019. ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing; pp. 319–27. [Google Scholar]
  • 17.Hall D, Li V, Shaikh Y, Gramigna M, Walonoski J. Synthetic Patient Records for Veteran Suicide Modeling & Analytics. 2020 Aug.
  • 18.Foraker RE, Yu SC, Gupta A, Michelson AP, Pineda Soto JA, Colvin R, et al. Spot the difference: comparing results of analyses from real patient data and synthetic derivatives. JAMIA Open. 2021 Feb 15;3(4):557–66. doi: 10.1093/jamiaopen/ooaa060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Walonoski J, Kramer M, Nichols J, Quina A, Moesel C, Hall D, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. Journal of the American Medical Informatics Association. 2018 Mar 1;25(3):230–8. doi: 10.1093/jamia/ocx079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Raza S, Reji DJ, Shajan F, Bashir SR. Large-scale application of named entity recognition to biomedicine and epidemiology. PLOS Digital Health. 2022 Dec 7;1(12):e0000152. doi: 10.1371/journal.pdig.0000152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Pulley JM, Jerome RN, Bernard GR, Olson EJ, Tan J, Wilkins CH, et al. Connecting the public with clinical trial options: The ResearchMatch Trials Today tool. J Clin Transl Sci. 2018 Aug;2(4):253–7. doi: 10.1017/cts.2018.327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Liu C, Yuan C, Butler AM, Carvajal RD, Li ZR, Ta CN, et al. DQueST: dynamic questionnaire for search of clinical trials. Journal of the American Medical Informatics Association. 2019 Nov 1;26(11):1333–43. doi: 10.1093/jamia/ocz121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Global Trial Finder: Why It Just Got Easier to Enroll in a Janssen Clinical Study | Johnson & Johnson [Internet] [cited 2023 Sep 12]. Available from: https://www.jnj.com/latest-news/global-trial-finder-why-it-just-got-easier-to-enroll-in-%20a-janssen-clinical-study.
  • 24.Sun Y, Butler A, Lin F, Liu H, Stewart LA, Kim JH, et al. The COVID-19 Trial Finder. Journal of the American Medical Informatics Association. 2021 Mar 1;28(3):616–21. doi: 10.1093/jamia/ocaa304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Fox Trial Finder | Parkinson’s Disease [Internet] [cited 2023 Sep 12]. Available from: https://www.michaeljfox.org/trial-finder.

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES