Abstract
Recently, biotechnology and pharmaceutical industries have made strides to adopt and implement Natural Language Processing (NLP) to address challenges faced when extracting and synthesizing high volumes of information found in unstructured and semistructured text. Here we present, and provide a summary of the findings from, a use case where NLP and text mining methodologies were used to extract clinical trial data from ClinicalTrials.gov for mRNA cancer vaccines.
METHODS
Text mining from ClinicalTrials.gov
To demonstrate the utility of NLP and text mining, we present a use case where we implemented these methodologies as a scalable solution to search for, and extract information from, planned, ongoing, and completed clinical research studies related to mRNA cancer vaccines on ClinicalTrials.gov. To this end, we utilized the Linguamatics I2E NLP platform which provides text analytics solutions based on keyword searches. This platform allows for retrieval of relevant documents, identification and extraction of key information, and summarization of the data in a structured format for researchers to review and analyze. Described here is a framework that was implemented for an mRNA cancer vaccine use case (cutoff: July 2022; Figure 1).
FIGURE 1.
Natural Language Processing (NLP) Framework for mining data from ClinicalTrials.gov. Twenty‐seven unique clinical trials were extracted and included in the final analysis.
The text mining tool I2E (Linguamatics) was utilized to retrieve specific trials from ClinicalTrials.gov based upon predefined inclusion and exclusion criteria (IEC). Specifically, the query was implemented with OR relationships as follows: (1) “Accepts Healthy Volunteers: No”; (2) Must include “treatment” and exclude “prevention” or “Health Services Research” in the Study Design Section; OR (3) Include all neoplasms except for leukemia, lymphoma, plasma cell neoplasms, and hematologic neoplasms. Specific cancer names/types utilized in IEC #3 were obtained from the National Cancer Institute (NCI) Thesaurus (NCIt) and are listed in Table S1. 1 Five hundred fifty‐one clinical trials met the IEC and were extracted, with the earliest trial dating back to October 1997 and are listed in Table S2. For these 551 trials, metadata and dosing information were extracted from the text included in the sections “Intervention,” “Detailed Description,” and “Study Arm” for each respective study. Next, using a rule‐based method, paragraphs with the conditions (Cancer vaccine name AND [dose OR route of administration OR duration OR frequency]) were retrieved and dose, route of administration, duration, and/or frequency were determined and extracted based upon the patterns predefined in the I2E (Linguamatics) platform.
The NLP‐mined clinical trials and respective data were reviewed by the study team to exclude trials which were not investigating an mRNA cancer vaccine, or which provided insufficient detail by which to determine the therapeutic modality/indication, and subsequently cleaned and updated by the study team into a more usable format. Ultimately, 27 clinical trials (~5% of the 551 trials) were identified and used for all subsequent analyses.
USE CASE: APPLYING NLP AND TEXT MINING TO SUMMARIZE THE CLINICAL TRIAL LANDSCAPE FOR MRNA CANCER VACCINES
We chose to use mRNA cancer vaccines as our use case because they are an exciting new therapeutic modality which have recently increased in popularity due to their versatility and ability to be produced at scale at relatively low costs. 2 , 3 For oncology indications, a significant advantage of mRNA cancer vaccines is that the mRNA sequence can easily be edited depending on the application of interest. This enables the exploration of personalized vaccines, simplifies testing across indications, and ultimately provides a platform approach with a relatively consistent manufacturing process across many potential products. 2 , 3
A total of 27 clinical trials were identified where at least one mRNA cancer vaccine was administered (or planned to be administered) in patients diagnosed with cancer. Among these, four trials are reported as “completed,” eight are reported as “recruiting,” seven are reported as “active, not recruiting,” two are reported as “not yet recruiting,” four are reported as “terminated” or “withdrawn,” and two are reported to have unknown recruitment status.
Among the 27 mRNA cancer vaccine trials, 37.0% (10/27) are phase I, 18.5% (5/27) are phase I/II, 22.2% (6/27) are phase II, and 3.7% (1/27) are phase II/III studies (Figure S1a). 18.5% (5/27) of the trials do not report their current clinical trial phase; however, three of these five trials mention and/or include study arms where dose escalation and/or dose expansion are being investigated. Interestingly, no pivotal trials were observed for this therapeutic modality, reflecting the early stages of development for mRNA cancer vaccines.
The number of patients enrolled in each trial varied considerably based on the study phase as well as the number of cohorts included in the study design (Figure S1a). In general, the enrollment size for phase I and/or phase I/II trials ranges from 10 to 272 patients. Focusing on the primary outcome measures reported, the phase I trials primarily focus on safety, whereas the phase I/II trials focus on safety and early treatment response. For phase II studies, the enrollment size ranges from 35 to 201 patients. Notably, one phase II study was withdrawn prior to enrolling any patients. Finally, for the one phase II/III trial, the enrollment size is 665 patients.
In addition, the route of administration of the mRNA cancer vaccines also varied considerably across the different clinical trials (Figure S1b). The diversity observed in the route of administration suggests that the optimal route may not yet be established and may depend upon many factors, including the type of cancer being targeted. Additional literature searches outside of ClinicalTrials.gov are needed for the eight vaccines that did not report their route of administration on their trial page. Additionally, only a limited number of trials (9/27) reported the mRNA vaccine dose administered to the patients (Table S3).
The mRNA cancer vaccines are being evaluated and investigated in a broad range of cancer types (Figure S2). The most common indications include non‐small cell lung cancer (7/27), colorectal cancer (6/27), melanoma (5/27), gastric and/or esophageal cancer (5/27), and solid tumors (5/27).
Although several studies are evaluating mRNA vaccines in the single agent setting, most studies have at least one study arm where the vaccines are being administered in combination with other therapeutic agents. Cancer immunotherapy drugs, including anti‐PD‐L1/anti‐PD1 antibodies and cytotoxic T lymphocyte antigen 4 (CTLA‐4)‐blocking antibodies, are the most commonly used combination agents. Sixteen of the 27 trials have at least one arm where the vaccine is being co‐administered with one or more cancer immunotherapy drug(s).
Interestingly, several trials are also evaluating administration of multiple mRNA vaccines concurrently (Table S3). This approach of co‐administering two mRNA vaccines may be necessary when one of the mRNA vaccines is not able to be administered repeatedly due to the immunogenicity of the vector, which is the case for GRT‐C901 and GRT‐C903. 4 Alternatively, a two mRNA vaccine approach may aim to combine one vaccine, which encodes a shared tumor antigen and is more of an “off the shelf” product, together with a vaccine that is personalized to the patient's tumor neoantigens in order to achieve a more diverse T cell response.
DISCUSSION
Although it is useful to survey and evaluate completed, ongoing, and planned clinical trials, manual curation of the nearly 200,000 study records posted on ClinicalTrials.gov can be tedious and error prone. New technologies, such as NLP, which combine computational linguistics and artificial intelligence, can help provide structured information from unstructured free text in an automatic and efficient manner. NLP‐based technology has proven its utility in various aspects such as disease prevention and treatment support where NLP has been used to help identify risk factors, predict disease progression, summarize patient information, and significantly reduce clinicians' workload for manual chart review. 5 , 6 More specifically to the work presented here, due to the continuous registration of new clinical trials, in addition to the large volume of existing registered clinical trials, it is difficult to efficiently retrieve quality data for analysis over time. Fortunately, text mining tools make it feasible to systematically identify trials for a given analysis. It is useful to use NLP to mine clinical trial data to extract and summarize valuable information, which subsequently helps to (1) survey the clinical trial landscape across various indications, sponsors, and study phases, and to (2) enable decision making for ongoing and future clinical studies with respect to trial design, treatment duration, stratification factors, inclusion/exclusion criteria, and primary and key secondary end points. NLP provides a tool to collect and aggregate data in an efficient and systematic manner, thereby allowing for faster turnaround times and fewer errors. In the future, we envision leveraging a similar approach to compare clinical study designs across modalities and indications to provide insights into similarities and differences between how these therapeutics are evaluated in the clinic.
The use case presented here highlights the application and utility of NLP and text mining tools (Linguamatics I2E) to summarize and gain insights from a large database, such as ClinicalTrials.gov. The use of these novel technologies and tools, in combination with data cleaning and curation by the study team, allowed for the identification of 551 clinical trials using NLP and text mining, from which this work summarized 27 relevant mRNA cancer vaccine trials, in a highly efficient and rapid manner. Implementing Named Entity Recognition capabilities allowed us to not only overcome difficulties often incurred when manually identifying acronyms and synonyms for biomedical terminologies, but also increased the precision and recall of identifying cancer vaccine trials. In addition, the automatic extraction and standardization of biomedical terminologies reduced the time and resources needed for curating data into a structured format. However, one of the Linguamatics I2E tool's limitations is that it heavily relies on manually defined patterns which could miss some information that may not be covered by a predefined rule. Additional advances are needed for this methodology to reach its full potential and reduce manual interventions; furthermore, the field would benefit from formal benchmarking across methods.
Additionally, although our methodology improved efficiency, the information available on ClinicalTrials.gov only provides a general overview, and missing data limit the information and trends that can be extracted from the database. For example, ~30% of the studies did not report a route of administration on their ClinicalTrials.gov page, limiting interpretation of any trends on the optimal or most common route of administration. Furthermore, only a handful of trials reported the vaccine dose or dosing frequency used in their study, both of which are key aspects for the development of a successful therapeutic. Additionally, the ClinicalTrials.gov treatment information often only contained the product name, with no additional description of the therapeutic and how to categorize the drug species. Thus, the study team had to identify which cancer treatments qualified as mRNA cancer vaccines by manually reviewing the treatment names and searching through literature and company websites for the desired descriptions. Although these missing data could be manually curated through literature searches, the work here aims to present our findings using NLP technologies while also highlighting the shortcomings of both ClinicalTrials.gov and text mining/NLP technologies. The increasing popularity of text mining tools is a great motivator for enriching publicly available databases to enable more impactful data extraction for metadata analysis.
Overall, this work highlights how novel technologies such as text mining and NLP can enable faster, more efficient research and presents a use case where these methodologies were implemented to provide a high‐level overview of the mRNA cancer vaccine clinical trial landscape.
FUNDING INFORMATION
No funding was received for this work.
CONFLICT OF INTEREST STATEMENT
All authors are employees and stockholders of Roche/Genentech, Inc.
Supporting information
Figure S1
Figure S2
Table S1
Table S2
Table S3
ACKNOWLEDGMENTS
The authors would like to acknowledge Benjamin Wu and Chi‐Chung Li for their subject expertise and review of the manuscript. We thank Anshin BioSolutions Corporation for medical writing support, which was provided under the direction of the authors.
Vora B, Kuruvilla D, Kim C, Wu M, Shemesh CS, Roth GA. Applying Natural Language Processing to ClinicalTrials.gov: mRNA cancer vaccine case study. Clin Transl Sci. 2023;16:2417‐2420. doi: 10.1111/cts.13648
Bianca Vora, Denison Kuruvilla, and Chloe Kim contributed equally to this work.
REFERENCES
- 1. Sioutos N, de Coronado S, Haber MW, Hartel FW, Shaiu WL, Wright LW. NCI thesaurus: a semantic model integrating cancer‐related clinical and molecular information. J Biomed Inform. 2007;40(1):30‐43. doi: 10.1016/j.jbi.2006.02.013 [DOI] [PubMed] [Google Scholar]
- 2. Pardi N, Hogan MJ, Weissman D. Recent advances in mRNA vaccine technology. Curr Opin Immunol. 2020;65:14‐20. doi: 10.1016/j.coi.2020.01.008 [DOI] [PubMed] [Google Scholar]
- 3. Chaudhary N, Weissman D, Whitehead KA. mRNA vaccines for infectious diseases: principles, delivery and clinical translation. Nat Rev Drug Discov. 2021;20(11):817‐838. doi: 10.1038/s41573-021-00283-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wolfson B, Franks SE, Hodge JW. Stay on target: reengaging cancer vaccines in combination immunotherapy. Vaccines (Basel). 2021;9(5):509. doi: 10.3390/vaccines9050509 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Aramaki E, Wakamiya S, Yada S, Nakamura Y. Natural language processing: from bedside to everywhere. Yearb Med Inform. 2022;31(1):243‐253. doi: 10.1055/s-0042-1742510 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Wu HY, Shendre A, Zhang S, et al. Translational knowledge discovery between drug interactions and pharmacogenetics. Clin Pharmacol Ther. 2020;107(4):886‐902. doi: 10.1002/cpt.1745 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Figure S1
Figure S2
Table S1
Table S2
Table S3