Skip to main content
Journal of the American Medical Informatics Association: JAMIA logoLink to Journal of the American Medical Informatics Association: JAMIA
editorial
. 2019 Oct 15;26(11):1161–1162. doi: 10.1093/jamia/ocz174

New approaches to cohort selection

Amber Stubbs 1, Özlem Uzuner 2,3,4
PMCID: PMC7647252  PMID: 31613362

Cohort selection for clinical trials is a critical component of modern medicine, yet it remains one of the most difficult, time-consuming, and expensive aspects of testing new treatments and interventions. Each clinical trial defines inclusion and exclusion criteria that describe the required patient population for the trial to accurately determine efficacy of the treatment. These criteria can be broad, limited only to specific ages or genders, or can be very specific, requiring certain medications be taken in a time period, or certain intentions on the parts of the patients (ie, an intention to become pregnant).

While a simple database search can often identify patients of the right age, or even those with particular diagnoses or test results, the more complex criteria often require study staff to manually examine records to identify qualified patients. Or, studies may rely on the patients to seek out the trial, or to be directed to trial possibilities by their doctors—both of which may lead to representation bias and misleading conclusions for the trial.1,2

For our field of medical informatics, leveraging natural language processing (NLP) tools to aid in patient identification means that we can, potentially, identify a larger number of qualified patients and remove the bias of self-selection from the clinical trial process, as well as greatly reduce the time and cost of patient selection. This special issue of Journal of the American Medical Informatics Association explores ways that NLP and machine learning can aid in cohort selection.

The 2018 National NLP Clinical Challenges (n2c2) shared task featured a track on cohort selection; Stubbs et al3 provide an overview of the task itself and results from participating teams. Participants were asked to match patients to criteria, with the criteria representing varying complexities and instances of measurement detection, inference, temporal reasoning, and expert knowledge. Participants used a variety of approaches for the task, with most teams using rule-based or hybrid rule/machine learning approaches, with some teams including medical professionals. Overall, teams that included medical professionals tended to perform well, and no particular type of criterion proved difficult for a majority of the top-performing systems. These results suggest that NLP systems could be used reliably to assist in cohort selection, and that future researchers could focus on examining system performance on more complex criteria.

One of the n2c2 participant teams, Vydiswaran et al4 utilized a hybrid combination of “pattern-based, knowledge-intensive, and feature weighting techniques.” The authors note that “[…] the best-performing approaches for many cohort identification criteria relied heavily on knowledge resources and decision lists,” and that creation of these resources is labor-intensive. Their system was one of the top-ranked systems in the shared task.

In contrast, shared task participants Segura-Bedmar and Raez5 explored how various machine learning techniques, such as a convolutional neural network, a recurrent neural network, a convolutional neural network–recurrent neural network hybrid architecture, and “a fully connected feedforward layer” affected the system output. They found that the hybrid models performed better than others, and that on most systems the fully connected feedforward layer improved results, suggesting that deep learning is a potential way forward in the realm of NLP for cohort selection.

Outside of the n2c2 shared task, Hernandez-Boussard et al6 explore whether “real-world data” can be used to draw conclusions about patients and their matching of selection criteria, and compared traditional query techniques on structured data to NLP and machine learning techniques on unstructured data. They found that, with the exception of medications, searching unstructured data led to poor results compared to the results from training machine learning systems on similar data. This suggests that unstructured data contain more relevant information on patients as compared with structured data when matching selection criteria, and that standard database queries are insufficient for selecting cohorts.

Matching patients to criteria is not the only aspect of a successful clinical trial recruitment—the patients must be recruited and agree to join the study. Gligorijevic et al7 explored the impact of investigators on the success of a study, and developed a system to predict which clinical investigators are more likely to perform well in terms of meeting recruitment goals. Their research suggests new ways, outside of NLP on unstructured records, that machine learning can be used to optimize cohort selection.

The articles in this special issue take a variety of approaches to exploring the issue of cohort selection, and we hope that the results found here can be used to make clinical trial research more timely and cost-effective.

References

  • 1. Mann CJ. Observational research methods. Research design II: cohort, cross sectional, and case-control studies. Emerg Med J 2003; 201: 54–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Geneletti S, Richardson S, Best N.. Adjusting for selection bias in retrospective, case-control studies. Biostatistics 2009; 101: 17–31. [DOI] [PubMed] [Google Scholar]
  • 3. Stubbs A, Filannino M, Soysal E, et al. Cohort selection for clinical trials: n2c2 2018 shared task track 1. J Am Med Inform Assoc 2019; https://doi.org/10.1093/jamia/ocz163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Vydiswaran VGV, Strayhorn A, Zhao X, et al. Hybrid bag of approaches to characterize selection criteria for cohort identification. J Am Med Inform Assoc 2019; https://doi.org/10.1093/jamia/ocz079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Segura-Bedmar I, Raez P.. Cohort selection for clinical trials using deep learning models. J Am Med Inform Assoc 2019; https://doi.org/10.1093/jamia/ocz139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Hernandez-Boussard T, Monda KL, Crespo BC, Riskin D.. Real-world evidence in cardiovascular medicine: assuring data validity in electronic health record-based studies. J Am Med Inform Assoc 2019; doi: 10.1093/jamia/ocz119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Gligorijevic J, Gligorijevic D, Pavlovski M, et al. Optimizing clinical trials recruitment via deep learning. J Am Med Inform Assoc 2019; doi: 10.1093/jamia/ocz064. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of the American Medical Informatics Association : JAMIA are provided here courtesy of Oxford University Press

RESOURCES