Skip to main content
PLOS One logoLink to PLOS One
. 2024 Mar 11;19(3):e0300109. doi: 10.1371/journal.pone.0300109

Enhancing site selection strategies in clinical trial recruitment using real-world data modeling

Lars Hulstaert 1,*, Isabell Twick 1, Khaled Sarsour 2, Hans Verstraete 3
Editor: Krit Pongpirul4
PMCID: PMC10927105  PMID: 38466688

Abstract

Slow patient enrollment or failing to enroll the required number of patients is a disruptor of clinical trial timelines. To meet the planned trial recruitment, site selection strategies are used during clinical trial planning to identify research sites that are most likely to recruit a sufficiently high number of subjects within trial timelines. We developed a machine learning approach that outperforms baseline methods to rank research sites based on their expected recruitment in future studies. Indication level historical recruitment and real-world data are used in the machine learning approach to predict patient enrollment at site level. We define covariates based on published recruitment hypotheses and examine the effect of these covariates in predicting patient enrollment. We compare model performance of a linear and a non-linear machine learning model with common industry baselines that are constructed from historical recruitment data. Performance of the methodology is evaluated and reported for two disease indications, inflammatory bowel disease and multiple myeloma, both of which are actively being pursued in clinical development. We validate recruitment hypotheses by reviewing the covariates relationship with patient recruitment. For both indications, the non-linear model significantly outperforms the baselines and the linear model on the test set. In this paper, we present a machine learning approach to site selection that incorporates site-level recruitment and real-world patient data. The model ranks research sites by predicting the number of recruited patients and our results suggest that the model can improve site ranking compared to common industry baselines.

Introduction

Background

Slow patient enrollment or failing to enroll the required number of patients is a disruptor of clinical trial timelines, leading to potential delays in drug approval, underpowered studies, the need to include additional study sites or even trial terminations [13]. Site selection, the process in which research sites, healthcare organizations and their associated investigators, are chosen to participate in clinical trials, is critical to enable timely recruitment.

Common site selection strategies use past trial data to assess how well a site would perform in a prospective clinical trial and different standardized and objective methods have been developed across industry and academia [48]. These methods include analyzing factors such as prior trial participation and performance, which are interrogated through database searches in investigator, site, and enrollment data sources. In certain cases, this process is complemented with epidemiologic and geographical analyses to create short lists of research sites that have both relevant research experience and direct access to a sufficiently large target patient population [5,9,10]. Research sponsors subsequently use site-level feasibility questionnaires to get estimates of expected recruitment for shortlisted sites–often resulting in an overestimation of their ability to recruit patients in trials [11]. It is hypothesized that this overestimation is because investigators do not have full protocol information and limited time for a thorough trial feasibility assessment [6].

Estimating site-level trial performance is a complex problem, further complicated by an increasingly competitive trial landscape and complex clinical trial designs [12]. In Table 1, different factors are summarized that have been reported in literature to influence site recruitment performance. A common challenge in site selection strategies is aggregating the information collected to make final decisions. The range of variables that impact the performance of a site require an effective measurement of the trade-offs that influence prospective recruitment performance [3].

Table 1. An overview of variables published in literature that are hypothesized to drive site recruitment performance.

Variable References
Past site & investigator trial performance. [3,68,1316]
Past site & investigator trial experience. [3,1315,17]
Study population, procedures and treatments performed at hospital. [2,3,6,7,14,16]
Time required to enroll the first subject in past studies. [6,7,13,14]
Duration of recruitment period. [17]
Site research capabilities and infrastructure. [1416]
Number of specialists and dedicated research staff. [2,6,7,14,15]
Time available for research. [6,7,14]
Interest in research and publication track record. [3,15]
Language proficiency. [15]

Quantitative research in this field is limited by the volume of clinical trial data needed to generate meaningful recruitment insights. Typically, the impact of the reported site level factors on recruitment performance either is not validated, validated only on a small sample of studies or only with feasibility questionnaire data of a single study. The low power observed in these analyses signals that a cross-study analysis is needed to yield generalizable results [17].

Different statistical and machine-learning methods have been developed to estimate trial recruitment to address these challenges. Approaches differ in whether they predict enrollment at study [18,19] or study-site level [20,21], and whether enrollment is predicted at the start of the trial [18,20,21], or over the course of a trial’s enrollment duration [20]. Several study related factors associated with trial enrollment have been studied (e.g. trial design, phase, sponsor, disease indication, competing trials recruiting similar patients, investigator experience and characterization) [20,21]. Existing study-site models examine variables that describe a small number of site-specific factors such as research experience, but these modeling approaches and their data are not tailored to the study indication and population. The use of electronic health record and claims data to characterize the available patient population for example remains limited [1,8].

Objective

The goal of this work is to predict the number of patients enrolled at a clinical trial site, before the start of a trial’s enrollment phase, using a broad set of indication-specific and site level characteristics. We explore a machine learning method that considers research experience, historical performance, patient availability and other site and investigator factors to make site-level enrollment predictions. The model predictions can be used in the operational planning phase prior to the start of a study when potential study sites are selected.

The methodology is validated in inflammatory bowel disease (IBD) and multiple myeloma (MM). Given the limited availability of real-world data sources with large-scale hospital coverage outside of the US, the analysis is limited to predict patient enrollment for US research sites. The approach aims to address the following research questions:

  • How do approaches that leverage a broad range of (study)-site variables compare to baseline site selection strategies?

  • Does the use of non-linear models boost the generalization performance vs. a simple linear Poisson regression model? Does the model benefit from capturing non-linear relations between covariates and the model target?

To allow for comparison with previously published model results, the models are also compared in their ability to identify the bottom and top 30% performing investigators [21].

Materials and methods

Data sources

The data used in this work is sourced from different systems that contain structured data related to studies, research sites, investigators, and patient populations. The following section provides a description of the data that is collected from these data sources. A summary of the data sources is provided in Table 2, describing the data type, provider, coverage & time frame.

Table 2. An overview of the different data sources with a description of the data, coverage, and time frame.

Data type & provider Description Coverage & time frame
Real-world data;
Komodo Health
Claims database containing open and closed claims; covering inpatient, outpatient, emergency department and institutional encounters. High coverage of the US population with healthcare insurance coverage (Commercial, Managed Medicaid, Medicaid, Medicare Advantage, Dual Eligible).
(300M+ patients—2016-present).
Enrollment data;
DrugDev DataQuerySystem (DQS)
Trial recruitment database, containing site level patient recruitment data across clinical studies. DQS covers pharmaceutical, sponsor-led trials–provided voluntarily as part of DrugDev Consortium
(178k+ studies—1990-present).
Public study data sourced from ClinicalTrials.gov and aggregated by Komodo Health Trial database prepared by Komodo Health by linking clinical trials from ClinicalTrials.gov data to healthcare providers using NLP-based provider matching. Site & investigator level study participation data for privately and publicly funded clinical studies conducted in the US.
(440k+ studies—initiated in 2000.).
Research publication data sourced from PubMed and aggregated by Komodo Health Publication database prepared by Komodo Health by linking publications on Pubmed to healthcare providers using NLP- based provider matching. Information on published manuscripts at investigator level on biomedical literature from MEDLINE, life science journals, and online books referenced in the National Library of Medicine. (35M+ citations—initiated in 1996.).

Enrollment data from the DrugDev DataQuerySystem (DQS) is used to compute study-site level recruitment variables. DQS is a data platform that allows trial sponsors to share information on clinical trial recruitment and is used to capture study performance variables at site level such as the site open date, first and last subject enrolled date, the enrollment duration, and the number of patients who enrolled in a trial. The data is available for pharmaceutical trials across different disease indications. New data is made available monthly through DQS, and sponsors publish enrollment data from their systems consolidated onto a common data format once the study has been finalized.

For each site selection exercise, enrollment data is collected for so-called benchmark studies within a given indication from DQS. Benchmark studies are defined by manually reviewing the available studies within a given indication. These are further refined with to ensure benchmark trials are like prospective studies in terms of study phase, target indication, eligibility criteria, study duration and type of intervention.

In Fig 1, the benchmark studies that have been used across the two exercises are visualized across study phase and study indication. The enrollment data, in terms of number of patients enrolled and enrollment months, is shown in Fig 2. The site distribution across US states is shown in Fig 3, as well as the site open year distribution. The complete list of benchmark studies for each exercise is provided in S1 File.

Fig 1. Overview of number of benchmark studies across phase and indication.

Fig 1

An overview of the number of benchmark studies across study phase and study indication for resp. IBD and MM.

Fig 2. Overview of number of patients enrolled and number of enrollment months across study-site combinations.

Fig 2

An overview of the number of patients enrolled and number of enrollment months per site across the benchmark studies for resp. IBD and MM.

Fig 3. Overview of number of sites across US states and open year.

Fig 3

An overview of number of sites across US states and site open year for resp. IBD and MM.

Data from the Komodo Healthcare map, a real-world data (RWD) source with significant geographical coverage in the US, is used to characterize the study population. This data contains longitudinal patient claims data with information on prescribed drugs, diagnoses, procedures, and treatments.

Linkage of patients across longitudinal data (e.g., instances where a patient is treated at multiple institutions during a follow-up period and across healthcare providers) is performed by Komodo Health prior to sharing the data. Based on these complete patient journeys we can characterize the referral patterns across healthcare providers. Additional tables are provided by Komodo that describe provider level publications and trial participation. These tables have been processed in such a way that they can easily be linked to healthcare providers based on the National Provider Identifier (NPI) system [22].

Patient cohorts are created from common trial eligibility criteria from the benchmark studies to mimic the target population of the prospective and benchmark studies. Exact replication of the target patient population is often not possible with the available claims data. Patient outcomes and lab measurement results for example are typically not available in claims data, while they are often part of a trial’s eligibility criteria. Patient cohort definitions are defined to mimic the general target patient population across benchmark studies. Patient populations are specified through an observation period, specialist type, patient age, diagnosis, drugs, and procedure codes. Inclusion criteria of benchmark studies are used to define a superset of relevant diagnosis, drugs, and procedures codes. These codes define a patient cohort that represents the broad patient population that is eligible for the benchmark studies. The cohort definitions for each exercise are shared in S2 File. The publication and trials databases are filtered only at indication level to capture the breadth of the research experience and interest of the HCO.

Real-world data is available at patient level but aggregated at healthcare provider (HCO) level. Different covariates are extracted at HCO level to characterize the available study population, procedures, treatments, staff, publications, and trial experience. The RWD and recruitment data sources are linked at healthcare organization (HCO) level. As DQS and Komodo use different HCO identifiers, manual validation is performed to ensure that each HCO is correctly linked across the data sources.

The real-world data that has been accessed for this study were deidentified in accordance with the Health Insurance Portability and Accessibility Act, and no personal health information was extracted. Therefore, no informed consent or institutional review board approval is required for this study.

Outcome of interest and covariates

The outcome of interest, the enrollment at study-site level, is defined as the total number of recruited patients at a given site for a given study. A summary of the enrollment characteristics of the two exercises is provided in S1 Table. Covariates are constructed from enrollment and real-world data to characterize a site within the context of a study. From the enrollment data, historical performance and experience variables are generated, such as covariates that summarize historical recruitment over a time window, the enrollment period, and the year when the site opened for a given study. The real-world data is used to build a set of indication-specific covariates that characterize study population, treatment, staff (physicians and specialists), referrals, publication, and trial participation.

Table 3 shows the different covariates that are generated and how they are constructed from the respective data source. Two types of covariates exist, those that characterize the site (site level covariate), are assumed to remain static over time and are not different across benchmark studies, and those that change over time and are unique at the study-site level (study-site level covariate). The covariate creation process was largely hypothesis driven, as defined in Table 1. Covariates are clipped for outliers outside the [0%, 97.5%] interval to the nearest value. Right-side clipping is used as the different covariates are gamma distributed. Missing values are imputed by 0 when a value is missing.

Table 3. Overview of the covariates.

Covariate Variable description & construction Source & type
Patients per site (target) The number of recruited patients at a given site for a given study. DQS study-site level
Enrollment months Number of months a given site enrolled patients in the study.
The number of months is defined as the time between the site open date and the date of the last subject enrolled in the study.
Site open year The year when the site opened in the study.
Median/EWMA of patients per site per month in the last 5 year The median or exponential weighted moving average (EWMA) of the number of patients per month at a site across the studies the site participated in over the last 5 years.
The time window is indexed on the site open date for a given study.
Median/EWMA rank in enrolled patients in the last 5 year The median or EWMA of the rank of the site with respect to number of patients across studies the site participated in over the last 5 years.
The time window is indexed on the site open date for a given study.
Median/EWMA/Sum of number of enrolled patients in the last 5 year The median, EMWA or sum of number of patients recruited across studies the site participated in over the last 5 years.
The time window is indexed on the site open date for a given study.
Number of claims Number of unique claims (as defined by the diagnosis codes) at the site across the observation period. RWD
site level
Number of patients Number of unique patients with a relevant claim (as defined by the diagnosis codes) at the site across the observation period.
Number of treated patients Number of unique patients with a relevant claim (as defined by the diagnosis codes) that have received treatment (as defined by medication and procedure codes) at the site across the observation period.
Number of visits Number of unique visits with a relevant claim (as defined by the diagnosis codes) at the site across the observation period.
Visits per patient Number of unique visits per patients with a relevant claim (as defined by the diagnosis codes) at the site across the observation period.
Number of physicians Number of unique physicians associated with relevant claims (as defined by the diagnosis codes) at the site across the observation period.
Number of treating physicians Number of unique physicians associated with relevant claims (as defined by the diagnosis codes) that treated patients (as defined by medication and procedure codes) at the site across the observation period.
Number of specialists Number of unique specialists (as defined by a list of specialists of interest) at the site across the observation period.
Patient flow difference Difference between number of incoming and outgoing patient referrals (as defined by counting the number of outgoing and incoming patient referrals with a relevant claim as defined by the diagnosis codes) at the site across the observation period.
Site flow difference Difference between number of sites that refer patients to the site vs the number of sites that the site refers patients to.
Number of publications Max number of publications in the indication of interest by grouping across the physicians at the site. PubMed
site level
Number of ongoing trials Number of ongoing trials at the site. ClinicalTrials.gov
site level
Number of Janssen trials Number of Janssen trials at the site.
Trial experience A binary variable indicating whether a site had past trial experience

Although real-world data represents a broad set of patients that are potentially eligible for trial participation at any given time, its covariates are not aligned at the study-site level. While temporal alignment of RWD & recruitment data is possible based on the claim date and enrollment period for a site in each study, the real-world data is available only from 2016 onwards, while the benchmark studies start as early as 2006. As such, the cohort observation period is used instead to characterize the real-world clinical practice of a site. The variability in yearly calculations of the site level RWD covariates across the available data is sufficiently small, allowing them to be approximated as constant when averaged across the cohort observation period. Before 2016 it is not possible to validate this hypothesis which has the potential to introduce data bias in RWD covariates for studies conducted before 2016.

Proposed approach

To predict site performance based on enrollment and real-world data covariates, different machine learning models have been developed. The number of recruited patients at study-site level are discrete counts that follow a Poisson distribution. The machine learning problem is defined as a Poisson regression problem where the enrollment months represent the exposure period.

We use a random train (80%) and test (20%) data split at site level to avoid the potential of a data distribution bias and corresponding impact on model generalization capabilities. The use of study specific variables is limited to ensure generalizability across studies and limit data leakage. A similar approach is used to perform cross-validation, using 5-fold cross-validation groups. In line with the Poisson modeling objective, models are compared with different regression metrics; mean absolute error (MAE), root mean squared error (RMSE), Spearman correlation coefficient are evaluated on both train and test set. The coefficient of determination (R2) is provided as reference, as the models are optimized on their ability to rank sites using the Spearman correlation coefficient.

We also assess whether the models succeed in identifying the top & bottom 30% of research sites in terms of enrollment, to allow for comparison with the results provided in prior work [21].

The regression outputs are converted into a ranked list and sites are grouped into two classes, top 30% and bottom 70%; top 70% and bottom 30%, based on whether they are part of the target group, respectively top 30% and bottom 30%. This group assignment is done with the actual and predicted enrollment counts to create the actual and predicted labels. We use the area-under-the-curve (AUC) classification metric to compare the different models on this classification task.

As there are no guidelines for systematic evaluation of site selection methods, the performance of new methods is compared to the median historical enrollment as baseline method. With this baseline, referred to as the median baseline, the median of the enrollment in train set is used to predict the enrollment of sites in the test set.

To reflect a common industry practice of using historical performance, we add a baseline method based on site-level historical enrollment, referred to as the site level baseline. With this baseline, the median of the historical enrollment of a site is used, to predict the enrollment of the site in other studies. If no historical enrollment data is available for a site, we impute the historical enrollment with the median historical enrollment in the train set.

Covariate selection and model training

For each exercise, a linear Poisson regression model and two non-linear machine learning models, a RandomForest and an XGBoost model (v.1.7.2), are trained and compared with the median and site baseline [23]. We considered other non-linear models but didn’t observe a significant difference in performance. The open-source framework Tune (v.0.1.5) [24] is used to train and perform hyperparameter tuning on the non-linear models. The range of hyperparameters across which the models are optimized is shared in S4 File, as well as the optimal set of hyperparameters for each experiment. In the hyperparameter optimization framework, a new set of hyperparameters is randomly sampled in each experiment and evaluated using cross-validation. Across 128 experiments, the set of optimal hyperparameters is identified for a given dataset.

The Shapley Additive exPlanations (SHAP) [25] algorithm is used to estimate the importance of the covariates and to determine the partial dependency relationship between covariates and enrollment. Manual covariate selection is performed by assessing covariate importance with the model trained on all covariates using the training data. Covariates with a variable importance, as defined by the covariate mean SHAP value, that is below 0.005 are removed from the covariate set. For each model, the selected set of covariates is defined in S3 File, which is a subset of the full set of covariates described in Table 3. To assess whether accuracy differences across modeling approaches are statistically significant a dependent t-test for paired samples is conducted on the model absolute error.

Results

The model performance results across the different indications are shared in Table 4 for the test dataset. Train model performance results have been added to S2 Table for completeness. The different performance metrics are computed between the target and predicted enrollment. We compare a simple median baseline site selection strategy, with a more advanced ‘site level’ baseline. Finally, we compare the use of non-linear models with more simple linear modeling methods.

Table 4. Performance metrics are computed between the actual and predicted enrollment.

Indication Model Test R2 Test Spearman correlation coefficient Test RMSE Test MAE Test Top 30% AUC Test Bottom 30% AUC
IBD Median Baseline - - 2.54 1.76 0.50 0.50
Site Baseline - 0.13 3.16 2.14 0.51 0.55
Linear Model 0.14 0.44 2.33 1.70 0.77 0.77
Random Forest 0.19 0.46 2.26 1.64 0.78 0.81
XGBoost 0.22 0.46 2.22 1.60 0.78 0.84
MM Median Baseline - - 3.58 2.31 0.50 0.50
Site Baseline - 0.16 4.07 2.71 0.57 0.73
Linear Model 0.10 0.34 3.24 2.52 0.71 0.76
Random Forest 0.15 0.35 3.20 2.40 0.75 0.84
XGBoost 0.13 0.37 3.19 2.40 0.74 0.79

Significance testing has been applied (paired Student’s t-test) to assess the significance in performance, as measured by mean absolute error difference between the models. For each experiment, the non-linear XGBoost model had a mean absolute error that was significantly lower than the linear models (p-value < 0.001). As no significant performance difference was observed among the non-linear models, only the XGBoost models are studied further.

We use Shapley values [25] to estimate covariate importance in the model in Fig 4. We also assess the relationship the model has learned between study-site level enrollment and the covariates of interest. Partial dependence plots, computed from the Shapley values, based on the XGBoost models, are used to visualize the relationship of a model covariate with the target variable. We explore the relationships between all selected covariates and the model target in Figs 5 and 6.

Fig 4. Covariate importance of the selected covariates for the XGBoost models.

Fig 4

The mean SHAP value represents the average impact of a covariate on the model output magnitude.

Fig 5. Covariate dependence plots for the IBD XGBoost model.

Fig 5

The SHAP value represents the impact on the model output of a given covariate value.

Fig 6. Covariate dependence plots for the MM XGBoost model.

Fig 6

The SHAP value represents the impact on the model output of a given covariate value.

Discussion

The proposed modeling approach is versatile and applicable across indications when sufficient benchmark studies are available, and the study population can be defined in a RWD cohort. The non-linear model performance improves the ability to rank the sites by expected enrollment, visible both in the increase of Spearman correlation coefficient and AUC on the test set, compared to the linear model and baselines. The ability to generate an accurate site level ranking allows trial organizers to accurately identify and prioritize top performing sites.

Comparing our results to the results of earlier data-driven site selection methodologies [21] is not straightforward due to variability in terms of study context (single indication vs multiple indications), methodology (site vs investigator ranking) and the evaluation approach (random vs study split) that is used. Comparing the top and bottom 30% AUCs of our methods with the published results, we observe that the average AUC on the test set (0.79) is higher compared to prior results (0.75), while our approach maintains a high level of interpretability on the relationship of study recruitment with the different covariates. Although the R2 remains low-to-modest, models have been optimized in the ability to rank sites, as expressed through the Spearman correlation coefficient and AUC.

Across the two experiments, there are important differences in the key covariates, highlighting the fact that different factors play a role in recruitment, depending on the indication. For instance, in trials targeting newly diagnosed patients like in the case for IBD, the research site must wait for patients to become available. In such cases, the recruitment period is an important covariate. On the other hand, for indications where patients are already undergoing treatment, such as MM, covariates that characterize the research setting, including the number of specialists, publications and ongoing trials are key covariates.

Regardless of the indication, past research experience, past high research performance and a high number of patients are consistently strong positive indicators of recruitment potential, aligned with previous research findings [117]. The site open year covariate captures the recruitment trend in a disease area over time and provides insights into the level of trial recruitment activity. In the case of IBD, the competition in the trial landscape has increased greatly, as highlighted by the high covariate importance. While the insights gained from these machine learning models are specific to each indication, they can serve to inform future trial designs and recruitment strategies.

The proposed site selection methodology represents a notable advancement; however, challenges with respect to data availability remain. The utility of real-world data for site selection relies on its availability across large geographical areas. At present, this approach is only viable in the United States. Moreover, due to the availability of US claims data from 2016 onwards, the data cannot be aligned to the study period of interest. Furthermore, the absence of large-scale linkage between claims data and electronic healthcare records, lab, and genomic data, poses challenges in the replication of study cohorts.

While expected recruitment is an important consideration in site selection strategies, it should not be the sole determinant in trial planning. Other factors, such as the overall experience collaborating with a research site and their research capabilities must also be considered. Additionally, sites with a diverse patient population need to be considered to improve the representativeness of the study population of clinical trials, and consequently the validity and generalizability of clinical trials results. Nonetheless, within the United States, several barriers to diversity in clinical trial participation still exist [26]. Therefore, new, and diverse research sites, in addition to historically strong performing ones, need to be considered during site selection to ensure novel therapies are more broadly accessible geographically and across underrepresented populations.

Conclusion

This work demonstrated empirically the importance of real-world data in predicting the patient recruitment of research sites in clinical trials. To the best of our knowledge, this is the first study that leverages machine learning methods and indication-level real-world data for site level enrollment prediction. This study adds to an improved understanding and quantitative validation of the factors that are critical to predict site study recruitment and a data-driven decision support system to help select and assess research sites for a proposed trial.

Supporting information

S1 File. List of benchmark studies per indication.

(DOCX)

pone.0300109.s001.docx (22.1KB, docx)
S2 File. Real-world data cohort definition per indication.

(DOCX)

pone.0300109.s002.docx (39.8KB, docx)
S3 File. Selected set of covariates per indication.

(DOCX)

pone.0300109.s003.docx (21.5KB, docx)
S4 File. XGBoost hyperparameter grid and final model hyperparameters per indication.

(DOCX)

pone.0300109.s004.docx (23.2KB, docx)
S1 Table. Summary of the enrollment statistics across the two experiments.

(DOCX)

pone.0300109.s005.docx (21.5KB, docx)
S2 Table. Model performance on train set across the two experiments.

(DOCX)

pone.0300109.s006.docx (26KB, docx)

Data Availability

The data underlying this article were provided by third parties (Komodo Health & IQVIA) under license and cannot be shared publicly. The source data for this study were licensed by Johnson & Johnson from Komodo Health and IQVIA, and hence we are not allowed to share the licensed data publicly. However, the same data used in this study are available for purchase by contracting with the database owners, Komodo Health (contact at: https://www.komodohealth.com/) and IQVIA (contact at: https://dqs.drugdev.com/help/contactUs). The authors did not have any special access privileges that other parties who license the data and contract with Komodo Health and IQVIA would not have. Further, in our efforts to enable use and reproducibility of the prediction model, we have provided detailed supporting material on covariates, and model hyperparameter settings.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Desai M. Recruitment and retention of participants in clinical studies: Critical issues and challenges. Perspect Clin Res 2020;11:51–3. doi: 10.4103/picr.PICR_6_20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Demaerschalk BM, Brown RD, Roubin GS, Howard VJ, Cesko E, Barrett KM, et al. Factors Associated With Time to Site Activation, Randomization, and Enrollment Performance in a Stroke Prevention Trial. Stroke 2017;48:2511–8. doi: 10.1161/STROKEAHA.117.016976 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Fogel DB. Factors associated with clinical trials that fail and opportunities for improving the likelihood of success: A review. Contemp Clin Trials Commun 2018;11:156–64. doi: 10.1016/j.conctc.2018.08.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Hurtado-Chong A, Joeris A, Hess D, Blauth M. Improving site selection in clinical studies: a standardised, objective, multistep method and first experience results. BMJ Open 2017;7:e014796. doi: 10.1136/bmjopen-2016-014796 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Potter JS, Donovan D, Weiss RD, Gardin J, Lindblad B, Wakim P, et al. Site selection in community-based clinical trials for substance use disorders: Strategies for effective site selection. Am J Drug Alcohol Abuse 2011;37:400–7. doi: 10.3109/00952990.2011.596975 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Dombernowsky T, Haedersdal M, Lassen U, Thomsen SF. Criteria for site selection in industry-sponsored clinical trials: a survey among decision-makers in biopharmaceutical companies and clinical research organizations. Trials 2019;20:708. doi: 10.1186/s13063-019-3790-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.The Role of Retail Pharmacies in the Evolving Landscape of Clinical Research. Appl Clin Trials Online 2023. https://www.appliedclinicaltrialsonline.com/view/the-role-of-retail-pharmacies-in-the-evolving-landscape-of-clinical-research (accessed April 20, 2023). [Google Scholar]
  • 8.Laaksonen N, Bengtström M, Axelin A, Blomster J, Scheinin M, Huupponen R. Clinical trial site identification practices and the use of electronic health records in feasibility evaluations: An interview study in the Nordic countries. Clin Trials 2021;18:724–31. doi: 10.1177/17407745211038512 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Miseta E. Bring Down The Cost Of Clinical Trials With Improved Site Selection n.d. https://www.clinicalleader.com/doc/bring-down-the-cost-of-clinical-trials-with-improved-site-selection-0001 (accessed April 20, 2023). [Google Scholar]
  • 10.Luo J, Chen W, Wu M, Weng C. Systematic data ingratiation of clinical trial recruitment locations for geographic-based query and visualization. Int J Med Inf 2017;108:85–91. doi: 10.1016/j.ijmedinf.2017.10.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Barnard KD, Dent L, Cook A. A systematic review of models to predict recruitment to multicentre clinical trials. BMC Med Res Methodol 2010;10:63. doi: 10.1186/1471-2288-10-63 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Bakhshi A, Senn S, Phillips A. Some issues in predicting patient recruitment in multi-centre clinical trials. Stat Med 2013;32:5458–68. doi: 10.1002/sim.5979 [DOI] [PubMed] [Google Scholar]
  • 13.Getz. Predicting Successful Site Performance n.d. https://www.appliedclinicaltrialsonline.com/view/predicting-successful-site-performance (accessed April 20, 2023). [Google Scholar]
  • 14.Gheorghiade M, Vaduganathan M, Greene SJ, Mentz RJ, Adams KF, Anker SD, et al. Site selection in global clinical trials in patients hospitalized for heart failure: perceived problems and potential solutions. Heart Fail Rev 2014;19:135–52. doi: 10.1007/s10741-012-9361-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gehring M, Taylor R, Mellody M, Casteels B, Piazzi A, Gensini G, et al. Factors influencing clinical trial site selection in Europe: The Survey of Attitudes towards Trial sites in Europe (the SAT-EU Study). BMJ Open 2013;3:e002957. doi: 10.1136/bmjopen-2013-002957 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Huang GD, Bull J, McKee KJ, Mahon E, Harper B, Roberts JN. Clinical trials recruitment planning: A proposed framework from the Clinical Trials Transformation Initiative. Contemp Clin Trials 2018;66:74–9. doi: 10.1016/j.cct.2018.01.003 [DOI] [PubMed] [Google Scholar]
  • 17.van den Bor RM, Grobbee DE, Oosterman BJ, Vaessen PWJ, Roes KCB. Predicting enrollment performance of investigational centers in phase III multi-center clinical trials. Contemp Clin Trials Commun 2017;7:208–16. doi: 10.1016/j.conctc.2017.07.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Bieganek C, Aliferis C, Ma S. Prediction of clinical trial enrollment rates. PloS One 2022;17:e0263193. doi: 10.1371/journal.pone.0263193 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Zhang X, Long Q. Stochastic modeling and prediction for accrual in clinical trials. Stat Med 2010;29:649–58. doi: 10.1002/sim.3847 [DOI] [PubMed] [Google Scholar]
  • 20.Liu J, Allen PJ, Benz L, Blickstein D, Okidi E, Shi X. A Machine Learning Approach for Recruitment Prediction in Clinical Trial Design 2021. [Google Scholar]
  • 21.Gligorijevic J, Gligorijevic D, Pavlovski M, Milkovits E, Glass L, Grier K, et al. Optimizing clinical trials recruitment via deep learning. J Am Med Inform Assoc JAMIA 2019;26:1195–202. doi: 10.1093/jamia/ocz064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.NPPES n.d. https://nppes.cms.hhs.gov/#/ (accessed June 19, 2023).
  • 23.Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. 2016. [Google Scholar]
  • 24.Liaw R, Liang E, Nishihara R, Moritz P, Gonzalez JE, Stoica I. Tune: A Research Platform for Distributed Model Selection and Training 2018. 10.48550/arXiv.1807.05118. [DOI] [Google Scholar]
  • 25.Lundberg SM, Lee S-I. A Unified Approach to Interpreting Model Predictions. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, et al., editors. Adv. Neural Inf. Process. Syst. 30, Curran Associates, Inc.; 2017, p. 4765–74. [Google Scholar]
  • 26.Clark LT, Watkins L, Piña IL, Elmer M, Akinboboye O, Gorham M, et al. Increasing Diversity in Clinical Trials: Overcoming Critical Barriers. Curr Probl Cardiol 2019;44:148–72. doi: 10.1016/j.cpcardiol.2018.11.002 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Krit Pongpirul

18 Jan 2024

PONE-D-23-38747Enhancing site selection strategies in clinical trial recruitment using real-world data modelingPLOS ONE

Dear Dr. Hulstaert,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 03 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Krit Pongpirul, MD, MPH, PhD.

Academic Editor

PLOS ONE

Journal Requirements:

1. When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for stating the following in the Competing Interests/Financial Disclosure* (delete as necessary) section: 

All authors are employees of Janssen Research and Development, a unit of Johnson and Johnson family of companies. The work on this study was part of their employment. All authors hold pension rights from the company and own stock options. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

   

We note that one or more of the authors are employed by a commercial company: Janssen Research and Development, a unit of Johnson and Johnson family of companies.

a. Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form.

Please also include the following statement within your amended Funding Statement. 

“The funder provided support in the form of salaries for authors, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.”

If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement. 

b. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc.  

Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to  PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and  there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared.

Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf.

3. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments:

Please address the comments raised by both reviewers.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

********** 

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

********** 

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

********** 

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

********** 

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: -- Summary --

The authors present a ML-based model for ranking clinical trial sites by their expected recruitment potential for clinical trial site selection. They employ historical recruitment data combined with real-world claims data and additional public data sources for model training and evaluation. Models are trained, evaluated and modeling approaches compared for two diseases indications.

-- Comments --

- Definition of patient cohorts for benchmark studies: Suppl. file S2 indicates that multiple clinical studies were used as reference for the identified patient cohorts. Please comment on how the cohort inclusion criteria for the presented 'experiments' were derived from the inclusion criteria of these clinical studies. Presumably, inclusion criteria were not identical for all clinical studies, how was consensus reached for the experiments?

- Mismatch between RWD cohort observation period and recruitment data: "As such, the cohort observation period is used instead to characterize the real-world clinical practice of a site, and it is assumed to remain constant over time". Please discuss the potential biases that may be introduced by estimating site level covariates for the recruitment into studies performed much earlier (about 15-20 years, 1990 onwards) from rather recent RWD (2016 onwards).

- Manual covariate selection: What thresholds have been applied and which covariates were selected/removed? Does fig 4 show only the selected covariates?

- Authors state in the discussion that 'The non-linear model performance significantly improves the ability to rank the sites by expected enrollment ...', however, the methods section indicates that statistical significance testing has been performed on the MAE which is not a direct measure of ranking performance. Please explain.

- Fig 1-3: Please indicate the unit of 'frequency' in figs 1-3. 'Studies per year' in fig 1? But 'number of studies' in fig 2 and fig 3?

- Table 4: The reported results are labeled as 'Test'. Presumably this refers to the 20% of the 80/20 split mentioned in line 217. Please clarify the nomenclature. It would be interesting to see how robust these performance estimates are for different sample splits.

- Table S6 does not contain train model performances as mentioned in line 265.

Reviewer #2: This is a comprehensive machine learning study for site selection using trial data and real world data. A linear Poisson regression model and a non-linear XGBoost model were trained and compared with the baseline method. It clearly shows that XGBoost outperforms other methods. Shapley values were used to estimate covariate importance in the model. It helps interpret the relationship of study recruitment with the different covariates. Interestingly it highlights that different factors play a role in recruitment for different indications such as IBD and MM. Together these methods demonstrate the value of machine learning models in improving site selections.

Since there are many non-linear models besides XGBoost, it would be interesting to add a few more models to test how different algorithms perform in the two indications. Since the training/test data are ready, this seems straightforward. But if there are practical reasons why other algorithms were not selected, you may simply add the explanation.

********** 

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Xiong Liu

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Mar 11;19(3):e0300109. doi: 10.1371/journal.pone.0300109.r002

Author response to Decision Letter 0


29 Jan 2024

Addressed reviewer comments in 'Response to Reviewers' letter. Please let us know if additional changes or revisions are required.

Reviewer 1 Comments

Definition of patient cohorts for benchmark studies: Suppl. file S2 indicates that multiple clinical studies were used as reference for the identified patient cohorts. Please comment on how the cohort inclusion criteria for the presented 'experiments' were derived from the inclusion criteria of these clinical studies. Presumably, inclusion criteria were not identical for all clinical studies, how was consensus reached for the experiments?


Updated line 166:

Inclusion criteria of benchmark studies are used to define a superset of relevant diagnosis, drugs, and procedures codes. These codes define a patient cohort that represents the broad patient population that is eligible for the benchmark studies.

Mismatch between RWD cohort observation period and recruitment data: "As such, the cohort observation period is used instead to characterize the real-world clinical practice of a site, and it is assumed to remain constant over time". Please discuss the potential biases that may be introduced by estimating site level covariates for the recruitment into studies performed much earlier (about 15-20 years, 1990 onwards) from rather recent RWD (2016 onwards).

Updated line line 212:

The variability in yearly calculations of the site level RWD covariates across the available data is sufficiently small, allowing them to be approximated as constant when averaged across the cohort observation period. Before 2016 it is not possible to validate this hypothesis which has the potential to introduce data bias in RWD covariates for studies conducted before 2016.

Manual covariate selection: What thresholds have been applied and which covariates were selected/removed? Does fig 4 show only the selected covariates?

Updated line 269:

Covariates with a variable importance, as defined by the covariate mean SHAP value, that is below 0.005 are removed from the covariate set. For each model, the selected set of covariates is defined in S4 File, which is a subset of the full set of covariates described in Table 3.

Fig 4 description is adapted to “Covariate importance of the selected covariates for the XGBoost models”

Authors state in the discussion that 'The non-linear model performance significantly improves the ability to rank the sites by expected enrollment ...', however, the methods section indicates that statistical significance testing has been performed on the MAE which is not a direct measure of ranking performance. Please explain.

Updated line 306 and 320 to reflect the type of statistical testing that was applied.

Fig 1-3: Please indicate the unit of 'frequency' in figs 1-3. 'Studies per year' in fig 1? But 'number of studies' in fig 2 and fig 3?



Descriptions of Fig 1-3 are adapted (line 138 to 145):

Fig 1. Overview of number of benchmark studies across phase and indication. An overview of the number of benchmark studies across study phase and study indication for resp. IBD and MM.

Fig 2. Overview of number of patients enrolled and number of enrollment months across study-site combinations. An overview of the number of patients enrolled and number of enrollment months per site across the benchmark studies for resp. IBD and MM.

Fig 3. Overview of number of sites across US states and open year. An overview of number of sites across US states and site open year for resp. IBD and MM.

Table 4: The reported results are labeled as 'Test'. Presumably this refers to the 20% of the 80/20 split mentioned in line 217. Please clarify the nomenclature. It would be interesting to see how robust these performance estimates are for different sample splits.

Updated line 224:

We use a random train (80%) and test (20%) data split at site level to avoid the potential of a data distribution bias and corresponding impact on model generalization capabilities.

Table S6 does not contain train model performances as mentioned in line 265.

Updated Table S6 to contain the train model performance.

Reviewer 2 Comments

Since there are many non-linear models besides XGBoost, it would be interesting to add a few more models to test how different algorithms perform in the two indications. Since the training/test data are ready, this seems straightforward. But if there are practical reasons why other algorithms were not selected, you may simply add the explanation.

Updated Table 4 with results for RandomForest experiments.

Updated line 253:

For each exercise, a linear Poisson regression model and two non-linear machine learning models, a RandomForest and an XGBoost model (v.1.7.2), are trained and compared with the median and site baseline.

We considered other non-linear models but didn’t observe a significant difference in performance.

Attachment

Submitted filename: Response to Reviewers.docx

pone.0300109.s007.docx (56.2KB, docx)

Decision Letter 1

Krit Pongpirul

23 Feb 2024

Enhancing site selection strategies in clinical trial recruitment using real-world data modeling

PONE-D-23-38747R1

Dear Dr. Hulstaert,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Krit Pongpirul, MD, MPH, PhD.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Your responses to the comments from both reviewers are satisfactory. Please address the optional comments during the proof.

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for addressing the comments. Please consider the following questions and recommendations below optional:

- You mention in your answers (update line 212) that "The variability in yearly calculations of the site level RWD covariates across the available data is sufficiently small, allowing them to be approximated as constant when averaged across the cohort observation period. Before 2016 it is not possible to validate this hypothesis which has the potential to introduce data bias in RWD covariates for studies conducted before 2016."

-> Readers may be interested in how this hypothesis been verified for the period from 2016.

- Thank you for clarifying the meaning of 'frequency' in the captions of Figs 1-3.

-> Modifying the figures' y-axis accordingly would greatly facilitate the reading in my opinion.

Reviewer #2: The authors have adequately addressed my comments. The paper is now in a good shape for publication.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Xiong Liu

**********

Acceptance letter

Krit Pongpirul

29 Feb 2024

PONE-D-23-38747R1

PLOS ONE

Dear Dr. Hulstaert,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Assoc. Prof. Dr. Krit Pongpirul

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. List of benchmark studies per indication.

    (DOCX)

    pone.0300109.s001.docx (22.1KB, docx)
    S2 File. Real-world data cohort definition per indication.

    (DOCX)

    pone.0300109.s002.docx (39.8KB, docx)
    S3 File. Selected set of covariates per indication.

    (DOCX)

    pone.0300109.s003.docx (21.5KB, docx)
    S4 File. XGBoost hyperparameter grid and final model hyperparameters per indication.

    (DOCX)

    pone.0300109.s004.docx (23.2KB, docx)
    S1 Table. Summary of the enrollment statistics across the two experiments.

    (DOCX)

    pone.0300109.s005.docx (21.5KB, docx)
    S2 Table. Model performance on train set across the two experiments.

    (DOCX)

    pone.0300109.s006.docx (26KB, docx)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0300109.s007.docx (56.2KB, docx)

    Data Availability Statement

    The data underlying this article were provided by third parties (Komodo Health & IQVIA) under license and cannot be shared publicly. The source data for this study were licensed by Johnson & Johnson from Komodo Health and IQVIA, and hence we are not allowed to share the licensed data publicly. However, the same data used in this study are available for purchase by contracting with the database owners, Komodo Health (contact at: https://www.komodohealth.com/) and IQVIA (contact at: https://dqs.drugdev.com/help/contactUs). The authors did not have any special access privileges that other parties who license the data and contract with Komodo Health and IQVIA would not have. Further, in our efforts to enable use and reproducibility of the prediction model, we have provided detailed supporting material on covariates, and model hyperparameter settings.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES