Skip to main content
PLOS One logoLink to PLOS One
. 2025 Oct 21;20(10):e0334858. doi: 10.1371/journal.pone.0334858

Machine learning detects hidden treatment response patterns only in the presence of comprehensive clinical phenotyping

Stephen D Auger 1,2,3,*, Gregory Scott 1,2,3
Editor: Ziheng Wang4
PMCID: PMC12539744  PMID: 41118359

Abstract

Inferential statistics traditionally used in clinical trials can miss relationships between clinical phenotypes and treatment responses. We simulated a randomised clinical trial to explore how gradient boosting (XGBoost) machine learning compares with traditional analysis when ‘ground truth’ treatment responsiveness depends on the interaction of multiple phenotypic variables. As expected, traditional analysis detected a significant treatment benefit (outcome measure change from baseline = 4.23; 95% CI 3.64–4.82). However, recommending treatment based upon this evidence would lead to 56.3% of patients failing to respond. In contrast, machine learning correctly predicted treatment response in 97.8% (95% CI 96.6–99.1) of patients, with model interrogation showing the critical phenotypic variables and the values determining treatment response had been identified. Importantly, when a single variable was omitted, accuracy dropped to 69.4% (95% CI 65.3–73.4). This proof of principle underscores the significant potential of machine learning to maximise the insights derived from clinical research studies. However, the effectiveness of machine learning in this context is highly dependent on the comprehensive capture of phenotypic data.

Introduction

The clinical phenotype refers to the observable characteristics of a disease, and is a crucial indicator of both its presence and how it manifests in an individual. A clinical phenotype includes numerous variables, such as a patient’s age and other demographic factors, co-morbidities, symptoms, and examination findings. Clinical phenotypes can be expanded by measuring and understanding more variables, including genetic, physical, environmental, radiological, electrophysiological, biochemical, and molecular characteristics, i.e., so-called ‘deep phenotyping’. An important motivation for clinical phenotyping is the idea that, somewhere within the phenotype, are variables that inform how a disease can be optimally treated within an individual.

For example, the management of headache depends upon numerous, interacting, variables within the clinical phenotype. These variables include whether the headache is migrainous, the presence and type of aura, and the presence of co-morbidities, like gastritis, asthma, or anxiety [1]. Within broad diagnostic categories, there are often distinct clinical phenotypes, which can be driven by different underlying disease processes that require tailored management approaches. Consequently, the optimal treatment strategy for a young female in full-time work with menstrual migraine and aura differs significantly from that of a retired older male with migrainous cervicogenic headache. Even single variables within an individual’s phenotype can change an otherwise effective treatment into something likely to cause harm, e.g., carbamazepine as a treatment of epilepsy in people of Han Chinese ethnicity [2].

For new treatments, particularly of heterogenous diseases, it can take decades to determine whether patients with a given clinical phenotype may benefit, or, in fact, not at all [3]. Without this understanding, groups of patients may receive treatments that provide no benefit and/or may cause harm. For example, aspirin was used for decades to prevent cardiovascular disease before evidence showed its benefit in primary prevention was minimal and that additional clinical variables needed to be considered to balance the benefits for high-risk patients against the associated bleeding risks [4,5]. Many similar examples exist, where targeting treatments based on specific clinical phenotypes has become necessary, such as beta-blockers in heart failure [6,7], calcium channel blockers in hypertension [8,9], and antidepressants in major depressive disorder [10]. In each case, certain patient groups have historically unknowingly received ineffective treatments for many years. Understanding and managing the heterogeneity of clinical phenotypes is therefore critical for personalised and effective care. However, building the evidence base to support this level of precision medicine requires time and, crucially, appropriate methodologies that can deconstruct this phenotypic heterogeneity.

Randomised controlled trials (RCTs) are among the highest levels of evidence in clinical research. Designing, funding, conducting, and publishing RCTs takes many years or even decades [11]. Given this investment, it is crucial to maximise the information gained from RCTs, to realise the greatest benefits for patients. Advances in machine learning (ML) and other multivariate analysis techniques offer opportunities to identify complex relationships between clinical phenotype and treatment response, which the inferential statistical methods traditionally used for RCT analysis may miss [12,13]. Indeed, a growing amount of clinical research includes the use of ML techniques, and considers their current limitations when applied in clinical settings [14,15]. The effectiveness of ML models, as for all methodologies, critically depends on the quality and quantity of the data available to them [16].

In this study, we investigated how both the analysis approach and characterisation of clinical phenotype within an RCT influences our ability to infer underlying ‘ground truth’. Using simulated data representative of typical RCTs, we conducted a novel, formal comparison contrasting the effectiveness of traditional statistical methods versus ML approaches at identifying the key factors and interactions driving treatment response. We simulated a clinical cohort undergoing a RCT, providing ground truth information about how variables within the clinical phenotype determine responses to the treatment. We compared this ground truth information with the conclusions that would be drawn by investigators when analysing the RCT data using traditional inferential statistics. We then examined the capability of ML to uncover additional insights from the same data. We used XGBoost (XGB), a form of gradient boosting ML which has demonstrated state-of-the-art performance on a range of problems involving complex, non-linear interactions between variables [17,18], and has been applied in a variety of clinical settings [1922]. We evaluated the additional benefits of XGB analysis, including the ability to reveal the phenotypic variables and values which critically determine treatment response. Finally, we examined how the comprehensiveness of clinical phenotyping might impact conclusions when using XGB analysis, considering the effects both of data deficiency and excess.

Results

Creation of simulated clinical cohort data and determinants of ‘ground truth’ treatment response

We created simulated clinical cohort data, modelling a group of 1000 patients with a disease for which a new treatment improves outcomes in those with certain clinical phenotypes, but not others. These patients are subsequently enrolled in a parallel group randomised placebo-controlled trial (1:1 randomisation). (See Methods for full details of the cohort’s characteristics and modelling of treatment response).

For each patient, the clinical phenotype consists of multiple arbitrary clinical variables, including illustrative variables for patient age, sex, and several additional clinical variables. Both binary and continuous variables are simulated; binary variables being analogous to, e.g., sex or presence/absence of a symptom, and continuous variables analogous to, e.g., age or a numerical symptom severity score.

There are three such clinical variables that critically determine which patients are responsive to the treatment: ‘X’, ‘Y’ and ‘Z’. X and Y are continuous; Z is binary. This scenario is analogous to, e.g., real-world clinical variables such as (respectively) bone mineral density, age, and sex, which jointly help determine which bone protection interventions are suitable for a given clinical phenotype [23]. Importantly, the investigators in our scenario have no knowledge of this ground truth information about which variables determine treatment response.

A simulated non-linear relationship between these three variables (X, Y, Z) determines which patients will be responsive to the treatment, illustrated in Fig 1 (orange indicates treatment responsive conditions, blue indicates not responsive), and described as follows: Any patient with a value of X above 95 is responsive to the treatment, no matter whether Z is present (left plot) or absent (right plot). For lower values of X (X<=95), if Z is present (left plot) and Y is between 50 and 90 then the patient is treatment responsive. If Z is absent (right plot) and X is between 90 and 95, then a patient is treatment responsive if Y is between 50 and 90.

Fig 1. Investigation outline.

Fig 1

Clinical data were generated for a simulated cohort of 1000 patients. The top panel shows the seven clinical phenotype variables. The range of values for each variable is shown in brackets. The plots indicate the combination of values of the critical variables determining treatment responsiveness: orange areas indicate values associated with patients being responsive and blue areas indicate non-response to treatment. The left and right plots indicate X and Y values when Z is present or absent respectively. By this arrangement 43.7% are responsive and 56.3% not responsive to treatment. Patients were randomly assigned to a treatment or placebo group (1:1 randomisation), with trial outcomes based on their true responsiveness and assigned group. The numbers in the ‘Treatment’ and ‘Placebo’ boxes represent the change in outcome measures (mean + /- standard deviations) for each group, according to their true responsiveness. We performed traditional inferential statistical analysis on the trial outcome data, to obtain estimates of effect size (mean change) and precision (95% confidence intervals). Then, machine learning (ML) analysis with XGBoost was conducted to predict individual patient treatment responses and to identify which clinical phenotype variables influenced these predictions. Finally, we assessed how data deficiencies and excesses impact ML analysis. CI = confidence interval.

With these conditions, 43.7% of patients are potentially responsive to treatment; 56.3% are not. Additional information is captured in further variables, but importantly, these have no bearing on treatment response (Fig 1). To aid interpretation, we denote these uninformative variables with numbers (i.e., ‘V1’ and ‘V2’) rather than the letters which denote the treatment response-determining variables above.

Treatment response, or non-response, is reflected in a distinct, continuous, outcome measure with arbitrary units. Patients responsive to the treatment will, on average, have a positive change in the outcome measure after receiving treatment; patients not responsive to treatment will, on average, have no change in the outcome measure with treatment. Non-responsive patients include those receiving placebo and those who receive treatment yet who are not responsive to it. To reflect the clinical reality of trial outcome measures having a degree of variability, the change in the trial outcome measure is modelled as being drawn from one of two normal distributions, of mean +10 or mean 0, for treatment responsive and non-responsive patients, respectively. Both distributions have standard deviation 3.

We simulated an RCT in which the 1000 patients are randomised to receive either the treatment or a placebo in a 1:1 ratio. The value for each patient’s change in the trial outcome measure is determined by their ground truth treatment responsivity (above) and their treatment/placebo group allocation. Patients of all clinical phenotypes were eligible for inclusion in the study. The RCT was rigorously conducted and well-powered, without loss to follow-up. The investigators collected and have available to them the complete information about all the variables described above (Fig 1), but not the ground truth treatment responsiveness – this information is what the trail aims to infer.

Analysis using traditional inferential statistics

The RCT was designed to investigate whether a new treatment improves a clinical outcome measure versus placebo. There is clinical cohort data for 1000 patients with the phenotypes described above. Following randomisation, 509 patients were allocated to receive treatment, 491 to receive placebo. Using traditional statistical methods, absolute changes in the outcome measure for the treatment group are compared with the placebo group. These results are illustrated with Forest plots, including subgroup analyses, in Fig 2. This traditional style of reporting of the outcome for each intervention group, including the estimated effect sizes and associated precision, aligns with consensus recommendations for presenting primary RCT outcomes, as outlined in the CONSORT statement [24].

Fig 2. Forest plots illustrating mean and 95% confidence interval change in the outcome measure in patients receiving treatment compared with placebo.

Fig 2

The bottom plot shows results across the entire cohort, with different subgroups plotted above that. CI = confidence interval.

There is a significant improvement in the outcome measure in patients who received the treatment compared with placebo (mean change 4.23, 95% CI 3.64 to 4.82). All subgroups, defined by the seven phenotypic variables, showed improved outcomes with the treatment compared with placebo. The size of treatment-related benefit varied according to the values of the critical response-determining variables (X, Y and Z), but not others (age, sex, V1 and V2) (Fig 2).

If we assume that changes in the outcome measure of 5 or above are clinically meaningful, the RCT evidence estimates that the number needed to treat (NNT) to achieve this clinically meaningful benefit is 2.62. This favourable NNT suggests that the new treatment is an effective one, although it would be important to weigh its benefits against any potential adverse effects, not modelled here. Importantly, based upon the results of this RCT, the new treatment appears to be an efficacious option for all patient groups.

Analysis of the same data using machine learning

Despite the apparently positive RCT results, the ground truth is that 563 of the 1000 patients are not responsive to treatment (Fig 1). Clinical application of the RCT-based evidence from the previous section (in addition to further RCTs conducted) could result in 56.3% of patients being prescribed a treatment from which they receive no benefit, while being exposed to any side effects and risks associated with the treatment. This 56.3% of patients are unknowingly disadvantaged by the RCT analysis based on traditional statistics. These disadvantaged patients disproportionately include those with lower X values (non-responsive patient mean 70.3, SD 12.1 versus responsive patient mean 79.8, SD 15.2, p for difference <0.0001), higher Y values (non-responsive patient mean 71.1, SD 20.1 versus responsive patient mean 68.6, SD 12.5, p = 0.02) and patients without Z (proportion of non-responsive patients with Z 0.255, SD 0.43 versus responsive patients 0.817, SD 0.39, p < 0.0001). Therefore, even with a perfect collection of all the phenotypic variables necessary to determine which patients are treatment responsive, most patients nonetheless receive ineffective treatment, with systematic disadvantage to patients with low X values, high Y values or those for whom Z is not present.

ML analysis could help reveal insights which traditional analysis methods do not. Using the exact same simulated data as in the traditional RCT analysis described above, we next examined the capabilities of XGB ML analysis.

We first used XGB to identify whether it is possible to predict which patients benefit from treatment based upon their clinical phenotype, and who should avoid ineffective treatment. (See Methods for full details including five-fold cross-validation and XGB model parameters.) This analysis included only patients allocated to the treatment arm. We defined treatment responsiveness as an outcome measure of 5 or above, and non-response as values below 5. Using simulated data allowed us to compare the XGB predictions with the ground truth treatment response, in addition to the trial’s outcome measure, which may not always reflect true responsiveness. This comparison of XGB predictions with ground truth helps assess the generalisability of predictions beyond this individual trial’s outcome data. Importantly, model training relied solely on data available to investigators, with the ground truth used exclusively for evaluation of prediction accuracy, not training.

Fig 3 shows the classification performance metrics for the XGB analysis compared with trial outcomes and ground truth responsivity. The XGB analysis predictions were 92.5% (95% CI 90.3–94.8) accurate according to the trial outcome data (Fig 3A) and 97.8% (95% CI 96.6–99.1) accurate according to ground truth (Fig 3B). The fact that XGB demonstrated higher accuracy compared to ground truth labels than the trial’s outcome measure suggests that the model did not overfit to the noisy surrogate outcome data. Rather, XGB identified patterns in the underlying clinical phenotype data which offer generalisable insights that more accurately reflect the true treatment response. A power analysis revealed that the method was highly robust and a success criterion of >90% accuracy against ground truth was met in 100% of 500 simulation runs, resulting in a statistical power of 100% to detect the simulated effect under these conditions.

Fig 3. Confusion matrices with classification metrics for predictions of treatment response using XGB.

Fig 3

Predictions of treatment response using XGB analysis are compared with the treatment response apparent in the trial outcome measure (A) and the ground truth (B). Orange shading denotes treatment responsive (suggested in the trial outcome or from ground truth) and blue shading denotes non-treatment responsive cells; bold orange/blue denote correct treatment allocation according to the outcome; light orange/blue denote inappropriate treatment allocation. PPV: positive predictive value; NPV: negative predictive value.

We also considered another commonly used form of ML analysis, logistic regression (LR). Using LR, predictions were 78.4% (95% CI 74.8–82.0) accurate according to the trial outcome data and 82.1% (78.8–85.5) accurate according to ground truth. The lower accuracy of LR compared to XGB is commonly found for non-linear multivariate interactions, such as in this analysis [17].

Even with perfect measurement of key variables, RCT analyses can miss crucial patterns. Conclusions drawn from traditional RCT analysis in the previous section could lead to 43.7% of patients benefitting from treatment, with the remaining majority of patients receiving an ineffective treatment. In contrast, XGB analysis of data from only half as many patients (specifically, those in the treatment group rather than the placebo group, which provides no insight into treatment responses) accurately predicted patients being treatment responsive or non-responsive in 97.8% of cases.

Additional benefits of a machine learning analysis

The XGB approach enables interrogation of the model using SHAP (SHapley Additive exPlanations) values, to identify phenotypic variables which are influential in determining the model’s predictions about treatment response [25]. Here, each SHAP value indicates how much an input variable contributed to the model’s prediction of each patient’s treatment response, shown in Fig 4. A large positive SHAP value indicates an influence for predictions of treatment responsive, whereas a large negative value indicates an influence for predictions of non-responsive. The colour of the plots indicates the value of the associated variable: dark red represents higher values of a continuous variable or the presence of a binary variable, while white indicates lower values of the continuous variable or the absence of the binary variable. All seven phenotypic variables are ranked from highest to lowest influence on model predictions, based on the total magnitude of SHAP values across all predictions. This ranking reflects the relative importance of each variable in influencing the model’s decisions. We see that the three true response-determining variables – X, Y and Z – are identified as being the ones which most strongly influence the model outputs (Fig 4).

Fig 4. SHAP (SHapley Additive exPlanations) values for each feature in the ML model.

Fig 4

Each point on the plot represents a variable’s SHAP value for each of the 509 patients receiving treatment for whom predictions of treatment response were made. Colours represent the value of the variable from low (white) to high (dark red). Variables are ordered by their impact on model outputs, from highest (top) to lowest (bottom), based on the sum of SHAP value magnitudes across all predictions.

From Fig 4, we also observe that certain values of the X variable have a substantial impact on the model’s predictions of treatment responsiveness, as evidenced by large positive SHAP values. The dark red shading of these points indicates that high values of X have large impact on the model predicting a patient as being treatment responsive. To explore this relationship further, in Fig 5, we plotted X values (x-axis) for each patient against their corresponding SHAP values (y-axis), coloured by the XGB prediction (orange for treatment-responsive, blue for non-responsive). Overlaid histograms show a summary of the proportion of responsive (orange) and non-responsive (red) predictions for the corresponding X values. Fig 5A shows that X values greater than 90 are associated with very high SHAP values, indicating a strong influence on the model’s output being that the patient is treatment responsive. Specifically, X values above 95 almost always predict treatment responsiveness, while those between 90 and 95 still frequently result in a prediction of treatment responsiveness. This demonstrates that X values above 90 are highly influential in predicting treatment responsiveness. Conversely, when X values are below 90, their impact on the model’s predictions is diminished.

Fig 5. Scatter plots of variable values versus SHAP values to interpret how specific variable ranges influence treatment predictions. A: Scatter plot of X values on the x-axis versus SHAP values (left y-axis).

Fig 5

SHAP values reflect the importance of each X value in predicting treatment response, with higher positive values indicating greater importance for predicting treatment responsiveness. The colour of plots indicates what the prediction was, with orange indicating a prediction of treatment responsive, and blue non-responsive. Overlayed histograms show the proportion of predictions (right y-axis) for different X values: orange bars denote treatment-responsive predictions, and blue bars denote non-treatment-responsive predictions. B: Scatter plot of Y values on the x-axis versus SHAP values, filtered to include only instances where X is below 90. This plot demonstrates the importance of Y values in the model’s predictions, where lower negative SHAP values suggest higher importance for predicting non-responsive to treatment. Histograms overlayed on the scatter plot represent the proportion of predictions for different Y values, with orange bars for treatment-responsive and blue bars for non-treatment-responsive predictions. C: Scatter plot of Z values on the x-axis versus SHAP values, with data filtered to include only Y values between 50 and 90, as well as X values below 90. SHAP values indicate the importance of Z values in the model’s predictions, considering the constraints on X and Y. The histograms show the proportion of predictions for Z values, with blue bars representing treatment-responsive predictions and blue bars representing non-treatment-responsive predictions.

Fig 4 also reveals that certain Y values significantly affect predictions of non-treatment responsiveness, as indicated by large negative SHAP values. Fig 5B examines the role of Y values when X is not influential (i.e., X is below 90). It shows that Y values below 50 or above 90 are linked to large negative SHAP values, suggesting these Y values are strong predictors of non-responsiveness in cases where X is below 90. Consequently, most predictions in these scenarios are for non-responsiveness. Additionally, Fig 4 indicates that Z generally has a significant influence on model predictions. Fig 5C focuses on cases where X is below 90 and Y is between 50 and 90. It shows that the presence of Z is associated with high positive SHAP values, correlating with a high likelihood of treatment responsiveness. Conversely, the absence of Z is associated with low negative SHAP values, indicating non-responsiveness. Thus when X is below 90 and Y is between 50 and 90, Z’s presence predicts treatment responsiveness, while its absence suggests non-responsiveness.

In summary:

  • X values above 90, especially those above 95, are strong predictors of treatment responsiveness.

  • When X is below 90, Y values below 50 or above 90 suggest non-responsiveness.

  • For X below 90 and Y between 50 and 90, Z’s presence indicates treatment responsiveness, whereas its absence indicates non-responsiveness.

This analysis clarifies how X, Y, and Z values contribute to the model’s predictions of treatment response. All these values align closely with the ground truth (see Fig 1).

With XGB, it is straightforward to identify key variables contributing to the model’s accurate predictions of treatment response, as well as important values of these variables.

Impact of data deficiency upon the machine learning analysis

We next considered an alternative scenario, which is identical to the original RCT, in the same patients, except that one response-determining clinical variable, Z, was not collected by the investigators. With this single data deficiency, the accuracies of the XGB model’s predictions were 68.8% (95% CI 64.7–72.8) and 69.4% (95% CI 65.3–73.4) relative to trial outcomes and ground truth, respectively (S1 Fig).

In this scenario, the XGB predictions still outperform treatment recommendations drawn from traditional RCT-based analysis (i.e., where only 43.7% of patients receive appropriate treatment). However, a single piece of missing clinical information results in far less accurate predictions by the XGB model compared with analysis with complete information (Fig 3). This finding highlights the important of comprehensive clinical data collection to realise the potential of ML analysis techniques. If the binary variable Z was not accounted for and characterised during data collection, investigators will inevitably be blind to its importance, regardless of the sophistication of the analytical methods used.

Impact of data excess upon the machine learning analysis

To address the problem of potential data deficiency, collection of additional variables is necessary. However, this solution must be balanced against the potential drawbacks of data excess. To evaluate the potential disadvantages of gathering more clinical data, we finally considered scenarios in which many more variables are collected and incorporated into the analysis.

In our previous RCT scenarios, only three of the seven clinical variables (X, Y, Z) were meaningful for understanding treatment response, while the other four (Age, Sex, V1, V2) were ‘noise’, that XGB models had to ‘filter’ from the true ‘signal’. We conducted similar XGB analyses, but in addition to these seven clinical variables, we progressively introduced further noisy variables, to evaluate the ability of XGB analysis to accurately detect true signals amidst increasing noise. These noisy variables were generated as a mix of binary and continuous types, with some covarying with X, Y or Z, while others remained completely independent of the existing variables. (For full details, see Methods). While all other aspects of the analysis approach and XGB model parameters remained unchanged, the clinical cohort was resampled to include these new noisy variables.

Fig 6 shows the accuracy of XGB treatment response predictions (based upon trial outcomes, in orange, and compared to ground truth, in green) as the number of additional noisy variables increases up to 10,000. Despite the increasing noise, XGB analysis consistently maintained high classification accuracy, with no observable decline in accuracy up to 10,000 variables.

Fig 6. Scatter plot depicting the accuracy of XGB treatment response predictions in relation to the number of noisy variables added to the original seven clinical variables.

Fig 6

Orange plots indicate accuracy compared with trial outcomes, while green plots indicate accuracy based upon ground truth.

The XGB analysis predictions again more closely mirrored ground truth treatment response, indicating they identified generalisable patterns rather than overfitting to a larger amount of noisy uninformative information.

With appropriate use of an XGB classifier, even the addition of up to 10,000 noisy clinical variables did not reduce classification accuracy, create spurious associations, or obscure true underlying effects.

Discussion

Our analyses highlight two crucial points for clinical research: the potential benefits of ML analysis over traditional analysis approaches, and the necessity for comprehensive clinical phenotyping, to fully realise these benefits.

Crucial insights can be missed on account of how data are analysed. XGB detected critical information about the relationship between clinical phenotype and treatment response, information that traditional inferential methods had overlooked. Conclusions drawn from standard analysis of our simulated RCT could lead to just 43.7% of patients receiving appropriate treatment, and incorrect treatment would systematically disadvantage specific patient groups (those with low X, high Y, or absent Z). Subgroup analyses can identify broad differences in the primary outcome effect sizes based on individual clinical variables. However, as shown in Fig 2, these analyses may still indicate significant improvements compared to baseline when there is heterogeneity within the subgroups or multivariate relationships. Just because we cannot detect discrimination, it does not mean it is not there; the problem of “unknown unknowns” persists. In contrast, XGB analysis of the same RCT data correctly predicted treatment response in 97.8% of patients, removing the discrimination of these patient groups.

Not all forms of ML may fully capture complex relationships. Here, we observed that LR had lower accuracy (82.1%) than XGB. LR may fail to detect nonlinearities, which likely accounts for its reduced accuracy in predicting treatment responses [17], highlighting that it is not simply a matter of “ML is good, non-ML is bad”. Indeed, inferential statistics offers several advantages over ML, such as providing clearer insights into causality and being more robust to overfitting [17]. Our recommendation is to integrate traditional inferential statistics from RCTs with ML analysis tailored to the data and specific research question(s). In many circumstances, ML approaches may not outperform traditional methods, especially if treatment response is related to simple, univariate and/or linear trends; if datasets are very small where complex models risk overfitting; or when strong theoretical knowledge already provides a robust explanatory framework. However, while RCT statistics provide valuable insights in many circumstances, ML can uncover more complex, multivariate, and non-linear patterns within the data that standard methods may overlook, thereby maximising the research value derived from clinical datasets.

To fully realise the benefits of ML analysis for patients, clinical phenotypes need to be characterised comprehensively. Here, the exclusion or inclusion of a single phenotypic variable (Z) determined whether the accuracy of XGB treatment predictions was 69.4% or 97.8%, respectively. This 28.4% difference is larger than the 19.4% difference between random chance predictions (50%) and using ML for analysis of incompletely characterised clinical phenotypes (69.4%). This underscores the need for “deep phenotyping”, capturing detailed and diverse clinical data [2628].

Clinical phenotypes are the most direct indicator of how disease is manifesting in patients. Many conditions sharing a diagnostic label are formed of a highly heterogeneous set of underlying pathologies. Treatment options for these conditions can be similarly diverse. For example, a range of autoimmune conditions are treated with various combinations of steroids, intravenous immunoglobulins, plasma exchange, monoclonal antibodies, methotrexate, azathioprine and mycophenolate mofetil. Similar diversity in disease processes and available treatment options is present for conditions like epilepsy and headache. In clinical practice, we collect detailed information to understand how, when, where, and why different symptom combinations manifest, which ultimately guides patient care. However, this level of detail is often lacking in research settings, where there is instead a tendency to simplify and isolate a few factors, often with strict exclusion criteria and randomisation, to minimise the influence of comorbidities and other variables. The precise phenotypic variables which might be critical will vary dependent upon the specific circumstances of an investigation, but for one specific example, highly impactful and rigorous trials in stroke medicine often reduce a complex clinical syndrome to a quantified NIH Stroke Scale (NIHSS) score, time since symptom onset, and a limited number of comorbidities such as diabetes, atrial fibrillation, and hypertension [2931]. However, critical nuances in the clinical history, such as previous recurrent episodes which might indicate cerebral amyloid angiopathy, which can increase the risk of haemorrhagic complications from thrombolysis, may be overlooked in initial trials and can take several years to become apparent [32].

Collecting and analysing detailed patient data has traditionally presented challenges for clinical research. Using univariate statistical analysis, multiple comparisons across subgroups increases the risk of identifying spurious associations, and necessitates adjustments to statistical significance thresholds such as Bonferroni correction or false discovery rater control, which can obscure genuine trends [33]. Consequently, interventional studies using traditional inferential methods often limit data collection and/or analysis to a small subset of variables to verify the similarity between treatment and control groups at baseline. While this approach helps control confounding factors, it can produce findings that are less representative of the broader clinical population, and resulting evidence on the most effective treatments for specific patient subgroups is, therefore, limited.

ML analysis can instead allow variance across diverse clinical features to be embraced, studied and understood. The XGB analysis effectively identified and filtered relevant variables, providing insights which more accurately reflected ground truth than the imperfect surrogate trial outcome measure. Even in the presence of 10,000 additional pieces of noisy, covarying clinical information, the XGB analysis remained sensitive to true trends, without making spurious associations.

Collecting more detailed phenotype data from existing patients can also be less resource-intensive than recruiting additional patients. Recruiting a single new patient involves multiple time-consuming steps, including consent, eligibility screening, assessments, treatment, monitoring, and follow-up. In contrast, gathering additional phenotypic variables for each patient potentially takes seconds, or minutes, even across an entire cohort. If these variables are extracted automatically from an electronic health record (EHR), it may require no additional researcher time at all.

Advances in natural language processing techniques enable large-scale extraction of clinical information from EHRs [3437], and there is a growing use of information captured during routine clinical care and stored within EHRs for research [3841]. Utilising data from EHRs and routine clinical practice could allow researchers to compare different management strategies with outcomes in patient populations that better reflect real-world settings, providing larger-scale insights at lower cost. The most accurate reflection of a broad clinical population is the clinical population itself. However, the feasibility and insights possible from this form of research will entirely depend upon how comprehensively phenotypes and clinical outcomes are documented in routine care.

ML analysis can also help inform whether a more detailed clinical phenotype is required. In our analysis, the much lower accuracy of XGB predictions when a single, key, variable was missing, indicates that treatment effects are not fully understood, and gathering more phenotypic information could be beneficial. Conversely, very high prediction accuracy indicates that treatment responses are well understood, and that further data collection might yield diminishing returns, suggesting that research efforts could be more effectively focused elsewhere. With traditional inferential analysis, the same missing information does not directly affect the primary outcome measure; the analysis is “blind” to the existence and impact of such “data gaps”.

A common concern with ML models is their lack of interpretability, which can impede clinicians’ understanding and trust in their outputs, limiting adoption into clinical practice. However, we have demonstrated that thorough interrogation of a trained XGB ML model can reveal which variables, and their specific values, determine clinical treatment response. This information could aid in the development of treatment algorithms similar to those used for mitigating cardiovascular disease risk, which take into account specific ranges of values for blood pressure, cholesterol levels, weight, height, and various other variables [42]. This interrogative approach to ML can highlight critical aspects of a disease that warrant further attention, in clinical practice, research, and model development.

We advocate that a post-hoc data-driven ML analysis, such as those described here, should become a routine part of any clinical research study with a suitably detailed and large dataset. While XGB can function effectively with smaller datasets of a few hundred patients, as is illustrated here, its predictive power is enhanced with larger and more diverse datasets. As the number of clinical phenotype variables increases, the ‘curse of dimensionality’ demands more patients to ensure reliable results, as the data required for accurate generalisation grows exponentially [43,44]. XGB typically performs best with thousands to hundreds of thousands of samples, enabling it to capture complex relationships and nuances within the data [18,45]. This approach could help reveal novel hypothesis generating information and lead to clinically meaningful insights which might otherwise have been missed and taken decades to eventually become apparent.

ML analysis need not be complex or resource intensive. All the analyses presented here were run within a minute on a single desktop computer. Our analyses serve as a clear example of how a straightforward gradient boosting ML classifier (XGB) can be effectively used in clinical research to achieve high predictive performance, versatility, scalability, and interpretability, to produce generalisable results. While this specific example used a binary classification predicting treatment response, the XGB algorithm can be easily adapted to perform regression, ranking, or other forms of prediction [18]. Applying these techniques effectively in practice requires systematic hyperparameter tuning (e.g., grid search, random search, Bayesian optimisation, LR elastic models) and sensitivity analysis to refine model performance. Although a detailed exploration is outside this work’s scope, practical guidance on these techniques is readily available in other resources [46,47].

Quality evaluations of clinical trials should arguably place greater emphasis on both how data are analysed and the extent of clinical phenotyping, the two topics we examined here. The purpose of evaluating trial quality is to establish their ability to reliably identify key information. Our findings clearly indicate that both the analysis methods used and the extent of clinical phenotyping directly influences a study’s ability to identify such key information. Various frameworks exist to evaluate clinical trial quality [24,4851]. Assessments currently centre around details such as randomisation, blinding, follow-up, analysis, and reporting of a pre-registered primary outcome. There is emphasis on ensuring patient sample sizes are sufficiently large to produce accurate estimates of population averages. As the number of patients in a study increases, the ability to detect smaller effect sizes improves, allowing identification of subtle associations between variables. However, once an appropriately powered sample size is reached, further increases yield diminishing returns. In our simulations, expanding the sample size to millions of patients would not have significantly changed the conclusions, but a single missing piece of clinical phenotype information greatly limited the insights gleaned. Rather than considering whether this type of insight is being missed, current evaluation frameworks generally focus on clinical data collection only as a means to confirm treatment arm similarity at baseline [50], for reporting of baseline characteristics, or ancillary analyses including subgroup evaluations [24]. Our traditional RCT analysis was conducted in a manner which could score “full marks” on relevant current evaluation metrics. However, without using XGB analysis and capturing Z, the treatment recommendations from the RCT are a poor representation of the ground truth, causing the majority of patients to receive inappropriate treatment. Current trial evaluation quality criteria credit a study’s ability to detect very small effect sizes more than the ability to detect much larger, impactful, effects demonstrated here.

Another key aspect of evaluating trial quality is identifying and minimising potential sources of bias. We have already discussed how, in our analysis, traditional inferential methods resulted in systematic discrimination against certain patient groups. Limiting the breadth of data collection can introduce bias by systematically disadvantaging certain patient groups, and the omission of important information may lead to trial outcomes that misrepresent these patients in direct proportion to how much they deviate from the mean of the uncollected data [52].

In clinical situations that appear to have a simple binary endpoint, such as survival or complete resolution of symptoms from an acute infection, a ML model’s utility can come from considering more granular, secondary outcomes that capture the patient’s recovery trajectory. For example, even when a treatment ensures patient survival, there is often significant heterogeneity in the time to symptomatic resolution, the incidence of treatment-related adverse events, or the duration of hospitalisation. By defining the ‘treatment effect’ in terms of these more nuanced metrics, ML analysis can still identify patient phenotypes associated with more or less favourable responses. The approach’s core strength lies in potentially helping explain any form of clinically relevant, treatment-related variance.

This work should be viewed as an important proof of principle and baseline upon which greater complexity can be built. We used a single, simplified scenario as a starting point, but it has limitations. We modelled data as complete, though missing data is common in real-world scenarios. However, XGB models can handle missing data effectively. ML models, particularly with high-dimensional data, risk overfitting, which can limit generalisability [16]. To address this, we tuned XGBoost parameters to reduce overfitting while preserving accuracy, ensuring robust, generalisable trends. Using a tree depth of three prioritises simplicity and interpretability, while avoiding overfitting and ensuring meaningful, low-order interactions can be captured. This shallow structure establishes a baseline against which future iterations can be compared, where deeper trees or more complex models could explore higher-dimensional, non-linear relationships if needed. Smaller cohorts and effect sizes could make it challenging for ML to reliably detect trends, so ML analysis is not suitable for all clinical research studies. This work demonstrates what can be achieved with a few hundred well-characterised cases, rather than the tens of thousands of data points required for some more complex, computationally intensive ML analyses. While this study focused on a single primary endpoint at one time point, ML has the potential to incorporate more nuanced analyses, such as time-course data.

ML techniques can deliver highly accurate, explainable, and generalisable predictions by analysing intricate interactions among multiple clinical variables. When appropriate analysis techniques are used, every piece of new clinical information collected has the potential to unlock new understanding of a disease which could benefit patients. There should be a greater drive to more comprehensively capture how diseases manifest with better clinical data, to enable patients to benefit from the potential insights which ML makes possible.

Methods

Creation of simulated clinical cohort data and determinants of ‘ground truth’ treatment response in a randomised control trial

For each member of the simulated clinical cohort of 1000 patients, we generated a set of arbitrary clinical phenotype variables. To minimise assumptions about the data’s underlying distributions, each variable was modelled independently and drawn from uniform distributions:

  • Age – an integer uniformly distributed between 18 and 100.

  • Sex – a binary variable with equal probability for the outcomes 0 and 1.

  • V1 – a binary variable with equal probability for the outcomes 0 and 1.

  • V2 – an integer uniformly distributed between 0 and 200.

  • X – an integer uniformly distributed between 50 and 100.

  • Y – an integer uniformly distributed between 40 and 100.

  • Z – a binary variable with equal probability for the outcomes 0 and 1.

These variables represented both binary and continuous data to reflect heterogeneous types of data used in clinical research. The lower and upper bounds for each variable were selected arbitrarily.

Whether or not a patient would be responsive to the treatment was determined by a simulated non-linear multivariable relationship between X, Y and Z illustrated in Fig 1 and explained in the associated text of the results section. In summary, patients were responsive to the treatment if X was 95 or above; if Z was present and Y was between 50 and 90; if Y was between 50 and 90 and X was between 90 and 95.

For the simulated RCT, all members of the cohort were randomly assigned in a 1:1 ratio to receive either the treatment or placebo. Trial outcome data were then drawn from either of two distributions:

  • A ‘responsive’ distribution – for patients receiving treatment who are treatment responsive, where the outcome measure change was drawn from a distribution with mean 10 and standard deviation 3.

  • A ‘non-responsive’ distribution – for patients receiving placebo or treatment which they are not responsive to, where the trial outcome measure change was drawn from a distribution with mean 0 and standard deviation 3.

For the analysis considering data deficiency, the Z variable was omitted from the data used for model training in the XGB analysis. All other values and variables were kept unchanged.

For the analysis considering data excess, additional ‘noisy’ variables were generated, starting with 5 noisy variables and increasing by 100 at each step, up to a total of 10,000 variables. One-sixth of the noisy variables were continuous and covaried with X. This covariance was established by taking the patient’s X value and adding a noise signal randomly drawn from a normal distribution with a mean of 0 and a standard deviation of 10. Another sixth of the noisy variables were continuous and covaried with Y, following the same method. Additionally, another sixth of the noisy variables were binary and covaried with Z, where the covariance was defined as a random selection of two-thirds of the values matching the patient’s Z value. The remaining half of the noisy variables were independent of all other variables, comprising either binary variables (randomly drawn from a probability distribution where the variable was present in 10% of patients and absent in 90%) or continuous variables (random integers between 18 and 100). The clinical cohort was resampled for this analysis, using the exact same distributions and definitions as above.

Analysis of the randomised control trial data using traditional inferential statistics

All statistical analyses illustrated in Fig 2 were performed using the NumPy toolbox [53] in Python version 3.11.5. The sample size of 1000 patients provides 80% power to detect an effect size of 1.064. All outcome end points are reported in terms of the mean absolute change in the outcome variable and 95% confidence intervals for that change in the treatment group versus placebo. The subgroup confidence intervals have not been adjusted for multiple comparisons which limits the ability to infer definitive treatment effects based upon them in isolation. Forest plots were produced using the Python forestplot toolbox [54]. All mean and 95% confidence interval values are reported to 2 decimal places.

Machine learning analyses

We conducted XGB ML analyses to examine the relationship between the simulated clinical phenotype variables and treatment responsiveness. The XGB models were trained to process all the variables known about each patient to make predictions about whether a patient is treatment responsive. In real life, clinical investigators would not be aware of ground truth treatment response and would have to rely upon trial surrogate trial outcome data. For all model training, we therefore used only the trial outcome as a surrogate of treatment response and did not include ground truth information. The XGB models were trained to predict whether patients had a change in the outcome variable or 5 or more based upon information about all the clinical phenotype variables. Ground truth information was only used later, in evaluation of predictions made using the trial outcome data.

Data preprocessing for all the XGB analyses used the same steps each time. Non-binary data were scaled by removing the mean and scaling to unit variance, to ensure that each variable contributed equally to the model’s training.

An XGBoost binary classifier was initialised with a predefined set of hyperparameters [18]. The same model parameters were used in every analysis. The maximum depth of each tree was set to 3 to help prevent overfitting by limiting the complexity of the model. Learning rate was set to 0.1, this determines the step size at each iteration and the value was chosen to ensure a balance between model accuracy and computational efficiency. The number of the number of trees (or rounds) in the model was 100. The subsample parameter was set to 0.8, meaning that 80% of the training data was randomly sampled to grow each tree. Similarly, 80% of the features were randomly sampled when creating each tree, thus introducing randomness and reducing overfitting. These pre-defined model hyperparameters were chosen to ensure performance and reliability of the models in each of the XGB analyses and minimise risk of overfitting.

To ensure robust model evaluation, a five-fold cross-validation approach was implemented using the KFold method in all the XGB analyses. Out-of-fold predictions for each sample were obtained and these predictions on unseen data were used as the model prediction data. Predictions on the unseen data were compared with true treatment response (according to either trial outcome data values above/below five or ground truth) and sorted into confusion matrices (Fig 3 and S1 Fig) indicating the number of true positive (top left), false positive (top right), false negative (bottom left) and true negative (bottom right). Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for each classification. The overall accuracy of predictions is indicated in the bottom right corner of each confusion matrix. 95% confidence intervals for the overall accuracy estimates were calculated using the normal approximation of the binomial distribution.

To quantify the statistical power and robustness of our analytical approach, the entire data generation and XGB analysis pipeline was repeated for 500 independent runs. For each run, a “successful detection” was defined as the model achieving an accuracy greater than 90% when classifying against the known ground truth.

To interpret XGB model predictions, SHAP (SHapley Additive exPlanations) values were computed for each of these XGB model predictions to determine how much each variable had impacted each prediction [25]. The SHAP values were then used to generate a summary plot, visualizing feature importance with a dot plot (Fig 4).

Fig 4 revealed that variables X, Y, and Z were the most influential in determining model predictions. To understand which specific values of these key variables contributed to treatment response, we first examined X, which, in some instances, had a large impact on predictions, indicated by high positive SHAP values above 2 in Fig 4. We generated a scatter plot of X values on the x-axis and their corresponding SHAP values on the y-axis (Fig 5A). This plot showed that X had a less pronounced effect when its value was below 90. Next, we identified that certain Y values were associated with negative predictions, as indicated by SHAP values below −2 in Fig 4. To investigate further, we plotted Y values against SHAP values for all patients when X was not influential, i.e., X values below 90 (Fig 5B). This plot revealed that Y had a reduced impact when its values ranged between 50 and 90. Finally, for patients with X values below 90 and Y values between 50 and 90, we plotted Z values on the x-axis against SHAP values on the y-axis to explore Z’s effect on predictions (Fig 5C).

For the LR analysis, default model parameters from the scikit-learn toolbox [47] were used. Data pre-processing followed the same steps as described above for the XGB analyses, with the exception that interaction terms between all combinations of variables were included for the LR model. This adjustment was made to account for potential interactions between the variables, ensuring the LR model’s ability to accurately detect multivariate effects was not excessively compromised. L2 regularisation was applied to minimise overfitting. The optimization problem was solved using the ‘lbfgs’ algorithm. Tolerance for stopping criteria was set to 0.0001 and maximum number of iterations to 100.

Supporting information

S1 Fig. Confusion matrices with classification metrics for predictions of treatment response using XGB, when variable Z is removed from consideration.

Predictions of treatment response using XGB analysis are compared with the treatment response apparent in the trial outcome measure (A) and the ground truth (B). Orange shading denotes treatment responsive (suggested in the trial outcome or from ground truth) and blue shading denotes non-treatment responsive cells; bold orange/blue denote correct treatment allocation according to the outcome; light orange/blue denote inappropriate treatment allocation. PPV: positive predictive value; NPV: negative predictive value.

(DOCX)

pone.0334858.s001.docx (523KB, docx)

Data Availability

All the datasets generated and analysed during the current study are available in the GitHub repository, https://github.com/stepdaug/ML-and-clinical-phenotyping.

Funding Statement

SDA is funded by a UK National Institute for Health and Care Research (NIHR) Clinical Lectureship and acknowledges infrastructure support for this research from the NIHR Imperial Biomedical Research Centre (BRC). GS is funded by the National Institute for Health Research [Advanced Fellowship]. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.

References

  • 1.Ferrari MD, Goadsby PJ, Burstein R, Kurth T, Ayata C, Charles A, et al. Migraine. Nat Rev Dis Primers. 2022;8(1):2. doi: 10.1038/s41572-021-00328-4 [DOI] [PubMed] [Google Scholar]
  • 2.Ferrell PB Jr, McLeod HL. Carbamazepine, HLA-B*1502 and risk of Stevens-Johnson syndrome and toxic epidermal necrolysis: US FDA recommendations. Pharmacogenomics. 2008;9(10):1543–6. doi: 10.2217/14622416.9.10.1543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hanney SR, Castle-Clarke S, Grant J, Guthrie S, Henshall C, Mestre-Ferrandiz J, et al. How long does biomedical research take? Studying the time taken between biomedical and health research and its translation into products, policy, and practice. Health Res Policy Syst. 2015;13:1. doi: 10.1186/1478-4505-13-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Berger JS. Aspirin for Primary Prevention-Time to Rethink Our Approach. JAMA Netw Open. 2022;5(4):e2210144. doi: 10.1001/jamanetworkopen.2022.10144 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Kim JH, Shim MJ, Lee S-Y, Oh J, Kim SH. Aspirin for Primary Prevention of Cardiovascular Disease. J Lipid Atheroscler. 2019;8(2):162–72. doi: 10.12997/jla.2019.8.2.162 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Writing Committee Members, Yancy CW, Jessup M, Bozkurt B, Butler J, Casey DE Jr, et al. 2013 ACCF/AHA guideline for the management of heart failure: a report of the American College of Cardiology Foundation/American Heart Association Task Force on practice guidelines. Circulation. 2013;128(16):e240-327. doi: 10.1161/CIR.0b013e31829e8776 [DOI] [PubMed] [Google Scholar]
  • 7.McMurray JJV, Adamopoulos S, Anker SD, Auricchio A, Böhm M, Dickstein K, et al. ESC Guidelines for the diagnosis and treatment of acute and chronic heart failure 2012: The Task Force for the Diagnosis and Treatment of Acute and Chronic Heart Failure 2012 of the European Society of Cardiology. Developed in collaboration with the Heart Failure Association (HFA) of the ESC. Eur Heart J. 2012;33(14):1787–847. doi: 10.1093/eurheartj/ehs104 [DOI] [PubMed] [Google Scholar]
  • 8.Gu A, Yue Y, Desai RP, Argulian E. Racial and ethnic differences in antihypertensive medication use and blood pressure control among US adults with hypertension: The national health and nutrition examination survey, 2003 to 2012. Circ Cardiovasc Qual Outcomes. 2017;10. doi: 10.1161/CIRCOUTCOMES.116.003166 [DOI] [PubMed] [Google Scholar]
  • 9.Wright JT Jr, Dunn JK, Cutler JA, Davis BR, Cushman WC, Ford CE, et al. Outcomes in hypertensive black and nonblack patients treated with chlorthalidone, amlodipine, and lisinopril. JAMA. 2005;293(13):1595–608. doi: 10.1001/jama.293.13.1595 [DOI] [PubMed] [Google Scholar]
  • 10.Thase ME, Denko T. Pharmacotherapy of mood disorders. Annu Rev Clin Psychol. 2008;4:53–91. doi: 10.1146/annurev.clinpsy.2.022305.095301 [DOI] [PubMed] [Google Scholar]
  • 11.Ross JS, Tse T, Zarin DA, Xu H, Zhou L, Krumholz HM. Publication of NIH funded trials registered in ClinicalTrials.gov: cross sectional analysis. BMJ. 2012;344:d7292. doi: 10.1136/bmj.d7292 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Read KL, Kendall PC, Carper MM, Rausch JR. Statistical Methods for Use in the Analysis of Randomized Clinical Trials Utilizing a Pretreatment, Posttreatment, Follow-up (PPF) Paradigm. 2013. [cited 11 Sep 2024]. doi: 10.1093/OXFORDHB/9780199793549.013.0014 [DOI] [Google Scholar]
  • 13.Mulder R, Singh AB, Hamilton A, Das P, Outhred T, Morris G, et al. The limitations of using randomised controlled trials as a basis for developing treatment guidelines. Evid Based Ment Health. 2018;21(1):4–6. doi: 10.1136/eb-2017-102701 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Weissler EH, Naumann T, Andersson T, Ranganath R, Elemento O, Luo Y, et al. The role of machine learning in clinical research: transforming the future of evidence generation. Trials. 2021;22(1):537. doi: 10.1186/s13063-021-05489-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Jt Summits Transl Sci Proc. 2020;2020. [PMC free article] [PubMed] [Google Scholar]
  • 16.Auger SD, Jacobs BM, Dobson R, Marshall CR, Noyce AJ. Big data, machine learning and artificial intelligence: a neurologist’s guide. Pract Neurol. 2020. doi: 10.1136/PRACTNEUROL-2020-002688 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. 2009. [cited 23 Sep 2024]. doi: 10.1007/978-0-387-84858-7 [DOI] [Google Scholar]
  • 18.Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016;13-17-August. 2016. p. 785–94. doi: 10.1145/2939672.2939785 [DOI] [Google Scholar]
  • 19.Zheng J, Li J, Zhang Z, Yu Y, Tan J, Liu Y, et al. Clinical Data based XGBoost Algorithm for infection risk prediction of patients with decompensated cirrhosis: a 10-year (2012-2021) Multicenter Retrospective Case-control study. BMC Gastroenterol. 2023;23(1):310. doi: 10.1186/s12876-023-02949-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Montomoli J, Romeo L, Moccia S, Bernardini M, Migliorelli L, Berardini D, et al. Machine learning using the extreme gradient boosting (XGBoost) algorithm predicts 5-day delta of SOFA score at ICU admission in COVID-19 patients. J Intensive Med. 2021;1(2):110–6. doi: 10.1016/j.jointm.2021.09.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Moore A, Bell M. XGBoost, A Novel Explainable AI Technique, in the Prediction of Myocardial Infarction: A UK Biobank Cohort Study. Clin Med Insights Cardiol. 2022;16:11795468221133611. doi: 10.1177/11795468221133611 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Hou N, Li M, He L, Xie B, Wang L, Zhang R, et al. Predicting 30-days mortality for MIMIC-III patients with sepsis-3: a machine learning approach using XGboost. J Transl Med. 2020;18(1):462. doi: 10.1186/s12967-020-02620-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Unnanuntana A, Gladnick BP, Donnelly E, Lane JM. The assessment of fracture risk. J Bone Joint Surg Am. 2010;92(3):743–53. doi: 10.2106/JBJS.I.00919 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Schulz KF, Altman DG, Moher D, CONSORT Group. CONSORT 2010 statement: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c332. doi: 10.1136/bmj.c332 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Lundberg SM, Allen PG, Lee SI. A Unified Approach to Interpreting Model Predictions. 10.5555/3295222.3295230 [DOI]
  • 26.Robinson PN. Deep phenotyping for precision medicine. Hum Mutat. 2012;33(5):777–80. doi: 10.1002/humu.22080 [DOI] [PubMed] [Google Scholar]
  • 27.Weng C, Shah NH, Hripcsak G. Deep phenotyping: Embracing complexity and temporality-Towards scalability, portability, and interoperability. J Biomed Inform. 2020;105:103433. doi: 10.1016/j.jbi.2020.103433 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Schork NJ. Personalized medicine: Time for one-person trials. Nature. 2015;520(7549):609–11. doi: 10.1038/520609a [DOI] [PubMed] [Google Scholar]
  • 29.Xiong Y, Campbell BCV, Schwamm LH, Meng X, Jin A, Parsons MW, et al. Tenecteplase for Ischemic Stroke at 4.5 to 24 Hours without Thrombectomy. New Engl J Med. 2024;391:203–12. doi: 10.1056/NEJMOA2402980 [DOI] [PubMed] [Google Scholar]
  • 30.Hacke W, Kaste M, Bluhmki E, Brozman M, Dávalos A, Guidetti D, et al. Thrombolysis with alteplase 3 to 4.5 hours after acute ischemic stroke. N Engl J Med. 2008;359(13):1317–29. doi: 10.1056/NEJMoa0804656 [DOI] [PubMed] [Google Scholar]
  • 31.National Institute of Neurological Disorders and Stroke rt-PA Stroke Study Group. Tissue plasminogen activator for acute ischemic stroke. N Engl J Med. 1995;333(24):1581–7. doi: 10.1056/NEJM199512143332401 [DOI] [PubMed] [Google Scholar]
  • 32.McCarron MO, Nicoll JAR. Cerebral amyloid angiopathy and thrombolysis-related intracerebral haemorrhage. Lancet Neurol. 2004;3(8):484–92. doi: 10.1016/S1474-4422(04)00825-7 [DOI] [PubMed] [Google Scholar]
  • 33.Vickerstaff V, Omar RZ, Ambler G. Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes. BMC Med Res Methodol. 2019;19(1):129. doi: 10.1186/s12874-019-0754-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kraljevic Z, Bean D, Shek A, Bendayan R, Hemingway H, Yeung JA, et al. Foresight-a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digit Health. 2024;6(4):e281–90. doi: 10.1016/S2589-7500(24)00025-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194. doi: 10.1038/s41746-022-00742-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Hossain E, Rana R, Higgins N, Soar J, Barua PD, Pisani AR, et al. Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review. Comput Biol Med. 2023;155:106649. doi: 10.1016/j.compbiomed.2023.106649 [DOI] [PubMed] [Google Scholar]
  • 37.Lee RY, Kross EK, Torrence J, Li KS, Sibley J, Cohen T, et al. Assessment of Natural Language Processing of Electronic Health Records to Measure Goals-of-Care Discussions as a Clinical Trial Outcome. JAMA Netw Open. 2023;6(3):e231204. doi: 10.1001/jamanetworkopen.2023.1204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Sauer CM, Chen L-C, Hyland SL, Girbes A, Elbers P, Celi LA. Leveraging electronic health records for data science: common pitfalls and how to avoid them. Lancet Digit Health. 2022;4(12):e893–8. doi: 10.1016/S2589-7500(22)00154-6 [DOI] [PubMed] [Google Scholar]
  • 39.Moynihan D, Monaco S, Ting TW, Narasimhalu K, Hsieh J, Kam S, et al. Cluster analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases. Sci Rep. 2024;14(1):5056. doi: 10.1038/s41598-024-55424-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med. 2018;24(11):1716–20. doi: 10.1038/s41591-018-0213-5 [DOI] [PubMed] [Google Scholar]
  • 41.Jacoba CMP, Celi LA, Silva PS. Biomarkers for Progression in Diabetic Retinopathy: Expanding Personalized Medicine through Integration of AI with Electronic Health Records. Semin Ophthalmol. 2021;36(4):250–7. doi: 10.1080/08820538.2021.1893351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Hippisley-Cox J, Coupland C, Brindle P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study. BMJ. 2017;357:j2099. doi: 10.1136/bmj.j2099 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Trunk GV. A problem of dimensionality: a simple example. IEEE Trans Pattern Anal Mach Intell. 1979;1(3):306–7. doi: 10.1109/tpami.1979.4766926 [DOI] [PubMed] [Google Scholar]
  • 44.Berisha V, Krantsevich C, Hahn PR, Hahn S, Dasarathy G, Turaga P, et al. Digital medicine and the curse of dimensionality. NPJ Digit Med. 2021;4(1):153. doi: 10.1038/s41746-021-00521-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient boosting algorithms. Artif Intell Rev. 2021;54:1937–67. doi: 10.1007/S10462-020-09896-5 [DOI] [Google Scholar]
  • 46.Probst P, Wright MN, Boulesteix AL. Hyperparameters and tuning strategies for random forest. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2019. doi: 10.1002/widm.1301 [DOI] [Google Scholar]
  • 47.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12. [Google Scholar]
  • 48.Higgins JPT, Altman DG, Gøtzsche PC, Jüni P, Moher D, Oxman AD, et al. The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials. BMJ. 2011;343:d5928. doi: 10.1136/bmj.d5928 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Jadad AR, Moore RA, Carroll D, Jenkinson C, Reynolds DJ, Gavaghan DJ, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary?. Control Clin Trials. 1996;17(1):1–12. doi: 10.1016/0197-2456(95)00134-4 [DOI] [PubMed] [Google Scholar]
  • 50.Nasa P, Jain R, Juneja D. Delphi methodology in healthcare research: How to decide its appropriateness. World J Methodol. 2021;11(4):116–29. doi: 10.5662/wjm.v11.i4.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Berger VW, Alperson SY. A general framework for the evaluation of clinical trial quality. Rev Recent Clin Trials. 2009;4(2):79–88. doi: 10.2174/157488709788186021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Nelson A, Nachev P. Machine Learning in Practice—Clinical Decision Support, Risk Prediction, Diagnosis. In: Clinical Applications of Artificial Intelligence in Real-World Data. Springer International Publishing. 2023. p. 231–45. doi: 10.1007/978-3-031-36678-9_15 [DOI] [Google Scholar]
  • 53.Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585(7825):357–62. doi: 10.1038/s41586-020-2649-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.S. LSY, Aguilera JQ, Shapiro A. LSYS/forestplot: Mplot. Zenodo. 2024. doi: 10.5281/zenodo.10544056 [DOI] [Google Scholar]

Decision Letter 0

Matthew Chin Heng Chua

24 Apr 2025

Dear Dr. Auger,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Jun 08 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Matthew Chin Heng Chua

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“SDA is funded by a UK National Institute for Health and Care Research (NIHR) Clinical Lectureship. GS is funded by the National Institute for Health Research [Advanced Fellowship].

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.”

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“The author(s) received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. Please remove your figures from within your manuscript file, leaving only the individual TIFF/EPS image files, uploaded separately. These will be automatically included in the reviewers’ PDF.

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

**********

Reviewer #1: In this work, a randomised clinical trial has been stimulated to compare gradient boosting (XG boost) machine learning with traditional analysis when “ground truth” treatment responsiveness depends on the interaction of multiple phenotypic variables. The detected results from traditional analysis indicated that outcome measure change from baseline was 4.23 (95% CI 3.64–4.82) of patients. In contrast, the treatment response of patients reached at 97.8% (95% CI 96.6–99.1) in the context of machine learning. Notably, analysis results drop to 69.4% (95% CI 65.3–73.4) because of an omitted single variable. These indicated that machine learning could be maximized insights derived from clinical research studies. Overall, it is meaningful work, but several points should be clarified.

1. The introduction did not sufficiently highlight the innovation and superiority of this work. The authors are encouraged to clarify the innovation and superiority of their work.

2. Clarify methodological generalizability and limitations of ml approaches. While the discussion provides a strong case for the superiority of XGB in this particular context, the manuscript would benefit from a more balanced and critical evaluation of the generalizability of these findings. For example, the authors could elaborate on the types of datasets or clinical conditions where ML approaches like XGB may not outperform traditional methods, as well as practical constraints such as data availability, model interpretability, or computational requirements. This would strengthen the manuscript by providing a more nuanced view of ML’s utility across diverse clinical settings.

3. Elaborate on the nature and impact of clinical phenotyping. The manuscript emphasizes the importance of “comprehensive clinical phenotyping,” but does not provide sufficient detail about what this entails in practice. The authors are encouraged to elaborate on the specific phenotypic variables critical to the model’s success and discuss how such detailed phenotyping can be realistically obtained in real-world clinical trials. Providing concrete examples or frameworks would enhance the translational relevance and practical guidance for future research.

4. The simulation assumes linear effects for non-critical variables (V1/V2), which may oversimplify real-world clinical complexity. For instance, age or biomarkers rarely exhibit uniform noise distributions. Introducing realistic correlations or nonlinear interactions among variables would enhance ecological validity. Additionally, XGBoost hyperparameters (e.g., max depth=3) are justified but lack sensitivity analysis. Reporting performance variation with deeper trees or alternative tuning strategies (e.g., grid search) would strengthen confidence in model robustness. Subgroup analyses in Figure 2 lack multiplicity adjustments, inflating Type I error risks. Acknowledging this limitation and proposing Bonferroni corrections or false discovery rate control would improve rigor.

5. The term "hidden treatment patterns" risks overstating novelty, as interactions are often hypothesized in clinical research. Clarify whether the "ground truth" represents simulated relationships or reflects established biological mechanisms (e.g., pharmacogenomics). Confusion matrices (Figure 3) use a colorblind-unfriendly red/green palette, risking misinterpretation. Replacing these with patterned fills or dual-labeling (e.g., text annotations) would improve accessibility. Reporting additional metrics like Cohen’s kappa or area under the receiver operating characteristic curve (AUROC) would contextualize accuracy improvements over chance expectations.

6. The study fails to connect findings to real-world clinical scenarios where phenotypic complexity limits trial generalizability (e.g., oncology, neurology). For example, how might missing variables like comorbidities or genomic markers affect ML predictions in diverse populations? Overstating ML’s ability to "maximize insights" without discussing computational costs, reproducibility challenges, or ethical implications (e.g., algorithmic bias in treatment allocation) undermines clinical utility. Underexplored potential biases (e.g., over-reliance on correlated variables like V1/V2) and the absence of temporal dynamics (longitudinal outcomes) limit translational relevance.

7. Claiming "data excess has no penalty" risks overgeneralization, as sparser signals or smaller datasets may suffer from noise-induced overfitting. Temper conclusions with caveats about scalability. The manuscript lacks engagement with existing literature on ML in clinical trials (e.g., SHAP/XGBoost applications in healthcare). A brief review contextualizing contributions (e.g., comparing to rule-based decision trees) would strengthen its novelty.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: Yes:  Xiaohai Zheng

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Oct 21;20(10):e0334858. doi: 10.1371/journal.pone.0334858.r002

Author response to Decision Letter 1


2 May 2025

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Formatting has been amended to meet all the formatting requirements described within the links provided.

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, we expect all author-generated code to be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

We have reviewed the policy and confirm all code will be shared in accordance with the policies.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript:

“SDA is funded by a UK National Institute for Health and Care Research (NIHR) Clinical Lectureship. GS is funded by the National Institute for Health Research [Advanced Fellowship].

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.”

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form.

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

“The author(s) received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

As advised, we have provided the amended Funding Statement in the Resubmission cover letter and removed the statement from Acknowledgements.

4. Please remove your figures from within your manuscript file, leaving only the individual TIFF/EPS image files, uploaded separately. These will be automatically included in the reviewers’ PDF.

We have removed all figures from the manuscript file.

5. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

We have added the Supporting Figure caption to the end of the Manuscript file as advised.

We thank the reviewer for their helpful comments and appreciate the chance to refine our manuscript. We have carefully considered each suggestion and below are our point-by-point responses, detailing the revisions incorporated into the manuscript based on the valuable feedback.

Reviewer #1: In this work, a randomised clinical trial has been stimulated to compare gradient boosting (XG boost) machine learning with traditional analysis when “ground truth” treatment responsiveness depends on the interaction of multiple phenotypic variables. The detected results from traditional analysis indicated that outcome measure change from baseline was 4.23 (95% CI 3.64–4.82) of patients. In contrast, the treatment response of patients reached at 97.8% (95% CI 96.6–99.1) in the context of machine learning. Notably, analysis results drop to 69.4% (95% CI 65.3–73.4) because of an omitted single variable. These indicated that machine learning could be maximized insights derived from clinical research studies. Overall, it is meaningful work, but several points should be clarified.

1. The introduction did not sufficiently highlight the innovation and superiority of this work. The authors are encouraged to clarify the innovation and superiority of their work.

We have now revised the Introduction (final paragraph) to explicitly highlight the core, novel aspect of the current work, as follows:

“In this study, we investigated how both the analysis approach and characterisation of clinical phenotype within an RCT influences our ability to infer underlying 'ground truth'. Using simulated data representative of typical RCTs, we conducted a novel, formal comparison contrasting the effectiveness of traditional statistical methods versus ML approaches at identifying the key factors and interactions driving treatment response.”

When coupled with the added context regarding existing clinical ML applications (addressing comment 7, below), we believe the manuscript now articulates its specific place and contribution within the field more clearly.

2. Clarify methodological generalizability and limitations of ml approaches. While the discussion provides a strong case for the superiority of XGB in this particular context, the manuscript would benefit from a more balanced and critical evaluation of the generalizability of these findings. For example, the authors could elaborate on the types of datasets or clinical conditions where ML approaches like XGB may not outperform traditional methods, as well as practical constraints such as data availability, model interpretability, or computational requirements. This would strengthen the manuscript by providing a more nuanced view of ML’s utility across diverse clinical settings.

We agree with the need for balanced discussion regarding the generalisability and limitations of ML approaches beyond these specific findings with XGB. In addition to the existing acknowledgement of the distinct strengths of traditional inferential statistics (paragraph 3, sentence 4), we have now explicitly incorporated the recommendation to elaborate on the types of datasets of clinical conditions where ML approaches may not outperform traditional methods (same laster in the same third paragraph):

“In many circumstances, ML approaches may not outperform traditional methods, especially if treatment response is related to simple, univariate and/or linear trends; if datasets are very small where complex models risk overfitting; or when strong theoretical knowledge already provides a robust explanatory framework.”

Regarding the importance of considering practical constraints, the Discussion addresses multiple key components of these practical considerations: challenges related to data availability and collection (Discussion paragraph 8), considerations surrounding model interpretability (Results section 4 and Discussion paragraph 11), and computational requirements (Discussion paragraph 13).

Together, these sections provide a balanced and comprehensive summary of some key considerations and limitations with ML relative to traditional inferential approaches.

3. Elaborate on the nature and impact of clinical phenotyping. The manuscript emphasizes the importance of “comprehensive clinical phenotyping,” but does not provide sufficient detail about what this entails in practice. The authors are encouraged to elaborate on the specific phenotypic variables critical to the model’s success and discuss how such detailed phenotyping can be realistically obtained in real-world clinical trials. Providing concrete examples or frameworks would enhance the translational relevance and practical guidance for future research.

We agree with the reviewer on the importance of clearly outlining what is meant by 'comprehensive clinical phenotyping’ in practice. To ensure this concept is clear from the outset, the second and third sentences of the Introduction orient the reader with an explanation of our use of the term 'clinical phenotype' and how these might be practically expanded to be more comprehensive .

Regarding the specific phenotypic variables critical to the model’s success in this study, there are two aspects to this: the predefined 'ground truth' variables and those identified empirically by the analysis. We hope the following locations clearly provide this detail: The 'ground truth' variables underpinning our simulations are presented visually in Figure 1 and described textually in the first sections of the Results and Methods. Subsequently, the variables identified as most important by the XGBoost analysis are presented and discussed in the Results section associated with Figures 4 and 5. Please let us know if there is any specific aspect which requires further explanation.

We acknowledge the practical challenges of acquiring detailed phenotypic data in real-world clinical trials. The specifics are highly dependent on the context of each study, making comprehensive coverage in a general discussion difficult. We address these practical considerations across several paragraphs in the Discussion (specifically 4, 5, 6, and 8), covering aspects such as utilising existing clinical records and the potential need for enhanced data capture structures. To provide a concrete example, as requested, paragraph 5 includes a description using stroke trials as an example comparing a relatively reductive NIHSS score with nuanced information which might be available within a clinical history (e.g., that which might indicate the presence of potential cerebral amyloid angiopathy) and impact treatment outcomes. Prompted by the reviewer's feedback, we have added clearer emphasis in paragraph 5:

“The precise phenotypic variables which might be critical will vary dependent upon the specific circumstances of an investigation, but for one specific example, highly impactful and rigorous trials in stroke medicine often reduce a complex clinical syndrome to a quantified NIH Stroke Scale (NIHSS) score, time since symptom onset, and a limited number of comorbidities such as diabetes, atrial fibrillation, and hypertension. However, critical nuances in the clinical history, such as previous recurrent episodes which might indicate cerebral amyloid angiopathy, which can increase the risk of haemorrhagic complications from thrombolysis, may be overlooked in initial trials and can take several years to become apparent.”

4. The simulation assumes linear effects for non-critical variables (V1/V2), which may oversimplify real-world clinical complexity. For instance, age or biomarkers rarely exhibit uniform noise distributions. Introducing realistic correlations or nonlinear interactions among variables would enhance ecological validity. Additionally, XGBoost hyperparameters (e.g., max depth=3) are justified but lack sensitivity analysis. Reporting performance variation with deeper trees or alternative tuning strategies (e.g., grid search) would strengthen confidence in model robustness. Subgroup analyses in Figure 2 lack multiplicity adjustments, inflating Type I error risks. Acknowledging this limitation and proposing Bonferroni corrections or false discovery rate control would improve rigor.

Regarding the reviewer's concerns about the potential oversimplification of assuming purely linear effects for the V1/V2 non-critical variables, we acknowledge that real-world clinical variables often exhibit complex, non-linear relationships and non-uniform noise distributions. While V1/V2 were kept simple for illustrative clarity, we designed the broader simulation involving up to 10,000 further non-critical variables specifically to address this concern by incorporating a variety of covariance structures, including both linear and non-linear interactions. We believe this approach significantly enhances the ecological validity beyond the baseline V1/V2 example. We recognise, as the reviewer implies, that comprehensively modelling all potential real-world biological complexity is an immense challenge, likely varying greatly between specific clinical scenarios. We believe the current simulation strikes a reasonable balance, providing a robust demonstration of the core concepts while acknowledging that specific applications would necessitate tailoring the modelled interactions to different specific contexts.

Our intention with the current manuscript was primarily to provide a proof-of-concept introduction to these ML methods for a predominantly clinical audience, hence our initial use of common default hyperparameters (like max depth = 3) to maintain focus on the core message. However, we are in full agreement that rigorous hyperparameter optimisation is a crucial step in developing deployable ML models. Therefore, following the reviewer's suggestion, we have now added text explaining this point to the Discussion paragraph 13:

“Applying these techniques effectively in practice requires systematic hyperparameter tuning (e.g., grid search, random search, Bayesian optimisation) and sensitivity analysis to refine model performance. Although a detailed exploration is outside this work's scope, practical guidance on these techniques is readily available in other resources.”

The new text is referenced to guide interested readers towards practical resources for exploring these optimisation strategies in greater depth.

Regarding multiplicity adjustments, we agree with the reviewer’s point, and this motivated our choice to present results in Figure 2 with only 95% confidence intervals rather than using significance of non-corrected statistical tests. We have fully incorporated the reviewer’s recommendation to propose Bonferroni corrections or false discovery rater control in the Discussion (paragraph 6):

Using univariate statistical analysis, multiple comparisons across subgroups increases the risk of identifying spurious associations, and necessitates adjustments to statistical significance thresholds such as Bonferroni correction or false discovery rater control, which can obscure genuine trends.

It is interesting to note that even with the potential inflated type I error risk, traditional methods still did not detect effects which are clear with ML analysis.

5. The term "hidden treatment patterns" risks overstating novelty, as interactions are often hypothesized in clinical research. Clarify whether the "ground truth" represents simulated relationships or reflects established biological mechanisms (e.g., pharmacogenomics). Confusion matrices (Figure 3) use a colorblind-unfriendly red/green palette, risking misinterpretation. Replacing these with patterned fills or dual-labeling (e.g., text annotations) would improve accessibility. Reporting additional metrics like Cohen’s kappa or area under the receiver operating characteristic curve (AUROC) would contextualize accuracy improvements over chance expectations.

Regarding use of the term ‘hidden treatment patterns’: We agree that interactions are often hypothesised in clinical research. However, our intention with this work is to consider the specific scenario where complex relationships that might not form part of a priori hypotheses might be uncovered through use of data-driven methods applied post hoc, as discussed in paragraph 12 of the Discussion.

Regarding clarification of what the ‘ground truth’ represents; we confirm this was indeed simulated to create a controlled environment for evaluating the methods. We have now added explicit statements clarifying this in both the Results (first section, paragraph 4):

“A simulated non-linear relationship between these three variables (X, Y, Z) determines which patients will be responsive to the treatment”

and Methods (section 1, paragraph 4):

“Whether or not a patient would be responsive to the treatment was determined by a simulated non-linear multivariable relationship”

Regarding metric reporting, we carefully considered the most appropriate metrics for illustrating the key points within the intended clinical context and opted to use confusion matrices. This choice was motivated by several factors: confusion matrices directly reflect performance at specific decision thresholds

Attachment

Submitted filename: Response to reviewers.docx

pone.0334858.s002.docx (230.2KB, docx)

Decision Letter 1

Ziheng Wang

7 Aug 2025

Dear Dr. Auger,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 21 2025 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org . When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols .

We look forward to receiving your revised manuscript.

Kind regards,

Ziheng Wang

Academic Editor

PLOS ONE

Journal Requirements:

1. If the reviewer comments include a recommendation to cite specific previously published works, please review and evaluate these publications to determine whether they are relevant and should be cited. There is no requirement to cite these works unless the editor has indicated otherwise. 

Additional Editor Comments:

Reviewer 3 raises a concern regarding the lack of  the method section. However, this information appears to be included in the latter part of the manuscript. The authors may consider relocating or emphasizing this section earlier in the manuscript to ensure better visibility and avoid potential confusion for readers.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Partly

Reviewer #4: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

Reviewer #1: The authors have meticulously addressed all the concerns raised in the initial review. I recommend that this manuscript be accepted for publication in PLOS One.

Reviewer #2: The authors took the reviewers' comments into account.

However, the database simulation seems unrealistic.

First, it is impossible to have a database based only on gender and age as sociodemographic factors and on only five variables.

Furthermore, the age variable generally has a normal distribution. It is illogical to treat it as a uniform distribution and assign the same probability to a 100-year-old patient as to an 18-year-old patient.

How did the authors also choose the lower and upper bounds for each simulated variable?

We are in the era of open science, and there are certainly several databases in the context studied, drawn from the real world. Why didn't the authors consider concretizing their work by testing it on a real database (such as the dbGaP example)?

An important criterion is missing: model power

For the LR model, why was the choice of L2 penalized? Since this is a simulation, an LR elastic model should have been used to determine the exact values of L1 and L2 that are preferable according to the simulated cohort.

Several studies have validated the power of XGBoost to predict the phenomena studied on real databases. However, an interpretability problem remains compared to its recognized flexibility. This problem is also present in the simulation carried out in this article. What is the added value of this work?

It is important to note that the figures are of low resolution. Please increase it.

Reviewer #3: The methodology section used in this paper is missing. A well-developed and structured section is mandatory.

From the introduction section, we move on to the results section; without the methodology section, it is impossible to complete the article.

Result section:

Title (lines 106-108): Too long as a title

Fig 1. Investigation outline line133 :

for figure 1: what is this explanatory and interpretative paragraph? make the interpretations of your text then put figure 1 in parentheses at the end of the paragraph

title : An analysis of the randomised control trial data using traditional inferential statistics (lne 173 ) : Too long as a title

Fig 2 (line 185-187) : is't title ?!

The same remarks for the rest of the figures and paragraphs

Revise your text in the form of presenting the information so that it is easy for the reader to read

Reviewer #4: The revised manuscript "Machine learning detects hidden treatment response patterns only in the presence of comprehensive clinical phenotyping" is well written and an important contribution to the body of knowledge.

The authors responded well to all the criticisms and concerns raised by the reviewers and I feel the revised manuscript is fit for publication.

However, one minor observation is that the machine learning model as developed by the authors is generally more applicable to diseases without a definitive cause(s), for example degenerative conditions or chronic non-communicable diseases. I wonder how the model would be applied, for example, in case of an infection or infectious disease like malaria with a known cause that can be detected in a laboratory and used as a guide by clinicians in determining therapy? Because with most infections, once they are effectively treated, the symptoms like headache, vomiting, diarrhoea etc, disappear. In other words, can the authors provide a scope for the diseases where the model can best be applied?

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #1: Yes:  Xiaohai Zheng

Reviewer #2: No

Reviewer #3: No

Reviewer #4: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/ . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org . Please note that Supporting Information files do not need this step.

PLoS One. 2025 Oct 21;20(10):e0334858. doi: 10.1371/journal.pone.0334858.r004

Author response to Decision Letter 2


11 Sep 2025

See also attached formatted document with full response to all reviewer comments.

Reviewer #1: The authors have meticulously addressed all the concerns raised in the initial review. I recommend that this manuscript be accepted for publication in PLOS One.

We thank the reviewer again for their previous comments. We are pleased they are satisfied with the amendments.

Reviewer #2: The authors took the reviewers' comments into account.

However, the database simulation seems unrealistic.

First, it is impossible to have a database based only on gender and age as sociodemographic factors and on only five variables.

Furthermore, the age variable generally has a normal distribution. It is illogical to treat it as a uniform distribution and assign the same probability to a 100-year-old patient as to an 18-year-old patient.

How did the authors also choose the lower and upper bounds for each simulated variable?

We are in the era of open science, and there are certainly several databases in the context studied, drawn from the real world. Why didn't the authors consider concretizing their work by testing it on a real database (such as the dbGaP example)?

We thank the reviewer for their comments.

Regarding choice of distributions and bounds, the variables were designed to be arbitrary placeholders to demonstrate a principle. We used uniform distributions for their simplicity and to make the fewest possible assumptions as a basis for generating random numbers. The variable labels "age" and "sex" were chosen purely to anchor the reader with familiar concepts for continuous and categorical data, respectively. We are not making any claims about clinical effects in specific age groups or modelling real-world demographics. Our choice of a uniform distribution for the variable labelled "age" was to demonstrate an effect of randomly generated numbers within a continuous variable, not to model the distribution of human age.

To ensure this is more clear to the reader, we have revised the manuscript. We have added the following lines to the Methods section [lines 582 - 585]:

“…we generated a set of arbitrary clinical phenotype variables. To minimise assumptions about the data's underlying distributions, each variable was modelled independently and drawn from uniform distributions.”

We have also clarified the arbitrary nature of the variable bounds by adding this sentence [line 594 - 595]:

“The lower and upper bounds for each variable were selected arbitrarily.”

Finally, we reinforce this point in the Results section with the following new text [lines 111 - 112]:

"For each patient, the clinical phenotype consists of multiple arbitrary clinical variables, including illustrative variables for..."

Regarding the number of variables, we intentionally restricted the number of variables to the minimum required to illustrate our core points. In drafting this manuscript, we found that models with more variables made the specific effects we were trying to highlight difficult for the reader to follow, without adding any new conceptual insights. The purpose was clarity of demonstration. To show that these principles are not limited to overly simplified scenarios, we included the analysis in Figure 6, which demonstrates that the same principles hold true in a model with much greater complexity, including a larger number of variables and covariance structures.

Regarding the use of a real-world database, the core purpose of this study was to demonstrate that certain machine learning techniques can detect treatment effects (and non-effects) that might be missed by traditional inferential statistics, we therefore chose simulated data to be able to program specific, known effects into the data and then definitively assess whether a given analytical technique can or cannot detect them. This provides a clear, unambiguous benchmark of the method's capabilities, which is not possible with a real-world dataset. Our aim here is to establish a principle of analytical sensitivity, which can then be applied to real-world data in future work.

We believe these revisions make the illustrative, non-realistic nature of our simulation explicit, directly addressing the reviewer's concerns and clarifying the methodological rationale for our readers.

An important criterion is missing: model power

We thank the reviewer for raising the important point about model power. This is an important criterion for establishing the reliability of the method, and we have now performed a new dedicated analysis to address it.

The analysis presented in Figure 3 was intended as a deep dive into the model's classification performance on a single, representative instance of our simulation. It demonstrates how the model works by more accurately identifying the ground truth than the noisy trial outcome but it does not establish the reliability of this success.

To address this, we have now conducted a formal power analysis to quantify the probability that our XGB approach will successfully detect the true treatment effect under the simulated conditions.

We defined "successful detection" in a given simulation run as the model achieving an accuracy of >90% against the known ground truth labels. We then ran our entire simulation and analysis pipeline 500 times, recording a success or failure for each run based on this criterion.

The results of this analysis show that the model's power was 100%. Specifically, in 500 out of 500 independent simulations, the XGB model achieved >90% accuracy against the ground truth, demonstrating that our method is highly reliable and not the result of a single, fortuitous outcome.

We have added details of this power analysis to the manuscript. In the Methods Machine learning analysis section [lines 676 - 679]:

“To quantify the statistical power and robustness of our analytical approach, the entire data generation and XGB analysis pipeline was repeated for 500 independent runs. For each run, a "successful detection" was defined as the model achieving an accuracy greater than 90% when classifying against the known ground truth.”

In the Results, Analysis of the same randomised control trial data using machine learning section [lines 237 - 239]:

“A power analysis revealed that the method was highly robust and a success criterion of >90% accuracy against ground truth was met in 100% of 500 simulation runs, resulting in a statistical power of 100% to detect the simulated effect under these conditions.”

For the LR model, why was the choice of L2 penalized? Since this is a simulation, an LR elastic model should have been used to determine the exact values of L1 and L2 that are preferable according to the simulated cohort.

Our choice of a standard L2-penalised model was a deliberate methodological decision, designed to create a robust and interpretable baseline for comparison against our primary XGB ML technique.

Our primary goal for the LR analysis was to create a strong, traditional statistical competitor. To this end, we did more than just fit a simple model; we explicitly enhanced its capability by including interaction terms between all combinations of variables. By giving the LR model the capacity to see these interactions, we sought to ensure a fair comparison with XGB.

With all variables and their interactions included, the primary risk to the model is overfitting due to multicollinearity and large coefficient values. L2 regularisation is precisely the standard and appropriate tool for managing this risk. It penalises large coefficients to prevent any single feature from dominating, which is exactly what is needed here. Therefore, a standard L2 penalty is the most parsimonious and directly applicable choice. We used the default, well-vetted parameters from the scikit-learn library. This approach represents a standard, reproducible, and widely understood implementation of logistic regression. Introducing a complex hyperparameter tuning process for the baseline model (as required by Elastic Net) would add a layer of optimisation that could obscure the core comparison of this study, which is focused on the fundamental differences between the analytical approaches, not on fine-tuning (and potential overfitting) a specific baseline model.

Our implemented LR model, enhanced with interaction terms and appropriately regularised with a standard L2 penalty, represents a powerful and fair baseline. It was intentionally configured this way to robustly test the hypothesis rather than being optimised for a single data instance. We believe this choice is methodologically sound and best serves the illustrative purpose of our study.

We have included mention of LR elastic models alongside discussion about other forms of hyperparameter tuning in the Discussion section [lines 513 - 517]:

“Applying these techniques effectively in practice requires systematic hyperparameter tuning (e.g., grid search, random search, Bayesian optimisation, LR elastic models) and sensitivity analysis to refine model performance. Although a detailed exploration is outside this work's scope, practical guidance on these techniques is readily available in other resources …”

Several studies have validated the power of XGBoost to predict the phenomena studied on real databases. However, an interpretability problem remains compared to its recognized flexibility. This problem is also present in the simulation carried out in this article. What is the added value of this work?

The reviewer is correct that while the predictive power of models like XGB is well-established, their perceived lack of interpretability has been a major barrier to clinical adoption. Addressing this very challenge is one of the central contributions of our work, i.e., providing a demonstration in how to deliver the clinical need for transparent, interpretable insights, rather than solely pursing predictive performance.

While many studies stop after reporting high accuracy, our paper's main purpose is to demonstrate a practical workflow for how this "black box" can be “opened” and its findings made useful to a clinician. We illustrate this through our application of post-hoc interpretability techniques, which allow us to move beyond a simple prediction and understand the drivers of that prediction.

We show how to determine the key factors driving the model's predictions across the entire cohort. For example, our analysis of feature importance in Figure 4 identifies which clinical phenotypes the model relied on most heavily, enabling clinicians to see whether biologically plausible factors are driving results and providing potentially hypothesis generating novel insights.

The added value is not necessarily in proving, once again, that XGB is powerful. The added value is in providing an illustrative template in how it can complement and directly address the flaws in standard inferential techniques used for clinical research. We demonstrate a practical methodology to translate a powerful but opaque prediction into a transparent, patient-specific insight that could guide clinical decision-making and hypothesis generation. Our work is intended to serve as a guide for researchers and clinicians on how to navigate the interpretability challenge, and to benefit from the full potential of these advanced models in medicine.

Our work also provides a practical demonstration of the data requirements needed for these methods to be reliable. We illustrate that the model's insights are highly sensitive to the completeness of the input data. The absence of even a single key feature can drastically alter the model’s performance, and the resulting clinical insights. This finding elevates the importance of comprehensive and robust patient phenotyping from merely "good practice" to an essential prerequisite for the safe and effective deployment of these advanced models.

It is important to note that the figures are of low resolution. Please increase it.

Thank you for noting the resolution issue. We have addressed this by providing a new, high-resolution version of Figure 4. All other figures have now been rendered at the maximum resolution and dimensions permitted by the journal's submission guidelines.

Reviewer #3: The methodology section used in this paper is missing. A well-developed and structured section is mandatory.

From the introduction section, we move on to the results section; without the methodology section, it is impossible to complete the article.

We direct the reviewer to the Methods section starting from page 14. The manuscript is structured in keeping with the journal’s manuscript organisation guidelines (https://journals.plos.org/plosone/s/submission-guidelines) whereby the Materials and Methods, Results and Discussion sections can be presented in any order. In the first paragraph of the Results section, we direct the reader to the Methods section for full methodological details.

Result section:

Title (lines 106-108): Too long as a title

We have revised this title to be more concise while retaining its core message. It now reads:

“Creation of simulated clinical cohort data and determinants of ‘ground truth’ treatment response”

Fig 1. Investigation outline line133 :

for figure 1: what is this explanatory and interpretative paragraph? make the interpretations of your text then put figure 1 in parentheses at the end of the paragraph

We have ensured that the Figure 1 legend adheres strictly to the journal’s figure caption requirements (https://journals.plos.org/plosone/s/figures#loc-captions), which mandate that captions must be self-sufficient and fully understandable without reference to the main text. To comply with the requirement to 'Describe each part of a multipart figure,' the current legend structure is necessary. Moving descriptive text elsewhere would compromise its standalone clarity and violate these guidelines.

title : An analysis of the randomised control trial data using traditional inferential statistics (lne 173 ) : Too long as a title

We have revised the section heading for brevity, ensuring it remains a clear guide to the content while maintaining consistency with other headings. It now reads:

“Analysis using traditional inferential statistics”

To maintain consistency, we also changed the title of the section which follows to:

“Analysis of the same data using machine learning”

Fig 2 (line 185-187) : is't title ?!

The same remarks for the rest of the figures and paragraphs

We have amended the caption for Figure 2 to clearly separate out a title and additional description in the legend. Upon reviewing all remaining figures, we confirmed that they were already consistent with this correct format, each featuring a distinct title. The issue was therefore seemingly isolated to Figure 2, which has now been rectified.

We were uncertain how the remark “is't (sic) title ?!” applies to 'paragraphs,' but if there is a specific formatting issue we have overlooked, we would be happy to address it.

Revise your text in the form of presenting the information so that it is easy for the reader to read

While we were encouraged that other reviewers complimented the writing, we recognise that clarity is paramount. In light of your feedback, we have carefully re-read the manuscript. It would be helpful to know if any lack of clarity stems from the writing itself, or perhaps from missing context which is detailed more fully in the Methods section which the reviewer’s first comment appears to indicate they had not been aware is included in the Manuscript. We would be grateful if you could point to any specific passages that are unclear, and we will happily revise them.

Reviewer #4: The revised manuscript "Machine learning detects hidden treatment response patterns only in the presence of comprehensive clinical phenotyping" is well written and an important contribution to the body of knowledge.

The authors responded well to all the criticisms and concerns raised by the reviewers and I feel the revised manuscript is fit for publication.

We thank the reviewer for their comments.

However, one minor observation is that the machine learning model as developed by the authors is generally more applicable to diseases without a definitive cause(s), for example degenera

Attachment

Submitted filename: Response to reviewer comments.docx

pone.0334858.s003.docx (30KB, docx)

Decision Letter 2

Ziheng Wang

2 Oct 2025

Machine learning detects hidden treatment response patterns only in the presence of comprehensive clinical phenotyping

PONE-D-25-17285R2

Dear Dr. Auger,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager®  and clicking the ‘Update My Information' link at the top of the page. For questions related to billing, please contact billing support .

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ziheng Wang

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions??>

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously? -->?>

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available??>

The PLOS Data policy

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English??>

Reviewer #2: Yes

**********

Reviewer #2: The authors have meticulously addressed all the concerns raised in my review. I recommend that this manuscript be accepted for publication in PLOS One.

We thank the reviewer again for their previous comments. We are pleased they are satisfied with the amendments.

**********

what does this mean? ). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy

Reviewer #2: No

**********

Acceptance letter

Ziheng Wang

PONE-D-25-17285R2

PLOS ONE

Dear Dr. Auger,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

You will receive further instructions from the production team, including instructions on how to review your proof when it is ready. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few days to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

You will receive an invoice from PLOS for your publication fee after your manuscript has reached the completed accept phase. If you receive an email requesting payment before acceptance or for any other service, this may be a phishing scheme. Learn how to identify phishing emails and protect your accounts at https://explore.plos.org/phishing.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ziheng Wang

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Confusion matrices with classification metrics for predictions of treatment response using XGB, when variable Z is removed from consideration.

    Predictions of treatment response using XGB analysis are compared with the treatment response apparent in the trial outcome measure (A) and the ground truth (B). Orange shading denotes treatment responsive (suggested in the trial outcome or from ground truth) and blue shading denotes non-treatment responsive cells; bold orange/blue denote correct treatment allocation according to the outcome; light orange/blue denote inappropriate treatment allocation. PPV: positive predictive value; NPV: negative predictive value.

    (DOCX)

    pone.0334858.s001.docx (523KB, docx)
    Attachment

    Submitted filename: Response to reviewers.docx

    pone.0334858.s002.docx (230.2KB, docx)
    Attachment

    Submitted filename: Response to reviewer comments.docx

    pone.0334858.s003.docx (30KB, docx)

    Data Availability Statement

    All the datasets generated and analysed during the current study are available in the GitHub repository, https://github.com/stepdaug/ML-and-clinical-phenotyping.


    Articles from PLOS One are provided here courtesy of PLOS

    RESOURCES