Skip to main content
Hepatobiliary Surgery and Nutrition logoLink to Hepatobiliary Surgery and Nutrition
editorial
. 2025 Sep 26;14(5):847–850. doi: 10.21037/hbsn-2025-455

Can artificial intelligence deliver on the promise of precision medicine?

Dimitris Bertsimas 1, Chen Lin 2, Samuel Singer 3, Georgios Antonios Margonis 1,3,
PMCID: PMC12521024  PMID: 41104215

The National Cancer Institute defines precision medicine as the use of specific information about a patient’s tumor to help make a diagnosis, plan treatment, find out how well treatment is working, or make a prognosis. Of these, guiding individualized treatment decisions is arguably the most ambitious and consequential aim—and also the one most fraught with methodological challenges.

The vision of tailoring treatment to individual patients is not new. As early as 1965, Sir Austin Bradford Hill, the architect of modern randomized controlled trials (RCTs), remarked that RCTs “do not tell the doctor what he wants to know” (1). He acknowledged the limitations of population-level averages and emphasized the need to identify “what is the most likely outcome when this drug is given to a particular patient”. This recognition—of heterogeneity in treatment effects—laid the conceptual foundation for what we now call precision medicine.

Despite this early insight, the clinical implementation of precision medicine has been constrained by two fundamental barriers. First, unobserved confounding biases the estimation of treatment effects in observational data, making it difficult to draw causal inferences. For instance, several studies and meta-analyses have reported that wider surgical margins in colorectal liver metastases (CRLMs) resections are associated with improved survival (2). However, this association may not reflect a true causal effect of wider margins, but rather the influence of unmeasured factors such as more indolent tumor biology or technically simpler tumors that happen to permit wider resections (3,4). Second, even in the absence of confounding—as in RCTs—we still lack a robust and objective framework for stratifying patients into subgroups to compare differential treatment effects. Historically, subgroup analyses in RCTs have relied on clinician-selected variables analyzed one at a time—a practice prone to multiple testing errors, confirmation bias, and oversimplification of the true drivers of treatment heterogeneity (5,6).

Attempts to use real-world data (RWD) for precision medicine predate the artificial intelligence (AI) era but have generally required strong, and often unverifiable, assumptions. For example, one can compare treated and untreated patients—or two treatments—in an observational study only if the direction of unobserved confounding is favorable, i.e., if the treatment group with better outcomes is also the one that would have been expected to do worse due to clinical selection bias. This rare alignment can lend credibility to causal claims despite residual confounding.

A case in point is our 2018 study at Johns Hopkins examining outcomes of anatomical versus non-anatomical liver resections for CRLMs, stratified by KRAS mutation status (7). We found that anatomical resections conferred a survival benefit only among patients with KRAS-mutated tumors. Although unobserved confounding was possible—since anatomic resections are typically performed on patients with more extensive disease—the direction of bias would have worked against our findings. That is, if anything, patients undergoing anatomical resections should have had worse outcomes. The fact that they didn’t lends credibility to a true treatment benefit, at least within the KRAS-mutated subgroup. This was a fortunate confluence: the presence of a biologically meaningful stratification variable (KRAS) and the likelihood that any bias would dilute, rather than exaggerate, the observed benefit.

But such ideal conditions are rare. In most clinical settings, we lack both reliable biomarkers and a deep enough biological understanding to confidently define subgroups for treatment personalization. Even when high-dimensional data such as genomics are available, they often introduce new challenges: the number of features (e.g., genes or mutations) may even exceed the number of patients, making it statistically difficult to separate true causal signals from random noise. This problem—known as the “curse of dimensionality”—is a major barrier to discovering meaningful treatment effect heterogeneity using real-world datasets. For example, a typical observational study may include a few hundred patients but thousands of genomic features, making it easy for a model to detect spurious associations that do not generalize. Without careful regularization, external validation, or biological prior knowledge, subgroup discovery in such settings is more likely to mislead than inform.

This is where AI is often expected to make a transformative impact. But to truly enable precision medicine, AI must do more than just handle high-dimensional data. It must tackle the two core challenges: (I) generate clinically trustworthy patient stratifications, and (II) adjust for—or at least be robust to—unobserved confounding.

AI is well-positioned to address the first challenge. Tree-based algorithms such as Optimal Policy Trees (OPTs) and interpretable meta-learners like the X-learner can discover patient subgroups in a data-driven yet transparent way (8,9). Unlike conventional subgroup analyses, these methods partition patients using objective criteria and allow for complex interaction effects. When properly regularized and validated, they offer a principled and interpretable way to personalize treatment recommendations. Their transparent nature is critical: clinicians are unlikely to trust or adopt treatment guidance that comes from a black-box model, especially when the stakes involve altering standard care.

At this point, it is important to acknowledge a broader tension between AI-driven approaches and the foundational goals of precision medicine. AI models, particularly those trained on large and heterogeneous datasets, are often optimized to generalize across broad populations. As such, they may risk diluting the individual-level nuance that precision medicine aspires to capture. This raises a legitimate concern: can algorithms trained on population-level data truly provide recommendations that are meaningful at the level of the individual patient? We argue that this apparent contradiction can be reconciled. Methods like OPTs and X-learners aim to bridge this gap by identifying subgroups—intermediate strata between the global and the individual—that reflect clinically and biologically meaningful variation. In doing so, they retain the generalizability of AI while offering tailored, interpretable rules that bring us closer to the ideal of individualized care. Thus, while AI starts from “big data”, its output can still support “small data” decisions—those that matter for a specific patient in a specific clinical context.

However, even the best AI-based subgrouping method cannot solve the problem of unobserved confounding on its own. Most methods in causal machine learning—like targeted maximum likelihood estimation, inverse probability weighting, or double machine learning—still rely on the assumption that all confounders are measured (10). If a key prognostic variable is missing, no amount of clever modeling will recover the true treatment effect. This is a sobering limitation, and one that needs to be addressed before AI can be reliably used for treatment recommendation in observational data. The field of causal inference acknowledges this limitation, often proceeding by assuming away unmeasured confounding entirely (11). This may be justifiable in some settings, especially when rich data are available and the selection process is well understood. But in many areas of clinical medicine—particularly in surgery and oncology—our mechanistic understanding is incomplete, and important drivers of patient selection and outcomes may remain unmeasured or even unknown.

To mitigate this, recent methodological innovations have aimed to explicitly model or adjust for unobserved confounding. Our team at Massachusetts Institute of Technology (MIT) has proposed two frameworks in this space. The first addresses overtreatment by identifying patients who may not need a therapy known to be effective—focusing on safely withholding treatment when the expected benefit is minimal (12). This framework relies on observational cohorts that capture the natural history of disease to estimate what would happen without treatment. The second framework shifts the focus from treatment eligibility to treatment utility, aiming to identify patients who do receive treatment but are unlikely to benefit from it (13). This approach requires RCTs to establish baseline prognosis and causal treatment effect. While promising, both frameworks depend on the availability of disease-specific data: the first on high-quality real-world cohorts, the second on relevant RCTs. This underscores the need for future methods that can generalize beyond existing trials and observational datasets—especially in rare diseases, underrepresented populations, or ethically constrained settings where RCTs are not feasible.

On the diagnostic side, AI has been far more successful. In radiology and pathology, for example, AI models routinely extract complex patterns from images and have demonstrated performance on par with—or even exceeding—human experts in some tasks. The reasons for this success are clear: (I) imaging data are rich and structured; (II) the labels are often objective (e.g., biopsy-confirmed cancer); and (III) interpretability is not always required in the same way it is for treatment decision-making. As a result, black-box models have gained traction and have even received regulatory approval in various diagnostic applications.

In prognostication, we return to familiar challenges. Risk prediction is often used to inform treatment decisions, clinical trial eligibility, and surveillance strategies. For example, adjuvant chemotherapy after CRLM resection may be recommended for patients deemed “high-risk”, even if the treatment effect is constant, because the absolute benefit increases with baseline risk. Here, calibration becomes essential: we need predicted risks to match observed outcomes so that clinicians can make sound judgments based on expected absolute benefit.

Unfortunately, many machine learning models—especially complex ones like random forests and gradient boosting machines—are poorly calibrated. A model might rank patients correctly (high C-index) but fail to predict accurate absolute risks. This calibration problem has been referred to as the “Achilles’ heel” of machine learning in clinical medicine, and it remains a major barrier to the widespread use of AI-based prognostic tools (14). To overcome this, future frameworks must explicitly target calibration, which depends on two key conditions: (I) similarity in the distribution of prognostic factors between the training and validation cohorts (covariate shift); and (II) consistency in the relationship between these factors and outcomes (concept shift), the latter often driven by unmeasured prognostic variables.

In conclusion, AI holds great promise for precision medicine—but only if it is deployed with methodological rigor and domain awareness. For treatment decisions, interpretability and confounding adjustment are essential. For prognostic tools, calibration and generalizability must be prioritized. And across all applications, we must remain vigilant about the limitations of our data and models. The future of precision medicine will not be powered by AI alone, but by thoughtful integration of AI into a framework that respects both statistical principles and clinical realities.

Supplementary

The article’s supplementary files as

hbsn-14-05-847-coif.pdf (529.9KB, pdf)
DOI: 10.21037/hbsn-2025-455

Acknowledgments

None.

Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Footnotes

Provenance and Peer Review: This article was commissioned by the editorial office, HepatoBiliary Surgery and Nutrition. The article has undergone external peer review.

Funding: None.

Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://hbsn.amegroups.com/article/view/10.21037/hbsn-2025-455/coif). The authors have no conflicts of interest to declare.

References

  • 1.Hill AB. Reflections on controlled trial. Ann Rheum Dis 1966;25:107-13. 10.1136/ard.25.2.107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Margonis GA, Sergentanis TN, Ntanasis-Stathopoulos I, et al. Impact of Surgical Margin Width on Recurrence and Overall Survival Following R0 Hepatic Resection of Colorectal Metastases: A Systematic Review and Meta-analysis. Ann Surg 2018;267:1047-55. 10.1097/SLA.0000000000002552 [DOI] [PubMed] [Google Scholar]
  • 3.D'Angelica MI. Positive Margins After Resection of Metastatic Colorectal Cancer in the Liver: Back to the Drawing Board? Ann Surg Oncol 2017;24:2432-3. [DOI] [PubMed] [Google Scholar]
  • 4.Bertsimas D, Margonis GA, Sujichantararat S, et al. Using Artificial Intelligence to Find the Optimal Margin Width in Hepatectomy for Colorectal Cancer Liver Metastases. JAMA Surg 2022;157:e221819. 10.1001/jamasurg.2022.1819 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Brookes ST, Whitely E, Egger M, et al. Subgroup analyses in randomized trials: risks of subgroup-specific analyses; power and sample size for the interaction test. J Clin Epidemiol 2004;57:229-36. 10.1016/j.jclinepi.2003.08.009 [DOI] [PubMed] [Google Scholar]
  • 6.Brookes ST, Whitley E, Peters TJ, et al. Subgroup analyses in randomised controlled trials: quantifying the risks of false-positives and false-negatives. Health Technol Assess 2001;5:1-56. 10.3310/hta5330 [DOI] [PubMed] [Google Scholar]
  • 7.Margonis GA, Buettner S, Andreatos N, et al. Anatomical Resections Improve Disease-free Survival in Patients With KRAS-mutated Colorectal Liver Metastases. Ann Surg 2017;266:641-9. 10.1097/SLA.0000000000002367 [DOI] [PubMed] [Google Scholar]
  • 8.Bertsimas D, Margonis GA, Sujichantararat S, et al. Interpretable artificial intelligence to optimise use of imatinib after resection in patients with localised gastrointestinal stromal tumours: an observational cohort study. Lancet Oncol 2024;25:1025-37. 10.1016/S1470-2045(24)00259-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Künzel SR, Sekhon JS, Bickel PJ, et al. Metalearners for estimating heterogeneous treatment effects using machine learning. Proc Natl Acad Sci U S A 2019;116:4156-65. 10.1073/pnas.1804597116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Austin PC, Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med 2015;34:3661-79. 10.1002/sim.6607 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hernán MA. Methods of Public Health Research - Strengthening Causal Inference from Observational Data. N Engl J Med 2021;385:1345-8. 10.1056/NEJMp2113319 [DOI] [PubMed] [Google Scholar]
  • 12.Bertsimas D, Koulouras AG, Margonis GA. The R.O.A.D. to precision medicine. NPJ Digit Med 2024;7:307. 10.1038/s41746-024-01291-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Bertsimas D, Koulouras A, Nagata H, et al. The R.O.A.D. to clinical trial emulation. Preprint. Res Sq 2025. doi: .
  • 14.Van Calster B, McLernon DJ, van Smeden M, et al. Calibration: the Achilles heel of predictive analytics. BMC Med 2019;17:230. 10.1186/s12916-019-1466-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    The article’s supplementary files as

    hbsn-14-05-847-coif.pdf (529.9KB, pdf)
    DOI: 10.21037/hbsn-2025-455

    Articles from Hepatobiliary Surgery and Nutrition are provided here courtesy of AME Publications

    RESOURCES