Skip to main content
Medical Science Educator logoLink to Medical Science Educator
editorial
. 2024 Aug 13;34(6):1565–1570. doi: 10.1007/s40670-024-02137-2

Preparing Physicians of the Future: Incorporating Data Science into Medical Education

Rishi M Shah 1, Kavya M Shah 2,, Piroz Bahar 3, Cornelius A James 4
PMCID: PMC11699019  PMID: 39758456

Abstract

The recent excitement surrounding artificial intelligence (AI) in health care underscores the importance of physician engagement with new technologies. Future clinicians must develop a strong understanding of data science (DS) to further enhance patient care. However, DS remains largely absent from medical school curricula, even though it is recognized as vital by medical students and residents alike. Here, we evaluate the current DS landscape in medical education and illustrate its impact in medicine through examples in pathology classification and sepsis detection. We also explore reasons for the exclusion of DS and propose solutions to integrate it into existing medical education frameworks.

Keywords: Data science, Medical education, Evidence-based medicine, Clinical informatics

Introduction

As the excitement surrounding artificial intelligence (AI)–powered large language models (LLMs)—most notably the public release of OpenAI’s ChatGPT in November 2022—continues to grow, health care systems are moving swiftly to understand the transformative impact that these technologies will have on patient care, research, and the education of future clinicians [1]. AI-based LLMs make use of many techniques from the field of data science (DS), a broad discipline focused on curating, analyzing, and deriving meaning from vast heterogeneous datasets. DS presents a holistic approach to data analysis by combining principles from domains such as statistics, mathematics, AI, and machine learning (ML) [2]. Like medicine, DS is concerned with gleaning actionable insights from noisy and diverse sources of information. Perhaps unsurprisingly, then, it is becoming increasingly evident that a comprehensive understanding of DS will be vital for physicians navigating the digital future of medicine.

As the amount of available digitized health data continues to increase, the prevalence of DS in medicine may be quantified by observing the amount of medical research employing DS methods over time. A PubMed search with “data science” as the Title/Abstract term shows that there has been a surge in the number of papers that include DS published from 2012 to 2023, indicating a sustained interest in the development of new health care–related DS methodologies (Fig. 1). Publications with “machine learning” or “artificial intelligence” as the Title/Abstract terms show a similar increase over the last 11 years, and as a reference, the number of publications related to medical education has increased at a much slower rate over the same period. Further, a combined search using the Title/Abstract terms “data science” and “medical education” between 2012 and 2023 yielded only 17 publications, indicating that the promise of DS in clinical practice has yet to be realized.

Fig. 1.

Fig. 1

Published papers in the past 11 years as listed on the US National Library of Medicine (PubMed) using “artificial intelligence,” “data science,” “machine learning,” and “medical education” as Title/Abstract terms. The actual user queries were: (i) “artificial intelligence” [Title/Abstract] and (“2012/01/01”[PDAT]: “2023/12/31”[PDAT]); (ii) “data science” [Title/Abstract] and (“2012/01/01”[PDAT]: “2023/12/31”[PDAT]); (iii) “machine learning” [Title/Abstract] and (“2012/01/01”[PDAT]: “2023/12/31”[PDAT]); and (iv) “medical education” [Title/Abstract] and (“2012/01/01”[PDAT]: “2023/12/31”[PDAT])

DS holds the potential to improve hospitals’ operational decision-making. Centralized data-driven tools that are predictive, learn continually, and deliver optimal prescriptive recommendations to hospital systems have been shown to reduce health care costs, decrease wait times, increase capacity using existing infrastructure, boost revenue, and improve overall patient outcomes [3]. As ML and AI algorithms continue to become ubiquitous in health care, it is imperative for medical schools to implement a structured program for students to learn the fundamentals of DS, to ensure that future physicians can leverage the incredible power of computation for medicine on their own. If a robust infrastructure for DS instruction in medical education is not developed, clinicians will not be adequately prepared to harness “big data” to improve their practice, resulting in an overreliance on the guidance of non-clinical stakeholders, whose values may not always be well-aligned.

Using Data Science to Improve Medicine

While the “hype” around AI and ML in health care is certainly justified, it is important to note that a solid understanding of fundamental DS skills is a prerequisite to understanding the design and application of AI and ML algorithms in medical settings. For example, clinicians will need to first learn how to collect, organize, and visualize large amounts of data before feeding this information into AI algorithms that mimic the human decision-making process or ML models that generate predictions. Through the following use cases in pathology classification and sepsis detection, we show how recent AI and ML technologies effectively apply the principles of DS to solve critical problems in medicine.

Currently, the gold standard in pathology is a detailed, time-intensive manual examination of histopathology slides to diagnose conditions like cancer. While this process generally provides accurate diagnoses that ultimately guide treatment decisions, it demands extensive expertise and resources. Moreover, molecular assays that seek to create more personalized cancer therapies can also be costly and inefficient. For instance, conducting high-throughput genetic screens to identify specific gene variants for targeted treatments can be an expensive endeavor that produces results not applicable to individuals with highly heterogeneous cancers. In contrast, contemporary ML-based computer vision methods that automate the analysis of histopathology slides present immense potential for swift, replicable, and inexpensive clinical and molecular diagnoses [4, 5]. While the costs of initial implementation and maintenance of such methods can be substantial, these systems can lead to long-term cost savings by streamlining diagnostic workflows and reducing the need for manual labor, repeated scans, and chemical staining while still enhancing diagnostic accuracy [6, 7]. These automated computational pathology platforms are powered by deep learning algorithms and accurately predict tumor grade and evaluate mutational subtypes and gene expression signatures across a variety of cancers, while also surpassing the diagnostic accuracy of expert pathologists. The integration of various quality control measures, such as the removal of images of poor quality and partitioning of data into training, validation, and test sets, along with the synthesis of computer vision and deep learning techniques, presents an excellent application of the principles of DS to construct a robust analysis pipeline. Companies like PathAI and Tempus, devoted to developing models for improving classification processes, are emerging as key stakeholders in pathology [8].

DS has also dramatically improved the detection of sepsis, a potentially life-threatening complication that affects nearly 1.7 million people each year [9]. As sepsis features symptoms such as fever, confusion, and discomfort, all of which can overlap with other conditions, it is often difficult to diagnose. Early detection of sepsis can dramatically improve a patient’s prognosis, which necessitates the development of accurate and automated systems. In 2022, Henry et al. developed the Targeted Real-time Early Warning System (TREWS), an ML model that identified signs and symptoms of sepsis hours before traditional methods and reduced the likelihood of a patient dying from sepsis by 20% [9]. This model is a paragon of DS in medicine, as it processed and synthesized a patient’s medical history, vitals, symptoms, and lab results in real time to accurately identify 82% of sepsis cases presented. DS-driven medical technologies like TREWS will require regular re-evaluation to ensure continued accuracy and reliability. In some instances, external validation and prospective studies will be necessary, especially when deploying models in settings where they were not originally developed. This validation process will help ensure the generalizability of the models, an important component of high-quality DS implementation. Nevertheless, the importance of DS for improving pathology classification and sepsis detection illustrates the transformative nature of the field in shaping health care.

The Need for Medical Student Training in Data Science

Despite its potential to transform health care, DS remains largely absent from undergraduate medical education (UME). There are a few potential explanations for why DS has not been added to current medical curricula. One key reason is that currently, there are no explicit accreditation requirements related to DS [10]. Additionally, as more medical schools implement condensed pre-clerkship curricula, a greater emphasis is placed on preparing students for licensing exams and ensuring proficiency in established core competencies, neither of which currently include DS [11, 12]. Recently, however, Russell et al. outlined a list of AI-related competencies for health professionals, focusing on the responsible and effective use of AI as a clinical tool, and Seth et al. proposed a DS curriculum for UME that aligns with existing competencies established by the Association of American Medical Colleges (AAMC), including systems-based practice and patient care [13, 14]. In the same vein, the Accreditation Council for Graduate Medical Education has established a joint initiative with the AAMC to design a new set of shared fundamental competencies in six areas, including practice-based learning and improvement and systems-based practice, to be implemented in medical schools, with the goal of updating medical school curricula to incorporate new trends in health care like DS today [15].

Precedence for including DS in medical education already exists. Biostatistics, a field closely related to DS, is a required or elective course offered at virtually all (n = 144/145) medical schools in the US [16]. Biostatistical tools enable students to formulate insightful inferences and create detailed descriptions of biological information such as DNA microarray data. However, biostatistics courses typically do not teach the data engineering methods such as feature selection and dimensionality reduction that make DS extremely robust, especially when dealing with large amounts of data. An important advantage of broadly integrating DS into UME in lieu of a standalone biostatistics course is that DS can equip students with the capacity to make powerful predictions in order to uncover patterns in messy and unfiltered datasets which are common in medical settings, a skill that is highly pertinent to health care today.

While some clinicians may be aware of the applications of DS to medicine, their training does not prepare them to take advantage of DS-related practices. A Stanford University School of Medicine survey of 523 US physicians and 210 medical students and residents found a stark dichotomy between the innovations that health care professionals perceive as being valuable to patients and their level of preparedness to implement these innovations in practice (Fig. 2) [17]. The same survey found that 92 (44%) students and 120 (23%) physicians plan to take courses on advanced statistics and DS to help them prepare for new trends in health care, illustrating a growing recognition of the potential benefits that DS tools can offer in improving patient care, medical research, and health care delivery. Additionally, a survey of the New England Journal of Medicine Catalyst Insights Council found that health care executives, clinical leaders, and clinicians identified DS as the second most important skill physicians of the future will need to succeed [18]. Taken together, these survey results show that in an increasingly data-driven health care landscape, providing high-quality DS training to medical students will be vital for them to effectively apply analytic tools in a clinical setting to deliver better outcomes for their patients.

Fig. 2.

Fig. 2

Discrepancies between the perceived benefit of and preparedness to apply various DS-related innovations to medical practice. S&R, students and residents; P, physicians

Integrating DS into Medical Education Curricula

Given the increasing prevalence of its interdisciplinary applications in health care, it is essential for medical schools to integrate DS into their curricula, rather than treat it as an additional curricular element [19]. By incorporating DS into existing courses that fulfill the requirements for the “Scientific Method/Clinical/Translational Research” accreditation standard as set forth by the Liaison Committee on Medical Education, students can gain practical experience in working with real-world data they are likely to encounter in their clinical practice [20]. Several content areas within existing courses at many medical schools such as evidence-based medicine (EBM) and health systems science—including, but not limited to, population health, epidemiology, and precision medicine—lend themselves well to the integration of DS.

For example, when learning about population health and epidemiology, students can build decision tree classifiers to identify risk factors for diseases such as diabetes and hypertension or apply principal component analysis to visualize the progression and spread of pathogens. In pharmacology courses, students can learn how DS tools are used to analyze large datasets for drug discovery and predict patient responses to medications. Anesthesiology curricula can be strengthened by integrating modules that explore how machine learning algorithms can analyze intraoperative data for real-time risk prediction, alongside utilizing extensive patient datasets to refine post-operative care protocols for specific patient subpopulations. When learning about precision medicine, DS can be utilized to extract information from sources such as medical databases and electronic health records, and to develop pipelines for assessing individual patient data, including genetic information, social determinants of health, and medical history to guide personalized treatment plans.

DS can also be integrated as an extension of commonly taught statistical methods in medical education. For example, regression analysis, a technique routinely taught in medical curricula, can be extended to incorporate DS methods such as random forests and support vector machines for predictive modeling. Similarly, instruction on survival analysis can be expanded upon via common DS techniques such as data regularization and dimensionality reduction to develop simple and generalized models to predict survival rates.

By incorporating DS principles into existing courses, medical students can develop essential skills in data analysis, interpretation, and application, empowering them to leverage AI and ML models to provide more effective patient care. These skills encompass crafting a precise clinical question using frameworks like PICO (Problem/Population, Intervention, Comparison, Outcome) and conducting thorough reviews of published studies to ensure study design validity, both central tenets of EBM [21]. To promote hands-on learning experiences in DS, medical schools should focus on developing robust infrastructure, including DS laboratories and high-performance computing resources [22]. Moreover, medical schools should encourage validation of DS tools through collaborative research, which will ensure the reliability and relevance of these systems in medical practice. Incorporating data literacy modules can further help address misinformation and prepare students for ethical DS use in health care [23]. An effective integration of DS in medical school curricula will equip future physicians with the necessary tools to navigate the evolving landscape of health care, where DS is already being utilized to improve diagnostic accuracy, treatment efficacy, and health outcomes.

Conclusion

Currently, training related to DS is insufficient across the continuum of medical education. While some students, residents, and practicing clinicians acknowledge the timeliness and importance of utilizing DS in health care, there remains a need for broader awareness and integration of this content to train physicians. Recent advancements in harnessing the power of DS, particularly in applications for rapid and accurate diagnoses of life-threatening conditions like sepsis and precise classification of pathological specimens, demonstrate its undeniable value. Medical schools must now prioritize the integration of DS into existing medical education curricula, forging a path to prepare future health care providers to adapt to trends in the era of the data-driven physician.

Author Contribution

Rishi M. Shah and Kavya M. Shah planned the manuscript. Rishi M. Shah prepared the figures. Rishi M. Shah, Kavya M. Shah, and Piroz Bahar wrote the manuscript. Cornelius A. James edited the manuscript and provided supervision. Kavya M. Shah submitted the manuscript.

Data Availability

The data underlying this article were derived from sources in the public domain: https://www.ncbi.nlm.nih.gov/pubmed and https://med.stanford.edu/dean/healthtrends.html.

Declarations

Competing Interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388:1233–9. 10.1056/NEJMsr22141842. [DOI] [PubMed] [Google Scholar]
  • 2.Li R, Kumar A, Chen JH. How chatbots and large language model artificial intelligence systems will reshape modern medicine: fountain of creativity or Pandora’s box? JAMA Intern Med. 2023;183:596–7. 10.1001/jamainternmed.2023.18351. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Agrawal S. Why hospitals need better data science. Harvard Business Review. 19 October 2017. https://hbr.org/2017/10/why-hospitals-need-better-data-science.
  • 4.Diao JA, Wang JK, Chui WF, et al. Human-interpretable image features derived from densely mapped cancer pathology slides predict diverse molecular phenotypes. Nat Commun. 2021;12:1613. 10.1038/s41467-021-21896-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Sangha V, Nargesi AA, Dhingra LS, et al. Detection of left ventricular systolic dysfunction from electrocardiographic images. Circulation. 2023;148:765–77. 10.1101/2022.06.04.22276000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Verghese G, Lennerz JK, Ruta D, et al. Computational pathology in cancer diagnosis, prognosis, and prediction - present day and prospects. J Pathol. 2023;260(5):551–63. 10.1002/path.6163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jahn SW, Plass M, Moinfar F. Digital pathology: advantages, limitations and emerging perspectives. J Clin Med. 2020;9(11):3697. 10.3390/jcm9113697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Santosh K, Gaur L. Artificial intelligence and machine learning in public healthcare: opportunities and societal impact. Springer. 2021. 10.1007/978-981-16-6768-8. [Google Scholar]
  • 9.Henry KE, Adams R, Parent C, et al. Factors driving provider adoption of the TREWS machine learning-based early warning system and its effects on sepsis treatment timing. Nat Med. 2022;28:1447–54. 10.1038/s41591-022-01895-z. [DOI] [PubMed] [Google Scholar]
  • 10.Obermeyer Z, Lee TH. Lost in thought - the limits of the human mind and the future of medicine. N Engl J Med. 2017;377:1209–11. 10.1056/NEJMp1705348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Murphy B. Medical schools ponder move to shorter pre-clerkship curriculum. American Medical Association, 17 November 2021. https://www.ama-assn.org/medical-students/clinical-rotations/medical-schools-ponder-move-shorter-pre-clerkship-curriculum (accessed 4 December 2023).
  • 12.Rider EA, Nawotniak RH. A practical guide to teaching and assessing the ACGME core competencies. 2nd ed. HCPro, Inc; 2010.
  • 13.Russell RG, Lovett-Novak L, Patel M, et al. Competencies for the use of artificial intelligence–based tools by health care professionals. Acad Med. 2022;98:348–56. 10.1097/acm.0000000000004963. [DOI] [PubMed] [Google Scholar]
  • 14.Seth P, Hueppchen N, Miller SD, et al. Data science as a core competency in undergraduate medical education in the age of artificial intelligence in health care. JMIR Med Educ. 2023;9:e46344. 10.2196/46344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Accreditation Council for Graduate Medical Education. New joint initiative launches to create common set of foundational competencies. ACGME Home, 26 July 2022. https://www.acgme.org/newsroom/2022/7/new-joint-initiative-launches-to-create-common-set-of-foundational-competencies/(accessed 4 December 2023).
  • 16.Association of American Medical Colleges. Number of medical schools including topic in required or elective courses: biostatistics. Liaison Committee on Medical Education (LCME) Annual Questionnaire Part II; 2016–2017. https://www.aamc.org/data-reports/curriculum-reports/data/curriculum-topics-required-and-elective-courses-medical-school-programs. Accessed 4 Dec 2023.
  • 17.Stanford Medicine. The rise of the data-driven physician. Stanford Medicine Health Trends Report; 2020. https://med.stanford.edu/dean/healthtrends.html. Accessed 4 Dec 2023.
  • 18.Mohta NS, Johnston SC. Medical education in need of a 2020 revamp. NEJM Catalyst. 2020;1(3). 10.1056/cat.20.0202.
  • 19.Ötleş E, James CA, Lomis KD, Woolliscroft JO. Teaching artificial intelligence as a fundamental toolset of medicine. Cell Rep Med. 2022;3:100824. 10.1016/j.xcrm.2022.100824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Standards, Publications, & Notification Forms. Liaison Committee on Medical Education. https://lcme.org/publications/ (Accessed 12 April 2023).
  • 21.Huang X, Lin J, Demner-Fushman D. Evaluation of PICO as a knowledge representation for clinical questions. AMIA Annu Symp Proc. 2006;359–363. [PMC free article] [PubMed]
  • 22.Matheny ME, Whicher D, Thadaney IS. Artificial intelligence in health care: a report from the National Academy of Medicine. JAMA. 2020;323(6):509–10. 10.1001/jama.2019.21579. [DOI] [PubMed] [Google Scholar]
  • 23.Topol EJ. Deep medicine: how artificial intelligence can make healthcare human again. Basic Books; 2019.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data underlying this article were derived from sources in the public domain: https://www.ncbi.nlm.nih.gov/pubmed and https://med.stanford.edu/dean/healthtrends.html.


Articles from Medical Science Educator are provided here courtesy of Springer

RESOURCES