Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jan 18.
Published in final edited form as: J Ambul Care Manage. 2023 Jan 18;46(2):114–120. doi: 10.1097/JAC.0000000000000453

Machine learning and healthcare: Potential benefits and issues

J Graham Atkinson 1, Elizabeth G Atkinson 2
PMCID: PMC9974552  NIHMSID: NIHMS1861160  PMID: 36649491

Introduction and terminology

The precise definitions of machine learning (ML) and artificial intelligence (AI) are subject to debate. In some discussions machine learning is entirely included within artificial intelligence and in others they are considered to overlap, but without either being entirely included in the other. Certainly, AI encompasses some activities that are broader than ML. For the purposes of this paper, we are considering ML to involve having a computer derive information from an enormous data set, while AI will involve not only that activity, but also, for example, the activity of learning from the application of a set of rules. Examples of learning from a set of rules are the recently developed program AlphaGO to play the game Go (Silver, 2016) and the DeepMind chess program (Silver, 2018). The projects that we will discuss in this paper and that we suggest have the potential to advance medical treatment are more simple ML applications using massive data sets to identify trends that deconvolve potential medical problems that would not be powered for detection in the, necessarily, smaller scale controlled trials that are the basis for the approval of drugs and equipment.

Interest in machine learning and artificial intelligence has exploded in recent years, as witnessed by the burgeoning number of papers that appear in PubMed (the National Institutes of Health database of publications) searches using these terms. Figure 1 shows the numbers of papers appearing in searches using the search criteria “machine learning” and “artificial intelligence”, by year of publication. In the 1990s the bars are barely discernable, while in 2022 there were over 20,000 papers involving ML and over 10,000 involving AI. 2022 was not included because the year was still incomplete at the time these data were generated, but interim results suggest that the numbers will substantially exceed those from 2021. The explosion in work on machine learning has been facilitated by, and necessitated by, the accumulation of big data and the increasing availability of computational power and tools to analyze it. By “big data” we mean large-scale data sets of thousands to millions of entries paired with metadata that are often attained without a specific downstream analytic goal in mind. These may be, for example, administrative or billing datasets from healthcare systems, genomic and phenotypic datasets such as biobanks, or even data tracking interactions on commercial or social websites.

Figure 1.

Figure 1.

Counts of publications in the PubMed database using the terms “Artificial Intelligence” (AI) or “Machine Learning” (ML) over time.

It is also becoming more widely appreciated that there are potential ethical concerns regarding the use of AI and ML, and that there are major concerns that already existing disparities in healthcare and other areas could be perpetuated, and even exacerbated, by the use of applications developed using ML applied to data bases that incorporate the impact of these disparities. NIST (Schwartz et al., 2022) and UNESCO (UNESCO, 2021) have taken a leading role in drawing attention to these issues and have started to develop standards and criteria for best practices when running ML. The NIST report classifies three different types of bias: Statistical, human, and systemic and gives examples of each type of bias, as well as discussing the harm that they can cause. The UNESCO report is broadly focused on ethical issues, for example, AI in autonomous vehicles, and the use of AI in judicial systems.

History and background on machine learning

AI has been subject to several boom-and-bust cycles over the past decades. In the 1940s and 1950s there was much excitement about its potential applications and perceived ability to revolutionize science and medicine, following and driven by the invention of the foundational “perceptron” algorithm by McCulloch and Pitts in 1943 (McCulloch, Pitts, 1943). This is what would now be described as a single layer artificial neural network and acted as a binary classifier, deciding whether a given input belonged in a particular class. Physical machine implementations of perceptrons were developed soon after by Frank Rosenblatt at MIT in 1958 (Rosenblatt, 1958). However, in 1969 Marvin Minsky and Seymore Papert (Minsky, Papert 1988 for the expanded edition) published a book in which they proved that there were severe limitations to the types of classifications that could be done using perceptrons and suggested (but did not prove) that adding layers to the networks would not overcome these limitations. This severely depressed the enthusiasm for AI until it was realized in the 1980s that the limitations would be overcome by adding a feedforward feature. In recent years there has been a surge in applications of AI in a variety of areas, with widespread success in the particular areas of speech and image recognition and extending even to self-driving vehicles.

Machine learning systems look for relationships, usually linear, between variables in massive data sets. The relationships can be across the entire set of observations, or in subsets, for example with respect to phenotypes. As the sizes of the datasets is growing, particularly for genomic datasets where the individual records can include hundreds of millions of variables, the use of automated systems for the discovery of correlations has become increasingly necessary. However, finding correlations between variables or between independent variables and phenotypes of medical interest is just the first step in the process of converting data into meaningful deductions or treatments. A vital next step is to determine whether the relationships are causal and, even more difficult, what are the underlying mechanisms. These are highly important factors to aid in identification of potential actionable molecular targets for the development of new medicines on the genetics front, and to understand the direct impact of behaviors or lab measurements in the context of interpreting and conveying individual patients’ risks.

The optimal dataset design for ML is also not the typical collection strategy for biomedical cohorts. ML excels when many variables are present to make predictions, even if it is not clear which variables may most dramatically contributing to the prediction. In traditional cohort design, measuring high numbers of variables may not be practical or may be unattractive to clinicians as it is not possible to make clear inferences about each individual variable. As such, the objective of data collection is modified when ML strategies are intended for use. This also makes large datasets with high numbers of shallowly phenotyped variables ideal settings for ML, such as biobanks, which typically contain thousands of pieces of metadata about patients, and massive compilations of EHR records.

Examples of uses of ML in biomedical research

Thanks to its ability to detect subtle patterns and make inferences about large volumes of data that are challenging to individually interpret, ML is poised to transform biomedical research. One area ripe for mining by ML are massive genomic biobank datasets, which both have extremely high volumes of samples/patients - typically hundreds of thousands to millions - coupled with large numbers of phenotypes, surveys and medical record metadata - typically thousands of traits/metadata - and immense volumes of genomic information - for sequencing-based biobanks, such as the All of Us Biobank spearheaded by the NIH (The “All of Us” Research Program Investigators 2019), there can be dozens or even hundreds of millions of individual genomic variants present for study. Given the immense scale of such data, portions of which are likely to have non-linear interactions with each other, it is extremely challenging to design informed models manually. However, ML techniques can readily ingest big data in the service of classifying subjects – a typical use case in a biomedical setting. Specifically, often the goal of ML-based biomedical research is to stratify the patient pool into typically two, but occasionally more, bins based on a large volume of attributes of each individual. By training an ML algorithm on, for example, the millions of genetic loci present across samples with and without a particular illness, ML models can learn subtle genetic indicators that may predispose patients to get sick, have earlier disease onset, or have more severe health outcomes. The current primary use case in preventive medicine, therefore, is to identify such patients who may be at elevated risk, thereby affording time for earlier interventions or screenings.

ML has already been used in biomedical research across a wide-ranging variety of biomedical settings. For example, it has been employed in psychiatric research to predict patients’ mood based on their behaviors (Wang et al. 2014), to infer ancestry based on genetic information (Maples 2013), to design robots to assist patients with disabilities that can react in response to patient intentions as inferred from neuronal activity (Corbett et al. 2012; Yu et al. 2007), in oncology to classify patients as high versus low risk for rapid disease progression (Kourou 2015), and even to identify medical patterns that are associated with higher healthcare costs (Oliveai 2021).

Correlation vs causality

The next issue is how to follow up once an interesting correlation has been found. A double-edged sword of ML is that it is while it excels at making predictions about biomedical data, often the underlying model is not readily interpretable. While prediction is often a primary goal of research, it is also often important for researchers to be able to fully understand the features impacting prediction as well as any error or noise affecting results. In such cases, ML may provide a useful benchmark for an upper bound of predictive performance against which more interpretable algorithms may be compared.

A key question that should be addressed when interpreting patterns identified by ML is if there is a causal relationship between the variables, or if rather they are reflections of some common underlying cause (Figure 2). For example, for a given disease, a particular protein may be strongly associated with illness, but is the presence of this protein a symptom of the disease, a primary cause of the disease, or an intermediate cause? Answering these questions is likely to require considerable additional study. Further, even after a causal relationship is identified, results may not be sufficient to enable or persuade medical practitioners to modify standard of care to implement the findings. There are several statistical strategies used for causal inference. In the field of genetics, a technique known as Mendelian Randomization has come into popularity in the context of large genomic datasets to assess whether individual genetic variations themselves are contributing to disease or if their apparent relationship to disease is confounded by another factor. A classic example of a potential indirect relationship is the observation that nicotine receptor genetic variants are very strongly associated with lung cancer. These variants themselves, however, may not be directly involved in any oncological process but rather influence the likelihood of an individual becoming a habitual smoker if they try smoking, which directly increases the subject’s risk of lung cancer.

Figure 2.

Figure 2.

Directed Acyclic Graph illustrating the potential confounding of a perceived causal relationship. The arrows represent causal links between variables. The confounder variable may or may not be directly observable in the dataset.

To identify potential treatments, it may be necessary to determine the mechanism by which the disease is being caused and methods to block, sidestep or otherwise interfere with that mechanism. This would typically require intensive functional validation of putative causal loci, controlled tests, and sophisticated statistical analysis for novel potential therapies, though there are occasionally opportunities for repurposing existing medications targeting relevant shared biological pathways. In all cases, careful consideration of potential mediators to the phenotype are highly important to ensure appropriate interpretation of outcomes.

Another important concept to consider in the interpretation of the results of ML is overfitting, which can often be invisible to the researchers. If used without due care, ML may create classifiers that appear to perform impressively on the dataset used for training but may not extend to other data sources. Some strategies to address overfitting have been designed for ML algorithms, a primary solution being regularization. In regularization, weights are assigned to data features with the goal of reducing model complexity and eliminating features that are not contributing substantially to signal. Such processes can aid in feature selection for more generalizable ML models. Selecting appropriate features of the data to train on is of paramount importance, as are the related efforts of reducing noisy and irrelevant information. This allows the most important and useful signals of the data to remain robust and generalizable.

ML and health disparities

Another significant and sensitive topic is the potential impact of machine learning on racial and socioeconomic disparities. If machine learning algorithms are trained using datasets that embody existing disparities in treatment or diagnosis, then there is a risk that these disparities will be perpetuated if they are not recognized and considered in the analysis. This issue is of sufficient concern that the National Institutes of Science and Technology (NIST) in March of 2022 issued a report titled “Towards a Standard for Identifying and Managing Bias in Artificial Intelligence” (Schwartz et al., 2022). One particularly concerning instance is the issue of kidney transplants and the algorithm used to prioritize patients on the waiting list to receive a kidney transplant. The title of the article expresses the concern: “How an Algorithm Blocked Kidney Transplants to Black Patients”. Though problems resulting from the algorithm were not due to any deliberate attempt at racial discrimination, the outcomes nonetheless were discriminatory. This is discussed in further detail and in lay terms in an article by Tom Simonite in Wired Magazine (Simonite, 2022), and David G. Robinson (Robinson, 2022) in his recent book “Voices in the Code” expands on the problems of prioritizing kidney transplants.

As most existing large-scale datasets are extremely Eurocentric, training models on current resources would not accurately capture genetic or phenotypic diversity representative of the global population, and as such may lead to higher accuracy and utility in patients of European descent than other populations (Sirugo et al. 2019. Martin et al., 2019)). Recruiting or aggregating more representative sample sets as training resources would be a key first step at tackling such equity issues in predictive performance.

Summary

Having raised various warnings throughout this perspective, it is necessary to say that ML is a juggernaut that cannot, and should not, be stopped. Our intention is not to foment fear regarding the utility of ML – indeed it holds great promise at bettering patient outcomes across biomedical science – but rather to contextualize and provide some caveats and cautions to help to avoid potential common pitfalls. ML is being used in myriad ways in health care, with particularly robust adoption in imaging processing, and has had some remarkable successes in radiology. Additional areas in which it is like to be highly valuable include in identification of patients at elevated risk for diseases, as well as the identification of rare and unexpected complications of medical procedures, identification of rare adverse effects of drugs, and characterization of drug interactions. These are areas in which controlled trials are unlikely to be effective in identifying the problems because controlled trials, by the nature of their design, often have too small an enrollment to be able to identify rare events. However, once potential issues are identified by a ML algorithm, the next step should be additional targeted analysis to confirm or deny the effect, and, if possible, to identify the mechanisms involved.

Sources of funding:

EGA is supported by the National Institutes of Mental Health (K01 MH121659), the Caroline Wiess Law Fund for Research in Molecular Medicine, and the ARCO Foundation Young Teacher-Investigator Fund at Baylor College of Medicine.

Footnotes

No conflicts of interest exist.

Bibliography

  1. Corbett EA, Perreault EJ, Kording KP. Decoding with limited neural data: A mixture of time-warped trajectory models for directional reaches. Journal of Neural Engineering. 2012;9:036002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2014. Nov 15;13:8–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Maples Brian K., Gavel Simon, Kenny Eimear E., Bustamante Carlos D., (2013) RFMix: A Discriminative Modeling Approach for Rapid and Robust Local-Ancestry Inference, Am J Hum Genet. 2013 Aug 8; 93(2): 278–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Martin Alicia R, Kanai Masahiro, Kamatani Yoichiro, Okada Yukinori, Neale Benjamin M, Daly Mark J, (2019) Clinical use of current polygenic risk scores may exacerbate health disparities, Nature Genetics, 2019 No. 4, 584–591. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. McCulloch WS, Pitts W, (1943) A logical calculus of the ideas immanent in neural nets, Bulletin of Mathematical Biophysics, Vol. 5, pp.115–137. [PubMed] [Google Scholar]
  6. Minsky Marvin L., Papert Seymour A., (1988) Perceptions, An Introduction to Computational Geometry, Expanded Edition, MIT Press, 1988 [Google Scholar]
  7. Oliveai (2021), Machine Learning in Healthcare Applications, October 2021, https://discover.oliveai.com/rs/541-CSN-882/images/Machine-Learning-in-Healthcare-Applications-WhitePaper_.pdf
  8. Pearl Judea, Causality, 14 September 2009, Cambridge University Press [Google Scholar]
  9. Robinson David G., Voices in the Code, September 8, 2022, Russell Sage Foundation [Google Scholar]
  10. Rosenblatt Frank (1959) Two theorems of statistical separability in the perceptron, Proceedings of a Symposium on the Mechanization of Thought Processes, Her Majesty’s Stationery Office, London, pp. 421–456 [Google Scholar]
  11. Schwartz Reva, Vassilev Apostol, Greene Kristen et al. , (2022) Towards a Standard for Identifying and Managing Bias in Artificial Intelligence, Natl. Inst. Stand. Technol. Spec. Publ 1270, 86 pages (March 2022). [Google Scholar]
  12. Silver David, et al. “Mastering the game of Go with deep neural networks and tree search”. Nature. 27 January 2016, 529 (7587): 484–489. [DOI] [PubMed] [Google Scholar]
  13. Silver David, et al. , “A general reinforcement learning algorithm that masters chess, shogi, and go through self-play”. Science. 7 December 2018, 362 (6419): 1140–1144. [DOI] [PubMed] [Google Scholar]
  14. Simonite Tom, How an Algorithm Blocked Kidney Transplants to Black Patients, Wired, October 26, 2020 [Google Scholar]
  15. Sirugo Giorgio, Williams Scott M., Tishkoff Sarah A. (2019) The Missing Diversity in Human Genetics Studies, Cell 177, 26–31, March 21, 2019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. The “All of Us” Research Program Investigators, (2019), The “all of Us” Research Program, N. Engl. J. Med 2019; 381: 668–676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. UNESCO. 24 November 2021, https://www.unesco.org/en/artificial-intelligence/recommendation-ethics
  18. Wang R, Chen F, Chen Z, Li T, Harari G, Tignor S, Zhou X, Ben-Zeev D, Campbell AT. StudentLife: Assessing mental health, academic performance and behavioral trends of college students using smartphones. Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing; September 13–17; Seattle. 2014. [Google Scholar]
  19. Zhou Yadi*, Wang Fei*, Tang Jian*, Nussinov Ruth, Cheng Feixiong, Artificial intelligence in COVID-19 drug repurposing, Lancet Digital Health 2020; 2:e667–76 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Yu BM, Kemere C, Santhanam G, Afshar A, Ryu SI, Meng TH, Sahani M, Shenoy KV. Mixture of trajectory models for neural decoding of goal-directed movements. Journal of Neurophysiology. 2007;97:3763–3780 [DOI] [PubMed] [Google Scholar]

RESOURCES