Abstract
With the fast development of artificial intelligence (AI) and its applications in medicine, it is often said that the time for intelligent medicine is arriving, if not already have arrived. While there is no doubt that AI-centred intelligent medicine will transform current healthcare, it is necessary to test and re-test medical AI (MAI) products before they are implemented in the real world. From the perspective of ensuring safety, accuracy and efficiency, it is imperative that MAIs undergo stress tests in a systematic and comprehensive manner where stress tests subject MAIs to workloads and environments beyond tests carried out by their developers. In such stress tests, potential bottlenecks or failures of MAIs may be identified and fed back to developers to optimise the products. To avoid bias and ensure fairness, stress tests should be prepared and administered by an independent body.
Keywords: Artificial intelligence, Machine Learning, Deep Learning, Software
introduction
Medical artificial intelligence (MAI) has been undergoing fast development, and there have been MAI products for almost every specialty of medicine.1 Yet, there has been limited trust on MAIs in practice.2 One important reason behind the limited trust is the lack of comprehensive evaluations of MAIs,3 especially in a real-world setup.4 5 This is important because healthcare is closely related to individuals’ well-being, and there is a low tolerance for error.6 Lack of or insufficient evaluations of MAIs may be due to the prohibitive financial costs, high human-resource costs, inaccessibility of real-world data or a combination of the above factors. Though MAIs generally undergo various tests at developers’ sites and a number of institutions before their release, such tests may not be comprehensive enough to fully assess a product’s robustness and reproducibility. High robustness and reproducibility offer confidence not only to clinical users but also to patients,7 improving the acceptability of MAIs in real world.8 9 While there has been consensus that MAIs should strive to achieve high robustness and reproducibility,10 11 they lack guidelines on how to comprehensively assess MAIs. This challenge becomes increasingly pressing as today’s MAIs are moving towards multimodality, multicentralised and high complexity, a trend that only adds to the difficulty of assessing them.12
We believe that comprehensive assessment of MAIs requires good design, longitudinal deployment and coordination among MAI developers, users, regulatory bodies and other stakeholders. This is in addition to the design of MAI themselves. In the effort of assessing MAIs, we reason that it is imperative for MAIs to go through meticulous stress tests before their adoption into healthcare practice.
What are stress tests?
Stress tests are a series of systematic and rigorous examinations designed to assess the robustness, accuracy and performance of an MAI under real-world conditions. The purpose of stress tests is to identify potential weaknesses and errors in the MAI before its adoption in practice. These tests evaluate how the MAI responds to various real-world stressors, including but not limited to increased workload, input errors, data variability and rare clinical conditions. Key aspects evaluated during stress testing include the MAI’s sensitivity, specificity, self-explainability and test-retest reliability.
Stress tests should evaluate the robustness, accuracy, consistency and operational resilience of an MAI. For example, these tests should assess the MAI when it suddenly receives more than average requests. If an MAI typically handles 10 requests per minute and suddenly receives 20 or 30 requests in 1 min, the MAI should be tested for the accuracy of its output and any notable delays in its decision-making. Stress tests should evaluate the MAI for self-detecting incorrect type and format of input, for example, a CT image is fed to an MAI when it expects an ultrasound picture. These tests should also evaluate the MAI for input of unusually large or small size, for example, an extra-long record of ECG or a very small size of a patient’s gene profile. In addition, stress tests assess how an MAI functions under stress, that is, its resilience to missing values, out-of-distribution data as well as unexpected user interaction.
Design of stress tests
We propose to classify stress tests at two scales. The first scale is global stress tests, which emphasise uniformity such that all the MAIs go through the same tests. These tests use the same test cases for assessing different MAIs with uniform criteria. Performances of MAIs are therefore comparable at a level field. A state or national body should design a standardised flow of stress test for MAIs, depending on the nature of a particular MAI, for example, whether it is a product specialised on a single type of clinical input or a general-purpose product that works on various inputs.
The second scale is local stress tests, which emphasise institution-specific evaluation such that an MAI should function as expected at a particular hospital, taking data generated locally by the hospital and generating reasonable outputs. These tests are necessary as hospitals may use different protocols in their practice. Hence, a good MAI should function as expected at either hospital, if it is designed to be able to handle images of different resolutions.
Good stress tests should include test cases for excessive load beyond normal functional capacity of an MAI, extreme and rare clinical cases, out-of-distribution test data and real cases that contain missing or incorrect information. Besides the comprehensiveness of the tests themselves, an important and perhaps more critical component of stress tests is that the tests should not allow an MAI to mesmerise or become overfitting on the tests per se. In other words, unlike some other fields in which a product can undergo a test multiple times without compromising the objectivity of the test, stress tests for MAIs and other AI products may become compromised if such products are able to learn about specific questions or cases in the tests. From this perspective, a good stress test should encompass sufficiently large questions or cases in it to reduce the risk that an MAI product is able to figure out the questions or cases.
Creation of a data reserve
As AI is a data-intensive technology, it is overall beneficial to provide as much data as possible to AI’s training. However, as there are hundreds of AI developers and an even larger number of parties that adopt existing AI models to specific applications, it is nearly impossible to be certain which data have not been seen by AI. Though a single development team may be able to sufficiently separate its data into training, validation and test data sets, there is no guarantee that its test set will not or have already been used as training set by another group. When evaluating or comparing different MAIs for the same functionality, there needs to be an independent dataset that can be used as test data which has not been used for training the MAI algorithms being evaluated. It is therefore important for data creators such as hospitals, healthcare institutions, insurance companies and other stakeholders to put an embargo on some data sets that are not used in AI development and training by any entities. This data should be collected and stored by a single body to constitute a data reserve, which can then be used for uniform and objective testing on all MAIs. This reserve functions as a safeguard against inadvertent data exposure and ensures the integrity of AI testing across developers, versions of MAI products and various implementations.
A data reserve should meet the following minimum requirements:
Longitudinal in data creation, such that the creation of data should span over a minimum number of years. This ensures that the data were generated by different models of medical instruments, various ways of collection and via different protocols.
Diverse in demographics, such that the patients from which data were collected should be of demographic diversity. This ensures that the data are representative of the affected population.
Comprehensive in data types, such that the reserve should comprise various types of data like imaging scans, laboratory results, electronic health records, genetic profiles and patient-reported outcomes. This breadth of data ensures that the data reserve can test MAIs over a comprehensive array of information sources.
Inclusive in clinical conditions, such that the reserve should comprise as many diseases and clinical conditions as possible, including both common and rare diseases. This inclusiveness ensures that the data reserve can approximate realistic clinical visits to a high degree.
These requirements dictate the quality of the data reserve from the perspective of being a good test; however, we should be careful not to alter the data in the reserve intentionally or unintentionally such that it voids the very purpose of testing MAIs in a realistic manner. Some alterations to avoid include, but not limited to:
Filling in missing data, which should be handled by the MAIs during test, such that MAIs should follow predetermined procedures to either fill in missing data or somehow exclude such data from the analysis process, and such operations should be saved and made available to human users.
Cleaning up data, which should be managed by MAIs or their built-in preprocessing steps, such that MAIs should clean up duplicated entries, restore incorrectly placed or misplaced data, and other data entry mistakes and record the steps they have taken so users can trace back the operations.
Rearranging incorrectly organised time-series data, which should be checked by MAIs such that wrongfully placed time-series data should be restored according to the order of data creation.
Normalising or in any way standardising data, which should be handled by MAIs such that raw data should be normalised or standardised by the same procedure, and
Correcting wrong entries in data, which should be detected by MAIs, such that ranges and units of data should be checked by MAIs before they are taken for analysis and unreasonable data should be flagged to users for users to determine whether the data should be accepted for clinical decision-making.
We note that the above listed alterations or processing are often required before data can be used by AI. However, in stress tests, such processing does not represent the real-world scenario at a hospital and should not be performed beforehand on the reserved data.
An important consideration for the data reserve is the multi-institutional patients who have visited more than one healthcare facility or may visit a different facility in the future. Such cases require careful coordination among hospitals, insurance companies and other healthcare management authorities and prompt upgrade of the data reserve, either to include the information from the other institutions in the reserve or mark the medical records at the other institutions for embargo so they will not be used for training and testing any MAIs.
Transparency, confidentiality, consistency and randomisation in stress tests
We emphasise that the purpose of stress tests is to help developers and users implement a robust MAI system; therefore, transparency is necessary to maintain the objectiveness and comprehensiveness of the tests. Transparent reporting of test results enables stakeholders to identify potential weaknesses, address shortcomings and implement improvements to enhance the robustness and safety of MAIs. Reports of test results may give the aggregated performance of an MAI, such as sensitivity, specificity and others. Reports may also give aggregated feedback of an MAI on its use of data modalities and whenever appropriate on its performance according to the distribution of test data, such as an MAI’s performance on certain age segments or some demographic groups.
However, as MAIs are highly adaptive, it is also imperative that stress tests should not be circumvented by MAIs mesmerising the tests, thus undermining the validity of the results. To mitigate this risk, we need to balance transparency and confidentiality in reporting stress test results. Here, confidentiality refers to the concept of keeping the statistical characteristics of the data reserve or any test datasets from being accessed or guessed by MAIs or developers. For example, if the data reserve has more old patients than young patients and, furthermore, if an MAI senses that the data reserve has such an imbalance in age distribution, the MAI may tune itself in such a way that its performance on old patients becomes better by sacrificing its performance on young patients. This may result in an overall improved performance of the MAI but does not necessarily mean that the MAI has intrinsic improvements. Therefore, statistical characteristics of the data reserve should be protected.
Consistency and randomisation are two factors to consider in designing and administrating stress tests. On the one hand, it is important to provide MAIs with the same tests, so not only all the MAIs are compared on a level field, but upgrades of an MAI are evaluated against the same benchmark as its previous version does. For this purpose, stress tests should offer the same or largely the same case composition over time. On the other hand, it is equally important that stress tests should offer different challenges to MAIs each time to avoid the tests becoming stagnant. As a result, stress tests will likely have two parts, with one part being the same for all MAIs to take and re-take the tests and the other part being randomised each time that an MAI re-takes the tests.
Evolution of the stress tests
While MAIs are constantly improving, stress tests and the data reserve should also evolve. The stress test administrators should maintain a record on the test results, especially tracking the cases that are incorrectly analysed by MAIs. In addition, for administrators to double-check the ground truth of such cases, it is necessary to keep and mark those cases, probably classifying them into different difficulty levels, depending on how many incorrect cases by the MAIs. Then in future tests, challenging cases at each difficulty levels should be included, although randomly, in the test sets.
The size of stress test data should be sufficiently large for two reasons. The first reason is to have enough statistical power for assessing the improvement in an MAI, for example, establishing statistical significance on whether a newer MAI is indeed better than its old version. The second reason is to avoid the contents of the tests being deduced by the test takers, either by MAIs or their developers. Data in the reserve should be periodically examined and updated as some data may be retired, and new data are added. Retired data, however, should not be made available to developers and researchers as access to even a portion of the data reserve may help an AI model test better in the future, but that improvement in performance is not trustworthy.
As MAIs are either of general-purpose or tailor-made for a clinical specialty, their stress tests will be different. For a general-purpose MAI, the tests should consist of as many clinical conditions as possible, yet it is important to avoid bias in the distribution of different conditions. For an MAI targeting a clinical specialty, it is important for its tests to include not only the disease of interest, but also healthy controls, differential diagnosis patients of other conditions and patients with the targeted disease who also have comorbidities.
Being highly non-linear, an MAI may experience substantial changes in its performance with adjustment of design, fine-tuning of hyperparameters and addition of training data. Under stress tests, it is important that the performance of an MAI after upgrades is comprehensively assessed. This includes, but not limited to, achieving an equal or better sensitivity, specificity and test-retest reliability.
Stress tests also evolve with the stakeholder landscape, which changes over time. Many factors influence the stakeholder landscape, including but not limited to the legal framework for MAI adoption, regulation on safety of MAI products, data privacy, liability and reimbursement policies. Ethical guidance is another important factor influencing the stakeholder landscape as it addresses bias, fairness, transparency and accountability in MAI development and use. When such factors change, it is likely that our requirements and design for sound stress tests need to change in response.
Conclusions
Because of the high impact that an MAI may exert on an individual’s healthcare, an MAI needs to undergo comprehensive stress tests before being deployed. These tests, on the one hand, need to fully evaluate an MAI’s performance such as sensitivity, specificity and explainability while, on the other hand, need to maintain a sufficient degree of confidentiality to prevent the MAIs from adapting to the tests. It is important to create and maintain a state or national test case reserve such that it maintains test cases confidentiality and only accepts an MAI to connect to it via a specified interface. And only aggregated test results are returned to the MAI, including sensitivity, specificity and test-retest reliability. While data are important for training AI and we constantly look for good training data, it is probably the data that cause an AI product to fail are, in some sense, even more valuable. For this reason, the data reserve should maintain correct and incorrect decisions made by various MAIs and periodically review such results, especially the ones that cause incorrect decisions by more than one MAI. While the main purpose of stress tests is to evaluate MAIs before their practical use, these tests are equally important to ensure that challenging cases found via the tests be appropriately communicated to clinical end users, so human intervention is recommended when a case of similar patterns presents itself to clinicians in real world. It deserves further discussion on whether challenging cases should be made available to MAI developers. It is possible that challenging cases should be collected and categorised into a class as a special test set for MAIs.
Footnotes
Funding: The work of HC was supported by the Zhongnanshan Medical Foundation of Guangdong Province, China (ZNSXS-202300001).
Patient consent for publication: Not applicable.
Ethics approval: Not applicable.
Provenance and peer review: Not commissioned; externally peer reviewed.
References
- 1.Rajpurkar P, Chen E, Banerjee O, et al. AI in health and medicine. Nat Med. 2022;28:31–8. doi: 10.1038/s41591-021-01614-0. [DOI] [PubMed] [Google Scholar]
- 2.Hatherley JJ. Limits of trust in medical AI. J Med Ethics. 2020;46:478–81. doi: 10.1136/medethics-2019-105935. [DOI] [PubMed] [Google Scholar]
- 3.Stupple A, Singerman D, Celi LA. The reproducibility crisis in the age of digital medicine. NPJ Digit Med . 2019;2:2. doi: 10.1038/s41746-019-0079-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Calisto FM, Santiago C, Nunes N, et al. BreastScreening-AI: Evaluating medical intelligent agents for human-AI interactions. Artif Intell Med. 2022;127:102285. doi: 10.1016/j.artmed.2022.102285. [DOI] [PubMed] [Google Scholar]
- 5.Brown A, Tomasev N, Freyberg J, et al. Detecting shortcut learning for fair medical AI using shortcut testing. Nat Commun. 2023;14:4314. doi: 10.1038/s41467-023-39902-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Hutson M. Artificial intelligence faces reproducibility crisis. Science. 2018;359:725–6. doi: 10.1126/science.359.6377.725. [DOI] [PubMed] [Google Scholar]
- 7.Haibe-Kains B, Adam GA, Hosny A, et al. Transparency and reproducibility in artificial intelligence. Nature New Biol. 2020;586:E14–6. doi: 10.1038/s41586-020-2766-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Laine RF, Arganda-Carreras I, Henriques R, et al. Avoiding a replication crisis in deep-learning-based bioimage analysis. Nat Methods. 2021;18:1136–44. doi: 10.1038/s41592-021-01284-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Walsh I, Fishman D, Garcia-Gasulla D, et al. DOME: recommendations for supervised machine learning validation in biology. Nat Methods. 2021;18:1122–7. doi: 10.1038/s41592-021-01205-4. [DOI] [PubMed] [Google Scholar]
- 10.Burnell R, Schellaert W, Burden J, et al. Rethink reporting of evaluation results in AI. Science. 2023;380:136–8. doi: 10.1126/science.adf6369. [DOI] [PubMed] [Google Scholar]
- 11.Jones DT. Setting the standards for machine learning in biology. Nat Rev Mol Cell Biol. 2019;20:659–60. doi: 10.1038/s41580-019-0176-5. [DOI] [PubMed] [Google Scholar]
- 12.Wagner SJ, Matek C, Shetab Boushehri S, et al. Make deep learning algorithms in computational pathology more reproducible and reusable. Nat Med. 2022;28:1744–6. doi: 10.1038/s41591-022-01905-0. [DOI] [PubMed] [Google Scholar]
