Skip to main content
European Journal of Human Genetics logoLink to European Journal of Human Genetics
. 2022 Feb 7;30(9):993–995. doi: 10.1038/s41431-022-01061-6

Good quality practices for artificial intelligence in genetics

Timothé Ménard 1,
PMCID: PMC9437024  PMID: 35132174

Background

“Artificial Intelligence” (AI) based systems developed for healthcare continue to make headlines, with new studies released on a weekly basis [1]. In genetics, a number of AI use cases have been published, from improving the performance of gene sequencing [2, 3] to developing early detection tools for genetic conditions [3, 4]. However, the vast majority of AI solutions published to date failed to demonstrate their applicability in the real world [5, 6], and the genetics discipline is no exception [5]. At the time of writing this article, there was no AI-based algorithm for genetic diagnosis approved by the Food and Drug Administration (FDA) [7]. Lack of end-to-end quality often limits the potential of new genetic AI-based tools [4]. Moreover, recent studies have shown that AI systems were often biased and increased inequality [8], which could have consequences for patients with genetic conditions.

In the area of Software as Medical Device (SaMD) and in clinical drug development, a number of guidance and AI good practices have been released by regulators. Namely, the FDA, Health Canada, and the UK’s Medicines and Healthcare products Regulatory Agency recently issued a guidance document that provides 10 principles for Good Machine Learning Practice (GMLP) [9]. These guiding principles will help promote safe, effective, and high-quality medical devices that use Artificial Intelligence and Machine Learning (AI/ML). By leveraging emerging guidance on the use of AI/ML in other healthcare domains, there is an opportunity to discuss what good quality practice could look like for AI in genetics (of note, AI/ML solutions may apply to SaMD as a standalone solution, or to improve existing diagnostic tests). We are providing an overview of the quality considerations applicable to the use of AI in genetics for clinical diagnostics, to help researchers to build reproducible solutions that would facilitate their adoption, increase trust and transparency, and provide tangible benefit to patients and society. We considered the GMLP guidance [9], the latest edition of the foundational textbook on statistical learning by James et al. [10] and a clinical quality strategy developed to assess the integrity of genomic data inferred using statistical learning [11]. We did not include other AI guidance to avoid redundancies. In addition to providing a high-level set of quality considerations, our objective was to start the debate and to go beyond the hype [1, 12] while setting the scene for scalable and trustworthy AI solutions in genetics.

Prerequisites

First and foremost, we need to clarify what is behind the term AI. There are various definitions but all converge to the concept of some kind of intelligence expressed by machines, and the ability to mimic the cognitive functions of humans [1]. While the term AI is often used in genetics scientific publications [4], it is more appropriate to use Machine Learning (which is subset of AI, which analyses data to learn for itself to produce useful insights and predictions on never seen before data)—or to use the original term Statistical Learning, as many of the algorithms applied were invented years (sometimes decades) ago [10]. Furthermore, some caution should be applied when using the term “AI”, as it might contribute to inflated expectations for patients and their physicians [1, 12]

A key question that must be addressed before considering an AI/ML based system is the intended use [9, 11]. From the intended use are derived all requirements discussed below. In genetics, many AI/ML are developed for diagnostic purposes [3], and therefore necessitate strict quality gatekeepers: if the models are intended to be used by the physicians, the outputs must be of the highest quality, as they will have a direct impact on patients and on their families [4, 11].

Data quality

The reliability of the output of AI/ML models depends on the quality of the input data [2, 5, 10]. In layman’s terms: for garbage in, you get garbage out. To ensure suitability and relevance of the data for AI/ML model training, a prospective data acquisition and selection strategy is critical [10, 11]. As the data are being gathered and then curated, full transparency in methodology is required, e.g. describing what variables are being collected vs. which ones are excluded, and why [4, 10]. In genetics, it is also crucial to ensure datasets are not biased and that they adequately represent the patient population [4], as genetic diseases have different prevalence, and can be heterogeneous in their genotypes and/or phenotypes. This should be reflected in the dataset used to train and validate AI/ML models, otherwise the model will be flawed. For very rare genetic conditions, building a dataset large enough and representative enough to train an AI/ML model would be difficult, so any AI/ML model developed here should be thoroughly evaluated for their robustness.

Validation

On top of the computerized system and clinical validation required for any diagnostic tool, any AI/ML model requires validation, to ensure their accuracy and reproducibility. Many publications (including in genetics) do not share the full code and data in an open-source manner, which would foster trust and transparency. Documentation and audit trail should be made available to enable researchers to verify the performance of the AI/ML model [9].

There are different strategies to validate AI/ML models, all well described by James et al. [10]. The classical approach follows a split between a training, a test and a validation set, where the model is trained on a subset of data (training set ~80%), then evaluated and tuned (test set ~10%) and then the final evaluation is performed (validation set ~10%) [10]. It is critical here to ensure that no data ends up in more than one set, otherwise data leakage would compromise the integrity of the output [10]. In genetics, where datasets are often imbalanced, i.e., the biomarker or the disease to be detected constitutes a rare event, it is often more appropriate to use a k-cross-fold validation [10] instead. To summarize, the validation strategy needs to be fit-for-purpose, should account for the nature of the disease or the biomarker, and should be documented and transparent.

Performance evaluation

The performance of AI/ML models should be evaluated against a reference dataset based on the best available methods [9] (applicable when replacing an existing validation set or improving by adding AI/ML to an existing pipeline) and relevant metrics should be used. For example, for rare genetic diseases where datasets would be highly imbalanced (e.g., only few patients carrying a variant amongst many that don’t), using accuracy as a metrics would be inappropriate. A model could achieve a very high accuracy by always predicting the class that has the vast majority of events (in the example above: by always predicting the patients that don’t carry the variant). This is a known phenomenon called accuracy paradox [10]. A well-suited approach would instead evaluate precision, recall, the Area Under the Receiver Operating Curve (AUC) values, and a confusion matrix (i.e., false positives rates, true positives rates, etc.). While the AUC is suitable for imbalanced dataset, one of the limitations is that we cannot identify how many healthy people get incorrectly flagged with a disease for every sick patient correctly flagged with a disease; weighing the cost of false positive, false negative and total testing against the benefit of true positive detection. The unfavorable outcome of such an analysis is the reason why the true benefits of mammography and PSA screenings are still being debated for early detection of breast and prostate cancer [13, 14]. This is also why recent AI/ML guidelines [9, 10] ask for the confusion matrix as one of the metrics for model performance. As experts of their models, researchers should be able to pick the optimal threshold value to construct a confusion matrix that reflects the best cost-benefit ratio.

Once these metrics are calculated, they need to be put in perspective, i.e., what does this mean in the real world for patients and their physicians. For example, detecting potential germline pathogenic variants from tumor-only sequencing data, with a high rate of false positives would trigger unnecessary stress and anxiety for patients while waiting for germline confirmatory testing [11]. In this example, focus [should] be placed on the performance of the Human-AI team [9], e.g., genetic counselors should be utilized to properly convey the test implications to their patients.

Of note, performance metrics should be transparent as they are not always fully disclosed in scientific publications [4, 11].

Explainability

A challenge with advanced AI/ML models is that they often lack transparency [5, 6]. They are sometimes referred to as black boxes, where no one understands how the models work and how outputs were generated [1]. Less “fancy” models, based on well-established statistical methods sometimes perform equally well than AI/ML [15], and have the advantage of being easily explainable. The trade-off between interpretability and performance should be assessed. Of note, some commercialized AI/ML SaMD may involve proprietary algorithms and therefore details on how the model works might not be available. However, clear instructions (including limitations, performance metrics and how to interpret those) should be provided, otherwise the model is unlikely to be trusted by patients and their physicians [9].

Quality assurance for AI in genetics

Once an AI/ML model is deployed, it is essential to continue to monitor its performance and applicability in the real world [9]. Some of the key risk areas relevant for AI/ML model in genetics include data privacy (ensuring patient privacy cannot be compromised), bias (e.g., selection bias with dataset not reflecting the population) and model drift (i.e., when the target population evolves away from the dataset used for training). An oversight mechanism should be in place, ideally with humans being on top [9]. A good approach would be to build quality assurance into an AI/ML model, e.g., with audits being conducted on a risk-based cadence to detect potential issues and as a consequence, models could be continuously improved [9, 11]. By embedding good quality practice in the development of AI/ML models for genetics, researchers, patients and their physicians will benefit from reliable and trustworthy AI solutions.

Acknowledgements

The content review was performed by Joanne Donald and Chris Ganter.

Competing interests

TM was employed by F.Hoffmann-La Roche at the time this short communication was completed, however, it has been written independently from his employment.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Emanuel EJ, Wachter RM. Artificial intelligence in healthcare: will the value match the hype? JAMA. 2019;321:2281–2. doi: 10.1001/jama.2019.4914. [DOI] [PubMed] [Google Scholar]
  • 2.Libbrecht M, Noble W. Machine learning applications in genetics and genomics. Nat Rev Genet. 2015;16:321–32. doi: 10.1038/nrg3920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Dias R, Torkamani A. Artificial intelligence in clinical and genomic diagnostics. Genome Med. 2019;11:70. doi: 10.1186/s13073-019-0689-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Ménard T. Correspondence on “Artificial intelligence–assisted phenotype discovery of fragile X syndrome in a population-based sample” by Movaghar et al. Genet Med. 2021. 10.1016/j.gim.2021.10.022 [DOI] [PMC free article] [PubMed]
  • 5.Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, Ma J, et al. Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. BMJ. 2021;375:n2281. doi: 10.1136/bmj.n2281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Freeman K, Geppert J, Stinton C, Todkill D, Johnson S, Clarke A, et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ. 2021;374:n1872. doi: 10.1136/bmj.n1872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.FDA-approved A.I.-based algorithms. https://medicalfuturist.com/fda-approved-ai-based-algorithms/. Accessed 22 Jan 2022.
  • 8.Leslie D, Mazumder A, Peppin A, Wolters MK. Does “AI” stand for augmenting inequality in the era of covid-19 healthcare? BMJ. 2021;372:n304. doi: 10.1136/bmj.n304. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.US Food and Drug Administration (FDA), UK Medicines and Healthcare products Regulatory Agency (MHRA) and Health Canada. Good machine learning practice for medical device development: guiding principles. 2021. https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles. Accessed 09 Nov 2021
  • 10.James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning. 2nd ed. New York: Springer; 2021. 10.1007/978-1-0716-1418-1
  • 11.Ménard T, Rolo D, Koneswarakantha B. Clinical quality in cancer research: strategy to assess data integrity of germline variants inferred from tumor-only testing sequencing data. Pharmaceut Med. 2021;35:225–33. doi: 10.1007/s40290-021-00399-4. [DOI] [PubMed] [Google Scholar]
  • 12.Strickland E. IBM Watson, heal thyself: how IBM overpromised and underdelivered on AI healthcare. IEEE Spectr. 2019;4:24–31. doi: 10.1109/MSPEC.2019.8678513. [DOI] [Google Scholar]
  • 13.Gøtzsche PC. Mammography screening: truth, lies, and controversy. Lancet. 2012;380(9838):218. doi: 10.1016/s0140-6736(12)61216-1. [DOI] [PubMed] [Google Scholar]
  • 14.Tabayoyong W, Abouassaly R. Prostate cancer screening and the associated controversy. Surg Clin N Am. 2015;95(5):1023–39. doi: 10.1016/j.suc.2015.05.001. [DOI] [PubMed] [Google Scholar]
  • 15.Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22. doi: 10.1016/j.jclinepi.2019.02.004. [DOI] [PubMed] [Google Scholar]

Articles from European Journal of Human Genetics are provided here courtesy of Nature Publishing Group

RESOURCES