Abstract
Background
Analyzing fundus images with deep learning techniques is promising for screening systematic diseases. However, the quality of the rapidly increasing number of studies was variable and lacked systematic evaluation.
Objective
To systematically review all the articles that aimed to predict systemic parameters and conditions using fundus image and deep learning, assessing their performance, and providing suggestions that would enable translation into clinical practice.
Methods
Two major electronic databases (MEDLINE and EMBASE) were searched until August 22, 2023, with keywords ‘deep learning’ and ‘fundus’. Studies using deep learning and fundus images to predict systematic parameters were included, and assessed in four aspects: study characteristics, transparent reporting, risk of bias, and clinical availability. Transparent reporting was assessed by the TRIPOD statement, while the risk of bias was assessed by PROBAST.
Results
4969 articles were identified through systematic research. Thirty-one articles were included in the review. A variety of vascular and non-vascular diseases can be predicted by fundus images, including diabetes and related diseases (19%), sex (22%) and age (19%). Most of the studies focused on developed countries. The models’ reporting was insufficient in determining sample size and missing data treatment according to the TRIPOD. Full access to datasets and code was also under-reported. 1/31(3.2%) study was classified as having a low risk of bias overall, whereas 30/31(96.8%) were classified as having a high risk of bias according to the PROBAST. 5/31(16.1%) of studies used prospective external validation cohorts. Only two (6.4%) described the study’s calibration. The number of publications by year increased significantly from 2018 to 2023. However, only two models (6.5%) were applied to the device, and no model has been applied in clinical.
Conclusion
Deep learning fundus images have shown great potential in predicting systematic conditions in clinical situations. Further work needs to be done to improve the methodology and clinical application.
Subject terms: Eye manifestations, Prognosis
Abstract
背景: 深度学习技术分析眼底图像有助于筛查全身性疾病。然而, 数量迅速增加的研究, 其质量存在差异, 并且缺乏系统性评估。目的: 对所有旨在使用眼底图像和深度学习预测系统参数和条件的文献进行系统回顾, 评估其性能, 并提供促进临床实践转化的建议。方法: 截至2023年8月22日, 使用关键词“深度学习”和“眼底”检索了两个主要的电子数据库 (MEDLINE和EMBASE) 。包括使用深度学习和眼底图像预测系统参数的研究, 并从四个方面进行评估: 研究特征、透明报告、偏倚风险和临床可用性。透明报告由TRIPOD声明评估, 而偏倚风险由PROBAST评估。结果: 系统性搜索共4969篇文章, 其中31篇被纳入综述。眼底图像可预测各种血管性和非血管性疾病, 包括糖尿病及相关疾病 (19%) 、性别 (22%) 和年龄 (19%) 。大多数研究集中在发达国家。根据TRIPOD, 在确定样本大小和缺失数据处理方面的报告不足。对于数据库和代码完整访问的报道也存在不足。根据PROBAST, 1/31 (3.2%) 被分类为总体偏倚风险低, 而30/31 (96.8%) 被分类为总体偏倚风险高。而5/31 (16.1%) 研究使用了前瞻性外部验证队列。只有两项 (6.4%) 描述了研究的校准情况。从2018年到2023年, 每年的发表文献数量显著增加。然而, 只有两个模型 (6.5%) 应用于设备, 没有模型应用于临床。结论: 深度学习眼底图像在预测临床情况方面展现出巨大潜力。需要进一步努力改进方法学和临床应用。
Introduction
There is a strong association between ocular vasculature and systemic vasculitides [1], and the retina is also a part of the central nervous system. The network of blood vessels and nerves in the fundus is the only part of the human body where the microcirculation and nerve terminals can be directly observed [2]. Recently, there is increasing evidence showing that many common diseases and even systematic parameters (such as age, sex, smoking, etc.) could be predicted by characteristic changes in the fundus of the eye [3, 4]. Fundus images are now recognized as a non-invasive and potential way for early diagnosis and prediction, and could be used for general health-care [5, 6]. However, as the most prediction was based on ophthalmologists’ judgment, doctors’ subjectivity and experience, the results were, therefore, susceptible to some degree of subjectivity [7, 8]. Only a few diseases with defined characteristics and easy gradation standard reached prediction consensus, which severely limits the broader generalization of the fundus images beyond eye diseases.
With the advent of mathematical algorithms and the availability of big datasets, deep learning-based algorithm showed significant advances in analyzing fundus images [9, 10]. By integrating big data and self-supervised training, deep learning algorithm could identify and analyze fundus biomedical features below the threshold of human observers [11]. In medical situations, artificial intelligence exhibited promising applications in many fields, such as radiology [12, 13], dermatology [14, 15], pathology [16, 17], and ophthalmology [18]. It could be used to help clinical decision-making and could even exhibit remarkable ability beyond human analysis.
More deep learning algorithms have been developed in recent years, with a relatively broad range of applications, reporting, methodology, and models’ clinical availability, and may lead to poor clinical translation [10]. To be implemented into clinical practice, the deep learning algorithm needs to be considered as low risk of bias, which requires all the information to be transparently reported. It also needs to be an available and applicable model, instead of an idealized mathematical algorithm. Therefore, using the best open standards to evaluate these models and promote clinical translation is urgently required.
The present study aims to systematically review all the studies predicting systemic parameters and diseases using fundus images and deep learning. We assessed each study’s characteristics, adherence to standards both in methodological and reporting, and models’ clinical availability. We critically appraised the models’ methodology and tried to provide possible solutions that would enable the translation of deep learning into clinical practice.
Methods
The review protocol was registered in the online PROSPERO database with the registration number (CRD42022344932) before search execution. Our systematic review was created by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, with a checklist displayed in Supplementary File 1.
Study identification and inclusion criteria
Two major electronic databases (MEDLINE(OVID) and Embase) were searched from database creation until August 22, 2023, with no additional filters used. Two researchers (LTL and RHZ) screened and cross-checked all the literature.
We included studies where the aim was to use fundus images for predicting systematic diseases. We applied no restrictions because of the study design, data source, or patient-related health outcome types. We selected publications for review if they satisfied several inclusion criteria: English language, patient-related outcome, adult human participants, assessed a deep learning algorithm in analyzing fundus photos, and at least regarded systematic diseases as outcomes. Deep learning algorithm was defined as computational models composed of multiple processing layers to learn representations of data with various levels of abstraction. Eligible publications needed to describe the development or validation of at least one model using a deep learning method that aimed for individualized prediction or diagnosis.
Exclusion criteria included: 1) commentaries, letters to the editor, editorials, and meeting abstracts; 2) Publications that aimed to use machine learning only to enhance the reading of images or signals or to extract vascular parameters were excluded, for our study aimed to alleviate data inefficiency by deriving information directly from fundus photos rather than assessing the predictive ability of specific caliber parameters; 3) Additional articles were retrieved by manually scrutinizing the reference lists of relevant publications. Figure 1 shows the PRISMA flow chart of study records. The inclusion and exclusion criteria of PICO (Patient, Intervention, Comparison, Outcome) are detailed in Supplementary File 2.
Fig. 1. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow chart of study records.
n number; AI artificial intelligence.
Study selection and data extraction
After the removal of clearly irrelevant records, Two researchers (YTL and RHZ) independently screened abstracts for potentially eligible studies. Full text reports were then assessed for eligibility, with disagreements resolved by consensus. Two researchers extracted data from study reports independently and in duplicate for each eligible analysis, with disagreements resolved by a third reviewer (WBW). Data were systematically extracted from each study into a predesigned spreadsheet and analyzed post hoc using pivot tables. We extracted each article in four aspects.
Study characteristics
Study characteristics were evaluated in both technical and dataset aspects. Technical aspects assessed the algorithm, including the network architectures, model performance metrics, and visualization techniques. Dataset aspects included disease information, dataset characteristics, external validation, photographs’ mydriatic, granularity, pre-processing, and gradability.
Reporting transparency
The reporting transparency aspect aims to evaluate the transparency of reporting. Based on the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) statement [19, 20], published on BMJ in 2015, we used a modified TRIPOD list of 29 total points(see Supplementary File 3), which were the same as the items used in a methodology review [21]. This review focused on the application of imaging in artificial intelligence, which was compliant with the aim of our article. In this review, the authors had a detailed discussion about the changes and inclusion of TRIPOD items in Supplementary File 3. We examined each item, and all the authors approved it as the final version (Supplementary File 5). This aspect aimed to assess whether studies broadly conformed to reporting recommendations included in TRIPOD and not the detailed granularity required for a full assessment of adherence. The overall group adherence was calculated as a percentage, and adherence groups were created based upon reporting of 0–33.3%, 33.4–66.6%, or 66.7–100%. This classification method has applications in other quality assessment articles [22].
Risk of bias
The risk of bias aspect aims to assess the risk of bias in each included article. Based on PROBAST (prediction model risk of bias assessment tool) [23, 24], we used a modified version that recorded items except for applicability and predictor variables, the same as the items used in a methodology review published previously [21]. Items were discussed by the co-authors and approved as the final version. PROBAST contains 20 signaling questions from four domains (participants, predictors, outcomes, and analysis) to allow assessment of the risk of bias in predictive modeling studies. We did not assess applicability (because we discussed the applicability in TRIPOD items and models’ clinical availability aspect) or predictor variables (these are less relevant in deep learning studies on fundus imaging; see Supplementary File 3). Studies whose code or data were not publicly available were e-mailed for access to assess the generalizability and availability of data and code.
Models’ clinical availability
The models’ clinical availability aspect aims to evaluate the maturity of the models to assess the clinical feasibility. We used the same evaluation methods in an acknowledged article [22]. This review focused on breast cancer prognostics and designed a series of items to assess the models’ maturity. After screening, we considered this vision applicable to our article. The review of evaluating the methodology of artificial intelligence for mechanical ventilation, published in 2022 [25], also used the same evaluation items. We assessed the levels of each model’s availability that can be used clinically. Studies were categorized into the following groups: (i) ‘mathematics into algorithm’ (authors propose new algorithmic construct with some feasibility testing, with no prospective validation), (ii) ‘algorithm into model’ (authors develop a new model based on retrospective data with some prospective feasibility testing, without comparing with any clinical standard), (iii) ‘model into device’ (validation of a previously developed model, ideally against an existing gold standard, without any clinical application), and (iv) ‘device into practice’ (prospective clinical evaluation of deployed devices).
Results
Study selection
We identified 4969 articles through systemic research. After removing the duplicates, 4365 articles were screened for abstracts. After exclusions, 31 articles were included in our review (Fig. 1).
Study characteristics
Diabetes and related diseases (19%), sex (22%) and age (19%) were of the highest research interest. 12 studies (39%) predicted more than one kind of disease. The three countries with the highest number of published studies were China (26%), USA (16%), and Singapore (16%), together representing more than half of the included studies.
In technical aspects, 25 studies (81%) reported using existing network architectures, while the other 6 studies (19%) developed customized network. 28 studies (90%) used AUC as model performance metrics, but only two article (6%) considered the model’s calibration. 20 studies (65%) provided heatmap or saliency to visualize the learning process, most of which got information from vessels and optic disc.
In dataset aspects, several studies were multi-center studies (52%), primarily with a retrospective design (71%); few studies were conducted for prospective external validation (16%). Only 11 studies (35%) have an independent validation dataset. More detailed information of each model is displayed in Supplementary File 4.
Reporting transparency
The models’ performance in TRIPOD is shown in Table 1. Regarding the study’s basic information, about half of the study reported information on participant’s characteristics (52%), study setting (68%), and eligibility criteria (61%). Regarding the study design, only 55% of the study explained how the sample size was drawn. 77% of the study did not report missing data. Only 77% of the studies reported internal validation, and even fewer conducted an external validation (42%). Regarding the model performance, most of the study (90%) reported AUC as a measurement, while few of the study (6%) reported calibration. Regarding the model availability and clinical application, limited freely available code and data were available only in 35% of published studies. We e-mailed 31 corresponding authors for access to code and data: 4 e-mails were automatically returned because of a non-existent account; 1 author was able to provide access to code and data on request; 5 author was not able to comply because of ethical or regulatory reasons; and 21 authors did not respond. Half of the study (55%) did not relate their results to actual clinical situations, mainly because the models were immature, or the models’ performance showed no improvement compared with traditional systematic factors.
Table 1.
Models’ performance in TRIPOD.
| Bottom tertile (0–33.3) | Middle tertile (33.4–66.6) | Upper tertile (66.7–100) |
|---|---|---|
| Calibration (16)[6%] | Data freely available (21)[35%] | Confidence interval (16)[68%] |
| Code freely available (21)[26%] | Protocol (21)[42%] | Study dates (4b)[71%] |
| External validation (10c)[42%] | Study setting-center (5a)[71%] | |
| Clinically relevant (20)[45%] | Defined outcome (6a)[74%] | |
| Missing data (13b)[45%] | Internal validation (10b)[77%] | |
| Participants characteristics (13b)[52%] | Objective (3)[84%] | |
| Sample size (8)[55%] | Participants flow (13a)[90%] | |
| Eligibility criteria (12)[58%] | Model performance AUC (10d)[90%] | |
| Funding (22)[94%] | ||
| Data source (4a)[97%] | ||
| Background (3a)[97%] |
The respective transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) reference is displayed in round brackets, and the percentage adherence in square brackets.
NA not applicable.
Risk of bias
Only 1 study (3%) was classified as low risk overall, whereas 30 (97%) were classified as having a high risk of bias (Fig. 2). 22% were considered at low risk of bias in the analysis section. The high risk of bias was attributable to inappropriate missing data handling and unclear inclusion and exclusion criteria. 4 articles (13%) reported using multiple imputations to impute missing data. 10 articles (32%) removed patients with missing data, and 17 (55%) did not report any missing data. 11 articles (35%) did not mention inclusion and exclusion criteria. 52% of studies considered the risk of model overfitting. The majority of studies did not mention calibration (94%) or complexity of data (93%).
Fig. 2. Risk of bias assessed using Prediction model Risk Of Bias Assessment Tool (PROBAST) reporting standards for all included studies.

Low indicates the percentage of studies that did not receive any high-risk rating for a particular category. High denotes those that achieved a high-risk rating in at least one criterion for a specific category.
Models’ clinical availability
The number of publications by year increased significantly from 2018 to 2023, as depicted in Fig. 3. The number of articles increased with higher availability degrees, especially in predicting diabetes and cardiovascular disease (Fig. 4). The number of high clinical availability models was still limited, as only two models were applied to the device, and no ‘device into practice’ technology was identified.
Fig. 3. Temporal distribution illustrating the number and models’ clinical availability of studies included in this review.

There has been a significant rise in the number of artificial intelligence algorithms used to develop models applied to treatment outcomes in the last five years. There has been no consistent shift towards device creation and subsequent evaluation in clinical practice.
Fig. 4. Distribution illustrating the number and clinical availability of studies in 7 different prediction outcomes.

Articles predicting diabetes and cardiovascular disease have more articles with higher availability degrees. No ‘device into practice’ technology was identified.
Discussion
Our review systematically assessed the methodology of recently published deep learning models using fundus images as predictors. We assessed each article in four aspects: general study characteristics, methods, adherence to reporting standards, risk of bias, and models’ clinical availability. The vital need and quickly growing body of deep learning models has created great expectations to apply models into clinical practice. However, potential adverse consequences might be possible for patients if the implementation has no rigorous evidence. Our findings suggested that more work can be done to improve the study design, reporting, transparency, and clinical availability.
Principal findings
Five key findings were established from our review. Firstly, we found that few studies used prospective datasets as validation. Considering the simulated environments that deep learning models used in most articles, the deep learning performance in clinical situations is difficult to evaluate. A higher C-index may not reflect clinical benefit, and only relying on this may cause unpredictable risks, which may be an important cause of the limited number of widely used models [25]. Whether the claimed “superior” performance models allow for a potential doctor’s assistance still needs further evaluation, especially in prospective, actual clinical situations.
Secondly, external validation cohorts needed to be more [26]. Suppose the algorithm is designed specifically to aid the performance of more junior clinicians or non-specialists in underdeveloped regions with limited medical care facilities, rather than experts in hospitals with good medical treatment conditions. In that case, the datasets should be more representative. The limited number of multicenter studies reduced confidence in the studies’ validity and applicability. Individual cities of individual countries attribute a disproportionately large number of studies. Uneven geographical distribution limits the generalizability to diverse ethics, populations, and healthcare systems [27]. More models should be validated through multicenter, prospective, and external validation in the future. Considering that deep learning has grown rapidly in the last seven years, and prospective studies take at least one to two years to conduct [28], it is reasonable to hypothesize that many high-quality trials will be present in large quantities over the next decade.
Thirdly, most studies examine model performance only via the C-index. However, C-index alone provides no information on model calibration. Calibration is one of the most critical performance measurements to estimate the risk of a future event correctly [29, 30], which was rarely reported. A model may have near-perfect discrimination, but if it is poorly calibrated, it would still not be reliable from a clinical standpoint [25]. The sparse use of calibration may partly be explained by their recent introduction, and suboptimal reporting diminishes the literature’s clinical utility.
Fourthly, the limited availability of datasets and code made it difficult to assess the reproducibility of deep learning research. Reproducible research has become a pressing issue recently, and it is crucial to encourage more data and code sharing. Some authors may be concerned about intellectual property when sharing the code or datasets. However, after the reasonable application for code or data has been examined and approved, the authors should ensure that algorithms are non-proprietary and available for scrutiny.
Fifthly, authors often produce overly optimistic estimations of their model performance despite most studies having limitations in design, reporting, transparency, and risk of bias. Overpromising results lead to studies being misinterpreted easily by the media and the public. Describing any shortage in the studies and reporting the evidence’s quality in the abstract and conclusion can help prevent tendentious reporting. Better design and more transparent reporting should eventually facilitate the innovation, validation, and translation process. The situation also exists when authors are aware of no methodology shortage in their articles. It would be challenging to recognize models’ deficiencies without any targeted guidelines published. Methodology guidelines for deep learning models were needed urgently.
Study limitations
Our findings must be considered with several limitations. Firstly, guidelines for evaluating the reporting and methodology (TRIPOD and PROBAST) were designed for prediction models instead of AI algorithms. After the experts’ discussion, only AI-associated items were evaluated and displayed, which may result in underestimating the bias. Secondly, the PROBAST contains several items of subjective judgment, and borderline results may impact overall interpretation. Our use of a clinically available framework was based on expert opinion [22]. They had been applied to other health domains [21, 22, 25] and had approved potential value for methodology and transparent reporting assessment. Thirdly, the evaluation of PROBAST and TRIPOD may be subjective. All the co-authors have been trained in the knowledge and application of the methodology evaluation to minimize the different standards between reviewers. Before the extraction began, all the co-authors extracted five articles that met the inclusion criteria. Items with different opinions were discussed and proofread to ensure the same extracting standards for all the articles. Lastly, we did no quantitative analysis. For the wide range of the evaluated outcomes and diseases, the number of articles in each disease category is fewer, which can be chased up when more deep-learning models are released.
Conclusion
In conclusion, our review analyzed 31 articles that predicted systematic information in fundus images using deep learning. Current weaknesses should be addressed as they could limit the clinical impact. Rigorous adherence to standards and clinically available design is essential to model clinical application and repeatability, which will facilitate the translation of algorithmic data science into feasible tools and improved patient care.
Summary
What was known before:
Evidence has shown that there is a strong association between ocular and systemic vasculature. Analyzing fundus images with deep learning techniques is promising for screening systematic diseases.
The quality of rapidly growing studies was variable and lacked systematic evaluation, which severely limits the models’ generalization.
What this study adds:
Our work focused on articles that used systematic parameters and diseases as outcomes, instead of any single system disease. This makes our research a potential primary screening suggestion.
It is the first review that performed PROBAST, TRIPOD, and maturity assessing into fundus images deep learning models.
We assessed the quality and applicability of the models, evaluated their performance, and provided proposed solutions that would enable the translation of deep learning into clinical practice.
Supplementary information
Author contributions
YTL and DL conceived the study and designed the study protocol. YTL, RHZ and WBW executed the search and extracted data. YTL performed the initial analysis of data, with all authors contributing to interpretation of data. All authors contributed to critical revision of the manuscript for important intellectual content and approved the final version. WBW is the study guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.
Funding
This work was supported by National Natural Science Foundation of China (82220108017, 82141128), the Capital Health Research and Development of Special (2020–1–2052), and the Science & Technology Project of Beijing Municipal Science & Technology Commission (Z201100005520045, Z181100001818003).
Data availability
Raw data are available on request from the corresponding author.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Yitong Li, Ruiheng Zhang, Li Dong.
Supplementary information
The online version contains supplementary material available at 10.1038/s41433-023-02914-0.
References
- 1.Liew G, Gopinath B, White AJ, Burlutsky G, Yin Wong T, Mitchell P. Retinal vasculature fractal and stroke mortality. Stroke. 2021;52:1276–82. doi: 10.1161/STROKEAHA.120.031886. [DOI] [PubMed] [Google Scholar]
- 2.Patton N, Aslam TM, MacGillivray T, Deary IJ, Dhillon B, Eikelboom RH, et al. Retinal image analysis: concepts, applications and potential. Prog Retin Eye Res. 2006;25:99–127. doi: 10.1016/j.preteyeres.2005.07.001. [DOI] [PubMed] [Google Scholar]
- 3.Forster RB, Garcia ES, Sluiman AJ, Grecian SM, McLachlan S, MacGillivray TJ, et al. Retinal venular tortuosity and fractal dimension predict incident retinopathy in adults with type 2 diabetes: the Edinburgh Type 2 Diabetes Study. Diabetologia. 2021;64:1103–12. doi: 10.1007/s00125-021-05388-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Wong TY, Knudtson MD, Klein R, Klein BE, Meuer SM, Hubbard LD. Computer-assisted measurement of retinal vessel diameters in the Beaver Dam Eye Study: methodology, correlation between eyes, and effect of refractive errors. Ophthalmology. 2004;111:1183–90. doi: 10.1016/j.ophtha.2003.09.039. [DOI] [PubMed] [Google Scholar]
- 5.Thom S, Stettler C, Stanton A, Witt N, Tapp R, Chaturvedi N, et al. Differential effects of antihypertensive treatment on the retinal microcirculation: an anglo-scandinavian cardiac outcomes trial substudy. Hypertension. 2009;54:405–8. doi: 10.1161/HYPERTENSIONAHA.109.133819. [DOI] [PubMed] [Google Scholar]
- 6.Czakó C, Kovács T, Ungvari Z, Csiszar A, Yabluchanskiy A, Conley S, et al. Retinal biomarkers for Alzheimer’s disease and vascular cognitive impairment and dementia (VCID): implication for early diagnosis and prognosis. Geroscience. 2020;42:1499–525. doi: 10.1007/s11357-020-00252-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Gamble L, Mash AJ, Burdan T, Ruiz RS, Spivey BE. Ophthalmology (eye physician and surgeon) manpower studies for the United States. Part IV: Ophthalmology manpower distribution 1983. Ophthalmology. 1983;90:47a–64a. doi: 10.1016/S0161-6420(83)80032-3. [DOI] [PubMed] [Google Scholar]
- 8.Yuan M, Chen W, Wang T, Song Y, Zhu Y, Chen C, et al. Exploring the growth patterns of medical demand for eye care: a longitudinal hospital-level study over 10 years in China. Ann Transl Med. 2020;8:1374. doi: 10.21037/atm-20-2939. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Celi LA, Mark RG, Stone DJ, Montgomery RA. “Big data” in the intensive care unit. Closing data loop. Am J Respir Crit Care Med. 2013;187:1157–60. doi: 10.1164/rccm.201212-2311ED. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Futoma J, Simons M, Panch T, Doshi-Velez F, Celi LA. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit Health. 2020;2:e489–e492. doi: 10.1016/S2589-7500(20)30186-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Mookiah MRK, Hogg S, MacGillivray TJ, Prathiba V, Pradeepa R, Mohan V, et al. A review of machine learning methods for retinal blood vessel segmentation and artery/vein classification. Med Image Anal. 2021;68:101905. doi: 10.1016/j.media.2020.101905. [DOI] [PubMed] [Google Scholar]
- 12.van Leeuwen KG, Schalekamp S, Rutten M, van Ginneken B, de Rooij M. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur Radio. 2021;31:3797–804. doi: 10.1007/s00330-021-07892-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Auffermann WF, Gozansky EK, Tridandapani S. Artificial intelligence in cardiothoracic radiology. AJR Am J Roentgenol 2019;212:997–1001. [DOI] [PubMed]
- 14.Jones OT, Matin RN, van der Schaar M, Prathivadi Bhayankaram K, Ranmuthu CKI, Islam MS, et al. Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: a systematic review. Lancet Digit Health. 2022;4:e466–e476. doi: 10.1016/S2589-7500(22)00023-1. [DOI] [PubMed] [Google Scholar]
- 15.Phillips M, Marsden H, Jaffe W, Matin RN, Wali GN, Greenhalgh J, et al. Assessment of accuracy of an artificial intelligence algorithm to detect melanoma in images of skin lesions. JAMA Netw Open. 2019;2:e1913436. doi: 10.1001/jamanetworkopen.2019.13436. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Nabi J. Artificial intelligence can augment global pathology initiatives. Lancet. 2018;392:2351–2. doi: 10.1016/S0140-6736(18)32209-8. [DOI] [PubMed] [Google Scholar]
- 17.Bera K, Schalper KA, Rimm DL, Velcheti V, Madabhushi A. Artificial intelligence in digital pathology - new tools for diagnosis and precision oncology. Nat Rev Clin Oncol. 2019;16:703–15. doi: 10.1038/s41571-019-0252-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ting DSJ, Foo VH, Yang LWY, Sia JT, Ang M, Lin H, et al. Artificial intelligence for anterior segment diseases: Emerging applications in ophthalmology. Br J Ophthalmol. 2021;105:158–68. doi: 10.1136/bjophthalmol-2019-315651. [DOI] [PubMed] [Google Scholar]
- 19.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. doi: 10.1136/bmj.g7594. [DOI] [PubMed] [Google Scholar]
- 20.Moons KG, Altman DG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162:W1–73. doi: 10.7326/M14-0698. [DOI] [PubMed] [Google Scholar]
- 21.Nagendran M, Chen Y, Lovejoy CA, Gordon AC, Komorowski M, Harvey H, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. doi: 10.1136/bmj.m689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Corti C, Cobanaj M, Marian F, Dee EC, Lloyd MR, Marcu S, et al. Artificial intelligence for prediction of treatment outcomes in breast cancer: systematic review of design, reporting standards, and bias. Cancer Treat Rev. 2022;108:102410. doi: 10.1016/j.ctrv.2022.102410. [DOI] [PubMed] [Google Scholar]
- 23.Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med. 2019;170:51–58. doi: 10.7326/M18-1376. [DOI] [PubMed] [Google Scholar]
- 24.Moons KGM, Wolff RF, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess risk of bias and applicability of prediction model studies: explanation and elaboration. Ann Intern Med. 2019;170:W1–w33. doi: 10.7326/M18-1377. [DOI] [PubMed] [Google Scholar]
- 25.Gallifant J, Zhang J, Del Pilar Arias Lopez M, Zhu T, Camporota L, Celi LA, et al. Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias. Br J Anaesth. 2022;128:343–51. doi: 10.1016/j.bja.2021.09.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lee AY, Yanagihara RT, Lee CS, Blazes M, Jung HC, Chee YE, et al. Multicenter, head-to-head, real-world validation study of seven automated artificial intelligence diabetic retinopathy screening systems. Diabetes Care. 2021;44:1168–75. doi: 10.2337/dc20-1877. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kaushal A, Altman R, Langlotz C. Geographic distribution of US cohorts used to train deep learning algorithms. JAMA. 2020;324:1212–3. doi: 10.1001/jama.2020.12067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117. doi: 10.1016/j.neunet.2014.09.003. [DOI] [PubMed] [Google Scholar]
- 29.Blaha MJ. The critical importance of risk score calibration: time for transformative approach to risk score validation? J Am Coll Cardiol. 2016;67:2131–4. doi: 10.1016/j.jacc.2016.03.479. [DOI] [PubMed] [Google Scholar]
- 30.Laukkanen JA, Kunutsor SK. Is ‘re-calibration’ of standard cardiovascular disease (CVD) risk algorithms the panacea to improved CVD risk prediction and prevention? Eur Heart J. 2019;40:632–4. doi: 10.1093/eurheartj/ehy726. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Raw data are available on request from the corresponding author.

