Predicting systemic diseases in fundus images: systematic review of setting, reporting, bias, and models’ clinical availability in deep learning studies

Yitong Li; Ruiheng Zhang; Li Dong; Xuhan Shi; Wenda Zhou; Haotian Wu; Heyan Li; Chuyao Yu; Wenbin Wei

doi:10.1038/s41433-023-02914-0

. 2024 Jan 18;38(7):1246–1251. doi: 10.1038/s41433-023-02914-0

Show available content in

Predicting systemic diseases in fundus images: systematic review of setting, reporting, bias, and models’ clinical availability in deep learning studies

Yitong Li ^1,^#, Ruiheng Zhang ^1,^#, Li Dong ^1,^#, Xuhan Shi ¹, Wenda Zhou ¹, Haotian Wu ¹, Heyan Li ¹, Chuyao Yu ¹, Wenbin Wei ^1,^✉

PMCID: PMC11076532 PMID: 38238576

Abstract

Background

Analyzing fundus images with deep learning techniques is promising for screening systematic diseases. However, the quality of the rapidly increasing number of studies was variable and lacked systematic evaluation.

Objective

To systematically review all the articles that aimed to predict systemic parameters and conditions using fundus image and deep learning, assessing their performance, and providing suggestions that would enable translation into clinical practice.

Methods

Two major electronic databases (MEDLINE and EMBASE) were searched until August 22, 2023, with keywords ‘deep learning’ and ‘fundus’. Studies using deep learning and fundus images to predict systematic parameters were included, and assessed in four aspects: study characteristics, transparent reporting, risk of bias, and clinical availability. Transparent reporting was assessed by the TRIPOD statement, while the risk of bias was assessed by PROBAST.

Results

4969 articles were identified through systematic research. Thirty-one articles were included in the review. A variety of vascular and non-vascular diseases can be predicted by fundus images, including diabetes and related diseases (19%), sex (22%) and age (19%). Most of the studies focused on developed countries. The models’ reporting was insufficient in determining sample size and missing data treatment according to the TRIPOD. Full access to datasets and code was also under-reported. 1/31(3.2%) study was classified as having a low risk of bias overall, whereas 30/31(96.8%) were classified as having a high risk of bias according to the PROBAST. 5/31(16.1%) of studies used prospective external validation cohorts. Only two (6.4%) described the study’s calibration. The number of publications by year increased significantly from 2018 to 2023. However, only two models (6.5%) were applied to the device, and no model has been applied in clinical.

Conclusion

Deep learning fundus images have shown great potential in predicting systematic conditions in clinical situations. Further work needs to be done to improve the methodology and clinical application.

Subject terms: Eye manifestations, Prognosis

Introduction

There is a strong association between ocular vasculature and systemic vasculitides [1], and the retina is also a part of the central nervous system. The network of blood vessels and nerves in the fundus is the only part of the human body where the microcirculation and nerve terminals can be directly observed [2]. Recently, there is increasing evidence showing that many common diseases and even systematic parameters (such as age, sex, smoking, etc.) could be predicted by characteristic changes in the fundus of the eye [3, 4]. Fundus images are now recognized as a non-invasive and potential way for early diagnosis and prediction, and could be used for general health-care [5, 6]. However, as the most prediction was based on ophthalmologists’ judgment, doctors’ subjectivity and experience, the results were, therefore, susceptible to some degree of subjectivity [7, 8]. Only a few diseases with defined characteristics and easy gradation standard reached prediction consensus, which severely limits the broader generalization of the fundus images beyond eye diseases.

With the advent of mathematical algorithms and the availability of big datasets, deep learning-based algorithm showed significant advances in analyzing fundus images [9, 10]. By integrating big data and self-supervised training, deep learning algorithm could identify and analyze fundus biomedical features below the threshold of human observers [11]. In medical situations, artificial intelligence exhibited promising applications in many fields, such as radiology [12, 13], dermatology [14, 15], pathology [16, 17], and ophthalmology [18]. It could be used to help clinical decision-making and could even exhibit remarkable ability beyond human analysis.

More deep learning algorithms have been developed in recent years, with a relatively broad range of applications, reporting, methodology, and models’ clinical availability, and may lead to poor clinical translation [10]. To be implemented into clinical practice, the deep learning algorithm needs to be considered as low risk of bias, which requires all the information to be transparently reported. It also needs to be an available and applicable model, instead of an idealized mathematical algorithm. Therefore, using the best open standards to evaluate these models and promote clinical translation is urgently required.

The present study aims to systematically review all the studies predicting systemic parameters and diseases using fundus images and deep learning. We assessed each study’s characteristics, adherence to standards both in methodological and reporting, and models’ clinical availability. We critically appraised the models’ methodology and tried to provide possible solutions that would enable the translation of deep learning into clinical practice.

Methods

The review protocol was registered in the online PROSPERO database with the registration number (CRD42022344932) before search execution. Our systematic review was created by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, with a checklist displayed in Supplementary File 1.

Study identification and inclusion criteria

Two major electronic databases (MEDLINE(OVID) and Embase) were searched from database creation until August 22, 2023, with no additional filters used. Two researchers (LTL and RHZ) screened and cross-checked all the literature.

We included studies where the aim was to use fundus images for predicting systematic diseases. We applied no restrictions because of the study design, data source, or patient-related health outcome types. We selected publications for review if they satisfied several inclusion criteria: English language, patient-related outcome, adult human participants, assessed a deep learning algorithm in analyzing fundus photos, and at least regarded systematic diseases as outcomes. Deep learning algorithm was defined as computational models composed of multiple processing layers to learn representations of data with various levels of abstraction. Eligible publications needed to describe the development or validation of at least one model using a deep learning method that aimed for individualized prediction or diagnosis.

Exclusion criteria included: 1) commentaries, letters to the editor, editorials, and meeting abstracts; 2) Publications that aimed to use machine learning only to enhance the reading of images or signals or to extract vascular parameters were excluded, for our study aimed to alleviate data inefficiency by deriving information directly from fundus photos rather than assessing the predictive ability of specific caliber parameters; 3) Additional articles were retrieved by manually scrutinizing the reference lists of relevant publications. Figure 1 shows the PRISMA flow chart of study records. The inclusion and exclusion criteria of PICO (Patient, Intervention, Comparison, Outcome) are detailed in Supplementary File 2.

Fig. 1 — n number; AI artificial intelligence.

Study selection and data extraction

After the removal of clearly irrelevant records, Two researchers (YTL and RHZ) independently screened abstracts for potentially eligible studies. Full text reports were then assessed for eligibility, with disagreements resolved by consensus. Two researchers extracted data from study reports independently and in duplicate for each eligible analysis, with disagreements resolved by a third reviewer (WBW). Data were systematically extracted from each study into a predesigned spreadsheet and analyzed post hoc using pivot tables. We extracted each article in four aspects.

Study characteristics

Study characteristics were evaluated in both technical and dataset aspects. Technical aspects assessed the algorithm, including the network architectures, model performance metrics, and visualization techniques. Dataset aspects included disease information, dataset characteristics, external validation, photographs’ mydriatic, granularity, pre-processing, and gradability.

Reporting transparency

The reporting transparency aspect aims to evaluate the transparency of reporting. Based on the TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) statement [19, 20], published on BMJ in 2015, we used a modified TRIPOD list of 29 total points(see Supplementary File 3), which were the same as the items used in a methodology review [21]. This review focused on the application of imaging in artificial intelligence, which was compliant with the aim of our article. In this review, the authors had a detailed discussion about the changes and inclusion of TRIPOD items in Supplementary File 3. We examined each item, and all the authors approved it as the final version (Supplementary File 5). This aspect aimed to assess whether studies broadly conformed to reporting recommendations included in TRIPOD and not the detailed granularity required for a full assessment of adherence. The overall group adherence was calculated as a percentage, and adherence groups were created based upon reporting of 0–33.3%, 33.4–66.6%, or 66.7–100%. This classification method has applications in other quality assessment articles [22].

Risk of bias

The risk of bias aspect aims to assess the risk of bias in each included article. Based on PROBAST (prediction model risk of bias assessment tool) [23, 24], we used a modified version that recorded items except for applicability and predictor variables, the same as the items used in a methodology review published previously [21]. Items were discussed by the co-authors and approved as the final version. PROBAST contains 20 signaling questions from four domains (participants, predictors, outcomes, and analysis) to allow assessment of the risk of bias in predictive modeling studies. We did not assess applicability (because we discussed the applicability in TRIPOD items and models’ clinical availability aspect) or predictor variables (these are less relevant in deep learning studies on fundus imaging; see Supplementary File 3). Studies whose code or data were not publicly available were e-mailed for access to assess the generalizability and availability of data and code.

Models’ clinical availability

The models’ clinical availability aspect aims to evaluate the maturity of the models to assess the clinical feasibility. We used the same evaluation methods in an acknowledged article [22]. This review focused on breast cancer prognostics and designed a series of items to assess the models’ maturity. After screening, we considered this vision applicable to our article. The review of evaluating the methodology of artificial intelligence for mechanical ventilation, published in 2022 [25], also used the same evaluation items. We assessed the levels of each model’s availability that can be used clinically. Studies were categorized into the following groups: (i) ‘mathematics into algorithm’ (authors propose new algorithmic construct with some feasibility testing, with no prospective validation), (ii) ‘algorithm into model’ (authors develop a new model based on retrospective data with some prospective feasibility testing, without comparing with any clinical standard), (iii) ‘model into device’ (validation of a previously developed model, ideally against an existing gold standard, without any clinical application), and (iv) ‘device into practice’ (prospective clinical evaluation of deployed devices).

Results

Study selection

We identified 4969 articles through systemic research. After removing the duplicates, 4365 articles were screened for abstracts. After exclusions, 31 articles were included in our review (Fig. 1).

Study characteristics

Diabetes and related diseases (19%), sex (22%) and age (19%) were of the highest research interest. 12 studies (39%) predicted more than one kind of disease. The three countries with the highest number of published studies were China (26%), USA (16%), and Singapore (16%), together representing more than half of the included studies.

In technical aspects, 25 studies (81%) reported using existing network architectures, while the other 6 studies (19%) developed customized network. 28 studies (90%) used AUC as model performance metrics, but only two article (6%) considered the model’s calibration. 20 studies (65%) provided heatmap or saliency to visualize the learning process, most of which got information from vessels and optic disc.

In dataset aspects, several studies were multi-center studies (52%), primarily with a retrospective design (71%); few studies were conducted for prospective external validation (16%). Only 11 studies (35%) have an independent validation dataset. More detailed information of each model is displayed in Supplementary File 4.

Reporting transparency

The models’ performance in TRIPOD is shown in Table 1. Regarding the study’s basic information, about half of the study reported information on participant’s characteristics (52%), study setting (68%), and eligibility criteria (61%). Regarding the study design, only 55% of the study explained how the sample size was drawn. 77% of the study did not report missing data. Only 77% of the studies reported internal validation, and even fewer conducted an external validation (42%). Regarding the model performance, most of the study (90%) reported AUC as a measurement, while few of the study (6%) reported calibration. Regarding the model availability and clinical application, limited freely available code and data were available only in 35% of published studies. We e-mailed 31 corresponding authors for access to code and data: 4 e-mails were automatically returned because of a non-existent account; 1 author was able to provide access to code and data on request; 5 author was not able to comply because of ethical or regulatory reasons; and 21 authors did not respond. Half of the study (55%) did not relate their results to actual clinical situations, mainly because the models were immature, or the models’ performance showed no improvement compared with traditional systematic factors.

Table 1.

Models’ performance in TRIPOD.

Bottom tertile (0–33.3)	Middle tertile (33.4–66.6)	Upper tertile (66.7–100)
Calibration (16)[6%]	Data freely available (21)[35%]	Confidence interval (16)[68%]
Code freely available (21)[26%]	Protocol (21)[42%]	Study dates (4b)[71%]
	External validation (10c)[42%]	Study setting-center (5a)[71%]
	Clinically relevant (20)[45%]	Defined outcome (6a)[74%]
	Missing data (13b)[45%]	Internal validation (10b)[77%]
	Participants characteristics (13b)[52%]	Objective (3)[84%]
	Sample size (8)[55%]	Participants flow (13a)[90%]
	Eligibility criteria (12)[58%]	Model performance AUC (10d)[90%]
		Funding (22)[94%]
		Data source (4a)[97%]
		Background (3a)[97%]

Open in a new tab

The respective transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) reference is displayed in round brackets, and the percentage adherence in square brackets.

NA not applicable.

Risk of bias

Only 1 study (3%) was classified as low risk overall, whereas 30 (97%) were classified as having a high risk of bias (Fig. 2). 22% were considered at low risk of bias in the analysis section. The high risk of bias was attributable to inappropriate missing data handling and unclear inclusion and exclusion criteria. 4 articles (13%) reported using multiple imputations to impute missing data. 10 articles (32%) removed patients with missing data, and 17 (55%) did not report any missing data. 11 articles (35%) did not mention inclusion and exclusion criteria. 52% of studies considered the risk of model overfitting. The majority of studies did not mention calibration (94%) or complexity of data (93%).

Models’ clinical availability

The number of publications by year increased significantly from 2018 to 2023, as depicted in Fig. 3. The number of articles increased with higher availability degrees, especially in predicting diabetes and cardiovascular disease (Fig. 4). The number of high clinical availability models was still limited, as only two models were applied to the device, and no ‘device into practice’ technology was identified.

Fig. 4 — Articles predicting diabetes and cardiovascular disease have more articles with higher availability degrees. No ‘device into practice’ technology was identified.

Discussion

Our review systematically assessed the methodology of recently published deep learning models using fundus images as predictors. We assessed each article in four aspects: general study characteristics, methods, adherence to reporting standards, risk of bias, and models’ clinical availability. The vital need and quickly growing body of deep learning models has created great expectations to apply models into clinical practice. However, potential adverse consequences might be possible for patients if the implementation has no rigorous evidence. Our findings suggested that more work can be done to improve the study design, reporting, transparency, and clinical availability.

Principal findings

Five key findings were established from our review. Firstly, we found that few studies used prospective datasets as validation. Considering the simulated environments that deep learning models used in most articles, the deep learning performance in clinical situations is difficult to evaluate. A higher C-index may not reflect clinical benefit, and only relying on this may cause unpredictable risks, which may be an important cause of the limited number of widely used models [25]. Whether the claimed “superior” performance models allow for a potential doctor’s assistance still needs further evaluation, especially in prospective, actual clinical situations.

Secondly, external validation cohorts needed to be more [26]. Suppose the algorithm is designed specifically to aid the performance of more junior clinicians or non-specialists in underdeveloped regions with limited medical care facilities, rather than experts in hospitals with good medical treatment conditions. In that case, the datasets should be more representative. The limited number of multicenter studies reduced confidence in the studies’ validity and applicability. Individual cities of individual countries attribute a disproportionately large number of studies. Uneven geographical distribution limits the generalizability to diverse ethics, populations, and healthcare systems [27]. More models should be validated through multicenter, prospective, and external validation in the future. Considering that deep learning has grown rapidly in the last seven years, and prospective studies take at least one to two years to conduct [28], it is reasonable to hypothesize that many high-quality trials will be present in large quantities over the next decade.

Thirdly, most studies examine model performance only via the C-index. However, C-index alone provides no information on model calibration. Calibration is one of the most critical performance measurements to estimate the risk of a future event correctly [29, 30], which was rarely reported. A model may have near-perfect discrimination, but if it is poorly calibrated, it would still not be reliable from a clinical standpoint [25]. The sparse use of calibration may partly be explained by their recent introduction, and suboptimal reporting diminishes the literature’s clinical utility.

Fourthly, the limited availability of datasets and code made it difficult to assess the reproducibility of deep learning research. Reproducible research has become a pressing issue recently, and it is crucial to encourage more data and code sharing. Some authors may be concerned about intellectual property when sharing the code or datasets. However, after the reasonable application for code or data has been examined and approved, the authors should ensure that algorithms are non-proprietary and available for scrutiny.

Fifthly, authors often produce overly optimistic estimations of their model performance despite most studies having limitations in design, reporting, transparency, and risk of bias. Overpromising results lead to studies being misinterpreted easily by the media and the public. Describing any shortage in the studies and reporting the evidence’s quality in the abstract and conclusion can help prevent tendentious reporting. Better design and more transparent reporting should eventually facilitate the innovation, validation, and translation process. The situation also exists when authors are aware of no methodology shortage in their articles. It would be challenging to recognize models’ deficiencies without any targeted guidelines published. Methodology guidelines for deep learning models were needed urgently.

Study limitations

Our findings must be considered with several limitations. Firstly, guidelines for evaluating the reporting and methodology (TRIPOD and PROBAST) were designed for prediction models instead of AI algorithms. After the experts’ discussion, only AI-associated items were evaluated and displayed, which may result in underestimating the bias. Secondly, the PROBAST contains several items of subjective judgment, and borderline results may impact overall interpretation. Our use of a clinically available framework was based on expert opinion [22]. They had been applied to other health domains [21, 22, 25] and had approved potential value for methodology and transparent reporting assessment. Thirdly, the evaluation of PROBAST and TRIPOD may be subjective. All the co-authors have been trained in the knowledge and application of the methodology evaluation to minimize the different standards between reviewers. Before the extraction began, all the co-authors extracted five articles that met the inclusion criteria. Items with different opinions were discussed and proofread to ensure the same extracting standards for all the articles. Lastly, we did no quantitative analysis. For the wide range of the evaluated outcomes and diseases, the number of articles in each disease category is fewer, which can be chased up when more deep-learning models are released.

Conclusion

In conclusion, our review analyzed 31 articles that predicted systematic information in fundus images using deep learning. Current weaknesses should be addressed as they could limit the clinical impact. Rigorous adherence to standards and clinically available design is essential to model clinical application and repeatability, which will facilitate the translation of algorithmic data science into feasible tools and improved patient care.

Summary

What was known before:

Evidence has shown that there is a strong association between ocular and systemic vasculature. Analyzing fundus images with deep learning techniques is promising for screening systematic diseases.
The quality of rapidly growing studies was variable and lacked systematic evaluation, which severely limits the models’ generalization.

What this study adds:

Our work focused on articles that used systematic parameters and diseases as outcomes, instead of any single system disease. This makes our research a potential primary screening suggestion.
It is the first review that performed PROBAST, TRIPOD, and maturity assessing into fundus images deep learning models.
We assessed the quality and applicability of the models, evaluated their performance, and provided proposed solutions that would enable the translation of deep learning into clinical practice.

Supplementary information

Supplementary Material^{(105.1KB, docx)}

Author contributions

YTL and DL conceived the study and designed the study protocol. YTL, RHZ and WBW executed the search and extracted data. YTL performed the initial analysis of data, with all authors contributing to interpretation of data. All authors contributed to critical revision of the manuscript for important intellectual content and approved the final version. WBW is the study guarantor. The corresponding author attests that all listed authors meet authorship criteria and that no others meeting the criteria have been omitted.

Funding

This work was supported by National Natural Science Foundation of China (82220108017, 82141128), the Capital Health Research and Development of Special (2020–1–2052), and the Science & Technology Project of Beijing Municipal Science & Technology Commission (Z201100005520045, Z181100001818003).

Data availability

Raw data are available on request from the corresponding author.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

These authors contributed equally: Yitong Li, Ruiheng Zhang, Li Dong.

Supplementary information

The online version contains supplementary material available at 10.1038/s41433-023-02914-0.

References

1.Liew G, Gopinath B, White AJ, Burlutsky G, Yin Wong T, Mitchell P. Retinal vasculature fractal and stroke mortality. Stroke. 2021;52:1276–82. doi: 10.1161/STROKEAHA.120.031886. [DOI] [PubMed] [Google Scholar]
2.Patton N, Aslam TM, MacGillivray T, Deary IJ, Dhillon B, Eikelboom RH, et al. Retinal image analysis: concepts, applications and potential. Prog Retin Eye Res. 2006;25:99–127. doi: 10.1016/j.preteyeres.2005.07.001. [DOI] [PubMed] [Google Scholar]
3.Forster RB, Garcia ES, Sluiman AJ, Grecian SM, McLachlan S, MacGillivray TJ, et al. Retinal venular tortuosity and fractal dimension predict incident retinopathy in adults with type 2 diabetes: the Edinburgh Type 2 Diabetes Study. Diabetologia. 2021;64:1103–12. doi: 10.1007/s00125-021-05388-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wong TY, Knudtson MD, Klein R, Klein BE, Meuer SM, Hubbard LD. Computer-assisted measurement of retinal vessel diameters in the Beaver Dam Eye Study: methodology, correlation between eyes, and effect of refractive errors. Ophthalmology. 2004;111:1183–90. doi: 10.1016/j.ophtha.2003.09.039. [DOI] [PubMed] [Google Scholar]
5.Thom S, Stettler C, Stanton A, Witt N, Tapp R, Chaturvedi N, et al. Differential effects of antihypertensive treatment on the retinal microcirculation: an anglo-scandinavian cardiac outcomes trial substudy. Hypertension. 2009;54:405–8. doi: 10.1161/HYPERTENSIONAHA.109.133819. [DOI] [PubMed] [Google Scholar]
6.Czakó C, Kovács T, Ungvari Z, Csiszar A, Yabluchanskiy A, Conley S, et al. Retinal biomarkers for Alzheimer’s disease and vascular cognitive impairment and dementia (VCID): implication for early diagnosis and prognosis. Geroscience. 2020;42:1499–525. doi: 10.1007/s11357-020-00252-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Gamble L, Mash AJ, Burdan T, Ruiz RS, Spivey BE. Ophthalmology (eye physician and surgeon) manpower studies for the United States. Part IV: Ophthalmology manpower distribution 1983. Ophthalmology. 1983;90:47a–64a. doi: 10.1016/S0161-6420(83)80032-3. [DOI] [PubMed] [Google Scholar]
8.Yuan M, Chen W, Wang T, Song Y, Zhu Y, Chen C, et al. Exploring the growth patterns of medical demand for eye care: a longitudinal hospital-level study over 10 years in China. Ann Transl Med. 2020;8:1374. doi: 10.21037/atm-20-2939. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Celi LA, Mark RG, Stone DJ, Montgomery RA. “Big data” in the intensive care unit. Closing data loop. Am J Respir Crit Care Med. 2013;187:1157–60. doi: 10.1164/rccm.201212-2311ED. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Futoma J, Simons M, Panch T, Doshi-Velez F, Celi LA. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit Health. 2020;2:e489–e492. doi: 10.1016/S2589-7500(20)30186-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Mookiah MRK, Hogg S, MacGillivray TJ, Prathiba V, Pradeepa R, Mohan V, et al. A review of machine learning methods for retinal blood vessel segmentation and artery/vein classification. Med Image Anal. 2021;68:101905. doi: 10.1016/j.media.2020.101905. [DOI] [PubMed] [Google Scholar]
12.van Leeuwen KG, Schalekamp S, Rutten M, van Ginneken B, de Rooij M. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur Radio. 2021;31:3797–804. doi: 10.1007/s00330-021-07892-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Auffermann WF, Gozansky EK, Tridandapani S. Artificial intelligence in cardiothoracic radiology. AJR Am J Roentgenol 2019;212:997–1001. [DOI] [PubMed]
14.Jones OT, Matin RN, van der Schaar M, Prathivadi Bhayankaram K, Ranmuthu CKI, Islam MS, et al. Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: a systematic review. Lancet Digit Health. 2022;4:e466–e476. doi: 10.1016/S2589-7500(22)00023-1. [DOI] [PubMed] [Google Scholar]
15.Phillips M, Marsden H, Jaffe W, Matin RN, Wali GN, Greenhalgh J, et al. Assessment of accuracy of an artificial intelligence algorithm to detect melanoma in images of skin lesions. JAMA Netw Open. 2019;2:e1913436. doi: 10.1001/jamanetworkopen.2019.13436. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Nabi J. Artificial intelligence can augment global pathology initiatives. Lancet. 2018;392:2351–2. doi: 10.1016/S0140-6736(18)32209-8. [DOI] [PubMed] [Google Scholar]
17.Bera K, Schalper KA, Rimm DL, Velcheti V, Madabhushi A. Artificial intelligence in digital pathology - new tools for diagnosis and precision oncology. Nat Rev Clin Oncol. 2019;16:703–15. doi: 10.1038/s41571-019-0252-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Ting DSJ, Foo VH, Yang LWY, Sia JT, Ang M, Lin H, et al. Artificial intelligence for anterior segment diseases: Emerging applications in ophthalmology. Br J Ophthalmol. 2021;105:158–68. doi: 10.1136/bjophthalmol-2019-315651. [DOI] [PubMed] [Google Scholar]
19.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. doi: 10.1136/bmj.g7594. [DOI] [PubMed] [Google Scholar]
20.Moons KG, Altman DG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162:W1–73. doi: 10.7326/M14-0698. [DOI] [PubMed] [Google Scholar]
21.Nagendran M, Chen Y, Lovejoy CA, Gordon AC, Komorowski M, Harvey H, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. doi: 10.1136/bmj.m689. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Corti C, Cobanaj M, Marian F, Dee EC, Lloyd MR, Marcu S, et al. Artificial intelligence for prediction of treatment outcomes in breast cancer: systematic review of design, reporting standards, and bias. Cancer Treat Rev. 2022;108:102410. doi: 10.1016/j.ctrv.2022.102410. [DOI] [PubMed] [Google Scholar]
23.Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med. 2019;170:51–58. doi: 10.7326/M18-1376. [DOI] [PubMed] [Google Scholar]
24.Moons KGM, Wolff RF, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess risk of bias and applicability of prediction model studies: explanation and elaboration. Ann Intern Med. 2019;170:W1–w33. doi: 10.7326/M18-1377. [DOI] [PubMed] [Google Scholar]
25.Gallifant J, Zhang J, Del Pilar Arias Lopez M, Zhu T, Camporota L, Celi LA, et al. Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias. Br J Anaesth. 2022;128:343–51. doi: 10.1016/j.bja.2021.09.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Lee AY, Yanagihara RT, Lee CS, Blazes M, Jung HC, Chee YE, et al. Multicenter, head-to-head, real-world validation study of seven automated artificial intelligence diabetic retinopathy screening systems. Diabetes Care. 2021;44:1168–75. doi: 10.2337/dc20-1877. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Kaushal A, Altman R, Langlotz C. Geographic distribution of US cohorts used to train deep learning algorithms. JAMA. 2020;324:1212–3. doi: 10.1001/jama.2020.12067. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117. doi: 10.1016/j.neunet.2014.09.003. [DOI] [PubMed] [Google Scholar]
29.Blaha MJ. The critical importance of risk score calibration: time for transformative approach to risk score validation? J Am Coll Cardiol. 2016;67:2131–4. doi: 10.1016/j.jacc.2016.03.479. [DOI] [PubMed] [Google Scholar]
30.Laukkanen JA, Kunutsor SK. Is ‘re-calibration’ of standard cardiovascular disease (CVD) risk algorithms the panacea to improved CVD risk prediction and prevention? Eur Heart J. 2019;40:632–4. doi: 10.1093/eurheartj/ehy726. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material^{(105.1KB, docx)}

Data Availability Statement

Raw data are available on request from the corresponding author.

[CR1] 1.Liew G, Gopinath B, White AJ, Burlutsky G, Yin Wong T, Mitchell P. Retinal vasculature fractal and stroke mortality. Stroke. 2021;52:1276–82. doi: 10.1161/STROKEAHA.120.031886. [DOI] [PubMed] [Google Scholar]

[CR2] 2.Patton N, Aslam TM, MacGillivray T, Deary IJ, Dhillon B, Eikelboom RH, et al. Retinal image analysis: concepts, applications and potential. Prog Retin Eye Res. 2006;25:99–127. doi: 10.1016/j.preteyeres.2005.07.001. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Forster RB, Garcia ES, Sluiman AJ, Grecian SM, McLachlan S, MacGillivray TJ, et al. Retinal venular tortuosity and fractal dimension predict incident retinopathy in adults with type 2 diabetes: the Edinburgh Type 2 Diabetes Study. Diabetologia. 2021;64:1103–12. doi: 10.1007/s00125-021-05388-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Wong TY, Knudtson MD, Klein R, Klein BE, Meuer SM, Hubbard LD. Computer-assisted measurement of retinal vessel diameters in the Beaver Dam Eye Study: methodology, correlation between eyes, and effect of refractive errors. Ophthalmology. 2004;111:1183–90. doi: 10.1016/j.ophtha.2003.09.039. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Thom S, Stettler C, Stanton A, Witt N, Tapp R, Chaturvedi N, et al. Differential effects of antihypertensive treatment on the retinal microcirculation: an anglo-scandinavian cardiac outcomes trial substudy. Hypertension. 2009;54:405–8. doi: 10.1161/HYPERTENSIONAHA.109.133819. [DOI] [PubMed] [Google Scholar]

[CR6] 6.Czakó C, Kovács T, Ungvari Z, Csiszar A, Yabluchanskiy A, Conley S, et al. Retinal biomarkers for Alzheimer’s disease and vascular cognitive impairment and dementia (VCID): implication for early diagnosis and prognosis. Geroscience. 2020;42:1499–525. doi: 10.1007/s11357-020-00252-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Gamble L, Mash AJ, Burdan T, Ruiz RS, Spivey BE. Ophthalmology (eye physician and surgeon) manpower studies for the United States. Part IV: Ophthalmology manpower distribution 1983. Ophthalmology. 1983;90:47a–64a. doi: 10.1016/S0161-6420(83)80032-3. [DOI] [PubMed] [Google Scholar]

[CR8] 8.Yuan M, Chen W, Wang T, Song Y, Zhu Y, Chen C, et al. Exploring the growth patterns of medical demand for eye care: a longitudinal hospital-level study over 10 years in China. Ann Transl Med. 2020;8:1374. doi: 10.21037/atm-20-2939. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Celi LA, Mark RG, Stone DJ, Montgomery RA. “Big data” in the intensive care unit. Closing data loop. Am J Respir Crit Care Med. 2013;187:1157–60. doi: 10.1164/rccm.201212-2311ED. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR10] 10.Futoma J, Simons M, Panch T, Doshi-Velez F, Celi LA. The myth of generalisability in clinical research and machine learning in health care. Lancet Digit Health. 2020;2:e489–e492. doi: 10.1016/S2589-7500(20)30186-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR11] 11.Mookiah MRK, Hogg S, MacGillivray TJ, Prathiba V, Pradeepa R, Mohan V, et al. A review of machine learning methods for retinal blood vessel segmentation and artery/vein classification. Med Image Anal. 2021;68:101905. doi: 10.1016/j.media.2020.101905. [DOI] [PubMed] [Google Scholar]

[CR12] 12.van Leeuwen KG, Schalekamp S, Rutten M, van Ginneken B, de Rooij M. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur Radio. 2021;31:3797–804. doi: 10.1007/s00330-021-07892-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Auffermann WF, Gozansky EK, Tridandapani S. Artificial intelligence in cardiothoracic radiology. AJR Am J Roentgenol 2019;212:997–1001. [DOI] [PubMed]

[CR14] 14.Jones OT, Matin RN, van der Schaar M, Prathivadi Bhayankaram K, Ranmuthu CKI, Islam MS, et al. Artificial intelligence and machine learning algorithms for early detection of skin cancer in community and primary care settings: a systematic review. Lancet Digit Health. 2022;4:e466–e476. doi: 10.1016/S2589-7500(22)00023-1. [DOI] [PubMed] [Google Scholar]

[CR15] 15.Phillips M, Marsden H, Jaffe W, Matin RN, Wali GN, Greenhalgh J, et al. Assessment of accuracy of an artificial intelligence algorithm to detect melanoma in images of skin lesions. JAMA Netw Open. 2019;2:e1913436. doi: 10.1001/jamanetworkopen.2019.13436. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR16] 16.Nabi J. Artificial intelligence can augment global pathology initiatives. Lancet. 2018;392:2351–2. doi: 10.1016/S0140-6736(18)32209-8. [DOI] [PubMed] [Google Scholar]

[CR17] 17.Bera K, Schalper KA, Rimm DL, Velcheti V, Madabhushi A. Artificial intelligence in digital pathology - new tools for diagnosis and precision oncology. Nat Rev Clin Oncol. 2019;16:703–15. doi: 10.1038/s41571-019-0252-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR18] 18.Ting DSJ, Foo VH, Yang LWY, Sia JT, Ang M, Lin H, et al. Artificial intelligence for anterior segment diseases: Emerging applications in ophthalmology. Br J Ophthalmol. 2021;105:158–68. doi: 10.1136/bjophthalmol-2019-315651. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594. doi: 10.1136/bmj.g7594. [DOI] [PubMed] [Google Scholar]

[CR20] 20.Moons KG, Altman DG, Reitsma JB, Ioannidis JP, Macaskill P, Steyerberg EW, et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med. 2015;162:W1–73. doi: 10.7326/M14-0698. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Nagendran M, Chen Y, Lovejoy CA, Gordon AC, Komorowski M, Harvey H, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. doi: 10.1136/bmj.m689. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Corti C, Cobanaj M, Marian F, Dee EC, Lloyd MR, Marcu S, et al. Artificial intelligence for prediction of treatment outcomes in breast cancer: systematic review of design, reporting standards, and bias. Cancer Treat Rev. 2022;108:102410. doi: 10.1016/j.ctrv.2022.102410. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Ann Intern Med. 2019;170:51–58. doi: 10.7326/M18-1376. [DOI] [PubMed] [Google Scholar]

[CR24] 24.Moons KGM, Wolff RF, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: a tool to assess risk of bias and applicability of prediction model studies: explanation and elaboration. Ann Intern Med. 2019;170:W1–w33. doi: 10.7326/M18-1377. [DOI] [PubMed] [Google Scholar]

[CR25] 25.Gallifant J, Zhang J, Del Pilar Arias Lopez M, Zhu T, Camporota L, Celi LA, et al. Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias. Br J Anaesth. 2022;128:343–51. doi: 10.1016/j.bja.2021.09.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR26] 26.Lee AY, Yanagihara RT, Lee CS, Blazes M, Jung HC, Chee YE, et al. Multicenter, head-to-head, real-world validation study of seven automated artificial intelligence diabetic retinopathy screening systems. Diabetes Care. 2021;44:1168–75. doi: 10.2337/dc20-1877. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Kaushal A, Altman R, Langlotz C. Geographic distribution of US cohorts used to train deep learning algorithms. JAMA. 2020;324:1212–3. doi: 10.1001/jama.2020.12067. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117. doi: 10.1016/j.neunet.2014.09.003. [DOI] [PubMed] [Google Scholar]

[CR29] 29.Blaha MJ. The critical importance of risk score calibration: time for transformative approach to risk score validation? J Am Coll Cardiol. 2016;67:2131–4. doi: 10.1016/j.jacc.2016.03.479. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Laukkanen JA, Kunutsor SK. Is ‘re-calibration’ of standard cardiovascular disease (CVD) risk algorithms the panacea to improved CVD risk prediction and prevention? Eur Heart J. 2019;40:632–4. doi: 10.1093/eurheartj/ehy726. [DOI] [PubMed] [Google Scholar]

PERMALINK

Predicting systemic diseases in fundus images: systematic review of setting, reporting, bias, and models’ clinical availability in deep learning studies

Yitong Li

Ruiheng Zhang

Li Dong

Xuhan Shi

Wenda Zhou

Haotian Wu

Heyan Li

Chuyao Yu

Wenbin Wei

Abstract

Background

Objective

Methods

Results

Conclusion

Abstract

Introduction

Methods

Study identification and inclusion criteria

Fig. 1. PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow chart of study records.

Study selection and data extraction

Study characteristics

Reporting transparency

Risk of bias

Models’ clinical availability

Results

Study selection

Study characteristics

Reporting transparency

Table 1.

Risk of bias

Fig. 2. Risk of bias assessed using Prediction model Risk Of Bias Assessment Tool (PROBAST) reporting standards for all included studies.

Models’ clinical availability

Fig. 3. Temporal distribution illustrating the number and models’ clinical availability of studies included in this review.

Fig. 4. Distribution illustrating the number and clinical availability of studies in 7 different prediction outcomes.

Discussion

Principal findings

Study limitations

Conclusion

Summary

What was known before:

What this study adds:

Supplementary information

Author contributions

Funding

Data availability

Competing interests

Footnotes

Supplementary information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases