Abstract
Artificial intelligence (AI), especially deep learning, has the potential to fundamentally alter clinical radiology. AI algorithms, which excel in quantifying complex patterns in data, have shown remarkable progress in applications ranging from self-driving cars to speech recognition. The AI application within radiology, known as radiomics, can provide detailed quantifications of the radiographic characteristics of underlying tissues. This information can be used throughout the clinical care path to improve diagnosis and treatment planning, as well as assess treatment response. This tremendous potential for clinical translation has led to a vast increase in the number of research studies being conducted in the field, a number that is expected to rise sharply in the future. Many studies have reported robust and meaningful findings; however, a growing number also suffer from flawed experimental or analytical designs. Such errors could not only can result in invalid discoveries, but also may lead others to perpetuate similar flaws in their own work. This perspective article aims to increase awareness of the issue, identify potential reasons why this is happening, and provide a path forward.
Keywords: Radiology, Medical Imaging, Data Science, Deep Learning, Artificial Intelligence, Radiomics, Radiogenomics
Breakthroughs in artificial intelligence (AI) have the potential to fundamentally alter medical image analysis as well as the clinical practice of radiology. These methods excel at identifying complex patterns in images and in using this information to guide clinical decisions (1–3). AI encompassess quantitative image analysis, also known as radiomics (4–9), which involves either the application of predefined engineered algorithms (that often rely on input from expert radiologists) or the use of deep learning technologies that can automatically learn feature representations from example data (4). Consequently, AI is expected to play a key role in automating clinical tasks that presently can only be done by human experts (1,2). Such applications may aid the radiologist in making reproducible clinical assessments, thereby increasing the speed and accuracy of radiologic interpretation, and help the reader in situations difficult for human observers to interpret, such as in predicting the malignancy of a particular lesion or response to a particular therapy based on the patient’s total tumor burden (10–12).
The potential of AI has resulted in an explosion of investigations that utilize various applications of data science to further radiologic research. The magnitude of this transformation is reflected in the large number of research studies published in recent years. Numerous papers have been published describing the automated detection of abnormalities (also known as CADe) (13–15), others the automated quantification of diseases (also known as CADx) (16–19), and still others that link radiologic data with genomic data (also known as imaging-genomics or radiogenomics) to define genotype-phenotype interactions (6,20–24). With promising results from these early studies and the increasing availability of imaging data (including large retrospective datasets with clinical endpoints), we expect radiomic investigations to continue to grow rapidly in number and complexity in the coming years.
Many examples of robust discoveries have emerged from studies with stringent analytical and experimental designs. However, a number of studies, including many published in high-impact journals, suffer from (avoidable) flaws in experimental design or analytical methodology, which could potentially invalidate the findings. Among the most egregious examples are studies that employ imaging datasets that are too small to power a significant finding, e.g. hundreds of parameters are evaluated but in only a couple dozen samples. Others include analyses that lack independent validation and present models that are trained and validated on the same data. Still others suffer from “information leakage” due to improper separation of training and validation datasets. A common example of this would be when the features are selected from the same data used to evaluate performance. Errors are also being made in statistical analyses, such as improper correction for multiple testing (or a failure to correct at all), which can lead to over-optimistically low p-values, or the reporting of incorrect statistical outcome measures. Such studies give rise to findings that can not be replicated, ultimately weakening the perception of data science applications in radiology and threatening the credibility of other investigations in the field.
These problems occur for a plethora of reasons, ranging from inadequate training to inadequate example studies to inadequate reporting and sharing. First, many researchers working in the medical imaging and radiology field have little or no formal background in data analysis and biostatistics. This gap in understanding creates a blind-spot in which investigators fail to recognize what is required for good study design or what methods can be used most appropriately to arrive at sound analytical results with a high likelihood of being replicated in an independent study. While humans readers are equipped to handle analyses based on a small number of imaging features in correlation with a limited number of pathologic findings, the thousands of sometimes non-intuitive imaging parameters current technology is capable of extracting from imaging data, compounded by the complex relationships that exist between them, require the use of more sophisticated analytical methods.
Further, many of the journal editors and reviewers assigned to evaluate and critique these scientific studies are experts in radiology but have no expertise in data analysis methods. Consequently, potential mistakes in the analytical pipeline may go unnoticed during the editorial and review process, resulting in publications with findings that may not be fully supported or reproduced by the data. This scenario sets up a vicious cycle in which other readers also fail to recognize experimental or analytical flaws, mistake the reported results for truth, and repeat the methodological errors in their own work. Indeed, a number of proof-of-concept radiomic studies with methodological shortcomings have been published, and one can now see these same errors repeated by others.
As is true for other scientific fields, current mechanisms for correcting published studies are inadequate. Although study flaws in published works may be quickly recognized by investigators with quantitative training or experience, those studies are rarely publicly challenged. There are few incentives to call into question the results of published papers, as doing so can cause great animosity. Nevertheless, there are some positive examples where a careful review of published studies can help us understand how to do better science. Chalkidou et al (25) performed an independent re-analysis of several published imaging studies and found that the majority had significant issues with the experimental design, potentially leading to incorrect results. It is important for the community to take notice of and support re-analyses such as these and to recognize their value in advancing our science.
A straightforward approach to advance the quality of analytics in quantitative imaging is to create a culture in which there is a willingness to share primary data and software code. Much can be done to stimulate this culture by making data-sharing a requirement for publication. Indeed, the technical aspects of sharing radiologic data are often feasible, and initiatives exists that can support investigators with this process, such as the Cancer Imaging Archive (TCIA) (26). Proper sharing assures that the results of any single study can be recapitulated, since other investigators can test the reproducibility of the main findings. It also facilitates rapid development of new, more robust methods as more data becomes available for integrated analyses.
As is true in other fields, such as genomics, editors and reviewers should demand the publication of all data (including medical images), code, and results to ensure full reproducibility. This level of disclosure is consistent with the guidelines of many scientific journals for other types of study and reflects the NIH’s requirements for data sharing and reproducible research. Integrating these “best practices” into quantitative imaging will help assure the quality and reliability of our studies and will increase the strength and influence of our work, as others use and cite our data and software.
As the saying goes, “the devil is in the details.” This is especially true for the field of data science where confounding errors are easy to generate, but hard to identify, and require expertise and experience to identify and resolve. The most important steps investigators must take are to acknowledge their limitations, know when to ask for external expert advice, and recognize that a poorly designed and analyzed study is of little value in the long run. Better awareness and education in data science and statistical methodologies will increase the overall quality of discoveries within radiology.
It is also important to establish analysis guidelines to avoid pitfalls and provide recommendations for analysis strategies for medical imaging data. Guidelines related to data acquisition, data normalization, development of robust features and models, and rigorous statistical analyses will also increase the quality of these studies and allow for better evaluation of imaging data with other data sources such as clinical and genomic data.
While the points raised here may seem overly critical of radiology research, this is not the first field to face such challenges. The most directly example in my view is the field of genomics, where early studies were underpowered, poorly analyzed, and non-replicable. As faith in genomic assays began to wane, the community came together and recognized the need for better standards for experimental design, reproducible research, data and code sharing, and good analytical practices (27–35). The widespread institution of these practices by academic leaders and scholarly journals has led to genomic assays that are far more reproducible between laboratories, a necessity for potential clinical application.
Fortunately, there is growing appreciation of these issues and of the importance of better training in quantitative methodologies. Data science is becoming an important subject at leading radiology and image analysis conferences, and educational seminars are stimulating learning and the acquisition of new skills. It is likely the knowledge base will continue to increase for both investigators and editors, improving the overall quality of new research studies.
If we do this right, and keep on emphasizing the importance of data science training, I believe our field has a bright future. We will never rid ourselves of all our mistakes; however, we can, if we avoid the major pitfalls, improve our credibility. This will not only lead to good science, but also could ultimately reshape clinical radiology practice. Most important, it will lead to improved treatment and better outcomes for the patients we serve.
TRANSLATIONAL RELEVANCE.
The future of radiology is bright, as novel artificial intelligence (AI) algorithms have the potential to increase the clinical efficacy and quality of radiology, while making it faster and cheaper. Indeed, a large number of research groups are investigating how AI can improve radiologic quantifications, and several compelling discoveries have already been made. Unfortunately, for a variety of reasons, there has also been a large increase in studies with serious experimental design flaws. Prime examples include the use of small datasets that are underpowered or lack independent validation, and flawed statistical analyses such as improper multiple testing correction. These errors can lead to invalid discoveries that are not replicable and only serve to weaken the perception of the field, the credibility of its investigations, and perhaps even slow the clinical introduction of new technologies.
Acknowledgments
We acknowledge financial support from the National Institute of Health (NIH-USA U24CA194354, and NIH-USA U01CA190234), and would like to thank Dr. Lawrence H. Schwartz and Dr. John Quackenbush for their insightful and helpful comments.
Footnotes
Declaration of Interests: H.J.W.L.A. declares no competing interests.
References
- 1.LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- 2.Rusk N. Deep learning. Nat Methods. 2015;13:35–35. [Google Scholar]
- 3.Lewis-Kraus G. The Great AI Awakening. New York Times; 2016. [Google Scholar]
- 4.Aerts HJWL. The Potential of Radiomic-Based Phenotyping in Precision Medicine: A Review. JAMA Oncol. 2016;2:1636–42. doi: 10.1001/jamaoncol.2016.2631. [DOI] [PubMed] [Google Scholar]
- 5.Gillies RJ, Kinahan PE, Hricak H. Radiomics: Images Are More than Pictures, They Are Data. Radiology. 2015:151169. doi: 10.1148/radiol.2015151169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Aerts HJWL, Velazquez ER, Leijenaar RTH, Parmar C, Grossmann P, Carvalho S, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;5:4006. doi: 10.1038/ncomms5006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yip SSF, Aerts HJWL. Applications and limitations of radiomics. Phys Med Biol. 2016;61:R150–66. doi: 10.1088/0031-9155/61/13/R150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Shaikh FA, Kolowitz BJ, Awan O, Aerts HJ, von Reden A, Halabi S, et al. Technical Challenges in the Clinical Application of Radiomics. JCO Clinical Cancer Informatics. 2017:1–8. doi: 10.1200/CCI.17.00004. [DOI] [PubMed] [Google Scholar]
- 9.Lambin P, Rios-Velazquez E, Leijenaar R, Carvalho S, van Stiphout RGPM, Granton P, et al. Radiomics: extracting more information from medical images using advanced feature analysis. Eur J Cancer. 2012;48:441–6. doi: 10.1016/j.ejca.2011.11.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Jha S, Topol EJ. Adapting to Artificial Intelligence: Radiologists and Pathologists as Information Specialists. JAMA. 2016;316:2353–4. doi: 10.1001/jama.2016.17438. [DOI] [PubMed] [Google Scholar]
- 11.Beam AL, Kohane IS. Translating Artificial Intelligence Into Clinical Care. JAMA. 2016;316:2368–9. doi: 10.1001/jama.2016.17217. [DOI] [PubMed] [Google Scholar]
- 12.Liu Y, Balagurunathan Y, Atwater T, Antic S, Li Q, Walker RC, et al. Radiological Image Traits Predictive of Cancer Status in Pulmonary Nodules. Clin Cancer Res. 2017;23:1442–9. doi: 10.1158/1078-0432.CCR-15-3102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Shin H-C, Roth HR, Gao M, Lu L, Xu Z, Nogues I, et al. Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning. IEEE Trans Med Imaging. 2016;35:1285–98. doi: 10.1109/TMI.2016.2528162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Summers RM. Deep Learning and Computer-Aided Diagnosis for Medical Image Processing: A Personal Perspective. Advances in Computer Vision and Pattern Recognition. 2017:3–10. [Google Scholar]
- 15.Firmino M, Morais AH, Mendoça RM, Dantas MR, Hekis HR, Valentim R. Computer-aided detection system for lung cancer in computed tomography scans: review and future prospects. Biomed Eng Online. 2014;13:41. doi: 10.1186/1475-925X-13-41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Amir GJ, Lehmann HP. After Detection:: The Improved Accuracy of Lung Cancer Assessment Using Radiologic Computer-aided Diagnosis. Acad Radiol [Internet] 2015 doi: 10.1016/j.acra.2015.10.014. Available from: http://dx.doi.org/10.1016/j.acra.2015.10.014. [DOI] [PMC free article] [PubMed]
- 17.Gupta S, Chyn PF, Markey MK. Breast cancer CADx based on BI-RADS™ descriptors from two mammographic views. Med Phys. 2006;33:1810–7. doi: 10.1118/1.2188080. [DOI] [PubMed] [Google Scholar]
- 18.Gruszauskas NP, Drukker K, Giger ML. Robustness Studies of Ultrasound CADx in Breast Cancer Diagnosis. Advances in Bioinformatics and Biomedical Engineering. :1–22. [Google Scholar]
- 19.Lo S-CB, Freedman MT, Kinnard L, Makariou E. Mammographic CADx system using an image library with an intelligent agent: A pattern matching approach. Medical Imaging. 2006 Image Processing [Internet]. 2006. Available from: http://dx.doi.org/10.1117/12.654667.
- 20.O’Connor JPB, Aboagye EO, Adams JE, Aerts HJWL, Barrington SF, Beer AJ, et al. Imaging biomarker roadmap for cancer studies. Nat Rev Clin Oncol. 2017;14:169–86. doi: 10.1038/nrclinonc.2016.162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Rios Velazquez E, Parmar C, Liu Y, Coroller TP, Cruz G, Stringfield O, et al. Somatic Mutations Drive Distinct Imaging Phenotypes in Lung Cancer. Cancer Res [Internet] 2017 doi: 10.1158/0008-5472.CAN-17-0122. Available from: http://dx.doi.org/10.1158/0008-5472.CAN-17-0122. [DOI] [PMC free article] [PubMed]
- 22.Grossmann P, Stringfield O, El-Hachem N, Bui MM, Rios Velazquez E, Parmar C, et al. Defining the biological basis of radiomic phenotypes in lung cancer. Elife [Internet] 2017:6. doi: 10.7554/eLife.23421. Available from: http://dx.doi.org/10.7554/eLife.23421. [DOI] [PMC free article] [PubMed]
- 23.Gutman DA, Dunn WD, Jr, Grossmann P, Cooper LAD, Holder CA, Ligon KL, et al. Somatic mutations associated with MRI-derived volumetric features in glioblastoma. Neuroradiology. 2015;57:1227–37. doi: 10.1007/s00234-015-1576-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Aerts HJWL, Grossmann P, Tan Y, Oxnard GG, Rizvi N, Schwartz LH, et al. Defining a Radiomic Response Phenotype: A Pilot Study using targeted therapy in NSCLC. Sci Rep. 2016;6:33860. doi: 10.1038/srep33860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Chalkidou A, O’Doherty MJ, Marsden PK. False Discovery Rates in PET and CT Studies with Texture Features: A Systematic Review. PLoS One. 2015;10:e0124165. doi: 10.1371/journal.pone.0124165. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Prior F, Smith K, Sharma A, Kirby J, Tarbox L, Clark K, et al. The public cancer radiology imaging collections of The Cancer Imaging Archive. Sci Data. 2017;4:170124. doi: 10.1038/sdata.2017.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet. 2001;29:365–71. doi: 10.1038/ng1201-365. [DOI] [PubMed] [Google Scholar]
- 28.Berrar DP, Dubitzky W, Granzow MA. Practical Approach to Microarray Data Analysis. Springer Science & Business Media; 2007. [Google Scholar]
- 29.Stekel D. Data Standards, Storage and Sharing. Microarray Bioinformatics. :231–52. [Google Scholar]
- 30.Quackenbush J. Microarray data normalization and transformation. Nat Genet. 2002;32:496–501. doi: 10.1038/ng1032. [DOI] [PubMed] [Google Scholar]
- 31.Quackenbush J. COMPUTATIONAL GENETICS: COMPUTATIONAL ANALYSIS OF MICROARRAY DATA. Nat Rev Genet. 2001;2:418–27. doi: 10.1038/35076576. [DOI] [PubMed] [Google Scholar]
- 32.Hegde P, Qi R, Abernathy K, Gay C, Dharap S, Gaspard R, et al. A concise guide to cDNA microarray analysis. Biotechniques. 2000;29:548–50. 552–4. doi: 10.2144/00293bi01. 556 passim. [DOI] [PubMed] [Google Scholar]
- 33.Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006;7:55–65. doi: 10.1038/nrg1749. [DOI] [PubMed] [Google Scholar]
- 34.Lee ML, Kuo FC, Whitmore GA, Sklar J. Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. Proc Natl Acad Sci U S A. 2000;97:9834–9. doi: 10.1073/pnas.97.18.9834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet. 2010;11:733–9. doi: 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]