Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jan 1.
Published in final edited form as: Clin Radiol. 2019 Apr 28;75(1):7–12. doi: 10.1016/j.crad.2019.04.002

Open access image repositories: high-quality data to enable machine learning research

F Prior a,*, J Almeida b, P Kathiravelu c, T Kurc d, K Smith a, T J Fitzgerald e, J Saltz d
PMCID: PMC6815686  NIHMSID: NIHMS1528096  PMID: 31040006

Abstract

Originally motivated by the need for research reproducibility and data reuse, large-scale, open access information repositories have become key resources for training and testing of advanced machine learning applications in biomedical and clinical research. To be of value, such repositories must provide large, high-quality data sets, where quality is defined as minimising variance due to data collection protocols and data misrepresentations. Curation is the key to quality. We have constructed a large public access image repository, The Cancer Imaging Archive, dedicated to the promotion of open science to advance the global effort to diagnose and treat cancer. Drawing on this experience and our experience in applying machine learning techniques to the analysis of radiology and pathology image data, we will review the requirements placed on such information repositories by state-of-the-art machine learning applications and how these requirements can be met.

INTRODUCTION

Imaging data and, in particular, quantitative features extracted by image analysis have been identified as a critical source of information for classification (imaging phenotypes), cancer diagnosis and tracking response to therapy14. Radiomics and pathomics, where quantitative features are algorithmically extracted from radiology and pathology imaging studies, provide valuable diagnostic and prognostic indicators of cancer 514,15, 16. Machine learning algorithms have shown promising results in analysis of digitised tissue specimens and better performance than most traditional image analysis techniques1719. Deep learning based pipelines are also gaining increased acceptance and use in radiology14, 20, 21. Identifying quantitative imaging phenotypes across scale through the use of machine learning is a rapidly evolving approach to improve our understanding of cancer biology 1,2.

In 2010, the US National Cancer Institute realised that open access to radiological images and other supporting data such as demographics, outcomes, and clinical trial data were required to promote cancer research. A contract was issued to create The Cancer Imaging Archive (TCIA), which has supported open science and cancer research since it went live in 2011, by acquiring, curating, hosting, and distributing collections of multi-modal information22, 23. TCIA adheres to the FAIR principle (Findable, Accessible, Interoperable, Reusable) 24. To make data findable and accessible TCIA provides both a web user interface, a visual query based user interface,25 and an application programming interface (API)26 to identify and retrieve images and related data. TCIA also serves as a data publisher27 and makes use of digital object identifiers (DOIs)27 to reference data collections in the literature and enable direct retrieval of the referenced data. All data within TCIA have been fully de-identified using tools and procedures that have been fully validated to comply with US and international privacy laws28. Although the number of on-line image repositories is growing rapidly (e.g., 2935) to the authors’ knowledge there is no data management service that can support the range of image data types, rich metadata, robust curation processes, and breadth of both human and computer access methods as TCIA.

The rapid growth of quantitative image analysis based on machine learning has extended the mission of TCIA to include providing data for training and testing of new algorithms36. This has allowed us to better understand the limitations imposed on new algorithm development by existing data resources.

QUANTITATIVE IMAGE ANALYSIS BY MACHINE LEARNING

The use of computers to aid in the detection of regions of clinical interest in images and diagnosis was introduced in the 1960s37. The field drew heavily on work in computer vision and became known as CAD (computer-aided diagnosis38, computer-aided detection39). Since the first system was approved by the FDA in 199840, CAD has been part of radiology with many approved systems in clinical practice40, 41. CAD research generated many of the machine learning tools we use today. CAD also extended to pathology image analysis42, 43. In spite of years of research and development, the number of clinically successful CAD products with US Food and Drug Agency (FDA) approval are rather limited44.

The new emphasis in medical imaging research is radiomics10, 11 where features are extracted from images either by quantitative measurements of objects of interest in the image12, 16 or by deep learning algorithms that learn features to support classification and risk assessment14, 21. Image-derived features are considered more useful for cross-scale analyses, e.g., radiogenomics20, 45, 46. Machine learning is also being used to remove image artefacts and enhance image quality44.

Machine learning algorithms, in particular deep learning methods, have shown promising results in analysis of digitised tissue specimens and better performance than most traditional image analysis techniques1719. Deep learning models are increasingly employed in imaging based quantitative analysis pipelines. Developed as a specific case of machine learning, deep learning algorithms teach themselves the informative representations, unlike classic machine learning approaches where human experts perform the feature engineering19. In pathology applications, the models are employed to segment tumours or identify niches in tissue such as tumour infiltrating lymphocytes13, or to segment epithelial versus stromal tissue regions47. In other cases, the models are directly employed to predict patient outcome or response to treatment48, 49.

Machine learning and other branches of artificial intelligence have found numerous applications in diagnostic decision support beyond radiomics and pathomics50. These applications include patient similarity analysis51 and cognitive assistants50.

DATA REQUIREMENTS FOR MACHINE LEARNING

Deep learning methods require large, representative and accurately annotated training datasets to train robust models and achieve acceptable performance40. Goodfellow et al.52 proposed the following rule of thumb: “a supervised deep learning algorithm will generally achieve acceptable performance with around 5,000 labelled examples per category and will match or exceed human performance when trained with a dataset containing at least 10 million labelled examples.” Why are such large data sets required? For any machine learning algorithm to perform sufficiently well as to be clinically useful it must be trained on data that appropriately represent the variance in the human population, the presentation of disease (target and comorbidities), and in data collection systems themselves10, 40, 5355. Such large data requirements pose a challenge to the medical imaging research community and form the basis for the call for large scale data sharing10, 5658.

Given the massive clinical image collections available in healthcare systems around the world, acquiring large data sets would not seem to pose a problem. Although some argue that clinical data must be used so that algorithms explicitly deal with the huge inconsistency in acquisition protocols inherent in clinical data,10 it is widely held that imaging has to be of sufficient quality and acquired with uniform parameters as in a clinical trial to make certain that conclusions drawn from artificial intelligence can be validated55. Whether the data comes from clinical practice or clinical trials, large scale data sharing is essential12, 40, 56 and the problem that collections do not truly represent the population, i.e., the lack of healthy controls, remains.

Supervised learning techniques, commonly used in radiomic/pathomic studies, are hampered by the lack of labelled data for training and testing44. In general, training data must be created manually by human experts resulting in high cost and limited volume of high-quality training (and testing) datasets44, 56. It is well established that there is poor agreement among human experts both for identification of objects of interest and performing analysis tasks such as segmentation10, 40. Manual segmentations are also often imprecise10, 56.

Reproducibility of radiomics/pathomics analyses are difficult to assess due to a lack of standards for validating results12, 58. Fundamentally there is no reference standard of truth against which to evaluate algorithms44, 59. One approach widely used in computer science is the establishment of open access benchmark databases to compare algorithm performance43. On the other hand, the FDA has questioned if testing data should be sequestered (i.e., retained as private, proprietary information) particularly for evaluating algorithms that require FDA approval56. Thus, intellectual property concerns limit access to valuable data sets. Several authors have discussed the problems of defining ground truth standards in medical imaging.60, 61 Approaches to modelling truth by combining observations from multiple observers (machine and human) attempt to create standards that do not penalise algorithms that outperform the human observer.62, 63

There are some unique challenges in developing high-quality training datasets in digital pathology. First, tissue images capture much denser information than many other imaging methods. State-of-the-art digitising microscopes can scan a whole slide tissue at resolutions ranging from 30,000×30,000 pixels to over 100,000×100,000 pixels. A whole slide tissue image can contain more than a million cells, nuclei, and other cellular-level structures. The sheer number of objects in a whole slide tissue image makes it impossible to manually segment and annotate each and every object. Second, there will be heterogeneity across tissue specimens and even within a tissue specimen. A whole slide tissue specimen may have multiple types of regions: normal tissue, tumour, stroma, etc. The development of representative datasets for machine/deep learning has to take into account heterogeneity. Third, tissue specimens go through a tissue processing phase before they are imaged. Variations in tissue preparation lead to artefacts in images, such as tissue folds, poor staining, and pen markings. In addition, there will be variations in image quality and characteristics, such as sharpness and colour profiles from different digitising microscopes. Even if images are obtained by the same digitising microscope, there can be artefacts such as out-of-focus or blurred regions. The image capture performance of a digitising microscope can degrade overtime and may need to be tuned at intervals. All these factors may lead to batch effects, i.e., images of the same tissue type coming from different source sites may have significant variations in colour, sharpness, etc. Consistent and unified protocols are necessary to reduce artefacts in image datasets used to generate training data. Lastly, there are no standard image file formats in digital pathology that are universally accepted, despite ongoing efforts6466. Almost every vendor in the digital pathology imaging domain has its own file format; there can even be variations between the file formats of different digitising microscopes from the same vendor. Libraries have been developed that can parse certain vendor formats67, 68. Tools and applications also have been developed to view and annotate pathology image data6871. Nevertheless, lack of a standard data format remains a major challenge in curating training datasets.

CAREFUL IMAGE CURATION PERMITS DATA AGGREGATION

Careful curation and strict quality-control processes are two key activities essential to the success of any information resource. When data adheres to a standard (e.g., Digital Imaging and Communications in Medicine [DICOM]72), de-identification may be unambiguously and completely performed in compliance to a standard (i.e. DICOM Standard PS 3.15-2011, Part 15: Security and System Management Profiles) providing the ability to generate data at a publicly shareable level of compliance and on a large scale. Following industry best practices, TCIA uses a standards-based approach to de-identification of DICOM objects to ensure that all managed objects are free of protected health information (PHI). The TCIA de-identification process ensures that the HIPAA de-identification standard is met by following the Safe Harbor Method as defined in section 164.514(b)(2) of the Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule. TCIA redacts PHI while retaining as much data as possible to ensure scientific usability of the data73.

TCIA and other research projects74 use the Posda open source framework75, 76 to implement curation workflows for all DICOM defined objects. Posda incorporates DICOM validation rules and maintains an up-to-date DICOM private tag dictionary. These underpin DICOM validation and guide de-identification processes. Posda supports redaction of PHI, validation and correction of linkages to referenced objects, correction of DICOM series, study, and patient level inconsistencies, analysis and correction of DICOM encoding errors, provenance tracking, and prioritisation of multiple data streams.

During curation it is important to ensure that data objects are usable and properly linked to supporting data, such as DICOM structure sets properly linked to the associated imaging. It is also important to ensure the same imaging data are not represented within an archive (or across archives) under different subject IDs. TCIA has had submissions of the same data by different research groups with different IDs assigned. This duplication was discovered using digests of the pixel data and the data were harmonised.

One mechanism for creating large datasets from smaller ones is through a collection of digital object identifiers (DOIs). If all accessible datasets have been published with DOIs, one can search within resources such as DataCite or EZID, to find and retrieve large datasets from wherever they hosted. When publishing results from machine learning algorithms, DOIs should be referenced to facilitate validation of results.

Not all image data are acquired in DICOM format. Metadata play a significant role in ensuring open and easy access to the image data. Metadata enables access and sharing of image data such as multi-dimensional microscopy image data that consists of several proprietary file formats77.

CAPTURING LABELLED DATA

In addition to collecting large volumes of image data and cross-linking compatible data from multiple collections, it is essential to collect and publicly distribute labelled data. The generation of labelled data remains a roadblock. Two approaches have shown promising results: crowdsourcing and the use of augmentation and synthetic data generation. Crowdsourcing employs large populations of experts (e.g., at national meetings) or citizen scientists to perform data labelling tasks including segmentation78, 79. The Crowds Cure Cancer project80 conducted at RSNA 2017 and 2018 uses the experts who attend this professional conference to label image data provided by TCIA. With potentially hundreds of participants labelling each data set, the variance in label estimates can be adequately modelled. Challenge competitions are another source of data labelled by consensus.36. Augmentation and synthetic data generation are potential approaches to alleviate the cost of generating training and test datasets81. Nevertheless, tools and methods are needed to review and interactively refine results from a machine/deep learning algorithm applied to datasets consisting of hundreds to thousands of images70.

FUTURE DEVELOPMENTS

Management of labelled training and test sets and the resulting image derived features raise special curation and validation issues. Currently, standards and standard operating processes for data representation, curation, evaluation, and sharing of high-quality labelled datasets are in early stages of development and leverage the curation pipelines developed for imaging data82. Basic processes, such as checking if all required metadata elements are included and whether the dataset can be parsed correctly, can be adopted from the image curation pipelines.

Centralised open access information repositories provide a single point for search and retrieval but are hampered by variations in international regulations and policies governing patient privacy and data sharing, and of course potentially limit access due to the need to physically move large quantities of data over the internet. Augmenting centralised repositories with distributed, loosely federated databases allows more local access and control. Such a federated environment should still provide access to data through a minimum set of ad hoc, community accepted interfaces and a minimum set of metadata elements so the data of interest can be searched, retrieved, and interpreted. Enabling open access to data is also often limited by organisational policies on data access and sharing. Research has proposed hybrid on-demand data integration from distributed data sources and service-based data sharing without actually replicating the content for biomedical research data83. We posit that such approaches will complement the open access image repositories in further empowering the machine learning research.

CONCLUSION

Machine learning algorithms require access to large amounts of data for training and testing. Machine learning will require access to data from multiple cross-linked multi-modal archives containing data such as radiology, pathology, genomics, radiomics, and other metadata such as survivability data. These data must be easy to query and retrieve the correct information. If the results from machine learning are to be generalisable the data on which they are based must be of sufficient quality or acquired with uniform parameters, as in a clinical trial, and publicly available so the algorithms, models, and conclusions can be tested and validated by the research community. No reference standard of truth against which to validate new machine learning based quantitative analyses exists. Validation sets with estimates of the variance of the labels and synthetic data are the best approximations currently available. Systems such as The Cancer Imaging Archive are capable of managing these data and making them freely available to the research community.

Highlights.

  • Machine learning algorithms, in particular Deep learning methods, have shown promising results in both radiology and pathology image analysis, but the number of clinically successful AI products with FDA approval is limited.

  • Access to appropriate data for training, testing and evaluation is a key limitation to the field.

  • Open access information repositories such as The Cancer Imaging Archive support the collection and curation of both large data sets and labeled data needed for training and testing machine learning algorithms.

ACKNOWLEDGEMENTS

This work was supported in part by the National Cancer Institute, National Institutes of Health contract no. HHSN261200800001E, subcontract 16X011; National Cancer Institute 1U01CA187013 and 1U24CA215109.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of interests

⊠ The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

REFERENCES

  • 1.Thrall JH. Personalized Medicine. Radiology. 2004;231(3):613–616. [DOI] [PubMed] [Google Scholar]
  • 2.Thrall JH. Trends and developments shaping the future of diagnostic medical imaging: 2015 Annual Oration in Diagnostic Radiology. Radiology. 2016;279(3):660–666. [DOI] [PubMed] [Google Scholar]
  • 3.Herold CJ, Lewin JS, Wibmer AG, et al. Imaging in the age of precision medicine: summary of the Proceedings of the 10th Biannual Symposium of the International Society for Strategic Studies in Radiology. Radiology. 2015:150709. [DOI] [PubMed] [Google Scholar]
  • 4.Bi WL, Hosny A, Schabath MB, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA Cancer J Clin 2019;69:127–157. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Cooper L, Kong J, Gutman D, et al. An Integrative Approach for In Silico Glioma Research. IEEE Trans Biomed Eng Lett. 2010;57(10):2617–2621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Cooper LA, Kong J, Gutman DA, et al. Integrated morphologic analysis for the identification and characterization of disease subtypes. J Am Med Inform Assoc 2012;19(2):317–323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Aerts HJ, Velazquez ER, Leijenaar RT, et al. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nature Commun 2014;5:4006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Parmar C, Leijenaar RT, Grossmann P, et al. Radiomic feature clusters and prognostic signatures specific for lung and head & neck cancer. Sci Rep 2015;5:11044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Parmar C, Rios Velazquez E, Leijenaar R, et al. Robust radiomics feature quantification using semiautomatic volumetric segmentation. PloS One. 2014;9(7):e102107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Kumar V, Gu Y, Basu S, et al. Radiomics: the process and the challenges. Magn Reson Imaging. 2012;30(9):1234–1248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Lambin P, Rios-Velazquez E, Leijenaar R, et al. Radiomics: extracting more information from medical images using advanced feature analysis. Eur J Cancer (Oxford, England : 1990). 2012;48(4):441–446. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Gillies RJ, Kinahan PE, Hricak H. Radiomics: images are more than pictures, they are data. Radiology. 2016;278(2):563–577. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Saltz J, Gupta R, Hou L, et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell Rep 2018;23(1):181. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Causey JL, Zhang J, Ma S, et al. Highly accurate model for prediction of lung nodule malignancy with CT scans. Sci Rep 2018;8(1):9286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Singanamalli A, Rusu M, Sparks RE, et al. Identifying in vivo DCE MRI markers associated with microvessel architecture and gleason grades of prostate cancer. J Magn Reson Imaging. 2016;43(1):149–158. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Kalpathy-Cramer J, Mamomov A, Zhao B, et al. Radiomics of lung nodules: a multi-institutional study of robustness and agreement of quantitative imaging features. Tomography. 2016. December;2(4):430–437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Janowczyk A, Madabhushi A. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. J Pathol Inform 2016;7:29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Litjens G, Kooi T, Bejnordi BE, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. [DOI] [PubMed] [Google Scholar]
  • 19.Shen D, Wu G, Suk H-I. Deep learning in medical image analysis. Ann Rev Biomed Eng 2017;19:221–248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Napel S, Mu W, Jardim-Perassi BV, Aerts HJ, Gillies RJJC. Quantitative imaging of cancer in the postgenomic era: radio (geno) mics, deep learning, and habitats. Cancer 2018;124(24):4633–4649. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hosny A, Parmar C, Coroller TP, et al. Deep learning for lung cancer prognostication: a retrospective multi-cohort radiomics study. PLoS Med 2018; 15(11):e1002711. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Prior FW, Clark K, Commean P, et al. TCIA: an information resource to enable open science. Conf Proc IEEE Eng Med Biol Soc 2013;2013:1282–1285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Clark K, Vendt B, Smith K, et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging. 2013;26(6):1045–1057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016. March 15;3:160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Commean PK, Rathmell JM, Clark KW, Maffitt DR, Prior FW. A query tool for investigator access to the data and images of the National Lung Screening Trial. J Digit Imaging. 2015:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kathiravelu P, Sharma A. Mediator: a data sharing synchronization platform for heterogeneous medical image archives. In Workshop on Connected Health at Big Data Era (BigCHat’15), co-located with 21 st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2015) ACM2015. [Google Scholar]
  • 27.Prior F, Smith K, Sharma A, et al. The public cancer radiology imaging collections of The Cancer Imaging Archive. Sci Data. 2017;4:170124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bennett W, Smith K, Jarosz Q, Nolan T, Bosch W. Reengineering workflow for curation of DICOM datasets. J Digit Imaging. 2018:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Toga AW. The clinical value of large neuroimaging data sets in Alzheimer’s disease. Neuroimaging Clin N Am 2012;22(1):107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Grethe JS, Baru C, Gupta A, et al. Biomedical informatics research network: building a national collaboratory to hasten the derivation of new understanding and treatment of disease. Stud Health Technol Inform 2005;112:100–110. [PubMed] [Google Scholar]
  • 31.Marcus D, Harwell J, Olsen T, et al. Informatics and data mining tools and strategies for the Human Connectome Project. Front Neuroinform 2011. ;5(4):1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Marcus DS, Wang TH, Parker J, Csernansky JG, Morris JC, Buckner RL. Open Access Series of Imaging Studies (OASIS): cross-sectional MRI data in young, middle aged, nondemented, and demented older adults. J Cogn Neurosci 2007;19(9):1498–1507. [DOI] [PubMed] [Google Scholar]
  • 33.Hall D, Huerta MF, McAuliffe MJ, Farber GK. Sharing heterogeneous data: the national database for autism research. Neuroinformatics. 2012;10(4):331–339. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Korfiatis PD, Kline TL, Blezek DJ, Langer SG, Ryan WJ, Erickson BJ. MIRMAID: a content management system for medical image analysis research. RadioGraphics. 2015;35(5):1461–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Roelofs E, Dekker A, Meldolesi E, van Stiphout RGPM, Valentini V, Lambin P. International data-sharing for radiotherapy research: an open-source based infrastructure for multicentric clinical data mining. Radiother Oncology. 2014;110(2):370–374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kalpathy-Cramer J, Freymann JB, Kirby JS, Kinahan PE, Prior FW. Quantitative Imaging Network: Data Sharing and Competitive AlgorithmValidation Leveraging The Cancer Imaging Archive. Transl Oncol 2014;7(1):147–152. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lodwick GS. Computer-aided diagnosis in radiology: a research plan. Invest Radiol 1966;1(1):72–80. [DOI] [PubMed] [Google Scholar]
  • 38.Boyer B, Balleyguier C, Granat O, Pharaboz C. CAD in questions/answers: review of the literature. Eur J Radiol 2009;69(1):24–33. [DOI] [PubMed] [Google Scholar]
  • 39.Ciatto S, Del Turco MR, Burke P, Visioli C, Paci E, Zappa M. Comparison of standard and double reading and computer-aided detection (CAD) of interval cancers at prior negative screening mammograms: blind review. Br J Cancer. 2003;89(9):1645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Giger ML, Chan HP, Boone J. Anniversary paper. History and status of CAD and quantitative image analysis: the role of Medical Physics and AAPM. Med Phys 2008;35(12):5799–5820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Doi K Computer-aided diagnosis in medical imaging: historical review, current status and future potential. Comput Med Imaging Graphics. 2007;31(4-5):198–211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.AlZubaidi AK, Sideseq FB, Faeq A, Basil M. Computer aided diagnosis in digital pathology application: review and perspective approach in lung cancer classification. In 2017 Annual Conference on New Trends in Information & Communications Technology Applications (NTICT) Piscataway: IEEE; 2017, pp. 219–224. [Google Scholar]
  • 43.Xing F, Yang L. Robust nucleus/cell detection and segmentation in digital pathology and microscopy images: a comprehensive review. IEEE Rev Biomed Eng 2016;9:234–263. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chan S, Siegel EL. Will machine learning end the viability of radiology as a thriving medical specialty? Br J Radiol 2018;91:20180416. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Colen R, Foster I, Gatenby R, et al. NCI Workshop report: clinical and computational requirements for correlating imaging phenotypes with genomics signatures. Transl Oncol 2014;7(5):556–569. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Saltz J, Almeida J, Gao Y, et al. Towards generation, management, and exploration of combined radiomics and pathomics datasets for cancer research. In AMIA 2017 Joint Summits on Translational Science. San Francisco Bethesda: AMIA [PMC free article] [PubMed] [Google Scholar]
  • 47.Xu J, Luo X, Wang G, Gilmore H, Madabhushi AJN. A deep convolutional neural network for segmenting and classifying epithelial and stromal regions in histopathological images. Neurocomputing. 2016;191:214–223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Mobadersany P, Yousefi S, Amgad M, et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc Natl Acad Sci U S A. 2018:201717139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Muhammad H, Häggström I, Klimstra DS, Fuchs TJ. Survival modeling of pancreatic cancer with radiology using convolutional neural networks In Simulation, Image Processing, and Ultrasound Systems for Assisted Diagnosis and Navigation. Cham: Springer, 2018; pp. 187–192. [Google Scholar]
  • 50.Syeda-Mahmood T Role of big data and machine learning in diagnostic decision support in radiology. J Am Coll Radiol 2018;15(3):569–576. [DOI] [PubMed] [Google Scholar]
  • 51.Syeda-Mahmood T, Wang F, Beymer D, Amir A, Richmond M, Hashmi S. AALIM: multimodal mining for cardiac decision support In Computers in Cardiology, Piscataway: IEEE, 2007; pp. 209–212. [Google Scholar]
  • 52.Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT press, 2016. [Google Scholar]
  • 53.Petrick N, Sahiner B, Armato SG III, et al. Evaluation of computer-aided detection and diagnosis systems. Med Phys 2013;40(8):087001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.de Bruijne M Machine learning approaches in medical image analysis: from detection to diagnosis. Cham: Elsevier, 2016; pp. 94–97. [DOI] [PubMed] [Google Scholar]
  • 55.Thrall JH, Li X, Li Q, et al. Artificial intelligence and machine learning in radiology: opportunities, challenges, pitfalls, and criteria for success. J Am Coll Radiol 2018;15(3):504–508. [DOI] [PubMed] [Google Scholar]
  • 56.Gallas BD, Chan H-P, D’Orsi CJ, et al. Evaluating imaging and computer-aided detection and diagnosis devices at the FDA. Acad Radiol 2012;19(4):463–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Singer DS, Jacks T, Jaffee E. A US “Cancer Moonshot” to accelerate cancer research. Science. 2016;353(6304):1105–1106. [DOI] [PubMed] [Google Scholar]
  • 58.Lambin P, Leijenaar RT, Deist TM, et al. Radiomics: the bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol 2017;14(12):749. [DOI] [PubMed] [Google Scholar]
  • 59.Kohli M, Prevedello LM, Filice RW, Geis JR. Implementing machine learning in radiology practice and research. AJR Am J Roentgenol 2017;208(4):754–760. [DOI] [PubMed] [Google Scholar]
  • 60.Hipp JD, Smith SC, Sica J, et al. Tryggo: old norse for truth: the real truth about ground truth: new insights into the challenges of generating ground truth maps for WSI CAD algorithm evaluation. J Pathol Inform 2012;3:8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Dodd LE, Wagner RF, Armato III SG, et al. Assessment methodologies and statistical issues for computer-aided diagnosis of lung nodules in computed tomography: contemporary research topics relevant to the lung image database consortium. Acad Radiol 2004. April;11(4):462–75. [DOI] [PubMed] [Google Scholar]
  • 62.Warfield SK, Zou KH, Wells WM. Simultaneous truth and performance level estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging. 2004;23(7):903. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Cholleti SR, Goldman SA, Blum A, et al. Veritas: combining expert opinions without labeled data. Int J Artif Intell Tools. 2009;18(05):633–651. [Google Scholar]
  • 64.Herrmann MD, Clunie DA, Fedorov A, et al. Implementing the DICOM standard for digital pathology. J Pathol Inform 2018;9:37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Kalinski T, Zwonitzer R, Rossner M, Hofmann H, Roessner A, Guenther T. Digital Imaging and Communications in Medicine (DICOM) as standard in digital pathology. Histopathology. 2012;61(1):132–134. [DOI] [PubMed] [Google Scholar]
  • 66.Singh R, Chubb L, Pantanowitz L, Parwani A. Standardization in digital pathology: Supplement 145 of the DICOM standards. J Pathol Inform 2011;2:23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Goode A, Gilbert B, Harkes J, Jukic D, Satyanarayanan M. OpenSlide: a vendor-neutral software foundation for digital pathology. J Pathol Inform 2013;4:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Allan C, Burel JM, Moore J, et al. OMERO: flexible, model-driven data management for experimental biology. Nat Methods. 2012;9(3):245–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Bankhead P, Loughrey MB, Fernandez JA, et al. QuPath: Open source software for digital pathology image analysis. Sci Rep 2017;7(1):16878. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Saltz J, Sharma A, Iyer G, et al. A containerized software system for generation, management, and exploration of features from whole slide tissue images. Cancer Res 2017;77(21):e79–e82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Gutman DA, Khalilia M, Lee S, et al. The Digital Slide Archive: a software platform for management, integration, and analysis of histology for cancer research. Cancer Res 2017;77(21):e75–e78. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.DICOM. Digital Imaging and Communications in Medicine (DICOM). Sup 187 - Preclinical Small Animal Imaging Acquisition Context. Rosslyn, VA: NEMA; 2015. [Google Scholar]
  • 73.Moore SM, Maffitt DR, Smith KE, et al. De-identification of medical images with retention of scientific research value. RadioGraphics. 2015;35(3):727–735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.NIOSH. Chest Image Repository. Atlanta, GA: CDC; 2011. [Google Scholar]
  • 75.Bennett W, Matthews J, Bosch W. SU-GG-T-262: open-source tool for assessing variability in DICOM data. Med Phys 2010;37(6):3245–3245. [Google Scholar]
  • 76.Rosenstein BS, Capala J, Efstathiou JA, et al. How will big data improve clinical and basic research in radiation therapy? nt J Radiat Oncol Biol Phys 2016. July 1 ;95(3):895–904. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Linkert M, Rueden CT, Allan C, et al. Metadata matters: access to image data in the real world. J Cell Biol 2010;189(5):777–782. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Irshad H, Montaser-Kouhsari L, Waltz G, et al. Crowdsourcing image annotation for nucleus detection and segmentation in computational pathology: evaluating experts, automated methods, and the crowd. Pacific Symposium on Biocomputing Co-Chairs Singapore: World Scientific, 2014; pp. 294–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Maier-Hein L, Mersmann S, Kondermann D, et al. Can masses of non-experts train highly accurate image classifiers? In International conference on medical image computing and computer-assisted intervention Cham: Springer, 2014; pp. 438–445. [DOI] [PubMed] [Google Scholar]
  • 80.Kalpathy-Cramer J, Beers A, Mamonov A, et al. Crowds cure cancer: data collected at the RSNA 2017 annual meeting Little Rock, Arkansas: The Cancer Imaging Archive, 2018. [Google Scholar]
  • 81.Hou L, Agarwal A, Samaras D, Kurc TM, Gupta RR, Saltz JH. Unsupervised histopathology image synthesis. arXiv arXiv:171205021. 2017. [Google Scholar]
  • 82.Saltz J, Sharma A, Iyer G, et al. A containerized software system for generation, management, and exploration of features from whole slide tissue images. Cancer Res 2017;77(21):e79–e82. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Kathiravelu P, Sharma A, Galhardas H, Van Roy P, Veiga L. On-demand big data integration: a hybrid ETL approach for reproducible scientific research. arXiv arXiv:180408985. 2018. [Google Scholar]

RESOURCES