Definition of the data of interest |
Specifying what kind of data should be collected and annotated for the project, in terms of imaging modality, protocol, anatomy, pathology, and clinical question that are relevant for the application. |
Data collection and de-identification |
Acquiring the imaging data from the source, such as a PACS system, a DICOM server, or a public repository. The data should be representative of the target population and environment. Data must be de-identified by removing any personal or sensitive information that can identify the patients or the institutions. Compliance with the ethical and legal regulations, such as HIPAA or GDPR, must be ensured. |
Annotation |
Labelling the data with the information that is needed for the machine learning task, such as bounding boxes, polygons, masks, or tags. A standard protocol or guideline for annotation should be followed. The annotation must be accurate, consistent, and complete. Either custom computer programs or existing software, free or proprietary, may be used to facilitate this process. |
Curation |
Reviewing and validating the annotated data and resolving any errors or discrepancies. Multiple experts or consensus methods to check the quality and reliability of the annotation may be employed. Software tools can be used to manage and monitor the annotation process. |
Storage |
Storing and organising the annotated data in a format that is suitable for machine learning, such as DICOM, NIfTI, or PNG. Data must be secure and accessible for the machine learning framework and model. Specific software tools can be employed to track and version the data. |