Table 3.
Step-by-step recommendations.
1. Gather meta-information about the data collection and annotation processes to reconstruct the full story of the dataset |
2. Establish the predictive causal direction: does the image cause the prediction target or vice versa? If annotations are scarce and image → target, semi-supervised learning may be futile, while data augmentation remains a viable alternative |
3. Identify any evidence of mismatch between datasets (Table 1). When applicable, importance reweighting is a common mitigation strategy; see further specific advice in the text |
• If causal (image → target): population shift, annotation shift |
• If anticausal (target → image): prevalence shift, manifestation shift |
4. Verify what types of differences in image acquisition are expected, if any. Consider applying data harmonisation techniques and domain adaptation (if test images are available) |
5. Determine whether the data collection was biased with respect to the population of interest, and whether selection was based on the images, the targets or both (Table 2). Refer to dataset shift guidance for mitigating the resulting biases |
6. Draw the full causal diagram including postulated direction, shifts and selections |