. 2020 Nov 19;31(6):3909–3922. doi: 10.1007/s00330-020-07417-0

Table 2.

Example of applying the checklist to the research article “Automated cardiovascular magnetic resonance imaging analysis with fully convolutional networks” by Bai et al [17]

1. Which clinical problem is being solved?
The study by Bai et al is focused on the fully automated determination of left ventricular ejection fraction (LVEF) and right ventricular ejection fraction (RVEF) on cardiovascular magnetic resonance imaging data, without the need for contour drawing by human experts. Application of ML can reduce the time and human burden of LVEF determination, which can easily take 10–15 min per subject.
2. Choice of ML model
The investigators used a DL approach with a fully convolutional neural network architecture consisting of 16 layers. Separate networks were trained for three commonly acquired anatomical orientations (short axis as well as vertical and horizontal long axes). Detailed parameters regarding network training are provided. The most important measure to look for with regard to overfitting is the use of strictly separate or “hold out” test dataset (see item 4 below). There was no explicit mention of any other measures taken to avoid overfitting.
3. Sample size motivation
The DL algorithm was developed using a convenience sample of 4875 subjects participating in the UK Biobank study. No formal sample size calculation was provided. A clear statistical analysis plan is provided in the materials and methods section of the paper. The algorithm was subsequently applied in a study comparing LVEF in normal versus obese subjects. Each of these groups consisted of further 867 patients, also selected from the UK Biobank study. Also for this second study, no formal sample size calculation was provided.
4. Specification of study design and training, validation, and testing datasets
A random sample of the British population participating in the UK Biobank study was used. As such, this is a retrospective cross-sectional analysis focused on understanding variations in the LVEF in the general population. Detailed inclusion and exclusion criteria were provided. The number of patients used for training, validation, and testing was 3975, 300, and 600 for the short axis segmentation algorithm; 3823, 300, and 600 for the vertical long axis algorithm; and 3782, 300, and 600 for the horizontal long axis algorithm. The investigators do not explicitly mention whether the test dataset was kept separate from the development and validation datasets. The developed algorithms were tested in 1734 additional UK Biobank participants. No external validation outside the UK Biobank was performed.
5. Standard of reference
The standard of reference consisted of the manual annotations of endocardial and epicardial contours in three anatomical orientations by 8 separate expert annotators. Their level of training and experience is not explicitly mentioned. The investigators also do not mention the number of cases annotated by each individual annotator. Three principal investigators oversaw the annotators, although the investigators do not explicitly mention what exactly their role was. Annotators were blinded for output of the machine learning algorithms.
6. Reporting of results
To assess the accuracy of the algorithms’ segmentations, the Dice metric, Hausdorff distance, and mean contour distance were calculated, using manual annotations as the standard of reference. In addition, the automatically generated LVEF, right ventricular ejection fraction (RVEF), and the underlying end diastolic and end-systolic volumes of the left and right ventricles and left ventricular (LV) myocardial mass were compared to the reference standard.
7. Are the results explainable?
The ML algorithms’ outputs can be visually assessed when overlaid on the obtained cardiac MR images, so the end result is easy to verify by human experts. There was no mention of any experiments to investigate the algorithms’ internal logic. However, the DL architecture used in this study has been extensively described by others.
8. Can the results be applied in a clinical setting?
Because this study concerns a random sample of the British population, the reported results only apply to this group of subjects. The investigators did not test the algorithm in a hospital setting. Based on this study, its accuracy in patients with suspected or known cardiovascular disease is unknown. Nevertheless, the algorithm is capable of running in the hospital on relatively standard computer hardware in combination with a GPU.
9. Is the performance reproducible and generalizable?
Because the UK Biobank contains cardiac MR images from multiple different scanners and sites, this study provides strong evidence of the generalizability of the algorithms’ performance across different MR hardware platforms and MR scanner operators. However, a standardized image acquisition protocol was used, which does not necessarily correspond to routine clinical practice. Because the algorithm was not tested on non-UK Biobank cardiac MR images, we do not know its performance outside of this domain. Human expert interobserver variation was assessed by comparing contours drawn three expert observers. Finally, the automatically generated contours for 250 randomly selected test subjects were visually assessed by two experienced image analysts.
10. Is there any evidence that the model has an effect on patient outcomes?
The investigators focused on development of an algorithm for automated ventricular ejection fraction measurement. Outcome was not studied.
11. Is the code available?
The cardiac MR data including the segmentations are available upon request for health-related research in the public interest. The software code is available on GitHub. It is unclear if the algorithm needs to be retrained with new data.