The backbone convolutional neural network developed for landmark
detection has a U-Net structure. More layers can be inserted in both the
downsampling branch and the upsampling branch, and more blocks can be
inserted into each layer. The output layer outputs the per-pixel scores,
which go through softmax function. For the landmark detection on
long-axis images, data from three views were used together to train one
model. As shown in the input, every minibatch was assembled by using
randomly selected images from three views and was used for back
propagation. A total of four layers with three or four blocks per layer
were used in this experiment. The output tensor shapes were reported by
using the format [B, C, H, W], where B is the size of the minibatch, and
C is the number of channels, and H and W are the image height and width.
Input images have one channel for image intensity, and the output has
four channels for three landmarks and the background. The illustration
for outputs plots three color-coded landmark channels and omits the
background channel.