The architecture of the proposed SSLLN with 15 convolutional layers. The network takes different CMR volumes as input, applies a branch of convolutions, learns image features from fine to coarse levels, concatenates multi-scale features and finally predicts the probability maps of segmentation and landmarks simultaneously. These probability maps, together with the ground-truth segmentation labels and landmark locations, are then utilised in the loss function in (1) which is minimised via the stochastic gradient descent. Here #S, #A, #C, #LK and GT represent the number of volume slices, the number of activation maps, the number of anatomies, the number of landmarks, and ground truth, respectively.