Convolutional neural network (CNN) encoder architectures and performance comparisons. The representational power of the four most commonly used CNN feature extractors were examined: DenseNet, ResNet, EDSR, and VGG. The CNN-encoder part of Masked-LMCTrans (multimodality coattentional convolutional neural network transformer) was replaced with the four structures, respectively, and the resultant models were named as “Encoder-Trans.” (A–D) The architecture and operation composition for DenseNet, ResNet, EDSR, and VGG encoder, respectively. (E) Violin plots show the model performances with 95% CIs. P values between DenseNet encoder and the other three encoders are less than .05 for peak signal-to-noise ratio (PSNR) and VIF (visual information fidelity) metrics, indicating statistical significance. BN = batch normalization, conv = convolution, EDSR = enhanced deep super-resolution network, ReLU = rectified linear unit, SSIM = structural similarity index measure, VGG = Visual Geometry Group.