Abstract
Introduction
Generative AI shows strong potential for data augmentation and privacy-preserving data sharing [1,2]. Recent methods can generate realistic 3D images [3]. However, cardiac cine MRI (CMR) is inherently 4D (3D+t), requiring models to capture both spatial structure and temporal dynamics. The high computational demands of modeling full 4D data make direct generation particularly challenging. Existing methods typically generate 2D slices or frames independently via latent interpolation [4], segmentation conditioning [5], or by adding attention modules to model spatial and temporal relations post hoc [6]. However, relying on independently encoded slices can limit global context integration, leading to spatiotemporal discontinuities.
Purpose
We propose a 4D generative framework that synthesizes full 4D CMR in a single pass without slice/frame-wise generation (Fig. 1). It preserves global anatomy and temporal coherence by leveraging a compact spatiotemporal vector-quantized variational autoencoder (VQ-VAE) on 2D+t slices and training a latent diffusion model on 3D+t volumes.
Methods
We used short-axis cine CMR data from the ACDC [7] and M&M2 datasets [8], with 390 training and 108 test volumes. Images were resampled to 1×1 mm in-plane resolution, 10 mm slice thickness, and 32 temporal frames (via cyclic repetition). Our framework combines a compact 3D VQ-VAE to project data into a lower dimensional latent space and a diffusion model operating in that space. The autoencoder was trained on individual slices using H×W×T crops of size 384×384×32, with perceptual, adversarial, entropy, and codebook normalization losses to ensure fidelity and efficient latent use. For diffusion training, latent representations of six consecutive slices were stacked along the feature dimension to form 3D+t inputs (48×H/4×W/4×T/4). The model was trained with Huber loss. At inference, a full 4D latent volume was synthesized in one pass and decoded depthwise to reconstruct cine CMR.
Results
Spatial realism was evaluated using Fréchet Inception Distance (FID) across depth. For each depth slice, features were extracted over time using a pre-trained 2D network [9]. Real-to-real comparisons gave a mean FID of 0.29 ± 0.06, while synthetic-to-real scored 1.67 ± 0.24. To assess temporal consistency, we segmented the left ventricle across the cardiac cycle [10], confirming that synthesized volumes exhibit realistic cardiac cycle dynamics (Fig. 2).
Conclusion
We present a generative model that synthesizes full 4D cine CMR in a single pass using latent diffusion, avoiding slice/frame-wise generation and ensuring high spatial and temporal consistency. Our approach leverages a spatiotemporal autoencoder to directly encode 4D structure into the latent space. Quantitative and qualitative results confirm realism and physiological plausibility. This work establishes a strong foundation for applications in data augmentation and privacy-preserving clinical workflows.


