Skip to main content
. 2023 Oct 20;25(10):1469. doi: 10.3390/e25101469

Figure 1.

Figure 1

Overview: Our approach predicts the next frame μt of a video autoregressively along with an additive correction y0t generated by a denoising process. Detailed model: Two convolutional RNNs (blue and red arrows) operate on a frame sequence x0:t1 to predict the most likely next frame μt (blue box) and a context vector for a denoising diffusion model. The diffusion model is trained to model the scaled residual y0t=(xtμt)/σ conditioned on the temporal context. At generation time, the generated residual is added to the next-frame estimate μt to generate the next frame as xt=μt+σy0t.