Table 2.
Differences in the architecture for different proposed model sizes.
| Total params. (millions) | Encoder params. (millions) | Extra depth | Hidden channels | Latent size | |
|---|---|---|---|---|---|
| Small model | 0.443 | 0.285 | 0 | 16, 32, 64 | 128 |
| Medium model | 0.979 | 0.617 | 0 | 32, 64, 128 | 128 |
| Large model | 1.463 | 1.007 | 2 | 32, 64, 128 | 128 |
Note that during inference, we only need the encoder network of the VAE model. We also only need to process the newly acquired image to obtain their latent representation, while the latent vectors of the previous image can be loaded.