Table 2.
Works on the use of world models for prediction
| Authors | Input | Output | Learning | Description | |
|---|---|---|---|---|---|
| AD | Karlsson et al. [160] | Images, point clouds | Semantic point clouds |
SSL | This work utilizes a hierarchical VAE to construct world model and generates pseudo-complete states and matches them with partial observations to predict future states. |
| Hu et al. [165] | Videos, text, actions | Videos | SSL | This work utilizes an autoregressive transformer to construct world model and leverages DINO, a self-supervised image model, to tokenize images. |
|
| Wang et al. [166] | Images, HDMap, 3D box, text, actions |
Videos, actions | SSL | This work obtains comprehension of the structured traffic information. Then, the prediction is formalized into a generative probabilistic model. |
|
| Zhang et al. [157] | Point clouds, actions | Point clouds | UL/SSL | This work utilizes a discrete diffusion model for point cloud prediction, which is a spatiotemporal transformer. This work leverages VQ-VAE to tokenize sensor observations. |
|
| Zheng et al. [163] | 3D occupancy scene | Scene, ego-vehicle motion |
SSL | By constructing a 3D occupancy space, a world model is trained to predict the next scene from previous scenarios, following an autoregressive manner. This work utilizes VQ-VAE for discretizing the 3D occupancy scene into tokens. |
|
| Min et al. [164] | Image-LiDAR pairs | 4D GO | UL/SSL | This work proposes a spatial–temporal world model for unified autonomous driving pretraining. |
|
| Bogdoll et al. [162] | Actions, point clouds, images | Point clouds, images, 3D OG | UL/SSL | This work leverages raw data to learn a sensor-agnostic 3D occupancy representation and predicts future states conditional on actions. |
|
| VP | Finn et al. [174] | Videos | Videos | UL/SSL | This work proposed to interact with the world under unsupervised conditions and develops an action-conditioned model for video prediction. |
| Wu et al. [176] | Videos | Videos | UL/SSL | This work leverages a pre-trained object-centric model to extract object slots from each frame. These slots are then forwarded to a transformer and used to predict future slots. |
|
| Wang et al. [177] | Images, videos, text, actions | Videos | SSL | Visual inputs are mapped into discrete tokens using VQ-GAN, and then the masked tokens are predicted using Transformer. |
AD, autonomous driving; VP, visual prediction; GO, geometric occupancy; OG, occupancy grids.