. 2024 Jul 16;7:0399. doi: 10.34133/research.0399

Table 2.

Works on the use of world models for prediction

	Authors	Input	Output	Learning	Description
AD	Karlsson et al. [160]	Images, point clouds	Semantic point clouds	SSL	This work utilizes a hierarchical VAE to construct world model and generates pseudo-complete states and matches them with partial observations to predict future states.
	Hu et al. [165]	Videos, text, actions	Videos	SSL	This work utilizes an autoregressive transformer to construct world model and leverages DINO, a self-supervised image model, to tokenize images.
	Wang et al. [166]	Images, HDMap, 3D box, text, actions	Videos, actions	SSL	This work obtains comprehension of the structured traffic information. Then, the prediction is formalized into a generative probabilistic model.
	Zhang et al. [157]	Point clouds, actions	Point clouds	UL/SSL	This work utilizes a discrete diffusion model for point cloud prediction, which is a spatiotemporal transformer. This work leverages VQ-VAE to tokenize sensor observations.
	Zheng et al. [163]	3D occupancy scene	Scene, ego-vehicle motion	SSL	By constructing a 3D occupancy space, a world model is trained to predict the next scene from previous scenarios, following an autoregressive manner. This work utilizes VQ-VAE for discretizing the 3D occupancy scene into tokens.
	Min et al. [164]	Image-LiDAR pairs	4D GO	UL/SSL	This work proposes a spatial–temporal world model for unified autonomous driving pretraining.
	Bogdoll et al. [162]	Actions, point clouds, images	Point clouds, images, 3D OG	UL/SSL	This work leverages raw data to learn a sensor-agnostic 3D occupancy representation and predicts future states conditional on actions.
VP	Finn et al. [174]	Videos	Videos	UL/SSL	This work proposed to interact with the world under unsupervised conditions and develops an action-conditioned model for video prediction.
	Wu et al. [176]	Videos	Videos	UL/SSL	This work leverages a pre-trained object-centric model to extract object slots from each frame. These slots are then forwarded to a transformer and used to predict future slots.
	Wang et al. [177]	Images, videos, text, actions	Videos	SSL	Visual inputs are mapped into discrete tokens using VQ-GAN, and then the masked tokens are predicted using Transformer.

AD, autonomous driving; VP, visual prediction; GO, geometric occupancy; OG, occupancy grids.