Skip to main content
. 2024 Jul 16;7:0399. doi: 10.34133/research.0399

Table 2.

Works on the use of world models for prediction

Authors Input Output Learning Description
AD Karlsson et al. [160] Images, point clouds Semantic point
clouds
SSL This work utilizes a hierarchical VAE to
construct world model and generates
pseudo-complete states and matches
them with partial observations to predict
future states.
Hu et al. [165] Videos, text, actions Videos SSL This work utilizes an autoregressive
transformer to construct world model and
leverages DINO, a self-supervised image
model, to tokenize images.
Wang et al. [166] Images, HDMap, 3D
box, text, actions
Videos, actions SSL This work obtains comprehension of the
structured traffic information. Then, the
prediction is formalized into a generative
probabilistic model.
Zhang et al. [157] Point clouds, actions Point clouds UL/SSL This work utilizes a discrete diffusion
model for point cloud prediction, which is
a spatiotemporal transformer. This work
leverages VQ-VAE to tokenize sensor
observations.
Zheng et al. [163] 3D occupancy scene Scene, ego-vehicle
motion
SSL By constructing a 3D occupancy space, a
world model is trained to predict the next
scene from previous scenarios, following
an autoregressive manner. This work
utilizes VQ-VAE for discretizing the 3D
occupancy scene into tokens.
Min et al. [164] Image-LiDAR pairs 4D GO UL/SSL This work proposes a spatial–temporal
world model for unified autonomous
driving pretraining.
Bogdoll et al. [162] Actions, point clouds, images Point clouds, images, 3D OG UL/SSL This work leverages raw data to learn a
sensor-agnostic 3D occupancy representation
and predicts future states
conditional on actions.
VP Finn et al. [174] Videos Videos UL/SSL This work proposed to interact with the
world under unsupervised conditions and
develops an action-conditioned model for
video prediction.
Wu et al. [176] Videos Videos UL/SSL This work leverages a pre-trained
object-centric model to extract object
slots from each frame. These slots are
then forwarded to a transformer and used
to predict future slots.
Wang et al. [177] Images, videos, text, actions Videos SSL Visual inputs are mapped into discrete
tokens using VQ-GAN, and then the
masked tokens are predicted using
Transformer.

AD, autonomous driving; VP, visual prediction; GO, geometric occupancy; OG, occupancy grids.