Skip to main content
Research logoLink to Research
. 2024 Jul 16;7:0399. doi: 10.34133/research.0399

Prospective Role of Foundation Models in Advancing Autonomous Vehicles

Jianhua Wu 1, Bingzhao Gao 1,2, Jincheng Gao 1, Jianhao Yu 1, Hongqing Chu 1,*, Qiankun Yu 3, Xun Gong 4, Yi Chang 4, H Eric Tseng 5, Hong Chen 6,7,*, Jie Chen 2,7
PMCID: PMC11249913  PMID: 39015204

Abstract

With the development of artificial intelligence and breakthroughs in deep learning, large-scale foundation models (FMs), such as generative pre-trained transformer (GPT), Sora, etc., have achieved remarkable results in many fields including natural language processing and computer vision. The application of FMs in autonomous driving holds considerable promise. For example, they can contribute to enhancing scene understanding and reasoning. By pre-training on rich linguistic and visual data, FMs can understand and interpret various elements in a driving scene, and provide cognitive reasoning to give linguistic and action instructions for driving decisions and planning. Furthermore, FMs can augment data based on the understanding of driving scenarios to provide feasible scenes of those rare occurrences in the long tail distribution that are unlikely to be encountered during routine driving and data collection. The enhancement can subsequently lead to improvement in the accuracy and reliability of autonomous driving systems. Another testament to the potential of FMs’ applications lies in world models, exemplified by the DREAMER series, which showcases the ability to comprehend physical laws and dynamics. Learning from massive data under the paradigm of self-supervised learning, world models can generate unseen yet plausible driving environments, facilitating the enhancement in the prediction of road users’ behaviors and the off-line training of driving strategies. In this paper, we synthesize the applications and future trends of FMs in autonomous driving. By utilizing the powerful capabilities of FMs, we strive to tackle the potential issues stemming from the long-tail distribution in autonomous driving, consequently advancing overall safety in this domain.

Introduction

Autonomous driving, as one of the most challenging tasks in artificial intelligence, has received considerable attention. The conventional autonomous driving system adopts a modular development strategy [1,2], whereby perception, prediction, and planning are developed separately and integrated into the vehicle. However, the information transmitted between modules is limited, and there is missing information. Furthermore, there are cumulative errors in the propagation process, and the computational efficiency of the modular transmission is relatively low. These factors collectively result in poor model performance. To further reduce the error and improve the computational efficiency, in recent years, researchers have attempted to train the model in an end-to-end manner [3,4]. End-to-end means that the model takes inputs directly from the sensor data and then outputs control decisions for the vehicle directly. While some progress has been made, the models still mainly rely on supervised learning (SL) with manually labeled data. Due to the ever-changing driving scenarios in the real world, it is challenging to cover all potential situations with only limited labeled data. This results in a model with poor generalization ability, which makes it difficult to adapt to the complex and changeable real-world driving corner cases.

In recent years, the emergence of foundation models (FMs) has provided new ideas to address this gap. An FM is commonly perceived as a large-scale machine learning model trained on diverse data, capable of being applied to various downstream tasks, which might not necessarily be directly related to its original training objective. The term was coined by Stanford University in August 2021 as “any model that is trained on broad data (generally using self-supervision at scale) that can be adapted (e.g., fine-tuned) to a wide range of downstream tasks” [5]. Examples of FMs include bidirectional encoder representations from transformers (BERT) [6] and GPT-4 [7] in natural language processing (NLP), and Sora [8] in computer vision (CV). Most FMs are constructed based on pre-existing architectures. For example, BERT and GPT-4 are based on the Transformer [9], and Sora is founded on the Diffusion Transformer [10].

Different from traditional deep learning, FMs can learn directly from massive unlabeled data (e.g., videos, images, natural language, etc.) through self-supervised pre-training, thereby acquiring stronger generalization ability and emergent abilities (thought to have already appeared in large language models [LLMs]). Based on this, after fine-tuning with a small amount of supervised data, FMs can be rapidly adapted and migrated to downstream tasks such as autonomous driving. With the strong comprehension, inference, and generalization ability imparted by self-supervised pre-training, FMs are expected to break the bottleneck of traditional models, enabling the autonomous driving system to better understand and adapt to complex traffic environments, thus providing a safer and more reliable autonomous driving experience.

Emergent abilities

Along with FM, Bommasani et al. [5] talk about the emergence characteristic or emergent ability of FM as “an ability is emergent if it is not present in smaller models but is present in larger models”. For instance, the adaptability of a language model (LM) to diversify downstream tasks, a novel behavior not directly tied to its initial training, seems to emerge abruptly as the model scales beyond an undisclosed threshold, transforming into an LLM [11].

Currently, the emergent abilities of FMs are mainly reflected in LLMs. In Fig. 1 [12], it is illustrated that as the model size, dataset size, and the number of computational floats used for training increase, the loss of LLM decreases, providing support for performing large-scale model training. Figure 2 [11] shows that when the amount of parameters of the model reaches a certain level, the capabilities of LLMs will get a qualitative leap, showing emergent abilities in different tasks.

Fig. 1.

Fig. 1.

Scaling laws [12].

Fig. 2.

Fig. 2.

Emergent abilities of LLMs [11]. (A) to (H) represent different downstream tasks. (A) 3-digit addition/subtraction, 2-digit multiplication. (B) Transliteration from the international phonetic alphabet. (C) Recovering scrambled words. (D) Persian question-answering. (E) Answering questions truthfully. (F) Mapping conceptual domains. (G) Massive multi-task language understanding. (H) Word in context semantic understanding. Each point is a separate LLM. The dotted line represents random performance.

The emergent abilities of LLMs are well represented in in-context learning (ICL) [11,13], which, strictly speaking, can be regarded as a subclass of prompt tuning. Context learning ability is the capability of LLMs to learn in a specific contextual environment. The main idea is to learn from analogies [14]. ICL or prompt learning enables LLMs to get excellent performance in a specific context without parameter tuning.

One particular type of ICL is chain-of-thought (CoT). Users can break down complex problems into a series of reasoning steps as input to LLM. In this way, LLM can perform complex reasoning tasks [15]. Emergent abilities are commonly found in LLMs; there is currently no compelling explanation for why these abilities would appear the way they do.

Park et al. [16] introduced generative agents that simulated real human behaviors, performed daily activities based on pre-input settings, and stored daily memories in natural language. The authors connected generative agents to LLM to create a small society with 25 intelligent agents, retrieved memories with LLM, and used its emergent abilities to plan the behaviors of intelligent agents. In the experiment, the intelligent agents emerged with more and more social behaviors in addition to their behaviors, fully demonstrating the LLM’s intelligent emergence.

Pre-training

The implementation of FMs is based on transfer learning and scaling [5]. The idea of transfer learning [17,18] is to apply the knowledge learned in one mission to another. In deep learning, transfer learning is implemented in 2 stages, pre-training and fine-tuning. FMs are pre-trained with massive data. After obtaining the pre-trained model, a specific dataset is selected for fine-tuning to adapt to different downstream tasks.

Pre-training is the foundation for FMs to obtain emergent abilities. By being pre-trained with massive data, FMs can obtain basic understanding and generative capability. Pre-training tasks include SL, self-supervised learning (SSL), etc [19]. Early pre-training relied on SL, especially in CV. To meet the training needs of neural networks, some large-scale supervised datasets, such as ImageNet [20], were built. However, SL also has some drawbacks, i.e., large-scale data labeling is required. With the gradual increase in the size of the model and the amount of parameters, the drawbacks of SL become more obvious. In NLP, since the degree of difficulty in labeling text is much greater than that of labeling images, SSL is gradually favored by scholars due to its feature of not requiring labeling.

Self-supervised learning

SSL allows learning feature representations in unlabeled data for subsequent tasks. The distinguishing feature of SSL is that they do not require manually labeled labels, but instead generate labels automatically from unlabeled data samples.

SSL usually involves 2 main processes [21]: (a) Self-supervised training phase: the model is trained to solve a designed pretext task and automatically generates pseudo labels based on data properties in this phase. It is designed to allow the model to learn the generic representation of the data. (b) Downstream tasks application phase: after the self-supervised training, the knowledge learned by the model can be further used for actual downstream tasks. Downstream tasks use SL methods, which include semantic segmentation [22], target detection [23], and sentiment analysis [24]. Due to self-supervised training, the generalization ability and convergence speed of the model in downstream tasks will be greatly improved.

SSL methods generally fall into 3 categories [25]: generative-based, contrastive-based, and adversarial-based. Generative-based method: it first encodes the input data using an encoder and then uses a decoder to regain the original form of the data. The model learns by minimizing the error. Generative-based methods include auto-regressive models, auto-encoding models, etc [26]. Contrastive-based method: it constructs positive and negative samples by pretext tasks and learns by comparing the similarity with positive and negative samples. Such methods include SimCLR [27] and others. Adversarial-based method: this method consists of a generator and a discriminator. The generator is responsible for generating fake samples, while the discriminator is adapted to distinguishing between these fake samples and real samples [25], and a typical example is generative adversarial network [28].

Pretext tasks of SSL

The pretext tasks can also be referred to as self-supervised tasks as they rely on the data itself to generate labels. These tasks are designed to make the model learn representations that are relevant to a specific task, thereby better handling downstream tasks.

In CV, the method of designing pretext tasks according to data attributes includes 4 main categories [21]: generation-based, context-based, free semantic label-based, and cross-modal-based. Among them, generation-based approaches mainly involve image or video generation tasks [29,30]; context-based pretext tasks are mainly designed leveraging contextual features of images or videos, such as contextual similarity, spatial structure, temporal structure, etc. [3133]; in the free semantic label-based pretext tasks, the network is trained leveraging automatically generated semantic labels [34]; and cross modal-based pretext tasks need to consider multiple modalities such as vision and voice [35].

In NLP, the most common pretext tasks include [36] center and neighbor word prediction, next and neighbor sentence prediction, autoregressive language modeling, sentence permutation, masked language modeling, etc. Among them, the Word2Vec [37] model uses center word prediction as a pretext task, while the BERT model uses next sentence prediction and masked language modeling as pretext tasks. These models are trained to learn the expressions of the corpus and applied to downstream tasks.

Fine-tuning

Fine-tuning is the process of further training on a specific task based on an already trained model, to adapt it to the specific data and requirements of the task. Typically, a model that has been pre-trained on large-scale data is used as a foundational model, and then it is fine-tuned on a specific task to improve performance. Currently, in the field of LLMs, fine-tuning methods include 2 main approaches: instruction tuning and alignment tuning [38].

Instruction fine-tuning aims at fine-tuning pre-trained models on a collection of datasets described by instructions [39]. Instruction fine-tuning generally includes 2 phases. First, instances of instruction formatting need to be collected or created. Then, these instances are used to fine-tune the model. Instruction fine-tuning allows LLMs to exhibit strong generalization ability on previously unseen tasks. The models obtained after pre-training and fine-tuning can work well in most cases; however, some special cases may occur. In the case of LLM, for example, the trained model may appear to fabricate false information or retain biased information from the corpus. To avoid such problems, the concept of human-aligned fine-tuning was proposed. The goal is to make the model’s behaviors conform to human expectations [40]. In contrast to instruction fine-tuning, this kind of alignment requires the consideration of completely different standards.

The GPT family is a typical FM, and its training process also includes pre-training and fine-tuning. Taking ChatGPT as an example, the pre-training process of ChatGPT uses self-supervised pre-training [41]. Given an unsupervised corpus, a standard language modeling approach is used to optimize its maximum likelihood estimate (MLE). GPT uses a multi-layer transformer decoder architecture [42], resulting in a pre-trained model.

The fine-tuning phase of ChatGPT consists of the following 3 steps [40]. Firstly, supervised fine-tuning (SFT) is performed on the obtained pre-trained model; secondly, comparison data are collected to train the reward model; and thirdly, the SFT model is fine-tuned to maximize the reward leveraging the proximal policy optimization (PPO) algorithm [43]. The last 2 steps together are reinforcement learning with human feedback (RLHF) [44].

Abilities of FMs in autonomous driving

The ultimate goal of autonomous driving is to realize a driving system that can completely replace human driving, and the basic criterion for evaluation is to drive like a human driver, which puts forward very high requirements on the reasoning ability of autonomous driving models. We can see that FMs based on large-scale data learning have powerful reasoning and generalization ability, which have great potential in autonomous driving. FMs can be used to enhance scenario understanding, give language-guided commands, and generate driving actions in empowering autonomous driving. In addition, FMs can be augmented with powerful generative capability for data augmentation, including extending existing autonomous driving datasets and directly generating driving scenarios. In particular, world models (a type of FM) can learn the inner workings of the physical world and predict future driving scenarios, which is of substantial importance for autonomous driving.

Consequently, it was deemed appropriate to conduct a comprehensive review of the applications of FMs in autonomous driving. This paper provides that review.

• In the “Supervised End-to-End Autonomous Driving” section, a brief overview of the latest supervised end-to-end autonomous driving is provided, to offer the reader a better background understanding.

• The “Human-Like Driving Based on Language and Vision Models” section reviews the applications of language and vision FMs in enhancing autonomous driving.

• The “Prediction of Autonomous Driving Based on World Models” section reviews the applications of world models in the exploration of the field of autonomous driving.

• The “Data Augmentation Based on Foundation Models” section reviews the applications of FMs in data augmentation.

Building on the preceding overview, the “Conclusion and Future Directions” section presents the challenges and future directions for enhancing autonomous driving with FMs.

Supervised End-to-End Autonomous Driving

The research idea of “pre-training + fine-tuning” in autonomous driving research not only has appeared after the introduction of large models but also has been researched for a long time. To use a more familiar term, it is end-to-end autonomous driving. In the past few years, some scholars have already optimized pretraining backbone in various ways, including the transformer architecture and SSL methods. Note that pretraining backbone here refers to a model that transforms each modal input into a usable feature representation for downstream tasks (such as target detection, trajectory prediction, decision planning, etc.). Many research attempts have also been made to develop end-to-end frameworks based on the transformer architecture, with excellent results. Hence, to summarize the application of the underlying models in autonomous driving more comprehensively, we believe that it is necessary to introduce the research related to end-to-end autonomous driving based on pretraining backbone. In this section, we summarize the latest research on pretraining backbone with end-to-end autonomous driving solutions. The pipeline for such methods is briefly illustrated in Fig. 3.

Fig. 3.

Fig. 3.

The pipeline diagram for the supervised end-to-end autonomous driving system with a pretraining backbone. Multi-modal sensing information is input to the pretraining backbone to extract features, after which it enters into the framework of autonomous driving algorithms built by various methods to realize tasks, such as planning/control, to accomplish end-to-end autonomous driving tasks.

Pretraining backbone

In end-to-end modeling, feature extraction of low-level information from raw data determines the potential of subsequent model performance to a certain extent, and an excellent pretraining backbone can endow the model with more powerful feature learning capability.

Pretraining convolutional networks such as ResNet [45] and VGGNet [46] are the most widely used backbone for visual feature extraction in end-to-end models. These pretraining networks are often trained to leverage target detection or segmentation as the task to extract generalized feature information, and their competitive performances have been verified in many works. ViT [47] first applied the transformer architecture to image processing and achieved excellent classification results. Transformer has the advantage of optimized algorithms for handling large-scale data with its simpler architecture and faster inference speed. The self-attention mechanism is very suitable for processing time series data. It enables the modeling and prediction of the temporal motion trajectories of objects in the environment, which is conducive to the fusion of heterogeneous data from multiple sources, such as LiDAR point clouds, images, maps, and so forth.

Another class of pretraining backbone, represented by LSS [48], BEVDet [49], BEVformer [50], BEVerse [51], etc., expanded the usability by extracting the images captured by the surround-view camera and converting them to bird’s eye view (BEV) features through model learning, indexing the local image features from the 2-dimensional (2D) viewpoints to the 3D space. In recent years, BEV has attracted extensive interest due to its ability to describe the driving scene more accurately, and the research of leveraging BEV features such as pretraining backbone output is not limited to camera, and the extraction and fusion of multi-modal sensing BEV features represented by BEVfusion [52] has further provided a wider vision for autonomous driving systems. However, it should be pointed out that although the transformer architecture brings great performance enhancement, this Backbone still constructs pre-trained models with SL methods, which rely on massive labeled data, and the data quality also greatly affects the final result of the model.

In both camera and point cloud processing domains, some works implement the pretraining backbone by unsupervised or SSL methods. Wu et al. [53] proposed the PPGeo model, which uses a large number of unlabeled driving videos to accomplish the pre-training of the visual coder in 2 stages, and can be adapted to different downstream end-to-end autonomous driving tasks. Sautier et al. [54] proposed BEVContrast for self-supervision of 3D Backbone on automotive LiDAR point clouds, which defines contrasts at the level of 2D cells in the BEV plane, retaining the simplicity as in PointContrast [55] while maintaining good performance in downstream driving tasks. In particular, while the SSL approach of “masking + reduction” is also considered to be an effective way of modeling the world, Yang et al. [56] proposed UniPAD, which is implemented based on SSL methods for masked autoencoding and 3D rendering. A portion of these multi-modal data is randomly keyed out to be masked and transformed into voxel space, where RGB or depth prediction results are generated by rendering techniques in such a 3D space, and the rest of the original images are used as the generated data for SL. The flexibility of the approach enables good integration into both 2D and 3D frameworks, and downstream tasks such as depth estimation, target detection, segmentation, and many others fine-tuned and trained on the model perform superiorly.

Supervised end-to-end autonomous driving models

Early work on end-to-end autonomous driving modeling was mainly based on various types of deep neural networks, which were constructed through imitation learning [5761] or reinforcement learning [6264] methods. The work of Chen et al. [3] analyzed the key challenges facing end-to-end autonomous driving from a methodological perspective, pointing out the future trend of empowering end-to-end autonomous driving with fundamental models such as transformer. Some scholars have tried to build an end-to-end autonomous driving system with transformer and got competitive results. For instance, there have been Transfuser [65,66], NEAT (NEural ATtention fields for end-to-end autonomous driving) [67], Scene Transformer [68], PlanT [69], Gatformer [70], FusionAD [71], UniAD [72], VAD (Vectorized scene representation for efficient Autonomous Driving) [73], GenAD [74], and a host of end-to-end frameworks developed based on transformer architecture.

Chitta et al. [65,66] proposed Transfuser, which takes RGB images and BEV views from LiDAR as inputs, uses multiple transformers to fuse the feature maps, and predicts the trajectory points for the next 4 steps through a single-layer gated recurrent unit (GRU) network, followed by longitudinal and transverse proportional–integral–derivative (PIDs) to control the vehicle operation. NEAT [67] further mapped the BEV scene to trajectory points and semantics information, then used an intermediate attention map to compress high-dimensional image features, which allows the model to focus on driving-relevant regions and ignore driving task-irrelevant information. PlanT proposed by Renz et al. [69] used simple object-level representations (vehicles and roads) as inputs to the transformer encoder, and used speed prediction of surrounding vehicles as a secondary task to predict future waypoint trajectories. UniAD proposed by Hu et al. [72] enhanced the design of the decoder and achieved the integration of the full stack of autonomous driving tasks into a unified framework to improve autonomous driving performance, although still relying on different sub-networks for each task. This work also won the CVPR 2023 Best Paper Award, which shows the academic recognition of the end-to-end autonomous driving paradigm. However, these models often require intensive computation. For this reason, Jiang et al. [73] proposed a method to fully vectorize the driving scenarios and learn the instance-level structural information to improve computing efficiency. In contrast to the previous modular end-to-end planning, Zheng et al. [74] propose a generative end-to-end, modeling autonomous driving as a trajectory generation.

Moreover, Drive Anywhere proposed by Wang et al. [75] not only realizes end-to-end multi-modal autonomous driving but also combines with LLM to be able to provide driving decisions based on representations that can be queried through images and texts. Dong et al. [76] generated image-based action commands and explanations by building a feature extraction model based on transformer. Jin et al. [77] proposed the ADAPT model to directly output vehicle control signals with inference language descriptions through an end-to-end model. This is the first driving action captioning architecture based on an action-aware transformer. It accomplished the driving control task while adding natural language narratives to guide the decision-making and action process of the autonomous driving control module. It also helped the user to get the vehicle’s state and the surrounding environment at all times and to better understand the basis of the actions taken by the autonomous driving system, which improved the interpretability of the decision-making. This provides a glimpse of the potential of the transformer architecture to improve the interpretability of end-to-end driving decisions.

Human-Like Driving Based on Language and Vision Models

With remarkable research progress in LLMs BERT, GPT-4, and Llama [78]; vision language models (VLMs) CLIP [79], ALIGN [80], and BLIP-2 [81]; and multi-modal large language models (M-LLMs) GPT-4V [82], LLaVA [83], and Gemini [84], as well as other FMs, their powerful reasoning capabilities are considered to have ushered in a new dawn for the realization of artificial general intelligence [85], which has had a marked and far-reaching impact on all aspects of society. In autonomous driving, FMs such as language and vision also show great potential, which is expected to improve the understanding and reasoning ability of autonomous driving models on driving scenarios and realize human-like driving for autonomous driving.

We provide an introduction of the research related to the enhancement of the understanding of the driving scenarios by the autonomous driving system based on FMs of language and vision, as well as the reasoning to give language-guided instructions and driving actions, as illustrated in Fig. 4. Related work on enhancing the understanding of driving scenarios is presented in the “Understanding of driving scenarios” section, on reasoning to give language-guided instructions is presented in the “Language-guided instructions” section, and on reasoning to generate driving actions is presented in the “Generation of actions” section.

Fig. 4.

Fig. 4.

The pipeline diagram for enhancing autonomous driving leveraging FMs, where FMs refer to language models and vision models. FMs can learn perceptual information and utilize their powerful ability to understand the driving scenarios and reason to give language-guided instructions and driving actions to enhance autonomous driving.

Understanding of driving scenarios

The study by Vasudevan et al. [86] found that the ability of the model to comprehend the scene and localize objects can be effectively enhanced by acquiring verbal descriptions and gaze estimation. Li et al. [87] proposed an image captioning model that generates high-level semantic information to improve its comprehension of the traffic scene. Their work verified that linguistic and visual features can effectively enhance the comprehension of driving scenarios.

Sriram et al. [88] have proposed an autonomous navigation framework that combines semantic segmentation results with natural language commands. This framework has been verified to be effective as a vehicle driver in the CARLA simulator and the KITTI dataset [89]. Elhafsi et al. [90] identified semantic anomalies by converting observed visual information into natural language descriptions and passing them to the LLM to exploit its powerful reasoning capabilities. In the context of VLM applications, Chen et al. [91] transferred image and text features to a 3D point cloud network based on CLIP to enhance the model’s understanding of the 3D scene. Romero et al. [92] constructed a video analytics system based on VIVA [93], an extended model of CLIP, intending to improve query accuracy through the utilization of the powerful comprehension of VLM. Tian et al. [94] employed VLM to describe and analyze the driving scenarios, thereby enhancing the understanding of the driving scenarios. In addition to direct enhancement of scene data, perceptual features have also been explored for enhancement. Pan et al. [95] designed the Ego-car prompt to enhance the obtained BEV features using the LM in CLIP. Dewangan et al. [96] proposed an approach to enhance BEV maps by detecting the features of each object in the BEV through VLMs (Blip-2 [81], Minigpt-4 [97], and Instructblip [98]) and through linguistic characterizations to obtain a language-enhanced BEV map. However, existing VLMs are constrained to the 2D domain, lacking the capacity for spatial awareness and long-horizon extrapolation. To address this issue, Zhou et al. [99] proposed a model, embodied language model (ELM), which enhances the understanding of driving scenarios over long-time domains and across space. This is achieved by using diverse pre-training data and selecting adaptive tokens.

Language-guided instructions

Here, we present a review of studies that give linguistic instructions through FMs, mainly descriptive instructions, such as “Red light ahead, you should slow down”, “Intersection ahead, please pay attention to pedestrians”, etc. Ding et al. [100] used a visual coder to encode video data, which was then fed into an LLM to generate corresponding driving scenario descriptions and suggestions. In particular, this work also proposed a method that enables high-resolution feature maps and the obtained high-resolution information to be fused into M-LLMs to further enhance the model’s recognition, interpretation, and localization capabilities. Fu et al. [101] explored the potential of leveraging LLMs to comprehend driving environments like a human being, utilizing the LLaMA-Adapter [102] to describe the scene data, and then giving linguistic commands via GPT-3.5. Wen et al. [103] proposed DiLu, a knowledge-driven paradigm based on previous work that can make decisions based on common-sense knowledge and accumulate experience. In particular, the article pointed out that DiLu possesses the ability to direct experience acquisition of real-world data, which has the potential for practical deployment of autonomous driving systems. To further improve the safety of LLM-based autonomous driving, Wang et al. [104] used an MPC-based verifier to evaluate and provide feedback on trajectory planning, and then fused prompt learning to enable LLM to perform in-context safety learning, which overall improved the safety and reliability of autonomous driving. In order to enrich the data input to obtain more accurate scene information, Wang et al. [105] utilized multi-model LLM to enable the autonomous driving system to obtain linguistic commands. Meanwhile, for the gap between linguistic commands and vehicle control commands, this work performed an alignment operation on decision states.

The aforementioned works are more in the context of datasets and simulation environments, and there has been some exploratory work in terms of real vehicle testing. Wayve proposed LINGO-1 [106], a grand model of self-driving interaction based on a visual–verbal–action grand model, where the model can interpret itself and answer visually while driving. It introduced human driving experiences, which can explain various causal elements in the driving scenarios through natural language descriptions, acquire feature information in the driving scenario in a human-like understanding way, and learn and give interactive language commands. Cui et al. [107] innovatively placed the LLM in the cloud, entered human commands, and leveraged the reasoning ability of the LLM to generate executable code instructions. However, the work suffers from latency issues and has room for improvement in terms of real-time performance requirements for autonomous driving.

The pipeline for incorporating LLMs into autonomous driving systems in more current research is summarized in general terms in Fig. 4, which is mainly implemented through scene understanding, high-level semantic decision-making, and trajectory planning. In this subsection, we summarize the advanced decision-making applications and argue that the research processes have some similarities. To more clearly illustrate how they work, we use DriveMLM [105], a typical recent research work, as an example for further illustration in Fig. 5.

Fig. 5.

Fig. 5.

For the application of LLMs to autonomous driving system decision-making, a typical pipeline is shown in this figure, referenced from DriveMLM [105].

DriveMLM simulates the behavioral planning module of a modular autopilot system by using an M-LLM, which performs closed-loop autonomous driving in a realistic simulator based on processed perceptual information and command requirements. DriveMLM also generates natural language explanations of its driving decisions, thereby increasing the transparency and trustworthiness of the system.

Generation of actions

As described in the “Language-guided instructions” section, academia and industry have attempted to embed GPT linguistic knowledge into autonomous driving decisions to enhance the performance of autonomous driving in the form of linguistic instructions to promote the application of FMs to autonomous driving. Long before FMs made breakthroughs in the LLM field, some works attempted to improve the performance of autonomous driving through similar research ideas. For example, the MP3 framework proposed by Casas et al. [108] used high-level semantic information as a decision training guide, which together with sensory data constitutes the input to build algorithms to realize motion prediction.

The research on the application of LLMs in autonomous driving is on the rise, and the GPT series, as the most successful variant of the transformer architecture, may be able to bring breakthroughs to improve the comprehensive performance at multiple levels. LLM is a representative of FMs from the level of linguistic knowledge that empowers the development of autonomous driving; however, linguistic descriptions and reasoning are not directly applied by the autonomous driving system. Considering that the large model is expected to be truly deployed at the vehicle end, it needs to eventually fall on the planning or control instructions; i.e., FMs should eventually empower autonomous driving from the action state level. Nevertheless, how to quantize linguistic decisions into action commands, such as planning and control, available to the autonomous driving system still faces great challenges. Some scholars have already made preliminary explorations, but there is still much room for development. Moreover, some scholars have explored the construction of autonomous driving models through a GPT-like approach, which directly outputs trajectories and even control commands based on LLM. In Table 1, we provide a brief overview of some representative works.

Table 1.

Works on the use of LLMs for generating autonomous driving planning and control

Authors Input Output Learning Description
Sha et al. [109] Prompt of scenario Control action sequences SL This work enables navigation and localization
as well as further trajectory planning through
visual perception and verbal commands.
Omama et al. [111] OSM maps with descriptions, LiDAR, camera Location, trajectory SL This work utilizes a Bayesian state estimation
model leveraging visual-linguistic
features to generate global paths, plan
trajectories, and control vehicles to complete
navigation.
Keysan et al. [114] Scene raster, text prompt Trajectory SL This work encodes the driving scene and text
prompt with pre-trained models dedicated
for each modality, finally sifts through the set
of trajectories to find the target trajectory.
Seff et al. [120] Multi-modal scene Motion prediction SL This work uses a single standard linguistic
modeling objective to learn multi-modal
distributions for predicting the future
behaviors of traffic participants.
Mao et al. [121] Perception, ego-states, trajectory, goal Thoughts, driving decisions, trajectory SL This work represents the inputs and outputs
as linguistic tokens and utilizes the LLM to
generate driving trajectories and provide
explanations for decision-making.
Wang et al. [113] Scene information Driving decisions, trajectory SL In the pre-training stage, this work trains a
causal transformer for driving scenario
prediction and decision-making. In the
fine-tuning stage, it adapts to motion
planning and accurate BEV generation.
Xu et al. [116] Video, text prompt, control signal Decision, control signal SL This work tokenizes video sequences, text,
and control signals to build the model, which
can generate responses to human inquiries
and predict control signals.
Sima et al. [115] Video, text question Scene description, decision, trajectory SL Based on Graph Visual Question Answering
(GVQA), this work realizes structured
reasoning for perception, prediction, and
planning through suitable quizzes for
human-like autonomous driving.
Shao et al. [119] Camera, LiDAR, text prompt Trajectory SL This work accomplishes e2e autonomous
driving by interacting with dynamic
environments through multi-modal
multi-view sensor data and language
commands.
Ma et al. [123] Video, text prompt, control signal Scene description, prediction, trajectory SL This work employs a Grounde-CoT process to
enhance the model’s reasoning capabilities.
This work also integrates 4 different tasks to
facilitate the model’s comprehensive
understanding of complex driving scenarios.

Sha et al. [109] proposed LanguageMPC, which employs GPT-3.5 as a decision-making module for complex autonomous driving scenarios that require human common sense comprehension. By designing cognitive pathways for integrated reasoning in LLM, Sha et al. proposed algorithms to transform LLM decisions into actionable driving control commands, which improved the vehicle’s ability to handle complex driving behaviors. Jain et al. [110] achieved navigation localization and further planning of trajectories with the help of visual perception for explicit verbal commands. Omama et al. [111] constructed a multi-modal map-based navigation and localization method called ALT-Pilot, which can be used to navigate to arbitrary destinations without the need for high-definition LiDAR maps, demonstrating that off-the-shelf visual LMs can be used to construct linguistically enhanced terrain maps. Pan et al. [95] proposed the VLP method to improve the contextual reasoning for visual perception and motion planning of an autonomous driving system with the powerful reasoning capability of LLM in the training phase, and achieved excellent performance in the open-loop end-to-end motion planning task.

Some scholars have also attempted to construct autonomous driving models directly through a GPT-like approach, i.e., leveraging LLMs to construct an end-to-end autonomous driving planner, which directly outputs predicted trajectories, path planning, and even control commands, intending to effectively improve the ability of autonomous driving models to generalize to unknown driving scenarios.

Pallagani et al. [112] constructed Plansformer, which is both an LLM and a planner, showing the great potential of an LLM fine-tuned as a planner from a variety of planning tasks. Wang et al. [113] constructed the BEVGPT model, which takes as input information about the current surroundings on the road and then outputs a sequence that includes future vehicle decision instructions and spatial paths that can be followed by self-driving vehicles.

Some works [114119] took both text prompts and information about the current surroundings on the road as inputs and then output textual responses or interpretations and a sequence that includes future vehicle decision instructions and spatial paths that can be followed by the self-driving vehicle. Among them, Cui et al. [117] utilized GPT-4 with inputs of natural language descriptions and environmental perception data to make LLM directly output driving decisions and operation commands. Furthermore, they conducted experiments with highway overtaking and lane changing scenarios in Ref. [118] to compare driving decisions provided by LLM with different cues, and the study showed that chained-thinking cueing helps LLM to make better driving decisions.

Some scholars have also tried different ideas. Seff et al. [120] proposed MotionLM, which uses motion prediction as a language modeling task to learn multi-modal distributions by representing continuous trajectories as discrete sequences of motion tokens leveraging a single standard language modeling objective to predict the future behaviors of road network participants. Mao et al. [121] proposed the GPT-Driver model to reformulate the motion-planning task as a language modeling problem by representing the inputs and outputs of the planner as linguistic tokens and leveraging the LLM to generate driving trajectories through linguistic descriptions of coordinate positions. Furthermore, they [122] proposed Agent Driver, which utilized LLM to introduce a general-purpose library of tools accessible via function calls, cognitive memory for common sense and empirical knowledge for decision-making, and a reasoning machine capable of CoT reasoning, task planning, motion planning, and self-reflection to achieve a more nuanced, human-like approach to autopilot. Ma et al. [123] proposed Dolphins. It is capable of performing tasks such as scene understanding, behavior prediction, and trajectory planning. This work demonstrates the ability of a visual LM to provide a comprehensive understanding of complex and open-world long-tail driving scenarios and solve a range of AV tasks, as well as emergent human-like capabilities including context-learning gradient-free immediate adaptation and reflective error recovery.

Considering the scale challenges of the visual language model (VLM), Chen et al. [124], based on the idea that digital vector modalities are more compact than image data, fused vectorized 2D scene representations with pre-trained LLMs to improve the LLM’s ability to interpret and reason about the integrated driving situation, giving scene interpretation and vehicle control commands. Tian et al. [94] propose DriveVLM, which, through the CoT mechanism, not only is able to generate descriptions and analyses of the scenes presented in image sequences to make driving decision guidance but also further enables trajectory planning in conjunction with the traditional automated driving pipeline. The proposed work also provides possible solutions to the challenges inherent in VLM in terms of spatial reasoning and computation, realizing an effective transition between existing autopilot approaches and large model-based approaches.

As in the previous subsection, for the research work on the application of LLMs to the direct generation of trajectory planning for autonomous driving systems, we take the example of a typical recent research work in Fig. 6, LMDrive [119], to hopefully illustrate more clearly how it works. LMDrive is based on the Carla simulator, and the model training consists of 2 phases: pre-training and command fine-tuning. In the pre-training phase, prediction headers are added to the vision encoder to perform pre-training tasks. After the pre-training is completed, the prediction headers are discarded and the vision encoder is frozen. In the instruction fine-tuning stage, the navigation instruction and the notice instruction are configured for each driving segment, and the visual tokens are processed through the time series of instruction encoding by LLaMA, and together with the textual tokens, they are inputted into LLM to obtain the prediction tokens. After the 2-MLP Adapter, the output is the planning of the future trajectory of the auto-vehicle and the flag of whether the instruction is completed or not, and the planned trajectory completes the closed-loop simulation through the transverse and longitudinal PID controllers.

Fig. 6.

Fig. 6.

For the application of LLMs to autonomous driving system planning, a typical pipeline is shown in this figure, referenced from LMDrive [119].

This type of research idea is much closer to human driving than pure knowledge embedding to make an autonomous driving model. With the development of large models, it perhaps has the potential to become one of the main development directions in the future. Motion planning, as one of the fundamental topics in the field of intelligent robotics [125], is important to quantifying linguistic decisions into action commands such as planning and even control available for autonomous driving systems through LLM. However, it should be noted that these new frameworks are also questionable in terms of reliability due to the unresolved pitfalls of large models themselves, such as “illusions” (LLMs may generate content that conflicts with source or factual information). Specific details about the problems of the large models themselves and the challenges inherited in autonomous driving will be discussed in detail in the “Conclusion and Future Directions” section.

Prediction of Autonomous Driving Based on World Models

World models (WMs) refer to the mental models of the world. It can be interpreted as a type of artificial intelligence model that encompasses a holistic understanding or representation of the environment in which it operates. This model is capable of simulating the environment to make predictions or decisions. The term “world models” has been mentioned in connection with reinforcement learning in recent literature [126,127]. This concept has also gained attention in autonomous driving because of its capacity to comprehend and articulate the dynamics of the driving environment, as will be detailed below. In his position paper, LeCun [128] pointed out that the learning capability of humans and animals may be rooted in their capacity to learn world models, allowing them to internalize and understand how the world works. He pointed out that humans and animals have exhibited the ability to acquire a vast amount of background knowledge about the functioning of the world observing a small number of events, whether related or unrelated to the task at hand. The idea of world model can be traced back to Dyna, proposed by Sutton [129] in 1991, to observe the state of the world and take appropriate actions accordingly to learn interactively with the world [130]. Dyna is essentially a form of reinforcement learning under supervised conditions. After that, researchers have also made many attempts. Ha and Schmidhuber [126] attempted to learn by leveraging an unsupervised approach, variational autoencoder (VAE), to encode input features, and recurrent neural network (RNN) to learn the evolution of the state. Hafner et al. [131] proposed the recurrent state space model (RSSM), which combined reinforcement learning to realize multi-step prediction that integrates stochasticity and determinism. Based on the RSSM architecture, Hafner et al. successively proposed DreamerV1 [132], DreamerV2 [133], and DreamerV3 [134], which learned in implicit variables to realize image prediction generation. Gao et al. [135] considered that there was redundant information implicit, and extended the framework of the Dreamer series by proposing the semantic masked recurrent world model (SEM2) to learn relevant driving states. Hu et al. [136] removed prediction rewards and proposed a model-based imitation learning (MILE) method to predict future states.

It can be seen that world model is highly related to reinforcement learning, imitation learning, and deep generative models. However, utilizing world models in reinforcement learning and imitation learning generally requires labeled data, and both SEM2 and MILE approaches mentioned are conducted within a supervised paradigm. There have also been attempts to combine reinforcement learning and unsupervised learning (UL) based on the limitations of labeled data [137,138]. Due to the close relationship with SSL, deep generative models have become more and more popular, and researchers in this field have made many attempts. In the following, we will mainly review the exploratory applications of generative world models in autonomous driving; the pipeline is illustrated in Fig. 7, the “Deep generative models” section introduces the principles of various types of deep generative models and their applications in generative driving scenarios, the “Generative methods” section introduces the applications of generative world models in autonomous driving, and the “Non-generative methods” section will introduce a class of non-generative methods.

Fig. 7.

Fig. 7.

The pipeline diagram for enhancing autonomous driving with world models. The world models first learn the intrinsic evolutionary patterns by observing the traffic environment and then enhance autonomous driving by hooking up different decoders adapted to different driving tasks.

Deep generative models

Deep generative models generally include VAEs [139,140], generative adversarial networks (GANs) [28,141], flow models [142,143], and autoregressive models (ARs) [144146].

VAEs combine the ideas of self-encoders and probabilistic graphical models to learn underlying data structures and generate new samples. Rempe et al. [147] used VAE to learn prior distributions of traffic scenarios and simulate the generation of accident-prone scenarios. GANs consist of a generator and a discriminator, which compete and enhance each other utilizing adversarial training, to ultimately achieve the goal of generating realistic samples. Kim et al. [148] used a GAN model to observe sequences of unlabeled video frames and their associated action pairs to simulate a dynamic traffic environment. The flow models generate similar data samples by transforming simple prior distributions into complex posterior distributions through a series of invertible transformations. Kumar et al. [149] used the flow model to achieve multi-frame video prediction. ARs are a class of sequence analysis methods, based on the autocorrelation between the sequence data, describing the relationship between the present and the past, and the estimation of the model parameters is usually done leveraging the least squares method and maximum likelihood estimation. For example, GPT uses maximum likelihood estimation for model parameter training. Feng et al. [150] achieved the generation of future trajectories of vehicles based on autoregressive iterations. Swerdlow et al. [151] implemented street-view image generation based on an autoregressive transformer. The diffusion model is a typical autoregressive method that learns the process of gradual denoising from purely noisy data. With its strong generative performance, the diffusion model is the new SOTA among current deep generative models. Works such as [152154] demonstrated that the diffusion model has a strong ability to understand complex scenarios, and the video diffusion model can generate higher-quality videos. Works such as [155,156] utilized the diffusion model to generate complex and diverse driving scenarios.

Generative methods

Based on the powerful capabilities of deep generative models, leveraging deep generative models as world models to learn driving scenarios to enhance autonomous driving has become popular. The following section will review the applications of leveraging deep generative models as world models in autonomous driving. In Table 2, we provide a brief overview of some representative works.

Table 2.

Works on the use of world models for prediction

Authors Input Output Learning Description
AD Karlsson et al. [160] Images, point clouds Semantic point
clouds
SSL This work utilizes a hierarchical VAE to
construct world model and generates
pseudo-complete states and matches
them with partial observations to predict
future states.
Hu et al. [165] Videos, text, actions Videos SSL This work utilizes an autoregressive
transformer to construct world model and
leverages DINO, a self-supervised image
model, to tokenize images.
Wang et al. [166] Images, HDMap, 3D
box, text, actions
Videos, actions SSL This work obtains comprehension of the
structured traffic information. Then, the
prediction is formalized into a generative
probabilistic model.
Zhang et al. [157] Point clouds, actions Point clouds UL/SSL This work utilizes a discrete diffusion
model for point cloud prediction, which is
a spatiotemporal transformer. This work
leverages VQ-VAE to tokenize sensor
observations.
Zheng et al. [163] 3D occupancy scene Scene, ego-vehicle
motion
SSL By constructing a 3D occupancy space, a
world model is trained to predict the next
scene from previous scenarios, following
an autoregressive manner. This work
utilizes VQ-VAE for discretizing the 3D
occupancy scene into tokens.
Min et al. [164] Image-LiDAR pairs 4D GO UL/SSL This work proposes a spatial–temporal
world model for unified autonomous
driving pretraining.
Bogdoll et al. [162] Actions, point clouds, images Point clouds, images, 3D OG UL/SSL This work leverages raw data to learn a
sensor-agnostic 3D occupancy representation
and predicts future states
conditional on actions.
VP Finn et al. [174] Videos Videos UL/SSL This work proposed to interact with the
world under unsupervised conditions and
develops an action-conditioned model for
video prediction.
Wu et al. [176] Videos Videos UL/SSL This work leverages a pre-trained
object-centric model to extract object
slots from each frame. These slots are
then forwarded to a transformer and used
to predict future slots.
Wang et al. [177] Images, videos, text, actions Videos SSL Visual inputs are mapped into discrete
tokens using VQ-GAN, and then the
masked tokens are predicted using
Transformer.

AD, autonomous driving; VP, visual prediction; GO, geometric occupancy; OG, occupancy grids.

Point cloud involved models

Zhang et al. [157] built on Maskgit [158] and recast it into a discrete diffusion model for point cloud prediction. This method utilized VQ-VAE [159] to tokenize the observation data for label-free learning. Karlsson et al. [160] used a hierarchical VAE to construct a world model, used latent variable prediction and adversarial modeling to generate pseudo-complete states, matched partial observations with pseudo-complete observations to predict future states, and evaluated it on the KITTI-360 [161] dataset. In particular, it utilized pre-trained vision-based semantic segmentation models to infer from raw images. Bogdoll et al. [162] constructed a multi-modal autonomous generative world model, MUVO, leveraging raw images and LiDAR data to learn a geometric representation of the world. Conditioned on actions, this mode achieved 3D occupancy prediction and can be directly applied to downstream tasks (e.g., planning). Similarly, Zheng et al. [163] used VQ-VAE to tokenize the 3D occupancy scene and constructed a 3D occupancy apace to learn a world model that can predict the motion of the ego-vehicle and the evolution of the driving scenario. To obtain finer-grained scene information, Min et al. [164] used unlabeled image-LiDAR for pre-training to construct a world model that can generate 4D geometric occupancy.

Image-based models

To address the challenges of predicting future changes in driving scenarios, Wayve proposed a generative world model, GAIA-1 [165]. GAIA-1 used transformer as the world model to learn and predict the next states of the input video, text, and action signals, and then generated realistic driving scenarios. For the learning of video streams, GAIA-1 adopted SSL, which can learn scaled data and obtain a comprehensive understanding of driving scenarios. Wang et al. [166] devised a 2-stage training strategy. Initially, a diffusion model was employed to learn driving scenarios and gain an understanding of structured traffic information. Subsequently, a video prediction task was used to construct a world model, designated DriveDreamer. Notably, by integrating historical driving behaviors, this approach enables the generation of future driving actions. Zhao et al. [167] constructed DriveDreamer-2 on top of the DriveDreamer framework by integrating LLM, which generates the corresponding agent trajectories based on user descriptions, and HDMap information to controllably generate driving videos. Wang et al. [168] generated the driving videos by jointly modeling the future multi-views and multi-frames. This approach greatly improved the consistency of the generated results, and end-to-end motion planning was generated based on this.

In the industry, at the 2023 CVPR Autonomous Driving Workshop, Tesla researcher Ashok Elluswamy presented their work in utilizing a generative large model to generate future driving scenarios [169]. In the demonstration, it was seen that the videos generated by Tesla’s generative large model were very close to those captured from real vehicles. It also can generate annotation-like semantic information, indicating that the model also has some semantic-level understanding and reasoning capabilities. Tesla named their work “Learning a General World Model”, and it can be seen that their understanding is to build a generalized world model. By learning from a large amount of visual data captured from real vehicles, Tesla intends to build a large-scale FM for autonomous driving, which can understand the dynamic evolution of the world.

Visual prediction

Vision is one of the most direct and effective means by which humans acquire information about the world because the feature information contained in image data is extremely rich. Numerous previous works [132134,138,170] have accomplished the task of image generation through the world model, demonstrating that the world model has a good understanding and reasoning ability for image data. However, these are mainly focused on image generation and are still lacking in video prediction tasks that can better represent the dynamic evolution of the world. Video prediction tasks require a deeper understanding of world evolution and also stronger guidance for downstream tasks. In the research works [160,165], they all effectively predicted generated future traffic scenarios, where SSL may be key. Previous work has explored this as well. Wichers et al. [171] trained a model leveraging raw images and proposed a hierarchical long-term video prediction method combining low-level pixel space and high-level feature space (e.g., landmarks), achieving longer video prediction compared to the work [134]. Endo et al. [172] constructed a model under the SSL paradigm for predicting future traffic scenarios from single-frame images to predict the future. Based on a denoising diffusion model with probabilistic conditional scores, Voleti et al. [173] trained the model by randomly masking the past frames or future frames unlabeled, which allowed block-by-block autoregressions to generate videos of arbitrary length. Finn et al. [174] proposed to physically interact with the world under unsupervised conditions and realize video prediction by predicting the distribution of pixel motion. Micheli et al. [175] verified the effectiveness of leveraging an autoregressive transformer as a world model, and achieved the prediction of game images by training the parameters through SSL. Wu et al. [176] constructed an object-centered world model to learn complex spatiotemporal interactions between objects and generated high visual quality future prediction.

Inspired by LLM, Wang et al. [177] consider world modeling as unsupervised visual sequence modeling. The visual input is mapped into discrete tokens using VQ-GAN [178], and then the Spatiotemporal Transformer is used to predict the masked tokens to learn the physical evolutionary patterns in them, thus gaining the ability to generate videos in various scenarios. Analogous to LLM’s tokens, OpenAI researchers transformed visual data into patches to propose the video generation model Sora. To address the high-dimensionality of visual data, they compressed the visual data into a lower-dimensional latent space and then generated a latent representation in this latent space through diffusion. This representation was then mapped back to the pixel space to realize video generation. By learning from Internet-scale data, Sora realizes the scaling law in the video domain, and Sora can generate coherent high-definition videos based on diverse prompts. In the same year, Google proposed Genie [179], a generative interactive model that uses unlabeled Internet gaming videos for training. In particular, Genie proposed a latent action model to infer latent actions between each frame and constructed a codebook for latent actions through training. To utilize the model, the user selects the initial frame and the specified latent action and autoregressively generates future frames. As the model size and batch size increase, Genie also demonstrates scaling results. In contrast, Sora is designed to generate video content with high fidelity, variable duration, and resolution. While not as advanced in video quality as Sora, Genie is optimized for building generative interactive environments in which the user can manipulate frame-by-frame to generate video.

The preceding studies demonstrate the efficacy of world models in enhancing autonomous driving. World models can be directly embedded into autonomous driving models to accomplish various driving tasks. Furthermore, there are explorations of learning to build general world models from large-scale visual data, such as Sora and Genie. These FMs can be utilized for data generation (to be discussed in the “Data Augmentation Based on Foundation Models” section). In addition, based on FMs’ generalization ability, they can be employed to perform a multitude of downstream tasks, or even be utilized to simulate the world.

Non-generative methods

In contrast to generative world models, LeCun [128] elaborated on different conceptions of the world model by proposing a Joint Extraction and Prediction Architecture (JEPA) based on an energy-based model. This is a non-generative and self-supervised architecture, as it does not predict the output y directly from the input x, but encodes x as sx to predict sy in representation space, as illustrated in Fig. 8. This has the advantage that it does not have to predict all the information about y and can eliminate irrelevant details.

Fig. 8.

Fig. 8.

Comparison of the architecture of generative and non-generative methods [184]. (A) Generative architectures reconstruct a signal y from a compatible signal x using a decoder network conditioned on additional (possibly latent) variables z. (B) Joint-embedding predictive architectures predict the embeddings of a signal y from a compatible signal x using a predictor network conditioned on additional (possibly latent) variables z.

Since its proposal, the JEPA architecture has been applied by several scholars in different domains with excellent performance. In the graph domain, Skenderi et al. [180] proposed Graph-JEPA, which is a JEPA model for graph domains. It divides the input graph into subgraphs and then predicts the representation of the target subgraph in the context subgraph. Graph-JEPA has obtained excellent performance in both graph classification and regression problems. In the field of audio, Fei et al. [181] proposed A-JEPA, which applies the mask modeling principle to audio. Following experimental validation, A-JEPA has been demonstrated to perform well in speech and audio classification tasks. Sun et al. proposed JEP-KD [182], which employs an advanced knowledge distillation method to enhance the effectiveness of visual speech recognition (VSR) and narrow the performance gap between VSR and automatic speech recognition.

In the field of CV, Bardes et al. [183] proposed MC-JEPA, which employs the JEPA architecture and an SSL approach to facilitate co-learning of optical flow and content features, thereby enabling the acquisition of dynamic content features. From video, MC-JEPA performs well in a variety of tasks, including estimation of optical flow, and segmentation of images and videos. META [184] proposed I-JEPA for learning highly semantic image representations without relying on manual data enhancement. The combination of I-JEPA with Vision Transformers yielded strong downstream performance in a variety of tasks, including linear classification, object counting, and depth prediction. Building on I-JEPA, META applies JEPA to the video domain by proposing V-JEPA [185]. This method combines mask prediction with the JEPA architecture to train a series of V-JEPA models with feature prediction as the goal of SSL. Experimental results demonstrate that these models exhibit excellent performance in a range of CV downstream tasks, including action recognition, action classification, and target classification.

To date, no literature has been identified that directly applies the JEPA to the field of autonomous driving. Nevertheless, it has great potential. Firstly, instead of predicting the video in pixel space, non-generative world models make feature predictions in representation space. This eliminates many irrelevant details. For example, in the scene prediction task of autonomous driving, we are more interested in the future movements of other traffic participants on the current road. Furthermore, for other vehicles that are not on the current road of the autonomous vehicle, for example, situated next to an elevated road parallel to the current road, we do not consider their future motion trajectories. The JEPA model eliminates these irrelevant details and reduces the complexity of the problem. Additionally, V-JEPA has demonstrated its ability to learn features in video. By analyzing a sufficiently large number of driving videos, it is anticipated that V-JEPA will be widely used in tasks such as generating driving scenarios and predicting future environmental states.

Data Augmentation Based on FMs

As deep learning continues to evolve, the performance of FMs with pre-training and fine-tuning as the underlying architecture is improving. FMs are spearheading the transition from rule-driven to data-driven learning paradigms. The importance of data as a key aspect of model learning is evident. A substantial quantity of data is utilized in the training process of an autonomous driving model to facilitate the model’s comprehension and decision-making abilities in diverse driving scenarios. Nevertheless, the collection of realistic data is a time-consuming and laborious process, so data augmentation is crucial to improving the generalization ability of automatic driving models.

The realization of data augmentation needs to consider 2 aspects: on the one hand, how to obtain large-scale data so that the data fed to the autonomous driving system are diverse and extensive; on the other hand, how to obtain as much high-quality data as possible so that the data that are used to train and test the autonomous driving models are accurate and reliable. Related works have also roughly chosen 2 directions to enhance the autonomous driving data, one is to enrich the data content of existing datasets and enhance the data features of driving scenarios, and the other is to generate driving scenarios with multiple levels through simulation. In the following, a review of related works on enhancing data based on FMs will be presented, in the “Expansion of autonomous driving datasets” section, we describe related work on extending datasets, and in the “Generation of driving scenarios” section, we describe related work on generating driving scenarios. Table 3 provides a brief overview of some representative works.

Table 3.

Works on data augmentation

Authors Input Output Learning
Expand dataset Qian et al. [193] Images, text, point clouds Q–A pairs SL
Wu et al. [194] Images, text Object–text pairs SL
Zhou et al. [196] Images, text Labeled data SL
Generate scenarios Marathe et al. [210] Objects, scenarios, weather
condition
Multi-weather images SL
Yang et al. [206] Text, BEV sketch, multi-view noise Street-view images SL
Li et al. [207] Layouts, frames, optical flow prior Multi-view videos SL
Wen et al. [208] Text, BEV sequence Multi-view videos SL
Chen et al. [124] Objects, text Trajectory SL
Zhong et al. [211] Text, scenarios with noise Traffic scenarios SL
Wang et al. [75] Images, knowledge Latent space simulation SL
Jin et al. [212] Text, simulation of urban driving Driving maneuvers SL
Zhao et al. [167] Text Videos SL

Q–A, question–answer.

Expansion of autonomous driving datasets

Existing autonomous driving datasets are mostly obtained by recording sensor data and then labeling the data. The features of the data obtained in this are usually low level and exist more at the level of numerical representation, which is insufficient for the visuospatial features of the autonomous driving scenarios. Natural language descriptions are seen as an effective way to enhance scene representation [79]; Flickr30k [186], RefCOCO [187], RefCOCOg [188], and CLEVR- Ref [189] use concise natural language descriptions to identify the corresponding visual regions in an image. Talk2Car [190] fused image, radar, and LiDAR data to construct the first object-referenced dataset containing language commands for self-driving cars. However, the Talk2Car dataset allowed only one object to be referenced at a time. CityFlow-NL [191] constructed a dataset for multi-target tracking through natural language descriptions, and Refer-KITTI [192] achieved prediction of arbitrary target tracking by leveraging natural language queries in the corresponding task.

FMs provide new ideas for enriching and expanding autonomous driving datasets under their advanced semantic understanding, reasoning, and interpretation capabilities. Qian et al. [193] created NuScenes-QA, a visual question-and-answer dataset for autonomous driving in 3D multi-view driving scenarios, by encoding question descriptions through an LM and obtaining answers through feature fusion with sensor data. Impactful progress was made in the use of natural language prompts. Wu et al. [194] extended NuScenes-QA by constructing the dataset NuPrompt by capturing and combining natural language elements and then invoking LLM to generate the descriptions. The dataset provided a finer match between the 3D instances and each of the prompts, which helped to characterize objects in the autopilot images more accurately. Sima et al. [115] took into account the interactions of the traffic elements and constructed Graph Visual Question Answering by extending the nuScenes dataset [195] with BLIP-2, which can better clarify the logical dependencies between objects and the hierarchy of driving tasks. In addition to directly extending the augmented autonomous dataset, some scholars have also integrated the CoT capability of LLM and the cross-modal capability of the vision model to build an automatic annotation system, OpenAnnotate3D [196], which can be used for multi-modal 3D data. Expanding the dataset by utilizing the advanced understanding, reasoning, and interpretation capabilities of the underlying models can help to better assess the interpretability and control of the autonomous driving system, thus improving the safety and reliability of the autonomous driving system. A comparison of some representative work is shown in Table 4.

Table 4.

Comparison of extended datasets. “-” means unavailable

Dataset Source Based FMs Modality 3D Multi-views Videos Frames QA pairs
RefCOCO [188] COCO None Image referring
expression
× × - 26,711 -
Refer-KITTI [192] KITTI None Image point cloud
object referral
× × 18 6,650 -
Talk2Car [190] nuScences None Image point cloud
object driving
command
× - 9,217 -
nuPrompt [194] nuScences GPT-3.5 Image point cloud
Question–
Answering
850 34,149 35k
DriveLM- nuScences
[115]
nuScences BLIP-2 Image point cloud
Question–
Answering
- 4,871 443k

Generation of driving scenarios

The diversity of driving scenarios is of considerable importance for autonomous driving. To obtain a better generalization ability, autonomous driving models must learn a wide variety of scenarios. However, the reality is that driving scenarios conform to a long-tailed distribution. (It is a probability distribution in which a significant proportion of the observations or instances are concentrated in the tail(s) of the distribution, away from the center or mean.) The “long-tail problem” of autonomous driving vehicles is that they are capable of handling situations that are frequently encountered, but are unable to cope with corner cases in rare or extreme situations. To address the long-tail problem, the key is to get as many corner cases as possible. Nevertheless, it is inefficient to limit the collection to real scenarios. For instance, in CODA [197], work on corner case mining, there are only 1,057 valid data out of 1 million data.

Given the above, the generation of large-scale and high-quality driving scenario data necessitates the capacity to actively create a multitude of driving scenarios. Traditional methodologies may be classified into 2 primary categories: rule-based and data-driven. Rule-based approaches, as exemplified by the literature cited in Refs. [198201], necessitate the utilization of predefined rules, are inadequate for the characterization of complex environments, simulate simpler environments, and exhibit limited generalization ability. In contrast, data-driven approaches [202205] utilize driving data to train the model, enabling it to continuously learn and adapt. However, data-driven approaches often necessitate a substantial quantity of labeled data for training, impeding further development of driving scenario generation. Additionally, this approach lacks control and is unsuitable for custom generation. Recently, FMs have achieved considerable success, and the generation of higher-quality driving scenarios through FMs has also attracted significant research attention. On the one hand, the diversity and accuracy of data generation can be enhanced based on the powerful understanding and reasoning capabilities of FMs. On the other hand, diverse prompts can be designed for controlled generation.

Based on LLMs and VLMs

In response to the fact that some long-tailed scenarios can never be collected in multi-view shots, Yang et al. [206] fused verbal cues, BEV sketch, and multi-view noise to design a 2-stage generative network BEVControl for synthesizing realistic street scene images. Nevertheless, BEVControl is insufficient for modeling foreground and background detail information. To address the difficulty of obtaining large-scale BEV representations, Li et al. [207] developed a spatiotemporal consistent diffusion framework, DrivingDiffusion, to autoregressively generate realistic multi-view videos controlled by 3D layouts. The quality of the generated data can be effectively enhanced by introducing local cue inputs into the vision model. For controllable generation, Wen et al. [208] integrated text prompts, image conditions, and BEV sequences to design a controllable module to improve the controllability of driving scenarios generation. Gao et al. [209] designed 3D geometric control by integrating text prompts with camera pose, road map, and object box fusion control to generate diverse road scenarios.

Based on the powerful understanding and reasoning ability of LLMs and VLMs, it has also become a research hotspot to embed them directly or guide the model to generate driving scenarios. Marathe et al. [210] efficiently generated a dataset comprising 16 weather extremes via prompting leveraging a VLM. Nevertheless, the model had some extension limitations due to the phenomenon of pre-selected fixation in data selection. Chen et al. [124] realized the combination of numerical vector modality and natural language by pairing control commands collected by reinforcement learning intelligence and answering questions generated by LLM to directly construct new data. Zhong et al. [211] proposed a scenario-level diffusion-based language-guided traffic simulation model, CTG++, which can generate instruction-compliant, realistic, and controllable traffic scenarios. Wang et al. [75] utilized natural language descriptions as conceptual representations that were integrated with LLM to enrich the complexity of the generated scenarios by leveraging their powerful common-sense reasoning capabilities. The behavior of human drivers is also an important part of driving scenarios. Jin et al. [212] proposed SurrealDriver, a generative driving agent simulation framework for urban environments based on LLM. By analyzing and learning from real driving data, SurrealDriver can capture the driver’s behavior patterns and decision-making processes and generate behavior sequences that are similar to those in real driving.

Based on world models

To achieve the controllable generation of driving scenarios, Wang et al. [166] combine text prompts and structured traffic constraints to guide the generation of pixel points with text descriptions. To obtain more accurate dynamic information, Wang et al. [168] incorporate driving actions into a controllable architecture, utilizing text descriptors, layouts, and ego actions to control video generation. However, these approaches introduce more structural information, which limits the interactivity of the model. To address this issue, Zhao et al. [167] propose a novel approach that combines LLM with the world model. This approach involves using LLM to convert user queries into agents’ trajectories, which are then used to generate HDMap. This HDMap then guides the generation of driving videos.

Efficient and accurate controllability generation can be achieved using FMs for driving scenarios. This will be able to provide diverse training data, which is important for improving the generalization ability of autonomous driving systems. A comparison of some representative work is shown in Table 5. Furthermore, the generated driving scenarios can be used to evaluate different autonomous driving models to test and validate their performance. Of course, we should also be able to see that with the emergence of various large-scale FMs such as Sora and Genia, there are new potential ideas for the generation of autonomous driving videos. The models are not restricted to the driving domain but can be used for transfer learning utilizing models obtained from training in the general vision domain. While the current state of technology in this domain remains imperfect, we believe that in the future, with the breakthrough of related technologies, we can even use them to generate the various driving scenarios we need, and truly learn a world model to simulate the world.

Table 5.

Video generation performance on nuScenes dataset. “-” means unavailable. The FID indicator and FVD indicator provide feedback on the image and video quality, respectively.

Method Based FMs Multi-view Multi-frame FID↓ FVD↓
BEVGen [151] None × 25.54 -
DriveGAN [148] None 73.4 502.3
MagicDrive [209] CLIP × 16.20 -
Panancea [208] CLIP 16.96 139
DriveDreamer [166] WM 52.6 452.0
DriverDreamer- 2 [167] GPT-3.5 and WM 11.2 55.7
Driving-WM [168] WM 15.8 122.7

Conclusion and Future Directions

This paper provides a comprehensive overview of the application of FMs to autonomous driving. In the “Human-like Driving Based on Language and Vision Models” section, recent works on the application of FMs such as LLMs and VLMs to autonomous driving are summarized in detail. In the “Prediction of Autonomous Driving Based on World Models” section, we present an exploratory application of the world models to the field of autonomous driving. In the “Data Augmentation Based on Foundation Models” section, recent works on data augmentation of the FMs are detailed. Overall, the FMs can effectively assist autonomous driving in terms of both augmenting the data and optimizing the model.

To evaluate the effectiveness of FMs in autonomous driving, we compare different FMs with traditional methods in terms of their effectiveness in motion planning in Table 6. Due to the relative maturity of LLMs and VLMs, it can be observed that methods based on them to enhance autonomous driving have been improved overall. In contrast, WMs-based approaches are still undergoing further exploration, with relatively less work published. Nevertheless, through the previous analysis, we can also see that world models are excellent at learning the evolutionary laws of the physical world and have great potential for improving autonomous driving.

Table 6.

Motion planning performance on the nuScenes validation dataset

Method Based FMs L2 (m)↓ Collision (%)↓
1 s 2 s 3 s Avg. 1 s 2 s 3 s Avg.
ST-P3 [61] None 1.33 2.11 2.90 2.11 0.23 0.62 1.27 0.71
UniAD [72] None 0.48 0.96 1.65 1.03 0.05 0.17 0.71 0.31
VAD [73] None 0.41 0.70 1.05 0.72 0.07 0.17 0.41 0.22
GenAD [74] None 0.36 0.83 1.55 0.91 0.06 0.23 1.00 0.43
GPT-Driver [121] a LLM 0.21 0.43 0.79 0.48 0.16 0.27 0.63 0.35
GPT-Driver [121] b LLM 0.20 0.42 0.72 0.44 0.14 0.25 0.60 0.33
Agent-Driver [122] a LLM 0.22 0.65 1.34 0.74 0.02 0.13 0.48 0.21
DriveVLM-Dual [94] a VLM 0.17 0.37 0.63 0.39 0.08 0.18 0.35 0.20
DriveVLM-Dual [94] c VLM 0.15 0.29 0.48 0.31 0.05 0.08 0.17 0.10
VLP-UniAD [95] a LLM 0.36 0.68 1.19 0.74 0.03 0.12 0.32 0.16
VLP-VAD [95] c LLM 0.30 0.53 0.84 0.55 0.01 0.07 0.38 0.15
OccWorld-O [163] b WM 0.43 1.08 1.99 1.17 0.07 0.38 1.35 0.60
Drive-WM [168] WM 0.43 0.77 1.20 0.80 0.10 0.21 0.48 0.26

a Results of perception and prediction from UniAD. b Results of perception and prediction from dataset annotations. c Results of perception and prediction from VAD.

Challenges and future directions

Nevertheless, it is evident from previous studies that FM-based autonomous driving is not yet sufficiently mature. This phenomenon can be attributed to several factors. FMs suffer from the problem of hallucination [213,214], as well as the fact that there are still limitations in learning video, a high-dimensional continuous modality. Additionally, deployment issues caused by inference latency [215,216] and potential ethical implications and societal impact should also be considered.

Hallucination

The hallucination error problem is mainly manifested as misrecognition in autonomous driving, such as wrong target detection, which may cause serious safety accidents. The phantom problem mainly arises because of the limited samples in the dataset or because the model is affected by unbalanced or noisy data, and the stability and generalization ability needs to be strengthened by utilizing expanding data and adversarial training.

Real-world deployment

As previously discussed, the majority of current research on FMs in autonomous driving is based on open-source dataset experiments [95,121] or closed-loop experiments in simulation environments [105,119], which is insufficient for real-time considerations. Additionally, some studies [215,216] have highlighted that large models have certain inference latency, which could potentially lead to significant safety concerns in autonomous driving applications. To further explore the effectiveness of FMs for real-time applications in autonomous driving, we conducted an experiment [217]. We used low-rank adaptation (LoRA) [218] to fine-tune the LLaMA-7B [78], and the fine-tuned LLM can reason to generate driving language commands. To verify its real-time performance in driving scenarios, we reasoned on a single GPU A800 and a single GPU 3080, respectively, and the time required to generate 6 tokens is 0.9 s and 1.2 s, effectively verifying that vehicle deployment of FM is possible. Meanwhile, the DriveVLM [94] work by Tian et al. also achieves second-level deployment inference on the NVIDIA Orin platform, further supporting the feasibility of in-vehicle FM. In the future, with the improvement of edge computing and in-vehicle computing [219], it may be possible to gradually move to road-side, vehicle-side, or hybrid deployment modes to further improve the real-time response capability and privacy protection level.

AI alignment

The deepening of FMs into various industries, including autonomous driving, is a major trend. Nevertheless, as related research continues, so do the risks to human society. Advanced AI systems exhibiting undesirable behaviors (e.g., spoofing) are a cause for concern, especially in areas such as autonomous driving, which is directly related to personal safety, and requires serious discussion and reflection. In response to this, AI alignment has been proposed and is currently being developed. The objective of AI alignment is to align the behaviors of AI systems with human intentions and values. This approach focuses on the goals of AI systems rather than their capabilities [220]. AI alignment facilitates the risk control, operational robustness, human ethicality, and interpretability of advanced AI systems when implemented in various domains [221]. This is a substantial body of research encompassing numerous AI-related disciplines. As this paper concentrates on the domain of autonomous driving and does not delve into the specifics of risk causes and solutions, we will not elaborate further here. In the field of autonomous driving, it is important to note that while promoting the application of FMs, researchers must establish reasonable technical ethics based on the guidance of AI alignment. This includes paying attention to the issues of algorithmic fairness, data privacy, system security, and the human–machine relationship. Furthermore, it is essential to promote the unity of technological development and social values to avoid potential ethical and social risks.

Visual emergent abilities

FMs have seen amazing emergent abilities with model scaling and demonstrated success in NLP. Nevertheless, in the context of autonomous driving, this line of research faces additional open challenges due to limited available data and extended context length issues. These challenges contribute to an inadequate understanding of macroscopic driving scenarios, thereby complicating long-term planning in this field. Driving video is a high-dimensional continuous modality with an extremely large amount of data (several orders of magnitude larger compared to textual data). Hence, training large vision models requires a more macroscopic scene distribution to embed enough video frames to reason about complex dynamic scenarios, which requires a more robust network structure and training strategies to learn this information. Bai et al. [222] proposed a 2-stage approach in a recent study, in which images are converted into discrete tokens to obtain “visual sentences”, and then autoregressive predictions are made, similar to the standard approach for LM [13]. Another promising solution may lie in world models. As described in the “Prediction of Autonomous Driving Based on World Models” section, world models can learn the intrinsic evolutionary laws of the world by observing a small number of events that are either relevant or irrelevant to the task. However, world models also have certain limitations in exploratory applications, where uncertainty in the predictive outcomes of the models, as well as learning what kind of data captures the intrinsic laws of how the world works, still warrant further exploration.

In conclusion, although there are many challenges to be solved in applying FMs to autonomous driving, its potential has already begun to take shape. In the future, we will continue to monitor the progress of FMs applied to autonomous driving.

Acknowledgments

Funding: This work was supported in part by the National Nature Science Foundation of China (nos. 62373289, 62273256, and 62088101) and the Fundamental Research Funds for the Central Universities.

Author contributions: H. Chen, H. Chu, B.G., and J.W. conceived and designed the study. J.W., J.G., and J.Y. wrote the manuscript draft. X.G., Y.C., Q.Y., and H.E.T gave some helpful suggestions. H. Chen, H. Chu, B.G., and J.C. revised the manuscript. The authors read and approved the final manuscript.

Competing interests: The authors declare that they have no competing interests.

References

  • 1.Yurtsever E, Lambert J, Carballo A, Takeda K. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access. 2020;8:58443–58469. [Google Scholar]
  • 2.Grigorescu S, Trasnea B, Cocias T, Macesanu G. A survey of deep learning techniques for autonomous driving. J Field Robot. 2020;37(3):362–386. [Google Scholar]
  • 3.Chen L, Wu P, Chitta K, Jaeger B, Geiger A, Li H. End-to-end autonomous driving: Challenges and frontiers. arXiv. 2023. 10.48550/arXiv.2306.16927 [DOI]
  • 4.Chib PS, Singh P. Recent advancements in end-to-end autonomous driving using deep learning: A survey. IEEE Trans Intell Veh. 2023;9(1):103–118. [Google Scholar]
  • 5.Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, Bernstein MS, Bohg J, Bosselut A, Brunskill E, et al. On the opportunities and risks of foundation models. arXiv. 2021. 10.48550/arXiv.2108.07258 [DOI]
  • 6.Kenton JDMWC, Toutanova LK. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019. 2019. p. 4171–4186.
  • 7.OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S. et al. Gpt-4 technical report. arXiv. 2023. 10.48550/arXiv.2303.08774 [DOI]
  • 8.Brooks T, Peebles B, Holmes C, DePue W, Guo Y, Jing L, Schnurr D, Taylor J, LuhmanT, Luhman E, et al. Video generation models as world simulators; 2024. https://openai.com/research/video-generation-models-as-world-simulators
  • 9.Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. arXiv. 2017. 10.48550/arXiv.1706.03762 [DOI]
  • 10.Peebles W, Xie S. Scalable diffusion models with transformers. arXiv. 2022. 10.48550/arXiv.2212.09748 [DOI]
  • 11.Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, Yogatama D, Bosma M, Zhou D, Metzler D, Chi EH, Hashimoto T, Vinyals, et al. Emergent abilities of large language models. arXiv. 2022. 10.48550/arXiv.2206.07682 [DOI]
  • 12.Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D. Scaling laws for neural language models. arXiv. 2020. 10.48550/arXiv.2001.08361 [DOI]
  • 13.Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A. et al. Language models are few-shot learners. arXiv. 2020. 10.48550/arXiv.2005.14165 [DOI]
  • 14.Dong Q, Li L, Dai D, Zheng C, Wu Z, Chang B, Sun X, Xu J, Sui Z. A survey for in-context learning. arXiv. 2022. 2301.00234
  • 15.Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, Chi E, Le Q, Zhou D. et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv. 2020. 10.48550/arXiv.2201.11903 [DOI]
  • 16.Park JS, O’Brien JC, Cai CJ, Morris MR, Liang P, Bernstein MS. Generative agents: Interactive simulacra of human behavior. arXiv. 2023. 10.48550/arXiv.2304.03442 [DOI]
  • 17.Thrun S. Lifelong learning algorithms. In: Learning to learn. Boston (MA): Springer; 1998. p. 181–209. [Google Scholar]
  • 18.Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng. 2009;22(10):1345–1359. [Google Scholar]
  • 19.Qiu X, Sun T, Xu Y, Shao Y, Dai N, Huang X. Pre-trained models for natural language processing: A survey. Sci China Technol Sci. 2020;63:1872–1897. [Google Scholar]
  • 20.Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Adv Neural Inf Proces Syst. 2012;25. [Google Scholar]
  • 21.Jing L, Tian Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans Pattern Anal Mach Intell. 2020;43(11):4037–4058. [DOI] [PubMed] [Google Scholar]
  • 22.Hoyer L, Dai D, Chen Y, Koring A, Saha S, Van Gool L. Three ways to improve semantic segmentation with self-supervised depth estimation. Paper presented at: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021 Jun 20–25; Nashville, TN.
  • 23.Liu S, Li Z, Sun J. Self-emd: Self-supervised object detection without imagenet. arXiv. 2020. 10.48550/arXiv.2011.13677 [DOI]
  • 24.Yu W, Xu H, Yuan Z, Wu J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. Proc AAAI Conf Artif Intell. 2021;35(12):10790–10797. [Google Scholar]
  • 25.Zhang K, Wen Q, Zhang C, Cai R, Jin M, Liu Y, Zhang J, Liang Y, Pang G, Song D, et al. Self-supervised learning for time series analysis: Taxonomy, progress, and prospects. arXiv. 2023. 10.48550/arXiv.2306.10125 [DOI] [PubMed]
  • 26.Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J. Self-supervised learning: Generative or contrastive. IEEE Trans Knowl Data Eng. 2023;35(1):857–876. [Google Scholar]
  • 27.Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR; 2020. p. 1597–1607.
  • 28.Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. arXiv. 2014. 10.48550/arXiv.1406.2661 [DOI]
  • 29.Zhang R, Isola P, and Efros AA. Colorful image colorization. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer; 2016. p. 649–666.
  • 30.Srivastava N, Mansimov E, Salakhudinov R. Unsupervised learning of video representations using lstms. In: International conference on machine learning. PMLR; 2015. p. 843–852.
  • 31.Misra I, Zitnick CL, Hebert M. Shuffle and learn: Unsupervised learning using temporal order verification. In: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer; 2016. p. 527–544.
  • 32.Doersch C, Gupta A, Efros AA. Unsupervised visual representation learning by context prediction. Paper presented at: 2015 IEEE International Conference on Computer Vision (ICCV); 2015 Dec 07–13; Santiago, Chile.
  • 33.Li D, Hung WC, Huang JB, Wang S, Ahuja N, Yang MH, Yang MH. Unsupervised visual representation learning by graph-based consistent constraints. In: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, the Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer; 2016. p, 678–694.
  • 34.Pathak D, Girshick R, Doll´ar P, Darrell T, Hariharan B. Learning features by watching objects move. Paper presented at: IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017 Jul 21–26; Honolulu, HI.
  • 35.Sayed N, Brattoli B, Ommer B. Cross and learn: Cross-modal self-supervision. In: Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9–12, 2018, Proceedings 40. Springer; 2019. p. 228–243.
  • 36.Jaiswal A, Babu AR, Zadeh MZ, Banerjee D, Makedon F. A survey on contrastive self-supervised learning. Technologies. 2020;9(1):2. [Google Scholar]
  • 37.Mikolov T, Chen K, Corrado G, and Dean J. Efficient estimation of word representations in vector space. arXiv. 2013. 10.48550/arXiv.1301.3781 [DOI]
  • 38.Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z, et al. A survey of large language models. arXiv. 2023. 10.48550/arXiv.2303.18223 [DOI]
  • 39.Wei J, Bosma M, Zhao V, Wei J, Bosma M, Zhao VY, Guu K, Yu AW, Lester B, Du N, Dai AM, Le QV. Finetuned language models are zero-shot learners. Paper presented at: International Conference on Learning Representations; 2021; Vienna, Austria.
  • 40.Ouyang L, Wu J, Jiang X, Almeida D, Wainwright CL, Mishkin P, Zhang C, Agarwal S, Slama K, Ray A, et al. Training language models to follow instructions with human feedback. arXiv 2022. 10.48550/arXiv.2203.02155 [DOI]
  • 41.Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language understanding by generative pre-training. 2018.
  • 42.Liu PJ, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N. Generating wikipedia by summarizing long sequences. Paper presented at: International Conference on Learning Representations; 2018; Vancouver, Canada.
  • 43.Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv. 2017. 10.48550/arXiv.1707.06347 [DOI]
  • 44.Christiano PF, Leike J, Brown TB, Martic M, Legg S, Amodei D. Deep reinforcement learning from human preferences. Paper presented at: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA.
  • 45.He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2016. p. 770–778.
  • 46.Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. Paper presented at: 3rd International Conference on Learning Representations (ICLR); 2014 Sep 4; San Diego, CA.
  • 47.Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, et al. An image is worth 16x16 words: Transformers for image recognition at scale. Paper presented at: International Conference on Learning Representations; 2020; Addis Ababa, Ethiopia.
  • 48.Philion J, Fidler S, Fidler S. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. In: Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer; 2020. p. 194–210.
  • 49.Huang J, Huang G, Zhu Z, Ye Y, Du D. BEVDet: High-performance multi-camera 3D object detection in bird-eye-view. arXiv. 2022. 10.48550/arXiv.2112.11790 [DOI]
  • 50.Li Z, Wang W, Li H, Xie E, Sima C, Lu T, Tu Q, Dai J. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In: European conference on computer vision. Springer; 2022. p. 1–18.
  • 51.Zhang Y, Zhu Z, Zheng W, Huang J, Huang G, Zhou J, Lu J. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv. 2022. 10.48550/arXiv.2205.09743 [DOI]
  • 52.Liang T, Xie H, Yu K, Xia Z, Lim Z, Wang Y, Tang T, Wang B, Tang Z. Bevfusion: A simple and robust lidar-camera fusion framework. Adv Neural Inf Proces Syst. 2022;35:10421–10434. [Google Scholar]
  • 53.Wu P, Chen L, Li H, Jia X, Yan J, Qiao Y. Policy pre-training for end-to-end autonomous driving via self-supervised geometric modeling. arXiv. 2023. 10.48550/arXiv.2301.01006 [DOI]
  • 54.Sautier C, Puy G, Boulch A, Marlet R, Lepetit V. BEVContrast: self-supervision in BEV space for automotive lidar point clouds. arXiv. 2023. 10.48550/arXiv.2310.17281 [DOI]
  • 55.Xie S, Gu J, Guo D, Qi CR, Guibas L, Litany O. Pointcontrast: Unsupervised pre-training for 3D point cloud understanding. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer; 2020. p. 574–591.
  • 56.Yang H, Zhang S, Huang D, Wu X, Zhu H, He T, Tang S, Zhao H, Qiu Q, Lin B, et al. UniPAD: A universal pre-training paradigm for autonomous driving. arXiv. 2023. 10.48550/arXiv.2310.08370 [DOI]
  • 57.Bojarski M, Del Testa D, Dworakowski D, Firner B, Flepp B, Goyal P, Jackel LD, Monfort M, Muller U, Zhang J, et al. End to end learning for self-driving cars. arXiv. 2016. 10.48550/arXiv.1604.07316 [DOI]
  • 58.Eraqi HM, Moustafa MN, Honer J. End-to-end deep learning for steering autonomous vehicles considering temporal dependencies. arXiv. 2017. 10.48550/arXiv.1710.03804 [DOI]
  • 59.Xu H, Gao Y, Yu F, Darrell T. End-to-end learning of driving models from large-scale video datasets. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE; 2017. p. 2174–2182.
  • 60.Codevilla F, Muller M, L´opez A, Koltun V, Dosovitskiy A. End-to-end driving via conditional imitation learning. In: 2018 IEEE international conference on robotics and automation (ICRA). IEEE; 2018. p. 4693–4700.
  • 61.Hu S, Chen L, Wu P, Li H, Yan J, Tao D. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In: European Conference on Computer Vision. Springer; 2022. p. 533–549.
  • 62.Liang X, Wang T, Yang L, Xing E. Cirl: Controllable imitative reinforcement learning for vision-based self-driving. Paper presented at: Proceedings of the European conference on computer vision (ECCV); 2018 Jul 10; Munich, Germany.
  • 63.Toromanoff M, Wirbel E, Moutarde F. End-to-end model-free reinforcement learning for urban driving using implicit affordances. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE; 2020. p. 7153–7162.
  • 64.Zhang Z, Liniger A, Dai D, Yu F, Van Gool L. End-to-end urban driving by imitating a reinforcement learning coach. In: Proceedings of the IEEE/CVF international conference on computer vision. IEEE; 2021. p. 15222–15232.
  • 65.Prakash A, Chitta K, Geiger A. Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2021. p. 7077–7087.
  • 66.Chitta K, Prakash A, Jaeger B, Yu Z, Renz K, Geiger A. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Trans Pattern Anal Mach Intell. 2022;45(11):12878–12895. [DOI] [PubMed] [Google Scholar]
  • 67.Chitta K, Prakash A, Geiger A. Neat: Neural attention fields for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE; 2021. p. 15793–15803.
  • 68.Ngiam J, Caine B, Vasudevan V, Zhang Z, Chiang H-TL, Ling J, Roelofs R, Bewley A, Liu C, Venugapol A, et al. Scene transformer: A unified architecture for predicting multiple agent trajectories. arXiv. 2021. 10.48550/arXiv.2106.08417 [DOI]
  • 69.Renz K, Chitta K, Mercea OB, Koepke AS, Akata Z, Geiger A. PlanT: Explainable planning transformers via object-level representations. In: Conference on Robot Learning. PMLR; 2023. p. 459–470.
  • 70.Zhang K, Feng X, Wu L, He Z. Trajectory prediction for autonomous driving using spatial-temporal graph attention transformer. IEEE Trans Intell Transp Syst. 2022;23(11):22343–22353. [Google Scholar]
  • 71.Ye T, Jing W, Hu C, Huang S, Gao L, Li F, Wang J, Guo K, Xiao W, Mao W, et al. Fusionad: Multi-modality fusion for prediction and planning tasks of autonomous driving. arXiv. 2023. 10.48550/arXiv.2308.01006 [DOI]
  • 72.Hu Y, Yang J, Chen L, Li K, Sima C, Zhu X, Chai S, Du S, Lin T, Wang W, et al. Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2023. p. 17853–17862.
  • 73.Jiang B, Chen S, Xu Q, Liao B, Zhou H, Zhang Q, Liu W, Huang C, Wang X. Vad: Vectorized scene representation for efficient autonomous driving. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. IEEE; 2023. p. 8340–8350.
  • 74.Zheng W, Song R, Guo X, Chen L. GenAD: Generative end-to-end autonomous driving. arXiv. 2024. 10.48550/arXiv.2402.11502 [DOI]
  • 75.Wang TH, Maalouf A, Xiao W, Ban Y, Amini A, Rosman G, Karaman S, Rus D. Drive anywhere: Generalizable end-to-end autonomous driving with multi-modal foundation models. arXiv. 2023. 10.48550/arXiv.2310.17642 [DOI]
  • 76.Dong J, Chen S, Zong S, Chen T, Labi S. Image transformer for explainable autonomous driving system. In: 2021 IEEE international intelligent transportation systems conference (ITSC). IEEE; 2021. p. 2732–2737.
  • 77.Jin B, Liu X, Zheng Y, Li P, Zhao H, Zhang T, Zheng Y, Zhou G, Liu J. Adapt: Action-aware driving caption transformer. arXiv. 2023. 10.48550/arXiv.2302.00673 [DOI]
  • 78.Touvron H, Lavril T, Izacard G, Martinet X, Lachaux M-A, Lacroix T, Roziere B, Goyal N, Hambro E, Azhar F, et al. Llama: Open and efficient foundation language models. arXiv. 2023. 10.48550/arXiv.2302.13971 [DOI]
  • 79.Radford A, Kim JW, Hallacy C. Learning transferable visual models from natural language supervision. In: International conference on machine learning. PMLR; 2021. p. 8748–8763.
  • 80.Jia C, Yang Y, Xia Y, Chen Y-T, Parekh Z, Pham H, Le QV, Sung Y, Li Z, Duering T. Scaling up visual and vision-language representation learning with noisy text supervision. In: International conference on machine learning. PMLR; 2021. p. 4904–4916.
  • 81.Li J, Li D, Savarese S, Hoi S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv. 2023. 10.48550/arXiv.2301.12597 [DOI]
  • 82.Yang Z, Li L, Lin K, Wang J, Lin C-C, Liu Z, Wang L. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv. 2023. 10.48550/arXiv.2309.17421 [DOI]
  • 83.Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. arXiv. 2023. 10.48550/arXiv.2304.08485 [DOI]
  • 84.Gemini Team Google, Anil R, Borgeaud S, Alayrac J-B, Yu J, Soricut R, Schalkwyk J, Dai AM, Hauth A, Milican K, et al. Gemini: A family of highly capable multimodal models. arXiv. 2023. 10.48550/arXiv.2312.11805 [DOI]
  • 85.Bostrom N. Ethical issues in advanced artificial intelligence. In: Science fiction and philosophy: From time travel to superintelligence. Hoboken (NJ): Wiley; 2003. p. 277–284.
  • 86.Vasudevan AB, Dai D, Van Gool L. Object referring in videos with language and human gaze. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE; 2018. p. 4129–4138.
  • 87.Li W, Qu Z, Song H, Wang P, Xue B. The traffic scene understanding and prediction based on image captioning. IEEE Access. 2020;9:1420–1427. [Google Scholar]
  • 88.Sriram N, Maniar T, Kalyanasundaram J, Gandhi V, Bhowmick B, Krishna KM. Talk to the vehicle: Language conditioned autonomous navigation of self driving cars. In: 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE; 2019. p. 5284–5290.
  • 89.Geiger A, Lenz P, Stiller C, Urtasun R. Vision meets robotics: The Kitti dataset. Int J Robot Res. 2013;32(11):1231–1237. [Google Scholar]
  • 90.Elhafsi A, Sinha R, Agia C, Schmerling E, Nesnas IA, Pavone M. Semantic anomaly detection with large language models. Auton Robot. 2023:1–21.
  • 91.Chen R, Liu Y, Kong L, Zhu X, Ma Y, Li Y, Hou Y, Qiao Y, Wang W. CLIP2Scene: Towards label-efficient 3D scene understanding by CLIP. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2023. p. 7020–7030.
  • 92.Romero F, Winston C, Hauswald J, Zaharia M, Kozyrakis C. Zelda: Video analytics using vision-language models. arXiv. 2023. 10.48550/arXiv.2305.03785 [DOI]
  • 93.Romero F, Hauswald J, Partap A, Kang D, Zaharia M, Kozyrakis C. Optimizing video analytics with declarative model relationships. Proc VLDB Endow. 2022;16(3):447–460. [Google Scholar]
  • 94.Tian X, Gu J, Li B, Liu Y, Hu C, Wang Y, Zhan K, Jia P, Lang X, Zhao X. DriveVLM: The convergence of autonomous driving and large vision-language models. arXiv. 2024. 10.48550/arXiv.2402.12289 [DOI]
  • 95.Pan C, Yaman B, Nesti T, Mallik A, Allievi AG, Velipasalar S, Ren L. VLP: Vision language planning for autonomous driving. arXiv. 2024. 10.48550/arXiv.2401.05577 [DOI]
  • 96.Dewangan V, Choudhary T, Chandhok S, Priyadarshan S, Jain A, Singh AK, Srivastava S, Jatavallabhula KM, Krishna KM. Talk2BEV: Language-enhanced Bird’s-eye view maps for autonomous driving. arXiv. 2023. 10.48550/arXiv.2310.02251 [DOI]
  • 97.Zhu D, Chen J, Shen X, Li X, Elhoseiny M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv. 2023. 10.48550/arXiv.2304.10592 [DOI]
  • 98.Dai W, Li J, Li D, Tiong AMH, Zhao J, Wang W, Li B, Fung P, Hoi S. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. arXiv. 2023. 10.48550/arXiv.2305.06500 [DOI]
  • 99.Zhou Y, Huang L, Bu Q, Zeng J, Li T, Qiu H, Zhu H, Guo M, Qiao Y, Li H. Embodied understanding of driving scenarios. arXiv. 2024. 10.48550/arXiv.2403.04593 [DOI]
  • 100.Ding X, Han J, Xu H, Zhang W, Li X. HiLM-D: Towards high-resolution understanding in multimodal large language models for autonomous driving. arXiv. 2023. 10.48550/arXiv.2309.05186 [DOI]
  • 101.Fu D, Li X, Wen L, Dou M, Cai P, Shi B, Qiao Y. Drive like a human: Rethinking autonomous driving with large language models. arXiv. 2023. 10.48550/arXiv.2307.07162 [DOI]
  • 102.Zhang R, Han J, Liu C, Gao P, Zhou A, Hu X, Yan S, Lu P, Li H, Qiao Y. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. arXiv. 2023. 10.48550/arXiv.2303.16199 [DOI]
  • 103.Wen L, Fu D, Li X, Cai X, Ma T, Cai P, Dou M, Shi B, He, Qiao Y. DiLu: A knowledge-driven approach to autonomous driving with large language models. arXiv. 2023. 10.48550/arXiv.2309.16292 [DOI]
  • 104.Wang Y, Jiao R, Lang C, Huang C, Wang Z, Yang Z, Zhu Q. Empowering autonomous driving with large language models: A safety perspective. arXiv. 2023. 10.48550/arXiv.2312.00812 [DOI]
  • 105.Wang W, Xie J, Hu C, Zhou H, Fan J, Tong W, Wen Y, Wu S, Deng H, Li Z, et al. DriveMLM: Aligning multi-modal large language models with behavioral planning states for autonomous driving. arXiv. 2023. 10.48550/arXiv.2312.09245 [DOI]
  • 106.Wayve. LINGO-1: Exploring natural language for autonomous driving. [accessed 14 Sep 2023] https://wayve.ai/thinking/lingo-natural-language-autonomous-driving/
  • 107.Cui C, Yang Z, Zhou Y, Ma Y, Lu J, Wang Z. Large language models for autonomous driving: Real-world experiments. arXiv. 2023. 10.48550/arXiv.2312.09397 [DOI]
  • 108.Casas S, Sadat A, Urtasun R. MP3: A unified model to map, perceive, predict and plan. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2021. p. 14398–14407.
  • 109.Sha H, Mu Y, Jiang Y, Chen L, Xu C, Luo P, Li SE, Tomizuka M, Zhang W, Ding M. LanguageMPC: Large language models as decision makers for autonomous driving. arXiv. 2023. 10.48550/arXiv.2310.03026 [DOI]
  • 110.Jain K, Chhangani V, Tiwari A, Krishna KM, Gandhi V. Ground then navigate: Language-guided navigation in dynamic scenes. In: IEEE International Conference on Robotics and Automation (ICRA). IEEE; 2023. p. 4113–4120.
  • 111.Omama M, Inani P, Paul P, Yellapragada SC, Jatavallabhula KM, Chinchala S, Krishna M. ALT-Pilot: Autonomous navigation with language augmented topometric maps. arXiv. 2023. 10.48550/arXiv.2310.02324 [DOI]
  • 112.Pallagani V, Muppasani BC, Murugesan K. Plansformer: Generating multi-domain symbolic plans using transformers. 2023. https://openreview.net/forum?id=uvSQ8WhWHQ
  • 113.Wang P, Zhu M, Lu H, Zhong H, Chen X, Shen S, Wang X, Wang Y. BEVGPT: Generative pre-trained large model for autonomous driving prediction, decision-making, and planning. arXiv. 2023. 10.48550/arXiv.2310.10357 [DOI]
  • 114.Keysan A, Look A, Kosman E, Gursun G, Wagner J, Yao Y, Rakitsch B. Can you text what is happening? Integrating pre-trained language encoders into trajectory prediction models for autonomous driving. arXiv. 2023. 10.48550/arXiv.2309.05282 [DOI]
  • 115.Sima C, Renz K, Chitta K, Chen L, Zhang H, Xie C, Luo P, Geiger A, Li H. DriveLM: Driving with graph visual question answering. arXiv. 2023. 10.48550/arXiv.2312.14150 [DOI]
  • 116.Xu Z, Zhang Y, Xie E, Zhao Z, Guo Y, Wong K-YW, Zhao H. DriveGPT4: Interpretable end-to-end autonomous driving via large language model. arXiv. 2023. 10.48550/arXiv.2310.01412 [DOI]
  • 117.Cui C, Ma Y, Cao X, Ye W, Wang Z. Drive as you speak: Enabling human-like interaction with large language models in autonomous vehicles. arXiv. 2023. 10.48550/arXiv.2309.10228 [DOI]
  • 118.Cui C, Ma Y, Cao X, Ye W, Wang Z. Receive, reason, and react: Drive as you say with large language models in autonomous vehicles. arXiv. 2023. 10.48550/arXiv.2310.08034 [DOI]
  • 119.Shao H, Hu Y, Wang L, Waslander SL, Liu Y, Li H. LMDrive: Closed-loop end-to-end driving with large language models. arXiv. 2023. 10.48550/arXiv.2312.07488 [DOI]
  • 120.Seff A, Cera B. Chen D, Ng M, Zhou A, Nayakanti N, Refaat KS, Al-Rfou R, Sapp B. MotionLM: Multi-agent motion forecasting as language modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). IEEE; 2023. p. 8579–8590.
  • 121.Mao J, Qian Y, Zhao H, Wang Y. GPT-Driver: Learning to drive with GPT. arXiv. 2023. 10.48550/arXiv.2310.01415 [DOI]
  • 122.Mao J, Ye J, Qian Y, Pavone M, Wang Y. A language agent for autonomous driving. arXiv. 2023. 10.48550/arXiv.2311.10813 [DOI]
  • 123.Ma Y, Cao Y, Sun J, Pavone M, Xiao C. Dolphins: Multimodal language model for driving. arXiv. 2023. 10.48550/arXiv.2312.00438 [DOI]
  • 124.Chen L, Sinavski O, Hunermann J, Karnsund A, Willmott AJ, Birch D, Maund D, Shotton J. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv. 2023. 10.48550/arXiv.2310.01957 [DOI]
  • 125.Wulker C, Ruan S, Chirikjian GS. Quantizing Euclidean motions via double-coset decomposition. Research. 2019;2019: Article 1608396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Ha D, Schmidhuber J. Recurrent world models facilitate policy evolution. Adv Neural Inf Proces Syst. 2018;31.
  • 127.Levine S. Understanding the world through action. arXiv. 2021. 10.48550/arXiv.2110.12543 [DOI]
  • 128.LeCun Y. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review. 2022;62.
  • 129.Sutton RS. Dyna, an integrated architecture for learning, planning, and reacting. ACM SIGART Bull. 1991;2(4):160–163. [Google Scholar]
  • 130.Fan C, Yao L, Zhang J, Zhen Z, Wu X. Advanced reinforcement learning and its connections with brain neuroscience. Research. 2023;6:0064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Hafner D, Lillicrap T, Fischer I, Villegas R, Ha D, Lee H, Davidson J. Learning latent dynamics for planning from pixels. In: International conference on machine learning. PMLR; 2019. p. 2555–2565.
  • 132.Hafner D, Lillicrap T, Ba J, Norouzi M. Dream to control: Learning behaviors by latent imagination. arXiv. 2019. 10.48550/arXiv.1912.01603 [DOI]
  • 133.Hafner D, Lillicrap T, Norouzi M, Ba J. Mastering atari with discrete world models. arXiv. 2020. 10.48550/arXiv.2010.02193 [DOI]
  • 134.Hafner D, Pasukonis J, Ba J, Lillicrap T. Mastering diverse domains through world models. arXiv. 2023. 10.48550/arXiv.2301.04104 [DOI]
  • 135.Gao Z, Mu Y, Chen C, Duan J, Li SE, Luo P, Lu Y. Enhance sample efficiency and robustness of end-to-end urban autonomous driving via semantic masked world model. arXiv. 2022. 10.48550/arXiv.2210.04017 [DOI]
  • 136.Hu A, Corrado G, Griffiths N, Murez Z, Gurau C, Yeo H, Kendall A, Cipolla R, Shotton J. Model-based imitation learning for urban driving. Adv Neural Inf Proces Syst. 2022;35:20703–20716. [Google Scholar]
  • 137.Sekar R, Rybkin O, Daniilidis K, Abbeel P, Hafner D, Pathak D. Planning to explore via self-supervised world models. In: International Conference on Machine Learning. PMLR; 2020. p. 8583–8592.
  • 138.Seo Y, Lee K, James SL, Abbeel P. Reinforcement learning with action-free pre-training from videos. In: International Conference on Machine Learning. PMLR; 2022. p. 19561–19579.
  • 139.Kingma DP, Welling M. Auto-encoding variational Bayes. arXiv. 2013. 10.48550/arXiv.1312.6114 [DOI]
  • 140.Rezende DJ, Mohamed S, Wierstra D. Stochastic backpropagation and approximate inference in deep generative models. In: International conference on machine learning. PMLR; 2014. p. 1278–1286.
  • 141.Mirza M, Osindero S. Conditional generative adversarial nets. arXiv. 2014. 10.48550/arXiv.1411.1784 [DOI]
  • 142.Dinh L, Krueger D, Bengio Y. Nice: Non-linear independent components estimation. arXiv. 2014. 10.48550/arXiv.1410.8516 [DOI]
  • 143.Dinh L, Sohl-Dickstein J, Bengio S. Density estimation using Real NVP. Paper presented at: International Conference on Learning Representations; 2016; San Juan, Puerto Rico.
  • 144.Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Proces Syst. 2020;33:6840–6851. [Google Scholar]
  • 145.Van den Oord A, Kalchbrenner N, Espeholt L, Vinyals O, Graves A, Kavukcuoglu K. Conditional image generation with pixelcnn decoders. Adv Neural Inf Proces Syst. 2016;29. [Google Scholar]
  • 146.Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE; 2022. p. 10684–10695.
  • 147.Rempe D, Philion J, Guibas LJ, Fidler S, Litany O. Generating useful accident-prone driving scenarios via a learned traffic prior. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2022. p. 17305–17315.
  • 148.Kim SW, Philion J, Torralba A, Fidler S. Drivegan: Towards a controllable high-quality neural simulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2021. p. 5820–5829.
  • 149.Kumar M, Babaeizadeh M, Erhan D, Finn C, Levine S, Dinh L, Kinma D. Videoflow: A flow-based generative model for video. arXiv. 2019. preprint 10.48550/arXiv.1903.01434 [DOI]
  • 150.Feng L, Li Q, Peng Z, Tan S, Zhou B. Trafficgen: Learning to generate diverse and realistic traffic scenarios. In: 2023 IEEE international conference on robotics and automation (ICRA). IEEE; 2023. p. 3567–3575.
  • 151.Swerdlow A, Xu R, Zhou B. Street-view image generation from a bird’s-eye view layout. IEEE Robotics and Automation Letters. 2024.
  • 152.Singer U, Polyak A, Hayes T, Yin X, An J, Zhang S, Hu Q, Yang H, Ashual O, Gafni O. Make-a-video: Text-to-video generation without text-video data. Paper presented at: The Eleventh International Conference on Learning Representations; 2023; Kigali, Rwanda.
  • 153.Harvey W, Naderiparizi S, Masrani V, Weilbach C, Wood F. Flexible diffusion modeling of long videos. Adv Neural Inf Proces Syst. 2022;35:27953–27965. [Google Scholar]
  • 154.Yang R, Srivastava P, Mandt S. Diffusion probabilistic modeling for video generation. Entropy. 2023;25(10):1469. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 155.Zhong Z, Rempe D, Xu D, Chen Y, Veer Y, Che T, Ray B, Pavone M. Guided conditional diffusion for controllable traffic simulation. In: 2023 IEEE international conference on robotics and automation (ICRA). IEEE; 2023. p. 3560–3566.
  • 156.Pronovost E, Wang K, Roy N. Generating driving scenes with diffusion. arXiv. 2023. 10.48550/arXiv.2305.18452 [DOI]
  • 157.Zhang L, Xiong Y, Yang Z, Casas S, Hu R, Urtasun R. Learning unsupervised world models for autonomous driving via discrete diffusion. arXiv. 2023. 10.48550/arXiv.2311.01017 [DOI]
  • 158.Chang H, Zhang H, Jiang L, Liu C, Freeman WT. Maskgit: Masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2022. p. 11315–11325.
  • 159.Van Den Oord A, Vinyals O, Kavukcuoglu K. Neural discrete representation learning. Adv Neural Inf Proces Syst. 2017;30.
  • 160.Karlsson R, Carballo A, Fujii K, Ohtani K, Takeda K. Predictive world models from real-world partial observations. arXiv. 2023. 10.48550/arXiv.2301.04783 [DOI]
  • 161.Liao Y, Xie J, Geiger A. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Trans Pattern Anal Mach Intell. 2022;45:3292–3310. [DOI] [PubMed] [Google Scholar]
  • 162.Bogdoll D, Yang Y, Z¨ollner JM. MUVO: A multimodal generative world model for autonomous driving with geometric representations. arXiv. 2023. 10.48550/arXiv.2311.11762 [DOI]
  • 163.Zheng W, Chen W, Huang Y, Zhang B, Duan Y, Lu J. OccWorld: Learning a 3D occupancy world model for autonomous driving. arXiv. 2023. [DOI]
  • 164.Min C, Zhao D, Xiao L, Nie Y, Dai B. UniWorld: Autonomous driving pre-training via world models. arXiv. 2023. 10.48550/arXiv.2311.11762 [DOI]
  • 165.Hu A, Russell L, Yeo H, Murez Z, Fedoseev G, Kendall A, Shotton J, Corrado G. Gaia-1: A generative world model for autonomous driving. arXiv. 2023. 10.48550/arXiv.2309.17080 [DOI]
  • 166.Wang X, Zhu Z, Huang G, Chen X, and Lu J. Drivedreamer: Towards real-world-driven world models for autonomous driving. arXiv. 2023. 10.48550/arXiv.2309.09777 [DOI]
  • 167.Zhao G, Wang X, Zhu Z, Chen X, Huang G, Bao X, Wang X. DriveDreamer-2: LLM-enhanced world models for diverse driving video generation. arXiv. 2024. 10.48550/arXiv.2403.06845 [DOI]
  • 168.Wang Y, He J, Fan L, Li H, Chen Y, Zhang Z. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving. arXiv. 2023. 10.48550/arXiv.2311.17918 [DOI]
  • 169.Tesla. Building foundation models for autonomy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. https://www.youtube.com/watch?v=6xXb_uT7ts [accessed 30 June 2023].
  • 170.Chen C, Yoon J, Wu YF, Ahn S. TransDreamer: Reinforcement learning with transformer world models. In: Deep RL Workshop NeurIPS 2021. 2021.
  • 171.Wichers N, Villegas R, Erhan D, Lee H. Hierarchical long-term video prediction without supervision. In: International Conference on Machine Learning. PMLR; 2018. p. 6038–6046.
  • 172.Endo Y, Kanamori Y, Kuriyama S. Animating landscape: Self-supervised learning of decoupled motion and appearance for single-image video synthesis. ACM Trans Graph. 2019;38(6):1–19. [Google Scholar]
  • 173.Voleti V, Jolicoeur-Martineau A, Pal C. MCVD-masked conditional video diffusion for prediction, generation, and interpolation. Adv Neural Inf Proces Syst. 2022;35:23371–23385. [Google Scholar]
  • 174.Finn C, Goodfellow I, Levine S. Unsupervised learning for physical interaction through video prediction. Adv Neural Inf Proces Syst. 2016;29.
  • 175.Micheli V, Alonso E, Fleuret F. Transformers are sample-efficient world models. Paper presented at: Deep Reinforcement Learning Workshop NeurIPS 2022; 2022; Louisiana, USA.
  • 176.Wu Z, Dvornik N, Greff K, Kipf T, Garg A. SlotFormer: Unsupervised visual dynamics simulation with object-centric models. Paper presented at: The Eleventh International Conference on Learning Representations; 2023; Kigali, Rwanda.
  • 177.Wang X, Zhu Z, Huang G, Wang B, Chen X, Lu J. WorldDreamer: Towards general world models for video generation via predicting masked tokens. arXiv. 2024. 10.48550/arXiv.2401.09985 [DOI]
  • 178.Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE; 2021. p. 12873–12883.
  • 179.Bruce J, Dennis M, Edwards A, Parker-Holder J, Shi Y, Hughes E, Lai M, Mavalankar A, Steigerwald R, Apps C, et al. Genie: Generative interactive environments. arXiv. 2024. 10.48550/arXiv.2402.15391 [DOI]
  • 180.Skenderi G, Li H, Tang J, Cristani M. Graph-level representation learning with joint-embedding predictive architectures. arXiv. 2023. 10.48550/arXiv.2309.16014 [DOI]
  • 181.Fei Z, Fan M, Huang J. A-JEPA: Joint-embedding predictive architecture can listen. arXiv. 2023. 10.48550/arXiv.2311.15830 [DOI]
  • 182.Sun C, Yang H, Qin B. JEP-KD: Joint-embedding predictive architecture based knowledge distillation for visual speech recognition. arXiv. 2024. 10.48550/arXiv.2403.18843 [DOI]
  • 183.Bardes A, Ponce J, LeCun Y. Mc-jepa: A joint-embedding predictive architecture for self-supervised learning of motion and content features. arXiv. 2023. 10.48550/arXiv.2307.12698 [DOI]
  • 184.Assran M, Duval Q, Misra I, Bojanowski P, Vincent P, Rabbat M, Lecun Y, Ballas N. Self-supervised learning from images with a joint-embedding predictive architecture. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2023. p. 15619–15629.
  • 185.Bardes A, Garrido Q, Ponce J, Chen X, Rabbat M, LeCun Y, Assran M, Ballas N. Revisiting feature prediction for learning visual representations from video. arXiv. 2024. 10.48550/arXiv.2404.08471 [DOI]
  • 186.Young P, Lai A, Hodosh M, Hockenmaier J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans Assoc Comput Linguist. 2014;2:67–78. [Google Scholar]
  • 187.Yu L, Poirson P, Yang S, Berg AC, Berg TL. Modeling context in referring expressions. In: Computer Vision–ECCV 2016: 14th European conference, Amsterdam, the Netherlands, October 11–14, 2016, proceedings, part II 14. Springer; 2016. p. 69–85.
  • 188.Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K. Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. IEEE; 2016. p. 11–20.
  • 189.Hu R, Andreas J, Darrell T, Saenko K. Explainable neural computation via stack neural module networks. Paper presented at: Proceedings of the European conference on computer vision (ECCV); 2018; Munich, Germany.
  • 190.Deruyttere T, Vandenhende S, Grujicic D, Van Gool L, Moens MF. Talk2Car: Taking Control of Your Self-Driving Car. Paper presented at: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019; Hong Kong, China.
  • 191.Feng Q, Ablavsky V, and Sclaroff S. Cityflow-nl: Tracking and retrieval of vehicles at city scale by natural language descriptions. arXiv. 2021. 10.48550/arXiv.2101.04741 [DOI]
  • 192.Wu D, Han W, Wang T, Dong X, Zhang X, Shen J. Referring multi-object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2023. p. 14633–14642.
  • 193.Qian T, Chen J, Zhuo L, Jiao Y, Jiang YG. NuScenes-QA: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv. 2023. 10.48550/arXiv.2305.14836 [DOI]
  • 194.Wu D, Han W, Wang T, Liu Y, Zhang X, Shen J. Language prompt for autonomous driving. arXiv. 2023. [DOI]
  • 195.Caesar H, Bankiti V, Lang AH, Vora S, Liong VE, Xu Q, Krishnan A, Pan Y, Baldan G, Beijbom. nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. IEEE; 2020. p. 11621–11631.
  • 196.Zhou Y, Cai L, Cheng X, Gan Z, Xue X, Ding W. OpenAnnotate3D: Open-vocabulary auto-labeling system for multi-modal 3D data. arXiv. 2023. 10.48550/arXiv.2310.13398 [DOI]
  • 197.Li K, Chen K, Wang H, Hong W, Ye C, Han J, Chen Y, Zhang W, Xu C, Yeung D-T, et al. Coda: A real-world road corner case dataset for object detection in autonomous driving. In: European Conference on Computer Vision. Springer; 2022. p. 406–423.
  • 198.Treiber M, Hennecke A, Helbing D. Congested traffic states in empirical observations and microscopic simulations. Phys Rev E. 2000;62:1805. [DOI] [PubMed] [Google Scholar]
  • 199.Fellendorf M, Vortisch P. Microscopic traffic flow simulator VISSIM. In: Fundamentals of traffic simulation. New York (USA): Springer; 2010. p. 63–93.
  • 200.Dosovitskiy A, Ros G, Codevilla F, Lopez A, Koltun V. CARLA: An open urban driving simulator. In: Conference on robot learning. PMLR; 2017. p. 1–16.
  • 201.Lopez PA, Behrisch M, Bieker-Walz L, Erdmann J, Flötteröd Y-P, Hilbrich R, Lücken L, Rummel J, Wagner P, Wiessner E. Microscopic traffic simulation using sumo. In: 2018 21st international conference on intelligent transportation systems (ITSC). IEEE; 2018. p. 2575–2582.
  • 202.Caesar H, Kabzan J, Tan KS, Fon WK, Wolff E, Lang A, Fletcher L, Beijborn O, Omari S,nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. arXiv. 2021. 10.48550/arXiv.2106.11810 [DOI]
  • 203.Li Q, Peng Z, Feng L, Zhang Q, Xue Z, Zhou B. Metadrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Trans Pattern Anal Mach Intell. 2022;45:3461–3475. [DOI] [PubMed] [Google Scholar]
  • 204.Vinitsky E, Lichtle N, Yang X, Amos B, Foerster J. Nocturne: A scalable driving benchmark for bringing multi-agent learning one step closer to the real world. Adv Neural Inf Proces Syst. 2022;35:3962–3974. [Google Scholar]
  • 205.Gulino C, Fu J, Luo W, Tucker G, Bronstein E, Lu Y, Harb J, Pan X, Wang Y, Chen X, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. Paper presented at: Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track; 2023; New Orleans (LA).
  • 206.Yang K, Ma E, Peng J, Guo Q, Lin D, Yu K. BEVControl: Accurately controlling streetview elements with multi-perspective consistency via BEV sketch layout. arXiv. 2023. 10.48550/arXiv.2308.01661 [DOI]
  • 207.Li X, Zhang Y, Ye X. DrivingDiffusion: Layout-guided multi-view driving scene video generation with latent diffusion model. arXiv. 2023. 10.48550/arXiv.2310.07771 [DOI]
  • 208.Wen Y, Zhao Y, Liu Y, Jia F, Wang Y, Luo C, Zhang C, Wang T, Sun X, Zhang X. Panacea: Panoramic and controllable video generation for autonomous driving. arXiv. 2023. 10.48550/arXiv.2311.16813 [DOI]
  • 209.Gao R, Chen K, Xie E, Hong L, Li Z, Yeung D-Y, Xu Q. MagicDrive: Street view generation with diverse 3D geometry control. arXiv. 2024. 10.48550/arXiv.2310.02601 [DOI]
  • 210.Marathe A, Ramanan D, Walambe R, Kotecha K. WEDGE: A multi-weather autonomous driving dataset built from generative vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2023. p. 3317–3326.
  • 211.Zhong Z, Rempe D, Chen Y, Ivanovic B, Cao Y, Xu D, Pavone M, Ray B. Language-guided traffic simulation via scene-level diffusion. arXiv. 2023. 10.48550/arXiv.2306.06344 [DOI]
  • 212.Jin Y, Shen X, Peng H, Liu X, Qin J, Li J, Xie J, Gao P, Zhou G, Gong J. SurrealDriver: Designing generative driver agent simulation framework in urban contexts based on large language model. arXiv. 2023. 10.48550/arXiv.2309.13193 [DOI]
  • 213.Zhang Y, Li Y, Cui L, Cai D, Lui L, Fu T, Huang X, Zhao E, WAng L, Luu AT, et al. Siren’s song in the AI ocean: A survey on hallucination in large language models. arXiv. 2023. 10.48550/arXiv.2309.01219 [DOI]
  • 214.Liu B, Ash JT, Goel S, Krishnamurthy A, Zhang C. Exposing attention glitches with flip-flop language modeling. arXiv. 2023. 10.48550/arXiv.2306.00946 [DOI]
  • 215.Pope R, Douglas S, Chowdhery A, Devlin J, Bradbury J, Anselm L, Heek J, Xiao K, Agrawal S, Dean J. Efficiently scaling transformer inference. Proc Machine Learning Syst. 2023;5.
  • 216.Weng L. Large Transformer Model Inference Optimization. 2023. [accessed 10 Jan 2023] https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
  • 217.Wang Y, Huang Z, Zheng Y, et al. Drive as Veteran: Fine-tuning of An Onboard Large Language Model for Highway Autonomous Driving. In: 2024 IEEE Intelligent Vehicles Symposium(IV). 2024; Jeju Island, Korea. [Google Scholar]
  • 218.Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W.. Lora: Low-rank adaptation of large language models. arXiv. 2021. 10.48550/arXiv.2106.09685 [DOI] [Google Scholar]
  • 219.Xu M, Niyato D, Zhang H, Kang J, Xiong Z, Mao S, Han Z. Sparks of generative pretrained transformers in edge intelligence for the metaverse: Caching and inference for mobile artificial intelligence-generated content services. IEEE Veh Technol Mag. 2023;18(4):35–44. [Google Scholar]
  • 220.Leike J, Krueger D, Everitt T, Martic M, Maini V, Legg S. Scalable agent alignment via reward modeling: A research direction. arXiv. 2018. 10.48550/arXiv.1811.07871 [DOI]
  • 221.Ji J, Qiu T, Chen B, Zhang B, Lou H, Wang K, Duan Y, He Z, Zhou J, Zhang Z, et al. AI alignment: A comprehensive survey. arXiv. 2024. 10.48550/arXiv.2310.19852 [DOI]
  • 222.Bai Y, Geng X, Mangalam K, Bar A, Yuille A, Darrell T, Malik J, Efros AA. Sequential modeling enables scalable learning for large vision models. arXiv. 2023. 10.48550/arXiv.2312.00785 [DOI]

Articles from Research are provided here courtesy of American Association for the Advancement of Science (AAAS) and Science and Technology Review Publishing House

RESOURCES