Skip to main content
Philosophical Transactions of the Royal Society B: Biological Sciences logoLink to Philosophical Transactions of the Royal Society B: Biological Sciences
. 2022 Dec 13;378(1869):20210447. doi: 10.1098/rstb.2021.0447

Learning robotic navigation from experience: principles, methods and recent results

Sergey Levine 1,, Dhruv Shah 1
PMCID: PMC9745865  PMID: 36511408

Abstract

Navigation is one of the most heavily studied problems in robotics and is conventionally approached as a geometric mapping and planning problem. However, real-world navigation presents a complex set of physical challenges that defies simple geometric abstractions. Machine learning offers a promising way to go beyond geometry and conventional planning, allowing for navigational systems that make decisions based on actual prior experience. Such systems can reason about traversability in ways that go beyond geometry, accounting for the physical outcomes of their actions and exploiting patterns in real-world environments. They can also improve as more data is collected, potentially providing a powerful network effect. In this article, we present a general toolkit for experiential learning of robotic navigation skills that unifies several recent approaches, describe the underlying design principles, summarize experimental results from several of our recent papers, and discuss open problems and directions for future work.

This article is part of the theme issue ‘New approaches to 3D vision’.

Keywords: robotics, navigation, learning, reinforcement learning, machine learning

1. Introduction

Navigation represents one of the most heavily studied topics in robotics [1]. It is often approached in terms of mapping and planning: constructing a geometric representation of the world from observations, then planning through this model using motion planning algorithms [24]. However, such geometric approaches abstract away significant physical and semantic aspects of the navigation problem that, in practice, leave a range of real-world situations difficult to handle (see figure 1). These challenges require special handling, resulting in complex systems with many components. Some works have sought to incorporate machine learning techniques to either learn navigational skills from simulation or to learn perception systems for navigation for human-provided labels. In this article, we instead argue that learned navigational models, trained directly on real-world experience rather than human-provided labels or simulators, provide the most promising long-term direction for a general solution to navigation. We refer to such learning approaches as experiential learning, because they learn directly from past experience of performing real-world navigation. As we will discuss in §2, such methods relate closely to reinforcement learning (RL).

Figure 1.

Figure 1.

Learning-based methods can handle situations that violate the assumptions of geometric methods: sometimes obstacles that geometrically appear to block the robot’s path, such as tall grass, are actually traversable (a), and sometimes seemingly solid ground is actually not traversable, as in the case of mud or sand traps (b). Unlike geometry-based methods [5], which plan through 3D reconstructions of the environment (c), experiential learning methods [6] learn to determine from raw sensory observations which features are traversable and which are not (d). This, together with their ability to improve as more data is collected, makes such techniques a powerful choice for real-world navigation.

Geometry-based methods for navigation, based on mapping and planning, are appealing in large part because they simplify the navigation problem into a concise geometric abstraction: if the 3D shape of the environment can be inferred from observations, this can be used to construct an accurate geometric model, a path to the destination can be planned within this model, and that path can then be executed in the real world. However, although some idealized environments fit neatly into this geometric abstraction, real-world settings have a tendency to confound it. Obstacles are not always rigid impassable barriers (e.g. tall grass), and areas that appear geometrically passable might not be (e.g. mud, foliage, etc.). Real-world environments also exhibit patterns that are not used by purely geometric approaches: roads often (but not always) intersect at right angles, city blocks tend to be of equal size, and buildings are often rectangular. Such patterns can lead to convenient shortcuts and intuitive behaviours that are often exploited by humans.

Machine learning can offer an appealing toolkit for addressing these complex situations and exploiting such patterns, but the many different ways of utilizing machine learning for navigation come with very different trade-offs. In this article, we will focus specifically on experiential learning, where a robot learns how to navigate directly from real-world navigation data. We can contrast this with four other types of approaches: (1) methods that utilize learning to handle semantic aspects of navigation, typically based on computer vision with human-provided labels [710]; (2) methods that utilize learning to assist in 3D mapping, which is then integrated into standard geometric pipelines [1118]; (3) methods that utilize RL in simulated environments and then employ transfer learning or domain adaptation [1922]; (4) methods that use human-provided demonstrations to learn navigational policies [2328].

Methods that utilize learning only for handling semantic (1) or geometric (2) perception do not address the limitations of geometry-based methods in terms of failing to understand the physical meaning of traversability and navigational affordances detailed above, though they can significantly improve the performance of geometric methods and address their limitations in regard to semantics. Such techniques can help to make conventional mapping and planning pipelines more effective by endowing them with semantics or more accurate 3D reconstruction. However, like conventional mapping techniques, they do not attempt to directly predict the physical outcome of a robot’s actions. This stands in contrast to experiential learning methods that directly learn which observations correspond to traversable or untraversable terrains or obstacles, though a number of works in robotic perception have incorporated elements of experiential learning, for example, for learning to classify traversability [2933].

Methods based on simulation (3) are limited in that they rely on the fidelity of the simulator to learn about the situations a robot might encounter in the real world. Although simulation methods can significantly simplify the engineering of navigational systems; in the end, they kick the can down the road: instead of manually adding special cases (tall grass, mud, etc.) into the standard mapping pipeline, we must instead model all such possible conditions in the simulator. Sometimes it might be easier to simulate some phenomenon and then learn how to handle it than to design a controller for it directly. However, human insight is still needed to identify the phenomena to simulate, and human engineering is needed to build such simulations, in contrast to methods that learn from real data and therefore learn about how the world actually works. Indeed, in other domains where machine learning methods have been successfully deployed in real-world products and applications—computer vision, NLP, speech recognition, etc. [34]—such methods utilize real data precisely because such data provides the best final performance in the real world with the least amount of effort.

Methods based on human-provided demonstrations (4), which have a long history in robotic navigation [2328,35,36], have the benefit of learning about the world as it really is, but carry a heavy price: the performance of the system is entirely limited by the number of demonstrations that are provided and does not improve with more use. In contrast, experiential learning methods [6,3740], which may also utilize demonstration data in combination with the robot’s own experience and, crucially, do not make the assumption that all of the provided data is good (i.e. it should not be imitated blindly), offer the most appealing combination of benefits. Such methods handle the world the way it really is, learning traversability and navigational affordances directly from experience, improving as more data is collected and do not require an expert human engineer to model the long tail of scenarios and special conditions that a robot might encounter in the real world.

Algorithms that learn robotic policies from experience often employ ‘end-to-end’ learning methods [41,42]. This can either mean that the robot learns the task directly from final task outcome feedback, or that it learns directly from raw sensory perception. Both have appealing benefits, but particularly the former is a critical strength of experiential learning: only by associating actual real-world trajectories with actual real-world outcomes can a robot acquire navigational skills that are not vulnerable to the ‘leaky abstractions’ that afflict other manually designed techniques. For example, the abstraction of geometry does not capture that tall grass is traversable. The abstraction of a simulator that does not model wheel slip does not capture that wheels can become stuck in mud. By learning about real outcomes from real data, such issues can be eliminated.

At the same time, as we will discuss in §4, learned navigation systems can (and should) still employ modularity and compositionality to solve temporally extended tasks. Indeed, we will argue that effective learning systems, like conventional mapping and planning methods, should still be divided into two parts: a memory or ‘mental map’ of their environment, and a high-level planning algorithm that uses this mental map to choose a route. Conventional methods simply choose specific abstractions, such as meshes or points in Cartesian space, to represent this map, whereas learning-based methods learn a suitable abstraction from data. These learned abstractions are grounded in the things that are actually important for real-world traversability, and they improve as the robot gathers more and more experience in the environment.

The goal of this article is to provide a high-level tutorial on how navigational systems can be trained on real-world data, provide pointers to relevant recent works, and present the overall architecture that a navigational system learned from experience should have. The remainder of this article will focus on providing a high-level summary of navigation via experiential learning, algorithms for learning low-level navigational skills from data, algorithms for composing these skills to solve temporally extended navigation problems, and a brief discussion of several of our recent works that provide experimental evidence for the viability of these approaches.

2. An overview of experiential learning for navigation

The central principle behind experiential learning is to learn from the actual experience of attempting (and succeeding or failing) to perform a given task, as opposed to learning from human-provided labels, such as semantic labels provided by humans (e.g. road vs. not road), or demonstrations. Perhaps the best-known framework for experiential learning is RL [43], which formulates the problem in terms of learning to maximize reward signals through active online exploration. However, we will make a distinction between the principle of experiential learning—learning how to perform a task using experience—and the methodology prescribed by RL. This is because the primary benefits really come from the use of experience rather than the specific choice of algorithm (RL or otherwise). The particular methods in the case studies in §5 use simple supervised learning methods, though they can be seen as a particularly naïve version of offline RL [44] and could likely utilize more advanced and modern offline RL methods as well.

We can use ot to denote the robot’s observation at time t, at to denote its commanded action (e.g. steering and throttle commands), and τ = {o1, a1, …, oH, aH} to denote a trajectory (i.e. a trial obtained by running the robot). The algorithm is provided with a dataset of trajectories D={τi}, which it uses to learn. This can be done either offline, where a static dataset consisting of previously collected data is provided and the algorithm learns entirely from this dataset, or it can be done online, where the policy explores the environment, appends the resulting experience to D, and periodically retrains the policy. The critical ingredient is the use of real trial data, not whether or not this data is collected online. The power of experiential learning comes from using real experience to understand which trajectories are possible, and which are not. For example, if D contains a trajectory that successfully drives through tall grass, the robot can learn that tall grass is traversable. Traversals that are not seen in the data (e.g. there is no trajectory where the robot drives through a wall) should be assumed to be impossible. Of course, this presumes a high degree of coverage in the dataset, and additional online exploration can be helpful here.

It is likely that ultimately the full benefit of experiential learning will be unlocked by combining offline and online training, as they offer complementary benefits. The central benefit of offline training is the ability to reuse large and diverse navigational datasets. In the same way that state-of-the-art models in computer vision [45] and NLP [46] achieve remarkable generalization by training on huge datasets, effective navigational systems will work best when trained on large previously collected datasets, which would be impractical to recollect online for every experiment. At the same time, a major strength of such methods is to continue to improve as more data is collected, particularly for real-world deployments where such methods can benefit from a network effect: as more robots are deployed, more data is collected, the robots become more capable, and it becomes possible to deploy more of them in more settings.

To define the task, we can assume that something in ot indicates task completion. For example, the task might be defined by a goal og, or by a goal location, where the location is part of ot. More generally, it can be defined by some reward function r(ot) or goal set ogG. We will assume for now that it is defined by a single goal og, though this requirement can be relaxed. The specific question that the learned model must be able to answer then becomes: given the current observation ot and some goal og, which action at should the robot take to eventually reach og?

RL [43] and imitation learning [23,35] offer viable solutions to this problem by learning policies of the form π(at | ot, og), as we will discuss in the next section. However, it is difficult to directly learn fully reactive policies that can reach very distant goals. Instead, we can decompose the navigation problem hierarchically: the robot should build some sort of ‘mental map’ of its surroundings, plan through this mental map, and utilize low-level navigational skills to execute this plan. Such skills might, for example, know how to navigate around a muddy puddle, cut across a grassy field, or go through a doorway in a building. But they do not reason about the longer-horizon structure of the plan, and therefore do not require memory. The role of π(at | ot, og) is to represent such skills and, as we will discuss in the next two sections, also to provide abstractions that can be used to build the higher level ‘mental map’. This higher level, discussed in §4, can either be an explicit search algorithm, or can be defined implicitly as part of a memory-based (e.g. recurrent) neural network model [17,4752]. This hierarchy is also present in the standard mapping and planning approach, where the geometric map represents the robot’s ‘memory’, but the abstraction (3D points) is chosen manually. Viewed in this way, a central benefit of the experiential learning approach is to learn low-level skills π(at | ot, og) that represent navigational affordances, and then build up its higher level mapping and planning mechanisms in terms of the capabilities of these skills.

3. Learning policies from data

Training π(at | ot, og) can be framed either as maximizing the probability that π reaches og, minimizing the time it takes to reach og, or in terms of some other metrics. In practice, methods for π(at | ot, og) include goal-conditioned imitation [5357] and RL [5868] which, though seemingly different conceptually, can be cast into the same framework. An algorithm for training π(at | ot, og) must provide an objective function JD(π), which factorizes over the dataset:

JD(π)=τDt=1HJot,at,τ(π).

We slightly abuse notation to index time steps in τ as ot, at. In the case of supervised learning, Jot,at,τ is given by Jot,at,τML:

Jot,at,τML(π)=Eogg(τ,t)[logπ(at|ot,og)],

where g(τ, t) is a relabelling distribution that selects future observations in τ as possible goals. For example, g(τ, t) might uniformly sample all ot where t′ > t, or select oH or ot+K (see figure 2). The general idea is to train the policy to imitate the actions in the trajectory when conditioned on the current observation and future observations in that same trajectory. RL algorithms typically use either an expected Q-value objective or a weighted likelihood objective given by

Jot,at,τQ(π)=Eogg(τ,t),aπ(a|ot,og)[Qπ(ot,at,og)]

and

Jot,at,τW(π)=Eogg(τ,t)[w(ot,at,og)logπ(at|ot,og)],

respectively. In the case of RL, g(τ, t) can select as goals future time steps in τ, as in the case of supervised learning (‘positives’), but can also mix in observations sampled from other trajectories (‘negatives’) that are less likely to be reached, since the Q-function or weight will tell the policy that these ‘negative’ goals have low values. Prior works have discussed a wide range of different relabelling strategies and their tradeoffs [58,59,65,68]. The expected Q objective JQ is typically used by standard actor-critic methods such as DDPG and SAC [69,70], as well as offline RL methods such as CQL [71]. The Q-function, in this case, is trained via Bellman error minimization on the same data, with offline RL methods typically including some explicit regularization to avoid out-of-distribution actions. The weighted likelihood objective JW is used by a number of offline RL methods, such as AWR, AWAC and CRR [7274], which utilize it to avoid out-of-distribution action queries. Typically, the weight w(ot, at, og) is chosen to be larger for actions with large Q-values. For example, AWAC uses the weight w(ot, at, og) = exp(Qπ(ot, at, og) − Vπ(ot, og)). Further technical details can be found in prior work on goal-conditioned imitation [5357], standard online RL [5963,65,69,70] and offline RL [68,7174]. For the purpose of this article, note that all three loss functions have a similar structure: they all involve selecting goals og using some relabelling function g(τ, t), and they all involve somehow training π(at | ot, og) to favour those actions that reach og, either directly using the actions that actually led to og in the data in the case of JML, or actions that have a high value for og according to a separate learned Q-function.

Figure 2.

Figure 2.

Low-level navigational policies can be trained from data by extracting tuples (ot, at, og), where ot is an observation along a trajectory, at is the corresponding action and og is an eventual goal that can be reached successfully after taking at in ot. Both supervised learning and RL-based techniques can do this by using some sort of relabelling function g(τ, t) to select og during training from the remainder of the trajectory τ from which ot was taken.

As discussed in the previous section, π(at | ot, og) by itself will not necessarily be effective at reaching distant goals, and perhaps more importantly, it does not maintain a memory of the environment, does not attempt to map it and remember the locations of landmarks, and does not perform explicit planning (though the process of training the Q-function arguably performs amortized planning via dynamic programming during training). Therefore, it is generally only effective for short-horizon goals. In the case of navigation tasks studied in prior work with such approaches, this typically means goals that are within line of sight of the robot, or within a few tens of metres of its present location [6,3840,64,75], though some works have explored extensions to enable significantly longer-range control in some settings, including through the use of memory and recurrence [48,51]. As a side note, ot in general may not represent a Markovian state of the system, but only an observation. The use of recurrence mitigates this issue [76], but if we only require the policies to represent short-range skills, this issue often does not cause severe problems.

In the next section, we will discuss how planning and memory can be incorporated into a complete navigational method that uses the policies π(at | ot, og) as local controllers. This will require an additional object besides the policy itself: an evaluation or distance function D(ot, og) that additionally predicts how long π(at | ot, og) will actually take to reach og from ot (and if it will succeed at all). As discussed in prior work [58], this distance function can be extracted from a value function learned with RL. If we choose the reward to be −1 for all time steps when the goal is not reached (i.e. r(ot, at, og) = −δ(otog)) and γ = 1, we have D(ot,og)=Vπ(ot,og), though in practice it is convenient to use γ < 1. With supervised learning, this quantity can be learned by regressing onto the distances in the dataset, using the loss Eogg(τ,t)[(D(ot,og)(tt))2], where t′ is the time step in τ corresponding to og.

4. Planning and high-level decision making

Navigation is not just a reactive process, where a robot observes a snapshot of its environment and chooses an action. While the explicit process of exact geometric reconstruction in classic navigational methods may be obviated by learning from data, any effective navigational method likely must still retain, either explicitly or implicitly, a similar overall structure: it should acquire and remember the overall shape of the environment (though perhaps not in terms of precise geometrical detail), and it must plan through this environment to reach the final destination, while reasoning about parts of the environment that are not currently visible but were observed before and stored in memory. Indeed, it has been extensively verified experimentally that humans and animals maintain ‘mental maps’ of environments that they visit frequently [77], and these mental maps are more likely topological rather than precise geometric reconstructions. Crucially, such mental maps depend on abstractions of the environment. Precise geometric maps use coordinates of vertices or points as abstractions, but these abstractions are more detailed than necessary for high-level navigation, where we would like to make decisions like ‘turn left at the light’, and allow our low-level skills (as described in the previous section) to take care of carrying out such decisions. Thus, the problem of building effective mental maps hinges on acquiring effective abstractions.

A powerful idea in learning-based navigation is that the low-level policies π(at | ot, og) and their distance functions D(ot, og) can provide us with such abstractions. The basic principle is that D(ot, og) can describe the connectivity between different observations in the environment. Given input observations of two landmarks, D(ot, og) can tell us if the robot’s low-level policy π(at|ot, og) can travel between them, which induces a graph that describes the connectivity of previously observed landmarks, as illustrated in figure 3. Thus, storing a subset of previously seen observations represents the robot’s memory of its environment, and the graph induced by edge weights obtained from D(oi, oj) for each pair (oi, oj) of stored observations then represents a kind of ‘mental map’—an abstract counterpart to the geometric map constructed by conventional SLAM (simultaneous localisation and mapping) algorithms. Searching through this graph can reveal efficient paths between any pair of landmarks. Critically, this mental map is not based on the geometric shape of the environment, but rather its connectivity according to the robot’s current navigational capabilities, as described by π(at | ot, og) and D(ot, og). Particularly in the RL setting, where D(ot, og) corresponds to the value function of π(at | ot, og), this makes it clear that the robot’s low-level capabilities effectively inform its high-level abstraction, defining both the representation of its memory (i.e. the graph) and its mechanism for high-level planning (i.e. search on this graph).

Figure 3.

Figure 3.

Planning with learned policies. Learned distance functions, which correspond to value functions for a low-level policy, describe the connectivity structure in the environment. This provides an abstraction for high-level planning informed by the capabilities of low-level skills.

Using policies and their value functions as abstractions for search and planning has been explored in a number of prior works, both in the case of RL (where distances are value functions) [64,78] and in the case of supervised learning (where distances are learned with regression) [38,75,7981], and we refer the reader to these prior works for technical details. However, the important ingredient is not necessarily the use of graph search, but rather the use of low-level skills to form abstractions for high-level skills. For example, it is entirely possible to dispense with the graph entirely and instead optimize over the goals using a high-level model via trajectory optimization or tree search [78,81]. Indeed, it is entirely possible that amortized higher-level planning methods (i.e. higher-level RL or other learned models) might, in the long run, prove more effective than classic graph search, or the two paradigms might be combined, for example, by using differentiable search methods as in the case of (a hierarchical variant of) value iteration networks [47] or other related methods [48,51]. The key point is that the higher-level mapping and planning process, whether learned or not, should operate on abstractions that are informed by the (learned) capabilities of the robot.

The graph described in figure 3 can be utilized for planning in a number of different ways. In the simplest case, the current observation ot and the final desired goal og are simply connected to the graph by using D(oi, oj), a graph search algorithm determines the next waypoint along the shortest path ow, and then π(at | ot, ow) is used to select the action [64]. However, real-world navigation problems often require more sophisticated approaches because (1) the environment might not have been previously explored, and therefore requires simultaneously constructing the ‘mental map’ and planning paths that move the robot toward the goal; (2) the goal might be specified with other information besides the target observation. In general, observations in a new environment might be added to the robot’s memory (‘mental map’), connecting them to a growing graph, and each time the robot might replan a path toward the goal. When the path to the goal cannot be determined because the environment has not been explored sufficiently, the robot might choose to explore a new location [39], or might use some sort of heuristic informed by side information, such as the spatial coordinate of the target, or even an overhead map [40]. The latter also provides a natural avenue for introducing other goal specification modalities: while the mental map is built in terms of the robot’s observations, the final goal can be specified in terms of any function of this observation, including potentially its GPS coordinates [40].

5. Experimental case studies

In this section, we will discuss selected recent works that develop experiential learning systems for robotic navigation, shown in figures 4 and 5: BADGR [6], which learns short-horizon navigational skills from autonomous exploration data; LaND [37], which extends BADGR to incorporate semantics in order to navigate sidewalks; ViNG [38], which incorporates topological ‘mental maps’ as described in the previous section; and the latter's two extensions: RECON [39] and ViKiNG [40], which incorporate the ability to explore new environments and utilize overhead maps, respectively.

Figure 4.

Figure 4.

Learning outdoor navigation. BADGR [6] learns short-horizon skills from randomly collected data to minimize the risk of collision (a) or choose paths with minimum bumpiness (to stay on paved trails) (b). LaND [37] extends this system to also learn from human-provided disengagement commands, thus learning the semantics of driving on sidewalks (c) without explicit labels or rules. ViNG [38] utilizes goal-conditioned skills and combines them into long-horizon plans by building ‘mental maps’ from prior experience in a given environment to reach a series of visually indicated goals for the task of autonomous mail delivery (d).

Figure 5.

Figure 5.

Searching a novel environment for a user-specified goal image (inset), RECON [39] incrementally builds a topological ‘mental map’ of landmarks (white) by sampling latent subgoals and navigating to them (blue path). Subsequent traversals use this mental map to reach the goal quickly (red).

(a) . Learning low-level navigational skills

Low-level navigational skills can be learned using model-free RL, model-based RL, or supervised learning. We illustrate a few variations in figure 4. All of these methods only use forward-facing monocular cameras, without depth sensing, GPS or LIDAR. BADGR [6] employs a partially model-based method for training the low-level skill, predicting various task-specific metrics based on an observation and a candidate sequence of actions. Data for this method is collected by using a randomized policy. The model is trained from this autonomously collected data, analogously to the models in §3, and can predict the probability that actions will lead to collision (visualized in figure 4a), the expected bumpiness of the terrain (figure 4b), and the location that the robot will reach, which is used to navigate to goals. LaND [37], shown in figure 4c, further extends this method to also predict disengagements from a human safety monitor, with data collected by attempting to navigate real-world sidewalks. By taking actions that avoid expected future disengagements, the robot implicitly learns social conventions and rules, such as staying on sidewalks and avoiding driveways, which enables the LaND system to effectively navigate real-world sidewalks in the city of Berkeley, California. These case studies illustrate how experiential learning can enable robotic navigation with a variety of objectives that allow accommodating user preferences and semantic or social rules. The training labels for these objective terms are provided during data collection either automatically via on-board sensors (with collision and bumpiness) or from human interventions that happen naturally during execution (in the case of LaND).

The above methods focus on low-level skills. Figure 4d illustrates ViNG [38], which integrates low-level skills, in this case, represented by a goal conditioned policy π(at | ot, og) and distance model D(ot, og) that are trained using the supervised learning loss in §3, with a high-level navigation strategy that builds a ‘mental map’ graph over previously seen observations in the current environment, as detailed in §4. Note that this ‘map’ does not use any explicit localization, it is constructed entirely from images previously observed in the environment. The visualization in figure 4d uses GPS tags, but these are not available to the algorithm and only used for illustration. ViNG uses goals specified by the user as images (i.e. photographs of the desired destination), and requires the robot to have previously driven through the environment to collect images that can be used to building the mental map that the system plans over. Note, however, that the low-level model π(at | ot, og) is trained on data from many different environments, constituting about 40 h of total experience, while the experience needed to build the map in each environment is comparatively more modest (comprising tens of minutes). In the next two sections, we will describe how this can be extended to also handle novel environments.

(b) . Searching in novel environments

The methods discussed above do not give us a way to reach goals in previously unseen environments—this would require the robot to physically search a novel environment for the desired goal. In conventional navigation systems, this is done by simultaneously mapping the environment and updating the plan on the fly. Experiential learning systems can also do this, building up their ‘mental map’ as they explore the new environment.

As with ViNG, we can first train the low-level policy π(at | ot, og) and distance model D(ot, og) on data from many different environments (RECON [39], described here, uses the same dataset). However, exploring a new environment requires being able to propose new feasible goals that have not yet been visited, rather than simply planning over a set of previously observed landmarks. This requires the learned low-level models to support an additional operation: sampling a new feasible subgoal that can be reached from the current observation. Sampling entire images, though feasible with a generative model, is technically complex. Instead, RECON employs a low-dimensional latent representation of feasible subgoals, learned via a variational information bottleneck (VIB) [82]. Specifically, a latent goal embedding is computed according to a conditional encoder q(zg | og, ot), where conditioning on both og and ot causes zg to represent a kind (latent) change in state. The VIB formulation provides us with both a trained encoder q(zg | og, ot), which we can use to then train π(at | ot, zg) and D(ot, zg), and a prior distribution p0(zg) that can be used to sample random latent goals. VIB training optimizes for random samples zgp0(zg) to correspond to feasible random goals—essentially random nearby locations that are reachable from ot.

The ability to sample random subgoals zg is used by RECON in combination with a fringe exploration algorithm, which serves as the high-level planner. RECON keeps track of how often the vicinity each landmark in the graph has been visited and, if the robot cannot plan a path directly to the final goal, it plans to reach the ‘fringe’ of the current graph, defined as landmarks with low visitation counts, and from there sample random goals zgp0(zg). This causes the robot to seek out rarely visited locations and explore further from there, as illustrated in figure 5 (blue path). After searching through the environment once, the robot can then reuse the mental map to reach the same or other goals much more quickly.

(c) . Learning to navigate with side information

Specifying the goal solely using an image can be limiting for more complex navigational tasks, where the robot must drive a considerable distance and simply searching the entire environment is impractical. To extend experiential learning to such settings, ViKiNG [40] further incorporates the ability to use GPS and overhead maps (e.g. satellite images or road schematics) as ‘heuristics’ (in the A* sense) into the high-level planning process. This can be seen as somewhat analogous to how humans navigate new environments by using both geographic knowledge (e.g. from a paper map) and first-person observations, combined with patterns learned from their experience [83].

ViKiNG trains an additional heuristic model that receives as input the image of the overhead map, the approximate goal location (obtained via a noisy GPS measurement), and a query location, and predicts a heuristic estimate of the feasibility of reaching the goal from that location. This estimate is learned from data via a variant of contrastive learning [84]. It is then included in the search process as a heuristic, analogously to how heuristics are used in A* search, though with a modification to account for the fact that the robot is carrying out a physical search of the environment, and therefore should also take into consideration the time it would take for it to travel to the best current graph node from its current location.

In an experimental evaluation, ViKiNG is able to extract useful heuristics from satellite images and road schematics, and can navigate to destinations that are up to 2 km away from the starting location in new, previously unseen environments, using low-level policies and heuristic models trained on data from other environments. Evaluated environments include hiking trails, city roads in Berkeley and Richmond in California, suburban neighbourhoods, and office parks. Figure 6 shows one such experiment, where the robot successfully uses satellite image hints to navigate to a goal 1.2 km away without any human interventions.

Figure 6.

Figure 6.

Kilometre-scale navigation using geographic hints. ViKiNG [40] can use a satellite image to perform informed search in large real-world environments (a), involving navigating on paved roads (b), through a dense patch of trees (c), while showing complex behaviour like backtracking on encountering a dead-end (d).

Note that the information from GPS and overhead maps is used merely as heuristics in the high-level planning algorithm, and not directly incorporated in the observation space for the low-level navigational skills. This illustrates an important principle of the low-level versus high-level decomposition for such learning-based methods: both the low-level and high-level components can utilize learning and benefit from patterns in the environment, but they serve inherently different purposes. The low level deals with local traversability, while the high level aims to determine which paths are more likely to lead to the destination. Note also that the approach for learning the heuristic model is fairly general, and could potentially be extended in future work to incorporate other types of hints, such as textual directions.

6. Prospects for the future and concluding remarks

We discussed how experiential learning can be used to address robotic navigation problems by learning how to traverse real-world environments from real-world data. In contrast to conventional methods based on mapping and planning, methods that learn from experience can learn about how the robot actually interacts with the world, directly inferring which terrain features and obstacles are traversable and which ones are not, and developing a grounded representation of the navigational affordances of the current robot in the real world. However, much like how conventional mapping and planning methods build an internal abstract model of the world and then use it for planning, learning-based methods also, implicitly or explicitly, construct such a model out of their experience in each environment. However, as we discuss in §4, in contrast to the hand-designed abstractions in geometric methods (e.g. 3D points or vertices), learning-based methods acquire these abstractions based on the capabilities of the learned skills. Thus, robots with different capabilities will end up using different abstractions, and the representations of the ‘mental maps’ that result from such abstractions are not geometric, but rather describe the connectivity of the environment in terms of the robot’s capabilities.

Such methods for robotic navigation have a number of key advantages. Besides grounding the robot’s inferences about traversability in actual experience, they can benefit from large and diverse datasets collected over the entirety of the robot’s lifetime. In fact, they can, in principle, even incorporate data from other robots to further improve generalization [85]. Furthermore, and perhaps most importantly, such methods can continue to improve as more data is collected. In contrast to learning-based methods that utilize human-provided labels, such as imitation learning [23] and many computer vision approaches [710], experiential learning methods do not require any additional manual effort to be able to include more experience in the training process, so every single trajectory executed by the robot can be used for further finetuning its learned models. Therefore, such approaches will benefit richly from scale: the more robots are out there navigating in real-world environments, the more data will be gathered, and the more powerful their navigational capabilities will become. In the long run, this might become one of the largest benefits of such methods.

Of course, such approaches are not without their limitations. A major benefit of hand-designed abstractions, such as those used by geometric methods, is that the designer has a good understanding of what goes on inside the abstracted model. It is easy to examine a geometric reconstruction to determine if it is good, and it is comparatively easy to design an effective planning algorithm if it only needs to plan through geometric maps constructed by a given mapping algorithm (rather than a real and unpredictable environment). But such abstractions suffer considerable error when applied to real-world settings that violate their assumptions. Learning-based methods, in contrast, are much more firmly grounded in the real world, but because of this, their representations are as messy as the real world itself, making the learned representations difficult to interpret and debug. The dependence of these representations on the data also makes the construction and curation of the dataset a critical part of the design process. While workflows for evaluating, debugging, and troubleshooting supervised learning methods are mature and generally quite usable, learning-based control methods are still difficult to troubleshoot. For example, there is no equivalent to a ‘validation set’ in learning-based control, because a learned policy will encounter a different data distribution when it is executed in the environment than it saw during training. While some recent works have sought to develop workflows, for instance, for offline RL methods [86], such research is still in its infancy, and more robust and reliable standards and workflows are needed.

Safety and robustness are also major challenges. In some sense, these challenges follow immediately from the previously mentioned difficulties in regard to interpretability and troubleshooting: ultimately, a method that always works is always safe, but a method that sometimes fails can be unsafe if it is unclear when such failures will occur, which makes it difficult to implement mitigating measures. Therefore, approaches that improve validation of learning-based methods will likely also improve their safety. Non-learning-based methods often have more clearly defined assumptions. This can make enforcing safety constraints easier in environments where those assumptions are not violated, or where it is easy to detect such violations. However, this can present a significant barrier to real-world applications: a SLAM method that assumes static scenes can work well for indoor navigation, but is not viable, for example, for autonomous driving. The most challenging open-world settings could violate all simplifying assumptions, which might simply leave no other choice except for learning-based methods. This makes it all the more important to develop effective techniques for uncertainty estimation, out-of-distribution robustness, and intelligent control under uncertainty, which are all currently active areas of research with many open problems [8789].

In the end, learning-based methods for robotic navigation offer a set of features that are very difficult to obtain in any other way: they provide for navigational systems that are grounded in the real-world capabilities of the robot, make it possible to utilize raw sensory inputs, improve as more data is gathered, and can accomplish all this with systems that, in terms of overall engineering, are often simpler and more compact than hand-designed mapping and planning approaches, once we account for the additional features and extensions that the latter require to handle all the edge cases that violate their assumptions. Methods based on experiential learning are still in their early days: although the basic techniques are decades old, their real-world applicability has only become feasible in recent years with the advent of effective deep neural network models. However, their benefits may make such approaches the standard for robotic navigation in the future.

Data accessibility

This article has no additional data.

Authors' contributions

S.L.: writing—original draft, writing—review and editing; D.S.: writing—original draft, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed therein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

The research discussed in this article was partially supported by ARL DCIST CRA W911NF-17-2-0181, the DARPA Assured Autonomy Program, and the DARPA RACER Program, as well as Berkeley AI Research and Berkeley DeepDrive.

References

  • 1.Siciliano B, Khatib O, Kröger T. 2008. Springer handbook of robotics, vol. 200. Berlin, Germany: Springer. [Google Scholar]
  • 2.Thrun S. 2007. Simultaneous localization and mapping. In Robotics and cognitive approaches to spatial mapping (eds Jefferies ME, Yeap WK), pp. 13-41. Berlin, Germany: Springer. [Google Scholar]
  • 3.Cadena C, Carlone L, Carrillo H, Latif Y, Scaramuzza D, Neira J, Reid I, Leonard JJ. 2016. Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robot. 32, 1309-1332. ( 10.1109/TRO.2016.2624754) [DOI] [Google Scholar]
  • 4.Bresson G, Alsayed Z, Yu L, Glaser S. 2017. Simultaneous localization and mapping: a survey of current trends in autonomous driving. IEEE Trans. Intell. Veh. 2, 194-220. ( 10.1109/TIV.2017.2749181) [DOI] [Google Scholar]
  • 5.Agha A, et al. 2021 doi: 10.48550/arXiv.2103.11470. NeBula: quest for robotic autonomy in challenging environments; team CoSTAR at the DARPA subterranean challenge. arXiv. ( ) [DOI]
  • 6.Kahn G, Abbeel P, Levine S. 2021. BADGR: an autonomous self-supervised learning-based navigation system. IEEE Robot. Autom. Lett. 6, 1312-1319. ( 10.1109/LRA.2021.3057023) [DOI] [Google Scholar]
  • 7.Chen C, Seff A, Kornhauser A, Xiao J. 2015. DeepDriving: learning affordance for direct perception in autonomous driving. In Proc. IEEE Int. Conf. on Computer Vision, pp. 2722–2730. New York, NY: IEEE.
  • 8.Armeni I, Sener O, Zamir AR, Jiang H, Brilakis I, Fischer M, Savarese S. 2016. 3D semantic parsing of large-scale indoor spaces. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 1534–1543. New York, NY: IEEE.
  • 9.Janai J, Güney F, Behl A, Geiger A. 2020. Computer vision for autonomous vehicles: problems, datasets and state of the art. Found. Trends Comput. Graph. Vis. 12, 1-308. ( 10.1561/0600000079) [DOI] [Google Scholar]
  • 10.Feng D, Haase-Schütz C, Rosenbaum L, Hertlein H, Glaeser C, Timm F, Wiesbeck W, Dietmayer K. 2020. Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 22, 1341-1360. ( 10.1109/TITS.2020.2972974) [DOI] [Google Scholar]
  • 11.Liu F, Shen C, Lin G, Reid I. 2015. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. 38, 2024-2039. ( 10.1109/TPAMI.2015.2505283) [DOI] [PubMed] [Google Scholar]
  • 12.Garg R, Bg VK, Carneiro G, Reid I. 2016. Unsupervised CNN for single view depth estimation: geometry to the rescue. In Eur. Conf. on Computer Vision, pp. 740–756. Cham, Switzerland: Springer.
  • 13.Tateno K, Tombari F, Laina I, Navab N. 2017. CNN-SLAM: real-time dense monocular SLAM with learned depth prediction. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 6243–6252. New York, NY: IEEE.
  • 14.DeTone D, Malisiewicz T, Rabinovich A. 2017. Toward geometric deep SLAM. arXiv. ( 10.48550/arXiv.1707.07410) [DOI]
  • 15.Yang N, Wang R, Stuckler J, Cremers D. 2018. Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In Proc. Eur. Conf. on Computer Vision (ECCV), pp. 817–833. Cham, Switzerland: Springer.
  • 16.DeTone D, Malisiewicz T, Rabinovich A. 2018. Superpoint: self-supervised interest point detection and description. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition Workshops, pp. 224–236. New York, NY: IEEE.
  • 17.Chaplot DS, Salakhutdinov R, Gupta A, Gupta S. 2020. Neural topological SLAM for visual navigation. In Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, pp. 12875–12884. New York, NY: IEEE.
  • 18.Krishna Murthy J, Saryazdi S, Iyer G, Paull L. 2020. gradSLAM: dense SLAM meets automatic differentiation. arXiv. ( 10.48550/arXiv.1910.10672) [DOI]
  • 19.Sadeghi F, Levine S. 2016. CAD2RL: real single-image flight without a single real image. arXiv. ( 10.48550/arxiv.1611.04201) [DOI]
  • 20.Pan X, You Y, Wang Z, Lu C. 2017. Virtual to real reinforcement learning for autonomous driving. arXiv. (doi:10.48550/arxiv.1704.03952)
  • 21.Müller M, Dosovitskiy A, Ghanem B, Koltun V. 2018. Driving policy transfer via modularity and abstraction. arXiv. (doi:10.48550/arxiv.1804.09364)
  • 22.Xia F, Zamir AR, He Z, Sax A, Malik J, Savarese S. 2018. Gibson Env: real-world perception for embodied agents. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 9068–9079. New York, NY: IEEE.
  • 23.Pomerleau DA. 1989. ALVINN: an autonomous land vehicle in a neural network. In Advances in neural information processing systems, vol. 1 (ed. Touretzky DS). Burlington, MA: Morgan Kaufmann. [Google Scholar]
  • 24.Silver D, Bagnell JA, Stentz A. 2010. Learning from demonstration for autonomous navigation in complex unstructured terrain. Int. J. Robot. Res. 29, 1565-1592. ( 10.1177/0278364910369715) [DOI] [Google Scholar]
  • 25.Codevilla F, Müller M, López A, Koltun V, Dosovitskiy A. 2018. End-to-end driving via conditional imitation learning. In 2018 IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 4693–4700. New York, NY: IEEE.
  • 26.Bansal M, Krizhevsky A, Ogale A. 2018. ChauffeurNet: learning to drive by imitating the best and synthesizing the worst. arXiv. ( 10.48550/arxiv.1812.03079) [DOI]
  • 27.Sauer A, Savinov N, Geiger A. 2018. Conditional affordance learning for driving in urban environments. In Proc. 2nd Conf. on Robot Learning (PLMR vol. 87), pp. 237–252. MLR Press.
  • 28.Codevilla F, Santana E, López AM, Gaidon A. 2019. Exploring the limitations of behavior cloning for autonomous driving. In Proc. IEEE/CVF Int. Conf. on Computer Vision, pp. 9329–9338. New York, NY: IEEE.
  • 29.Hirose N, Sadeghian A, Goebel P, Savarese S. 2017. To go or not to go? A near unsupervised learning approach for robot navigation. arXiv. ( 10.48550/arxiv.1709.05439) [DOI]
  • 30.Kahn G, Villaflor A, Ding B, Abbeel P, Levine S. 2018. Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 5129–5136. New York, NY: IEEE.
  • 31.Wellhausen L, Ranftl R, Hutter M. 2020. Safe robot navigation via multi-modal anomaly detection. IEEE Robot. Autom. Lett. 5, 1326-1333. ( 10.1109/LRA.2020.2967706) [DOI] [Google Scholar]
  • 32.Palazzo S, Guastella DC, Cantelli L, Spadaro P, Rundo F, Muscato G, Giordano D, Spampinato C. 2020. Domain adaptation for outdoor robot traversability estimation from RGB data with safety-preserving loss. In 2020 IEEE/RSJ Int. Conf. on Intelligent Robots and Systems (IROS), pp. 10014–10021. New York, NY: IEEE.
  • 33.Lee H, Chung W. 2021. A self-training approach-based traversability analysis for mobile robots in urban environments. In 2021 IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 3389–3394. New York, NY: IEEE.
  • 34.LeCun Y, Bengio Y, Hinton G. 2015. Deep learning. Nature 521, 436-444. ( 10.1038/nature14539) [DOI] [PubMed] [Google Scholar]
  • 35.Schaal S. 1999. Is imitation learning the route to humanoid robots? Trends Cogn. Sci. 3, 233-242. ( 10.1016/S1364-6613(99)01327-3) [DOI] [PubMed] [Google Scholar]
  • 36.Bagnell JA. 2015. An invitation to imitation. Technical Report. Pittsburg, PA: Robotics Institute, Carnegie-Mellon University. [DOI] [PMC free article] [PubMed]
  • 37.Kahn G, Abbeel P, Levine S. 2021. LaND: learning to navigate from disengagements. IEEE Robot. Autom. Lett. 6, 1872-1879. ( 10.1109/LRA.2021.3060404) [DOI] [Google Scholar]
  • 38.Shah D, Eysenbach B, Kahn G, Rhinehart N, Levine S. 2021. ViNG: learning open-world navigation with visual goals. In IEEE Int. Conf. on Robotics and Automation (ICRA). New York, NY: IEEE.
  • 39.Shah D, Eysenbach B, Rhinehart N, Levine S. 2021. Rapid exploration for open-world navigation with latent goal models. In 5th Ann. Conf. on Robot Learning, London, UK. RSS Foundation.
  • 40.Shah D, Levine S. 2022. ViKiNG: vision-based kilometer-scale navigation with geographic hints. arXiv. ( 10.48550/arxiv.2202.11271) [DOI]
  • 41.Levine S, Finn C, Darrell T, Abbeel P. 2016. End-to-end training of deep visuomotor policies. J. Mach. Learn. Res. 17, 1334-1373. [Google Scholar]
  • 42.Xu H, Gao Y, Yu F, Darrell T. 2017. End-to-end learning of driving models from large-scale video datasets. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2174–2182. New York, NY: IEEE.
  • 43.Sutton RS, Barto AG. 2018. Reinforcement learning: an introduction. Cambridge, MA: MIT Press. [Google Scholar]
  • 44.Levine S, Kumar A, Tucker G, Fu J. 2020. Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv. ( 10.48550/arxiv.2005.01643) [DOI]
  • 45.Krizhevsky A, Sutskever I, Hinton GE. 2012. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097-1105. ( 10.1145/3065386) [DOI] [Google Scholar]
  • 46.Devlin J, Chang M-W, Lee K, Toutanova K. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. ( 10.48550/arxiv.1810.04805) [DOI]
  • 47.Tamar A, Wu Y, Thomas G, Levine S, Abbeel P. 2016. Value iteration networks. Adv. Neural Inf. Process. Syst. (NIPS) 29. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 48.Gupta S, Davidson J, Levine S, Sukthankar R, Malik J. 2017. Cognitive mapping and planning for visual navigation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 2616–2625. New York, NY: IEEE.
  • 49.Zhang J, Tai L, Liu M, Boedecker J, Burgard W. 2017. Neural slam: learning to explore with external memory. arXiv. ( 10.48550/arxiv.1706.09520) [DOI]
  • 50.Amos B, Jimenez I, Sacks J, Boots B, Kolter JZ. 2018. Differentiable MPC for end-to-end planning and control. Adv. Neural Inf. Process. Syst. (NeurIPS) 31. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 51.Mirowski P, et al. 2018. Learning to navigate in cities without a map. Adv. Neural Inf. Process. Syst. (NeurIPS) 31. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 52.Chaplot DS, Gandhi D, Gupta S, Gupta A, Salakhutdinov R. 2020. Learning to explore using active neural slam. arXiv. ( 10.48550/arxiv.2004.05155) [DOI]
  • 53.Ghosh D, Gupta A, Reddy A, Fu J, Devin C, Eysenbach B, Levine S. 2019. Learning to reach goals via iterated supervised learning. arXiv. ( 10.48550/arxiv.1912.06088) [DOI]
  • 54.Lynch C, Khansari M, Xiao T, Kumar V, Tompson J, Levine S, Sermanet P. 2020. Learning latent plans from play. In Conf. on Robot Learning (PLMR vol. 100), pp. 1113–1132. MLR Press.
  • 55.Dasari S, Gupta A. 2020. Transformers for one-shot visual imitation. arXiv. ( 10.48550/arxiv.2011.05970) [DOI]
  • 56.Emmons S, Eysenbach B, Kostrikov I, Levine S. 2021. The essential elements of offline RL via supervised learning. In Int. Conf. on Learning Representations. La Jolla, CA: ICLR.
  • 57.Yang R, Lu Y, Li W, Sun H, Fang M, Du Y, Li X, Han L, Zhang C. 2022. Rethinking goal-conditioned supervised learning and its connection to offline RL. arXiv. ( 10.48550/arxiv.2202.04478) [DOI]
  • 58.Kaelbling LP. 1993. Learning to achieve goals. In Int. Joint Conf. on Artificial Intelligence, pp. 1094–1099. IJCAI.
  • 59.Andrychowicz M, et al. 2017. Hindsight experience replay. Adv. Neural Inf. Process. Syst. (NIPS) 30. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 60.Veeriah V, Oh J, Singh S. 2018. Many-goals reinforcement learning. arXiv. ( 10.48550/arxiv.1806.09605) [DOI]
  • 61.Nair AV, Pong V, Dalal M, Bahl S, Lin S, Levine S. 2018. Visual reinforcement learning with imagined goals. Adv. Neural Inf. Process. Syst. (NeurIPS) 31. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 62.Warde-Farley D, Van de Wiele T, Kulkarni T, Ionescu C, Hansen S, Mnih V. 2018. Unsupervised control through non-parametric discriminative rewards. arXiv. ( 10.48550/arxiv.1811.11359) [DOI]
  • 63.Pong VH, Dalal M, Lin S, Nair A, Bahl S, Levine S. 2019. Skew-Fit: state-covering self-supervised reinforcement learning. arXiv. ( 10.48550/arxiv.1903.03698) [DOI]
  • 64.Eysenbach B, Salakhutdinov RR, Levine S. 2019. Search on the replay buffer: bridging planning and reinforcement learning. Adv. Neural Inf. Process. Syst. (NeurIPS) 32. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 65.Eysenbach B, Salakhutdinov R, Levine S. 2020. C-learning: learning to achieve goals via recursive classification. arXiv. ( 10.48550/arxiv.2011.08909) [DOI]
  • 66.Colas C, Karch T, Sigaud O, Oudeyer P-Y. 2020. Intrinsically motivated goal-conditioned reinforcement learning: a short survey. arXiv. ( 10.48550/arxiv.2012.09830) [DOI]
  • 67.Chane-Sane E, Schmid C, Laptev I. 2021. Goal-conditioned reinforcement learning with imagined subgoals. In Proc. 38th Int. Conf. on Machine Learning (PMLR vol. 139), pp. 1430–1440. MLR Press.
  • 68.Chebotar Y, et al. 2021 doi: 10.48550/arxiv.2104.07749. Actionable models: unsupervised offline reinforcement learning of robotic skills. arXiv. ( ) [DOI]
  • 69.Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, Silver D, Wierstra D. 2015. Continuous control with deep reinforcement learning. arXiv. ( 10.48550/arxiv.1509.02971) [DOI]
  • 70.Haarnoja T, Zhou A, Abbeel P, Levine S. 2018. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proc. 36th Int. Conf. on Machine Learning (PMLR vol. 80), pp. 1861–1870. MLR Press.
  • 71.Kumar A, Zhou A, Tucker G, Levine S. 2020. Conservative Q-learning for offline reinforcement learning. arXiv. ( 10.48550/arxiv.2006.04779) [DOI]
  • 72.Peng XB, Kumar A, Zhang G, Levine S. 2019. Advantage-weighted regression: simple and scalable off-policy reinforcement learning. arXiv. ( 10.48550/arxiv.1910.00177) [DOI]
  • 73.Nair A, Gupta A, Dalal M, Levine S. 2020. AWAC: accelerating online reinforcement learning with offline datasets. arXiv. ( 10.48550/arxiv.2006.09359) [DOI]
  • 74.Wang Z, et al. 2020. Critic regularized regression. Adv. Neural Inf. Process. Syst. (NeurIPS) 33, 7768-7778. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 75.Savinov N, Dosovitskiy A, Koltun V. 2018. Semi-parametric topological memory for navigation. arXiv. ( 10.48550/arxiv.1803.00653) [DOI]
  • 76.Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, Harley T, Silver D, Kavukcuoglu K. 2016. Asynchronous methods for deep reinforcement learning. In Proc. 33rd Int. Conf. on Machine Learning (PMLR vol. 48), pp. 1928–1937. MLR Press.
  • 77.Gould P, White R. 2012. Mental maps. Abingdon, UK: Routledge. [Google Scholar]
  • 78.Nasiriany S, Pong V, Lin S, Levine S. 2019. Planning with goal-conditioned policies. Adv. Neural Inf. Process. Syst. (NeurIPS) 32. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 79.Emmons S, Jain A, Laskin M, Kurutach T, Abbeel P, Pathak D. 2020. Sparse graphical memory for robust planning. Adv. Neural Inf. Process. Syst. (NeurIPS) 33, 5251-5262. San Mateo, CA: Morgan Kaufmann. [Google Scholar]
  • 80.Beeching E, Dibangoye J, Simonin O, Wolf C. 2020. Learning to plan with uncertain topological maps. In Eur. Conf. on Computer Vision, pp. 473–490. Cham, Switzerland: Springer.
  • 81.Ichter B, Sermanet P, Lynch C. 2020. Broadly-exploring, local-policy trees for long-horizon task planning. arXiv. ( 10.48550/arxiv.2010.06491) [DOI]
  • 82.Alemi AA, Fischer I, Dillon JV, Murphy K. 2016. Deep variational information bottleneck. arXiv. ( 10.48550/arxiv.1612.00410) [DOI]
  • 83.Wiener JM, Büchner SJ, Hölscher C. 2009. Taxonomy of human wayfinding tasks: a knowledge-based approach. Spat. Cogn. Comput. 9, 152-165. ( 10.1080/13875860902906496) [DOI] [Google Scholar]
  • 84.van den Oord A, Li Y, Vinyals O. 2019. Representation learning with contrastive predictive coding. arXiv. ( 10.48550/arxiv.1807.03748) [DOI]
  • 85.Kang K, Kahn G, Levine S. 2021. Hierarchically integrated models: learning to navigate from heterogeneous robots. In 5th Ann. Conf. on Robot Learning, London, UK. MLR Press.
  • 86.Kumar A, Singh A, Tian S, Finn C, Levine S. 2021. A workflow for offline model-free robotic reinforcement learning. arXiv. ( 10.48550/arxiv.2109.10813) [DOI]
  • 87.Gal Y, Ghahramani Z. 2016. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proc. 33rd Int. Conf. on Machine Learning (PMLR vol. 48), pp. 1050–1059. MLR Press.
  • 88.Guo C, Pleiss G, Sun Y, Weinberger KQ. 2017. On calibration of modern neural networks. In Proc. 34th Int. Conf. on Machine Learning (PMLR vol. 70), pp. 1321–1330. MLR Press.
  • 89.Lakshminarayanan B, Pritzel A, Blundell C. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. Adv. Neural Inf. Process. Syst. (NIPS) 30. San Mateo, CA: Morgan Kaufmann. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article has no additional data.


Articles from Philosophical Transactions of the Royal Society B: Biological Sciences are provided here courtesy of The Royal Society

RESOURCES