Abstract
With widespread applications of artificial intelligence (AI), the capabilities of the perception, understanding, decision-making, and control for autonomous systems have improved significantly in recent years. When autonomous systems consider the performance of accuracy and transferability, several AI methods, such as adversarial learning, reinforcement learning (RL), and meta-learning, show their powerful performance. Here, we review the learning-based approaches in autonomous systems from the perspectives of accuracy and transferability. Accuracy means that a well-trained model shows good results during the testing phase, in which the testing set shares a same task or a data distribution with the training set. Transferability means that when a well-trained model is transferred to other testing domains, the accuracy is still good. Firstly, we introduce some basic concepts of transfer learning and then present some preliminaries of adversarial learning, RL, and meta-learning. Secondly, we focus on reviewing the accuracy or transferability or both of these approaches to show the advantages of adversarial learning, such as generative adversarial networks, in typical computer vision tasks in autonomous systems, including image style transfer, image super-resolution, image deblurring/dehazing/rain removal, semantic segmentation, depth estimation, pedestrian detection, and person re-identification. We furthermore review the performance of RL and meta-learning from the aspects of accuracy or transferability or both of them in autonomous systems, involving pedestrian tracking, robot navigation, and robotic manipulation. Finally, we discuss several challenges and future topics for the use of adversarial learning, RL, and meta-learning in autonomous systems.
Keywords: autonomous systems, artificial intelligence, transferability, deep learning, generative adversarial networks, reinforcement learning, meta-learning
The Bigger Picture
Accuracy and transferability are critical to the perception and decision-making tasks of autonomous systems. The focus of several learning-based perception and decision-making methods has gradually evolved from accuracy to transferability. This survey summarizes the perception and decision-making tasks of autonomous systems from the perspectives of accuracy and transferability. We introduce transfer learning and some preliminaries of adversarial learning, reinforcement learning, and meta-learning. Then, we review several perception and decision tasks of autonomous systems from the perspectives of accuracy or transferability or both. Last but not least, we discuss several challenges and future works for using adversarial learning, reinforcement learning, and meta-learning in autonomous systems.
Accuracy and transferability are critical to the perception and decision-making tasks of autonomous systems. This paper reviews the perception and decision-making tasks of autonomous systems from the perspectives of accuracy and transferability. This survey summarizes some learning-based methods and discusses several challenges and future topics for complex multi-task, domain adaptation and model transferability in autonomous systems.
Main Text
Introduction
Artificial intelligence (AI) has been widely used in art, government, healthcare, games, and economics due to its powerful learning ability. Especially after the representative AI algorithm AlphaGo defeated the world champion in Go games,1 people have been paying more attention to AI. Understanding the behavior of AI agents is very important in promoting its technology.2 With the rise of deep learning (DL) algorithms, the upgrading of hardware, and the availability of big data, AI technology has been making huge progress in recent years.3 Autonomous systems powered by AI, including unmanned vehicles, robotic manipulators, and drones have been widely used in various industries and daily lives, such as intelligent transportation,4 intelligent logistics,5 and service robots.6 Due to the limitations of current computer perception and decision-making technologies in terms of accuracy and transferability, autonomous systems still have much room for improvement in complex and intelligent tasks via technological development. Due to the ability of DL to capture high-dimensional data features,3 DL-based algorithms are widely used in the perception and decision-making tasks of autonomous systems. There are a number of typical tasks related to perception and decision-making for autonomous systems, such as image super-resolution (SR),7,8 image deblurring/dehazing/rain removal,9, 10, 11 semantic segmentation,12,13 depth estimation,14,15 pedestrian detection,16 person re-identification (re-ID),17 pedestrian tracking,18 robot navigation,19,20 and robotic manipulation.21,22 However, most DL-based models have good accuracy and poor transferability, i.e., they are usually effective in the testing dataset with the same data distribution or task. When a well-trained model is transferred to other datasets or real-world tasks, the accuracy usually declines drastically, which means that the transferability is poor; thus, the transferability has to be taken into account for practical applications.23 This issue results in the fact that the current vision perception and decision-making methods cannot be used directly in actual autonomous systems. Transfer learning improves the transferability of models between different domains, i.e., a well-trained model can achieve good accuracy when applied to other testing domains.
Recently, since adversarial learning, such as generative adversarial networks (GANs), has shown promising results in image generation, a number of GANs-based methods have been proposed and have achieved breakthroughs in the aforementioned computer vision tasks.24, 25, 26, 27 In the field of AI, GANs have become increasingly important due to their powerful generation and domain adaptation capabilities.28 GANs have attracted increasing attention since they were proposed by Goodfellow et al.29 in 2014. GAN is a generative model that introduces adversarial learning between the generator and the discriminator, in which the generator creates data to deceive the discriminator while the discriminator distinguishes whether its input comes from real data or generated ones. The generator and discriminator are iteratively optimized in the game, and finally reach the Nash equilibrium.30 In particular, when considering a well-trained model for different datasets or real scenes, GANs can be used for domain-transfer tasks by virtue of their ability to capture high-frequency features to generate sharp images.31 Although some learning-based models mainly focus on the aspect of accuracy,7,12,14 GANs have demonstrated satisfactory results for various complex image fields in autonomous systems and other related fields, such as text-to-image generation,32,33 image style transfer,24,34 SR,26 image deblurring,27 image rain removal,35,36 object detection,37,38 semantic segmentation,24,39,40 pedestrian detection,41 person re-ID,42 and video generation.43
Meanwhile, as a powerful tool for decision-making and control, reinforcement learning (RL) has been extensively studied in recent years because it is suitable for decision-making tasks in complex environments.44,45 However, when the input data are high-dimensional such as images, sounds, and videos, it is difficult to solve the problem only with RL. With the help of deep neural networks (DNNs), deep RL (DRL), which combines the high-dimensional perceptual ability of DL with the decision-making ability of RL, has achieved promising results recently in various fields of application, such as obstacle avoidance,46,47 robot navigation,48,49 robotic manipulation,50,51 video target tracking,18,52 game playing,53,54 and drug testing.55,56 However, DRL tends to require a large number of trials and needs to specify a reward function to define a certain task.57 The former is time-consuming and the latter is significantly difficult when training from scratch. To tackle these problems, the idea of “learn to learn,” called meta-learning, has emerged.58 Compared with DRL, meta-learning makes the learning methods more transferable and efficient by utilizing previous experience to guide the learning of new tasks across domains. Therefore, meta-learning methods perform well especially in environments lacking data, such as image recognition,59 classification,60 robot navigation,61 and robotic arm control.62
With the development of DL, learning-based perception and decision-making algorithms for autonomous systems have become a hot research topic. Among the reviews of autonomous systems, Tang et al.63 introduced the applications of learning-based methods in perception and decision-making for autonomous systems. Gui et al.28 gave a detailed overview of various GANs methods from the perspectives of algorithms, theories, and applications. Arulkumaran et al.64 detailed the core algorithms of DRL and the advantages of RL for visual understanding tasks. Unlike previous surveys, we focus on reviewing learning-based approaches in the perception and decision-making tasks of autonomous systems from the perspectives of accuracy or transferability, or both.
The organization of this review is arranged as follows. The next section introduces transfer learning and one of its related machine-learning techniques, domain adaptation, and presents the basic concepts of adversarial learning, RL, and meta-learning. Following this, we survey some recent developments by exploring various learning-based approaches in autonomous systems, taking into account accuracy or transferability or both of these concepts. We then summarize some trends and challenges for autonomous systems, followed by our conclusions. The abbreviations used in this review are listed in Table 1.
Table 1.
Summary of Abbreviations in This Review
| Abbreviation | Full Name |
|---|---|
| AC | actor-critic |
| AI | artificial intelligence |
| cGANs | conditional generative adversarial networks |
| CNNs | convolutional neural networks |
| CycleGAN | cycle-consistent adversarial network |
| DL | deep learning |
| DNNs | deep neural networks |
| DQN | deep Q network |
| DRL | deep reinforcement learning |
| GANs | generative adversarial networks |
| GAIL | generative adversarial imitation learning |
| HR | high-resolution |
| IRL | inverse reinforcement learning |
| LR | low-resolution |
| LSTM | long short-term memory |
| MAML | model-agnostic meta-learning |
| re-ID | re-identification |
| RL | reinforcement learning |
| SR | super-resolution |
| TCN | temporal convolution network |
Preliminaries
Learning-based methods are used in various perception and decision-making tasks of autonomous systems, such as image style transfer, image SR, image deblurring/dehazing/rain removal, semantic segmentation, depth estimation, pedestrian detection, person re-ID, pedestrian tracking, robot navigation, and robotic manipulation. However, most traditional learning-based methods usually achieve good accuracy on the testing set with the same distribution or the same task. In recent years, with the research on the transferability of models, several typical learning-based methods have been widely used, such as adversarial learning and meta-learning.
Overview of the Section
Focusing on the transferability of models, transfer learning is proposed and first introduced in this section, which aims to make a well-trained model have good transferability, i.e., the well-trained model can be transferred to other testing sets and still have a good accuracy. We then introduce several typical learning-based methods concentrating on improving the accuracy or transferability, or both, including adversarial learning, RL, and meta-learning. In the perception tasks of autonomous systems, adversarial learning, such as GANs, has capabilities of good accuracy or transferability or both. In the decision-making tasks of autonomous systems, RL and meta-learning are often used to improve the accuracy or transferability of the system.
Transfer Learning
Transfer learning is a research topic aiming to investigate the improvement of learners from one target domain trained with more easily obtained data from source domains.65 In other words, the domains, tasks, and distributions used in training and testing could be different. Therefore, transfer learning saves a great deal of time and cost in labeling data when encountering various scenarios of machine-learning applications. According to different situations between domains, source tasks, and target tasks, transfer learning can be categorized into three subsettings: inductive transfer learning, transductive transfer learning, and unsupervised transfer learning.66 The definitions and differences between these transfer learning settings are presented in detail in Table 2.
Table 2.
Definitions and Differences between Three Transfer Learning Settings
| Transfer Learning Settings | Source and Target Domains | Source and Target Tasks | Source Domain Labels | Target Domain Labels |
|---|---|---|---|---|
| Inductive transfer learning | the same/different but related | different but related | available/unavailable | available |
| Transductive transfer learning | different but related | the same | available | unavailable |
| Unsupervised transfer learning | the same/different but related | different but related | unavailable | unavailable |
Copyright 2009, IEEE. Reprinted, with permission, from Pan and Yang.66
Domain Adaptation
There are many machine-learning techniques that are connected to transfer learning,66 for example, domain adaptation,67 related to transductive transfer learning, and multi-task learning68 and self-taught learning,69 related to inductive transfer learning. Here, we focus on domain adaptation, whereby the source and target domains share the same feature spaces while the feature distributions are different but related. The difference between domain adaptation and transductive transfer learning is that domain adaptation leverages labeled data in the source domain to learn a classifier for the target domain, where the target domain is either fully unlabeled (unsupervised domain adaptation) or has few labeled samples (semi-supervised domain adaptation).70 Domain adaptation is promising for the transferability of perception tasks of autonomous systems because it is efficient to reduce the domain shift among different datasets arising from synthetic and real images,71 different weather conditions,72 different lighting conditions,73 or different seasons,74 among others. Domain adaptation for visual applications includes shallow and deep methods.31 There is some research studying shallow domain-adaptive methods, which mainly include homogeneous domain adaptation and heterogeneous domain adaptation, according to whether the source data and target data have the same representation.67,75,76 Readers who want to learn more about shallow domain adaptation methods are referred to the studies by Csurka31 and Patel et al.,77 and the references therein. In this review, we mainly focus on deep domain adaptation methods, including traditional DL67,78,79 and adversarial learning.80, 81, 82
Adversarial Learning
Early adversarial learning modeled the learner and the adversary as a competitive two-player game.83 Subsequently, adversarial learning games have expanded into different forms, such as a Bayesian game,84 a sequential game,85 a bilevel optimization problem,86 and so forth.87 With the popularity of DNNs, Goodfellow et al.29 used adversarial learning to generate tasks, i.e., GANs. This model is widely used in various fields of autonomous systems.
Generative Adversarial Networks
As a powerful learning-based method for computer vision tasks, adversarial learning not only improves the accuracy but also helps improve the transferability of the model by reducing the differences between the training and testing domain distributions.80 GANs are architectures that use adversarial learning methods for generative tasks.28 The framework includes two models, a generator G and a discriminator D, as shown in Figure 1. G captures the prior noise distribution to generate fake data , and D outputs a single scalar to characterize whether the sample comes from training data x or generated data . G and D play against each other, promote each other, and finally reach the Nash equilibrium.30 G and D focus on a two-player minimax game with the value function :
| (Equation 1) |
where is a binary cross-entropy function, which aims to let D classify real or fake samples. In Equation 1, D tries to maximize its output, G tries to minimize its output, and the game ends at a saddle point.30
Figure 1.
Generative Adversarial Networks and Several Typical Variants
(A) Generative adversarial networks. Copyright (2018) IEEE. Reprinted, with permission, from Creswell et al.88
(B) Conditional generative adversarial networks. From Mirza and Osindero.89
(C) Cycle-consistent adversarial networks. Copyright (2017) IEEE. Reprinted, with permission, from Zhu et al.24
Conditional Generative Adversarial Networks
In the original generative model, since the prior comes from the noise distribution , the mode of the generated data cannot be controlled.30 Mirza and Osindero89 then proposed conditional GANs (cGANs), in which some extra information y is fed to the generator and discriminator in the model such that the data generation process can be guided, as shown in Figure 1. Note that y can be class labels or any other kind of auxiliary information. Compared with Equation 1, the objective function of cGANs is as follows:
| (Equation 2) |
Cycle-Consistent Adversarial Networks
Unlike models tailored for specific tasks, such as GANs and cGANs, cycle-consistent adversarial networks (CycleGANs) use a unified framework for various image tasks, which make the framework simple and effective.24 Zhu et al.24 proposed CycleGAN to learn image translation between the source domain X and the target domain Y with unpaired training examples and , in which N, M represent the total number of samples in the source and target domains, as shown in Figure 1. The framework includes two generators and , and two discriminators and , where distinguishes between images x and translated images ; similarly, distinguishes between images y and translated images . The output of the mapping G is , and the output of the mapping F is . Zhu et al. express the adversarial loss for the generator and the discriminator as follows:
| (Equation 3) |
They similarly define the adversarial loss for the generator and the discriminator as . Based on the adversarial loss, they proposed a cycle-consistency loss to encourage and . The cycle-consistency loss is expressed as:
| (Equation 4) |
The full objective of CycleGAN is
| (Equation 5) |
where λ is a hyperparameter used to control the relative importance of the adversarial loss and the cycle-consistency loss.
As a powerful generative model, many variants of GANs were presented by modifying loss functions or network architectures and were used for various computer vision tasks. In this review, we mainly focus on the problem of scene transfer and task transfer in autonomous systems using GANs, including image style transfer, image SR, image denoising/dehazing/rain removal, semantic segmentation, depth estimation, pedestrian detection, and person re-ID.
Reinforcement Learning
Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions in a dynamic environment.90 In the RL framework, an agent interacts with the environment to choose the action in the state of a given environment in order to maximize its long-term reward.91 RL algorithms can be classified into two kinds, model-based and model-free algorithms.92 Model-based RL is to learn a transition model that allows the environment to be simulated without directly interacting with the environment.64 Model-based methods include guided policy search (GPS)50 and model-based value expansion.93 However, model-free RL uses the experience of states and environments directly to generate actions.94 Model-free methods include deep Q network (DQN),95 deep deterministic policy gradient (DDPG) method,96 dynamic policy programming (DPP) method,97 and asynchronous advantage actor-critic (A3C) method.61 Model-free algorithms can learn complex tasks but tend to be inefficient in sampling, while model-based algorithms are more efficient in sampling but usually have difficulty in scaling to complicated tasks.48 With further research on the application of RL methods, several problems occur in that model-based algorithms are no longer applicable to more complex tasks, while model-free algorithms need more training data. Moreover, when the given environment changes or training data are insufficient, the chances are that RL methods need to train the model starting from the scratch, which is inefficient and inaccurate. Therefore, RL methods are limited when generalizing to different tasks and domains.48 In this review, we mainly focus on several modifications on RL methods, such as amending the network structure98,99 and optimizing the way of training,100,101 to enable the model to learn the new tasks accurately in the same domain or transferably across domains.
Meta-Learning
Meta-learning, or “learning to learn” and “learning how to learn,” uses previous knowledge and experience to guide the learning of new tasks to equip the model with the ability to learn across domains.102 The goal of meta-learning is to train a model that can quickly adapt to a new task using only a few data points and training iterations.60 Similar to transfer learning, meta-learning improves the learner's generalization ability in a multi-task setting. However, unlike transfer learning, meta-learning focuses on the sampling of both data and tasks. Therefore, meta-learning models are trained by being exposed to a large number of tasks, which qualifies them to learn new tasks from few data settings. The meta-learning methods can be divided into three categories: recurrent models, metric learning, and learning optimizers.103
Recurrent models are trained by various methods, such as long short-term memory (LSTM)104 and temporal convolution network (TCN),105 to acquire the dataset sequentially and then process new inputs from the task. LSTM104 processes data sequentially and figures out its own learning strategy from scratch. Moreover, TCN105 uses convolution structures to capture long-range temporal patterns, whose framework is simpler and more accurate than LSTM.
Metric learning is a way to calculate the similarity between two targets from different tasks. For a specific task, the input target is classified into a target category with large similarity judging from a metric distance function.106 It has been widely used for few-shot learning,107 during which the data belong to a large number of categories; some categories are unknown at the stage of training and the training samples of each category are particularly small.108 These characteristics are consistent with the characteristics of meta-learning. There are four sorts of typical networks proposed for metric learning: Siamese network,109 prototypical network,103 matching network,110 and relation network.111
Learning is optimized, i.e., one meta-learner learns how to update the learner so that the learner can learn the task efficiently.112 This method has been extensively studied to obtain better optimization results of neural networks. Combined with RL113 or imitation learning,62 meta-learning is able to learn new policies accurately or adapt to new tasks effectively. Model-agnostic meta-learning (MAML)60 is a representative and popular meta-learning optimization method, which uses stochastic gradient descent (SGD)114 to update. It adapts quickly to new tasks due to no assumptions being made about the form of the model and no extra parameters being introduced for meta-learning. MAML includes a base-model learner and a meta-learner. Each base-model learner learns a specific task and the meta-learner learns the average performance θ of multiple specific tasks as the initialization parameters of one new task.60 From Figure 2, the model is represented by a parameterized function with the parameter When adapting to a new task that is drawn from a distribution over tasks the model's parameter is updated to . is computed by one or more gradient descent updates on task Moreover, represents the loss function for task and the step size α is regarded as a hyperparameter. For example, we consider one gradient update on task
| (Equation 6) |
Figure 2.
Diagram of the MAML Algorithm, which Optimizes for a Representation θ That Can Quickly Adapt to New Tasks
From Finn et al.60
The model parameters are trained by optimizing for the performance of a parameterized function with parameter , corresponding to the following problem:
| (Equation 7) |
When extending MAML to the imitation learning setting, the model's input, , is the agent's observation sampled at time whereas the output is the agent's action taken at time The demonstration trajectory can be represented as , using a mean squared error loss as a function of policy parameters φ as follows:
| (Equation 8) |
During meta-training, several demonstrations are sampled as training tasks. The demonstrations help to compute for each task using gradient descent with Equation 6 and to compute the gradient of the meta-objective by using Equation 7 with the loss in Equation 8. During meta-testing, we consider using only a single demonstration as a new task , updating with SGD. Therefore, the model is updated to acquire a policy for that task.115
The Relationship between Adversarial Learning, RL, and Meta-Learning
RL116 is a method to describe and solve the problem that agents learn policies to achieve the maximum returns or specific goals in the interactions with the environment. Pfau and Vinyals117 discussed the connection between GANs and actor-critic (AC) methods. AC is a kind of RL method that learns the policy and value function simultaneously. To be specific, the actor network chooses the proper action in a continuous action space, while the critic network implements a single-step update, which improves the learning efficiency.118 Pfau and Vinyals117 argued that GANs can be viewed as an AC approach in an environment where actors cannot influence rewards. RL and GANs are integrated for various tasks, such as real-time point cloud shape completion119 and image synthesis.120
In the field of RL, using the cost function to understand the underlying behavior is called inverse reinforcement learning (IRL).121 The policy distribution in the IRL can be regarded as the data distribution of the generator in GANs, and the reward in the IRL can be regarded as the discriminator in GANs. However, IRL learns the cost function to explain expert behavior but cannot directly tell the learner how to take action, which leads to high running costs. Ho and Ermon122 proposed generative adversarial imitation learning (GAIL), combining GANs with imitation learning, which employs GANs to fit the states and actions distributions that define expert behavior. GAIL significantly improves the performance in large-scale and high-dimensional planning problems.123
Introducing meta-learning to RL methods is called meta-RL methods,124 which equips the model to solve new problems more efficiently by utilizing the experience from prior tasks. A meta-RL model is trained over a distribution of different but related tasks, and during testing it is able to learn to solve a new task quickly by developing a new RL algorithm.125 There are several meta-RL algorithms that utilize past experience to achieve good performance on new tasks. For example, MAML60 and Reptile126 are typical methods for updating model parameters and optimizing model weights; MAESN (model-agnostic exploration with structured noise)127 can learn structured action noise from prior experience; evolved policy gradient128 defines the policy gradient loss function as a temporal convolution over previous experience. Moreover, when dealing with unlabeled training data, unsupervised meta-RL methods129 effectively acquire accelerated RL procedures without manual task design, such as collecting data and labeling data. Therefore, both supervised and unsupervised meta-RL can transfer previous task information to new tasks across domains.
Autonomous Systems Meet Accuracy and Transferability
Computer vision and robot control tasks are critical to autonomous systems. Currently there is a variety of learning-based methods for perception and decision-making tasks. As mentioned at the beginning of the previous section (Preliminaries), most traditional learning-based methods show good accuracy in the same data distribution or task but suffer from poor transferability; specifically, when considering the application of a well-trained model to different scenarios, its accuracy often decreases heavily. This is due to the obvious domain gap between different datasets. Therefore, domain adaptation between different domains is very important for autonomous systems.
Overview of the Section
In this section, we mainly focus on learning-based methods in the perception and decision-making tasks of autonomous systems, from the perspectives of accuracy or transferability or both, such as image style transfer, image SR, image denoising/dehazing/rain removal, semantic segmentation, depth estimation, other geometry information (surface normal and optical flow) prediction, pedestrian detection/re-ID/tracking, robot navigation, and robotic manipulation in autonomous systems. Although some traditional DL-based methods mainly focus on improving the accuracy of the model, in recent years methods for the above visual tasks have gradually attached importance to the transferability, using adversarial learning, RL, and meta-learning. We summarize some typical computer vision tasks and robot control tasks in autonomous systems in Tables 3 and 4, including their training manners, loss functions, learning methods, and experimental platforms. As shown in Table 3, the training manners of some computer vision tasks gradually change from supervised to unsupervised ones, and their loss functions change from accuracy to transferability between domains. Table 4 indicates that for robot control tasks, informative simulation environments and flexible practice platforms will help to accurately transfer information across domains.
Table 3.
Summary of Methods for Computer Visual Tasks in Autonomous Systems
| Year | Reference | Task | Multi-Task | GANs-Based | Supervisiona | Lossb |
|---|---|---|---|---|---|---|
| 2016 | Gatys et al.130 | style transfer | Supervised | C | ||
| 2016 | Johnson et al.131 | style transfer | Supervised | B | ||
| 2017 | Li et al.132 | style transfer | Supervised | C | ||
| 2017 | Pix2Pix34 | style transfer | Supervised | A, E | ||
| 2017 | CycleGAN24 | style transfer | Unsupervised | A, D | ||
| 2019 | DLOW133 | style transfer | Unsupervised | A, D | ||
| 2019 | INIT134 | style transfer | Unsupervised | A, C | ||
| 2014 | SRCNN7 | super-resolution | Supervised | F | ||
| 2015 | SRCNN8 | super-resolution | Supervised | F | ||
| 2016 | FSRCNN135 | super-resolution | Supervised | F | ||
| 2016 | Johnson et al.131 | super-resolution | Supervised | B | ||
| 2017 | SRGAN26 | super-resolution | Supervised | A, F | ||
| 2017 | EnhanceNet136 | super-resolution | Supervised | A, B, F | ||
| 2018 | ZSSR137 | super-resolution | Unsupervised | E | ||
| 2018 | ESRGAN138 | super-resolution | Supervised | A, B, E | ||
| 2018 | CinCGAN139 | super-resolution | Unsupervised | A, D, F | ||
| 2019 | Soh et al.140 | super-resolution | Supervised | A, C, F | ||
| 2020 | Gong et al.141 | super-resolution | Unsupervised | A, D, E | ||
| 2018 | DeblurGAN27 | image deblurring | Supervised | A, B | ||
| 2019 | DeblurGAN-v2142 | image deblurring | Supervised | A, E, F | ||
| 2019 | Dr-Net143 | image deblurring | Supervised | A, E | ||
| 2018 | Li et al.144 | image dehazing | Supervised | A, B, E | ||
| 2018 | Cycle-Dehaze145 | image dehazing | Unsupervised | A, D | ||
| 2019 | Kim et al.146 | image dehazing | Supervised | A, D, E, F | ||
| 2019 | CDNet147 | image dehazing | Unsupervised | A, D | ||
| 2020 | Sharma et al.148 | image dehazing | Supervised | A, B, E, F | ||
| 2018 | Qian et al.35 | image rain removal | Supervised | A, B, F | ||
| 2019 | Li et al.149 | image rain removal | Supervised | A, B, F | ||
| 2019 | ID-CGAN36 | image rain removal | Supervised | A, B, E | ||
| 2020 | AI-GAN150 | image rain removal | Supervised | A, F | ||
| 2016 | Hoffman et al.71 | semantic segmentation | Unsupervised | F, G | ||
| 2017 | SegNet13 | semantic segmentation | Supervised | F | ||
| 2017 | Mask R-CNN151 | instance segmentation | Supervised | F | ||
| 2017 | CyCADA74 | semantic segmentation | Unsupervised | A, D, F | ||
| 2018 | FCAN152 | semantic segmentation | Unsupervised | A, F | ||
| 2018 | Hu et al.153 | instance segmentation | partially supervisedc | F | ||
| 2018 | Hong et al.39 | semantic segmentation | Unsupervised | A, F, G | ||
| 2019 | CrDoCo154 | semantic segmentation | Unsupervised | A, C, D, F | ||
| 2019 | CLAN155 | semantic segmentation | Unsupervised | A, F | ||
| 2019 | Li et al.156 | semantic segmentation | self-supervised | A, B, C, F | ||
| 2020 | Erkent et al.157 | semantic segmentation | Unsupervised | A, F | ||
| 2014 | Eigen et al.14 | depth estimation | Supervised | F | ||
| 2015 | Eigen et al.15 | depth estimation | Supervised | F | ||
| 2015 | Liu et al.158 | depth estimation | Supervised | F | ||
| 2018 | Atapour-Abarghouei et al.25 | depth estimation | Supervised | A, C | ||
| 2019 | ASM159 | depth estimation | Supervised | F | ||
| 2019 | CrDoCo154 | depth estimation | Unsupervised | A, C, D, F | ||
| 2019 | GASDA160 | depth estimation | Unsupervised | A, D, F | ||
| 2020 | ARC161 | depth estimation | Supervised | A, B, C, D, F | ||
| 2013 | ConvNet162 | pedestrian detection | Unsupervised | F | ||
| 2015 | TA-CNN16 | pedestrian detection | Supervised | F | ||
| 2017 | SAF R-CNN163 | pedestrian detection | Supervised | F | ||
| 2019 | Kim et al.41 | pedestrian detection | Unsupervised | A, E, F | ||
| 2018 | SPGAN42 | person re-ID | Unsupervised | A, D, F | ||
| 2018 | CamStyle164 | person re-ID | Unsupervised | A, D, F | ||
| 2019 | ATNet165 | Person re-ID | Unsupervised | A, D, F | ||
| 2017 | ADNet166 | pedestrian tracking | Supervised | F | ||
| 2017 | Supancic et al.18 | pedestrian tracking | supervised | –d | ||
| 2018 | Chen et al.167 | pedestrian tracking | supervised | E | ||
| 2019 | ConvNet-LSTM52 | pedestrian tracking | supervised | –d |
For models that do not explicitly state whether they are supervised in the references, this review considers models that require paired images as supervised and models that do not require paired images as unsupervised.
We classify the loss function into several classes. “A” represents adversarial (GAN) loss. “B” represents perceptual loss. “C” represents reconstruction loss. “D” represents cycle consistency loss. “E” represents pixel-wise loss. “F” represents specific task loss such as depth loss and semantic loss. “G” represents domain transfer loss such as domain adversarial loss and domain classifier loss.
Partially supervised learning problems refer to training on the combination of strong and weak labels.153 According to Schwencker and Trentin,168 partially supervised learning includes active learning, general semi-supervised learning, semi-supervised learning with graphs, partially supervised learning in ensembles, and multiple classifier systems.
RL-based methods mainly focus on reward and action instead of loss.
Table 4.
Summary of Traditional RL/Meta-Learning Methods for Scenario-Transfer Tasks
| Year | Reference | Task | RL Method | Meta-Learning Method | Simulation Platform | Practice Platform |
|---|---|---|---|---|---|---|
| 2016 | Sadeghi et al.169 | UAV navigation | F | 3D CAD environment | Parrot Bebop | |
| 2017 | Tai et al.170 | robot navigation | I | V-REP | Turtlebot | |
| 2017 | Zhang et al.171 | robot navigation | H | maze-like 3D environment | Robotino | |
| 2017 | Polvara et al.99 | UAV navigation | H | gazebo | Parrot AR Drone 2 | |
| 2017 | Zhu et al.61 | robot navigation | K | B | AI2-THOR | SCITOS |
| 2018 | Banino et al.172 | robot navigation | K | A | multi-room 2D environment | None |
| 2018 | Faust et al.22 | robot navigation | I | simulated building plans | differential drive robot | |
| 2019 | Zhu et al.173 | robot navigation | K | A | SUNCG | Matterport3D |
| 2019 | Niroui et al.174 | robot navigation | K | A | Turtlebot Stage simulator | Turtlebot |
| 2019 | Wortsman et al.175 | robot navigation | A, C, E | AI2-THOR | none | |
| 2019 | Jabri et al.176 | robot navigation | E | ViZDoom | none | |
| 2019 | Koch et al.177 | UAV navigation | I, O, N | GymFC | none | |
| 2020 | Gaudet et al.178 | UAV navigation | E | Mars and asteroid landing simulation | none | |
| 2015 | Zhang et al.179 | robotic manipulation | H | none | Baxter arm | |
| 2016 | Levine et al.50 | robotic manipulation | L | MuJoCo | PR2 robot | |
| 2017 | Gu et al.180 | robotic manipulation | M | MuJoCo | 7-DoF arm | |
| 2017 | Finn et al.115 | robotic manipulation | C, D | MuJoCo | 7-DoF PR2 arm | |
| 2018 | Haarnoja et al.181 | robotic manipulation | G | MuJoCo | 7-DoF Sawyer arm | |
| 2018 | Zhu et al.182 | robotic manipulation | N | A | MuJoCo | Jaco robot arm |
| 2018 | Zeng et al.101 | robotic manipulation | H | V-REP | UR5 robot arm et al. | |
| 2018 | Yu et al.183 | robotic manipulation | C, D | MuJoCo | 7-DoF PR2 arm et al. | |
| 2019 | Yu et al.184 | robotic manipulation | N, O, J | C | MuJoCo | none |
| 2019 | Zeng et al.185 | robotic manipulation | B | None | Amazon Robotics Challenge | |
| 2019 | Tsurumine et al.186 | robotic manipulation | P | n-DoF simulated manipulator | 15-DoF humanoid robot | |
| 2020 | Singh et al.187 | robotic manipulation | D | bullet physics engine | none |
We classify the meta-learning methods into several classes. “A” represents recurrent network. “B” represents metric network. “C” represents MAML. “D” represents meta-imitation learning. “E” represents meta-RL. Similarly, we classify the RL methods into several classes. “F” represents Fitted Q-iteration. “G” represents soft Q-learning. “H” represents DQN. “I” represents DDPG. “J” represents soft AC. “K” represents A3C. “L” represents GPS. “M” represents asynchronous NAF (normalized advantage function).188 “N” represents PPO (proximal policy optimization).189 “O” represents TRPO (trust region policy optimization).190 “P” represents DPP. DoF, degrees of freedom.
Image Style Transfer
Images can be well transferred between different styles, which is conducive to the perception and decision-making algorithms of autonomous systems applicable to various scenarios. Autonomous systems inevitably face the problem of the image style transfer arising from seasonal conversion,74 varying weather conditions,72 or day conversion.73 In particular, it is more challenging and interesting to consider transferring training data for night to day, rainy to sunny, or winter to summer, since most autonomous systems have a better ability to perceive under good lighting or weather conditions than some harsh environments. The task of image style transfer is to change the content of the source domain image to the target domain one while ensuring that the style is consistent with the target domain.130 In addition, style transfer, as an interesting data augmentation strategy, can extend the range of lighting and weather changes, thus further improving the transferability of the model.191 In addition, using the image style transfer algorithm to achieve the transfer from the simulated environment to the real world is very useful for semantic segmentation, robot navigation, and grasping tasks, because training directly in the real world may lead to higher experimental costs due to some possible damage to hardware.191 Traditional methods to achieve style transfer mainly rely on non-parametric techniques to manipulate the pixels of the image (e.g., Efros and Freeman,192 Hertzmann et al.193). Although traditional methods have achieved good results in style transfer, they are limited to using only low-level features of the image for texture transfer, but not semantic transfer.130
Traditional DL-Based Style Transfer
Convolutional neural networks (CNNs) have been used in image style transfer, since they have achieved impressive results in numerous visual perception areas. Gatys et al.130 first proposed to utilize CNNs (pre-trained VGG Networks) to separate content and style from natural images, and then combined the content of one image with the style of another into a new image to achieve an artistic style transfer. This work opened up a new viewpoint for style transfer using DNNs. To reduce the computational burden, Johnson et al.131 proposed to use the perceptual loss instead of the per-pixel loss for the image style transfer task. This method achieves results similar to those of Gatys et al.130 while being three orders of magnitude faster. Since Gatys et al.130 used the Gram matrices to represent the artistic style of an image, the subsequent improvement works did not investigate its principles in depth. Li et al.132 first regarded neural style transfer as a domain adaptation problem, and theoretically showed that the second-order interaction in the Gram matrix is not necessary for style transfer, which is equivalent to a specific maximum mean discrepancy. In addition, Chen et al.194 presented a stereo neural style conversion that can be used in emerging technologies such as three-dimensional (3D) movies or virtual reality. This method seems promising for improving the perception accuracy of autonomous systems in unmanned scenes because the transferred results contain more stereo information in the scene.
GANs-Based Style Transfer
Traditional CNN-based methods minimize the Euclidean distance between predicted pixels and ground-truth pixels, which may cause blurry results.34 GANs can be used for image style transfer, which can produce more realistic images.34 Isola et al.34 used cGANs to transfer image style, and the experimental results showed that cGANs (with L1 loss) not only have satisfactory results for style transfer tasks but also can produce reasonable results for a wide variety of problems such as semantic segmentation and background removal. However, this method requires paired image samples, which is often difficult to implement in practice. By considering this issue, Zhu et al.24 proposed CycleGAN to learn image translation between domains with unpaired examples, as shown in Figure 3. As mentioned in Preliminaries, the framework of CycleGAN includes two generators and two discriminators to achieve mutual translation between the source and the target domain. The main insight of CycleGAN is to preserve the key attributes between the input and the translated image by using a cycle-consistency loss. At almost the same time, DiscoGAN195 and DualGAN196 were presented to adopt similar cycle-consistency ideas to achieve an image transfer task across domains. To improve CycleGAN from the aspect of semantic information alignment at the feature level, Hoffman et al.74 proposed CyCADA by combining domain adaptation and cycle-consistent adversarial, which uniformly considers feature-level and pixel-level adversarial domain adaptation and cycle-consistency constraints. CyCADA has achieved satisfactory results in some challenging tasks, such as from synthesis to practical conversion and seasonal conversion, which is very important for the generalization of autonomous systems. It was shown that CyCADA has a better transferability than the original CycleGAN model. Since these methods, such as CycleGAN and CyCADA, can only realize the translation between two domains, different models should be trained for each pair of domains in the case of handling multiple-domain translation tasks, which limits their wide application. By considering this point, Choi et al.197 proposed StarGAN to perform image translations for multiple domains using a single generator and a discriminator. StarGAN takes both the image and its domain label as input, and learns to transfer the input image into the corresponding target domain. To further improve the existing adaptive image style transfer methods, Gong et al.133 proposed a domain flow generation (DLOW) model, which generates a series of intermediate domains to bridge two different domains. This method may be helpful for gradual changes, such as day or season, because it can generate a continuous sequence of intermediate samples ranging from the source to target samples. Recent image translation tasks focused on semantic consistency of images instead of image style and content. Royer et al.198 proposed XGAN, which is an unsupervised semantic style transfer task for many-to-many mapping. Royer et al. used domain adaptation techniques to constrain the shared embedding and proposed a semantic consistency loss as a form of self-supervision to act on two domain translations. This method has a good generalization effect when there is a large domain shift between the two domains. In addition, to obtain fine-grained local information of images, Shen et al.134 proposed the instance-aware image-to-image translation approach, which applies instance and global styles to the target image spatially, as shown in Figure 3. Similarly, the image style transfer was considered at the instance level by Ma et al.199 and Mo et al.200
Figure 3.
Generative Adversarial Networks for Image Style Transfer
(A) Results of instance-level day→night translation. Copyright (2019) IEEE. Reprinted, with permission, from Shen et al.134
(B) Results of seasonal conversion. Copyright (2017) IEEE. Reprinted, with permission, from Zhu et al.24
See also Table 3.
As a data augmentation strategy, image style transfer can help the scene to be transferred in various lighting conditions, various weather conditions, simulation of real-world environment, and so forth. Image style transfer helps autonomous systems perform their perception and decision-making tasks in better lighting conditions, and effectively reduces hardware losses in the real-world environment, which is critical for autonomous systems. Although many traditional DL-based models have achieved good style transfer results, with the advent of GANs various research works have been extended based on GANs. The recent developments of image style transfer have focused on instance-level style transfer. We believe that future works should focus on the image style transfer for more complex scenes, such as changing the style of the specified instance without changing the background style in a wild environment. In addition, future research should also consider improving the accuracy of style transfer and the speed of the overall process, striving for real-time performance with good accuracy. In addition to the style transfer task, we further consider increasing the resolution of images, i.e., the SR task.
Super-resolution
SR is a challenging visual perception task to generate high-resolution (HR) images from low-resolution (LR) image inputs.201 SR is crucial to understanding the environment at high level for autonomous systems. For example, SR is helpful in constructing a dense map. In this subsection, we first discuss the recent developments in SR by focusing on accuracy. We then summarize the new developments in SR by considering transferability.
There are a number of methods dedicated to improving image quality, such as single-image interpolation202 and image restoration.203 It is worth pointing out that these are different from SR. On the one hand, single-image interpolation usually cannot restore high-frequency details.202 In addition, image restoration often uses methods such as image sharpening, whereby the input image and output image remain the same size, although the output quality can be improved.203 SR does not only improve the output quality but also increases the number of pixels per unit area, i.e., the size of the image increases.201 In some cases, the image SR can be regarded as a method of image enhancement.204 Recently, a large number of SR methods have been proposed, such as interpolation-based methods205 and reconstruction-based methods.206 Farsiu et al.207 introduced the advances and challenges of traditional methods for SR.
Traditional DL-Based SR
There are some results studying traditional DL-based methods without adversarial learning for SR, which are mainly CNN-based. Dong et al.7 considered using CNNs to handle SR tasks in an end-to-end manner. They presented the SR convolutional neural network (SRCNN), which has little extra pre-/post-processing beyond optimization. In addition, they confirmed that DL provides a better quality and speed for SR than the sparse coding method208 and the K-SVD-based method,209 while SRCNN only uses information on the luminance channel. Dong et al.8 then extended SRCNN to process three color channels simultaneously to improve the accuracy of SR results. Considering the poor real-time performance of SRCNN, Dong et al.135 utilized a compact hourglass-shape CNN structure to accelerate the current SRCNN. In fact, most learning-based SR methods use the per-pixel loss between the output image and the ground-truth image.7,8 Johnson et al.131 considered the use of perceptual loss to achieve a better SR, which is able to better reconstruct details than the per-pixel loss. Note that the aforementioned SR methods often rely on specific training data. When there are non-ideal imaging conditions due to noise or compression artifacts, these methods usually fail to provide good SR results. Therefore, Shocher et al.137 considered “zero-shot” SR (ZSSR), which does not rely on prior training. To the best of our knowledge, ZSSR is the first unsupervised CNN-based SR method, which achieves reasonable SR results in some complex or unknown imaging conditions. Due to the lack of recurrence of blurry LR images, ZSSR is less effective for SR when facing very blurry LR images. By taking into account this issue, Zhang et al.210 proposed a deep plug-and-play SR framework for LR images with arbitrary blur kernels. This modified framework is flexible and effective in dealing with very blurry LR images. Recent trends in SR also include SR for stereo images211 and 3D appearance.212
GANs-Based SR
In addition to the traditional DL-based SR methods, GANs show their promising results in SR. The use of GANs for SR has the advantage of bringing the generated results closer to the natural image manifold, which may improve the accuracy of the result.26 The representative work on GANs-based SR (SRGAN) was presented by Ledig et al.,26 which combines a content loss with an adversarial loss by training a GAN. This method is capable of reconstructing photo-realistic natural images for an upscaling factor of 4×. Although the SRGAN achieves good SR results, the local matching of texture statistics is not considered, which may restrict the improvement of the SR results to some extent. By considering this point, Sajjadi et al.136 focused on creating realistic textures to achieve SR. They proposed EnhanceNet, which combines adversarial training, perceptual loss, and a newly proposed texture transfer loss to achieve HR results with realistic textures. To further improve the accuracy of SRGAN, Wang et al.138 extended SRGAN to ESRGAN by introducing residual-in-residual dense block and improving the discriminator and a perceptual loss. ESRGAN consistently has a better visual quality and natural texture than SRGAN,26 as shown in Figure 4.
Figure 4.
Generative Adversarial Networks for SR and Image Deblurring/Dehazing/Rain Removal
(A) The super-resolution results of for SRGAN, ESRGAN, and the ground truth. Reprinted by permission of Wang et al.138 Copyright.
(B) Image deblurring results. Copyright (2019) IEEE. Reprinted, with permission, from Kupyn et al.142
(C) Image dehazing results. Copyright (2019) IEEE. Reprinted, with permission, from Dudhane and Murala.147
(D) Image rain removal results. Copyright (2018) IEEE. Reprinted, with permission, from Qian et al.35
See also Table 3.
HR images are conducive to improving the accuracy of perception tasks in autonomous systems. In autonomous systems more complicated situations may be encountered, such as when HR datasets are unavailable or the input LR images are noisy and blurry, which means that SR cannot be achieved with paired data. Inspired by the cycle consistency of CycleGAN, Yuan et al.139 tackled these issues with a cycle-in-cycle network (CinCGAN), which consists of two CycleGANs. The first CycleGAN maps LR images to the clean LR space, in which the proper denoising/deblurring processing is implemented on the original LR input. They then stacked another well-trained deep model to up-sample the intermediate results to the desired size. Finally, they used adversarial learning to fine-tune the network in an end-to-end manner. The second CycleGAN contains the first one to achieve the purpose of mapping from the original LR to the HR. CinCGAN achieves results comparable with those of the supervised method.135 Most SR methods trained on synthetic datasets are not effective in the real world. SRGAN and EnhanceNet increase the perceptual quality by enhancing textures, which may produce fake details and unnatural artifacts. Soh et al.140 focused on the naturalness of the results to reconstruct realistic HR images. On further considering the transferability of the model, to solve the domain shift between synthetic data and real-world data, Gong et al.141 proposed to further minimize the domain gap by aligning the feature distribution while achieving SR. Specifically, they proposed a method to learn real-world SR images from a set of unpaired LR and HR images, which achieves satisfactory SR results on both paired and unpaired datasets. It is difficult to directly extend the image SR methods to video SR. Recent developments included using the same framework to implement image SR and video SR,213 and real-time video SR using GANs.214
Image SR is used to increase the resolution of images, which helps to improve the accuracy of perception tasks. Although various SR models focus on improving accuracy, recent works have focused on the transferability of the model, such as the transfer from synthetic datasets to real-world data. Future works may consider combining SR task with other tasks so that one model can achieve multiple tasks including SR. In addition to image SR tasks, we further consider image restoration, such as image deblurring/dehazing/rain removal.
Image Deblurring, Image Dehazing, and Image Rain Removal
Autonomous systems often encounter poor weather conditions, such as rain and fog. There also exist blurry images due to poor shooting conditions or fast-moving objects. It is well recognized that the accuracy of computer vision tasks heavily depends on the quality of input images. Hence, it is of great importance to study image deblurring/dehazing/rain removal for autonomous systems, which make the high-level understanding of tasks such as semantic segmentation and depth estimation possible in practical applications of autonomous systems. It should be noted that although some image deblurring/dehazing/rain removal tasks use image enhancement algorithms,215, 216, 217 they are more relevant to image restoration than image enhancement. According to Maini and Aggarwal,218 the aim of image enhancement is to improve the viewer's perception of the image in a way that improves the information content, and it is designed to give emphasis to features of the image. Image restoration aims to restore the noisy/corrupt image to its corresponding clean image, and the corruption may include motion blur and noise.219 Therefore, image deblurring/dehazing/rain removal can be regarded as image restoration tasks. When adversarial learning, such as GANs, is used for image deblurring/dehazing/rain removal tasks, it can not only generate realistic images to improve the accuracy of image restoration but also improve the transferability of the models by considering the transfer from synthetic datasets to real-world images.
Image Deblurring
Image blur, which heavily affects the understanding of the surroundings, is widely observed in autonomous systems. To tackle the problem of image deblurring, several traditional DL-based methods without adversarial learning have been successively proposed.9,220,221 Considering the convincing performance of GANs in preserving image textures and creating realistic images, as well as being inspired by image-to-image translation with GANs, Kupyn et al.27 regarded image deblurring as a special image-to-image translation task. They proposed DeblurGAN, which is an end-to-end deblurring learning method based on cGANs. This method considers both accuracy and transferability, i.e., DeblurGAN improves deblurring results and it is 5-fold faster than the approach used by Nah et al.221 for both synthetic and real-world blurry images. Kupyn et al.142 further improved DeblurGAN by adding a feature pyramid network to G and adopting a double-scale D, which is called DeblurGAN-v2. DeblurGAN-v2 achieves better accuracy than DeblurGAN while being 1-- to 100-fold faster than competitors, which will make it applicable to real-time video deblurring, as shown in Figure 4. Recently, Aljadaany et al.143 presented Dr-Net, which combines Douglas-Rachford iterations and Wasserstein-GAN222 to solve image deblurring without knowing the specific blurring kernel. In addition, Lu et al.223 extracted the content and blur features separately from blurred images to encode the blur features accurately into the deblurring framework. They also utilized the cycle-consistency loss to preserve the content structure of the original images. Considering that stereo cameras are more commonly used in unmanned aerial vehicles, Zhou et al.224 focused their research on the deblurring of stereo images.
Image Dehazing
Haze is a typical weather phenomenon with poor visibility, which forms a major obstacle for computer vision applications. Image dehazing is designed to recover clear scene reflections, atmospheric light colors, and transmission maps from input images.145 In recent years, a series of learning-based image dehazing methods have been proposed,10,225,226 Although these methods do not require prior information, their dependence on parameters and models may severely cause an impact on the quality of dehazing images. To reduce the effects of intermediate parameters on the model and to establish an image dehazing method with good transferability, a series of GANs-based methods have been proposed for image dehazing. Li et al.144 tackled image dehazing based on cGAN. Different from the basic cGAN, the generator in this method includes an encoder and decoder architecture, which helps the generator to capture more useful features to generate realistic results. The addition of cGAN makes the method in Li et al.144 achieve ideal results on both synthetic datasets and real-world hazy images. Considering the transferability of different scenarios and datasets as well as independence of paired images, Engin et al.145 proposed the Cycle-Dehaze network by utilizing CycleGAN. This approach adds the cyclic perception-consistency loss and the cycle-consistency loss, thereby achieving image dehazing across datasets with unpaired images. Similar bidirectional GANs for dehazing have also been studied by Kim et al.146 It is difficult for the Cycle-Dehaze network to reconstruct real scene information without color distortion. Therefore, Dudhane and Murala147 proposed the cycle-consistent generative adversarial network (CDNet), which utilized the optical model to find the haze distribution from the depth information. CDNet ensures that the fog-free scene is obtained without color distortion. The image dehazing results of Cycle-Dehaze and CDNet are shown in Figure 4. Most image dehazing methods only consider objects at the same scale-space, which will make dehazed images suffer from blurriness and halo artifacts. Sharma et al.148 considered improving the accuracy and transferability of image dehazing, and presented an approach that can remove haze based on per-pixel difference between Laplacians of Gaussian of hazed images and original haze-free images at a scale-space. The model showed compelling results from simulated datasets to real-world maps, from indoors to outdoors. Recent developments in image dehazing also included targeting different channels, such as color channel,227 dark channel,228 and multi-scale networks.229
Image Rain Removal
Image rain removal is a challenging task because the size, number, and shape of raindrops are usually uncertain and difficult to learn. A number of methods have been proposed for image rain removal, but most of them require stereo image pairs,230 image sequences,231 or motion-based images.232 Eigen et al.11 proposed a single-image rain removal method, which is limited to dealing with relatively sparse and small raindrops.
To improve the accuracy of the image rain removal results and considering the outstanding performance of GANs in the image in painting or completion problems, a series of GANs-based methods have been used for image rain removal. Qian et al.35 tackled the heavy raindrop removal from a single image using an attentive GAN. This method uses an attention map in both the generator and the discriminator. The generator produces an attention map through an attention-recurrent network and generates a raindrop-free image together with the input image. The discriminator evaluates the validity of the generation both globally and locally. The rain removal results of Eigen et al.11 and Qian et al.35 are shown in Figure 4. Nevertheless, this method is not suitable for torrential rain removal and is limited to raindrop removal. Heavy rain, strongly visible streaks, or dense rain accumulations make the scene less visible. Li et al.149 considered the heavy-rain situation and introduced an integrated two-stage CNN, which is able to remove rain streaks and rain accumulation simultaneously. In the first physics-based stage, a streak-aware decomposition module was proposed to decompose entangled rain streaks and rain accumulation to extract joint features. The second refinement stage utilized a cGAN that inputs the reconstructed map of the previous level and generates the final clean image. This method considered the transferability between the synthetic datasets and real-world images, and has achieved convincing results in both synthetic and real heavy-rain scenarios. To improve the stability of GANs and reduce artifacts introduced by GANs in the output images, Zhang et al.36 proposed an image deraining conditional generative adversarial network (ID-CGAN), which uses a multi-scale discriminator to leverage features from different scales to determine whether the derained image is from real data or generated data. ID-CGAN has obtained satisfactory image rain removal results on both the synthetic dataset and real-world images. Jin et al.150 considered that existing methods may cause over-smoothing in derained images, and therefore solved the problem from the perspective of feature disentanglement. They introduced an asynchronous interactive GAN (AI-GAN), which not only has achieved good results for image rain removal but also has strong generalization capabilities, which can be used for image/video encoding, action recognition, and person re-ID.
Image deblurring/dehazing/rain removal tasks help to extract more useful information from bad-weather scenes, which can help autonomous systems to better perceive the scene. We focus on introducing GANs-based models, which improve the accuracy or transferability or both of these tasks. Future works may include the image deblurring/dehazing/rain removal tasks as the premise of perception and then integrate a deeper model to achieve scene perception. After the introduction of tasks including image style transfer, image SR, and image deblurring/dehazing/rain removal, we consider high-level perception tasks such as semantic segmentation.
Semantic Segmentation
In emerging autonomous systems, such as autonomous driving and indoor navigation, scene understanding is required by means of semantic segmentation. Semantic segmentation is a pixel-level prediction method that can classify each pixel into different categories corresponding to their labels, such as airplanes, cars, traffic signs, or even backgrounds.233 In addition, instance segmentation combines semantic segmentation and object detection to further distinguish object categories in the scene.151 Some traditional DL-based methods without adversarial learning have been proposed and have achieved good accuracy of semantic segmentation13,152 and instance segmentation.151,153 In practice, such annotations of pixel-level semantic information are usually expensive to obtain. Considering that the semantic labels of synthetic datasets are easy to obtain, it is helpful to consider semantic segmentation on labeled synthetic datasets and then transfer the results to real-world applications. Due to the domain shift between synthetic datasets and real-world images, it is worth exploring how to transfer the model trained on synthetic datasets to real-world images. By considering this point, adversarial learning is used to implement domain adaptation to improve the transferability of the model. Like other computer vision tasks in this review, the trend is now moving from improving accuracy to enhancing transferability. In this subsection, we focus on accuracy or transferability, or both, to review semantic segmentation and instance segmentation tasks.
Traditional DL-Based Semantic Segmentation
Traditional DL-based semantic segmentation algorithms are mainly based on end-to-end convolutional network frameworks. To the best of our knowledge, Long et al.12 were the first to train an end-to-end fully convolutional network (FCN) for semantic segmentation. The main insight is to replace fully connected layers with fully convolutional layers to output spatial maps. In addition, they defined a skip architecture to enhance the segmentation results. More importantly, the framework is suitable for input images of arbitrary size and can produce the correspondingly sized output. This work is well recognized as a milestone for semantic segmentation using DL. However, because the encoder network of this method has a large number of trainable parameters, the overall size of the network is large, which results in the difficulty to train FCN. Badrinarayanan et al.13 proposed SegNet, which has significantly fewer trainable parameters and can be trained in an end-to-end manner using SGD. SegNet is important in that the decoder performs the non-linear upsampling using the pooling index computed in the max-pooling step of the corresponding encoder, which eliminates the need to learn upsampling. Based on the encoder-decoder network of SegNet, DeepLab uses multi-scale contextual information to enrich semantic information. DeepLab proposed a series of semantic segmentation methods, such as DeepLabv3+,234 which combines a spatial pyramid pooling module and an encoder-decoder structure for semantic segmentation. In addition, the depthwise separable convolution is applied to both atrous spatial pyramid pooling and the decoder module to make the encoder-decoder network faster and stronger.
The accuracy of unsupervised semantic segmentation tasks is usually worse than that of supervised methods, while supervised semantic segmentation often requires a lot of manual labeling, which is very costly. Note that a synthetic dataset with computer simulation such as Grand Theft Auto235 can automatically label a large number of semantic tags, which is very important to improving the accuracy of the semantic segmentation model. However, due to the domain shift between the synthetic dataset and the real-world scene, it is necessary to consider domain adaptation in the semantic segmentation task. To address the domain gap problem and improve the transferability of the model, Hoffman et al.71 proposed a domain adaptation framework with FCN for semantic segmentation, as shown in Figure 5. This method considers aligning both global and local features through some specific adaptation techniques. The method makes full use of the label information of the synthesized dataset and successfully transfers the results from a synthetic dataset to the real scene, in which a satisfactory semantic segmentation result is achieved in practical applications. The same combination of FCN with domain adaptation for semantic segmentation was also presented by Zhang et al.,152 whereby fully convolutional adaptation networks also successfully explored domain adaptation for semantic segmentation. The model combines appearance adaptation networks and representation adaptation networks to synthesize images for domain adaptation at both the visual appearance level and the representation level. Recent developments also involved 3D semantic segmentation236,237 and 3D instance segmentation.238
Figure 5.
Generative Adversarial Networks for Semantic Segmentation and Multi-Task
(A) CycleGAN for semantic segmentation Copyright (2017) IEEE. Reprinted, with permission, from Zhu et al.24
(B) Qualitative results on adaptation from cities in SYNTHIA fall to cities in SYNTHIA winter. From Hoffman et al.71
(C) Multi-task includes semantic segmentation (top row), depth prediction (middle row), to optical flow estimation (bottom row). Copyright (2019) IEEE. Reprinted, with permission, from Chen et al.154
See also Table 3.
Traditional DL-Based Instance Segmentation
The more challenging task is instance segmentation, which combines both object detection and semantic segmentation.151 Li et al.239 first proposed an end-to-end fully convolutional method for instance-aware semantic segmentation. However, the method produced spurious edges on overlapping instances. He et al.151 proposed Mask R-CNN, which is a classic instance segmentation algorithm. Mask R-CNN is easy to train and to generalize to other tasks, i.e., it performs breakthrough results in instance segmentation, bounding-box object detection, and person keypoint detection. This method includes two stages. The first stage proposes a candidate object bounding box. In the second stage, the prediction class and the box offset are in parallel, and the network outputs a binary mask for each region of interest. Mask R-CNN implements instance segmentation in a supervised manner, which is very expensive to semantic labels. In view of this, Hu et al.153 proposed a solution to large-scale instance segmentation by developing a partially supervised learning paradigm, in which only a small part of the training process has instance masks and the rest have box annotations. This method has demonstrated exciting new research directions in large-scale instance segmentation.
GANs-Based Semantic Segmentation
GANs are flexible enough to reduce the differences between the segmentation result and the ground truth, and further improve the accuracy of the semantic segmentation results without manual labeling in some cases.240 Regarding the use of GANs for semantic segmentation, the typical methods are Pix2Pix34 and CycleGAN.24 The semantic segmentation result for CycleGAN is shown in Figure 5. There are several variants based on Pix2Pix and CycleGAN.74,241,242 These methods not only achieve satisfactory results in image style transfer but also work well in semantic segmentation. Most of the adversarial domain-adaptive semantic segmentation methods for subsequent improvements of CycleGAN and Pix2Pix improve the training stability and transferability by improving loss functions or network layers. Hong et al.39 proposed a method based on cGAN for semantic segmentation. The network integrated cGAN into the FCN framework to reduce the gap between source and target domains. In practical tasks, objects often appear in an occluded state, which brings great challenges to the perception tasks of autonomous systems. To solve this problem, Ehsani et al.38 proposed SeGAN, which jointly generated the appearance and segmentation mask for invisible and visible regions of objects. Luo et al.155 further considered a joint distribution at the category level that was different from global alignment strategies such as CycleGAN. They proposed a category-level adversarial network to enhance local semantic consistency in the case of global feature alignment. Note that traditional semantic segmentation methods may suffer from the unsatisfactory quality of image-to-image conversion. Once the image-to-image conversion fails, nothing can be done to obtain satisfactory results in the subsequent stage of semantic segmentation. Li et al.156 tackled this problem by introducing a bidirectional learning framework with self-supervised learning, in which both translation and segmentation adaptation models can promote each other in a closed loop. This segmentation adaptation model was trained on both synthetic and real-world datasets, which improved the segmentation performance of real-world data. In addition, Erkent and Laugier157 considered a method of semantic segmentation adapted to different weather conditions, which can achieve satisfactory accuracy for semantic segmentation without the need of labeling the weather conditions of the source or target domain.
Semantic segmentation, as a high-level perception task of autonomous systems, predicts the semantic information of each pixel with a specific class label. Because early supervised algorithms are expensive in collecting labeled datasets from the real world, many algorithms consider the transferability between synthetic datasets and real-world data. Recent developments include semantic segmentation in more complex environments based on GANs, such as bad weather conditions. Meanwhile, we consider that instance segmentation based on GANs is also an open question. In addition to semantic segmentation, depth estimation is another high-level perception task of autonomous systems, whereby it is very challenging to estimate the depth value of each pixel in the image.
Depth Estimation
Depth estimation is an important task to help autonomous systems understand the 3D geometry of environments at high level. A series of classical and learning-based methods were proposed to estimate depth based on motion243 or stereo images,244 which is computationally expensive. As widely known, due to the lack of complete scene 3D information, estimating the depth from a single image is an ill-posed task.158 For the monocular depth estimation task, a series of traditional DL-based algorithms without adversarial learning have been proposed to improve the accuracy of the model. However, considering that it is expensive to collect well-annotated datasets in depth estimation tasks, it is appealing to use adversarial learning methods, such as GANs, to achieve domain adaptation from synthetic datasets to real-world images. In addition, the adaptive method is used to improve the transferability of the model, so that the model trained on the synthetic dataset can be well transferred to real-world images. Here, we introduce traditional DL-based depth estimation frameworks as well as methods to improve the transferability of depth estimation models by introducing adversarial learning.
Traditional DL-Based Depth Estimation
Traditional DL-based depth estimation methods mainly focus on improving the accuracy of the results by using deep convolution frameworks. Eigen et al.14 first proposed using a neural network to estimate depth from a single image in an end-to-end manner, which pioneeringly showed that it is promising for neural networks to estimate the depth from a single image. This framework consists of two components: the first one roughly estimated the global depth structure and the second one refined this global prediction using local information. Considering the continuous property of the monocular depth value, depth estimation is transformed into a learning problem of a continuous conditional random field (CRF). Liu et al.158 presented a deep convolutional neural field model for single monocular depth estimation, which combined deep CNN and continuous CRF. This method achieved good results on both indoor and outdoor datasets. To reduce the dependence on the supervised signal and improve the transferability between different domains, unsupervised domain adaptation methods were presented for depth estimation by Nath Kundu et al.245 Some other developments in considering optical flow, camera pose, and intrinsic parameters from monocular video for depth estimation can be found in Chen et al.246 By considering the intrinsic parameters of the camera, similar to Gordon et al.,247 accurate depth information can be extracted from any video.
GANs-Based Depth Estimation
For the depth estimation task, it is too expensive to collect well-annotated image datasets. An appealing alternative is to use the unsupervised domain adaptation method via GANs to achieve domain adaptation from synthetic datasets to real-world images. Atapour-Abarghouei et al.25 took advantage of the adversarial domain adaptation to train a depth estimation model in a synthetic city environment and transferred it to the real scene. The framework consists of two stages. At the first stage, a depth estimation model is trained with the dataset captured in the virtual environment. At the second stage, the proposed method transfers synthetic style images into real-world ones to reduce the domain discrepancy. Although this method considers the transfer of synthetic city environment to the real-world scene, it ignores the specific geometric structure of the image in the target domain, which is important for improving the accuracy of depth estimation. Motivated by this problem, Zhao et al.160 proposed a geometry-aware symmetric domain adaptation network (GASDA), which produces high-quality results for both image style transfer and depth estimation. GASDA is based on CycleGAN,24 which performs translations for both synthetic-realistic and realistic-synthetic simultaneously with a geometric consistency loss of real stereo images. Zhao et al.161 further considered high-level domain transformation, i.e., mixing a large number of synthetic images with a small amount of real-world images. They proposed the attend-remove-complete (ARC) method, which learns to attend, remove, and complete some challenging regions. The ARC method can ultimately make good use of synthetic data to generate accurate depth estimates.
Depth Estimation via Joint Tasks Learning
Each pixel in one image usually contains surface normal orientation vector information and semantic labels, and surface normal prediction, semantic segmentation, and depth estimation are related to the geometry of objects, which makes it possible to train different structured prediction tasks in a consistent manner. To the best of our knowledge, there are some works that apply a single model to multiple related tasks. Note that for different tasks, the model should be fine-tuned15,248 or use different loss functions,154,159 Applying a single model to multiple related tasks through fine-tuning or using different loss functions shows that the model has a good transferability. Eigen et al.15 developed a more general network for depth estimation and applied it to other computer vision tasks, such as surface normal estimation and per-pixel semantic labeling. It is worth noting that Eigen et al. used a single framework for depth estimation, surface normal estimation, and semantic segmentation with only fine-tuning, which improved the framework of their previous network14 by considering a third scale at a higher resolution. Considering that GANs perform well in structured prediction space, Hwang et al.159 proposed adversarial structure matching (ASM), which trains a structured prediction network through an adversarial process. This method achieved ideal results on monocular depth estimation, semantic segmentation, and surface normal prediction. Although the ASM model has good transferability for multiple tasks through different loss functions, its limitation is that specified datasets should be used for specific tasks and cannot be generalized to other datasets. To solve this limitation, Chen et al.154 embedded the pixel-level domain adaptation into the depth estimation task. Specifically, they proposed CrDoCo, a pixel-level adversarial domain-adaptive algorithm for dense prediction tasks. The core idea of this method is that although the image styles of two domains may be different during the domain-transfer process, the task predictions (e.g., depth estimation) should be exactly the same. Since CrDoCo is a pixel-level framework for dense prediction, it can be applied to semantic segmentation, depth prediction, and optical flow estimation, as shown in Figure 5. CrDoCo can be applied to multi-tasking only by adjusting its loss function, and it also shows a good transferability between different datasets for a specific task.
Depth estimation helps autonomous systems understand the 3D structure of the surrounding scene. The transferability of depth estimation includes not only the transfer of synthetic to real-world data but also the transfer of indoor to outdoor environments. Since depth, surface normals, and semantic labels are all related to object geometric information, recent works have also considered improving the accuracy of depth estimation by utilizing the interconnection between different tasks. We believe that future works should include the consideration of depth estimation under poor weather and light conditions. Following the autonomous systems perception tasks reviewed above, we now introduce pedestrian detection, re-ID, and tracking tasks involved in autonomous systems.
Pedestrian Detection, Re-identification, and Tracking
Pedestrian detection, re-ID, and tracking are very important for autonomous systems, especially for autonomous driving. The related works of pedestrian detection mainly focused on improving the accuracy of the results. Various developments have been made on improving the accuracy of complex visual environments, such as nighttime and occlusion. Person re-ID, which is more complicated than pedestrian detection, requires matching pedestrians in disjointed camera views. Traditional learning-based methods of person re-ID mainly focused on improving the accuracy of results, while recent GANs-based algorithms focused on transferability between domains. To further complicate the person re-ID, some developments consider locating targets in a sequence of time, i.e., video tracking. The RL-based pedestrian tracking methods focus on not only accuracy but also transferability of the algorithms. In this subsection, we review pedestrian detection, re-ID, and tracking tasks, focusing on accuracy or transferability, or both.
Pedestrian Detection
In recent years, pedestrian detection has been widely taken into account in autonomous systems, especially for autonomous driving and robot movement,249,250 Pedestrian detection methods are generally divided into two categories: models based on hand-crafted features and deep models.16 Various models based on hand-crafted features have been proposed in the past few decades.251, 252, 253 Although these models have made good progress, models based on hand-crafted features fail to extract semantic information. Sermanet et al.162 used sparse convolutional feature hierarchies for pedestrian detection, which is named ConvNet. The network first performs layer-wise training on the whole multi-stage system, then uses the labeled data to fine-tune the complete architecture for the detection task. Although ConvNet learns features from training data, it treats pedestrian detection as a single binary classification task, which may confuse positive and negative samples. Therefore, Tian et al.16 proposed a task-assistant CNN (TA-CNN), which can learn features from multiple tasks and multiple datasets. TA-CNN combines semantic tasks, including pedestrian attributes and scene attributes, to optimize pedestrian detection results. To further improve the accuracy of pedestrian detection in natural scenes, Li et al.163 considered that the problem of large variance in pedestrian scale with different spatial scales may cause dramatically different features. Therefore, they developed a scale-aware fast R-CNN (SAF R-CNN) framework, which combines a large-size subnetwork and a small-size subnetwork, as well as using the scale-aware weighting mechanism to deal with various sizes of pedestrian in scenes.163 Although SAF R-CNN can detect pedestrian instances of different scales, it does not consider factors such as illumination conditions. To solve the problem of pedestrian detection under challenging illumination conditions at nighttime, Kim et al.41 used adversarial learning for cross-spectral pedestrian detection with unpaired setting. This method makes the color and thermal features of prominent areas where pedestrians exist to be similar by using adversarial learning, thereby improving the accuracy of pedestrian detection results at nighttime. Recent developments in pedestrian detection include tiny-scale pedestrian detection254 and occluded pedestrian detection.255
Person Re-identification
A similar while more difficult task than pedestrian detection, person re-ID requires matching pedestrians in disjointed camera views. At present, there are several learning-based methods focusing on person re-ID.17,256,257 However, these methods have poor transferability, i.e., the person re-ID models trained on one domain usually fail to generalize well to another domain. Considering that CycleGAN shows impressive results in transferability using unpaired images, Deng et al.42 introduced the similarity preserving cycle-consistent generative adversarial network, an unsupervised domain adaptation approach to generate samples that not only have the target domain style but also preserve the underlying ID information. This method showed that applying domain adaptation to person re-ID can achieve competitive accuracy. Taking into account the data augmentation of different cameras, Zhong et al.164 introduced the camera style (CamStyle) adaptation. CamStyle smoothes disparities in camera styles, transferring labeled training image styles to each camera to augment the training set. CamStyle helps to learn pedestrian descriptors through camera-invariant properties to improve re-ID experimental accuracy. The above approaches, such as SPGAN42 and CamStyle,164 treated the domain gap as a black box and attempted to solve it by using a single style transformer. Liu et al.165 proposed a novel adaptive transfer network (ATNet), which investigates the root causes of the domain gap. ATNet realizes the domain transfer of person re-ID by decomposing complicated cross-domain transfers and transferring features through sub-GANs separately. Recently, a theory-based analysis by Song et al.258 bridged the theoretical gap between unsupervised domain adaptation and re-ID tasks. Recent developments in person re-ID also involved considering occluded parts259 and different visual factors such as viewpoint, pose, illumination, and background.260
Pedestrian Tracking
Video tracking is an improvement in person re-ID and needs to locate the target in a sequence of time, which is difficult to handle because of tracking obstacles. When it comes to accuracy and transferability, RL-based methods in pedestrian tracking are concerned with whether the action in each frame is discrete or continuous and whether the labeled bounding boxes at each frame are limited or not. Tracking pedestrians by searching a series of discrete actions in each frame is a solution. Yun et al.166 proposed the action-decision network (ADNet) to generate actions to find the location and the size of the target object in a new frame. The ADNet is updated by performing tracking simulation on the training sequence and utilizing action dynamics with the help of RL. After pre-training ADNet by supervised learning, online adaptation is applied to the network to accommodate the appearance changes or deformation of the target during tracking test sequences. Therefore, the pre-trained ADNet features can be transferred to a new frame during online adaptation. When the bounding boxes at each frame are limited, the algorithm can also be trained successfully and transferred to new frames with the help of larger training datasets. Supancic and Ramanan18 used RL to train trackers with more limited supervision on far more massive datasets. The results illustrated that the algorithm can track pedestrians on a never-before-seen video, and the video can be used for both evaluation of the current tracker and for training the tracker for future use. In brief, the learning structure is informative and the features contained in the video will be transferred to another training process. Furthermore, to exploit continuous actions for visual tracing, which improves training efficiency and accuracy, Chen et al.167 introduced a real-time AC framework to exploit continuous action space for visual tracking. For online tracking, the “actor” model provides an offline dynamic search strategy to locate the target object in each frame efficiently by only one action output, and the “critic” model acts as a verification module to make the tracker more robust. The real-time performance of the trackers is better than that of state-of-the-art methods such as MDNet261 and ADNet.166 Similar to Chen et al.,167 Luo et al.52 used the same RL method to deal with continuous tracking problems. Moreover, they introduced an environment augmentation technique, i.e., virtual environments named ViZDoom,262 to boost the tracker’s generalization ability.
The current methods of pedestrian detection focus on improving the accuracy of the detection results, while the GANs-based person re-ID methods mentioned in this survey center on improving the transferability of the algorithm. The RL-based pedestrian tracking methods concentrate on equipping both the accuracy and transferability. Future works should include pedestrian detection, re-ID, and tracking in severe occlusion situations. In addition, future research may also involve pedestrian detection, re-ID, and tracking tasks at the semantic level. In addition to perception tasks, we now introduce some decision-making tasks, such as robot navigation. Robot navigation focuses on navigating a robot to avoid collision or to a target, considering accuracy or transferability, or both, of tasks.
Robot Navigation
Robot navigation, which has recently become crucial and a hot topic in autonomous systems, mainly focuses on navigating a robot to a target position or to avoid obstacles in a known/unknown environment. Here we consider whether the trained model can accurately learn the task features or successfully transfer the previous information to new tasks or domains. A variety of RL and meta-learning methods, such as DQN,171 LSTM structure,263 and MAML,127 can accurately or transferably handle the changes arising from the environment or task when using the previously trained model. As shown in Table 4, we summarize the RL and meta-learning methods to handle accurate learning and domain-transfer tasks in robot and unmanned and autonomous vehicle (UAV) navigation issues. As For the experimental platform, AI2-THOR (the house of interactions)61 performs well due to its shared task features and datasets, ensuring the learned skills transfer to new tasks. Moreover, meta-learning methods usually have more satisfactory transferability than RL methods when lacking training and testing data by means of extracting or memorizing previous training data in simulation.
RL-Based Robot Navigation
To improve training efficiency and accuracy, dividing a single task to several subtasks and training them separately is a solution. Polvara et al.99 proposed two distinct DQNs, called double DQNs, which were used to train two subtasks: landmark detection and vertical landing. Due to the separate training of each single task at the same time, training efficiency and accuracy were improved to an extent. Moreover, training the model with various auxiliary tasks, such as pixel control,264 reward prediction,265 and value function replay,100 also helps the robot to adapt to the target faster and more accurately.
To equip the model with better transferability when encountering a new situation, task features53,171 and training policies47 can be transferred to novel tasks in the same domain or across domains. Parisotto et al.53 and Rusu et al.54 transferred useful features among different ATARI games and then the corresponding features were utilized to train a new ATARI game in the same domain. In addition, when dealing with the tasks whose trials in the real world are usually time-consuming or expensive, the characteristic of tasks can be transferred across domains effectively. Zhang et al.171 proposed a shared DQN between tasks in order to learn informative features of tasks, which can be transferred from simulation to the real world. Similarly, as shown in Figure 6, Sadeghi and Levine169 proposed a novel realistic translation network, which transforms virtual image inputs into real images with a similar scene structure. Moreover, policies can be transferred from simulation to simulation. Similar to Polvara et al.,99 the primary training policy of Sadeghi and Levine169 can be divided into several secondary policies, which acquire certain behaviors. These behaviors are then combined to train the primary policy, which helps to make the primary policy more transferable across domains. Chen et al.47 used AC networks to train the secondary policies as well as the primary policy. In navigation, the primary behavior learned by a high-degree-freedom robot is to navigate straight to the target within a sample environment. Chen et al. then randomized the non-essential aspects of every secondary behavior, such as the appearance, the positions, and the number of obstacles in the scene, to improve generalization ability of the final policy.
Figure 6.
UAV Indoor Navigation via DRL Algorithm
DRL algorithm is entirely trained in a simulated 3D CAD model and generalized to real indoor flight environment. From Sadeghi and Levine.169 See also Table 4.
Due to the sampling constraints of model-free RL methods and transferring limits of model-based RL methods as mentioned in Preliminaries, it is difficult to equip a model with good transferability and sampling efficiency at the same time. An easy way to handle this contradiction is to combine model-free methods with model-based methods. Kahn et al.48 used a generalized computation graph to find the navigation policies from scratch by inserting specific instantiations between model-free and model-based ones. Therefore, the algorithm not only learns high-dimensional tasks but also has promising sampling efficiency.
Meta-Learning-Based Robot Navigation
RL-based methods tend to need sufficient training data to acquire transferability. When a new task has insufficient data during training and testing, meta-learning methods can also promote the model to be transferable across domains. Firstly, recurrent models, such as LSTM structure, weaken the long-term dependency of sequential data, which acts as an optimizer to learn an optimization method for the gradient descent models. Mirowski et al.266 proposed a multi-city navigation network with LSTM structure. The main task of the LSTM structure was used to encode and encapsulate region-specific features and structures to add multiple paths in each city. After training in multiple cities, it was proved that the network is sufficiently versatile. Moreover, metric learning can be utilized to extract image information and generalize the specific information, which is helpful in navigation. Zhu et al.61 combined Siamese network with AC network to navigate the robot to the target only with 3D images. Siamese network captures and compares the special characteristics from the observation image and target image. The joint representation of images is then kept in scene-specific layers. AC network uses the features in scene-specific layers to generate policy and value outputs in navigation. To sum up, the deep Siamese AC network shares parameters across different tasks and domains so that the model can be generalized across targets and scenes. Even if the models trained by these two meta-learning methods acquire both accuracy and transferability, when the models encounter new cross-domain tasks they also need plenty of data to be retrained. To fine-tune a new model with few data, MAML is beneficial. Finn et al.60 verified that MAML performs well in 2D navigation and locomotion simulation compared with traditional policy gradient algorithms. It was shown that MAML could learn a model that adapts much more quickly in a single gradient update while continuing to improve with additional updates without overfitting. When the training process is unsupervised, MAML is not applicable and needs to be adjusted, such as by constructing a reward function during the meta-training process129 and labeling data using clustering methods.267 Wortsman et al.175 proposed a self-adaptive visual navigation (SAVN) method derived from MAML to learn adaptation to new environments without any supervision. Specifically, SAVN optimizes two objective functions: self-supervised interaction loss and navigation loss. During training, the interaction and navigation gradients are back-propagated through the network, and the parameters of the self-supervised loss are updated at the end of each episode using navigation gradients, which are trained by MAML. During testing, the parameters of the interaction loss remain fixed while the rest of the network is updated using interaction gradients. Therefore, the model equips the MAML methods with good transferability in a no-supervision environment.
RL or meta-learning methods help the robot navigate to targets or avoid obstacles. When using RL methods, separating tasks or adding auxiliary tasks during the training process will improve the accuracy of the navigation results. Moreover, there are many ways to improve transferability in robot navigation, including task transfer, parameter transfer, and policy transfer. Compared with RL methods, meta-learning methods promote the transferability well, especially when the training and testing data are limited. In the future, with the popularity of MAML, we believe that MAML will become capable of handling more complex tasks in reality and achieving more satisfactory transferability by means of combining with state-of-art methods such as metric learning and LSTM structure. After outlining robot navigation tasks in autonomous systems, we now focus on another robotic issue, robotic manipulation.
Robotic Manipulation
We now focus on transferability in robotic manipulation issues according to domain-transfer tasks implemented by various robotic arms. Compared with robot navigation, robotic manipulation mainly considers precise control of robotic arms by means of multiple degrees of freedom. RL methods enable the robotic arm to transfer between different environments and tasks by means of special inputs268 and reformed training networks.269 Moreover, meta-learning and imitation learning can be utilized to handle difficult tasks with few or even one demonstration during the meta-testing process in the same domain or across domains115,183 in order to speed up the learning process and transfer previous task features. Table 4 summarizes RL and meta-learning methods to deal with domain-transfer robotic manipulation problems. As experimental platforms, MuJoCo (multi-joint dynamics with contact)270 and the PR2 arm are popular because robotic arms with multiple degrees of freedom and shared information have better accuracy and transferability. Moreover, compared with RL, meta-learning is capable of training with fewer data and adapting more quickly to new tasks to acquire model transferability.
RL-Based Robotic Manipulation
When considering improvement of the transferability of robotic arm systems, synthetic data as input179 and separate networks in training98 are possible RL-based solutions. Synthetic inputs help to transfer experience learned from different settings in simulation to the real world. Zhang et al.179 were the first to produce a three-joint robot arm that learned control via DQN merely from raw-pixel images without any prior knowledge. The robot arm reaches the target in the real world successfully only when it takes synthetic images that are generated by the 2D simulator as inputs according to real-time joint angles. Therefore, the input of synthetic images inevitably offsets the gap between the simulation and real world, thereby improving the transferability. Moreover, when the data are limited and unable to be synthesized, DQN can be divided into perception and control modules, which are trained separately. The perception skills and the controller obtained from simulation will then be transferable.98 Similarly, DQN can also train several networks and combine the experience learned together. Zeng et al.101 used DQN to jointly train two fully convolutional networks mapping from visual observations to actions. The experience transfers between robot pushing and grasping processes, and thus these synergies are learned. To compare some popular RL methods focusing on generalization ability in robotic manipulation, Quillen et al.271 evaluated simulated benchmark tasks, whereby robot arms were used to grasp random targets in comparison with some DRL algorithms, such as double Q-learning (DQL), DDPG, path consistency learning, and Monte Carlo (MC) policy evaluation. In the experiment, the trained robot arms coped with grasping unseen targets. The results revealed that DQL performs better than other algorithms in low-data regimes and has a relatively higher robustness to the choice of hyperparameters. When data are becoming plentiful, MC policy evaluation achieves a slightly better performance.
MAML-Based Robotic Manipulation
However, in robotic manipulation issues, traditional RL methods tend to need plenty of training data. Even if they can transfer to new tasks or domains, they also have poor generalization ability.180,272 MAML combined with imitation learning is able to utilize past experience across different tasks or domains, which can learn new skills from a very small number of demonstrations in various fields of application. Duan et al.62 let the robot arm demonstrate itself in simulation, i.e., the input and output samples were collected by the robot arm itself. The inputs of the model are the position information of each block rather than images or videos. They first sampled a demonstration from one of the training tasks, then sampled another pair of observation and action from a second demonstration of the same task. Considering both the first demonstration and second observation, the network was trained to output the corresponding action. In a manipulation network, the soft attention structure allows the model to generalize to conditions and tasks that are invisible in training data. Thus, Finn et al.115 used visual inputs from raw pixels as demonstration. The model requires data from significantly fewer prior demonstrations in training and merely one demonstration in testing to learn new skills effectively. Moreover, it not only performs well in simulation but also works in a real robotic arm system. MAML is modified to two-head architecture, which means that the algorithm is flexible for both learning to adapt policy parameters and learning the expert demonstration. Therefore, the number of demonstrations needed for an individual task is reduced by sharing the data across tasks. Taking robot arm pushing as an example, during the training process the robot arm can see various pushing demonstrations that contain different objects, and each object may have different quality and friction. In the testing process, the robot arm needs to push the object that it has never seen during training. It needs to learn which object to push and how to push it according to merely one demonstration. As shown in Figure 7, compared with Finn et al.,115 Yu et al.183 increased the difficulty of imitation learning, i.e., only using a single video demonstration from a human as input, whereby the robot arm needs to accomplish the same work as in Finn et al.115 by domain adaptation. The authors proposed a domain-adaptive meta-learning method that transfers the data from human demonstrations to robot arm demonstrations. MAML was utilized to deal with the setting of learning from video demonstrations of humans. Due to the clone of behavior across the domain, the loss function also needs to be reconstructed, and TCN is used to construct the loss network in MAML structure in the robotic arm domain. Specifically, the robot arm will learn a set of initial parameters in the video domain, then after one or a few steps of gradient descent on merely one human demonstration the robot arm is able to perform the new task effectively. Recently, based on the study by Finn et al.115 Singh et al.187 improved the one-shot imitation model by using additional autonomously collected data instead of manually collecting data. It is novel that they put forward an embedding network to distinguish whether two demonstration embeddings are close to each other. By the use of metric learning, they compute the Euclidean distance to find the distance between two videos. If they are close, it is regarded that the demonstrations falls into the same task. Therefore, the demonstrations from the same task are viewed as autonomously collected data that can be used to be trained in different tasks.
Figure 7.
Demonstrations and Robotic Actions in Simulation and Real World
(A) Robot demonstrations used for meta-imitation learning. From Finn et al.115
(B) Human and robot demonstrations used for meta-imitation learning with large domain shift. From Yu et al.183
See also Table 4.
In robotic manipulation tasks, synthetic data as input and separate networks in training are RL-based ideas to equip robotic arms with transferability. Moreover, MAML with imitation learning methods do well in task and domain transfer with relatively few data. In the future, training with unlabeled data will be a trend, at which point autonomous systems need to label the training data by means of unsupervised methods. On the other hand, the testing demonstrations in meta-imitation learning will be much fewer, even with no demonstrations, so that an accurate and transferable model is needed.
Discussion and Future Works
This review shows the powerful effects of traditional DL, adversarial learning, RL, and meta-learning on complex visual and control tasks in autonomous systems. In particular, some traditional DL-based methods may not guarantee the accuracy when transferred to another domain; however, adversarial learning, RL, and meta-learning are able to treat transferability well. Although adversarial learning, such as GANs, produce better, clearer, and more transferable results than other traditional DL-based methods, meta-learning methods or combining them with RL and imitation learning methods tend to be equipped with an efficiency or transferability, or both of these.
Discussion
In this review, we introduce several typical perception and decision-making tasks of autonomous systems from the perspectives of accuracy and transferability. Since autonomous systems may have a better perception accuracy under good lighting environments than in harsh environments, we first introduce image style transfer, which can change the training data from night to day, rain to sunny, and so forth. Moreover, image style transfer can realize the transfer of synthetic datasets to real-world images, which greatly reduces the hardware loss caused by directly using autonomous systems for real scenes. We then review image enhancement and image restoration. Autonomous systems usually involve tasks such as image SR and image deblurring/dehazing/rain removal. We review recent developments from the perspectives of accuracy and transferability. When the image quality reaches a good perceptible state, we consider two typical high-level perception tasks of autonomous systems, i.e., semantic segmentation and depth estimation. Since obtaining the ground-truth labels is difficult with these two tasks, various methods have been proposed for the transferability between the synthetic datasets and real-world data. In addition, we review the tasks of pedestrian detection, person re-ID, and pedestrian tracking that are often involved in autonomous systems. Among them, pedestrian detection mainly aims to improve the accuracy of the detection results; person re-ID is a similar task but is more difficult, requiring the matching of pedestrians in disjointed camera views; pedestrian tracking pays attention to transferring network and video features and promoting framework accuracy. Furthermore, we consider two perception and decision-making tasks of robotic systems, i.e., robot navigation and robotic manipulation. Robot navigation tasks focus on accurately learning task features and transferring information across tasks or domains by RL or meta-learning. Robotic manipulation deals with domain-transfer tasks with more precise robotic arm control. These two tasks have simulation platforms or practice platforms, or both, to verify the accuracy or transferability.
Future Works
There are still important challenges and future works worthy of our attention. In this subsection, we summarize some trends and challenges for autonomous systems.
GANs with Good Stability, Quick Convergence, and Controllable Mode
GANs employ the gradient descent method to iterate the generator and discriminator to solve the minimax game problem. In the game, the mutual game between the generator and discriminator may cause model training to be unstable and difficult to converge, and even cause the mode to collapse. Although there are some preliminary studies aimed at improving these deficiencies of GANs,273,274 there is still much room for improvement in terms of the modal diversity and real-time performance. In addition, controlling the mode of data enhancement remains an open question. How to make the generated data mode controllable by controlling additional conditions and keep the model stable, and to achieve purposeful data enhancement, in particular for the computer vision tasks in autonomous systems, is an interesting research direction in the future.
GANs for Complex Multi-Tasking
Although GANs have achieved great results in some typical computer vision tasks of autonomous systems, it still remains difficult to consider the development of more complex multi-tasking in the future. Since some visual tasks are often related to each other, this phenomenon makes it possible to seamlessly reuse supervision between related tasks or solve different tasks in one system without adding complexity.275 For example, it is promising to consider training a general-purpose network that can be used for multi-task image restoration in a bad weather condition with only fine-tuning, such as image rain removal, snow removal, dehazing, seasonal change, and light adjustment. In addition, in severe rain and foggy weather, how to perform image SR while removing rain/dehazing at the same time is challenging. In short, the use of GANs for more complex multi-tasking remains an open question and is worth exploring.
GANs for More Challenging Domain Adaptation
In autonomous systems, transferability is important for computer vision tasks. Although some results introduce GANs into domain adaptation to improve domain transfer,25,276 there is still much room for development. When considering more diverse domains, more differentiated cross-domains, and cross-style domains, such as road scenarios in different countries, the existing methods often cannot guarantee good transferability among these domains. However, GANs are promising for development of more diverse domain adaptations by showing unprecedented effectiveness in domain transfer. It is of interest to study the further use of GANs for more differentiated cross-domain transferability.
Multi-Modal, Multi-Task, and Multi-Agent RL
Most of the RL methods in applications focus primarily on visual input only. However, when considering information from multiple models, such as voice, text, and video, agents that can better understand the scenes and the performance in experiments will be more accurate and satisfactory.49,277 Moreover, in multi-task RL models, the agent is simultaneously trained on both auxiliary tasks and target tasks,264,265 so that the agent has the ability to transfer experience between tasks. Furthermore, thanks to the distributed nature of the multi-agent, multi-agent RL can achieve learning efficiency from sharing experience, such as communication, teaching, and imitation.278
Meta-Learning for Unsupervised Tasks
Traditional meta-learning consists of supervised learning during training and testing whereby both training data and testing data are labeled. However, if we use the unlabeled training data—in other words, there is no reward generated in training—how can we also achieve better results on specific tasks during testing? Leveraging unsupervised embeddings to automatically construct tasks or losses for unsupervised meta-learning is a solution,129,175,267 after which the training tasks for meta-learning are constructed. Therefore, meta-learning issues can be transformed into a wider unsupervised application. It will be interesting to use unsupervised meta-learning methods in more realistic task distributions so that the agent can explore and adapt to new tasks more intelligently and the model can solve real-world tasks more effectively.
The Application Performance of RL and Meta-Learning
To deal with the differences between simulation environments and real scenes, the tasks or the networks can be transferred successfully using RL or meta-learning. Chances are that most of the existing algorithms with good performance in simulation cannot perform as well in the real world,279 which limits the applications of the models in simulation. Therefore, content-rich and flexible simulation frameworks, namely physics engines such as AI2-THOR,61 MuJoCo,270 or GymFC,177 synthetic datasets such as SUNCG,171 and robot operating platforms such as V-REP (virtual robot experiment platform),280 will help to keep the learned information in more detail so that when transferred into the real world the performance is potentially good.115,183 In the future, more informative simulation environments and more flexible real-world platforms will shorten the gap between simulation and real world, thereby making the model more accurate and transferable. For example, a humanoid robotic hand with multiple degrees of freedom is able to deal with tasks more accurately;281 3D simulation involving shared tasks, simulators, and datasets ensures that the learned skills can be transferred successfully to reality.282 In sum, due to the high similarity between simulation and real-world platforms, various high-complexity applications trained in simulation can be accurately transferred into practice.
Conclusion
In this review, we aim to contribute to the evolution of autonomous systems by exploring the impacts of accuracy or transferability, or both of them, on complex computer vision tasks and decision-making problems. To this end, we mainly focus on basic challenging perception and decision-making tasks in autonomous systems, such as image style transfer, image SR, image deblurring/dehazing/rain removal, semantic segmentation, depth estimation, pedestrian detection, person re-ID, pedestrian tracking, robot navigation, and robotic manipulation. We introduce some basic concepts and methods of transfer learning and its special case domain adaptation, then briefly discuss three typical adversarial learning networks, namely GANs, cGANs, and CycleGAN. We also present some basic concepts of RL, explain the idea of meta-learning, and discuss the relationship between adversarial learning, RL and meta-learning. Additionally, we analyze some typical DL methods and focus on the powerful performance of GANs in computer vision tasks, and discuss RL and meta-learning methods in robot control tasks in both simulation and the real world. Moreover, we provide summary tables of learning-based methods for different tasks in autonomous systems, which include the training manners, loss function of models, and experimental platforms in visual and robot control tasks. Finally, we discuss main challenges and future works from the aspects of perception and decision-making of autonomous systems by considering accuracy and transferability.
Acknowledgments
The authors would like to thank the Editor-in-Chief, Scientific Editor, and anonymous referees for their helpful comments and suggestions, which have greatly improved this paper. This work was supported by the National Key Research and Development Program of China under grant 2018YFC0809302; the National Natural Science Foundation of China under grant nos. 61988101, 61751305, and 61673176; and by the Program of Introducing Talents of Discipline to Universities (the 111 Project) under grant B17017.
Author Contributions
Ideas design, Y.T. and C. Zhang; Reference Collection, Y.T., C. Zhang, and J.W.; Writing – Original Draft, C. Zhang and J.W.; Writing – Review & Editing, Y.T., G.G.Y., C. Zhao, Q.S., J.K, and F.Q.
References
- 1.Silver D., Schrittwieser J., Simonyan K., Antonoglou I., Huang A., Guez A., Hubert T., Baker L., Lai M., Bolton A. Mastering the game of go without human knowledge. Nature. 2017;550:354–359. doi: 10.1038/nature24270. [DOI] [PubMed] [Google Scholar]
- 2.Rahwan I., Cebrian M., Obradovich N., Bongard J., Bonnefon J.-F., Breazeal C., Crandall J.W., Christakis N.A., Couzin I.D., Jackson M.O. Machine behaviour. Nature. 2019;568:477–486. doi: 10.1038/s41586-019-1138-y. [DOI] [PubMed] [Google Scholar]
- 3.LeCun Y., Bengio Y., Hinton G. Deep learning. Nature. 2015;521:436–444. doi: 10.1038/nature14539. [DOI] [PubMed] [Google Scholar]
- 4.Ferdowsi A., Challita U., Saad W. Deep learning for reliable mobile edge analytics in intelligent transportation systems: an overview. IEEE Vehicular Technology Mag. 2019;14:62–70. [Google Scholar]
- 5.McFarlane D., Giannikas V., Lu W. Intelligent logistics: involving the customer. Comput. Industry. 2016;81:105–115. [Google Scholar]
- 6.J. Forlizzi and C. DiSalvo, (2006). Service robots in the domestic environment: a study of the roomba vacuum in the home. In Proceedings of the 1st ACM SIGCHI/SIGART Conference on Human-robot Interaction, pp. 258–265.
- 7.C. Dong, C.C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution. In European Conference on Computer Vision, 2014, pp. 184–199.
- 8.Dong C., Loy C.C., He K., Tang X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern. Anal. Mach. Intell. 2015;38:295–307. doi: 10.1109/TPAMI.2015.2439281. [DOI] [PubMed] [Google Scholar]
- 9.J. Sun, W. Cao, Z. Xu, and J. Ponce, Learning a convolutional neural network for non-uniform motion blur removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 769–777.
- 10.Cai B., Xu X., Jia K., Qing C., Tao D. DehazeNet: an end-to-end system for single image haze removal. IEEE Trans. Image Process. 2016;25:5187–5198. doi: 10.1109/TIP.2016.2598681. [DOI] [PubMed] [Google Scholar]
- 11.D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken through a window covered with dirt or rain. In Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 633–640.
- 12.J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [DOI] [PubMed]
- 13.Badrinarayanan V., Kendall A., Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern. Anal. Mach. Intell. 2017;39:2481–2495. doi: 10.1109/TPAMI.2016.2644615. [DOI] [PubMed] [Google Scholar]
- 14.D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems 27 (NIPS 2014), 2014, pp. 2366–2374.
- 15.D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2650–2658.
- 16.Y. Tian, P. Luo, X. Wang, and X. Tang, Pedestrian detection aided by deep learning semantic tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5079–5087.
- 17.M. Ye, A.J. Ma, L. Zheng, J. Li, and P.C. Yuen,Dynamic label graph matching for unsupervised video re-identification. In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5142–5150.
- 18.J. Supancic III and D. Ramanan, “Tracking as online decision-making: learning a policy from streaming videos with reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 322–331.
- 19.Kober J., Bagnell J.A., Peters J. Reinforcement learning in robotics: a survey. Int. J. Robotics Res. 2013;32:1238–1274. [Google Scholar]
- 20.Polydoros A.S., Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J. Intell. Robotic Syst. 2017;86:153–173. [Google Scholar]
- 21.Gupta A., Devin C., Liu Y., Abbeel P., Levine S. Learning invariant feature spaces to transfer skills with reinforcement learning. arXiv. 2017 1703.02949. [Google Scholar]
- 22.A. Faust, K. Oslund, O. Ramirez, A. Francis, L. Tapia, M. Fiser, and J. Davidson, “Prm-rl: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 5113–5120.
- 23.C. Tan, F. Sun, T. Kong, W. Zhang, C. Yang, and C. Liu, A survey on deep transfer learning. In International Conference on Artificial Neural Networks, 2018, pp. 270–279.
- 24.J.-Y. Zhu, T. Park, P. Isola, and A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232.
- 25.A. Atapour-Abarghouei and T.P. Breckon, Real-time monocular depth estimation using synthetic data with domain adaptation via image style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2800–2810.
- 26.C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.
- 27.O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, DeblurGAN: blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8183–8192.
- 28.Gui J., Sun Z., Wen Y., Tao D., Ye J. A review on generative adversarial networks: algorithms, theory, and applications. arXiv. 2020 2001.06937. [Google Scholar]
- 29.Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde-Farley D., Ozair S., Courville A., Bengio Y. Generative adversarial nets. Advances in Neural Information Processing Systems. 2014:2672–2680. [Google Scholar]
- 30.Goodfellow I. NIPS 2016 tutorial: generative adversarial networks. arXiv. 2016 1701.00160. [Google Scholar]
- 31.Csurka G. Domain adaptation for visual applications: a comprehensive survey. arXiv. 2017 1702.05374. [Google Scholar]
- 32.Reed S., Akata Z., Yan X., Logeswaran L., Schiele B., Lee H. Generative adversarial text to image synthesis. arXiv. 2016 1605.05396. [Google Scholar]
- 33.T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1316–1324.
- 34.P. Isola, J.-Y. Zhu, T. Zhou, and A.A. Efros, Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134.
- 35.R. Qian, R.T. Tan, W. Yang, J. Su, and J. Liu, Attentive generative adversarial network for raindrop removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2482–2491.
- 36.Zhang H., Sindagi V., Patel V.M. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2019 [Google Scholar]
- 37.J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1222–1230.
- 38.K. Ehsani, R. Mottaghi, and A. Farhadi, SeGAN: segmenting and generating the invisible. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6144–6153.
- 39.W. Hong, Z. Wang, M. Yang, and J. Yuan, Conditional generative adversarial network for structured domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1335–1344.
- 40.S. Sankaranarayanan, Y. Balaji, A. Jain, S. Nam Lim, and R. Chellappa, “Learning from synthetic data: addressing domain shift for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3752–3761.
- 41.M. Kim, S. Joung, K. Park, S. Kim, and K. Sohn, “Unpaired cross-spectral pedestrian detection via adversarial feature learning. In 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 1650–1654.
- 42.W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 994–1003.
- 43.C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics. In Advances in Neural Information Processing Systems, 2016, pp. 613–621.
- 44.Jaradat M.A.K., Al-Rousan M., Quadan L. Reinforcement based mobile robot navigation in dynamic environment. Robotics and Computer-Integrated Manufacturing. 2011;27:135–149. [Google Scholar]
- 45.N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast quadrupedal locomotion. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA’04. 2004, vol. 3, 2004, pp. 2619–2624.
- 46.Xie L., Wang S., Markham A., Trigoni N. Towards monocular vision based obstacle avoidance through deep reinforcement learning. arXiv. 2017 1706.09829. [Google Scholar]
- 47.X. Chen, A. Ghadirzadeh, J. Folkesson, M. Björkman, and P. Jensfelt, Deep reinforcement learning to acquire navigation skills for wheel-legged robots in complex environments. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 3110–3116.
- 48.G. Kahn, A. Villaflor, B. Ding, P. Abbeel, and S. Levine, Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 1–8.
- 49.P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3674–3683.
- 50.Levine S., Finn C., Darrell T., Abbeel P. End-to-end training of deep visuomotor policies. J. Machine Learn. Res. 2016;17:1334–1373. [Google Scholar]
- 51.V. Mnih, A.P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 2016, pp. 1928–1937.
- 52.Luo W., Sun P., Zhong F., Liu W., Zhang T., Wang Y. End-to-end active object tracking and its real-world deployment via reinforcement learning. IEEE Trans. Pattern. Anal. Mach. Intell. 2020;42:1317–1332. doi: 10.1109/TPAMI.2019.2899570. [DOI] [PubMed] [Google Scholar]
- 53.Parisotto E., Ba J.L., Salakhutdinov R. Actor-mimic: deep multitask and transfer reinforcement learning. arXiv. 2015 1511.06342. [Google Scholar]
- 54.Rusu A.A., Colmenarejo S.G., Gulcehre C., Desjardins G., Kirkpatrick J., Pascanu R., Mnih V., Kavukcuoglu K., Hadsell R. Policy distillation. arXiv. 2015 1511.06295. [Google Scholar]
- 55.Olivecrona M., Blaschke T., Engkvist O., Chen H. Molecular de-novo design through deep reinforcement learning. J. Cheminformatics. 2017;9:48. doi: 10.1186/s13321-017-0235-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Popova M., Isayev O., Tropsha A. Deep reinforcement learning for de novo drug design. Sci. Adv. 2018;4:eaap7885. doi: 10.1126/sciadv.aap7885. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Botvinick M., Ritter S., Wang J.X., Kurth-Nelson Z., Blundell C., Hassabis D. Reinforcement learning, fast and slow. Trends Cogn. Sci. 2019;23:408–422. doi: 10.1016/j.tics.2019.02.006. [DOI] [PubMed] [Google Scholar]
- 58.Vilalta R., Drissi Y. A perspective view and survey of meta-learning. Artif. Intelligence Rev. 2002;18:77–95. [Google Scholar]
- 59.Koch G., Zemel R., Salakhutdinov R. Siamese neural networks for one-shot image recognition. ICML Deep Learning Workshop. 2015;2 [Google Scholar]
- 60.C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning Volume 70, 2017, pp. 1126–1135.
- 61.Y. Zhu, R. Mottaghi, E. Kolve, J.J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi, Target-driven visual navigation in indoor scenes using deep reinforcement learning. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3357–3364.
- 62.Duan Y., Andrychowicz M., Stadie B., Ho O.J., Schneider J., Sutskever I., Abbeel P., Zaremba W. One-shot imitation learning. Advances in Neural Information Processing Systems. 2017:1087–1098. [Google Scholar]
- 63.Tang Y., Zhao C., Wang J., Zhang C., Sun Q., Zheng W., Du W., Qian F., Kurths J. An overview of perception and decision-making in autonomous systems in the era of learning. arXiv. 2020 doi: 10.1109/TNNLS.2022.3167688. 2001.02319. [DOI] [PubMed] [Google Scholar]
- 64.Arulkumaran K., Deisenroth M.P., Brundage M., Bharath A.A. A brief survey of deep reinforcement learning. arXiv. 2017 1708.05866. [Google Scholar]
- 65.Weiss K., Khoshgoftaar T.M., Wang D. A survey of transfer learning. J. Big Data. 2016;3:9. [Google Scholar]
- 66.Pan S.J., Yang Q. A survey on transfer learning. IEEE Trans. Knowledge Data Eng. 2009;22:1345–1359. [Google Scholar]
- 67.Pan S.J., Tsang I.W., Kwok J.T., Yang Q. Domain adaptation via transfer component analysis. IEEE Trans. Neural Networks. 2010;22:199–210. doi: 10.1109/TNN.2010.2091281. [DOI] [PubMed] [Google Scholar]
- 68.T. Evgeniou and M. Pontil, Regularized multi-task learning. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 109–117.
- 69.R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng, Self-taught learning: transfer learning from unlabeled data. In Proceedings of the 24th International Conference on Machine learning, 2007, pp. 759–766.
- 70.Ganin Y., Lempitsky V. Unsupervised domain adaptation by backpropagation. arXiv. 2014 1409.7495. [Google Scholar]
- 71.Hoffman J., Wang D., Yu F., Darrell T. FCNs in the wild: pixel-level adversarial and constraint-based adaptation. arXiv. 2016 1612.02649. [Google Scholar]
- 72.M. Wulfmeier, A. Bewley, and I. Posner, Addressing appearance change in outdoor robotics with adversarial domain adaptation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 1551–1558.
- 73.S. Bak, P. Carr, and J.-F. Lalonde, “Domain adaptation through synthesis for unsupervised person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 189–205.
- 74.Hoffman J., Tzeng E., Park T., Zhu J.-Y., Isola P., Saenko K., Efros A.A., Darrell T. CyCADA: cycle-consistent adversarial domain adaptation. arXiv. 2017 1711.03213. [Google Scholar]
- 75.Y. Zhu, Y. Chen, Z. Lu, S.J. Pan, G.-R. Xue, Y. Yu, and Q. Yang, Heterogeneous transfer learning for image classification. In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
- 76.Yang L., Jing L., Yu J., Ng M.K. Learning transferred weights from co-occurrence data for heterogeneous transfer learning. IEEE Trans. Neural Networks Learn. Syst. 2015;27:2187–2200. doi: 10.1109/TNNLS.2015.2472457. [DOI] [PubMed] [Google Scholar]
- 77.Patel V.M., Gopalan R., Li R., Chellappa R. Visual domain adaptation: a survey of recent advances. IEEE Signal Process. Mag. 2015;32:53–69. [Google Scholar]
- 78.Chen M., Xu Z., Weinberger K., Sha F. Marginalized denoising autoencoders for domain adaptation. arXiv. 2012 1206.4683. [Google Scholar]
- 79.Long M., Cao Y., Wang J., Jordan M.I. Learning transferable features with deep adaptation networks. arXiv. 2015 doi: 10.1109/TPAMI.2018.2868685. 1502.02791. [DOI] [PubMed] [Google Scholar]
- 80.E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7167–7176.
- 81.Ganin Y., Ustinova E., Ajakan H., Germain P., Larochelle H., Laviolette F., Marchand M., Lempitsky V. Domain-adversarial training of neural networks. J. Machine Learn. Res. 2016;17:2096–3030. [Google Scholar]
- 82.Q. Chen, Y. Liu, Z. Wang, I. Wassell, and K. Chetty, “Re-weighted adversarial adaptation network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7976–7985.
- 83.N. Dalvi, P. Domingos, S. Sanghai, and D. Verma, “Adversarial classification. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 99–108.
- 84.M. Großhans, C. Sawade, M. Brückner, and T. Scheffer, Bayesian games for adversarial regression problems. In International Conference on Machine Learning, 2013, pp. 55–63.
- 85.Brückner M., Kanzow C., Scheffer T. Static prediction games for adversarial learning problems. J. Machine Learn. Res. 2012;13:2617–2654. [Google Scholar]
- 86.S. Mei and X. Zhu, Using machine teaching to identify optimal training-set attacks on machine learners. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- 87.Dasgupta P., Collins J.B., McCarrick M. Playing to learn better: repeated games for adversarial learning with multiple classifiers. arXiv. 2020 2002.03924. [Google Scholar]
- 88.Creswell A., White T., Dumoulin V., Arulkumaran K., Sengupta B., Bharath A.A. Generative adversarial networks: an overview. IEEE Signal Process. Mag. 2018;35:53–65. [Google Scholar]
- 89.Mirza M., Osindero S. Conditional generative adversarial nets. arXiv. 2014 1411.1784. [Google Scholar]
- 90.Kaelbling L.P., Littman M.L., Moore A.W. Reinforcement learning: a survey. J. Artif. Intelligence Res. 1996;4:237–285. [Google Scholar]
- 91.Sutton R.S., Barto A.G. MIT Press; 2018. Reinforcement Learning: An Introduction. [Google Scholar]
- 92.Geffner H. Model-free, model-based, and general intelligence. arXiv. 2018 1806.02308. [Google Scholar]
- 93.Feinberg V., Wan A., Stoica I., Jordan M.I., Gonzalez J.E., Levine S. Model-based value estimation for efficient model-free reinforcement learning. arXiv. 2018 1803.00101. [Google Scholar]
- 94.Gläscher J., Daw N., Dayan P., O’Doherty J.P. States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron. 2010;66:585–595. doi: 10.1016/j.neuron.2010.04.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Mnih V., Kavukcuoglu K., Silver D., Graves A., Antonoglou I., Wierstra D., Riedmiller M. Playing Atari with deep reinforcement learning. arXiv. 2013 1312.5602. [Google Scholar]
- 96.Lillicrap T.P., Hunt J.J., Pritzel A., Heess N., Erez T., Tassa Y., Silver D., Wierstra D. Continuous control with deep reinforcement learning. arXiv. 2015 1509.02971. [Google Scholar]
- 97.Azar M.G., Gómez V., Kappen H.J. Dynamic policy programming. J. Machine Learn. Res. 2012;13:3207–3245. [Google Scholar]
- 98.Zhang F., Leitner J., Milford M., Corke P. Modular deep Q networks for sim-to-real transfer of visuo-motor policies. arXiv. 2016 1610.06781. [Google Scholar]
- 99.Polvara R., Patacchiola M., Sharma S., Wan J., Manning A., Sutton R., Cangelosi A. Autonomous quadrotor landing using deep reinforcement learning. arXiv. 2017 1709.03339. [Google Scholar]
- 100.Mirowski P., Pascanu R., Viola F., Soyer H., Ballard A.J., Banino A., Denil M., Goroshin R., Sifre L., Kavukcuoglu K. Learning to navigate in complex environments. arXiv. 2016 1611.03673. [Google Scholar]
- 101.A. Zeng, S. Song, S. Welker, J. Lee, A. Rodriguez, and T. Funkhouser, Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 4238–4245.
- 102.Vanschoren J. Meta-learning: a survey. arXiv. 2018 1810.03548. [Google Scholar]
- 103.Li Z., Zhou F., Chen F., Li H. Meta-SGD: learning to learn quickly for few-shot learning. arXiv. 2017 1707.09835. [Google Scholar]
- 104.Hochreiter S., Schmidhuber J. Long short-term memory. Neural Comput. 1997;9:1735–1780. doi: 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- 105.C. Lea, M.D. Flynn, R. Vidal, A. Reiter, and G.D. Hager, “Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165.
- 106.Kulis B. Metric learning: a survey. Foundations Trends Mach. Learn. 2013;5:287–364. [Google Scholar]
- 107.Snell J., Swersky K., Zemel R. Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems. 2017:4077–4087. [Google Scholar]
- 108.Wang Y., Kwok J., Ni L., Yao Q. Generalizing from a few examples: a survey on few-shot learning. arXiv. 2019 1904.05046. [Google Scholar]
- 109.S. Chopra, R. Hadsell, and Y. LeCun, Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 2005, pp. 539–546.
- 110.Vinyals O., Blundell C., Lillicrap T., Kavikcuoglu K., Wierstra D. Matching networks for one shot learning. Advances in Neural Information Processing Systems. 2016:3630–3638. [Google Scholar]
- 111.F. Sung, Y. Yang, L. Zhang, T. Xiang, P.H. Torr, and T.M. Hospedales, “Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
- 112.Lee Y., Choi S. Gradient-based meta-learning with learned layerwise metric and subspace. arXiv. 2018 1801.05558. [Google Scholar]
- 113.Rakelly K., Zhou A., Quillen D., Finn C., Levine S. Efficient off-policy meta-reinforcement learning via probabilistic context variables. arXiv. 2019 1903.08254. [Google Scholar]
- 114.Bottou L. Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y., Saporta G., editors. Proceedings of COMPSTAT’2010. Springer; 2010. pp. 177–186. [Google Scholar]
- 115.Finn C., Yu T., Zhang T., Abbeel P., Levine S. One-shot visual imitation learning via meta-learning. arXiv. 2017 1709.04905. [Google Scholar]
- 116.Sutton R.S., Barto A.G. MIT Press; 1998. Reinforcement Learning: An Introduction. [Google Scholar]
- 117.Pfau D., Vinyals O. Connecting generative adversarial networks and actor-critic methods. arXiv. 2016 1610.01945. [Google Scholar]
- 118.V.R. Konda and J.N. Tsitsiklis, Actor-critic algorithms. In Advances in Neural Information Processing Systems, 2000, pp. 1008–1014.
- 119.M. Sarmad, H.J. Lee, and Y.M. Kim, “RL-GAN-Net: a reinforcement learning agent controlled gan network for real-time point cloud shape completion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5898–5907.
- 120.Ganin Y., Kulkarni T., Babuschkin I., Eslami S., Vinyals O. Synthesizing programs for images using reinforced adversarial learning. arXiv. 2018 1804.01118. [Google Scholar]
- 121.Ng A.Y., Russell S.J. Algorithms for inverse reinforcement learning. ICML ’00: Proceedings of the 17th International Conference on Machine Learning. 2000:663–670. [Google Scholar]
- 122.Ho J., Ermon S. Generative adversarial imitation learning. Advances in Neural Information Processing Systems. 2016:4565–4573. [Google Scholar]
- 123.Li Y., Song J., Ermon S. InfoGAIL: interpretable imitation learning from visual demonstrations. Advances in Neural Information Processing Systems. 2017:3812–3822. [Google Scholar]
- 124.S. Hochreiter, A.S. Younger, and P.R. Conwell, “Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, 2001, pp. 87–94.
- 125.Wang J.X., Kurth-Nelson Z., Tirumala D., Soyer H., Leibo J.Z., Munos R., Blundell C., Kumaran D., Botvinick M. Learning to reinforcement learn. arXiv. 2016 1611.05763. [Google Scholar]
- 126.Nichol A., Schulman J. Reptile: a scalable metalearning algorithm. arXiv. 2018 https://openai.com/blog/reptile/ 1803.02999. [Google Scholar]
- 127.Gupta A., Mendonca R., Liu Y., Abbeel P., Levine S. Meta-reinforcement learning of structured exploration strategies. Advances in Neural Information Processing Systems. 2018:5302–5311. [Google Scholar]
- 128.Houthooft R., Chen Y., Isola P., Stadie B., Wolski F., Ho O.J., Abbeel P. Evolved policy gradients. Advances in Neural Information Processing Systems. 2018:5400–5409. [Google Scholar]
- 129.Gupta A., Eysenbach B., Finn C., Levine S. Unsupervised meta-learning for reinforcement learning. arXiv. 2018 1806.04640. [Google Scholar]
- 130.L.A. Gatys, A.S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2414–2423.
- 131.J. Johnson, A. Alahi, and L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, 2016, pp. 694–711.
- 132.Li Y., Wang N., Liu J., Hou X. Demystifying neural style transfer. arXiv. 2017 1701.01036. [Google Scholar]
- 133.R. Gong, W. Li, Y. Chen, and L.V. Gool, DLOW: domain flow for adaptation and generalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2477–2486.
- 134.Z. Shen, M. Huang, J. Shi, X. Xue, and T.S. Huang, “Towards instance-level image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3683–3692.
- 135.C. Dong, C.C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision, 2016, pp. 391–407.
- 136.M.S. Sajjadi, B. Scholkopf, and M. Hirsch EnhanceNet: single image super-resolution through automated texture synthesis. In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4491–4500.
- 137.A. Shocher, N. Cohen, and M. Irani, “Zero-shot” super-resolution using deep internal learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3118–3126.
- 138.X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, Y. Qiao, and C. Change Loy, “Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
- 139.Y. Yuan, S. Liu, J. Zhang, Y. Zhang, C. Dong, and L. Lin, Unsupervised image super-resolution using cycle-in-cycle generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 701–710.
- 140.J.W. Soh, G.Y. Park, J. Jo, and N.I. Cho, Natural and realistic single image super-resolution with explicit natural manifold discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8122–8131.
- 141.Gong D., Sun W., Shi Q., Hengel A.v. d., Zhang Y. Learning to zoom-in via learning to zoom-out: real-world super-resolution by generating and adapting degradation. arXiv. 2020 doi: 10.1109/TIP.2021.3049951. 2001.02381. [DOI] [PubMed] [Google Scholar]
- 142.O. Kupyn, T. Martyniuk, J. Wu, and Z. Wang, DeblurGAN-v2: deblurring (orders-of-magnitude) faster and better. In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8878–8887.
- 143.R. Aljadaany, D.K. Pal, and M. Savvides, “Douglas-Rachford networks: learning both the image prior and data fidelity terms for blind image deconvolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 235–10 244.
- 144.R. Li, J. Pan, Z. Li, and J. Tang, “Single image dehazing via conditional generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8202–8211.
- 145.D. Engin, A. Genç, and H. Kemal Ekenel, “Cycle-dehaze: enhanced CycleGAN for single image dehazing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 825–833.
- 146.G. Kim, J. Park, S. Ha, and J. Kwon, “Bidirectional deep residual learning for haze removal. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 46–54.
- 147.A. Dudhane and S. Murala, CDNet: single image de-hazing using unpaired adversarial training. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, pp. 1147–1155.
- 148.P. Sharma, P. Jain, and A. Sur, Scale-aware conditional generative adversarial network for image dehazing. In The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 2355–2365.
- 149.R. Li, L.-F. Cheong, and R.T. Tan, “Heavy rain image restoration: integrating physics model and conditional adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1633–1642.
- 150.Jin X., Chen Z., Li W. AI-GAN: asynchronous interactive generative adversarial network for single image rain removal. Pattern Recogn. 2020;100:107143. [Google Scholar]
- 151.K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-CNN. In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2961–2969.
- 152.Y. Zhang, Z. Qiu, T. Yao, D. Liu, and T. Mei, Fully convolutional adaptation networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6810–6818.
- 153.R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick, Learning to segment every thing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4233–4241.
- 154.Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang, “CrDoCo: pixel-level domain transfer with cross-domain consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1791–1800.
- 155.Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang, “Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2507–2516.
- 156.Y. Li, L. Yuan, and N. Vasconcelos, Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6936–6945.
- 157.Erkent Ö., Laugier C. Semantic segmentation with unsupervised domain adaptation under varying weather conditions for autonomous vehicles. IEEE Robotics Automation Lett. 2020:1–8. [Google Scholar]
- 158.Liu F., Shen C., Lin G., Reid I. Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern. Anal. Mach. Intell. 2015;38:2024–2039. doi: 10.1109/TPAMI.2015.2505283. [DOI] [PubMed] [Google Scholar]
- 159.J.-J. Hwang, T.-W. Ke, J. Shi, and S.X. Yu, “Adversarial structure matching for structured prediction tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4056–4065.
- 160.S. Zhao, H. Fu, M. Gong, and D. Tao, “Geometry-aware symmetric domain adaptation for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9788–9798.
- 161.Zhao Y., Kong S., Shin D., Fowlkes C. Domain decluttering: simplifying images to mitigate synthetic-real domain shift and improve depth estimation. arXiv. 2020 2002.12114. [Google Scholar]
- 162.P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun, Pedestrian detection with unsupervised multi-stage feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3626–3633.
- 163.Li J., Liang X., Shen S., Xu T., Feng J., Yan S. Scale-aware fast r-CNN for pedestrian detection. IEEE Trans. Multimedia. 2017;20:985–996. [Google Scholar]
- 164.Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera style adaptation for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5157–5166.
- 165.J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptive transfer network for cross-domain person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7202–7211.
- 166.S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Young Choi, Action-decision networks for visual tracking with deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2711–2720.
- 167.B. Chen, D. Wang, P. Li, S. Wang, and H. Lu, Real-time ‘actor-critic’ tracking. In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 318–334.
- 168.Schwenker F., Trentin E. Pattern classification and clustering: a review of partially supervised learning approaches. Pattern Recognition Lett. 2014;37:4–14. [Google Scholar]
- 169.Sadeghi F., Levine S. CAD2RL: real single-image flight without a single real image. arXiv. 2016 1611.04201. [Google Scholar]
- 170.L. Tai, G. Paolo, and M. Liu, Virtual-to-real deep reinforcement learning: continuous control of mobile robots for mapless navigation. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 31–36.
- 171.J. Zhang, J.T. Springenberg, J. Boedecker, and W. Burgard, “Deep reinforcement learning with successor features for navigation across similar environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017, pp. 2371–2378.
- 172.Banino A., Barry C., Uria B., Blundell C., Lillicrap T., Mirowski P., Pritzel A., Chadwick M.J., Degris T., Modayil J. Vector-based navigation using grid-like representations in artificial agents. Nature. 2018;557:429–433. doi: 10.1038/s41586-018-0102-6. [DOI] [PubMed] [Google Scholar]
- 173.F. Zhu, L. Zhu, and Y. Yang, “Sim-real joint reinforcement transfer for 3D indoor navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 388–11 397.
- 174.Niroui F., Zhang K., Kashino Z., Nejat G. Deep reinforcement learning robot for search and rescue applications: exploration in unknown cluttered environments. IEEE Robotics Automation Lett. 2019;4:610–617. [Google Scholar]
- 175.M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, “Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6750–6759.
- 176.Jabri A., Hsu K., Gupta A., Eysenbach B., Levine S., Finn C. Unsupervised curricula for visual meta-reinforcement learning. Advances in Neural Information Processing Systems (NeurIPS 2019) 2019 http://papers.nips.cc/paper/9238-unsupervised-curricula-for-visual-meta-reinforcement-learning [Google Scholar]
- 177.Koch W., Mancuso R., West R., Bestavros A. Reinforcement learning for UAV attitude control. ACM Trans. Cyber-Physical Syst. 2019;3:1–21. [Google Scholar]
- 178.Gaudet B., Linares R., Furfaro R. Adaptive guidance and integrated navigation with reinforcement meta-learning. Acta Astronautica. 2020 doi: 10.13140/RG.2.2.24778.52164. [DOI] [Google Scholar]
- 179.Zhang F., Leitner J., Milford M., Upcroft B., Corke P. Towards vision-based deep reinforcement learning for robotic motion control. arXiv. 2015 1511.03791. [Google Scholar]
- 180.S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3389–3396.
- 181.T. Haarnoja, V. Pong, A. Zhou, M. Dalal, P. Abbeel, and S. Levine, Composable deep reinforcement learning for robotic manipulation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 6244–6251.
- 182.Zhu Y., Wang Z., Merel J., Rusu A., Erez T., Cabi S., Tunyasuvunakool S., Kramár J., Hadsell R., de Freitas N. Reinforcement and imitation learning for diverse visuomotor skills. arXiv. 2018 1802.09564. [Google Scholar]
- 183.Yu T., Abbeel P., Levine S., Finn C. One-shot hierarchical imitation learning of compound visuomotor tasks. arXiv. 2018 1810.11043. [Google Scholar]
- 184.Yu T., Quillen D., He Z., Julian R., Hausman K., Finn C., Levine S. Meta-World: a benchmark and evaluation for multi-task and meta reinforcement learning. arXiv. 2019 1910.10897. [Google Scholar]
- 185.A. Zeng, S. Song, K.-T. Yu, E. Donlon, F.R. Hogan, M. Bauza, D. Ma, O. Taylor, M. Liu, E. Romo et al., Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 1–8.
- 186.Tsurumine Y., Cui Y., Uchibe E., Matsubara T. Deep reinforcement learning with smooth policy update: application to robotic cloth manipulation. Robotics Autonomous Syst. 2019;112:72–83. [Google Scholar]
- 187.Singh A., Jang E., Irpan A., Kappler D., Dalal M., Levine S., Khansari M., Finn C. Scalable multi-task imitation learning with autonomous improvement. arXiv. 2020 2003.02636. [Google Scholar]
- 188.S. Gu, T. Lillicrap, I. Sutskever, and S. Levine, “Continuous deep Q-learning with model-based acceleration. In International Conference on Machine Learning, 2016, pp. 2829–2838.
- 189.Schulman J., Wolski F., Dhariwal P., Radford A., Klimov O. Proximal policy optimization algorithms. arXiv. 2017 1707.06347. [Google Scholar]
- 190.J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization. In International Conference on Machine Learning, 2015, pp. 1889–1897.
- 191.Shorten C., Khoshgoftaar T.M. A survey on image data augmentation for deep learning. J. Big Data. 2019;6:60. doi: 10.1186/s40537-021-00492-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 192.A.A. Efros and W.T. Freeman, Image quilting for texture synthesis and transfer. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001, pp. 341–346.
- 193.A. Hertzmann, C.E. Jacobs, N. Oliver, B. Curless, and D.H. Salesin, Image analogies. In Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001, pp. 327–340.
- 194.D. Chen, L. Yuan, J. Liao, N. Yu, and G. Hua, Stereoscopic neural style transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6654–6663.
- 195.T. Kim, M. Cha, H. Kim, J.K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning-Volume70, 2017, pp. 1857–1865.
- 196.Z. Yi, H. Zhang, P. Tan, and M. Gong, “Dualgan:Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2849–2857.
- 197.Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
- 198.Royer A., Bousmalis K., Gouws S., Bertsch F., Mosseri I., Cole F., Murphy K. XGAN: unsupervised image-to-image translation for many-to-many mappings. In: Singh R., Vatsa M., Patel V.M., Ratha N., editors. Domain Adaptation for Visual Understanding. Springer; 2020. pp. 33–49. [Google Scholar]
- 199.S. Ma, J. Fu, C. Wen Chen, and T. Mei, “DA-GAN: instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5657–5666.
- 200.Mo S., Cho M., Shin J. InstaGAN: instance-aware image-to-image translation. arXiv. 2018 1812.10889. [Google Scholar]
- 201.Park S.C., Park M.K., Kang M.G. Super-resolution image reconstruction: a technical overview. IEEE Signal Process. Mag. 2003;20:21–36. [Google Scholar]
- 202.Hou H., Andrews H. Cubic splines for image interpolation and digital filtering. IEEE Trans. Acoust. Speech, Signal Process. 1978;26:508–517. [Google Scholar]
- 203.K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3929–3938.
- 204.X. Ding, Y. Wang, Z. Liang, J. Zhang, and X. Fu, “Towards underwater image enhancement using super-resolution convolutional neural networks. In International Conference on Internet Multimedia Computing and Service, 2017, pp. 479–486.
- 205.Keys R. Cubic convolution interpolation for digital image processing. IEEE Trans. Acoust. Speech, Signal Process. 1981;29:1153–1160. [Google Scholar]
- 206.J. Sun, Z. Xu, and H.-Y. Shum, “Image super-resolution using gradient profile prior. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8.
- 207.Farsiu S., Robinson D., Elad M., Milanfar P. Advances and challenges in super-resolution. Int. J. Imaging Syst. Technology. 2004;14:47–57. [Google Scholar]
- 208.Yang J., Wright J., Huang T.S., Ma Y. Image super-resolution via sparse representation. IEEE Trans. Image Process. 2010;19:2861–2873. doi: 10.1109/TIP.2010.2050625. [DOI] [PubMed] [Google Scholar]
- 209.R. Zeyde, M. Elad, and M. Protter, On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, 2010, pp. 711–730.
- 210.K. Zhang, W. Zuo, and L. Zhang, “Deep plug-and-play super-resolution for arbitrary blur kernels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1671–1681.
- 211.L. Wang, Y. Wang, Z. Liang, Z. Lin, J. Yang, W. An, and Y. Guo, Learning parallax attention for stereo image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 250–12 259.
- 212.Y. Li, V. Tsiminaki, R. Timofte, M. Pollefeys, and L.V. Gool, 3D appearance super-resolution with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9671–9680.
- 213.Brifman A., Romano Y., Elad M. Unified single-image and video super-resolution via denoising algorithms. IEEE Trans. Image Process. 2019;28:6063–6076. doi: 10.1109/TIP.2019.2924173. [DOI] [PubMed] [Google Scholar]
- 214.Lucas A., Lopez-Tapia S., Molina R., Katsaggelos A.K. Generative adversarial networks and perceptual losses for video super-resolution. IEEE Trans. Image Process. 2019;28:3312–3327. doi: 10.1109/TIP.2019.2895768. [DOI] [PubMed] [Google Scholar]
- 215.Joshi N., Matusik W., Adelson E.H., Kriegman D.J. Personal photo enhancement using example images. ACM Trans. Graph. 2010;29:12–21. [Google Scholar]
- 216.R. Chen, “Image dehazing based on image enhancement algorithm. In 5th International Conference on Information Engineering for Mechanics and Materials, 2015.
- 217.Fu X., Huang J., Ding X., Liao Y., Paisley J. Clearing the skies: a deep network architecture for single-image rain removal. IEEE Trans. Image Process. 2017;26:2944–2956. doi: 10.1109/TIP.2017.2691802. [DOI] [PubMed] [Google Scholar]
- 218.Maini R., Aggarwal H. A comprehensive review of image enhancement techniques. arXiv. 2010 1003.4053. [Google Scholar]
- 219.J. Kuruvilla, D. Sukumaran, A. Sankar, and S.P. Joy, A review on image processing and image segmentation. In International Conference on Data Mining and Advanced Computing (SAPIENCE), 2016, pp. 198–203.
- 220.Schuler C.J., Hirsch M., Harmeling S., Schölkopf B. Learning to deblur. IEEE Trans. Pattern. Anal. Mach. Intell. 2015;38:1439–1451. doi: 10.1109/TPAMI.2015.2481418. [DOI] [PubMed] [Google Scholar]
- 221.S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3883–3891.
- 222.Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A.C. Improved training of Wasserstein GANS. Advances in Neural Information Processing Systems. 2017:5767–5777. [Google Scholar]
- 223.B. Lu, J.-C. Chen, and R. Chellappa, Unsupervised domain-specific deblurring via disentangled representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 225–10 234.
- 224.S. Zhou, J. Zhang, W. Zuo, H. Xie, J. Pan, and J.S. Ren, DAVANet: stereo deblurring with view aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 10996–11005.
- 225.W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang, Single image dehazing via multi-scale convolutional neural networks. In European Conference on Computer Vision, 2016, pp. 154–169.
- 226.Zhang H., Sindagi V., Patel V.M. Joint transmission map estimation and dehazing using deep networks. arXiv. 2017 1708.00581. [Google Scholar]
- 227.Ancuti C.O., Ancuti C., De Vleeschouwer C., Sbetr M. Color channel transfer for image dehazing. IEEE Signal Process. Lett. 2019;26:1413–1417. [Google Scholar]
- 228.Golts A., Freedman D., Elad M. Unsupervised single image dehazing using dark channel prior loss. IEEE Trans. Image Process. 2019;29:2692–2701. doi: 10.1109/TIP.2019.2952032. [DOI] [PubMed] [Google Scholar]
- 229.X. Liu, Y. Ma, Z. Shi, and J. Chen, “GridDehazeNet: attention-based multi-scale network for image dehazing. In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7314–7323.
- 230.A. Yamashita, Y. Tanaka, and T. Kaneko, “Removal of adherent waterdrops from images acquired with stereo camera. In 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2005, pp. 400–405.
- 231.A. Yamashita, I. Fukuchi, and T. Kaneko, Noises removal from image sequences acquired with moving camera by estimating camera motion from spatio-temporal information. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2009, pp. 3794–3801.
- 232.You S., Tan R.T., Kawakami R., Mukaigawa Y., Ikeuchi K. Adherent raindrop modeling, detectionand removal in video. IEEE Trans. Pattern. Anal. Mach. Intell. 2015;38:1721–1733. doi: 10.1109/TPAMI.2015.2491937. [DOI] [PubMed] [Google Scholar]
- 233.Garcia-Garcia A., Orts-Escolano S., Oprea S., Villena-Martinez V., Garcia-Rodriguez J. A review on deep learning techniques applied to semantic segmentation. arXiv. 2017 1704.06857. [Google Scholar]
- 234.L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
- 235.S.R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: ground truth from computer games. In European Conference on Computer Vision, 2016, pp. 102–118.
- 236.Q.-H. Pham, B.-S. Hua, T. Nguyen, and S.-K. Yeung, “Real-time progressive 3D semantic segmentation for indoor scenes. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019, pp. 1089–1098.
- 237.Z. Liang, M. Yang, L. Deng, C. Wang, and B. Wang, “Hierarchical depthwise graph convolutional neural network for 3D semantic segmentation of point clouds. In 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 8152–8158.
- 238.J. Lahoud, B. Ghanem, M. Pollefeys, and M.R. Oswald, 3D instance segmentation via multi-task metric learning. In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9256–9266.
- 239.Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2359–2367.
- 240.Luc P., Couprie C., Chintala S., Verbeek J. Semantic segmentation using adversarial networks. arXiv. 2016 1611.08408. [Google Scholar]
- 241.Liu M.-Y., Breuel T., Kautz J. Unsupervised image-to-image translation networks. Advances in Neural Information Processing Systems. 2017:700–708. [Google Scholar]
- 242.Z. Murez, S. Kolouri, D. Kriegman, R. Ramamoorthi, and K. Kim, Image to image translation for domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4500–4509.
- 243.Ullman S. The interpretation of structure from motion. Proc. R. Soc. Lond. B Biol. Sci. 1979;47:405–426. doi: 10.1098/rspb.1979.0006. [DOI] [PubMed] [Google Scholar]
- 244.Scharstein D., Szeliski R. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. Int. J. Comput. Vis. 2002;47:7–42. [Google Scholar]
- 245.J. Nath Kundu, P. Krishna Uppala, A. Pahuja, and R. Venkatesh Babu, AdaDepth: unsupervised content congruent adaptation for depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2656–2665.
- 246.Y. Chen, C. Schmid, and C. Sminchisescu, Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7063–7072.
- 247.A. Gordon, H. Li, R. Jonschkowski, and A. Angelova, “Depth from videos in the wild: unsupervised monocular depth learning from unknown cameras. In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8977–8986.
- 248.A. Mousavian, H. Pirsiavash, and J. Košecká, “Joint semantic segmentation and depth estimation with deep convolutional networks. In 2016 Fourth International Conference on 3D Vision (3DV), 2016, pp. 611–619.
- 249.B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowded scenes. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, 2005, pp. 878–885.
- 250.Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning strong parts for pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1904–1912.
- 251.N. Dalal and B. Triggs, Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, 2005, pp. 886–893.
- 252.Nam W., Dollár P., Han J.H. Local decorrelation for improved pedestrian detection. Advances in Neural Information Processing Systems. 2014:424–432. [Google Scholar]
- 253.J. Cao, Y. Pang, and X. Li, Pedestrian detection inspired by appearance constancy and shape symmetry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1316–1324. [DOI] [PubMed]
- 254.R. Yin, Multi-resolution generative adversarial networks for tiny-scale pedestrian detection. In 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 1665–1669.
- 255.Xie J., Pang Y., Cholakkal H., Anwer R.M., Khan F.S., Shao L. PSC-Net: learning part spatial co-occurence for occluded pedestrian detection. arXiv. 2020 2001.09252. [Google Scholar]
- 256.J. Wang, X. Zhu, S. Gong, and W. Li, Transferable joint attribute-identity deep learning for unsupervised person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2275–2284.
- 257.Fan H., Zheng L., Yan C., Yang Y. Unsupervised person re-identification: clustering and fine-tuning. ACM Trans. Multimedia Comput. Commun. Appl. (Tomm) 2018;14:1–18. [Google Scholar]
- 258.Song L., Wang C., Zhang L., Du B., Zhang Q., Huang C., Wang X. Unsupervised domain adaptive re-identification: theory and practice. Pattern Recogn. 2020;102:107173. [Google Scholar]
- 259.R. Hou, B. Ma, H. Chang, X. Gu, S. Shan, and X. Chen, “VRSTC: occlusion-free video person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7183–7192.
- 260.X. Sun and L. Zheng, “Dissecting person re-identification from the viewpoint of viewpoint. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 608–617.
- 261.H. Nam and B. Han, Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302.
- 262.M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski, ViZDoom: A Doom-based AI research platform for visual reinforcement learning. In 2016 IEEE Conference on Computational Intelligence and Games (CIG). IEEE, 2016, pp. 1–8.
- 263.A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, Social LSTM: human trajectory prediction in crowded spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 961–971.
- 264.Bellemare M., Dabney W., Dadashi R., Taiga A.A., Castro P.S., Le Roux N., Schuurmans D., Lattimore T., Lyle C. A geometric perspective on optimal representations for reinforcement learning. Advances in Neural Information Processing Systems. 2019:4360–4371. [Google Scholar]
- 265.Jaderberg M., Mnih V., Czarnecki W.M., Schaul T., Leibo J.Z., Silver D., Kavukcuoglu K. Reinforcement learning with unsupervised auxiliary tasks. arXiv. 2016 1611.05397. [Google Scholar]
- 266.Mirowski P., Grimes M., Malinowski M., Hermann K.M., Anderson K., Teplyashin D., Simonyan K., Kavukcuoglu K., Zisserman A., Hadsell R. Learning to navigate in cities without a map. Advances in Neural Information Processing Systems. 2018:2419–2430. [Google Scholar]
- 267.Hsu K., Levine S., Finn C. Unsupervised learning via meta-learning. arXiv. 2018 1810.02334. [Google Scholar]
- 268.Kompella V.R., Stollenga M., Luciw M., Schmidhuber J. Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots. Artif. Intelligence. 2017;247:313–335. [Google Scholar]
- 269.Andrychowicz M., Wolski F., Ray A., Schneider J., Fong R., Welinder P., McGrew B., Tobin J., Abbeel O.P., Zaremba W. Hindsight experience replay. Advances in Neural Information Processing Systems. 2017:5048–5058. [Google Scholar]
- 270.E. Todorov T. Erez Y. Tassa MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012, pp. 5026–5033.
- 271.D. Quillen E. Jang O. Nachum C. Finn J. Ibarz S. Levine Deep reinforcement learning for vision-based robotic grasping: a simulated comparative evaluation of off-policy methods. In 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 6284–6291.
- 272.Kalashnikov D., Irpan A., Pastor P., Ibarz J., Herzog A., Jang E., Quillen D., Holly E., Kalakrishnan M., Vanhoucke V. QT-Opt: scalable deep reinforcement learning for vision-based robotic manipulation. arXiv. 2018 1806.10293. [Google Scholar]
- 273.Che T., Li Y., Jacob A.P., Bengio Y., Li W. Mode regularized generative adversarial networks. arXiv. 2016 1612.02136. [Google Scholar]
- 274.A. Ghosh, V. Kulharia, V.P. Namboodiri, P.H. Torr, and P.K. Dokania, “Multi-agent diverse generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8513–8521.
- 275.A.R. Zamir, A. Sax, W. Shen, L.J. Guibas, J. Malik, and S. Savarese, “Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3712–3722.
- 276.K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3722–3731.
- 277.X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W.Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6629–6638.
- 278.Buşoniu L., Babuška R., De Schutter B. Multi-agent reinforcement learning: an overview. In: Srinivasan D., Jain L.C., editors. Springer; 2010. pp. 183–221. (Innovations in Multi-Agent Systems and Applications – 1). [Google Scholar]
- 279.Li Y. Deep reinforcement learning: an overview. arXiv. 2017 1701.07274. [Google Scholar]
- 280.E. Rohmer, S.P. Singh, and M. Freese, “V-REP: a versatile and scalable robot simulation framework. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2013, pp. 1321–1326.
- 281.Andrychowicz O.M., Baker B., Chociej M., Jozefowicz R., McGrew B., Pachocki J., Petron A., Plappert M., Powell G., Ray A. Learning dexterous in-hand manipulation. Int. J. Robotics Res. 2020;39:3–20. [Google Scholar]
- 282.M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik et al., Habitat: a platform for embodied AI research. In Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 9339–9347.







