Abstract
Current computer vision tasks based on deep learning require a huge amount of data with annotations for model training or testing, especially in some dense estimation tasks, such as optical flow segmentation and depth estimation. In practice, manual labeling for dense estimation tasks is very difficult or even impossible, and the scenes of the dataset are often restricted to a small range, which dramatically limits the development of the community. To overcome this deficiency, we propose a synthetic dataset generation method to obtain the expandable dataset without burdensome manual workforce. By this method, we construct a dataset called MineNavi containing video footages from first-perspective-view of the aircraft matched with accurate ground truth for depth estimation in aircraft navigation application. We also provide quantitative experiments to prove that pre-training via our MineNavi dataset can improve the performance of depth estimation model and speed up the convergence of the model on real scene data. Since the synthetic dataset has a similar effect to the real-world dataset in the training process of deep model, we finally conduct the experiments on MineNavi with unsupervised monocular depth estimation (UMDE) deep learning models to demonstrate the impact of various factors in our dataset such as lighting conditions and motion mode, aiming to explore what makes this kind of models training better.
Subject terms: Aerospace engineering, Electrical and electronic engineering
Introduction
In recent years, the machine learning based depth estimation methods, which heavily rely on the labeled dataset, have achieved satisfying performance. However, the scarcity of available labeled data, high costs of data acquisition and annotation, limit the quantity and variety of existing deep learning methods. Although the problem of data shortage can be partly solved by unsupervised learning methods with only sparse or even no annotated data, the ground-truth are still needed in experiments for evaluating or testing the generalization performance of the model. Thus, it is still of great significance to obtain a sufficient amount of images with accurate and dense depth information.
The common data acquisition method in real world is not feasible for the depth estimation, especially for aircraft visual navigation because humans cannot manually label a pixel-wise annotation. Building a virtual world to generate synthetic datasets as the intermediate domain with the help of digital simulation technology may be the most feasible way for data generation and labeling at current stage. Since the newly released synthetic datasets1–4 are not flexible enough to suit for different needs, e.g, fixed resolution, limited scenes, low data diversity and huge volume, etc, it is difficult to apply them to the dense estimation task in the large scale environment, especially for the depth estimation in aircraft navigation.
Therefore, in this paper, we propose a simple and expandable synthetic dataset generation method, and construct a custom dataset, which is called as MineNavi (Fig. 1). This dataset generation method can not only solve the problem of high cost of real-world data acquisition, but also can narrow the gap between the training domain and the target domain by customizing the synthetic scene that similar with the target domain. Besides, different with conventional studies that adjust models in a fixed dataset to make them close to or superior to the state-of-the-art methods under certain evaluation metrics, we analyze the influences of the changes in datasets on the models. It is very significant because it can not only verify the generalization capabilities of the models to the environment, but also give guidance to construct real-world datasets. In addition, to explore the impact of the various dataset factors on depth estimation models, our constructed MineNavi dataset contains the dense depth maps and surface normal vectors of objects. It will help us to observe the performance of depth estimation model under different factors of the dataset, such as the ego-motion camera, lighting and motion patterns, etc. Our experiments show that these variations on training sets may significantly affect the performance of the models. Finally, unlike the KITTI dataset5 applied to autonomous driving, our dataset is mainly oriented to the depth estimation of the large-scale scene in aircraft view, which can not only lead to the development of scene 3D reconstruction6 but also provide training data and testbeds for autonomous aircraft with scene perception7.
Figure 1.
MineNavi dataset provides image sequence, depth map, surface normal map and camera 6 DoF pose in the large scale scene (with over 576 m depth).
Our contributions are as follows: firstly, we propose an open synthetic dataset generation method and construct MineNavi for the large-scale depth estimation applications. Secondly, we design experiments to report the performance of the baseline models pre-trained on the MineNavi dataset, and reveal the influence of various factors in datasets on depth estimation models. MineNavi dataset is available on Kaggle platform8.
MineNavi: a synthetic dataset of large scale scenes
Using MineCraft to construct a dataset is not a novel idea for computer vision community9, we here to utilize it for depth estimation in aircraft visual navigation application. Our data generation method contains four steps: map loading, camera moving path setting, shader and lighting conditions setting, and ground-truth acquisition. Figure 2 shows the pipeline of data generation process.
Figure 2.
Data generation pipeline. We use some open source tools and tools provided by ourselves to achieve efficient data generation.
Not only the environment features, such as the scene structure and lighting condition, affect the performance of depth estimation models, but also the particular dynamical parameters, such as moving targets in the environment and ego-motion of the aircraft, play important roles in benchmark datasets for models’ training and evaluation. Accordingly, each image frame of the dataset can be parameterized as:
| 1 |
where represents the 6 DoF camera motion paths, n is the quantified timestamp of the path, M is the map of the scene, s is the shader that renders the world, L(t, w) is the lighting condition. t is the time in a day, and w indicates the weather conditions.
Scene construction
Although a lot of work2,10 build the scenes based on 3D modeling software such as Blender and Maya, the construction of large-scale 3D scenes is still a relatively time-consuming and laborious task. Besides, the limited scenes diversity will lead to the over-fitting situation of the models. MineCraft community11 has extremely rich scene maps and users can freely build the required scene to generate specific dataset. Since the aircraft navigation is always involved in the large-scale scenes, the negative effects of the jagged features of objects in MineCraft can be ignored.
In order to increase the diversity of data, we use different shaders and lighting conditions. MineNavi dataset cooperates with the time and weather system according to different light and shadow styles to generate multiple style data.
The construction in MineNavi based on the block is very simple and flexible. In order to build a more refined scene, users can use plug-ins to adjust the size of the block to achieve more complex objects (see Fig. 3).
Figure 3.
Virtual world constructed in MineCraft. Up: The open virtual world AudiaCity that we used to build our dataset. Down: Users can achieve higher resolution scenes or buildings by applying plugins that adjust the blocks to small size.
Camera paths setting
Base on previous study12–14, we have found that the unsupervised monocular depth estimation methods are very sensitive to the camera motion in the training.
Therefore, we develop different camera paths and generates corresponding datasets for experiments.
Unlike lighting and other factors that can be quantified as a scalar, a moving camera has 6 continuous degrees of freedom.
Therefore, for a training triplets, we propose a quantitative scalar , i.e., quasi-axis rate to generate datasets of the motion paths according to , and analyze the pros and cons of the data under different . The can be formulated as:
| 2 |
| 3 |
| 4 |
| 5 |
is the rotation angle of camera visual axis at time n, calculated from ,and is the camera position vector at time n. When , the camera moves parallel to the visual axis, when , the camera moves perpendicular to the visual axis.
In MineNavi, we can set the key points manually or automatically by using Aperture15 and obtain the full path that matched with image sequence through the interpolation algorithm (see Fig. 4). The generated path has high enough dynamics, and the pose transformation is much larger than the general real-world data captured by UAVs.
Figure 4.
The accurate camera 6 DoF pose at any timestamp in the path is gotten by exporting the key points of the Aperture path and interpolating.
Moving objects in the scene
The dynamic objects in a practical environment may have a great influence on unsupervised monocular depth estimation method. Many previous work16–18 focused on how to remove the negative effects of dynamic objects in monocular depth estimation, but due to the very limited dataset, the progress is far from satisfactory. In order to simulate the influence of a moving object in the synthetic dataset on the depth estimation, the construction of the scene containing moving objects can be further parameterized as:
| 6 |
where is the path of the ith moving objects in the scene. Each of the dynamic object can be modeled as custom shape by Blender or other 3D software and set their paths by using BlockBuster19. Note that we have no involved the moving object in our proposed dataset yet due to its negative effects on depth estimation mode.
Generating ground-truth annotations
The shader can perform color mapping on the 3D information of the scene, which acts as a ground-truth as shown in Fig. 1.
We use the DepthMap rendering plug-in to export the corresponding error-free, pixel-level dense depth map that matches the image in sequences. In addition, we provide a surface normal rendering plug-in SurfMap to support surface normal estimation tasks.
Thus, the datasets construction method proposed in this study can generate a large number of customized datasets at a very low cost.
Datasets building
We constructed several datasets by MineNavi, as shown in Table 1. MNv1.0 contains 40 scenes and total of 2000 images (50 images per scene) rendered by sildurs-middle shader with the sunny weather at noon, MNv1.1 and MNv1.0 are identical except for using sildurs-high shader as the renderer. MNv1.2 has raw, sildurs-middle and sildurs-high renders. MN1.3 is the largest dataset we have built, containing 3 blur conditions (low, middle, high), five lighting conditions (morning, noon, afternoon, night, rain), 324 scenes, and a total of 168,200 images. Compared with MNv1, MNv2 differs mainly in the motion patterns and scenes, which are centered around a certain central point with three lambda motions, i.e., .
Table 1.
Builded MineNavi datasets.
| Dataset | Num. of images | Rendering quality | Num. of scenes | Lighting conditions | Motion blur | |
|---|---|---|---|---|---|---|
| MNv1.0 | 2000 | Middle | 40 | Noon | – | 1 |
| MNv1.1 | 2000 | High | 40 | Noon | – | 1 |
| MNv1.2 | 9600 | All | 40 | All | – | 1 |
| MNv1.3 | 162,000 | High | 324 | All | All | 1 |
| MNv2.0 | 16,200 | High | 40 | Noon | – | 1 |
| MNv2.1 | 16,200 | High | 40 | Noon | – | |
| MNv2.2 | 16,200 | High | 40 | Noon | – | 0 |
Experiments
In this section, we verify the feasibility and credibility of the MineNavi dataset in the training of the depth estimation model, and explore the impact of dataset varieties on the unsupervised monocular depths estimation model. Thus, we will demonstrate that 1) the depth estimation model can improve generalization through pre-training on MineNavi. 2) it is desirable to exploit the influence of data to model caused by various factors of the dataset. We prepare monodepth220 and its two variants monodepth2-3D and monodepth2-3Ds as the test models on our proposed datasets. We also present using Sequential Heat-map of Photometric-error Histogram (SHPH) to verify whether an image sequence is compatible with depth estimation model training intuitively.
MDE models
Unsupervised depth estimation includes monocular methods12–14,21–24 usually contain a single-view depth and a multi-views pose network, to compute the depth. With the similar principle, we use a test model monodepth220 and its variants as baseline, which are shown in Fig. 5.
Figure 5.
For spatiotemporal feature learning, We build a 3D encoder and apply it into monodepth2 to build monodepth2-3D and monodepth2-3Ds.
Inspired by spacial-temporal methods in scene understanding25,26, the first variant of monodepth2 is monodepth2-3D, i.e., replace the encoder with a 3D encoder for improving the efficiency of training frames, which can enhance the richness by extracting the temporal features from multiple images17,27,28. What’s more, as mentioned by previous work29 that if there is structural similarity among candidate tasks, it is reasonable to assign just one encoder to extract identical features and recover required information by task-oriented decoders respectively20,21. Thus, we apply the model that using a single encoder to extract the mixed features for depth or pose estimation network as second variants of monodepth2, i.e., monodepth2-3Ds.
Apply MineNavi to MDE models
We present two variant models by changing their encoder, and apply them into frameworks of supervised training and unsupervised training (monodepth2) on MNv1.0, MNv1.1 and MNv1.2. For comparison, we also prepare the model Table 2 shows the results of models with single-frame and multi-frame input and the models on MN achieve similar or even better results by simply replacing the encoder from ResNet18 to 3D-ResNet18. Obviously, depth information is embedded under the multi-frame image sequence, which can assist the model to recover depth better. The quantitative results are shown in Fig. 6.
Table 2.
Quantitative results in MN datasets. First six model are trained in supervised and the rest are unsupervised (monodepth2).
| Model | Dataset | Error | Accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| AbsRel | SqRel | RMS | RMSlog | |||||
| ResNet18 | MNv1.0 | 0.198 | 1.318 | 5.679 | 0.268 | 0.731 | 0.923 | 0.972 |
| 3D-ResNet18 | MNv1.0 | 0.194 | 1.197 | 5.453 | 0.259 | 0.749 | 0.932 | 0.973 |
| ResNet18 | MNv1.1 | 0.207 | 1.479 | 5.832 | 0.274 | 0.721 | 0.920 | 0.968 |
| 3D-ResNet18 | MNv1.1 | 0.181 | 1.159 | 5.372 | 0.250 | 0.761 | 0.935 | 0.976 |
| ResNet18 | MNv1.2 | 0.142 | 0.707 | 4.455 | 0.181 | 0.846 | 0.961 | 0.986 |
| 3D-ResNet18 | MNv1.2 | 0.129 | 0.675 | 4.384 | 0.172 | 0.861 | 0.966 | 0.987 |
| Monodepth2 | MNv1.1 | 0.212 | 1.426 | 7.054 | 0.295 | 0.706 | 0.902 | 0.959 |
| Monodepth2-3D | MNv1.1 | 0.245 | 1.833 | 5.919 | 0.273 | 0.750 | 0.925 | 0.965 |
| Monodepth2 | MNv1.2 | 0.170 | 0.974 | 5.782 | 0.211 | 0.798 | 0.941 | 0.977 |
| Monodepth2-3D | MNv1.2 | 0.165 | 1.170 | 5.965 | 0.211 | 0.800 | 0.942 | 0.977 |
| Monodepth2-3Ds | MNv1.2 | 0.160 | 0.991 | 5.899 | 0.208 | 0.809 | 0.943 | 0.977 |
The best results in each dataset are shown in bold.
Figure 6.
Qualitative results of different models in MineNavi.
Generalization of MineNavi
We execute the models pre-training on MineNavi datasets with linear camera moving path. In order to evaluate the influence of data diversity on the performance, we use MNv1.0, MNv1.1 and MNv1.2 for model pre-training. For comparison, we also prepare the models that are trained from scratch and ImageNet30. Although the model pre-trained on ImageNet by classification task has structural difference compare with the model that trained on similar target task, it is still the most popular method in depth estimation task.
Fine-tune on KITTI
We conduct fine-tuning on the KITTI with monodepth2, monodepth2-3D and monodepth2-3Ds pre-trained from scratch, ImageNet, MNv1.2 and MNv1.3 for 10 epochs. Note that we have removed the mask mechanism and reduced the epochs for simple training without affecting the final conclusion, so the results may be different from the original monodepth220.
From Table 3 it can be seen that the performance of monodepth2 and monodepth2-3D pre-trained on ImageNet is better than that pre-trained on MNv1.2 and scratch, but worse than MNv1.3. The MineNavi has a strong generalization capability compared to the KITTI. As mentioned before, MNv1.2 and MNv1.3 are only different in lighting condition and data volume. Therefore, the diversity of lighting conditions effectively improves the generalization capabilities of the models.
Table 3.
Quantity results of various MDE models in KITTI with different pre-trained datasets. The best result are bolded and the second best are underlined. Since we have only 10 epochs of fine-tune and without masking mechanism, the results are different from the original paper of monodepth220.
| Models | Pre-trained datasets | Error | Accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| AbsRel | SqRel | RMS | RMSlog | |||||
| Monodepth2 | Scratch | 0.141 | 1.117 | 4.797 | 0.205 | 0.839 | 0.948 | 0.980 |
| ImageNet | 0.135 | 1.007 | 4.668 | 0.200 | 0.845 | 0.950 | 0.980 | |
| MNv1.2 | 0.138 | 1.095 | 4.722 | 0.204 | 0.843 | 0.949 | 0.979 | |
| MNv1.3 | 0.130 | 1.055 | 4.630 | 0.196 | 0.856 | 0.953 | 0.980 | |
| Monodepth2-3D | Scratch | 0.170 | 1.453 | 5.758 | 0.247 | 0.775 | 0.916 | 0.963 |
| ImageNet | 0.163 | 1.857 | 5.529 | 0.233 | 0.807 | 0.930 | 0.968 | |
| MNv1.2 | 0.167 | 2.468 | 5.671 | 0.230 | 0.820 | 0.932 | 0.966 | |
| MNv1.3 | 0.153 | 1.639 | 5.356 | 0.224 | 0.820 | 0.936 | 0.970 | |
| Monodepth2-3Ds | Scratch | 0.161 | 1.660 | 5.449 | 0.228 | 0.805 | 0.930 | 0.970 |
| ImageNet | 0.158 | 2.447 | 5.511 | 0.220 | 0.839 | 0.938 | 0.970 | |
| MNv1.2 | 0.165 | 2.361 | 5.535 | 0.229 | 0.820 | 0.934 | 0.968 | |
| MNv1.3 | 0.158 | 2.133 | 5.555 | 0.223 | 0.829 | 0.936 | 0.969 | |
Compared with the other datasets, the model of monodepth2-3Ds pre-trained on ImageNet has the better performance. This is mainly because excessive noises in KITTI, e.g, the moving objects deteriorate the robustness of the network performance of the shared encoder, but the large amount of data of ImageNet can make the model more robust31. Note that although MineNavi dataset is much smaller the ImageNet, it has competitive performance with ImageNet in depth estimation model training. The quantitative results are show in Fig. 7 which matches with Table 3. It can be seen that the depth map obtained by the MN-trained model has sharper edges compared to the ImageNet with the trained model, which also indicates that the model can generalize better to similar tasks by unifying the task with the trained model. We also provide fine-tuning curves on Fig. 8 and it shows the value in generalization of our MineNavi dataset.
Figure 7.
Qualitative results of different models in KITTI.In these models, we used different pre-training weights and a completely consistent tuning process. The results for each model are, from top to bottom, train from scratch, ImageNet, MNv1.2 and MNv1.3.
Figure 8.
Fine-tuning curves of three test models on KITTI. Solid curves denote accuracy () metric of depth estimation and dash curves denote training loss.
Fine-tune on FPV
Since there is no ground-truth in the FPV dataset, we have to compute the distances between the models in different domains based on the loss value29. The closer the migration distance is, the better the pre-training dataset can be generalized to the target domain.
Compared the losses curve among of different pre-trained models that are fine-tuned on FPV in Fig. 9, MineNavi pre-trained models converge faster than the others. The reason behind probably comes from that the MineNavi dataset is closer to the FPV dataset than ImageNet in terms of environment scenes. What’s more, compared with the ImageNet pre-trained model through the task of objects detection, the MineNavi pre-trained model through the task of depth estimation has learned geometric representation32 during the pre-training, which leads the model converge faster when the target task has structural similarity29 with source task. Note that, with the continuous expansion of the dataset, MineNavi can realize a more satisfied performance.
Figure 9.
Fine-tuning curves of three test models on FPV.
Factors that affect the train of MDE
Due to the expandable characteristics of the MineNavi dataset, we can easily generate customized datasets with different variation factors to avoid the over-fitting. It also a helpful way to discover the impacts of factors of datasets on the models. Thus we conduct experiments to explore how the factors in dataset can affect MDE model, including the shader, lighting conditions, motion blur, ego-motion and velocity of training image sequence.
Impact of shaders
The MineNavi dataset can generate the rendered image sequences sampled on the same path through different shaders, which allows us to quantitatively evaluate the impacts of the synthetic world design and the quality of other rendering parameters on the algorithm performance. We apply Sildurs33 to adjust the image rendering quality and build three training datasets index Raw, middle-sildurs and high-sildurs of MNv1.2. All of them are captured in an identical scene with linear camera motion and collected for about 10000 images. The only differences among them are shader setting: Raw is rendered by no shader, middle-sildurs uses sildurs with middle performance and high-sildurs uses high-performance shader. We apply random initial weights encoder to monodepth2 and train it on above three datasets. We use cross-evaluation on each trained model, i.e., evaluate every model on all datasets. The qualitative results are shown in Table 4.
Table 4.
The performance of the accuracy of the generated dataset under different rendering shaders. Here AbsRel and are used as error and accuracy indicators. The best result in each row has been underlined and the optimal result has been bolded.
| Train sets (AbsRel ) | Test datasets | ||
|---|---|---|---|
| Raw | Middle-sildurs | High-sildurs | |
| Raw | 0.207 0.689 | 0.3260.524 | 0.3110.528 |
| Middle-sildurs | 0.4360.425 | 0.148 0.813 | 0.1580.774 |
| High-sildurs | 0.4390.430 | 0.1560.778 | 0.143 0.816 |
It shows that as the training scenes rendered gradually improve, the performance of the depth estimation model improves consequently. Besides, compared with a model that is trained on less-texture data and tested on rendered data, the model that is trained on rendered data and tested on less-texture data brings a worse result. It is consistent with the fact that the rendering performance will promote the robust of the model during the training.
Lighting conditions
Previous study20,34 show that during the depth estimation model training, the low-texture areas caused by insufficient lighting or overexposure will produce problem pixels in depth estimation.
To further explore the impact of lighting conditions in data on the depth estimation model, we apply the models with random initial weights and train them on five datasets index of MNv1.3 (see Fig. 10) under different lighting conditions: morning, noon, afternoon, night and rainy day. Quantified results on AbsRel are shown in Fig. 11. We can observe that in the lighting conditions at morning and noon, three test models achieve similar results. However, as the lighting in training data is getting dim (afternoon, night, rainy), three models are deteriorated significantly. This can be attributed primarily to that the adequate lighting makes the color between pixels more diverse, and the error map is close to the uniform distribution. Note that at the time of afternoon, the models performance dropped dramatically, even worse than night that with dimmer lighting condition, we suspect that the reason behind this is there are too much problematic pixels in captured images caused by lens flare, which strongest in afternoon compared with the other lighting conditions. SHPH results on the collected sequences under different lighting conditions and different camera moving paths are shown in the row3 of Fig. 10. It can be clearly seen that the clear lighting conditions bring the even distribution of the SHPH.
Figure 10.
Sequence under different lighting conditions. Photometric error map (row 2) and SHEH map (row 3).
Figure 11.

Models trained with datasets that various in lighting conditions show different AbsRel.
Impact of motion blur
The motions of cameras will also affect the stability of the SHPH. As shown in the Fig. 12, it can be seen that the distribution of the photometric error map gradually even with the increase motion blur. In our experiment, four datasets with different motion blur are built. The quantified results of monodepth2 are shown in the Table 6. The motion blur has a great impact on the SHPH, we suspect that it is an effective way to overcome the noise and introduce the robustness by adding a certain motion blur in sequences. This is reflected in SHPH that appropriate motion blur can make the SHPH more stable, which leads the view synthesis of depth estimation model easier (see Fig. 12).
Figure 12.
At close to the ground ( m) a histogram statistical result of the inter-frame error and its sequence photometric error heat map.
Table 6.
The model shows the different accuracy results under different motion blurs.
| Train sets (AbsRel ) | Test datasets | |||
|---|---|---|---|---|
| None | Low blur | Middle blur | High blur | |
| None | 0.221 0.731 | 0.2320.705 | 0.2350.703 | 0.2370.704 |
| Low blur | 0.2370.676 | 0.2030.731 | 0.199 0.748 | 0.2010.752 |
| Middle blur | 0.2030.746 | 0.1770.782 | 0.174 0.811 | 0.1790.808 |
| High blur | 0.2530.646 | 0.2130.692 | 0.1970.729 | 0.191 0.754 |
The best result in each row is underlined and the optimal result is bolded.
Table 7 shows the the performance of two variants of monodepth2 in MineNavi datasets with different motion blur. It can be seen that the two models are trained on the motion-blurred dataset, and the performance is significantly better than the dataset without blurred.
Table 7.
Motion blur test in monodepth2-3D (up) and monodepth2-3Ds (down). The best result in each row is underlined and the optimal result is bolded.
| Train sets (AbsRel ) | Test datasets | |||
|---|---|---|---|---|
| None | Low blur | Middle blur | High blur | |
| None | 0.199 0.768 | 0.2160.760 | 0.2170.758 | 0.2180.752 |
| Low blur | 0.2000.730 | 0.1850.764 | 0.184 0.765 | 0.1860.764 |
| Middle blur | 0.1940.734 | 0.1770.775 | 0.175 0.781 | 0.1760.785 |
| High blur | 0.2000.730 | 0.1860.763 | 0.180 0.774 | 0.1810.778 |
| Train sets (AbsRel ) | Test datasets | |||
|---|---|---|---|---|
| None | Low blur | Middle blur | High blur | |
| None | 0.219 0.739 | 0.2310.738 | 0.2350.737 | 0.2330.735 |
| Low blur | 0.2230.704 | 0.2070.734 | 0.2060.736 | 0.205 0.737 |
| Middle blur | 0.2280.691 | 0.2080.726 | 0.2070.731 | 0.207 0.734 |
| High blur | 0.2460.669 | 0.2260.704 | 0.2240.706 | 0.223 0.707 |
Besides, we also introduce vary lighting conditions into experiments. As shown in the Table 5, it can be seen that in the variant models of monodepth2, the darker the lighting conditions, the worse the performance, which is consistent with the results of the experiments. Note that at the time of afternoon, the models performance dropped dramatically, even worse than night that with dimmer lighting condition, We suspect that the reason behind this is there are too much problematic pixels in captured images caused by lens flare, which strongest in afternoon compared with the other lighting conditions (Supplementary Information).
Table 5.
Quantity results in MineNavi with different lights.
| Models | Dataset index | Error | Accuracy | |||||
|---|---|---|---|---|---|---|---|---|
| AbsRel | SqRel | RMS | RMSlog | |||||
| Monodepth2 | Morning | 0.276 | 8.491 | 36.672 | 0.289 | 0.694 | 0.885 | 0.947 |
| Noon | 0.276 | 11.284 | 49.412 | 0.321 | 0.665 | 0.858 | 0.931 | |
| Afternoon | 0.463 | 26.176 | 67.480 | 0.534 | 0.375 | 0.647 | 0.803 | |
| Night | 0.391 | 18.674 | 57.397 | 0.432 | 0.470 | 0.748 | 0.875 | |
| Rain | 0.547 | 30.597 | 54.880 | 0.520 | 0.395 | 0.671 | 0.818 | |
| Monodepth2-3D | Morning | 0.243 | 7.469 | 36.867 | 0.263 | 0.737 | 0.903 | 0.957 |
| Noon | 0.297 | 11.108 | 46.373 | 0.324 | 0.622 | 0.852 | 0.936 | |
| Afternoon | 0.374 | 13.967 | 45.221 | 0.391 | 0.478 | 0.779 | 0.910 | |
| Night | 0.359 | 15.065 | 49.514 | 0.392 | 0.490 | 0.780 | 0.907 | |
| Rain | 0.481 | 24.776 | 57.464 | 0.492 | 0.386 | 0.673 | 0.836 | |
| Monodepth2-3Ds | Morning | 0.277 | 10.086 | 41.467 | 0.290 | 0.697 | 0.883 | 0.950 |
| Noon | 0.290 | 11.167 | 46.460 | 0.315 | 0.656 | 0.865 | 0.936 | |
| Afternoon | 0.311 | 11.901 | 43.584 | 0.322 | 0.618 | 0.859 | 0.942 | |
| Night | 0.351 | 14.853 | 52.417 | 0.384 | 0.507 | 0.790 | 0.906 | |
| Rain | 0.479 | 26.123 | 67.677 | 0.535 | 0.349 | 0.621 | 0.793 | |
Impact of ego-motion variance
The ego-motion of the camera in the video will affect the depth estimation model training. Due to the continuous nature of the camera ego-motion, it is not easy to explore the impact of this factor. In this section, we build three datasets, i.e., MNv2.0, MNv2.1 and MNv2.2, which various in motion mode which corresponds to linear motion , overhead cruising motion , and circular motion . Finally, the motion speed can be controlled by the number of interval frames of each train triplet in the datasets, and each of them is equipped with three velocities v, and . We test different motion modes through the models, and the quantitative results are shown in the Table 8. It can be seen that as the decreases, the performance of the test model also decreases, and the velocity of training triplets also has significantly affect on the performance of test model. According to the previous analysis, the reason behind this probably is that the training triplet with larger and appropriate velocity have a even distribution in SHPH, hence a better performance is achieved.
Table 8.
Model performance in different ego-motion modes.
| Traintest sets (AbsRel ) | Velocities | ||
|---|---|---|---|
| 0.143 0.816 | 0.141 0.818 | 0.152 0.806 | |
| 0.2400.472 | 0.2400.476 | 0.2520.458 | |
| 0.2800.414 | 0.4730.234 | 0.4810.248 | |
The best result in each column is underlined and the optimal result is bolded.
Velocity of training sequence
We find that the training sequence vary in sample frequency can greatly affect the performance of the model. It is essential because if the velocity of sampling camera is faster, the photometric differences between two adjacent frames are bigger, making the model difficult to train. Figure 13 shows the qualitative results of the models that vary in velocity of training sequence and encoder.
Figure 13.
Qualitative model results in MineNavi. First row: monodepth2. Second row: monodepth2-3D. From left to right, the velocity of training sequence is getting lower.
Discussion
This paper proposes a method to construct a synthetic dataset, which includes a large-scale scene with low cost but infinite volume, including surface normals, depth, and the 6 DoF paths of the camera’s ego-motion. This dataset generation method can provide a solution to overcome the difficulty of data collection in some dense estimation tasks. For depth estimation task in aircraft navigation, we construct several datasets. According to the experimental results, our proposed dataset generation method can perform as an intermediate domain for depth estimation. The data-to-model experiments reveal that future work should not only focus on the innovation of the models, but also pay more attention to the factors in the dataset that affect the models.
Supplementary Information
Author contributions
X.W. and W.L. wrote the main manuscript text and prepared all figures. M.Y. and B.L. mainly responsible for revising the language of the article. All authors reviewed the manuscript.
Funding
This study was supported by the Project of Sichuan Science and Technology Department (No. 2022YFG0153) and the Funding from Sichuan University (No. GSJDJS2021010).
Data availability
All MineNavi code for generated dataset used in this analysis can be found at: https://github.com/xdr940/MineNavi. The datasets generated and analysed during the current study are available in the https://www.kaggle.com/datasets/xdr940/minenavi.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1038/s41598-022-26613-0.
References
- 1.Mayer, N. et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4040–4048 (2016).
- 2.Butler, D. J., Wulff, J., Stanley, G. B. & Black, M. J. A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision (ECCV) (2012).
- 3.Antonini, A., Guerra, W., Murali, V., Sayre-McCord, T. & Karaman, S. A large-scale dataset for UAV perception in aggressive flight. CoRR, The blackbird dataset (2018).
- 4.Zhang, Y. et al. Physically-based rendering for indoor scene understanding using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5287–5295 (2017).
- 5.Geiger A, Lenz P, Stiller C, Urtasun R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013;32:1231–1237. doi: 10.1177/0278364913491297. [DOI] [Google Scholar]
- 6.Izadi, S. et al. Kinectfusion: Real-time 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, 559–568 (2011).
- 7.Kanellakis C, Nikolakopoulos G. Survey on computer vision for UAVs: Current developments and trends. J. Intell. Robot. Syst. 2017;87:141–168. doi: 10.1007/s10846-017-0483-z. [DOI] [Google Scholar]
- 8.Wang, X. Minenavi: An expandable synthetic dataset based on minecraft. https://www.kaggle.com/datasets/xdr940/minenavi (2020).
- 9.Guss, W. H. et al. Minerl: A large-scale dataset of minecraft demonstrations. In IJCAI (2019).
- 10.Gaidon, A., Wang, Q., Cabon, Y. & Vig, E. Virtualworlds as proxy for multi-object tracking analysis. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).
- 11.MineCraftMaps. minecraftmaps. https://www.minecraftmaps.com/ (2011).
- 12.Bian J, et al. Unsupervised scale-consistent depth and ego-motion learning from monocular video. Adv. Neural Inf. Process. Syst. 2019;32:35–45. [Google Scholar]
- 13.Zou, Y., Luo, Z. & Huang, J.-B. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), 36–53 (2018).
- 14.Guizilini, V., Ambrus, R., Pillai, S., Raventos, A. & Gaidon, A. 3D packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020).
- 15.McHorse. Aperture. https://github.com/mchorse/aperture (2017).
- 16.Yin, Z. & Shi, J. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1983–1992 (2018).
- 17.Watson, J., Mac Aodha, O., Prisacariu, V., Brostow, G. & Firman, M. The temporal opportunist: Self-supervised multi-frame monocular depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1164–1174 (2021).
- 18.Klingner, M., Termöhlen, J.-A., Mikolajczyk, J. & Fingscheidt, T. Self-supervised monocular depth estimation: Solving the dynamic object problem by semantic guidance. In European Conference on Computer Vision, 582–600 (Springer, 2020).
- 19.McHorse, C. Block buster. https://www.curseforge.com/minecraft/mc-mods/blockbuster (2016).
- 20.Godard, C., Mac Aodha, O., Firman, M. & Brostow, G. Digging into self-supervised monocular depth estimation. In CVPR (2019).
- 21.Zhou, T. & Brown, M. Unsupervised learning of depth and ego-motion from video. In CVPR (2017).
- 22.Godard, C. & Aodha, M. Unsupervised monocular depth estimation with left-right consistency. In CVPR (2017).
- 23.Ranjan, A. et al. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12240–12249 (2019).
- 24.Yang, Z., Wang, P., Wang, Y., Xu, W. & Nevatia, R. Lego: Learning edge with geometry all at once by watching videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 225–234 (2018).
- 25.Li N, Chang F, Liu C. Spatial-temporal cascade autoencoder for video anomaly detection in crowded scenes. IEEE Trans. Multimed. 2020;23:203–215. doi: 10.1109/TMM.2020.2984093. [DOI] [Google Scholar]
- 26.Li J, et al. Spatio-temporal attention networks for action recognition and detection. IEEE Trans. Multimed. 2020;22:2990–3001. doi: 10.1109/TMM.2020.2965434. [DOI] [Google Scholar]
- 27.Hur, J. & Roth, S. Self-supervised multi-frame monocular scene flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2684–2694 (2021).
- 28.Patil V, Van Gansbeke W, Dai D, Van Gool L. Don’t forget the past: Recurrent depth estimation from monocular video. IEEE Robot. Autom. Lett. 2020;5:6813–6820. doi: 10.1109/LRA.2020.3017478. [DOI] [Google Scholar]
- 29.Zamir, A. R. et al. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3712–3722 (2018).
- 30.Jia, D. & Dong, W. A large-scale hierarchical image database. In CVPR (Imagenet, 2009).
- 31.Hendrycks, D., Lee, K. & Mazeika, M. Using pre-training can improve model robustness and uncertainty. In International Conference on Machine Learning, 2712–2721 (PMLR, 2019).
- 32.Wang, K., Chen, Y., Guo, H., Wen, L. & Shen, S. Geometric pretraining for monocular depth estimation. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 4782–4788 (IEEE, 2020).
- 33.Sildur. Sildurs shaders. https://sildurs-shaders.github.io/ (2019).
- 34.Wang X, Li W, Yang M, Cheng P, Liang B. Unsupervised monocular training method for depth estimation using statistical masks. IEEE Access. 2020;8:191530–191541. doi: 10.1109/ACCESS.2020.3032582. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All MineNavi code for generated dataset used in this analysis can be found at: https://github.com/xdr940/MineNavi. The datasets generated and analysed during the current study are available in the https://www.kaggle.com/datasets/xdr940/minenavi.












