Skip to main content
Sensors (Basel, Switzerland) logoLink to Sensors (Basel, Switzerland)
. 2025 Jun 17;25(12):3790. doi: 10.3390/s25123790

Depth from 2D Images: Development and Metrological Evaluation of System Uncertainty Applied to Agricultural Scenarios

Bernardo Lanza 1, Cristina Nuzzi 1,*, Simone Pasinetti 1
Editor: Ben Hamza1
PMCID: PMC12197065  PMID: 40573674

Abstract

This article describes the development, experimental validation, and uncertainty analysis of a simple-to-use model for monocular depth estimation based on optical flow. The idea is deeply rooted in the agricultural scenario, for which vehicles that move around the field are equipped with low-cost cameras. In the experiment, the camera was mounted on a robot moving linearly at five different constant speeds looking at the target measurands (ArUco markers) positioned at different depths. The acquired data was processed and filtered with a moving average window-based filter to reduce noise in the estimated apparent depths of the ArUco markers and in the estimated optical flow image speeds. Two methods are proposed for model validation: a generalized approach and a complete approach that separates the input data according to their image speed to account for the exponential nature of the proposed model. The practical result obtained by the two analyses is that, to reduce the impact of uncertainty on depth estimates, it is best to have image speeds higher than 500–800 px/s. This is obtained by either moving the camera faster or by increasing the camera’s frame rate. The best-case scenario is achieved when the camera moves at 0.50–0.75 m/s and the frame rate is set to 60 fps (effectively reduced to 20 fps after filtering). As a further contribution, two practical examples are provided to offer guidance for untrained personnel in selecting the camera’s speed and camera characteristics. The developed code is made publicly available on GitHub.

Keywords: optical flow, uncertainty evaluation, measurement science, precision agriculture, computer vision

1. Introduction

Outdoor depth estimation can be achieved in several ways, depending on the hardware and software adopted. For applications requiring accuracy and dense depth reconstruction, LiDAR devices are the best choice [1]. However, the price of the hardware is high according to the depth accuracy provided, and oftentimes, such devices require the users to move slower than desired to ensure the correct acquisition of the point clouds (e.g., to deal with light interference in the best way possible). In addition, LiDAR systems require users to adopt dedicated hardware to keep up with the acquisition and storage needs of the device. Luckily, not every outdoor application requires the data density provided by LiDAR systems. In these cases, a viable and low-cost choice is the adoption of depth cameras leveraging stereo vision. However, the quality of the output (both the color and depth images) is often insufficient, especially when employed in applications where the camera moves fast, introducing defects due to the shutter and acquisition speed. In the case of high-speed movements, a good color camera paired with high-end optics is necessary to cope with the speed of the moving scenario to be acquired. Starting from good-quality images, the depth estimation problem could be addressed by computer vision (CV) techniques. One of the most popular methods is optical flow (OF), first developed in 1981 [2,3]. OF is defined as the apparent motion of objects in a sequence of images caused by the relative motion between the scene captured in them and the camera. Therefore, the problem encompasses several variables such as ambient illumination, object texture, and difficult geometrical shapes for which occlusions may happen, producing incorrect OF estimates [4,5,6]. In general, OF is obtained as a map of vectors indicating, for each pixel in the image, the corresponding apparent motion and its intensity. OF is used to estimate the speed and depth of both slow and fast phenomena and is applied in several fields, such as healthcare [7], robotics and moving vehicles in general [8], industry [9], and agriculture [10,11]. Modern approaches for solving OF issues exploit Deep Learning (DL) to improve the estimates, especially in the case of depth estimation [12,13]. Other recent advancements in the field of monocular depth estimation are presented in [14,15,16,17,18,19,20,21,22,23,24]. These works extensively adopt DL models to estimate depth from monocular images without employing OF and are considered state-of-the-art by the CV community. However, their depth output is not stable and it is not based on a measurement, not to mention the computational requirements needed to run those models on mobile and embedded devices. Some of those models do not produce outputs in metric coordinates either, making it difficult for untrained personnel to use those models in real in-field applications for which near real-time computation is required. Focusing on the agricultural sector, however, the general problem of OF estimation can be simplified since the measurement environment is more constrained. Farming vehicles, such as tractors and fruit-picking robots, move linearly along the rows of the field at a constant speed, which is not higher than 5 km/h in the case of large and heavy tractors and close to 2 m/s for agricultural robots. In addition, modern tractors are being designed to be autonomously driven, requiring accurate tractor–row distance estimation to adapt to the specific field they work on [25].

This article describes a simplified version of OF inspired by the Structure from Motion (SfM) CV problem, which allows the reconstruction of complex 3D structures from 2D multi-view images of the scene collected by a moving camera [26]. Traditional SfM techniques are computationally intensive and hardware-demanding, typically adopted for applications such as land reconstruction, architecture, and construction. Therefore, in this paper, a custom model that is easier to understand for end-users was designed. The developed model was validated by means of a practical laboratory experiment mimicking the agricultural scenario, thus showing how the model performs at different camera speeds and relative camera–object depths. In particular, the focus of this article stands on the models’ validation and uncertainty analysis, which ultimately provides ready-to-use information for the end-user (e.g., the farmer) to both choose the right camera for the target application and design the measurement setup and constraints. In fact, it is generally challenging to couple depth estimates with measurement uncertainty, typically treated as a confidence measure [27] that does not adhere to the “Guide to the Expression of Uncertainty” manual [28,29]. Moreover, monocular depth estimation research is mostly rooted in the CV community, which focuses more on the software and mathematical aspects of the problem and less on practical needs for the untrained end-user to obtain a reliable depth measure, such as which hardware to choose according to specific characteristics and what are the limitations of the measurement setup to keep in mind. The findings of this work could be beneficial for several research areas, including yield estimation [30,31] and the wood–fruits–leaf canopy volume estimation from 2D images [32]. Finally, the code used for this work is made publicly available on GitHub at [33].

2. Materials and Methods

2.1. Equipment and Experimental Setup

The experimental setup was designed considering the agricultural scenario, in which tractors and other vehicles move linearly at a constant speed. This is a constrained scenario that allows for some simplifications. The idea is that, by mounting a camera on the moving vehicle, it is possible to acquire a sequence of images for which the temporal and spatial relationships are known. This is possible because (i) the vehicle’s speed is known thanks to its encoder, and (ii) the acquired frames are coupled with acquisition timestamps. In addition, by mounting the camera orthogonal to the vehicle movement direction so that it looks at the canopy, its reference frame primarily moves along its Y axis. By focusing solely on the Y–Z plane with the motion vector confined within it, these assumptions allow for a simplified model implementation.

A scheme of the acquisition setup used for this work is shown in Figure 1, depicting a robot, a color camera, and a custom-made target. The moving vehicle is simulated by a robot (Universal Robots UR10e) mounted upside-down in the workspace described in [34]. On the robot’s end-effector a color camera (GoPro Hero 11 Black) was mounted, tasked with the acquisition of color images during movement. The robot was programmed to move horizontally at 5 constant speeds (S1=0.25 m/s, S2=0.50 m/s, S3=0.75 m/s, S4=0.94 m/s, and S5=0.97 m/s) with a horizontal travel distance of 1.45 m. The speeds tested were selected according to the common operating speed of farming vehicles. The camera has a 1/1.9” CMOS sensor with a resolution of 5599×4927 px and F2.5 lens aperture. To increase the camera frame rate, during acquisition the camera was set to a resolution of 2704×1520 px with an aspect ratio of 4:3, allowing it to acquire frames at 60 frames per second (fps). The lens distortion coefficients and the internal camera matrix (containing the position of the image center CX and CY [px] as well as the focal length dimension in both directions fX and fY [px]) were computed by performing a calibration procedure using a chessboard pattern [35]. It is worth noting that the pixels of the chosen camera have a square shape; hence, fX=fY=f.

Figure 1.

Figure 1

Graphical scheme of the setup adopted in this work, highlighting relevant parameters and sizes of the involved equipment.

The custom-made target measurand is made of an aluminum bar (metallic support) placed at a height of H=1.3 m from the ground, on which 5 smaller bars of different sizes (B1=10 cm, B2=20 cm, B3=40 cm, B4=55 cm, and B5=80 cm) were fixed orthogonally. The target was crafted to simulate agricultural rows where plants grow at different depths than the guiding canopy. The metallic support was positioned at 4 different locations from the robot reference frame (RF), D1=115 cm, D2=135 cm, D3=155 cm, and D4=165 cm. These distances were chosen to simulate different working conditions (e.g., vehicles moving at different distances from the plants growing in the row, and different row distances).

On top of each bar, an ArUco marker [36,37,38] of size 45×45 cm was positioned to be orthogonal to the camera during acquisition (M1, M2, M3, M4, and M5). ArUco markers are square matrices of black and white cells that easily represent a location and an orientation at the same time according to the deformation of the matrix pattern. Each ArUco marker is unique and procedurally generated by the related library, which includes algorithms and functions to retrieve the information of a specific marker from an image. In this work, the ArUco markers serve as the depth ground truth. The ArUco OpenCV library was employed to extract the markers’ depth [36,37,38]. Images taken from the acquisition setup (with the used ArUco marker shown) can be found in Figure 2.

Figure 2.

Figure 2

Example of the setup and the equipment adopted in this work. ArUco markers are highlighted in yellow.

2.2. Data Acquisition

A total of 20 tests were conducted (4 target positions D ×5 camera velocity S). Each camera acquisition is composed of a positive and a negative camera movement along the Y axis. Instead of saving each acquired frame independently, a more computationally efficient solution was defined. A single video in MP4 format was recorded, with each video including frames related to a specific combination of D and S, for both movements along the +Y axis and Y axis. The customized algorithm detailed in Section 2.4 extracts individual frames from the video, splits movements along +Y axis and Y axis, and computes the distance from the camera (depth Z m) of each ArUco marker. Subsequently, the displacement along the Y axis (ΔY px) of the marker’s center is calculated between two consecutive frames to apply the OF algorithm and estimate the depth.

The number of frames in a single trial varies depending on the robot’s (and thus the camera’s) velocity, as the robot follows a fixed path, with an average of 300 frames. Overall, a total of 60,000 data points were collected (5 robot speeds × 2 robot movement directions × 4 relative distances of the metallic support from the camera × 5 ArUco markers at different depths × 300 frames).

2.3. Model Definition

The model definition is based on the scheme of Figure 3 where the projection of a moving point of interest onto the camera plane and the object plane is represented.

Figure 3.

Figure 3

Scheme depicting the projection of a point from the real world to the image plane assuming a pinhole camera model.

Imagine a point P (with coordinates relative to the world’s reference frame, WRF) that moves in the object plane along the Y axis from position P0 to position P1 (displacement equal to ΔY). In the camera’s reference frame (CRF), the projection p of the point P on the camera plane moves from position p0 to position p1 (displacement equal to δY). This displacement is corrected for lens distortion through a camera calibration process (such as [35] or similar approaches). Considering the object plane and the image plane placed at a distance from WRF equal to z and f, respectively, there are two similar triangles (P0P1WRF) and (p0p1WRF). Based on their similarity properties, the equation is obtained as follows:

ΔYδy=Zf, (1)

where ΔY is the displacement of the point in the WRF [m], δy is the displacement of the point in the image plane [px], Z is the depth corresponding to the point [m], and f is the focal length of the camera [px].

By applying Equation (1), knowing the camera focal length f, it is possible to estimate the point’s P depth (Z) based on p pixel displacement in the image plane δy, given its relative displacement with respect to the WRF ΔY.

To accurately calculate δy with respect to a specific point or object within an image, several challenges arise in detection and tracking. The detection challenges include reliably identifying specific points or objects despite varying lighting conditions and similar-looking objects. Ensuring consistency in detecting the same points across multiple frames is crucial for accurate displacement calculation.

In tracking, the main issue is maintaining the continuity of the detected object across successive frames, which requires robust algorithms capable of handling rapid movements, changes in scale, and rotation. Additionally, tracking algorithms may experience drift over time, leading to deviations from the actual position due to cumulative errors.

Advanced techniques are typically employed to address these challenges. Feature extraction methods provide robust features invariant to changes in scale, rotation, and illumination, aiding in detection and tracking. Optical flow (OF) algorithms calculate the apparent motion of objects between consecutive frames, directly providing δy by analyzing pixel displacement. Marker-based tracking, using fiducial markers, offers a reliable method as these markers are easily detectable and their positions can be accurately determined.

Integrating these methods improves the accuracy and reliability of δy calculations. This displacement can be related to the original speed of the point P tracked according to the formula S=V·T, where S represents the space traversed by the tracked point, V refers to the movement velocity, and T=t2t1 refers to the time between two consecutive frames. Two versions of this relationship can be defined accordingly:

ΔY=Vcamera[m/s]·T, (2)
δY=Vimage[px/s]·T, (3)

where Vcamera is the point P velocity in the object plane, Vimage is the point P velocity in the image frame, and T is the time between two consecutive frames. Therefore, the speed of the object in the WRF (Vcamera) is related to the apparent speed of the object in the CRF (Vimage).

By substituting these relations into Equation (1) and simplifying for T (purposely considered equal for the two terms), Z can be obtained directly from point P velocities (in the object plane and in the image plane):

Z=Vcamera·fYVimage. (4)

This equation will be referred to as the “analytical model” definition.

In this work, the aim is to simplify the relationship for Z estimation using only the OF output, treating Vcamera as a constant rather than an unknown variable. By assuming a constant Vcamera during movement, the idea is to obtain a general parameter K that allows end users to estimate Z with sufficient confidence for the target application using the data output of OF to obtain Vimage. Following this, the “experimental model” definition is as follows:

Z=KVimage, (5)

where Vimage [px/s] is obtained through the OF algorithm and parameter K [m·px/s] is unknown.

2.4. Depth Computation

The v=120 videos of the tests (see Section 2.2) were processed leveraging a variety of open-source functions available on the OpenCV library. The processing code was developed in Python and is publicly available at the link provided in reference [33]. Each video was analyzed independently, and the frames contained in it were not saved on disk; instead, to save space, a data extraction procedure was applied to analyze the contents of the frames and save the outputs in separate files in CSV format.

Each video sequence v contains a set of frames i acquired at time ti. Each frame was processed individually following a two-block operational procedure illustrated in Figure 4 (first block in blue, and second block in pink). The first block involves calculating the pose of the markers thanks to a set of image processing analyses (yellow block inside the blue block of Figure 4), followed by the second block, which focuses on computing the optical flow for each marker. Examples of the frames registered by the camera during the experiments are shown in Figure 5.

Figure 4.

Figure 4

Graphical scheme of the processing procedure divided into two operational blocks (blue and pink). The image processing step (yellow) is repeated for each image i=1N in the video sequence v, producing an intermediate output table. Then, the OF algorithm is applied to image couples, starting from image i=2 and tracking changes in the markers’ centers (obtained from the first operational block in blue).

Figure 5.

Figure 5

Examples taken from the video sequences v at different speeds. Please note that the closer marker (M5) moves outside the camera’s field of view when the robot is in the leftmost end position, due to the setup’s positioning in the laboratory with respect to the available space. (a) S1=0.25 m/s, (b) S2=0.50 m/s, (c) S3=0.75 m/s, (d) S5=0.95 m/s.

The image processing block (yellow) structure is the following:

  1. Conversion to grayscale. The image of the current frame Ii is converted from color to grayscale to facilitate the subsequent operations.

  2. Brightness and contrast correction. The image’s contrast α and brightness β are optimized to improve the image’s quality. These parameters are defined to ensure that the markers can be detected with sufficient accuracy by the ArUco library functions. The best values were experimentally found and are equal to α=2 and β=5 according to the general illumination of the working conditions of the experimental setup. To read further about how these two parameters are defined and used in OpenCV, please refer to the official documentation in [39].

  3. ArUco markers finding. The five markers m in the image Ii are identified using a specific set of functions in the OpenCV ArUco library [36,37,38].

  4. Computation of apparent depth. Using the ArUco library, the depth Zi,m of each marker is calculated, providing an estimate of the marker’s Z coordinate relative to the camera’s reference system.

  5. Computation of markers’ centers. For each detected marker m, its center coordinates on the image plane, Ci,m=(xi,m,yi,m), are calculated. Coordinates x and y are different from the 3D coordinates (X,Y,Z), which involve marker and camera calibration since they are solely related to the marker’s detection on the image plane.

Outputs of the first operational block are the ground truth coordinates of the ArUco markers acquired, (xi,m,yi,m,Zi,m) for each acquired image Ii.

The second operational block (pink) was defined to compute pixel displacements between consecutive frames through OF. To do so, starting from the intermediate output table, the OF algorithm was applied on pairs of images Ii1 and Ii (thus beginning the counter from image i=2), using as tracked markers the centers of the markers Ci,m detected before. This produces another value called OFi,m representing the pixel displacement δY between consecutive images in px/mm.

For each video v=120 the overall procedure produces a table Tabv of N1 rows × 11 columns (time instant ti plus the 5 Z-coordinates of the markers’ centers Zi,m and the 5 OFs of each marker’s center OFi,m). Each Tabv, containing the OF data stream over time, is saved in a CSV file.

2.5. Signal Synchronization and Filtering

Considering the analytical model definition described by Equation (4), in the experiments Vcamera was equal to the robot’s speed. However, even if the robot is theoretically actuated at a constant speed, some portions of its movement include acceleration and deceleration (at the start, during changing direction, and at the end of the movement). Ideally, the robot’s movement during the experiment and the camera acquisition should be synchronized. However, the robot signal was obtained from its encoders, which produce denser data compared to the data obtained from the image-based analysis of Section 2.4. Thus, the two data streams are not synchronized and contain a different number of points. To address this issue and obtain synchronized data streams, the following procedure was applied. Please note that in the following notation, we will refer to the data stream containing the robot speed over time using Rv and to the image speed over time (computed as the OF of the 5 ArUco markers’ centers’ positions in the image plane) using Tabv, for a specific acquisition video v=120:

  1. The robot’s signal obtained from the encoders represents its position in its coordinate reference system. The robot’s speed over time Rv is obtained by computing the first derivative of the signal.

  2. Rv and Tabv are filtered to remove portions where the robot was not moving and, consequently, to separate between its positive (+Y) and negative (Y) movements. To do so, the algorithm searched for OF and speed absolute values in Tabv and Rv, respectively, below the 5th percentile of the overall values. This threshold, determined iteratively to minimize data loss while eliminating noise during stationary phases, ensures that even minimal detected movements are treated as stationary. These rows correspond to the initial and final moments of acquisition when the robot was not moving; hence, they are removed from both Tabv and Rv (red portions in the top image of Figure 6).

  3. To find the instant when the robot changes its moving direction (from +Y to Y), the software scans Tabv and Rv to select the rows where OF and robot speed values change from positive to negative. The turning point (yellow portions in the top image of Figure 6) is identified as the point where the corresponding signal goes under the 5th percentile of the values in Tabv and Rv, respectively (without the elements already filtered out from the previous step). Then, the negative stream is rectified, obtaining a signal always in the positive quadrant for both OF and robot speed values.

  4. The robot’s signal is sub-sampled to the same number of points as the corresponding OF signal of the experiment. Then, for each data point of the OF data stream Tabv, the algorithm searches the temporally nearest neighbor of the corresponding robot’s data stream Rv. The iterative procedure outputs several values as many points in the data streams and then computes their average Tshift, representing the temporal shift between the two signals. The temporal correction Tshift is then applied to Rv.

  5. The two data streams include acceleration and deceleration components (i.e., not a constant speed) that must be removed to obtain only data corresponding to movements at a constant speed. This step is applied on both Tabv and Rv at the same time. The points to be removed are selected iteratively by removing parts at the beginning and at the end of the original curve and computing the linear regression with the obtained curve. The procedure removes the initial and final 1% of the whole curve first and iteratively increases the removal percentage up to 20% (with steps of 1%). For each curve obtained, a linear fit is computed coupled with the corresponding R2. The result is a function of the distribution of R2 values with respect to the percentage of removed data. The ideal constant velocity segment corresponds to the portion of the whole curve with the greatest R2. It was experimentally found that the ideal value for sub-sampling is 16% for all acquisitions since R2 does not change notably afterward. An example of this procedure is graphically shown in the bottom image of Figure 6, where the program iteratively selects portions of the data (depicted with colored bars to highlight the portion of data considered, in pairs) until the optimal portion is selected (in the figure, this is the black one).

Figure 6.

Figure 6

Graphical example of the filtering procedure applied on the data of the ArUco marker M3 at speed S1=0.25 m/s. (top) Removal of the portions with no robot movement (red) at the initial and final stages of the movement and corresponding to the change in direction (yellow). (bottom) Example depicting the selection process of the ideal portion of data corresponding to a movement at constant velocity.

At the end of this procedure, the data in each Tabv is merged according to the robot’s speed. The result is a total of 5 tables called Datas, comprehending, for each ArUco marker, all the Ps captured points obtained during the experiment for that specific robot’s speed (subscript s refers to the five robot speeds, ranging from 1 to 5 and equal to S1=0.25 m/s, S2=0.50 m/s, S3=0.75 m/s, S4=0.95 m/s, and S5=0.97 m/s). Please note that the number of Ps depends on the robot’s speed because the quantity of captured points varies according to it (more points for slower speeds).

2.6. Window-Based Filtering

The data in Datas resulting from the processing described in Section 2.5 is significantly noisy, especially at higher speeds. This is primarily due to the vibrations of the camera and sudden environmental disturbances (e.g., light changes, electrical noise, and optical aberrations) that are inherent in real-world conditions, especially in agricultural scenarios where light scattering effects appear on the plants’ canopy. These factors introduce noise and uncertainty into the acquired images, which in turn affect the OF output, a well-known issue in the literature [4,5]. Moreover, in those scenarios, it is quite common to track specific objects during acquisition, for example, fruits [40,41], by using DL models such as object detectors, including in the uncertainty also the contribution of incorrect prediction or bounding box positioning. To mitigate this issue, the data in Datas are filtered again using a window-based moving average technique. In addition, this approach effectively reduces the impact of incorrect DL predictions in the case of fruit tracking.

The filter was applied on each Datas, and considering the data of each ArUco marker individually, the best results were obtained by setting the window size equal to 3, resulting in a sub-sampling of the data corresponding to an “effective” camera acquisition speed of 20 fps (in contrast with the original 60 fps).

In the subsequent analysis, results will be shown for data with and without the application of the window-based moving average filter, for a total of k=110 (5 robot speeds S1...S5, with and without filtering) experimental models represented by tables Datak. Accordingly, the quantity of captured points contained in each Datak will now be called Pk since their number is further reduced after filtering. An example of the windowing effect can be seen in Figure 7.

Figure 7.

Figure 7

Comparison between the analytical model (black line) and the experimental model (red line) for robot speed S3=0.75 m/s. The experimental model is fitted to the points to obtain parameter K. The green triangles correspond to (a) the acquired marker points, and (b) the filtered points.

2.7. Model Validation and Uncertainty Estimation

Each data set Datak contains a set of p=1Pk points represented by a certain Vimage obtained as the OF value of the ArUco marker m (m=15), OFi,m, and a certain depth value Zi,m, computed for each acquired image i in the data set Datak as described in Section 2.4. For simplicity, the points of a data set are generally represented by the pair (Vp,k, Zp,k), a notation that takes into account all m marker points contained in a specific Datak. Therefore, after conducting the pre-processing and filtering steps described in Section 2, the resulting k=110 data sets Datak are used to verify if the experimental model of Equation (5) is in agreement with the analytical one in Equation (4). To do so, it is first necessary to obtain an estimation of the parameter K for each data set, namely Kk. For this purpose, the data set points are simply used as the input for a curve-fitting method implemented in Python 3.10 by the function “curve_fit” of the SciPy package [42], producing a Kk for each data set. Using the known information about the robot speed (one of the 5 tested speeds in our experiment S1S5, which can be considered equal to Vcamera since the camera is rigidly mounted on the robot’s end-effector), the focal length of the camera (fY), and the acquired Vimage of each point (Vp,k), it is straightforward to also apply the analytical model in Equation (4) and compare the two. Figure 7 shows the resulting comparison of the two models for the original data and for the filtered data corresponding to the robot speed S3=0.75 m/s (the results for the other robot speeds are similar and are omitted for brevity). In both cases, the experimental and analytical models are overlapped (red dashed line versus black solid line), meaning that the experimental model was correctly defined and that the experimental procedure was properly conducted. However, by observing the curves, a major issue arises due to the exponential nature of the models. In fact, data points corresponding to Vimage close to 0 px/s are distributed over a wider range of possible values, resulting in high variability. This effect is strongly reduced when Vimage is higher than 500–800 px/s according to the camera’s speed, for which the model shows acceptable variability for the target application. Due to this issue, the problem of uncertainty estimation of the experimental model is challenging to tackle. In the following analysis, two approaches are proposed to obtain the model’s uncertainty: a “generalized approach” and a “complete approach”. A scheme of the analysis conducted to estimate the model’s uncertainty is shown in Figure 8.

Figure 8.

Figure 8

Graphical scheme of the uncertainty estimation procedure, divided into three blocks: (1) Monte Carlo generation, which is the starting point for both consequent approaches, (2) the generalized approach, giving a general uncertainty value for each tested model k, and (3) the complete approach, which computes uncertainty values for each tested model k by also dividing data points by velocity group, thus accounting for their different characteristics.

Both approaches are based on a common starting point based on a Monte Carlo generation of simulated data points. The procedure, called “Monte Carlo generation” in Figure 8, is as follows:

  1. Original data distributions definition. In real-world applications, the values of each point’s speed and depth are affected by uncertainty. To validate the proposed experimental model, their variability was empirically set to σV=1 px/s and σZ=0.005 m, respectively. These two values were estimated considering the overall data acquired and the ArUco markers documentation [36,37,38]. Now, considering a certain set of points Datak, for each point (Vp,k, Zp,k) contained in it, two Gaussian distributions were built using the data point’s actual values as the mean, μV and μZ, and the variabilities defined before, σV and σZ, as the distribution’s spread.

  2. Synthetic data generation. From the distributions of V and Z, a total of 10,000 simulated data points (V^p,k, Z^p,k) were generated for each original point. This process produces a table Synthk composed of 10,000 rows and Pk columns. The synthetic data generation procedure was repeated for each tested robot’s velocity S1S5, for both original and filtered data, obtaining a total of k=110 tables Synthk.

  3. Estimation of parameter K. Each row r=110,000 of Synthk now contains Pk synthetic points. Therefore, it is possible to estimate parameter K for that specific row, Kr,k, by fitting the experimental model to the data in the row. Repeating this process for all rows produces 10,000 estimated values of Kr,k.

2.8. Uncertainty Estimation Using the Generalized Approach

The following steps are conducted after the “Monte Carlo generation” procedure described in Section 2.7 and are graphically shown in Figure 8 in the block called “Generalized approach”.

Given the 10 tables Synthk and the corresponding Kr,k for each row r=110,000, for each point p=1Pk in the row, it is possible to compute Z˜r,p,k, which is the depth value output produced by applying the experimental model in Equation (5). Then, residuals εr,p,k of each point are calculated as the absolute difference between the synthetic depth Z^r,p,k and the depth obtained from the experimental model, Z˜r,p,k:

Z˜r,p,k=Kr,kV^r,p,k, (6)
εr,p,k=Z^r,p,kZ˜r,p,k, (7)

where Z^r,p,k and V^r,p,k are the depth and the image speed of a synthetic data point p in row r, respectively, and Kr,k is the experimental model parameter K estimated for each row r=110,000 of the data set Datak at the end of the “Monte Carlo generation” procedure.

Then, for each row r, it is possible to compute the sample standard deviation σr,k as

σr,k=pPεr,p,k2P1, (8)

where P corresponds to the original number of data points Pk in the columns of table Synthk.

Finally, we compute the root mean square (RMS) of all the σr,k values, which is the final uncertainty for each model, uk=RMSk:

RMSk=mean(σr,k2). (9)

2.9. Uncertainty Estimation Using the Complete Approach

The following steps are conducted after the “Monte Carlo generation” procedure described at the start of Section 2.7 and are graphically shown in Figure 8 in the block called “Complete approach”.

After generating the 10 tables Synthk and obtaining the r=110,000 model parameters Kr,k, the experimental model in Equation (5) is now estimated for each robot speed, for a total of k=110 models, Modelk. Using as the input the estimated Kr,k and the original image velocity data of each point Vp,k (given from the OF), the model outputs a depth value Z˜p,k. Repeating this step for all the points in all Datak tables produces new data tables Modelk, in which each point is described as the pair (Vp,k, Z˜p,k).

As already discussed at the start of Section 2.7, given the exponential nature of the model, the data points for which V was close to 0 px/s demonstrated a behavior very different from those belonging to the latter part of the graph where V was close to 3000 px/s. To analyze this issue, the data points have been split into groups according to their Vp,k value. For all Modelk tables, the data points were divided into groups according to the value of Vp,k. By considering the range of possible values of Vp,k for that specific Modelk, groups were created with a step of 30 px/s (e.g., group 1 contained points with Vp,k in the range [10, 40) px/s, group 2 with Vp,k in the range [40, 70), etc.). This resulted in a certain number of groups g=1G according to the specific table Modelk considered. The RMSk values described by Equation (9) are now calculated on the data belonging to each group g for each model k; namely, σg,r,k is obtained by using Equation (8) and εg,r,p,k, as in Equation (6). This produces group-separated RMS values for the specific experimental model k considered, RMSg,k. The final uncertainty in this approach is composed of two contributions for each model k and it is separated by velocity group g: one is given by the RMSg,k, and the other is given by the standard deviation σZ˜g,k of the depth values obtained by applying the experimental model and is divided by group, namely Z˜g,p,k. The computation of the group-separated uncertainty for each model k is given by the following:

ug,k=RMSg,k2+σZ˜g,k2. (10)

3. Results and Discussion

The resulting uncertainties for the generalized approach are shown in Table 1 for both original and filtered data. The effect of the window-based filter is evident for S1 and S2, for which the overall uncertainty is reduced by 50%, while for S3 the effective uncertainty reduction is only 20%. In the case of S4 and S5, the effect is even less evident, reducing the uncertainty of just 0.01 m in both cases (5%). This effect is related to the number of frames acquired according to the robot’s speed; in fact, since the camera’s acquisition rate is the same in all tests (60 fps), at lower speeds, more images are acquired representing the same scene; thus, there is not sufficient displacement in between two consecutive pictures for OF to work well. By sub-sampling the data, the effect is to virtually reduce the camera’s acquisition rate to 20 fps, incrementing the spatial difference between two consecutive frames and thus improving OF estimation. In addition, after applying the filtering, the overall uncertainty for S1=0.25 m/s is the same as the one obtained for S3=0.75 m/s. Generally, best-case scenarios are obtained for S2=0.50 m/s and for S3=0.75 m/s. These speeds are acceptable for the agricultural scenario considered. Figure 9 shows the histograms of the distribution of values σr,k from which the final uncertainty RMSk=uk is computed, for the best-case scenarios (models with S2 and S3), both with no filter and with a filter applied. All the other models were omitted for brevity since the resulting histograms are the same.

Table 1.

Summary of resulting generalized uncertainty for all tested robot speeds S1S5. Data is shown for both the original data (u15) and for the data resulting from the window-based filtering procedure (u610).

Robot Speed [m/s] u15 [m] u610 [m]
0.25 0.15 0.07
0.50 0.08 0.04
0.75 0.09 0.07
0.95 0.20 0.19
0.97 0.22 0.21

Figure 9.

Figure 9

Histograms showing the distribution of values σr,k (blue bars) and the resulting RMSk (red dashed line). (a,b) Histograms of models with k=2 and k=7 (S2=0.50 m/s) with no filtering and filtering applied, respectively. (c,d) Histograms of models with k=3 and k=8 (S3=0.75 m/s) with no filtering and filtering applied, respectively.

As for the complete approach, the results of RMSg,k and σZ˜g,k are shown in Figure 10a and Figure 10c, respectively, for models with k=15 (no filtering), and in Figure 10b and Figure 10d, respectively, for models with k=610 (filtering applied). Obviously, the groups of each model are not comparable since they depend on the numerosity of points belonging to the group, which in turn depends on the total number of points Pk in the original data set Datak. In Figure 10e,f the contribution of RMSg,3 and σZ˜g,3 towards the computation of the total uncertainty ug,3, for S3=0.75 m/s in both the unfiltered and filtered cases (models with k=3 and k=8, respectively), is displayed. The results for the other robot speeds S1, S2, S4, and S5 are similar, so they were omitted for brevity. It is evident that almost all the contribution is due to RMSg,k, and its trend indicates that estimating the depth of a given point from OF is not robust in the first area of the graph where the speed of the point (OF) is lower than 500–800 px/s according to the camera’s speed. However, for point speed higher than this value, the overall uncertainty is reduced to less than 20 cm.

Figure 10.

Figure 10

(a,b) Graphs showing the RMSg,k values for models with k=15 (no filtering) and with k=610 (filtering applied), respectively. (c,d) Graphs showing the σZ˜g,k values for models with k=15 (no filtering) and with k=610 (filtering applied), respectively. (e,f) Graphs showing, for S3=0.75 m/s, the contribution of RMSg,3 (blue dashed line with “X” markers) and σZ˜g,3 (red dotted line with “squared” markers) towards the computation of the total uncertainty ug,3 (green solid line with “circular” markers), for k=3 (no filtering) and k=8 (filtering applied), respectively.

Measurement uncertainty on the computation of depth is even less than 100 mm for robot speeds equal to S2=0.50 m/s and S3=0.75 m/s, while it increases for the other three (see Figure 10a,c). This is interesting because it highlights that moving at a very low speed (S1=0.25 m/s =0.9 km/h) gives similar uncertainty values to those obtained when moving at higher speeds (S4=0.95 m/s =3.4 km/h and S5=0.97 m/s =3.5 km/h). This effect can be explained by how depth is estimated from OF. When the robot moves too slowly (Vcamera is too low), there is not sufficient pixel difference between consecutive pairs of images for OF to produce an accurate estimation of Vimage=OFi,m, while the accuracy of the ArUco markers’ apparent depth Zi,m is higher with lower uncertainty due to a more stable image frame. On the other hand, when the robot moves faster, the estimation of Vimage improves until the amount of pixel difference between consecutive images is too high for OF to produce a valid estimation since the two images could be so different from each other that the point matching fails. Consequently, the accuracy in the estimation of apparent depth becomes less accurate when Vcamera increases. Evidently, by applying the window-based filtering, the variability of OF estimation is reduced, and the overall depth estimation is improved despite losing data points (see Figure 10b,d). This drawback is acceptable for the target application (agriculture) and other applications for which sampling one data point in every three is not an issue.

To conclude, the sub-sampling of data points is closely related to the camera acquisition speed (fps): given the issue of OF not being able to produce reliable outputs when the image pairs’ difference is too low, adopting cameras with high fps is not the best choice. On the other hand, considering the possibility of sub-sampling the acquired frames, choosing a camera with less than 30 fps could lead to similar issues. In both cases, the outcome may be similar to the model produced for S1. It is worth noting that a few points are missing in the region of 2000–2300 px/s in Figure 10e,f. This is due to poor detection of some markers happening for V3 and the lack of detected points for that Vimage speed range in the case of V1 and V2, as can be noted in Figure 10a,b. Poor detections may occur if the acquired image is too blurred (as is the case for V3) or if illumination conditions create aberrations on the ArUco markers, resulting in incorrect or missing detections. However, considering the points present for the speed range higher than 2300 px/s, it is safe to say that the overall trend of the two curves in Figure 10e,f is unaltered.

3.1. Limitations of the Method

It is worth stressing that the present article describes the theory of the measurement principle validated through a laboratory experiment. Therefore, the choices made to design the setup and its constraints (Section 2.1) refer to agricultural scenarios in general but can be easily generalized to other environments as well (e.g., industry and autonomous driving). An important limitation of this approach is that the validation was only conducted indoors; hence, other impacting parameters typical of the outdoor environment were not studied in detail. These are as follows: (1) The illumination of the scene, which must be sufficient to see the points of interest in the image but, at the same time, not too bright to avoid aberrations or light spots that may interfere with the recognition of the targets. This is especially important since bright spots and blurred areas of the targets produce incorrect or missing detections. To this aim, an interesting design choice is the addition of an illumination system, for example, in the case of industrial or dark scenarios, to improve the clarity and contrast of image details. (2) The eventual vibrations of the moving vehicle, which may produce incorrect estimations of the OF values and should therefore be compensated either during acquisition or afterwards. (3) Adverse meteorological conditions that may interfere with data acquisition, including dirt. However, assessing the impact of these factors on the method and eventually proposing algorithmic solutions to overcome them requires a more in-depth outdoor study, which will be conducted in the future as a further advancement of the proposed method.

Concerning the inference time, e.g., the time required for a new pair of consecutive frames to be processed by the proposed model and pipeline to produce a final depth estimation, some considerations should be made. (1) The inference can be applied only to pairs of frames because OF is calculated on their differences; thus, we should consider the time required by the camera to acquire and save two frames (either on disk or in volatile memory). This may be a crucial bottleneck, especially if the acquisition hardware is an embedded platform, slowing down acquisition to even 10 fps instead of the typical 30 fps of most common devices. (2) At least one point of interest should be detected in both images of the pair. In the case of our experiment, the points of interest were the ArUco markers, but for other applications, it can be something else. This detection requires dedicated software, which may be AI-based instead of classical computer vision techniques (as is the case for the ArUco markers library adopted in this work). The ArUco markers were detected at a speed of approximately 100 ms per image. Keep in mind that AI-based software is often slower, requiring even 500 ms per image. (3) OF should be computed on the points of interest detected in the pair of images, which is a fast and well-known algorithm that benefits from GPU acceleration if available, requiring only a few ms per pair of images. (4) The final step is the application of the proposed model to obtain the depth estimation in meters according to the estimated Vimage speed obtained, which is a simple formula requiring a few ns to be computed. All things considered, we experimentally found that for our setup the average inference time is approximately 300 ms.

3.2. Practical Examples

In the next sections, two examples provide some practical conclusions for agronomists and researchers aiming at conducting on-the-go depth measurements using the proposed method, taking into consideration the uncertainty behavior explained above. In particular, the examples explain (1) how to choose the correct vehicle speed given a specific camera, and (2) how to choose the correct camera for the target application given the vehicle speed (Vimage).

3.2.1. Choose the Correct Vehicle Speed Given a Specific Camera

Let us consider a camera with a sensor size of SX×SY=7.2×5.4 mm, pixel resolution of PRX×PRY=1600×1200 px, and 60 fps acquisition speed. The camera is fixed on a rigid case mounted on the side of a vehicle, so the pixels vary along the X axis of the camera (assuming negligible vibrations, so no movement along Y). From the results obtained and discussed in the previous section, it is assumed the minimum and maximum acceptable image velocity are Vimagemin=500 px/s and Vimagemax to keep measurement uncertainty on estimated depth values acceptable for the target application.

Consider adopting a maximum vehicle speed equal to Vvehiclemax=4·Vvehiclemin. In these conditions, the camera obtains a pair of frames after a time dT=1/fps=1/60 s, so the amount of pixels that change in two consecutive images according to the chosen speed is calculated as Cpx=Vimage·dT. Using the minimum and maximum image speeds, it results that Cpxmin8 px and Cpxmax33 px.

Considering the target application of an agriculture scenario where a tractor moves in between two rows, it is known that the inter-row distance is typically set to 2 m, so it can be reasonably assumed that the working distance WD, which is the camera–object distance, is WD800 mm. By using the following formula, we can calculate the camera’s field of view (FoV) along the X axis:

FoVmm=SX·WDf, (11)

where f is the focal length of the chosen optics. Several options exist on the market, and the choice depends on the magnification effect desired according to the operating WD, recalling from the basics of digital photography [43] that higher values of f correspond to a higher magnification and a narrow FoV, while lower values correspond to the opposite case. In this example, let us assume f=8 mm, corresponding to FoVmm=720 mm. Now it is possible to calculate the millimeters-to-pixels resolution ratio Rs as follows:

Rs=FoVmmPRX. (12)

A low value of Rs corresponds to a higher number of pixels needed to represent a millimeter, thus producing image details with a higher number of pixels. Having a sufficient number of pixels to represent the smallest target object in the image is key to successful computer vision applications. For this example, it results in Rs=0.45 mm/px. Then, the minimum and maximum vehicle speeds can be obtained using the following formulas:

Vvehiclemin=Rs·Vimagemin, (13)
Vvehiclemax=Rs·4·Vimagemin. (14)

This results in Vvehiclemin=225 mm/s =0.23 m/s =0.81 km/h and Vvehiclemax=0.90 m/s =3.24 km/h. Given the described camera, the tractor must maintain its velocity between Vvehiclemin and Vvehiclemax to obtain the most reliable depth measurements.

3.2.2. Choose the Correct Camera for the Target Application Given the Vehicle Speed

In the second scenario, the vehicle speed and the working distance are already defined and the question is about the choice of the right camera for the application. For the sake of the example, let us consider Vvehicle=5 km/h =1.39 m/s =1388.89 mm/s and WD=1000 mm. Hence, by inverting Equation (13) and considering Vimagemin=500 px/s (which is the optimal image velocity resulting from our experiments), the result is Rs=2.78 mm/px. Let us assume that, for the target application, at least FoVmm=1000 mm must be viewed by the camera. Inverting Equation (12) results in PRX=360 px. The end-user is now encouraged to search among the plethora of available cameras on the market with a pixel resolution along X at least equal to 360 px, a requirement easily fulfilled nowadays. After selecting the best product for its needs, it is straightforward to also select the optics by using Equation (11) to find the value of f. Once again, the resulting value is the optimal one obtained from mathematical formulations; thus, it may differ from what is available on the market. The process of finding the right camera and optics may be a long one since it requires adjustments and comparisons with several products.

Finally, two considerations should be made about possible aberrations appearing in the acquired frames when using cameras for on-the-go acquisitions. The first issue is motion blur, an effect appearing when the vehicle moves too fast and the camera acquires too slowly. In this case, it is recommended to use cameras with high fps. The second issue is about the camera’s shutter. Low-cost cameras are typically equipped with a rolling shutter; however, this results in the upper and bottom portions of the image corresponding to two different scenes, an issue especially for fast movements (a typical example is the picture of a fan). To avoid this issue, it is recommended to choose cameras equipped with a global shutter.

4. Conclusions

The presented work deals with the topic of depth estimation from a moving monocular 2D camera, leveraging optical flow. Starting from the classic analytical model used to estimate depth from moving images, an experimental model that is easier to apply for end-users is proposed and validated. The experimental setup comprises a robot that simulates the moving vehicle, on which the camera is mounted. The target measurand is a rigid frame with five bars of different lengths mounted on it to simulate objects positioned at different depths. On top of each bar, an ArUco marker was fixed. A total of five experiments were conducted by actuating the robot at five speeds, each time recording a video that contains both positive and negative robot motions. The developed software analyzes the videos to extract the ArUco markers’ apparent depth (Z) and compute the optical flow (Vimage) from pairs of images. A window-based moving average filter was developed and applied to the acquired data to reduce noise and improve the final uncertainty.

The core of this work lies in the metrological validation of the system and the computation of measurement uncertainty, for which two approaches are proposed as follows: the generalized approach and the complete approach. In the case of the generalized approach, the best-case scenario is obtained for robot speeds equal to S2=0.50 m/s and S3=0.75 m/s, for which the corresponding uncertainty on depth estimation is u2=0.08 m (no filter) and u7=0.04 m (filter applied) for S2, and u3=0.09 m (no filter) and u8=0.07 m (filter applied) for S3. The complete approach separates the depth data according to their speed Vimage to deal with the exponential nature of the models, producing group-separated results for each model with and without filtering. Again, the best case scenario is obtained for robot speed equal to S2 or S3.

Another interesting conclusion useful for end-users is that the experiments highlighted that low image speeds increase the total uncertainty when the pixel displacement between two consecutive images is insufficient, thus reducing the robustness of the optical flow algorithm. This effect directly impacts both the vehicle speed and the camera’s acquisition rate. Generally, it is shown that for image speeds higher than 500–800 px/s, uncertainty on depth estimation drops below 200 mm. This outcome is especially useful for applications such as on-the-go depth measurements in scenarios where 100 to 200 mm uncertainty is acceptable, such as agriculture. In fact, in this specific context, it is common to have low-cost cameras mounted on moving vehicles and the main questions are as follows: “Given a certain camera, at which speed should the vehicle move to get stable depth readings?” and “Which camera should be bought if the vehicle moves at a certain speed?” To answer these questions, two practical examples are presented and discussed, showing how the proposed work can be effectively employed by end-users in the two most common scenarios.

To conclude, the presented work is a stepping stone towards the development of reliable, easy-to-use, and low-cost embedded measuring systems suitable for in-field measurements for a plethora of applications, especially in agriculture. Future work will be devoted to the optimization of the presented methodology by combining it with modern Deep Learning models; for example, for fruit counting and measurements.

Acknowledgments

The authors wish to thank Matteo Lancini and Davide Botturi for their aid and support during the development of this work.

Author Contributions

Conceptualization, B.L. and S.P.; methodology, B.L. and C.N.; software, B.L.; validation, B.L., C.N. and S.P.; formal analysis, B.L. and C.N.; investigation, B.L.; resources, C.N.; data curation, C.N.; writing—original draft preparation, B.L. and C.N.; writing—review and editing, S.P.; visualization, C.N.; supervision, S.P.; project administration, S.P. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data set is available upon request from the authors. The developed code used for this work is made publicly available on GitHub (see reference [33]).

Conflicts of Interest

The authors declare no conflicts of interest.

Funding Statement

This work was supported by the European Union, FSE-REACT-EU, PON “Research and Innovation 2014–2020”, D.M. 1061/2021, contract number DOT1346224-8.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Wang Z., Menenti M. Challenges and Opportunities in Lidar Remote Sensing. Front. Remote Sens. 2021;2:641723. doi: 10.3389/frsen.2021.641723. [DOI] [Google Scholar]
  • 2.Horn B.K., Schunck B.G. Determining optical flow. Artif. Intell. 1981;17:185–203. doi: 10.1016/0004-3702(81)90024-2. [DOI] [Google Scholar]
  • 3.Lucas B.D., Kanade T. An Iterative Image Registration Technique with an Application to Stereo Vision; Proceedings of the IJCAI’81: 7th international joint conference on Artificial intelligence; Vancouver, BC, Canada. 24–28 August 1981; pp. 674–679. [Google Scholar]
  • 4.Savian S., Elahi M., Tillo T. Deep Biometrics. Springer International Publishing; Cham, Switzerland: 2020. Optical Flow Estimation with Deep Learning, a Survey on Recent Advances; pp. 257–287. [DOI] [Google Scholar]
  • 5.Barron J.L., Fleet D.J., Beauchemin S.S. Performance of optical flow techniques. Int. J. Comput. Vis. 1994;12:43–77. doi: 10.1007/BF01420984. [DOI] [Google Scholar]
  • 6.Wu J., Liu S., Wang Z., Zhang X., Guo R. Dynamic depth estimation of weakly textured objects based on light field speckle projection and adaptive step length of optical flow method. Measurement. 2023;214:112834. doi: 10.1016/j.measurement.2023.112834. [DOI] [Google Scholar]
  • 7.Benameur N., Kraim T., Arous Y., Benabdallah N. The Assessment of left ventricular Function in MRI using the detection of myocardial borders and optical flow approaches: A Review. Int. J. Cardiovasc. Pract. 2017;2:73–75. doi: 10.21859/ijcp-030101. [DOI] [Google Scholar]
  • 8.Chao H., Gu Y., Napolitano M. A Survey of Optical Flow Techniques for Robotics Navigation Applications. J. Intell. Robot. Syst. 2014;73:361–372. doi: 10.1007/s10846-013-9923-6. [DOI] [Google Scholar]
  • 9.Diamond D., Heyns P., Oberholster A. Accuracy evaluation of sub-pixel structural vibration measurements through optical flow analysis of a video sequence. Measurement. 2017;95:166–172. doi: 10.1016/j.measurement.2016.10.021. [DOI] [Google Scholar]
  • 10.Nagano S., Moriyuki S., Wakamori K., Mineno H., Fukuda H. Leaf-Movement-Based Growth Prediction Model Using Optical Flow Analysis and Machine Learning in Plant Factory. Front. Plant Sci. 2019;10:227. doi: 10.3389/fpls.2019.00227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Srokosz P.E., Bujko M., Bocheńska M., Ossowski R. Optical flow method for measuring deformation of soil specimen subjected to torsional shearing. Measurement. 2021;174:109064. doi: 10.1016/j.measurement.2021.109064. [DOI] [Google Scholar]
  • 12.Ranftl R., Vineet V., Chen Q., Koltun V. Dense Monocular Depth Estimation in Complex Dynamic Scenes; Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Los Alamitos, CA, USA. 27–30 June 2016; pp. 4058–4066. [DOI] [Google Scholar]
  • 13.Saxena S., Herrmann C., Hur J., Kar A., Norouzi M., Sun D., Fleet D.J. The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation. arXiv. 2023 doi: 10.48550/arXiv.2306.01923.2306.01923 [DOI] [Google Scholar]
  • 14.Lasinger K., Ranftl R., Schindler K., Koltun V. Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer. arXiv. 2019 doi: 10.48550/arXiv.1907.01341.1907.01341 [DOI] [PubMed] [Google Scholar]
  • 15.Bhat S.F., Birkl R., Wofk D., Wonka P., Müller M. ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth. arXiv. 2023 doi: 10.48550/arXiv.2302.12288.2302.12288 [DOI] [Google Scholar]
  • 16.Li Z., Bhat S.F., Wonka P. PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation. arXiv. 2023 doi: 10.48550/arXiv.2312.02284.2312.02284 [DOI] [Google Scholar]
  • 17.Ke B., Obukhov A., Huang S., Metzger N., Daudt R.C., Schindler K. Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation. arXiv. 2024 doi: 10.48550/arXiv.2312.02145.2312.02145 [DOI] [Google Scholar]
  • 18.Memon Z.W., Chen Y., Zhang H. An Adaptive Threshold-Based Pixel Point Tracking Algorithm Using Reference Features Leveraging the Multi-State Constrained Kalman Filter Feature Point Triangulation Technique for Depth Mapping the Environment. Sensors. 2025;25:2849. doi: 10.3390/s25092849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ghasemieh A., Kashef R. Advanced Monocular Outdoor Pose Estimation in Autonomous Systems: Leveraging Optical Flow, Depth Estimation, and Semantic Segmentation with Dynamic Object Removal. Sensors. 2024;24:8040. doi: 10.3390/s24248040. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wang S., Xu S., Ma Z., Wang D., Li W. A Systematic Solution for Moving-Target Detection and Tracking While Only Using a Monocular Camera. Sensors. 2023;23:4862. doi: 10.3390/s23104862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Shimada T., Nishikawa H., Kong X., Tomiyama H. Pix2Pix-Based Monocular Depth Estimation for Drones with Optical Flow on AirSim. Sensors. 2022;22:2097. doi: 10.3390/s22062097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cui X.Z., Feng Q., Wang S.Z., Zhang J.H. Monocular Depth Estimation with Self-Supervised Learning for Vineyard Unmanned Agricultural Vehicle. Sensors. 2022;22:721. doi: 10.3390/s22030721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Gao T., Li M., Xue L., Bao J., Lian H., Li T., Shi Y. Height-Variable Monocular Vision Ranging Technology for Smart Agriculture. IEEE Access. 2023;11:92847–92856. doi: 10.1109/ACCESS.2023.3305964. [DOI] [Google Scholar]
  • 24.Zhang L., Hao Q., Mao Y., Su J., Cao J. Beyond Trade-Off: An Optimized Binocular Stereo Vision Based Depth Estimation Algorithm for Designing Harvesting Robot in Orchards. Agriculture. 2023;13:1117. doi: 10.3390/agriculture13061117. [DOI] [Google Scholar]
  • 25.Ponnambalam V.R., Bakken M., Moore R.J.D., Glenn Omholt Gjevestad J., Johan From P. Autonomous Crop Row Guidance Using Adaptive Multi-ROI in Strawberry Fields. Sensors. 2020;20:5249. doi: 10.3390/s20185249. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Ozyesil O., Voroninski V., Basri R., Singer A. A Survey of Structure from Motion. arXiv. 2017 doi: 10.48550/arXiv.1701.08493.1701.08493 [DOI] [Google Scholar]
  • 27.Mac Aodha O., Humayun A., Pollefeys M., Brostow G.J. Learning a confidence measure for optical flow. IEEE Trans. Pattern Anal. Mach. Intell. 2013;35:1107–1120. doi: 10.1109/TPAMI.2012.171. [DOI] [PubMed] [Google Scholar]
  • 28.Evaluation of Measurement Data—Guide to the Expression of Uncertainty in Measurement. Joint Committee for Guides in Metrology; Sevres, France: 2008. [(accessed on 1 January 2024)]. Available online: https://www.bipm.org/documents/20126/2071204/JCGM_100_2008_E.pdf/cb0ef43f-baa5-11cf-3f85-4dcd86f77bd6. [Google Scholar]
  • 29.Guide to the Expression of Uncertainty in Measurement—Part 6: Developing and Using Measurement Models. Joint Committee for Guides in Metrology; Sevres, France: 2020. [(accessed on 1 January 2024)]. Available online: https://www.bipm.org/documents/20126/2071204/JCGM_GUM_6_2020.pdf/d4e77d99-3870-0908-ff37-c1b6a230a337. [Google Scholar]
  • 30.Lanza B., Botturi D., Gnutti A., Lancini M., Nuzzi C., Pasinetti S. A Stride Toward Wine Yield Estimation from Images: Metrological Validation of Grape Berry Number, Radius, and Volume Estimation. Sensors. 2024;24:7305. doi: 10.3390/s24227305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Botturi D., Gnutti A., Nuzzi C., Lanza B., Pasinetti S. STEWIE: eSTimating grapE berries number and radius from images using a Weakly supervIsed nEural network; Proceedings of the 2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor); Pisa, Italy. 6–8 November 2023; pp. 277–282. [DOI] [Google Scholar]
  • 32.Lanza B., Nuzzi C., Botturi D., Pasinetti S. First Step Towards Embedded Vision System for Pruning Wood Estimation; Proceedings of the 2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor); Pisa, Italy. 6–8 November 2023; pp. 757–762. [DOI] [Google Scholar]
  • 33.Lanza B. Depth Estimation from Optical Flow in Agricultural Fields. 2024. [(accessed on 24 May 2025)]. Available online: https://github.com/bernardolanza93/DepthFromOpticalFlow.git.
  • 34.Umbrico A., Orlandini A., Cesta A., Faroni M., Beschi M., Pedrocchi N., Scala A., Tavormina P., Koukas S., Zalonis A., et al. Design of Advanced Human–Robot Collaborative Cells for Personalized Human–Robot Collaborations. Appl. Sci. 2022;12:6839. doi: 10.3390/app12146839. [DOI] [Google Scholar]
  • 35.Zhang Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 2000;22:1330–1334. doi: 10.1109/34.888718. [DOI] [Google Scholar]
  • 36.Garrido-Jurado S., Muñoz-Salinas R., Madrid-Cuevas F., Marín-Jiménez M. Automatic generation and detection of highly reliable fiducial markers under occlusion. Pattern Recognit. 2014;47:2280–2292. doi: 10.1016/j.patcog.2014.01.005. [DOI] [Google Scholar]
  • 37.Garrido-Jurado S., Muñoz-Salinas R., Madrid-Cuevas F., Medina-Carnicer R. Generation of fiducial marker dictionaries using Mixed Integer Linear Programming. Pattern Recognit. 2016;51:481–491. doi: 10.1016/j.patcog.2015.09.023. [DOI] [Google Scholar]
  • 38.Romero-Ramirez F.J., Muñoz-Salinas R., Medina-Carnicer R. Speeded up detection of squared fiducial markers. Image Vis. Comput. 2018;76:38–47. doi: 10.1016/j.imavis.2018.05.004. [DOI] [Google Scholar]
  • 39.Huamán A. OpenCV Official Documentation. Changing the Contrast and Brightness of an Image. 2016. [(accessed on 4 May 2024)]. Available online: https://docs.opencv.org/4.x/d3/dc1/tutorial_basic_linear_transform.html.
  • 40.Gené-Mola J., Felip-Pomés M., Net-Barnés F., Morros J.R., Miranda J.C., Arnó J., Asín L., Lordan J., Ruiz-Hidalgo J., Gregorio E. Video-Based Fruit Detection and Tracking for Apple Counting and Mapping; Proceedings of the 2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor); Pisa, Italy. 6–8 November 2023; pp. 301–306. [DOI] [Google Scholar]
  • 41.Ferrer-Ferrer M., Ruiz-Hidalgo J., Gregorio E., Vilaplana V., Morros J.R., Gené-Mola J. Simultaneous fruit detection and size estimation using multitask deep neural networks. Biosyst. Eng. 2023;233:63–75. doi: 10.1016/j.biosystemseng.2023.07.010. [DOI] [Google Scholar]
  • 42.Scipy Official Documentation. Curve Fit Function of Scipy Optimize Library. 2022. [(accessed on 4 May 2024)]. Available online: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html.
  • 43.Rowlands D.A. Physics of Digital Photography. 2nd ed. IOP Publishing; Bristol, UK: 2020. pp. 2053–2563. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data set is available upon request from the authors. The developed code used for this work is made publicly available on GitHub (see reference [33]).


Articles from Sensors (Basel, Switzerland) are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES