Learning to soar in turbulent environments

Gautam Reddy; Antonio Celani; Terrence J Sejnowski; Massimo Vergassola

doi:10.1073/pnas.1606075113

. 2016 Aug 1;113(33):E4877–E4884. doi: 10.1073/pnas.1606075113

Learning to soar in turbulent environments

Gautam Reddy ^a, Antonio Celani ^b, Terrence J Sejnowski ^c,^d,¹, Massimo Vergassola ^a

PMCID: PMC4995969 PMID: 27482099

Significance

Thermals are ascending currents that typically extend from the ground up to the base of the clouds. Birds and gliders piggyback thermals to fly with a reduced expenditure of energy, for example, during migration, and to extend their flying range. Flow in the thermals is highly turbulent, which poses the challenge of the orientation in strongly fluctuating environments. We combine numerical simulations of atmospheric flow with reinforcement learning methods to identify strategies of navigation that can cope with and even exploit turbulent fluctuations. Specifically, we show how the strategies evolve as the level of turbulent fluctuations increase, and we identify those sensorimotor cues that are effective at directing turbulent navigation.

Keywords: thermal soaring, turbulence, navigation, reinforcement learning

Abstract

Birds and gliders exploit warm, rising atmospheric currents (thermals) to reach heights comparable to low-lying clouds with a reduced expenditure of energy. This strategy of flight (thermal soaring) is frequently used by migratory birds. Soaring provides a remarkable instance of complex decision making in biology and requires a long-term strategy to effectively use the ascending thermals. Furthermore, the problem is technologically relevant to extend the flying range of autonomous gliders. Thermal soaring is commonly observed in the atmospheric convective boundary layer on warm, sunny days. The formation of thermals unavoidably generates strong turbulent fluctuations, which constitute an essential element of soaring. Here, we approach soaring flight as a problem of learning to navigate complex, highly fluctuating turbulent environments. We simulate the atmospheric boundary layer by numerical models of turbulent convective flow and combine them with model-free, experience-based, reinforcement learning algorithms to train the gliders. For the learned policies in the regimes of moderate and strong turbulence levels, the glider adopts an increasingly conservative policy as turbulence levels increase, quantifying the degree of risk affordable in turbulent environments. Reinforcement learning uncovers those sensorimotor cues that permit effective control over soaring in turbulent environments.

Migrating birds and gliders use upward wind currents in the atmosphere to gain height while minimizing the energy cost of propulsion by the flapping of the wings or engines (1, 2). This mode of flight, called soaring, has been observed in a variety of birds. For instance, birds of prey use soaring to maintain an elevated vantage point in their search for food (3); migrating storks exploit soaring to cover large distances in their quest for greener pastures (4). Different forms of soaring have been observed. Of particular interest here is thermal soaring, where a bird gains height by using warm air currents (thermals) formed in the atmospheric boundary layer. For both birds and gliders, a crucial part of the thermal soaring is to identify a thermal and to find and maintain its core, where the lift is typically the largest. Once migratory birds have climbed up to the top of a thermal, they glide down to the next thermal and repeat the process, a migration strategy that strongly reduces energy costs (4). Soaring strategies are also important for technological applications, namely, the development of autonomous gliders that can fly large distances with minimal energy consumption (5).

Thermals arise as ascending convective plumes driven by the temperature gradient created due to the heating of the earth’s surface by the sun (6). Hydrodynamic instabilities and processes that lead to the formation of a thermal inevitably give rise to a turbulent environment characterized by strong, erratic fluctuations (7, 8). Birds or gliders attempting to find and maintain a thermal face the challenge of identifying the potentially long-lived and large-scale wind fluctuations amid a noisy turbulent background. The structure of turbulence is highly complex, with fluctuations occurring at many different scales and long-ranged correlations in space and time (9, 10). We thereby expect nontrivial correlations between the large-scale convective plumes and the locally fluctuating quantities. Thermal soaring is a particularly interesting example of navigation within turbulent flows, because the velocity amplitudes of a glider or bird are of the same order of magnitude as the fluctuating flow they are immersed in.

It has been frequently observed and attested by glider pilots that birds are able to identify and navigate thermals more accurately than human pilots endowed with modern instrumentation (11). It is an open problem, however, what sensorimotor cues are available to birds and how they are exploited, which constitutes a major motivation for the present study.

An active agent navigating a turbulent environment has to gather information about the fluctuating flow while simultaneously using the flow to ascend. Thus, the problem faced by the agent bears similarities to the general problem of balancing exploration and exploitation in uncertain environments, which has been well studied in the reinforcement learning framework (12). The general idea of reinforcement learning is to selectively reinforce actions that are highly rewarding and thereby have the reinforced actions chosen when the situation reoccurs. The solution to a reinforcement learning problem typically yields a behavioral policy that is approximately optimal, where optimality is defined in the sense of maximizing the reward function used to train the agent.

The previous description suggests that reinforcement learning methods are poised to deliver effective strategies of soaring flight. Past applications are indeed promising, yet they have considered the soaring problem in unrealistically simplified situations, with no turbulence or with fluctuations modeled as Gaussian white noise. Ref. 13 considered the learning problem associated with finding the center of a stationary thermal without turbulence, and used a neural-based algorithm to recover the empirical rules proposed by Reichmann (14) to locate the core of the thermal. Other attempts (15, 16) have used neural networks and Q-learning to find strategies to center a turbulence-free thermal. Akos et al. (17) show that these simple rules fail even in the presence of modest velocity fluctuations modeled as Gaussian white noise, and express the need for strategies that could work in realistic turbulent flows.

Here, we enforce realistic aerodynamic constraints on the flight of gliders and train them in complex turbulent environments by using reinforcement learning algorithms. We show that the glider finds an effective strategy for soaring, and we identify sensorimotor cues that are most relevant for guiding turbulent navigation. Our soaring strategy is effective even in the presence of strong fluctuations. The predicted strategy of flight lends itself to field experiments with remote-controlled gliders and to comparisons with the behavior of soaring birds.

Models

We first describe the models used for the simulation of the atmospheric boundary layer flow, the mechanics of flight, and the reinforcement learning algorithms that we have used. The next section will then present the corresponding results.

Modeling the Turbulent Environment.

Key physical aspects of the flow in the convective boundary layer are governed by Rayleigh–Bénard convection (see ref. 9 for a review). The corresponding equations are derived from the Navier–Stokes equations with coupled temperature and velocity fields simplified using the Boussinesq approximation. The dimensionless Rayleigh–Bénard equations read as follows:

\frac{\partial u}{\partial t} + u \cdot \nabla u = - \nabla P + {(\frac{Pr}{Ra})}^{1 / 2} \nabla^{2} u + θ \hat{z},

[1]

\frac{\partial θ}{\partial t} + u \cdot \nabla θ = \frac{1}{{(Pr Ra)}^{1 / 2}} \nabla^{2} θ,

[2]

where $u, θ$ , and P are the velocity, temperature, and pressure fields, respectively. The vertical direction coincides with the z axis. The temperature appears in the dynamics of the velocity field as a buoyant forcing term. The equations contain two dimensionless quantities that determine the qualitative behavior of the flow: the Rayleigh number, Ra, and the Prandtl number, Pr. When Ra is beyond a critical value $\sim 10^{3}$ , the thermally generated buoyancy drives the flow toward instability. In this regime, the flow is characterized by large-scale convective cells and turbulent eddies at every length scale. In the atmosphere, the Rayleigh number can reach up to Ra = 10¹⁵ to 10²⁰. In such high-Rayleigh number regimes, the flow is strongly turbulent and numerical simulations of convection in the atmosphere are thus plagued by the same limitations of simulating fully developed turbulent flows. We performed direct numerical simulations of Rayleigh–Bénard convection at Ra = $10^{8}$ using the Gerris Flow Solver (18) (see Supporting Information for more details about the grid and the numerical scheme). Our test arena is a 3D cubical box of side length 1 km in physical units. We impose periodic boundary conditions on the lateral walls and no-slip on the floor and the ceiling of the box. The floor is fixed at a high temperature (which is rescaled to $θ = 1$ ), and the ceiling is fixed at $θ = 0$ .

A small, random perturbation in the flow quickly leads to an instability and to the formation of coherent thermal plumes within the chamber. Snapshots of the velocity and temperature fields at the statistically stationary state are shown in Fig. 1A. The statistical properties of the flow are consistent with those observed in previous works (19, 20), particularly the Nusselt number (which measures the ratio of convective to conductive heat transfer) and the mean temperature and velocity field profiles (Fig. S1).

Fig. 1. — Snapshots of the vertical velocity (A) and the temperature fields (B) in our numerical simulations of 3D Rayleigh–Bénard convection. For the vertical velocity field, the red and blue colors indicate regions of large upward and downward flow, respectively. For the temperature field, the red and blue colors indicate regions of high and low temperature, respectively. Notice that the hot and cold regions drive the upward and downward branches of the convective cell, in agreement with the basic physics of convection. (C) The force-body diagram of flight with no thrust, that is, without any engine or flapping of wings. The figure also shows the bank angle μ (blue), the angle of attack α (green), and the glide angle γ (red). (D) The range of horizontal speeds and climb rates accessible by controlling the angle of attack. At small angles of attack, the glider moves fast but also sinks fast, whereas at larger angles, the glider moves and sinks more slowly. If the angle of attack is too high, at about $16 °$ , the glider stalls, leading to a sudden drop in lift. The vertical black dashed line shows the fixed angle of attack for most of the simulations (*Results*, *Control over the Angle of Attack*).

Fig. S1. — Some additional observables for the RB simulations, which were not shown in the main text for the sake of saving space. The four panels show (A) the root-mean-square (rms) velocity; (B) the horizontal and vertical rms velocities; (C) the mean temperature; (D) the profile of the Nusselt number vs. height. The behavior of these observables is consistent with results of previous numerical simulations, namely those in refs. 19 and 20.

To test the robustness of our learned policies of flight with respect to the modeling of turbulence, we also considered an alternative to the Rayleigh–Bénard flow. Specifically, we considered a kinematic model of turbulence that extends the one in ref. 21 to the inhomogeneous case relevant for the atmospheric boundary layer (Methods). Results for the kinematic model confirm the robustness of our conclusions and the learned policy has similar features in both flows (Supporting Information and Figs. S2–S4). Below, we shall focus on the simulations of the Rayleigh–Bénard flow described above.

Fig. S2. — Properties of the flow for the kinematic model of turbulence. A shows the mean-squared velocity profile, scaling initially as $z^{2 / 3}$ (*Inset*) and then transitioning to a constant behavior. These behaviors are analogous to those observed in the free convection layer and the mixed layer of the ABL. The constant value of the rms velocity in the mixed layer is defined as $u_{rms}$ . B shows that two initially proximal particles released in the flow separate as $r^{2} \sim t^{3}$ (black dashed line); that is, the Richardson’s superdiffusive law is well captured by our model. Small deviations are due to finite-size effects, and the observed exponent is 2.7 (blue solid line).

Fig. S4. — Learned policies for the kinematic model of turbulence. A and B show the learned policies at ${\hat{u}}_{rms} = 0$ and ${\hat{u}}_{rms} = 5$ . The policies are largely similar to ones for the RB flow that were presented in the main text. C shows a heat map of the optimal bank angles for negative $a_{z}$ and $τ < 0$ . We normalize the acceleration as ${\hat{a}}_{z} = a_{z} Δ t / v_{glider}$ , where Δt = 1 s and $v_{glider}$ is the airspeed of the glider. D shows a simplified version of the heat map, which can be interpreted as a “fluctuation filter,” analogously to what was discussed in the main text. The data points are the values of $(- {\hat{a}}_{z}, {\hat{u}}_{rms})$ corresponding to an optimal bank angle of $12.5 °$ . The “large” and “small” markers respectively denote values that require a strong bank action or relatively small values that are interpreted as fluctuations and are filtered out by the policy of flight.

Glider Mechanics.

A bird or glider flying in the flow described above with a fixed, stretched-out wing is safely assumed to be in mechanical equilibrium, except for centripetal forces while turning (22, 23). A glider with weight W traveling with velocity $v$ experiences a lift force L perpendicular to its velocity and a drag force D antiparallel to its velocity (see Fig. 1C for a force body diagram). The glider has no engine and thus generates no thrust. The magnitudes of the lift and the drag depend on the speed v, the angle of attack α, the density of air ρ and the surface area S of the wing as follows: $L = (1 / 2) ρ S v^{2} C_{L} (α)$ and $D = (1 / 2) ρ S v^{2} C_{D} (α)$ . The glide angle γ, which is the angle between the velocity and its projection on the horizontal, determines the ratio of the climb rate $v_{c} (< 0)$ and the horizontal speed $v_{⊥}$ . Balancing the forces on the glider, and accounting for the centripetal acceleration, the velocity of the glider and its turning rate are obtained as follows:

\tan γ = \frac{- v_{c}}{v_{⊥}} = \frac{D}{L cos μ} = \frac{C_{D} (α)}{C_{L} (α) cos μ};

[3]

\ddot{y} = g \cos γ tan μ; v^{2} = \frac{2 m g \sin γ}{ρ S C_{D} (α)} .

[4]

Here, $\ddot{y}$ is the centripetal acceleration. The ratio $m g / S$ is called the wing loading of the glider (22). The kinematics of a glider is therefore set by the wing loading and the dependence of the lift and the drag coefficients on the angle of attack. The general features of the lift and drag coefficient curves for a typical symmetric airfoil are described in ref. 24; the resulting dependence of the velocity on the angle of attack is shown in Fig. 1B. The glider can be maneuvered by controlling the angle of attack, which changes the speed and climb rate of the glider, or by banking the glider to turn.

The Learning Algorithm.

To identify effective strategies of soaring flight in turbulent flows, we used the reinforcement learning algorithm state–action–reward–state–action (SARSA) (12). Historically, the algorithm was inspired by the theory of animal learning, and its model-free nature allows for learning previously unknown strategies driven by feedback on performance (25).

Reinforcement learning problems are typically posed in the framework of a Markov decision process (MDP). In an MDP, the agent traverses a state space with transition probabilities that depend only on the current state s and the immediate next state $s'$ , as for a Markov process. The transition probabilities can be influenced by taking actions at each time step. After every action, the agent is given some reward $r (s, s', a)$ , which depends on the states s and $s'$ and the chosen action a. The ultimate goal of reinforcement learning algorithms is to find the optimal policy $π^{∗}$ , that is, to find the probability of choosing action a given the state s. The optimal policy maximizes for each state s the sum of discounted future rewards $V_{π_{s}^{a}} (s) = 〈 r_{0} 〉 + β 〈 r_{1} 〉 + β^{2} 〈 r_{2} 〉 + \dots$ , where $〈 r_{i} 〉$ is the expected reward after i steps, β is the discount factor ( $0 \leq β 1$ ), and the sum above obviously depends on the policy $π_{s}^{a}$ . When β is close to zero, the optimal policy greedily maximizes the expected immediate reward, leading to a purely exploitative strategy. As β gets closer to unity, later rewards contribute significantly and more exploratory strategies are preferred.

The SARSA algorithm finds the optimal policy by estimating for every state–action pair its Q function defined as the expected sum of future rewards given the current state s and the action a. At each step, the Q function is updated as follows:

Q (s, a) \to Q (s, a) + η (r + β Q (s', a') - Q (s, a)),

[5]

where r is the received reward and η is the learning rate. The update is made online and does not require any prior model of the flow or the flight. This feature is particularly relevant in modeling decision-making processes in animals. When the algorithm is close to convergence, the Q function approaches the solution to Bellman’s dynamic programming equations (12). The policy $π_{s}^{a}$ , which encodes the probability of choosing action a at state s, approaches the optimal one $π^{∗}$ and is obtained from the Q function via a Boltzmann-like expression:

π_{s}^{a} \propto \exp (- \hat{Q} (s, a) / τ_{temp}),

[6]

\hat{Q} (s, a) = \frac{\max_{a'} Q (s, a') - Q (s, a)}{\max_{a'} Q (s, a') - \min_{a'} Q (s, a')} .

[7]

Here, $τ_{temp}$ is an effective “temperature”: when $τ_{temp} ≫ 1$ , actions are only weakly dependent on the associated Q function ; conversely, for $τ_{temp}$ small, the policy greedily chooses the action with the largest Q. The temperature parameter is initially chosen large and lowered as training progresses to create an annealing effect, thereby preventing the policy from getting stuck in local extrema. Parameters used in our simulations can be found in Table S1.

Table S1.

Values for the parameters used in our simulations and training of the glider

Label	Description	Value
Ra	Rayleigh number	10⁸
Pr	Prandtl number	0.7
$z_{i}$	Inversion height	1 km
$m g / S$	Wing loading	10 N/m²
η	Learning rate	0.1
γ	Discount factor (fixed)	0.98
$τ_{temp}$	Softmax “temperature” (early stages)	2.0
$τ_{temp}$	Softmax “temperature” (later stages)	0.2
_C	Constant in reward function	5
l	Wingspan	10 m
$Δ t$	Time step	1 s
$v_{glider}$	Glider airspeed (at fixed α)	4 m/s
α	Angle of attack (fixed)	9°
$a_{z}^{thresh}$	Threshold for vertical wind acceleration	0.05 m/s²
$τ^{thresh}$	Threshold for torque	1 m²/s

Open in a new tab

In the sequel, we shall qualify the policy identified by SARSA as optimal. It should be understood, however, that the SARSA algorithm (as other reinforcement learning algorithms) typically identifies an approximately optimal policy and “approximately” is skipped only for the sake of conciseness.

Results

Sensorimotor Cues and Reward Function for Effective Learning.

Key aspects of the learning for the soaring problem are the sensorimotor cues that the glider can sense (state space) and the choice of the reward used to train the glider to ascend quickly. As the state and action spaces are continuous and high-dimensional, it is necessary to discretize them, which we realize here by a standard lookup table representation. The height ascended per trial, averaged over different realizations of the flow, serves as our performance criterion.

The glider is allowed control over its angle of attack and its bank angle (Fig. 1B). Control over the angle of attack features two regimes: (i) at small angles of attack, the horizontal speed is large and the climb rate is small (the glider sinks quickly); (ii) at large angles of attack but below the stall angle, the horizontal speed is small, whereas the climb rate is large. The bank angle controls the heading of the glider, and we allow for a range of variation between $- 15 °$ and $15 °$ . Exploring various possibilities, we found that three actions are minimally sufficient: increasing, decreasing, or preserving the angle of attack and the bank angle. The angle of attack and bank angle were incremented/decremented in steps of $2.5 °$ and $5 °$ , respectively. In summary, the glider can choose $3^{2}$ possible actions to control its navigation in response to the sensorimotor cues described hereafter.

Our rationale in the choice of the state space was trying to minimize biological or electronic sensory devices necessary for control. We tested different combinations of local sensorimotor cues that could be indicative of the existence of a thermal. These were the vertical wind velocity $u_{z}$ , the vertical wind acceleration $a_{z}$ , torques τ, the local temperature θ, and their 16 possible combinations. Namely, if $u$ denotes the local wind speed, we define the wind acceleration as $a_{z} = (u_{z}^{(t)} - u_{z}^{(t - 1)}) / Δ t$ and the “torques” as $τ = (u_{z +} - u_{z -}) l$ , where $u_{z +}$ and $u_{z -}$ are the vertical wind velocities at the left and the right wing, l is the wingspan of the glider, and $Δ t$ is the step used for time discretization (see below). After experimentation with various architectures, we found that a lookup table structure with three states per observable, corresponding to positive high, negative high, and small values, ensures good performance. The chosen thresholds, $a_{z}^{thresh}$ and $τ^{thresh}$ , that demarcate the large and small values in our scheme are listed in Table S1.

As for the reward function, we found that a purely global reward, that is, awarded at the end of a trial without any local guidance, does not propagate easily to early state–action pairs for realistically long trials. Eligibility traces (12), which maintain a memory of past state–action pairs and their rewards, did not alleviate the issue. For gliders or migrating birds, a fall can be extremely disadvantageous, and we account for this by having a glider that touches the surface receive a large negative reward as a penalty. After a broad exploration of various choices, we heuristically found that best soaring performances are obtained by a local-in-time reward that linearly combines the vertical wind velocity and the wind acceleration achieved at the subsequent time step, that is, $R = u_{z} + C a_{z}$ (see Table S1 for the chosen value). We observe that performance does not change significantly for a wide range of values of C.

Flight Training.

The glider is first trained on a set of trials and its performance is then tested on 500 trials. Trials consist of independent statistical realizations of the turbulent flow. The glider flight is discretized by time steps Δt = 1 s, which is an estimate for the control times of the glider and the timescales of the turbulent eddies at the size of the glider. Each trial lasts for 2.5 min, which is roughly one-half the relaxation time of the large-scale convective flow at steady state. The duration captures the order of magnitude of the typical time, $\sim 10$ min, for birds to reach the base of the clouds.

The velocity relative to the ground of the glider is $u + v$ , where $u$ and $v$ are the contributions due to the wind and the glider velocity, respectively. If $u_{rms}$ is the root-mean-squared speed of the flow and $v_{glider}$ is the typical airspeed of the glider, we introduce their dimensionless ratio ${\hat{u}}_{rms} = u_{rms} / v_{glider}$ . At small ${\hat{u}}_{rms}$ , fluctuations are weak. Conversely, at large ${\hat{u}}_{rms}$ , the glider has less time to react to rapidly changing velocities; that is, the environment is strongly fluctuating. Moreover, in that regime, the glider is carried away by the flow and the amount of control the glider has over its trajectory is reduced. We expect that the policy of flight learned by the glider will differ between the regimes of weak and strong fluctuations.

Learning in Different Flow Regimes.

A qualitative sense of the efficiency of the training in a fluctuating regime is illustrated in Fig. 2. The trajectories go from random paths to the spirals that are characteristic of the thermal soaring flights of birds and gliders. Fig. 3A quantifies the significant improvement in performance due to training and shows that training for a few hundred trials suffices for convergence with negligible overfitting for larger training sets. To compare performance in flows of different mean speeds, we train and test gliders in flows with varying $u_{rms}$ . Fig. 3B shows the gain in height as a function of ${\hat{u}}_{rms}$ . As expected, we observe two regimes: (i) for weak and moderate fluctuations, ${\hat{u}}_{rms} ≲ 1$ , the gain in height follows a rapidly increasing trend; (ii) for strong fluctuations, ${\hat{u}}_{rms} ≳ 1$ , gains still increase but more slowly. Because the ascended height depends on the flow speed, Fig. 3B also shows the soaring efficiency χ, defined as the difference between $Δ h ({\hat{u}}_{rms})$ and $Δ h (0)$ divided by $w_{rms} Δ T$ , where $w_{rms}$ is the rms vertical speed of the flow and ΔT = 150 s is the duration of a trial (Supporting Information and Fig. S1 for the value of w_rms). If the glider did not attempt to selectively find upward currents, χ would vanish, whereas $χ = 1$ corresponds to a glider perfectly capturing vertical currents. As the flow speed increases, the efficiency shows a downward trend that reflects the increasing difficulty in control due to higher levels of fluctuations.

Fig. 3. — The soaring performance of flight policies and the sensed sensorimotor cues. (A) The learning curve for two different turbulent fluctuation levels, as quantified by the ratio ${\hat{u}}_{rms}$ of the rms variations of the flow and the airspeed of the glider. The two values ${\hat{u}}_{rms} = 0.5$ (red) and ${\hat{u}}_{rms} = 1.5$ (green) show the increase in the average ascended height per trial with the size of the training set. The training saturates after $\approx 250$ trials. The green and red dotted lines show the learning curves of 20 individual gliders. (B) The panel shows the average height ascended for different ${\hat{u}}_{rms}$ (blue). We also plot the soaring efficiency $χ ({\hat{u}}_{rms})$ as defined in the text. The efficiency takes into account the stronger ascending velocities that are a priori available when ${\hat{u}}_{rms}$ increases. The difficulty is of course that higher velocities are also associated to stronger fluctuations. The efficiency indeed shows a downward trend that reflects the increasing difficulty in control as fluctuations increase. (C) A comparison of the average gain in height for different combinations of sensorimotor cues (vertical acceleration $a_{z}$ and velocity $v_{z}$ , torque τ and temperature θ) sensed by the glider. Vertical wind velocities and temperature also give minor contributions compared with the performance of vertical wind acceleration and torque. The third bar includes the performance when the control of the angle of attack α is included as a possible action. The contribution is marginal and the convergence is actually slowed down so that the final performance after a finite number of training trials is slightly inferior to the first bar. The error bars show the 95% confidence interval of the average gain in height. (D) The relative improvement in height gained with respect to a greedy strategy, that is, with discount factor $β = 0$ . A reinforcement learning policy that is not greedy, that is, $β \neq 0$ , has significantly better performance, demonstrating that long-term planning improves soaring. For A, B, and D, the error bars are smaller than the symbol size and are not shown.

The performance of different gliders soaring simultaneously within the same flow does not vary significantly, indicating that an ensemble of gliders learn a uniquely optimal policy. The performance over different realizations for a single glider varies wildly, with a SD of the final height of the same magnitude as the final height itself when ${\hat{u}}_{rms} \approx 1$ . Despite this wide variation, the number of failures (i.e., the glider touches the ground) always decreases rapidly to almost zero with the number of training trials.

Role of Wind Acceleration and Torques.

Our learning procedure allows us to test the possible local sensorimotor cues that give good soaring performance. For each cue, we define a mean level and upper and lower thresholds symmetrically around the mean value. The performance was found to be largely independent of the chosen thresholds.

In Fig. 3C, we show a comparison between the performance of a few different combinations of the cues. We found that the pairing of vertical wind acceleration and torques, gauged in terms of the average height ascended per trial, works best (results in Fig. 3 A and B are obtained using this pair). Intuitively, the combination of vertical wind acceleration and torques provides information on the gradient of the vertical wind velocity in two complementary directions, thus allowing the glider to decide between turning or continuing along the same path. Conversely, the vertical wind velocity does indicate the strength of a thermal, but it does not guide the glider to the core of the thermal. The pair acceleration and torque allows the glider to climb the thermal toward the core and also detect the edge of a thermal so that the glider can stay within the core. The resulting pattern within a thermal is a spiral that occurs solely from actions based on local observables and minimal memory use. Temperature fails to improve performance, which could be intuited as the temperature field is highly intermittent and is itself a convoluted function of the turbulent velocity (26, 27).

Control over the Angle of Attack.

Fig. 3C shows that control over the angle of attack does not influence significantly the performance in climbing an individual thermal. The angle of attack should play an important role, however, in other situations, namely, during cross-country races or bird migration, where gliders need to cover large horizontal distances and control over the horizontal speed and sink rate is needed (11, 28, 29). To verify this expectation, we considered a simple test case of a glider traversing, without turning, a 2D track consisting of a series of ascending or descending columns of air with turbulence added on top. We found that control over the angle of attack indeed improves the gain in height (Supporting Information and Fig. S5) and the glider learns to increase its pace during phases of descent while slowing down during periods of ascending currents. We expect that the differing roles of the angle of attack for soaring between and within thermals holds true for birds as well, a prediction that can be tested in field experiments.

Fig. S5. — Control over angle of attack during interthermal flight. A shows the vertical wind velocity profile $u_{z} (x)$ on the X axis. Each episode consists of the glider starting from the origin and traversing a fixed total distance of 0.7 km constrained along the X axis. The upward and downward currents have a parabolic profile of width 0.2 km such that the peak vertical velocities are 10 and −10 m/s, respectively. B shows the improvement in performance as training progresses, showing that the glider indeed learns to modulate its angle of attack for greater ascent.

In the sequel, we shall analyze the soaring in a single thermal. We fix then for simplicity the angle of attack at $\sim 9 °$ (where the climb rate is the largest; Fig. 1B), and the pair acceleration–torque as sensorimotor cues sensed by the glider (Fig. 3C).

Dependence on the Temporal Discounting.

The performance of the glider as a function of the temporal discount factor β is shown in Fig. 3D. The gain in height increases as the effective time horizon ${(1 - β)}^{- 1}$ grows, reaches a maximum at $\approx 100$ s, and then slowly declines. The best time horizon is comparable with the timescale of the flow patterns at the height reached by the glider. This demonstrates that long-term planning is crucial for soaring and the importance of a relatively long-term strategy to effectively use the ascending thermals.

Optimal Flight Policy.

The Q function learned by the SARSA algorithm defines the optimal state–action policy via Eq. 6. An optimal policy associates the choice of an action to the pair acceleration–torque $(a_{z}, τ)$ . The optimal action is chosen among the three options: (i) increase the bank angle μ by $5 °$ ; (ii) decrease μ by $5 °$ ; (iii) keep μ unchanged. In Fig. 4A, we show a comparison between the policy for the two regimes of weak and strong fluctuations.

The policies in Fig. 4 have a few intuitive features that are preserved at different flow speeds. For instance, when the glider experiences a negative wind acceleration, the optimal action is to sharply bank toward the side of the wing that experiences larger lift. When the glider experiences a large positive acceleration and no torque, the glider continues flying along its current path. Despite these similarities, the policies exhibit marked differences, which we proceed to analyze.

For each $a_{z}, τ$ pair, it is useful to consider its preferred angles (the green circles in Fig. 4), that is, those angles that the policy leads to if the pair $a_{z}, τ$ is maintained fixed. We observe that the preferred bank angles of gliders trained in a strong flow are relatively moderate, and the policy in general is more conservative. Consider, for instance, the case of zero torque and zero acceleration (column 5 of the policies in Fig. 4). The optimal bank action in the weak flow regime is to turn as much as possible, in contrast to the policy in the strong flow regime, which is to not turn. Another interesting qualitative difference is when the glider experiences negative acceleration and significant torque on the right wing (column 1 of the policies in Fig. 4). In the weak flow regime, if the glider is already banked to the left (negative bank angles), the policy is to bank further left to complete a full circle. In the strong flow regime, the policy is once again more conservative, preferring to not risk the full turn.

A policy becoming more conservative and risk averse as fluctuations increase is consistent with the balance of exploration and exploitation (12). In a noisy environment, where a wrong decision can lead to highly negative consequences, we expect an active agent to play safe and tend to gather more information before taking action. In a turbulent environment, we expect the glider to exploit (avoid) only significantly large positive (negative) fluctuations along its trajectory while filtering out transient, small-scale fluctuations. In the next subsection, we shall further confirm this expectation by tracking the changes in the optimal policy with the flow speed and extracting a few general principles of the optimal flight policy.

Optimal Bank Angles.

To quantify the description of the optimal policy shown in Fig. 4A, we consider the distributions of the bank angle μ given the acceleration $a_{z}$ and torque τ in the previous time step, that is, $Pr (μ^{(t + 1)} | a_{z}^{(t)}, τ^{(t)})$ . We define the optimal bank angle as follows:

μ_{opt} (a_{z}, τ) = \underset{μ^{t + 1}}{\arg \max} Pr (μ^{t + 1} | a_{z}^{t}, τ^{t}),

[8]

and we are interested in the variations of the optimal bank angle with the turbulence level ${\hat{u}}_{rms}$ . We use a bicubic spline interpolation to smooth the probability distributions and thereby obtain smoothened values for $μ_{opt}$ .

To create a higher resolution in $a_{z}$ , we expand our state space by creating finer divisions in the vertical wind accelerations. Note that the performance with an expanded state space is not significantly better than the one with just three states. Fig. 4 shows a heat map of the optimal bank angles at different $a_{z} < 0$ and $τ < 0$ . For every $a_{z}$ , $μ_{opt}$ drops from the maximum value of $15^{\circ}$ to a value closer to zero as ${\hat{u}}_{rms}$ increases. Note that, because $τ < 0$ , the optimal angles are biased toward being positive. We define a threshold on the optimal bank angles at $12.5 °$ , which empirically corresponds to the point where the optimal bank angles drop most rapidly as ${\hat{u}}_{rms}$ increases. Above (below) the threshold, the angles are considered “high” (“low”). The threshold on the optimal bank angle defined a cutoff on $- a_{z}$ and thereby an effective “fluctuation filter.”

We interpret the fluctuation filter above as follows: at a particular ${\hat{u}}_{rms}$ , if the glider encounters a fluctuation with $- a_{z}$ above the cutoff, the glider interprets the fluctuation as significant, that is, as the large-scale downward branch of a convective cell, and banks away. Conversely, fluctuations below the cutoff are ignored. In other words, the cutoff defined above gives the level that identifies significantly large fluctuations that require action. Similar behaviors are obtained for ( $a_{z} < 0, τ = 0)$ and $τ > 0$ is symmetric with respect to the case $τ < 0$ just discussed. Conversely, for $a_{z} > 0$ , the glider maintains a bank angle close to zero unless it experiences an exceptionally large torque. These simple principles are the key for effective soaring in fluctuating turbulent environments.

Discussion

We have shown that reinforcement learning methods cope with strong turbulent fluctuations and identify effective policies of navigation in turbulent flow. Previous works neglected turbulence, which is an essential and unavoidable feature of natural flow. The learned policies dramatically improve the gain of height and the rapidity of climbing within thermals, even when turbulent fluctuations are strong and the glider has reduced control due to its being transported by the flow.

We deliberately kept simple the sensorimotor cues that the glider can sense to guide its flight. In particular, possible cues were local in space and time for two reasons: keep the closest contact with what birds are likely to sense and minimize the mechanical instrumentation needed for the control of autonomously flying vehicles. In the same spirit, we kept simple the parametrization of the learned policies, by using a relatively coarse discretization of the space of states and actions.

Turbulence has indeed a major impact upon the policy of flight. We explicitly presented how the learned policies of flight modify as the level of turbulence increases. In particular, we quantified the increase of the threshold on the cues needed for the glider to change its parameters of control. We also discussed the simple principles that the policy follows to filter out transient, small-scale turbulent fluctuations, and identify the level of the sensorimotor cues that requires actions that modify the parameters of flight of the glider.

We found that the bank angle of the glider is the main control for navigation within a single thermal, which is the main interest of the current work. However, we also considered a very simplified setting mimicking the flight between multiple thermals, and there we found that control of the angle of attack is important. Interthermals flight is of major interest for birds’ migration and glider pilots. MacCready (28) determined the optimal speed to maximize the average cross-country speed as a function of the glider’s rate of sink and the velocity of ascent within the thermals. The resulting instrument (the so-called MacCready speed ring) is commonly used by glider pilots with various supplementary empirical prescriptions, which typically tend to be risk averse. MacCready’s prediction was also recently compared with the behavior of various birds (29) along their thermal-dense migratory routes. Their behavior was found to differ from the prediction; namely, a more conservative policy was observed, with slower but less sinking paths that reduce the probability of dramatic losses of height. One possible cause for more conservative policies relates to the uncertainties on the location and the velocity of ascent within the thermals, which was previously considered in the literature (30). Another possible reason suggested by our results is turbulence along the interthermal paths, which is neglected in MacCready’s and subsequent arguments. Our methodology can be adapted to realistically model interthermal conditions, and future work will assess the role of turbulence in the policy of interthermal flight.

We identified torque and vertical accelerations as the local sensorimotor cues that most effectively guide turbulent navigation. Temperature was specifically shown to yield minor gains. The robustness of our results with respect to the modeling of turbulence strongly suggests that the conclusion applies to natural conditions; a sensor of temperature could then be safely spared in the instrumentation for autonomous flying vehicles. More generally, it will be of major interest to implement our predicted policy on remotely controlled gliders and test their flight performance in field experiments. Thanks to our choices discussed above, the mechanical instrumentation needed for control is minimal and can be hosted on commercial gliders without perturbing their aerodynamics. Finally, our flight policy and the nature of the sensorimotor cues that we identified, provide predictions that can be compared with the behavior of soaring birds and could shed light on the decision processes that enable them to perform their soaring feats.

Methods

Our kinematic model of turbulence extends the one in ref. 21 to the inhomogeneous case relevant for the atmospheric boundary layer. We can thereby statistically reproduce the Kolmogorov and Richardson laws (10) and the velocity profile of the atmospheric boundary layer (6). The atmospheric boundary layer on a sunny day extends to an inversion height $z_{i} \sim$ 1 km and mainly consists of two layers—the free convection layer, extending up to $0.1 z_{i}$ , and the mixed layer (6). The rms of velocity fluctuations varies with the height z as $〈 δ {(u^{kin})}^{2} (z) 〉 \sim z^{2 / 3}$ in the free convection layer and is statistically constant in the mixed layer. To reproduce these statistics, we decomposed the velocity field at height z into contributions from fields of different integral length scales $l_{n}$ :

u^{kin} (x_{⊥}, z, t) = \sum_{l_{n} > z} c_{n} u^{kin} (x_{⊥}, z, t | l_{n}),

[9]

where $x_{⊥}$ are the two horizontal components of the position. The velocity field at each length scale $ℓ_{n}$ is specified in spatial wavenumbers $k$ as follows:

u^{kin} (x_{⊥}, z, t | l_{n}) = \int {\hat{u}}_{n}^{kin} (k, t) e^{i k . x} d^{3} k,

[10]

where the individual Fourier components ${\hat{u}}_{n}^{kin} (k, t)$ are modeled as a Ornstein–Uhlenbeck process (21). The corresponding diffusion constant is set such that the spatial energy spectrum follows the Kolmogorov five-thirds law $E (k) \sim k^{- 5 / 3}$ , where $k = | k |$ . The power law energy spectrum gives rise to long-range spatial correlations with fluctuations at every length scale up to $l_{n}$ . The relaxation time of each mode is given by the Kolmogorov scaling $τ_{k} \sim k^{- 2 / 3}$ (10). The coefficients $c_{n}$ and the integral length scales $l_{n}$ are chosen to reproduce the velocity profile of the boundary layer (see Supporting Information for details). We accounted for the mean ascending current within the thermals by superposing a Gaussian-shaped mean vertical velocity on top of the fluctuations.

Modeling the Atmospheric Boundary Layer

Conditions ideal for thermal soaring typically occur during a sunny day, when a strong temperature gradient between the surface of the Earth and the top of the atmospheric boundary layer (ABL) drive strong convective flow. The soaring flight of birds and gliders primarily occurs within this convective boundary layer. The mechanical and thermal forces within the boundary layer generate turbulence characterized by strongly fluctuating wind velocities. We simulated those turbulent conditions in two different ways: (i) a direct numerical simulation of Rayleigh–Bénard (RB) convection, which captures the basic physical mechanisms of thermal formation; (ii) a kinematic model of turbulent fluctuations that reproduces the statistical features of turbulence in the convective boundary layer. The second model accurately captures the Kolmogorov and Richardson laws, and the mean velocity profile of the ABL. The RB flow allows us to explore the role of temperature as a cue for orientation in turbulent environments.

RB Convective Flow.

Our simulations involve the numerical integration of Navier–Stokes equations with coupled velocity and temperature fields simplified by the Boussinesq approximation. When the Rayleigh number Ra is beyond a critical value of $\sim 10^{3}$ , the thermally generated buoyancy drives instabilities in the flow. In this regime, the flow is characterized by large-scale convective cells and turbulent eddies at every length scale. In the atmosphere, the Rayleigh number can reach up to Ra = 10¹⁵ to 10²⁰. In such high-Rayleigh number regimes, the flow is strongly turbulent—numerical simulations of convection in the atmosphere are thus plagued by the same limitations of simulating fully developed turbulent flows. We simulated 3D RB convection with a Rayleigh number Ra = $10^{8}$ and a Prandtl number Pr = 0.7 using the Gerris Flow Solver (18). The floor and the ceiling of the cubical simulation box are no-slip and are fixed at temperatures of unity and zero, respectively. We impose periodic boundary conditions on the side walls. The equations involved are the perturbed velocity ( $u$ ), temperature (θ) field equations about the mean field as follows (19):

\frac{\partial u}{\partial t} + u \cdot \nabla u = - \nabla P + {(\frac{Pr}{Ra})}^{1 / 2} \nabla^{2} u + θ \hat{z},

[S1]

\frac{\partial θ}{\partial t} + u \cdot \nabla θ = \frac{1}{{(PrRa)}^{1 / 2}} \nabla^{2} θ,

[S2]

along with the incompressibility condition ( $\nabla \cdot u = 0$ ). For accurate simulations, the grid spacing in the bulk $δ_{b}$ should be chosen smaller than the Kolmogorov viscous length scale η of the flow. If the side of the cubical box is h, the Kolmogorov scale can be approximated by $η / h \approx π {(Pr / RaNu)}^{1 / 4}$ (19). The Nusselt number Nu is defined as the ratio of convective to conductive heat transfer in the flow. At our parameter values, the Nusselt number can be approximated by $Nu \approx 0.124 {Ra}^{0.309}$ (19). We thereby obtain $64 η / h \approx 0.8$ , and thus we use a spacing of $δ_{b} / h = 1 / 64$ within the bulk. The grid spacing is required to be smaller at the no-slip boundaries due to the formation of the thermal and viscous boundary layers; the Grotzbach criterion (31) suggests three to five grid points within the boundary layers for accurate numerical simulations. The thickness of the thermal boundary layer can be approximated by $δ_{T} / h \approx 1 / 2 Nu$ (20) and the viscous boundary layer thickness is $δ_{v} = Pr δ_{T}$ . This gives $δ_{T} / h = 0.016$ . We found that using a grid spacing of $h / 256$ within the boundary layers is sufficient to ensure stability of the numerical integration scheme and proper resolution of the fields.

In summary, our setup consists of a cubical grid symmetric about the center and the mesh size is: $h / 256$ up to a height of $0.025 h$ , $h / 128$ from $0.025 h$ to $0.05 h$ , and finally $h / 64$ in the bulk. In Fig. S1, we show the velocity and temperature profiles of the flow. Also shown is the Nusselt number defined as follows (9):

Nu = (〈 u_{z} θ 〉 - κ 〈 \partial_{z} θ 〉) / κ Δ θ / h = {(PrRa)}^{1 / 2} 〈 u_{z} θ 〉 - 〈 \partial_{z} θ 〉,

[S3]

where $κ = {(PrRa)}^{- 1 / 2}$ is the effective thermal conductivity after rescaling, $Δ θ = 1$ is the (rescaled to unity) temperature difference between the hot and cold plates, and $h = 1$ is the (rescaled to unity) distance between the plates. The numerically obtained value $Nu \approx 32$ matches well with previous values in the literature (figure 1A in ref. 20).

A Kinematic Model of the Convective Boundary Layer.

The ABL is the lowest region of the atmosphere and extends up to a height of about 1–2 km. Above the ABL, the flow is nearly geostrophic, that is, winds flow along isobars due to the balance of pressure gradient forces and the Coriolis force. On a sunny day, the boundary layer is characterized by convective flows and is roughly structured in four layers as follows:

•
Surface layer: This is typically a thin layer of a few meters dominated by shear forces. The wind velocity profile has a logarithmic dependence on the height z, that is, $u (z) \sim \log (z / z_{0})$ , where $z_{0}$ depends on the surface roughness.
•
Free convection layer: This is a matching layer between the surface and the mixing layers. In this layer, the velocity profile features a Kolmogorov scaling $u^{2} (z) \sim z^{2 / 3}$ . The layer extends from a few meters up to $0.1 z_{i}$ , where $z_{i}$ is the inversion height, the top of the ABL.
•
Mixed layer: In this layer, shear forces are negligible and the surface is irrelevant. Convective mixing forces the velocity profile, temperature, and velocity correlation lengths to be uniform with height. The layer extends from 0.1 $z_{i}$ to $z_{i}$ .
•
Inversion layer: The top of the ABL has a capping inversion layer characterized by cold temperatures, strong winds, and clouds.

In our simulations, we resolve the free convection layer and the mixed layer with the inversion height $z_{i} \sim 1$ km. We use a kinematic model of turbulence that extends the one in ref. 21 to the inhomogeneous case and statistically reproduces the Kolmogorov and Richardson laws and the velocity profile of the ABL. A Gaussian ascending core is added on top of the turbulent fluctuations and provides a mean, z-independent, ascending flow:

u_{z}^{thermal} \propto e^{- \frac{{(r_{⊥} - r_{⊥}^{center})}^{2}}{2 R^{2}}},

[S4]

where $r_{⊥}$ is the 2D position vector in the horizontal plane, $r_{⊥}^{center}$ is the location of the center of the thermal, and R is its radius.

The fluctuating field is a composition of flows of different integral length scales. The velocity at height z has contributions from flows of length scales $l_{n}$ greater than z:

u (r_{⊥}, z, t) = \sum_{l_{n} > z} c_{n} u (r_{⊥}, z, t | l_{n}),

[S5]

where $u (r_{⊥}, z, t | l_{n})$ is the velocity contribution due to the flow with length scale $l_{n}$ . Amplitudes are normalized to have $〈 u (r_{⊥}, l_{n}, t | l_{n}) . u (r_{⊥}, l_{n}, t | l_{n}) 〉 = 1$ . Hereafter, $u (r_{⊥}, z, t | l_{n})$ will be denoted by $u_{n} (r_{⊥}, z, t)$ . We expand in a spatial discrete Fourier transform with spatial frequencies $k$ :

u_{n} (r_{⊥}, z, t) = \sum_{k} {\hat{u}}_{n} (k_{⊥}, k_{z}, t) e^{i (k_{⊥} \cdot r_{⊥} + k_{z} z)} .

[S6]

The energy spectrum of velocity fluctuations at relevant scales follows the Kolmogorov law $E (k) \sim k^{- 5 / 3}$ , where $k = | k |$ . From the Kolmogorov law, we require the spatial energy spectrum $E (k) = 4 π k^{2} 〈 {| {\hat{u}}_{n} (k_{⊥}, k_{z}, t) |}^{2} 〉 \sim k^{- 5 / 3}$ , and we have $〈 {| {\hat{u}}_{n} (k_{⊥}, k_{z}, t) |}^{2} 〉 \sim k^{- 11 / 3}$ for $k > 1 / l_{n}$ . The time evolution of the real part (and similarly for the independent imaginary part) of each mode is modeled as an Ornstein–Uhlenbeck process as follows:

d {\hat{u}}_{n R} (k_{⊥}, k_{z}, t) = - {\hat{u}}_{n R} (k_{⊥}, k_{z}, t) (\frac{d t}{τ_{n k}}) + σ_{n k} d W_{t},

[S7]

where $d W_{t}$ is a Wiener process, $τ_{n k}$ is a timescale that depends on n and k, and $σ_{n k}$ is the amplitude of fluctuations. The equation above ensures that the temporal correlations of each mode decay with time scale $τ_{n k}$ , which obeys the dimensional Kolmogorov scaling $τ_{n k} \sim {(l_{n} k)}^{- 2 / 3}$ . At steady state, the energy of mode $k$ equals $〈 {| {\hat{u}}_{n} (k_{⊥}, k_{z}, t) |}^{2} 〉 = 3 σ_{n k}^{2} τ_{n k}$ . Because the contribution due to a flow of length scale $l_{n}$ is required to cut off at height $l_{n}$ and vanishes at $z = 0$ , we impose the supplementary condition that the Fourier expansion of $u_{n} (r_{⊥}, z, t)$ has only sinusoidal contributions in the vertical direction. We thereby have two conditions on the Fourier components:

{\hat{u}}_{n} (- k_{⊥}, - k_{z}, t) = {\hat{u}}_{n}^{*} (k_{⊥}, k_{z}, t); {\hat{u}}_{n} (k_{⊥}, - k_{z}, t) = - {\hat{u}}_{n} (k_{⊥}, k_{z}, t) .

[S8]

The first condition enforces that the flow is real and the second one enforces the vanishing at the ground.

The two conditions (Eq. S8) can be used to reorganize the sum in Eq. S6 and elementary calculations lead to the following expression for the two-point velocity correlation function:

〈 u_{n} (r_{⊥} + l_{⊥}, z, t) \cdot u_{n} (r_{⊥}, z, t) 〉 = 16 \sum^{\sim} \sin^{2} (k_{z} z) [〈 {\hat{u}}_{n R} (k_{⊥}, k_{z}, t) \cdot {\hat{u}}_{n R} (k_{⊥}, k_{z}, t) 〉 \cos (k_{⊥} \cdot l_{⊥})] .

[S9]

The sum $\sum^{\sim}$ runs over the set of independent wave vectors [which is restricted by the two conditions (Eq. S8)]. The real and imaginary components of the modes are Gaussian and independent of each other. The Kolmogorov scaling of the amplitudes of the modes mentioned above finally gives $〈 u_{n} (r_{⊥}, z, t) \cdot u_{n} (r_{⊥}, z, t) 〉 \sim z^{2 / 3}$ . Due to the imposed sinusoidal constraints, the scaling flattens out around $z = l_{n} / 4$ .

The extent of the simulation lattice depends on the integral length scale $l_{n}$ of the flow. The lattice for the nth flow has dimensions $4 z_{i} \times 4 z_{i} \times l_{n}$ with each dimension discretized into 64 points. The spatial frequencies are of the form $(p / 4 z_{i}, q / 4 z_{i}, r / l_{n})$ , where $p, q, r = 0,1,2, \dots, 63$ . To integrate the Fourier components, we use the standard stochastic Runge–Kutta update rule:

{\hat{u}}_{n k} \to {\hat{u}}_{n k} (1 - δ + \frac{1}{2} δ^{2}) + a_{n k} N (0, \sqrt{δ}) - a_{n k} [ρ N (0, \sqrt{δ}) + \sqrt{1 - ρ^{2}} N (0, \sqrt{δ'})],

[S10]

where $a_{n k} = σ_{n k} \sqrt{τ_{n k}}$ , $N (μ, σ)$ is a normal random variable with mean μ and variance $σ^{2}$ , and the notation ${\hat{u}}_{n k}$ indicates that the same update rule holds for the real and imaginary parts of the modes with $| k | = k$ . At steady state, $〈 {\hat{u}}_{n R} (k_{⊥}, k_{z}, t) \cdot {\hat{u}}_{n R} (k_{⊥}, k_{z}, t) 〉 = 3 a_{n k}^{2} / 2$ . Here, $δ = d t_{k} / τ_{n k}$ and $δ' = δ^{3} / (12 - 3 δ^{2})$ . It follows from Eq. S9 that

〈 u_{n} (r_{⊥}, z, t) \cdot u_{n} (r_{⊥}, z, t) 〉 = 24 \sum^{\sim} \sin^{2} (k_{z} z) a_{n k}^{2},

[S11]

with $a_{n k}^{2} \propto k^{- 11 / 3}$ . The length scales are spaced logarithmically, that is, $l_{n} = l_{1} λ^{n - 1}$ and the contribution $n = N$ with the largest length scale has $l_{N} = 4 z_{i}$ . By numerically calculating the velocity magnitudes from Eq. S11, we scale the value of the coefficients $c_{n}$ to obtain the behavior that is appropriate for the convective boundary layer. We pick $N = 8$ , and the parameters $c_{n}$ (Eq. S5) used for the kinematic turbulence model are as follows: c₁, 1.1; c₂, 0.8; c₃, 0.7; c₄, 0.7; c₅, 0.6; c₆, 0.6; c₇, 0.5; c₈, 2.0. The full velocity field (Eq. S5) at an arbitrary point is obtained by interpolating the contribution from each flow. Finally, although our simulation constructs a cubic box of size $4 z_{i}$ , we constrain ourselves to the first quarter in the vertical direction for the symmetry reasons mentioned above. The resulting velocity profile and the Richardson superdiffusive law are shown in Fig. S2.

Learning to Soar: Kinematic Model

In the main text, we described our results for RB flow. In this section, we detail the results for the kinematic model described above. The gliders are trained using the same learning procedure and glider mechanics presented in the main text. The upshot is that learned policies are similar for the two flows and the main features of the flying policies discussed in the main text apply to the synthetic flow as well.

Setup.

We consider a 3D setting as described in the previous section. The core mean flow (Eq. S4) is centered at the origin with radius R = 0.25 km, and the maximal velocity at the center is set to 5 m/s. Turbulent fluctuations of magnitude $u_{rms}^{kin}$ and the statistics described above are added on top of the Gaussian core. The fluctuations have long-range spatial correlations, and the longest (and slowest) modes relax on timescales comparable to those of each ascent. The glider starts from the edge of the Gaussian thermal, facing away from its center, at an initial distance $r_{⊥}^{init} = 2 R$ , and attempts to find the center of the thermal amid turbulent fluctuations. The groundspeed of a glider has three contributions:

u^{ground} = v^{glider} + u^{thermal} + u^{kin},

[S12]

where $v^{glider}$ is the air velocity of the glider (see Models, Glider Mechanics, in the main text), $u^{thermal}$ is the contribution due to the mean Gaussian core, and $u^{kin}$ is the contribution due to the turbulent fluctuations. The airspeed and heading of the glider are controlled by the angle of attack and the bank angle of the glider (see main text). In this setting, we have three velocity scales: the contribution from the mean Gaussian core, the airspeed of the glider, and the magnitude of fluctuations. Correspondingly, we have three regimes: (I) the weak fluctuations regime, where the magnitude of fluctuations is smaller than the mean contribution of the core; (II) the strong fluctuations regime, where the mean is masked by fluctuations; and (III) the extreme fluctuations regime, when the fluctuations are larger than the airspeed of the glider. As a measure of the level of fluctuations, we define ${\hat{u}}_{rms} = u_{rms}^{kin} / u^{thermal} (r_{⊥} = r_{⊥}^{init})$ , that is, the ratio between turbulent fluctuations and the thermal velocity at the starting point. In terms of ${\hat{u}}_{rms}$ , the three regimes correspond to ${\hat{u}}_{rms} < 1$ , $1 < {\hat{u}}_{rms} < 6$ , and ${\hat{u}}_{rms} > 6$ . We expect, as in the case of the RB flow, that the policy learned by the glider differs in the regimes of weak fluctuations and strong/extreme fluctuations. The turbulent fluctuations and glider flight are resolved with time steps of 1 s each, as for the results described in the main text. We used precisely the same architecture for the reinforcement learning algorithm as in the case of RB flow (see main text).

Results.

As for RB flow, we found that the vertical wind acceleration $a_{z}$ and the torques τ are the best mechanical cues to guide turbulent navigation. The angle of attack does not improve performance for reasons similar to those presented in the main text. We fix the set of observables to $a_{z}$ and τ, and the glider has control over its bank angle. Fig. S3 shows the learning curve and the average gain in height for different values of ${\hat{u}}_{rms}$ . As anticipated, three regimes can be distinguished: (I) when ${\hat{u}}_{rms} < 1$ , the contribution of the Gaussian core to the wind velocity is dominant, which makes its location and climbing relatively easy and allows for a large gain in height. (II) At $1 < {\hat{u}}_{rms} < 6$ , fluctuations dominate and the glider is forced to learn how to exploit the fluctuations to soar. (III) When the fluctuations exceed the airspeed of the glider, the glider is carried away by the strong flow and progressively loses control over its trajectory. We observe indeed a declining gain in height for these extreme values of fluctuations.

Fig. S3. — Training and learning for the kinematic model of turbulence. A shows the learning curves for ${\hat{u}}_{rms} = 0,2.25, 4.5, 6.75, 9$ . Training saturates in less than 100 episodes and exhibits no overfitting. B shows the performance in terms of the gain in height for different ${\hat{u}}_{rms}$ . Vertical lines separate the three regimes of weak (I), strong (II), and extreme (III) fluctuations. The gain in height scales differently in the three regimes: (I) the glider climbs the thermal using the mean ascending flow, (II) the glider uses the turbulent fluctuations to ascend, and (III) the gain in height declines as the glider is carried away by the fluctuations that dominate its velocity, which results in loss of control.

Sample policies for ${\hat{u}}_{rms} = 0$ and ${\hat{u}}_{rms} = 5$ are shown in Fig. S4. Comparison with Fig. 4 of the main text shows that the qualitative features of the policies in the two flows are very much the same; the arguments presented there directly apply to this case. The optimal bank angles can be obtained for each $(a_{z}, τ)$ pair by finding the mode of the distribution $Pr (μ_{t + 1}) | a_{z}, τ)$ . Fig. S4 shows the optimal bank angles for the case of negative $a_{z}$ and $τ < 0$ . The sharp change in policy occurs at the boundary between regimes II and III; note that these correspond to the weak and strong flow regimes for the RB flow. The scaling of the boundary between “large” and “small” fluctuations ( $- {\hat{a}}_{z}$ ) is qualitatively similar but its nonlinear profile quantitatively differs from the linear one that we found for the RB flow. In Fig. S4D, we also show a simplified version of the heat map extended to a larger range. To obtain this figure, we proceed as in the main text, that is, defined a cutoff in the optimal bank angle at $12.5 °$ that separates out the “large” and “small” fluctuation regions. The diverse colors shown in the figure correspond to simulations with expanded bin sizes in the tiled representation of $a_{z}$ .

Control over Angle of Attack During Interthermal Flight

As elucidated in the main text, for the task of finding and centering a single thermal, control over angle of attack offers minor improvement in the performance of the glider. However, angle of attack is expected to be relevant during interthermal flight, that is, when the glider needs to travel large distances quickly while avoiding the dangerous case of missing the thermal, losing height, and crashing to the ground. The importance of angle of attack for the full case of interthermal flight in realistic turbulent conditions is a direction for future work (see Discussion in the main text). In this section, we considered a very simplified setting of interthermal flight and verified that control over angle of attack does indeed offer significant advantages.

We consider a glider constrained to moving in the X–Z plane, where Z is the vertical, for reasons that will be clear momentarily. The glider faces an X-dependent vertical velocity profile consisting of a downward current followed by a symmetric upward current as shown in Fig. S5A. The net gain in height provided by the currents for a glider moving at a constant speed is zero. However, by modulating its angle of attack to slow down during regions of updraft and increasing its pace during regions of downdraft, the glider can achieve a net positive gain in height. The bank angle does not play any role here as the motion is constrained to the X–Z plane, which is the reason why we selected this geometry.

We used a reward function $\propto (u_{z} - v_{c}) / v_{⊥}$ , where $u_{z}, v_{c}$ , and $v_{⊥}$ are the vertical wind velocity, climb rate, and horizontal speed, respectively. The reward function encourages the glider to seek smaller horizontal speeds while ascending and larger horizontal speeds while sinking. The state space in this case includes the angle of attack discretized to seven states between $2.5 °$ and $17.5 °$ , and the vertical wind velocity discretized to 25 bins between $- 12$ and 12 m/s. At each step, the glider has an option of increasing by $2.5 °$ , decreasing by $2.5 °$ , or maintaining the same angle of attack. The bank angle is fixed at zero degrees. Fig. S5B shows the net gain in height relative to a glider moving at fixed speed with the number of training episodes.

Even though the setting considered here is extremely simplified, results show the advantage provided by the control of the angle of attack. Quantifying the advantage and identifying the corresponding policy of control for realistic turbulent flows encountered during long-distance migration or cross-country glider competitions is the subject of ongoing work.

Acknowledgments

We are grateful to A. Libchaber for numerous discussions on convective flow. This work was supported by Simons Foundation Grant 340106 (to M.V.).

Footnotes

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1606075113/-/DCSupplemental.

References

1.Cone CD., Jr Thermal soaring of birds. Am Sci. 1962;50(1):180–209. [Google Scholar]
2.Pennycuick CJ. Thermal soaring compared in three dissimilar tropical bird species, Fregata magnificens, Pelecanus occidentalis and Coragyps atratus. J Exp Biol. 1983;102:307–325. [Google Scholar]
3.Ehrlich P, Dobkin D, Wheye D. 1988. The Birder’s Handbook: A Field Guide to the Natural History of North American Birds. (Simon and Schuster, New York)
4.Newton I. The Migration Ecology of Birds. Academic; London: 2007. [Google Scholar]
5.Allen MJ. 2007. Guidance and control of an autonomous soaring UAV. Proceedings of the AIAA Aerospace Sciences Meeting (American Institute of Aeronautics and Astronautics, Reston, VA), AIAA Paper 2007-867.
6.Garrat JR. 1994. The Atmospheric Boundary Layer. Cambridge Atmospheric and Space Science Series (Cambridge Univ Press, Cambridge, UK)
7.Lenschow DH, Stephens PL. The role of thermals in the convective boundary layer. Boundary-Layer Meteorol. 1980;19(4):509–532. [Google Scholar]
8.Young GS. Convection in the atmospheric boundary layer. Earth Sci Rev. 1988;25(3):179–198. [Google Scholar]
9.Ahlers G, Grossmann S, Lohse D. Heat transfer and large scale dynamics in turbulent Rayleigh–Benard convection. Rev Mod Phys. 2009;81:503. [Google Scholar]
10.Frisch U. Turbulence: The Legacy of A. N. Kolmogorov. Cambridge Univ Press; Cambridge, UK: 1995. [Google Scholar]
11.Akos Z, Nagy M, Vicsek T. Comparing bird and human soaring strategies. Proc Natl Acad Sci USA. 2008;105(11):4139–4143. doi: 10.1073/pnas.0707711105. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA: 1998. [Google Scholar]
13.Wharington J, Herszberg I. 1998. Control of a high endurance unmanned aerial vehicle. Proceedings of the 21st Congress of International Council of the Aeronautical Sciences (International Council of the Aeronautical Sciences, Bonn, Germany), Paper 98-3.7.1.
14.Reichmann H. Cross-Country Soaring. Thomson Publications; Santa Monica, CA: 1988. [Google Scholar]
15.Lawrance NRJ, et al. Long endurance autonomous flight for unmanned aerial vehicles. AerospaceLab. 2014;8(5):1–15. [Google Scholar]
16.Woodbury T, Dunn C, Valasek J. 2014. Autonomous soaring using reinforcement learning for trajectory generation. Proceedings of the AIAA Aerospace Sciences Meeting (American Institute of Aeronautics and Astronautics, Reston, VA), AIAA Paper 2014-0990.
17.Akos Z, Nagy M, Leven S, Vicsek T. Thermal soaring flight of birds and unmanned aerial vehicles. Bioinspir Biomim. 2010;5(4):045003. doi: 10.1088/1748-3182/5/4/045003. [DOI] [PubMed] [Google Scholar]
18.Popinet S. Gerris: A tree-based adaptive solver for the incompressible Euler equations in complex geometries. J Comput Phys. 2003;190:572–600. [Google Scholar]
19.Verzicco R, Camussi R. Numerical experiments on strongly turbulent thermal convection in a slender cylindrical cell. J Fluid Mech. 2003;477:19–49. [Google Scholar]
20.Verzicco R, Camussi R. Prandtl number effects in convective turbulence. J Fluid Mech. 1999;383:55–73. [Google Scholar]
21.Fung JCH, Hunt JCR, Malik NA, Perkins RJ. Kinematic simulation of homogeneous turbulence by unsteady random Fourier modes. J Fluid Mech. 1992;236:281–318. [Google Scholar]
22.von Mises R. Theory of Flight. McGraw Hill; New York: 1945. [Google Scholar]
23.Anderson JR., Jr . Introduction to Flight. McGraw Hill; New York: 1978. [Google Scholar]
24.von Karman T. Aerodynamics. McGraw-Hill; New York: 1963. [Google Scholar]
25.Tesauro G. Temporal difference learning and TD-Gammon. Commun ACM. 1995;38:58–68. [Google Scholar]
26.Shraiman BI, Siggia ED. Scalar turbulence. Nature. 2000;405(6787):639–646. doi: 10.1038/35015000. [DOI] [PubMed] [Google Scholar]
27.Falkovich G, Gawedzki K, Vergassola M. Particles and fields in fluid turbulence. Rev Mod Phys. 2001;73:913–975. [Google Scholar]
28.MacCready PBJ. Optimum airspeed selector. Soaring. 1958;1958(1):10–11. [Google Scholar]
29.Horvitz N, et al. The gliding speed of migrating birds: Slow and safe or fast and risky? Ecol Lett. 2014;17(6):670–679. doi: 10.1111/ele.12268. [DOI] [PubMed] [Google Scholar]
30.Cochrane JH. MacCready theory with uncertain lift and limited altitude. Technical Soaring. 1999;23:88–96. [Google Scholar]
31.Grotzbach G. Spatial resolution requirements for direct numerical simulation of the Rayleigh–Bénard convection. J Comput Phys. 1983;49:241–264. [Google Scholar]

[r1] 1.Cone CD., Jr Thermal soaring of birds. Am Sci. 1962;50(1):180–209. [Google Scholar]

[r2] 2.Pennycuick CJ. Thermal soaring compared in three dissimilar tropical bird species, Fregata magnificens, Pelecanus occidentalis and Coragyps atratus. J Exp Biol. 1983;102:307–325. [Google Scholar]

[r3] 3.Ehrlich P, Dobkin D, Wheye D. 1988. The Birder’s Handbook: A Field Guide to the Natural History of North American Birds. (Simon and Schuster, New York)

[r4] 4.Newton I. The Migration Ecology of Birds. Academic; London: 2007. [Google Scholar]

[r5] 5.Allen MJ. 2007. Guidance and control of an autonomous soaring UAV. Proceedings of the AIAA Aerospace Sciences Meeting (American Institute of Aeronautics and Astronautics, Reston, VA), AIAA Paper 2007-867.

[r6] 6.Garrat JR. 1994. The Atmospheric Boundary Layer. Cambridge Atmospheric and Space Science Series (Cambridge Univ Press, Cambridge, UK)

[r7] 7.Lenschow DH, Stephens PL. The role of thermals in the convective boundary layer. Boundary-Layer Meteorol. 1980;19(4):509–532. [Google Scholar]

[r8] 8.Young GS. Convection in the atmospheric boundary layer. Earth Sci Rev. 1988;25(3):179–198. [Google Scholar]

[r9] 9.Ahlers G, Grossmann S, Lohse D. Heat transfer and large scale dynamics in turbulent Rayleigh–Benard convection. Rev Mod Phys. 2009;81:503. [Google Scholar]

[r10] 10.Frisch U. Turbulence: The Legacy of A. N. Kolmogorov. Cambridge Univ Press; Cambridge, UK: 1995. [Google Scholar]

[r11] 11.Akos Z, Nagy M, Vicsek T. Comparing bird and human soaring strategies. Proc Natl Acad Sci USA. 2008;105(11):4139–4143. doi: 10.1073/pnas.0707711105. [DOI] [PMC free article] [PubMed] [Google Scholar]

[r12] 12.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. MIT Press; Cambridge, MA: 1998. [Google Scholar]

[r13] 13.Wharington J, Herszberg I. 1998. Control of a high endurance unmanned aerial vehicle. Proceedings of the 21st Congress of International Council of the Aeronautical Sciences (International Council of the Aeronautical Sciences, Bonn, Germany), Paper 98-3.7.1.

[r14] 14.Reichmann H. Cross-Country Soaring. Thomson Publications; Santa Monica, CA: 1988. [Google Scholar]

[r15] 15.Lawrance NRJ, et al. Long endurance autonomous flight for unmanned aerial vehicles. AerospaceLab. 2014;8(5):1–15. [Google Scholar]

[r16] 16.Woodbury T, Dunn C, Valasek J. 2014. Autonomous soaring using reinforcement learning for trajectory generation. Proceedings of the AIAA Aerospace Sciences Meeting (American Institute of Aeronautics and Astronautics, Reston, VA), AIAA Paper 2014-0990.

[r17] 17.Akos Z, Nagy M, Leven S, Vicsek T. Thermal soaring flight of birds and unmanned aerial vehicles. Bioinspir Biomim. 2010;5(4):045003. doi: 10.1088/1748-3182/5/4/045003. [DOI] [PubMed] [Google Scholar]

[r18] 18.Popinet S. Gerris: A tree-based adaptive solver for the incompressible Euler equations in complex geometries. J Comput Phys. 2003;190:572–600. [Google Scholar]

[r19] 19.Verzicco R, Camussi R. Numerical experiments on strongly turbulent thermal convection in a slender cylindrical cell. J Fluid Mech. 2003;477:19–49. [Google Scholar]

[r20] 20.Verzicco R, Camussi R. Prandtl number effects in convective turbulence. J Fluid Mech. 1999;383:55–73. [Google Scholar]

[r21] 21.Fung JCH, Hunt JCR, Malik NA, Perkins RJ. Kinematic simulation of homogeneous turbulence by unsteady random Fourier modes. J Fluid Mech. 1992;236:281–318. [Google Scholar]

[r22] 22.von Mises R. Theory of Flight. McGraw Hill; New York: 1945. [Google Scholar]

[r23] 23.Anderson JR., Jr . Introduction to Flight. McGraw Hill; New York: 1978. [Google Scholar]

[r24] 24.von Karman T. Aerodynamics. McGraw-Hill; New York: 1963. [Google Scholar]

[r25] 25.Tesauro G. Temporal difference learning and TD-Gammon. Commun ACM. 1995;38:58–68. [Google Scholar]

[r26] 26.Shraiman BI, Siggia ED. Scalar turbulence. Nature. 2000;405(6787):639–646. doi: 10.1038/35015000. [DOI] [PubMed] [Google Scholar]

[r27] 27.Falkovich G, Gawedzki K, Vergassola M. Particles and fields in fluid turbulence. Rev Mod Phys. 2001;73:913–975. [Google Scholar]

[r28] 28.MacCready PBJ. Optimum airspeed selector. Soaring. 1958;1958(1):10–11. [Google Scholar]

[r29] 29.Horvitz N, et al. The gliding speed of migrating birds: Slow and safe or fast and risky? Ecol Lett. 2014;17(6):670–679. doi: 10.1111/ele.12268. [DOI] [PubMed] [Google Scholar]

[r30] 30.Cochrane JH. MacCready theory with uncertain lift and limited altitude. Technical Soaring. 1999;23:88–96. [Google Scholar]

[r31] 31.Grotzbach G. Spatial resolution requirements for direct numerical simulation of the Rayleigh–Bénard convection. J Comput Phys. 1983;49:241–264. [Google Scholar]

PERMALINK

Learning to soar in turbulent environments

Gautam Reddy

Antonio Celani

Terrence J Sejnowski

Massimo Vergassola

Series information

Significance

Abstract

Models

Modeling the Turbulent Environment.

Fig. 1.

Fig. S1.

Fig. S2.

Fig. S4.

Glider Mechanics.

The Learning Algorithm.

Table S1.

Results

Sensorimotor Cues and Reward Function for Effective Learning.

Flight Training.

Learning in Different Flow Regimes.

Fig. 2.

Fig. 3.

Role of Wind Acceleration and Torques.

Control over the Angle of Attack.

Fig. S5.

Dependence on the Temporal Discounting.

Optimal Flight Policy.

Fig. 4.

Optimal Bank Angles.

Discussion

Methods

Modeling the Atmospheric Boundary Layer

RB Convective Flow.

A Kinematic Model of the Convective Boundary Layer.

Learning to Soar: Kinematic Model

Setup.

Results.

Fig. S3.

Control over Angle of Attack During Interthermal Flight

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases