Skip to main content
Philosophical Transactions of the Royal Society B: Biological Sciences logoLink to Philosophical Transactions of the Royal Society B: Biological Sciences
. 2023 May 29;378(1881):20220195. doi: 10.1098/rstb.2022.0195

Bridging adaptive management and reinforcement learning for more robust decisions

Melissa Chapman 1,, Lily Xu 2, Marcus Lapeyrolerie 1, Carl Boettiger 1
PMCID: PMC10225849  PMID: 37246377

Abstract

From out-competing grandmasters in chess to informing high-stakes healthcare decisions, emerging methods from artificial intelligence are increasingly capable of making complex and strategic decisions in diverse, high-dimensional and uncertain situations. But can these methods help us devise robust strategies for managing environmental systems under great uncertainty? Here we explore how reinforcement learning (RL), a subfield of artificial intelligence, approaches decision problems through a lens similar to adaptive environmental management: learning through experience to gradually improve decisions with updated knowledge. We review where RL holds promise for improving evidence-informed adaptive management decisions even when classical optimization methods are intractable and discuss technical and social issues that arise when applying RL to adaptive management problems in the environmental domain. Our synthesis suggests that environmental management and computer science can learn from one another about the practices, promises and perils of experience-based decision-making.

This article is part of the theme issue ‘Detecting and attributing the causes of biodiversity change: needs, gaps and solutions’.

Keywords: adaptive management, artificial intelligence, reinforcement learning, deep learning, biodiversity, decision-making

1. Introduction

Given the urgency of environmental crises and the impending risk of crossing planetary tipping points, developing management strategies in the face of uncertainty is increasingly critical. Decision-making under uncertainty, central to most contemporary environmental policy and practice, is referenced in contexts that span scales and systems, ranging from multilateral initiatives to halt biodiversity loss [1] and mitigate climate change [2] to local measures for habitat protection [3] and water allocation [4]. But uncertainty, and the possibility of triggering regime shifts, persist as challenges to effective conservation decision-making [5], motivating conservation science, policy and practice to focus not only on making decisions given uncertainty but iteratively reducing uncertainty through adaptive decision-making.

Since the coining of the term adaptive management in the 1970s [6], attempts to apply the paradigm of ‘learning while doing’ have proliferated across the environmental domain [7,8]. But the theoretical underpinnings of adaptive management—reducing uncertainty around a discrete set of autonomous models (systems that do not explicitly depend on independent variables) or model parameter values over time to take actions that maximize a notion of expected utility—have proven to be difficult in application [9,10], buckling under the problem complexity and sociopolitical nuance within which real environmental decisions are made [11,12]. Notably, the decision-theoretic methods commonly used to solve adaptive management problems (e.g. Bayesian model updating and dynamic programming) generally assume that uncertainties not only are known but can be precisely quantified in probabilities [13]. This, in turn, hinges on the assumption that models of environmental systems are identifiable and autonomous, which is often not true, particularly in the case of systems with tipping points or in the context of rapid environmental changes due to anthropogenic pressures [14].

In the light of the limitations of decision-theoretic methods, multiple alternative approaches have emerged to inform the management of complex systems under uncertainty (e.g. scenario planning and resilience thinking [5]) For example, statistical early warning signals of tipping points (e.g. critical slowing down) allow us to punt on the issue of model misspecification. Notably, early warning signals, scenario planning, and other ‘resilience thinking’ approaches all shift away from quantifying a decision policy (e.g. a model that suggests a fishing quota of X metric tons) and towards classifying (e.g. ‘the system is/is not approaching a tipping point’, or ‘scenario B is preferable to scenario C’). Hereafter, we will call these ‘classification’ approaches. Despite long-standing calls to better integrate decision theory and classification approaches [13], classification has continued to retreat from the decision problem all together, focusing instead on advanced computational tools—like deep learning—to better predict complex systems dynamics while avoiding the need to explicitly consider actions or the expected utility of a given strategy. For example, Bury et al. [15] use machine learning to classify critical transitions into four possible classes but do not suggest action strategies given a belief state in the system dynamics.

At first glance, focusing on system classification is compelling: if we can identify a proximate tipping point or predict a threshold response in a system—regardless of the exact underlying model (e.g. [15])—surely a manager can leverage that knowledge to compose a good management strategy? But is washing our hands of devising quantitative decision strategies for complex systems really a good idea? Heuristic approaches to decision-making notably fail to design effective sequential decision strategies (in contrast to iterative or static decision strategies) compared with formal approaches [16]. This leads us to ask if the limitations of a heuristic decision are better than those faced by traditional model-based decision theory [17].

If we can predict thresholds more successfully without models, can we also derive quantitative decision strategies for those systems more effectively without models (and the assumptions of model-based decision-theoretic approaches)? Within the field of computer science, a rapidly emerging class of machine learning algorithms has proven remarkably effective at making complex sequential decisions without first learning a model of the system [18]. Interestingly, these model-free reinforcement learning (RL) algorithms often mimic what any good manager does in the same situation: forgoing learning over an explicit set of process-based models and relying instead on knowledge from successes and failures experienced by repeatedly making decisions. But unlike a human manager, RL algorithms can process near real-time and high-dimensional data as well as learn strategies by interacting with a wide diversity of simulations and scenarios [19] that would be infeasible for a human to process. Moreover, RL algorithms show promise to more effectively manage systems with unknown tipping points than do fixed ‘rule-of-thumb’ strategies [20].

In this paper, we first highlight the limitations and assumptions of decision-theoretic approaches that have driven a wedge between the theory and practice of adaptive environmental management. We then explore examples of successful RL applications, highlighting which specific RL methods and concepts might provide a promising path forward to overcome some limitations of model-based adaptive management (table 1). Despite the promise of these emerging methods, the problem of making strategic decisions in non-identifiable and/or non-autonomous (systems subjected to external inputs) systems is not easily solved by even the best algorithms. We simultaneously suggest that RL research might benefit from the insights gleaned from trying to tackle pressing environmental problems (table 1). Finally, we emphasize that while RL approaches may, at last, allow algorithms to help human decision-makers better grapple with real-world complexities, such a transition will raise new challenges for equity, governance and accountability.

Table 1.

Emerging methods and approaches in reinforcement learning (RL) have the potential to help overcome a myriad of challenges in adaptive management, from making effective decisions in complex environments over long planning horizons to learning strategies in high-stakes situations. The following methods help define a research agenda for RL researchers seeking to contribute to environmental management. MDP, Markov decision process.

adaptive management challenge RL methods and concepts citation(s)
dealing with uncertainty
 Data used to train RL agents may not be precise or generalize to real-world settings. Robust RL aims to learn policies that perform well across a large class of possible environments, including environments that may not have been explicitly encountered during training. [2123]
 When data are insufficient to reliably predict the outcome of an action, decision-makers wish to understand the degree of uncertainty. Uncertainty quantification measures and attempts to reduce uncertainty in predictive systems. These uncertainty estimates can be used to constrain the RL policy to avoid taking actions with high uncertainty in outcome, related to safe RL. [2426]
 A given model does not precisely describe the true dynamics of the ecological system, regardless of what values are used to instantiate the parameters. Model misspecification refers to RL settings in which the Markov decision process used to model the environment does not describe reality. Model misspecification is typically addressed with robust RL, including model-free approaches. [2729]
 For a given set of observed data, multiple sets of parameters used to instantiate the model may have the same probability distribution of being the best-fit model. Non-identifiability describes the challenge of learning and planning in the presence of unmeasured state variables (confounders). [30,31]
 The reward function may be unknown in advance. Inverse RL attempts to recover the reward function given an optimal policy and environment dynamics. [32,33]
limited opportunities to interact with real-world, high-stakes settings
 Must learn optimal policies using historical data, without collecting new data. This challenge arises when data collection is expensive—as is often the case in conservation management. Offline RL (or batch RL) learns the best policy possible given historical observations (a static dataset) without exploration. Requires off-policy evaluation to estimate the performance of policies that were never enacted in the historical data. [34]
 In high-stakes settings, managers wish to be risk-averse to avoid potentially catastrophic settings, such as unintentionally wiping out the population of one species. RL safety trains an RL policy by limiting explorative actions to those that are unlikely to reach very bad states, for example by imposing additional constraints or avoiding actions with high uncertainty. [35,36]
 Decision-makers wish to know why a policy calls for a specific action. Explainability in RL aims to provide human-understandable explanations for why an RL method recommends a specific action, in contrast to ‘black box’ methods. [37,38]
 Decision-makers want to have human experts oversee and possibly override an RL agent's decisions. Human-in-the-loop systems treat humans as experts and defer to these human experts to make decisions when the RL agent is highly uncertain. [39,40]
complex environments with long planning horizons
 The underlying rules of the environment may be changing over time, owing to exogenous factors such as climate change or socioeconomic drivers. Non-stationarity refers to changes over time in the underlying dynamics of the MDP. In dynamic settings, the state constantly changes, but with non-stationarity the transition dynamics and rewards shift as well. [41,42]
 Management decisions may be highly complex, thus difficult to learn from scratch. Curriculum learning trains an agent on progressively harder tasks, using transfer learning to build off knowledge learned from previous tasks in subsequent tasks. [43,44]
 Management decisions may require planning challenging multi-step tasks over long time horizons. Hierarchical RL decomposes long-horizon tasks into more tractable subtasks. [45]
 After we receive a reward, we want to know which action(s) were critical to that outcome. This challenge is exasperated by delayed feedback. Credit assignment evaluates the utility of individual actions over a long sequence of steps. [4648]
 Rewards may be sparse, especially with delayed feedback, and the benefit of intermediate actions may not be immediately obvious. Reward shaping provides more gradual, localized feedback to guide the policy toward high-reward states. [4951]
multiple stakeholders, multiple agents
 Multiple stakeholders may each have their own objective. Multi-objective RL learns optimal decisions in the face of multiple conflicting objectives. This challenge is most salient when the relative weighting (importance) of the objectives is not known. [52,53]
 Multiple agents may be acting in an environment simultaneously. Multi-agent RL trains agents to act in the presence of other agents. In a cooperative setting, these agents share a goal but in a competitive setting, these agents have non-aligned goals that may be in conflict. [54,55]

2. Adaptive management: translating theory to practice

Given the limitations of model-based approaches to adaptive management (box 1), the practice of adaptive management—and its integration into policy—has deviated from formal definitions of decision theory (figure 1). Rather, implemented adaptive management strategies often take ‘rule-of-thumb’ approaches, substituting human judgment and the experience of decision-makers in place of computationally intensive models [7,61]. While more loosely defined and human-informed adaptive management processes overcome some of the technical shortcomings of model-based adaptive management (e.g. by allowing consideration of greater system complexity and the social context of decisions), these approaches also face several limitations. Beyond the myriad of potential non-technical risks, such as goal slippage, manager turnover and sensitivity to political asymmetries and institutional change [62], ‘rules of thumb’ face technical constraints. For example, the exponentially increasing amount of monitoring data from automated sensors [63] quickly becomes impractical for humans to integrate into a heuristic decision process.

Box 1. Examples of model-based adaptive management and their key challenges.

(A) adaptive flow regimes: Tallapoosa River
 For nearly a decade, the United States Geological Survey (USGS) implemented adaptive management on the Tallapoosa River to determine optimal flows under multiple competing stakeholder objectives. The iterative decision model included annual monitoring of ecosystem indicators to vary flow regimes optimally. Despite the adaptive decision framework, the best management strategies to provide both adequate hydrological and thermal habitats while sufficing recreational values remain a central controversy in the system [56]. objective: ensure the conservation of at-risk species and meet ecosystem service objectives
learning: passive learning of ecosystem dynamics and the valuation of ecosystem services
states: discrete set of ecosystem indicators, ecosystem values
actions: four flow allocation strategies
key challenges faced: multiple competing objectives, noisy observations, delayed feedbacks, stochasticity, high dimensionality
(B) adaptive harvest management: waterfowl
 Since 1995, the United States Fish and Wildlife Service has used an adaptive management framework to regulate duck harvests. Harvest quotas rely on an iterative cycle of monitoring, assessment and decision-making. Based on monitoring data, managers continually refine models of the relationships between hunting regulation, harvests and waterfowl abundance. While significant updates to the model weights have occurred over the past 20 years, non-stationarity due to global change challenges the current methodology [57]. objectives: sustainable harvest of waterfowl
learning: passive learning of population dynamics in response to harvest regimes
states: waterfowl abundance
actions: annual harvest quotas
key challenges faced: non-stationarity, delayed feedbacks, stochasticity, multiple objectives
(C) adaptive harvest management: red knots (Calidris canutus rufas) and horseshoe crab​​s (Limulidae polyphemus)
 Following increased harvest of horseshoe crabs in the Delaware Bay during the 1990s, migratory shorebird populations declined steeply. Recognizing this decline, the fisheries commission began regulating the horseshoe crab harvest. Proposed adaptive management frameworks focus on two competing models of red knot population dynamics and horseshoe crab harvests, seeking to iteratively improve harvest policies for both objectives [58]. objective: ensure the conservation of at-risk species while also meeting harvest objectives
learning: passive learning of multi-species dynamics
states: red knot abundance and fecundity, horseshoe crab abundance
actions: harvest quota for horseshoe crabs
key challenges faced: model set limitations, dimensionality, delayed feedbacks, multiple objectives

Figure 1.

Figure 1.

(a) Decision-theoretic approaches to adaptive management—what we refer to as ‘model-based’ adaptive management—formulate problems as Markov decision processes (MDPs) with unknown state transition probabilities (for a helpful overview of MDPs for ecology, see [16,59]). At each time step t, the agent takes an action, on an environment, and the environment transitions from the current state s, to the next state st+1 and provides the agent with a reward r. After taking an action, the agent observes the system state and rewards to update its belief of the underlying dynamics of that system to inform the decision strategy (policy) that selects the next action. (b) While theory provides a formalized way to ‘learn’ about systems while acting on them, in practice, adaptive management decisions are usually made by humans who may or may not use the output of models to support decisions but often rely on experience with the complex systems and sociopolitical context not captured by stylized system models. (c) Like classical model-based optimization approaches, the task for a reinforcement agent is to learn a ‘policy’ that maps the state of the system to an action the agent should take to maximize the expected sum of future rewards [60]. This can be done by interacting with historical data or simulators. However, the process by which reinforcement learning (RL) learns optimal policies is fundamentally different from (a). (d) Unlike the classical methods used in model-based adaptive management (AM), RL allows action strategies to be developed without learning over a predetermined set of potential models. (Online version in colour.)

Thorough reviews of the application [7,8], theory [64] and legal implications [65] of adaptive management already exist in the literature. In figure 1a,b, we summarize the adaptive management paradigm. It should be noted that while adaptive management is not appropriate in all settings, the benefits of leveraging adaptive over non-adaptive approaches are particularly apparent in contexts where there is deep uncertainty about system dynamics—a setting in which RL might hold the most promise. In the remainder of this section, we illustrate, through examples, the key gaps between adaptive management as it is expressed in theory and the requirements for its effective implementation in practice. While decision-makers' deviations from scientific and/or algorithmic recommendations and definitions of adaptive management (figure 1b) are often viewed with scepticism, we explore how this deviation frequently reflects the inability of models integrated into decision-theoretic frameworks to sufficiently account for real-world complexity and challenges (box 1).

Take, for example, the Tallapoosa River, where an upstream hydropower dam had uncertain effects on the integrity of the river's species-rich ecological community, prompting a decade-long adaptive management effort [9,56] (box 1A). Monitoring how multiple ecological indicators and stakeholder preferences respond to flow alterations was used to iteratively update a model (a state–space transition model) of the system, which was integrated into a decision-theoretic framework to optimize flow regimes [56]. While the river monitoring effort held promise to instruct more responsive and informed decisions, the high-dimensional system (which included multiple species and ecosystem indicators, recreational interests, dam revenue, seasonal temperature) and nearly unlimited potential flow regime strategies was reduced to three system ‘states’ and four potential flow regime ‘actions'‘ over which the formal adaptive management problem learned and optimized [56]. Would a more realistic representation of the systems, at least mirroring the scope of monitoring data collected, and a more extensive set of possible actions have allowed the solutions to capture nuanced and temporal tradeoffs between human use and ecological demands in this system?

Unfortunately, what adaptive management models add in social and ecological complexity they often lose in the number of models and parameters over which they learn owing to computational constraints. For example, in the case of adaptively managing horseshoe crab harvests in the Delaware Bay (box 1C), proposed frameworks included multi-species population models and a broader set of potential actions but considered only two competing models over which to reduce uncertainty and derive decision strategies [58]. Beyond the challenges of capturing complexity and uncertainty of system states and actions, model-based adaptive management frameworks face limitations to inform decision-making in the context of global environmental change [64]. For example, in the case of adaptively setting quotas for waterfowl harvest in the USA (box 1B), the increasingly non-autonomous (non-stationary) nature of the system (owing to climate change not being included as part of the population model) is beginning to limit the applicability of the model-based decision-theoretic methodology to devising decision strategies [57]. Of course, climate dynamics could be integrated into the models. However, this further complicates both the model and the computational requirements to solve for optimal decision strategies using dynamic programming, as well as introduces additional uncertainty.

In each of the cases in box 1, human decision-makers wrestle with a growing number of complexities, competing needs and pressures left unanswered by the existing decision-theoretic adaptive management paradigms. These mismatches limit the utility of existing model-based optimization approaches to the decision-making process and risk unintended consequences of relying on best-fit models in decision processes. Importantly, proposals to leverage adaptive management for higher-dimensional environmental problems, such as climate mitigation [60] and protected area design and management [66,67], are likely only to widen the gap between model-based theory and the realities of management decisions in practice. While classification approaches enable skirting the decision problem altogether, less constrained approximate methods for finding optimal strategies have also been suggested for dealing with high-dimensional natural resource management problems for decades [68]. In the following sections, we explore how emerging methods from deep RL might allow us to leverage the benefits of both heuristic and computational approaches to adaptive management, improving our capacity to manage systems under uncertainty (table 1 and figure 1c).

3. Reinforcement learning as model-free adaptive management

RL has proven better than humans at making strategic, adaptive and complex decisions across a diversity of problems and domains. While RL's most cited feats are in the context of games (e.g. chess) [43] and robotic tasks [69], RL algorithms are increasingly used to solve planning problems across a variety of noisy and uncertain real-world settings, from healthcare [70] and energy systems [71] to biological systems [72] and economic policy [73].

The problem set-up for RL closely mirrors model-based adaptive management (figure 1). Like classical model-based optimization approaches [16], an RL agent aims to learn a ‘policy’ (decision strategy) that maps the state of the system to the best action to take to maximize the expected sum of future rewards [20]. However, the process by which RL learns optimal policies can be fundamentally different. Unlike classical dynamic programming methods leveraged in adaptive management examples from box 1, which require specifying state transition matrices (state–space models of the system over which to learn) [16] (box 1, figure 1a,d), model-free RL allows action strategies to be developed without a predetermined model (figure 1c), bypassing the need to iteratively learn a ‘best model’ of system dynamics altogether (figure 1d). Importantly, RL learns action strategies through experience, which can include simulated experience, experience derived from historical data, and/or real-world experience.

First, RL can learn from simulated experience. A notable example is Atari, where researchers achieved human-level performance across dozens of Atari video games by training RL agents over millions of time steps (decisions) [19] (box 2A). Instead of fine-tuning a new model for each of the games, a single RL agent using the same neural network architecture and hyperparameters was applied to 49 different Atari games, reaching performance comparable to a professional human game player across a majority of the games. Without the goal of learning a single system transition matrix, the RL agent learns generic concepts that allow strategic decision-making in a diversity of settings. Beyond gameplay, simulators can also model physical environments such as atmospheric wind conditions, as used to build a simulator to train an RL agent to navigate high-pressure balloons in the stratosphere (box 2B). In the context of adaptive management, simulators could be designed to describe water flow, species population dynamics or a changing climate. Leveraging simulated experience to learn policies for adaptive management problems could provide a means to integrate more complex system dynamics and action sets, such as in non-autonomous environments.

Box 2.

Examples of reinforcement learning applications and the key challenges they address.

(A) playing strategic games: Atari
 Dealing with high-dimensional inputs to make effective decisions across different tasks and situations remains a key challenge for RL. Atari consists of 49 distinct video games with visual inputs and has become a go-to benchmark for developing RL algorithms. Using deep Q-learning, a single RL agent was trained across dozens of games to outcompete human players [19]. key challenges addressed: high-dimensional input, diverse tasks, long horizon
objective: maximize score across a set of Atari games
state: 84 × 84 × 4 colour video frames at 60 Hz
action: discrete, variable for each game
reward: rescaled game scores
RL approach: deep Q-learning
(B) flight control: stratospheric balloons
 Stratospheric balloons are high-altitude balloons that can reside 15–60 km.a.s.l. for months at a time. These balloons carry up to 1.1 tons of payload, typically tools used for weather forecasting, satellite navigation, atmospheric chemistry experiments and testing new space technology. Navigation in these high-altitude settings is dependent on stratospheric winds, of which relevant meteorological data are sparse, and solar availability, which is needed to charge the battery. Model-free reinforcement learning enabled effective navigation over the Pacific Ocean over 39 days, using distributed Q-learning [74]. key challenges addressed: incomplete data, noisy observations, unreliable solar availability, safe navigation, long planning horizon
objective: navigate a super-pressure balloon to float near weather station
state: 1083 wind variables. Sixteen ambient variables
action: discrete (Ascend, descend, stay)
reward: distance from a weather station, with maximum reward within 50 km of the station
RL approach: Model-free Q-learning with experience replay
(C) clinical decision-making: sepsis treatment
 Sepsis is a life-threatening excessive immune response to an infection that may lead to organ failure or death. Treating sepsis involves a complex mix of antibiotics, corticosteroids, timing and dosage of drugs, and intravenous fluids. The treatment regiment for patients in the intensive care unit (ICU) must be customized to each individual patient in response to that patient's response to medical interventions. Deep Q-learning helped learn treatment policies to help reduce patient mortality by 1.8–3.7% [75]. key challenges addressed: continuous state, sparse reward signals, stochasticity, delayed feedback, interpretability
objective: improve patient survival
state: continuous. Forty-eight values of demographics, physiological data and vital signs)
action: discrete. 5 × 5 intervention options with different amounts of IV fluid and vasopressor dosage
reward: weighted sum of indicators of patient health including the extent of organ failure and changes in blood pressure
RL approach: dueling double-deep deep-Q-learning

While RL allows us to relax assumptions required by most model-based adaptive management frameworks (e.g. MDPs), managing non-autonomous systems remains a hard problem for both computer and environmental science. We do not suggest that RL will readily overcome this challenge, but rather that in complex socio-ecological settings (e.g. the Tallapoosa River, box 1A), RL might outperform both human heuristics and model-based adaptive management methods through exploring a set of complex simulated environments millions of times, accruing orders of magnitude more feedback than a system that could only interact with the real environment [76].

Notably, simulated experience is not the only way RL can learn decision strategies. Deep RL has been shown to learn effective policies from historical data in the absence of a simulator by stitching together trajectories of system observations (offline RL) [34,77]. For example, in hospitals—which often have decades of historical records about patient status, treatment and outcomes—this offline approach was taken to learn individualized treatments for sepsis patients (box 2C). Of course, the extensive data available in some societal domains like healthcare is less common in environment systems. In some environmental challenges, such as water and air quality monitoring, sensors are already constantly taking samples potentially allowing for the tracking of the impact of specific actions. While this level of monitoring data are not available for many other ecological management problems, improved sensing of everything from vegetation dynamics to species occurrence [78] are trending the field in that direction and likely making RL-based management more feasible.

A key challenge in the environmental domain will be off-policy evaluation, to estimate the performance of policies that were never observed in the offline data [79]. A large body of techniques for off-policy evaluation has been developed for RL for observational healthcare [80]. Additional observations collected through real-world (online) experience may then be used to improve the policy further or ‘adaptively’ update policies while taking actions.

Emerging methods in RL have the potential to address more than just the issue of dimensionality and non-autonomous dynamics in natural resource management problems. From making effective decisions in environments with sparse rewards to addressing systems with multiple competing objectives, in table 1, we map adaptive management challenges to RL methods that might help address them.

4. Possibilities and pitfalls of applying reinforcement learning to adaptive management problems

If real-world complexity has forced the practice of adaptive management beyond the reach of theory, emerging paradigms of RL appear to at last be putting such challenges within reach of algorithms (table 1). But just because this may be possible, is it really a good idea? Adapting RL for adaptive management could open possibilities [20,76] but also introduces new pitfalls while re-surfacing age-old concerns of algorithmic decision processes [81]. We divide these possibilities and perils into three themes. First, the conceptual shift of an RL approach to learning as something based on heuristics and experience rather than rigorous mathematical theorems. This can recapitulate some of the benefits but also the shortcomings of human decision-making. Second, RL still faces the same challenge of any objective-based decision-making: accurately defining the task at hand. RL overcomes some computational constraints but still requires defining a scope of possible rewards, states and actions. We refer to this as world-making. Third, deep RL's technical and computational needs may limit its application to the largest technology institutions with access to these resources. This stakeholder shift exacerbates potential ethical and political consequences. Here we outline both the technical and social components of these opportunities and challenges across the three main themes of (1) learning, (2) world-making, and (3) shifting stakeholders (figure 2). We hope both the RL and adaptive management communities recognize and focus on addressing these challenges when developing and implementing environmental decision-making.

Figure 2.

Figure 2.

Traditional adaptive management relies on modeling the environment using Markov decision processes, which mirrors a ‘model-based’ approach to RL (green pathway, left). Model-free RL (grey pathway, center) eschews learning an intermediate model to instead directly estimate the reward for taking specific actions at a given state. As we outline, reinforcement learning brings both promise and new challenges for adaptive management for learning, world-making and shifting stakeholders, which all impact different components of the RL pipeline. (Online version in colour.)

(a) . Learning

The central difference between model-free deep RL and the theory of adaptive management regards learning. In adaptive management, learning is defined as reducing uncertainty over parameters or candidate predictive models1 of the underlying ecological processes (figure 1a). Learning is expressed in terms of quantitatively precise probability distributions and realized through mathematically precise theorems such as Bayes' rule to dictate how model ‘beliefs’ (probability distributions) narrow in response to actions and new information (figure 1a). By contrast, a human manager does not necessarily need a predictive model of the process to adjust a decision (figure 1b). It is possible to propose a policy without a model based on experience alone. For example, if the estimated waterfowl population (box 1B) decreased too much last year compared to the year before, it's probably a good idea to lower the harvest quota this year. Of course, the theory may give the same answer with more quantitative precision—how much to lower the quota (and also just how much to re-adjust ‘belief’ probabilities towards some more pessimistic growth rates of the species)—but that answer is only as good as the models it considers. The manager's experience may factor in variables ignored by the models—a harvest quota of zero may be socially or politically unacceptable, while past experience of ups and downs may provide an experienced manager with a notion for the right size of adjustment with nary an equation [82,83].

Model-free RL capitalizes on this experience-driven approach to decision-making. The RL agent does not need to predict future states; it decides what action to take given only experience from past states and resulting rewards (or costs). The RL researcher places an untrained agent in a novel environment (usually a computer simulation, e.g. in the Atari example (box 2A)), in which the agent takes exploratory actions while adjusting its policy to improve long-term reward. From repeated simulations over hundreds of thousands of episodes, the agent will extensively explore the space of actions and outcomes. The process is sometimes compared to a newborn first exploring the world around them. Like the newborn, this RL agent is not entirely naive—the researcher must select among a myriad of specific algorithms each with very different approaches to solving the RL problem. The researcher, just like the parent of a newborn, may present a modified system of rewards and costs to coax along the desired learning more efficiently in a process called reward shaping (table 1). Reward shaping becomes particularly useful when there is a big payoff only after a long sequence of actions (e.g. rewarding distance to the end of a maze rather than only completion).

When trained in a single environment, the strategies that RL agents learn rarely generalize to even small deviations from past experience. The agent will often overfit to the smallest details—the units of measurement, the duration of the particular episode. More generally applicable strategies can be found by presenting the agent with a wide variety of environments. For example, one pervasive challenge in learning from simulated experience is the ‘sim2real’ gap, the difference between an RL agent's performance in a simulated environment versus a real environment [84]. Robust RL techniques may help close the sim2real gap and avoid overfitting [85] (table 1). A wealth of emerging approaches seek to improve generalizability. Curriculum learning algorithms seek to provide the most efficient way to interleave different environments (table 1). In adversarial learning, a second agent seeks to learn and propose alterations to the environment that are most likely to fool the focal agent into poor performance.

Most large-scale ecological simulation systems still fall short of capturing the many processes involved [76] but are already beyond the reach of dynamic programming methods of classical adaptive management. Collectively, advances in the ecological realism of simulations and computational RL methods make it feasible to train intelligent agents across a wide variety of simulated environments. Historical observations can only be paired with historical actions, and thus never provide an agent much insight into the outcomes of the actions not taken but can nevertheless be used to supplement and ground-truth training based on simulation (table 1). A more fraught question concerns the role of RL-based learning in real-world contexts. Like the distinction between active and passive adaptive management, RL is typically divided between ‘training’ and ‘evaluation’ loops. In the training loop, the RL agent explores their action space to discover and adjust their decision strategy (active learning). In the evaluation loop, acting is essentially passive, with the RL agent seeking to maximize expected utility without updating their decision strategy. Evaluation need not always be passive in RL (especially in ‘low-stakes’ real-world scenarios, such as a physical robot learning to walk or handle objects) but mirrors the general preference of managers to rely on passive adaptive management in high-stakes scenarios.

(b) . World-making

While RL might allow quantitative adaptive management to consider more realistic state and action spaces, reducing the numerical constraints on problems only refracts the issue of distilling a complex environment: how do we bound an environmental state, define management objectives and determine a set of available actions while ensuring these represent environmental realities and values of those most impacted by the decisions? How do we create sufficient simulations or decide on the appropriate data streams to train algorithms with? RL might expand the scope and range of problems that we can solve, but it does not remove the sociopolitical considerations inherent to how those problems are defined.

Let us imagine reformulating the waterfowl harvest problem as an RL problem. We could simply create a simulation of the states and their responses to actions in alignment with the current model-based formalization (box 1B): the action space is an annual harvest quota to maximize expected long-term yields and the state space is a one-dimensional representation of waterfowl stock. But given fewer constraints, we might represent the waterfowl stock as part of a larger ecological (or socio-ecological) system responding to human land use, climatic shifts, and weather extremes, overcoming shortcomings of current methods, like the consideration of non-autonomous systems. Even if the action space remained the same (harvest quota), the algorithm's optimal policy would change. The action space could similarly be expanded to better capture the possibility of decisions (e.g. to a temporally and spatially dynamic closure rather than a single harvest quota), changing not only the policy and its impact on resource users, but the underlying system trajectory. These changes seem benign, if not beneficial, but it is easy to envision how the imagination and values of the algorithm designer impact not only the conceptualization of the environment (state spaces), but the solutions derived, and actions taken, which ultimately feedback to create the reality of that system [81]. Leaving us to question what parts of the system are included in the simulations and how that might shift the distribution of benefits and costs.

Barriers to the adoption of adaptive management strategies arise not only from a lack of realism in system formalization or capacity to deal with complexity, but also from disagreement over whose values are represented in the decision objectives and the potential risks of following algorithmic suggestions [86]. RL does not sidestep these issues, but methods such as multi-objective RL (table 1) can learn optimal decisions in the face of multiple conflicting objectives, and inverse RL can help align the values as they are represented in formalized rewards with real-world values (value alignment) [32,33] (table 1). Additionally, reward shaping (table 1) can help to ensure RL agents do not myopically take actions that lead to short-term gain over long-term benefits.

Even in light of technical methods to improve the alignment of values and capture multiple objectives, the issues of world-making are social and normative at their core. Specifying state, action and rewards in RL applications will necessarily reflect both epistemic values and contextual values of the developer [81,87]. Which begs the question: who has the power and capacity to define problems and develop RL algorithms?

(c) . Shifting stakeholders

Given both the technical expertise and computational requirements needed to train RL algorithms, industry (specifically big tech) involvement in the development and deployment of these methods is commonplace across environmental application domains[88]. The shift from government-maintained and managed algorithms—as is currently the case in most environmental adaptive management contexts (such as waterfowl adaptive management; box 1B)—to industry-maintained algorithms would create a new set of actors in the environmental regulatory processes. Because political and financial concerns may influence the design of RL environments and agents, developing transparent and inclusive participatory processes will be critical to ethical and equitable development and application of RL to adaptive management problems.

Beyond shifting power and creating new environmental actors, RL-derived environmental decisions risk undermining trust in environmental governance systems by increasing the ambiguity of who is accountable for future environmental degradation [81]. If a decision is derived from an algorithm that relies on trial and error rather than clearly mapping to a model choice, are poor outcomes anyone's fault? Moreover, if an RL agent continues to learn and adapt while interacting with the system (adaptive management, by definition), how do we ensure that its policies are meaningfully overseen [87]? In this way, RL differs from more transparent model-based methods in the relative lack of capacity to query solutions, potentially obscuring biases and compliance with regulations [89]. Lessons from RL applications to other safety-critical domains, such as nuclear fusion [90], and tools from the explainable AI subfield [91,92], might help mitigate these issues (table 1). However, problems of explainability and safety become even more pronounced when RL is proposed for controlling less identifiable and high-dimensional systems, as is the case in many environmental management contexts.

Ethical AI principles provide some guidance to procedure and practice to ensure safe application of algorithms. But these guidelines, like the algorithms themselves, are primarily developed in the Global North, notably missing perspectives from Central and South America, Africa and Central Asia [93]. Moreover, ethical guidelines rarely address the many dimensions of power implicated in world-making; not only the power to make decisions or define objectives, but power to set the agendas (e.g. defining objectives, state and action spaces) and shift ideologies [81]. Applying decolonial theories to AI application and development, as discussed in [94], might help address the shortcomings of AI ethics and recentre the importance of power and representation in procedural and development processes.

While the technical synergies and differences between RL and model-based adaptive management methods are outlined throughout this paper, simultaneously considering the parallels between AI ethics [93], science technology studies [95] and political ecology [96] are critical when considering applications of RL to safety-critical real-world domains like environmental management.

5. Conclusion

To bridge the gap between the science and practice of adaptive management there is a need for decision-centred methods that capture the complexity and uncertainty of ecosystems. Advances in deep learning have positioned RL as a promising approach to solve sequential problems under uncertainty, while sidestepping the need to define a set of candidate models or effectively refine our belief in those models. Here we highlight recent advances in RL methods that overcome several limitations—such as high-dimensional spaces, imperfect models and lack of accurate simulators—that have prevented adaptive management from moving beyond theory in complex situations. Simultaneously we underline key priorities for RL—such as robustness, safety and multi-objective rewards—to enable its effective and responsible deployment for ecological decision-making.

Endnotes

1

Some authors distinguish between model uncertainty that is ‘structural’ in nature, e.g. if recruitment follows a Ricker-shaped curve or a Beverton–Holt-shaped curve, versus uncertainty that is only of a ‘parametric’ nature—e.g. the value of initial growth rate r in a Ricker model. In practice, the lines are blurry as it is often possible for a structurally flexible enough model to represent both families of curves in terms of the choice of some additional parameters. In fact, the deep neural networks underlying most of modern machine learning including RL-based methods owe their success to being precisely such highly flexible function approximators. The key observation of model-free RL is that the functions we seek to approximate are not the process itself—the probability from any possible current state to any possible future state under any possible action—but rather, the often smaller map between possible states to the space of possible actions—the ‘policy function’ or ‘value function’ the manager should adopt.

Data accessibility

This article has no additional data.

Authors' contributions

M.C. and L.X.: conceptualization, investigation, resources, writing—original draft, writing—review and editing; M.L.: conceptualization, resources, writing—review and editing; C.B.: conceptualization, resources, writing—original draft, writing—review and editing.

All authors gave final approval for publication and agreed to be held accountable for the work performed herein.

Conflict of interest declaration

We declare we have no competing interests.

Funding

This material is based upon work supported by the National Science Foundation (grant no. DBI-1942280). L.X. was supported by a Google PhD Fellowship.

References

  • 1.Perino A, et al. 2022. Biodiversity post-2020: closing the gap between global targets and national-level implementation. Conserv. Lett. 15, e12848. ( 10.1111/conl.12848) [DOI] [Google Scholar]
  • 2.Arvai J, et al. 2006. Adaptive management of the global climate problem: bridging the gap between climate research and climate policy. Clim. Change 78, 217-225. ( 10.1007/s10584-006-9094-6) [DOI] [Google Scholar]
  • 3.Wilhere GF. 2002. Adaptive management in habitat conservation plans. Conserv. Biol. 16, 20-29. ( 10.1046/j.1523-1739.2002.00350.x) [DOI] [PubMed] [Google Scholar]
  • 4.Melis TS, Walters CJ, Korman J. 2015. Surprise and opportunity for learning in Grand Canyon: the Glen Canyon dam adaptive management program. Ecol. Soc. 20, 22. ( 10.5751/ES-07621-200322) [DOI] [Google Scholar]
  • 5.Polasky S, Carpenter SR, Folke C, Keeler B. et al. 2011. Decision-making under great uncertainty: environmental management in an era of global change. Trends Ecol. Evol. 26, 398-404. ( 10.1016/j.tree.2011.04.007) [DOI] [PubMed] [Google Scholar]
  • 6.Walters CJ, Hilborn R. 1978. Ecological optimization and adaptive management. Annu. Rev. Ecol. Syst. 9, 157-188. ( 10.1146/annurev.es.09.110178.001105) [DOI] [Google Scholar]
  • 7.Rist L, Campbell BM, Frost P. 2013. Adaptive management: where are we now? Environ. Conserv. 40, 5-18. ( 10.1017/S0376892912000240) [DOI] [Google Scholar]
  • 8.Westgate MJ, Likens GE, Lindenmayer DB. 2013. Adaptive management of biological systems: a review. Biol. Conserv. 158, 128-139. ( 10.1016/j.biocon.2012.08.016) [DOI] [Google Scholar]
  • 9.Williams BK, Brown ED. 2014. Adaptive management: from more talk to real action. Environ. Manage. 53, 465-479. ( 10.1007/s00267-013-0205-7) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.McLain RJ, Lee RG. 1996. Adaptive management: promises and pitfalls. Environ. Manage. 20, 437-448. ( 10.1007/BF01474647) [DOI] [PubMed] [Google Scholar]
  • 11.Williams BK, Brown ED. 2018. Double-loop learning in adaptive management: the need, the challenge, and the opportunity. Environ. Manage. 62, 995-1006. ( 10.1007/s00267-018-1107-5) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Williams BK, Brown ED. 2016. Technical challenges in the application of adaptive management. Biol. Conserv. 195, 255-263. ( 10.1016/j.biocon.2016.01.012) [DOI] [Google Scholar]
  • 13.Fischer J, Peterson GD, Gardner TA, Gordon LJ, Fazey I, Elmqvist T, Felton A, Folke C, Dovers S. 2009. Integrating resilience thinking and optimisation for conservation. Trends Ecol. Evol. 24, 549-554. ( 10.1016/j.tree.2009.03.020) [DOI] [PubMed] [Google Scholar]
  • 14.Scheffer M, Carpenter S, Foley JA, Folke C, Walker B. 2001. Catastrophic shifts in ecosystems. Nature 413, 591-596. ( 10.1038/35098000) [DOI] [PubMed] [Google Scholar]
  • 15.Bury TM, Sujith RI, Pavithran I, Scheffer M, Lenton TM, Anand M, Bauch CT. 2021. Deep learning for early warning signals of tipping points. Proc. Natl Acad. Sci. USA 118, e2106140118. ( 10.1073/pnas.2106140118) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Marescot L, Chapron G, Chadès I, Fackler PL, Duchamp C, Marboutin E, Gimenez O. et al. 2013. Complex decisions made simple: a primer on stochastic dynamic programming. Methods Ecol. Evol. 4, 872-884. ( 10.1111/2041-210X.12082) [DOI] [Google Scholar]
  • 17.Boettiger C. 2022. The forecast trap. Ecol. Lett. 25, 1655-1664. ( 10.1111/ele.14024) [DOI] [PubMed] [Google Scholar]
  • 18.Sutton RS, Barto AG. 2018. Reinforcement learning: an introduction. New York, NY: MIT press. [Google Scholar]
  • 19.Mnih V, et al. 2015. Human-level control through deep reinforcement learning. Nature 518, 529-533. ( 10.1038/nature14236) [DOI] [PubMed] [Google Scholar]
  • 20.Lapeyrolerie M, Chapman MS, Norman KEA, Boettiger C. 2022. Deep reinforcement learning for conservation decisions. Methods Ecol. Evol. 13, 2649-2662. ( 10.1111/2041-210X.13954) [DOI] [Google Scholar]
  • 21.Pinto L, Davidson J, Sukthankar R, Gupta A. 2017. Robust adversarial reinforcement learning. PMLR 70, 2817–2826. ( 10.48550/arXiv.1703.02702) [DOI] [Google Scholar]
  • 22.Xu L, Perrault A, Fang F, Chen H, Tambe M. 2021. Robust reinforcement learning under Minimax regret for green security. PMLR 161, 257–267. [Google Scholar]
  • 23.Rigter M, Lacerda B, Hawes N. 2021. Minimax regret optimisation for robust planning in uncertain Markov decision processes. In Proc. AAAI Conf. Artificial Intelligence 35, pp. 11 930–11 938. ( 10.1609/aaai.v35i13.17417) [DOI]
  • 24.Abdar M, et al. 2021. A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inform. Fusion 76, 243-297. (https://www.sciencedirect.com/science/article/pii/S1566253521001081) [Google Scholar]
  • 25.Kahn G, Villaflor A, Pong V, Abbeel P, Levine S. 2017. Uncertainty-aware reinforcement learning for collision avoidance. arXiv, 1702.01182. ( 10.48550/arXiv.1702.01182) [DOI] [Google Scholar]
  • 26.Henaff M, Canziani A, LeCun Y. 2019. Model-predictive policy learning with uncertainty regularization for driving in dense traffic. arXiv, 1901.02705. ( 10.48550/arXiv.1901.02705) [DOI] [Google Scholar]
  • 27.Mankowitz DJ, et al. 2019. Robust reinforcement learning for continuous control with model misspecification. arXiv, 1906.07516. ( 10.48550/arXiv.1906.07516) [DOI] [Google Scholar]
  • 28.Roy A, Xu H, Pokutta S. 2017. Reinforcement learning under model mismatch. In Proc. 31st Conf. Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. Adv. Neur. Inform. Process. Syst. 30.
  • 29.Bagnell JA, Ng AY, Schneider JG.Schneider. 2001. Solving uncertain Markov decision processes. Pittsburgh, PA: Carnegie Mellon University, Robotics Institute.
  • 30.Oberst M, Sontag D. 2019. Counterfactual off-policy evaluation with gumbel-max structural causal models. In Int. Conf. Machine Learning, Long Beach, California, 9–15 June 2019. PMLR 97, 4881–4890.
  • 31.Fu Z, Qi Z, Zhaoran Wang Z, Yang Z, Xu Y, Kosorok MR. 2022. Offline reinforcement learning with instrumental variables in confounded Markov decision processes. arXiv, 2209.08666. ( 10.48550/arXiv.2209.08666) [DOI] [Google Scholar]
  • 32.Ng AY, Russell S. 2000. Algorithms for inverse reinforcement learning. ICML 1, 2.
  • 33.Arora S, Doshi P. 2021. A survey of inverse reinforcement learning: challenges, methods and progress. Artif. Intell. 297, 103500. ( 10.1016/j.artint.2021.103500) [DOI] [Google Scholar]
  • 34.Levine S, Kumar A, Tucker G, Fu J. 2020. Offline reinforcement learning: tutorial, review, and perspectives on open problems. arXiv, 2005.01643. ( 10.48550/arXiv.2005.01643) [DOI] [Google Scholar]
  • 35.Garcıa J, Fernández F. 2015. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16, 1437-1480. [Google Scholar]
  • 36.Munos R, Stepleton T, Harutyunyan A, Bellemare M. 2016. Safe and efficient off-policy reinforcement learning. Adv. Neural Inform. Process. Syst. 29. ( 10.48550/arXiv.1606.02647) [DOI] [Google Scholar]
  • 37.Juozapaitis Z, Koul A, Fern A, Erwig M, Doshi-Velez F. 2019. Explainable reinforcement learning via reward decomposition. In IJCAI/ECAI Workshop Explainable Artificial Intelligence, Tokyo, Japan, 29 October 2019.
  • 38.Wells L, Bednarz T. 2021. Explainable AI and reinforcement learning—a systematic review of current approaches and trends. Front. Artif. Intell. 4, 550030. ( 10.3389/frai.2021.550030) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Mandel T, Liu YE, Brunskill E, Popović Z. 2017. Where to add actions in human-in-the-loop reinforcement learning. In Proc 31st AAAI Conf. Artificial Intelligence, San Francisco, California, USA, 4–9 February 2017, vol. 31. ( 10.1609/aaai.v31i1.10945) [DOI]
  • 40.Hadfield-Menell D, Dragan A, Abbeel P, Russell S. 2016. Cooperative inverse reinforcement learning. arXiv, 1606.03137. ( 10.48550/arXiv.1606.03137) [DOI] [Google Scholar]
  • 41.Da Silva BC, Basso EW, Bazzan AL, Engel PM. 2006. Dealing with non-stationary environments using context detection. In ICML '06: Proc. 23rd Int. Conf. Machine Learning, Pittsburgh, Pennsylvania, USA, 25–29 June 2006, pp. 217–224. ( 10.1145/1143844.1143872) [DOI]
  • 42.Chandak Y, Theocharous G, Shankar S, White M, Mahadevan S, Thomas P. 2020. Optimizing for the future in non-stationary mdps. PMLR 119, 1414–1425. [Google Scholar]
  • 43.Silver D, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature 529, 484-489. ( 10.1038/nature16961) [DOI] [PubMed] [Google Scholar]
  • 44.Narvekar S, Peng B, Leonetti M, Sinapov J, Taylor ME, Stone P. 2020. Curriculum learning for reinforcement learning domains: a framework and survey. arXiv, 2003.04960. ( 10.48550/arXiv.2003.04960) [DOI] [Google Scholar]
  • 45.Pateria S, Subagdja B, Tan A, Quek C. 2021. Hierarchical reinforcement learning: a comprehensive survey. ACM Comput. Surv. 54, 109. ( 10.1145/3453160) [DOI] [Google Scholar]
  • 46.Ke NR, Goyal A, Bilaniuk O, Binas J, Mozer MC, Pal C, Bengio Y. 2018. Sparse attentive backtracking: temporal credit assignment through reminding. Adv. Neural Inform. Process. Syst. 31. ( 10.48550/arXiv.1809.03702) [DOI] [Google Scholar]
  • 47.Hung C-C, Lillicrap T, Abramson J, Wu Y, Mirza M, Carnevale F, Ahuja A, Wayne G. 2019. Optimizing agent behavior over long time scales by transporting value. Nat. Commun. 10, 5223. ( 10.1038/s41467-019-13073-w) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Harutyunyan A, et al. 2019. Hindsight credit assignment. Adv. Neural Inform. Process. Syst. 32. ( 10.48550/arXiv.1912.02503) [DOI] [Google Scholar]
  • 49.Laud AD. 2004. Theory and application of reward shaping in reinforcement learning. PhD thesis, University of Illinois at Urbana-Champaign. [Google Scholar]
  • 50.Hu Y, Wang W, Jia H, Wang Y, Chen Y, Hao J, Wu F, Fan C. et al. 2020. Learning to utilize shaping rewards: a new approach of reward shaping. Adv. Neural Inform. Process. Syst. 33, 15 931-15 941. ( 10.48550/arXiv.2011.02669) [DOI] [Google Scholar]
  • 51.Marom O, Rosman B. 2018. Belief reward shaping in reinforcement learning. Proc. AAAI Conf. Artif. Intell. 32. ( 10.1609/aaai.v32i1.11741) [DOI] [Google Scholar]
  • 52.Liu C, Xu X, Hu D. 2014. Multiobjective reinforcement learning: a comprehensive overview. IEEE Trans. Syst. Man Cybern. Syst. 45, 385-398. ( 10.1109/TSMC.2014.2358639) [DOI] [PubMed] [Google Scholar]
  • 53.Mossalam H, Assael YM, Roijers DM, Whiteson S et al. 2016 doi: 10.48550/arXiv.1610.02707. Multi-objective deep reinforcement learning. arXiv, c1610.02707. ( ) [DOI] [Google Scholar]
  • 54.Shoham Y, Powers R, Grenager T. 2003. Multi-agent reinforcement learning: a critical survey. Technical Report. Stanford, CA: Stanford University.
  • 55.Zhang K, Yang Z, Başar T. 2021. Multi-agent reinforcement learning: a selective overview of theories and algorithms. In Handbook of reinforcement learning and control (eds G Kyriakos, YW Vamvoudakis, FL Lewis, D Cansever), pp. 321-384. Cham, Switzerland: Springer. [Google Scholar]
  • 56.Irwin ER, Freeman MC, Peterson J, Kennedy KD, Lloyd MC, Coffman KO, Kosnicki E, Hess T. 2019. Adaptive management of flows from RL Harris dam (Tallapoosa River, Alabama) – stakeholder process and use of biological monitoring data for decision making. Open-File Rep. US Geol. Surv., no. 2019-1026.
  • 57.Johnson FA, Boomer GS, Williams BK, Nichols JD, Case DJ. 2015. Multilevel learning in the adaptive management of waterfowl harvests: 20 years and counting. Wildl. Soc. Bull. 39, 9-19. ( 10.1002/wsb.518) [DOI] [Google Scholar]
  • 58.McGowan CP, et al. 2011. Multispecies modeling for adaptive management of horseshoe crabs and red knots in the Delaware Bay. Nat. Resour. Model. 24, 117-156. ( 10.1111/j.1939-7445.2010.00085.x) [DOI] [Google Scholar]
  • 59.Chadès I, Nicol S, Rout TM, Péron M, Dujardin Y, Pichancourt JB, Hastings A, Hauser CE. 2017. Optimization methods to solve adaptive management problems. Theor. Ecol. 10, 1-20. ( 10.1007/s12080-016-0313-0) [DOI] [Google Scholar]
  • 60.Ogden AE, Innes JL. 2009. Application of structured decision making to an assessment of climate change vulnerabilities and adaptation options for sustainable forest management. Ecol. Soc. 14, 11. ( 10.5751/ES-02771-140111) [DOI] [Google Scholar]
  • 61.Jacobson C, Hughey KFD, Allen WJ, Rixecker S, Carter RW. 2009. Toward more reflexive use of adaptive management. Soc. Nat. Resour. 22, 484-495. ( 10.1080/08941920902762321) [DOI] [Google Scholar]
  • 62.Biber E. 2013. Adaptive management and the future of environmental law. Akron Law Rev. 46, 4. [Google Scholar]
  • 63.Hino M, Benami E, Brooks N. 2018. Machine learning for environmental monitoring. Nat. Sustain. 1, 583-588. ( 10.1038/s41893-018-0142-9) [DOI] [Google Scholar]
  • 64.Williams BK. 2011. Adaptive management of natural resources—framework and issues. J. Environ. Manage. 92, 1346-1353. ( 10.1016/j.jenvman.2010.10.041) [DOI] [PubMed] [Google Scholar]
  • 65.Ruhl JB, Fischman RL. 2010. Adaptive management in the courts. MN Law Rev. 95, 424. [Google Scholar]
  • 66.Kingsford RT, Biggs HC, Pollard SR. 2011. Strategic adaptive management in freshwater protected areas and their rivers. Biol. Conserv. 144, 1194-1203. ( 10.1016/j.biocon.2010.09.022) [DOI] [Google Scholar]
  • 67.Agrawal A. 2000. Adaptive management in transboundary protected areas: the Bialowieza National Park and Biosphere Reserve as a case study. Environ. Conserv. 27, 326-333. ( 10.1017/S0376892900000370) [DOI] [Google Scholar]
  • 68.Fonnesbeck CJ. 2005. Solving dynamic wildlife resource optimization problems using reinforcement learning. Nat. Resour. Model. 18, 1-40. ( 10.1111/j.1939-7445.2005.tb00147.x) [DOI] [Google Scholar]
  • 69.Won D-O, Müller K-R, Lee S-W. 2020. An adaptive deep reinforcement learning framework enables curling robots with human-like performance in real-world conditions. Sci. Robot. 5, eabb9764. ( 10.1126/scirobotics.abb9764) [DOI] [PubMed] [Google Scholar]
  • 70.Yu C, Liu J, Nemati S, Yin G. 2021. Reinforcement learning in healthcare: a survey. ACM Comput. Surv. 55, 5. ( 10.1145/3477600) [DOI] [Google Scholar]
  • 71.Perera ATD, Kamalaruban P. 2021. Applications of reinforcement learning in energy systems. Renew. Sustain. Energy Rev. 137, 110618. ( 10.1016/j.rser.2020.110618) [DOI] [Google Scholar]
  • 72.Neftci EO, Averbeck BB. 2019. Reinforcement learning in artificial and biological systems. Nat. Mach. Intell. 1, 133-143. ( 10.1038/s42256-019-0025-4) [DOI] [Google Scholar]
  • 73.Zheng S, Trott A, Srinivasa S, Parkes DC, Socher R. 2022. The AI economist: taxation policy design via two-level deep multiagent reinforcement learning. Sci. Adv. 8, eabk2607. ( 10.1126/sciadv.abk2607) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Bellemare MG, Candido S, Castro PS, Gong J, Machado MC, Moitra S, Ponda SS, Wang Z. 2020. Autonomous navigation of stratospheric balloons using reinforcement learning. Nature 588, 77-82. ( 10.1038/s41586-020-2939-8) [DOI] [PubMed] [Google Scholar]
  • 75.Raghu A, Komorowski M, Ahmed I, Celi L, Szolovits P, Ghassemi M et al. 2017 doi: 10.48550/arXiv.1711.09602. Deep reinforcement learning for sepsis treatment. arXiv, 1711.09602. ( ) [DOI] [Google Scholar]
  • 76.Urban MC, et al. 2022. Coding for life: designing a platform for projecting and protecting global biodiversity. BioScience 72, 91-104. ( 10.1093/biosci/biab099) [DOI] [Google Scholar]
  • 77.Kostrikov I, Nair A, Levine S. 2021. Offline reinforcement learning with implicit Q-learning. arXiv, 2110.06169. ( 10.48550/arXiv.2110.06169) [DOI] [Google Scholar]
  • 78.Oliver RY, et al. 2023. Camera trapping expands the view into global biodiversity and its change. Phil. Trans. R. Soc. B 378, 20220232. ( 10.1098/rstb.2022.0232) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Jiang N, Li L. 2016. Doubly robust off-policy value evaluation for reinforcement learning. PMLR 48, 652–661. [Google Scholar]
  • 80.Gottesman O, et al. 2018. Evaluating reinforcement learning algorithms in observational health settings. arXiv, 1805.12298. ( 10.48550/arXiv.1805.12298) [DOI] [Google Scholar]
  • 81.Scoville C, Chapman M, Amironesei R, Boettiger C. 2021. Algorithmic conservation in a changing climate. Curr. Opin. Environ. Sustain. 51, 30-35. ( 10.1016/j.cosust.2021.01.009) [DOI] [Google Scholar]
  • 82.McDonald-Madden EVE, Baxter PW, Possingham HP. 2008. Subpopulation triage: how to allocate conservation effort among populations. Conserv. Biol. 22, 656-665. ( 10.1111/j.1523-1739.2008.00918.x) [DOI] [PubMed] [Google Scholar]
  • 83.Chadès I, Martin TG, Nicol S, Burgman MA, Possingham HP, Buckley YM. 2011. General rules for managing and surveying networks of pests, diseases, and endangered species. Proc. Natl Acad. Sci. USA 108, 8323-8328. ( 10.1073/pnas.1016846108) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Höfer S, et al. 2021. Sim2Real in robotics and automation: applications and challenges. IEEE Trans. Autom. Sci. Eng. 18, 398-400. ( 10.1109/TASE.2021.3064065) [DOI] [Google Scholar]
  • 85.Thomas P, Brunskill E. 2016. Data-efficient off-policy policy evaluation for reinforcement learning. PMLR 48, 2139–2148. [Google Scholar]
  • 86.Allen CR, Gunderson LH. 2011. Pathology and failure in the design and implementation of adaptive management. J. Environ. Manage. 92, 1379-1384. ( 10.1016/j.jenvman.2010.10.063) [DOI] [PubMed] [Google Scholar]
  • 87.Gilbert TK, Dean S, Lambert N, Zick T, Snoswell A et al. 2022 doi: 10.48550/arXiv.2204.10817. Reward reports for reinforcement learning.arXiv, 2204.10817. ( ) [DOI] [Google Scholar]
  • 88.Verdegem P. 2022. Dismantling AI capitalism: the commons as an alternative to the power concentration of Big Tech. AI Soc. 2022, 1-11. ( 10.1007/s00146-022-01437-8) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Puiutta E, Veith E. 2020. Explainable reinforcement learning: a survey. In Int. Cross-Domain Conf. Machine Learning and Knowledge Extraction, Dublin, Ireland, 25–28 August. Cham, Switzerland: Springer.
  • 90.Degrave J, et al. 2022. Magnetic control of tokamak plasmas through deep reinforcement learning. Nature 602, 414-419. ( 10.1038/s41586-021-04301-9) [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Roscher R, Bohn B, Duarte MF, Garcke J. et al. 2020. Explainable machine learning for scientific insights and discoveries. IEEE Access 8, 42 200-42 216. ( 10.1109/ACCESS.2020.2976199) [DOI] [Google Scholar]
  • 92.Bhatt U, et al. 2020. Explainable machine learning in deployment. In FAT* '20: Proc. 2020 Conf. Fairness, Accountability, and Transparency, New York, NY, 27–30 January 2020, pp. 648–657. ( ) [DOI]
  • 93.Jobin A, Ienca M, Vayena E. 2019. The global landscape of AI ethics guidelines. Nat. Mach. Intell. 1, 389-399. ( 10.1038/s42256-019-0088-2) [DOI] [Google Scholar]
  • 94.Mohamed S, Png M-T, Isaac W. 2020. Decolonial AI: decolonial theory as sociotechnical foresight in artificial intelligence. Philos. Technol. 33, 659-684. ( 10.1007/s13347-020-00405-8) [DOI] [Google Scholar]
  • 95.Bareis J, Katzenbach C. 2022. Talking AI into being: the narratives and imaginaries of national AI strategies and their performative politics. Sci. Technol. Hum. Values 47, 855-881. ( 10.1177/01622439211030007) [DOI] [Google Scholar]
  • 96.Nost E, Colven E. 2022. Earth for AI: a political ecology of data-driven climate initiatives. Geoforum 130, 23-34. ( 10.1016/j.geoforum.2022.01.016) [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

This article has no additional data.


Articles from Philosophical Transactions of the Royal Society B: Biological Sciences are provided here courtesy of The Royal Society

RESOURCES