Skip to main content
NASA Author Manuscripts logoLink to NASA Author Manuscripts
. Author manuscript; available in PMC: 2022 Jun 7.
Published in final edited form as: IEEE Aerosp Conf. 2021 Jun 7;50100:10.1109/aero50100.2021.9438267. doi: 10.1109/aero50100.2021.9438267

Exploring Transfers between Earth-Moon Halo Orbits via Multi-Objective Reinforcement Learning

Christopher J Sullivan 1, Natasha Bosanac 2, Rodney L Anderson 3, Alinda K Mashiku 4, Jeffrey R Stuart 5
PMCID: PMC8753611  NIHMSID: NIHMS1754419  PMID: 35028651

Abstract

Multi-Reward Proximal Policy Optimization, a multi-objective deep reinforcement learning algorithm, is used to examine the design space of low-thrust trajectories for a SmallSat transferring between two libration point orbits in the Earth-Moon system. Using Multi-Reward Proximal Policy Optimization, multiple policies are simultaneously and efficiently trained on three distinct trajectory design scenarios. Each policy is trained to create a unique control scheme based on the trajectory design scenario and assigned reward function. Each reward function is defined using a set of objectives that are scaled via a unique combination of weights to balance guiding the spacecraft to the target mission orbit, incentivizing faster flight times, and penalizing propellant mass usage. Then, the policies are evaluated on the same set of perturbed initial conditions in each scenario to generate the propellant mass usage, flight time, and state discontinuities from a reference trajectory for each control scheme. The resulting low-thrust trajectories are used to examine a subset of the multi-objective trade space for the SmallSat trajectory design scenario. By autonomously constructing the solution space, insights into the required propellant mass, flight time, and transfer geometry are rapidly achieved.

1. INTRODUCTION

Small satellites, or SmallSats, offer an enticing and low-cost platform for mission designers, scientists, and engineers developing mission concepts for expanding humanity’s knowledge of the Solar System. Riding as secondary payloads or on smaller launch vehicles, SmallSats may provide a low-cost option for achieving a variety of targeted science, technology, and exploration objectives within and beyond low Earth orbit (LEO). For instance, the Mars Cube One (MarCO) spacecraft, two 13.5 kg SmallSats, were deployed on an interplanetary trajectory to serve as communication relays for the InSight lander, while also demonstrating the potential for spacecraft with this form factor to operate beyond LEO [1,2]. Additional SmallSat missions planned for destinations beyond LEO include the Lunar IceCube, LunaH-Map, and NEA Scout missions, each expanding the boundaries and capabilities of SmallSat operations [35]. Following deployment, a SmallSat equipped with a low-thrust propulsion system may adjust its trajectory to reach a variety of mission orbits in cislunar space. The propulsion system may also be used to transfer between orbits for mission concepts that leverage multiple vantage points.

Low-thrust trajectories exist as solutions to a high-dimensional, multi-objective optimization problem. Furthermore, the existence and properties of feasible solutions are strongly influenced by the hardware specifications, mission objectives, mission constraints, deployment conditions, power limitations, and other operational requirements. Understanding the design space of low-thrust trajectories for SmallSats as the spacecraft and mission parameters evolve enables trades between multiple objectives such as reducing the divergence from a reference trajectory and limiting propellant mass requirements over a reasonable flight time. The solutions to a multi-objective problem are often studied via the associated Pareto front, reflecting the set of nondominated solutions [6]. One step towards uncovering the global Pareto front is to first develop point solutions that lie within the multi-objective solution space. These point solutions offer general insights into the trajectory design space, and may be iterated upon to develop locally optimal solutions or even a globally optimal solution. However, designing one feasible, low-thrust trajectory in the complex gravitational environment of cislunar space may be time-consuming and computationally-intensive.

Machine learning offers a promising candidate approach for producing solutions in high-dimensional, multi-modal solution spaces. Deep reinforcement learning (DRL) has surged in popularity within the mission design community for recovering complex solutions in high-dimensional design applications [715]. Multi-objective deep reinforcement learning (MODRL) algorithms offer an efficient method for uncovering multiple solutions with varying geometries and objective prioritizations while reducing the required computational resources and time [7,16,17]. This paper focuses on using an MODRL algorithm to simultaneously train multiple policies to autonomously recover a subset of the solution space for low-thrust trajectories between libration point orbits in the Earth-Moon system.

In this paper, Multi-Reward Proximal Policy Optimization (MRPPO) is used to examine the design space of low-thrust trajectories for a SmallSat transferring between two orbits in the Earth-Moon Circular Restricted Three-Body Problem (CR3BP). MRPPO is a MODRL algorithm that enables the recovery of a multitude of policies that span a multi-objective solution space. MRPPO leverages the advantages of single objective DRL algorithms while reducing the required computational resources and increasing the stability and performance during the training process. Using MRPPO, multiple policies, each with distinct weights scaling the competing objectives, are simultaneously trained. Each policy is trained to create a unique controls scheme based on the assigned reward function: a weighted combination of objectives that guide the spacecraft to the target mission orbit, minimize deviations from a reference trajectory, and penalize propellant mass usage.

MRPPO is used in this paper to train policies to guide a low-thrust-enabled SmallSat in three transfer design scenarios in the Earth-Moon CR3BP: (1) from an L1 Lyapunov orbit to an L2 Lyapunov orbit of equal energy, (2) from an L1 northern halo orbit to an L2 southern halo orbit of equal energy, and (3) from an L1 northern halo orbit to a higher-energy L2 southern halo orbit. The first scenario serves as a verification test due to the availability of insights from dynamical systems theory; specifically, a known natural connection between the two periodic orbits. The second and third scenarios offer more complex evaluations where no direct, natural solutions exist. For each trajectory design scenario, the trained policies are evaluated on a common set of perturbed initial conditions randomly drawn along the initial periodic orbit to generate insights into the characteristics of each control scheme. Using the maximum, minimum, and mean values of these characteristics, regions of the multi-objective space are studied. By autonomously constructing a subset of the multi-objective solution space, insights into the required propellant mass, flight time, and trajectory geometry are rapidly achieved; such information is valuable during spacecraft design and mission concept development.

2. DYNAMICAL MODEL

Spacecraft motion in the Earth-Moon system is often approximated using simpler dynamical models prior to higher-fidelity analyses. One model, the CR3BP, has been used extensively in trajectory design to model the natural and low-thrust-enabled motion of a spacecraft in cislunar space. In the CR3BP, fundamental dynamical structures may be recovered that describe the natural flow of motion within multi-body systems. Spacecraft trajectories may be designed to leverage these natural dynamical structures and conserve propellant mass. This section describes both the formulation of the natural and low-thrust-enabled CR3BP equations of motion, and the natural dynamical structures used in the trajectory design scenarios explored within this paper.

Circular Restricted Three-Body Problem

The natural trajectory of a spacecraft in cislunar space is studied using the CR3BP. In the CR3BP, two primary masses, denoted M1 and M2, govern the motion of a third body of negligible mass, e.g. the spacecraft. These two primary masses are modeled as point mass objects assumed to follow circular orbits about their mutual barycenter. Then, a rotating frame, (x^, y^,z^), is defined to co-rotate with the two primary bodies such that x^ is directed from the Earth to the Moon, z^ is aligned with the orbital angular momentum vectors of the Earth and Moon, and y^ completes the right-handed set of axes [18]. Additionally, to aid the numerical integration scheme and convergence properties of the DRL algorithm, the state of the spacecraft is nondimensionalized using the characteristic quantities of the Earth-Moon system. In the Earth-Moon system, the characteristic length l* is set equal to the semi-major axis of the Moon’s assumed circular orbit with respect to the Earth, the characteristic time t* is defined to set the orbital period of the Moon to 2π, and the sum of the masses are used to define m* = M1 + M2 [19]. An additional parameter that significantly influences the availability and geometry of the natural dynamical structures within a multi-body system is the mass ratio, calculated as μ = M2/(M1+M2). The values of these characteristic quantities in the Earth-Moon CR3BP are summarized in Table 1. Then, the nondimensional state of a spacecraft in the Earth-Moon CR3BP is denoted x¯=[d¯,v¯], where d¯ and v¯ represent the position and velocity vectors in the rotating frame with respect to the Earth-Moon barycenter. Using these definitions, the equations of motion that govern the natural dynamics of a spacecraft in the Earth-Moon CR3BP and formulated in the rotating frame are written as

x¨2y˙=U*xy¨+2x˙=U*yz¨=U*z (1)

where the spacecraft’s distance to the primaries are d1=(x+μ)2+y2+z2 and d2=(x1+μ)2+y2+z2 and U*=12(x2+y2)+(1μ)/d1+μ/d2 is the pseudopotential function of the system [20]. These equations of motion admit a constant of integration, denoted the Jacobi constant, CJ=2U*x˙2y˙2z˙2, which is an energy-type value that is theoretically constant along any natural trajectory in the CR3BP.

Table 1:

Earth-Moon system characteristic quantities.

Parameter Value
Characteristic length, l* 384,400 km
Characteristic time, t* 375,132 s
Characteristic mass, m* 6.0477 × 1024 kg
Mass ratio, μ 1.2151 × 10−2

Natural Dynamical Structures

A variety of natural dynamical structures that exist in the Earth-Moon CR3BP are often incorporated into the trajectory design process, including equilibrium points, periodic orbits, and hyperbolic invariant manifolds [18]. A multi-body system described by the CR3BP admits five equilibrium points, denoted L1-L5, that exist throughout the system [20]. In the CR3BP, various periodic orbit families exist near the equilibrium points [18]. Of particular interest are the Earth-Moon L1 and L2 equilibrium points, which possess favorable characteristics for the long-term placement of spacecraft; for instance, the Lunar Gateway is currently expected to be located in an L2 southern halo orbit [21]. In addition to the L2 southern halo orbit family, additional periodic orbit families of interest include the planar L1 Lyapunov, planar L2 Lyapunov, and L1 northern halo families, which may serve as potential destinations for future missions operating in the cislunar regime. Figure 1 displays a subset of members from the L1 Lyapunov and halo families using dashed and solid arcs, respectively, in the Earth-Moon rotating frame; each orbit is shaded according to the value of the Jacobi constant. Figure 2 displays members of the associated L2 orbit families using a similar configuration. In this paper, members of these periodic orbit families are used to define initial and final orbits within each of the three trajectory design scenarios.

Figure 1:

Figure 1:

Members of the Earth-Moon L1 Lyapunov and northern halo orbit families depicted using dashed and solid arcs, respectively, and shaded by Jacobi constant.

Figure 2:

Figure 2:

Members of the Earth-Moon L2 Lyapunov and southern halo orbit families depicted using dashed and solid arcs, respectively, and shaded by Jacobi constant.

Hyperbolic invariant manifolds of periodic orbits act as natural transport mechanisms and are often used in trajectory design strategies that leverage dynamical systems theory [18, 19,22]. For a periodic orbit that possesses stable and unstable modes, trajectories on the unstable manifold naturally depart the orbit while trajectories on a stable manifold asymptotically approach the orbit [18]. Intersections of the stable and unstable manifold structures correspond to heteroclinic connections that connect two periodic orbits of equal Jacobi constant [18]. One example of a heteroclinic connection, requiring no propellant mass usage, is depicted in Fig. 3 between L1 and L2 Lyapunov orbits with CJ = 3.15. In this figure, the initial L1 Lyapunov orbit is displayed as a black dashed arc, the final L2 Lyapunov orbit is plotted as a solid black arc, and the heteroclinic connection is displayed in blue. Although these natural transfers only exist between two periodic orbits of equal Jacobi constant when the manifold structures intersect in the phase space, nonintersecting stable and unstable manifold structures may lie close to a low-thrust transfer with limited propellant mass requirements [22].

Figure 3:

Figure 3:

A natural, heteroclinic transfer from an L1 Lyapunov orbit to an L2 Lyapunov orbit.

Low-Thrust Propulsion in the CR3BP

The CR3BP equations of motion are modified to incorporate the acceleration imparted by the low-thrust propulsion system and the propellant mass usage. In this paper, the low-thrust propulsion system is modeled using three variable thrust, constant specific impulse, low-thrust engines with control authority along the x^, y^, z^ axes. The acceleration supplied by the low-thrust engines is written as

a¯=Tms/cuxx^+Tms/cuyy^+Tms/cuzz^=axx^+ayy^+azz^ (2)

where u¯=[ux,uy,uz] is the control vector defined in the rotating frame; T is the thrust magnitude of each low-thrust engine, nondimensionalized using the spacecraft’s initial wet mass and system’s characteristic length and time; and ms/c is the spacecraft mass, nondimensionalized by the initial wet mass [23]. In addition, the decrement in the spacecraft mass due to the propellant mass usage by the low-thrust engines is recorded by the mass flow rate equation [23]. Then, the equations of motion for a low-thrust-enabled spacecraft in the Earth-Moon CR3BP are written as

x¨2y˙=U*x+axy¨+2x˙=U*y+ayz¨=U*z+azm˙s/c=TIspg0 (3)

where the specific impulse of the low-thrust engines is denoted by Isp and g0 is the gravitational acceleration measured at the surface of the Earth [23]. These differential equations are implemented in the DRL environment presented within this paper to define the dynamics governing a spacecraft in a multi-body system.

3. DEEP REINFORCEMENT LEARNING

In the astrodynamics community, DRL has recently been applied to a variety of optimization problems including within multi-body trajectory design scenarios [711]. DRL algorithms are a type of RL algorithm that use deep neural networks to learn the optimal state-action relationship for an environment and set of objectives [24]. The mathematical representation for the set of objectives for a scenario is designated as the reward function, whereby higher rewards correspond to more desirable behavior. Generally, RL algorithms use an agent to explore an unknown environment by performing actions at states and collecting rewards based on those state-action pairs and a defined reward function [25,26]. The goal of the algorithm is to maximize the long-term reward, mathematically defined as the value function for the environment [27]. DRL methods leverage deep neural networks as universal function approximators because deep neural networks are able to learn the complex, high-dimensional state-action relationships within an environment much more robustly and efficiently than traditional RL techniques [28]. As a result, DRL techniques may be applied to a variety of complex, high-dimensional applications to successfully uncover locally optimal behavior [24,29].

One DRL technique that demonstrates favorable convergence properties in sensitive, chaotic environments is Proximal Policy Optimization (PPO), which trains a single policy to map optimal actions to every state within the environment [3032]. PPO incorporates a soft constraint on the size of the updates to the policy, which inhibits large updates from destabilizing the policy. In chaotic environments, limiting the update size provides more robust and stable convergence properties. Additionally, the policies are modeled using two deep neural networks in an actor-critic structure. In an actor-critic structure, the actor neural network learns the optimal mapping of actions to states for the environment and defined reward function while the critic neural network learns the states in the environment that produce the highest long-term reward [33]. Further, deep neural networks act as universal function approximators possessing robust convergence properties for recovering highly abstract state-action mappings in complex dynamical environments while also guaranteeing the recovery of locally optimal behavior. The deep neural networks are initialized with no knowledge of the dynamical model nor the assigned reward function; they must update their weights using the state-action-reward experiences gathered from the environment. MRPPO, a recently developed MODRL algorithm, uses PPO as a foundation within a multiple reward function framework. MRPPO retains PPO’s advantageous convergence properties in chaotic environments and enables multiple policies, each with distinct reward functions and neural networks, to be trained simultaneously, thereby reducing the required computational resources and training time.

Multi-Reward Proximal Policy Optimization

MRPPO trains multiple policies, each assigned a distinct reward function corresponding to a unique prioritization of objectives, by sharing the environmental data across all policies [7]. The environmental data encompasses the state-action propagation data from the dynamical model generated via numerical integration; in multi-body environments, this step often commands a significant portion of the total training time for a policy. To reduce the required computation time when training multiple policies, the propagation data generated by one policy is shared at the moment it is generated with multiple other policies. Then, each policy may generate the unique reward for that shared state-action pair, allowing policies to efficiently learn distinct behaviors within the environment. Sharing the propagation data aids exploration for all policies within the environment, and tends to produce more stable behavior within each policy [7].

The structure of MRPPO enables multiple policies to share state-action propagation data from the environment while also allowing each policy to learn to maximize the cumulative reward from its assigned reward function. Figure 4 depicts a conceptual representation of MRPPO where N policies, denoted πi for i = [1, …, N], each control ki agents within the environment with reward functions ri,t(s¯t,u¯t) [7]. At each time step, the policy receives the current state of an agent, s¯t and outputs an action to perform at that state,u¯t. The dynamical model, i.e. the black box in Figure 4 labeled low-thrust-enabled CR3BP, is used to propagate that state-action pair forward in time to generate a new state,s¯t+1. The state-action transition is saved within a shared memory depicted by the pink block for all policies and the new state of the agent, s¯t+1, is input into the controlling policy to develop a new action to perform. Once a defined number of state-action pairs have been collected, the state-action pairs are used to compute the rewards for each policy’s reward function in the yellow boxes in Figure 4. The policies in the blue boxes are then updated using the unique state-action-reward experiences, and this process is continued until a specified termination condition is reached.

Figure 4:

Figure 4:

MRPPO training N policies with ki agents and reward functions ri,t(s¯t,u¯t) within a shared dynamical model.

Several hyperparameters govern MRPPO and influence the resulting behavior of the policies. In particular, the low-thrust-enabled CR3BP is a chaotic environment; thus, the hyperparameters must be selected to facilitate stable and steady learning for the policies while also exploring. Table 2 displays the selected hyperparameters used in this investigation. The values in this table are selected using the results of a hyperparameter grid search conducted by Sullivan and Bosanac for MRPPO in a multi-body trajectory design scenario to demonstrate stable training behavior in a chaotic environment [7]. Additionally, the actor and critic neural networks for each policy are structured using the parameters outlined in Table 3 [7,32]. Using MRPPO, multiple policies modeled using deep neural networks are trained on a multi-body trajectory design scenario to produce locally optimal behavior for a set of defined reward functions.

Table 2:

Selected MRPPO Hyperparameters.

Hyperparameter Value
Environmental Steps, τ 4096
Epochs, E 5
Mini Batches, M 4
Discount Factor, γ 0.95
GAE Factor, λ 0.85
Clipping Parameter, ε 2 × 10−3
Value Function Coefficient, c1 0.5
Entropy Coefficient, c2 1 × 10−4
Learning Rate, lr 1 × 10−4
Maximum Gradient Norm, cmax 0.5

Table 3:

Structure of the Neural Networks.

Parameter Value
Actor Hidden Layers 2
Actor Hidden Nodes per Layer 64
Critic Hidden Layers 2
Critic Hidden Nodes per Layer 1024
Weight Initialization Scheme Orthogonal
Activation Function Hyperbolic Tangent

4. RECOVERING TRANSFERS USING MRPPO

MRPPO is used to uncover a region of the the solution space for three trajectory design scenarios in the Earth-Moon CR3BP. Specifically, eight policies are trained within each scenario to guide a low-thrust-enabled SmallSat towards a target periodic orbit via a transfer that exists near a reference path that is constructed using discontinuous, natural manifold arcs. In this investigation, several implementation factors influence the training of the policies for each scenario, such as the state definition, action definition, reward function formulation, weights of conflicting objectives, SmallSat parameters, and the reference trajectory; each of these components is defined within this section.

State and Action Definitions

Using MRPPO, policies are trained to construct mappings for locally optimal actions at every state in the environment. The selected actions maximize the long-term reward within the environment for the assigned reward functions. First, the state is defined as

s¯t=[d¯t,v¯t,ms/c,Δd¯t,Δv¯t] (4)

The spacecraft’s current position, velocity, and mass, are denoted d¯t, v¯t, and ms/c, respectively. In addition, Δd¯ and Δv¯ are the differences in the position and velocity vectors between the spacecraft’s state and the closest state along the reference trajectory or periodic orbits. Then, the generated action vector, u¯t, for the spacecraft is a 3 × 1 vector with components constrained to have a magnitude less than or equal to 1 and specifying the thrust vector in the Earth-Moon rotating frame. If the policies output an action vector with any components that possess a magnitude larger than 1, that component is scaled down to either −1 or 1. The action vector is scaled using the maximum available thrust of each engine and assumed to be constant in the Earth-Moon rotating frame while being used as the control for the spacecraft. This state-action pair is propagated forward in time in the low-thrust-enabled CR3BP for Δt = 3 × 10−2 nondimensional time units, or approximately 3 hours.

Reward Function Formulation

The reward function formulated for a trajectory design scenario captures the objectives and goals inherent to the multi-objective optimization problem. In this paper, the objectives include minimizing the position and velocity differences with respect to the initial orbit, final orbit, and potentially discontinuous reference trajectory while also minimizing propellant mass usage and flight time; these objectives often conflict [34, 35]. To guide the spacecraft from the initial orbit to the final orbit using a transfer with a specific geometry, the reward function is formulated as

rt(s¯t,u¯t)=10κ(s¯t)|Δd¯t+1|κ(s¯t)|Δv¯t+1|cm(ms/c(t)ms/c(t+Δt))+Ω (5)

Using this structure, the first and second terms capture the state difference objectives measured with respect to the closest state along the initial orbit, reference trajectory, or final orbit, the third term measures the propellant mass usage due to the most recent action, u¯t, and the final term, Ω, is associated with activating the termination conditions. This reward function formulation admits a multi-objective trade space whereby decreasing the position or velocity difference requires propellant mass usage.

A reference trajectory, assembled from natural arcs in the Earth-Moon CR3BP, is used in this initial proof of concept to aid the policies in learning low-thrust, propellant-optimal paths between periodic orbits [22]. However, incorporating a reference trajectory does bias the policies to learn the given transfer geometry and precludes the policies from learning other transfer geometries. Ongoing work is focusing on limiting the influence of the reference trajectory on the solutions developed by MRPPO.

To incentivize the spacecraft to transfer from the initial orbit to the final orbit while following the reference trajectory, the state difference components in the reward function formulation are augmented with a state-dependent coefficient. The state coefficient, κ(s¯t), is a piecewise function dependent upon whether the closest state to the spacecraft is along the initial orbit, reference trajectory, or final orbit. The periodic orbits are each discretized into 1,000 states equally spaced in time while the states along the reference trajectory are drawn every 1×10−4 nondimensional time units, enabling the spacecraft to measure the state differences to within a small error. This coefficient is formulated as

κ(s¯t)={100InitialOrbit100100(Kl1)K1ReferenceTrajectory1FinalOrbit

where the second leg of the function is an exponential-like decrease from 100 to 1, K is set to 1000, and is the index of the state drawn from the reference trajectory divided by the total number of states along the reference trajectory. The piecewise nature of the function encourages the SmallSats to transfer from the initial orbit to the final orbit without biasing the policies to depart immediately if the SmallSat is not in the vicinity of the reference trajectory. Of course, the reward function formulation used in this paper does bias the policies to follow the reference trajectory even if slightly more propellant optimal transfers exist.

Once a spacecraft state is initialized in the environment, the action output from the neural networks is applied and the spacecraft state is propagated forward in time. This process is repeated for additional time steps until a termination condition is reached. Three termination conditions and associated values of Ω exist for each scenario: (1) the spacecraft drifts further than 10,000 km away from any state along the initial orbit, final orbit, or reference trajectory where Ω = −100, (2) the maximum number of time steps along a single trajectory, set as 300 steps or equivalently, 39.08 days, is reached such that Ω = 0, and (3) the spacecraft approaches within 384 km and 5 m/s of the closest state in position space along the final orbit where Ω = 1,000.

To train policies that recover distinct behavior within this multi-objective solution space, the coefficient of the propellant mass usage term in the reward function is varied between policies. By increasing the coefficient, the policies are encouraged to conserve propellant mass at the expense of larger position and velocity differences. Specifically, the mass coefficients for the eight policies are selected as cm = [0, 4, 8, 12, 16, 20, 24, 28]. In each trajectory design scenario, Policy 1 corresponds to cm = 0 with no penalty on propellant mass usage while Policy 8 corresponds to cm = 28, with the highest penalty on propellant mass usage. The values for the mass coefficient are selected to approximately scale the magnitude of the propellant mass usage term in the reward function to be on the order of the other reward function terms, 100. Then, the policies are trained to guide a low-thrust-enabled SmallSat from the initial orbit to the final orbit while learning behavior that maximizes their assigned reward functions, enabling exploration of the multi-objective solution space.

Spacecraft Initialization

In this paper, the spacecraft is modeled as a 180 kg ESPA-class SmallSat equipped with three low-thrust engines aligned with the axes of the Earth-Moon rotating frame. Each engine is assumed to possess variable thrust capabilities with a maximum available thrust of T = 0.5 N and a constant specific impulse of Isp = 3000 s. Thus, the maximum available thrust for the spacecraft is Tmax = 0.866 N. Additionally, throughout training, the initial conditions for the spacecraft state are randomly drawn from anywhere along the initial orbit to facilitate developing robust controls schemes. Then, perturbations are drawn from a Gaussian distribution with zero mean and a standard deviation of 1 × 10−3 nondimensional units; these perturbations are added to the position and velocity components of the initial condition to reflect off-nominal states. Once the policies are trained, they are evaluated on a common set of 1,000 perturbed initial conditions for each trajectory design scenario to facilitate comparisons between each policy and characterize each control scheme.

Trajectory Design Scenarios

Three transfer scenarios in the Earth-Moon CR3BP are defined using distinct combinations of initial and final periodic orbits: (1) from an L1 Lyapunov orbit to an L2 Lyapunov orbit with equal values of the Jacobi constant, (2) from an L1 northern halo orbit to an L2 southern halo orbit with equal values of the Jacobi constant, and (3) from an L1 northern halo orbit to an L2 southern halo orbit with distinct values of the Jacobi constant. Table 4 displays the Jacobi constants for the initial and final periodic orbits in each transfer scenario. In the first scenario, the reference trajectory uses a continuous heteroclinic connection. This scenario also serves as a verification for demonstrating that MRPPO may train multiple policies that facilitate continuous low-thrust transfers between periodic orbits. In the second and third scenarios, no direct, heteroclinic connection exists between the selected orbits. Thus, the reference trajectory is defined using a discontinuous path formed using arcs along the unstable and stable manifolds associated with the initial and final orbits; these arcs intersect in position, but not in velocity at a hyperplane defined as Σ : x = 1 − μ. Incorporating natural manifold arcs into the reference trajectory encourages the policies to produce paths with low propellant usage between the two orbits within the Earth-Moon CR3BP [22]. However, using manifold arcs to construct a reference trajectory that guides the policies does bias the resulting solutions and, therefore, limits the region of the multi-objective space that is examined.

Table 4:

Characteristics of the Transfer Design Scenarios.

Scenario Bounding Orbits CJ
#1 From: L1 Lyapunov 3.15
To:L2 Lyapunov 3.15

#2 From: L1 Northern Halo 3.11
To: L2 Southern Halo 3.11

#3 From: L1 Northern Halo 3.15
To: L2 Southern Halo 3.11

5. RESULTS

Eight policies are trained using MRPPO to develop distinct controls schemes for a low-thrust-enabled SmallSat transferring between periodic orbits in the Earth-Moon system. Each policy is assigned a reward function that uniquely balances the objectives for the multi-objective solution space. Once the policies are trained, they are evaluated on a common set of 1,000 initial conditions, randomly drawn along the initial periodic orbit with additional induced perturbations in their position and velocity components. Then, the solution space is developed by computing the mean characteristics of each evaluated trajectory for each policy. Specifically, the relationship of the propellant mass usage and the position and velocity differences between the evaluation trajectory and the reference states, |Δd¯| and |Δv¯|, is examined to explore the recovered regions of the multi-objective trade space for each scenario. Additionally, the unique behavior recovered by select policies is analyzed to generate insights into the multi-modal nature of the high-dimensional, multi-objective solution space.

Lyapunov Orbit Transfers

The first trajectory design scenario is designed to verify that MRPPO successfully trains policies that efficiently guide SmallSats along a transfer between periodic orbits in the Earth-Moon system. In this scenario, two Lyapunov orbits are connected via a heteroclinic connection, which also serves as the reference trajectory for the policies. While a natural transfer does exist between these two orbits, the policies are expected to use propellant mass to: (i) correct the state initial discontinuities from the dynamical structures caused by the perturbed initial conditions, (ii) transfer between the periodic orbits, and (iii) decrease the flight time when possible. Then, the trained policies are evaluated on a fixed set of perturbed initial conditions.

Eight policies with varying propellant mass usage penalties are trained using MRPPO on this multi-body trajectory design scenario to uncover a subset of the multi-objective solution space. The resulting trajectories for Policies 1 and 8, policies with no penalty on propellant mass usage and the highest penalty on propellant mass usage, respectively, are depicted in Figs. 5a and 5b. In these figures, the initial L1 Lyapunov orbit, final L2 Lyapunov orbit, and reference trajectory are denoted by black arcs while the evaluation trajectories are shaded in a variety of hues for clarity. Then, Fig. 6a depicts the relationship between the propellant mass usage and position difference of the spacecraft measured with respect to the closest state along either the initial orbit, final orbit, or reference trajectory. In this figure, the circles are shaded according to their respective policy, as described in the legend, and denote the mean propellant mass usage and mean position difference across all 1,000 evaluation trajectories while the bars represent the minimum and maximum values for any single trajectory. Across the policies, the propellant mass usage tends to decrease as the position difference increases which is attributable to the policies possessing distinct mass coefficients in their reward functions. This relationship is also exhibited in Fig. 6b, which displays the relationship between the propellant mass usage and velocity difference using the same configuration as Fig. 6a. The policies produce trajectories that require less than 1.25 kg of propellant mass to perform this transfer, but the majority of transfers require 1.5 kg to 2 kg of propellant mass for a SmallSat with an initial wet mass of 180 kg.

Figure 5:

Figure 5:

1,000 evaluation trajectories departing an L1 Lyapunov orbit to an L2 Lyapunov orbit with an equal Jacobi constant in Scenario 1 controlled using: (a) Policy 1 and (b) Policy 8.

Figure 6:

Figure 6:

Relationship between the propellant mass usage and the position and velocity differences for the L1 Lyapunov to L2 Lyapunov transfer in Scenario 1.

The trajectory with the largest perturbation from the initial L1 Lyapunov orbit is selected to further compare Policies 1 and 8. The selected trajectories for Policies 1 and 8 are displayed in Figs. 7a and 7b where the trajectories are shaded blue, the initial orbit, final orbit, and reference trajectory are denoted with black arcs, and the thrust actions are represented by red arrows. Analysis of these figures reveal that both policies perform actions at the locations where the reference trajectory intersects the periodic orbits. Figures 8a and 8b depict the thrust magnitude throughout the trajectory for each policy: both policies tend to perform small actions except at the critical points along the trajectory where the reference trajectory intersects the periodic orbits. Further, when the spacecraft intersects the final periodic orbit in position space, Policy 1, which has no penalty for propellant mass usage, immediately performs a large action to correct the velocity difference and reach the termination hyperspheres. Conversely, Policy 8, which has the largest penalty for propellant mass usage, follows the final periodic orbit for a longer time allowing the natural dynamics and low-thrust engines to gradually decrease the state discontinuities. Figures 9 and 10 display the difference in the position and velocity components along the trajectories measured with respect to the closest state in position space along the final orbit for Policy 1 and 8, respectively. In these figures, the black dashed lines represent the values for the termination hypersphere: 384 km for the position difference and 5 m/s for the velocity difference. These figures demonstrate that Policy 1 successfully uses a large thrust action to intercept both the position and velocity difference hyperspheres simultaneously. However, Policy 8 converges to the position difference hypersphere once before re-entering the position difference hypersphere after the natural dynamics enabled the trajectory to approach the velocity difference hypersphere without large thrust actions.

Figure 7:

Figure 7:

Trajectory from a highly perturbed initial condition departing an L1 Lyapunov orbit to an L2 Lyapunov orbit with an equal Jacobi constant in Scenario 1 controlled using: (a) Policy 1 and (b) Policy 8.

Figure 8:

Figure 8:

Thrust magnitude along an evaluation trajectory in Scenario 1 commanded using: (a) Policy 1 and (b) Policy 8.

Figure 9:

Figure 9:

Position and velocity differences measured with respect to the closest state on the final orbit for a single trajectory in Scenario 1 commanded by Policy 1.

Figure 10:

Figure 10:

Position and velocity differences measured with respect to the closest state on the final orbit for a single trajectory in Scenario 1 commanded by Policy 8.

Transfers between Halo Orbits with Equal Jacobi Constants

In the second multi-body trajectory design scenario, an L1 northern halo orbit and an L2 southern halo orbit with an equal Jacobi constant are selected to develop a subset of the multi-objective solution space for low-thrust transfers. The reference trajectory is formulated using two manifold arcs emanating from the halo orbits that intersect in position near perilune, but not in velocity. The discontinuity in the reference trajectory necessitates the application of low-thrust acceleration to correct both the initial perturbations along the initial orbit and to facilitate a continuous transfer of similar geometry between the periodic orbits. Additionally, by training the policies using MRPPO, locally optimal behavior is recovered using the same reward functions and training parameters as the previous example.

Using MRPPO, eight policies, each assigned to a distinct reward function, are trained to recover low-thrust transfers in this complex multi-body trajectory design scenario. Figures 11a and 11b depict a planar view of the evaluation trajectories for Policies 1 and 8, respectively, using the same color configuration as Fig. 5. The trajectories associated with Policy 1 tend to resemble the reference trajectory much more closely in position space compared to the solutions associated with Policy 8. The relationship is detailed in Fig. 12a where the policies with higher penalties on propellant mass usage possess larger differences in distance from the reference trajectory, highlighting the multi-objective trade space for this scenario. Additionally, Fig. 12b displays the propellant mass usage and velocity difference relationship demonstrating a consistent trade-off between decreasing state differences from the reference trajectory or decreasing propellant mass usage. Further, Policies 7 and 8 exhibit smaller differences in their behavior requiring approximately 2.5 kg.

Figure 11:

Figure 11:

1,000 evaluation trajectories departing an L1 northern halo orbit to an L2 southern halo orbit with an equal Jacobi constant in the Earth-Moon CR3BP in Scenario 2 controlled using: (a) Policy 1 and (b) Policy 8.

Figure 12:

Figure 12:

Relationship between the state discontinuities and propellant mass usage for the L1 northern halo to L2 southern transfer in Scenario 2.

Analyzing the trajectory characteristics associated with distinct policies on the same highly perturbed initial condition generates additional insights into the behavior of each controls scheme. Figures 13a and 13b depict the same evaluation trajectory for Policies 1 and 8 respectively, with the thrust directions illustrated using red arrows along the trajectory. The thrust magnitude at each time step is depicted in Figs. 14a and 14b. These figures demonstrate that Policy 1 consistently produces larger thrust vectors, throughout the transfer than Policy 8, a result of the distinct mass coefficients in each reward function. Additionally, the thrust magnitude for each policy resembles a bang-bang-like controller with the spacecraft performing an action with a large control magnitude at a time step followed by multiple time steps with much smaller control magnitudes.

Figure 13:

Figure 13:

Trajectory from a highly perturbed initial condition departing an L1 northern halo orbit to an L2 southern halo orbit with an equal Jacobi constant in Scenario 2 controlled using: (a) Policy 1 and (b) Policy 8.

Figure 14:

Figure 14:

Thrust magnitude along an evaluation trajectory in Scenario 2 commanded using: (a) Policy 1 and (b) Policy 8.

The influence of the actions on the energy of the spacecraft is further indicated by the Jacobi constant along each trajectory. Figures 15a and 15b where the solid blue line denotes the Jacobi constant of the periodic orbits. For this initial condition, Policy 1 produces much larger changes in the Jacobi constant along the trajectory than Policy 8 reflective of the mass coefficient in each policy’s reward function.

Figure 15:

Figure 15:

Jacobi constant along an evaluation trajectory in Scenario 2 commanded using: (a) Policy 1 and (b) Policy 8.

Additional insight is generated by examining the distinct behavior of Policies 1 and 8 as the selected trajectory approaches the final orbit. The position and velocity differences measured with respect to the final orbit are depicted in blue in Figs. 16a and 16b on a logarithmic scale for Policy 1 where the convergence hyperspheres are represented by a dashed black line. Similarly, Figs. 17a and 17b display the position and velocity differences for Policy 8. Both policies demonstrate similar convergence to the final orbit; both policies intersect the velocity hypersphere prior to the position hypersphere. The distinct control selections and transfer geometries developed by each policy may be attributed to the objectives that each policy learns to maximize in this multi-objective solution space.

Figure 16:

Figure 16:

Position and velocity differences measured with respect to the closest state on the final orbit for a single trajectory in Scenario 2 commanded by Policy 1.

Figure 17:

Figure 17:

Position and velocity differences measured with respect to the closest state on the final orbit for a single trajectory in Scenario 2 commanded by Policy 8.

Transfers between Halo Orbits of Distinct Jacobi Constants

The third trajectory design scenario requires the policies to learn to transfer from a lower-energy L1 northern halo orbit to a higher-energy L2 southern halo orbit using a reference orbit with a large velocity discontinuity at x = 1 − μ. The eight policies are trained using the same reward functions as the previous two examples to develop multiple regions within the multi-objective solution space. The evaluation trajectories for Policies 1 and 8 are displayed in Figs. 18a and 18b using the same configuration as Fig. 5, demonstrating that both policies learned to guide the low-thrust-enabled SmallSat to the target orbit with a transfer that resembles the reference trajectory. Further, the relationships between the propellant mass usage and the position and velocity differences for all policies are depicted in Figs. 19a and 19b, respectively. Similar to the previous two scenarios, as the propellant mass usage increases, the policies tend to follow the reference trajectory more closely. However, in this scenario, Policies 6, 7, and 8 produce similar propellant mass usage and and state discontinuity behavior reflecting that the mass coefficients in their assigned reward functions are not encouraging additional propellant mass conservation. Further, the mean propellant mass usage across all policies is significantly less than the policies developed in the second scenario.

Figure 18:

Figure 18:

1,000 evaluation trajectories departing an L1 northern halo orbit to an L2 southern halo orbit with a distinct Jacobi constant in the Earth-Moon CR3BP in Scenario 3 controlled using: (a) Policy 1 and (b) Policy 8.

Figure 19:

Figure 19:

Relationship between the state discontinuities and propellant mass usage for the L1 northern halo to L2 southern halo transfer with distinct Jacobi constants in Scenario 3.

Examining a single evaluation trajectory for Policies 1 and 8 generates additional insights into the characteristics of the solutions in this trajectory design scenario. Figures 20a and 20b display a single evaluation trajectory associated with the most highly perturbed initial condition for Policies 1 and 8, respectively. Additionally, Figs. 21a and 21b depict the Jacobi constant along the trajectory for both policies in blue where the black dashed line represents the Jacobi constant of the initial periodic orbit and the solid black line denotes the Jacobi constant of the final periodic orbit. Both policies tend to maintain the spacecraft’s Jacobi constant to be approximately equal to the Jacobi constant of either periodic orbit and minimize the Jacobi constant extensions outside of the Jacobi constant bounds formed by the two periodic orbits. In Figure 22, the thrust magnitudes along both trajectories reveal that Policy 1 generates two long stretches of flight time with near-zero thrust magnitudes. Otherwise, Policy 1 generates bang-bang-like control vectors culminating in large control usage at the discontinuity in the reference trajectory. Similarly, Policy 8 tends to produce actions with near-zero thrust magnitudes throughout the majority of the transfer except near the discontinuity in the reference trajectory. The position and velocity differences, measured with respect to the final orbit, are depicted in Figs. 23 and 24 for Policies 1 and 8, respectively. Both policies tend to intersect the velocity difference hypersphere prior to decreasing the position difference to match the final periodic orbit.

Figure 20:

Figure 20:

Trajectory from a highly perturbed initial condition departing an L1 northern halo orbit to an L2 southern halo orbit with a distinct Jacobi constant in Scenario 3 controlled using: (a) Policy 1 and (b) Policy 8.

Figure 21:

Figure 21:

Jacobi constant along an evaluation trajectory in Scenario 3 commanded using: (a) Policy 1 and (b) Policy 8.

Figure 22:

Figure 22:

Thrust magnitude along an evaluation trajectory in Scenario 3 commanded using: (a) Policy 1 and (b) Policy 8.

Figure 23:

Figure 23:

Position and velocity differences measured with respect to the closest state on the final orbit for a single trajectory in Scenario 3 commanded by Policy 1.

Figure 24:

Figure 24:

Position and velocity differences measured with respect to the closest state on the final orbit for a single trajectory in Scenario 3 commanded by Policy 8.

6. CONCLUSION

MRPPO, an MODRL algorithm, is used to train multiple policies in three libration point orbit transfer scenarios in the Earth-Moon system. This approach uncovers a subset of the multi-objective solution space for a low-thrust-enabled SmallSat connecting two periodic orbits via a low-thrust transfer with a desired geometry. First, eight policies are trained on an L1 Lyapunov orbit to an L2 Lyapunov orbit transfer as a verification for uncovering the solution space. Then, two more complex scenarios, focusing on departing an L1 northern halo orbit and approaching an L2 southern halo orbit at distinct energies, are evaluated using this approach. By varying the scaling of objectives in the assigned reward functions, the policies learn to recover trajectories that maximize the long-term reward in each trajectory design scenario: balancing deviations from the periodic orbits and reference trajectory, propellant mass usage, flight time, and convergence to the final periodic orbit. Policy 1 is trained to minimize the state discontinuities from the periodic orbits and reference trajectory without any penalty for propellant mass usage while Policies 2 through 8 possess increasing penalties on propellant mass usage. Once the policies are trained, they are evaluated on a common set of perturbed initial conditions to generate insights into the recovered solutions. For each trajectory design scenario, the policy with no penalty on propellant mass usage tends to more closely follow the reference trajectory while the policy with the highest penalty on propellant mass usage tends to conserve significantly more propellant mass at the expense of higher deviations from the reference trajectory. In addition, this paper demonstrates the capability of MRPPO for rapidly developing low-thrust trajectories in the vicinity of a discontinuous reference trajectory within a multi-body trajectory design scenario.

ACKNOWLEDGMENTS

This work was supported by a NASA Space Technology Research Fellowship. Part of the research has been carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. The High Performance Computing resources used in this investigation were provided by funding from the JPL Office of the Chief Information Officer. Part of this research was performed at the University of Colorado Boulder.

BIOGRAPHY

graphic file with name nihms-1754419-b0025.gif

Christopher J. Sullivan is a fourth-year Ph.D. student in the Smead Department of Aerospace Engineering Sciences at the University of Colorado Boulder. His research focuses on leveraging multi-objective deep reinforcement learning and dynamical systems theory to efficiently explore the trajectory design space for low-thrust-enabled missions in multi-body systems. He is supported by a NASA Space Technology Research Fellowship.

graphic file with name nihms-1754419-b0026.gif

Natasha Bosanac is an Assistant Professor in the Colorado Center for Astrodynamics Research and Smead Department of Aerospace Engineering Sciences at the University of Colorado Boulder. Her research focuses on applications of dynamical systems theory and machine learning to astrodynamics and celestial mechanics in multi-body systems. She received her Ph.D. and M.S in Aerospace Engineering in 2012 and 2016 from Purdue University and her Bachelor’s degree from the Massachusetts Institute of Technology in 2010.

graphic file with name nihms-1754419-b0027.gif

Rodney L. Anderson received his Ph.D. and M.S. in Aerospace Engineering Sciences from the University of Colorado at Boulder and his Bachelor’s degree from North Carolina State University. He is currently a Technologist in the Mission Design and Navigation section at the Jet Propulsion Laboratory, California Institute of Technology where he has worked since 2010. His research interests are focused on the application of dynamical systems theory to astrodynamics, mission design, navigation, and celestial mechanics.

graphic file with name nihms-1754419-b0028.gif

Alinda K. Mashiku received her undergraduate degree from The Ohio State University in 2007, and then both her MSAAE and PhD from Purdue University in 2009 and 2013 consecutively. She is currently the NASA Conjunction Analysis and Risk Assessment (CARA) deputy program manager and technical lead at the NASA Goddard Space Flight Center. CARA is responsible for the conjunction analysis and risk assessment by providing operations support for NASA’s non-human spacecraft in Earth Orbit. Her academic and professional background research interests are in hybrid systems estimation, nonlinear Bayesian estimation theory, statistical orbit determination, fault detection and identification, and data compression using information measures using Bayesian Machine Learning for Neural Networks.

graphic file with name nihms-1754419-b0029.gif

Jeffrey R. Stuart received his B.S., M.S., and Ph.D. in Aeronautics & Astronautics from Purdue University in 2008, 2011, and 2014, respectively. He is currently technical staff in Mission Design & Navigation at the Jet Propulsion Laboratory, California Institute of Technology. He is the MDNav Lead for the SunRISE formation flying mission and works on a variety of projects from early formulation to flight operations. His interests include automated trajectory design, advanced navigation techniques, formation flying, combinatorial optimization, and visual analytic methods.

Contributor Information

Christopher J. Sullivan, Colorado Center for Astrodynamics, Smead Aerospace Engineering Sciences, University of Colorado Boulder, 429 UCB, Boulder, CO 80303

Natasha Bosanac, Colorado Center for Astrodynamics, Smead Aerospace Engineering Sciences, University of Colorado Boulder, 429 UCB, Boulder, CO 80303.

Rodney L. Anderson, Jet Propulsion Laboratory, California Institute of Technology, 4800 Oak Drive Dr. Pasadena, CA 91109

Alinda K. Mashiku, Navigation and Mission Design Branch NASA Goddard Space Flight Center 8800 Greenbelt Rd. Greenbelt, MD 20771

Jeffrey R. Stuart, Jet Propulsion Laboratory California Institute of Technology 4800 Oak Drive Dr. Pasadena, CA 91109

REFERENCES

  • [1].Schoolcraft J, Klesh A, and Werne T, “Marco: interplanetary mission development on a cubesat scale,” in Space Operations: Contributions from the Global Community. Springer, 2017, pp. 221–231. [Google Scholar]
  • [2].NASA, MarCO (Mars Cube One), 2020, https://solarsystem.nasa.gov/missions/mars-cube-one/in-depth/; Accessed 20 September 2020.
  • [3].Bosanac N, Cox A, Howell K, and Folta DC, “Trajectory design for a cislunar cubesat leveraging dynamical systems techniques: The lunar icecube mission,” Acta Astronautica, pp. 283–296, 2018. [Google Scholar]
  • [4].Genova AL and Dunham DW, “Trajectory design for the lunar polar hydrogen mapper mission,” 27th AAS/AIAA Space Flight Mechanics Meeting, San Antonio, TX, 2017. [Google Scholar]
  • [5].Johnson L, Castillo-Rogez J, Dervan J, and McNutt L, “Near earth asteroid (nea) scout,” 4th International Symposium on Solar Sailing, Kyoto Japan, 2017. [Google Scholar]
  • [6].Moffaert KV and Nowé A, “Multi-objective reinforcement learning using sets of pareto dominating policies,” Journal of Machine Learning Research, vol. 15, no. 107, pp. 3663–3692, 2014. [Online]. Available: http://jmlr.org/papers/v15/vanmoffaert14a.html [Google Scholar]
  • [7].Sullivan CJ and Bosanac N, “Using multi-objective deep reinforcement learning to uncover a pareto front in multi-body trajectory design,” in AAS/AIAA Astrodynamics Specialist Conference, 2020. [Google Scholar]
  • [8].―, “Using reinforcement learning to design a low-thrust approach into a periodic orbit in a multi-body system,” in 30th AIAA/AAS Space Flight Mechanics Meeting, 2020. [Google Scholar]
  • [9].Miller D and Linares R, “Low-thrust optimal control via reinforcement learning,” in 29th AAS/AIAA Space Flight Mechanics Meeting, 2019, pp. 1–18. [Google Scholar]
  • [10].LaFarge NB, Miller D, Howell KC, and Linares R, “Guidance for closed-loop transfers using reinforcement learning with application to libration point orbits,” in AIAA Scitech 2020 Forum, 2020, p. 0458. [Google Scholar]
  • [11].Zavoli A and Federici L, “Reinforcement learning for low-thrust trajectory design of interplanetary missions,” AAS/AIAA Astrodynamics Specialist Conference, 2020. [Google Scholar]
  • [12].Scorsoglio A, Furfaro R, Linares R, Massari M et al. , “Actor-critic reinforcement learning approach to relative motion guidance in near-rectilinear orbit,” Advances in the Astronautical Sciences, pp. 1–20, 2019. [Google Scholar]
  • [13].Das-Stuart A, Howell KC, and Folta DC, “Rapid trajectory design in complex environments enabled by reinforcement learning and graph search strategies,” Acta Astronautica, vol. 171, pp. 172–195, 2020. [Google Scholar]
  • [14].Cheng L, Wang Z, and Jiang F, “Real-time control for fuel-optimal moon landing based on an interactive deep reinforcement learning algorithm,” Astrodynamics, vol. 3, no. 4, pp. 375–386, 2019. [Google Scholar]
  • [15].Harris A, Teil T, and Schaub H, “Spacecraft decision-making autonomy using deep reinforcement learning,” in 29th AAS/AIAA Space Flight Mechanics Meeting, 2019, pp. 1–19. [Google Scholar]
  • [16].Nguyen TT, Nguyen ND, Vamplew P, Nahavandi S, Dazeley R, and Lim CP, “A multi-objective deep reinforcement learning framework,” Engineering Applications of Artificial Intelligence, vol. 96, p. 103915, 2020. [Google Scholar]
  • [17].Tajmajer T, “Modular multi-objective deep reinforcement learning with decision values,” in 2018 Federated Conference on Computer Science and Information Systems. IEEE, 2018, pp. 85–93. [Google Scholar]
  • [18].Koon WS, Lo MW, Marsden JE, and Ross SD, Dynamical Systems, The Three-Body Problem and Space Mission Design. Marsden Books, 2006. [Google Scholar]
  • [19].Parker JS and Anderson RL, Low-Energy Lunar Trajectory Design. Hoboken, New Jersey: John Wiley and Sons, 2014, pp. 95–113, 10.1002/9781118855065. [DOI] [Google Scholar]
  • [20].Szebehely V, Theory of Orbits: The Restricted Problem of Three Bodies. London, UK: Academic Press, 1967. [Google Scholar]
  • [21].Davis DC, Phillips SM, Howell KC, Vutukuri S, and McCarthy BP, “Stationkeeping and transfer trajectory design for spacecraft in cislunar space,” 2017 AAS/AIAA Astrodynamics Specialist Conference, Stevenson, WA, 2017. [Google Scholar]
  • [22].Anderson RL and Lo MW, “Role of invariant manifolds in low-thrust trajectory design,” Journal of Guidance, Control, and Dynamics, vol. 32, no. 6, pp. 1921–1930, November-December 2009, 10.2514/1.37516. [DOI] [Google Scholar]
  • [23].Mingotti G, Topputo F, and Bernelli-Zazzera F, “Low-energy, low-thrust transfers to the moon,” Celestial Mechanics and Dynamical Astronomy, vol. 105, no. 1–3, p. 61, 2009. [Google Scholar]
  • [24].Mnih V, Kavukcuoglu K, Silver D, Rusu A, Veness J, Bellemare M, Graves A, Riedmiller M, Fidjeland A, Ostrovski G, Petersen S, Beattie C, Sadik A, Antonoglou I, King H, Kumaran D, Wierstra D, Legg S, and Hassabis D, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–33, 02 2015. [DOI] [PubMed] [Google Scholar]
  • [25].Minsky M, “Steps toward artificial intelligence,” in Proceedings of the IRE, vol. 49, no. 1, pp. 8–30, 1961, doi: 10.1109/JRPROC.1961.287775. [DOI] [Google Scholar]
  • [26].Barto AG, “Reinforcement learning and dynamic programming,” 6th IFAC/IFIP/IFORS/IEA Symposium on Analysis, Design and Evaluation of Man-Machine Systems, Cambridge, MA, vol. 28, no. 15, pp. 407–412, 1995. [Google Scholar]
  • [27].Schulman J, Moritz P, Levine S, Jordan M, and Abbeel P, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015. [Google Scholar]
  • [28].Hornik K, Stinchcombe M, White H et al. , “Multilayer feedforward networks are universal approximators.” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989. [Google Scholar]
  • [29].Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, Dieleman S, Grewe1 D, Nham J, Kalchbrenner N, Lillicrap T, Leach M, Kavukcuoglu K, Graepel T, and Hassabis D, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016. [DOI] [PubMed] [Google Scholar]
  • [30].Schulman J, Wolski F, Dhariwal P, Radford A, and Klimov O, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017. [Google Scholar]
  • [31].Henderson P, Islam R, Bachman P, Pineau J, Precup D, and Meger D, “Deep reinforcement learning that matters,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [Google Scholar]
  • [32].Andrychowicz M, Raichuk A, Stańczyk P, Orsini M, Girgin S, Marinier R, Hussenot L, Geist M, Pietquin O, Michalski M, Gelly S, and Bachem O, “What matters in on-policy reinforcement learning? a large-scale empirical study,” 2020.
  • [33].Konda VR and Tsitsiklis JN, “Actor-critic algorithms,” Advances in Neural Information Processing Systems 12: Proceedings of the 1999 Conference, Denver, CO, pp. 1008–1014, 2000. [Google Scholar]
  • [34].Peng H, Chen B, and Wu Z, “Multi-objective transfer to libration-point orbits via the mixed low-thrust and invariant-manifold approach,” Nonlinear Dynamics, vol. 77, no. 1–2, pp. 321–338, 2014. [Google Scholar]
  • [35].Shah V, Beeson R, and Coverstone V, “A method for optimizing low-energy transfers in the earth-moon system using global transport and genetic algorithms,” in AIAA/AAS Astrodynamics Specialist Conference, 2016. [Google Scholar]

RESOURCES