Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2023 Feb 2;13:1925. doi: 10.1038/s41598-023-28582-4

Safe reinforcement learning under temporal logic with reward design and quantum action selection

Mingyu Cai 1, Shaoping Xiao 2,, Junchao Li 2, Zhen Kan 3
PMCID: PMC9894922  PMID: 36732441

Abstract

This paper proposes an advanced Reinforcement Learning (RL) method, incorporating reward-shaping, safety value functions, and a quantum action selection algorithm. The method is model-free and can synthesize a finite policy that maximizes the probability of satisfying a complex task. Although RL is a promising approach, it suffers from unsafe traps and sparse rewards and becomes impractical when applied to real-world problems. To improve safety during training, we introduce a concept of safety values, which results in a model-based adaptive scenario due to online updates of transition probabilities. On the other hand, a high-level complex task is usually formulated via formal languages, including Linear Temporal Logic (LTL). Another novelty of this work is using an Embedded Limit-Deterministic Generalized Büchi Automaton (E-LDGBA) to represent an LTL formula. The obtained deterministic policy can generalize the tasks over infinite and finite horizons. We design an automaton-based reward, and the theoretical analysis shows that an agent can accomplish task specifications with the maximum probability by following the optimal policy. Furthermore, a reward shaping process is developed to avoid sparse rewards and enforce the RL convergence while keeping the optimal policies invariant. In addition, inspired by quantum computing, we propose a quantum action selection algorithm to replace the existing ε-greedy algorithm for the balance of exploration and exploitation during learning. Simulations demonstrate how the proposed framework can achieve good performance by dramatically reducing the times to visit unsafe states while converging optimal policies.

Subject terms: Mechanical engineering, Electrical and electronic engineering

Introduction

Markov Decision Processes (MDPs) usually model motion planning subject to stochastic uncertainties and have been applied to represent many engineering systems1. On the other hand, complex high-level specifications can be expressed via formal languages2, including Linear and Signal Temporal logic (LTL and STL). Specifically, automaton-assisted control synthesis for general MDPs has caught growing attention to achieve tasks formulated by LTL languages. When assuming an MDP model is thoroughly knowledgeable, a common LTL-based objective is to optimize the balance of maximizing the task satisfaction probability and reducing the total expected cost3,4.

We are in the era of Artificial Intelligence (AI). Advanced AI techniques, including Machine Learning (ML) and Evolutionary Algorithms (EAs), have been utilized in various science and engineering disciplines. Many pioneer works have been done in materials science, complex systems, robotics, and more. As one of the ML’s subsets, Reinforcement Learning (RL) doesn’t require pre-gathered data for supervision. Indeed, RL is a sequential decision-making process, and the agent learns optimal control policies via gathering experience from the unknown environment5 and sometimes considering uncertain MDPs.

In RL, safety means the agent avoids visiting undesirable states in the exploration process. However, although standard RL algorithms can ensure optimal convergence, those algorithms generally lack safe protection on what happens during the learning process. For example, a mobile robot is not allowed to reach the control room, but it may visit this room using epsilon policies during learning. Consequently, the agent doesn’t intend to explore safely while learning the optimal policy to maximize the collected reward. Therefore, safe RL has attracted more attention, and various methods have been proposed6. However, most existing methods710 either hold strong assumptions about the dynamic models or only focus on minimizing the risk of violating a safety specification without considering complex high-level specifications.

There have been several proposed works on abstraction-based scenarios of MDPs in the past few years. For example, some research works employed LTL formulas to specify the instructions for a control agent to learn optimal strategies, i.e., optimal policies. Specifically, many works1115 designed automaton-based rewards so that model-free RL agents could find the optimal policies satisfying LTL specifications with probabilistic guarantees. However, none of them addresses the critical safety issues during training.

Li et al.16 designed a robustness-based automaton and combined control barrier functions to facilitate learning, but the tasks introduced in16 were considered over finite horizons only. A related work17 utilized a Limit-Deterministic Generalized Büchi Automaton (LDGBA)18 to elucidate LTL specifications for learning enhancement. It proposed a model-based safe padding technique to prevent the system from entering bad states. However, as shown in our previous study15, directly applying LDGBA with purely positional policies, i.e., deterministic policies, might fail to achieve some tasks because there was no tracking record of accepting sets. An accepting set consists of automaton states that satisfy the acceptance condition of an LDGBA.

On the other hand, there is a growth of interest in quantum supremacy because a quantum computing algorithm, demonstrated on a quantum computer, offers significant speedup compared to the best possible algorithm on a classical computer. Moreover, the integration of machine learning and quantum computing19, called Quantum Machine Learning (QML)20, investigates how to encode classical data in quantum states and leverage superposition properties of quantum systems for solving some specific problems. Particularly, there is an exciting topic for researchers, Quantum Neural Networks (QNNs), including Quantum Convolutional Neural Networks (QCNNs).

Quantum neural networks are computational neural network models, mostly feed-forward networks21, in which quantum bits in quantum neurons process and pass the information. In addition, inspired by Convolutional Neural Networks (CNNs), QCNNs were developed and evaluated for recognizing quantum states encoded from symmetry-protected topological phases22. A typical application of QML was image recognition23 utilizing QCNN (or QNN) and Quantum Boolean Image Processing (QBIP)24. Furthermore, QNNs have been implemented in Deep Reinforcement Learning (DRL) to enhance learning outcomes25.

Quantum computing, in general, is much faster than classical computing, such as solving optimization problems26. Therefore, we can intuitively expect that RL and quantum computing will join forces to make faster AI. Saggio et al.27 utilized a quantum communication channel for an agent interacting with the environment to speed up the RL process. Their method generated a quantum state that was a superposition of rewarded and non-rewarded action sequences at each quantum epoch. After the environment flipped the sign of the winning action sequence, a quantum algorithm was applied to improve the chance of selecting the best action sequence. Such a hybrid AI has been shown to speed up the RL process by 60%. However, their approach was only applied to the Deterministic Strictly Epochal (DSE) learning scenarios.

In another work, Dong and co-workers28 utilized MDP states (actions) in the conventional RL as eigenbases to generate the state (action) spaces in Quantum Reinforcement Learning (QRL) by superposition. The eigenbases, i.e., eigen states or eigen actions, serve as the orthogonal bases in a Hilbert space corresponding to the generated quantum system. To update the probability of eigen actions by Grover’s algorithm, the collected reward, and state value functions are needed to determine the number of iterations. In addition, each eigen action state has a duplicated copy to prevent memory loss after selecting an action in their approach. Ganger and Hu29 extended Dong’s work28 by using state-action value functions, also named Q values, instead of state value functions in QRL.

In this paper, we propose a few advanced techniques in RL for motion planning. One of our contributions is to extend our previous results15 by presenting a provably correct reward design and developing the model-based safe padding. We encode LTL specifications over an infinite horizon into an Embedded LDGBA (E-LDGBA) that can record unvisited accepting sets to enable the application of deterministic policies. By using the shaping process for dense rewards, rigorous analysis shows that optimizing the expected return of the shaped reward scheme is the same as maximizing the satisfaction probability of LTL.

Secondly, assuming the ability of local observation, model-based padding can effectively add a “shield” to avoid the agent entering into sink components with a probabilistic bound and to maintain safety during the learning process. We propose a concept of safety value functions, including state and action safety values, to estimate an agent’s probability of entering a safe state. Combining RL’s conventional value functions and the proposed safety value functions will maximize the agent’s safe protection and task satisfaction.

There are some other works considering safety issues in RL. For example, Fernandez-Gauna et al.30 defined Undesirable Terminal States (UTS) as some of the terminal states associated with negative rewards. Once the agent reached one UTS, the corresponding constraints were violated, and the environment would be reset to its initial state. Another work31 presented an approach for provably safe learning. This work introduced Justified Speculative Control (JSC) that combined verified runtime monitoring with RL. If the system was accurately modeled, only safe actions were taken. Otherwise, any available action would be selected randomly. Differing from those works, we propose safety value functions to quantify how well the agent will avoid unsafe states starting from the current state or taking the current action. The action selection depends on the conventional RL value function and the newly-introduced safety value function.

In addition, this work improves several aspects of the result of17. First, we employ a novel automaton structure that has been verified to accept the same language as LDGBA, i.e., E-LDGBA, to make up for the drawbacks of LDGBA. Then, we develop a potential function in a reward-shaping process to maintain a dense reward, and the obtained optimal policy remains to satisfy the tasks with maximum probability. In addition, by dividing the LTL into two parts, one of which defines the safe properties, we show that the model-based safe padding via safety values remains the original optimal convergence invariant.

Finally, we propose a quantum action selection algorithm in this paper, inspired by Grover’s algorithm and quantum gate/measurement noises, to substitute the ε-greedy action selection in conventional RL. As a difference from the work27, the quantum state, representing the action space in our method, is created by the superposition of available actions at the current MDP state instead of action sequences at each episode27. On the other hand, this quantum state is locally generated at each learning step, and no duplication is needed for global updates as in the works of28 and29. It shall be noted that there is no “perfect” copy of an arbitrary unknown quantum state available, according to the no-cloning theorem32.

The organization of this paper is described below. Section "Problem formulation" formulates MDP, RL, LTL, automata, and the problem definition. After introducing E-LDGBA, Section "Automaton‑based reward design" describes an automaton-based reward design and a reward-shaping process. Section "Safety value functions" proposes safety value functions, and quantum action selection is described in Section "Quantum action selection". Then, two examples are included in Section "Simulations and discussions", followed by the conclusion and forthcoming works.

Problem formulation

Quantum computing

In classical computing, i.e., binary computing, a classical bit is a binary piece of information that can only take one of two possible values or states; for example, logic states 0 or 1. A qubit, a short name for a quantum bit, is the quantum equivalence of a classical bit. Differing from a classical bit, a qubit can represent one of two basic states or one possible combination. Given the most common basis, |0 and |1 that correspond to the logic states 0 and 1 for a classical bit, a qubit in its superposition state can be expressed as

|q1=α0|0+α1|1 1

where α0 and α1 are complex coefficients. When we measure a qubit in its superposition state, i.e., Equation (1), the qubit would collapse into a state of the basis, e.g., |0 or |1, with the probability of |α0|2 or |α1|2, respectively. In addition, |α0|2+|α1|2=1 shall be satisfied.

Similar to Equation (1), a quantum state in an n-qubit system can be written as

|ψn=|q1q2qn=k=02n-1αk|k 2

where |k are basic states of the n-qubit system, and αk are the corresponding complex coefficients satisfying k=02n-1|αk|2=1. When being measured, the n-qubit state will collapse into one of the basic states |k with the probability of |αk|2.

In quantum computing, a quantum register in the n-qubit system is utilized to carry a superposition of n-qubit basic states. Compared to an n-bit classical register that can store any one of 2n possible numbers, an n-qubit register can store any combination of 2n numbers. As the quantum version of a classical logic gate, a quantum gate operates on a quantum register that is usually initialized as |00...0 to evolve its state during the quantum computation.

One of the standard quantum single-qubit gates is the Hadamard gate (H), which results in a superposition of equal parts |0 and |1 when operating on either |0 or |1, as shown below.

H|0=12|0+12|1,H|1=12|0-12|1 3

Indeed, the right-hand sides of Equation (3) are another set of basic states, named |+ and |-, which are also commonly used.

In addition, among the Pauli gates (XYZ) and the Phase gates (S and T), the X gate is the quantum analog of the classical NOT gate with respect to |0 and |1, i.e., X|0=|1 and X|1=|0. In addition, a commonly used two-qubit gate is CNOT or CX gate, which provides the quantum equivalence of the classical XOR gate. CNOT gate stands for the controlled-not gate and is always applied on two qubits. If the first qubit (as the control one) is |1, then a NOT operation is applied to the second qubit (as the target one). Otherwise, the target qubit remains the same. For example,

CNOT|01=|01,CNOT|11=|10 4

Markov decision process

Mathematically, an MDP, denoted by M=S,A,pS,s0,Π,L, consists of a finite state space S and a finite action space A. In addition, A(s) at state s denotes the available actions the agent can take at this state. The transition function, pS:S×A×S[0,1], defines the probability of the agent moving from one state (s) to another (s) after it takes an action (a). Particularly, spS(s,a,s)=1 shall be satisfied. Usually, the transition probability pS describes the motion uncertainties in an RL problem. Π is a set of atomic propositions. The labeling function, L:S2Π where 2Π is the power set of Π, assigns a subset of Π to each state. Sometimes, there exists an initial state s0S with which the agent starts.

When an action function, ξ:SA, is deterministic, it outputs an action (a) at a given state (s). The MDP evolves gradually after performing action ξi=a at step i (i0). Consequently, a sequence of actions ξ=ξ0ξ1 defines the control policy and generates a path s=s0s1s2 over M. It shall be noted that the transition probability pSsi,ai,si+1 is positive for all i. For a stationary policy, we have ξi=ξ for all i. On the other hand, if ξi=ξ(si) depends on the current state si only, the control policy is memoryless. Otherwise, the policy shall have a finite memory with the history of visited states, i.e., ξi=ξ(...,si-1,si).

An MDP has a Markov property if the environment is fully observable and an agent carries out a simple go-to-goal task. Consequently, each decision-making relies on the current state only, so the control policy is memoryless. However, one of this study’s focuses is developing a framework to handle complex tasks by introducing a product of MDP and LTL-induced automaton. The induced control policy is usually a finite memory one, i.e., the action selection depends on the current and past states.

Reinforcement learning

The environment in an RL problem can be expressed by an MDP as defined in Section "Markov decision process". An agent learns the policy via communicating with the environment, and learning is an iterative process. For example, Fig. 1 shows that the agent decides which action to take after identifying the present state. Once conducting the selected action, it receives feedback, i.e., the reward, and observes the next state for decision-making of the following action.

Figure 1.

Figure 1

The agent interacts with the environment in a reinforcement learning problem.

The reward function is one of the key elements in an RL problem. Generally, a reward function can be denoted by Λ(s,a,s), which generates the reward to an agent after it takes an action (a) at the current state (s) and reaches another state (s). In addition, a discount function γ(s,a,s)[0,1] is usually employed, and here defines the expected discounted return an agent can receive under policy ξ after starting from state s.

Uξs=Eξi=0γisi,ai,si+1·Λsi,ai,si+1|s0=s 5

Equation (5) is also implied as the utility or the state value function generated from policy ξ. In an RL problem, the state value function U(s) quantifies how well the agent can reach a goal state over the long run, starting from state s. The learning objective is finding an optimal policy, ξ=argmaxξUξs, which can maximize the state value at each state. On the other hand, an optimal policy guides the agent in deciding on the best action to accomplish the task. When the environment is not fully known, i.e., the transition probability in its MDP model is unknown, model-free RL methods with tabular approaches are usually utilized for finite state and action spaces.

The methods in RL can be categorized as policy-based or value-based. The value-based RL methods directly solve and converge the value functions as the optimal ones. Q-learning33 is one of them, and it solves action values or Q values instead of state values. A Q value is a function of state and action, i.e., Q(sa), represents the expected return an agent can collect under the current policy after taking action a at state s. Once the optimal action values are converged, we can obtain an optimal policy via the greedy action selection. It shall be noted that state values are related to action values by U(s)=maxaQ(s,a).

Since Q-learning doesn’t require a state-to-state transition function, it is one of the model-free RL methods. When an RL problem has finite state/action spaces, a Q table is adopted in the naïve Q-learning method to store the value of every action at each state. Therefore, the best action can be determined at each state by searching for the highest Q value in the table. This method employs Monte Carlo simulations to converge Q values. The learning usually takes many episodes. Each episode consists of many steps or ends once the agent accomplishes the task. At each step, after taking an action (a), the agent moves from one state (s) to another (s). Then, the Q value of action a at the current state s can be updated via the Bellman equation5.

Qnew(s,a)=Q(s,a)+αΛ(s,a,s)+γ(s,a,s)maxaQ(s,a)-Q(s,a) 6

where maxaQ(s,a) represents the highest action value at state s.

Equations (5) and (6) include generalized definitions of the reward function Λ(s,a,s) and the discount factor γ(s,a,s). Indeed, in a commonly used formulation5,33, they are functions of the current state only, as Λ(s) and γ(s), respectively. In addition, a properly designed learning rate, α, is essential. If α is large, the convergence is fast but sometimes unstable. In addition, non-optimal value functions may be reached. On the other hand, although a smaller α can result in a smoother and more stable convergence procedure, the procedure may be slower. It is practical to employ an adaptive learning rate scheme to start with a large learning rate and decrease it with iterations.

Deep neural networks are usually employed to approximate value functions if the state and/or action spaces in an RL problem are large or continuous. The methods are then called Deep Reinforcement Learning (DRL) methods. The commonly used methods include Deep Q Network (DQN)34, a variation of Q-learning, and Proximal Policy Optimization (PPO)35, a policy-based method.

Linear temporal logic

Instead of simple go-to-goal tasks, this paper considers complex high-level tasks that LTL, a formal language, can describe. Specifically, we can identify and verify properties about how a world changes over time using LTL. The properties of interest, given high-level specifications of a system, include safety (i.e., anything terrible will never happen), liveness (i.e., something good will finally occur), and fairness (i.e., independent subsystems can make progress).

An LTL formula can be built on atomic propositions Π, some Boolean operators, including True, negation ¬, and conjunction , and two temporal operators such as next and until U2. Its syntax can be inductively defined as

ϕ:=Trueaϕ1ϕ2¬ϕϕϕ1Uϕ2, 7

where aΠ is an atomic proposition. The temporal operators define time-dependent properties when the system evolves. For example, the formula ϕ can be read as “ϕ is true at the next state” while ϕ1Uϕ2 as “ϕ1 is true at each state until ϕ2 is true at some future states.”

A word is defined as an infinite sequence o=o0o1 with oi2Π. Here denotes the satisfaction relation. Consequently, words can be used to interpret the LTL formula’s semantics, defined as below.

oTrueoααL(o[0])oϕ1ϕ2oϕ1andoϕ2o¬ϕoϕoϕo[1:]ϕoϕ1Uϕ2t1s.t.o[t1:]ϕ2,t2[0,t1),o[t2:]ϕ1

In addition to the standard operators introduced in Equation (7), we can derive other propositional and temporal logic operators, including False¬, disjunction , implication , always , and eventually 2. It shall be noted that the basic formulas of other temporal operators, ϕ and ϕ, can be read as “ϕ is always true in the future” and “ϕ could be true sometimes in the future,” respectively. Thus, in an RL problem, an LTL formula can describe whether or not a set of infinite traces (i.e., sequences of MDP states) can satisfy the user-specified task(s).

Example 1

Given a grid-world, as shown in Fig. 2, an MDP M can be defined with s={s0,s1,...,s8} and Π={T1,T2,T3,Us}. Ti where i1,2,3 are labeled on the states of interest while Us is labeled on an unsafe state. Considering a user-specified task, “First T1, then T2, finally T3, and never Us,” we can write an LTL formula as φ=T1T2T3¬Us. This task can be satisfied by, for example, a word of o=T1T1T2T3T3 where matches the preceding character for 0 or more times (up to infinite). Consequently, a set of infinite traces will satisfy the task defined in this example if it can generate the above-expressed word.

Figure 2.

Figure 2

A grid-world with states of interest labeled by Ti and unsafe states labeled by Us.

In the above example, the solution consists of any policy that can generate traces to satisfy the LTL-formulated task(s) as well as maximize the expected return in Equation (5). In other words, the LTL formula plays a constraint in finding the optimal policy. However, such a constraint cannot be directly implemented in conventional RL problems. Instead, a finite state automaton is usually employed to represent the LTL formula. Then, the optimal policy can be achieved via RL with automaton theory and model checking2.

Limit-deterministic generalized Büchi automaton

As discussed above, an LTL formula can specify a complex high-level task. Then, task satisfaction can be evaluated by an automaton, such as an LDGBA18, through model checking.

Definition 1

(LDGBA) An LDGBA, A=Q,Σ,δ,q0,F, consists of a finite set of states Q and a finite alphabet Σ=2Π with a set of atomic propositions Π. The transition function, δ(q,α): Q×Σϵ2Q, allows the LDGBA to change its state when taking an input symbol (αΣ) or not (α{ϵ}). In addition, q0Q is an initial automaton state. F=F1,F2,,Ff represents a set of accepting sets in the automaton, and FiQ where i=1,f. The state set Q in an LDGBA can be divided into deterministic and non-deterministic sets, i.e., QD and QN, respectively. Such partition satisfies the following:

  • The LDGBA transitions in the deterministic set are total, i.e., |δq,α|=1 with αΣ. Such transitions are allowed within this set only. Therefore, after consuming an input symbol α at an automaton state q, the resulted automaton state is δq,αQD,

  • The state transitions without any input symbol, i.e., ϵ-transitions, are only valid for state transitions from the non-deterministic set (QN) to the deterministic set (QD). Therefore, they are not allowed in QD. and

  • All the accepting sets are subsets of the deterministic set.

A run of an LDGBA can be written as q=q0q1. Let infq denotes the infinite portion of q. If there exists infqFi, i1,f, we say that q satisfies the LDGBA acceptance condition or the LDGBA accepts q. Example 2 demonstrates an LDGBA, representing the LTL formula in Example 1. We recommend Owl36 to readers for more details about automaton generation.

Example 2

In Example 1, the LTL formula is expressed as ϕ=T1T2T3¬Us for the user-specified task. Figure 3 shows the LTL-induced LDGBA, which has only one accepting set F1={q3}. A run of LDGBA, for example, q=q0q1q2q3, is accepted by this LDGBA because qF1={q3}.

Figure 3.

Figure 3

An LDGBA for the LTL formula ϕ=T1T2T3¬Us.

Problem formulation

It was discussed above that an LTL formula ϕ can describe the task specifications to be performed by an agent. For example, given a policy ξ=ξ0ξ1 that the agent learns from MDP M, it can generate a path, denoted by sξ=s0sisi+1, with pS(si,ai,si+1)>0. Correspondingly, a sequence of labels (i.e., a trace) can be derived from the labeling function as Lsξ=l0l1 where liLsi. If this trace satisfies the task, i.e., Lsξϕ, the satisfaction probability that the agent can accomplish the task is expressed as

PrMξϕ=PrMξLsξϕ|sξSξ 8

where Sξ is defined as a set of admissible paths generated from policy ξ, starting with the initial state s0.

Assumption 1

At least one deterministic policy exists such that the agent can accomplish the task with a non-zero probability by following this policy.

Assumption 1 states that the agent can always find a policy from which an induced trace can fully satisfy the user-specified LTL task. This assumption is moderate and has been widely utilized11,14,37. This paper considers the LTL task as the form ϕ=ϕgϕsafe, where ϕg provides a general form of high-level tasks, and ϕsafe represents the safety requirement. Consequently, we define the problem as follows.

Problem 1

This RL problem considers (1) a user-specified task that can be formulated via LTL as ϕ=ϕgϕsafe and (2) an MDP M in which the transition function is unknown. The objective is to find a deterministic policy ξ, which can maximize the probability of task satisfaction, i.e., ξ=argmaxξPrMξϕg, and maintain the safety ϕsafe during the learning process.

To address Problem 1, Section "Automaton‑based reward design" proposes an automaton-based reward design to guide the agent in learning the optimal policy on the product of MDP and automaton. Then, Section "Safety value functions" develops a safe padding technique by introducing safety value functions to promote safety during the learning process. In addition, a quantum action selection technique is proposed in Section "Quantum action selection" to substitute the ε-greedy algorithm for the balance of exploration and exploitation in learning.

Automaton-based reward design

Previous work15 has shown that directly utilizing LDGBA may fail to find deterministic policies satisfying LTL specifications. This issue can be addressed by designing an E-LDGBA.

Embedded LDGBA

To keep tracking unvisited accepting sets in an LDGBA, we introduce a tracking-frontier set T during model checking. In addition, a Boolean variable, B, is employed to indicate the satisfaction of accepting conditions for each round. Here, one round is defined as all accepting sets being visited. On the other hand, this tracking-frontier set is initialized as T=F while B as False. Then, the set T,B=fVq,T is updated synchronously during learning process as:

fVq,T=T\Fj,False,ifqFjandFjT,F\Fj,TrueifqFjandT=,T,False,otherwise. 9

Definition 2

(E-LDGBA) Considering an LDGBA A=Q,Σ,δ,q0,F and a Boolean variable B, we can define an E-LDGBA correspondingly as A¯=Q¯,Σ,δ¯,q¯0,F¯,fV,T,B with initially setting T=F and B=False, respectively. The automaton states are augmented, i.e., Q¯=Q×2F or q¯=(q,T). In the transition function, δ¯:Q¯×Σϵ2Q¯, the finite alphabet Σ is the same as in LDGBA. In addition, the non-deterministic set Q¯N, the deterministic set Q¯D, and ϵ-transitions from Q¯N to Q¯D can be correspondingly constructed. Similarly, the set of accepting sets in E-LDGBA is F¯=F¯1,F¯2F¯f, where F¯j=q,TQ¯|qFjFjT, j=1,f. In particular, the transition q¯=δ¯q¯,σ¯, where σ¯Σϵ, shall satisfy the following: 1) the transitions in LDGBA are valid, i.e., q=δq,σ¯, and 2) T and B are synchronously updated via T,B=fVq,T at each transition as defined in Eq. (9).

We abuse the tuple structure in Definition 2 because the tacking-frontier set T and the Boolean variable B are synchronously updated after each transition. While updating the frontier set T, the Boolean variable B indicates whether the current state belongs to at least one accepting set that has not been visited within the current round. This design is essential to guide the agent to visit all accepting sets once per round with infinite rounds based on user-specified tasks.

In the rest of this paper, given an LTL formula ϕ, we utilize Aϕ and A¯ϕ to represent an LDGBA and its corresponding E-LDGBA, respectively. Here assume that L(Aϕ)Σω is the language accepted by Aϕ. In other words, this language is the set of all infinite words that satisfy the LTL formula ϕ2. If L(A¯ϕ)Σω is the language accepted by A¯ϕ, we have the following lemma.

Lemma 1

Given an LTL formula ϕ, we can generate two automata: an LDGBA Aϕ=Q,Σ,δ,q0,F and its corresponding E-LDGBA A¯ϕ=Q¯,Σ,δ¯,q¯0,F¯,fV,T,B. Then, it leads that

L(A¯ϕ)=L(Aϕ). 10

The proof of Lemma 1 is in our previous work15. Therefore, we can employ E-LDGBA to uphold task satisfaction. In addition, we define a set of sink components in an E-LDGBA as below.

Definition 3

A set of non-accepting sink components in an E-LDGBA, i.e., Q¯sinkQ¯, is a group of automaton states from which none of the accepting states can be reached.

In addition, after defining the LTL-formulated task as ϕ=ϕgϕsafe, we can always find a set of states Q¯unsafeQ¯sink associated with the violation of ϕsafe.

Embedded product MDP (EP-MDP)

Definition 4

(EP-MDP) Considering an MDP M=S,A,pS,s0,Π,L and an LTL-converted E-LDGBA A¯ϕ=Q¯,Σ,δ¯,q¯0,F¯,fV,T,B, we can construct an embedded product MDP (EP-MDP), P=M×A¯ϕ=X,UP,pP,x0,FP,fV,T,B, which consists of a set of labeled states X=S×2Π×Q¯ with x=s,l,q¯=s,l,q,TX and lLs and a set of actions UP=Aϵ. Corresponding to the restriction in the E-LDGBA, ϵ-transitions are valid from x=(s,l,q¯) with q¯Q¯N to x=(s,l,q¯) with q¯Q¯D only. The transition probability pPx,uP,x:X×UP×X[0,1] equals (1) pSs,a,s if δ¯q¯,l=q¯ and uP=aAs; (2) 1 if uPϵ, q¯δ¯q¯,ϵ, and s,l=s,l; and (3) 0 otherwise. In addition, x0=s0,l0,q¯0 is the initial state, and FP=F1P,F2PFfP is the set of accepting sets where FjP=s,l,q¯X|q¯F¯j, j=1,f. After each transition is completed, T and B are synchronously updated via T,B=fVq,T in Equation (9).

It shall be noted that the EP-MDP P can capture the dynamical interchanges between all possible paths over the associated MDP and all words recognized by the corresponding E-LDGBA. Assuming π is a policy over P and generates an infinite path, e.g., xπ=x0xixi+1. This path is acceptable if infxπFiP where i=1,f. Such an accepting path xπ can produce a policy ξ over M to satisfy the LTL-specified task ϕ. On the other hand, P has at least one accepting maximum end component (AMEC)2, PX,U, and reaching this AMEC is the same as satisfying task ϕ.

As mentioned above, this study decomposes an LTL-formulated task into ϕ=ϕgϕsafe, and we can define a set of unsafe states as Xunsafe=x=(s,l,q¯)X|sSq¯Q¯unsafe. Also, let PrπxAccP denote the probability of policy π satisfying the EP-MDP’s acceptance conditions. Then, the maximum probability of satisfying the acceptance of P is defined as PrmaxxAccP=maxπPrMπAccP. Consequently, Problem 1 can be rephrased as below.

Problem 2

Considering an MDP M and an LTL task ϕ, a corresponding EP-MDP P exists, and transition probabilities are unknown. The objective is to find a policy π that satisfies the EP-MDP’s acceptance conditions with a maximum probability, i.e., PrπxAccP=PrmaxxAccP, and avoid entering Xunsafe during the learning process.

A base reward design will be discussed in the following subsection. Such a design can enable the RL agent to find the optimal policy and achieve the maximum probability of satisfying task(s). Then, we will further improve the reward density via reward shaping with a potential function.

Base reward

In an EP-MDP P, all accepting states can be grouped into a set as FUP=xX|xFiP,i1,f. According to the reward function Λ(s,a,s) defined in an MDP M, after each transition x,uP,x in the corresponding EP-MDP, the agent can receive a reward as: 1) Rx,uP,x=Λs,a,s if δ¯q¯,l=q¯ and uP=aAs; and 2) Rx,uP,x=0 otherwise. In addition, the discount factor in the EP-MDP can be defined similarly. In this study, we consider both functions depending on the state only. Consequently, inspired by14, the reward and discount factor functions can be designed as below.

Rx=1-rF,ifxFUP,0,otherwise, 11
γx=rF,ifxFUP,γF,otherwise, 12

where γF is the discount factor when the agent doesn’t reach any accepting state after a valid transition, and rFγF otherwise. In addition, rF(γF) shall satisfy limγF1-rFγF=1 and limγF1-1-γF1-rFγF=0. Based on the approval in14, a state value represents the probability that an agent can accomplish the specified task starting from this state.

Reward shaping

The reward function designed above is always zero for any transition between the states xFUP. For instance, given an LDGBA Aϕ as shown in Fig. 3, we can obtain its corresponding E-LDGBA A¯φ and construct the EP-MDP P=M×A¯ϕ. Then, it can be observed that executing any transition between product states x=s,l,q1,T and x=s,l,q2,T always renders a zero reward for any MDP state s,sS. We propose a potential function Φ(x):XR to increase the reward density and redesign the reward function as below.

Rx,uP,x=Rx+γx·Φx-Φx 13

Given an EP-MDP P=M×A¯ϕ=X,UP,pP,x0,FP,fV,T,B, there exists a set automaton accepting states, i.e., Q¯F=q¯Q¯|q¯F¯i,i1,f. If an agent visits a product state x=s,l,q1,T=(s,l,q¯) and it is its first time to visit the associated automaton state q¯ that belongs to Q¯\Q¯Fq¯0Q¯sink, the agent will receive a positive reward. Such a modification on the reward design will enhance the guiding of task satisfaction during the learning process via model checking because any automaton state in Q¯\Q¯Fq¯0Q¯sink, i.e., the one that can reach any accepting set, has to be explored starting from the initial automaton state q¯0.

We design another tracking-frontier set TΦ to keep tracking unvisited automaton states in TΦ0=Q¯\Q¯Fq¯0Q¯sink. It shall be noted that TΦ is different from T, defined in Section "Embedded LDGBA", which tracks unvisited accepting sets. Initially, TΦ is set as TΦ0. Then, it is updated as below at each transition from s,l,q¯ to s,l,q¯ after taking action uP.

fΦq¯,TΦ=TΦ\q¯,ifq¯TΦ,TΦ0\q¯ifB=True,TΦ,otherwise. 14

After B becomes True in fV , i.e., the agent has visited all accepting sets, TΦ will be reset as TΦ0. Consequently, the potential function Φx at x=s,l,q¯ is proposed as:

Φx=1-rF,ifq¯TΦ,0,otherwise 15

where rF is the discount factor as defined in Section "Base reward". According to Equation (15), the potential function equals 1-rF or 0 for unvisited or visited automaton states. This design will enhance the efficiency of exploration. Based on the shaped reward in Equation (13), the expected return that an agent can collect from state xX under policy π can be expressed as

Uπx=Eπi=0γixi·Rxi,uP,xi+1|x0=x. 16

Theorem 1

Given an EP-MDP P=M×A¯ϕ, by selecting γF1- and applying a shaped reward function (13), the optimal policy π can maximize the expected return in Equation (16). This policy also maximizes the probability of task satisfaction, i.e., PrπxAccP=PrmaxxAccP.

Proof

First, theorems 1-3 of15 have verified that by applying the base reward design in Equations (11) and (12), optimizing the expected return can guarantee the optimal policy satisfying the specified LTL task with the maximum probability. Then, the work of38 has shown that such an optimal policy remains invariant by applying a shaped reward in Eq. (13).

To be brief, we proposed a reward-shaping scheme to overcome the issues of sparse reward in an EP-MDP and guarantee that an agent can learn the optimal policy to maximize the probability of task satisfaction.

Safety value functions

State and action safety values

Given an MDP M=S,A,pS,s0,Π,L, the agent is located at the current state sS. Assuming that the agent can observe the label of its current state only but can record the safety status of the states it has visited as

uss=0ifL(s)Lus1otherwise 17

where LusΠ is a set of unsafe labels, which is a subset of the set of atomic propositions, Π.

The transition probability pS is assumed to be unknown to the agent. However, the agent can estimate the transition dynamics based on the observation history. In other words, the agent also records the number of times it executes action a at state s, i.e., N(s,a), and the number of times it reaches the next state s after taking a at s, i.e., N(s,a,s). Consequently, the current belief of the agent about its transition dynamics can be expressed below by considering the Maximum Likelihood Estimation (MLE)39 for the mean transition probability function.

p~S(s,a,s)=N(s,a,s)N(s,a) 18

A state safety value function Vs(s) is introduced here to represent the minimum probability of the agent moving to a safe state after taking action. It can be estimated via the Bellman update17,40:

Vs(s)=minaA(s)sp~S(s,a,s)us(s). 19

Then, here defines an action safety value function, representing the maximum probability of the agent staying safe after taking action a at state s, as

Qs(s,a)=sp~S(s,a,s)Vs(s). 20

If considering environment uncertainty, a labeling probability function exists such that a state may be labeled “unsafe” with a probability. Consequently, Equation (17) needs to be rewritten as below to update the state’s safety status at every visit.

uss=us(s)-us(s)Ns(s)+1ifL(s)Lusus(s)+1.0-us(s)Ns(s)+1otherwise 21

where Ns(s) is the total number of times the agent has visited state s, and it needs to be updated afterward as Ns(s)Ns(s)+1.

In addition, it shall be noted that the safety value functions can be calculated in a continuous state space by revising Equations (19) and (20) as

Vs(s)=minaA(s)1ΩΩP¯S(s,a,s)us(s)ds 22

and

Qs(s,a)=1ΩΩP¯S(s,a,s)Vs(s)ds 23

where P¯S(s,a,s) can be predicted by an artificial neural network in addition to Q-networks if deep Q-learning41 is used. Similar to Q-networks, this transition probability network can be trained and updated by a collection of experiences. In this paper, we consider discrete state spaces only.

Safe reinforcement learning

Similar to Equation (6), in a Q-learning33 on an EP-MDP P=X,UP,pP,x0,FP,fV,T,B, Q values can be updated below after the agent takes action uP and moves from one state (x) to another state (x).

Qx,uP1-αQx,uP+αRx,uP,x+γx·maxu¯PUPQx,u¯P 24

The action safety value function in (20) can be incorporated with the Q-value for safe learning. It shall be noted that ϵ-transitions don’t influence the generated policy. Therefore, the action selection technique during the learning process in an EP-MDP P is proposed as:

  • if {ϵ}UP(x) where x=(s,l,q,T), select action uP{ϵ}.

  • if UP(x)=A(s), section action uP(x)A(s) based on the ε-greedy algorithm as
    uPx=argmaxaA(s)Q(x,a)+βQs(s,a)with probability 1-εany actionaA(s)with probabilityε 25
    where β[0,1] is a parameter bias in selecting safe actions with the importance of action safety.

According to the reward function employed in Equation (11), an action value, i.e., Q(sa), represents the probability of the agent accomplishing the specified task by taking action a at state s. Since both represent probabilities, it is natural to combine the action value function and the action safety value function linearly for decision-making in Equation (25). For other definitions of reward functions, Equation (25) might need to be revised17.

The safe RL algorithm is described in Algorithm 1, and the product states are generated on the fly based on Definition 4. It shall be noted that the action safety values in Equation (20) can be globally evaluated at the beginning of each learning episode (using collecting data as prior knowledge) as shown in Algorithm 1 or locally updated at the current state and its neighboring states at each step. Our simulations showed that either approach could dramatically reduce the times the agent encounters unsafe states.graphic file with name 41598_2023_28582_Figa_HTML.jpg

Quantum action selection

Grover’s algorithm

Grover’s algorithm42, sometimes referred to as Grover Search, quantum search, or quantum database search, is a quantum algorithm for searching through a non-ordered list. It can be used to invert a function, to which there are many possible solutions, but only one is correct. Grover’s algorithm can find the solution out of the pack much faster than any classical algorithm. Searching through the database with N items using a classical approach requires O(N) evaluations (on average N/2 evaluations). However, Grover’s algorithm can complete such a search using only O(N) evaluations.

The search problem we consider here consists of three steps: marking the desired basic state within the 2n space represented by an n-qubit quantum state, applying Grover’s algorithm on the quantum state, and finding the desired basic state with a single measurement. At first, we prepare an n-qubit quantum state, which is initialized as |q1q2qn=|000, by applying an H gate to each qubit,

|ψ=Hn|000=12nk=02n-1|k 26

which results in an equal superposition of all 2n basic states.

We use an ‘oracle’ operator Ud to mark the desired basic state, e.g., |d, in the above superposition |ψ, as described in43. With the assistance of ancilla qubits, this operator utilizes X gates and an nth-order CNOT gate to flip the sign of the desired state. It shall be noted that the nth-order CNOT gate uses n qubits as the control qubits with the ancilla qubit as the target. Consequently, the n-qubit quantum state becomes

|ϕ=Ud|ψ=12nkd|k-12n|d 27

Next, a Grover diffusion operation is conducted to amplify the amplitude of the desired basic state. Mathematically, the Grover diffusion operation is equivalent to a reflection of the average amplitude of all 2n basic states, and it can be expressed as the Grover diffusion operator43, Us=2|ψψ|-I, on the n-qubit system, |ϕ in Equation (27). However, |ψψ| is not a unitary operator, so it is not physically realizable. It has been demonstrated that H gates and the oracle operator can be utilized to realize the Grover diffusion operator43. Theoretically, the optimal number of times to conduct the Grover iteration, consisting of oracle and diffusion operations, is r=π4θ-12, where sinθ=1N and N=2n. When N>>1, it can be approximated as πN4.

The last step is to measure the resulting n-qubit quantum state and find the desired basic state. For a two-qubit system, after one Grover iteration, the quantum state becomes

UsUdH2|00=-|d 28

Theoretically, the desired basic state |d is surely found. For a three-qubit system, after two Grover iterations, the quantum state becomes

(UsUd)2H3|000=-0.08839kd|k+0.97227|d 29

Therefore, there is a 5.5% probability of obtaining a wrong basic state other than |d.

Quantum action selection

Here we propose the quantum action selection technique that can replace the ε-greedy action selection from the quantum computing point of view. According to the number of available actions, e.g., M, n qubits are utilized to form a superposition of 2n basic states, representing discrete actions in an MDP problem. M and n are characterized by the following inequality:

M2n<2M 30

Taking the example of a mobile robot moving in a grid world, if the robot can move “up”, “down”, “left” and “right”, those actions are represented by two-qubit basic states, |00, |10, |01, and |11, respectively. After applying the H gate on each qubit of the initial quantum state |00 as in Equation (26), we can obtain an equal superposition of 12(|00+|10+|01+|11), meaning that each action will have the same probability of being selected if we measure this two-qubit state once. To select the best action, the Grover algorithm needs to be applied.

Based on the combination of action values and action safety values, as shown in Equation (25), the basic state corresponding to the action associated with the highest value is marked by the oracle function in Equation (27). Then, the amplitude of this basic state is amplified via the Grover diffusion operation, so the probability of selecting the best action is higher than the others when measuring the quantum state. However, as demonstrated in the previous subsection, after one Grover’s iteration on the equal superposition state 12(|00+|10+|01+|11), it is reduced to the marked basic state with an amplitude of -1. In other words, only the best action is selected when measuring this quantum state, as shown in Equation (28). This is equivalent to the greedy action selection.

In another case, a three-qubit system is needed if the robot can conduct eight actions. After applying the quantum action selection technique described above, the marked action always has a high probability, 94.5%, of being selected. Consequently, the balance of exploration and exploitation is not achieved, especially at the beginning of learning.

It shall be noted that the above discussions are theoretical and only validated on the quantum simulators on classical computers. Indeed, quantum computers always have decoherence and noise issues, so it would be impossible to follow the exact probabilities when measuring the quantum state. Continuing the above example of a 2-qubit system, if the best action is “up”, the basic state |00 is marked. After one Grover iteration, the resulting quantum action state is -|00, and it is then measured with 1024 shots. We executed the simulation in three different ways: 1) using a quantum simulator of qiskit44, an open-source software development kit (SDK), on a classical computer, 2) using ibmq_Belem (an IBM 5-qubit quantum machine) as the backend on a classical computer, and 3) using ibmq_Belem through the IBM quantum composer online. The calculated probabilities are shown in Fig. 4.

Figure 4.

Figure 4

The measurement probability of -1|00 by using 1) a quantum simulator on a classical computer, 2) a quantum machine as the backend on a classical computer, and 3) a quantum machine.

It can be seen that the quantum simulator (on a classical computer) doesn’t suffer from different error sources because those errors can be corrected with a small amount of extra storage and logic in the classical computer. Larger errors are induced when using a quantum machine as the backend than when directly using a quantum machine. Although some quantum error correction algorithms exist, they utilize many qubits, and only a few qubits are left for actual quantum computation. Therefore, due to the limited number of qubits on the existing quantum computers, no hardware platform can currently conduct robust error corrections for large-scale quantum computation. As mentioned in45, quantum technologists are putting their efforts into more accurate quantum gates and, eventually, fully error-tolerant quantum computing.

However, in our quantum action selection method, we take advantage of gate noises and measurement errors to balance the exploration and exploitation when using the quantum simulator on a classical computer. We employ the noise model from qiskit to implement the depolarizing and measurement noise. The first noise model results in an imperfection in quantum gate operations, i.e., replacing the state of any qubit with a completely random state with a probability pgate. Our quantum action selection method applies this noise model on H, X, and CNOT gates. The other noise model flips between |0 and |1 immediately before measurement with probability pmeas. It shall be noted that pgate and pmeas are heuristic parameters, and it is recommended that the values are initialized as pgate=pmeas=0.1 and then reduced to 0.01 for 2-qubit quantum states in our simulations. The probabilities of action selection, corresponding to the example in Fig. 4 at different noise levels, are shown in Fig. 5.

Figure 5.

Figure 5

The measurement probability with three different noise levels.

The flowchart of the quantum action selection method is summarized below. We implement the developed quantum action selection technique in our safe RL method (Algorithm 1), resulting in a Quantum Safe Q-learning (QSQ-learning) method. Algorithm 2 can be implemented in any other RL algorithm to replace ε-greedy for action selection.graphic file with name 41598_2023_28582_Figb_HTML.jpg

Simulations and discussions

We study two examples to apply the developed QSQ-learning algorithm for motion planning in grid worlds in this study. The noise in the quantum simulator is set as 0.1 and then slowly decreased to 0.01 for action selection. The reward calculation uses γB=0.99 and γ=0.99999 in Eq. (11).

Motion planning with safe absorbing states

Figure 6 shows a grid world with safe absorbing states13. A mobile robot can take four actions: left, up, right, and down. The robot receives a reward if it reaches states labeled a or b. Also, three states, with circles, are safe absorbing states. Once the robot reaches one of them, it will stay in this state no matter which action is taken. States labeled c are unsafe, and “Obs” represents an obstacle. In this example, the objective of motion planning is finding a policy so that the robot can eventually always visit a safe absorbing state while avoiding unsafe states. This task can be specified as the below LTL formula.

φ1=ab¬c 31

It shall be noted that state (1, 2) (labeled a) is not a safe absorbing state. Therefore, although the robot receives a reward for visiting this state, it moves away after taking any action. We first consider deterministic actions, i.e., there is only a single next state when the robot takes action at the current state. We conducted five simulations with safety value functions and another five without safety value functions and obtained the same optimal policies. Figure 7 illustrates state values and an optimal policy. It can be seen that all states, except unsafe and obstacle states, ensure the 100% probability of satisfying the task if the robot starts from one of those states and follows the optimal policy. It is worth mentioning that there is more than one optimal policy for motion planning in this example.

Figure 6.

Figure 6

The grid-world with safe absorbing states.

Figure 7.

Figure 7

The schemes of (a) state values and (b) an optimal policy.

We conduct the simulations with and without safety value functions for comparison. Each simulation takes 10,000 episodes with 200 steps per episode. We investigate the number of times the robot visits unsafe states while learning optimal policies. It is found that implementing safety value functions could dramatically reduce the number of times to visit unsafe states. Specifically, without using safety value functions, the robot visits unsafe states an average of 950 times, while it visits unsafe states only 177 times when considering both safety values and Q-values in decision-making.

The state safety values and the maximum action safety values are shown in Fig. 8. The safety state value represents the minimum probability of transitioning the agent from the current state to a safe state. For example, the robot can move from state (0, 1) to unsafe state (0,2) by taking action right so that the state safety value is 0 at state (0, 1). However, after taking action down, the robot can move from state (0, 1) to state (1, 1), which is a safe state. Consequently, the maximum action safety value is 1.00 at state (0, 1) as shown in Fig. 8(b).

Figure 8.

Figure 8

The results of (a) state safety values and (b) the maximum action safety values.

We also consider the scenario with action uncertainties due to the actuator malfunction. Therefore, the considered MDP is stochastic. After an action is selected, the robot moves in the desired direction with a probability of 80%. On the other hand, it can go in each side direction with a probability of 10%. Figure 9 illustrates the estimated state values and state safety values after one simulation via the developed QSQ-learning. Theoretically, the state values, i.e., the maximal task-satisfaction probabilities, shall be 1.0, 0.9, or 0.8 at all non-unsafe states due to the action uncertainties. For example, the robot can take a right action at state (1, 2) to a safe absorbing state, (1, 3). Therefore, the probability of the robot fulfilling the task requirement is 80%. The estimated state values in Fig. 9a from our simulation agree with the theoretical predictions.

Figure 9.

Figure 9

The estimated (a) state values and (b) state safety values for stochastic actions.

On the other hand, the state safety value at state (1, 2) in Fig. 9b is expected to be 0.2 because the robot has a minimum probability of 20% of reaching a safe state after taking action up or down (the estimated safety value at this state is 0.17). One optimal policy is obtained, as shown in Fig. 10. The state values indicate at least one path generated from the optimal policy so that the robot can accomplish the task. We investigate the evolution of the times the robot visits unsafe states during the learning and compare the results with the one from Q-learning without safety value functions, as shown in Fig. 11. It can be seen that implementing safety value functions dramatically reduces the number of visits to unsafe states up to 87%.

Figure 10.

Figure 10

The generate optimal policy.

Figure 11.

Figure 11

The evolution of visiting unsafe states.

Unlike a previous work17, in which an agent was able to observe the safety statuses of its neighboring states, it is assumed that the agent can observe the current state’s safety status only. Therefore, the agent must visit unsafe states due to exploration, especially at the beginning of learning, to obtain enough information for calculating the safety value functions. However, considering safety in action selection significantly saves the number of visits to unsafe states compared to the conventional RL methods.

Slippery grid-world

In this example, a robot moves on a 10×10 grid world, shown in Fig. 12, where the robot can “slip” to any adjacent state with a probability of 15% when taking action. The robot takes off from the initial state (marked as “S” in Fig. 12) and tries to visit goal1 (cyan) and then goal2 (yellow) for infinite times while avoiding unsafe states (red). The task can be specified as the following LTL formula.

φ2=goal1Ugoal2¬unsafe 32

We conducted 10 simulations for both Q-learning (without safety value functions) and the developed QSQ-learning. Each simulation consists of 1000 episodes with 4000 steps per episode. Two trajectories, induced from the optimal policies via Q-learning, for the first round of visiting goal1 and then goal2 are illustrated in Fig. 12. In addition, Fig. 13 includes two trajectories generated from the optimal policies learned via QSQ-learning. For each learning method, we record the maximum and minimum times the robot visits unsafe states during the simulations and list them in Table 1. It can be seen that the robot visits unsafe states many fewer times when implementing unsafe value functions in learning.

Figure 12.

Figure 12

The generated trajectories from the optimal policy learned via Q-learning without safety value functions.

Figure 13.

Figure 13

The generated trajectories from the optimal policy learned via QSQ-learning.

Table 1.

The number of times the robot visits unsafe states during learning.

maximum times minimum times
Q-learning without safety values 160,937 157,192
QSQ-learning 2,294 962

Conclusions and future work

This paper presents safe reinforcement learning to find RL control policies that satisfy LTL specifications over finite and infinite horizons. The developed reward-shaping process further improves the reward density and guides the LTL satisfaction with maximum probability, while the safe padding technique utilizes the properties of E-LDGBA and maintains the safe exploration without influencing the original probabilistic guarantee. In addition, the quantum action selection technique provides an alternative approach to balancing exploration and exploitation during RL while taking advantage of quantum computing.

The state labels, including safety labels, are atomic propositions. In real-world problems, the agent can acquire the labels based on the collected information via perception sensors and its prior knowledge base. The safety value function proposed in this paper represents the probability of safely reaching the next state after taking a selected action. Such a concept can be extended to calculate the expected probability of reaching safe states after taking action and following the current policy within a finite or infinite horizon. In addition, the idea of calculating safety value functions for a continuous state space, proposed as Equations (22) and (23), will be refined in future studies.

Quantum computing is powerful because the quantum algorithms, such as Grover’s search algorithm, have lower computational complexities than their classical equivalents. However, current quantum computers are relatively small (up to 433 qubits in the largest quantum computer built by IBM) and noisy (not fault-tolerant). Indeed, we are in the era of Noisy Intermediate-Scale Quantum (NISQ)45. Consequently, Variational Quantum-Classical (VQC) algorithms have become popular in deploying quantum algorithms on near-term quantum devices. In VQC algorithms, classical computers perform the overall computation task on information they acquire from running calculations on a quantum computer. Our quantum safe RL approach adopts the same strategy: the learning process is conducted on a classical computer while the action is selected via quantum computing. The proposed quantum action selection algorithm results in parameterized quantum circuits, which are relatively small, short-lived, and thus suitable for NISQ computers. Although quantum action selection is performed on the quantum simulator in our simulations, it can be run on a quantum machine as the backend (see Fig. 4).

Although quantum computing theoretically promises to be exponentially faster than classical computers, we don’t think the computation would be faster for our simulation examples, even using a quantum machine to conduct action selection because the action spaces are small. However, we expect our method to take advantage of quantum physics’s fundamental properties, especially superposition when handling problems with large/continuous action spaces. For example, our previous work1 studied motion planning for Mars exploration cases with a large-scale continuous environment. Our method can be utilized to solve the same problems by applying a fine-scale discretization to the action space since an n-qubit system can represent 2n actions. The method can be extended to the discretization of a continuous state space with quantum states.

In addition, the proposed quantum action selection method can also be applied to policy-based RL methods, in which the actions are predicted with various probabilities. In this case, the basic quantum state corresponding to the action with the highest predicted probability would be marked. Considering the above-mentioned motivations and challenges, further research is needed to extend this method to problems with continuous state and action spaces.

Author contributions

All authors contributed to the study conception and design. Problem definitions, theory approvals, and simulations were performed by M.C and S.X. The first draft of the manuscript was written by M.C and S.X. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. All authors have approved the manuscript and agreed with its publication on the Journal of Intelligent & Robotic Systems.

Funding Information

Xiao would like to thank the US Department of Education (ED#P116S210005) for supporting this research.

Data availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Cai M, Hasanbeig M, Xiao S, Abate A, Kan Z. Modular deep reinforcement learning for continuous motion planning with temporal logic. IEEE Robot. Autom. Lett. 2021;6(4):7973–7980. doi: 10.1109/LRA.2021.3101544. [DOI] [Google Scholar]
  • 2.Baier C, Katoen J-P. Principles of model checking. Cambridge: The MIT Press; 2008. [Google Scholar]
  • 3.Guo M, Zavlanos MM. Probabilistic motion planning under temporal tasks and soft constraints. IEEE Trans. Autom. Control. 2018;63(12):4051–4066. doi: 10.1109/TAC.2018.2799561. [DOI] [Google Scholar]
  • 4.Cai, M., M, Li Z, Gao H, Xiao S, Kan Z. Optimal Probabilistic Motion Planning with Potential Infeasible LTL Constraints. IEEE Trans. Autom. Control. 2023;68(1):301–316. doi: 10.1109/TAC.2021.3138704. [DOI] [Google Scholar]
  • 5.Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cambridge: The MIT Press; 2018. [Google Scholar]
  • 6.Garcıa J, Fernández F. A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 2015;16(1):1437–1480. [Google Scholar]
  • 7.Moldovan, T.M., & Abbeel, P. Safe exploration in markov decision processes. arXiv preprint arXiv:1205.4810 (2012).
  • 8.Alshiekh, M., Bloem, R., Ehlers, R., Könighofer, B., Niekum, S., & Topcu, U. Safe reinforcement learning via shielding. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018).
  • 9.Cheng, R., Orosz, G., Murray, R.M., & Burdick, J.W. End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 3387–3395 (2019).
  • 10.Wen M, Topcu U. Constrained cross-entropy method for safe reinforcement learning. IEEE Trans. Autom. Control. 2021;66(7):3123–3127. doi: 10.1109/TAC.2020.3015931. [DOI] [Google Scholar]
  • 11.Hahn, E.M., Perez, M., Schewe, S., Somenzi, F., Trivedi, A., & Wojtczak, D. Omega-regular objectives in model-free reinforcement learning. In: Int. Conf. Tools Alg. Constr. Anal. Syst., pp. 395–412 (2019). Springer
  • 12.Cai M, Peng H, Li Z, Kan Z. Learning-based probabilistic LTL motion planning with environment and motion uncertainties. IEEE Trans. Autom. Control. 2021;66(5):2386–2392. doi: 10.1109/TAC.2020.3006967. [DOI] [Google Scholar]
  • 13.Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Reinforcement learning for temporal logic control synthesis with probabilistic satisfaction guarantees. In: Proc. IEEE Conf. Decis. Control, pp. 5338–5343 (2019). IEEE
  • 14.Bozkurt, A.K., Wang, Y., Zavlanos, M.M., & Pajic, M. Control synthesis from linear temporal logic specifications using model-free reinforcement learning. In: Int. Conf. Robot. Autom., pp. 10349–10355 (2020). IEEE.
  • 15.Cai, M., Xiao, S., Li, B., Li, Z., & Kan, Z. Reinforcement learning based temporal logic control with maximum probabilistic satisfaction. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 806–812 (2021). 10.1109/ICRA48506.2021.9561903.
  • 16.Li, X., Serlin, Z., Yang, G. & Belta, C. A formal methods approach to interpretable reinforcement learning for robotic planning. Sci. Robot.4(37), (2019). [DOI] [PubMed]
  • 17.Hasanbeig, M., Abate, A., & Kroening, D. Cautious reinforcement learning with logical constraints. AAMAS’20: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, 483–491 (2020).
  • 18.Sickert, S., Esparza, J., Jaax, S., & Křetínskỳ, J. Limit-deterministic Büchi automata for linear temporal logic. In: Int. Conf. Comput. Aided Verif., pp. 312–332 (2016). Springer.
  • 19.Nielsen, M.A., & Chuang, I.L. Quantum Computation and Quantum Information, 10th edn. Cambridge University Press, New York (2010). 10.1017/CBO9780511976667.
  • 20.Biamonte J, Wittek P, Pancotti N, Rebentrost P, Wiebe N, Lloyd S. Quantum machine learning. Nature. 2017;549(7671):195–202. doi: 10.1038/nature23474. [DOI] [PubMed] [Google Scholar]
  • 21.Beer K, Bondarenko D, Farrelly T, Osborne TJ, Salzmann R, Scheiermann D, Wolf R. Training deep quantum neural networks. Nat. Commun. 2020;11(1):1–6. doi: 10.1038/s41467-020-14454-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Cong I, Choi S, Lukin MD. Quantum convolutional neural networks. Nat. Phys. 2019;15(12):1273–1278. doi: 10.1038/s41567-019-0648-8. [DOI] [Google Scholar]
  • 23.Iyengar, S.S., Kumar, L.K.J., & Mastriani, M. Analysis of five techniques for the internal representation of a digital image inside a quantum processor (2020) arXiv:2008.01081.
  • 24.Li Y, Zhou R-G, Xu R, Luo J, Hu W. A quantum deep convolutional neural network for image recognition. Quantum Sci. Technol. 2020;5(4):044003. doi: 10.1088/2058-9565/AB9F93. [DOI] [Google Scholar]
  • 25.Hu W, Hu J, Hu W, Hu J. Reinforcement learning with deep quantum neural networks. J. Quantum Inf. Sci. 2019;9(1):1–14. doi: 10.4236/JQIS.2019.91001. [DOI] [Google Scholar]
  • 26.Denchev VS, Boixo S, Isakov SV, Ding N, Babbush R, Smelyanskiy V, Martinis J, Neven H. What is the computational value of finite-range tunneling? Phys. Rev. X. 2016;6(3):031015. doi: 10.1103/PhysRevX.6.031015. [DOI] [Google Scholar]
  • 27.Saggio V, Asenbeck BE, Hamann A, Strömberg T, Schiansky P, Dunjko V, Friis N, Harris NC, Hochberg M, Englund D, Wölk S, Briegel HJ, Walther P. Experimental quantum speed-up in reinforcement learning agents. Nature. 2021;591(7849):229–233. doi: 10.1038/s41586-021-03242-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Dong D, Chen C, Li H, Tarn TJ. Quantum reinforcement learning. IEEE Trans. Syst. Man Cybern. B Cybern. 2008;38(5):1207–1220. doi: 10.1109/TSMCB.2008.925743. [DOI] [PubMed] [Google Scholar]
  • 29.Ganger M, Hu W. Quantum multiple Q-learning. Int. J. Intell. Sci. 2019;09(01):1–22. doi: 10.4236/IJIS.2019.91001. [DOI] [Google Scholar]
  • 30.Fernandez-Gauna B, Graña M, Lopez-Guede JM, Etxeberria-Agiriano I, Ansoategui I. Reinforcement Learning endowed with safe veto policies to learn the control of Linked-Multicomponent Robotic Systems. Inf. Sci. 2015;317:25–47. doi: 10.1016/J.INS.2015.04.005. [DOI] [Google Scholar]
  • 31.Fulton N, Platzer A. Safe reinforcement learning via formal methods: Toward safe control through proof and learning. Proc. AAAI Conf. Artif. Intell. 2018;32(1):6485–6492. doi: 10.1609/AAAI.V32I1.12107. [DOI] [Google Scholar]
  • 32.Wootters W, Zurek W. A single quantum cannot be cloned. Nature. 1982;299:802–803. doi: 10.1038/299802a0. [DOI] [Google Scholar]
  • 33.Watkins CJ, Dayan P. Q-learning. Mach. Learn. 1992;8(3–4):279–292. doi: 10.1007/BF00992698. [DOI] [Google Scholar]
  • 34.Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M.A. Playing atari with deep reinforcement learning. CoRR (2013) arXiv:1312.5602.
  • 35.Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. Proximal policy optimization algorithms. arXiv:1707.06347 (2017) [cs.LG].
  • 36.Kretínský, J., Meggendorfer, T., & Sickert, S. Owl: A library for ω-words, automata, and LTL. In: Autom. Tech. Verif. Anal., pp. 543–550 (2018). 10.1007/978-3-030-01090-4_34.
  • 37.Sadigh, D., Kim, E.S., Coogan, S., Sastry, S.S., & Seshia, S.A. A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications. In: Proc. IEEE Conf. Decis. Control., pp. 1091–1096 (2014).
  • 38.Ng, A.Y., Harada, D., & Russell, S. Policy invariance under reward transformations: Theory and application to reward shaping. In: ICML, vol. 99, pp. 278–287 (1999).
  • 39.Dempster AP, Laird NM, B RD. Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 1977;39(1):1–22. [Google Scholar]
  • 40.Bertsekas DP, Tsitsiklis JN. Neuro-dynamic Programming. Belmont, MA: Athena scientific; 1996. [Google Scholar]
  • 41.Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., & Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In: Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 48, pp. 1928–1937. PMLR, New York, New York, USA (2016). https://proceedings.mlr.press/v48/mniha16.html.
  • 42.Grover LK. Quantum mechanics helps in searching for a needle in a Haystack. Phys. Rev. Lett. 1997;79(2):325. doi: 10.1103/PhysRevLett.79.325. [DOI] [Google Scholar]
  • 43.Koch, D., Wessing, L., & Alsing, P.M. Introduction to Coding Quantum Algorithms: A Tutorial Series Using Qiskit (2019) arXiv:1903.04359.
  • 44.Qiskit. https://qiskit.org/ Accessed 2021-08-10.
  • 45.Preskill J. Quantum Computing in the NISQ era and beyond. Quantum. 2018;2:79. doi: 10.22331/q-2018-08-06-79. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES