Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

Raphael Fonteneau; Susan A Murphy; Louis Wehenkel; Damien Ernst

doi:10.1007/s10479-012-1248-5

. Author manuscript; available in PMC: 2013 Sep 16.

Published in final edited form as: Ann Oper Res. 2013 Sep 1;208(1):383–416. doi: 10.1007/s10479-012-1248-5

Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

Raphael Fonteneau ¹, Susan A Murphy ², Louis Wehenkel ³, Damien Ernst ⁴

PMCID: PMC3773886 NIHMSID: NIHMS487541 PMID: 24049244

Abstract

In this paper, we consider the batch mode reinforcement learning setting, where the central problem is to learn from a sample of trajectories a policy that satisfies or optimizes a performance criterion. We focus on the continuous state space case for which usual resolution schemes rely on function approximators either to represent the underlying control problem or to represent its value function. As an alternative to the use of function approximators, we rely on the synthesis of “artificial trajectories” from the given sample of trajectories, and show that this idea opens new avenues for designing and analyzing algorithms for batch mode reinforcement learning.

Keywords: Reinforcement Learning, Optimal Control, Artificial Trajectories, Function Approximators

1 Introduction

Optimal control problems arise in many real-life applications, such as engineering [40], medicine [41,35,34] or artificial intelligence [43]. Over the last decade, techniques developed by the Reinforcement Learning (RL) community [43] have become more and more popular for addressing those types of problems. RL was initially focusing on how to design intelligent agents able to interact with their environment so as to optimize a given performance criterion [43]. Since the end of the nineties, many researchers have focused on the resolution of a subproblem of RL: computing high performance policies when the only information available on the environment is contained in a batch collection of trajectories. This subproblem of RL is referred to as batch mode RL [18].

Most of the techniques proposed in the literature for solving batch mode RL problems over large or continuous spaces combine value or policy iteration schemes from the Dynamic Programming (DP) theory [2] with function approximators (e.g., radial basis functions, neural networks, etc) representing (state-action) value functions [7]. These approximators have two main roles: (i) to offer a concise representation of state-action value functions defined over continuous spaces and (ii) to generalize the information contained in the finite sample of input data. Another family of algorithms that has been less studied in RL adopts a two stage process for solving these batch mode RL problems. First, they train function approximators to learn a model of the environment and, afterwards, they use various optimization schemes (e.g., direct policy search, dynamic programming) to compute a policy which is (near-)optimal with respect to this model.

While successful in many studies, the use of function approximators for solving batch mode RL problems has also drawbacks. In particular, the black box nature of this approach makes performance analysis very difficult, and hence severely hinders the design of new batch mode RL algorithms presenting some a priori desired performance guarantees. Also, the policies inferred by these algorithms may have counter-intuitive properties. For example, in a deterministic framework, for a fixed initial state, and when there is in the input sample a trajectory that has been generated by an optimal policy starting from this initial state, there is no guarantee that a function approximator-based policy will reproduce this optimal behavior. This is surprising, since a simple “imitative learning” approach would have such a desirable property.

The above observations have lead us to develop a new line of research based on the synthesis of “artificial trajectories” for addressing batch mode RL problems. In our approach, artificial trajectories are rebuilt from the tuples extracted from the given batch of trajectories with the aim of achieving an optimality property. In this paper, we revisit our work on this topic [19–23], with the objective of showing that these ideas open avenues for addressing many batch mode RL related problems. In particular, four algorithms that exploit artificial trajectories will be presented. The first one computes an estimate of the performance of a given control policy [22]. The second one provides a way for computing performance guarantees in deterministic settings [19]. The third one leads to the computation of policies having high performance guarantees [20, 23], and the fourth algorithm presents a sampling strategy for generating additional trajectories [21]. Finally, we highlight connections between the concept of artificial trajectory synthesis and other standard batch mode RL techniques.

The paper is organized as follows. First, Section 2 gives a brief review of the field of batch mode RL. Section 3 presents the batch mode RL setting adopted in this paper and several of the generic problems it raises. In Section 4, we present our new line of research articulated around the synthesis of artificial trajectories. Finally, Section 5 proposes to make the link between this paradigm and existing batch mode RL techniques, and Section 6 concludes.

2 Related Work

Batch mode RL techniques are probably rooted in the works of Bradtke and Barto [6] and Boyan [4] related to the use of least-squares techniques in the context of Temporal Difference learning methods (LSTD) for estimating the return of control policies. Those works have been extended to address optimal control problems by Lagoudakis and Parr [27] who have introduced the Least-Square Policy Iteration (LSPI) algorithm that mimics the policy iteration algorithm of the DP theory [2]. Several papers have proposed some theoretical works related to least-squares TD-based algorithms, such as for example Nedić and Bertsekas [36] and Lazaric et al. [29,30].

Another algorithm from the DP theory, the value iteration algorithm, has also served as inspiration for designing batch mode RL algorithms. For example, Ormoneit and Sen have developed a batch mode RL algorithm in 2002 [37] using kernel approximators, for which theoretical analyses are also provided. Reference [12] proposes an algorithm that combines value iteration with any type of regressors (e.g., regression trees, SVMs, neural networks). Reference [13] has named this algorithm Fitted Q Iteration (FQI) and provides a careful empirical analysis of its performance when combined with ensembles of regression trees.

References [40, 28] and [44] study the performances of this FQI algorithm with (deep) neural networks and CMACs (Cerebella Model Articulator Controllers). The Regularized FQI algorithm proposes to use penalized least-squares regression as function approximator to limit the model-complexity of the original FQI algorithm [17]. Extensions of the FQI algorithm to continuous action spaces have also been proposed [1]. More theoretical works related with FQI have also been published [33, 10].

Applications of these batch mode RL techniques have already lead to promising results in robotics [38,3,45], power systems [14], image processing [15], water reservoir optimization [9,8], medicine [34,16,26] and driving assistance strategies [39].

3 Batch Mode RL: Formalization and Typical Problems

We consider a stochastic discrete-time system whose dynamics is given by

x_{t + 1} = f (x_{t} + u_{t} + w_{t}) \forall t \in {0, \dots, T - 1}

where x_t belongs to a state space Inline graphic ⊂ ℝ^d, Where ℝ^d is the d – dimensional Euclidean space and T ∈ ℕ\{0} denotes the finite optimization horizon. At every time t ∈ {0,…, T − 1}, the system can be controlled by taking an action u_t ∈ , and is subject to a random disturbance w_t ∈ drawn according to a probability distribution p Inline graphic (·)¹. To each system transition from time t to t + 1 is associated a reward signal:

r_{t} = ρ (x_{t}, u_{t}, w_{t}) \forall t \in {0, \dots, T - 1} .

Let h : {0, …,T−1} × Inline graphic → be a control policy. When starting from a given initial state x₀ and following the control policy h, an agent will get a random sum of rewards signal R^h (x₀):

\begin{matrix} R^{h} (x_{0}, w_{0}, \dots, w_{T - 1}) = \sum_{t = 0}^{T - 1} ρ (x_{t}, h (t, x_{t}), w_{t}) \\ with x_{t + 1} = f (x_{t}, h (t, x_{t}), w_{t}) \forall t \in {0, \dots, T - 1} \\ w_{t} \sim p W (\cdot) . \end{matrix}

In RL, the classical performance criterion for evaluating a policy h is its expected T–stage return:

Definition 1 (Expected T–stage Return)

J^{h} (x_{0}) = \underset{w_{0}, \dots, w_{T - 1} \sim p W (.)}{E} [R^{h} (x_{0}, w_{0}, \dots, w_{T - 1})],

but, when searching for risk-aware policies, it is also of interest to consider the value at risk criterion:

Definition 2 (T–stage Value at Risk)

Let b ∈ ℝ and c ∈ [0, 1[.

J_{VaR}^{h, (b, c)} (x_{0}) = {\begin{cases} - \infty if P (R^{h} (x_{0}, w_{0}, \dots, w_{T - 1}) < b) > c, \\ J^{h} (x_{0}) otherwise . \end{cases}

The central problem of batch mode RL is to find a good approximation of a policy h* that optimizes one such performance criterion, given the fact that the functions f, ρ and p Inline graphic (·) are unknown, and thus not accessible to simulation. Instead, they are “replaced” by a batch collection of n ∈ ℕ\{0} elementary pieces of trajectories, defined according to the following process.

Definition 3 (Sample of transitions)

Let

P_{n} = {(x^{l}, u^{l})}_{l = 1}^{n} \in {(X \times U)}^{n}

be a given set of state-action pairs. Consider the ensemble of samples of one-step transitions of size n that could be generated by complementing each pair (x^l, u^l) of Inline graphic _n by drawing for each l a disturbance signal w^l at random from p (.), and by recording the resulting values of ρ(x^l, u^l, w^l) and f(x^l, u^l, w^l). We denote by _n ( _n, w¹,…,wⁿ) one such “random” set of one-step transitions defined by a random draw of n disturbance signals w^l l = 1 … n. We assume that we know one realization of the random set Inline graphic _n ( _n, w¹,…,wⁿ), that we denoted by _n:

F_{n} = {(x^{l}, u^{l}, r^{l}, y^{l})}_{l = 1}^{n}

where, for all l ∈ {1,…, n},

\begin{array}{l} \forall l \in {1, \dots, n}, r^{l} = ρ (x^{l}, u^{l}, w^{l}), \\ y^{l} = f (x^{l}, u^{l}, w^{l}), \end{array}

for some realizations of the disturbance process w^l ∼p Inline graphic (·).

Notice first that the resolution of the central problem of finding a good approximation of an optimal policy h* is very much correlated to the problem of estimating the performance of a given policy. Indeed, when this latter problem is solved, the search for an optimal policy can in principle be reduced to an optimization problem over the set of candidate policies. We thus will start by addressing the problem of characterizing the performance of a given policy.

It is sometimes desirable to be able to compute policies having good performance guarantees. Indeed, for many applications, even if it is perhaps not paramount to have a policy h which is very close to the optimal one, it is however crucial to be able to guarantee that the considered policy h leads to high-enough cumulated rewards. The problem of computing such policies will also be addressed later in this paper.

In many applications, one has the possibility to move away from a pure batch setting by carrying out a limited number of experiments on the real system in order to enrich the available sample of trajectories. We thus also consider the problem of designing strategies for generating optimal experiments for batch mode RL.

4 Synthesizing Artificial Trajectories

We first formalize the concept of artificial trajectories in Section 4.1. In Section 4.2, we detail, analyze and illustrate on a benchmark how artificial trajectories can be exploited for estimating the performances of policies. We focus in Section 4.3 on the deterministic case, and we show how artificial trajectories can be used for computing bounds on the performances of policies. Afterwards, we exploit these bounds for addressing two different problems: the first problem (Section 4.4) is to compute policies having good performance guarantees. The second problem (Section 4.5) is to design sampling strategies for generating additional system transitions.

4.1 Artificial Trajectories

Artificial trajectories are made of elementary pieces of trajectories (one-step system transitions) taken from the sample Inline graphic _n. Formally, an artificial trajectory is defined as follows:

Definition 4 (Artificial Trajectory)

An artificial trajectory is an (ordered) sequence of T one-step system transitions:

[(x^{l_{0}}, u^{l_{0}}, r^{l_{0}}, y^{l_{0}}), \dots, (x^{l_{T - 1}}, u^{l_{T - 1}}, r^{l_{T - 1}}, y^{l_{T - 1}})] \in F_{n}^{T}

where

l_{t} \in {1, \dots, n}, \forall t \in {0, \dots, T - 1} .

We give in Figure 1 an illustration of one such artificial trajectory.

Fig. 1 — An example of an artificial trajectory rebuilt from 4 one-step system transitions from _n.

Observe that one can synthesize n^T different artificial trajectories from the sample of transitions Inline graphic _n. In the rest of this paper, we present various techniques for extracting and exploiting “interesting” subsets of artificial trajectories.

4.2 Evaluating the Expected Return of a Policy

A major subproblem of batch mode RL is to evaluate the expected return J^h(x₀) of a given policy h. Indeed, when such an oracle is available, the search for an optimal policy can be in some sense reduced to an optimization problem over the set of all candidate policies. When a model of the system dynamics, reward function and disturbances probability distribution is available, Monte Carlo estimation techniques can be run to estimate the performance of any control policy. But, this is indeed not possible in the batch mode setting. In this section, we detail an approach that estimates the performance of a policy by rebuilding artificial trajectories so as to mimic the behavior of the Monte Carlo estimator. We assume in this section (and also in Section 4.3) that the action space Inline graphic is continuous and normed.

4.2.1 Monte Carlo Estimation

The Monte Carlo (MC) estimator works in a model-based setting (i.e., in a setting where f, ρ and p Inline graphic (.) are known). It estimates J^h(x₀) by averaging the returns of several (say p ∈ ℕ₀) trajectories which have been generated by simulating the system from x₀ using the policy h. More formally, the MC estimator of the expected return of the policy h when starting from the initial state x₀ writes:

Definition 5 (Monte Carlo Estimator)

M_{p}^{h} (x_{0}) = \frac{1}{p} \sum_{i = 1}^{p} \sum_{t = 0}^{T - 1} ρ (x_{t}^{i}, h (t, x_{t}^{i}), w_{t}^{i})

With ∀t ∈ {0,…,T − 1}, ∀i ∈ {1,…,p},

\begin{matrix} w_{t}^{i} \sim p W (.), \\ x_{0}^{i} = x_{0}, \\ x_{t + 1}^{i} = f (x_{t}^{i}, h (t, x_{t}^{i}), w_{t}^{i}) . \end{matrix}

The bias and variance of the MC estimator are:

Proposition 1 (Bias and Variance of the MC Estimator)

\begin{matrix} \underset{w_{t}^{i} \sim p W (.), i = 1 \dots p, t = 0 \dots T - 1}{E} [M_{p}^{h} (x_{0}) - J^{h} (x_{0})] = 0, \\ \underset{w_{t}^{i} \sim p W (.), i = 1 \dots p, t = 0 \dots T - 1}{E} [{(M_{p}^{h} (x_{0}) - J^{h} (x_{0}))}^{2}] = \frac{σ_{R^{h}}^{2} (x_{0})}{p} . \end{matrix}

where $σ_{R^{h}}^{2} (x_{0})$ denotes the assumed finite variance of R^h (x₀, w₀,…,w_T-1):

σ_{R^{h}}^{2} (x_{0}) = \underset{w_{0}, \dots, w_{T - 1} \sim p W (.)}{Var} [R^{h} (x_{0}, w_{0}, \dots, w_{T - 1})] > + \infty .

4.2.2 Model-free Monte Carlo Estimation

From a sample Inline graphic _n, our model-free MC (MFMC) estimator works by rebuilding p ∈ ℕ \ {0} artificial trajectories. These artificial trajectories will then serve as proxies of p “actual” trajectories that could be obtained by simulating the policy h on the given control problem. Our estimator averages the cumulated returns over these artificial trajectories to compute its estimate of the expected return J^h(x₀). The main idea behind our method amounts at selecting the artificial trajectories so as to minimize the discrepancy of these trajectories with a classical MC sample that could be obtained by simulating the system with policy h.

To rebuild a sample of p artificial trajectories of length T starting from x₀ and similar to trajectories that would be induced by a policy h, our algorithm uses each one-step transition in Inline graphic _n at most once; we thus assume that pT ≤ n. The p artificial trajectories of T one-step transitions are created sequentially. Every artificial trajectory is grown in length by selecting, among the sample of not yet used one-step transitions, a transition whose first two elements minimize the distance – using a distance metric Δ in Inline graphic × – with the couple formed by the last element of the previously selected transition and the action induced by h at the end of this previous transition. Because we do not re-use one-step transitions, the disturbances associated with the selected transitions are i.i.d., which provides the MFMC estimator with interesting theoretical properties (see Section 4.2.3). Consequently, this also ensures that the p rebuilt artificial trajectories will be distinct.

Algorithm 1 MFMC algorithm to rebuilt a set of size p of T–length artificial trajectories from a sample of n one-step transitions.

Input: Inline graphic

_n, h(., .),x₀, Δ(., .),T,p

Let

denote the current set of not yet used one-step transitions in Inline graphic

_n; Initially,

←

_n;

for i = 1 to p (extract an artificial trajectory) do

t ← 0;

x_{t}^{i} \leftarrow x_{0}

;

while t < T do

u_{t}^{i} \leftarrow h (t, x_{t}^{i})

;

H \leftarrow \underset{(x, u, r, y) \in G}{arg min} Δ ((x, u), (x_{t}^{i}, u_{t}^{i}))

;

l_{t}^{i}

← lowest index in Inline graphic

_n of the transitions that belong to Inline graphic

;

t ← t + 1;

x_{t}^{i} \leftarrow y^{l_{t}^{i}}

;

G \leftarrow G \ {(x^{l_{t}^{i}}, u^{l_{t}^{i}}, r^{l_{t}^{i}}, y^{l_{t}^{i}})}

; \\ do not re-use transitions

end while

end for

Return the set of indices

{l_{t}^{i}}_{i = 1, t = 0}^{i = p, t = T - 1}

Open in a new tab

A tabular version of the algorithm for building the set of artificial trajectories is given in Table 1. It returns a set of indices of one-step transitions ${l_{t}^{i}}_{i = 1, t = 0}^{i = p, t = T - 1}$ from Inline graphic _n based on h, x₀, the distance metric Δ and the parameter p. Based on this set of indices, we define our MFMC estimate of the expected return of the policy h when starting from the initial state x₀:

Definition 6 (Model-free Monte Carlo Estimator)

M_{p}^{h} (F_{n}, x_{0}) = \frac{1}{p} \sum_{i = 1}^{p} \sum_{t = 0}^{T - 1} r^{l_{t}^{i}} .

Figure 2 illustrates the MFMC estimator. Note that the computation of the MFMC estimator $M_{p}^{h} (F_{n}, x_{0})$ has a linear complexity with respect to the cardinality n of Inline graphic _n, the number of artificial trajectories p and the optimization horizon T.

Fig. 2 — Rebuilding three 4-length trajectories for estimating the return of a policy.

4.2.3 Analysis of the MFMC Estimator

In this section we characterize some main properties of our estimator. To this end, we study the distribution of our estimator $M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ , seen as a function of the random set ${\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n})$ ; in order to characterize this distribution, we express its bias and its variance as a function of a measure of the density of the sample Inline graphic _n, defined by its “k–dispersion”; this is the smallest radius such that all Δ-balls in × of this radius contain at least k elements from _n. The use of this notion implies that the space × is bounded (when measured using the distance metric Δ).

The bias and variance characterization will be done under some additional assumptions detailed below. After that, we state the main theorems formulating these characterizations. Proofs are given in [22].

Assumption: Lipschitz continuity of the functions f, ρ and h

We assume that the dynamics f, the reward function ρ and the policy h are Lipschitz continuous, i.e., we assume that there exist finite constants L_f, L_ρ and L_h ∈ ℝ⁺ such that:

\begin{array}{l} \forall (x, x', u, u', w) \in X^{2} \times U^{2} \times W, \\ {║ f (x, u, w) - f (x', u', w) ║}_{X} \leq L_{f} ({║ x - x' ║}_{X} + {║ u - u' ║}_{U}), \\ | ρ (x, u, w) - ρ (x', u', w) | \leq L_{ρ} ({║ x - x' ║}_{X} + {║ u - u' ║}_{U}), \\ {║ h (t, x) - h (t, x') ║}_{U} \leq L_{h} {║ x - x' ║}_{X}, \forall t \in {0, \dots, T - 1}, \end{array}

where ║.║ Inline graphic and ║.║ denote the chosen norms over the spaces and , respectively.

Assumption: × is bounded

We suppose that Inline graphic × is bounded when measured using the distance metric Δ.

Definition 7 (Distance Metric Δ)

\forall (x, x', u, u') \in X^{2} \times U^{2}, Δ ((x, u), (x', u')) = {║ x - x' ║}_{X} + {║ u - u' ║}_{U} .

Given k ∈ ℕ\{0} with k ≤ n, we define the k–dispersion, α_k( Inline graphic _n) of the sample _n:

Definition 8 (k–Dispersion)

α_{k} (P_{n}) = sup_{(x, u) \in X \times U} Δ_{k}^{P_{n}} (x, u),

where $Δ_{k}^{P_{n}} (x, u)$ denotes the distance of (x, u) to its k–th nearest neighbor (using the distance metric Δ) in the Inline graphic _n sample.

Definition 9 (Expected Value of $M_{p}^{h} (\tilde{F} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

We denote by $E_{p, P_{n}}^{h} (x_{0})$ the expected value:

E_{p, P_{n}}^{h} (x_{0}) = \underset{w^{1}, \dots, w^{n} \sim p W (.)}{E} [M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})] .

We have the following theorem:

Theorem 1 (Bias Bound for $M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

| J^{h} (x_{0}) - E_{p, P_{n}}^{h} (x_{0}) | \leq C α_{pT} (P_{n}) with C = L_{ρ} \sum_{t = 0}^{T - 1} \sum_{i = 0}^{T - t - 1} {(L_{f} (1 + L_{h}))}^{i} .

The proof of this result is given in [22]. This formula shows that the bias is bounded closer to the target estimate if the sample dispersion is small. Note that the sample dispersion itself actually only depends on the sample Inline graphic _n and on the value of p (it will increase with the number of trajectories used by our algorithm).

Definition 10 (Variance of $M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

We denote by $V_{p, P_{n}}^{h} (x_{0})$ the variance of the MFMC estimator defined by

V_{p, P_{n}}^{h} (x_{0}) = \underset{w^{1}, \dots, w^{n} \sim p W (.)}{E} [{(M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0}) - E_{p, P_{n}}^{h} (x_{0}))}^{2}]

and we give the following theorem:

Theorem 2 (Variance Bound for $M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

V_{p, P_{n}}^{h} (x_{0}) \leq {(\frac{σ_{R^{h}} (x_{0})}{\sqrt{p}} + 2 C α_{pT} (P_{n}))}^{2} with C = L_{ρ} \sum_{t = 0}^{T - 1} \sum_{i = 0}^{T - t - 1} {(L_{f} (1 + L_{h}))}^{i} .

The proof of this theorem is given in [22]. We see that the variance of our MFMC estimator is guaranteed to be close to that of the classical MC estimator if the sample dispersion is small enough.

• Illustration

In this section, we illustrate the MFMC estimator on an academic problem. The system dynamics and the reward function are given by

x_{t + 1} = sin (\frac{π}{2} (x_{t} + u_{t} + w_{t}))

and

ρ (x_{t}, u_{t}, w_{t}) = \frac{1}{2 π} e^{- \frac{1}{2} (x_{t}^{2} + u_{t}^{2})} + w_{t}

with the state space Inline graphic being equal to [−1,1] and the action space to [−1,1]. The disturbance w_t is an element of the interval $W = [- \frac{∊}{2}, \frac{∊}{2}]$ with ∊ = 0.1 and p is a uniform probability distribution over this interval. The optimization horizon T is equal to 15. The policy h whose performances have to be evaluated is

h (t, x) = - \frac{x}{2}, \forall x \in X, \forall t \in {0, \dots, T - 1} .

The initial state of the system is set x₀ = −0.5.

For our first set of experiments, we choose to work with a value of p = 10 i.e., the MFMC estimator rebuilds 10 artificial trajectories to estimate J^h(−0.5). In these experiments, for different cardinalities $n_{j} = {(10 j)}^{2} = m_{j}^{2} j = 1 \dots 10$ , we build a sample Inline graphic _nj = {(x^l,u^l)} that uniformly cover the space × as follows:

x^{l} = - 1 + \frac{2 j_{1}}{m_{j}} and u^{l} = - 1 + \frac{2 j_{2}}{m_{j}} j_{1}, j_{2} \in {0, \dots, m_{j} - 1} .

Then, we generate 50 random sets $F_{n_{j}}^{1}, \dots, F_{n_{j}}^{50}$ Over Inline graphic _{n_j} and run our MFMC estimator on each of these sets. The results of this first set of experiments are gathered in Figure 3. For every value of n_j considered in our experiments, the 50 values computed by the MFMC estimator are concisely represented by a boxplot. The box has lines at the lower quartile, median, and upper quartile values. Whiskers extend from each end of the box to the adjacent values in the data within 1.5 times the interquartile range from the ends of the box. Outliers are data with values beyond the ends of the whiskers and are displayed with a red + sign. The squares represent an accurate estimate of J^h(−0.5) computed by running thousands of Monte Carlo simulations. As we observe, when the samples increase in size (which corresponds to a decrease of the pT–dispersion α_pT( Inline graphic _n)) the MFMC estimator is more likely to output accurate estimations of J^h(−0.5). As explained throughout this paper, there exist many similarities between the model-free MFMC estimator and the model-based MC estimator. These can be empirically illustrated by putting Figure 3 in perspective with Figure 4. This latter figure reports the results obtained by 50 independent runs of the MC estimator, each one of these runs using also p = 10 trajectories. As expected, one can see that the MFMC estimator tends to behave similarly to the MC estimator when the cardinality of the sample increases.

Fig. 3 — Computations of the MFMC estimator with p = 10, for different cardinalities n of the sample of one-step transitions. For each cardinality, 50 independent samples of transitions have generated. Squares represent *J^h* (x₀).

Fig. 4 — Computations of the MC estimator with p = 10. 50 independent runs have been computed.

In our second set of experiments, we choose to study the influence of the number of artificial trajectories p upon which the MFMC estimator bases its prediction. For each value p_j = j² j = 1 … 10 we generate 50 samples $F_{10,000}^{1}, \dots, F_{10,000}^{50}$ of one-step transitions of cardinality 10, 000 (using the sample Inline graphic ₁₀₀₀₀ defined in the first set of experiments) and use these samples to compute the MFMC estimator. The results are plotted in Figure 5. This figure shows that the bias of the MFMC estimator seems to be relatively small for small values of p and to increase with p. This is in accordance with Theorem 1 which bounds the bias with an expression that is increasing with p.

Fig. 5 — Computations of the MFMC estimator for different values of the number p of artificial trajectories extracted from a sample of n = 10, 000 tuples. For each value of p, 50 independent samples of transitions have generated. Squares represent *J^h* (x₀).

In Figure 6, we have plotted the evolution of the values computed by the model-based MC estimator when the number of trajectories it considers in its prediction increases. While, for small numbers of trajectories, it behaves similarly to the MFMC estimator, the quality of its predictions steadily increases with p, while it is not the case for the MFMC estimator whose performances degrade once p crosses a threshold value. Notice that this threshold value could be made larger by increasing the size of the samples of one-step system transitions used as input of the MFMC algorithm.

Fig. 6 — Computations of the MC estimator for different values of the number of trajectories p. For each value of p, 50 independent runs of the MC estimator have been computed. Squares represent *J^h*(x₀).

4.2.4 MFMC Estimation of the value at risk

In order to take into consideration the riskiness of policies - and not only their good performances “on average” -, one may prefer to consider a Value-at-Risk–type performance criterion instead of expected return. Notice that this type of criterion has received more and more attention during the last few years inside the RL community [11,31,32].

If we consider the p artificial trajectories that are rebuilt by the MFMC estimator, the value at risk $J_{VaR}^{h, (b, c)} (x_{0})$ can be efficiently approximated by the value ${\tilde{J}}_{VaR}^{h, (b, c)} (x_{0})$ defined as follows:

Definition 11 (Estimate of the T–stage Value at Risk)

Let b ∈ ℝ and c ∈[0, 1[.

{\tilde{J}}_{VaR}^{h, (b, c)} (x_{0}) = {\begin{matrix} - \infty & if \frac{1}{p} \sum_{i = 1}^{p} I_{{r^{i} < b}} > c, \\ M^{h} (F_{n}, x_{0}) & otherwise \end{matrix}

where rⁱ denotes the return of the i–th artificial trajectory:

r^{i} = \sum_{t = 0}^{T - 1} r^{l_{t}^{i}} .

4.3 Artificial Trajectories in the Deterministic Case: Computing Bounds

From this subsection to the end of Section 4, we assume a deterministic environment. More formally, we assume that the disturbances space is reduced to a single element Inline graphic = {0} which concentrates the whole probability mass p (0) = 1. We abusively use the convention:

\forall (x, u) \in X \times U, f (x, u) = f (x, u, 0), ρ (x, u) = ρ (x, u, 0) .

We still assume that the functions f, ρ and h are Lipschitz continuous. Observe that, in a deterministic context, only one trajectory is needed to compute J^h(x₀) by Monte Carlo estimation. We have the following result:

Proposition 2 (Lower Bound from the MFMC)

Let ${[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1}$ be an artificial trajectory rebuilt by the MFMC algorithm when using the distance measure Δ. Then, we have

| M_{1}^{h} (F_{n}, x_{0}) - J^{h} (x_{0}) | \leq \sum_{t = 0}^{T - 1} L_{Q_{T - t}} Δ ((y^{l_{t - 1}}, h (t, y^{l_{t - 1}})), (x^{l_{t}}, u^{l_{t}}))

where

L_{Q_{T - t}} = L_{ρ} \sum_{i = 0}^{T - t - 1} {(L_{f} (1 + L_{h}))}^{i}

and y^l−1 = x₀.

The proof of this theorem can be found in [19]. Since the previous result is valid for any artificial trajectory, we have:

Corollary 1 (Lower Bound from any Artificial Trajectory)

Let ${[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1}$ be any artificial trajectory. Then,

J^{h} (x_{0}) \geq \sum_{t = 0}^{T - 1} r^{l_{t}} - \sum_{t = 0}^{T - 1} L_{Q_{T - t}} Δ ((y^{l_{t - 1}}, h (t, y^{l_{t - 1}})), (x^{l_{t}}, u^{l_{t}}))

This suggests to identify an artificial trajectory that leads to the maximization of the previous lower bound:

Definition 12 (Maximal Lower Bound)

L^{h} (F_{n}, x_{0}) = max_{{[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1} \in F_{n}^{T}} \sum_{t = 0}^{T - 1} r^{l_{t}} - \sum_{t = 0}^{T - 1} L_{Q_{T - t}} Δ ((y^{l_{t - 1}}, h (t, y^{l_{t - 1}})), (x^{l_{t}}, u^{l_{t}})) .

Note that in the same way, a minimal upper bound can be computed:

Definition 13 (Minimal Upper Bound)

U^{h} (F_{n}, x_{0}) = min_{{[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1} \in F_{n}^{T}} \sum_{t = 0}^{T - 1} r^{l_{t}} + \sum_{t = 0}^{T - 1} L_{Q_{T - 1}} Δ ((y^{l_{t - 1}}, h (t, y^{l_{t - 1}})), (x^{l_{t}}, u^{l_{t}})) .

Additionaly, we can prove that both the lower and the upper bound are tight, in the sense that they both converge towards J^h(x₀) when the dispersion of the sample of system transitions Inline graphic _n decreases towards zero.

Proposition 3 (Tightness of the Bounds)

\begin{array}{r} \exists C_{b} > 0 : J^{h} (x_{0}) - L^{h} (F_{n}, x_{0}) \leq C_{b} α_{1} (P_{n}) \\ U^{h} (F_{n}, x_{0}) - J^{h} (x_{0}) \leq C_{b} α_{1} (P_{n}) \end{array}

where α₁( Inline graphic _n) denotes the 1 – dispersion of the sample of system transitions _n.

This result is proved in [19]. Note that the computation of both the maximal lower bound and minimal upper bound can be reformulated as a shortest path problem in a graph, for which the computational complexity is linear with respect to the optimization horizon T and quadratic with respect to the cardinality n of the sample of transitions Inline graphic _n.

4.3.1 Extension to Finite Action Spaces

The results given above can be extended to the case where the action space Inline graphic is finite (and thus discrete) by considering policies that are fully defined by a sequence of actions. Such policies can be qualified as “open-loop”. Let Π be the set of open-loop policies:

Definition 14 (Open-loop Policies)

Π = {π : {0, \dots, T - 1} \to U}

Given an open-loop policy π, the (deterministic) T–stage return of π writes:

J^{π} (x_{0}) = \sum_{t = 0}^{T - 1} ρ (x_{t}, π (t))

With

x_{t + 1} = f (x_{t}, π (t)), \forall t \in {0, \dots, T - 1} .

In the context of a finite action space, the Lipschitz continuity of f and ρ is: ∀(x, x′, u) ∈ Inline graphic ² × ,

{║ f (x, u) - f (x', u) ║}_{X} \leq L_{f} {║ x - x' ║}_{X}, | ρ (x, u) - ρ (x', u) | \leq L_{ρ} {║ x - x' ║}_{X} .

Since the action space is not normed anymore, we also need to redefine the sample dispersion.

Definition 15 (Sample Dispersion)

We assume that the state space is bounded, and we define the sample dispersion α*( Inline graphic _n) as the smallest α such that

\exists α > 0 : sup_{x \in X} min_{l \in {1, \dots, n}} {║ x^{l} - x ║}_{X} \leq α .

Let π ∈ Π be an open-loop policy. We have the following result:

Proposition 4 (Lower Bound - Open-loop Policy π)

Let ${[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1}$ be an artificial trajectory such that

u^{l_{t}} = π (t) \forall t \in {0, \dots, T - 1} .

Then,

J^{π} (x_{0}) \geq \sum_{t = 0}^{T - 1} r^{l_{t}} - \sum_{t = 0}^{T - 1} L_{Q_{T - t}}^{'} {║ y^{l_{t - 1}} - x^{l_{t}} ║}_{X} .

where

L_{Q_{T - t}}^{'} = L_{ρ} \sum_{i = 0}^{T - t - 1} {(L_{f})}^{i} .

A maximal lower bound can then be computed by maximizing the previous bound over the set of all possible artificial trajectories that satisfy the condition u^l_t = π(t) ∀t ∈ {0,…,T − 1}. In the following, we denote by $F_{n, π}^{T}$ the set of artificial trajectories that satisfy this condition:

F_{n, π}^{T} = {{[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1} \in F_{n}^{T} | u^{l_{t}} = π (t) \forall t \in 0, \dots, T - 1}

Then, we have:

Definition 16 (Maximal Lower Bound - Open-loop Policy π)

L^{π} (F_{n}, x_{0}) = max_{{[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1} \in F_{n, π}^{T}} \sum_{t = 0}^{T - 1} r^{l_{t}} - \sum_{t = 0}^{T - 1} L_{Q_{T - t}}^{'} {║ y^{l_{t - 1}} - x^{l_{t}} ║}_{X} .

Similarly, a minimal upper bound U^π ( Inline graphic _n, x₀) can also be computed:

Definition 17 (Minimal Upper Bound - Open-loop Policy π)

U^{π} (F_{n}, x_{0}) = min_{{[(x^{l_{t}}, u^{l_{t}}, r^{l_{t}}, y^{l_{t}})]}_{t = 0}^{T - 1} \in F_{n, π}^{T}} \sum_{t = 0}^{T - 1} r^{l_{t}} + \sum_{t = 0}^{T - 1} L_{Q_{T - t}}^{'} {║ y^{l_{t - 1}} - x^{l_{t}} ║}_{X} .

Both bounds are tight in the following sense:

Proposition 5 (Tightness of the Bounds - Open-loop Policy π)

\exists C_{b}^{'} > 0 : J^{π} (x_{0}) - L^{π} (F_{n}, x_{0}) \leq c_{b}^{'} α * (P_{n}), U^{π} (F_{n}, x_{0}) - J^{π} (x_{0}) \leq C_{b}^{'} α * (P_{n}) .

The proofs of the above stated results are given in [20].

4.4 Artificial Trajectories for Computing Safe Policies

Like in Section 4.3.1, we still assume that the action space Inline graphic is finite, and we consider open-loop policies. To obtain a policy with good performance guarantees, we suggest to find an open-loop policy ${\hat{π}}_{F_{n}, x_{0}}^{*} \in Π$ such that:

{\hat{π}}_{F_{n}, x_{0}}^{*} \in \underset{π \in Π}{arg max} L^{π} (F_{n}, x_{0}) .

Recall that such an “open-loop” policy is optimized with respect to the initial state x₀. Solving the above optimization problem can be seen as identifying an optimal rebuilt artificial trajectory ${[(x^{l_{t}^{*}}, u^{l_{t}^{*}}, r^{l_{t}^{*}}, y^{l_{t}^{*}})]}_{t = 0}^{T - 1}$ and outputting as open-loop policy the sequence of actions taken along this artificial trajectory:

\forall t \in {0, \dots, T - 1}, {\hat{π}}_{F_{n}, x_{0}}^{*} (t) = u^{l_{t}^{*}} .

Finding such a policy can again be done in an efficient way by reformulating the problem as a shortest path problem in a graph. We provide in [20] an algorithm called CGRL (which stands for “Cautious approach to Generalization in RL”) of complexity Inline graphic (n²T) for finding such a policy. A tabular version of the CGRL is given in Table 2 and an illustration that shows how the CGRL solution can be seen as a shortest path in a graph is also given in Figure 7. We now give a theorem which shows the convergence of the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ towards an optimal open-loop policy when the dispersion α* ( Inline graphic _n) of the sample of transitions converges towards zero.

Fig. 7 — A graphical interpretation of the CGRL algorithm. The CGRL solution can be interpreted as a shortest path in a specific graph.

Theorem 3 (Convergence of ${\hat{π}}_{F_{n}, x_{0}}^{*}$ )

Let Inline graphic * (x₀) be the set of optimal open-loop policies:

J * (x_{0}) = \underset{π \in Π}{arg max} J^{π} (x_{0}),

Algorithm 2 CGRL algorithm.

Input:

F_{n} = {(x^{l}, u^{l}, r^{l}, y^{l})}_{l = 1}^{n}, L_{f}, L_{ρ}, x_{0}, T

Initialization:

D ← n × (T − 1) matrix initialized to zero;

A ← n–dimensional vector initialized to zero;

B ← n–dimensional vector initialized to zero;

Computation of the Lipschitz constants

{L_{Q_{N}}^{'}}_{N = 1}^{T}

L_{Q_{1}}^{'} = L_{ρ}

;

for k = 2 … T do

L_{Q_{k}}^{'} \leftarrow L_{ρ} + L_{f} L_{Q_{k - 1}}^{'}

;

end for

t ← T − 2;

while t > −1 do

for i = 1 … n do

j_{0} \leftarrow \underset{j \in {1, \dots, n}}{arg max} r^{j} - L_{Q_{T - t - 1}}^{'} {║ y^{i} - x^{j} ║}_{X} + B (j)

;

m_{0} \leftarrow max_{j \in {1, \dots, n}} r^{j} - L_{Q_{T - t - 1}}^{'} {║ y^{i} - x^{j} ║}_{X} + B (j)

;

A(i) ← m₀;

D(i, t + 1) ← j₀; \\ best tuple at t + 1 if in tuple i at time t

end for

B ← A;

t = t −1;

end while

Conclusion:

S ← (T + 1)–length vector of actions initialized to zero;

l \leftarrow \underset{j \in {1, \dots, n}}{arg max} r^{j} - L_{Q_{T}}^{'} {║ x_{0} - x^{j} ║}_{X} + B (j)

;

S (T + 1) \leftarrow \max_{j \in {1, \dots, n}} r^{j} - L_{Q_{T}}^{'} {║ x_{0} - x^{j} ║}_{X} + B (j)

; \\ best lower bound

S(1) ← u^l; \\ CGRL action for t = 0.

for t = 0 … T − 2 do

l′ ← D(l, t + 1);

S(t + 2,:) ← u^l′; other CGRL actions

l ← l′;

end for

Return: S

Open in a new tab

and let us suppose that Inline graphic *(x₀) ≠ Π (if *(x₀) = Π, the search for an optimal policy is indeed trivial). We define

∊ (x_{0}) = min_{π \in Π \ J * (x_{0})} {(\underset{π' \in Π}{max J^{π^{'}}} (x_{0})) - J^{π} (x_{0})} .

Then,

(C_{b}^{'} α^{*} (P_{n}) < ∊ (x_{0})) \Rightarrow {\hat{π}}_{F_{n}, x_{0}}^{*} \in J * (x_{0}) .

The proof of this result is also given in [20].

• Illustration

We now illustrate the performances of the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ computed by the CGRL algorithm on a variant of the puddle world benchmark introduced in [42]. In this benchmark, a robot whose goal is to collect high cumulated rewards navigates on a plane. A puddle stands in between the initial position of the robot and the high reward area (see figure 8). If the robot is in the puddle, it gets highly negative rewards. An optimal navigation strategy drives the robot around the puddle to reach the high reward area. Two datasets of one-step transitions have been used in our example. The first set Inline graphic contains elements that uniformly cover the area of the state space that can be reached within T steps. The set ′ has been obtained by removing from the elements corresponding to the highly negative rewards. The full specification of the benchmark and the exact procedure for generating Inline graphic and ′ are given in [20]. On Figure 9, we have drawn the trajectory of the robot when following the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ . Every state encountered is represented by a white square. The plane upon which the robot navigates has been colored such that the darker the area, the smaller the corresponding rewards are. In particular, the puddle area is colored in dark grey/black. We see that the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ drives the robot around the puddle to reach the high-reward area – which is represented by the light-grey circles.

Fig. 8 — The Puddle World benchmark. Starting from x₀, an agent has to avoid the puddles and navigate towards the goal.

Figure 10 represents the policy inferred from Inline graphic by using the (finite-time version of the) Fitted Q Iteration algorithm (FQI) combined with extremely randomized trees as function approximators [13]. The trajectories computed by the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ and FQI algorithms are very similar and so are the sums of rewards obtained by following these two trajectories. However, by using Inline graphic ′ rather that , the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ and FQI algorithms do not lead to similar trajectories, as it is shown on Figures 11 and 12. Indeed, while the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ still drives the robot around the puddle to reach the high reward area, the FQI policy makes the robot cross the puddle. In terms of optimality, this latter navigation strategy is much worse. The difference between both navigation strategies can be explained as follows. The FQI algorithm behaves as if it were associating to areas of the state space that are not covered by the input sample, the properties of the elements of this sample that are located in the neighborhood of these areas. This in turn explains why it computes a policy that makes the robot cross the puddle. The same behavior could probably be observed by using other algorithms that combine dynamic programming strategies with kernel-based approximators or averagers [5,25,37]. The policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ generalizes the information contained in the dataset, by assuming, given the intial state, the most adverse behavior for the environment according to its weak prior knowledge about the environment. This results in the fact that it penalizes sequences of decisions that could drive the robot in areas not well covered by the sample, and this explains why the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ drives the robot around the puddle when run with Inline graphic ′.

4.4.1 Taking Advantage of Optimal Trajectories

In this section, we give another result which shows that, in the case where an optimal trajectory can be found in the sample of system transitions, then the policy ${\hat{π}}_{F_{n}, x_{0}}^{*}$ computed by the CGRL algorithm is also optimal.

Theorem 4 (Optimal Policies computed from Optimal Trajectories)

Let $π_{x_{0}}^{*} \in J * (x_{0})$ be an optimal open-loop policy. Let us assume that one can find in Inline graphic _n a sequence of T one-step system transitions

[(x^{l_{0}}, u^{l_{0}}, r^{l_{0}}, x^{l_{1}}), (x^{l_{1}}, u^{l_{1}}, r^{l_{1}}, x^{l_{2}}), \dots, (x^{l_{T - 1}}, u^{l_{T - 1}}, r^{l_{T - 1}}, x^{l_{T}})] \in F_{n}^{T}

such that

\begin{array}{l} x^{l_{0}} = x_{0}, \\ u^{l_{t}} = π_{x_{0}}^{*} (t) \forall t \in {0, \dots, T - 1} . \end{array}

Let ${\hat{π}}_{F_{n}, x_{0}}^{*}$ be such that

{\hat{π}}_{F_{n}, x_{0}}^{*} \in \underset{π \in Π}{arg max} L^{π} (F_{n}, x_{0}) .

Then,

{\hat{π}}_{F_{n}, x_{0}}^{*} \in J^{*} (x_{0}) .

Proof Let us prove the result by contradiction. Assume that ${\hat{π}}_{F_{n}, x_{0}}^{*}$ is not optimal. Since $π_{x_{0}}^{*}$ is optimal, one has:

J^{{\hat{π}}_{F_{n}, x_{0}}^{*}} (x_{0}) < J^{π_{x_{0}}^{*}} (x_{0}) .

(1)

Let us now consider the lower bound $B^{π_{x_{0}}^{*}} (F_{n}, x_{0})$ on the return of the policy $π_{x_{0}}^{*}$ computed from the sequence of transitions

[(x^{l_{0}}, u^{l_{0}}, r^{l_{0}}, x^{l_{1}}), (x^{l_{1}}, u^{l_{1}}, r^{l_{1}}, x^{l_{2}}), \dots, (x^{l_{T - 1}}, u^{l_{T - 1}}, r^{l_{T - 1}}, x^{l_{T}})] \in F_{n}^{T} .

By construction of this sequence of transitions, we have:

\begin{array}{l} B^{π_{x_{0}}^{*}} (F_{n}, x_{0}) = \sum_{t = 0}^{T - 1} r^{l_{t}} - \sum_{t = 0}^{T - 1} L_{Q_{T - t}}^{'} {║ x^{l_{t}} - x^{l_{t}} ║}_{X} \\ = \sum_{t = 0}^{T - 1} r^{l_{t}} \\ = J^{h_{x_{0}}^{*}} (x_{0}) \end{array}

By definition of the policy ${\hat{π}}_{F_{n}, x_{0}}^{*} \in \underset{π \in Π}{arg max} L^{π} (F_{n}, x_{0})$ , we have:

L^{{\hat{π}}_{F_{n}, x_{0}}^{*}} (F_{n}, x_{0}) \geq B^{π_{x_{0}}^{*}} (F_{n}, x_{0}) = J^{π_{x_{0}}^{*}} (x_{0}) .

(2)

Since $L^{{\hat{π}}_{F_{n}, x_{0}}^{*}} (F_{n}, x_{0})$ is a lower bound on the return of ${\hat{π}}_{F_{n}, x_{0}}^{*}$ , we have:

J^{{\hat{π}}_{F_{n}, x_{0}}^{*}} (x_{0}) \geq L^{{\hat{π}}_{F_{n}, x_{0}}^{*}} (F_{n}, x_{0}) .

(3)

Combining inequalities 2 and 3 yields a contradiction with inequality 1.

4.5 Rebuilding Artificial Trajectories for Designing Sampling Strategies

We suppose in this section that additional system transitions can be generated, and we detail hereafter a sampling strategy to select state-action pairs (x, u) for generating f(x, u) and ρ(x, u) so as to be able to discriminate rapidly – as new one-step transitions are generated – between optimal and non-optimal policies from Π. This strategy is directly based on the previously described bounds.

Before describing our proposed sampling strategy, let us introduce a few definitions. First, note that a policy can only be optimal given a set of one-step transitions Inline graphic if its upper bound is not lower than the lower bound of any element of Π. We qualify as “candidate optimal policies given ” and we denote by Π ( , x₀) the set of policies which satisfy this property:

Definition 18 (Candidate Optimal Policies Given )

Π (F, x_{0}) = {π \in Π | \forall π' \in Π, U^{π} (F, x_{0}) \geq L^{π'} (F, x_{0})} .

We also define the set of “compatible transitions given Inline graphic ” as follows:

Definition 19 (Compatible Transitions Given )

A transition (x, u, r, y) ∈ Inline graphic × × ℝ × is said compatible with the set of transitions if

\forall (x^{l}, u^{l}, r^{l}, y^{l}) \in F, (u^{l} = u) \Rightarrow {\begin{cases} | r - r^{l} | \leq L_{ρ} {║ x - x^{l} ║}_{X}, \\ {║ y - y^{l} ║}_{X} \leq L_{f} {║ x - x^{l} ║}_{X} . \end{cases}

We denote by Inline graphic ( ) ⊂ × × ℝ × the set that gathers all transitions that are compatible with the set of transitions .

Our sampling strategy generates new one-step transitions iteratively. Given an existing set Inline graphic _m of m ∈ ℕ\{0} one-step transitions, which is made of the elements of the initial set _n and the m-n one-step transitions generated during the first m-n iterations of this algorithm, it selects as next sampling point (x^m+1,u^m+1) ∈ × , the point that minimizes in the worst conditions the largest bound width among the candidate optimal policies at the next iteration:

(x^{m + 1}, u^{m + 1}) \in \underset{(x, u) \in X \times U}{arg min} {max_{(r, y) \in ℝ \times X s . t . (x, u, r, y) \in C (F_{m}) π \in Π (F_{m} \cup {(x, u, r, y)}, x_{0})} δ^{π} (F_{m} \cup {(x, u, r, y)}, x_{0})}

where

δ^{π} (F, x_{0}) = U^{π} (F, x_{0}) - L^{π} (F, x_{0}) .

Based on the convergence properties of the bounds, we conjecture that the sequence (Π( Inline graphic _m, x₀))_m∈ℕ converges towards the set of all optimal policies in a finite number of iterations:

\exists m_{0} \in ℕ \ {0} : \forall m \in ℕ, (m \geq m_{0}) \Rightarrow Π (F_{m}, x_{0}) = J * (x_{0}) .

The analysis of the theoretical properties of the sampling strategy and its empirical validation are left for future work.

• Illustration

In order to illustrate how the bound-based sampling strategy detailed above allows to discriminate among policies, we consider the following toy problem. The actual dynamics and reward functions are given by:

\begin{array}{l} f (x, u) = x + u, \\ ρ (x, u) = x + u . \end{array}

The state space is included in ℝ. The action space is set to

U = {- 0.20, - 0.10, 0, + 0.10, + 0.20} .

We consider a time horizon T = 3, which induces 5³ = 125 different policies. The initial state is set to x₀ = −0.65. Consequently, there is only one optimal policy, which consists in applying action +0.20 three times.

In the following experiments, the $\underset{(x, u) \in X \times U}{arg min}$ and $max_{(r, y) \in ℝ \times X s . t . (x, u, r, y) \in C (F_{m})}$ operators, whose computation is of huge complexity, are approximated using purely random search algorithms (i.e. by randomly generating feasible points and taking the optimum over those points). We begin with a small sample of n = 5 transitions (one for each action)

{(0, - 0.20, ρ (0, - 0.20), f (0, - 0.20)), (0, - 0.10, ρ (0, - 0.10), f (0, - 0.10)), (0, 0, ρ (0, 0), f (0, 0)), (0, 0.10, ρ (0, 0.10), f (0, 0.10)), (0, 0.20, ρ (0, 0.20), f (0, 0.20))}

and iteratively augment it using our bound-based sampling strategy. We compare our strategy with a uniform sampling strategy (starting from the same initial sample of 5 transitions). We plot in Figure 13 the evolution of the empirical average number of candidate optimal policies (over 50 runs) with respect to the cardinality of the generated sample of transitions ². We empirically observe that the bound-based sampling strategy allows to discriminate policies faster than the uniform sampling strategy. In particular, we observe that, on average, bound-based strategy using 40 samples provides discriminating performances that are equivalent to those of the uniform sampling strategy using 80 samples, which represents a significant increase. Note that in this specific benchmark, one should sample 5 + 25 + 125 = 155 state-action pairs (by trying all possible policies) in order to be sure to discriminate all non-optimal policies.

Fig. 13 — Evolution of the average number of candidate optimal policies with respect to the cardinality of the generated samples of transitions using our bound-based sampling strategy and a uniform sampling strategy (empirical average over 50 runs).

4.6 Summary

We synthesize in Figure 14 the different settings and the corresponding results that have been presented in Section 4. Such results are classified in two main categories: stochastic setting and deterministic setting. Among each setting, we explicit the context (continuous/finite action space) and the nature of each result (theoretical result, algorithmic contribution, empirical evaluation) using a color code.

Fig. 14 — A schematic presentation of the results presented in Section 4. algorithm.

5 Towards a New Paradigm for Batch Mode RL

In this concluding section, we highlight some connexions between the approaches based on synthesizing artificial trajectories and a more standard batch mode RL algorithm, the FQI algorithm [13] when it is used for policy evaluation. From a technical point of view, we consider again in this section the stochastic setting that was formalized in Section 3. The action space Inline graphic is continuous and normed, and we consider a given closed-loop, time varying, Lipschitz continuous control policy h: {0, …, T − 1} × → .

5.1 Fitted Q Iteration for Policy Evaluation

The finite horizon FQI iteration algorithm for policy evaluation (FQI-PE) works by recursively computing a sequence of functions ${({\hat{Q}}_{T - t}^{h} (., .))}_{t = 0}^{T - 1}$ as follows:

Definition 20 (FQI-PE Algorithm)

• ∀(x, u) ∈ Inline graphic × .

{\hat{Q}}_{0}^{h} (x, u) = 0 \forall (x, u) \in X \times U,

• For t = T − 1…0, build the dataset $D = {(i^{l}, o^{l})}_{l = 1}^{n}$ :

\begin{matrix} i^{l} = (x^{l}, u^{l}) \\ o^{l} = r^{l} + {\hat{Q}}_{T - t - 1}^{h} (y^{l}, h (t + 1, y^{l})) \end{matrix}

and use a regression algorithm Inline graphic to infer from D the function ${\hat{Q}}_{T - t}^{h}$ :

{\hat{Q}}_{T - t}^{h} = R A (D) .

The FQI -PE estimator of the policy h is given by:

Definition 21 (FQI Estimator)

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = {\hat{Q}}_{T}^{h} (x_{0}, h (0, x_{0})) .

5.2 FQI using k–Nearest Neighbor Regressors: an Artificial Trajectory Viewpoint

We propose in this section to use a k–Nearest Neighbor algorithm (k–NN) as regression algorithm Inline graphic . In the following, for a given state action couple (x, u) ∈ × , we denote by l_i(x, u) the lowest index in _n of the i-th nearest one step transition from the state-action couple (x, u) using the distance measure Δ. Using this notation, the k–NN based FQI-PE algorithm for estimating the expected return of the policy h works as follows:

Definition 22 (k–NN FQI-PE Algorithm)

• ∀(x, u) ∈ Inline graphic × ,

{\hat{Q}}_{0}^{h} (x, u) = 0,

• For t = T − 1…0, ∀(x, u) ∈ Inline graphic × ,

{\hat{Q}}_{T - t}^{h} (x, u) = \frac{1}{k} \sum_{i = 1}^{k} (r^{l_{i} (x, u)} + {\hat{Q}}_{T - t - 1}^{h} (y^{l_{i} (x, u)}, h (t + 1, y^{l_{i} (x, u)}))) .

The k–NN FQI-PE estimator of the policy h is given by:

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = {\hat{Q}}_{T}^{h} (x_{0}, h (0, x_{0})) .

One can observe that, for a fixed initial state x₀, the computation of the k–NN FQI-PE estimator of h works by identifying (k + k² + … + k^T) non-unique one-step transitions. These transitions are non-unique in the sense that some transitions can be selected several times during the process. In order to concisely denote the indexes of the one-step system transitions that are selected during the k–NN FQI-PE algorithm, we introduce the notation l^{i₀,i₁, … i_t} for refering to the transition l_{i_t} (y^{l^{i₀, … i_t−1}}, h(t, y^{l^{i₀, … i_t−1}})) for i₀, … i_t ∈ {1, … k}, t ≥ 1 with l^i₀ = l_i₀(x₀, h(0, x₀)). Using these notations, we illustrate the computation of the k–NN FQI-PE Estimator in Figure 15. Then, we have the following result:

Fig. 15 — Illustration of the k–NN value iteration algorithm in terms of artificial trajectories.

Proposition 6 (k–NN FQI-PE using Artificial Trajectories)

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \frac{1}{K^{T}} \sum_{i_{0} = 1}^{k} \dots \sum_{i_{T - 1} = 1}^{k} (r^{l^{i_{0}}} + r^{l^{i_{0}, i_{1}}} + \dots + r^{l^{i_{0}, i_{1}, \dots, i_{T - 1}}}) .

where the set of rebuilt artificial trajectories

{[(x^{l^{i_{0}}}, u^{l^{i_{0}}}, r^{l^{i_{0}}}, y^{l^{i_{0}}}), \dots, (x^{l^{i_{0}, \dots, i_{T - 1}}}, u^{l^{i_{0}, \dots, i_{T - 1}}}, r^{l^{i_{0}, \dots, i_{T - 1}}}, y^{l^{i_{0}, \dots, i_{T - 1}}})]}

is such that ∀t ∈ {0, …, T − 1}, ∀ (i₀, …,i_t) ∈ {1, …, k}^{t + 1},

Δ ((y^{l^{i_{0}, \dots, i_{t - 1}}}, h (t, y^{l^{i_{0}, \dots, i_{t - 1}}})), (x^{l^{i_{0}, \dots, i_{t}}}, u^{l^{i_{0}, \dots, i_{t}}})) \leq α_{k} (P_{n}) .

Proof We propose to prove by induction the property

H_{t} : {\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \frac{1}{k^{t}} \sum_{i_{0} = 1}^{k} \dots \sum_{i_{t - 1} = 1}^{k} (r^{l^{i_{0}}} + \dots + r^{l^{i_{0}, \dots, i_{t - 1}}} + {\hat{Q}}_{T - t}^{h} (y^{l^{i_{0}, \dots, i_{t - 1}}}, h (t, y^{l^{i_{0}, \dots, i_{t - 1}}}))) .

Basis: According to the definition of the k–NN estimator, we have:

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \frac{1}{k} \sum_{i_{0} = 1}^{k} (r^{l^{i_{0}}} + {\hat{Q}}_{T - 1}^{h} (y^{l^{i_{0}}}, h (1, y^{l^{i_{0}}}))),

which proves Inline graphic ₁.

Induction step: Let us assume that Inline graphic _t is true for t ∈ {1,…, T − 1}. Then, we have:

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \frac{1}{k^{t}} \sum_{i_{0} = 1}^{k} \dots \sum_{i_{t - 1} = 1}^{k} (r^{l^{i_{0}}} + \dots + r^{l^{i_{0}, \dots, i_{t - 1}}} + {\hat{Q}}_{T - t}^{h} (y^{l^{i_{0}, \dots, i_{t - 1}}}, h (t, y^{l^{i_{0}, \dots, i_{t - 1}}}))) .

(4)

According to the definition of the k–NN FQI-PE algorithm, we have:

{\hat{Q}}_{T - t}^{h} (y^{l^{i_{0}, \dots, i_{t - 1}}}, h (t, y^{l^{i_{0}, \dots, i_{t - 1}}})) = \frac{1}{k} \sum_{i_{t} = 1}^{k} (r^{l^{i_{0}, \dots, i_{t - 1, i_{t}}}} + {\hat{Q}}_{T - t - 1}^{h} (y^{l^{i_{0}, \dots, i_{t - 1}, i_{t}}}, h (t + 1, y^{l^{i_{0}, \dots, i_{t - 1}, i_{t}}})))

(5)

Equations (4) and (5) give

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \frac{1}{k^{t}} \sum_{i_{0} = 1}^{k} \dots \sum_{i_{t - 1} = 1}^{k} (r^{l^{i_{0}}} + \dots + r^{l^{i_{0}, \dots, i_{t - 1}}} + \frac{1}{k} \sum_{i_{t} = 1}^{k} (r^{l^{i_{0}, \dots, i_{t}}} + {\hat{Q}}_{T - t - 1}^{h} (y^{l^{i_{0}, \dots, i_{t}}}, h (t + 1, y^{l^{i_{0}, \dots, i_{t}}})))) = \frac{1}{k^{t}} \sum_{i_{0} = 1}^{k} \dots \sum_{i_{t - 1} = 1}^{k} (\frac{1}{k} \sum_{i_{t} = 1}^{k} (r^{l^{i_{0}}} + \dots + r^{l^{i_{0}, \dots, i_{t - 1}}}) + \frac{1}{k} \sum_{i_{t} = 1}^{k} (r^{l^{i_{0}, \dots, i_{t}}} + {\hat{Q}}_{T - t - 1}^{h} (y^{l^{i_{0}, \dots, i_{t}}}, h (t + 1, y^{l^{i_{0}, \dots, i_{t}}})))) = \frac{1}{k^{t + 1}} \sum_{i_{0} = 1}^{k} \dots \sum_{i_{t - 1} = 1}^{k} \sum_{i_{t} = 1}^{k} (r^{l^{i_{0}}} + \dots + r^{l^{i_{0}, \dots, i_{t - 1}}} + r^{l^{i_{0}, \dots, i_{t}}} + {\hat{Q}}_{T - t - 1}^{h} (y^{l^{i_{0}, \dots, i_{t}}}, h (t + 1, y^{l^{i_{0}, \dots, i_{t}}}))),

which proves Inline graphic _t+1. The proof is completed by observing that

{\hat{Q}}_{0}^{h} (x, u) = 0, \forall (x, u) \in X \times U

and by observing that the property

Δ ((y^{l^{i_{0}, \dots, i_{t - 1}}}, h (t, y^{l^{i_{0}, \dots, i_{t - 1}}})), (x^{l^{i_{0}, \dots, i_{t}}}, u^{l^{i_{0}, \dots, i_{t}}})) \leq α_{k} (P_{n}) .

directly comes from the use of k–NN function approximators.

The previous result shows that the estimate of the expected return of the policy h computed by the k–NN FQI-PE algorithm is the average of the return of k^T artificial trajectories. These artificial trajectories are built from (k + k² + … + k^T) non-unique one-step system transitions from Inline graphic _n that are also chosen by minimizing the distance between two successive one-step transitions.

• Illustration. We empirically compare the MFMC estimator with the k–NN FQI-PE estimator on the toy problem presented in Section 4.2, but with a smaller time horizon T = 5. For a fixed cardinality n = 100, we consider all possible values of the parameter k (k ∈ {1,…, 100} since there are at most n nearest neighbours) and p (p ∈ {1,…, 20} since one can generate at most n/T different artificial trajectories without re-using transitions). For each value of p (resp. k), we generate 1000 samples of transitions using a uniform random distribution over the state action space. For each sample, we run the MFMC (resp. the k–NN FQI-PE estimator). As a baseline comparison, we also compute a 1000 runs of the MC estimator for every value of p. Figure 16 (resp. 17 and 18) reports the obtained empirical average (resp. variance and mean squared error).

Fig. 16 — Empirical average observed for the MC estimator, the MFMC estimator and the k–NN FQI-PE estimator for different values of k and p (k ∈ {1,…, 100}, p ∈ {1,…, 20}, 1000 runs for each value of *k, p*).

We observe in Figure 16 that (i) the MFMC estimator with p ∈ {1,…, 3} is less biased than the k–NN FQI-PE estimator with any value of k ∈ {1,…, 100} and (ii) the bias of the MFMC estimator increases faster (with respect to p) than the bias of the k–NN FQI-PE estimator (with respect to k). The increase of the bias of the MFMC estimator with respect to p is suggested by Theorem 1, where an upper bound on the bias that increases with p is provided. This phenomenon seems to affect the k–NN FQI-PE estimator (with respect to k) to a lesser extent. In Figure 17, we observe that the k–NN FQI-PE estimator has a variance that is higher than that of the MFMC estimator for any k = p. This may be explained by the fact that for samples of n = 100 transitions, one-step transitions are often re-used by the k–NN FQI-PE estimator, which generates dependence between artificial trajectories. We finally plot in Figure 18 the observed empirical mean squared error (sum of the squared empirical bias and empirical variance) and observe that in our specific setting, the MFMC estimator offers for values of p ∈ {1,2,3,4} a better bias versus variance compromise than the k–NN FQI-PE estimator with any value of k.

Fig. 17 — Empirical variance observed for the MC estimator, the MFMC estimator and the k–NN FQI-PE estimator for different values of k and p (k ∈ {1,…, 100}, p ∈ {1,…, 20}, 1000 runs for each value of *k, p*).

Fig. 18 — Empirical mean square error observed for the MC estimator, the MFMC estimator and the k–NN FQI-PE estimator for different values of k and p (k ∈ {1,…, 100}, p ∈ {1,…, 20}, 1000 runs for each value of *k, p*).

5.3 Kernel-based and Other Averaging type Regression Algorithms

The results exposed in Section 5.2 can be extended to the case where the FQI-PE algorithm is combined with kernel-based regressors and in particular tree-based regressors. In such a context, the sequence of functions ${({\hat{Q}}_{T - t}^{h} (.))}_{t = 0}^{T}$ is computed as follows:

Definition 23 (KB FQI-PE Algorithm)

∀(x, u) ∈ Inline graphic × ,

{\hat{Q}}_{0}^{h} (x, u) = 0,

For t = T – 1 … 0, ∀(x, u) ∈ Inline graphic × ,

{\hat{Q}}_{T - t}^{h} (x, u) = \sum_{l = 1}^{n} k_{F_{n}} ((x, u), (x^{l}, u^{l})) (r^{l} + {\hat{Q}}_{T - t - 1}^{h} (y^{l}, h (t + 1, y^{l}))),

with

k_{F_{n}} ((x, u), (x^{l}, u^{l})) = \frac{Φ (\frac{Δ ((x, u), (x^{l}, u^{l}))}{b_{n}})}{\sum_{i = 1}^{n} Φ (\frac{Δ ((x, u), (x^{i}, u^{i}))}{b_{n}})},

where Φ: [0,1] → ℝ⁺ is a univariate non-negative “mother kernel” function, and b_n > 0 is the bandwidth parameter. We also suppose that $\int_{0}^{1} Φ (z) dz = 1$ and Φ(x) = 0 ∀x > 1.

The KB estimator of the expected return of the policy h is given by:

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = {\hat{Q}}_{T}^{h} (x_{0}, h (0, x_{0})) .

Given an initial state x₀ ∈ Inline graphic , the computation of the KB FQI-PE algorithm can also be interpreted as an identification of a set of one-step transitions from _n. At each time step t, all the one-step transitions (x^l, u^l, r^l, y^l) that are not farther than a distance b_n from (x_t, h(t, x_t)) are selected and weighted with a distance dependent factor. Other one-step transitions are weighted with a factor equal to zero. This process is iterated with the output of each selected one-step transitions. An illustration is given in Figure 19 . The value returned by the KB estimator can be expressed as follows:

Fig. 19 — Illustration of the KB value iteration agorithm in terms of artificial trajectories.

Proposition 7

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \sum_{i_{0} = 1}^{n} \dots \sum_{i_{T - 1} = 1}^{n} θ_{0}^{0, i_{0}} θ_{1}^{i_{0}, i_{1}} \dots θ_{T - 1}^{i_{T - 2}, i_{T - 1}} (r^{i_{0}} + \dots + r^{i_{T - 1}})

with

\begin{array}{l} θ_{0}^{0, i_{0}} = k_{F_{n}} ((x_{0}, h (0, x_{0})), (x^{i_{0}}, u^{i_{0}})), \\ θ_{t + 1}^{i_{t}, i_{t + 1}} = k_{F_{n}} ((y^{i_{t}}, h (t + 1, y^{i_{t}})), (x^{i_{t + 1}}, u^{i_{t + 1}})), \forall t \in {0, \dots, T - 2} . \end{array}

Proof We propose to prove by induction the property

H_{t} : {\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = (\sum_{i_{0} = 1}^{n} \dots \sum_{i_{t} = 1}^{n} θ^{0, i_{0}} θ^{i_{0}, i_{1}} \dots θ^{i_{t - 1}, i_{t}} (r^{i_{0}} + \dots + r^{i_{t - 1}} + {\hat{Q}}_{T - t}^{h} (y^{i_{t - 1}}, h (t, y^{i_{t - 1}})))) .

Basis: One has

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \sum_{i_{0} = 1}^{n} θ^{0, i_{0}} (r^{i_{0}} + {\hat{Q}}_{T - t}^{h} (y^{i_{0}}, h (1, y^{i_{0}}))),

which proves Inline graphic ₁.

Induction step: Let us assume that Inline graphic _t is true for t ∈ {1,…, T − 1}. Then, one has

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = (\sum_{i_{0} = 1}^{n} \dots \sum_{i_{t} = 1}^{n} θ^{0, i_{0}} θ^{i_{0}, i_{1}} \dots θ^{i_{t - 2}, i_{t - 1}} (r^{i_{0}} + \dots + r^{i_{t - 1}} + {\hat{Q}}_{T - t}^{h} (y^{i_{t - 1}}, h (t, y^{i_{t - 1}})))) .

(6)

According to the KB value iteration algorithm, we have:

{\hat{Q}}_{T - t}^{h} (y^{i_{t - 1}}, h (t, y^{i_{t - 1}})) = \sum_{i_{t - 1} = 1}^{k} θ^{i_{t - 1}, i_{t}} (r^{i_{t}} + {\hat{Q}}_{T - t - 1}^{h} (y^{i_{t}}, h (t + 1, y^{i_{t}}))) .

(7)

Equations (6) and (7) give

{\hat{J}}_{FQi}^{h} (F_{n}, x_{0}) = \sum_{i_{0} = 1}^{n} \dots \sum_{i_{t - 1} = 1}^{n} θ^{0, i_{0}} \dots θ^{i_{t - 2}, i_{t - 1}} \times (r^{i_{0}} + \dots + r^{i_{t - 1}} + \sum_{i_{t} = 1}^{n} θ^{i_{t - 1}, i_{t}} (r^{i_{t}} + {\hat{Q}}_{T - t - 1}^{h} (y^{i_{t}}, h (t + 1, y^{i_{t}}))))

Since $\sum_{i_{t} = 1}^{n} θ^{i_{t - 1}, i_{t}} = 1$ , one has

{\hat{J}}_{FQI}^{h} (F_{n}, x_{0}) = \sum_{i_{0} = 1}^{n} \dots \sum_{i_{t - 1} = 1}^{n} θ^{0, i_{0}} \dots θ^{i_{t - 2}, i_{t - 1}} \times ((\sum_{i_{t} = 1}^{n} θ^{i_{t - 1}, i_{t}}) (r^{i_{0}} + \dots + r^{i_{t - 1}}) + \sum_{i_{t} = 1}^{n} θ^{i_{t - 1}, i_{t}} (r^{i_{t}} + {\hat{Q}}_{T - t - 1}^{h} (y^{i_{t}}, h (t + 1, y^{i_{t}}))) = \sum_{i_{0} = 1}^{n} \dots \sum_{i_{t - 1} = 1}^{n} θ^{0, i_{0}} \dots θ^{i_{t - 2}, i_{t - 1}} \sum_{i_{t} = 1}^{n} θ^{i_{t - 1}, i_{t}} \times (r^{i_{0}} + \dots + r^{i_{t - 1}} + r^{i_{t}} + {\hat{Q}}_{T - t - 1}^{h} (y^{i_{t}}, h (t + 1, y^{i_{t}})))

which proves Inline graphic _{t + 1}. The proof is completed by observing that ${\hat{Q}}_{0}^{h} (x, u) = 0, \forall (x, u) \in X \times U$ .

One can observe through Proposition 7 that the computation of the KB estimate of the expected return of the policy h can be expressed in the form of a weighted sum of the return of n^T artificial trajectories. Each artificial trajectory

[(x^{i_{0}}, u^{i_{0}}, r^{i_{0}}, y^{i_{0}}), (x^{i_{1}}, u^{i_{1}}, r^{i_{1}}, y^{i_{1}}), \dots, (x^{i_{T - 1}}, u^{i_{T - 1}}, r^{i_{T - 1}}, y^{i_{T - 1}})]

is weighted with a factor $θ_{0}^{0, i_{0}} θ_{1}^{i_{0}, i_{1}} \dots θ_{T - 1}^{i_{T - 2, i_{T - 1}}}$ . Note that some of these factors can eventually be equal to zero. Similarly to the k–NN estimator, these artificial trajectories are also built from the T × n^T non-unique one-step system transitions from Inline graphic _n.

More generally, we believe that the notion of artificial trajectory could also be used to characterize other batch mode RL algorithms that rely on other kinds of “averaging” schemes [24].

6 Conclusion

In this paper we have revisited recent works based on the idea of synthesizing artificial trajectories in the context of batch mode reinforcement learning problems. This paradigm shows to be of value in order to construct novel algorithms and performance analysis techniques. We think that it is of interest to revisit in this light the existing batch mode reinforcement algorithms based on function approximators in order to analyze their behavior and possibly create new variants presenting interesting performance guarantees.

Acknowledgments

Raphael Fonteneau is a Post-doctoral Fellow of the F.R.S.-FNRS. This paper presents research results of the European excellence network PASCAL2 and of the Belgian Network DYSCO, funded by the Interuniversity Attraction Poles Programme, initiated by the Belgian State, Science Policy Office. We also acknowledge financial support from NIH grants P50 DA10075 and R01 MH080015. The scientific responsibility rests with its authors.

Footnotes

Here the fundamental assumption is that w_t is independent of w_t−1, w_t−2, …, w₀ given x_t and u_t; to simplify all notations and derivations, we furthermore impose that the process is time-invariant and does not depend on the states and actions x_t, u_t.

We have chosen to represent the average results obtained over 50 runs for both sampling methods rather the results obtained over one single run since (i) the variance of the results obtained by uniform sampling is high and (ii) the variance of the results obtained by the bound-based approach is also significant since the procedures for approximating the $\underset{(x, u) \in X \times U}{arg min}$ and $max_{(r, y) \in ℝ \times X s . t . (x, u, r, y) \in C (F_{m})}$ operators rely on a random number generator.

Contributor Information

Raphael Fonteneau, Email: raphael.fonteneau@ulg.ac.be, University of Liége, Belgium.

Susan A. Murphy, Email: samurphy@umich.edu, University of Michigan, USA.

Louis Wehenkel, Email: l.wehenkel@ulg.ac.be, University of Liége, Belgium.

Damien Ernst, Email: dernst@ulg.ac.be, University of Liége, Belgium.

References

1.Antos A, Munos R, Szepesvári C. Fitted Q-iteration in continuous action space MDPs. Advances in Neural Information Processing Systems 20 (NIPS) 2007 [Google Scholar]
2.Bellman R. Dynamic Programming. Princeton University Press; 1957. [Google Scholar]
3.Bonarini A, Caccia C, Lazaric A, Restelli M. Batch reinforcement learning for controlling a mobile wheeled pendulum robot. Artificial Intelligence in Theory and Practice II. 2008:151–160. [Google Scholar]
4.Boyan J. Technical update: Least-squares temporal difference learning. Machine Learning. 2005;49:233–246. [Google Scholar]
5.Boyan J, Moore A. Advances in Neural Information Processing Systems 7 (NIPS) MIT Press; Denver, CO, USA: 1995. Generalization in reinforcement learning: Safely approximating the value function; pp. 369–376. [Google Scholar]
6.Bradtke S, Barto A. Linear least-squares algorithms for temporal difference learning. Machine Learning. 1996;22:33–57. [Google Scholar]
7.Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming using Function Approximators. Taylor & Francis CRC Press; 2010. [Google Scholar]
8.Castelletti A, Galelli S, Restelli M, Soncini-Sessa R. Tree-based reinforcement learning for optimal water reservoir operation. Water Resources Research. 2010;46 [Google Scholar]
9.Castelletti A, de Rigo D, Rizzoli A, Soncini-Sessa R, Weber E. Neuro-dynamic programming for designing water reservoir network management policies. Control Engineering Practice. 2007;15(8):1031–1038. [Google Scholar]
10.Chakraborty B, Strecher V, Murphy S. Workshop on Model Uncertainty and Risk in Reinforcement Learning. NIPS; Whistler, Canada: 2008. Bias correction and confidence intervals for fitted Q-iteration. [Google Scholar]
11.Defourny B, Ernst D, Wehenkel L. Workshop on Model Uncertainty and Risk in Reinforcement Learning. NIPS; Whistler, Canada: 2008. Risk-aware decision making and dynamic programming. [Google Scholar]
12.Ernst D, Geurts P, Wehenkel L. Iteratively extending time horizon reinforcement learning. European Conference on Machine Learning (ECML) 2003:96–107. [Google Scholar]
13.Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research. 2005;6:503–556. [Google Scholar]
14.Ernst D, Glavic M, Capitanescu F, Wehenkel L. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics. 2009;39:517–529. doi: 10.1109/TSMCB.2008.2007630. [DOI] [PubMed] [Google Scholar]
15.Ernst D, Marée R, Wehenkel L. Reinforcement learning with raw image pixels as state input. International Workshop on Intelligent Computing in Pattern Analysis/Synthesis (IWICPAS) Proceedings series: Lecture Notes in Computer Science. 2006;4153:446–454. [Google Scholar]
16.Ernst D, Stan G, Goncalves J, Wehenkel L. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. Machine Learning Conference of Belgium and The Netherlands (BeNeLearn) 2006:65–72. [Google Scholar]
17.Farahmand A, Ghavamzadeh M, Szepesvri C, Mannor S. Regularized fitted q-iteration: Application to planning. In: Girgin S, Loth M, Munos R, Preux P, Ryabko D, editors. Recent Advances in Reinforcement Learning, Lecture Notes in Computer Science. Vol. 5323. Springer Berlin; Heidelberg: 2008. pp. 55–68. [Google Scholar]
18.Fonteneau R. Ph D thesis. University of Liége; 2011. Contributions to Batch Mode Reinforcement Learning. [Google Scholar]
19.Fonteneau R, Murphy S, Wehenkel L, Ernst D. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) Nashville, TN, USA: 2009. Inferring bounds on the performance of a control policy from a sample of trajectories. [Google Scholar]
20.Fonteneau R, Murphy S, Wehenkel L, Ernst D. A cautious approach to generalization in reinforcement learning. Second International Conference on Agents and Artificial Intelligence (ICAART); Valencia, Spain. 2010. [Google Scholar]
21.Fonteneau R, Murphy S, Wehenkel L, Ernst D. Generating informative trajectories by using bounds on the return of control policies. Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010) 2010 [Google Scholar]
22.Fonteneau R, Murphy S, Wehenkel L, Ernst D. Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP 9. Chia Laguna, Sardinia, Italy: 2010. Model-free Monte Carlo–like policy evaluation; pp. 217–224. [Google Scholar]
23.Fonteneau R, Murphy SA, Wehenkel L, Ernst D. Towards min max generalization in reinforcement learning. Revised Selected Papers Series: Communications in Computer and Information Science (CCIS); Agents and Artificial Intelligence: International Conference, ICAART 2010; Valencia, Spain. January 2010; Heidelberg: Springer; 2011. pp. 61–77. [Google Scholar]
24.Gordon G. Stable function approximation in dynamic programming. Twelfth International Conference on Machine Learning (ICML) 1995:261–268. [Google Scholar]
25.Gordon G. Ph D thesis. Carnegie Mellon University; 1999. Approximate solutions to markov decision processes. [Google Scholar]
26.Guez A, Vincent R, Avoli M, Pineau J. Adaptive treatment of epilepsy via batch-mode reinforcement learning. Innovative Applications of Artificial Intelligence (IAAI) 2008 [Google Scholar]
27.Lagoudakis M, Parr R. Least-squares policy iteration. Jounal of Machine Learning Research. 2003;4:1107–1149. [Google Scholar]
28.Lange S, Riedmiller M. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) Brugge, Belgium: 2010. Deep learning of visual control policies. [Google Scholar]
29.Lazaric A, Ghavamzadeh M, Munos R. Tech rep, SEQUEL (INRIA) Lille - Nord Europe: 2010. Finite-sample analysis of least-squares policy iteration. [Google Scholar]
30.Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of LSTD. International Conference on Machine Learning (ICML) 2010:615–622. [Google Scholar]
31.Morimura T, Sugiyama M, Kashima H, Hachiya H, Tanaka T. Nonparametric return density estimation for reinforcement learning; 27th International Conference on Machine Learning (ICML); Haifa, Israel. Jun. 21-25 (2010). [Google Scholar]
32.Morimura T, Sugiyama M, Kashima H, Hachiya H, Tanaka T. Parametric return density estimation for reinforcement learning; 26th Conference on Uncertainty in Artificial Intelligence (UAI); Catalina Island, California, USA. Jul. 8-11; 2010. pp. 368–375. [Google Scholar]
33.Munos R, Szepesvári C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research. 2008:815–857. [Google Scholar]
34.Murphy S. Optimal dynamic treatment regimes. (Journal of the Royal Statistical Society, Series B).2003;65(2):331–366. [Google Scholar]
35.Murphy S, Van Der Laan M, Robins J. Marginal mean models for dynamic regimes. Journal of the American Statistical Association. 2001;96(456):1410–1423. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Nedi A, Bertsekas DP. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems. 2003;13:79–110. doi: 10.1023/A:1022192903948. [DOI] [Google Scholar]
37.Ormoneit D, Sen S. Kernel-based reinforcement learning. Machine Learning. 2002;49(2-3):161–178. [Google Scholar]
38.Peters J, Vijayakumar S, Schaal S. Third IEEE-RAS International Conference on Humanoid Robots. Citeseer; 2003. Reinforcement learning for humanoid robotics; pp. 1–20. [Google Scholar]
39.Pietquin O, Tango F, Aras R. Computational Intelligence in Vehicles and Transportation Systems (CIVTS), 2011 IEEE Symposium on. IEEE; 2011. Batch reinforcement learning for optimizing longitudinal driving assistance strategies; pp. 73–79. [Google Scholar]
40.Riedmiller M. Sixteenth European Conference on Machine Learning (ECML) Porto, Portugal: 2005. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method; pp. 317–328. [Google Scholar]
41.Robins J. A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7(9-12):1393–1512. [Google Scholar]
42.Sutton R. Advances in Neural Information Processing Systems 8 (NIPS) MIT Press; Denver, CO, USA: 1996. Generalization in reinforcement learning: Successful examples using sparse coding; pp. 1038–1044. [Google Scholar]
43.Sutton R, Barto A. Reinforcement Learning. MIT Press; 1998. [Google Scholar]
44.Timmer S, Riedmiller M. IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL) IEEE; 2007. Fitted Q iteration with cmacs; pp. 1–8. [Google Scholar]
45.Tognetti S, Savaresi S, Spelta C, Restelli M. Control Applications (CCA) & Intelligent Control (ISIC), 2009. IEEE; 2009. Batch reinforcement learning for semi-active suspension control; pp. 582–587. [Google Scholar]

[R1] 1.Antos A, Munos R, Szepesvári C. Fitted Q-iteration in continuous action space MDPs. Advances in Neural Information Processing Systems 20 (NIPS) 2007 [Google Scholar]

[R2] 2.Bellman R. Dynamic Programming. Princeton University Press; 1957. [Google Scholar]

[R3] 3.Bonarini A, Caccia C, Lazaric A, Restelli M. Batch reinforcement learning for controlling a mobile wheeled pendulum robot. Artificial Intelligence in Theory and Practice II. 2008:151–160. [Google Scholar]

[R4] 4.Boyan J. Technical update: Least-squares temporal difference learning. Machine Learning. 2005;49:233–246. [Google Scholar]

[R5] 5.Boyan J, Moore A. Advances in Neural Information Processing Systems 7 (NIPS) MIT Press; Denver, CO, USA: 1995. Generalization in reinforcement learning: Safely approximating the value function; pp. 369–376. [Google Scholar]

[R6] 6.Bradtke S, Barto A. Linear least-squares algorithms for temporal difference learning. Machine Learning. 1996;22:33–57. [Google Scholar]

[R7] 7.Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming using Function Approximators. Taylor & Francis CRC Press; 2010. [Google Scholar]

[R8] 8.Castelletti A, Galelli S, Restelli M, Soncini-Sessa R. Tree-based reinforcement learning for optimal water reservoir operation. Water Resources Research. 2010;46 [Google Scholar]

[R9] 9.Castelletti A, de Rigo D, Rizzoli A, Soncini-Sessa R, Weber E. Neuro-dynamic programming for designing water reservoir network management policies. Control Engineering Practice. 2007;15(8):1031–1038. [Google Scholar]

[R10] 10.Chakraborty B, Strecher V, Murphy S. Workshop on Model Uncertainty and Risk in Reinforcement Learning. NIPS; Whistler, Canada: 2008. Bias correction and confidence intervals for fitted Q-iteration. [Google Scholar]

[R11] 11.Defourny B, Ernst D, Wehenkel L. Workshop on Model Uncertainty and Risk in Reinforcement Learning. NIPS; Whistler, Canada: 2008. Risk-aware decision making and dynamic programming. [Google Scholar]

[R12] 12.Ernst D, Geurts P, Wehenkel L. Iteratively extending time horizon reinforcement learning. European Conference on Machine Learning (ECML) 2003:96–107. [Google Scholar]

[R13] 13.Ernst D, Geurts P, Wehenkel L. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research. 2005;6:503–556. [Google Scholar]

[R14] 14.Ernst D, Glavic M, Capitanescu F, Wehenkel L. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics. 2009;39:517–529. doi: 10.1109/TSMCB.2008.2007630. [DOI] [PubMed] [Google Scholar]

[R15] 15.Ernst D, Marée R, Wehenkel L. Reinforcement learning with raw image pixels as state input. International Workshop on Intelligent Computing in Pattern Analysis/Synthesis (IWICPAS) Proceedings series: Lecture Notes in Computer Science. 2006;4153:446–454. [Google Scholar]

[R16] 16.Ernst D, Stan G, Goncalves J, Wehenkel L. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. Machine Learning Conference of Belgium and The Netherlands (BeNeLearn) 2006:65–72. [Google Scholar]

[R17] 17.Farahmand A, Ghavamzadeh M, Szepesvri C, Mannor S. Regularized fitted q-iteration: Application to planning. In: Girgin S, Loth M, Munos R, Preux P, Ryabko D, editors. Recent Advances in Reinforcement Learning, Lecture Notes in Computer Science. Vol. 5323. Springer Berlin; Heidelberg: 2008. pp. 55–68. [Google Scholar]

[R18] 18.Fonteneau R. Ph D thesis. University of Liége; 2011. Contributions to Batch Mode Reinforcement Learning. [Google Scholar]

[R19] 19.Fonteneau R, Murphy S, Wehenkel L, Ernst D. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) Nashville, TN, USA: 2009. Inferring bounds on the performance of a control policy from a sample of trajectories. [Google Scholar]

[R20] 20.Fonteneau R, Murphy S, Wehenkel L, Ernst D. A cautious approach to generalization in reinforcement learning. Second International Conference on Agents and Artificial Intelligence (ICAART); Valencia, Spain. 2010. [Google Scholar]

[R21] 21.Fonteneau R, Murphy S, Wehenkel L, Ernst D. Generating informative trajectories by using bounds on the return of control policies. Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010) 2010 [Google Scholar]

[R22] 22.Fonteneau R, Murphy S, Wehenkel L, Ernst D. Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR: W&CP 9. Chia Laguna, Sardinia, Italy: 2010. Model-free Monte Carlo–like policy evaluation; pp. 217–224. [Google Scholar]

[R23] 23.Fonteneau R, Murphy SA, Wehenkel L, Ernst D. Towards min max generalization in reinforcement learning. Revised Selected Papers Series: Communications in Computer and Information Science (CCIS); Agents and Artificial Intelligence: International Conference, ICAART 2010; Valencia, Spain. January 2010; Heidelberg: Springer; 2011. pp. 61–77. [Google Scholar]

[R24] 24.Gordon G. Stable function approximation in dynamic programming. Twelfth International Conference on Machine Learning (ICML) 1995:261–268. [Google Scholar]

[R25] 25.Gordon G. Ph D thesis. Carnegie Mellon University; 1999. Approximate solutions to markov decision processes. [Google Scholar]

[R26] 26.Guez A, Vincent R, Avoli M, Pineau J. Adaptive treatment of epilepsy via batch-mode reinforcement learning. Innovative Applications of Artificial Intelligence (IAAI) 2008 [Google Scholar]

[R27] 27.Lagoudakis M, Parr R. Least-squares policy iteration. Jounal of Machine Learning Research. 2003;4:1107–1149. [Google Scholar]

[R28] 28.Lange S, Riedmiller M. European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning (ESANN) Brugge, Belgium: 2010. Deep learning of visual control policies. [Google Scholar]

[R29] 29.Lazaric A, Ghavamzadeh M, Munos R. Tech rep, SEQUEL (INRIA) Lille - Nord Europe: 2010. Finite-sample analysis of least-squares policy iteration. [Google Scholar]

[R30] 30.Lazaric A, Ghavamzadeh M, Munos R. Finite-sample analysis of LSTD. International Conference on Machine Learning (ICML) 2010:615–622. [Google Scholar]

[R31] 31.Morimura T, Sugiyama M, Kashima H, Hachiya H, Tanaka T. Nonparametric return density estimation for reinforcement learning; 27th International Conference on Machine Learning (ICML); Haifa, Israel. Jun. 21-25 (2010). [Google Scholar]

[R32] 32.Morimura T, Sugiyama M, Kashima H, Hachiya H, Tanaka T. Parametric return density estimation for reinforcement learning; 26th Conference on Uncertainty in Artificial Intelligence (UAI); Catalina Island, California, USA. Jul. 8-11; 2010. pp. 368–375. [Google Scholar]

[R33] 33.Munos R, Szepesvári C. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research. 2008:815–857. [Google Scholar]

[R34] 34.Murphy S. Optimal dynamic treatment regimes. (Journal of the Royal Statistical Society, Series B).2003;65(2):331–366. [Google Scholar]

[R35] 35.Murphy S, Van Der Laan M, Robins J. Marginal mean models for dynamic regimes. Journal of the American Statistical Association. 2001;96(456):1410–1423. doi: 10.1198/016214501753382327. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Nedi A, Bertsekas DP. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dynamic Systems. 2003;13:79–110. doi: 10.1023/A:1022192903948. [DOI] [Google Scholar]

[R37] 37.Ormoneit D, Sen S. Kernel-based reinforcement learning. Machine Learning. 2002;49(2-3):161–178. [Google Scholar]

[R38] 38.Peters J, Vijayakumar S, Schaal S. Third IEEE-RAS International Conference on Humanoid Robots. Citeseer; 2003. Reinforcement learning for humanoid robotics; pp. 1–20. [Google Scholar]

[R39] 39.Pietquin O, Tango F, Aras R. Computational Intelligence in Vehicles and Transportation Systems (CIVTS), 2011 IEEE Symposium on. IEEE; 2011. Batch reinforcement learning for optimizing longitudinal driving assistance strategies; pp. 73–79. [Google Scholar]

[R40] 40.Riedmiller M. Sixteenth European Conference on Machine Learning (ECML) Porto, Portugal: 2005. Neural fitted Q iteration - first experiences with a data efficient neural reinforcement learning method; pp. 317–328. [Google Scholar]

[R41] 41.Robins J. A new approach to causal inference in mortality studies with a sustained exposure period–application to control of the healthy worker survivor effect. Mathematical Modelling. 1986;7(9-12):1393–1512. [Google Scholar]

[R42] 42.Sutton R. Advances in Neural Information Processing Systems 8 (NIPS) MIT Press; Denver, CO, USA: 1996. Generalization in reinforcement learning: Successful examples using sparse coding; pp. 1038–1044. [Google Scholar]

[R43] 43.Sutton R, Barto A. Reinforcement Learning. MIT Press; 1998. [Google Scholar]

[R44] 44.Timmer S, Riedmiller M. IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL) IEEE; 2007. Fitted Q iteration with cmacs; pp. 1–8. [Google Scholar]

[R45] 45.Tognetti S, Savaresi S, Spelta C, Restelli M. Control Applications (CCA) & Intelligent Control (ISIC), 2009. IEEE; 2009. Batch reinforcement learning for semi-active suspension control; pp. 582–587. [Google Scholar]

PERMALINK

Batch Mode Reinforcement Learning based on the Synthesis of Artificial Trajectories

Raphael Fonteneau

Susan A Murphy

Louis Wehenkel

Damien Ernst

Abstract

1 Introduction

2 Related Work

3 Batch Mode RL: Formalization and Typical Problems

Definition 1 (Expected T–stage Return)

Definition 2 (T–stage Value at Risk)

Definition 3 (Sample of transitions)

4 Synthesizing Artificial Trajectories

4.1 Artificial Trajectories

Definition 4 (Artificial Trajectory)

Fig. 1.

4.2 Evaluating the Expected Return of a Policy

4.2.1 Monte Carlo Estimation

Definition 5 (Monte Carlo Estimator)

Proposition 1 (Bias and Variance of the MC Estimator)

4.2.2 Model-free Monte Carlo Estimation

Definition 6 (Model-free Monte Carlo Estimator)

Fig. 2.

4.2.3 Analysis of the MFMC Estimator

Assumption: Lipschitz continuity of the functions f, ρ and h

Assumption: × is bounded

Definition 7 (Distance Metric Δ)

Definition 8 (k–Dispersion)

Definition 9 (Expected Value of Mph(F∼(Pn,w1,…,wn),x0))

Theorem 1 (Bias Bound for Mph(F∼n(Pn,w1,…,wn),x0))

Definition 10 (Variance of Mph(F∼n(Pn,w1,…,wn),x0))

Theorem 2 (Variance Bound for Mph(F∼n(Pn,w1,…,wn),x0))

• Illustration

Fig. 3.

Fig. 4.

Fig. 5.

Fig. 6.

4.2.4 MFMC Estimation of the value at risk

Definition 11 (Estimate of the T–stage Value at Risk)

4.3 Artificial Trajectories in the Deterministic Case: Computing Bounds

Proposition 2 (Lower Bound from the MFMC)

Corollary 1 (Lower Bound from any Artificial Trajectory)

Definition 12 (Maximal Lower Bound)

Definition 13 (Minimal Upper Bound)

Proposition 3 (Tightness of the Bounds)

4.3.1 Extension to Finite Action Spaces

Definition 14 (Open-loop Policies)

Definition 15 (Sample Dispersion)

Proposition 4 (Lower Bound - Open-loop Policy π)

Definition 16 (Maximal Lower Bound - Open-loop Policy π)

Definition 17 (Minimal Upper Bound - Open-loop Policy π)

Proposition 5 (Tightness of the Bounds - Open-loop Policy π)

4.4 Artificial Trajectories for Computing Safe Policies

Fig. 7.

Theorem 3 (Convergence of π^Fn,x0∗)

• Illustration

Fig. 8.

Fig. 9.

Fig. 10.

Fig. 11.

Fig. 12.

4.4.1 Taking Advantage of Optimal Trajectories

Theorem 4 (Optimal Policies computed from Optimal Trajectories)

4.5 Rebuilding Artificial Trajectories for Designing Sampling Strategies

Definition 18 (Candidate Optimal Policies Given )

Definition 19 (Compatible Transitions Given )

• Illustration

Fig. 13.

4.6 Summary

Fig. 14.

5 Towards a New Paradigm for Batch Mode RL

5.1 Fitted Q Iteration for Policy Evaluation

Definition 20 (FQI-PE Algorithm)

Definition 21 (FQI Estimator)

5.2 FQI using k–Nearest Neighbor Regressors: an Artificial Trajectory Viewpoint

Definition 22 (k–NN FQI-PE Algorithm)

Fig. 15.

Proposition 6 (k–NN FQI-PE using Artificial Trajectories)

Fig. 16.

Definition 9 (Expected Value of $M_{p}^{h} (\tilde{F} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

Theorem 1 (Bias Bound for $M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

Definition 10 (Variance of $M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

Theorem 2 (Variance Bound for $M_{p}^{h} ({\tilde{F}}_{n} (P_{n}, w^{1}, \dots, w^{n}), x_{0})$ )

Theorem 3 (Convergence of ${\hat{π}}_{F_{n}, x_{0}}^{*}$ )