A Generalization Error for Q-Learning

Susan A Murphy

. Author manuscript; available in PMC: 2006 Jun 8.

Published in final edited form as: J Mach Learn Res. 2005 Jul;6:1073–1097.

A Generalization Error for Q-Learning

Susan A Murphy ^1,^✉

PMCID: PMC1475741 NIHMSID: NIHMS4345 PMID: 16763665

Abstract

Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data

1. Introduction

In many areas of the medical and social sciences the following planning problem arises. A training set or batch of n trajectories of T +1-decision epochs is available for estimating a policy. A decision epoch at time t, t = 0,1,...,T , is composed of information observed at time t, O_t , an action taken at time t, A_t and a reward, R_t . For example there are currently a number of ongoing large clinical trials for chronic disorders in which, each time an individual relapses, the individual is re-randomized to one of several further treatments (Schneider et al., 2001; Fava et al., 2003; Thall et al., 2000). These are finite horizon problems with T generally quite small, T = 2–4, with known exploration policy. Scientists want to estimate the best “strategies,” i.e. policies, for managing the disorder. Alternately the training set of n trajectories may be historical; for example data in which clinicians and their patients are followed with i! nformation about disease process, treatment burden and treatment decisions recorded through time. Again the goal is to estimate the best policy for managing the disease. Alternately, consider either catalog merchandizing or charitable solicitation; information about the client, and whether or not a solicitation is made and/or the form of the solicitation is recorded through time (Simester, Peng and Tsitsiklis, 2003). The goal is to estimate the best policy for deciding which clients should receive a mailing and the form of the mailing. These latter planning problems can be viewed as infinite horizon problems but only T decision epochs per client are recorded. If T is large, the rewards are bounded and the dynamics are stationary Markovian then this finite horizon problem provides an approximation to the discounted infinite horizon problem (Kearns, Mansour and Ng, 2000).

These planning problems are characterized by unknown system dynamics and thus can also be viewed as learning problems as well. Note there is no access to a generative model nor an online simulation model nor the ability to conduct offline simulation. All that is available is the n trajectories of T + 1 decision epochs. One approach to learning a policy in this setting is Q-learning (Watkins, 1989) since the actions in the training set are chosen according to a (non-optimal) exploration policy; Q-learning is an off-policy method (Sutton and Barto, 1998). When the observables are vectors of continuous variables or are otherwise of high dimension, Q-learning must be combined with function approximation.

The contributions of this paper are as follows. First a version of Q-learning with function approximation, suitable for learning a policy with one training set of finite horizon trajectories and a large observation space, is introduced; this “batch” version of Q-learning processes the entire training set of trajectories prior to updating the approximations to the Q-functions. An incremental implementation of batch Q-learning results in one-step Q-learning with function approximation. Second performance guarantees for this version of Q-learning are provided. These performance guarantees do not assume assume that the system dynamics are Markovian. The performance guarantees are upper bounds on the average difference in value functions or more specifically the average generalization error. Here the generalization error for batch Q-learning is defined analogous to the generalization error in supervised learning (Schapire et al., 1998); it is the average diffe! rence in value when using the optimal policy as compared to using the greedy policy (from Q-learning) in generating a separate test set. The performance guarantees are analogous to performance guarantees available in supervised learning (Anthony and Bartlett, 1999).

The upper bounds on the average generalization error permit an additional contribution. These upper bounds illuminate the mismatch between Q-learning with function approximation and the goal of finding a policy maximizing the value function (see the remark following Lemma 2 and the third remark following Theorem 2). This mismatch occurs because the Q-learning algorithm with function approximation does not directly maximize the value function but rather this algorithm approximates the optimal Q-function within the constraints of the approximation space in a least squares sense; this point is discussed as some length in section 3 of Tsitsiklis and van Roy (1997).

In the process of providing an upper bound on the average generalization error, finite sample bounds on the difference in average values resulting from different policies are derived. There are three terms in the upper bounds. The first term is a function of the optimization criterion used in batch Q-learning, the second term is due to the complexity of the approximation space and the last term is an approximation error due to the above mentioned mismatch. The third term which is a function of the complexity of the approximation space is similar in form to generalization error bounds derived for supervised learning with neural networks as in Anthony and Bartlett (1999). From the work of Kearns, Mansour, and Ng (1999, 2000) and Peshkin and Shelton (2002), we expect and find as well here that the number of trajectories needed to guarantee a specified error level is exponential in the horizon time, T . The upper bound does not depend on the dimension of the observables O_t ’s. This is in contrast to the results of Fiechter (1994 Fiechter (1997) in which the upper bound on the average generalization error depends on the number of possible values for the observables.

A further contribution is that the upper bound on the average generalization error provides a mechanism for generalizing ideas from supervised learning to reinforcement learning. For example if the optimal Q-function belongs to the approximation space, then the upper bounds imply that batch Q-learning is a PAC reinforcement learning algorithm as in Feichter (1994 in Feichter (1997); see the first remark following Theorem 1. And second the upper bounds provide a starting point in using structural risk minimization for model selection (see the second remark after Theorem 1).

In Section 2, we review the definition of the value function and Q-function for a (possibly non-stationary, non-Markovian) finite horizon decision process. Next we review batch Q-learning with function approximation when the learning algorithm must use a training set of n trajectories. In Section 5 we provide the two main results, both of which provide the number of trajectories needed to achieve a given error level with a specified level of certainty.

2. Preliminaries

In the following we use upper case letters, such as O and A, to denote random variables and lower case letters, such as o and a, to denote instantiates or values of the random variables. Each of the n trajectories is composed of the sequence {O₀, A₀, O₁, . . . , A_T , O_T ₊₁} where T is a finite constant. Define O_t = {O₀,..., O_t } and similarly for A_t . Each action A_t takes values in finite, discrete action space A and O_t takes values in the observation space O. The observation space may be multidimensional and continuous. The arguments below will not require the Markovian assumption with the value of O_t equal to the state at time t. The rewards are R_t = r_t (O_t , A_t , O_t₊₁) for r_t a reward function and for each 0 ≤ t ≤ T (if the Markov assumption holds then replace O_t with O_t and A_t with A_t ). We assume that the rewards are bounded, taking values in the interval [0,1].

We assume the trajectories are sampled at random according to a fixed distribution denoted by P. Thus the trajectories are generated by one fixed distribution. This distribution is composed of the unknown distribution of each O_t conditional on (O_t₋₁, A_t₋₁) (call these unknown conditional densities {f₀,... f_T }) and an exploration policy for generating the actions. Denote the exploration policy by p_T = {p₀,..., p_T} where the probability that action a is taken given history {O_t , A_t₋₁} is p_t (a|O_t , A_t₋₁) (if the Markov assumption holds then, as before, replace O_t with O_t and A_t₋₁ with A_t₋₁.) We assume that p_t (a|o_t , a_t₋₁) > 0 for each action a ∈ A and for each possible value (o_t , a_t₋₁); that is, at each time all actions are possible. Then the likelihood (under P) of the trajectory, {o₀, a₀, o₁,..., a_T , o_T ₊₁} is

f_{0} (o_{0}) p_{0} (a_{0} | o_{0}) \prod_{t = 1}^{T} f_{t} (o_{t} | o_{t - 1}, a_{t - 1}) p_{t} (a_{t} | o_{t}, a_{t - 1}) f_{T + 1} (o_{T + 1} | o_{T}, a_{T}) .

(1)

Denote expectations with respect to the distribution P by an E.

Define a deterministic, but possibly non-stationary and non-Markovian, policy, π, as a sequence of decision rules, {π₁,..., π_T }, where the output of the time t decision rule, π_t (o_t , a_t₋₁), is an action. Let the distribution P_π denote the distribution of a trajectory whereby the policy π is used to generate the actions. Then the likelihood (under P_π) of the trajectory {o₀, a₀, o₁,..., a_T , o_T ₊₁} is

f_{0} (o_{0}) 1_{a_{0} = π_{0} (o_{0})} \prod_{j = 1}^{T} f_{j} (o_{j} | o_{j - 1}, a_{j - 1}) 1_{a_{j} = π_{j} (o_{j}, a_{j - 1})} f_{T + 1} (o_{T + 1} | o_{T}, a_{T})

(2)

where for a predicate W , 1_W is 1 if W is true and is 0 otherwise. Denote expectations with respect to the distribution P_π by an E_π.

Note that since (1) and (2) differ only in regard to the policy for generating actions, an expectation with respect to either P or P_π that does not involve integration over the policy results in the same quantity. For example, E [R_t |O_t , A_t ] = E_π [R_t |O_t , A_t ], for any policy π.

Let Π be the collection of all policies. In a finite horizon planning problem (permitting non-stationary, non-Markovian policies) the goal is to estimate a policy that maximizes $E_{π} [\sum_{j = 1}^{T} R_{j} | O_{0} = o_{0}]$ over π ∈ Π. If the system dynamics are Markovian and each r _j(o _j, a _j, o _j₊₁) = γj_r₍o _j, a _j, o _j₊₁) for r a bounded reward function and γ ∈ (0,1) a discount factor, then this finite horizon problem provides an approximation to the discounted infinite horizon problem (Kearns Mansour and Ng, 2000) for T large.

Given a policy, π, the value function for an observation, o₀, is

V_{π} (o_{0}) = E_{π} [\sum_{j = 1}^{T} R_{j} | O_{0} = o_{0}] .

The t-value function for policy π is the value of the rewards summed from time t on and is

V_{π, t} (o_{t}, a_{t - 1}) = E_{π} [\sum_{j = t}^{T} R_{j} | O_{t} = o_{t}, A_{t - 1} = a_{t - 1}] .

If the Markovian assumption holds then (o_t , a_t₋₁) in the definition of V_π,_t is replaced by o_t . Note that the time 0 value function is simply the value function (V_π,0 = V_π). For convenience, set V_π,_T ₊₁ = 0. Then the value functions satisfy the following relationship:

V_{π, t} (o_{t}, a_{t - 1}) = E_{π} [R_{t} + V_{π, t + 1} (O_{t + 1}, A_{t}) | O_{t} = o_{t}, A_{t - 1} = a_{t - 1}]

for t = 0,...,T . The time t Q-function for policy π is

Q_{π, t} (o_{t}, a_{t}) = E [R_{t} + V_{π, t + 1} (O_{t + 1}, A_{t}) | O_{t} = o_{t}, A_{t} = a_{t}] .

(The subscript, π, can be omitted as this expectation is with respect to the distribution of O_t₊₁ given (O_t , A_t ), f_t₊₁; this conditional distribution does not depend on the policy.) In Section 4 we express the difference in value functions for policy $\tilde{π}$ and policy π in terms of the advantages (as defined in Baird, 1993). The time t advantage is

μ_{π, t} (o_{t}, a_{t}) = Q_{π, t} (o_{t}, a_{t}) - V_{π, t} (o_{t}, a_{t - 1}) .

The advantage can be interpreted as the gain in performance obtained by following action a_t at time t and thereafter policy π as compared to following policy π from time t on.

The optimal value function V ^*(o) for an observation o is

V^{*} (o) = max_{π \in Π} V_{π} (o)

and the optimal t-value function for history (o_t , a_t₋₁) is

V_{t}^{*} (o_{t}, a_{t - 1}) = max_{π \in Π} V_{π, t} (o_{t}, a_{t - 1}) .

As is well-known, the optimal value functions satisfy the Bellman equations (Bellman, 1957)

V_{t}^{*} (o_{t}, a_{t - 1}) = max_{a_{t} \in A} E [R_{t} + V_{t + 1}^{*} (O_{t + 1}, A_{t}) | O_{t} = o_{t}, A_{t} = a_{t}] .

Optimal, deterministic, time t decision rules must satisfy

π_{t}^{*} (o_{t}, a_{t - 1}) \in arg max_{a_{t} \in A} E [R_{t} + V_{t + 1}^{*} (O_{t + 1}, A_{t}) | O_{t} = o_{t}, A_{t} = a_{t}] .

The optimal time t Q-function is

Q_{t}^{*} (o_{t}, a_{t}) = E [R_{t} + V_{t + 1}^{*} (O_{t + 1}, A_{t}) | O_{t} = o_{t}, A_{t} = a_{t}],

and thus the optimal time t advantage, which is given by

μ_{t}^{*} (o_{t}, a_{t}) = Q_{t}^{*} (o_{t}, a_{t}) - V_{t}^{*} (o_{t}, a_{t - 1}),

is always nonpositive and furthermore it is maximized in a_t at a_t = π_t^*(o_t , a_t₋₁).

3. Batch Q-Learning

We consider a version of Q-learning for use in learning a non-stationary, non-Markovian policy with one training set of finite horizon trajectories. The term “batch” Q-learning is used to emphasize that learning occurs only after the collection of the training set. The Q-functions are estimated using an approximator (i.e. neural networks, decision-trees etc.) (Bertsekas and Tsitsiklis, 1996; Tsitsiklis and van Roy, 1997) and then the estimated decision rules are the argmax of the estimated Q functions. Let Q_t be the approximation space for the tth Q-function, e.g. Q_t = {Q_t (o_t , a_t ; θ) : θ ∈θ}; θ is a vector of parameters taking values in a parameter space θ which is a subset of a Euclidean space. For convenience set Q_T ₊₁ equal to zero and write E_n f for the expectation of an arbitrary function, f , of a trajectory with respect to the probability obtained by choosing a trajectory uniformly from the training set of n trajectories (for example, $E_{n} [f (O_{t})] = 1 / n \sum_{i = 1}^{n} f (O_{i t})$ for O_it the tth observation in the ith trajectory). In batch Q-learning using dynamic programming and function approximation solve the following backwards through time t =!T,T − 1,...,1 to obtain

θ_{t} \in arg min_{θ} E_{n} {[R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}; θ_{t + 1}) - Q_{t} (O_{t}, A_{t}; θ)]}^{2} .

(3)

Suppose that Q-functions are approximated by linear combinations of p features (Q_t = {θ^T q_t (o_t , a_t ) : θ ∈ R^p}) then to achieve (3) solve backwards through time, t = T,T −1,...,0,

0 = E_{n} [(R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}; θ_{t + 1}) - Q_{t} (O_{t}, A_{t}; θ_{t})) q_{t} {(O_{t}, A_{t})}^{T}]

(4)

for θ_t

An incremental implementation with updates between trajectories of (3) and (4) results in one-step Q-learning (Sutton and Barto, 1998, pg. 148, put γ = 1, assume the Markov property and no need for function approximation). This is not surprising as Q-learning can be viewed as approximating least squares value iteration (Tsitsiklis and van Roy, 1996). To see the connection consider the following generic derivation of an incremental update. Denote the ith example in a training set by X_i. Define ${\hat{θ}}^{n}$ to be a solution of $\sum_{i = 1}^{n} f (X_{i}, θ) = 0$ for f a given p dimensional vector of functions and each integer n. Using a Taylor series, expand $\sum_{i = 1}^{n + 1} f (X_{i}, {\hat{θ}}^{(n + 1)}) in {\hat{θ}}^{(n + 1)}$ about ${\hat{θ}}^{(n)}$ to obtain a between-example update to ${\hat{θ}}^{(n)}$ :

{\hat{θ}}^{(n + 1)} \leftarrow {\hat{θ}}^{(n)} + \frac{1}{n + 1} {(E_{n + 1} (- \frac{\partial f (X, {\hat{θ}}^{n})}{\partial {\hat{θ}}^{n}}))}^{- 1} f (X_{n + 1}, {\hat{θ}}^{n}) .

Replace $\frac{1}{n + 1} {(E_{n + 1} (- \frac{\partial f (X, {\hat{θ}}^{n})}{\partial {\hat{θ}}^{n}}))}^{- 1}$ by a step-size α_n (α_n → 0 as n → ∞) to obtain a general formula for the incremental implementation. Now consider an incremental implementation of (4) for each t = 0,...,T . Then for each t, X = (O_t₊₁, A_t ), θ = θ_t and

f (X, θ_{t}) = (R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}; {\hat{θ}}_{t + 1}^{(n + 1)}) - Q_{t} (O_{t}, A_{t}; θ_{t})) q_{t} {(O_{t}, A_{t})}^{T}

is a vector of dimension p. The incremental update is

{\hat{θ}}_{t}^{(n + 1)} \leftarrow {\hat{θ}}_{t}^{(n)} + α_{n} (R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}; {\hat{θ}}_{t + 1}^{(n + 1)}) - Q_{t} (O_{t}, A_{t}; {\hat{θ}}_{t}^{(n)})) q_{t} {(O_{t}, A_{t})}^{T}

for t = 0,...,T . This is the one-step update of Sutton and Barto (1998, pg. 148) with γ = 1 and generalized to permit function approximation and nonstationary Q-functions and is analogous to the TD(0) update of Tsitsiklis and van Roy (1997) permitting non-Markovian, nonstationary value functions.

Denote the estimator of the optimal Q-functions based on the training data by ${\hat{Q}}_{t}$ for t = 0,...,T (for simplicity, is omitted). The estimated policy, $\hat{π}$ , satisfies ${\hat{π}}_{t} (o_{t}, a_{t - 1}) \in arg {max}_{a_{t}} {\hat{Q}}_{t} (o_{t}, a_{t})$ for each t. Note that members of the approximation space Q_t need not be “Q-functions” for any policy. For example the Q-functions corresponding to the use of a policy π (Q_π,_t, t = 0,...,T) must satisfy

E [R_{t} + V_{π, t + 1} (O_{t + 1}, A_{t}) | O_{t}, A_{t}] = Q_{π, t} (O_{t}, A_{t})

where V_π,_t₊₁(O_t₊₁, A_t ) = Q_π,_t₊₁(O_t₊₁, A_t , a_t₊₁) with a_t₊₁ set equal to π_t₊₁(O_t₊₁, A_t ). Q-learning does not impose this restriction on ${{\hat{Q}}_{t}, t = 0, \dots, T}$ ; indeed it may be that no member of the approximation space can satisfy this restriction. None-the-less we refer to the ${\hat{Q}}_{t}$ ’s as estimated Q-functions. Note also that the approximation for the Q-functions combined with the definition of the estimated decision rules as the argmax of the estimated Q functions places implicit restrictions on the set of policies that will be considered. In effect the space of interesting polices is no longer Π but rather ΠQ = {π_θ, θ ∈ θ} where π_θ = {π_1,θ,..., π_T_,θ} and where each π_t_,_θ(o_t , a_t₋₁) ∈ arg max_a_t Q_t(o_t , a_t ; θ)for some Q_t ∈ Q_t.

4. Generalization Error

Define the generalization error of a policy π at an observation o₀ as the average difference between the optimal value function and the value function resulting from the use of policy π in generating a separate test set. The generalization error of policy π at observation o₀ can be written as

V^{*} (o_{0}) - V_{π} (o_{0}) = - E_{π} [\sum_{t = 0}^{T} μ_{t}^{*} (O_{t}, A_{t}) | O_{0} = o_{0}]

(5)

where E_π denotes the expectation using the likelihood (2). So the generalization error can be expressed in terms of the optimal advantages evaluated at actions determined by policy π; that is when each A_t = π_t (O_t , A_t₋₁). Thus the closer each optimal advantage, μt *(O_t , A_t ) for A_t = π_t (O_t , A_t₋₁) is to zero, the smaller the generalization error. Recall that the optimal advantage, μt *(O_t , A_t ), is zero when A_t = π_t *(O_t , A_t₋₁). The display in (5) follows from Kakade’s (ch. 5, 2003) expression for the difference between the value functions for two policies.

Lemma 1

Given policies $\tilde{π}$ and π,

V_{\tilde{π}} (o_{0}) - V_{π} (o_{0}) = - E_{π} [\sum_{t = 0}^{T} μ_{\tilde{π}, t} (O_{t}, A_{t}) | O_{0} = o_{0}] .

Set $\tilde{π}$ = π* to obtain (5). An alternate to Kakade’s (2003) proof is as follows.

Proof. First note

V_{π} (o_{0}) = E_{π} [\sum_{t = 0}^{T} R_{t} | O_{0} = o_{0}] = E_{π} [E_{π} [\sum_{t = 0}^{T} R_{t} | O_{T}, A_{T}] | O_{0} = o_{0}] .

(6)

And $E_{π} [\sum_{t = 0}^{T} R_{t} | O_{T}, A_{T}]$ is the expectation with respect to the distribution of O_{T +} ₁ given the history (O_T, A_T); this is the density f_{T +} ₁ from Section 2 and f_{T +} ₁ is independent of the policy used to choose the actions. Thus we may subscript E by either π or $\tilde{π}$ without changing the expectation; $E_{π} [\sum_{t = 0}^{T} R_{t} | O_{T}, A_{T}] = E_{\tilde{π}} [\sum_{t = 0}^{T} R_{t} | O_{T}, A_{T}] = \sum_{t = 0}^{T - 1} R_{t} + Q_{\tilde{π}, T} (O_{T}, A_{T}) .$ The conditional expectation can be written in a telescoping sum as

\begin{array}{l} E_{π} [\sum_{t = 0}^{T} R_{t} | O_{T}, A_{T}] = \sum_{t = 0}^{T} Q_{\tilde{π}, t} (O_{t}, A_{t}) - V_{\tilde{π}, t} (O_{t}, A_{t - 1}) \\ + \sum_{t = 1}^{T} R_{t - 1} + V_{\tilde{π}, t} (O_{t}, A_{t - 1}) - Q_{\tilde{π}, t - 1} (O_{t - 1}, A_{t - 1}) \\ + V_{\tilde{π}, 0} (O_{0}) \end{array}

The first sum is the sum of the advantages. The second sum is a sum of temporal-difference errors; integrating the temporal-difference error with respect to the conditional distribution of O_t given (O_t _{− 1}, A_t _{− 1}), denoted by f_t in Section 2, we obtain zero,

E [R_{t - 1} + V_{\tilde{π}, t} (O_{t}, A_{t - 1}) | O_{t - 1}, A_{t - 1}] = Q_{\tilde{π}, t - 1} (O_{t - 1}, A_{t - 1})

(as before E denotes expectation with respect to (1); recall that expectations that do not integrate over the policy can be written either with an E or an E_π). Substitute the telescoping sum into (6) and note that $V_{\tilde{π}, 0} (O_{0}) = V_{\tilde{π}} (O_{0})$ to obtain the result. ▪

In the following Lemma the difference between value functions corresponding to two policies, $\tilde{π}$ and π, is expressed in terms of both the L₁ and L₂ distances between the optimal Q-functions and any functions {Q₀, Q₁,..., Q_T} satisfying π_t (o_t, a_t _{− 1}) ∈ arg max_a_t Q_t (o_t, a_t), t = 0,...,T and any functions ${{\tilde{Q}}_{0}, {\tilde{Q}}_{1}, \dots, {\tilde{Q}}_{T}}$ satisfying ${\tilde{π}}_{t} (o_{t}, a_{t - 1}) \in arg {max}_{a_{t}} {\tilde{Q}}_{t} (o_{t}, a_{t})$ , t = 0,...,T. We assume that there exists a positive constant, L for which p_t (a_t|o_t, a_t _{− 1}) ≥ L⁻¹ for each t and all pairs (o_t, a_t _{− 1}); if the stochastic decision rule, p_t, were uniform then L would be the size of the action space.

Lemma 2

For all functions, Q_t satisfying π_t(o_t, a_t _{− 1}) ∈ arg max_a_t Q_t(o_t, a_t), t = 0,...,T, and all functions ${\tilde{Q}}_{t}$ satisfying ${\tilde{π}}_{t} (o_{t}, a_{t - 1}) \in arg {max}_{a_{t}} {\tilde{Q}}_{t} (o_{t}, a_{t})$ , t = 0,...,T we have,

\begin{array}{l} | V_{\tilde{π}} (o_{0}) - V_{π} (o_{0}) | \leq \sum_{t = 0}^{T} 2 L^{t + 1} E [| Q_{t} (O_{t}, A_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t}) | O_{0} = o_{0}] \\ + \sum_{t = 0}^{T} 2 L^{t + 1} E [| {\tilde{Q}}_{t} (O_{t}, A_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t}) | O_{0} = o] \end{array}

and

\begin{array}{l} | V_{\tilde{π}} (o_{0}) - V_{π} (o_{0}) | \leq \sum_{t = 0}^{T} 2 L^{(t + 1) / 2} \sqrt{E [{(Q_{t} (O_{t}, A_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t}))}^{2} | O_{0} = o_{0}]} \\ + \sum_{t = 0}^{T} 2 L^{(t + 1) / 2} \sqrt{E [{({\tilde{Q}}_{t} (O_{t}, A_{t}) - Q_{\tilde{π},}_{t} (O_{t}, A_{t}))}^{2} | O_{0} = o_{0}]}, \end{array}

where E denotes expectation with respect to the distribution generating the training sample (1).

Remark:

1. Note that in general arg max_a_t $Q_{\tilde{π},}_{t} (o_{t}, a_{t})$ may not be ${\tilde{π}}_{t}$ thus we can not choose ${\tilde{Q}}_{t} = Q_{\tilde{π}, t}$ . However if $\tilde{π} = π^{*}$ then we can choose ${\tilde{Q}}_{t} = Q_{t}^{*} (= Q_{π *, t} by definition)$ and the second term in both upper bounds is equal to zero.

2. This result can be used to emphasize one aspect of the mismatch between estimating the optimal Q function and the goal of learning a policy that maximizes the value function. Suppose ${\tilde{Q}}_{t} = Q_{t}^{*}, \tilde{π} = π *$ . The generalization error is

V^{*} (o_{0}) - V_{π} (o_{0}) \leq \sum_{t = 0}^{T} 2 L^{(t + 1) / 2} \sqrt{E [{(Q_{t} (O_{t}, A_{t}) - Q_{t}^{*} (O_{t}, A_{t}))}^{2} | O_{0} = o_{0}]}

for Q_t any function satisfying π_t(o_t, a_t _{− 1}) ∈ arg max_a_t Q_t(o_t, a_t). Absent restrictions on the Q_t s, this inequality cannot be improved in the sense that choosing each Q_t = Q_t^* and π_t = π_t^* yields 0 on both sides of the inequality. However an inequality in the opposite direction is not possible, since as was seen in Lemma 1, V^*(o₀) − V_π(o₀) involves the Q functions only through the advantages (see also (7) below with $\tilde{π} = π^{*}$ ). Thus for the difference in value functions to be small, the average difference between Q_t(o_t, a_t) − max _a_t Q_t(o_t, a_t) and Q^*_t(o_t, a_t) − max _a_t Q_t^*(o_t, a_t) must be small; this does not require that the average difference between Q_t(o_t, a_t) and Q_t^*(o_t, a_t) is small. The mismatch is not unexpected. For example, Baxter and Bartlett (2001) provide an example in which the approximation space for the value function includes a value function for which the greedy policy is optimal, yet the greedy policy found by temporal difference learning (TD(1!)) function performs very poorly.

Proof

Define μ_t(o_t, a_t) = Q_t(o_t, a_t) − max _a_t Q_t(o_t, a_t) for each t; note that μ_t(o_t, a_t) evaluated at a_t = π_t(o_t, a_t _{− 1}) is zero. Start with the result of Lemma 1. Then note the difference between the value functions can be expressed as

V_{\tilde{π}} (o_{0}) - V_{π} (o_{0}) = \sum_{t = 0}^{T} E_{π} [μ_{t} (O_{t}, A_{t}) - μ_{\tilde{π}, t} (O_{t}, A_{t}) | O_{0} = o_{0}] .

(7)

since P_π puts a_t = π_t(o_t, a_t _{− 1}) and μ_t (o_t, a_t) = 0 for this value of a_t. When it is clear from the context μ_t $(μ_{\tilde{π}, t})$ is used as abbreviation for μ_t (O_t, A_t) $(μ_{\tilde{π}, t} (O_{t}, A_{t}))$ in the following. Also $Q_{\tilde{π}, t}$ (O_t, A_t _{− 1}, a_t) with a_t replaced by ${\tilde{π}}_{t}$ (O_t, A_t _{− 1}) is written as $Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, {\tilde{π}}_{t}) .$ Consider the absolute value of the tth integrand in (7):

\begin{array}{l} | μ_{t} - μ_{\tilde{π}, t} | \\ = | Q_{t} (O_{t}, A_{t}) - max_{a_{t}} Q_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t}) + Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, {\tilde{π}}_{t}) | \\ \leq | Q_{t} (O_{t}, A_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t}) | + | max_{a_{t}} Q_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, {\tilde{π}}_{t}) | . \end{array}

Since max_a_t ${\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, {\tilde{π}}_{t})$ and for any functions, h and h′, |max _a_t h(a_t) − max_a_t h′(a_t)|≤ max_a_t |h(a_t) − h′ (a_t)|,

\begin{array}{l} | max_{a_{t}} Q_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, {\tilde{π}}_{t}) | \\ \leq max_{a_{t}} | Q_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) | \\ + | {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, {\tilde{π}}_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, {\tilde{π}}_{t}) | . \end{array}

We obtain $| μ_{t} - μ_{\tilde{π}, t} |$

\begin{array}{l} \leq \underset{a_{t}}{2 max} | Q_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) | + \underset{a_{t}}{2 max} | {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{\tilde{π}, t} (O_{t}, A_{t - 1}, a_{t}) | \\ \leq 2 L \sum_{a_{t}} | Q_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) | p_{t} (a_{t} | O_{t}, A_{t - 1}) \\ + 2 L \sum_{a_{t}} | {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, a_{t}) | p_{t} (a_{t} | O_{t}, A_{t - 1}) . \end{array}

(8)

Insert the above into (7) and use Lemma A1 to obtain $| V_{\tilde{π}} (o_{0}) - V_{π} (o_{0}) |$

\begin{array}{l} \leq 2 L \sum_{t = 0}^{T} E_{π} [\sum_{a_{t}} | {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) | p_{t} (a_{t} | O_{t}, A_{t - 1}) | O_{0} = o_{0}] \\ + E_{π} [\sum_{a_{t}} | {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π},}_{t} (O_{t}, A_{t - 1}, a_{t}) | p_{t} (a_{t} | O_{t}, A_{t - 1}) | O_{0} = o_{0}] \\ = 2 L \sum_{t = 0}^{T} E [(\prod_{ℓ = 0}^{t - 1} \frac{1_{A_{ℓ} = π_{ℓ}}}{p ℓ (A ℓ | O_{ℓ}, A_{ℓ - 1})}) | Q_{t} - {\tilde{Q}}_{t} | | O_{0} = o_{0}] \\ + E [(\prod_{ℓ = 0}^{t - 1} \frac{1_{A_{ℓ} = π_{ℓ}}}{p ℓ (A ℓ | O_{ℓ}, A_{ℓ - 1})}) | {\tilde{Q}}_{t} - Q_{\tilde{π}, t} | | O_{0} = o_{0}] \\ \leq 2 \sum_{t = 0}^{T} L^{t + 1} E [| Q_{t} - {\tilde{Q}}_{t} | | O_{0} = o_{0}] + 2 \sum_{t = 0}^{T} L^{t + 1} E [| {\tilde{Q}}_{t} - Q_{\tilde{π}, t} | | O_{0} = o_{0}] \end{array}

(Q_t, $Q_{\tilde{π}, t}$ is used as abbreviation for Q_t (O_t, A_t), respectively $Q_{\tilde{π}, t}$ (O_t, A_t)). This completes the proof of the first result.

Start from (8) and use Hölder’s inequality to obtain, $| V_{\tilde{π}} (o_{0}) - V_{π} (o_{0}) |$

\begin{array}{l} \leq 2 \sum_{t = 0}^{T} E_{π} [max_{a_{t}} | Q_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) | | O_{0} = o_{0}] \\ + E_{π} [max_{a_{t}} | {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, a_{t}) | | O_{0} = o_{0}] \\ \leq 2 \sum_{t = 0}^{T} \sqrt{E_{π} [max_{a_{t}} {| Q_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) |}^{2} | O_{0} = o_{0}]} \\ + \sqrt{E_{π} [max_{a_{t}} {| {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, a_{t}) |}^{2} | O_{0} = o_{0}]} \\ \leq 2 \sum_{t = 0}^{T} \sqrt{L E_{π} [\sum_{a_{t}} {| Q_{t} (O_{t}, A_{t - 1}, a_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) |}^{2} P_{t} (a_{t} | O_{t}, A_{t - 1}) | O_{0} = o_{0}]} \\ + 2 \sum_{t = 0}^{T} \sqrt{L E_{π} [\sum_{a_{t}} {| {\tilde{Q}}_{t} (O_{t}, A_{t - 1}, a_{t}) - Q_{\tilde{π},}_{t} (O_{t}, A_{t - 1}, a_{t}) |}^{2} P_{t} (a_{t} | O_{t}, A_{t - 1}) | O_{0} = o_{0}]} . \end{array}

Now use Lemma A1 and the lower bound on the p_t’s to obtain the result,

\begin{array}{l} | V_{\tilde{π}} (o_{0}) - V_{π} (o_{0}) | \leq 2 \sum_{t = 0}^{T} \sqrt{L E [\prod_{ℓ = 0}^{t - 1} \frac{1_{A_{ℓ} = π_{ℓ}}}{p ℓ (A ℓ | O_{ℓ}, A_{ℓ - 1})} {(Q_{t} - {\tilde{Q}}_{t})}^{2} | O_{0} = o_{0}]} \\ + \sqrt{L E [\prod_{ℓ = 0}^{t - 1} \frac{1_{A_{ℓ} = π_{ℓ}}}{p ℓ (A ℓ | O_{ℓ}, A_{ℓ - 1})} {({\tilde{Q}}_{t} - Q_{\tilde{π}, t})}^{2} | O_{0} = o_{0}]} \\ \leq 2 L^{(t + 1) / 2} \sum_{t = 0}^{T} \sqrt{E [{(Q_{t} - {\tilde{Q}}_{t})}^{2} | O_{0} = o_{0}]} \\ + \sqrt{E [{({\tilde{Q}}_{t} - Q_{\tilde{π}, t})}^{2} | O_{0} = o_{0}]} . \end{array}

▪

5. Finite Sample Upper Bounds on the Average Generalization Error

Traditionally the performance of a policy π is evaluated in terms of maximum generalization error: max_o [V^*(o) − V_π(o)] (Bertsekas and Tsitsiklis, 1996). However here we consider an average generalization error as in Kakade (2003) (see also Fiechter, 1997; Kearns, Mansour and Ng, 2000; Peshkin and Shelton, 2002); that is ∫_o[V^*(o) − V_π(o)] dF(o) for a specified distribution F on the observation space. The choice of F with density f = f₀ (f₀ is the density of O₀ in likelihoods (1) and (2)) is particularly appealing in the development of a policy in many medical and social science applications. In these cases, f₀ represents the distribution of initial observations corresponding to a particular population of subjects. The goal is to produce a good policy for this population of subjects. In general as in Kakade (2003) F may be chosen to incorporate domain knowledge concerning the steady state dis! tribution of a good policy. If only a training set of trajectories is available for learning and we are unwilling to assume that the system dynamics are Markovian, then the choice of F is constrained by the following consideration. If the distribution of O₀ in the training set (f₀) assigns mass zero to an observation o′, then the training data will not be able to tell us anything about V_π (o′). Similarly if f₀ assigns a very small positive mass to o′ then only an exceptionally large training set will permit an accurate estimate of V_π(o′). Of course this will not be a problem for the average generalization error, as long as F also assigns very low mass to o′. Consequently in our construction of the finite sample error bounds for the average generalization error, we will only consider distributions F for which the density of F, say f, satisfies ${sup}_{o} | \frac{f (o)}{f_{0} (o)} | \leq M$ for some finite constant M. In this case the average generalization error is bounded above by

\begin{array}{l} \int V^{*} (o) - V_{π} (o) d F (o) \leq M E [V^{*} (O_{0}) - V_{π} (O_{0})] \\ = - M E_{π} [\sum_{t = 0}^{T} μ_{t}^{*} (O_{t}, A_{t})] . \end{array}

The second line is a consequence of (5) and the fact that the distribution of O₀ is the same under likelihoods (1) and (2).

In the following theorem a non-asymptotic upper bound on the average generalization error is provided; this upper bound depends on the number of trajectories in the training set (n), the performance of the approximation on the training set, the complexity of the approximation space and of course on the confidence (δ) and accuracy (ɛ) demanded. The batch Q-learning algorithm minimizes quadratic forms (see (3)); thus we represent the performance of functions {Q₀, Q₁,..., Q_T} on the training set by these quadratic forms,

{E r r}_{n, Q_{t + 1}} (Q_{t}) = E_{n} {[R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{t} (O_{t}, A_{t})]}^{2}

for each j (recall Q_T ₊₁ is set to zero and E_n represents the expectation with respect to the probability obtained by choosing a trajectory uniformly from the training set).

The complexity of each Q_t space can be represented by it’s covering number (Anthony and Bartlett, 1999, pg 148). Suppose F is a class of functions from a space, X, to ℝ. For a sequence x = (x₁,..., x_n) ∈ Xⁿ, define F_|_x to be a subset of ℝⁿ given by F_|_x = {(f(x₁),..., f (x_n)): f ∈ F}. Define the metric dp on ℝⁿ by $d_{p} (z, y) = {(1 / n {\sum_{i = 1}^{n} | z_{i} - y_{i} |}^{p})}^{1 / p}$ for p a positive integer (for p=∞, define $d_{\infty} (z, y) = {max}_{i = 1}^{n} | z_{i} - y_{i} |$ ). Then N (ɛ,F_|_x, d_p) is defined as the minimum cardinality of an ɛ-covering of F_|_x with respect to the metric d_p. Next given ɛ>0, positive integer n, metric d_p and function class, F, the covering number for F is defined as

N_{p} (ɛ, F, n) = max {N (ɛ, F_{| x}, d_{p}) : x \in X^{n}} .

In the following theorem, F = {max a_t₊₁ Q_t₊₁ (o_t₊₁, a_t ) - Q_t (o_t , a_t ) : Q_t ∈ Q_t , t = 0,...,T } and (x)⁺ is x if x > 0 and zero otherwise.

Theorem 1

Assume that the functions in Q_t , t ∈ 0,...,T are uniformly bounded. Suppose that there exists a positive constant, say L, for which p_t (a_t |o_t, a_t) ≥ L⁻¹ for all (o_t, a_t ) pairs, 0 ≥ t ≥ T . Then for ɛ > 0 and with probability at least 1–δ, over the random choice of the training set, every choice of functions, Q _j ∈ Q_j,j = 0,...,T with associated policy π defined by π_j(o_j, a _j_–1) = arg max_aj Q_j(o_j,a_j_–₁) and every choice of functions, ${\tilde{Q}}_{j} \in Q_{j}, j = 0, \dots, T$ with associated policy $\tilde{π}$ defined by ${\tilde{π}}_{j} (o_{j}, a_{j - 1}) = arg {max}_{a_{j}} {\tilde{Q}}_{j} (o_{j}, a_{j})$ the following bound is satisfied,

\begin{array}{l} \int | V_{\tilde{π}} (o) - V_{π} (o) | d F (o) \\ \leq 6 M L^{1 / 2} \sum_{t = 0}^{T} {[\sum_{i = t}^{T} {(16)}^{i - t} L^{i} {(E r r_{n, Q_{i + 1}} (Q_{i}) - E r r_{n, Q_{i + 1}} ({\tilde{Q}}_{i}))}^{+}]}^{1 / 2} \\ + 12 M L^{1 / 2} ɛ \\ {+ 6ML}^{1 / 2} \sum_{t = 0}^{T} \sum_{i = t}^{T} {(16)}^{(i - t) / 2} L^{i / 2} \sqrt{E {[{\tilde{Q}}_{i} (O_{i}, A_{i}) - Q_{\tilde{π}, i} (O_{i}, A_{i})]}^{2}} . \end{array}

for n satisfying

4 (T + 1) N_{1} (\frac{ɛ^{2}}{32 M^{'} {(16 L)}^{(T + 2)}}, F, 2 n) exp {\frac{ɛ^{4} n}{32 {(M^{'})}^{2} {(16 L)}^{2 (T + 2)}}} \leq δ

(9)

and where M′ is a uniform upper bound on the absolute value of f ∈ F and E represents the expectation with respect to the distribution (1) generating the training set.

Remarks:

1. Suppose that Q^*_t ∈ Q_t for each select t. Select ${\tilde{Q}}_{t} = Q_{t}^{*}$ and ${\hat{Q}}_{t} = arg {min}_{Q_{t} \in Q_{t}} E r r_{n, {\hat{Q}}_{t + 1}} (Q_{t})$ , t = T, T−1,..., 0 (recall Q_T+₁, ${\hat{Q}}_{T + 1}$ are identically zero). Then with probability greater than 1 − δ, we obtain,

\int V^{*} (o) - V_{\hat{π}} (o) d F (o) \leq 12 M L^{1 / 2} ɛ

(10)

for all n satisfying (9). Thus, as long as the covering numbers for each Q_t and thus for F do not grow too fast, estimating each Q_t by minimizing $E r r_{n, {\hat{Q}}_{t + 1}} (Q_{t})$ yields a policy that consistently achieves the optimal value. Suppose the approximation spaces Q_t , t = 0,...,T are feed-forward neural networks as in remark 4 below. In this case the training set size n sufficient for (10) to hold need only be polynomial in (1/δ, 1/ɛ) and batch Q-learning is a probably approximate correct (PAC) reinforcement learning algorithm as defined by Fiechter (1997). As shown by Fiechter (1997) this algorithm can be converted to an efficient on-line reinforcement learning algorithm (here the word on-line implies updating the policy between trajectories).

2. Even when Q_t* does not belong to Q_t we can add the optimal Q function at each time, t, to the approximation space, Q_t with a cost of no more than an increase of 1 to the covering number $N_{1} (\frac{ɛ^{2}}{32 M^{'} {(16 L)}^{(T + 2)}}, F, 2 n)$ . If we do this the result continues to hold when we set $\tilde{π}$ to an optimal policy $π^{*}$ and set ${\tilde{Q}}_{t} = Q_{t}^{*}$ for each t; the generalization error is

\begin{array}{l} \int V^{*} (o) - V_{π} (o) d F (o) \leq 6 M L^{1 / 2} \sum_{t = 0}^{T} {[\sum_{i = t}^{T} {(16)}^{i - t} L^{i} E r r_{n, Q_{i + 1}} (Q_{i})]}^{1 / 2} \\ + 12 M L^{1 / 2} ɛ \end{array}

for all n satisfying (9). This upper bound is consistent with the practice of using a policy $\hat{π}$ for which ${\hat{π}}_{t} (o_{t}, a_{t - 1}) \in arg {max}_{a_{t}} {\hat{Q}}_{t} (o_{t}, a_{t})$ and ${\hat{Q}}_{t} \in arg {min}_{Q_{t} \in Q_{t}} {E r r}_{n, {\hat{Q}}_{t + 1}} (Q_{t}) .$ Given that the covering numbers for the approximation space can be expressed in a sufficiently simple form (as in remark 4 below), this upper bound can be used to carry out model selection using structural risk minimization (Vapnik, 1982). That is, one might consider a variety of approximation spaces and use structural risk minimization to use the training data to choose which approximation space is best. The resulting upper bound on the average generalization error can be found by using the above result and Lemma 15.5 of Anthony and Bartlett (1999).

3. The restriction on n in (9) is due to the complexity associated with the approximation space (e.g. the Q_t ’s). The restriction is crude; to see this, note that if there were only a finite number of functions in F then n need only satisfy

2 (T + 1) | F | exp {\frac{2 ɛ^{4} n}{{(3 M^{'})}^{2} {(16 L)}^{2 (T + 2)}}} = δ

(use Hoeffding’s inequality; see Anthony and Bartlett, pg 361, 1999) and thus for a given (ε, &δ) we may set the number of trajectories in the training set n equal to $\frac{{(3 M^{'})}^{2} {(16 L)}^{2 (T + 2)}}{2 ɛ^{4}} ln (\frac{2 (T + 1) | F |}{δ}) .$ This complexity term appears similar to that achieved by learning algorithms (e.g. see Anthony and Bartlett, 1999, pg. 21) or in reinforcement learning (e.g. Peshkin and Shelton, 2002) however note that n is of the order ɛ⁻⁴ rather than the usual ɛ⁻². The ɛ⁻⁴ term (instead of ɛ⁻²) is attributable to the fact that Err_{Q_t}₊₁(Q_t) is not only a function of Q_t but also of Q_t₊₁. However further assumptions on the approximation space permit an improved result. See Theorem 2 below for one possible refinement. Note the needed training set size n depends exponentially on the horizon time T but not on the dimension of the observation space. Thi! s is not unexpected as the upper bounds on the generalization error of both Kearns, Mansour and Ng (2000) and Peshkin and Shelton’s (2002) policy search methods (the latter using a training set and importance sampling weights) also depend exponentially on the horizon time.

4. When F is infinite, we use covering numbers for the approximation space Q_t and then appeal to Lemma A2 in the appendix to derive a covering number for F; this results in

N_{1} (ɛ, F, n) \leq (T + 1) max_{t = 0, \dots, T} N_{1} {(\frac{ɛ}{2 | A |}, Q_{t}, | A | n)}^{2} .

One possible approximation space is based on feed-forward neural networks. From Anthony and Bartlett (1999) we have that if each Q_j is the class of functions computed by a feed-forward network with W weights and k computation units arranged in L layers and each computation unit has a fixed piecewise-polynomial activation function with q pieces and degree no more than ℓ, then $N_{1} (ɛ, Q_{t}, n) \leq e (d + 1) {(\frac{2 e M^{'}}{ɛ})}^{d}$ where d = 2(W + 1)(L + 1) log₂(4(W + 1)(L + 1)q(k + 1)/ln 2) + 2(W + 1)(L + 1)² log₂(ℓ + 1) + 2(L + 1). To see this combine Anthony and Bartlett’s Theorems 8.8, 14.1 and 18.4. They provide covering numbers for functions computed by other types of neural networks as well. A particularly simple neural network is an affine combination of a given set of p input features; i.e. $f (x) = ω_{0} + \sum_{i = 1}^{p - 1} ω_{i} x_{i}$ for (1, x) a vector of p real valued features and each ω_i ∈ ℝ. Suppose each Q_t is a class of functions computed by this network. Then Theorems 11.6 and 18.4 of Anthony and Bartlett imply that $N_{1} (ɛ, Q_{t}, n) \leq e (p + 1) {(\frac{2 e M^{'}}{ɛ})}^{p}$ . In this case

n \geq \frac{32 {(M^{'})}^{4} {(16 L)}^{2 (T + 2)}}{ɛ^{4}} log (\frac{4 {(T + 1)}^{2} e^{2} {(p + 1)}^{2} {(128 e | A | {(M^{'})}^{2} {(16 L)}^{(T + 2)})}^{2 p}}{{δ ɛ}^{4 p}}) .

This number will be large for any reasonable accuracy, ɛ and confidence, δ.

Proof of Theorem 1

An upper bound on the average difference in value functions can be obtained from Lemma 2 by using Jensen’s inequality and the assumption that the the density of F ( f ) satisfies ${sup}_{o} | \frac{f (o)}{f_{0} (o)} | \leq M$ for some finite constant M:

\begin{array}{l} \int | V_{\tilde{π}} (o) - V_{π} (o) | d F (o) \leq M \sum_{t = 0}^{T} 2 L^{(t + 1) / 2} \sqrt{E {[Q_{t} (O_{i}, A_{i}) - {\tilde{Q}}_{t} (O_{i}, A_{i})]}^{2}} \\ + M \sum_{t = 0}^{T} 2 L^{(t + 1) / 2} \sqrt{E {[{\tilde{Q}}_{t} - Q_{\tilde{π}, t}]}^{2}} \end{array}

(11)

where ${\tilde{Q}}_{t} - Q_{\tilde{π}, t}$ is used as abbreviation for ${\tilde{Q}}_{t} (O_{t}, A_{t})$ , respectively $Q_{\tilde{π}, t} (O_{t}, A_{t})$ . In the following an upper bound on each $E {[Q_{t} - {\tilde{Q}}_{t}]}^{2}$ is constructed.

The performance of the approximation on an infinite training set can be represented by

E r r_{Q_{t + 1}} (Q_{t}) = E {[R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{t}]}^{2}

for each t (recall Q_T₊₁ = 0, also we abbreviate Q_t (O_t, A_t ) by Q_t whenever no confusion may arise). The errors, Err’s, can be used to provide an upper bound on the L₂ norms on the Q-functions by the following argument. Consider $E r r_{Q_{t + 1}} (Q_{t}) - E r r_{Q_{t + 1}} ({\tilde{Q}}_{t})$ for each t. Within each of these quadratic forms add and subtract

Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, {\tilde{π}}_{t + 1}) - Q_{\tilde{π}, t} - E [max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, {\tilde{π}}_{t + 1}) | O_{t}, A_{t}] .

In the above $Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, {\tilde{π}}_{t + 1})$ is defined as $Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, a_{t + 1})$ with a_t₊₁ replaced by ${\tilde{π}}_{t + 1} (O_{t + 1}, A_{t}) .$ Expand each quadratic form and use the fact that $E [R_{t} + Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, {\tilde{π}}_{t + 1}) | O_{t}, A_{t}] = Q_{\tilde{π}, t}$ . Cancelling common terms yields

\begin{array}{l} E {[Q_{\tilde{π}, t} - Q_{t} + E [max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, {\tilde{π}}_{t + 1}) | O_{t}, A_{t}]]}^{2} \\ - E {[Q_{\tilde{π}, t} - {\tilde{Q}}_{t} + E [max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, {\tilde{π}}_{t + 1}) | O_{t}, A_{t}]]}^{2} . \end{array}

Add and subtract ${\tilde{Q}}_{t}$ in the first quadratic form and expand. This yields

\begin{array}{l} E r r_{Q_{t + 1}} (Q_{t}) - E r r_{Q_{t + 1}} ({\tilde{Q}}_{t}) = \\ E {[{\tilde{Q}}_{t} - Q_{t}]}^{2} + 2 E [{\tilde{Q}}_{t} - Q_{t}] [{\tilde{Q}}_{t} - Q_{\tilde{π}, t}] \\ + 2 E [({\tilde{Q}}_{t} - Q_{t}) (max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - max_{a_{t + 1}} {\tilde{Q}}_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}))] \\ + 2 E [({\tilde{Q}}_{t} - Q_{t}) (max_{a_{t + 1}} {\tilde{Q}}_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{\tilde{π}, t + 1} (O_{t + 1}, A_{t}, {\tilde{π}}_{t + 1}))] . \end{array}

(12)

Using the arguments similar to those used around Equation (8) and using the fact that (x + y)² ≤ 2x² + 2y² we obtain,

\begin{array}{l} E r r_{Q_{t + 1}} (Q_{t}) - E r r_{Q_{t + 1}} ({\tilde{Q}}_{t}) \geq E {[Q_{t} - {\tilde{Q}}_{t}]}^{2} \\ - 4 {(E {[Q_{t} - {\tilde{Q}}_{t}]}^{2} (E {[{\tilde{Q}}_{t} - Q_{\tilde{π}, t}]}^{2} + L E {[Q_{t + 1} - {\tilde{Q}}_{t + 1}]}^{2} + L E {[{\tilde{Q}}_{t + 1} - Q_{\tilde{π}, t + 1}]}^{2}))}^{1 / 2} . \end{array}

Using this inequality we can now derive an upper bound on each $E {[Q_{t} - {\tilde{Q}}_{t}]}^{2}$ in terms of the Err’s and the $E [{\tilde{Q}}_{t + 1} - Q_{\tilde{π}, t + 1}]$ ’s. Define

m_{t} = L^{- (T - t)} E {[{\tilde{Q}}_{t} - Q_{\tilde{π}, t}]}^{2} and b_{t} = L^{- (T - t)} E {[Q_{t} - {\tilde{Q}}_{t}]}^{2}

and

e_{t} = L^{- (T - t)} (E r r_{Q_{t + 1}} (Q_{t}) - E r r_{Q_{t + 1}} ({\tilde{Q}}_{t}))

for t ≤ T and b_T ₊₁ = m_T ₊₁ = e_T ₊₁ = 0. We obtain

e_{t} \geq b_{t} - 4 \sqrt{b_{t} (m_{t} + b_{t + 1} + m_{t + 1}) .}

Completing the square, reordering terms, squaring once again and using the inequality (x + y)² ≤ 2x² + 2y² yields b_t ≤ 16(b_t₊₁ + m_t + m_t₊₁) + 2e_t for t ≤ T. We obtain

b_{T - t} \leq 2 \sum_{i = 0}^{t} {(16)}^{i} e_{T - t + i} + \sum_{i = 1}^{t} {(16)}^{i} (16 + 1) m_{T - t + i} + 16 m_{T - t} .

Inserting the definitions of b_T₋_t, e_T₋_t₊_i and reordering, yields

\begin{array}{l} E {[Q_{t} - {\tilde{Q}}_{t}]}^{2} \leq 2 \sum_{i = t}^{T} {(16 L)}^{i - t} (E r r_{Q_{i + 1}} (Q_{i}) - E r r_{Q_{i + 1}} ({\tilde{Q}}_{i})) \\ + \sum_{i = t + 1}^{T} {(16)}^{i - t} (16 + 1) L^{T - t} m_{i} + L^{T - t} m_{t} . \end{array}

(13)

As an aside we can start from (12) and derive the upper bound,

\begin{array}{l} E r r_{Q_{t + 1}} (Q_{t}) - E r r_{Q_{t + 1}} ({\tilde{Q}}_{t}) \leq E {[Q_{t} - {\tilde{Q}}_{t}]}^{2} \\ + 4 L^{(T - t)} \sqrt{L^{- (T - t)} E {[Q_{t} - {\tilde{Q}}_{t}]}^{2} (m_{t} + L^{- (T - t - 1)} E {[Q_{t + 1} - {\tilde{Q}}_{t + 1}]}^{2} + m_{t + 1}) .} \end{array}

This combined with (13) implies that minimizing each $E r r_{Q_{t + 1}} (Q_{t}) - E r r_{Q_{t + 1}} ({\tilde{Q}}_{t})$ in Q_t is equivalent to minimizing each $E {[Q_{t} - {\tilde{Q}}_{t}]}^{2}$ in Q_t modulo the approximation terms m_t for t = 0,..., T.

Returning to the proof next note that

\begin{array}{l} E r r_{Q_{t + 1}} (Q_{t}) - E r r_{Q_{t + 1}} ({\tilde{Q}}_{t}) \leq | E r r_{Q_{t + 1}} (Q_{t}) - E r r_{n, Q_{t + 1}} (Q_{t}) | \\ + | E r r_{Q_{t + 1}} ({\tilde{Q}}_{t}) - E r r_{n, Q_{t + 1}} ({\tilde{Q}}_{t}) | \\ + {(E r r_{n, Q_{t + 1}} (Q_{t}) - E r r_{n, Q_{t + 1}} ({\tilde{Q}}_{t}))}^{+} \end{array}

where (x)⁺ is equal to x if x ≥ 0 and is equal to 0 otherwise. Note that if each Q_t minimizes Err_n_,_Q_t₊₁ as in (3) then the third term is zero. Substituting into (13), we obtain

\begin{array}{l} E {[Q_{t} - {\tilde{Q}}_{t}]}^{2} \leq 2 \sum_{i = t}^{T} {(16 L)}^{i - t} (| E r r_{Q_{i + 1}} (Q_{i}) - E r r_{n, Q_{i + 1}} ({\tilde{Q}}_{i}) | \\ + | E r r_{Q_{i + 1}} ({\tilde{Q}}_{i}) - E r r_{n, Q_{i + 1}} ({\tilde{Q}}_{i}) | \\ + {(E r r_{n, Q_{i + 1}} (Q_{i}) - E r r_{n, Q_{i + 1}} ({\tilde{Q}}_{i}))}^{+}) \\ + \sum_{i = t + 1}^{T} {(16)}^{i - t} (16 + 1) L^{i - t} E {[{\tilde{Q}}_{i} - Q_{\tilde{π}, i}]}^{2} + E {[{\tilde{Q}}_{t} - Q_{\tilde{π}, t}]}^{2} . \end{array}

Combine this inequality with (11); simplify the sums and use the fact that for x, y both nonnegative $\sqrt{x + y} \leq \sqrt{x} + \sqrt{y}$ to obtain $\int | V_{\tilde{π}} (o) - V_{π} (o) | d F (o)$

\begin{array}{l} \leq 6 M L^{1 / 2} {\sum_{t = 0}^{T} [\sum_{i = t}^{T} {(16)}^{i - t} L^{i} {(E r r_{n, Q_{i + 1}} (Q_{i}) - E r r_{n, Q_{i + 1}} ({\tilde{Q}}_{i}))}^{+}]}^{1 / 2} \\ + 12 M L^{1 / 2} {(16 L)}^{(T + 2) / 2} \sqrt{max_{t} sup_{Q_{t}, Q_{t + 1}} | E r r_{Q_{t + 1}} (Q_{t}) - E r r_{n, Q_{t + 1}} (Q_{t}) |} \\ + 6 M L^{1 / 2} \sum_{t = 0}^{T} \sum_{i = t}^{T} {(16)}^{(i - t) / 2} L^{i / 2} \sqrt{E {[{\tilde{Q}}_{i} - Q_{\tilde{π}, i}]}^{2}} . \end{array}

All that remains is to provide an upper bound on

P [\cup_{i = 0}^{T} {for {some}_{Q_{t} \in Q_{t}, t = 0, \dots, T} | E r r_{Q_{t + 1}} (Q_{i}) - E r r_{n, Q_{i + 1}} (Q_{i})) | > ɛ^{'}}] .

This probability is in turn bounded above by

\sum_{i = 0}^{T} P [for {some}_{Q_{t} \in Q_{t}, t = 0, \dots, T} | E r r_{Q_{i + 1}} (Q_{i}) - E r r_{n, Q_{i + 1}} (Q_{i}) | > ɛ^{'}] .

Anthony and Bartlett (1999, pg. 241) use Hoeffding’s inequality along with the classical techniques of symmetrization and permutation to provide the upper bound (see also van der Vaart and Wellner, 1996),

\begin{array}{l} P [for {some}_{Q_{t} \in Q_{t}, t = 0, \dots, T} | E r r_{Q_{i + 1}} (Q_{i}) - E r r_{n, Q_{i + 1}} (Q_{i}) | > ɛ^{'}] \\ \leq 4 N_{1} (\frac{ɛ^{'}}{32 M^{'}}, F, 2 n) exp {- \frac{{(ɛ^{'})}^{2} n}{32 {(M^{'})}^{2}}} . \end{array}

Put $ɛ = {(16 L)}^{(T + 2) / 2} \sqrt{ɛ^{'}}$ to obtain the results of the theorem.

Suppose the Q functions are approximated by linear combinations of p features; for each t = 0,...,T, denote the feature vector by q_t (o_t, a_t ). The approximation space is then,

Q_{t} = {Q_{t} (o_{t}, a_{t}) = θ^{T} q_{t} (o_{t}, a_{t}) : θ \in Θ}

where Θ is a subset of R^p. In this case, the batch Q-learning algorithm may be based on (4); we represent the performance of the functions {Q₀,..., Q_t } on the training set by

{\tilde{E r r}}_{n, Q_{t + 1}} (Q_{t}) = E_{n} [(R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{t} (O_{t}, A_{t})) q_{t} (O_{t}, A_{t})]

for t = 0,...,T (recall E_n represents the expectation with respect to the probability obtained by choosing a trajectory uniformly from the training set). In this theorem

F^{'} = \cup_{i = 1}^{p} \cup_{t = 1}^{T} {(r_{t} + max_{a_{t + 1}} Q_{t + 1} (o_{t + 1}, a_{t + 1}; θ_{t + 1}) - Q_{t} (o_{t}, a_{t}; θ_{t})) q_{t i} (o_{t}, a_{t}) : θ_{t} {, θ}_{t + 1} \in Θ}

Define the functions ${{\bar{Q}}_{0}, \dots, {\bar{Q}}_{T}}$ , and the policy, $\bar{π}$ , as follows. First define ${\bar{Q}}_{T} (O_{T}, A_{T})$ to be the projection of E [R_T |O_T, A_T ] on the space spanned by q_T. Then set ${\bar{π}}_{T} (o_{T}, a_{T - 1}) \in arg {max}_{a_{T}} {\bar{Q}}_{T} (o_{T}, a_{T})$ . Next for t = T − 1,...,0, set ${\bar{Q}}_{t} (O_{t}, A_{t})$ as the projection of $E [R_{t} + {\bar{Q}}_{t + 1} (O_{t + 1}, A_{t}, {\bar{π}}_{t + 1}) | O_{t}, A_{t}]$ on the space spanned by q_t (recall ${\bar{Q}}_{t + 1} (O_{t + 1}, A_{t}, {\bar{π}}_{t + 1})$ is defined as ${\bar{Q}}_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1})$ with a_t₊₁ replaced by ${\bar{π}}_{t + 1} (O_{t + 1}, A_{t}))$ . And set ${\bar{π}}_{t} (o_{t}, a_{t - 1}) \in arg {max}_{a_{t}} {\bar{Q}}_{t} (o_{t}, a_{t})$ . These projections are with respect to P, the distribution which generated the trajectories in the training set (the likelihood is in (1)).

Theorem 2

Suppose that there exists a positive constant, say L, for which p_t (a_t |o_t, a_t₋₁) ≥ L⁻¹ for all (o_t, a_t₋₁), 0 ≤ t ≤ T. Suppose that for each t, x ∈ R^p, x^T Eq_t q_t^T x > η ||x||² where η > 0 (|| · || is the Euclidean norm). Also assume that Θ is a closed subset of {x ∈ R^p : ||x|| ≤ M_Θ} and for all (t, i), the ith component in the vector q_t is pointwise bounded; |q_ti| ≤M_Q for M_Q a constant. Then for ɛ > 0, with probability at least 1 – δ, over the random choice of the training set, every choice of functions, Q_t ∈ Q_t and functions ${\tilde{Q}}_{t}, t = 0, \dots, T$ with associated policies defined by π with π_t (o_t , a_t_–1 ∈ arg max_at Q_t(o_t , a_t) and $\tilde{π}$ with ${\bar{π}}_{t} (o_{t}, a_{t - 1}) \in arg {max}_{a_{t}} {\tilde{Q}}_{t} (o_{t}, a_{t})$ respectively, the following bounds are satisfied

\begin{array}{l} \sum_{t = 0}^{T} L^{(t + 1)} E | {\bar{Q}}_{t} (O_{t}, A_{t}) - Q_{t} (O_{t}, A_{t}) | \leq \sqrt{p} M_{Q} / η \sum_{t = 0}^{T} L^{(t + 1)} \sum_{j = t}^{T} {(L p M_{Q}^{2} / η)}^{j - t} ‖ {\tilde{E r r}}_{n, Q_{j + 1}} (Q_{j}) ‖ \\ + 4 ɛ . \end{array}

for t = 0,...,T , where E represents the expectation with respect to the distribution (1) generating the training set and

\begin{array}{l} \int | V_{\tilde{π}} (o) - V_{π} (o) | d F (o) \leq 2 M \sqrt{p} M_{Q} / η \sum_{t = 0}^{T} L^{(t + 1)} \sum_{j = t}^{T} {(L p M_{Q}^{2} / η)}^{j - t} ‖ {\tilde{E r r}}_{n, Q_{j + 1}} (Q_{j}) ‖ \\ + 8 M ɛ \\ + 2 M \sum_{t = 0}^{T} L^{(t + 1)} E | {\bar{Q}}_{t} (O_{t}, A_{t}) - {\tilde{Q}}_{t} (O_{t}, A_{t}) | \\ + 2 M \sum_{t = 0}^{T} L^{(t + 1)} E | {\tilde{Q}}_{t} (O_{t}, A_{t}) - Q_{\tilde{π}, t} (O_{t}, A_{t}) | \end{array}

for n larger than

{(\frac{C}{ɛ})}^{2} log (\frac{B}{δ})

(14)

where $C = 4 \sqrt{2} M^{'} p^{T + 1 / 2} M_{Q}^{2 T + 1} η^{- (T + 1)} L^{T + 1}, M^{'}$ is a uniform upper bound on the absolute value on all f ∈ F′ and $B = ɛ^{- 2 p} 4^{6 p + 3} p^{2 T p + p + 3} {(T + 1)}^{2} e^{2 p + 2} {(M^{'})}^{4 p} {| A |}^{p} M_{Q}^{_{^{(2 T + 1) 2 p}}} η^{- 2 p (T + 1)} L^{2 p (T + 1)}$

Remarks:

1.Define ${\hat{Q}}_{t}$ as a zero of ${\tilde{E r r}}_{n, {\hat{Q}}_{t + 1}} (Q_{t}), t = T, T - 1, \dots, 0$ (recall that ${\hat{Q}}_{T + 1}$ is identically zero). Suppose that Q_t* ∈Q_t for each t; in this case ${\bar{Q}}_{t} = Q_{t}^{*}$ for all t (we ignore sets of measure zero in this discussion). Then with probability greater than 1 – δ and $\tilde{π} = π^{*}, {\tilde{Q}}_{t} = Q_{t}^{*}$ we obtain

\int V^{*} (o) - V_{\hat{π}} (o) d F (o) \leq 8 M ɛ

for all n satisfying (14). Thus estimating each Q_t by solving ${\tilde{E r r}}_{n, Q_{t + 1}} (Q_{t}) = 0, t = T, \dots, 0,$ yields a policy that consistently achieves the optimal value.

2. Again define ${\hat{Q}}_{t}$ as a zero of ${\tilde{E r r}}_{n, {\hat{Q}}_{t + 1}} (Q_{t}), t = T, T - 1, \dots, 0.$ Given two T + 1 vectors of functions Q′ = {Q′_0,…, Q′_T} and Q = {Q_0,…, Q_T} define $ℓ (Q^{'}, Q) = \sum_{t = 0}^{T} L^{t + 1} E | {Q^{'}}_{t} (O_{t}, A_{t}) - Q_{t} (O_{t}, A_{t}) | .$ Then the first result of Theorem 2 implies that $ℓ (\bar{Q}, \hat{Q})$ converges in probability to zero. From Lemma 2 we have that $\int | V_{\tilde{π}} (o) - V_{π} (o) | d F (o) \leq 2 M ℓ (Q, \tilde{Q}) + 2 M ℓ (Q_{\tilde{π}}, \tilde{Q})$ and thus $\int | V_{\tilde{π}} (o) - V_{\hat{π}} (o) | d F (o)$ is with high probability bounded above by $2 M ℓ (\bar{Q}, \tilde{Q}) + 2 M ℓ (Q_{\tilde{π}}, \tilde{Q}) .$ Consequently the presence of the third and fourth terms in Theorem 2 is not surprising. It is unclear whether the “go-between” ${\tilde{Q}}_{t}$ is necessary.

3. Recall the space of policies implied by the approximation spaces for the Q-functions is given by $Π_{Q} = {π_{θ}, θ \in Θ}$ where π_θ = {π_1,θ,…,π_T,_θ} and where each π_t_,θ (o_t, a_t_–1) ∈ arg max _a_t Q_t (o_t , a_t;θ) for some Q_t ∈ Q_t. Suppose that max _{π ∈ Π_Q} ∫V_π(o)dF(o) is achieved by some member of Π_Q and $\tilde{π} \in arg {max}_{π \in Π_{Q}} \int V_{π} (o) d F (o) .$ Ideally Q-learning would provide a policy that achieves the highest value as compared to other policies in Π_Q (as is the case with $\tilde{π}$ ). This is not necessarily the case. As discussed in the above remark batch Q-learning yields estimated Q-functions for which $ℓ (\bar{Q}, \hat{Q})$ converges to zero. The policy $\bar{π}$ may not produce a maximal value; that is $\int V_{\tilde{π}} (o) - V_{\bar{π}} (o) d F (o)$ need not be zero (see also the remark following Lemma 2). Recall from Lemma 2 that $2 M ℓ (\bar{Q}, \tilde{Q}) + 2 M ℓ (\tilde{Q}, Q_{\tilde{π}})$ is an upper bound on this difference. It is not hard to see that $ℓ (\tilde{Q}, Q_{\tilde{π}})$ is zero if and only if $\tilde{π}$ is the optimal policy; indeed the optimal Q-function would belong to the approximation space. The Q-learning algorithm does not directly maximize the value function. As remarked in Tsitsiklis and van Roy (1997) the goal of the Q-learning algorithm is to construct an approximation to the optimal Q-function within the constraints imposed by the app! roximation space; this approximation is a projection when the approximation space is linear. Approximating the Q-function yields an optimal policy if the approximating class is sufficiently rich. Ormoneit and Sen (2002) consider a sequence of approximation spaces (kernel based spaces indexed by a bandwidth) and make assumptions on the optimal value function which guarantee that this sequence of approximations spaces is sufficient rich (as the bandwidth decreases with increasing training set size) so as to approximate the optimal value function to any desired degree.

4. Again define ${\hat{Q}}_{t}$ as a zero of ${\tilde{E r r}}_{n, {\hat{Q}}_{t + 1}} (Q_{t}), t = T, T - 1, \dots, 0.$ Since $ℓ (\bar{Q}, \hat{Q})$ converges in probability to zero, one might think that $\int | V_{\bar{π}} (o) - V_{\hat{π}} (o) | d F (o)$ should be small as well. Referring to Lemma 1, we have that the difference in value functions $\int | V_{\bar{π}} (o) - V_{\hat{π}} (o) | d F (o)$ can be expressed as the sum over t of the expectation of $Q_{\tilde{π}, t} (O_{t}, A_{t - 1}, {\hat{π}}_{t}) - Q_{\bar{π}, t} (O_{t}, A_{t - 1}, {\bar{π}}_{t}) .$ However $ℓ (\bar{Q}, \hat{Q})$ small does not imply that $\hat{π}$ and $\bar{π}$ will be close nor does it imply that $Q_{\bar{π}, t} (O_{t}, A_{t - 1}, {\hat{π}}_{t}) - Q_{\bar{π}, t} (O_{t}, A_{t - 1}, {\bar{π}}_{t})$ will be small. To see the former consider an action space with 10 actions, 1,...,10 and ${\hat{Q}}_{t} (o_{t}, a_{t}) = 1$ for a = 1,...,9, ${\hat{Q}}_{t} (o_{t}, 10) = 1 + 1 / 2 ɛ$ and ${\bar{Q}}_{t} (o_{t}, a_{t}) = 1 - 1 / 2 ɛ for a = 2, \dots, 10, {\bar{Q}}_{t} (o_{t}, 1) = 1.$ So ${\bar{Q}}_{t}$ and ${\hat{Q}}_{t}$ are uniformly less than ɛ apart yet the argument of their maxima are 1 and 10.

Proof of Theorem 2

Fix Q_t = θ_t^T q_t, θ∈ θ for t = 0,...,T . Define an infinite training sample version of ${\tilde{E r r}}_{n}$ as

\begin{array}{l} {\tilde{E r r}}_{Q_{t + 1}} (Q_{t}) = E [(R_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - Q_{t}) q_{t}] \\ = E [({\bar{Q}}_{t} + max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - {\bar{Q}}_{t + 1} (O_{t + 1}, A_{t}, {\bar{π}}_{t + 1}) - Q_{t}) q_{t}] \end{array}

where Q_t is an abbreviation for Q_t (O_t, A_t). To derive the last equality recall that ${\bar{Q}}_{t} (O_{t}, A_{t})$ is the projection of $E [R_{t} + {\bar{Q}}_{t + 1} (O_{t + 1}, A_{t}, {\bar{π}}_{t + 1}) | O_{t}, A_{t}]$ on the space spanned by q_t . Since ${\bar{Q}}_{t}$ is a projection we can write $Q_{t} = θ_{t}^{T} q_{t}$ for some $θ_{\bar{π}, t} \in Θ .$ Also we can write Q_t = θ^T_t q_t for some θ_t ∈ θ. The $\tilde{E r r}$ ’s provide a pointwise upper bound on the differences, $| {\bar{Q}}_{t} - Q_{t} |$ , as follows. Rearrange the terms in ${\tilde{E r r}}_{Q_{t + 1}}$ using the fact that Eq_t q_t^T is invertible to obtain

\begin{array}{l} {(θ}_{\bar{π}, t} - θ_{t}) = {(E q_{t} q_{t}^{T})}^{- 1} {\tilde{E r r}}_{Q_{t + 1}} (Q_{t}) \\ - {(E q_{t} q_{t}^{T})}^{- 1} E [(max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - {\bar{Q}}_{t + 1} (O_{t + 1}, A_{t}, {\bar{π}}_{t + 1})) q_{t}] . \end{array}

Denote the Euclidean norm of a p dimensional vector x by ||x||. Then

\begin{array}{l} | {(θ_{\bar{π}, t} - θ_{t})}^{T} q_{t} | \leq (1 / η) ‖ {\tilde{E r r}}_{Q_{t + 1}} (Q_{t}) ‖ ‖ q_{t} ‖ + \\ (1 / η) E [| max_{a_{t + 1}} Q_{t + 1} (O_{t + 1}, A_{t}, a_{t + 1}) - {\bar{Q}}_{t + 1} (O_{t + 1}, A_{t}, {\bar{π}}_{t + 1}) | ‖ q_{t} ‖] ‖ q_{t} ‖ \\ \leq (1 / η) ‖ {\tilde{E r r}}_{Q_{t + 1}} (Q_{t}) ‖ ‖ q_{t} ‖ + (1 / η) L E [| Q_{t + 1} - {\bar{Q}}_{t + 1} | ‖ q_{t} ‖] ‖ q_{t} ‖ \\ \leq (1 / η) \sqrt{p} M_{Q} ‖ {\tilde{E r r}}_{Q_{t + 1}} (Q_{t}) ‖ + (1 / η) L p M_{Q}^{2} E [| Q_{t + 1} - {\bar{Q}}_{t + 1} |] \end{array}

for t ≤ T . To summarize

E | {\bar{Q}}_{t} - Q_{t} | \leq (1 / η) \sqrt{p} M_{Q} ‖ {\tilde{E r r}}_{Q_{t + 1}} (Q_{t}) ‖ + (1 / η) L p M_{Q}^{2} E [| Q_{t + 1} - {\bar{Q}}_{t + 1} |]

where Q_t, ${\bar{Q}}_{t}$ is an abbreviation for Q_t (O_t, A_t), respectively ${\bar{Q}}_{t} (O_{t}, A_{t})$ , for each t.

As in the proof of Theorem 1, these inequalities can be solved for each $E | {\bar{Q}}_{t} - Q_{t} |$ to yield

\begin{array}{l} E | {\bar{Q}}_{t} - Q_{t} | \leq (\sqrt{p} M_{Q} / η) \sum_{j = t}^{T} {(L p M_{Q}^{2} / η)}^{j - t} ‖ {\tilde{E r r}}_{Q_{j + 1}} (Q_{j}) ‖ \\ \leq (\sqrt{p} M_{Q} / η) \sum_{j = t}^{T} {(L p M_{Q}^{2} / η)}^{j - t} ‖ {\tilde{E r r}}_{n, Q_{j + 1}} (Q_{j}) - {\tilde{E r r}}_{Q_{j + 1}} (Q_{j}) ‖ \\ + (\sqrt{p} M_{Q} / η) \sum_{j = t}^{T} {(L p M_{Q}^{2} / η)}^{j - t} ‖ {\tilde{E r r}}_{n, Q_{j + 1}} (Q_{j}) ‖ . \end{array}

Simplifying terms we obtain

\begin{array}{l} \sum_{t = 0}^{T} L^{(t + 1)} E | {\bar{Q}}_{t} - Q_{t} | \leq \sqrt{p} M_{Q} / η \sum_{t = 0}^{T} L^{(t + 1)} \sum_{j = t}^{T} {(L p M_{Q}^{2} / η)}^{j - t} ‖ {\tilde{E r r}}_{n, Q_{j + 1}} (Q_{j}) ‖ \\ + 4 p^{T + 1 / 2} M_{Q}^{2 T + 1} η^{- (T + 1)} L^{T + 1} max_{t} ‖ {\tilde{E r r}}_{n, Q_{t + 1}} (Q_{t}) - {\tilde{E r r}}_{Q_{t + 1}} (Q_{t}) ‖ . \end{array}

(15)

Consider each component of each of the T +1, p dimensional vectors, ${\tilde{E r r}}_{n, Q_{i + 1}} (Q_{i})) - {\tilde{E r r}}_{Q_{i + 1}} (Q_{i})$ for an ɛ′> 0:

P [\cup_{i = 0}^{T} \cup_{j = 1}^{p} {for {some}_{θ_{i}, θ_{i + 1} \in Θ, q_{i} \in Q_{i}, q_{i + 1} \in Q_{i + 1}} | {\tilde{E r r}}_{n, Q_{i + 1}} {(Q_{i})}_{j} - {\tilde{E r r}}_{Q_{i + 1}} (Q_{i}))_{j} | > ɛ^{'}}] .

This probability is in turn bounded above by

\sum_{i = 0}^{T} \sum_{j = 1}^{p} P [for {some}_{θ_{i}, θ_{i + 1} \in Θ, q_{i} \in Q_{i}, q_{i + 1} \in Q_{i + 1}} | {\tilde{E r r}}_{n, Q_{i + 1}} {(Q_{i})}_{j} - {\tilde{E r r}}_{Q_{i + 1}} (Q_{i}))_{j} | > ɛ^{'}] .

In Lemmas 17.2, 17.3, 17.5, Anthony and Bartlett (1999) provide an upper bound on the probability

P [for some f \in F has | E_{n} (ℓ_{f}) - E (ℓ_{f}) | \geq ɛ^{'}]

where ℓ_f (x, y) = (y − f (x))². These same lemmas (based on the classical arguments of symmetrization, permutation and reduction to a finite set) can be used for f ∈ F′ since the functions in F′ are uniformly bounded. Hence for each j = 1,..., p and t = 0,...,T

\begin{array}{l} P [for some θ_{t}, θ_{t + 1} \in Θ, q_{t} \in Q_{t}, q_{t + 1} \in Q_{t + 1} has | {\tilde{E r r}}_{n, Q_{t + 1}} {(Q_{t})}_{j} - {\tilde{E r r}}_{Q_{t + 1}} {(Q_{t})}_{j} | > ɛ^{'}] \\ \leq 4 N_{1} (\frac{ɛ^{'}}{16 M^{'}}, F^{'}, 2 n) exp {- \frac{{(ɛ^{'})}^{2} n}{32 {(M^{'})}^{2}}} . \end{array}

Set ɛ = p^T^+1/2M²^T⁺¹_Qη⁻⁽^T⁺¹⁾L^T⁺¹ɛ′. Thus for n satisfying

\begin{array}{l} 4 p (T + 1) N_{1} (\frac{ɛ}{16 M^{'} p^{T + 1 / 2} M_{Q}^{2 T + 1} η^{- (T + 1)} L^{T + 1}}, F^{'}, 2 n) \\ exp {- \frac{ɛ^{2} n}{32(M^{'})^{2} {(p^{T + 1 / 2} M_{Q}^{2 T + 1} η^{- (T + 1)} L^{T + 1})}^{2}}} \leq δ, \end{array}

(16)

the first result of the theorem holds.

To simplify the constraint on n, we derive a covering number for F′ from covering numbers for the Q_t’s. Apply Lemma A2 part 1, to obtain

N_{1} (ɛ, V_{t + 1, n}) \leq N_{1} (\frac{ɛ}{| A |}, Q_{t + 1}, | A | n)

for V_t₊₁ = {max_at₊₁Q_t₊₁(o_t₊₁, a_t₊₁) : Q_t₊₁ ∈ Q_t₊₁}. Next apply Lemma A2, parts 2 and 3, to obtain

\begin{array}{l} N_{1} (ɛ, F^{'}, n) \leq \sum_{t = 0}^{T - 1} N_{1} (\frac{ɛ}{2 | A | M^{'}}, Q_{t + 1}, | A | n) N_{1} (\frac{ɛ}{2 M^{'}}, Q_{t}, n) \\ + N_{1} (\frac{ɛ}{M^{'}}, Q_{T}, n) . \end{array}

Theorems 11.6 and 18.4 of Anthony and Bartlett imply that $N_{1} (ɛ, Q_{t}, n) \leq e (p + 1) {(\frac{2 e}{ɛ})}^{p}$ for each t. Combining this upper bound with (16) and simplifying the algebra yields (14).

Next Lemma 2 implies:

\begin{array}{l} \int | V_{\tilde{π}} (o) - V_{π} (o) | d F (o) \leq M \sum_{t = 0}^{T} 2 L^{(t + 1)} E | Q_{t} - {\bar{Q}}_{t} | \\ + M \sum_{t = 0}^{T} 2 L^{(t + 1)} E | {\bar{Q}}_{t} - {\tilde{Q}}_{t} | + M \sum_{t = 0}^{T} 2 L^{(t + 1)} E | {\tilde{Q}}_{t} - Q_{\tilde{π}, t} | . \end{array}

This combined with the first result of the theorem implies the second result.

6. Discussion

Planning problems involving a single training set of trajectories are not unusual and can be expected to increase due to the widespread use of policies in the social and behavioral/medical sciences (see, for example, Rush et al., 2003; Altfeld and Walker, 2001; Brooner, and Kidorf, 2002); at this time these policies are formulated using expert opinion, clinical experience and/or theoretical models. However there is growing interest in formulating these policies using empirical studies (training sets). These training sets are collected under fixed exploration policies and thus while they allow exploration they do not allow exploitation, that is, online choice of the actions. If subjects are recruited into the study at a much slower rate than the calendar duration of the horizon, then it is possible to permit some exploitation; some of this occurs in the field of cancer research (Thall, Sung and Estey, 2002).

This paper considers the use of Q-learning with dynamic programming and function approximation for this planning purpose. However the mismatch between Q-learning and the goal of learning a policy that maximizes the value function has serious consequences and emphasizes the need to use all available science in choosing the approximation space. Often the available behaviorial or psychosocial theories provide qualitative information concerning the importance of different observations. In addition these theories are often represented graphically via directed acyclic graphs. However information at the level of the form of the conditional distributions connecting the nodes in the graphs is mostly unavailable. Also due to the complexity of the problems there are often unknown missing common causes of different nodes in the graphs. See http://neuromancer.eecs.umich.edu/dtr for more information and references. Methods that can use this qualitative information to minimize t! he mismatch are needed.

Acknowledgments

We would like to acknowledge the help of the reviewers and of Min Qian in improving this paper. Support for this project was provided by the National Institutes of Health (NIDA grants K02 DA15674 and P50 DA10075 to the Methodology Center).

Appendix A

Recall that the distributions, P and P_π differ only with regards to the policy (see (1) and (2)). Thus the following result is unsurprising. Let f (O_T₊₁, A_T) be a (measurable) nonnegative function; then E_πf can be expressed in terms of an expectation with respect to the distribution P if we assume that p_t (a_t|o_t, a_t₋₁) > 0 for each (o_t, a_t ) pair and each t. The presence of the p _js in denominator below represent the price we pay because we only have access to training trajectories with distribution P; we do not have access to trajectories from distribution P_π.

Lemma A1 Assume that P_π[p₀(A₀|S₀) > 0] = 1 and P_π[p_t (A_t|O_t, A_t₋₁) > 0] = 1 for t = 1,...,T. For any (measurable) nonnegative function of g(O_t, A_t ), the P-probability that

E_{π} [g (O_{t}, A_{t}) | S_{0}] = E [(\prod_{ℓ = 0}^{t} \frac{1_{A_{ℓ} = π_{ℓ}}}{p_{ℓ} (A_{ℓ} | O_{ℓ}, A_{ℓ - 1})}) g (O_{t}, A_{t}) | S_{0}]

is one for t = 0,...,T.

Proof: We need only prove that

E [h (S_{0}) E_{π} [g (O_{t}, A_{t}) | S_{0}]] = E [h (S_{0}) E [(\prod_{ℓ = 0}^{t} \frac{1_{A_{ℓ} = π_{ℓ}}}{p_{ℓ} (A_{ℓ} | O_{ℓ}, A_{ℓ - 1})}) g (O_{t}, A_{t}) | S_{0}]]

for any (measurable) nonnegative function, h. Consider the two likelihoods ((1) and (2)) for a trajectory up to time t. Denote the dominating measure for the two likelihoods for the trajectory up to time t as λ_t. By assumption,

\begin{array}{l} \int h (s_{0}) g (o_{t}, a_{t}) (\prod_{ℓ = 0}^{T} \frac{1_{A_{ℓ} = π_{ℓ}}}{p_{ℓ} (a_{ℓ} | o_{ℓ}, a_{ℓ - 1})}) f_{0} (s_{0}) p_{0} (a_{0} | s_{0}) \\ \prod_{j = 1}^{t} f_{j} (s_{j} | o_{j - 1}, a_{j - 1}) p_{j} (a_{j} | o_{j}, a_{j - 1}) d λ_{t} (o_{t}, a_{t}) \\ = \int h (s_{0}) g (o_{t}, a_{t}) f_{0} (s_{0}) 1_{a_{0} = π_{0} (s_{0})} \prod_{j = 1}^{t} f_{j} (s_{j} | o_{j - 1}, a_{j - 1}) 1_{a_{j} = π_{j} (o_{j}, a_{j - 1})} d λ_{t} (o_{t}, a_{t}) . \end{array}

By definition the left hand side is $E [h (S_{0}) g (O_{t}, A_{t}) (\prod_{ℓ = 0}^{j} \frac{1_{A_{ℓ} = π_{ℓ}}}{p_{ℓ} (A_{ℓ} | O_{ℓ}, A_{ℓ - 1})})]$ and the right hand side is E_π[h(S₀)g(O_t, A_t)]. Expressing both sides as the expectation of a conditional expectation, we obtain,

E_{π} [h (S_{0}) E_{π} [g (O_{t}, A_{t}) | S_{0}]] = E [h (S_{0}) E [g (O_{t}, A_{t}) (\prod_{ℓ = 0}^{t} \frac{1_{A_{ℓ} = π_{ℓ}}}{p_{ℓ} (A_{ℓ} | O_{ℓ}, A_{ℓ - 1})}) | S_{0}]] .

Note that the distribution of S₀ is the same regardless of how the actions are chosen, that is the distribution of S₀ is the same under both P and P_π. Thus

E [h (S_{0}) E_{π} [g (O_{t}, A_{t}) | S_{0}]] = E [h (S_{0}) E [g (O_{t}, A_{t}) (\prod_{ℓ = 0}^{t} \frac{1_{A_{ℓ} = π_{ℓ}}}{p_{ℓ} (A_{ℓ} | A_{ℓ}, A_{ℓ - 1})}) | S_{0}]] .

Lemma A2 For p, q, r, s, N positive integers and M_F, M_G, M_Θ positive reals, define the following classes of real valued functions,

\begin{array}{l} H \subseteq {h (x, a) : x \in ℝ^{p}, a \in {1, \dots, N}} \\ F \subseteq {f (x) : x \in ℝ^{q}, sup_{x} | f (x) | \leq M_{F}} \\ G \subseteq {g (x, y) : x \in ℝ^{q}, y \in ℝ^{r} sup_{x, y} | g (x, y) | \leq M_{G}} \end{array}

and

Θ \subseteq {θ \in ℝ^{s} : max_{i = 1, \dots, s} | θ_{i} | \leq M_{Θ}} .

The following hold.

1. If V = {max_a h(x, a) : h ∈ H}then N₁ (ɛ, V n) ≤ N₁ (ɛ/N, H Nn).

2. For |a| ≤ 1, |b| ≤ 1, if V = {a f (x) + bg(x, y) : f ∈ F, g ∈ G} then N₁(ɛ, V, n) ≤ N₁(ɛ/2, F, n)N₁(ɛ/2, G, n).

3. If V = F ∪ G then N₁(ɛ,V, n) ≤ N₁(ɛ,F, n) + N₁(ɛ, G, n).

4. If V = {θ₁ f₁(x) + ... + θ_s f_s(x) : f_i ∈ F, (θ₁,..., θ_s) ∈ Θ} then $N_{1} (ɛ, V, n) \leq e (s + 1) {(\frac{4 e s M_{Θ} M_{F}}{ɛ})}^{s} N_{\infty} {(\frac{ɛ}{4 s M_{Θ}}, F, n)}^{s} .$

Proof. We prove 1. and 4.; the proofs of 2. and 3. are straightforward and are omitted. Consider 1. Given (x₁,..., x_n), the ɛ-covering number for the class of points in ℝ^Nn, {(h(x_i, a) : i = 1,...,n, a = 1,...N); h ∈ H} is bounded above by N₁ (ɛ, H, Nn). Note that for (z_ia, i = 1,...,n, a = 1,...,N),

\begin{array}{l} 1 / n \sum_{i = 1}^{n} | max_{a = 1, \dots, N} h (x_{i}, a) - max_{a = 1, \dots, N} z_{i a} | \leq 1 / n \sum_{i = 1}^{n} max_{a = 1, \dots, N} | h (x_{i}, a) - z_{i a} | \\ \leq 1 / n \sum_{i = 1}^{n} \sum_{a = 1}^{N} | h (x_{i}, a) - z_{i a} | . \end{array}

Thus the ɛ-covering number for the class of points in ℝⁿ, $ℝ^{n}, {({max}_{a = 1}^{N} h (x_{i}, a) : i = 1, \dots, n); h \in H}$ is bounded above by N₁ (ɛ, H, Nn). Using the definition of covering numbers for classes of functions we obtain $N_{1} (ɛ, V, n) \leq N_{1} (\frac{ɛ}{N}, H, N n) .$

Next consider 4. Put x = (x₁,..., x_n) (each x_i ∈ ℝ^q) and f (x_i) = ( f₁(x_i),..., f_s(x_i))^T. Then there exists {z₁,..., z_N}, (N = N_∞ (ɛ/(4sM_Θ), F, n); z_j ∞ ℝⁿ) that form the centers of an ɛ/(4sM_Θ)-cover for F. To each z_j we can associate an f ∈ F, say f ^*_j so that {f ^*₁,...,f ^*_N} form the centers of an ɛ/(2sM_Θ)-cover for F. Then given {f₁,..., f_s} ∈ F there exists j^* ∈ {1,...,N } for j = 1,...,s, so that max_1≤j_≤_s max_1≤i_≤_n | f_j(x_i) − f^*_j_*(x_i)| ≤ ɛ /( 2sM_Θ). Then

(1 / n) \sum_{i = 1}^{n} | \sum_{j = 1}^{s} θ_{j} f_{j} (x_{i}) - θ_{j} f_{j *}^{*} (x_{i}) | \leq ɛ / 2 .

Define $F^{'} = {\sum_{j = 1}^{s} θ_{j} f_{j *}^{*} : θ_{j} \in Θ} .$ Theorems 11.6 and 18.4 of Anthony and Bartlett (1996) imply that $N_{1} (ɛ / 2, F^{'}, n) \leq e (s + 1) {(\frac{4 e s M_{Θ} M_{F}}{ɛ})}^{s}$ These two combine to yield the result.

References

Altfeld M, Walker BD. Less is more? STI in acute and chronic HIV-1 infection. Nature Medicine. 2001;7:881–884. doi: 10.1038/90901. [DOI] [PubMed] [Google Scholar]
M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge, UK: Cambridge University Press, 1999.
L. Baird. Advantage updating. Technical Report. WL-TR-93-1146, Wright-Patterson Air Force Base, 1993.
Baxter J, Bartlett PL. Infinite-horizon policy-gradient estimation. J Artificial Intelligence Research. 2001;15:319–350. [Google Scholar]
R. E. Bellman Dynamic Programming. Princeton: Princeton University Press, 1957.
D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Belmont, MA.: Athena Scientific. 1996.
Brooner RK, Kidorf M. Using behavioral reinforcement to improve methadone treatment participation. Science and Practice Perspectives. 2002;1:38–48. doi: 10.1151/spp021138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, Kupfer DJ. Background and rationale for the sequenced treatment alternative to relieve depression (STAR*D) study. Psychiatric Clinics of North America. 2003;26(3):457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]
C. N. Fiechter Efficient Reinforcement Learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (COLT 1994), pages 88–97, New Brunswick, NJ, 1994.
C. N. Fiechter. Expected Mistake Bound Model for On-Line Reinforcement Learning. In Proceedings of the Fourteenth International Conference on Machine Learning, Douglas H. Fisher (Ed.), Nashville, Tennessee, pages 116–124, 1997.
S. M. Kakade. On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University College, London, 2003.
Kearns M, Mansour Y, Ng AY. A Sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning. 1999;49(2–3):193–208. [Google Scholar]
M. Kearns, Y. Mansour and A. Y. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 12, MIT Press, 2000.
Ormoneit D, Sen S. Kernel-Based Reinforcement Learning. Machine Learning. 2002;49(2–3):161–178. [Google Scholar]
L. Peshkin and C. R. Shelton. Learning from scarce experience. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002) Claude Sammut, Achim G. Hoffmann (Eds.) pages 498–505, Sydney, Australia, 2002.
Rush AJ, Crismon ML, Kashner TM, Toprac MG, Carmody TJ, Trivedi MH, Suppes T, Miller AL, Biggs MM, Shores-Wilson K, Witte BP, Shon SP, Rago WV, Altshuler KZ TMAP Research Group. Texas medication algorithm project, phase 3 (TMAP-3): Rationale and study design. Journal of Clinical Psychiatry. 2003;64(4):357–69. doi: 10.4088/jcp.v64n0402. [DOI] [PubMed] [Google Scholar]
Schneider LS, Tariot PN, Lyketsos CG, Dagerman KS, Davis KL, Davis S, Hsiao JK, Jeste DV, Katz IR, Olin JT, Pollock BG, Rabins PV, Rosenheck RA, Small GW, Lebowitz B, Lieberman JA. National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE) American Journal of Geriatric Psychiatry. 2001;9(4):346–360. [PubMed] [Google Scholar]
Schapire RE, Bartlett P, Freund Y, Lee WS. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics. 1998;26(5):1651–1686. [Google Scholar]
D. I. Simester, P. Sun, and J. N. Tsitsiklis. Dynamic catalog mailing policies. unpublished manuscript, Available electronically at http://web.mit.edu/jnt/www/Papers/P-03-sun-catalog-rev2.pdf, 2003.
R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Mass, 1998.
Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Statistics and Medicine. 2000;19:1011–1028. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association. 2002;97:29–39. [Google Scholar]
Tsitsiklis JN, Van Roy B. Feature-based methods for large scale dynamic programming. Machine Learning. 1996;22:59–94. [Google Scholar]
Tsitsiklis JN, Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control. 1997;42(5):674–690. [Google Scholar]
A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996.
C. J. C. H. Watkins. Learning from Delayed Rewards. Ph.D. thesis, Cambridge University, 1989.

[R1] Altfeld M, Walker BD. Less is more? STI in acute and chronic HIV-1 infection. Nature Medicine. 2001;7:881–884. doi: 10.1038/90901. [DOI] [PubMed] [Google Scholar]

[R2] M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge, UK: Cambridge University Press, 1999.

[R3] L. Baird. Advantage updating. Technical Report. WL-TR-93-1146, Wright-Patterson Air Force Base, 1993.

[R4] Baxter J, Bartlett PL. Infinite-horizon policy-gradient estimation. J Artificial Intelligence Research. 2001;15:319–350. [Google Scholar]

[R5] R. E. Bellman Dynamic Programming. Princeton: Princeton University Press, 1957.

[R6] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Belmont, MA.: Athena Scientific. 1996.

[R7] Brooner RK, Kidorf M. Using behavioral reinforcement to improve methadone treatment participation. Science and Practice Perspectives. 2002;1:38–48. doi: 10.1151/spp021138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, Kupfer DJ. Background and rationale for the sequenced treatment alternative to relieve depression (STAR*D) study. Psychiatric Clinics of North America. 2003;26(3):457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]

[R9] C. N. Fiechter Efficient Reinforcement Learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (COLT 1994), pages 88–97, New Brunswick, NJ, 1994.

[R10] C. N. Fiechter. Expected Mistake Bound Model for On-Line Reinforcement Learning. In Proceedings of the Fourteenth International Conference on Machine Learning, Douglas H. Fisher (Ed.), Nashville, Tennessee, pages 116–124, 1997.

[R11] S. M. Kakade. On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University College, London, 2003.

[R12] Kearns M, Mansour Y, Ng AY. A Sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning. 1999;49(2–3):193–208. [Google Scholar]

[R13] M. Kearns, Y. Mansour and A. Y. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 12, MIT Press, 2000.

[R14] Ormoneit D, Sen S. Kernel-Based Reinforcement Learning. Machine Learning. 2002;49(2–3):161–178. [Google Scholar]

[R15] L. Peshkin and C. R. Shelton. Learning from scarce experience. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002) Claude Sammut, Achim G. Hoffmann (Eds.) pages 498–505, Sydney, Australia, 2002.

[R16] Rush AJ, Crismon ML, Kashner TM, Toprac MG, Carmody TJ, Trivedi MH, Suppes T, Miller AL, Biggs MM, Shores-Wilson K, Witte BP, Shon SP, Rago WV, Altshuler KZ TMAP Research Group. Texas medication algorithm project, phase 3 (TMAP-3): Rationale and study design. Journal of Clinical Psychiatry. 2003;64(4):357–69. doi: 10.4088/jcp.v64n0402. [DOI] [PubMed] [Google Scholar]

[R17] Schneider LS, Tariot PN, Lyketsos CG, Dagerman KS, Davis KL, Davis S, Hsiao JK, Jeste DV, Katz IR, Olin JT, Pollock BG, Rabins PV, Rosenheck RA, Small GW, Lebowitz B, Lieberman JA. National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE) American Journal of Geriatric Psychiatry. 2001;9(4):346–360. [PubMed] [Google Scholar]

[R18] Schapire RE, Bartlett P, Freund Y, Lee WS. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics. 1998;26(5):1651–1686. [Google Scholar]

[R19] D. I. Simester, P. Sun, and J. N. Tsitsiklis. Dynamic catalog mailing policies. unpublished manuscript, Available electronically at http://web.mit.edu/jnt/www/Papers/P-03-sun-catalog-rev2.pdf, 2003.

[R20] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Mass, 1998.

[R21] Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Statistics and Medicine. 2000;19:1011–1028. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]

[R22] Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association. 2002;97:29–39. [Google Scholar]

[R23] Tsitsiklis JN, Van Roy B. Feature-based methods for large scale dynamic programming. Machine Learning. 1996;22:59–94. [Google Scholar]

[R24] Tsitsiklis JN, Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control. 1997;42(5):674–690. [Google Scholar]

[R25] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996.

[R26] C. J. C. H. Watkins. Learning from Delayed Rewards. Ph.D. thesis, Cambridge University, 1989.

PERMALINK

A Generalization Error for Q-Learning

Susan A Murphy

Abstract

1. Introduction

2. Preliminaries

3. Batch Q-Learning

4. Generalization Error

Lemma 1

Lemma 2

Proof

5. Finite Sample Upper Bounds on the Average Generalization Error

Theorem 1

Proof of Theorem 1

Theorem 2

Proof of Theorem 2

6. Discussion

Acknowledgments

Appendix A

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Generalization Error for Q-Learning

Susan A Murphy

Abstract

1. Introduction

2. Preliminaries

3. Batch Q-Learning

4. Generalization Error

Lemma 1

Lemma 2

Proof

5. Finite Sample Upper Bounds on the Average Generalization Error

Theorem 1

Proof of Theorem 1

Theorem 2

Proof of Theorem 2

6. Discussion

Acknowledgments

Appendix A

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases