Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2006 Jun 8.
Published in final edited form as: J Mach Learn Res. 2005 Jul;6:1073–1097.

A Generalization Error for Q-Learning

Susan A Murphy 1,
PMCID: PMC1475741  NIHMSID: NIHMS4345  PMID: 16763665

Abstract

Planning problems that involve learning a policy from a single training set of finite horizon trajectories arise in both social science and medical fields. We consider Q-learning with function approximation for this setting and derive an upper bound on the generalization error. This upper bound is in terms of quantities minimized by a Q-learning algorithm, the complexity of the approximation space and an approximation term due to the mismatch between Q-learning and the goal of learning a policy that maximizes the value function.

Keywords: multistage decisions, dynamic programming, reinforcement learning, batch data

1. Introduction

In many areas of the medical and social sciences the following planning problem arises. A training set or batch of n trajectories of T +1-decision epochs is available for estimating a policy. A decision epoch at time t, t = 0,1,...,T , is composed of information observed at time t, Ot , an action taken at time t, At and a reward, Rt . For example there are currently a number of ongoing large clinical trials for chronic disorders in which, each time an individual relapses, the individual is re-randomized to one of several further treatments (Schneider et al., 2001; Fava et al., 2003; Thall et al., 2000). These are finite horizon problems with T generally quite small, T = 2–4, with known exploration policy. Scientists want to estimate the best “strategies,” i.e. policies, for managing the disorder. Alternately the training set of n trajectories may be historical; for example data in which clinicians and their patients are followed with i! nformation about disease process, treatment burden and treatment decisions recorded through time. Again the goal is to estimate the best policy for managing the disease. Alternately, consider either catalog merchandizing or charitable solicitation; information about the client, and whether or not a solicitation is made and/or the form of the solicitation is recorded through time (Simester, Peng and Tsitsiklis, 2003). The goal is to estimate the best policy for deciding which clients should receive a mailing and the form of the mailing. These latter planning problems can be viewed as infinite horizon problems but only T decision epochs per client are recorded. If T is large, the rewards are bounded and the dynamics are stationary Markovian then this finite horizon problem provides an approximation to the discounted infinite horizon problem (Kearns, Mansour and Ng, 2000).

These planning problems are characterized by unknown system dynamics and thus can also be viewed as learning problems as well. Note there is no access to a generative model nor an online simulation model nor the ability to conduct offline simulation. All that is available is the n trajectories of T + 1 decision epochs. One approach to learning a policy in this setting is Q-learning (Watkins, 1989) since the actions in the training set are chosen according to a (non-optimal) exploration policy; Q-learning is an off-policy method (Sutton and Barto, 1998). When the observables are vectors of continuous variables or are otherwise of high dimension, Q-learning must be combined with function approximation.

The contributions of this paper are as follows. First a version of Q-learning with function approximation, suitable for learning a policy with one training set of finite horizon trajectories and a large observation space, is introduced; this “batch” version of Q-learning processes the entire training set of trajectories prior to updating the approximations to the Q-functions. An incremental implementation of batch Q-learning results in one-step Q-learning with function approximation. Second performance guarantees for this version of Q-learning are provided. These performance guarantees do not assume assume that the system dynamics are Markovian. The performance guarantees are upper bounds on the average difference in value functions or more specifically the average generalization error. Here the generalization error for batch Q-learning is defined analogous to the generalization error in supervised learning (Schapire et al., 1998); it is the average diffe! rence in value when using the optimal policy as compared to using the greedy policy (from Q-learning) in generating a separate test set. The performance guarantees are analogous to performance guarantees available in supervised learning (Anthony and Bartlett, 1999).

The upper bounds on the average generalization error permit an additional contribution. These upper bounds illuminate the mismatch between Q-learning with function approximation and the goal of finding a policy maximizing the value function (see the remark following Lemma 2 and the third remark following Theorem 2). This mismatch occurs because the Q-learning algorithm with function approximation does not directly maximize the value function but rather this algorithm approximates the optimal Q-function within the constraints of the approximation space in a least squares sense; this point is discussed as some length in section 3 of Tsitsiklis and van Roy (1997).

In the process of providing an upper bound on the average generalization error, finite sample bounds on the difference in average values resulting from different policies are derived. There are three terms in the upper bounds. The first term is a function of the optimization criterion used in batch Q-learning, the second term is due to the complexity of the approximation space and the last term is an approximation error due to the above mentioned mismatch. The third term which is a function of the complexity of the approximation space is similar in form to generalization error bounds derived for supervised learning with neural networks as in Anthony and Bartlett (1999). From the work of Kearns, Mansour, and Ng (1999, 2000) and Peshkin and Shelton (2002), we expect and find as well here that the number of trajectories needed to guarantee a specified error level is exponential in the horizon time, T . The upper bound does not depend on the dimension of the observables Ot ’s. This is in contrast to the results of Fiechter (1994 Fiechter (1997) in which the upper bound on the average generalization error depends on the number of possible values for the observables.

A further contribution is that the upper bound on the average generalization error provides a mechanism for generalizing ideas from supervised learning to reinforcement learning. For example if the optimal Q-function belongs to the approximation space, then the upper bounds imply that batch Q-learning is a PAC reinforcement learning algorithm as in Feichter (1994 in Feichter (1997); see the first remark following Theorem 1. And second the upper bounds provide a starting point in using structural risk minimization for model selection (see the second remark after Theorem 1).

In Section 2, we review the definition of the value function and Q-function for a (possibly non-stationary, non-Markovian) finite horizon decision process. Next we review batch Q-learning with function approximation when the learning algorithm must use a training set of n trajectories. In Section 5 we provide the two main results, both of which provide the number of trajectories needed to achieve a given error level with a specified level of certainty.

2. Preliminaries

In the following we use upper case letters, such as O and A, to denote random variables and lower case letters, such as o and a, to denote instantiates or values of the random variables. Each of the n trajectories is composed of the sequence {O0, A0, O1, . . . , AT , OT +1} where T is a finite constant. Define Ot = {O0,..., Ot } and similarly for At . Each action At takes values in finite, discrete action space A and Ot takes values in the observation space O. The observation space may be multidimensional and continuous. The arguments below will not require the Markovian assumption with the value of Ot equal to the state at time t. The rewards are Rt = rt (Ot , At , Ot+1) for rt a reward function and for each 0 ≤ tT (if the Markov assumption holds then replace Ot with Ot and At with At ). We assume that the rewards are bounded, taking values in the interval [0,1].

We assume the trajectories are sampled at random according to a fixed distribution denoted by P. Thus the trajectories are generated by one fixed distribution. This distribution is composed of the unknown distribution of each Ot conditional on (Ot−1, At−1) (call these unknown conditional densities {f0,... fT }) and an exploration policy for generating the actions. Denote the exploration policy by pT = {p0,..., pT} where the probability that action a is taken given history {Ot , At−1} is pt (a|Ot , At−1) (if the Markov assumption holds then, as before, replace Ot with Ot and At−1 with At−1.) We assume that pt (a|ot , at−1) > 0 for each action a ∈ A and for each possible value (ot , at−1); that is, at each time all actions are possible. Then the likelihood (under P) of the trajectory, {o0, a0, o1,..., aT , oT +1} is

f0(o0)p0(a0|o0)t=1Tft(ot|ot-1,at-1)pt(at|ot,at-1)fT+1(oT+1|oT,aT). (1)

Denote expectations with respect to the distribution P by an E.

Define a deterministic, but possibly non-stationary and non-Markovian, policy, π, as a sequence of decision rules, {π1,..., πT }, where the output of the time t decision rule, πt (ot , at−1), is an action. Let the distribution Pπ denote the distribution of a trajectory whereby the policy π is used to generate the actions. Then the likelihood (under Pπ) of the trajectory {o0, a0, o1,..., aT , oT +1} is

f0(o0)1a0=π0(o0)j=1Tfj(oj|oj-1,aj-1)1aj=πj(oj,aj-1)fT+1(oT+1|oT,aT) (2)

where for a predicate W , 1W is 1 if W is true and is 0 otherwise. Denote expectations with respect to the distribution Pπ by an Eπ.

Note that since (1) and (2) differ only in regard to the policy for generating actions, an expectation with respect to either P or Pπ that does not involve integration over the policy results in the same quantity. For example, E [Rt |Ot , At ] = Eπ [Rt |Ot , At ], for any policy π.

Let Π be the collection of all policies. In a finite horizon planning problem (permitting non-stationary, non-Markovian policies) the goal is to estimate a policy that maximizes Eπ[j=1TRj|O0=o0] over π ∈ Π. If the system dynamics are Markovian and each r j(o j, a j, o j+1) = γjr(o j, a j, o j+1) for r a bounded reward function and γ ∈ (0,1) a discount factor, then this finite horizon problem provides an approximation to the discounted infinite horizon problem (Kearns Mansour and Ng, 2000) for T large.

Given a policy, π, the value function for an observation, o0, is

Vπ(o0)=Eπ[j=1TRj|O0=o0].

The t-value function for policy π is the value of the rewards summed from time t on and is

Vπ,t(ot,at-1)=Eπ[j=tTRj|Ot=ot,At-1=at-1].

If the Markovian assumption holds then (ot , at−1) in the definition of Vπ,t is replaced by ot . Note that the time 0 value function is simply the value function (Vπ,0 = Vπ). For convenience, set Vπ,T +1 = 0. Then the value functions satisfy the following relationship:

Vπ,t(ot,at-1)=Eπ[Rt+Vπ,t+1(Ot+1,At)|Ot=ot,At-1=at-1]

for t = 0,...,T . The time t Q-function for policy π is

Qπ,t(ot,at)=E[Rt+Vπ,t+1(Ot+1,At)|Ot=ot,At=at].

(The subscript, π, can be omitted as this expectation is with respect to the distribution of Ot+1 given (Ot , At ), ft+1; this conditional distribution does not depend on the policy.) In Section 4 we express the difference in value functions for policy π˜ and policy π in terms of the advantages (as defined in Baird, 1993). The time t advantage is

μπ,t(ot,at)=Qπ,t(ot,at)-Vπ,t(ot,at-1).

The advantage can be interpreted as the gain in performance obtained by following action at at time t and thereafter policy π as compared to following policy π from time t on.

The optimal value function V *(o) for an observation o is

V*(o)=maxπΠVπ(o)

and the optimal t-value function for history (ot , at−1) is

Vt*(ot,at-1)=maxπΠVπ,t(ot,at-1).

As is well-known, the optimal value functions satisfy the Bellman equations (Bellman, 1957)

Vt*(ot,at-1)=maxatAE[Rt+Vt+1*(Ot+1,At)|Ot=ot,At=at].

Optimal, deterministic, time t decision rules must satisfy

πt*(ot,at-1)argmaxatAE[Rt+Vt+1*(Ot+1,At)|Ot=ot,At=at].

The optimal time t Q-function is

Qt*(ot,at)=E[Rt+Vt+1*(Ot+1,At)|Ot=ot,At=at],

and thus the optimal time t advantage, which is given by

μt*(ot,at)=Qt*(ot,at)-Vt*(ot,at-1),

is always nonpositive and furthermore it is maximized in at at at = πt*(ot , at−1).

3. Batch Q-Learning

We consider a version of Q-learning for use in learning a non-stationary, non-Markovian policy with one training set of finite horizon trajectories. The term “batch” Q-learning is used to emphasize that learning occurs only after the collection of the training set. The Q-functions are estimated using an approximator (i.e. neural networks, decision-trees etc.) (Bertsekas and Tsitsiklis, 1996; Tsitsiklis and van Roy, 1997) and then the estimated decision rules are the argmax of the estimated Q functions. Let Qt be the approximation space for the tth Q-function, e.g. Qt = {Qt (ot , at ; θ) : θ ∈θ}; θ is a vector of parameters taking values in a parameter space θ which is a subset of a Euclidean space. For convenience set QT +1 equal to zero and write En f for the expectation of an arbitrary function, f , of a trajectory with respect to the probability obtained by choosing a trajectory uniformly from the training set of n trajectories (for example, En[f(Ot)]=1/ni=1nf(Oit) for Oit the tth observation in the ith trajectory). In batch Q-learning using dynamic programming and function approximation solve the following backwards through time t =!T,T − 1,...,1 to obtain

θtargminθEn[Rt+maxat+1Qt+1(Ot+1,At,at+1;θt+1)-Qt(Ot,At;θ)]2. (3)

Suppose that Q-functions are approximated by linear combinations of p features (Qt = {θT qt (ot , at ) : θ ∈ Rp}) then to achieve (3) solve backwards through time, t = T,T −1,...,0,

0=En[(Rt+maxat+1Qt+1(Ot+1,At,at+1;θt+1)-Qt(Ot,At;θt))qt(Ot,At)T] (4)

for θt

An incremental implementation with updates between trajectories of (3) and (4) results in one-step Q-learning (Sutton and Barto, 1998, pg. 148, put γ = 1, assume the Markov property and no need for function approximation). This is not surprising as Q-learning can be viewed as approximating least squares value iteration (Tsitsiklis and van Roy, 1996). To see the connection consider the following generic derivation of an incremental update. Denote the ith example in a training set by Xi. Define θ^n to be a solution of i=1nf(Xi,θ)=0 for f a given p dimensional vector of functions and each integer n. Using a Taylor series, expand i=1n+1f(Xi,θ^(n+1))   in   θ^(n+1) about θ^(n) to obtain a between-example update to θ^(n) :

θ^(n+1)θ^(n)+1n+1(En+1(-f(X,θ^n)θ^n))-1f(Xn+1,θ^n).

Replace 1n+1(En+1(-f(X,θ^n)θ^n))-1 by a step-size αnn → 0 as n → ∞) to obtain a general formula for the incremental implementation. Now consider an incremental implementation of (4) for each t = 0,...,T . Then for each t, X = (Ot+1, At ), θ = θt and

f(X,θt)=(Rt+maxat+1Qt+1(Ot+1,At,at+1;θ^t+1(n+1))-Qt(Ot,At;θt))qt(Ot,At)T

is a vector of dimension p. The incremental update is

θ^t(n+1)θ^t(n)+αn(Rt+maxat+1Qt+1(Ot+1,At,at+1;θ^t+1(n+1))-Qt(Ot,At;θ^t(n)))qt(Ot,At)T

for t = 0,...,T . This is the one-step update of Sutton and Barto (1998, pg. 148) with γ = 1 and generalized to permit function approximation and nonstationary Q-functions and is analogous to the TD(0) update of Tsitsiklis and van Roy (1997) permitting non-Markovian, nonstationary value functions.

Denote the estimator of the optimal Q-functions based on the training data by Q^t for t = 0,...,T (for simplicity, is omitted). The estimated policy, π^, satisfies π^t(ot,at-1)arg   maxatQ^t(ot,at) for each t. Note that members of the approximation space Qt need not be “Q-functions” for any policy. For example the Q-functions corresponding to the use of a policy π (Qπ,t, t = 0,...,T) must satisfy

E[Rt+Vπ,t+1(Ot+1,At)|Ot,At]=Qπ,t(Ot,At)

where Vπ,t+1(Ot+1, At ) = Qπ,t+1(Ot+1, At , at+1) with at+1 set equal to πt+1(Ot+1, At ). Q-learning does not impose this restriction on {Q^t,t=0,,T}; indeed it may be that no member of the approximation space can satisfy this restriction. None-the-less we refer to the Q^t’s as estimated Q-functions. Note also that the approximation for the Q-functions combined with the definition of the estimated decision rules as the argmax of the estimated Q functions places implicit restrictions on the set of policies that will be considered. In effect the space of interesting polices is no longer Π but rather ΠQ = {πθ, θ ∈ θ} where πθ = {π1,θ,..., πT} and where each πt,θ(ot , at−1) ∈ arg maxat Qt(ot , at ; θ)for some Qt ∈ Qt.

4. Generalization Error

Define the generalization error of a policy π at an observation o0 as the average difference between the optimal value function and the value function resulting from the use of policy π in generating a separate test set. The generalization error of policy π at observation o0 can be written as

V*(o0)-Vπ(o0)=-Eπ[t=0Tμt*(Ot,At)|O0=o0] (5)

where Eπ denotes the expectation using the likelihood (2). So the generalization error can be expressed in terms of the optimal advantages evaluated at actions determined by policy π; that is when each At = πt (Ot , At−1). Thus the closer each optimal advantage, μt *(Ot , At ) for At = πt (Ot , At−1) is to zero, the smaller the generalization error. Recall that the optimal advantage, μt *(Ot , At ), is zero when At = πt *(Ot , At−1). The display in (5) follows from Kakade’s (ch. 5, 2003) expression for the difference between the value functions for two policies.

Lemma 1

Given policies π˜ and π,

Vπ˜(o0)-Vπ(o0)=-Eπ[t=0Tμπ˜,t(Ot,At)|O0=o0].

Set π˜ = π* to obtain (5). An alternate to Kakade’s (2003) proof is as follows.

Proof. First note

Vπ(o0)=Eπ[t=0TRt|O0=o0]=Eπ[Eπ[t=0TRt|OT,AT]|O0=o0]. (6)

And Eπ[t=0TRt|OT,AT] is the expectation with respect to the distribution of OT + 1 given the history (OT, AT); this is the density fT + 1 from Section 2 and fT + 1 is independent of the policy used to choose the actions. Thus we may subscript E by either π or π˜ without changing the expectation; Eπ[t=0TRt|OT,AT]=Eπ˜[t=0TRt|OT,AT]=t=0T-1Rt+Qπ˜,T(OT,AT). The conditional expectation can be written in a telescoping sum as

Eπ[t=0TRt|OT,AT]=t=0TQπ˜,t(Ot,At)-Vπ˜,t(Ot,At-1)+t=1TRt-1+Vπ˜,t(Ot,At-1)-Qπ˜,t-1(Ot-1,At-1)+Vπ˜,0(O0)

The first sum is the sum of the advantages. The second sum is a sum of temporal-difference errors; integrating the temporal-difference error with respect to the conditional distribution of Ot given (Ot − 1, At − 1), denoted by ft in Section 2, we obtain zero,

E[Rt-1+Vπ˜,t(Ot,At-1)|Ot-1,At-1]=Qπ˜,t-1(Ot-1,At-1)

(as before E denotes expectation with respect to (1); recall that expectations that do not integrate over the policy can be written either with an E or an Eπ). Substitute the telescoping sum into (6) and note that Vπ˜,0(O0)=Vπ˜(O0) to obtain the result. ▪

In the following Lemma the difference between value functions corresponding to two policies, π˜ and π, is expressed in terms of both the L1 and L2 distances between the optimal Q-functions and any functions {Q0, Q1,..., QT} satisfying πt (ot, at − 1) ∈ arg maxat Qt (ot, at), t = 0,...,T and any functions {Q˜0,Q˜1,,Q˜T} satisfying π˜t(ot,at-1)argmaxatQ˜t(ot,at), t = 0,...,T. We assume that there exists a positive constant, L for which pt (at|ot, at − 1) ≥ L−1 for each t and all pairs (ot, at − 1); if the stochastic decision rule, pt, were uniform then L would be the size of the action space.

Lemma 2

For all functions, Qt satisfying πt(ot, at − 1) ∈ arg maxat Qt(ot, at), t = 0,...,T, and all functions Q˜t satisfying π˜t(ot,at-1)argmaxatQ˜t(ot,at), t = 0,...,T we have,

|Vπ˜(o0)-Vπ(o0)|t=0T2Lt+1E[|Qt(Ot,At)-Q˜t(Ot,At)|O0=o0]+t=0T2Lt+1E[|Q˜t(Ot,At)-Qπ˜,t(Ot,At)|O0=o]

and

|Vπ˜(o0)-Vπ(o0)|t=0T2L(t+1)/2E[(Qt(Ot,At)-Q˜t(Ot,At))2|O0=o0]+t=0T2L(t+1)/2E[(Q˜t(Ot,At)-Qπ˜,t(Ot,At))2|O0=o0],

where E denotes expectation with respect to the distribution generating the training sample (1).

Remark:

1. Note that in general arg maxat Qπ˜,t(ot,at) may not be π˜t thus we can not choose Q˜t=Qπ˜,t. However if π˜=π* then we can choose Q˜t=Qt*(=Qπ*,tby definition) and the second term in both upper bounds is equal to zero.

2. This result can be used to emphasize one aspect of the mismatch between estimating the optimal Q function and the goal of learning a policy that maximizes the value function. Suppose Q˜t=Qt*,π˜=π*. The generalization error is

V*(o0)-Vπ(o0)t=0T2L(t+1)/2E[(Qt(Ot,At)-Qt*(Ot,At))2|O0=o0]

for Qt any function satisfying πt(ot, at − 1) ∈ arg maxat Qt(ot, at). Absent restrictions on the Qt s, this inequality cannot be improved in the sense that choosing each Qt = Qt* and πt = πt* yields 0 on both sides of the inequality. However an inequality in the opposite direction is not possible, since as was seen in Lemma 1, V*(o0) − Vπ(o0) involves the Q functions only through the advantages (see also (7) below with π˜=π*). Thus for the difference in value functions to be small, the average difference between Qt(ot, at) − max at Qt(ot, at) and Q*t(ot, at) − max at Qt*(ot, at) must be small; this does not require that the average difference between Qt(ot, at) and Qt*(ot, at) is small. The mismatch is not unexpected. For example, Baxter and Bartlett (2001) provide an example in which the approximation space for the value function includes a value function for which the greedy policy is optimal, yet the greedy policy found by temporal difference learning (TD(1!)) function performs very poorly.

Proof

Define μt(ot, at) = Qt(ot, at) − max at Qt(ot, at) for each t; note that μt(ot, at) evaluated at at = πt(ot, at − 1) is zero. Start with the result of Lemma 1. Then note the difference between the value functions can be expressed as

Vπ˜(o0)-Vπ(o0)=t=0TEπ[μt(Ot,At)-μπ˜,t(Ot,At)|O0=o0]. (7)

since Pπ puts at = πt(ot, at − 1) and μt (ot, at) = 0 for this value of at. When it is clear from the context μt (μπ˜,t) is used as abbreviation for μt (Ot, At) (μπ˜,t(Ot,At)) in the following. Also Qπ˜,t(Ot, At − 1, at) with at replaced by π˜t (Ot, At − 1) is written as Qπ˜,t(Ot,At-1,π˜t). Consider the absolute value of the tth integrand in (7):

|μt-μπ˜,t|=|Qt(Ot,At)-maxatQt(Ot,At-1,at)-Qπ˜,t(Ot,At)+Qπ˜,t(Ot,At-1,π˜t)||Qt(Ot,At)-Qπ˜,t(Ot,At)|+|maxatQt(Ot,At-1,at)-Qπ˜,t(Ot,At-1,π˜t)|.

Since maxat Q˜t(Ot,At-1,at)-Q˜t(Ot,At-1,π˜t)and for any functions, h and h′, |max at h(at) − maxat h′(at)|≤ maxat |h(at) − h′ (at)|,

|maxatQt(Ot,At-1,at)-Qπ˜,t(Ot,At-1,π˜t)|maxat|Qt(Ot,At-1,at)-Q˜t(Ot,At-1,at)|+|Q˜t(Ot,At-1,π˜t)-Qπ˜,t(Ot,At-1,π˜t)|.

We obtain |μt-μπ˜,t|

2maxat|Qt(Ot,At-1,at)-Q˜t(Ot,At-1,at)|+2maxat|Q˜t(Ot,At-1,at)-Q˜π˜,t(Ot,At-1,at)|2Lat|Qt(Ot,At-1,at)-Q˜t(Ot,At-1,at)|pt(at|Ot,At-1)+2Lat|Q˜t(Ot,At-1,at)-Qπ˜,t(Ot,At-1,at)|pt(at|Ot,At-1). (8)

Insert the above into (7) and use Lemma A1 to obtain |Vπ˜(o0)-Vπ(o0)|

2Lt=0TEπ[at|Q˜t(Ot,At-1,at)-Q˜t(Ot,At-1,at)|pt(at|Ot,At-1)|O0=o0]+Eπ[at|Q˜t(Ot,At-1,at)-Qπ˜,t(Ot,At-1,at)|pt(at|Ot,At-1)|O0=o0]=2Lt=0TE[(=0t-11A=πp(A|O,A-1))|Qt-Q˜t||O0=o0]+E[(=0t-11A=πp(A|O,A-1))|Q˜t-Qπ˜,t||O0=o0]2t=0TLt+1E[|Qt-Q˜t||O0=o0]+2t=0TLt+1E[|Q˜t-Qπ˜,t||O0=o0]

(Qt, Qπ˜,t is used as abbreviation for Qt (Ot, At), respectively Qπ˜,t(Ot, At)). This completes the proof of the first result.

Start from (8) and use Hölder’s inequality to obtain, |Vπ˜(o0)-Vπ(o0)|

2t=0TEπ[maxat|Qt(Ot,At-1,at)-Q˜t(Ot,At-1,at)||O0=o0]+Eπ[maxat|Q˜t(Ot,At-1,at)-Qπ˜,t(Ot,At-1,at)||O0=o0]2t=0TEπ[maxat|Qt(Ot,At-1,at)-Q˜t(Ot,At-1,at)|2|O0=o0]+Eπ[maxat|Q˜t(Ot,At-1,at)-Qπ˜,t(Ot,At-1,at)|2|O0=o0]2t=0TLEπ[at|Qt(Ot,At-1,at)-Q˜t(Ot,At-1,at)|2Pt(at|Ot,At-1)|O0=o0]+2t=0TLEπ[at|Q˜t(Ot,At-1,at)-Qπ˜,t(Ot,At-1,at)|2Pt(at|Ot,At-1)|O0=o0].

Now use Lemma A1 and the lower bound on the pt’s to obtain the result,

|Vπ˜(o0)-Vπ(o0)|2t=0TLE[=0t-11A=πp(A|O,A-1)(Qt-Q˜t)2|O0=o0]+LE[=0t-11A=πp(A|O,A-1)(Q˜t-Qπ˜,t)2|O0=o0]2L(t+1)/2t=0TE[(Qt-Q˜t)2|O0=o0]+E[(Q˜t-Qπ˜,t)2|O0=o0].

5. Finite Sample Upper Bounds on the Average Generalization Error

Traditionally the performance of a policy π is evaluated in terms of maximum generalization error: maxo [V*(o) − Vπ(o)] (Bertsekas and Tsitsiklis, 1996). However here we consider an average generalization error as in Kakade (2003) (see also Fiechter, 1997; Kearns, Mansour and Ng, 2000; Peshkin and Shelton, 2002); that is ∫o[V*(o) − Vπ(o)] dF(o) for a specified distribution F on the observation space. The choice of F with density f = f0 (f0 is the density of O0 in likelihoods (1) and (2)) is particularly appealing in the development of a policy in many medical and social science applications. In these cases, f0 represents the distribution of initial observations corresponding to a particular population of subjects. The goal is to produce a good policy for this population of subjects. In general as in Kakade (2003) F may be chosen to incorporate domain knowledge concerning the steady state dis! tribution of a good policy. If only a training set of trajectories is available for learning and we are unwilling to assume that the system dynamics are Markovian, then the choice of F is constrained by the following consideration. If the distribution of O0 in the training set (f0) assigns mass zero to an observation o′, then the training data will not be able to tell us anything about Vπ (o′). Similarly if f0 assigns a very small positive mass to o′ then only an exceptionally large training set will permit an accurate estimate of Vπ(o′). Of course this will not be a problem for the average generalization error, as long as F also assigns very low mass to o′. Consequently in our construction of the finite sample error bounds for the average generalization error, we will only consider distributions F for which the density of F, say f, satisfies supo|f(o)f0(o)|Mfor some finite constant M. In this case the average generalization error is bounded above by

V*(o)-Vπ(o)dF(o)ME[V*(O0)-Vπ(O0)]=-MEπ[t=0Tμt*(Ot,At)].

The second line is a consequence of (5) and the fact that the distribution of O0 is the same under likelihoods (1) and (2).

In the following theorem a non-asymptotic upper bound on the average generalization error is provided; this upper bound depends on the number of trajectories in the training set (n), the performance of the approximation on the training set, the complexity of the approximation space and of course on the confidence (δ) and accuracy (ɛ) demanded. The batch Q-learning algorithm minimizes quadratic forms (see (3)); thus we represent the performance of functions {Q0, Q1,..., QT} on the training set by these quadratic forms,

Errn,Qt+1(Qt)=En[Rt+maxat+1Qt+1(Ot+1,At,at+1)-Qt(Ot,At)]2

for each j (recall QT +1 is set to zero and En represents the expectation with respect to the probability obtained by choosing a trajectory uniformly from the training set).

The complexity of each Qt space can be represented by it’s covering number (Anthony and Bartlett, 1999, pg 148). Suppose F is a class of functions from a space, X, to ℝ. For a sequence x = (x1,..., xn) ∈ Xn, define F|x to be a subset of ℝn given by F|x = {(f(x1),..., f (xn)): f ∈ F}. Define the metric dp on ℝn by dp(z,y)=(1/ni=1n|zi-yi|p)1/p for p a positive integer (for p=∞, define d(z,y)=maxi=1n|zi-yi|). Then N (ɛ,F|x, dp) is defined as the minimum cardinality of an ɛ-covering of F|x with respect to the metric dp. Next given ɛ>0, positive integer n, metric dp and function class, F, the covering number for F is defined as

Np(ɛ,F,n)=max{N(ɛ,F|x,dp):xXn}.

In the following theorem, F = {max at+1 Qt+1 (ot+1, at ) - Qt (ot , at ) : Qt ∈ Qt , t = 0,...,T } and (x)+ is x if x > 0 and zero otherwise.

Theorem 1

Assume that the functions in Qt , t ∈ 0,...,T are uniformly bounded. Suppose that there exists a positive constant, say L, for which pt (at |ot, at) ≥ L−1 for all (ot, at ) pairs, 0 ≥ tT . Then for ɛ > 0 and with probability at least 1–δ, over the random choice of the training set, every choice of functions, Q jQj,j = 0,...,T with associated policy π defined by πj(oj, a j–1) = arg maxaj Qj(oj,aj1) and every choice of functions, Q˜jQj,j=0,,T with associated policy π˜ defined by π˜j(oj,aj-1)=argmaxajQ˜j(oj,aj) the following bound is satisfied,

|Vπ˜(o)-Vπ(o)|dF(o)6ML1/2t=0T[i=tT(16)i-tLi(Errn,Qi+1(Qi)-Errn,Qi+1(Q˜i))+]1/2+12ML1/2ɛ+6ML1/2t=0Ti=tT(16)(i-t)/2Li/2E[Q˜i(Oi,Ai)-Qπ˜,i(Oi,Ai)]2.

for n satisfying

4(T+1)N1(ɛ232M(16L)(T+2),F,2n)exp{ɛ4n32(M)2(16L)2(T+2)}δ (9)

and where M′ is a uniform upper bound on the absolute value of f ∈ F and E represents the expectation with respect to the distribution (1) generating the training set.

Remarks:

1. Suppose that Q*tQt for each select t. Select Q˜t=Qt* and Q^t=argminQtQtErrn,Q^t+1(Qt), t = T, T−1,..., 0 (recall QT+1, Q^T+1 are identically zero). Then with probability greater than 1 − δ, we obtain,

V*(o)-Vπ^(o)dF(o)12ML1/2ɛ (10)

for all n satisfying (9). Thus, as long as the covering numbers for each Qt and thus for F do not grow too fast, estimating each Qt by minimizing Errn,Q^t+1(Qt) yields a policy that consistently achieves the optimal value. Suppose the approximation spaces Qt , t = 0,...,T are feed-forward neural networks as in remark 4 below. In this case the training set size n sufficient for (10) to hold need only be polynomial in (1/δ, 1/ɛ) and batch Q-learning is a probably approximate correct (PAC) reinforcement learning algorithm as defined by Fiechter (1997). As shown by Fiechter (1997) this algorithm can be converted to an efficient on-line reinforcement learning algorithm (here the word on-line implies updating the policy between trajectories).

2. Even when Qt* does not belong to Qt we can add the optimal Q function at each time, t, to the approximation space, Qt with a cost of no more than an increase of 1 to the covering number N1(ɛ232M(16L)(T+2),F,2n). If we do this the result continues to hold when we set π˜ to an optimal policy π* and set Q˜t=Qt* for each t; the generalization error is

V*(o)-Vπ(o)dF(o)6ML1/2t=0T[i=tT(16)i-tLiErrn,Qi+1(Qi)]1/2+12ML1/2ɛ

for all n satisfying (9). This upper bound is consistent with the practice of using a policy π^ for which π^t(ot,at-1)argmaxatQ^t(ot,at) and Q^targminQtQtErrn,Q^t+1(Qt). Given that the covering numbers for the approximation space can be expressed in a sufficiently simple form (as in remark 4 below), this upper bound can be used to carry out model selection using structural risk minimization (Vapnik, 1982). That is, one might consider a variety of approximation spaces and use structural risk minimization to use the training data to choose which approximation space is best. The resulting upper bound on the average generalization error can be found by using the above result and Lemma 15.5 of Anthony and Bartlett (1999).

3. The restriction on n in (9) is due to the complexity associated with the approximation space (e.g. the Qt ’s). The restriction is crude; to see this, note that if there were only a finite number of functions in F then n need only satisfy

2(T+1)|F|exp{2ɛ4n(3M)2(16L)2(T+2)}=δ

(use Hoeffding’s inequality; see Anthony and Bartlett, pg 361, 1999) and thus for a given (ε, &δ) we may set the number of trajectories in the training set n equal to (3M)2(16L)2(T+2)2ɛ4ln(2(T+1)|F|δ).This complexity term appears similar to that achieved by learning algorithms (e.g. see Anthony and Bartlett, 1999, pg. 21) or in reinforcement learning (e.g. Peshkin and Shelton, 2002) however note that n is of the order ɛ−4 rather than the usual ɛ−2. The ɛ−4 term (instead of ɛ−2) is attributable to the fact that ErrQt+1(Qt) is not only a function of Qt but also of Qt+1. However further assumptions on the approximation space permit an improved result. See Theorem 2 below for one possible refinement. Note the needed training set size n depends exponentially on the horizon time T but not on the dimension of the observation space. Thi! s is not unexpected as the upper bounds on the generalization error of both Kearns, Mansour and Ng (2000) and Peshkin and Shelton’s (2002) policy search methods (the latter using a training set and importance sampling weights) also depend exponentially on the horizon time.

4. When F is infinite, we use covering numbers for the approximation space Qt and then appeal to Lemma A2 in the appendix to derive a covering number for F; this results in

N1(ɛ,F,n)(T+1)maxt=0,,TN1(ɛ2|A|,Qt,|A|n)2.

One possible approximation space is based on feed-forward neural networks. From Anthony and Bartlett (1999) we have that if each Qj is the class of functions computed by a feed-forward network with W weights and k computation units arranged in L layers and each computation unit has a fixed piecewise-polynomial activation function with q pieces and degree no more than ℓ, then N1(ɛ,Qt,n)e(d+1)(2eMɛ)d where d = 2(W + 1)(L + 1) log2(4(W + 1)(L + 1)q(k + 1)/ln 2) + 2(W + 1)(L + 1)2 log2(ℓ + 1) + 2(L + 1). To see this combine Anthony and Bartlett’s Theorems 8.8, 14.1 and 18.4. They provide covering numbers for functions computed by other types of neural networks as well. A particularly simple neural network is an affine combination of a given set of p input features; i.e. f(x)=ω0+i=1p-1ωixi for (1, x) a vector of p real valued features and each ωi ∈ ℝ. Suppose each Qt is a class of functions computed by this network. Then Theorems 11.6 and 18.4 of Anthony and Bartlett imply that N1(ɛ,Qt,n)e(p+1)(2eMɛ)p. In this case

n32(M)4(16L)2(T+2)ɛ4log(4(T+1)2e2(p+1)2(128e|A|(M)2(16L)(T+2))2pδɛ4p).

This number will be large for any reasonable accuracy, ɛ and confidence, δ.

Proof of Theorem 1

An upper bound on the average difference in value functions can be obtained from Lemma 2 by using Jensen’s inequality and the assumption that the the density of F ( f ) satisfies supo|f(o)f0(o)|M for some finite constant M:

|Vπ˜(o)-Vπ(o)|dF(o)Mt=0T2L(t+1)/2E[Qt(Oi,Ai)-Q˜t(Oi,Ai)]2+Mt=0T2L(t+1)/2E[Q˜t-Qπ˜,t]2 (11)

where Q˜t-Qπ˜,t is used as abbreviation for Q˜t(Ot,At), respectively Qπ˜,t(Ot,At). In the following an upper bound on each E[Qt-Q˜t]2 is constructed.

The performance of the approximation on an infinite training set can be represented by

ErrQt+1(Qt)=E[Rt+maxat+1Qt+1(Ot+1,At,at+1)-Qt]2

for each t (recall QT+1 = 0, also we abbreviate Qt (Ot, At ) by Qt whenever no confusion may arise). The errors, Err’s, can be used to provide an upper bound on the L2 norms on the Q-functions by the following argument. Consider ErrQt+1(Qt)-ErrQt+1(Q˜t) for each t. Within each of these quadratic forms add and subtract

Qπ˜,t+1(Ot+1,At,π˜t+1)-Qπ˜,t-E[maxat+1Qt+1(Ot+1,At,at+1)-Qπ˜,t+1(Ot+1,At,π˜t+1)|Ot,At].

In the above Qπ˜,t+1(Ot+1,At,π˜t+1) is defined as Qπ˜,t+1(Ot+1,At,at+1) with at+1 replaced by π˜t+1(Ot+1,At). Expand each quadratic form and use the fact that E[Rt+Qπ˜,t+1(Ot+1,At,π˜t+1)|Ot,At]=Qπ˜,t. Cancelling common terms yields

E[Qπ˜,t-Qt+E[maxat+1Qt+1(Ot+1,At,at+1)-Qπ˜,t+1(Ot+1,At,π˜t+1)|Ot,At]]2-E[Qπ˜,t-Q˜t+E[maxat+1Qt+1(Ot+1,At,at+1)-Qπ˜,t+1(Ot+1,At,π˜t+1)|Ot,At]]2.

Add and subtract Q˜t in the first quadratic form and expand. This yields

ErrQt+1(Qt)-ErrQt+1(Q˜t)=E[Q˜t-Qt]2+2E[Q˜t-Qt][Q˜t-Qπ˜,t]+2E[(Q˜t-Qt)(maxat+1Qt+1(Ot+1,At,at+1)-maxat+1Q˜t+1(Ot+1,At,at+1))]+2E[(Q˜t-Qt)(maxat+1Q˜t+1(Ot+1,At,at+1)-Qπ˜,t+1(Ot+1,At,π˜t+1))]. (12)

Using the arguments similar to those used around Equation (8) and using the fact that (x + y)2 ≤ 2x2 + 2y2 we obtain,

ErrQt+1(Qt)-ErrQt+1(Q˜t)E[Qt-Q˜t]2-4(E[Qt-Q˜t]2(E[Q˜t-Qπ˜,t]2+LE[Qt+1-Q˜t+1]2+LE[Q˜t+1-Qπ˜,t+1]2))1/2.

Using this inequality we can now derive an upper bound on each E[Qt-Q˜t]2 in terms of the Err’s and the E[Q˜t+1-Qπ˜,t+1]’s. Define

mt=L-(T-t)E[Q˜t-Qπ˜,t]2   and bt=L-(T-t)E[Qt-Q˜t]2

and

et=L-(T-t)(ErrQt+1(Qt)-ErrQt+1(Q˜t))

for tT and bT +1 = mT +1 = eT +1 = 0. We obtain

etbt-4bt(mt+bt+1+mt+1).

Completing the square, reordering terms, squaring once again and using the inequality (x + y)2 ≤ 2x2 + 2y2 yields bt ≤ 16(bt+1 + mt + mt+1) + 2et for tT. We obtain

bT-t2i=0t(16)ieT-t+i+i=1t(16)i(16+1)mT-t+i+16mT-t.

Inserting the definitions of bTt, eTt+i and reordering, yields

E[Qt-Q˜t]22i=tT(16L)i-t(ErrQi+1(Qi)-ErrQi+1(Q˜i))+i=t+1T(16)i-t(16+1)LT-tmi+LT-tmt. (13)

As an aside we can start from (12) and derive the upper bound,

ErrQt+1(Qt)-ErrQt+1(Q˜t)E[Qt-Q˜t]2+4L(T-t)L-(T-t)E[Qt-Q˜t]2(mt+L-(T-t-1)E[Qt+1-Q˜t+1]2+mt+1).

This combined with (13) implies that minimizing each ErrQt+1(Qt)-ErrQt+1(Q˜t) in Qt is equivalent to minimizing each E[Qt-Q˜t]2 in Qt modulo the approximation terms mt for t = 0,..., T.

Returning to the proof next note that

ErrQt+1(Qt)-ErrQt+1(Q˜t)|ErrQt+1(Qt)-Errn,Qt+1(Qt)|+|ErrQt+1(Q˜t)-Errn,Qt+1(Q˜t)|+(Errn,Qt+1(Qt)-Errn,Qt+1(Q˜t))+

where (x)+ is equal to x if x ≥ 0 and is equal to 0 otherwise. Note that if each Qt minimizes Errn,Qt+1 as in (3) then the third term is zero. Substituting into (13), we obtain

E[Qt-Q˜t]22i=tT(16L)i-t(|ErrQi+1(Qi)-Errn,Qi+1(Q˜i)|+|ErrQi+1(Q˜i)-Errn,Qi+1(Q˜i)|+(Errn,Qi+1(Qi)-Errn,Qi+1(Q˜i))+)+i=t+1T(16)i-t(16+1)Li-tE[Q˜i-Qπ˜,i]2+E[Q˜t-Qπ˜,t]2.

Combine this inequality with (11); simplify the sums and use the fact that for x, y both nonnegative x+yx+y to obtain |Vπ˜(o)-Vπ(o)|dF(o)

6ML1/2t=0T[i=tT(16)i-tLi(Errn,Qi+1(Qi)-Errn,Qi+1(Q˜i))+]1/2+12ML1/2(16L)(T+2)/2maxtsupQt,Qt+1|ErrQt+1(Qt)-Errn,Qt+1(Qt)|+6ML1/2t=0Ti=tT(16)(i-t)/2Li/2E[Q˜i-Qπ˜,i]2.

All that remains is to provide an upper bound on

P[i=0T{for someQtQt,t=0,,T|ErrQt+1(Qi)-Errn,Qi+1(Qi))|>ɛ}].

This probability is in turn bounded above by

i=0TP[for someQtQt,t=0,,T|ErrQi+1(Qi)-Errn,Qi+1(Qi)|>ɛ].

Anthony and Bartlett (1999, pg. 241) use Hoeffding’s inequality along with the classical techniques of symmetrization and permutation to provide the upper bound (see also van der Vaart and Wellner, 1996),

P[for someQtQt,t=0,,T|ErrQi+1(Qi)-Errn,Qi+1(Qi)|>ɛ]4N1(ɛ32M,F,2n)exp{-(ɛ)2n32(M)2}.

Put ɛ=(16L)(T+2)/2ɛ to obtain the results of the theorem.

Suppose the Q functions are approximated by linear combinations of p features; for each t = 0,...,T, denote the feature vector by qt (ot, at ). The approximation space is then,

Qt={Qt(ot,at)=θTqt(ot,at):θΘ}

where Θ is a subset of Rp. In this case, the batch Q-learning algorithm may be based on (4); we represent the performance of the functions {Q0,..., Qt } on the training set by

Err˜n,Qt+1(Qt)=En[(Rt+maxat+1Qt+1(Ot+1,At,at+1)-Qt(Ot,At))qt(Ot,At)]

for t = 0,...,T (recall En represents the expectation with respect to the probability obtained by choosing a trajectory uniformly from the training set). In this theorem

F=i=1pt=1T{(rt+maxat+1Qt+1(ot+1,at+1;θt+1)-Qt(ot,at;θt))qti(ot,at) :θt,θt+1Θ}.

Define the functions {Q¯0,,Q¯T}, and the policy, π¯, as follows. First define Q¯T(OT,AT) to be the projection of E [RT |OT, AT ] on the space spanned by qT. Then set π¯T(oT,aT-1)argmaxaTQ¯T(oT,aT). Next for t = T − 1,...,0, set Q¯t(Ot,At)as the projection of E[Rt+Q¯t+1(Ot+1,At,π¯t+1)|Ot,At]on the space spanned by qt (recall Q¯t+1(Ot+1,At,π¯t+1) is defined as Q¯t+1(Ot+1,At,at+1) with at+1 replaced by π¯t+1(Ot+1,At)). And set π¯t(ot,at-1)argmaxatQ¯t(ot,at). These projections are with respect to P, the distribution which generated the trajectories in the training set (the likelihood is in (1)).

Theorem 2

Suppose that there exists a positive constant, say L, for which pt (at |ot, at−1) ≥ L−1 for all (ot, at−1), 0 ≤ tT. Suppose that for each t, xRp, xT Eqt qtT x > η ||x||2 where η > 0 (|| · || is the Euclidean norm). Also assume that Θ is a closed subset of {xRp : ||x|| ≤ MΘ} and for all (t, i), the ith component in the vector qt is pointwise bounded; |qti| ≤MQ for MQ a constant. Then for ɛ > 0, with probability at least 1 – δ, over the random choice of the training set, every choice of functions, QtQt and functions Q˜t,t=0,,T with associated policies defined by π with πt (ot , at–1 ∈ arg maxat Qt(ot , at) and π˜ with π¯t(ot,at-1)argmaxatQ˜t(ot,at) respectively, the following bounds are satisfied

t=0TL(t+1)E|Q¯t(Ot,At)-Qt(Ot,At)|pMQ/ηt=0TL(t+1)j=tT(LpMQ2/η)j-tErr˜n,Qj+1(Qj)+4ɛ.

for t = 0,...,T , where E represents the expectation with respect to the distribution (1) generating the training set and

|Vπ˜(o)-Vπ(o)|dF(o)2MpMQ/ηt=0TL(t+1)j=tT(LpMQ2/η)j-tErr˜n,Qj+1(Qj)+8Mɛ+2Mt=0TL(t+1)E|Q¯t(Ot,At)-Q˜t(Ot,At)|+2Mt=0TL(t+1)E|Q˜t(Ot,At)-Qπ˜,t(Ot,At)|

for n larger than

(Cɛ)2log(Bδ) (14)

where C=42MpT+1/2MQ2T+1η-(T+1)LT+1,M is a uniform upper bound on the absolute value on all fF′ and B=ɛ-2p46p+3p2Tp+p+3(T+1)2e2p+2(M)4p|A|pMQ(2T+1)2pη-2p(T+1)L2p(T+1)

Remarks:

1.Define Q^t as a zero of Err˜n,Q^t+1(Qt),t=T,T-1,,0 (recall that Q^T+1 is identically zero). Suppose that Qt* ∈Qt for each t; in this case Q¯t=Qt* for all t (we ignore sets of measure zero in this discussion). Then with probability greater than 1 – δ and π˜=π*,Q˜t=Qt* we obtain

V*(o)-Vπ^(o)dF(o)8Mɛ

for all n satisfying (14). Thus estimating each Qt by solving Err˜n,Qt+1(Qt)=0,t=T,,0, yields a policy that consistently achieves the optimal value.

2. Again define Q^t as a zero of Err˜n,Q^t+1(Qt),t=T,T-1,,0.Given two T + 1 vectors of functions Q′ = {Q0,…, QT} and Q = {Q0,…, QT} define (Q,Q)=t=0TLt+1E|Qt(Ot,At)-Qt(Ot,At)|. Then the first result of Theorem 2 implies that (Q¯,Q^) converges in probability to zero. From Lemma 2 we have that |Vπ˜(o)-Vπ(o)|dF(o)2M(Q,Q˜)+2M(Qπ˜,Q˜) and thus |Vπ˜(o)-Vπ^(o)|dF(o)is with high probability bounded above by 2M(Q¯,Q˜)+2M(Qπ˜,Q˜). Consequently the presence of the third and fourth terms in Theorem 2 is not surprising. It is unclear whether the “go-between” Q˜t is necessary.

3. Recall the space of policies implied by the approximation spaces for the Q-functions is given by ΠQ={πθ,θΘ} where πθ = {π1,θ,…,πT,θ} and where each πt (ot, at–1) ∈ arg max at Qt (ot , at;θ) for some QtQt. Suppose that max π ∈ ΠQVπ(o)dF(o) is achieved by some member of ΠQ and π˜argmaxπΠQVπ(o)dF(o). Ideally Q-learning would provide a policy that achieves the highest value as compared to other policies in ΠQ (as is the case with π˜). This is not necessarily the case. As discussed in the above remark batch Q-learning yields estimated Q-functions for which (Q¯,Q^) converges to zero. The policy π¯ may not produce a maximal value; that is Vπ˜(o)-Vπ¯(o)dF(o)need not be zero (see also the remark following Lemma 2). Recall from Lemma 2 that 2M(Q¯,Q˜)+2M(Q˜,Qπ˜) is an upper bound on this difference. It is not hard to see that (Q˜,Qπ˜) is zero if and only if π˜ is the optimal policy; indeed the optimal Q-function would belong to the approximation space. The Q-learning algorithm does not directly maximize the value function. As remarked in Tsitsiklis and van Roy (1997) the goal of the Q-learning algorithm is to construct an approximation to the optimal Q-function within the constraints imposed by the app! roximation space; this approximation is a projection when the approximation space is linear. Approximating the Q-function yields an optimal policy if the approximating class is sufficiently rich. Ormoneit and Sen (2002) consider a sequence of approximation spaces (kernel based spaces indexed by a bandwidth) and make assumptions on the optimal value function which guarantee that this sequence of approximations spaces is sufficient rich (as the bandwidth decreases with increasing training set size) so as to approximate the optimal value function to any desired degree.

4. Again define Q^t as a zero of Err˜n,Q^t+1(Qt),t=T,T-1,,0. Since (Q¯,Q^) converges in probability to zero, one might think that |Vπ¯(o)-Vπ^(o)|dF(o) should be small as well. Referring to Lemma 1, we have that the difference in value functions |Vπ¯(o)-Vπ^(o)|dF(o) can be expressed as the sum over t of the expectation of Qπ˜,t(Ot,At-1,π^t)-Qπ¯,t(Ot,At-1,π¯t).However (Q¯,Q^) small does not imply that π^ and π¯ will be close nor does it imply that Qπ¯,t(Ot,At-1,π^t)-Qπ¯,t(Ot,At-1,π¯t) will be small. To see the former consider an action space with 10 actions, 1,...,10 and Q^t(ot,at)=1 for a = 1,...,9, Q^t(ot,10)=1+1/2ɛ and Q¯t(ot,at)=1-1/2ɛ for a=2,,10,Q¯t(ot,1)=1.So Q¯t and Q^t are uniformly less than ɛ apart yet the argument of their maxima are 1 and 10.

Proof of Theorem 2

Fix Qt = θtT qt, θ∈ θ for t = 0,...,T . Define an infinite training sample version of Err˜n as

Err˜Qt+1(Qt)=E[(Rt+maxat+1Qt+1(Ot+1,At,at+1)-Qt)qt]=E[(Q¯t+maxat+1Qt+1(Ot+1,At,at+1)-Q¯t+1(Ot+1,At,π¯t+1)-Qt)qt]

where Qt is an abbreviation for Qt (Ot, At). To derive the last equality recall that Q¯t(Ot,At) is the projection of E[Rt+Q¯t+1(Ot+1,At,π¯t+1)|Ot,At] on the space spanned by qt . Since Q¯t is a projection we can write Qt=θtTqt for some θπ¯,tΘ. Also we can write Qt = θTt qt for some θt ∈ θ. The Err˜’s provide a pointwise upper bound on the differences, |Q¯t-Qt|, as follows. Rearrange the terms in Err˜Qt+1using the fact that Eqt qtT is invertible to obtain

(θπ¯,t-θt)=(EqtqtT)-1Err˜Qt+1(Qt)-(EqtqtT)-1E[(maxat+1Qt+1(Ot+1,At,at+1)-Q¯t+1(Ot+1,At,π¯t+1))qt].

Denote the Euclidean norm of a p dimensional vector x by ||x||. Then

|(θπ¯,t-θt)Tqt|(1/η)Err˜Qt+1(Qt)qt+(1/η)E[|maxat+1Qt+1(Ot+1,At,at+1)-Q¯t+1(Ot+1,At,π¯t+1)|qt]qt(1/η)Err˜Qt+1(Qt)qt+(1/η)LE[|Qt+1-Q¯t+1|qt]qt(1/η)pMQErr˜Qt+1(Qt)+(1/η)LpMQ2E[|Qt+1-Q¯t+1|]

for tT . To summarize

E|Q¯t-Qt|(1/η)pMQErr˜Qt+1(Qt)+(1/η)LpMQ2E[|Qt+1-Q¯t+1|]

where Qt, Q¯t is an abbreviation for Qt (Ot, At), respectively Q¯t(Ot,At), for each t.

As in the proof of Theorem 1, these inequalities can be solved for each E|Q¯t-Qt| to yield

E|Q¯t-Qt|(pMQ/η)j=tT(LpMQ2/η)j-tErr˜Qj+1(Qj)(pMQ/η)j=tT(LpMQ2/η)j-tErr˜n,Qj+1(Qj)-Err˜Qj+1(Qj)+(pMQ/η)j=tT(LpMQ2/η)j-tErr˜n,Qj+1(Qj).

Simplifying terms we obtain

t=0TL(t+1)E|Q¯t-Qt|pMQ/ηt=0TL(t+1)j=tT(LpMQ2/η)j-tErr˜n,Qj+1(Qj)+4pT+1/2MQ2T+1η-(T+1)LT+1maxtErr˜n,Qt+1(Qt)-Err˜Qt+1(Qt). (15)

Consider each component of each of the T +1, p dimensional vectors, Err˜n,Qi+1(Qi))-Err˜Qi+1(Qi) for an ɛ′> 0:

P[i=0Tj=1p{for someθi,θi+1Θ,qiQi,qi+1Qi+1|Err˜n,Qi+1(Qi)j-Err˜Qi+1(Qi))j|>ɛ}].

This probability is in turn bounded above by

i=0Tj=1pP[for someθi,θi+1Θ,qiQi,qi+1Qi+1|Err˜n,Qi+1(Qi)j-Err˜Qi+1(Qi))j|>ɛ].

In Lemmas 17.2, 17.3, 17.5, Anthony and Bartlett (1999) provide an upper bound on the probability

P[for some fFhas |En(f)-E(f)|ɛ]

where ℓf (x, y) = (yf (x))2. These same lemmas (based on the classical arguments of symmetrization, permutation and reduction to a finite set) can be used for f ∈ F′ since the functions in F′ are uniformly bounded. Hence for each j = 1,..., p and t = 0,...,T

P[for some θt,θt+1Θ,qtQt,qt+1Qt+1has |Err˜n,Qt+1(Qt)j-Err˜Qt+1(Qt)j|>ɛ]4N1(ɛ16M,F,2n)exp{-(ɛ)2n32(M)2}.

Set ɛ = pT+1/2M2T+1Qη−(T+1)LT+1ɛ′. Thus for n satisfying

4p(T+1)N1(ɛ16MpT+1/2MQ2T+1η-(T+1)LT+1,F,2n)exp{-ɛ2n32(M)2(pT+1/2MQ2T+1η-(T+1)LT+1)2}δ, (16)

the first result of the theorem holds.

To simplify the constraint on n, we derive a covering number for F′ from covering numbers for the Qt’s. Apply Lemma A2 part 1, to obtain

N1(ɛ,Vt+1,n)N1(ɛ|A|,Qt+1,|A|n)

for Vt+1 = {maxat+1Qt+1(ot+1, at+1) : Qt+1 ∈ Qt+1}. Next apply Lemma A2, parts 2 and 3, to obtain

N1(ɛ,F,n)t=0T-1N1(ɛ2|A|M,Qt+1,|A|n)N1(ɛ2M,Qt,n)+N1(ɛM,QT,n).

Theorems 11.6 and 18.4 of Anthony and Bartlett imply that N1(ɛ,Qt,n)e(p+1)(2eɛ)p for each t. Combining this upper bound with (16) and simplifying the algebra yields (14).

Next Lemma 2 implies:

|Vπ˜(o)-Vπ(o)|dF(o)Mt=0T2L(t+1)E|Qt-Q¯t|+Mt=0T2L(t+1)E|Q¯t-Q˜t|+Mt=0T2L(t+1)E|Q˜t-Qπ˜,t|.

This combined with the first result of the theorem implies the second result.

6. Discussion

Planning problems involving a single training set of trajectories are not unusual and can be expected to increase due to the widespread use of policies in the social and behavioral/medical sciences (see, for example, Rush et al., 2003; Altfeld and Walker, 2001; Brooner, and Kidorf, 2002); at this time these policies are formulated using expert opinion, clinical experience and/or theoretical models. However there is growing interest in formulating these policies using empirical studies (training sets). These training sets are collected under fixed exploration policies and thus while they allow exploration they do not allow exploitation, that is, online choice of the actions. If subjects are recruited into the study at a much slower rate than the calendar duration of the horizon, then it is possible to permit some exploitation; some of this occurs in the field of cancer research (Thall, Sung and Estey, 2002).

This paper considers the use of Q-learning with dynamic programming and function approximation for this planning purpose. However the mismatch between Q-learning and the goal of learning a policy that maximizes the value function has serious consequences and emphasizes the need to use all available science in choosing the approximation space. Often the available behaviorial or psychosocial theories provide qualitative information concerning the importance of different observations. In addition these theories are often represented graphically via directed acyclic graphs. However information at the level of the form of the conditional distributions connecting the nodes in the graphs is mostly unavailable. Also due to the complexity of the problems there are often unknown missing common causes of different nodes in the graphs. See http://neuromancer.eecs.umich.edu/dtr for more information and references. Methods that can use this qualitative information to minimize t! he mismatch are needed.

Acknowledgments

We would like to acknowledge the help of the reviewers and of Min Qian in improving this paper. Support for this project was provided by the National Institutes of Health (NIDA grants K02 DA15674 and P50 DA10075 to the Methodology Center).

Appendix A

Recall that the distributions, P and Pπ differ only with regards to the policy (see (1) and (2)). Thus the following result is unsurprising. Let f (OT+1, AT) be a (measurable) nonnegative function; then Eπf can be expressed in terms of an expectation with respect to the distribution P if we assume that pt (at|ot, at−1) > 0 for each (ot, at ) pair and each t. The presence of the p js in denominator below represent the price we pay because we only have access to training trajectories with distribution P; we do not have access to trajectories from distribution Pπ.

Lemma A1 Assume that Pπ[p0(A0|S0) > 0] = 1 and Pπ[pt (At|Ot, At−1) > 0] = 1 for t = 1,...,T. For any (measurable) nonnegative function of g(Ot, At ), the P-probability that

Eπ[g(Ot,At)|S0]=E[(=0t1A=πp(A|O,A-1))g(Ot,At)|S0]

is one for t = 0,...,T.

Proof: We need only prove that

E[h(S0)Eπ[g(Ot,At)|S0]]=E[h(S0)E[(=0t1A=πp(A|O,A-1))g(Ot,At)|S0]]

for any (measurable) nonnegative function, h. Consider the two likelihoods ((1) and (2)) for a trajectory up to time t. Denote the dominating measure for the two likelihoods for the trajectory up to time t as λt. By assumption,

h(s0)g(ot,at)(=0T1A=πp(a|o,a-1))f0(s0)p0(a0|s0)j=1tfj(sj|oj-1,aj-1)pj(aj|oj,aj-1)dλt(ot,at)=h(s0)g(ot,at)f0(s0)1a0=π0(s0)j=1tfj(sj|oj-1,aj-1)1aj=πj(oj,aj-1)dλt(ot,at).

By definition the left hand side is E[h(S0)g(Ot,At)(=0j1A=πp(A|O,A-1))] and the right hand side is Eπ[h(S0)g(Ot, At)]. Expressing both sides as the expectation of a conditional expectation, we obtain,

Eπ[h(S0)Eπ[g(Ot,At)|S0]]=E[h(S0)E[g(Ot,At)(=0t1A=πp(A|O,A-1))|S0]].

Note that the distribution of S0 is the same regardless of how the actions are chosen, that is the distribution of S0 is the same under both P and Pπ. Thus

E[h(S0)Eπ[g(Ot,At)|S0]]=E[h(S0)E[g(Ot,At)(=0t1A=πp(A|A,A-1))|S0]].

Lemma A2 For p, q, r, s, N positive integers and MF, MG, MΘ positive reals, define the following classes of real valued functions,

H{h(x,a):xp,a{1,,N}}F{f(x):xq,supx|f(x)|MF}G{g(x,y):xq,yrsupx,y|g(x,y)|MG}

and

Θ{θs:maxi=1,,s|θi|MΘ}.

The following hold.

1. If V = {maxa h(x, a) : h ∈ H}then N1 (ɛ, V n) ≤ N1 (ɛ/N, H Nn).

2. For |a| ≤ 1, |b| ≤ 1, if V = {a f (x) + bg(x, y) : f ∈ F, g ∈ G} then N1(ɛ, V, n) ≤ N1(ɛ/2, F, n)N1(ɛ/2, G, n).

3. If V = F ∪ G then N1(ɛ,V, n) ≤ N1(ɛ,F, n) + N1(ɛ, G, n).

4. If V = {θ1 f1(x) + ... + θs fs(x) : fi ∈ F, (θ1,..., θs) ∈ Θ} then N1(ɛ,V,n)e(s+1)(4esMΘMFɛ)sN(ɛ4sMΘ,F,n)s.

Proof. We prove 1. and 4.; the proofs of 2. and 3. are straightforward and are omitted. Consider 1. Given (x1,..., xn), the ɛ-covering number for the class of points in ℝNn, {(h(xi, a) : i = 1,...,n, a = 1,...N); h ∈ H} is bounded above by N1 (ɛ, H, Nn). Note that for (zia, i = 1,...,n, a = 1,...,N),

1/ni=1n|maxa=1,,Nh(xi,a)-maxa=1,,Nzia|1/ni=1nmaxa=1,,N|h(xi,a)-zia|1/ni=1na=1N|h(xi,a)-zia|.

Thus the ɛ-covering number for the class of points in ℝn, n,{(maxa=1Nh(xi,a):i=1,,n);hH} is bounded above by N1 (ɛ, H, Nn). Using the definition of covering numbers for classes of functions we obtain N1(ɛ,V,n)N1(ɛN,H,Nn).

Next consider 4. Put x = (x1,..., xn) (each xi ∈ ℝq) and f (xi) = ( f1(xi),..., fs(xi))T. Then there exists {z1,..., zN}, (N = N (ɛ/(4sMΘ), F, n); zj ∞ ℝn) that form the centers of an ɛ/(4sMΘ)-cover for F. To each zj we can associate an f ∈ F, say f *j so that {f *1,...,f *N} form the centers of an ɛ/(2sMΘ)-cover for F. Then given {f1,..., fs} ∈ F there exists j* ∈ {1,...,N } for j = 1,...,s, so that max1≤js max1≤in | fj(xi) − f*j*(xi)| ≤ ɛ /( 2sMΘ). Then

(1/n)i=1n|j=1sθjfj(xi)-θjfj**(xi)|ɛ/2.

Define F={j=1sθjfj**:θjΘ}. Theorems 11.6 and 18.4 of Anthony and Bartlett (1996) imply that N1(ɛ/2,F,n)e(s+1)(4esMΘMFɛ)s These two combine to yield the result.

References

  1. Altfeld M, Walker BD. Less is more? STI in acute and chronic HIV-1 infection. Nature Medicine. 2001;7:881–884. doi: 10.1038/90901. [DOI] [PubMed] [Google Scholar]
  2. M. Anthony and P. L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge, UK: Cambridge University Press, 1999.
  3. L. Baird. Advantage updating. Technical Report. WL-TR-93-1146, Wright-Patterson Air Force Base, 1993.
  4. Baxter J, Bartlett PL. Infinite-horizon policy-gradient estimation. J Artificial Intelligence Research. 2001;15:319–350. [Google Scholar]
  5. R. E. Bellman Dynamic Programming. Princeton: Princeton University Press, 1957.
  6. D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Belmont, MA.: Athena Scientific. 1996.
  7. Brooner RK, Kidorf M. Using behavioral reinforcement to improve methadone treatment participation. Science and Practice Perspectives. 2002;1:38–48. doi: 10.1151/spp021138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fava M, Rush AJ, Trivedi MH, Nierenberg AA, Thase ME, Sackeim HA, Quitkin FM, Wisniewski S, Lavori PW, Rosenbaum JF, Kupfer DJ. Background and rationale for the sequenced treatment alternative to relieve depression (STAR*D) study. Psychiatric Clinics of North America. 2003;26(3):457–494. doi: 10.1016/s0193-953x(02)00107-7. [DOI] [PubMed] [Google Scholar]
  9. C. N. Fiechter Efficient Reinforcement Learning. In Proceedings of the Seventh Annual ACM Conference on Computational Learning Theory (COLT 1994), pages 88–97, New Brunswick, NJ, 1994.
  10. C. N. Fiechter. Expected Mistake Bound Model for On-Line Reinforcement Learning. In Proceedings of the Fourteenth International Conference on Machine Learning, Douglas H. Fisher (Ed.), Nashville, Tennessee, pages 116–124, 1997.
  11. S. M. Kakade. On the Sample Complexity of Reinforcement Learning. Ph.D. thesis, University College, London, 2003.
  12. Kearns M, Mansour Y, Ng AY. A Sparse sampling algorithm for near-optimal planning in large Markov decision processes. Machine Learning. 1999;49(2–3):193–208. [Google Scholar]
  13. M. Kearns, Y. Mansour and A. Y. Ng. Approximate planning in large POMDPs via reusable trajectories. In Advances in Neural Information Processing Systems, 12, MIT Press, 2000.
  14. Ormoneit D, Sen S. Kernel-Based Reinforcement Learning. Machine Learning. 2002;49(2–3):161–178. [Google Scholar]
  15. L. Peshkin and C. R. Shelton. Learning from scarce experience. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML 2002) Claude Sammut, Achim G. Hoffmann (Eds.) pages 498–505, Sydney, Australia, 2002.
  16. Rush AJ, Crismon ML, Kashner TM, Toprac MG, Carmody TJ, Trivedi MH, Suppes T, Miller AL, Biggs MM, Shores-Wilson K, Witte BP, Shon SP, Rago WV, Altshuler KZ TMAP Research Group. Texas medication algorithm project, phase 3 (TMAP-3): Rationale and study design. Journal of Clinical Psychiatry. 2003;64(4):357–69. doi: 10.4088/jcp.v64n0402. [DOI] [PubMed] [Google Scholar]
  17. Schneider LS, Tariot PN, Lyketsos CG, Dagerman KS, Davis KL, Davis S, Hsiao JK, Jeste DV, Katz IR, Olin JT, Pollock BG, Rabins PV, Rosenheck RA, Small GW, Lebowitz B, Lieberman JA. National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE) American Journal of Geriatric Psychiatry. 2001;9(4):346–360. [PubMed] [Google Scholar]
  18. Schapire RE, Bartlett P, Freund Y, Lee WS. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics. 1998;26(5):1651–1686. [Google Scholar]
  19. D. I. Simester, P. Sun, and J. N. Tsitsiklis. Dynamic catalog mailing policies. unpublished manuscript, Available electronically at http://web.mit.edu/jnt/www/Papers/P-03-sun-catalog-rev2.pdf, 2003.
  20. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, Mass, 1998.
  21. Thall PF, Millikan RE, Sung HG. Evaluating multiple treatment courses in clinical trials. Statistics and Medicine. 2000;19:1011–1028. doi: 10.1002/(sici)1097-0258(20000430)19:8<1011::aid-sim414>3.0.co;2-m. [DOI] [PubMed] [Google Scholar]
  22. Thall PF, Sung HG, Estey EH. Selecting therapeutic strategies based on efficacy and death in multicourse clinical trials. Journal of the American Statistical Association. 2002;97:29–39. [Google Scholar]
  23. Tsitsiklis JN, Van Roy B. Feature-based methods for large scale dynamic programming. Machine Learning. 1996;22:59–94. [Google Scholar]
  24. Tsitsiklis JN, Van Roy B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control. 1997;42(5):674–690. [Google Scholar]
  25. A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, New York, 1996.
  26. C. J. C. H. Watkins. Learning from Delayed Rewards. Ph.D. thesis, Cambridge University, 1989.

RESOURCES